Vietnam National University - Ho Chi Minh City
Ho Chi Minh city University of Technology
Faculty of Computer Science and Engineering
GRADUATE THESIS
GROUNDED LANGUAGE LEARNING: IMPROVE TEXT
REPRESENTATION WITH VISUAL INFORMATION
Major: Computer science
Council: Computer Science 11
Supervisor: Assoc. Prof. Quan Thanh Tho
Reviewer: Mr. Le Dinh Thuan
---o0o--Student: Nguyen Tran Cong Duy (1710043)
Grounded Language Learning: Improve Text
Representation with Visual Information
Thesis
Nguyen Tran Cong Duy
Supervisors
Assoc. Prof. Quan Thanh Tho
Assurance
I hereby declare that, except for the reference results from other related works specified in the thesis, the contents presented in this thesis are my own implementation
and there is no part of the content applied for a degree at another school.
Ho Chi Minh City, July 11, 2021
i
Acknowledgement
As a matter of first importance, I am massively thankful to my counselor Assoc.
Prof. Quan Thanh Tho for his consistent help and direction all through my
thesis, and for the opportunity in studying and researching he gave me. Second, I
additionally thank my parents and my best friends for the persistent consolation,
backing, and consideration.
ii
Abstract
Nowadays, people learn languages through listening, speaking, reading, writing,
and multimodal interactions with the real world. Even a child has been taught
from a young age to talk (listen), teach to speak, teach gestures and learn through
pictures from a young age, people from a young age not only learn language from
resources. or books with only words that learn a combination of images, stories,
and descriptive sentences. Today’s language models are mostly not learned from
real-world factors but are trained by purely linguistic data sources. There have been
a few constructions that incorporate language and other elements into applications
for robotics, visual and linguistic tasks, etc. with positive results but presently a
major challenge in the industry, machine learning, deep learning today.
iii
Contents
List of Tables
vi
List of Figures
1 Introduction
1.1 Motivation . . . . . . . . .
1.2 Our contribution . . . . .
1.3 The Scope of the Thesis .
1.4 Organization of the Thesis
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
2 Foundations
2.1 Neural Network - Multilayer perceptron
2.2 Convolutional Neural Network . . . . . .
2.3 Object Detection . . . . . . . . . . . . .
2.4 Recurrent Neural Network . . . . . . . .
2.5 Transformer . . . . . . . . . . . . . . . .
2.6 BERT . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
7
14
17
24
28
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Related work
31
3.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Grounded Language Learning Approaches . . . . . . . . . . . . . . . 32
4 Motivation
36
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Propose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Methodology
40
5.1 Approach 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Approach 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
iv
CONTENTS
6 Experiments
50
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7 Analysis
7.1 Impact of visual dimension . . . . . . . . . . . . . . . .
7.2 The impact of visual grounding . . . . . . . . . . . . .
7.3 Visualization of alignment between tokens and objects
7.4 Evaluation on Pre-training Tasks . . . . . . . . . . . .
7.5 Visualization of token-level scoring on first approach .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
57
57
58
58
8 Application
8.1 Django . . . . . . .
8.2 System description
8.3 Mockup . . . . . .
8.4 Demo results . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
61
62
64
65
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9 Conclusion
67
9.1 What have we done? . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.2 Future direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References
69
v
List of Tables
4.1 Statistics of some common datasets used in visual grounded language
learning task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Task descriptions and statistics. . . . . . . . . . . . . . . . . . . . .
6.2 Downstream task results of BERT and our GroundedBERT, we
conduct the experiments on BERT-base and BERT-large architectures.
MRPC and QQP results are F1 score, STS-B results are Pearson
correlation, SQuAD v1.1 and SQuAD v2.0 results are exact matching
and F1 score respectively. The results, which outperform the other
one are marked in bold, are all scale to range 0-100. The ∆base
and ∆large columns show the difference between our model and the
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Downstream task results of BERT, V&L pretrained models and our
ObjectGroundedBERT (OGBERT), we conduct the experiments on
BERT-base architectures. MRPC and QQP results are F1 score,
STS-B results are Pearson correlation, SQuAD v1.1 and SQuAD v2.0
results are exact matching and F1 score respectively. The results,
which outperform the other one are marked in bold, are all scale
to range 0-100. The ∆base column show the difference between our
model and the baseline. . . . . . . . . . . . . . . . . . . . . . . . .
7.1 Downstream task results of our ObjectGroundedBERT with different
dimension of visual embedding. The metrics and results are set up
similar to Table 6.3. . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
36
50
53
54
56
LIST OF TABLES
7.2 Downstream task results and comparison of our ObjectGroundedBERT without training the Text-ground-image Module. The metrics
and results are set up similar to Table 6.3. The first four rows report
the fine-tuned results of our model without training with the visual
grounded datasets, the last four rows show the difference to the results
reported in Table 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Downstream task results on different pretraining tasks. . . . . . . .
vii
57
58
List of Figures
1.1 Wikipedia, BookCorpus datasets . . . . . . . . . . . . . . . . . . .
1.2 Wikipedia, BookCorpus datasets with another visual datasets . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
2.23
2.24
2.25
A simple neural network. . . . . . . . .
MLP with 3 layers. . . . . . . . . . . .
Visualization of max pooling . . . . . .
LeNet architecture . . . . . . . . . . .
AlexNet architecture . . . . . . . . . .
VGG16 architecture . . . . . . . . . .
Inception cell . . . . . . . . . . . . . .
Inception architecture . . . . . . . . .
ResNets block . . . . . . . . . . . . . .
ResNets architecture . . . . . . . . . .
R-CNN . . . . . . . . . . . . . . . . .
Fast R-CNN . . . . . . . . . . . . . . .
Faster R-CNN . . . . . . . . . . . . . .
Recurrent Neural Network . . . . . . .
RNN calculation in one node. . . . . .
RNN architectures . . . . . . . . . . .
Encoded-Decoder . . . . . . . . . . . .
Encoded-Decoder with Attention . . .
Transformer: attention is all you need
Transformer calculation . . . . . . . .
Transformer calculation . . . . . . . .
Transformer calculation . . . . . . . .
Transformer calculation . . . . . . . .
Transformer multihead attention . . .
BERT model . . . . . . . . . . . . . .
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
6
6
10
11
11
12
13
13
14
14
15
16
17
18
19
20
22
23
24
25
26
26
27
28
28
LIST OF FIGURES
3.1 IMAGINET architecture from the paper Collell et al. (2017) . . . .
3.2 Cap2Both is the combination of Cap2Img and Cap2Both, the figure
is in the paper Kiela et al. (2018) . . . . . . . . . . . . . . . . . . .
3.3 Bordes et al. (2019) paper architecture . . . . . . . . . . . . . . . .
3.4 Illustration of the BERT transformer model trained with a visuallysupervised language model with two objectives: masked language
model (on the left) and voken classification (on the right). The
figure is in paper Tan & Bansal (2020) . . . . . . . . . . . . . . . .
32
4.1 A grounded language learning example. . . . . . . . . . . . . . . .
37
5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Implementation of our ObjectGroundedBERT. The model consists
of two components, i.e. Language encoder and Text-ground-image
part. The new representaion of language model combines of Textual
embedding and Visual embedding. . . . . . . . . . . . . . . . . . .
5.3 Implementation of our pretraining framework for ObjectGroundedBERT. The model consists of four components, i.e. Object detection
model (Faster-RCNN), Object encoder, Cross Modal Transformer
and ObjectGroundedBERT. The detail of Cross Modal Transformer
layer is shown on the right. The pretraining tasks are Masked Visual
Feature Prediction, Image-Text Matching and Masked Language
Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.1 MSCOCO Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 GLUE Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
51
7.1 Illustration of the Attention map of Cross modal layers. . . . . . .
7.2 First visualization of token-level scoring. This example chooses the
caption there is a clean bathroom counter and sink with 3 images.
7.3 Second visualization of token-level scoring. This example chooses
the caption bicycle parked in the grass by a tree with 3 images. . .
57
8.1
8.2
8.3
8.4
8.5
8.6
8.7
61
62
63
64
64
65
65
Django framework . . . . . . . . .
Architecture diagram . . . . . . .
Activity diagram . . . . . . . . . .
Home page . . . . . . . . . . . . .
Model page - result page . . . . .
Input of CoLA task at home page
Result of CoLA task at home page
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
35
44
45
59
60
LIST OF FIGURES
8.8 Input of MNLI task at home page . . . . . . . . . . . . . . . . . .
8.9 Result of MNLI task at home page . . . . . . . . . . . . . . . . . .
x
66
66
Chapter 1
Introduction
1.1
Motivation
Humans learn language through listening, speaking, reading, writing, and multimodal interactions with the real world. Even a child has been taught from
a young age to talk (listen), teach to speak, teach gestures and learn through
pictures from a young age, people from a young age not only learn language from
resources. or books with only words that learn a combination of images, stories,
and descriptive sentences. Today’s language models are mostly not learned from
real-world elements but are trained only by pure-language data sources 1.1 (text
only) like Wikipedia, BookCorpus, etc.
Figure 1.1: Wikipedia, BookCorpus datasets
1
CHAPTER 1. INTRODUCTION
1.2
Our contribution
Previous studies of visual grounded language learning train language encoder with
both language objective and visual grounding example . However, due to the
differences in distribution and scale between the visual-grounded datasets and
language corpora, the language model tends to mix up the context of the tokens
that are occurred in the grounded data with those that are not. As such, there is
confusion between the visual information and contextual meaning of text during
embedding training. To overcome this limitation, we propose GroundedBERT a grounded language learning method that enhances the BERT representation
with visual grounded information. GroundedBERT comprises of two components:
(i) a text-ground-image part captures the global and local semantic mapping
between the visual and textual representations learned via sentence-level and tokenlevel mechanism, and (ii) an original BERT embedding captures the contextual
representation of words learned from the textual language corpora. Our proposed
method significantly outperforms the baseline language models on various language
tasks of GLUE and SQuAD datasets.
Moreover, these studies use a convolutional neural network (CNN) to extract
features from the whole image for grounding with the sentence description. However,
this approach has two main drawbacks: (i) the whole image usually contains more
objects and backgrounds than the sentence itself; thus, matching them together
will confuse the grounded model; (ii) CNN only extracts the features of the
image but not the relationship between objects inside that, limiting the grounded
model to learn complicated contexts. To overcome such shortcomings, we propose
a novel object-level grounded language learning framework that empowers the
language representation with visual object-grounded information. The framework
is comprised of two main components: (i) ObjectGroundedBERT captures the
visual-object relations and literary portrayals by cross-modal pretraining via a
text-ground-image mechanism, and (ii) Cross-modal Transformer helps the object
encoder and ObjectGroundedBERT learn the alignment and representation of
image-text context.
Experimental results show that our proposed framework ObjectGroundedBERT
and GroundedBERT consistently outperform the baseline language models on
various language tasks of GLUE and SQuAD datasets.
Our works are submitted into two conferences EMNLP 20211 and Neurips 2022 .
1
2
https://2021.emnlp.org/
https://nips.cc/
2
CHAPTER 1. INTRODUCTION
Figure 1.2: Wikipedia, BookCorpus datasets with another visual datasets
1.3
The Scope of the Thesis
The coverage of this text are:
• We propose GroundedBERT - a grounded language learning approach that
enhances BERT representation with visual-grounded information. Instead
of grounding visual information to the language model, which changes the
original contextual representation, the visual-grounded representation is
first learned from the text-image pairs and then joined to the contextual
representation to form a unified visual-textual representation. Moreover with
ObjectGroundedBERT, to the best of our knowledge, this study is the first to
investigate the grounded language learning at the object-level with the rich
grounded information containing object features, attributes, and positions.
By doing so, we can enhance the ability of grounded language to capture
more complex relations and avoid the confusion during learning process.
• To this end, we introduce a Text-ground-image module that captures both
global and local semantics between contextual relation of words and image via
a novel token-level and sentence-level learning mechanism. We also propose
a novel grounded language framework that enhances language representation
with visual-objected-grounded information. Instead of using CNN to encode
the whole image, we embed the features of objects from an off-the-shelf object
detector into the encoder and connect them with the language modality via a
3
CHAPTER 1. INTRODUCTION
cross-modal Transformer. A Text-ground-image mechanism is also proposed
to capture the visual object information and their relations found from the
semantic correlation of words and image via multi-task pretraining strategy.
• We conduct extensive experiments on various language downstream tasks in
GLUE and SQuAD datasets, and significantly outperforms the baselines on
these tasks.
• We build a demo app for CoLA, MNLI tasks using our GroundedBERT.
1.4
Organization of the Thesis
• In chapter 2, we provide some basic math and machine learning concepts
that might be helpful for the reader to understand the rest of the text.
• In chapter 3, we summarize some legacy research works in the area in a
consistent way.
• In chapter 4, we first briefly describe why we have the novel ideal and the
problem of previous works.
• In chapter 5, we describe the detail of our proposed models and training
strategies.
• In chapter 6, we show our results on many downstream tasks and also how
to implement and config for the hyper-parameters.
• In chapter 7, we additional analyze our work on many aspects to show the
effectiveness and visualize some results of our examples.
• In chapter 8, we describe and illustrate the application to demo our work on
downstream tasks.
• Finally, we conclude by discussing what we have done in Pre-Thesis and
Thesis also The future direction of our work in chapter 9.
4
Chapter 2
Foundations
In this chapter, I will present the background knowledge used in the process of
making the thesis, including commonly used concepts in deep learning networks and
natural language processing, popular network architectures. in image processing
and the models used for language.
2.1
2.1.1
Neural Network - Multilayer perceptron
Logistic regression
is a binary classification method, built by a function capable of taking any value
and returning a result. the number between 0 and 1 (sigmoid function) - equivalent
to the probability of that happening based on the input data. This is similar
to a single-layer neural network. As the figure 2.1 illustrates the use of logistic
regression as a simple neural network.
In the figure 2.1, the matrix x contains the attributes (features) of an input,
the matrix w will be the weights of the attributes and b is the bias. After going
through the calculation steps, the result obtained ŷ will be the probability that
label (label) y has a value of 1 with known x and w .
P (y = 1|w, x) = ŷ
2.1.2
(2.1)
Multilayer perceptron
Multilayer perceptron is a multi-layer neural network, usually consisting of layers:
input layer, output layer and hidden layer.
5
CHAPTER 2. FOUNDATIONS
Figure 2.1: A simple neural network.
Figure 2.2 illustrates an MLP with 3 layers (input, hidden, output).
Figure 2.2: MLP with 3 layers.
2.1.3
2.1.3.1
Activation functions
Sigmoid function
Equation:
1
(2.2)
1 + e−x
The sigmoid function takes a real value x and returns a value in the range
(0, 1). If x is a very small negative real number then the result of the sigmoid
σ(x) =
6
CHAPTER 2. FOUNDATIONS
function will be asymptote to 0, and vice versa if x is a very large positive number
then the result will be asymptote to 1.
2.1.3.2
Tanh function
Equation:
ex − e−x
(2.3)
ex + e−x
The tanh function takes a real number and returns a value in the range (-1,1).
tanh(x) =
The tanh function can also be represented by the sigmoid function as follows:
2.1.3.3
tanh(x) = 2σ(2x) − 1
(2.4)
f (x) = max(0, x)
(2.5)
ReLU function
Equation:
Thus, compared with sigmoid and tanh, the ReLU function will not have a
gradient cancellation problem. The calculation speed of the ReLU function will
also be faster than the previous two functions. However, ReLU also has a drawback,
with x having a value less than 0, through the ReLU function, the result will be 0.
If the value of the node is changed to 0, it will not be meaningful in the next layer.
and the corresponding coefficients from that node are also not updated with the
gradient. This phenomenon is called Dying ReLU.
2.2
2.2.1
Convolutional Neural Network
Introduction
Contingent upon whether we are taking care of high contrast or shading pictures,
every pixel area may be related with possibly one or numerous mathematical
qualities, individually. As of not long ago, our method of managing this rich
construction was profoundly uninspiring. We essentially disposed of each picture’s
spatial construction by smoothing them into one-dimensional vectors, taking care of
them through a completely associated MLP. Since these organizations are invariant
to the request for the highlights, we could get comparable outcomes whether or
not we protect a request relating to the spatial construction of the pixels or in the
event that we permute the segments of our plan network prior to fitting the MLP’s
7
CHAPTER 2. FOUNDATIONS
boundaries. Ideally, we would use our earlier information that close-by pixels are
normally identified with one another, to fabricate effective models for gaining from
picture information.
This part presents convolutional neural organizations (CNNs), an amazing
group of neural organizations that are intended for correctly this reason. CNNbased models are presently universal in the field of PC vision and have become so
prevailing that barely anybody today would foster a business application or enter
a contest identified with picture acknowledgment, object location, or semantic
division, without working off of this methodology.
Current CNNs, as they are called casually owe their plan to motivations from
science, bunch hypothesis, and a solid portion of test dabbling. Notwithstanding
their example productivity in accomplishing precise models, CNNs will, in general,
be computationally proficient, both on the grounds that they require fewer boundaries than completely associated structures and on the grounds that convolutions
are not difficult to parallelize across GPU centers. Thusly, specialists frequently
apply CNNs at whatever point conceivable, and progressively they have arisen as
believable contenders even on assignments with a one-dimensional arrangement
structure, like sound, text, and time arrangement investigation, where repetitive
neural organizations are ordinarily utilized. Some shrewd variations of CNNs have
likewise applied them as a powerful influence for chart-organized information and
in recommender frameworks.
2.2.2
Detail
The Convolutional Layer is the core component of the CNN network that performs
the most important operations. Concepts and parameters for setting up a convolutional network include such as filter size (F), input size (W), displacement (S strike), zero-padding (P), depth (depth) aka the number of filters we use.
• Depth of the output volume will be equal to the number of filters we decide
to use. Each of these filters will learn to detect different characteristics of the
input. For example, with an image input, each filter in the first convolutional
layer will in turn learn to detect edges and corners with different directions,
etc. We will call the set of neurons connected to a region of the image. input
is the depth column.
• Displacement is the rate at which the filter moves on the input. For
example, when we have a displacement of 1, we will shift the filter 1 pixel at
a time. When we have an offset of 2, we will shift the filter by 2 pixels at a
8
- Xem thêm -