Tài liệu Grounded language learning improve text representation with visual information

.PDF

116

thanhphoquetoi Báo vi phạm

Tải xuống 116

Mô tả:

Vietnam National University - Ho Chi Minh City Ho Chi Minh city University of Technology Faculty of Computer Science and Engineering GRADUATE THESIS GROUNDED LANGUAGE LEARNING: IMPROVE TEXT REPRESENTATION WITH VISUAL INFORMATION Major: Computer science Council: Computer Science 11 Supervisor: Assoc. Prof. Quan Thanh Tho Reviewer: Mr. Le Dinh Thuan ---o0o--Student: Nguyen Tran Cong Duy (1710043) Grounded Language Learning: Improve Text Representation with Visual Information Thesis Nguyen Tran Cong Duy Supervisors Assoc. Prof. Quan Thanh Tho Assurance I hereby declare that, except for the reference results from other related works specified in the thesis, the contents presented in this thesis are my own implementation and there is no part of the content applied for a degree at another school. Ho Chi Minh City, July 11, 2021 i Acknowledgement As a matter of first importance, I am massively thankful to my counselor Assoc. Prof. Quan Thanh Tho for his consistent help and direction all through my thesis, and for the opportunity in studying and researching he gave me. Second, I additionally thank my parents and my best friends for the persistent consolation, backing, and consideration. ii Abstract Nowadays, people learn languages through listening, speaking, reading, writing, and multimodal interactions with the real world. Even a child has been taught from a young age to talk (listen), teach to speak, teach gestures and learn through pictures from a young age, people from a young age not only learn language from resources. or books with only words that learn a combination of images, stories, and descriptive sentences. Today’s language models are mostly not learned from real-world factors but are trained by purely linguistic data sources. There have been a few constructions that incorporate language and other elements into applications for robotics, visual and linguistic tasks, etc. with positive results but presently a major challenge in the industry, machine learning, deep learning today. iii Contents List of Tables vi List of Figures 1 Introduction 1.1 Motivation . . . . . . . . . 1.2 Our contribution . . . . . 1.3 The Scope of the Thesis . 1.4 Organization of the Thesis viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 2 Foundations 2.1 Neural Network - Multilayer perceptron 2.2 Convolutional Neural Network . . . . . . 2.3 Object Detection . . . . . . . . . . . . . 2.4 Recurrent Neural Network . . . . . . . . 2.5 Transformer . . . . . . . . . . . . . . . . 2.6 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 7 14 17 24 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Related work 31 3.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Grounded Language Learning Approaches . . . . . . . . . . . . . . . 32 4 Motivation 36 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Propose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Methodology 40 5.1 Approach 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Approach 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 iv CONTENTS 6 Experiments 50 6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7 Analysis 7.1 Impact of visual dimension . . . . . . . . . . . . . . . . 7.2 The impact of visual grounding . . . . . . . . . . . . . 7.3 Visualization of alignment between tokens and objects 7.4 Evaluation on Pre-training Tasks . . . . . . . . . . . . 7.5 Visualization of token-level scoring on first approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 57 57 58 58 8 Application 8.1 Django . . . . . . . 8.2 System description 8.3 Mockup . . . . . . 8.4 Demo results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 61 62 64 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Conclusion 67 9.1 What have we done? . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9.2 Future direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 References 69 v List of Tables 4.1 Statistics of some common datasets used in visual grounded language learning task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Task descriptions and statistics. . . . . . . . . . . . . . . . . . . . . 6.2 Downstream task results of BERT and our GroundedBERT, we conduct the experiments on BERT-base and BERT-large architectures. MRPC and QQP results are F1 score, STS-B results are Pearson correlation, SQuAD v1.1 and SQuAD v2.0 results are exact matching and F1 score respectively. The results, which outperform the other one are marked in bold, are all scale to range 0-100. The ∆base and ∆large columns show the difference between our model and the baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Downstream task results of BERT, V&L pretrained models and our ObjectGroundedBERT (OGBERT), we conduct the experiments on BERT-base architectures. MRPC and QQP results are F1 score, STS-B results are Pearson correlation, SQuAD v1.1 and SQuAD v2.0 results are exact matching and F1 score respectively. The results, which outperform the other one are marked in bold, are all scale to range 0-100. The ∆base column show the difference between our model and the baseline. . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Downstream task results of our ObjectGroundedBERT with different dimension of visual embedding. The metrics and results are set up similar to Table 6.3. . . . . . . . . . . . . . . . . . . . . . . . . . . vi 36 50 53 54 56 LIST OF TABLES 7.2 Downstream task results and comparison of our ObjectGroundedBERT without training the Text-ground-image Module. The metrics and results are set up similar to Table 6.3. The first four rows report the fine-tuned results of our model without training with the visual grounded datasets, the last four rows show the difference to the results reported in Table 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Downstream task results on different pretraining tasks. . . . . . . . vii 57 58 List of Figures 1.1 Wikipedia, BookCorpus datasets . . . . . . . . . . . . . . . . . . . 1.2 Wikipedia, BookCorpus datasets with another visual datasets . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 A simple neural network. . . . . . . . . MLP with 3 layers. . . . . . . . . . . . Visualization of max pooling . . . . . . LeNet architecture . . . . . . . . . . . AlexNet architecture . . . . . . . . . . VGG16 architecture . . . . . . . . . . Inception cell . . . . . . . . . . . . . . Inception architecture . . . . . . . . . ResNets block . . . . . . . . . . . . . . ResNets architecture . . . . . . . . . . R-CNN . . . . . . . . . . . . . . . . . Fast R-CNN . . . . . . . . . . . . . . . Faster R-CNN . . . . . . . . . . . . . . Recurrent Neural Network . . . . . . . RNN calculation in one node. . . . . . RNN architectures . . . . . . . . . . . Encoded-Decoder . . . . . . . . . . . . Encoded-Decoder with Attention . . . Transformer: attention is all you need Transformer calculation . . . . . . . . Transformer calculation . . . . . . . . Transformer calculation . . . . . . . . Transformer calculation . . . . . . . . Transformer multihead attention . . . BERT model . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 6 6 10 11 11 12 13 13 14 14 15 16 17 18 19 20 22 23 24 25 26 26 27 28 28 LIST OF FIGURES 3.1 IMAGINET architecture from the paper Collell et al. (2017) . . . . 3.2 Cap2Both is the combination of Cap2Img and Cap2Both, the figure is in the paper Kiela et al. (2018) . . . . . . . . . . . . . . . . . . . 3.3 Bordes et al. (2019) paper architecture . . . . . . . . . . . . . . . . 3.4 Illustration of the BERT transformer model trained with a visuallysupervised language model with two objectives: masked language model (on the left) and voken classification (on the right). The figure is in paper Tan & Bansal (2020) . . . . . . . . . . . . . . . . 32 4.1 A grounded language learning example. . . . . . . . . . . . . . . . 37 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Implementation of our ObjectGroundedBERT. The model consists of two components, i.e. Language encoder and Text-ground-image part. The new representaion of language model combines of Textual embedding and Visual embedding. . . . . . . . . . . . . . . . . . . 5.3 Implementation of our pretraining framework for ObjectGroundedBERT. The model consists of four components, i.e. Object detection model (Faster-RCNN), Object encoder, Cross Modal Transformer and ObjectGroundedBERT. The detail of Cross Modal Transformer layer is shown on the right. The pretraining tasks are Masked Visual Feature Prediction, Image-Text Matching and Masked Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.1 MSCOCO Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 GLUE Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 7.1 Illustration of the Attention map of Cross modal layers. . . . . . . 7.2 First visualization of token-level scoring. This example chooses the caption there is a clean bathroom counter and sink with 3 images. 7.3 Second visualization of token-level scoring. This example chooses the caption bicycle parked in the grass by a tree with 3 images. . . 57 8.1 8.2 8.3 8.4 8.5 8.6 8.7 61 62 63 64 64 65 65 Django framework . . . . . . . . . Architecture diagram . . . . . . . Activity diagram . . . . . . . . . . Home page . . . . . . . . . . . . . Model page - result page . . . . . Input of CoLA task at home page Result of CoLA task at home page ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 35 44 45 59 60 LIST OF FIGURES 8.8 Input of MNLI task at home page . . . . . . . . . . . . . . . . . . 8.9 Result of MNLI task at home page . . . . . . . . . . . . . . . . . . x 66 66 Chapter 1 Introduction 1.1 Motivation Humans learn language through listening, speaking, reading, writing, and multimodal interactions with the real world. Even a child has been taught from a young age to talk (listen), teach to speak, teach gestures and learn through pictures from a young age, people from a young age not only learn language from resources. or books with only words that learn a combination of images, stories, and descriptive sentences. Today’s language models are mostly not learned from real-world elements but are trained only by pure-language data sources 1.1 (text only) like Wikipedia, BookCorpus, etc. Figure 1.1: Wikipedia, BookCorpus datasets 1 CHAPTER 1. INTRODUCTION 1.2 Our contribution Previous studies of visual grounded language learning train language encoder with both language objective and visual grounding example . However, due to the differences in distribution and scale between the visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that are occurred in the grounded data with those that are not. As such, there is confusion between the visual information and contextual meaning of text during embedding training. To overcome this limitation, we propose GroundedBERT a grounded language learning method that enhances the BERT representation with visual grounded information. GroundedBERT comprises of two components: (i) a text-ground-image part captures the global and local semantic mapping between the visual and textual representations learned via sentence-level and tokenlevel mechanism, and (ii) an original BERT embedding captures the contextual representation of words learned from the textual language corpora. Our proposed method significantly outperforms the baseline language models on various language tasks of GLUE and SQuAD datasets. Moreover, these studies use a convolutional neural network (CNN) to extract features from the whole image for grounding with the sentence description. However, this approach has two main drawbacks: (i) the whole image usually contains more objects and backgrounds than the sentence itself; thus, matching them together will confuse the grounded model; (ii) CNN only extracts the features of the image but not the relationship between objects inside that, limiting the grounded model to learn complicated contexts. To overcome such shortcomings, we propose a novel object-level grounded language learning framework that empowers the language representation with visual object-grounded information. The framework is comprised of two main components: (i) ObjectGroundedBERT captures the visual-object relations and literary portrayals by cross-modal pretraining via a text-ground-image mechanism, and (ii) Cross-modal Transformer helps the object encoder and ObjectGroundedBERT learn the alignment and representation of image-text context. Experimental results show that our proposed framework ObjectGroundedBERT and GroundedBERT consistently outperform the baseline language models on various language tasks of GLUE and SQuAD datasets. Our works are submitted into two conferences EMNLP 20211 and Neurips 2022 . 1 2 https://2021.emnlp.org/ https://nips.cc/ 2 CHAPTER 1. INTRODUCTION Figure 1.2: Wikipedia, BookCorpus datasets with another visual datasets 1.3 The Scope of the Thesis The coverage of this text are: • We propose GroundedBERT - a grounded language learning approach that enhances BERT representation with visual-grounded information. Instead of grounding visual information to the language model, which changes the original contextual representation, the visual-grounded representation is first learned from the text-image pairs and then joined to the contextual representation to form a unified visual-textual representation. Moreover with ObjectGroundedBERT, to the best of our knowledge, this study is the first to investigate the grounded language learning at the object-level with the rich grounded information containing object features, attributes, and positions. By doing so, we can enhance the ability of grounded language to capture more complex relations and avoid the confusion during learning process. • To this end, we introduce a Text-ground-image module that captures both global and local semantics between contextual relation of words and image via a novel token-level and sentence-level learning mechanism. We also propose a novel grounded language framework that enhances language representation with visual-objected-grounded information. Instead of using CNN to encode the whole image, we embed the features of objects from an off-the-shelf object detector into the encoder and connect them with the language modality via a 3 CHAPTER 1. INTRODUCTION cross-modal Transformer. A Text-ground-image mechanism is also proposed to capture the visual object information and their relations found from the semantic correlation of words and image via multi-task pretraining strategy. • We conduct extensive experiments on various language downstream tasks in GLUE and SQuAD datasets, and significantly outperforms the baselines on these tasks. • We build a demo app for CoLA, MNLI tasks using our GroundedBERT. 1.4 Organization of the Thesis • In chapter 2, we provide some basic math and machine learning concepts that might be helpful for the reader to understand the rest of the text. • In chapter 3, we summarize some legacy research works in the area in a consistent way. • In chapter 4, we first briefly describe why we have the novel ideal and the problem of previous works. • In chapter 5, we describe the detail of our proposed models and training strategies. • In chapter 6, we show our results on many downstream tasks and also how to implement and config for the hyper-parameters. • In chapter 7, we additional analyze our work on many aspects to show the effectiveness and visualize some results of our examples. • In chapter 8, we describe and illustrate the application to demo our work on downstream tasks. • Finally, we conclude by discussing what we have done in Pre-Thesis and Thesis also The future direction of our work in chapter 9. 4 Chapter 2 Foundations In this chapter, I will present the background knowledge used in the process of making the thesis, including commonly used concepts in deep learning networks and natural language processing, popular network architectures. in image processing and the models used for language. 2.1 2.1.1 Neural Network - Multilayer perceptron Logistic regression is a binary classification method, built by a function capable of taking any value and returning a result. the number between 0 and 1 (sigmoid function) - equivalent to the probability of that happening based on the input data. This is similar to a single-layer neural network. As the figure 2.1 illustrates the use of logistic regression as a simple neural network. In the figure 2.1, the matrix x contains the attributes (features) of an input, the matrix w will be the weights of the attributes and b is the bias. After going through the calculation steps, the result obtained ŷ will be the probability that label (label) y has a value of 1 with known x and w . P (y = 1|w, x) = ŷ 2.1.2 (2.1) Multilayer perceptron Multilayer perceptron is a multi-layer neural network, usually consisting of layers: input layer, output layer and hidden layer. 5 CHAPTER 2. FOUNDATIONS Figure 2.1: A simple neural network. Figure 2.2 illustrates an MLP with 3 layers (input, hidden, output). Figure 2.2: MLP with 3 layers. 2.1.3 2.1.3.1 Activation functions Sigmoid function Equation: 1 (2.2) 1 + e−x The sigmoid function takes a real value x and returns a value in the range (0, 1). If x is a very small negative real number then the result of the sigmoid σ(x) = 6 CHAPTER 2. FOUNDATIONS function will be asymptote to 0, and vice versa if x is a very large positive number then the result will be asymptote to 1. 2.1.3.2 Tanh function Equation: ex − e−x (2.3) ex + e−x The tanh function takes a real number and returns a value in the range (-1,1). tanh(x) = The tanh function can also be represented by the sigmoid function as follows: 2.1.3.3 tanh(x) = 2σ(2x) − 1 (2.4) f (x) = max(0, x) (2.5) ReLU function Equation: Thus, compared with sigmoid and tanh, the ReLU function will not have a gradient cancellation problem. The calculation speed of the ReLU function will also be faster than the previous two functions. However, ReLU also has a drawback, with x having a value less than 0, through the ReLU function, the result will be 0. If the value of the node is changed to 0, it will not be meaningful in the next layer. and the corresponding coefficients from that node are also not updated with the gradient. This phenomenon is called Dying ReLU. 2.2 2.2.1 Convolutional Neural Network Introduction Contingent upon whether we are taking care of high contrast or shading pictures, every pixel area may be related with possibly one or numerous mathematical qualities, individually. As of not long ago, our method of managing this rich construction was profoundly uninspiring. We essentially disposed of each picture’s spatial construction by smoothing them into one-dimensional vectors, taking care of them through a completely associated MLP. Since these organizations are invariant to the request for the highlights, we could get comparable outcomes whether or not we protect a request relating to the spatial construction of the pixels or in the event that we permute the segments of our plan network prior to fitting the MLP’s 7 CHAPTER 2. FOUNDATIONS boundaries. Ideally, we would use our earlier information that close-by pixels are normally identified with one another, to fabricate effective models for gaining from picture information. This part presents convolutional neural organizations (CNNs), an amazing group of neural organizations that are intended for correctly this reason. CNNbased models are presently universal in the field of PC vision and have become so prevailing that barely anybody today would foster a business application or enter a contest identified with picture acknowledgment, object location, or semantic division, without working off of this methodology. Current CNNs, as they are called casually owe their plan to motivations from science, bunch hypothesis, and a solid portion of test dabbling. Notwithstanding their example productivity in accomplishing precise models, CNNs will, in general, be computationally proficient, both on the grounds that they require fewer boundaries than completely associated structures and on the grounds that convolutions are not difficult to parallelize across GPU centers. Thusly, specialists frequently apply CNNs at whatever point conceivable, and progressively they have arisen as believable contenders even on assignments with a one-dimensional arrangement structure, like sound, text, and time arrangement investigation, where repetitive neural organizations are ordinarily utilized. Some shrewd variations of CNNs have likewise applied them as a powerful influence for chart-organized information and in recommender frameworks. 2.2.2 Detail The Convolutional Layer is the core component of the CNN network that performs the most important operations. Concepts and parameters for setting up a convolutional network include such as filter size (F), input size (W), displacement (S strike), zero-padding (P), depth (depth) aka the number of filters we use. • Depth of the output volume will be equal to the number of filters we decide to use. Each of these filters will learn to detect different characteristics of the input. For example, with an image input, each filter in the first convolutional layer will in turn learn to detect edges and corners with different directions, etc. We will call the set of neurons connected to a region of the image. input is the depth column. • Displacement is the rate at which the filter moves on the input. For example, when we have a displacement of 1, we will shift the filter 1 pixel at a time. When we have an offset of 2, we will shift the filter by 2 pixels at a 8

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất