VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
GRADUATION THESIS
EFFECTIVELY APPLY VIETNAMESE FOR
VISUAL QUESTION ANSWERING SYSTEM
(Old title: Development of a VQA system)
Major: Computer Science
Council : Software Engineering
Instructor: Dr. Quan Thanh Tho
Reviewer : Mr. Le Dinh Thuan
Authors : Nguyen Bao Phuc
Tran Hoang Nguyen
Ho Chi Minh City, July 2021
1712674
1712396
ĈҤ,+Ӑ&48Ӕ&*,$73+&0
---------75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$
KHOA:KH & KT Máy tính ____
%Ӝ0Ð1 KHMT ___________
&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0
ĈӝFOұS- 7ӵGR- +ҥQKSK~F
1+,ӊ09Ө/8Ұ1È17Ӕ71*+,ӊ3
&K~ê6LQKYLrQSK̫LGiQWͥQj\YjRWUDQJQK̭WFͯDE̫QWKX\͇WWrình
+Ӑ9¬7Ç1 7UҫQ+RjQJ1JX\rQ ____________________ MSSV: 1712396 ______
NGÀNH: KHMT ____________________________ /Ӟ3MT17KH03 ___________
+Ӑ9¬7Ç1 1JX\ӉQ%ҧR3K~F ______________________ MSSV: 1712674 ______
NGÀNH: KHMT ____________________________ /Ӟ3 MT17KH04 ___________
ĈҫXÿӅOXұQiQ
3KiWWULӇQKӋWKӕQJ9LVXDO4XHVWLRQ$QVZHULQJ
1KLӋPYө\rXFҫXYӅQӝLGXQJYjVӕOLӋX EDQÿҫX
x
x
x
x
x
x
x
x
x
1JKLrQFӭXFiFOêWKX\ӃWKӑFVkXQӅQWҧQJÿѭӧFiSGөQJWURQJÿӅWjL
1JKLrQFӭXOêWKX\ӃW9LVXDO4XHVWLRQ$QVZHULQJ
7LӅQ[ӱOêGӳOLӋX
7uPKLӇXYjiSGөQJSKѭѫQJSKiSERWWRP-XSWUtFK[XҩWYHFWRUÿһFWUѭQJWӯ
KuQKҧQK
1JKLrQFӭXKѭӟQJWLӃSFұQ Co-$WWHQWLRQWUtFK[XҩWWK{QJWLQGӵDWUrQQӝL
GXQJFkXKӓL- FkXKӓLTXDFiFF{QJWUuQKQJKLrQFӭX
;k\GӵQJYjKXҩQOX\ӋQP{KuQKKӑFPi\GӵDWUrQGӳOLӋXÿmÿѭӧF[ӱOê
3KkQWtFKYjWKLӃWNӃKӋWKӕQJ9LVXDO4XHVWLRQ$QVZHULQJKRjQFKӍQK
;k\GӵQJZHEVLWHWKLӃWNӃJLDRGLӋQSKiWWULӇQIURQW-end, back-HQGYjWULӇQ
NKDLKӋWKӕQJ
ĈiQKJLiKӋWKӕQJ
1Jj\JLDRQKLӋPYөOXұQiQ 01/02/2021
1Jj\KRjQWKjQKQKLӋPYө 01/08/2021
+ӑWrQJLҧQJYLrQKѭӟQJGүQ
3KҫQKѭӟQJGүQ
1) 3*6764XҧQ7KjQK7Kѫ
2) __________________________________________________________________________
3) __________________________________________________________________________
1ӝLGXQJYj\rXFҫX/971ÿmÿѭӧFWK{QJTXD%ӝP{Q
1Jj\WKiQJQăP
&+Ӫ1+,ӊ0%Ӝ0Ð1
*,Ҧ1*9,Ç1+ѬӞ1*'Ү1&+Ë1+
.êYjJKLU}K͕WrQ
.êYjJKLU}K͕WrQ
3*6764XҧQ7KjQK7Kѫ
3+̮1'¬1+&+2.+2$%͠0Ð1
1JѭӡLGX\ӋWFKҩPVѫEӝ________________________
ĈѫQYӏ _______________________________________
1Jj\EҧRYӋ __________________________________
ĈLӇPWәQJNӃW _________________________________
1ѫLOѭXWUӳOXұQiQ _____________________________
TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA KH & KT MÁY TÍNH
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
---------------------------Ngày
tháng
năm
PHIẾU CHẤM BẢO VỆ LVTN
(Dành cho người hướng dẫn/phản biện)
1. Họ và tên SV: Trần Hoàng Nguyên
MSSV: 1712396
Ngành (chuyên ngành): KHMT
Họ và tên SV: Nguyễn Bảo Phúc
MSSV: 1712674
Ngành (chuyên ngành): KHMT
2. Đề tài: Phát triển hệ thống Visual Question Answering
3. Họ tên người hướng dẫn/phản biện: ThS. Lê Đình Thuận
4. Tổng quát về bản thuyết minh:
Số trang:
Số chương:
Số bảng số liệu
Số hình vẽ:
Số tài liệu tham khảo:
Phần mềm tính toán:
Hiện vật (sản phẩm)
5. Tổng quát về các bản vẽ:
- Số bản vẽ:
Bản A1:
Bản A2:
Khổ khác:
- Số bản vẽ vẽ tay
Số bản vẽ trên máy tính:
6. Những ưu điểm chính của LVTN:
-
-
Đề tài xây dựng hệ thống thông minh trả lời câu hỏi nội dung hình ảnh.
Sinh viên đã xây dựng được hệ thống VQA và huấn luyện được mô hình
thành công. Đề tài còn có sự phát triển trong việc huấn luyện mô hình để
hỗ trợ tiếng Việt.
Đề tài đánh giá là khó. Khối lượng đánh giá là nhiều. Đòi hỏi khả năng tự
học của SV trong việc kết hợp nhiều thành phần kiến thức.
Kết quả đề tài được tổng kết thành bài báo khoa học tại hội nghị khoa
học FAIR 2021. (Ghi chú: tại thời điểm phản biện, chưa có kết quả việc
chấp nhận của hội nghị với bài báo khoa học)
Luận văn được trình bày đầy đủ và rõ ràng. Sinh viên chú ý sắp xếp phần
demo để thể hiện trọng tâm công việc của đề tài.
7. Những thiếu sót chính của LVTN:
8. Đề nghị: Được bảo vệ □
Bổ sung thêm để bảo vệ □
9. 3 câu hỏi SV phải trả lời trước Hội đồng:
10. Đánh giá chung (bằng chữ: giỏi, khá, TB):
Không được bảo vệ □
Điểm :
10/10
Ký tên (ghi rõ họ tên)
ThS. Lê Đình Thuận
TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA KH & KT MÁY TÍNH
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
---------------------------Ngày
tháng
năm
PHIẾU CHẤM BẢO VỆ LVTN
(Dành cho người hướng dẫn/phản biện)
1. Họ và tên SV: Trần Hoàng Nguyên
MSSV: 1712396
Ngành (chuyên ngành): KHMT
Họ và tên SV: Nguyễn Bảo Phúc
MSSV: 1712674
Ngành (chuyên ngành): KHMT
2. Đề tài: Phát triển hệ thống Visual Question Answering
3. Họ tên người hướng dẫn/phản biện: PGS.TS. Quản Thành Thơ
4. Tổng quát về bản thuyết minh:
Số trang:
Số chương:
Số bảng số liệu
Số hình vẽ:
Số tài liệu tham khảo:
Phần mềm tính toán:
Hiện vật (sản phẩm)
5. Tổng quát về các bản vẽ:
- Số bản vẽ:
Bản A1:
Bản A2:
Khổ khác:
- Số bản vẽ vẽ tay
Số bản vẽ trên máy tính:
6. Những ưu điểm chính của LVTN:
-
-
Sinh viên đã hoàn thành một hệ thống VQA như yêu cầu đề ra. Sinh viên
nắm vững, hiểu rõ các nội dung lý thuyết và phát triển xây dựng ứng
dụng model thành công. Sinh viên cũng đã dịch tập dữ liệu huấn luyện
sang tiếng Việt để hỗ trợ cho việc trả lời câu hỏi tiếng Việt.
Một phần công việc của luận văn đã được viết thành bài báo khoa học và
nộp cho hội nghị FAIR.
Phần công việc của luận văn cũng được tiếp tục được mở rộng cho một
dự án hợp tác nghiên cứu với nhóm nghiên cứu một giáo sưc khác tại Đài
Loan.
Luận văn được viết bằng tiếng Anh tương đối chuẩn và rõ ràng.
7. Những thiếu sót chính của LVTN:
8. Đề nghị: Được bảo vệ □
Bổ sung thêm để bảo vệ □
9. 3 câu hỏi SV phải trả lời trước Hội đồng:
10. Đánh giá chung (bằng chữ: giỏi, khá, TB):
Không được bảo vệ □
Điểm :
9.8/10
Ký tên (ghi rõ họ tên)
PGS.TS. Quản Thành Thơ
Declaration of Authenticity
We assure that the graduation thesis "Effectively apply Vietnamese for Visual Question
Answering system" is the original report of our research. We have finalized our graduation
thesis honestly and guarantee the truth of our work for this thesis. We are solely responsible
for the precision and reliability of the above information.
Ho Chi Minh City, August, 9th , 2021
Acknowledgements
First and foremost, we would like to thank Dr. Quan Thanh Tho, Associate Professor in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of
Technology (HCMUT) for the support throughout our research work. It has been our great
fortune, to work and finish our thesis under his supervision. He is the most knowledgeable
and insightful person we have ever met. He helped us throughout the project with his wise
knowledge and enthusiasm in deep learning. From him, we have learned how to do deep
learning research by a critical way and have a chance to widen our knowledge. He also
let us join his research group, URA. This opportunity not only allow us to get more useful
suggestions about our thesis from everybody in group, but we also have learned many new
things, new skills day by day, such as by joining seminars held by members in group. From
that, we can create more interesting ideas focusing on our thesis.
Our sincere thanks also goes to Mr. Le Dinh Thuan, Master of Engineering in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, for
being our reviewer. His feedback, suggestion and advice was essential and influential for
the completion of our thesis. We are thankful for having such a good reviewer like him.
Last but not least, we would like to thank the entire teachers at HCMC University of
Technology, especially Faculty of Computer Science and Engineering, where it has been
our pleasure and honor to studied for the last four years. Also our beloved friends and
family, who always support us with a constant love and encouragement.
AUTHORS
Abstracts
In recent years, deep learning has emerged as a promising technology with the hope that
it can be designed to tackle practical problems, which had been considered inconceivable
for previous approaches. Specifically, the blind and visual impaired are usually afraid to be
burdensome for their family, their friends,.. when they need visual guidance. However, there
is still a lack of modern systems, which can be a virtual friend to help them interact with
the surrounding environment. Therefore, we research and develop a novel deep learning
application, which can capture the complex relationship between the surrounding objects
and deliver assistance to the blind and visual impaired. With this dissertation, we propose
a novel visual question answering model in Vietnamese and a development of practical
systems that utilize our model to address the aforementioned problems.
CONTENTS
List of figures
x
List of tables
Chapter 1 INTRODUCTION
1.1 Motivation . . . . . . . . . . . . . . . . .
1.2 Topic’s scientific and practical importance
1.3 Thesis objectives and scope . . . . . . . .
1.4 Our contribution . . . . . . . . . . . . . .
1.5 Thesis structure . . . . . . . . . . . . . .
Chapter 2 THEORETICAL OVERVIEW
2.1 Deep learning neural network . . . . . . .
2.1.1 Perceptron . . . . . . . . . . . . .
2.1.2 Multi layer perceptron . . . . . .
2.1.3 Activation functions . . . . . . . .
2.1.4 Loss functions . . . . . . . . . . .
2.1.5 Backpropagation and optimization
2.2 Computer vision theoretical background .
2.2.1 Convolutional Network . . . . . .
2.3
2.4
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
4
5
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
8
9
10
12
14
16
16
2.2.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 CNNs variants . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Regional-based Convolutional Neural Networks . . . . . .
Natural language processing theoretical background . . . . . . . .
2.3.1 Word Embedding . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Recurrent Neural Network (RNN) . . . . . . . . . . . . .
2.3.3 LSTM - Long Short Term Memory . . . . . . . . . . . . .
2.3.4 GRU - Gated Recurrent Network . . . . . . . . . . . . .
2.3.5 Attention mechanism . . . . . . . . . . . . . . . . . . . .
2.3.6 Bidirectional Encoder Representations from Transformers
Visual and Language tasks related to VQA . . . . . . . . . . . . .
2.4.1 Image Captioning . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
22
27
27
33
34
36
37
39
43
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
CONTENTS
2.4.2
2.4.3
Visual Commonsense Reasoning . . . . . . . . . . . . . . . . . . .
Other Visual and Language tasks . . . . . . . . . . . . . . . . . . .
43
44
Chapter 3 RELATED WORK
3.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Pythia v0.1: the Winning Entry to the VQA Challenge 2018 . . . . . . . . .
3.4 Deep Modular Co-Attention Networks for Visual Question Answering . . .
3.5 ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
45
46
Image-Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 4 METHODOLOGY
4.1 Feature extraction and co-attention method
4.1.1 Visual feature . . . . . . . . . . .
4.1.2 Textual feature . . . . . . . . . .
4.2 Co-attention layer . . . . . . . . . . . . .
4.3 Our proposal model . . . . . . . . . . . .
47
49
49
51
.
.
.
.
.
53
54
54
54
55
58
.
.
.
.
.
61
62
63
63
64
65
Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
66
Chapter 6 EXPERIMENTAL ANALYSIS
6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
69
69
69
Chapter 5 VIETNAMESE VQA DATASET
5.1 VQA-v2 dataset . . . . . . . . . . . .
5.2 Visual Genome dataset . . . . . . . .
5.3 Challenge . . . . . . . . . . . . . . .
5.4 Automatic data generation . . . . . .
5.5 Data refinement . . . . . . . . . . . .
5.6
5.7
6.2
6.1.3 Evaluation metric . . .
6.1.4 Implementation details
6.1.5 Training strategy . . .
Experimental results . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
70
70
71
ix
CONTENTS
Chapter 7 APPLICATION
7.1 Technology terminology . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
77
77
78
78
80
80
80
81
82
83
85
87
System components . . . . . . . . . . . . . . . . . . . . . . . . . .
Our result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
90
Chapter 8 CONCLUSION
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Limitation and broader future works . . . . . . . . . . . . . . . . . . . . .
8.2.1 Improve existing Vietnamese VQA models . . . . . . . . . . . . .
94
95
95
95
7.2
7.3
7.4
7.1.2 ReactJS . . . . . . . . . . . . . . . . . . . .
7.1.3 React Native . . . . . . . . . . . . . . . . . .
7.1.4 Docker . . . . . . . . . . . . . . . . . . . .
7.1.5 C4 Model: Describing Software Architecture
System functionality . . . . . . . . . . . . . . . . . .
7.2.1 Web application system . . . . . . . . . . . .
7.2.2 Mobile application system . . . . . . . . . .
System diagram . . . . . . . . . . . . . . . . . . . .
7.3.1 Overview . . . . . . . . . . . . . . . . . . .
7.3.2 Use case diagram . . . . . . . . . . . . . . .
7.3.3 Activity diagram . . . . . . . . . . . . . . .
System architecture . . . . . . . . . . . . . . . . . .
7.4.1
7.4.2
8.2.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
76
76
Give Vietnamese VQA a new direction . . . . . . . . . . . . . . .
96
References
97
Appendices
101
Chapter A FAIR 2021 CONFERENCE PAPER
102
Chapter B SATU PROJECT
111
LIST OF FIGURES
1.1
Overview of the Visual Question Answering task. . . . . . . . . . . . . . .
3
2.1
2.2
2.3
2.4
2.5
9
10
11
17
2.6
2.7
2.8
2.9
Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Symbols and calculating process in Multilayer Perceptron . . . . . . . . . .
Multi layer perceptron with two hidden layers . . . . . . . . . . . . . . . .
Convolution operation between 2-D input image and 2-D kernel . . . . . . .
The receptive field of the units in the deeper layers of a convolutional network is larger than the receptive field of the units in the shallow layers. . . .
Receptive field of one output unit in CNNs. . . . . . . . . . . . . . . . . .
Apply 2 × 2 pooling layer to 6 × 6 input . . . . . . . . . . . . . . . . . . .
AlexNet’s architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VGG16 (left) and VGG19 (right) architecture. . . . . . . . . . . . . . . . .
18
19
20
20
21
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
Residual function (left) and ResNet-18 architecture.
The architecture of R-CNN . . . . . . . . . . . . .
The architecture of Fast R-CNN . . . . . . . . . . .
The architecture of Faster R-CNN . . . . . . . . .
Word2Vec overview . . . . . . . . . . . . . . . . .
CBOW and Skip-gram Architecture . . . . . . . .
Recurrent network architecture . . . . . . . . . . .
LSTM architecture . . . . . . . . . . . . . . . . .
GRU architecture . . . . . . . . . . . . . . . . . .
Self-attention . . . . . . . . . . . . . . . . . . . .
Multihead-attention . . . . . . . . . . . . . . . . .
BERT for Masked LM . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
22
23
24
26
29
31
33
35
37
38
39
41
2.22 BERT for Next Sentence Prediction . . . . . . . . . . . . . . . . . . . . . .
42
3.1
3.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Typically, attention models operate on CNN features corresponding to a uniform grid of equally-sized image regions (left). Bottom-up approach enables
attention to be calculated at the level of objects and other salient image regions (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bottom-up attention in VQA task. . . . . . . . . . . . . . . . . . . . . . .
48
48
xi
LIST OF FIGURES
3.3
The overall flowchart of the deep Modular Co-attention Networks. They
also provided two different strategies for deep co-attention learning, namely
stacking and encoder-decoder . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4
Architecture of the ImageBERT model. . . . . . . . . . . . . . . . . . . .
52
4.1
4.2
4.3
4.4
Our proposed question processor for Vietnamese VQA task . . . . . . .
Architecture of multi-head attention module . . . . . . . . . . . . . . .
Architecture of self-attention unit (left) and guided-attention unit (right)
Architecture of our proposed model . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
55
57
58
60
5.1
5.2
5.3
5.4
Sample image in VQA-v2 dataset. . . . . . . . .
The list of answers that have the most occurrence.
The list of answers that have the least occurrence.
Sample image in VQA-v2 dataset. . . . . . . . .
.
.
.
.
.
.
.
.
62
64
64
66
6.1
6.2
Our learning rate is controlled by Adam optimizer and warmup scheduler. .
Accuracy and co-attention depth relationship. All of this experienced of the
test set and used the small specs. . . . . . . . . . . . . . . . . . . . . . . .
Our loss values on train and validation set, which consist of the value from
71
6.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
72
6.5
6.6
6.7
6.8
epoch 1 to 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Our loss values on train and validation set, which consist of the value from
epoch 2 to 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The overall accuracies. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The accuracies of Yes/No question. . . . . . . . . . . . . . . . . . . . . . .
The accuracies of number question. . . . . . . . . . . . . . . . . . . . . . .
The accuracies of other question. . . . . . . . . . . . . . . . . . . . . . . .
7.1
7.2
7.3
7.4
7.5
7.6
Components of a C4 model. . . . . . . . . . . . . . . .
Usecase diagram of Vietnamese VQA Web System . .
Usecase diagram of Vietnamese VQA Mobile System .
Activity diagram of Vietnamese VQA Web System. . .
Activity diagram of Vietnamese VQA Mobile System.
Component level description of our whole system. . . .
.
.
.
.
.
.
79
83
84
85
86
87
7.7
7.8
7.9
Homepage of our web application. . . . . . . . . . . . . . . . . . . . . . .
Introduction for Vietnamese VQA - Give an VQA example . . . . . . . . .
Introduction for Vietnamese VQA - What is VQA ? . . . . . . . . . . . . .
90
90
91
6.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
72
73
73
73
LIST OF FIGURES
7.10 Choose an image and enter a question in Vietnamese, both of them are used
as input for VQA. In this case, the question we enter is "Ở đây có thứ gì?". .
7.11 Top-5 answers are generated from our Vietnamese VQA model. In this case,
for the above question, the top 5 answers sound quite good. The top-1 answer
is "Sách". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.12 Overview of our VQA web system on mobile. Within the image, the question, the answer generated from VQA is very clear and helpful . . . . . . .
7.13 User can upload a favorite image and then ask a question. The VQA system
will response after few seconds. . . . . . . . . . . . . . . . . . . . . . . . .
7.14 Our application on mobile device. User can ask a question about visual
information (left) or daily information like datetime, weather, position,...
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
91
92
92
93
93
B.1 Our application of VQA for predicting the potential of natural disaster. For
example, giving a question "what is the overall condition of the given image
?", VQA can generate answer based on the visual content of the image. The
answer here is "Non-flooded" . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.2 Our proposed wildfires surveillance system pipeline . . . . . . . . . . . . . 113
LIST OF TABLES
5.1
5.2
Summary for one sample in VQA-v2 dataset. . . . . . . . . . . . . . . . .
Statistical description of our Vietnamese dataset. . . . . . . . . . . . . . . .
6.1
Summary of our model with the large specs and BERT as our language
processor. The experiments is conducted on our validation set. . . . . . . .
The results of our models with 4 variants. All of the results are conducted
on our Vietnamese VQA test set. . . . . . . . . . . . . . . . . . . . . . . .
6.2
63
65
72
73
1
INTRODUCTION
This chapter gives an outline of the thesis topic, including its research aims, research scope,
and scientific and practical value.
CHAPTER 1 INTRODUCTION
1.1
2
Motivation
Vision impairment severely impacts quality of life among both adult and young populations.
Young children with early onset vision impairment can experience limited cognitive development, also it leads to lifelong consequences. Adults with vision impairment have lower
productivity and also higher rates of anxiety and depression. In case of the older visual impaired, it can lead to a result of social isolation, and low level of navigability. The number of
people visual impaired was estimated to be about 285 million, in which 39 million people
are blind and 246 million people have low vision. 1
In recent years, Artificial Intelligence (AI) has not just made our lives easier by automating tedious and dangerous tasks, but it has unlocked myriad possibilities to people
with disabilities and promising them unique ways of experiencing the world. More and
more AI-powered applications in the industry of assistive technology have been put into
practice and shown its benefits. In this regards, one of the most promising AI tasks that can
help visual impaired in their daily life is Visual Question Answering [1].
Visual Question Answering (VQA) is a research field of multimodal learning in artificial intelligence. Multimodal learning is the task requiring us to propose a deep neural network that can model features over multiple modalities - multiple data sources (text, audio,
images, numbers, ...) to solve problems and achieve high accuracy, which makes it become a
more interesting and challenging assignment. A typical VQA input consists of two objects:
an image and a text question. The task of VQA is formulated as follows: given a question, an
image, the model must predict the correct answer. This answer generally needs to be chosen
from a defined set of possible choices. For example, in fig. 1.1, given the below image and
the question, "What is the moustache made of ?", the model must answer "Banana".
This task is challenging on many levels. First of all, the model needs to understand the
text of the question and the visual signals from the image. Secondly, it should correctly
correlate text with the visual signals. On top of understanding text and visual signals, the
model also needs to use common sense reasoning, knowledge base reasoning, and identify
the context of the image. This means that a VQA system needs to be capable of processing
images, such as detecting objects, recognizing entities and activities. At the same time, this
system must be able to handle text processing as a natural language of humans. The real
challenge in VQA is the combination of techniques from both computer vision and natural
1 The number is obtained from WHO, 2010.
CHAPTER 1 INTRODUCTION
3
Figure 1.1: Overview of the Visual Question Answering task.
language processing to produce a meaningful and accurate answer that provides relevant
information and it is beneficial for human. In this graduation thesis, we aim to build a VQA
system that could understand Vietnamese questions and give a meaningful, accurate answer
written in Vietnamese.
Since the VQA task appeared, there have been many proposals for VQA models that are
more and more complex and capable of answering previously unseen questions and obtain
higher and higher accuracy. Nowadays, a number of recent works have proposed attention
models for VQA. Co-attention mechanism is applied for VQA model more and more popularly, and it gets better results in VQA challenges. We use it to build our Vietnamese VQA
model so that we could make the accuracy of the model as high as possible.
1.2
Topic’s scientific and practical importance
There are many potential applications for VQA. Probably the most direct application is to
help blind and visually impaired users. A VQA system could provide information about an
image on the internet or any social media.
Another obvious application is to integrate VQA into image retrieval systems. This
could have a huge impact on social media or e-commerce. VQA can also be used for educational or recreational purposes.
CHAPTER 1 INTRODUCTION
4
Furthermore, as we have known, natural disasters, including wildfires, flooding, ice
jam,... cause a lot of great damages in our life, seriously disrupts the functioning of a community or society. Natural disaster’s impact usually includes loss of life, injury, disease, and
other negative effects. Therefore, it is necessary that we need an approach of disaster risk
reduction to save lives and minimize disability and disease. Disaster surveillance allows us
to identify risk factors, track disease trends, determine action items, and target interventions. We can apply VQA to develop a tool that tells if there is ice or not on a water body, or
whether there has smoke and fire in the forest or not. From that, we can gain an early trigger
for a potential disaster.
1.3
Thesis objectives and scope
In this dissertation, we try our best to obtain results corresponding to the objectives are as
follows:
• Release a novel VQA dataset written in Vietnamese and provide an in-depth analysis
of our dataset.
• Gain a deep understanding of the co-attention mechanism and apply it effectively for
the VQA task.
• Design and implement a deep learning model to apply Vietnamese effectively in VQA.
The result consists of a pipeline for implementing, training, evaluating our Vietnamese
VQA model.
• Design, develop and apply our models to one mobile app and one web app, which can
help the visual impaired in their daily life.
• Elaborate on our solutions and examine the limitations of our models.
CHAPTER 1 INTRODUCTION
1.4
5
Our contribution
Thesis’s contribution
Our first contribution is a new Vietnamese VQA dataset that we built by taking advantage
from the previous VQA-v2 dataset. The dataset consists of images from MS-COCO 1 , and
about a million of question-answer pairs. By thoroughly examine our dataset, we find that
our dataset can significantly accelerate performance, and play a key role in addressing the
Vietnamese VQA task.
Moreover, we propose an effective pipeline for developing Vietnamese VQA model.
By using some modern techniques, our model can overcome the problems in a complex
language like Vietnamese and efficiently capture relationships between textual and visual
presentation. We also quantitively and qualitatively evaluate our models to show that how
effective our model can deal with the Vietnamese VQA task.
Furthermore, we design and develop two applications on both mobile and web, which
are used our models as their core component, in the hope that they can make the daily life
of visual impaired people easier.
Paper’s contribution
As a part of our work, we submit one paper into FAIR 2021 Conference. The paper includes
the summary of our thesis work, from constructing a novel Vietnamese VQA dataset to
proposing a pipeline which can effectively apply our model to address Vietnamese VQA
task.
We hope that our paper will add a small contribution to the knowledge of the artificial
intelligence, computer vision and natural language processing societies, and also assist in
advancing the field of the Vietnamese Visual Question Answering.
1 https://cocodataset.org/
CHAPTER 1 INTRODUCTION
1.5
6
Thesis structure
This dissertation consists of 8 chapters, the introductory chapter serving as chapter 1 and
the conclusion as chapter 8. From chapters 2-7, we will give key materials that we use
throughout our thesis. A brief overview of the contents of each chapter is presented below.
Chapter 2 presents a theoretical background for the thesis. It is the foundation of knowledge that is really necessary for us to gain a deep understanding of VQA.
Chapter 3 reviews relevant related work and raises some problems of traditional approaches for the VQA task.
Chapter 4 has the description of the feature extraction method, co-attention method,
and our proposed architecture of Vietnamese VQA model.
Chapter 5 describes a process of building our Vietnamese dataset and provide an indepth analysis on the dataset.
Chapter 6 gives an analysis of our conducted experiments and results.
Chapter 7 includes the description of our AI application that utilizes VQA model.
Chapter 8 is the end of our thesis. We discuss and summarize the achievements and
drawbacks of our VQA system and present future plans for our thesis.
- Xem thêm -