Đăng ký Đăng nhập
Trang chủ Effectively apply vietnamese for visual question answering system ...

Tài liệu Effectively apply vietnamese for visual question answering system

.PDF
127
1
112

Mô tả:

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS EFFECTIVELY APPLY VIETNAMESE FOR VISUAL QUESTION ANSWERING SYSTEM (Old title: Development of a VQA system) Major: Computer Science Council : Software Engineering Instructor: Dr. Quan Thanh Tho Reviewer : Mr. Le Dinh Thuan Authors : Nguyen Bao Phuc Tran Hoang Nguyen Ho Chi Minh City, July 2021 1712674 1712396 ĈҤ,+Ӑ&48Ӕ&*,$73+&0 ---------75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ KHOA:KH & KT Máy tính ____ %Ӝ0Ð1 KHMT ___________ &Ӝ1*+Ñ$;­+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0 ĈӝFOұS- 7ӵGR- +ҥQKSK~F 1+,ӊ09Ө/8Ұ1È17Ӕ71*+,ӊ3 &K~ê6LQKYLrQSK̫LGiQWͥQj\YjRWUDQJQK̭WFͯDE̫QWKX\͇WWrình +Ӑ9¬7Ç1 7UҫQ+RjQJ1JX\rQ ____________________ MSSV: 1712396 ______ NGÀNH: KHMT ____________________________ /Ӟ3MT17KH03 ___________ +Ӑ9¬7Ç1 1JX\ӉQ%ҧR3K~F ______________________ MSSV: 1712674 ______ NGÀNH: KHMT ____________________________ /Ӟ3 MT17KH04 ___________ ĈҫXÿӅOXұQiQ 3KiWWULӇQKӋWKӕQJ9LVXDO4XHVWLRQ$QVZHULQJ 1KLӋPYө \rXFҫXYӅQӝLGXQJYjVӕOLӋX EDQÿҫX  x x x x x x x x x 1JKLrQFӭXFiFOêWKX\ӃWKӑFVkXQӅQWҧQJÿѭӧFiSGөQJWURQJÿӅWjL 1JKLrQFӭXOêWKX\ӃW9LVXDO4XHVWLRQ$QVZHULQJ 7LӅQ[ӱOêGӳOLӋX 7uPKLӇXYjiSGөQJSKѭѫQJSKiSERWWRP-XSWUtFK[XҩWYHFWRUÿһFWUѭQJWӯ KuQKҧQK 1JKLrQFӭXKѭӟQJWLӃSFұQ Co-$WWHQWLRQWUtFK[XҩWWK{QJWLQGӵDWUrQQӝL GXQJFkXKӓL- FkXKӓLTXDFiFF{QJWUuQKQJKLrQFӭX ;k\GӵQJYjKXҩQOX\ӋQP{KuQKKӑFPi\GӵDWUrQGӳOLӋXÿmÿѭӧF[ӱOê 3KkQWtFKYjWKLӃWNӃKӋWKӕQJ9LVXDO4XHVWLRQ$QVZHULQJKRjQFKӍQK ;k\GӵQJZHEVLWHWKLӃWNӃJLDRGLӋQSKiWWULӇQIURQW-end, back-HQGYjWULӇQ NKDLKӋWKӕQJ ĈiQKJLiKӋWKӕQJ 1Jj\JLDRQKLӋPYөOXұQiQ 01/02/2021 1Jj\KRjQWKjQKQKLӋPYө 01/08/2021 +ӑWrQJLҧQJYLrQKѭӟQJGүQ 3KҫQKѭӟQJGүQ 1) 3*6764XҧQ7KjQK7Kѫ 2) __________________________________________________________________________ 3) __________________________________________________________________________ 1ӝLGXQJYj\rXFҫX/971ÿmÿѭӧFWK{QJTXD%ӝP{Q 1Jj\WKiQJQăP &+Ӫ1+,ӊ0%Ӝ0Ð1 *,Ҧ1*9,Ç1+ѬӞ1*'Ү1&+Ë1+ .êYjJKLU}K͕WrQ .êYjJKLU}K͕WrQ 3*6764XҧQ7KjQK7Kѫ 3+̮1'¬1+&+2.+2$%͠0Ð1 1JѭӡLGX\ӋW FKҩPVѫEӝ ________________________ ĈѫQYӏ _______________________________________ 1Jj\EҧRYӋ __________________________________ ĈLӇPWәQJNӃW _________________________________ 1ѫLOѭXWUӳOXұQiQ _____________________________ TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc ---------------------------Ngày tháng năm PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn/phản biện) 1. Họ và tên SV: Trần Hoàng Nguyên MSSV: 1712396 Ngành (chuyên ngành): KHMT Họ và tên SV: Nguyễn Bảo Phúc MSSV: 1712674 Ngành (chuyên ngành): KHMT 2. Đề tài: Phát triển hệ thống Visual Question Answering 3. Họ tên người hướng dẫn/phản biện: ThS. Lê Đình Thuận 4. Tổng quát về bản thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính toán: Hiện vật (sản phẩm) 5. Tổng quát về các bản vẽ: - Số bản vẽ: Bản A1: Bản A2: Khổ khác: - Số bản vẽ vẽ tay Số bản vẽ trên máy tính: 6. Những ưu điểm chính của LVTN: - - Đề tài xây dựng hệ thống thông minh trả lời câu hỏi nội dung hình ảnh. Sinh viên đã xây dựng được hệ thống VQA và huấn luyện được mô hình thành công. Đề tài còn có sự phát triển trong việc huấn luyện mô hình để hỗ trợ tiếng Việt. Đề tài đánh giá là khó. Khối lượng đánh giá là nhiều. Đòi hỏi khả năng tự học của SV trong việc kết hợp nhiều thành phần kiến thức. Kết quả đề tài được tổng kết thành bài báo khoa học tại hội nghị khoa học FAIR 2021. (Ghi chú: tại thời điểm phản biện, chưa có kết quả việc chấp nhận của hội nghị với bài báo khoa học) Luận văn được trình bày đầy đủ và rõ ràng. Sinh viên chú ý sắp xếp phần demo để thể hiện trọng tâm công việc của đề tài. 7. Những thiếu sót chính của LVTN: 8. Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ 9. 3 câu hỏi SV phải trả lời trước Hội đồng: 10. Đánh giá chung (bằng chữ: giỏi, khá, TB): Không được bảo vệ □ Điểm : 10/10 Ký tên (ghi rõ họ tên) ThS. Lê Đình Thuận TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc ---------------------------Ngày tháng năm PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn/phản biện) 1. Họ và tên SV: Trần Hoàng Nguyên MSSV: 1712396 Ngành (chuyên ngành): KHMT Họ và tên SV: Nguyễn Bảo Phúc MSSV: 1712674 Ngành (chuyên ngành): KHMT 2. Đề tài: Phát triển hệ thống Visual Question Answering 3. Họ tên người hướng dẫn/phản biện: PGS.TS. Quản Thành Thơ 4. Tổng quát về bản thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính toán: Hiện vật (sản phẩm) 5. Tổng quát về các bản vẽ: - Số bản vẽ: Bản A1: Bản A2: Khổ khác: - Số bản vẽ vẽ tay Số bản vẽ trên máy tính: 6. Những ưu điểm chính của LVTN: - - Sinh viên đã hoàn thành một hệ thống VQA như yêu cầu đề ra. Sinh viên nắm vững, hiểu rõ các nội dung lý thuyết và phát triển xây dựng ứng dụng model thành công. Sinh viên cũng đã dịch tập dữ liệu huấn luyện sang tiếng Việt để hỗ trợ cho việc trả lời câu hỏi tiếng Việt. Một phần công việc của luận văn đã được viết thành bài báo khoa học và nộp cho hội nghị FAIR. Phần công việc của luận văn cũng được tiếp tục được mở rộng cho một dự án hợp tác nghiên cứu với nhóm nghiên cứu một giáo sưc khác tại Đài Loan. Luận văn được viết bằng tiếng Anh tương đối chuẩn và rõ ràng. 7. Những thiếu sót chính của LVTN: 8. Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ 9. 3 câu hỏi SV phải trả lời trước Hội đồng: 10. Đánh giá chung (bằng chữ: giỏi, khá, TB): Không được bảo vệ □ Điểm : 9.8/10 Ký tên (ghi rõ họ tên) PGS.TS. Quản Thành Thơ Declaration of Authenticity We assure that the graduation thesis "Effectively apply Vietnamese for Visual Question Answering system" is the original report of our research. We have finalized our graduation thesis honestly and guarantee the truth of our work for this thesis. We are solely responsible for the precision and reliability of the above information. Ho Chi Minh City, August, 9th , 2021 Acknowledgements First and foremost, we would like to thank Dr. Quan Thanh Tho, Associate Professor in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT) for the support throughout our research work. It has been our great fortune, to work and finish our thesis under his supervision. He is the most knowledgeable and insightful person we have ever met. He helped us throughout the project with his wise knowledge and enthusiasm in deep learning. From him, we have learned how to do deep learning research by a critical way and have a chance to widen our knowledge. He also let us join his research group, URA. This opportunity not only allow us to get more useful suggestions about our thesis from everybody in group, but we also have learned many new things, new skills day by day, such as by joining seminars held by members in group. From that, we can create more interesting ideas focusing on our thesis. Our sincere thanks also goes to Mr. Le Dinh Thuan, Master of Engineering in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, for being our reviewer. His feedback, suggestion and advice was essential and influential for the completion of our thesis. We are thankful for having such a good reviewer like him. Last but not least, we would like to thank the entire teachers at HCMC University of Technology, especially Faculty of Computer Science and Engineering, where it has been our pleasure and honor to studied for the last four years. Also our beloved friends and family, who always support us with a constant love and encouragement. AUTHORS Abstracts In recent years, deep learning has emerged as a promising technology with the hope that it can be designed to tackle practical problems, which had been considered inconceivable for previous approaches. Specifically, the blind and visual impaired are usually afraid to be burdensome for their family, their friends,.. when they need visual guidance. However, there is still a lack of modern systems, which can be a virtual friend to help them interact with the surrounding environment. Therefore, we research and develop a novel deep learning application, which can capture the complex relationship between the surrounding objects and deliver assistance to the blind and visual impaired. With this dissertation, we propose a novel visual question answering model in Vietnamese and a development of practical systems that utilize our model to address the aforementioned problems. CONTENTS List of figures x List of tables Chapter 1 INTRODUCTION 1.1 Motivation . . . . . . . . . . . . . . . . . 1.2 Topic’s scientific and practical importance 1.3 Thesis objectives and scope . . . . . . . . 1.4 Our contribution . . . . . . . . . . . . . . 1.5 Thesis structure . . . . . . . . . . . . . . Chapter 2 THEORETICAL OVERVIEW 2.1 Deep learning neural network . . . . . . . 2.1.1 Perceptron . . . . . . . . . . . . . 2.1.2 Multi layer perceptron . . . . . . 2.1.3 Activation functions . . . . . . . . 2.1.4 Loss functions . . . . . . . . . . . 2.1.5 Backpropagation and optimization 2.2 Computer vision theoretical background . 2.2.1 Convolutional Network . . . . . . 2.3 2.4 xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4 5 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 8 9 10 12 14 16 16 2.2.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 CNNs variants . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Regional-based Convolutional Neural Networks . . . . . . Natural language processing theoretical background . . . . . . . . 2.3.1 Word Embedding . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Recurrent Neural Network (RNN) . . . . . . . . . . . . . 2.3.3 LSTM - Long Short Term Memory . . . . . . . . . . . . . 2.3.4 GRU - Gated Recurrent Network . . . . . . . . . . . . . 2.3.5 Attention mechanism . . . . . . . . . . . . . . . . . . . . 2.3.6 Bidirectional Encoder Representations from Transformers Visual and Language tasks related to VQA . . . . . . . . . . . . . 2.4.1 Image Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 22 27 27 33 34 36 37 39 43 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CONTENTS 2.4.2 2.4.3 Visual Commonsense Reasoning . . . . . . . . . . . . . . . . . . . Other Visual and Language tasks . . . . . . . . . . . . . . . . . . . 43 44 Chapter 3 RELATED WORK 3.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Pythia v0.1: the Winning Entry to the VQA Challenge 2018 . . . . . . . . . 3.4 Deep Modular Co-Attention Networks for Visual Question Answering . . . 3.5 ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised 45 46 Image-Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 METHODOLOGY 4.1 Feature extraction and co-attention method 4.1.1 Visual feature . . . . . . . . . . . 4.1.2 Textual feature . . . . . . . . . . 4.2 Co-attention layer . . . . . . . . . . . . . 4.3 Our proposal model . . . . . . . . . . . . 47 49 49 51 . . . . . 53 54 54 54 55 58 . . . . . 61 62 63 63 64 65 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 66 Chapter 6 EXPERIMENTAL ANALYSIS 6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 69 69 69 Chapter 5 VIETNAMESE VQA DATASET 5.1 VQA-v2 dataset . . . . . . . . . . . . 5.2 Visual Genome dataset . . . . . . . . 5.3 Challenge . . . . . . . . . . . . . . . 5.4 Automatic data generation . . . . . . 5.5 Data refinement . . . . . . . . . . . . 5.6 5.7 6.2 6.1.3 Evaluation metric . . . 6.1.4 Implementation details 6.1.5 Training strategy . . . Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 70 70 71 ix CONTENTS Chapter 7 APPLICATION 7.1 Technology terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 78 78 80 80 80 81 82 83 85 87 System components . . . . . . . . . . . . . . . . . . . . . . . . . . Our result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 90 Chapter 8 CONCLUSION 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Limitation and broader future works . . . . . . . . . . . . . . . . . . . . . 8.2.1 Improve existing Vietnamese VQA models . . . . . . . . . . . . . 94 95 95 95 7.2 7.3 7.4 7.1.2 ReactJS . . . . . . . . . . . . . . . . . . . . 7.1.3 React Native . . . . . . . . . . . . . . . . . . 7.1.4 Docker . . . . . . . . . . . . . . . . . . . . 7.1.5 C4 Model: Describing Software Architecture System functionality . . . . . . . . . . . . . . . . . . 7.2.1 Web application system . . . . . . . . . . . . 7.2.2 Mobile application system . . . . . . . . . . System diagram . . . . . . . . . . . . . . . . . . . . 7.3.1 Overview . . . . . . . . . . . . . . . . . . . 7.3.2 Use case diagram . . . . . . . . . . . . . . . 7.3.3 Activity diagram . . . . . . . . . . . . . . . System architecture . . . . . . . . . . . . . . . . . . 7.4.1 7.4.2 8.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 76 76 Give Vietnamese VQA a new direction . . . . . . . . . . . . . . . 96 References 97 Appendices 101 Chapter A FAIR 2021 CONFERENCE PAPER 102 Chapter B SATU PROJECT 111 LIST OF FIGURES 1.1 Overview of the Visual Question Answering task. . . . . . . . . . . . . . . 3 2.1 2.2 2.3 2.4 2.5 9 10 11 17 2.6 2.7 2.8 2.9 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Symbols and calculating process in Multilayer Perceptron . . . . . . . . . . Multi layer perceptron with two hidden layers . . . . . . . . . . . . . . . . Convolution operation between 2-D input image and 2-D kernel . . . . . . . The receptive field of the units in the deeper layers of a convolutional network is larger than the receptive field of the units in the shallow layers. . . . Receptive field of one output unit in CNNs. . . . . . . . . . . . . . . . . . Apply 2 × 2 pooling layer to 6 × 6 input . . . . . . . . . . . . . . . . . . . AlexNet’s architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . VGG16 (left) and VGG19 (right) architecture. . . . . . . . . . . . . . . . . 18 19 20 20 21 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 Residual function (left) and ResNet-18 architecture. The architecture of R-CNN . . . . . . . . . . . . . The architecture of Fast R-CNN . . . . . . . . . . . The architecture of Faster R-CNN . . . . . . . . . Word2Vec overview . . . . . . . . . . . . . . . . . CBOW and Skip-gram Architecture . . . . . . . . Recurrent network architecture . . . . . . . . . . . LSTM architecture . . . . . . . . . . . . . . . . . GRU architecture . . . . . . . . . . . . . . . . . . Self-attention . . . . . . . . . . . . . . . . . . . . Multihead-attention . . . . . . . . . . . . . . . . . BERT for Masked LM . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 24 26 29 31 33 35 37 38 39 41 2.22 BERT for Next Sentence Prediction . . . . . . . . . . . . . . . . . . . . . . 42 3.1 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Typically, attention models operate on CNN features corresponding to a uniform grid of equally-sized image regions (left). Bottom-up approach enables attention to be calculated at the level of objects and other salient image regions (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bottom-up attention in VQA task. . . . . . . . . . . . . . . . . . . . . . . 48 48 xi LIST OF FIGURES 3.3 The overall flowchart of the deep Modular Co-attention Networks. They also provided two different strategies for deep co-attention learning, namely stacking and encoder-decoder . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Architecture of the ImageBERT model. . . . . . . . . . . . . . . . . . . . 52 4.1 4.2 4.3 4.4 Our proposed question processor for Vietnamese VQA task . . . . . . . Architecture of multi-head attention module . . . . . . . . . . . . . . . Architecture of self-attention unit (left) and guided-attention unit (right) Architecture of our proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 57 58 60 5.1 5.2 5.3 5.4 Sample image in VQA-v2 dataset. . . . . . . . . The list of answers that have the most occurrence. The list of answers that have the least occurrence. Sample image in VQA-v2 dataset. . . . . . . . . . . . . . . . . 62 64 64 66 6.1 6.2 Our learning rate is controlled by Adam optimizer and warmup scheduler. . Accuracy and co-attention depth relationship. All of this experienced of the test set and used the small specs. . . . . . . . . . . . . . . . . . . . . . . . Our loss values on train and validation set, which consist of the value from 71 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 72 6.5 6.6 6.7 6.8 epoch 1 to 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Our loss values on train and validation set, which consist of the value from epoch 2 to 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The overall accuracies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The accuracies of Yes/No question. . . . . . . . . . . . . . . . . . . . . . . The accuracies of number question. . . . . . . . . . . . . . . . . . . . . . . The accuracies of other question. . . . . . . . . . . . . . . . . . . . . . . . 7.1 7.2 7.3 7.4 7.5 7.6 Components of a C4 model. . . . . . . . . . . . . . . . Usecase diagram of Vietnamese VQA Web System . . Usecase diagram of Vietnamese VQA Mobile System . Activity diagram of Vietnamese VQA Web System. . . Activity diagram of Vietnamese VQA Mobile System. Component level description of our whole system. . . . . . . . . . 79 83 84 85 86 87 7.7 7.8 7.9 Homepage of our web application. . . . . . . . . . . . . . . . . . . . . . . Introduction for Vietnamese VQA - Give an VQA example . . . . . . . . . Introduction for Vietnamese VQA - What is VQA ? . . . . . . . . . . . . . 90 90 91 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 72 73 73 73 LIST OF FIGURES 7.10 Choose an image and enter a question in Vietnamese, both of them are used as input for VQA. In this case, the question we enter is "Ở đây có thứ gì?". . 7.11 Top-5 answers are generated from our Vietnamese VQA model. In this case, for the above question, the top 5 answers sound quite good. The top-1 answer is "Sách". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Overview of our VQA web system on mobile. Within the image, the question, the answer generated from VQA is very clear and helpful . . . . . . . 7.13 User can upload a favorite image and then ask a question. The VQA system will response after few seconds. . . . . . . . . . . . . . . . . . . . . . . . . 7.14 Our application on mobile device. User can ask a question about visual information (left) or daily information like datetime, weather, position,... (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 91 92 92 93 93 B.1 Our application of VQA for predicting the potential of natural disaster. For example, giving a question "what is the overall condition of the given image ?", VQA can generate answer based on the visual content of the image. The answer here is "Non-flooded" . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.2 Our proposed wildfires surveillance system pipeline . . . . . . . . . . . . . 113 LIST OF TABLES 5.1 5.2 Summary for one sample in VQA-v2 dataset. . . . . . . . . . . . . . . . . Statistical description of our Vietnamese dataset. . . . . . . . . . . . . . . . 6.1 Summary of our model with the large specs and BERT as our language processor. The experiments is conducted on our validation set. . . . . . . . The results of our models with 4 variants. All of the results are conducted on our Vietnamese VQA test set. . . . . . . . . . . . . . . . . . . . . . . . 6.2 63 65 72 73 1 INTRODUCTION This chapter gives an outline of the thesis topic, including its research aims, research scope, and scientific and practical value. CHAPTER 1 INTRODUCTION 1.1 2 Motivation Vision impairment severely impacts quality of life among both adult and young populations. Young children with early onset vision impairment can experience limited cognitive development, also it leads to lifelong consequences. Adults with vision impairment have lower productivity and also higher rates of anxiety and depression. In case of the older visual impaired, it can lead to a result of social isolation, and low level of navigability. The number of people visual impaired was estimated to be about 285 million, in which 39 million people are blind and 246 million people have low vision. 1 In recent years, Artificial Intelligence (AI) has not just made our lives easier by automating tedious and dangerous tasks, but it has unlocked myriad possibilities to people with disabilities and promising them unique ways of experiencing the world. More and more AI-powered applications in the industry of assistive technology have been put into practice and shown its benefits. In this regards, one of the most promising AI tasks that can help visual impaired in their daily life is Visual Question Answering [1]. Visual Question Answering (VQA) is a research field of multimodal learning in artificial intelligence. Multimodal learning is the task requiring us to propose a deep neural network that can model features over multiple modalities - multiple data sources (text, audio, images, numbers, ...) to solve problems and achieve high accuracy, which makes it become a more interesting and challenging assignment. A typical VQA input consists of two objects: an image and a text question. The task of VQA is formulated as follows: given a question, an image, the model must predict the correct answer. This answer generally needs to be chosen from a defined set of possible choices. For example, in fig. 1.1, given the below image and the question, "What is the moustache made of ?", the model must answer "Banana". This task is challenging on many levels. First of all, the model needs to understand the text of the question and the visual signals from the image. Secondly, it should correctly correlate text with the visual signals. On top of understanding text and visual signals, the model also needs to use common sense reasoning, knowledge base reasoning, and identify the context of the image. This means that a VQA system needs to be capable of processing images, such as detecting objects, recognizing entities and activities. At the same time, this system must be able to handle text processing as a natural language of humans. The real challenge in VQA is the combination of techniques from both computer vision and natural 1 The number is obtained from WHO, 2010. CHAPTER 1 INTRODUCTION 3 Figure 1.1: Overview of the Visual Question Answering task. language processing to produce a meaningful and accurate answer that provides relevant information and it is beneficial for human. In this graduation thesis, we aim to build a VQA system that could understand Vietnamese questions and give a meaningful, accurate answer written in Vietnamese. Since the VQA task appeared, there have been many proposals for VQA models that are more and more complex and capable of answering previously unseen questions and obtain higher and higher accuracy. Nowadays, a number of recent works have proposed attention models for VQA. Co-attention mechanism is applied for VQA model more and more popularly, and it gets better results in VQA challenges. We use it to build our Vietnamese VQA model so that we could make the accuracy of the model as high as possible. 1.2 Topic’s scientific and practical importance There are many potential applications for VQA. Probably the most direct application is to help blind and visually impaired users. A VQA system could provide information about an image on the internet or any social media. Another obvious application is to integrate VQA into image retrieval systems. This could have a huge impact on social media or e-commerce. VQA can also be used for educational or recreational purposes. CHAPTER 1 INTRODUCTION 4 Furthermore, as we have known, natural disasters, including wildfires, flooding, ice jam,... cause a lot of great damages in our life, seriously disrupts the functioning of a community or society. Natural disaster’s impact usually includes loss of life, injury, disease, and other negative effects. Therefore, it is necessary that we need an approach of disaster risk reduction to save lives and minimize disability and disease. Disaster surveillance allows us to identify risk factors, track disease trends, determine action items, and target interventions. We can apply VQA to develop a tool that tells if there is ice or not on a water body, or whether there has smoke and fire in the forest or not. From that, we can gain an early trigger for a potential disaster. 1.3 Thesis objectives and scope In this dissertation, we try our best to obtain results corresponding to the objectives are as follows: • Release a novel VQA dataset written in Vietnamese and provide an in-depth analysis of our dataset. • Gain a deep understanding of the co-attention mechanism and apply it effectively for the VQA task. • Design and implement a deep learning model to apply Vietnamese effectively in VQA. The result consists of a pipeline for implementing, training, evaluating our Vietnamese VQA model. • Design, develop and apply our models to one mobile app and one web app, which can help the visual impaired in their daily life. • Elaborate on our solutions and examine the limitations of our models. CHAPTER 1 INTRODUCTION 1.4 5 Our contribution Thesis’s contribution Our first contribution is a new Vietnamese VQA dataset that we built by taking advantage from the previous VQA-v2 dataset. The dataset consists of images from MS-COCO 1 , and about a million of question-answer pairs. By thoroughly examine our dataset, we find that our dataset can significantly accelerate performance, and play a key role in addressing the Vietnamese VQA task. Moreover, we propose an effective pipeline for developing Vietnamese VQA model. By using some modern techniques, our model can overcome the problems in a complex language like Vietnamese and efficiently capture relationships between textual and visual presentation. We also quantitively and qualitatively evaluate our models to show that how effective our model can deal with the Vietnamese VQA task. Furthermore, we design and develop two applications on both mobile and web, which are used our models as their core component, in the hope that they can make the daily life of visual impaired people easier. Paper’s contribution As a part of our work, we submit one paper into FAIR 2021 Conference. The paper includes the summary of our thesis work, from constructing a novel Vietnamese VQA dataset to proposing a pipeline which can effectively apply our model to address Vietnamese VQA task. We hope that our paper will add a small contribution to the knowledge of the artificial intelligence, computer vision and natural language processing societies, and also assist in advancing the field of the Vietnamese Visual Question Answering. 1 https://cocodataset.org/ CHAPTER 1 INTRODUCTION 1.5 6 Thesis structure This dissertation consists of 8 chapters, the introductory chapter serving as chapter 1 and the conclusion as chapter 8. From chapters 2-7, we will give key materials that we use throughout our thesis. A brief overview of the contents of each chapter is presented below. Chapter 2 presents a theoretical background for the thesis. It is the foundation of knowledge that is really necessary for us to gain a deep understanding of VQA. Chapter 3 reviews relevant related work and raises some problems of traditional approaches for the VQA task. Chapter 4 has the description of the feature extraction method, co-attention method, and our proposed architecture of Vietnamese VQA model. Chapter 5 describes a process of building our Vietnamese dataset and provide an indepth analysis on the dataset. Chapter 6 gives an analysis of our conducted experiments and results. Chapter 7 includes the description of our AI application that utilizes VQA model. Chapter 8 is the end of our thesis. We discuss and summarize the achievements and drawbacks of our VQA system and present future plans for our thesis.
- Xem thêm -

Tài liệu liên quan