Tài liệu Aspect based sentiment analysis for text documents

.PDF

thanhphoquetoi Báo vi phạm

Tải xuống 83

Mô tả:

HO CHI MINH CITY NATIONAL UNIVERSITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATE THESIS ASPECT BASED SENTIMENT ANALYSIS FOR TEXT DOCUMENTS Major: Computer Science Council: Supervisor: Examiner: Students: KHMT 5 Dr. Le Thanh Van Dr. Nguyen Quang Hung Tran Cong Toan Tri 1713657 Nguyen Phu Thien 1713304 Ho Chi Minh, December 2021 ĐẠI HỌC QUỐC GIA TP.HCM ---------TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA:KH & KT Máy tính BỘ MÔN:Hệ thống & Mạng ____ CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình HỌ VÀ TÊN: TRẦN CÔNG TOÀN TRÍ HỌ VÀ TÊN: NGUYỄN PHÚ THIỆN NGÀNH: KHOA HỌC MÁY TÍNH MSSV: 1713657 MSSV: 1713304 LỚP: MTKH03 1. Đầu đề luận án: Phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản Aspect based sentiment analysis for text documents 2. Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu): - Tìm hiểu về đặc điểm của bài toán phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản. - Tìm hiểu các công trình liên quan. - Nghiên cứu và đề xuất mô hình có thể nhận biết được khía cạnh và cảm xúc của khía cạnh từ dữ liệu văn bản. - Thu thập dữ liệu để huấn luyện và kiểm thử mô hình. - Hiện thực mô hình đề xuất, thực nghiệm, so sánh và đánh giá. 3. Ngày giao nhiệm vụ luận án: 30/08/2021 4. Ngày hoàn thành nhiệm vụ: 31/12/2021 5. Họ tên giảng viên hướng dẫn: Phần hướng dẫn: 100% 1) TS. Lê Thanh Vân __________________________________________________________ Nội dung và yêu cầu LVTN đã được thông qua Bộ môn. Ngày ........ tháng ......... năm .......... CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH (Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên) Lê Thanh Vân PHẦN DÀNH CHO KHOA, BỘ MÔN: Người duyệt (chấm sơ bộ): ________________________ Đơn vị: _______________________________________ Ngày bảo vệ: ___________________________________ Điểm tổng kết: _________________________________ Nơi lưu trữ luận án: _____________________________ TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc ---------------------------Ngày 27 tháng 12 năm 2021 PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn/phản biện) 1. Họ và tên SV: Trần Công Toàn Trí, Nguyễn Phú Thiện MSSV: 1713657, 1713304 Ngành (chuyên ngành): Khoa học máy tính 2. Đề tài: Phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản 3. Họ tên người hướng dẫn/phản biện: TS. Lê Thanh Vân 4. Tổng quát về bản thuyết minh: Số trang: 88 Số chương: 7 (bao gồm 1 chương phụ lục) Số bảng số liệu: 16 Số hình vẽ: 38 Số tài liệu tham khảo: Phần mềm tính toán: Hiện vật (sản phẩm) 5. Tổng quát về các bản vẽ: - Số bản vẽ: Bản A1: Bản A2: Khổ khác: - Số bản vẽ vẽ tay Số bản vẽ trên máy tính: 6. Những ưu điểm chính của LVTN: Luận văn hướng đến việc đề xuất mô hình phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản. Để đạt được mục tiêu của đề tài, nhóm sinh viên đã thực hiện tốt những việc sau: - Tìm hiểu các đặc điểm của bài toán phân tích cảm xúc nói chung và theo khía cạnh nói riêng từ dữ liệu văn bản. - Tìm hiểu các công trình nghiên cứu liên quan nổi bật trong những năm gần đây. - Chủ động liên hệ các nhóm nghiên cứu VLSP và UIT để thu thập các tập dữ liệu mẫu để xây dựng tập dữ liệu huấn luyện và kiểm thử. Xây dựng công cụ crawler để thu thập dữ liệu từ trang booking.com để có dữ liệu thực tế phục vụ đánh giá mô hình đề xuất. - Tìm hiểu và phân tích tốt ưu điểm của các mô hình xử lý ngôn ngữ tự nhiên như BERT, PhoBert, các mô hình về tích hợp các lớp tiềm ẩn và mô hình phân loại phân cấp theo entity, aspect và sentiment, mô hình dựng câu bổ trợ dựa trên Bert. - Ứng dụng mô hình NLI-B dựng câu bổ trợ dựa trên PhoBert cho ngôn ngữ tiếng Việt. - Đề xuất mô hình HSUM-HC là sự kết hợp và tận dụng tốt các ưu điểm của PhoBert, các lớp ẩn của mô hình để tăng khả năng nhận biết ngữ nghĩa và tích hợp với mô hình phân loại phân cấp. - Đánh giá thực nghiệm NLI-B, HSUM-HC 4 lớp tiềm ẩn và HSUM-HC 8 lớp tiềm ẩn với 3 tập dữ liệu VLSP, UIT và Booking.com cho 2 miền dữ liệu là nhà hàng và khách sạn. Thực nghiệm cho kết quả tốt hơn ở đa số trường hợp khi so sánh với Linear SVM, Multilayer Perceptron, CNN, BiLSTM+CNN, PhoBert và viBErt. Thêm vào đó, mô hình có các độ đo đánh giá cao khi dữ liệu biểu diễn ở mức document vì nhận biết tốt sự liên kết về mặt ngữ nghĩa giữa các câu. Ngoài ra, khi nhận thấy mô hình mang lại kết quả đánh giá tốt, trong giai đoạn cuối thực hiện, luận văn đã được đề xuất bổ sung thêm một ứng dụng đơn giản hỗ trợ tìm kiếm cho phép người dùng nhập vào một yêu cầu bất kì về khách sạn mà không theo tiêu chí định trước như các trang đặt phòng hiện tại. Ứng dụng sẽ phân tích yêu cầu, nhận biết yêu cầu và trả ra kết quả các khách sạn ứng với các tiêu chí cần tìm kiếm. Ứng dụng này nhằm thể hiện tính ứng dụng thực tế của bài toán đề xuất và sẽ được phát triển hoàn thiện hơn trong hệ thống gợi ý thực hiện bởi các nhóm đề tài sau. Bên cạnh đó, nhóm sinh viên cũng đã viết bài báo khoa học “HSUM- HC: Integrating Bert-based hidden aggregation to hierarchical classifier for Vietnamese aspect-based sentiment analysis” được chấp thuận và đã trình bày tại hội nghị IEEE, 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) vào ngày 21/12/2021, Hà Nội, Việt Nam. 7. Những thiếu sót chính của LVTN: Một số câu trong báo cáo luận văn quá dài, nên tách ý ra để rõ nghĩa và đọc dễ hiểu hơn. 8. Đề nghị: Được bảo vệ ! Bổ sung thêm để bảo vệ o Không được bảo vệ o 9. 3 câu hỏi SV phải trả lời trước Hội đồng: a. b. c. 10. Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi Điểm : 10 /10 Ký tên (ghi rõ họ tên) Lê Thanh Vân 75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ .+2$.+ .70È<7Ë1+ &Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$ 9,ӊ71$0 ĈӝFOұS7ӵGR+ҥQKSK~F 1Jj\WKiQJQăP 3+,ӂ8&+Ҩ0%Ҧ29ӊ/971 'jQKFKRQJ˱ͥLK˱ͣQJG̳QSK̫QEL͏Q +ӑYjWrQ6975Ҫ1&Ð1*72¬175Ë 0669 1JjQKFKX\rQQJjQK.KRDKӑFPi\WtQK +ӑYjWrQ691*8<ӈ13+Ò7+,ӊ1 0669 1JjQKFKX\rQQJjQK.KRDKӑFPi\WtQK ĈӅWjL3KkQWtFKFҧP[~FWKHRNKtDFҥQKWӯGӳOLӋXYăQEҧQ$VSHFWEDVHGVHQWLPHQWDQDO\VLVIRUWH[W GRFXPHQWV +ӑWrQQJѭӡL KѭӟQJGүQSKҧQELӋQ761JX\ӉQ4XDQJ +QJ 7әQJTXiWYӅEҧQWKX\ӃWPLQK 6ӕWUDQJ 6ӕFKѭѫQJ 6ӕEҧQJVӕOLӋX 6ӕKuQKYӁ 6ӕWjLOLӋXWKDPNKҧR 3KҫQPӅPWtQKWRiQ +LӋQYұWVҧQSKҭP 7әQJTXiWYӅFiFEҧQYӁ 6ӕEҧQYӁ %ҧQ$ %ҧQ$ .KәNKiF 6ӕEҧQYӁYӁWD\ 6ӕEҧQYӁWUrQPi\WtQK 1KӳQJѭXÿLӇPFKtQKFӫD/971 &ҧ KDL VLQK YLrQ WKӇ KLӋQ Nӻ QăQJ OjP YLӋF QKyP WӕW ÿӑF KLӇX WjL OLӋX WLӃQJ $QK WӕW YLӃW OXұQ YăQ EҵQJ QJ{QQJӳ7LӃQJ$QKWӕW /XұQ YăQ WKӇ KLӋQ KDL VLQK YLrQ Fy NKҧ QăQJ QJKLrQ FӭX NKRD KӑF ÿm F{QJ Eӕ Yj WUuQK Ej\ EjL EiR WҥL KӝLQJKӏ1,&6 1KyP VLQK YLrQ Fy Nӻ QăQJ [k\ GӵQJ EjL WRiQ QJKLrQ FӭX WuP KLӇX F{QJ QJKӋ Yj P{ KuQK KӑF Pi\ Wӯ ÿy FҧL WLӃQ FKR UD P{ KuQK PӟL +680+& Fy NKҧ QăQJ WӕW KѫQ GӵD WUrQ 3KR%HUW iS GөQJ FKR EjL WRiQ $VSHFW %DVHG 6HQWLPHQW $QDO\VLV $%6$ WLӃQJ 9LӋW Oj PӝW WK~ Yӏ 1KyP VLQK YLrQ Fy ÿiQK JLi ÿӇ NLӇP FKӭQJ P{ KuQK+680+&ÿӅ[XҩWYjÿҥWÿѭӧFWӕWKѫQNKLVRViQKYӟLFiFF{QJWUuQKNKiF 1KӳQJWKLӃXVyWFKtQKFӫD/971 0{ KuQK 3KR%HUW Oj ÿm Fy VҷQ NKi QәL WLӃQJ Wӯ NӃW TXҧ QJKLrQ FӭX FӫD 9LQ$, /XұQ YăQ GӯQJ OҥL ӣ YLӋF [k\ GӵQJ FiF OӟS WUrQ FӫD 3UR%HUW ÿӇ WҥR UD P{ KuQK PӟL +680+& SK KӧS YӟL ÿӅ WjL OXұQ YăQ EjL WRiQ $%6$ &{QJ YLӋF Qj\ ӣ PӭF ÿӝ OXұQ YăQ ĈҥL KӑF KRjQ WRjQ FKҩS QKұQ ÿѭӧF /XұQ YăQ KѫL WKLrQ YӅ GҥQJ OjPÿӅWjLQJKLrQFӭXKѫQÿӅWjLOXұQYăQWӕWQJKLӋS .KL ÿѭD P{ KuQK +680+& YjR EjL WRiQ WKӵF WӃ 7UDYHO/LQN Gӳ OLӋX QKұQ [pW Wӯ %RRNLQJFRP WKu FzQ QKLӅX KҥQ FKӃ 7UDQJ ZHE JLDR GLӋQ 7UDYHO/LQN TXi Vѫ VjL WKLӃX FiF SKkQ WtFK \rX FҫX Wӯ SKtD QJѭӡL GQJ ÿӃQ WKLӃW NӃ KӋ WKӕQJ 9ҩQ ÿӅ NLӇP WKӱ SKҫQ PӅP FKѭD ÿҥW 9t Gө 1KyP VLQK YLrQ FKӍ GӯQJ OҥL ӣ YLӋF Oҩ\ ê NLӃQ QKұQ [pW Yj FKҩP ÿLӇP YӟL PӝW Vӕ EҥQ VLQK YLrQ FӫD WiF JLҧ QrQ NӃW TXҧ FKR NKi WӕW 9LӋF VҳS [ӃS NӃW TXҧWuPNLӃPFKѭDWKӵFVӵKӧSOê&iFPүXFkXWUX\YҩQFzQKҥQFKӃFKѭDSKKӧSWKӵFWӃ ĈӅQJKӏĈѭӧFEҧRYӋ ⌧ %әVXQJWKrPÿӇEҧRYӋ Ƒ .K{QJÿѭӧFEҧRYӋ Ƒ FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ D6LQKYLrQKm\OêJLҧLYuVDRGӵDWUrQ3KR%HUWQӃXNK{QJFy3KR%HUWWKuVӁҧQKKѭӣQJNӃWTXҧOXұQYăQ QKѭWKӃQjR" E6RViQKNӃWTXҧWuPNLӃPGӵDWUrQFRPPHQWFӫDOXұQYăQWUrQ7UDYHO/LQNYӟLNӃWTXҧWuPNLӃPGӵDWUrQWӯ NKRiYjFy[pW\ӃXWӕNKRҧQJFiFKFyNKiFQKѭWKӃQjR"9tGөTXiQăQQJRQJҫQWUѭӡQJĈ+%.YjFyFKӛ ÿұX[H ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7%*LӓL ĈLӇP .êWrQJKLU}KӑWrQ 761JX\ӉQ4XDQJ+QJ DEDICATION We, Tran Cong Toan Tri and Nguyen Phu Thien, declare that this thesis titled "Sentiment Analysis of User Comments" and the work presented in it are our own and that, to the best of our knowledge and belief, it contains no material previously published or written by another person (except where explicitly defined in the acknowledgments), nor material which to a substantial extent has been submitted for the award of any other degree or diploma of a university or other institution of higher learning. Acknowledgements It gives us great pleasure and satisfaction in presenting our thesis on “Aspect based sentiment analysis for text documents”. This would be our last project as bachelor students in university, this project reflects what we have learned and the skills we acquire during the years at the University of Technology. For that reason, we would like to express our deepest sense of gratitude towards our guidance teacher, Dr. Le Thanh Van for allowing us to work on this project, for her continuous support throughout our study and research, and for the amount of patience, motivation, and knowledge that she has given us. We could not have imagined a better advisor and mentor for our thesis. Besides our advisor, we would like to say thank you for all the knowledge, experience, and support of the teachers and staff of the Computer Science and Engineering Faculty that has been given to us in our time at university, their teachings have helped us acquire the foundational knowledge for this thesis. Lastly, we would like to thank all of our friends and family who have motivated us every step of the way, we would not have been able to finish this without them, and we are grateful for everything we have been given till this day. Abstract In today’s world, customers and their feedback are vital to any business’s survival. Competition is harsh on every market, the competitor who understands and pleases their customer the most will be more successful. For that to happen, businesses need to gather information about their customers’ opinions on a large scale. ABSA is a method for them to achieve this, it has been studied rigorously by researchers in the past, and since the creation of Bert, ABSA methods are getting more and more advance, showing better and better results in recent years. However for Vietnamese, ABSA is still not as developed, due to the limited resources and the nuances of the language. In our work, we want to improve the capabilities of Vietnamese ABSA, we use the Vietnamese SOTA pre-trained PhoBert and built two models from it. One has a custom classifier, made from a combination of previous methods that proved effective, and the other is made to utilize Bert’s sequence pair feature, constructing auxiliary sentences and turning ABSA into a question-answering problem. With our work, we hope to set new baseline results for the Vietnamese ABSA datasets, along with providing useful knowledge for any researchers who want to improve it further. Our implementation achieved SOTA scores for both public datasets on Vietnamese ABSA, getting considerably higher scores than previous works. To demonstrate that our model not only works on filtered data but also actual user reviews, we also obtained reviews from a booking site and use our model on them. From that data, we made a profile for each hotel, finding their pros and cons, then we built a search engine to help users in booking their accommodation by providing immediate access to necessary information. In this work, we will provide the acquisition of these data, their evaluation results, and our process of designing the search engine. Contents 1 Introduction 1 1.1 Why we chose this project . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Project goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Project scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Project structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Aspect Based Sentiment Analysis 5 2.1 What is ABSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 ABSA research overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Vietnamese ABSA shared task . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Our proposed models 27 3.1 Bert sequence-pair with auxiliary sentences . . . . . . . . . . . . . . . . . . 27 3.2 HSUM-HC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Experimental results and discussion 35 4.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Training cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6 Evaluation on real-life data . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Survey results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Model application for a recommender system 53 5.1 Inspiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Conclusion 65 6.1 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Research Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A Our Research Paper 72 List of Tables 2.1 Possible entity-attribute pairs for Hotel domain . . . . . . . . . . . . . . . 9 2.2 Possible entity-attribute pairs for Restaurant domain . . . . . . . . . . . . 9 3.1 Translation for Hotel domain entities . . . . . . . . . . . . . . . . . . . . . 31 3.2 Translation for Hotel domain attributes . . . . . . . . . . . . . . . . . . . . 31 3.3 Translation for Restaurant domain entities . . . . . . . . . . . . . . . . . . 31 3.4 Translation for Restaurant domain attributes . . . . . . . . . . . . . . . . . 31 3.5 Translation for Sentiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Dataset details for VLSP 2018 ABSA . . . . . . . . . . . . . . . . . . . . . 35 4.2 Dataset details for UIT ABSA . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Training paramters for HSUM-HC and NLI_B . . . . . . . . . . . . . . . . 41 4.4 Training costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Results on the test set of VLSP 2018 Dataset, Hotel domain . . . . . . . . 42 4.6 Results on the test set of UIT ABSA Dataset, Hotel domain . . . . . . . . 43 4.7 Results on the test set of VLSP 2018 Dataset, Restaurant domain . . . . . 43 4.8 Results on the test set of UIT ABSA Dataset, Restaurant domain . . . . . 44 4.9 Real-life dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 List of Figures 2.1 Paperswithcode’s summary on SemEval-2014 ABSA researches . . . . . . . 7 2.2 Example of a review and expected labels . . . . . . . . . . . . . . . . . . . 8 2.3 Multitask BiLSTM-CNN model for ABSA . . . . . . . . . . . . . . . . . . 12 2.4 The architecture of intra attention . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 The architecture of global attention . . . . . . . . . . . . . . . . . . . . . . 15 2.6 BERT input representation. [1] . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7 The architecture of Bert Encoder layer . . . . . . . . . . . . . . . . . . . . 17 2.8 Example of word segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.9 Thin et al. Bert implementation . . . . . . . . . . . . . . . . . . . . . . . . 20 2.10 Hierarchical Hidden level aggregation for Bert . . . . . . . . . . . . . . . . 24 2.11 Hierarchical approach for a Bert-based ABSA task . . . . . . . . . . . . . . 26 3.1 QA_M auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 NLI_M auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 QA_B auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 NLI_B auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 HSUM-HC model for the ABSA task . . . . . . . . . . . . . . . . . . . . . 33 4.1 Aspect distribution of the VLSP-2018 ABSA dataset, hotel domain . . . . 37 4.2 Aspect distribution of the VLSP-2018 ABSA dataset, restaurant domain . 37 4.3 Aspect distribution of the UIT ABSA dataset, hotel domain . . . . . . . . 38 4.4 Aspect distribution of the UIT ABSA dataset, restaurant domain . . . . . 38 4.5 Sentiment distribution for the VLSP-2018 dataset hotel domain (left) and restaurant domain (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Sentiment distribution for the UIT ABSA dataset hotel domain (left) and restaurant domain (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.7 The loss curves on the validation and test sets for VLSP 2018 (left) and UIT ABSA dataset (right), Hotel domain . . . . . . . . . . . . . . . . . . . 46 4.8 The Phase B validation curves on the validation and test sets for VLSP 2018 (left) and UIT ABSA dataset (right), Hotel domain . . . . . . . . . . 46 4.9 The loss curves on the validation and test sets for VLSP 2018 (left) and UIT ABSA dataset (right), Restaurant domain . . . . . . . . . . . . . . . . 46 4.10 The Phase B validation curves on the validation and test sets for VLSP 2018 (left) and UIT ABSA dataset (right), Restaurant domain . . . . . . . 47 4.11 HSUM-HC and NLI_B F1 score differences for VLSP-2018 hotel domain . 48 4.12 HSUM-HC and NLI_B F1 score differences for UIT ABSA hotel domain . 49 4.13 Survey results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 Score calculation for each hotel . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 The search bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 A hotel in the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 The comment window when opened . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Travel Link result page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Travel Link result page with comment window . . . . . . . . . . . . . . . . 61 5.7 Top 4 recommended hotels for the query chỗ ở gần trung tâm, nhân viên thân thiện, phòng rộng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.8 Comments of the top recommended hotel . . . . . . . . . . . . . . . . . . . 63 5.9 Comments of the second recommended hotel . . . . . . . . . . . . . . . . . 63 Acronyms ABSA Aspect Based Sentiment Analysis. 1–14, 19–27, 30, 32, 33, 35–41, 43–47, 49, 65–67 NLP Natural Language Processing. 3, 6, 10, 11, 15, 18, 66, 67 SOTA State Of The Art. 8, 11, 22, 25, 44, 65 UIT University of Information Technology. 3, 7, 9, 11–13, 20, 36, 38–41, 43–47, 49, 65 VLSP Association for Vietnamese Language and Speech Processing. 3, 7, 10–13, 19, 20, 25, 35–37, 39–48, 51, 65 Ho Chi Minh city National University - University of Technology Faculty of Computer Science and Engineering Chapter 1 Introduction In this chapter, we will give a brief introduction about the project, the reason we chose this project, the goal and scope of our project in real-life usage. 1.1 Why we chose this project Nowadays, with the development of the Internet and eCommerce, shopping is not as simple as picking what you want in a store anymore. Shopping now can be done at home, through phones, computers,... And with online shopping, customers can’t try out the product before they buy, nor can they feel the material or quality of the product, especially in today’s situation, when Covid-19 is plaguing many countries and forcing people to stay indoors, the only way to judge a product before buying is from past customers’ experience. Almost every ecommerce application has a function to let customers leave their opinions on the service they received. Not only that, with the growing popularity of review websites for every domain possible, to name some: Foody.vn for restaurant reviews, agoda.com, booking.com, mytour.vn for hotel reviews, tinhte.vn for tech reviews,..., the majority of customer are likely using them to look up reviews for any product or service they are planning to purchase, even when they plan to go shopping in person. Any customer’s opinion can be read by everyone very quickly. A business can lose a large portion of their customer to a bad review on the internet without even knowing about it. Therefore, learning about customers’ opinions is one of the top priorities for any business if they want to succeed, a company must ensure they are always aware of the general opinions. Doing that not only allows them to have a better overview of their market growth, but also know out what they can improve. Such a system to help them analyze customer opinions as detailed as possible is Aspect Based Sentiment Analysis (ABSA), with this, an opinion written in text can be classified into labels, and not only can we learn what the opinion is about, we can also learn its sentiment (positive, neutral or negative), this combined with Graduate thesis Page 1/78 Ho Chi Minh city National University - University of Technology Faculty of Computer Science and Engineering the abundance of reviews and opinions online can be priceless to any business, helping them get the most accurate view of their customer base. With every decision a business makes, they always have to track their customer’s responses and make changes accordingly, this can keep them from having a marketing catastrophe. On occasions, user opinions can circulate quickly on the internet, showing up on front pages and are seemingly shared by many people. However, whether or not a company should make changes according to this opinion is another problem, because it can be a "loud minority". In which listening to this loud crowd will actually dissatisfy the majority of their customer base. This kind of decision can only be made with sufficient information, and enough coverage for customer opinions, this is called "Brand Monitoring" and is used by every big brand names, it is the process of tracking different channels to identify where their brand is mentioned and understand how people perceive it, it lets them keep an eye on potential crises and respond to questions or criticism before they get out of control. Not only in the service business but also in any field that needs the general public’s endorsement to succeed can benefit from learning their customers. Politics is a prime example, almost every government in recent years employs a system to get the general opinions, especially in presidential elections, they need to gauge the public opinion and act accordingly, nowadays most of these opinions are online and in large quantities, too much for any human to sort through. So the application of ABSA in this field is absolutely necessary, ABSA can help a government gain significant advantages against their opponent just by knowing what the public wants and making the right statements. With all these potential fields for application, ABSA is very useful for anyone who can apply it. The English ABSA system has been extensively developed and applied to reallife usage. However, for Vietnamese, a less-resources language, research, and development are still required to get ABSA to the point of effective commercial use. That is why for our work, we focus on improving past works and developing a more effective system to handle the Vietnamese ABSA tasks. 1.2 Project goal The goal of our project is to build a model capable of classifying aspects and sentiments given a review. We believe that with this system, customer opinions can be explored on a large scale. More detailed profiles can be built for each user, understanding customers’ shopping habits and preferences not only give us a more accurate overview of the customer base but also offer better recommendations to customers, increasing sale and satisfaction. With our work, we also hope to improve Vietnamese ABSA capabilities. We experimented Graduate thesis Page 2/78 Ho Chi Minh city National University - University of Technology Faculty of Computer Science and Engineering with applying Transformer models for ABSA, and maximizing the potential of PhoBert on a monolingual dataset. We also experiment with utilizing PhoBert’s sequence-pair, building auxiliary sentences for each review, and treating ABSA like a question-answering task, with the hope of capturing better aspect-sentiment relationships in each sequence. In our project, we study the work in past studies, learning their methodology and advantages, from that knowledge we develop our own method, improving from previous models. Our method is a combination of components made specifically to improve the performance of Bert. Not only did we test our work with public datasets, but we also perform evaluations on real-life data crawled from review sites, with the purpose of seeing how well our model can handle unfiltered data. For our thesis, we also demonstrate our model’s potential by using it for a recommendation system on the hotel domain, in which we use our model to build hotel profiles from past customer reviews, and given a query from a user, we will analyze the aspects and sentiments of that review and suggest them suitable hotels. Our system also makes it convenient for users to view past reviews by sorting the reviews by relevance to the query, making sure our users always see relevant information first. 1.3 Project scope In our work, we will build a system to solve the ABSA task of classifying user reviews into aspects and sentiments. The datasets we have used for training and evaluating are public datasets from the Association for Vietnamese Language and Speech Processing (VLSP) and the University of Information Technology Natural Language Processing club (UIT NLP group). The data we use for training and evaluation will be in the Vietnamese language. Our model will be expected to classify Vietnamese text, with proper accent marks and clearly expressed ideas, any text that is possible for a human to interpret without having prior knowledge about slang or abbreviations. 1.4 Project structure Our thesis will include 7 chapters, including this one. The content of each is as follow: 1. Chapter 1: Introduction In this chapter we will give an introduction to our project, providing more insights for the reason we chose this project, the goal and scope of our project, and our general direction with this project. Graduate thesis Page 3/78 Ho Chi Minh city National University - University of Technology Faculty of Computer Science and Engineering 2. Chapter 2: Aspect based sentiment analysis In this chapter, we will go into detail about our task - ABSA. We will explain in detail what ABSA and its goal is. We also present past works done on this task along with their methodology and results. The dataset we will use for training and evaluating will also be introduced in this chapter. 3. Chapter 3: Model Architecture In this chapter we will introduce our model, the inspiration of our model, and its architecture, we will explain in more detail how it functions and what each component of the system does. 4. Chapter 4: Experimental results and discussion In this chapter, we will present our experimental setups and results. We will present our results, comparing them with past works on the same dataset. We also present our real-life crawled dataset from review sites and evaluate our model on that. 5. Chapter 5: Model Application In this chapter we will apply our model to a real-life task, making an application that serves as a recommendation system for hotel booking. We will explain our goals for this app and in detail how it works. 6. Chapter 6: Conclusion In this chapter, we will summarize our results, our project’s pros and cons. We will also talk about possible developments to this project in the future. Graduate thesis Page 4/78 Ho Chi Minh city National University - University of Technology Faculty of Computer Science and Engineering Chapter 2 Aspect Based Sentiment Analysis 2.1 What is ABSA Sentiment Analysis (also known as Opinion Mining) is the process of determining what the user thinks about a certain product, service, or any particular domain. It’s a classification task with the purpose of annotating a portion of text with a positive, negative, or neutral sentiment. ABSA is an evolution of Sentiment Analysis that aims at capturing the aspect-level opinions expressed in natural language texts [2]. A user opinion or review can be represented by dozens or hundreds of words about multiple aspects with different sentiments to each, and determining which sentiment words go with which aspect can be very difficult. With ABSA, reviews about a product can now be analyzed in detail, showing the reviewer’s opinion on each aspect of that product. The main process of ABSA is as follows: Given a customer review about a domain (e.g. hotel or restaurant), the goal is to identify sets of (Aspect, Polarity) that fit the opinion mentioned in the review. Each aspect is a set of an entity and an attribute, and polarity consists of negative, neutral, and positive sentiment. For each domain, all possible combinations of entities and attributes are predefined. The ABSA task will be divided into two phases: (i) identify pairs of entities and attribute, (ii) analyze the sentiment polarity to the corresponding aspect (entity#attribute) identified in the previous phase. For example, a review This place has an amazing view, the food is great too but the service is bad after phase (i) the entities and attributes pairs will be {Hotel#Design&Features, Food&Drinks#Quality, Service#General } and after phase (ii) the process output will be Hotel#Design&Features: Positive, Food&Drinks#Quality: Positive, Service#General: Negative. Graduate thesis Page 5/78

- Xem thêm -

Tài liệu Aspect based sentiment analysis for text documents

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất