HO CHI MINH CITY NATIONAL UNIVERSITY
UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
GRADUATE THESIS
ASPECT BASED SENTIMENT ANALYSIS
FOR TEXT DOCUMENTS
Major: Computer Science
Council:
Supervisor:
Examiner:
Students:
KHMT 5
Dr. Le Thanh Van
Dr. Nguyen Quang Hung
Tran Cong Toan Tri
1713657
Nguyen Phu Thien
1713304
Ho Chi Minh, December 2021
ĐẠI HỌC QUỐC GIA TP.HCM
---------TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA:KH & KT Máy tính
BỘ MÔN:Hệ thống & Mạng ____
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
HỌ VÀ TÊN: TRẦN CÔNG TOÀN TRÍ
HỌ VÀ TÊN: NGUYỄN PHÚ THIỆN
NGÀNH: KHOA HỌC MÁY TÍNH
MSSV: 1713657
MSSV: 1713304
LỚP: MTKH03
1. Đầu đề luận án:
Phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản
Aspect based sentiment analysis for text documents
2. Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
- Tìm hiểu về đặc điểm của bài toán phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản.
- Tìm hiểu các công trình liên quan.
- Nghiên cứu và đề xuất mô hình có thể nhận biết được khía cạnh và cảm xúc của khía cạnh từ
dữ liệu văn bản.
- Thu thập dữ liệu để huấn luyện và kiểm thử mô hình.
- Hiện thực mô hình đề xuất, thực nghiệm, so sánh và đánh giá.
3. Ngày giao nhiệm vụ luận án: 30/08/2021
4. Ngày hoàn thành nhiệm vụ: 31/12/2021
5. Họ tên giảng viên hướng dẫn:
Phần hướng dẫn: 100%
1) TS. Lê Thanh Vân __________________________________________________________
Nội dung và yêu cầu LVTN đã được thông qua Bộ môn.
Ngày ........ tháng ......... năm ..........
CHỦ NHIỆM BỘ MÔN
GIẢNG VIÊN HƯỚNG DẪN CHÍNH
(Ký và ghi rõ họ tên)
(Ký và ghi rõ họ tên)
Lê Thanh Vân
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ): ________________________
Đơn vị: _______________________________________
Ngày bảo vệ: ___________________________________
Điểm tổng kết: _________________________________
Nơi lưu trữ luận án: _____________________________
TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA KH & KT MÁY TÍNH
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
---------------------------Ngày 27 tháng 12 năm 2021
PHIẾU CHẤM BẢO VỆ LVTN
(Dành cho người hướng dẫn/phản biện)
1. Họ và tên SV: Trần Công Toàn Trí, Nguyễn Phú Thiện
MSSV: 1713657, 1713304
Ngành (chuyên ngành): Khoa học máy tính
2. Đề tài: Phân tích cảm xúc theo khía cạnh từ dữ liệu văn bản
3. Họ tên người hướng dẫn/phản biện: TS. Lê Thanh Vân
4. Tổng quát về bản thuyết minh:
Số trang: 88
Số chương: 7 (bao gồm 1 chương phụ lục)
Số bảng số liệu: 16
Số hình vẽ: 38
Số tài liệu tham khảo:
Phần mềm tính toán:
Hiện vật (sản phẩm)
5. Tổng quát về các bản vẽ:
- Số bản vẽ:
Bản A1:
Bản A2:
Khổ khác:
- Số bản vẽ vẽ tay
Số bản vẽ trên máy tính:
6. Những ưu điểm chính của LVTN:
Luận văn hướng đến việc đề xuất mô hình phân tích cảm xúc theo khía cạnh từ dữ liệu văn
bản. Để đạt được mục tiêu của đề tài, nhóm sinh viên đã thực hiện tốt những việc sau:
- Tìm hiểu các đặc điểm của bài toán phân tích cảm xúc nói chung và theo khía cạnh nói riêng
từ dữ liệu văn bản.
- Tìm hiểu các công trình nghiên cứu liên quan nổi bật trong những năm gần đây.
- Chủ động liên hệ các nhóm nghiên cứu VLSP và UIT để thu thập các tập dữ liệu mẫu để xây
dựng tập dữ liệu huấn luyện và kiểm thử. Xây dựng công cụ crawler để thu thập dữ liệu từ
trang booking.com để có dữ liệu thực tế phục vụ đánh giá mô hình đề xuất.
- Tìm hiểu và phân tích tốt ưu điểm của các mô hình xử lý ngôn ngữ tự nhiên như BERT,
PhoBert, các mô hình về tích hợp các lớp tiềm ẩn và mô hình phân loại phân cấp theo entity,
aspect và sentiment, mô hình dựng câu bổ trợ dựa trên Bert.
- Ứng dụng mô hình NLI-B dựng câu bổ trợ dựa trên PhoBert cho ngôn ngữ tiếng Việt.
- Đề xuất mô hình HSUM-HC là sự kết hợp và tận dụng tốt các ưu điểm của PhoBert, các lớp
ẩn của mô hình để tăng khả năng nhận biết ngữ nghĩa và tích hợp với mô hình phân loại
phân cấp.
- Đánh giá thực nghiệm NLI-B, HSUM-HC 4 lớp tiềm ẩn và HSUM-HC 8 lớp tiềm ẩn với 3
tập dữ liệu VLSP, UIT và Booking.com cho 2 miền dữ liệu là nhà hàng và khách sạn. Thực
nghiệm cho kết quả tốt hơn ở đa số trường hợp khi so sánh với Linear SVM, Multilayer
Perceptron, CNN, BiLSTM+CNN, PhoBert và viBErt. Thêm vào đó, mô hình có các độ đo
đánh giá cao khi dữ liệu biểu diễn ở mức document vì nhận biết tốt sự liên kết về mặt ngữ
nghĩa giữa các câu.
Ngoài ra, khi nhận thấy mô hình mang lại kết quả đánh giá tốt, trong giai đoạn cuối thực
hiện, luận văn đã được đề xuất bổ sung thêm một ứng dụng đơn giản hỗ trợ tìm kiếm cho phép
người dùng nhập vào một yêu cầu bất kì về khách sạn mà không theo tiêu chí định trước như các
trang đặt phòng hiện tại. Ứng dụng sẽ phân tích yêu cầu, nhận biết yêu cầu và trả ra kết quả các
khách sạn ứng với các tiêu chí cần tìm kiếm. Ứng dụng này nhằm thể hiện tính ứng dụng thực tế
của bài toán đề xuất và sẽ được phát triển hoàn thiện hơn trong hệ thống gợi ý thực hiện bởi các
nhóm đề tài sau.
Bên cạnh đó, nhóm sinh viên cũng đã viết bài báo khoa học “HSUM- HC: Integrating
Bert-based hidden aggregation to hierarchical classifier for Vietnamese aspect-based
sentiment analysis” được chấp thuận và đã trình bày tại hội nghị IEEE, 2021 8th NAFOSTED
Conference on Information and Computer Science (NICS) vào ngày 21/12/2021, Hà Nội, Việt Nam.
7. Những thiếu sót chính của LVTN:
Một số câu trong báo cáo luận văn quá dài, nên tách ý ra để rõ nghĩa và đọc dễ hiểu hơn.
8. Đề nghị: Được bảo vệ !
Bổ sung thêm để bảo vệ o
Không được bảo vệ o
9. 3 câu hỏi SV phải trả lời trước Hội đồng:
a.
b.
c.
10. Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi
Điểm :
10 /10
Ký tên (ghi rõ họ tên)
Lê Thanh Vân
75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$
.+2$.+ .70È<7Ë1+
&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$ 9,ӊ71$0
ĈӝFOұS7ӵGR+ҥQKSK~F
1Jj\WKiQJQăP
3+,ӂ8&+Ҩ0%Ҧ29ӊ/971
'jQKFKRQJ˱ͥLK˱ͣQJG̳QSK̫QEL͏Q
+ӑYjWrQ6975Ҫ1&Ð1*72¬175Ë
0669
1JjQKFKX\rQQJjQK.KRDKӑFPi\WtQK
+ӑYjWrQ691*8<ӈ13+Ò7+,ӊ1
0669
1JjQKFKX\rQQJjQK.KRDKӑFPi\WtQK
ĈӅWjL3KkQWtFKFҧP[~FWKHRNKtDFҥQKWӯGӳOLӋXYăQEҧQ$VSHFWEDVHGVHQWLPHQWDQDO\VLVIRUWH[W
GRFXPHQWV
+ӑWrQQJѭӡL KѭӟQJGүQSKҧQELӋQ761JX\ӉQ4XDQJ +QJ
7әQJTXiWYӅEҧQWKX\ӃWPLQK
6ӕWUDQJ
6ӕFKѭѫQJ
6ӕEҧQJVӕOLӋX
6ӕKuQKYӁ
6ӕWjLOLӋXWKDPNKҧR
3KҫQPӅPWtQKWRiQ
+LӋQYұWVҧQSKҭP
7әQJTXiWYӅFiFEҧQYӁ
6ӕEҧQYӁ
%ҧQ$
%ҧQ$
.KәNKiF
6ӕEҧQYӁYӁWD\
6ӕEҧQYӁWUrQPi\WtQK
1KӳQJѭXÿLӇPFKtQKFӫD/971
&ҧ KDL VLQK YLrQ WKӇ KLӋQ Nӻ QăQJ OjP YLӋF QKyP WӕW ÿӑF KLӇX WjL OLӋX WLӃQJ $QK WӕW YLӃW OXұQ YăQ EҵQJ
QJ{QQJӳ7LӃQJ$QKWӕW
/XұQ YăQ WKӇ KLӋQ KDL VLQK YLrQ Fy NKҧ QăQJ QJKLrQ FӭX NKRD KӑF ÿm F{QJ Eӕ Yj WUuQK Ej\ EjL EiR WҥL
KӝLQJKӏ1,&6
1KyP VLQK YLrQ Fy Nӻ QăQJ [k\ GӵQJ EjL WRiQ QJKLrQ FӭX WuP KLӇX F{QJ QJKӋ Yj P{ KuQK KӑF Pi\ Wӯ ÿy
FҧL WLӃQ FKR UD P{ KuQK PӟL +680+& Fy NKҧ QăQJ WӕW KѫQ GӵD WUrQ 3KR%HUW iS GөQJ FKR EjL WRiQ $VSHFW
%DVHG 6HQWLPHQW $QDO\VLV $%6$ WLӃQJ 9LӋW Oj PӝW WK~ Yӏ 1KyP VLQK YLrQ Fy ÿiQK JLi ÿӇ NLӇP FKӭQJ P{
KuQK+680+&ÿӅ[XҩWYjÿҥWÿѭӧFWӕWKѫQNKLVRViQKYӟLFiFF{QJWUuQKNKiF
1KӳQJWKLӃXVyWFKtQKFӫD/971
0{ KuQK 3KR%HUW Oj ÿm Fy VҷQ NKi QәL WLӃQJ Wӯ NӃW TXҧ QJKLrQ FӭX FӫD 9LQ$, /XұQ YăQ GӯQJ OҥL ӣ YLӋF [k\
GӵQJ FiF OӟS WUrQ FӫD 3UR%HUW ÿӇ WҥR UD P{ KuQK PӟL +680+& SK KӧS YӟL ÿӅ WjL OXұQ YăQ EjL WRiQ
$%6$ &{QJ YLӋF Qj\ ӣ PӭF ÿӝ OXұQ YăQ ĈҥL KӑF KRjQ WRjQ FKҩS QKұQ ÿѭӧF /XұQ YăQ KѫL WKLrQ YӅ GҥQJ
OjPÿӅWjLQJKLrQFӭXKѫQÿӅWjLOXұQYăQWӕWQJKLӋS
.KL ÿѭD P{ KuQK +680+& YjR EjL WRiQ WKӵF WӃ 7UDYHO/LQN Gӳ OLӋX QKұQ [pW Wӯ %RRNLQJFRP WKu FzQ
QKLӅX KҥQ FKӃ 7UDQJ ZHE JLDR GLӋQ 7UDYHO/LQN TXi Vѫ VjL WKLӃX FiF SKkQ WtFK \rX FҫX Wӯ SKtD QJѭӡL GQJ
ÿӃQ WKLӃW NӃ KӋ WKӕQJ 9ҩQ ÿӅ NLӇP WKӱ SKҫQ PӅP FKѭD ÿҥW 9t Gө 1KyP VLQK YLrQ FKӍ GӯQJ OҥL ӣ YLӋF Oҩ\ ê
NLӃQ QKұQ [pW Yj FKҩP ÿLӇP YӟL PӝW Vӕ EҥQ VLQK YLrQ FӫD WiF JLҧ QrQ NӃW TXҧ FKR NKi WӕW 9LӋF VҳS [ӃS NӃW
TXҧWuPNLӃPFKѭDWKӵFVӵKӧSOê&iFPүXFkXWUX\YҩQFzQKҥQFKӃFKѭDSKKӧSWKӵFWӃ
ĈӅQJKӏĈѭӧFEҧRYӋ ⌧
%әVXQJWKrPÿӇEҧRYӋ Ƒ
.K{QJÿѭӧFEҧRYӋ Ƒ
FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ
D6LQKYLrQKm\OêJLҧLYuVDRGӵDWUrQ3KR%HUWQӃXNK{QJFy3KR%HUWWKuVӁҧQKKѭӣQJNӃWTXҧOXұQYăQ
QKѭWKӃQjR"
E6RViQKNӃWTXҧWuPNLӃPGӵDWUrQFRPPHQWFӫDOXұQYăQWUrQ7UDYHO/LQNYӟLNӃWTXҧWuPNLӃPGӵDWUrQWӯ
NKRiYjFy[pW\ӃXWӕNKRҧQJFiFKFyNKiFQKѭWKӃQjR"9tGөTXiQăQQJRQJҫQWUѭӡQJĈ+%.YjFyFKӛ
ÿұX[H
ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7%*LӓL
ĈLӇP
.êWrQJKLU}KӑWrQ
761JX\ӉQ4XDQJ+QJ
DEDICATION
We, Tran Cong Toan Tri and Nguyen Phu Thien, declare that this thesis titled "Sentiment Analysis of User Comments" and the work presented in it are our own and that,
to the best of our knowledge and belief, it contains no material previously published or
written by another person (except where explicitly defined in the acknowledgments), nor
material which to a substantial extent has been submitted for the award of any other
degree or diploma of a university or other institution of higher learning.
Acknowledgements
It gives us great pleasure and satisfaction in presenting our thesis on “Aspect based
sentiment analysis for text documents”. This would be our last project as bachelor
students in university, this project reflects what we have learned and the skills we acquire
during the years at the University of Technology.
For that reason, we would like to express our deepest sense of gratitude towards our
guidance teacher, Dr. Le Thanh Van for allowing us to work on this project, for her
continuous support throughout our study and research, and for the amount of patience,
motivation, and knowledge that she has given us. We could not have imagined a better
advisor and mentor for our thesis.
Besides our advisor, we would like to say thank you for all the knowledge, experience,
and support of the teachers and staff of the Computer Science and Engineering Faculty
that has been given to us in our time at university, their teachings have helped us acquire
the foundational knowledge for this thesis.
Lastly, we would like to thank all of our friends and family who have motivated us every
step of the way, we would not have been able to finish this without them, and we are
grateful for everything we have been given till this day.
Abstract
In today’s world, customers and their feedback are vital to any business’s survival.
Competition is harsh on every market, the competitor who understands and pleases their
customer the most will be more successful. For that to happen, businesses need to gather
information about their customers’ opinions on a large scale. ABSA is a method for them
to achieve this, it has been studied rigorously by researchers in the past, and since the
creation of Bert, ABSA methods are getting more and more advance, showing better and
better results in recent years. However for Vietnamese, ABSA is still not as developed,
due to the limited resources and the nuances of the language.
In our work, we want to improve the capabilities of Vietnamese ABSA, we use the
Vietnamese SOTA pre-trained PhoBert and built two models from it. One has a custom
classifier, made from a combination of previous methods that proved effective, and the
other is made to utilize Bert’s sequence pair feature, constructing auxiliary sentences and
turning ABSA into a question-answering problem. With our work, we hope to set new
baseline results for the Vietnamese ABSA datasets, along with providing useful knowledge
for any researchers who want to improve it further. Our implementation achieved SOTA
scores for both public datasets on Vietnamese ABSA, getting considerably higher scores
than previous works.
To demonstrate that our model not only works on filtered data but also actual user
reviews, we also obtained reviews from a booking site and use our model on them. From
that data, we made a profile for each hotel, finding their pros and cons, then we built
a search engine to help users in booking their accommodation by providing immediate
access to necessary information. In this work, we will provide the acquisition of these data,
their evaluation results, and our process of designing the search engine.
Contents
1 Introduction
1
1.1
Why we chose this project . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Project goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Project scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Project structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Aspect Based Sentiment Analysis
5
2.1
What is ABSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
ABSA research overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.3
Vietnamese ABSA shared task . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.4
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Our proposed models
27
3.1
Bert sequence-pair with auxiliary sentences . . . . . . . . . . . . . . . . . . 27
3.2
HSUM-HC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Experimental results and discussion
35
4.1
Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2
Training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3
Training cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5
Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6
Evaluation on real-life data . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7
Survey results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Model application for a recommender system
53
5.1
Inspiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3
Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Conclusion
65
6.1
Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2
Research Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A Our Research Paper
72
List of Tables
2.1
Possible entity-attribute pairs for Hotel domain . . . . . . . . . . . . . . .
9
2.2
Possible entity-attribute pairs for Restaurant domain . . . . . . . . . . . .
9
3.1
Translation for Hotel domain entities . . . . . . . . . . . . . . . . . . . . . 31
3.2
Translation for Hotel domain attributes . . . . . . . . . . . . . . . . . . . . 31
3.3
Translation for Restaurant domain entities . . . . . . . . . . . . . . . . . . 31
3.4
Translation for Restaurant domain attributes . . . . . . . . . . . . . . . . . 31
3.5
Translation for Sentiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1
Dataset details for VLSP 2018 ABSA . . . . . . . . . . . . . . . . . . . . . 35
4.2
Dataset details for UIT ABSA . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3
Training paramters for HSUM-HC and NLI_B . . . . . . . . . . . . . . . . 41
4.4
Training costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5
Results on the test set of VLSP 2018 Dataset, Hotel domain . . . . . . . . 42
4.6
Results on the test set of UIT ABSA Dataset, Hotel domain . . . . . . . . 43
4.7
Results on the test set of VLSP 2018 Dataset, Restaurant domain . . . . . 43
4.8
Results on the test set of UIT ABSA Dataset, Restaurant domain . . . . . 44
4.9
Real-life dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
List of Figures
2.1
Paperswithcode’s summary on SemEval-2014 ABSA researches . . . . . . .
7
2.2
Example of a review and expected labels . . . . . . . . . . . . . . . . . . .
8
2.3
Multitask BiLSTM-CNN model for ABSA . . . . . . . . . . . . . . . . . . 12
2.4
The architecture of intra attention . . . . . . . . . . . . . . . . . . . . . . . 14
2.5
The architecture of global attention . . . . . . . . . . . . . . . . . . . . . . 15
2.6
BERT input representation. [1] . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7
The architecture of Bert Encoder layer . . . . . . . . . . . . . . . . . . . . 17
2.8
Example of word segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9
Thin et al. Bert implementation . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Hierarchical Hidden level aggregation for Bert . . . . . . . . . . . . . . . . 24
2.11 Hierarchical approach for a Bert-based ABSA task . . . . . . . . . . . . . . 26
3.1
QA_M auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 28
3.2
NLI_M auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 28
3.3
QA_B auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 29
3.4
NLI_B auxiliary sentence format . . . . . . . . . . . . . . . . . . . . . . . 29
3.5
HSUM-HC model for the ABSA task . . . . . . . . . . . . . . . . . . . . . 33
4.1
Aspect distribution of the VLSP-2018 ABSA dataset, hotel domain . . . . 37
4.2
Aspect distribution of the VLSP-2018 ABSA dataset, restaurant domain . 37
4.3
Aspect distribution of the UIT ABSA dataset, hotel domain . . . . . . . . 38
4.4
Aspect distribution of the UIT ABSA dataset, restaurant domain . . . . . 38
4.5
Sentiment distribution for the VLSP-2018 dataset hotel domain (left) and
restaurant domain (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6
Sentiment distribution for the UIT ABSA dataset hotel domain (left) and
restaurant domain (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7
The loss curves on the validation and test sets for VLSP 2018 (left) and
UIT ABSA dataset (right), Hotel domain . . . . . . . . . . . . . . . . . . . 46
4.8
The Phase B validation curves on the validation and test sets for VLSP
2018 (left) and UIT ABSA dataset (right), Hotel domain . . . . . . . . . . 46
4.9
The loss curves on the validation and test sets for VLSP 2018 (left) and
UIT ABSA dataset (right), Restaurant domain . . . . . . . . . . . . . . . . 46
4.10 The Phase B validation curves on the validation and test sets for VLSP
2018 (left) and UIT ABSA dataset (right), Restaurant domain . . . . . . . 47
4.11 HSUM-HC and NLI_B F1 score differences for VLSP-2018 hotel domain . 48
4.12 HSUM-HC and NLI_B F1 score differences for UIT ABSA hotel domain . 49
4.13 Survey results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1
Score calculation for each hotel . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2
The search bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3
A hotel in the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4
The comment window when opened . . . . . . . . . . . . . . . . . . . . . . 60
5.5
Travel Link result page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6
Travel Link result page with comment window . . . . . . . . . . . . . . . . 61
5.7
Top 4 recommended hotels for the query chỗ ở gần trung tâm, nhân viên
thân thiện, phòng rộng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.8
Comments of the top recommended hotel . . . . . . . . . . . . . . . . . . . 63
5.9
Comments of the second recommended hotel . . . . . . . . . . . . . . . . . 63
Acronyms
ABSA Aspect Based Sentiment Analysis. 1–14, 19–27, 30, 32, 33, 35–41, 43–47, 49, 65–67
NLP Natural Language Processing. 3, 6, 10, 11, 15, 18, 66, 67
SOTA State Of The Art. 8, 11, 22, 25, 44, 65
UIT University of Information Technology. 3, 7, 9, 11–13, 20, 36, 38–41, 43–47, 49, 65
VLSP Association for Vietnamese Language and Speech Processing. 3, 7, 10–13, 19, 20,
25, 35–37, 39–48, 51, 65
Ho Chi Minh city National University - University of Technology
Faculty of Computer Science and Engineering
Chapter 1
Introduction
In this chapter, we will give a brief introduction about the project, the reason we chose
this project, the goal and scope of our project in real-life usage.
1.1
Why we chose this project
Nowadays, with the development of the Internet and eCommerce, shopping is not as
simple as picking what you want in a store anymore. Shopping now can be done at home,
through phones, computers,... And with online shopping, customers can’t try out the
product before they buy, nor can they feel the material or quality of the product, especially
in today’s situation, when Covid-19 is plaguing many countries and forcing people to stay
indoors, the only way to judge a product before buying is from past customers’ experience.
Almost every ecommerce application has a function to let customers leave their opinions on
the service they received. Not only that, with the growing popularity of review websites
for every domain possible, to name some: Foody.vn for restaurant reviews, agoda.com,
booking.com, mytour.vn for hotel reviews, tinhte.vn for tech reviews,..., the majority of
customer are likely using them to look up reviews for any product or service they are
planning to purchase, even when they plan to go shopping in person. Any customer’s
opinion can be read by everyone very quickly. A business can lose a large portion of
their customer to a bad review on the internet without even knowing about it. Therefore,
learning about customers’ opinions is one of the top priorities for any business if they want
to succeed, a company must ensure they are always aware of the general opinions. Doing
that not only allows them to have a better overview of their market growth, but also know
out what they can improve. Such a system to help them analyze customer opinions as
detailed as possible is Aspect Based Sentiment Analysis (ABSA), with this, an opinion
written in text can be classified into labels, and not only can we learn what the opinion is
about, we can also learn its sentiment (positive, neutral or negative), this combined with
Graduate thesis
Page 1/78
Ho Chi Minh city National University - University of Technology
Faculty of Computer Science and Engineering
the abundance of reviews and opinions online can be priceless to any business, helping
them get the most accurate view of their customer base.
With every decision a business makes, they always have to track their customer’s
responses and make changes accordingly, this can keep them from having a marketing
catastrophe. On occasions, user opinions can circulate quickly on the internet, showing
up on front pages and are seemingly shared by many people. However, whether or not a
company should make changes according to this opinion is another problem, because it
can be a "loud minority". In which listening to this loud crowd will actually dissatisfy the
majority of their customer base. This kind of decision can only be made with sufficient
information, and enough coverage for customer opinions, this is called "Brand Monitoring"
and is used by every big brand names, it is the process of tracking different channels to
identify where their brand is mentioned and understand how people perceive it, it lets
them keep an eye on potential crises and respond to questions or criticism before they get
out of control.
Not only in the service business but also in any field that needs the general public’s
endorsement to succeed can benefit from learning their customers. Politics is a prime
example, almost every government in recent years employs a system to get the general
opinions, especially in presidential elections, they need to gauge the public opinion and
act accordingly, nowadays most of these opinions are online and in large quantities, too
much for any human to sort through. So the application of ABSA in this field is absolutely
necessary, ABSA can help a government gain significant advantages against their opponent
just by knowing what the public wants and making the right statements.
With all these potential fields for application, ABSA is very useful for anyone who can
apply it. The English ABSA system has been extensively developed and applied to reallife usage. However, for Vietnamese, a less-resources language, research, and development
are still required to get ABSA to the point of effective commercial use. That is why for
our work, we focus on improving past works and developing a more effective system to
handle the Vietnamese ABSA tasks.
1.2
Project goal
The goal of our project is to build a model capable of classifying aspects and sentiments
given a review. We believe that with this system, customer opinions can be explored on
a large scale. More detailed profiles can be built for each user, understanding customers’
shopping habits and preferences not only give us a more accurate overview of the customer
base but also offer better recommendations to customers, increasing sale and satisfaction.
With our work, we also hope to improve Vietnamese ABSA capabilities. We experimented
Graduate thesis
Page 2/78
Ho Chi Minh city National University - University of Technology
Faculty of Computer Science and Engineering
with applying Transformer models for ABSA, and maximizing the potential of PhoBert
on a monolingual dataset. We also experiment with utilizing PhoBert’s sequence-pair,
building auxiliary sentences for each review, and treating ABSA like a question-answering
task, with the hope of capturing better aspect-sentiment relationships in each sequence.
In our project, we study the work in past studies, learning their methodology and
advantages, from that knowledge we develop our own method, improving from previous
models. Our method is a combination of components made specifically to improve the
performance of Bert. Not only did we test our work with public datasets, but we also
perform evaluations on real-life data crawled from review sites, with the purpose of seeing
how well our model can handle unfiltered data.
For our thesis, we also demonstrate our model’s potential by using it for a recommendation system on the hotel domain, in which we use our model to build hotel profiles
from past customer reviews, and given a query from a user, we will analyze the aspects
and sentiments of that review and suggest them suitable hotels. Our system also makes it
convenient for users to view past reviews by sorting the reviews by relevance to the query,
making sure our users always see relevant information first.
1.3
Project scope
In our work, we will build a system to solve the ABSA task of classifying user reviews
into aspects and sentiments. The datasets we have used for training and evaluating are
public datasets from the Association for Vietnamese Language and Speech Processing
(VLSP) and the University of Information Technology Natural Language Processing club
(UIT NLP group).
The data we use for training and evaluation will be in the Vietnamese language. Our
model will be expected to classify Vietnamese text, with proper accent marks and clearly
expressed ideas, any text that is possible for a human to interpret without having prior
knowledge about slang or abbreviations.
1.4
Project structure
Our thesis will include 7 chapters, including this one. The content of each is as follow:
1. Chapter 1: Introduction In this chapter we will give an introduction to our
project, providing more insights for the reason we chose this project, the goal and
scope of our project, and our general direction with this project.
Graduate thesis
Page 3/78
Ho Chi Minh city National University - University of Technology
Faculty of Computer Science and Engineering
2. Chapter 2: Aspect based sentiment analysis In this chapter, we will go into
detail about our task - ABSA. We will explain in detail what ABSA and its goal
is. We also present past works done on this task along with their methodology and
results. The dataset we will use for training and evaluating will also be introduced
in this chapter.
3. Chapter 3: Model Architecture In this chapter we will introduce our model,
the inspiration of our model, and its architecture, we will explain in more detail
how it functions and what each component of the system does.
4. Chapter 4: Experimental results and discussion In this chapter, we will
present our experimental setups and results. We will present our results, comparing
them with past works on the same dataset. We also present our real-life crawled
dataset from review sites and evaluate our model on that.
5. Chapter 5: Model Application In this chapter we will apply our model to a
real-life task, making an application that serves as a recommendation system for
hotel booking. We will explain our goals for this app and in detail how it works.
6. Chapter 6: Conclusion In this chapter, we will summarize our results, our
project’s pros and cons. We will also talk about possible developments to this
project in the future.
Graduate thesis
Page 4/78
Ho Chi Minh city National University - University of Technology
Faculty of Computer Science and Engineering
Chapter 2
Aspect Based Sentiment Analysis
2.1
What is ABSA
Sentiment Analysis (also known as Opinion Mining) is the process of determining
what the user thinks about a certain product, service, or any particular domain. It’s
a classification task with the purpose of annotating a portion of text with a positive,
negative, or neutral sentiment. ABSA is an evolution of Sentiment Analysis that aims at
capturing the aspect-level opinions expressed in natural language texts [2]. A user opinion
or review can be represented by dozens or hundreds of words about multiple aspects with
different sentiments to each, and determining which sentiment words go with which aspect
can be very difficult. With ABSA, reviews about a product can now be analyzed in detail,
showing the reviewer’s opinion on each aspect of that product.
The main process of ABSA is as follows: Given a customer review about a domain
(e.g. hotel or restaurant), the goal is to identify sets of (Aspect, Polarity) that fit the
opinion mentioned in the review. Each aspect is a set of an entity and an attribute, and
polarity consists of negative, neutral, and positive sentiment. For each domain, all possible
combinations of entities and attributes are predefined. The ABSA task will be divided
into two phases: (i) identify pairs of entities and attribute, (ii) analyze the sentiment
polarity to the corresponding aspect (entity#attribute) identified in the previous phase.
For example, a review This place has an amazing view, the food is great too but the service
is bad after phase (i) the entities and attributes pairs will be {Hotel#Design&Features,
Food&Drinks#Quality, Service#General } and after phase (ii) the process output will
be Hotel#Design&Features: Positive, Food&Drinks#Quality: Positive, Service#General:
Negative.
Graduate thesis
Page 5/78
- Xem thêm -