Transductive Support Vector Machines
for Cross-lingual Sentiment Classification
Nguyen Thi Thuy Linh
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Professor Ha Quang Thuy
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
December, 2009
ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my
knowledge it contains no materials previously published or written by another
person, or substantial proportions of material which have been accepted for the
award of any other degree or diploma at University of Engineering and Technology
any other educational institution, except where due acknowledgement is made
in the thesis. I also declare that the intellectual content of this thesis is the
product of my own work, except to the extent that assistance from others in the
project’s design and conception or in style, presentation and linguistic expression
is acknowledged.’
Signed ........................................................................
i
Abstract
Sentiment classification has been much attention and has many useful applications
on business and intelligence. This thesis investigates sentiment classification problem employing machine learning technique. Since the limit of Vietnamese sentiment
corpus, while there are many available English sentiment corpus on the Web. We
combine English corpora as training data and a number of unlabeled Vietnamese
data in semi-supervised model. Machine learning eliminates the language gap between the training set and test set in our model. Moreover, we also examine types
of features to obtain the best performance.
The results show that semi-supervised classifier are quite good in leveraging
cross-lingual corpus to compare with the classifier without cross-lingual corpus. In
term of features, we find that using only unigram model turning out the outperformace.
i
Acknowledgements
I’m grateful to my advisor Associate Professor Ha Quang Thuy who guide and
encourage me since I was undergraduate student. I have learned much about machine
learning and nature language processing from him, and I appreciate his guidance
and assistance.
I thank so much to Assistant Professor Nguyen Le Minh in JAIST (Japan Advanced Institute Science and Technology) for valuable comments and helpful suggestions since I started research on my thesis.
I am also thankful the members of Smart of Integrated Systems (SIS Lab) and
Information System Department in the University of Engineering and Technology
all of who always support me in work and study.
I thank to the members of Nature Languages Processing Laboratory in JAIST
for the corroborate with me during the time I was an exchange student in JAIST.
I would like to dedicate this thesis to my wonderful family - from whom I’ve
learnt many things about life including the process of scientific thought.
Hanoi, December 2009
Nguyen Thi Thuy Linh
Table of Contents
1 Introduction
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
1.2 What might be involved? . . . . . . . . . . . . . . .
1.3 Our approach . . . . . . . . . . . . . . . . . . . . .
1.4 Related works . . . . . . . . . . . . . . . . . . . . .
1.4.1 Sentiment classification . . . . . . . . . . . .
1.4.1.1 Sentiment classification tasks . . .
1.4.1.2 Sentiment classification features . .
1.4.1.3 Sentiment classification techniques
1.4.1.4 Sentiment classification domains .
1.4.2 Cross-domain text classification . . . . . . .
2 Background
2.1 Sentiment Analysis . . . . . . . . . . . . . .
2.1.1 Applications . . . . . . . . . . . . . .
2.2 Support Vector Machines . . . . . . . . . . .
2.3 Semi-supervised techniques . . . . . . . . . .
2.3.1 Generate maximum-likelihood models
2.3.2 Co-training and bootstrapping . . . .
2.3.3 Transductive SVM . . . . . . . . . .
3 The
3.1
3.2
3.3
semi-supervised model for cross-lingual
The semi-supervised model . . . . . . . . . .
Review Translation . . . . . . . . . . . . . .
Features . . . . . . . . . . . . . . . . . . . .
3.3.1 Words Segmentation . . . . . . . . .
3.3.2 Part of Speech Tagging . . . . . . . .
3.3.3 N-gram model . . . . . . . . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
approach
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
3
4
4
4
4
4
5
5
.
.
.
.
.
.
.
6
6
7
7
10
10
11
11
.
.
.
.
.
.
13
13
16
16
16
18
18
TABLE OF CONTENTS
iii
4 Experiments
4.1 Experimental set up . . . . . . . . . .
4.2 Data sets . . . . . . . . . . . . . . . .
4.3 Evaluation metric . . . . . . . . . . . .
4.4 Features . . . . . . . . . . . . . . . . .
4.5 Results . . . . . . . . . . . . . . . . . .
4.5.1 Effect of cross-lingual corpus . .
4.5.2 Effect of extraction features . .
4.5.2.1 Using stopword list . .
4.5.2.2 Segmentation and Part
4.5.2.3 Bigram . . . . . . . .
4.5.3 Effect of features size . . . . . .
20
20
20
22
22
23
23
24
24
24
25
25
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
of speech tagging
. . . . . . . . . .
. . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Conclusion and Future Works
28
A
30
B
32
List of Figures
1.1
An application of sentiment classification . . . . . . . . . . . . . . . .
2
2.1
2.2
Visualization of opinion summary and comparison . . . . . . . . . . .
Hyperplanes separate data points . . . . . . . . . . . . . . . . . . . .
8
9
3.1
Semi-supervised model with cross-lingual corpus . . . . . . . . . . . . 15
4.1
4.2
The effects of feature size . . . . . . . . . . . . . . . . . . . . . . . . . 26
The effects of training size . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
List of Tables
3.1
3.2
3.3
An example of Vietnamese Words Segmentation . . . . . . . . . . . . 17
An example of Vietnamese Words Segmentation . . . . . . . . . . . . 18
An example of Unigrams and Bigrams . . . . . . . . . . . . . . . . . 19
4.1
4.2
4.3
Tools and Application in Usage . . . . . . . . . . . . . . . . . . . . . 21
The effect of cross-lingual corpus . . . . . . . . . . . . . . . . . . . . 23
The effect of selection features . . . . . . . . . . . . . . . . . . . . . . 25
A.1 Vietnamese Stopwords List by (Dan, 1987) . . . . . . . . . . . . . . . 31
B.1 POS List by (VLSP, 2009) . . . . . . . . . . . . . . . . . . . . . . . . 33
B.2 subPos list by (VLSP, 2009) . . . . . . . . . . . . . . . . . . . . . . . 34
v
Chapter 1
Introduction
1.1
Introduction
“What other people think” has always been an important factor of information for
most of us during the decision-making process. Long time before the explosion of
World Wide Web, we asked our friends to recommend an auto machine, or explain
the movie that they were planning to watch, or conferred Consumer Report to
determine which television we would offer. But now with the explosion of Web 2.0
platforms blogs, discussion forums, review sites and various other types of social
media, consumers have a huge of unprecedented power whichby to share their brand
of experiences and opinions. This development made it possible to find out the bias
and the recommendation in vast pool of people who we have no acquaintances.
In such social websites, users create their comments regarding the subject which
is discussed. Blogs are examples, each entry or posted article is a subject, and friends
would produce their opinion on that, whether they agreed or disagreed. Another
example is commercial website where products are purchased on-line. Each product
is a subject that consumers then would leave their experience comments on that
after acquiring and practicing the product. There are plenty of instance for creating
the opinion on on-line documents in that way. However, with very large amounts of
such available information in the Internet, it should be organized to make the best
of use. As a part of the effort to better exploiting this information for supporting
users, researches have been actively investigating the problem of automatic sentiment
classification.
Sentiment classification is a type of text categorization which labels the posted
1
1.1. Introduction
2
comment is positive or negative class. It also includes neutral class in some cases.
We just focus positive and negative class in this work. In fact, labeling the posted
comments with consumers sentiment would provide succinct summaries to readers.
Sentiment classification has a lot of important application on business and intelligence (Pang & Lee, 2008) therefore we need to consider looking into this matter.
As not an except, till now there are more and more Vietnamese social websites
and commercial product online that have been much more interesting from the
youth. Facebook1 is a social network that now has about 10 million users. Youtube2
is also a famous website supplying the clips that users watch and create comment
on each clip. Nevertheless, it have been no worthy attention, we would investigate
sentiment classification on Vietnamese data as the work of my thesis.
We consider one of applications for merchant sites. A popular product may receives hundreds of consumer reviews. This makes potential customers very hard to
read them to help him on making a decision whether to buy the product. In order to
supporting customers, summarizer product reviews systems are built. For example,
assume that we summarize the reviews of a particular digital camera Canon 8.1 as
Figure 1.1.
Canon 8.1:
Aspect: picture quality
- Positive:
- Negative:
Aspect: size
- Positive:
- Negative:
Figure 1.1: An application of sentiment classification
Picture quality and size are aspects of the product. There are a list of works in
such summarizer systems, in which sentiment classification is a crucial job. Sentiment
classification is one of steps in this summarizer.
1
2
http://www.facebook.com
http://www.youtube.com
1.2. What might be involved?
1.2
3
What might be involved?
As mentioned in the previous section, sentiment classification is a specific of text
classification in machine learning. The number class of this type in common is two
class: positive and negative class. Consequently, there are a lot of machine learning techniques to solve sentiment classification. The text categorization is generally
topic-based text categorization where each words receive a topic distribution. While,
for sentiment classification, consumers express their bias based on sentiment words.
This difference would be examined and consider to obtain the better performance.
On the other hands, the annotated Vietnamese data has been limited. That would
be challenges to learn based on supervised learning. In previous Vietnamese text
classification researches, the learning phase employed the training set approximately
with the size of 8000 documents (Linh, 2006). Because annotating is an expert work
and expensive labor intensive, Vietnamese sentiment classification would be more
challenging.
1.3
Our approach
To date, a variety of corpus-based methods have been developed for sentiment classification. The methods usually rely heavily on annotated corpus for training the
sentiment classifier. The sentiment corpora are considered as the most valuable
resources for the sentiment classification task. However, such resources are very
imbalanced in different languages. Because most previous work studies on English
sentiment classification, many annotated corpora for English sentiment classification are freely available on the Internet. In order to face the challenge of limited
Vietnamese corpus, we propose to leverage rich English corpora for Vietnamese sentiment classification. In this thesis, we examine the effects of cross-lingual sentiment
classification, which leverages only English training data for learning classifier without using any Vietnamese resources. To achieve a better performance, we employ
semi-supervised learning in which we utilize 960 annotated Vietnamese reviews. We
also examine the effect of selection features in Vietnamese sentiment classification
by applying nature language processing techniques. Although, we studied on Vietnamese domain, this approach can be applied for many other languages.
1.4. Related works
1.4
4
Related works
1.4.1
Sentiment classification
1.4.1.1
Sentiment classification tasks
Sentiment categorization can be conducted at document, sentence or phrase (part
of sentence) level. Document level categorization attempts to classify sentiments in
movie reviews, product reviews, news articles, or Web forum posts (Pang et al.,
2002)(Hu & Liu, 2004b)(Pang & Lee, 2004). Sentence level categorization classifies
positive or negative sentiments for each sentence (Pang & Lee, 2004)(Mullen &
Collier, 2004). The work on phrase level categorization captures multiple sentiments
that may be present within a single sentence. In this study we focus on document
level sentiment categorization.
1.4.1.2
Sentiment classification features
The types of features have been used in previous sentiment classification including
syntactic, semantic, link-based and stylistics features. Along with semantic features,
syntactic properties are the most commonly used as set of features for sentiment
classification. These include word n-grams (Pang et al., 2002)(Gamon et al., 2005),
part-of-speech tagging (Pang et al., 2002).
Semantic features integrate manual or semi-automatic annotate to add polarity
or scores to words and phrases. Turney (Turney, 2002) used a mutual information
calculation to automatically compute the SO score for each word and phrase. While
Hu and Liu (Hu & Liu, 2004b)(Hu & Liu, 2004a) made use the synonyms and
antonyms in WordNet to identify the sentiment.
1.4.1.3
Sentiment classification techniques
There can be classified previously into three used techniques for sentiment classification. These consists of machine learning, link analysis methods, and score-based
approaches.
Many studies used machine learning algorithms such as support vector machines
(SVM) (Pang et al., 2002)(Wan, 2009)(Efron, 2004) and Naive Bayes (NB) (Pang
et al., 2002)(Pang & Lee, 2004). SVM have surpassed in comparison other machine
learning techniques such as NB or Maximum Entropy (Pang et al., 2002).
1.4. Related works
5
Using link analysis methods for sentiment classification are grounded on linkbased features and metrics. (Efron, 2004) used co-citation analysis for sentiment
classification of Website opinions.
Score-based methods are typically used in conjunction with semantic features.
These techniques classify review sentiments through by total sum of comprised positive or negative sentiment features (Turney & Littman, 2002).
1.4.1.4
Sentiment classification domains
Sentiment classification has been applied to numerous domains, including reviews,
Web discussion group, etc. Reviews are movie, product and music reviews (Pang
et al., 2002)(Hu & Liu, 2004b)(Wan, 2008). Web discussion groups are Web forums,
newsgroups and blogs.
In this thesis, we investigate sentiment classification using semantic features in
comparison to syntactic features. Because of the outperformance of SVM algorithm
we apply machine learning technique with SVM classifier. We study on product
reviews that are available corpus in the Internet.
1.4.2
Cross-domain text classification
Cross-domain text classification can be consider as a more general task than crosslingual sentiment classification. In the case of cross-domain text classification, the
labeled and unlabeled data originate from different domains. Conversely, in the case
of cross-lingual sentiment classification, the labeled data come from a domain and
the unlabeled data come from another.
In particular, several previous studies focus on the problem of cross-lingual text
classification, which can be consider as a special case of general cross-domain text
classification. There are a few novel models have been proposed as the same problem,
for example, the information bottleneck approach, the multilingual domain models,
the co-training algorithm.
Chapter 2
Background
2.1
Sentiment Analysis
The Web has dramatically changed with technique web 2.0 the way that people
express their opinions. Now they can post comments of products at merchant sites
and express their views on almost anything in Internet forums, discussion groups,
blogs, ets., which are generally called the user generated content or user generated
media. Come along with the so called user generated content; sentiment analysis has
drawn much attention in the Natural Language Processing (NLP) field. Sentiment
analysis attempts to identify and analyze opinions and emotions. Hearst and Wiebe
originally proposed the idea of mining direction-based text, namely, text containing
opinions, sentiments, affects, and biases. In some documents, the concepts “sentiment
analysis” and “opinion mining” are interchangeable, although their first meaning has
a little distinguish.
There are several tasks with much interesting research in sentiment analysis field,
in which sentiment classification is one of major task. This task treats opinion mining
as a text classification problem. It classifies an evaluative text as being positive or
negative. For example, given a product review, the system determines whether the
review expresses a positive or a negative sentiment of the reviewer.
Given a set of evaluative texts D, a sentiment classifier categorizes each document
d ∈ D into one of the two classes, positive and negative. Positive means that d
expresses a positive opinion. Negative means that d gives an expression about a
negative opinion.
6
2.2. Support Vector Machines
2.1.1
7
Applications
Opinions are so important that whenever one needs to make decision, one wants
to hear others’opinion. This is true for both individuals and organizations. The
technology of opinion mining thus has a tremendous scope for practical applications.
Individual consumers: If an individual wants to purchase a product, it is useful
to see a summary of opinions of existing users so that he/she can make an informed
decision. This is better than reading a large number of reviews to form a mental
picture of the strengths and weaknesses of the product. He/she can also compare
the summaries of opinions of competing products, which is even more useful. An
example in Figure 2.1 shows this.
Organizations and businesses: Opinion mining is equally, if not even more, important to businesses and organizations. For example, it is critical for a product
manufacturer to know how consumers perceive its product and those of its competitors. This information is not only useful for marketing and product benchmarking
but also useful for product design and product developments.
The major application of sentiment classification is to give a quick view of the
prevailing opinion on an object so that people might see “what others think” easily.
The task is similar but different from classic topic-based text classification, which
classifies documents into predefined topic classes, e.g., politics, sport, education, science, etc. In topic-based classification, topic related words are important. However,
in sentiment classification, topic-related words are unimportant. Instead, sentiment
words that indicate positive or negative opinions are important, e.g., great, interesting, good, terrible, worst, etc.
2.2
Support Vector Machines
The SVM algorithm was first developed in 1963 by Vapnik and Lerner. However, the
SVM started up attention only in 1995 with the appearance of Vapnik’s book “The
nature of statistical learning theory”. Come along with a bag of algorithm learning
for text classification, SVM has been successfully performance. In text classification,
suppose some given data points each belong to one of two classes, the classification
task is deciding which class a new data point will belong to. For support vector
machine, each data point is viewed as a p-dimensional vector, and now the goal
becomes into finding out a p − 1 dimensional hyperplane that can separate such
2.2. Support Vector Machines
Figure 2.1: Visualization of opinion summary and comparison
8
2.2. Support Vector Machines
9
Figure 2.2: Hyperplanes separate data points
points. This hyperplane is classifier or linear classifier in the other way. Obliviously,
there are many such hyperplanes separating the data. However, maximum separation
between the two classes is our desired. Indeed, we choose the hyperplane in order to
the distance from it to the nearest data point on each side is maximized.
Given a set of points D = {(xi , yi )|xi ∈ Rp , yi ∈ {−1, 1}}i−1 n where yi is either
1 or −1 indicating the class which the point xi belongs to. We present w
~ as a
hyperplane that not only separates the data vectors in one class from those in the
other, but for which the separation, or margin, is as large as possible. Search such
hyperplane corresponds to a constrained optimization problem. The solution can be
written as
w
~=
P
j
αj cj ~xj ,
αj ≥ 0
Where the αj is greater than zero obtained by solving a dual optimization problem. Those ~xj are called support vectors, since they are only data vectors contributing to w.
~ Identifying of new instances consists simply of determining which side of
w
~ hyperplane they fall on.
This above formulation is a primal form. Writing the classification rule in its
unconstrained dual form reveals that the maximum margin hyperplane and there
2.3. Semi-supervised techniques
10
fore the classification task is only a function of the support vectors, the training
data that lie on the margin.
P
Using the fact that k w k2 = w · w and substituting w
~ = j αj cj ~xj ,
αj ≥ 0,
one can show that the dual of the SVM boils down to the following optimization
problem:
Maximine (in αj )
Pn
1
T
j=1 αj − 2 αj αi cj ci xj xi
subjects to (for any j = 1, ..., n)
αj 0 and
Pn
j=1 αj cj = 0
the α terms constitute a dual representation for the weight vector in terms of
the training set:
w
~=
P
j
αj cj ~xj ,
αj ≥ 0
For simplicity reasons, sometimes it is required that the hyperplane passes through
the origin of the coordinate system. Such hyperplanes are called unbiased, whereas
general hyperplanes not necessarily passing through the origin are called biased. An
unbiased hyperplane can be enforced by setting b = 0 in the primal optimization
problem. The corresponding dual is identical to the dual given above without the
equality constraint.
Pn
i=1
αj cj = 0
There are extensions to the linear SVM, they are soft margin and non-linear
classification. In this thesis, we do not express in detail. It is could be see more in
(Vapnik, 1998)
2.3
2.3.1
Semi-supervised techniques
Generate maximum-likelihood models
From early research in semi-supervised learning, Expectation Maximization (EM)
algorithm has been studied for some Nature Language Processing (NLP) tasks.
Still now, EM has been successful in also text classification (Nigram et al., 2000).
EM is an iterative method which alternates between performing an expectation
2.3. Semi-supervised techniques
11
(E) step and a maximization (M) step. The goal is finding maximum likelihood
estimates of parameters in probabilistic models. One problem with this approach and
other generative models is that it is difficult to incorporate arbitrary, interdependent
features that may be useful for solving the task.
2.3.2
Co-training and bootstrapping
A number of semi-supervised approaches are grounded on the co-training framework
(Blum & Mitchell, 1998), which assumes each document in the input domain can be
separate into two independent views conditioned on the output class. One important
aspect should be taken into account is that assumption when we want to apply. In
fact, the co-training algorithm is a typical bootstrapping method, which starts with
a set of labeled data, and increase the amount of annotated data using some amounts
of unlabeled data in an incremental way. Till now, co-training has been successfully
applied to named-entity classification, statistic parsing, part of speech tagging and
sentiment classification.
2.3.3
Transductive SVM
Thorsten Joachims (Joachims, 1999) proposed the semi-supervised by applying SVM
algorithm that is widely accessed. Suppose that, we have l labeled examples {xi , yi }li=1
called as L set and u unlabeled examples {x∗j }uj=1 as U, where xi , x∗j ∈ Rd and
yi ∈ {−1, 1} . The goal is to construct a learner by making use of both L and U set.
The optimize function is shown as follows:
OP: Transductive SVM
Minimize:
(y1∗ , , yn∗ , w,
~ b, ξ1 , , ξn , ξ1∗ , , ξk∗ ): 21 kwk2 + CΣni=0 ξi + C ∗ Σkj=0 ξj∗
subjects to:
∀ni=1 : yi [w
~ × x~i + b] ≥ 1 − ξi
k
∗
∀i=1 : yi [w
~ × x~∗j + b] ≥ 1 − ξj∗
∀ni=1 : ξj > 0
∗
∀kn
j=1 : ξj > 0
C and C ∗ are set by the user. They allow trading off margin size against misclassifying training data or excluding test data. Training a transductive SVM means
2.3. Semi-supervised techniques
12
solving the combinatorial optimization problem OP. For a small number of test examples, this problem can be well-done simply by trying all possible assignments of
y1∗ , ..., yk∗ to the two classes. However, the amount of test data is large, we just find
an approximate solution to optimization problem OP using a form of local search.
The key idea of the algorithm is that it begins with a labeling of the examples
belonging U set based on the classification of an inductive SVM. Then it improves
the solution by switching the labels of these test examples that is miss classifying.
After that, the algorithm taking the labeled data in L and U set as input retrains
the model. They improve the loop stops after a finite number of loops iteration, since
the C−∗ or C+∗ are bounded by the C ∗ . For each iterative, the algorithm relabels for
the two misclassifying examples. The number of the wrong class couples is the one
of iteration.
TSVM has been successful for text classification (Joachims, 1998)(Pang et al.,
2002). That is the reason we employed this semi-supervised algorithm.
- Xem thêm -