Tài liệu Transductive support vector machines for cross lingual sentiment classification.

.PDF

124

hoangtuavartar Báo vi phạm

Tải xuống 124

Mô tả:

Transductive Support Vector Machines for Cross-lingual Sentiment Classification Nguyen Thi Thuy Linh Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Professor Ha Quang Thuy A thesis submitted in fulfillment of the requirements for the degree of Master of Computer Science December, 2009 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology any other educational institution, except where due acknowledgement is made in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed ........................................................................ i Abstract Sentiment classification has been much attention and has many useful applications on business and intelligence. This thesis investigates sentiment classification problem employing machine learning technique. Since the limit of Vietnamese sentiment corpus, while there are many available English sentiment corpus on the Web. We combine English corpora as training data and a number of unlabeled Vietnamese data in semi-supervised model. Machine learning eliminates the language gap between the training set and test set in our model. Moreover, we also examine types of features to obtain the best performance. The results show that semi-supervised classifier are quite good in leveraging cross-lingual corpus to compare with the classifier without cross-lingual corpus. In term of features, we find that using only unigram model turning out the outperformace. i Acknowledgements I’m grateful to my advisor Associate Professor Ha Quang Thuy who guide and encourage me since I was undergraduate student. I have learned much about machine learning and nature language processing from him, and I appreciate his guidance and assistance. I thank so much to Assistant Professor Nguyen Le Minh in JAIST (Japan Advanced Institute Science and Technology) for valuable comments and helpful suggestions since I started research on my thesis. I am also thankful the members of Smart of Integrated Systems (SIS Lab) and Information System Department in the University of Engineering and Technology all of who always support me in work and study. I thank to the members of Nature Languages Processing Laboratory in JAIST for the corroborate with me during the time I was an exchange student in JAIST. I would like to dedicate this thesis to my wonderful family - from whom I’ve learnt many things about life including the process of scientific thought. Hanoi, December 2009 Nguyen Thi Thuy Linh Table of Contents 1 Introduction 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 1.2 What might be involved? . . . . . . . . . . . . . . . 1.3 Our approach . . . . . . . . . . . . . . . . . . . . . 1.4 Related works . . . . . . . . . . . . . . . . . . . . . 1.4.1 Sentiment classification . . . . . . . . . . . . 1.4.1.1 Sentiment classification tasks . . . 1.4.1.2 Sentiment classification features . . 1.4.1.3 Sentiment classification techniques 1.4.1.4 Sentiment classification domains . 1.4.2 Cross-domain text classification . . . . . . . 2 Background 2.1 Sentiment Analysis . . . . . . . . . . . . . . 2.1.1 Applications . . . . . . . . . . . . . . 2.2 Support Vector Machines . . . . . . . . . . . 2.3 Semi-supervised techniques . . . . . . . . . . 2.3.1 Generate maximum-likelihood models 2.3.2 Co-training and bootstrapping . . . . 2.3.3 Transductive SVM . . . . . . . . . . 3 The 3.1 3.2 3.3 semi-supervised model for cross-lingual The semi-supervised model . . . . . . . . . . Review Translation . . . . . . . . . . . . . . Features . . . . . . . . . . . . . . . . . . . . 3.3.1 Words Segmentation . . . . . . . . . 3.3.2 Part of Speech Tagging . . . . . . . . 3.3.3 N-gram model . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 4 4 4 4 4 5 5 . . . . . . . 6 6 7 7 10 10 11 11 . . . . . . 13 13 16 16 16 18 18 TABLE OF CONTENTS iii 4 Experiments 4.1 Experimental set up . . . . . . . . . . 4.2 Data sets . . . . . . . . . . . . . . . . 4.3 Evaluation metric . . . . . . . . . . . . 4.4 Features . . . . . . . . . . . . . . . . . 4.5 Results . . . . . . . . . . . . . . . . . . 4.5.1 Effect of cross-lingual corpus . . 4.5.2 Effect of extraction features . . 4.5.2.1 Using stopword list . . 4.5.2.2 Segmentation and Part 4.5.2.3 Bigram . . . . . . . . 4.5.3 Effect of features size . . . . . . 20 20 20 22 22 23 23 24 24 24 25 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of speech tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion and Future Works 28 A 30 B 32 List of Figures 1.1 An application of sentiment classification . . . . . . . . . . . . . . . . 2 2.1 2.2 Visualization of opinion summary and comparison . . . . . . . . . . . Hyperplanes separate data points . . . . . . . . . . . . . . . . . . . . 8 9 3.1 Semi-supervised model with cross-lingual corpus . . . . . . . . . . . . 15 4.1 4.2 The effects of feature size . . . . . . . . . . . . . . . . . . . . . . . . . 26 The effects of training size . . . . . . . . . . . . . . . . . . . . . . . . 27 iv List of Tables 3.1 3.2 3.3 An example of Vietnamese Words Segmentation . . . . . . . . . . . . 17 An example of Vietnamese Words Segmentation . . . . . . . . . . . . 18 An example of Unigrams and Bigrams . . . . . . . . . . . . . . . . . 19 4.1 4.2 4.3 Tools and Application in Usage . . . . . . . . . . . . . . . . . . . . . 21 The effect of cross-lingual corpus . . . . . . . . . . . . . . . . . . . . 23 The effect of selection features . . . . . . . . . . . . . . . . . . . . . . 25 A.1 Vietnamese Stopwords List by (Dan, 1987) . . . . . . . . . . . . . . . 31 B.1 POS List by (VLSP, 2009) . . . . . . . . . . . . . . . . . . . . . . . . 33 B.2 subPos list by (VLSP, 2009) . . . . . . . . . . . . . . . . . . . . . . . 34 v Chapter 1 Introduction 1.1 Introduction “What other people think” has always been an important factor of information for most of us during the decision-making process. Long time before the explosion of World Wide Web, we asked our friends to recommend an auto machine, or explain the movie that they were planning to watch, or conferred Consumer Report to determine which television we would offer. But now with the explosion of Web 2.0 platforms blogs, discussion forums, review sites and various other types of social media, consumers have a huge of unprecedented power whichby to share their brand of experiences and opinions. This development made it possible to find out the bias and the recommendation in vast pool of people who we have no acquaintances. In such social websites, users create their comments regarding the subject which is discussed. Blogs are examples, each entry or posted article is a subject, and friends would produce their opinion on that, whether they agreed or disagreed. Another example is commercial website where products are purchased on-line. Each product is a subject that consumers then would leave their experience comments on that after acquiring and practicing the product. There are plenty of instance for creating the opinion on on-line documents in that way. However, with very large amounts of such available information in the Internet, it should be organized to make the best of use. As a part of the effort to better exploiting this information for supporting users, researches have been actively investigating the problem of automatic sentiment classification. Sentiment classification is a type of text categorization which labels the posted 1 1.1. Introduction 2 comment is positive or negative class. It also includes neutral class in some cases. We just focus positive and negative class in this work. In fact, labeling the posted comments with consumers sentiment would provide succinct summaries to readers. Sentiment classification has a lot of important application on business and intelligence (Pang & Lee, 2008) therefore we need to consider looking into this matter. As not an except, till now there are more and more Vietnamese social websites and commercial product online that have been much more interesting from the youth. Facebook1 is a social network that now has about 10 million users. Youtube2 is also a famous website supplying the clips that users watch and create comment on each clip. Nevertheless, it have been no worthy attention, we would investigate sentiment classification on Vietnamese data as the work of my thesis. We consider one of applications for merchant sites. A popular product may receives hundreds of consumer reviews. This makes potential customers very hard to read them to help him on making a decision whether to buy the product. In order to supporting customers, summarizer product reviews systems are built. For example, assume that we summarize the reviews of a particular digital camera Canon 8.1 as Figure 1.1. Canon 8.1: Aspect: picture quality - Positive: - Negative: Aspect: size - Positive: - Negative: Figure 1.1: An application of sentiment classification Picture quality and size are aspects of the product. There are a list of works in such summarizer systems, in which sentiment classification is a crucial job. Sentiment classification is one of steps in this summarizer. 1 2 http://www.facebook.com http://www.youtube.com 1.2. What might be involved? 1.2 3 What might be involved? As mentioned in the previous section, sentiment classification is a specific of text classification in machine learning. The number class of this type in common is two class: positive and negative class. Consequently, there are a lot of machine learning techniques to solve sentiment classification. The text categorization is generally topic-based text categorization where each words receive a topic distribution. While, for sentiment classification, consumers express their bias based on sentiment words. This difference would be examined and consider to obtain the better performance. On the other hands, the annotated Vietnamese data has been limited. That would be challenges to learn based on supervised learning. In previous Vietnamese text classification researches, the learning phase employed the training set approximately with the size of 8000 documents (Linh, 2006). Because annotating is an expert work and expensive labor intensive, Vietnamese sentiment classification would be more challenging. 1.3 Our approach To date, a variety of corpus-based methods have been developed for sentiment classification. The methods usually rely heavily on annotated corpus for training the sentiment classifier. The sentiment corpora are considered as the most valuable resources for the sentiment classification task. However, such resources are very imbalanced in different languages. Because most previous work studies on English sentiment classification, many annotated corpora for English sentiment classification are freely available on the Internet. In order to face the challenge of limited Vietnamese corpus, we propose to leverage rich English corpora for Vietnamese sentiment classification. In this thesis, we examine the effects of cross-lingual sentiment classification, which leverages only English training data for learning classifier without using any Vietnamese resources. To achieve a better performance, we employ semi-supervised learning in which we utilize 960 annotated Vietnamese reviews. We also examine the effect of selection features in Vietnamese sentiment classification by applying nature language processing techniques. Although, we studied on Vietnamese domain, this approach can be applied for many other languages. 1.4. Related works 1.4 4 Related works 1.4.1 Sentiment classification 1.4.1.1 Sentiment classification tasks Sentiment categorization can be conducted at document, sentence or phrase (part of sentence) level. Document level categorization attempts to classify sentiments in movie reviews, product reviews, news articles, or Web forum posts (Pang et al., 2002)(Hu & Liu, 2004b)(Pang & Lee, 2004). Sentence level categorization classifies positive or negative sentiments for each sentence (Pang & Lee, 2004)(Mullen & Collier, 2004). The work on phrase level categorization captures multiple sentiments that may be present within a single sentence. In this study we focus on document level sentiment categorization. 1.4.1.2 Sentiment classification features The types of features have been used in previous sentiment classification including syntactic, semantic, link-based and stylistics features. Along with semantic features, syntactic properties are the most commonly used as set of features for sentiment classification. These include word n-grams (Pang et al., 2002)(Gamon et al., 2005), part-of-speech tagging (Pang et al., 2002). Semantic features integrate manual or semi-automatic annotate to add polarity or scores to words and phrases. Turney (Turney, 2002) used a mutual information calculation to automatically compute the SO score for each word and phrase. While Hu and Liu (Hu & Liu, 2004b)(Hu & Liu, 2004a) made use the synonyms and antonyms in WordNet to identify the sentiment. 1.4.1.3 Sentiment classification techniques There can be classified previously into three used techniques for sentiment classification. These consists of machine learning, link analysis methods, and score-based approaches. Many studies used machine learning algorithms such as support vector machines (SVM) (Pang et al., 2002)(Wan, 2009)(Efron, 2004) and Naive Bayes (NB) (Pang et al., 2002)(Pang & Lee, 2004). SVM have surpassed in comparison other machine learning techniques such as NB or Maximum Entropy (Pang et al., 2002). 1.4. Related works 5 Using link analysis methods for sentiment classification are grounded on linkbased features and metrics. (Efron, 2004) used co-citation analysis for sentiment classification of Website opinions. Score-based methods are typically used in conjunction with semantic features. These techniques classify review sentiments through by total sum of comprised positive or negative sentiment features (Turney & Littman, 2002). 1.4.1.4 Sentiment classification domains Sentiment classification has been applied to numerous domains, including reviews, Web discussion group, etc. Reviews are movie, product and music reviews (Pang et al., 2002)(Hu & Liu, 2004b)(Wan, 2008). Web discussion groups are Web forums, newsgroups and blogs. In this thesis, we investigate sentiment classification using semantic features in comparison to syntactic features. Because of the outperformance of SVM algorithm we apply machine learning technique with SVM classifier. We study on product reviews that are available corpus in the Internet. 1.4.2 Cross-domain text classification Cross-domain text classification can be consider as a more general task than crosslingual sentiment classification. In the case of cross-domain text classification, the labeled and unlabeled data originate from different domains. Conversely, in the case of cross-lingual sentiment classification, the labeled data come from a domain and the unlabeled data come from another. In particular, several previous studies focus on the problem of cross-lingual text classification, which can be consider as a special case of general cross-domain text classification. There are a few novel models have been proposed as the same problem, for example, the information bottleneck approach, the multilingual domain models, the co-training algorithm. Chapter 2 Background 2.1 Sentiment Analysis The Web has dramatically changed with technique web 2.0 the way that people express their opinions. Now they can post comments of products at merchant sites and express their views on almost anything in Internet forums, discussion groups, blogs, ets., which are generally called the user generated content or user generated media. Come along with the so called user generated content; sentiment analysis has drawn much attention in the Natural Language Processing (NLP) field. Sentiment analysis attempts to identify and analyze opinions and emotions. Hearst and Wiebe originally proposed the idea of mining direction-based text, namely, text containing opinions, sentiments, affects, and biases. In some documents, the concepts “sentiment analysis” and “opinion mining” are interchangeable, although their first meaning has a little distinguish. There are several tasks with much interesting research in sentiment analysis field, in which sentiment classification is one of major task. This task treats opinion mining as a text classification problem. It classifies an evaluative text as being positive or negative. For example, given a product review, the system determines whether the review expresses a positive or a negative sentiment of the reviewer. Given a set of evaluative texts D, a sentiment classifier categorizes each document d ∈ D into one of the two classes, positive and negative. Positive means that d expresses a positive opinion. Negative means that d gives an expression about a negative opinion. 6 2.2. Support Vector Machines 2.1.1 7 Applications Opinions are so important that whenever one needs to make decision, one wants to hear others’opinion. This is true for both individuals and organizations. The technology of opinion mining thus has a tremendous scope for practical applications. Individual consumers: If an individual wants to purchase a product, it is useful to see a summary of opinions of existing users so that he/she can make an informed decision. This is better than reading a large number of reviews to form a mental picture of the strengths and weaknesses of the product. He/she can also compare the summaries of opinions of competing products, which is even more useful. An example in Figure 2.1 shows this. Organizations and businesses: Opinion mining is equally, if not even more, important to businesses and organizations. For example, it is critical for a product manufacturer to know how consumers perceive its product and those of its competitors. This information is not only useful for marketing and product benchmarking but also useful for product design and product developments. The major application of sentiment classification is to give a quick view of the prevailing opinion on an object so that people might see “what others think” easily. The task is similar but different from classic topic-based text classification, which classifies documents into predefined topic classes, e.g., politics, sport, education, science, etc. In topic-based classification, topic related words are important. However, in sentiment classification, topic-related words are unimportant. Instead, sentiment words that indicate positive or negative opinions are important, e.g., great, interesting, good, terrible, worst, etc. 2.2 Support Vector Machines The SVM algorithm was first developed in 1963 by Vapnik and Lerner. However, the SVM started up attention only in 1995 with the appearance of Vapnik’s book “The nature of statistical learning theory”. Come along with a bag of algorithm learning for text classification, SVM has been successfully performance. In text classification, suppose some given data points each belong to one of two classes, the classification task is deciding which class a new data point will belong to. For support vector machine, each data point is viewed as a p-dimensional vector, and now the goal becomes into finding out a p − 1 dimensional hyperplane that can separate such 2.2. Support Vector Machines Figure 2.1: Visualization of opinion summary and comparison 8 2.2. Support Vector Machines 9 Figure 2.2: Hyperplanes separate data points points. This hyperplane is classifier or linear classifier in the other way. Obliviously, there are many such hyperplanes separating the data. However, maximum separation between the two classes is our desired. Indeed, we choose the hyperplane in order to the distance from it to the nearest data point on each side is maximized. Given a set of points D = {(xi , yi )|xi ∈ Rp , yi ∈ {−1, 1}}i−1 n where yi is either 1 or −1 indicating the class which the point xi belongs to. We present w ~ as a hyperplane that not only separates the data vectors in one class from those in the other, but for which the separation, or margin, is as large as possible. Search such hyperplane corresponds to a constrained optimization problem. The solution can be written as w ~= P j αj cj ~xj , αj ≥ 0 Where the αj is greater than zero obtained by solving a dual optimization problem. Those ~xj are called support vectors, since they are only data vectors contributing to w. ~ Identifying of new instances consists simply of determining which side of w ~ hyperplane they fall on. This above formulation is a primal form. Writing the classification rule in its unconstrained dual form reveals that the maximum margin hyperplane and there 2.3. Semi-supervised techniques 10 fore the classification task is only a function of the support vectors, the training data that lie on the margin. P Using the fact that k w k2 = w · w and substituting w ~ = j αj cj ~xj , αj ≥ 0, one can show that the dual of the SVM boils down to the following optimization problem: Maximine (in αj ) Pn 1 T j=1 αj − 2 αj αi cj ci xj xi subjects to (for any j = 1, ..., n) αj 0 and Pn j=1 αj cj = 0 the α terms constitute a dual representation for the weight vector in terms of the training set: w ~= P j αj cj ~xj , αj ≥ 0 For simplicity reasons, sometimes it is required that the hyperplane passes through the origin of the coordinate system. Such hyperplanes are called unbiased, whereas general hyperplanes not necessarily passing through the origin are called biased. An unbiased hyperplane can be enforced by setting b = 0 in the primal optimization problem. The corresponding dual is identical to the dual given above without the equality constraint. Pn i=1 αj cj = 0 There are extensions to the linear SVM, they are soft margin and non-linear classification. In this thesis, we do not express in detail. It is could be see more in (Vapnik, 1998) 2.3 2.3.1 Semi-supervised techniques Generate maximum-likelihood models From early research in semi-supervised learning, Expectation Maximization (EM) algorithm has been studied for some Nature Language Processing (NLP) tasks. Still now, EM has been successful in also text classification (Nigram et al., 2000). EM is an iterative method which alternates between performing an expectation 2.3. Semi-supervised techniques 11 (E) step and a maximization (M) step. The goal is finding maximum likelihood estimates of parameters in probabilistic models. One problem with this approach and other generative models is that it is difficult to incorporate arbitrary, interdependent features that may be useful for solving the task. 2.3.2 Co-training and bootstrapping A number of semi-supervised approaches are grounded on the co-training framework (Blum & Mitchell, 1998), which assumes each document in the input domain can be separate into two independent views conditioned on the output class. One important aspect should be taken into account is that assumption when we want to apply. In fact, the co-training algorithm is a typical bootstrapping method, which starts with a set of labeled data, and increase the amount of annotated data using some amounts of unlabeled data in an incremental way. Till now, co-training has been successfully applied to named-entity classification, statistic parsing, part of speech tagging and sentiment classification. 2.3.3 Transductive SVM Thorsten Joachims (Joachims, 1999) proposed the semi-supervised by applying SVM algorithm that is widely accessed. Suppose that, we have l labeled examples {xi , yi }li=1 called as L set and u unlabeled examples {x∗j }uj=1 as U, where xi , x∗j ∈ Rd and yi ∈ {−1, 1} . The goal is to construct a learner by making use of both L and U set. The optimize function is shown as follows: OP: Transductive SVM Minimize: (y1∗ , , yn∗ , w, ~ b, ξ1 , , ξn , ξ1∗ , , ξk∗ ): 21 kwk2 + CΣni=0 ξi + C ∗ Σkj=0 ξj∗ subjects to: ∀ni=1 : yi [w ~ × x~i + b] ≥ 1 − ξi k ∗ ∀i=1 : yi [w ~ × x~∗j + b] ≥ 1 − ξj∗ ∀ni=1 : ξj > 0 ∗ ∀kn j=1 : ξj > 0 C and C ∗ are set by the user. They allow trading off margin size against misclassifying training data or excluding test data. Training a transductive SVM means 2.3. Semi-supervised techniques 12 solving the combinatorial optimization problem OP. For a small number of test examples, this problem can be well-done simply by trying all possible assignments of y1∗ , ..., yk∗ to the two classes. However, the amount of test data is large, we just find an approximate solution to optimization problem OP using a form of local search. The key idea of the algorithm is that it begins with a labeling of the examples belonging U set based on the classification of an inductive SVM. Then it improves the solution by switching the labels of these test examples that is miss classifying. After that, the algorithm taking the labeled data in L and U set as input retrains the model. They improve the loop stops after a finite number of loops iteration, since the C−∗ or C+∗ are bounded by the C ∗ . For each iterative, the algorithm relabels for the two misclassifying examples. The number of the wrong class couples is the one of iteration. TSVM has been successful for text classification (Joachims, 1998)(Pang et al., 2002). That is the reason we employed this semi-supervised algorithm.

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất