Tài liệu Graduation thesis computer science finding the semantic similarity in vietnamese

.PDF

127

transuma Báo vi phạm

Tải xuống 127

Mô tả:

VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY -------- Nguyen Tien Dat FINDING THE SEMANTIC SIMILARITY IN VIETNAMESE GRADUATION THESIS Major Field: Computer Science Ha Noi – 2010 VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY -------- Nguyen Tien Dat FINDING THE SEMANTIC SIMILARITY IN VIETNAMESE GRADUATION THESIS Major Field: Computer Science Supervisor: Phd. Phạm Bảo Sơn Ha Noi – 2010 Finding the semantic similarity in Vietnamese Nguyen Tien Dat Abstract Our thesis shows the quality of semantic vector representation with random projection and Hyperspace Analogue to Language model under about the researching on Vietnamese. The main goal is how to find semantic similarity or to study synonyms in Vietnamese. We are also interested in the stability of our approach that uses Random Indexing and HAL to represent semantic of words or documents. We build a system to find the synonyms in Vietnamese called Semantic Similarity Finding System. In particular, we also evaluate synonyms resulted from our system. Keywords: Semantic vector, Word space model, Random projection, Apache Lucene i Finding the semantic similarity in Vietnamese Nguyen Tien Dat Acknowledgments First of all, I wish to express my respect and my deepest thanks to my advisor Pham Bao Son, University of Engineering and Technology, Viet Nam National University, Ha Noi for his enthusiastic guidance, warm encouragement and useful research experiences. I would like to gratefully thank all the teachers of University of Engineering and Technology, VNU for their invaluable knowledge which they provide me during the past four academic years. I would also like to send my special thanks to my friends in K51CA class, HMI Lab. Last, but not least, my family is really the biggest motivation for me. My parents and my brother always encourage me when I have stress and difficulty. I would like to send them great love and gratefulness. Ha Noi, May 19, 2010 Nguyen Tien Dat ii Finding the semantic similarity in Vietnamese Nguyen Tien Dat Contents Abstract .......................................................................................................................... i Acknowledgments ......................................................................................................... ii Contents ........................................................................................................................ iii Figure List .................................................................................................................... vi Table List ..................................................................................................................... vii Table List ..................................................................................................................... vii Chapter 1 Introduction ................................................................................................1 Chapter 2 Background Knowledge .............................................................................4 2.1 Lexical relations ..................................................................................................4 2.1.1 Synonym and Hyponymy ...........................................................................4 2.1.2 Antonym and Opposites .............................................................................5 2.2 Word-space model ...............................................................................................6 2.2.1 Definition ....................................................................................................7 2.2.2 Semantic similarity .....................................................................................8 2.2.3 Document-term matrix ................................................................................9 2.2.4 Example: tf-idf weights ............................................................................10 2.2.5 Applications ..............................................................................................11 2.3 Word space model algorithms ...........................................................................12 2.3.1 Context vector ...........................................................................................12 2.3.2 Word-concurrence Matrices .....................................................................13 iii Finding the semantic similarity in Vietnamese Nguyen Tien Dat 2.3.3 Similarity Measure ....................................................................................15 2.4 Implementation of word space model ................................................................16 2.4.1 Problems ...................................................................................................16 High dimensional matrix ...................................................................................16 Data sparseness ..................................................................................................17 Dimension reductions ........................................................................................17 2.4.2 Latent semantic Indexing ..........................................................................18 2.4.3 Hyperspace Analogue to Language ..........................................................21 2.4.4 Random Indexing ......................................................................................22 Chapter 3 Semantic Similarity Finding System ......................................................25 3.1 System Description ............................................................................................25 3.2 System Processes Flow ......................................................................................26 3.3 Lucene Indexing .................................................................................................27 3.3 Semantic Vector Package ..................................................................................29 3.4 System output .....................................................................................................31 Chapter 4 Experimental setup and Evaluations ......................................................33 4.1 Data setup ..........................................................................................................33 4.2 Experimental measure .......................................................................................33 4.2.1 Test Corpus ...............................................................................................34 4.2.2 Experimental Metric .................................................................................35 4.3 Experiment 1: Two kinds of context vector .......................................................36 4.4 Experiment 2: Context-size Evaluation .............................................................37 4.5 Experiment 3: Performance of system ...............................................................41 4.6 Discussion ..........................................................................................................47 iv Finding the semantic similarity in Vietnamese Nguyen Tien Dat Chapter 5 Conclusion and Future work ...................................................................50 References.....................................................................................................................52 Appendix.......................................................................................................................56 v Finding the semantic similarity in Vietnamese Nguyen Tien Dat Figure List Figure 2.1: Word geometric repersentaion ........................................................................................................ 8 Figure 2.2: Cosine distance ................................................................................................................................... 8 Figure 2.3: The processes of Random indexing ................................................................................................ 23 Figure 3.1: The processes of Semantic Similarity Finding System .................................................................. 26 Figure 3.2: Lucene Index Toolbox - Luke ......................................................................................................... 28 Figure 4.1: Context size ....................................................................................................................................... 37 Figure 4.2: P1 when context-size changes .......................................................................................................... 40 Figure 4.3: Pn, n=1..19 for each kind of word ................................................................................................... 45 Figure 4.4 Average synonym for Test Corpus .................................................................................................. 46 vi Finding the semantic similarity in Vietnamese Nguyen Tien Dat Table List Table 2.1: A example of documents-term matrix ............................................................................................... 9 Table 2.2: Word co-occurrence table ................................................................................................................. 13 Table 2.3: Co-occurrences Matrix ..................................................................................................................... 14 Table 4.1: All words in Test corpus – Target words......................................................................................... 34 Table 4.2 Results of Mode 1 and 2 on Test Corpus .......................................................................................... 36 Table 4.3: Results of Context-size Experiment ................................................................................................. 39 Table 4.4: P1(TestCorpus) for each context-size ............................................................................................... 40 Table 4.5: the best synonyms .............................................................................................................................. 42 Table 4.5: the best synonyms of all target words retuned by our System ....................................................... 42 Table 4.6: Result output for nouns returned by system. .................................................................................. 43 Table 4.7: Result output for pronouns returned by system. ............................................................................ 43 Table 4.8: Result output for Verbs returned by system ................................................................................... 44 Table 4.9: Result output for Verbs returned by system ................................................................................... 44 Table 4.10: Pn , n = 1, 2 ..19 of each kind of word and Test Corpus ................................................................ 45 Table 4.11: Some interesting results .................................................................................................................. 48 vii Chapter 1: Introduction Nguyen Tien Dat Chapter 1 Introduction Finding semantic similarity is an interesting project in Natural Language Processing (NLP). Determining semantic similarity of a pair of words is an important problem in many NLP applications such as: web-mining [18] (search and recommendation systems), targeted advertisement and domains that need semantic content matching, word sense disambiguation, text categorization [28][30]. There is not much research done on semantic similarity for Vietnamese, while semantic similarity plays a crucial role for human categorization [11] and reasoning; and computational similarity measures have also been applied to many fields such as: semantics-based information retrieval [4][29], information filter [9] or ontology engineering [19]. Nowadays, word space model is often used in current research in semantic similarity. Specifically, there are many well-known approaches for representing the context vector of words such as: Latent Semantic Indexing (LSI) [17], Hyperspace Analogue to Language (HAL) [21] and Random Indexing (RI) [26]. These approaches have been introduced and they have proven useful in implementing word space model [27]. In our thesis, we carry on the word space model and implementation for computing the semantic similarity. We have studied every method and investigated their advantages and disadvantages to select the suitable technique to apply for Vietnamese text data. Then, we built a complete system for finding synonyms in Vietnamese. It is called Semantic Similarity Finding System. Our system is a 1 Chapter 1: Introduction Nguyen Tien Dat combination of some processes or approaches to easily return the synonyms of a given word. Our experimental results on the task of finding synonym are promising. Our thesis is organized as following. First, in Chapter 2, we introduce the background knowledge about word space model and also review some of the solutions that have been proposed in the word space implementation. In the next Chapter 3, we then describe our Semantic Similarity Finding System for finding synonyms in Vietnamese. Chapter 4 describes the experiment we carry out to evaluate the quality of our approach. Finally, Chapter 5 is conclusion and our future work. 2 Chapter 1: Introduction Nguyen Tien Dat 3 Chapter 2. Background Knowledge Nguyen Tien Dat Chapter 2 Background Knowledge 2.1 Lexical relations The first section, we describe the lexical relations to clear the concept of synonym as well as hyponymy. Relations lexical concepts are difficult to define a common way. It is given by Cruse (1986) [35]. A lexical relation is a culturally recognized pattern of association that exists between lexical units in a language. 2.1.1 Synonym and Hyponymy The synonymy is the equality or at least similarity of the importance of different linguistic. Two words are synonymous if they have the same meaning [15]. Words that are synonyms are said to be synonymous, and the sate of being a synonym is called synonymy. For the example, in the English, words “car” and “automobile” are synonyms. In the figurative sense, two words are often said to be synonyms if they have the same extended meaning or connotation. Synonyms can be any part of speech (e.g. noun, verb, adjective or pronoun) as the two words of a pair are the same part of speech. More examples of Vietnamese synonyms: độc giả - bạn đọc (noun) chung quanh – xung quanh (pronoun) bồi thường – đền bù (verb) an toàn – đảm bảo (adjective) 4 Chapter 2. Background Knowledge Nguyen Tien Dat In the linguistics dictionary, the synonym is defined in three concepts: 1. A word having the same or nearly the same meaning as another word or other words in a language. 2. A word or an expression that serves as a figurative or symbolic substitute for another. 3. Biology: A scientific name of an organism or of a taxonomic group that has been superseded by another name at the same rank. In the linguistics: Hyponym is a word whose meaning is included in that of other word [14]. Some examples in English: “scarlet”, “vermilion”, and “crimson” are hyponyms of “red”. And in Vietnamese: “vàng cốm”, “vàng choé” and “vàng lụi” are hyponyms of “vàng”, in case, “vàng” is in color. In our thesis, we don’t distinguish clearly between synonym and hyponym. We suppose the hyponym is a kind of synonym. 2.1.2 Antonym and Opposites In the lexical semantics, opposites are the words that are in a relationship of binary incompatibles in opposite as: female-male, long-short and to love – to hate. The notion of incompatibility refers to the fact that one word in an opposite pair entails that it is not the other pair member. The concept of incompatibility here refers to the fact that a word in a pair of facing demands that it not be a member of another pair. For example, “something that is long” entails that “it is not short”. There are two members in a set of opposites, thus it is referred to a binary relationship. The relationship between opposites is determined as opposition. Opposites are simultaneously similar and different in meanings [12]. Usually, they differ in only one dimension of meaning, but are similar in most other aspects, which are similar in grammar and semantics of the unusual location. Some words are 5 Chapter 2. Background Knowledge Nguyen Tien Dat non-opposable. For example, animal or plant species have no binary opposites or antonyms. Opposites may be viewed as a special type of incompatibility: An example, incompatibility is also found in the opposite pairs “fast - slow” It’s fast. - It’s not slow. Some features of opposites are given by Cruse (2004): binary, inheritress, and patency. In this section, we introduced Antonyms are gradable opposites. They located at the opposite end of a continuous spectrum of meaning. Antonym is defined: “A word which having a meaning opposite to that of another word is called antonym. “[12] It has also been commonly used with concepts synonyms. More antonym examples in Vietnamese: đi – đứng nam – nữ yêu – ghét ……… Words can have some different antonyms, depending on the meaning or contexts of words. We study antonyms to make clearly a fundamental part of a language, in contrast to synonyms. 2.2 Word-space model Word-space model is an algebraic model to represent text documents or any objects (phrase, paraphrase, term …). It uses a mathematical model as vector to identify or index terms in the text documents. Model is useful in information retrieval, information filter, indexing. The invention can be traced at the Salton's introduction 6 Chapter 2. Background Knowledge Nguyen Tien Dat about Vector space Model for information retrieval [29]. This term is due to Hinrich Schutze (1993): “Vector similarity is the only information present in Word Space: semantically related words are close, unrelated words are distant. (page.896) “ 2.2.1 Definition Word-space models contain the related method for representing concepts in a high dimensional vector space. In this thesis, we suggest a name: Semantic vector model through our work. The models include some well-known approach such as: Latent Semantic Indexing [17], Hyperspace Analogue to language [21]. Document and queries are performed as vectors. d j = w1, j ,w2, j ,....,wt, j  q = w1,q ,w2,q ,....,wt,q  Each dimension corresponds to a separate term. If document does not include the term, term's value in the vector is zero. In contract, if a term occurs in the documents, its value is non-zero. There are many ways to compute above values, but we study one famous way that has been developed. That is tf-idf weighting [36] (see the part of section below): The core principle is that semantic similarity can be represented as a proximate n-dimensional vector; n can be 1 or the large number. We consider the 1-dimensional and 2-dimensional word space in the Figure: 7 Chapter 2. Background Knowledge Nguyen Tien Dat Figure 2.1: Word geometric repersentaion In above geometric representation, it shows the simple words of some Vietnamese. As an example, both semantic spaces, “ô_tô” is the closer meaning to “xe_hơi” than “xe_đạp” and “xe_máy”. The definition of term depends on each application. Typically terms are single words, or longer phrases. If words are chosen to be terms, the dimensionality of the vector is the number of words in the vocabulary. 2.2.2 Semantic similarity As we have seen in the definition, the word-space model is a model of semantic similarity. On the other hand, the geometric metaphor of meaning is Meanings are locations in a semantic space, and semantic similarity is proximity between the locations. The term-document vector represents the context of term in low granularity. Besides, creating term vector according to the some words surrounding to compute semantic vector [21]. It is a kind of semantic vector model. To compare the semantic similarity in semantic vector model, we use Cosine Figure 2.2: Cosine distance 8 Chapter 2. Background Knowledge Nguyen Tien Dat distance: In practice, it is easier to calculate the cosine of the angel between the vectors instead of angle: A cosine value of zero means that the query and document vector does not exist and match. The higher Cosine distance is; the closer similarity of semantic of two terms is. 2.2.3 Document-term matrix A document-term matrix and term-document matrix are the mathematical matrices that show the frequency of terms occurings in a set of collected documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. In a term-document matrix, rows correspond to words or terms and columns correspond to documents. To determine value of these matrices, one scheme is tf-idf [36]. A simple example for document-term matrix: D1 = “tôi thích chơi ghita.” D2 = “tôi ghét ghét chơi ghita.” Then the documents-term matrix is: Table 2.1: A example of documents-term matrix Tôi thích ghét chơi ghita D1 1 1 0 1 1 D2 1 0 2 1 1 9 Chapter 2. Background Knowledge Nguyen Tien Dat Matrix shows how many times terms appear in the documents. And in detail, complexity, we describe the tf-idf in the next part. 2.2.4 Example: tf-idf weights In the classic semantic vector model [31], the term weights in the documents vectors are products of local and global parameters. It is called term frequency-inverse document frequency model. The weight document vector is measured by , where And tft,d is term frequency of term t in document d is inverse document frequency. |D| is the number of documents. is the number of documents in which the term t occurs. Distance between document dj and query q can calculated as: [36]. 10 Chapter 2. Background Knowledge Nguyen Tien Dat 2.2.5 Applications Over 20 years, the semantic model has been developed strongly, it's useful to perform many important tasks of natural language processing. Such applications, in which semantic vector models play a great role, are: Information retrieval [7]: It is basic foundation to create applications that are fully, automatically and widely applicable on different languages or cross-languages. The system has flexible input and output options. Typically, user queries any combination of words or documents while system return about documents or words. Therefore, it is very easy to build web interface for users. Regarding cross-language information Retrieval, semantic vector models is more convenient than other systems to query in one language that matches relevant documents or articles in the same or other languages because it is fully automatic corpus analysis while Machine translation requires vast lexical resources. Some machine translations are very expensive to develop and lack coverage to all lexicon of a language. Information filters [9]: It is very interested. Information Retrieval needs relative stable database and depend on user queries while Information filter (IF) finds relatively stable information needs. The data stream in IF is rapidly changing. IF also use more techniques such as: information routing, text categorization or classification. Word sense discrimination and disambiguation [28]: The main idea is clustering the weighted sum of vector for words found in a paragraph of text; it is called the context vector of word. It calculates the co-occurrence matrix too (see in section 2.2), the appearance of an ambiguous word can then be mapped to one of these word-senses. Document segmentation [3]. Computing the context vector of region text help category this text belongs to a kind of documents. Given a document, system can show that it is some kinds of sport, policy or law topic. Lexical and ontology acquisition [19]: According to the knowledge of a few given words called seed words and their relationship to getting many other similar words that distance of semantic vector is nearby. 11

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất