Lexicalized statistical parsing for vietnamese

  • Số trang: 54 |
  • Loại file: PDF |
  • Lượt xem: 37 |
  • Lượt tải: 0
nhattuvisu

Đã đăng 26946 tài liệu

Mô tả:

Lexicalized Statistical Parsing for Vietnamese Pham Thi Minh Thu Faculty of Information Technology Hanoi University of Engineering and Technology Vietnam National University, Hanoi Supervised by Doctor Le Anh Cuong A thesis submitted in fulfillment of the requirements for the degree of Master of Computer Science June, 2010 Table of Contents Acknowledgements ii 1 Introduction 1 1.1 What is syntactic parsing? . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Current Studies in Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Vietnamese syntactic parsing . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Objective of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Parsing approaches 7 2.1 Context Free Grammar (CFG) . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Parsing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Top-down parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Bottom-up parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Comparison between top-down parsing and bottom-up parsing . . . 9 2.2.4 CYK algorithm (Cocke-Younger-Kasami) . . . . . . . . . . . . . . 9 2 2.2.5 Earley algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Probabilistic context-free grammar (PCFGs) 11 . . . . . . . . . . . . . . . . 13 2.3.1 The concept of PCFG . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Disadvantages of PCFGs . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Lexical Probabilistic Context Free Grammar (LPCFGs) 2.4.1 Head structure . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 The concept of Lexical Probabilistic Context Free Grammar (LPCFGs) 16 2.4.3 Three models of Collins 3 . . . . . . . . . . . . . . . . . . . . . . . 18 Vietnamese parsing and our approach 21 3.1 Vietnamese characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 iii TABLE OF CONTENTS 3.2.1 POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Viet Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Objectives 4 5 iv 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 The POS tagset and Syntax tagset for Vietnamese . . . . . . . . . . 27 3.4 Our approach in building a Vietnamese parser . . . . . . . . . . . . . . . 27 3.4.1 Adapting Bikel's parser for Vietnamese . . . . . . . . . . . . . . . 29 3.4.2 Analyze error and propse using heuristic rules . . . . . . . . . . . . 30 Experiments and Discussion 33 4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Bikel's parsing tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Adaptating Bikel's tool to Vietnamese . . . . . . . . . . . . . . . . . . . . 35 4.3.1 Investigate different configurations . . . . . . . . . . . . . . . . . . 35 4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.4 Evaluation of the parser . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Experimental results on using heuristic rules . . . . . . . . . . . . . . . . 42 Conclusions and Future Work 46 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 List of Figures 1.1 The parse tree of sentence "I go to school" . . . . . . . . . . . . . . . . . 2 1.2 A parse tree in Vietnamese . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 The parse tree of the Vietnamese sentence "mÌo b¾t chuét" . . . . . . . . 15 2.2 Two derivations of the sentence "T«i hiÓu Lan h¬n Nga" . . . . . . . . . . 16 2.3 A parse tree of Vietnamese in LPCFG . . . . . . . . . . . . . . . . . . . . 17 2.4 A tree with the "C" suffix used to identify . . . . . . . . . . . . . . . . . 19 3.1 Set of tag in Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 A sample of labeled data in Penn Treebank before manually treatment . . 25 3.3 A sample of labeled data in Penn Treebank after manually treatment . . . 25 3.4 Tagset of Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 A sample of complete data in English and Vietnamese . . . . . . . . . . . 27 4.1 The Bikel's system overview . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Result of testing standard Collins' model 2 with training data's size change from 60% to 100% of the full data. Where series 1 and series 2 stand for testing on sentences with length less equal 40 and 100 respectively. . . . . v 43 List of Tables 2.1 Analysis table with CYK algorithm . . . . . . . . . . . . . . . . . . . . . 11 3.1 POS tagset in Viet Treebank . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Phrase tagset in Viet Treebank . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Clause tagset in Viet Treebank . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Syntax function tagset in Viet Treebank . . . . . . . . . . . . . . . . . . . 29 4.1 The initial results on Viet Treebank with different configurations. Key:CB = average crossing brackets, 0CB = zero crossing brackets, ≤ 2CB =≤ 2 crossing brackets. All results are percentages, except for those in the CB column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Number of sentence for training . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 The results with the change of the training data set . . . . . . . . . . . . . 42 4.4 The error rate. We use 520 sentences for development testing. Then filtering sentences which have the F-score less than 70 %. As the result, we collect 147 sentences into the set of error sentences. The Percentage of a error is calculated by the number of sentences commit this error divide 147. Because a sentence may be some errors so the total percentage may exceed 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 The obtained results after applying some proposal rules to correct some wrong syntactic parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 44 Chapter 1 Introduction For a long time, human being have always dreamed of an intelligent machine which can listen to, understand and implement humans' requirements. Many scientists have tried to make that dream and devoted many achievements for the science of artificial intelligence. In artificial intelligence, natural language processing (NLP) is a field which studies on how to understand and generate automatically human language. NLP has many practical applications such as machine translation, information extraction, discourse analysis, text summarization. These applications have the same basic problems such as lexical analysis, syntactic parsing and semantic analysis. In which, syntactic parsing is the central role and it is also the goal of this thesis. 1.1 What is syntactic parsing? Syntactic parsing (parsing or syntactic analysis ) is the process of analyzing a given sequence of tokens (i.e. a sentence) to identify their grammatical structure with respect to a given grammar. The grammatical structure is often represented in the form which displays visually the dependence of components as a tree (is called parse tree or syntactic tree). In other words, parsing is the problem to get a given sequence of words as input and output is the parse trees corresponding to that sequence. Figure 1.1 shows examples for parse tree: a) a English parse tree in usual form and b) a Vietnamese tree in other form. Parsing is the major module of a grammar checking system. In order to check grammar, we need to parse input sentences, then examine the correctness of the structures in the output. Furthermore, a sentence which cannot be parsed may have grammatical errors. 1 1.1. What is syntactic parsing? 2 Figure 1.1: The parse tree of sentence "I go to school" Figure 1.2: A parse tree in Vietnamese Parsing is also the important intermediate stage of representation for semantic analysis, and thus plays an important role in applications like machine translation, question answering, and information extraction. For example, in transfer-based machine translation the system will analyze the source sentence to output a parse tree and then construct the equivalent parse tree in the target language. The output sentence will be generated mainly based on this equivalent parse tree. It is to understand that in a question answering system we need parsing to find out which is the subject, object, or action. It is also interesting that parsing can help speech processing. It supports to correct the fault of the speech recognition process. On the other hand, in speech synthesis parsing help put stress on the correct position in the sentence. 1.2. Current Studies in Parsing 3 Through these above example we can see that construct an accurate and effective parser will bring great benefits to many applications of natural language processing. 1.2 Current Studies in Parsing As one of the basic and central problem of NLP, parsing attracts many studies. They belong to one of the two approaches: rule-based and statistics-based. In conventional parsing systems, a grammar is hand-crafted, often involves a large amount of lexically specific information in the form of sub-categorization information. In there, ambiguity, a major problem in parsing, is solved through selectional restrictions . For example, a lexicon might specify that "eat" must take an object with the feature + "food". In (Collins, 1999), the author has showed several problems with selectional restrictions such as increasing the volume of information required when the vocabulary size becomes so large. In the other word, the biggest challenge is the large amount of vocabulary to require both selectional restrictions and structural preference should be encoded as the soft preferences instead of hard constraints. To overcome these obstacles, the researchers began to explore machine-learning approaches to parsing problem, primary through statistical models. In these approaches, a set of example pairs of sentence and the corresponding syntactic tree is annotated by hand and used to train parsing models. A set of trees is called a " treebank". Several parts of the treebank are reserved as test data for evaluating the model's accuracy. Early works investigate the use of probabilistic context free grammar (PCFG). Using PCFG is considered as the next generation of parsing and is also as a beginning step in statistical parsing. In a PCFG, each grammar rule is associated with a probability. The probability of a parse tree is the product of the probabilities of all rules used in that tree. In the case, parsing is essentially the process of searching the tree that has the maximum probability. However, a simple PCFG often fail due to its lack of sensitivity to lexical information and structural preferences. Then some solutions were proposed to resolve this problem. Several directions were listed in (Collins, 1999) such as: towards probabilistic version of lexicalized grammars; using supervised training algorithms; to construct models that had increased structural sensitivity; to look into history-based models. Among them lexicalized probabilistic context free grammar (LPCFG) is a promising approach. It can solve many ambiguity phenomena in parsing. Some works that based on this approach achieved high performance, such as in (Collins, 1997). After this research, Daniel M. Bikel and 1.3. Vietnamese syntactic parsing 4 his coworker have developed Collin’s models and designed a parser for multiple languages. It has been applied successfully for some languages as English, Chinese and Arabic (Bikel, 2004). The concrete results of the parser for these languages is reported in (Bikel, 2004): For English, F-measure is 90.01 %; for Chinese, F-measure is 81.2 % and in Arabic F-measure is 75.7 %. According to these results and the comparison between current parsers (e.g. Charniak parser, Bekelley parser, Standford parser), Bikel’s parser is still rated one of the best parser at present. Recently, this approach of using LPCFG continues being applied for many languages. Moreover, a number of new strategies have proposed to improve the accuracy of parsers. In several researches, using semi-supervised training methods becomes a promising approach. Their experimental results show that this approach outperforms the supervised one, without much additional computational cost. Some other studies has integrated semantic information into parsing in order to fully exploit the benefits of lexical resources and upgrade the parser, such as in (Xiong et al., 2005), (Agirre & Baldwin, 2008). (Xiong et al., 2005) described the way of incorporating semantic knowledge as follow: Firstly, they used two Chinese electronic semantics dictionaries and heuristic rules in order to extract semantic categories. Then they built a selection preference sub-model based on extracted semantic categories. Similarly, in (Agirre & Baldwin, 2008), the sense information was added to parsing by substituting the original words with their semantic tags which correspond with their semantic classes, for example knife and scissors belong to TOOL class, cake and pork are assigned to FOOD class. In addition, some other suggested tactics have been enhanced of the performance of parsing as a powerful learning technique (sample selection) for reducing the amount of human-labeled training data (Carreras et al., 2008); or, a strategy for utilizing POS tags resources to annotate parser input in (Watson et al., 2007). Through the review of approaches in parsing and especially some recent studies, we found that LPCFG appears in all of state-of-the-art parsing systems. LPCFG. Therefore in our opinion, LPCFG is a good choice for Vietnamese parsing. 1.3 Vietnamese syntactic parsing In Vietnam, works in natural language processing (i.e. computational linguistics) in general and in parsing in particular have been only motivated very recently. A few of the parsers which follow the knowledge-based approach are constructed with the manual 1.4. Objective of the Thesis 5 grammar rules. Since the construction of grammar rules is manual, the accuracy of the parser is not high. It only analyzes a limited number of sentences generated by the grammar. The approach using statistics has been also studied, but also only at brief and has no experimental results. For example, (Quoc-The & Thanh-Huong, 2008) presented about LPCFG but surprisingly it did not provide any experiment, only some examples were provided to illustrate the syntactic ambiguity of Vietnamese. With such restricted results, no Vietnamese parser has been published widely. It can say that while many countries in the world has gone forward a long way in parsing, Vietnam has just been at the stage to start. The precondition for deployment these models for Vietnamese is a corpus containing parsed sentences which is the crucial resources for statistical parsing. Since lack of corpus, the previous works on Vietnamese parsing have not had the significant experimental results. Fortunately, at present there is a standard Vietnamese parsed corpus, called Viet Treebank, which developed in a project supported by Vietnamese government. This corpus involves about 10,000 parsed sentences followed Penn Treebank’s format, and therefore we can apply it to the Bikel's tool. As mentioned above, lexicalized models have been applied successfully for multiple languages. Among these languages, Chinese obtained 81.2 % F-score on average as Bikel shown in (Bikel, 2004). This result also motivated our study since the syntactic structure of Chinese is similarities to the syntactic structure of Vietnamese. 1.4 Objective of the Thesis This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach. We will use Viet Treebank as the parsed corpus and adapt Bikel's parsing tool for Vietnamese. Then, we will also investigate some common errors appearing in the parsing, and propose a solution to improve accuracy of the parser. To achieve this objective, this study has to find the answers for following questions: How to adapt Bikel's system for Vietnamese? How is the initial result? Which model/configure is appropriate for Vietnamese? How to improve performance of the system based on analyzing errors? In summary, in this study we try to carry out the following tasks. - Study the basic techniques and methods in parsing, focusing on lexicalized statistical approaches; 1.5. Thesis structure 6 - Analyze and adapt Bikel's parser (Bikel, 2004) for Vietnamese; In this aim we try to build and publish a Vietnamese parsing tool which are useful for many tasks of Vietnamese processing. - Investigate different parsing models and different linguistic features to discover the best configuration for Vietnamese; - Analyze grammatical errors from a development test set and find out a solution to improve the accuracy of the parser. 1.5 Thesis structure The rest of this thesis is organized as follows: Chapter 2 introduces basic parsing approaches from classical methods such as topdown or bottom-up strategy to statistic based methods like probabilistic context-free grammar (PCFG) and lexicalized probabilistic context-free grammar (LPCFG). In this chapter, we also introduce the important parsing algorithms including CYK, Earley and Chart parsing. Chapter 3 represents Vietnamese parsing and our approach. Characteristics of Vietnamese and Viet Treebank will be introduced in the comparison with Penn Treebank. Chapter 4 describes our experiments and discussions. After the introduction of the Bikel parsing tool, we will describe the process of applying and developing it for the Vietnamese: from adapting the tool for Vietnamese and investigating it in order to find out the best configuration, and finally handling several grammatical errors to reduce error rate and enhance the parser performance. Chapter 6 summarizes the obtained results, gives some conclutions of our work, and shows our plan for the future work. Chapter 2 Parsing approaches In the previous chapter, we have introduced the concept of parsing and its role in natural language processing. This chapter will firstly presents context free grammar, and then presents common parsing methods, including two classical parsing strategies (the top-down and bottom-up); CYK algorithm (Cocke-Younger-Kasami); Chart parsing and Earley algorithm. At the end of this chapter, we will present the Probabilistic Context Free Grammar and the lexicalized statistical model for parsing. 2.1 Context Free Grammar (CFG) To analyze syntax for a language, we firstly need to represent the language in a form which computer can understand. The most popular formal presentation for grammar of a language is the context-free-grammar (invented by Chomsky). Language is defined as a set of strings where each string was generated by a finite set of nonempty elements called the alphabet, such as Vietnamese alphabet and Vietnamese languages. A context-free grammar (CFG) is a set of four components. The grammar is denoted, we have G = hT, N, S, Ri. Where: - T 6= ∅ is a finite set of elements which called terminal (lexicon). The set of terminals is the alphabet (or words) of the language defined by the grammar. - N 6= ∅ is a finite set of non-terminal characters or variables. Note that 4 ∩ Σ = ∅. They represent different types of phrase or clause in the sentence. - S is one of the non-terminal ( S ∈ N ) and called start variable (or start symbol) that is used to represent the whole sentence. - R is a finite set of rules or productions of the grammar. Each rule in R has the form 7 2.2. Parsing Algorithms 8 X → α where X is non-terminal and α is a sequence of terminals and non-terminals. A grammar G generates a language L. In parsing, CFGs or its variations are used to represent the grammar of a language which is the grammatical base to construct methods to solve the parsing problem. The next section presents these methods. 2.2 2.2.1 Parsing Algorithms Top-down parsing Top-down parsing is a strategy of analyzing the given sentences in the following way: beginning with the start symbol and at each step, expand one of the remaining nonterminals (from left to right) by replacing it with the right side of one of its productions in the grammar until achieving the desired string. In other words, a parse tree is generated by a top - down parser as a result of the construction process of the tree beginning with the start symbol (the root of the tree), using rules in the grammar to generate tree from the root (start symbol) to leaves (words or lexicons). In top-down parsing, for the rules having the same left hand side, the selection rules could be simplified based on the size of the right-hand side string (comparison between the right-hand side strings) or just the order of the symbols in the right hand side of the rules. In case the top-down analysis not finished, we turn back to search the appropriate rule for the parse construction. 2.2.2 Bottom-up parsing Contrary to top-down strategy, the bottom-up parsing (also known as shift-reduce parsing) begins with an input sentence, using two main action (shift and reduce) to backward the input string into the start symbol (the root of parse tree). With a stack, the words in the input string are pushed into the stack from left to right (shift), and if stack contains the right-hand side of a rule and it can be replaced by the left hand side of the rule. Similar to top-down strategy, in bottom-up parsing, when errors occur, or no analysis, we perform backtracking to develop by a different rule. This process continues until we can not turn back anymore, at this time if the stack was not reduced backward the start symbol, the bottom-up parser can not be analyzed the input string. 2.2. Parsing Algorithms 2.2.3 9 Comparison between top-down parsing and bottom-up parsing Both methods have their advantages and disadvantages. Top-down strategy does not waste time to examine the trees that are not begun by the start symbol (S). That means it would never have visited the sub-trees without root S. However, this approach has weaknesses. While not waste time with the tree not started by S, the top-down parser throws away resources for the trees that do not match with the input string. This weakness is consequence of generating the parse tree before examining the input string. Conversely, in bottom-up parsing, although parse trees can not be generated by the starting symbol S, but it is always to ensure that the generated parse trees agree with the input string. In short, each strategy has advantages and disadvantages. Thus, if combining of the two approaches, we will have a good method. 2.2.4 CYK algorithm (Cocke-Younger-Kasami) CYK algorithm, sometimes known as the CKY algorithm, identify whether a input string can be generated by a given CFG, if so, how it can be generated. This algorithm is a form of bottom-up parsing using dynamic programming. CYK algorithm operates on CFG to be in Chomsky Normal Form (CNF). CFG in CNF is CFG in which the rule of the form: R = {A → BC, A → α|A, B, C ∈ 4, α ∈ Σ}. - Pseudo-code of CYK algorithm The algorithm in pseudocode is as follows: a1 . . . an R1 . . . Rn Let the input be a string S consisting of n characters: Let the grammar contain r nonterminal symbols This grammar contains the subset Rs which is the set of start symbols. P [n, n, r] be an array of booleans. Initialize all elements of P For each i = 1 to n For each unit production Rj → ai , set P [i, 1, j] = true. For each i = 2 to n -- Length of span For each j = 1 to n − i + 1 -- Start of span For each k = 1 to i − 1 -- Partition of span For each production RA → RBRC Let to false. 2.2. Parsing Algorithms 10 P [j, k, B] and P [j + k, i − k, C] then set P [j, i, A] = true If any of P [1, n, x] is true (x is iterated over the set s, where s are all the indices for Rs ) If Then S is member of language Else S is not member of language It is easy to see that the time complexity of this algorithm is O( n3 ). A table data-structure is used to keep the track of the parsing process with CYK algorithm. Example We use the following grammar to analyze a Vietnamese "MÌo b¾t chuét" (cats catch mice) using CYK algorithm. Grammar: S → N P V P (1) N P → N (2) N → t«i" (3) N → "mÌo" (4) N → "chuét" (5) V P → V (6) V P → V P P (7) V → "b¾t" (8) P P → N (9) First, we can see that this grammar is not written in CNF. The grammar is converted into CNF as follow: S → N P V P (1a) N P → "t«i" (2a) N P → "mÌo" (3a) N P → "chuét" (4a) N → "t«i" (5a) N → "mÌo" (6a) N → "chuét" (7a) V P → "b¾t" (8a) V P → V P P (9a) V → "b¾t" (10a) P P → "t«i" (13a) 2.2. Parsing Algorithms 11 Table 2.1: Analysis table with CYK algorithm S S VP NP, N, PP, VP VP, V NP, N, PP, VP MÌo b¾t chuét P P → "mÌo" (14a) P P → "chuét" (15a) We have following table (chart): 2.2.5 Earley algorithm Earley algorithm is a top-down parsing algorithm using dynamic programming technique. It carries the typical feature of dynamic programming, that is to reduce the running time from exponential to polynomial by removing the solutions be generated due to turn back. In this case, dynamic programming algorithm makes the running time given in O( N 3 ) where N is the total number of input sequences. However, it does not need grammars given in CNF, and so overcomes the main disadvantage of the CKY approach. The main idea of the Earley algorithm is to travel from left to right and create a network including N + 1 entities. With each word in the sentence, chart contains a list of states which represent each component of generated tree. When a sentence is completely parsed, charts mark the process of analyzing the input sentence ending. Each sub-tree can be performed once only and can be reused by the parser. Each separate state contains a chart entity including three parameters: a sub-tree corresponding to a grammar rule, information about the development of tree, and the location of the sub-tree equivalent to the input string. We give dot notation (.) on the right side of a rule of grammar to describe that it (the right hand side of this rule) has already been parsed. This structure is called the dotted rule. The state of the position will be expressed by two factors: locate the start state and the position of dot notation. Using grammar in section 2.2.4, we have several examples of dotted rule as follows: S → •N P V P [0, 0] V P → V • P P [1, 2] N → "chuét" • [2, 2] The basic principles of the Earley parser is to develop the parse tree through the set including N + 1 states in chart from left to right, processing each state in the set. At each 2.2. Parsing Algorithms 12 step, one of three stages described below will be operated to each state of the law. In each case, the result was to add a new state based on the current state or the next one in the chart. Algorithms are always developed through adding new information on the chart, the state will never be canceled and can not turn back previous chart. And state S → •, [0, N ] in the list of state is the last chart, showing that the input string is parsed successfully. The three main operators of the Earley algorithm is PREDICTOR, COMPLETER and SCANNER. These operators get input as a string and return a state. PREDICTOR and COMPLETER add states to the current chart, and SCANNER add states to new chart. + Predictor As its name, Predictor is responsible for creating a new state, represents states which occur during the analysis process. Predictor is applied to any state in which non-terminal is on the right of the dot notation and is not in the part-of-speech group. Result of this operator is a new state for each expansion replaced non-terminal in grammar. + Scanner When a state is generated from the label on the right of the dot notation, Scanner will be called to check the input and merge the state to the corresponding label to put on the chart. The tasks are complete when a new state is created and changes the location of dot notation based on the input group has predicted. Here, Earley parser using input strings like the top-down parser to avoid the ambiguity, only terminals (labeled), the words predicted by the states, will be analyzed by the chart. + Completer Completer operator is applied to the state that has already had the dot notation at the end of the rule. It is easily to realize that the current status shows the success of the analysis. The purpose of this operator is to research in the rules and develop previous states in the current state of the input. New state is generated by taking the old states, and shifting the dots through rules in the grammar and adding new states into the current chart. With the grammar below, we will analyze the sentence "T«i h¸t" ("I'm singing in English") by Earley algorithm: S → N P V P (1) N P → N (2) N → "t«i" (3) V P → V (4) V P → V P P (5) 2.3. Probabilistic context-free grammar (PCFGs) 13 V → "h¸t" (6) P P → N (7) Chart[0]: S → • N P V P [0, 0] N P → • N [0, 0] N → • "t«i" [0, 0] Examine token "t«i" Chart[1] N → 't«i" • [0, 1] N P → N • [0, 1] S → N P • V P [0, 1] V P → •V [1, 1] V P → •V P P [1, 1] V → • "h¸t" [1, 1] Consider token: "h¸t" Chart[2] V → "h¸t" • [1, 2] V P → V P • [1, 2] V P → V P • P P [1, 2] P P → •N [2, 2] N → •"t«i"[2, 2] S → N P V P • [0, 2] In Chart[2], there is the state S → N P V P • [0, 2] and the length of the input string equal 2, thus the process of analysis is complete and successful. 2.3 2.3.1 Probabilistic context-free grammar (PCFGs) The concept of PCFG As mentioned in Chapter 1, conventional parsing systems used CFGs and followed the rule-based approach. This approach has encountered several obstacles with selectional restriction. At that time, a new approach was proposed to overcome the disadvantages 2.3. Probabilistic context-free grammar (PCFGs) 14 of the CFGs. In this approach, parsing problem is considered as a problem in machine learning. Through a training process, it aims to construct a probabilistic model, which is then used to produce the best parse tree for a test sentence. In this section we introduce the Probabilistic Context Free Grammar (PCFG). In PCFG, each rule has a probability. The product of the probabilities of the rules which used in a tree is the probability of that parse tree, P (T |S). The parser itself is an algorithm which searches for the tree, Tbest , that maximizes P (T |S). A generative model uses the observation that maximizing P (T, S) is equivalent to maximizing P (T |S). Tbest = arg max P (T |S) = arg max T P (T, S) = arg max P (T, S) T P (S) Example: For the following PCFG, require to compute the probability of parse tree (Figure 2.1) for the sentence "MÌo b¾t chuét" ("Cats catch mice" in English). S → NP V P 1.0 (1) NP → N 1.0 (2) N → "t«i" 0.33 (3) N → "mÌo" 0.33 (4) N → "chuét" 0.34 (5) VP →V 0.5 (6) V P → V PP 0.5 (7) V → "b¾t" 0.5 (8) PP → N 1.0 (9) The probability of the tree is calculated as follows: ρ = ρ(S → N P ) ∗ ρ(N P → N ) ∗ ρ(N → mÌo) ∗ρ(V P → V P P ) ∗ ρ(V → b¾t) ∗ ρ(P P → N ) ∗ ρ(N → "chuét") ρ = 1.0 ∗ 1.0 ∗ 0.3 ∗ 0.5 ∗ 0.5 ∗ 1.0 ∗ 0.34 ρ = 0.028 2.3.2 Disadvantages of PCFGs Although PCFGs can resolve partly the ambiguity rely on probability augmented with each rule, it lack of sensitivity to lexical information and structural preference. In natural language, the semantic and the structure depend strong on the context of this sentence. 2.4. Lexical Probabilistic Context Free Grammar (LPCFGs) 15 Figure 2.1: The parse tree of the Vietnamese sentence "mÌo b¾t chuét" We consider an example, such sentence "T«i hiÓu Lan h¬n Nga" in order to see that PCFGs lack of sensitivity to lexical information. This sentence can be put into two parse trees such as Figure 2.2. If two sentences have the same probability, we must put lexical context information to distinguish. Thus, adding lexical information to PCFG will bring many benefits. 2.4 Lexical Probabilistic Context Free Grammar (LPCFGs) In the previous section, we introduced the probabilistic model for the parsing problem. However, this model still remains some drawbacks. M. Collins proposed a new approach for parsing. That approach is Lexical Probabilistic Context Free Grammar (LPCFG). In 1996, Collin introduced three models under this approach in his paper. In three models, Collin put into the PCFG a new structure called head.
- Xem thêm -