Lexicalized Statistical Parsing for Vietnamese
Pham Thi Minh Thu
Faculty of Information Technology
Hanoi University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Doctor Le Anh Cuong
A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
June, 2010
Table of Contents
Acknowledgements
ii
1
Introduction
1
1.1 What is syntactic parsing? . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2 Current Studies in Parsing . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3 Vietnamese syntactic parsing
. . . . . . . . . . . . . . . . . . . . . . . .
4
1.4 Objective of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.5 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Parsing approaches
7
2.1 Context Free Grammar (CFG) . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2 Parsing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1 Top-down parsing . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.2 Bottom-up parsing
. . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.3 Comparison between top-down parsing and bottom-up parsing . . .
9
2.2.4 CYK algorithm (Cocke-Younger-Kasami) . . . . . . . . . . . . . .
9
2
2.2.5 Earley algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Probabilistic context-free grammar (PCFGs)
11
. . . . . . . . . . . . . . . .
13
2.3.1 The concept of PCFG . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.2 Disadvantages of PCFGs . . . . . . . . . . . . . . . . . . . . . . .
14
2.4 Lexical Probabilistic Context Free Grammar (LPCFGs)
2.4.1 Head structure
. . . . . . . . . .
15
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.4.2 The concept of Lexical Probabilistic Context Free Grammar (LPCFGs) 16
2.4.3 Three models of Collins
3
. . . . . . . . . . . . . . . . . . . . . . .
18
Vietnamese parsing and our approach
21
3.1 Vietnamese characteristics . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2 Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
iii
TABLE OF CONTENTS
3.2.1 POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.2 Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.3 Viet Treebank
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Objectives
4
5
iv
25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3.2 The POS tagset and Syntax tagset for Vietnamese . . . . . . . . . .
27
3.4 Our approach in building a Vietnamese parser . . . . . . . . . . . . . . .
27
3.4.1 Adapting Bikel's parser for Vietnamese . . . . . . . . . . . . . . .
29
3.4.2 Analyze error and propse using heuristic rules . . . . . . . . . . . .
30
Experiments and Discussion
33
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.2 Bikel's parsing tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.3 Adaptating Bikel's tool to Vietnamese . . . . . . . . . . . . . . . . . . . .
35
4.3.1 Investigate different configurations . . . . . . . . . . . . . . . . . .
35
4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.3.3 Parsing
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.3.4 Evaluation of the parser . . . . . . . . . . . . . . . . . . . . . . . .
39
4.3.5 Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.4 Experimental results on using heuristic rules . . . . . . . . . . . . . . . .
42
Conclusions and Future Work
46
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.2 Contribution
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.3 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
List of Figures
1.1 The parse tree of sentence "I go to school" . . . . . . . . . . . . . . . . .
2
1.2 A parse tree in Vietnamese
. . . . . . . . . . . . . . . . . . . . . . . . .
2
2.1 The parse tree of the Vietnamese sentence "mÌo b¾t chuét" . . . . . . . .
15
2.2 Two derivations of the sentence "T«i hiÓu Lan h¬n Nga" . . . . . . . . . .
16
2.3 A parse tree of Vietnamese in LPCFG . . . . . . . . . . . . . . . . . . . .
17
2.4 A tree with the "C" suffix used to identify
. . . . . . . . . . . . . . . . .
19
3.1 Set of tag in Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2 A sample of labeled data in Penn Treebank before manually treatment . .
25
3.3 A sample of labeled data in Penn Treebank after manually treatment . . .
25
3.4 Tagset of Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.5 A sample of complete data in English and Vietnamese . . . . . . . . . . .
27
4.1 The Bikel's system overview . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.2 Result of testing standard Collins' model 2 with training data's size change
from 60% to 100% of the full data. Where series 1 and series 2 stand for
testing on sentences with length less equal 40 and 100 respectively. . . . .
v
43
List of Tables
2.1 Analysis table with CYK algorithm . . . . . . . . . . . . . . . . . . . . .
11
3.1 POS tagset in Viet Treebank . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2 Phrase tagset in Viet Treebank . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3 Clause tagset in Viet Treebank
. . . . . . . . . . . . . . . . . . . . . . .
29
3.4 Syntax function tagset in Viet Treebank . . . . . . . . . . . . . . . . . . .
29
4.1 The initial results on Viet Treebank with different configurations. Key:CB
= average crossing brackets, 0CB = zero crossing brackets,
≤ 2CB =≤ 2
crossing brackets. All results are percentages, except for those in the CB
column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2 Number of sentence for training . . . . . . . . . . . . . . . . . . . . . . .
42
4.3 The results with the change of the training data set . . . . . . . . . . . . .
42
4.4 The error rate. We use 520 sentences for development testing. Then filtering
sentences which have the F-score less than 70 %. As the result, we collect
147 sentences into the set of error sentences. The Percentage of a error
is calculated by the number of sentences commit this error divide 147.
Because a sentence may be some errors so the total percentage may exceed
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.5 The obtained results after applying some proposal rules to correct some
wrong syntactic parsing
. . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
44
Chapter 1
Introduction
For a long time, human being have always dreamed of an intelligent machine which can
listen to, understand and implement humans' requirements. Many scientists have tried to
make that dream and devoted many achievements for the science of artificial intelligence.
In artificial intelligence, natural language processing (NLP) is a field which studies on
how to understand and generate automatically human language. NLP has many practical
applications such as machine translation, information extraction, discourse analysis, text
summarization. These applications have the same basic problems such as lexical analysis,
syntactic parsing and semantic analysis. In which, syntactic parsing is the central role
and it is also the goal of this thesis.
1.1
What is syntactic parsing?
Syntactic parsing (parsing or syntactic analysis ) is the process of analyzing a given sequence of tokens (i.e. a sentence) to identify their grammatical structure with respect
to a given grammar. The grammatical structure is often represented in the form which
displays visually the dependence of components as a tree (is called parse tree or syntactic
tree). In other words, parsing is the problem to get a given sequence of words as input
and output is the parse trees corresponding to that sequence.
Figure 1.1 shows examples for parse tree: a) a English parse tree in usual form and
b) a Vietnamese tree in other form.
Parsing is the major module of a grammar checking system. In order to check grammar, we need to parse input sentences, then examine the correctness of the structures in
the output. Furthermore, a sentence which cannot be parsed may have grammatical errors.
1
1.1. What is syntactic parsing?
2
Figure 1.1: The parse tree of sentence "I go to school"
Figure 1.2: A parse tree in Vietnamese
Parsing is also the important intermediate stage of representation for semantic analysis,
and thus plays an important role in applications like machine translation, question answering, and information extraction. For example, in transfer-based machine translation
the system will analyze the source sentence to output a parse tree and then construct the
equivalent parse tree in the target language. The output sentence will be generated mainly
based on this equivalent parse tree. It is to understand that in a question answering system
we need parsing to find out which is the subject, object, or action. It is also interesting
that parsing can help speech processing. It supports to correct the fault of the speech
recognition process. On the other hand, in speech synthesis parsing help put stress on the
correct position in the sentence.
1.2. Current Studies in Parsing
3
Through these above example we can see that construct an accurate and effective
parser will bring great benefits to many applications of natural language processing.
1.2
Current Studies in Parsing
As one of the basic and central problem of NLP, parsing attracts many studies. They
belong to one of the two approaches: rule-based and statistics-based.
In conventional parsing systems, a grammar is hand-crafted, often involves a large
amount of lexically specific information in the form of sub-categorization information. In
there, ambiguity, a major problem in parsing, is solved through selectional restrictions .
For example, a lexicon might specify that "eat" must take an object with the feature
+ "food". In (Collins, 1999), the author has showed several problems with selectional
restrictions such as increasing the volume of information required when the vocabulary
size becomes so large. In the other word, the biggest challenge is the large amount of
vocabulary to require both selectional restrictions and structural preference should be
encoded as the soft preferences instead of hard constraints.
To overcome these obstacles, the researchers began to explore machine-learning approaches to parsing problem, primary through statistical models. In these approaches,
a set of example pairs of sentence and the corresponding syntactic tree is annotated by
hand and used to train parsing models. A set of trees is called a " treebank". Several
parts of the treebank are reserved as test data for evaluating the model's accuracy. Early
works investigate the use of probabilistic context free grammar (PCFG). Using PCFG is
considered as the next generation of parsing and is also as a beginning step in statistical
parsing. In a PCFG, each grammar rule is associated with a probability. The probability
of a parse tree is the product of the probabilities of all rules used in that tree. In the case,
parsing is essentially the process of searching the tree that has the maximum probability.
However, a simple PCFG often fail due to its lack of sensitivity to lexical information
and structural preferences. Then some solutions were proposed to resolve this problem.
Several directions were listed in (Collins, 1999) such as: towards probabilistic version of
lexicalized grammars; using supervised training algorithms; to construct models that had
increased structural sensitivity; to look into history-based models. Among them lexicalized probabilistic context free grammar (LPCFG) is a promising approach. It can solve
many ambiguity phenomena in parsing. Some works that based on this approach achieved
high performance, such as in (Collins, 1997). After this research, Daniel M. Bikel and
1.3. Vietnamese syntactic parsing
4
his coworker have developed Collins models and designed a parser for multiple languages. It has been applied successfully for some languages as English, Chinese and
Arabic (Bikel, 2004). The concrete results of the parser for these languages is reported in
(Bikel, 2004): For English, F-measure is 90.01 %; for Chinese, F-measure is 81.2 % and
in Arabic F-measure is 75.7 %. According to these results and the comparison between
current parsers (e.g. Charniak parser, Bekelley parser, Standford parser), Bikels parser is
still rated one of the best parser at present.
Recently, this approach of using LPCFG continues being applied for many languages.
Moreover, a number of new strategies have proposed to improve the accuracy of parsers.
In several researches, using semi-supervised training methods becomes a promising approach. Their experimental results show that this approach outperforms the supervised
one, without much additional computational cost. Some other studies has integrated semantic information into parsing in order to fully exploit the benefits of lexical resources
and upgrade the parser, such as in (Xiong et al., 2005), (Agirre & Baldwin, 2008). (Xiong
et al., 2005) described the way of incorporating semantic knowledge as follow: Firstly,
they used two Chinese electronic semantics dictionaries and heuristic rules in order to
extract semantic categories. Then they built a selection preference sub-model based on
extracted semantic categories. Similarly, in (Agirre & Baldwin, 2008), the sense information was added to parsing by substituting the original words with their semantic tags
which correspond with their semantic classes, for example knife and scissors belong to
TOOL class, cake and pork are assigned to FOOD class. In addition, some other suggested tactics have been enhanced of the performance of parsing as a powerful learning
technique (sample selection) for reducing the amount of human-labeled training data
(Carreras et al., 2008); or, a strategy for utilizing POS tags resources to annotate parser
input in (Watson et al., 2007).
Through the review of approaches in parsing and especially some recent studies, we
found that LPCFG appears in all of state-of-the-art parsing systems. LPCFG. Therefore
in our opinion, LPCFG is a good choice for Vietnamese parsing.
1.3
Vietnamese syntactic parsing
In Vietnam, works in natural language processing (i.e. computational linguistics) in
general and in parsing in particular have been only motivated very recently. A few of
the parsers which follow the knowledge-based approach are constructed with the manual
1.4. Objective of the Thesis
5
grammar rules. Since the construction of grammar rules is manual, the accuracy of the
parser is not high. It only analyzes a limited number of sentences generated by the
grammar. The approach using statistics has been also studied, but also only at brief and
has no experimental results. For example, (Quoc-The & Thanh-Huong, 2008) presented
about LPCFG but surprisingly it did not provide any experiment, only some examples
were provided to illustrate the syntactic ambiguity of Vietnamese. With such restricted
results, no Vietnamese parser has been published widely.
It can say that while many countries in the world has gone forward a long way in
parsing, Vietnam has just been at the stage to start. The precondition for deployment
these models for Vietnamese is a corpus containing parsed sentences which is the crucial
resources for statistical parsing. Since lack of corpus, the previous works on Vietnamese
parsing have not had the significant experimental results. Fortunately, at present there
is a standard Vietnamese parsed corpus, called Viet Treebank, which developed in a
project supported by Vietnamese government. This corpus involves about 10,000 parsed
sentences followed Penn Treebanks format, and therefore we can apply it to the Bikel's
tool.
As mentioned above, lexicalized models have been applied successfully for multiple
languages. Among these languages, Chinese obtained 81.2
% F-score on average as Bikel
shown in (Bikel, 2004). This result also motivated our study since the syntactic structure
of Chinese is similarities to the syntactic structure of Vietnamese.
1.4
Objective of the Thesis
This thesis focuses on building a syntactic parser for Vietnamese using LPCFG approach.
We will use Viet Treebank as the parsed corpus and adapt Bikel's parsing tool for Vietnamese. Then, we will also investigate some common errors appearing in the parsing,
and propose a solution to improve accuracy of the parser.
To achieve this objective, this study has to find the answers for following questions: How to adapt Bikel's system for Vietnamese? How is the initial result? Which
model/configure is appropriate for Vietnamese? How to improve performance of the
system based on analyzing errors?
In summary, in this study we try to carry out the following tasks.
- Study the basic techniques and methods in parsing, focusing on lexicalized statistical
approaches;
1.5. Thesis structure
6
- Analyze and adapt Bikel's parser (Bikel, 2004) for Vietnamese; In this aim we
try to build and publish a Vietnamese parsing tool which are useful for many tasks of
Vietnamese processing.
- Investigate different parsing models and different linguistic features to discover the
best configuration for Vietnamese;
- Analyze grammatical errors from a development test set and find out a solution to
improve the accuracy of the parser.
1.5
Thesis structure
The rest of this thesis is organized as follows:
Chapter 2 introduces basic parsing approaches from classical methods such as topdown or bottom-up strategy to statistic based methods like probabilistic context-free
grammar (PCFG) and lexicalized probabilistic context-free grammar (LPCFG). In this
chapter, we also introduce the important parsing algorithms including CYK, Earley and
Chart parsing.
Chapter 3 represents Vietnamese parsing and our approach. Characteristics of Vietnamese and Viet Treebank will be introduced in the comparison with Penn Treebank.
Chapter 4 describes our experiments and discussions. After the introduction of the
Bikel parsing tool, we will describe the process of applying and developing it for the
Vietnamese: from adapting the tool for Vietnamese and investigating it in order to find
out the best configuration, and finally handling several grammatical errors to reduce error
rate and enhance the parser performance.
Chapter 6 summarizes the obtained results, gives some conclutions of our work, and
shows our plan for the future work.
Chapter 2
Parsing approaches
In the previous chapter, we have introduced the concept of parsing and its role in
natural language processing. This chapter will firstly presents context free grammar, and
then presents common parsing methods, including two classical parsing strategies (the
top-down and bottom-up); CYK algorithm (Cocke-Younger-Kasami); Chart parsing and
Earley algorithm. At the end of this chapter, we will present the Probabilistic Context
Free Grammar and the lexicalized statistical model for parsing.
2.1
Context Free Grammar (CFG)
To analyze syntax for a language, we firstly need to represent the language in a form
which computer can understand. The most popular formal presentation for grammar of
a language is the context-free-grammar (invented by Chomsky). Language is defined as
a set of strings where each string was generated by a finite set of nonempty elements
called the alphabet, such as Vietnamese alphabet and Vietnamese languages.
A context-free grammar (CFG) is a set of four components. The grammar is denoted,
we have
G = hT, N, S, Ri. Where:
- T 6= ∅ is a finite set of elements which called terminal (lexicon). The set of terminals
is the alphabet (or words) of the language defined by the grammar.
-
N 6= ∅ is a finite set of non-terminal characters or variables. Note that 4 ∩ Σ = ∅.
They represent different types of phrase or clause in the sentence.
- S is one of the non-terminal ( S
∈ N ) and called start variable (or start symbol) that
is used to represent the whole sentence.
- R is a finite set of rules or productions of the grammar. Each rule in R has the form
7
2.2. Parsing Algorithms
8
X → α where X is non-terminal and α is a sequence of terminals and non-terminals.
A grammar G generates a language L.
In parsing, CFGs or its variations are used to represent the grammar of a language
which is the grammatical base to construct methods to solve the parsing problem. The
next section presents these methods.
2.2
2.2.1
Parsing Algorithms
Top-down parsing
Top-down parsing is a strategy of analyzing the given sentences in the following way:
beginning with the start symbol and at each step, expand one of the remaining nonterminals (from left to right) by replacing it with the right side of one of its productions in the
grammar until achieving the desired string. In other words, a parse tree is generated by
a top - down parser as a result of the construction process of the tree beginning with the
start symbol (the root of the tree), using rules in the grammar to generate tree from the
root (start symbol) to leaves (words or lexicons).
In top-down parsing, for the rules having the same left hand side, the selection rules
could be simplified based on the size of the right-hand side string (comparison between
the right-hand side strings) or just the order of the symbols in the right hand side of the
rules. In case the top-down analysis not finished, we turn back to search the appropriate
rule for the parse construction.
2.2.2
Bottom-up parsing
Contrary to top-down strategy, the bottom-up parsing (also known as shift-reduce parsing)
begins with an input sentence, using two main action (shift and reduce) to backward the
input string into the start symbol (the root of parse tree). With a stack, the words in the
input string are pushed into the stack from left to right (shift), and if stack contains the
right-hand side of a rule and it can be replaced by the left hand side of the rule.
Similar to top-down strategy, in bottom-up parsing, when errors occur, or no analysis,
we perform backtracking to develop by a different rule. This process continues until we
can not turn back anymore, at this time if the stack was not reduced backward the start
symbol, the bottom-up parser can not be analyzed the input string.
2.2. Parsing Algorithms
2.2.3
9
Comparison between top-down parsing and bottom-up parsing
Both methods have their advantages and disadvantages.
Top-down strategy does not waste time to examine the trees that are not begun by
the start symbol (S). That means it would never have visited the sub-trees without root S.
However, this approach has weaknesses. While not waste time with the tree not started
by S, the top-down parser throws away resources for the trees that do not match with the
input string. This weakness is consequence of generating the parse tree before examining
the input string.
Conversely, in bottom-up parsing, although parse trees can not be generated by the
starting symbol S, but it is always to ensure that the generated parse trees agree with the
input string.
In short, each strategy has advantages and disadvantages. Thus, if combining of the
two approaches, we will have a good method.
2.2.4
CYK algorithm (Cocke-Younger-Kasami)
CYK algorithm, sometimes known as the CKY algorithm, identify whether a input string
can be generated by a given CFG, if so, how it can be generated. This algorithm is a form
of bottom-up parsing using dynamic programming. CYK algorithm operates on CFG to
be in Chomsky Normal Form (CNF). CFG in CNF is CFG in which the rule of the form:
R = {A → BC, A → α|A, B, C ∈ 4, α ∈ Σ}.
-
Pseudo-code of CYK algorithm
The algorithm in pseudocode is as follows:
a1 . . . an
R1 . . . Rn
Let the input be a string S consisting of n characters:
Let the grammar contain r nonterminal symbols
This grammar contains the subset
Rs
which is the set of start symbols.
P [n, n, r] be an array of booleans. Initialize all elements of P
For each i = 1 to n
For each unit production Rj → ai , set P [i, 1, j] = true.
For each i = 2 to n -- Length of span
For each j = 1 to n − i + 1 -- Start of span
For each k = 1 to i − 1 -- Partition of span
For each production RA → RBRC
Let
to false.
2.2. Parsing Algorithms
10
P [j, k, B] and P [j + k, i − k, C] then set P [j, i, A] = true
If any of P [1, n, x] is true (x is iterated over the set s, where s are all the indices
for Rs )
If
Then S is member of language
Else S is not member of language
It is easy to see that the time complexity of this algorithm is O( n3 ).
A table data-structure is used to keep the track of the parsing process with CYK
algorithm.
Example
We use the following grammar to analyze a Vietnamese "MÌo b¾t chuét" (cats catch
mice) using CYK algorithm.
Grammar:
S → N P V P (1)
N P → N (2)
N → t«i" (3)
N → "mÌo" (4)
N → "chuét" (5)
V P → V (6)
V P → V P P (7)
V → "b¾t" (8)
P P → N (9)
First, we can see that this grammar is not written in CNF. The grammar is converted
into CNF as follow:
S → N P V P (1a)
N P → "t«i" (2a)
N P → "mÌo" (3a)
N P → "chuét" (4a)
N → "t«i" (5a)
N → "mÌo" (6a)
N → "chuét" (7a)
V P → "b¾t" (8a)
V P → V P P (9a)
V → "b¾t" (10a)
P P → "t«i" (13a)
2.2. Parsing Algorithms
11
Table 2.1: Analysis table with CYK algorithm
S
S
VP
NP, N, PP, VP
VP, V
NP, N, PP, VP
MÌo
b¾t
chuét
P P → "mÌo" (14a)
P P → "chuét" (15a)
We have following table (chart):
2.2.5
Earley algorithm
Earley algorithm is a top-down parsing algorithm using dynamic programming technique.
It carries the typical feature of dynamic programming, that is to reduce the running time
from exponential to polynomial by removing the solutions be generated due to turn back.
In this case, dynamic programming algorithm makes the running time given in O( N 3 )
where N is the total number of input sequences. However, it does not need grammars
given in CNF, and so overcomes the main disadvantage of the CKY approach.
The main idea of the Earley algorithm is to travel from left to right and create a
network including N + 1 entities. With each word in the sentence, chart contains a list of
states which represent each component of generated tree. When a sentence is completely
parsed, charts mark the process of analyzing the input sentence ending. Each sub-tree
can be performed once only and can be reused by the parser.
Each separate state contains a chart entity including three parameters: a sub-tree
corresponding to a grammar rule, information about the development of tree, and the
location of the sub-tree equivalent to the input string. We give dot notation (.) on the
right side of a rule of grammar to describe that it (the right hand side of this rule) has
already been parsed. This structure is called the dotted rule. The state of the position will
be expressed by two factors: locate the start state and the position of dot notation.
Using grammar in section 2.2.4, we have several examples of dotted rule as follows:
S → •N P V P [0, 0]
V P → V • P P [1, 2]
N → "chuét" • [2, 2]
The basic principles of the Earley parser is to develop the parse tree through the set
including N + 1 states in chart from left to right, processing each state in the set. At each
2.2. Parsing Algorithms
12
step, one of three stages described below will be operated to each state of the law. In
each case, the result was to add a new state based on the current state or the next one
in the chart. Algorithms are always developed through adding new information on the
chart, the state will never be canceled and can not turn back previous chart. And state
S → •, [0, N ] in the list of state is the last chart, showing that the input string is parsed
successfully.
The three main operators of the Earley algorithm is PREDICTOR, COMPLETER and
SCANNER. These operators get input as a string and return a state. PREDICTOR and
COMPLETER add states to the current chart, and SCANNER add states to new chart.
+ Predictor
As its name, Predictor is responsible for creating a new state, represents states which
occur during the analysis process. Predictor is applied to any state in which non-terminal
is on the right of the dot notation and is not in the part-of-speech group. Result of this
operator is a new state for each expansion replaced non-terminal in grammar.
+ Scanner
When a state is generated from the label on the right of the dot notation, Scanner will
be called to check the input and merge the state to the corresponding label to put on
the chart. The tasks are complete when a new state is created and changes the location
of dot notation based on the input group has predicted. Here, Earley parser using input
strings like the top-down parser to avoid the ambiguity, only terminals (labeled), the
words predicted by the states, will be analyzed by the chart.
+ Completer
Completer operator is applied to the state that has already had the dot notation at the
end of the rule. It is easily to realize that the current status shows the success of the
analysis. The purpose of this operator is to research in the rules and develop previous
states in the current state of the input. New state is generated by taking the old states,
and shifting the dots through rules in the grammar and adding new states into the current
chart.
With the grammar below, we will analyze the sentence "T«i h¸t" ("I'm singing in
English") by Earley algorithm:
S → N P V P (1)
N P → N (2)
N → "t«i" (3)
V P → V (4)
V P → V P P (5)
2.3. Probabilistic context-free grammar (PCFGs)
13
V → "h¸t" (6)
P P → N (7)
Chart[0]:
S → • N P V P [0, 0]
N P → • N [0, 0]
N → • "t«i" [0, 0]
Examine token "t«i"
Chart[1]
N → 't«i" • [0, 1]
N P → N • [0, 1]
S → N P • V P [0, 1]
V P → •V [1, 1]
V P → •V P P [1, 1]
V → • "h¸t" [1, 1]
Consider token: "h¸t"
Chart[2]
V → "h¸t" • [1, 2]
V P → V P • [1, 2]
V P → V P • P P [1, 2]
P P → •N [2, 2]
N → •"t«i"[2, 2]
S → N P V P • [0, 2]
In Chart[2], there is the state
S → N P V P • [0, 2] and the length of the input string equal
2, thus the process of analysis is complete and successful.
2.3
2.3.1
Probabilistic context-free grammar (PCFGs)
The concept of PCFG
As mentioned in Chapter 1, conventional parsing systems used CFGs and followed the
rule-based approach. This approach has encountered several obstacles with selectional
restriction. At that time, a new approach was proposed to overcome the disadvantages
2.3. Probabilistic context-free grammar (PCFGs)
14
of the CFGs. In this approach, parsing problem is considered as a problem in machine
learning. Through a training process, it aims to construct a probabilistic model, which is
then used to produce the best parse tree for a test sentence. In this section we introduce
the Probabilistic Context Free Grammar (PCFG).
In PCFG, each rule has a probability. The product of the probabilities of the rules
which used in a tree is the probability of that parse tree,
P (T |S). The parser itself is an
algorithm which searches for the tree, Tbest , that maximizes P (T |S). A generative model
uses the observation that maximizing P (T, S) is equivalent to maximizing P (T |S).
Tbest = arg max P (T |S) = arg max
T
P (T, S)
= arg max P (T, S)
T
P (S)
Example: For the following PCFG, require to compute the probability of parse tree
(Figure 2.1) for the sentence "MÌo b¾t chuét" ("Cats catch mice" in English).
S → NP V P
1.0 (1)
NP → N
1.0 (2)
N → "t«i"
0.33 (3)
N → "mÌo"
0.33 (4)
N → "chuét"
0.34 (5)
VP →V
0.5 (6)
V P → V PP
0.5 (7)
V → "b¾t"
0.5 (8)
PP → N
1.0 (9)
The probability of the tree is calculated as follows:
ρ = ρ(S → N P ) ∗ ρ(N P → N ) ∗ ρ(N → mÌo)
∗ρ(V P → V
P P ) ∗ ρ(V → b¾t) ∗ ρ(P P → N ) ∗ ρ(N → "chuét")
ρ = 1.0 ∗ 1.0 ∗ 0.3 ∗ 0.5 ∗ 0.5 ∗ 1.0 ∗ 0.34
ρ = 0.028
2.3.2
Disadvantages of PCFGs
Although PCFGs can resolve partly the ambiguity rely on probability augmented with
each rule, it lack of sensitivity to lexical information and structural preference. In natural
language, the semantic and the structure depend strong on the context of this sentence.
2.4. Lexical Probabilistic Context Free Grammar (LPCFGs)
15
Figure 2.1: The parse tree of the Vietnamese sentence "mÌo b¾t chuét"
We consider an example, such sentence "T«i hiÓu Lan h¬n Nga" in order to see that
PCFGs lack of sensitivity to lexical information. This sentence can be put into two parse
trees such as Figure 2.2.
If two sentences have the same probability, we must put lexical context information
to distinguish. Thus, adding lexical information to PCFG will bring many benefits.
2.4
Lexical Probabilistic Context Free Grammar (LPCFGs)
In the previous section, we introduced the probabilistic model for the parsing problem.
However, this model still remains some drawbacks. M. Collins proposed a new approach
for parsing. That approach is Lexical Probabilistic Context Free Grammar (LPCFG). In
1996, Collin introduced three models under this approach in his paper. In three models,
Collin put into the PCFG a new structure called head.
- Xem thêm -