Tài liệu Bilingual sentence alignment based on sentence length and word translation

  • Số trang: 61 |
  • Loại file: PDF |
  • Lượt xem: 147 |
  • Lượt tải: 0
nguyetha

Đã đăng 8489 tài liệu

Mô tả:

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi - 2014 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION Major: Computer science Code: 60 48 01 MASTER THESIS OF INFORMATION TECHNOLOGY SUPERVISOR: PhD. Phuong-Thai Nguyen Hanoi - 2014 2 ORIGINALITY STATEMENT „I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET) or any other educational institution, except where due acknowledgement is made in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project‟s design and conception or in style, presentation and linguistic expression is acknowledged.‟ Signed ........................................................................ 3 Acknowledgements I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his supervision but also for his enthusiastic encouragement, right suggestion and knowledge which I have been giving during studying in Master‟s course. I would also like to show my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of Information Technology - Vietnam Academy of Science and Technology - who provided valuable data in my evaluating process. I would like to thank PhD Van-Vinh Nguyen for examining and giving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh Nguyen for supporting and checking some issues in my research. In addition, I would like to express my thanks to lectures, professors in Faculty of Information Technology, University of Engineering and Technology (UET), Vietnam University, Hanoi who teach me and helping me whole time I study in UET. Finally, I would like to thank my family and friends for their support, share, and confidence throughout my study. 4 Abstract Sentence alignment plays an important role in machine translation. It is an essential task in processing parallel corpora which are ample and substantial resources for natural language processing. In order to apply these abundant materials into useful applications, parallel corpora first have to be aligned at the sentence level. This process maps sentences in texts of source language to their corresponding units in texts of target language. Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval. This task also helps to extract structural information and derive statistical parameters from bilingual corpora. There have been a number of algorithms proposed with different approaches for sentence alignment. However, they may be classified into some major categories. First of all, there are methods based on the similarity of sentence lengths which can be measured by words or characters of sentences. These methods are simple but effective to apply for language pairs that have a high similarity in sentence lengths. The second set of methods is based on word correspondences or lexicon. These methods take into account the lexical information about texts, which is based on matching content in texts or uses cognates. An external dictionary may be used in these methods, so these methods are more accurate but slower than the first ones. There are also methods based on the hybrids of these first two approaches that combine their advantages, so they obtain quite high quality of alignments. In this thesis, I summarize general issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on the hybrid method, especially the proposal of Moore (2002), an effective method with high performance in term of precision. From analyzing the limits of this method, I propose an algorithm using a new feature, bilingual word clustering, to improve the quality of Moore‟s method. The baseline method (Moore, 2002) will be introduced based on analyzing of the framework, and I describe advantages as well as weaknesses of this approach. In addition to this, I describe the basis knowledge, algorithm of bilingual word clustering, and the new feature used in sentence alignment. Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method. Keywords: sentence alignment, parallel corpora, natural language processing, word clustering. 5 Table of Contents ORIGINALITY STATEMENT ........................................................................................ 3 Acknowledgements ............................................................................................................. 4 Abstract ............................................................................................................................... 5 Table of Contents ................................................................................................................ 6 List of Figures ..................................................................................................................... 9 List of Tables ..................................................................................................................... 10 CHAPTER ONE Introduction ........................................................................................ 11 1.1. Background.............................................................................................................. 11 1.2. Parallel Corpora ....................................................................................................... 12 1.2.1. Definitions ....................................................................................................... 12 1.2.2. Applications..................................................................................................... 12 1.2.3. Aligned Parallel Corpora ............................................................................... 12 1.3. Sentence Alignment ................................................................................................ 12 1.3.1. Definition......................................................................................................... 12 1.3.2. Types of Alignments ........................................................................................ 12 1.3.3. Applications..................................................................................................... 15 1.3.4. Challenges ....................................................................................................... 15 1.3.5. Algorithms ....................................................................................................... 16 1.4. Thesis Contents ....................................................................................................... 16 1.4.1. Objectives of the Thesis................................................................................... 16 1.4.2. Contributions................................................................................................... 17 1.4.3. Outline ............................................................................................................. 17 1.5. Summary.................................................................................................................. 18 CHAPTER TWO Related Works ................................................................................... 19 2.1. Overview ................................................................................................................. 19 2.2. Overview of Approaches ......................................................................................... 19 6 2.2.1. Classification................................................................................................... 19 2.2.2. Length-based Methods .................................................................................... 19 2.2.3. Word Correspondences Methods .................................................................... 21 2.2.4. Hybrid Methods ............................................................................................... 21 2.3. Some Important Problems ....................................................................................... 22 2.3.1. Noise of Texts .................................................................................................. 22 2.3.2. Linguistic Distances ........................................................................................ 22 2.3.3. Searching......................................................................................................... 23 2.3.4. Resources ........................................................................................................ 23 2.4. Length-based Proposals ........................................................................................... 23 2.4.1. Brown et al., 1991 ........................................................................................... 23 2.4.2. Vanilla: Gale and Church, 1993 ..................................................................... 24 2.4.3. Wu, 1994 ......................................................................................................... 27 2.5. Word-based Proposals ............................................................................................. 27 2.5.1. Kay and Roscheisen, 1993 .............................................................................. 27 2.5.2. Chen, 1993 ...................................................................................................... 27 2.5.3. Melamed, 1996 ................................................................................................ 28 2.5.4. Champollion: Ma, 2006 .................................................................................. 29 2.6. Hybrid Proposals ..................................................................................................... 30 2.6.1. Microsoft’s Bilingual Sentence Aligner: Moore, 2002 ................................... 30 2.6.2. Hunalign: Varga et al., 2005 .......................................................................... 31 2.6.3. Deng et al., 2007 ............................................................................................. 32 2.6.4. Gargantua: Braune and Fraser, 2010 ............................................................ 33 2.6.5. Fast-Champollion: Li et al., 2010 ................................................................... 34 2.7. Other Proposals ....................................................................................................... 35 2.7.1. Bleu-align: Sennrich and Volk, 2010 .............................................................. 35 2.7.2. MSVM and HMM: Fattah, 2012 ..................................................................... 36 2.8. Summary.................................................................................................................. 37 CHAPTER THREE Our Approach ............................................................................... 39 3.1. Overview ................................................................................................................. 39 7 3.2. Moore‟s Approach ................................................................................................... 39 3.2.1. Description ...................................................................................................... 39 3.2.2. The Algorithm.................................................................................................. 40 3.3. Evaluation of Moore‟s Approach ............................................................................ 42 3.4. Our Approach .......................................................................................................... 42 3.4.1. Framework ...................................................................................................... 42 3.4.2. Word Clustering .............................................................................................. 43 3.4.3. Proposed Algorithm ........................................................................................ 45 3.4.4. An Example ..................................................................................................... 49 3.5. Summary.................................................................................................................. 50 CHAPTER FOUR Experiments ..................................................................................... 51 4.1. Overview ................................................................................................................. 51 4.2. Data.......................................................................................................................... 51 4.2.1. Bilingual Corpora ........................................................................................... 51 4.2.2. Word Clustering Data ..................................................................................... 53 4.3. Metrics ..................................................................................................................... 54 4.4. Discussion of Results .............................................................................................. 54 4.5. Summary.................................................................................................................. 57 CHAPTER FIVE Conclusion and Future Work........................................................... 58 5.1. Overview ................................................................................................................. 58 5.2. Summary.................................................................................................................. 58 5.3. Contributions ........................................................................................................... 58 5.4. Future Work............................................................................................................. 59 5.4.1. Better Word Translation Models..................................................................... 59 5.4.2. Word-Phrase ................................................................................................... 59 Bibliography ...................................................................................................................... 60 8 List of Figures Figure 1.1. A sequence of beads (Brown et al., 1991). .................................................... 13 Figure 2.1. Paragraph length (Gale and Church, 1993). .................................................. 25 Figure 2.2. Equation in dynamic programming (Gale and Church, 1993) ...................... 26 Figure 2.3. A bitext space in Melamed‟s method (Melamed, 1996). .............................. 29 Figure 2.4. The method of Varga et al., 2005 .................................................................. 31 Figure 2.5. The method of Braune and Fraser, 2010 ....................................................... 33 Figure 2.6. Sentence Alignment Approaches Review. .................................................... 38 Figure 3.1. Framework of sentence alignment in our algorithm. ..................................... 43 Figure 3.2. An example of Brown's cluster algorithm ..................................................... 44 Figure 3.3. English word clustering data ......................................................................... 44 Figure 3.4. Vietnamese word clustering data ................................................................... 44 Figure 3.5. Bilingual dictionary ....................................................................................... 46 Figure 3.6. Looking up the probability of a word pair ..................................................... 47 Figure 3.7. Looking up in a word cluster ......................................................................... 48 Figure 3.8. Handling in the case: one word is contained in dictionary ............................ 48 Figure 4.1. Comparison in Precision ................................................................................ 55 Figure 4.2. Comparison in Recall .................................................................................... 56 Figure 4.3. Comparison in F-measure .............................................................................. 57 9 List of Tables Table 1.1. Frequency of alignments (Gale and Church, 1993) ....................................... 14 Table 1.2. Frequency of beads (Ma, 2006) ..................................................................... 14 Table 1.3. Frequency of beads (Moore, 2002) ................................................................ 14 Table 1.4. An entry in a probabilistic dictionary (Gale and Church, 1993) ................... 15 Table 2.1. Alignment pairs (Sennrich and Volk, 2010) .................................................. 36 Table 4.1. Training data-1............................................................................................... 51 Table 4.2. Topics in Training data-1............................................................................... 52 Table 4.3. Training data-2............................................................................................... 52 Table 4.4. Topics in Training data-2............................................................................... 52 Table 4.5. Input data for training clusters ....................................................................... 53 Table 4.6. Topics for Vietnamese input data to train clusters ........................................ 53 Table 4.7. Word clustering data sets. .............................................................................. 54 10 CHAPTER ONE Introduction 1.1. Background Parallel corpora play an important role in a number of tasks such as machine translation, cross language information retrieval, word disambiguation, sense disambiguation, bilingual lexicography, automatic translation verification, automatic acquisition of knowledge about translation, and cross-language information retrieval. Building a parallel corpus, therefore, helps connecting considered languages [1, 5, 7, 1213, 15-16]. Parallel texts, however, are useful only when they have to be sentence-aligned. The parallel corpus first is collected from various resources, which has a very large size of the translated segments forming it. This size is usually of the order of entire documents and causes an ambiguous task in learning word correspondences. The solution to reduce the ambiguity is first decreasing the size of the segments within each pair, which is known as sentence alignment task. [7, 12-13, 16] Sentence alignment is a process that maps sentences in the text of the source language to their corresponding units in the text of the target language [3, 8, 12, 14, 20]. This task is the work of constructing a detailed map of the correspondence between a text and its translation (a bitext map) [14]. This is the first stage for Statistical Machine Translation. With aligned sentences, we can perform further analyses such as phrase and word alignment analysis, bilingual terminology, and collocation extraction analysis as well as other applications [3, 7-9, 17]. Efficient and powerful sentence alignment algorithms, therefore, become increasingly important. A number of sentence alignment algorithms have been proposed [1, 7, 9, 12, 15, 17, 20]. Some of these algorithms are based on sentence length [3, 8, 20]; some use word correspondences [5, 11, 13-14]; some are hybrid of these two approaches [2, 6, 15, 19]. Additionally, there are also some other outstanding methods for this task [7, 17]. For details of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6. I propose an improvement to an effective hybrid algorithm [15] that is used in sentence alignment. For details of our approach, see Section 3.4. I also create experiments 11 to illustrate my research. For details of the corpora used in our experiments, see Section 4.2. For results and discussions of experiments, see Sections 4.4, 4.5. In the rest of this chapter, I describe some issues related to the sentence alignment task. In addition to this, I introduce objectives of the thesis and our contributions. Finally, I describe the structure of this thesis. 1.2. Parallel Corpora 1.2.1. Definitions Parallel corpora are a collection of documents which are translations of each other [16]. Aligned parallel corpora are collections of pairs of sentences where one sentence is a translation of the other [1]. 1.2.2. Applications Bilingual corpora are an essential resource in multilingual natural language processing systems. This resource helps to develop data-driven natural language processing approaches. This also contributes to applying machine learning to machine translation [15-16]. 1.2.3. Aligned Parallel Corpora Once the parallel text is sentence aligned, it provides the maximum utility [13]. Therefore, this makes the task of aligning parallel corpora of considerable interest, and a number of approaches have been proposed and developed to resolve this issue. 1.3. Sentence Alignment 1.3.1. Definition Sentence alignment is the task of extracting pairs of sentences that are translation of one another from parallel corpora. Given a pair of texts, this process maps sentences in the text of the source language to their corresponding units in the text of the target language [3, 8, 13]. 1.3.2. Types of Alignments Aligning sentences is to find a sequence of alignments. This section provides some more definitions about “alignment” as well as issues related to it. Brown et al., 1991, assumed that every parallel corpus can be aligned in terms of a sequence of minimal alignment segments, which they call “beads”, in which sentences align 1-to-1, 1-to-2, 2-to-1, 1-to-0, 0-to-1. 12 Figure 1.1. A sequence of beads (Brown et al., 1991). Groups of sentence lengths are circled to show the correct alignment. Each of the groupings is called a bead, and there is a number to show sentence length of a sentence in the bead. In figure 1.1, “17e” means the sentence length (17 words) of an English sentence, and “19f” means the sentence length (19 words) of a French sentence. There is a sequence of beads as follows:  An 𝑒𝑓-bead (one English sentence aligned with one French sentence) followed by  An 𝑒𝑓𝑓-bead (one English sentence aligned with two French sentences) followed by  An 𝑒-bead (one English sentence) followed by  A ¶𝑒¶𝑓 bead (one English paragraph and one French paragraph). An alignment, then, is simply a sequence of beads that accounts for the observed sequences of sentence lengths and paragraph markers [3]. There are quite a number of beads, but it is possible to only consider some of them including 1-to-1 (one sentence of source language aligned with one sentence of target language), 1-to-2 (one sentence of source language aligned with two sentences of target language), etc; Brown et al., 1991 [3] mentioned to beads 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2to-1, and a bead of paragraphs ( ¶𝑒, ¶𝑓, ¶𝑒𝑓 ) because of considering alignments by paragraphs of this method. Moore, 2002 [15] only considers five of these beads: 1-to-1, 1to-0, 0-to-1, 1-to-2, 2-to-1 in which each of them is called as follows:  1-to-1 bead (a match)  1-to-0 bead (a deletion)  0-to-1 bead (an insertion)  1-to-2 bead (an expansion)  2-to-1 bead (a contraction) 13 The common information related to this is the frequency of beads. Table 1.1 shows frequencies of types of beads proposed by Gale and Church, 1993 [8]. Table 1.1. Frequency of alignments (Gale and Church, 1993) Category Frequency Prob(match) 1-1 1167 0.89 1-0 or 0-1 13 0.0099 2-1 or 1-2 117 0.089 2-2 15 0.011 1312 1.00 Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as Table 1.2: Table 1.2. Frequency of beads (Ma, 2006) Category Frequency Percentage 1306 89.4% 1-0 or 0-1 93 6.4% 1-2 or 2-1 60 4.1% 2 0.1% 1-1 Others Total 1461 Table 1.3 also describes these frequencies of types of beads in Moore, 2002 [15]: Table 1.3. Frequency of beads (Moore, 2002) Category Percentage 1-1 94% 1-2 2% 2-1 2% 1-0 1% 0-1 1% Total 100% 14 Generally, the frequency of bead 1-to-1 in almost all corpora is largest in all types of beads, with frequency around 90% whereas other types are only about few percentages. 1.3.3. Applications Sentence alignment is an important topic in Machine Translation. This is an important first step for Statistical Machine Translation. It is also the first stage to extract structural and semantic information and to derive statistical parameters from bilingual corpora [17, 20]. Moreover, this is the first step to construct probabilistic dictionary (Table 1.4) for use in aligning words in machine translation, or to construct a bilingual concordance for use in lexicography. Table 1.4. An entry in a probabilistic dictionary (Gale and Church, 1993) English French Prob(French|English) the le 0.610 the la 0.178 the l‟ 0.083 the les 0.023 the ce 0.013 the il 0.012 the de 0.009 the à 0.007 the que 0.007 1.3.4. Challenges Although this process might seem very easy, it has some important challenges which make the task difficult [9]: The sentence alignment task is non-trivial because sentences do not always align 1-to1. At times a single sentence in one language might be translated as two or more sentences in the other language. The input text also affects the accuracies. The performance of sentence alignment algorithms decreases significantly when input data becomes very noisy. Noisy data means that there are more 1-0 and 0-1 alignments in the data. For example, there are 89% 1-1 alignments in English-French corpus (Gale and Church, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus. Whereas in UN 15 Chinese English corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1 alignments are 6.4% in this corpus. Although some methods work very well on clean data, their performance goes down quickly as data becomes noisy [13]. In addition, it is difficult to achieve perfect accurate alignments even if the texts are easy and “clean”. For instance, the success of an alignment program may decline dramatically when applied on a novel or philosophy text, but this program gives wonderful results when applied on a scientific text. The performance alignment also depends on languages of corpus. For example, an algorithm based on cognates (words in language pairs that resemble each other phonetically) is likely to work better for English-French than for English-Hindi because there are fewer cognates for English-Hindi [1]. 1.3.5. Algorithms A sentence alignment program is called “ideal” if it is fast, highly accurate, and requires no special knowledge about the corpus or the two languages [2, 9, 15]. A common requirement for sentence alignment approaches is the achievement of both high accuracy and minimal consumption of computational resources [2, 9]. Furthermore, a method for sentence alignment should also work in an unsupervised fashion and be language pair independent in order to be applicable to parallel corpora in any language without requiring a separate training set. A method is unsupervised if it is an alignment model directly from the data set to be aligned. Meanwhile, language pair independence means that approaches require no specific knowledge about the languages of the parallel texts to align. 1.4. Thesis Contents This section introduces the organization of contents in this thesis including: objectives, our contributions, and the outline. 1.4.1. Objectives of the Thesis In this thesis, I report results of my study of sentence alignment and approaches proposed for this task. Especially, I focus on Moore‟s method (2002), a method which is outstanding and has a number of advantages. I also discover a new feature, word clustering, which may apply for this task to improve the accuracy of alignment. I examine this proposal in experiments and compare results to those in the baseline method to prove advantages of my approach. 16 1.4.2. Contributions My main contributions are as follows:  Evaluating methods in sentence alignment and introducing an algorithm that improves Moore‟s method.  Using new feature - word clustering, helps to improve accuracy of alignment. This contributes in complementing strategies in the sentence alignment problem. 1.4.3. Outline The rest of the thesis is organized as follows: Chapter 2 – Related Works In this chapter I introduce some recent research about sentence alignment. In order to have a general view of methods proposed to deal this problem, an overall presentation about methods of sentence alignment is introduced in this chapter. Methods are classified into some types in which each method is given by describing its algorithm along with evaluations related to it. Chapter 3 – Our Approach This chapter describes the method we proposed in sentence alignment to improve Moore‟s method. Initially, an analysis of Moore‟s method and evaluations about it are also mentioned in this chapter. The major content of this chapter is the framework of the proposed method, an algorithm using bilingual word clustering. An example is described in this chapter to illustrate the approach clearly. Chapter 4 – Experiments This chapter shows experiments performed in our approach. Data corpora used in experiments are presented completely. Results of experiments as well as discussions about them are clearly described for evaluating our approach to the baseline method. Chapter 5 –Conclusions and Future Works In this last chapter, advantages and restrictions of my works are summarized in a general conclusion. Besides, some research directions are mentioned to improve the current model in the future. Finally, references are given to show research published that my system refers to. 17 1.5. Summary This chapter introduces my research work. I have given background information about parallel corpora, sentence alignment, definitions of issues as well as some initial problems related to sentence alignment algorithms. Terms of alignment which are used in this task have been defined in this chapter. In addition, an outline of my research work in this thesis has also been provided. A discussion of future proposed work is also presented. 18 CHAPTER TWO Related Works 2.1. Overview This chapter is an introduction to some research in sentence alignment in recent years and some evaluations about these approaches. A number of problems related to this work are also discussed: factors that affect the performance of alignment algorithms, searching and resources for each method. Evaluations of algorithm are introduced to give a general view of advantages as well as weaknesses of each algorithm. Section 2.2 provides an overview of sentence alignment approaches. Section 2.3 introduces and evaluates some primary approaches in length-based methods. Section 2.4 introduces and evaluates proposals of word-correspondence-based approaches. Proposals as well as evaluations for each of them in hybrid methods are presented in Section 2.5. Certainly, there are some other outstanding approaches about this task, which are also introduced in Section 2.6. Section 2.7 concludes this chapter. 2.2. Overview of Approaches 2.2.1. Classification From the first approaches proposed in 1990s, there have been a number of publications reported in sentence alignment with different techniques. In various sentence alignment algorithms which have been proposed, there are three widespread approaches based respectively on a comparison of sentence length, lexical correspondence and a combination of these first two methods. There are also some other techniques such as methods based on BLEU score, support vector machine, and hidden Markov model classifiers. 2.2.2. Length-based Methods Length-based approaches are based on modeling the relationship between the lengths of sentences that are mutual translations. The length is measured by characters or words of a sentence. In these approaches, semantics of the text are not considered. Statistical methods are used for this task instead of the content of texts. In other words, these 19 methods only consider the length of sentences in order to make the decision for alignment. These methods are based on the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each proposed correspondence of sentences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference. There are two random variables 𝑙1 and 𝑙2 which are the lengths of the two sentences under consideration. It is assumed that these random variables are independent and identically distributed with a normal distribution [8]. Given the two parallel texts 𝑆𝑇 (source text) and 𝑇𝑇 (target text), the goal of this task is to find alignment A which is highest probability. 𝑚𝑎𝑥𝐴 𝑃(𝐴, 𝑆𝑇, 𝑇𝑇) In order to estimate this probability, aligned text is decomposed in a sequence of aligned sentence beads where each bead is assumed to be independent of others. The algorithms of this type were first proposed in Brown, et al., 1991 and Gale and Church, 1993. These approaches use sentence-length statistics in order to model the relationship between groups of sentences that are translations of each other. Wu (Wu, 1994) also uses the length-based method by applying the algorithm proposed by Gale and Church, and he further uses lexical cues from corpus-specific bilingual lexicon to improve alignment. The methods proposed in this type of sentence alignment algorithm are based solely on the lengths of sentences, so they require almost no prior knowledge. Furthermore, these methods are highly accurate despite their simplicity. They can also perform in a high speed. When aligning texts whose languages are similar or have a high length correlation such as English, French, and German, these approaches are especially useful and work remarkably well. They also perform fairly well if the input text is clean such as in Canadian Hansards corpus [3]. The Gale and Church algorithm is still widely used today, for instance to align Europarl (Koehn, 2005). Nevertheless, these methods are not robust since they only use the sentence length information. They will no longer be reliable if there is too much noise in the input bilingual texts. As shown in (Chen, 1993) [5] the accuracy of sentence-length based methods decreases drastically when aligning texts containing small deletions or free translation; they can easily misalign small passages because they ignore word identities. The algorithm of Brown et al. requires corpus-dependent anchor points while the method 20
- Xem thêm -