Tài liệu Parallel texts extraction from the web

.PDF

122

thanhphoquetoi Báo vi phạm

Tải xuống 122

Mô tả:

Parallel Texts Extraction from the Web by Le Quang Hung Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Dr. Le Anh Cuong A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology December, 2010 Contents ORIGINALITY STATEMENT i Abstract ii Acknowledgements iii List of Figures vi List of Tables vii 1 Introduction 1.1 Parallel corpus and its role . . . . . . . . . . . . . . 1.2 Current studies on automatically extracting parallel 1.3 Objectives of the thesis . . . . . . . . . . . . . . . . 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . 1.5 Thesis’ structure . . . . . . . . . . . . . . . . . . . 2 Related works 2.1 The general framework . 2.2 Structure-based methods 2.3 Content-based methods . 2.4 Hybrid methods . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The proposed approach 3.1 The proposed model . . . . . . . . . . . . . . . . 3.1.1 Host crawling . . . . . . . . . . . . . . . . 3.1.2 Content-based filtering module . . . . . . 3.1.2.1 The method based on cognation . 3.1.2.2 The method based on identifying ments . . . . . . . . . . . . . . . 3.1.3 Structure analysis module . . . . . . . . . iv . . . . . . . . . corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . translation seg. . . . . . . . . . . . . . . . . . . . . . . 1 1 3 4 5 5 . . . . . 7 7 8 12 14 15 . . . . 16 16 17 18 20 . 23 . 28 Contents 3.2 v 3.1.4 Classification modeling . . . . . . . . . . . . . . . . . . . . . 30 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Experiment 4.1 Evaluation measures 4.2 Experimental setup . 4.3 Experimental results 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 33 36 40 5 Conclusion and Future Works 41 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 A LIBSVM tool 43 B Relevant publications 44 Bibliography 45 List of Figures 1.1 An example of English-Vietnamese parallel texts. . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 General architecture in building parallel corpus. The STRAND architecture [1]. . . . . . . . . . . An example of aligning two documents. . . . . . The workflow of the PTMiner system [2]. . . . . The algorithm of translation pairs finder [3]. . . Architecture of the PTI system [4]. . . . . . . . An example of the two links in the text. . . . . . . . . . . . 7 9 10 11 13 13 15 3.1 3.2 3.3 3.4 3.5 3.6 17 18 19 20 22 3.13 Architecture of the Parallel Text Mining system. . . . . . . . . . . . Architecture of a standard Web crawler. . . . . . . . . . . . . . . . An example of a candidate pair. . . . . . . . . . . . . . . . . . . . . Description of the process content-based filtering module. . . . . . . An example of two corresponding texts of English and Vietnamese. The algorithm measures similarity of cognates between a texts pair (Etext, V text). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relationships between bilingual web pages. . . . . . . . . . . . . . . The paragraphs can be denoted from HTML pages based on the tag < p >. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identifying translation paragraphs. . . . . . . . . . . . . . . . . . . A sample code written in Java to perform translation from English into Vietnamese via Google AJAX API. . . . . . . . . . . . . . . . Web documents and the source HTML code for two parallel translated texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of the publication date feature is extracted from a HTML page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification model. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 31 4.1 4.2 4.3 4.4 Figure for precision and recall measures. . . . . . . . . . The format of training and testing data. . . . . . . . . . Performance of identifying translation segments method. Comparison of the methods. . . . . . . . . . . . . . . . . 32 34 38 39 3.7 3.8 3.9 3.10 3.11 3.12 vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 22 24 25 27 27 29 List of Tables 1.1 Europarl parallel corpus: 10 aligned language pairs all of which include English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Symbols and descriptions . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 URLs from three sites: BBC, VOA News and VietnamPlus . . . . No. pages downloaded and No. candidate pairs. . . . . . . . . . . Structure-based method. . . . . . . . . . . . . . . . . . . . . . . . Content-based method. . . . . . . . . . . . . . . . . . . . . . . . . Method based on cognation. . . . . . . . . . . . . . . . . . . . . . Combining structural features and cognate information. . . . . . . Identifying translation at document level. . . . . . . . . . . . . . . Identifying translation at paragraph level. . . . . . . . . . . . . . Identifying translation at sentence level. . . . . . . . . . . . . . . Overall results of each method (P-Precision, R-Recall, F-FS core). vii . . . . . . . . . . 33 34 36 36 37 37 37 38 38 39 Chapter 1 Introduction In this chapter, we first introduce about parallel corpus and its role in NLP applications. Current studies, objectives of the thesis and contributions are then presented. Finally, the thesis’ structure is shortly described. 1.1 Parallel corpus and its role Parallel text Different definitions of the term “parallel text” (also known as bitext) can be found in the literature. As common understanding, a parallel text is a text in one language together with its translation in another language. Dan Tufis [5] gives a definition: “parallel text is an association between two texts in different languages that represent translations of each other”. Figure 1.1 shows an example of English-Vietnamese parallel texts. Parallel corpus A parallel corpus is a collection of parallel texts. According to [6], the simplest case is where two languages only are involved, one of the corpora is an exact translation of the other (e.g., COMPARA corpus [7]). However, some parallel corpora exist in several languages. For instance, Europarl parallel corpus [8] which includes versions in 11 European languages as report in Table 1.1. In addition, the direction of the translation need not be constant, so that some texts in a parallel 1 Chapter 1. Introduction 2 Figure 1.1: An example of English-Vietnamese parallel texts. corpus may have been translated from language L1 to language L2 and others the other way around. The direction of the translation may not even be known. The parallel corpora exist in several formats. They can be raw parallel texts or they can be aligned texts. The texts can be aligned in paragraph level, sentence level or even in phrase level and word level. The alignment of the texts is useful for different NLP tasks. Statistical machine translation [9, 10] uses parallel sentences as the input for the alignment module which produces word translation probabilities. Cross language information retrieval [11–13] uses parallel texts for determining corresponding information in both questioning and answering. Extracting semantically equivalent components of the parallel texts as words, phrases, sentences are useful for bilingual dictionary construction [14, 15]. The parallel texts are also used for acquisition of lexical translation [16] or word sense disambiguation [17]. For most of the mentioned tasks, the parallel corpora are currently playing a crucial role in NLP applications. Chapter 1. Introduction 3 Table 1.1: Europarl parallel corpus: 10 aligned language pairs all of which include English. Parallel Corpus (L1 -L2 ) Danish-English German-English Greek-English Spanish-English Finnish-English French-English Italian-English Dutch-English Portuguese-English Swedish-English 1.2 Sentences L1 Words 1,684,664 43,692,760 1,581,107 41,587,670 960,356 1,689,850 48,860,242 1,646,143 32,355,142 1,723,705 51,708,806 1,635,140 46,380,851 1,715,710 47,477,378 1,681,991 47,621,552 1,570,411 38,537,243 English Words 46,282,519 43,848,958 27,468,389 46,843,295 45,136,552 47,915,991 47,236,441 47,166,762 47,000,805 42,810,628 Current studies on automatically extracting parallel corpus Nowadays, along with the development the Internet, the Web is really a huge database containing multi-language documents thus it is useful for bilingual texts processing. For that reason, many studies [1–4, 18–22] are paying their attention in mining parallel corpora from the Web. Basically, we can classify these studies into three groups: content-based (CB) [3, 4, 22], structure-based (SB) [1, 2, 18], and hybrid (combination of the both methods) [19–21]. The CB approach uses the textual content of the parallel document pairs being evaluated. This approach usually uses lexicon translations getting from a bilingual dictionary to measure the similarity of content of the two texts. When the bilingual dictionary is available, documents are translated word by word to the target language. The translated documents then are used to find the best matching parallel documents by applying similarity scores functions such as cosine, Jaccard, Dice, etc. However, using bilingual dictionary may face difficulty because a word usually has many its translations. Meanwhile, the SB approach relies on analysis HTML structure of pages. This approach uses the hypothesis that parallel web pages are presented in similar structures. The similarity of the web pages are estimated based on the structural HTML of them. Note that this approach does not require linguistical knowledge. Chapter 1. Introduction 4 In addition, this approach is very effective in filtering a big number of unmatched documents, as it is quite fast but accuracy. Nevertheless, it has drawbacks that requires the presentation of two sites with similar content must be presented in the same. From our observation, many sites use the same template to design the Web, the structure of pages is similar but the content of them is different. For that reason, HTML structure-based approach is not applicable in some cases. 1.3 Objectives of the thesis As we have introduced, the parallel corpus is the valuable resource for different NLP tasks. Unfortunately, the available parallel corpora are not only in relatively small size, but also unbalanced even in the major languages [3]. Some resources are available, such as for English-French, the data are usually restricted to government documents (e.g., the Hansard corpus) or newswire texts. The others are limited availability due to licensing restrictions as [23]. According to [24], there are now some reliable parallel corpora: Hansard Corpus1 , JRC-Acquis Parallel Corpus2 , Europarl3 , and COMPARA4 . However, these resources only exist for some language pairs. In Vietnam, the NLP is in early stage. The lack of parallel corpora is more severe. The lack of such kind of resource has been an obstacle in the development of the data-driven NLP technologies. There are a few studies of mining parallel corpora from the Web, one of them is presented in [22] (for English-Vietnamese language pair). On the other hand, the current studies [1–4, 18–21] while extremely useful, they have a few drawbacks as mentioned in Section 1.2. So, obtaining a parallel corpus with high quality is still a challenge. That is why it still remains a big motivation for many studies on this work. The objective of this research is extracting parallel texts from bilingual web sites of the English and Vietnamese language pair. We first propose two new methods of designing content-based features: (1) based on cognation, (2) based on identifying translation segments. Then, we combine content-based features with structural features under a framework of machine learning. 1 http://www.isi.edu/natural-language/download/hansard/ http://langtech.jrc.it/JRC-Acquis.html 3 http://www.statmt.org/europarl/ 4 http://www.linguateca.pt/COMPARA/ 2 Chapter 1. Introduction 1.4 5 Contributions In our work, we aim to automatically extracting English-Vietnamese parallel texts. As encouraging by [20] we formulate this problem as classification problem to utilize as much as possible the knowledge from structural information and the similarity of content. The most important contribution of our work is that we proposed two new methods of designing content-based features and combined with structural-based features to extract parallel texts from bilingual web sites. • The first method based on cognation. It is worth to emphasize that different from previous studies [2, 20], we use cognate information replace of word by word translation. From our observation, when translating a text from one language to another, some special parts will be kept or changed in a little. These parts are usually abbreviation, proper noun, and number. We also use other content-based features such as the length of tokens, the length of paragraphs, which also do not require any linguistically analysis. It is worth to note that by this approach we do not need any dictionary thus we think it can be apply for other language pairs. • The second method based on identifying translation segments use to match translation paragraphs. That will help us to extract proper translation units in bilingual web pages. Previous studies usually use lexicon translations getting from a bilingual dictionary to measure the similarity of content of the two texts, such as in [4, 20]. This approach may face difficulty because a word usually has many its translations. Differently, we use the Google translator because by using it we can utilize the advantages of a statistical machine translation. It helps to disambiguating lexical ambiguity, translating phrases, and reordering. 1.5 Thesis’ structure Given below is a brief outline of the topics discussed in next sections of this thesis: Chapter 2 - Related works The studies that have close relations with our work are introduced in this chapter. Chapter 1. Introduction 6 Chapter 3 - The proposed approach We show our proposed model, including the general architecture of the model, how structural features and content-based features are designed and estimated. Chapter 4 - Experiment This chapter evaluates the goodness and effectiveness of our proposed method for extracting parallel texts from the Web. The performance of our proposed and baseline are presented in here. Chapter 5 - Conclusion and Future works Final conclusions about our work as a whole and the evaluation of the results in particular are presented, followed by suggestions of possible future work that could be done. Finally, references introduce researches that are closely related to our work. Chapter 2 Related works In this chapter, we outline the general framework in building parallel corpus. Then, we review the studies that have close relations with our work. 2.1 The general framework Figure 2.1: General architecture in building parallel corpus. 7 Chapter 2. Related works 8 In general, there are two approaches in building the parallel corpus (illustrated in Figure 2.1). The first one is automatically collect bilingual documents from the Web. The process of identifying parallel texts is a simple step-by-step procedure: (1) locate bilingual web sites, (2) crawl for URLs of possible parallel web pages, and (3) match parallel pages. The content features and structural features used to extract parallel texts (the detail of this task is presented in the next sections). The other one based on the monolingual corpora [25]. As seen from the diagram, starting with two large monolingual corpora (a non-parallel corpus) divided into documents, this approach is composed of three steps: (1) selecting pairs of similar documents, (2) from each such pair, generate all possible sentence pairs and pass them through a simple word-overlap-based filter, thus obtaining candidate sentence pairs, and (3) the candidates are presented to a maximum entropy (ME) classifier that decides whether the sentences in each pair are mutual translations of each other. The next section will present some related works for mining parallel corpus from the Web. 2.2 Structure-based methods Parallel web pages in a site in general speaking have comparable structures and contents. Therefore, a big number of studies focus on finding characteristics of HTML structures such as URL links, filename, and HTML tags. Recently, several systems have been developed to find parallel web pages from the Web. In this section, we describe two of these systems: Original STRAND [1, 18] and PTMiner [2]. The Original STRAND is an architecture for structural translation recognition, acquiring natural data. Its goal is to identify pairs of web pages that are mutual translations. In order to do this, it exploits an observation about the way that web page authors disseminate information in multiple languages: When presenting the same content in two different languages, authors exhibit a very strong tendency to use the same document structure. The STRAND therefore locates pages that might be translations of each other, via a number of different strategies, and filters out page pairs whose page structures diverge by too much. The STRAND architecture has three basic steps (illustrated in Figure 2.2): Chapter 2. Related works 9 Figure 2.2: The STRAND architecture [1]. • Location of pages that might have parallel translations, • Generation of candidate pairs that might be translations, and • Structural filtering out of nontranslation candidate pairs. The heart of STRAND is a structural filtering process that relies on analysis of the pages’ underlying HTML to determine a set of pair-specific structural values, and then uses those values to decide whether the pages are translations of one another. The first step in this process is to linearize the HTML structure and ignore the actual linguistic content of the documents. Both documents in the candidate pair are run through a markup analyzer that acts as a transducer, producing a linear sequence containing three kinds of token: [START:element label] e.g., [START:H3] [END:element label] e.g., [END:H3] [Chunk:length] e.g., [Chunk:250] The chunk length is measured in nonwhitespace bytes, and the HTML tags are normalized for case. Attribute-value pairs within the tags are treated as non-markup text (e.g., produces [START:FONT] followed by [Chunk:12]). Chapter 2. Related works 10 The second step is to align the linearized sequences using a standard dynamic programming technique. For example, consider two documents that begin as Figure 2.3. Figure 2.3: An example of aligning two documents. Using this alignment, the authors compute four values from the aligned structures which indicate the amount of non-shared material, the number of aligned non-markup text chunks of unequal length, the correlation of lengths of the aligned non-markup chunks, and the significance level of the correlation. Machine learning, namely decision trees, are then used for filtering, based on these four values. PTMiner system [2] works on extracting bilingual English-Chinese documents. This system uses a search engine to locate for host containing the parallel web pages. In order to generate candidate pairs, the PTMiner uses a URLmatching process (e.g., Chinese translation of a URL as ”http://www.XXXX.com/ ../eng/..e.html” might be ”http://www.XXXX.com/../chi/..c.html”) and other Chapter 2. Related works 11 features such as size, date, etc. Note that the URLs do not match in most of the bilingual English-Vietnamese web sites. Figure 2.4: The workflow of the PTMiner system [2]. The PTMiner implements the following steps (illustrated in Figure 2.4): 1. Search for candidate sites - Using existing Web search engines, search for the candidate sites that may contain parallel pages. 2. Filename fetching - For each candidate site, fetch the URLs of Web pages that are indexed by the search engines. 3. Host crawling - Starting from the URLs collected in the previous step, search through each candidate site separately for more URLs. 4. Pair scan - From the obtained URLs of each site, scan for possible parallel pairs. 5. Download and verifying - Download the parallel pages, determine file size, language, and character set of each page, and filter out non-parallel pairs. In experiment, several hundred selected pairs were evaluated manually. Their results were quite promising, from a corpus of 250 MB of English-Chinese text, statistical evaluation showed that of the pairs identified, 90% were correct. Chapter 2. Related works 2.3 12 Content-based methods The approach discussed thus far relies heavily on document structure. However, as Ma and Liberman [3] point out, not all translators create translated pages that look like the original page. Moreover, structure-based matching is applicable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain parallel text without structural tags. All these considerations motivate an approach to matching translations that pays attention to similarity of content, whether or not similarities of structure exist. In this section, we describe three systems: Bilingual Internet Text Search (BITS) [3], Parallel Text Identification (PTI) [4], and Dang’s system [22]. The BITS system starts with a given list of domains to search for parallel text. In this system, a translation lexicon (each entry of a translation lexicon lists a word in language L1 and its translation in language L2 ) is used to find translation token pairs. For a given text A in language L1 , they first tokenize A and every B in language L2 . The similarity between A and every text B in language L2 is measured as an algorithm in Figure 2.5. Then finding the B which is most similar to A, if the similarity between A and B is greater than a given threshold t, then A and B are declared a translation pair. The similarity between A and B is defined as sim( A, B) = Number of translation token pairs Number of tokens in text A (2.1) In experiment, Ma and Liberman use an English-German bilingual lexicon of 117,793 entries. The authors report 99.1% precision and 97.1% recall on a handpicked set of 600 documents (half in each language) containing 240 translation pairs (as judged by humans). The PTI system (illustrated in Figure 2.6) crawls the Web to fetch parallel multilingual Web documents using a Web spider. To determine the parallelism between potential bilingual document pairs, two different modules are developed. A filename comparison module is used to check filename resemblance. A content analysis module is used to measure the degree of semantic similarity. It incorporates a novel content-based similarity scoring method for measuring the degree of parallelism for every potential document pair based on their semantic content Chapter 2. Related works 13 Figure 2.5: The algorithm of translation pairs finder [3]. using a bilingual wordlist. The results showed that the PTI system achieves a precision rate of 93% and a recall rate of 96% (180 instances is correct among a total of 193 pairs extracted). Figure 2.6: Architecture of the PTI system [4]. In our knowledge, there are rarely studies on this field related to Vietnamese. [22] built an English-Vietnamese parallel corpus based on content-based matching. Firstly, candidate web page pairs are found by using the features of sentence length and date. Then, they measure similarity of content using a bilingual EnglishVietnamese dictionary and making decision that whether two papers are parallel based on some thresholds of this measure. Note that this system only searches Chapter 2. Related works 14 for parallel pages that are good translations of each other and they are required being written in the same style. Moreover, using word by word translation will cause much ambiguity. Therefore, this approach is difficult to extend when the data increases as well as when applying for bilingual web sites with various styles. Another instance of this approach is that instead of using bilingual dictionary, a simple word-based statistical machine translation is used to translate texts in one language to the other. [26] uses this method to build an English-Chinese parallel corpus from a huge text collection of Xinhua Web bilingual news corpora collected by LDC1 . By adding newly built parallel corpus to their existing corpus, they reported an increase in the translation quality of their word-based statistical machine translation in terms of word alignment. A bootstrapping approach [27] can also be applied to incrementally increase number of both parallel sentences and bilingual lexical vocabulary. 2.4 Hybrid methods The last version of STRAND [20] is another well-known web parallel text mining system. Its goal is to identify pairs of web pages that are mutual translations. They used the AltaVista search engine to search for multilingual web sites and generated candidate pairs based on manually created substitution rules. The heart of STRAND is a structural filtering process that relies on analysis of the pages underlying HTML to determine a set of pair-specific structural values, and then uses those values to filter the candidate pairs. This system also proposes a new method that combines content-based and structure matching by using a cross-language similarity score as an additional parameter of the structure-based method. A translation lexicon is used to link tokens between pairs of parallel document. The link be a pair (x, y) in which x is a word in language L1 and y is a word in L2 . An example of two texts with links is illustrated in Figure 2.7. Using the results of MCBM2 they defined a tsim translational similarity measure as tsim = 1 2 Number of two-word links in best matching Number of links in best matching Linguistic Data Consortium, at http://www.ldc.upenn.edu/ Problem of maximum cardinality bipartite matching (2.2) Chapter 2. Related works 15 Figure 2.7: An example of the two links in the text. In experiment, approximately 400 pairs were evaluated by human annotators. The STRAND produced fewer than 3500 English-Chinese pairs with a precision of 98% and a recall of 61%. In others systems, [19] proposed a method that combining length-base and content-based methods to do parallel text matching exploiting only title part of web page. They achieved 100% accuracy but the recall is not high as in many cases, the title of corresponding text is not well translated. In [21], they use URL-based, length-based, content-based and HTML structure features incorporated within knearest-neighbours classifier to do parallel text matching for English-Chinese. To identify a bilingual web site, they using the anchor and ALT text information within HTML page. If some of pages have those text that match a list of predefined strings that indicate English and Chinese, the page will be considered as a bilingual page. [28] proposed a similar approach. The author presents a system that automatically collects bilingual texts from the Internet. The criteria for parallel text detection is based on the size, HTML structures and word by word translation model. 2.5 Summary In this chapter, we presented related works for mining parallel corpus from the Web. The content-based approach usually uses a bilingual dictionary to match pairs of word-word in two languages. Meanwhile, structure-based approach relies on analysis HTML structure of pages. In the real implementation, both approaches are usually employed to get good performance. Generally, the structure-based methods are applied to quickly filter out the documents that are apparently not matched with a given document, after that the content-based methods are applied to find the right translational document pairs.

- Xem thêm -

Tài liệu Parallel texts extraction from the web

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất