Đăng ký Đăng nhập
Trang chủ Giáo dục - Đào tạo Cao đẳng - Đại học Công nghệ thông tin Luận văn cntt enhancing the quality of machine translation system using cross li...

Tài liệu Luận văn cntt enhancing the quality of machine translation system using cross lingual word embedding models

.PDF
54
177
72

Mô tả:

Enhancing the quality of Machine Translation System Using Cross-Lingual Word Embedding Models Nguyen Minh Thuan Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Associate Professor. Nguyen Phuong Thai A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science November 2018 2 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Hanoi, November 15th , 2018 Signed ........................................................................ i ii ABSTRACT In recent years, Machine Translation has shown promising results and received much interest of researchers. Two approaches that have been widely used for machine translation are Phrase-based Statistical Machine Translation (PBSMT) and Neural Machine Translation (NMT). During translation, both approaches rely heavily on large amounts of bilingual corpora which require much effort and financial support. The lack of bilingual data leads to a poor phrase-table, which is one of the main components of PBSMT, and the unknown word problem in NMT. In contrast, monolingual data are available for most of the languages. Thanks to the advantage, many models of word embedding and cross-lingual word embedding have been appeared to improve the quality of various tasks in natural language processing. The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment. The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT. Publications: ? Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai Nguyen and Chi-Mai Luong. Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and Low-Resource Languages. In the 2018 International Conference on Asian Language Processing (IALP 2018). iii ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my lecturers in university, and especially to my supervisors - Assoc.Prof. Nguyen Phuong Thai, Dr. Nguyen Van Vinh and MSc. Vu Huy Hien. They are my inspiration, guiding me to get the better of many obstacles in the completion this thesis. I am grateful to my family. They usually encourage, motivate and create the best conditions for me to accomplish this thesis. I would like to also thank my brother, Nguyen Minh Thong, my friends, Tran Minh Luyen, Hoang Cong Tuan Anh, for giving me many useful advices and supporting my thesis, my studying and my living. Finally, I sincerely acknowledge the Vietnam National University, Hanoi and especially, TC.02-2016-03 project named “Building a machine translation system to support translation of documents between Vietnamese and Japanese to help managers and businesses in Hanoi approach Japanese market” for supporting finance to my master study. To my family ♥ iv Table of Contents 1 Introduction 1 2 Literature review 2.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Open-Source Machine Translation . . . . . . . . . . . . . . . 2.1.4.1 Moses - an Open Statistical Machine Translation System . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4.2 OpenNMT - an Open Neural Machine Translation System . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Monolingual Word Embedding Models . . . . . . . . . . . . 2.2.2 Cross-Lingual Word Embedding Models . . . . . . . . . . . . . . . . 4 4 4 5 7 8 . 9 . . . . 10 11 12 13 3 Using Cross-Lingual Word Embedding Models for Machine Translation Systems 3.1 Enhancing the quality of Phrase-table in SMT Using Cross-Lingual Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Recomputing Phrase-table weights . . . . . . . . . . . . . . . 3.1.2 Generating new phrase pairs . . . . . . . . . . . . . . . . . . . 3.2 Addressing the Unknown Word Problem in NMT Using Cross-Lingual Word Embedding Models . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 19 21 4 Experiments and Results 27 4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 v TABLE OF CONTENTS 4.2.1 4.2.2 4.2.3 5 Conclusion vi Word Translation Task . . . . . . . . . . . . . . . . . . . . . . 31 Impact of Enriching the Phrase-table on SMT system . . . . . 32 Impact of Removing the Unknown Words on NMT system . . 35 38 List of Figures 2.1 2.2 The CBOW model predicts the current word based on the context, and the Skip-gram predicts surrounding words based on the current word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Toy illustration of the cross-lingual embedding model. . . . . . . . . . 14 3.1 3.2 3.3 Flow of training phrase. . . . . . . . . . . . . . . . . . . . . . . . . . 22 Flow of testing phrase. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Example in testing phrase. . . . . . . . . . . . . . . . . . . . . . . . . 25 vii List of Tables 3.1 The sample of new phrase pairs generated by using projections of word vector representations . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 4.2 4.3 4.4 Monolingual corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . Bilingual corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bilingual dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . The precision of word translation retrieval top-k nearest neighbors in Vietnamese-English and Japanese-Vietnamese language pairs. . . . . Results on UET and TED dataset in the PBSMT system for VietnameseEnglish and Japanese-Vietnamese respectively . . . . . . . . . . . . . Translation examples of the PBSMT in Vietnamese-English . . . . . Results of removing unknown words on UET and TED dataset in the NMT system for Vietnamese-English and Japanese-Vietnamese respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Translation examples of the NMT system in Vietnamese-English . . . 4.5 4.6 4.7 4.8 viii 28 28 29 32 33 34 35 37 List of Abbreviations MT SMT PBSMT NMT NLP RNN CNN UNMT Machine Translation Statistical Machine Translation Phrase-based Statistical Machine Translation Neural Machine Translation Natural Language Processing Recurrent Neural Network Convolutional Neural Network Unsupervised Neural Machine Translation ix Chapter 1 Introduction Machine Translation (MT) is a sub-field of computational linguistics. It is automated translation, which translates text or speech from one natural language to another by using computer software. Nowadays, machine translation systems attain much success in practice, and two approaches that have been widely used for MT are Phrase-based statistical machine translation (PBSMT) and Neural Machine Translation (NMT). In the PBSMT system, the core of this system is the phrase-table, which contains words and phrases for SMT system to translate. In the translation process, sentences are split into distinguished parts as shown in (Koehn et al., 2007) (Koehn, 2010). At each step, for a given source phrase, the system will try to find the best candidate amongst many target phrases as its translation based mainly on phrase-table. Hence, having a good phrase-table possibly makes translation systems improve the quality of translation. However, attaining a rich phrase-table is a challenge since the phrase-table is extracted and trained from large amounts of bilingual corpora which require much effort and financial support, especially for less-common languages such as Vietnamese, Laos, etc. In the NMT system, two main components are encoder and decoder. the encoder component uses a neural network, such as the recurrent neural network (RNN), to encode the source sentence, and the decoder component also uses a neural network to predict words in the target language. Some NMT models incorporate attention mechanisms to improve the translation quality. To reduce the computational complexity, conventional NMT systems often limit their vocabularies to be the top 30K-80K most frequent words in the source and target language, and all words outside the vocabulary, called unknown words, are replaced into a single unk symbol. This approach leads to the inability to generate 1 2 the proper translation for this unknown words during testing as shown in (Luong et al., 2015b) (Li et al., 2016) Latterly, there are several approaches to address the above impediments. With the problem in the PBSMT system. (Passban et al., 2016) proposed a method of using new scores generated by a Convolution Neural Network which indicates the semantic relatedness of phrase pairs. They attained an improvement of approximately 0.55 BLEU score. However, their method is suitable for medium-size corpora and creates more scores for the phrase-table which can increase computation complexity of all translation systems. (Cui et al., 2013) utilized techniques of pivot languages to enrich their phrase-table. Their phrase-table is made of source-pivot and pivot-target phrase-tables. As a result of this combination, they attained a significant improvement of translation. Similarly, (Zhu et al., 2014) used a method based on pivot languages to calculate the translation probabilities of source-target phrase pairs and achieved a slight enhancement. Unfortunately, the methods based on pivot languages are not able to apply for the Vietnamese language since the the less-common nature of this language. (Vogel and Monson, 2004) improved the translation quality by using phrase pairs from an augmented dictionary. They first augmented the dictionary using simple morphological variations and then assigned probabilities to entries of this dictionary by using co-occurrence frequencies collected from bilingual data. However, their method needs a lot of bilingual corpora to estimate accurately the probabilities for dictionary entries, which are not available for low-resource languages. In order to address the unknown word problem in NMT system. (Luong et al., 2015b) annotated the training bilingual corpus with explicit alignment information that allows the NMT system to emit, for each unknown word in the target sentence, the position of its corresponding word in the source sentence. This information is then used in a post-processing step to translate every unknown word by using a bilingual dictionary. The method showed a substantial improvement of up to 2.8 BLEU points over various NMT systems on WMT’14 English-French translation task. However, having the good dictionary, which is utilized in the post-processing step, is also costly and time-consuming. (Sennrich et al., 2016) introduced a simple approach to handle the translation of unknown words in NMT by encoding unknown words as a sequence of subword units. This method based on the intuition that a variety of word classes are translated via smaller units than words. For example, names are translated by character copying or 3 transliteration, compounds are translated via compositional translation, etc. The approach indicated an improvement up to 1.3 BLEU over a back-off dictionary baseline model on WMT 15 English-Russian translation task. (Li et al., 2016) proposed a novel substitution-translation-restoration method to tackle the problem of the NMT unknown word. In this method, the substitution step replaces the unknown words in a testing sentence with similar in-vocabulary words based on a similarity model learned from monolingual data. The translation step then translates the testing sentence with a model trained on bilingual data with unknown words replaced. Finally, the restoration step substitutes the translations of the replaced words by that of original ones. This method demonstrated a significant improvement up to 4 BLEU points over the attention-based NMT on Chinese-toEnglish translation. Recently, techniques using word embedding receive much interest from natural language processing communities. Word embedding is a vector representation of words which conserves semantic information and their contexts words. Additionally, we can exploit the advantage of embedding to represent words in diverse distinction spaces as shown in (Mikolov et al., 2013b). Besides, cross-lingual word embedding models are also receiving a lot of interest, which learn cross-lingual representations of words in a joint embedding space to represent meaning and transfer knowledge in cross-lingual scenarios. Inspired by the advantages of the cross-lingual embedding models, the work of (Mikolov et al., 2013b) and (Li et al., 2016), we propose a model to enhance the quality of a phrase-table by recomputing the phrase weights and generating new phrase pairs for the phrase-table, and a model to address the unknown word problem in the NMT system by replacing the unknown words with the most appropriate in-vocabulary words. The rest of this thesis is organized as follows: Chapter 2 gives an overview of related backgrounds. In Chapter 3, we describe our two proposed models. A model enhances the quality of phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT. Settings and results of our experiments are shown in Chapter 4. We indicate our conclusion and future works in Chapter 5. Chapter 2 Literature review In this chapter, we indicate an overview of Machine Translation (MT) research and Word Embedding models in section 2.1 and 2.2 respectively. Section 2.1 shows the history, approaches, evaluation and open-source in MT. In section 2.2, we introduce an overview of Word Embedding including Monolingual and Cross-Lingual Word Embedding models. 2.1 2.1.1 Machine Translation History Machine Translation is a sub-field of computational linguistics. It is automated translation, which translates text or speech from one natural language to another by using computer software. The first ideas of machine translation may have appeared in the seventh century. Descartes and Leibniz proposed theories of how to create dictionaries by using universal numerical codes. In the mid-1930s, Georges Artsrouni attempted to build “translation machines” by using paper tape to create an automatic dictionary. After that, Peter Troyanskii proposed a model including a bilingual dictionary and a method for handling grammatical issues between languages based on the Esperanto’s grammatical system. On January 7th, 1954, at the head office of IBM in New York, the first machine translation system was published by Georgetown-IBM experiment. It automatically translated 60 sentences from Russian to English for the first time and opened a race for machine translation in many countries, such as Canada, Germany, and Japan. However, in 1966, the Automatic Language Processing Advisory Committee (AL4 2.1. Machine Translation 5 PAC) reported that the ten-year-long research failed to fulfill expectations in (Vogel et al., 1996). During the 1980s, a lot of activities in MT were executed, especially in Japan. At this time, research in MT typically depended on translation through a variety of intermediary linguistic representation including syntactic, morphological, and semantic analysis. At the end of the 1980s, since computational power increased and became less expensive, more research was attempted in the statistical approach for MT. During the 2000s, research in MT has seen major changes. A lot of research has focused on example-based machine translation and statistical machine translation (SMT). Besides, researchers also gave more interests in hybridization by combining morphological and syntactic knowledge into statistical systems, as well as combining statistics with existing rule-based systems. Recently, the hot trend of MT is using a large artificial neural network into MT, called Neural Machine Translation (NMT). In 2014, (Cho et al., 2014) published the first paper on using neural networks in MT, followed by a lot of research in the following few years. Apart from the research on bilingual machine translation systems, in 2018, researchers paid much attention to unsupervised neural machine translation (UNMT) which only used monolingual data to train the MT system. 2.1.2 Approaches In this section, we indicate typically approaches for MT based on linguistic rules, statistical and neural network. Rule-based Rule-based Machine Translation (RBMT) is the first approach to MT, which contains more linguistic information of the source and target languages such as morphological, syntactic rules and semantic analysis. The basic approach involves parsing and analyzing the structure of the source sentence and then converting it into the target language based on a manually determined set of rules created by linguistic experts. The key advantage of RBMT is that this approach can translate a wide range of text without requiring bilingual corpus. However, creating rules for an RBMT system is costly and time-consuming. Additionally, when translating real texts, the rules are unable to cover all possible linguistic phenomena and they can conflict with each other. Therefore, RBMT has mostly been replaced by SMT or hybrid systems. 2.1. Machine Translation 6 Statistical Statistical Machine Translation (STM) system uses statistical models to generate translations based on the bilingual and monolingual corpus. The basic idea of SMT comes from information theory. A sentence f in the source language is translated to the sentence e in the target language based on the probability distribution p(e|f ). A simple way to modeling the probability distribution p(e|f ) is to apply Bayes Theorem, which is: p(e|f ) ∝ p(f |e)p(e) where p(e|f ) is the translation model, which estimates the probability of source sentence f given the target sentence e, and p(e) is the language model, which is the probability of seeing sentence e in the target language. Therefore, finding the best translation ê is executed by maximizing the product p(e|f )p(e): ê = argmaxp(e|f ) = argmaxp(f |e)p(e) e∈e∗ e∈e∗ In order to perform the search efficiently in the huge search space e∗ , machine translation decoder trade-off the quality and time usage by using the foreign string, heuristics and other methods to limit the search space. Some efficient searching algorithms, which are currently used in the decoder, are Viterbi Beam, A* stack, Graph Model, etc. SMT has been used as the core of systems by Google Translate and Bing Translator. Example-based In an Example-based machine translation (EBMT) system, a sentence is translated by using the idea of analogy. In this approach, the corpus that is used is large of existing translation pairs of source and target sentences. Given a new source sentence that is to be translated, the corpus is retrieved to select the sentences that contain similar sub-sentential parts. Then, the similar sentences are used to translate the sub-sentential parts of the original source sentence into the target language, and these parts are put together to generate a complete translation. Neural Machine Translation Neural Machine Translation (NMT) is the newest approach to MT and based on the model of machine learning. This approach uses a large artificial neural network to predict the likelihood of a sequence of words, typically encoding whole sentences in a single integrated model. The structure of the NMT models is simpler than that 2.1. Machine Translation 7 of SMT models that uses vector representations (“embedding”, “continuous space representations”) for words and internal states. The NMT contains a single sequence model to predict one word at a time. There is no separate translation model, language model, reordering model. The first NMT models are using a recurrent neural network (RNN), which uses a bidirectional RNN, known as an encoder, to encode the source sentence and a second RNN, known as a decoder, to predict words in the target language. NMT systems can continuously learn and be adjusted to generate the best output and require a lot of computing power. This is why these models have only been developed strongly in recent years. 2.1.3 Evaluation Machine Translation evaluation is essential to examine the quality of a MT system or compare different MT systems. The simplest method to evaluate MT output is using human judges. However, human evaluation is costly and time-consuming and thus unsuitable for frequently developing and researching an MT system. Therefore, various automatic methods have been studied to evaluate the quality of translation such as Word Error Rate (WER), Position independent word Error Rate (PER), the NIST score (Doddington, 2002), the BLEU score (Papineni et al., 2002), etc. In our work, we use BLEU for automatic evaluating our MT system configurations. BLEU is a popular method for automatic evaluating MT output that is quick, inexpensive, and language-independent as shown in (Papineni et al., 2002). The basic idea of this method is to compare n-grams of the MT output with n-grams of the standard translation and count the number of matches. The more the matches, the better the MT output is. A BLEU formula is shown as follows: The BLEU n-gram precision pn are computed by summing the n-gram matches for all the candidate sentences in the test corpus C: P pn = C∈{Candidates} P ngram∈C P C∈{Candidates} P Countmatched (ngram) ngram∈C Count(ngram) (2.1) Next, the brevity penalty (BP) is calculated as: BP =  1 if c > r e(1−r/c) if c ≤ r (2.2) 2.1. Machine Translation 8 where c and r is the length of the candidate translation and standard translation respectively. Then, the BLEU score is computed as follows: N X BLEU = BP × exp( wn log pn ) (2.3) n=1 where n is the orders of n-gram considered for pn and wn is the weights assigned for the n-gram precisions. In the baseline, N = 4 and weights are uniformly distributed. 2.1.4 Open-Source Machine Translation In order to stimulate the development of the MT research community, a variety of free and complete toolkits for MT are provided. With the statistical (or data-driven) approach to MT, we can consider some systems as follows: ˆ Moses1 : a complete SMT system. ˆ UCAM-SMT2 : the Cambridge SMT system. ˆ Phrasal3 : a toolkit for phrase-based SMT. ˆ Joshua4 : a decoder for syntax-based SMT. ˆ Pharaoh5 : a decoder for IBM Model 4. Besides, because of the superiority of NMT over SMT, NMT has received much attention from researchers and companies. The following start-of-the-art NMT systems are totally free and easy to setup: ˆ OpenNMT6 : a sytem is designed to be simple to use and easy to extend de- veloped by Harvard university and SYSTRAN. ˆ Google-GNMT7 : a competitive sequence-to-sequence model developed by Google. 1 http://www.statmt.org/moses/ http://ucam-smt.github.io/ 3 https://nlp.stanford.edu/phrasal/ 4 https://cwiki.apache.org/confluence/display/JOSHUA/ 5 https://www.isi.edu/licensed-sw/pharaoh/ 6 http://opennmt.net/ 7 https://github.com/tensorflow/nmt 2 2.1. Machine Translation 9 ˆ Facebook-fairseq8 : a system is implemented with Convolutional Neural Net- work (CNN), which can achieve a similar performance as the RNN-based NMT while running nine times faster developed by Facebook AI Research. ˆ Amazon-Sockeye9 : a sequence-to-sequence framework based on Apache MXNet are developed by Amazon. In this part, we introduce two MT systems, which are used in our work. The first system is Moses - an open system for SMT and the remaining system is OpenNMT - an open system for NMT. 2.1.4.1 Moses - an Open Statistical Machine Translation System Moses, which was introduced by (Koehn et al., 2007), is a complete open source toolkit for statistical machine translation. It can automatically train translation models for any language pair from a collection of translated sentences (parallel data). Due to the trained model, an efficient search algorithm is used to quickly find the highest probability translation among an exponential numbers of candidates. There are two main components in Moses: the training pipeline and the decoder. The training pipeline contains a variety of tools which take the parallel data and train it into a translation model. Firstly, the data needs to be cleaned by inserting spaces words and punctuation (tokenisation), removing long and empty sentences, etc. Secondly, some external tools are then used for word alignment such as GIZA++ in (Och and Ney, 2003), MGIZA++. These word alignments are then used to extract phrase translation pairs or hierarchical rules. These phrase pairs or rules are then scored by using corpus-wide statistics. Finally, weights of different statistical models are tuned to generate the best possible translations. MERT in (Och, 2003) is used to tune weights in Moses. In the decoder process, Moses uses the trained translation model to translate the source sentence into the target sentence. To overcome the huge search problem in decoding, Moses implements several different algorithms for this search such as stack-based, cube-pruning, chart parsing etc. Besides, an important part of the decoder is the language model, which is trained from the monolingual data in the target language to ensure the fluency of the output. Moses supports many kinds of language model tools such as KENLM in (Heafield, 2011), SRILM in (Stolcke, 2002), IRSTLM in (Federico et al., 2008), etc. 8 9 https://github.com/facebookresearch/fairseq https://github.com/awslabs/sockeye
- Xem thêm -

Tài liệu liên quan