Tài liệu Towards a framework for building an annotated named entities corpus

.PDF

46085

120

nhattuvisu Báo vi phạm

Tải xuống 120

Mô tả:

Table of Contents 1 Introduction 1.1 Overview Name Entity recognition(NER) 1.2 NER Approach . . . . . . . . . . . . . . 1.2.1 Rule based approach . . . . . . . 1.2.2 Machine learning Approach . . . 1.2.3 Comparing . . . . . . . . . . . . 1.3 Thesis contribution . . . . . . . . . . . . 1.4 Thesis structure . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Overview our problem . . . . . . . . . . . 2.2 Building NER corpus research . . . . . . . 2.3 Researches about building corpus Process . 2.4 Overview annotate tools . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Corpus building process 3.1 Corpus building process . . . . . . . . . . . . . . . . . 3.1.1 Objective . . . . . . . . . . . . . . . . . . . . . 3.1.2 Built annotation guide line . . . . . . . . . . . . 3.1.3 Annotate documents . . . . . . . . . . . . . . . 3.1.4 Quality control . . . . . . . . . . . . . . . . . . 3.2 Building Vietnamese NER corpus by off-line tools . . . 3.2.1 Built annotation guide line . . . . . . . . . . . . 3.2.2 Annotate documents . . . . . . . . . . . . . . . 3.2.3 Quality control . . . . . . . . . . . . . . . . . . 3.3 Discus about Vietnamese NER corpus building process. 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 4 5 6 7 . . . . . 8 8 9 10 11 12 . . . . . . . . . . . 13 13 13 14 16 17 20 20 22 24 26 27 TABLE OF CONTENTS iii 4 Online Annotation Framework 4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.2 Training section . . . . . . . . . . . . . . . . . . 4.3 Annotation documents . . . . . . . . . . . . . . 4.3.1 Online annotation interface . . . . . . . 4.3.2 Automate file distribution for annotator 4.3.3 Automate save and manage files . . . . . 4.4 Quality control . . . . . . . . . . . . . . . . . . 4.4.1 Document level . . . . . . . . . . . . . . 4.4.2 Corpus level . . . . . . . . . . . . . . . . 4.4.3 Explain unusual entity . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 29 30 31 32 33 34 34 35 37 38 . . . . . . . . . . . . . . . 39 39 40 41 42 45 47 47 48 49 51 52 54 54 56 58 . . . . . 60 60 62 62 63 63 5 Evaluation 5.1 Introduction . . . . . . . . . . . . . . 5.2 Corpus evaluation . . . . . . . . . . . 5.2.1 Inter annotatetor agreements 5.2.2 Offline corpus evaluation . . . 5.2.3 Online corpus . . . . . . . . . 5.3 Time costing . . . . . . . . . . . . . 5.3.1 Overview . . . . . . . . . . . 5.3.2 Offline process . . . . . . . . . 5.3.3 Online framework . . . . . . . 5.4 Named entity recognition system . . 5.4.1 Preprocessing . . . . . . . . . 5.4.2 Gazetteer . . . . . . . . . . . 5.4.3 Transducer . . . . . . . . . . 5.4.4 Experiment . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion And Future work 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . 6.2 Future work . . . . . . . . . . . . . . . . . . . 6.2.1 Create corpus bigger and more quality 6.2.2 Improve online annotation framework . 6.2.3 Building NER system base statistical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv A Name Entity guideline A.1 Basic concepts . . . . . . . . . . A.1.1 Entity and Entity Name A.1.2 Instance of entity . . . . A.1.3 List of Entities . . . . . A.1.4 Entities recognize rules A.2 Entity classification . . . . . . . A.2.1 Person . . . . . . . . . . A.2.2 Organization . . . . . . A.2.3 Location . . . . . . . . . A.2.4 Facility . . . . . . . . . . A.2.5 Religion . . . . . . . . . TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 64 64 64 65 65 65 67 68 69 69 List of Figures 3.1 3.2 3.3 3.4 Process building Annotation guide line Callisto formatting . . . . . . . . . . . Callisto interface . . . . . . . . . . . . Comparing two user corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 22 23 25 4.1 4.2 4.3 4.4 4.5 Online Annotation Process . . . . . Annotation online tools Interface . Annotation gudeline form Interface Review Tool Interface . . . . . . . . Compare two documents interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 31 32 35 36 5.1 Inter Annotation Agreements result of two User . . . . . 5.2 Evaluate accuracy rate for each Entity kind . . . . . . . 5.3 Evaluate online corpus accuracy rate for each entity kind 5.4 Name entity recognition system architecture . . . . . . . 5.5 Jape rule to recognize Person entity . . . . . . . . . . . . 5.6 Performance on the training data using strict criteria . . 5.7 Performance on test data using strict criteria . . . . . . . 5.8 Performance on the test data using lenient criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 44 47 52 55 57 57 58 v . . . . . . . . . . List of Tables 5.1 5.2 5.3 5.4 5.5 5.6 5.7 An example of par corpus which annotate bu two user B) . . . . . . . . . . . . . . . . . . . . . . . . frequency annotated documents . . . . . . . . . . Inter annotation agreements in online annotation User corpus accurate rate in online method . . . . Time spent to quality control corpus . . . . . . . Time spent During annotation process . . . . . . Quality control time in online framework . . . . . vi user (User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 45 46 46 49 50 51 Chapter 1 Introduction 1.1 Overview Name Entity recognition(NER) The ability to determine the named entities in a text has been established as an important task for several natural language processing areas, including information retrieval, machine translation, information extraction and language understanding. The term ”Named Entity” widely use in Nature Language Processing(NLP), was coined for the Sixth Message Understanding Conference(MUC-6). At the time, MUC was focusing in Information Extraction(IE) tasks where structured information of computer activities and defense related activities is extracted from unstructured text,such as newspaper articles. In defining tasks,people noticed that it is essential to recognize information units like names including: Person, organization and location names and numerics expression including: time, date, money, percent expression. Identifying references to these entities in text was recognition as one of the importance sub- task of IE and was called ”Name Entity Recognition and Classification”. The computational research aiming at automatically identifying named entities 1 2 Chapter 1. Introduction in texts forms a vast and heterogeneous pool of strategies, methods and representations. One of the first research papers in the field was presented by Lisa F. Rau (1991) at the Seventh IEEE Conference on Artificial Intelligence Applications. In genreral, each NER researches which have been devoted have to solve four problems: Language, Input,Kind of entity, and learning method. Languages: NER have been applied to several languages. There are many good researches for English NER, they have solved language independence and multilingualism problems. German is well studied in CONLL-2003 and in earlier works. Similarly, Spanish and Dutch are strongly represented, boosted by a major devoted conference: CONLL -2002 (Collins, 2002). Chinese is studied in some researches (Wang et al., 1992),(Computer et al., 1996), (Yu et al., 1998) and so are French (Petasis et al., 2001), (And, 2003), Greek (Karkaletsis et al., 1999) and Italian (Black et al., 1998), (Cucchiarelli & Velardi, ). Many other languages received some attention as well: Basque (Whitelaw & Patrick, 2003), Bulgarian (Silva et al., 2004), Catalan (Carreras et al., 2003),Hindi (Cucerzan & Yarowsky, 1999), Romania (Cucerzan & Yarowsky, 1999), Swedish (Kokkinakis, 1998) and Turkish (Cucerzan & Yarowsky, 1999). Portuguese was examined by(Palmer et al., 1997). In Vietnamese, there are some NER research is apply, for example VN- KIM (Nguyen & Cao, 2007)IE system have just Format input NER research have been applied to many format of documents: General text, email, scientific text, journalistic,ect and mamy domain: sport, business,literature, etc. Each system usually direct specific format and domain (Maynard et al., 2001). Designed a system for email, scientific texts and religious texts (Minkov & Wang, 2005) created a system specifically designed for email documents. Now day, studies 1.2. NER Approach 3 want to apply to newer kind of format and domain. For example, MUC-6 collection composed of newswire texts, and on a proprietary corpus made of manual translations of phone conversations and technical email Kind of Entity Although list entities depend kind and domain specific problems, NER systems usually record some entities: Person, Location,Organization, Date, Time, Money, Percent. Ambiguous have been appeared by Person, Location,Organization and other is fewer. In Each domain, NERs target some specific one. For instance, in medicine domain, entity can be mane of disease or name of medicine. 1.2 NER Approach Similar to other NLP problems NER research have been developed into two main approaches: • Rule based approach. • Machine learning approach. 1.2.1 Rule based approach Using expert system to built rule system is traditional approach and they have been applied in NLP in general and NER in particular. Rule system is set of rule which have been built by people (in ordinary expert) to particular target. Rules will create by some features: Part of speech, context( words and phrases are in front of words and behind one etc...) and some properties(Uppercase, lowercase...) and some special dictionaries. For example: 4 Chapter 1. Introduction President Busto leave Iraq said Monday’s talks will include discussion on security, a timetable for U.S forces In this example, ”Busto” appear after the ”President”,for this reason ”Busto” is snnotated as Person entity. Similar, ”Iraq” appear before ”leave” verb so that it is seemed ”Location’ entity. In this approach, we don’t need to annotate corpus. System can be identified and classified immediately by set of rules. Advantage of approach is: easy to built rule base system. So that many NER systems is rule base system since first period. However, it is difficult to enhance accuracy rate. Because organize set of rules is difficult. If we do not organize appropriately their, the rules is overlap each other, and system can not identify and classify correctly. 1.2.2 Machine learning Approach Now day, machine learning is common approach to solve NLP problems. In NER, it is used to enhance accuracy. These are some model have been applied: support vector machine, Hidden Markov model, decision tree, etc.. There are three kinds of learning method have been applied in Machine learning: Un-supervised, supervised, and semi-supervised. However, Un-supervised systems and semi-supervised don’t not for NER problems. There are a few researchs apply these methods: for example: Collins with system used un-annotate corpus (Collins & Singer, 1999). And Kim with system using proper name and un- annotate corpus. Systems which are applied supervised used more popularly in NER problems. For example:Bikel with hidden markow model(Black et al., 1998) ,Borthwick with Maximum Entropy (Borthwick et al., 1998), etc... In Machine Learning systems, we must built three sets: training set, test set and practice set. • A training set consists of an input vector and an answer vector, and is used 1.2. NER Approach 5 together with a supervised learning method to train a knowledge for the system. In NER, a training set is a corpus which have been annotated standard labels. • A Test set is similar to training set. But target of test set is check and evaluate system accuracy. In NER problem, test set is a corpus which similar to train set. • Practice set: is set which is applied machine leaning system to automatically identify and classify. Execute practice set is goal to built system. 1.2.3 Comparing Annotation based learning have some advantages from manual hand writing rule: • Annotation based learning can continue indefinitely, over weeks and months, with relatively self contained annotation decision at each point.In contrasts rule writing must remain cognizant of potential previous rules interdependencies when adding and revising rules,ultimately bounding continued rule system growth by cognitive load factor. • Annotation by learning can more effective combine the effort of multiple people. The tagged sentences from deference data sets can be simple concatenated to form larger data sets with broader coverage. • User who write rule require large skill, including not only linguistic knowledge for annotation, but also competence regular expression and ability to grap the complex interactive within rule list. However, in machine learning approaches, annotators only require can used fluently language. 6 Chapter 1. Introduction • Performance of system which built by rule writer tend to exhibit considerably more variance. While machine system tend to much more consistent result. Although machine learning approach have a lot of advantages. However we meet a main barrier: machine learning need a high quality corpus. So that the problem is how to build a high quality copus. For Vietnamese, There is not any NER corpus is published. Although some systems have been built based on machine learning approach, they don’t share theirs corpus. So that it is difficult to other research improve accurate for NER system. For this reason, my thesis focus: • Solutions to build Vietnamese NER corpus • Quality control and evaluate the corpus. • Apply the corpus into NER problem. 1.3 Thesis contribution The thesis contribution includes: • We release a building corpus process base on • We apply the process to build NER corpus by offline tools method. Offline tools method is a manual way use desktop programs, for example: Callisto tool. Offline tools method is called as offline tools. • To overcome offline tools disadvantage, We build a online annotation framework. The online frame work have some features – Annotation will be executed though Internet environments (Annotate anytime, anywhere). 1.4. Thesis structure 7 – Automate all steps in process: Manage files, distribute to annotator, etc.. – Enable lager number annotator. – Quality control corpus in many level. • Apply corpus to evaluate our NER system. 1.4 Thesis structure So that, my thesis including five chapters • Chapter one: Introduction: Overview NER research and some approach to built NER system.And We expose problem. • Chapter two: Related word: Overview some research in the world to built NLP corpus in general and NER corpus in particular. So that we localize my directly study. • Chapter three: Building corpus process: Describe a process build a general corpus. Then, we apply to build Vietnamese corpus by off line tools. • Chapter four: Online corpus Framework: We base on building corpus process to build a online framework for annotating. It will overcome off-line tools disadvantage. • Chapter five: Evaluation: Present about my experiments and evaluate result. And describe our NER system using corpus we built. Chapter 2 Related Work In this chapter, we discus about some published building corpus research, Include NER corpus and NLP corpus We discust about some factors: • Process building • Support tools • Quality control For learning exist research we build own strategy to solve our problem. 2.1 Overview our problem As we in last chapter, building a high quality NER corpus is very important. The corpus will be used many ways in NER system: • Testing system: Corpus will be used to evaluate system accuracy rate. • Training system: Corpus will be used to build system knowledge (Machine learning approach). 8 2.2. Building NER corpus research 9 However building high quality corpus is not easy. If you don’t have suitable method, you only have a low accuracy corpus. So that our problem is: How to build a high quality NER corpus, and how to quality control corpus. We need do three works to solve problem: Release a building corpus process, supply tools to support, and quality control. 2.2 Building NER corpus research When survey the problem we see that. Building NER corpus problem is not new in the world. For example: Kravalová, Jana and Žabokrtský, Zdeněk have built Czech Named Entity corpus (Kravalová & Žabokrtský, 2009). In this research, 6000 sentences are manually annotated named entities. They receive about 33000 entities. They use the corpus to train and evaluate a named entity recognizer based on Support Vector Machine classification technique. The presented recognizer outperforms the results previously reported for NE recognition in Czech. Furthermore, Asif Ekbal and Sivaji Bandyopadhyay have built Bengali Named Entity Tagged Corpus (Asif Ekbal, 2008). A Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. They used tool ”Sanchay Editor1” to manual annotate, Sanchay Editor1 is a text editor for Indian language. Their corpus has been used to develop NER system in Bengali use pattern directed shallow parsing approaches, includes: Hidden Markov Model (HMM), Maximum Entropy (ME) Model, Conditional Random Field (CRF) and Support Vector Machine (SVM). 10 Chapter 2. Related Work There is no NER copus is publish for Vietnamese language. So that, some NER system have been based creating rule approach, for example: VN-KIM (Nguyen & Cao, 2007)(using Jape grammar) To release Vietnamese NER corpus will useful for developing automatically NER researches. 2.3 Researches about building corpus Process Many building corpus research are published, and many corpus is created: POS corpus, TreeBank corpus, event newer corpus: Parallel language corpus, Opinion corpus, etc... For example: • Towards the national corpus of Polish research (Adam Przepiorkowski & Lazinski, 2008) study about building National Corpus of Polish and used to build Polish dictionary. The corpus is very big, about a billion words. The corpus have been built by four parters, and they annotated various features, entire corpus will be annotated linguistically, structurally and with the meta data. During building time, they plan to carefully consider the recommendations of the ISO/TC 37/SC 4 subcommittee, the TEI guidelines, any future recommendations of the CLARIN project 1 • ”Building a Greek corpus for Textual Entailment” research (Evi Marzelou & Piperidis, 2008)study about building Greek corpus. Annotation process in the research includes some steps: Create guidelines, annotate (by expert and non-expert human annotator). They compare and release the gold entailment annotation. • The research ”Opinion annotation in On-line Chinese Product Reviews” (Ruifeng Xu 1 more information in web http://www.clarin.eu/ 2.4. Overview annotate tools 11 & Li, 2008) focus about opinion annotation. The research will explain about annotation schema. It includes seven steps. Summary, after we review some create annotation corpus research. We see that building corpus schemma include three main steps: Build annotation guide line, Annotate, and quality control corpus. So that our corpus will be applied these steps. 2.4 Overview annotate tools These are many annotate tools exist: we can reference them: • GATE2 : The framework written by Java languages. It includes many functions, Annotation is Gate ’s function. • Callisto3 : The Callisto annotation tool was developed to support linguistic annotation of textual sources for any Unicode-supported language • EasyRef 4 : It is a web service to handle (view, edit, import, export, bug reports) syntactic annotations. • SACODEYL Annotator5 : It is a open source application to annotate documents in desktop, furthermore it can be a web application. • WordFreak6 WordFreak is a java-based linguistic annotation tool designed to support human, and automatic annotation of linguistic data as well as employ active-learning for human correction of automatically annotated data. We will reference all tools to build our tools for annotate process. 2 http://gate.ac.uk/ http://callisto.mitre.org/index.html 4 http://atoll.inria.fr/easyrefpub/ 5 http://www.um.es/sacodeyl/en/pages/software.html 6 http://wordfreak.sourceforge.net/ 3 12 2.5 Chapter 2. Related Work Summary In this chapter, we focus about some related works around the thesis includes: building corpus process, annotation tools.It is useful to direct our word: forward a framework for building an Annotated Named entities corpus. In next chapters, we have explain our work to build Vietnamese NER Corpus. Chapter 3 Corpus building process In this chapter, we present about corpus building process. Similar other annotated process, corpus building process includes three steps: Built annotation guide line, annotate documents, and quality control. Then, we apply the process to build Vietnamese NER corpus. We use some off-line tools and discuss about advantage and disadvantage. 3.1 Corpus building process 3.1.1 Objective In this subsection, we explain the importance of building process. If you want to build a small corpus (Corpus contains a few documents) you do not need a corpus building process. Simple, you annotate each documents by annotate tools. If you want corpus more accurate, the documents are annotated some times. However, when you want to build a large corpus. The work becomes complex. Many people need join in the job. So that building process corpus is defined. Basing on corpus building process, people will know what work they have to do. Manager manage 13 14 Chapter 3. Corpus building process more easy all works and corpus quality. Requirements of the building corpus process are list • Every people takes part in the corpus building. • Each documents have to be annotated many times • Administrator can control and evaluate quality of corpus as quality of annotator ’s work As research we have studied in section two chapter two such as: National corpus of Polish (Adam Przepiorkowski & Lazinski, 2008), building a Greek corpus for Textual Entailment (Evi Marzelou & Piperidis, 2008), opinion annotation in Online Chinese Product Reviews (Ruifeng Xu & Li, 2008). Corpus building process include three steps: • Building annotation guide line. • Annotation documents. • Quality control corpus. In next section we will present each steps. 3.1.2 Built annotation guide line Annotation guide line is nearly a user manual for annotator. They base on instructions which is contained in guide line to find and classify entity. In building NER corpus, guide line include some contents: define a name entity, classify entities and sign of entity in documents. Annotation guide line is very important because: • Annotators seem guide line as theirs user manual to annotate correctly . Before annotation process, they have to read and study carefully guide line. They 3.1. Corpus building process 15 have to knows: which word or phrase can be seem entity, Identification of each entity kinds. If they do not understand clearly, they face many problems when they annotate, and many error will be made. • When face ambiguous case (The case can be understood many ways). Base on the rules in guide line. Annotator will decide which way is the most correct. For example when annotation sentences: Trưởng công an huyện Kỳ Sơn dẫn tôi đi tới kho chứa hàng trăm khẩu súng tự chế được gom lại trong chiến dịch thu hồi vũ khí và vật liệu nổ trái phép Ky Son police chief take me to hundreds of manual weapons which have been gathered in inlegal Weapons and detonation materials gathered campaign There are two ways to annotation in the sentence : First "Kỳ Sơn" is "Location" entity because it is a district name. In other way, "Công an xã Kỳ Sơn"is "Organization" entity because it is a office name. Which way do we choose?. In annotation guide line, we show that: "Entity is not annotated overlap, and the most correct entity is longest entity". So that the way two is applied. • Because there is only correct entity in one context, when we compare two documents which have been annotated by two difference one, if we found the differences. It demonstrates that one is correct and other is wrong, even both of them is wrong. To repair them we have to base guideline. For example when annotation the sentence:

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất