Đăng ký Đăng nhập
Trang chủ Towards a framework for building an annotated named entities corpus luận văn th...

Tài liệu Towards a framework for building an annotated named entities corpus luận văn ths. công nghệ thông tin 1 01 10

.PDF
5
3
63

Mô tả:

Towards a framework for building an Annotated Named Entities Corpus Hoang Huu Son Faculty of Information Technology University of technology and engineering Vietnam National University, Hanoi Supervised by Doctor Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2010 Table of Contents 1 Introduction 1.1 Overview Name Entity recognition(NER) 1.2 NER Approach . . . . . . . . . . . . . . 1.2.1 Rule based approach . . . . . . . 1.2.2 Machine learning Approach . . . 1.2.3 Comparing . . . . . . . . . . . . 1.3 Thesis contribution . . . . . . . . . . . . 1.4 Thesis structure . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Overview our problem . . . . . . . . . . . 2.2 Building NER corpus research . . . . . . . 2.3 Researches about building corpus Process . 2.4 Overview annotate tools . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Corpus building process 3.1 Corpus building process . . . . . . . . . . . . . . . . . 3.1.1 Objective . . . . . . . . . . . . . . . . . . . . . 3.1.2 Built annotation guide line . . . . . . . . . . . . 3.1.3 Annotate documents . . . . . . . . . . . . . . . 3.1.4 Quality control . . . . . . . . . . . . . . . . . . 3.2 Building Vietnamese NER corpus by off-line tools . . . 3.2.1 Built annotation guide line . . . . . . . . . . . . 3.2.2 Annotate documents . . . . . . . . . . . . . . . 3.2.3 Quality control . . . . . . . . . . . . . . . . . . 3.3 Discus about Vietnamese NER corpus building process. 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 4 5 6 7 . . . . . 8 8 9 10 11 12 . . . . . . . . . . . 13 13 13 14 16 17 20 20 22 24 26 27 TABLE OF CONTENTS iii 4 Online Annotation Framework 4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.2 Training section . . . . . . . . . . . . . . . . . . 4.3 Annotation documents . . . . . . . . . . . . . . 4.3.1 Online annotation interface . . . . . . . 4.3.2 Automate file distribution for annotator 4.3.3 Automate save and manage files . . . . . 4.4 Quality control . . . . . . . . . . . . . . . . . . 4.4.1 Document level . . . . . . . . . . . . . . 4.4.2 Corpus level . . . . . . . . . . . . . . . . 4.4.3 Explain unusual entity . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 29 30 31 32 33 34 34 35 37 38 . . . . . . . . . . . . . . . 39 39 40 41 42 45 47 47 48 49 51 52 54 54 56 58 . . . . . 60 60 62 62 63 63 5 Evaluation 5.1 Introduction . . . . . . . . . . . . . . 5.2 Corpus evaluation . . . . . . . . . . . 5.2.1 Inter annotatetor agreements 5.2.2 Offline corpus evaluation . . . 5.2.3 Online corpus . . . . . . . . . 5.3 Time costing . . . . . . . . . . . . . 5.3.1 Overview . . . . . . . . . . . 5.3.2 Offline process . . . . . . . . . 5.3.3 Online framework . . . . . . . 5.4 Named entity recognition system . . 5.4.1 Preprocessing . . . . . . . . . 5.4.2 Gazetteer . . . . . . . . . . . 5.4.3 Transducer . . . . . . . . . . 5.4.4 Experiment . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion And Future work 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . 6.2 Future work . . . . . . . . . . . . . . . . . . . 6.2.1 Create corpus bigger and more quality 6.2.2 Improve online annotation framework . 6.2.3 Building NER system base statistical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv A Name Entity guideline A.1 Basic concepts . . . . . . . . . . A.1.1 Entity and Entity Name A.1.2 Instance of entity . . . . A.1.3 List of Entities . . . . . A.1.4 Entities recognize rules A.2 Entity classification . . . . . . . A.2.1 Person . . . . . . . . . . A.2.2 Organization . . . . . . A.2.3 Location . . . . . . . . . A.2.4 Facility . . . . . . . . . . A.2.5 Religion . . . . . . . . . TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 64 64 64 65 65 65 67 68 69 69 Toward a Framework for building Named Entity Corpus Hoang Huu Son University of Engineering and Technology Vietnam National University, Hanoi 144, Xuan Thuy, Cau Giay, Hanoi, Vietnam Abstract Named entities recognition (NER) problem is one of the most interesting in nature language processing domain. However a main NER research barrier is difficult to build a NER corpus and there is any NER corpus have been published. So that in the thesis, we release a corpus building process and frameworks to build NER corpus - special Vietnamese named entity corpus. 1. Introduction Please be noted some points as follows. - The context of the research and its role/importance - Related studies and their methods/solutions/approaches - The remain problems and objective of this study/thesis - Your proposal. What will be carried out? released corpus of Czech sentences with manually annotated named entities, in which a rich two-level classification scheme was used. - How are the models designed? You can design different models/parameters, so please describe them in detail. - How are the data prepared? - The results should be presented in Tables and Graphs - It is important of giving the discussion after obtaining experimental results. 4. Conclusions - With regard to the objective of this study as you showed in the introduction, which have been done? - The contribution of your work, the meaning of obtained results. - Present future work if needed. 2. ... Publications - You can arrange one or more sections after the Introduction. - You can use subsections. - Show how the problem are formulated. You may give some foundations if necessary. - Show different aspects of the problems, for examples: the feature selections, learning algorithms, etc. - Show your proposal, it is good if you can present the differences between your proposal and previous studies. It is also important to show/analyze the solution in a reasonable way. - Show how features are selected/built; the algorithms/methods you will use. - Give here your publications during this master course - You can also give here your submission and its status (i.e. submitted, revising, in press,...) 3. Experiments You should give the information as follows: Kravalová, Jana and Žabokrtský, Zdeněk have built Czech Named Entity Corpus which present in paper [?]. In this recently References [1] I. M. Author. Some related article I wrote. Some Fine Journal, 99(7):1–100, January 1999. [2] A. N. Expert. A Book He Wrote. His Publisher, Erewhon, NC, 1999.
- Xem thêm -

Tài liệu liên quan