Tài liệu Openk data cleansing system a clustering based approach for detecting data anomalies

.PDF

thanhphoquetoi Báo vi phạm

Tải xuống 62

Mô tả:

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS OPENK : DATA CLEANSING SYSTEM - A CLUSTERING-BASED APPROACH FOR DETECTING DATA ANOMALIES Council: Information System Instructor: Assoc. Prof. Dr. Dang Tran Khanh Reviewer: Dr. Phan Trong Nhan Student: Nguyen Dinh Khuong - 1752306 Ho Chi Minh City, August 2021 ĈҤ,+Ӑ&48Ӕ&*,$73+&0 ---------75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ KHOA:KH & KT Máy tính %Ӝ0Ð1KHMT &Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0 ĈӝFOұS- 7ӵGR- +ҥQKSK~F 1+,ӊ09Ө/8Ұ1È17Ӕ71*+,ӊ3 &K~ê6LQKYLrQSK̫LGiQWͥQj\YjRWUDQJQK̭WFͯDE̫QWKX\͇Wtrình +Ӑ9¬7Ç1 1*8<ӈ1ĈÎ1+.+ѬѪ1* MSSV: 1752306 /Ӟ3 KHM2 NGÀNH: COMPUTER SCIENCE 1. ĈҫXÿӅOXұQiQ OPENK : DATA CLEANSING SYSTEM ± CLUSTERING-BASED APPROACH FOR DETECTING DATA ANOMALIES 1KLӋPYө\rXFҫXYӅQӝLGXQJYjVӕOLӋXEDQÿҫX - - - Learn requirements, analysis, design and implementation of data cleansing system running on web app platform. Research and apply Edit-based similarity algorithms, using knowledge and methodologies from Algorithm Design and Analysis, Database Management System, Clustering Methods, Web development to provide reasonable and optimized approach in detecting and clustering cluster of anomalies data, which will be ready for the next steps. Reading scientific papers and proposing a solution to prevent inconsistent and duplicate data based on clustering methods. Researching different related works - others data cleansing systems such as GoogleRefine, BigDansing, NADEEF, ... thereby making reasonable assessments and comparisons for the advantages and disadvantages of the current system. After that, developing further functions performance and system optimization. Apply K-NN methods (LD, Damerau LD, Hamming), Similarity (Jaro, Jaro-Winkler) methods and Key Collision (Fingerprint, N-gram Fingerprint) for detecting and clustering. Test and evaluate the proposed system. 31Jj\JLDRQKLӋPYөOXұQiQ02/02/2021 1Jj\KRjQWKjQKQKLӋPYө26/07/2021 +ӑWrQJLҧQJYLrQKѭӟQJGүQ 'UĈһQJ7UҫQ.KiQK 3KҫQKѭӟQJGүQ All thesis 1ӝLGXQJYj\rXFҫX/971ÿmÿѭӧFWK{QJTXD%ӝP{Q 1Jj\WKiQJQăP &+Ӫ1+,ӊ0%Ӝ0Ð1 *,Ҧ1*9,Ç1+ѬӞ1*'Ү1&+Ë1+ .êYjJKLU}K͕WrQ .êYjJKLU}K͕WrQ PGS76ĈһQJ7UҫQ.KiQK 3+̮1'¬1+&+2.+2$%͠0Ð1 1JѭӡLGX\ӋWFKҩPVѫEӝ ĈѫQYӏ 1Jj\EҧRYӋ ĈLӇPWәQJNӃW 1ѫLOѭXWUӳOXұQiQ TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc ---------------------------Ngày 10 tháng 08 năm 2021 PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn) 1. Họ và tên SV: Nguyễn Đình Khương MSSV: 1752306 Ngành (chuyên ngành): Khoa học máy tính K 2. Đề tài: OPEN : DATA CLEANSING SYSTEM – A CLUSTERING-BASED APPROACH FOR DETECTING DATA ANOMALIES 3. Họ tên người hướng dẫn: PGS.TS. Đặng Trần Khánh 4. Tổng quát về bản thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính toán: Windows, Python, … Hiện vật (sản phẩm) 5. Tổng quát về các bản vẽ: - Số bản vẽ: Bản A1: Bản A2: Khổ khác: - Số bản vẽ vẽ tay Số bản vẽ trên máy tính: 6. Những ưu điểm chính của LVTN: Developed a cleansing tool for improving (big) data quality in order to achieve the high utility in businesses. Moreover, the student had finished the following: - Studying Pandas, Numpy, JSON Python library, and other relevent programming tools. - Investigating algorithms for measuring text similarity using different methods. - Studying cleansing and validating data tools such as OpenRefine, Cerberus. - Reading scientific papers and proposing a solution to prevent inconsistent and duplicate data based on the clustering method. - Build a visualization method for users to have a better view about the collected data. - Build an API-based library for the developer community. 7. Những thiếu sót chính của LVTN: The thesis presentation can be improved. 8. Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ Không được bảo vệ □ 9. 3 câu hỏi SV phải trả lời trước Hội đồng: a. Point out a better functionality of OPENk comparing with the known existing work/systems? 10. Đánh giá chung (bằng chữ: xs/giỏi, khá, TB): Xuất sắc Điểm: 10 /10 Ký tên (ghi rõ họ tên) PGS.TS. Đặng Trần Khánh 75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ KHOA KH & KT MÁY TÍNH &Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0 ĈӝFOұS- 7ӵGR- +ҥQKSK~F ---------------------------Ngày 03 tháng 08 QăP 2021 3+,ӂ8&+Ҩ0%Ҧ29ӊ/971 'jQKFKRQJ˱ͥLSK̫QEL͏Q +ӑYjWrQ69 NguyӉQĈình KhѭѫQJ MSSV: 1752306 Ngành (chuyên ngành): Khoa hӑF0áy tính ĈӅWjL Openk: Data Cleansing System - A Clustering-based Approach for Detecting Data Anomalies +ӑWrQQJѭӡLSKҧQELӋQ TS. Phan TrӑQJ1Kân 7әQJTXiWYӅEҧQWKX\ӃWPLQK 6ӕWUDQJ 6ӕFKѭѫng: SӕEҧQJVӕOLӋX 6ӕKuQKYӁ 6ӕWjLOLӋXWKDPNKҧR 3KҫQPӅPWtQKWRiQ +LӋn vұWVҧQSKҭP 7әQJTXiWYӅFiFEҧQYӁ - 6ӕEҧQYӁ %ҧQ$ %ҧQ$ .KәNKiF - 6ӕEҧQYӁYӁWD\ 6ӕEҧQYӁWUrQPi\WtQK 1KӳQJѭXÿLӇPFKtQKFӫD/971 -The student has developed a web application that supports users recognizing data anomalies by a clustering-based approach with some built-in methods. -The student has employed modern technologies for development such as flask, jinja, pandas, numpy, html, css, javascript, and performed some basic empiriments (loading time, error, running time). -The system can connect to files and cloud-based database management systems. 1KӳQJWKLӃXVyWFKtQKFӫD/971 -The way of identifying data anomalies based on a clustering approach does not really show the anomalies. For example, it shows the two different texts as abnormal. -The evaluation and comparison are simple and towards time than accuracy. In addition, it does not clearly show how effective the system helps in anomaly detection. -The system is inflexible to add more methods. Moreover, how to choose the parameter values is a problem to a user (e.g., k parameter, the limitation of records loading from Azure database). ĈӅQJKӏĈѭӧFEҧRYӋ; %әVXQJWKrPÿӇEҧRYӋ .K{QJÿѭӧFEҧRYӋ FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ a. Would you please show a use-case in that a user can benefit from your system? b. Any comparison with some related work (e.g., Open Refine)? c. ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7% Good ĈLӇP: 9/10 Ký tên (ghi rõ hӑtên) Phan TrӑQJ1Kân Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Acknowledgements First and foremost we would like to thank my supervisor Dr. Dang Tran Khanh, not only for his academic guidance and assistance, but also for his patience and personal support which made me truly grateful. I would like to guarantee that this research is my own, conducted under the supervision and guidance of Dr. Dang Tran Khanh. The result of my research is legitimate and has not been published in any forms prior to this. All materials used within this researched are collected by myself, by various sources and are appropriately listed in the references section. In addition, within this research, we also used the results of several other authors and organizations. They have all been aptly referenced. In any case of plagiarism, we stand by my actions and are to be responsible for it. Ho Chi Minh city University of Technology therefore is not responsible for any copyright infringements conducted within my research. GRADUATION THESIS Page 4/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Abstract At the moment, massive amounts of data are created every second over the internet, making the most efficient decisions has become a critical goal. Assume that we had all of the information, but that extracting the valuable knowledge would be extremely difficult. The following are the reasons for this assumption: data is not always clean or at least correct since data obtained from many sources may be redundant, some of them can be duplicated. These data must be cleaned before they can be utilized for further processing. Any inconsistencies or duplication in the datasets should be detected using a detection procedure. Widowing, blocking, and machine learning are among of the methods that are utilized to identify anomalous data. The goals of this thesis are to offer OpenK , a simple yet efficient data cleansing system based on clustering approaches. In this scenario, a cluster will comprise all data that are similarity-based assumptions, is detected by several techniques: Nearest Neighbor (Levenshtein Distance, Damerau-Levenshtein Distance, Hamming Distance), Similarity Measurement (Jaro Similarity, Jaro-Winkler Similarity) and Key Collision (Fingerprints, N-gram Fingerprints). This tool will be evaluated in order to see how the efficiency of it and compare to other tool for better view of assessment. We used airlines dataset from https://assets.datacamp. com/production/repositories/5737/datasets and special case study Real Estate dataset which is crawled from https://batdongsan.com.vn. OpenK also aids the user in loading and viewing data. Beside that, CRUD procedures, Pagination, Toggle column ON/OFF, Sort column, and Search keywords are being used for analyzing and wrangling input data. Keywords: Data Cleansing, Levenshtein Distance, Jaro-Winkler Similarity, Fingerprints, Anomaly detection GRADUATION THESIS Page 5/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Contents 1 2 Introduction 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Theoretical Background 13 2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Data Anomalies Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 2.2.1 Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 2.3.2 3 10 Key Collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1.a Fingerprint . . . . . . . . . . . . . . . . . . . . 16 2.3.1.b N-gram Fingerprint . . . . . . . . . . . . . . . . 17 Nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2.a Hamming distance . . . . . . . . . . . . . . . . 19 2.3.2.b Levenshtein distance . . . . . . . . . . . . . . . 20 2.3.2.c Damerau-Levenshtein distance . . . . . . . . . 22 2.3.2.d Jaro Distance - Jaro-Winkler Distance . . . . . 23 Methodologies And Design 26 GRADUATION THESIS Page 6/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 3.1 3.2 4 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Main components . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.2 Detecting anomaly execution flow . . . . . . . . . . . . . . . . . 28 3.1.3 Use-case of clustering data site . . . . . . . . . . . . . . . . . . . 29 3.1.3.a Actor determination and following use-case . . 29 3.1.3.b Use-case diagram and specification . . . . . . . 30 Existing System and Design . . . . . . . . . . . . . . . . . . . . . . . . . 36 System Implementation 39 4.1 Technologies and Framework . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Function implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5 System Evaluation 50 6 Thesis Denouement 54 7 6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 Assessment of Thesis Connotation . . . . . . . . . . . . . . . . . . . . . . 55 6.3 Future Advancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 APPENDIX A : USER MANUAL GRADUATION THESIS 59 Page 7/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering List of Figures 2.1 Example of applying fingerprint algorithm for name . . . . . . . . . . . . 16 2.2 Formula of Hamming distance calculation . . . . . . . . . . . . . . . . . 19 2.3 3-bit binary cube for finding Hamming distance . . . . . . . . . . . . . . 20 2.4 Levenshtein Distance calculation formula . . . . . . . . . . . . . . . . . . 21 2.5 Example of Levenshtein distance calculation table. . . . . . . . . . . . . . 22 2.6 Example of Damerau-Levenshtein distance calculation table. . . . . . . . 22 2.7 Formula of Jaro similarity calculation . . . . . . . . . . . . . . . . . . . . 23 2.8 Jaro-Winkler similarity calculation example . . . . . . . . . . . . . . . . . 24 2.9 Comparision of barcode correction using different techniques . . . . . . . 25 3.1 Overall architecture of OpenK system . . . . . . . . . . . . . . . . . . . . 26 3.2 Data type format converter . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Data cleansing component illustration . . . . . . . . . . . . . . . . . . . . 27 3.4 Clustering Operations illustration . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Activity diagram of OpenK system . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Use case diagram of Data site of OpenK system . . . . . . . . . . . . . . 31 3.7 Use case specification of viewing data . . . . . . . . . . . . . . . . . . . . 32 3.8 Use case specification of paging data . . . . . . . . . . . . . . . . . . . . 32 3.9 Use case specification of searching data keywords . . . . . . . . . . . . . 32 3.10 Use case specification of sorting data column . . . . . . . . . . . . . . . . 33 3.11 Use case specification of export data . . . . . . . . . . . . . . . . . . . . . 33 3.12 Use case specification of Hiding column data . . . . . . . . . . . . . . . . 34 GRADUATION THESIS Page 8/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 3.13 Use case specification of Manage data cluster . . . . . . . . . . . . . . . . 34 3.14 Use case specification of cluster data using knn method . . . . . . . . . . 35 3.15 Use case specification of cluster data using similarity method . . . . . . . 35 3.16 Use case specification of cluster data using key collision method . . . . . 36 3.17 Overall architecture of BigDansing . . . . . . . . . . . . . . . . . . . . . 37 3.18 Overall architecture of NADEEF . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Relation diagram of OpenK routing system . . . . . . . . . . . . . . . . . 42 4.2 Flow chart diagram of Upload function implementation . . . . . . . . . . 43 4.3 Flow chart diagram of Data function implementation . . . . . . . . . . . 44 4.4 Class diagram of clustering method . . . . . . . . . . . . . . . . . . . . . 45 4.5 Flow of clustering data with KNN class . . . . . . . . . . . . . . . . . . . 46 4.6 Flow of clustering data with Similarity class . . . . . . . . . . . . . . . . 47 4.7 Implementation code of clustering data with Fingerprint algorithm . . . . 48 4.8 Flow of clustering data with Fingerprint algorithm . . . . . . . . . . . . . 49 5.1 Time performance for loading & visualizing input dataset of OpenK and OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Time performance for detecting & clustering input dataset of OpenK and OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Error percentage for detecting & clustering input dataset of OpenK and OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 GRADUATION THESIS Page 9/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 1 Introduction “Most of the world will make decisions by either guessing or using their gut. They will be either lucky or wrong.” – Suhail Doshi, CEO of Mixpanel 1.1 Problem Statement Enormous amounts of data that are available for the company or organization, will influence their business decision. Data collected from the various resources may be dirty and this will affect the precision of the predicted result. Data cleansing offers a better data quality which will be a great help for the organization to make sure their data is ready for the analyzing phase. Because of the reasons above, the job of cleaning and detecting the anomalies of the collected data is the “A-number-1 crucial”. 1.2 Objective The main objective of the thesis is to develop a cleansing tool for improving data quality in order to achieve the high performance in businesses. The system should also be able to resolve some problems that might occur during the cleaning phase such as unusual data or duplicate data. Moreover, this thesis will propose the API library for the developer community. 1.3 Scope In the scope of this thesis, we will carry out the following tasks: • Study data analysis tools such as: Pandas, Numpy, JSON Python library, ... • Study about algorithms for measure text similarity using different method such as Levenshtein distance, Damerau-Levenshtein distance, Jaro algorithm, Jaro-Winkler GRADUATION THESIS Page 10/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Distance using similarity-based metric, Fingerprint method, ... • Study about cleansing and validating data tools such as OpenRefine, Cerberus. • Propose a solution to prevent inconsistent and duplicate data. • Build a visualization method for users to have a better view about the collected data. • Build an API library for the developer community. 1.4 Thesis Structure • Chapter 1 : Introduction - In this chapter, the thesis covers the (i) problem statement - the problem that this thesis is aimed to solve. (ii) Also, thesis will involve the list of mission need to be accomplished. (iii) The scope of thesis - the carry out’s list of tasks, (iv) and the structure of the thesis is also inscribed. • Chapter 2: Theoretical Background - In this chapter, (i) First is to mentioned the related works of others authors, so that we can have an overview of what have been done previously. Secondly, (i) Data anomalies detection - the conception and existing methods. Clustering methods (iii) are acknowledged, these methods - Key Collision and Nearest Neighbors - will be written about the conception and how to work with it. • Chapter 3: Methodologies and Design - In this chapter, (i) This part consists the system objectives - clearly show the target of the cleansing system . Additionally, (ii) General architecture will give you the overall point of view of the system design. Moreover, (iii) execution flow - show how which individual statements, components or function calls of an imperative program are executed or evaluated. (iv) Existing design. Lastly, (v) achievements and discussion. • Chapter 4: System Implementation - In this chapter, (i) Technologies and Framework used namely: Pandas, Numpy, JSON lweb for python, fingerprints library and more,... . (ii) Function implementation - list of functions, classes, parameters are being adopted. • Chapter 5: System Evaluation - In this chapter, (i) Measurements will referred to which criteria for evaluation of the efficiency of the system. Also, (ii) the score of efficiency of above measurement - also known as appraisal. GRADUATION THESIS Page 11/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering • Chapter 6: Thesis Denouement - In this chapter, the summarization of the thesis is written - which includes (i) final culmination. And thesis evaluation (ii), to ensure the contribute of the thesis. Finally, is (iii) the future advancement for the later development of the cleansing system. • Chapter 7: References - In this chapter, all the referred results and references of the others paper are kindly denoted. GRADUATION THESIS Page 12/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 2 2.1 Theoretical Background Related Works For the time being, there are scads of systems and researches on data cleans- ing system. In [1], BIGDANSING - the Big Data Cleansing System Ire presented to tackle the problem of cleaning big data, detecting anomalies and generating possible fixes for it. Moreover, [2] NADEEF - An open-source single-node platform supporting both declarative and user defined quality rules, research efforts have targeted the usability of a data cleansing system, but at the expense of performance and scalability. With Google Refine, latter known as OpenRefine [3], is a Java-based powerful tool that allows user to load data, understand it, clean it up, reconcile it, and augment it with data coming from the web. All from a web browser and the comfort and privacy of user’s computer. OpenRefine notes that: clustering in OpenRefine works only at the syntactic level (the character composition of the cell value) and, while very useful to spot errors, typos, and inconsistencies, it’s by no means enough to perform effective semantically-aware reconciliation - but can deny the fact that it is very effective in treating duplicate and inconsistencies data point. And [4], provides powerful yet simple and light-weight data validation functionality out of the box and is designed to be easily extensible allowing for custom validation. In this thesis, we will approach the method of detecting data anomalies problem using clustering-based techniques like what OpenRefine did previously. 2.2 Data Anomalies Detection 2.2.1 Conception In data analysis, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. GRADUATION THESIS Page 13/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns. Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal” and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the utilized model. 2.2.2 Existing methods There are heretofore several designs [10] for detecting data anomaly: 1. One-class support vector machines [11]. 2. Fuzzy logic-based outlier detection. 3. Deviations from association rules and frequent itemsets. 4. Bayesian networks [12]. 5. Hidden Markov models (HMMs) [12]. 6. Replicator neural networks [12], autoencoders - variational autoencoders [13]. 7. Long Short term memory neuron network [14]. GRADUATION THESIS Page 14/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering The performance of different methods depends a lot on the data set and parameters, and methods have little systematic advantages over another when compared across many data sets and parameters. [15] In this context, we use clustering-based technique for detecting anomaly - Key Collision & K-Nearest neighbors. 2.3 Clustering Methods In OpenK , clustering on a column is a great way to look for inconsistencies in your data and fix those. Clustering uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match. Where editing a single cell or text facet at a time can be time-consuming and difficult, clustering is quick and streamlined. OpenK has clustering always requires the user to merge and edit the name of the cluster - it will display value that we apply on the previous and it will be the cluster’s name and apply it to all the elements of the cluster. In order to do its analysis clustering performs a lot of cleaning actions behind the scenes, but only the merges that you accept affect your data. Understanding the various behind-the-scenes cleanups can assist you in determining which clustering approach is the most accurate and successful. 2.3.1 Key Collision Key Collision methods are based on the idea of creating an alternative represen- tation of a value (a “key”) that contains only the most valuable or meaningful part of the string or a sequence, a word together different ones based on the fact that their key is the same (hence the name “key collision”). GRADUATION THESIS Page 15/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 2.3.1.a Fingerprint Fingerprint is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data-cleaning behind the scenes that you might think to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split up all strings (words) and sort them alphabetically (so “Khương, Nguyễn Đình” becomes “dinh khuong nguyen”). The process that generates the key from a string value is the following (note that the order of these operations is significant): + Remove leading and trailing whitespace. + Change all characters to their lowercase representation. + Remove all punctuation and control characters. + Normalize extended Western characters to their ASCII representator example “gödel" → “godel"). + Split the string into whitespace-separated tokens. + Sort the tokens and remove duplicates. + Join the tokens back together. Figure 2.1: Example of applying fingerprint algorithm for name There are several factors need to be considered in this method: GRADUATION THESIS Page 16/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering Due to the fact that the normalization of space and reduced characters and deleted punctuation, the fingerprint of these portions is not distinguished. Because these string characteristics are the least important in meaning distinction, they are the most changing sections of the strings and their removal has a considerable advantage in developing clusters. Since the portions of the string are sorted, it doesn’t matter the provided tokens sequence (“Cruise", “Tom Cruise" and “Tom Cruise," both finish in a fingerprint and end in the same cluster) Normalizing extended western characters plays the role of reproducing data entry mistakes performed when entering extended characters with an ASCII-only keyboard. Note that this procedure can also lead to false positives. For example, “gödel” and “godél” would both end up with “godel” as their fingerprint, but they’re likely to be different names, so this might work less effectively for data sets where extended characters play substantial differentiation role. 2.3.1.b N-gram Fingerprint N-gram Fingerprint allows us to change the n value to anything you want, and it will generate n-grams of n size (after cleaning), alphabetize them, and then reassemble them into a fingerprint. A 1-gram fingerprint, for example, will simply sort all of the letters in the cell into alphabetical order by dividing them into segments of one character. A 2-gram fingerprint will locate all two-character segments, eliminate duplicates, alphabetize them, and reassemble them (for example, "banana" yields "ba a na a na," which becomes "anbana"). This can aid in matching cells with typos and spaces (for example, matching "lookout" and "look out," which are not identified because fingerprinting separates words). This can assist. The greater the n number, the lower the clusters. Keep a watch on misspelled values that are nearly one another (for example, the ’wellington’ and the ’Elgin Town’) with 1 gram. The n-gram fingerprint method does the following: + Change all characters to their lowercase representation GRADUATION THESIS Page 17/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering + Remove all punctuation, whitespace, and control characters + Obtain all the string n-grams + Normalize extended western characters to their ASCII representator example “gödel" → “godel"). + Split the string into whitespace-separated tokens. + Sort the tokens and remove duplicates. + Join the tokens back together. So, for example, the 2-gram fingerprint of "Paris" is "arispari" and the 1-gram fingerprint is "aiprs". Check the code if you’re curious about the details. Why is this useful? In practice, using big values for n-grams doesn’t yield any advantage over the previous fingerprint method, but using 2-grams and 1-grams, while yielding many false positives, can find clusters that the previous method didn’t find even with strings that have small differences, with a very small performance price. For example "Krzysztof", "Kryzysztof", and "Krzystof" have different lengths and different regular fingerprints, but share the same 1-gram fingerprint because they use the same letters. 2.3.2 Nearest neighbors Nearest neighbors - while key collisions methods are very fast, they tend to be either too strict or too lax with no way to fine tune how much difference between strings we are willing to tolerate. The Nearest Neighbor methods (also known as kNN), on the other hand, provide a parameter (the radius, or k) which represents a distance threshold: any pair of strings that is closer than a certain value will be binned together. GRADUATION THESIS Page 18/83 Ho Chi Minh City University of Technology, VNU-HCM Faculty of Computer Science and Engineering 2.3.2.a Hamming distance Definition The Hamming distance between two strings of similar length in information the- ory is the number of locations where the corresponding symbols differ. In other words, it assesses the smallest number of replacing strings that might have changed one string into the other strings, or the minimum number of mistakes. The Hamming Distance, in a more generic context, is one of several string metrics for quantifying the editing distance between two equal-length sequences. The name is given to Richard Hamming, an American mathematician. Figure 2.2: Formula of Hamming distance calculation Hamming Distance application Coding theory, especially block codes, in which the equal-length strings are vectors over a finite field, is a key application. The minimal hamming distance is used to establish several basic coding theory ideas, for example error detection and error correction codes. In specifically, if and only if the lowest Hamming distance between any two of its codewords is at least k+1, a code C is considered to be k error detecting . Consider the code “000" and “111", which consists of two codewords. Because the hamming distance between these two words is 3, the error detection is k=2. This means that the error can be identified even if one or two bits are reversed. “000" becomes “111" when three bits are inverted, and the error is undetectable. As a result, a code having a minimum Hamming distance d between its codeGRADUATION THESIS Page 19/83

- Xem thêm -

Tài liệu Openk data cleansing system a clustering based approach for detecting data anomalies

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất