Tài liệu Developing a hadoop based distributed system for metagenomic binning

.PDF

thanhphoquetoi Báo vi phạm

Tải xuống 62

Mô tả:

VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING ——————– * ——————— THESIS DEVELOPING A HADOOP-BASED DISTRIBUTED SYSTEM FOR METAGENOMIC BINNING Major: Computer Engineering Committee: Computer Engineering 1 (English Program) Supervisor: Assoc. Prof. Dr. Tran Van Hoai Reviewer: Assoc. Prof. Dr. Thoai Nam —o0o— Student 1: Tran Duong Huy (1752242) Student 2: Pham Nhat Phuong (1752042) Student 3: Nguyen Huu Trung Nhan (1752392) ĐẠI HỌC QUỐC GIA TP.HCM ---------TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA:KH & KT Máy tính_____ BỘ MÔN:KHMT_____________ CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình HỌ VÀ TÊN:_______________________________________MSSV:________________ HỌ VÀ TÊN:_______________________________________MSSV:________________ HỌ VÀ TÊN:_______________________________________MSSV:________________ NGÀNH:____________________________________LỚP:_______________________ 1. Đầu đề luận án: Developing a Hadoop-based Distributed System for Metagenomic Binning_________________ _____________________________________________________________________________ 2. Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu): - Tìm hiểu về Metagenomic Binning và thuật toán BiMeta. - Tìm hiểu Hadoop và cài đặt một hệ thống minh hoạ. - Phát triển thuật toán BiMeta dựa trên Hadoop. - Xây dựng giao diện để người dùng có thể sử dụng BiMeta trên hệ thống tính toán Hadoop. - Đánh giá khả năng sử dụng Hadoop cho bài toán Metagenomic Binning trên tập số liệu được dùng bởi cộng đồng. 3. Ngày giao nhiệm vụ luận án: … 4. Ngày hoàn thành nhiệm vụ: … 5. Họ tên giảng viên hướng dẫn: Phần hướng dẫn: 1) PGS.TS. Trần Văn Hoài, hướng dẫn toàn bộ._______________________________________ 2)____________________________________________________________________________ 3)____________________________________________________________________________ Nội dung và yêu cầu LVTN đã được thông qua Bộ môn. Ngày ........ tháng ......... năm .......... CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH (Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên) PHẦN DÀNH CHO KHOA, BỘ MÔN: Người duyệt (chấm sơ bộ):_________________________ Đơn vị:_________________________________________ Ngày bảo vệ:____________________________________ Điểm tổng kết:___________________________________ Nơi lưu trữ luận án:_______________________________ 75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0 KHOA KH & KT MÁY TÍNH ĈӝFOұS- 7ӵGR- +ҥQKSK~F ---------------------------1Jj\WKiQJQăP 3+,ӂ8&+Ҩ0%Ҧ29ӊ/971 'jQKFKRQJ˱ͥLK˱ͣQJG̳QSK̫QEL͏Q +ӑYjWrQ697UҫQ'ѭѫQJ+X\ MSSV: 1752242 1JjQKFKX\rQQJjQK.ӻWKXұW0i\WtQK +ӑYjWrQ693KҥP1KҩW3KѭѫQJ MSSV: 1752042 1JjQKFKX\rQQJjQK.ӻWKXұW0i\WtQK +ӑYjWrQ691JX\ӉQ +ӳX7UXQJ1KkQ MSSV: 1752392 1JjQKFKX\rQQJjQK.ӻWKXұW0i\WtQK ĈӅWjL'HYHORSLQJD+DGRRS-based Distributed System for Metagenomic Binning +ӑWrQQJѭӡLKѭӟQJGүQSKҧQELӋQ3*6767UҫQ9ăQ+RjL 7әQJTXiWYӅEҧQWKX\ӃWPLQK 6ӕWUDQJ 6ӕFKѭѫQJ 6ӕEҧQJVӕOLӋX 6ӕKuQKYӁ 6ӕWjLOLӋXWKDPNKҧR 15 3KҫQPӅPWtQKWRiQ +LӋQYұWVҧQSKҭP 7әQJTXiWYӅFiFEҧQYӁ - 6ӕEҧQYӁ%ҧQ$ BҧQ$ .KәNKiF - 6ӕEҧQYӁYӁWD\ 6ӕEҧQYӁWUrQPi\WtQK 1KӳQJѭXÿLӇPFKtQKFӫD/971 - 6LQKYLrQÿmKLӋQWKӵFÿѭӧFPӝWKӋWKӕQJYӟLÿҫ\ÿӫFiFWKjQKSKҫQ+ҥWҫQJWtQKWRiQ+DGRRSEDVHG3KiWWULӇQWKXұWWRiQGQJ+DGRRSWURQJÿyWҩWFҧFiFEѭӟFFӫD%L0HWDÿmÿѭӧFFKX\ӇQÿәL VDQJGQJ+DGRRSYj6SDUN0ӝWJLDRGLӋQQJѭӡLGQJÿӇFyWKӇWѭѫQJWiFYӟLKӋWKӕQJ - 6LQKYLrQÿmQҳPÿѭӧFQKӳQJNLӃQWKӭFFѫEҧQYӅ+DGRRS6SDUNÿӇFyWKӇWKLӃWOұSPӝWKӋWKӕQJ PLQKKRҥ - 6LQKYLrQÿmQҳPÿѭӧFQKӳQJNLӃQWKӭFFѫEҧQÿӇFyWKӇWULӇQNKDLPӝWWKXұWWRiQWXҫQWӵOrQP{L WUѭӡQJWtQKWRiQSKkQEӕGҥQJ+DGRRS - Sinh viêQÿmWKӵFQJKLӋPYӟLQKӳQJWұSGӳOLӋXÿѭӧFVӱGөQJWURQJQKӳQJF{QJEӕX\WtQÿӇÿiQKJLi NKҧQăQJFӫDKӋWKӕQJ 1KӳQJWKLӃXVyWFKtQKFӫD/971 - 9ӟLNKҧQăQJFzQKҥQFKӃWURQJOƭQKYӵFVLQKKӑFQrQYLӋFFKX\ӇQÿәLWKXұWWRiQVDQJP{KuQKOұSWUuQK +DGRRSYj6SDUNFzQPӝWYjLKҥQFKӃQKҩWOjWKӡLJLDQWtQKWRiQFKѭDWKӇKLӋQÿѭӧFNKҧQăQJFӫDKӋ WKӕQJSKkQEӕ - 'RNK{QJWKӇWLӃSFұQGӉGjQJYӟL KҥWҫQJWtQKWRiQOӟQQrQNӃWTXҧWtQKWRiQFzQQKLӅXKҥQFKӃ - 6LQKYLrQPӟLFKӍFyNLӃQWKӭFFѫEҧQYӅPHWDJHQRPLFELQQLQJQrQQKӳQJFKӭFQăQJJLDRGLӋQQJѭӡL GQJFӫDKӋWKӕQJFKѭDÿѭӧFViQJWҥR ĈӅQJKӏĈѭӧFEҧRYӋ5 %әVXQJWKrPÿӇEҧRYӋ .K{QJÿѭӧFEҧRYӋ FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ D6LQKYLrQKm\FKRELӃWQӃXSKiWWULӇQWLӃSÿӇWăQJNKҧQăQJWtQKWRiQWKuFөWKӇVӁOjP gì? ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7%JLӓL ĈLӇP .êWrQJKLU}KӑWrQ 7UҫQ9ăQ+RjL 75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ .+2$.+ .70È<7Ë1+ &Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$ 9,ӊ71$0 ĈӝFOұS7ӵGR+ҥQKSK~F 1Jj\WKiQJQăP 3+,ӂ8&+Ҩ0%Ҧ29ӊ/971 'jQKFKRQJ˱ͥLSK̫QEL͏Q +ӑYjWrQ697UҫQ'ѭѫQJ+X\ 0669 1JjQK.707 3KҥP1KұW3KѭѫQJ 0669 1JjQK.707 1JX\ӉQ+ӳX7UXQJ1KkQ 0669 1JjQK.707 ĈӅWjL'HYHORSLQJD+DGRRSEDVHGGLVWULEXWHGV\VWHPIRUPHWDJHQRPLFELQQLQJ +ӑWrQQJѭӡLKѭӟQJGүQSKҧQELӋQ3*6767UҫQ9ăQ+RjL 7әQJTXiWYӅEҧQWKX\ӃWPLQK 6ӕWUDQJ 6ӕFKѭѫQJ 6ӕEҧQJVӕOLӋX 6ӕKuQKYӁ 6ӕWjLOLӋXWKDPNKҧR 3KҫQPӅPWtQKWRiQ +LӋQYұWVҧQSKҭP 7әQJTXiWYӅFiFEҧQYӁ 6ӕEҧQYӁ %ҧQ$ %ҧQ$ .KәNKiF 6ӕEҧQYӁYӁWD\ 6ӕEҧQYӁWUrQPi\WtQK 1KӳQJѭXÿLӇPFKtQKFӫD/971 í 6LQKYLrQQҳPEҳWNӻWKXұWYjF{QJQJKӋWtQKWRiQWUrQ+DGRRSYj6SDUNWӕW í %jLWRiQ0HWDJHQRPLF%LQQLQJFǊQJÿѭӧFVLQKYLrQOƭQKKӝLYjWULӇQNKDLPӝWJLҧLSKiSWtQK WRiQEѭӟFYӟLJLҧLWKXұWWѭѫQJӭQJWUrQ0DS5HGXFHYj6SDUN í 0{LWUѭӡQJWtQKWRiQWUrQ+DGRRSYj6SDUNÿѭӧF[k\GӵQJÿӇWULӇQNKDLÿӇJLҧLTX\ӃWEjL WRiQ0HWDJHQRPLF%LQQLQJ í 3KҫQWKӱQJKLӋPFyNӏFKEҧQYjWKӵFKLӋQWUrQWұSGӳOLӋX 1KӳQJWKLӃXVyWFKtQKFӫD/971 í 'RKҥQFKӃYӅWKӡLJLDQYjP{LWUѭӡQJWKӵFQrQYLӋFÿiQKJLiJLҧLSKiSJLҧLEjLWRiQ 0HWDJHQRPLF%LQQLQJWUrQP{LWUѭӡQJSKkQWiQVӱGөQJ+DGRRS 6SDUNFzQÿѫQJLҧQQrQ FKѭDWKҩ\KLӋXTXҧFӫDJLҧLSKiSFǊQJQKѭFiFSKkQWtFKNӃWTXҧÿӇKѭӟQJÿӃQWKӵFKLӋQFҧL WLӃQYӅKLӋXQăQJFӫDJLҧLSKiS ĈӅQJKӏĈѭӧFEҧRYӋ 䖵 %әVXQJWKrPÿӇEҧRYӋ Ƒ FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7%JLӓL .K{QJÿѭӧFEҧRYӋ Ƒ ĈLӇP .êWrQ 7KRҥL1DP Declaration We commit that our topic "Developing a Hadoop-based distributed system for metagenomic binning" is our personal thesis proposal. We declare that this topic is conducted under our effort, time, and the recommendation of our supervisor, Assoc. Prof. Dr. Tran Van Hoai. All of the research results are conducted by ourselves and not copied from any other sources. If there is any evidence of plagiarism, we will be responsible for all consequences. Ho Chi Minh City, 2020, Tran Duong Huy, Pham Nhat Phuong, Nguyen Huu Trung Nhan. i Acknowledgement This thesis would not have been possible to complete without the help and support of many others. First and foremost, we would like to express our sincere gratitude to our supervisor, Associate Professor PhD Tran Van Hoai. His insight and expert knowledge in the field improved our research work. He has also helped us to have organized our thinking in research and in writing the thesis. We sincerely thank the teachers in the Faculty of Computer Science and Engineering from Ho Chi Minh City University of Technology for their enthusiasm to impart knowledge during the time we study at school. With knowledge accumulated throughout the learning process, it helps us to complete this thesis. In the end, we would like to wish the teachers and our supervisor good health and success in their noble careers. ii Abstract Bioinformatics research is considered to be an area in which biological data is vast, extensive and complex. Biological data are constantly evolving and often unlabeled, so a controlled test method cannot be used. One of the most difficult problems to be solved in this area is that detecting new symptoms of the virus, or at least grouping them together, is an urgent need (for example, determining the proximity of SARS-CoV-2 to the bat virus). One way to solve the problem is to create scalable clustering tools that can handle very large amounts of data. Genomics and next-generations technology like Illumina, Roche 454 are producing 200 billion kits a week, transferring 60 thousand genes and efficient computers. Our goal in this thesis, inspired by previous research, is to create a Hadoop-based tool for metagenomic binning. iii Contents Declaration i Acknowledgement ii Abstract iii 1 2 Introduction 1 1.1 Overview of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 3 2.1 Metagenomic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3 Metagenomic Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Hadoop Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 HDFS - Hadoop Distributed File System . . . . . . . . . . . . . . . . 9 2.2.3 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 What is Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Introduction to Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 How do Sparks run on a cluster? . . . . . . . . . . . . . . . . . . . . . 22 2.2 2.3 iv 3 Related Work 3.1 3.2 4 6 A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.3 Fundamentals of proposed method . . . . . . . . . . . . . . . . . . . . 29 3.1.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.5 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons 31 3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Datasets and Hadoop cluster configuration . . . . . . . . . . . . . . . . 31 3.2.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Methodology 33 4.1 BiMetaReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.1 Step 1: Load Fasta file . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.2 Step 2: Create Document . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.3 Step 3: Create Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.4 Step 4: Build Overlap Graph . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.5 Step 5 and 6: Find Connected Component and Clustering . . . . . . . . 35 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . 37 4.2.3 Devices and Components . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.4 Web Application Design . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 5 28 Experimental and Evaluation 50 5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Conclusions References 56 57 v List of Figures 2.1 Reads and Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Hadoop Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 A Basic Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Upload A File To HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Read A File From HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 The first few lines of the file S8.fna . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 The pipeline of the phase Reading Fasta File . . . . . . . . . . . . . . . . . . . 14 2.8 Map-Reduce Logical Data Flow for Reading Fasta File . . . . . . . . . . . . . 16 2.9 Hadoop YARN - How YARN manages a running job . . . . . . . . . . . . . . 17 2.10 Spark’s toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.11 The architecture of a Spark Application . . . . . . . . . . . . . . . . . . . . . 20 2.12 A narrow dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.13 A wide dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.14 A cluster driver and worker (no Spark Application yet) . . . . . . . . . . . . . 22 2.15 Spark’s Cluster mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.16 Spark’s Client mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.17 Requesting resources for a driver . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.18 Launching the Spark Application . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.19 Application execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.20 Shutting down the application . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Binning process of BiMeta . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 The Libra workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Scalability testing for Libra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Workflow of the BiMetaReduce program . . . . . . . . . . . . . . . . . . . . . 35 4.2 Overview of the proposed system . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Python & Django frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vi 4.4 HTML, CSS & Javascript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Model View Template Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 Use case diagram of web application . . . . . . . . . . . . . . . . . . . . . . . 41 4.7 Login and Register Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . 44 4.8 Homepage,"About" page and "Your Project" page Sequence Diagram . . . . . 45 4.9 "System" page Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . 46 4.10 Homepage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.11 Homepage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.12 Homepage 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.13 "System" page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.14 "Your Project" page UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.15 "About" page UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Sequential and MapReduce Runtime . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Runtime of different settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Runtime of setting 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Runtime of setting 7 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Fruchterman-Reingold graph of the R4 file . . . . . . . . . . . . . . . . . . . . 54 5.6 Kamada-Kawai graph of the R4 file . . . . . . . . . . . . . . . . . . . . . . . 55 5.7 Circular graph of the R4 file . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 vii Chapter 1 Introduction 1.1 Overview of this thesis Bioinformatics research is considered an area that includes large, expanding, and complex biological data sets. The biological data is increasingly large, and often unlabeled, making it impossible to use supervised learning methods. A challenging problem that arises in this domain is detecting a new strain of the virus, or at least cluster it into a group of species, is a very urgent need (such as identifying the proximity of SARS-CoV-2 to a bat virus group). An alternative approach to the problem is to build a scalable clustering tool that can work with very large datasets. Genomics and next generation, like Illumina, Roche 454, produce 200 billion sets per week for the transmission of 60k genes and efficient computers. Inspired by the previous studies, in this thesis, our goal is to create a Hadoop-based tool for metagenomic mechanics. 1.2 Scope and Objectives This work aims to develop a scalable clustering tool for working with large datasets. Some objectives required for this thesis: • Learn about the fundamental of Genes and Genomes. • Research about Genomic Technologies. • Research about the Hadoop distributed processing framework for data processing, storage for big data applications, and related distributed computing environments. • Learn about metagenomic clustering problems and algorithms capable of distributed computing. • Develop distributed clustering tools and experimental computing on the actual microbiological dataset from NCBI’s database. 1 1.3 Thesis Outline In this thesis, it contains five main parts. Section 1 is the introduction of the project. Section 2 is the background of the research field. Section 3 is the related work that proposes a different method to solve the problem. Section 4 shows the proposed methods and discussion. Section 5 covered our system experiment results and evaluation. And the last section is the conclusion of our work. 2 Chapter 2 Background 2.1 2.1.1 Metagenomic Background Microbes run the world. Although we usually don’t see them. But microbes are essential to every area of human life - in fact, for every life on Earth. Almost all knowledge about microbes based on laboratory work performed in unusual and unnatural conditions for optimal growth in artificial environments in pure cultures without ecological context [8]. Metagenomic science, which is only a few years old, makes it possible to study microbes in their natural environment, the complex communities in which they usually live. Understanding microbial communities bring changes in many fields, such as biology, medicine, ecology, and biotechnology. Which leads to the question: what is metagenomic? Metagenomics is made up of two words: meta and genomics. Genomics collects DNA sequences, meta shows that we obtain it with many organisms together. Metagenomics is a molecular tool used to analyze DNA from environmental samples to study existing microbial communities without having pure cultures [4]. Sanger sequencing, also known as the "chain termination method", is a well-known method for determining the nucleotide sequence in DNA. Many years after the development of Sanger sequencing technology, the massively parallel sequencing technology known as next-generation sequencing (NGS) has revolutionized the biological sciences. Next-generation sequencing technologies make millions of readings faster and less expensive. Recently, many projects have been used Roche 454, Illumina Genome Analyzer, AB SOLiD [11]. The analysis of microbial communities is usually carried out by a process referred to as binning, reads from the related organisms are grouped together. Binning methods can be broadly classified into two categories: supervised, and unsupervised methods. Among the most important supervised methods, we can recall Mega [6], Kraken [14], Clark [9], and MetaPhlan [10]. The other category of methods is unsupervised, MetaProb [5], BiMeta [12], MetaCluster [13], AbundanceBin [15]. Because of the lack of labeled datasets, the unsupervised method is a better approach for the binning problem. We found that BiMeta and MetaProb has a significant result in metagenomic clustering among the remaining studies. In this thesis, we will develop a Hadoop-based tool for metagenomic binning, BiMeta will be used as a main base model for clustering. 3 2.1.2 Basic Concepts 1. Defitition Of Bioinformatics The term “bioinformatics” was coined by Paulien Hogeweg and Ben Hesper in 1978. Bioinformatics is fundamentally informatics as applied to biology or computer-aided analysis of biological data [2]. 2. Goals Of Bioinformatics Analysis The main goal of bioinformatics is to be able to predict the biological processes in health and disease. In order to satisfy those abilities, understanding of the biological processes is necessary. Thus, the extra goal of bioinformatics is to develop such an understanding through analysis integration of the information obtained on genes and proteins. For further bioinformatics develop new tools and continuously improve the existing set of tools for diverse types of analyses [2]. 3. DNA & Nucleobases DNA is a double-stranded right-handed helix; the two strands are complementary because of complementary base pairing, and antiparallel because the two strands have opposite 5’-3’ orientation. DNA is composed of structural units called nucleotides (deoxyribonucleotides). Each nucleotide is composed of a pentose sugar (20-deoxy-D-ribose); one of the four nitrogenous bases—adenine (A), thymine (T), guanine (G), or cytosine (C); and a phosphate. The pentose sugar has five carbon atoms and they are numbered 10 (1-prime) through 50 (5-prime). The base is attached to the 10 carbon atom of the sugar, and the phosphate is attached to the 50 carbon atom. In the double-stranded DNA, A pairs with T by two hydrogen bonds and G pairs with C by three hydrogen bonds; thus GC-rich regions of DNA have more hydrogen bonds and consequently are more resistant to thermal denaturation [2].The other nucleotide symbols usually used is R (which denotes G or A), its complement is Y (which denotes C or T). 4. Sequencing & Read Genome sequencing is the most direct method of detecting mutations, such as single nucleotide polymorphisms (SNPs) and copy number variations(CNVs). The development of the dideoxy method of DNA sequencing was a major step forward for the science of molecular biology. The dideoxy method of DNA sequencing was published by Sanger and colleagues in 1977. About 20 years after the development of Sanger’s dideoxy sequencing, Pal Nyren introduced the pyrosequencing technique. The pyrosequencing technique paved the way for the development and commercialization of large-scale, highthroughput, massively parallel sequencing technology, popularly referred to as nextgeneration sequencing or next-gen sequencing (NGS) technology [2]. Most recent projects have made use of next generation sequencing technologies such as 454 pyrosequencing, Illumina Genome Analyzer, and AB SOLiD. New sequencing technology can generate millions of reads at a much faster and lower cost. However, the length of the sequences produced by these technologies varies greatly. Illumina read lengths range from 50 to 300 bp, while the Roche 454 System can generate reads up to 700 bp. As a result, both long read and short read research tools are needed for metagenomic projects. [12]. 4 In next-generation sequencing, a read refers to the DNA sequence from one fragment (a small section of DNA). Prior to sequencing, most next-generation sequencing methods segment the genome, and each sequenced fragment creates a read. The length of the read and the number of reads generated are determined by fragment size and the technologies used. Since DNA fragments usually overlap, the reads may be pieced together to recreate the genome. Some next-generation sequencing techniques that do not fragment the genome are known as long-read sequencing because they yield very long reads. Figure 2.1: Reads and Sequence 2.1.3 Metagenomic Binning 1. Binning Method & Clustering Taxonomy Metagenomic science, which is only a few years old, makes it possible to study microbes in their natural environment, the complex communities in which they usually live. Understanding microbial communities bring changes in many fields, such as biology, medicine, ecology, and biotechnology. The analysis of microbial communities is usually carried out by a process referred to as binning, is to group reads into clusters that represent genomes from closely related organisms. Binning methods can be roughly classified into three main categories: supervised, semisupervised, and unsupervised method. Among the above binning methods, unsupervised methods base classification on features derived from reads, which is particularly useful when reference database availability is limited. There are so many clustering algorithms, the processes of different clustering algorithms can be broadly classified into five category: Partitioning-based, Hierarchical-based, Density-based, Grid-based and Model-based. Clustering can be supervised or unsupervised. In supervised clustering, the expression pattern of the gene(s) is known and this knowledge is used to group genes into clusters. The most widely used method of unsupervised clustering is known as hierarchical clustering [2]. However, because of applying BiMeta binning algorithm, the used unsupervised clustering is k-means clustering which belongs to Partitioning-based. 5 2. K-means Clustering K-means is a centroid-based or distance-based algorithm that computes distances to allocate a point to a cluster. Each cluster in K-means is correlated with a centroid. There are five main steps in K-means algorithm: • Step 1: Choose the number of clusters k • Step 2: Select k random points from the data as centroids • Step 3: Assign all the points to the closest cluster centroid • Step 4: Recompute the centroids of newly formed clusters • Step 5: Repeat steps 3 and 4 There are essentially three stopping criteria that can be adopted to stop the K-means algorithm: • Centroids of newly formed clusters do not change • Points remain in the same cluster • Maximum number of iterations are reached K-means clustering algorithm in detail is given as in Algorithm 1. Algorithm 1 k-means clustering Input: Dataset, number of cluster k Output: Centroids of the k clusters, labels for the training data 1: Obtain a sample by constructing the coreset of the input dataset. 2: For a given cluster assignment C, the total cluster variance is minimized with respect to {m1 , . . . , mK } yielding the means of the currently assigned clusters. 3: Given a current set of means {m1 , . . . , mK }, is minimized by assigning each observation to the closest (current) cluster mean. That is, C(i) = argmin k xi − mk k2 1≤k≤K 4: Steps 1 and 2 are iterated until the assignments do not change. 3. Metrics in binning problem Three commonly used performance metrics, namely, precision, recall, and F-measure are used to evaluate the binning algorithm.Let m be the number of species in a metagenomic dataset, and k be the number of clusters returned by the binning algorithm. Then Ai j is the number of reads from species j assigned to cluster i. A cluster i can represents a species j when Ai j0 = max j Ai j . The precision and recall are defined as follows: precision = recall = ∑ki=1 max j Ai j ∑ki=1 ∑mj=1 Ai j ∑mj=1 maxi Ai j ∑ki=1 ∑mj=1 Ai j + unassignedreads 6 Precision shows the ratio of reads assigned in a cluster that belong to the same species while recall presents the ratio of reads from the same species that are assigned in the same cluster. If the number of cluster i >> the number of species j, the majority of reads in each cluster probably belongs to a single genome and thus precision would be high. However, sensitivity would be low as some genomes are represented by multiple clusters. If the number of cluster i << the number of species j, some clusters would contain reads from multiple genomes and precision would be low. Thus, precision increases while sensitivity decreases with the number of predicted clusters. Besides, we also use F-measure which emphasizes comprehensively on both precision and recall: F − measure = 2 1 1 + precision recall 7 2.2 Hadoop In this section, we cover the fundamental components in Hadoop. Hadoop is an open-source software (or open-source big data framework) that allows to store and process big data in a distributed environment across a group that contains many computers using simple programming model (Map-Reduce Programming Model). It is designed to scale up from a single computer to thousands of computers, each computer offers local computation and storage. 2.2.1 Hadoop Components Figure 2.2: Hadoop Components Basic components in Hadoop software are: • For data storage: Hadoop Distributed File System (HDFS) • For resource management (monitor memory, disk, cpu of a group of computers): YARN (this is an acronym of Yet Another Resource Negotiator) • For data processing: Hadoop MapReduce (Hadoop MR) Other big data processing frameworks that can work together with Hadoop’s basic components inside Hadoop software are: Apache Spark, Pig, Hive,... 8 2.2.2 HDFS - Hadoop Distributed File System Hadoop Distributed File System is a file system of Hadoop software that manage the storage across a groups of computer in Hadoop Cluster (Figure 2.2). Hadoop Distributed File System has all basic operations (reading files, creating directories, moving files, deleting data, and listing directories) that a normal file system has. Figure 2.3: A Basic Hadoop Cluster A block (the minimum amount of data that can be read or written) in HDFS has the default size 128 MB. Regarding to blocks, HDFS has a property which is "a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage" (For example, a 10 KB file stored with a block size of 128 MB uses 10 KB of disk space, not 128 MB). The block size of HDFS is large in order to reduce the cost of disk seek time because disk seeks are generally expensive operations. When discussing HDFS, there is another term to call the Master computer, this term is Namenode. Similarly, there is another term to call the Slave computer, this term is Datanode. The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located; however, it does not store block locations persistently, because this information is reconstructed from datanodes when the system starts. Datanodes are the workers of the filesystem. They store and retrieve blocks when they are told to by the namenode, and they report back to the namenode periodically with lists of blocks that they are storing. 9

- Xem thêm -

Tài liệu Developing a hadoop based distributed system for metagenomic binning

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất