VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
——————– * ———————
THESIS
DEVELOPING A HADOOP-BASED
DISTRIBUTED SYSTEM FOR
METAGENOMIC BINNING
Major: Computer Engineering
Committee: Computer Engineering 1
(English Program)
Supervisor: Assoc. Prof. Dr. Tran Van Hoai
Reviewer: Assoc. Prof. Dr. Thoai Nam
—o0o—
Student 1: Tran Duong Huy
(1752242)
Student 2: Pham Nhat Phuong
(1752042)
Student 3: Nguyen Huu Trung Nhan (1752392)
ĐẠI HỌC QUỐC GIA TP.HCM
---------TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA:KH & KT Máy tính_____
BỘ MÔN:KHMT_____________
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
HỌ VÀ TÊN:_______________________________________MSSV:________________
HỌ VÀ TÊN:_______________________________________MSSV:________________
HỌ VÀ TÊN:_______________________________________MSSV:________________
NGÀNH:____________________________________LỚP:_______________________
1. Đầu đề luận án:
Developing a Hadoop-based Distributed System for Metagenomic Binning_________________
_____________________________________________________________________________
2. Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
- Tìm hiểu về Metagenomic Binning và thuật toán BiMeta.
- Tìm hiểu Hadoop và cài đặt một hệ thống minh hoạ.
- Phát triển thuật toán BiMeta dựa trên Hadoop.
- Xây dựng giao diện để người dùng có thể sử dụng BiMeta trên hệ thống tính toán Hadoop.
- Đánh giá khả năng sử dụng Hadoop cho bài toán Metagenomic Binning trên tập số liệu được
dùng bởi cộng đồng.
3. Ngày giao nhiệm vụ luận án: …
4. Ngày hoàn thành nhiệm vụ: …
5. Họ tên giảng viên hướng dẫn:
Phần hướng dẫn:
1) PGS.TS. Trần Văn Hoài, hướng dẫn toàn bộ._______________________________________
2)____________________________________________________________________________
3)____________________________________________________________________________
Nội dung và yêu cầu LVTN đã được thông qua Bộ môn.
Ngày ........ tháng ......... năm ..........
CHỦ NHIỆM BỘ MÔN
GIẢNG VIÊN HƯỚNG DẪN CHÍNH
(Ký và ghi rõ họ tên)
(Ký và ghi rõ họ tên)
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ):_________________________
Đơn vị:_________________________________________
Ngày bảo vệ:____________________________________
Điểm tổng kết:___________________________________
Nơi lưu trữ luận án:_______________________________
75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0
KHOA KH & KT MÁY TÍNH
ĈӝFOұS- 7ӵGR- +ҥQKSK~F
---------------------------1Jj\WKiQJQăP
3+,ӂ8&+Ҩ0%Ҧ29ӊ/971
'jQKFKRQJ˱ͥLK˱ͣQJG̳QSK̫QEL͏Q
+ӑYjWrQ697UҫQ'ѭѫQJ+X\
MSSV: 1752242
1JjQKFKX\rQQJjQK.ӻWKXұW0i\WtQK
+ӑYjWrQ693KҥP1KҩW3KѭѫQJ
MSSV: 1752042
1JjQKFKX\rQQJjQK.ӻWKXұW0i\WtQK
+ӑYjWrQ691JX\ӉQ +ӳX7UXQJ1KkQ
MSSV: 1752392
1JjQKFKX\rQQJjQK.ӻWKXұW0i\WtQK
ĈӅWjL'HYHORSLQJD+DGRRS-based Distributed System for Metagenomic Binning
+ӑWrQQJѭӡLKѭӟQJGүQSKҧQELӋQ3*6767UҫQ9ăQ+RjL
7әQJTXiWYӅEҧQWKX\ӃWPLQK
6ӕWUDQJ 6ӕFKѭѫQJ
6ӕEҧQJVӕOLӋX
6ӕKuQKYӁ
6ӕWjLOLӋXWKDPNKҧR 15
3KҫQPӅPWtQKWRiQ
+LӋQYұWVҧQSKҭP
7әQJTXiWYӅFiFEҧQYӁ
- 6ӕEҧQYӁ%ҧQ$ BҧQ$ .KәNKiF
- 6ӕEҧQYӁYӁWD\ 6ӕEҧQYӁWUrQPi\WtQK
1KӳQJѭXÿLӇPFKtQKFӫD/971
- 6LQKYLrQÿmKLӋQWKӵFÿѭӧFPӝWKӋWKӕQJYӟLÿҫ\ÿӫFiFWKjQKSKҫQ+ҥWҫQJWtQKWRiQ+DGRRSEDVHG3KiWWULӇQWKXұWWRiQGQJ+DGRRSWURQJÿyWҩWFҧFiFEѭӟFFӫD%L0HWDÿmÿѭӧFFKX\ӇQÿәL
VDQJGQJ+DGRRSYj6SDUN0ӝWJLDRGLӋQQJѭӡLGQJÿӇFyWKӇWѭѫQJWiFYӟLKӋWKӕQJ
- 6LQKYLrQÿmQҳPÿѭӧFQKӳQJNLӃQWKӭFFѫEҧQYӅ+DGRRS6SDUNÿӇFyWKӇWKLӃWOұSPӝWKӋWKӕQJ
PLQKKRҥ
- 6LQKYLrQÿmQҳPÿѭӧFQKӳQJNLӃQWKӭFFѫEҧQÿӇFyWKӇWULӇQNKDLPӝWWKXұWWRiQWXҫQWӵOrQP{L
WUѭӡQJWtQKWRiQSKkQEӕGҥQJ+DGRRS
- Sinh viêQÿmWKӵFQJKLӋPYӟLQKӳQJWұSGӳOLӋXÿѭӧFVӱGөQJWURQJQKӳQJF{QJEӕX\WtQÿӇÿiQKJLi
NKҧQăQJFӫDKӋWKӕQJ
1KӳQJWKLӃXVyWFKtQKFӫD/971
- 9ӟLNKҧQăQJFzQKҥQFKӃWURQJOƭQKYӵFVLQKKӑFQrQYLӋFFKX\ӇQÿәLWKXұWWRiQVDQJP{KuQKOұSWUuQK
+DGRRSYj6SDUNFzQPӝWYjLKҥQFKӃQKҩWOjWKӡLJLDQWtQKWRiQFKѭDWKӇKLӋQÿѭӧFNKҧQăQJFӫDKӋ
WKӕQJSKkQEӕ
- 'RNK{QJWKӇWLӃSFұQGӉGjQJYӟL KҥWҫQJWtQKWRiQOӟQQrQNӃWTXҧWtQKWRiQFzQQKLӅXKҥQFKӃ
- 6LQKYLrQPӟLFKӍFyNLӃQWKӭFFѫEҧQYӅPHWDJHQRPLFELQQLQJQrQQKӳQJFKӭFQăQJJLDRGLӋQQJѭӡL
GQJFӫDKӋWKӕQJFKѭDÿѭӧFViQJWҥR
ĈӅQJKӏĈѭӧFEҧRYӋ5
%әVXQJWKrPÿӇEҧRYӋ
.K{QJÿѭӧFEҧRYӋ
FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ
D6LQKYLrQKm\FKRELӃWQӃXSKiWWULӇQWLӃSÿӇWăQJNKҧQăQJWtQKWRiQWKuFөWKӇVӁOjP gì?
ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7%JLӓL
ĈLӇP
.êWrQJKLU}KӑWrQ
7UҫQ9ăQ+RjL
75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$
.+2$.+ .70È<7Ë1+
&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$ 9,ӊ71$0
ĈӝFOұS7ӵGR+ҥQKSK~F
1Jj\WKiQJQăP
3+,ӂ8&+Ҩ0%Ҧ29ӊ/971
'jQKFKRQJ˱ͥLSK̫QEL͏Q
+ӑYjWrQ697UҫQ'ѭѫQJ+X\
0669
1JjQK.707
3KҥP1KұW3KѭѫQJ
0669
1JjQK.707
1JX\ӉQ+ӳX7UXQJ1KkQ 0669
1JjQK.707
ĈӅWjL'HYHORSLQJD+DGRRSEDVHGGLVWULEXWHGV\VWHPIRUPHWDJHQRPLFELQQLQJ
+ӑWrQQJѭӡLKѭӟQJGүQSKҧQELӋQ3*6767UҫQ9ăQ+RjL
7әQJTXiWYӅEҧQWKX\ӃWPLQK
6ӕWUDQJ
6ӕFKѭѫQJ
6ӕEҧQJVӕOLӋX
6ӕKuQKYӁ
6ӕWjLOLӋXWKDPNKҧR
3KҫQPӅPWtQKWRiQ
+LӋQYұWVҧQSKҭP
7әQJTXiWYӅFiFEҧQYӁ
6ӕEҧQYӁ
%ҧQ$
%ҧQ$
.KәNKiF
6ӕEҧQYӁYӁWD\
6ӕEҧQYӁWUrQPi\WtQK
1KӳQJѭXÿLӇPFKtQKFӫD/971
í 6LQKYLrQQҳPEҳWNӻWKXұWYjF{QJQJKӋWtQKWRiQWUrQ+DGRRSYj6SDUNWӕW
í %jLWRiQ0HWDJHQRPLF%LQQLQJFNJQJÿѭӧFVLQKYLrQOƭQKKӝLYjWULӇQNKDLPӝWJLҧLSKiSWtQK
WRiQEѭӟFYӟLJLҧLWKXұWWѭѫQJӭQJWUrQ0DS5HGXFHYj6SDUN
í 0{LWUѭӡQJWtQKWRiQWUrQ+DGRRSYj6SDUNÿѭӧF[k\GӵQJÿӇWULӇQNKDLÿӇJLҧLTX\ӃWEjL
WRiQ0HWDJHQRPLF%LQQLQJ
í 3KҫQWKӱQJKLӋPFyNӏFKEҧQYjWKӵFKLӋQWUrQWұSGӳOLӋX
1KӳQJWKLӃXVyWFKtQKFӫD/971
í 'RKҥQFKӃYӅWKӡLJLDQYjP{LWUѭӡQJWKӵFQrQYLӋFÿiQKJLiJLҧLSKiSJLҧLEjLWRiQ
0HWDJHQRPLF%LQQLQJWUrQP{LWUѭӡQJSKkQWiQVӱGөQJ+DGRRS 6SDUNFzQÿѫQJLҧQQrQ
FKѭDWKҩ\KLӋXTXҧFӫDJLҧLSKiSFNJQJQKѭFiFSKkQWtFKNӃWTXҧÿӇKѭӟQJÿӃQWKӵFKLӋQFҧL
WLӃQYӅKLӋXQăQJFӫDJLҧLSKiS
ĈӅQJKӏĈѭӧFEҧRYӋ 䖵
%әVXQJWKrPÿӇEҧRYӋ Ƒ
FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ
ĈiQKJLiFKXQJEҵQJFKӳJLӓLNKi7%JLӓL
.K{QJÿѭӧFEҧRYӋ Ƒ
ĈLӇP
.êWrQ
7KRҥL1DP
Declaration
We commit that our topic "Developing a Hadoop-based distributed system for metagenomic
binning" is our personal thesis proposal. We declare that this topic is conducted under our effort,
time, and the recommendation of our supervisor, Assoc. Prof. Dr. Tran Van Hoai.
All of the research results are conducted by ourselves and not copied from any other sources.
If there is any evidence of plagiarism, we will be responsible for all consequences.
Ho Chi Minh City, 2020,
Tran Duong Huy, Pham Nhat Phuong, Nguyen Huu Trung Nhan.
i
Acknowledgement
This thesis would not have been possible to complete without the help and support of many
others. First and foremost, we would like to express our sincere gratitude to our supervisor,
Associate Professor PhD Tran Van Hoai. His insight and expert knowledge in the field improved
our research work. He has also helped us to have organized our thinking in research and in
writing the thesis.
We sincerely thank the teachers in the Faculty of Computer Science and Engineering from
Ho Chi Minh City University of Technology for their enthusiasm to impart knowledge during
the time we study at school. With knowledge accumulated throughout the learning process, it
helps us to complete this thesis.
In the end, we would like to wish the teachers and our supervisor good health and success in
their noble careers.
ii
Abstract
Bioinformatics research is considered to be an area in which biological data is vast, extensive
and complex. Biological data are constantly evolving and often unlabeled, so a controlled test
method cannot be used. One of the most difficult problems to be solved in this area is that
detecting new symptoms of the virus, or at least grouping them together, is an urgent need
(for example, determining the proximity of SARS-CoV-2 to the bat virus). One way to solve
the problem is to create scalable clustering tools that can handle very large amounts of data.
Genomics and next-generations technology like Illumina, Roche 454 are producing 200 billion
kits a week, transferring 60 thousand genes and efficient computers. Our goal in this thesis,
inspired by previous research, is to create a Hadoop-based tool for metagenomic binning.
iii
Contents
Declaration
i
Acknowledgement
ii
Abstract
iii
1
2
Introduction
1
1.1
Overview of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.3
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Background
3
2.1
Metagenomic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.2
Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1.3
Metagenomic Binning . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1
Hadoop Components . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.2
HDFS - Hadoop Distributed File System . . . . . . . . . . . . . . . .
9
2.2.3
Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.4
YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3.1
What is Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3.2
Introduction to Spark . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.3.3
How do Sparks run on a cluster? . . . . . . . . . . . . . . . . . . . . .
22
2.2
2.3
iv
3
Related Work
3.1
3.2
4
6
A two-phase binning algorithm using l-mer frequency on groups of non-overlapping
reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.2
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.3
Fundamentals of proposed method . . . . . . . . . . . . . . . . . . . .
29
3.1.4
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.1.5
Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.1.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons 31
3.2.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.2
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.3
Datasets and Hadoop cluster configuration . . . . . . . . . . . . . . . .
31
3.2.4
Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Methodology
33
4.1
BiMetaReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1.1
Step 1: Load Fasta file . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1.2
Step 2: Create Document . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1.3
Step 3: Create Corpus . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1.4
Step 4: Build Overlap Graph . . . . . . . . . . . . . . . . . . . . . . .
35
4.1.5
Step 5 and 6: Find Connected Component and Clustering . . . . . . . .
35
System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.2.2
Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . .
37
4.2.3
Devices and Components . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.2.4
Web Application Design . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.2
5
28
Experimental and Evaluation
50
5.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.2
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Conclusions
References
56
57
v
List of Figures
2.1
Reads and Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Hadoop Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
A Basic Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.4
Upload A File To HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5
Read A File From HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.6
The first few lines of the file S8.fna . . . . . . . . . . . . . . . . . . . . . . . .
13
2.7
The pipeline of the phase Reading Fasta File . . . . . . . . . . . . . . . . . . .
14
2.8
Map-Reduce Logical Data Flow for Reading Fasta File . . . . . . . . . . . . .
16
2.9
Hadoop YARN - How YARN manages a running job . . . . . . . . . . . . . .
17
2.10 Spark’s toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.11 The architecture of a Spark Application . . . . . . . . . . . . . . . . . . . . .
20
2.12 A narrow dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.13 A wide dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.14 A cluster driver and worker (no Spark Application yet) . . . . . . . . . . . . .
22
2.15 Spark’s Cluster mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.16 Spark’s Client mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.17 Requesting resources for a driver . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.18 Launching the Spark Application . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.19 Application execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.20 Shutting down the application . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.1
Binning process of BiMeta . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2
The Libra workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3
Scalability testing for Libra . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.1
Workflow of the BiMetaReduce program . . . . . . . . . . . . . . . . . . . . .
35
4.2
Overview of the proposed system . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.3
Python & Django frameworks . . . . . . . . . . . . . . . . . . . . . . . . . .
39
vi
4.4
HTML, CSS & Javascript . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.5
Model View Template Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.6
Use case diagram of web application . . . . . . . . . . . . . . . . . . . . . . .
41
4.7
Login and Register Sequence Diagram . . . . . . . . . . . . . . . . . . . . . .
44
4.8
Homepage,"About" page and "Your Project" page Sequence Diagram . . . . .
45
4.9
"System" page Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . .
46
4.10 Homepage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.11 Homepage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.12 Homepage 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.13 "System" page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.14 "Your Project" page UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.15 "About" page UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1
Sequential and MapReduce Runtime . . . . . . . . . . . . . . . . . . . . . . .
52
5.2
Runtime of different settings . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.3
Runtime of setting 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.4
Runtime of setting 7 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.5
Fruchterman-Reingold graph of the R4 file . . . . . . . . . . . . . . . . . . . .
54
5.6
Kamada-Kawai graph of the R4 file . . . . . . . . . . . . . . . . . . . . . . .
55
5.7
Circular graph of the R4 file . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
vii
Chapter 1
Introduction
1.1
Overview of this thesis
Bioinformatics research is considered an area that includes large, expanding, and complex
biological data sets. The biological data is increasingly large, and often unlabeled, making it
impossible to use supervised learning methods. A challenging problem that arises in this domain
is detecting a new strain of the virus, or at least cluster it into a group of species, is a very urgent
need (such as identifying the proximity of SARS-CoV-2 to a bat virus group). An alternative
approach to the problem is to build a scalable clustering tool that can work with very large
datasets.
Genomics and next generation, like Illumina, Roche 454, produce 200 billion sets per week
for the transmission of 60k genes and efficient computers. Inspired by the previous studies, in
this thesis, our goal is to create a Hadoop-based tool for metagenomic mechanics.
1.2
Scope and Objectives
This work aims to develop a scalable clustering tool for working with large datasets. Some
objectives required for this thesis:
• Learn about the fundamental of Genes and Genomes.
• Research about Genomic Technologies.
• Research about the Hadoop distributed processing framework for data processing, storage
for big data applications, and related distributed computing environments.
• Learn about metagenomic clustering problems and algorithms capable of distributed
computing.
• Develop distributed clustering tools and experimental computing on the actual microbiological dataset from NCBI’s database.
1
1.3
Thesis Outline
In this thesis, it contains five main parts. Section 1 is the introduction of the project. Section
2 is the background of the research field. Section 3 is the related work that proposes a different
method to solve the problem. Section 4 shows the proposed methods and discussion. Section 5
covered our system experiment results and evaluation. And the last section is the conclusion of
our work.
2
Chapter 2
Background
2.1
2.1.1
Metagenomic
Background
Microbes run the world. Although we usually don’t see them. But microbes are essential to
every area of human life - in fact, for every life on Earth. Almost all knowledge about microbes
based on laboratory work performed in unusual and unnatural conditions for optimal growth in
artificial environments in pure cultures without ecological context [8]. Metagenomic science,
which is only a few years old, makes it possible to study microbes in their natural environment,
the complex communities in which they usually live. Understanding microbial communities
bring changes in many fields, such as biology, medicine, ecology, and biotechnology.
Which leads to the question: what is metagenomic? Metagenomics is made up of two words:
meta and genomics. Genomics collects DNA sequences, meta shows that we obtain it with many
organisms together. Metagenomics is a molecular tool used to analyze DNA from environmental
samples to study existing microbial communities without having pure cultures [4].
Sanger sequencing, also known as the "chain termination method", is a well-known method
for determining the nucleotide sequence in DNA. Many years after the development of Sanger
sequencing technology, the massively parallel sequencing technology known as next-generation
sequencing (NGS) has revolutionized the biological sciences. Next-generation sequencing
technologies make millions of readings faster and less expensive. Recently, many projects have
been used Roche 454, Illumina Genome Analyzer, AB SOLiD [11].
The analysis of microbial communities is usually carried out by a process referred to as
binning, reads from the related organisms are grouped together. Binning methods can be broadly
classified into two categories: supervised, and unsupervised methods. Among the most important
supervised methods, we can recall Mega [6], Kraken [14], Clark [9], and MetaPhlan [10].
The other category of methods is unsupervised, MetaProb [5], BiMeta [12], MetaCluster [13],
AbundanceBin [15].
Because of the lack of labeled datasets, the unsupervised method is a better approach for the
binning problem. We found that BiMeta and MetaProb has a significant result in metagenomic
clustering among the remaining studies. In this thesis, we will develop a Hadoop-based tool for
metagenomic binning, BiMeta will be used as a main base model for clustering.
3
2.1.2
Basic Concepts
1. Defitition Of Bioinformatics
The term “bioinformatics” was coined by Paulien Hogeweg and Ben Hesper in 1978.
Bioinformatics is fundamentally informatics as applied to biology or computer-aided
analysis of biological data [2].
2. Goals Of Bioinformatics Analysis
The main goal of bioinformatics is to be able to predict the biological processes in health
and disease. In order to satisfy those abilities, understanding of the biological processes
is necessary. Thus, the extra goal of bioinformatics is to develop such an understanding
through analysis integration of the information obtained on genes and proteins. For further
bioinformatics develop new tools and continuously improve the existing set of tools for
diverse types of analyses [2].
3. DNA & Nucleobases
DNA is a double-stranded right-handed helix; the two strands are complementary because
of complementary base pairing, and antiparallel because the two strands have opposite
5’-3’ orientation.
DNA is composed of structural units called nucleotides (deoxyribonucleotides). Each
nucleotide is composed of a pentose sugar (20-deoxy-D-ribose); one of the four nitrogenous
bases—adenine (A), thymine (T), guanine (G), or cytosine (C); and a phosphate. The
pentose sugar has five carbon atoms and they are numbered 10 (1-prime) through 50
(5-prime). The base is attached to the 10 carbon atom of the sugar, and the phosphate is
attached to the 50 carbon atom.
In the double-stranded DNA, A pairs with T by two hydrogen bonds and G pairs with C
by three hydrogen bonds; thus GC-rich regions of DNA have more hydrogen bonds and
consequently are more resistant to thermal denaturation [2].The other nucleotide symbols
usually used is R (which denotes G or A), its complement is Y (which denotes C or T).
4. Sequencing & Read
Genome sequencing is the most direct method of detecting mutations, such as single
nucleotide polymorphisms (SNPs) and copy number variations(CNVs). The development
of the dideoxy method of DNA sequencing was a major step forward for the science of
molecular biology. The dideoxy method of DNA sequencing was published by Sanger
and colleagues in 1977. About 20 years after the development of Sanger’s dideoxy
sequencing, Pal Nyren introduced the pyrosequencing technique. The pyrosequencing
technique paved the way for the development and commercialization of large-scale, highthroughput, massively parallel sequencing technology, popularly referred to as nextgeneration sequencing or next-gen sequencing (NGS) technology [2].
Most recent projects have made use of next generation sequencing technologies such
as 454 pyrosequencing, Illumina Genome Analyzer, and AB SOLiD. New sequencing
technology can generate millions of reads at a much faster and lower cost. However,
the length of the sequences produced by these technologies varies greatly. Illumina read
lengths range from 50 to 300 bp, while the Roche 454 System can generate reads up to 700
bp. As a result, both long read and short read research tools are needed for metagenomic
projects. [12].
4
In next-generation sequencing, a read refers to the DNA sequence from one fragment (a
small section of DNA). Prior to sequencing, most next-generation sequencing methods
segment the genome, and each sequenced fragment creates a read. The length of the read
and the number of reads generated are determined by fragment size and the technologies
used. Since DNA fragments usually overlap, the reads may be pieced together to recreate
the genome. Some next-generation sequencing techniques that do not fragment the genome
are known as long-read sequencing because they yield very long reads.
Figure 2.1: Reads and Sequence
2.1.3
Metagenomic Binning
1. Binning Method & Clustering Taxonomy
Metagenomic science, which is only a few years old, makes it possible to study microbes
in their natural environment, the complex communities in which they usually live. Understanding microbial communities bring changes in many fields, such as biology, medicine,
ecology, and biotechnology.
The analysis of microbial communities is usually carried out by a process referred to as binning, is to group reads into clusters that represent genomes from closely related organisms.
Binning methods can be roughly classified into three main categories: supervised, semisupervised, and unsupervised method. Among the above binning methods, unsupervised
methods base classification on features derived from reads, which is particularly useful
when reference database availability is limited. There are so many clustering algorithms,
the processes of different clustering algorithms can be broadly classified into five category: Partitioning-based, Hierarchical-based, Density-based, Grid-based and Model-based.
Clustering can be supervised or unsupervised.
In supervised clustering, the expression pattern of the gene(s) is known and this knowledge
is used to group genes into clusters. The most widely used method of unsupervised
clustering is known as hierarchical clustering [2]. However, because of applying BiMeta
binning algorithm, the used unsupervised clustering is k-means clustering which belongs
to Partitioning-based.
5
2. K-means Clustering
K-means is a centroid-based or distance-based algorithm that computes distances to
allocate a point to a cluster. Each cluster in K-means is correlated with a centroid. There
are five main steps in K-means algorithm:
• Step 1: Choose the number of clusters k
• Step 2: Select k random points from the data as centroids
• Step 3: Assign all the points to the closest cluster centroid
• Step 4: Recompute the centroids of newly formed clusters
• Step 5: Repeat steps 3 and 4
There are essentially three stopping criteria that can be adopted to stop the K-means
algorithm:
• Centroids of newly formed clusters do not change
• Points remain in the same cluster
• Maximum number of iterations are reached
K-means clustering algorithm in detail is given as in Algorithm 1.
Algorithm 1 k-means clustering
Input: Dataset, number of cluster k
Output: Centroids of the k clusters, labels for the training data
1: Obtain a sample by constructing the coreset of the input dataset.
2: For a given cluster assignment C, the total cluster variance is minimized with respect to
{m1 , . . . , mK } yielding the means of the currently assigned clusters.
3: Given a current set of means {m1 , . . . , mK }, is minimized by assigning each observation to
the closest (current) cluster mean. That is,
C(i) = argmin k xi − mk k2
1≤k≤K
4:
Steps 1 and 2 are iterated until the assignments do not change.
3. Metrics in binning problem
Three commonly used performance metrics, namely, precision, recall, and F-measure are
used to evaluate the binning algorithm.Let m be the number of species in a metagenomic
dataset, and k be the number of clusters returned by the binning algorithm. Then Ai j is the
number of reads from species j assigned to cluster i. A cluster i can represents a species j
when Ai j0 = max j Ai j . The precision and recall are defined as follows:
precision =
recall =
∑ki=1 max j Ai j
∑ki=1 ∑mj=1 Ai j
∑mj=1 maxi Ai j
∑ki=1 ∑mj=1 Ai j + unassignedreads
6
Precision shows the ratio of reads assigned in a cluster that belong to the same species
while recall presents the ratio of reads from the same species that are assigned in the same
cluster.
If the number of cluster i >> the number of species j, the majority of reads in each
cluster probably belongs to a single genome and thus precision would be high. However,
sensitivity would be low as some genomes are represented by multiple clusters. If the
number of cluster i << the number of species j, some clusters would contain reads from
multiple genomes and precision would be low. Thus, precision increases while sensitivity
decreases with the number of predicted clusters. Besides, we also use F-measure which
emphasizes comprehensively on both precision and recall:
F − measure =
2
1
1
+
precision recall
7
2.2
Hadoop
In this section, we cover the fundamental components in Hadoop. Hadoop is an open-source
software (or open-source big data framework) that allows to store and process big data in a
distributed environment across a group that contains many computers using simple programming
model (Map-Reduce Programming Model). It is designed to scale up from a single computer to
thousands of computers, each computer offers local computation and storage.
2.2.1
Hadoop Components
Figure 2.2: Hadoop Components
Basic components in Hadoop software are:
• For data storage: Hadoop Distributed File System (HDFS)
• For resource management (monitor memory, disk, cpu of a group of computers): YARN
(this is an acronym of Yet Another Resource Negotiator)
• For data processing: Hadoop MapReduce (Hadoop MR)
Other big data processing frameworks that can work together with Hadoop’s basic components inside Hadoop software are: Apache Spark, Pig, Hive,...
8
2.2.2
HDFS - Hadoop Distributed File System
Hadoop Distributed File System is a file system of Hadoop software that manage the storage
across a groups of computer in Hadoop Cluster (Figure 2.2). Hadoop Distributed File System
has all basic operations (reading files, creating directories, moving files, deleting data, and listing
directories) that a normal file system has.
Figure 2.3: A Basic Hadoop Cluster
A block (the minimum amount of data that can be read or written) in HDFS has the default
size 128 MB. Regarding to blocks, HDFS has a property which is "a file in HDFS that is smaller
than a single block does not occupy a full block’s worth of underlying storage" (For example,
a 10 KB file stored with a block size of 128 MB uses 10 KB of disk space, not 128 MB). The
block size of HDFS is large in order to reduce the cost of disk seek time because disk seeks are
generally expensive operations.
When discussing HDFS, there is another term to call the Master computer, this term is
Namenode. Similarly, there is another term to call the Slave computer, this term is Datanode.
The namenode manages the filesystem namespace. It maintains the filesystem tree and the
metadata for all the files and directories in the tree. This information is stored persistently on
the local disk in the form of two files: the namespace image and the edit log. The namenode
also knows the datanodes on which all the blocks for a given file are located; however, it does
not store block locations persistently, because this information is reconstructed from datanodes
when the system starts.
Datanodes are the workers of the filesystem. They store and retrieve blocks when they are
told to by the namenode, and they report back to the namenode periodically with lists of blocks
that they are storing.
9
- Xem thêm -