VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
!
NGUYEN DUC CANH
AMINO ACID SUBSTITUTION MODEL
FOR ROTAVIRUS
MASTER THESIS
Major: Computer Science
HA NOI - 2019
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Duc Canh
AMINO ACID SUBSTITUTION MODEL
FOR ROTAVIRUS
MASTER THESIS
Major: Computer Science
Supervisor:
Assoc. Prof. Le Sy Vinh
HA NOI - 2019
Abstract
Modeling protein evolution has been a major field of research in bioinformatics
for decades. One popular method to approximate the evolution of proteins is to
use an amino-acid substitution model which can reveal the instantaneous rate that
an amino acid is changed into another amino acid. Such kind of model is useful
in many different ways and has become a main component in a variety of bioinformatic systems. Many models such as JTT, WAG and LG have been estimated
using data from various species. Recent research showed that these models might
be inappropriate for analysis of some specific species. Meanwhile, the world has
witnessed a series of emerging epidemics caused by viruses, notably rotavirus - a
contagious virus that can cause gastroenteritis. These epidemics raise a need for
modeling the evolution of these emerging viruses. In this thesis, using the data
from the Viral Genome Resource at National Center for Biotechnology Information (NCBI), we propose the ROTA model that has been specifically estimated for
modeling the evolution of rotavirus. Analysis revealed significant differences between ROTA and existing models in amino acid frequencies, exchangeability coefficients as well as inferred phylogenies. Experiments showed that ROTA better
characterizes the evolutionary patterns of rotavirus than other models and should
be useful in most systems that requires an accurate description of rotavirus evolution.
iii
Acknowledgements
I would like to express my sincere gratitude to my advisor Assoc. Prof. Le Sy Vinh
for the continuous support of my study and research, for his patience, motivation,
enthusiasm, and immense knowledge. His guidance helped me in all the time
of research and writing of this thesis. I could not have imagined having a better
advisor and mentor for my Master study.
Besides my advisor, I would like to thank Dr. Dang Cao Cuong and MSc. Le
Kim Thu for giving devoted explanations to my questions and guiding me to solve
various problems that I had to face.
I also thank my friends: Can Duy Cat, Nguyen Minh Trang, Le Hai Nam for
their continuous motivations without which I would never be able to complete this
thesis.
My sincere thanks also goes to Information Faculty of University of Engineering and Technology, Vietnam National University, Hanoi for providing me the
necessary facilities to conduct experiments.
Last but not the least, I would like to thank my parents for giving birth to me at
the first place and supporting me spiritually throughout my life.
iv
Declaration
I hereby declare that this thesis was entirely my own work and that any additional
sources of information have been properly cited.
I certify that, to the best of my knowledge, my thesis does not infringe upon
anyone’s copyright nor violate any proprietary rights and that any ideas, techniques or any other material from the work of other people included in my thesis,
published or otherwise, are fully acknowledged in accordance with the standard
referencing practices.
I declare that this thesis has not been submitted for a higher degree to any other
University or Institution.
v
Table of Contents
Abstract
iii
Acknowledgements
iv
Declaration
v
Table of Contents
vii
Acronyms
viii
List of Figures
ix
List of Tables
x
Introduction
1
1
Background
5
1.1
Sequence evolution . . . . . . . . . . . . . . . . . . . . . . . .
5
1.1.1
Heredity materials . . . . . . . . . . . . . . . . . . . . .
5
1.1.2
Evolution and homologous sequences . . . . . . . . . . .
6
Modeling sequence evolution . . . . . . . . . . . . . . . . . . .
9
1.2.1
Sequence alignment . . . . . . . . . . . . . . . . . . . .
9
1.2.2
General time-reversible substitution model . . . . . . . .
11
1.2.3
Model of rate heterogeneity . . . . . . . . . . . . . . . .
15
1.2.4
Available amino acid substitution models . . . . . . . . .
16
Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.3.1
17
1.2
1.3
Overview . . . . . . . . . . . . . . . . . . . . . . . . .
vi
2
3
1.3.2
Phylogenetic tree reconstruction . . . . . . . . . . . . .
18
1.3.3
Robinson-Foulds distance . . . . . . . . . . . . . . . . .
20
1.3.4
Phylogenetic hypothesis testing . . . . . . . . . . . . . .
20
Method
24
2.1
Modeling method . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2
Model estimation process . . . . . . . . . . . . . . . . . . . . .
26
Results and discussion
28
3.1
Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Model analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
Performance on testing alignments . . . . . . . . . . . . . . . .
31
3.4
Tree topology analysis . . . . . . . . . . . . . . . . . . . . . . .
32
3.4.1
Robinson-Foulds distance . . . . . . . . . . . . . . . . .
32
3.4.2
Shimodaira-Hasegawa test . . . . . . . . . . . . . . . .
33
Protein-specific models . . . . . . . . . . . . . . . . . . . . . .
35
3.5
Conclusions
39
References
44
A ROTA model
45
vii
Acronyms
GTR
General time-reversible
ME
Minimum evolution
ML
Maximum likelihood
MP
Maximum parsimony
MSA
Multiple sequence alignment
NCBI National Center for Biotechnology
Information
RF
Robinson-Foulds
viii
List of Figures
1.1
A sample phylogenetic tree of 5 sequences. . . . . . . . . . . . .
18
1.2
Two trees T1 and T2 describe the same set of 5 sequences {1, 2, 3, 4, 5}
but have different topologies. Tree T1 has two bipartitions {1, 2}|{3, 4, 5}
and {1, 2, 3}|{4, 5} while tree T2 has two bipartitions {1, 3}|{2, 4, 5}
and {1, 2, 3}|{4, 5}. Bipartition {1, 2}|{3, 4, 5} is present only in
T1 whereas bipartition {1, 3}|{2, 4, 5} is present only in T2 . Bipartition {1, 2, 3}|{4, 5} occurs in both T1 and T2 . The standardized Robinson and Foulds distance between T1 and T2 is 2/4 . . .
3.1
21
The exchangeability coefficients in ROTA, FLU and JTT models.
The black (gray or white) bubble at the intersection of row X and
column Y presents the exchange rate between amino acid X and
amino acid Y in ROTA (FLU or JTT). . . . . . . . . . . . . . . .
3.2
32
The relative differences between exchangeability coefficients in
ROTA and other two models. The size of the bubble corresponds
to the value (ROT AXY − MXY )/(ROT AXY + MXY ) where X,
Y is one of 20 amino acids and M is FLU for the subfigure a) or
JTT for the subfigure b). The black bubble with value 2/3 (1/3)
indicates the coefficient in ROTA is 5 (2) times larger than the
corresponding one in M whereas the white bubble with value 2/3
(1/3) indicates the coefficient in ROTA is 5 (2) times smaller than
3.3
the corresponding one in M. . . . . . . . . . . . . . . . . . . . .
33
Amino acid frequencies of ROTA, FLU and JTT models . . . . .
34
ix
List of Tables
1.1
Twenty differenct amino acids. . . . . . . . . . . . . . . . . . .
3.1
Number of sequences grouped by protein types in training and
testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
34
Comparisons of ROTA and 10 other models in constructing maximum likelihood trees . . . . . . . . . . . . . . . . . . . . . . .
3.5
30
Comparisons of ROTA and 10 other models in constructing maximum likelihood trees . . . . . . . . . . . . . . . . . . . . . . .
3.4
29
The Pearson’s correlations between ROTA and 10 widely used
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
7
35
The normalized Robinson-Foulds distance between trees inferred
using ROTA versus FLU and JTT models for 12 testing multiple
alignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
36
The number of test alignments that trees inferred from existing
models are significantly worse than those from ROTA for the 11
testing alignments that ROTA is the best-fit model. . . . . . . . .
3.7
37
Comparison of log-likelihood per site between ROTA and proteinspecific model. For each alignment of protein P, ROTAP is the
model estimated using only sequences of protein P from training
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
A.1 Amino acid exchangeability matrix (the first 20 rows) and frequency vector (the last row) of ROTA . . . . . . . . . . . . . . .
x
46
Introduction
Motivation
A major field of research in bioinformatics is to understand the evolutionary relationship among species. In the recent decades, protein data have been accumulated in a large scale and are used in many methods to model the evolutionary
process. One popular method is to use protein data to estimate amino acid substitution models which can reveal the instantaneous substitution rates that amino
acids change into the other amino acids. This kind of model has become crucial
for a wide range of bioinformatics sytems thanks to its various applications.
The first application of amino acid substitution model is to help create protein
sequence alignments [1]. Methods used to align sequences often involve the usage
of a score matrix which represents the penalty when an mutation occurs. Typically
for nucleotide models, a score matrix can be simply +1 for a matched site, -1 for
a mismatched site and -2 for an indel of two nucleotide sequences. Such a simple scoring scheme, however, is not suitable for comparing protein sequences.
Amino acids, the building blocks that make up protein sequences, have biochemical properties that influence their exchangeability in the process of evolution. For
instance, amino acids of similar molecular sizes get substituted more often than
those of widely different molecular sizes. Other properties such as the tendency
to bind with water molecules also influence the probability of substitution. Therefore, it is important to use a scoring system that reflects these properties which
can be derived from amino acid substitution models.
Amino acid substitution models are also frequently used to infer protein phylogenetic trees using maximum likelihood approach [2]. In this method, given a
phylogeny of protein sequences with branch lengths, an evolutionary model that
1
allows to compute the probability of this tree is needed to select the maximum
likelihood tree. A substitution model allows us to compute the transition probabilities Pij (t), the probability that state j will exist at the end of a branch of length
t, if the state at the start of the branch is i, thus makes it possible to calculate the
likelihood of data given a phylogenetic tree.
Moreover, these models are used to estimate pairwise genetic distances between
protein sequences that subsequently serve as inputs for distance-based phylogenetic analysis [3]. Distance methods’ objective is to fit a tree to a matrix of pairwise genetic distances. For every two sequences, the observed distance is a single
value based on the number of different positions between the two sequences. The
observed distance is an underestimation of the true genetic distance because some
of the positions may have experienced multiple substitution events. In distancebased methods, a better estimate of the number of substitutions that have actually
occurred is often done by applying a substitution model.
Many methods have been proposed to estimate general amino acid substitution models from large and diverse databases [2] [4]. The common methods are
counting or maximum likelihood approaches. Estimating amino acid substitution
models is much more challenging than estimating nucleotide substitution models
due to a large number of parameters to be optimized. For example, the general
time reversible model for nucleotides contains 8 parameters in comparison to 208
parameters for models of amino acid substitutions. Thus, amino acid substitution
models are typically estimated from large datasets. The first model is Dayhoff
model [5] using highly similar protein sequences. As more protein sequences
became available, JTT model was estimated [6] which used the same counting
method. However, the counting methods are limited to only closely related protein sequences.
The maximum likelihood method was proposed later [7] to estimate the mtREV
model from 20 complete vertebrate mtDNA-encoded protein sequences. WAG
model [8] was estimated using a maximum likelihood method from 182 globular protein families. The WAG model produced better likelihood trees than the
Dayhoff and JTT models for a large number of globular protein families. The
maximum likelihood method is improved by incorporating the variability of evolutionary rates across sites into the estimation process to create LG model [4] from
2
the Pfam database. Experiments showed that the LG model gave better results than
other models both in terms of constructing maximum likelihood trees.
Although a number of general models have been estimated from protein sequences of various species, they might be inappropriate for a particular set of
species due to the differences in the evolutionary processes of these species. A
number of specific amino acid substitution models for important species have been
introduced such as rtREV model [9] for retrovirus and HIVb, HIVw models [10]
for HIV related analysis. Recently, FLU model is estimated for Influenza virus
which was shown to be highly different from widely used existing models and
more suitable for Influenza protein analysis [11].
Meanwhile, the world has witnessed a series of emerging epidemics caused by
viruses such as Ebolavirus, MERS-CoV - Middle East respiratory syndrome coronavirus, or Rotavirus - a contagious virus that can cause gastroenteritis. These epidemics raise a need to modeling the evolution of these viruses. The Viral Genome
Resource [12] at National Center for Biotechnology Information (NCBI) acting
as a storage for protein sequences of various emerging viruses provides necessary
data to model viral protein evolution.
Problem statement
Existing amino acid substitution models do not have the ability to accurately characterize the evolution of rotavirus. An amino acid substitution model for rotavirus
is necessary to better understand its viral evolution and medical impacts. Many
protein analysis systems can utilize an rotavirus-specifc amino acid substitution
model to analyse rotavirus protein sequences and find out the patterns in rotavirus
viral evolution and infection process.
In this thesis, the data from the Viral Genome Resource at NCBI is used to
estimate such an amino acid substitution model for rotavirus, an emerging virus
that is the leading cause of the Diarrhea in young children. A maximum likelihood
approach which allows the variability of evolutionary rates across sites are used
as the method of model estimation.
The main contribution of this work is to provide a rotavirus-specific substitution model which would benefit subsequent research on understanding the evo3
lutionary patterns of this type of virus. A detailed investigation of the model is
conducted to understand the difference of this model from existing ones and to
evaluate its performance on other dataset.
Outline
This thesis is divided into 3 main chapters. The first chapter provides background
knowledge needed to understand the estimation process in the next chapter. Chapter 2 describes how the evolution of rotavirus can be modeled as well as which
steps the model estimation process follows. Chapter 3 describes the data preparation process, then investigates the rotavirus-specific model obtained from the
process described in Chapter 2 (called ROTA for short) to highlight its inner characteristics and compare its performance on a testing dataset with other existing
models. Some discussions on the results are also given in Chapter 3.
4
Chapter 1
Background
This chapter explains necessary terminologies and concepts to understand the estimation process of an amino acid substitution model. Firstly, section 1.1 describes the evolution of sequences in which we explains heredity materials that
pass down from generations to generations and homologous sequences that are the
main subjects of investigation to understand evolutionary process. Subsequently,
section 1.2 explains popular methods to model sequence evolution as well as revises existing amino acid substitution models which are commonly used in practice. The final section discusses phylogenetic trees in detail which are used as the
visual representation of evolutionary process and then addresses the methods that
we use later to compare these phylogenetic trees.
1.1
Sequence evolution
1.1.1
Heredity materials
In molecular biology, nucleic acids are known to be the heredity materials which
are replicated and pass down from parents to children. There are two types of nucleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). For living
organisms, DNA is the molecule that carries genetic information. Many viruses
use RNA as their genetic material instead.
DNA and RNA are composed of smaller subunits called nucleotides [13]. Nucleotides are linked together through chemical bonds in a linear fashion and form
sequences of DNAs and RNAs. There are 4 types of nucleotides in DNA which
5
are named after 4 different nitrogenous bases making up these nucleotides: A, C,
G and T (short for adenine, cytosine, guanine and thymine, respectively). In terms
of RNA, base T is substituted by base U (uracil).
DNA (or RNA in some viruses) instructs cells to produce one of the most important types of macro-molecules: protein. Proteins play a critical role in the
continuity of life since they form the structure of organisms, provide a lot of funtionalities as well as control the regulation of the body. Proteins are sequences of
amino acids. There are 20 different amino acids found in nature (see Table 1.1),
each of which has different chemical properties. One change of amino acids in a
protein sequence could either have a subtle effect, change the protein function or
entirely cause loss of function.
Only some regions in DNA and RNA contain the instructions to encode proteins. Those coding regions are called genes. Within a gene, three consecutive
nucleotides, often called a codon, encode one amino acid. The process of creating
a protein based on a gene is called translation. Different codons can encode the
same amino acid, which explains why although there are 64 different combinations of 3 nucleotides, only 20 types of amino acids are present. To be accurate,
61 codons encode for amino acids while the other three are stop-codons. As their
name suggests, these stop-codons instruct cells to stop the translation process.
1.1.2
Evolution and homologous sequences
The well-known theory about evolution is proposed by Darwin [14] which suggests that species have evolved from a common ancestor. At molecular level,
living organisms share:
• The same genetic material (nucleic acids)
• The same table of encodings (universal genetic codes)
• The same molecular building blocks (such as nucleotides and amino acids).
These shared features highly suggest that living things today are descendants
of a common ancestor, or at least highly similar ancestors. Heredity materials,
genetic codes and the process of translation are passed down from generations to
6
Table 1.1: Twenty differenct amino acids.
Name
Three-letter code One-letter code
Alanine
Ala
A
Cysteine
Cys
C
Aspartic Acid
Asp
D
Glutamic Acid
Glu
E
Phenylalanine
Phe
F
Glycine
Gly
G
Histidine
His
H
Isoleucine
Ile
I
Lysine
Lys
K
Leucine
Leu
L
Methionine
Met
M
Asparagine
Asn
N
Proline
Pro
P
Glutamine
Gln
Q
Arginine
Arg
R
Serine
Ser
S
Threonine
Thr
T
Valine
Val
V
Tryptophan
Trp
W
Tyrosine
Tyr
Y
generations, which results in present-day organisms having many similar features
at molecular level.
Children, however, often inherit genes from their parents with modifications
(also called mutations) due to the error-prone replication process. The mutations
can be harmful or beneficial. When mutations occur, the mutated organisms may
either die, develop new traits or even evolve into new species. Millions of years
of replication and mutation result in a diversity of species as we see at the present
days.
7
Mutations can be in small-scale or large-scale. In small-scale mutations, genes
are affected by changes in one or a small number of nucleotides. If only one single
nucleotide is mutated, it is called a point mutation. There are three type of point
mutations:
• Substitutions are changes from one nucleotide to the others. A substitution
may have no effect if it does not change the corresponding gene product (synonymous substitution) or may alter the amino acid in the sequence of protein
(nonsynonymous substitution). In theory, every type of nucleotides can be
substituted by any other type at any position in the sequence. Lethal substitutions, however, do not replicate and are hardly found in nature.
• Insertions add extra nucleotides into the sequence. This may shift the reading
frame of codons or change the splice sites of a gene (the positions genes are
chopped up and then reassemble before translation occurs), both of which
greatly affect encoded proteins.
• Deletions remove nucleotides from the original sequence. As insertions, deletions may change reading frame or splice sites and alter the gene products.
Large-scale mutations are more complicated and affect one or many genes.
Some types of large-scale mutations include: duplications which create many
copies of the same sequence, deletions of large region on a gene or deletions of
many genes and translocations which exchange regions between different DNA
molecules.
Sequences of nucleotides or amino acids are said to be homologous if they
derive from an ancestral sequence. Homologous sequences are not necessarily the
same as their ancestor due to mutations. Investigating homologous sequences is
an important task as understanding the differences among homologous sequences
can reveal the evolutionary relationships among species or individuals of a species.
Identifying homology, however, is a hard task because of various mutations in the
replication process.
In the following sections, we use the term character to denote a nucleotide
or an amino acid. Homologous characters are nucleotides or amino acids which
derive from a common nucleotide or amino acid.
8
1.2
Modeling sequence evolution
1.2.1
Sequence alignment
Homologous sequences may not have the same length because of different number
of insertions and deletions in these sequences. Before the evolutionary relationships can be analysed, homologous sequences need to be aligned first. Aligning is
the process of inserting character ’-’ to the input character sequences so the these
sequences have an equal length.
There may be many different ways to align two sequences. For example, with
two sequences TGAATAGCC and TGAAAGTCC, these are acceptable alignments:
1
2
3
4
5
6
7
8
9
Sequence 1 T G A A T A G C C
Sequence 2 T G A A A G T C C
1
2
3
4
5
6
7
8 9 10
Sequence 1 T G A A T A G - C
C
Sequence 2 T G A A - A G T C C
With the absence of ancestral sequence, the length of the real alignment and
the types of occurred mutations are unknown. In the above example, the first
alignment has a length of 9 nucleotides and there are 3 substitutions at position 5,
6 and 7. In contrast, the second alignment has a length of 10 nucletides and there
is one deletion (or one insertion of T) at position 5 and one insertion of T (or one
deletion) at position 8. Since it is impossible to distinguish whether insertions or
deletions have actually occured, they are often referred to as indels.
The construction of pairwise alignment can be done in O(pq) time using NeedlemanWunsch algorithm [15] where p and q are the lengths of the two sequences. The
Needleman-Wunsch algorithm is a dynamic programmign algorithm which tries
to assign a score to each alignment and find the alignments with the highest score.
A so-called score matrix is used which defines scores for every event. Different
scoring systems exist. One simple system is: +1 for a match, −1 for a mismatch
and −2 for an indel.
Given two DNA sequences X = (x1 , x2 , . . . , xp ) and Y = (y1 , y2 , . . . , yq )
together with a scoring matrix C and the character set A = {A, C, G, T, −}, the
9
highest score alignments can be found using a recursive formula:
F (0, 0) = 0
F (i − 1, j − 1) + C(xi , yj )
F (i, j) = max F (i − 1, j) + C(xi , −)
F (i, j − 1) + C(−, y )
(1.1)
j
where:
• C(xi , yj ) is score to match xi to yj
• C(xi , −) or C(−, yi ) is score when there is an indel
• F (i, j) is the best score when aligning two sequences Xi = (x1 , x2 , . . . , xi )
and Yj = (y1 , y2 , . . . , yj )
Amino acid sequences can be aligned in a similar fashion using a larger character set A = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, −} and
a different score matrix. A notable observation is that the score matrix of amino
acids is much more complicated due to the difference in properties of amino acids.
The more similar two amino acids are, the higher score an alignment for them
should be. Also, the score matrix may be different for different types of proteins
or species. Score matrices can be derived from substitution models.
The pairwise sequence alignment only reveals the relationship between two
sequences whereas many bioinformatic analysis objective is to investigate the relationship among a set of sequences. A multiple sequence alignment (MSA) is
required for such analysis which is a matrix of characters such that row i represents characters of sequence i and column j represents homologous characters at
position j .
The construction of multiple sequence alignments can be performed using dynamic programming [16]. The time complexity for this approach is O(mn 2n )
where n is the number of sequences and m is the length of the alignment. This
computational burden limits this approach to only a few sequences. Many approximate methods have been proposed for larger number of sequences such as
CLUSTALW [17] or MUSCLE [18].
10
- Xem thêm -