Đăng ký Đăng nhập

Tài liệu Mô hình thay thế axit amin cho virut rota

.PDF
56
144
60

Mô tả:

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY ! NGUYEN DUC CANH AMINO ACID SUBSTITUTION MODEL
 FOR ROTAVIRUS
 MASTER THESIS Major: Computer Science HA NOI - 2019 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Duc Canh AMINO ACID SUBSTITUTION MODEL
 FOR ROTAVIRUS
 MASTER THESIS Major: Computer Science Supervisor: Assoc. Prof. Le Sy Vinh HA NOI - 2019 Abstract Modeling protein evolution has been a major field of research in bioinformatics for decades. One popular method to approximate the evolution of proteins is to use an amino-acid substitution model which can reveal the instantaneous rate that an amino acid is changed into another amino acid. Such kind of model is useful in many different ways and has become a main component in a variety of bioinformatic systems. Many models such as JTT, WAG and LG have been estimated using data from various species. Recent research showed that these models might be inappropriate for analysis of some specific species. Meanwhile, the world has witnessed a series of emerging epidemics caused by viruses, notably rotavirus - a contagious virus that can cause gastroenteritis. These epidemics raise a need for modeling the evolution of these emerging viruses. In this thesis, using the data from the Viral Genome Resource at National Center for Biotechnology Information (NCBI), we propose the ROTA model that has been specifically estimated for modeling the evolution of rotavirus. Analysis revealed significant differences between ROTA and existing models in amino acid frequencies, exchangeability coefficients as well as inferred phylogenies. Experiments showed that ROTA better characterizes the evolutionary patterns of rotavirus than other models and should be useful in most systems that requires an accurate description of rotavirus evolution. iii Acknowledgements I would like to express my sincere gratitude to my advisor Assoc. Prof. Le Sy Vinh for the continuous support of my study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my Master study. Besides my advisor, I would like to thank Dr. Dang Cao Cuong and MSc. Le Kim Thu for giving devoted explanations to my questions and guiding me to solve various problems that I had to face. I also thank my friends: Can Duy Cat, Nguyen Minh Trang, Le Hai Nam for their continuous motivations without which I would never be able to complete this thesis. My sincere thanks also goes to Information Faculty of University of Engineering and Technology, Vietnam National University, Hanoi for providing me the necessary facilities to conduct experiments. Last but not the least, I would like to thank my parents for giving birth to me at the first place and supporting me spiritually throughout my life. iv Declaration I hereby declare that this thesis was entirely my own work and that any additional sources of information have been properly cited. I certify that, to the best of my knowledge, my thesis does not infringe upon anyone’s copyright nor violate any proprietary rights and that any ideas, techniques or any other material from the work of other people included in my thesis, published or otherwise, are fully acknowledged in accordance with the standard referencing practices. I declare that this thesis has not been submitted for a higher degree to any other University or Institution. v Table of Contents Abstract iii Acknowledgements iv Declaration v Table of Contents vii Acronyms viii List of Figures ix List of Tables x Introduction 1 1 Background 5 1.1 Sequence evolution . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1 Heredity materials . . . . . . . . . . . . . . . . . . . . . 5 1.1.2 Evolution and homologous sequences . . . . . . . . . . . 6 Modeling sequence evolution . . . . . . . . . . . . . . . . . . . 9 1.2.1 Sequence alignment . . . . . . . . . . . . . . . . . . . . 9 1.2.2 General time-reversible substitution model . . . . . . . . 11 1.2.3 Model of rate heterogeneity . . . . . . . . . . . . . . . . 15 1.2.4 Available amino acid substitution models . . . . . . . . . 16 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.1 17 1.2 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . vi 2 3 1.3.2 Phylogenetic tree reconstruction . . . . . . . . . . . . . 18 1.3.3 Robinson-Foulds distance . . . . . . . . . . . . . . . . . 20 1.3.4 Phylogenetic hypothesis testing . . . . . . . . . . . . . . 20 Method 24 2.1 Modeling method . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Model estimation process . . . . . . . . . . . . . . . . . . . . . 26 Results and discussion 28 3.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Model analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Performance on testing alignments . . . . . . . . . . . . . . . . 31 3.4 Tree topology analysis . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Robinson-Foulds distance . . . . . . . . . . . . . . . . . 32 3.4.2 Shimodaira-Hasegawa test . . . . . . . . . . . . . . . . 33 Protein-specific models . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Conclusions 39 References 44 A ROTA model 45 vii Acronyms GTR General time-reversible ME Minimum evolution ML Maximum likelihood MP Maximum parsimony MSA Multiple sequence alignment NCBI National Center for Biotechnology Information RF Robinson-Foulds viii List of Figures 1.1 A sample phylogenetic tree of 5 sequences. . . . . . . . . . . . . 18 1.2 Two trees T1 and T2 describe the same set of 5 sequences {1, 2, 3, 4, 5} but have different topologies. Tree T1 has two bipartitions {1, 2}|{3, 4, 5} and {1, 2, 3}|{4, 5} while tree T2 has two bipartitions {1, 3}|{2, 4, 5} and {1, 2, 3}|{4, 5}. Bipartition {1, 2}|{3, 4, 5} is present only in T1 whereas bipartition {1, 3}|{2, 4, 5} is present only in T2 . Bipartition {1, 2, 3}|{4, 5} occurs in both T1 and T2 . The standardized Robinson and Foulds distance between T1 and T2 is 2/4 . . . 3.1 21 The exchangeability coefficients in ROTA, FLU and JTT models. The black (gray or white) bubble at the intersection of row X and column Y presents the exchange rate between amino acid X and amino acid Y in ROTA (FLU or JTT). . . . . . . . . . . . . . . . 3.2 32 The relative differences between exchangeability coefficients in ROTA and other two models. The size of the bubble corresponds to the value (ROT AXY − MXY )/(ROT AXY + MXY ) where X, Y is one of 20 amino acids and M is FLU for the subfigure a) or JTT for the subfigure b). The black bubble with value 2/3 (1/3) indicates the coefficient in ROTA is 5 (2) times larger than the corresponding one in M whereas the white bubble with value 2/3 (1/3) indicates the coefficient in ROTA is 5 (2) times smaller than 3.3 the corresponding one in M. . . . . . . . . . . . . . . . . . . . . 33 Amino acid frequencies of ROTA, FLU and JTT models . . . . . 34 ix List of Tables 1.1 Twenty differenct amino acids. . . . . . . . . . . . . . . . . . . 3.1 Number of sequences grouped by protein types in training and testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 34 Comparisons of ROTA and 10 other models in constructing maximum likelihood trees . . . . . . . . . . . . . . . . . . . . . . . 3.5 30 Comparisons of ROTA and 10 other models in constructing maximum likelihood trees . . . . . . . . . . . . . . . . . . . . . . . 3.4 29 The Pearson’s correlations between ROTA and 10 widely used models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 7 35 The normalized Robinson-Foulds distance between trees inferred using ROTA versus FLU and JTT models for 12 testing multiple alignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 36 The number of test alignments that trees inferred from existing models are significantly worse than those from ROTA for the 11 testing alignments that ROTA is the best-fit model. . . . . . . . . 3.7 37 Comparison of log-likelihood per site between ROTA and proteinspecific model. For each alignment of protein P, ROTAP is the model estimated using only sequences of protein P from training dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.1 Amino acid exchangeability matrix (the first 20 rows) and frequency vector (the last row) of ROTA . . . . . . . . . . . . . . . x 46 Introduction Motivation A major field of research in bioinformatics is to understand the evolutionary relationship among species. In the recent decades, protein data have been accumulated in a large scale and are used in many methods to model the evolutionary process. One popular method is to use protein data to estimate amino acid substitution models which can reveal the instantaneous substitution rates that amino acids change into the other amino acids. This kind of model has become crucial for a wide range of bioinformatics sytems thanks to its various applications. The first application of amino acid substitution model is to help create protein sequence alignments [1]. Methods used to align sequences often involve the usage of a score matrix which represents the penalty when an mutation occurs. Typically for nucleotide models, a score matrix can be simply +1 for a matched site, -1 for a mismatched site and -2 for an indel of two nucleotide sequences. Such a simple scoring scheme, however, is not suitable for comparing protein sequences. Amino acids, the building blocks that make up protein sequences, have biochemical properties that influence their exchangeability in the process of evolution. For instance, amino acids of similar molecular sizes get substituted more often than those of widely different molecular sizes. Other properties such as the tendency to bind with water molecules also influence the probability of substitution. Therefore, it is important to use a scoring system that reflects these properties which can be derived from amino acid substitution models. Amino acid substitution models are also frequently used to infer protein phylogenetic trees using maximum likelihood approach [2]. In this method, given a phylogeny of protein sequences with branch lengths, an evolutionary model that 1 allows to compute the probability of this tree is needed to select the maximum likelihood tree. A substitution model allows us to compute the transition probabilities Pij (t), the probability that state j will exist at the end of a branch of length t, if the state at the start of the branch is i, thus makes it possible to calculate the likelihood of data given a phylogenetic tree. Moreover, these models are used to estimate pairwise genetic distances between protein sequences that subsequently serve as inputs for distance-based phylogenetic analysis [3]. Distance methods’ objective is to fit a tree to a matrix of pairwise genetic distances. For every two sequences, the observed distance is a single value based on the number of different positions between the two sequences. The observed distance is an underestimation of the true genetic distance because some of the positions may have experienced multiple substitution events. In distancebased methods, a better estimate of the number of substitutions that have actually occurred is often done by applying a substitution model. Many methods have been proposed to estimate general amino acid substitution models from large and diverse databases [2] [4]. The common methods are counting or maximum likelihood approaches. Estimating amino acid substitution models is much more challenging than estimating nucleotide substitution models due to a large number of parameters to be optimized. For example, the general time reversible model for nucleotides contains 8 parameters in comparison to 208 parameters for models of amino acid substitutions. Thus, amino acid substitution models are typically estimated from large datasets. The first model is Dayhoff model [5] using highly similar protein sequences. As more protein sequences became available, JTT model was estimated [6] which used the same counting method. However, the counting methods are limited to only closely related protein sequences. The maximum likelihood method was proposed later [7] to estimate the mtREV model from 20 complete vertebrate mtDNA-encoded protein sequences. WAG model [8] was estimated using a maximum likelihood method from 182 globular protein families. The WAG model produced better likelihood trees than the Dayhoff and JTT models for a large number of globular protein families. The maximum likelihood method is improved by incorporating the variability of evolutionary rates across sites into the estimation process to create LG model [4] from 2 the Pfam database. Experiments showed that the LG model gave better results than other models both in terms of constructing maximum likelihood trees. Although a number of general models have been estimated from protein sequences of various species, they might be inappropriate for a particular set of species due to the differences in the evolutionary processes of these species. A number of specific amino acid substitution models for important species have been introduced such as rtREV model [9] for retrovirus and HIVb, HIVw models [10] for HIV related analysis. Recently, FLU model is estimated for Influenza virus which was shown to be highly different from widely used existing models and more suitable for Influenza protein analysis [11]. Meanwhile, the world has witnessed a series of emerging epidemics caused by viruses such as Ebolavirus, MERS-CoV - Middle East respiratory syndrome coronavirus, or Rotavirus - a contagious virus that can cause gastroenteritis. These epidemics raise a need to modeling the evolution of these viruses. The Viral Genome Resource [12] at National Center for Biotechnology Information (NCBI) acting as a storage for protein sequences of various emerging viruses provides necessary data to model viral protein evolution. Problem statement Existing amino acid substitution models do not have the ability to accurately characterize the evolution of rotavirus. An amino acid substitution model for rotavirus is necessary to better understand its viral evolution and medical impacts. Many protein analysis systems can utilize an rotavirus-specifc amino acid substitution model to analyse rotavirus protein sequences and find out the patterns in rotavirus viral evolution and infection process. In this thesis, the data from the Viral Genome Resource at NCBI is used to estimate such an amino acid substitution model for rotavirus, an emerging virus that is the leading cause of the Diarrhea in young children. A maximum likelihood approach which allows the variability of evolutionary rates across sites are used as the method of model estimation. The main contribution of this work is to provide a rotavirus-specific substitution model which would benefit subsequent research on understanding the evo3 lutionary patterns of this type of virus. A detailed investigation of the model is conducted to understand the difference of this model from existing ones and to evaluate its performance on other dataset. Outline This thesis is divided into 3 main chapters. The first chapter provides background knowledge needed to understand the estimation process in the next chapter. Chapter 2 describes how the evolution of rotavirus can be modeled as well as which steps the model estimation process follows. Chapter 3 describes the data preparation process, then investigates the rotavirus-specific model obtained from the process described in Chapter 2 (called ROTA for short) to highlight its inner characteristics and compare its performance on a testing dataset with other existing models. Some discussions on the results are also given in Chapter 3. 4 Chapter 1 Background This chapter explains necessary terminologies and concepts to understand the estimation process of an amino acid substitution model. Firstly, section 1.1 describes the evolution of sequences in which we explains heredity materials that pass down from generations to generations and homologous sequences that are the main subjects of investigation to understand evolutionary process. Subsequently, section 1.2 explains popular methods to model sequence evolution as well as revises existing amino acid substitution models which are commonly used in practice. The final section discusses phylogenetic trees in detail which are used as the visual representation of evolutionary process and then addresses the methods that we use later to compare these phylogenetic trees. 1.1 Sequence evolution 1.1.1 Heredity materials In molecular biology, nucleic acids are known to be the heredity materials which are replicated and pass down from parents to children. There are two types of nucleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). For living organisms, DNA is the molecule that carries genetic information. Many viruses use RNA as their genetic material instead. DNA and RNA are composed of smaller subunits called nucleotides [13]. Nucleotides are linked together through chemical bonds in a linear fashion and form sequences of DNAs and RNAs. There are 4 types of nucleotides in DNA which 5 are named after 4 different nitrogenous bases making up these nucleotides: A, C, G and T (short for adenine, cytosine, guanine and thymine, respectively). In terms of RNA, base T is substituted by base U (uracil). DNA (or RNA in some viruses) instructs cells to produce one of the most important types of macro-molecules: protein. Proteins play a critical role in the continuity of life since they form the structure of organisms, provide a lot of funtionalities as well as control the regulation of the body. Proteins are sequences of amino acids. There are 20 different amino acids found in nature (see Table 1.1), each of which has different chemical properties. One change of amino acids in a protein sequence could either have a subtle effect, change the protein function or entirely cause loss of function. Only some regions in DNA and RNA contain the instructions to encode proteins. Those coding regions are called genes. Within a gene, three consecutive nucleotides, often called a codon, encode one amino acid. The process of creating a protein based on a gene is called translation. Different codons can encode the same amino acid, which explains why although there are 64 different combinations of 3 nucleotides, only 20 types of amino acids are present. To be accurate, 61 codons encode for amino acids while the other three are stop-codons. As their name suggests, these stop-codons instruct cells to stop the translation process. 1.1.2 Evolution and homologous sequences The well-known theory about evolution is proposed by Darwin [14] which suggests that species have evolved from a common ancestor. At molecular level, living organisms share: • The same genetic material (nucleic acids) • The same table of encodings (universal genetic codes) • The same molecular building blocks (such as nucleotides and amino acids). These shared features highly suggest that living things today are descendants of a common ancestor, or at least highly similar ancestors. Heredity materials, genetic codes and the process of translation are passed down from generations to 6 Table 1.1: Twenty differenct amino acids. Name Three-letter code One-letter code Alanine Ala A Cysteine Cys C Aspartic Acid Asp D Glutamic Acid Glu E Phenylalanine Phe F Glycine Gly G Histidine His H Isoleucine Ile I Lysine Lys K Leucine Leu L Methionine Met M Asparagine Asn N Proline Pro P Glutamine Gln Q Arginine Arg R Serine Ser S Threonine Thr T Valine Val V Tryptophan Trp W Tyrosine Tyr Y generations, which results in present-day organisms having many similar features at molecular level. Children, however, often inherit genes from their parents with modifications (also called mutations) due to the error-prone replication process. The mutations can be harmful or beneficial. When mutations occur, the mutated organisms may either die, develop new traits or even evolve into new species. Millions of years of replication and mutation result in a diversity of species as we see at the present days. 7 Mutations can be in small-scale or large-scale. In small-scale mutations, genes are affected by changes in one or a small number of nucleotides. If only one single nucleotide is mutated, it is called a point mutation. There are three type of point mutations: • Substitutions are changes from one nucleotide to the others. A substitution may have no effect if it does not change the corresponding gene product (synonymous substitution) or may alter the amino acid in the sequence of protein (nonsynonymous substitution). In theory, every type of nucleotides can be substituted by any other type at any position in the sequence. Lethal substitutions, however, do not replicate and are hardly found in nature. • Insertions add extra nucleotides into the sequence. This may shift the reading frame of codons or change the splice sites of a gene (the positions genes are chopped up and then reassemble before translation occurs), both of which greatly affect encoded proteins. • Deletions remove nucleotides from the original sequence. As insertions, deletions may change reading frame or splice sites and alter the gene products. Large-scale mutations are more complicated and affect one or many genes. Some types of large-scale mutations include: duplications which create many copies of the same sequence, deletions of large region on a gene or deletions of many genes and translocations which exchange regions between different DNA molecules. Sequences of nucleotides or amino acids are said to be homologous if they derive from an ancestral sequence. Homologous sequences are not necessarily the same as their ancestor due to mutations. Investigating homologous sequences is an important task as understanding the differences among homologous sequences can reveal the evolutionary relationships among species or individuals of a species. Identifying homology, however, is a hard task because of various mutations in the replication process. In the following sections, we use the term character to denote a nucleotide or an amino acid. Homologous characters are nucleotides or amino acids which derive from a common nucleotide or amino acid. 8 1.2 Modeling sequence evolution 1.2.1 Sequence alignment Homologous sequences may not have the same length because of different number of insertions and deletions in these sequences. Before the evolutionary relationships can be analysed, homologous sequences need to be aligned first. Aligning is the process of inserting character ’-’ to the input character sequences so the these sequences have an equal length. There may be many different ways to align two sequences. For example, with two sequences TGAATAGCC and TGAAAGTCC, these are acceptable alignments: 1 2 3 4 5 6 7 8 9 Sequence 1 T G A A T A G C C Sequence 2 T G A A A G T C C 1 2 3 4 5 6 7 8 9 10 Sequence 1 T G A A T A G - C C Sequence 2 T G A A - A G T C C With the absence of ancestral sequence, the length of the real alignment and the types of occurred mutations are unknown. In the above example, the first alignment has a length of 9 nucleotides and there are 3 substitutions at position 5, 6 and 7. In contrast, the second alignment has a length of 10 nucletides and there is one deletion (or one insertion of T) at position 5 and one insertion of T (or one deletion) at position 8. Since it is impossible to distinguish whether insertions or deletions have actually occured, they are often referred to as indels. The construction of pairwise alignment can be done in O(pq) time using NeedlemanWunsch algorithm [15] where p and q are the lengths of the two sequences. The Needleman-Wunsch algorithm is a dynamic programmign algorithm which tries to assign a score to each alignment and find the alignments with the highest score. A so-called score matrix is used which defines scores for every event. Different scoring systems exist. One simple system is: +1 for a match, −1 for a mismatch and −2 for an indel. Given two DNA sequences X = (x1 , x2 , . . . , xp ) and Y = (y1 , y2 , . . . , yq ) together with a scoring matrix C and the character set A = {A, C, G, T, −}, the 9 highest score alignments can be found using a recursive formula: F (0, 0) = 0    F (i − 1, j − 1) + C(xi , yj )   F (i, j) = max F (i − 1, j) + C(xi , −)    F (i, j − 1) + C(−, y ) (1.1) j where: • C(xi , yj ) is score to match xi to yj • C(xi , −) or C(−, yi ) is score when there is an indel • F (i, j) is the best score when aligning two sequences Xi = (x1 , x2 , . . . , xi ) and Yj = (y1 , y2 , . . . , yj ) Amino acid sequences can be aligned in a similar fashion using a larger character set A = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, −} and a different score matrix. A notable observation is that the score matrix of amino acids is much more complicated due to the difference in properties of amino acids. The more similar two amino acids are, the higher score an alignment for them should be. Also, the score matrix may be different for different types of proteins or species. Score matrices can be derived from substitution models. The pairwise sequence alignment only reveals the relationship between two sequences whereas many bioinformatic analysis objective is to investigate the relationship among a set of sequences. A multiple sequence alignment (MSA) is required for such analysis which is a matrix of characters such that row i represents characters of sequence i and column j represents homologous characters at position j . The construction of multiple sequence alignments can be performed using dynamic programming [16]. The time complexity for this approach is O(mn 2n ) where n is the number of sequences and m is the length of the alignment. This computational burden limits this approach to only a few sequences. Many approximate methods have been proposed for larger number of sequences such as CLUSTALW [17] or MUSCLE [18]. 10
- Xem thêm -

Tài liệu liên quan