Tài liệu Detecting bad snps from illumina beadchips using jeffreys distance, phát hiện các snp xấu từ illumina beadchips sử dụng khoảng cách jeffreys

.PDF

120

hoangtuavartar Báo vi phạm

Tải xuống 120

Mô tả:

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY -------- NGUYEN HOANG SON Detecting bad SNPs from Illumina BeadChips using Jeffreys distance MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi 1 - 2012 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY -------- NGUYEN HOANG SON DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE MASTER THESIS Sector: Information Technology Major: Computer Science Code : 60 48 01 Supervised by: Dr. Le Sy Vinh Hanoi - 2012 1 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UET or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed ........................................................................ i Abstract Current microarray technologies are able to assay thousands of samples over million of SNPs simultaneously. Computational approaches have been developed to analyse a huge amount of data from microarray chips to understand sophisticated human genomes. The data from microarray chips might contain errors due to bad samples or bad SNPs. In this thesis, a novel method is proposed to detect bad SNPs from the probe intensities data of Illumina Beadchips. This approach measures the difference among results determined by three software Illuminus, GenoSNP and Gencall to detect the unstable SNPs. Experiment with SNP data in chromosome 20 of Kenyan people demonstrates the usefulness of our method. This approach reduces the number of SNPs that are needed to check manually. Furthermore, it has the ability in detecting bad SNPs that have not been recognized by other criteria. Acknowledgements Apart from the efforts of myself, the success of any project depends largely on the encouragement and guidelines of many others. First and foremost, I would like to thank to my supervisor Dr. Sy Vinh Le for the valuable guidance and advice. This research project would not have been finished successfully without his continuously support and assistance. His enthusiastic supervision helped me in all the time of research and writing of this thesis. I also would like to gratefully acknowledge the tremendous encouragement and insightful comments of Dr. Si Quang Le during the time of research of this thesis. His brilliant ideas helped me so much to overcome numerous problems and difficulties. Any word is inadequate for his helpful aids. The author would also like to convey thanks to the Department of Computer Science for providing the useful preferences and laboratory facilities. I also wish to express love and gratitude to my beloved families; for their understanding and endless love, through the duration of my studies. My thanks and appreciations also go to my colleague in developing the project and people who have willingly helped me out with their abilities. i Table of Contents Overview 1 1 Introduction 1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 SNP Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Quality Control and Quality Assurance . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Naive method for SNP genotyping 2.2 GenCall . . . . . . . . . . . . . . 2.3 Illuminus . . . . . . . . . . . . . . 2.4 GenoSNP . . . . . . . . . . . . . 2.5 Discussion . . . . . . . . . . . . . 3 3 5 7 . . . . . 10 10 11 12 12 13 . . . . . 15 15 16 16 17 19 4 Experimental Results 4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 24 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Method 3.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . 3.2 Approximate relative entropy between two Student distributions 3.2.1 Approximate Student distribution . . . . . . . . . . . . . 3.2.2 The matched bound approximation . . . . . . . . . . . . 3.3 Estimate conflict degree between three callers . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . 34 ii List of Figures 1.1 1.2 An example of a SNP detected by an alignment of two DNA sequences. BeadChip work flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Approximate t-distribution by a mixture of Gaussians . . . . . . . . 18 An example of Matched Bound Approximate method with two mixtures of three components . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 4.2 4.3 4.4 Input file format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histograms of three metrics in term of conflict score. . . . . . . . . Erroneous loci filtered by applied two functions but minimum. . . . HWE, MAF and missing rate histogram of three callers, with original and filtered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 4 6 . 23 . 26 . 27 . 29 List of Tables 4.1 4.2 4.3 4.4 Result Result Result Result of of of of missing rate filter. . . MAF filter. . . . . . . HWE exact test filter. synthesis criteria. . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 30 30 Overview Recently, Genome-Wide Association Study (GWAS), also known as Whole Genome Association Study (WGAS), proved to be an successful strategy in identifying genetic variants associated with common diseases or complicated phenotypes. This method focuses on associations between Single Nucleotide Polymorphisms (SNP) and traits. Because GWAS investigates the entire genome rather than just testing one or a few genetic regions, there will be a large extreme amount of SNP genotype data needed to be called. As the consequence, different impressive methods have been developed to deal with the problem of genotype calling automatically and efficiently. In general, these programs try to translate probe hybridization intensities outputed from microarray chips into SNP genotypes. In the ideal case, a call at a SNP loci generates three clusters of signals, so that SNP genotype for a certain sample could be determined according to which cluster it belongs to. However, in fact, the ambiguous data and errors happen frequently due to many reasons. Quality Control (QC), as an indispensable step, is included in every studies to remove these error data from the dataset as much as possible. Unfortunately, this kind of work requires significant time and effort because it is hardly automated completely. It is also difficult to find the bad data among the good ones when the applied criteria are not clear for these cases. Although many statistical methods have been proposed, expert-guided evaluations are usually needed to determine the ultimate results. On top of that, the statistical variables used in considering SNP genotype quality have thresholds that depending on conditions, making obstacles on the way of finding a automatic solution based on them. Therefore, an novel approach beside the traditional QC process is necessary to make this important step self-regulating as much as possible. In our work, we developed a new method to detect bad SNPs. Our method takes advantage of three available genotype callers for Illumina BeadChip, namely GenCall, GenoSNP and Illuminus. The general idea comes from the fact that data 1 LIST OF TABLES 2 of bad SNPs tend to confuse the callers, so the calling results of different methods are usually not consistent. That is, by comparing visually the cluster clouds of three callers SNP by SNP, we could recognize these bad loci. The distance in informatics theory (relative entropy) is applied to examine these dissimilarities so that bad SNPs could be detected automatically. We applied our method with real-life data and it proved to be a good protocol with satisfactory results. Despite of the fact that there still a lot of work to do, this method really has the ability to help experts in confirming the filtered data, also suggests potential bad SNPs that hard to find by traditional QC process. The thesis is organized in five chapters. An introduction to the definitions and background knowledge is given in Chapter 1. At first, this chapter will provide the very first images in bioinformatics by explaining briefly some terminologies in molecular biology, such as human genome, DNA, SNP and SNP genotyping . . . It also offer further information relating to our work, about solutions for genotype calling problem and quality control process. Attending these sections are necessary for better understanding the later content of the thesis. Chapter 2 will describe in more details about the available genotype callers. The general idea for a simple caller will be given at the beginning to give the most basic solution for the genotype calling problem. After that, three callers used in our algorithm are explained thoroughly. Information provided in this chapter will help reader get the first step further toward our solution. The proposed method is presented in the third chapter. We will review the definitions about relative entropy (also known as Kullback-Leibler divergence) and its application with Normal distribution. The next section will state an available method to calculate relative entropy between Student distributions, which is required to deal with our problem. Finally, an estimate technique is proposed to measure the dissimilarities between outputs of three callers. The experiment process of applying our protocol with real data is offered in Chapter 4. At first, input data description is given. After that, there is a step of try-and-error experiment to find the most appropriate parameters that should be used in our program. Finally, the result of bad SNPs detected by our method is evaluated by compared with other traditional QC criteria. The final chapter will give you the overall conclusion and further discussion about the thesis. Chapter 1 Introduction This chapter will provide some background knowledge in molecular biology and bioinformatics ralated to my thesis topic. 1.1 Biological Background Human Genome In the nucleus of almost every cell, there exist a common substrate that code for entire body of human beings - known as DNA. This large and complex-structure molecule is composed of numerous pairs of nucleotides, namely Adenine, Cytosine, Guanine and Thymine. DNA has two strands, but tracking one of them is sufficient (the other could be determined by pairing rule). Human genome refers to the whole genetic makeups of human species, which means every DNA sequences stored on 23 chromosome pairs and small mitochondrial DNA. They are encoded and represented as databases of very long strings of four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides respectively. There are nearly three billions characters in total for a completed genome, which are inherited and passed down generations by generations. Single Nucleotide Polymorphisms : Human genomes from different individuals are about 99.9% identical. The only 0.1% difference make particular characteristics of individuals (Collins & McKusick, 2001). Single Nucleotide Polymorphisms (SNP) are of these important biological different markers. Knowing information about SNP and SNP genotype would contributes significantly to Genome-wide association studies (GWAS), helping scientists understand better the complicated hu- 3 1.1. Biological Background 4 man genomes and their inheritance mechanism. As extremely important biological markers, these tiny changes could influence decisively to various exposures, especially the ones related to diseases are in the top hot topics recently A SNP is defined as a single base change in a DNA sequence. The positions where SNP occurred are Figure 1.1: An example of a SNP detected by an alignment of two DNA sequences. called SNP site. Note that there usually happen only two possible nucleotide variants for any SNP site. The nucleotide variants at SNP sites are called alleles. The rare polymorphisms with more than two bases are out of consideration, therefore we could just use 0 and 1 to denote the two possible alleles. For instance, Figure 1.1 shows two possible DNA sequences at the same locus in the human genome. Considering the green strands, we have two sequences that differ in one location (G versus A): 1 : . . . AACGGATCCAC . . . 2 : . . . AACGAATCGAC . . . 1.2. SNP Genotyping 5 The SNP in this site could be written along with the sequence as follow: G . . . AACG[ ]ATCCAC . . . A SNP Genotype : A pair of alleles each from one chromosome copy in a deploy organism at a SNP site is called SNP genotype. Since there are two different alleles at a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11 (equivalent to the terminology of allele aa, aA, and AA respectively) where 0 and 1 represents from two different nucleotide variants at this SNP site. 00 and 11 are called homozygous genotypes while 01 is called heterozygous genotype. Analyzing SNPs and SNP genotypes helps us better understand the complicated human genomes and their inheritance mechanisms. These nucleotide changes could relate to diseases and studying these changes is recently a hot topic. For example, in Figure 1.1, if these two DNA sequences above come from a common locus of two chromatids in a chromosome of one individual, then the SNP genotype for that person is 01. 1.2 SNP Genotyping Illumina BeadChips The need of simultaneous assaying a large number of individuals (samples) over million of SNP sites has led to the innovation of biological integrated chips industry. To date, there are many different manufacturers working on this field, produce various type of microarrays such as Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc. The microarray beadchip contains a number of well plate, each well has thousands of beads. A known 25-mer oligonucleotides (a predefined short sequence DNA of 25 nucleotides of human genome) is embedded in each bead. The fluorescent DNA fragments containing genetic information of individual will match with their complementary on the beads in hybridization step. A computational system then scan and analyse the coloured signal emitted from the beads to output the allele densities. The number of well per plate and bead per well is integrated increasingly day by day, allowing us to interrogate millions of SNP locus simutaneously. The Illumina whole-genome BeadChips (Steemers et al., 2006; Peiffer et al., 2006) is one of the most popular microarray technologies used to study human SNPs. It takes m individuals (samples) as the input and captures genotypes across 1.2. SNP Genotyping 6 Figure 1.2: BeadChip work flow. n SNP sites. After several chemical and preprocessing steps, it outputs a matrix G = {gij,i=1,...,m;j=1,...,n } where gij = (xij , yij ) are the raw intensities which indicate the SNP genotype of sample i at SNP site j. The intensities xij , yij represent for allele 0, 1 of SNP genotype gij . Methods In genotyping problem, if microarray chips are hardwares, then it is necessary to have appropriate softwares to deal with the data. Computational methods to determine genotype types from raw intensity data have been developed along with each type of chip. There are several methods available (also known as callers) for Illumina Bead Array. In this work, we only consider the three most popular methods, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP. Gencall(Illumina Inc., 2005) is the earliest tool still being in use in GenomeStudio tool set. Illuminus(Teo et al., 2007) was developed by Yik Teo et al. as a further tool focusing on the ability to process large amount of data available from advanced chips. GenoSNP(Giannoulatou et al., 2008) is a novel approach that could work out with little experimented samples available. 1.3. Quality Control and Quality Assurance 1.3 7 Quality Control and Quality Assurance Large-scale stastical studies are supceptible to errors. Quality Control and Quality Assurance (QC/QA) process is considered as a critical phase that has an significant affect to the accuracy of computational model. There are a number of proposed protocols have been made to ensure the data quality, avoiding spurious conclusions and mistaken results. QC is defined as a process of monitoring and controlling the quality of data as it is being generated, whereas QA is used to review the product quality after that (Laurie et al., 2010). The content of this thesis is related to QC procedure of genotype calling problem. Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter. We just focus on the later in this article. In which, low-quality SNPs are called bad SNPs and should be filtered out. The most common and widely protocol used in evaluating SNPs quality is based on measurements such as HardyWeinberg Equilibrium, missing rate, MAF (minor allele frequency), Mendel error etc. Although all three callers have applied their own advanced methods to filter out these bad SNPs from the data set, it is highly possible that a number of faulty ones still remained in the dataset. In a typical QC process for genotyping problem, there are three variables considered. They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF (minor allele frequency). In fact, to improve the efficiency of filtering process, several extra conditions are also included. For instance, there is an suggestion about association tests based on duplicate discordance, Mendelian errors, sex differences and heterozygosity rate in addition (Laurie et al., 2010). Briefly, in a SNP locus: • Missing rate or missing proportion (MSP) at a SNP indicates how much samples failed in this call: MSP = number of no calls total of samples As could be seen clearly, the higher missing rate indicates the poorer genotype calling performance. The loci with MSP greater than 5% usually are considered as problematic SNPs. • In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele frequency is q then the probalilities of three possible genotype 0/0, 0/1 and 1.3. Quality Control and Quality Assurance 8 1/1 are (1−q)2 , 2q(1−q), q 2 . These probabilities should be stable, meaning the real values are nearly the same with the expected values. Significant deviation of HWE tests is typically imply gross genotyping error. A probalility test value, or p-value, is used to estimate this difference. • MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles: MAF = number of minor alleles total alleles Low MAF means that there exist a cluster among three with fewer samples. As the consequence, SNPs with low MAF are more prone to error, since almost calling methods that based on clustering do not work well in these cases. An proposed criteria for a standard-quality SNP genotyping are HWE p-value greater than 0.00033, MSP in average is less than 3% but maximum must not pass 10%, while MAF and quality score greater than a pre-defined figure which varies from study to study (Group, 2007; Laurie et al., 2010). In fact, a combination of these variables are required to generate more dependable results of filtering. However, as stated before, these thresholds are not fixed in every experiments and very hard to be automated. These numbers depends on many factors, giving a rough evaluation to the data quality. The final conclusions of bad SNPs have to be made manually by an expert-guided process to avoid false positive cases which are potentially missed disease variants in the study. A typical marker (SNP in this case) QC process of GWAS data includes at least four steps as follow (Anderson Carl A, ): • Filter out SNPs with excessive missing rate. • Identify SNPs with extreme deviation from HWE. • Find every SNPs that have discrepancy of missing rates between cases and controls. • Detect low-MAF markers. In this suggestion, the limit for missing rate is 5% (1% for SNPs of low frequency M AF ≤ 5%). Every SNPs with HWE p-value less than 0.001 will be examined by experts for quality. The final step is to remove SNPs showing very low MAF, the threshold of 1-2% is applied typically. The authors also emphasize that checking cluster plots manually is the best safeguard for quality evaluation. 1.3. Quality Control and Quality Assurance 9 In this thesis, we propose a new method to measure the difference among results created from different callers. At first, the next section will provide information about genotype calling problem and current callers. Then we will introduce our method of utilizing Jeffreys distance to find bad SNPs. At the end, our experimental results will be evaluated statistically and plotted visually to estimate the effectiveness of this approach. Lastly, there comes our conclusion and further discussion about the work. Chapter 2 Related Work Given an matrix G = {gij,i=1,...,m;j=1,...,n } with gij = (xij , yij ), representing the genotype intensity of m samples at n SNP loci. The task is to assign every pairs to its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible). This problem is call SNP genotype calling or in short, SNP genotyping. 2.1 Naive method for SNP genotyping The most simple method for SNP genotyping problem use the correlation between two allelic intensities (xij , yij ). For example, with a data point (x, y): • If xij yij then this SNP genotype could be labelled as 00 (homozygous of allele 0) • If xij yij then it should belong to genotype class 11 (homozygous of allele 1) • Or if xij ' yij then this SNP calling should be 01 (heterozygous) However, this naive method could not work well with the ambiguous cases and its performance is poor with the tremendous amount of data. In fact, statistical approach are used to overcome this problem. Illuminus, GenoSNP and Gencall are there most popular callers for 2.5M BeadChips of Illumina. They use different clustering algorithms to group samples into three clusters 00, 01, 11. Samples which do not belong any clusters are called outlier. 10 2.2. GenCall 2.2 11 GenCall Gencall (Illumina Inc., 2005) is the canonical caller developed for Illumina microarray from the very beginning. The work-flow of this software in SNP genotype calling is described as follow: • Firstly, the programs read the input data of bead intensities from idats file (output by scanning process). Then the data belonging to each bead pool (group of beads that highly related to each others) is normalized and converted to affine coordinate (Ritchie et al., 2011). • Secondly, an algorithm known as GenTrain is utilized for the unsupervised clustering step. The transformed data of every samples of each SNP are modelled, then combined with several heuristic information in order to be analysed by a neural network. The centroids for each of three clusters are estimated, forming three genotype groups with the shape and position of them. After the clustering process, a statistical score (called GenTrain score) that combined a number of factors, penalty terms... is assigned to each SNP locus in order to evaluate the resulting clusters in a way similar to human expert’s visual. • The GenTrain score, together with clustering information is then used with a Bayesian model to establish the classification. Finally, the calling score for each call (GenCall Score) is generated as a confident degree for that call. GenCall score is not a probability value, but is used to represent the quality of genotyping and might be varied between different loci or chromosomes. However any call with score less than 0.2 will be considered as a failure due to the lack of confidence (assigned as no call in this case), while good calls are indicated by the score greater than 0.7. Every SNP genotypes that have GenCall score fall between 0.2 and 0.7 are qualified and needed to be checked carefully from other point of views. It worth noting that GenCall run the top/bottom strand correlation studies, in which each SNP is genotyped twice for two strands respectively and then compared to ensure the agreement. This mechanism also helps Gencall to filter out some faulty calls from the last result. 2.3. Illuminus 2.3 12 Illuminus Illuminus (Teo et al., 2007) is one of the most popular callers for 2.5M BeadChip of Illumina. The whole working process of genotype calling methods are quite the same with several overall common phases, namely data normalization, clustering, calling and sometimes validation. In this second tool of interests, much efforts in preprocessing data is spent by adding further procedures to ease the clustering process. Assume that normalized data generated from Illumina BeadStudio software (also the input of GenCall algorithm) are in form (xij , yij ), where (xij and yij ) are the signal intensities for the two alleles 0 and 1 respectively, belong to a certain sample i at a certain SNP j. Illuminus then converts those figures into new measures, namely contrast and strength defined as follow: cij = xij − yij xij + yij sij = log(xij + yij ) Secondly, the clustering algorithm take those values as input and by using EM (Expectation Maximization) framework on them, it try to fit a bivariate mixture model with three t-distributed components (as three genotype clusters) and a Gaussian component for outliers (corresponding to no call cluster). The parameters that determine the shape and location of components are then re-estimated iteratively until they maximize the fitness likelihood and become stable. Next, the posterior probabilities are calculated to classify the samples into appropriate genotype states. For details of this technique, please refer to its publication(Teo et al., 2007). There is also a method to ensure the stability of calling results in Illuminus, called pertubation analysis. This process means that every SNPs has been called twice, in original data (xij , yij ) and pertubated data (xij + , yij + ) respectively, then the concordance between the two result is estimated to determine whether this SNP is valid or not. 2.4 GenoSNP GenoSNP (Giannoulatou et al., 2008) is another prominent caller that work with Illumina BeadChip. Firstly, the data transformation method of this tool is different compared to the other two. Start with pairs of raw intensities (x, y), it generates

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất