Tài liệu Phát hiện các SNP xấu từ Illumina BeadChips sử dụng khoảng cách Jeffreys

.PDF

371

147

tailieuonline Báo vi phạm

Tải xuống 147

Mô tả:

Table of Contents Overview 1 1 Introduction 1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 SNP Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Quality Control and Quality Assurance . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Naive method for SNP genotyping 2.2 GenCall . . . . . . . . . . . . . . 2.3 Illuminus . . . . . . . . . . . . . . 2.4 GenoSNP . . . . . . . . . . . . . 2.5 Discussion . . . . . . . . . . . . . 3 3 5 7 . . . . . 10 10 11 12 12 13 . . . . . 15 15 16 16 17 19 4 Experimental Results 4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 24 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Method 3.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . 3.2 Approximate relative entropy between two Student distributions 3.2.1 Approximate Student distribution . . . . . . . . . . . . . 3.2.2 The matched bound approximation . . . . . . . . . . . . 3.3 Estimate conflict degree between three callers . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . 34 ii List of Figures 1.1 1.2 An example of a SNP detected by an alignment of two DNA sequences. BeadChip work flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Approximate t-distribution by a mixture of Gaussians . . . . . . . . 18 An example of Matched Bound Approximate method with two mixtures of three components . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 4.2 4.3 4.4 Input file format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histograms of three metrics in term of conflict score. . . . . . . . . Erroneous loci filtered by applied two functions but minimum. . . . HWE, MAF and missing rate histogram of three callers, with original and filtered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 4 6 . 23 . 26 . 27 . 29 List of Tables 4.1 4.2 4.3 4.4 Result Result Result Result of of of of missing rate filter. . . MAF filter. . . . . . . HWE exact test filter. synthesis criteria. . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 30 30 Overview Recently, Genome-Wide Association Study (GWAS), also known as Whole Genome Association Study (WGAS), proved to be an successful strategy in identifying genetic variants associated with common diseases or complicated phenotypes. This method focuses on associations between Single Nucleotide Polymorphisms (SNP) and traits. Because GWAS investigates the entire genome rather than just testing one or a few genetic regions, there will be a large extreme amount of SNP genotype data needed to be called. As the consequence, different impressive methods have been developed to deal with the problem of genotype calling automatically and efficiently. In general, these programs try to translate probe hybridization intensities outputed from microarray chips into SNP genotypes. In the ideal case, a call at a SNP loci generates three clusters of signals, so that SNP genotype for a certain sample could be determined according to which cluster it belongs to. However, in fact, the ambiguous data and errors happen frequently due to many reasons. Quality Control (QC), as an indispensable step, is included in every studies to remove these error data from the dataset as much as possible. Unfortunately, this kind of work requires significant time and effort because it is hardly automated completely. It is also difficult to find the bad data among the good ones when the applied criteria are not clear for these cases. Although many statistical methods have been proposed, expert-guided evaluations are usually needed to determine the ultimate results. On top of that, the statistical variables used in considering SNP genotype quality have thresholds that depending on conditions, making obstacles on the way of finding a automatic solution based on them. Therefore, an novel approach beside the traditional QC process is necessary to make this important step self-regulating as much as possible. In our work, we developed a new method to detect bad SNPs. Our method takes advantage of three available genotype callers for Illumina BeadChip, namely GenCall, GenoSNP and Illuminus. The general idea comes from the fact that data 1 LIST OF TABLES 2 of bad SNPs tend to confuse the callers, so the calling results of different methods are usually not consistent. That is, by comparing visually the cluster clouds of three callers SNP by SNP, we could recognize these bad loci. The distance in informatics theory (relative entropy) is applied to examine these dissimilarities so that bad SNPs could be detected automatically. We applied our method with real-life data and it proved to be a good protocol with satisfactory results. Despite of the fact that there still a lot of work to do, this method really has the ability to help experts in confirming the filtered data, also suggests potential bad SNPs that hard to find by traditional QC process. The thesis is organized in five chapters. An introduction to the definitions and background knowledge is given in Chapter 1. At first, this chapter will provide the very first images in bioinformatics by explaining briefly some terminologies in molecular biology, such as human genome, DNA, SNP and SNP genotyping . . . It also offer further information relating to our work, about solutions for genotype calling problem and quality control process. Attending these sections are necessary for better understanding the later content of the thesis. Chapter 2 will describe in more details about the available genotype callers. The general idea for a simple caller will be given at the beginning to give the most basic solution for the genotype calling problem. After that, three callers used in our algorithm are explained thoroughly. Information provided in this chapter will help reader get the first step further toward our solution. The proposed method is presented in the third chapter. We will review the definitions about relative entropy (also known as Kullback-Leibler divergence) and its application with Normal distribution. The next section will state an available method to calculate relative entropy between Student distributions, which is required to deal with our problem. Finally, an estimate technique is proposed to measure the dissimilarities between outputs of three callers. The experiment process of applying our protocol with real data is offered in Chapter 4. At first, input data description is given. After that, there is a step of try-and-error experiment to find the most appropriate parameters that should be used in our program. Finally, the result of bad SNPs detected by our method is evaluated by compared with other traditional QC criteria. The final chapter will give you the overall conclusion and further discussion about the thesis. Chapter 1 Introduction This chapter will provide some background knowledge in molecular biology and bioinformatics ralated to my thesis topic. 1.1 Biological Background Human Genome In the nucleus of almost every cell, there exist a common substrate that code for entire body of human beings - known as DNA. This large and complex-structure molecule is composed of numerous pairs of nucleotides, namely Adenine, Cytosine, Guanine and Thymine. DNA has two strands, but tracking one of them is sufficient (the other could be determined by pairing rule). Human genome refers to the whole genetic makeups of human species, which means every DNA sequences stored on 23 chromosome pairs and small mitochondrial DNA. They are encoded and represented as databases of very long strings of four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides respectively. There are nearly three billions characters in total for a completed genome, which are inherited and passed down generations by generations. Single Nucleotide Polymorphisms : Human genomes from different individuals are about 99.9% identical. The only 0.1% difference make particular characteristics of individuals (Collins & McKusick, 2001). Single Nucleotide Polymorphisms (SNP) are of these important biological different markers. Knowing information about SNP and SNP genotype would contributes significantly to Genome-wide association studies (GWAS), helping scientists understand better the complicated hu- 3 1.1. Biological Background 4 man genomes and their inheritance mechanism. As extremely important biological markers, these tiny changes could influence decisively to various exposures, especially the ones related to diseases are in the top hot topics recently A SNP is defined as a single base change in a DNA sequence. The positions where SNP occurred are Figure 1.1: An example of a SNP detected by an alignment of two DNA sequences. called SNP site. Note that there usually happen only two possible nucleotide variants for any SNP site. The nucleotide variants at SNP sites are called alleles. The rare polymorphisms with more than two bases are out of consideration, therefore we could just use 0 and 1 to denote the two possible alleles. For instance, Figure 1.1 shows two possible DNA sequences at the same locus in the human genome. Considering the green strands, we have two sequences that differ in one location (G versus A): 1 : . . . AACGGATCCAC . . . 2 : . . . AACGAATCGAC . . . 1.2. SNP Genotyping 5 The SNP in this site could be written along with the sequence as follow: G . . . AACG[ ]ATCCAC . . . A SNP Genotype : A pair of alleles each from one chromosome copy in a deploy organism at a SNP site is called SNP genotype. Since there are two different alleles at a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11 (equivalent to the terminology of allele aa, aA, and AA respectively) where 0 and 1 represents from two different nucleotide variants at this SNP site. 00 and 11 are called homozygous genotypes while 01 is called heterozygous genotype. Analyzing SNPs and SNP genotypes helps us better understand the complicated human genomes and their inheritance mechanisms. These nucleotide changes could relate to diseases and studying these changes is recently a hot topic. For example, in Figure 1.1, if these two DNA sequences above come from a common locus of two chromatids in a chromosome of one individual, then the SNP genotype for that person is 01. 1.2 SNP Genotyping Illumina BeadChips The need of simultaneous assaying a large number of individuals (samples) over million of SNP sites has led to the innovation of biological integrated chips industry. To date, there are many different manufacturers working on this field, produce various type of microarrays such as Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc. The microarray beadchip contains a number of well plate, each well has thousands of beads. A known 25-mer oligonucleotides (a predefined short sequence DNA of 25 nucleotides of human genome) is embedded in each bead. The fluorescent DNA fragments containing genetic information of individual will match with their complementary on the beads in hybridization step. A computational system then scan and analyse the coloured signal emitted from the beads to output the allele densities. The number of well per plate and bead per well is integrated increasingly day by day, allowing us to interrogate millions of SNP locus simutaneously. The Illumina whole-genome BeadChips (Steemers et al., 2006; Peiffer et al., 2006) is one of the most popular microarray technologies used to study human SNPs. It takes m individuals (samples) as the input and captures genotypes across 1.2. SNP Genotyping 6 Figure 1.2: BeadChip work flow. n SNP sites. After several chemical and preprocessing steps, it outputs a matrix G = {gij,i=1,...,m;j=1,...,n } where gij = (xij , yij ) are the raw intensities which indicate the SNP genotype of sample i at SNP site j. The intensities xij , yij represent for allele 0, 1 of SNP genotype gij . Methods In genotyping problem, if microarray chips are hardwares, then it is necessary to have appropriate softwares to deal with the data. Computational methods to determine genotype types from raw intensity data have been developed along with each type of chip. There are several methods available (also known as callers) for Illumina Bead Array. In this work, we only consider the three most popular methods, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP. Gencall(Illumina Inc., 2005) is the earliest tool still being in use in GenomeStudio tool set. Illuminus(Teo et al., 2007) was developed by Yik Teo et al. as a further tool focusing on the ability to process large amount of data available from advanced chips. GenoSNP(Giannoulatou et al., 2008) is a novel approach that could work out with little experimented samples available. 1.3. Quality Control and Quality Assurance 1.3 7 Quality Control and Quality Assurance Large-scale stastical studies are supceptible to errors. Quality Control and Quality Assurance (QC/QA) process is considered as a critical phase that has an significant affect to the accuracy of computational model. There are a number of proposed protocols have been made to ensure the data quality, avoiding spurious conclusions and mistaken results. QC is defined as a process of monitoring and controlling the quality of data as it is being generated, whereas QA is used to review the product quality after that (Laurie et al., 2010). The content of this thesis is related to QC procedure of genotype calling problem. Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter. We just focus on the later in this article. In which, low-quality SNPs are called bad SNPs and should be filtered out. The most common and widely protocol used in evaluating SNPs quality is based on measurements such as HardyWeinberg Equilibrium, missing rate, MAF (minor allele frequency), Mendel error etc. Although all three callers have applied their own advanced methods to filter out these bad SNPs from the data set, it is highly possible that a number of faulty ones still remained in the dataset. In a typical QC process for genotyping problem, there are three variables considered. They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF (minor allele frequency). In fact, to improve the efficiency of filtering process, several extra conditions are also included. For instance, there is an suggestion about association tests based on duplicate discordance, Mendelian errors, sex differences and heterozygosity rate in addition (Laurie et al., 2010). Briefly, in a SNP locus: • Missing rate or missing proportion (MSP) at a SNP indicates how much samples failed in this call: MSP = number of no calls total of samples As could be seen clearly, the higher missing rate indicates the poorer genotype calling performance. The loci with MSP greater than 5% usually are considered as problematic SNPs. • In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele frequency is q then the probalilities of three possible genotype 0/0, 0/1 and 1.3. Quality Control and Quality Assurance 8 1/1 are (1−q)2 , 2q(1−q), q 2 . These probabilities should be stable, meaning the real values are nearly the same with the expected values. Significant deviation of HWE tests is typically imply gross genotyping error. A probalility test value, or p-value, is used to estimate this difference. • MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles: MAF = number of minor alleles total alleles Low MAF means that there exist a cluster among three with fewer samples. As the consequence, SNPs with low MAF are more prone to error, since almost calling methods that based on clustering do not work well in these cases. An proposed criteria for a standard-quality SNP genotyping are HWE p-value greater than 0.00033, MSP in average is less than 3% but maximum must not pass 10%, while MAF and quality score greater than a pre-defined figure which varies from study to study (Group, 2007; Laurie et al., 2010). In fact, a combination of these variables are required to generate more dependable results of filtering. However, as stated before, these thresholds are not fixed in every experiments and very hard to be automated. These numbers depends on many factors, giving a rough evaluation to the data quality. The final conclusions of bad SNPs have to be made manually by an expert-guided process to avoid false positive cases which are potentially missed disease variants in the study. A typical marker (SNP in this case) QC process of GWAS data includes at least four steps as follow (Anderson Carl A, ): • Filter out SNPs with excessive missing rate. • Identify SNPs with extreme deviation from HWE. • Find every SNPs that have discrepancy of missing rates between cases and controls. • Detect low-MAF markers. In this suggestion, the limit for missing rate is 5% (1% for SNPs of low frequency M AF ≤ 5%). Every SNPs with HWE p-value less than 0.001 will be examined by experts for quality. The final step is to remove SNPs showing very low MAF, the threshold of 1-2% is applied typically. The authors also emphasize that checking cluster plots manually is the best safeguard for quality evaluation. 1.3. Quality Control and Quality Assurance 9 In this thesis, we propose a new method to measure the difference among results created from different callers. At first, the next section will provide information about genotype calling problem and current callers. Then we will introduce our method of utilizing Jeffreys distance to find bad SNPs. At the end, our experimental results will be evaluated statistically and plotted visually to estimate the effectiveness of this approach. Lastly, there comes our conclusion and further discussion about the work. Chapter 2 Related Work Given an matrix G = {gij,i=1,...,m;j=1,...,n } with gij = (xij , yij ), representing the genotype intensity of m samples at n SNP loci. The task is to assign every pairs to its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible). This problem is call SNP genotype calling or in short, SNP genotyping. 2.1 Naive method for SNP genotyping The most simple method for SNP genotyping problem use the correlation between two allelic intensities (xij , yij ). For example, with a data point (x, y): • If xij yij then this SNP genotype could be labelled as 00 (homozygous of allele 0) • If xij yij then it should belong to genotype class 11 (homozygous of allele 1) • Or if xij ' yij then this SNP calling should be 01 (heterozygous) However, this naive method could not work well with the ambiguous cases and its performance is poor with the tremendous amount of data. In fact, statistical approach are used to overcome this problem. Illuminus, GenoSNP and Gencall are there most popular callers for 2.5M BeadChips of Illumina. They use different clustering algorithms to group samples into three clusters 00, 01, 11. Samples which do not belong any clusters are called outlier. 10 2.2. GenCall 2.2 11 GenCall Gencall (Illumina Inc., 2005) is the canonical caller developed for Illumina microarray from the very beginning. The work-flow of this software in SNP genotype calling is described as follow: • Firstly, the programs read the input data of bead intensities from idats file (output by scanning process). Then the data belonging to each bead pool (group of beads that highly related to each others) is normalized and converted to affine coordinate (Ritchie et al., 2011). • Secondly, an algorithm known as GenTrain is utilized for the unsupervised clustering step. The transformed data of every samples of each SNP are modelled, then combined with several heuristic information in order to be analysed by a neural network. The centroids for each of three clusters are estimated, forming three genotype groups with the shape and position of them. After the clustering process, a statistical score (called GenTrain score) that combined a number of factors, penalty terms... is assigned to each SNP locus in order to evaluate the resulting clusters in a way similar to human expert’s visual. • The GenTrain score, together with clustering information is then used with a Bayesian model to establish the classification. Finally, the calling score for each call (GenCall Score) is generated as a confident degree for that call. GenCall score is not a probability value, but is used to represent the quality of genotyping and might be varied between different loci or chromosomes. However any call with score less than 0.2 will be considered as a failure due to the lack of confidence (assigned as no call in this case), while good calls are indicated by the score greater than 0.7. Every SNP genotypes that have GenCall score fall between 0.2 and 0.7 are qualified and needed to be checked carefully from other point of views. It worth noting that GenCall run the top/bottom strand correlation studies, in which each SNP is genotyped twice for two strands respectively and then compared to ensure the agreement. This mechanism also helps Gencall to filter out some faulty calls from the last result. 2.3. Illuminus 2.3 12 Illuminus Illuminus (Teo et al., 2007) is one of the most popular callers for 2.5M BeadChip of Illumina. The whole working process of genotype calling methods are quite the same with several overall common phases, namely data normalization, clustering, calling and sometimes validation. In this second tool of interests, much efforts in preprocessing data is spent by adding further procedures to ease the clustering process. Assume that normalized data generated from Illumina BeadStudio software (also the input of GenCall algorithm) are in form (xij , yij ), where (xij and yij ) are the signal intensities for the two alleles 0 and 1 respectively, belong to a certain sample i at a certain SNP j. Illuminus then converts those figures into new measures, namely contrast and strength defined as follow: cij = xij − yij xij + yij sij = log(xij + yij ) Secondly, the clustering algorithm take those values as input and by using EM (Expectation Maximization) framework on them, it try to fit a bivariate mixture model with three t-distributed components (as three genotype clusters) and a Gaussian component for outliers (corresponding to no call cluster). The parameters that determine the shape and location of components are then re-estimated iteratively until they maximize the fitness likelihood and become stable. Next, the posterior probabilities are calculated to classify the samples into appropriate genotype states. For details of this technique, please refer to its publication(Teo et al., 2007). There is also a method to ensure the stability of calling results in Illuminus, called pertubation analysis. This process means that every SNPs has been called twice, in original data (xij , yij ) and pertubated data (xij + , yij + ) respectively, then the concordance between the two result is estimated to determine whether this SNP is valid or not. 2.4 GenoSNP GenoSNP (Giannoulatou et al., 2008) is another prominent caller that work with Illumina BeadChip. Firstly, the data transformation method of this tool is different compared to the other two. Start with pairs of raw intensities (x, y), it generates 2.5. Discussion 13 (log10 (x + 1), log10 (y + 1)) This method also utilizes the EM framework for mixture model of Student distributed components as in Illuminus. However it uses a different method to fit intensities data to the model. Instead of clustering the probe intensities across of individuals at each SNP as in two method stated above, GenoSNP develops the model within a single individual based on the log scale of the normalized figures. This novel approach is able to overcome the problem of other methods in that the accuracy is highly depend on the number of control samples, which sometimes in reality cannot be afforded. Moreover, this method gives a perfect solution for studies where data is typed by different chips. It is worth noting that GenoSNP has two versions of using EM algorithm for re-estimate parameters step. Beside the original version, there is another approach called Variational Bayes expectation Maximization (VB-EM) that differ from the former by E-step. VB-EM is proved to be more robust about the uncertainty of model parameters (Giannoulatou et al., 2008), therefore improving the performance of the original algorithm. After running the calling algorithm, there also measures of confidence for each call, which is the posterior probability representing the possibility the call coming from the class assigned. This confident degree is also used in filter the poorer data of SNP or samples. 2.5 Discussion Many comparisons have been made between these three callers (Giannoulatou et al., 2008; Teo et al., 2007; Ritchie et al., 2011). In general, each of these methods have their own advantages and disadvantages. For example, Illuminus is the caller with highest call rate (equivalent to lowest missing rate), but the accuracy is not the best. This drawback happens due to the bad effect of noisy samples, or outliers to the clustering algorithm of Illuminus is more critical than the other methods. GenCall, on the other hand, is appreciated by its high quality calling, but the call rate is relative low among three. Finally, GenoSNP has more ability in balancing the trade-off between the two factors. In addition, it could work well with fewer-sample dataset. However, for a sufficient dataset, the efficiency of this method will falls behind the two remaining. To conclude, all of these callers in general could work quite well with arbitrary dataset with call rate and accuracy usually exceed 95%. A combination of all three 2.5. Discussion 14 callers, also known as a consensus calling, will increases the performance of genotype calling problem with no doubt. This very interesting idea has become motivation for our work. That is, by considering the results of three callers and making cross references, we could find the problematic SNPs that have bad effect to the calling process. Chapter 3 Method The main idea of our method was risen from the fact that the failed SNPs tend to confuse the clustering algorithms of callers due to their noisy informations, also known as outliers. As the consequence, the three genotype classes clustered by Illuminus, GenoSNP and Gencall for these SNPs are not consistent. They are conflict or diverged when compare to each others. By estimating this kind of quantity in the whole dataset using informative distance, we could compare and determine the SNPs with largest conflicts and assign them as bad SNPs. More details are described below. 3.1 Kullback-Leibler divergence Kullback-Leibler divergence is widely used in statistical analysis to estimate the similarity between two distributions. We also use this measurement to compute the degree of divergence among genotype callers. Kullback-Leibler divergence or relative entropy D(P ||Q) between two distributions with probability mass functions f (x) and g(x) respectively, could be computed as follows: D(P ||Q) = X f (x) log x∈X 15 f (x) g(x) (3.1) 3.2. Approximate relative entropy between two Student distributions 16 For two γ-dimensional normal distributions f (µf , Σf ) and g(µg , Σg ), there is a closed form expression given as: D(f ||g) = 1 |Σg | (log + Tr[Σ−1 g Σf ] 2 |Σf | + (µf − µg )T Σ−1 g (µf − µg ) − γ) (3.2) in which γ = 2 for our genotyping problem. Using this formula, we could directly estimate with ease how different two bivariate Gaussian distributions are. In fact, each of three genotype classes should be fit with an Student distribution, which is supposed to be more robust and outliers-tolerated than a Normal. Unfortunately, there is no such closed form expression exists to estimate KL divergence between these distributions of interest, hence an approximation is sufficient. One solution comes straightly from expression 3.1, by applying Monte Carlo sampling method. Briefly, a sample xi is drawn from the pdf f such that Ef [log f (xi )/g(xi )] = D(f ||g) and by drawing n i.i.d samples {xi }ni=1 we have DM C (f ||g) = n 1X f (xi ) log → D(f ||g) n i=1 g(xi ) as n → ∞ (Hershey & Olsen, 2007). This method will return arbitrary accuracy as wished, however its expensive calculation did not meet the criteria to large amount of data in 2.5M chip. Thus another approximation from (El Attar et al., 2009) are taken into consideration, and it proved to be more suitable than the former in balancing the trade-off between accuracy and computational cost. 3.2 Approximate relative entropy between two Student distributions 3.2.1 Approximate Student distribution The first step of this method is to identify the relationship between a Student distribution with appropriate Gaussian densities, so that we could take advances of the closed form formula 3.2. This could be done based on an Equation that express a Student-t distribution S as an infinite mixture of Gaussian distributions with the 3.2. Approximate relative entropy between two Student distributions 17 same mean µ, but different precision matrices Λ: S(x; µ, Λ, υ) = +∞ Z υ υ N (x; µ, uΛ)G(u; , )du 2 2 0 (3.3) or could be rewritten equivalently as in (El Attar et al., 2009), when we consider the variance distributed according to an inverse gamma distribution rather than with precision distributed as a gamma distribution. S(x; µ, Σ, υ) = +∞ Z 0 υ υ N (x; µ, Σ/u)G(u; , )du 2 2 (3.4) where N (x; µ, Σ/u) is a Normal distribution of x with mean and covariance matrices µ and Σ/u; while G(u; υ2 , υ2 ) is a Gamma distribution of u with two given parameters shape and rate respectively. By simplifying the above Equation, a Student distribution could be approximated by a finite mixture of P Norms: S(x) = P Σ 1 X N (x; µ, ) P i=1 ui (3.5) where {ui }Pi=1 are randomly draw from Gamma distribution G(u; υ2 , υ2 ). The value of P is significantly affect the accuracy of the approximation and should be chosen according to the degree of freedom υ. We could see the effectiveness of this approximation via Figure 3.1. In which, the first graph shows several Normal components (grey curves) making the approximation of t-distribution (red curve), while the second compares the real Student distribution (black curve) versus its approximation (the red one). 3.2.2 The matched bound approximation There exist several approaches to estimate how different two mixture of normal distribution components are, and they are all listed and compared in (Hershey & Olsen, 2007). Among them, the matched bound approximation (Goldberger et al., 2003) are proved to be one of the two best approaches with lower cost but better result in almost case than the remainings. Because each of Student distribution is approximated by mixture of same number

- Xem thêm -

Tài liệu Phát hiện các SNP xấu từ Illumina BeadChips sử dụng khoảng cách Jeffreys

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất