Table of Contents
Overview
1
1 Introduction
1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 SNP Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Quality Control and Quality Assurance . . . . . . . . . . . . . . . . .
2 Related Work
2.1 Naive method for SNP genotyping
2.2 GenCall . . . . . . . . . . . . . .
2.3 Illuminus . . . . . . . . . . . . . .
2.4 GenoSNP . . . . . . . . . . . . .
2.5 Discussion . . . . . . . . . . . . .
3
3
5
7
.
.
.
.
.
10
10
11
12
12
13
.
.
.
.
.
15
15
16
16
17
19
4 Experimental Results
4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
24
26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Method
3.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . .
3.2 Approximate relative entropy between two Student distributions
3.2.1 Approximate Student distribution . . . . . . . . . . . . .
3.2.2 The matched bound approximation . . . . . . . . . . . .
3.3 Estimate conflict degree between three callers . . . . . . . . . .
Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
ii
List of Figures
1.1
1.2
An example of a SNP detected by an alignment of two DNA sequences.
BeadChip work flow. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
Approximate t-distribution by a mixture of Gaussians . . . . . . . . 18
An example of Matched Bound Approximate method with two mixtures of three components . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1
4.2
4.3
4.4
Input file format. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Histograms of three metrics in term of conflict score. . . . . . . . .
Erroneous loci filtered by applied two functions but minimum. . . .
HWE, MAF and missing rate histogram of three callers, with original
and filtered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
4
6
. 23
. 26
. 27
. 29
List of Tables
4.1
4.2
4.3
4.4
Result
Result
Result
Result
of
of
of
of
missing rate filter. . .
MAF filter. . . . . . .
HWE exact test filter.
synthesis criteria. . . .
.
.
.
.
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
30
30
Overview
Recently, Genome-Wide Association Study (GWAS), also known as Whole Genome
Association Study (WGAS), proved to be an successful strategy in identifying genetic variants associated with common diseases or complicated phenotypes. This
method focuses on associations between Single Nucleotide Polymorphisms (SNP)
and traits. Because GWAS investigates the entire genome rather than just testing
one or a few genetic regions, there will be a large extreme amount of SNP genotype
data needed to be called. As the consequence, different impressive methods have
been developed to deal with the problem of genotype calling automatically and efficiently. In general, these programs try to translate probe hybridization intensities
outputed from microarray chips into SNP genotypes. In the ideal case, a call at
a SNP loci generates three clusters of signals, so that SNP genotype for a certain
sample could be determined according to which cluster it belongs to. However, in
fact, the ambiguous data and errors happen frequently due to many reasons.
Quality Control (QC), as an indispensable step, is included in every studies to
remove these error data from the dataset as much as possible. Unfortunately, this
kind of work requires significant time and effort because it is hardly automated
completely. It is also difficult to find the bad data among the good ones when the
applied criteria are not clear for these cases. Although many statistical methods
have been proposed, expert-guided evaluations are usually needed to determine the
ultimate results. On top of that, the statistical variables used in considering SNP
genotype quality have thresholds that depending on conditions, making obstacles
on the way of finding a automatic solution based on them. Therefore, an novel
approach beside the traditional QC process is necessary to make this important
step self-regulating as much as possible.
In our work, we developed a new method to detect bad SNPs. Our method
takes advantage of three available genotype callers for Illumina BeadChip, namely
GenCall, GenoSNP and Illuminus. The general idea comes from the fact that data
1
LIST OF TABLES
2
of bad SNPs tend to confuse the callers, so the calling results of different methods
are usually not consistent. That is, by comparing visually the cluster clouds of three
callers SNP by SNP, we could recognize these bad loci. The distance in informatics
theory (relative entropy) is applied to examine these dissimilarities so that bad SNPs
could be detected automatically. We applied our method with real-life data and it
proved to be a good protocol with satisfactory results. Despite of the fact that
there still a lot of work to do, this method really has the ability to help experts in
confirming the filtered data, also suggests potential bad SNPs that hard to find by
traditional QC process.
The thesis is organized in five chapters. An introduction to the definitions and
background knowledge is given in Chapter 1. At first, this chapter will provide
the very first images in bioinformatics by explaining briefly some terminologies in
molecular biology, such as human genome, DNA, SNP and SNP genotyping . . . It
also offer further information relating to our work, about solutions for genotype
calling problem and quality control process. Attending these sections are necessary
for better understanding the later content of the thesis.
Chapter 2 will describe in more details about the available genotype callers.
The general idea for a simple caller will be given at the beginning to give the most
basic solution for the genotype calling problem. After that, three callers used in our
algorithm are explained thoroughly. Information provided in this chapter will help
reader get the first step further toward our solution.
The proposed method is presented in the third chapter. We will review the
definitions about relative entropy (also known as Kullback-Leibler divergence) and
its application with Normal distribution. The next section will state an available
method to calculate relative entropy between Student distributions, which is required
to deal with our problem. Finally, an estimate technique is proposed to measure the
dissimilarities between outputs of three callers.
The experiment process of applying our protocol with real data is offered in
Chapter 4. At first, input data description is given. After that, there is a step of
try-and-error experiment to find the most appropriate parameters that should be
used in our program. Finally, the result of bad SNPs detected by our method is
evaluated by compared with other traditional QC criteria.
The final chapter will give you the overall conclusion and further discussion about
the thesis.
Chapter 1
Introduction
This chapter will provide some background knowledge in molecular biology and
bioinformatics ralated to my thesis topic.
1.1
Biological Background
Human Genome In the nucleus of almost every cell, there exist a common substrate that code for entire body of human beings - known as DNA. This large and
complex-structure molecule is composed of numerous pairs of nucleotides, namely
Adenine, Cytosine, Guanine and Thymine. DNA has two strands, but tracking one
of them is sufficient (the other could be determined by pairing rule).
Human genome refers to the whole genetic makeups of human species, which
means every DNA sequences stored on 23 chromosome pairs and small mitochondrial DNA. They are encoded and represented as databases of very long strings of
four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides
respectively. There are nearly three billions characters in total for a completed
genome, which are inherited and passed down generations by generations.
Single Nucleotide Polymorphisms : Human genomes from different individuals are about 99.9% identical. The only 0.1% difference make particular characteristics of individuals (Collins & McKusick, 2001). Single Nucleotide Polymorphisms
(SNP) are of these important biological different markers. Knowing information
about SNP and SNP genotype would contributes significantly to Genome-wide association studies (GWAS), helping scientists understand better the complicated hu-
3
1.1. Biological Background
4
man genomes and their inheritance mechanism. As extremely important biological
markers, these tiny changes could influence decisively to various exposures, especially the ones related to diseases are in the top hot topics recently A SNP is defined
as a single base change in a DNA sequence. The positions where SNP occurred are
Figure 1.1: An example of a SNP detected by an alignment of two DNA sequences.
called SNP site. Note that there usually happen only two possible nucleotide variants for any SNP site. The nucleotide variants at SNP sites are called alleles. The
rare polymorphisms with more than two bases are out of consideration, therefore
we could just use 0 and 1 to denote the two possible alleles.
For instance, Figure 1.1 shows two possible DNA sequences at the same locus
in the human genome. Considering the green strands, we have two sequences that
differ in one location (G versus A):
1 : . . . AACGGATCCAC . . .
2 : . . . AACGAATCGAC . . .
1.2. SNP Genotyping
5
The SNP in this site could be written along with the sequence as follow:
G
. . . AACG[ ]ATCCAC . . .
A
SNP Genotype : A pair of alleles each from one chromosome copy in a deploy
organism at a SNP site is called SNP genotype. Since there are two different alleles
at a SNP site, there are three different types for a SNP genotype and denoted as:
00, 01, and 11 (equivalent to the terminology of allele aa, aA, and AA respectively)
where 0 and 1 represents from two different nucleotide variants at this SNP site. 00
and 11 are called homozygous genotypes while 01 is called heterozygous genotype.
Analyzing SNPs and SNP genotypes helps us better understand the complicated
human genomes and their inheritance mechanisms. These nucleotide changes could
relate to diseases and studying these changes is recently a hot topic. For example,
in Figure 1.1, if these two DNA sequences above come from a common locus of
two chromatids in a chromosome of one individual, then the SNP genotype for that
person is 01.
1.2
SNP Genotyping
Illumina BeadChips The need of simultaneous assaying a large number of individuals (samples) over million of SNP sites has led to the innovation of biological
integrated chips industry. To date, there are many different manufacturers working
on this field, produce various type of microarrays such as Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc. The microarray beadchip contains
a number of well plate, each well has thousands of beads. A known 25-mer oligonucleotides (a predefined short sequence DNA of 25 nucleotides of human genome) is
embedded in each bead. The fluorescent DNA fragments containing genetic information of individual will match with their complementary on the beads in hybridization
step. A computational system then scan and analyse the coloured signal emitted
from the beads to output the allele densities. The number of well per plate and bead
per well is integrated increasingly day by day, allowing us to interrogate millions of
SNP locus simutaneously.
The Illumina whole-genome BeadChips (Steemers et al., 2006; Peiffer et al.,
2006) is one of the most popular microarray technologies used to study human
SNPs. It takes m individuals (samples) as the input and captures genotypes across
1.2. SNP Genotyping
6
Figure 1.2: BeadChip work flow.
n SNP sites. After several chemical and preprocessing steps, it outputs a matrix
G = {gij,i=1,...,m;j=1,...,n } where gij = (xij , yij ) are the raw intensities which indicate
the SNP genotype of sample i at SNP site j. The intensities xij , yij represent for
allele 0, 1 of SNP genotype gij .
Methods In genotyping problem, if microarray chips are hardwares, then it is necessary to have appropriate softwares to deal with the data. Computational methods
to determine genotype types from raw intensity data have been developed along
with each type of chip. There are several methods available (also known as callers)
for Illumina Bead Array. In this work, we only consider the three most popular
methods, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP.
Gencall(Illumina Inc., 2005) is the earliest tool still being in use in GenomeStudio
tool set. Illuminus(Teo et al., 2007) was developed by Yik Teo et al. as a further
tool focusing on the ability to process large amount of data available from advanced
chips. GenoSNP(Giannoulatou et al., 2008) is a novel approach that could work out
with little experimented samples available.
1.3. Quality Control and Quality Assurance
1.3
7
Quality Control and Quality Assurance
Large-scale stastical studies are supceptible to errors. Quality Control and Quality
Assurance (QC/QA) process is considered as a critical phase that has an significant
affect to the accuracy of computational model. There are a number of proposed
protocols have been made to ensure the data quality, avoiding spurious conclusions
and mistaken results. QC is defined as a process of monitoring and controlling the
quality of data as it is being generated, whereas QA is used to review the product
quality after that (Laurie et al., 2010). The content of this thesis is related to QC
procedure of genotype calling problem.
Frequently in cleaning genotype data, it consists of two separate phases: samples
filter and SNPs filter. We just focus on the later in this article. In which, low-quality
SNPs are called bad SNPs and should be filtered out. The most common and widely
protocol used in evaluating SNPs quality is based on measurements such as HardyWeinberg Equilibrium, missing rate, MAF (minor allele frequency), Mendel error
etc. Although all three callers have applied their own advanced methods to filter
out these bad SNPs from the data set, it is highly possible that a number of faulty
ones still remained in the dataset.
In a typical QC process for genotyping problem, there are three variables considered. They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF
(minor allele frequency). In fact, to improve the efficiency of filtering process, several extra conditions are also included. For instance, there is an suggestion about
association tests based on duplicate discordance, Mendelian errors, sex differences
and heterozygosity rate in addition (Laurie et al., 2010). Briefly, in a SNP locus:
• Missing rate or missing proportion (MSP) at a SNP indicates how much samples failed in this call:
MSP =
number of no calls
total of samples
As could be seen clearly, the higher missing rate indicates the poorer genotype calling performance. The loci with MSP greater than 5% usually are
considered as problematic SNPs.
• In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele
frequency is q then the probalilities of three possible genotype 0/0, 0/1 and
1.3. Quality Control and Quality Assurance
8
1/1 are (1−q)2 , 2q(1−q), q 2 . These probabilities should be stable, meaning the
real values are nearly the same with the expected values. Significant deviation
of HWE tests is typically imply gross genotyping error. A probalility test
value, or p-value, is used to estimate this difference.
• MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles:
MAF =
number of minor alleles
total alleles
Low MAF means that there exist a cluster among three with fewer samples.
As the consequence, SNPs with low MAF are more prone to error, since almost
calling methods that based on clustering do not work well in these cases.
An proposed criteria for a standard-quality SNP genotyping are HWE p-value
greater than 0.00033, MSP in average is less than 3% but maximum must not pass
10%, while MAF and quality score greater than a pre-defined figure which varies from
study to study (Group, 2007; Laurie et al., 2010). In fact, a combination of these
variables are required to generate more dependable results of filtering. However, as
stated before, these thresholds are not fixed in every experiments and very hard to
be automated. These numbers depends on many factors, giving a rough evaluation
to the data quality. The final conclusions of bad SNPs have to be made manually by
an expert-guided process to avoid false positive cases which are potentially missed
disease variants in the study. A typical marker (SNP in this case) QC process of
GWAS data includes at least four steps as follow (Anderson Carl A, ):
• Filter out SNPs with excessive missing rate.
• Identify SNPs with extreme deviation from HWE.
• Find every SNPs that have discrepancy of missing rates between cases and
controls.
• Detect low-MAF markers.
In this suggestion, the limit for missing rate is 5% (1% for SNPs of low frequency
M AF ≤ 5%). Every SNPs with HWE p-value less than 0.001 will be examined by
experts for quality. The final step is to remove SNPs showing very low MAF, the
threshold of 1-2% is applied typically. The authors also emphasize that checking
cluster plots manually is the best safeguard for quality evaluation.
1.3. Quality Control and Quality Assurance
9
In this thesis, we propose a new method to measure the difference among results
created from different callers. At first, the next section will provide information
about genotype calling problem and current callers. Then we will introduce our
method of utilizing Jeffreys distance to find bad SNPs. At the end, our experimental results will be evaluated statistically and plotted visually to estimate the
effectiveness of this approach. Lastly, there comes our conclusion and further discussion about the work.
Chapter 2
Related Work
Given an matrix G = {gij,i=1,...,m;j=1,...,n } with gij = (xij , yij ), representing the
genotype intensity of m samples at n SNP loci. The task is to assign every pairs to
its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible). This
problem is call SNP genotype calling or in short, SNP genotyping.
2.1
Naive method for SNP genotyping
The most simple method for SNP genotyping problem use the correlation between
two allelic intensities (xij , yij ). For example, with a data point (x, y):
• If xij yij then this SNP genotype could be labelled as 00 (homozygous of
allele 0)
• If xij yij then it should belong to genotype class 11 (homozygous of allele
1)
• Or if xij ' yij then this SNP calling should be 01 (heterozygous)
However, this naive method could not work well with the ambiguous cases and
its performance is poor with the tremendous amount of data. In fact, statistical
approach are used to overcome this problem. Illuminus, GenoSNP and Gencall
are there most popular callers for 2.5M BeadChips of Illumina. They use different
clustering algorithms to group samples into three clusters 00, 01, 11. Samples which
do not belong any clusters are called outlier.
10
2.2. GenCall
2.2
11
GenCall
Gencall (Illumina Inc., 2005) is the canonical caller developed for Illumina microarray from the very beginning. The work-flow of this software in SNP genotype calling
is described as follow:
• Firstly, the programs read the input data of bead intensities from idats file
(output by scanning process). Then the data belonging to each bead pool
(group of beads that highly related to each others) is normalized and converted
to affine coordinate (Ritchie et al., 2011).
• Secondly, an algorithm known as GenTrain is utilized for the unsupervised
clustering step. The transformed data of every samples of each SNP are modelled, then combined with several heuristic information in order to be analysed
by a neural network. The centroids for each of three clusters are estimated,
forming three genotype groups with the shape and position of them. After the
clustering process, a statistical score (called GenTrain score) that combined a
number of factors, penalty terms... is assigned to each SNP locus in order to
evaluate the resulting clusters in a way similar to human expert’s visual.
• The GenTrain score, together with clustering information is then used with
a Bayesian model to establish the classification. Finally, the calling score for
each call (GenCall Score) is generated as a confident degree for that call.
GenCall score is not a probability value, but is used to represent the quality of
genotyping and might be varied between different loci or chromosomes. However
any call with score less than 0.2 will be considered as a failure due to the lack of
confidence (assigned as no call in this case), while good calls are indicated by the
score greater than 0.7. Every SNP genotypes that have GenCall score fall between
0.2 and 0.7 are qualified and needed to be checked carefully from other point of
views.
It worth noting that GenCall run the top/bottom strand correlation studies, in
which each SNP is genotyped twice for two strands respectively and then compared
to ensure the agreement. This mechanism also helps Gencall to filter out some faulty
calls from the last result.
2.3. Illuminus
2.3
12
Illuminus
Illuminus (Teo et al., 2007) is one of the most popular callers for 2.5M BeadChip
of Illumina. The whole working process of genotype calling methods are quite the
same with several overall common phases, namely data normalization, clustering,
calling and sometimes validation. In this second tool of interests, much efforts
in preprocessing data is spent by adding further procedures to ease the clustering
process. Assume that normalized data generated from Illumina BeadStudio software
(also the input of GenCall algorithm) are in form (xij , yij ), where (xij and yij ) are the
signal intensities for the two alleles 0 and 1 respectively, belong to a certain sample i
at a certain SNP j. Illuminus then converts those figures into new measures, namely
contrast and strength defined as follow:
cij =
xij − yij
xij + yij
sij = log(xij + yij )
Secondly, the clustering algorithm take those values as input and by using EM
(Expectation Maximization) framework on them, it try to fit a bivariate mixture
model with three t-distributed components (as three genotype clusters) and a Gaussian component for outliers (corresponding to no call cluster). The parameters that
determine the shape and location of components are then re-estimated iteratively
until they maximize the fitness likelihood and become stable. Next, the posterior
probabilities are calculated to classify the samples into appropriate genotype states.
For details of this technique, please refer to its publication(Teo et al., 2007).
There is also a method to ensure the stability of calling results in Illuminus,
called pertubation analysis. This process means that every SNPs has been called
twice, in original data (xij , yij ) and pertubated data (xij + , yij + ) respectively,
then the concordance between the two result is estimated to determine whether this
SNP is valid or not.
2.4
GenoSNP
GenoSNP (Giannoulatou et al., 2008) is another prominent caller that work with
Illumina BeadChip. Firstly, the data transformation method of this tool is different
compared to the other two. Start with pairs of raw intensities (x, y), it generates
2.5. Discussion
13
(log10 (x + 1), log10 (y + 1)) This method also utilizes the EM framework for mixture
model of Student distributed components as in Illuminus. However it uses a different
method to fit intensities data to the model. Instead of clustering the probe intensities
across of individuals at each SNP as in two method stated above, GenoSNP develops
the model within a single individual based on the log scale of the normalized figures.
This novel approach is able to overcome the problem of other methods in that
the accuracy is highly depend on the number of control samples, which sometimes
in reality cannot be afforded. Moreover, this method gives a perfect solution for
studies where data is typed by different chips. It is worth noting that GenoSNP
has two versions of using EM algorithm for re-estimate parameters step. Beside
the original version, there is another approach called Variational Bayes expectation
Maximization (VB-EM) that differ from the former by E-step. VB-EM is proved
to be more robust about the uncertainty of model parameters (Giannoulatou et al.,
2008), therefore improving the performance of the original algorithm.
After running the calling algorithm, there also measures of confidence for each
call, which is the posterior probability representing the possibility the call coming
from the class assigned. This confident degree is also used in filter the poorer data
of SNP or samples.
2.5
Discussion
Many comparisons have been made between these three callers (Giannoulatou et al.,
2008; Teo et al., 2007; Ritchie et al., 2011). In general, each of these methods have
their own advantages and disadvantages. For example, Illuminus is the caller with
highest call rate (equivalent to lowest missing rate), but the accuracy is not the
best. This drawback happens due to the bad effect of noisy samples, or outliers
to the clustering algorithm of Illuminus is more critical than the other methods.
GenCall, on the other hand, is appreciated by its high quality calling, but the call
rate is relative low among three. Finally, GenoSNP has more ability in balancing the
trade-off between the two factors. In addition, it could work well with fewer-sample
dataset. However, for a sufficient dataset, the efficiency of this method will falls
behind the two remaining.
To conclude, all of these callers in general could work quite well with arbitrary
dataset with call rate and accuracy usually exceed 95%. A combination of all three
2.5. Discussion
14
callers, also known as a consensus calling, will increases the performance of genotype
calling problem with no doubt. This very interesting idea has become motivation
for our work. That is, by considering the results of three callers and making cross
references, we could find the problematic SNPs that have bad effect to the calling
process.
Chapter 3
Method
The main idea of our method was risen from the fact that the failed SNPs tend
to confuse the clustering algorithms of callers due to their noisy informations, also
known as outliers. As the consequence, the three genotype classes clustered by
Illuminus, GenoSNP and Gencall for these SNPs are not consistent. They are conflict
or diverged when compare to each others. By estimating this kind of quantity in
the whole dataset using informative distance, we could compare and determine the
SNPs with largest conflicts and assign them as bad SNPs. More details are described
below.
3.1
Kullback-Leibler divergence
Kullback-Leibler divergence is widely used in statistical analysis to estimate the
similarity between two distributions. We also use this measurement to compute
the degree of divergence among genotype callers. Kullback-Leibler divergence or
relative entropy D(P ||Q) between two distributions with probability mass functions
f (x) and g(x) respectively, could be computed as follows:
D(P ||Q) =
X
f (x) log
x∈X
15
f (x)
g(x)
(3.1)
3.2. Approximate relative entropy between two Student distributions 16
For two γ-dimensional normal distributions f (µf , Σf ) and g(µg , Σg ), there is a closed
form expression given as:
D(f ||g) =
1
|Σg |
(log
+ Tr[Σ−1
g Σf ]
2
|Σf |
+ (µf − µg )T Σ−1
g (µf − µg ) − γ)
(3.2)
in which γ = 2 for our genotyping problem. Using this formula, we could directly
estimate with ease how different two bivariate Gaussian distributions are. In fact,
each of three genotype classes should be fit with an Student distribution, which is
supposed to be more robust and outliers-tolerated than a Normal. Unfortunately,
there is no such closed form expression exists to estimate KL divergence between
these distributions of interest, hence an approximation is sufficient. One solution
comes straightly from expression 3.1, by applying Monte Carlo sampling method.
Briefly, a sample xi is drawn from the pdf f such that Ef [log f (xi )/g(xi )] = D(f ||g)
and by drawing n i.i.d samples {xi }ni=1 we have
DM C (f ||g) =
n
1X
f (xi )
log
→ D(f ||g)
n i=1
g(xi )
as n → ∞ (Hershey & Olsen, 2007).
This method will return arbitrary accuracy as wished, however its expensive
calculation did not meet the criteria to large amount of data in 2.5M chip. Thus
another approximation from (El Attar et al., 2009) are taken into consideration, and
it proved to be more suitable than the former in balancing the trade-off between
accuracy and computational cost.
3.2
Approximate relative entropy between two Student distributions
3.2.1
Approximate Student distribution
The first step of this method is to identify the relationship between a Student distribution with appropriate Gaussian densities, so that we could take advances of the
closed form formula 3.2. This could be done based on an Equation that express a
Student-t distribution S as an infinite mixture of Gaussian distributions with the
3.2. Approximate relative entropy between two Student distributions 17
same mean µ, but different precision matrices Λ:
S(x; µ, Λ, υ) =
+∞
Z
υ υ
N (x; µ, uΛ)G(u; , )du
2 2
0
(3.3)
or could be rewritten equivalently as in (El Attar et al., 2009), when we consider the
variance distributed according to an inverse gamma distribution rather than with
precision distributed as a gamma distribution.
S(x; µ, Σ, υ) =
+∞
Z
0
υ υ
N (x; µ, Σ/u)G(u; , )du
2 2
(3.4)
where N (x; µ, Σ/u) is a Normal distribution of x with mean and covariance matrices
µ and Σ/u; while G(u; υ2 , υ2 ) is a Gamma distribution of u with two given parameters
shape and rate respectively.
By simplifying the above Equation, a Student distribution could be approximated
by a finite mixture of P Norms:
S(x) =
P
Σ
1 X
N (x; µ, )
P i=1
ui
(3.5)
where {ui }Pi=1 are randomly draw from Gamma distribution G(u; υ2 , υ2 ). The value
of P is significantly affect the accuracy of the approximation and should be chosen
according to the degree of freedom υ.
We could see the effectiveness of this approximation via Figure 3.1. In which,
the first graph shows several Normal components (grey curves) making the approximation of t-distribution (red curve), while the second compares the real Student
distribution (black curve) versus its approximation (the red one).
3.2.2
The matched bound approximation
There exist several approaches to estimate how different two mixture of normal
distribution components are, and they are all listed and compared in (Hershey &
Olsen, 2007). Among them, the matched bound approximation (Goldberger et al.,
2003) are proved to be one of the two best approaches with lower cost but better
result in almost case than the remainings.
Because each of Student distribution is approximated by mixture of same number
- Xem thêm -