VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
--------
NGUYEN HOANG SON
Detecting bad SNPs from Illumina
BeadChips using Jeffreys distance
MASTER THESIS OF INFORMATION
TECHNOLOGY
Hanoi
1 - 2012
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
--------
NGUYEN HOANG SON
DETECTING BAD SNPS FROM ILLUMINA
BEADCHIPS USING JEFFREYS DISTANCE
MASTER THESIS
Sector:
Information Technology
Major:
Computer Science
Code :
60 48 01
Supervised by:
Dr. Le Sy Vinh
Hanoi - 2012
1
ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my
knowledge it contains no materials previously published or written by another
person, or substantial proportions of material which have been accepted for the
award of any other degree or diploma at UET or any other educational institution,
except where due acknowledgement is made in the thesis. Any contribution made
to the research by others is explicitly acknowledged in the thesis. I also declare
that the intellectual content of this thesis is the product of my own work, except
to the extent that assistance from others in the project’s design and conception
or in style, presentation and linguistic expression is acknowledged.’
Signed ........................................................................
i
Abstract
Current microarray technologies are able to assay thousands of samples over million
of SNPs simultaneously. Computational approaches have been developed to analyse
a huge amount of data from microarray chips to understand sophisticated human
genomes. The data from microarray chips might contain errors due to bad samples
or bad SNPs. In this thesis, a novel method is proposed to detect bad SNPs from the
probe intensities data of Illumina Beadchips. This approach measures the difference
among results determined by three software Illuminus, GenoSNP and Gencall to
detect the unstable SNPs. Experiment with SNP data in chromosome 20 of Kenyan
people demonstrates the usefulness of our method. This approach reduces the number of SNPs that are needed to check manually. Furthermore, it has the ability in
detecting bad SNPs that have not been recognized by other criteria.
Acknowledgements
Apart from the efforts of myself, the success of any project depends largely on the
encouragement and guidelines of many others. First and foremost, I would like to
thank to my supervisor Dr. Sy Vinh Le for the valuable guidance and advice. This
research project would not have been finished successfully without his continuously
support and assistance. His enthusiastic supervision helped me in all the time of
research and writing of this thesis.
I also would like to gratefully acknowledge the tremendous encouragement and
insightful comments of Dr. Si Quang Le during the time of research of this thesis. His
brilliant ideas helped me so much to overcome numerous problems and difficulties.
Any word is inadequate for his helpful aids.
The author would also like to convey thanks to the Department of Computer
Science for providing the useful preferences and laboratory facilities. I also wish
to express love and gratitude to my beloved families; for their understanding and
endless love, through the duration of my studies.
My thanks and appreciations also go to my colleague in developing the project
and people who have willingly helped me out with their abilities.
i
Table of Contents
Overview
1
1 Introduction
1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 SNP Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Quality Control and Quality Assurance . . . . . . . . . . . . . . . . .
2 Related Work
2.1 Naive method for SNP genotyping
2.2 GenCall . . . . . . . . . . . . . .
2.3 Illuminus . . . . . . . . . . . . . .
2.4 GenoSNP . . . . . . . . . . . . .
2.5 Discussion . . . . . . . . . . . . .
3
3
5
7
.
.
.
.
.
10
10
11
12
12
13
.
.
.
.
.
15
15
16
16
17
19
4 Experimental Results
4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
24
26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Method
3.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . .
3.2 Approximate relative entropy between two Student distributions
3.2.1 Approximate Student distribution . . . . . . . . . . . . .
3.2.2 The matched bound approximation . . . . . . . . . . . .
3.3 Estimate conflict degree between three callers . . . . . . . . . .
Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
ii
List of Figures
1.1
1.2
An example of a SNP detected by an alignment of two DNA sequences.
BeadChip work flow. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
Approximate t-distribution by a mixture of Gaussians . . . . . . . . 18
An example of Matched Bound Approximate method with two mixtures of three components . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1
4.2
4.3
4.4
Input file format. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Histograms of three metrics in term of conflict score. . . . . . . . .
Erroneous loci filtered by applied two functions but minimum. . . .
HWE, MAF and missing rate histogram of three callers, with original
and filtered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
4
6
. 23
. 26
. 27
. 29
List of Tables
4.1
4.2
4.3
4.4
Result
Result
Result
Result
of
of
of
of
missing rate filter. . .
MAF filter. . . . . . .
HWE exact test filter.
synthesis criteria. . . .
.
.
.
.
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
30
30
Overview
Recently, Genome-Wide Association Study (GWAS), also known as Whole Genome
Association Study (WGAS), proved to be an successful strategy in identifying genetic variants associated with common diseases or complicated phenotypes. This
method focuses on associations between Single Nucleotide Polymorphisms (SNP)
and traits. Because GWAS investigates the entire genome rather than just testing
one or a few genetic regions, there will be a large extreme amount of SNP genotype
data needed to be called. As the consequence, different impressive methods have
been developed to deal with the problem of genotype calling automatically and efficiently. In general, these programs try to translate probe hybridization intensities
outputed from microarray chips into SNP genotypes. In the ideal case, a call at
a SNP loci generates three clusters of signals, so that SNP genotype for a certain
sample could be determined according to which cluster it belongs to. However, in
fact, the ambiguous data and errors happen frequently due to many reasons.
Quality Control (QC), as an indispensable step, is included in every studies to
remove these error data from the dataset as much as possible. Unfortunately, this
kind of work requires significant time and effort because it is hardly automated
completely. It is also difficult to find the bad data among the good ones when the
applied criteria are not clear for these cases. Although many statistical methods
have been proposed, expert-guided evaluations are usually needed to determine the
ultimate results. On top of that, the statistical variables used in considering SNP
genotype quality have thresholds that depending on conditions, making obstacles
on the way of finding a automatic solution based on them. Therefore, an novel
approach beside the traditional QC process is necessary to make this important
step self-regulating as much as possible.
In our work, we developed a new method to detect bad SNPs. Our method
takes advantage of three available genotype callers for Illumina BeadChip, namely
GenCall, GenoSNP and Illuminus. The general idea comes from the fact that data
1
LIST OF TABLES
2
of bad SNPs tend to confuse the callers, so the calling results of different methods
are usually not consistent. That is, by comparing visually the cluster clouds of three
callers SNP by SNP, we could recognize these bad loci. The distance in informatics
theory (relative entropy) is applied to examine these dissimilarities so that bad SNPs
could be detected automatically. We applied our method with real-life data and it
proved to be a good protocol with satisfactory results. Despite of the fact that
there still a lot of work to do, this method really has the ability to help experts in
confirming the filtered data, also suggests potential bad SNPs that hard to find by
traditional QC process.
The thesis is organized in five chapters. An introduction to the definitions and
background knowledge is given in Chapter 1. At first, this chapter will provide
the very first images in bioinformatics by explaining briefly some terminologies in
molecular biology, such as human genome, DNA, SNP and SNP genotyping . . . It
also offer further information relating to our work, about solutions for genotype
calling problem and quality control process. Attending these sections are necessary
for better understanding the later content of the thesis.
Chapter 2 will describe in more details about the available genotype callers.
The general idea for a simple caller will be given at the beginning to give the most
basic solution for the genotype calling problem. After that, three callers used in our
algorithm are explained thoroughly. Information provided in this chapter will help
reader get the first step further toward our solution.
The proposed method is presented in the third chapter. We will review the
definitions about relative entropy (also known as Kullback-Leibler divergence) and
its application with Normal distribution. The next section will state an available
method to calculate relative entropy between Student distributions, which is required
to deal with our problem. Finally, an estimate technique is proposed to measure the
dissimilarities between outputs of three callers.
The experiment process of applying our protocol with real data is offered in
Chapter 4. At first, input data description is given. After that, there is a step of
try-and-error experiment to find the most appropriate parameters that should be
used in our program. Finally, the result of bad SNPs detected by our method is
evaluated by compared with other traditional QC criteria.
The final chapter will give you the overall conclusion and further discussion about
the thesis.
Chapter 1
Introduction
This chapter will provide some background knowledge in molecular biology and
bioinformatics ralated to my thesis topic.
1.1
Biological Background
Human Genome In the nucleus of almost every cell, there exist a common substrate that code for entire body of human beings - known as DNA. This large and
complex-structure molecule is composed of numerous pairs of nucleotides, namely
Adenine, Cytosine, Guanine and Thymine. DNA has two strands, but tracking one
of them is sufficient (the other could be determined by pairing rule).
Human genome refers to the whole genetic makeups of human species, which
means every DNA sequences stored on 23 chromosome pairs and small mitochondrial DNA. They are encoded and represented as databases of very long strings of
four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides
respectively. There are nearly three billions characters in total for a completed
genome, which are inherited and passed down generations by generations.
Single Nucleotide Polymorphisms : Human genomes from different individuals are about 99.9% identical. The only 0.1% difference make particular characteristics of individuals (Collins & McKusick, 2001). Single Nucleotide Polymorphisms
(SNP) are of these important biological different markers. Knowing information
about SNP and SNP genotype would contributes significantly to Genome-wide association studies (GWAS), helping scientists understand better the complicated hu-
3
1.1. Biological Background
4
man genomes and their inheritance mechanism. As extremely important biological
markers, these tiny changes could influence decisively to various exposures, especially the ones related to diseases are in the top hot topics recently A SNP is defined
as a single base change in a DNA sequence. The positions where SNP occurred are
Figure 1.1: An example of a SNP detected by an alignment of two DNA sequences.
called SNP site. Note that there usually happen only two possible nucleotide variants for any SNP site. The nucleotide variants at SNP sites are called alleles. The
rare polymorphisms with more than two bases are out of consideration, therefore
we could just use 0 and 1 to denote the two possible alleles.
For instance, Figure 1.1 shows two possible DNA sequences at the same locus
in the human genome. Considering the green strands, we have two sequences that
differ in one location (G versus A):
1 : . . . AACGGATCCAC . . .
2 : . . . AACGAATCGAC . . .
1.2. SNP Genotyping
5
The SNP in this site could be written along with the sequence as follow:
G
. . . AACG[ ]ATCCAC . . .
A
SNP Genotype : A pair of alleles each from one chromosome copy in a deploy
organism at a SNP site is called SNP genotype. Since there are two different alleles
at a SNP site, there are three different types for a SNP genotype and denoted as:
00, 01, and 11 (equivalent to the terminology of allele aa, aA, and AA respectively)
where 0 and 1 represents from two different nucleotide variants at this SNP site. 00
and 11 are called homozygous genotypes while 01 is called heterozygous genotype.
Analyzing SNPs and SNP genotypes helps us better understand the complicated
human genomes and their inheritance mechanisms. These nucleotide changes could
relate to diseases and studying these changes is recently a hot topic. For example,
in Figure 1.1, if these two DNA sequences above come from a common locus of
two chromatids in a chromosome of one individual, then the SNP genotype for that
person is 01.
1.2
SNP Genotyping
Illumina BeadChips The need of simultaneous assaying a large number of individuals (samples) over million of SNP sites has led to the innovation of biological
integrated chips industry. To date, there are many different manufacturers working
on this field, produce various type of microarrays such as Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc. The microarray beadchip contains
a number of well plate, each well has thousands of beads. A known 25-mer oligonucleotides (a predefined short sequence DNA of 25 nucleotides of human genome) is
embedded in each bead. The fluorescent DNA fragments containing genetic information of individual will match with their complementary on the beads in hybridization
step. A computational system then scan and analyse the coloured signal emitted
from the beads to output the allele densities. The number of well per plate and bead
per well is integrated increasingly day by day, allowing us to interrogate millions of
SNP locus simutaneously.
The Illumina whole-genome BeadChips (Steemers et al., 2006; Peiffer et al.,
2006) is one of the most popular microarray technologies used to study human
SNPs. It takes m individuals (samples) as the input and captures genotypes across
1.2. SNP Genotyping
6
Figure 1.2: BeadChip work flow.
n SNP sites. After several chemical and preprocessing steps, it outputs a matrix
G = {gij,i=1,...,m;j=1,...,n } where gij = (xij , yij ) are the raw intensities which indicate
the SNP genotype of sample i at SNP site j. The intensities xij , yij represent for
allele 0, 1 of SNP genotype gij .
Methods In genotyping problem, if microarray chips are hardwares, then it is necessary to have appropriate softwares to deal with the data. Computational methods
to determine genotype types from raw intensity data have been developed along
with each type of chip. There are several methods available (also known as callers)
for Illumina Bead Array. In this work, we only consider the three most popular
methods, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP.
Gencall(Illumina Inc., 2005) is the earliest tool still being in use in GenomeStudio
tool set. Illuminus(Teo et al., 2007) was developed by Yik Teo et al. as a further
tool focusing on the ability to process large amount of data available from advanced
chips. GenoSNP(Giannoulatou et al., 2008) is a novel approach that could work out
with little experimented samples available.
1.3. Quality Control and Quality Assurance
1.3
7
Quality Control and Quality Assurance
Large-scale stastical studies are supceptible to errors. Quality Control and Quality
Assurance (QC/QA) process is considered as a critical phase that has an significant
affect to the accuracy of computational model. There are a number of proposed
protocols have been made to ensure the data quality, avoiding spurious conclusions
and mistaken results. QC is defined as a process of monitoring and controlling the
quality of data as it is being generated, whereas QA is used to review the product
quality after that (Laurie et al., 2010). The content of this thesis is related to QC
procedure of genotype calling problem.
Frequently in cleaning genotype data, it consists of two separate phases: samples
filter and SNPs filter. We just focus on the later in this article. In which, low-quality
SNPs are called bad SNPs and should be filtered out. The most common and widely
protocol used in evaluating SNPs quality is based on measurements such as HardyWeinberg Equilibrium, missing rate, MAF (minor allele frequency), Mendel error
etc. Although all three callers have applied their own advanced methods to filter
out these bad SNPs from the data set, it is highly possible that a number of faulty
ones still remained in the dataset.
In a typical QC process for genotyping problem, there are three variables considered. They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF
(minor allele frequency). In fact, to improve the efficiency of filtering process, several extra conditions are also included. For instance, there is an suggestion about
association tests based on duplicate discordance, Mendelian errors, sex differences
and heterozygosity rate in addition (Laurie et al., 2010). Briefly, in a SNP locus:
• Missing rate or missing proportion (MSP) at a SNP indicates how much samples failed in this call:
MSP =
number of no calls
total of samples
As could be seen clearly, the higher missing rate indicates the poorer genotype calling performance. The loci with MSP greater than 5% usually are
considered as problematic SNPs.
• In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele
frequency is q then the probalilities of three possible genotype 0/0, 0/1 and
1.3. Quality Control and Quality Assurance
8
1/1 are (1−q)2 , 2q(1−q), q 2 . These probabilities should be stable, meaning the
real values are nearly the same with the expected values. Significant deviation
of HWE tests is typically imply gross genotyping error. A probalility test
value, or p-value, is used to estimate this difference.
• MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles:
MAF =
number of minor alleles
total alleles
Low MAF means that there exist a cluster among three with fewer samples.
As the consequence, SNPs with low MAF are more prone to error, since almost
calling methods that based on clustering do not work well in these cases.
An proposed criteria for a standard-quality SNP genotyping are HWE p-value
greater than 0.00033, MSP in average is less than 3% but maximum must not pass
10%, while MAF and quality score greater than a pre-defined figure which varies from
study to study (Group, 2007; Laurie et al., 2010). In fact, a combination of these
variables are required to generate more dependable results of filtering. However, as
stated before, these thresholds are not fixed in every experiments and very hard to
be automated. These numbers depends on many factors, giving a rough evaluation
to the data quality. The final conclusions of bad SNPs have to be made manually by
an expert-guided process to avoid false positive cases which are potentially missed
disease variants in the study. A typical marker (SNP in this case) QC process of
GWAS data includes at least four steps as follow (Anderson Carl A, ):
• Filter out SNPs with excessive missing rate.
• Identify SNPs with extreme deviation from HWE.
• Find every SNPs that have discrepancy of missing rates between cases and
controls.
• Detect low-MAF markers.
In this suggestion, the limit for missing rate is 5% (1% for SNPs of low frequency
M AF ≤ 5%). Every SNPs with HWE p-value less than 0.001 will be examined by
experts for quality. The final step is to remove SNPs showing very low MAF, the
threshold of 1-2% is applied typically. The authors also emphasize that checking
cluster plots manually is the best safeguard for quality evaluation.
1.3. Quality Control and Quality Assurance
9
In this thesis, we propose a new method to measure the difference among results
created from different callers. At first, the next section will provide information
about genotype calling problem and current callers. Then we will introduce our
method of utilizing Jeffreys distance to find bad SNPs. At the end, our experimental results will be evaluated statistically and plotted visually to estimate the
effectiveness of this approach. Lastly, there comes our conclusion and further discussion about the work.
Chapter 2
Related Work
Given an matrix G = {gij,i=1,...,m;j=1,...,n } with gij = (xij , yij ), representing the
genotype intensity of m samples at n SNP loci. The task is to assign every pairs to
its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible). This
problem is call SNP genotype calling or in short, SNP genotyping.
2.1
Naive method for SNP genotyping
The most simple method for SNP genotyping problem use the correlation between
two allelic intensities (xij , yij ). For example, with a data point (x, y):
• If xij yij then this SNP genotype could be labelled as 00 (homozygous of
allele 0)
• If xij yij then it should belong to genotype class 11 (homozygous of allele
1)
• Or if xij ' yij then this SNP calling should be 01 (heterozygous)
However, this naive method could not work well with the ambiguous cases and
its performance is poor with the tremendous amount of data. In fact, statistical
approach are used to overcome this problem. Illuminus, GenoSNP and Gencall
are there most popular callers for 2.5M BeadChips of Illumina. They use different
clustering algorithms to group samples into three clusters 00, 01, 11. Samples which
do not belong any clusters are called outlier.
10
2.2. GenCall
2.2
11
GenCall
Gencall (Illumina Inc., 2005) is the canonical caller developed for Illumina microarray from the very beginning. The work-flow of this software in SNP genotype calling
is described as follow:
• Firstly, the programs read the input data of bead intensities from idats file
(output by scanning process). Then the data belonging to each bead pool
(group of beads that highly related to each others) is normalized and converted
to affine coordinate (Ritchie et al., 2011).
• Secondly, an algorithm known as GenTrain is utilized for the unsupervised
clustering step. The transformed data of every samples of each SNP are modelled, then combined with several heuristic information in order to be analysed
by a neural network. The centroids for each of three clusters are estimated,
forming three genotype groups with the shape and position of them. After the
clustering process, a statistical score (called GenTrain score) that combined a
number of factors, penalty terms... is assigned to each SNP locus in order to
evaluate the resulting clusters in a way similar to human expert’s visual.
• The GenTrain score, together with clustering information is then used with
a Bayesian model to establish the classification. Finally, the calling score for
each call (GenCall Score) is generated as a confident degree for that call.
GenCall score is not a probability value, but is used to represent the quality of
genotyping and might be varied between different loci or chromosomes. However
any call with score less than 0.2 will be considered as a failure due to the lack of
confidence (assigned as no call in this case), while good calls are indicated by the
score greater than 0.7. Every SNP genotypes that have GenCall score fall between
0.2 and 0.7 are qualified and needed to be checked carefully from other point of
views.
It worth noting that GenCall run the top/bottom strand correlation studies, in
which each SNP is genotyped twice for two strands respectively and then compared
to ensure the agreement. This mechanism also helps Gencall to filter out some faulty
calls from the last result.
2.3. Illuminus
2.3
12
Illuminus
Illuminus (Teo et al., 2007) is one of the most popular callers for 2.5M BeadChip
of Illumina. The whole working process of genotype calling methods are quite the
same with several overall common phases, namely data normalization, clustering,
calling and sometimes validation. In this second tool of interests, much efforts
in preprocessing data is spent by adding further procedures to ease the clustering
process. Assume that normalized data generated from Illumina BeadStudio software
(also the input of GenCall algorithm) are in form (xij , yij ), where (xij and yij ) are the
signal intensities for the two alleles 0 and 1 respectively, belong to a certain sample i
at a certain SNP j. Illuminus then converts those figures into new measures, namely
contrast and strength defined as follow:
cij =
xij − yij
xij + yij
sij = log(xij + yij )
Secondly, the clustering algorithm take those values as input and by using EM
(Expectation Maximization) framework on them, it try to fit a bivariate mixture
model with three t-distributed components (as three genotype clusters) and a Gaussian component for outliers (corresponding to no call cluster). The parameters that
determine the shape and location of components are then re-estimated iteratively
until they maximize the fitness likelihood and become stable. Next, the posterior
probabilities are calculated to classify the samples into appropriate genotype states.
For details of this technique, please refer to its publication(Teo et al., 2007).
There is also a method to ensure the stability of calling results in Illuminus,
called pertubation analysis. This process means that every SNPs has been called
twice, in original data (xij , yij ) and pertubated data (xij + , yij + ) respectively,
then the concordance between the two result is estimated to determine whether this
SNP is valid or not.
2.4
GenoSNP
GenoSNP (Giannoulatou et al., 2008) is another prominent caller that work with
Illumina BeadChip. Firstly, the data transformation method of this tool is different
compared to the other two. Start with pairs of raw intensities (x, y), it generates
- Xem thêm -