Đăng ký Đăng nhập
Trang chủ Nông - Lâm - Ngư Lâm nghiệp Giải đề final_tong hop...

Tài liệu Giải đề final_tong hop

.DOCX
17
869
136

Mô tả:

1. Give an accession number J00306. Design a pair of primer for exon ranged from 1231 to 1368. Validate the quality of your obtained primer. Step1: access to NCBI and enter the accession number J00306. Step2: press ctrl + F simultaneously and type “1231…1368” to find out the exon we would like to design primer. (In case you cannot see the picture clearly, please zoom out). Step3: access the “Pick Primer” on the right side Step4: put the range of exon into the box Step5: get primer (make sure the highlighted part is checked). The first pair of primer is usually the best one. In general setting, to obtain a good primer, some criteria is set, such as: - The length is 18-30 bases - Melting temperature is 50-60 degrees Celsius. - GC content is between 45% and 55%. - The Max Tm difference is only 3 degrees Celsius. Both have the same length of 20 bp. The forward primer starts at 1184 position, while reverse one start at 1463. Melting temperature of the forward and reverse primer are 54.6oC and 55.9oC, respectively. The GC content is the same, 50%. Both of the forward and reverse primer does not form stable hairpins and dimerize. However, they do not have GC clamps at the 3' end of the primers. In overall, these primers can be considered as a good pair of primer. 2. Give an accession number NC_000009.12 and a sequence ranged from 94603133 to 94640249. What does this sequence encode for? List out values of BLAST output. Step1: access NCBI to inspect what NC_000009.12 is Step2: Run BLAST Step3: blastn _ input the sequence range _ Human genomic plus transcript (Human G + T). For the BLAST result, the sequence ranged from 94603133 to 94640249 of the accession number NC_000009.12 encodes for (only choose 3 first result with Query cover is higher than 99%): - Homo sapiens fructose-bisphosphatase 1 (FBP1), RefSeqGene on chromosome 9 - Human DNA sequence from clone RP11-342C23 on chromosome 9, complete sequence - Homo sapiens fructose-1,6-bisphosphatase 1 (FBP1) gene, complete cds 3. Ins gene encodes for insulin in human. Giving accession number of NC_000011.10 with range from 2159779 to 2161209, find a primer pair with production length of 500+-50 bp 4. Giving an accession number, NC_0000009.12, and a sequence range, from 94603133 to 94640249. Answer the following questions; a. By using the human G+T databases for BLAST, what does this sequence encode for. b. Briefly describe the implementation of your BLAST output to support your answer in question a. In the first result, we get the information: Max Score: 89.8 Total score: 743 Query cover: 0% Identity: 100% Accession number: NC_000006.12 Gaps: 1/67 (1%) Meaning: E value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. The smaller the E value, the better the result. E value equal to 0 means that the result is good and we can use them. In this case, E value = 1e-13 << 0.05 => the alignment is significantly matching. The identity is 91% >75% shows that Mouse GULO (chromosome 14) is highly identical to human GULO (chromosome 6)  The first result is the most identical gene. c. Use the mRNA of your sequence (from your BLAST result) to design a pair of primer and report your result. 5. The following picture shows the phylogenetic and modular analysis of C militaris (CCM) poly ketide sytheases (PKS)compared with those involved in the production of human mycotoxin. (a) A neighbor joining tree showing the relationship of ketoacyl CoA synthase (KS) domain sequences. (b) Modulation of comparison of C militaris PKSs with those involved in production of mycotoxins. Domain definitions ACP, acyl carrier protein domain, AT, acyltransferase domain, CYC, cyclase domain, DH, dehydratase domain, TE, thioesterase domain CCM_00603 is lacked of gene cluster for patulin biosythesis. a. Starting from raw samples of C militaris, what bioinformatic approaches can be used to reconstruct the phylogenetic tree in the figure (a)? b. Researchers have concerned about the possibility of harmful side effects of the chinese support this hypothesis? Explain your answer. 6. The table below shows values of diffrerent coding statistic the 223bp long second coding exon of human Bglubin gene, and in a 223bp long sequence from the middle of the second intron of the same gene. Position asymmetry Periodic asymmetry index Average mutual information Fourier spectrum Exon sequence 0.0957 1.159 0.00681 2.278 Intron sequence 0.0211 1.009 0.000344 0.892 a. What are those methods in this table used for? The sequence based measures indicatie of protein-coding functon in genomic DNA. A good knowledge of the core coding statstcs is important to understand how gene identicaton programs work and to interpret their predictons The main distncton here is between measures dependent of a model of coding DNA, and measures independent of such a model. The model of coding DNA is always probabilistc, allowing to compute the probability of a DNA sequence, giien that the sequence is coding. Although in the practce, the ialues (scores) of a giien coding statstc in a query sequence can be computed in a number of diferent ways, here for the model-based coding statstcs we will compute scores based on such a probability. Indeed, giien a query sequence, we will compute the probability of the sequence under the model of coding DNA, and under an alternatie model or non-coding DNA (which, here, for illustraton purposes will be simply random DNA). We will take the logarithm of the rato of these two probabilites--the loglikelihood rato--as the score of the coding statstc in the query sequence. b. Based on this result, why do average mutual information is the most sensitive method? 7. Protein vanA with the help of two other proteins, adds alactate group instead of alanine to the end of the peptidoglycan chain. This occurrence help bacteria resisting to Vancomycin. Giving two vanA’s structures from a modern sample and a 30000 year old DNA sample with PDB ID of 1E4E and 3SE7 respectively, use appropriate bioinformatic tool(s) to answer the following questions: a. Which class does vanA belong to (according to shape and secondary structure)?  VanA is a D-alanine-D-lactate ligase, indicating that it adds lactate to the growing peptidoglycan chain. b. How different are their primary strutres (single chain only)?  The enzyme that makes the normal peptidoglycan is a D-alanine-D-alanine ligase, which adds alanine to the chain.  Surprisingly, it is very similar to VanA made by modern bacteria, showing that this war of antibiotics and resistance began long before medical science discovered the utility of antibiotics. c. How different are their teriary strutres (single chain only)? We compare the ancient and modern proteins using the Structure Comparison Tool.  http://www.rcsb.org/pdb/workbench/showPrecalcAlignment.do? action=pw_fatcat&name1=1E4E.A&name2=3SE7.A d. In term of evolution, make at least 2 assumptions based on your previouscomparisons. VanA reconstructed from a 30,000 year old bacterium, with bound ATP. 8. There are types of genetic variations such as Tandemrepeat polymorphism, Insertion/ Deletion polymorphism, Single nucleotide morphism (SNP). In your opinion, explain why researchers focus extensively on SNPs nowadays. Tandemrepeat polymorphism: Tandem repeats or variable number of tandem repeats (VNTR) are a very common class of polymorphism, consisting of variable length of sequence motifs that are repeated in tandem in a variable copy number. VNTRs are subdivided into two subgroups based on the size of the tandem repeat units. Microsatellites or Short Tandem Repeat (STR) repeat unit: 1-6 (dinucleotide repeat: CACACACACACA). Minisatellites repeat unit: 14-100. For example: Spinocerebellar ataxia Type10 (SCA10) (OMIM:+603516) is caused by largest tandem repeat seen in human genome. Normal population has 10-22 mer pentanucleotide ATTCT repeat in intron 9 of SCA10 gene; where as SCA10 patients have 800-4500 repeat units, which causes the disease allele up to 22.5 kb larger than the normal one. Insertion/ Deletion polymorphism: Insertion/Deletion (INDEL) polymorphisms are quite common and widely distributed throughout the human genome. Sequence repetitiveness in the form of direct or inverted tandem repeat have been shown to predispose DNA to localized rearrangements between homologous repeats. Such rearrangements are thought to be one of the reason which create INDEL polymorphism. For example: Association between coronary heart disease and a 287 bp Indel Polymorphism located in intron 16 of the angiotensin converting enzyme (ACE) have been reported (OMIM 106180). This Indel, known as ACE/ID is responsible for 50% of the inter individual variability of plasma ACE concentration. In silico estimation of potentially polymorphic VNTR are over 100,000 across the human genome. The short insertion/deletions are very difficult to quantify and the number is likely to fall in between SNPs and VNTR. Single nucleotide polymorphism (SNPs) Responsible for 90% of all human genetic variation. A SNP occurs every 100 – 300 base pairs. Currently almost 12 million SNPs in the NCBI SNP database. May be within genes (coding SNP, cSNP) or outside gene (non – coding, the majority). May cause amino acid changes or not. If it causes an amino acid changes or not. If it causes an amino acid change it is called non – synonymous (nsSNP). Most SNPs are not responsible for a disease. Like microsatellites, they are used as markers for pinpointing a disease on the genome map. SNPs make particularly good markers because they occur frequently throughout the genome, and are older and more stable genetically. The most common polymorphisms (or genetic differences) in the human genome are single base-pair differences. When two different haploid genomes are compared, SNPs occur, on average, about every 1,000 bases . No biological assumptions and can identify novel genes/pathways. Excellent chance to identify risk alleles. Utility in individual risk assessment. SNPs are important: The organism - SNPs are mutations, therefore they will alter DNA function. Depending on where they are, this can potentially cause critical illness by altering an important genetic feature. At the other end of the spectrum, they may have no discernable impact. Based on population genetics theory, SNPs with severe disease-causing effects are likely to be bred out of gene pools. Genetic Epidemiologists - GEs use SNPs as genetic markers to track disease with. Large studies called Genome-Wide Association Studies study teh SNPs in tens to hundreds of thousands of people and find associations between particular SNPs and disease. Most SNPs have no effect on health or development. Some of these genetic differences, however, have proven to be very important in the study of human health. Researchers have found SNPs that may help predict an individual’s response to certain drugs, susceptibility to environmental factors such as toxins, and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease genes within families. Future studies will work to identify SNPs associated with complex diseases such as heart disease, diabetes, and cancer. 9. What is GWAS? In general, how do many patients are there needed for GWAS. scientists conduct a GWAS study? At least, how A genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease (such as cancer) or condition. It is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses. At least 1000 – 3000 patients and at most 100 – 200 thousand patients are there needed for GWAS. In general, to conduct a GWAS study, firstly, scientists collect large cohort of cases and controls. Second, microarray-based SNP genotyping is performed. After the derivation of haplotypes, the detection of association signals is carried out. Then, a fine mapping of association signal is produced. Finally, association is replicated and goes through the biological validation test. 10. Briefly list out each step (if any). the procedure of genome assembly and specific software needed in 11. Explain why repetitive sequences are challenge to genome assembly Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Repetitive DNA are sequences that are similar or identical to sequences elsewhere in the genome. Although some repeats appear to be nonfunctional, others have played a part in human evolution, at times creating novel functions, but also acting as independent, ‘selfish’ sequence elements. Repeats arise from a variety of biological mechanisms that result in extra copies of a sequence being produced and inserted into the genome. Repeats come in all shapes and sizes: they can be widely interspersed repeats, tandem repeats or nested repeats, they may comprise just two copies or millions of copies, and they can range in size from 1–2 bases (mono- and dinucleotide repeats) to millions of bases. Repeats can also take the form of large-scale segmental duplications, such as those found on some human chromosomes and even whole-genome duplication. For de novo assembly, repeats that are longer than the read length create gaps in the assembly. To create gaps, repeats can be erroneously collapsed on top of one another and can cause complex, misassembled rearrangements. For genome assembly, repeats create ambiguities which, in turn, can produce biases and errors when interpreting results. Because, sequence is cut into many small fragments. Repeat regions can cause wrong alignment or difficult in overlapping. Repetitive sequences, which permeate the genomes of species from across the tree of life, create ambiguities in the processes of aligning and assembling NGS data. Repetitive sequences are a huge challenge because the reads associated with the can't be assigned to just one location in the genome. Each copy of a repetitive element is flanked on each side by a unique sequence. Repetitive DNA is a challenge for assembly. Consider that half the human genome consists of repetitive DNA and other genomes have even more; transposable elements span over 80% of the maize genome. Beyond assembly, this also leads to a tremendous technical challenge for alignment to a reference genome:… repeats introduce ambiguous assemblies and alignments, sometimes producing biases and errors 12. FASTQ is considered as raw data generated from next generation sequencing machine. What is the difference in FASTA format and FASTQ format? Briefly describe structure of FASTQ format  The difference between FASTQ format and FASTA format: FASTQ FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. Like the FASTA format, the FASTQ format includes a sequence string, consisting of the nucleotide sequence of each read. FASTQ also includes an associated quality score for every base, making them appropriate for reads from an Illumina machine or other brands. FASTA a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.  Describes the structure of FASTQ format: Each FASTQ file has records that are in blocks four lines long  The first line, beginning with the @ symbol - the UNIQUE sequence name, identifies the record. It may optionally include information about the sequence length or the machine used for sequencing.  The second line has the sequence (in upper case), including the nucleotides G, A, T, C, and (as is the case here in the second position) there may be an N for unknown nucleotide.  The third line begins with the + symbol and typically contains just that character (as in this case), or it can have more information.  The fourth line includes the quality scores (ASCII characters) corresponding to every base. Each quality score is assigned a single character, and the entire quality score string must equal the length of the sequence string 13. Briefly describe how Illumina sequencing machine works Illumina sequencing works on the principle of cycle reversible termination (a) Genomic DNA is purified and then randomly fragmented. This can be accomplished mechanically by methods such as sonication, shearing, or nebulization, often followed by size selection of the randomly fragmented DNA. Adapters are attached to both ends. (b) Single-stranded DNA fragments are covalently attached to the surface of flow cell channels. (c) The addition of DNA polymerase and unlabeled deoxynucleotides creates solid-phase “bridge amplification” in which the template DNA makes U-shaped loops with both ends attached to the surface of the channel. (d) Double-stranded bridges are formed. The double-stranded molecules are denatured and then amplified to generate dense clusters of template DNA. (e) Four labeled reversible terminators are added (with primer and DNA polymerase). Only a single reversible terminator will be added to each template in a given cycle. As with Sanger sequencing, chain termination will occur at specific bases that cannot elongate. (f) Following laser excitation, the identity of the first base is recorded. (g) For the second cycle, the reversible terminators are removed (by deprotection). All four labeled reversible terminators and the polymerase are again added to the flow cell. The cycles are repeated. Sequencing Over Multiple Chemistry Cycles: The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time. Align Data: The data are aligned and compared to a reference, and sequencing differences are identified 14. How many kinds of function of pair-end reads? data does illumina sequencing machine generate? What is the There are two kinds of data which illumine machine generate: single – end reads and paired – end reads. Paired end reads are useful to identify deletions (as well as insertions) because such reads have an expected distance (depending on the size of the library inserts) and orientation. Paired-end sequencing allows users to sequence both ends of a fragment and generate high-quality, alignable sequence data. Paired-end sequencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts. It read provide superior alignment across DNA regions containing repetitive sequences, and produce longer contigs for de novo sequencing by filling gaps in the consensus sequence 15. What are the methods for structural prediction of protein and their drawback? In structural biology, there are two main approaches to determining protein structure: X-ray crystallography; and nuclear magnetic resonance spectroscopy (NMR). Structures can also be predicted computationally using three approaches: homology modeling, threading, and ab initio prediction. Structure prediction is a major goal of proteomics. There are three principal ways to predict the structure of a protein. First, for a protein target that shares substantial. similarity to other proteins of known structure, homology modeling (also called comparative modeling) is applied. Second, for proteins that share folds but are not necessarily homologous, threading is a major approach. Proteins that are analogous (related by convergent evolution rather than homology) can be studied this way. Third, for targets lacking identifable homology (or analogy) to proteins of known structure, ab initio approaches are applied. homology Modeling (Comparative Modeling) There are several principal types of errors that occur in comparative modeling (see Marti-Renom et al., 2000): • errors in side-chain packing; • distortions within correctly aligned regions; • errors in regions of a target that lack a match to a template; • errors in sequence alignment; and • use of incorrect templates. Each target undergoes comparative modeling using an existing experimental structure as a guide that may be superimposed on the target. Fold recognition (threading) The target might assume a fold that occurs in a characterized protein because of convergent evolution, or because the two proteins are homologous but extremely distantly related. An input sequence is parsed into subfragments and “threaded” onto a library of known folds. Scoring functions allow an assessment of how compatible the sequence is with known structures Ab Initio prediction (template-Free Modeling) the resolution of ab initio methods is generally low. Knowledge-based approach would fail in following conditons:  Structure homologues are not aiailable  Possible undiscoiered new fold exists. Aninsen’s theory: Protein natie structure corresponds to the state with the lowest free energy of the protein-solient system. Limitatons of De noio Predicton Methods o A major limitaton of de noio protein predicton methods is the extraordinary amount of computer tme required to successfully solie for the natie conirmaton of a protein. o Another way of circumientng the computatonal power limitatons is using coarse-grained modeling. Coarse-grained protein models allow for de noio structure predicton of small proteins, or large protein fragments, in a short computatonal tme. Gene Predictinn The principle • Identfy common genetc features of known genes • Generate genetc proiles. • Compare the proiles to uncharacterized gene as a predicton. • Test and ialidate the predicton. Cimputatinal Methids fir Gene Predictin • Gene Predicton methods Extrinsic/ Homology method: Based on sequence similarity. The assumptons of homology method: - Coding regions eiolie slower than non coding regions. - Homologous sequences refect a common ancestry and therefore gene structure. Software: AAT, EbEST, GeneSeqer, ORFrgene2, SYM4 , GeneWise, SYNCOD Intrinsic/ Ab inito method: Based on statstcal proiles. Predict genes based on the statstcal propertes of uncharacterized sequence. - Software: FrGENESH, Gene ID, GeneMark.hmm, GenSCAN, ppound, VEIL, TWINSCAN, HMMgene. - Challenges in eukaryotes: - Protein coding genes are separated by intergenic regions. - The presence of exons and introns. - Signal sequences are difcult to identfy Features fir gene predictin in prikaryites Promoter elements: - 35 region. - 10 region. - Transcriptonal start site. - ORFrs. - Translaton stop sites. THE EpISTENCE OFr CONSENSUS SEQUENCES (ESPCIALLY PROMOTER SEQUENCE) FrACILITATE THE GENE PREDICTION IN PROKARYOTES. • Software: AMIGene, Easy Gene, GeneMark.hmm-P, Glimmer, SG inder, MEDstart, REGANOR, TICO, Zcurie. Challenges in Prikaryitc gene predictin • Prokaryotes pose difcultes due do high gene density and simple gene structure: – Little informaton from short gene. – Reduce detecton accuracy due to oierlapping genes. Features fir gene predictin in eukaryites • Predicton for eukaryotes is a whole lot more complicated than for prokaryotes. • Because the large informaton: • – splice sites, – start and stop codons, – branch points, promoters, – terminators, polyA sites, – ribosomal binding sites, – topoisomerase II binding sites, – topoisomerase I cleaiage sites, – transcriptonal factor binding sites, – etc. Software: Software: FrGENESH, Gene ID, GeneMark.hmm, GenSCAN, ppound, VEIL, TWINSCAN, HMMgene. Challenges in eukaryitc gene predictin • Low gene density and complex gene structure. • Presence of alternatie splicing mechanism. • Presence of pseudo-genes. Why Is Gene Predictin Difcultl • DNA signals haie low informaton content (degenerated and highly unspeciic). • Difcult to discriminate real signals. • Sequencing errors. 16. Acoording your understanding in read map analysis, is it possible to estimate the number of duplication by using coverage? Briefly explain your answer. 17. Functional site (or active site) of a protein consists of only a few animo acids. Use your understanding in the biochemical nature of protein to disscuss the functional importance of other residues that are NOT belonged to the functional site. 18. Algorithms in genome assembly largely depends on overlappped sequence of the reads 9know sequences of DNA fragment after sequencing). Based on the human genome project, briefly discuss the genomic feature(s) of the human genome that may challenge the process of genome assembly. Human genome with approximately 35 million reads, needed large computing farms and distributed computing. From 2006, the Illumina (previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Human contamination in other mammalian genome sequences will be particularly problematic, as such contamination is expected to be common due to handling of the samples. For parts of a de novo-sequenced mammalian genome, the best BLAST hit will be against a human or mouse sequence simply because the region in question has not been sequenced and annotated in any other mammal. [Genome assembly] _ Definition of genome assembly In bioinformatcs, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs). Genime assemblies ofer a consensus representaton of a genome, spanning all the chromosomes (and extrachromosomal elements such as organellar genomes and plasmids). When next-generaton sequencing is performed on a preiiously assembled Analysis of Next-Generaton Sequence Data 395 genome (e.g., when we sequence a person’s genome) alignment to the reference genome is performed, but that human reference has already been assembled so further assembly is not required. In contrast, when we sequence the genome of a species that has not preiiously been characterized, de noio (“from new”) assembly is required. Genome assembly: Challenges Errors in assembly are important because we rely on each assembly for all aspects of the genomic landscape, including the locations of genes. Genomes can be assembled de novo (“anew,” without referring to other completed genomes) or by mapping reads onto a reference genome. The assembly process involves the collection of individual sequences, the closing of gaps, and the lowering of the error rate The priblem if sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a diferent cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obiious difculty of this task, there are some extra practcal issues: the original may haie many repeated paragraphs, and some shreds may be modiied during shredding to haie typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable. While-genime assembly iniolies fragmentng genomic DNA from an organism, then constructng libraries of iarious sizes (often from 2 kb to 50 kb or eien >100 kb). In one approach the ends of cloned inserts are sequenced (producing mate pair reads). As reads are aligned they are organized into contgs such as those found in the Whole-Genome Shotgun (WGS) diiision of NCBI. Contgs can be ordered and oriented to assemble scafolds (also called supercontgs). These may contain gaps whose sizes can be estmated. Global statstcs for assemblies include: (1) the total number of scafolds (including those with or without known placement or orientaton); (2) the scafold N50 (the length in base pairs such that scafolds of this length or longer include 50% of the bases in the assembly); (3) the total number of contgs; and (4 ) the contg N50 (here the length such that contgs of this length or longer include 50% of the bases in the assembly. N50 is therefore a measure of contguity, with larger ialues denotng more complete assemblies. The Genime Reference Cinsirtum (GRC) which is responsible for human genome assemblies lists the N50 for each human chromosome. Fror chromosome 11 (harboring the HBB gene cluster) the N50 is about 4 1.5 megabases, while in earlier assemblies (such as NCBI35) it was millions of base pairs shorter. _ Genome assembly procedure or pipeline Flowchart describing assembly and annotation procedures The steps involved in creating a highquality genome. Sequencing can include the conventional Sanger technique and/or several NextGen technologies including 454, Illumina, and Ion Torrent (see Table 1). Contig and scafold assembly can utilize several assemblers including: Atlas (Havlak et al. 2004), AbySS (Simpson et al. 2009), ALLPATHS-LG (Gnerre et al. 2011), Celera assembler (Myers et al. 2000), MaSuRCA ( (accessed on July 19, 2013), and SOAPdenovo (Li et al. 2010). Chromosome mapping can use genetic information, radiation hybrids or f uorescence in situ hybridi- zation (FISH). “ Breaking ” misassembled scafolds and placing them on chromosomes can involve extensive manual work. Expressed sequence tags (ESTs) are usually partial transcripts obtained from Sanger sequencing. mRNA-seq is often performed with Illumina technology but can also be conducted with Ion Torrent machines. OR _ Types of read data _ De novo assembly and Reference mapping assembly (De novo = "new", Reference = "something already exists". So one assembly is built based on the known genome, the other is built based on nothing but itself) In sequence assembly, two diferent types can be distnguished: 1. de-noio: assembling short reads to create full-length (sometmes noiel) sequences (see De noio sequence assemblers, de noio transcriptome assembly) 2. mapping: assembling reads against an existng backbone sequence, building a sequence that is similar but not necessarily identcal to the backbone sequence In terms of complexity and tme requirements, de-nivi assemblies are orders of magnitude slower and more memory intensiie than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare eiery read with eiery other read (an operaton that has a naiie tme complexity of O(n2); using a hash this can be reduced signiicantly). Referring to the comparison drawn to shredded books in the introducton: while for mapping assemblies one would haie a iery similar book as template (perhaps with the names of the main characters and a few locatons changed), the de-noio assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, a noiel, a catalogue, or eien seieral books. Also, eiery shred would be compared with eiery other shred. “”””copy sequence của primer của mình vào, làm từng cái sau đó bạn bấm hairpin và sef dimer để check hairpin là tự bản thân nó cuộn lại, bằng các liên kết H khi primer cuộn lại sẽ k thể nối với đoạn gene của mình để cắt target gene dc nên cần phải hạn chế như hình này primer dài 20 nu, có hai cái loop thì k dc nè mình cũng có thể dựa vào cái delta G nữa, Delta G của hairpin k dc nhỏ hơn -1 đây là kq của self-dimer nghĩa là các primer cùng loại sẽ bám vào nhau delta G của self-dimer k dc nhỏ hơn -9, nếu có, primer sẽ k có validation self dimer phải dc hạn chế vì nếu primer cùng loại mà bám vào nhau hết thì khi chạy PCR nó cũng sẽ k bám vào gene những đoạn liên kết ở giữa thì k sao nhưng hạn chế và tối kị nhất là hai đầu 3' nối với nhau khi hai đầu 3' nối với nhau r nó sẽ k nối vào gene lúc chạy PCR dc For hairpin analysis, you can change the default concentrations provided to match your reaction conditions. The most valuable piece of information on this screen is the Tm for each of your structures. If the Tm of the structure is lower than your reaction conditions, then this structure will not cause any problems. If it is higher, this oligo may be problematic and should be redesigned. For self-dimer analysis, click on 'Self-Dimer' to bring up a new window with each possible self-dimer your oligo can form. For each diagram you will be able to see the calculated delta G value for this secondary structure. If you have a strong delta G (-9kcal/mol or more negative) this oligo could be problematic. Enter the sequence of your forward primer into the sequence box, and then click 'Hetero-Dimer.' This will open a second box below the original sequence box, in which you enter the sequence of your reverse primer. Then click the "Calculate" button below the second box. In general, a primer pair with a delta G of -9kcal/mol or more negative will be problematic.
- Xem thêm -

Tài liệu liên quan