Thursday, 16 May 2013

bioinformatics - What's the use of DNA sequencing results?


It's just a "string" where nucleotides encode something but I have no idea what they encode specifically.




It sounds like you are describing a computer program, represented by a string of bytes on the hard drive.



Unfortunately, the analogy breaks down very quickly because DNA is vastly more complex and a lot of aspects are still poorly understood.



But the basis is the same: a string of symbols encodes information. In the case of DNA, different parts encode different things in a different manner.



The elephant in the room are of course coding sequences: stretches of DNA, contained in so-called “genes” – which code for proteins and other stuff.



Coding sequences use a fairly simple encoding schema that’s known as the genetic code, decoded in 1961. The genetic code has the nice property of being (almost) universal across all species, and easy enough for a child to understand:



Three consecutive DNA base pairs form a codon. Each codon stands for one amino acid (except for a special “stop codon”). Amino acids form so-called polypeptide chainsproteins. There are codon tables, just as you have manuals for assembly mnemonics in computing:



Codon table of the genetic code



Unfortunately, it’s not trivial to know where genes are on the genome. Just by eyeballing the sequence there is nothing to distinguish one stretch of DNA from its surrounding. But there are certain recognisable stretches of DNA (“motifs”) which we can use to locate genes and other interesting regions.



Grossly simplified, a gene is preceded by a promoter region which is highly conserved between species (but gene-specific). Once you’ve identified one species’ promoter, you know it for other species. Furthermore, all promoters share highly similar elements, for instance the TATA box – literally the occurrence of “TATA” in the genome.



Of course, just looking for occurrences of “TATA” would yield vastly more spurious hits than actual promoters but combined with other information you get a gene model – a statistical machine which can tell you with high confidence where on a genome the genes are located.




Once you have found the genes in a sequence which you can trivially translate to amino acid sequences, you can try to form inferences about the function of the proteins (a protein’s function is an almost-direct consequence of its sequence). Unfortunately, discovering the function of a protein from its sequence alone isn’t possible but we do know the functions of many proteins.



When we now look at a genome and find a mutation in a protein-coding gene we can infer that the function of this protein is probably modified. Most mutations are so-called deleterious mutations (e.g. by deleting a single nucleotide you get a frame shift in the codons, and all the subsequent codons no longer make sense), meaning that they destroy the protein’s function.



Other, much rarer, mutations modify the protein’s function, making it more or less efficient, or giving it another function altogether.



In the simplest case (but these are rare), a single such mutation can explain a complex phenotype. This is known as a Mendelian trait and can be used to explain a phenotypes such as eye colour, but also hereditary diseases.



Usually, though, it is merely one of many adjusting screws which skew your susceptibility to a certain phenotype in one direction. For instance, you might be slightly more susceptible to breast cancer or diabetes.




This is one use of the DNA sequence, there are many more; in the last decade, we have realised that regulation of gene activity plays a much larger role than anticipated, and most research today looks at regulatory patterns on the DNA level. This is also done with sequence data, but there doesn’t appear to be a simple schema analogous to the genetic code to understand regulation. Instead, it’s a complex interplay of many completely unrelated mechanisms.

No comments:

Post a Comment