Chapters 3 and 4 of the textbook discuss the concepts of Open Reading Frames (ORFs) and some challenges in extending knowledge of ORF to predictions of genes.
Although we could (and perhaps will) write our own code to compute ORFs, we will begin by relying on the publicly accessible NCBI ORF Finder to help us explore some data sets.
One important goal in genetic analysis is to take a successfully sequenced genomes and try to better differentiate "coding" and "noncoding" portions of the DNA. We are especially interested in protein-coding genes within a DNA sequence.
For prokaryotes, it is estimated that about 80% of the DNA is coding, but for eukaryotes it is often that only 1-3% of the DNA is coding.
While strings are convenient for representing a sequence of characters, Python allows for representation of sequences of arbitrary types of data as well. The primary structure for such a sequence is a Python list.
Recall from the central dogma that coding regions of DNA are convert to RNA and then to proteins, with each triple of nucleotides (codon) leading to a specific amino acid in the protein sequence. There are some cases where several distinct codons end up producing the same amino acid. There is also a particular codon (ATG in the original DNA sequence) that is known as the "start codon", which produces the amino acid methionine, yet this start codon is key at the molecular level for getting the process rolling.
There are also three specific codons that serve as stop codons for the process, and these are TAA, TGA, and TAG.
The precise conversion from codons to amino acids can either be given
as a complete table, or sometimes is described using a wheel-like
structure that is more convenient for tracing codons as a three-letter
sequence.
Because it matters where you start grouping three nucleotides into a codon, there are actually six different reading frames, three in the forward direction, and three because there could be coding regions that are on the reverse complementary strand.
How many ORFs are you able to find for the following strand? (including possible ORFs in the implicit complementary strand)
TTACCTATGCATGCATAACTGA