Saint Louis University |
Computer Science 1020
|
Computer Science Department |
Resources: pp. 28-32 of Ch. 02 of text.
As we get closer to authoring our own implementation of bioinformatics algorithms in Python, we do another activity using existing online tools, in this case the Sequence Manipulation Suite (SMS) available at www.bioinformatics.org/sms2/. The motivation will be examining the coding portions of a human gene CFTR that is associated with Cystic Fibrosis (CF).
We wish to retrieve the reference (i.e., normal) sequence for the spliced CFTR mRNA. Search the NCBI Nucleotides to find the entry for the human CFTR gene.
The gene entry should include a link to the NCBI Reference Sequences, and from there you can find the primary sequence entry for mRNA. (You should find that it has ascension number NM_000492.3.) Once viewing the entry, we want to specifically focus on the coding sequence (CDS) within the full gene. (Some other time we will consider how the range of the coding sequence is determined.) Within the GeneBank entry, look specifically for the CDS feature and click on CDS. Try to cut and paste only the highlight portion of the full nuceotide sequence that is identified as the CDS. (It starts at nucleotie 133 of the gene with sequence 'atg...'.)
Copy any paste that portion of the coding sequence in a text file for later use. It is okay if it includes the numbering, perhaps starting as
atgcagag gtcgcctctg gaaaaggcca gcgttgtctc caaacttttt 181 ttcagctgga ccagaccaat tttgaggaaa ggatacagac agcgcctgga attgtcagac 241 atataccaaa tcccttctgt tgattctgct gacaatctat ctgaaaaatt ggaaagagaa ...Save this in a file named NM_000492.3_CDS.txt.
While still at the NCBI nucleotide site for the CDS feature, notice that the feature description includes a "translation" string beginning as
translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLI ...This is the amino acid that is produced by the coding sequence. Though we are going to use other tools to produce that sequence from the raw nuceotide sequence, go ahead and copy/paste that translation sequence and save it in a file so that we can later compare our results.
We now return to the task of processing the raw nuceotide sequence (previously saved as NM_000492.3_CDS.txt). Go to the SMS tools and use the "Filter DNA" tool to convert the raw file you saved in the previous step to FASTA format. Though not necessary for these tools, change the default settings to select that it convert the remaining letters to uppercase. This should produce a result in FASTA format that will begin as
>filtered DNA sequence consisting of 4443 bases. ATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGCTGGACC AGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATC CCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGAGAATGGGATAGAGAG ...Copy and paste that result and save it in a new file named NM_000492.3_CDS.fasta.
For the sake of experience, go to the SMS
"
Returning to the original DNA strand, we wish to determine the amino acid sequence which it produces. For this we will use the SMS "Translate" tool. Paste the contents of your NM_000492.3_CDS.fasta file and submit. This should produce an amino acid sequence starting as
>rf 1 filtered DNA sequence consisting of 4443 bases. MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRE LASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIA IYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQL ...In fact, if you go back to the "translation" string that you should have set aside for safe keeping in step 2, you should find that this is the same sequence (though shown in FASTA format). Save this new result in a file named NM_000492.3_amino.fasta.
As a brief aside, we wish to demonstrate another SMS tool. In the above process, we got the correct amino acid sequence because we had already pre-selected the coding sequence (CDS) precisely begining with the start codon. Therefore the translation worked for the default reading frame 1. More generally, it might not be clear which is the appropriate reading frame. SMS includes a "Translation Map" tool.
For the second part of this project, the book provides sequence data for potential parents Mary and Tom. The file CFScreening.txt is a FASTA file that has four DNA nucelotide sequences: Mary and Tom each have two versions of the CFTR gene (one allele inherited from a mother and one allele inherited from a father).
The goal of this part of the project is to compare each of Mary and Tom's alleles to the reference gene in search of mutations. Conveniently, all the alleles have the same length and so we will not consider any insertions or deletions, but just single nucelotide polymorphisms (SNPs). For each of the four new alleles, there are two ways to compare to the reference:
Compare the DNA nucelotides of the allele to those of the reference.
Compare the amino acid sequence generated by the allele relative to the reference amino acid sequence. (Presumably, you can use a similar pipeline of operations as in Part 1 in order to determine the amino acid sequence for each of the new alleles.
We introduce one additional SMS tool to perform the pairwise sequence comparisons. This is the "Pairwise Align Codons" tool and the visualizing "Color Align Conservation" tool. To compare two sequences (whether dna or protein), first place the two sequences in FASTA form in the two boxes of the "Pairwise Align Codons" tool. That produces another FAFSA file with some modifications. Copy/paste that result into the input of the "Color Align Conservation" tool and submit it to see a visualization that makes differences in the two sequences more apparent.
To complete this project, you must submit answers to the following questions from Chapter 2.