Computer Science 1020
Introduction to Computer Science: Bioinformatics

CH02 Hands-on Project

Resources: pp. 28-32 of Ch. 02 of text.

As we get closer to authoring our own implementation of bioinformatics algorithms in Python, we do another activity using existing online tools, in this case the Sequence Manipulation Suite (SMS) available at www.bioinformatics.org/sms2/. The motivation will be examining the coding portions of a human gene CFTR that is associated with Cystic Fibrosis (CF).

Part 1: Processing the reference genome

We wish to retrieve the reference (i.e., normal) sequence for the spliced CFTR mRNA. Search the NCBI Nucleotides to find the entry for the human CFTR gene.

The gene entry should include a link to the NCBI Reference Sequences, and from there you can find the primary sequence entry for mRNA. (You should find that it has ascension number NM_000492.3.) Once viewing the entry, we want to specifically focus on the coding sequence (CDS) within the full gene. (Some other time we will consider how the range of the coding sequence is determined.) Within the GeneBank entry, look specifically for the CDS feature and click on CDS. Try to cut and paste only the highlight portion of the full nuceotide sequence that is identified as the CDS. (It starts at nucleotie 133 of the gene with sequence 'atg...'.)

Copy any paste that portion of the coding sequence in a text file for later use. It is okay if it includes the numbering, perhaps starting as
```
                  atgcagag gtcgcctctg gaaaaggcca gcgttgtctc caaacttttt
 181 ttcagctgga ccagaccaat tttgaggaaa ggatacagac agcgcctgga attgtcagac
 241 atataccaaa tcccttctgt tgattctgct gacaatctat ctgaaaaatt ggaaagagaa
 ...
      
```
Save this in a file named NM_000492.3_CDS.txt.
While still at the NCBI nucleotide site for the CDS feature, notice that the feature description includes a "translation" string beginning as
```
      translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD
      SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL
      GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLI
      ...
      
```
This is the amino acid that is produced by the coding sequence. Though we are going to use other tools to produce that sequence from the raw nuceotide sequence, go ahead and copy/paste that translation sequence and save it in a file so that we can later compare our results.
We now return to the task of processing the raw nuceotide sequence (previously saved as NM_000492.3_CDS.txt). Go to the SMS tools and use the "Filter DNA" tool to convert the raw file you saved in the previous step to FASTA format. Though not necessary for these tools, change the default settings to select that it convert the remaining letters to uppercase. This should produce a result in FASTA format that will begin as
```
      >filtered DNA sequence consisting of 4443 bases.
      ATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGCTGGACC
      AGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATC
      CCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGAGAATGGGATAGAGAG
      ...
      
```
Copy and paste that result and save it in a new file named NM_000492.3_CDS.fasta.
For the sake of experience, go to the SMS "Reverse Complement" tool and paste your FASTA file. This should produce a new FASTA file with the reverse complment strand. The original sequence ended with the six nuceotides CTTTAG, and the reversed complement therefore begins with the six nuceotides CTAAAG.
Returning to the original DNA strand, we wish to determine the amino acid sequence which it produces. For this we will use the SMS "Translate" tool. Paste the contents of your NM_000492.3_CDS.fasta file and submit. This should produce an amino acid sequence starting as
```
      >rf 1 filtered DNA sequence consisting of 4443 bases.
      MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRE
      LASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIA
      IYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQL
      ...
      
```
In fact, if you go back to the "translation" string that you should have set aside for safe keeping in step 2, you should find that this is the same sequence (though shown in FASTA format). Save this new result in a file named NM_000492.3_amino.fasta.
As a brief aside, we wish to demonstrate another SMS tool. In the above process, we got the correct amino acid sequence because we had already pre-selected the coding sequence (CDS) precisely begining with the start codon. Therefore the translation worked for the default reading frame 1. More generally, it might not be clear which is the appropriate reading frame. SMS includes a "Translation Map" tool.

Part 2: Detecting Mutations

For the second part of this project, the book provides sequence data for potential parents Mary and Tom. The file CFScreening.txt is a FASTA file that has four DNA nucelotide sequences: Mary and Tom each have two versions of the CFTR gene (one allele inherited from a mother and one allele inherited from a father).

The goal of this part of the project is to compare each of Mary and Tom's alleles to the reference gene in search of mutations. Conveniently, all the alleles have the same length and so we will not consider any insertions or deletions, but just single nucelotide polymorphisms (SNPs). For each of the four new alleles, there are two ways to compare to the reference:

Compare the DNA nucelotides of the allele to those of the reference.
Compare the amino acid sequence generated by the allele relative to the reference amino acid sequence. (Presumably, you can use a similar pipeline of operations as in Part 1 in order to determine the amino acid sequence for each of the new alleles.

We introduce one additional SMS tool to perform the pairwise sequence comparisons. This is the "Pairwise Align Codons" tool and the visualizing "Color Align Conservation" tool. To compare two sequences (whether dna or protein), first place the two sequences in FASTA form in the two boxes of the "Pairwise Align Codons" tool. That produces another FAFSA file with some modifications. Copy/paste that result into the input of the "Color Align Conservation" tool and submit it to see a visualization that makes differences in the two sequences more apparent.

Questions to Complete

To complete this project, you must submit answers to the following questions from Chapter 2.

Compare each of Mary's alleles with the wild-type coding sequence. can you identify a mutation in either or both?
Now, translate each of Mary's alleles and compare them with the wild-type amino-acid sequence. What differences can you detect?
We know Mary is a carrier of the CF allele but does not have the disease. Summarize your finding for Mary's CFTR alleles: Describe the mutation(s) that have occurred, discuss how (if at all) they affect the CFTR protein, and explain how your genomic data fit with what Mary already knows, including which allele she must have inherited from each of her parents.
Repeat your analysis for each of Tom's two alleles. Is he a carrier of CF? Summarize your findings as in quesiton 3 and determine the probability that Tom and Mary will have a child with CF.

Michael Goldwasser

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

CH02 Hands-on Project

Part 1: Processing the reference genome

Part 2: Detecting Mutations

Questions to Complete

Saint Louis University

Computer Science 1020 Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

CH02 Hands-on Project

Part 1: Processing the reference genome

Part 2: Detecting Mutations

Questions to Complete

Computer Science 1020
Introduction to Computer Science: Bioinformatics