Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Homework Assignment 05

Phylogeny

Contents:


Collaboration Policy

For this assignment, you must work alone. Please make sure you adhere to the policies on academic integrity in this regard.


Overview

Topic: Phylogeny
Related Reading: Chapter 6 of the text
Due: 11:59pm, Wednesday, March 27, 2019

We will perform the Web Exploration project from Chapter 6 of the text. A full description is given there, but in short we will use the Β-casein gene (CSN2), which is conserved in placental mammals, as the evolutionary clock in analyzing the likely phylogenetic relationships between whales, dolphins and a variety of other mammals.


Your Task

  1. Collect the CSN2 gene data for the following animals and store those as a single FASTA file named animals.fasta (that should also be submitted). Make sure that the preliminary comment line for each sequence has the form

          >commonName accessionNumber
    For example the whale preface could be
          >Whale JF701647
          

    We want you to include all of the following animals. While we are giving you GenBank accession numbers, please spend some time trying to do the search for yourself and see if you end up with those samples, as in the end we ask you to include two additional mammals of your choice and so you will need to be familiar with the search process. Make sure you are using the Β-casein gene (CSN2) for each species. (Note that for some species, it is only exon 7 that is included, and for some it is the mRNA sample). Note well that all of these should be on the order of 400-1400bp, so if you are getting something significantly longer, it's not the coding portion of this gene.

  2. Go to the website phylogeny.fr

    1. Select the "One-click" option for Phylogeny analysis
    2. Upload your animals.fasta file and submit. It will take some time to complete the analysis pipeline, but when done, it will render the resulting tree. The relationship for dog/whale/porpoise/camel/rat should hopefully be somewhat like Figure 6.4 on page 114 of the text, but the analysis will be different because of the additional species.
    3. There are some options for the display of your tree. Please select the "Branch lengths" options.
    4. Download (and submit) a PDF of the rendered tree.
    5. (ANSWER QUESTIONS 1-3 BELOW)

    6. Click on the Alignment tab of the computed pipeline. This shows the computed multiple alignment of your species. You will note that there were some species that had the entire gene and some species that had only exon 7 and thus are missing parts.
    7. (ANSWER QUESTION 4 BELOW)

    8. Click on the Curation tab of the computed pipeline. The curation focuses all attention only on those nucelotides of the multiple alignment that are represented in ALL species. That is, any location that was assigned a gap for any species will be ignore. Graphically, the "blue bar" regions of the curation are those that have a nucelotide for each species, and just such common portions of the multiple sequences can be downloaded in the below "Outputs" section as "Cured alignment in FASTA Format". Download that cured alignment, save it as cured.fasta (and submit it).
  3. The authors provide a Evolutionary Distance Calculator that can be used to compute the three distance metrics described in the chapter for any pairwise aligned sequences. Using results pasted from the cured.fasta file from step (2f), perform a series of tests to answer the remaining questions below. Note well that when pasting two sequences in this calculator, you must NOT include the descriptive preface line; just paste the two nucleotide sequences for comparison. However, if you want to analyze lots of pairs, you will need to do a lot of separate submissions with this tool

    For convenience, I've written my own script that will do analysis of all pairs and all three distance metrics on a single run. The script can be downloaded as distance.py, and it is written under the assumption that you will have already saved a file specifically named cured.fasta as described in step (2f).

    Once you have been able to compute all the distance measures among your pairs of species, go on to answer the final questions 5 and 6 below.


Questions

  1. Which node of your tree represents the most recent common ancestor of whales and any terrestrial mammal?

  2. Notice that the whale and dolphin branches are not equally long (in phylogram view). What does this difference in length represent, biologically?

  3. An outgroup is a single most distant species that roots a tree so that it is considered outside the group of remaining species. What was the outgroup species in your tree rendering?

  4. Looking at the original (uncurated) multiple alignment results, you should see that some of the species sequences were for the complete gene while others included only exon7. Which were the species with more complete data and which had more limited data? Were there any species that seemed to have significant portions of either extra or missing nucleotides relative to the rest of the group? Explain.

  5. Provide a qualitative analysis of the three distances metrics, and whether there are any significant reorderings of which species are nearer to each other, or any significant differences in the relative scale of the differences.

    For sake of disclosure, I believe that both of the tools provide are not computing the Tamura distance as intended because that metric is defined based on the overall GC-content of the entire species (not just the aligned gene), but our tools are only given the aligned gene not the full genome.

  6. Choose one of the distance metrics and sketch a phylogenetic tree based on the distances calculated for that metric. Remember to make your branch lengths proportional to distance. How does your tree compare with the one generated by phylogeny.fr?


Submitting Your Assignment Electronically

You should submit all of the following electronic materials into the hw05 folder of your git repository.

Please note the late policy for homeworks.


Grading Standards

This assignment is worth 40 points. Each of the six questions will be worth 5 points, and the final 10 points will be apportioned to the other electronic artifacts that are to be submitted.


Michael Goldwasser
CSCI 1020, Spring 2019
Last modified: Wednesday, 20 March 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring