Saint Louis University |
Computer Science 1020
|
Computer Science Department |
For this assignment, you must work alone. Please make sure you adhere to the policies on academic integrity in this regard.
Topic: Phylogeny
Related Reading: Chapter 6 of the text
Due:
11:59pm, Wednesday, March 27, 2019
We will perform the Web Exploration project from Chapter 6 of the text. A full description is given there, but in short we will use the Β-casein gene (CSN2), which is conserved in placental mammals, as the evolutionary clock in analyzing the likely phylogenetic relationships between whales, dolphins and a variety of other mammals.
Collect the CSN2 gene data for the following animals and store those as a single FASTA file named animals.fasta (that should also be submitted). Make sure that the preliminary comment line for each sequence has the form
>commonName accessionNumberFor example the whale preface could be
>Whale JF701647
We want you to include all of the following animals. While we are giving you GenBank accession numbers, please spend some time trying to do the search for yourself and see if you end up with those samples, as in the end we ask you to include two additional mammals of your choice and so you will need to be familiar with the search process. Make sure you are using the Β-casein gene (CSN2) for each species. (Note that for some species, it is only exon 7 that is included, and for some it is the mRNA sample). Note well that all of these should be on the order of 400-1400bp, so if you are getting something significantly longer, it's not the coding portion of this gene.
Go to the website phylogeny.fr
(ANSWER QUESTIONS 1-3 BELOW)
(ANSWER QUESTION 4 BELOW)
The authors provide a Evolutionary Distance Calculator that can be used to compute the three distance metrics described in the chapter for any pairwise aligned sequences. Using results pasted from the cured.fasta file from step (2f), perform a series of tests to answer the remaining questions below. Note well that when pasting two sequences in this calculator, you must NOT include the descriptive preface line; just paste the two nucleotide sequences for comparison. However, if you want to analyze lots of pairs, you will need to do a lot of separate submissions with this tool
For convenience, I've written my own script that will do analysis of all pairs and all three distance metrics on a single run. The script can be downloaded as distance.py, and it is written under the assumption that you will have already saved a file specifically named cured.fasta as described in step (2f).
Once you have been able to compute all the distance measures among your pairs of species, go on to answer the final questions 5 and 6 below.
Which node of your tree represents the most recent common ancestor of whales and any terrestrial mammal?
Notice that the whale and dolphin branches are not equally long (in phylogram view). What does this difference in length represent, biologically?
An outgroup is a single most distant species that roots a tree so that it is considered outside the group of remaining species. What was the outgroup species in your tree rendering?
Looking at the original (uncurated) multiple alignment results, you should see that some of the species sequences were for the complete gene while others included only exon7. Which were the species with more complete data and which had more limited data? Were there any species that seemed to have significant portions of either extra or missing nucleotides relative to the rest of the group? Explain.
Provide a qualitative analysis of the three distances metrics, and whether there are any significant reorderings of which species are nearer to each other, or any significant differences in the relative scale of the differences.
For sake of disclosure, I believe that both of the tools provide are not computing the Tamura distance as intended because that metric is defined based on the overall GC-content of the entire species (not just the aligned gene), but our tools are only given the aligned gene not the full genome.
Choose one of the distance metrics and sketch a phylogenetic tree based on the distances calculated for that metric. Remember to make your branch lengths proportional to distance. How does your tree compare with the one generated by phylogeny.fr?
You should submit all of the following electronic materials into the hw05 folder of your git repository.
File with your answers to the questions.
animals.fasta
This is the file that includes your collected nucleotide
sequences for ALL species in your study.
phylo_tree.pdf
A download of the rendered tree.
cured.fasta
The download of the curated multiple alignment (from step 2e).
Please note the late policy for homeworks.
This assignment is worth 40 points. Each of the six questions will be worth 5 points, and the final 10 points will be apportioned to the other electronic artifacts that are to be submitted.