Saint Louis University |
Computer Science 1020
|
Computer Science Department |
Topic: | Global Sequence Alignments |
Collaboration Policy: | The lab should be completed working in pairs |
Submission Deadline: | 11:59pm Saturday, 3 March 2018 |
In this lab, we explore a variety of models for doing global sequence alignment for nucleotide sequences. Specifically, we look at three computed measures for comparing a pair of nucleotide sequences.
LCS
A straightforward computation of the length of the Longest
Common Subsequence of the two sequences. The longer the common
sequence, the more closely related we presume the species are.
LCS + gap/mismatch penalty
Computing an alignment in which there is a +1 reward for every
exact match of nucleotides that are aligned, but also with a -1
penalty for any pair of non-matching nucleotides and a -1
penalty for any nucleotide of one sequence that ends up aligned
with a gap in the other.
Substitutions matrix and adaptive gap penalty
Rather than a +1 for a match and a -1 for any form of a
mismatch, we use a more general substitution matrix to score how
well single nucleotides match with other nucleotides. In fact,
we note that some of the FASTA files include a standardized
character set defined by the IUPAC to allow for a variety of
ambiguities that might result from sequencing reads. The
specific substitution matrix that we will use is known as
NUC.4.4 (available locally or from NCBI).
For the standard ACGT nucelotides, the matrix provides a +5
reward for a match and a -4 penalty for any mismatch. But there
are more gradiations of values for the various matches one might
find with ambiguities.
The second adjustment to our scoring involves the penalty for gaps. In the previous rendition there was simply a fixed cost for each character that was left aligned with a gap. Given that there might be mutation that create large stretches of inserted or deleted nucleotides in one species, we wish to prioritize in a way that might steer the alignment to find really closely matching portions of a sequence that might be further separated from each other in one species versus another. So rather than penalizing linearly by the size of a gap, we have one larger penalty for starting a gap in either sequence, but a more minimal penalty. Specifically, we charge a penalty of -6 for the first nucleotide in a gap, and -1 for each additional nucleotide.
In order to experiment on some real data, we went back to the mitocondrial DNA samples provided by NCBI, and in particular a subset of species that were identified by students of this class as part of the initial course questionnaire. That full data set, and the raw FASTA files that contain the nucleotide sequences are linked from the data sets portion of the class page. But the analyses of our global sequence alignments is on this page. Given the quadratic nature of our current sequence alignment algorithms, we have restricted the experiments to only those that have 30K bp or less (which effectively gets us most of the animals, and excludes most plants and fungi). In the end, this left us with 80 species indicated in the table below.
We then scored a sequence alignment between all pairs of species, optimizing for each of the three quantitative measures defined above. Then, for each species, we have prepared a table which shows how that species compares to all other species, sorted from largest to smallest measures. Those tables are linked in the table below.
Note: As of class, we are completely finished with the computations of the original LCS and the LCS+gap measures, but the advanced one is still running; at the moment of writing, it seems we have the all-pairs matrix for about 66 of the 80 species. So some of the linked tables will be missing for those final handful of species, and even in the tables given for the other species, this last batch of species are not included in analyses. So if you don't see the Santa Barbara tree frog showing up as a close match for some other species, it could simply be that we haven't yet done that measurement. I will update these tables to be the complete tables once the calculations are complete.
Note well: the absolute range of these scales are somewhat arbitrary, so it is the relative differences that are important. In particular, within a single category of analyses, there's no particular meaning to where scores becomes positive vs negative, as that's an artifact of the scoring system. More importantly, the score we get for a particular comparision when using the three different measures should not be compared to each other.
What should (hopefully) be signifant is the ordering for a specific measure and a specific reference species of how the other 79 species compare to that. Our hope is certainly that those with the highest scores are the best matches in terms of the underlying biological sequences, and also that large gaps between some similarities and others should presumably give a sense of whether there were some clear strata of nearby and distant species.
So as your task, we are going to use a crowdsource approach to doing some basic sanity checking of the experimental results. Specifically, your pair is responsible for writing up a report (to be submitted electronically by the 11:59pm Saturday, 3 March 2018 duedate), that does the following.
You are to pick five references species (ideally once that you think are probably not closely related to each other in the biological sense). For each of these five species, you are to do a close examination of the three different analyses and to write a reflection of your discoveries. Specifically, we hope you will be able to comment on the following:
Do the results match your "intution" for closely related species? That is, do you think the ones that show up near the top of the list are likely to be closely related species, and are the ones near the bottom of the list ones that you'd expect to be very distant? I'm presuming we'd like to find out things such as perhaps that birds are more like other birds, canines are more like other canines, and so forth, but I don't actually know the ground truth and it might be that there really are closely related sequences between species that we might not think are as closely related as others (e.g., human and mouse).
What differences do you see between the three different computational measures for your species. That is, do all three (lcs, gap, advanced) measures produce roughly the same predictions in terms of closely related species or do you see significant difference. Certainly we are interested in knowing whether the more complex measures are producing more meaningful results.
If we presume that the most advanced of the three analyses are the most reliable, are there any outliers that you find in the species analyses, that is some species that shows up near the top of the matches that you would not have expected to be as closely related, or some species that you would have expected to be closely related but that turned out to be relatively low down the list.
Was there a noticeable gap between a group of most closely related species and the next tier, or was there a more smooth range of scores with no obvious tiers?
Please write up your analyses clearly and in electronic fashion, and with the names of both members of the pair indicated at the top of the document, and have one member of the pair submitted the document through the course submissions website.
Note that species in this table are simply ordered by their accession number, which seems to have no relevance to biological taxonomy (but perhaps is related to how long ago they were provided to NCBI).
accession# | bp | common name | analyses | ||
---|---|---|---|---|---|
NC_000884.1 | 16801 | guinea pig | lcs | gap | adv |
NC_000891.1 | 17019 | platypus | lcs | gap | adv |
NC_000894.1 | 20992 | leishmania | lcs | gap | adv |
NC_001499.1 | 5894 | abelson (virus) | lcs | gap | adv |
NC_001601.1 | 16402 | blue whale | lcs | gap | adv |
NC_001722.1 | 10359 | hiv-2 (virus) | lcs | gap | adv |
NC_001788.1 | 16670 | wild_ass | lcs | gap | adv |
NC_002083.1 | 16499 | orangutan | lcs | gap | adv |
NC_002783.2 | 16749 | rhea | lcs | gap | adv |
NC_003190.1 | 16715 | john dory | lcs | gap | adv |
NC_005212.1 | 17047 | cheetah | lcs | gap | adv |
NC_005797.1 | 16369 | axolotl | lcs | gap | adv |
NC_005958.1 | 16016 | alligator lizard | lcs | gap | adv |
NC_006887.1 | 16375 | tiger salamander | lcs | gap | adv |
NC_006928.1 | 16408 | brydes whale | lcs | gap | adv |
NC_008092.1 | 16729 | gray wolf 1 | lcs | gap | adv |
NC_008161.1 | 14853 | stony coral | lcs | gap | adv |
NC_008410.1 | 17277 | asiatic toad | lcs | gap | adv |
NC_009064.1 | 16703 | indo-pacific sergeant | lcs | gap | adv |
NC_009686.1 | 16757 | gray wolf 2 | lcs | gap | adv |
NC_009830.1 | 16434 | powderblue surgeonfish | lcs | gap | adv |
NC_010570.1 | 16433 | pirarucu | lcs | gap | adv |
NC_011180.1 | 16825 | flat needlefish | lcs | gap | adv |
NC_011196.1 | 16738 | greylag goose | lcs | gap | adv |
NC_011943.1 | 16502 | starry triggerfish | lcs | gap | adv |
NC_011947.1 | 16441 | spiny tailed leatherjacket | lcs | gap | adv |
NC_012920.1 | 16569 | human | lcs | gap | adv |
NC_014887.1 | 15599 | chinese grasshopper | lcs | gap | adv |
NC_015119.1 | 16803 | snow scorpionfly | lcs | gap | adv |
NC_016197.1 | 13724 | filarial nematode | lcs | gap | adv |
NC_016198.1 | 14281 | giant roundworm | lcs | gap | adv |
NC_016428.1 | 16263 | striped field mouse | lcs | gap | adv |
NC_016577.1 | 2633 | AbMV (virus) | lcs | gap | adv |
NC_018801.1 | 16775 | red-winged blackbird | lcs | gap | adv |
NC_018804.1 | 16773 | saffron-cowled blackbird | lcs | gap | adv |
NC_019571.1 | 13913 | cat lungworm | lcs | gap | adv |
NC_020099.1 | 1670 | copepod | lcs | gap | adv |
NC_020346.1 | 17098 | greenspot goby | lcs | gap | adv |
NC_020591.1 | 16673 | hazel grouse | lcs | gap | adv |
NC_020648.1 | 16538 | striped skunk | lcs | gap | adv |
NC_021933.1 | 15282 | millipede | lcs | gap | adv |
NC_022415.1 | 16744 | white shark | lcs | gap | adv |
NC_022827.1 | 18479 | staghorn coral | lcs | gap | adv |
NC_023248.1 | 29999 | anamorphic fungus | lcs | gap | adv |
NC_023889.1 | 16386 | killer whale | lcs | gap | adv |
NC_024052.1 | 16965 | diana tarsier | lcs | gap | adv |
NC_024268.1 | 17937 | silver-throated bushtit | lcs | gap | adv |
NC_024626.1 | 52528 | freshwater green alga | lcs | gap | adv |
NC_025222.1 | 16560 | tonkean macaque | lcs | gap | adv |
NC_026082.1 | 17952 | besra | lcs | gap | adv |
NC_026104.1 | 15804 | stonefly | lcs | gap | adv |
NC_026914.1 | 14948 | water flea | lcs | gap | adv |
NC_027241.1 | 16893 | goulds sunbird | lcs | gap | adv |
NC_027847.1 | 17962 | grey parrot | lcs | gap | adv |
NC_027857.1 | 16551 | bandit angelfish | lcs | gap | adv |
NC_027956.1 | 16721 | african golden wolf | lcs | gap | adv |
NC_028290.1 | 16565 | atlantic sturgeon | lcs | gap | adv |
NC_028510.1 | 17271 | austrolebias | lcs | gap | adv |
NC_029146.1 | 17821 | silvereye | lcs | gap | adv |
NC_029168.1 | 15326 | acasta sulcata | lcs | gap | adv |
NC_029510.1 | 15258 | dancing acraea | lcs | gap | adv |
NC_029846.1 | 17370 | lesser kestrel | lcs | gap | adv |
NC_030247.1 | 16603 | labeonin | lcs | gap | adv |
NC_031807.1 | 16581 | common carp | lcs | gap | adv |
NC_031858.1 | 16310 | pacific star shell | lcs | gap | adv |
NC_032058.1 | 17827 | abyssinian white-eye | lcs | gap | adv |
NC_032084.1 | 17165 | grey burrowing snake | lcs | gap | adv |
NC_033906.1 | 15872 | Sinopodisma wulingshanensis | lcs | gap | adv |
NC_033973.1 | 16817 | terek sandpiper | lcs | gap | adv |
NC_034122.1 | 14913 | congo termite | lcs | gap | adv |
NC_035130.1 | 6363 | australian mosquito | lcs | gap | adv |
NC_035150.1 | 18974 | cat gecko | lcs | gap | adv |
NC_035677.1 | 16130 | bean weevil | lcs | gap | adv |
NC_035817.1 | 16490 | finlaysons squirrel | lcs | gap | adv |
NC_036493.1 | 17325 | Santa Barbara tree frog | lcs | gap | adv |