Course Home | Assignments | Computing Resources | Data Sets | Lab Hours/Tutoring | Python | Schedule | Submit

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2018

Computer Science Department

Lab 04

Topic: Global Sequence Alignments
Collaboration Policy: The lab should be completed working in pairs
Submission Deadline:    11:59pm Saturday, 3 March 2018

Overview

In this lab, we explore a variety of models for doing global sequence alignment for nucleotide sequences. Specifically, we look at three computed measures for comparing a pair of nucleotide sequences.

  1. LCS
    A straightforward computation of the length of the Longest Common Subsequence of the two sequences. The longer the common sequence, the more closely related we presume the species are.

  2. LCS + gap/mismatch penalty
    Computing an alignment in which there is a +1 reward for every exact match of nucleotides that are aligned, but also with a -1 penalty for any pair of non-matching nucleotides and a -1 penalty for any nucleotide of one sequence that ends up aligned with a gap in the other.

  3. Substitutions matrix and adaptive gap penalty
    Rather than a +1 for a match and a -1 for any form of a mismatch, we use a more general substitution matrix to score how well single nucleotides match with other nucleotides. In fact, we note that some of the FASTA files include a standardized character set defined by the IUPAC to allow for a variety of ambiguities that might result from sequencing reads. The specific substitution matrix that we will use is known as NUC.4.4 (available locally or from NCBI). For the standard ACGT nucelotides, the matrix provides a +5 reward for a match and a -4 penalty for any mismatch. But there are more gradiations of values for the various matches one might find with ambiguities.

    The second adjustment to our scoring involves the penalty for gaps. In the previous rendition there was simply a fixed cost for each character that was left aligned with a gap. Given that there might be mutation that create large stretches of inserted or deleted nucleotides in one species, we wish to prioritize in a way that might steer the alignment to find really closely matching portions of a sequence that might be further separated from each other in one species versus another. So rather than penalizing linearly by the size of a gap, we have one larger penalty for starting a gap in either sequence, but a more minimal penalty. Specifically, we charge a penalty of -6 for the first nucleotide in a gap, and -1 for each additional nucleotide.


Experiments

In order to experiment on some real data, we went back to the mitocondrial DNA samples provided by NCBI, and in particular a subset of species that were identified by students of this class as part of the initial course questionnaire. That full data set, and the raw FASTA files that contain the nucleotide sequences are linked from the data sets portion of the class page. But the analyses of our global sequence alignments is on this page. Given the quadratic nature of our current sequence alignment algorithms, we have restricted the experiments to only those that have 30K bp or less (which effectively gets us most of the animals, and excludes most plants and fungi). In the end, this left us with 80 species indicated in the table below.

We then scored a sequence alignment between all pairs of species, optimizing for each of the three quantitative measures defined above. Then, for each species, we have prepared a table which shows how that species compares to all other species, sorted from largest to smallest measures. Those tables are linked in the table below.

Note: As of class, we are completely finished with the computations of the original LCS and the LCS+gap measures, but the advanced one is still running; at the moment of writing, it seems we have the all-pairs matrix for about 66 of the 80 species. So some of the linked tables will be missing for those final handful of species, and even in the tables given for the other species, this last batch of species are not included in analyses. So if you don't see the Santa Barbara tree frog showing up as a close match for some other species, it could simply be that we haven't yet done that measurement. I will update these tables to be the complete tables once the calculations are complete.

Note well: the absolute range of these scales are somewhat arbitrary, so it is the relative differences that are important. In particular, within a single category of analyses, there's no particular meaning to where scores becomes positive vs negative, as that's an artifact of the scoring system. More importantly, the score we get for a particular comparision when using the three different measures should not be compared to each other.

What should (hopefully) be signifant is the ordering for a specific measure and a specific reference species of how the other 79 species compare to that. Our hope is certainly that those with the highest scores are the best matches in terms of the underlying biological sequences, and also that large gaps between some similarities and others should presumably give a sense of whether there were some clear strata of nearby and distant species.


Your Task -- Analysis

So as your task, we are going to use a crowdsource approach to doing some basic sanity checking of the experimental results. Specifically, your pair is responsible for writing up a report (to be submitted electronically by the 11:59pm Saturday, 3 March 2018 duedate), that does the following.

You are to pick five references species (ideally once that you think are probably not closely related to each other in the biological sense). For each of these five species, you are to do a close examination of the three different analyses and to write a reflection of your discoveries. Specifically, we hope you will be able to comment on the following:

Please write up your analyses clearly and in electronic fashion, and with the names of both members of the pair indicated at the top of the document, and have one member of the pair submitted the document through the course submissions website.


Data Sets

Note that species in this table are simply ordered by their accession number, which seems to have no relevance to biological taxonomy (but perhaps is related to how long ago they were provided to NCBI).

accession#bpcommon nameanalyses
NC_000884.116801 guinea piglcsgapadv
NC_000891.117019 platypuslcsgapadv
NC_000894.120992 leishmanialcsgapadv
NC_001499.15894 abelson (virus)lcsgapadv
NC_001601.116402 blue whalelcsgapadv
NC_001722.110359 hiv-2 (virus)lcsgapadv
NC_001788.116670 wild_asslcsgapadv
NC_002083.116499 orangutanlcsgapadv
NC_002783.216749 rhealcsgapadv
NC_003190.116715 john dorylcsgapadv
NC_005212.117047 cheetahlcsgapadv
NC_005797.116369 axolotllcsgapadv
NC_005958.116016 alligator lizardlcsgapadv
NC_006887.116375 tiger salamanderlcsgapadv
NC_006928.116408 brydes whalelcsgapadv
NC_008092.116729 gray wolf 1lcsgapadv
NC_008161.114853 stony corallcsgapadv
NC_008410.117277 asiatic toadlcsgapadv
NC_009064.116703 indo-pacific sergeantlcsgapadv
NC_009686.116757 gray wolf 2lcsgapadv
NC_009830.116434 powderblue surgeonfishlcsgapadv
NC_010570.116433 piraruculcsgapadv
NC_011180.116825 flat needlefishlcsgapadv
NC_011196.116738 greylag gooselcsgapadv
NC_011943.116502 starry triggerfishlcsgapadv
NC_011947.116441 spiny tailed leatherjacketlcsgapadv
NC_012920.116569 humanlcsgapadv
NC_014887.115599 chinese grasshopperlcsgapadv
NC_015119.116803 snow scorpionflylcsgapadv
NC_016197.113724 filarial nematodelcsgapadv
NC_016198.114281 giant roundwormlcsgapadv
NC_016428.116263 striped field mouselcsgapadv
NC_016577.12633 AbMV (virus)lcsgapadv
NC_018801.116775 red-winged blackbirdlcsgapadv
NC_018804.116773 saffron-cowled blackbirdlcsgapadv
NC_019571.113913 cat lungwormlcsgapadv
NC_020099.11670 copepodlcsgapadv
NC_020346.117098 greenspot gobylcsgapadv
NC_020591.116673 hazel grouselcsgapadv
NC_020648.116538 striped skunklcsgapadv
NC_021933.115282 millipedelcsgapadv
NC_022415.116744 white sharklcsgapadv
NC_022827.118479 staghorn corallcsgapadv
NC_023248.129999 anamorphic funguslcsgapadv
NC_023889.116386 killer whalelcsgapadv
NC_024052.116965 diana tarsierlcsgapadv
NC_024268.117937 silver-throated bushtitlcsgapadv
NC_024626.152528 freshwater green algalcsgapadv
NC_025222.116560 tonkean macaquelcsgapadv
NC_026082.117952 besralcsgapadv
NC_026104.115804 stoneflylcsgapadv
NC_026914.114948 water flealcsgapadv
NC_027241.116893 goulds sunbirdlcsgapadv
NC_027847.117962 grey parrotlcsgapadv
NC_027857.116551 bandit angelfishlcsgapadv
NC_027956.116721 african golden wolflcsgapadv
NC_028290.116565 atlantic sturgeonlcsgapadv
NC_028510.117271 austrolebiaslcsgapadv
NC_029146.117821 silvereyelcsgapadv
NC_029168.115326 acasta sulcatalcsgapadv
NC_029510.115258 dancing acraealcsgapadv
NC_029846.117370 lesser kestrellcsgapadv
NC_030247.116603 labeoninlcsgapadv
NC_031807.116581 common carplcsgapadv
NC_031858.116310 pacific star shelllcsgapadv
NC_032058.117827 abyssinian white-eyelcsgapadv
NC_032084.117165 grey burrowing snakelcsgapadv
NC_033906.115872 Sinopodisma wulingshanensislcsgapadv
NC_033973.116817 terek sandpiperlcsgapadv
NC_034122.114913 congo termitelcsgapadv
NC_035130.16363 australian mosquitolcsgapadv
NC_035150.118974 cat geckolcsgapadv
NC_035677.116130 bean weevillcsgapadv
NC_035817.116490 finlaysons squirrellcsgapadv
NC_036493.117325 Santa Barbara tree froglcsgapadv

Michael Goldwasser
CSCI 1020, Spring 2018
Last modified: Wednesday, 28 February 2018
Course Home | Assignments | Computing Resources | Data Sets | Lab Hours/Tutoring | Python | Schedule | Submit