Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab 05

Topic: Global Sequence Alignments
Collaboration Policy: The lab should be completed working in pairs
Submission Deadline:    11:59pm Friday, 15 February 2018

Overview

In this lab, we explore a variety of models for doing global sequence alignment for nucleotide sequences. Specifically, we look at three computed measures for comparing a pair of nucleotide sequences.

  1. LCS
    A straightforward computation of the length of the Longest Common Subsequence of the two sequences. The longer the common sequence, the more closely related we presume the species are.

  2. LCS + gap/mismatch penalty
    Computing an alignment in which there is a +1 reward for every exact match of nucleotides that are aligned, but also with a -1 penalty for any pair of non-matching nucleotides and a -1 penalty for any nucleotide of one sequence that ends up aligned with a gap in the other.

  3. Substitutions matrix and adaptive gap penalty
    Rather than a +1 for a match and a -1 for any form of a mismatch, we use a more general substitution matrix to score how well single nucleotides match with other nucleotides. In fact, we note that some of the FASTA files include a standardized character set defined by the IUPAC to allow for a variety of ambiguities that might result from sequencing reads. The specific substitution matrix that we will use is known as NUC.4.4 (available locally or from NCBI). For the standard ACGT nucelotides, the matrix provides a +5 reward for a match and a -4 penalty for any mismatch. But there are more gradiations of values for the various matches one might find with ambiguities.

    The second adjustment to our scoring involves the penalty for gaps. In the previous rendition there was simply a fixed cost for each character that was left aligned with a gap. Given that there might be mutation that create large stretches of inserted or deleted nucleotides in one species, we wish to prioritize in a way that might steer the alignment to find really closely matching portions of a sequence that might be further separated from each other in one species versus another. So rather than penalizing linearly by the size of a gap, we have one larger penalty for starting a gap in either sequence, but a more minimal penalty. Specifically, we charge a penalty of -6 for the first nucleotide in a gap, and -1 for each additional nucleotide.


Experiments

In order to experiment on some real data, we went back to the mitocondrial DNA samples provided by NCBI, and in particular a subset of species that were identified by students of this class (and past students) as part of the initial course questionnaire. That full data set, and the raw FASTA files that contain the nucleotide sequences are linked from the data sets portion of the class page. But the analyses of our global sequence alignments is on this page. Given the quadratic nature of our current sequence alignment algorithms, we have restricted the experiments to only those that have 30K bp or less (which effectively gets us most of the animals, and excludes most plants and fungi). In the end, this left us with 117 species indicated in the table below.

We then scored a sequence alignment between all pairs of species, optimizing for each of the three quantitative measures defined above. Then, for each species, we have prepared a table which shows how that species compares to all other species, sorted from largest to smallest measures. Those tables are linked in the table below.

Note well: the absolute range of these scales are somewhat arbitrary, so it is the relative differences that are important. In particular, within a single category of analyses, there's no particular meaning to where scores becomes positive vs negative, as that's an artifact of the scoring system. More importantly, the score we get for a particular comparision when using the three different measures should not be compared to each other.

What should (hopefully) be signifant is the ordering for a specific measure and a specific reference species of how the other 116 species compare to that. Our hope is certainly that those with the highest scores are the best matches in terms of the underlying biological sequences, and also that large gaps between some similarities and others should presumably give a sense of whether there were some clear strata of nearby and distant species.


Your Task -- Analysis

So as your task, we are going to use a crowdsource approach to doing some basic sanity checking of the experimental results. Specifically, your pair is responsible for writing up a report (to be submitted electronically by the 11:59pm Friday, 15 February 2018 duedate), that does the following.

You are to pick five references species (ideally ones that you think are not likely to be closely related to each other in the biological sense). For each of these five species, you are to do a close examination of the three different analyses and to write a reflection of your discoveries. Specifically, we hope you will be able to comment on the following:

Please write up your analyses clearly and in electronic fashion, and with the names of both members of the pair indicated at the top of the document, and have one member of the pair submitted the document through the course submissions website.


Data Sets

Note that species in this table are simply ordered by their accession number, which seems to have no relevance to biological taxonomy (but perhaps is related to how long ago they were provided to NCBI).

accession#bpcommon nameanalyses
NC_000845 16613 wild boarlcsgapadv
NC_000884 16801 guinea piglcsgapadv
NC_000891 17019 platypuslcsgapadv
NC_000894 20992 leishmanialcsgapadv
NC_001321 16398 fin whalelcsgapadv
NC_001499 5894 abelson (virus)lcsgapadv
NC_001601 16402 blue whalelcsgapadv
NC_001602 16797 grey seallcsgapadv
NC_001610 17084 virginia opossumlcsgapadv
NC_001644 16563 bonobolcsgapadv
NC_001645 16364 gorillalcsgapadv
NC_001700 17009 catlcsgapadv
NC_001722 10359 hiv-2 (virus)lcsgapadv
NC_001788 16670 wild asslcsgapadv
NC_002008 16727 doglcsgapadv
NC_002078 16816 aardvarklcsgapadv
NC_002083 16499 orangutanlcsgapadv
NC_002369 16507 red squirrellcsgapadv
NC_002783 16749 rhealcsgapadv
NC_003190 16715 john dorylcsgapadv
NC_003322 16996 common wombatlcsgapadv
NC_004380 16479 goosefishlcsgapadv
NC_004390 16508 tapetaillcsgapadv
NC_005212 17047 cheetahlcsgapadv
NC_005797 16369 axolotllcsgapadv
NC_005958 16016 alligator lizardlcsgapadv
NC_006887 16375 tiger salamanderlcsgapadv
NC_006928 16408 brydes whalelcsgapadv
NC_007233 5990 monkey malaria parasitelcsgapadv
NC_008092 16729 gray wolf 1lcsgapadv
NC_008161 14853 stony corallcsgapadv
NC_008221 15774 asian longhorned beetlelcsgapadv
NC_008410 17277 asiatic toadlcsgapadv
NC_008668 16778 ray-finned fishlcsgapadv
NC_009064 16703 indo-pacific sergeantlcsgapadv
NC_009686 16757 gray wolf 2lcsgapadv
NC_009692 16431 sea otterlcsgapadv
NC_009830 16434 powderblue surgeonfishlcsgapadv
NC_010570 16433 piraruculcsgapadv
NC_010638 16773 snow leopardlcsgapadv
NC_011137 16565 neanderthallcsgapadv
NC_011180 16825 flat needlefishlcsgapadv
NC_011196 16738 greylag gooselcsgapadv
NC_011943 16502 starry triggerfishlcsgapadv
NC_011947 16441 spiny tailed leatherjacketlcsgapadv
NC_012920 16569 humanlcsgapadv
NC_013272 16518 brown marmorated stink buglcsgapadv
NC_014175 16544 peninsular horned tree lizardlcsgapadv
NC_014295 15895 eastern honey beelcsgapadv
NC_014887 15599 chinese grasshopperlcsgapadv
NC_015119 16803 snow scorpionflylcsgapadv
NC_015200 16939 eurasian magpielcsgapadv
NC_015342 15647 kudzu beetlelcsgapadv
NC_016197 13724 filarial nematodelcsgapadv
NC_016198 14281 giant roundwormlcsgapadv
NC_016419 15140 east palearctic butterflylcsgapadv
NC_016428 16263 striped field mouselcsgapadv
NC_016577 2633 AbMV (virus)lcsgapadv
NC_018033 16683 blood pheasantlcsgapadv
NC_018801 16775 red-winged blackbirdlcsgapadv
NC_018804 16773 saffron-cowled blackbirdlcsgapadv
NC_019571 13913 cat lungwormlcsgapadv
NC_020099 1670 copepodlcsgapadv
NC_020336 16525 tiger tail seahorselcsgapadv
NC_020346 17098 greenspot gobylcsgapadv
NC_020591 16673 hazel grouselcsgapadv
NC_020648 16538 striped skunklcsgapadv
NC_020669 17112 striped hyenalcsgapadv
NC_021386 16580 long-tailed chinchillalcsgapadv
NC_021933 15282 millipedelcsgapadv
NC_022415 16744 white sharklcsgapadv
NC_022429 16637 spectral batlcsgapadv
NC_022827 18479 staghorn corallcsgapadv
NC_023248 29999 anamorphic funguslcsgapadv
NC_023520 16773 sand tiger sharklcsgapadv
NC_023889 16386 killer whalelcsgapadv
NC_023955 15782 freshwater pearl mussellcsgapadv
NC_024052 16965 diana tarsierlcsgapadv
NC_024268 17937 silver-throated bushtitlcsgapadv
NC_024820 16433 giraffelcsgapadv
NC_024853 16509 moonlighter fishlcsgapadv
NC_025222 16560 tonkean macaquelcsgapadv
NC_025594 16802 long-tailed rosefinchlcsgapadv
NC_025923 20350 Eurasian bitternlcsgapadv
NC_026082 17952 besralcsgapadv
NC_026104 15804 stoneflylcsgapadv
NC_026308 17298 baja california brush lizardlcsgapadv
NC_026914 14948 water flealcsgapadv
NC_027241 16893 goulds sunbirdlcsgapadv
NC_027847 17962 grey parrotlcsgapadv
NC_027857 16551 bandit angelfishlcsgapadv
NC_027932 16232 harvest mouselcsgapadv
NC_027943 16270 peruvian scalloplcsgapadv
NC_027956 16721 african golden wolflcsgapadv
NC_028018 16653 white charlcsgapadv
NC_028290 16565 atlantic sturgeonlcsgapadv
NC_028510 17271 austrolebiaslcsgapadv
NC_029146 17821 silvereyelcsgapadv
NC_029168 15326 acasta sulcatalcsgapadv
NC_029498 15281 yellow-banded acraealcsgapadv
NC_029510 15258 dancing acraealcsgapadv
NC_029846 17370 lesser kestrellcsgapadv
NC_030247 16603 labeoninlcsgapadv
NC_030265 15534 european mantislcsgapadv
NC_031807 16581 common carplcsgapadv
NC_031858 16310 pacific star shelllcsgapadv
NC_032058 17827 abyssinian white-eyelcsgapadv
NC_032084 17165 grey burrowing snakelcsgapadv
NC_033906 15872 Sinopodisma wulingshanensislcsgapadv
NC_033973 16817 terek sandpiperlcsgapadv
NC_034122 14913 congo termitelcsgapadv
NC_035130 6363 australian mosquitolcsgapadv
NC_035150 18974 cat geckolcsgapadv
NC_035677 16130 bean weevillcsgapadv
NC_035817 16490 finlaysons squirrellcsgapadv
NC_036391 16513 yellow bullheadlcsgapadv
NC_036493 17325 Santa Barbara tree froglcsgapadv

Michael Goldwasser
CSCI 1020, Spring 2019
Last modified: Thursday, 14 February 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring