Saint Louis University |
Computer Science 1020
|
Computer Science Department |
For this assignment, you must work alone. Please make sure you adhere to the policies on academic integrity in this regard.
Topic: Smith-Waterman Algorithm
Related Reading: pp. 50-51 and 57 of text and Wikipedia
Due:
11:59pm, Monday, 25 February 2019
The Needleman-Wunsch algorithm produces an optimal global pairwise alignment for a given metric, in that the entirety of two sequences must be aligned. A downside to doing such global alignment is that there might be two sequences that have some strong commonalities (e.g. conserved genes), but with one or both sequences having mutations that caused large portions of the sequence to be inserted, duplicated, or otherwised changed and so the global alignment score might not reflect the local similarities as strongly because the negative contributions of the rest of the sequence.
The Smith-Waterman algorithm is a varient which instead computes the optimal local pairwise alignment. The goal is to allow for finding an alignment between any portions of the two original sequences that demonstrate the strongest alignment. While this seems to require a different approach, as there are many possible portions that might be considered, it turns out that the optimal local alignment can be computed with the Smith-Waterman algorithm, which is very similar in design to the Needleman-Wunsch algorithm.
The algorithmic changes are as follows:
When completing the table, we do not allow for any entries to become negative. That is, while there still may be negative contributions from gap penalties or mismatch scores, if the overall entry were to become negative, it should be set to zero.
In line with the above rule, the top row and leftmost column are set to all zeros (rather than to the negative gap penalties used in the Needleman-Wunsch algorithm).
Rather than considering only the bottom-right entry of the completed table (which for Needlman-Wunsch represents the optimal global alignment score for the full sequences), we are interested for the largest entry that occurs anywhere within the table. That is the entry that defines the optimal local alignment score.
To reconstruct the actual alignment acheiving the optimal score, we begin at the cell of the table in which that score is found. However, rather than necessarily tracing that square back to the top-left corner, we perform the reverse engineering step until the first time that we reach a cell that has a value of zero.
Further discussion of the algorithm can be found in the Wikipedia article, which includes a detailed example for illustration.
Your task is to adapt our original implementation of the Needleman-Wunsch algorithm, to produce a working implementation of the Smith-Waterman algorithm. The hope is that if you understand how the original Python code implements the Needleman-Wunsch algorithm, then you will be able to focus on the relatively few places where that code must be adjusted to produce the Smith-Waterman algorithm.
If you have any questions about how the original code works, feel free to ask!
We are providing two files (available individually or as a combined zip file):
align.py
This is our original implementation of the Needlman-Wunsch
algorithm. The functions that you will need to consider changing
are at the beginning of the file, namely buildTable
which fills in the table of values, optScore which
determines the maximum alignment score (which is the bottom-left
cell for Needlman-Wunsch, but could be anywhere for
Smith-Waterman), and reconstructSolution which produces
a string representation of the alignment, based on tracing back
a path in the table.
After those three functions, there is a note in the source code saying not to change anything below. You do not need to understand the rest of the code. (It is there to help manage the overall testing of the algorithm.)
tests.txt
The script is written in a way so that it runs the algorithm on
tests that are expected to be within a file named
tests.txt. Each line of the file must have precisely
five fields ordered as follows:
[match score] [mismatch score] [gap score] [first sequence] [second sequence]
We are providing an initial test file with five lines. The first four are revisiting examples that we manually examined during past lectures. The final line
3 -3 -2 GGTTGACTA TGTTACGGdescribes the model for the Wikipedia example.
You are welcome to analyze the correctness of your adapted code on those examples, or to add additional such tests to the tests.txt file.
You should submit two files to the appropriate folder in our git repository. If working with a partner, only one of you needs to submit these files.
align.py
Submit your revised source code.
readme.txt
Submit this file to give a brief overview of your work,
discussing any successes or challenges you faced. Also, if
working with a partner make sure that contributions of both
partners are addressed.
This assignment is worth 40 points, apportioned as follows: