Course Home | Assignments | Computing Resources | Data Sets | Lab Hours/Tutoring | Python | Schedule | Submit

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2018

Computer Science Department

Homework Assignment 03

Approximate Matching

Due: 11:59pm, Monday, 26 February 2018


Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

While it would be great if biological sequences always matched precisely what we expected, there are many context where we need to be willing to consider an approximate match for a motif pattern. For this assignment, we will have you develop an alternative version of the typical "find" behavior for Python strings, but instead of insisting on finding a perfect match, we will allow for a match that has up to k replacements (or what are known as Single Nucleotide Polymorphisms, or SNPs, in genetics).

As an example, consider searching for the motif CAT in the strand GTACGTACATT. If looking for the leftmost occurence of the exact pattern, we would located it beginning at index 7.

01234567890
GTACGTACATT
However, if we were willing to allow up to 1 SNP, we would find the first approximate match starting at index 3.
01234567890
GTACGTACATT
   CAT
and if we were to allow up to 2 SNPs, then the first approximate match starts at index 1
01234567890
GTACGTACATT
 CAT


Your Task

Your task is to implement a function findApprox(strand, pattern, k), which identifies the index at which the leftmost approximate match exists using at most k  replacements. If no such approximate match is found, the function should return -1.

However, when writing code for more complex logic, it is often helpful to have a modular design in which you define additional functions that encapsulate certain subtasks. For this program, we wish to have you define two different functions.

The first function, isApproxMatch(a,b,k), is one that should take two equal-length strings as parameters and an integer k, and to determine True or False, whether those two strings qualify as approximate matches for each other when allowed k SNPs. For example, we would like to recognize that CTAG and GTAA qualify as approximate matches if allowing 2 SNPs, but not if allowing only 1 SNP.

With that function completed, you should then have an easier time implementing the desired findApprox(strand, pattern, k). You may use a for loop to consider possible indices j at which the approximate pattern might be found, and then use the first function to test whether the slice starting at index j of the strand qualifies as an approximate match.


Testing

It will be important to test your code on a variety of scenarios that could arise. For example, here are a handful of cases to consider:

strand pattern k result comment
GTACGTACATT CAT 0 7
GTACGTACATT CAT 1 3
GTACGTACATT CAT 2 1
GTCATTACAGT CAT 2 2 okay to have fewer than k SNPs
TACGTAAATT CTAG 2 3
TACGTAAATT CTAG 1 -1 no sufficient match

As with the previous homework, we are providing a mechanism for you to automate a series of tests by specifying those tests in a simple text file, with lines of the form

strand pattern k
For example, a test file that encodes the six tests given above can be found as tests.txt.

As part of this homework assignment, you must submit not only your Python implementation, but also your own set of up to ten tests. We will then test each submitted program on each submitted test case and give credit in the grading standards both for how well your implementation does when faced with other students' tests, and how well your tests due in exposing flaws in other students' impelementations.


Files You Need

To get you going, we are providing three files, which you may either download indivdiually, or combined as this zip file to be unpacked. The three files are

If you are already working on our department's computer system and prefer to copy these files using commandline techniques, you may execute the following command from whatever working directory you'd like them place:

cp -Rp /public/goldwasser/1020/homeworks/findApprox  .


Submitting Your Assignment Electronically

All of the following files should be submitted electronically:

To submit, please follow the instructions on our submit system, using the website password that you indicated when completing the course questionnaire. Please also note the late policy for homeworks.


Grading Standards

The assignment is worth 40 points, which will be assessed as follows:


Michael Goldwasser
Last modified: Thursday, 15 February 2018