Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Homework Assignment 06

Clustering

Contents:


Collaboration Policy

For this assignment, you must work alone. Please make sure you adhere to the policies on academic integrity in this regard.


Overview

Topic: Tree Building Algorithms
Related Reading: Chapter 7 of the text and our notes
Due: 11:59pm, Wednesday, April 3, 2019

In this project, you will help implement the hierarchical clustering algorithms for building phylogeny trees, providing support for three different linkage metrics:


Files We Provide

You may download the entire collection of files as clustering.zip, or you may examine them individually through the links below.

Python source code:

Sample input files:


Data Structure Overview

We will provide you with a codebase that does much of the extraneous work such as managing user interaction and parsing of input files. Our code also implements the logic of the main loop of the clustering algorithm, and maintains a variety of important data structures.

During the hierarchical clustering algorithm, there are a series of clusters that are formed. Originally, each original species is its own cluster, but then we merge two clusters to make a new one, and continue in such form. This is also reflected in the tree structure as the leaves are each trivial subtrees, but then those subtrees get combined into larger subtrees until finally having one big tree.

In order to organize the various data structures, we assign an integer ID to every cluster that we ever encounter while the algorithm runs, numbered starting with the original species as 0, 1, 2, ... . We then rely heavily on use of those integer IDs as indices into various lists and dictionaries, and as parameters to function when trying to identify a species, cluster, or its corresponding tree. We then define the following structures


Your Task

You are responsibile for implementing the following functions within the provided cluster.py code:

While the above three cost functions assume that you go back to base principles and reexamine the original distance of every A/B pair that exists, it is possible to implement this more efficiently based on distances that we have already computed for other subclusters. The following three functions should take the more efficient approach. When computing the distacne between clusters A and B, if you are told that newly formed cluster B is actually comprised of previous clusters B1 and B2, you should be able to determine the desired result by examining only the know values for dist(A,B1) and dist(A,B2). We described this efficiency in discussion of UPGMA lecture but the same improvement can be made for the single and complete cost functions.


Submitting Your Assignment Electronically

You should submit your modified cluster.py file into the hw06 folder of your git repository.

Please note the late policy for homeworks.


Grading Standards

This assignment is worth 40 points, with each of the eight functions for which you are responsible being worth 5 points each.


Michael Goldwasser
CSCI 1020, Spring 2019
Last modified: Monday, 13 May 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring