Computer Science 1020
Introduction to Computer Science: Bioinformatics

Homework Assignment 06

Clustering

Collaboration Policy

For this assignment, you must work alone. Please make sure you adhere to the policies on academic integrity in this regard.

Overview

Topic: Tree Building Algorithms
Related Reading: Chapter 7 of the text and our notes
Due: 11:59pm, Wednesday, April 3, 2019

In this project, you will help implement the hierarchical clustering algorithms for building phylogeny trees, providing support for three different linkage metrics:

Single linkage:
The distance between two clusters is defined to be the distance between the two elements of the respective clsuters that are nearest to each other
Complete linkage:
The distance between the clusters is defined to be the distance between the two elements of the respective clsuters that are furthest from each other.
Centroid linkage via UPGMA:
The distance between the clusters is taken as the average distance across all pairs from the respective clusters.

Files We Provide

You may download the entire collection of files as clustering.zip, or you may examine them individually through the links below.

Python source code:

cluster.py
This is the main program to be executed, and the only file that you should be modifying. All of the functions you should be implementing are grouped together near the beginning of the file.
phylip.py
This file manages parsing of input files using the PHYLIP distance matrix format. You need not read or edit this file, but will need it in the same folder as others.
chronogram.py
This is a modified verison of our earlier code for drawing chronograms of our tree structures. It is slightly different from the past, so please use this version for this project if you wish to visualize.

Sample input files:

p127.txt
Distance matrix given on p. 127 of our text. The book works through single linkage clustering.
p139.txt
Distance matrix given on p. 139 of our text. Note that they use this as example for the (extra credit) Neighbor-Joining method, though you can still use this data for the other algorithms.
wiki.txt
Distances from the Wikipedia coverage of UPGMA (though you are welcome to use this as input for other metrics as well).
edwards.txt
Distances for the detailed example of UPGMA from Richard Edwards lab at Univ. New South Wales (though you are welcome to use this as input for other metrics as well).
carr.txt
Distances for the detailed example of WPGMA from Steven Carr lab at Memorial University. Note well that they walk through the WEIGHTED Pair Group Method (WPGMA) not the Unweighted version (though they given nice explanation of how they differ).
english.txt
Distance matrix we used for our earlier activity on building a tree of relationships between an English saying. If you're interested, see how the human-built trees compare to the result of UPGMA on this data set.

Data Structure Overview

We will provide you with a codebase that does much of the extraneous work such as managing user interaction and parsing of input files. Our code also implements the logic of the main loop of the clustering algorithm, and maintains a variety of important data structures.

During the hierarchical clustering algorithm, there are a series of clusters that are formed. Originally, each original species is its own cluster, but then we merge two clusters to make a new one, and continue in such form. This is also reflected in the tree structure as the leaves are each trivial subtrees, but then those subtrees get combined into larger subtrees until finally having one big tree.

In order to organize the various data structures, we assign an integer ID to every cluster that we ever encounter while the algorithm runs, numbered starting with the original species as 0, 1, 2, ... . We then rely heavily on use of those integer IDs as indices into various lists and dictionaries, and as parameters to function when trying to identify a species, cluster, or its corresponding tree. We then define the following structures

names is a list of the original samples, such that names[0] is name in the first row of the distance matrix, names[1] the next, and so forth.
clades is a list with an entry for every cluster ever encountered while the algorithm is running, with clades[k] itself being a list of ID numbers for the original samples that exist in that cluster.
trees is a list with an entry for every cluster ever encountered while the algorithm is running, such that trees[k] is the tuple representation of the corresponding subtree representing the clade.
active is a set of integer IDs that denote all clades that are currently active in the algorithm. That is, when merging clusters A and B to form C, A and B become inactive and removed from this set, while C gets added.
dist is a list of dictionaries such that store all distances that have been computed between clusters. In particular, if a and b are integer IDs representing clusters, then syntax dict[a][b] denotes the computed distance between those clusters.

Your Task

You are responsibile for implementing the following functions within the provided cluster.py code:

find_nearest_active()
This function is responsible for examining the set of currently active clusters and the precomputed dictionary of distances, and locating and returning a pair of cluster IDs that have minimum distance among the active pairs. The function should technically return a pair of ID numbers for those clusters, using a syntax such as
```
return a,b
```
presuming variables a and b are the appropriate IDs.
combine_trees(a,b)
Given integers a and b that identify existing clusters, this function is responsible for returning a tuple that represents the tree that results after merging those clusters. The tuple structure should be the familiar one we defined when drawing phylogeny trees in earlier projects.
single_linkage_cost(a,b)
Given integers a and b that identify existing clusters, this function is responsible for computing (from scratch) and returning the single linkage distance between those clusters, that is the distance of the closest pair that exists between an element of cluster a and an element of cluster b.
complete_linkage_cost(a,b)
Given integers a and b that identify existing clusters, this function is responsible for computing (from scratch) and returning the complete linkage distance between those clusters, that is the distance of the farthest pair that exists between an element of cluster a and an element of cluster b.
upgma_linkage_cost(a,b)
Given integers a and b that identify existing clusters, this function is responsible for computing (from scratch) and returning the upgma linkage distance between those clusters, that is the average distance taken over all pairs involving one element of cluster a and one element of cluster b.

While the above three cost functions assume that you go back to base principles and reexamine the original distance of every A/B pair that exists, it is possible to implement this more efficiently based on distances that we have already computed for other subclusters. The following three functions should take the more efficient approach. When computing the distacne between clusters A and B, if you are told that newly formed cluster B is actually comprised of previous clusters B1 and B2, you should be able to determine the desired result by examining only the know values for dist(A,B1) and dist(A,B2). We described this efficiency in discussion of UPGMA lecture but the same improvement can be made for the single and complete cost functions.

single_linkage_cost_efficient(a,b1,b2)
complete_linkage_cost_efficient(a,b1,b2)
upgma_linkage_cost_efficient(a,b1,b2)

Submitting Your Assignment Electronically

You should submit your modified cluster.py file into the hw06 folder of your git repository.

Please note the late policy for homeworks.

Grading Standards

This assignment is worth 40 points, with each of the eight functions for which you are responsible being worth 5 points each.

Michael Goldwasser

CSCI 1020, Spring 2019
Last modified: Monday, 13 May 2019

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Homework Assignment 06

Clustering

Contents:

Collaboration Policy

Overview

Files We Provide

Data Structure Overview

Your Task

Submitting Your Assignment Electronically

Grading Standards

Saint Louis University

Computer Science 1020 Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Homework Assignment 06

Clustering

Contents:

Collaboration Policy

Overview

Files We Provide

Data Structure Overview

Your Task

Submitting Your Assignment Electronically

Grading Standards

Computer Science 1020
Introduction to Computer Science: Bioinformatics