Saint Louis University |
Computer Science 1020
|
Computer Science Department |
For this assignment, you must work alone. Please make sure you adhere to the policies on academic integrity in this regard.
Topic: Tree Building Algorithms
Related Reading: Chapter 7 of the text and
our notes
Due:
11:59pm, Wednesday, April 3, 2019
In this project, you will help implement the hierarchical clustering algorithms for building phylogeny trees, providing support for three different linkage metrics:
Single linkage:
The distance between two clusters is defined to be the
distance between the two elements of the
respective clsuters that are nearest to each other
Complete linkage:
The distance between the clusters is defined to be the
distance between the two elements of the
respective clsuters that are furthest from each other.
Centroid linkage via UPGMA:
The distance between the clusters is taken as the
average distance across all pairs from the respective
clusters.
You may download the entire collection of files as clustering.zip, or you may examine them individually through the links below.
Python source code:
cluster.py
This is the main program to be executed, and the only file that you should be
modifying.
All of the functions you should be implementing are
grouped together near the beginning of the file.
phylip.py
This file manages parsing of input files using the PHYLIP
distance matrix format. You need not read or edit this file, but
will need it in the same folder as others.
chronogram.py
This is a modified verison of our earlier code for drawing
chronograms of our tree structures. It is slightly different
from the past, so please use this version for this project if
you wish to visualize.
Sample input files:
p127.txt
Distance matrix given on p. 127 of our text. The book works
through single linkage clustering.
p139.txt
Distance matrix given on p. 139 of our text. Note that they
use this as example for the (extra credit) Neighbor-Joining
method, though you can still use this data for the other algorithms.
wiki.txt
Distances from the Wikipedia coverage of UPGMA
(though you are welcome to use this as input for other metrics as well).
edwards.txt
Distances for the detailed
example of UPGMA from Richard Edwards lab at Univ. New South Wales
(though you are welcome to use this as input for other metrics as well).
carr.txt
Distances for the detailed
example of WPGMA from Steven Carr lab at Memorial
University. Note well that they walk through the WEIGHTED Pair
Group Method (WPGMA) not the Unweighted version (though they
given nice explanation of how they differ).
english.txt
Distance matrix we used for our earlier activity on building a
tree of relationships between an English saying. If you're
interested, see how the human-built trees compare to the
result of UPGMA on this data set.
We will provide you with a codebase that does much of the extraneous work such as managing user interaction and parsing of input files. Our code also implements the logic of the main loop of the clustering algorithm, and maintains a variety of important data structures.
During the hierarchical clustering algorithm, there are a series of clusters that are formed. Originally, each original species is its own cluster, but then we merge two clusters to make a new one, and continue in such form. This is also reflected in the tree structure as the leaves are each trivial subtrees, but then those subtrees get combined into larger subtrees until finally having one big tree.
In order to organize the various data structures, we assign an integer
ID to every cluster that we ever encounter while the algorithm runs,
numbered starting with the original species as
names is a list of the original samples, such that names[0] is name in the first row of the distance matrix, names[1] the next, and so forth.
clades is a list with an entry for every cluster ever encountered while the algorithm is running, with clades[k] itself being a list of ID numbers for the original samples that exist in that cluster.
trees is a list with an entry for every cluster ever encountered while the algorithm is running, such that trees[k] is the tuple representation of the corresponding subtree representing the clade.
active is a set of integer IDs that denote all clades that are currently active in the algorithm. That is, when merging clusters A and B to form C, A and B become inactive and removed from this set, while C gets added.
dist is a list of dictionaries such that store all distances that have been computed between clusters. In particular, if a and b are integer IDs representing clusters, then syntax dict[a][b] denotes the computed distance between those clusters.
You are responsibile for implementing the following functions within the provided cluster.py code:
find_nearest_active()
This function is responsible for examining the set of currently
active clusters and the precomputed dictionary of distances, and
locating and returning a pair of cluster IDs that have minimum
distance among the active pairs. The function should
technically return a pair of ID numbers for those clusters,
using a syntax such as
return a,bpresuming variables a and b are the appropriate IDs.
combine_trees(a,b)
Given integers a and b that identify existing
clusters, this function is responsible for returning a tuple that
represents the tree that results after merging those clusters.
The
tuple structure should be the familiar one we defined when
drawing phylogeny trees in earlier projects.
single_linkage_cost(a,b)
Given integers a and b that identify existing
clusters, this function is responsible for computing (from
scratch) and returning the single linkage distance between those
clusters, that is the distance of the closest pair that exists
between an element of cluster a and an element of cluster b.
complete_linkage_cost(a,b)
Given integers a and b that identify existing
clusters, this function is responsible for computing (from
scratch) and returning the complete linkage distance between those
clusters, that is the distance of the farthest pair that exists
between an element of cluster a and an element of cluster b.
upgma_linkage_cost(a,b)
Given integers a and b that identify existing
clusters, this function is responsible for computing (from
scratch) and returning the upgma linkage distance between those
clusters, that is the average distance taken over all pairs involving
one element of cluster a and one element of cluster b.
single_linkage_cost_efficient(a,b1,b2)
complete_linkage_cost_efficient(a,b1,b2)
upgma_linkage_cost_efficient(a,b1,b2)
You should submit your modified cluster.py file into the hw06 folder of your git repository.
Please note the late policy for homeworks.
This assignment is worth 40 points, with each of the eight functions for which you are responsible being worth 5 points each.