In the previous lecture, we introduced a general clustering algorithm for building phylogeny trees. One important implementation details is the linkage measure used to determine the "distance" between one cluster and another. One commonly used approach for averaging the distances is the Unweighted Pair Group Method with Arithmetic means (UPGMA).
When considering clusters $S$ and $T$, the UPGMA measure is equal to the average of the distances taken over all pairs of individual elements $s \in S$ and $t \in T$. Formally, we can annotate that distance as $$d(S,T) = \frac{\sum_{s \in S, t \in T} d(s,t)}{|S| \cdot |T|}$$ For example, if we have a cluster of two items and a cluster of three items, we would be averaging over the six people. As a tangible example, consider the following distance matrix from the book:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
A | ||||||
B | 1 | |||||
C | 3 | 2 | ||||
D | 7 | 6 | 4 | |||
E | 17 | 16 | 14 | 10 | ||
F | 19 | 18 | 16 | 12 | 2 |
Consider a hypothetical cluster ABC and a cluster DE. Those clusters
would have distance 12.667 under UPGMA, computed as the average of
d(A,D) = 7
d(A,E) = 17
d(B,D) = 6
d(B,E) = 16
d(C,D) = 14
d(D,E) = 16
While the underlying definition of the computed average seems to involve the sum of quadratically many pairs, this computation can be performed much more efficiently in the context of merging two clusters in the UPGMA algorithm, by reusing previously calculated values. In particular, assume that we have just merged clusters $T1$ and $T2$ to form cluster $T$ and we want to know the distance $d(S,T)$ for some other cluster $S$. We claim that $$d(S,T) = \frac{|T1| \cdot d(S,T1) + |T2| \cdot d(S,T2)}{|T|}$$ This can be shown algebraically by examining the underlying sumations. Notice that there are $|S|\cdot|T1|$ such pairs associated with elements of $|T1|$. We know those pairs have average of $d(S,T1)$ and thus a total sum of $|S| \cdot |T1| \cdot d(S,T1)$. Similarly, the $|S|\cdot|T2|$ pairs associated with $T2$ would have sum $|S|\cdot |T2| \cdot d(S,T2)$. Thus the average of all $|S|\cdot|T|$ pairs would be $$d(S,T) = \frac{|S|\cdot|T1| \cdot d(S,T1) + |S|\cdot|T2| \cdot d(S,T2)}{|S|\cdot|T|}$$ however we can cancel out the factor of $|S|$ from the numerator and denominator.
The take-home lesson here is that when merging two clusters, the distance of the newly formed cluster to another can be computed in a constant amount of time, rather than examination of all the underlying pairs that contribute to that average.
We start by connecting A and B to form cluster AB, given that they have smallest distance 1. Assuming constant mutation rate, we define their common ancestor to be 0.5 units in the past.
0.5 /------ A | --+ | \------ BFor the UPGMA algorithm, we get the following matrix.
AB | C | D | E | F | |
---|---|---|---|---|---|
AB | |||||
C | 2.5 | ||||
D | 6.5 | 4 | |||
E | 16.5 | 14 | 10 | ||
F | 18.5 | 16 | 12 | 2 |
We next merge E with F, having shortest distance 2, resulting in partial tree
1.0 0.5 0.0 /------ A | --------+ | \------ B /------------ E | --+ | \------------ Fand resulting distances
AB | C | D | EF | |
---|---|---|---|---|
AB | ||||
C | 2.5 | |||
D | 6.5 | 4 | ||
EF | 17.5 | 15 | 11 |
We next merge AB with C, resulting in partial tree
1.0 0.5 0.0 /------ A | /---------+ | | -+ \------ B | \---------------- C /------------ E | --+ | \------------ Fand resulting distances
ABC | D | EF | |
---|---|---|---|
ABC | |||
D | 5.67 | ||
EF | 16.667 | 11 |
We next merge ABC with D, having distance 5.67, resulting in partial tree
2.0 1.0 0.5 0.0 /------ A | /---------+ | | /-----------+ \------ B | | -- + \---------------- C | \---------------------------- D /------------ E | --+ | \------------ Fand resulting distances
ABCD | EF | |
---|---|---|
ABCD | ||
EF | 15.25 |
wikipedia (with example)
UPGMA example (this one showing that final tree might not match original model). However, note well that their second example has a calculation error (following WPGMA rather than UPGMA).
A WPGMA example and discussion of the difference between UPGMA and WPGMA calculations