Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Unweighted Pair Group Method with Arithmetic means (UPGMA) algorithm


Overview

In the previous lecture, we introduced a general clustering algorithm for building phylogeny trees. One important implementation details is the linkage measure used to determine the "distance" between one cluster and another. One commonly used approach for averaging the distances is the Unweighted Pair Group Method with Arithmetic means (UPGMA).


Formal Definition

When considering clusters $S$ and $T$, the UPGMA measure is equal to the average of the distances taken over all pairs of individual elements $s \in S$ and $t \in T$. Formally, we can annotate that distance as $$d(S,T) = \frac{\sum_{s \in S, t \in T} d(s,t)}{|S| \cdot |T|}$$ For example, if we have a cluster of two items and a cluster of three items, we would be averaging over the six people. As a tangible example, consider the following distance matrix from the book:

ABCDEF
A
B1
C32
D764
E17161410
F191816122

Consider a hypothetical cluster ABC and a cluster DE. Those clusters would have distance 12.667 under UPGMA, computed as the average of
d(A,D) = 7
d(A,E) = 17
d(B,D) = 6
d(B,E) = 16
d(C,D) = 14
d(D,E) = 16


Efficient Computation

While the underlying definition of the computed average seems to involve the sum of quadratically many pairs, this computation can be performed much more efficiently in the context of merging two clusters in the UPGMA algorithm, by reusing previously calculated values. In particular, assume that we have just merged clusters $T1$ and $T2$ to form cluster $T$ and we want to know the distance $d(S,T)$ for some other cluster $S$. We claim that $$d(S,T) = \frac{|T1| \cdot d(S,T1) + |T2| \cdot d(S,T2)}{|T|}$$ This can be shown algebraically by examining the underlying sumations. Notice that there are $|S|\cdot|T1|$ such pairs associated with elements of $|T1|$. We know those pairs have average of $d(S,T1)$ and thus a total sum of $|S| \cdot |T1| \cdot d(S,T1)$. Similarly, the $|S|\cdot|T2|$ pairs associated with $T2$ would have sum $|S|\cdot |T2| \cdot d(S,T2)$. Thus the average of all $|S|\cdot|T|$ pairs would be $$d(S,T) = \frac{|S|\cdot|T1| \cdot d(S,T1) + |S|\cdot|T2| \cdot d(S,T2)}{|S|\cdot|T|}$$ however we can cancel out the factor of $|S|$ from the numerator and denominator.

The take-home lesson here is that when merging two clusters, the distance of the newly formed cluster to another can be computed in a constant amount of time, rather than examination of all the underlying pairs that contribute to that average.


Textbook Example

We start by connecting A and B to form cluster AB, given that they have smallest distance 1. Assuming constant mutation rate, we define their common ancestor to be 0.5 units in the past.

 0.5
  /------ A
  |
--+
  |
  \------ B
For the UPGMA algorithm, we get the following matrix.
ABCDEF
AB
C2.5
D6.5 4
E16.51410
F18.516122

We next merge E with F, having shortest distance 2, resulting in partial tree

 1.0   0.5    0.0
        /------ A
        |
--------+
        |
        \------ B

  /------------ E
  |
--+
  |
  \------------ F
and resulting distances
ABCDEF
AB
C2.5
D6.5 4
EF17.51511

We next merge AB with C, resulting in partial tree

    1.0   0.5   0.0
           /------ A
           |
 /---------+
 |         |
-+         \------ B
 |
 \---------------- C

     /------------ E
     |
   --+
     |
     \------------ F
and resulting distances
ABCDEF
ABC
D5.67
EF16.66711
Note well that the distance from D to ABC is 5.67, which was calculated as $$\frac{2 \cdot 6.5 + 1 \cdot 4}{3}$$, or if back to base principles as the average of original $d(A,D)=7$, $d(B,D)=6$ and $d(C,D)=4$. Similarly, the new measure $d(ABC,EF)$ was computed as $$\frac{2 \cdot 17.5 + 1 \cdot 15}{3}$$, or if back to base principles, it is the unweighted average of the six distances 17,19,16,18,14,16.

We next merge ABC with D, having distance 5.67, resulting in partial tree

     2.0          1.0   0.5   0.0
                         /------ A
                         |
               /---------+
               |         |
   /-----------+         \------ B
   |           |
-- +           \---------------- C
   |  
   \---------------------------- D
     
                   /------------ E
                   |
                 --+
                   |
                   \------------ F
and resulting distances
ABCDEF
ABCD
EF15.25
and we would conclude by doing the final merge of ABCD and EF with distance 15.25.


Additional Readings/Resources


Michael Goldwasser
Last modified: Tuesday, 26 March 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring