Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Phylogeny Tree Building Algorithms


Reading: Chapter 7

Overview

While there are some computational problems that we know how to solve optimally and efficiency (e.g. various pairwise sequence alignments), phylogeny tree building is NOT one of those! One of the biggest challenges is the combinatorial explosion in terms of number number of tree toplogies.


Combinatorics

If considering only two modern species, there is only one tree model to describe their relationship, which is that they must share a common ancestor.

  /--- A
  |
--+
  |
  \--- B
(note that there is no significance regarding which is top versus and bottom branch.

With three leaves, the shape of the tree must be

      /---  *
      |
  /---+
  |   |
--+   \---  *
  |
  \-------  *
and thus there are three distinct models depending on which species serves as the outgroup of the other two.

We then pose the question of how many distinct such trees exists for modeling the relationship between four species?

What we find in general is that the number of tree models grows super-exponentially
n#trees
21
33
4[student exercise]
5[bonus exercise]
6945
710,395
8135,135
92,027,025
1034,458,425
......
208,200,794,532,637,891,559,375
Note well that the n=20 result is more than 8 trillion billions.

Spoiler: python code that computes the above combinatorics.

Also worth noting that even if we fix a set of distances for a chronogram, there might still be several distinctly different evolutionary relationships that are consistent. In particular, Figure 7.2 of St. Clair/Visick book shows multiple different trees that are consistent with the following distances:
ABCD
A 424
B4 42
C24 4
D424


Tree-building Algorithms

Character-based algorithms
These directly use the aligned sequences and typically rely on probabilistic models that evaluate the likelihood of each possible tree. This can be very effective for trying to find best model for a relatively small group of related species, but becomes intractable as the number of trees grows.

Distance-based algorithms
Multiple sequence alignment is used to produce a set of pairwise distances between samples of the group, and the subsequent tree building is then based only on the distances (not on the further information in the underlying sequences). These approaches can be tractible on much larger groups, but typically heuristical and not guaranteeing that they produce the most likely result.


Clustering Algorithms

The most common distance-based tree-building algorithms are based on greedily clustering. That is, starting with all the individual species as their own cluster, and given computed distances, a tree is built up hierarchically as follows.

while there exists two or more clusters:

  * Let A and B be the two "closest" clusters

  * Merge A and B in the tree, with new edges having distances that
    match the computed distance for those clusters

  * Recompute distances from all other clusters to the new AB cluster
    (so called "linkage method")

The big question is how to define the distance from one cluster of species to another. Three varying definitions are as follows.


Example from Text

ABCDEF
A
B1
C32
D764
E17161410
F191816122

We start by connecting A and B to form cluster AB, given that they have smallest distance 1. Assuming constant mutation rate, we define their common ancestor to be 0.5 units in the past.

 0.5
  /------ A
  |
--+
  |
  \------ B
The question then becomes what is the distance from AB to other elements. For example, C has distance 3 from A and 2 from B. If using single linkage, we say that C has distance 2 from AB (because it is only 2 away from B). If using complete linkage, we'd say C has distance 3 from AB (because it is 3 away from A). For the centroid linkage, we'd say that C has distance 2.5 from AB, as that is the average of the C-A and C-B distances.

Using single linkage as the approach, we'd get the following remaining matrix.
ABCDEF
AB
C2
D6 4
E161410
F1816122

Next there is a tie with shortest distance 2. We could therefore choose to either merge E with F or to merge AB with C. (In this case, it doesn't matter how we proceed as we'll do the other merge as the next step, but in some examples we might get different results depending on how we break ties.) Let's assume we merge EF, resulting in partial tree

 1.0   0.5    0.0
        /------ A
        |
--------+
        |
        \------ B

  /------------ E
  |
--+
  |
  \------------ F
and resulting distances
ABCDEF
AB
C2
D6 4
EF161410

Next we merge AB and C to get

 1.0   0.5    0.0
        /------ A
        |
  /-----+
  |     |
--+     \------ B
  |
  \------------ C


  /------------ E
  |
--+
  |
  \------------ F
and resulting distances
ABCDEF
ABC
D4
EF1410

Next we merge ABC and D to get

  2.0        1.0   0.5    0.0
                    /------ A
                    |
              /-----+
              |     |
  /-----------+     \------ B
  |           |
--+           \------------ C
  |
  \------------------------ D

              /------------ E
              |
  ------------+
              |
              \------------ F
and resulting distances
ABCDEF
ABCD
EF10

Finally the remaining two clusters are combined with a common ancestor assumed to be 5 units in the past.

 5.0                                     2.0        1.0   0.5    0.0
                                                           /------ A
                                                           |
                                                     /-----+
                                                     |     |
                                         /-----------+     \------ B
                                         |           |
  /--------------------------------------+           \------------ C
  |                                      |
  |                                      \------------------------ D
--+
  |                                                  /------------ E
  |                                                  |
  \--------------------------------------------------+
                                                     |
                                                     \------------ F


Michael Goldwasser
Last modified: Saturday, 23 March 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring