Phylogeny Tree Building Algorithms

Reading: Chapter 7

Overview

While there are some computational problems that we know how to solve optimally and efficiency (e.g. various pairwise sequence alignments), phylogeny tree building is NOT one of those! One of the biggest challenges is the combinatorial explosion in terms of number number of tree toplogies.

Combinatorics

If considering only two modern species, there is only one tree model to describe their relationship, which is that they must share a common ancestor.

  /--- A
  |
--+
  |
  \--- B

(note that there is no significance regarding which is top versus and bottom branch.

With three leaves, the shape of the tree must be

      /---  *
      |
  /---+
  |   |
--+   \---  *
  |
  \-------  *

and thus there are three distinct models depending on which species serves as the outgroup of the other two.

We then pose the question of how many distinct such trees exists for modeling the relationship between four species?

What we find in general is that the number of tree models grows super-exponentially

n #trees

2 1

3 3

4 [student exercise]

5 [bonus exercise]

6 945

7 10,395

8 135,135

9 2,027,025

10 34,458,425

... ...

20 8,200,794,532,637,891,559,375

Note well that the n=20 result is more than 8 trillion billions.

n	#trees
2	1
3	3
4	[student exercise]
5	[bonus exercise]
6	945
7	10,395
8	135,135
9	2,027,025
10	34,458,425
...	...
20	8,200,794,532,637,891,559,375

Spoiler: python code that computes the above combinatorics.

Also worth noting that even if we fix a set of distances for a chronogram, there might still be several distinctly different evolutionary relationships that are consistent. In particular, Figure 7.2 of St. Clair/Visick book shows multiple different trees that are consistent with the following distances:

A B C D

A 4 2 4

B 4 4 2

C 2 4 4

D 4 2 4

	A	B	C	D
A		4	2	4
B	4		4	2
C	2	4		4
D	4	2	4

Tree-building Algorithms

Character-based algorithms
These directly use the aligned sequences and typically rely on probabilistic models that evaluate the likelihood of each possible tree. This can be very effective for trying to find best model for a relatively small group of related species, but becomes intractable as the number of trees grows.

Distance-based algorithms
Multiple sequence alignment is used to produce a set of pairwise distances between samples of the group, and the subsequent tree building is then based only on the distances (not on the further information in the underlying sequences). These approaches can be tractible on much larger groups, but typically heuristical and not guaranteeing that they produce the most likely result.

Clustering Algorithms

The most common distance-based tree-building algorithms are based on greedily clustering. That is, starting with all the individual species as their own cluster, and given computed distances, a tree is built up hierarchically as follows.

while there exists two or more clusters:

  * Let A and B be the two "closest" clusters

  * Merge A and B in the tree, with new edges having distances that
    match the computed distance for those clusters

  * Recompute distances from all other clusters to the new AB cluster
    (so called "linkage method")

The big question is how to define the distance from one cluster of species to another. Three varying definitions are as follows.

Single linkage:
The distance between the clusters is defined to be the distance between the two nearest elements of the respective clsuters.
Complete linkage:
The distance between the clusters is defined to be the distance between the two furthest elements of the respective clsuters.
Centroid linkage:
The distance between the clusters is based upon the distance of the "centers" of the two clusters, which is typically defined as some form of averaging of the pairwise distances of elements of the clusters.

Example from Text

	A	B	C	D	E
A
B	1
C	3	2
D	7	6	4
E	17	16	14	10
F	19	18	16	12	2

We start by connecting A and B to form cluster AB, given that they have smallest distance 1. Assuming constant mutation rate, we define their common ancestor to be 0.5 units in the past.

 0.5
  /------ A
  |
--+
  |
  \------ B

The question then becomes what is the distance from AB to other elements. For example, C has distance 3 from A and 2 from B. If using single linkage, we say that C has distance 2 from AB (because it is only 2 away from B). If using complete linkage, we'd say C has distance 3 from AB (because it is 3 away from A). For the centroid linkage, we'd say that C has distance 2.5 from AB, as that is the average of the C-A and C-B distances.

Using single linkage as the approach, we'd get the following remaining matrix.

AB C D E F

AB

C 2

D 6 4

E 16 14 10

F 18 16 12 2

	AB	C	D	E
AB
C	2
D	6	4
E	16	14	10
F	18	16	12	2

Next there is a tie with shortest distance 2. We could therefore choose to either merge E with F or to merge AB with C. (In this case, it doesn't matter how we proceed as we'll do the other merge as the next step, but in some examples we might get different results depending on how we break ties.) Let's assume we merge EF, resulting in partial tree

 1.0   0.5    0.0
        /------ A
        |
--------+
        |
        \------ B

  /------------ E
  |
--+
  |
  \------------ F

and resulting distances

	AB	C	D
AB
C	2
D	6	4
EF	16	14	10

Next we merge AB and C to get

 1.0   0.5    0.0
        /------ A
        |
  /-----+
  |     |
--+     \------ B
  |
  \------------ C


  /------------ E
  |
--+
  |
  \------------ F

and resulting distances

	ABC	D
ABC
D	4
EF	14	10

Next we merge ABC and D to get

  2.0        1.0   0.5    0.0
                    /------ A
                    |
              /-----+
              |     |
  /-----------+     \------ B
  |           |
--+           \------------ C
  |
  \------------------------ D

              /------------ E
              |
  ------------+
              |
              \------------ F

and resulting distances

	ABCD	EF
ABCD
EF	10

Finally the remaining two clusters are combined with a common ancestor assumed to be 5 units in the past.

 5.0                                     2.0        1.0   0.5    0.0
                                                           /------ A
                                                           |
                                                     /-----+
                                                     |     |
                                         /-----------+     \------ B
                                         |           |
  /--------------------------------------+           \------------ C
  |                                      |
  |                                      \------------------------ D
--+
  |                                                  /------------ E
  |                                                  |
  \--------------------------------------------------+
                                                     |
                                                     \------------ F

Michael Goldwasser

Last modified: Saturday, 23 March 2019