While there are some computational problems that we know how to solve optimally and efficiency (e.g. various pairwise sequence alignments), phylogeny tree building is NOT one of those! One of the biggest challenges is the combinatorial explosion in terms of number number of tree toplogies.
If considering only two modern species, there is only one tree model to describe their relationship, which is that they must share a common ancestor.
/--- A | --+ | \--- B(note that there is no significance regarding which is top versus and bottom branch.
With three leaves, the shape of the tree must be
/--- * | /---+ | | --+ \--- * | \------- *and thus there are three distinct models depending on which species serves as the outgroup of the other two.
We then pose the question of how many distinct such trees exists for modeling the relationship between four species?
What we find in general is that the number of tree models grows super-exponentially
n | #trees |
---|---|
2 | 1 |
3 | 3 |
4 | [student exercise] |
5 | [bonus exercise] |
6 | 945 |
7 | 10,395 |
8 | 135,135 |
9 | 2,027,025 |
10 | 34,458,425 |
... | ... |
20 | 8,200,794,532,637,891,559,375 |
Spoiler: python code that computes the above combinatorics.
Also worth noting that even if we fix a set of distances for a chronogram, there might still be several distinctly different evolutionary relationships that are consistent. In particular, Figure 7.2 of St. Clair/Visick book shows multiple different trees that are consistent with the following distances:
A | B | C | D | |
---|---|---|---|---|
A | 4 | 2 | 4 | |
B | 4 | 4 | 2 | |
C | 2 | 4 | 4 | |
D | 4 | 2 | 4 |
Character-based algorithms
These directly use the aligned sequences and typically rely on
probabilistic models that evaluate the likelihood of each possible
tree. This can be very effective for trying to find best model for a
relatively small group of related species, but becomes intractable as
the number of trees grows.
Distance-based algorithms
Multiple sequence alignment is used to produce a set of pairwise
distances between samples of the group, and the subsequent tree
building is then based only on the distances (not on the further
information in the underlying sequences). These approaches can be
tractible on much larger groups, but typically heuristical and not
guaranteeing that they produce the most likely result.
The most common distance-based tree-building algorithms are based on greedily clustering. That is, starting with all the individual species as their own cluster, and given computed distances, a tree is built up hierarchically as follows.
while there exists two or more clusters: * Let A and B be the two "closest" clusters * Merge A and B in the tree, with new edges having distances that match the computed distance for those clusters * Recompute distances from all other clusters to the new AB cluster (so called "linkage method")
The big question is how to define the distance from one cluster of species to another. Three varying definitions are as follows.
Single linkage:
The distance between the clusters is defined to be the
distance between the two nearest elements of the
respective clsuters.
Complete linkage:
The distance between the clusters is defined to be the
distance between the two furthest elements of the
respective clsuters.
Centroid linkage:
The distance between the clusters is based upon the
distance of the "centers" of the two clusters, which is
typically defined as some form of averaging of the pairwise
distances of elements of the clusters.
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
A | ||||||
B | 1 | |||||
C | 3 | 2 | ||||
D | 7 | 6 | 4 | |||
E | 17 | 16 | 14 | 10 | ||
F | 19 | 18 | 16 | 12 | 2 |
We start by connecting A and B to form cluster AB, given that they have smallest distance 1. Assuming constant mutation rate, we define their common ancestor to be 0.5 units in the past.
0.5 /------ A | --+ | \------ BThe question then becomes what is the distance from AB to other elements. For example, C has distance 3 from A and 2 from B. If using single linkage, we say that C has distance 2 from AB (because it is only 2 away from B). If using complete linkage, we'd say C has distance 3 from AB (because it is 3 away from A). For the centroid linkage, we'd say that C has distance 2.5 from AB, as that is the average of the C-A and C-B distances.
Using single linkage as the approach, we'd get the following remaining matrix.
AB | C | D | E | F | |
---|---|---|---|---|---|
AB | |||||
C | 2 | ||||
D | 6 | 4 | |||
E | 16 | 14 | 10 | ||
F | 18 | 16 | 12 | 2 |
Next there is a tie with shortest distance 2. We could therefore choose to either merge E with F or to merge AB with C. (In this case, it doesn't matter how we proceed as we'll do the other merge as the next step, but in some examples we might get different results depending on how we break ties.) Let's assume we merge EF, resulting in partial tree
1.0 0.5 0.0 /------ A | --------+ | \------ B /------------ E | --+ | \------------ Fand resulting distances
AB | C | D | EF | |
---|---|---|---|---|
AB | ||||
C | 2 | |||
D | 6 | 4 | ||
EF | 16 | 14 | 10 |
Next we merge AB and C to get
1.0 0.5 0.0 /------ A | /-----+ | | --+ \------ B | \------------ C /------------ E | --+ | \------------ Fand resulting distances
ABC | D | EF | |
---|---|---|---|
ABC | |||
D | 4 | ||
EF | 14 | 10 |
Next we merge ABC and D to get
2.0 1.0 0.5 0.0 /------ A | /-----+ | | /-----------+ \------ B | | --+ \------------ C | \------------------------ D /------------ E | ------------+ | \------------ Fand resulting distances
ABCD | EF | |
---|---|---|
ABCD | ||
EF | 10 |
Finally the remaining two clusters are combined with a common ancestor assumed to be 5 units in the past.
5.0 2.0 1.0 0.5 0.0 /------ A | /-----+ | | /-----------+ \------ B | | /--------------------------------------+ \------------ C | | | \------------------------ D --+ | /------------ E | | \--------------------------------------------------+ | \------------ F