Distance Measurement in Molecular Phylogenetics

Reading: Chapter 6

Overview: Our goal is to use genetic analyses to predict evolutionary relationships between a collection of species and produce phylogenetic trees that portray those likely relationships. However, the starting point for any such analysis is typically to make accurate predictions as to the pairwise distance between any two species.

Distance Measurements

Goal: to determine a metric that not only makes clear which species are more closely related, but which can estimate the evolutionary time since two species may have diverged.

Consideration: How to measure distance will depend greatly on whether you are trying to do phylogenetic analysis of a group of closely related species (e.g., different types of whales), or a widely divergent group of species (e.g., the full tree of life).

Candidate measures:

Global Pairwise Alignment score (as calculated with Needleman-Wunsch algorithm)
Concern is that full-genome comparison results in scores that are more impacted by the (large) portions of genomes that don't align well, rather than in subtle differences portions of genomes that are well conserved.
Local Pairwise Alignment score (as calculated with Smith-Waterman algorithm)
While likely better than the global alignment as it will focus on the most closely conserved regions, note that when performing this across pairs in a collection, the local alignment scores for some pairs will be based on different parts of the genome than with other pairs, so it might not be that those scores are readily comparable to each other in an absolute sense.
Full alignment for a known conserved gene.
By focusing on a single gene that is common to all species in a study, scores from one pair should be more readily comparable to scores from another pair. More importantly, we are likely to see high levels of alignment (especially in the functional coding regions), but still some variation due to substitutions (especially within introns). Therefore a single gene can better serve as a so-called "molecular clock" allowing us to better infer evolutionary time based upon the changes that occur in the conserved gene across species.

Therefore, the most common phylogenetic analyses are based on a single conserved gene. Of course, there is still a choice of what gene, and that choice is effected by how divergent of a group of species is being studied. For example, the book notes that
- The gene encoding 16S rRNA is found in every living creature, so it can be used for a tree of life.
- Genes for hemoglobin subunits are well-conserved among vertebrate animals.
- Genes for casein (the protein in milk) is conserved in all mammals.

Interpreting a conserved gene as a molecular clock

So let's assume that we have a specific gene that is conserved across all species in a study, and that we can do full pairwise sequence alignment between that gene for each pair of species. There is still a question of how to best use that as a "molecular clock" to accurately predict how much time was likely to have passed since two species diverged from each other. A simple approach is to look at the number of substitutions in the reference gene for the two species, and then to presume that those substitutions happen at a consistent rate over time. That might allow us to directly estimate time proportional to the number of observed substitutions in the common-day sequences.

However that is an over-simplified model for the following reasons:

At a single nucleotide, a change observed in the current time sequences might not represent only a single substitution. It may be that over time a nucleotide switched from A ⟶ C ⟶ T ⟶ G and thus what we observe as A ⟶ G was actually a series of three SNP substitions over time.
For similar reasons, even a nucleotide that matches in the sequences could still have had hidden substitions in history, such as A ⟶ C ⟶ T ⟶ A.
Even if substitutions did occur in nature at a consistent rate and at random locations in a sequence, not all such substitions would survive and be preserved. Many could be disadvantageous to the organism, especially if in a functional region. Substitions in noncoding regions are far more likely to be carried on, as are "silent mutations" that change the nucleotide yet not the amino acid that is produced.
Due to biochemistry, there is also a difference in rate of mutations between
- Transition mutations which are G ⟷ A or T ⟷ C substitutions, and which tend to occur more frequently.
- Transversion mutations which are all others, and which sometimes get corrected by DNA repair enzymes before replication.

Our goal is to determine:

$K$: The actual number of changes that have occurred
$T$: The actual amount of time that has passed since divergence.

From those, we can calculate the substitution rate, $r = \frac{K}{2T}$, noting the constant 2 in the denominator because both derived species have presumably been mutating for $T$ units of time relative to the presumed common ancestor. We consider three increasingly complex models for estimating $K$ from the observed sequences.

Jukes-Cantor Model
1969 model attempts to account for "hidden" substitutions relative to number of observed substitutions. For sequences $a$ and $b$, they estimate

$K_{ab} = -\frac{3}{4} \ln \left(1 - \frac{4}{3}D_{ab}\right)$

where $D_{ab}$ is observed fraction of substitutions.
Kimura's Two-parameter Substitution Model
1980 model considers different likelihood of transition substitions and transversion substitutions.

$K_{ab} = \frac{1}{2}\ln\left(\frac{1}{1-2S-V}\right) + \frac{1}{4}\ln\left(\frac{1}{1-2V}\right)$

where $S$ is observed fraction of transition substitutions (S=tranSition)
$V$ is observed fraction of transversion substitutions (V=transVersion)
Tamura's Three-Parameter Model
1992 model considers that substitution rates are affected by the underlying GC content of a sequence. They suggest the formula

$K_{ab} = -C \ln \left(1 - \frac{S}{C} - V\right) - \frac{1}{2}\left(1-C\right)\ln\left(1-2V\right)$

with the same $S$ and $V$ definition as Kimura, and with constant
$C = GC_{s1} + GC_{s2} - 2 \cdot GC_{s1} \cdot GC_{s2}$
with $GC_{s1}$ being the fraction of GC content in the sequence containing gene $a$ and $GC_{s2}$ for the sequence with gene $b$.

Michael Goldwasser

Last modified: Wednesday, 20 March 2019