Overview: Our goal is to use genetic analyses to predict evolutionary relationships between a collection of species and produce phylogenetic trees that portray those likely relationships. However, the starting point for any such analysis is typically to make accurate predictions as to the pairwise distance between any two species.
Goal: to determine a metric that not only makes clear which species are more closely related, but which can estimate the evolutionary time since two species may have diverged.
Consideration: How to measure distance will depend greatly on whether you are trying to do phylogenetic analysis of a group of closely related species (e.g., different types of whales), or a widely divergent group of species (e.g., the full tree of life).
Candidate measures:
Global Pairwise Alignment score (as calculated with
Needleman-Wunsch algorithm)
Concern is that full-genome comparison results in scores that
are more impacted by the (large) portions of genomes that don't
align well, rather than in subtle differences portions of
genomes that are well conserved.
Local Pairwise Alignment score (as calculated with
Smith-Waterman algorithm)
While likely better than the global alignment as it will focus on
the most closely conserved regions, note that when performing
this across pairs in a collection, the local alignment scores
for some pairs will be based on different parts of the genome
than with other pairs, so it might not be that those scores are
readily comparable to each other in an absolute sense.
Full alignment for a known conserved gene.
By focusing on a single gene that is common to all species in a
study, scores from one pair
should be more readily comparable to scores from another pair.
More importantly, we are likely to see high levels of alignment
(especially in the functional coding regions), but still some
variation due to substitutions (especially within
introns). Therefore a single gene can better serve as a
so-called "molecular clock" allowing us to better infer
evolutionary time based upon the changes that occur in the
conserved gene across species.
Therefore, the most common phylogenetic analyses are based on a single conserved gene. Of course, there is still a choice of what gene, and that choice is effected by how divergent of a group of species is being studied. For example, the book notes that
So let's assume that we have a specific gene that is conserved across all species in a study, and that we can do full pairwise sequence alignment between that gene for each pair of species. There is still a question of how to best use that as a "molecular clock" to accurately predict how much time was likely to have passed since two species diverged from each other. A simple approach is to look at the number of substitutions in the reference gene for the two species, and then to presume that those substitutions happen at a consistent rate over time. That might allow us to directly estimate time proportional to the number of observed substitutions in the common-day sequences.
However that is an over-simplified model for the following reasons:
At a single nucleotide, a change observed in the current time
sequences might not represent only a single substitution. It may be
that over time a nucleotide switched from
For similar reasons, even a nucleotide that matches in the
sequences could still have had hidden substitions in history,
such as
Even if substitutions did occur in nature at a consistent rate and at random locations in a sequence, not all such substitions would survive and be preserved. Many could be disadvantageous to the organism, especially if in a functional region. Substitions in noncoding regions are far more likely to be carried on, as are "silent mutations" that change the nucleotide yet not the amino acid that is produced.
Due to biochemistry, there is also a difference in rate of mutations between
Transition mutations which are
Transversion mutations which are all others, and which sometimes get corrected by DNA repair enzymes before replication.
Our goal is to determine:
Jukes-Cantor Model
1969 model attempts to account for "hidden" substitutions relative to number of
observed substitutions. For sequences $a$ and $b$, they estimate
$K_{ab} = -\frac{3}{4} \ln \left(1 - \frac{4}{3}D_{ab}\right)$
where $D_{ab}$ is observed fraction of substitutions.
Kimura's Two-parameter Substitution Model
1980 model considers different likelihood of transition substitions and
transversion substitutions.
$K_{ab} = \frac{1}{2}\ln\left(\frac{1}{1-2S-V}\right) + \frac{1}{4}\ln\left(\frac{1}{1-2V}\right)$
where $S$ is observed fraction of transition
substitutions (S=tranSition)
$V$ is observed fraction of transversion substitutions
(V=transVersion)
Tamura's Three-Parameter Model
1992 model considers that substitution rates are affected by the
underlying GC content of a sequence. They suggest the formula
$K_{ab} = -C \ln \left(1 - \frac{S}{C} - V\right) - \frac{1}{2}\left(1-C\right)\ln\left(1-2V\right)$
with the same $S$ and $V$ definition as Kimura, and with
constant
$C = GC_{s1} + GC_{s2} - 2 \cdot GC_{s1} \cdot GC_{s2}$
with $GC_{s1}$ being the fraction of GC content in the sequence
containing gene $a$ and $GC_{s2}$ for the sequence with gene $b$.