Today we introduce the principles of the Needleman-Wunsch algorithm for computing the optimal global pairwise alignment between two sequences.
We consider the global pairwise alignment of two given sequences, which
we denote $X$ and $Y$, perhaps having differing lengths $m$ and $n$
respectively. These could be nucelotide sequences, amino acid
sequences, or any other such omics data. For convenience, we will use
a 1-indexed notation (sorry Python) in which individual
characters are denoted as
$X = x_1 x_2 x_3 \ldots x_m$
$Y = y_1 y_2 y_3 \ldots y_n$
The goal is to produce an alignment of the two sequences, by inserting zero or more gaps in the sequences to bring other elements into alignment. For example, given sequences ACTTGC and CGATGAACT we might choose the following alignment:
AC_TTG__C_ _CGATGAACT
In most general terms, we assume a score function that is composed of the following two components:
For example, a simple $score$ function might give +1 for a perfect match and -1 for any mismatch. A more nuanced $score$ function might depend upon how similar mismatched elements are to each other.
The above gap penalty can be further refined with care, for example having a different penalty for those gaps that are grouped at the beginning or end of the alignment, or by having a lesser penalty for consecutive gaps.
The goal is to produce an alignment that maximizes the acheived score.
The key insight for this algorithm is a divide-and-conquer approach based on using optimal alignment of substrings of the full input in order to compute the optimal alignment of larger substrings. In particular, we will let expression $OPT[j][k]$ denote the optimal score that can be achieved if aligning the first $j$ characters of the first sequence with the first $k$ charaters of the second sequence.
We can compute the $OPT$ values based on a recursive formula.
We can use a table to store all of the $OPT$ values, parameterized by $j$ and $k$, and computed from smaller to larger parameters.
We can subsequently use the completed table to reconstruct an alignment achieving the optimal score.
The key to the recursion is that we can consider the rightmost portion of an optimal alignment. While we don't yet know what that will be, it will take one of three forms:
$OPT[j][k] = \left\{ \begin{array}{ll} k \cdot gap & \mbox{if $j=0$}\\ j \cdot gap & \mbox{if $k=0$}\\ \max \left\{ \begin{array}{l} score(x_j,y_k) + OPT[j-1][k-1]\\ gap + OPT[j-1][k]\\ gap + OPT[j][k-1]\\ \end{array} \right. & \mbox{otherwise}\\ \end{array} \right.$
The first two cases are known as base cases of the recursive definition. In essence, the first case states that if you are trying to align 0 characters of $X$ with $k$ characters of $Y$ there is no choice but to insert $k$ gaps into $X$. Conversely, if there are $j$ characters of $X$ to align with $0$ characters of $Y$ then $j$ gaps must be inserted into $Y$. When both substrings are nonempty, we revert to the general case in which there are three possible ways to end the alignment and we take the option that produces the maximum of the three resulting scores.
The Needleman-Wunsch algorithm is implemented by computing the full table of $OPT[j][k]$ values for all $0 \leq j \leq m$ and $0 \leq k \leq n$, computing this table typically row-by-row (or column-by-column) and thus from smaller $j$ and $k$ values to larger $j$ and $k$ values until finally computing $OPT[m][n]$ that represents the optimal score for the two complete sequences.
The running time of the algorithm is proportional to the product of the lengths of the sequences. For example if both sequences have length 1000, then there will be approximately 1 million intermediate values to compute. Likewise if both sequences have length 10000, there are 100 million intermediate values. So this algorithm is still practical for length in tens of thousands, but it becomes impractical for computing the pairwise alignment of much larger sequences.
With the complete table of $OPT[j][k]$ values computed, the actual alignment can be found by starting at the $OPT[m][n]$ value and reverse engineering which of the three cases it must have come from in terms of the end of the alignment (i.e. with the final characters of each aligned with each other, or with a gap introduced at the end of one or the other of the sequences. This process can then be repeated to reverse engineer how the remaining substrings are to be aligned.
For illustration we consider a pairwise alignment of strings ACTTCG and ATGAAT.
We begin with a very simple score function in which there is +1 for a match, 0 for a mismatch, and with a 0 gap penalty. This definition is a formulation of the Longest Common Subsequence (LCS), as the goal is simply to optimize the number of perfect matches in the alignment.
The completed table appears as follows, leading to the conclusion that the maximum score is 3 when considering the full sequences.
A | T | G | A | A | T | |||
$j$\$k$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
A | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
C | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
T | 3 | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
T | 4 | 0 | 1 | 2 | 2 | 2 | 2 | 3 |
C | 5 | 0 | 1 | 2 | 2 | 2 | 2 | 3 |
G | 6 | 0 | 1 | 2 | 3 | 3 | 3 | 3 |
To reconstruct the alignment, we begin by examining the lower-right cell of the table and we reverse engineer how the 3 representing $OPT[6][6]$ was computed. In this case, there is some ambiguity in that it might have been set to either $OPT[5][6]$ or $OPT[6][5]$. For sake of example, we trace through a reconstruction assuming it was $OPT[5][6]$ (that is, coming from the subcase above it in the table). Therefore, we will be aligning the G in $X$ with a gap in $Y$, thus:
G _Similarly $OPT[5][6] = OPT[4][6]$ and thus we extending the solution as
CG __However, it must be that $OPT[4][6] = 1 + OPT[3][5]$ due to the matching T at $X_4$ and $Y_6$, extending our solution as
TCG T__
Continuing in this way, we can rebuild an alignment by tracing the path from the lower-right to the upper-left, as indicated in blue in this diagram:
A | T | G | A | A | T | |||
$j$\$k$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
A | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
C | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
T | 3 | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
T | 4 | 0 | 1 | 2 | 2 | 2 | 2 | 3 |
C | 5 | 0 | 1 | 2 | 2 | 2 | 2 | 3 |
G | 6 | 0 | 1 | 2 | 3 | 3 | 3 | 3 |
The resulting global alignment is
ACT___TCG A_TGAAT__
Note well that had we initially deduced that the lower-right 3 had come from its left neighbor, this would have lead us to a different alignment achieving the same score.
ACTTCG___ A__T_GAATas evidenced by the indicated path through the table.
A | T | G | A | A | T | |||
$j$\$k$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
A | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
C | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
T | 3 | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
T | 4 | 0 | 1 | 2 | 2 | 2 | 2 | 3 |
C | 5 | 0 | 1 | 2 | 2 | 2 | 2 | 3 |
G | 6 | 0 | 1 | 2 | 3 | 3 | 3 | 3 |
We repeate our analysis of the above pair of sequences, this time with a more complex score function that includes:
ACT-TCG A_TGAATbased on the illustrated path in the following computed table.
A | T | G | A | A | T | |||
$j$\$k$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
0 | 0 | -1 | -2 | -3 | -4 | -5 | -6 | |
A | 1 | -1 | 1 | 0 | -1 | -2 | -3 | -4 |
C | 2 | -2 | 0 | 0 | -1 | -2 | -3 | -4 |
T | 3 | -3 | -1 | 1 | 0 | -1 | -2 | -2 |
T | 4 | -4 | -2 | 0 | 0 | -1 | -2 | -1 |
C | 5 | -5 | -3 | -1 | -1 | -1 | -2 | -2 |
G | 6 | -6 | -4 | -2 | 0 | -1 | -2 | -3 |
See attached handout for a walkthrough of a larger example.