Global Pairwise Alignment

Today we introduce the principles of the Needleman-Wunsch algorithm for computing the optimal global pairwise alignment between two sequences.

Problem definition

We consider the global pairwise alignment of two given sequences, which we denote $X$ and $Y$, perhaps having differing lengths $m$ and $n$ respectively. These could be nucelotide sequences, amino acid sequences, or any other such omics data. For convenience, we will use a 1-indexed notation (sorry Python) in which individual characters are denoted as
$X = x_1 x_2 x_3 \ldots x_m$
$Y = y_1 y_2 y_3 \ldots y_n$

The goal is to produce an alignment of the two sequences, by inserting zero or more gaps in the sequences to bring other elements into alignment. For example, given sequences ACTTGC and CGATGAACT we might choose the following alignment:

AC_TTG__C_
_CGATGAACT

Alignment scoring

In most general terms, we assume a score function that is composed of the following two components:

A general function $score(u,v)$ for a reward/penalty assigned if character $u$ in one sequence is aligned with character $v$ in the other sequence.
An arbitrary $gap$ penalty for each entry of one sequence that is aligned with a gap in the opposite sequence.

For example, a simple $score$ function might give +1 for a perfect match and -1 for any mismatch. A more nuanced $score$ function might depend upon how similar mismatched elements are to each other.

The above gap penalty can be further refined with care, for example having a different penalty for those gaps that are grouped at the beginning or end of the alignment, or by having a lesser penalty for consecutive gaps.

The goal is to produce an alignment that maximizes the acheived score.

Needleman-Wunsch algorithm

The key insight for this algorithm is a divide-and-conquer approach based on using optimal alignment of substrings of the full input in order to compute the optimal alignment of larger substrings. In particular, we will let expression $OPT[j][k]$ denote the optimal score that can be achieved if aligning the first $j$ characters of the first sequence with the first $k$ charaters of the second sequence.

We can compute the $OPT$ values based on a recursive formula.
We can use a table to store all of the $OPT$ values, parameterized by $j$ and $k$, and computed from smaller to larger parameters.
We can subsequently use the completed table to reconstruct an alignment achieving the optimal score.

Recursive formula

The key to the recursion is that we can consider the rightmost portion of an optimal alignment. While we don't yet know what that will be, it will take one of three forms:

It will align the last item of X and last item of Y
It will align the last item of X with a gap in Y
It will align the last item of Y with a gap in X

This intution leads to the following recursive formula:

$OPT[j][k] = \left\{ \begin{array}{ll} k \cdot gap & \mbox{if $j=0$}\\ j \cdot gap & \mbox{if $k=0$}\\ \max \left\{ \begin{array}{l} score(x_j,y_k) + OPT[j-1][k-1]\\ gap + OPT[j-1][k]\\ gap + OPT[j][k-1]\\ \end{array} \right. & \mbox{otherwise}\\ \end{array} \right.$

The first two cases are known as base cases of the recursive definition. In essence, the first case states that if you are trying to align 0 characters of $X$ with $k$ characters of $Y$ there is no choice but to insert $k$ gaps into $X$. Conversely, if there are $j$ characters of $X$ to align with $0$ characters of $Y$ then $j$ gaps must be inserted into $Y$. When both substrings are nonempty, we revert to the general case in which there are three possible ways to end the alignment and we take the option that produces the maximum of the three resulting scores.

Computing the table

The Needleman-Wunsch algorithm is implemented by computing the full table of $OPT[j][k]$ values for all $0 \leq j \leq m$ and $0 \leq k \leq n$, computing this table typically row-by-row (or column-by-column) and thus from smaller $j$ and $k$ values to larger $j$ and $k$ values until finally computing $OPT[m][n]$ that represents the optimal score for the two complete sequences.

The running time of the algorithm is proportional to the product of the lengths of the sequences. For example if both sequences have length 1000, then there will be approximately 1 million intermediate values to compute. Likewise if both sequences have length 10000, there are 100 million intermediate values. So this algorithm is still practical for length in tens of thousands, but it becomes impractical for computing the pairwise alignment of much larger sequences.

Reconstructing an optimal alignment

With the complete table of $OPT[j][k]$ values computed, the actual alignment can be found by starting at the $OPT[m][n]$ value and reverse engineering which of the three cases it must have come from in terms of the end of the alignment (i.e. with the final characters of each aligned with each other, or with a gap introduced at the end of one or the other of the sequences. This process can then be repeated to reverse engineer how the remaining substrings are to be aligned.

A small example

For illustration we consider a pairwise alignment of strings ACTTCG and ATGAAT.

We begin with a very simple score function in which there is +1 for a match, 0 for a mismatch, and with a 0 gap penalty. This definition is a formulation of the Longest Common Subsequence (LCS), as the goal is simply to optimize the number of perfect matches in the alignment.

The completed table appears as follows, leading to the conclusion that the maximum score is 3 when considering the full sequences.

A T G A A T

$j$\$k$ 0 1 2 3 4 5 6

0 0 0 0 0 0 0 0

A 1 0 1 1 1 1 1 1

C 2 0 1 1 1 1 1 1

T 3 0 1 2 2 2 2 2

T 4 0 1 2 2 2 2 3

C 5 0 1 2 2 2 2 3

G 6 0 1 2 3 3 3 3

To reconstruct the alignment, we begin by examining the lower-right cell of the table and we reverse engineer how the 3 representing $OPT[6][6]$ was computed. In this case, there is some ambiguity in that it might have been set to either $OPT[5][6]$ or $OPT[6][5]$. For sake of example, we trace through a reconstruction assuming it was $OPT[5][6]$ (that is, coming from the subcase above it in the table). Therefore, we will be aligning the G in $X$ with a gap in $Y$, thus:

        G
        _

Similarly $OPT[5][6] = OPT[4][6]$ and thus we extending the solution as

       CG
       __

However, it must be that $OPT[4][6] = 1 + OPT[3][5]$ due to the matching T at $X_4$ and $Y_6$, extending our solution as

      TCG
      T__

Continuing in this way, we can rebuild an alignment by tracing the path from the lower-right to the upper-left, as indicated in blue in this diagram:

A T G A A T

$j$\$k$ 0 1 2 3 4 5 6

0 0 0 0 0 0 0 0

A 1 0 1 1 1 1 1 1

C 2 0 1 1 1 1 1 1

T 3 0 1 2 2 2 2 2

T 4 0 1 2 2 2 2 3

C 5 0 1 2 2 2 2 3

G 6 0 1 2 3 3 3 3

The resulting global alignment is

ACT___TCG
A_TGAAT__

Note well that had we initially deduced that the lower-right 3 had come from its left neighbor, this would have lead us to a different alignment achieving the same score.

ACTTCG___
A__T_GAAT

as evidenced by the indicated path through the table.

			`A`	`T`	`G`	`A`	`A`	`T`
	$j$\$k$	0	1	2	3	4	5	6
	0	0	0	0	0	0	0	0
`A`	1	0	1	1	1	1	1	1
`C`	2	0	1	1	1	1	1	1
`T`	3	0	1	2	2	2	2	2
`T`	4	0	1	2	2	2	2	3
`C`	5	0	1	2	2	2	2	3
`G`	6	0	1	2	3	3	3	3

More advanced score function

We repeate our analysis of the above pair of sequences, this time with a more complex score function that includes:

+1 for a perfect match
-1 for a mismatched pair
-1 for a gap penalty

With this metric, we can produce an optimal alignment of

ACT-TCG
A_TGAAT

based on the illustrated path in the following computed table.

A T G A A T

$j$\$k$ 0 1 2 3 4 5 6

0 0 -1 -2 -3 -4 -5 -6

A 1 -1 1 0 -1 -2 -3 -4

C 2 -2 0 0 -1 -2 -3 -4

T 3 -3 -1 1 0 -1 -2 -2

T 4 -4 -2 0 0 -1 -2 -1

C 5 -5 -3 -1 -1 -1 -2 -2

G 6 -6 -4 -2 0 -1 -2 -3

Larger Example

See attached handout for a walkthrough of a larger example.

Michael Goldwasser

Last modified: Monday, 11 February 2019