A critical aspect of computing is the ability to repeat a sequence of instructions using a control structure known as a loop. In Python, there are several forms of loops and we will begin with a form known as a for loop.
At its core, a for loop is used to repeate a block of code once for each element of a sequence. For example, this could be to loop for each character of a string, each element of a list, or each line of a file. The basic syntax of a for loop appears as follows.
for variableName in sequence: one or more commands that are to be repeatedAs a biological example, we consider computing the GC-content of a dna string, which we will view as a floating-point number between 0.0 and 1.0 that is the ratio between the number of bases that are either G or C relative to the total number of bases. We can perform such a computation with the following code.
match = 0 for base in dna: if base == 'G' or base == 'C': match = match + 1 gcContent = match/len(dna)The following demonstrates the execution of this code on a small example.
As an aside, we will note that the count method of the string class would allow us to compute the same value as
gcContent = (dna.count('G') + dna.count('C')) / len(dna)although that implementation would in fact execute two implicit loops, one for each call to count.
While the direct application of a for loop over elements of a sequence is the preferred syntax due to its simplicity, there are some situations in which it does not suffice. In some situations, it is important during each repetition that you not only know the element of the sequence but also the index at which it occurs within the syntax. The context of the index would allow you to more easily examine elements nearby to the current one, or perhaps examining the corresponding element at the same position of a different sequence.
In such situations, the technical approach is to use an index-based version of a for loop in which we formally iterate over a range of integers indices (rather than iterating directly over the original sequence).
As a motivating example, consider the goal of counting the number of mistmatched basepairs between a reference sequence and an indvidual's allele. Assuming variables reference and allele we could compute this as follows.
errors = 0 for k in range(len(reference)): if allele[k] != reference[k]: errors = errors + 1
The key to this approach is use of another built-in function named range. The range function produces a sequence of integers. There are three forms of range.
The first version uses a single parameter.
The syntax range(k)
produces the sequence of numbers
We will see use of range(k) a lot because those integers from 0 to k-1 are precisely the indices of a list of k items.
The second version uses two parameters, which give a starting
value and stopping value for the range. Specifically, a syntax
such as range(j, k) produces the sequence of integers
The third version uses three parameters, with the third being
the step size for the sequence. For example, we could
get some even numbers with range(0, 10, 2) which
produces
A negative step size can be used to get a decreasing sequence,
such as range(10, 5, -1) which produces sequence
You should notice a great similarity between the use of parameters for a range and the use of parameters when describing slices of a string, although the syntax is different (with commas separating range parameters, and colons separating those arguments for a slice).
Returning to use of the range function for an index-based for loop,
consider a dna string with length 5. Notice that
len(dna) is 5 and thus range(len(dna)) produces
sequence
However, we could alter the range for different purposes. For example,
we might loop over
As another example, we consider counting the number of times a dna base is immediately followed by the same base. We can implement that count as follows:
count = 0 for k in range(len(dna)-1): if dna[k] == dna[k+1]: count += 1Note well that in this example, we chose to loop over range(len(dna)-1) rather than range(len(dna)). As a sanity check, assume that dna had length 5. Then we only need a loop that executes 4 times to compare the four pairs of neighbors. Our loop would be executing only over the sequence