It is typically much easier to "read" a new language than to "write" in a new language, so for that reason, we will introduce a variety of Python control structures in this lecture, as motivated by some biological examples. I don't expect you to master all these techniques quite yet; later we'll come back to visit these techniques with the mindset of an author.
Chapter 1 of the text is where they introduce Python techniques for functions, for loops, and if/elif/else conditionals. They also use idea of GC-content for motivation, but they demonstrate a lot of (intentionally) bad ways to do that before getting to the end of the chapter and leaving the good way as an exercise. I prefer to just jump straight to a good way and examine the code.
While we briefly introduce dictionaries for convenience in our examples below, those will be introduced at the beginning of Chapter 7 of the book.
We can define our own functions for a variety of tasks. As a simple
example just to get used to the syntax, a Python implementation of a
mathematical functions such as
Formally, the first of those two lines states that we are defining a new function, that we want the function to be named f, and that the function takes a single parameter that we will name x. It is also important that the first line ends with a colon, which is a symbol that will be used in a variety of control structures in Python.
def f(x): """Return the quantity 3x2 - 2x + 5""" return 3*x*x - 2*x + 5
The second line (starting and ending with the triple-quotes) is entirely optional, but it is a Pythonic way to provide documentation for what this function does. From within the Python interpreter, someone can issue the command, help(f), to see this documentation.
The final line serves as the body of the function, which is the code that should be executed when the function is called. In this case, the body is a single statement but more generally the body can have many statements. Python uses indentation to define the scope of the function body. The special return command is used to indicate the value that should be sent back to the caller of the function at the conclusion.
In terms of the mathematical computation, notice that the * operator is used to indicate multiplication. Also note that Python follows algebraic convention in terms of order of operations, and thus the multiplications, such as 2*x will be done before the addition and subtraction. The spacing in our computation was purely for visual appeal. The same result would be computed even if we had written the final line as
def f(x): return 3*x*x-2*x+5
For those who want an advanced lesson, Python uses the ** operator for exponentiation (which has precedence even over multiplication), so this could have been written as
Probably not that important for x2 but can be helpful for higher powers.
def f(x): return 3*x**2 - 2*x + 5
While the first example had a single numeric parameter, a function can have parameters of any type (and multiple parameters, if desired).
As a biological example, we consider computing the GC-content of a dna string, which we will view as a floating-point number between 0.0 and 1.0 that is the ratio between the number of bases that are either G or C relative to the total number of bases. That is, we want the fraction
$\frac{\#\mbox{G} \ +\ \#\mbox{C}}{\#\mbox{bases}}$
here is a function that computes the GC-content of a given dna string (returned as a floating-point number between 0.0 and 1.0).
def gc_content(dna): """Return the quantity representing the fraction of bases that are G or C""" return (dna.count('C') + dna.count('G')) / float(len(dna))
With this example, we wish to highlight two more lessons about numeric computations in Python.
The first is that we were required to explicitly parenthesize the sum that formed the numerator in our fraction. For simplicity of notation, if we let G, C, and B represent the respective quantities, if we were to type Python expression
G+C/Bthe division has precedence over addition and so it is interpreted as
G+(C/B)So we need to explicitly parenthesize the quantity (G+C) as the numerator.
The second lesson is that we explicitly made the denominator of
the computed division a floating-point representation of the
length, that is using shorthand notation, we typed
This is because in Python2, if we divide an integer by an integer, it truncates and provides an integer result. (In Python3, that additional step is superfluous because the integer / integer expression would already compute the floating-point division.)
The above function for computing the GC-content of a strand of DNA relies on the fact that strings in Python support their own count function, which does the step of looping through the entire string and keeping track of the number of matches. Here we provide a more homespun approach that demonstrates both the use of a construct known as a for loop and an if statement.
def gc_content(dna): match = 0 # number of GC matches we find for base in dna: if base == 'G' or base == 'C': match += 1 # shorthand for match = match + 1 return float(match) / len(dna)
See it run:
There are a lot of new techniques to unpack here, so let's unpack each part of this code.
First we note that there are five lines within the function body (not just one). Python recognizes this because they are all indented within the definition of the function.
The first line of the body,
The character # in Python designates the rest of a line as a comment, which is ignored by the interpreter but helpful to a human who is reading the source code. In the first line of the body, we use this to give the reader an understanding about what this number will represent.
The line
for base in dna:is our first example of a for loop. What this does is to provide a piece of code that should be executed for each character of a string. The variable name, base, that is used in this syntax was our choice of what to name the individual character for each pass of the loop.
The extent of the body of the loop is again indicated using a further level of indentation. In this case, the next two lines are part of the body of the loop, but the subsequent return statement is not part of the loop (that will only be executed once the entire loop is complete).
Within the body of the for loop, we are using a new construct known as an "if statement" (or as a conditional). The purpose is to provide one or more commands that should only be done if some particular condition is met. The general form of the structure is
if condition: bodywhere condition should typically be a Boolean expression (that is, one that results in a True or False value). In this example, we used a compound condition
if base == 'G' or base == 'C'as our condition, which will be true if the given base is either of those characters.
As another convenient shorthand, this condition could have been stated succinctly as
if base in 'GC'This is relying on the fact that the in operator can be used to test whether one string is found as a substring of another.
Finally, I will (reluctantly) give a warning about a common
pitfall which is the following Python syntax that does
not express the desired logic.
if base == 'G' or 'C'
While this might seem reasonable in English, this does something
else in Python (which I dare not explain). Just wanted to give
the warning...
Within the body of the conditional, our goal is to increment the integer counter named match, which we are using to keep track of how many bases were one of G or C. We effectively want to reassign that value to be one more than it use to be, and can do so as
match = match + 1but because it is so common to want to use an assignment to update a value by some constant, there is a shorthand operator, += that allows the more succinct syntax
match += 1 # i.e., add one to match
The function again terminates with a return statement that computes the floating-point ratio of the number of GC bases over the number of total bases.
The above example relied on a compound boolean condition to test whether the base was C or G. We could have stated that more distinctly as two separate tests using an extension of the if-statement syntax using an "elif" clause, which is short for "else if".
def gc_content(dna): match = 0 # number of GC matches we find for base in dna: if base == 'C': match += 1 elif base == 'G': match += 1 return float(match) / len(dna)
In the body of the for loop, this logic is akin to the following process. If the base character is C, then the match count is increased. Otherwise it performs a second test to see if the base character is G, in which case it also increases the match count.
In this particular example, notice that the body of the if block and the elif block are the same. In general, those could be different actions in those two blocks (as in our next example). In fact, if they are the same actions in both, the original version with a single compound condition is preferred, because it makes more clear that there is only a single action that might be taken, but one that could be triggered by two possible conditions.
In our next example, we demonstrate that the result of a function need not be numeric. We design a function, reverse_complement(dna), that computes and returns the reverse complement sequence for a single strand of dna. As an example, a call to reverse_complement('CCGAT') should produce the string 'ATCGG' with the A of the result being the complement of the final T of the original, the T of the result being the complement of the A at the second-to-last location of the original, and so forth.
def reverse_complement(dna): """Return a string representing the reverse complement of the single dna strand.""" other = '' # start with an empty string for base in dna: if base == 'G': other += 'C' elif base == 'C': other += 'G' elif base == 'A': other += 'T' else: # presumably, only other possibility is T other += 'A' return other[ : :-1] # reverse the result
See it run:
What is new in this example is a more general form of conditional where we can have a variety of possible cases. The elif keyword is shorthand for the phrase "else if". So the logic within the forloop could be phrased in English as
"if the base is G, add a C; else if it is a C, add a G; else if it is an A, add a T; else add an A."""For the final case, we could have again used an elif to explicitly check for a base of T, but if we presume the original dna was legitimate, then we needn't bother checking because by process of elimination, if it was not G, C, or A, it must be T.
Also, notice that our overall process was to construct a new string, other, piecewise while converting each base of the original strand. Finally, since the goal was to produce the reverse complement, the return statement uses the slicing notation with a skip of negative one to produce the reversed string.
# while we could define the following dictionary within the body function # we might also construct it once, outside the function, since it is # always the same complement = { 'G':'C', 'C':'G', 'A':'T', 'T':'A' } def reverse_complement(dna): """Return a string representing the reverse complement of the single dna strand.""" other = '' # start with an empty string for base in dna: other += complement[base] return other[ : :-1] # reverse the result
Exercise: Use techniques as above to define a function dna2rna(seq) that transcribes a string of DNA, such as 'CCGAT' to the corresponding RNA sequence to which it binds ('GGCUA' in this example).