As we begin our exploration of the Python programming language, our first goal will be to gain a high-level overview of programming concepts and a comfort level in reading Python code and understanding its instructions (as opposed to writing new Python code). I'd equate this similar to natural languages, in which it is easier to learn to read/hear existing samples of the language before learning to express yourself by writing/speaking the language. Therefore, this document provides an initial overview of the language and programming concepts. We will add further links to more specific documentation on aspects of the language at a later point.
It is important to understand that there are various types of data that programmers will want to store and manipulate, and that each of these has a different internal representation. In Python, a concept known as a class is used to define a type of data. The primary classes that we will use include:
int -- represents an integer value
(e.g. 53, -10000)
float -- a more general numeric representation
that can handle integral or non-integral values, albeit with
inherent limits on precision
(e.g. 3.14159,
53.0, 6.022e23)
str -- a string of characters. Strings in Python
most often represent language/text such as 'hello' or
bool -- named for George Boole a "bool" or
"boolean" value is a logial value that is either True
or False. We will see boolean values a great deal
implicity when testing conditions such as if
list -- a list is used to maintain a general sequence of values. That is, while a string is specifically a sequence of characters, a list can be a sequence of characters, sequence of numbers, sequence of strings, sequence of other lists.
dict -- a "dictionary" is a more advanced data
structure that is optimized to efficient map from a set of
unique "keys" to their associated "values". For example, we could
use a dictionary to represent the mapping from codons to amino
acids, for example mapping
When programming in Python we can give internal names to any values that we compute as a way to store and later identify such a value. For example, we might set
samples = 357to assign the name samples to the associated value 357, or the command
dna = 'ACCTAAGA'to assign the name dna to the associated value. The = symbol is used to designate such an assignment of a value to a name.
The order in which commands are executed by a computer is called the "flow of control." The default flow is that commands are executed in a Python script in the order in which they are expressed.
firstCommand secondCommand thirdCommand ...However, there are many control structures that allow you to vary the flow of control. Most notably:
conditionals -- An "if statement" is the simplest form of a conditional, as it allows you to specify a block of code that may or may not be executed depending on the evaluation of a boolean condition, such as
if raining: openTheUmbrellaThe indented block of code will be executed if the defining condition is true, but otherwise that block of code is skipped.
Conditionals may also use an "if/else" form in which there is a second block that should be executed with the condition is false. It is also possible to chain the conditions for more multiway branching statements.
loops -- One of the most powerful aspects of computing is the ability to repeat blocks of code. Loops are the most common control structure used to identify a block of code that are to be repeated. In fact, we will see three different styles of loops. A for loop will be used to repeat a block of code once for each element of a sequence (such as each basepair in a dna strand). A range-based for loop will be used to repeat a block of code over a numeric range. Finally, a while loop will give us the most flexible form that allows for a block of code to be repeated so long as some boolean condition is satisfied.
functions -- Functions provide a means for abstraction in that they allow for a more complex behavior to defined a named and subsequently used in other portions of code. For example, we can think of higher-level tasks such as transcribing DNA and translating RNA to an amino acid sequence. In Python, we might implement these behaviors and define a method named translate allowing us to write code such as
aminoSeq = translate(dna)
There are some functions that are already automatically supported by Python, such as max which computes the maximum value among a sequence of values. Many more functions have been implemented and included in various libraries known as modules in Python, which can be imported into a script. Finally, we can define new custom functions that suit our own purposes.
The primary task of the SMS Filter tool is to produce a result that is equivalent to the original piece of text, except omitting all characters that are not alphabetic. We offer our own function, named clean which accomplishes this task. Our implementation appears as follows.
def clean(original): result = '' for c in original: if c.isalpha(): result = result + c return result
To step through this function running on an example, click the "Forward" button below.
A primary task of the SMS Reverse Complement tool is to compute the reverse complement of an original strand of dna. We offer our own function, named complement which accomplishes this task. Our implementation appears as follows.
# convert a given DNA sequence to its reverse complementary strand dna2dna = {'A':'T', 'T':'A', 'C':'G', 'G':'C'} def reverse_complement(dna): other = '' for base in dna: other = other + dna2dna[base] return other[::-1] # python trick to reverse a stringTo step through this function running on an example, click the "Forward" button below.
The SMS Color Align Conservation tool, though more advanced than the following, was used to try to graphically highlight locations within two equal-length sequences that differe from each other. We compute a (non-graphical) summary of the pairwise differences of two sequences as follows.
# compute pairwise difference between two equal-length strings # list of difference returned using notation such as 'T35G' # to reflect that at location 35 a T in first sequence is G in second def difference(first, second): diff = [] for k in range(len(first)): if first[k] != second[k]: diff.append( first[k] + str(1+k) + second[k]) return diffTo step through this function running on an example, click the "Forward" button below.