Preface: These notes are not intended to be a comprehensive lesson. Instead, these are my own reminders of topics I wish to discuss, knowing that I can provide more details live in class. Still, I decided to post this since it provides the outline of what I thought was important. Additionally, these notes discuss a variety of very useful behaviors of Python's str class that are not introduced in the course textbook.
Strings
A string is finite sequence of characters from some
alphabet. In Python, that sequence is represented as an instance
of the str class. In Python 2, the default
alphabet for strings is that of ASCII. In Python 3, the
default alphabet is that of Unicode.
Of course, the Python strings have quite general use, we are free to have applications in which we happen to use a more restricted alphabet (e.g., A, C, G, T).
String Literals
A string literal in Python is enclosed within a pair of single
quote marks (e.g., 'ACCTG') or a pair of double-quote
marks (e.g., "ACCTG"). Those two are the same string as
far as Python is concerned. (The choice of characters is for
convenience, especially if you wish to use the actual character
' or " as a character within the string,
rather than as a closing delimiter).
The empty string, '', is simply a string with zero characters.
Identifiers
In Python, an identifier is a name that we give to any
piece of data so that we can refer to it in later
commands. While those identifiers are also a sequence of
characters, the identifier is not enclosed in
quotations.
For example, we might issue the command
seq = 'ACCTG'which associates the identifier seq with the piece of data which is the string 'ACCTG'.
String Length
If identifer seq refers to a string, we can query its
length with the syntax
len(seq)
Indexing
In order to refer to individual characters of a string, Python
relies on an indexing convention which refers to an
integer offset from the beginning of the
string. Please note that Python (and many programming languages)
are zero-indexed as the first character of a
string is numbered with index 0 (as it is zero steps away
from the beginning). While this is a common convention for computer
scientists, it is different from the one-indexed
convention that is typically used by non-computer-scientists
(such as biologists) when describing positions within a
sequence.
So in review, the first character of a string has index 0, the second has index 1, and so on, until the final character which has index len(seq)-1. The syntax used in Python to refer to the single character at index j of a string seq is
seq[j]
While standard indices are an offset measured from the beginning of a string, Python also makes use of negative index values to allow for measurement from the end of a string. By that convention, index -1 refers to the last character of a string, -2 the second-to-last, and so on (with -len(seq) as an alternative index for the first character).
Slicing
Python also provides an extremely useful way to reference
various subsequences of a string using what is known as a
slicing notation.
As a basic example, the syntax
seq[j:k]represents the slice of s starting at index j and going up to by not including index k. This convention will take some getting used to, but the nature of its half-open interval means that slices
While the default slice uses a step size of 1, it is possible to express a different step size, or even a negative step size, as a third optional indicator. For example
seq[j:k:2]is a slice starting at seq[j] and subsequently taking every second character going up to (but not including or surpassing) the character seq[k].
Two additional conveniences with slices is that if you omit an argument before the first colon, it just presumes that you want to start at the beginning of the string, and if you omit the argument after the first colon, it presumes you want to go through the end of the string. By these conventions
seq[ :n]is the slice that includes the first n characters of the string (unless the string length is less than n in which case it will stop at the end). The syntax
seq[j: ]will be a slice starting at seq[j] and going all the way to the end. The spacing in these examples is optional as it is possible to express these two syntaxes as seq[:n] and seq[j:] respectively
Reversed String
One rather unfortunate thing is that Python doesn't include an
intuitive syntax for producing the reversal of a
string. Instead, experienced Python programs use the slicing
notation
seq[ : :-1]which indicates to go from one end of the string to the other, but with a negative increment (thus from the end to the beginning).
Concatenation and other Operators
Two or more strings can be concatenated to form a new string by
using the + operator. For example
'one' + 'two'will be the string 'onetwo'. (Notice that no spaces are introduced by the concatenation.)
Just as the + operator was given an intuitive meaning for strings, many other operators have been given special definitions for strings.
A syntax such as
s * 5produces a string that is equivalent to five concatenations of the original string. For example,
The equivalence of two strings can be tested by using a syntax such as
s == tThis will be True if the two strings have precisely the same sequence of characters, and False otherwise. Note well that this equivalence is case sensitive, as 'hello' and 'Hello' are not equivalent.
The operator for testing the non-equivalence of two strings is !=. For example the test
s != twhich should be thought of as "s is not equal to t" will be True if the two strings are not equivalent to each other, and False if they are equivalent.
< | strictly less than |
<= | less than or equal to |
> | strictly greater than |
>= | greater than or equal to |
Additional Behaviors
There are many other built-in behaviors supported for strings in
Python. Most of these take a different from of what is known as
a method call using a syntax known as dot
notation.
s.startswith(pattern)
s.endswith(pattern)
s.count(pattern)
s.find(pattern)
s.find(pattern, j)
s.index(pattern)
s.index(pattern, j)
s.rfind(pattern)
s.rfind(pattern, j)
s.rindex(pattern)
s.rindex(pattern, j)
s.strip(chars)
s.lstrip(chars)
s.rstrip(chars)
s.replace(old, new)
s.replace(old, new, count)