Strings in Python

Preface: These notes are not intended to be a comprehensive lesson. Instead, these are my own reminders of topics I wish to discuss, knowing that I can provide more details live in class. Still, I decided to post this since it provides the outline of what I thought was important. Additionally, these notes discuss a variety of very useful behaviors of Python's str class that are not introduced in the course textbook.

Major Themes

Use of Python's str class for sequence of characters
Use of integer indices to refer to individual characters within
Use of slicing notation to refer to subsequences
Variety of other useful behaviors of Python's strings

Details

Strings
A string is finite sequence of characters from some alphabet. In Python, that sequence is represented as an instance of the str class. In Python 2, the default alphabet for strings is that of ASCII. In Python 3, the default alphabet is that of Unicode.

Of course, the Python strings have quite general use, we are free to have applications in which we happen to use a more restricted alphabet (e.g., A, C, G, T).
String Literals
A string literal in Python is enclosed within a pair of single quote marks (e.g., 'ACCTG') or a pair of double-quote marks (e.g., "ACCTG"). Those two are the same string as far as Python is concerned. (The choice of characters is for convenience, especially if you wish to use the actual character ' or " as a character within the string, rather than as a closing delimiter).

The empty string, '', is simply a string with zero characters.
Identifiers
In Python, an identifier is a name that we give to any piece of data so that we can refer to it in later commands. While those identifiers are also a sequence of characters, the identifier is not enclosed in quotations.

For example, we might issue the command
```
    seq = 'ACCTG'
```
which associates the identifier seq with the piece of data which is the string 'ACCTG'.
String Length
If identifer seq refers to a string, we can query its length with the syntax
```
    len(seq)
```
Indexing
In order to refer to individual characters of a string, Python relies on an indexing convention which refers to an integer offset from the beginning of the string. Please note that Python (and many programming languages) are zero-indexed as the first character of a string is numbered with index 0 (as it is zero steps away from the beginning). While this is a common convention for computer scientists, it is different from the one-indexed convention that is typically used by non-computer-scientists (such as biologists) when describing positions within a sequence.

So in review, the first character of a string has index 0, the second has index 1, and so on, until the final character which has index len(seq)-1. The syntax used in Python to refer to the single character at index j of a string seq is
```
    seq[j]
```
While standard indices are an offset measured from the beginning of a string, Python also makes use of negative index values to allow for measurement from the end of a string. By that convention, index -1 refers to the last character of a string, -2 the second-to-last, and so on (with -len(seq) as an alternative index for the first character).
Slicing
Python also provides an extremely useful way to reference various subsequences of a string using what is known as a slicing notation. As a basic example, the syntax
```
    seq[j:k]
```
represents the slice of s starting at index j and going up to by not including index k. This convention will take some getting used to, but the nature of its half-open interval means that slices seq[j:k] + seq[k:m] precisely equal slice seq[j:m].

While the default slice uses a step size of 1, it is possible to express a different step size, or even a negative step size, as a third optional indicator. For example
```
    seq[j:k:2]
```
is a slice starting at seq[j] and subsequently taking every second character going up to (but not including or surpassing) the character seq[k].

Two additional conveniences with slices is that if you omit an argument before the first colon, it just presumes that you want to start at the beginning of the string, and if you omit the argument after the first colon, it presumes you want to go through the end of the string. By these conventions
```
    seq[ :n]
```
is the slice that includes the first n characters of the string (unless the string length is less than n in which case it will stop at the end). The syntax
```
    seq[j: ]
```
will be a slice starting at seq[j] and going all the way to the end. The spacing in these examples is optional as it is possible to express these two syntaxes as seq[:n] and seq[j:] respectively
Reversed String
One rather unfortunate thing is that Python doesn't include an intuitive syntax for producing the reversal of a string. Instead, experienced Python programs use the slicing notation
```
    seq[ : :-1]
```
which indicates to go from one end of the string to the other, but with a negative increment (thus from the end to the beginning).
Concatenation and other Operators
Two or more strings can be concatenated to form a new string by using the + operator. For example
```
    'one' + 'two'
```
will be the string 'onetwo'. (Notice that no spaces are introduced by the concatenation.)

Just as the + operator was given an intuitive meaning for strings, many other operators have been given special definitions for strings.
- A syntax such as
```
    s * 5
```
  produces a string that is equivalent to five concatenations of the original string. For example, 3 * 'Ho' produces a result 'HoHoHo'.
- The equivalence of two strings can be tested by using a syntax such as
```
    s == t
```
  This will be True if the two strings have precisely the same sequence of characters, and False otherwise. Note well that this equivalence is case sensitive, as 'hello' and 'Hello' are not equivalent.
- The operator for testing the non-equivalence of two strings is !=. For example the test
```
    s != t
```
  which should be thought of as "s is not equal to t" will be True if the two strings are not equivalent to each other, and False if they are equivalent.
- You can define inequalities on two strings, which are evaluated akin to alphabetical order if with letters, but more generally with characters ordered according to their ASCII values. The inequality operators are
  
  < strictly less than
  
  <= less than or equal to
  
  > strictly greater than
  
  >= greater than or equal to
Additional Behaviors
There are many other built-in behaviors supported for strings in Python. Most of these take a different from of what is known as a method call using a syntax known as dot notation.
- s.startswith(pattern)
- s.endswith(pattern)
- s.count(pattern)
- s.find(pattern)
- s.find(pattern, j)
- s.index(pattern)
- s.index(pattern, j)
- s.rfind(pattern)
- s.rfind(pattern, j)
- s.rindex(pattern)
- s.rindex(pattern, j)
- s.strip(chars)
- s.lstrip(chars)
- s.rstrip(chars)
- s.replace(old, new)
- s.replace(old, new, count)

Michael Goldwasser

Last modified: Wednesday, 30 January 2019

`<`	strictly less than
`<=`	less than or equal to
`>`	strictly greater than
`>=`	greater than or equal to