Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Strings in Python

Preface: These notes are not intended to be a comprehensive lesson. Instead, these are my own reminders of topics I wish to discuss, knowing that I can provide more details live in class. Still, I decided to post this since it provides the outline of what I thought was important. Additionally, these notes discuss a variety of very useful behaviors of Python's str class that are not introduced in the course textbook.


Major Themes


Details

  1. Strings
    A string is finite sequence of characters from some alphabet. In Python, that sequence is represented as an instance of the str class. In Python 2, the default alphabet for strings is that of ASCII. In Python 3, the default alphabet is that of Unicode.

    Of course, the Python strings have quite general use, we are free to have applications in which we happen to use a more restricted alphabet (e.g., A, C, G, T).

  2. String Literals
    A string literal in Python is enclosed within a pair of single quote marks (e.g., 'ACCTG') or a pair of double-quote marks (e.g., "ACCTG"). Those two are the same string as far as Python is concerned. (The choice of characters is for convenience, especially if you wish to use the actual character ' or " as a character within the string, rather than as a closing delimiter).

    The empty string, '', is simply a string with zero characters.

  3. Identifiers
    In Python, an identifier is a name that we give to any piece of data so that we can refer to it in later commands. While those identifiers are also a sequence of characters, the identifier is not enclosed in quotations.

    For example, we might issue the command

        seq = 'ACCTG'
    which associates the identifier seq with the piece of data which is the string 'ACCTG'.

  4. String Length
    If identifer seq refers to a string, we can query its length with the syntax

        len(seq)

  5. Indexing
    In order to refer to individual characters of a string, Python relies on an indexing convention which refers to an integer offset from the beginning of the string. Please note that Python (and many programming languages) are zero-indexed as the first character of a string is numbered with index 0 (as it is zero steps away from the beginning). While this is a common convention for computer scientists, it is different from the one-indexed convention that is typically used by non-computer-scientists (such as biologists) when describing positions within a sequence.

    So in review, the first character of a string has index 0, the second has index 1, and so on, until the final character which has index len(seq)-1. The syntax used in Python to refer to the single character at index j of a string seq is

        seq[j]

    While standard indices are an offset measured from the beginning of a string, Python also makes use of negative index values to allow for measurement from the end of a string. By that convention, index -1 refers to the last character of a string, -2 the second-to-last, and so on (with -len(seq) as an alternative index for the first character).

  6. Slicing
    Python also provides an extremely useful way to reference various subsequences of a string using what is known as a slicing notation. As a basic example, the syntax

        seq[j:k]
    represents the slice of s starting at index j and going up to by not including index k. This convention will take some getting used to, but the nature of its half-open interval means that slices seq[j:k] + seq[k:m] precisely equal slice seq[j:m].

    While the default slice uses a step size of 1, it is possible to express a different step size, or even a negative step size, as a third optional indicator. For example

        seq[j:k:2]
    is a slice starting at seq[j] and subsequently taking every second character going up to (but not including or surpassing) the character seq[k].

    Two additional conveniences with slices is that if you omit an argument before the first colon, it just presumes that you want to start at the beginning of the string, and if you omit the argument after the first colon, it presumes you want to go through the end of the string. By these conventions

        seq[ :n]
    is the slice that includes the first n characters of the string (unless the string length is less than n in which case it will stop at the end). The syntax
        seq[j: ]
    will be a slice starting at seq[j] and going all the way to the end. The spacing in these examples is optional as it is possible to express these two syntaxes as seq[:n] and seq[j:] respectively

  7. Reversed String
    One rather unfortunate thing is that Python doesn't include an intuitive syntax for producing the reversal of a string. Instead, experienced Python programs use the slicing notation

        seq[ : :-1]
    which indicates to go from one end of the string to the other, but with a negative increment (thus from the end to the beginning).

  8. Concatenation and other Operators
    Two or more strings can be concatenated to form a new string by using the + operator. For example

        'one' + 'two'
    will be the string 'onetwo'. (Notice that no spaces are introduced by the concatenation.)

    Just as the + operator was given an intuitive meaning for strings, many other operators have been given special definitions for strings.

  9. Additional Behaviors
    There are many other built-in behaviors supported for strings in Python. Most of these take a different from of what is known as a method call using a syntax known as dot notation.


Michael Goldwasser
Last modified: Wednesday, 30 January 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring