Lab Solution

Python String Manipulation

  1. How long is this sequence?

    16801 basepairs, as reported by

    len(dna)

  2. What is the first basepair of the sequence?

    G, as reported by

    dna[0]

  3. What is the 2000th basepair of the sequence (that is, the one with index 1999)?

    T, as reported by

    dna[1999]

  4. What is the last basepair of the sequence?

    G, as reported either by

    dna[16800]
    or better yet as
    dna[-1]

  5. What are the first 10 characters of the sequence?

    GTTGATGTAG, as reported by

    dna[ :10]

  6. What are the last 10 characters of the sequence?

    CCGCACCCCG, as reported by

    dna[-10: ]

  7. How many times does the character C appear in the sequence?

    4155 times, as reported by

    dna.count('C')

  8. The GC-content of a sequence is the percentage of basepairs that are either G or C. What is the GC-content of this sequence?

    Note: If you compute a ratio of integers in Python2, by default it will only provide the integer quotient. One way to get a floating-point division is to ensure that at least one of the numerator or denominator is a floating-point numbers, such as float(1) / 3 or even 1.0/3.

    Approximately 39.3%, which can be computed by computed piecewise by first doing the count of C and G and then dividing that sum by length, or as a single Python expression:

    float(dna.count('C') + dna.count('G')) / len(dna)

  9. The pattern CCAAT is a particular motif known as a "CAT box". How many times does this motif appear in the sequence?

    30 times, as reported by

    dna.count('CCAAT')

  10. What is the index at which the first occurrence of the pattern CCAAT begins?

    index 757, as reported by

    dna.index('CCAAT')

  11. What is the index at which the second occurrence of the pattern CCAAT begins?

    The answer is index 1403.

    Knowing that the first occurrence spans indices 757 through 761, we can find the second as

    dna.index('CCAAT', 762)
    or if we prefer not to use our existing knowledge, as
    dna.index('CCAAT', dna.index('CCAAT') )

  12. What is the index at which the last occurrence of the pattern CCAAT begins?

    index 15151, as reported by

    dna.rindex('CCAAT')

  13. Consider initial prefixes of the sequence, such as the first three characters GTT. That particular prefix occurs 164 times in the sequence. What is the shortest prefix that does not occur anywhere else? (Given the techniques we've learned so far, you will likely need to resort to some trial and error.)

    The first seven characters (GTTGATG) occurs only once as a pattern in the sequence. This is the shortest such prefix, which can be confirmed by noticing

    dna.count(dna[:6])      # returns 2
    dna.count(dna[:7])      # returns 1

  14. What is the largest number of consecutive occurrences of A that can be found in the sequence? (Again, lacking more advanced programming techniques, some trial and error can be used.)

    There are 9 consecutive A's in the sequence (starting at index 5159), but there are never ten consecutive occurrences. This can be demonstrated by examining result of

    dna.find(9 * 'A')         # reports 5159
    dna.find(10 * 'A')        # reports -1


Last modified: Thursday, 07 February 2019