How long is this sequence?
16801 basepairs, as reported by
len(dna)
What is the first basepair of the sequence?
G, as reported by
dna[0]
What is the 2000th basepair of the sequence (that is, the one with index 1999)?
T, as reported by
dna[1999]
What is the last basepair of the sequence?
G, as reported either by
dna[16800]or better yet as
dna[-1]
What are the first 10 characters of the sequence?
GTTGATGTAG, as reported by
dna[ :10]
What are the last 10 characters of the sequence?
CCGCACCCCG, as reported by
dna[-10: ]
How many times does the character C appear in the sequence?
4155 times, as reported by
dna.count('C')
The GC-content of a sequence is the percentage of basepairs that are either G or C. What is the GC-content of this sequence?
Note: If you compute a ratio of integers in Python2, by default
it will only provide the integer quotient. One way to get a
floating-point division is to ensure that at least one of the
numerator or denominator is a floating-point numbers, such as
Approximately 39.3%, which can be computed by computed piecewise by first doing the count of C and G and then dividing that sum by length, or as a single Python expression:
float(dna.count('C') + dna.count('G')) / len(dna)
The pattern CCAAT is a particular motif known as a "CAT box". How many times does this motif appear in the sequence?
30 times, as reported by
dna.count('CCAAT')
What is the index at which the first occurrence of the pattern CCAAT begins?
index 757, as reported by
dna.index('CCAAT')
What is the index at which the second occurrence of the pattern CCAAT begins?
The answer is index 1403.
Knowing that the first occurrence spans indices 757 through 761, we can find the second as
dna.index('CCAAT', 762)
or if we prefer not to use our existing knowledge, as
dna.index('CCAAT', dna.index('CCAAT') )
What is the index at which the last occurrence of the pattern CCAAT begins?
index 15151, as reported by
dna.rindex('CCAAT')
Consider initial prefixes of the sequence, such as the first three characters GTT. That particular prefix occurs 164 times in the sequence. What is the shortest prefix that does not occur anywhere else? (Given the techniques we've learned so far, you will likely need to resort to some trial and error.)
The first seven characters (GTTGATG) occurs only once as a pattern in the sequence. This is the shortest such prefix, which can be confirmed by noticing
dna.count(dna[:6]) # returns 2
dna.count(dna[:7]) # returns 1
What is the largest number of consecutive occurrences of A that can be found in the sequence? (Again, lacking more advanced programming techniques, some trial and error can be used.)
There are 9 consecutive A's in the sequence (starting at index 5159), but there are never ten consecutive occurrences. This can be demonstrated by examining result of
dna.find(9 * 'A') # reports 5159
dna.find(10 * 'A') # reports -1