Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab 12

Topic: Shine-Dalgarno Sequences
Collaboration Policy: The lab should be completed working in pairs
Submission Deadline:    11:50am Monday, 15 April 2019

Overview

We have considered ORFs as candidate genes due to their potential as a coding region. However, we might increase our confidence in such candidate genes by recognizing certain upstream motifs that are required due to biochemistry. Most notably:

If we locate such motifs upstream of an ORF, that would increase our confidence that the ORF is serving as a gene.

Unfortunately, detecting these motifs is not as simple as we might hope. First we note that the nuceotide sequences required for such functionality, and the relative position of those motifs, depend greatly on the type of organism. Furthermore, even in closely relatied species or within organisms of the same species, there is typically not one precise motif sequence, but instead a consensus sequence that represents the most common motif, but with the expectation that there may be variation among genomes.


Bacteria

In this lab, we focus on prokaryotes which have somewhat more regularity in terms of the upstream motifs. In particular, we will focus on bacteria.

Thus an idealized version of a bacterial gene might have an upstream region as shown in the following figure:



Note well that the actual sequence might not perfectly match the consensus, as with the pattern GGGAGG for the SD sequence (rather than consensus AGGAGG) or the pattern TTGCTA for the -35 promoter (rather than consensus TTGACA).

However, another complicating factor for prokaryotes is that sometimes several consecutive genes act as a single operon, in which case only the first of those genes is proceeded by a promoter. In this case, the operon might have a pattern akin to


Your Task

In this lab, we ask you to manually examine the upstream regions for ORFs that are found in three reference genomes:

Our primary focus will be on looking for a Shine-Dalgarno sequence preceding each ORFs (as there may not be promoter motifs for genes that are part of operons). We will provide you with software that locates ORFs and which shows an upstream region just before the start codon for the ORF.

We ask that you do the following for at least two of the three reference genomes:

  1. Run the software using a minimum ORF length of 200bp. The software report the results sorted from longest to shortest.

  2. Manually examine the 10 longest and 10 shortest ORFs that were reported, and determine whether you believe there is a reasonable match for the consensus SD sequence AGGAGG. Record your observations.

  3. Then we wish for you to compare your "predicted genes" to those genes that are identified within GenBank. To ease the task of cross-referencing, we will provide additional files that will allow the software to tag such identified genes. (But do NOT turn on those tags until you have tried to do the analysis manually).

  4. If time remains, are you able to find any likely promoter pairs that match or nearly match the consensus TTGACA....TATAAT sequences upstream of the transcriptional unit (remembering that that may be perhaps tens or hundreds of base pairs earlier).

(blank report sheet)


Software and Data Files

You may download the single Python script, orf.py, or the bigger zip file that has that script as well as the supporting data files.


Michael Goldwasser
CSCI 1020, Spring 2019
Last modified: Monday, 15 April 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring