Lab 12

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Topic:	Shine-Dalgarno Sequences
Collaboration Policy:	The lab should be completed working in pairs
Submission Deadline:	11:50am Monday, 15 April 2019

Overview

We have considered ORFs as candidate genes due to their potential as a coding region. However, we might increase our confidence in such candidate genes by recognizing certain upstream motifs that are required due to biochemistry. Most notably:

For transcription to take place, the RNA polymerase enzyme which makes RNA must attach to the DNA in the relative upstream neighborhood of the beginning of the translational unit of the gene. Such an upstream section of DNA is known as a promoter.
Also, for the transcribed portion of the DNA, in order for protein synthesis to begin at the start codon, there must be a ribosomal binding site (RBS) in close proximity upstream of the start codon.

If we locate such motifs upstream of an ORF, that would increase our confidence that the ORF is serving as a gene.

Unfortunately, detecting these motifs is not as simple as we might hope. First we note that the nuceotide sequences required for such functionality, and the relative position of those motifs, depend greatly on the type of organism. Furthermore, even in closely relatied species or within organisms of the same species, there is typically not one precise motif sequence, but instead a consensus sequence that represents the most common motif, but with the expectation that there may be variation among genomes.

Bacteria

In this lab, we focus on prokaryotes which have somewhat more regularity in terms of the upstream motifs. In particular, we will focus on bacteria.

For most bacteria (and some viruses that infect bacteria), the Shine-Dalgarno (SD) sequence commonly seves as the ribosomal binding site just upstream of the start codon. The consenus SD-sequence in DNA is AGGAGG, though even this consensus varies across species. (For example it is AGGAGGT for E. coli.) The SD sequence is generally located about 8bp upstream of the codon (so called position -8), though that distance could be a wide range such as -4 to -10bp ahead of the start codon.
Promoters for genes vary greatly as well, but a common motif for bacteria includes the following combination:
- The consenus pattern TATAAT in the neighborhood of -10bp ahead of the transcriptional unit.
- The consenus pattern TTGACA in the neighborhood of -35bp ahead of the transcriptional unit.
  (More specifically it should be 15-19bp upstream of the TATAAT promoter.)
Note well that the transcriptional unit includes not only the ORF but also whatever upstream portion (including the SD sequence) is transcribed to RNA, so when discussing the -10 position it is not immediately evident where that is.

Thus an idealized version of a bacterial gene might have an upstream region as shown in the following figure:

Note well that the actual sequence might not perfectly match the consensus, as with the pattern GGGAGG for the SD sequence (rather than consensus AGGAGG) or the pattern TTGCTA for the -35 promoter (rather than consensus TTGACA).

However, another complicating factor for prokaryotes is that sometimes several consecutive genes act as a single operon, in which case only the first of those genes is proceeded by a promoter. In this case, the operon might have a pattern akin to

Your Task

In this lab, we ask you to manually examine the upstream regions for ORFs that are found in three reference genomes:

MG833025.1
This is an enterobacteria phage virus that infects E. Coli. As a virus, it has a relatively short genome of length 38553bp.
NZ_BIRV01000016.1
This is a small portion of the genome for Clostridioides difficile, a bacterium that causes diarrhea and colitis. This portion has length 70827bp.
AE016877.1
This is Bacillus cereus, a bacterium commonly found in soil and food. Its full genome has length 5411809bp.

Our primary focus will be on looking for a Shine-Dalgarno sequence preceding each ORFs (as there may not be promoter motifs for genes that are part of operons). We will provide you with software that locates ORFs and which shows an upstream region just before the start codon for the ORF.

We ask that you do the following for at least two of the three reference genomes:

Run the software using a minimum ORF length of 200bp. The software report the results sorted from longest to shortest.
Manually examine the 10 longest and 10 shortest ORFs that were reported, and determine whether you believe there is a reasonable match for the consensus SD sequence AGGAGG. Record your observations.
Then we wish for you to compare your "predicted genes" to those genes that are identified within GenBank. To ease the task of cross-referencing, we will provide additional files that will allow the software to tag such identified genes. (But do NOT turn on those tags until you have tried to do the analysis manually).
If time remains, are you able to find any likely promoter pairs that match or nearly match the consensus TTGACA....TATAAT sequences upstream of the transcriptional unit (remembering that that may be perhaps tens or hundreds of base pairs earlier).

(blank report sheet)

Software and Data Files

You may download the single Python script, orf.py, or the bigger zip file that has that script as well as the supporting data files.

Michael Goldwasser

CSCI 1020, Spring 2019
Last modified: Monday, 15 April 2019

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab 12

Overview

Bacteria

Your Task

Software and Data Files

Saint Louis University

Computer Science 1020 Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab 12

Overview

Bacteria

Your Task

Software and Data Files

Computer Science 1020
Introduction to Computer Science: Bioinformatics