Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab: Sequencing Technologies

Overview

Topic: Sequencing Technologies
Related Reading: Ch. 8 of text.
Collaboration Policy: The lab should be completed working in pairs
Due: 11:59pm Wednesday, 3 April 2019

We base this lab on two of the three parts of the authors' Web Explorations for Chapter 08, however some of the online tools have changed since they published the book, and so they have graciously offered us an updated version of those activities. I have password protected them with the same password used to access solution sets (ask me if you need the credentials).

Updated Chapter 08 Web Explorations

I am also making some changes to the set of questions you must enter, so please follow my outline of the lab, but see the reading for far more discussion of the biology.

Note: This lab relies on visual recognition of colors. If this presents a challenge for you, please let me know and we'll accomodate.


Part I

  1. Go to NCBI Trace Archive database

  2. Select the "Trace Archive" tab.

  3. Select the "Obtaining Data" tab within.

  4. Select the "Registered Species" tab within.

  5. Navigate to find "Human Gut Metagenome" among the registered species (it's alphabetized under "H").

  6. Clicking on the "Human Gut Metagenome" entry (not the FTP link), will populate the appropriate query in the search bar above, and you can click Submit to see the results.

  7. The defualt view of results with be a FASTA file of read sequences. Click the "in color" checkbox and then click "Show" to rerender the view.

  8. You should now have a view of the first 5 of 635,147 read sequences for this data set. The colors correspond to ranges of a "quality score" that indicates a confidence in specific nucelotide called for that position. If you mouse over a particular nucelotide, it will show you the number of the nucelotide (e.g. the first is #1) and its quality score.

  9. STOP: Answer questions 1 through 3
  10. Change from FASTA to Trace in the "Retrieve" dialog, and then click the "Show" button to refresh the view. You should now have a view of a graph of the underlying trace data. We will begin by focussing only on the first read sequence. Click the "Quality Score" box under the first sequence in order to see an additional bar representation of the quality scores for each called nucelotide.

  11. What you should see is a graph of wavelengths and intensities for four different light frequencies measured from fluorescent nucelotides. (This graphs them as A=green, C=blue, G=black T=red.) Take a moment and examine how the peaks in the wavelengths are being read as nuceotides in the DNA sequence.

  12. STOP: Answer questions 4 through 6
  13. The next task is to see how well you could manually call the nucleotide sequence when viewing the trace. To do this, we ask you to intentionally obscure the portion of the view that shows the nuceotides and quality scores. We will then have you use the interface to enter another base# which will recenter the view with the indicated base# at the center of the screen. Try this out using base# 100 as a test, and then complete the following questions.

  14. STOP: Answer questions 7 through 9
  15. Note that by default this is displaying 5 traces per page and we have only examined the first 5 of 40000 traces in this set. Each is identified with a TI#, for example TI# 2178906716 for the first sequence we have examined.

    We wish to examine TI# 2178908254. This should be found as the fourth trace displayed on page 308. Go to that trace, turn on the Quality Score bars, then center the view on base# 200.

  16. STOP: Answer question 10

Questions

  1. The first read sequence displayed has read length 1393nt. What are the lengths of the other four sequences on the first page of results?

  2. Each read sequence tends to have a quality pattern in which there is less clarity at the beginning, then a period with relatively good quality, and then a tail of the sequence with lower quality ratings. For the first three sequences shown, provide an estimate of the range of nucletoide locations that have quality ratings of 40 or higher. (We say "estimate" because the quality rating does fluctuate within subsequences.)

  3. Nucelotide #9 of the first sequence is N. What does this indicate?

  4. Carefully examine the trace for nuceotides 28 through 42. In your own works explain how the trace corresponds to the nucelotide sequence that was called.

  5. Examine the trace that lead to a call of N at nuceotide 9. What conditions do you believe lead to the call of N?

  6. What pattern do you observe with the quality scores over the first 90 nuceotides of the sequence?

  7. As indicated in the instructions, jump to base# 300 and manually call the next 20 nuceotides (while keeping the answers obscured). Report your results. Now uncover the answers and report on how your predictions compare to the actual results.

  8. Repeat the above exercise starting at Base# 500.

  9. Repeat the above exercise starting at Base# 700.

  10. Give your general observations of the trace and called nuceotides in the neighborhood of nuceotides 160-240. Not well the large number of N calls. Can you offer an explanation for those N's?


Part II

We will skip this part, but you are welcome to complete it on your own. It introduces a variety of analysis tools available at the Galaxy web site.


Part III

This part of the experiment involves sequence assembly from a set of reads. We will use a set of 2,500 simulated 454 sequencing reads in FASTA format, provided by the authors. These reads are between 100 to 500 bases each and contain between 1 and 10 random substitutions or deletions to simulate errors inherernt in sequencing data.

  1. Download the sequence reads FASTA file, reads.txt file, either from the publisher's website or our local copy and save it to your computer as reads.txt

  2. Go to the EGassembler homepage.

  3. Paste or upload the read sequence data.

  4. There are many parts of the default pipeline. Leave checked the sequence cleaning process and the final sequence assembly process, but uncheck/disable the other steps of the process.

  5. Click "submit" to start the process. It will take you to a page with some summary information and a link to results, but the link to results will not be active until it is done processing (hopefully in a few minutes).

  6. When done, there are three files of interest. One containing the resulting contigs, one containing any singletons that were not assembled, and a file that shows the alignment of the individual reads in the consensus contig sequence. Download all three (individually or they offer a single zip with the three).

Questions

  1. How many sequence reads were rejected in the sequence cleaning process? Can you determine why they were rejected?

  2. Use BLAST to compare your contig sequence with known sequences in GenBank. The assembled sequence should match one known sequence with a high degree of similarity. What have we sequenced? How long is its genome?

  3. Looking at the contig alignment file in the EGassembler results, you should be able to see hundreds if not thousands of small sequencing errors among the sequence reads. How was the assembler able to generate a correct contig sequence (as compared with the known sequence in the database) despite these errors? Explain how the sequence errors were accurately corrected. Were all errors caught, or did some remain in the final contig sequence?


Michael Goldwasser
Last modified: Monday, 01 April 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring