Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab: BLAST

Overview

Topic: Use of BLAST
Related Reading: Ch. 4 of text.
Due: 11:59pm Wednesday, 20 February 2019

BLAST (Basic Local Alignment Search Tools) is a suite of tools that is meant to allow for efficient search for a near-matching sequence across large data collections. We will focus on use of NCBI's BLAST suite and primarily on their nucleotide search.

In this lab, we explore use of the BLAST algorithm for efficiently finding (approximate) sequence matches within large datasets. Our goals are twofold:


Keeping Track of Time

For much of these experiments, we wish to know how long queries take on NCBI's site. Unfortunately, there does not seem to be a reflection of the actual running time within the reported results after a query has completed. You will instead need to watch to keep track of the time expired. Also note that you will only get an approximate range of times because the results web page automatically refreshes every so often, with longer spans as time progresses (though shorter spans again when it knows it is near completion). Thus you might know that a query was not yet done after 12 seconds but was complete by 30 seconds.

Note as well that NCBI limits the overall computation time that they will devote to a single query, and so if a query is taking too long you might encounter the following error:


Sample Query Sequences

filename length description
strep.fasta 738nt Streptococcus agalactiae ermB coding sequence from Ch. 4 of the book
strep_perturb10.fasta 715nt Same as above, but perturbed with 10% chance of a nucleotide being replaced/deleted
strep_perturb20.fasta 705nt Same as above, but perturbed with 20% chance of a nucleotide being replaced/deleted
strep_perturb30.fasta 689nt Same as above, but perturbed with 30% chance of a nucleotide being replaced/deleted
strep_perturb40.fasta 672nt Same as above, but perturbed with 40% chance of a nucleotide being replaced/deleted
 
tom.fasta 4440nt The CFTR allele for "Tom" from an earlier lab
tom_drop10.fasta 3996nt Same as above, but intentionally omitting every 10th nucelotide
tom_drop8.fasta 3885nt Same as above, but intentionally omitting every 8th nucelotide
tom_drop7.fasta 3806nt Same as above, but intentionally omitting every 7th nucelotide
tom_drop5.fasta 3552nt Same as above, but intentionally omitting every 5th nucelotide
 
random150.fasta 150nt 150 randomly generated nucelotides
random300.fasta 300nt 300 randomly generated nucelotides
random600.fasta 600nt 600 randomly generated nucelotides

Exploration: Database Size

The size and scope of the database can be affected, both by picking which database to use, and optionally ability to exclude/limit within such a database. After a query is complete, it is possible to determine the number of sequences and total number of nucleotides within the selected data set by expanding the "Search Summary" and examining the "Database" details.

Perform a search on tom.fasta, using the blastn algorithm with default parameters, and each of the following database settings. For each trial, record the approximate query time and the number of sequences and nucleotides in the database.

Can you draw any conclusions regarding the relationship between the database size and the query time? What database factors other than size might affect query time?


Exploration: Match Quality

We are interested in how the algorithms behave when we introduce intentional mutations in a sequence. Specifically, we examine the various degradations of the Strep files, under the following conditions:

Can you draw any conclusions regarding the relationship between the degradation and both the query time and quality of found matches?


Exploration: Algorithmic Choice

Repeat the above tests on the various degradations of strep, but this time using the megablast algorithm (and default parameters).


Exploration: Further degradation

As another form of degradation, we took the Tom allele and created versions of that by uniformly removing the k-th nucleotide for various choices of k. Perform searches with both blastn and megablast and report on how well it matches to CFTR for various levels of degradation.


Exploration: Conservation Across Organisms

Let's return to the tom.fasta file, which we know is an allele of human CFTR. Repeat a search, using blastn, and the nucleotide collection (nr/nt) database excluding Models (XM/XP) and Uncultured. In addition, for Organism select humans (taxid:9605) and check the exclude box.


Michael Goldwasser
Last modified: Friday, 22 February 2019
Course Home | Assignments | Data Sets/Tools | Python | Schedule | Git Submission | Tutoring