Saint Louis University |
Computer Science 1020
|
Computer Science Department |
Topic: Use of BLAST
Related Reading: Ch. 4 of text.
Due:
11:59pm Wednesday, 20 February 2019
BLAST (Basic Local Alignment Search Tools) is a suite of tools that is meant to allow for efficient search for a near-matching sequence across large data collections. We will focus on use of NCBI's BLAST suite and primarily on their nucleotide search.
In this lab, we explore use of the BLAST algorithm for efficiently finding (approximate) sequence matches within large datasets. Our goals are twofold:
Gain practical experience in the mechanics of submitting BLAST queries and examining results.
Explore some algorithmic performance tradeoffs for BLAST queries, such as how the time or quality depends on size of database, properties of the query string, and choice of algorithm and parameters.
For much of these experiments, we wish to know how long queries take on NCBI's site. Unfortunately, there does not seem to be a reflection of the actual running time within the reported results after a query has completed. You will instead need to watch to keep track of the time expired. Also note that you will only get an approximate range of times because the results web page automatically refreshes every so often, with longer spans as time progresses (though shorter spans again when it knows it is near completion). Thus you might know that a query was not yet done after 12 seconds but was complete by 30 seconds.
Note as well that NCBI limits the overall computation time that they
will devote to a single query, and so if a query is taking too long
you might encounter the following error:
filename | length | description |
---|---|---|
strep.fasta | 738nt | Streptococcus agalactiae ermB coding sequence from Ch. 4 of the book |
strep_perturb10.fasta | 715nt | Same as above, but perturbed with 10% chance of a nucleotide being replaced/deleted |
strep_perturb20.fasta | 705nt | Same as above, but perturbed with 20% chance of a nucleotide being replaced/deleted |
strep_perturb30.fasta | 689nt | Same as above, but perturbed with 30% chance of a nucleotide being replaced/deleted |
strep_perturb40.fasta | 672nt | Same as above, but perturbed with 40% chance of a nucleotide being replaced/deleted |
tom.fasta | 4440nt | The CFTR allele for "Tom" from an earlier lab |
tom_drop10.fasta | 3996nt | Same as above, but intentionally omitting every 10th nucelotide |
tom_drop8.fasta | 3885nt | Same as above, but intentionally omitting every 8th nucelotide |
tom_drop7.fasta | 3806nt | Same as above, but intentionally omitting every 7th nucelotide |
tom_drop5.fasta | 3552nt | Same as above, but intentionally omitting every 5th nucelotide |
random150.fasta | 150nt | 150 randomly generated nucelotides |
random300.fasta | 300nt | 300 randomly generated nucelotides |
random600.fasta | 600nt | 600 randomly generated nucelotides |
The size and scope of the database can be affected, both by picking which database to use, and optionally ability to exclude/limit within such a database. After a query is complete, it is possible to determine the number of sequences and total number of nucleotides within the selected data set by expanding the "Search Summary" and examining the "Database" details.
Perform a search on tom.fasta, using the blastn algorithm with default parameters, and each of the following database settings. For each trial, record the approximate query time and the number of sequences and nucleotides in the database.
Can you draw any conclusions regarding the relationship between the database size and the query time? What database factors other than size might affect query time?
We are interested in how the algorithms behave when we introduce intentional mutations in a sequence. Specifically, we examine the various degradations of the Strep files, under the following conditions:
Can you draw any conclusions regarding the relationship between the degradation and both the query time and quality of found matches?
Repeat the above tests on the various degradations of strep, but this time using the megablast algorithm (and default parameters).
As another form of degradation, we took the Tom allele and created versions of that by uniformly removing the k-th nucleotide for various choices of k. Perform searches with both blastn and megablast and report on how well it matches to CFTR for various levels of degradation.
Let's return to the tom.fasta file, which we know is an allele of human CFTR. Repeat a search, using blastn, and the nucleotide collection (nr/nt) database excluding Models (XM/XP) and Uncultured. In addition, for Organism select humans (taxid:9605) and check the exclude box.