Computer Science 1020
Introduction to Computer Science: Bioinformatics

Lab: BLAST

Overview

Topic: Use of BLAST
Related Reading: Ch. 4 of text.
Due: 11:59pm Wednesday, 20 February 2019

BLAST (Basic Local Alignment Search Tools) is a suite of tools that is meant to allow for efficient search for a near-matching sequence across large data collections. We will focus on use of NCBI's BLAST suite and primarily on their nucleotide search.

In this lab, we explore use of the BLAST algorithm for efficiently finding (approximate) sequence matches within large datasets. Our goals are twofold:

Gain practical experience in the mechanics of submitting BLAST queries and examining results.
Explore some algorithmic performance tradeoffs for BLAST queries, such as how the time or quality depends on size of database, properties of the query string, and choice of algorithm and parameters.

Keeping Track of Time

For much of these experiments, we wish to know how long queries take on NCBI's site. Unfortunately, there does not seem to be a reflection of the actual running time within the reported results after a query has completed. You will instead need to watch to keep track of the time expired. Also note that you will only get an approximate range of times because the results web page automatically refreshes every so often, with longer spans as time progresses (though shorter spans again when it knows it is near completion). Thus you might know that a query was not yet done after 12 seconds but was complete by 30 seconds.

Note as well that NCBI limits the overall computation time that they will devote to a single query, and so if a query is taking too long you might encounter the following error:

Sample Query Sequences

filename	length	description
`strep.fasta`	738nt	Streptococcus agalactiae ermB coding sequence from Ch. 4 of the book
`strep_perturb10.fasta`	715nt	Same as above, but perturbed with 10% chance of a nucleotide being replaced/deleted
`strep_perturb20.fasta`	705nt	Same as above, but perturbed with 20% chance of a nucleotide being replaced/deleted
`strep_perturb30.fasta`	689nt	Same as above, but perturbed with 30% chance of a nucleotide being replaced/deleted
`strep_perturb40.fasta`	672nt	Same as above, but perturbed with 40% chance of a nucleotide being replaced/deleted

`tom.fasta`	4440nt	The CFTR allele for "Tom" from an earlier lab
`tom_drop10.fasta`	3996nt	Same as above, but intentionally omitting every 10th nucelotide
`tom_drop8.fasta`	3885nt	Same as above, but intentionally omitting every 8th nucelotide
`tom_drop7.fasta`	3806nt	Same as above, but intentionally omitting every 7th nucelotide
`tom_drop5.fasta`	3552nt	Same as above, but intentionally omitting every 5th nucelotide

`random150.fasta`	150nt	150 randomly generated nucelotides
`random300.fasta`	300nt	300 randomly generated nucelotides
`random600.fasta`	600nt	600 randomly generated nucelotides

Exploration: Database Size

The size and scope of the database can be affected, both by picking which database to use, and optionally ability to exclude/limit within such a database. After a query is complete, it is possible to determine the number of sequences and total number of nucleotides within the selected data set by expanding the "Search Summary" and examining the "Database" details.

Perform a search on tom.fasta, using the blastn algorithm with default parameters, and each of the following database settings. For each trial, record the approximate query time and the number of sequences and nucleotides in the database.

Human genomic + transcript (Human G+T)
Mouse genomic + transcript (Mouse G+T)
Nucleotide collection (nr/nt)
Nucleotide collection (nr/nt) but excluding Models (XM/XP)

Can you draw any conclusions regarding the relationship between the database size and the query time? What database factors other than size might affect query time?

Exploration: Match Quality

We are interested in how the algorithms behave when we introduce intentional mutations in a sequence. Specifically, we examine the various degradations of the Strep files, under the following conditions:

Use blastn with default parameters
Use nucleotide collection (nr/nt) database excluding Models (XM/XP)

Can you draw any conclusions regarding the relationship between the degradation and both the query time and quality of found matches?

Exploration: Algorithmic Choice

Repeat the above tests on the various degradations of strep, but this time using the megablast algorithm (and default parameters).

Exploration: Further degradation

As another form of degradation, we took the Tom allele and created versions of that by uniformly removing the k-th nucleotide for various choices of k. Perform searches with both blastn and megablast and report on how well it matches to CFTR for various levels of degradation.

Exploration: Conservation Across Organisms

Let's return to the tom.fasta file, which we know is an allele of human CFTR. Repeat a search, using blastn, and the nucleotide collection (nr/nt) database excluding Models (XM/XP) and Uncultured. In addition, for Organism select humans (taxid:9605) and check the exclude box.

Michael Goldwasser

Saint Louis University

Computer Science 1020
Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab: BLAST

Overview

Keeping Track of Time

Sample Query Sequences

Exploration: Database Size

Exploration: Match Quality

Exploration: Algorithmic Choice

Exploration: Further degradation

Exploration: Conservation Across Organisms

Saint Louis University

Computer Science 1020 Introduction to Computer Science: Bioinformatics

Michael Goldwasser

Spring 2019

Computer Science Department

Lab: BLAST

Overview

Keeping Track of Time

Sample Query Sequences

Exploration: Database Size

Exploration: Match Quality

Exploration: Algorithmic Choice

Exploration: Further degradation

Exploration: Conservation Across Organisms

Computer Science 1020
Introduction to Computer Science: Bioinformatics