Assignment
Overview
Topic: Indices and Query Evaluation
Related Reading: Chs. 8-11
Due: 12:45pm Tuesday, 11 November 2003
This assignment will explore the effect of various indices when
querying a large database.
For this assignment, you may either work alone or with one other person.
If you do work as a pair, you should make sure that both names are at the top
of your submission.
TIGER database
The U.S. Census Bureau has a publically available database providing
all of the underlying data needed for a map of the United States. The
database provides coordinates and names for all roads, rivers,
railroads and other features and landmarks, together with street
addresses and zip codes. This data is updated every few years, and
is made available to all. It is the underlying database which in
turn is used by mapping applications such as MapQuest, Yahoo, etc, as
well as traditional print maps such as Rand McNally.
The datatbase is known as TIGER, an acronym for
"Topologically Integrated Geographic
Encoding and Referencing system. I have taken a
snapshot of a portion of this data from 2002, and entered it into our
local Oracle system. I pulled the entire database for the state of
Missouri, which already involves roughly 1GB of raw data files (though
I did not enter all types of information into Oracle).
Though the public data is not presented as a Relational Database, I
have interpetted it as such and developed the following ER diagram.
This, in turn, leads to a table schema with seven tables, one for each
of the above entity classes (all of the relationships were expressed
implicitly via foreign key constraints). In brief, the seven entities
are:
Note: For the last four entity sets, we have only entered data
for the state of Missouri.
The precise creation of our tables was based on the SQL script createtiger.sql. Sample data from
each table is presented here.
Your Assignment
For this assignment, you may either work alone or with one other person.
If you do work as a pair, you should make sure that both names are at the top
of your submission.
Your assignment is to explore the efficiency of some simple queries,
based on the use of indices. We have actually created three separate
copies of the underlying database:
Your task is to submit the following:
- Create a script, queries.sql, which contains at least 10
interesting queries on this database. An "interesting" query is one
which will display a markedly different execution time when run on
the three versions of the database. At the same time, if your query
is overly complicated, it will be more difficult to infer which of
many indices were used in evaluation.
- Generate a table of experimental running times for the queries
you have developed. Start your query script with the statement
"set timing on", so that Oracle will report the
elapsed computation time when evaluating each individual query. Run
your complete query set three different times on each of the three
different databases, reporting all execution times. It may even be
worthwhile to gather data on two or more different days.
(The reason we are having you repeat each query on the same database
is to measure the variance in execution times due to other factors,
such as the machine load at the time.)
-
Give a prose evaluation of your experiments. This may be several
pages long. To begin, highlight the experiments that expose the
(known) difference between the original tiger database and the tiger2
database.
Following this, develop your own hypotheses as to the existence of
indices in tiger3, the mystery database. Justify your hypotheses
based specifically on your experimental results.
In order to use the same script on each of the three databases, we
recommend the following. Write your original query.sql script based
on the standard tablenames (e.g., Chain, CFCC, Feature). Then, in
order to specify which copy of the database to use, setup appropriate
synonyms by running one of the following scripts immediately before
your queries: