Assignment

Overview

Topic: Indices and Query Evaluation
Related Reading: Chs. 8-11
Due: 12:45pm Tuesday, 11 November 2003
This assignment will explore the effect of various indices when querying a large database.

For this assignment, you may either work alone or with one other person. If you do work as a pair, you should make sure that both names are at the top of your submission.

  • We will first discuss the database we will be using for this assignment.
  • Then we will discuss the requirements of this assignment.

  • TIGER database

    The U.S. Census Bureau has a publically available database providing all of the underlying data needed for a map of the United States. The database provides coordinates and names for all roads, rivers, railroads and other features and landmarks, together with street addresses and zip codes. This data is updated every few years, and is made available to all. It is the underlying database which in turn is used by mapping applications such as MapQuest, Yahoo, etc, as well as traditional print maps such as Rand McNally.

    The datatbase is known as TIGER, an acronym for "Topologically Integrated Geographic Encoding and Referencing system. I have taken a snapshot of a portion of this data from 2002, and entered it into our local Oracle system. I pulled the entire database for the state of Missouri, which already involves roughly 1GB of raw data files (though I did not enter all types of information into Oracle).

    Though the public data is not presented as a Relational Database, I have interpetted it as such and developed the following ER diagram.

    This, in turn, leads to a table schema with seven tables, one for each of the above entity classes (all of the relationships were expressed implicitly via foreign key constraints). In brief, the seven entities are:

  • State (57 entities)
    Each state (as well as other U.S. Teritories) is given a unique two-digit code by the goverment. Though we only have detailed information loaded into the database for Missouri, we have the full list of states for completeness (some Missouri features border other states).
  • County (3,233 entities)
    Within each state, every county is given a unique three-digit code. I have portrayed this as a weak entity, in the sense that the three-digit codes are unique only within a state. To uniqely identify a particular county in the full database, you would need to use the pair of state and county codes.
  • CFCC (232 entities)
    Every road, waterway, railway landmark, or other entry in the database iis coded with a CFCC classification, identifying the class of the item. For example a road could be classified as 'A41' which is described as 'Local, neighborhood and rural road, city street, unseparated.'
  • Chain (1,288,119 entities)
    An atomic "piece" of every road, river, railroad, etc is called a chain, and assigned a unique identifying number, known as its TLID. On a road, for example, a single chain would generally be the part of the road between two intersections (such as the chain of N. Grand Blvd which runs between Lindell Blvd and W. Pine Blvd, having TLID 107607932).
  • Point (915,141 entities)
    Every chain is described by its two endpoints. Every point includes both longitude and lattitude coordinates, and is assigned a unique indentifying number, known as its TZID. For example, the intersection of Lindell Blvd and N. Grand Blvd is a single point in the database, with TZID of 11233807.
  • Feature (162,703 entities)
    A feature is essentially a name or label which we might associate with one or more chains. That is, the chain which is visible from Ritter Hall is commonly known as North Grand Blvd. Other named features include the "Mississippi River" as well as the "Missouri Pacific Railroad" line. Each feature is assigned a FEAT identifier, which is unique within the given County.

    Note that there may be a many-to-many relationship between features and chains. For example, many chains make up what is commonly known as Lindell Blvd. There are also some chains which have more than one common name, such as "United States Highway 40" a.k.a. "I-64".

  • ChainSide (2563954 entities)
    Each side of the chain is separately described as a ChainSide. Each side contains information such as what state, county, or zip code is on that side of the chain (some chains lie on borders, and thus the left and right sides may differ). Furthermore, information is given as to the postal addresses which lie on that side of the chain.
  • Note: For the last four entity sets, we have only entered data for the state of Missouri.

    The precise creation of our tables was based on the SQL script createtiger.sql. Sample data from each table is presented here.


    Your Assignment

    For this assignment, you may either work alone or with one other person. If you do work as a pair, you should make sure that both names are at the top of your submission.

    Your assignment is to explore the efficiency of some simple queries, based on the use of indices. We have actually created three separate copies of the underlying database:

  • tiger
    This is the one which was originally described, as created by SQL script createtiger.sql. On declaring primary and foreign keys, Oracle automatically makes indices for each delcared key. (if a composite primary key is declared, then Oracle makes one composite index for that key).
  • tiger2
    This table was based upon the same underlying data, however no primary or foreign keys were delcared, and thus no indices were created.
  • tiger3
    This is a mystery database. It is using the same underlying data, however we are not disclosing the choie of indices. We want you to hypothesize based on experiments.

    As a hint, there are a total of five separate indices built on this database. We have included some, but not all, primary and foreign key indices, as well as some indices involving non-key fields. To get you going, one of the indices is a composite index for table Feature, based upon the four fields (NamePrefix, Name, NameType, NameSuffix), however not necessarily in that respective order. One of your goals should be to determine the precise order we used on this composite index.

    Finally, do not create indices on any of the smaller tables. For example, table State only contains 57 entries, which presumably fit on a single page. Whether or not such a small table has an index is virtually undetectable.

  • Your task is to submit the following:
    1. Create a script, queries.sql, which contains at least 10 interesting queries on this database. An "interesting" query is one which will display a markedly different execution time when run on the three versions of the database. At the same time, if your query is overly complicated, it will be more difficult to infer which of many indices were used in evaluation.
    2. Generate a table of experimental running times for the queries you have developed. Start your query script with the statement "set timing on", so that Oracle will report the elapsed computation time when evaluating each individual query. Run your complete query set three different times on each of the three different databases, reporting all execution times. It may even be worthwhile to gather data on two or more different days.

      (The reason we are having you repeat each query on the same database is to measure the variance in execution times due to other factors, such as the machine load at the time.)

    3. Give a prose evaluation of your experiments. This may be several pages long. To begin, highlight the experiments that expose the (known) difference between the original tiger database and the tiger2 database. Following this, develop your own hypotheses as to the existence of indices in tiger3, the mystery database. Justify your hypotheses based specifically on your experimental results.
    In order to use the same script on each of the three databases, we recommend the following. Write your original query.sql script based on the standard tablenames (e.g., Chain, CFCC, Feature). Then, in order to specify which copy of the database to use, setup appropriate synonyms by running one of the following scripts immediately before your queries:
  • usetiger.sql
  • usetiger2.sql
  • usetiger3.sql