Assignments | Course Home | PostgreSQL | Schedule & Lecture Notes | Submit

Saint Louis University

Computer Science 371
Databases

Michael Goldwasser

Fall 2005

Dept. of Math & Computer Science


TIGER database

Contents:


Overview

To better understand the significance of query evaluation, optimization and the effect of indices, we need a larger dataset. The U.S. Census Bureau has a publicly available dataset providing all of the underlying data needed for a map of the United States. The data provides coordinates and names for all roads, rivers, railroads and other features and landmarks, together with street addresses and zip codes. This data is updated every few years, and is made available to all. It is the underlying database which in turn is used by mapping applications such as MapQuest, Yahoo, etc, as well as traditional print maps such as Rand McNally (truthfully, there are commercial data sets available which contain even more details that are excluded from TIGER, such as the direction of one-way streets).

The dataset is known as TIGER, an acronym for "Topologically Integrated Geographic Encoding and Referencing system. I have downloaded a portion of the raw data from their 2004 edition for us to play with. I loaded a bit of nationwide data, such as the list of all states and counties, and then I pulled more complete road data for the entire state of Missouri. The data set contains every single street and intersection, including names, addresses, zip codes, and a classification into the road type (e.g. highway, single lane). There are also geographic features such as all rivers.


ER model

Though the public data is not presented as a Relational Database, I have interpreted it as such and developed the following ER diagram (click for fullsize version)


Tables

In implementing the relational database, we have defined nine tables. Seven of those tables directly model entities of the above ER model, and two of the tables represent many-to-many relationships in that model. A brief summary of each of the nine tables follows:


Indices

Since we wish to better understand the effect of indices on query evaluation and performance, I have actually created three separate instances of this data set, described as follows. All three databases have the identical underlying dataset, however the set of indices differ between the three.


Usage on Turing

All of the data described here has been loaded onto the csci371 database on turing. However, because we wished to clearly separate out the distinct instances of this database (tiger1, tiger2, tiger3), we have made it so that none of the tables are visible by default. Before querying any of the tables, you must explicitly set your search path in psql using a command such as:

set search_path = tiger3;
From that point on, you will be using the specified instance of the database for all subsequent queries you make (unless you again set the search_path to something else).

The advantage of this is that you may then execute the same query script on any of the database instances simply by changing the search path appropriately.

Timing your queries

We will want to start paying particular attention to the computation time of the database when answering queries. Though you could use clock time, the database supports a mode in which it reports the time spent on a command. In psql, you may toggle the timing setting using the command
\timing

Note: We have configured postgres so that it gives up on any query after 5 minutes of computation time. Unfortunately, the message displayed on such a timeout is misdirected, stating that the query was canceled by the user's request (when really because of the time limit).


Michael Goldwasser
CSCI 371, Fall 2005
Last modified: Wednesday, 16 November 2005
Assignments | Course Home | PostgreSQL | Schedule & Lecture Notes | Submit