Assignments | Course Home | PostgreSQL | Schedule & Lecture Notes | Submit

Computer Science 371
Databases

TIGER database

To better understand the significance of query evaluation, optimization and the effect of indices, we need a larger dataset. The U.S. Census Bureau has a publicly available dataset providing all of the underlying data needed for a map of the United States. The data provides coordinates and names for all roads, rivers, railroads and other features and landmarks, together with street addresses and zip codes. This data is updated every few years, and is made available to all. It is the underlying database which in turn is used by mapping applications such as MapQuest, Yahoo, etc, as well as traditional print maps such as Rand McNally (truthfully, there are commercial data sets available which contain even more details that are excluded from TIGER, such as the direction of one-way streets).

The dataset is known as TIGER, an acronym for "Topologically Integrated Geographic Encoding and Referencing system. I have downloaded a portion of the raw data from their 2004 edition for us to play with. I loaded a bit of nationwide data, such as the list of all states and counties, and then I pulled more complete road data for the entire state of Missouri. The data set contains every single street and intersection, including names, addresses, zip codes, and a classification into the road type (e.g. highway, single lane). There are also geographic features such as all rivers.

ER model

Though the public data is not presented as a Relational Database, I have interpreted it as such and developed the following ER diagram (click for fullsize version)

Tables

In implementing the relational database, we have defined nine tables. Seven of those tables directly model entities of the above ER model, and two of the tables represent many-to-many relationships in that model. A brief summary of each of the nine tables follows:

State (56 tuples)
Each state is given a unique two-digitcode by the government (e.g., 29 for Missouri), as well as the more traditional abbreviations (e.g., MO for Missouri). There are actually 56 tuples because it includes DC for the District of Columbia, as well as five other U.S. Territories.

stateID abbrev

... ...

29 MO

... ...
County (3,232 tuples)
Within each state, every county is given a unique three-digit code. I have portrayed this as a weak entity, in the sense that the three-digit codes are unique only within a state. To uniquely identify a particular county in the full database, you would need to use the pair of state and county codes.

As an example, our school is in St. Louis City (which is actually modeled as a county in this schema), and has a countyid of 510 within state 29. This is distinct from the true St. Louis County, which has countyid of 189, or from St. Charles county, which has countid of 183.

stateID countyID name

... ... ...

29 510 St. Louis

... ... ...
CFCC (274 tuples)
Every road, waterway, railway landmark, or other entry in the database is coded with a CFCC classification, identifying the class of the item. For example a road could be classified as 'A41' which is described as 'Local, neighborhood and rural road, city street, unseparated.'

CFCC description

... ...

A41 Local, neighborhood, and rural road, city street, unseparated

... ...
Point (1,206,260 tuples)
Every chain runs between two endpoints. Every point includes both longitude and latitude coordinates, and is assigned a unique identifying number, known as its TZID. For example, the intersection of Lindell Blvd and N. Grand Blvd is a single point in the database, with TZID of 11233807, with longitude of -90.232812 West and latitude of 38.637451 North.

Postgres has special support for geometric data, therefore we use their built-in point datatype to represent the (longitude/latitude) pairs. If you wish to access one of the two coordinates of a point, you may do so by indexing it as if it were an array (e.g., in this table coord[0] is longitude and coord[1] is lattitude)

TZID coord

... ...

11233807 (-90232812,38637451)

... ...
Chain (1,676,070 tuples)
An atomic "piece" of every road, river, railroad, etc is called a chain, and assigned a unique identifying number, known as its TLID. For example, a single chain of a road would generally be the part of the road between two intersections (such as the chain of N. Grand Blvd which runs between Lindell Blvd and W. Pine Blvd, having TLID 107607932).

TLID CFCC TZIDS TZIDE

... ... ... ...

107607932 A41 11233807 11233819

... ... ... ...
ChainThroughCounty (1,699,020 tuples)
This is a simple table representing the relationship showing which chains lie in which counties. Most often, a chain will lie in a single county, however if a chain is part of the county border, it will be associated with both counties.

TLID stateID countyID

... ... ...

107607932 29 510

... ... ...

stateID	abbrev
...	...
29	MO
...	...

stateID	countyID	name
...	...	...
29	510	St. Louis
...	...	...

CFCC	description
...	...
A41	Local, neighborhood, and rural road, city street, unseparated
...	...

TZID	coord
...	...
11233807	(-90232812,38637451)
...	...

TLID	stateID	countyID
...	...	...
107607932	29	510
...	...	...

Feature (172,905 tuples)
A feature is essentially a name or label which we might associate with one or more chains. That is, the chain which is visible from Ritter Hall is commonly known as N Grand Blvd. Each feature is assigned a FEAT identifier, which is unique within the given county. The description of the feature has four distinct parts, the Name (e.g. Grand), the type (e.g. Blvd), the prefix (e.g. N), and possibly a suffix (not in the case of this chain, but for others, often something like N, S, NE, etc). Any of those descriptive fields might be NULL.

stateID	countyID	featID	Prefix	Name	Type	Suffix
...	...	...	...	...	...	...
29	510	889	N	Grand	Blvd
29	510	964	S	Grand	Blvd
29	510	1232	S	Grand	Ave
...	...	...	...	...	...	...

ChainName (1,107945 tuples)
This table represents the many-to-many relationship between the chains and the features.

Note that there may be a many-to-many relationship between features and chains. For example, many chains make up what is commonly known as N Grand Blvd. There are also some chains which have more than one common name. For example the chain with TLID=107606466 has five different names in the database (United States Highway 40, I-70, I-55, I-64, Poplar St Brg).

stateID	countyID	featID	TLID
...	...	...	...
29	510	889	107603556
29	510	889	107607932
29	510	889	107607933
...	...	...	...

AddressRange (81,153 tuples)
For chains which represent roads, the range of postal address along each side of the chain is represented as an AddressRange. A chain may have multiple AddressRanges associated with it, generally because the two sides of the chain have separate addresses or in some cases because a single chainside may actually have two distinct ranges of addresses.

TLID	seq	orientation	fraddr	toaddr	zip	zipplus
...	...	...	...	...	...	...
107607932	0	L	298	140	63106
107607932	0	R	299	141	63103	2006
107607933	0	L	1	139	63106
107607933	0	R	2	138	63103
...	...	...	...	...	...	...

Indices

Since we wish to better understand the effect of indices on query evaluation and performance, I have actually created three separate instances of this data set, described as follows. All three databases have the identical underlying dataset, however the set of indices differ between the three.

tiger1

All of the data, but no indices whatsoever. (we do not even formally declare primary/foreign keys because enforcement of those constraints requires an index.)

If you are interested in the precise definition, see the SQL code.
tiger2

All of the data, and a BTREE index for each primary key. Note well that some of the primary keys involve multiple attributes, in a specified order.

If you are interested in the precise definition, see the SQL code.
tiger3

Everything in tiger2 plus the addition of five secondary indices, namely:
- BTREE index on Feature(Name,Prefix,Type)
- BTREE index on AddressRange(zip,zipplus)
- BTREE index on Chain(CFCC)
- HASH index on Chain(TZIDS)
- RTREE index on Point (using bounding box for coordinates)
If you are interested in the precise definition, see the SQL code.

Usage on Turing

All of the data described here has been loaded onto the csci371 database on turing. However, because we wished to clearly separate out the distinct instances of this database (tiger1, tiger2, tiger3), we have made it so that none of the tables are visible by default. Before querying any of the tables, you must explicitly set your search path in psql using a command such as:

set search_path = tiger3;

From that point on, you will be using the specified instance of the database for all subsequent queries you make (unless you again set the search_path to something else).

The advantage of this is that you may then execute the same query script on any of the database instances simply by changing the search path appropriately.

Timing your queries

We will want to start paying particular attention to the computation time of the database when answering queries. Though you could use clock time, the database supports a mode in which it reports the time spent on a command. In psql, you may toggle the timing setting using the command

\timing

Note: We have configured postgres so that it gives up on any query after 5 minutes of computation time. Unfortunately, the message displayed on such a timeout is misdirected, stating that the query was canceled by the user's request (when really because of the time limit).

Michael Goldwasser

CSCI 371, Fall 2005
Last modified: Wednesday, 16 November 2005

Assignments | Course Home | PostgreSQL | Schedule & Lecture Notes | Submit

Saint Louis University

Computer Science 371
Databases

Michael Goldwasser

Fall 2005

Dept. of Math & Computer Science

TIGER database

Contents:

Overview

ER model

Tables

Indices

Usage on Turing

Timing your queries

Computer Science 371 Databases

Fall 2005

TIGER database

Contents:

Timing your queries

Computer Science 371
Databases