Saint Louis University |
Computer Science 371
|
Dept. of Math & Computer Science |
To better understand the significance of query evaluation, optimization and the effect of indices, we need a larger dataset. The U.S. Census Bureau has a publicly available dataset providing all of the underlying data needed for a map of the United States. The data provides coordinates and names for all roads, rivers, railroads and other features and landmarks, together with street addresses and zip codes. This data is updated every few years, and is made available to all. It is the underlying database which in turn is used by mapping applications such as MapQuest, Yahoo, etc, as well as traditional print maps such as Rand McNally (truthfully, there are commercial data sets available which contain even more details that are excluded from TIGER, such as the direction of one-way streets).
The dataset is known as TIGER, an acronym for "Topologically Integrated Geographic Encoding and Referencing system. I have downloaded a portion of the raw data from their 2004 edition for us to play with. I loaded a bit of nationwide data, such as the list of all states and counties, and then I pulled more complete road data for the entire state of Missouri. The data set contains every single street and intersection, including names, addresses, zip codes, and a classification into the road type (e.g. highway, single lane). There are also geographic features such as all rivers.
Though the public data is not presented as a Relational Database, I have interpreted it as such and developed the following ER diagram (click for fullsize version)
In implementing the relational database, we have defined nine tables. Seven of those tables directly model entities of the above ER model, and two of the tables represent many-to-many relationships in that model. A brief summary of each of the nine tables follows:
State (56 tuples)
Each state is given a unique two-digitcode by the government
(e.g., 29 for Missouri), as well as the more traditional
abbreviations (e.g., MO for Missouri). There are actually 56
tuples because it includes DC for the District of Columbia, as
well as five other U.S. Territories.
stateID | abbrev |
---|---|
... | ... |
29 | MO |
... | ... |
County (3,232 tuples)
Within each state, every county is given a unique
three-digit code. I have portrayed this as a weak entity, in the
sense that the three-digit codes are unique only within a state. To
uniquely identify a particular county in the full database, you would
need to use the pair of state and county codes.
As an example, our school is in St. Louis City (which is actually modeled as a county in this schema), and has a countyid of 510 within state 29. This is distinct from the true St. Louis County, which has countyid of 189, or from St. Charles county, which has countid of 183.
stateID | countyID | name |
---|---|---|
... | ... | ... |
29 | 510 | St. Louis |
... | ... | ... |
CFCC (274 tuples)
Every road, waterway, railway landmark, or other entry in the database
is coded with a CFCC classification, identifying the class of the item.
For example a road could be classified as 'A41' which is described as
'Local, neighborhood and rural road, city street, unseparated.'
CFCC | description |
---|---|
... | ... |
A41 | Local, neighborhood, and rural road, city street, unseparated |
... | ... |
Point (1,206,260 tuples)
Every chain runs between two endpoints. Every point
includes both longitude and latitude coordinates, and is assigned a
unique identifying number, known as its TZID.
For example, the intersection of Lindell Blvd and N. Grand Blvd is a
single point in the database, with TZID of 11233807, with
longitude of -90.232812 West and latitude of 38.637451 North.
Postgres has special support for geometric data, therefore we use their built-in point datatype to represent the (longitude/latitude) pairs. If you wish to access one of the two coordinates of a point, you may do so by indexing it as if it were an array (e.g., in this table coord[0] is longitude and coord[1] is lattitude)
TZID | coord |
---|---|
... | ... |
11233807 | (-90232812,38637451) |
... | ... |
Chain (1,676,070 tuples)
An atomic "piece" of every road, river, railroad, etc is called a
chain, and assigned a unique identifying number, known as its TLID.
For example, a single chain of a road would generally be the part of
the road between two intersections (such as the chain of N. Grand Blvd
which runs between Lindell Blvd and W. Pine Blvd, having TLID 107607932).
TLID | CFCC | TZIDS | TZIDE |
---|---|---|---|
... | ... | ... | ... |
107607932 | A41 | 11233807 | 11233819 |
... | ... | ... | ... |
ChainThroughCounty (1,699,020 tuples)
This is a simple table representing the relationship showing
which chains lie in which counties. Most often, a chain will
lie in a single county, however if a chain is part of the county
border, it will be associated with both counties.
TLID | stateID | countyID |
---|---|---|
... | ... | ... |
107607932 | 29 | 510 |
... | ... | ... |
Feature (172,905 tuples)
A feature is essentially a name or label which we might associate with
one or more chains. That is, the chain which is visible from Ritter
Hall is commonly known as N Grand Blvd. Each feature is
assigned a FEAT identifier, which is unique within the given
county. The description of the feature has four distinct parts,
the Name (e.g. Grand), the type (e.g. Blvd), the prefix
(e.g. N), and possibly a suffix (not in the case of this chain,
but for others, often something like N, S, NE, etc). Any of
those descriptive fields might be NULL.
stateID | countyID | featID | Prefix | Name | Type | Suffix |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
29 | 510 | 889 | N | Grand | Blvd | |
29 | 510 | 964 | S | Grand | Blvd | |
29 | 510 | 1232 | S | Grand | Ave | |
... | ... | ... | ... | ... | ... | ... |
ChainName (1,107945 tuples)
This table represents the many-to-many relationship between the
chains and the features.
Note that there may be a many-to-many relationship between features and chains. For example, many chains make up what is commonly known as N Grand Blvd. There are also some chains which have more than one common name. For example the chain with TLID=107606466 has five different names in the database (United States Highway 40, I-70, I-55, I-64, Poplar St Brg).
stateID | countyID | featID | TLID |
---|---|---|---|
... | ... | ... | ... |
29 | 510 | 889 | 107603556 |
29 | 510 | 889 | 107607932 |
29 | 510 | 889 | 107607933 |
... | ... | ... | ... |
AddressRange (81,153 tuples)
For chains which represent roads, the range of postal address
along each side of the chain is represented as an AddressRange.
A chain may have multiple AddressRanges associated with it,
generally because the two sides of the chain have separate
addresses or in some cases because a single chainside may
actually have two distinct ranges of addresses.
TLID | seq | orientation | fraddr | toaddr | zip | zipplus |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
107607932 | 0 | L | 298 | 140 | 63106 | |
107607932 | 0 | R | 299 | 141 | 63103 | 2006 |
107607933 | 0 | L | 1 | 139 | 63106 | |
107607933 | 0 | R | 2 | 138 | 63103 | |
... | ... | ... | ... | ... | ... | ... |
Since we wish to better understand the effect of indices on query evaluation and performance, I have actually created three separate instances of this data set, described as follows. All three databases have the identical underlying dataset, however the set of indices differ between the three.
All of the data, but no indices whatsoever. (we do not even formally declare primary/foreign keys because enforcement of those constraints requires an index.)
If you are interested in the precise definition, see the SQL code.
All of the data, and a BTREE index for each primary key. Note well that some of the primary keys involve multiple attributes, in a specified order.
If you are interested in the precise definition, see the SQL code.
Everything in tiger2 plus the addition of five secondary indices, namely:
If you are interested in the precise definition, see the SQL code.
All of the data described here has been loaded onto the csci371 database on turing. However, because we wished to clearly separate out the distinct instances of this database (tiger1, tiger2, tiger3), we have made it so that none of the tables are visible by default. Before querying any of the tables, you must explicitly set your search path in psql using a command such as:
set search_path = tiger3;From that point on, you will be using the specified instance of the database for all subsequent queries you make (unless you again set the search_path to something else).
The advantage of this is that you may then execute the same query script on any of the database instances simply by changing the search path appropriately.
\timing
Note: We have configured postgres so that it gives up on any query after 5 minutes of computation time. Unfortunately, the message displayed on such a timeout is misdirected, stating that the query was canceled by the user's request (when really because of the time limit).