Lecture #09 (3 November 2001)

Databases


Overall Reading
Brookshear: Ch. 9.1, 9.3

Outline:

  • General Issues (Ch. 9.1 [Br])
  • (skipping Ch. 9.2 [Br] on Implementation)
  • Relational Database Model (Ch. 9.3 [Br])
  • Relational Design
  • Relational Operations
  • SQL (Structured Query Language)
  • "A Gentle Introduction to SQL"

  • General Issues (Ch. 9.1 [Br])

    In an earlier lecture, we gave an example of a "simple database" which stored a list of names such that we could:
  • search for the presence of an entry
  • print the list in alphabetical order
  • insert a new entry
  • We saw some data structures which might help do this efficiently. However, as we described it then, this was one-dimensional.
    That is, we were only allowed to search for information based on one particular field of information (the name).

    More generally, we want databases to handle information about many different items, each of which may have many different fields of data.

    Employee Record:

  • First Name
  • Last Name
  • Social Security Number
  • Birthday (Date/Month/Year)
  • Home Address
  • Start Date
  • End Date
  • Job Title
  • Department
  • Manager
  • Office
  • Phone
  • Salary
  • Already, this is a more complicated situation.
  • Searching on a single field:
    If the only operation I needed was to be able to search for employees based on the last name, we could keep all of the records in a flat-file, with all records pre-sorted based on the name. (We could be efficient, and use binary search!)

  • Searching on other fields?
    But what if I wanted to search for people with a birthday today?!
    Sequential search is slow. But cannot use binary search unless all employee records were sorted based on birthdate.

    If efficiency important, perhaps I should keep two copies of all of the records, with one version sorted by name and another version sorted by birthdate.

  • Searching based on multiple fileds?
    What if I want to find all people who have been with the company 10 years or more, but are making less than $50000?

  • Duplication is bad
    Consider the suggestion that we keep two copies of all records, so that we can sort one by name and one by birthday. This wastes memory. Also updating employee information becomes perilous, because we need to make sure of consistency. (And the situation just gets worse if we have three copies or four copies, etc.)

  • Privacy:
    Different people should have different views and access to the information (schema vs. sub-schema)

    For instance, payroll should have access to my salary and home address, but general employees should probably only see my name, office, and phone number. My manager might know my salary, but not my social security number.

  • Integrated data system:
    We want to think about all of these needs ahead of time, and to design an integrated database which can meet all of these needs, as efficiently as possible in terms of time and memory usage.

  • (skipping Ch. 9.2 [Br] on Implementation)

    General theme is that there are many layers of abstraction separating the user's view form the details of the actual data in memory.

    Relational Database Model (Ch. 9.3 [Br])

  • Overview:

    The most popular databases in use today are based on a relational database model (e.g., Oracle, Informix, Microsoft Access)

    The database portrays the data as if it were stored in rectangular tables, called relations.

    Example (from Fig. 9.3):

    EmplIDNameAddressSSNum
    25X15 Joe E. Baker 33 Nowhere St. 111223333
    34Y70 Cheryl H. Clark 563 Downtown Ave. 999009999
    23Y34 G. Jerry Smith 1555 Circle Dr. 111005555
    . . . .
    . . . .
    . . . .

    We call one row of this table a tuple.


  • Relational Design

  • An example:

    Name Birthday Zodiac Birthstone
    Art Alexakis Apr. 12 Aries diamond
    Hank Azaria Apr. 25 Tauras diamond
    Antonio Banderas Aug. 10 Leo peridot
    Lucas Black Nov. 29 Sagittarius citrine
    Matthew Broderick Mar. 21 Pisces aquamarine
    Sandra Bullock July 26 Leo ruby
    Steve Buscemi Dec. 13 Sagittarius turquoise
    Nicolas Cage Jan. 7 Capricorn garnet
    Jim Carrey Jan. 17 Capricorn garnet
    George Clooney May 6 Tauras emerald
    Courteney Cox June 15 Gemini pearl
    John Cusack June 28 Cancer pearl
    Joan Cusack Oct. 11 Libra opal
    Matt Damon Oct. 8 Libra opal
    Robert De Niro Aug. 17 Leo peridot
    Leonardo DiCaprio Nov. 11 Scorpio citrine
    Cameron Diaz Aug. 30 Virgo peridot
    Stephen Dorff July 29 Leo ruby
    Minnie Driver Jan. 31 Aquarius garnet
    Calista Flockhart Nov. 11 Scorpio citrine
    Harrison Ford July 13 Cancer ruby
    Jodie Foster Nov. 19 Scorpio citrine
    Mel Gibson Jan. 3 Capricorn garnet
    Joseph Gordon-Levitt Feb. 17 Aquarius amethyst
    Tom Hanks July 9 Cancer ruby
    Dustin Hoffman Aug. 8 Leo peridot
    Timothy Hutton Aug. 16 Leo peridot
    Jeremy Irons Sep. 19 Virgo sapphire
    Joshua Jackson June 11 Gemini pearl
    Greg Kinnear June 17 Gemini pearl
    Kevin Kline Oct. 24 Scorpio opal
    Jeremy London Nov. 7 Scorpio citrine
    Courtney Love July 9 Cancer ruby
    Peter MacNicol Apr. 10 Aries diamond
    William H. Macy Mar. 13 Pisces aquamarine
    John Malkovich Dec. 9 Sagittarius turquoise
    Julianna Margulies June 8 Gemini pearl
    Matthew McConaughey Nov. 4 Scorpio citrine
    Dylan McDermott Oct. 26 Scorpio opal
    Frances McDormand June 23 Cancer pearl
    Jack Nicholson Apr. 22 Aries diamond
    Edward Norton Aug. 18 Leo peridot
    Gwyneth Paltrow Sep. 28 Libra sapphire
    Sean Penn Aug. 17 Leo peridot
    Matthew Perry Aug. 19 Leo peridot
    Michelle Pfeiffer Apr. 29 Tauras diamond
    Brad Pitt Dec. 18 Sagittarius turquoise
    Natalie Portman June 9 Gemini pearl
    Aidan Quinn Mar. 8 Pisces aquamarine
    Giovanni Ribisi Mar. 31 Aries aquamarine
    Julia Roberts Oct. 28 Scorpio opal
    Keri Russell Mar. 23 Pisces aquamarine
    Adam Sandler Sep. 9 Virgo sapphire
    Susan Sarandon Oct. 4 Libra opal
    Rick Schroder Apr. 13 Aries diamond
    Gary Sinise Mar. 17 Pisces aquamarine
    Kevin Spacey July 26 Leo ruby
    Rider Strong Dec. 11 Sagittarius turquoise
    Billy Bob Thornton Aug. 4 Leo peridot
    Uma Thurman Apr. 29 Tauras diamond
    Skeet Ulrich Jan. 20 Capricorn garnet
    Eddie Vedder Dec. 23 Capricorn turquoise
    Robin Williams July 21 Cancer ruby
    Kate Winslet Oct. 5 Libra opal
    Robin Wright Penn Apr. 8 Aries diamond

    Though this may contain the information that we want, there are several problems with such a structure.

  • Duplication: For example, Jack Nicholson, Uma Thurman and Michelle Pfeiffer were all born in April and all have a diamond as a birthstone. We seem to be storing this information in several places.

  • Making modifications: What if it is decided that April's birthstone will be changed to quartz? It appears you will need to go and change many entries of your table.

  • Deletions: In making deletions, you might accidentally lose some useful information. For instance, Joseph Gordon-Levitt is the only person in our database born in February. We might have a reason to remove that person from our database. But now, we seem to have lost the correspondance between February and amethyst. That is, if we later add a new person to the database born in February, we do not seem to know the proper birthstone.
  • A better way: We can instead represent our database using a combination of three relations, as follows:

    Name Birthday
    Art Alexakis Apr. 12
    Hank Azaria Apr. 25
    Antonio Banderas Aug. 10
    Lucas Black Nov. 29
    Matthew Broderick Mar. 21
    Sandra Bullock July 26
    Steve Buscemi Dec. 13
    Nicolas Cage Jan. 7
    Jim Carrey Jan. 17
    George Clooney May 6
    Courteney Cox June 15
    John Cusack June 28
    Joan Cusack Oct. 11
    Matt Damon Oct. 8
    Robert De Niro Aug. 17
    Leonardo DiCaprio Nov. 11
    Cameron Diaz Aug. 30
    Stephen Dorff July 29
    Minnie Driver Jan. 31
    Calista Flockhart Nov. 11
    Harrison Ford July 13
    Jodie Foster Nov. 19
    Mel Gibson Jan. 3
    Joseph Gordon-Levitt Feb. 17
    Tom Hanks July 9
    Dustin Hoffman Aug. 8
    Timothy Hutton Aug. 16
    Jeremy Irons Sep. 19
    Joshua Jackson June 11
    Greg Kinnear June 17
    Kevin Kline Oct. 24
    Jeremy London Nov. 7
    Courtney Love July 9
    Peter MacNicol Apr. 10
    William H. Macy Mar. 13
    John Malkovich Dec. 9
    Julianna Margulies June 8
    Matthew McConaughey Nov. 4
    Dylan McDermott Oct. 26
    Frances McDormand June 23
    Jack Nicholson Apr. 22
    Edward Norton Aug. 18
    Gwyneth Paltrow Sep. 28
    Sean Penn Aug. 17
    Matthew Perry Aug. 19
    Michelle Pfeiffer Apr. 29
    Brad Pitt Dec. 18
    Natalie Portman June 9
    Aidan Quinn Mar. 8
    Giovanni Ribisi Mar. 31
    Julia Roberts Oct. 28
    Keri Russell Mar. 23
    Adam Sandler Sep. 9
    Susan Sarandon Oct. 4
    Rick Schroder Apr. 13
    Gary Sinise Mar. 17
    Kevin Spacey July 26
    Rider Strong Dec. 11
    Billy Bob Thornton Aug. 4
    Uma Thurman Apr. 29
    Skeet Ulrich Jan. 20
    Eddie Vedder Dec. 23
    Robin Williams July 21
    Kate Winslet Oct. 5
    Robin Wright Penn Apr. 8

    StartDate EndDate Birthstone
    January 1 January 31 garnet
    February 1 February 29 amethyst
    March 1 March 31 aquamarine
    April 1 April 30 diamond
    May 1 May 31 emerald
    June 1 June 30 pearl
    July 1 July 31 ruby
    August 1 August 31 peridot
    September 1 September 30 sapphire
    October 1 October 31 opal
    November 1 November 30 citrine
    December 1 December 31 turquoise

    StartDate EndDate Zodiac
    Feb. 19 Mar. 20 Pisces
    Mar. 21 Apr. 19 Aries
    Apr. 20 May 20. Tauras
    May 21 June 20 Gemini
    June 21 July 22 Cancer
    July 23 Aug. 22 Leo
    Aug. 23 Sep. 22 Virgo
    Sep. 23 Oct. 22 Libra
    Oct. 23 Nov. 21 Scorpio
    Nov. 22 Dec. 21 Sagittarius
    Dec. 22 Jan. 19 Capricorn
    Jan. 20 Feb. 18 Aquarius

    Query: Find all Capricorns

  • More complicated example (from text):
    Revisiting the exployee database, imagine that we want more detailed information such as the person job history within the company. Details should include the job title, a job identification code, the skill code associated with each job, the department, and the period of time duirng which the employee held that job.

    First approach (Fig. 9.4 of text):

    EmplIDNameAddressSSNum JobIDJobTitleSkillCodeDept StartDateTermDate
    25X15 Joe E. Baker 33 Nowhere St. 111223333 F5 Floor manager FM3 Sales 9-1-1998 9-30-1999
    25X15 Joe E. Baker 33 Nowhere St. 111223333 D7 Dept. head D2 Sales 10-1-1999 present
    34Y70 Cheryl H. Clark 563 Downtown Ave. 999009999 F5 Floor manager FM3 Sales 10-1-1998 present
    23Y34 G. Jerry Smith 1555 Circle Dr. 111005555 S25X Secretary T5 Personnel 3-1-1996 4-30-1998
    23Y34 G. Jerry Smith 1555 Circle Dr. 111005555 S25X Secretary T6 Accounting 5-1-1998 present
    . . . . . . . . . .
    . . . . . . . . . .
    . . . . . . . . . .

    Again, this suffers from the same pitfalls

  • Redundancy (such as Joe's address)

  • Deletions may cause loss of information. (if Joe quits, we would lose informaiton such as the Skill Code for job D7)
  • An alternative solution: Use three separate relations. (Fig. 9.5)

  • EMPLOYEE relation
    EmplIDNameAddressSS Num
    25X15 Joe E. Baker 33 Nowhere St. 111223333
    34Y70 Cheryl H. Clark 563 Downtown Ave. 999009999
    23Y34 G. Jerry Smith 1555 Circle Dr. 111005555
    . . . .
    . . . .
    . . . .

  • JOB relation
    JobIDJobTitleSkillCodeDept
    S25X Secretary T5 Personnel
    S26Z Secretary T6 Accounting
    F5 Floor manager FM3 Sales
    . . . .
    . . . .
    . . . .

  • ASSIGNMENT relation
    EmplIDJobIDStartDateTermDate
    23Y34 S25X 3-1-1996 4-30-1998
    34Y70 F5 10-1-1998 present
    23Y34 S25Z 5-1-1998 present
    . . . .
    . . . .
    . . . .
  • Query: Find all people in Accounting

  • A Flawed Design

    Imagine that you had the following relation:

    EmplIDJobTitleDept
    25X15 Floor manager Sales
    28Z25 Secretary Sales
    23Y34 Secretary Accounting

    and that you tried to replace it with a combination of the following two relations:

    EmplIDJobTitle
    25X15 Floor manager
    28Z25 Secretary
    23Y34 Secretary

    JobTitleDept
    Floor manager Sales
    Secretary Sales
    Secretary Accounting

    Is this new formation equivalent? No!

    Example: How do you determine the employees in the Sales department?


  • Relational Operations

    We would like to see how we can get information from the database as a user/programmer.
  • SELECT
    Sometimes we want to select only some of the tuples.

    EMPLOYEE relation
    EmplIDNameAddressSS Num
    25X15 Joe E. Baker 33 Nowhere St. 111223333
    34Y70 Cheryl H. Clark 563 Downtown Ave. 999009999
    23Y34 G. Jerry Smith 1555 Circle Dr. 111005555
    . . . .
    . . . .
    . . . .

    SELECT from EMPLOYEE where EmplID="34Y70"

    EmplIDNameAddressSS Num
    34Y70 Cheryl H. Clark 563 Downtown Ave. 999009999

  • PROJECT
    Sometimes you only want a view with some columns.

    EMPLOYEE relation
    EmplIDNameAddressSS Num
    25X15 Joe E. Baker 33 Nowhere St. 111223333
    34Y70 Cheryl H. Clark 563 Downtown Ave. 999009999
    23Y34 G. Jerry Smith 1555 Circle Dr. 111005555
    . . . .
    . . . .
    . . . .

    PROJECT Name,Address from EMPLOYEE

    NameAddress
    Joe E. Baker 33 Nowhere St.
    Cheryl H. Clark 563 Downtown Ave.
    G. Jerry Smith 1555 Circle Dr.
    . .
    . .
    . .

  • JOIN
    This operation combines the attributes of both relations into a single relation, as follows. It takes each tuple from the first table and combines it with each tuple from the second table. As a simple example:

    Relation A
    VW
    r2
    t4
    p6

    Relation B

    XYZ
    5gp
    4de
    2mq
    4tf

    JOIN A and B

    A.VA.WB.XB.YB.Z
    r25gp
    r24de
    r22mq
    r24tf
    t45gp
    t44de
    t42mq
    t44tf
    p65gp
    p64de
    p62mq
    p64tf

    The attribute names from each of the original relations do not necessarily need to have the same names, and if they do, it may be a coincidence. However, we often do not want to join all possible pairs of tuples, but only those in which some particular attributes match.

    Relation A
    VW
    r2
    t4
    p6

    Relation B

    XYZ
    5gp
    4de
    2mq
    4tf

    JOIN A and B where A.W = B.X

    A.VA.WB.XB.YB.Z
    r22mq
    t44de
    t44tf

  • Combining operators:
    In general, you can combining these operators together to do more non-trivial queries, essentially creating new relations and then operating on those.

    Example,
    NEW1 <- JOIN ASSIGNMENT and JOB where ASSIGNMENT.JobID=JOB.JobID
    NEW2 <- SELECT form NEW1 where ASSIGNMENT.TermDate="present"
    LIST <- PROJECT ASSIGNMENT.EmplID,JOB.Dept from NEW2

    Also, it should be noted that the user interface for many databases will present these same concepts, however using different syntax or graphical interaction.


  • SQL (Structured Query Language)

    This is a language for specifying queries of a relational database. It was originally created by IBM but has since become standardized for use in the industry.

    Though its terminology differs somewhat from what we have introduced above, a single SQL statement may be used to express a combination of SELECT, PROJECT and JOIN operations.

    Format:

      select <list of attributes>
      from   <list of tables>
      where  <list of conditions>
    
    Let's revisit some sample queries:
  • PROJECT Name, Address from EMPLOYEE

    In SQL:

      select Name, Address
      from EMPLOYEE
    
    [Note well: the terminology 'select' in SQL is not the same as the operation SELECT we have discussed]

  • SELECT from EMPLOYEE where EmplID="34Y70"

      select EmplID, Name, Address, SSN
      from EMPLOYEE
      where EmplID = '34Y70'
    

  • JOIN A and B where A.W = B.X

      select *
      from A, B
      where A.W = B.X
    

  • NEW1 <- JOIN ASSIGNMENT and JOB where ASSIGNMENT.JobID=JOB.JobID
    NEW2 <- SELECT form NEW1 where ASSIGNMENT.TermDate="present"
    LIST <- PROJECT ASSIGNMENT.EmplID,JOB.Dept from NEW2

      select EmplID, Dept
      from ASSIGNMENT, JOB
      where ASSIGNMENT.JobID = JOB.JobID
        and ASSIGNMENT.TermDate = "present"
    
  • SQL statements can use additional features not mentioned in the text. Just a few such examples are:

  • attribute list following select keyword
  • the character "*" can be used as a wildcard to display all attributes.
  • for numeric attributes, you can create new columns which combine information of several attributes. For example:
    select Weight/Density
  • condition list following where keyword
  • condition can be arbitrarily complex logical expression, based on operators (AND, OR, NOT).
  • for numeric attributes, you can say things such as:
    where Value >= 25
    Or using operators such as (>, >=, =, <, <=)
  • for text attributes, the inequality operators will be evaluated based on alphabetical order.
  • There are even way to do partial matches for text, such as
    FirstName LIKE 'Mich*'

  • "A Gentle Introduction to SQL"

    A very nice website titled, A Gentle Introduction to SQL, is provided by Andrew Cumming of the School of Computing of Napier University in the UK. It provides several nice tutorials allowing you to form your own SQL queries on existing databases (and it goes into far more depth than we will do in this course).

    Databases we can play with:

  • Information on all countries, according to the "CIA World Factbook" from 1995. This information is in a single table titled cia as follows:

    table 'cia'
    nameregionareapopulationgdp
    Afghanistan Asia 652000 25838797 21000000000
    Albania Europe 28748 3490435 5600000000
    Algeria Africa 2381740 31193917 147600000000
    . . . . .
    . . . . .

  • Here are some practice problems.
  • Here is a self-test tutorial you can try.
  • Here are the answers to the tutorial.
  • Information on movies and their stars, according to the Internet Movie Database from 1997. To avoid duplicaiton, this database design uses three separate tables

    table 'movie'
    idtitleyrscorevotes
    1 Star Wars 1977 8.8 53567
    2 Shawshank Redemption, The 1994 9 44974
    3 Pulp Fiction 1994 8.6 43993
    4 Titanic 1997 7.2 43371
    . . . . .
    . . . . .

    table 'actor'
    idname
    1 Woody Allen
    2 Clint Eastwood
    3 Robert De Niro
    4 Sean Connery
    . .
    . .

    table 'casting'
    movieidactoridord
    972 588 1
    849 588 2
    1575 588 3
    47 590 4
    . . .
    . . .

  • A more complete explanation of these fields can be found here.
  • Here is a self-test tutorial you can try. (Please note, we have only covered enough to get you through queries 1a-1b, 2a-2c, 3a-3e. Queries 4a-4e involve operations we have not covered.)
  • Here are the answers to the tutorial.

  • comp150 Class Page
    mhg@cs.luc.edu
    Last modified: 3 November 2001