A Gentle Introduction to SQL

We will be playing with a website titled "A Gentle Introduction to SQL" originally developed by Andrew Cummings of the School of Computing of Napier University, Edinburgh, UK. It is accessible at sqlzoo.net.

Approximate lesson plan:

Review of Basic SQL query syntax
Managing the resulting tuples
(DISTINCT, ORDER BY)
Aggregates
Grouping
Nested Queries
- Uncorrelated Nested Queries
- Correlated Nested Queries
- Nested Query in the FROM clause
Set Operations
Joins
NULL values

Review of Basic SQL query syntax
```
SELECT  attribute-list
FROM    table-list
WHERE   qualifying-condition      
```
To begin with, we will consider the case where the table-list is a single table (thus no join).

Tutorials:
- warmup for basic SQL queries (cia)
- basic SQL queries (cia)
Of note:
- SELECT attribute-list
  - ```
  SELECT *
```
  will display all original attributes, in defined order
- Can control the choice of and order of the resulting attributes based on the attribute-list, e.g.,
```
  SELECT name,region
```
  vs.
```
  SELECT region,name
```
- Can rename columns to choose alternate labels, e.g.,
```
  SELECT name AS country, area AS "area (in sq. km)"
```
- Can create new attributes on the fly, if computable from existing attributes, e.g., with numeric data
```
  SELECT gdp/population AS percapita
```
  or with concatenating string data,
```
  SELECT first || ' ' || last AS name
```
- FROM table-list
  Though the FROM clause may simply be a list of one or more existing tables, e.g., FROM cia, it is also possible to assign what is known as a table alias or tuple variable using a syntax such as FROM cia C. Remembering that the where clause of the query is evaluated on a tuple by tuple basis, we can now use variable C in the WHERE clause to designate the particular tuple being considered.
  
  Though use of such an alias is unnecessary on straightforward queries, especially with only one table, it will be come helpful, and in some cases necessary, as we look at more complex queries involving nested queries and joins.
- WHERE qualifying-condition
  - Arbitrary boolean expression can be formed using operators AND, OR, NOT and parenthesizing if not wanting default operator precedence.
  - Can use syntax
```
attr IN set
```
    to check for containment in a set, e.g.,
```
WHERE name IN ('France','Germany','Italy')
```
  - Can use LIKE keyword to do partial matches of strings, using '%' as wildcard for zero or more characters, e.g.,
```
WHERE name LIKE 'A%a'
```
    or '_' as a wildcard for exactly one character.
  - Can use BETWEEN keyword to check ranges, e.g.,
```
WHERE area BETWEEN 207600 AND 244820
```
Managing the resulting tuples

From a pure relational algebra point of view, a table instance is a set of tuples, where no two tuples are identical and where the order of the tuples is arbitrary. Yet for SQL, we often may want to process the results of a query to ensure that only distinct tuples appear in the results and we may want to specify the order in which the tuples are displayed.

As for the first desire, of removing duplicated, this is accomplished by adding the optional keyword DISTINCT after the keyword SELECT yet before the list of attributes.

The order can be controlled by adding an ORDER BY clause to the end of a query. Together the syntax for a query appears as:
```
SELECT [DISTINCT]  attribute-list
FROM     table-list
WHERE    qualifying-condition
[ORDER BY attribute-list]
```
Note that you can specify DISTINCT without ORDER BY and likewise ORDER BY without DISTINCT.

Tutorials:
- overview of DISTINCT and ORDER BY
Of note:
- If more than one attribute is given in the ORDER BY list, the tuples will primarily be sorted by the first attribute, but in case of a tie sorted by the secondary attribute, and in case of a further tie the third, and so on.
- By default, the sorting based upon an attribute is so that the values are in ascending order. You may change that to descending order by optionally placing the keyword DESC after an attribute on the ORDER BY list.
Aggregates

There are several built-in functions which can take a set of many tuples and compute aggregate results for the entire group. The most commonly used such functions are MAX, MIN, SUM, AVG, COUNT. The result of a query using an aggregate function is a single tuple with the aggregate value(s).

Of note for aggregates:
- Once you include an aggregate function in the SELECT list, you may only use aggregate functions, not regular attributes. For example, if all countries have been aggregated to compute the average area, it no longer makes sense to refer to the name attribute, as the names are not all the same.
- You may combine DISTINCT with a particular aggregate expression by placing the keyword within the parentheses, e.g., AVG(DISTINCT area) vs. AVG(area).
- The expression COUNT is generally used in the context of COUNT(*) which counts the overall number of tuples. However COUNT can be used with other attributes, in which case it counts the number of tuples with non-null values for such an attribute, e.g. COUNT(area), or the number of distinct non-null values as in COUNT(DISTINCT area).
Grouping

As discussed above, when using aggregates the default behavior is that all tuples are grouped together. However it is possible to break the overall list of tuples into several groups rather than one. This is done by specifying a list of attributes for grouping, where the underlying tuples are then grouped based on those with common values for those attributes.

Furthermore, once many such groups are formed, we have one additional chance to filter out which of those groups interest us. This is done through a HAVING clause which gives a boolean condition. The HAVING clause for groups is analogous to the WHERE clause for selecting the original tuples.

At this point, we have seen the general syntax
```
SELECT   [DISTINCT] select-list
FROM     from-list
WHERE    qualification
GROUP BY grouping-list
HAVING   group-qualification
ORDER BY order-list
```
It is important to understand the (conceptual) order of evaluation for such a statement:
1. All tables in the from line are "joined" (more below).
2. For each row of such a table, the where qualification is evaluated to see whether or not to include the row.
3. All of the rows chosen by WHERE are then grouped, according to the grouping-list, so that each group is aggregated into a single row.
4. For each remaining group, the HAVING qualification is used to decide whether to include that group's row in the final results.
5. All remaining rows are sorted according to the order-list.
6. If DISTINCT is included in the SELECT line, then duplicates are removed (turning a multi-set of tuples into a true set of tuples)
7. Finally, only the columns specified in the select-list are given in the final result.
Tutorials:
- warmup for GROUP BY and HAVING (cia)
- aggregate functions (cia)
Of note for grouping:
- If using a GROUP BY clause, the only expressions which can then appear in the SELECT clause are either aggregates or original attributes which are among those used for the grouping. Notice that since all tuples in a particular group have a common value for one of those attributes, that attribute value is well-defined for the group.
Nested Queries
- Uncorrelated Nested Queries
  The most typical use of nested queries is when the inner selection is self-sufficient and can be independently evaluated. An example is
```
SELECT   C.name
FROM     cia C
WHERE    C.area > (SELECT area FROM cia WHERE name='United States')
```
  Which finds all countries with area at least as large as the United States.
  
  Queries generally behave as if the WHERE clause is evaluated separately for each such C in the range of the FROM clause. Of course, since the nested select is independent of C, there is really no reason to reevaluate that statement for each C. A query optimizer will realize this and thus evaluate the nested query once and only once.
- Correlated Nested Queries
  In some cases, we wish to use a nested query which is not independent of the contest. Consider the following goal: For each region, report the country with the largest area in that region. One way to structure this query is as follows:
```
SELECT   C.region,C.name
FROM     cia C
WHERE    C.area >=  ALL (SELECT D.area FROM cia D WHERE C.region = D.region)
```
  Note carefully the occurrence of range variable C in the inner select! This means that we need to consider reevaluating the inner select for each choice of C.
- Nested Queries in the FROM clause
  Though typically, the table(s) in the from clause are actual tables, it is possible to put a nested select statement in the from clause, with that resulting table used as the source for answering the larger query. Here is one simple (though unnecessary) example of such a syntax, for finding all European countries which start with the letter A.
```
SELECT   C.name
FROM     (SELECT name FROM cia WHERE region='Europe') AS European
WHERE    European.name LIKE 'A%'
```
  Note that when creating such a from table on-the-fly, we will want to give it a name with an associated range variable (European, in this example). Note that this example could have been written more directly without need for such a technique, by expanding the outer WHERE clause to do the filtering for region='Europe'. If there is a more clear approach, one should avoid using nested queries as part of the from clause.
Set Operations

There are several operators which allow you to work with one or more sets. For all of the following, we let (...) represent a general set, typical that of a nested select statement.
- EXISTS (...)
  Evaluates as a boolean expression, designating whether the given set is non-empty.
- value IN (...)
  Evaluates as a boolean expression, checking whether value is in the given set.
- value >= ALL (...)
  Evaluates as a boolean expression, checking whether the given value is greater than or equal to each entry of the given set.
- value >= ANY (...)
  Evaluates as a boolean expression, checking whether the given value is greater than or equal to any one of the entries of the given set.
- value > ALL (...)
  value > ANY (...)
  value <= ALL (...)
  value <= ANY (...)
  value < ALL (...)
  value < ANY (...)
  Similar to above.
- (...) CONTAINS (...)
  Evaluates to boolean which is true if all tuples in the first list are also in the second list.
- (...) UNION (...)
  Evaluates to the set of tuples which are in either of the lists (or both).
- (...) INTERSECT (...)
  Evaluates to the set of tuples which are in both lists.
- (...) EXCEPT (...)
  Evaluates to the set of tuples which are in the first list but not in the second list.
For the final four operators above which take two sets as operands, you must ensure that the two queries return a set of results of equivalent form (i.e. have tuples with the same types of attributes in the same order).
Joins

When more than one table name is given as part of the FROM-list, the tables are to be "joined". The result is the cross product of the sets of tuples from the joined tables. For example, if two tables are joined, the result will be a table with one column for each of the columns of the original two tables. Then tuples are created by taking every possible row of the first table combined with every possible row of the second table. The concept extends to three or more tables as well.

Tutorials:
- warmup for basic joins (movies)
- basic joins (movies)
Special cases of joins:
- Equi-joins
  The most common type of a join is when the WHERE clause consists entirely of a conjunction of equality tests (e.g., A.x=B.x AND A.y=B.y). The reason this appears so often is when joining upon a set of attributes which comprise a foreign key of one table, which references another.
- Natural Joins
  In fact, if the names of the attributes in an Equi-join are all the same, we call this a natural join.
- Self Joins
  Example: "comparable" countries
```
SELECT C.name, D.name
FROM   cia C, cia D
WHERE  C.region = D.region
   AND C.name != D.name
   AND C.area > 10000
   AND C.area between D.area and D.area+1000
```
- Outer Joins
  When joining table A to table B, the standard join includes a tuple from A paired with a tuple of B, only if the specified WHERE condition is satisfied. If a tuple from A does not 'match' any tuple from B, it does not participate in the result. Similarly, a tuple from B which does not 'match' any tuple from A is not a part of the result.
  
  However, there may be times where we still wish to have such unmatched tuples appear in the result. In a "left outer join", the result includes all tuples from the regular join as well as one tuple for each entry of the original left table which was never matched. That tuple will be padded with NULL values for any expected attributes that are not otherwise known.
  In similar fashion, a right outer join will include resulting tuples from the regular join as well as one tuple for each tuple of the original right table which went unmatched. Finally, an "outer join" is equivalent to the union of the left and right outer joins.
  
  A standard join of two tables in SQL uses a FROM syntax of
  FROM A,B
  The corresponding syntax for outer joins in SQL is
  FROM A LEFT OUTER JOIN B
  FROM A RIGHT OUTER JOIN B
  FROM A OUTER JOIN B
NULL values

Though we are used to writing boolean conditions, which are either true or false, it is important to note that NULL values, when they exist in the database, are treated in a special way. What we may think of as a boolean expression is actually viewed as an expression with three possible results true, false, and unknown. Based on this, we can expand the standard true tables for boolean operators like AND, OR, NOT in the sensible way.

Yet it is still important to know what happens to a tuple in a query if the WHERE clause evaluates to unknown. In this case, it is not included in the result. Only those tuples for which the WHERE clause was unquestionably true will appear.

Yet for constraints, or CHECK expressions, the opposite convention is used. If enforcing a check, the system will only stop operations which unquestionably lead to a violation of the check. If the boolean expression evaluated to unknown, the benefit of the doubt is given and the constraint is presumed to be satisfied.

There are also special operators to check if an attribute is a NULL value or not, namely IS NULL and IS NOT NULL.

Michael Goldwasser