Evaluating Relational Operators

Preface: These notes provide a very sparse outline of the material in Ch. 14 of the text. Please refer to the text for additional details.

Table of Contents:

Selection

Simple Selection Conditions

General Selection Conditions

Projection

Join

Selection

A selection involves finding those tuples which match the WHERE clause of a query. In general, our goal os to find the "most selective access path", that is the shortest sequence of page requests. There are several options available.

Simple Selection

Consider a query with a simple "WHERE" condition, as follows:

SELECT *
FROM cia
WHERE area > 10000

Evaluation strategies:

Scan the entire relation, checking condition for each record.

If relation is sorted file (though usually is not):
Perform binary search (for equality) and do partial scan after that, if range search.

If Clustered (BTree) Index exists:
Use it. It will almost certainly be best strategy

If Unclustered (BTree) Index exists:
It is not so clear. The index lets us quickly identify the set of rid's, but those rid's may be scattered over different pages of disk. Two approaches are:
1. Sort rid's according to pageID
  Then gather all needed pages
2. Ignore index and just do scan of entire file

If Hash Index Index exists:
This helps only for equality selection. The issues of how to use the index are similar as before, depending on if it is clustered or unclustered.

General Selection Conditions

Consider a more complex "WHERE" condition involving the conjunction of several simpler conditions, such as:

SELECT *
FROM cia
WHERE area > 10000
AND Region = 'Africa'
AND Name LIKE 'T%'

There are several possible strategies:

Do a File Scan, evaluating each record on the fly.

If an index exists for the key of one of the parts of the condition, we can use it to identify the records in the subset of the original relation which matches that partial condition. Then we can do a scan of those records, testing the remaining conditions on-the-fly.
Whether such an initial use of the index is helpful depends on conditions similar to the previous discussion on clustered vs. unclustered indices, and selectivity of the condition.

If several relevant indexes are available, we can consider using them as follows.
For each available index of relevance, compute set of rid's (but do not yet retrieve the corresponding records).

Compute the intersection of those rid lists (sort lists and compare)

Retrieve those records which lay in the intersection. If all original conditions have been considered, then this is the final result. If other conditions remain, they can be tested on-the-fly.

Conjunctive Normal Form

The previous example will still not of general form. A WHERE clause could be a more complex boolean expression such as:

SELECT *
FROM cia
WHERE (area > 10000 AND Region = 'Africa')
OR ((Region = 'Africa' AND Name LIKE 'T%') OR Gdp < 25000)

In this case, we rely on the fact that any arbitrary boolean expression can be converted to what is known as Conjunctive Normal Form (CNF). An expression is in conjunctive normal form if it a conjunction of expressions, where each such conjunct is a disjunction of base terms. For example, the early expression is equivalent to the following CNF formula:

(area > 10000 OR Name LIKE 'T%' OR Gdp < 25000)
AND
(Region = 'Africa' OR Gdp < 25000)

To demonstrate the terminology, we note that (area > 10000 OR Name LIKE 'T%' OR Gdp < 25000) is a conjunct, comprised of three terms, the first of which is area > 10000.

We can thus generalize the above evaluation strategies as follows. For each conjunct, we can consider possible evaluation strategies to find a subset of records which match that conjunct. Then we could take the intersection of all such subsets, or we could use one conjunct to pare down the original set, and evaluate other conjuncts on-the-fly.

For evaluating a specific conjunct, the existence of a disjunction has the following effects:

If even a single term exists without a relevant index, than we would be required to do a full scan of records to correctly evaluate that conjunct. Once we recognize that we are forced to do a full-scan for this conjunct, there is no real gain in trying to optimize based on other indices.

If indexes do exists for each term of a conjunct, then you can use the indices to build a set of relevant rid's for each term, and then take the union of those rid's to create a complete list of rid's for which you would then retrieve full records.

Projection

We can think of the projection operation as a two-step process:

Selecting the chosen attributes from the complete records
Removing duplicates from the result (if DISTINCT is designated)

There is not much challenge to the first step. The only thing to note is that an index-only evaluation may be possible if all needed attributes are included in the index.

As to the second step, this is more time-consuming. The only apparent way to remove duplicates is to make sure that all identical records get grouped together in some way. There are two ways we can approach this task: one based on sorting, one based on hashing.

Projection based on Sorting

Do scan of complete records to filter down to chosen attributes

Sort those intermediate results

Remove duplicates from the sorted list

A few slight improvements may be possible. First, the sorting algorithm could be adjusted so that duplicates are thrown away along the way, rather than as post-processing (this is quite easy for mergesort). Secondly, the initial pass for creating runs for mergesort can be combined with the initial pass for filtering attributes of the records. This would presumably save one complete pass over the data.

Projection based on Hashing Our overall approach goes as follows:

Partition all original records into one of B-1 partitions, based on a hash functioon.

For each resulting partition, remove duplicates within that partition (can recurse with different hash function, if worthwhile!).

This works because we will be sure that all identical records get hashed into the same bucket. The potential need for recursion is that there may be many non-identical records which still get hashed to the same bucket.

The reason that we have chosen to use only B-1 buckets for our hashtable is based on effectively using the buffer pool's B pages. We can use one page of the buffer pool for reach of the B-1 buckets, in order to accumulate results as items get inserted. The final page of the buffer pool can be used for scanning the input.

Join

For our discussion of joins, we consider the following sample query from the text:

SELECT *
FROM R, S
WHERE R.i = S.j

We adopt the following notation, saying that relation R has M pages of data, with each page holding p_R records. Similarly, we assume S has N pages of data, with each page holding p_S records.

For a typical example, we consider relation R with M=1000 and p_R=100 (thus 1,000,000 records overall), and relation S with N=500 and p_S=80 (thus 40,000 records overall).

Nested Loops Join

foreach tuple r in R do
foreach tuple s in S do
if r_i = s_j then add < r,s > to result

Outer loop requires M I/Os to scan R. There are (p_R*M) individual records in R, and for each of these, we use N I/Os to scan S.

Total cost is thus (M + p_R*M*N) I/Os.

On our hypothetical example, we get (1000 + (5*10⁷)) I/Os.

Refinement: given that we have a page of R in the buffer pool at a time, it is clearly inefficient to repeat the inner loop for each individual record of R. If we instead do this join page-at-a-time, we would execute the inner loop only M times (rather than (p_R*M) times), thereby getting a total cost of (M + M*N) I/O's.

Slight Refinement: by reversing the roles of R and S, we could minimize this cost when N < M, by getting (N + M*N) I/Os. However since the M*N term is the dominant one, we gain very little by this optimization.

Block Nested Loops Join

The previous technique did not take advantage of the size of the buffer pool. It effectively used only 3 pages of memory: one for the scan of R, one for the scan of S, and one to accumulate output.

We could do much better by refining the previous technique to read in a larger block of R into main memory at a time. If we read in B-2 pages of R, we would still have one additional page remaining to devote to scanning S, and one page to devote to accumulating output. We could then rewrite the previous algorithm as:

foreach block of B-2 pages of R do
     foreach page of S do
         for all matching in-memory tuples,
         add < r,s > to result

The outer loop still requires M I/Os, but the inner loop will only be executed ceil(M/(B-2)) times, thus the overall cost is M + N*ceil(M/(B-2)) I/Os.

Notice that the efficiency depends greatly on the size of the buffer pool. For example, if M < B-2 (thus it fits in main memory), then the overall cost is just M+N.

On our hypothetical example, we might consider what we could do with a buffer pool of size 100 pages. We get (1000 + (10*500)=6000 I/Os. If we reverse the roles of R and S, we get (500 + (5*1000)=5500 I/Os.

Advanced Refinement: If the effect of block reads from disks are taken into consideration, this technique may be further improved by using small blocks of R, in exchange for reading larger blocks of S (rather than single pages of S).

Index Nested Loops Join

If an index exists on one of the relationships, we can consider whether an alternate approach should be used. Let's assume that S contains an index on S.j. Then given a particular tuple r from R, we could use the index to find all entries of S with s.j=r.i;

foreach tuple r in R do
foreach tuple s in S where r.i = s.j
if r_i = s_j then add < r,s > to result

To analyze the cost, we must make some assumptions about the efficiency of the index searches as well as the amount of clustering between index leafs to records. We know that the outerloop requires M I/Os overall for the eventual scan of R. The book makes the simplifying assumption that a B+ tree index typically requires 2-4 I/Os to find the appropriate leaf. Once it has found the rid's in the index, there is an additional cost to pull those records from disk. This cost depends on the number of matches found as well as the amount of clustering between those records locations on disk. If unclustered, there will presumably be up to one additional I/O per record retrieved.

Sort-Merge Join

We can start by sorting each relation individually, on the relevant attribute. This has the effect of partitioning each relationship into groups based on this value.

We can then perform the join efficiently by only computing the cross-product of partitions of R and S which have matching values. These partitions can be matched using an algorithm similar to a merge, as we simultaneously scan a current R partition and a current S partition.

The only other challenge we may face is a partition that is so large that it does not fit in main memory. If such a large partition of R matched a large partition of S, outputing the cross-product of those two partitions would require additional I/Os.

The analysis depends on several factors. For starters, we must sort each relation, if not already stored in sorted order. The external merge sort algorithm takes O(M log M) time where the base of the logarithm depends on B, the size of the buffer pool.

For our hypothetical example with B=100, we might expect the sort to complete in only two passes, though with separate read and writes for each pass. Thus R is sorted using 4000 I/Os, and S with 2000 I/Os. The final phase of the sort-merge algorithm likely takes a single scan of each relation, thus using an additional 1500 I/Os, for a totoal of 7500. This assumes that we do not run into the large partition complication (of course, if our indexes are on a unique attribute, this problem never occurs).

Advanced Refinement: We can potentially find some savings by combining the merging phase of the sort with the merging phase of the join. However, such a refinement generally would only work if the buffer pool is large enough, B > sqrt(L), where L is the number of records in the larger of the two relations.

Hash Join

The sort-merge algorithm was based on partitioning each of the individual relationships into equivalence classes, and then comparing matching partitions from the two relations. We could use the same approach, with a hash-based partitioning scheme.

The idea is to use a single hash function on the appropriate attribute of each relation, thereby insering all records of each relation into a single hash table. The overall join would have two phases (similar to the hash-based partitioning).

Create partitions based on inserting all records into hash table
For each individual partition, find "true" < r,s > matches (as opposed to hashing collisions). This can be done with a "probing" technique where we build a secondary hash table for partition R_i of R, and then for each record s of S_i, we probe for matches.

As we did for hash-based partitioning, we want to maximize the use of the buffer pool. For the partitoning phase, we could devote B-1 pages as buckets, leaving 1 page for scanning the input.

For the probing phase, we might be in trouble if a single partition is so large as to not fit entirely in memory. If partitioning was uniform, we would get M/(B-1) records in each partition of R. Thus we would need a buffer pool with B > M/(B-1). Of course, since hashing is not perfectly uniform, we might need to consider a fudge factor f>1, and expect that the largest partition has size fM/(B-1). In this case, we need approximately B > sqrt(fM) pages in the buffer pool for this technique to perform well.

If the relations are too big relative to the buffer size, we could instead proceed recursively, for each original partition.

The original hash insertions requires (M+N) I/Os to scan in the records, and a corresponding (M+N) I/Os to write buckets out to disk as they fill. The cost of the second phase depends on whether we have any partitions that are too large to fit in main memory. If not, then the second phase requires a single scan, and thus we get an overall cost of 3(M+N) I/Os.

Hybrid Hash Join

If there are significantly more pages in the buffer pool that the required sqrt(fM), we can reduce I/Os but not writing some of the buckets out to disk at all, but rather using the additionally available pages of memory to keep some data, in preparation for the second phase of the hash-join algorithm.

Michael Goldwasser