Table of Contents:
SELECT *Evaluation strategies:
FROM cia
WHERE area > 10000
SELECT *There are several possible strategies:
FROM cia
WHERE area > 10000
AND Region = 'Africa'
AND Name LIKE 'T%'
Whether such an initial use of the index is helpful depends on conditions similar to the previous discussion on clustered vs. unclustered indices, and selectivity of the condition.
SELECT *In this case, we rely on the fact that any arbitrary boolean expression can be converted to what is known as Conjunctive Normal Form (CNF). An expression is in conjunctive normal form if it a conjunction of expressions, where each such conjunct is a disjunction of base terms. For example, the early expression is equivalent to the following CNF formula:
FROM cia
WHERE (area > 10000 AND Region = 'Africa')
OR ((Region = 'Africa' AND Name LIKE 'T%') OR Gdp < 25000)
(area > 10000 OR Name LIKE 'T%' OR Gdp < 25000)
AND
(Region = 'Africa' OR Gdp < 25000)
To demonstrate the terminology, we note that (area > 10000 OR Name LIKE 'T%' OR Gdp < 25000) is a conjunct, comprised of three terms, the first of which is area > 10000.
We can thus generalize the above evaluation strategies as follows. For each conjunct, we can consider possible evaluation strategies to find a subset of records which match that conjunct. Then we could take the intersection of all such subsets, or we could use one conjunct to pare down the original set, and evaluate other conjuncts on-the-fly.
For evaluating a specific conjunct, the existence of a disjunction has the following effects:
There is not much challenge to the first step. The only thing to note is that an index-only evaluation may be possible if all needed attributes are included in the index.
As to the second step, this is more time-consuming. The only apparent way to remove duplicates is to make sure that all identical records get grouped together in some way. There are two ways we can approach this task: one based on sorting, one based on hashing.
The reason that we have chosen to use only B-1 buckets for our hashtable is based on effectively using the buffer pool's B pages. We can use one page of the buffer pool for reach of the B-1 buckets, in order to accumulate results as items get inserted. The final page of the buffer pool can be used for scanning the input.
SELECT *We adopt the following notation, saying that relation R has M pages of data, with each page holding pR records. Similarly, we assume S has N pages of data, with each page holding pS records.
FROM R, S
WHERE R.i = S.j
For a typical example, we consider relation R with M=1000 and pR=100 (thus 1,000,000 records overall), and relation S with N=500 and pS=80 (thus 40,000 records overall).
foreach tuple r in R doOuter loop requires M I/Os to scan R. There are (pR*M) individual records in R, and for each of these, we use N I/Os to scan S.
foreach tuple s in S do
if ri = sj then add < r,s > to result
Total cost is thus (M + pR*M*N) I/Os.
On our hypothetical example, we get (1000 + (5*107)) I/Os.
Refinement: given that we have a page of R in the buffer pool at a time, it is clearly inefficient to repeat the inner loop for each individual record of R. If we instead do this join page-at-a-time, we would execute the inner loop only M times (rather than (pR*M) times), thereby getting a total cost of (M + M*N) I/O's.
Slight Refinement: by reversing the roles of R and S, we could minimize this cost when N < M, by getting (N + M*N) I/Os. However since the M*N term is the dominant one, we gain very little by this optimization.
We could do much better by refining the previous technique to read in a larger block of R into main memory at a time. If we read in B-2 pages of R, we would still have one additional page remaining to devote to scanning S, and one page to devote to accumulating output. We could then rewrite the previous algorithm as:
foreach block of B-2 pages of R doThe outer loop still requires M I/Os, but the inner loop will only be executed ceil(M/(B-2)) times, thus the overall cost is M + N*ceil(M/(B-2)) I/Os.
foreach page of S do
for all matching in-memory tuples,
add < r,s > to result
Notice that the efficiency depends greatly on the size of the buffer pool. For example, if M < B-2 (thus it fits in main memory), then the overall cost is just M+N.
On our hypothetical example, we might consider what we could do with a buffer pool of size 100 pages. We get (1000 + (10*500)=6000 I/Os. If we reverse the roles of R and S, we get (500 + (5*1000)=5500 I/Os.
Advanced Refinement: If the effect of block reads from disks are taken into consideration, this technique may be further improved by using small blocks of R, in exchange for reading larger blocks of S (rather than single pages of S).
foreach tuple r in R doTo analyze the cost, we must make some assumptions about the efficiency of the index searches as well as the amount of clustering between index leafs to records. We know that the outerloop requires M I/Os overall for the eventual scan of R. The book makes the simplifying assumption that a B+ tree index typically requires 2-4 I/Os to find the appropriate leaf. Once it has found the rid's in the index, there is an additional cost to pull those records from disk. This cost depends on the number of matches found as well as the amount of clustering between those records locations on disk. If unclustered, there will presumably be up to one additional I/O per record retrieved.
foreach tuple s in S where r.i = s.j
if ri = sj then add < r,s > to result
We can then perform the join efficiently by only computing the cross-product of partitions of R and S which have matching values. These partitions can be matched using an algorithm similar to a merge, as we simultaneously scan a current R partition and a current S partition.
The only other challenge we may face is a partition that is so large that it does not fit in main memory. If such a large partition of R matched a large partition of S, outputing the cross-product of those two partitions would require additional I/Os.
The analysis depends on several factors. For starters, we must sort each relation, if not already stored in sorted order. The external merge sort algorithm takes O(M log M) time where the base of the logarithm depends on B, the size of the buffer pool.
For our hypothetical example with B=100, we might expect the sort to complete in only two passes, though with separate read and writes for each pass. Thus R is sorted using 4000 I/Os, and S with 2000 I/Os. The final phase of the sort-merge algorithm likely takes a single scan of each relation, thus using an additional 1500 I/Os, for a totoal of 7500. This assumes that we do not run into the large partition complication (of course, if our indexes are on a unique attribute, this problem never occurs).
Advanced Refinement: We can potentially find some savings by combining the merging phase of the sort with the merging phase of the join. However, such a refinement generally would only work if the buffer pool is large enough, B > sqrt(L), where L is the number of records in the larger of the two relations.
The idea is to use a single hash function on the appropriate attribute of each relation, thereby insering all records of each relation into a single hash table. The overall join would have two phases (similar to the hash-based partitioning).
For the probing phase, we might be in trouble if a single partition is so large as to not fit entirely in memory. If partitioning was uniform, we would get M/(B-1) records in each partition of R. Thus we would need a buffer pool with B > M/(B-1). Of course, since hashing is not perfectly uniform, we might need to consider a fudge factor f>1, and expect that the largest partition has size fM/(B-1). In this case, we need approximately B > sqrt(fM) pages in the buffer pool for this technique to perform well.
If the relations are too big relative to the buffer size, we could instead proceed recursively, for each original partition.
The original hash insertions requires (M+N) I/Os to scan in the records, and a corresponding (M+N) I/Os to write buckets out to disk as they fill. The cost of the second phase depends on whether we have any partitions that are too large to fit in main memory. If not, then the second phase requires a single scan, and thus we get an overall cost of 3(M+N) I/Os.