Tasks on OLAP Optimization

1. Context

This document contains thoughts on self-study tasks for a text on OLAP optimization.

2. Back-of-the-envelope calculations

The purpose of these tasks is to get a feeling for the time required for query processing. Roughly, few “small” queries are fast, while “large” ones need optimizations for interactive processing as in OLAP. Also, lots of “small” queries are slower than a single larger one of the same size.

2.1. Transfer random block

Rotate to proper location, which introduces a latency of \(l\) seconds. (There is no rotation with SSDs; however, as mentioned in the text, a latency still arises from overhead of individual read requests.) Then transfer block of size \(x\) KiB at speed of \(y\) MB/s:

\(l + \frac{x \cdot 1024 B}{y \cdot 10^6 B/s}\)

2.2. Produce 1% of tuples for unsorted data

As data is not sorted, all tuples need to be checked whether they match. Thus, the percentage does not matter for calculations.

2.3. Produce exactly one tuple for unsorted data

As data is not sorted, the single tuple may appear at any position, on average after reading half the data.

2.4. Produce tuples for sorted data

Use binary search to locate the start of relevant data. Note that we transfer blocks, not individual tuples. Thus, calculate number \(n\) of blocks first. For simplicity, suppose \(n\) = TotalSize/BlockSize (in practice, blocks may not be filled entirely; e.g., see fillfactor for PostgreSQL). In the worst case, we transfer \(\log n\) blocks until we are sure about the starting tuple. Note that these block transfers are random ones; thus, use the time calculated above.

Then read only relevant data beginning with the starting tuple.

3. Query Optimization

The optimizer produces some equivalent plans (as the search space is too large for exhaustive search), computes their costs, and executes the plan with lowest cost. The best plan might not be among those considered by the optimizer.