Tasks on OLAP and ETL

1. Context

This document contains thoughts on self-study tasks for a text on OLAP and ETL.

Please, put some effort into those tasks, before you read on.

2. Prerequisites

2.1. SQL

SELECT c_name, c_address, c_phone, n_name, sum(o_totalprice) as order_volume
FROM orders, customer, nation, region
WHERE o_custkey = c_custkey and c_nationkey = n_nationkey and n_regionkey = r_regionkey
      and r_name = 'EUROPE'
GROUP BY c_name, c_address, c_phone, n_name
HAVING sum(o_totalprice) > 1000000;

Importantly, we “filter” ordinary attributes (selections in the relational algebra) with WHERE, but aggregate values with HAVING.

Joins can be expressed in a number of syntactic variants in SQL, and you should be able to formulate one such variant (which may have a different form than the one above).

Aggregation requires grouping by the non-aggregated attributes.

2.2. Aggregation and FDs

Both queries produce the same number of rows with the same aggregate values.

3. Multidimensional Schemata

3.1. FDs are many-to-one relationships

Suppose AccountID → CustomerID. Then each account is owned by exactly one customer but each customer can own many accounts.

3.2. Level Day

Just by looking at the name of an attribute we do not know what it means. We have an FD Day → Month, though. Thus, Day cannot have values 1, …, 31 (as, e.g., 1 occurs in every month). Hence, Day must include information about its month (and also about year, due to another FD). Thus, Day qualifies as full Date, e.g., 1923-08-19.

4. DW as integrated database for OLAP

4.1. Prof. Plattner on ETL

Please read those 5 pages for yourself.

ETL is complex, introduces storage redundancy and delays for OLAP results. Materialized views as pre-aggregation/optimization technique (discussed in later session) reduce user flexibility. Different schemata introduce complexity and require ETL processing.

Real-world OLTP and OLAP workloads are not that different.

Thus, use of same, single database for both possible, while ETL is not necessary. On modern hardware, on-demand aggregation might be possible without the need for pre-aggregation (or other optimizations performed by database administrators).

4.2. Necessity of ETL/Integration

A data warehouse integrates several data sources. Thus, there is no single database on top of which OLAP could be performed.

The mentioned paper [YRX+20] describes the Google ecosystem, where several data sources do exist, and migration to some new single system was no option. The authors want “transactional systems to focus on transaction processing and query engines to focus on query processing”. A solution called Lightning provides “HTAP-as-a-service”, i.e., analytical query capabilities over different data sources, which keep their flexibility (without the need to change or migrate those preexisting transactional systems).

4.3. `GROUP BY ROLLUP`

The query shows summed sales values per supplier, rolled up to supplier nations, rolled up to supplier regions, and rolled up to the grand total.

If we reverse the order, we obtain aggregates per supplier three times (see above task on FDs and aggregation) and the grand total. Quite likely, this is not what the user wants.

If we group by supplier, part, customer, we create a 3-dimensional result. With ROLLUP, that result is also rolled up to 2, 1, and 0 dimensions, in one specific order. If we reverse the order, we obtain a different, but probably also meaningful result. In contrast, the previous queries performed roll-up operations within the hierarchy of a single dimension (namely, supplier), where one can only roll-up from more detailed to less detailed levels.