Table of Contents

INFO-H-419: Data Warehouses

Lecturer

Volume

Study Programme

Schedule

The course is given during the first semester

Grading

Course Summary

Relational and object-oriented databases are mainly suited for operational settings in which there are many small transactions querying and writing to the database. Consistency of the database (in the presence of potentially conflicting transactions) is of utmost importance. Much different is the situation in analytical processing where historical data is analyzed and aggregated in many different ways. Such queries differ significantly from the typical transactional queries in the relational model:

For these reasons, data to be analyzed is typically collected into a data warehouse with Online Analytical Processing support. Online here refers to the fact that the answers to the queries should not take too long to be computed. Collecting the data is often referred to as Extract-Transform-Load (ELT). The data in the data warehouse needs to be organized in a way to enable the analytical queries to be executed efficiently. For the relational model star and snowflake schemes are popular designs. Next to OLAP on top of a relational database (ROLAP), also native OLAP solutions based on multidimensional structures (MOLAP) exist. In order to further improve query answering efficiency, some query results can already be materialized in the database, and new indexing techniques have been developped.

In the course, the main concepts of multidimensional databases will be covered and illustrated using the SQL Server tools.

Books

Extra books

The following materials have been used to construct the course material, but are not required reading for the course:

Prerequisites

Course Slides

Software

All software used in the course is available in the computer labs. Students who wish a personal copy of the software on their own computers, can get free copies of the software. Succinct instructions to acquire the software have been included below; in case additional help is required you can contact the sysadmin of the department: Robin Choquet Robin.Choquet@ulb.be

Exercises

Group Project

TPC is a non-profit corporation that defines transaction processing and database benchmarks and disseminates objective, verifiable TPC performance data to the industry. Regarding data warehouses, two TPC benchmarks are relevant:

The project of the course consist of 2 parts:

You have free choice to use the tools on which the two benchmarks will be implemented. For example, the TPC-DS benchmark could be implemented on SQL Server Analysis Services, Pentaho Analysis Services (aka Mondrian), etc. Similarly, the TPC-DI benchmark could be implemented on SQL Server Integration Services, Pentaho Data Integration, Talend Data Studio, SQL scripts, etc., which then load the data warehouse on a DBMS such as SQL Server, Oracle, PostgreSQL, etc.

Furthermore, both benchmarks must be implemented with several scale factors, which determine the size of the resulting data warehouse. You DO NOT need to use the scale factors mentioned in the TPC requirements. The pedagogical objectives aimed at is that you learn how to properly perform a benchmark. Therefore, you need to estimate the biggest scale factor that you can put on your own computer: this will be your reference scale factor, say 1.0, and then you will need to have 3 smaller scale factors, e.g., at 0.1, 0.2, and 0.5 of the full size in order to see the evolution of the performance.

The project is carried out in groups of 3-4 persons, which will be the same for the two parts. Before you can submit part I of the project, you will have to register in a group. For this, please send an email to the lecturer with the information about your group by 1/10/2023 at the latest. The submission deadlines for parts I and II are strict.

The deliverables expected for each part of the project are the following:

The project evaluation will count for 30% of your total grade. This may seem undervalued, however, putting effort in the project will definitely help you in achieving a better understanding of the course material which will result in a better score in the paper exam which amounts for 70% of the grade.

Tools of the previous year

SQL Server, PostgreSQL, mySQL, Oracle, SQLite, mariadb, Spark SQL, DB2/Airflow, Microsoft Azure SQL, Citus, AWS Aurora, Google BigQuery, Impala

Groups of the current year

Examinations from Previous Years