MA Computer Science Projects (PROJ-H-402)

Course objective

The course PROJ-H-402 is managed by Dr. Mauro Birattari. Please refer to the course description page for the rules concerning the project. What follows is a list of project proposals supervised by academic members of CoDE.

Project proposals

Engineering of a Rule-Based Information Extraction Engine

Information extraction, the activity of extracting structured information from unstructured text, is a core data preparation step. Systems for information extraction fall into two main categories. The first category contains machine-learning based systems, where a significant amount of training is required to train good models for specific extraction tasks. The second category consists of rule-based systems in which the data to be extracted from the text is specified by (human-written) rules in some (often declarative) extraction language. Despite advances in machine learning, rule-based systems are widely used in practice.

In recent years, novel theoretical algorithms have been proposed to more efficiently execute rule-based information extraction workloads. The objective in this project is to implement one such Algorithm, by Florenzano et al (2018), experimentally analyze its performance, and propose extensions of the algorithm to overcome performance bottlenecks.


- Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, Domagoj Vrgoc: Constant Delay Algorithms for Regular Document Spanners. PODS 2018: 165-177

Interested? Contact Stijn Vansummeren (

Status: available

Query processing for mixed database-machine learning based workloads

Because of the growing importance and wide deployment of large-scale Machine Learning (ML), there is wide interest in the design and implementation of processing engines that can efficiently evaluate ML workloads. One class of sytems, embodied by systems such as Tensorflow and SystemML takes linear algebra as the key primitive for expressing ML workflows, and obtain efficient processing engines by porting known database-style optimization techniques to the linear algebra setting. Another class of systems, embodied by FAQ queries take relational algebra as the key primitive, but modify it to allow expression of certain ML workloads. To some extent, the classical optimization techniques as well as recent results for exploiting modern hardware transfer to this extended relational algebra. As an added bonus, traditional database workloads (OLTP/OLAP style) can be trivially supported

The focus in this project is in the latter style of systems. The overall goal is to experimentally identify classes of FAQ queries for which it would be beneficial to exploit techniques developped in the former class of systems. Concretely, this can be approached by experimentally studying queries in the FAQ framework (featuring joins) for which known results in evaluating linear algebra operations (in concretum: matrix multiplication algorithms that run in less than O(n^3) time) can be exploited.

Contact : Stijn Vansummeren (

Status: available

teaching/projh402.txt · Last modified: 2018/10/02 08:13 by svsummer