This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
teaching:projh402 [2018/10/02 08:13] svsummer |
teaching:projh402 [2020/09/14 20:08] svsummer |
||
---|---|---|---|
Line 6: | Line 6: | ||
===== Project proposals ===== | ===== Project proposals ===== | ||
- | |||
- | === Engineering of a Rule-Based Information Extraction Engine === | ||
- | |||
- | Information extraction, the activity of extracting structured | ||
- | information from unstructured text, is a core data preparation | ||
- | step. Systems for information extraction fall into two main | ||
- | categories. The first category contains machine-learning based | ||
- | systems, where a significant amount of training is required to train | ||
- | good models for specific extraction tasks. The second category | ||
- | consists of rule-based systems in which the data to be extracted from | ||
- | the text is specified by (human-written) rules in some (often | ||
- | declarative) extraction language. Despite advances in machine | ||
- | learning, rule-based systems are widely used in practice. | ||
- | |||
- | In recent years, novel theoretical algorithms have been proposed to | ||
- | more efficiently execute rule-based information extraction | ||
- | workloads. The objective in this project is to implement one such | ||
- | Algorithm, by Florenzano et al (2018), experimentally analyze its | ||
- | performance, and propose extensions of the algorithm to overcome | ||
- | performance bottlenecks. | ||
- | |||
- | |||
- | References: | ||
- | |||
- | - Fernando Florenzano, Cristian Riveros, Martín Ugarte, | ||
- | Stijn Vansummeren, Domagoj Vrgoc: Constant Delay Algorithms for | ||
- | Regular Document Spanners. PODS 2018: 165-177 | ||
- | |||
- | |||
- | **Interested?** Contact Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) | ||
- | |||
- | **Status**: available | ||
- | |||
- | |||
- | === Query processing for mixed database-machine learning based workloads === | ||
- | |||
- | Because of the growing importance and wide deployment of large-scale | ||
- | Machine Learning (ML), there is wide interest in the design and | ||
- | implementation of processing engines that can efficiently evaluate ML | ||
- | workloads. One class of sytems, embodied by systems such as Tensorflow | ||
- | and SystemML takes linear algebra as the key primitive for expressing | ||
- | ML workflows, and obtain efficient processing engines by porting known | ||
- | database-style optimization techniques to the linear algebra | ||
- | setting. Another class of systems, embodied by FAQ queries take | ||
- | relational algebra as the key primitive, but modify it to allow | ||
- | expression of certain ML workloads. To some extent, the classical | ||
- | optimization techniques as well as recent results for exploiting | ||
- | modern hardware transfer to this extended relational algebra. As an | ||
- | added bonus, traditional database workloads (OLTP/OLAP style) can be | ||
- | trivially supported | ||
- | |||
- | The focus in this project is in the latter style of systems. The | ||
- | overall goal is to experimentally identify classes of FAQ queries for | ||
- | which it would be beneficial to exploit techniques developped in the | ||
- | former class of systems. Concretely, this can be approached by | ||
- | experimentally studying queries in the FAQ framework (featuring joins) | ||
- | for which known results in evaluating linear algebra operations (in | ||
- | concretum: matrix multiplication algorithms that run in less than | ||
- | O(n^3) time) can be exploited. | ||
- | |||
- | **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) | ||
- | |||
- | **Status**: available | ||