This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
teaching:projh402 [2013/09/23 11:47] svsummer |
teaching:projh402 [2018/10/02 08:13] svsummer |
||
---|---|---|---|
Line 7: | Line 7: | ||
===== Project proposals ===== | ===== Project proposals ===== | ||
- | ==== Development of a Personal Scientific Digital Library Management System ==== | + | === Engineering of a Rule-Based Information Extraction Engine === |
- | In this project, the student is asked to construct a software system to help manage large collections of scientific papers in digital form. Specifically, the system must be able to: | + | Information extraction, the activity of extracting structured |
- | - Scan a given filesystem location for given filetypes (PDFs, EPUB, ...) containing scientific articles. | + | information from unstructured text, is a core data preparation |
- | - Extract the metadata from each identified file. Here, the metadata includes the title of the article, its authors, the publishing venue, the publisher, the year of publication, the article's abstract ... The development of an intelligent way to retreive this metadata is requried. This could be done, for example by a combination of parsing the file, contacting the internet repositories of known publishers (AMC, Springer, Elsevier) etc to retrieve the data. | + | step. Systems for information extraction fall into two main |
- | - Offer search capabilities, in order to allow a user to find all indexed articles matching certain criteria (title, author, ...) | + | categories. The first category contains machine-learning based |
- | - Offer archiving capabilities | + | systems, where a significant amount of training is required to train |
+ | good models for specific extraction tasks. The second category | ||
+ | consists of rule-based systems in which the data to be extracted from | ||
+ | the text is specified by (human-written) rules in some (often | ||
+ | declarative) extraction language. Despite advances in machine | ||
+ | learning, rule-based systems are widely used in practice. | ||
- | Use of semantic web technologies (RDF, SPARQL, ...) to store and search the metadata is encouraged. | + | In recent years, novel theoretical algorithms have been proposed to |
+ | more efficiently execute rule-based information extraction | ||
+ | workloads. The objective in this project is to implement one such | ||
+ | Algorithm, by Florenzano et al (2018), experimentally analyze its | ||
+ | performance, and propose extensions of the algorithm to overcome | ||
+ | performance bottlenecks. | ||
- | **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) | + | |
+ | References: | ||
+ | |||
+ | - Fernando Florenzano, Cristian Riveros, Martín Ugarte, | ||
+ | Stijn Vansummeren, Domagoj Vrgoc: Constant Delay Algorithms for | ||
+ | Regular Document Spanners. PODS 2018: 165-177 | ||
+ | |||
+ | |||
+ | **Interested?** Contact Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) | ||
**Status**: available | **Status**: available | ||
- | ==== Curriculum Revision Assistant ==== | ||
- | In this project, the student is asked to construct a software system that can assist in the revision of teaching curricula (also known as teaching programs). The system should have the following functionalities: | + | === Query processing for mixed database-machine learning based workloads === |
- | - It should be able to load existing curricula from the ULB central administration. This could be done, for example, by parsing the webpages available at banner (the Civil Engineering in CS program is available at http://banssbfr.ulb.ac.be/PROD_frFR/bzscrse.p_disp_prog_detail?term_in=201314&prog_in=MA-IRIF&lang=FRENCH, for example). | + | |
- | - It should allow to make different versions of the teaching programs, much in the same way as version control systems like GIT and subversion offer the possibility to make different "development branches" of a program's source code. | + | Because of the growing importance and wide deployment of large-scale |
- | - It should allow to analyze the modifications proposed in the teaching programs, and summarize the impact that these changes could have on other programs. (For example, if a course is removed from the computer science curriculum, it should also be removed from all curricula that included the course.) | + | Machine Learning (ML), there is wide interest in the design and |
+ | implementation of processing engines that can efficiently evaluate ML | ||
+ | workloads. One class of sytems, embodied by systems such as Tensorflow | ||
+ | and SystemML takes linear algebra as the key primitive for expressing | ||
+ | ML workflows, and obtain efficient processing engines by porting known | ||
+ | database-style optimization techniques to the linear algebra | ||
+ | setting. Another class of systems, embodied by FAQ queries take | ||
+ | relational algebra as the key primitive, but modify it to allow | ||
+ | expression of certain ML workloads. To some extent, the classical | ||
+ | optimization techniques as well as recent results for exploiting | ||
+ | modern hardware transfer to this extended relational algebra. As an | ||
+ | added bonus, traditional database workloads (OLTP/OLAP style) can be | ||
+ | trivially supported | ||
+ | |||
+ | The focus in this project is in the latter style of systems. The | ||
+ | overall goal is to experimentally identify classes of FAQ queries for | ||
+ | which it would be beneficial to exploit techniques developped in the | ||
+ | former class of systems. Concretely, this can be approached by | ||
+ | experimentally studying queries in the FAQ framework (featuring joins) | ||
+ | for which known results in evaluating linear algebra operations (in | ||
+ | concretum: matrix multiplication algorithms that run in less than | ||
+ | O(n^3) time) can be exploited. | ||
**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) | **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) | ||
**Status**: available | **Status**: available | ||
+ | |||