Differences

This shows you the differences between two versions of the page.

--- teaching:projh402 [2014/10/24 08:54]
svsummer [Project proposals]
+++ teaching:projh402 [2018/10/02 08:13]
svsummer
@@ Line 7: / Line 7: @@
 ===== Project proposals =====
-==== Principles of Database Management Architectures in Managed Virtual Environments ====
+=== Engineering of a Rule-Based Information Extraction Engine ===
-With the gaining popularity of Big Data, many data processing engines
+Information extraction, the activity of extracting structured
-are implemented in a managed virtual environment such as the Java
+information from unstructured text, is a core data preparation
-Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill,
+step. Systems for information extraction fall into two main
-...). While this improves the portability of the engine, the tradeoffs
+categories. The first category contains machine-learning based
-and implementation principles w.r.t. traditional C++ implementations
+systems, where a significant amount of training is required to train
-are sometimes less understood.
+good models for specific extraction tasks. The second category
+consists of rule-based systems in which the data to be extracted from
+the text is specified by (human-written) rules in some (often
+declarative) extraction language. Despite advances in machine
+learning, rule-based systems are widely used in practice.
-The objective in this project is to develop some basic functionalities
+In recent years, novel theoretical algorithms have been proposed to
-of a database storage engine (Linked files, BTree, Extensible Hash
+more efficiently execute rule-based information extraction
-table, basic external-memory sorting ) in a managed virtual machine
+workloads. The objective in this project is to implement one such
-(i.e., the Java Virtual Machine or and the .NET Common Language
+Algorithm, by Florenzano et al (2018), experimentally analyze its
-Runtime), and compare this with a C++-based implementation both on (1)
+performance, and propose extensions of the algorithm to overcome
-ease of implementation and (2) execution efficiency. In order to
+performance bottlenecks.
-develop the managed virtual machine implementation, the interested
-student will need to research the best practices that are used in the
-above-mentioned projects to gain maximum execution speed (e.g., use of
-the java.lang.unsafe feature, memory-mapped files, ...).
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
-**Status**: available
+References:
+- Fernando Florenzano, Cristian Riveros, Martín Ugarte,
+Stijn Vansummeren, Domagoj Vrgoc: Constant Delay Algorithms for
+Regular Document Spanners. PODS 2018: 165-177
-==== Development of a compiler and runtime engine for AQL ====
-In 2005, researchers at the IBM Almaden Research Center developped a
+**Interested?** Contact Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
-new system specifically geared for practical information extraction in
-the enterprise. This effort lead to [[https://www.google.be/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CEYQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&ei=gyhIUe-XPIexPJ-fgLAG&usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&sig2=LVQkus_67uSVlwK34BXZ8w&bvm=bv.43828540,d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://pic.dhe.ibm.com/infocenter/bigins/v2r0/topic/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/aql_overview.html|AQL (Annotation Query Language)]].
-The declarative nature of AQL enables new kinds of tools for extractor
-development, and a cost-based optimizer for
-performance.
-The goal of this project is to develop an open-source compiler and
+**Status**: available
-runtime environment of (a simplified version of) AQL.
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
-**Status**: available
+=== Query processing for mixed database-machine learning based workloads ===
-==== Development of a distributed simulation algorithm ====
+Because of the growing importance and wide deployment of large-scale
+Machine Learning (ML), there is wide interest in the design and
+implementation of processing engines that can efficiently evaluate ML
+workloads. One class of sytems, embodied by systems such as Tensorflow
+and SystemML takes linear algebra as the key primitive for expressing
+ML workflows, and obtain efficient processing engines by porting known
+database-style optimization techniques to the linear algebra
+setting. Another class of systems, embodied by FAQ queries take
+relational algebra as the key primitive, but modify it to allow
+expression of certain ML workloads. To some extent, the classical
+optimization techniques as well as recent results for exploiting
+modern hardware transfer to this extended relational algebra. As an
+added bonus, traditional database workloads (OLTP/OLAP style) can be
+trivially supported
-Simulation and Bisimulation are fundamental notions in computer
+The focus in this project is in the latter style of systems. The
-science. They underly many formal verification algorithms, and have
+overall goal is to experimentally identify classes of FAQ queries for
-recently been applied to the construction of so-called structural
+which it would be beneficial to exploit techniques developped in the
-indexes,which are novel index data structures for relational databases
+former class of systems. Concretely, this can be approached by
-and the Semantic Web.  Essentially, a (bi)simulation is a relation on
+experimentally studying queries in the FAQ framework (featuring joins)
-the nodes of a graph. Unfortunately, however, while efficient
+for which known results in evaluating linear algebra operations (in
-main-memory algorithms for computing whether two nodes are similar
+concretum: matrix multiplication algorithms that run in less than
-exist, these algorithms fail when no the input graphs are too large to
+O(n^3) time) can be exploited.
-fit in main memory.
-The objective of this project is to implement a recently proposed
-algorithm for  computing simulation in a distributed setting, and
-provide a preliminary performance evaluation of this implementation.
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
@@ Line 68: / Line 70: @@
 **Status**: available
-==== Development of a Personal Scientific Digital Library Management System ====
-In this project, the student is asked to construct a software system to help manage large collections of scientific papers in digital form. Specifically, the system must be able to:
-  - Scan a given filesystem location for given filetypes (PDFs, EPUB, ...) containing scientific articles.
-  - Extract the metadata from each identified file. Here, the metadata includes the title of the article, its authors, the publishing venue, the publisher, the year of publication, the article's abstract ... The development of an intelligent way to retreive this metadata is requried. This could be done, for example by a combination of parsing the file, contacting the internet repositories of known publishers (AMC, Springer, Elsevier) etc to retrieve the data.
-  - Offer search capabilities, in order to allow a user to find all indexed articles matching certain criteria (title, author, ...)
-  - Offer archiving capabilities
-Use of semantic web technologies (RDF, SPARQL, ...) to store and search the metadata is encouraged.
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
-**Status**: taken