This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
teaching:projh402 [2016/10/24 15:32]
teaching:projh402 [2018/10/02 08:13] (current)
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
-=== Fast loading ​of semantic web datasets into native RDF stores ====+=== Engineering ​of a Rule-Based Information Extraction Engine ​===
-The next generation of the Web, the so-called Semantic Web, stores +Information extraction, the activity of extracting structured 
-extremely large knowledge bases in the RDF data modelIn this data +information from unstructured text, is a core data preparation 
-modelall knowledge ​is represented by means of triples ​of the form +stepSystems for information extraction fall into two main 
-(subject, property, object), where subject, property and object can be +categories. The first category contains machine-learning based 
-URLsamong other things.+systemswhere a significant amount of training ​is required to train 
 +good models for specific extraction tasks. The second category 
 +consists ​of rule-based systems in which the data to be extracted from 
 +the text is specified by (human-writtenrules in some (often 
 +declarative) extraction language. Despite advances in machine 
 +learningrule-based systems are widely used in practice.
-In order to effeciently query such knowledge bases, the RDF data is +In recent years, novel theoretical algorithms have been proposed ​to 
-typically loaded into a so-called native RDF storeTo ensure that the +more efficiently execute rule-based information extraction 
-knowledge ​is encoded for fast retrieval, the RDF store will first +workloadsThe objective in this project ​is to implement one such 
-encode all variable-length URLs in the dataset by fixed-width +Algorithm, by Florenzano et al (2018)experimentally analyze its 
-integersamong other things. Each RDF triple will then be encoded by +performanceand propose extensions of the algorithm to overcome 
-by their corresponding integer triples ​(integer_of_subject+performance bottlenecks
-The purpose of this project is to implement and experementally compare 
-a number of algorithms that can perform this encoding: 
-  * The trivial alogrithm that simply maintains a hashmap that maps URLs to their integer codes. When processing triple (s,p,o), it looks up s, p, and o in the hashmap to see if they have alraedy been assigned an integer ID. If so, this id is used for the   ​encoding;​ otherwise they are inserted into the hashmap with new, unique ids. The downside of this approach is that, while simple, it requires that one can store all URLS in working memory.+References: ​
-  * The slightly smarter algorithm that works in multiple stages: the ID is computed by a pre-fixed hash function. For each URLthe URL and its ID are written to an output file. This file is later sorted on ID to check for possible ​ hash collisions between distinct URLS+Fernando FlorenzanoCristian Riveros, Martín Ugarte, 
 +Stijn Vansummeren,​ Domagoj Vrgoc: Constant Delay Algorithms ​for 
 +Regular Document SpannersPODS 2018: 165-177
-  * Algorithms that use the best known state-of-the art data structures for compactly storing respresenting sets of strings, such as the HAT-TRIE ("​Engineering scalable, cache and space efficient tries for strings"​ Nikolas Askitis, Ranjan Sinha, The VLDB Journal, October 2010, Volume 19, Issue 5, pp 633-660 and "​HAT-Trie:​ A Cache-Conscious Trie-Based Data Structure For Strings",​ The 30th International Australasian Computer Science Conference (ACSC), Volume 62, pages 97 - 105, 2007.). 
-  ​Variations of the above algorithms, fine-tuned for semantic web datasets. +**Interested?​** Contact Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
- +
- +
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+
 **Status**: available **Status**: available
-===== Graph Indexing for Fast Subgraph Isomorphism Testing ===== 
-There is an increasing amount of scientific data, mostly from the bio-medical sciences, that can be represented as collections of graphs (chemical molecules, gene interaction networks, ...). A crucial operation when searching in this data is that of subgraph ​   isomorphism testing: given a pattern P that one is interested in (also a graph) in and a collection D of graphs (e.g., chemical molecules), find all graphs in G that have P as a   ​subgraph. Unfortunately,​ the subgraph isomorphism problem is computationally intractable. In ongoing research, to enable tractable processing of this problem, we aim to reduce the number of candidate graphs in D to which a subgraph isomorphism test needs   to be executed. Specifically,​ we index the graphs in the collection D by means of decomposing them into graphs for which subgraph ​  ​isomorphism *is* tractable. An associated algorithm that filters graphs that certainly cannot match P can then formulated based on ideas from information retrieval. 
-In this project, the student will emperically validate on real-world datasets the extent to which graphs can be decomposed into graphs ​for which subgraph isomorphism is tractable, and run experiments to validate the effectiveness of the proposed method in terms of filtering power.+=== Query processing ​for mixed database-machine learning based workloads ===
-**Interested?​** Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]+Because of the growing importance and wide deployment of large-scale 
 +Machine Learning (ML), there is wide interest in the design and 
 +implementation of processing engines that can efficiently evaluate ML 
 +workloadsOne class of sytems, embodied by systems such as Tensorflow 
 +and SystemML takes linear algebra as the key primitive for expressing 
 +ML workflows, and obtain efficient processing engines by porting known 
 +database-style optimization techniques to the linear algebra 
 +settingAnother class of systems, embodied by FAQ queries take 
 +relational algebra as the key primitive, but modify it to allow 
 +expression of certain ML workloadsTo some extent, the classical 
 +optimization techniques as well as recent results for exploiting 
 +modern hardware transfer to this extended relational algebra. As an 
 +added bonus, traditional database workloads (OLTP/OLAP style) can be 
 +trivially supported
-**Status**: available +The focus in this project is in the latter style of systemsThe 
- +overall goal is to experimentally identify classes ​of FAQ queries for 
-==== Principles of Database Management Architectures in Managed Virtual Environments ==== +which it would be beneficial ​to exploit techniques developped ​in the 
- +former class of systemsConcretelythis can be approached by 
-With the gaining popularity of Big Data, many data processing engines +experimentally studying queries ​in the FAQ framework ​(featuring joins
-are implemented in a managed virtual environment such as the Java +for which known results in evaluating linear algebra operations (in 
-Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, +concretum: matrix multiplication ​algorithms ​that run in less than 
-...). While this improves the portability of the engine, the tradeoffs +O(n^3) time) can be exploited.
-and implementation principles w.r.t. traditional C++ implementations +
-are sometimes less understood. +
- +
-The objective ​in this project is to develop some basic functionalities +
-of a database storage engine (Linked files, BTree, Extensible Hash +
-table, basic external-memory sorting ) in a managed virtual machine +
-(i.e., the Java Virtual Machine or and the .NET Common Language +
-Runtime), and compare this with a C++-based implementation both on (1) +
-ease of implementation and (2) execution efficiency. In order to +
-develop the managed virtual machine implementation,​ the interested +
-student will need to research the best practices that are used in the +
-above-mentioned projects to gain maximum execution speed (e.g., use of +
-the java.lang.unsafe featurememory-mapped files, ...). +
- +
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) +
- +
-**Status**: taken. +
- +
- +
-==== Development of a distributed simulation algorithm ==== +
- +
-Simulation and Bisimulation are fundamental notions ​in computer +
-science. They underly many formal verification algorithms, and have +
-recently been applied to the construction of so-called structural +
-indexes,​which are novel index data structures for relational databases +
-and the Semantic Web.  Essentially,​ a (bi)simulation is a relation on +
-the nodes of a graph. Unfortunately,​ however, while efficient +
-main-memory algorithms ​for computing whether two nodes are similar +
-exist, these algorithms ​fail when no the input graphs are too large to +
-fit in main memory. ​ +
- +
-The objective of this project is to implement a recently proposed +
-algorithm for  computing simulation in a distributed setting, and +
-provide a preliminary performance evaluation of this implementation.+
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
 **Status**: available **Status**: available
teaching/projh402.txt · Last modified: 2018/10/02 08:13 by svsummer