This is an old revision of the document!

MA Computer Science Projects (PROJ-H-402)

Course objective

The course PROJ-H-402 is managed by Dr. Mauro Birattari. Please refer to the course description page http://iridia.ulb.ac.be/proj-h-402/index.php/Main_Page for the rules concerning the project. What follows is a list of project proposals supervised by academic members of CoDE.

Project proposals

Fast loading of semantic web datasets into native RDF stores

The next generation of the Web, the so-called Semantic Web, stores extremely large knowledge bases in the RDF data model. In this data model, all knowledge is represented by means of triples of the form (subject, property, object), where subject, property and object can be URLs, among other things.

In order to effeciently query such knowledge bases, the RDF data is typically loaded into a so-called native RDF store. To ensure that the knowledge is encoded for fast retrieval, the RDF store will first encode all variable-length URLs in the dataset by fixed-width integers, among other things. Each RDF triple will then be encoded by by their corresponding integer triples (integer_of_subject, integer_of_property, integer_of_object).

The purpose of this project is to implement and experementally compare a number of algorithms that can perform this encoding:

The trivial alogrithm that simply maintains a hashmap that maps URLs to their integer codes. When processing triple (s,p,o), it looks up s, p, and o in the hashmap to see if they have alraedy been assigned an integer ID. If so, this id is used for the encoding; otherwise they are inserted into the hashmap with new, unique ids. The downside of this approach is that, while simple, it requires that one can store all URLS in working memory.

The slightly smarter algorithm that works in multiple stages: the ID is computed by a pre-fixed hash function. For each URL, the URL and its ID are written to an output file. This file is later sorted on ID to check for possible hash collisions between distinct URLS.

Algorithms that use the best known state-of-the art data structures for compactly storing respresenting sets of strings, such as the HAT-TRIE (“Engineering scalable, cache and space efficient tries for strings” Nikolas Askitis, Ranjan Sinha, The VLDB Journal, October 2010, Volume 19, Issue 5, pp 633-660 and “HAT-Trie: A Cache-Conscious Trie-Based Data Structure For Strings”, The 30th International Australasian Computer Science Conference (ACSC), Volume 62, pages 97 - 105, 2007.).

Variations of the above algorithms, fine-tuned for semantic web datasets.

Contact : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)

Status: available

Graph Indexing for Fast Subgraph Isomorphism Testing

There is an increasing amount of scientific data, mostly from the bio-medical sciences, that can be represented as collections of graphs (chemical molecules, gene interaction networks, …). A crucial operation when searching in this data is that of subgraph isomorphism testing: given a pattern P that one is interested in (also a graph) in and a collection D of graphs (e.g., chemical molecules), find all graphs in G that have P as a subgraph. Unfortunately, the subgraph isomorphism problem is computationally intractable. In ongoing research, to enable tractable processing of this problem, we aim to reduce the number of candidate graphs in D to which a subgraph isomorphism test needs to be executed. Specifically, we index the graphs in the collection D by means of decomposing them into graphs for which subgraph isomorphism *is* tractable. An associated algorithm that filters graphs that certainly cannot match P can then formulated based on ideas from information retrieval.

In this project, the student will emperically validate on real-world datasets the extent to which graphs can be decomposed into graphs for which subgraph isomorphism is tractable, and run experiments to validate the effectiveness of the proposed method in terms of filtering power.

Interested? Contact : Stijn Vansummeren

Status: available

Principles of Database Management Architectures in Managed Virtual Environments

With the gaining popularity of Big Data, many data processing engines are implemented in a managed virtual environment such as the Java Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, …). While this improves the portability of the engine, the tradeoffs and implementation principles w.r.t. traditional C++ implementations are sometimes less understood.

The objective in this project is to develop some basic functionalities of a database storage engine (Linked files, BTree, Extensible Hash table, basic external-memory sorting ) in a managed virtual machine (i.e., the Java Virtual Machine or and the .NET Common Language Runtime), and compare this with a C++-based implementation both on (1) ease of implementation and (2) execution efficiency. In order to develop the managed virtual machine implementation, the interested student will need to research the best practices that are used in the above-mentioned projects to gain maximum execution speed (e.g., use of the java.lang.unsafe feature, memory-mapped files, …).

Contact : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)

Status: available

Development of a distributed simulation algorithm

Simulation and Bisimulation are fundamental notions in computer science. They underly many formal verification algorithms, and have recently been applied to the construction of so-called structural indexes,which are novel index data structures for relational databases and the Semantic Web. Essentially, a (bi)simulation is a relation on the nodes of a graph. Unfortunately, however, while efficient main-memory algorithms for computing whether two nodes are similar exist, these algorithms fail when no the input graphs are too large to fit in main memory.

The objective of this project is to implement a recently proposed algorithm for computing simulation in a distributed setting, and provide a preliminary performance evaluation of this implementation.

Contact : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)

Status: available

Table of Contents

MA Computer Science Projects (PROJ-H-402)

Course objective

Project proposals

Fast loading of semantic web datasets into native RDF stores

Graph Indexing for Fast Subgraph Isomorphism Testing

Principles of Database Management Architectures in Managed Virtual Environments

Development of a distributed simulation algorithm