Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2015/06/26 16:09]
svsummer
teaching:projh402 [2018/10/02 08:13]
svsummer
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
-===== Graph Indexing for Fast Subgraph Isomorphism Testing =====+=== Engineering of a Rule-Based Information Extraction Engine ​===
  
-There is an increasing amount of scientific datamostly from the bio-medical sciences, that can be represented as collections ​of graphs (chemical moleculesgene interaction networks, ...). A crucial operation when searching in this data is that of subgraph ​   isomorphism testing: given pattern P that one is interested in (also a graph) in and a collection D of graphs (e.g., chemical molecules), find all graphs in G that have P as   subgraph. Unfortunately,​ the subgraph isomorphism problem ​is computationally intractableIn ongoing research, to enable tractable processing ​of this problem, we aim to reduce the number of candidate graphs ​in D to which a subgraph isomorphism test needs   to be executed. Specifically,​ we index the graphs ​in the collection D by means of decomposing them into graphs for which subgraph ​  ​isomorphism *is* tractableAn associated algorithm that filters graphs that certainly cannot match P can then formulated ​based on ideas from information retrieval.+Information extraction, the activity ​of extracting structured 
 +information from unstructured text, is a core data preparation 
 +stepSystems for information extraction fall into two main 
 +categoriesThe first category contains machine-learning based 
 +systemswhere significant amount of training ​is required to train 
 +good models for specific extraction tasksThe second category 
 +consists ​of rule-based systems ​in which the data to be extracted from 
 +the text is specified by (human-written) rules in some (often 
 +declarative) extraction languageDespite advances in machine 
 +learning, rule-based systems are widely used in practice.
  
-In this project, the student will emperically validate on real-world datasets the extent ​to which graphs can be decomposed into graphs for which subgraph isomorphism is tractable, and run experiments to validate the effectiveness ​of the proposed method in terms of filtering power.+In recent years, novel theoretical algorithms have been proposed to 
 +more efficiently execute rule-based information extraction 
 +workloads. The objective in this project ​is to implement one such 
 +Algorithm, by Florenzano et al (2018), experimentally analyze its 
 +performance, and propose extensions ​of the algorithm to overcome 
 +performance bottlenecks
  
-**Interested?​** Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]] 
  
-**Status**available+References
  
-==== Principles of Database Management Architectures in Managed Virtual Environments ====+- Fernando Florenzano, Cristian Riveros, Martín Ugarte, 
 +Stijn Vansummeren,​ Domagoj Vrgoc: Constant Delay Algorithms for 
 +Regular Document Spanners. PODS 2018: 165-177
  
-With the gaining popularity of Big Data, many data processing engines 
-are implemented in a managed virtual environment such as the Java 
-Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, 
-...). While this improves the portability of the engine, the tradeoffs 
-and implementation principles w.r.t. traditional C++ implementations 
-are sometimes less understood. 
  
-The objective in this project is to develop some basic functionalities +**Interested?** Contact ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
-of a database storage engine (Linked files, BTree, Extensible Hash +
-table, basic external-memory sorting ) in a managed virtual machine +
-(i.e., the Java Virtual Machine or and the .NET Common Language +
-Runtime), and compare this with a C++-based implementation both on (1) +
-ease of implementation and (2) execution efficiency. In order to +
-develop the managed virtual machine implementation,​ the interested +
-student will need to research the best practices that are used in the +
-above-mentioned projects to gain maximum execution speed (e.g., use of +
-the java.lang.unsafe feature, memory-mapped files, ...). +
- +
-**Contact** Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+
  
 **Status**: available **Status**: available
  
  
-==== Development of a compiler and runtime engine ​for AQL ====+=== Query processing ​for mixed database-machine learning based workloads ​===
  
-In 2005researchers at the IBM Almaden Research Center developped a +Because of the growing importance and wide deployment of large-scale 
-new system specifically geared ​for practical information extraction in +Machine Learning (ML)there is wide interest in the design and 
-the enterpriseThis effort lead to [[https://​www.google.be/​url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEYQFjAB&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&​ei=gyhIUe-XPIexPJ-fgLAG&​usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&​sig2=LVQkus_67uSVlwK34BXZ8w&​bvm=bv.43828540,d.ZWU|SystemT]] ​a rule-based IE system with an SQL-like declarative language named [[http://​pic.dhe.ibm.com/​infocenter/​bigins/​v2r0/​topic/​com.ibm.swg.im.infosphere.biginsights.analyze.doc/​doc/​aql_overview.html|AQL (Annotation Query Language)]]. +implementation of processing engines that can efficiently evaluate ML 
-The declarative nature of AQL enables new kinds of tools for extractor +workloads. One class of sytems, embodied by systems such as Tensorflow 
-development,​ and a cost-based optimizer for +and SystemML takes linear algebra as the key primitive ​for expressing 
-performance.  ​+ML workflows, and obtain efficient processing engines by porting known 
 +database-style optimization techniques to the linear algebra 
 +settingAnother class of systems, embodied by FAQ queries take 
 +relational algebra as the key primitive, but modify it to allow 
 +expression of certain ML workloadsTo some extentthe classical 
 +optimization techniques as well as recent results for exploiting 
 +modern hardware transfer to this extended relational algebraAs an 
 +added bonustraditional database workloads (OLTP/OLAP stylecan be 
 +trivially supported
  
-The goal of this project is to develop an open-source compiler and +The focus in this project ​is in the latter style of systems. The 
-runtime environment ​of (a simplified version ofAQL.+overall goal is to experimentally identify classes of FAQ queries for 
 +which it would be beneficial to exploit techniques developped in the 
 +former class of systems. Concretely, this can be approached by 
 +experimentally studying queries in the FAQ framework ​(featuring joins) 
 +for which known results in evaluating linear algebra operations (in 
 +concretum: matrix multiplication algorithms that run in less than 
 +O(n^3) timecan be exploited.
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
- 
-==== Development of a distributed simulation algorithm ==== 
- 
-Simulation and Bisimulation are fundamental notions in computer 
-science. They underly many formal verification algorithms, and have 
-recently been applied to the construction of so-called structural 
-indexes,​which are novel index data structures for relational databases 
-and the Semantic Web.  Essentially,​ a (bi)simulation is a relation on 
-the nodes of a graph. Unfortunately,​ however, while efficient 
-main-memory algorithms for computing whether two nodes are similar 
-exist, these algorithms fail when no the input graphs are too large to 
-fit in main memory. ​ 
- 
-The objective of this project is to implement a recently proposed 
-algorithm for  computing simulation in a distributed setting, and 
-provide a preliminary performance evaluation of this implementation. 
- 
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) 
- 
-**Status**: available 
- 
- 
  
  
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi