Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2014/10/24 08:54]
svsummer [Project proposals]
teaching:projh402 [2018/10/02 08:13]
svsummer
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
-==== Principles ​of Database Management Architectures in Managed Virtual Environments ====+=== Engineering ​of a Rule-Based Information Extraction Engine ​===
  
-With the gaining popularity ​of Big Datamany data processing engines +Information extraction, ​the activity ​of extracting structured 
-are implemented in a managed virtual environment such as the Java +information from unstructured textis a core data preparation 
-Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, +step. Systems for information extraction fall into two main 
-...). While this improves the portability ​of the engine, ​the tradeoffs +categoriesThe first category contains machine-learning based 
-and implementation principles w.r.t. traditional C++ implementations +systemswhere a significant amount of training is required to train 
-are sometimes less understood.+good models for specific extraction tasksThe second category 
 +consists ​of rule-based systems in which the data to be extracted from 
 +the text is specified by (human-written) rules in some (often 
 +declarative) extraction languageDespite advances in machine 
 +learning, rule-based systems ​are widely used in practice.
  
-The objective in this project is to develop some basic functionalities +In recent years, novel theoretical algorithms have been proposed to 
-of a database storage engine ​(Linked files, BTree, Extensible Hash +more efficiently execute rule-based information extraction 
-table, basic external-memory sorting ​in a managed virtual machine +workloads. ​The objective in this project is to implement one such 
-(i.e.the Java Virtual Machine or and the .NET Common Language +Algorithm, by Florenzano et al (2018), experimentally analyze its 
-Runtime), and compare this with a C++-based implementation both on (1) +performance, and propose extensions ​of the algorithm ​to overcome 
-ease of implementation and (2) execution efficiency. In order to +performance bottlenecks
-develop ​the managed virtual machine implementation,​ the interested +
-student will need to research the best practices that are used in the +
-above-mentioned projects to gain maximum execution speed (e.g., use of +
-the java.lang.unsafe feature, memory-mapped files, ...).+
  
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) 
  
-**Status**available+References
  
 +- Fernando Florenzano, Cristian Riveros, Martín Ugarte,
 +Stijn Vansummeren,​ Domagoj Vrgoc: Constant Delay Algorithms for
 +Regular Document Spanners. PODS 2018: 165-177
  
-==== Development of a compiler and runtime engine for AQL ==== 
  
-In 2005, researchers at the IBM Almaden Research Center developped a +**Interested?​** Contact Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
-new system specifically geared for practical information extraction in +
-the enterpriseThis effort lead to [[https://​www.google.be/​url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEYQFjAB&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&​ei=gyhIUe-XPIexPJ-fgLAG&​usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&​sig2=LVQkus_67uSVlwK34BXZ8w&​bvm=bv.43828540,​d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://​pic.dhe.ibm.com/​infocenter/​bigins/​v2r0/​topic/​com.ibm.swg.im.infosphere.biginsights.analyze.doc/​doc/​aql_overview.html|AQL (Annotation Query Language)]]. +
-The declarative nature of AQL enables new kinds of tools for extractor +
-development,​ and a cost-based optimizer for +
-performance.  ​+
  
-The goal of this project is to develop an open-source compiler and +**Status**: available
-runtime environment of (a simplified version of) AQL.+
  
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) 
  
-**Status**: available+=== Query processing for mixed database-machine learning based workloads ===
  
-==== Development ​of a distributed simulation algorithm ====+Because ​of the growing importance and wide deployment of large-scale 
 +Machine Learning (ML), there is wide interest in the design and 
 +implementation of processing engines that can efficiently evaluate ML 
 +workloads. One class of sytems, embodied by systems such as Tensorflow 
 +and SystemML takes linear algebra as the key primitive for expressing 
 +ML workflows, and obtain efficient processing engines by porting known 
 +database-style optimization techniques to the linear algebra 
 +setting. Another class of systems, embodied by FAQ queries take 
 +relational algebra as the key primitive, but modify it to allow 
 +expression of certain ML workloads. To some extent, the classical 
 +optimization techniques as well as recent results for exploiting 
 +modern hardware transfer to this extended relational algebra. As an 
 +added bonus, traditional database workloads (OLTP/OLAP style) can be 
 +trivially supported
  
-Simulation and Bisimulation are fundamental notions ​in computer +The focus in this project is in the latter style of systemsThe 
-scienceThey underly many formal verification algorithms, and have +overall goal is to experimentally identify classes ​of FAQ queries for 
-recently been applied ​to the construction ​of so-called structural +which it would be beneficial to exploit techniques developped in the 
-indexes,which are novel index data structures for relational databases +former class of systemsConcretelythis can be approached by 
-and the Semantic Web.  Essentially,​ a (bi)simulation is a relation on +experimentally studying queries in the FAQ framework (featuring joins) 
-the nodes of a graphUnfortunatelyhowever, while efficient +for which known results in evaluating linear algebra operations (in 
-main-memory algorithms for computing whether two nodes are similar +concretum: matrix multiplication algorithms that run in less than 
-exist, these algorithms fail when no the input graphs are too large to +O(n^3) time) can be exploited.
-fit in main memory. ​ +
- +
-The objective of this project is to implement a recently proposed +
-algorithm for  computing simulation ​in a distributed setting, and +
-provide a preliminary performance evaluation of this implementation.+
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
Line 68: Line 70:
 **Status**: available **Status**: available
  
- 
- 
-==== Development of a Personal Scientific Digital Library Management System ==== 
- 
-In this project, the student is asked to construct a software system to help manage large collections of scientific papers in digital form. Specifically,​ the system must be able to: 
-  - Scan a given filesystem location for given filetypes (PDFs, EPUB, ...) containing scientific articles. 
-  - Extract the metadata from each identified file. Here, the metadata includes the title of the article, its authors, the publishing venue, the publisher, the year of publication,​ the article'​s abstract ... The development of an intelligent way to retreive this metadata is requried. This could be done, for example by a combination of parsing the file, contacting the internet repositories of known publishers (AMC, Springer, Elsevier) etc to retrieve the data. 
-  - Offer search capabilities,​ in order to allow a user to find all indexed articles matching certain criteria (title, author, ...) 
-  - Offer archiving capabilities 
- 
-Use of semantic web technologies (RDF, SPARQL, ...) to store and search the metadata is encouraged. 
- 
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) 
- 
-**Status**: taken 
  
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi