Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2013/09/23 11:38]
svsummer created
teaching:projh402 [2018/10/02 08:13]
svsummer
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
-==== Development ​of a Personal Scientific Digital Library Management System ====+=== Engineering ​of a Rule-Based Information Extraction Engine ​===
  
-In this project, the student ​is asked to construct ​software system to help manage large collections of scientific papers in digital formSpecificallythe system must be able to: +Information extraction, the activity of extracting structured 
-  - Scan a given filesystem location ​for given filetypes (PDFs, EPUB, ...) containing scientific articles+information from unstructured text, is a core data preparation 
-  Extract ​the metadata ​from each identified file. Here, the metadata includes the title of the article, its authors, the publishing venue, the publisher, the year of publication,​ the article'​s abstract ... The development of an intelligent way to retreive this metadata ​is requried. This could be done, for example ​by a combination of parsing the file, contacting the internet repositories of known publishers ​(AMC, Springer, Elsevieretc to retrieve the data+stepSystems for information extraction fall into two main 
-  Offer search capabilities, ​in order to allow a user to find all indexed articles matching certain criteria (title, author, ​...) +categories. The first category contains machine-learning based 
-  - Offer archiving capabilities+systemswhere a significant amount of training is required ​to train 
 +good models ​for specific extraction tasksThe second category 
 +consists of rule-based systems in which the data to be extracted ​from 
 +the text is specified ​by (human-writtenrules in some (often 
 +declarative) extraction languageDespite advances in machine 
 +learning, rule-based systems are widely used in practice.
  
-Use of semantic web technologies (RDF, SPARQL, ...) to store and search ​the metadata is encouraged.+In recent yearsnovel theoretical algorithms have been proposed to 
 +more efficiently execute rule-based information extraction 
 +workloadsThe objective in this project is to implement one such 
 +Algorithm, by Florenzano et al (2018), experimentally analyze its 
 +performance, ​and propose extensions of the algorithm to overcome 
 +performance bottlenecks
  
-**Contact** Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+ 
 +References:  
 + 
 +- Fernando Florenzano, Cristian Riveros, Martín Ugarte, 
 +Stijn Vansummeren,​ Domagoj Vrgoc: Constant Delay Algorithms for 
 +Regular Document Spanners. PODS 2018: 165-177 
 + 
 + 
 +**Interested?** Contact ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
  
-==== Curriculum Revision Assistant ==== 
  
-In this project, the student is asked to construct a software system that can assist in the revision of teaching curricula (also known as teaching programs). The system should have the following functionalities:​ +=== Query processing ​for mixed database-machine learning based workloads ===
-  - It should be able to load existing curricula from the ULB central administration. This could be done, for example, by parsing the webpages available at banner (the Civil Engineering in CS program is available at http://​banssbfr.ulb.ac.be/​PROD_frFR/​bzscrse.p_disp_prog_detail?​term_in=201314&​prog_in=MA-IRIF&​lang=FRENCH, ​for example). +
-  ​It should allow to make different versions of the teaching programs, much in the same way as version control systems like GIT and subversion offer the possibility to make different "​development branches"​ of a program'​s source code. +
-  - It should allow to analyze the modifications proposed in the teaching programs, and summarize the impact that these changes could have on other programs. (For example, if a course is removed from the computer science curriculum, it should also be removed from all curricula that included the course.)+
  
-**Contact** : Stijn Vansummeren ​(stijn.vansummeren@ulb.ac.be)+Because of the growing importance and wide deployment of large-scale 
 +Machine Learning ​(ML), there is wide interest in the design and 
 +implementation of processing engines that can efficiently evaluate ML 
 +workloadsOne class of sytems, embodied by systems such as Tensorflow 
 +and SystemML takes linear algebra as the key primitive for expressing 
 +ML workflows, and obtain efficient processing engines by porting known 
 +database-style optimization techniques to the linear algebra 
 +settingAnother class of systems, embodied by FAQ queries take 
 +relational algebra as the key primitive, but modify it to allow 
 +expression of certain ML workloadsTo some extent, the classical 
 +optimization techniques as well as recent results for exploiting 
 +modern hardware transfer to this extended relational algebra. As an 
 +added bonus, traditional database workloads (OLTP/OLAP stylecan be 
 +trivially supported
  
 +The focus in this project is in the latter style of systems. The
 +overall goal is to experimentally identify classes of FAQ queries for
 +which it would be beneficial to exploit techniques developped in the
 +former class of systems. Concretely, this can be approached by
 +experimentally studying queries in the FAQ framework (featuring joins)
 +for which known results in evaluating linear algebra operations (in
 +concretum: matrix multiplication algorithms that run in less than
 +O(n^3) time) can be exploited.
  
 +**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
 +
 +**Status**: available
  
  
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi