Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2013/09/23 11:47]
svsummer
teaching:projh402 [2018/10/02 08:13]
svsummer
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
-==== Development ​of a Personal Scientific Digital Library Management System ====+=== Engineering ​of a Rule-Based Information Extraction Engine ​===
  
-In this project, the student ​is asked to construct ​software system to help manage large collections of scientific papers in digital formSpecificallythe system must be able to: +Information extraction, the activity of extracting structured 
-  - Scan a given filesystem location ​for given filetypes (PDFs, EPUB, ...) containing scientific articles+information from unstructured text, is a core data preparation 
-  Extract ​the metadata ​from each identified file. Here, the metadata includes the title of the article, its authors, the publishing venue, the publisher, the year of publication,​ the article'​s abstract ... The development of an intelligent way to retreive this metadata ​is requried. This could be done, for example ​by a combination of parsing the file, contacting the internet repositories of known publishers ​(AMC, Springer, Elsevieretc to retrieve the data+stepSystems for information extraction fall into two main 
-  Offer search capabilities, ​in order to allow a user to find all indexed articles matching certain criteria (title, author, ​...) +categories. The first category contains machine-learning based 
-  - Offer archiving capabilities+systemswhere a significant amount of training is required ​to train 
 +good models ​for specific extraction tasksThe second category 
 +consists of rule-based systems in which the data to be extracted ​from 
 +the text is specified ​by (human-writtenrules in some (often 
 +declarative) extraction languageDespite advances in machine 
 +learning, rule-based systems are widely used in practice.
  
-Use of semantic web technologies (RDF, SPARQL, ...) to store and search ​the metadata is encouraged.+In recent yearsnovel theoretical algorithms have been proposed to 
 +more efficiently execute rule-based information extraction 
 +workloadsThe objective in this project is to implement one such 
 +Algorithm, by Florenzano et al (2018), experimentally analyze its 
 +performance, ​and propose extensions of the algorithm to overcome 
 +performance bottlenecks
  
-**Contact** Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+ 
 +References:  
 + 
 +- Fernando Florenzano, Cristian Riveros, Martín Ugarte, 
 +Stijn Vansummeren,​ Domagoj Vrgoc: Constant Delay Algorithms for 
 +Regular Document Spanners. PODS 2018: 165-177 
 + 
 + 
 +**Interested?** Contact ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
  
-==== Curriculum Revision Assistant ==== 
  
-In this project, ​the student ​is asked to construct a software system that can assist ​in the revision ​of teaching curricula (also known as teaching programs)The system should have the following functionalities:​ +=== Query processing for mixed database-machine learning based workloads === 
-  It should be able to load existing curricula from the ULB central administrationThis could be done, for example, by parsing ​the webpages available at banner (the Civil Engineering in CS program is available at http://​banssbfr.ulb.ac.be/​PROD_frFR/​bzscrse.p_disp_prog_detail?​term_in=201314&​prog_in=MA-IRIF&​lang=FRENCH, for example). + 
-  - It should allow to make different versions of the teaching programsmuch in the same way as version control ​systems ​like GIT and subversion offer the possibility ​to make different "​development branches" ​of a program'​s source code. +Because of the growing importance and wide deployment of large-scale 
-  - It should allow to analyze the modifications proposed ​in the teaching programsand summarize ​the impact ​that these changes could have on other programs. ​(For example, if a course is removed from the computer science curriculum, it should also be removed from all curricula that included the course.)+Machine Learning (ML), there is wide interest ​in the design and 
 +implementation ​of processing engines that can efficiently evaluate ML 
 +workloadsOne class of sytems, embodied by systems such as Tensorflow 
 +and SystemML takes linear algebra as the key primitive for expressing 
 +ML workflows, and obtain efficient processing engines by porting known 
 +database-style optimization techniques ​to the linear algebra 
 +settingAnother class of systemsembodied ​by FAQ queries take 
 +relational algebra as the key primitive, but modify it to allow 
 +expression of certain ML workloadsTo some extentthe classical 
 +optimization techniques as well as recent results ​for exploiting 
 +modern hardware transfer ​to this extended relational algebra. As an 
 +added bonustraditional database workloads (OLTP/OLAP style) can be 
 +trivially supported 
 + 
 +The focus in this project is in the latter style of systems. The 
 +overall goal is to experimentally identify classes ​of FAQ queries for 
 +which it would be beneficial ​to exploit techniques developped ​in the 
 +former class of systems. Concretelythis can be approached by 
 +experimentally studying queries in the FAQ framework (featuring joins) 
 +for which known results in evaluating linear algebra operations (in 
 +concretum: matrix multiplication algorithms ​that run in less than 
 +O(n^3) time) can be exploited.
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
 +
  
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi