Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2013/09/23 11:38]
svsummer created
teaching:projh402 [2015/08/18 11:22]
svsummer [Project proposals]
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
-==== Development ​of a Personal Scientific Digital Library Management System ​====+=== Fast loading ​of semantic web datasets into native RDF stores ​====
  
-In this project, the student is asked to construct a software system to help manage ​large collections of scientific papers ​in digital formSpecifically, the system must be able to: +The next generation of the Web, the so-called Semantic Web, stores 
-  - Scan given filesystem location ​for given filetypes (PDFsEPUB, ...containing scientific articles+extremely ​large knowledge bases in the RDF data modelIn this data 
-  ​- Extract the metadata from each identified fileHerethe metadata includes the title of the articleits authorsthe publishing venuethe publisher, the year of publication, the article'​s abstract ... The development ​of an intelligent way to retreive ​this metadata ​is requried. This could be donefor example ​by a combination of parsing ​the file, contacting ​the internet repositories of known publishers ​(AMCSpringerElsevier) etc to retrieve the data. +modelall knowledge is represented by means of triples of the form 
-  ​Offer search capabilitiesin order to allow a user to find all indexed articles matching certain criteria ​(titleauthor...) +(subject, property, object), where subject, property and object can be 
-  - Offer archiving capabilities+URLs, among other things. 
 + 
 +In order to effeciently query such knowledge bases, the RDF data is 
 +typically loaded into so-called native RDF store. To ensure that the 
 +knowledge is encoded ​for fast retrievalthe RDF store will first 
 +encode all variable-length URLs in the dataset by fixed-width 
 +integersamong other thingsEach RDF triple will then be encoded by 
 +by their corresponding integer triples (integer_of_subject,​ 
 +integer_of_property,​ integer_of_object). 
 + 
 +The purpose of this project is to implement and experementally compare 
 +a number of algorithms that can perform this encoding: 
 + 
 +  ​* The trivial alogrithm that simply maintains a hashmap that maps URLs to their integer codesWhen processing triple (s,p,o)it looks up spand o in the hashmap to see if they have alraedy been assigned an integer ID. If sothis id is used for the   encoding; otherwise they are inserted into the hashmap with new, unique ids. The downside ​of this approach ​is thatwhile simple, it requires that one can store all URLS in working memory. 
 + 
 +  * The slightly smarter algorithm that works in multiple stages: the ID is computed ​by a pre-fixed hash function. For each URL, the URL and its ID are written to an output ​file. This file is later sorted on ID to check for possible ​ hash collisions between distinct URLS.  
 + 
 +  * Algorithms that use the best known state-of-the art data structures for compactly storing respresenting sets of strings, such as the HAT-TRIE ​("​Engineering scalablecache and space efficient tries for strings"​ Nikolas AskitisRanjan Sinha, The VLDB Journal, October 2010, Volume 19, Issue 5, pp 633-660 and "​HAT-Trie:​ A Cache-Conscious Trie-Based Data Structure For Strings"​The 30th International Australasian Computer Science Conference ​(ACSC)Volume 62pages 97 - 105, 2007.)
 + 
 +  ​* Variations of the above algorithms, fine-tuned for semantic web datasets.
  
-Use of semantic web technologies (RDF, SPARQL, ...) to store and search the metadata is encouraged. 
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
 +===== Graph Indexing for Fast Subgraph Isomorphism Testing =====
  
-==== Curriculum Revision Assistant ====+There is an increasing amount of scientific data, mostly from the bio-medical sciences, that can be represented as collections of graphs (chemical molecules, gene interaction networks, ...). A crucial operation when searching in this data is that of subgraph ​   isomorphism testing: given a pattern P that one is interested in (also a graph) in and a collection D of graphs (e.g., chemical molecules), find all graphs in G that have P as a   ​subgraph. Unfortunately,​ the subgraph isomorphism problem is computationally intractable. In ongoing research, to enable tractable processing of this problem, we aim to reduce the number of candidate graphs in D to which a subgraph isomorphism test needs   to be executed. Specifically,​ we index the graphs in the collection D by means of decomposing them into graphs for which subgraph ​  ​isomorphism *is* tractable. An associated algorithm that filters graphs that certainly cannot match P can then formulated based on ideas from information retrieval.
  
-In this project, the student ​is asked to construct a software system that can assist in the revision of teaching curricula (also known as teaching programs). The system should have the following functionalities:​ +In this project, the student ​will emperically validate on real-world datasets the extent ​to which graphs ​can be decomposed into graphs ​for which subgraph isomorphism is tractableand run experiments to validate ​the effectiveness of the proposed method ​in terms of filtering power. 
-  - It should be able to load existing curricula from the ULB central administration. This could be done, for exampleby parsing ​the webpages available at banner (the Civil Engineering ​in CS program is available at http://banssbfr.ulb.ac.be/​PROD_frFR/​bzscrse.p_disp_prog_detail?​term_in=201314&​prog_in=MA-IRIF&​lang=FRENCHfor example)+ 
-  - It should allow to make different versions ​of the teaching programsmuch in the same way as version control systems like GIT and subversion offer the possibility ​to make different "​development branches" ​of a program'​s source code. +**Interested?​** Contact ​[[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]] 
-  It should allow to analyze the modifications proposed ​in the teaching programs, and summarize ​the impact that these changes could have on other programs(For exampleif course is removed from the computer science curriculumit should also be removed from all curricula ​that included ​the course.)+ 
 +**Status**: available 
 + 
 +==== Principles of Database Management Architectures in Managed Virtual Environments ==== 
 + 
 +With the gaining popularity of Big Datamany data processing engines 
 +are implemented in a managed virtual environment such as the Java 
 +Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, 
 +...). While this improves the portability ​of the engine, the tradeoffs 
 +and implementation principles w.r.t. traditional C++ implementations 
 +are sometimes less understood. 
 + 
 +The objective in this project is to develop some basic functionalities 
 +of a database storage engine (Linked files, BTree, Extensible Hash 
 +table, basic external-memory sorting ) in a managed virtual machine 
 +(i.e.the Java Virtual Machine or and the .NET Common Language 
 +Runtime)and compare this with C++-based implementation both on (1) 
 +ease of implementation and (2) execution efficiency. In order to 
 +develop ​the managed virtual machine implementationthe interested 
 +student will need to research the best practices ​that are used in the 
 +above-mentioned projects to gain maximum execution speed (e.g., use of 
 +the java.lang.unsafe feature, memory-mapped files, ...).
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
 +
 +**Status**: available
 +
 +
 +==== Development of a compiler and runtime engine for AQL ====
 +
 +In 2005, researchers at the IBM Almaden Research Center developped a
 +new system specifically geared for practical information extraction in
 +the enterprise. This effort lead to [[https://​www.google.be/​url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEYQFjAB&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&​ei=gyhIUe-XPIexPJ-fgLAG&​usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&​sig2=LVQkus_67uSVlwK34BXZ8w&​bvm=bv.43828540,​d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://​pic.dhe.ibm.com/​infocenter/​bigins/​v2r0/​topic/​com.ibm.swg.im.infosphere.biginsights.analyze.doc/​doc/​aql_overview.html|AQL (Annotation Query Language)]].
 +The declarative nature of AQL enables new kinds of tools for extractor
 +development,​ and a cost-based optimizer for
 +performance.  ​
 +
 +The goal of this project is to develop an open-source compiler and
 +runtime environment of (a simplified version of) AQL.
 +
 +**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
 +
 +**Status**: available
 +
 +==== Development of a distributed simulation algorithm ====
 +
 +Simulation and Bisimulation are fundamental notions in computer
 +science. They underly many formal verification algorithms, and have
 +recently been applied to the construction of so-called structural
 +indexes,​which are novel index data structures for relational databases
 +and the Semantic Web.  Essentially,​ a (bi)simulation is a relation on
 +the nodes of a graph. Unfortunately,​ however, while efficient
 +main-memory algorithms for computing whether two nodes are similar
 +exist, these algorithms fail when no the input graphs are too large to
 +fit in main memory. ​
 +
 +The objective of this project is to implement a recently proposed
 +algorithm for  computing simulation in a distributed setting, and
 +provide a preliminary performance evaluation of this implementation.
 +
 +**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
 +
 +**Status**: available
  
  
  
  
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi