Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2013/09/23 11:47]
svsummer
teaching:projh402 [2016/10/24 15:32]
svsummer
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
-==== Development ​of a Personal Scientific Digital Library Management System ​====+=== Fast loading ​of semantic web datasets into native RDF stores ​====
  
-In this project, the student is asked to construct a software system to help manage ​large collections of scientific papers ​in digital formSpecifically, the system must be able to: +The next generation of the Web, the so-called Semantic Web, stores 
-  - Scan given filesystem location ​for given filetypes (PDFsEPUB, ...containing scientific articles+extremely ​large knowledge bases in the RDF data modelIn this data 
-  ​- Extract the metadata from each identified fileHerethe metadata includes the title of the articleits authorsthe publishing venuethe publisher, the year of publication, the article'​s abstract ... The development ​of an intelligent way to retreive ​this metadata ​is requried. This could be donefor example ​by a combination of parsing ​the file, contacting ​the internet repositories of known publishers ​(AMCSpringerElsevier) etc to retrieve the data. +modelall knowledge is represented by means of triples of the form 
-  ​Offer search capabilitiesin order to allow a user to find all indexed articles matching certain criteria ​(titleauthor...) +(subject, property, object), where subject, property and object can be 
-  - Offer archiving capabilities+URLs, among other things. 
 + 
 +In order to effeciently query such knowledge bases, the RDF data is 
 +typically loaded into so-called native RDF store. To ensure that the 
 +knowledge is encoded ​for fast retrievalthe RDF store will first 
 +encode all variable-length URLs in the dataset by fixed-width 
 +integersamong other thingsEach RDF triple will then be encoded by 
 +by their corresponding integer triples (integer_of_subject,​ 
 +integer_of_property,​ integer_of_object). 
 + 
 +The purpose of this project is to implement and experementally compare 
 +a number of algorithms that can perform this encoding: 
 + 
 +  ​* The trivial alogrithm that simply maintains a hashmap that maps URLs to their integer codesWhen processing triple (s,p,o)it looks up spand o in the hashmap to see if they have alraedy been assigned an integer ID. If sothis id is used for the   encoding; otherwise they are inserted into the hashmap with new, unique ids. The downside ​of this approach ​is thatwhile simple, it requires that one can store all URLS in working memory. 
 + 
 +  * The slightly smarter algorithm that works in multiple stages: the ID is computed ​by a pre-fixed hash function. For each URL, the URL and its ID are written to an output ​file. This file is later sorted on ID to check for possible ​ hash collisions between distinct URLS.  
 + 
 +  * Algorithms that use the best known state-of-the art data structures for compactly storing respresenting sets of strings, such as the HAT-TRIE ​("​Engineering scalablecache and space efficient tries for strings"​ Nikolas AskitisRanjan Sinha, The VLDB Journal, October 2010, Volume 19, Issue 5, pp 633-660 and "​HAT-Trie:​ A Cache-Conscious Trie-Based Data Structure For Strings"​The 30th International Australasian Computer Science Conference ​(ACSC)Volume 62pages 97 - 105, 2007.)
 + 
 +  ​* Variations of the above algorithms, fine-tuned for semantic web datasets.
  
-Use of semantic web technologies (RDF, SPARQL, ...) to store and search the metadata is encouraged. 
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
 +===== Graph Indexing for Fast Subgraph Isomorphism Testing =====
  
-==== Curriculum Revision Assistant ====+There is an increasing amount of scientific data, mostly from the bio-medical sciences, that can be represented as collections of graphs (chemical molecules, gene interaction networks, ...). A crucial operation when searching in this data is that of subgraph ​   isomorphism testing: given a pattern P that one is interested in (also a graph) in and a collection D of graphs (e.g., chemical molecules), find all graphs in G that have P as a   ​subgraph. Unfortunately,​ the subgraph isomorphism problem is computationally intractable. In ongoing research, to enable tractable processing of this problem, we aim to reduce the number of candidate graphs in D to which a subgraph isomorphism test needs   to be executed. Specifically,​ we index the graphs in the collection D by means of decomposing them into graphs for which subgraph ​  ​isomorphism *is* tractable. An associated algorithm that filters graphs that certainly cannot match P can then formulated based on ideas from information retrieval.
  
-In this project, the student ​is asked to construct a software system that can assist ​in the revision ​of teaching curricula (also known as teaching programs). The system should have the following ​functionalities: +In this project, the student ​will emperically validate on real-world datasets the extent ​to which graphs ​can be decomposed into graphs for which subgraph isomorphism is tractable, and run experiments to validate the effectiveness of the proposed method ​in terms of filtering power. 
-  It should be able to load existing curricula from the ULB central administrationThis could be donefor exampleby parsing ​the webpages available at banner (the Civil Engineering ​in CS program is available at http://banssbfr.ulb.ac.be/​PROD_frFR/​bzscrse.p_disp_prog_detail?​term_in=201314&​prog_in=MA-IRIF&​lang=FRENCH, for example)+ 
-  - It should allow to make different versions of the teaching programsmuch in the same way as version control systems like GIT and subversion offer the possibility to make different "​development branches" ​of a program'​s source code+**Interested?​** Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]] 
-  It should allow to analyze ​the modifications ​proposed in the teaching programs, and summarize the impact that these changes could have on other programs. (For example, if course is removed from the computer science curriculum, it should also be removed from all curricula that included the course.)+ 
 +**Status**: available 
 + 
 +==== Principles of Database Management Architectures in Managed Virtual Environments ==== 
 + 
 +With the gaining popularity ​of Big Data, many data processing engines 
 +are implemented in a managed virtual environment such as the Java 
 +Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, 
 +...). While this improves ​the portability of the engine, the tradeoffs 
 +and implementation principles w.r.t. traditional C++ implementations 
 +are sometimes less understood. 
 + 
 +The objective in this project is to develop some basic functionalities 
 +of a database storage engine (Linked files, BTree, Extensible Hash 
 +table, basic external-memory sorting ) in a managed virtual machine 
 +(i.e., the Java Virtual Machine or and the .NET Common Language 
 +Runtime)and compare this with a C++-based implementation both on (1) 
 +ease of implementation and (2) execution efficiency. In order to 
 +develop the managed virtual machine implementation, the interested 
 +student will need to research ​the best practices that are used in the 
 +above-mentioned projects to gain maximum execution speed (e.g., use of 
 +the java.lang.unsafe feature, memory-mapped files, ...). 
 + 
 +**Contact** ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be
 + 
 +**Status**: taken. 
 + 
 + 
 +==== Development of a distributed simulation algorithm ==== 
 + 
 +Simulation and Bisimulation are fundamental notions in computer 
 +scienceThey underly many formal verification algorithms, and have 
 +recently been applied ​to the construction of so-called structural 
 +indexes,which are novel index data structures for relational databases 
 +and the Semantic Web.  Essentially,​ a (bi)simulation is a relation on 
 +the nodes of a graphUnfortunately,​ however, while efficient 
 +main-memory algorithms for computing whether two nodes are similar 
 +exist, these algorithms fail when no the input graphs are too large to 
 +fit in main memory.  
 + 
 +The objective of this project is to implement a recently ​proposed 
 +algorithm for  computing simulation ​in a distributed setting, and 
 +provide ​preliminary performance evaluation of this implementation.
  
 **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) **Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
  
 **Status**: available **Status**: available
 +
 +
 +
  
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi