Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2015/06/26 16:09]
svsummer
teaching:projh402 [2016/09/22 14:35]
svsummer [Graph Indexing for Fast Subgraph Isomorphism Testing]
Line 7: Line 7:
 ===== Project proposals ===== ===== Project proposals =====
  
 +=== Fast loading of semantic web datasets into native RDF stores ====
 +
 +The next generation of the Web, the so-called Semantic Web, stores
 +extremely large knowledge bases in the RDF data model. In this data
 +model, all knowledge is represented by means of triples of the form
 +(subject, property, object), where subject, property and object can be
 +URLs, among other things.
 +
 +In order to effeciently query such knowledge bases, the RDF data is
 +typically loaded into a so-called native RDF store. To ensure that the
 +knowledge is encoded for fast retrieval, the RDF store will first
 +encode all variable-length URLs in the dataset by fixed-width
 +integers, among other things. Each RDF triple will then be encoded by
 +by their corresponding integer triples (integer_of_subject,​
 +integer_of_property,​ integer_of_object).
 +
 +The purpose of this project is to implement and experementally compare
 +a number of algorithms that can perform this encoding:
 +
 +  * The trivial alogrithm that simply maintains a hashmap that maps URLs to their integer codes. When processing triple (s,p,o), it looks up s, p, and o in the hashmap to see if they have alraedy been assigned an integer ID. If so, this id is used for the   ​encoding;​ otherwise they are inserted into the hashmap with new, unique ids. The downside of this approach is that, while simple, it requires that one can store all URLS in working memory.
 +
 +  * The slightly smarter algorithm that works in multiple stages: the ID is computed by a pre-fixed hash function. For each URL, the URL and its ID are written to an output file. This file is later sorted on ID to check for possible ​ hash collisions between distinct URLS. 
 +
 +  * Algorithms that use the best known state-of-the art data structures for compactly storing respresenting sets of strings, such as the HAT-TRIE ("​Engineering scalable, cache and space efficient tries for strings"​ Nikolas Askitis, Ranjan Sinha, The VLDB Journal, October 2010, Volume 19, Issue 5, pp 633-660 and "​HAT-Trie:​ A Cache-Conscious Trie-Based Data Structure For Strings",​ The 30th International Australasian Computer Science Conference (ACSC), Volume 62, pages 97 - 105, 2007.).
 +
 +  * Variations of the above algorithms, fine-tuned for semantic web datasets.
 +
 +
 +**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)
 +
 +**Status**: available
 ===== Graph Indexing for Fast Subgraph Isomorphism Testing ===== ===== Graph Indexing for Fast Subgraph Isomorphism Testing =====
  
Line 41: Line 72:
 **Status**: available **Status**: available
  
- 
-==== Development of a compiler and runtime engine for AQL ==== 
- 
-In 2005, researchers at the IBM Almaden Research Center developped a 
-new system specifically geared for practical information extraction in 
-the enterprise. This effort lead to [[https://​www.google.be/​url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEYQFjAB&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&​ei=gyhIUe-XPIexPJ-fgLAG&​usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&​sig2=LVQkus_67uSVlwK34BXZ8w&​bvm=bv.43828540,​d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://​pic.dhe.ibm.com/​infocenter/​bigins/​v2r0/​topic/​com.ibm.swg.im.infosphere.biginsights.analyze.doc/​doc/​aql_overview.html|AQL (Annotation Query Language)]]. 
-The declarative nature of AQL enables new kinds of tools for extractor 
-development,​ and a cost-based optimizer for 
-performance.  ​ 
- 
-The goal of this project is to develop an open-source compiler and 
-runtime environment of (a simplified version of) AQL. 
- 
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) 
- 
-**Status**: available 
  
 ==== Development of a distributed simulation algorithm ==== ==== Development of a distributed simulation algorithm ====
 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi