Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:mfe:is [2014/02/19 15:18]
ezimanyi [Extending SPARQL for Spatio-temporal Data Support]
teaching:mfe:is [2015/04/13 13:47]
svsummer
Line 1: Line 1:
-====== MFE 2014-2015 : Web and Information Systems ======+====== MFE 2015-2016 : Web and Information Systems ======
  
 ===== Introduction ===== ===== Introduction =====
Line 15: Line 15:
  
 <​note>​Please note that this list of subjects is **not exhaustive. Interested students are invited to propose original subjects.**</​note> ​ <​note>​Please note that this list of subjects is **not exhaustive. Interested students are invited to propose original subjects.**</​note> ​
 +
 +===== Master Thesis in Collaboration with Euranova =====
 +
 +Our laboratory performs collaborative research with Euranova R&D (http://​euranova.eu/​). The list of subjects proposed for this year by Euranova can be found 
 +{{:​teaching:​mfe:​mt2014_euranova.pdf|here}}
 +
 +These subject include topics on distributed graph processing, processing big data using Map/Reduce, cloud computing, and social networks.
 +
 +  * Contact : [[ezimanyi@ulb.ac.be|Esteban Zimanyi]]
  
 ===== Automatic detection of name variations ===== ===== Automatic detection of name variations =====
Line 101: Line 110:
 Interested? Contact [[toon.calders@ulb.ac.be|Toon Calders]] Interested? Contact [[toon.calders@ulb.ac.be|Toon Calders]]
  
-===== Master Thesis in Collaboration with Euranova ===== 
  
-Our laboratory performs collaborative research with Euranova R&D (http://​euranova.eu/​). The list of subjects proposed for this year by Euranova can be found  +===== Design and Implementation ​of a Curriculum Revision Tool ======
-{{:​teaching:​mfe:​euranova_master_thesis_2013_2014.pdf|here}}.+
  
-These subject include topics on distributed graph processingprocessing big data using Map/Reduce, cloud computing, and social networks.+Stijn Vansummeren (WIT)Frédéric Robert (BEAMS)
  
-  * Contact : [[ezimanyi@ulb.ac.be|Esteban Zimanyi]]+This MFE concers the analysis, design, and implementation of a 
 +software system that can assist in the revision of teaching curricula 
 +(also known as teaching programs).
  
-===== Efficient computation ​of simulation for structural indexing ​ ===== +The primary targetted functionalities ​of the  software system are as 
- +follows: 
-Simulation and bisimulation are  fundamental notions in computer science. They underlie many formal verification algorithms, and have recently been applied ​to the construction ​of indexing data structures for relational databases and the semantic web. +  ​* It should allow to make different versions ​of the teaching programsmuch in the same way as version control systems like GIT and subversion offer the possibility ​to make different "​development branches" ​of a program'​s source code
- +  * It should ​ allow an extensible means to check the modified program for inconsistentcies. (For example, if course X has course Y as prerequisite, then course Y should not be scheduled ​in 2nd semester ​and X in 1st semester. Moreover, the total number of ECTS of all courses should be at most 60 ECTS. ) 
-Essentiallya simulation or bisimulation is a relation on the nodes +  ​* It should allow to analyze ​the modifications proposed in the teaching programs, and summarize ​the impact ​that these changes could have on other programs(For exampleif course is removed ​from the computer science curriculumit should be flagged ​that it should also be removed ​from all curricula that included ​the course.) 
-of a graph. Unfortunately,​ however, while efficient main-memory +  * It should load data from (and preferablysave data to) the ULB central administration ​database.  
-algorithms for computing whether two nodes are simulating or bisimulating exist, these algorithms fail when no the input graphs are too large to fit in main memory.  +  * It should give suggestions concerning ​the impact ​of the modifications on the course schedules.
- +
-The goal of this thesis is to study, compare, and implement various +
-approaches to computing simulation in an external memory setting, for +
-the explicit purpose of using the implementation to efficiently construct +
-simulation-based indexes for large relational databases and the +
-semantic web. +
- +
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]  +
- +
-===== Aspects of Text Analytics and Information Extraction ===== +
- +
-Automatically extracting structured information from text is a task that has been pursued for decades. As a discipline, //​Information Extraction//​ (IE) had its start with the [[http://​acl.ldc.upenn.edu/​C/​C96/​C96-1079.pdf|DARPA Message Understanding Conference in 1987]]. ​ While early work in the area focused largely on military applications,​ recent changes have made information extraction increasingly important to an increasingly broad audience. ​ Trends such as the rise of social media have produced huge amounts of text data, while analytics platforms like Hadoop have at the same time made the analysis of this data more accessible ​to a broad range of users. ​ Since most analytics over text involves information extraction as a first step, IE is a very important part of +
-data analysis in the enterprise today. +
- +
-Broadly speaking, there are two main schools of thought on the realization of IE: the //​statistical// ​(machine-learning) methodology and the  //​rule-based//​ approach. ​ The first started with simple models, then progressed to approaches based onprobabilistic graph models. Within the rule-based approach, most of the solutions build upon [[https://​www.google.be/​url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEEQFjAB&​url=http%3A%2F%2Fwww.dfki.de%2F~neumann%2Fesslli04%2Freader%2Foverview%2FIJCAI99.pdf&​ei=1yZIUdSZPMWHPa2rgagP&​usg=AFQjCNFA6QYIt4yNR0oZRL4yjd--kev37A&​sig2=nEILF_cNDk4JWiVDS5BXvg&​bvm=bv.43828540,​d.ZWU|cascaded finite-state ​ transducers]]. ​ Most systems ​in both categories were built for academic settings, where most users are highly-trained computational linguists, where workloads cover only a small number of very well-defined tasks and data setsand where extraction throughput is far less important than the accuracy of results. +
- +
-In practice, these existing tools suffer from a number of practical problems. For example, users need to have an intuitive understanding ​of machine learning or the ability to build and understand complex and highly interdependent rules. Determining why an extractor produced a given incorrect result +
-is hence often deemed extremely difficult, which makes reuse of extractors across different data sets and applications impractical. ​ And extremely +
-high CPU and memory requirements made extractors cost-prohibitive to deploy over large-scale data sets. +
- +
-In 2005, researchers ​at the IBM Almaden Research Center started work on a new system specifically geared for practical information extraction in the enterprise This effort lead to [[https://​www.google.be/​url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEYQFjAB&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&​ei=gyhIUe-XPIexPJ-fgLAG&​usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&​sig2=LVQkus_67uSVlwK34BXZ8w&​bvm=bv.43828540,​d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://​pic.dhe.ibm.com/​infocenter/​bigins/​v2r0/​topic/​com.ibm.swg.im.infosphere.biginsights.analyze.doc/​doc/​aql_overview.html|AQL (Annotation Query Language)]]. +
-The declarative nature of AQL enables new kinds of tools for extractor +
-development,​ and a cost-based optimizer for +
-performance. ​  +
- +
-The goal of this thesis is to study and compare both the +
-traditional methods towards information extraction and the new +
-AQL-based method proposed by SystemTbased on experimental +
-evaluation of information extraction problems on the +
-Web. Additional possible topics of study include the (1) +
-implementation ​and optimization aspects of AQL, (2) the extension +
-of AQL with probablistic methods, or (3) the inference of AQL +
-rules from examples. +
- +
- +
-Interested? Contact [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]] +
- +
-===== Models for programming Data Management in the Cloud ===== +
- +
-Many say that "The Cloud" is the next computing platform ​on the +
-WebUnfortunately"the cloud" has become ​marketing buzzword with +
-many different services offered, ​from the rental of virtual machines, +
-to the rental of storage space, to specific compute platforms +
-(e.g. MapReduce) ​that offer transparent parallelization. +
- +
-In this thesis, we are interested in the cloud from the point of view +
-of data managementThere is a recent trend in data management +
-research to use logic programming rule-based languages to specify +
-distributed applications,​ most notably on the web, as well as +
-inference in the semantic web (see below for a list of +
-references). The goal of this thesis is to study, compare, and where +
-possible extend the current (logic-programming based) proposals for +
-managing data in the cloud. +
- +
-  * References:​ +
-       * http://​boom.cs.berkeley.edu/​ +
-       * http://​p2.cs.berkeley.edu/​index.php +
-       * http://​www.comlab.ox.ac.uk/​files/​3608/​RR-10-21.pdf +
- +
-\\ +
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]] ​  +
- +
-===== Distributed Structural Indexes for RDF Data ===== +
- +
-In an effort to enable people to share information in a +
-structured form on the Web as easily as they can share unstructured +
-HTML documents today, the World Wide Web Consortium ​(W3C for short) is +
-calling for the creation of a Web of Linked Data. In the same way as +
-one uses HTML and hyperlinks to publish and connect information on the +
-Web of Documentsone uses the RDF data model and RDF links to publish +
-and connect structured information on the Web of Linked Data. The +
-advantage of RDF over HTML lies in its simplicity: all information is +
-represented uniformly as triples of the form (subject, predicate,​ +
-object). This allows one to represent both facts about entities (e.g., +
-(Tim Berners-Lee,​ age, 54)) and links between entities (e.g. (Tim +
-Berners-Lee,​ author of, http://​...)) ​ in an easily +
-machine-interpretable manner. This is much more difficult with HTML +
-where there is little or no constraint on the way information is +
-represented. +
- +
-Linked Data has the potential to turn the Web into one huge database +
-with structured querying capabilities that vastly exceed the limited +
-keyword search queries so common on the Web of Documents today. +
- +
-As a key component of efficient query answering in Linked Data Management systems, much research is focused on devising high-performance native RDF indexing data structures. One class of such indexes, called structural indexes, seem very promising in this respect. Currently however, structural indexes for RDF are difficult to distribute accross the web. Given the importance of distribution in web-scale data, the goal of this thesis is to investigate how structural RDF indexes can be used in a distributed query answering platform. +
- +
- +
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]] +
- +
-                                                                    +
- +
-=====Foundations of Data Description Languages===== +
- +
-Recently, several small "​domain specific languages"​ have been proposed +
-to facilitate programming with ad hoc data (including PADS, +
-DATASCRIPT,​PACKETTYPES,​ Microsoft M Grammar). Ad hoc data is data +
-other than data in well-behaved relational or XML formats. +
- +
-The above languages take as input a description of the data format to +
-be dealt with, and automatically generate a large number ​of software +
-tools (parsers, serializers,​ data transformers,​ error recognition,​ +
-...) to process ​the ad-hoc data. +
- +
-The goal of this thesis is to study the programming language-theory +
-foundations behind these languages, their commonalities and their +
-differencesIf possible, suggestions for further extensions to the +
-languages should be formulated. +
- +
-  * References : +
-      * http://​datascript.sourceforge.net/​ +
-      * http://​www.padsproj.org/​index.html +
- +
-\\ +
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]+
  
-=====Capturing Semantic ​ Web Data from Web Pages=====+A proof-of-concept implementation of a revision tool that supports the first two requirements above is currently being developed in the context of a PROJH402 project. The MFE student that selects this topic is expected to:
  
 +  * Develop this prototype to a production-ready implementation.
 +  * Implement the communication with the central ULB database.
 +  * Implement the impact analysis concerning the course schedules.
 +  * Interact with the administration of the Ecole Polytechnique to fine-tune the above requirements;​ test the implementation;​ and integrate remarks after testing
  
-The [[http://​linkeddata.org/|Linked Open Data]] (LOD) initiative is aimed at extending the Web  by means of publishing various open datasets as RDF,  setting RDF links between data items from different data sources In spite of  the interest of organization in publishing their data, many of them are not willing to pay the price of devoting working hours or their employees for doing the hard work that preparing and updating these data requiresThereforea very interesting and practical problem that arises is how to produce LOD automatically from Web sitesThis   ​problem can be tackled if selected and well-defined domains are chosen. ​+Contact ​Stijn Vansummeren <stijn.vansummeren@ulb.ac.be>Frédéric Robert <​frrobert@ulb.ac.be>
  
-  
-In his thesis we propose to select a site of a broadcasting company, and, through intelligent crawling techniques capture data of interest and publish it as RDF data. In a second step, we propose to  use these data to pose queries that involve different nodes of the Web of linked ​ data.  ​ 
-  
  
-* Contacts :  
-    * [[ezimanyi@ulb.ac.be|Esteban Zimányi]] ​ 
-  
 =====Publishing and Using Spatio-temporal Data on the Semantic Web===== =====Publishing and Using Spatio-temporal Data on the Semantic Web=====
  
Line 252: Line 145:
 by application providers, that can build attractive and useful applications,​ in particular, for devices like mobile phones, tablets, etc.  by application providers, that can build attractive and useful applications,​ in particular, for devices like mobile phones, tablets, etc. 
  
-The goals of this thesis are: (i) study the existing proposals for mapping spatio-temporal data into LOD; (ii) apply this mapping to a real-world case study (as was the case for the [[http://​www.oscb.be/​|Open Semantic Cloud for Brussels]] project; (iii) Based on the produced mapping, and using existing applications like the [[http://​linkedgeodata.org/​|Linked Geo Data project]], build applications that make use of LOD for example, to find out which cultural events are taking place at a given time at a given location. ​  +The goals of this thesis are: (1) study the existing proposals for mapping spatio-temporal data into LOD; (2) apply this mapping to a real-world case study (as was the case for the [[http://​www.oscb.be/​|Open Semantic Cloud for Brussels]] project; (3) Based on the produced mapping, and using existing applications like the [[http://​linkedgeodata.org/​|Linked Geo Data project]], build applications that make use of LOD for example, to find out which cultural events are taking place at a given time at a given location. ​  
    
  
 
teaching/mfe/is.txt · Last modified: 2020/09/29 17:03 by mahmsakr