Differences

This shows you the differences between two versions of the page.

--- teaching:mfe:is [2014/03/25 13:08]
svsummer [Comparision of Big Data Analysis tools]
+++ teaching:mfe:is [2015/04/13 13:47]
svsummer
@@ Line 1: / Line 1: @@
-====== MFE 2014-2015 : Web and Information Systems ======
+====== MFE 2015-2016 : Web and Information Systems ======
 ===== Introduction =====
@@ Line 15: / Line 15: @@
 <note>Please note that this list of subjects is **not exhaustive. Interested students are invited to propose original subjects.**</note>
+===== Master Thesis in Collaboration with Euranova =====
+Our laboratory performs collaborative research with Euranova R&D (http://euranova.eu/). The list of subjects proposed for this year by Euranova can be found
+{{:teaching:mfe:mt2014_euranova.pdf|here}}
+These subject include topics on distributed graph processing, processing big data using Map/Reduce, cloud computing, and social networks.
+  * Contact : [[ezimanyi@ulb.ac.be|Esteban Zimanyi]]
 ===== Automatic detection of name variations =====
@@ Line 100: / Line 109: @@
 Interested? Contact [[toon.calders@ulb.ac.be|Toon Calders]]
-===== Master Thesis in Collaboration with Euranova =====
-Our laboratory performs collaborative research with Euranova R&D (http://euranova.eu/). The list of subjects proposed for this year by Euranova can be found
-{{:teaching:mfe:mt2014_euranova.pdf|here}}
-These subject include topics on distributed graph processing, processing big data using Map/Reduce, cloud computing, and social networks.
-  * Contact : [[ezimanyi@ulb.ac.be|Esteban Zimanyi]]
-===== Structural compression of relational and semantic web databases =====
-Recent research in database management systems at ULB has shown how to
-theoretically construct succinct (compressed) representations for
-relational databases and semantic web databases. The advantage of
-these succinct representations is that they allow querying directly
-*on the succinct representation*, without needing to consult the
-underlying database.
-The goal of this thesis is to study scalable algorithms for
-constructing the actual succinct representations. Some in-memory
-algorithms are already known, but given the large size of typical
-database, distributed and out-of-memory alternatives need to be found.
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
 ===== Design and Implementation of a Curriculum Revision Tool ======
-Stijn Vansummeren (WIT)
+Stijn Vansummeren (WIT), Frédéric Robert (BEAMS)
 This MFE concers the analysis, design, and implementation of a
@@ Line 143: / Line 121: @@
 The primary targetted functionalities of the  software system are as
 follows:
+  * It should allow to make different versions of the teaching programs, much in the same way as version control systems like GIT and subversion offer the possibility to make different "development branches" of a program's source code.
+  * It should  allow an extensible means to check the modified program for inconsistentcies. (For example, if course X has course Y as prerequisite, then course Y should not be scheduled in 2nd semester and X in 1st semester. Moreover, the total number of ECTS of all courses should be at most 60 ECTS. )
+  * It should allow to analyze the modifications proposed in the teaching programs, and summarize the impact that these changes could have on other programs. (For example, if a course is removed from the computer science curriculum, it should be flagged that it should also be removed from all curricula that included the course.)
+  * It should load data from (and preferably, save data to) the ULB central administration database.
+  * It should give suggestions concerning the impact of the modifications on the course schedules.
-* It should allow to make different versions of the teaching programs, much in the same way as version control systems like GIT and subversion offer the possibility to make different "development branches" of a program's source code.
+A proof-of-concept implementation of a revision tool that supports the first two requirements above is currently being developed in the context of a PROJH402 project. The MFE student that selects this topic is expected to:
-* It should  allow an extensible means to check the modified program for inconsistentcies. (For example, if course X has course Y
-===== Structural compression of relational and semantic web databases =====
-Recent research in database management systems at ULB has shown how to
-theoretically construct succinct (compressed) representations for
-relational databases and semantic web databases. The advantage of
-these succinct representations is that they allow querying directly
-*on the succinct representation*, without needing to consult the
-underlying database.
-The goal of this thesis is to study scalable algorithms for
-constructing the actual succinct representations. Some in-memory
-algorithms are already known, but given the large size of typical
-database, distributed and out-of-memory alternatives need to be found.
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-===== Aspects of Text Analytics and Information Extraction =====
-Automatically extracting structured information from text is a task that has been pursued for decades. As a discipline, //Information Extraction// (IE) had its start with the [[http://acl.ldc.upenn.edu/C/C96/C96-1079.pdf|DARPA Message Understanding Conference in 1987]].  While early work in the area focused largely on military applications, recent changes have made information extraction increasingly important to an increasingly broad audience.  Trends such as the rise of social media have produced huge amounts of text data, while analytics platforms like Hadoop have at the same time made the analysis of this data more accessible to a broad range of users.  Since most analytics over text involves information extraction as a first step, IE is a very important part of
-data analysis in the enterprise today.
-Broadly speaking, there are two main schools of thought on the realization of IE: the //statistical// (machine-learning) methodology and the  //rule-based// approach.  The first started with simple models, then progressed to approaches based onprobabilistic graph models. Within the rule-based approach, most of the solutions build upon [[https://www.google.be/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CEEQFjAB&url=http%3A%2F%2Fwww.dfki.de%2F~neumann%2Fesslli04%2Freader%2Foverview%2FIJCAI99.pdf&ei=1yZIUdSZPMWHPa2rgagP&usg=AFQjCNFA6QYIt4yNR0oZRL4yjd--kev37A&sig2=nEILF_cNDk4JWiVDS5BXvg&bvm=bv.43828540,d.ZWU|cascaded finite-state  transducers]].  Most systems in both categories were built for academic settings, where most users are highly-trained computational linguists, where workloads cover only a small number of very well-defined tasks and data sets, and where extraction throughput is far less important than the accuracy of results.
-In practice, these existing tools suffer from a number of practical problems. For example, users need to have an intuitive understanding of machine learning or the ability to build and understand complex and highly interdependent rules. Determining why an extractor produced a given incorrect result
-is hence often deemed extremely difficult, which makes reuse of extractors across different data sets and applications impractical.  And extremely
-high CPU and memory requirements made extractors cost-prohibitive to deploy over large-scale data sets.
-In 2005, researchers at the IBM Almaden Research Center started work on a new system specifically geared for practical information extraction in the enterprise.  This effort lead to [[https://www.google.be/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CEYQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&ei=gyhIUe-XPIexPJ-fgLAG&usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&sig2=LVQkus_67uSVlwK34BXZ8w&bvm=bv.43828540,d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://pic.dhe.ibm.com/infocenter/bigins/v2r0/topic/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/aql_overview.html|AQL (Annotation Query Language)]].
-The declarative nature of AQL enables new kinds of tools for extractor
-development, and a cost-based optimizer for
-performance.
-The goal of this thesis is to study and compare both the
-traditional methods towards information extraction and the new
-AQL-based method proposed by SystemT, based on experimental
-evaluation of information extraction problems on the
-Web. Additional possible topics of study include the (1)
-implementation and optimization aspects of AQL, (2) the extension
-of AQL with probablistic methods, or (3) the inference of AQL
-rules from examples.
-Interested? Contact [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-===== Models for programming Data Management in the Cloud =====
-Many say that "The Cloud" is the next computing platform on the
-Web. Unfortunately, "the cloud" has become a marketing buzzword with
-many different services offered, from the rental of virtual machines,
-to the rental of storage space, to specific compute platforms
-(e.g. MapReduce) that offer transparent parallelization.
-In this thesis, we are interested in the cloud from the point of view
-of data management. There is a recent trend in data management
-research to use logic programming rule-based languages to specify
-distributed applications, most notably on the web, as well as
-inference in the semantic web (see below for a list of
-references). The goal of this thesis is to study, compare, and where
-possible extend the current (logic-programming based) proposals for
-managing data in the cloud.
-  * References:
-       * http://boom.cs.berkeley.edu/
-       * http://p2.cs.berkeley.edu/index.php
-       * http://www.comlab.ox.ac.uk/files/3608/RR-10-21.pdf
-\\
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-  * Status: **already taken**
-===== Distributed Structural Indexes for RDF Data =====
-In an effort to enable people to share information in a
-structured form on the Web as easily as they can share unstructured
-HTML documents today, the World Wide Web Consortium (W3C for short) is
-calling for the creation of a Web of Linked Data. In the same way as
-one uses HTML and hyperlinks to publish and connect information on the
-Web of Documents, one uses the RDF data model and RDF links to publish
-and connect structured information on the Web of Linked Data. The
-advantage of RDF over HTML lies in its simplicity: all information is
-represented uniformly as triples of the form (subject, predicate,
-object). This allows one to represent both facts about entities (e.g.,
-(Tim Berners-Lee, age, 54)) and links between entities (e.g. (Tim
-Berners-Lee, author of, http://...))  in an easily
-machine-interpretable manner. This is much more difficult with HTML
-where there is little or no constraint on the way information is
-represented.
-Linked Data has the potential to turn the Web into one huge database
-with structured querying capabilities that vastly exceed the limited
-keyword search queries so common on the Web of Documents today.
-As a key component of efficient query answering in Linked Data Management systems, much research is focused on devising high-performance native RDF indexing data structures. One class of such indexes, called structural indexes, seem very promising in this respect. Currently however, structural indexes for RDF are difficult to distribute accross the web. Given the importance of distribution in web-scale data, the goal of this thesis is to investigate how structural RDF indexes can be used in a distributed query answering platform.
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
+  * Develop this prototype to a production-ready implementation.
+  * Implement the communication with the central ULB database.
+  * Implement the impact analysis concerning the course schedules.
+  * Interact with the administration of the Ecole Polytechnique to fine-tune the above requirements; test the implementation; and integrate remarks after testing
+Contact : Stijn Vansummeren <stijn.vansummeren@ulb.ac.be>, Frédéric Robert <frrobert@ulb.ac.be>