Differences

This shows you the differences between two versions of the page.

--- teaching:mfe:is [2014/06/03 11:22]
svsummer [Master Thesis in Collaboration with DPI 24/7 Media Publishing]
+++ teaching:mfe:is [2015/04/13 13:47]
svsummer
@@ Line 1: / Line 1: @@
-====== MFE 2014-2015 : Web and Information Systems ======
+====== MFE 2015-2016 : Web and Information Systems ======
 ===== Introduction =====
@@ Line 24: / Line 24: @@
   * Contact : [[ezimanyi@ulb.ac.be|Esteban Zimanyi]]
-===== Master Thesis in Collaboration with DPI 24/7 Media Publishing =====
-The goal of the thesis is to set up a Saas / Paas solution for the deployement of the dpi 24/7 media publishing distribution in a Heroku-like style.
-During this master thesis you will not only realize a theoretical and technological analysis of the problem of such a deployment but also implement a concrete solution for the dpi 24/7 distribution.
-From a technical point of view you will :
-  * Develop a service using Docker and Dokku for the on-demand deployment of instances of the DPI 24/7 distribution (full stack architecture)
-  * Realize performance tests of the developed service
-  * Study the different options of the Paas mode (full stack or elastic deployment)
-Second, you will analyze the different existing solutions for the orchestration of an elastic virtualization architecture.
-Technology used by the DPI 24/7 distribution : Linux, Varnish, NginX, Php-fpm, Mysql (in background Tomcat, SOLR).
-Virtualization technology : Container virtualization and deployment with Dokku
-Interested? DIP 27/7 Contact [[ddu@audaxis.com|Dimitri Dujardin]]. Academic Supervisor [[svsummer@ulb.ac.be|Stijn Vansummeren]]
 ===== Automatic detection of name variations =====
@@ Line 154: / Line 135: @@
 Contact : Stijn Vansummeren <stijn.vansummeren@ulb.ac.be>, Frédéric Robert <frrobert@ulb.ac.be>
-===== Design and Development of a Comprehensive DICOM validation application=====
-Using the new XML machine-readable format of the DICOM standard (in the form of docbook documents), the architecture of software tools and services for the automatic extraction and utilization of the full content of the DICOM standard will be defined and the corresponding software solutions will be developed. A comprehensive DICOM validation application will also be developed as a pilot project using the previously created DICOM standard digital services.
-References: <http://dicom.nema.org/; http://www.oasis-open.org/docbook/>
-Requirements: XML, XSL, database, Java or Python or C++.
-Contacts : Arnaud Schenkel <arnaud.schenkel@ulb.ac.be>, David Wikler <david.wikler@ulb.ac.be>, Stijn Vansummeren <stijn.vansummeren@ulb.ac.be>
-===== Structural compression of relational and semantic web databases =====
-Stijn Vansummeren (WIT)
-Recent research in database management systems at ULB has shown how to
-theoretically construct succinct (compressed) representations for
-relational databases and semantic web databases. The advantage of
-these succinct representations is that they allow querying directly
-*on the succinct representation*, without needing to consult the
-underlying database.
-The goal of this thesis is to study scalable algorithms for
-constructing the actual succinct representations. Some in-memory
-algorithms are already known, but given the large size of typical
-database, distributed and out-of-memory alternatives need to be found.
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-===== A contribution to Apache DRILL =====
-Google's research lab has produced a remarkable number of software
-systems for the analytics of Big Data:
-  * [[|Map/Reduce]] for offline, batch-oriented data analysis over arbitrary datasets
-  * [[http://googleresearch.blogspot.be/2009/06/large-scale-graph-computing-at-google.html|Pregel]] for offline analysis over graph-structured datasets
-  * [[http://research.google.com/pubs/pub36632.html|Dremel]] for on-line analysis over structured datasets
-For Map/Reduce and Pregel, the Apache Software foundation has
-previously constructed open source implementations ([[http://hadoop.apache.org/|Hadoop]],
-[[https://giraph.apache.org/|Giraph]]). For Dremel, a project is
-currently underway to provide an Open Source implementation (known as
-[[http://incubator.apache.org/drill/index.html|Apache Drill]]).
-The goal of this thesis is to (1) study the current architecture of Apache
-Drill, (2) compare this with the state of the art in query processing
-for structured datasets; (3) contribute to the development of the
-Drill implementation.
-Students interested in this MFE are highly advised to follow the
-course {{http://cs.ulb.ac.be/public/teaching/infoh417|INFOH417
-Database Systems Architecture}} for a background on query processing
-in traditional database management systems.
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-===== Aspects of Text Analytics and Information Extraction =====
-Automatically extracting structured information from text is a task that has been pursued for decades. As a discipline, //Information Extraction// (IE) had its start with the [[http://acl.ldc.upenn.edu/C/C96/C96-1079.pdf|DARPA Message Understanding Conference in 1987]].  While early work in the area focused largely on military applications, recent changes have made information extraction increasingly important to an increasingly broad audience.  Trends such as the rise of social media have produced huge amounts of text data, while analytics platforms like Hadoop have at the same time made the analysis of this data more accessible to a broad range of users.  Since most analytics over text involves information extraction as a first step, IE is a very important part of
-data analysis in the enterprise today.
-Broadly speaking, there are two main schools of thought on the realization of IE: the //statistical// (machine-learning) methodology and the  //rule-based// approach.  The first started with simple models, then progressed to approaches based onprobabilistic graph models. Within the rule-based approach, most of the solutions build upon [[https://www.google.be/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CEEQFjAB&url=http%3A%2F%2Fwww.dfki.de%2F~neumann%2Fesslli04%2Freader%2Foverview%2FIJCAI99.pdf&ei=1yZIUdSZPMWHPa2rgagP&usg=AFQjCNFA6QYIt4yNR0oZRL4yjd--kev37A&sig2=nEILF_cNDk4JWiVDS5BXvg&bvm=bv.43828540,d.ZWU|cascaded finite-state  transducers]].  Most systems in both categories were built for academic settings, where most users are highly-trained computational linguists, where workloads cover only a small number of very well-defined tasks and data sets, and where extraction throughput is far less important than the accuracy of results.
-In practice, these existing tools suffer from a number of practical problems. For example, users need to have an intuitive understanding of machine learning or the ability to build and understand complex and highly interdependent rules. Determining why an extractor produced a given incorrect result
-is hence often deemed extremely difficult, which makes reuse of extractors across different data sets and applications impractical.  And extremely
-high CPU and memory requirements made extractors cost-prohibitive to deploy over large-scale data sets.
-In 2005, researchers at the IBM Almaden Research Center started work on a new system specifically geared for practical information extraction in the enterprise.  This effort lead to [[https://www.google.be/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CEYQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.179.356%26rep%3Drep1%26type%3Dpdf&ei=gyhIUe-XPIexPJ-fgLAG&usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&sig2=LVQkus_67uSVlwK34BXZ8w&bvm=bv.43828540,d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://pic.dhe.ibm.com/infocenter/bigins/v2r0/topic/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/aql_overview.html|AQL (Annotation Query Language)]].
-The declarative nature of AQL enables new kinds of tools for extractor
-development, and a cost-based optimizer for
-performance.
-The goal of this thesis is to study and compare both the
-traditional methods towards information extraction and the new
-AQL-based method proposed by SystemT, based on experimental
-evaluation of information extraction problems on the
-Web. Additional possible topics of study include the (1)
-implementation and optimization aspects of AQL, (2) the extension
-of AQL with probablistic methods, or (3) the inference of AQL
-rules from examples.
-Interested? Contact [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-===== Models for programming Data Management in the Cloud =====
-Many say that "The Cloud" is the next computing platform on the
-Web. Unfortunately, "the cloud" has become a marketing buzzword with
-many different services offered, from the rental of virtual machines,
-to the rental of storage space, to specific compute platforms
-(e.g. MapReduce) that offer transparent parallelization.
-In this thesis, we are interested in the cloud from the point of view
-of data management. There is a recent trend in data management
-research to use logic programming rule-based languages to specify
-distributed applications, most notably on the web, as well as
-inference in the semantic web (see below for a list of
-references). The goal of this thesis is to study, compare, and where
-possible extend the current (logic-programming based) proposals for
-managing data in the cloud.
-  * References:
-       * http://boom.cs.berkeley.edu/
-       * http://p2.cs.berkeley.edu/index.php
-       * http://www.comlab.ox.ac.uk/files/3608/RR-10-21.pdf
-\\
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]
-  * Status: **already taken**
-===== Distributed Structural Indexes for RDF Data =====
-In an effort to enable people to share information in a
-structured form on the Web as easily as they can share unstructured
-HTML documents today, the World Wide Web Consortium (W3C for short) is
-calling for the creation of a Web of Linked Data. In the same way as
-one uses HTML and hyperlinks to publish and connect information on the
-Web of Documents, one uses the RDF data model and RDF links to publish
-and connect structured information on the Web of Linked Data. The
-advantage of RDF over HTML lies in its simplicity: all information is
-represented uniformly as triples of the form (subject, predicate,
-object). This allows one to represent both facts about entities (e.g.,
-(Tim Berners-Lee, age, 54)) and links between entities (e.g. (Tim
-Berners-Lee, author of, http://...))  in an easily
-machine-interpretable manner. This is much more difficult with HTML
-where there is little or no constraint on the way information is
-represented.
-Linked Data has the potential to turn the Web into one huge database
-with structured querying capabilities that vastly exceed the limited
-keyword search queries so common on the Web of Documents today.
-As a key component of efficient query answering in Linked Data Management systems, much research is focused on devising high-performance native RDF indexing data structures. One class of such indexes, called structural indexes, seem very promising in this respect. Currently however, structural indexes for RDF are difficult to distribute accross the web. Given the importance of distribution in web-scale data, the goal of this thesis is to investigate how structural RDF indexes can be used in a distributed query answering platform.
-  * Contact : [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]