Differences

This shows you the differences between two versions of the page.

--- teaching:mfe:is [2015/09/14 13:41]
svsummer [Engineering a runtime system and compiler for AQL]
+++ teaching:mfe:is [2016/03/16 10:27]
tcalders
@@ Line 59: / Line 59: @@
 The objective of this master thesis is to apply the same methodology to engineer a compiler that translates (fragments of) SPARQL (the standard query language for querying RDF data on the semantic web) into machine code. The overall methodology should follow the methodology used by HyPer and Legobase:
   * Use of a high-level language to construct the compiler (Scala, http://scala-lang.org/)
-  * Use of Latent Modular Staging (LMS for short) for generating low-level portable assembly code at runtime (http://scala-lms.github.io/)
+  * Use of Lightweight Modular Staging (LMS for short) for generating low-level portable assembly code at runtime (http://scala-lms.github.io/)
   * Use of LLVM (http://llvm.org/) as a portable assembly code and corresponding translator to machine code.
@@ Line 68: / Line 68: @@
 **Deliverables** of the master thesis project:
   - An overview of the state of the art in query-to-machine-code compilation.
-  - A description of latent modular staging and how it can be used to construct machine-code compilers.
+  - A description of lightweight modular staging and how it can be used to construct machine-code compilers.
   - The SPARQL compiler (software artifact)
   - A benchmark set of SPARQL queries and associated data sets for the experimental validation
@@ Line 114: / Line 114: @@
 The objective of this master thesis is to design and engineer a runtime system and compiler for (a fragment) of AQL based on finite state automata. Ideally, to obtain the best performance, these automata should be compiled into machine-code when executed. For this compilation, the following technologies should be used:
   * A a high-level language to construct the compiler (Scala, http://scala-lang.org/)
-  * Use of Latent Modular Staging (LMS for short) for generating low-level portable assembly from the automata at runtime (http://scala-lms.github.io/)
+  * Use of Lightweight Modular Staging (LMS for short) for generating low-level portable assembly from the automata at runtime (http://scala-lms.github.io/)
   * Use of LLVM (http://llvm.org/) as a portable assembly code and corresponding translator to machine code.
@@ Line 211: / Line 211: @@
 **Status?** Already taken
-===== Semi-Supervised Entity Resolution =====
-Toon Calders (WIT)
-In the big data era large collections of data have become available for analysis. These data, however, often come from different data sources and may contain errors. Consider for instance a company that wants to combine data from marketing and sales in order to see to what extent the targeted marketing campaign has been successful in attracting new customers. A key operation in this analysis is the identification of which records from marketing and sales refer to the same person. In this way it can be determined which targeted potential customers were already clients, and of the contacted non-clients, which ones reacted to the marketing campaign. Furthermore, most likely the records of marketing are far less reliable and formatted differently than those of sales. For instance, the marketing records won't usually contain a client number. The process of linking these sources together and identifying which records refer to the same person is know as entity resolution. Most existing approaches for entity resolution use either a fixed set of pre-determined rules, which may be sub-optimal for the problem at hand, or are based on learning classifiers which requires large amounts of labelled data.
-In this thesis you will study the possibility of entity-resolution in the absence of large collections of labelled data, by exploiting redundancies in the features with which records can be compared in combination with an active learning approach in which volunteers can be asked to label some examples on the fly.
-\\
-**Interested?** Contact [[toon.calders@ulb.ac.be|Toon Calders]]
-===== Using Non-Redundant Sequential Pattern Mining for Process Discovery =====
-Toon Calders (WIT)
-Process mining is the act of deriving a process model, such as for instance a Petri-net or a BPMN model, based on an event log. An example of such a log could be all events that an insurance company undertakes for pricing a car insurance based on a request from a client. Events could be looking up if the client has been blacklisted, his or her history w.r.t. car accidents, estimating the risk based on car type, age and gender of the requester, making a proposal, soliciting the agreement of the client, in case of disagreement, contacting a manager to approve a special offer, etc. Based on several traces for different clients may allow the automatic reconstruction of a process model. There exist several approaches for process mining, including footprint based algorithms such as Alpha, Alpha+, heuristic algorithms including heuristics miner, genetic algorithms, region based methods, etc. The goal of this thesis is to explore the possibility of using current state-of-the-art data mining algorithms for sequence and episode mining as a basis of a new and improved version of the alpha-algorithm.
-Van der Aalst, W. M. (2011). Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer.
-Interested? Contact [[toon.calders@ulb.ac.be|Toon Calders]]
-===== Mining patterns for compression =====
-Toon Calders (WIT)
-Data mining is the research discipline that studies the extraction of information from large amounts of data. One of the typical data mining tasks is pattern mining where we try to find regularities that occur frequently in a dataset. The prototypical example is that of a supermarket storing for every customer visiting the supermarket, the transaction; that is, the set of items that were bought by that customer. The frequent itemset mining problem now is to detect which combinations of products were more often sold together than a given threshold. One of the major problems of pattern mining algorithms, however, is the enormous amount of redundant patterns they generate; for instance, very popular items, such as toilet paper, tend to appear in many frequent combinations purely due to chance. In order to deal with this problem, techniques based upon compression and minimum description length were proposed to reduce the number of patterns. The rationale behind the minimal description length principle is that a set of patterns that describes well what is happening in the dataset should allow for a good compression. For a collection of patterns, the quality is measured as the description length of the patterns plus the size of the data compressed with these patterns. For instance, if the pattern {bread, milk, butter} has a high frequency, we could opt to replace every occurrence of this pattern by a special code, effectively reducing the encoding length of the data. Surprisingly, however, the MDL principle was until now only used to rule out redundant patterns, and it has not been researched yet how well the discovered patterns actually do compress the data as compared to compression algorithms such as Lempel–Ziv–Welch.
-Hence, in this highly research oriented graduation project, two research questions are central: (1) How good do non-redundant pattern sets based on MDL allow compressing data, and (2) Can we extract useful patterns from existing compression algorithms?
-Interested? Contact [[toon.calders@ulb.ac.be|Toon Calders]]