 ===== Course objective ===== ===== Course objective =====
The course PROJ-H-402 is managed by Dr. Mauro Birattari. Please refer to the course description page http://​iridia.ulb.ac.be/​proj-h-402/​index.php/​Main_Page for the rules concerning the project. What follows is a list of project proposals supervised by academic members of the WIT laboratory.
===== Projects in Mobility Databases =====
-==== Principles ​of Database Management Architectures ​in Managed Virtual Environments ====+Mobility databases (MOD) are database systems that can store and manage moving object geospatial trajectory data. A moving object is an object that changes its location over time (e.g., a car driving on the road network). Using a variety ​of sensors, the location tracks of moving objects can be recorded in digital formats. A MOD, then, helps storing and querying such data. A couple of prototype systems have been proposed by research groups. Yet, a mainstream system is by far still missing. By mainstream we mean that the development builds on widely accepted tools, that are actively being maintained and developed. A mainstream system would exploit the functionality of these tools, and would maximize the reuse of their ecosystems. As a result, it becomes more closer to end users, and easily adopted ​in the industry.
-With the gaining popularity of Big Datamany data processing engines +Towards filling this gapour group is building ​the [[https://​github.com/​MobilityDB/​MobilityDB|MobilityDB]] systemIt builds on [[https://​postgis.net/​|PostGIS]]which is a spatial database extension of [[https://​www.postgresql.org/​|PostgreSQL]]MobilityDB extends the type system of PostgreSQL and PostGIS with abstract data types (ADTsfor representing moving object dataIt defines, for instance, ​the tgeompoint type for representing a time dependant geometry point. MobilityDB types are well integrated into the platform, to achieve maximal reusability,​ hence a mainstream development. For instance, the tgeompoint type builds on the PostGIS geometry(point) typeSimilarly MobilityDB builds on existing operations, indexing, and optimization framework.
-are implemented in a managed virtual environment such as the Java +
-Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, +
-...). While this improves ​the portability of the engine, the tradeoffs +
-and implementation principles w.r.t. traditional C++ implementations +
-are sometimes less understood.+
-The objective in this project ​is to develop some basic functionalities +MobilityDB supports SQL as query interface. Currently it is quite rich in terms of types and functionsIt is incubated as community project ​in [[https://​www.osgeo.org/​projects/​mobilitydb/​|OSGeo]]which certifies high technical quality
-of a database storage engine (Linked files, BTree, Extensible Hash +
-table, basic external-memory sorting ) in a managed virtual machine +
-(i.e., the Java Virtual Machine or and the .NET Common Language +
-Runtime), and compare this with a C++-based implementation both on (1) +
-ease of implementation ​and (2) execution efficiencyIn order to +
-develop the managed virtual machine implementation,​ the interested +
-student will need to research the best practices that are used in the +
-above-mentioned projects to gain maximum execution speed (e.g., use of +
-the java.lang.unsafe feature, memory-mapped files, ...).+
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+The following project ideas contribute to different parts of MobilityDBThey all constitute innovative development,​ mixing both research and developmentThey hence will help developing the student skills in:
-**Status**: available+  ​Understanding the theory and the implementation of moving object databases. 
 +  ​Understanding the architecture of extensible databases, in this case PostgreSQL. 
 +  ​Writing open source software.
-==== Development ​of a compiler and runtime engine for AQL ====+===== Visualization ​of Moving Objects on the Web =====
-In 2005, researchers at the IBM Almaden Research Center developped a +There are several open source platforms ​for publishing spatial data and interactive mapping applications to the webTwo populars ones are [[https://mapserver.org/|MapServer]] and [[http://geoserver.org/​|GeoServer]],​ which are written, respectively,​ in C and in Java. 
-new system specifically geared ​for practical information extraction in +Newer platforms exists, such as [[https://kepler.gl/|kepler.gl]], which were designed ​for handling large-scale data sets
-the enterpriseThis effort lead to [[https://www.google.be/url?​sa=t&​rct=j&​q=&​esrc=s&​source=web&​cd=2&​cad=rja&​ved=0CEYQFjAB&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.​ei=gyhIUe-XPIexPJ-fgLAG&​usg=AFQjCNHgkbcREbd6bCA26BVf0FuIZ9n7Sg&​sig2=LVQkus_67uSVlwK34BXZ8w&​bvm=bv.43828540,​d.ZWU|SystemT]] , a rule-based IE system with an SQL-like declarative language named [[http://pic.dhe.ibm.com/infocenter/bigins/​v2r0/​topic/​com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/​aql_overview.html|AQL (Annotation Query Language)]]+
-The declarative nature of AQL enables new kinds of tools for extractor +
-development,​ and a cost-based optimizer for +
-performance +
-The goal of this project is to develop an open-source compiler and +However, these platforms are used for static spatial data and are unable to cope with moving objects. ​The goal of the project is to extend one of these platforms with spatio-temporal data types in order to be able to display animated maps.
-runtime environment ​of (a simplified version of) AQL.+
-**Contact** ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+{{:teaching:​trips2.gif?​direct|}}
-**Status**: available+Animated visualization of car trajectories
-==== Development of a distributed simulation algorithm ​====+**Status**: taken 
 +===== Implementing TSBS on MobilityDB =====
-Simulation and Bisimulation ​are fundamental notions in computer +The Time Series Benchmark Suite ([[https://​github.com/​timescale/​tsbs|TSBS]]) is a collection of Go programs that are used to generate datasets ​and then benchmark read and write performance ​of various time series ​databases. ​This bechmark has been developed by [[https://​www.timescale.com/​|TimescaleDB]]which is a time series extension ​of PostgreSQL
-science. They underly many formal verification algorithms, ​and have +
-recently been applied to the construction ​of so-called structural +
-indexes,​which are novel index data structures for relational ​databases +
-and the Semantic Web ​Essentiallya (bi)simulation ​is a relation on +
-the nodes of a graph. Unfortunately,​ however, while efficient +
-main-memory algorithms for computing whether two nodes are similar +
-exist, these algorithms fail when no the input graphs are too large to +
-fit in main memory+
-The objective ​of this project ​is to implement a recently proposed +A significant addition ​of TimescaleDB to PosgreSQL ​is the addition of the [[https://​blog.timescale.com/​blog/​simplified-time-series-analytics-using-the-time_bucket-function/​|time_bucket]] function. This function allows ​to partition the time line in user-defined interval units that are used for aggregating data.
-algorithm for  computing simulation ​in a distributed setting, and +
-provide a preliminary performance evaluation of this implementation.+
-**Contact** : Stijn Vansummeren ​(stijn.vansummeren@ulb.ac.be)+The project consists in implementing a multidimensional generalization of the time_bucket function that allows the user to partition the spatial and/or temporal domain of a table in units (or tiles) that can be used for aggregating dataThen, the project consists of performing a benchmark comparison of TimescaleDB and MobilityDB.
-**Status**: available 
 +===== Distributed Moving Object Database on Amazon AWS =====
 +A distributed database is an architecture in which multiple database instances on different machines are integrate in order to form a single database server. Both the data and the queries are then distributed over these database instances. This architecture is effective in deploying big databases on a cloud platform.
-==== Development ​of a Personal Scientific Digital Library Management System ====+MobilityDB is engineered as an extension ​of PostgreSQL. AWS supports PostgreSQL databases in [[https://​aws.amazon.com/​rds/​postgresql/​|Amazon RDS]] for PostgreSQL and in [[https://​aws.amazon.com/​rds/​aurora/​postgresql-features/​|Amazon Aurora]]. The goal of this project is to integrate MobilityDB with these products. The key outcomes are a comprehensive assessment of which MOD API can/cannot be distributed,​ and an assessment of the performance gain. These outcomes should serve as a base for thesis project to achieve effective integration.
-In this project, the student is asked to construct a software system to help manage large collections of scientific papers in digital form. Specifically,​ the system must be able to: 
-  - Scan a given filesystem location for given filetypes (PDFs, EPUB, ...) containing scientific articles. 
-  - Extract the metadata from each identified file. Here, the metadata includes the title of the article, its authors, the publishing venue, the publisher, the year of publication,​ the article'​s abstract ... The development of an intelligent way to retreive this metadata is requried. This could be done, for example by a combination of parsing the file, contacting the internet repositories of known publishers (AMC, Springer, Elsevier) etc to retrieve the data. 
-  - Offer search capabilities,​ in order to allow a user to find all indexed articles matching certain criteria (title, author, ...) 
-  - Offer archiving capabilities 
-Use of semantic web technologies (RDFSPARQL, ...) to store and search the metadata ​is encouraged.+===== Distributed Moving Object Database on MS Azure ===== 
 +A distributed database is an architecture in which multiple database instances on different machines are integrate in order to form a single database server. Both the data and the queries are then distributed over these database instances. This architecture is effective in deploying big databases on a cloud platform. 
 +MobilityDB is engineered as an extension ​of PostgreSQL. MS Azure supports distributed PostgreSQL databases using [[https://​www.citusdata.com/​|Citus]]. We have made successful tests for integrating MobilityDB and Citus on a local cluster. The goal of this project is to repeat this work on MS Azureintegrate MobilityDB with these products. The key outcomes are a comprehensive assessment of which MOD API can/cannot be distributedand an assessment of the performance gain. These outcomes should serve as a base for a thesis project to achieve effective integration. 
 +===== Map-matching as a Service ===== 
 +GPS location tracks typically contain errorsas the GPS points will normally be some meters away from the true positionIf we know that the movement happened on a street network, e.g., a bus or a car, then we can correct this back by putting the points on the street. Luckily there are Algorithms for this, called Map-Matching. There are also a handful of open source systems that do map matching. It remains however difficult to end users to use them, because they involve non-trivial installation and configuration effort. Preparing the base map, which will be used in the matching is also an issue to users.  
 +Original trajectory 
 +Map-matched trajectory 
 +The goal of this project is to build an architecture for a Map-Matching service. The challanges are that the GPS data arrives in different formats, and that Map-Matching is a time consuming Algorithm. This architecture should thus allow different input formats, and should be able to automatically scale according to the request rate. Another key outcome of this project is to compare the existing Map-Matching implementations,​ and to discuss their suitability in real world problems. 
 +  * [[https://​github.com/​bmwcarit/​barefoot|Barefoot]] 
 +  * [[https://​valhalla.readthedocs.io/​en/​latest/​api/​map-matching/​api-reference/​|Valhalla Map Matching API]]  
 +  * [[https://​github.com/​graphhopper/​map-matching|GraphHopper]] 
 +  * [[https://​github.com/​cyang-kth/​fmm|Fast Map Matching]] 
 +===== Geospatial Trajectory Data Cleaning ===== 
 +Data cleaning is essential preprocessing for analysing the data and extracting meaningful insights. Real data will typically include outliers, inconsistencies,​ missing data, repeated transactions possibly with different keys, and other kinds of acquisition errors. In geospatial trajectory data, there are even more sources of error, such as GPS inaccuracies.  
 +The goal of this project is to survey the state of the art in geospatial trajectory data cleaning, both model-based and machine learning. The work also includes prototyping and empirically evaluating a selection of these methods in the MobilityDB system, and on different real datasets. These outcomes should serve as a base for a thesis project to enhance geospatial trajectory data cleaning. 
 +===== Geospatial Trajectory Similarity Measure ===== 
 +One of the main functions for a wide range of application domains is to measure the  similarity between two  moving objects'​ trajectories. This is desirable for similarity-based retrieval, classification,​ clustering and  other querying and mining tasks over moving objects'​ data. The  existing movement similarity measures can be classified into  two classes: (1spatial similarity that focuses on finding trajectories with  similar geometric shapes, ignoring the temporal dimension; and (2) spatio-temporal similarity that takes into account both the spatial and the temporal dimensions of movement data. 
 +The goal of this project is to survey ​and to prototype in MobilityDB the state of art methods in trajectory similarity. Since it is a complex problem, these outcomes should serve as a base for a thesis project to propose effective and efficient trajectory similarity measures. 
 +===== Spatiotemporal k-Nearest Neighbour (kNN) Queries ===== 
 +An example of continuous kNN is when the GPS device of the vehicle initiates a query 
 +to find the three closest gas stations to the vehicle at any time instant during its trip from source to destination. According to the location of the vehicle, the set of three nearest gas stations can change. The result is thus a set of intervals, where very interval is associated with a set of three gas stations. The challenge in this type of query is to find an efficient incremental way of evaluation.  
 +The goal of the project is to survey the state of art in continuous kNN queries, and to prototype selected methods in MobilityDB. Since it is a complex problem, these outcomes should serve as a base for a more elaborate thesis project. 
 +===== K-D-Tree Indexes for MobilityDB ===== 
 +Indexes are essential in databases for quickly locating data without having to search ​every row in a table every time a database table is accessed. Thus, an index is an auxiliary data structure that improves ​the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index. PostgreSQL provides [[https://​habr.com/​ru/​company/​postgrespro/​blog/​441962/​|multiple types of indexes]] for various data types. 
 +In MobilityDB two types of indexes has been implemented,​ namely, [[https://​habr.com/​en/​company/​postgrespro/​blog/​444742/​|GiST]] and [[https://​habr.com/​ru/​company/​postgrespro/​blog/​446624/​|SP-GiST]]. More precisely, in PostgreSQL, these types of indexes are frameworks for developing multiple types of indexes. Concerning SP-GiST indexes, in MobilityDB we have developed 4-dimensional quad-trees where the dimensions are X, Y, and possibly Z for the spatial dimension and T for the time dimension. An alternative approach would be to use [[https://​en.wikipedia.org/​wiki/​K-d_tree|K-D Trees]]. K-D trees can be implemented in PostgreSQL using the SP-GiST framework and an example [[https://​github.com/​postgres/​postgres/​blob/​master/​src/​backend/​access/​spgist/​spgkdtreeproc.c|implementation]] for simple [[https://​www.postgresql.org/​docs/​current/​datatype-geometric.html|geometric types]] exist. The goal of the project ​is to implement K-D indexes for MobilityDB and perform a benchmark comparison between K-D trees and the existing 4-dimensional quad-trees. 
 +===== VODKA Indexes for MobilityDB ===== 
 +MobilityDB provides [[https://​habr.com/​en/​company/​postgrespro/​blog/​444742/​|GiST]] and [[https://​habr.com/​ru/​company/​postgrespro/​blog/​446624/​|SP-GiST]] indexes for temporal types. These indexes are based on bounding boxes, that is, the nodes of the index tree store a bounding box that keeps the mininum and maximum values of each of the dimensions where X, Y, Z (if available) are for the spatial dimension and T for the temporal dimension. The reason for this is that a temporal type (for example, a moving point representing the movement of a vehicle) can have thousands of timestamped points and keeping all these points for each vehicle indexed in a table is very inefficient. By keeping the bounding box only it is possible to quickly filter the rows in a table and then a more detailed analysis can be made for those rows selected by the index. 
 +However, the drawback of keeping a single bounding box for the whole trajectory makes that the index is not very selective as shown in the following figure (extracted from a presentation by Oleg Bartunov from PostgresPro) 
 +The goal of the project is to define [[https://​www.pgcon.org/​2014/​schedule/​events/​696.en.html|VODKA indexes]] for MobilityDB, which  enable us to store in the index multiple bounding boxes (one per segment) associated to each row in the table as shown in the following figure 
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be) 
-**Status**: available 
