
This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
teaching:projh402 [2016/09/22 14:35]
svsummer [Graph Indexing for Fast Subgraph Isomorphism Testing]
teaching:projh402 [2021/09/19 10:43]
ezimanyi [Dynamic Time Warping for Trajectories]
Line 3: Line 3:
 ===== Course objective ===== ===== Course objective =====
-The course PROJ-H-402 is managed by Dr. Mauro Birattari. Please refer to the course description page http://​iridia.ulb.ac.be/​proj-h-402/​index.php/​Main_Page ​for the rules concerning the project. ​ What follows is a list of project proposals supervised by academic members of CoDE.+The course PROJ-H-402 is managed by Dr. Mauro Birattari. Please refer to the [[http://​iridia.ulb.ac.be/​wiki/PROJ-H-402_-_Computing_Project:​_Rules|course description page]] ​  for the rules concerning the project. ​ What follows is a list of project proposals supervised by academic members of the WIT laboratory.
-===== Project proposals ​=====+===== Projects in Mobility Databases ​=====
-=== Fast loading ​of semantic web datasets into native RDF stores ====+Mobility databases (MOD) are database systems that can store and manage moving object geospatial trajectory data. A moving object is an object that changes its location over time (e.g., a car driving on the road network). Using a variety ​of sensors, the location tracks of moving objects can be recorded in digital formats. A MOD, then, helps storing and querying such data. A couple of prototype systems have been proposed by research groups. Yet, a mainstream system is by far still missing. By mainstream we mean that the development builds on widely accepted tools, that are actively being maintained and developed. A mainstream system would exploit the functionality of these tools, and would maximize the reuse of their ecosystems. As a result, it becomes more closer to end users, and easily adopted in the industry.
-The next generation of the Web, the so-called Semantic Web, stores +Towards filling this gapour group is building ​the [[https://​github.com/​MobilityDB/​MobilityDB|MobilityDB]] system. It builds on [[https://​postgis.net/​|PostGIS]]which is a spatial database extension ​of [[https://​www.postgresql.org/​|PostgreSQL]]. MobilityDB extends ​the type system of PostgreSQL and PostGIS with abstract data types (ADTs) for representing moving object data. It definesfor instancethe tgeompoint type for representing a time dependant geometry point. MobilityDB types are well integrated into the platform, to achieve maximal reusability,​ hence a mainstream development. For instance, the tgeompoint type builds on the PostGIS geometry(pointtype. Similarly MobilityDB builds on existing operationsindexing, and optimization framework.
-extremely large knowledge bases in the RDF data modelIn this data +
-modelall knowledge ​is represented by means of triples ​of the form +
-(subjectpropertyobject), where subjectproperty ​and object can be +
-URLs, among other things.+
-In order to effeciently ​query such knowledge bases, the RDF data is +MobilityDB supports SQL as query interface. Currently it is quite rich in terms of types and functionsIt is incubated as community project ​in [[https://​www.osgeo.org/​projects/​mobilitydb/​|OSGeo]]which certifies high technical quality
-typically loaded into a so-called native RDF storeTo ensure that the +
-knowledge ​is encoded for fast retrieval, the RDF store will first +
-encode all variable-length URLs in the dataset by fixed-width +
-integers, among other thingsEach RDF triple will then be encoded by +
-by their corresponding integer triples (integer_of_subject,​ +
-The purpose of this project ​is to implement ​and experementally compare +The following ​project ​ideas contribute ​to different parts of MobilityDB. They all constitute innovative development,​ mixing both research ​and development. They hence will help developing the student skills in:
-a number of algorithms that can perform this encoding:+
-  * The trivial alogrithm that simply maintains a hashmap that maps URLs to their integer codes. When processing triple (s,p,o), it looks up s, p, and o in the hashmap to see if they have alraedy been assigned an integer IDIf so, this id is used for the   encoding; otherwise they are inserted into the hashmap with new, unique ids. The downside ​of this approach is that, while simpleit requires that one can store all URLS in working memory.+  * Understanding the theory ​and the implementation of moving object databases. 
 +  * Understanding ​the architecture ​of extensible databases, in this case PostgreSQL. 
 +  * Writing open source software.
-  * The slightly smarter algorithm that works in multiple stages: the ID is computed by a pre-fixed hash function. For each URL, the URL and its ID are written to an output file. This file is later sorted on ID to check for possible ​ hash collisions between distinct URLS.  
-  * Algorithms that use the best known state-of-the art data structures for compactly storing respresenting sets of strings, such as the HAT-TRIE ("​Engineering scalable, cache and space efficient tries for strings"​ Nikolas Askitis, Ranjan Sinha, The VLDB Journal, October 2010, Volume 19, Issue 5, pp 633-660 and "​HAT-Trie:​ A Cache-Conscious Trie-Based Data Structure For Strings",​ The 30th International Australasian Computer Science Conference (ACSC), Volume 62, pages 97 - 105, 2007.).+===== Visualization ​of Moving Objects on the Web =====
-  * Variations of the above algorithms, fine-tuned ​for semantic ​web datasets.+There are several open source platforms ​for publishing spatial data and interactive mapping applications to the web. Two popular ones are [[https://​mapserver.org/​|MapServer]] and [[http://​geoserver.org/​|GeoServer]],​ which are written, respectively,​ in C and in Java.
 +However, these platforms are used for static spatial data and are unable to cope with moving objects. The goal of the project is to extend one of these platforms with spatio-temporal data types in order to be able to display animated maps.
-**Contact** ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+{{:teaching:​trips2.gif?​direct|}}
-**Status**: available +Animated visualization of car trajectories
-===== Graph Indexing for Fast Subgraph Isomorphism Testing =====+
-There is an increasing amount of scientific data, mostly from the bio-medical sciences, that can be represented as collections of graphs (chemical molecules, gene interaction networks, ...). A crucial operation when searching in this data is that of subgraph ​   isomorphism testing: given a pattern P that one is interested in (also a graph) in and a collection D of graphs (e.g., chemical molecules), find all graphs in G that have P as a   ​subgraph. Unfortunately,​ the subgraph isomorphism problem is computationally intractable. In ongoing research, to enable tractable processing of this problem, we aim to reduce the number of candidate graphs in D to which a subgraph isomorphism test needs   to be executed. Specifically,​ we index the graphs in the collection D by means of decomposing them into graphs for which subgraph ​  ​isomorphism *is* tractable. An associated algorithm that filters graphs that certainly cannot match P can then formulated based on ideas from information retrieval.+===== MobilityDB ​on Google Cloud Platform =====
-In this project, ​the student will emperically validate ​on real-world datasets ​the extent to which graphs can be decomposed into graphs for which subgraph isomorphism ​is tractable, and run experiments ​to validate ​the effectiveness ​of the proposed method in terms of filtering power.+Deploying MobilityDB on the cloud enables the processing of the large amounts of mobility data that are continuously being generated nowadays. MobilityDB has been already deployed on Azure and on AWS. This project continue this effort ​on the Google Cloud Platform. The objective ​is to build on the similarities and differences ​of the three cloud platforms for defining a foundation for mobility data management on the cloud.
-**Interested?​** Contact ​: [[stijn.vansummeren@ulb.ac.be|Stijn Vansummeren]]+Links: 
 +  * [[https://​github.com/​MobilityDB/​MobilityDB-Azure|MobilityDB-Azure]] 
 +  * [[https://​github.com/​MobilityDB/​MobilityDB-AWS|MobilityDB-AWS]]
-**Status**: ​available+**Status**: ​taken 
 +===== Implementing TSBS on MobilityDB =====
-==== Principles ​of Database Management Architectures in Managed Virtual Environments ====+The Time Series Benchmark Suite ([[https://​github.com/​timescale/​tsbs|TSBS]]) is a collection ​of Go programs that are used to generate datasets and then benchmark read and write performance of various time series databases. This bechmark has been developed by [[https://​www.timescale.com/​|TimescaleDB]],​ which is a time series extension of PostgreSQL. ​
-With the gaining popularity ​of Big Data, many data processing engines +A significant addition of TimescaleDB to PosgreSQL is the addition ​of the [[https://​blog.timescale.com/​blog/​simplified-time-series-analytics-using-the-time_bucket-function/​|time_bucket]] functionThis function allows to partition ​the time line in user-defined interval units that are used for aggregating data.
-are implemented in a managed virtual environment such as the Java +
-Virtual Machine (e.g., Apache Hadoop, Apache Giraph, Drill, +
-...). While this improves ​the portability of the engine, the tradeoffs +
-and implementation principles w.r.t. traditional C++ implementations +
-are sometimes less understood.+
-The objective ​in this project is to develop some basic functionalities +The project consists ​in implementing a multidimensional generalization of the time_bucket function that allows the user to partition the spatial and/or temporal domain ​of a table in units (or tiles) that can be used for aggregating dataThenthe project consists ​of performing a benchmark comparison of TimescaleDB and MobilityDB.
-of a database storage engine (Linked files, BTree, Extensible Hash +
-table, basic external-memory sorting ) in a managed virtual machine +
-(i.e., the Java Virtual Machine ​or and the .NET Common Language +
-Runtime), and compare this with a C++-based implementation both on (1) +
-ease of implementation and (2) execution efficiency. In order to +
-develop the managed virtual machine implementation,​ the interested +
-student will need to research the best practices ​that are used in the +
-above-mentioned projects to gain maximum execution speed (e.g., use of +
-the java.lang.unsafe feature, memory-mapped files, ...).+
-**Contact** : Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+**Status**: taken
-**Status**: available 
-==== Development of a distributed simulation algorithm ==== 
-Simulation and Bisimulation are fundamental notions in computer 
-science. They underly many formal verification algorithms, and have 
-recently been applied to the construction of so-called structural 
-indexes,​which are novel index data structures for relational databases 
-and the Semantic Web.  Essentially,​ a (bi)simulation is a relation on 
-the nodes of a graph. Unfortunately,​ however, while efficient 
-main-memory algorithms for computing whether two nodes are similar 
-exist, these algorithms fail when no the input graphs are too large to 
-fit in main memory. ​ 
-The objective of this project is to implement ​recently proposed +===== Map-matching as Service ===== 
-algorithm for  computing simulation in distributed settingand +GPS location tracks typically contain errors, as the GPS points will normally be some meters away from the true position. If we know that the movement happened on street network, e.g., a bus or a car, then we can correct ​this back by putting the points on the street. Luckily there are Algorithms for this, called Map-Matching. There are also a handful of open source systems that do map matching. It remains however difficult to end users to use them, because they involve non-trivial installation and configuration effort. Preparing the base map, which will be used in the matching is also an issue to users
-provide ​preliminary performance evaluation of this implementation.+
-**Contact** ​Stijn Vansummeren (stijn.vansummeren@ulb.ac.be)+{{:teaching:​original.png?​direct&​400|}}
-**Status**: available+Original trajectory 
 +Map-matched trajectory 
 +The goal of this project is to build an architecture for a Map-Matching service. The challenges are that the GPS data arrives in different formats, and that Map-Matching is a time consuming Algorithm. This architecture should thus allow different input formats, and should be able to automatically scale according to the request rate. Another key outcome of this project is to compare the existing Map-Matching implementations,​ and to discuss their suitability in real world problems. 
 +  ​[[https://​github.com/​bmwcarit/​barefoot|Barefoot]] 
 +  ​[[https://​valhalla.readthedocs.io/​en/​latest/​api/​map-matching/​api-reference/​|Valhalla Map Matching API]]  
 +  ​[[https://​github.com/​graphhopper/​map-matching|GraphHopper]] 
 +  ​[[https://​github.com/​cyang-kth/​fmm|Fast Map Matching]] 
 +===== Symbolic trajectories ===== 
 +Symbolic trajectories enable to attach semantic information to geometric trajectories. Essentially,​ symbolic trajectories are just time-dependent labels representing,​ for example, the names of roads traversed obtained by map matching, transportation modes, speed profile, cells of a cellular network, behaviors of animals, cinemas within 2km distance, and so forth. Symbolic trajectories can be combined with geometric trajectories to obtain annotated trajectories. 
 +The goal of this project is to explore how to implement symbolic trajectories in MobilityDB. This implementation will be based on the ttext (temporal text) data type implemented in MobilityDB and will explore how to extend it with regular expressions. This extension can be inspired from the [[https://​www.postgresql.org/​docs/​13/​functions-json.html|jsonb]] data type implemented in PostgreSQL.  
 +  * R.H. Guting, F Valdés, M.L. Damiani, {{:​teaching:​symbolic_trajectories.pdf|Symbolic Trajectories}},​ ACM Transactions on Spatial Algorithms Systems, (1)2, Article 7, 2015 
 +===== Trajectory Data Warehouses ===== 
 +Mobility data warehouses are data warehouses that keep location data for a set of moving objects. You can refer to the article below for more information about the subject. The project consists in building a mobility data warehouse for ship trajectories on MobilityDB. 
 +The input data comes from the Danish Maritine Authority (follow the link "Get historical AIS data"​). To download the data you must use an FTP client (such as FileZilla). Follow the instructions in Chapter 1 of the MobilityDB Workshop to load the data into MobilityDB. 
 +You must implement a comprehensive data warehouse application. For this, you will perform in particular the following steps. 
 +  * Define a conceptual multidimentional schema for the application. 
 +  * Translate the conceptual model into a relational data warehouse.  
 +  * Implement the relational data warehouse in MobilityDB.  
 +  * Implement analytical queries based on the queries proposed in [1]. 
 +  * A. Vaisman and E. Zimányi. [[https://​www.mdpi.com/​2220-9964/​8/​4/​170|Mobility data warehouses]]. ISPRS International Journal of GeoInformation,​ 8(4), 2019.  
 +  * [[https://​www.dma.dk/​SikkerhedTilSoes/​Sejladsinformation/​AIS/​Sider/​default.aspx|Danish Maritine Authority]] 
 +  * [[https://​github.com/​MobilityDB/​MobilityDB-workshop|MobilityDB Workshop]] 
 +===== Geospatial Trajectory Data Cleaning ===== 
 +Data cleaning is essential preprocessing for analysing the data and extracting meaningful insights. Real data will typically include outliers, inconsistencies,​ missing data, repeated transactions possibly with different keys, and other kinds of acquisition errors. In geospatial trajectory data, there are even more sources of error, such as GPS inaccuracies.  
 +The goal of this project is to survey the state of the art in geospatial trajectory data cleaning, both model-based and machine learning. The work also includes prototyping and empirically evaluating a selection of these methods in the MobilityDB system, and on different real datasets. These outcomes should serve as a base for a thesis project to enhance geospatial trajectory data cleaning. 
 +===== Dynamic Time Warping for Trajectories ===== 
 +The dynamic time warping (DTW) algorithm is able to find the optimal alignment between two time series. It is often used to determine time series similarity, classification,​ and to find corresponding regions between two time series. Several dynamic time warping implementations are available. However, DTW has a quadratic time and space complexity that limits its use to small time series data sets. Therefore, a fast approximation of DTW that has linear time and space complexity has been proposed. 
 +The goal of this project is to survey and to prototype in MobilityDB the state of art methods in dynamic time warping.  
 +  * T. Giorgino, [[https://​www.jstatsoft.org/​article/​view/​v031i07|Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package]], Journal of Statistical Software, (31)7, 2009. 
 +  * S. Salvador, P. Chan, [[https://​cs.fit.edu/​~pkc/​papers/​tdm04.pdf|FastDTW:​ Toward Accurate Dynamic Time Warping in Linear Time and Space]], Intelligent Data Analysis, 11(5):​561-580,​ 2007 
 +  * D.F. Silva, G.E.A.P.A. Batista, [[http://​sites.labic.icmc.usp.br/​dfs/​pdf/​SDM_PrunedDTW.pdf|Speeding Up All-Pairwise Dynamic Time Warping Matrix Calculation]],​ Proceedings of the 2016 SIAM International Conference on Data Mining, 837-845, 2016. 
 +  * G. Al-Naymat, S. Chawla, J. Taheri (2012). [[https://​arxiv.org/​abs/​1201.2969|SparseDTW:​ A Novel Approach to Speed up Dynamic Time Warping]]. CoRR abs/​1201.2969 (2012) 
 +===== Geospatial Trajectory Similarity Measure ===== 
 +One of the main functions for a wide range of application domains is to measure the  similarity between two  moving objects'​ trajectories. This is desirable for similarity-based retrieval, classification,​ clustering and  other querying and mining tasks over moving objects'​ data. The  existing movement similarity measures can be classified into  two classes: (1) spatial similarity that focuses on finding trajectories with  similar geometric shapes, ignoring the temporal dimension; and (2) spatio-temporal similarity that takes into account both the spatial and the temporal dimensions of movement data. 
 +The goal of this project is to survey and to prototype in MobilityDB the state of art methods in trajectory similarity. Since it is a complex problem, these outcomes should serve as a base for a thesis project to propose effective and efficient trajectory similarity measures. 
 +===== Spatiotemporal k-Nearest Neighbour (kNN) Queries ===== 
 +An example of continuous kNN is when the GPS device of the vehicle initiates a query 
 +to find the three closest gas stations to the vehicle at any time instant during its trip from source to destination. According to the location of the vehicle, the set of three nearest gas stations can change. The result is thus a set of intervals, where very interval is associated with a set of three gas stations. The challenge in this type of query is to find an efficient incremental way of evaluation.  
 +The goal of the project is to survey the state of art in continuous kNN queries, and to prototype selected methods in MobilityDB. Since it is a complex problem, these outcomes should serve as a base for a more elaborate thesis project. 
 +===== K-D-Tree Indexes for MobilityDB ===== 
 +Indexes are essential in databases for quickly locating data without having to search every row in a table every time a database table is accessed. Thus, an index is an auxiliary data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index. PostgreSQL provides [[https://​habr.com/​ru/​company/​postgrespro/​blog/​441962/​|multiple types of indexes]] for various data types. 
 +In MobilityDB two types of indexes has been implemented,​ namely, [[https://​habr.com/​en/​company/​postgrespro/​blog/​444742/​|GiST]] and [[https://​habr.com/​ru/​company/​postgrespro/​blog/​446624/​|SP-GiST]]. More precisely, in PostgreSQL, these types of indexes are frameworks for developing multiple types of indexes. Concerning SP-GiST indexes, in MobilityDB we have developed 4-dimensional quad-trees where the dimensions are X, Y, and possibly Z for the spatial dimension and T for the time dimension. An alternative approach would be to use [[https://​en.wikipedia.org/​wiki/​K-d_tree|K-D Trees]]. K-D trees can be implemented in PostgreSQL using the SP-GiST framework and an example [[https://​github.com/​postgres/​postgres/​blob/​master/​src/​backend/​access/​spgist/​spgkdtreeproc.c|implementation]] for simple [[https://​www.postgresql.org/​docs/​current/​datatype-geometric.html|geometric types]] exist. The goal of the project is to implement K-D indexes for MobilityDB and perform a benchmark comparison between K-D trees and the existing 4-dimensional quad-trees. 
 +===== VODKA Indexes for MobilityDB ===== 
 +MobilityDB provides [[https://​habr.com/​en/​company/​postgrespro/​blog/​444742/​|GiST]] and [[https://​habr.com/​ru/​company/​postgrespro/​blog/​446624/​|SP-GiST]] indexes for temporal types. These indexes are based on bounding boxes, that is, the nodes of the index tree store a bounding box that keeps the mininum and maximum values of each of the dimensions where X, Y, Z (if available) are for the spatial dimension and T for the temporal dimension. The reason for this is that a temporal type (for example, a moving point representing the movement of a vehicle) can have thousands of timestamped points and keeping all these points for each vehicle indexed in a table is very inefficient. By keeping the bounding box only it is possible to quickly filter the rows in a table and then a more detailed analysis can be made for those rows selected by the index. 
 +However, the drawback of keeping a single bounding box for the whole trajectory makes that the index is not very selective as shown in the following figure (extracted from a presentation by Oleg Bartunov from PostgresPro) 
 +The goal of the project is to define [[https://​www.pgcon.org/​2014/​schedule/​events/​696.en.html|VODKA indexes]] for MobilityDB, which  enable us to store in the index multiple bounding boxes (one per segment) associated to each row in the table as shown in the following figure 
teaching/projh402.txt · Last modified: 2022/09/06 10:39 by ezimanyi