Database Systems Architecture - Project
Assignment
You are asked to implement an external-memory merge-sort
algorithm (such as the one described in Section 15.4.1 of TCB), and examine its
performance under different parameters. As a warm-up exercise you will need to
explore several different ways to read data from, and write data to secondary
memory. The overall goal of the assignment is to get real-world experience with
the performance of external-memory algorithms. Be sure to read the
full assignment for full
details and modalities.
Practicalities & Evaluation
The assignment should be solved in groups of 3 (if the total number of students is not divisible by 3, at most two groups of 2 students will be allowed)
This project contributes 6/20 to the overall grade, and the
written exam contributes the remaining 14/20 points. Your solution should be pushed to the Git repository (see full assignment) no later than 5 january 2020. You get a penalty of -1/6 points for each day that your solution is delayed.
Overview of groups
You are asked to register, per group, the names of the group members via this online poll by November 4 at the latest. If you cannot find a partner, please indicate so by sending an email to prof. Vansummeren, who will hook you up with a partner.
The list of approved groups is published
Questions & Answers
In the third I/O variant, do we need to implement our own buffer manually, or do we need to configure the buffer size parameter available through Java BufferedInputStream class and test it for different values?
The objective of the 3rd IO variant is to be able to experiment with different values for B. Hence, it suffices to configure the buffer size parameter available through the Java BufferedInputStream class.
For the tests, how can we proceed systematically and make use of existing testing environments?
As described in the project assignment there are several frameworks available on the web that allow conducting experiments in a systematic way. These are called micro benchmark frameworks. You are free to use such framework if you find that convenient. Note that, typically, these frameworks do not ship code to generate test files automatically (since that is rather application-specific).
I am now experimenting with the Random reading. As I understand, we need to perform j random jumps using the seek operation and each seek operation should move the file pointer p positions from the start of the file.
I am using Java and I am facing a problem, as I perform the seek with the FileReader.skip(p) method. As I see, the class FileReader provides the operations mark() and set() (it is the only way to move the file pointer back to the start of the file) but they fail during the execution (with the error: mark not supported).
I tried to apply the same functions with the BufferedReader class and they work. My question is how are we supposed to move the fp back to the start of the stream in order to correctly apply the next seek?
You are right that the FileReader class itself does not allow to seek in the underlying file. (Note: also the set() method of the bufferedReader is not guaranteed to work). The solution is to start form a RandomAccessFile, which has a seek() method that does exactly what you want. This RandomAccessFile can be wrapped inside a FileReader (you can get the FileDescriptor of the RandomAccessFile and use that to create the FileReader, while will start reading at the current position of the RandomAccessFile ). Every time you do a seek(), however, yo u you need to re-create the FileRearder, however to make sure that the position is synchronized. (Note: if you use BufferedReader you also need to re-create this, since it depends on the FileReader). See, e.g.,
https://stackoverflow.com/questions/5614206/buffered-randomaccessfile-java
As far as we could observe for the moment, it seems that once we have
open a file during a first test run, there is no I/O to the disk for the
next ones. We believe that the file was actually buffered in cache and
we fear it might impact the accuracy of our tests. For example the IMDB
data set is only 5GB, so a computer equipped with more RAM (like 8GB)
could load all the files in RAM, if it is not running any other programs
at the same time. Could you please tell us if there is a way to stop
this behavior and if so, do we have to consider it ?
Yes, if you are running on Linux for example then the
OS will buffer the pages it has read in case you will need them again. As such, after the first run your files will likely be in RAM and no “real” I/O takes place. You can circumvent this behavior by clearing the
OS pagecache before re-running the experiment (see, e.g.,
https://unix.stackexchange.com/questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system ) It is recommended to do so to have 100% reliable results.