The Elf Approach
Our paper "Accelerating multi-column selection predicates in main-memory – the Elf approach" has been accepted for publication and presentation at the ICDE 2017 in San Diego, California.
Evaluating selection predicates is a data-intensive task that reduces intermediate results, which will be the input for further operations such as joins. With analytical queries getting more and more complex, the number of evaluated selection predicates per query rises as well. This leads to numerous multi-column selection predicates. Recent approaches to increase the performance of main-memory databases for selection-predicate evaluation aim at optimally exploiting the speed of the CPU by using accelerated scans. However, scanning each column one by one leaves tuning opportunities open that arise if all predicates are considered together. To this end, we introduce Elf, a storage structure that is able to exploit the relation between several selection predicates. Our Elf features cache sensitivity, an optimized storage layout, fixed search paths, and slight data compression. In our evaluation, we compare its query performance to two state-of-the-art approaches and a sequential scan using the concept of single instruction multiple data (SIMD). Our results indicate a clear superiority of our approach. For TPC-H queries with multi-column selection predicates, we achieve a speed-up between factor five and two orders of magnitude, mainly depending on the selectivity of the predicates.
The source code for the ICDE 2017 Paper is available at: http://git.iti.cs.ovgu.de/dbronesk/ICDE-elf
Required Libraries: Boost, CMake
Test data can be found in the source code repository above in the folder OP (TPC-H with scale factor 1).
- Our Paper "Accelerating Multi-Column Selection Predicates in Main-Memory - The Elf Approach" has been awarded with the FIN Forschungspreis of the University of Magdeburg (PDF)
David Broneske, Veit Köppen, Gunter Saake, and Martin Schäler. Accelerating multi-column selection predicates in main-memory – the Elf approach. In IEEE International Conference on Data Engineering (ICDE), pages 647 – 658, 2017. (PDF)
Efficient evaluation of selection predicates (e.g., range predicates) defined on multiple columns of the same table is a difficult, but nevertheless important task. Especially for subsequent join processing or aggregation, we need to reduce the amount of tuples to be processed. As we have seen an enormous increase of data with the last decade, this kind of selection predicate became more important. Especially in OLAP scenarios or scientific data management tasks, we often face multi-dimensional data sets that need to be filtered based on several dimensions. So far, the state-of-the-art solution strategy is to apply highly optimized sequential scans. However, the intermediate results are often large, while the final query result often only contains a small fraction of the data set. This is due to the combined selectivity of all predicates. In this report, we propose Elf - a new tree-based approach to efficiently support such queries.In contrast, to other tree-based approaches, we do not suffer from the curse of dimensionality. The main reason is that we do not apply space or data partitioning methods, like bounding boxes, but incrementally index sub-spaces. In addition, our Elf is cache sensitive, contains an optimized storage layout, fixed search paths, and even supports slight data compression rates. Our experimental results indicate a clear superiority of our approach compared to a highly optimized SIMD sequential scan as competitor. For TPC-H queries with multi-column selection predicates, we achieve a speed-up between factor five and two orders of magnitude, mainly depending on the selectivity of the predicates.
Veit Köppen, David Broneske, Gunter Saake, and Martin Schäler. Elf: A Main-Memory Structure for Efficient Multi-Dimensional Range and Partial Match Queries. Technical Report 002-2015, Otto-von-Guericke-University Magdeburg, Magdeburg, 2015.
Efficient evaluation of multi-dimensional range queries in a main-memory database is an important, but difficult task. State-of-the-art techniques rely on optimised sequential scans or tree-based structures. For range queries with small result sets, sequential scans exhibit poor asymptotic performance. Also, as the dimensionality of the data set increases, the performance of tree-based structures degenerates due to the curse of dimensionality. Recent literature proposed the Elf, a main-memory structure that is optimised for the case of such multi-dimensional low-selectivity queries. The Elf outperforms other state-of-the-art methods in manually tuned scenarios. However, choosing an optimal parameter configuration for the Elf is vital, since for poor configurations, the search performance degrades rapidly. Consequently, further knowledge about the behaviour of the Elf in different configurations is required to achieve robust performance. In this thesis, we therefore propose a numerical cost model for the Elf. Like all main-memory index structures, the Elf response time is not dominated by disk accesses, refusing a straightforward analysis. Our model predicts the size and shape of the Elf region that is examined during search. We propose that the response time of a search is linear to the size of this region. Furthermore, we study the impact of skewed data distributions and correlations on the shape of the Elf. We find that they lead to behaviour that is accurately describable through simple reductions in attribute cardinality. Our experimental results indicate that for data sets of up to 15 dimensions, our cost model predicts the size of the examined Elf region with relative errors below 5%. Furthermore, we find that the size of the Elf region examined during search predicts the response time with an accuracy of 80%.
Jonas Schneider. Analytic Performance Model of a Main-Memory Index Structure. In CoRR, arXiv.org, 2016.