SIMD Vectorized Hashing for Grouped Aggregation
Bala Gurumurthy, David Broneske, Marcus Pinnecke, Gabriel Campero Durand, and Gunter Saake.
In Advances in Databases and Information Systems
pages 113 – 126, September 2018.
Grouped aggregation is a commonly used analytical function. The common implementation of the function using hashing techniques suffers lower throughput rate due to the collision of the insert keys in the hashing techniques. During collision, the underlying technique searches for an alternative location to insert keys. Searching an alternative location increases the processing time for an individual key thereby degrading the overall throughput. In this work, we use Single Instruction Multiple Data (SIMD) vectorization to search multiple slots at an instant followed by direct aggregation of results. We provide our experimental results of our vectorized grouped aggregation with various open-addressing hashing techniques using several dataset distributions and our inferences on them. Among our findings, we observe different impacts of vectorization on these techniques. Namely, linear probing and two-choice hashing improve their performance with vectorization, whereas cuckoo and hopscotch hashing show a negative impact. Overall, we provide in this work a basic structure of a dedicated SIMD accelerated grouped aggregation framework that can be adapted with different hashing techniques.
Read in ADBIS
GridFormation: Towards Self-Driven Online Data Partitioning using Reinforcement Learning
Gabriel Campero Durand, Marcus Pinnecke, Rufat Piriyev, Mahmoud Mohsen, David Broneske, Gunter Saake, Maya Sekeran, Fabian Rodriguez, and Laxmi Balami.
In First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM)
In this paper we define a research agenda to develop a general framework supporting online autonomous tuning of data partitioning and layouts with a reinforcement learning formulation. We establish the core elements of our approach: agent, environment, action space and supporting components. Externally predicted workloads and the current physical design serve as input to our agent. The environment guides the search process by generating immediate rewards based on fresh cost estimates, for either the entirety or a sample of queries from the workload, and by deciding the possible actions given a state. This set of actions is configurable, enabling the representation of different partitioning problems. For use in an online setting the agent learns a fixed-length sequence of n actions that maximize the temporal reward for the predicted workload. Through an initial implementation we assert the feasibility of our approach. To conclude, we list open challenges for this work.
Read in aiDM
Piecing Together Large Puzzles, Efficiently:
Towards Scalable Loading into Graph Database Systems
Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, and Gunter Saake.
In Grundlagen von Datenbanken, GvDB
Many applications rely on network analysis to extract business intelligence from large datasets, requiring specialized graph tools such as processing frameworks (e.g. Apache Giraph, Gradoop), database systems (e.g. Neo4j, JanusGraph) or applications/libraries (e.g. NetworkX, nvGraph). A recent survey reports scalability, particularly for loading, as the foremost practical challenge faced by users. In this paper we consider the design space of tools for efficient and scalable graph bulk loading. For this we implement a prototypical loader for a property graph DBMS, using a distributed mes- sage bus. With our implementation we evaluate the impact and limits of basic optimizations. Our results confirm the expectation that bulk loading can be best supported as a server-side process. We also find, for our specific case, gains from batching writes (up to 64x speedups in our evaluation), uniform behavior across partitioning strategies, and the need for careful tuning to find the optimal configuration of batching and partitioning. In future work we aim to study loading into alternative physical storages with GeckoDB, an HTAP database system developed in our group.
Read in GvDB
Exploring Large Scholarly Networks with Hermes
Gabriel Campero Durand, Anusha Janardhana, Marcus Pinnecke, Yusra Shakeel, Jacob Krüger, Thomas Leich, and Gunter Saake
In International Conference on Extending Database Technology, EDBT
pages 650–653. OpenProceedings, March 2018
Every year, the number of scientific publications increases, adding complexity to the networks of collaborations, citations, and topics, in which papers are embedded. Analyzing these networks with efficient tools is important to help researchers identify relevant works and understand scientific impact. However, available tools face several limitations, indicating that there is still room for improvement. We present Hermes, a prototype for exploring large and heterogeneous scholarly networks. Hermes allows users to seamlessly navigate diverse types of networks within a single graph, spanning hundreds of millions of nodes and relationships. Our prototype achieves reasonable responsiveness on commodity hardware through: a) comprehensive indexing, b) a careful coupling of a graph database and a search engine, and c) incremental processing of temporal queries. In this demonstration, we explain the techniques we adopt and illustrate how to use Hermes for exploring the Microsoft Academic Graph.
Read in EDBT
Backlogs and Interval Timestamps: Building Blocks for Supporting Temporal Queries in Graph Databases
Gabriel Campero Durand, Marcus Pinnecke, David Broneske, and Gunter Saake.
In Proceedings of the Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017)
Venice, Italy, March 21-24, 2017., volume 1810. CEUR-WS, 2017
The analysis of networks, either at a single point in time or through their evolution, is an increasingly important task in modern data management. Graph databases are uniquely suited to improve static network analysis. However, there’s still no consensus on how to best model data evolution with these databases. In our work we propose an elementary concept to support temporal analysis with property graph databases, using a single-graph model limited to structural changes. We manage the temporal aspects of items with interval timestamps and backlogs. To include backlogs in the model we examine two alternatives: (1) global indexes, and (2) using the graph as an index by resorting to timestamp denormalization. We evaluate density calculation and time slice retrieval over successive days from a SNAP dataset, on an Apache Titan prototype of our model, observing from 2x to 100x response time gains by comparing differential vs. snapshot methods; and no conclusive difference between the backlog alternatives.
Read in EDBT