Backend/Database Co-Design for Rapid Webservice Prototyping:
The FIN Doktorandentag Talk
Self-published extended abstract
Protobase: It's About Time for Backend/Database Co-Design
A Demo on Rapid Microservice Prototyping for Third-Party Dataset Analytics
Marcus Pinnecke, Gabriel Campero Durand, Roman Zoun, David Broneske, and Gunter Saake.
In Datenbanksysteme für Business, Technologie und Web (BTW 2019), Demo, Accepted
In this interactive demonstration, we show the current state of Protobase, our main-memory analytic document store that is designed from scratch to enable rapid prototyping of efficient microservices that perform analytics and explorations on (third-party) JSON-like documents stored in a novel columnar binary-encoded format, called the Cabin file format. In contrast to other solutions, our database system does neither expose a particular query language, nor a fixed REST API to its clients. Instead, the entire user-defined backend logic, which user code is written in Python, is placed inside a sandbox that runs in the systems process. Protobase in turn exposes a user-defined REST API that the (frontend) application interacts with. Thus, our system acts as a backend server while at the same time avoids full exposure of its database to the clients. Consequentially, a Protobase instance (database + user code + REST API) serves as (the entire) microservice - potentially minimizing the number of systems running in a typical analytic software stack. In terms of execution performance, Protobase therefore takes the inter-process communication overhead between backend and database system out of the picture and heavily utilizes columnar binary document storage to scale-up for analytic queries. Both features lead to a notable performance gain for non-trivial services, potentially minimizing the number of required nodes in a cloud setting, too. In our demo, we overview Protobases internals, spot major design decisions, and show how to prototype a scholarly search engine managing the Microsoft Academic Graph, a real-world scientific paper graph of roughly 154 mio. documents.
Read in BTW 2018
Are Databases Fit for Hybrid Workloads on GPUs?
A Storage Engine’s Perspective.
Marcus Pinnecke, David Broneske, Gabriel Campero Durand, and Gunter Saake.
In Proceedings of the International Workshop on Big Data Management on Emerging Hardware,
San Diego, USA, April 22, 2017, pages 1599–1606, 2017
Employing special-purpose processors (e.g., GPUs) in database systems has been studied throughout the last decade. Research on heterogeneous database systems that use both general- and special-purpose processors has addressed either transaction- or analytic processing, but not the combination of them. Support for hybrid transaction- and analytic processing (HTAP) has been studied exclusively for CPU-only systems. In this paper we ask the question whether current systems are ready for HTAP workload management with cooperating general-and special-purpose processors. For this, we take the perspective of the backbone of database systems: the storage engine. We propose a unified terminology and a comprehensive taxonomy to compare state-of-the-art engines from both domains. We show similarities and differences, and determine a necessary set of features for engines supporting HTAP workload on CPUs and GPUs. Answering our research question, our findings yield a resolute: not yet.
Read in ICDE 2017
Efficient Single Step Traversals in Main-Memory Graph-Shaped Data
Master's thesis, University of Magdeburg,
Management of graph-shaped data gained a momentum to both industry and research. Traversal queries through a graph-shaped dataset are easy to express, and can be efficiently executed using graph databases. High-performance traversals through graph-shaped data is claimed to be enabled by native graph storage (i.e., encoding data using graph data structures), and native graph processing (i.e., operating on data with graph-domain specific operations). A common belief is that native graph storage databases are inherently superior to non-native graph storage databases (e.g., relational databases) in terms of traversal efficiency. This claim is especially supported by graph database vendors, but not yet proven or disproven objectively.
In this work, we study in context of main-memory systems how the primitives of arbitrary traversal algorithms (i.e., single step traversal queries) are affected by native graph storage, and non-native graph storage in terms of execution performance. We focus on single step traversal queries that address navigation in graph-shaped data. We compare classic graph encoding and a state-of-the-art graph database micro-index as representatives of native graph storage, and table scanning and indexing by several binary search trees as representatives of non-native graph storage.
We evaluate the representatives for native and non-native graph storage on both artificial datasets, and real world graph datasets. To be aware of confounding variables, we implement a unified main-memory-only experimental query engine to avoid bias from internal behavior of some blackbox systems (e.g., main-memory systems vs. disk-based systems).
Our experimental results show that high efficient traversal algorithm in main-memory systems require indexing adjacent records and incident relationships rather than the property of being a native graph storage or a non-native graph storage.
Toward GPU Accelerated Data Stream Processing
Marcus Pinnecke, David Broneske, and Gunter Saake
In Proceedings of the 27th GI-Workshop Grundlagen von Datenbanken, Gommern, Germany,
May 26-29, 2015., pages 78–83, 2015
In recent years, the need for continuous processing and analysis of data streams has increased rapidly. To achieve high throughput-rates, stream-applications make use of operator-parallelization, batching-strategies and distribution. Another possibility is to utilize co-processors capabilities per operator. Further, the database community noticed, that a column-oriented architecture is essential for efficient co-processing, since the data transfer overhead is smaller compared to trans- ferring whole tables.
However, current systems still rely on a row-wise architec- ture for stream processing, because it requires data structures for high velocity. In contrast, stream portions are in rest while being bound to a window. With this, we are able to alter the per-window event representation from row to column orientation, which will enable us to exploit GPU acceleration.
To provide general-purpose GPU capabilities for stream processing, the varying window sizes lead to challenges. Since very large windows cannot be passed directly to the GPU, we propose to split the variable-length windows into fixed-sized window portions. Further, each such portion has a column-oriented event representation. In this paper, we present a time and space efficient, data corruption free concept for this task. Finally, we identify open research challenges related to co-processing in the context of stream processing.
Read in GvDB
Query Optimization in Heterogenous Event Processing Federations
Marcus Pinnecke and Bastian Hoßbach
Datenbank-Spektrum, 15(3):193–202, 2015
Continuous processing of event streams evolved to an important class of data management over the last years and will become even more important due to novel applications such as the Internet of Things. Because systems for data stream and event processing have been developed independent of each other, often in competition and without the existence of any standards, the Stream Processing System (SPS) landscape is extremely heterogeneous today. To overcome the problems caused by this heterogeneity, a novel event processing middleware, the Java Event Processing Connectivity (JEPC), has been presented recently. However, despite the fact that SPSs can be accessed uniformly using JEPC, their different performance profiles caused by different algorithms and implementations remain. This gives the opportunity to query optimization, because individual system strengths can be exploited. In this paper, we present a novel query optimizer that exploits the technical heterogeneity in a federation of different unified SPSs. Taking into account different performance profiles of SPSs, we address query plan partitioning, candidate selection, and reducing inter-system communication in order to improve the overall query performance. We suggest a heuristic that finds a good initial mapping of sub-plans to a set of heterogenous SPSs. An experimental evaluation clearly shows that heterogeneous federations outperform homogeneous federations, in general, and that our heuristic performs well in practice.
Open in Datenbankspektrum
Konzept und prototypische Implementierung eines föderativen Complex Event Processing Systeme mit Operatorverteilung
In Datenbanksysteme für Business, Technologie und Web (BTW 2015),
Workshopband, pages 233–242. GI, 2015
Complex Event Processing (CEP) ist eine etablierte Technologie zur Verarbeitung
von Ereignisstr¨omen in nahezu Echtzeit. Trotz alledem unterscheiden bestehende
CEP-Systeme sich stark in ihren jeweiligen Leistungsumfaengen und -profilen.
Werden verschiedene CEP-Systeme in einem foederativen System kombiniert, so kann
das resultierende System im Vergleich zu den Einzelsystemen eine hoehere Datendurchsatzrate
und einen breiteren Leistungsumfang erreichen. Fehlende Standardisierung
und inkompatible Schnittstellen der CEP-Systeme behindern diesen Ansatz jedoch.
Dieser Missstand wird durch die Middleware Java Event Processing Connectivity
(JEPC) behoben. Gegenstand dieser Arbeit ist das auf JEPC basierendes foederatives
CEP-System so zu erweitern, dass eine Verteilung von Anfrageoperatoren auf die
beteiligten Systeme automatisch durchfuehrt wird. Hierfuer wird in dieser Arbeit das
zugrundeliegende Konzept, eine Anfrageoptimierung sowie ein Kostenmodell fuer die
Auswahl eines konkreten Systems vorgestellt. In einer Evaluation wird gezeigt, dass
durch den Einsatz einer Foederation die Datendurchsatzrate deutlich steigt, insbesondere
wenn es sich um eine heterogene Foederation mit unterschiedlichen Leistungsprofilen
Read in BTW