2024

Counting Butterflies in Fully Dynamic Bipartite Graph Streams (to appear)

Serafeim Papadias, Zoi Kaoudi, Varun Pandey, Jorge Arnulfo Quiane Ruiz, Volker Markl
Conference Paper ICDE '24: IEEE International Conf. on Data Engineering (ICDE), 2024

Abstract

A bipartite graph extensively models relationships between real-world entities of two different types, such as user-product data in e-commerce. Such graph data are inherently becoming more and more streaming, entailing continuous insertions and deletions of edges. A butterfly (i.e., 2x2 bi-clique) is the smallest non-trivial cohesive structure that plays a crucial role. Counting such butterfly patterns in streaming bipartite graphs is a core problem in applications such as dense subgraph discovery and anomaly detection. Yet, existing approximate solutions consider insert-only streams and, thus, achieve very low accuracy in fully dynamic bipartite graph streams that involve both insertions and deletions of edges. Adapting them to consider deletions is not trivial either, because different sampling schemes and new accuracy analyses are required. In this paper, we propose Abacus, a novel approximate algorithm that counts butterflies in the presence of both insertions and deletions by utilizing sampling. We prove that Abacus always delivers unbiased estimates. Furthermore, we extend Abacus and devise a parallel mini-batch variant, namely, Parabacus, which counts butterflies in parallel. Parabacus counts butterflies in a load-balanced manner using versioned samples, which results in significant speedup and is thus ideal for critical applications in the streaming environment. We evaluate Abacus/Parabacus using a diverse set of real bipartite graphs and assess its performance in terms of accuracy, throughput, and speedup. The results indicate that our proposal is the first capable of efficiently providing accurate butterfly counts in the most generic setting, i.e., a fully dynamic graph streaming environment that entails both insertions and deletions. It does so without sacrificing throughput and even improving it with the parallel version.

2023

Space-Efficient Random Walks on Streaming Graphs

Serafeim Papadias, Zoi Kaoudi, Jorge Arnulfo Quiane Ruiz, Volker Markl
Conference Paper VLDB '23: Proc. of the Intl. Conf. on Very Large Databases, 2023

Abstract

Graphs in many applications, such as social networks and IoT, are inherently streaming, involving continuous additions and deletions of vertices and edges at high rates. Constructing random walks in a graph, i.e., sequences of vertices selected with a specific probability distribution, is a prominent task in many of these graph applications as well as machine learning (ML) on graph-structured data. In a streaming scenario, random walks need to constantly keep up with the graph updates to avoid stale walks and thus, performance degradation in the downstream tasks. We present Wharf, a system that efficiently stores and updates random walks on streaming graphs. It avoids a potential size explosion by maintaining a compressed, high-throughput, and low-latency data structure. It achieves (i) the succinct representation by coupling compressed purely functional binary trees and pairing functions for storing the walks, and (ii) efficient walk updates by effectively pruning the walk search space. We evaluate Wharf, with real and synthetic graphs, in terms of throughput and latency when updating random walks. The results show the high superiority of Wharf over inverted index- and tree-based baselines.

2022

In-Place Updates in Tree-Encoded Bitmaps

Marcellus Prama Saputra, Eleni Tzirita Zacharatou, Serafeim (Makis) Papadias, and Volker Markl
Conference Paper SSDBM '22: Proc. Intl. Conf. on Scientific and Statistical Database Management, 2022

Abstract

The Tree-Encoded Bitmap (TEB) is a tree-based bitmap compression scheme that maps runs in a bitmap to leaf nodes in a binary tree. Currently, TEBs perform updates using an auxiliary differential data structure. However, consulting this additional data structure at every read introduces both memory and read overheads. To mitigate the shortcomings of differential updates, we propose algorithms to update TEBs in place. To that end, we classify the updates that can occur in a TEB into two types: run-forming and run-breaking. Run-forming updates correspond to leaf nodes at the lowest level of the binary tree. All other updates are run-breaking. Each type of update requires different handling. Through experimentation with synthetic data, we determined that in-place run-forming updates are 2-3× faster than differential updates, while run-breaking updates cannot be efficiently performed in place. Therefore, we propose a hybrid solution that performs run-forming updates in place and stores run-breaking updates in a differential data structure. Our experiments with synthetic data show that our hybrid solution performs updates faster than the differential approach. For example, for a workload where 20% of the updates are run forming, our hybrid solution is 69% faster on average.

2020

Tunable Streaming Graph Embeddings at Scale

Serafeim Papadias
Conference Paper PhD@VLDB '20: Proc. Intl. Conf. on Very Large Databases, 2020

Abstract

An increasing number of real-world applications require machine learning tasks over large-scale streaming graphs, where nodes and edges are continuously being added or deleted. Graph embeddings have been widely used for solving such tasks by capturing the graph structure and features into a low-dimensional latent space. However, current approaches have one or more of the following disadvantages: (i) they are designed for either static or dynamic graphs and thus, need retraining after each graph change or periodically updating the embeddings after each snapshot arrival, (ii) they fail to scale to today’s size of graphs composed of billions of nodes, or (iii) yet the ones devised for streaming graphs perform redundant retraining computations by mandating continuous embedding updates even if the accuracy is not improved. The goal of this thesis is to overcome the abovementioned problems by devising tunable streaming methods that can scale to massive graphs. We envision an end-to-end ML streaming system that achieves that goal and provides users with abstractions to easily define their own streaming embedding algorithms.

2019

Streaming HyperCube: A Massively Parallel Stream Join Algorithm

Yuan Qiu, Serafeim Papadias, Ke Yi
Conference Paper EDBT '19: Proc. Intl. Conf. on Extending Database Technology, 2019

Abstract

Streaming join is an essential operation in many real-time applications. A lot of research has been devoted to the study of distributed streaming join algorithms. However, existing solutions rely on heuristics and cannot handle data skew optimally. On non-streaming, static data, the HyperCube algorithm ensures a balanced load across all processors in an optimal way. In this paper, we extend this algorithm to the streaming setting, which can adapt the HyperCube configuration depending on the current data distribution. Some preliminary experimental results are provided to demonstrate the efficiency our algorithm.

Trajectory-aware Load Adaption for Continuous Traffic Analytics

Kostas Patroumpas, Serafeim Papadias
Conference Paper SSTD '19: Proc. Intl. Symposium on Spatial and Temporal Databases, 2019

Abstract

We introduce a framework for online monitoring of moving objects, which takes into account their evolving trajectories and copes smoothly with fluctuating demands of multiple continuous queries for limited system resources. This centralized scheme accepts streaming positional updates from numerous objects, but it only examines recent trajectory segments with expectedly higher utility in query evaluation, shedding the rest as immaterial. We focus on adaptive processing under extreme load conditions, opting to retain salient trajectory segments and possibly sacrifice smaller, frequently observed paths in favor of longer, distinctive routes. We propose heuristics for incremental, yet approximate, query evaluation in order to provide up-to-date traffic analytics using windows that abstract particular regions and time intervals of interest. Finally, we conduct a comprehensive experimental study to validate our approach, demonstrating its benefits in result accuracy and efficiency for almost real-time response to trajectory-based aggregates.

Full list of publications at DBLP.

Notice: The documents contained in this website are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.