Butterfly Counting in Fully Dynamic Bipartite Graph Streams [ICDE'24]

We designed Abacus, a novel approximate algorithm that counts butterflies in the presence of both insertions and deletions by utilizing sampling. Abacus always delivers unbiased estimates of low variance. Furthermore, we extend Abacus and devise a parallel mini-batch variant, namely, Parabacus, which counts butterflies in parallel. Parabacus counts butterflies in a load-balanced manner using versioned samples, which results in significant speedup and is thus ideal for critical applications in the streaming environment.


Graph Stream Processing [PhD@VLDB'20, VLDB'23]

We designed Wharf, a system that efficiently stores and updates random walks on streaming graphs. Wharf avoids a potential size explosion by maintaining a compressed, high-throughput, and low-latency data structure. It achieves (i) the succinct representation by coupling compressed purely functional binary trees and pairing functions for storing the walks, and (ii) efficient walk updates by effectively pruning the walk search space. I am particularly interested in the impact of the graph stream data processing, as graph can model data of pretty much any real world scenario.

Here, you may find my presentation from the online VLDB 2020 (such an unfortunate event that we missed Tokyo!)


In-Place Updates in Tree-Encoded Bitmaps [SSDBM'22]

The Tree-Encoded Bitmap (TEB) is a tree-based bitmap compression scheme that maps runs in a bitmap to leaf nodes in a binary tree. Currently, TEBs perform updates using an auxiliary differential data structure. However, consulting this additional data structure at every read introduces both memory and read overheads. To mitigate the shortcomings of differential updates, we propose algorithms to update TEBs in place. To that end, we classify the updates that can occur in a TEB into two types: run-forming and run-breaking. Run-forming updates correspond to leaf nodes at the lowest level of the binary tree. All other updates are run-breaking. Each type of update requires different handling. Through experimentation with synthetic data, we determined that in-place run-forming updates are 2-3× faster than differential updates, while run-breaking updates cannot be efficiently performed in place. Therefore, we propose Thuja, a hybrid solution that performs run-forming updates in place and stores run-breaking updates in a differential data structure.


Worst-Case Optimal Data Stream Joins [EDBT'19]

Streaming join is an essential operation in many real-time applications as, nowadays, more and more data are becoming inherently streaming in nature. A lot of research has been devoted to the study of distributed streaming join algorithms. However, existing solutions rely on heuristics and cannot handle data skew optimally. On non-streaming, static data, the HyperCube algorithm ensures a balanced load across all processors in an optimal way. In this work, we extended this algorithm to the streaming setting, which is worst-case optimal streaming join algorithm and can adapt the HyperCube configuration depending on the current data distribution. We believe this is a fruitful direction and more can be done to ensure provable load-balanced and worst-case optimal execution.


Streaming Spatiotemporal Data Management [SSTD'19]

Together with Kostas, we introduced a load shedding framework for online monitoring of moving objects, which takes into account their evolving trajectories and copes smoothly with fluctuating demands of multiple continuous queries for limited system resources. This centralized scheme accepts streaming positional updates from numerous objects, but it only examines recent trajectory segments with expectedly higher utility in query evaluation, shedding the rest as immaterial. We focus on adaptive processing under extreme load conditions, opting to retain salient trajectory segments and possibly sacrifice smaller, frequently observed paths in favor of longer, distinctive routes. We propose heuristics for incremental query evaluation in order to provide up-to-date traffic analytics using windows that abstract particular regions and time intervals of interest.