A research article co-authored by a Two Sigma scientist was recently accepted as a full paper—and won Best Student Paper Award (Research Track)—at the 22nd ACM International Conference on Knowledge Discovery and Data Mining (ACM KDD2016). The conference is the main venue for novel research in data mining and data science.
TRIÈST, the suite of algorithms introduced in the paper, is used for approximate triangle counting in large graphs from fully-dynamic edge streams.
The value of high-quality approximations
Graphs are a natural representation for many human and artificial phenomena. Since available graph datasets can be extremely large, a fast and exact computation of characteristic quantities of such datasets is often impractical or infeasible. Additionally, these datasets are by nature very dynamic, with edges and nodes constantly being added and removed. Such datasets can then be seen as a stream of edge additions and deletions.
Characteristic quantities in these graphs are intrinsically volatile, so there is limited added value in computing them exactly. As a practical matter, the goal is rather to keep track, at all times, of a high-quality approximation of these quantities. For efficiency, the algorithms performing this task should aim to exploit the available memory space as much as possible, and they should require only one pass over the stream.
Particularly useful characteristic quantities are the graph-global and vertex-local numbers of triangles in the graph. These quantities are fundamental primitives with many applications, including community detection, topic mining, span and anomaly detection, and protein interaction network analysis.
A brief introduction to TRIÈST
Together with co-authors Lorenzo De Stefani and Eli Upfal from Brown University, and Alessandro Epasto from Google, Two Sigma Labs research scientist Matteo Riondato has developed TRIÈST, a suite of sampling-based, one-pass algorithms for approximate triangle counting from fully-dynamic edge streams. TRIÈST works by storing in main memory a small, fixed-size subset of the edges seen on the stream (i.e., a sample), and it uses this sample to update the triangle estimates continuously.
The estimations computed by TRIÈST are extremely accurate and fast to compute: TRIÈST can process up to 100,000 edges per second, and the estimation error is within 20% of the true value (a nine-fold reduction compared to the previous state of the art).