MiSoSouP: Extracting the Most “Flavor” from a Dataset

The MiSoSouP algorithm provides a novel way to identify interesting subgroups in a dataset accurately and efficiently.

Just as miso soup is tasty and made out of simple ingredients, the MiSoSouP algorithm uses a relatively simple approach to extract the most “flavor” out of a dataset—at least as far as the interesting subgroups are concerned.

MiSoSouP, which stands for “Mining Interesting Subgroups with Sampling and Pseudodimension” is an algorithm for identifying interesting subsets (the subgroups) of a dataset for which the distribution of a specific feature (the target) within the subgroup largely differs from the distribution of that feature in the entire dataset.

As an example, consider a table of patients, with all but one of the columns describing features of the patients and the target column describing the presence or absence of a different feature (e.g., a super power). The goal of subgroup mining is to identify expressions (not just combinations or conjunctions, but conjunctions of disjunctions) of features that are strongly correlated or strongly anti-correlated with the presence of the super power. These expressions are known as subgroups. Algorithms for finding subgroups have been developed by the data mining research community, but they do not scale well as the size of the dataset (e.g., the number of patients) grows.

Creating a more scalable algorithm with statistical learning theory

The MiSoSouP project had the specific goal of creating an algorithm that scaled well as the number of rows in the dataset increases. The main idea behind it is to avoid analyzing the whole dataset, and instead to compute a high-quality approximation of the collection of interesting subgroups by only analyzing a small random subset (sample) of the rows in the dataset.

The key question is how the subgroups from the sample compare to the subgroups that would have been extracted from the whole dataset. MiSoSouP addresses this question using concepts and theorems from the field of statistical learning theory, which gives mathematical and probabilistical foundations to machine learning. As it turns out, it is possible to compute an upper bound to the pseudodimension of the task of subgroup mining. The pseudodimension is a combinatorial quantity that quantifies the complexity of approximating a set of quantities (in this case, the interestingness of the subgroups) from a sample.

Previously the use of this quantity had been limited to the field of studying the generalization ability of learning algorithms, but MiSoSouP illustrates how the quantity can be useful far beyond this area, thereby creating a new bridge between statistical learning and data mining. MiSoSouP uses the upper bound to the pseudodimension to compute how large the sample should be in order to achieve the desired accuracy in the approximation of the collection of subgroups. Experiments show that MiSoSouP is much faster and more scalable than existing algorithms, while only requiring a fraction of the data to obtain high-quality approximations.

See the full paper, MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension, for details on computing the sample size and the results of the experimental evaluation on real datasets, showing the performances of MiSoSouP in terms of runtime and accuracy as functions of the desired quality guarantees.

MiSoSouP was developed by Matteo Riondato, a research scientist on Two Sigma’s Labs team, in collaboration with Prof. Fabio Vandin from the University of Padua (Italy). The paper was accepted for publication in the proceedings of the ACM KDD’18 conference, the main research conference on data science. Deemed one of the best papers of this year’s KDD, an extended version has been invited for submission to a special issue of ACM Transactions on Knowledge Discovery from Data.

Topics

MiSoSouP: Extracting the Most “Flavor” from a Dataset

Creating a more scalable algorithm with statistical learning theory

Related Articles

ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages

Graph Summarization with Quality Guarantees

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size