JSM 2017 Experiences

A group of Two Sigma statisticians share their experiences from the 2017 Joint Statistical Meeting, which the company sponsored for the first time this year.

JSM, formally known as Joint Statistical Meeting, has become world’s largest annual statistical event since its inception in 1840. The meeting serves as a unique venue for faculties, students, and industry affiliations to exchange ideas, results, and insights gathered in the development and practice of the field of statistics. Topics of the meeting range from statistical theory and methodology to applications in areas like biology, medical studies, social science, and other related areas. Recent JSM conferences have attracted more than 5000 attendees from about 50 countries.

Two Sigma sponsored the JSM 2017, which was held in Baltimore, MD. This year’s meeting boasted more than 500 sessions, covering a wide range of topics. Below, a group of Two Sigma statisticians provides an overview of some of the most interesting of the sessions and lectures on recent advances in statistics as well as on challenges statisticians face as we move forward. Highlights discussed include:

Special lectures, such as “What’s happening in Selective Inference?” by Emmanuel Candes, and “Information-Theoretic Methods in Statistics” by Martin Wainwright, which covers novel topics in statistics that started to attract substantial interest in the statistics community recently.
Special, late-breaking, and additional sessions, including one on “Computer Age Statistical Inference” by Bradley Efron and Trevor Hastie, attracted large audiences and stimulated interesting floor discussions.

Special Lectures

Wald lecture I-III: What’s Happening in Selective Inference? — Emmanuel Candes, Stanford University

The lecture was a call to action for statisticians to meet emerging challenges in modern statistics resulting from rapid advances in technology in recent years. In particular, modern statistical inference for scientific discovery tends to be (1) post-selection, i.e., hypotheses are formed, revised, or drastically changed after data are collected and analyzed (hence the name selective), (2) large-scale, as it is not uncommon to have thousands of hypotheses being mined together, and (3) computationally intensive and analytically intractable, as complex models are becoming more widely adopted. How statisticians should embrace those changes has naturally become a hot topic of discussion inside the statistics community. For example, with machine learning models coming to center stage, even the most talented statisticians find deriving the null distribution of p-value—a simple task used to be done on paper—increasingly hard to tackle.

Emmanuel Candes’ work provides an interesting approach to the aforementioned challenge in the context of high-dimensional variable selection problems (Barber and Candes, 2015): instead of calculating p-value for all the testing statistics, one simply manufactures knockoff variables, a “fake copy” of the dataset that mimics the correlation structure of the original features but is otherwise unrelated to the response, after controlling for the existing features (i.e., conditional independence). The same testing statistics are then calculated from those knockoff variables, along with the variable of interests, and a data-dependent threshold to control FDR is then calculated and shown theoretically and numerically to work well. The merit of the method lies in its simplicity and universal applicability, regardless of the complexity of the underlying model or the choice of testing statistics. I.e., it works in the same way for testing regression beta in a simple linear model or lasso as it does when dealing with variable selection in a complicated machine learning model like a neural network.

Candes’ research has important implications for the future of statistical inference, as such new theories and principles are in great need to respond to the sweeping advances in science and technology we are experiencing today.

Blackwell Lecture: Information-Theoretic Methods in Statistics: From Privacy to Optimization — Martin Wainwright, U.C. Berkeley

The arrival of massive datasets for analysis in the new age of statistics has led to new issues in statistical inference. Some issues like curse of dimensionality, are well understood, while others, like the importance of privacy and related computational constraints, have just started to gain statisticians’ attention in recent years. To this end, Martin Wainwright covered two related topics in his talk: first, he provided a general characterization of the trade-offs between alpha privacy and statistical utility, where the statistical utility is measured in terms of minimax risk. The second half of the talk covered randomized sketching methods for approximately solving least-squares problems under convex constraints. The authors initially provided a general lower bound on any randomized method and then discussed a new method, named the iterative Hessian Sketch (Pilanci and Wainwright, 2016), which is shown to perform well as a measurement of distance between the approximate minimizer and the true minimizer.

Medallion Lecture II: State-Space Modeling of Dynamics Processes in Neuroscience — Emery Brown, Massachusetts Institute of Technology

Emery Brown gave an interesting talk on applying state-space point process modeling in neuroscience research. After presenting a brief overview of state-space point processes, Brown discussed a state-space model used to control medical comas (2-D linear state-space model with binary observations/Kalman filter) and showed that it is feasible to do so. The model contains an optimal feedback controller, a filter, thresholding, and a recursive Bayesian estimator. For example, given a target burst-suppression level, the system will control the administration of a drug (e.g., propofol level) to maintain the desired general anesthesia state using the closed-loop control system that he described.

Special Sessions

Computer Age Statistical Inference

The two talks in this special lecture series were based on a recent book by Bradley Efron and Trevor Hastie: Computer Age Statistical Inference: Algorithms, Evidence, and Data Science (Efron and Hastie, 2016).

Bayes, Oracle Bayes, Empirical Bayes — Bradley Efron, Stanford University

Empirical Bayes, an idea pioneered by Herbert Robbins (Robbins, 1956) and Stein (Stein, 1956), has since enjoyed widespread success through its applications in many scientific fields. It philosophically bridges the frequentist and Bayesian ways of thinking, while practically achieving performance close to that of Oracle Bayes without requiring the knowledge of the values of underlying hyper-parameters (hence, in some sense, being viewed as optimal). However, most of the ongoing research focuses on certain loss functions, like mean-squared-errors, and it remains unclear how Empirical Bayes method behaves for general loss functions. This is a known deficiency of f-modeling, which focuses on modeling the marginal distribution of the data. As an alternative perspective, Efron called for putting more efforts into g-modelling, i.e., the modeling of the prior in the Empirical Bayes framework. Using several real-world datasets, the author showed that this alternative is a much-preferred approach (Efron, 2015) and may reshape the future of research in Empirical Bayes.

Variable Selection at Scale — Trevor Hastie, Stanford University

The lecture offered a brief overview of variable selection methods, such as best subset, forward stepwise, ridge, lasso, elastic net and relaxed lasso, and it provided a comprehensive comparison of those methods using a wide range of numerical experiments. The speaker first noted that there has been an interesting breakthrough in best-subset algorithms using mixed integer programming (Miyashiro and Takano, 2015), so it is feasible to compare best-subset methods with other variable selection methods in a lot of studies. The numerical comparisons using real or synthetic data (Hastie etc, 2017) indicate that best subset behaves like the forward stepwise method and works better in high-SNR cases, while lasso-type methods tend to perform better in low-SNR scenarios, with relaxed lasso emerging as the overall winner.

Three Principles for Data Science: Predictability, Stability, and Computability — Bin Yu, U. C. Berkeley

Prediction has become an increasingly important topic in the emerging field of data science. In recent years, we have seen a lot of success in applying machine learning methods to various areas. Bin Yu’s talk focused on the connections of “three principles of data science”: predictability, stability, and computability. The three principles are connected in many ways. On one hand, stability, with respect to data or model, is deeply related to the predictability of the model. On the other hand, all of the algorithms need to be computable to be useful in practice.

As a concrete example, Bin discussed an ongoing project where a convolutional neural network (CNN) was applied to model measured activation of human brain neurons upon seeing different pictures. It was found that the first few layers of convolution play a critical role, and after the first few convolution layers are constructed, different algorithms can be used to construct the last couple of layers. Stochastic gradient descent (SGD) was used to search for the clusters of input pictures that maximize the output strength, a similar approach to that of Google’s Deep Dream project.

There were also some interesting discussions about the relation between stability, computability and generalization performance. Bin mentioned that her ongoing research showed that convergence speed of an algorithm is deeply related to its generalization error, and there are cases when slower converging algorithms tend to have better generalization performance.

Late-Breaking Session II: Hindsight is 20/20 and for 2020

From Euler to Clinton: an Unexpected Statistical Journey — Xiao-Li Meng, Harvard University

This lecture was a call for caution in the application of statistics when working with so-called “big data”. The speaker discussed the use of a simple but subtle identity to measure the effective sample size when the selection variable (whether one is selected to participate in a poll) and the effect variables (whom the participant will vote for) are correlated. He found its application can lead to surprising results in practice, demonstrating that even a very weak correlation between the two can lead to a significant reduction in effective sample size. For example, a poll of 1.6 million participants has an effective sample size of only around 400 under a seemingly small correlation like 5%. One important takeaway from the talk is that the sheer size of data for analysis does not automatically guarantee statistical efficiency/validity. What is even worse, when “big data” is analyzed incorrectly, the speaker said, “the bigger the data, the surer we miss our target.”

Additional Invited Sessions

Inference for Big Data — Larry Wasserman, Carnegie Mellon University

This talk was based on a recent paper about distribution-free predictive inference for regression (Lei etc, 2017), in which the authors propose a method that allows the prediction band to be constructed for the variable of interest using any form of the regression function (from simple linear regression to blackbox deep-learning models). The cornerstone of the proposed framework is conformal prediction. The method was shown to be distribution-free and to require only minimal assumptions, like i.i.d. of input data. The speaker also illustrated that the proposed method works in both the continuous case, where the resulting prediction band is usually an interval, and the discrete case, where the resulting prediction band is a set. He used numerical examples to show that the empirical coverage rate of the constructed prediction band matches well with the claimed one

Overlapping clustering with LOVE — Florentina Bunea, Cornell University

The speaker proposed a novel Latent model-based OVErlapping cluster method (LOVE) to recover overlapping sub-groups through a model formulated via matrix decomposition (Bunea etc, 2017). Assuming there are some pure variables that are associated with only one latent factor while the remaining majority can have multiple allocations, the model is shown to be identifiable up to label switching. The newly developed algorithm, named LOVE, estimates the clusters by first identifying the set of pure variables and then determining the allocation matrix and the corresponding overlapping clusters. Numerical studies, including an application of the methods to an RNA-seq dataset, are used to compare LOVE with other existing methods and show encouraging results.

Rate-Optimal Perturbation Bounds for Singular Subspaces with Applications to High-Dimensional Data Analysis — Tony Cai, University of Pennsylvania

The study of perturbation bounds has seen a wide application in many fields, including statistics, computer science, and quantum mechanics. In this work, the authors proposed that in singular value decomposition (SVD) separate perturbation bounds can be established for the left and right singular subspaces and showed the rate-optimality of the individual perturbation bounds. The new perturbation bounds can be applied to problems like low-rank matrix denoising and singular space estimation, high-dimensional clustering, and canonical correlation analysis.

Robust Covariate-Adjusted Multiple Testing — Jianqing Fan, Princeton University

The problem of large-scale multiple testing, where thousands of hypotheses are being tested together, has become the rule rather than the exception in the path of scientific discovery nowadays. This talk addressed two challenges commonly seen in this area of research: strong dependence among the testing statistics, and heavy-tailed data. To deal with the former challenge, the authors use a multi-factor model to model the dependence structure of the testing statistics. For the latter, a Huber loss is employed when constructing individual testing statistics. The proposed model was shown to control false discovery proportions, both theoretically and through numerical studies with simulated and real datasets. The authors also emphasized that the new method significantly outperforms the multiple t-tests under strong dependence and applies to cases where normality assumption is violated.

Random Matrices and Applications

Free Component Analysis — Raj Rao Nadakuditi, University of Michigan

A standard problem in image processing is to unmix and reconstruct original pictures from a mixture of them. Leveraging the framework of random matrices, the speaker discussed a method called Free Component Analysis (FCA), which is analogous to the independent component analysis (ICA) used to unmix independent random variables from their additive mixture. In contrast to ICA, FCA searches for directions that maximize the absolute value of free kurtosis (Wu and Nadakuditi 2017), while ICA finds directions that maximize classical kurtosis. There is ongoing work letting FCA maximize free entropy instead of free kurtosis, which has shown significant improvement over free kurtosis maximizing FCA. The speakers noted that those methods are also different from other types of matrix decomposition algorithms, such as Robust PCA (Candes etc, 2011), which decompose a single matrix into a low-rank part and a sparse part.

High-Dimensional Cointegration Analysis — Alexei Onatski, University of Cambridge

Most existing cointegration tests, while working well in the standard paradigm of large N (sample size) and fixed p (number of parameters), are known to severely over-reject the null hypothesis of no cointegration in the new paradigm of high-dimensional statistics (Gonzalo and Pitarakis, 1999), where both N and p are large. The author studied the asymptotic behavior of the empirical distribution of the squared canonical correlations and found that that when the observation and dimensionality of the variables go to infinity simultaneously, the distribution converges to a Wachter distribution. Under sequential asymptotics (one goes to infinity first, before letting the other go to infinity), the limiting distribution follows Marchenko-Pastur distribution with parameter 2. Onatsky proposes a Bartlett-type correction to the observed phenomenon, based on a theory developed in a recent paper (Onatski and Wang, 2017).

Computationally Intensive Methods for Estimation and Inference

A Novel Exact Method for Significance of Higher Criticism via Steck’s Determinant — Jeffrey Miecznikowski, University at Buffalo

In this work, the author proposed a new approach that applies a result generally known as Steck’s determinant (Steck, 1971) to directly computing the significance for higher criticism statistics. The advantage of this method is that it allows to assess higher criticism significance straightforward in the absence of simulation or asymptotic results. The approach is particularly useful when we use higher criticism with relatively low dimension. Such a scenario can be found in GWAS, where higher criticism is commonly used for signal detection within a gene that contains relatively few SNPs.

Hypothesis Testing when Signals are Rare and Weak — Ke Zheng, University of Chicago

In a linear regression model, ranking variables is a hard problem. This work (Ke and Yang, 2017) provides a novel approach to ranking variables, where the basic idea is first to use the design matrix to construct a sparse graph and then to use the graph to guide the ranking. The algorithm is easy to use and almost tuning-free (the tuning parameter only appears in the construction of the graph). The method is effective and provides much better ROC compared to other ranking methods.

Topics

Communities

Featured Series