Ben Wellington Talks AI, Data Science, and Improv on The Gradient Podcast

Ben Wellington joins The Gradient podcast to discuss his career at the intersection of finance and AI, along with key lessons he's learned on the journey.

Ben Wellington, Two Sigma’s Deputy Head of Feature Forecasting, recently appeared on a podcast recorded by Daniel Bashir, host and editor of The Gradient, which shares stories and lessons from the AI community across disciplines.

Ben and host Daniel Bashir discussed Ben’s career trajectory, along with some of the most interesting lessons he’s learned in his many years working as a data scientist and machine learning expert in investment management.

From academia to industry

Ben initially recounts his transition from academia to industry. While completing his graduate studies in natural language processing (NLP) at New York University, he found himself uninspired by what many of the leading labs in the NLP space were doing. Then, in 2007, he interviewed at Two Sigma.

I had never thought about finance even a little bit. But when I met this group of people, I was just really excited about the diverse problems, the datasets…

“It was a small company back then, maybe 150 people-ish,” he says. “And I had never thought about finance even a little bit. But when I met this group of people, I was just really excited about the diverse problems, the datasets…” 17 years later, Ben is still working at Two Sigma, his first job out of grad school.

NLP and featurization in finance

Examining the intersection of NLP, other machine learning techniques and finance, Ben highlights the critical role of “features”—relevant, novel, and useful pieces of information in a dataset—in building predictive models.

Using the hypothetical example of corporate earnings conference calls, Ben explains how even subtle elements, such as the frequency of “ums” (or other filler words) by a CEO, the tone of questions asked, or the number of people listening to a call, might have predictive value, and therefore serve as valuable features. Having “featurized” this information, he explains, “you get this really, really rich data set of conference calls, where it’s not just a bunch of calls, but rather a bunch of characteristics, ideas, hypothesis-driven approaches to these calls. And from there, you can start to feed those to algorithms.”

Challenges in applying machine learning

Ben goes on to outline some of the key challenges of using machine learning in finance. He highlights, for example, the issue of data saturation, where an increasing number of people use the same techniques and data, gradually leading to decreased effectiveness of older alpha forecast models.

He also touches on the potential trade-offs between the performance of ML models and their interpretability (i.e., understanding how a model arrives at the conclusions it does)—a very active area of research.

Comparing the challenges of interpretability in certain ML contexts to similar issues in medicine, he notes that scientists and doctors began using general anesthesia long before they understood precisely how it works.

“And yet here we are as a society, comfortable using it, because it’s important…we’ve measured it enough times…we’re going to be willing to move ahead with it.” With enough testing, supervision, and risk controls, he argues, scientists can be comfortable using machine learning models, too, even if they aren’t perfectly interpretable.

The relevance of time in financial data analysis

Later in the episode, Ben and Bashir turn their attention to the importance of time in the context of data science. Specifically, Ben points out, data scientists trying to forecast the future need to know exactly when each piece of information in a dataset would have been known. They need to be confident that the timestamps for each data point are accurate. As it turns out, however, timestamps have often been retroactively added to (or altered in) many of the datasets available commercially today.

…Data scientists trying to forecast the future need to know exactly when each piece of information in a dataset would have been known.

Knowing when this is the case is critical, says Ben, “because if the data wasn’t actually in the world [when you think it was], then the market couldn’t have responded.” In other words, a model could seem more performant than it really is, simply because of subtly misleading data. The best solution, he adds, is to try to record as much data as one can in real time.

The importance of storytelling with data

As the podcast draws to a close, Ben discusses his work with his data science blog I Quant NY, where he tells stories with data about New York City. He also shares lessons he’s learned through one of his favorite hobbies, improv.

For example, he highlights the value of focusing on specific details or peculiarities in data to bring larger narratives more vividly to life. He also emphasizes the importance of building on others’ ideas when collaborating. “I like to think of it through the lens of ‘yes-AND-ing’ each other,” he says, alluding to one of the cardinal rules of improv. “Because it fundamentally is taking someone’s idea, and then building it up… And it’s a really core skill I’ve taken from improv and brought to my professional career and work.”

Tune in to the podcast

To catch Ben’s entire conversation on The Gradient podcast, listen to the episode here.

Topics

Communities

Featured Series

Ben Wellington Talks AI, Data Science, and Improv on The Gradient Podcast

From academia to industry

NLP and featurization in finance

Challenges in applying machine learning

The relevance of time in financial data analysis

The importance of storytelling with data

Tune in to the podcast

Related Reading

A Guide to Large Language Model Abstractions

Gyms for the Mind: Two Sigma’s Hacker Labs

Semantic Types: From Computer-Centric to Human-Centric Data Types