Introducing Smooshr: A Quick and Friendly Way to Clean Messy Datasets

Over the last six years, we have worked with organizations spanning multiple domains, geographies, and levels of technical sophistication. We’ve learned a lot over the years, collaborating one-on-one with our partners on bespoke solutions. However, many elements in our approach crop up across projects, so we started producing open source tools that could scale our methodology to the broader public (you might have already heard about two of our other tools: scout and NewerHoods).

In our latest blog post, we discussed our project with Vera Institute of Justice, which consisted of consolidating open 911 data across five US cities — Charleston, Dallas, Detroit, New Orleans, and Seattle. Before we dove into any inner- or cross-city exploratory analyses, we spent a significant amount of time wrangling the data into shape, which included breaking down 2,683 Call for Action (CFA) codes, which document the reason for the call, into 24 graspable concepts.

We learned a lot during that project, including the fact that we didn’t want to consolidate 2,683 CFA codes ever again. Well — at least not that way. We knew we weren’t the only ones who have tackled the terror of taxonomies before, and we knew there had to be a better way. Perhaps a more pleasant method involving less manual maneuvering. And while data scientists love to hate data cleaning, maybe they’d hate it a little less?

While some solutions exist for processing numerical data, cleaning text data is a lengthy and painstaking process even for those who live and breathe regular expressions. Not only that, this effort requires substantial resources and bandwidth that mission-driven organizations might simply not have.

Thus was the inspiration for our newest open source tool, smooshr — a user-friendly web app that data scientists and the broader public alike can easily use.

Just upload or point to the data you want to clean and start consolidating column entries through a point-and-click interface. Smooshr runs locally on your machine, so we won’t have access to the data you are working with. No coding necessary — although it does spit out reproducible code for your ETL pipeline for those who want it!

Consolidating entities for Call for Action codes with smooshr

Check out smooshr to experience how this open source tool can make it easy to create and share recipes for cleaning unruly data. Smooshr also has a specific focus on entity consolidation and standardization within and across datasets. We also invite you to check out its repo on GitHub, provide feedback, and help with future development efforts!

Read more from Data Clinic

This article is not an endorsement by Two Sigma of the papers discussed, their viewpoints or the companies discussed. The views expressed above reflect those of the authors and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Two Sigma makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever. Click here for other important disclaimers and disclosures.