Over the last six years, we have worked with organizations spanning multiple domains, geographies, and levels of technical sophistication. We’ve learned a lot over the years, collaborating one-on-one with our partners on bespoke solutions. However, many elements in our approach crop up across projects, so we started producing open source tools that could scale our methodology to the broader public (you might have already heard about two of our other tools: scout and NewerHoods).
In our latest blog post, we discussed our project with Vera Institute of Justice, which consisted of consolidating open 911 data across five US cities — Charleston, Dallas, Detroit, New Orleans, and Seattle. Before we dove into any inner- or cross-city exploratory analyses, we spent a significant amount of time wrangling the data into shape, which included breaking down 2,683 Call for Action (CFA) codes, which document the reason for the call, into 24 graspable concepts.
We learned a lot during that project, including the fact that we didn’t want to consolidate 2,683 CFA codes ever again. Well — at least not that way. We knew we weren’t the only ones who have tackled the terror of taxonomies before, and we knew there had to be a better way. Perhaps a more pleasant method involving less manual maneuvering. And while data scientists love to hate data cleaning, maybe they’d hate it a little less?
While some solutions exist for processing numerical data, cleaning text data is a lengthy and painstaking process even for those who live and breathe regular expressions. Not only that, this effort requires substantial resources and bandwidth that mission-driven organizations might simply not have.
Thus was the inspiration for our newest open source tool, smooshr — a user-friendly web app that data scientists and the broader public alike can easily use.
Just upload or point to the data you want to clean and start consolidating column entries through a point-and-click interface. Smooshr runs locally on your machine, so we won’t have access to the data you are working with. No coding necessary — although it does spit out reproducible code for your ETL pipeline for those who want it!
Check out smooshr to experience how this open source tool can make it easy to create and share recipes for cleaning unruly data. Smooshr also has a specific focus on entity consolidation and standardization within and across datasets. We also invite you to check out its repo on GitHub, provide feedback, and help with future development efforts!