In June 2015, the Environmental Defense Fund joined with the Two Sigma Data Clinic to use oil and gas well inspection data to create a preliminary predictive model for violations.
The U.S. oil and gas industry has seen massive growth since the early 2000s, but regulatory enforcement has struggled to keep up with drilling. State agencies—whose job is to inspect well sites for environmental and safety violations—have limited resources to do their work at scale.
“It is almost like firefighters trying to put out a five-alarm fire with a 20-foot garden hose,” said Pennsylvania Auditor General Eugene DePasquale in 2014. DePasquale’s audit of the state’s Department of Environmental Protection (DEP) called the agency “unprepared … understaffed and underfunded” to take on its increased oversight responsibilities in the wake of Pennsylvania's shale gas boom.
The Environmental Defense Fund (EDF), an advocacy group that has studied the impact of oil and gas activities nationwide, wanted to explore the potential of data to help with the “too many wells, too few inspectors” problem. Specifically, information about past inspections could be used to identify well sites at-risk of having a future violation, helping regulators prioritize the order of their inspections.
In June 2015, EDF joined forces with the Two Sigma Data Clinic to look into the data and to create a preliminary predictive model for violations.
The first hurdle was the limited amount of available information. Predicting future violations requires data on all well inspections, not just the ones that have incurred violations. Out of the more than thirty states that have active oil and gas wells, only two—Colorado and Pennsylvania—post publicly available data on inspections, according to a 2015 report by the National Resources Defense Council and FracTracker Alliance (a third, West Virginia, only had data on violations.)
The Data Clinic focused on Pennsylvania, since the state’s electronic database provided relatively better-structured information, listing alleged violations under set categories rather than just in free-form text. The project team used data on nearly 190,000 oil and gas well inspections in Pennsylvania from 2008 through 2014.
Although Pennsylvania’s DEP has increased the number of well inspections nearly every year since 2008, the number of reported violations decreased between 2011 and 2014. The inspector “hit rate”—the proportion of inspections resulting in at least one violation—was also lower in 2014 than in 2011.
Smaller well operators tended to have a higher “violation rate” (defined as the number of violations per inspection), the analysis found. Additionally, public complaints typically resulted in the discovery of more violations. When compared to the average of all inspections, complaint-driven inspections led to nearly three times the number of violations, and nearly five times the penalty amount.
Preliminary correlation analyses at the county-level showed that past violations were (unsurprisingly) suggestive of future violations; past violation rates were highly positively correlated with inspector hit rates. In the other direction, counties with wealthier and more educated residents were less likely to have well sites with high inspector hit rates.
The predictive model included only routine, weekday inspections, rather than complaint-driven inspections or other non-random inspections, like those in response to emergencies. The goal was to maximize the inspector hit rate violation while minimizing model complexity.
The model that did the best job of balancing predictive power with simplicity had just two variables—operator inspection counts for the past year (negative correlation with hit rates) and the number of environmental violations the site incurred during its last inspection (positive correlation with hit rates.)
To evaluate the model’s performance, the Data Clinic tested it on out-of-sample inspections data–data that the model had never seen—by examining the well sites that had the highest probability of incurring a violation, according to the model. In this test, 80% of inspections with violations were discovered among the just the top 37% of inspections predicted to have violations.
Prioritizing sites that had many violations when last inspected and whose operators have had few inspections over the past year may seem obvious, but it could potentially go a long way in improving inspection efficiency.
Of course, the value of a predictive model is only as good as its input data. At the time of the Data Clinic project, information on Pennsylvania’s oil and gas well inspections was accessible only by repeatedly querying an online form, which appeared to be inconsistently updated. Inspections were sometimes added years after they had reportedly occurred, and some alleged violations had been deleted from the dataset in subsequent releases, the Data Clinic found. Part of the issue could be that the DEP’s primary data management system, created in the 1990s, is “no longer supported by the company that created it,” according to a 2016 report by NPR.
There have been some changes since the conclusion of the Data Clinic project. Pennsylvania’s DEP staff, who had previously recorded their inspections on paper forms, have been conducting them electronically with an iPad app as of March 2017. The DEP now has a mapping site that visualizes well locations and inspection reports. The state’s new open data portal, launched in late 2016, contains a dataset on well inspections from July 2015 onwards—violations, however, have yet to be uploaded.
Although there is still a long way to go before getting to a final predictive model, the Data Clinic’s work provided a proof-of-concept to EDF by illustrating that data can help environmental advocates better understand how to improve the effectiveness of oil and gas well inspections. In 2017, the organization announced an initiative dedicated to “democratizing data in the oil and gas industry,” in hopes of bringing about greater industry accountability through increased data transparency in Pennsylvania and nationwide.
Learn more about Two Sigma's Data Clinic here.