The incidences of opioid overdoses have increased sharply in the United States over the past few years. Unfortunately, this trend is expected to continue. As a result, all levels of government across the country are marshalling resources to combat the epidemic, and we set out to provide a tool to assist in their efforts.
The particular use case we address pertains to the difficulty public officials face in attempting to coordinate disparate resources and activities. Currently, public officials may be looking at the density of overdoses and planning their interventions accordingly. At the same time, other agencies, organizations and individuals may be attempting interventions of their own. If there were a way to distinguish high risk areas from low risk areas with high levels of specificity, the public official could coordinate available resources accordingly. Armed with this information, the public official can both empower community stakeholders to intervene at appropriate levels (via a companion web/mobile application, for example) while simultaneously measuring the impact of her own various interventions over time–all in one location.
A web application or dashboard would give a public official the ability to crystallize massive amounts of information into manageable and actionable chunks, greatly increasing her chances of producing meaningful and sustainable outcomes.
Before discussing the specific data itself, it is important to briefly mention the logic we used to identify data that might be useful. First, we thought about broad categorical data that might have an impact on our dependent variable (the number of overdoses). These types of categorical data might include distance-related variables (the distance between overdoses and other phenomena), density variables (frequency of potentially related phenomena), neighborhood-related variables (block-level information from census surveys), and environmental variables (whether in the built environment or otherwise).
From this point, we set out to identify specific variables within these categories from openly-available sources. We used 311 open data provided by the City of Cincinnati to identify variables that range from crime and public health calls to reported instances of tall grass, graffiti and abandoned vehicles. We similarly located highways and bus stops. We used Google Earth Engine to identify variables like gas stations, methodone clinics, and package (liquor) store locations. We used Social Explorer (www.socialexplorer.com) to find ACS 2015 census survey data for Hamilton County (where Cincinnati is located). Lastly, we used shapefiles of Hamilton County and Cincinnati that are also readily available from open sources.
For 311 data, we filtered according to the desired date range and description, exporting our results to .csv files. We performed similar export functions for Google Search Engine query results. We downloaded each of these files into Esri’s ArcGis Desktop tool. and performed distance and density calculations. For our census data, we cleaned the data (performed calculations to get percentages and removed extraneous absolute numbers) using Excel and imported these data in ArcGis as well.
Whether using R or ArcGis, however, it is important to ensure that all files are properly projected (i.e. in the same projection). Similarly, we were sure to set the processing extent at the appropriate level (we used our Cincinatti shapefile), since we will be using raster files. The importance of these steps can’t be overstated! Next, we performed a number of operations to generate new variables: (Euclidean Distance Calculations, Average Distance to X nearest neighbors, Kernel Densities, etc.).
We created a grid (a.ka. fishnet), which we describe later, so that we could identify the values of the variables we are exploring at this localized, granular level. We used the zonal statistics as a table tool in ArcGis to aggregate our density and other raster layers.
To wrangle the final dataset, we simply joined all of our variables to the grid (fishnet), and exported the attribute table as a .txt file that we could import into R.
The distribution of Overdoses is a rare event, meaning that we would expect to see a lot of areas with no cases of overdose. Numerically, if we aggregated the observed incidences of an overdose for some unit of area, we would expect to see very a large number of areas with zero values, and then fewer with one, and even fewer with two and so forth (in a non-linear fashion).
We created grid cells using the create fishnet tool in ArcGis, specifying an area of 500ft by 500ft, and we spatially joined it to our point data of overdoses.
Here, you’ll see an image of the frequency distribution that should resemble a Poisson Distribution.
(Note: we chose a 500ft X 500ft area for our particular use case–allocating resources such that a “rapid-response” would be reasonable. At just over 1.5 square football fields, we thought that someone could seek and retrieve help in a timely fashion within this area, and more importantly, we thought that a public official could deploy resources at this scale if coordinating with other stakeholers. A different use case may call for a different grid size.)
The code below will allow you to generate a sample Poisson distribution. The resulting histogram can be used to compare against the frequency distribution you observed when examining the statistics option of the combined fishnet and overdose incidences above. The shapes of these distributions should look similar.
set.seed(612312)
n <- 1000
x <- runif(n)
mean <- exp(x)
y <- rpois(n,mean)
As seen in the image above, our combined fishnet and overdose incidences have a distribution that approximates that of a POisson distribution.
Next, we proceeded to explore our independent variables (predictors). One attribute of particular interest was the spatial distributions of our predictors. For example, having a number of predictors all located in the same one-mile radius would not be ideal.
Below, we included a few of these predictors.
Distance to Suicide: Exploratory Analysis
Distance to Parks: Exploratory Analysis
It is important to note that we are not looking for the “perfect” variable, per se. Rather, we are looking for plausible associations whose spatial components are similarly distributed spatially to large concentrations of overdoses. This way, when we begin predicting incidences of overdoses, our model will be generalizable to unseen data. Also, it is not essential that you create density maps in the exploratory stage (looking at points would provide similar clues).
Satisfied that our predictors are not all located in the same place and varied enough to preclude obvious redundancy, we export our fishnet (with all of our variables joined) to a .csv file, which we import into R for our regression analysis.
We estimated a Poisson regressions, removing insignificant variables and trying different combinations. We chose a model with the lowest range of fitted values to observed values. We included a summary of the results below.
##
## =============================================
## Dependent variable:
## ---------------------------
## Train_Full
## Regression Results
## ---------------------------------------------
## Den_AbVeh 0.0018***
## (0.0002)
##
## Den_Ass 0.0013***
## (0.0001)
##
## Dis_PkgStr 0.00002***
## (0.00001)
##
## Dis_PerDwn -0.0008***
## (0.00003)
##
## Dis_BusStp -0.0007***
## (0.00005)
##
## Dis_PolSta -0.00002*
## (0.00001)
##
## Dis_School 0.00002***
## (0.000005)
##
## Dis_Grocer 0.0001***
## (0.00001)
##
## Dis_FireSt -0.0001***
## (0.00001)
##
## Dis_CBD -0.00001***
## (0.000003)
##
## Dis_VacBus -0.0003***
## (0.00003)
##
## Dis_PubLib 0.00005***
## (0.00001)
##
## OD_Hless_Bu 0.0393***
## (0.0053)
##
## Metho_Buff 0.0945***
## (0.0056)
##
## PckS_Buff -0.0616***
## (0.0052)
##
## EU_GasSt -0.00003***
## (0.000004)
##
## Constant 0.8429***
## (0.0646)
##
## ---------------------------------------------
## Observations 8,892
## Log Likelihood -6,390.2040
## Akaike Inf. Crit. 12,814.4100
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
To get a sense of how important each of the predictors in the multiple regression is, we standardize them. This operation tells us how many standard deviations a dependent variable may be expected to change for each unit increase (one standard deviation) in each of the predictors.
We standardize our beta coefficients to measure which of our predictors has a greater effect on the dependent variable. This is especially useful when the variables are measured in different units of measurement (as is the case in our analysis).
## std_coefficient variable
## Den_AbVeh 0.07970605 Den_AbVeh
## Den_Ass 0.07575260 Den_Ass
## Dis_PkgStr 0.08081238 Dis_PkgStr
## Dis_PerDwn -0.83344079 Dis_PerDwn
## Dis_BusStp -0.66794119 Dis_BusStp
## Dis_PolSta -0.07439635 Dis_PolSta
## Dis_School 0.09501846 Dis_School
## Dis_Grocer 0.27419003 Dis_Grocer
## Dis_FireSt -0.27310345 Dis_FireSt
## Dis_CBD -0.06574281 Dis_CBD
## Dis_VacBus -0.30480371 Dis_VacBus
## Dis_PubLib 0.24840593 Dis_PubLib
## OD_Hless_Bu 0.02929227 OD_Hless_Bu
## Metho_Buff 0.07812705 Metho_Buff
## PckS_Buff -0.03810419 PckS_Buff
## EU_GasSt -0.16087831 EU_GasSt
We then take the absolute value of the standardized coefficients, to observe the overall, non-directional, impact of each of the predictors. To get a visual sense of the predictor’s relative impact to one another, the image below demonstrates what is numerically described by our table.
From the outset, we thought that we would use a reverse-stepwise, “kitchen-sink” approach–running a regression using all of the variables we compiled, removing insignificant variables “step-by-step.” We would partition the data into test and training sets for various time periods to assess the robustness (generalizability) of our model.
Given the amount of data available (just under two years worth), we separated our model into a several training and test sets. Our first training set was for the months of July 2015 to July 2017, and our test set was from August 2017 to November 2017. Then, we trained our model with the data from January 2016 to November 2016, and used test data for the same period the following year(January 2017 to November 2017).
We found that our model was, indeed, generalizable, predicting less than .25 overdoses incorrectly per 500 square foot region in Cincinnati (Mean Absolute Error = .25).
Kernel Density: The Current Method
As might be surmised from the image above, we would expect that kernel density should do a fairly reasonable job in predicting overdoses. Our goal, however, is to do a better job. To evaluate the performance of our model, we did the following:
The process for validating our predictions is as follows:
The same process must be done to measure the kernel density of overdoses.
Finally, we join the kernel.dbf table to the fittedvalues.dbf table.
These steps will be performed each time new training and test data sets are selected. The code below describes the steps for generating our final table for comparisons, which we visualized in R.
## GRIDCODE kernelCnt fittedCnt Category kernelPct fittedPct
## 1 1 56 125 90% - 100% 0.0672 0.1504
## 2 2 112 98 70% - 89% 0.1345 0.1179
## 3 3 88 110 50% - 69% 0.1056 0.1324
## 4 4 131 144 30% - 49% 0.1573 0.1733
## 5 5 446 354 1% - 29% 0.5354 0.4260
Given our use case, we are primarily concerned with the number of high risk areas that we are able to correctly predict. In the table below, you can see that, in the areas where someone is most likely to overdose, we do a better job of predicting the number of overdoses than might be expected from a traditional kernel density approach.
As stated above, we think that a public health official could use our model to more efficiently allocate scarce resources to areas with the greatest need. Mapping the risk areas according to our predictions would provide a quick and detailed view of the areas most likely to require resources. This is exactly what we set out to do!
Our companion application, would allow the model to be updated at regular intervals. In this way, the public offical can also keep track of how risks in different areas are changing over time (perhaps in relation to various intervetions).
The first thing we can do to improve our model is secure more data. While we think our results are compelling, more data would allow us to refine the model’s predictive capacity In particular, more data related to pharmacy locations and volume of prescriptions dispersed are clear opportunities to include pertinent data that our model currently lacks. Second, adding non-public data about interventions may also improve the model.
Additionally, we might select a shorter duration of time for our training and test data to estimate the final model. Moreover, we might also consider time of day in our model and attempt to predict on this plane as well. Our public official might require additional information about when resources may be deployed.
Lastly, our model could be improved by consulting with public officials to assess whether our grid size (fishnet) area is actionable given the resources at hand. Though the area made sense to us in the abstract, it may not be practical in reality. A practioner could provide clarity on this point.
Description of Variables:
Our companion app concept can be found here: https://www.youtube.com/watch?v=8lnI1aOmD7c
Thanks for reading!