Why Take On This Challenge?

The incidences of opioid overdoses have increased sharply in the United States over the past few years. Unfortunately, this trend is expected to continue. As a result, all levels of government across the country are marshalling resources to combat the epidemic, and we set out to provide a tool to assist in their efforts.

The particular use case we address pertains to the difficulty public officials face in attempting to coordinate disparate resources and activities. Currently, public officials may be looking at the density of overdoses and planning their interventions accordingly. At the same time, other agencies, organizations and individuals may be attempting interventions of their own. If there were a way to distinguish high risk areas from low risk areas with high levels of specificity, the public official could coordinate available resources accordingly. Armed with this information, the public official can both empower community stakeholders to intervene at appropriate levels (via a companion web/mobile application, for example) while simultaneously measuring the impact of her own various interventions over time–all in one location.

A web application or dashboard would give a public official the ability to crystallize massive amounts of information into manageable and actionable chunks, greatly increasing her chances of producing meaningful and sustainable outcomes.

Amassing The Data

Before discussing the specific data itself, it is important to briefly mention the logic we used to identify data that might be useful. First, we thought about broad categorical data that might have an impact on our dependent variable (the number of overdoses). These types of categorical data might include distance-related variables (the distance between overdoses and other phenomena), density variables (frequency of potentially related phenomena), neighborhood-related variables (block-level information from census surveys), and environmental variables (whether in the built environment or otherwise).

From this point, we set out to identify specific variables within these categories from openly-available sources. We used 311 open data provided by the City of Cincinnati to identify variables that range from crime and public health calls to reported instances of tall grass, graffiti and abandoned vehicles. We similarly located highways and bus stops. We used Google Earth Engine to identify variables like gas stations, methodone clinics, and package (liquor) store locations. We used Social Explorer (www.socialexplorer.com) to find ACS 2015 census survey data for Hamilton County (where Cincinnati is located). Lastly, we used shapefiles of Hamilton County and Cincinnati that are also readily available from open sources.

For 311 data, we filtered according to the desired date range and description, exporting our results to .csv files. We performed similar export functions for Google Search Engine query results. We downloaded each of these files into Esri’s ArcGis Desktop tool. and performed distance and density calculations. For our census data, we cleaned the data (performed calculations to get percentages and removed extraneous absolute numbers) using Excel and imported these data in ArcGis as well.

Whether using R or ArcGis, however, it is important to ensure that all files are properly projected (i.e. in the same projection). Similarly, we were sure to set the processing extent at the appropriate level (we used our Cincinatti shapefile), since we will be using raster files. The importance of these steps can’t be overstated! Next, we performed a number of operations to generate new variables: (Euclidean Distance Calculations, Average Distance to X nearest neighbors, Kernel Densities, etc.).

We created a grid (a.ka. fishnet), which we describe later, so that we could identify the values of the variables we are exploring at this localized, granular level. We used the zonal statistics as a table tool in ArcGis to aggregate our density and other raster layers.

To wrangle the final dataset, we simply joined all of our variables to the grid (fishnet), and exported the attribute table as a .txt file that we could import into R.

Setup and Exploratory Mapping

The distribution of Overdoses is a rare event, meaning that we would expect to see a lot of areas with no cases of overdose. Numerically, if we aggregated the observed incidences of an overdose for some unit of area, we would expect to see very a large number of areas with zero values, and then fewer with one, and even fewer with two and so forth (in a non-linear fashion).

We created grid cells using the create fishnet tool in ArcGis, specifying an area of 500ft by 500ft, and we spatially joined it to our point data of overdoses.

  1. Open the attribute file after performing the join above.
  2. Right click the column corresponding to your joined overdose points.
  3. Select the ‘Statistics’ option.

Here, you’ll see an image of the frequency distribution that should resemble a Poisson Distribution.

(Note: we chose a 500ft X 500ft area for our particular use case–allocating resources such that a “rapid-response” would be reasonable. At just over 1.5 square football fields, we thought that someone could seek and retrieve help in a timely fashion within this area, and more importantly, we thought that a public official could deploy resources at this scale if coordinating with other stakeholers. A different use case may call for a different grid size.)

Poisson Distribution: A Very Brief Refresher

The code below will allow you to generate a sample Poisson distribution. The resulting histogram can be used to compare against the frequency distribution you observed when examining the statistics option of the combined fishnet and overdose incidences above. The shapes of these distributions should look similar.

set.seed(612312)
n <- 1000
x <- runif(n)
mean <- exp(x)
y <- rpois(n,mean)

Our Exploratory Approach

As seen in the image above, our combined fishnet and overdose incidences have a distribution that approximates that of a POisson distribution.

Next, we proceeded to explore our independent variables (predictors). One attribute of particular interest was the spatial distributions of our predictors. For example, having a number of predictors all located in the same one-mile radius would not be ideal.

Below, we included a few of these predictors.

  1. First, we take a look at the density of suicide. While a correlation matrix might tell us the mathematic correlation (if any) with our variable of interest (overdoses), we plot the density of suicides to see how its spatial components might line up against our dependent variable. As you can see, it does a good job in some areas (like the northern-most and eastern part of the city) and not so well in others.
Distance to Suicide: Exploratory Analysis

Distance to Suicide: Exploratory Analysis

  1. Next, we look at distance to parks for associatons with overdoses. As we can see in the plot below, this variable seems to do a good job spatially capturing some phenomena that might impact our ability to predict overdoses.
Distance to Parks: Exploratory Analysis

Distance to Parks: Exploratory Analysis

It is important to note that we are not looking for the “perfect” variable, per se. Rather, we are looking for plausible associations whose spatial components are similarly distributed spatially to large concentrations of overdoses. This way, when we begin predicting incidences of overdoses, our model will be generalizable to unseen data. Also, it is not essential that you create density maps in the exploratory stage (looking at points would provide similar clues).

Satisfied that our predictors are not all located in the same place and varied enough to preclude obvious redundancy, we export our fishnet (with all of our variables joined) to a .csv file, which we import into R for our regression analysis.

We estimated a Poisson regressions, removing insignificant variables and trying different combinations. We chose a model with the lowest range of fitted values to observed values. We included a summary of the results below.

## 
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                           Train_Full         
##                       Regression Results     
## ---------------------------------------------
## Den_AbVeh                  0.0018***         
##                            (0.0002)          
##                                              
## Den_Ass                    0.0013***         
##                            (0.0001)          
##                                              
## Dis_PkgStr                0.00002***         
##                            (0.00001)         
##                                              
## Dis_PerDwn                -0.0008***         
##                            (0.00003)         
##                                              
## Dis_BusStp                -0.0007***         
##                            (0.00005)         
##                                              
## Dis_PolSta                 -0.00002*         
##                            (0.00001)         
##                                              
## Dis_School                0.00002***         
##                           (0.000005)         
##                                              
## Dis_Grocer                 0.0001***         
##                            (0.00001)         
##                                              
## Dis_FireSt                -0.0001***         
##                            (0.00001)         
##                                              
## Dis_CBD                   -0.00001***        
##                           (0.000003)         
##                                              
## Dis_VacBus                -0.0003***         
##                            (0.00003)         
##                                              
## Dis_PubLib                0.00005***         
##                            (0.00001)         
##                                              
## OD_Hless_Bu                0.0393***         
##                            (0.0053)          
##                                              
## Metho_Buff                 0.0945***         
##                            (0.0056)          
##                                              
## PckS_Buff                 -0.0616***         
##                            (0.0052)          
##                                              
## EU_GasSt                  -0.00003***        
##                           (0.000004)         
##                                              
## Constant                   0.8429***         
##                            (0.0646)          
##                                              
## ---------------------------------------------
## Observations                 8,892           
## Log Likelihood            -6,390.2040        
## Akaike Inf. Crit.         12,814.4100        
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

Standarized Coefficients (Weights)

To get a sense of how important each of the predictors in the multiple regression is, we standardize them. This operation tells us how many standard deviations a dependent variable may be expected to change for each unit increase (one standard deviation) in each of the predictors.

We standardize our beta coefficients to measure which of our predictors has a greater effect on the dependent variable. This is especially useful when the variables are measured in different units of measurement (as is the case in our analysis).

##             std_coefficient    variable
## Den_AbVeh        0.07970605   Den_AbVeh
## Den_Ass          0.07575260     Den_Ass
## Dis_PkgStr       0.08081238  Dis_PkgStr
## Dis_PerDwn      -0.83344079  Dis_PerDwn
## Dis_BusStp      -0.66794119  Dis_BusStp
## Dis_PolSta      -0.07439635  Dis_PolSta
## Dis_School       0.09501846  Dis_School
## Dis_Grocer       0.27419003  Dis_Grocer
## Dis_FireSt      -0.27310345  Dis_FireSt
## Dis_CBD         -0.06574281     Dis_CBD
## Dis_VacBus      -0.30480371  Dis_VacBus
## Dis_PubLib       0.24840593  Dis_PubLib
## OD_Hless_Bu      0.02929227 OD_Hless_Bu
## Metho_Buff       0.07812705  Metho_Buff
## PckS_Buff       -0.03810419   PckS_Buff
## EU_GasSt        -0.16087831    EU_GasSt

We then take the absolute value of the standardized coefficients, to observe the overall, non-directional, impact of each of the predictors. To get a visual sense of the predictor’s relative impact to one another, the image below demonstrates what is numerically described by our table.

The Modeling Approach

From the outset, we thought that we would use a reverse-stepwise, “kitchen-sink” approach–running a regression using all of the variables we compiled, removing insignificant variables “step-by-step.” We would partition the data into test and training sets for various time periods to assess the robustness (generalizability) of our model.

Given the amount of data available (just under two years worth), we separated our model into a several training and test sets. Our first training set was for the months of July 2015 to July 2017, and our test set was from August 2017 to November 2017. Then, we trained our model with the data from January 2016 to November 2016, and used test data for the same period the following year(January 2017 to November 2017).

We found that our model was, indeed, generalizable, predicting less than .25 overdoses incorrectly per 500 square foot region in Cincinnati (Mean Absolute Error = .25).

Validating The Predictions

Kernel Density: The Current Method

Kernel Density: The Current Method

As might be surmised from the image above, we would expect that kernel density should do a fairly reasonable job in predicting overdoses. Our goal, however, is to do a better job. To evaluate the performance of our model, we did the following:

The process for validating our predictions is as follows:

  1. Add the fitted values of our regression output using the cbind() function in R.
  2. Export the new output (with fitted values) from R. Load it into ArcGis.
  3. Join this output to the fishnet shapefile.
  4. Use the polygon to raster tool to create a raster called ‘fitted’. Make sure that the cell size value field is set to the fishnet size, in our case we choose 500ft.
  5. Reclassify the fitted raster values to extend from 1-100 using reclassfication tool. Call this raster fitted_rcl.
  6. Next, reclassify fitted_rcl into 5 classes of: 90 to 100 = Class 1 70 to 89 = Class 2 50 to 69 = Class 3 30 to 49 = Class 4 1 to 29 = Class 5. Call this raster fitted_rcl2. This requires reversing the order of the values after the thresholds are set.
  7. Run raster to polygon tool to convert fitted_rcl2 into polygons. Call this fittedGroups.shp.
  8. Spatial join the Test Overdose points to the fittedGroups.shp and Call the fittedgroups_joinPoints.shp.
  9. Opening the attribute table of fittedgroups_joinPoints.shp, summarize by the gridcode column to get count of overdoses by group. Call this new .dbf fittedvalues.dbf.
  10. In the fittedvalue.dbf table, create a new field called fittedCnt, and set it equal to the count of the summarized overdoses.

The same process must be done to measure the kernel density of overdoses.

  1. Using the Kernel Density tool in ArcGis, create a kernel density raster for the training overdose samples. You may choose to use the incremental spatial autocorrelation tool to specify the radius (we did not).
  2. Using the reclassify tool, reclassify the kernel raster such that the values extend from 1-100. Call this raster kernel_rcl.
  3. Reclassify the kernel_rcl file, into the same 5 classes chosen above (for the fitted values). Remember to reverse the order, and call this kernel_rcl2.
  4. Use raster to polygon tool to convert kernel_rcl2 into polygons. Call this kernelGroups.shp.
  5. Spatial join the test overdose points to the kernelGroups.shp file. Call this kernelGroups_joinPoints.shp.
  6. Opening the attribute table of kernelgroups_joinPoints.shp, summarize by the gridcode column to get count of overdoses by group. Call the new .dbf Kernel.dbf.
  7. In the kernel.dbf table, create a new field called kernelCnt, and set it equal to the count of the summarized overdoses.

Finally, we join the kernel.dbf table to the fittedvalues.dbf table.

  1. Join the kernel.dbf to the fittedvalues.dbf of the risk terrain model using GRIDCODE as the tabular matching key.
  2. Export the joined table as countComparisons.txt
  3. Import countComparisons.txt into R.
  4. Deselect the category field as well as the two fields that describe the count of overdoses by both risk terrain model and kernel density (code provided below).
  5. Add a field that changes the category field, GRIDCODE into a sting-based field denoting the percentage range of predictions.
  6. Create new fields that represent the count of overdoses by category as a percent of all the crimes.

These steps will be performed each time new training and test data sets are selected. The code below describes the steps for generating our final table for comparisons, which we visualized in R.

##   GRIDCODE kernelCnt fittedCnt   Category kernelPct fittedPct
## 1        1        56       125 90% - 100%    0.0672    0.1504
## 2        2       112        98  70% - 89%    0.1345    0.1179
## 3        3        88       110  50% - 69%    0.1056    0.1324
## 4        4       131       144  30% - 49%    0.1573    0.1733
## 5        5       446       354   1% - 29%    0.5354    0.4260

Goodness of Fit

Given our use case, we are primarily concerned with the number of high risk areas that we are able to correctly predict. In the table below, you can see that, in the areas where someone is most likely to overdose, we do a better job of predicting the number of overdoses than might be expected from a traditional kernel density approach.

Conclusion

As stated above, we think that a public health official could use our model to more efficiently allocate scarce resources to areas with the greatest need. Mapping the risk areas according to our predictions would provide a quick and detailed view of the areas most likely to require resources. This is exactly what we set out to do!

Our companion application, would allow the model to be updated at regular intervals. In this way, the public offical can also keep track of how risks in different areas are changing over time (perhaps in relation to various intervetions).

Improving Our Model

The first thing we can do to improve our model is secure more data. While we think our results are compelling, more data would allow us to refine the model’s predictive capacity In particular, more data related to pharmacy locations and volume of prescriptions dispersed are clear opportunities to include pertinent data that our model currently lacks. Second, adding non-public data about interventions may also improve the model.

Additionally, we might select a shorter duration of time for our training and test data to estimate the final model. Moreover, we might also consider time of day in our model and attempt to predict on this plane as well. Our public official might require additional information about when resources may be deployed.

Lastly, our model could be improved by consulting with public officials to assess whether our grid size (fishnet) area is actionable given the resources at hand. Though the area made sense to us in the abstract, it may not be practical in reality. A practioner could provide clarity on this point.

Appendix

Description of Variables:

  1. Den_AbVeh– average distance to 5 nearest abandoned vehicles
  2. Den_Ass– average distance to 5 nearest assaults
  3. Den_Dr_Arr– average distance to 5 nearest drug arrests
  4. Den_Gr–average distance to 3 nearest complaints of tall grass
  5. Dis_PerDwn–average distance to 5 nearest person down calls
  6. Dis_BusStp–average distance to 5 nearest bus stops
  7. Dis_Graffi–average distance to 5 nearest graffiti complaints
  8. Dis_Parks–average distance to nearest park
  9. Dis_PolSta–distance to nearest police stations
  10. Dis_School–average distance to 3 nearest schools
  11. Dis_Grocer–average distance to 5 nearest grocery stores
  12. Dis_FireSt–average distance to 2 nearest fire stations
  13. Dis_CBD–distance to central business district
  14. Dis_VacBus–average distance to 5 nearest vacant buildings
  15. Dis_PubLib–average distance x nearest neighbors of public libraries
  16. Eu_Dis_Pes–Euclidean distance to pest complaints
  17. Percent_18–percent above 18 years of age
  18. Percent_Bl–percent black or african american
  19. Percent_Po–percent poor/public assistance
  20. Pov_Rate–poverty rate
  21. OD_Hi_Bu–proximity to highways with speed limit >= 35mph buffer
  22. OD_Hless_Bu–proximity to homeless shelter buffer
  23. Metho_Buff–proximity to methodone buffer
  24. PckS_Buff–proximity to package store buffer
  25. DisOrd_Den–Density of disorderly complaints
  26. EU_GasSt–Euclidean Distance to nearest Gas Station

Our companion app concept can be found here: https://www.youtube.com/watch?v=8lnI1aOmD7c

Thanks for reading!