Introduction
Within the past several years, opioid overdoses have skyrocketed across the United States. According to The New York Times, for the first time in generations, life expectancies are decreasing and opioids have surpassed HIV/AIDS and car crashes as a leading cause of death. In 2017, President Trump declared the epidemic a health emergency. Ohio has been particularly hard hit, with overdoses growing by 33 percent in just two years, according to the Ohio Department of Health. Cincinnati in particular has seen substantial increases in opiate overdoses. Using fire incident data on overdoses from Cincinnati’s open data portal we created a model to determine where overdoses are most likely to occur in the city, in order to create a distribution app for Naloxone, an opioid antagonist which can prevent deaths in the case of an overdose.
Business as Usual
Dayton, Ohio was recently in the news (The New York Times) for their work to curb overdoses. They attribute their success to several concerted efforts, including making Naloxone widely available and training people on Naloxone administration. We will be referring to Narcan specifically. Narcan is the nasal form of Naloxone.
Ultimately, this means that despite laudable and concerted efforts, distribution remains ad hoc. The overarching strategy is not to determine where overdoses are most likely to occur, but rather to make Narcan as widely available as possible, with the hopes that if an overdose does occur, Narcan will be within reach, rather than bringing Narcan to locations where overdoses are more likely to happen.
Proposed Use Case
The benefit to creating a predictive model versus using kernel density or just an ad-hoc strategy is that it allows models to take a variety of environmental factors into account, and also uses the prevalence of past overdoses in order to determine patterns and use predictions to maximize allocation of limited resources.
Our model would be available in the form of a Narcan Distribution App. This means that the people using it would be policy makers and social services providers. The App would allow the user to select the kind of city facility where they would like to put Narcan and the app will map the facilities and identify key strategic facilities and will show the predicted overdoses within a half-mile (or other predetermined range) of that facility. This means that users can easily select where to put Narcan and understand the impact of providing overdose antidotes at a diversity of facilities. It will also make it easier for users (policy makers and social service providers) to monitor the Narcan use and ensure there is always supply where it is needed, without technical or statistical expertise.
Data Collection
The first step in the analysis is to call packages that are needed to conduct this analysis from the R Library.
library(tidyverse)
library(sf)
library(ggplot2)
library(tigris)
library(viridis)
library(FNN)
library(dplyr)
The majority of data collected for this model came from the Cincinnati Open Data portal and the United States Census. The first step in data collection in wrangling was to create a base map, which used the U.S. Census Tiger Cincinnati Shapefile.
Using the Census API, we read a Cincinnati land and water file into R. Using the SF package and ggplot, the shapefiles were projected to WGS 84.
Next, the overdose dataset was read into R. This dataset came from an EMS dataset and was downloaded from Cincinnati’s Open Data portal. In Excel, the EMS data was cleaned to only include those EMS calls that involved a heroin overdose. We chose heroin because opioid antagonists are effective for heroin overdoses. We also used excel to create a training set of overdoses occuring in 2015 and 2016 and a test set of overdoses occuring in 2017 and 2018.
Once read into R, this dataset was transformed into an SF and projected. The OD point data was then mapped.
Overdoses in Cincinatti
We chose these variables because in our mind, it would make sense intuitively that drug use would increase in areas that are less monitored and less conspicuous. As such we selected variables that could serve as a proxy for a lack of eyes on the street. Given this, we wanted to see if we could tie variables like 311 calls about litter and building overcrowding to a higher prevalence of overdoses in an area.
Exploratory Analysis and Data Wrangling
Exploration of Overdoses
We overlayed our Cincinnati basemap with a 500 foot by 500 foot fishnet. This fishnet allows us to determine the density of point data. We chose 500 foot squares for a couple of reasons:
1. If we are looking to provide Narcan according to these cells then we want a small geographic area so that the antagonist can be administered as quickly as possible.
2. Once you start getting smaller than 500 feet you are picking through buildings and parcels internally, meaning that the minutiae can actually throw the prediction off.
We then transformed the overdose count data to an SF and joined it to the fishnet. This allowed us to map the concentrations of overdoses in each fishnet cell.
Fishnetted Observed Overdoses in Cincinnati
As can be seen in the plot, overdoses are relatively rare occurrences and are concentrated in certain parts of the city, specifically the downtown area. This was confirmed by plotting a histogram of overdoses in each cell.
Overdose Distribution
Exploration of Independent Variables
The independent variables, listed in the above section were downloaded from a variety of sources and then wrangled. To create the variables from the Open Data Cincinnati portal (311 and 911 calls and parks), datasets were downloaded and cleaned in Excel. These datasets could also be read into R using the API, however, we were limited to 1,000 rows using this method, forcing us to clean the data in Excel. In Excel, we separated out the relevant calls into individual CSVs, which would make it easier to wrangle the data using Dplyr functions in R. Once the CSVs were imported to R we removed NAs (any row without a geospatial element), pulled out just the coordinates in order to turn it into an SF, transformed and clipped it to the fishnet shapefile, and give it a legend label.
calls311<-read.csv('Cincy311.csv') %>%
dplyr::select(LATITUDE,LONGITUDE, Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>% #remove those without coordinates in order to create SF
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_sf() %>%
st_transform(st_crs(fishnet)) %>%
st_intersection(.,st_union(fishnet))%>%
mutate(Legend = "311 Calls")
We then r-bind(ed) them together into one dataframe of coordinate sets called allvars. We then took this coordinate set data frame and bound it to the SF we created in order to create a faceted fishnet density plot of all our variables.
Independent Variables
Census data was downloaded from American FactFinder, cleaned in Excel and then imported into R, where it could be transformed into an SF and joined to the fishnet. Additional binary variables were engineered using the EMS overdose data.
To calculate the distance variables, we used K nearest neighbor, which allows us to calculate the distance to the nearest specified variable, in our case parks, libraries, 911 related drug calls, and overdoses. With K nearest neighbor, we are able to specify the number of near neighbors that we want to calculate distance to and then average that number to get an average distance. For our analysis, we used 2, 5, and 10 nearest neighbors in order to determine the strongest determinant of overdoses in the city, seen below.
#output a matrix which we need for the nearest neighbor function.
ODneighbor.xy <-
ODsclean.sf %>%
cbind(.,st_coordinates(st_centroid(ODsclean.sf))) %>%
st_set_geometry(NULL) %>%
dplyr::select(X,Y) %>%
as.matrix()
#use the function to measure average nearest neighbor distance with k=2/5/10 and map the relationship
# KNN FUNCTION
nn_function <- function(measureFrom,measureTo,k) {
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint)
return(output)
}
distODneighborho<-as.data.frame(nn_function(vars_net.xy, ODneighbor.xy, 2))
distODneighborho5<-as.data.frame(nn_function(vars_net.xy, ODneighbor.xy, 5))
distODneighborho10<-as.data.frame(nn_function(vars_net.xy, ODneighbor.xy, 10))
#ODs
vars_net<-
cbind(as.data.frame(vars_net),distODneighborho)%>%
st_sf()%>%
rename(ODdist=pointDistance)
vars_net<-
cbind(as.data.frame(vars_net),distODneighborho5)%>%
st_sf()%>%
rename(ODdist=pointDistance)
vars_net<-
cbind(as.data.frame(vars_net),distODneighborho10)%>%
st_sf()%>%
rename(ODdist=pointDistance)
Distance to 2 Nearest Libraries (5)
Distance to 2 Nearest Drug Related 911 Calls (5)
Distance to 2 Nearest Overdoses (5)
All this data wrangling allowed us to visualize a litany of variables across space (and to a limited extent, time). It also showed us that many of these events are in fact related; we can see hot spots for overdoses in many of the same places we see hot spots for vermin or litter. This will all come into play and be important as we create our model.
Model Building
Because overdoses are a relatively rare occurrence (as shown in the histogram in the preceding section), and are concentrated in specific parts of the city, we chose to use a poisson regression. The dependent variable in our model, overdose count per cell, is not normally distributed and logging the variable does not make it normally distributed. As such, a linear model would not predict as well as other regression types.
This model was trained on a dataset containing overdoses from 2015 and 2016 and tested on a dataset containing overdoses from 2017 and 2018. A training and test set are used so that the accuracy of the model can be analyzed. A model that predicts well on a test set, which includes only data that it has not be trained on, is a robust model, while a model that cannot predict well for a test set requires additional tweaking. By separating the training and test sets by years, we are able to determine how generalizable the model is over time. That is, can the model predict as well for one time period as it can for another?
We used the kitchen sink method of model building, where we began with all of the variables that we wrangled and ran the model, changing out and removing variables, until all were statistically significant, indicating that they were useful predictors of overdoses. Reg 2, below, represents the final model.
reg2 <- glm(countODs ~ ., family = "poisson",
data= final_net_test %>%
as.data.frame %>%
dplyr::select(countODs,
X311.Calls,
Police.Calls,
Building.Overcrowding,
Litter,
Police.Calls,
policeDist.1,
policeDist.2, ODdist, ODdist.1,
ODdist.2, Avg_PctUnm,))
summary(reg2)
Model Results
Results and Validation
Results
The final model included nine independent variables, (Distance to nearest 2 ODs, distance to 5 nearest ODs, count of 311 calls, Count of 911 drug-related calls, litter count, overcrowded building count, distance to nearest 2 drug-related police calls, and distance to 5 nearest drug-related police calls) all of which were statistically significant. To gain an understanding of which variables were the most powerful predictors, the model coefficients for can be standardized. Taking the absolute value of each standardized coefficient then allows us to see where variables were the strongest predictors. As shown in the code and graph below.
#create standardized regression coefficients, which puts them all the same scale.
standardized <- as.data.frame(lm.beta(reg2))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized
#plot absolute value
ggplot(standardized, aes(x=variable, y=abs(std_coefficient), fill=variable)) +
geom_bar(stat="identity")
#reorder by coefficient importance
standardized$absCoef <- abs(standardized$std_coefficient)
ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) +
geom_bar(stat="identity") +
labs(title="Standardized Coefficients") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Average KNN Home Prices (5)
As the graph shows, distance to the two nearest overdoses and 311 calls were the most powerful predictors in the model. The least powerful predictors in the model were counts of overcrowded buildings and count of drug related police calls. Notably, distance to the two nearest ODs was overwhelmingly the most powerful predictor of where overdoses are most likely to occur in Cincinnati.
The predicted overdoses per fishnet cell are shown in the map below. Based on the map of predicted values, the model appears to predict overdoses well, especially for those areas with the highest concentration of overdoses.
Predicted Overdoses in Cincinnati
Validation
To understand goodness of fit and the predictive power of the model, several validation tests were undertaken. This included calculating the mean absolute error (MAE) of the predicted values for the test set and comparing the results of the model to a kernel density map of overdoses in the city. Both of these measures serve to illustrate how well the model predicts where overdoses in order to ensure the most efficient and effective allocation of Narcan.
MAE is a measure of goodness of fit that calculates the average error of predictions in the model. That is, the calculation takes the absolute value of errors for the predictions for each fishnet cell and provides a value for average error for the model, as illustrated in the below code block. In this case, we are using MAE rather than MAPE (mean absolute percentage error) because ODs are such a rare occurrence that being off by one or two overdoses could translate to a 100 percent MAPE, meaning that MAE is more reflective of OD patterns.
cbind(final_net,reg2$fitted.values) %>%
mutate(error = abs(countODs - reg2.fitted.values)) %>%
st_set_geometry(NULL) %>%
summarize(MAE = mean(error))
The final model had a MAE of 2.5 for the test set. This means that on average, the model predicted overdoses per cell with an absolute error of 2.5. A value of 2.5 indicates that the model does a relatively good job of predicting accurately where overdoses are most likely to occur.
The low MAE for the test set also indicates that the model is generalizable over time. Because the model was trained on a dataset for years 2015 and 2016 and tested on a dataset of overdoses from 2017 and 2018, we can determine if the model predicts well across time periods. Given the low MAE and thus accuracy of predictions across space when predicting for a time period that the model did not train on, we know that the model is generalizable over time.
To further validate the model, the model was compared to a kernel density model. This allowed us to compare predictions between risk levels. While an ideal model would predict well at both high and low risk levels, a model that predicts more accurately in those areas where overdoses are most likely to occur is particularly important. Comparing our model to the kernel density model requires work in both R and ArcGIS.
Kernel Density Results
As the above chart shows, the model we developed and the kernel density model are good at predicting different risk levels. The kernel density model is good at predicting locations where overdoses are less likely to happen (bottom 30% of grids), whereas our model does better across all other thresholds. For the purposes of our app and use case, it is important to predict the highest risk levels more accurately, as those are the places where the most resources need to be allocated. As such, our model is particularly useful and is able to provide the greatest benefit. The outcome of this comparison confirms what was assumed when looking at the maps of observed and predicted overdoses.
Discussion
Use Case
We imagine that our primary app users will be local policy makers and social service providers. Given this, our app must be easy to understand and intuitive first and foremost. Our users are not interested in the distribution of a poisson model, but need to know how many predicted overdoses there are close to any given facility and be able to ensure that they have sufficient Narcan supplied there.
Our app offers a new approach in that we seek to allocate and distribute Narcan based on the count of predicted overdoses occuring nearby. We argue that this will be a more efficient use of resources for a couple of reasons: 1. The current model is reliant on people coming to seek help, rather than bringing help to those who need it. 2. While overdoses are an epidemic, they remain fairly uncommon, meaning, there are large swaths of Cincinnati where overdoses virtually do not happen. Given this, Narcan distributed to these areas (above a certain threshold), is likely to go unused, whereas, it might be needed elsewhere.
Interface
Our app gives users the option of using either a map view of a list view. Both views allow users to filter out the kinds of facilities they wish to go to. For the purposes of this demonstration we have limited the facilities to Libraries, Police Stations, and Fire Departments. However, Cincinnati also has walk-in clinics and pharmacies that would also be promising and eligible facilities to be considered as Narcan distribution sites.
The map view shows users a map of their city, and allows them to click on facilities where a popup appears, telling them the facility name, the predicted overdoses, and the Narcan supply.
The list view contains the same information in a different format that allows users to filter data based off of:It is important to note that the list view also allows users to update the Narcan supply in real time so that other users can look at accurate, to-date numbers. On the whole, the app allows for an easy, intuitive way to distribute Narcan in a more strategic way, that allows users to see numbers and understand needs, without necessarily worrying about the technicalities or statistically modeling going on behind the scenes.
Conclusion and Next Steps
The model we developed does a fairly good job of predicting where overdoses are most likely to occur and will thus help policymakers allocate Narcan efficiently, it can still be improved. It is a clear improvement from a kernel density model in high risk areas, which should be the highest priority for policy makers and social service workers. That said, the model does not do as well predicting for low- and medium-risk areas, which should be addressed through further feature engineering and data wrangling, such as building out the temporal prediction capacity in order to determine not only where, but when overdoses are most likely to occur. In addition, further feature engineering could help reduce the MAE, which at 2.5 is good, it could be improved.