Chicago Crime Prediction

Introduction

This report examines narcotics arrests in Chicago in 2018. The assignment challenged us to find a measuremof criminality that is highly subjective. In contrast to burglaries, assaults, and violent crime, narcotics arrests are almost entirely reliant on police officer disgression. They are rarely reported via 911 calls by civilians. I chose to model narcotics arrests because they suffer from systematic bias and overpolicing of low-income and minority communities. The ultimate goal of this report is to build a machine-learning model that could predict areas of latent risk for future narcotics arrests in Chicago. I originally start with several risk predictors, including alley lights out, potholes, number of vacant buildings and public housing complexes, before refining my prediction model.

knitr::opts_chunk$set(error=FALSE)

#Load Libraries
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(spatstat)
library(bookdown)
library(pander)
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

#gets rid of markdown knitting errors
clean_book(clean = getOption("bookdown.clean_book", FALSE))
my_locale <- Sys.getlocale("LC_ALL")
Sys.setlocale("LC_ALL", my_locale)

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

## Read in Data from Chicago

#This uses the Socrata package for some data sets

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

#In 2017, there were 11,671 narcotics arrests in Chicago. Unlike assault, battery, or theft, narcotics is mostly innocuous, relying on the disgression
#of police officers to make an arrest, rather than a 911 call from a victim
#thus narcotics arrest, are heavily reliant on human judgment and bias

narcotics <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/d62x-nvdr") %>% 
    filter(Primary.Type == "NARCOTICS") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

Mapping Narcotics Arrests, 2018

The two maps below show the spatial distribution of narcotics arrests in Chicago in 2018. The main hot spot occurs in the West Side neighborhood of Austin.

## visualizing point data

#Plotting point data and density

# uses grid.arrange to organize indpendent plots
grid.arrange(ncol=2,nrow=2,heights=(c(5,1)),widths=(c(1,1)),
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = narcotics, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Narcotics Arrests, Chicago - 2018") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(narcotics)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 30, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.70), guide = FALSE) +
  labs(title = "Density of Narcotics Arrests") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

The next code chunk creates and maps fishnet grid cells across Chicago, counting the amount of narcotics arrests per grid cell.

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))


### Aggregate points to the fishnet

narcotics_net <- 
  dplyr::select(narcotics) %>% 
  mutate(countNarcotics = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countNarcotics = replace_na(countNarcotics, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = narcotics_net, aes(fill = countNarcotics), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Narcotics for the fishnet") +
  mapTheme()

Choosing Risk Factors

In this section I read in risk factors from Chicago’s open data portal. I chose alley lights out, potholes, vacant buildings, and public housing as features. I also included pooled American Community Survey (ACS) data that examines tract-level income and racial composition. I then joined these risk factors to my fishnet. Looking at the features, I am most interested in factors that can account for the hotspots of my outcome - narcotics arrests. Looking at the grid cell maps below, it appears that vacant buildings most accurately accounts for the narcotics arrests hotspots.

alleyLightsOut <- 
  read.socrata("https://data.cityofchicago.org/resource/t28b-ys7j.json") %>%
  mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Alley_Lights_Out")

Potholes <- 
  read.socrata("https://data.cityofchicago.org/resource/_311-potholes.json") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Potholes")

#keeping this in here because it will be very important for my model
VacantBuildings <- 
  read.socrata("https://data.cityofchicago.org/resource/7nii-7srd.json") %>%
  mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2018") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Vacant_Buildings")

PublicHousing <- 
  read.socrata("https://data.cityofchicago.org/resource/s6ha-ppgi.json") %>%  
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Public_Housing")

## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

#ACS Data
#changed to 2017 since error
tracts18 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E",
                                             "B19013_001E","B25058_001E",
                                             "B06012_002E"), 
          year=2017, state= "IL", county= "Cook", geometry=T, output="wide") %>%
  st_transform('ESRI:102271') %>%
  rename(TotalPop = B25026_001E, 
         Whites = B02001_002E,
         MedHHInc = B19013_001E, 
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E) %>%
  dplyr::select(-NAME, -starts_with("B")) %>%
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         raceContext = ifelse(pctWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  dplyr::select(-Whites, -TotalPoverty) 

#### How we aggregate a feature to our fishnet


vars_net <- 
  rbind(alleyLightsOut, Potholes,VacantBuildings,
        PublicHousing) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup()

#Mapping Risk Features
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =2, top = "Risk Factors by Fishnet"))

Measuring Distance

In this section I use the nn_function to measure distance from each fishnet grid cell to my risk factors. I also add a feature that measures total distance from “The Loop” a famous highway surrounding Chicago’s downtown. Finally I join all of these features to the fishnet that includes the outcome variable of narcotics arrest counts.

st_c <- st_coordinates
st_coid <- st_centroid

stc_potholes <- st_c(Potholes) %>%
  na.omit()

#toggled K's based on obversvations
vars_net <-
  vars_net %>%
  mutate(
    Alley_Lights_out.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(alleyLightsOut),3),
    Potholes.nn =
      nn_function(st_c(st_coid(vars_net)), stc_potholes,5),
    Public_Housing.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(PublicHousing),1),
    Vacant_Buildings.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(VacantBuildings),2))

#Mapping Risk Factor Distance
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 2, top = "Nearest Neighbor risk Factors by Fishnet"))

## Adding Loop Distance Feature

loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 



## Now joining risk features fishnet to narcotics fish
final_net <-
  left_join(narcotics_net, st_drop_geometry(vars_net), by="uniqueID") 



### Join in areal data - Polygon to Polygon

#Using spatial joins to join *centroids* of fishnets to polygon for neighborhoods and districts.

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
    st_join(dplyr::select(tracts18, pctPoverty, pctWhite, MedRent), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

Local Moran’s I for Fishnet Grid Cells

The Local Moran’s I chart below shows evidence of spacial autocorrelation, or clustering, at the local level in the main hotspot. This implies that the model must better account for spatial features. Further evidence can be seen confirming this in the p-value and hot spot maps.

The second map at the bottom takes the Moran’s I grid cells that are statistically significant and measures the distance across Chicago’s fishnet.

## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)


final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countNarcotics, final_net.weights)),
    as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Narcotics_Count = countNarcotics, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)


vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(final_net.localMorans, Variable == i), 
            aes(fill = Value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Narcotics"))

final_net <-
  final_net %>% 
  mutate(narcotics.isSig = 
           ifelse(localmoran(final_net$countNarcotics, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(narcotics.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, narcotics.isSig == 1))), 1))


ggplot() +
      geom_sf(data = final_net, aes(fill=narcotics.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Distance to highly signficant narcotics arrests") +
      mapTheme()

Correlation Charts

Vacant buildings appear to be highly correlated (r = .42) with narcotics arrests. Percent white (r = -0.29) and percent poverty (r = 0.28) were the next most correlated risk features. Potholes seems to have very little relation to our outcome. This provides a good overview of which features could be included in our prediction model.

#Correlation Charts

correlation.long <-
  st_drop_geometry(final_net) %>%
  dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District, -narcotics.isSig, -narcotics.isSig.dist) %>%
  gather(Variable, Value, -countNarcotics)

correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countNarcotics, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countNarcotics)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Narcotics Arrests as a function of risk factors") +
  plotTheme()

Histogram of Narcotics Arrests

The histogram below shows that narcotics arrests are skewed right. Most fishnet grid cells do not have any arrests, while a few grid cells have the majority.

ggplot(final_net, aes(x=countNarcotics))+
  geom_histogram(binwidth =3) +xlim(c(-10,50)) +ylim(c(0,1500)) +
  labs(title="Narcotics Histogram")+
  theme_bw()

Building a Model

There are two models for comparison. The first is based on each grid cell, given a random indentification number (keep in mind their are over 2000 grid cells). The second model will include a spatial component that accounts for neighborhoods in Chicago (this may help us with spatial autocorrelation).

Both models benefitted from including both counts and nearest neighbor distance (nn_function) of our risk features, resulting in lower mean absolute errors (MAE) and standard deviations (SD). Including alley lights, potholes, public housing, vacant buildings, and loop distance helps both models. Excluding all census variables actually improved the second spatial “LOGO” model, so I manually added each census feature back individually. Percent white improved the models, while percent poverty and median rent only hurt the models.

The second model had higher MAE and SD than our first model (which relied on random k-fold cross validation by fishnet grid cell). This implies that Spatial features actually don’t improve the risk prediction. Including neighborhood name hurts the model by increasing mean error. It’s possible that our open data risk features eliminate the neccessity of neighborhood.

The final map at the bottom is not an apples to apples comparison, since grid cells are more granular than neighborhoods. It displays model error with Model 1 on the left and Model 2 on the right. There are higher MAEs at the neighborhood level on the right, as evidenced by the green and yellow neighborhoods.

# Leave One Group Out CV on spatial features

reg.ss.vars <- c("Alley_Lights_Out","Alley_Lights_out.nn","Potholes", "Potholes.nn", "Public_Housing","Public_Housing.nn","Vacant_Buildings.nn",
                 "Vacant_Buildings", "loopDistance", "pctWhite",  "narcotics.isSig","narcotics.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countNarcotics ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

## RUN REGRESSIONS

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countNarcotics",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countNarcotics, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countNarcotics",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countNarcotics, Prediction, geometry)

#View MAEby Random K Fold and Spatial LOGO
reg.summary <- 
  rbind(
    mutate(reg.ss.cv, Error = Prediction - countNarcotics,
           Regression = "Random k-fold CV: Spatial Process"),
    mutate(reg.ss.spatialCV, Error = Prediction - countNarcotics,
           Regression = "Spatial LOGO-CV: Spatial Process")) %>%
  st_sf() 

error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - countNarcotics, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  facet_wrap(~Regression) +  
  geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
       x="Mean Absolute Error", y="Count")

#Table of Mean MAE and SD MAE

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
  summarize(Mean_MAE = round(mean(MAE), 2),
            SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "MAE by regression") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#FDE725FF")

MAE by regression
Regression	Mean_MAE	SD_MAE
Random k-fold CV: Spatial Process	0.82	0.94
Spatial LOGO-CV: Spatial Process	1.13	1.24

error_by_reg_and_fold %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Narcotics Errors by Model") +
    mapTheme() + theme(legend.position="bottom")

Evaluating Mean Error by Race

Both of the models include percentage of white individuals for each fishnet grid cell, therefore, I expected both to be fairly generalizble across race. However, both models under-predict narcotics arrests in majority-non-white neighborhoods, and over-predict in majority-white neighborhoods. The spatial process model (Model 2) performs slightly better by mean error.

reg.summary %>% 
  st_centroid() %>%
  st_join(tracts18) %>%
  na.omit() %>%
  st_drop_geometry() %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
  kable_styling("striped", full_width = F)

Mean Error by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Random k-fold CV: Spatial Process	-0.0703703	0.0825447
Spatial LOGO-CV: Spatial Process	-0.0306336	0.0840195

Kernel Density - 2019 Narcotics Arrests and Latent Risk

Lastly I ran a kernel density based on 2018 narcotics arrests, to predict 2019 arrests. The first map shows the kernel density plot for just 2018 nacrotics arrests.

Next, I overlayed actual 2019 narcotics arrests point data. The kernel density predictions were compared against the cross-validation risk model (I chose the random k-fold, Model 1, instead of the spatial process, Model 2, due to lower MAE). The second map shows the comparison between kernel density and model 1, with model 1 looking slightly more predictive for 2019 narcotics arrests.

#2018 Kernel
narc_ppp <- as.ppp(st_c(narcotics), W = st_bbox(final_net))
narc_KD <- spatstat::density.ppp(narc_ppp, 1000)

as.data.frame(narc_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  ggplot() +
  geom_sf(aes(fill=value)) +
  geom_sf(data = sample_n(narcotics, 1500), size = .5) +
  scale_fill_viridis(name = "Density") +
  labs(title = "Kernel density of 2018 narcotics arrests") +
  mapTheme()

#2019kernel
narcotics19 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2019/3i3m-jwuy") %>% 
  filter(Primary.Type == "NARCOTICS") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

narc_ppp <- as.ppp(st_coordinates(narcotics), W = st_bbox(final_net))
narc_KD <- spatstat::density.ppp(narc_ppp, 1000)

#kernel density
narc_KDE_sf <- as.data.frame(narc_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(narcotics19) %>% mutate(narcoticsCount = 1), ., sum) %>%
      mutate(narcoticsCount = replace_na(narcoticsCount, 0))) %>%
  dplyr::select(label, Risk_Category, narcoticsCount)

#risk predictions
narc_risk_sf <-
  filter(reg.summary, Regression == "Random k-fold CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(narcotics19) %>% mutate(narcoticsCount = 1), ., sum) %>%
      mutate(narcoticsCount = replace_na(narcoticsCount, 0))) %>%
  dplyr::select(label,Risk_Category, narcoticsCount)

#map
rbind(narc_KDE_sf, narc_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Risk_Category), colour = NA) +
  geom_sf(data = sample_n(narcotics19, 3000), size = .5, colour = "black") +
  facet_wrap(~label, ) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title="Comparison of Kernel Density and Risk Predictions",
       subtitle="2018 narcotics risk predictions; 2019 narcotics arrests") +
  mapTheme()

Kernel Density vs. Risk Prediction Model

The bar chart shows that the risk prediction model, Model 1, does a better job predicting hot spots, while the kernel density predicted better in the 70-89% risk category. Each were fairly even in areas of lower risk.

#histogram
rbind(narc_KDE_sf, narc_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countNarcotics = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countNarcotics / sum(countNarcotics)) %>%
  ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
  geom_bar(aes(fill=label), position="dodge", stat="identity") +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Risk prediction vs. Kernel density, 2019 Narcotics Arrests") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Conclusion

Statistically I think my risk prediction model did a good job of predicting future narcotics arrests. It captured a good percentage of hot spot arrests, better than a kernel density plot. For each fishnet gridcell, it accurately predicted within one arrest (0.9). The risk predictors of alley lights out, loop point distance, total potholes, nearby public housing, and nearby abandoned buildings were all helpful in predicting narcotics arrests. Additionally, the racial composition of a grid cell, via the ACS survey, reduced model errors.

Operationalizing this model is much more difficult, and ultiamtely I wouldn’t recommend this model. Narcotics arrests are heavily subject to police disgression and by operationalizing this model, the police department would only reinforce systematic bias in narcotics arrests. 2018 narcotics arrests are heavily biased and influential in making 2019 predictions, therefore this model (althougth accurate) is following the dreaded machine learning phrase of “garbage-in, garbage-out”. Influencing policing decisions by patrolling neighborhoods with a density of public housing and vacant buildings (i.e. low-income neighborhoods that suffer from disinvestment) only would exacerbate the systematic inequalities there. Finally the model does not generalize well across majority-white and majority-non-white neighborhoods, which is problematic in recommending the model. The feature engineering process was also limited. We were instructed to ignore more popular datasets (abandoned cars, street lights out, graffiti, and liquor stores) and these may have helped lower our mean absolute errors.