This analysis uses a machine learning model to determine the risk of arrest for unlawful firearm possession in the city of Chicago. I chose this outcome because unlike other criminal violations like burglary and arson, mere possession of a firearm does not assume that the weapon has been fired, and must instead be discovered through a police search. As a result, the circumstances leading up to such an arrest are more open to influence by selective bias. Although CPD and the ACLU reached an agreement in 2015 to reform its “stop and frisk” practice, the intense clustering of such arrests in South and West Chicago suggests arrests for unlawful firearm possession may still suffer from selection bias on the basis of race or poverty. Finally, this model’s risk prediction accuracy and generalizability is evaluated and a recommendation of whether the proposed model has real-world value is discussed.

1 Setup

knitr::opts_chunk$set(
  comment = NA,
    message = FALSE,
    warning = FALSE)
# --- Setup: Libraries ----
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)

options(scipen = 999)

# --- Setup: Custom Functions ----
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

st_c <- st_coordinates
st_coid <- st_centroid

2 Data Wrangling

2.1 Outcome Variable Map: Arrests for Unlawful Firearm Possession

# --- Part 1: Chicago Data ----
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

# --- Part 2: Fishnet ----
chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

# --- Part 3: Crime Data  ----
gunposs17 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
  filter(Primary.Type == "WEAPONS VIOLATION" & 
           Description == "UNLAWFUL POSS OF HANDGUN" | Description == "UNLAWFUL POSS OF OTHER FIREARM") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform('ESRI:102271') %>% 
  distinct()

crime_net <- 
  dplyr::select(gunposs17) %>% 
  mutate(countGuns = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countGuns = ifelse(is.na(countGuns), 0, countGuns),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

Below are two maps of the 2017 arrests of unlawful possession of a handgun or other firearm (Figure 2.1). The map on the left shows the location of each such arrest, and the map on the right visualizes the density of these arrests.

# Density Maps
grid.arrange(ncol=2,nrow=2,heights=(c(5,1)),widths=(c(1,1)),
             ggplot() + 
               geom_sf(data = chicagoBoundary) +
               geom_sf(data = gunposs17, colour="black", size=0.1, show.legend = "point") +
               labs(title= "Gun Arrests - Chicago, 2017",
                    caption = "Figure 2.1") +
               mapTheme(title_size = 14),
             
             ggplot() + 
               geom_sf(data = chicagoBoundary, fill = "grey40") +
               stat_density2d(data = data.frame(st_coordinates(gunposs17)), 
                              aes(X, Y, fill = ..level.., alpha = ..level..),
                              size = 0.01, bins = 30, geom = 'polygon') +
               scale_fill_viridis() +
               scale_alpha(range = c(0.00, 0.70), guide = FALSE) +
               labs(title = "Density of Gun Arrests") +
               mapTheme(title_size = 14) + theme(legend.position = "none"))

Figure 2.2 shows the count of unlawful firearm possession within each 500ft x 500ft square of a grid cell fishnet lattice. In most of Chicago, there are almost no such arrests, but a high concentration in the West, Southwest, and South Sides of Chicago.

ggplot() +
  geom_sf(data = crime_net, aes(fill = countGuns)) +
  scale_fill_viridis(option = "B",
                     begin = .2) +
  labs(title = "Count of Gun Arrests, Fishnet",
       caption = "Figure 2.2") +
  mapTheme()

2.2 Risk Factors

The risk factor features are created using the Chicago Open Data database. In addition to previously considered features for abandoned buildings, street lights out, and sanitation requests, new features for public housing, and grocery stores are also included. These predictors are joined to the fishnet and mapped in Figure 2.3.

Features for abandoned cars, graffiti, L stations, liquor retailers, and potholes were also considered, but ultimately eliminated due to low correlation with the outcome.

# --- Part 4: Risk Factor Features ----
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
  mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
  mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
  mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

# Additional Predictors: 
grocery <- 
  read.socrata("https://data.cityofchicago.org/resource/ce29-twzt.json") %>%  
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Grocery_Stores")

publicHousing <- 
  read.socrata("https://data.cityofchicago.org/resource/s6ha-ppgi.json") %>%  
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Public_Housing")

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

vars_net <- 
  rbind(abandonBuildings, sanitation, streetLightsOut,
        publicHousing, grocery) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup()

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =3, top = "Risk Factors by Fishnet", bottom = "Figure 2.3"))

2.3 Feature Engineering: Nearest Neighbor

Nearest neighbor features are also calculated (Figure 2.4), along with Euclidean distances to The Loop and from the Stevenson, Dan Ryan, Eisenhower, and Chicago Skyway expressways (due to the close association in the U.S. between highway construction and racial segregation).

# --- Part 5: Feature Engineering ----
## Nearest Neighbors
vars_net <-
  vars_net %>%
  mutate(
    Abandoned_Buildings.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),5),
    Street_Lights_Out.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),5),
    Sanitation.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(sanitation),5),
    Public_Housing.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(publicHousing),5),
    Grocery_Stores.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(grocery),5))

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor Risk Factors by Fishnet",
                       bottom = "Figure 2.4"))

2.4 Feature Engineering: Distance to One Point

Expy <- 
  st_read('https://data.cityofchicago.org/resource/pr57-gg9e.geojson') %>% 
  filter(street_typ == 'EXPY' & street_nam == 'STEVENSON' | street_nam == 'DAN RYAN' | street_nam == 'DAN RYAN EXPRESS' | street_nam == 'EISENHOWER' | street_nam == 'CHICAGO SKYWAY') %>%
  st_transform('ESRI:102271') %>%
  st_intersection(chicagoBoundary)
expyPoints <- st_cast(Expy,"POINT") 

loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$expyDistance = 
  nn_function(st_c(st_coid(vars_net)),st_c(expyPoints),1) %>%
  as.numeric() 

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

a <- ggplot() +
   geom_sf(data=vars_net, aes(fill=expyDistance)) +
   scale_fill_viridis(option="B") +
   labs(title="Expressways",
        caption = "Figure 2.5") +
   mapTheme(title_size = 14)

b <- ggplot() +
    geom_sf(data=vars_net, aes(fill=loopDistance)) +
    scale_fill_viridis(option="B") +
    labs(title="The Loop",
         caption = " ") +
    mapTheme(title_size = 14)

grid.arrange(a, b, ncol=2, top = "Euclidean Distances from Features")

2.5 Join Features with Neighborhoods & Police Districts

final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")

final_net <-
  st_centroid(final_net) %>%
  st_join(dplyr::select(neighborhoods, name)) %>%
  st_join(dplyr::select(policeDistricts, District)) %>%
  st_drop_geometry() %>%
  left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
  st_sf() %>%
  na.omit()

# Plot by aggregate areas
dplyr::select(final_net, Neighborhood = name, PoliceDistrict = District) %>%
  gather(Variable, value, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = value)) +
  facet_wrap(~Variable) +
  scale_fill_viridis(discrete = TRUE, name="") +
  labs(title = "Aggregate Areas, Gun Possession Arrests",
       caption = "Figure 2.6") +
  mapTheme() + theme(legend.position = "none")

3 Data Exploration

3.1 Local Moran’s I

Local Moran’s I is calculated to determine the spatial autocorrelation of gun possession arrests, which is then used to map distance to highly significant risk regions, or hotspots.

# --- Part 6: Spatial Process ----
# Create neighbor list & local spatial weights
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# Local Moran's I --> determine hotspots
final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countGuns, final_net.weights)),
    as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(gunposs_Count = countGuns, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Sig_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

# Local Moran's I Maps
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = chicagoBoundary, fill='black') +
    geom_sf(data = filter(final_net.localMorans, Variable == i), 
            aes(fill = Value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme() + theme(legend.position = 'bottom')}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Moran's I Statistics, Gun Possession Arrests",
                       bottom = "Figure 3.1"))

final_net <-
  final_net %>% 
  mutate(gunposs17.isSig = ifelse(localmoran(final_net$countGuns, 
                                            final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(gunposs17.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                           st_coordinates(st_centroid(
                                             filter(final_net, gunposs17.isSig == 1))), 1 ))

ggplot() + 
  geom_sf(data = final_net, aes(fill = gunposs17.isSig.dist)) +
  scale_fill_viridis(option = "B") +
  labs(title = "Distance to Gun Possession Arrest Hotspots",
       caption = "Figure 3.2") +
  mapTheme()

3.2 Correlation Scatterplot

After evaluating correlation to outcome for each risk factor, either the nearest neighbor or count feature is selected, depending on which has a higher correlation. Additionally, distance to the Loop does not appear to be as highly correlated as, say, distance to expressway, so only the latter is used.

# Significant cluster of unlawful gun arrests (hotspots)
final_net <-
  final_net %>% 
  mutate(gunposs17.isSig = 
           ifelse(localmoran(final_net$countGuns, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(gunposs17.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, gunposs17.isSig == 1))), 1))

# Correlation Tests (use count OR nn, not both)
correlation.long <-
  st_drop_geometry(final_net) %>%
  dplyr::select(-uniqueID, -cvID, -name, 
                -District, -gunposs17.isSig, -gunposs17.isSig.dist) %>%
  gather(Variable, Value, -countGuns)

correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countGuns, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countGuns)) +
  geom_point(size = 0.5) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#ee8138") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Domestic Battery Count as A Function of Risk Factors",
       caption = "Figure 3.3")

4 Regression & Validation

4.1 Poisson Regression

As shown in the histogram below, the vast majority of fishnet areas in Chicago did not see any gun possession arrests in 2017, but a handful saw upwards of 10 or even 20, given the extreme right skew of the distribution. Because the distribution of gun arrests in grid cells is not normal and more closely resembles a Poisson distribution (Figure 4.1), that is the regression model used.

ggplot(final_net, aes(countGuns)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of Gun Arrests by Grid Cell",
       caption = "Figure 4.1")

4.2 Poisson Regression

Two sets of regression features are analyzed. the first set only includes the 6 selected features: abandoned buildings, street lights out, sanitation requests, expressway distance, public housing, and grocery stores. The second set includes spatial process features as well.

# --- Part 7: Poisson Regression ----
reg.vars <- c("Abandoned_Buildings", "Street_Lights_Out", "Sanitation",
              "expyDistance", "Public_Housing.nn", "Grocery_Stores.nn")

reg.ss.vars <- c("Abandoned_Buildings", "Street_Lights_Out", "Sanitation",  
                 "expyDistance", "Public_Housing.nn", "Grocery_Stores.nn", 
                 "gunposs17.isSig", "gunposs17.isSig.dist")

# Cross-validation
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countGuns ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

# Estimate 4 regressions
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countGuns",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countGuns, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countGuns",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countGuns, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countGuns",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countGuns, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countGuns",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countGuns, Prediction, geometry)

5 Evaluation: Accuracy

As shown in Figure 4.2, including spatial process features does lead to observably more specific predictions in hotspot areas.

# --- Part 8: Accuracy & Generalizability ----
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countGuns,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv,        Error = Prediction - countGuns,
           Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countGuns,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = Prediction - countGuns,
           Regression = "Spatial LOGO-CV: Spatial Process")) %>%
  st_sf() 

obsmap <- ggplot()+
  geom_sf(data = reg.summary, aes(fill=countGuns), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title="Observed Arrests") +
    mapTheme() + 
    theme(legend.position="bottom")

predmap <- unique(reg.summary$Regression)
predList <- list(obsmap)

for(i in predmap){
  predList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(reg.summary, Regression == i), aes(fill=Prediction), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme() + 
    theme(legend.position="bottom")}

do.call(grid.arrange,c(predList, ncol =5, top = "Observations vs Predictions (by CV Method)", bottom = "Figure 4.2"))

5.1 Errors

Figures 4.3 and 4.4 show that including spatial process features does reduce error somewhat. The histograms of MAE by CV method also support the finding that models that incorporate spatial process have less error, there remains at least some in the majority of cases.

filter(reg.summary, 
       Regression == "Random k-fold CV: Just Risk Factors" |
       Regression == "Random k-fold CV: Spatial Process" |
       Regression == "Spatial LOGO-CV: Just Risk Factors" | 
       Regression == "Spatial LOGO-CV: Spatial Process") %>%
  ggplot() +
    geom_sf(aes(fill = Error), line=0) +
    facet_wrap(~Regression, ncol=4) +
    scale_fill_viridis() +
    labs(title = "Gun Possession Arrest Errors by CV Method",
         caption="Figure 4.3") +
    mapTheme()

# MAE
error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - countGuns, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  facet_wrap(~Regression) +  
  geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
       x="Mean Absolute Error", y="Count",
       caption="Figure 4.4") +
  plotTheme()

The table below also shows that mean MAE is reduced slightly with spatial process included. Although all errors are less than 1, it should be noted that the overall mean average of countGuns is 1.52, given that a large majority of grid cells do not see any arrests. Therefore, these errors are still quite large in this context, and appear to primarily occur due to underprediction in the high-decile areas (Figures 4.5 & 4.6).

# Mean MAE, SD
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
  summarize(Mean_MAE = round(mean(MAE), 2),
            SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "Table 4.1") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF")

Table 4.1
Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	0.40	0.34
Random k-fold CV: Spatial Process	0.32	0.23
Spatial LOGO-CV: Just Risk Factors	0.85	0.89
Spatial LOGO-CV: Spatial Process	0.47	0.54

# Map Errors
error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
  geom_sf(aes(fill = MAE)) +
  facet_wrap(~Regression) +
  scale_fill_viridis() +
  labs(title = "Handgun errors by LOGO-CV Regression",
       caption = "Figure 4.5") +
  mapTheme() + theme(legend.position="bottom")

# Check that accounting for local spatial process eliminated spatial autocorrelation
neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  group_by(cvID) %>%
  poly2nb(as_Spatial(.), queen=TRUE) %>%
  nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
  st_drop_geometry() %>%
  group_by(Regression) %>%
  summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                nsim = 999, zero.policy = TRUE, 
                                na.action=na.omit)[[1]],
            p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                               nsim = 999, zero.policy = TRUE, 
                               na.action=na.omit)[[3]]) %>%
  kable(caption = "Table 4.2") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF")

Table 4.2
Regression	Morans_I	p_value
Spatial LOGO-CV: Just Risk Factors	0.3068481	0.001
Spatial LOGO-CV: Spatial Process	0.1039040	0.035

# Predicted vs. Obs Plot
st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
  mutate(gunposs17_Decile = ntile(countGuns, 10)) %>%
  group_by(Regression, gunposs17_Decile) %>%
  summarize(meanObserved = mean(countGuns, na.rm=T),
            meanPrediction = mean(Prediction, na.rm=T)) %>%
  gather(Variable, Value, -Regression, -gunposs17_Decile) %>%          
  ggplot(aes(gunposs17_Decile, Value, shape = Variable)) +
  geom_point(size = 2) + geom_path(aes(group = gunposs17_Decile), colour = "black") +
  scale_shape_manual(values = c(2, 17)) +
  facet_wrap(~Regression) + xlim(0,10) +
  labs(title = "Predicted and observed handguns by observed gun arrest decile",
       caption = "Figure 4.6")  +
  plotTheme()

6 Evaluation: Generalizability

6.1 Neighborhood Context

As shown by the side-by-side comparison maps in Figure 5.1, there is a strong relationship between the racial demographics present within a neighborhood and the number of individuals arrested by police for unlawful gun possession. Table 5.1 shows that the spatial Poisson regression models estimated earlier underestimates gun arrests, which as previously discussed, occurs in hotspot neighborhoods. This finding supports that majority non-White neighborhoods see more gun possession arrests.

# --- Part 9: Generalizability by Neighborhood ----
tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E", "B17001_002"), 
          year = 2017, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001, 
         TotalPoverty = B17001_002) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White"),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0)) %>%
  .[neighborhoods,]

# Compare by neighborhood race (majority White/non-White)
reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
  st_centroid() %>%
  st_join(tracts17) %>%
  na.omit() %>%
  st_drop_geometry() %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Table 5.1 Mean Error by Neighborhood Racial Context") %>%
  kable_styling("striped", full_width = F)  %>%
  row_spec(2, color = "black", background = "#FDE725FF") 

race <- ggplot() + 
  geom_sf(data = tracts17, aes(fill = raceContext)) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Race Context, 2017") +
  mapTheme() + 
  theme(legend.position = "bottom") +
  theme(legend.title = element_blank())

raceList = list(race, obsmap)

do.call(grid.arrange,c(raceList, ncol =2, top = "Majority Non-White Neighborhoods & Gun Arrests", bottom = "Figure 5.1"))

6.2 Poverty Context

There also appears to be a strong relationship between neighborhood poverty and number of individuals arrested by police for unlawful gun possession. The poverty rate in Chicago is 12.4%, so the povertyLvl variable is defined as: * Below Avg. Poverty: pctPoverty = 0 - 12.4% * Above Avg. Poverty: pctPoverty = 12.4 - 33.3% * Above Avg. Poverty: pctPoverty > 33.3%
Table 5.2 shows that the spatial Poisson regression models estimated earlier also underestimated gun arrests in high poverty areas.

# Compare by neighborhood race (majority White/non-White)

tracts17 <-
  tracts17 %>%
  mutate(
    povertyLvl = cut(pctPoverty, breaks=c(0, 0.124, 0.333, Inf), 
                     labels=c("Below Avg.","Above Avg.","High")))
  
reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
  st_centroid() %>%
  st_join(tracts17) %>%
  na.omit() %>%
  st_drop_geometry() %>%
  group_by(Regression, povertyLvl) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(povertyLvl, mean.Error) %>%
  kable(caption = "Table 5.2 Mean Error by Poverty Context") %>%
  kable_styling("striped", full_width = F)  %>%
  row_spec(2, color = "black", background = "#FDE725FF")

Table 5.2 Mean Error by Poverty Context
Regression	Below Avg.	Above Avg.	High
Spatial LOGO-CV: Just Risk Factors	0.5102482	0.1169854	-0.8256960
Spatial LOGO-CV: Spatial Process	0.2346239	0.0007012	-0.2301151

pov <- ggplot() + 
  geom_sf(data = tracts17, aes(fill = povertyLvl)) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Poverty Rate, 2017") +
  mapTheme() + 
  theme(legend.position = "bottom",
        legend.text = element_text(size = 7)) +
  theme(legend.title = element_blank())

povList = list(pov, obsmap)

do.call(grid.arrange,c(povList, ncol =2, top = "Poverty Rate & Gun Arrests", bottom = "Figure 5.2"))

7 Comparison to Kernel Density Method

Compared to the risk prediction model, the kernel density model produces smoother risk categories, although it is difficult to determine visually which model performs better (Figure 6.1).

Figure 6.2 shows that the kernel density model is roughly equal in terms of prediction accuracy. The kernel density model predicts marginally better in the middel risk categories (30-90%), but the risk prediction model shows marginally higher accuracy at the extremes.

# --- Part 10: Kernel Density Method ----
# Compare 2018 risk model to 2017 kernal density model
gunposs18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "WEAPONS VIOLATION" & 
           Description == "UNLAWFUL POSS OF HANDGUN"|Description == "UNLAWFUL POSS OF OTHER FIREARM") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

guns_ppp <- as.ppp(st_coordinates(gunposs17), W = st_bbox(final_net))
guns_KD <- spatstat::density.ppp(guns_ppp, 1000)

# Compute kernal density on 2017 data
guns_KDE_sf <- as.data.frame(guns_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(gunposs18) %>% mutate(gunsCount = 1), ., sum) %>%
      mutate(gunsCount = replace_na(gunsCount, 0))) %>%
  dplyr::select(label, Risk_Category, gunsCount)

# Risk Predictions
guns_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(gunposs18) %>% mutate(gunsCount = 1), ., sum) %>%
      mutate(gunsCount = replace_na(gunsCount, 0))) %>%
  dplyr::select(label,Risk_Category, gunsCount)

# Comparison
rbind(guns_KDE_sf, guns_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Risk_Category), colour = NA) +
  geom_sf(data = sample_n(gunposs18, 3000), size = .1, colour = "black", alpha = .7) +
  facet_wrap(~label, ) +
  scale_fill_viridis(discrete = TRUE, option="C") +
  labs(title="Comparison of Kernel Density and Risk Predictions",
       subtitle="2017 unlawful gun possession risk predictions; 2018 arrests",
       caption="Figure 6.1") +
  mapTheme()

# Plot -- Risk Pred better with high-risk areas, kernal density better on low-risk areas
rbind(guns_KDE_sf, guns_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countGuns = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countGuns / sum(countGuns)) %>%
  ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
  geom_bar(aes(fill=label), position="dodge", stat="identity") +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Risk prediction vs. Kernel density, 2018 gun possessions",
       caption = "Figure 6.2") +
  plotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

8 Conclusion

A perfect risk prediction model of unlawful gun possession would allow law enforcement to seize weapons in a city with a high rate of shootings and homicides. However, there is no way to determine if this model is truly reflective of actual unregistered gun ownership, as arrests are only made where police patrol, and there is demonstrably higher policing in poorer and majority non-White communities. Instead, the usefulness of this machine learning model predicting risk of unlawful gun possession arrests lies in comparing it against shooting incidents and gun ownership records. The Chicago Police Department currently has a contract for the ShotSpotter gunshot detection system. This risk model can be compared to ShotSpotter data to determine if police are surveilling in the right areas.

Additionally, this model can be compared to gun ownership records. If low gun registration rates are due to the high cost of registration, unlawful possession might be better addressed by diverting these cases to confiscation until the owner can undergo a low-cost safety course. This may also reduce the likelihood of accidental shootings. Instead gun arrests are actually underpredicted by this risk model, in part because arrests in high poverty, majority non-White communities is so considerably higher than the rest of Chicago for these violations. Charging poor Chicagoans for not passing criteria that they cannot afford to pass is punitive and conflates poor gun ownership with illegal gun ownership. Ultimately, however, this prediction model is not accurate enough to be of much use for anything at the moment. More features are needed to improve accurracy in capturing the spatial correlation in middle- and high-risk areas.

Risk Prediction Modeling: Firearm Posession, Chicago

Juliana Zhou

December 18, 2020