1.1 Predictive Policing & Geospatial Risk Models

In 2017, Chicago Police Department recorded 268,387 crime incidents. On average, 735 incidents Ire reported daily, let along those off the records. These incidents not only harm individual safety, but also create a sense of chaos for other residents and visitors. The effect extended beyond public safety into the influence on land value and the overall community economic development.

To protect public safety, property values, and community Ill-being, Chicago police department has expressed interest and supported with funding to intepret the crime pattern geospatially. Since the policing resource is quite imited, the police department turns to comprehensive data research and creative algorithm for insights on better resource allocation. Luckily, the database maintained in recent years enables such approach.

To assist this effort, hereby, I demonstrate a crime risk model with a look into one of the most severe crime types– Armed Robberywith Gun. This predictive model works by returing places at risk of Armed Robbery with Gun crime in the future by applying actual crime information as Ill as socio-economic factors that may be independent of the subjective policing interventions. The socio-economic factors could be environmental risk and protective factors, such as housing blight, vacancy, and conditions of community assets. Meanwhile, the model considers selection bias by studying neighborhood disparity effect, such as over-policing in minority neighborhood is observed in practices.

Besides the measure to study selection bias, Chicago Department should observe this model with more system-wise caution. Bias came with dataset as Ill as the way the data was collected in the first place. Whether it was due to an inconsistent collection method across departmental groups, or specific policing interventions, such bias may not be easily removed. Therefore, it is recommended to validate the model multiple times through multiple participants in order to create a useful predictive policing model for practice.

1.2 Accuracy/Generalizability Tradeoff

A successful risk prediction model recognizes its tradeoff betIen accuracy and generalizability. A highly generalizable model predicts crime risk across different context regardless of its previous experience with crime report. A highly accurate model generates crime counts that are proved to be equal to the observed crime counts. While a generalizable model is good at predicting latent risk, that is, the risk of crime in places where no crime happened, it may be less accurate with its result. On the contrary, an accurate model may be true about the actual crime counts, it fails to consider latent risk. Since the use of our model is to help police develop a tool to allocate limited policing resources to protect citizens, I aim at a useful accuracy that would improve upon the resource allocation than the business as usual, while a useful generalizability that could speak to different contexts better than one.

In the following sections, I will first pull crime data from Chicago Open Data Ibsite through R programming. Then, I will do some data wrangling to help us observe these data. In section 3, I will focus on feature engineering to prepare for our prediction model. In section 4, I will develop some new spatial structure features. In section 5, I will run the model, analyze and map out the results. Lastly, I will continue the analysis and validate prediction result for accuracy and generalizability in section 6.

But first, I prepared the graphic style and format.

library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text(size = 13, color = "black", family="San Serif"),
    plot.title = element_text(size = 14,colour = "black", family="Sans Serif"),
    plot.subtitle=element_text(size = 12, face="italic", family="Serif"),
    plot.caption=element_text(size = 10, hjust=0, family = "Serif"),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "grey", fill=NA, size=2)
  )
}

2.1 Data Wrangling: Defining Spatial Units for Data Aggregation

The spatial scale choice is important in the precision intelligence. To characterize the existing crime context, it is thus crutial to find the right scale of information storage.

Currently, the police department used Police Beat and Police District (Figure below) as administrative units to distribute resources. However, both units tend to serve too many people at a time, which makes it complicated to characterize each unit. IN addition, using either of these unit would assume the change of risk on boundaries of adjacent areas.

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform(crs=102271) %>%
  dplyr::select(District = dist_num)

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform(crs=102271) %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

To minimize the bias using existing admin boundary as analysis boundary, I designed equally sized sna shaped grid cells to capture the characteristics of aggregating data– known as “fishnet” approach. On top of the City of Chicago boundary (besides the O’Hare airport area on northIst), I applied 500ft X 500ft fishnet to capture incident data. This grid cell size is appropriate for a few reasons. 500ft is roughly the distance between n and (n+1)th Street in Chicago. In this block level, the distribution of crime data is more intepretable, thus providing insights for the distribution of policing resource distribution. Moreover, 500ft X 500ft fishnet will provide around 20k grid cells, which makes it possible for most computer to process.

chicagoBoundary <- 
  st_read("C:/Users/Joyce Yuqi Liu/Desktop/GISR_data/Week10_hw/Chicago_boundary/chicagoBoundary.shp") %>%
  st_transform(crs=102271) 
fishnet <- 
  st_make_grid(chicagoBoundary, cellsize = 500) %>%
  st_sf()
ggplot() + 
  geom_sf(data=chicagoBoundary, fill=NA, colour="black") +
  geom_sf(data=fishnet, fill=NA, colour="black") +
  labs(title = "Chicago and the fishnet") +
  mapTheme()

fishnet <- 
  fishnet[chicagoBoundary,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)

ggplot() +
  geom_sf(data=fishnet) +
  labs(title = "Fishnet in Chicago") +
  mapTheme()

With the grid cells serving as my new spatial unit, I assigned Armed Robbery-Armed with Gun crime dataset from Chicago Open Data API. I chose this for not only it is one of the most severe crimes, but also abundant with data shown by Chicago Police Department Clear Map (http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html).Some additional data wrangling included removing extraneous characters from the “Location” Field, which was converted to “separate” fields of “X” and “Y” coordinates. Those fields were then made numeric, converted to simple features, and projected. I also removed the duplicate geometries.

Armed_Robbery<- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
  filter(Primary.Type == "ROBBERY" & 
           Description == "ARMED: HANDGUN") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct()

Armed_Robbery
## Simple feature collection with 4284 features and 0 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 334758.5 ymin: 553009.8 xmax: 367321 ymax: 594816.3
## epsg (SRID):    102271
## proj4string:    +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## First 10 features:
##                     geometry
## 1  POINT (353861.6 580374.2)
## 2  POINT (355523.4 579779.8)
## 3  POINT (353083.8 592043.8)
## 4  POINT (352910.3 591928.9)
## 5  POINT (351350.4 590280.8)
## 6  POINT (352293.7 580697.3)
## 7    POINT (357287 566203.1)
## 8  POINT (350607.9 575776.6)
## 9  POINT (349204.3 583978.5)
## 10 POINT (358937.5 581242.3)

Here is how the Armed Robberydataset looks like in Chicago Boundary map. Selection bias in this crime type may come from the possibility that Armed Robberycases are reported to be done by people of certain race, and by people with access to gun supply (age and socio-economic status). It may also depend on if the gun is observed and involved by the robbers during the interaction with police officers. If not, the case might be categorized differently and left out of our intended violent crime selections, despite of the fact that it is equally dangerous as an armed gun Armed Robbery. Moreover, in general, this dataset reports only a point location than a area of activity, which may create bias of reporting such incidents, especially when the activity happen in different administrative and grid cell area (fishnet). These are all case-specific selection bias that we should keep in mind.

2.2 Data Wrangling: Joining Crime Data to Spatial Units

Now that I have both the spatial unit and dataset, it’s time to join the Armed-with-Gun Armed_Robberydataset to our fishnet. First, I summed up the Armed Robbery with Gun count for each grid cell, and assign 0 to grid cells that observe no such Armed Robbery. Then, I generated a random group “cvID” for cross validation later, where 100 of them allow 100-fold cross validation.

crime_net <-
  Armed_Robbery%>% 
  dplyr::select() %>% 
  mutate(countRobbery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRobbery = ifelse(is.na(countRobbery), 0, countRobbery),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

Here is how the Armed_Robberydataset crime_net looks like in fishnet based on Chicago Boundary map.

ggplot() +
  geom_sf(data = crime_net, aes(fill = countRobbery)) +
  scale_fill_viridis() +
  labs(title = "Count of Armed Robberyfor the fishnet") +
  mapTheme()

2.3 Data Wrangling: Add Environmental Risk Factors

To understand the Armed Robbery with Gun pattern geospatially, I needed to collect some risk factors that may be associated with it. Based on the available datasets on Chicago Open Data Portal and recommendations from collaborators, I started with an initial model with some risk factors based on built environmental characteristics. I selected six risk factors for our model: street lights out, alley lights out, abandoned buildings, liquor Retail, Restaurant, and non residential garage. Neighborhood boundaries and CTA bus stations were for feature engineering of distance later.

StreetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

AlleyLightsout <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Alley-Lights-Out-Historical/t28b-ys7j") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Alley_Lights_Out")

AbandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
  mutate(year = substr(date_service_request_was_received,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

LiquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

Restaurant <-
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q") %>%
  filter((BUSINESS.ACTIVITY == "Preparation of Food and Dining on Premise With Seating")|(BUSINESS.ACTIVITY == "Sale of Food Prepared Onsite With Dining Area")) %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Restaurant")

Garage_nonResidential <-
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q") %>%
  filter((BUSINESS.ACTIVITY == "Provide Parking Spaces for a Free - Available and Advertised to the Public ")|(LICENSE.DESCRIPTION == "Commercial Garage")) %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Garage_nonResidential")

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

Here are the maps for these chosen risk factors.

ggplot() +
  geom_sf(data=chicagoBoundary) +
  geom_sf(data = rbind(StreetLightsOut,AlleyLightsout,AbandonBuildings,
                       LiquorRetail, Restaurant, Garage_nonResidential),
          size = .1) +
  facet_wrap(~Legend, ncol = 2) +
  labs(title = "Risk Factors") +  
  mapTheme()

3.1 Feature engineering – Initial Asssembling of Features

I mainly applied two feature engineering techniques for our model. The first was to sum up the number of a given risk factor in each grid cell by binding up individual risk factor layers. The result was a point data layer, which was then converted to a data frame. To analyze the data frame with regression, I transposed it to put all risk factors on columns and nonrepetitive grid cells on rows. Then, I joined this new data frame with fishnet – “vars_net”.

vars_net <- 
  rbind(StreetLightsOut,AlleyLightsout,AbandonBuildings,
        LiquorRetail, Restaurant, Garage_nonResidential) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

vars_net
## Simple feature collection with 2458 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 341164.1 ymin: 552875.4 xmax: 367664.1 ymax: 594875.4
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
## # A tibble: 2,458 x 8
## # Groups:   uniqueID [2,458]
##    uniqueID                  geometry Abandoned_Build~ Alley_Lights_Out
##    <chr>                <POLYGON [m]>            <dbl>            <dbl>
##  1 1        ((359164.1 552875.4, 359~                0                0
##  2 10       ((363664.1 552875.4, 364~                0                0
##  3 100      ((355664.1 555375.4, 356~                0                3
##  4 1000     ((360164.1 568375.4, 360~                1                0
##  5 1001     ((360664.1 568375.4, 361~                0                1
##  6 1002     ((361164.1 568375.4, 361~                0                2
##  7 1003     ((361664.1 568375.4, 362~                0                0
##  8 1004     ((362164.1 568375.4, 362~                0                0
##  9 1005     ((362664.1 568375.4, 363~                0                0
## 10 1006     ((363164.1 568375.4, 363~                0                0
## # ... with 2,448 more rows, and 4 more variables: Garage_nonResidential <dbl>,
## #   Liquor_Retail <dbl>, Restaurant <dbl>, Street_Lights_Out <dbl>

While the wide form of a dataset like vars_net came in handy for analysis, I applied the original long form (before transpose) to visualize risk factors and their counts on a small multiple map. Here is how our risk factors look like on fishnet.

vars_net.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =2, top = "the Six Risk Factors by Fishnet(unit: number of)"))

You may have notices that these factors all suggest different spatial pattern across the city. While Abandoned Buildings are concentrated in city south, nonresidential garages, restaurants, and liquor stores have a few hotspots near the Loop, downtown Chicago area. Alley light and street light out appear to be more dispersed than the other factors. These observations made me wonder if such distinctive patterns of the environmental factors would be related to the location of Loop. Here is a visualization.

loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis() +
  labs(title="Euclidean distance to The Loop (ft)") +
  mapTheme() 

3.2 Feature engineering – Nearest Neighbor Features

Engineering the risk factor creates new data perspective from the raw collection. Here, I introduced an average nearest neighbor distance approach. This essentially extend data points into datanet, as all grid cells are integrated in a spatial relation specific to a certain factor. Using grid cell centroid points, I measured the average distance of k nearest data point (k= 3 in this case) for each of the six risk factors. For example, for Restaurant, it meant for every grid cell to find the average distance of the 3 nearest Resturants from it. As a result, the map would suggest a gradient of the distance value where grid cells of smaller values are closer to a cluster of restaurant locations, relative to other grid cells.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

vars_net$StreetLightsOut.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(StreetLightsOut), 3)

vars_net$AlleyLightsout.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(AlleyLightsout), 3)

vars_net$AbandonBuildings.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(AbandonBuildings), 3)

vars_net$LiquorRetail.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(LiquorRetail), 3)

vars_net$Restaurant.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(Restaurant), 3)

vars_net$Garage_nonResidential.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(Garage_nonResidential), 3)

Here is the small multiple map for Nearest Neighbor Risk Factors by fishnet.

vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =2, top = "Nearest Neighbor risk Factors by Fishnet"))

3.3 Feature engineering – Integrate All Data in One Dataframe

Since I have prepared both the dependent variable—Armed Robberyincidents (crime_net), and the six independent variables corresponding to the fishnet(vars_net), I would like to combine them in one dataframe.

final_net <-
  left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID") 

In addition, in the same data frame, neighborhood boundary and police districts are captured by grid cell in the same fashion. I noticed that some grid cells aren’t assigned a neighborhood, and some neighborhoods are only assigned to a few grid cell. These are both normal events, as some grid cell may fall beyond neighborhood boundary, while some neighborhoods are simply too small to be captured by more grid cells.

final_net <-
  st_centroid(final_net) %>%
  st_join(., dplyr::select(neighborhoods, name)) %>%
  st_join(., dplyr::select(policeDistricts, District)) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
  st_sf() %>%
  na.omit()

dplyr::select(final_net, name, District) %>%
  gather(Variable, Value, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Value)) +
  facet_wrap(~Variable) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "POlice District and Neighborhood Name") +
  mapTheme() + theme(legend.position = "none")

4.1 Spatial Structure Exploration

To identify spatial pattern, a.k.a. spatial autocorrelation, a common statistical method is Moran’s I. The null hypothesis I am testing against is that Armed Robberycount at a given location is randomly distributed relative to its immediate neighbors. The first step is to build up a queen matrix, which established high data continuity for analysis by considering the nearest eight grid cells for any one cell. Instead of rook matrix, which only considers 4 neighboring cells.

final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

Applying the queen matrix to Armed Robbery with Gun count, I calculated Moran’s I and evaluated it by checking the value, the p-value, and the grid cells that showed significance (where p<=0.05). Here is the small multiple maps to help you visualize these analyses.

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countRobbery, final_net.weights)),
    as.data.frame(final_net, NULL)) %>% 
  st_sf() %>%
  dplyr::select(Armed_Robbery_Count = countRobbery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Armed Armed Robbery"))

To further distinguish the highly significant grid cells from less significant ones, I created a dummy variable to denote a highly significant as the cluster with p < 0.0000001. Then I measured the distance from each grid cells to its nearest highly significant cluster, or what we usually know as hotspot. The visualization of these distance values informed us of the areas relatively near and away from the significant clusters.

final_net <-
  final_net %>% 
  mutate(Armed_Robbery.isSig = ifelse(localmoran(final_net$countRobbery, 
                                            final_net.Iights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(Armed_Robbery.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                           st_coordinates(st_centroid(
                                             filter(final_net, Armed_Robbery.isSig == 1))), 1 ))
ggplot() + 
  geom_sf(data = final_net, aes(fill = Armed_Robbery.isSig.dist)) +
  scale_fill_viridis() +
  labs(title = "Distance to Highly Significant Armed Robbery with Gun hotspots") +
  mapTheme()

4.2 Correlation Tests

In this section, I explore the correlation betIen Armed Robbery with Gun count and the six risk factors. Since I did feature engineering with the nearest neighbor approach, I demonstrated the correlation plot here alongside with that. They suggest similar messages, such as, as alley light is less available, more incidents of Armed Robberywith Gun likely increase. Nevertheless, they helped me select the right format of the feature to apply on the regression model.

correlation.long <-
  st_set_geometry(final_net, NULL) %>%
  dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
  gather(Variable, Value, -countRobbery)

correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countRobbery, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countRobbery)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Armed Robberycount as a function of risk factors")

5.1 Regression Model: Poisson Regression is the choice

In order to know the appropriate type of regression, I drew a history to learn more statisticallyu about the distribution of Armed Robbery with Gun count. Since such crime is relatively rare, I found it reasonable to see a very skewed distribution. However, it violated OLS regression’s normality assumption, making OLS inappropriate in this case. Instead, Poisson Regression is the better choice. It essentially recoded the Armed Robbery with Gun data from a continuous range of value to binomial outcome, where 0 means “no Armed Robbery” and 1 means “yes there is”. In other words, I looked into the probability of Armed Robbery with Gun across space. Correspondingly, our assumption for the regression changed. Here, the logistic regression assumes that the probabilities are well fit to the logistic S curve.

ggplot(final_net, aes(countRobbery)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of Armed Robbery with Gun by Grid Cell")

5.2 Poisson Regression: Cross validation Method and Data for Regression Analysis

Since geospatial risk model targets at factors that are strictly spatial, I could perform spatial cross-validation to evaluate the generalizability of our model. A generalizable model in our case means that it could predict Armed Robberyrisk on both City of Chicago and neighborhood level. However, the prediction at neighborhood or small scale in general is hard, as Armed Robbery with Gun could be location-specific. To solve for this issue, I used Leave-one-group-out cross-validation (LOGO-CV), where a specific location is held out and predicted for by the model generated based on the remaining areas.

I applied two types of cross-validation on the scales provided by three existing fields on the column. For random k-fold cross-validation, I will use cvID previously generated for each grid cell randomly. For spatial cross-validation (=LOGO-CV in this case), I used neighborhood name and police district.

In total, I was interested in two types of regressions models based on data analyzed— A “just risk factors”, and B “risk factors and spatial structures of dependent variable” (Spatial Structure) with two types of cross-validation described above. In LOGO-CV, each neighborhood has an opportunity to play as a hold out and to be predicted for. The outcome was simply a “Prediction” for each grid cell.

reg.vars <- c("StreetLightsOut.nn", "AlleyLightsout.nn", "AbandonBuildings.nn", 
              "LiquorRetail.nn", "Restaurant.nn", "Garage_nonResidential.nn", "loopDistance"
        )

reg.ss.vars <- c("StreetLightsOut.nn", "AlleyLightsout.nn", "AbandonBuildings.nn", 
                 "LiquorRetail.nn", "Restaurant.nn", "Garage_nonResidential.nn", "loopDistance", 
                 "Armed_Robbery.isSig", "Armed_Robbery.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countRobbery ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

After creating the types of cross-validation, assembling the variable lists for regression analysis, and defining the spatial cross-validation’s process, I are ready to compare the total four regression models.

1. Random k-fold CV: Just Risk Factors 2. Random k-fold CV: Spatial Structure 3. Spatial LOGO-CV: Just Risk Factors 4. Spatial LOGO-CV: Spatial Structure

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countRobbery, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countRobbery, Prediction, geometry)

5.3 Possion Regression Models: Accuracy & Generalizability

There is obviously tradeoffs between generalizability and accuracy: if the model summarize pattern too generally, it lacks insights on prediction accuracy. However, in this case, I argue that generalizability is slightly more important than accuracy due to the different spatial scales included in my model. To demonstrate this, I created a dataframe called ‘reg_summary” to summarize the four regression models I have done so far. Then, I compared the regression model by maping out the predictions. As you can see from this map, the predictions created by regression model that include the spatial structure factors appear to pick up the localized hotspots better.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = countRobbery- Prediction,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv,        Error = countRobbery- Prediction,
           Regression = "Random k-fold CV: Spatial Structure"),
    
    mutate(reg.spatialCV,    Error = countRobbery- Prediction,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = countRobbery- Prediction,
           Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
  st_sf() 

grid.arrange(
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Prediction)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Predicted Armed Robberyby Regression") +
    mapTheme() + theme(legend.position="bottom"),
  
  filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
    ggplot() +
    geom_sf(aes(fill = countRobbery)) +
    scale_fill_viridis() +
    labs(title = "Observed Armed Robbery\n") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

Taking a closer look at the LOGO-CV, I observed that some places indicated in the regression without spatial structure tend to under-predict at the hotspot, leaving larger raw errors there.

filter(reg.summary, Regression == "Spatial LOGO-CV: Just Risk Factors" | 
         Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  ggplot() +
  geom_sf(aes(fill = Error)) +
  facet_wrap(~Regression) +
  scale_fill_viridis() +
  labs(title = "Armed Robbery with Gun errors by Regression") +
  mapTheme()

Zooming out from LOGO-CV to look at all four regression model, I calculated the MAE and standard deviation of MAE. It turns out that on an absolute basis, our models have 1.2 Armed Robbery with Gun count error on average. The mean raw error, to compare with, was 2.4035. To avoid the risk of overprediction for minority neighborhoods, I looked at mean raw error instead of MAE. In general, I also observed that regressions with a spatial structure have a smaller error compared with those without. Spatial structure offered a better result at predicting Armed Robbery with Gun crime clusters.

st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  summarize(MAE = round(mean(abs(Prediction - countRobbery), na.rm = T),2),
            SD_MAE = round(sd(abs(Prediction - countRobbery), na.rm = T),2)) %>% 
  kable(caption = "MAE by regression") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF")
MAE by regression
Regression MAE SD_MAE
Random k-fold CV: Just Risk Factors 1.42 1.62
Random k-fold CV: Spatial Structure 1.23 1.30
Spatial LOGO-CV: Just Risk Factors 1.44 1.64
Spatial LOGO-CV: Spatial Structure 1.25 1.31

5.4 Data Equity: Neighborhood Context

Racial makeup is one of the common demographic features in describing a neighborhood context. A model may not be perfect in predicting values, but it would definately fail miserably if it used biased data recorded due to overpolicing, which may create systematic overpolicing on certain communities. In this section, I investigated this fundamental issue of data equity.

I pull census tract data on racial make up, more specifically, on racial dominance for any tract (“Majority_White” and “Majority_Non_White”). The map suggested a very segregated pattern based on the census categories.

census_api_key("0088c80cb8bce039e28c59229e46a56da29cf05b")
install = TRUE

tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2017, state=17, county=031, geometry=T) %>%
  st_transform(102271)  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]
ggplot() + 
  geom_sf(data = tracts17, aes(fill = raceContext)) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Race Context", name="Race Context") +
  mapTheme() 

Instead of doing generalizability test for all four regressions, the ones with LOGO-CV were more applicable for its mean error by regression and race context. This error value allowed me to understand if the models are over/under-predicting a certain neighborhood context. As you can see from the table, our model tends to overpredict (+) Armed Robbery in Majority non White neighborhood while underpredicting (-) Majority White neighborhood. This suggests that our model could result in bothering effect for direct use in policing resource allocation in terms of racial equity. The spatial structure, however, was again better to be integrated in the final mode, as it provides much lower absolute error in the regression.

final_reg <- 
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure" |
           Regression == "Spatial LOGO-CV: Just Risk Factors") %>%
  mutate(uniqueID = rownames(.))

final_reg.tracts <- 
  st_join(st_centroid(final_reg), tracts17) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(final_reg, uniqueID)) %>%
  st_sf() %>%
  na.omit()

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
  kable_styling("striped", full_width = F)
Mean Error by neighborhood racial context
Regression Majority_Non_White Majority_White
Spatial LOGO-CV: Just Risk Factors 0.4199251 -0.4344162
Spatial LOGO-CV: Spatial Structure 0.2456442 -0.2519098

5.5 Compare with the Tradition: Does My Model Predict Better than the Traditional Crime Hotspots?

Traditional crime hotspot functions upon Kernel Density, which visualizes a smooth kernel over each point with the highest kernel value over the point and lowest the furthest away in the defined radius. The popularity of a place could be understood as the sum of all kernels, where higher popularity lies in sites with high points proximity. By deciding different search radius, I could make different hotspot maps for Armed Robbery as demonstrated below.

library(raster)
Rob_ppp <- as.ppp(st_coordinates(Armed_Robbery), W = st_bbox(final_net))
Rob_KD.1000 <- spatstat::density.ppp(Rob_ppp, 1000)
Rob_KD.1500 <- spatstat::density.ppp(Rob_ppp, 1500)
Rob_KD.2000 <- spatstat::density.ppp(Rob_ppp, 2000)
Rob_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(Rob_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(Rob_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(Rob_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

Rob_KD.df$Legend <- factor(Rob_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
ggplot(data=Rob_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis() +
  mapTheme()

As our search radius increase, the hotspot became less highlighted. For the best comparison case, I used the 1000ft radius to overlap with the Armed Robbery with Gun count. I put a fishnet on top of these for the reference of spatial scale.

Rob_ppp <- as.ppp(st_coordinates(Armed_Robbery), W = st_bbox(final_net))
Rob_KD <- spatstat::density.ppp(Rob_ppp, 1000)
as.data.frame(Rob_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  ggplot() +
  geom_sf(aes(fill=value)) +
  geom_sf(data = sample_n(Armed_Robbery, 1500), size = .5) +
  scale_fill_viridis() +
  mapTheme()

To compare the goodness of fit between these two modeling approaches (Possion vs. Kernel), I created an indicator “Risk_Category”, which reclassifies the Armed Robbery with Gun counts into five degrees. I followed the steps suggested by Ken Steif in his markdown.

  1. Compute the kernel density; and aggregate to the fishnet.
  2. Scale the kernel density to run from 1-100 and then reclassify those values into 5 risk categories.
  3. Join to that, the count of burglaries for each grid cell.
  4. Repeat for the risk predictions.
  5. Count the rate of test set points by model type and risk category. Map and plot accordingly.
#build risk_category for kernal density and regression model
# Compute kernel density
Rob_ppp <- as.ppp(st_coordinates(Armed_Robbery), W = st_bbox(final_net))
Rob_KD <- spatstat::density.ppp(Rob_ppp, 1000)
# Convert kernel density to grid cells taking the mean
Rob_KDE_sf <- as.data.frame(Rob_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  #Mutate the Risk_Category field as defined below.
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  #Bind to a layer where test set crime counts are spatially joined to the fisnnet.
  bind_cols(
    aggregate(
      dplyr::select(Armed_Robbery) %>% mutate(RobCount = 1), ., length) %>%
      mutate(RobCount = replace_na(RobCount, 0))) %>%
  #Select the fields I need
  dplyr::select(label, Risk_Category, RobCount)

head(Rob_KDE_sf)
## Simple feature collection with 6 features and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 359664.1 ymin: 552875.4 xmax: 362664.1 ymax: 553375.4
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
##            label Risk_Category RobCount                       geometry
## 1 Kernel Density    30% to 49%        1 POLYGON ((359664.1 552875.4...
## 2 Kernel Density    30% to 49%        0 POLYGON ((360164.1 552875.4...
## 3 Kernel Density    30% to 49%        0 POLYGON ((360664.1 552875.4...
## 4 Kernel Density     1% to 29%        0 POLYGON ((361164.1 552875.4...
## 5 Kernel Density     1% to 29%        0 POLYGON ((361664.1 552875.4...
## 6 Kernel Density     1% to 29%        0 POLYGON ((362164.1 552875.4...
#build risk_category for LOGO-CV
Rob_risk_sf <-
  filter(final_reg, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols(
    aggregate(
      dplyr::select(Armed_Robbery) %>% mutate(RobCount = 1), ., length) %>%
      mutate(RobCount = replace_na(RobCount, 0))) %>%
  dplyr::select(label,Risk_Category, RobCount)

I calculated Risk_category for both Kernal Density and Regression with LOGO-CV and compared them with the real datasets. It turned out that my Risk Prediction Regression with LOGO-CV is a stronger fit for the high risk categories are more nuanced, more uniquely match the places with high density of Armed Robbery with Gun crime.

rbind(Rob_KDE_sf, Rob_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Risk_Category), colour = NA) +
  geom_sf(data = sample_n(Armed_Robbery, 1500), size = .1, colour = "black") +
  facet_wrap(~label, ) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title="Comparison of Kernel Density and My Risk Predictions",
       subtitle="Relative to test set points (in black)") +
  mapTheme()

The same argument is echoed by the bar chart of risk category and model type. As you can see, the risk predictions capture a greater share of observed Armed Robbery events in the highest risk category, while less so in mid and mid-high risk level.

rbind(Rob_KDE_sf, Rob_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countRobbery= sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countRobbery/ sum(countRobbery)) %>%
  ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
  geom_bar(aes(fill=label), position="dodge", stat="identity") +
  scale_fill_viridis(discrete = TRUE)

Conclusion

In conclusion, my geospatial risk prediction model used six risk factors, all of which are built environment associated, and the summarized spatial structure to predict Armed Robbery with Gun. It considered generalizability slightly more important than accuracy due to the large spatial scale in the model. While cross validation process suggested that spatial structure is helpful in making our model more generalizable, the neighborhood context test suggested that it still tended to overpredict/underpredict neighborhood with different racial contexts. In addition, the accuracy test results based on the mean error suggested a fair amount of accuracy in our models, especially the one with spatial structure. Finally, compaing with the traditional crime Kernal Density model, my model with spatial structure is still better. Yet the prediction accuracy could be compromised in mid and mi-high crime risk predictionn. Nevertheless, due to the fair generalizability, I would recommend our model as a solid proto for the Chicago Police Department.

I also have a few suggestions for model improvements.

First of all, I think in reality, based on the data available, data analyst in the police department could look at more risk factors and more combinations of them based on unique local knowledge. If some other factors that could speak to certain crime patterns are not currently captured, it would be a great idea to inform police reporters to record such data in time. Recently, Chicago Police Department is indeed making a precedent to ask all police officers to report whenever they point a gun for intervention. Theoretically, this could help researchers learn more about selection bias, thus may have a better understanding on what risk factors to include.

Secondly, I think the modeling approach should always be coupled with the current human-decision process, where the model should only be claimed more accurate when it outperforms the current decision-making while stays equally effective across the city.

Thirdly, I think it might be a good idea to train data researchers in Police Department to look at smaller, city-region scale for risk modeling, than trying to generalize across the entire city. This could not only create higher accuracy of model, but also more easily improve generalizability for that specific region. In this case, the policing resource allocation could be based on one, or a few regions in the city. Most importantly, applying this model in production should be a multi-stakeholder decision and dynamic process (constant updates and critical adaptation).

Reference in brief

Ken Stief, Geospatial risk modeling - the case of crime markdown. 10/27/2019 Chicago Open Data Portal: https://data.cityofchicago.org/Public-Safety/Crimes-Map/dfnk-7re6 chicago Police Department Resource: https://home.chicagopolice.org/