Predict Risk

Police Administrative Areas in Chicago

This code makes a map of the police districts and the police beats in Chicago. The different zones on these maps are not all equal in size or in population.

ggplot() +
  geom_sf(data = bothPoliceUnits) +
  facet_wrap(~Legend) +
  labs(title = "Police Administrative Areas, Chicago") +
  mapTheme()

Fishnet Lattice of Chicago

In order to examine the distribution of variables without the bias of the lines from police districts or police beats, we can make a fishnet. This code makes a fishnet lattice of Chicago (excluding O’Hare airport), where each grid cell is a 500x500ft area. This will help us to identify the distribution of crime in the city.

ggplot() +
  geom_sf(data = fishnet) +
  labs(title = "Fishnet in Chicago") +
  mapTheme()

Simple Battery Crimes

The outcome of interest I am looking at is simple battery crimes. Using a dataset from Chicago Open Data API, we can plot the different locations where simple battery occurred in Chicago.

Now, we can spatially join the coordinates of the battery incidents to the fishnet lattice of Chicago. I added a variable countBattery so there could be a count of battery incidents in each grid cell of the fishnet. I then mapped the counts of battery for each cell.

Risk Factors

There are many risk factors that may contribute to battery incidents, including street lights being out, sanitation code complaints, liquor retail, and alley lights being out. Here are the maps for each of these risk factors.

ggplot() +
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = risk_factors, size = 0.1) +
  labs(title = "Risk factors, points") +
  facet_wrap(~Legend) +
  mapTheme()

Count of Risk Factors by grid cell

Next, I made a count for each risk factor to count how many incidents occur in each grid cell. When we attach the risk factor points to the lattice grid of Chicago, we can see which cells have higher incidents of each risk factor.

Nearest Neighbor Features

The nearest neighbor feature measures the average distance from each grid cell to the k nearest data points. In this case, k=3. For example, for liquor retail, this measures the average distance of the three nearest liquor stores to each cell. Thus, grid cells that have a smaller value in this situation would be closer to more liquor stores.

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
  mutate(
    Liquor_Retail.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
    Street_Lights_Out.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
    Alley_Lights_Out.nn = 
      nn_function(st_c(st_coid(vars_net)), st_c(alleyLightsOut), 3),
    Sanitation.nn =
      nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3))

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, top = "Nearest Neighbor risk Factors by Fishnet"))

Distance to the Loop

From the maps above, we can see that different risk factors cluster in different areas. For instance, liquor retail has a major hotspot on the northeast side of the city. In order to look deeper in to the reason for this, we can make a map showing the distance of each grid cell to the loop, a business district in Chicago. When comparing these maps, we see that many of the liquor retail locations are in the loop.

loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric()

ggplot() +
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = vars_net, aes(fill = loopDistance)) +
  scale_fill_viridis() +
  labs(title = "Euclidean Distance to The Loop") +
  mapTheme()

Police Districts and Neighborhoods

Since we have data for battery incidents and the four risk factors, we should combine these into one dataframe, final_net. Using this dataframe, we can make a map of police districts and neighborhoods in Chicago.

final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

ggplot() +
  geom_sf(data = final_net, aes(fill = District), show.legend = F) +
  labs(title = "Police Districts") +
  mapTheme()

ggplot() +
  geom_sf(data = final_net, aes(fill = name), show.legend = F) +
  labs(title = "Neigborhoods") +
  mapTheme()

Local Moran’s I

A common statistical method to assess spatial autocorrelation is Moran’s I. The null hypothesis being tested in this instance is battery count at a given location is randomly distributed relative to its immediate neighbors. For this test, we use a queen matrix, where data is included for each of the 8 cells surrounding the target cell. Using the queen matrix, we can calculate Moran’s I for each cell as well as the p-values. The map below shows significant p-values (p<=0.05) for each cell.

final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countBattery, final_net.weights, zero.policy = T)),
    as.data.frame(final_net)) %>% 
    st_sf() %>%
      dplyr::select(Battery_Count = countBattery, 
                     Local_Morans_I = Ii, 
                      P_Value = `Pr(z != E(Ii))`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
      gather(Variable, Value, -geometry)

vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
            aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Battery"))

The following map shows the distance from each cell to its nearest significant hotspot based on the p-value.

final_net <-
  final_net %>% 
  mutate(battery.isSig = 
           ifelse(localmoran(final_net$countBattery, 
                             final_net.weights, zero.policy = T)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(battery.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, battery.isSig == 1))), 1))

ggplot() +
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = final_net, aes(fill = battery.isSig.dist)) +
  scale_fill_viridis() +
  labs(title = "Distance to Highly Significant Battery Hotspots") +
  mapTheme()

Correlation Tests

The correlation tests below investigate the correlation between battery counts and the four risk factors discussed above.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
    gather(Variable, Value, -countBattery)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countBattery, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countBattery)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Battery count as a function of risk factors") +
  plotTheme()

As you can see from the plot, there is generally a correlation between these factors and battery. For instance, as more alley lights and street lights are out, there are more instances of battery. The same is true for instances of sanitation code complaints. Interestingly, the strongest linear correlation between battery and a risk factor is with liquor retail, suggesting that areas with more liquor retail have higher instances of battery.

Poison Regression

Here, we use a process called LOGO-CV (leave out one group cross validation), where we hold out one area, train the models, predict for the held out area, and record the goodness of fit. Each neighborhood takes turns being the hold-out. Then, a regression analysis is done using each of these four models:
1. Random k-fold CV: just risk factors
2. Random k-fold CV: spatial structure
3. Spatial LOGO-CV: just risk factors
4. Spatial LOGO-CV: spatial structure

Accuracy and Generalizability

In order to apply a model to a different dataset at a different place or time, the model needs to be able to be generalized. The map below shows the four different regression models done above.

reg.summary <- 
  rbind(
    mutate(reg.cv, Error = Prediction - countBattery,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv, Error = Prediction - countBattery,
           Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV, Error = Prediction - countBattery,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = Prediction - countBattery,
           Regression = "Spatial LOGO-CV: Spatial Process")) %>%
  st_sf() 

grid.arrange(
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Prediction)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Predicted Battery by Regression") +
    mapTheme() + theme(legend.position="bottom"),
  
  filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
    ggplot() +
    geom_sf(aes(fill = countBattery)) +
    scale_fill_viridis() +
    labs(title = "Observed Battery\n") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

The plots below show the mean absolute error for each fold across each regression.

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countBattery, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
       x="Mean Absolute Error", y="Count") +
    plotTheme()

The table below calculates the mean and standard deviation of error for each regression. From this table, we can see that the spatial process feature improves the model because the standard deviation and average error is lower.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
            SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF")
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 1.50 1.50
Random k-fold CV: Spatial Process 1.37 1.30
Spatial LOGO-CV: Just Risk Factors 5.01 7.33
Spatial LOGO-CV: Spatial Process 3.91 4.08

The figure below maps the LOGO-CV errors. The highest errors tend to be in the hotspot areas.

error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Battery errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

Racial Context

In order to make sure that the crime data gathered is not due to overpolicing in certain areas due to racial biases, we should investigate the racial makeup of Chicago.

ggplot() +
  geom_sf(data = tracts18, aes(fill = raceContext)) +
  scale_fill_manual(values = c("blue", "grey")) +
  labs(title = "Race Context") +
  mapTheme()

From the table below, we can see the the model on average underpredicts in majority nonwhite neighborhoods and overpredicts in majority white neighborhoods.

reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F) 
Mean Error by neighborhood racial context
Regression Majority_Non_White Majority_White
Spatial LOGO-CV: Just Risk Factors -1.375197 1.362367
Spatial LOGO-CV: Spatial Process -1.532351 1.448241

Kernel Density

The kernel density features places a smooth kernel or curve at each crime point so that the curve is at its highest directly over the point. The density at a particular place is the sum of all the kernels at that point. Thus, areas that have many nearby incidents of battery will have a higher density.

bat_ppp <- as.ppp(st_coordinates(battery), W = st_bbox(final_net))
bat_KD <- density.ppp(bat_ppp, 1000)

as.data.frame(bat_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
    ggplot() +
      geom_sf(aes(fill=value)) +
      geom_sf(data = sample_n(battery, 1500), size = .5) +
      scale_fill_viridis(name = "Density") +
      labs(title = "Kernel density of 2017 batteries") +
      mapTheme()

From the map, we see that the hotspots do generally correlate with incidents of battery.

Does 2017 kernel density predict 2018 incidents of battery?

In order to determine if 2017 kernel density can predict incidents of battery in 2018, there are multiple steps. We first compute the kernel density and aggregate to the fishnet. We then reclassify the values into 5 risk categories. We then join to that the count of battery for each cell.

battery18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "BATTERY" & 
           Description == "SIMPLE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

bat_ppp <- as.ppp(st_coordinates(battery), W = st_bbox(final_net))
bat_KD <- density.ppp(bat_ppp, 1000)

bat_KDE_sf <- as.data.frame(bat_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(battery18) %>% mutate(batCount = 1), ., sum) %>%
      mutate(batCount = replace_na(batCount, 0))) %>%
  dplyr::select(label, Risk_Category, batCount)

bat_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(battery18) %>% mutate(batCount = 1), ., sum) %>%
      mutate(batCount = replace_na(batCount, 0))) %>%
  dplyr::select(label,Risk_Category, batCount)

The following map shows kernel density and risk predictions compared to the real datasets for 2018 batteries. The risk prediction regression with LOGO-CV appears to be a stronger fit because it is more nuanced, and the high risk categories more closely match areas with a high density of batteries.

rbind(bat_KDE_sf, bat_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(battery18, 3000), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
       subtitle="2017 battery risk predictions; 2018 batteries") +
    mapTheme()

A good model should show that the risk predictions capture a greater share of 2018 battery incidents in the highest risk category relative to the Kernel density. From the bar chart below, we see that this is not the case in the highest risk category.

rbind(bat_KDE_sf, bat_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countBattery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countBattery / sum(countBattery)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2018 batteries") +
      plotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))