2. Data wrangling

In this section, we download spatial data sets from the Toronto open data website on collisions, neighborhoods, property, schools and more. We then learn how to measure geographic exposure by compiling these data into the fishnet, a lattice grid of cells that is the unit of analysis for the algorithm.

Begin by installing and loading the below libraries.

options(scipen=9999)


library(tidyverse)
library(sf)
library(RSocrata)
library(spatstat)
library(viridis)
library(FNN)
library(spdep)
library(gridExtra)
library(ggredist)
library(ggplot2)
library(knitr)
library(sp)
library(fastmap)
library(skimr)
library(tmap)
library(tmaptools)
library(spdep)
library('rgeoda')

Download & wrangle collision data

We begin by downloading base map layers and creating the fishnet using 500m by 500m grid cells. Note that all data is projected as a WGS 84/UTM (meter). The resulting fishnet is created with st_make_grid and visualized below. A unique Id is created for joining below.

setwd("E:\\R-programming\\TorontoCollisions\\TorontoCollisions")

nhoods <-
  st_read ("Neighbourhoods.geojson")%>%
  st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% 
  
  ### Other EPSG:3347 - NAD83 / Statistics Canada Lambert
  
  st_buffer(10) #to correct some of the digitization errors

bound <- st_read ("CityBoundary.geojson") %>% 
st_transform(st_crs(nhoods))
 
fishnet <- st_make_grid(bound, cellsize = 500) %>%
  st_sf() %>%
  mutate(uniqueID = rownames(.)) %>%
  .[bound,]
  
ggplot() +
  geom_sf(data=fishnet) +
  labs(title="Toronto & Fishnet")+geom_sf(data=bound, colour="blue", size=1, fill="NA")

  theme_bw()

Next, collisions data are downloaded directly and then save as *.geojson file format.

We filter for collisions after 2005 (most recently available data) , convert the data to the sf format (st_as_sf) and then project into meter. It is clear that collisions occur across the city with a higher density in some specific areas which are known as hot spots.

collisoions <- st_read("ChildPedestrian.geojson")%>% 
   mutate(year = substr(DATE,1,4)) %>%
    filter(year >= 2006) %>%
    st_as_sf(wkt="point") %>%
    st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>%
    dplyr::select(year) %>%
    .[bound,]

ggplot() +
  geom_sf(data=nhoods, fill= "lightgray") +
  geom_sf(data=collisoions, size=1, col="red")+geom_sf(data=bound, colour="blue", size=1, fill="NA") +
  labs(title="Collisions, Toronto, 2006-2022") + theme_bw()

Our predictive model is going to be ‘trained’ on collisisons from 2006 to 2021 (15 years), and below, 2022 collisions are set aside for testing purposes. If the goal is operational planning, then we should ensure our model generalizes to the the near future - hence the 2022 collisions are set aside. Besides, it is unimpressive to predict for the data the model was trained on.

Collisions2006_2021 <- filter(collisoions, year < 2022)
Collisions2022 <- filter(collisoions, year == 2022)

How do we know if our predictions are useful? In this case, we will compare them to a Kernel Density map of collisions- which we can think of as predictions generated purely from spatial autocorrelation (the idea that collisions cluster). This is different from a geospatial risk prediction model which will include some spatial autocorrelation predictors, but relies on other measures of geographic exposure - like intersections.

The first two lines in the code block below use the spatstat package to create the Kernel Density layer^[Depending on your spatstat version, you may have to remove spatstat:: from line two in this code block]. The third line converts it to a data frame, then a Simple Features layer, before joining it to the fishnet. The aggregate function spatial joins the Density layer and the fishnet to return the mean density per grid cell.

collision_ppp <- as.ppp(st_coordinates(Collisions2006_2021), W = st_bbox(bound))
collision_KD <- density.ppp(collision_ppp, 500)

as.data.frame(collision_KD)%>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(nhoods)) %>%
  aggregate(., fishnet, mean) %>%
  ggplot() +
  geom_sf(aes(fill=value)) +
  geom_sf(data=bound, colour="blue", size=1, fill="NA") +
  scale_fill_viridis(name = "Density",option = "magma") +
  labs(title = "Kernel density risk of collision")+ theme_bw()

The code block below spatially joins the 2006-2021 collisons points to the fishnet. aggregate is again used to perform a spatial join. Any grid cell without a collision returns NA which is replaced with 0. The result is mapped. Now we have our dependent variable - the count of collisions per grid cell.

collision_net <-
  dplyr::select(Collisions2006_2021) %>%
  mutate(countCollisions = 1) %>%
  aggregate(., fishnet, sum) %>%
  mutate(countCollisions = replace_na(countCollisions, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))


ggplot() +
  geom_sf(data = collision_net, aes(fill = countCollisions)) +
  scale_fill_viridis(option = "magma",
                    name="Collision Count") + geom_sf(data=bound, colour="blue", size=1, fill="NA")+labs(title = "Count of collisions for the fishnet")+ theme_bw()

Download & wrangle risk factors

In this section, we download the independent variables. These are the predictive ‘features’ that measure geographic exposure to collisions. Take note of the different strategies used to download the data. Some use st_read from the sf package, while others use read.socrata. In this Markdown, we downloaded our variables data and saved them in *.geojson file format. We also use other formats such as *.shp file format.

Wrangle all variable data: High-rise residential buildings

High density residential buildings, data is downloaded from Toronto Open Data portal and saved in our working directory and then projected. We might hypothesize that exposure to dense and high density residential buildings is a good predictor of child and youth pedestrian motor vehicle collisions . High-density are pulled out from property layer file as point feature data. The code block to follow then uses a spatial join to get the count of buildings per grid cell.

Note that we store all the features in the build_net layer, which is an updated version of the fishnet layer.

### High rise residential buildings

HighRiseB <- st_read("HighRiseResidentialBs.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% 
  st_centroid() %>% st_sf()

  Build_net <-cbind(st_drop_geometry(fishnet),
    dplyr::select(HighRiseB) %>%
    mutate(countBuilds = 1) %>%
    aggregate(., fishnet, sum) %>%
    mutate(countBuilds = replace_na(countBuilds, 0))) %>%
  st_sf()

The code block below then maps the high density buildings in Toronto. What do you make of this measure of ‘exposure’?

ggplot() +
  geom_sf(data = Build_net, aes(fill = countBuilds)) +
  scale_fill_viridis(option = "magma",
                    name="High Density Residential Buildings Count") +labs(title = "Count of high density residential buildings for the fishnet")+ theme_bw()+ geom_sf(data=bound, colour="blue", size=1, fill="NA")+ theme_bw()

The figure above shows that the count of high density buildings may only describe exposure for grid cells that actually have high density residential buildings. Most grid cells return 0 vacant parcels. A better way to describe exposure is to measure the average distance from each fishnet grid cell centroid to its k nearest vacant neighbors. The code block below does this using the nn_function - a custom function created for this purpose. Here we set k = 3. Note, that the choice of k affects the ‘amount’ of exposure. We can now see that Build_net includes the count of vacant parcels as well as the nearest neighbor distance (Build.nn) from every grid cell to its 3 nearest neighbors. st_c and st_coid is just shorthand for the two respective sf functions.

The code block below use the nn_function for the aim of our study:

nn_function <- function(measureFrom, measureTo, k) {
  nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    as.data.frame()
  names(output) <- c("dist.nn")
  return(output)  
}

dataXY_function <- function(data) {
  dataXY <- data %>%
    st_coordinates() %>%
    as.data.frame() %>%
    dplyr :: select(X,Y) %>%
    as.matrix()
  return(dataXY)
}

st_c <- st_coordinates
st_coid <- st_centroid

Build_net <-
  Build_net %>%
    mutate(nn_function(st_c(st_coid(Build_net)), st_c(HighRiseB),3))

When the nearest neighbor distance is mapped, a measure of exposure can be seen for every location, Citywide. Why is this a better measure of exposure, compared to the count per grid cell?

ggplot() +
  geom_sf(data = Build_net, aes(fill = dist.nn))+
  scale_fill_viridis(option = "magma", direction = -1,
                    name="Builds Distance/m") +labs(title = "Nearest neighbor distance to HDBs in meter")+geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

Next, the schools count is calculated for each grid cell by spatially joining the `Schools` parcel centroids to the `fishnet` using `aggregate`. Note the `mean` function. It is hypotheses that collisions are more likely to happen where there are more schools and children and teenagers are roaming around the schools.

### Schools

Schools <- st_read("Schools.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% st_centroid() %>% st_sf()


### Aggregate by fishnet
Schools_net <-cbind(st_drop_geometry(fishnet),
    dplyr::select(Schools) %>%
    mutate(countSchools = 1) %>%
    aggregate(., fishnet, sum) %>%
    mutate(countSchools = replace_na(countSchools, 0))) %>%
  st_sf()

### Plot
ggplot() +
  geom_sf(data = Schools_net, aes(fill = countSchools)) +
  scale_fill_viridis(option = "magma",
                    name="Schools Count") +labs(title = "Count of schools for the fishnet")+ theme_bw()+ geom_sf(data=bound, colour="blue", size=1, fill="NA")

### Measuring ANN index for schools
Schools_net <-
  Schools_net %>%
    mutate(nn_function(st_c(st_coid(Schools_net)), st_c(Schools),3))

### Plot ANN weighted fishnet
ggplot() +
  geom_sf(data = Schools_net, aes(fill = dist.nn))+
  scale_fill_viridis(option = "magma", direction = -1,
                    name="Schools Distance/m") +
  labs(title = "Nearest neighbor distance to schools in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

Next, the municipality parks count is calculated for each grid cell by spatially joining the `Parks` parcel centroids to the `fishnet` using `aggregate`. It is hypotheses that collisions are more likely to happen where there are more parks and children and teenagers are using and roaming around theme.

### Municipality Parks

Parks <- st_read("Parks.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% st_centroid() %>% st_sf()


### Aggregate by fishnet
Parks_net <-cbind(st_drop_geometry(fishnet),
    dplyr::select(Parks) %>%
    mutate(countParks = 1) %>%
    aggregate(., fishnet, sum) %>%
    mutate(countParks = replace_na(countParks, 0))) %>%
  st_sf()

### Plot
ggplot() +
  geom_sf(data = Parks_net, aes(fill = countParks)) +
  scale_fill_viridis(option = "magma",
                    name="Parks Count") +labs(title = "Count of parks for the fishnet")+ geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

### Measuring ANN index for parks
Parks_net <-
  Parks_net %>%
    mutate(nn_function(st_c(st_coid(Parks_net)), st_c(Parks),3))

### Plot ANN weighted fishnet
ggplot() +
  geom_sf(data = Parks_net, aes(fill = dist.nn))+
  scale_fill_viridis(option = "magma", direction = -1,
                    name="Parks Distance/m") +
  labs(title = "Nearest neighbor distance to parks in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

### Road intersection 
Intersecions <- st_read("Intersections.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% st_centroid() %>% st_sf()


### Aggregate by fishnet
Intersecions_net <-cbind(st_drop_geometry(fishnet),
    dplyr::select(Intersecions) %>%
    mutate(countIntersecions = 1) %>%
    aggregate(., fishnet, sum) %>%
    mutate(countIntersecions = replace_na(countIntersecions, 0))) %>%
  st_sf()

 
### Plot
ggplot() +
  geom_sf(data = Intersecions_net, aes(fill = countIntersecions)) +
  scale_fill_viridis(option = "magma",
                    name="Intersecions Count") +labs(title = "Count of Intersecions for the fishnet")+ theme_bw()+ geom_sf(data=bound, colour="blue", size=1, fill="NA")

### Measuring ANN index for Intersecions
Intersecions_net <-
  Intersecions_net %>%
    mutate(nn_function(st_c(st_coid(Intersecions_net)), st_c(Intersecions),3))
 


### Plot ANN weighted fishnet
ggplot() +
  geom_sf(data = Intersecions_net, aes(fill = dist.nn))+
  scale_fill_viridis(option = "magma", direction = -1,
                    name="Intersecions Distance/m") +
  labs(title = "Nearest neighbor distance to Intersecions in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

Create the `final_net`

In this section we spatially joined our study four exposure variables with the counts of collisions by fishnet. We also removed probably duplicated data from our final database.

### Join 1

ParkSchool<- st_join(Parks_net, Schools_net, by = "uniqueID")
view(ParkSchool)

## We checked and removed duplicated geometries 
ParkSchool<- ParkSchool[!duplicated(ParkSchool$geometry),]
view(ParkSchool)

### Join 2
InterBUild<- st_join(Intersecions_net,Build_net, by = "uniqueID")
view(InterBUild)

InterBUild<- InterBUild[!duplicated(InterBUild$geometry),]
view(InterBUild)
 
### Join 3 
FourEposure <- st_join(x = ParkSchool, y = InterBUild, by = "uniqueID")
FourEposure<- FourEposure[!duplicated(FourEposure$geometry),]
view(FourEposure)

### Join 4
FinalFishnet <- st_join(FourEposure, collision_net, by = "uniqueID")

FinalFishnet<- FinalFishnet[!duplicated(FinalFishnet$geometry),]

view(FinalFishnet)


FinalFishnet<- st_read ("FinalFishnet2.geojson")%>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N")%>% st_sf()
names(FinalFishnet)
    
    
### Plot ANN weighted fishnet
ggplot() +
  geom_sf(data = FinalFishnet, aes(fill =ParksNND))+
  scale_fill_viridis(option = "magma", direction = -1,
                    name="Parks NND/m") +
  labs(title = "Nearest neighbor distance to Parks in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

In the next step, we joined our fishnet (spatial object includes all data) to the neighborhoods layer giving a categorical neighborhood for each grid cell. This is required for the spatial cross validation performed below.

FishDataNeigh<- st_join(FinalFishnet, nhoods) %>% st_sf() 
  
FishDataNeigh<-FishDataNeigh[!duplicated(FishDataNeigh$geometry),]



show(FishDataNeigh)

names(FishDataNeigh)

ggplot() +
  geom_sf(data = FishDataNeigh, aes(fill = AREA_NAME), show.legend=FALSE) + scale_fill_viridis(option = "magma", discrete = TRUE,direction = -1,name="Distance to parks NND/m")+labs(title = "Neighborhoods joined to the fishnet") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()

The goal of this section is to engineer features for our model that help predict the collision hotspots and the collision coldspots. It is notoriously difficult for linear (regression) models to predict these areas, which are effectively outliers. It is important that we do not say, under-predict the hotspots - otherwise, we under-predict risk.

To do so, a statistic called Local Moran’s I posits a null hypothesis that collisoin count at a given location is randomly distributed relative to its immediate neighbors. Where this hypothesis can be overturned, we may observe local clustering - a hotspot.

The below analysis is also useful for exploratory purposes and will be familiar to many spatial analysts. What may be new, is using these metrics as features of a statistical model. The first step is to create a spatial weights matrix that relates each grid cell to its adjacent neighbors.

Fish.nb <- poly2nb(as_Spatial(FishDataNeigh), queen=TRUE)
listw <- nb2listw(Fish.nb, style="W", zero.policy=TRUE)

There is a lot in the code block below and more context can be found in this section of the book: Public Policy Analytics: Code & Context for Data Science in Government by Ken Steif, Ph.D Updated: February, 2021. The Local Moran’s I statistic is calculated and a new feature is engineered called Significant_Hotpots - which are those statistically significant local clusters (p <= 0.05).

The local spatial statistic Moran’s I is calculated for each zone based on the spatial weights object used. The values returned include a Z-value, and may be used as a diagnostic tool. The statistic is: but those from Sokal et al. (1998) are implemented here.

\[ I_i=\frac{(x_i-\bar{x})}{{\sum_{k=1}^{n}(x_k-\bar{x})^2}/(n-1)}{\sum_{j=1}^{n}w_{ij}(x_j-\bar{x})} \]

The second part of the code block uses a loop to map four indicators each with their own unique legends. Not only does this analysis reveal the hotspots, but because Significant_Hotpots is observed for each grid cell, it can be used as a predictive feature. Let’s see how below.

LMI <- localmoran(FishDataNeigh$CollisCou, listw, zero.policy=NULL, na.action=na.fail, conditional=TRUE,
  alternative = "two.sided", mlvar=TRUE,
        spChk=NULL, adjust.x=FALSE)
localmoran_perm(FishDataNeigh$CollisCou, listw, nsim=499, zero.policy=NULL, na.action=na.fail, 
  alternative = "two.sided", mlvar=TRUE,
        spChk=NULL, adjust.x=FALSE, sample_Ei=TRUE, iseed=NULL,
        no_repeat_in_row=FALSE)
LMIdf <- data.frame(LMI)
names(LMIdf)

FishDataNeigh$LMI_I<- LMIdf$Ii
FishDataNeigh$LMI_Pv<- LMIdf$Pr.z....E.Ii..
FishDataNeigh$CV<- collision_net$cvID

head(FishDataNeigh)

view(FishDataNeigh)

ggplot1<- ggplot() +geom_sf(data = FishDataNeigh, aes(fill = LMI_I), colour= NA) + scale_fill_viridis_c(name="", option = "plasma",direction = -1) + theme_bw() + theme(legend.position="bottom")+ labs(title= "Moran's I")+geom_sf(data=bound, colour="blue", size=1, fill="NA")

ggplot2<- ggplot() +geom_sf(data = FishDataNeigh, aes(fill = LMI_Pv), colour= NA) + scale_fill_viridis(name="", option = "magma", direction = -1) + theme_bw() + theme(legend.position="bottom")+ labs(title= "P-values")+geom_sf(data=bound, colour="blue", size=1, fill="NA")


### To show tow variable in smae/single plot
grid.arrange(ggplot1,ggplot2, ncol=2)

Measuring distance or exposure to these hotspots provides information that the predictive model can use to account for very local hotspots. Below, an average nearest neighbor feature is added to `final_net` describing distance to very `Significant_Hotpots` (note the p-value used).

FishDataNeigh1 <-
  FishDataNeigh %>%
  mutate(colis.Sig =
           ifelse(localmoran(FishDataNeigh$CollisCou,
                             listw)[,5] <= 0.01, 1, 0)) %>%
  mutate(colis.Sig.dist =
           nn_function(st_coordinates(st_centroid(FishDataNeigh)),
                       st_coordinates(st_centroid(
                         filter(FishDataNeigh, colis.Sig == 1))), 1))
 
 
st_as_sf(x = FishDataNeigh1)

names(FishDataNeigh1)
write_sf(FishDataNeigh1, "FishDataNeigh1.geojson")
#### above we can change the value of significant (P-value) range from 0.05 to 0.0000001 ... to find signifucant cells. 
         
ggplot() +
  geom_sf(data = FishDataNeigh1, aes(fill =colis.Sig), show.legend=FALSE) + geom_sf(data = filter(FishDataNeigh1, colis.Sig == 1) %>% st_centroid()) + scale_fill_continuous(type = "gradient") +labs(title = "Distance to very significant collision hotpots",subtitle = "Black points represent significant hotspots")+ theme_bw()+geom_sf(data=bound, colour="blue", size=1, fill="NA")

str(FishDataNeigh1)

4. Modeling & Validation

In this section we introduce the concept of spatial cross-validation - a special flavor of cross-validation. The risk prediction model is then compared to the Kernel Density to test for its planning utility.

When the dependent variable is count data, a class of statistical model called Generalized Linear Model or glm is often used. There are many flavors of glm models covered throughout the Ken Steif’s book, and very little context is given here.

While our model will be judged in a planning context, many readers will appreciate the regression summary below. Again, we are regressing collision count as a function of the features engineered above. The output shows that many of features are statistically significant predictors of collision counts.

GLM <- summary(glm(FishDataNeigh1$CollisCou ~ ., family = "poisson",data = FishDataNeigh1 %>% st_drop_geometry() %>% dplyr::select(SchoolsNND, BuildsNND, ParksNND,IntersNND,colis.Sig)))

GLM

Spatial cross-validation

We are estimating a regression model predicting the latent risk of collisions. One of our goals is that our model be generalizable. This means that it performs with equity across all neighborhoods regardless of other factors.

reg.vars <- c("SchoolsNND", "BuildsNND", "ParksNND", "IntersNND")

reg.ss.vars <- c("SchoolsNND", "BuildsNND", "ParksNND", "IntersNND","colis.Sig")

The code block below performs the spatial cross validation. It iteratively pulls out one neighborhood or id, trains the model on the remaining groups, and then predicts for the hold out. It outputs predictions for each grid cell.

crossValidate.colis <- function(dataset, id, dependentVariable, indVariables) {

  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])

  for (i in cvID_list) {

    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")

    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
      dplyr::select(id, geometry, indVariables, dependentVariable)

    regression <-
      glm(CollisCou ~ ., family = "poisson",
          data = fold.train %>%
            dplyr::select(-geometry, -id))

    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))

    allPredictions <-
      rbind(allPredictions, thisPrediction)

  }
  return(st_sf(allPredictions))
}

The code block below runs the cross-validation. The first chunk omits the spatial process variables, and the second includes it. In both cases the holdout group id is nhoods, a field in FishDataNeigh1.

reg.spatialCV <- crossValidate.colis(
  dataset = FishDataNeigh1,
  id = "AREA_SHORT_CODE",
  dependentVariable = "CollisCou",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = AREA_SHORT_CODE, CollisCou, Prediction, geometry)

reg.ss.spatialCV <- crossValidate.colis(
  dataset = FishDataNeigh1,
  id = "AREA_SHORT_CODE",
  dependentVariable = "CollisCou",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = AREA_SHORT_CODE, CollisCou, Prediction, geometry)

Analyze model errors

Let us examine model accuracy by calculating model errors, subtracting grid cell Predictions from the observed CollisCou. reg.summary includes CollisCou, predictions, errors and a Regression label for both regressions estimated above.

reg.summary <-
  rbind(
    mutate(reg.spatialCV,    Error = Prediction - CollisCou,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
    mutate(reg.ss.spatialCV, Error = Prediction - CollisCou,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf()

as.tibble(st_drop_geometry(head(reg.summary)))

The code block below calculates several goodness of fit metrics for both regressions and by AREA_SHORT_CODE (the cvID). Mean_Error allows us to understand if we are over or under predicting in a given place. MAE takes the absolute value of errors.

error_by_reg_and_fold <-
  reg.summary %>%
    group_by(Regression, cvID) %>%
    summarize(Mean_Error = mean(Prediction - CollisCou, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

Neighborhood errors are now mapped for both regressions. What do you notice about their spatial distribution? In general, errors are much lower when the local spatial process is accounted for - ie. the hot and cold spots.

error_by_reg_and_fold %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis(option = "magma", direction = -1) +
    labs(title = "Burglary errors by LOGO-CV Regression") + theme(legend.position="bottom")+ theme_bw()+geom_sf(data=bound, colour="blue", size=1, fill="NA")

Although this is a very simple model, assuming it is satisfactory, we can move on to the model predictions. The visualization below maps risk predictions, the Kernel Density and the observed CollisCou.

The risk prediction model forecasts much higher risk in and around the inner city, so much so that it masks colors associated with marginal risk in other areas of the City. The hotspots from the Kernel Density seem much more apparent.

So which predicts better - the Kernel Density or the Risk Prediction Model? Let’s find out.

grid.arrange(ncol=1,    
  reg.summary %>%
    group_by(Regression, cvID) %>%
    filter(str_detect(Regression, "Spatial Process")) %>%
    ggplot() + geom_sf(aes(fill=Prediction)) +
    scale_fill_viridis(option = "magma", direction = -1) +
    ggtitle("Risk Predictions") + theme_map(),
  as.data.frame(collision_KD) %>%
    st_as_sf(coords = c("x", "y"), crs = st_crs(nhoods)) %>%
    aggregate(., FishDataNeigh1, mean) %>%
      ggplot() +
        geom_sf(aes(fill=value)) +
        scale_fill_viridis(name = "Density", option = "magma", direction = -1) +
        ggtitle("Kernel Density") + theme_map(),
  reg.summary %>%
    group_by(Regression, cvID) %>%
    filter(str_detect(Regression, "Spatial Process")) %>%
    ggplot() + geom_sf(aes(fill=CollisCou)) +
    scale_fill_viridis(option = "magma", direction = -1) +
    ggtitle("Observed collision count") + theme_map())

library(ggredist)

5. Risk Prediction or Kernel Density?

As we have discussed, the most ‘accurate’ model may not the best. In fact, the most genearlizable model - which is less bias, might be important, but also may not be the best contender. What we are really after is the most ‘useful’ model - and useful can only be judged in the context of the planning use case. If we were on the ground in Toronto, we might study how the Police Department currently makes decisions around collision inspection and comparable use cases.

In this case, we assume a Accident Analyst uses Kernel Density to understand areas at most risk for collisions. If density is the ‘business-as-usual’ approach for targeting resources, then our forecast must be more useful - meaning it must identify more locations as high risk that actually did experience collisions.

To make it more challenging, we test our 2006-2021 collison model on the 2022 collisions withheld above. This way we can see if our model generalizes to the next year - which is important for the Police Department’s planning process.

Consider the Kernel Density and the risk predictions to be “predictive surfaces” - layers to describe risk Citywide. Below, each surface is divided into 5 risk categories, with the 90% to 100% risk category as the highest. ntile calculates percentiles. You can see, aggregate is used to spatial join the 2022 collisions to the risk category grid cells. This is done first for the Kernel Density, below.

collis_KDE_sf <- as.data.frame(collision_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(FishDataNeigh1)) %>%
  aggregate(., FishDataNeigh1, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(Collisions2022) %>% mutate(countCollisions = 1), ., sum) %>%
    mutate(countCollisions = replace_na(countCollisions, 0))) %>%
  dplyr::select(label, Risk_Category, countCollisions)

Now for the risk predictions, below.

collis_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(Collisions2022) %>% mutate(countCollisions = 1), ., sum) %>%
      mutate(countCollisions = replace_na(countCollisions, 0))) %>%
  dplyr::select(label,Risk_Category, countCollisions)

Next, the 5 risk categories are mapped along with the 2022 collisions overlaid in red. Which predictive surface looks like the better targeting tool?

rbind(collis_KDE_sf, collis_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = Collisions2022, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "plasma", discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2006-2021 fire risk predictions; 2022 collisons overlayed") +
    theme_bw()

The plot below takes the evaluation one step further, analyzing the rate of actual 2022 collisions that fall into each risk category across both predictive surfaces. What can we conclude from the 5th and highest risk category and how might this speak to the utility of the algorithm?

rbind(collis_KDE_sf, collis_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countCollisions = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_collisons = countCollisions / sum(countCollisions)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_collisons)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      #scale_fill_viridis(option="magma", discrete = TRUE, direction=-1) +
      scale_fill_manual(values = c("#6b4596ff", "#f7cb44ff")) +
      labs(title = "Risk prediction vs. Kernel density, 2022 Collisions") +
      theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Indeed, the risk predictive model captures more 2022 collisions than the Kernel Density, suggesting it is a more useful tool for planning. Recall, the Kernel Density predicts based purely on spatial autocorrelation. To reiterate from the introduction - the difference is that Kernel Density tells us to revisit past accidents, but risk predictive models, if properly developed, reveal risk based on a more thorough predictive understanding of the spatial system and why things occur where.

The risk prediction models are much more effective when we have a complete sample of the outcome - like collisions. The analyst should be cautious of use cases where the dependent variable suffers from selection bias - like drug crimes.

I hope this markdown was useful and I hope you replicate these methods in your own work. Please do check out Ken Steif’s book , Public Policy Analytics, to learn so much more about spatial modeling approache to solving public policy challenges.

Thanks!

Alireza

Geospatial Machine Learning: Predicting collision risk in Toronto

Alireza Mohammadi,(Alirezamohammadi20142014@gmail.com)

2023-10-20

1. Introduction

2. Data wrangling

Download & wrangle collision data

Download & wrangle risk factors

Wrangle all variable data: High-rise residential buildings

Create the `final_net`

4. Modeling & Validation

Spatial cross-validation

Analyze model errors

5. Risk Prediction or Kernel Density?

Geospatial Machine Learning: Predicting collision risk in Toronto

Alireza Mohammadi,(Alirezamohammadi20142014@gmail.com)

2023-10-20

1. Introduction

2. Data wrangling

Download & wrangle collision data

Download & wrangle risk factors

Wrangle all variable data: High-rise residential buildings

Create the final_net

4. Modeling & Validation

Spatial cross-validation

Analyze model errors

5. Risk Prediction or Kernel Density?

Create the `final_net`