In memory of Ken Steif, a professor at the University of Pennsylvania. This markdown is the repetition of risk modelling done by him. Geospatial Machine Learning: Predicting fire risk in San Francisco.
In this markdown we will learn how to develop a geospatial risk prediction model - a type of machine learning model, specifically to predict child and youth pedestrian motor vehicle collision (CYC) risk in Toronto, Ontario, Canada. I chosen this study area because of available qualified data. Here we posit that collision risk across space is a function of exposure to nearby spatial risk or protective factors, like streets intersections. These algorithms are explicitly spatial in nature because of their reliance on measuring geographic exposure.
These models provide intelligence beyond points on a map - that simply tell us to revisit past locations of adverse events. They provide a more thorough predictive understanding of spatial systems and why things occur where.
Collision is a relatively rare event, and the ‘latent risk’ for collision is much greater than the number of collisions actually observed. Today, we will estimate this spatial latent risk to help the Toronto Police Servicest (TPS) do operational planning, like accident inspection and and improving the urban street network.
This tutorial assumes the reader is familiar with spatial analysis
and data wrangling in R - namely the sf and
tidyverse packages. It is also assumed the reader has some
familiarity with machine learning.
In this section, we download spatial data sets from the Toronto open data website on collisions, neighborhoods, property, schools and more. We then learn how to measure geographic exposure by compiling these data into the fishnet, a lattice grid of cells that is the unit of analysis for the algorithm.
Begin by installing and loading the below libraries.
options(scipen=9999)
library(tidyverse)
library(sf)
library(RSocrata)
library(spatstat)
library(viridis)
library(FNN)
library(spdep)
library(gridExtra)
library(ggredist)
library(ggplot2)
library(knitr)
library(sp)
library(fastmap)
library(skimr)
library(tmap)
library(tmaptools)
library(spdep)
library('rgeoda')
We begin by downloading base map layers and creating the fishnet using 500m by 500m grid cells. Note that all data is projected as a WGS 84/UTM (meter). The resulting fishnet is created with st_make_grid and visualized below. A unique Id is created for joining below.
setwd("E:\\R-programming\\TorontoCollisions\\TorontoCollisions")
nhoods <-
st_read ("Neighbourhoods.geojson")%>%
st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>%
### Other EPSG:3347 - NAD83 / Statistics Canada Lambert
st_buffer(10) #to correct some of the digitization errors
bound <- st_read ("CityBoundary.geojson") %>%
st_transform(st_crs(nhoods))
fishnet <- st_make_grid(bound, cellsize = 500) %>%
st_sf() %>%
mutate(uniqueID = rownames(.)) %>%
.[bound,]
ggplot() +
geom_sf(data=fishnet) +
labs(title="Toronto & Fishnet")+geom_sf(data=bound, colour="blue", size=1, fill="NA")
theme_bw()
Next, collisions data are downloaded directly and then save as *.geojson file format.
We filter for collisions after 2005 (most recently
available data) , convert the data to the sf format
(st_as_sf) and then project into meter. It is clear that
collisions occur across the city with a higher density in some specific
areas which are known as hot spots.
collisoions <- st_read("ChildPedestrian.geojson")%>%
mutate(year = substr(DATE,1,4)) %>%
filter(year >= 2006) %>%
st_as_sf(wkt="point") %>%
st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>%
dplyr::select(year) %>%
.[bound,]
ggplot() +
geom_sf(data=nhoods, fill= "lightgray") +
geom_sf(data=collisoions, size=1, col="red")+geom_sf(data=bound, colour="blue", size=1, fill="NA") +
labs(title="Collisions, Toronto, 2006-2022") + theme_bw()
Our predictive model is going to be ‘trained’ on collisisons from 2006 to 2021 (15 years), and below, 2022 collisions are set aside for testing purposes. If the goal is operational planning, then we should ensure our model generalizes to the the near future - hence the 2022 collisions are set aside. Besides, it is unimpressive to predict for the data the model was trained on.
Collisions2006_2021 <- filter(collisoions, year < 2022)
Collisions2022 <- filter(collisoions, year == 2022)
How do we know if our predictions are useful? In this case, we will compare them to a Kernel Density map of collisions- which we can think of as predictions generated purely from spatial autocorrelation (the idea that collisions cluster). This is different from a geospatial risk prediction model which will include some spatial autocorrelation predictors, but relies on other measures of geographic exposure - like intersections.
The first two lines in the code block below use the spatstat package to create the Kernel Density layer^[Depending on your spatstat version, you may have to remove spatstat:: from line two in this code block]. The third line converts it to a data frame, then a Simple Features layer, before joining it to the fishnet. The aggregate function spatial joins the Density layer and the fishnet to return the mean density per grid cell.
collision_ppp <- as.ppp(st_coordinates(Collisions2006_2021), W = st_bbox(bound))
collision_KD <- density.ppp(collision_ppp, 500)
as.data.frame(collision_KD)%>%
st_as_sf(coords = c("x", "y"), crs = st_crs(nhoods)) %>%
aggregate(., fishnet, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data=bound, colour="blue", size=1, fill="NA") +
scale_fill_viridis(name = "Density",option = "magma") +
labs(title = "Kernel density risk of collision")+ theme_bw()
The code block below spatially joins the 2006-2021 collisons points to the fishnet. aggregate is again used to perform a spatial join. Any grid cell without a collision returns NA which is replaced with 0. The result is mapped. Now we have our dependent variable - the count of collisions per grid cell.
collision_net <-
dplyr::select(Collisions2006_2021) %>%
mutate(countCollisions = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countCollisions = replace_na(countCollisions, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = collision_net, aes(fill = countCollisions)) +
scale_fill_viridis(option = "magma",
name="Collision Count") + geom_sf(data=bound, colour="blue", size=1, fill="NA")+labs(title = "Count of collisions for the fishnet")+ theme_bw()
In this section, we download the independent variables. These are the
predictive ‘features’ that measure geographic exposure to collisions.
Take note of the different strategies used to download the data. Some
use st_read from the sf package, while others
use read.socrata. In this Markdown, we downloaded our
variables data and saved them in *.geojson file format.
We also use other formats such as *.shp file
format.
High density residential buildings, data is downloaded
from Toronto Open Data portal and saved in our working directory and
then projected. We might hypothesize that exposure to dense and high
density residential buildings is a good predictor of child and youth
pedestrian motor vehicle collisions . High-density are pulled out from
property layer file as point feature data. The code block
to follow then uses a spatial join to get the count of buildings per
grid cell.
Note that we store all the features in the build_net
layer, which is an updated version of the fishnet
layer.
### High rise residential buildings
HighRiseB <- st_read("HighRiseResidentialBs.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>%
st_centroid() %>% st_sf()
Build_net <-cbind(st_drop_geometry(fishnet),
dplyr::select(HighRiseB) %>%
mutate(countBuilds = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countBuilds = replace_na(countBuilds, 0))) %>%
st_sf()
The code block below then maps the high density buildings in Toronto. What do you make of this measure of ‘exposure’?
ggplot() +
geom_sf(data = Build_net, aes(fill = countBuilds)) +
scale_fill_viridis(option = "magma",
name="High Density Residential Buildings Count") +labs(title = "Count of high density residential buildings for the fishnet")+ theme_bw()+ geom_sf(data=bound, colour="blue", size=1, fill="NA")+ theme_bw()
The figure above shows that the count of high density buildings may
only describe exposure for grid cells that actually have high density
residential buildings. Most grid cells return 0 vacant
parcels. A better way to describe exposure is to measure the average
distance from each fishnet grid cell centroid to
its k nearest vacant neighbors. The code block below does this
using the nn_function - a custom function created for this
purpose. Here we set k = 3. Note, that the choice of
k affects the ‘amount’ of exposure. We can now see that
Build_net includes the count of vacant parcels as well as
the nearest neighbor distance (Build.nn) from every grid
cell to its 3 nearest neighbors. st_c and
st_coid is just shorthand for the two respective
sf functions.
The code block below use the nn_function for the aim of our study:
nn_function <- function(measureFrom, measureTo, k) {
nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
as.data.frame()
names(output) <- c("dist.nn")
return(output)
}
dataXY_function <- function(data) {
dataXY <- data %>%
st_coordinates() %>%
as.data.frame() %>%
dplyr :: select(X,Y) %>%
as.matrix()
return(dataXY)
}
st_c <- st_coordinates
st_coid <- st_centroid
Build_net <-
Build_net %>%
mutate(nn_function(st_c(st_coid(Build_net)), st_c(HighRiseB),3))
When the nearest neighbor distance is mapped, a measure of exposure can be seen for every location, Citywide. Why is this a better measure of exposure, compared to the count per grid cell?
ggplot() +
geom_sf(data = Build_net, aes(fill = dist.nn))+
scale_fill_viridis(option = "magma", direction = -1,
name="Builds Distance/m") +labs(title = "Nearest neighbor distance to HDBs in meter")+geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
Next, the schools count is calculated for each grid cell by spatially joining the `Schools` parcel centroids to the `fishnet` using `aggregate`. Note the `mean` function. It is hypotheses that collisions are more likely to happen where there are more schools and children and teenagers are roaming around the schools.
### Schools
Schools <- st_read("Schools.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% st_centroid() %>% st_sf()
### Aggregate by fishnet
Schools_net <-cbind(st_drop_geometry(fishnet),
dplyr::select(Schools) %>%
mutate(countSchools = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countSchools = replace_na(countSchools, 0))) %>%
st_sf()
### Plot
ggplot() +
geom_sf(data = Schools_net, aes(fill = countSchools)) +
scale_fill_viridis(option = "magma",
name="Schools Count") +labs(title = "Count of schools for the fishnet")+ theme_bw()+ geom_sf(data=bound, colour="blue", size=1, fill="NA")
### Measuring ANN index for schools
Schools_net <-
Schools_net %>%
mutate(nn_function(st_c(st_coid(Schools_net)), st_c(Schools),3))
### Plot ANN weighted fishnet
ggplot() +
geom_sf(data = Schools_net, aes(fill = dist.nn))+
scale_fill_viridis(option = "magma", direction = -1,
name="Schools Distance/m") +
labs(title = "Nearest neighbor distance to schools in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
Next, the municipality parks count is calculated for each grid cell by spatially joining the `Parks` parcel centroids to the `fishnet` using `aggregate`. It is hypotheses that collisions are more likely to happen where there are more parks and children and teenagers are using and roaming around theme.
### Municipality Parks
Parks <- st_read("Parks.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% st_centroid() %>% st_sf()
### Aggregate by fishnet
Parks_net <-cbind(st_drop_geometry(fishnet),
dplyr::select(Parks) %>%
mutate(countParks = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countParks = replace_na(countParks, 0))) %>%
st_sf()
### Plot
ggplot() +
geom_sf(data = Parks_net, aes(fill = countParks)) +
scale_fill_viridis(option = "magma",
name="Parks Count") +labs(title = "Count of parks for the fishnet")+ geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
### Measuring ANN index for parks
Parks_net <-
Parks_net %>%
mutate(nn_function(st_c(st_coid(Parks_net)), st_c(Parks),3))
### Plot ANN weighted fishnet
ggplot() +
geom_sf(data = Parks_net, aes(fill = dist.nn))+
scale_fill_viridis(option = "magma", direction = -1,
name="Parks Distance/m") +
labs(title = "Nearest neighbor distance to parks in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
### Road intersection
Intersecions <- st_read("Intersections.geojson") %>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N") %>% st_centroid() %>% st_sf()
### Aggregate by fishnet
Intersecions_net <-cbind(st_drop_geometry(fishnet),
dplyr::select(Intersecions) %>%
mutate(countIntersecions = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countIntersecions = replace_na(countIntersecions, 0))) %>%
st_sf()
### Plot
ggplot() +
geom_sf(data = Intersecions_net, aes(fill = countIntersecions)) +
scale_fill_viridis(option = "magma",
name="Intersecions Count") +labs(title = "Count of Intersecions for the fishnet")+ theme_bw()+ geom_sf(data=bound, colour="blue", size=1, fill="NA")
### Measuring ANN index for Intersecions
Intersecions_net <-
Intersecions_net %>%
mutate(nn_function(st_c(st_coid(Intersecions_net)), st_c(Intersecions),3))
### Plot ANN weighted fishnet
ggplot() +
geom_sf(data = Intersecions_net, aes(fill = dist.nn))+
scale_fill_viridis(option = "magma", direction = -1,
name="Intersecions Distance/m") +
labs(title = "Nearest neighbor distance to Intersecions in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
final_netIn this section we spatially joined our study four exposure variables with the counts of collisions by fishnet. We also removed probably duplicated data from our final database.
### Join 1
ParkSchool<- st_join(Parks_net, Schools_net, by = "uniqueID")
view(ParkSchool)
## We checked and removed duplicated geometries
ParkSchool<- ParkSchool[!duplicated(ParkSchool$geometry),]
view(ParkSchool)
### Join 2
InterBUild<- st_join(Intersecions_net,Build_net, by = "uniqueID")
view(InterBUild)
InterBUild<- InterBUild[!duplicated(InterBUild$geometry),]
view(InterBUild)
### Join 3
FourEposure <- st_join(x = ParkSchool, y = InterBUild, by = "uniqueID")
FourEposure<- FourEposure[!duplicated(FourEposure$geometry),]
view(FourEposure)
### Join 4
FinalFishnet <- st_join(FourEposure, collision_net, by = "uniqueID")
FinalFishnet<- FinalFishnet[!duplicated(FinalFishnet$geometry),]
view(FinalFishnet)
FinalFishnet<- st_read ("FinalFishnet2.geojson")%>% st_transform("EPSG:32617 - WGS 84 / UTM zone 17N")%>% st_sf()
names(FinalFishnet)
### Plot ANN weighted fishnet
ggplot() +
geom_sf(data = FinalFishnet, aes(fill =ParksNND))+
scale_fill_viridis(option = "magma", direction = -1,
name="Parks NND/m") +
labs(title = "Nearest neighbor distance to Parks in meter") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
In the next step, we joined our fishnet (spatial object includes all data) to the neighborhoods layer giving a categorical neighborhood for each grid cell. This is required for the spatial cross validation performed below.
FishDataNeigh<- st_join(FinalFishnet, nhoods) %>% st_sf()
FishDataNeigh<-FishDataNeigh[!duplicated(FishDataNeigh$geometry),]
show(FishDataNeigh)
names(FishDataNeigh)
ggplot() +
geom_sf(data = FishDataNeigh, aes(fill = AREA_NAME), show.legend=FALSE) + scale_fill_viridis(option = "magma", discrete = TRUE,direction = -1,name="Distance to parks NND/m")+labs(title = "Neighborhoods joined to the fishnet") +geom_sf(data=bound, colour="blue", size=1, fill="NA")+theme_bw()
The goal of this section is to engineer features for our model that help predict the collision hotspots and the collision coldspots. It is notoriously difficult for linear (regression) models to predict these areas, which are effectively outliers. It is important that we do not say, under-predict the hotspots - otherwise, we under-predict risk.
To do so, a statistic called Local Moran’s I posits a null hypothesis that collisoin count at a given location is randomly distributed relative to its immediate neighbors. Where this hypothesis can be overturned, we may observe local clustering - a hotspot.
The below analysis is also useful for exploratory purposes and will be familiar to many spatial analysts. What may be new, is using these metrics as features of a statistical model. The first step is to create a spatial weights matrix that relates each grid cell to its adjacent neighbors.
Fish.nb <- poly2nb(as_Spatial(FishDataNeigh), queen=TRUE)
listw <- nb2listw(Fish.nb, style="W", zero.policy=TRUE)
There is a lot in the code block below and more context can be found
in this
section of the book: Public Policy Analytics: Code & Context for
Data Science in Government by Ken Steif, Ph.D Updated: February,
2021. The Local Moran’s I statistic is
calculated and a new feature is engineered called
Significant_Hotpots - which are those statistically
significant local clusters (p <= 0.05).
The local spatial statistic Moran’s I is calculated for each zone based on the spatial weights object used. The values returned include a Z-value, and may be used as a diagnostic tool. The statistic is: but those from Sokal et al. (1998) are implemented here.
\[ I_i=\frac{(x_i-\bar{x})}{{\sum_{k=1}^{n}(x_k-\bar{x})^2}/(n-1)}{\sum_{j=1}^{n}w_{ij}(x_j-\bar{x})} \]
The second part of the code block uses a loop to map four indicators
each with their own unique legends. Not only does this analysis reveal
the hotspots, but because Significant_Hotpots is observed
for each grid cell, it can be used as a predictive feature. Let’s see
how below.
LMI <- localmoran(FishDataNeigh$CollisCou, listw, zero.policy=NULL, na.action=na.fail, conditional=TRUE,
alternative = "two.sided", mlvar=TRUE,
spChk=NULL, adjust.x=FALSE)
localmoran_perm(FishDataNeigh$CollisCou, listw, nsim=499, zero.policy=NULL, na.action=na.fail,
alternative = "two.sided", mlvar=TRUE,
spChk=NULL, adjust.x=FALSE, sample_Ei=TRUE, iseed=NULL,
no_repeat_in_row=FALSE)
LMIdf <- data.frame(LMI)
names(LMIdf)
FishDataNeigh$LMI_I<- LMIdf$Ii
FishDataNeigh$LMI_Pv<- LMIdf$Pr.z....E.Ii..
FishDataNeigh$CV<- collision_net$cvID
head(FishDataNeigh)
view(FishDataNeigh)
ggplot1<- ggplot() +geom_sf(data = FishDataNeigh, aes(fill = LMI_I), colour= NA) + scale_fill_viridis_c(name="", option = "plasma",direction = -1) + theme_bw() + theme(legend.position="bottom")+ labs(title= "Moran's I")+geom_sf(data=bound, colour="blue", size=1, fill="NA")
ggplot2<- ggplot() +geom_sf(data = FishDataNeigh, aes(fill = LMI_Pv), colour= NA) + scale_fill_viridis(name="", option = "magma", direction = -1) + theme_bw() + theme(legend.position="bottom")+ labs(title= "P-values")+geom_sf(data=bound, colour="blue", size=1, fill="NA")
### To show tow variable in smae/single plot
grid.arrange(ggplot1,ggplot2, ncol=2)
Measuring distance or exposure to these hotspots provides information that the predictive model can use to account for very local hotspots. Below, an average nearest neighbor feature is added to `final_net` describing distance to very `Significant_Hotpots` (note the p-value used).
FishDataNeigh1 <-
FishDataNeigh %>%
mutate(colis.Sig =
ifelse(localmoran(FishDataNeigh$CollisCou,
listw)[,5] <= 0.01, 1, 0)) %>%
mutate(colis.Sig.dist =
nn_function(st_coordinates(st_centroid(FishDataNeigh)),
st_coordinates(st_centroid(
filter(FishDataNeigh, colis.Sig == 1))), 1))
st_as_sf(x = FishDataNeigh1)
names(FishDataNeigh1)
write_sf(FishDataNeigh1, "FishDataNeigh1.geojson")
#### above we can change the value of significant (P-value) range from 0.05 to 0.0000001 ... to find signifucant cells.
ggplot() +
geom_sf(data = FishDataNeigh1, aes(fill =colis.Sig), show.legend=FALSE) + geom_sf(data = filter(FishDataNeigh1, colis.Sig == 1) %>% st_centroid()) + scale_fill_continuous(type = "gradient") +labs(title = "Distance to very significant collision hotpots",subtitle = "Black points represent significant hotspots")+ theme_bw()+geom_sf(data=bound, colour="blue", size=1, fill="NA")
str(FishDataNeigh1)
In this section we introduce the concept of spatial cross-validation - a special flavor of cross-validation. The risk prediction model is then compared to the Kernel Density to test for its planning utility.
When the dependent variable is count data, a class of statistical
model called Generalized Linear Model or glm is often used.
There are many flavors of glm models covered throughout the
Ken Steif’s book, and very little context is given here.
While our model will be judged in a planning context, many readers will appreciate the regression summary below. Again, we are regressing collision count as a function of the features engineered above. The output shows that many of features are statistically significant predictors of collision counts.
GLM <- summary(glm(FishDataNeigh1$CollisCou ~ ., family = "poisson",data = FishDataNeigh1 %>% st_drop_geometry() %>% dplyr::select(SchoolsNND, BuildsNND, ParksNND,IntersNND,colis.Sig)))
GLM
We are estimating a regression model predicting the latent risk of collisions. One of our goals is that our model be generalizable. This means that it performs with equity across all neighborhoods regardless of other factors.
reg.vars <- c("SchoolsNND", "BuildsNND", "ParksNND", "IntersNND")
reg.ss.vars <- c("SchoolsNND", "BuildsNND", "ParksNND", "IntersNND","colis.Sig")
The code block below performs the spatial cross validation. It
iteratively pulls out one neighborhood or id, trains the
model on the remaining groups, and then predicts for the hold out. It
outputs predictions for each grid cell.
crossValidate.colis <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(CollisCou ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
The code block below runs the cross-validation. The first chunk omits
the spatial process variables, and the second includes it. In both cases
the holdout group id is nhoods, a field in
FishDataNeigh1.
reg.spatialCV <- crossValidate.colis(
dataset = FishDataNeigh1,
id = "AREA_SHORT_CODE",
dependentVariable = "CollisCou",
indVariables = reg.vars) %>%
dplyr::select(cvID = AREA_SHORT_CODE, CollisCou, Prediction, geometry)
reg.ss.spatialCV <- crossValidate.colis(
dataset = FishDataNeigh1,
id = "AREA_SHORT_CODE",
dependentVariable = "CollisCou",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = AREA_SHORT_CODE, CollisCou, Prediction, geometry)
Let us examine model accuracy by calculating model errors,
subtracting grid cell Predictions from the observed
CollisCou. reg.summary includes
CollisCou, predictions, errors and a
Regression label for both regressions estimated above.
reg.summary <-
rbind(
mutate(reg.spatialCV, Error = Prediction - CollisCou,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - CollisCou,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
as.tibble(st_drop_geometry(head(reg.summary)))
The code block below calculates several goodness of fit metrics for
both regressions and by AREA_SHORT_CODE (the
cvID). Mean_Error allows us to understand if
we are over or under predicting in a given place. MAE takes
the absolute value of errors.
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - CollisCou, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
Neighborhood errors are now mapped for both regressions. What do you notice about their spatial distribution? In general, errors are much lower when the local spatial process is accounted for - ie. the hot and cold spots.
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis(option = "magma", direction = -1) +
labs(title = "Burglary errors by LOGO-CV Regression") + theme(legend.position="bottom")+ theme_bw()+geom_sf(data=bound, colour="blue", size=1, fill="NA")
Although this is a very simple model, assuming it is satisfactory, we
can move on to the model predictions. The visualization below maps risk
predictions, the Kernel Density and the observed
CollisCou.
The risk prediction model forecasts much higher risk in and around the inner city, so much so that it masks colors associated with marginal risk in other areas of the City. The hotspots from the Kernel Density seem much more apparent.
So which predicts better - the Kernel Density or the Risk Prediction Model? Let’s find out.
grid.arrange(ncol=1,
reg.summary %>%
group_by(Regression, cvID) %>%
filter(str_detect(Regression, "Spatial Process")) %>%
ggplot() + geom_sf(aes(fill=Prediction)) +
scale_fill_viridis(option = "magma", direction = -1) +
ggtitle("Risk Predictions") + theme_map(),
as.data.frame(collision_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(nhoods)) %>%
aggregate(., FishDataNeigh1, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
scale_fill_viridis(name = "Density", option = "magma", direction = -1) +
ggtitle("Kernel Density") + theme_map(),
reg.summary %>%
group_by(Regression, cvID) %>%
filter(str_detect(Regression, "Spatial Process")) %>%
ggplot() + geom_sf(aes(fill=CollisCou)) +
scale_fill_viridis(option = "magma", direction = -1) +
ggtitle("Observed collision count") + theme_map())
library(ggredist)
As we have discussed, the most ‘accurate’ model may not the best. In fact, the most genearlizable model - which is less bias, might be important, but also may not be the best contender. What we are really after is the most ‘useful’ model - and useful can only be judged in the context of the planning use case. If we were on the ground in Toronto, we might study how the Police Department currently makes decisions around collision inspection and comparable use cases.
In this case, we assume a Accident Analyst uses Kernel Density to understand areas at most risk for collisions. If density is the ‘business-as-usual’ approach for targeting resources, then our forecast must be more useful - meaning it must identify more locations as high risk that actually did experience collisions.
To make it more challenging, we test our 2006-2021 collison model on the 2022 collisions withheld above. This way we can see if our model generalizes to the next year - which is important for the Police Department’s planning process.
Consider the Kernel Density and the risk predictions to be
“predictive surfaces” - layers to describe risk Citywide. Below, each
surface is divided into 5 risk categories, with the 90% to 100% risk
category as the highest. ntile calculates percentiles. You
can see, aggregate is used to spatial join the 2022
collisions to the risk category grid cells. This is done first for the
Kernel Density, below.
collis_KDE_sf <- as.data.frame(collision_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(FishDataNeigh1)) %>%
aggregate(., FishDataNeigh1, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(Collisions2022) %>% mutate(countCollisions = 1), ., sum) %>%
mutate(countCollisions = replace_na(countCollisions, 0))) %>%
dplyr::select(label, Risk_Category, countCollisions)
Now for the risk predictions, below.
collis_risk_sf <-
filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(Collisions2022) %>% mutate(countCollisions = 1), ., sum) %>%
mutate(countCollisions = replace_na(countCollisions, 0))) %>%
dplyr::select(label,Risk_Category, countCollisions)
Next, the 5 risk categories are mapped along with the 2022 collisions overlaid in red. Which predictive surface looks like the better targeting tool?
rbind(collis_KDE_sf, collis_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = Collisions2022, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(option = "plasma", discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2006-2021 fire risk predictions; 2022 collisons overlayed") +
theme_bw()
The plot below takes the evaluation one step further, analyzing the rate of actual 2022 collisions that fall into each risk category across both predictive surfaces. What can we conclude from the 5th and highest risk category and how might this speak to the utility of the algorithm?
rbind(collis_KDE_sf, collis_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countCollisions = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_collisons = countCollisions / sum(countCollisions)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_collisons)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
#scale_fill_viridis(option="magma", discrete = TRUE, direction=-1) +
scale_fill_manual(values = c("#6b4596ff", "#f7cb44ff")) +
labs(title = "Risk prediction vs. Kernel density, 2022 Collisions") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
Indeed, the risk predictive model captures more 2022 collisions than the Kernel Density, suggesting it is a more useful tool for planning. Recall, the Kernel Density predicts based purely on spatial autocorrelation. To reiterate from the introduction - the difference is that Kernel Density tells us to revisit past accidents, but risk predictive models, if properly developed, reveal risk based on a more thorough predictive understanding of the spatial system and why things occur where.
The risk prediction models are much more effective when we have a complete sample of the outcome - like collisions. The analyst should be cautious of use cases where the dependent variable suffers from selection bias - like drug crimes.
I hope this markdown was useful and I hope you replicate these methods in your own work. Please do check out Ken Steif’s book , Public Policy Analytics, to learn so much more about spatial modeling approache to solving public policy challenges.
Thanks!
Alireza