1 Overview

A non-exhaustive walk through is provided for those wishing to apply spatial regression analysis in their work. These techniques can help understand geographical patterns of crime or public safety problems. This may be of most interest for analysts working in or closely with Community Safety Partnerships, Violence Reduction Units or in strategic roles where there are opportunities to inform changes in policies and crime prevention plans.

For a comprehensive accessible guide on approaches and theory I would highly recommend Chainey, S. (2020). Understanding crime : analyzing the geography of crime. Redlands, California: Esri Press., Chapter 8: Applying spatial regression analysis to crime data. Further resources are included at the end of this guide.

For this walk through I created an aggregated data set using Census Output Area (COA) areal units covering the Teesside Urban Area. The significant majority of COAs contain between 110-139 households. The data set includes:

police.uk burglary data (excluding offences occurring within Retail Centre Area boundaries. This is to create an approximate dataset for residential burglary). Please be mindful that this is a crude method that will create limitations when it comes to modelling the data.
Census 2011 socio-demographic and household information - ethnicity, tenure, education, population structure
Consumer Vulnerability Classifications
Output Area Classifications
Indices of Multiple Deprivation 2019

When creating a data set it’s a good idea to think about sourcing variables (or proxy variables) which are theoretically relevant and hypothesis driven - we could find many correlates with a large data set, but without theory might find it difficult to plausibly explain how the variable effects the outcome. Also consider whether or not the variable is something that could be modified in practice - the presence of heavy rain might explain lower burglary, but is not something that could be written into policy or a crime prevention plan.

2 Libraries and data

Libraries used as below.

library(tidyverse)
library(sf)
library(sp)
library(tmap)
library(RColorBrewer)
library(GGally)
library(spdep)
library(spatialreg)
library(car)

Data import. This data set has been cleaned and checked for errors.

# import spatial data, geojson file
ts <- st_read("ts_burg_geojson.geojson")
# retrospectively added non-White data column
ts$bame <- ts$ea_mixed + ts$ea_black + ts$ea_asian

Data dictionary.

3 Exploratory Data Analysis (EDA)

Let’s begin by generating understanding of the data by exploring the trends and relationships among the variables. I used resources from the book Geographical Data Science and Spatial Data Analysis, Lex Comber and Chris Brunsdon to assist in this EDA.

3.1 Correlation

The cor.test() function is a simple way to test the relationship between two variables, providing the Pearson’s product moment correlation by default. A few examples are shown below observing the relationship between the Burglary Rate in 2019 and a) proportion of Asian households, b) proportion of student households and c) employment deprivation score.

The data outputs show the significance of the correlations (p-value), confidence intervals and the correlation coefficient estimates. These variables were all significantly correlated with burglary rates in 2019, with the effect size being greatest for employment deprivation.

# Burglary rate and % Asian Households
cor.test(~bdrate19 + ea_asian, data = ts)

## 
##  Pearson's product-moment correlation
## 
## data:  bdrate19 and ea_asian
## t = 9.9455, df = 1322, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2129952 0.3132621
## sample estimates:
##       cor 
## 0.2638412

# Burglary rate and % Student Households
cor.test(~bdrate19 + hh_student, data = ts)

## 
##  Pearson's product-moment correlation
## 
## data:  bdrate19 and hh_student
## t = 10.319, df = 1322, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2224276 0.3221644
## sample estimates:
##       cor 
## 0.2730295

# Burglary rate and Employment Deprivation
cor.test(~bdrate19 + imd_emp, data = ts)

## 
##  Pearson's product-moment correlation
## 
## data:  bdrate19 and imd_emp
## t = 14.592, df = 1322, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3250993 0.4179373
## sample estimates:
##       cor 
## 0.3724497

Scatter plots are a visual alternative to viewing linear relationships. The example output below shows that there is a positive correlation or linear relationship between burglary rates and employment deprivation.

# call to ggplot
ggplot(data = ts, aes(x = bdrate19, y = imd_emp)) +
# add scatter plot, set opacity aesthetics  
  geom_point(alpha = 0.1) +
# add trend line method and aesthetics  
  geom_smooth(method = "lm", colour = "#DE2D26") +
# add labels
  labs(title = "Burglary Rate and Employment Deprivation: Teesside Output Areas 2019", x = "Burglary Rate", y = "Employment Deprivation") +
# choose theme  
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

Boxplots are useful when you want to view continuous and categorical data relationships. Using classification data we can view the distribution of burglary rates at COA level by consumer vulnerability groups and output area classifications.

ts %>%
# call ggplot, order by output area classification  
  ggplot(aes(x = reorder(oac_gp, bdrate, FUN = median), y = bdrate, fill = "dodgerblue")) +
# add data to boxplot  
  geom_boxplot(aes(group = oac_gp), outlier.alpha = 0.4, outlier.colour = "grey25") +
# move x axis to vertical   
  coord_flip() + xlab("") + ylab("Burglary Rate") +
  labs(title = "Burglary Rate by Output Area Classification") +
  theme_minimal() +
# remove legend  
  theme(legend.position = "none")

# repeat above and change categorical variable to consumer vulnerability class
ts %>%
  ggplot(aes(x = reorder(gd_cv, bdrate, FUN = median), y = bdrate, fill = "dodgerblue")) +
  geom_boxplot(aes(group = gd_cv), outlier.alpha = 0.4, outlier.colour = "grey25") +
  coord_flip() + xlab("") + ylab("Burglary Rate") +
  labs(title = "Burglary Rate by Consumer Vulnerability Classification") +
  theme_minimal() +
  theme(legend.position = "none")

Seeing that the vulnerable communities group has higher rates of burglary, it might be useful to explore socio-economics and demographics to help identify potential hypotheses why?

The radar plot below provides an overview of a selection of Census 2011 and IMD 2019 variables. We can see that the vulnerable communities group has much lower levels of owner occupied housing and educational attainment levels. There are also higher levels of income deprivation and employment deprivation. We might infer from this that communities with lower incomes and higher unemployment on average risk exhibiting higher rates of burglary. This might mean targeted interventions to secure households (assisted target hardening of properties) and reduce the skills/unemployment gap.

# create tibble from the spatial data ts
ts_tbl <- as_tibble(ts)

# create dataset for radar plot, select numeric variables to include
radar <- ts_tbl %>% select(19:24, 27, 34:37, 44, 49) %>%
# standardise the variables using scale  
  mutate_if(is.numeric, scale) %>%
# aggregate COA data to consumer vulnerability group  
  aggregate(by = list(ts$gd_cv), FUN = mean) %>%
# pivot   
  pivot_longer(-Group.1) %>%
# arrange by variable name for plotting  
  arrange(name) %>%
  filter(name != "gd_cv")

ggplot(data = radar, aes(x=factor(name), y=value, group = Group.1, colour = Group.1, fill = Group.1)) +
  geom_point(size = 2) +
  geom_polygon(size = 1, alpha = 0.2) +
  scale_color_manual(values = brewer.pal(6, "Set2")) +
  scale_fill_manual(values = brewer.pal(8, "Set2")) +
  facet_wrap(~Group.1, ncol = 3) +
  coord_polar() +
  theme_light() +
  theme(legend.position = "none", axis.title = element_blank(), axis.text = element_text(size = 6))

3.2 Pairs

When working with problems that have many variables supported by theoretical frameworks, viewing them in one plot can provide a helpful overview.

In the case of residential burglary there are many variables to consider including:

Physical features and immediate surroundings of a property
Socio-economic characteristics and household composition
Household routine activities
Crime specific considerations (proximity to acquisitive crime perpetrators, accessible markets for stolen items)
Interaction of all the above

Unfortunately only limited information is available publicly for this walk through.

The output below plots a selection of variables and their relationship to burglary rates. The results are split in two groups of COAs - higher or lower than the median rate. We can view the correlation coefficients, density plots, scatter plots, histograms and boxplots.

# select variables for pairing in correlation matrix
ts_tbl %>% .[c(12, 19:21,23:24,35, 37, 41, 49)] %>%
# add vector to indicate burglary rates around the median as Higher or Lower  
  mutate(bdrate_high = ifelse(bdrate19 > median(bdrate19), "H", "L")) %>%
# call ggpairs and set aesthetics for correlation coefficients and plots  
  ggpairs(aes(colour = bdrate_high, alpha = 0.4), 
          upper = list(continuous = wrap('cor', size = 2.5)),
          lower = list(continuous = wrap('points', alpha = 0.7, size = 0.1))) +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

3.3 Spatial EDA

Viewing the variables using maps can be a heuristic approach to visualizing and making hypotheses about spatial relationships.

The plot below shows the rates in burglary across Teesside COAs, this can be viewed as a static plot or as an interactive map. Somewhat limited, we can see that most geographical areas have lower rates of burglary and that areas with similar ranged values often cluster together geographically.

# set to plot mode
tmap_mode("plot") 
# add polygon/shape file with data
tm_shape(ts) + 
# set variable attribute, title and style  
  tm_polygons("bdrate19", title = "Burglary Dwelling Per 1,000 2019", palette = "YlOrRd", style = "jenks", legend.hist = T) +
# customise layout  
  tm_layout(title = "Teesside Burglary by COA",
            frame = F, legend.outside = T,
            legend.hist.width = 1,
            legend.format = list(digits = 1),
            legend.outside.position = c("left", "top"),
            legend.text.size = 0.6,
            legend.title.size = 1) +
  tm_scale_bar(position = c("right", "bottom")) +
  tm_borders(col = "grey", lwd= 1)

Viewing and arranging spatial plots side-by-side becomes more useful. From this we can begin to see that areas with high rates of student households and Asian households appear to occupy the same COAs with high burglary rates (around the River Tees in Stockton and Middlesbrough central areas), but would be unlikely to explain high rates in Redcar and pockets in north of Stockton or south of Middlesbrough where there are few student or Asian households.

tmap_mode("plot") 

map1 <- tm_shape(ts) +
  tm_fill("bdrate19", palette = "Reds", style = "jenks", n = 5, title = "BD Rate '19") +
  tm_layout(legend.position = c("right", "bottom"), frame = F)

map2 <- tm_shape(ts) +
  tm_fill("hh_student", palette = "YlGn", style = "jenks", n = 5, title = "% Student HH") +
  tm_layout(legend.position = c("right", "bottom"), frame = F)

map3 <- tm_shape(ts) +
  tm_fill("ea_asian", palette = "YlOrRd", style = "jenks", n = 5, title = "% Asian HH") +
  tm_layout(legend.position = c("right", "bottom"), frame = F)

tmap_arrange(map2, map3, map1, nrow = 1)

Other pocket areas of high burglary rates would appear to co-locate with areas of high % social housing and high % of residents with no qualifications, in some instances but not all (i.e., Hemlington, to the south centre left part of Teesside).

tmap_mode("plot") 

map1 <- tm_shape(ts) +
  tm_fill("bdrate19", palette = "Reds", style = "jenks", n = 5, title = "BD Rate '19") +
  tm_layout(legend.position = c("right", "bottom"), frame = F)

map4 <- tm_shape(ts) +
  tm_fill("ten_soc", palette = "YlGn", style = "jenks", n = 5, title = "% Social HH") +
  tm_layout(legend.position = c("right", "bottom"), frame = F)

map5 <- tm_shape(ts) +
  tm_fill("edu_noqual", palette = "YlOrRd", style = "jenks", n = 5, title = "% No Quals") +
  tm_layout(legend.position = c("right", "bottom"), frame = F)

tmap_arrange(map4, map5, map1, nrow = 1)

4 Regression

Lets now attempt to create a regression to identify the relationship between multiple explanatory variables and a dependent variable (burglary rate).

4.1 Ordinary least squares (OLS)

Three OLS models have been produced and one Poisson GLM.

Model 1 uses Employment Deprivation, % Student Households, % Non-White, % Private Rented, % Owner Occupied and Living Environment score.
Model 2 excludes the non-significant variable in Model 1 (% Student Households).
Model 3 uses Model 1 variables but transforms the burglary rate to a logarithmic value.
Model 4 uses Model 1 variables but changes dependent variable to burg19 count and uses a Poisson regression

The vif() function, Variation Inflation Factor, is used to check for multi-collinearity (where two or more explanatory variables are correlated). If explanatory variables have a vif > 7.5, it suggests the variable is redundant in the model and should be removed.

# run model 1
model1 <- lm(bdrate19 ~ imd_emp + hh_student + bame + edu_noqual + ten_priv + ten_own + imd_livenv, data = ts)
# summary of model outputs
summary(model1)

## 
## Call:
## lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual + 
##     ten_priv + ten_own + imd_livenv, data = ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.993  -4.324  -1.463   2.785  63.486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.50666    2.00720   2.245 0.024917 *  
## imd_emp      9.99762    4.12453   2.424 0.015487 *  
## hh_student   0.09956    0.09066   1.098 0.272343    
## bame         0.13315    0.03449   3.861 0.000118 ***
## edu_noqual   0.05264    0.03169   1.661 0.096974 .  
## ten_priv     0.05717    0.03100   1.844 0.065366 .  
## ten_own     -0.05631    0.01670  -3.372 0.000767 ***
## imd_livenv   0.17317    0.03573   4.847  1.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.237 on 1316 degrees of freedom
## Multiple R-squared:  0.2675, Adjusted R-squared:  0.2636 
## F-statistic: 68.66 on 7 and 1316 DF,  p-value: < 2.2e-16

# variation inflation factor, to check for multi-collinearity
vif(model1)

##    imd_emp hh_student       bame edu_noqual   ten_priv    ten_own imd_livenv 
##   3.351667   1.794739   1.787851   3.176990   3.273476   3.938303   2.190363

# run model 2
model2 <- lm(bdrate19 ~ imd_emp + bame + edu_noqual + ten_priv + ten_own, data = ts)
summary(model2)

## 
## Call:
## lm(formula = bdrate19 ~ imd_emp + bame + edu_noqual + ten_priv + 
##     ten_own, data = ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.207  -4.582  -1.468   2.843  63.372 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.57289    1.95802   1.314  0.18907    
## imd_emp     12.69982    4.05137   3.135  0.00176 ** 
## bame         0.17206    0.03275   5.253 1.74e-07 ***
## edu_noqual   0.07411    0.03155   2.349  0.01897 *  
## ten_priv     0.14937    0.02499   5.976 2.94e-09 ***
## ten_own     -0.04189    0.01615  -2.594  0.00959 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.31 on 1318 degrees of freedom
## Multiple R-squared:  0.2534, Adjusted R-squared:  0.2506 
## F-statistic: 89.47 on 5 and 1318 DF,  p-value: < 2.2e-16

vif(model2)

##    imd_emp       bame edu_noqual   ten_priv    ten_own 
##   3.177531   1.584625   3.093580   2.090806   3.619233

# run model 3
model3 <- lm(log1p(bdrate19) ~ imd_emp + hh_student + bame + edu_noqual + ten_priv + ten_own, data = ts)
summary(model3)

## 
## Call:
## lm(formula = log1p(bdrate19) ~ imd_emp + hh_student + bame + 
##     edu_noqual + ten_priv + ten_own, data = ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2645 -0.9013  0.1585  0.7998  2.5098 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.638376   0.244437   2.612 0.009114 ** 
## imd_emp      1.742004   0.508046   3.429 0.000625 ***
## hh_student   0.001362   0.011327   0.120 0.904294    
## bame         0.013961   0.004268   3.271 0.001099 ** 
## edu_noqual   0.013683   0.003918   3.492 0.000495 ***
## ten_priv     0.018048   0.003269   5.521 4.05e-08 ***
## ten_own     -0.002433   0.002035  -1.195 0.232172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.031 on 1317 degrees of freedom
## Multiple R-squared:  0.2175, Adjusted R-squared:  0.2139 
## F-statistic:    61 on 6 and 1317 DF,  p-value: < 2.2e-16

vif(model3)

##    imd_emp hh_student       bame edu_noqual   ten_priv    ten_own 
##   3.247942   1.789473   1.748927   3.101661   2.324356   3.736125

# run model 4 as Poisson 
model4 <- glm(burg19 ~ imd_emp + hh_student + bame + edu_noqual + ten_priv + ten_own + imd_livenv, family = "poisson", data = ts)
summary(model4)

## 
## Call:
## glm(formula = burg19 ~ imd_emp + hh_student + bame + edu_noqual + 
##     ten_priv + ten_own + imd_livenv, family = "poisson", data = ts)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4654  -1.4321  -0.3452   0.6468   5.9627  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.578046   0.159372   3.627 0.000287 ***
## imd_emp      1.163623   0.308077   3.777 0.000159 ***
## hh_student   0.006258   0.004473   1.399 0.161809    
## bame         0.017215   0.001832   9.398  < 2e-16 ***
## edu_noqual   0.003530   0.002561   1.378 0.168128    
## ten_priv     0.002715   0.002169   1.252 0.210585    
## ten_own     -0.009059   0.001323  -6.846 7.58e-12 ***
## imd_livenv   0.017323   0.002306   7.513 5.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4075.8  on 1323  degrees of freedom
## Residual deviance: 2847.2  on 1316  degrees of freedom
## AIC: 5365
## 
## Number of Fisher Scoring iterations: 5

vif(model4)

##    imd_emp hh_student       bame edu_noqual   ten_priv    ten_own imd_livenv 
##   2.856732   2.107486   1.940106   3.032330   5.084854   3.629715   2.760446

The results from the three linear models are weak. The output below shows the results for the three models side-by-side. Model 1 (27%) explains more of the variation in burglary rates than model 2 and 3. I’m ignoring model 4 for this walk through, this is included to show the call for producing a Poisson regression which may be used when data contains many units with zero counts creating a highly skewed data distribution.

texreg::screenreg(list(model1, model2, model3, model4))

## 
## ===================================================================
##                 Model 1      Model 2      Model 3      Model 4     
## -------------------------------------------------------------------
## (Intercept)        4.51 *       2.57         0.64 **       0.58 ***
##                   (2.01)       (1.96)       (0.24)        (0.16)   
## imd_emp           10.00 *      12.70 **      1.74 ***      1.16 ***
##                   (4.12)       (4.05)       (0.51)        (0.31)   
## hh_student         0.10                      0.00          0.01    
##                   (0.09)                    (0.01)        (0.00)   
## bame               0.13 ***     0.17 ***     0.01 **       0.02 ***
##                   (0.03)       (0.03)       (0.00)        (0.00)   
## edu_noqual         0.05         0.07 *       0.01 ***      0.00    
##                   (0.03)       (0.03)       (0.00)        (0.00)   
## ten_priv           0.06         0.15 ***     0.02 ***      0.00    
##                   (0.03)       (0.02)       (0.00)        (0.00)   
## ten_own           -0.06 ***    -0.04 **     -0.00         -0.01 ***
##                   (0.02)       (0.02)       (0.00)        (0.00)   
## imd_livenv         0.17 ***                                0.02 ***
##                   (0.04)                                  (0.00)   
## -------------------------------------------------------------------
## R^2                0.27         0.25         0.22                  
## Adj. R^2           0.26         0.25         0.21                  
## Num. obs.       1324         1324         1324          1324       
## AIC                                                     5364.99    
## BIC                                                     5406.50    
## Log Likelihood                                         -2674.50    
## Deviance                                                2847.22    
## ===================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

# other statistics to consider
# AIC measure, the lower the AIC the better the model
AIC(model1)

## [1] 9351.008

AIC(model2)

## [1] 9372.269

# log likelihood measuring goodness of the fit, always negative, higher the ll (closer to zero) better the model
logLik(model1)

## 'log Lik.' -4666.504 (df=9)

logLik(model2)

## 'log Lik.' -4679.135 (df=7)

Mapping the residuals can help identify areas, and potentially other variables, that could help improve where the model is under performing. These are shown in the plot below.

# add residuals to ts data
ts$resids <- model1$residuals
# add standardised residuals
ts$s.resids <- as.vector(scale(ts$resids))

# add poisson resids
ts$poisresides <- as.vector(scale(model4$residuals))

# map resids
tmap_mode("view")
tm_shape(ts) +
  tm_fill("s.resids", palette = "seq", breaks = c(-Inf, -2, 0, 2, Inf), title = "Std. Resids", alpha = 0.6) +
  tm_layout(aes.palette = list(seq = "-RdBu")) +
  tm_shape(ts) +
  tm_borders(col = "black",lwd = 1)

A problem here is that we are using a proximate residential burglary data set. It is possible that the COAs with high/low standardised residuals are those where all burglary points falling within Retail Centre Areas were discarded.

The age of the Census data also presents problems. Some COAs have undergone extensive changes in number of households (urban clearance, urban development) which could have altered the socio-economic and population structures since 2011. Examples are shown below for two areas of Teesside.

Urban Clearance of older properties in South Bank

New Urban Development in Ingleby Barwick

4.2 Creating spatial weights

The data set is a shape file. We can use this to create spatial weights and neighbour relationships between COAs. A full guide on Spatial regression analysis in R, Carlos Mendez is available here.

# create queen contiguity
queen.nb <- poly2nb(ts, row.names = ts$oa11cd)

# create rook contiguity
rook.nb <- poly2nb(ts, row.names = ts$oa11cd, queen = FALSE)

# summarise relationships
summary(queen.nb)

## Neighbour list object:
## Number of regions: 1324 
## Number of nonzero links: 7736 
## Percentage nonzero weights: 0.4413067 
## Average number of links: 5.8429 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  15  16  17  22  31 
##   1  23  93 215 299 276 206  98  60  22  13   7   3   3   2   1   1   1 
## 1 least connected region:
## 1236 with 1 link
## 1 most connected region:
## 430 with 31 links

# test symmetry
is.symmetric.nb(queen.nb)

## [1] TRUE

# create list
queen.listw <- nb2listw(queen.nb)

Check for spatial dependence using the Moran’s I test. A Moran I statistic tells us if we have clustered values (closer to 1) or dispersed values (closer to -1), or no pattern (0). When the p-value is significant and the standard deviate is positive we can summarise that the spatial distribution is more spatially clustered than would be expected if spatial processes were random.

There is spatial clustering in burglary rates by Teesside COAs.

moranTest <- moran.test(ts$bdrate19, nb2listw(queen.nb))
moranTest

## 
##  Moran I test under randomisation
## 
## data:  ts$bdrate19  
## weights: nb2listw(queen.nb)    
## 
## Moran I statistic standard deviate = 19.71, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3236350605     -0.0007558579      0.0002708625

4.3 Model selection

To identify which type of model to select we can examine model 1 using Local Moran tests.

The Moran Test for Model 1 indicates some clustering, with a positive SD and significant p value.

The Lagrange Multiplier results suggest proceeding with a Spatial Lag model. A flow-chart for interpreting the results and selecting a model is provided here, see Figure 2.

# test the spatial dependency of the model
lmMoranTest <- lm.morantest(model1, queen.listw)
lmMoranTest

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual
## + ten_priv + ten_own + imd_livenv, data = ts)
## weights: queen.listw
## 
## Moran I statistic standard deviate = 6.2654, p-value = 1.86e-10
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.0996969173    -0.0032438816     0.0002699493

# calculate lagrange multiplier diagnostics
lmLMtests <- lm.LMtests(model1, queen.listw, test = c("LMerr", "LMlag", "RLMerr", "RLMlag", "SARMA"))
lmLMtests

## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual
## + ten_priv + ten_own + imd_livenv, data = ts)
## weights: queen.listw
## 
## LMerr = 36.259, df = 1, p-value = 1.727e-09
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual
## + ten_priv + ten_own + imd_livenv, data = ts)
## weights: queen.listw
## 
## LMlag = 47.03, df = 1, p-value = 6.991e-12
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual
## + ten_priv + ten_own + imd_livenv, data = ts)
## weights: queen.listw
## 
## RLMerr = 3.599, df = 1, p-value = 0.05781
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual
## + ten_priv + ten_own + imd_livenv, data = ts)
## weights: queen.listw
## 
## RLMlag = 14.37, df = 1, p-value = 0.0001502
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = bdrate19 ~ imd_emp + hh_student + bame + edu_noqual
## + ten_priv + ten_own + imd_livenv, data = ts)
## weights: queen.listw
## 
## SARMA = 50.629, df = 2, p-value = 1.014e-11

4.4 Spatial Lag Model

The call for a Spatial Lag model using the spatial weights list is provided below with a summary. The Spatial Lag model R square is only marginally better than the OLS model 1 (0.2827 versus 0.2675). It is likely if we were to include information on past burglary rates that this would improve significantly as crime hot spots tend not to change quickly over time.

# spatial lag model include spatial weights
model5 <- lmSLX(bdrate19 ~ imd_emp + hh_student + bame + edu_noqual + ten_priv + ten_own + imd_livenv, data = ts, queen.listw) 
summary(model5)

## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.007  -4.341  -1.497   2.740  62.206 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     4.0685065  3.6923719   1.102  0.27072   
## imd_emp        -2.8426275  5.7596099  -0.494  0.62171   
## hh_student     -0.1643035  0.1129870  -1.454  0.14614   
## bame            0.0263758  0.0541157   0.487  0.62606   
## edu_noqual      0.0177939  0.0370836   0.480  0.63143   
## ten_priv        0.0670293  0.0347207   1.931  0.05376 . 
## ten_own        -0.0488920  0.0183500  -2.664  0.00781 **
## imd_livenv      0.1269417  0.0702918   1.806  0.07116 . 
## lag.imd_emp    25.0742173  9.2445558   2.712  0.00677 **
## lag.hh_student  0.5358595  0.1729742   3.098  0.00199 **
## lag.bame        0.1078121  0.0733639   1.470  0.14192   
## lag.edu_noqual -0.0173323  0.0617539  -0.281  0.77901   
## lag.ten_priv   -0.0643768  0.0646851  -0.995  0.31981   
## lag.ten_own    -0.0003472  0.0345250  -0.010  0.99198   
## lag.imd_livenv  0.0421586  0.0910299   0.463  0.64335   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.173 on 1309 degrees of freedom
## Multiple R-squared:  0.2827, Adjusted R-squared:  0.275 
## F-statistic: 36.84 on 14 and 1309 DF,  p-value: < 2.2e-16

5 Summary

Appreciated this walk through is a whistle stop tour! Intended audience are those working as crime, community safety or violence reduction unit analysts and who are interested in developing skills in spatial analysis, statistics and working with R.

Other more advanced techniques that you may wish to consider when working with spatial data are GWR (Geographic Weighted Regression), Machine Learning (i.e., Random Forests, Gradient Boosting Machines, k-nearest neighbours) and Risk Terrain Modeling.

6 Learn more

Amazing resources below.

GeoDa, an open source software tool for geo-spatial data techniques
Geographical Data Science, Lex Comber and Chris Brunsdon
MGWR Multiscale Geographically Weighted Regression, an open source software tool for performing GWR
Spatial regression analysis in R, Carlos Mendez
Understanding Crime: Analyzing the Geography of Crime, Spencer Chainey

Spatial Correlation and Regression with Crime Data

Iain @routineactivity

03/08/2021