Spatial Regression Residuals

In this example, I’ll use the simplest spatial regression model: a spatial error model. We use this model when we believe that our error term is spatially correlated.

#load in data and packages first
library(sf)

## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(spData)

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(ggplot2)
library(ggthemes)
library(spdep)

## Warning: package 'spdep' was built under R version 4.2.2

## Loading required package: sp

library(spatialreg)

## Warning: package 'spatialreg' was built under R version 4.2.2

## Loading required package: Matrix

## 
## Attaching package: 'spatialreg'

## The following objects are masked from 'package:spdep':
## 
##     get.ClusterOption, get.coresOption, get.mcOption,
##     get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
##     set.coresOption, set.mcOption, set.VerboseOption,
##     set.ZeroPolicyOption

library(GWmodel)

## Warning: package 'GWmodel' was built under R version 4.2.2

## Loading required package: maptools

## Checking rgeos availability: FALSE
## Please note that 'maptools' will be retired by the end of 2023,
## plan transition at your earliest convenience;
## some functionality will be moved to 'sp'.
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()

## Loading required package: robustbase

## Warning: package 'robustbase' was built under R version 4.2.2

## Loading required package: Rcpp

## Welcome to GWmodel version 2.2-9.

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

#Let's load in some packages and data
setwd("~/Binghamton/geog380/nyc_neighborhoods")
nyc <- st_read("NYC_Nhood ACS2008_12.shp")

## Reading layer `NYC_Nhood ACS2008_12' from data source 
##   `C:\Users\mhaller\Documents\Binghamton\geog380\nyc_neighborhoods\NYC_Nhood ACS2008_12.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 98 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS:  WGS 84

#As always, let's plot the data first

ggplot()+
  geom_sf(data = nyc, aes(fill=as.numeric(poor/poptot)))+
  scale_fill_steps(
    name = "Poverty Rate",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 5,
    show.limits = T)+
  theme_void()

Let’s start by putting together a simple regression model

#create our variables for the model:
model <- nyc %>% 
  mutate(unemp = popunemplo/poptot,
         educ = highschool/poptot,
         fem = female/poptot,
         nonwhite = (poptot-european)/poptot,
         poverty_frac = (poor/poptot)) %>% 
  select(c(poverty_frac, unemp, educ, fem, nonwhite, medianinco, 
           HHsize, medianage, cartodb_id, geometry)) %>% 
  drop_na() %>% 
  filter(!HHsize == "NA")
  


#Run a basic model

summary(lm1 <- lm(poverty_frac ~ unemp + educ + fem + nonwhite, data = model))

## 
## Call:
## lm(formula = poverty_frac ~ unemp + educ + fem + nonwhite, data = model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16908 -0.06067 -0.01158  0.04802  0.48144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1625     0.1550   1.048 0.296019    
## unemp         1.5920     0.6074   2.621 0.009565 ** 
## educ         -0.1299     0.1359  -0.956 0.340474    
## fem          -0.2064     0.2966  -0.696 0.487530    
## nonwhite      0.1405     0.0354   3.968 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09213 on 169 degrees of freedom
## Multiple R-squared:  0.2861, Adjusted R-squared:  0.2692 
## F-statistic: 16.93 on 4 and 169 DF,  p-value: 1.077e-11

#Let's test for spatial autocorrelation in the residuals!
#Create the weights first
nyc_list <- model %>% 
  poly2nb(c('cartodb_id')) %>% 
  nb2listw(zero.policy = TRUE) 

#run the Moran's I test for regression residuals
lm.morantest(lm1, nyc_list)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = poverty_frac ~ unemp + educ + fem + nonwhite, data
## = model)
## weights: nyc_list
## 
## Moran I statistic standard deviate = 11.519, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.608220471     -0.017314809      0.002948891

We can also plot the residuals:

nyc1 <- nyc %>% filter(cartodb_id %in% model$cartodb_id)

nyc1 <- nyc1 %>% 
    mutate(olsresid = resid(lm1))

ggplot()+
  geom_sf(data = nyc1, aes(fill=olsresid))+
  scale_fill_steps(
    name = "OLS Residuals",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 5,
    show.limits = T)+
  theme_void()

Clearly there is some spatial dependence in the residuals!

I can check the residuals to see if other assumptions are violated using the plot() function on my regression. Note that this does not work for a spatial regression.

plot(lm1)

#Plots show up in the plotting window below

Let’s run a spatial error model to control for spatial dependence in the residuals.

lm_error <- errorsarlm(poverty_frac ~ unemp + educ + fem + nonwhite,
              data = model,
              listw = nyc_list,
              zero.policy = TRUE, 
              na.action = na.omit)

summary(lm_error)

## 
## Call:errorsarlm(formula = poverty_frac ~ unemp + educ + fem + nonwhite, 
##     data = model, listw = nyc_list, na.action = na.omit, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.108433 -0.034228 -0.007583  0.029398  0.384573 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  0.294585   0.121982  2.4150   0.01574
## unemp        0.437989   0.388484  1.1274   0.25956
## educ        -0.069613   0.154146 -0.4516   0.65155
## fem         -0.398446   0.220523 -1.8068   0.07079
## nonwhite     0.167325   0.029635  5.6462 1.641e-08
## 
## Lambda: 0.77529, LR test value: 124.99, p-value: < 2.22e-16
## Asymptotic standard error: 0.043101
##     z-value: 17.988, p-value: < 2.22e-16
## Wald statistic: 323.55, p-value: < 2.22e-16
## 
## Log likelihood: 233.0546 for error model
## ML residual variance (sigma squared): 0.0032235, (sigma: 0.056776)
## Number of observations: 174 
## Number of parameters estimated: 7 
## AIC: -452.11, (AIC for lm: -329.11)

From this output, we can see that the lamda value is 0.78, which is statistically significant. This suggests that there was significant spatial autocorrelation in the residuals, and using a spatial regression model should improve the fit of the model.

Here, we can see an AIC value of -452.11, compared to an AIC for the linear model of –329.11. Because the error model’s AIC is lower, this suggests an improvement in the model.

Let’s test for spatial autocorrelation in the residuals of this model:

#This code just allows me to manually extract the residual values
#Normally, R wants an LM object for this test - 
#Running a spatial error model does not output an lm object, which is why we need to do this
#Create a blank vector the length of the variable
seResiduals <- rep(0, length(model$poverty_frac))
#create an index based on the order of the residuals in the lm_error object
resIndex <- lm_error$residuals %>% names() %>% as.integer();
#use the index to fill in the seResiduals object with corresponding residual values
seResiduals[resIndex] <- lm_error$residuals

#perform the Moran Test
nyc_list %>%
  moran.test(seResiduals, ., zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  seResiduals  
## weights: .    
## 
## Moran I statistic standard deviate = 0.25622, p-value = 0.3989
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.007860260      -0.005780347       0.002834167

The insignificant p-value (p > 0.05) suggests that we have sufficiently controlled for autocorrelation in the residuals.

Let’s plot the new residuals to verify this:

nyc1 <- nyc1 %>% 
    mutate(errorresid = resid(lm_error))

ggplot()+
  geom_sf(data = nyc1, aes(fill=errorresid))+
  scale_fill_steps(
    name = "Spatial Error Residuals",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 5,
    show.limits = T)+
  theme_void()

Our residuals now look much more random than the previous OLS residuals - this suggests that we have controlled for much of the spatial autocorrelation in the residuals.

Resources

Brazil, N (2023). Lab 8: Spatial Regression. Retrieved from: https://crd230.github.io/lab8.html