Last week, we started talking about spatial autocorrelation and how to measure it. This week, we’ll talk more specifically about what to do if you want to run a regression model that has spatial dependence. First, we’ll start with the simplest model: a spatial error model. We use this model when we believe that our error term is spatially correlated.
setwd("~/Binghamton/DIDA 370/nyc_neighborhoods")
Warning: The working directory was changed to C:/Users/melha/OneDrive/Documents/Binghamton/DIDA 370/nyc_neighborhoods inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
nyc <- st_read("NYC_Nhood ACS2008_12.shp")
Reading layer `NYC_Nhood ACS2008_12' from data source
`C:\Users\melha\OneDrive\Documents\Binghamton\DIDA 370\nyc_neighborhoods\NYC_Nhood ACS2008_12.shp'
using driver `ESRI Shapefile'
Simple feature collection with 195 features and 98 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
Geodetic CRS: WGS 84
Let’s start by putting together a simple regression model
#create our variables for the model:
model <- nyc %>%
mutate(unemp = popunemplo/poptot,
educ = highschool/poptot,
fem = female/poptot,
nonwhite = (poptot-european)/poptot) %>%
select(c(gini, unemp, educ, fem, nonwhite, medianinco,
HHsize, medianage, cartodb_id, geometry)) %>%
filter(!gini %in% "NA") %>%
mutate(gini = as.numeric(gini))
#Run a basic model
summary(lm1 <- lm(gini ~ unemp + educ + fem + nonwhite, data = model))
Call:
lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
Residuals:
Min 1Q Median 3Q Max
-0.111503 -0.022791 -0.001208 0.026237 0.090886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.42158 0.06444 6.542 7.12e-10 ***
unemp 0.15442 0.25352 0.609 0.5433
educ -0.49805 0.05697 -8.743 2.34e-15 ***
fem 0.21109 0.12333 1.712 0.0888 .
nonwhite -0.01861 0.01476 -1.261 0.2092
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03829 on 167 degrees of freedom
Multiple R-squared: 0.348, Adjusted R-squared: 0.3324
F-statistic: 22.29 on 4 and 167 DF, p-value: 9.253e-15
nyc_list <- nyc_list %>% nb2listw(zero.policy = TRUE)
#run the Moran's I test for regression residuals
lm.morantest(lm1, nyc_list)
Global Moran I for regression residuals
data:
model: lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
weights: nyc_list
Moran I statistic standard deviate = 9.8844, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran I Expectation Variance
0.521431192 -0.017403985 0.002971725
Clearly there is some spatial dependence in the residuals! Let’s run a spatial error model to control for it.
lm_error <- errorsarlm(gini ~ unemp + educ + fem + nonwhite,
data = model,
listw = nyc_list,
zero.policy = TRUE,
na.action = na.omit)
summary(lm_error)
Call:errorsarlm(formula = gini ~ unemp + educ + fem + nonwhite, data = model,
listw = nyc_list, na.action = na.omit, zero.policy = TRUE)
Residuals:
Min 1Q Median 3Q Max
-0.071808 -0.018530 0.002033 0.016991 0.067793
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.476432 0.054592 8.7271 < 2.2e-16
unemp -0.330687 0.175027 -1.8893 0.05885
educ -0.265466 0.067783 -3.9164 8.988e-05
fem 0.039918 0.099192 0.4024 0.68737
nonwhite 0.024514 0.013233 1.8526 0.06394
Lambda: 0.74625, LR test value: 98.308, p-value: < 2.22e-16
Asymptotic standard error: 0.047107
z-value: 15.842, p-value: < 2.22e-16
Wald statistic: 250.96, p-value: < 2.22e-16
Log likelihood: 368.8172 for error model
ML residual variance (sigma squared): 0.00065865, (sigma: 0.025664)
Number of observations: 172
Number of parameters estimated: 7
AIC: -723.63, (AIC for lm: -627.33)
From this output, we can see that the lamda value is 0.75, which is statistically significant. This suggests that there is significant spatial autocorrelation in the residuals.
Here, we can see an AIC value of -723.63, compared to an AIC for the linear model of -627.33. Because the error model’s AIC is lower, this suggests an improvement in the model.
Let’s test for spatial autocorrelation in the residuals of this model:
#This code just allows me to manually extract the residual values
#Normally, R wants an LM object for this test -
#Running a spatial error model does not output an lm object, which is why we need to do this
#Create a blank vector the length of the variable
seResiduals <- rep(0, length(model$gini))
#create an index based on the order of the residuals in the lm_error object
resIndex <- lm_error$residuals %>% names() %>% as.integer();
#use the index to fill in the seResiduals object with corresponding residual values
seResiduals[resIndex] <- lm_error$residuals
#perform the Moran Test
nyc_list %>%
moran.test(seResiduals, ., zero.policy = TRUE)
Moran I test under randomisation
data: seResiduals
weights: .
Moran I statistic standard deviate = -0.26757, p-value = 0.6055
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
-0.020662788 -0.005847953 0.003065648
Now, we can see that our Moran test is insignificant - this tells us that we effectively controlled for the spatial variation in the error term.
What if we wanted to perform a spatial lag model instead?
lm_lag <- lagsarlm(gini ~ unemp + educ + fem + nonwhite,
data = model,
listw = nyc_list,
zero.policy = TRUE,
na.action = na.omit)
summary(lm_lag)
Call:lagsarlm(formula = gini ~ unemp + educ + fem + nonwhite, data = model,
listw = nyc_list, na.action = na.omit, zero.policy = TRUE)
Residuals:
Min 1Q Median 3Q Max
-0.0663453 -0.0180466 0.0015673 0.0183086 0.0610481
Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.1444097 0.0506729 2.8498 0.004374
unemp -0.1887369 0.1714991 -1.1005 0.271109
educ -0.2003493 0.0441773 -4.5351 5.757e-06
fem 0.0641006 0.0834367 0.7683 0.442336
nonwhite 0.0078877 0.0099912 0.7895 0.429841
Rho: 0.69011, LR test value: 101.67, p-value: < 2.22e-16
Asymptotic standard error: 0.051164
z-value: 13.488, p-value: < 2.22e-16
Wald statistic: 181.94, p-value: < 2.22e-16
Log likelihood: 370.4993 for lag model
ML residual variance (sigma squared): 0.00067072, (sigma: 0.025898)
Number of observations: 172
Number of parameters estimated: 7
AIC: -727, (AIC for lm: -627.33)
LM test for residual autocorrelation
test value: 0.0026113, p-value: 0.95924
Check for spatial autocorrelation in these residuals:
seResiduals <- rep(0, length(model$gini))
resIndex <- lm_lag$residuals %>% names() %>% as.integer();
seResiduals[resIndex] <- lm_lag$residuals
#perform the Moran Test
nyc_list %>%
moran.test(seResiduals, ., zero.policy = TRUE)
Moran I test under randomisation
data: seResiduals
weights: .
Moran I statistic standard deviate = 0.079751, p-value = 0.4682
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
-0.001427726 -0.005847953 0.003071956
Again, this model also controls for autocorrelation in the residuals, so we’re in good shape so far.
Based on AICs alone, this model performs a bit better than the error model (-727 < -723). But this difference is so small it seems almost arbitrary - let’s test more formally whether one or the other is a good fit with the Lagrange Multiplier test.
LM <- lm.LMtests(lm1, nyc_list, test = "all")
Please update scripts to use lm.RStests in place of lm.LMtests
LM
Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
dependence
data:
model: lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
test weights: listw
RSerr = 86.624, df = 1, p-value < 2.2e-16
Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
dependence
data:
model: lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
test weights: listw
RSlag = 102.4, df = 1, p-value < 2.2e-16
Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
dependence
data:
model: lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
test weights: listw
adjRSerr = 0.049827, df = 1, p-value = 0.8234
Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
dependence
data:
model: lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
test weights: listw
adjRSlag = 15.824, df = 1, p-value = 6.953e-05
Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
dependence
data:
model: lm(formula = gini ~ unemp + educ + fem + nonwhite, data = model)
test weights: listw
SARMA = 102.45, df = 2, p-value < 2.2e-16
We’re most interested in the LMerr and LMlag results - unsurprisingly, both turn out to be significant. So this test can’t tell us which one to pick - let’s look at a more robust test. We only need to look at these tests if our initial tests are not definitive. It turns out, RLMerr and RLMlag are the robust test results in this output. As you can see, RLMlag is significant, while RLMerr is not, suggesting that we should stick with the lag model.
#Let's pick the optimum bandwidth for our regression
bwVal <- GWmodel::bw.gwr(gini ~ unemp + educ + fem + nonwhite,
data = model %>% sf::as_Spatial(),
approach = 'AICc', kernel = 'bisquare',
adaptive = TRUE)
Adaptive bandwidth (number of nearest neighbours): 113 AICc value: -684.8114
Adaptive bandwidth (number of nearest neighbours): 78 AICc value: -692.4784
Adaptive bandwidth (number of nearest neighbours): 54 AICc value: -695.3735
Adaptive bandwidth (number of nearest neighbours): 42 AICc value: -691.9389
Adaptive bandwidth (number of nearest neighbours): 64 AICc value: -693.9018
Adaptive bandwidth (number of nearest neighbours): 50 AICc value: -694.8839
Adaptive bandwidth (number of nearest neighbours): 58 AICc value: -694.2964
Adaptive bandwidth (number of nearest neighbours): 52 AICc value: -695.2733
Adaptive bandwidth (number of nearest neighbours): 55 AICc value: -695.3017
Adaptive bandwidth (number of nearest neighbours): 53 AICc value: -695.1687
Adaptive bandwidth (number of nearest neighbours): 54 AICc value: -695.3735
bwVal
[1] 54
#Now we can run our model based on this!
lm.gwr <- gwr.basic(gini ~ unemp + educ + fem + nonwhite,
data = model %>% sf::as_Spatial(),
bw = bwVal, kernel = "bisquare", adaptive = TRUE)
#Let's check the output of the model!
print(lm.gwr)
***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2024-03-19 15:27:46.940592
Call:
gwr.basic(formula = gini ~ unemp + educ + fem + nonwhite, data = model %>%
sf::as_Spatial(), bw = bwVal, kernel = "bisquare", adaptive = TRUE)
Dependent (y) variable: gini
Independent variables: unemp educ fem nonwhite
Number of data points: 172
***********************************************************************
* Results of Global Regression *
***********************************************************************
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.111503 -0.022791 -0.001208 0.026237 0.090886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.42158 0.06444 6.542 7.12e-10 ***
unemp 0.15442 0.25352 0.609 0.5433
educ -0.49805 0.05697 -8.743 2.34e-15 ***
fem 0.21109 0.12333 1.712 0.0888 .
nonwhite -0.01861 0.01476 -1.261 0.2092
---Significance stars
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03829 on 167 degrees of freedom
Multiple R-squared: 0.348
Adjusted R-squared: 0.3324
F-statistic: 22.29 on 4 and 167 DF, p-value: 9.253e-15
***Extra Diagnostic information
Residual sum of squares: 0.24478
Sigma(hat): 0.03794578
AIC: -627.3262
AICc: -626.8171
BIC: -749.5563
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: bisquare
Adaptive bandwidth: 54 (number of nearest neighbours)
Regression points: the same locations as observations are used.
Distance metric: Euclidean distance metric is used.
****************Summary of GWR coefficient estimates:******************
Min. 1st Qu. Median 3rd Qu. Max.
Intercept 0.167030 0.356849 0.443673 0.676007 0.9797
unemp -1.471005 -0.648716 -0.095804 0.116276 1.5008
educ -0.918110 -0.550286 -0.337617 -0.129685 0.0748
fem -0.983254 -0.178452 0.103148 0.255668 0.6599
nonwhite -0.128888 -0.010758 0.042518 0.062026 0.1272
************************Diagnostic information*************************
Number of data points: 172
Effective number of parameters (2trace(S) - trace(S'S)): 41.15948
Effective degrees of freedom (n-2trace(S) + trace(S'S)): 130.8405
AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -695.3735
AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -746.2521
BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): -784.3095
Residual sum of squares: 0.1089552
R-square value: 0.7097961
Adjusted R-square value: 0.6178013
***********************************************************************
Program stops at: 2024-03-19 15:27:47.039451
Woah, there’s a lot going on here. What might be more helpful is if we just map it - let’s look at how the relationship between inequality and education varies across space:
#Add the coefficient for education to the model data
model$gwr_exp_coeff <- lm.gwr$SDF$educ
#There's a built in function to plot this, but we need to transform our sf object into an sp object instead
model1 <- as(model, Class = "Spatial")
#plot it!
p1 <- spplot(model1, "gwr_exp_coeff")
p1
Ok, here we can see where the relationship varies across space. Let’s look at another variable - are they the same?
model$gwr_exp_coeff <- lm.gwr$SDF$nonwhite
model1 <- as(model, Class = "Spatial")
#plot it!
p2 <- spplot(model1, "gwr_exp_coeff")
p2
#Let's compare them side-by-side
library(cowplot)
Attaching package: ‘cowplot’
The following object is masked from ‘package:ggthemes’:
theme_map
The following object is masked from ‘package:lubridate’:
stamp
plot_grid(p1, p2,
labels = c('A) Education', 'B) Percent Nonwhite'),
hjust = -0.3, vjust = 4, label_size = 10)
What do you think - what patterns do you notice here?
Resources:
Burchfield, E. (n. d.) Spatial Regression. Retrieved from: https://www.emilyburchfield.org/courses/gsa/spatial_regression_lab.
Murugesan, M. (2017). NYC Neighborhoods. [Data Set]. Retrieved from: https://geodacenter.github.io/data-and-lab/NYC-Nhood-ACS-2008-12/.
Sun, S. (2022) 4.2 Spatial Error and Lag Models. Retrieved from: http://www.geo.hunter.cuny.edu/~ssun/R-Spatial/spregression.html#spatial-error-and-lag-models.