This session covers the following topics on linear regression modelling with spatial data:
In particular, this session concerns the consequences of incorporating or not incorporating spatially autocorrelation effects when regression modelling. Ignoring clear evidence of residual spatial autocorrelation violates a core assumption of the classic OLS regression analysis: that of independent and identically distributed errors that following a Normal distribution with zero mean and constant variance.
All regressions in this session assume fixed relationships across the study area, yielding (constant) fixed coefficient estimates. For (non-constant) varying coefficient models, see Session 7, where coefficients can be mapped and where these spatial heterogeneity effects are considered through a Geographically Weighted Regression (GWR) framework.
The exercises in this session use the Liudaogou watershed soils data again, specifically both the point and the polygon format layers that were created in Session 2. To recap, the data set includes: response variables; soil total nitrogen percentage (TNPC), soil total phosphorus percentage (TPPC), and predictor variables; soil organic carbon (SOCgkg), nitrate nitrogen (NO3Ngkg), ammonium (NH4Ngkg), percentage clay (ClayPC), silt (SiltPC), sand (SandPC) content, vegetation coverage (CoveragePC), Slope, Aspect, Altitude_m, SoilType, LandUse and Position.
Again, a number of different packages will be used, most of which should already be installed (if not install, them):
library(tmap)
library(sf)
library(spdep)
library(dplyr)
library(car)
library(mosaic)
library(gstat)
library(nlme)
Once again, remember to:
Having done that you should clear your workspace, create and load the 4 soils data files from Sessions 1 and 2. Note you will probably need to copy these over to the directory you are working from for this session:
# clear the workspace
rm(list = ls())
# load the data
load("soils.RData")
load("df_sf.RData")
load("df_zone.RData")
load("boundary.RData")
Again, it is useful to be able to work in both sp and sf formats:
# convert to sp format
df_sp = as(df_sf, "Spatial")
zone_sp = as(df_zone, "Spatial")
It is also useful to add the coordinates (from our R object called data) to our spatial data sets, to use as extra predictor variables, if required:
df_sf$Longitude <- data$Longitude
df_sp@data$Longitude <- data$Longitude
df_sf$Latitude <- data$Latitude
df_sp@data$Latitude <- data$Latitude
Before moving to the regression analyses, it is worth noting that a preliminary EDA (not conducted here) should include an assessment of: (a) each variable’s distribution characteristics and transform if necessary (as in part, addressed below), (b) the correlation matrix for the continuous variables and (c) conditional boxplots for the response’s relationship to any categorical or ordinal variable. All 3 forms of EDA should also include the detection of outliers that may compromise subsequent regression fits. Robust (outlier-resistant) regressions are not considered here.
Also, as the data are spatial, then inspecting the maps of all variables is strongly recommended.
In session 3, when illustrating the MAUP - TNPC, SOCgkg, NO3Ngkg and NH4Ngkg were all transformed using natural logs and ClayPC was square root transformed. In this session, we will similarly transform, except for ClayPC. Given Clay / Silt / Sand are compositional in form, ClayPC is simply dropped from all regression analyses to avoid instabilities.
Thus, conducting transforms on the soils data, similar to that done in Wang et al (2009) and Comber et al (2018):
df_sf$TNPC <- log(df_sf$TNPC+0.0001)
df_sf$SOCgkg <- log(df_sf$SOCgkg)
df_sf$NO3Ngkg <- log(abs(df_sf$NO3Ngkg))
df_sf$NH4Ngkg <- log(df_sf$NH4Ngkg)
df_sp@data$TNPC <- log(df_sp@data$TNPC+0.0001)
df_sp@data$SOCgkg <- log(df_sp@data$SOCgkg)
df_sp@data$NO3Ngkg <- log(abs(df_sp@data$NO3Ngkg))
df_sp@data$NH4Ngkg <- log(df_sp@data$NH4Ngkg)
In this first sub-section, we will fit an OLS estimated linear regression and interrogate its fit in terms of: (i) R-squared, (ii) predictor variable significance, and (iii) evidence of collinearity amongst its predictors through variance inflation factors (VIFs).
For illustration, we will fit the following OLS regression, where soil Total Nitrogen (TNPC) is a function of 5 predictor variables (soil organic carbon SOCgkg, nitrate nitrogen NO3Ngkg, ammonium NH4Ngkg, silt SiltPC and sand SandPC):
mod.OLS.1 = lm(TNPC ~SOCgkg+SandPC+SiltPC+NO3Ngkg+NH4Ngkg, data = df_sf)
The OLS regression can be inspected in the usual way:
summary(mod.OLS.1)
##
## Call:
## lm(formula = TNPC ~ SOCgkg + SandPC + SiltPC + NO3Ngkg + NH4Ngkg,
## data = df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2244 -0.1444 0.0305 0.1931 3.1065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.014103 0.759322 -3.969 7.97e-05 ***
## SOCgkg 0.691289 0.039673 17.425 < 2e-16 ***
## SandPC 0.007762 0.007096 1.094 0.27442
## SiltPC 0.023308 0.008186 2.847 0.00454 **
## NO3Ngkg 0.124629 0.029328 4.249 2.44e-05 ***
## NH4Ngkg -0.141503 0.073634 -1.922 0.05506 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5433 on 683 degrees of freedom
## Multiple R-squared: 0.6098, Adjusted R-squared: 0.607
## F-statistic: 213.5 on 5 and 683 DF, p-value: < 2.2e-16
Thus the R-squared = 0.61, which indicates a reasonable fit. All coefficients except that for SandPC are significantly different from zero at least at the 90% level. Observe that the significance tests assume independent and identically distributed errors - thus are potentially compromised from the outset.
To assess for predictor variable collinearity (i.e. when predictor variables are highly correlated and thus each relay very similar information to the response), we can use VIFs:
vif(mod.OLS.1)
## SOCgkg SandPC SiltPC NO3Ngkg NH4Ngkg
## 1.786283 40.431248 40.301342 1.542160 1.253378
Following guidelines given in Belsley et al. (1980) and O’Brien, (2007) predictors with a VIF > 10 should be removed and the regression re-fitted. This avoids inaccurate coefficient estimation and associated uncertainties due to such collinearities. An alternative to improve statistical inferences in the presence of collinearity can be found in a penalised regression (e.g. Zou and Hastie 2005), where all predictors can potentially be retained, but this approach was not considered here.
Thus re-fitting the OLS regression, where soil Total Nitrogen TNPC is now a function of 4 predictor variables (SOCgkg, NO3Ngkg, NH4Ngkg and SiltPC) - i.e. SandPC is removed (as it was one of two predictors with VIFs > 40):
mod.OLS.2 = lm(TNPC ~SOCgkg+SiltPC+NO3Ngkg+NH4Ngkg, data = df_sf)
And its summary:
summary(mod.OLS.2)
##
## Call:
## lm(formula = TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg, data = df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2115 -0.1594 0.0381 0.1987 3.1023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.220137 0.222938 -9.959 < 2e-16 ***
## SOCgkg 0.689018 0.039624 17.389 < 2e-16 ***
## SiltPC 0.014503 0.001487 9.753 < 2e-16 ***
## NO3Ngkg 0.126049 0.029304 4.301 1.94e-05 ***
## NH4Ngkg -0.146518 0.073502 -1.993 0.0466 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5434 on 684 degrees of freedom
## Multiple R-squared: 0.6091, Adjusted R-squared: 0.6069
## F-statistic: 266.5 on 4 and 684 DF, p-value: < 2.2e-16
And the VIFs:
vif(mod.OLS.2)
## SOCgkg SiltPC NO3Ngkg NH4Ngkg
## 1.781388 1.329538 1.539140 1.248517
All VIFs are now < 2, which is good and the R-squared = 0.61, which is no different to that found before. All coefficients are now significantly different from zero at least at the 95% level. Again, observe that these significance tests assume independent and identically distributed errors - this has not changed.
Many studies may stop their statistical analyses at this juncture, but this overlooks possibly important spatial effects. So the first step to address this, is to examine the residuals from the OLS fit.
Therefore, in this sub-section, we will interrogate the residuals from the chosen OLS fit through: (i) mapping them, (ii) assessing their spatial autocorrelation with a Moran’s I analysis, and (iii) assessing their spatial autocorrelation with a semivariogram analysis.
The residuals from the chosen OLS can be summarised:
summary(mod.OLS.2$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.21153 -0.15940 0.03812 0.00000 0.19869 3.10231
var(mod.OLS.2$residuals) # the variance
## [1] 0.2935738
Again, we can use the following code from Session 3 for mapping the residuals:
quick.map <- function(spdf,var,legend.title,main.title, dot.size = 0.2) {
if (class(spdf) == "SpatialPointsDataFrame"){
p = tm_shape(spdf)+
tm_dots(var, title = legend.title, size = dot.size)+
tm_layout(title = main.title, legend.title.size =0.9)
}
if (class(spdf) == "SpatialPolygonsDataFrame"){
p = tm_shape(spdf)+
tm_fill(var, title = legend.title)+
tm_layout(title = main.title, legend.title.size =0.9)
}
return(p)
}
So adding the raw residuals to our spatial data sets:
df_sf$residuals <- mod.OLS.2$residuals
df_sp@data$residuals <- mod.OLS.2$residuals
And mapping them in Figure 1:
quick.map(df_sp,var="residuals","Raw","OLS residuals")
Residuals from OLS regression fit.
There exists some unusually high negative and some unusually high positive residuals, both of which suggest outlying observations which should be investigated further (but not here!). There does not appear to be any obvious spatial pattern to the residuals.
Therefore, lets us assess for residual spatial autocorrelation and assess the significance of it using a Moran’s I test based on the zonal data form as done in session 3. However this time, we must use the lm.morantest as looking at residuals from a linear regression. We cannot use the morantest directly on the residuals as this would result in a known bias. This is due to an inherent identification problem when separating first- from second-order effects when regression modelling with spatial data (e.g. Armstrong 1984).
# create the neighbour list
# note the need to convert the polygons to sp format for poly2nb
gnb = poly2nb(as(df_zone, "Spatial")) # see Session 3
glw = nb2listw(gnb) # see Session 3
moran.result <- lm.morantest(mod.OLS.2, listw=glw)
moran.result
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg, data =
## df_sf)
## weights: glw
##
## Moran I statistic standard deviate = 6.4688, p-value = 4.94e-11
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.1437592952 -0.0031677076 0.0005158947
The outputs indicate there is significant residual autocorrelation as the p-value <<<< 0.
To provide detail and clarity to this significance in autocorrelation, a residual semivariogram can be found. However, given the same identification problem above, this semivariogram will also be biased, but it does serve for exploratory purposes, just as it did in Session 4 when kriging with a trend model. The residual semivariogram (Figure 2) is found from:
Res.vgm = variogram(TNPC ~SOCgkg+SiltPC+NO3Ngkg+NH4Ngkg, df_sf, cutoff=1200)
Res.vgm.fit = fit.variogram(Res.vgm, model = vgm(0.3, "Exp", 100, 0.005),
fit.sills = c(F,T), fit.method=7)
Res.vgm.fit
## model psill range
## 1 Nug 0.0050000 0.00000
## 2 Exp 0.2952393 40.67338
plot(Res.vgm,Res.vgm.fit,main='Residual semivariograms')
Empirical residual semivariogram with Exponential model fit
The (biased) residual semivariogram, fitted via a weighted least squares (WLS) method, yields: an effective correlation range of 3 x 40.7 = 122.1 m, which is relatively weak considering the study has a maximum distance of 3742 m; a nugget variance of 0.005 (user-specified and not estimated, in this case); and a structural variance of 0.30. Furthermore, the nugget effect is relatively strong at 0.005/(0.005 + 0.30) = 0.016.
Overall, experience suggests that spatial autocorrelation in the residuals is not particularly strong even though the Moran’s I test is significant.
The maximum distance between data points / zones in the study area can be found from:
max(dist(st_coordinates(df_sf)))
## [1] 3742.06
Given the results of the residual analyses, above, we will now fit a REML estimated linear regression with spatial effects and compare its outputs to a regression without spatial effects in terms of: (i) model fit, and (ii) predictor variable significance.
To fit REML estimated linear regressions, we can use the lme function in the nlme (linear mixed model) package as follows:
First, fit a regression without spatial effects (which should be the same as the OLS regression, above):
df_sf$dummy <- rep(1, nrow(df_sf)) # as we don't have grouping data (e.g a 'treatment')
mod.REML.2.NS <- lme(fixed = TNPC ~SOCgkg+SiltPC+NO3Ngkg+NH4Ngkg,
data = df_sf, random = ~ 1 | dummy, method = "REML")
The REML regression outputs can be inspected and compared with OLS estimation (look at the estimated coefficients etc.):
summary(mod.REML.2.NS)
## Linear mixed-effects model fit by REML
## Data: df_sf
## AIC BIC logLik
## 1154.838 1186.534 -570.4192
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 0.0552057 0.5434065
##
## Fixed effects: TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg
## Value Std.Error DF t-value p-value
## (Intercept) -2.2201369 0.22967119 684 -9.666589 0.0000
## SOCgkg 0.6890176 0.03962435 684 17.388743 0.0000
## SiltPC 0.0145032 0.00148703 684 9.753140 0.0000
## NO3Ngkg 0.1260489 0.02930385 684 4.301446 0.0000
## NH4Ngkg -0.1465184 0.07350177 684 -1.993400 0.0466
## Correlation:
## (Intr) SOCgkg SiltPC NO3Ngk
## SOCgkg 0.193
## SiltPC -0.093 -0.408
## NO3Ngkg 0.120 -0.434 0.009
## NH4Ngkg -0.938 -0.141 -0.076 -0.263
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -13.27097205 -0.29332592 0.07015841 0.36564577 5.70900803
##
## Number of Observations: 689
## Number of Groups: 1
summary(mod.OLS.2)
##
## Call:
## lm(formula = TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg, data = df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2115 -0.1594 0.0381 0.1987 3.1023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.220137 0.222938 -9.959 < 2e-16 ***
## SOCgkg 0.689018 0.039624 17.389 < 2e-16 ***
## SiltPC 0.014503 0.001487 9.753 < 2e-16 ***
## NO3Ngkg 0.126049 0.029304 4.301 1.94e-05 ***
## NH4Ngkg -0.146518 0.073502 -1.993 0.0466 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5434 on 684 degrees of freedom
## Multiple R-squared: 0.6091, Adjusted R-squared: 0.6069
## F-statistic: 266.5 on 4 and 684 DF, p-value: < 2.2e-16
Good news! Estimation with the lme function matches that of the lm function - both essentially OLS fits.
Second, fit a regression with spatial effects - this will take a few minutes (note using the update function):
mod.REML.2.S <- update(mod.REML.2.NS, correlation = corExp(c(122.1,0.016),
form = ~ Longitude+Latitude, nugget = T), method = "REML")
Observe that the semivariogram parameters corExp(c(122.1,0.016) are again only initialising parameters for the REML fit. These are taken from our exploratory (but biased) residual semivariogram (modelled with WLS), above. Note also, that we now specify the effective correlation range for the exponential variogram model with the nlme package.
The REML regression outputs can be inspected using:
summary(mod.REML.2.S)
## Linear mixed-effects model fit by REML
## Data: df_sf
## AIC BIC logLik
## 1112.334 1153.086 -547.1672
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 0.05589367 0.5501557
##
## Correlation Structure: Exponential spatial correlation
## Formula: ~Longitude + Latitude | dummy
## Parameter estimate(s):
## range nugget
## 264.5930675 0.7493071
## Fixed effects: TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg
## Value Std.Error DF t-value p-value
## (Intercept) -1.8828924 0.26177976 684 -7.192658 0.0000
## SOCgkg 0.6546110 0.04106173 684 15.942118 0.0000
## SiltPC 0.0108872 0.00156948 684 6.936837 0.0000
## NO3Ngkg 0.1161982 0.03064289 684 3.792012 0.0002
## NH4Ngkg -0.2007521 0.08109098 684 -2.475641 0.0135
## Correlation:
## (Intr) SOCgkg SiltPC NO3Ngk
## SOCgkg 0.047
## SiltPC -0.114 -0.370
## NO3Ngkg 0.183 -0.457 0.034
## NH4Ngkg -0.923 -0.004 -0.053 -0.324
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -13.1502849 -0.2595032 0.1001055 0.3897324 5.7792639
##
## Number of Observations: 689
## Number of Groups: 1
If you inspect the summaries for mod.REML.2.NS (non-spatial) and mod.REML.2.S (spatial), there is little gained in incorporating spatial effects with respect to the estimated coefficients and the associated relationship inferences. This is not surprising given the (moderate) outcomes of the residual variography, above.
Note also the semivariogram parameter estimates via the REML fit - i.e. a effective correlation range of 264.6 m and a nugget effect of 0.75 - are both quite different to the biased WLS variogram model fit.
It is also useful to compare models by their fit statistics (AIC - Akaike Information Criterion and BIC - Bayesian Information Criterion) with values of: (1155 and 1186) and (1112 and 1153) for (mod.REML.2.NS and mod.REML.2.S), respectively. Thus in terms of model fit, the spatial model can be preferred as both AIC and BIC values are ‘significantly’ reduced.
Observe that the lm function does not report the AIC value or that the lme function does not report the R-squared value. An R-squared can be calculated as follows, for the spatial model:
preds <- mod.REML.2.S$fitted[,1]
actual <- df_sf$TNPC
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
rsq
## [1] 0.5996779
The spatial model’s R-squared = 0.60. As usual, this is little different to the non-spatial model’s R-squared (of 0.61).
In summary, accounting for spatial effects in the soils process is important in terms of model fit (e.g. AIC), but not so important in terms of inference with respect to the significance of the soils data relationships. In the next sub-sections, we shall see situations when this is not the case.
Task: create an R function for calculating R-squared and confirm that all R-squared values are correctly found for the following regressions: mod.OLS.2, mod.REML.2.NS and mod.REML.2.S. For an answer, see the last section.
In this sub-section, we will conduct a brief model specification exercise in terms of the effects of removing predictor variables on the OLS and REML model fits. Here, spatial autocorrelation tends to become more prominent as predictor information is reduced.
Thus re-fitting all regressions, where soil Total Nitrogen (TNPC) is now a function of NH4Ngkg only:
The OLS fit:
mod.OLS.3.MS = lm(TNPC ~NH4Ngkg, data = df_sf)
Conduct Moran’s I test:
lm.morantest(mod.OLS.3.MS, listw=glw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = TNPC ~ NH4Ngkg, data = df_sf)
## weights: glw
##
## Moran I statistic standard deviate = 13.652, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.3089947961 -0.0020740256 0.0005191651
Find the residual semivariograms (Figure 3):
Res.vgm.MS = variogram(TNPC ~NH4Ngkg, df_sf, cutoff=1200)
Res.vgm.fit.MS = fit.variogram(Res.vgm.MS, model = vgm(0.4, "Exp", 200, 0.3),
fit.sills = c(T,T), fit.method=7)
Res.vgm.fit.MS
## model psill range
## 1 Nug 0.4116035 0.0000
## 2 Exp 0.2794458 277.3878
plot(Res.vgm.MS,Res.vgm.fit.MS,main='Residual semivariograms (w.r.t. model specification)')
Empirical residual semivariogram with Exponential model fit (w.r.t. model specification)
Fit the (non-spatial) REML:
mod.REML.3.NS.MS <- lme(fixed = TNPC ~NH4Ngkg,
data = df_sf, random = ~ 1 | dummy, method = "REML")
Fit the (spatial) REML:
mod.REML.3.S.MS <- update(mod.REML.3.NS.MS, correlation = corExp(c(831,0.28),
form = ~ Longitude+Latitude, nugget = T), method = "REML")
Investigate the summary of the non-spatial model:
summary(mod.REML.3.NS.MS)
## Linear mixed-effects model fit by REML
## Data: df_sf
## AIC BIC logLik
## 1721.831 1739.961 -856.9157
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 0.08490436 0.8357394
##
## Fixed effects: TNPC ~ NH4Ngkg
## Value Std.Error DF t-value p-value
## (Intercept) -3.829833 0.3370614 687 -11.362418 0
## NH4Ngkg 0.735511 0.1011688 687 7.270141 0
## Correlation:
## (Intr)
## NH4Ngkg -0.963
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -9.6039171 -0.4176306 0.1994362 0.6070194 4.4114531
##
## Number of Observations: 689
## Number of Groups: 1
Investigate the summary of the spatial model:
summary(mod.REML.3.S.MS)
## Linear mixed-effects model fit by REML
## Data: df_sf
## AIC BIC logLik
## 1540.169 1567.363 -764.0847
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 0.08635397 0.8499901
##
## Correlation Structure: Exponential spatial correlation
## Formula: ~Longitude + Latitude | dummy
## Parameter estimate(s):
## range nugget
## 502.1114005 0.5705617
## Fixed effects: TNPC ~ NH4Ngkg
## Value Std.Error DF t-value p-value
## (Intercept) -2.3729103 0.3972594 687 -5.973201 0.0000
## NH4Ngkg 0.2689516 0.1061042 687 2.534787 0.0115
## Correlation:
## (Intr)
## NH4Ngkg -0.859
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -9.1823809 -0.3852807 0.2640150 0.6551056 4.3454131
##
## Number of Observations: 689
## Number of Groups: 1
Inspecting the coefficient summaries for mod.REML.3.NS.MS (non-spatial) and mod.REML.3.S.MS (spatial), there are clear differences in the coefficient estimates and t-values. Intercept and NH4Ngkg coefficient significance appears much stronger if spatial autocorrelation is not considered.
It is again useful to compare models by their AIC / BIC with values of: 1722 / 1740 and 1540 / 1567 for mod.REML.3.NS.MS and mod.REML.3.S.MS, respectively. Thus in terms of model fit, the spatial model can be preferred as both AIC and BIC values are ‘significantly’ reduced.
Observe also the difference in the structure for the residual variograms in Figure 2 and Figure 3, where the latter (the one for this sub-section) has a much longer correlation range.
Thus in summary, the objective here was to demonstrate that as the number of predictor variables decrease, it is common for the value of incorporating spatial effects to increase (and vice-versa until the model is considered fully-specified).
In this sub-section, the aim is to demonstrate how inherently ‘spatial’ predictor variables (e.g. in Longitude and Latitude) can act as a surrogate spatial effect in an OLS regression, while their significance is reduced when spatial effects are introduced through a REML.
Therefore, re-fitting all regressions, where soil Total Nitrogen (TNPC) is now a function of NH4Ngkg, Longitude and Latitude:
The OLS fit:
mod.OLS.4.SP = lm(TNPC ~NH4Ngkg + Longitude + Latitude, data = df_sf)
Conduct Moran’s I test:
lm.morantest(mod.OLS.4.SP, listw=glw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = TNPC ~ NH4Ngkg + Longitude + Latitude, data =
## df_sf)
## weights: glw
##
## Moran I statistic standard deviate = 11.982, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.2668138378 -0.0048327674 0.0005139644
Fit and plot the residual semivariograms (Figure 4):
Res.vgm.SP = variogram(TNPC ~NH4Ngkg + Longitude + Latitude, df_sf, cutoff=1200)
Res.vgm.fit.SP = fit.variogram(Res.vgm.SP, model = vgm(0.4, "Exp", 200, 0.3),
fit.sills = c(T,T), fit.method=7)
Res.vgm.fit.SP
## model psill range
## 1 Nug 0.4023107 0.0000
## 2 Exp 0.2584434 244.2777
plot(Res.vgm.SP,Res.vgm.fit.SP,main='Residual semivariograms (w.r.t. spatial predictors)')
Empirical residual semivariogram with Exponential model fit (w.r.t. spatial predictors)
Fit the (non-spatial) REML:
mod.REML.4.NS.SP <- lme(fixed = TNPC ~NH4Ngkg + Longitude + Latitude,
data = df_sf, random = ~ 1 | dummy, method = "REML")
Fit the (spatial) REML:
mod.REML.4.S.SP <- update(mod.REML.4.NS.SP, correlation = corExp(c(733,0.27),
form = ~ Longitude+Latitude, nugget = T), method = "REML")
Investigate the summary of the non-spatial model:
summary(mod.REML.4.NS.SP)
## Linear mixed-effects model fit by REML
## Data: df_sf
## AIC BIC logLik
## 1707.006 1734.183 -847.5031
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 0.08166565 0.8038598
##
## Fixed effects: TNPC ~ NH4Ngkg + Longitude + Latitude
## Value Std.Error DF t-value p-value
## (Intercept) 421.1444 165.24805 685 2.548559 0.0110
## NH4Ngkg 0.5383 0.10520 685 5.116361 0.0000
## Longitude 0.0003 0.00005 685 7.307166 0.0000
## Latitude -0.0002 0.00004 685 -3.887538 0.0001
## Correlation:
## (Intr) NH4Ngk Longtd
## NH4Ngkg 0.186
## Longitude 0.073 -0.316
## Latitude -0.982 -0.121 -0.262
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -10.1958607 -0.3854290 0.1440258 0.5935284 4.4839039
##
## Number of Observations: 689
## Number of Groups: 1
Investigate the summary of the spatial model:
summary(mod.REML.4.S.SP)
## Linear mixed-effects model fit by REML
## Data: df_sf
## AIC BIC logLik
## 1573.781 1610.017 -778.8906
##
## Random effects:
## Formula: ~1 | dummy
## (Intercept) Residual
## StdDev: 0.08602963 0.8467971
##
## Correlation Structure: Exponential spatial correlation
## Formula: ~Longitude + Latitude | dummy
## Parameter estimate(s):
## range nugget
## 484.7152744 0.5738722
## Fixed effects: TNPC ~ NH4Ngkg + Longitude + Latitude
## Value Std.Error DF t-value p-value
## (Intercept) 333.5508 683.4631 685 0.4880304 0.6257
## NH4Ngkg 0.2655 0.1068 685 2.4853867 0.0132
## Longitude 0.0002 0.0002 685 1.1057415 0.2692
## Latitude -0.0001 0.0002 685 -0.6838277 0.4943
## Correlation:
## (Intr) NH4Ngk Longtd
## NH4Ngkg 0.083
## Longitude 0.028 -0.073
## Latitude -0.983 -0.068 -0.211
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -9.3758292 -0.3674393 0.2072935 0.6134756 4.2909287
##
## Number of Observations: 689
## Number of Groups: 1
Inspecting the coefficient summaries for mod.REML.4.NS.SP (non-spatial) and mod.REML.4.S.SP (spatial), there are clear differences in the p-values and thus the significance of each predictor variable. Both Longitude and Latitude are significant predictors in the OLS regression but are both insignificant in the REML regression. Furthermore, the intercept is only significant in the OLS regression.
It is again useful to compare models by their fit statistics (AIC and BIC) with values of: (1707 and 1734) and (1574 and 1610) for (mod.REML.4.NS.SP and mod.REML.4.S.SP), respectively. Thus in terms of model fit, the spatial model can be preferred as both AIC and BIC values are ‘significantly’ reduced.
This analysis clearly demonstrates how ‘spatial-type’ predictor variables (e.g. in Longitude and Latitude) can act as a surrogate spatial effect in an OLS regression, while their significance is reduced when spatial effects are introduced through a REML.
This observation also demonstrates the importance of accounting for spatial effects for processes that are inherently spatial, else incorrect scientific inferences can result.
Task: does Altitude_m have a similar effect on the interpretation of the OLS and REML fits?
In this session, we have taken you through a series of OLS and REML regressions for modelling with or with spatial autocorrelation effects. Clearly these effects can sometimes, but not always, be important - commonly depending on the nature and amount of the available predictor variables.
You may also have observed the similarities between REML regression and kriging with an external drift (KED) from Session 4. This is not surprising as they are one and the same model - the former estimates the model parameters / coefficients, the latter uses this model to predict at an un-sampled locations (see Cressie 1992).
Try repeating the above exercises with TPPC instead of TNPC as the response variable. Suggest trying this with TPPC square-root transformed.
create an R function for calculating R-squared:
RSQR <- function(x,y){
rss <- sum((x - y) ^ 2) ## residual sum of squares
tss <- sum((y - mean(y)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
rsq}
RSQR(mod.REML.2.S$fitted[,1],df_sf$TNPC)
## [1] 0.5996779
Armstrong, M., 1984. Problems with universal kriging. Mathematical Geology, 16(1), pp.101-108.
David, B., Kuh, E. and Welsch, R., 1980. Regression diagnostics: identifying influential data and sources of collinearity. John Wiley & Sons.
Comber, A., Wang, Y., Lu, Y., Zhang, X. and Harris, P., 2018. Hyper-local geographically weighted regression: extending GWR through local model selection and local bandwidth optimization. Journal of Spatial Information Science, 2018(17), pp.63-84.
Cressie, N., 1992. Statistics for spatial data. Wiley, New Jersey.
O’Brien, R.M., 2007. A caution regarding rules of thumb for variance inflation factors. Quality & Quantity, 41(5), pp.673-690.
Wang, Y., Zhang, X. and Huang, C., 2009. Spatial variability of soil total nitrogen and soil total phosphorus under different land uses in a small watershed on the Loess Plateau, China. Geoderma, 150(1-2), pp.141-149.
Zou, H. and Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: B (statistical methodology), 67(2), pp.301-320.