Overview

This session covers the following topics on linear regression modelling with spatial data:

In particular, this session concerns the consequences of incorporating or not incorporating spatially autocorrelation effects when regression modelling. Ignoring clear evidence of residual spatial autocorrelation violates a core assumption of the classic OLS regression analysis: that of independent and identically distributed errors that following a Normal distribution with zero mean and constant variance.

All regressions in this session assume fixed relationships across the study area, yielding (constant) fixed coefficient estimates. For (non-constant) varying coefficient models, see Session 7, where coefficients can be mapped and where these spatial heterogeneity effects are considered through a Geographically Weighted Regression (GWR) framework.

The exercises in this session use the Liudaogou watershed soils data again, specifically both the point and the polygon format layers that were created in Session 2. To recap, the data set includes: response variables; soil total nitrogen percentage (TNPC), soil total phosphorus percentage (TPPC), and predictor variables; soil organic carbon (SOCgkg), nitrate nitrogen (NO3Ngkg), ammonium (NH4Ngkg), percentage clay (ClayPC), silt (SiltPC), sand (SandPC) content, vegetation coverage (CoveragePC), Slope, Aspect, Altitude_m, SoilType, LandUse and Position.

Again, a number of different packages will be used, most of which should already be installed (if not install, them):

library(tmap)
library(sf)
library(spdep)
library(dplyr)
library(car)
library(mosaic)
library(gstat)
library(nlme)

Once again, remember to:

  1. Create a seperate folder for this Session
  2. Always write your code into an R script… always!
  3. Save the script to the folder
  4. Set the working directory to the folder location

Having done that you should clear your workspace, create and load the 4 soils data files from Sessions 1 and 2. Note you will probably need to copy these over to the directory you are working from for this session:

# clear the workspace
rm(list = ls())
# load the data 
load("soils.RData")
load("df_sf.RData")
load("df_zone.RData")
load("boundary.RData")

Again, it is useful to be able to work in both sp and sf formats:

# convert to sp format
df_sp = as(df_sf, "Spatial")
zone_sp = as(df_zone, "Spatial")

It is also useful to add the coordinates (from our R object called data) to our spatial data sets, to use as extra predictor variables, if required:

df_sf$Longitude <- data$Longitude
df_sp@data$Longitude <- data$Longitude

df_sf$Latitude <- data$Latitude
df_sp@data$Latitude <- data$Latitude

Preliminary EDA

Before moving to the regression analyses, it is worth noting that a preliminary EDA (not conducted here) should include an assessment of: (a) each variable’s distribution characteristics and transform if necessary (as in part, addressed below), (b) the correlation matrix for the continuous variables and (c) conditional boxplots for the response’s relationship to any categorical or ordinal variable. All 3 forms of EDA should also include the detection of outliers that may compromise subsequent regression fits. Robust (outlier-resistant) regressions are not considered here.

Also, as the data are spatial, then inspecting the maps of all variables is strongly recommended.

In session 3, when illustrating the MAUP - TNPC, SOCgkg, NO3Ngkg and NH4Ngkg were all transformed using natural logs and ClayPC was square root transformed. In this session, we will similarly transform, except for ClayPC. Given Clay / Silt / Sand are compositional in form, ClayPC is simply dropped from all regression analyses to avoid instabilities.

Thus, conducting transforms on the soils data, similar to that done in Wang et al (2009) and Comber et al (2018):

df_sf$TNPC <- log(df_sf$TNPC+0.0001)
df_sf$SOCgkg <- log(df_sf$SOCgkg)
df_sf$NO3Ngkg <- log(abs(df_sf$NO3Ngkg))
df_sf$NH4Ngkg <- log(df_sf$NH4Ngkg)

df_sp@data$TNPC <- log(df_sp@data$TNPC+0.0001)
df_sp@data$SOCgkg <- log(df_sp@data$SOCgkg)
df_sp@data$NO3Ngkg <- log(abs(df_sp@data$NO3Ngkg))
df_sp@data$NH4Ngkg <- log(df_sp@data$NH4Ngkg)

1. Standard OLS regression analysis

In this first sub-section, we will fit an OLS estimated linear regression and interrogate its fit in terms of: (i) R-squared, (ii) predictor variable significance, and (iii) evidence of collinearity amongst its predictors through variance inflation factors (VIFs).

For illustration, we will fit the following OLS regression, where soil Total Nitrogen (TNPC) is a function of 5 predictor variables (soil organic carbon SOCgkg, nitrate nitrogen NO3Ngkg, ammonium NH4Ngkg, silt SiltPC and sand SandPC):

mod.OLS.1 = lm(TNPC ~SOCgkg+SandPC+SiltPC+NO3Ngkg+NH4Ngkg, data = df_sf)

The OLS regression can be inspected in the usual way:

summary(mod.OLS.1)
## 
## Call:
## lm(formula = TNPC ~ SOCgkg + SandPC + SiltPC + NO3Ngkg + NH4Ngkg, 
##     data = df_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2244 -0.1444  0.0305  0.1931  3.1065 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.014103   0.759322  -3.969 7.97e-05 ***
## SOCgkg       0.691289   0.039673  17.425  < 2e-16 ***
## SandPC       0.007762   0.007096   1.094  0.27442    
## SiltPC       0.023308   0.008186   2.847  0.00454 ** 
## NO3Ngkg      0.124629   0.029328   4.249 2.44e-05 ***
## NH4Ngkg     -0.141503   0.073634  -1.922  0.05506 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5433 on 683 degrees of freedom
## Multiple R-squared:  0.6098, Adjusted R-squared:  0.607 
## F-statistic: 213.5 on 5 and 683 DF,  p-value: < 2.2e-16

Thus the R-squared = 0.61, which indicates a reasonable fit. All coefficients except that for SandPC are significantly different from zero at least at the 90% level. Observe that the significance tests assume independent and identically distributed errors - thus are potentially compromised from the outset.

To assess for predictor variable collinearity (i.e. when predictor variables are highly correlated and thus each relay very similar information to the response), we can use VIFs:

vif(mod.OLS.1)
##    SOCgkg    SandPC    SiltPC   NO3Ngkg   NH4Ngkg 
##  1.786283 40.431248 40.301342  1.542160  1.253378

Following guidelines given in Belsley et al. (1980) and O’Brien, (2007) predictors with a VIF > 10 should be removed and the regression re-fitted. This avoids inaccurate coefficient estimation and associated uncertainties due to such collinearities. An alternative to improve statistical inferences in the presence of collinearity can be found in a penalised regression (e.g. Zou and Hastie 2005), where all predictors can potentially be retained, but this approach was not considered here.

Thus re-fitting the OLS regression, where soil Total Nitrogen TNPC is now a function of 4 predictor variables (SOCgkg, NO3Ngkg, NH4Ngkg and SiltPC) - i.e. SandPC is removed (as it was one of two predictors with VIFs > 40):

mod.OLS.2 = lm(TNPC ~SOCgkg+SiltPC+NO3Ngkg+NH4Ngkg, data = df_sf)

And its summary:

summary(mod.OLS.2)
## 
## Call:
## lm(formula = TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg, data = df_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2115 -0.1594  0.0381  0.1987  3.1023 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.220137   0.222938  -9.959  < 2e-16 ***
## SOCgkg       0.689018   0.039624  17.389  < 2e-16 ***
## SiltPC       0.014503   0.001487   9.753  < 2e-16 ***
## NO3Ngkg      0.126049   0.029304   4.301 1.94e-05 ***
## NH4Ngkg     -0.146518   0.073502  -1.993   0.0466 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5434 on 684 degrees of freedom
## Multiple R-squared:  0.6091, Adjusted R-squared:  0.6069 
## F-statistic: 266.5 on 4 and 684 DF,  p-value: < 2.2e-16

And the VIFs:

vif(mod.OLS.2)
##   SOCgkg   SiltPC  NO3Ngkg  NH4Ngkg 
## 1.781388 1.329538 1.539140 1.248517

All VIFs are now < 2, which is good and the R-squared = 0.61, which is no different to that found before. All coefficients are now significantly different from zero at least at the 95% level. Again, observe that these significance tests assume independent and identically distributed errors - this has not changed.

Many studies may stop their statistical analyses at this juncture, but this overlooks possibly important spatial effects. So the first step to address this, is to examine the residuals from the OLS fit.

2. Investigation of the residuals

Therefore, in this sub-section, we will interrogate the residuals from the chosen OLS fit through: (i) mapping them, (ii) assessing their spatial autocorrelation with a Moran’s I analysis, and (iii) assessing their spatial autocorrelation with a semivariogram analysis.

2.1 Mapping the residuals

The residuals from the chosen OLS can be summarised:

summary(mod.OLS.2$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -7.21153 -0.15940  0.03812  0.00000  0.19869  3.10231
var(mod.OLS.2$residuals) # the variance
## [1] 0.2935738

Again, we can use the following code from Session 3 for mapping the residuals:

quick.map <- function(spdf,var,legend.title,main.title, dot.size = 0.2) {
  if (class(spdf) == "SpatialPointsDataFrame"){
  p = tm_shape(spdf)+
      tm_dots(var, title = legend.title, size = dot.size)+
      tm_layout(title = main.title, legend.title.size =0.9)
  }
  if (class(spdf) == "SpatialPolygonsDataFrame"){
  p = tm_shape(spdf)+
      tm_fill(var, title = legend.title)+
      tm_layout(title = main.title, legend.title.size =0.9)
  }
  return(p)
}

So adding the raw residuals to our spatial data sets:

df_sf$residuals <- mod.OLS.2$residuals
df_sp@data$residuals <- mod.OLS.2$residuals

And mapping them in Figure 1:

quick.map(df_sp,var="residuals","Raw","OLS residuals")
Residuals from OLS regression fit.

Residuals from OLS regression fit.

There exists some unusually high negative and some unusually high positive residuals, both of which suggest outlying observations which should be investigated further (but not here!). There does not appear to be any obvious spatial pattern to the residuals.

2.2 Moran’s I test

Therefore, lets us assess for residual spatial autocorrelation and assess the significance of it using a Moran’s I test based on the zonal data form as done in session 3. However this time, we must use the lm.morantest as looking at residuals from a linear regression. We cannot use the morantest directly on the residuals as this would result in a known bias. This is due to an inherent identification problem when separating first- from second-order effects when regression modelling with spatial data (e.g. Armstrong 1984).

# create the neighbour list
# note the need to convert the polygons to sp format for poly2nb
gnb = poly2nb(as(df_zone, "Spatial")) # see Session 3
glw = nb2listw(gnb) # see Session 3
moran.result <- lm.morantest(mod.OLS.2, listw=glw)
moran.result
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg, data =
## df_sf)
## weights: glw
## 
## Moran I statistic standard deviate = 6.4688, p-value = 4.94e-11
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.1437592952    -0.0031677076     0.0005158947

The outputs indicate there is significant residual autocorrelation as the p-value <<<< 0.

2.3 Residual Semivariograms

To provide detail and clarity to this significance in autocorrelation, a residual semivariogram can be found. However, given the same identification problem above, this semivariogram will also be biased, but it does serve for exploratory purposes, just as it did in Session 4 when kriging with a trend model. The residual semivariogram (Figure 2) is found from:

Res.vgm = variogram(TNPC ~SOCgkg+SiltPC+NO3Ngkg+NH4Ngkg, df_sf, cutoff=1200)
Res.vgm.fit = fit.variogram(Res.vgm, model = vgm(0.3, "Exp", 100, 0.005),
                            fit.sills = c(F,T), fit.method=7)
Res.vgm.fit
##   model     psill    range
## 1   Nug 0.0050000  0.00000
## 2   Exp 0.2952393 40.67338
plot(Res.vgm,Res.vgm.fit,main='Residual semivariograms')
Empirical residual semivariogram with Exponential model fit

Empirical residual semivariogram with Exponential model fit

The (biased) residual semivariogram, fitted via a weighted least squares (WLS) method, yields: an effective correlation range of 3 x 40.7 = 122.1 m, which is relatively weak considering the study has a maximum distance of 3742 m; a nugget variance of 0.005 (user-specified and not estimated, in this case); and a structural variance of 0.30. Furthermore, the nugget effect is relatively strong at 0.005/(0.005 + 0.30) = 0.016.

Overall, experience suggests that spatial autocorrelation in the residuals is not particularly strong even though the Moran’s I test is significant.

The maximum distance between data points / zones in the study area can be found from:

max(dist(st_coordinates(df_sf)))
## [1] 3742.06

3. REML regression analysis

Given the results of the residual analyses, above, we will now fit a REML estimated linear regression with spatial effects and compare its outputs to a regression without spatial effects in terms of: (i) model fit, and (ii) predictor variable significance.

To fit REML estimated linear regressions, we can use the lme function in the nlme (linear mixed model) package as follows:

First, fit a regression without spatial effects (which should be the same as the OLS regression, above):

df_sf$dummy <- rep(1, nrow(df_sf)) # as we don't have grouping data (e.g a 'treatment')
mod.REML.2.NS <- lme(fixed = TNPC ~SOCgkg+SiltPC+NO3Ngkg+NH4Ngkg,
                     data = df_sf, random = ~ 1 | dummy, method = "REML")

The REML regression outputs can be inspected and compared with OLS estimation (look at the estimated coefficients etc.):

summary(mod.REML.2.NS)
## Linear mixed-effects model fit by REML
##  Data: df_sf 
##        AIC      BIC    logLik
##   1154.838 1186.534 -570.4192
## 
## Random effects:
##  Formula: ~1 | dummy
##         (Intercept)  Residual
## StdDev:   0.0552057 0.5434065
## 
## Fixed effects: TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg 
##                  Value  Std.Error  DF   t-value p-value
## (Intercept) -2.2201369 0.22967119 684 -9.666589  0.0000
## SOCgkg       0.6890176 0.03962435 684 17.388743  0.0000
## SiltPC       0.0145032 0.00148703 684  9.753140  0.0000
## NO3Ngkg      0.1260489 0.02930385 684  4.301446  0.0000
## NH4Ngkg     -0.1465184 0.07350177 684 -1.993400  0.0466
##  Correlation: 
##         (Intr) SOCgkg SiltPC NO3Ngk
## SOCgkg   0.193                     
## SiltPC  -0.093 -0.408              
## NO3Ngkg  0.120 -0.434  0.009       
## NH4Ngkg -0.938 -0.141 -0.076 -0.263
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -13.27097205  -0.29332592   0.07015841   0.36564577   5.70900803 
## 
## Number of Observations: 689
## Number of Groups: 1
summary(mod.OLS.2)
## 
## Call:
## lm(formula = TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg, data = df_sf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2115 -0.1594  0.0381  0.1987  3.1023 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.220137   0.222938  -9.959  < 2e-16 ***
## SOCgkg       0.689018   0.039624  17.389  < 2e-16 ***
## SiltPC       0.014503   0.001487   9.753  < 2e-16 ***
## NO3Ngkg      0.126049   0.029304   4.301 1.94e-05 ***
## NH4Ngkg     -0.146518   0.073502  -1.993   0.0466 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5434 on 684 degrees of freedom
## Multiple R-squared:  0.6091, Adjusted R-squared:  0.6069 
## F-statistic: 266.5 on 4 and 684 DF,  p-value: < 2.2e-16

Good news! Estimation with the lme function matches that of the lm function - both essentially OLS fits.

Second, fit a regression with spatial effects - this will take a few minutes (note using the update function):

mod.REML.2.S <- update(mod.REML.2.NS, correlation = corExp(c(122.1,0.016),
                      form = ~ Longitude+Latitude, nugget = T), method = "REML")

Observe that the semivariogram parameters corExp(c(122.1,0.016) are again only initialising parameters for the REML fit. These are taken from our exploratory (but biased) residual semivariogram (modelled with WLS), above. Note also, that we now specify the effective correlation range for the exponential variogram model with the nlme package.

The REML regression outputs can be inspected using:

summary(mod.REML.2.S)
## Linear mixed-effects model fit by REML
##  Data: df_sf 
##        AIC      BIC    logLik
##   1112.334 1153.086 -547.1672
## 
## Random effects:
##  Formula: ~1 | dummy
##         (Intercept)  Residual
## StdDev:  0.05589367 0.5501557
## 
## Correlation Structure: Exponential spatial correlation
##  Formula: ~Longitude + Latitude | dummy 
##  Parameter estimate(s):
##       range      nugget 
## 264.5930675   0.7493071 
## Fixed effects: TNPC ~ SOCgkg + SiltPC + NO3Ngkg + NH4Ngkg 
##                  Value  Std.Error  DF   t-value p-value
## (Intercept) -1.8828924 0.26177976 684 -7.192658  0.0000
## SOCgkg       0.6546110 0.04106173 684 15.942118  0.0000
## SiltPC       0.0108872 0.00156948 684  6.936837  0.0000
## NO3Ngkg      0.1161982 0.03064289 684  3.792012  0.0002
## NH4Ngkg     -0.2007521 0.08109098 684 -2.475641  0.0135
##  Correlation: 
##         (Intr) SOCgkg SiltPC NO3Ngk
## SOCgkg   0.047                     
## SiltPC  -0.114 -0.370              
## NO3Ngkg  0.183 -0.457  0.034       
## NH4Ngkg -0.923 -0.004 -0.053 -0.324
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -13.1502849  -0.2595032   0.1001055   0.3897324   5.7792639 
## 
## Number of Observations: 689
## Number of Groups: 1

If you inspect the summaries for mod.REML.2.NS (non-spatial) and mod.REML.2.S (spatial), there is little gained in incorporating spatial effects with respect to the estimated coefficients and the associated relationship inferences. This is not surprising given the (moderate) outcomes of the residual variography, above.

Note also the semivariogram parameter estimates via the REML fit - i.e. a effective correlation range of 264.6 m and a nugget effect of 0.75 - are both quite different to the biased WLS variogram model fit.

It is also useful to compare models by their fit statistics (AIC - Akaike Information Criterion and BIC - Bayesian Information Criterion) with values of: (1155 and 1186) and (1112 and 1153) for (mod.REML.2.NS and mod.REML.2.S), respectively. Thus in terms of model fit, the spatial model can be preferred as both AIC and BIC values are ‘significantly’ reduced.

Observe that the lm function does not report the AIC value or that the lme function does not report the R-squared value. An R-squared can be calculated as follows, for the spatial model:

preds <- mod.REML.2.S$fitted[,1]
actual <- df_sf$TNPC
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
rsq
## [1] 0.5996779

The spatial model’s R-squared = 0.60. As usual, this is little different to the non-spatial model’s R-squared (of 0.61).

In summary, accounting for spatial effects in the soils process is important in terms of model fit (e.g. AIC), but not so important in terms of inference with respect to the significance of the soils data relationships. In the next sub-sections, we shall see situations when this is not the case.

Task: create an R function for calculating R-squared and confirm that all R-squared values are correctly found for the following regressions: mod.OLS.2, mod.REML.2.NS and mod.REML.2.S. For an answer, see the last section.

4. Effects of predictor variable choice

In this sub-section, we will conduct a brief model specification exercise in terms of the effects of removing predictor variables on the OLS and REML model fits. Here, spatial autocorrelation tends to become more prominent as predictor information is reduced.

4.1 Regression fits with associated diagnostics

Thus re-fitting all regressions, where soil Total Nitrogen (TNPC) is now a function of NH4Ngkg only:

The OLS fit:

mod.OLS.3.MS = lm(TNPC ~NH4Ngkg, data = df_sf)

Conduct Moran’s I test:

lm.morantest(mod.OLS.3.MS, listw=glw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = TNPC ~ NH4Ngkg, data = df_sf)
## weights: glw
## 
## Moran I statistic standard deviate = 13.652, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.3089947961    -0.0020740256     0.0005191651

Find the residual semivariograms (Figure 3):

Res.vgm.MS = variogram(TNPC ~NH4Ngkg, df_sf, cutoff=1200)
Res.vgm.fit.MS = fit.variogram(Res.vgm.MS, model = vgm(0.4, "Exp", 200, 0.3),
                               fit.sills = c(T,T), fit.method=7)
Res.vgm.fit.MS
##   model     psill    range
## 1   Nug 0.4116035   0.0000
## 2   Exp 0.2794458 277.3878
plot(Res.vgm.MS,Res.vgm.fit.MS,main='Residual semivariograms (w.r.t. model specification)')
Empirical residual semivariogram with Exponential model fit (w.r.t. model specification)

Empirical residual semivariogram with Exponential model fit (w.r.t. model specification)

Fit the (non-spatial) REML:

mod.REML.3.NS.MS <- lme(fixed = TNPC ~NH4Ngkg,
                        data = df_sf, random = ~ 1 | dummy, method = "REML")

Fit the (spatial) REML:

mod.REML.3.S.MS <- update(mod.REML.3.NS.MS, correlation = corExp(c(831,0.28),
                          form = ~ Longitude+Latitude, nugget = T), method = "REML")

Investigate the summary of the non-spatial model:

summary(mod.REML.3.NS.MS)
## Linear mixed-effects model fit by REML
##  Data: df_sf 
##        AIC      BIC    logLik
##   1721.831 1739.961 -856.9157
## 
## Random effects:
##  Formula: ~1 | dummy
##         (Intercept)  Residual
## StdDev:  0.08490436 0.8357394
## 
## Fixed effects: TNPC ~ NH4Ngkg 
##                 Value Std.Error  DF    t-value p-value
## (Intercept) -3.829833 0.3370614 687 -11.362418       0
## NH4Ngkg      0.735511 0.1011688 687   7.270141       0
##  Correlation: 
##         (Intr)
## NH4Ngkg -0.963
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -9.6039171 -0.4176306  0.1994362  0.6070194  4.4114531 
## 
## Number of Observations: 689
## Number of Groups: 1

Investigate the summary of the spatial model:

summary(mod.REML.3.S.MS)
## Linear mixed-effects model fit by REML
##  Data: df_sf 
##        AIC      BIC    logLik
##   1540.169 1567.363 -764.0847
## 
## Random effects:
##  Formula: ~1 | dummy
##         (Intercept)  Residual
## StdDev:  0.08635397 0.8499901
## 
## Correlation Structure: Exponential spatial correlation
##  Formula: ~Longitude + Latitude | dummy 
##  Parameter estimate(s):
##       range      nugget 
## 502.1114005   0.5705617 
## Fixed effects: TNPC ~ NH4Ngkg 
##                  Value Std.Error  DF   t-value p-value
## (Intercept) -2.3729103 0.3972594 687 -5.973201  0.0000
## NH4Ngkg      0.2689516 0.1061042 687  2.534787  0.0115
##  Correlation: 
##         (Intr)
## NH4Ngkg -0.859
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -9.1823809 -0.3852807  0.2640150  0.6551056  4.3454131 
## 
## Number of Observations: 689
## Number of Groups: 1

4.2 Interpretation

Inspecting the coefficient summaries for mod.REML.3.NS.MS (non-spatial) and mod.REML.3.S.MS (spatial), there are clear differences in the coefficient estimates and t-values. Intercept and NH4Ngkg coefficient significance appears much stronger if spatial autocorrelation is not considered.

It is again useful to compare models by their AIC / BIC with values of: 1722 / 1740 and 1540 / 1567 for mod.REML.3.NS.MS and mod.REML.3.S.MS, respectively. Thus in terms of model fit, the spatial model can be preferred as both AIC and BIC values are ‘significantly’ reduced.

Observe also the difference in the structure for the residual variograms in Figure 2 and Figure 3, where the latter (the one for this sub-section) has a much longer correlation range.

Thus in summary, the objective here was to demonstrate that as the number of predictor variables decrease, it is common for the value of incorporating spatial effects to increase (and vice-versa until the model is considered fully-specified).

5. Effects of the ‘spatial-type’ predictors

In this sub-section, the aim is to demonstrate how inherently ‘spatial’ predictor variables (e.g. in Longitude and Latitude) can act as a surrogate spatial effect in an OLS regression, while their significance is reduced when spatial effects are introduced through a REML.

5.1 Regression fits with associated diagnostics

Therefore, re-fitting all regressions, where soil Total Nitrogen (TNPC) is now a function of NH4Ngkg, Longitude and Latitude:

The OLS fit:

mod.OLS.4.SP = lm(TNPC ~NH4Ngkg + Longitude + Latitude, data = df_sf)

Conduct Moran’s I test:

lm.morantest(mod.OLS.4.SP, listw=glw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = TNPC ~ NH4Ngkg + Longitude + Latitude, data =
## df_sf)
## weights: glw
## 
## Moran I statistic standard deviate = 11.982, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.2668138378    -0.0048327674     0.0005139644

Fit and plot the residual semivariograms (Figure 4):

Res.vgm.SP = variogram(TNPC ~NH4Ngkg + Longitude + Latitude, df_sf, cutoff=1200)
Res.vgm.fit.SP = fit.variogram(Res.vgm.SP, model = vgm(0.4, "Exp", 200, 0.3),
                               fit.sills = c(T,T), fit.method=7)
Res.vgm.fit.SP
##   model     psill    range
## 1   Nug 0.4023107   0.0000
## 2   Exp 0.2584434 244.2777
plot(Res.vgm.SP,Res.vgm.fit.SP,main='Residual semivariograms (w.r.t. spatial predictors)')
Empirical residual semivariogram with Exponential model fit (w.r.t. spatial predictors)

Empirical residual semivariogram with Exponential model fit (w.r.t. spatial predictors)

Fit the (non-spatial) REML:

mod.REML.4.NS.SP <- lme(fixed = TNPC ~NH4Ngkg + Longitude + Latitude,
                        data = df_sf, random = ~ 1 | dummy, method = "REML")

Fit the (spatial) REML:

mod.REML.4.S.SP <- update(mod.REML.4.NS.SP, correlation = corExp(c(733,0.27),
                          form = ~ Longitude+Latitude, nugget = T), method = "REML")

Investigate the summary of the non-spatial model:

summary(mod.REML.4.NS.SP)
## Linear mixed-effects model fit by REML
##  Data: df_sf 
##        AIC      BIC    logLik
##   1707.006 1734.183 -847.5031
## 
## Random effects:
##  Formula: ~1 | dummy
##         (Intercept)  Residual
## StdDev:  0.08166565 0.8038598
## 
## Fixed effects: TNPC ~ NH4Ngkg + Longitude + Latitude 
##                Value Std.Error  DF   t-value p-value
## (Intercept) 421.1444 165.24805 685  2.548559  0.0110
## NH4Ngkg       0.5383   0.10520 685  5.116361  0.0000
## Longitude     0.0003   0.00005 685  7.307166  0.0000
## Latitude     -0.0002   0.00004 685 -3.887538  0.0001
##  Correlation: 
##           (Intr) NH4Ngk Longtd
## NH4Ngkg    0.186              
## Longitude  0.073 -0.316       
## Latitude  -0.982 -0.121 -0.262
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -10.1958607  -0.3854290   0.1440258   0.5935284   4.4839039 
## 
## Number of Observations: 689
## Number of Groups: 1

Investigate the summary of the spatial model:

summary(mod.REML.4.S.SP)
## Linear mixed-effects model fit by REML
##  Data: df_sf 
##        AIC      BIC    logLik
##   1573.781 1610.017 -778.8906
## 
## Random effects:
##  Formula: ~1 | dummy
##         (Intercept)  Residual
## StdDev:  0.08602963 0.8467971
## 
## Correlation Structure: Exponential spatial correlation
##  Formula: ~Longitude + Latitude | dummy 
##  Parameter estimate(s):
##       range      nugget 
## 484.7152744   0.5738722 
## Fixed effects: TNPC ~ NH4Ngkg + Longitude + Latitude 
##                Value Std.Error  DF    t-value p-value
## (Intercept) 333.5508  683.4631 685  0.4880304  0.6257
## NH4Ngkg       0.2655    0.1068 685  2.4853867  0.0132
## Longitude     0.0002    0.0002 685  1.1057415  0.2692
## Latitude     -0.0001    0.0002 685 -0.6838277  0.4943
##  Correlation: 
##           (Intr) NH4Ngk Longtd
## NH4Ngkg    0.083              
## Longitude  0.028 -0.073       
## Latitude  -0.983 -0.068 -0.211
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -9.3758292 -0.3674393  0.2072935  0.6134756  4.2909287 
## 
## Number of Observations: 689
## Number of Groups: 1

5.2 Interpretation

Inspecting the coefficient summaries for mod.REML.4.NS.SP (non-spatial) and mod.REML.4.S.SP (spatial), there are clear differences in the p-values and thus the significance of each predictor variable. Both Longitude and Latitude are significant predictors in the OLS regression but are both insignificant in the REML regression. Furthermore, the intercept is only significant in the OLS regression.

It is again useful to compare models by their fit statistics (AIC and BIC) with values of: (1707 and 1734) and (1574 and 1610) for (mod.REML.4.NS.SP and mod.REML.4.S.SP), respectively. Thus in terms of model fit, the spatial model can be preferred as both AIC and BIC values are ‘significantly’ reduced.

This analysis clearly demonstrates how ‘spatial-type’ predictor variables (e.g. in Longitude and Latitude) can act as a surrogate spatial effect in an OLS regression, while their significance is reduced when spatial effects are introduced through a REML.

This observation also demonstrates the importance of accounting for spatial effects for processes that are inherently spatial, else incorrect scientific inferences can result.

Task: does Altitude_m have a similar effect on the interpretation of the OLS and REML fits?

Summary

In this session, we have taken you through a series of OLS and REML regressions for modelling with or with spatial autocorrelation effects. Clearly these effects can sometimes, but not always, be important - commonly depending on the nature and amount of the available predictor variables.

You may also have observed the similarities between REML regression and kriging with an external drift (KED) from Session 4. This is not surprising as they are one and the same model - the former estimates the model parameters / coefficients, the latter uses this model to predict at an un-sampled locations (see Cressie 1992).

Tasks

Try repeating the above exercises with TPPC instead of TNPC as the response variable. Suggest trying this with TPPC square-root transformed.

Answer to the R-squared task

create an R function for calculating R-squared:

RSQR <- function(x,y){
rss <- sum((x - y) ^ 2)  ## residual sum of squares
tss <- sum((y - mean(y)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
rsq}
RSQR(mod.REML.2.S$fitted[,1],df_sf$TNPC)
## [1] 0.5996779

References

Armstrong, M., 1984. Problems with universal kriging. Mathematical Geology, 16(1), pp.101-108.

David, B., Kuh, E. and Welsch, R., 1980. Regression diagnostics: identifying influential data and sources of collinearity. John Wiley & Sons.

Comber, A., Wang, Y., Lu, Y., Zhang, X. and Harris, P., 2018. Hyper-local geographically weighted regression: extending GWR through local model selection and local bandwidth optimization. Journal of Spatial Information Science, 2018(17), pp.63-84.

Cressie, N., 1992. Statistics for spatial data. Wiley, New Jersey.

O’Brien, R.M., 2007. A caution regarding rules of thumb for variance inflation factors. Quality & Quantity, 41(5), pp.673-690.

Wang, Y., Zhang, X. and Huang, C., 2009. Spatial variability of soil total nitrogen and soil total phosphorus under different land uses in a small watershed on the Loess Plateau, China. Geoderma, 150(1-2), pp.141-149.

Zou, H. and Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: B (statistical methodology), 67(2), pp.301-320.