Up until now, we have been concerned with describing the structure of spatial data through correlational, and the methods of exploratory spatial data analysis.
Through ESDA, we examined data for patterns and using the Moran I and Local Moran I statistics, we examined clustering of variables.
Now we consider regression models for continuous outcomes. We begin with a review of the Ordinary Least Squares model for a continuous outcome.
\(y_i = \beta_0 + \beta_1 * x_i + e_i\)
e is the error in the model for y that is unaccounted for by the values of x and the grand mean \(\beta_0\).
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.446620 1.0879494 1.329676 1.867119e-01
## x 1.473915 0.1037759 14.202863 1.585002e-25
Where, the line shows \(E[y|x] = \beta_0 + \beta_1 * x_i\)
We assume that the errors, \(e_i \sim N(0, \sigma^2)\) are independent, Normally distributed and homoskdastic, with variances \(\sigma^2\).
This is the simple model with one predictor. We can easily add more predictors to the equation and rewrite it: \(y = \beta_0 + \sum^k \beta_k * x_{ik} + e_i\)
So, now the mean of y is modeled with multiple x variables. We can write this relationship more compactly using matrix notation:
\(Y = X ' \beta + e\)
Where Y is now a \(n*1\) vector of observations of our dependent variable, X is a \(n*k\) matrix of independent variables, with the first column being all 1’s and e is the \(n*1\) vector of errors for each observation.
\[\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}\]
\[x=\begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1, k}\\ 1 & x_{2,1} & x_{1,2} & \dots & x_{1, k} \\ 1 &\vdots & \vdots & \vdots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n, k} \end{bmatrix}\]
\[ e = \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{bmatrix}\]
The residuals are uncorrelated, with covariance matrix \(\Sigma\) =
\[ \Sigma = \sigma^2 I = \sigma^2 * \begin{bmatrix} 1 & 0 & 0 & \dots & 0\\ 0 & 1 & 0 & \dots & 0 \\ 0 & \vdots & \vdots & \dots & \vdots \\ 0 & 0 & 0 & \dots & 1 \end{bmatrix} = \begin{bmatrix} \sigma^2 & 0 & 0 & \dots & 0\\ 0 & \sigma^2 & 0 & \dots & 0 \\ 0 & \vdots & \vdots & \dots & \vdots \\ 0 & 0 & 0 & \dots & \sigma^2 \end{bmatrix}\]
Do we (meaning our data) meet the statistical assumptions of our analytical models?
Always ask this of any analysis you do, if your model is wrong, your inference will also be wrong.
Since spatial data often display correlations amongst closely located observations (autocorrelation), we should probably test for autocorrelation in the model residuals, as that would violate the assumptions of the OLS model.
One method for doing this is to calculate the value of Moran’s I for the OLS residuals.
Here’s a simple OLS model of the form:
Poverty rate = % Black +% Hispanic + %LEP
In R: lm( poverty ~pblack+phisp+plep)
I extract the residuals and map them :
sa_acs2%>%
mutate(povquant=cut(ppov, breaks = quantile(sa_acs2$ppov, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=povquant, fill=povquant))+geom_sf()+
scale_fill_brewer(palette = "RdBu")+
scale_color_brewer(palette = "RdBu")+
ggtitle(label = "Poverty Rate in Census Tracts -AACOG 2015")+geom_sf(data=metro, fill=NA, color="black")
sa_acs2%>%
mutate(blquant=cut(pblack, breaks = quantile(sa_acs2$pblack, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=blquant, fill=blquant))+geom_sf()+
scale_fill_brewer(palette = "RdBu")+
scale_color_brewer(palette = "RdBu")+
ggtitle(label = "% Black in Census Tracts -AACOG 2015")+geom_sf(data=metro, fill=NA, color="black")
sa_acs2%>%
mutate(hquant=cut(phisp, breaks = quantile(sa_acs2$phisp, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=hquant, fill=hquant))+geom_sf()+
scale_fill_brewer(palette = "RdBu")+
scale_color_brewer(palette = "RdBu")+
ggtitle(label = "% Hispanic in Census Tracts -AACOG 2015")+geom_sf(data=metro, fill=NA, color="black")
sa_acs2%>%
mutate(lquant=cut(plep, breaks = quantile(sa_acs2$plep, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=lquant, fill=lquant))+geom_sf()+
scale_fill_brewer(palette = "RdBu")+
scale_color_brewer(palette = "RdBu")+
ggtitle(label = "% Low english proficiency in Census Tracts -AACOG 2015")+geom_sf(data=metro, fill=NA, color="black")
sa_acs2<-as(sa_acs2, "Spatial")
nbs<-poly2nb(sa_acs2, queen = T)
wts<-nb2listw(nbs, style = "W")
fit <- lm( ppov ~plep+ phisp+pblack, data=sa_acs2)
summary(fit)
##
## Call:
## lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.434 -4.520 -0.842 3.189 44.057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.82055 0.93552 -4.084 5.20e-05 ***
## plep 0.55065 0.07212 7.635 1.27e-13 ***
## phisp 0.17435 0.02324 7.503 3.14e-13 ***
## pblack 0.32229 0.04581 7.035 7.05e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.545 on 471 degrees of freedom
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5473
## F-statistic: 192 on 3 and 471 DF, p-value: < 2.2e-16
nbs<-poly2nb(sa_acs2, queen = T)
wts<-nb2listw(nbs, style = "W")
sa_acs2$olsresid<-rstudent(fit)
library(ggplot2)
library(sf)
library(dplyr)
sa_acs2<-st_as_sf(sa_acs2)
sa_acs2%>%
mutate(rquant=cut(olsresid, breaks = quantile(sa_acs2$olsresid, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=rquant, fill=rquant))+geom_sf()+
scale_fill_brewer(palette = "RdBu")+
scale_color_brewer(palette = "RdBu")+geom_sf(data=metro, fill=NA, color="black")
#Moran's I on residuals from model
lm.morantest(fit, listw = wts)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
## weights: wts
##
## Moran I statistic standard deviate = 7.5941, p-value = 1.55e-14
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.1926032955 -0.0058025514 0.0006825815
Which, in this case, there appears to be significant clustering in the residuals, since the observed value of Moran’s I is .14, with a z-test of 5.12, p < .0001
If we now assume we measure our Y and X’s at specific spatial locations (s), so we have Y(s) and X(s).
In most analysis, the spatial location (i.e. the county or census tract) only serves to link X and Y so we can collect our data on them, and in the subsequent analysis this spatial information is ignored that explicitly considers the spatial relationships between the variables or the locations.
In fact, even though we measure Y(s) and X(s) what we end up analyzing X and Y, and apply the ordinary regression methods on these data to understand the effects of X on Y.
Moreover, we could move them around in space (as long as we keep the observations together \(y_i\) with \(x_i\)) and still get the same results.
Such analyses have been called a-spatial. This is the kind of regression model you are used to fitting, where we ignore any information on the locations of the observations themselves.
However, we can extend the simple regression case to include the information on (s) and incorporate it into our models explicitly, so they are no longer a-spatial.
There are several methods by which to incorporate the (s) locations into our models, there are several alternatives to use on this problem:
Geographically weighted regression
-We will first deal with the case of the spatially autoregressive model, or SAR model, as its structure is just a modification of the OLS model from above.
We saw in the normal OLS model that some of the basic assumptions of the model are that the: 1) model residuals are distributed as iid standard normal random variates 2) and that they have common (and constant, meaning homoskedastic) unit variance.
Spatial data, however present a series of problems to the standard OLS regression model. These problems are typically seen as various representations of spatial structure or dependence within the data. The spatial structure of the data can introduce spatial dependence into both the outcome, the predictors and the model residuals.
This can be observed as neighboring observations, both with high (or low) values (positive autocorrelation) for either the dependent variable, the model predictors or the model residuals. We can also observe situations where areas with high values can be surrounded by areas with low values (negative autocorrelation).
One such way of doing this is by allowing there to be correlation between residuals in our model, or to be correlation in the dependent variable itself.
There are 2 basic forms of the spatial autoregressive model: the spatial lag and the spatial error models.
\(Y = X ' \beta + e\)
\(Y= \rho W Y + X '\beta +e\)
where \(\rho\) is the autoregressive coefficient, which tells us how strong the resemblance is, on average, between \(Y_i\) and it’s neighbors. The matrix W is the spatial weight matrix, describing the spatial network structure of the observations, like we described in the ESDA lecture.
The spatial error model says that the autocorrelation is not in the outcome itself, but instead, any autocorrelation is attributable to there being missing spatial covariates in the data.
If these spatially patterned covariates could be measures, the the autocorrelation would be 0. This model is written:
\(Y= X' \beta +e\)
\(e=\lambda W e + v\)
Another form of a spatial lag model is the Spatial Durbin Model (SDM). This model is an extension of the ordinary lag or error model that includes spatially lagged independent variables.
If you remember, one issue that commonly occures with the lag model, is that we often have residual autocorrelation in the model. This autocorrelation could be attributable to a missing spatial covariate.
We can get a kind of spatial covariate by lagging the predictor variables in the model using W.
This model can be written:
\(Y= \rho W Y + X '\beta + W X \theta + e\)
Where, the \(\theta\) parameter vector are now the regression coefficients for the lagged predictor variables. We can also include the lagged predictors in an error model, which gives us the Durbin Error Model (DEM):
\(Y= X '\beta + W X \theta + e\)
\(e=\lambda W e + v\)
Generally, the spatial Durbin model is preferred to the ordinary error model, because we can include the unspecified spatial covariates from the error model into the Durbin model via the lagged predictor variables.
To some degree, both of the SAR specifications allow us to model spatial dependence in the data. The primary difference between them is where we model said dependence.
The lag model says that the dependence affects the dependent variable only, we can liken this to a diffusion scenario, where your neighbors have a diffusive effect on you.
The error model says that dependence affects the residuals only. We can liken this to the missing spatially dependent covariate situation, where, if only we could measure another really important spatially associated predictor, we could account for the spatial dependence. But alas, we cannot, and we instead model dependence in our errors.
These are inherently two completely different ways to think about specifying a model, and we should really make our decision based upon how we think our process of interest operates.
That being said, this way of thinking isn’t necessarily popular among practitioners. Most practitioners want the best fitting model, ’nuff said. So methods have been developed that test for alternate model specifications, to see which kind of model best summarizes the observed variation in the dependent variable and the spatial dependence.
These are a set of so-called Lagrange Multiplier (econometrician’s jargon for a score test) test. These tests compare the model fits from the OLS, spatial error, and spatial lag models using the method of the score test.
For those who don’t remember, the score test is a test based on the relative change in the first derivative of the likelihood function around the maximum likelihood.
The particular thing here that is affecting the value of this derivative is the autoregressive parameter, \(\rho\) or \(\lambda\).
In the OLS model \(\rho\) or \(\lambda\) = 0 (so both the lag and error models simplify to OLS), but as this parameter changes, so does the likelihood for the model, hence why the derivative of the likelihood function is used.
This is all related to how the estimation routines estimate the value of \(\rho\) or \(\lambda\).
In general, you fit the OLS model to your dependent variable, then submit the OLS model fit to the LMT testing procedure.
Then you look to see which model (spatial error, or spatial lag) has the highest value for the test.
Enter the uncertainty…
So how much bigger, you might say?
Well, drastically bigger, if the LMT for the error model is 2500 and the LMT for the lag model is 2480, this is NOT A BIG DIFFERENCE, only about 1%.
If you see a LMT for the error model of 2500 and a LMT for the lag model of 250, THIS IS A BIG DIFFERENCE.
So what if you don’t see a BIG DIFFERENCE, HOW DO YOU DECIDE WHICH MODEL TO USE???
Well, you could think more, but who has time for that.
The econometricians have thought up a better LMT test, the so-called robust LMT, robust to what I’m not sure, but it is said that it can settle such problems of a not so big difference between the lag and error model specifications.
So what do you do? In general, think about your problem before you run your analysis, should this fail you, proceed with using the LMT, if this is inconclusive, look at the robust LMT, and choose the model which has the larger value for this test.
lm.LMtests(model = fit, listw=wts, test = "all")
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
## weights: wts
##
## LMerr = 52.661, df = 1, p-value = 3.965e-13
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
## weights: wts
##
## LMlag = 55.063, df = 1, p-value = 1.168e-13
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
## weights: wts
##
## RLMerr = 4.8322, df = 1, p-value = 0.02793
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
## weights: wts
##
## RLMlag = 7.2344, df = 1, p-value = 0.007152
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2)
## weights: wts
##
## SARMA = 59.895, df = 2, p-value = 9.859e-14
Fit the spatial regression models
#Spatial Error model
fit.err<-errorsarlm(ppov ~plep+ phisp+pblack,
data=sa_acs2, listw=wts)
summary(fit.err, Nagelkerke=T)
##
## Call:errorsarlm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2,
## listw = wts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.85935 -3.95945 -0.81684 2.50323 42.55534
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.734125 1.328316 -2.8112 0.004936
## plep 0.418125 0.073264 5.7071 1.149e-08
## phisp 0.189494 0.026702 7.0967 1.278e-12
## pblack 0.406506 0.060064 6.7679 1.307e-11
##
## Lambda: 0.44647, LR test value: 44.337, p-value: 2.764e-11
## Asymptotic standard error: 0.060668
## z-value: 7.3592, p-value: 1.8496e-13
## Wald statistic: 54.158, p-value: 1.8507e-13
##
## Log likelihood: -1609.739 for error model
## ML residual variance (sigma squared): 49.553, (sigma: 7.0394)
## Nagelkerke pseudo-R-squared: 0.59021
## Number of observations: 475
## Number of parameters estimated: 6
## AIC: 3231.5, (AIC for lm: 3273.8)
#Spatial Lag Model
fit.lag<-lagsarlm(ppov ~plep+ phisp+pblack,
data=sa_acs2, listw=wts, type="lag")
summary(fit.lag, Nagelkerke=T)
##
## Call:lagsarlm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2,
## listw = wts, type = "lag")
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.97510 -4.03343 -0.84677 3.01898 42.54880
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.576735 0.885479 -4.0393 5.361e-05
## plep 0.427829 0.068689 6.2285 4.710e-10
## phisp 0.109421 0.024722 4.4260 9.601e-06
## pblack 0.254920 0.045657 5.5834 2.359e-08
##
## Rho: 0.36363, LR test value: 45.209, p-value: 1.7712e-11
## Asymptotic standard error: 0.053853
## z-value: 6.7523, p-value: 1.4553e-11
## Wald statistic: 45.593, p-value: 1.4553e-11
##
## Log likelihood: -1609.303 for lag model
## ML residual variance (sigma squared): 50.125, (sigma: 7.0799)
## Nagelkerke pseudo-R-squared: 0.59096
## Number of observations: 475
## Number of parameters estimated: 6
## AIC: 3230.6, (AIC for lm: 3273.8)
## LM test for residual autocorrelation
## test value: 0.18386, p-value: 0.66807
#Spatial Durbin Lag Model
fit.lag2<-lagsarlm(ppov ~plep+ phisp+pblack,
data=sa_acs2, listw=wts, type="mixed")
summary(fit.lag2, Nagelkerke=T)
##
## Call:lagsarlm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2,
## listw = wts, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.59221 -3.94023 -0.78062 2.84811 40.58928
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.786308 1.017801 -1.7551 0.07925
## plep 0.369867 0.073809 5.0112 5.410e-07
## phisp 0.160475 0.033431 4.8001 1.586e-06
## pblack 0.566667 0.085724 6.6104 3.833e-11
## lag.plep 0.211005 0.134749 1.5659 0.11737
## lag.phisp -0.104311 0.045954 -2.2699 0.02321
## lag.pblack -0.446957 0.108637 -4.1142 3.885e-05
##
## Rho: 0.37343, LR test value: 33.114, p-value: 8.6917e-09
## Asymptotic standard error: 0.064063
## z-value: 5.8291, p-value: 5.5743e-09
## Wald statistic: 33.978, p-value: 5.5743e-09
##
## Log likelihood: -1598.994 for mixed model
## ML residual variance (sigma squared): 47.929, (sigma: 6.9231)
## Nagelkerke pseudo-R-squared: 0.60834
## Number of observations: 475
## Number of parameters estimated: 9
## AIC: 3216, (AIC for lm: 3247.1)
## LM test for residual autocorrelation
## test value: 12.904, p-value: 0.0003278
#Spatial Durbin Error Model
fit.errdurb<-errorsarlm(ppov ~plep+ phisp+pblack,
data=sa_acs2, listw=wts, etype="emixed")
summary(fit.errdurb, Nagelkerke=T)
##
## Call:errorsarlm(formula = ppov ~ plep + phisp + pblack, data = sa_acs2,
## listw = wts, etype = "emixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.39650 -3.92649 -0.82731 2.77587 40.82837
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.346863 1.514869 -2.2093 0.027151
## plep 0.402120 0.072283 5.5631 2.650e-08
## phisp 0.156135 0.032152 4.8561 1.197e-06
## pblack 0.536194 0.082761 6.4788 9.246e-11
## lag.plep 0.404398 0.149994 2.6961 0.007016
## lag.phisp -0.036084 0.050734 -0.7112 0.476938
## lag.pblack -0.306134 0.115498 -2.6506 0.008036
##
## Lambda: 0.36879, LR test value: 29.863, p-value: 4.6371e-08
## Asymptotic standard error: 0.064931
## z-value: 5.6797, p-value: 1.3494e-08
## Wald statistic: 32.259, p-value: 1.3494e-08
##
## Log likelihood: -1600.62 for error model
## ML residual variance (sigma squared): 48.29, (sigma: 6.9491)
## Nagelkerke pseudo-R-squared: 0.60565
## Number of observations: 475
## Number of parameters estimated: 9
## AIC: 3219.2, (AIC for lm: 3247.1)
#which says we still have significant autocorrelation in the residuals, even after
#accounting for autocorrelation in the outcome
bptest.sarlm(fit.lag)
##
## studentized Breusch-Pagan test
##
## data:
## BP = 20.281, df = 3, p-value = 0.0001484
bptest.sarlm(fit.lag2)
##
## studentized Breusch-Pagan test
##
## data:
## BP = 34.085, df = 6, p-value = 6.477e-06
AIC(fit.err)
## [1] 3231.478
AIC(fit.lag)
## [1] 3230.606
AIC(fit.lag2)
## [1] 3215.989
AIC(fit.errdurb)
## [1] 3219.24