Lab 10 Exercises

Eric Goodwin || November 18, 2021

Exercise 1

The file boston.shp in the compressed file boston.tr.zip contains information on house pricing and various social and economic variables for tracts in Boston (the full set of variables is given below. We have explored the local spatial autocorrelation in this data set in a previous lab. Here, you have been asked to produce a statistical model of house prices in Boston, and to explore the various factors that may influence price. You should use the corrected median house values per tract, and log-transform these to remove the skew in the distribution. Unlike the previous questions, there is no single answer to this, instead, you will need to decide on your own modeling strategy among the spatial models we have looked at (including spatial lag and error models, spatial filtering and GWR). At a minimum your answer should include the following:

A map of house prices

A basic linear regression of house prices against several of the other variables that you consider to be important influences on house price

A test of spatial autocorrelation in the residuals of this model

A spatial regression model if the autocorrelation is significant, together with the reason that you chose that modeling approach

Descriptions of goodness-of-fit of this model

A statement describing which variables influence house prices, and the direction, strength and significance of that relationship

A brief statement as to whether or not you believe this model is adequate, and what, if anything, might be missing

library(sf)
library(spdep)
library(spatialreg)
library(tmap)
bos = st_read("../datafiles/boston.tr/boston.tr/boston.shp")
## Reading layer `boston' from data source 
##   `C:\Users\erics\OneDrive\Documents\geog6000\datafiles\boston.tr\boston.tr\boston.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 506 features and 21 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## Geodetic CRS:  NAD27
bos.geom = st_geometry(bos)
bos.coords = st_centroid(bos.geom)

Now that the data has been read in, we’ll visualize median house prices with a simple map

tm_shape(bos) +
  tm_borders() +
  tm_fill("CMEDV")

Now we’ll log transform the median house values

bos$lCMEDV = log(bos$CMEDV)

I decided to model median house prices in Boston as a function of nitric oxides concentration, per capita crime, and proportions of owner-occupied units built prior to 1940.

bos.lm1 = lm(lCMEDV ~ CRIM + NOX + AGE, data = bos)
summary(bos.lm1)
## 
## Call:
## lm(formula = lCMEDV ~ CRIM + NOX + AGE, data = bos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04219 -0.17917 -0.05692  0.15669  1.10788 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.7357017  0.0766299  48.750  < 2e-16 ***
## CRIM        -0.0177576  0.0018263  -9.723  < 2e-16 ***
## NOX         -0.9065077  0.1860311  -4.873 1.48e-06 ***
## AGE         -0.0019561  0.0007424  -2.635  0.00867 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3194 on 502 degrees of freedom
## Multiple R-squared:  0.3918, Adjusted R-squared:  0.3881 
## F-statistic: 107.8 on 3 and 502 DF,  p-value: < 2.2e-16

All variables indicate significance. Now we will test for spatial autocorrelation in the residuals (I did a Monte-Carlo simulatio for Moran I). First we’ll create the neighborhood structure for the dataset (I am using Queen’s case), and the weighted matrix (I am using row standardization).

bos.nbq = poly2nb(bos, queen = TRUE)
bos.listw = nb2listw(bos.nbq)

moran.mc(residuals(bos.lm1), bos.listw, nsim = 999, alternative = 'greater')
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(bos.lm1) 
## weights: bos.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.58028, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

The autocorrelation in the residuals is significant as indicated by the low p-value of .001.

I decided to run the Lagrange Multiplier test on the linear model to see if we find where the autocorrelation is (either a spatial lag effect, which would indicate autocorrelation in the dependent variable, or a spatial error model)

bos.lmt = lm.LMtests(bos.lm1, bos.listw, test = c("LMerr","LMlag"))
summary(bos.lmt)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = lCMEDV ~ CRIM + NOX + AGE, data = bos)
## weights: bos.listw
##  
##       statistic parameter   p.value    
## LMerr    457.96         1 < 2.2e-16 ***
## LMlag    451.39         1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values are very, very close.I ran a robust test to see if that would help my decision making.

bos.lmt_robust = lm.LMtests(bos.lm1, bos.listw, test = c("RLMerr","RLMlag"))
summary(bos.lmt_robust)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = lCMEDV ~ CRIM + NOX + AGE, data = bos)
## weights: bos.listw
##  
##        statistic parameter   p.value    
## RLMerr    23.879         1 1.026e-06 ***
## RLMlag    17.304         1 3.186e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given the fact both values are still very similar, I decided to run both lag and error models, as well as do a spatial filter, to see which has the lowest AIC (goodness-of-fit). For the spatial filter I chose an alpha value of .1, because I was worried that setting a value too high would lead to overfitting (I may be mistaken in this regard), and .1 is still beyond a .05 threshold for significance.

bos.lag1 = lagsarlm(lCMEDV ~ CRIM + NOX + AGE, data= bos, bos.listw)
summary(bos.lag1)
## 
## Call:
## lagsarlm(formula = lCMEDV ~ CRIM + NOX + AGE, data = bos, listw = bos.listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.658086 -0.125051 -0.027810  0.098778  0.973654 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)  0.87148655  0.13091238  6.6570 2.794e-11
## CRIM        -0.00707410  0.00124945 -5.6618 1.498e-08
## NOX         -0.17554449  0.12730970 -1.3789    0.1679
## AGE         -0.00074403  0.00048973 -1.5193    0.1287
## 
## Rho: 0.76785, LR test value: 353.15, p-value: < 2.22e-16
## Asymptotic standard error: 0.031018
##     z-value: 24.755, p-value: < 2.22e-16
## Wald statistic: 612.82, p-value: < 2.22e-16
## 
## Log likelihood: 38.15841 for lag model
## ML residual variance (sigma squared): 0.043404, (sigma: 0.20834)
## Number of observations: 506 
## Number of parameters estimated: 6 
## AIC: -64.317, (AIC for lm: 286.83)
## LM test for residual autocorrelation
## test value: 0.11507, p-value: 0.73444
bos.err1 = errorsarlm(lCMEDV ~ CRIM + NOX + AGE, data= bos, bos.listw)
summary(bos.err1)
## 
## Call:
## errorsarlm(formula = lCMEDV ~ CRIM + NOX + AGE, data = bos, listw = bos.listw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68881 -0.12934 -0.01476  0.09534  0.84955 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)  4.09073148  0.14368484 28.4702 < 2.2e-16
## CRIM        -0.00667508  0.00146178 -4.5664 4.961e-06
## NOX         -1.58301595  0.24184204 -6.5457 5.923e-11
## AGE         -0.00284776  0.00068406 -4.1630 3.141e-05
## 
## Lambda: 0.83774, LR test value: 386.39, p-value: < 2.22e-16
## Asymptotic standard error: 0.02687
##     z-value: 31.177, p-value: < 2.22e-16
## Wald statistic: 972.03, p-value: < 2.22e-16
## 
## Log likelihood: 54.77833 for error model
## ML residual variance (sigma squared): 0.038972, (sigma: 0.19741)
## Number of observations: 506 
## Number of parameters estimated: 6 
## AIC: -97.557, (AIC for lm: 286.83)
bos.sferr = SpatialFiltering(bos.lm1,
                             nb = bos.nbq,
                             style = "C",
                             alpha = 0.1,
                             ExactEV = FALSE,
                             data = bos)
fit.bos.sferr = fitted(bos.sferr)
bos.lm2 = lm(lCMEDV ~ CRIM + NOX + AGE + fit.bos.sferr,
             data = bos)
AIC(bos.lm2)
## [1] -235.5752

The AIC scores for the spatial lag model, spatial error model, and spatial filter models, respectively, are -64.32, -97.56, and -235.58. Thus I decided to go with the linear model with spatial filter applied.

I will now test the residuals of the spatial filter model to see if we accounted for the autocorrelation in the model.

moran.test(residuals(bos.lm2), bos.listw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(bos.lm2)  
## weights: bos.listw    
## 
## Moran I statistic standard deviate = -1.6849, p-value = 0.954
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##     -0.0473903596     -0.0019801980      0.0007263577

The p-value of 0.954 tells me we’ve accounted for the spatial autocorrelation in the residuals very well as the likelihood of spatial autocorrelation is very low.

summary(bos.lm2)$coefficients[1:4,]
##                 Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)  3.735701718 0.0434340492  86.008599 8.239615e-280
## CRIM        -0.017757600 0.0010351349 -17.154866  5.023592e-51
## NOX         -0.906507670 0.1054429080  -8.597142  1.394782e-16
## AGE         -0.001956122 0.0004207764  -4.648840  4.403022e-06

Looking back to the spatial filter model, the coefficients indicate that median house value decreases with increases in crime, nitric oxides concentration, and housing age. All values are significant, but crime and nitric oxides exert the most strength on bringing housing values down.

I am not entirely sure whether this model is adequate, as the p-values for all my models I ran indicated such high significance. It is possible I could be missing a variable as my spatial filter had so much autocorrelation to account for. Also, the spatial filter model included a whopping 60 moran eigenvectors. That seems like a bit much.