Spatial Regression II

library(RColorBrewer) #color palettes
library(sf) #simple features for spatial data
library(spdep) #spatial regression
library(spatialreg) #spatial regression
library(spgwr) #geographically weighted regression
library(tmap) #mapping package
library(kableExtra) #table modification
library(dbscan) #density based clustering
library(ggplot2) #plotting results
library(ggpubr) ##Arranging plots
library(Hmisc) ##describe function
library(tidyverse) ##data formatting package

## Set working directory
setwd("~/Desktop/University of Utah PhD /Course Work/Fall 2022 Semester/GEOG6000_Data Analysis/lab10")

Columbus Crime Dataset

Reading the Dataset

##Read in the data

col = st_read("../datafiles/columbus/columbus.shp",
              quiet = TRUE)

##Log transform the covariates

col$lINC = log(col$INC)

col$lHOVAL = log(col$HOVAL)

Building the Spatial Weight Matrix

##Neighborhood Structure
col.nbq = poly2nb(col, queen = TRUE)

col.nbq
## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 236 
## Percentage nonzero weights: 9.829238 
## Average number of links: 4.816327
##Weight Matrix using the neighborhood structure in the previous step
col.listw = nb2listw(col.nbq)
##Extract the Columbus Geometry
col.geom = st_geometry(col)

##Build the centroid geometry
coords = st_coordinates(st_centroid(col))

##Visualize 

plot(col.geom,
     col = "lightgray",
     border = "white",
     reset = FALSE)
plot(col.nbq, coords,
     add = TRUE,
     col = 3,
     lwd = 2)

Spatial Filtering

From Simon’s Lab:

“Spatial Filtering is carried out as a series of steps:

  • Start by building the simple OLS regression model between the dependent and independent variables. Test for spatial autocorrelation.

  • Use the SpatialFiltering() function to run the stepwise procedure to select which Moran eigenvectors should be used to filter the data. The selected eigenvector at each step is the one that reduces the Moran’s I values for the model the most (i.e. filters the most spatial autocorrelation)

  • Rebuild the model including the selected set of Moran eigenvectors

Step 1: OLS Build

##Step 1: Build OLS model

lm.ols = lm(CRIME ~ lHOVAL + lINC,
            data = col)

summary(lm.ols)
## 
## Call:
## lm(formula = CRIME ~ lHOVAL + lINC, data = col)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.361  -5.997  -1.521   7.432  28.013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  138.074     14.542   9.495 2.06e-12 ***
## lHOVAL       -14.924      4.568  -3.267 0.002059 ** 
## lINC         -19.272      5.024  -3.836 0.000379 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 46 degrees of freedom
## Multiple R-squared:  0.5392, Adjusted R-squared:  0.5191 
## F-statistic: 26.91 on 2 and 46 DF,  p-value: 1.827e-08
##Run the Morans I to test for spatial autocorrelation

moran.mc(residuals(lm.ols),
          listw = col.listw,
          nsim = 999,
          alternative = "greater")
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(lm.ols) 
## weights: col.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.21308, observed rank = 987, p-value = 0.013
## alternative hypothesis: greater
## Results indicate there is spatial autocorrelation

Step 2: Filter Build

## Step 2: Build the spatial filter to step through various moran's eigenvectors to figure out which ones should be used to filter the data. 

## Use the following parameters:

##The neighborhood structure (not the spatial weight matrix)

##The weights to be used (style).We set this here to C which is a global standardization. To see other options, look at the help for nb2listw().


##The stopping rule (alpha). Moran eigenvectors are selected until the p-value of the stepwise Moran’s I test exceed this value


##ExactEV = TRUE - calculates exact values for Moran’s I expected value and variance. If this is set to FALSE, the values will be calculated using a fast approximation. This can greatly reduce the amount of time required, especially with large datasets.

sf.err = SpatialFiltering(lm.ols,
                          nb = col.nbq,
                          style = "C",
                          alpha = 0.25,
                          ExactEV = TRUE,
                          data = col)

sf.err
##   Step SelEvec      Eval      MinMi   ZMinMi      Pr(ZI)        R2     gamma
## 0    0       0 0.0000000 0.21262754 2.965794 0.003019023 0.5391541   0.00000
## 1    1       4 0.6711982 0.12538574 2.132225 0.032988341 0.6128150 -31.46222
## 2    2       3 0.7999718 0.01474576 1.053091 0.292299441 0.6673701  27.07627

The output table reads as follows:

  • Step: the step number

  • SelEvec/Eval: the eigenvalue of the selected eigenvector

  • MinMi: Moran’s I for residuals with that eigenvector (and previous eigenvectors) included as a filter

  • ZMinMi: Moran’s I for residuals transformed to z-score

  • Pr(ZI): the p-value for the z-score

  • R2: the R2 value of the model with eigenvectors included

  • gamma: regression coefficient of selected eigenvector in fit

## Extract eigenvector 4 to plot it

col$ME4 = sf.err$dataset[ , "vec4"]

## Plot it

tm_shape(col) +
  tm_fill(col = "ME4", palette = "BuPu") +
  tm_borders() +
  tm_layout(main.title = "Moran Eigenvector 4",
            main.title.size = 1.2)
## Warning: Currect projection of shape col unknown. Long-lat (WGS84) is assumed.

Step 3: Re-build the Model

## Extract the chosen eigenvectors from the filter (in this case vec 3 and 4) to use for our new model 
E.sel = fitted(sf.err)

## Build the new model
lm.sf = lm(CRIME ~ lHOVAL + lINC + E.sel,
            data = col)

summary(lm.sf)
## 
## Call:
## lm(formula = CRIME ~ lHOVAL + lINC + E.sel, data = col)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.222  -5.912  -0.805   6.981  17.897 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  138.074     12.632  10.930 4.00e-14 ***
## lHOVAL       -14.924      3.968  -3.761 0.000496 ***
## lINC         -19.272      4.364  -4.416 6.45e-05 ***
## E.selvec4    -31.462     10.079  -3.122 0.003175 ** 
## E.selvec3     27.076     10.079   2.686 0.010153 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 44 degrees of freedom
## Multiple R-squared:  0.6674, Adjusted R-squared:  0.6371 
## F-statistic: 22.07 on 4 and 44 DF,  p-value: 4.77e-10
##Test the filtered models residuals for autocorrelation

moran.test(residuals(lm.sf),
      listw = col.listw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(lm.sf)  
## weights: col.listw    
## 
## Moran I statistic standard deviate = 0.3903, p-value = 0.3482
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.014810731      -0.020833333       0.008340303
##This model accounts for the spatial autocorrelation well
##Compare the models using an ANOVA

anova(lm.ols, lm.sf)
## Analysis of Variance Table
## 
## Model 1: CRIME ~ lHOVAL + lINC
## Model 2: CRIME ~ lHOVAL + lINC + E.sel
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     46 6192.9                                  
## 2     44 4470.0  2      1723 8.4802 0.0007672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##We reject the null hypothesis that the simple linear model is better

Geographically Weighted Regression

##Read in the data

south.sf = st_read("../datafiles/south/south00.shp", 
                   quiet = TRUE)

##The geometry is valid. Correct it with st_make_valid
south.sf = st_make_valid(south.sf)

##Mapping poverty

tm_shape(south.sf) +
  tm_fill("PPOV", palette = "BrBG") +
  tm_layout(main.title = "Southeastern US Poverty Rates")

            #legend.outside = TRUE,
            #legend.outside.position = "left")
##Queens Structure
south_nb = poly2nb(south.sf, queen = TRUE)

## "W" is for row standardized weights
south_listw = nb2listw(south_nb, style = "W")

## Moran's I for poverty

moran.test(south.sf$PPOV,
         listw = south_listw)
## 
##  Moran I test under randomisation
## 
## data:  south.sf$PPOV  
## weights: south_listw    
## 
## Moran I statistic standard deviate = 36.544, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5892974827     -0.0007215007      0.0002606743
## Autocorrelation present!
## Why are we using the square root? To normalize the data?
## The square root of Poverty as a function of the following:

## PFHH: Percent female head of household
## PUNEM: Percent unemployed
## PBLK: Percent Black in population
## P65UP: Percent over 65 years old
## METRO: Metropolitan area (binary)
## PHSPLUS: Percent with education beyond high school

fit1 = lm(SQRTPPOV ~ PFHH + PUNEM + PBLK + P65UP + METRO + PHSPLUS, 
           data = south.sf)

summary(fit1)
## 
## Call:
## lm(formula = SQRTPPOV ~ PFHH + PUNEM + PBLK + P65UP + METRO + 
##     PHSPLUS, data = south.sf)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.234660 -0.032609 -0.000461  0.029598  0.249979 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.34195    0.01074  31.829  < 2e-16 ***
## PFHH         0.64126    0.04349  14.747  < 2e-16 ***
## PUNEM        1.58472    0.06938  22.841  < 2e-16 ***
## PBLK        -0.10640    0.01647  -6.459 1.46e-10 ***
## P65UP        0.17417    0.04106   4.242 2.37e-05 ***
## METRO       -0.02633    0.00338  -7.790 1.31e-14 ***
## PHSPLUS     -0.30564    0.01495 -20.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05042 on 1380 degrees of freedom
## Multiple R-squared:  0.7286, Adjusted R-squared:  0.7275 
## F-statistic: 617.6 on 6 and 1380 DF,  p-value: < 2.2e-16
moran.test(residuals(fit1), 
           listw = south_listw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(fit1)  
## weights: south_listw    
## 
## Moran I statistic standard deviate = 22.706, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.3657562577     -0.0007215007      0.0002604946
## Spatial Autocorrelation Present!
## Extract the residuals as a new column in the south.sf simple feature
south.sf$lm.res = residuals(fit1)


## Visualize them

tm_shape(south.sf) +
  tm_fill("lm.res", palette = "GnBu") +
  tm_layout(main.title = "OLS Model Residuals (fit1)")

            #legend.outside = TRUE,
            #legend.outside.position = "left")
## Two (2) steps using the spgwr pacakge

## Step 1 - Assess the best window size. You can use two approaches:

## Fixed size: each window will have the same bandwidth or size, so models in data sparse areas will have fewer observations

## Adaptive: rather than setting a single window size, the window for each model is chosen to capture the same number of observations. Adaptive is typically prefered

## Step 1.1 - Extract centroids to use in the distance calculations

sf.coords = st_coordinates(st_centroid(south.sf))

## Step 1.2 - Find the value of "q": the proportion of points in each window
## Adative approach (adapt = TRUE)
## Verbosee - reports progress as the code runs
## See ?gwr.sel for other elements

south.bw = gwr.sel(SQRTPPOV ~ PFHH + PUNEM + PBLK + P65UP + METRO + PHSPLUS,
                    data = south.sf, 
                    coords = sf.coords, 
                    adapt = TRUE, 
                    gweight = gwr.Gauss,
                    method = "cv", 
                    verbose = TRUE)
## Adaptive q: 0.381966 CV score: 3.066879 
## Adaptive q: 0.618034 CV score: 3.221821 
## Adaptive q: 0.236068 CV score: 2.927806 
## Adaptive q: 0.145898 CV score: 2.798224 
## Adaptive q: 0.09016994 CV score: 2.67092 
## Adaptive q: 0.05572809 CV score: 2.550302 
## Adaptive q: 0.03444185 CV score: 2.447888 
## Adaptive q: 0.02128624 CV score: 2.389188 
## Adaptive q: 0.01315562 CV score: 2.383642 
## Adaptive q: 0.0153006 CV score: 2.375256 
## Adaptive q: 0.01677638 CV score: 2.373892 
## Adaptive q: 0.01659908 CV score: 2.374078 
## Adaptive q: 0.01849899 CV score: 2.376264 
## Adaptive q: 0.01743436 CV score: 2.374001 
## Adaptive q: 0.01704878 CV score: 2.373764 
## Adaptive q: 0.01708947 CV score: 2.373761 
## Adaptive q: 0.01713016 CV score: 2.373761 
## Adaptive q: 0.01708947 CV score: 2.373761
south.bw
## [1] 0.01708947
dim(south.sf)[1] * south.bw
## [1] 23.7031
## Step 2 - Build the model using the output of q from the previous step using the adapt = south.bw syntax

## Note - this is building 1387 local models!!

south.gwr = gwr(SQRTPPOV ~ PFHH + PUNEM + PBLK + P65UP + METRO + PHSPLUS,
                 data = south.sf, 
                 coords = sf.coords, 
                 adapt = south.bw, 
                 gweight = gwr.Gauss, 
                 hatmatrix = TRUE)

south.gwr
## Call:
## gwr(formula = SQRTPPOV ~ PFHH + PUNEM + PBLK + P65UP + METRO + 
##     PHSPLUS, data = south.sf, coords = sf.coords, gweight = gwr.Gauss, 
##     adapt = south.bw, hatmatrix = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.01708947 (about 23 of 1387 data points)
## Summary of GWR coefficient estimates at data points:
##                    Min.    1st Qu.     Median    3rd Qu.       Max.  Global
## X.Intercept.  0.2005339  0.2799681  0.3398143  0.4037562  0.5059632  0.3420
## PFHH         -0.0405331  0.5569098  0.7030943  0.8323044  1.2702785  0.6413
## PUNEM         0.1308578  0.7678957  1.0349647  1.4880431  2.2959962  1.5847
## PBLK         -0.5843126 -0.1491605 -0.0610091  0.0328856  0.5007957 -0.1064
## P65UP        -1.0307935  0.0651806  0.2045079  0.3702337  0.9033343  0.1742
## METRO        -0.0527382 -0.0229680 -0.0144405 -0.0084449  0.0096279 -0.0263
## PHSPLUS      -0.5564329 -0.3728045 -0.3012876 -0.2458864 -0.1088166 -0.3056
## Number of data points: 1387 
## Effective number of parameters (residual: 2traceS - traceS'S): 244.3255 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1142.675 
## Sigma (residual: 2traceS - traceS'S): 0.03911912 
## Effective number of parameters (model: traceS): 173.8689 
## Effective degrees of freedom (model: traceS): 1213.131 
## Sigma (model: traceS): 0.03796614 
## Sigma (ML): 0.03550685 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): -4923.036 
## AIC (GWR p. 96, eq. 4.22): -5149.69 
## Residual sum of squares: 1.748641 
## Quasi-global R2: 0.8647189

Note:

The output of the model gives summary statistics about the range of coefficients across all models built (n=1387, the number of counties). In the model diagnostics, you will find an AIC value, which may be used in model comparison, and an R2 value, which should be treated with some caution.

The output from the gwr() function includes a large list object with a lot of information in it. One object in this list is a SpatialPointsDataFrame. This is the older form of spatial object in R, and contains various information about the local models. We’ll now use this to plot out some of the model results. It is possible to convert these to an sf object, but it is easier to simply assign new columns in the existing south.sf object with the results we want to visualize.

##Explore the gwr object

class(south.gwr$SDF)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
##str(south.gwr) was useful as well

##Extract the local r^2 values

south.sf$localr2 = south.gwr$SDF$localR2

##Visualize it

tm_shape(south.sf) +
  tm_fill("localr2", palette = "RdPu") +
  tm_layout(main.title = "GWR Local r2 Output")

  • Areas with low R2 values may indicate model misspecification - missing independent variables (e.g. West Texas).
## Extract the coefficient
south.sf$beta_P65UP = south.gwr$SDF$P65UP

##Visualization
tm_shape(south.sf) + 
  tm_fill("beta_P65UP", palette = "PRGn", n = 9) +
  tm_layout(main.title = "Coefficient for P65UP")

##How do we do this for p values???

Spatial Hierarchical Models

OPTIONAL SECTION

Exercise

1.1 Map of House Prices

## Read in the data

bos = st_read("../datafiles/boston.tr/boston.shp",
              quiet = TRUE)


## Log transform the CMEDV

bos$logCMEDV = log(bos$CMEDV)



cmedv.hist = ggplot(data = bos, aes(x = CMEDV)) +
  geom_histogram(binwidth = 5) +
  labs(title = "CMEDV Histogram") +
  theme_bw()


logcmedv.hist = ggplot(data = bos, aes(x = logCMEDV)) +
  geom_histogram(binwidth = 0.2) +
  labs(title = "Log CMEDV Histogram") +
  theme_bw()


ggarrange(cmedv.hist, logcmedv.hist, ncol = 2, nrow = 1)

## Check the geometry

# st_is_valid(bos)

## Valid

tm_shape(bos) +
  tm_fill(col = "CMEDV", palette = "GnBu", title = "House Value (USD 1000)") +
  tm_borders(col = "gray90", lwd = 0.5) +
   tm_style("cobalt") +
  tm_layout(main.title = "Greater Boston Area Median House Value",
            main.title.size = 1) +
  tm_compass(type = "arrow",
             position = c("left", "bottom")) +
  tm_scale_bar(position = c("center", "bottom"),
               color.dark = "gray50")

1.2 Basic Liniear Regression

## Variable Choices

## CRIM - per capita crime - need to transform
## RM - average numbers of rooms per dwelling - good
## AGE - proportions of owner-occupied units built prior to 1940 - maybe?
## DIS - weighted distances to five Boston employment centers - transform
## TAX - property tax - transforming doesn't fix the distribution
## LSTAT - Percent lower socioeconomic status -transform

## log transform skewed variables
bos$logCRIM = log(bos$CRIM)
#bos$logAGE = log(bos$AGE)
bos$logDIS = log(bos$DIS)
bos$logLSTAT = log(bos$LSTAT)

## Question on this - Does transforming independent variables minimize the behavior/weight of outliers? 

## bos.lm0 = lm(logCMEDV ~ CRIM + RM + AGE + DIS + TAX + LSTAT,
            #data = bos)

bos.lm = lm(logCMEDV ~ logCRIM + RM + AGE + logDIS + TAX + logLSTAT,
            data = bos)

summary(bos.lm)
## 
## Call:
## lm(formula = logCMEDV ~ logCRIM + RM + AGE + logDIS + TAX + logLSTAT, 
##     data = bos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91890 -0.11081 -0.00785  0.12021  0.88008 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.8527975  0.1724392  22.343  < 2e-16 ***
## logCRIM     -0.0193056  0.0094096  -2.052 0.040722 *  
## RM           0.0874990  0.0185950   4.706 3.28e-06 ***
## AGE          0.0004103  0.0005950   0.690 0.490781    
## logDIS      -0.1216113  0.0319548  -3.806 0.000159 ***
## TAX         -0.0005013  0.0001005  -4.989 8.40e-07 ***
## logLSTAT    -0.4480165  0.0276209 -16.220  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2103 on 499 degrees of freedom
## Multiple R-squared:  0.7378, Adjusted R-squared:  0.7347 
## F-statistic: 234.1 on 6 and 499 DF,  p-value: < 2.2e-16

1.3 Spatial Autocorrelation Test (Residuals)

## Build the neighborhood structure

bos.nbq = poly2nb(bos, queen = TRUE)

## Build the weight matrix

bos.listw = nb2listw(bos.nbq)


## Extract the centroids and geometry

bos.coords = st_coordinates(st_centroid(bos))
bos.geom = st_geometry(bos)

# Visualize the struture

plot(bos.geom,
     col = "lightgray",
     border = "white",
     reset = FALSE)
plot(bos.nbq, bos.coords,
     add = TRUE,
     col = 3,
     lwd = 2)

bos.mor.mc = moran.mc(residuals(bos.lm),
          listw = bos.listw,
          nsim = 999,
          alternative = "greater")

bos.mor.mc
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(bos.lm) 
## weights: bos.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.5044, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

1.4 Spatial Regression Model

## Non-robust test

bos.lmt = lm.LMtests(bos.lm, listw = bos.listw, test = c("LMerr", "LMlag"))

bos.lmt.sum = summary(bos.lmt)

bos.lmt.sum
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = logCMEDV ~ logCRIM + RM + AGE + logDIS + TAX +
## logLSTAT, data = bos)
## weights: bos.listw
##  
##       statistic parameter   p.value    
## LMerr    346.02         1 < 2.2e-16 ***
## LMlag    328.26         1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Robust Test
## RLM syntax for the test

bos.lmt.robust = lm.LMtests(bos.lm, listw = bos.listw, test = c("RLMerr", "RLMlag"))

bos.lmt.robust.sum = summary(bos.lmt.robust)

bos.lmt.robust.sum
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = logCMEDV ~ logCRIM + RM + AGE + logDIS + TAX +
## logLSTAT, data = bos)
## weights: bos.listw
##  
##        statistic parameter   p.value    
## RLMerr    76.680         1 < 2.2e-16 ***
## RLMlag    58.914         1 1.643e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Choice and Explanation

  • The results of the Moran’s I test - a large statistic (0.5044011), observed rank (1000), and statistically significant p-value (0.001) causes us to conclude that there is significant spatial autocorrelation among our residuals.

  • The results of the Lagrange Multiplier robust test indicate that that the spatial error is the more important cause of spatial autocorrelation (Statistic - 76.679671 and p-value - 0), although the spatial lag is also quite significant as well (Statistic - 58.913662 and p-value - 1.6431301^{-14}). As a result, I chose to start with a spatial error model to account for the spatial autocorrelation.

##Spatial Error Model

bos.error.lm = errorsarlm(logCMEDV ~ logCRIM + RM + AGE + logDIS + TAX + logLSTAT,
            data = bos,
            listw = bos.listw)

bos.error.lm.sum = summary(bos.error.lm)

bos.error.lm.sum 
## 
## Call:errorsarlm(formula = logCMEDV ~ logCRIM + RM + AGE + logDIS + 
##     TAX + logLSTAT, data = bos, listw = bos.listw)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5865394 -0.0745839 -0.0035055  0.0815622  0.8391321 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)  3.09839059  0.15982075  19.3867 < 2.2e-16
## logCRIM     -0.02122957  0.00937961  -2.2634 0.0236126
## RM           0.13081701  0.01536215   8.5155 < 2.2e-16
## AGE         -0.00085280  0.00054118  -1.5758 0.1150672
## logDIS      -0.07030100  0.05536721  -1.2697 0.2041834
## TAX         -0.00035684  0.00010372  -3.4403 0.0005811
## logLSTAT    -0.26768263  0.02386178 -11.2180 < 2.2e-16
## 
## Lambda: 0.79718, LR test value: 298.04, p-value: < 2.22e-16
## Asymptotic standard error: 0.030895
##     z-value: 25.803, p-value: < 2.22e-16
## Wald statistic: 665.8, p-value: < 2.22e-16
## 
## Log likelihood: 223.5317 for error model
## ML residual variance (sigma squared): 0.020523, (sigma: 0.14326)
## Number of observations: 506 
## Number of parameters estimated: 9 
## AIC: -429.06, (AIC for lm: -133.03)
bos$err.resid = residuals(bos.error.lm)

tm_shape(bos) +
  tm_fill("err.resid", palette = "YlOrRd" ) +
  tm_style(style = "cobalt") +
  tm_borders(col = "lightgray") +
  tm_layout(main.title = "Error Model Residuals")

plot(residuals(bos.error.lm),
     main = "Error Model Residuals Plot")
abline(v = 90, col = "red")
abline(v = 0, col = "red")

## Slightly heteroskedastic in this region.
bos.error.lm.moran = moran.mc(residuals(bos.error.lm),
                              listw = bos.listw,
                              nsim = 999,
                              alternative = "two.sided")

bos.error.lm.moran
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(bos.error.lm) 
## weights: bos.listw  
## number of simulations + 1: 1000 
## 
## statistic = -0.038557, observed rank = 86, p-value = 0.172
## alternative hypothesis: two.sided

1.5 Model Goodness-of-Fit

bos.error.lm.sum 
## 
## Call:errorsarlm(formula = logCMEDV ~ logCRIM + RM + AGE + logDIS + 
##     TAX + logLSTAT, data = bos, listw = bos.listw)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5865394 -0.0745839 -0.0035055  0.0815622  0.8391321 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)  3.09839059  0.15982075  19.3867 < 2.2e-16
## logCRIM     -0.02122957  0.00937961  -2.2634 0.0236126
## RM           0.13081701  0.01536215   8.5155 < 2.2e-16
## AGE         -0.00085280  0.00054118  -1.5758 0.1150672
## logDIS      -0.07030100  0.05536721  -1.2697 0.2041834
## TAX         -0.00035684  0.00010372  -3.4403 0.0005811
## logLSTAT    -0.26768263  0.02386178 -11.2180 < 2.2e-16
## 
## Lambda: 0.79718, LR test value: 298.04, p-value: < 2.22e-16
## Asymptotic standard error: 0.030895
##     z-value: 25.803, p-value: < 2.22e-16
## Wald statistic: 665.8, p-value: < 2.22e-16
## 
## Log likelihood: 223.5317 for error model
## ML residual variance (sigma squared): 0.020523, (sigma: 0.14326)
## Number of observations: 506 
## Number of parameters estimated: 9 
## AIC: -429.06, (AIC for lm: -133.03)

Goodness-Of-Fit Assessment

  • Lambda for this model is 0.7971773 which is the autoregressive coefficient for this model. In this case, it represents a strong autocorrelation in the residuals of the linear model.

  • The p- value for this model is well below our critical threshold indicating this model is statistically significant.

  • The AIC for this model is approximately -430 compared to an AIC of -133.0272299. This model is a significant improvement over the basic linear model.

  • When mapping and plotting the residuals, it appears that this model does well in dealing with the spatial autocorrelation. The residuals map reveals generally a random pattern for this model’s residuals. The residuals plot is slightly heteroskedastic between the 0 and 100 index indicating that this model did not completely address the autocorrelation issue. This could be for a variety of reasons - the model may be misspecified and is missing a key variable(s) for particular local areas. The general model may be missing a variable as well. (See figures above)

  • Overall, the model is a huge improvement over the basic ordinary least squares linear model.

1.6 Variables that Influence House Prices

Variable Assessment

  • The statistically significant variables that influence house value are:

  • Per Capita Crime (logCRIM): The logCRIM coefficient is -0.0212296. This is a negative relationship with house value as the log crime rate increases the log corrected median house value (logCMEDV) decreases. The p-value, is statistically significant, although it is just slightly below our critical threshold. The relatively small z-score, coupled with the p-value indicates this relationship, while statisitically significant, is not as strong as other variables.

  • Average Number of Rooms (RM): The RM coefficient is 0.130817. This is a positive relationship with house value as the room size increases, it increases the logCMEDV. The z-score and p-value are both extremely significant which indicates this relationship strongly affects home value.

  • Property Tax (TAX): The TAX coefficient is -3.5683593^{-4}. This is a negative relationship as the tax rate increases the logCMEDV decreases. The z-score and p-value are both statistically significant. Although it is significant, the coefficient and slope is extremely small which means we don’t expect to see huge changes in house value as a result from property tax.

  • Percentage of Lower Status Population (logLSTAT): The logLSTAT coefficient is -0.2676826. This is a negative relationship as the logLSTAT increases, the logCMEDV decreases. The z-score and p-value are both extremely significant indicating a strong relationship between socioecononmic status and house value.

1.7 Model Assessment

Overall Assessment

  • The spatial error model has some shortcomings despite its statistically significant results. The model does well to account for spatial autocorrelation, however, there is some heteroskedasticity remaining in the residuals. There is model is likely misspecified and is missing other important covariates. It may be worth conducting a geographically weighted regression to explore the importance and spatial distribution of the independent variables.

  • The Lagrange Multiplier robust test demonstrated that there is a significant spatial lag effect present within the data set as well. By choosing the spatial error model for our approach, we ignored spatially lagged terms which contributes to overestimating our coefficients and inaccurate p-values. Although our Moran’s Monte Carlo test for the error model led us to accept the null hypothesis that there is no spatial autocorrelation, it may be worth building a spatial lag model to compare against the spatial error model.

  • In conclusion, I hesitate to use the spatial error model to explain house value in Boston as we are likely missing covariates and are not accounting for spatial lag effects. If this were an enduring project, I would continue to explore the data using a geographically weighted regression, a spatial lag model, and potentially a SAC/SAR model.