Load Libraries

Read in Data

bos = st_read("datafiles/boston.tr/boston.shp", quiet = TRUE)

Log-transform corrected median house values per tract

bos$CMEDV.log = log(bos$CMEDV)

Create a map of house prices

plot(st_geometry(bos$geometry))

my.pal <- brewer.pal(9, "YlOrRd") 
plot(bos["CMEDV.log"], 
     main = "Log Corrected Median Housing Prices",
     col = my.pal)
## Warning in plot.sf(bos["CMEDV.log"], main = "Log Corrected Median Housing
## Prices", : col is not of length 1 or nrow(x): colors will be recycled; use pal
## to specify a color palette

A basic linear regression of house prices against several of the other variables that you consider to be important influences on house price

bos.lm <- lm(CMEDV.log ~ CRIM + DIS + RAD + RM, data = bos)
summary(bos.lm)
## 
## Call:
## lm(formula = CMEDV.log ~ CRIM + DIS + RAD + RM, data = bos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93756 -0.12371 -0.00179  0.12929  1.43313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.222033   0.113774  10.741  < 2e-16 ***
## CRIM        -0.013782   0.001772  -7.779 4.19e-14 ***
## DIS          0.007229   0.006500   1.112    0.267    
## RAD         -0.008289   0.001855  -4.468 9.78e-06 ***
## RM           0.304559   0.017338  17.566  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2643 on 501 degrees of freedom
## Multiple R-squared:  0.5842, Adjusted R-squared:  0.5809 
## F-statistic:   176 on 4 and 501 DF,  p-value: < 2.2e-16

Test of spatial autocorrelation - build weight matrix

bos.nb <- poly2nb(bos, queen = TRUE)
coords <- st_centroid(st_geometry(bos))
plot(st_geometry(bos), reset = FALSE)
plot(bos.nb, coords, add = TRUE, col = "gray")

bos.listw = nb2listw(bos.nb)

Test for spatial autocorrelation of the residuals of model

moran.test(residuals(bos.lm), bos.listw)
## 
##  Moran I test under randomisation
## 
## data:  residuals(bos.lm)  
## weights: bos.listw    
## 
## Moran I statistic standard deviate = 21.476, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5739831446     -0.0019801980      0.0007192333

There is significant autocorrelation in the residuals due to the Moran I statistic of 0.5946675971 and the p value of 2.2e-16.

A spatial regression model if the autocorrelation is significant, together with the reason that you chose that modeling approach

lmt.robust <- lm.LMtests(bos.lm, bos.listw, test = c("RLMerr","RLMlag"))
summary(lmt.robust)
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = CMEDV.log ~ CRIM + DIS + RAD + RM, data = bos)
## weights: bos.listw
##  
##        statistic parameter   p.value    
## RLMerr    58.209         1 2.354e-14 ***
## RLMlag    54.473         1 1.577e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The RMLag is more significant than the RLMerr, so I chose to create a spatial lag model

Spatial Lag Model

bos.fit2 = lagsarlm(CMEDV.log ~ CRIM + DIS + RAD + RM, data = bos, bos.listw)
summary(bos.fit2)
## 
## Call:lagsarlm(formula = CMEDV.log ~ CRIM + DIS + RAD + RM, data = bos, 
##     listw = bos.listw)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.6107300 -0.0919042 -0.0040206  0.0905304  1.0154026 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.2630091  0.0862793 -3.0483  0.002301
## CRIM        -0.0065771  0.0011760 -5.5929 2.234e-08
## DIS         -0.0078900  0.0042086 -1.8747  0.060828
## RAD         -0.0015651  0.0012261 -1.2765  0.201788
## RM           0.1963733  0.0128891 15.2357 < 2.2e-16
## 
## Rho: 0.70051, LR test value: 378.53, p-value: < 2.22e-16
## Asymptotic standard error: 0.027458
##     z-value: 25.512, p-value: < 2.22e-16
## Wald statistic: 650.86, p-value: < 2.22e-16
## 
## Log likelihood: 147.1156 for lag model
## ML residual variance (sigma squared): 0.029131, (sigma: 0.17068)
## Number of observations: 506 
## Number of parameters estimated: 7 
## AIC: -280.23, (AIC for lm: 96.302)
## LM test for residual autocorrelation
## test value: 7.8452, p-value: 0.0050955

The AIC for this model is -60.437 and the log-likelihood is 37.21845. Based on the AIC and log-likelihood, I would say the model has a high goodness of fit based on this and when compared to the original model.

Test the residuals for any remaining autocorrelation:

moran.mc(residuals(bos.fit2), 
         listw = bos.listw, 
         nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(bos.fit2) 
## weights: bos.listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.057866, observed rank = 989, p-value = 0.011
## alternative hypothesis: greater

The variables that influence housing prices are CRIM (Per capita crime ), DIS (Weighted distances to five Boston employment centers ), RAD (Index of accessibility to radial highways per town ), and RM (Average numbers of rooms per dwelling ). So, the less crime in an area, the smaller distance to an employment center, higher a ccessibility to highways per town, and the more rooms per dwelling all influence housing prices.

I think this model could be missing some variables that influence housing prices based on the Moran I statistic of -0.012252 and the high p value of 0.671 in the test for autocorrelation in the residuals.