In the following analysis, I would like to see the distribution of rent (DP04_0134E) among the census tracts in the Alamo Area Council of Governments (AACOG), Texas, and how is rent affected by race/ethnicity and median household income? The data are gathered using the get_acs function of the tidycensus package, which essentially provides access to the American Community Survey data. For the sake of this analysis, the data have been restricted to the State of Texas, and only to a few census tracts which are combinedly known as AACOG.
Here’s a multiple regression model for the rent in of the form:
\(\text{Rent} = \beta_0 + \beta_2*\text{% White} + \beta_2*\text{% Hispanic} + \beta_1*\text{% Black} + \beta_3*\text{Median Household Income}\)
It says that we expect the proportion of the population that is White, the proportion of the population that is Hispanic, the proportion of the population that is Black, and the proportion of the population with low Median Household Income to affect the rental distribution across the census tracts.
lm(rent ~ pwhite + phisp + pblack + medhhinc)
sa_acs2%>%
mutate(rentquant=cut(prent, breaks = quantile(sa_acs2$prent, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=rentquant, fill=rentquant))+geom_sf()+
scale_fill_viridis_d()+
scale_color_viridis_d()+
ggtitle(label = "Distribution of Rent in Census Tracts -AACOG 2015")+geom_sf(data=metro, fill=NA, color="black")
fit <- lm(prent ~pwhite + phisp + pblack + medhhinc, data=sa_acs2)
summary(fit)
##
## Call:
## lm(formula = prent ~ pwhite + phisp + pblack + medhhinc, data = sa_acs2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -736.04 -102.72 -4.44 96.24 1236.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.513e+03 2.518e+02 6.010 3.76e-09 ***
## pwhite -1.151e+01 2.544e+00 -4.525 7.68e-06 ***
## phisp -1.110e+01 2.491e+00 -4.457 1.04e-05 ***
## pblack -7.275e+00 3.123e+00 -2.330 0.0202 *
## medhhinc 9.029e-03 4.783e-04 18.877 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 194.5 on 462 degrees of freedom
## Multiple R-squared: 0.6192, Adjusted R-squared: 0.6159
## F-statistic: 187.8 on 4 and 462 DF, p-value: < 2.2e-16
All the predictors, percetage white, percentage hispanic, percentage black, and median household income show siginificant association with rent.
library(ggplot2)
library(sf)
library(dplyr)
sa_acs2<-st_as_sf(sa_acs2)
sa_acs2%>%
mutate(rquant=cut(olsresid, breaks = quantile(sa_acs2$olsresid, p=seq(0,1,length.out = 8)), include.lowest = T))%>%
ggplot(aes(color=rquant, fill=rquant))+geom_sf()+
scale_fill_viridis_d()+
scale_color_viridis_d()+geom_sf(data=metro, fill=NA, color="black")+ggtitle("OLS Model Residuals")
If we visually interpret the map we can see significant clustering and correlation in the residuals.
lm.morantest(fit, listw = wts1) #Queen Adjacency Matrix
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2)
## weights: wts1
##
## Moran I statistic standard deviate = 3.0034, p-value = 0.001335
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.0722049972 -0.0071326636 0.0006977835
lm.morantest(fit, listw = wts2) #Rook Adjacency Matrix
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2)
## weights: wts2
##
## Moran I statistic standard deviate = 2.8102, p-value = 0.002475
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.0733171402 -0.0072399790 0.0008217097
We can see that, Moran’s I is very close for both neighborhood specification, with the Rook Adjacency Matrix providing a slightly larger degree of dependence. The observed value for Moran’s I for the residual is .07, with a z-test of 2.81 and a small p-value allow us to conclude that there is significant autocorrelation in the model residuals. So it violates the assumptions of the model.
moran.plot(sa_acs2$prent, listw=wts1, main="Moran Scatterplot (Queen)", xlab = "Distribution of Rent", ylab="Spatially lagged Rent")
moran.plot(sa_acs2$prent, listw=wts2, main="Moran Scatterplot (Rook)", xlab = "Distribution of Rent", ylab="Spatially lagged Rent")
spplot(sa_acs2, "localir", main="Local Moran's I (Rook)",
at=quantile(sa_acs2$localir),
col.regions=brewer.pal(n=4, "RdBu"))
spplot(sa_acs2, "clr", main="Local Moran's Clusters (Rook)",
col.regions=c(2,0))
#lm.LMtests(model = fit, listw=wts1, test = c("LMerr", "LMlag", "RLMerr", "RLMlag"))
lm.LMtests(model = fit, listw=wts2, test = c("LMerr", "LMlag", "RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2)
## weights: wts2
##
## LMerr = 6.3352, df = 1, p-value = 0.01184
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2)
## weights: wts2
##
## LMlag = 10.504, df = 1, p-value = 0.001191
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2)
## weights: wts2
##
## RLMerr = 0.026686, df = 1, p-value = 0.8702
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2)
## weights: wts2
##
## RLMlag = 4.1955, df = 1, p-value = 0.04053
We can see that the lag model has more support, with both the regular and the robust LMT showing larger values for that model, compared to the error model specification.
fit.err<-errorsarlm(prent ~pwhite + phisp + pblack + medhhinc, data=sa_acs2, listw=wts2)
## Warning in errorsarlm(prent ~ pwhite + phisp + pblack + medhhinc, data = sa_acs2, : inversion of asymptotic covariance matrix failed for tol.solve = 1e-10
## reciprocal condition number = 5.23578e-15 - using numerical Hessian.
summary(fit.err)
##
## Call:errorsarlm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2, listw = wts2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -740.496 -100.336 -10.743 92.721 1226.234
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4471e+03 2.6042e+02 5.5566 2.752e-08
## pwhite -1.0521e+01 2.6578e+00 -3.9586 7.539e-05
## phisp -1.0406e+01 2.5840e+00 -4.0271 5.647e-05
## pblack -6.7035e+00 3.2673e+00 -2.0517 0.0402
## medhhinc 8.8480e-03 5.0889e-04 17.3870 < 2.2e-16
##
## Lambda: 0.17531, LR test value: 6.0764, p-value: 0.0137
## Approximate (numerical Hessian) standard error: 0.07029
## z-value: 2.494, p-value: 0.01263
## Wald statistic: 6.2203, p-value: 0.01263
##
## Log likelihood: -3118.373 for error model
## ML residual variance (sigma squared): 36717, (sigma: 191.62)
## Number of observations: 467
## Number of parameters estimated: 7
## AIC: 6250.7, (AIC for lm: 6254.8)
We can see that all the effects are significant in the model, similar to what we saw in the OLS model above. The direction of the coefficients are similar to the ones we can see in the OLS models, but the magnitudes of the regression effects are different. We can also see the estimate of the spatial auto-regression coefficient for the lag model, \(\lambda\) is 0.17, with a standard error of 0.07, and a z-test that shows the coefficient is different from 0. The output also contains the estimates of the residual mean square (7.03), which is much smaller than it was in the OLS model (191.62), indicating the error model is explaining more of the variation in the data than the OLS model. Finally, we see the Akaike Information Criteria, or AIC, which is a measure of model fit. For the error model it is 6250.7 and for the OLS model it was 6254.8, a slight difference. In this case, the error model is preferred based on the AIC value.
fit.lag<-lagsarlm(prent ~pwhite + phisp + pblack + medhhinc, data=sa_acs2, listw=wts2, type="lag")
## Warning in lagsarlm(prent ~ pwhite + phisp + pblack + medhhinc, data = sa_acs2, : inversion of asymptotic covariance matrix failed for tol.solve = 1e-10
## reciprocal condition number = 3.63626e-15 - using numerical Hessian.
summary(fit.lag)
##
## Call:lagsarlm(formula = prent ~ pwhite + phisp + pblack + medhhinc,
## data = sa_acs2, listw = wts2, type = "lag")
##
## Residuals:
## Min 1Q Median 3Q Max
## -721.102 -98.297 -5.039 87.832 1224.698
##
## Type: lag
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2125e+03 2.6570e+02 4.5634 5.032e-06
## pwhite -9.4873e+00 2.5843e+00 -3.6712 0.0002414
## phisp -8.9828e+00 2.5416e+00 -3.5343 0.0004088
## pblack -5.7148e+00 3.1079e+00 -1.8388 0.0659458
## medhhinc 8.2069e-03 5.3875e-04 15.2334 < 2.2e-16
##
## Rho: 0.15694, LR test value: 9.4481, p-value: 0.0021137
## Approximate (numerical Hessian) standard error: 0.050387
## z-value: 3.1148, p-value: 0.0018409
## Wald statistic: 9.7018, p-value: 0.0018409
##
## Log likelihood: -3116.687 for lag model
## ML residual variance (sigma squared): 36497, (sigma: 191.04)
## Number of observations: 467
## Number of parameters estimated: 7
## AIC: 6247.4, (AIC for lm: 6254.8)
The regression effects show that the coefficients changed compared to the error or OLS models. Specifically, the lag model shows a notable degree of change in the effect of the %white and the %Hispanic variables. The auto-regressive coefficient, \(\rho\) for the lag model is also significant and positive, indicating that rent is significantly associated with rents in neighboring areas. The lag model fit (AIC value of 6247.4) is better than the OLS model (6254.8).
Comparing the AIC value among these three models, the lag model is a better measure of model fit.