library(tidyverse)
library(ggExtra)
library(spdplyr)
library(spdep)
Today we’ll work with a dataset I made that merges the EPA’s air quality data with Census estimates of population. We’ll be building toy regressions with some simple datasets, so be sure to take whatever results you may generate with a grain of salt.
AQI_POP.RDS
and county.RDS
).Year == 2003
for the state of California
. To make sure you filtered correctly, be sure that mean(aqi@data$Median.AQI) == 50.62963
.STATEFP
to do this.merge()
. Make sure the result of your merge()
is a new shapefile for counties in California that now includes attributes stored in the AQI dataset.spplot()
to create a map of the Median.AQI
in each county. Median AQI is the median value of the air quality index for the year 2003. High values of AQI are bad, low values are good.aqi <- readRDS("C:/Users/Maggie/Downloads/AQI_POP.RDS")
str(aqi)
## 'data.frame': 34723 obs. of 24 variables:
## $ GEOID : chr "01001" "01001" "01001" "01001" ...
## $ State : Factor w/ 57 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County : Factor w/ 2062 levels "Abbeville","Adams",..: 1061 1061 1061 1061 1061 32 32 32 32 32 ...
## $ Year : int 1981 1990 1989 1980 1982 2015 2004 2003 2006 2008 ...
## $ Days.with.AQI : int 357 266 63 179 245 264 271 278 280 265 ...
## $ Good.Days : int 289 183 54 122 203 230 182 189 174 210 ...
## $ Moderate.Days : int 49 64 9 35 36 33 74 77 81 50 ...
## $ Unhealthy.for.Sensitive.Groups.Days: int 15 18 0 18 5 1 15 11 24 5 ...
## $ Unhealthy.Days : int 4 1 0 4 0 0 0 1 1 0 ...
## $ Very.Unhealthy.Days : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Hazardous.Days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Max.AQI : int 195 151 100 177 206 129 143 174 156 133 ...
## $ X90th.Percentile.AQI : int 77 93 64 108 67 53 83 77 97 68 ...
## $ Median.AQI : int 32 43 34 40 31 38 43 41 46 37 ...
## $ Days.CO : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Days.NO2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Days.Ozone : int 241 266 63 122 166 189 196 189 217 206 ...
## $ Days.SO2 : int 116 0 0 57 79 0 0 0 0 0 ...
## $ Days.PM2.5 : int 0 0 0 0 0 75 75 89 63 59 ...
## $ Days.PM10 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Abbr : Factor w/ 57 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ STATEFP : chr "01" "01" "01" "01" ...
## $ COUNTYFP : chr "001" "001" "001" "001" ...
## $ POP2010 : int 54571 54571 54571 54571 54571 182265 182265 182265 182265 182265 ...
# filter
aqi <- filter(aqi, Year == 2003 & State == 'California')
## Warning: package 'bindrcpp' was built under R version 3.4.3
# double check
mean(aqi$Median.AQI)
## [1] 50.62963
cty <- readRDS("C:/Users/Maggie/Downloads/county.RDS")
# filter
head(aqi$STATEFP)
## [1] "06" "06" "06" "06" "06" "06"
cty <- filter(cty, STATEFP=='06')
# merge
aqi2 <- merge(cty,aqi)
class(aqi2)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
# spplot
spplot(aqi2,"Median.AQI",main='Air Quality by County')
Create a new shapefile in which you remove all rows in the merged AQI-county shapefile in which Median.AQI
is NA
. Use this clean shapefile in all subsequent analyses. Use the ggExtra
package function ggMarginal()
to create a visualization of your data with histograms on each axis and a scatter plot in the center (see tutorial for hints).
# filter
aqi3 <- filter(aqi2,Median.AQI != 'NA')
# plot
p2 <- ggplot2::ggplot() + ggplot2::geom_point(aes(x=aqi3$Days.Ozone, y=aqi3$Median.AQI)) + ggplot2::xlab("Ozone Days") + ggplot2::ylab("Median AQI")
ggMarginal(p2, type="histogram")
Let’s estimate a simple OLS regression with lm()
. We’ll use the formula Median.AQI ~ log(POP2010)
. Population data is not normally distributed, so we’ll use a log transformation of this dataset to address this. To read about interpreting coefficients that have been log-transformed, see this.
Print the summary()
of your regression results. What is the value of the coefficient estimate for log(POP2010)
?
# run OLS, print summary
ols <- lm(data=aqi3, Median.AQI ~ log(POP2010))
summary(ols)
##
## Call:
## lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.211 -16.309 -1.865 7.307 57.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.576 19.470 -1.725 0.0906 .
## log(POP2010) 6.881 1.577 4.362 6.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.7 on 52 degrees of freedom
## Multiple R-squared: 0.2679, Adjusted R-squared: 0.2538
## F-statistic: 19.03 on 1 and 52 DF, p-value: 6.119e-05
Plot the residuals from this regression using spplot()
. Based on your knowledge of California, what might explain spatial patterns in the residuals?
aqi3$resid <- residuals(ols) # add regression residuals back to data
spplot(aqi3,"resid",main="Residuals")
nb2lsitw()
with the default row standardized weights (style=W
).lm.morantest()
to test for spatial autocorrelation in the residuals of the OLS regression you ran in Q3.# create queen nb
queen <- poly2nb(aqi3)
# create weights
qn_w <- nb2listw(queen)
summary(sapply(qn_w$weights, sum)) # check for row standardization, default
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
# visualize
aqi3$wts <- qn_w$weights
#spplot(aqi3,"wts",main="Weights")
# lm.morantest()
lm.morantest(ols, qn_w)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
##
## Moran I statistic standard deviate = 5.7926, p-value = 3.466e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.487380556 -0.027552977 0.007902444
formula = Median.AQI ~ log(POP2010)
.log(POP2010)
?# run spatial error model
sem <- errorsarlm(formula=Median.AQI~log(POP2010), data=aqi3, listw=qn_w, quiet= T)
summary(sem)
##
## Call:errorsarlm(formula = Median.AQI ~ log(POP2010), data = aqi3,
## listw = qn_w, quiet = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0447 -9.3886 -3.8347 6.5264 51.0266
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.9249 18.6329 -1.8744 0.06088
## log(POP2010) 6.8259 1.5008 4.5482 5.41e-06
##
## Lambda: 0.62626, LR test value: 20.568, p-value: 5.7558e-06
## Asymptotic standard error: 0.11821
## z-value: 5.298, p-value: 1.1711e-07
## Wald statistic: 28.068, p-value: 1.1711e-07
##
## Log likelihood: -223.4471 for error model
## ML residual variance (sigma squared): 205.37, (sigma: 14.331)
## Number of observations: 54
## Number of parameters estimated: 4
## AIC: 454.89, (AIC for lm: 473.46)
spplot()
to visualize the county-level fitted values from the regression you ran in Q6.aqi3$fitted_sem <- sem$fitted.values
#spplot(aqi3,"fitted_sem") #won't run
spplot()
to visualize the county-level residuals from the regression you ran in Q6. Visually compare this map to the map of Median.AQI
.aqi3$resid_sem <- sem$residuals
#spplot(aqi3, "resid_sem", main="Residuals") # won't run
formula = Median.AQI ~ log(POP2010)
. Set zero.policy = T
.log(POP2010)
?lag <- lagsarlm(formula=Median.AQI~log(POP2010), data=aqi3, qn_w, zero.policy=T)
summary(lag)
##
## Call:
## lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi3, listw = qn_w,
## zero.policy = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8057 -10.3559 -3.6790 8.3136 52.6841
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.9223 15.9265 -2.3811 0.0172619
## log(POP2010) 5.0543 1.4046 3.5985 0.0003201
##
## Rho: 0.50248, LR test value: 16.071, p-value: 6.0999e-05
## Asymptotic standard error: 0.12658
## z-value: 3.9697, p-value: 7.1964e-05
## Wald statistic: 15.758, p-value: 7.1964e-05
##
## Log likelihood: -225.6953 for lag model
## ML residual variance (sigma squared): 233.63, (sigma: 15.285)
## Number of observations: 54
## Number of parameters estimated: 4
## AIC: 459.39, (AIC for lm: 473.46)
## LM test for residual autocorrelation
## test value: 1.6173, p-value: 0.20347
lm.LMtests
to determine which spatial model best fits the data, formula, and weights object we used in Q6 and Q9.lm.LMtests(ols, qn_w, test="all")
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
##
## LMerr = 26.542, df = 1, p-value = 2.578e-07
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
##
## LMlag = 23.033, df = 1, p-value = 1.592e-06
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
##
## RLMerr = 3.6284, df = 1, p-value = 0.0568
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
##
## RLMlag = 0.11921, df = 1, p-value = 0.7299
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
##
## SARMA = 26.661, df = 2, p-value = 1.624e-06