Set up

library(tidyverse)
library(ggExtra)
library(spdplyr)
library(spdep)

Today we’ll work with a dataset I made that merges the EPA’s air quality data with Census estimates of population. We’ll be building toy regressions with some simple datasets, so be sure to take whatever results you may generate with a grain of salt.

Q1

aqi <- readRDS("C:/Users/Maggie/Downloads/AQI_POP.RDS")
str(aqi)
## 'data.frame':    34723 obs. of  24 variables:
##  $ GEOID                              : chr  "01001" "01001" "01001" "01001" ...
##  $ State                              : Factor w/ 57 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                             : Factor w/ 2062 levels "Abbeville","Adams",..: 1061 1061 1061 1061 1061 32 32 32 32 32 ...
##  $ Year                               : int  1981 1990 1989 1980 1982 2015 2004 2003 2006 2008 ...
##  $ Days.with.AQI                      : int  357 266 63 179 245 264 271 278 280 265 ...
##  $ Good.Days                          : int  289 183 54 122 203 230 182 189 174 210 ...
##  $ Moderate.Days                      : int  49 64 9 35 36 33 74 77 81 50 ...
##  $ Unhealthy.for.Sensitive.Groups.Days: int  15 18 0 18 5 1 15 11 24 5 ...
##  $ Unhealthy.Days                     : int  4 1 0 4 0 0 0 1 1 0 ...
##  $ Very.Unhealthy.Days                : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Hazardous.Days                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Max.AQI                            : int  195 151 100 177 206 129 143 174 156 133 ...
##  $ X90th.Percentile.AQI               : int  77 93 64 108 67 53 83 77 97 68 ...
##  $ Median.AQI                         : int  32 43 34 40 31 38 43 41 46 37 ...
##  $ Days.CO                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Days.NO2                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Days.Ozone                         : int  241 266 63 122 166 189 196 189 217 206 ...
##  $ Days.SO2                           : int  116 0 0 57 79 0 0 0 0 0 ...
##  $ Days.PM2.5                         : int  0 0 0 0 0 75 75 89 63 59 ...
##  $ Days.PM10                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Abbr                               : Factor w/ 57 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ STATEFP                            : chr  "01" "01" "01" "01" ...
##  $ COUNTYFP                           : chr  "001" "001" "001" "001" ...
##  $ POP2010                            : int  54571 54571 54571 54571 54571 182265 182265 182265 182265 182265 ...
# filter
aqi <- filter(aqi, Year == 2003 & State == 'California')
## Warning: package 'bindrcpp' was built under R version 3.4.3
# double check 
mean(aqi$Median.AQI)
## [1] 50.62963
cty <- readRDS("C:/Users/Maggie/Downloads/county.RDS")
# filter
head(aqi$STATEFP)
## [1] "06" "06" "06" "06" "06" "06"
cty <- filter(cty, STATEFP=='06')

# merge
aqi2 <- merge(cty,aqi)
class(aqi2)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
# spplot
spplot(aqi2,"Median.AQI",main='Air Quality by County')

Q2

Create a new shapefile in which you remove all rows in the merged AQI-county shapefile in which Median.AQI is NA. Use this clean shapefile in all subsequent analyses. Use the ggExtra package function ggMarginal() to create a visualization of your data with histograms on each axis and a scatter plot in the center (see tutorial for hints).

# filter
aqi3 <- filter(aqi2,Median.AQI != 'NA')

# plot
p2 <- ggplot2::ggplot() + ggplot2::geom_point(aes(x=aqi3$Days.Ozone, y=aqi3$Median.AQI)) + ggplot2::xlab("Ozone Days") + ggplot2::ylab("Median AQI")
ggMarginal(p2, type="histogram")

Q3

Let’s estimate a simple OLS regression with lm(). We’ll use the formula Median.AQI ~ log(POP2010). Population data is not normally distributed, so we’ll use a log transformation of this dataset to address this. To read about interpreting coefficients that have been log-transformed, see this.

Print the summary() of your regression results. What is the value of the coefficient estimate for log(POP2010)?

# run OLS, print summary
ols <- lm(data=aqi3, Median.AQI ~ log(POP2010))
summary(ols)
## 
## Call:
## lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.211 -16.309  -1.865   7.307  57.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.576     19.470  -1.725   0.0906 .  
## log(POP2010)    6.881      1.577   4.362 6.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.7 on 52 degrees of freedom
## Multiple R-squared:  0.2679, Adjusted R-squared:  0.2538 
## F-statistic: 19.03 on 1 and 52 DF,  p-value: 6.119e-05

Q4

Plot the residuals from this regression using spplot(). Based on your knowledge of California, what might explain spatial patterns in the residuals?

aqi3$resid <- residuals(ols) # add regression residuals back to data
spplot(aqi3,"resid",main="Residuals")

Q5

# create queen nb
queen <- poly2nb(aqi3)
# create weights
qn_w <- nb2listw(queen)
summary(sapply(qn_w$weights, sum)) # check for row standardization, default
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1
# visualize
aqi3$wts <- qn_w$weights
#spplot(aqi3,"wts",main="Weights")

# lm.morantest() 
lm.morantest(ols, qn_w) 
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## Moran I statistic standard deviate = 5.7926, p-value = 3.466e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.487380556     -0.027552977      0.007902444

Q6

# run spatial error model
sem <- errorsarlm(formula=Median.AQI~log(POP2010), data=aqi3, listw=qn_w, quiet= T) 
summary(sem)
## 
## Call:errorsarlm(formula = Median.AQI ~ log(POP2010), data = aqi3, 
##     listw = qn_w, quiet = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0447  -9.3886  -3.8347   6.5264  51.0266 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -34.9249    18.6329 -1.8744  0.06088
## log(POP2010)   6.8259     1.5008  4.5482 5.41e-06
## 
## Lambda: 0.62626, LR test value: 20.568, p-value: 5.7558e-06
## Asymptotic standard error: 0.11821
##     z-value: 5.298, p-value: 1.1711e-07
## Wald statistic: 28.068, p-value: 1.1711e-07
## 
## Log likelihood: -223.4471 for error model
## ML residual variance (sigma squared): 205.37, (sigma: 14.331)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 454.89, (AIC for lm: 473.46)

Q7

aqi3$fitted_sem <- sem$fitted.values
#spplot(aqi3,"fitted_sem") #won't run

Q8

aqi3$resid_sem <- sem$residuals
#spplot(aqi3, "resid_sem", main="Residuals") # won't run

Q9

lag <- lagsarlm(formula=Median.AQI~log(POP2010), data=aqi3, qn_w, zero.policy=T)
summary(lag)
## 
## Call:
## lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi3, listw = qn_w, 
##     zero.policy = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8057 -10.3559  -3.6790   8.3136  52.6841 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  -37.9223    15.9265 -2.3811 0.0172619
## log(POP2010)   5.0543     1.4046  3.5985 0.0003201
## 
## Rho: 0.50248, LR test value: 16.071, p-value: 6.0999e-05
## Asymptotic standard error: 0.12658
##     z-value: 3.9697, p-value: 7.1964e-05
## Wald statistic: 15.758, p-value: 7.1964e-05
## 
## Log likelihood: -225.6953 for lag model
## ML residual variance (sigma squared): 233.63, (sigma: 15.285)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 459.39, (AIC for lm: 473.46)
## LM test for residual autocorrelation
## test value: 1.6173, p-value: 0.20347

Q10

lm.LMtests(ols, qn_w, test="all")
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## LMerr = 26.542, df = 1, p-value = 2.578e-07
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## LMlag = 23.033, df = 1, p-value = 1.592e-06
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## RLMerr = 3.6284, df = 1, p-value = 0.0568
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## RLMlag = 0.11921, df = 1, p-value = 0.7299
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## SARMA = 26.661, df = 2, p-value = 1.624e-06