Set up

library(tidyverse)
library(ggExtra)
library(spdplyr)
library(spdep)

Today we’ll work with a dataset I made that merges the EPA’s air quality data with Census estimates of population. We’ll be building toy regressions with some simple datasets, so be sure to take whatever results you may generate with a grain of salt.

Q1

First, load the data (AQI_POP.RDS and county.RDS).
Filter rows so our new datasets only includes air quality and population data from Year == 2003 for the state of California. To make sure you filtered correctly, be sure that mean(aqi@data$Median.AQI) == 50.62963.
Load the county shapefile and filter counties in the state of California. You may have to look up California’s STATEFP to do this.
Append the AQI data to the California county shapefile using the merge(). Make sure the result of your merge() is a new shapefile for counties in California that now includes attributes stored in the AQI dataset.
Use spplot() to create a map of the Median.AQI in each county. Median AQI is the median value of the air quality index for the year 2003. High values of AQI are bad, low values are good.

aqi <- readRDS("C:/Users/Maggie/Downloads/AQI_POP.RDS")
str(aqi)

## 'data.frame':    34723 obs. of  24 variables:
##  $ GEOID                              : chr  "01001" "01001" "01001" "01001" ...
##  $ State                              : Factor w/ 57 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                             : Factor w/ 2062 levels "Abbeville","Adams",..: 1061 1061 1061 1061 1061 32 32 32 32 32 ...
##  $ Year                               : int  1981 1990 1989 1980 1982 2015 2004 2003 2006 2008 ...
##  $ Days.with.AQI                      : int  357 266 63 179 245 264 271 278 280 265 ...
##  $ Good.Days                          : int  289 183 54 122 203 230 182 189 174 210 ...
##  $ Moderate.Days                      : int  49 64 9 35 36 33 74 77 81 50 ...
##  $ Unhealthy.for.Sensitive.Groups.Days: int  15 18 0 18 5 1 15 11 24 5 ...
##  $ Unhealthy.Days                     : int  4 1 0 4 0 0 0 1 1 0 ...
##  $ Very.Unhealthy.Days                : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Hazardous.Days                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Max.AQI                            : int  195 151 100 177 206 129 143 174 156 133 ...
##  $ X90th.Percentile.AQI               : int  77 93 64 108 67 53 83 77 97 68 ...
##  $ Median.AQI                         : int  32 43 34 40 31 38 43 41 46 37 ...
##  $ Days.CO                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Days.NO2                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Days.Ozone                         : int  241 266 63 122 166 189 196 189 217 206 ...
##  $ Days.SO2                           : int  116 0 0 57 79 0 0 0 0 0 ...
##  $ Days.PM2.5                         : int  0 0 0 0 0 75 75 89 63 59 ...
##  $ Days.PM10                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Abbr                               : Factor w/ 57 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ STATEFP                            : chr  "01" "01" "01" "01" ...
##  $ COUNTYFP                           : chr  "001" "001" "001" "001" ...
##  $ POP2010                            : int  54571 54571 54571 54571 54571 182265 182265 182265 182265 182265 ...

# filter
aqi <- filter(aqi, Year == 2003 & State == 'California')

## Warning: package 'bindrcpp' was built under R version 3.4.3

# double check 
mean(aqi$Median.AQI)

## [1] 50.62963

cty <- readRDS("C:/Users/Maggie/Downloads/county.RDS")
# filter
head(aqi$STATEFP)

## [1] "06" "06" "06" "06" "06" "06"

cty <- filter(cty, STATEFP=='06')

# merge
aqi2 <- merge(cty,aqi)
class(aqi2)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

# spplot
spplot(aqi2,"Median.AQI",main='Air Quality by County')

Q2

Create a new shapefile in which you remove all rows in the merged AQI-county shapefile in which Median.AQI is NA. Use this clean shapefile in all subsequent analyses. Use the ggExtra package function ggMarginal() to create a visualization of your data with histograms on each axis and a scatter plot in the center (see tutorial for hints).

# filter
aqi3 <- filter(aqi2,Median.AQI != 'NA')

# plot
p2 <- ggplot2::ggplot() + ggplot2::geom_point(aes(x=aqi3$Days.Ozone, y=aqi3$Median.AQI)) + ggplot2::xlab("Ozone Days") + ggplot2::ylab("Median AQI")
ggMarginal(p2, type="histogram")

Q3

Let’s estimate a simple OLS regression with lm(). We’ll use the formula Median.AQI ~ log(POP2010). Population data is not normally distributed, so we’ll use a log transformation of this dataset to address this. To read about interpreting coefficients that have been log-transformed, see this.

Print the summary() of your regression results. What is the value of the coefficient estimate for log(POP2010)?

# run OLS, print summary
ols <- lm(data=aqi3, Median.AQI ~ log(POP2010))
summary(ols)

## 
## Call:
## lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.211 -16.309  -1.865   7.307  57.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.576     19.470  -1.725   0.0906 .  
## log(POP2010)    6.881      1.577   4.362 6.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.7 on 52 degrees of freedom
## Multiple R-squared:  0.2679, Adjusted R-squared:  0.2538 
## F-statistic: 19.03 on 1 and 52 DF,  p-value: 6.119e-05

Q4

Plot the residuals from this regression using spplot(). Based on your knowledge of California, what might explain spatial patterns in the residuals?

aqi3$resid <- residuals(ols) # add regression residuals back to data
spplot(aqi3,"resid",main="Residuals")

Q5

Create a queen neighborhood with the filtered AQI-population data you constructed in Q2.
Create a weights matrix using nb2lsitw() with the default row standardized weights (style=W).
Visualize this weight matrix.
Use lm.morantest() to test for spatial autocorrelation in the residuals of the OLS regression you ran in Q3.
Is there spatial autocorrelation in the residuals?

# create queen nb
queen <- poly2nb(aqi3)
# create weights
qn_w <- nb2listw(queen)
summary(sapply(qn_w$weights, sum)) # check for row standardization, default

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

# visualize
aqi3$wts <- qn_w$weights
#spplot(aqi3,"wts",main="Weights")

# lm.morantest() 
lm.morantest(ols, qn_w)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## Moran I statistic standard deviate = 5.7926, p-value = 3.466e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.487380556     -0.027552977      0.007902444

Q6

Run a spatial error model using the clean data you created in Q2, the weights object created in Q5, and following formula formula = Median.AQI ~ log(POP2010).
What is the coefficient estimate for log(POP2010)?

# run spatial error model
sem <- errorsarlm(formula=Median.AQI~log(POP2010), data=aqi3, listw=qn_w, quiet= T) 
summary(sem)

## 
## Call:errorsarlm(formula = Median.AQI ~ log(POP2010), data = aqi3, 
##     listw = qn_w, quiet = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0447  -9.3886  -3.8347   6.5264  51.0266 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -34.9249    18.6329 -1.8744  0.06088
## log(POP2010)   6.8259     1.5008  4.5482 5.41e-06
## 
## Lambda: 0.62626, LR test value: 20.568, p-value: 5.7558e-06
## Asymptotic standard error: 0.11821
##     z-value: 5.298, p-value: 1.1711e-07
## Wald statistic: 28.068, p-value: 1.1711e-07
## 
## Log likelihood: -223.4471 for error model
## ML residual variance (sigma squared): 205.37, (sigma: 14.331)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 454.89, (AIC for lm: 473.46)

Q7

Use spplot() to visualize the county-level fitted values from the regression you ran in Q6.

aqi3$fitted_sem <- sem$fitted.values
#spplot(aqi3,"fitted_sem") #won't run

Q8

Use spplot() to visualize the county-level residuals from the regression you ran in Q6. Visually compare this map to the map of Median.AQI.

aqi3$resid_sem <- sem$residuals
#spplot(aqi3, "resid_sem", main="Residuals") # won't run

Q9

Run a spatial lag model using the clean data you created in Q2, the weights object created in Q5, and following formula formula = Median.AQI ~ log(POP2010). Set zero.policy = T.
What is the coefficient estimate for log(POP2010)?

lag <- lagsarlm(formula=Median.AQI~log(POP2010), data=aqi3, qn_w, zero.policy=T)
summary(lag)

## 
## Call:
## lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi3, listw = qn_w, 
##     zero.policy = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8057 -10.3559  -3.6790   8.3136  52.6841 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  -37.9223    15.9265 -2.3811 0.0172619
## log(POP2010)   5.0543     1.4046  3.5985 0.0003201
## 
## Rho: 0.50248, LR test value: 16.071, p-value: 6.0999e-05
## Asymptotic standard error: 0.12658
##     z-value: 3.9697, p-value: 7.1964e-05
## Wald statistic: 15.758, p-value: 7.1964e-05
## 
## Log likelihood: -225.6953 for lag model
## ML residual variance (sigma squared): 233.63, (sigma: 15.285)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 459.39, (AIC for lm: 473.46)
## LM test for residual autocorrelation
## test value: 1.6173, p-value: 0.20347

Q10

Use lm.LMtests to determine which spatial model best fits the data, formula, and weights object we used in Q6 and Q9.

lm.LMtests(ols, qn_w, test="all")

## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## LMerr = 26.542, df = 1, p-value = 2.578e-07
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## LMlag = 23.033, df = 1, p-value = 1.592e-06
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## RLMerr = 3.6284, df = 1, p-value = 0.0568
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## RLMlag = 0.11921, df = 1, p-value = 0.7299
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi3)
## weights: qn_w
## 
## SARMA = 26.661, df = 2, p-value = 1.624e-06

Spatial regression questions

Maggie Hallerud

Set up

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10