Set up

AQI_POP <- readRDS("~/USU/2023/GEOG 4870/Module 10_Spatial_Regression/Spatial Regression/AQI_POP.RDS")
county <- readRDS("~/USU/2023/GEOG 4870/Module 10_Spatial_Regression/Spatial Regression/county.RDS")
library(tidyverse)
library(ggExtra)
library(spdplyr)
library(spdep)
library(terra)
library(spdep)
library(spatialreg)

Today we will work with a dataset that merges the EPA’s air quality data with Census estimates of population. We will be building toy regressions with some simple datasets, so be sure to take whatever results you may generate with a grain of salt. sdasd asd

Q1

aqi <- filter(AQI_POP, State == 'California' & Year == '2003')
mean(aqi$Median.AQI)
## [1] 50.62963
cty <- county[county$STATEFP == "06",]
###plot(cty)

# merge
####cal_aqi_m <- cty@data = data.frame(cty@data, aqi[match(sdata@, aqi$Median.AQI)])

cal_aqi <- merge(cty, aqi)
names(cal_aqi)
##  [1] "STATEFP"                             "COUNTYFP"                           
##  [3] "GEOID"                               "COUNTYNS"                           
##  [5] "AFFGEOID"                            "NAME"                               
##  [7] "LSAD"                                "ALAND"                              
##  [9] "AWATER"                              "State"                              
## [11] "County"                              "Year"                               
## [13] "Days.with.AQI"                       "Good.Days"                          
## [15] "Moderate.Days"                       "Unhealthy.for.Sensitive.Groups.Days"
## [17] "Unhealthy.Days"                      "Very.Unhealthy.Days"                
## [19] "Hazardous.Days"                      "Max.AQI"                            
## [21] "X90th.Percentile.AQI"                "Median.AQI"                         
## [23] "Days.CO"                             "Days.NO2"                           
## [25] "Days.Ozone"                          "Days.SO2"                           
## [27] "Days.PM2.5"                          "Days.PM10"                          
## [29] "Abbr"                                "POP2010"
# spplot

spplot(cal_aqi, "Median.AQI", main = "Air Quality Index:")

Q2

Create a new shapefile in which you remove all rows in the merged AQI-county shapefile in which Median.AQI is NA. Use this clean shapefile in all subsequent analyses. Use the ggExtra package function ggMarginal() to create a visualization of your Median.AQI and POP2010 with histograms on each axis and a scatter plot in the center (see tutorial for hints).

# filter
aqi_cal <- cal_aqi[!is.na(cal_aqi@data$Median.AQI),]
# plot


pop_aqi <- ggplot(data = aqi_cal@data, aes(x = aqi_cal$Median.AQI, y = aqi_cal$POP2010, color = aqi_cal$County)) +
  geom_point() +
  theme(legend.position = "none") +
  xlab("MedianAQI") +
  ylab("2010 Population")

ggMarginal(pop_aqi, type = "histogram")

Q3

Let’s estimate a simple OLS regression with lm(). We will use the formula Median.AQI ~ log(POP2010). Population data is not normally distributed, so we will use a log transformation of this dataset to address this. To read about interpreting coefficients that have been log-transformed, see this.

Print the summary() of your regression results. What is the value of the coefficient estimate for log(POP2010)?

# run OLS, print summary
cal.ols <- lm(Median.AQI ~ log(POP2010), data = aqi_cal)
summary(cal.ols)
## 
## Call:
## lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.211 -16.309  -1.865   7.307  57.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.576     19.470  -1.725   0.0906 .  
## log(POP2010)    6.881      1.577   4.362 6.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.7 on 52 degrees of freedom
## Multiple R-squared:  0.2679, Adjusted R-squared:  0.2538 
## F-statistic: 19.03 on 1 and 52 DF,  p-value: 6.119e-05

Coefficient Estimate of Log(POP2010) is 6.881.

Q4

Plot the residuals from this regression using spplot(). Based on your knowledge of California, what might explain spatial patterns in the residuals?

aqi_cal$resid <- residuals(cal.ols)
spplot(aqi_cal, "resid")

Given that most population centers with higher expected levels of polution occur on the coast and the wind direction runs west to east from the Pacific, higher positive residual values may occur in the less populated eastern part of the state where lower AQIs are expected, but are influenced by western winds pushing pollution from coastal cities eastward into less populated areas.

Q5

# create queen nb
queen <- poly2nb(aqi_cal)
# create weights
aqi_cal_qw <- nb2listw(queen, style= "W", zero.policy = T)

# visualize
#id <- row.names(as(aqi_cal, "data.frame"))
#coords <- coordinates(aqi_cal)
#spplot(queen)

# lm.morantest() 
lm.morantest(cal.ols, aqi_cal_qw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## Moran I statistic standard deviate = 5.7926, p-value = 3.466e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.487380556     -0.027552977      0.007902444

Yes, since the p-value is significant we can assume there is spatial autocorrelation in the risiduals.

Q6

install.packages(“spatialreg”) library * Run a spatial error model using the clean data you created in Q2, the weights object created in Q5, and following formula formula = Median.AQI ~ log(POP2010). * What is the coefficient estimate for log(POP2010)?

# run spatial error model
aqi_cal.sem <- errorsarlm(Median.AQI ~ log(POP2010), data = aqi_cal, listw = aqi_cal_qw, quiet = T)
summary(aqi_cal.sem)
## 
## Call:errorsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, 
##     listw = aqi_cal_qw, quiet = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0447  -9.3886  -3.8347   6.5264  51.0266 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -34.9249    18.6329 -1.8744  0.06088
## log(POP2010)   6.8259     1.5008  4.5482 5.41e-06
## 
## Lambda: 0.62626, LR test value: 20.568, p-value: 5.7558e-06
## Asymptotic standard error: 0.11821
##     z-value: 5.298, p-value: 1.1711e-07
## Wald statistic: 28.068, p-value: 1.1711e-07
## 
## Log likelihood: -223.4471 for error model
## ML residual variance (sigma squared): 205.37, (sigma: 14.331)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 454.89, (AIC for lm: 473.46)

The coefficient estimate for log(pop2010) is 6.8259.

Q7

aqi_cal$fitted_sem <- aqi_cal.sem$fitted.values
spplot(aqi_cal, "fitted_sem", main = "Trend")

Q8

aqi_cal$resid_sem <- aqi_cal.sem$residuals
spplot(aqi_cal, "resid_sem", main = "Residuals")

spplot(cal_aqi, "Median.AQI", main = "Air Quality Index:")

Q9

aqi_cal.lag <- lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, aqi_cal_qw, zero.policy = T)
summary(aqi_cal.lag)
## 
## Call:lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, 
##     listw = aqi_cal_qw, zero.policy = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8057 -10.3559  -3.6790   8.3136  52.6841 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  -37.9223    15.9265 -2.3811 0.0172619
## log(POP2010)   5.0543     1.4046  3.5985 0.0003201
## 
## Rho: 0.50248, LR test value: 16.071, p-value: 6.0999e-05
## Asymptotic standard error: 0.12658
##     z-value: 3.9697, p-value: 7.1964e-05
## Wald statistic: 15.758, p-value: 7.1964e-05
## 
## Log likelihood: -225.6953 for lag model
## ML residual variance (sigma squared): 233.63, (sigma: 15.285)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 459.39, (AIC for lm: 473.46)
## LM test for residual autocorrelation
## test value: 1.6173, p-value: 0.20347

The coefficient estimate for log(pop2010) is 5.0543.

Q10

lm <- lm.LMtests(cal.ols, aqi_cal_qw, test = "all")
lm
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## LMerr = 26.542, df = 1, p-value = 2.578e-07
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## LMlag = 23.033, df = 1, p-value = 1.592e-06
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## RLMerr = 3.6284, df = 1, p-value = 0.0568
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## RLMlag = 0.11921, df = 1, p-value = 0.7299
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## SARMA = 26.661, df = 2, p-value = 1.624e-06
t <- sacsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, aqi_cal_qw, zero.policy = T)
t
## 
## Call:
## sacsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, 
##     listw = aqi_cal_qw, zero.policy = T)
## Type: sac 
## 
## Coefficients:
##          rho       lambda  (Intercept) log(POP2010) 
##   -0.2962962    0.7615587  -15.7882247    6.4292711 
## 
## Log likelihood: -223.0551

The Spatial Error Model best fits the data showing a better log liklihood.