Set up

AQI_POP <- readRDS("~/USU/2023/GEOG 4870/Module 10_Spatial_Regression/Spatial Regression/AQI_POP.RDS")
county <- readRDS("~/USU/2023/GEOG 4870/Module 10_Spatial_Regression/Spatial Regression/county.RDS")
library(tidyverse)
library(ggExtra)
library(spdplyr)
library(spdep)
library(terra)
library(spdep)
library(spatialreg)

Today we will work with a dataset that merges the EPA’s air quality data with Census estimates of population. We will be building toy regressions with some simple datasets, so be sure to take whatever results you may generate with a grain of salt. sdasd asd

Q1

First, load the data (AQI_POP.RDS and county.RDS).
Filter rows so our new datasets only includes air quality and population data from Year == 2003 for the state of California. To make sure you filtered correctly, be sure that mean(aqi@data$Median.AQI) == 50.62963.
Load the county shapefile and filter counties in the state of California. You may have to look up California’s STATEFP to do this.
Append the AQI data to the California county shapefile using the merge(). Make sure the result of your merge() is a new shapefile for counties in California that now includes attributes stored in the AQI dataset.
Use spplot() to create a map of the Median.AQI in each county. Median AQI is the median value of the air quality index for the year 2003. High values of AQI are bad, low values are good.

aqi <- filter(AQI_POP, State == 'California' & Year == '2003')
mean(aqi$Median.AQI)

## [1] 50.62963

cty <- county[county$STATEFP == "06",]
###plot(cty)

# merge
####cal_aqi_m <- cty@data = data.frame(cty@data, aqi[match(sdata@, aqi$Median.AQI)])

cal_aqi <- merge(cty, aqi)
names(cal_aqi)

##  [1] "STATEFP"                             "COUNTYFP"                           
##  [3] "GEOID"                               "COUNTYNS"                           
##  [5] "AFFGEOID"                            "NAME"                               
##  [7] "LSAD"                                "ALAND"                              
##  [9] "AWATER"                              "State"                              
## [11] "County"                              "Year"                               
## [13] "Days.with.AQI"                       "Good.Days"                          
## [15] "Moderate.Days"                       "Unhealthy.for.Sensitive.Groups.Days"
## [17] "Unhealthy.Days"                      "Very.Unhealthy.Days"                
## [19] "Hazardous.Days"                      "Max.AQI"                            
## [21] "X90th.Percentile.AQI"                "Median.AQI"                         
## [23] "Days.CO"                             "Days.NO2"                           
## [25] "Days.Ozone"                          "Days.SO2"                           
## [27] "Days.PM2.5"                          "Days.PM10"                          
## [29] "Abbr"                                "POP2010"

# spplot

spplot(cal_aqi, "Median.AQI", main = "Air Quality Index:")

Q2

Create a new shapefile in which you remove all rows in the merged AQI-county shapefile in which Median.AQI is NA. Use this clean shapefile in all subsequent analyses. Use the ggExtra package function ggMarginal() to create a visualization of your Median.AQI and POP2010 with histograms on each axis and a scatter plot in the center (see tutorial for hints).

# filter
aqi_cal <- cal_aqi[!is.na(cal_aqi@data$Median.AQI),]
# plot


pop_aqi <- ggplot(data = aqi_cal@data, aes(x = aqi_cal$Median.AQI, y = aqi_cal$POP2010, color = aqi_cal$County)) +
  geom_point() +
  theme(legend.position = "none") +
  xlab("MedianAQI") +
  ylab("2010 Population")

ggMarginal(pop_aqi, type = "histogram")

Q3

Let’s estimate a simple OLS regression with lm(). We will use the formula Median.AQI ~ log(POP2010). Population data is not normally distributed, so we will use a log transformation of this dataset to address this. To read about interpreting coefficients that have been log-transformed, see this.

Print the summary() of your regression results. What is the value of the coefficient estimate for log(POP2010)?

# run OLS, print summary
cal.ols <- lm(Median.AQI ~ log(POP2010), data = aqi_cal)
summary(cal.ols)

## 
## Call:
## lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.211 -16.309  -1.865   7.307  57.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.576     19.470  -1.725   0.0906 .  
## log(POP2010)    6.881      1.577   4.362 6.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.7 on 52 degrees of freedom
## Multiple R-squared:  0.2679, Adjusted R-squared:  0.2538 
## F-statistic: 19.03 on 1 and 52 DF,  p-value: 6.119e-05

Coefficient Estimate of Log(POP2010) is 6.881.

Q4

Plot the residuals from this regression using spplot(). Based on your knowledge of California, what might explain spatial patterns in the residuals?

aqi_cal$resid <- residuals(cal.ols)
spplot(aqi_cal, "resid")

Given that most population centers with higher expected levels of polution occur on the coast and the wind direction runs west to east from the Pacific, higher positive residual values may occur in the less populated eastern part of the state where lower AQIs are expected, but are influenced by western winds pushing pollution from coastal cities eastward into less populated areas.

Q5

Create a queen neighborhood with the filtered AQI-population data you constructed in Q2.
Create a weights matrix using nb2lsitw() with the default row standardized weights (style=W).
Visualize this weight matrix.
Use lm.morantest() to test for spatial autocorrelation in the residuals of the OLS regression you ran in Q3.
Is there spatial autocorrelation in the residuals?

# create queen nb
queen <- poly2nb(aqi_cal)
# create weights
aqi_cal_qw <- nb2listw(queen, style= "W", zero.policy = T)

# visualize
#id <- row.names(as(aqi_cal, "data.frame"))
#coords <- coordinates(aqi_cal)
#spplot(queen)

# lm.morantest() 
lm.morantest(cal.ols, aqi_cal_qw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## Moran I statistic standard deviate = 5.7926, p-value = 3.466e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.487380556     -0.027552977      0.007902444

Yes, since the p-value is significant we can assume there is spatial autocorrelation in the risiduals.

Q6

install.packages(“spatialreg”) library * Run a spatial error model using the clean data you created in Q2, the weights object created in Q5, and following formula formula = Median.AQI ~ log(POP2010). * What is the coefficient estimate for log(POP2010)?

# run spatial error model
aqi_cal.sem <- errorsarlm(Median.AQI ~ log(POP2010), data = aqi_cal, listw = aqi_cal_qw, quiet = T)
summary(aqi_cal.sem)

## 
## Call:errorsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, 
##     listw = aqi_cal_qw, quiet = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0447  -9.3886  -3.8347   6.5264  51.0266 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -34.9249    18.6329 -1.8744  0.06088
## log(POP2010)   6.8259     1.5008  4.5482 5.41e-06
## 
## Lambda: 0.62626, LR test value: 20.568, p-value: 5.7558e-06
## Asymptotic standard error: 0.11821
##     z-value: 5.298, p-value: 1.1711e-07
## Wald statistic: 28.068, p-value: 1.1711e-07
## 
## Log likelihood: -223.4471 for error model
## ML residual variance (sigma squared): 205.37, (sigma: 14.331)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 454.89, (AIC for lm: 473.46)

The coefficient estimate for log(pop2010) is 6.8259.

Q7

Use spplot() to visualize the county-level fitted values from the regression you ran in Q6.

aqi_cal$fitted_sem <- aqi_cal.sem$fitted.values
spplot(aqi_cal, "fitted_sem", main = "Trend")

Q8

Use spplot() to visualize the county-level residuals from the regression you ran in Q6. Visually compare this map to the map of Median.AQI.

aqi_cal$resid_sem <- aqi_cal.sem$residuals
spplot(aqi_cal, "resid_sem", main = "Residuals")

spplot(cal_aqi, "Median.AQI", main = "Air Quality Index:")

Q9

Run a spatial lag model using the clean data you created in Q2, the weights object created in Q5, and following formula formula = Median.AQI ~ log(POP2010). Set zero.policy = T.
What is the coefficient estimate for log(POP2010)?

aqi_cal.lag <- lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, aqi_cal_qw, zero.policy = T)
summary(aqi_cal.lag)

## 
## Call:lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, 
##     listw = aqi_cal_qw, zero.policy = T)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8057 -10.3559  -3.6790   8.3136  52.6841 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  -37.9223    15.9265 -2.3811 0.0172619
## log(POP2010)   5.0543     1.4046  3.5985 0.0003201
## 
## Rho: 0.50248, LR test value: 16.071, p-value: 6.0999e-05
## Asymptotic standard error: 0.12658
##     z-value: 3.9697, p-value: 7.1964e-05
## Wald statistic: 15.758, p-value: 7.1964e-05
## 
## Log likelihood: -225.6953 for lag model
## ML residual variance (sigma squared): 233.63, (sigma: 15.285)
## Number of observations: 54 
## Number of parameters estimated: 4 
## AIC: 459.39, (AIC for lm: 473.46)
## LM test for residual autocorrelation
## test value: 1.6173, p-value: 0.20347

The coefficient estimate for log(pop2010) is 5.0543.

Q10

Use lm.LMtests to determine which spatial model best fits the data, formula, and weights object we used in Q6 and Q9.

lm <- lm.LMtests(cal.ols, aqi_cal_qw, test = "all")
lm

## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## LMerr = 26.542, df = 1, p-value = 2.578e-07
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## LMlag = 23.033, df = 1, p-value = 1.592e-06
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## RLMerr = 3.6284, df = 1, p-value = 0.0568
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## RLMlag = 0.11921, df = 1, p-value = 0.7299
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
## 
## SARMA = 26.661, df = 2, p-value = 1.624e-06

t <- sacsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, aqi_cal_qw, zero.policy = T)
t

## 
## Call:
## sacsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, 
##     listw = aqi_cal_qw, zero.policy = T)
## Type: sac 
## 
## Coefficients:
##          rho       lambda  (Intercept) log(POP2010) 
##   -0.2962962    0.7615587  -15.7882247    6.4292711 
## 
## Log likelihood: -223.0551

Spatial regression questions

Josh Carr

Set up

Q1

Q2

Q3

Coefficient Estimate of Log(POP2010) is 6.881.

Q4

Q5

Yes, since the p-value is significant we can assume there is spatial autocorrelation in the risiduals.

Q6

The coefficient estimate for log(pop2010) is 6.8259.

Q7

Q8

Q9

The coefficient estimate for log(pop2010) is 5.0543.

Q10

The Spatial Error Model best fits the data showing a better log liklihood.