AQI_POP <- readRDS("~/USU/2023/GEOG 4870/Module 10_Spatial_Regression/Spatial Regression/AQI_POP.RDS")
county <- readRDS("~/USU/2023/GEOG 4870/Module 10_Spatial_Regression/Spatial Regression/county.RDS")
library(tidyverse)
library(ggExtra)
library(spdplyr)
library(spdep)
library(terra)
library(spdep)
library(spatialreg)
Today we will work with a dataset that merges the EPA’s air quality data with Census estimates of population. We will be building toy regressions with some simple datasets, so be sure to take whatever results you may generate with a grain of salt. sdasd asd
AQI_POP.RDS and
county.RDS).Year == 2003 for the state of
California. To make sure you filtered correctly, be sure
that mean(aqi@data$Median.AQI) == 50.62963.STATEFP to
do this.merge(). Make sure the result of your
merge() is a new shapefile for counties in California that
now includes attributes stored in the AQI dataset.spplot() to create a map of the
Median.AQI in each county. Median AQI is the median value
of the air quality index for the year 2003. High values of AQI are bad,
low values are good.aqi <- filter(AQI_POP, State == 'California' & Year == '2003')
mean(aqi$Median.AQI)
## [1] 50.62963
cty <- county[county$STATEFP == "06",]
###plot(cty)
# merge
####cal_aqi_m <- cty@data = data.frame(cty@data, aqi[match(sdata@, aqi$Median.AQI)])
cal_aqi <- merge(cty, aqi)
names(cal_aqi)
## [1] "STATEFP" "COUNTYFP"
## [3] "GEOID" "COUNTYNS"
## [5] "AFFGEOID" "NAME"
## [7] "LSAD" "ALAND"
## [9] "AWATER" "State"
## [11] "County" "Year"
## [13] "Days.with.AQI" "Good.Days"
## [15] "Moderate.Days" "Unhealthy.for.Sensitive.Groups.Days"
## [17] "Unhealthy.Days" "Very.Unhealthy.Days"
## [19] "Hazardous.Days" "Max.AQI"
## [21] "X90th.Percentile.AQI" "Median.AQI"
## [23] "Days.CO" "Days.NO2"
## [25] "Days.Ozone" "Days.SO2"
## [27] "Days.PM2.5" "Days.PM10"
## [29] "Abbr" "POP2010"
# spplot
spplot(cal_aqi, "Median.AQI", main = "Air Quality Index:")
Create a new shapefile in which you remove all rows in the merged
AQI-county shapefile in which Median.AQI is
NA. Use this clean shapefile in all subsequent
analyses. Use the ggExtra package function
ggMarginal() to create a visualization of your
Median.AQI and POP2010 with histograms on each
axis and a scatter plot in the center (see tutorial for hints).
# filter
aqi_cal <- cal_aqi[!is.na(cal_aqi@data$Median.AQI),]
# plot
pop_aqi <- ggplot(data = aqi_cal@data, aes(x = aqi_cal$Median.AQI, y = aqi_cal$POP2010, color = aqi_cal$County)) +
geom_point() +
theme(legend.position = "none") +
xlab("MedianAQI") +
ylab("2010 Population")
ggMarginal(pop_aqi, type = "histogram")
Let’s estimate a simple OLS regression with lm(). We
will use the formula Median.AQI ~ log(POP2010). Population
data is not normally distributed, so we will use a log transformation of
this dataset to address this. To read about interpreting coefficients
that have been log-transformed, see this.
Print the summary() of your regression results. What is
the value of the coefficient estimate for log(POP2010)?
# run OLS, print summary
cal.ols <- lm(Median.AQI ~ log(POP2010), data = aqi_cal)
summary(cal.ols)
##
## Call:
## lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.211 -16.309 -1.865 7.307 57.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.576 19.470 -1.725 0.0906 .
## log(POP2010) 6.881 1.577 4.362 6.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.7 on 52 degrees of freedom
## Multiple R-squared: 0.2679, Adjusted R-squared: 0.2538
## F-statistic: 19.03 on 1 and 52 DF, p-value: 6.119e-05
Plot the residuals from this regression using spplot().
Based on your knowledge of California, what might explain spatial
patterns in the residuals?
aqi_cal$resid <- residuals(cal.ols)
spplot(aqi_cal, "resid")
nb2lsitw() with the
default row standardized weights (style=W).lm.morantest() to test for spatial autocorrelation
in the residuals of the OLS regression you ran in Q3.# create queen nb
queen <- poly2nb(aqi_cal)
# create weights
aqi_cal_qw <- nb2listw(queen, style= "W", zero.policy = T)
# visualize
#id <- row.names(as(aqi_cal, "data.frame"))
#coords <- coordinates(aqi_cal)
#spplot(queen)
# lm.morantest()
lm.morantest(cal.ols, aqi_cal_qw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
##
## Moran I statistic standard deviate = 5.7926, p-value = 3.466e-09
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.487380556 -0.027552977 0.007902444
install.packages(“spatialreg”) library * Run a spatial error
model using the clean data you created in Q2, the weights object created
in Q5, and following formula
formula = Median.AQI ~ log(POP2010). * What is the
coefficient estimate for log(POP2010)?
# run spatial error model
aqi_cal.sem <- errorsarlm(Median.AQI ~ log(POP2010), data = aqi_cal, listw = aqi_cal_qw, quiet = T)
summary(aqi_cal.sem)
##
## Call:errorsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal,
## listw = aqi_cal_qw, quiet = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0447 -9.3886 -3.8347 6.5264 51.0266
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.9249 18.6329 -1.8744 0.06088
## log(POP2010) 6.8259 1.5008 4.5482 5.41e-06
##
## Lambda: 0.62626, LR test value: 20.568, p-value: 5.7558e-06
## Asymptotic standard error: 0.11821
## z-value: 5.298, p-value: 1.1711e-07
## Wald statistic: 28.068, p-value: 1.1711e-07
##
## Log likelihood: -223.4471 for error model
## ML residual variance (sigma squared): 205.37, (sigma: 14.331)
## Number of observations: 54
## Number of parameters estimated: 4
## AIC: 454.89, (AIC for lm: 473.46)
spplot() to visualize the county-level fitted
values from the regression you ran in Q6.aqi_cal$fitted_sem <- aqi_cal.sem$fitted.values
spplot(aqi_cal, "fitted_sem", main = "Trend")
spplot() to visualize the county-level residuals
from the regression you ran in Q6. Visually compare this map to the map
of Median.AQI.aqi_cal$resid_sem <- aqi_cal.sem$residuals
spplot(aqi_cal, "resid_sem", main = "Residuals")
spplot(cal_aqi, "Median.AQI", main = "Air Quality Index:")
formula = Median.AQI ~ log(POP2010). Set
zero.policy = T.log(POP2010)?aqi_cal.lag <- lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, aqi_cal_qw, zero.policy = T)
summary(aqi_cal.lag)
##
## Call:lagsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal,
## listw = aqi_cal_qw, zero.policy = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8057 -10.3559 -3.6790 8.3136 52.6841
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.9223 15.9265 -2.3811 0.0172619
## log(POP2010) 5.0543 1.4046 3.5985 0.0003201
##
## Rho: 0.50248, LR test value: 16.071, p-value: 6.0999e-05
## Asymptotic standard error: 0.12658
## z-value: 3.9697, p-value: 7.1964e-05
## Wald statistic: 15.758, p-value: 7.1964e-05
##
## Log likelihood: -225.6953 for lag model
## ML residual variance (sigma squared): 233.63, (sigma: 15.285)
## Number of observations: 54
## Number of parameters estimated: 4
## AIC: 459.39, (AIC for lm: 473.46)
## LM test for residual autocorrelation
## test value: 1.6173, p-value: 0.20347
lm.LMtests to determine which spatial model best
fits the data, formula, and weights object we used in Q6 and Q9.lm <- lm.LMtests(cal.ols, aqi_cal_qw, test = "all")
lm
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
##
## LMerr = 26.542, df = 1, p-value = 2.578e-07
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
##
## LMlag = 23.033, df = 1, p-value = 1.592e-06
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
##
## RLMerr = 3.6284, df = 1, p-value = 0.0568
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
##
## RLMlag = 0.11921, df = 1, p-value = 0.7299
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Median.AQI ~ log(POP2010), data = aqi_cal)
## weights: aqi_cal_qw
##
## SARMA = 26.661, df = 2, p-value = 1.624e-06
t <- sacsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal, aqi_cal_qw, zero.policy = T)
t
##
## Call:
## sacsarlm(formula = Median.AQI ~ log(POP2010), data = aqi_cal,
## listw = aqi_cal_qw, zero.policy = T)
## Type: sac
##
## Coefficients:
## rho lambda (Intercept) log(POP2010)
## -0.2962962 0.7615587 -15.7882247 6.4292711
##
## Log likelihood: -223.0551