Various studies have found a significant relationship between urban air pollution and concentrations of vulnerable populations, such as racial minorities and people of low socioeconomic status (Jbaily et al., 2020). Long term exposure to high concentrations of urban air pollution can lead to serious health effects. Although urban air pollution is a major source of disease and mortality, agricultural air pollution also has significant health effects; it is responsible for approximately 18% of air pollution related-deaths in the US in 2018 (Cook, 2021). The Area Deprivation Index (ADI) is an adapted version of a measurement created by the Health Resources and Services Administration (HRSA) that seeks to quantitatively describe these social factors that contribute to health outcomes. This script will examine air pollution exposure inequity in urban and rural regions in California by comparing pollution with ADI at different urban-rural classifications. These classifications, created by the National Center for Health Satistics (NCHS), group counties based on the presence and size of metropolitan or micropolitan areas (as defined by the Office of Management and Budget) within the counties. Data analysis was completed for 5 randomly selected dates in 2013 for simplicity and because urban-rural classification data was only available for 2013.
library(tidyverse)
library(dplyr)
library(tidycensus)
library(sociome)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)
These csv files were obtained from the CDC Website
Note: PM 2.5 and Ozone data were filtered for statefips=06 (California), and the 5 randomly selected dates in 2013 so that the data could be downloaded in a reasonable amount of time
#download CDC PM 2.5 data
pm2.5 <- read_csv("/Users/lorigerstenfeld/Downloads/Daily_Census_Tract-Level_PM2.5_Concentrations__2011-2015-FILTERED.csv")
#dowload CDC ozone data
ozone <- read_csv("/Users/lorigerstenfeld/Downloads/Daily_Census_Tract-Level_Ozone_Concentrations__2011-2015 FILTERED.csv")
# download National Center for Health Statistics (NCHS) urban-rural classification data
urbanruralclass <- read_csv("/Users/lorigerstenfeld/Downloads/NCHSURCodes2013.csv")
# must install api key before using sociome package with tidycensus::census_api_key
ca_adi <- sociome::get_adi(geography="tract", year=2013, state="CA")
#pm2.5 and ozone datasets were pre-selected for California
# urban rural classification data
urbanruralclass_ca <- urbanruralclass %>%
dplyr::filter(`State Abr.`== "CA")
pollution_combined <- pm2.5 %>%
dplyr::left_join(ozone, by=c("year", "date", "statefips", "countyfips", "ctfips")) %>% #joining datasets
dplyr::select(c("date", "countyfips", "ctfips", "DS_PM_pred", "DS_O3_pred")) #selecting only date, county, census tract, and predicted measurements of each pollutant
pollution_adi <- pollution_combined %>%
dplyr::left_join(ca_adi, by=c("ctfips"="GEOID")) %>% #joining datasets
dplyr::select(c("date", "countyfips", "ctfips", "DS_PM_pred", "DS_O3_pred", "ADI")) #selecting for relevant columns
#separate FIPS code in urban-rural data set into state and county FIPS codes
urbanrural_ca_sep <- urbanruralclass_ca %>%
tidyr::separate(`FIPS code`, into=c("statefips", "countyfips"), sep=1)
#complete join with pollution dataset, utlizing only the county FIPS code
pollution_adi_urbanrural <- pollution_adi %>%
dplyr::left_join(urbanrural_ca_sep, by="countyfips") %>%
dplyr::select(c("date", "countyfips", "ctfips", "pm2.5"="DS_PM_pred", "ozone"="DS_O3_pred", "ADI", `County name`, "urbanrural"=`2013 code`)) #select relevant columns
pollution_adi_urbanrural_cleaned <- na.omit(pollution_adi_urbanrural)
#create matrix that matches the name of each urban-rural classification to its number
urbanrural_names <- c("Large Central Metro", "Large Fringe Metro", "Medium Metro", "Small Metro", "Micropolitan", "Non-core")
names(urbanrural_names) <- c("1", "2", "3", "4", "5", "6")
#create plot for ADI vs. PM 2.5
ggplot(pollution_adi_urbanrural_cleaned, aes(pm2.5, ADI))+
geom_point(alpha=0.1, shape=1)+
geom_smooth(method="lm", size=2, aes(color=factor(urbanrural)), se=F, show.legend=F)+
facet_grid(cols=vars(urbanrural), labeller=labeller(urbanrural=urbanrural_names))+
theme_classic()+
labs(title="Figure 1: ADI vs. PM 2.5 Concentrations by 2013 Urban-rural Classifications in California",
subtitle="Data from CDC and sociome package",
caption=str_wrap("Figure 1: PM 2.5 data from the CDC regressed against ADI data from the sociome package for all California census tracts. Colored lines represent linear regressions for the data in each facet. All data presented is from five randomly selected dates in 2013. See Table 1 for slopes, R^2, and p-values for all regressions. Facets are split by the urban-rural classifications of counties, as defined by the NCHS. By these classifications, large central metro areas are metro counties as defined by the Office of Management and Budget (OMB) to have a population over 1 million, large fringe metro counties are areas surrounding large metro counties that do not qualify as large central metro counties themselves. Medium metro counties have a population between 250,000 and 999,999 and small metro counties are metro areas with a population under 250,000. Non-metro classifications are micropolitan areas with a population between 10,000 and 50,000 people. Non-core counties are not defined as either metropolitan or micropolitan by the OMB.", 130),
x="PM 2.5 (µg/m³)",
y="Area Deprivation Index (ADI)")+
theme(plot.title = element_text(size=17, hjust=0.5, vjust=1),
plot.subtitle=element_text(size=14, hjust=0.5, vjust=1),
plot.caption=element_text(size=11, hjust=0))
#create plot for ADI vs. ozone
ggplot(pollution_adi_urbanrural_cleaned, aes(ozone, ADI))+
geom_point(alpha=0.1, shape=1)+
geom_smooth(method="lm", size=2, aes(color=factor(urbanrural)), se=F, show.legend=F)+
facet_grid(cols=vars(urbanrural), labeller=labeller(urbanrural=urbanrural_names))+
theme_classic()+
labs(title="Figure 2: ADI vs. Ozone Concentration by 2013 Urban-rural Classifications in California",
subtitle="Data from CDC and sociome package",
caption=str_wrap("Figure 2: Ground level ozone data from the CDC regressed against ADI from the sociome package for all California census tracts. Colored lines represent linear regressions for the data in each facet. All data presented is from five randomly selected dates in 2013. See Table 2 for slopes, R^2, and p-values for all regressions. Facets are split by the urban-rural classifications of counties, as defined by the NCHS. By these classifications, large central metro areas are metro counties as defined by the Office of Management and Budget (OMB) to have a population over 1 million, large fringe metro counties are areas surrounding large metro counties that do not qualify as large central metro counties themselves. Medium metro counties have a population between 250,000 and 999,999 and small metro counties are metro areas with a population under 250,000. Non-metro classifications are micropolitan areas with a population between 10,000 and 50,000 people. Non-core counties are not defined as either metropolitan or micropolitan by the OMB.", 130),
x="Ozone (ppb)",
y="Area Deprivation Index (ADI)")+
theme(plot.title = element_text(size=17, hjust=0.5, vjust=1),
plot.subtitle=element_text(size=14, hjust=0.5, vjust=1),
plot.caption=element_text(size=11, hjust=0))
This step is necessary to regress each urban-rural classification separately.
urbanrural1 <- pollution_adi_urbanrural_cleaned %>%
dplyr::filter(urbanrural==1)
urbanrural2 <- pollution_adi_urbanrural_cleaned %>%
dplyr::filter(urbanrural==2)
urbanrural3 <- pollution_adi_urbanrural_cleaned %>%
dplyr::filter(urbanrural==3)
urbanrural4 <- pollution_adi_urbanrural_cleaned %>%
dplyr::filter(urbanrural==4)
urbanrural5 <- pollution_adi_urbanrural_cleaned %>%
dplyr::filter(urbanrural==5)
urbanrural6 <- pollution_adi_urbanrural_cleaned %>%
dplyr::filter(urbanrural==6)
linmod_pm_summary_1 <- lm(ADI~pm2.5, urbanrural1) %>%
summary()
stats_summary1 <- tibble("Urban Rural Classification"= 1,
"slope" = linmod_pm_summary_1$coefficients[2,1], #slope from summary
"p value" = scales::pvalue(linmod_pm_summary_1$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), #p-value from summary formatted for minimum accuracy of interest
"R^2^" = linmod_pm_summary_1$r.squared)
#repeating above for urban-rural classification 2
linmod_pm_summary_2 <- lm(ADI~pm2.5, urbanrural2) %>%
summary()
stats_summary2 <- tibble("Urban Rural Classification"= 2,
"slope" = linmod_pm_summary_2$coefficients[2,1],
"p value" = scales::pvalue(linmod_pm_summary_2$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_pm_summary_2$r.squared)
#urban-rural classification 3
linmod_pm_summary_3 <- lm(ADI~pm2.5, urbanrural3) %>%
summary()
stats_summary3 <- tibble("Urban Rural Classification"= 3,
"slope" = linmod_pm_summary_3$coefficients[2,1],
"p value" = scales::pvalue(linmod_pm_summary_3$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_pm_summary_3$r.squared)
#urban-rural classification 4
linmod_pm_summary_4 <- lm(ADI~pm2.5, urbanrural4) %>%
summary()
stats_summary4 <- tibble("Urban Rural Classification"= 4,
"slope" = linmod_pm_summary_4$coefficients[2,1],
"p value" = scales::pvalue(linmod_pm_summary_4$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_pm_summary_4$r.squared)
#urban-rural classification 5
linmod_pm_summary_5 <- lm(ADI~pm2.5, urbanrural5) %>%
summary()
stats_summary5 <- tibble("Urban Rural Classification"= 5,
"slope" = linmod_pm_summary_5$coefficients[2,1],
"p value" = scales::pvalue(linmod_pm_summary_5$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_pm_summary_5$r.squared)
#urban-rural classification 6
linmod_pm_summary_6 <- lm(ADI~pm2.5, urbanrural6) %>%
summary()
stats_summary6 <- tibble("Urban Rural Classification"= 6,
"slope" = linmod_pm_summary_6$coefficients[2,1],
"p value" = scales::pvalue(linmod_pm_summary_6$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_pm_summary_6$r.squared)
#combining all stat summaries for urban-rural classification 1-6 for PM 2.5 data vs. ADI
pmstats <- rbind(stats_summary1,stats_summary2,stats_summary3,stats_summary4,stats_summary5,stats_summary6)
#repeat process above, just with ozone data
#urban-rural classification 1
linmod_o_summary_1 <- lm(ADI~ozone, urbanrural1) %>%
summary()
ostats_summary1 <- tibble("Urban Rural Classification"= 1,
"slope" = linmod_o_summary_1$coefficients[2,1], #slope from summary
"p value" = scales::pvalue(linmod_o_summary_1$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),#p-value from summary formatted for desired accuracy
"R^2^" = linmod_o_summary_1$r.squared)
#urban-rural classification 2
linmod_o_summary_2 <- lm(ADI~ozone, urbanrural2) %>%
summary()
ostats_summary2 <- tibble("Urban Rural Classification"= 2,
"slope" = linmod_o_summary_2$coefficients[2,1],
"p value" = scales::pvalue(linmod_o_summary_2$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_o_summary_2$r.squared)
#urban-rural classification 3
linmod_o_summary_3 <- lm(ADI~ozone, urbanrural3) %>%
summary()
ostats_summary3 <- tibble("Urban Rural Classification"= 3,
"slope" = linmod_o_summary_3$coefficients[2,1],
"p value" = scales::pvalue(linmod_o_summary_3$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_o_summary_3$r.squared)
# urban-rural classification 4
linmod_o_summary_4 <- lm(ADI~ozone, urbanrural4) %>%
summary()
ostats_summary4 <- tibble("Urban Rural Classification"= 4,
"slope" = linmod_o_summary_4$coefficients[2,1],
"p value" = scales::pvalue(linmod_o_summary_4$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_o_summary_4$r.squared)
#urban-rural classification 5
linmod_o_summary_5 <- lm(ADI~ozone, urbanrural5) %>%
summary()
ostats_summary5 <- tibble("Urban Rural Classification"= 5,
"slope" = linmod_o_summary_5$coefficients[2,1],
"p value" = scales::pvalue(linmod_o_summary_5$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_o_summary_5$r.squared)
#urban-rural classification 6
linmod_o_summary_6 <- lm(ADI~ozone, urbanrural6) %>%
summary()
ostats_summary6 <- tibble("Urban Rural Classification"= 6,
"slope" = linmod_o_summary_6$coefficients[2,1],
"p value" = scales::pvalue(linmod_o_summary_6$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
"R^2^" = linmod_o_summary_6$r.squared)
#combine stat summaries into one tibble
ozonestats <- rbind(ostats_summary1,ostats_summary2,ostats_summary3,ostats_summary4,ostats_summary5,ostats_summary6)
#setting number of significant digits in tables to 4
options(digits=4)
#changing names of urban-rural classification from numbers to descriptions
pmstats$`Urban Rural Classification` <- urbanrural_names
#creating table for statistics of ADI vs. PM 2.5 regression
pmstats %>%
kableExtra::kbl(digits = getOption("digits"), booktabs=T, caption="Table 1: Regression of ADI on PM 2.5 Pollution in California Census Tracts by 2013 Urban-rural Classification",
align=c("ccc")) %>%
kable_styling()
| Urban Rural Classification | slope | p value | R2 |
|---|---|---|---|
| Large Central Metro | 0.7070 | <0.01 | 0.0147 |
| Large Fringe Metro | -0.3263 | <0.01 | 0.0059 |
| Medium Metro | 0.6538 | <0.01 | 0.0502 |
| Small Metro | 0.1881 | <0.01 | 0.0064 |
| Micropolitan | -0.0961 | 0.32 | 0.0016 |
| Non-core | 0.0268 | 0.74 | 0.0003 |
#changing names of urban-rural classification from numbers to descriptions
ozonestats$`Urban Rural Classification` <- urbanrural_names
#creating table for statistics of ozone vs. ADI regression
ozonestats %>%
kableExtra::kbl(digits = getOption("digits"), booktabs=T, caption="Table 2: Regression of ADI on Ozone Pollution in California Census Tracts by 2013 Urban-rural Classification",
align=c("ccc")) %>%
kable_styling()
| Urban Rural Classification | slope | p value | R2 |
|---|---|---|---|
| Large Central Metro | 0.0848 | <0.01 | 0.0020 |
| Large Fringe Metro | 0.4111 | <0.01 | 0.1071 |
| Medium Metro | 0.1046 | <0.01 | 0.0048 |
| Small Metro | 0.2129 | <0.01 | 0.0259 |
| Micropolitan | -0.2749 | <0.01 | 0.0554 |
| Non-core | -0.1851 | <0.01 | 0.0440 |
The generally low R2 values and low p-values demonstrate that most of the ADI vs. pollution regressions were weak but significant. The regressions for ADI vs. PM 2.5 in urban settings showed a higher correlation between the two variables than the suburban or rural settings. However, the ADI vs. ozone data in contrast had a higher correlation for the suburban and rural settings, but the slopes were negative for the rural settings. The generally low R2 values indicate that ADI may not be a reliable indicator of air pollution inequities experienced by populations. Additionally, the inconsistencies associated with combining census tracts from different parts of California in counties with the same urban-rural classification may have contributed to the weak regressions found between pollution and ADI. This also demonstrates that local research may be better poised to answer specific questions about air pollution inequity than the large scale evaluation completed in this analysis.