1 Introduction

Various studies have found a significant relationship between urban air pollution and concentrations of vulnerable populations, such as racial minorities and people of low socioeconomic status (Jbaily et al., 2020). Long term exposure to high concentrations of urban air pollution can lead to serious health effects. Although urban air pollution is a major source of disease and mortality, agricultural air pollution also has significant health effects; it is responsible for approximately 18% of air pollution related-deaths in the US in 2018 (Cook, 2021). The Area Deprivation Index (ADI) is an adapted version of a measurement created by the Health Resources and Services Administration (HRSA) that seeks to quantitatively describe these social factors that contribute to health outcomes. This script will examine air pollution exposure inequity in urban and rural regions in California by comparing pollution with ADI at different urban-rural classifications. These classifications, created by the National Center for Health Satistics (NCHS), group counties based on the presence and size of metropolitan or micropolitan areas (as defined by the Office of Management and Budget) within the counties. Data analysis was completed for 5 randomly selected dates in 2013 for simplicity and because urban-rural classification data was only available for 2013.

2 Initial data analysis steps

2.1 Load Packages

library(tidyverse)
library(dplyr)
library(tidycensus)
library(sociome)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)

2.2 Download data for analysis

2.2.1 Download data from csv files

These csv files were obtained from the CDC Website

Note: PM 2.5 and Ozone data were filtered for statefips=06 (California), and the 5 randomly selected dates in 2013 so that the data could be downloaded in a reasonable amount of time

#download CDC PM 2.5 data
pm2.5 <- read_csv("/Users/lorigerstenfeld/Downloads/Daily_Census_Tract-Level_PM2.5_Concentrations__2011-2015-FILTERED.csv")

#dowload CDC ozone data
ozone <- read_csv("/Users/lorigerstenfeld/Downloads/Daily_Census_Tract-Level_Ozone_Concentrations__2011-2015 FILTERED.csv")

# download National Center for Health Statistics (NCHS) urban-rural classification data
urbanruralclass <- read_csv("/Users/lorigerstenfeld/Downloads/NCHSURCodes2013.csv")

2.2.2 Create ADI data for California in 2013 by census tract using sociome package

# must install api key before using sociome package with tidycensus::census_api_key 
ca_adi <- sociome::get_adi(geography="tract", year=2013, state="CA")

3 Data wrangling

3.0.1 Filter data for California from all datasets

#pm2.5 and ozone datasets were pre-selected for California

# urban rural classification data
urbanruralclass_ca <- urbanruralclass %>%
  dplyr::filter(`State Abr.`== "CA")

3.1 Join datasets

3.1.1 Join PM 2.5 and ozone datasets

pollution_combined <- pm2.5 %>%
  dplyr::left_join(ozone, by=c("year", "date", "statefips", "countyfips", "ctfips")) %>% #joining datasets
  dplyr::select(c("date", "countyfips", "ctfips", "DS_PM_pred", "DS_O3_pred")) #selecting only date, county, census tract, and predicted measurements of each pollutant

3.1.2 Join pollution dataset with adi dataset

pollution_adi <- pollution_combined %>%
  dplyr::left_join(ca_adi, by=c("ctfips"="GEOID")) %>% #joining datasets
  dplyr::select(c("date", "countyfips", "ctfips", "DS_PM_pred", "DS_O3_pred", "ADI")) #selecting for relevant columns

3.1.3 Join urban-rural classification with pollution_adi dataset

#separate FIPS code in urban-rural data set into state and county FIPS codes
urbanrural_ca_sep <- urbanruralclass_ca %>%
  tidyr::separate(`FIPS code`, into=c("statefips", "countyfips"), sep=1)

#complete join with pollution dataset, utlizing only the county FIPS code
pollution_adi_urbanrural <- pollution_adi %>%
  dplyr::left_join(urbanrural_ca_sep, by="countyfips") %>%
  dplyr::select(c("date", "countyfips", "ctfips", "pm2.5"="DS_PM_pred", "ozone"="DS_O3_pred", "ADI", `County name`, "urbanrural"=`2013 code`)) #select relevant columns

3.2 Remove NAs from final data set

pollution_adi_urbanrural_cleaned <- na.omit(pollution_adi_urbanrural)

4 Analyze data

4.1 Create plots

#create matrix that matches the name of each urban-rural classification to its number
urbanrural_names <- c("Large Central Metro", "Large Fringe Metro", "Medium Metro", "Small Metro", "Micropolitan", "Non-core")
names(urbanrural_names) <- c("1", "2", "3", "4", "5", "6")

#create plot for ADI vs. PM 2.5
ggplot(pollution_adi_urbanrural_cleaned, aes(pm2.5, ADI))+
 geom_point(alpha=0.1, shape=1)+
  geom_smooth(method="lm", size=2, aes(color=factor(urbanrural)), se=F, show.legend=F)+
  facet_grid(cols=vars(urbanrural), labeller=labeller(urbanrural=urbanrural_names))+ 
  theme_classic()+
  labs(title="Figure 1: ADI vs. PM 2.5 Concentrations by 2013 Urban-rural Classifications in California",
       subtitle="Data from CDC and sociome package",
       caption=str_wrap("Figure 1: PM 2.5 data from the CDC regressed against ADI data from the sociome package for all California census tracts. Colored lines represent linear regressions for the data in each facet. All data presented is from five randomly selected dates in 2013. See Table 1 for slopes, R^2, and p-values for all regressions. Facets are split by the urban-rural classifications of counties, as defined by the NCHS. By these classifications, large central metro areas are metro counties as defined by the Office of Management and Budget (OMB) to have a population over 1 million, large fringe metro counties are areas surrounding large metro counties that do not qualify as large central metro counties themselves. Medium metro counties have a population between 250,000 and 999,999 and small metro counties are metro areas with a population under 250,000. Non-metro classifications are micropolitan areas with a population between 10,000 and 50,000 people. Non-core counties are not defined as either metropolitan or micropolitan by the OMB.", 130), 
       x="PM 2.5 (µg/m³)",
       y="Area Deprivation Index (ADI)")+
    theme(plot.title = element_text(size=17, hjust=0.5, vjust=1),
          plot.subtitle=element_text(size=14, hjust=0.5, vjust=1),
          plot.caption=element_text(size=11, hjust=0))

#create plot for ADI vs. ozone
ggplot(pollution_adi_urbanrural_cleaned, aes(ozone, ADI))+
  geom_point(alpha=0.1, shape=1)+
  geom_smooth(method="lm", size=2, aes(color=factor(urbanrural)), se=F, show.legend=F)+
  facet_grid(cols=vars(urbanrural), labeller=labeller(urbanrural=urbanrural_names))+
  theme_classic()+
   labs(title="Figure 2: ADI vs. Ozone Concentration by 2013 Urban-rural Classifications in California",
       subtitle="Data from CDC and sociome package",
       caption=str_wrap("Figure 2: Ground level ozone data from the CDC regressed against ADI from the sociome package for all California census tracts. Colored lines represent linear regressions for the data in each facet. All data presented is from five randomly selected dates in 2013. See Table 2 for slopes, R^2, and p-values for all regressions. Facets are split by the urban-rural classifications of counties, as defined by the NCHS. By these classifications, large central metro areas are metro counties as defined by the Office of Management and Budget (OMB) to have a population over 1 million, large fringe metro counties are areas surrounding large metro counties that do not qualify as large central metro counties themselves. Medium metro counties have a population between 250,000 and 999,999 and small metro counties are metro areas with a population under 250,000. Non-metro classifications are micropolitan areas with a population between 10,000 and 50,000 people. Non-core counties are not defined as either metropolitan or micropolitan by the OMB.", 130), 
       x="Ozone (ppb)",
       y="Area Deprivation Index (ADI)")+
    theme(plot.title = element_text(size=17, hjust=0.5, vjust=1),
          plot.subtitle=element_text(size=14, hjust=0.5, vjust=1),
          plot.caption=element_text(size=11, hjust=0))

4.2 Linear regression

4.2.1 Create linear regressions

4.2.1.1 Create seperate data sets for each urban-rural classification

This step is necessary to regress each urban-rural classification separately.

urbanrural1 <- pollution_adi_urbanrural_cleaned %>%
  dplyr::filter(urbanrural==1)

urbanrural2 <- pollution_adi_urbanrural_cleaned %>%
  dplyr::filter(urbanrural==2)

urbanrural3 <- pollution_adi_urbanrural_cleaned %>%
  dplyr::filter(urbanrural==3)

urbanrural4 <- pollution_adi_urbanrural_cleaned %>%
  dplyr::filter(urbanrural==4)

urbanrural5 <- pollution_adi_urbanrural_cleaned %>%
  dplyr::filter(urbanrural==5)

urbanrural6 <- pollution_adi_urbanrural_cleaned %>%
  dplyr::filter(urbanrural==6)

4.2.2 Make tibble of linear regressions of PM 2.5 vs. ADI

linmod_pm_summary_1 <- lm(ADI~pm2.5, urbanrural1) %>%
  summary()
stats_summary1 <- tibble("Urban Rural Classification"= 1,
                    "slope"  = linmod_pm_summary_1$coefficients[2,1], #slope from summary
                    "p value"   = scales::pvalue(linmod_pm_summary_1$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), #p-value from summary formatted for minimum accuracy of interest
                    "R^2^"  = linmod_pm_summary_1$r.squared)

#repeating above for urban-rural classification 2
linmod_pm_summary_2 <- lm(ADI~pm2.5, urbanrural2) %>%
  summary()
stats_summary2 <- tibble("Urban Rural Classification"= 2,
                    "slope"  = linmod_pm_summary_2$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_pm_summary_2$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_pm_summary_2$r.squared)

#urban-rural classification 3
linmod_pm_summary_3 <- lm(ADI~pm2.5, urbanrural3) %>%
  summary()
stats_summary3 <- tibble("Urban Rural Classification"= 3,
                    "slope"  = linmod_pm_summary_3$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_pm_summary_3$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_pm_summary_3$r.squared)

#urban-rural classification 4
linmod_pm_summary_4 <- lm(ADI~pm2.5, urbanrural4) %>%
  summary()
stats_summary4 <- tibble("Urban Rural Classification"= 4,
                    "slope"  = linmod_pm_summary_4$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_pm_summary_4$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_pm_summary_4$r.squared)

#urban-rural classification 5
linmod_pm_summary_5 <- lm(ADI~pm2.5, urbanrural5) %>%
  summary()
stats_summary5 <- tibble("Urban Rural Classification"= 5,
                    "slope"  = linmod_pm_summary_5$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_pm_summary_5$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_pm_summary_5$r.squared)

#urban-rural classification 6
linmod_pm_summary_6 <- lm(ADI~pm2.5, urbanrural6) %>%
  summary()
stats_summary6 <- tibble("Urban Rural Classification"= 6,
                    "slope"  = linmod_pm_summary_6$coefficients[2,1],
                    "p value"   = scales::pvalue(linmod_pm_summary_6$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
                    "R^2^"  = linmod_pm_summary_6$r.squared)

#combining all stat summaries for urban-rural classification 1-6 for PM 2.5 data vs. ADI
pmstats <- rbind(stats_summary1,stats_summary2,stats_summary3,stats_summary4,stats_summary5,stats_summary6)

4.2.3 Make tibble of linear regressions of ozone vs. ADI

#repeat process above, just with ozone data

#urban-rural classification 1
linmod_o_summary_1 <- lm(ADI~ozone, urbanrural1) %>%
  summary()
ostats_summary1 <- tibble("Urban Rural Classification"= 1,
                    "slope"  = linmod_o_summary_1$coefficients[2,1], #slope from summary
                    "p value"   = scales::pvalue(linmod_o_summary_1$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),#p-value from summary formatted for desired accuracy
                    "R^2^"  = linmod_o_summary_1$r.squared)

#urban-rural classification 2
linmod_o_summary_2 <- lm(ADI~ozone, urbanrural2) %>%
  summary()
ostats_summary2 <- tibble("Urban Rural Classification"= 2,
                    "slope"  = linmod_o_summary_2$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_o_summary_2$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_o_summary_2$r.squared)

#urban-rural classification 3
linmod_o_summary_3 <- lm(ADI~ozone, urbanrural3) %>%
  summary()
ostats_summary3 <- tibble("Urban Rural Classification"= 3,
                    "slope"  = linmod_o_summary_3$coefficients[2,1],
                    "p value"   = scales::pvalue(linmod_o_summary_3$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE),
                    "R^2^"  = linmod_o_summary_3$r.squared)

# urban-rural classification 4
linmod_o_summary_4 <- lm(ADI~ozone, urbanrural4) %>%
  summary()
ostats_summary4 <- tibble("Urban Rural Classification"= 4,
                    "slope"  = linmod_o_summary_4$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_o_summary_4$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_o_summary_4$r.squared)

#urban-rural classification 5
linmod_o_summary_5 <- lm(ADI~ozone, urbanrural5) %>%
  summary()
ostats_summary5 <- tibble("Urban Rural Classification"= 5,
                    "slope"  = linmod_o_summary_5$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_o_summary_5$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_o_summary_5$r.squared)

#urban-rural classification 6
linmod_o_summary_6 <- lm(ADI~ozone, urbanrural6) %>%
  summary()
ostats_summary6 <- tibble("Urban Rural Classification"= 6,
                    "slope"  = linmod_o_summary_6$coefficients[2,1], 
                    "p value"   = scales::pvalue(linmod_o_summary_6$coefficients[2,4], accuracy=0.01, decimal.mark=".", prefix=NULL, add_p=FALSE), 
                    "R^2^"  = linmod_o_summary_6$r.squared)

#combine stat summaries into one tibble
ozonestats <- rbind(ostats_summary1,ostats_summary2,ostats_summary3,ostats_summary4,ostats_summary5,ostats_summary6)

4.2.4 Make tables from tibbles

#setting number of significant digits in tables to 4
options(digits=4)

#changing names of urban-rural classification from numbers to descriptions
pmstats$`Urban Rural Classification` <- urbanrural_names

#creating table for statistics of ADI vs. PM 2.5 regression
pmstats %>%
  kableExtra::kbl(digits = getOption("digits"), booktabs=T, caption="Table 1: Regression of ADI on PM 2.5 Pollution in California Census Tracts by 2013 Urban-rural Classification",
      align=c("ccc")) %>%
  kable_styling()

Table 1: Regression of ADI on PM 2.5 Pollution in California Census Tracts by 2013 Urban-rural Classification
Urban Rural Classification	slope	p value	R²
Large Central Metro	0.7070	<0.01	0.0147
Large Fringe Metro	-0.3263	<0.01	0.0059
Medium Metro	0.6538	<0.01	0.0502
Small Metro	0.1881	<0.01	0.0064
Micropolitan	-0.0961	0.32	0.0016
Non-core	0.0268	0.74	0.0003

#changing names of urban-rural classification from numbers to descriptions
ozonestats$`Urban Rural Classification` <- urbanrural_names

#creating table for statistics of ozone vs. ADI regression
ozonestats %>%
  kableExtra::kbl(digits = getOption("digits"), booktabs=T, caption="Table 2: Regression of ADI on Ozone Pollution in California Census Tracts by 2013 Urban-rural Classification",
      align=c("ccc")) %>%
  kable_styling()

Table 2: Regression of ADI on Ozone Pollution in California Census Tracts by 2013 Urban-rural Classification
Urban Rural Classification	slope	p value	R²
Large Central Metro	0.0848	<0.01	0.0020
Large Fringe Metro	0.4111	<0.01	0.1071
Medium Metro	0.1046	<0.01	0.0048
Small Metro	0.2129	<0.01	0.0259
Micropolitan	-0.2749	<0.01	0.0554
Non-core	-0.1851	<0.01	0.0440

5 Conclusions

The generally low R² values and low p-values demonstrate that most of the ADI vs. pollution regressions were weak but significant. The regressions for ADI vs. PM 2.5 in urban settings showed a higher correlation between the two variables than the suburban or rural settings. However, the ADI vs. ozone data in contrast had a higher correlation for the suburban and rural settings, but the slopes were negative for the rural settings. The generally low R² values indicate that ADI may not be a reliable indicator of air pollution inequities experienced by populations. Additionally, the inconsistencies associated with combining census tracts from different parts of California in counties with the same urban-rural classification may have contributed to the weak regressions found between pollution and ADI. This also demonstrates that local research may be better poised to answer specific questions about air pollution inequity than the large scale evaluation completed in this analysis.

References

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Krieger, Nik, Jarrod Dalton, Cindy Wang, and Adam Perzynski. 2021. Sociome: Operationalizing Social Determinants of Health Data for Researchers. https://CRAN.R-project.org/package=sociome.

Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Walker, Kyle, and Matt Herman. 2021. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

———. 2021a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

———. 2021b. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

———. 2021c. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2021. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley, and Dana Seidel. 2020. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

Air Pollution Inequity as Measured by ADI in Urban vs. Rural Settings in California

Lori Gerstenfeld

3/3/2022