##Introduction

With the growing industrialization and use of motorized vehicles in the United States, air pollution intensifies. Long term exposure to air pollution can cause negative health effects in individuals, like cancer or decreased cognitive function. In this exploration, we are aiming to explore the relationship between hazardous air pollutants in the United States and cancer risk. This research question is pertinent from both a public health perspective and an environmental justice perspective in regards to the increase of deregulation of environmental legislation recently, specifically with the Clean Air Act. We aim to explore what locations are most vulnerable to air pollution and hence have a higher risk of developing cancer. In addition, we want to investigate what pollutants exist in the highest frequency, and how that ties to cancer risk. To perform this evaluation, we look at urban vs. rural communities in different locations across the United States and examine levels of pollution and subsequent effect on cancer risk. Using our conclusion from this analysis, we could make recommendations to local governments to reduce air pollution levels by limiting emissions or investing in public transportation. In addition, this analysis could serve as evidence to reinstate federal policies that have recently been struck down. This data could be used in further research to compare types of cancers linked to air pollution, in addition to non-cancer health problems.

The data used to analyze our research question comes from the EPA, which gathers data from air quality monitoring stations across the US in order to analyse a national cancer risk. The EPA looked at a 5-year average of different types of pollutants from 2013-2017, with population levels and demographics located within 1 mile of the monitoring station. We will be focusing on the different types of pollutants that each station monitored as well as the linked risk for cancer around these spots. Our key variables will be CancerRisk, Setting, Chem_class, City, and State. CancerRisk is the risk of developing cancer that an individual faces, calculated by an algorithm used by the EPA to deduce the probability of developing cancer over time. City is the US city where the air pollution was measured, and State represents the state that the city of observation is located in. Setting denotes if the city is in an urban or rural environment. Chem_class indicates which chemical compound is being looked at, being either VOC (volatile organic compounds), PAH (polycyclic aromatic hydrocarbon), Carbonyl, or PM. VOCs are volatile organic compounds, which are mostly human-made and can be produced from paint and pharmaceuticals manufacturing, or from household objects like cleaning products (1). PAHs, or polycyclic aromatic hydrocarbons, are produced when coal, wood, oil, gas, garbage, and tobacco are burned (2). Carbonyls are a type of VOC that are a significant contributor to ozone formation and come from sources like vehicle exhaust and industrial emissions (3). PM is particulate matter and can come from sources like construction sites, fires, or be created by reactions from power plants or automobiles (4).

##Data Tidying for Figure 1

BKMdata <- Project_Data_1_

##We changed all of the column names from the original data set because they were formatted weird.

BKMdata <- BKMdata |>
  rename(Latitude = ...2,
         Longitude = ...3,
         URE = ...17,
         CancerRisk = ...19,
         CRinAmil = ...20,
         City = ...12,
         Setting = ...16,
         Year = ...7, 
         State = ...5, 
         Chem_class = ...18)

##We created a new data set with only the variables we are doing analysis on to use for figures 2 and 3.

BKMdata_tidy <- BKMdata |>
  distinct(Latitude, Longitude, Chem_class, URE, CancerRisk, CRinAmil, City, Setting, Year, State)|>
  slice(-1)

##For figure 1 one, we needed to use the mean cancer risk for each pollutant to better compare between Urban and Rural, so we did that here.

BKM_summary <- BKMdata_tidy |>
  group_by(Setting, Chem_class) |>
  mutate(CancerRisk = as.numeric(CancerRisk)) |>
  summarize(mean_CancerRisk = mean(CancerRisk, na.rm = TRUE))
## `summarise()` has grouped output by 'Setting'. You can override using the
## `.groups` argument.

##Code for Figure 1.

ggplot(BKM_summary, aes(x = Setting, y = mean_CancerRisk, fill = Chem_class)) +
  geom_col(position = "dodge") +    
  scale_fill_viridis_d() +
  theme_clean() +
  labs(
    title = "Figure 1: Average Chemical Exposure in Urban vs. Rural Settings",
    x = "Setting (Urban vs Rural)",
    y = "Average Cancer Risk",
    fill = "Chemical Class",
    caption = "Data Source: https://catalog.data.gov/dataset/an-examination-of-national-cancer-risk-based-on-monitored-hazardous-ambient-air-pollutants"
  )
This is a bar chart depicting how rural/urban cities are affected by different air pollutants. The x-axis is Setting, a binary categorical variable, which can either take the value of Rural or Urban, and the y-axis represents Average Exposure, a numerical variable describing an individual's average Cancer Risk, which ranges from 0 to 2.0e-05. The color depicts chemical class, which is a categorical variable and takes values of carbonyl, PAH, PM, and VOC. We can assess from the graph that rural cities contain significantly less air pollutants than urban cities, and the most common air pollutant that contributes to exposure is carbonyl.

Bar chart depicting Rural vs. Urban Settings by Chem_class frequency.

Alt-text for Figure 1: This is a bar chart depicting how rural/urban cities are affected by different air pollutants. The x-axis is Setting, a binary categorical variable, which can either take the value of “Rural” or “Urban,” and the y-axis represents Average Exposure, a numerical variable describing an individual’s average Cancer Risk, which ranges from 0 to 2.0e-05. The color depicts chemical class, which is a categorical variable and takes values of carbonyl, PAH, PM, and VOC. We can assess from the graph that rural cities contain significantly less air pollutants than urban cities, and the most common air pollutant that contributes to exposure is carbonyl.

Brief Interpretation: This graph indicates a clear relationship between the air pollutants observed in urban and rural environments. The total amount of observed pollutants in urban areas is more than triple the amoount of observed pollution in rural areas. There are higher observed levels of every class of pollutant in urban environments, which in turn, increases cancer risk.

##Code for Figure 2.

BKMdata_tidy|>
  filter(Year == 2017)|>
ggplot(aes(x = fct_infreq(City), fill = Chem_class))+
  geom_bar() +
  scale_fill_viridis_d() +
  facet_grid(~Setting) +
  coord_flip() +
  theme_clean() +
  labs(title = "Figure 2: Amount of Observed Air Pollution by City and Setting",
       y = "Amount of Observed Air Pollution (ppm)",
       x = "City",
       caption= "Data Source: https://catalog.data.gov/dataset/an-examination-of-national-cancer-risk-based-on-monitored-hazardous-ambient-air-pollutants")
This is a stacked bar chart depicting city vs. count for rural and urban cities air pollutants. The color denotes the class of air pollutant, . The x-axis is the amount of observed air pollution, ranging from 0-40 ppm. The y-axis represents the US city where the air pollutant frequency is recorded. From the graph, we can see that there are more urban cities than rural cities, and it seems on average that urban cities have more air pollutants than rural cities.

Bar chart depicting Rural and Urban US Cities by Chem_class frequency.

Alt-text for Figure 2: This is a stacked bar chart depicting city vs. count for rural and urban cities air pollutants. The color denotes the class of air pollutant, . The x-axis is the amount of observed air pollution and represents the frequency of pollution, ranging from 0-40 ppm. The y-axis represents the US “city” where the air pollutant frequency is recorded. From the graph, we can see that there are more urban cities than rural cities, and it seems on average that urban cities have more air pollutants than rural cities.

Brief Interpretation: Similarly to figure 1, this graph indicates a difference in exposure to different air pollutants in different cities. All of the cities under the urban classification have relatively similar exposure levels, except for Atlanta, Portland, and Houstan, which have generally lower levels. Further analysis could be required to determine if the larger category of state provides any deeper understanding. Another point of deeper analysis could be performed regarding the connection between location (city or state) and cancer risk to analyze which areas of the United States pose the highest risk and require the most pollution reduction.

##For Figure 3 we did analysis on CancerRisk, so we turned it into a numeric variable.

BKMdata_tidy <- BKMdata_tidy |>
  mutate(CancerRisk = parse_number(CancerRisk))

##Graph code for Figure 3.

BKMdata_tidy |>
  filter(Year == 2017) |>
  mutate(across(c(Longitude, Latitude, CancerRisk),as.numeric)) |>
  ggplot(aes(x = Longitude, y = Latitude)) +
  geom_point(aes(size = CancerRisk, color = CancerRisk, shape = Setting)) +
  coord_map(xlim = c(-125, -65), ylim = c(25, 50)) +
  scale_color_viridis_c(option = "turbo") +
  borders("usa") +
  theme_bw() +
  labs(title = "Figure 3: Cancer Risk by City",
       caption= "Data Source: https://catalog.data.gov/dataset/an-examination-of-national-cancer-risk-based-on-monitored-hazardous-ambient-air-pollutants")
This is a map depicting city vs. cancer risk across the United States. The color shows the degree of cancer risk faced by an individual. The shape of the point represents whether the city is rural or urban, with urban being a triangle and rural being a circle. The y-axis is latitude, ranging from 25-50, and the x axis represents longitude, ranging from -130 to -60. This graph indicates a very clear difference in cancer risk based on what setting one's city resides in. The highest cancer risk resides in urban settings.

Map depicting Cancer Risk In US Cities

Alt-text for Figure 3: This is a map depicting city vs. cancer risk across the United States. The color shows the degree of cancer risk faced by an individual. The shape of the point represents whether the city is rural or urban, with urban being a triangle and rural being a circle. The y-axis is latitude, ranging from 25-50, and the x axis represents longitude, ranging from -130 to -60. This graph indicates a very clear difference in cancer risk based on what setting one’s city resides in. The highest cancer risk resides in urban settings.

Brief Interpretation: Here we can see the cancer risk in each city as well as if the city is rural or urban. From this we can see that most cities have similar cancer risks between 2e-05 and 4e-05. A couple cities have a notably higher cancer risk, mostly on the coasts. Further analysis could be conducted here to discover if there is a concerete explanation for this deviation from the norm of if it is simply variation.

##References:

  1. What are volatile organic compounds (VOCs)? | US EPA. (2019, February 19). US EPA. https://www.epa.gov/indoor-air-quality-iaq/what-are-volatile-organic-compounds-vocs

  2. PFAS Fact Sheets and Infographics | Per- and Polyfluoroalkyl Substances (PFAS) | US EPA. (2018, March 19). US EPA. https://19january2021snapshot.epa.gov/pfas/pfas-fact-sheets-and-infographics_.html

  3. Liu, Q., Gao, Y., Huang, W., Ling, Z., Wang, Z., & Wang, X. (2022). Carbonyl compounds in the atmosphere: A review of abundance, source and their contributions to O3 and SOA formation. Atmospheric Research, 106184. https://doi.org/10.1016/j.atmosres.2022.106184

  4. Particulate Matter (PM) Basics | US EPA. (2016, April 19). US EPA. https://www.epa.gov/pm-pollution/particulate-matter-pm-basics

  5. EPA. (n.d.). An examination of national cancer risk based on monitored hazardous ambient air pollutants - Catalog. Dataset - Catalog; Publisher. U.S. EPA Office of Research and Development (ORD). Retrieved December 10, 2025, from https://catalog.data.gov/dataset/an-examination-of-national-cancer-risk-based-on-monitored-hazardous-ambient-air-pollutants