Background & Objective

Pollution is an important factor that can affect where people choose to live. Man-made pollution in the form of poor air quality contributes to climate change and threatens public health. People with health conditions such as asthma need to be particularly careful in choosing places to live. With significant efforts to promote clean energy and reduce air pollution, air quality in the US has seen improvements to some extent. For example, average SO2 (indicative of acid rain) has decreased over the years. However, PM2.5 pollution in the US varies across regions with the Midwest and California being the most polluted areas. More alarmingly, air quality in the US recently worsened despite years of improvement. These observations call for in-depth understanding of the distribution and severity of air pollution in the US. In this project we aim to use US pollution data to examine the current status and trend of air pollution in US cities, in order to recommend good and bad places to live in terms of air quality.

Data & Methods

To examine the status and trends in US air pollution, we will use the “U.S. Pollution Data” dataset from Kaggle, which has been downloaded and explored by hundreds of other researchers (https://www.kaggle.com/sogun3/uspollution). The data from this set is taken from the EPA (https://aqs.epa.gov/aqsweb/airdata/download_files.html). To examine the data we will use packages within the tidyverse to organize and visualize the data. Such packages will include dplyr, ggplot2, and tidyr. We will use tidyr and dplyr to aggregate the data by city, county, and state, sort the data, and organize it into a way that allows for easy visualization, and then use ggplot2 to graph the data. We will also use the sf package to spatially visualize air pollution by plotting pollution levels onto a map of the US (like in Fig. 2) and showing the change in the map over time via a panel of maps.  

Note that the dataset does not include data for Mississippi, Montana, Nebraska, Vermont, and West Virginia.

Load data and packages

library(tidyverse)
library(knitr)
library(kableExtra)
library(usmap)

rawdata = read_csv("pollution_us_2000_2016.csv") %>% 
  separate(`Date Local`, c("Year", "Month", "Day"), sep = "-")
names(rawdata) <- gsub(" ", "_", names(rawdata))
head(rawdata)

Results

  1. We will produce a list of the 10 most polluted and 10 least polluted states in 2015. To determine the most/least polluted states, we will use each pollutant’s AQI (air quality index).
pollution_by_state = rawdata %>% 
  filter(Year == 2015) %>% 
  group_by(State) %>% 
  summarise(NO2 = mean(NO2_AQI, na.rm = TRUE), O3 = mean(O3_AQI, na.rm = TRUE), 
            SO2 = mean(SO2_AQI, na.rm = TRUE), CO = mean(CO_AQI, na.rm = TRUE))
# This will give us a dataframe containing the average AQI for each pollutant by state
head(pollution_by_state) 
## # A tibble: 6 x 5
##   State        NO2    O3   SO2    CO
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 Alabama     20.0  37.3  6.95  3.93
## 2 Alaska      18.6  19.2 14.8   6.27
## 3 Arizona     26.2  44.9  1.64  6   
## 4 Arkansas    19.4  33.5  2.37  4.25
## 5 California  18.9  41.1  1.25  5.37
## 6 Colorado    35.5  38.6  4.93  6.54
AQI = 
  data.frame(State = pollution_by_state[,1], meanAQI = rowMeans(pollution_by_state[,-1])) %>% 
  arrange(desc(meanAQI))
# head(AQI)

top10 = AQI[1:10,]
bottom10 = arrange(AQI, meanAQI)[1:10,]

kable(top10) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
    add_header_above(c("Top 10 Most Polluted States 2015", " "))
Top 10 Most Polluted States 2015
State meanAQI
Colorado 21.41057
Utah 19.85771
Arizona 19.68895
Nevada 18.93991
Ohio 18.35968
Indiana 18.01813
Kansas 17.62841
New York 17.60394
District Of Columbia 17.26486
New Jersey 17.23095
kable(bottom10) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  add_header_above(c("Top 10 Least Polluted States 2015", " "))
Top 10 Least Polluted States 2015
State meanAQI
North Dakota 10.65881
South Dakota 10.68196
Hawaii 10.85691
Tennessee 11.35148
New Hampshire 11.42109
Maine 11.94928
Oregon 12.65205
Florida 12.65812
Iowa 12.92361
Minnesota 13.26009
  1. We will produce graphs of the national averages for different pollutants and their changes by year.
pollutants = rawdata %>% 
  select(Year, NO2_Mean, O3_Mean, SO2_Mean, CO_Mean) %>% 
  group_by(Year) %>% 
  summarise(NO2_Mean = mean(NO2_Mean), O3_Mean = mean(O3_Mean), 
            SO2_Mean = mean(SO2_Mean), CO_Mean = mean(CO_Mean))

plot_pollutants = function(pollutant, num, units) {
  ggplot(pollutants, aes(x = Year, y = unlist(pollutants[,num]))) +
    geom_point(col = "Blue") +
    labs(title = paste0("US Average ", pollutant, " 2000-2016"), y = paste0("Mean ", pollutant," ", units))
}

plot_pollutants("NO2", 2, "(parts per billion)")

plot_pollutants("O3", 3, "(parts per million)")

plot_pollutants("SO2", 4, "(parts per billion)")

plot_pollutants("CO", 5, "(parts per million)")

  1. We will produce graphs of the trends in air pollution for the 5 states with the greatest decrease, and 5 states with the greatest increase in air pollution, from 2011 to 2015 (the range of years that most states have available data for).
trends = rawdata %>% 
  filter(State != "Country Of Mexico" & State != "District Of Columbia") %>% 
  select(State, Year, NO2_AQI, O3_AQI, SO2_AQI, CO_AQI) %>% 
  group_by(State, Year) %>% 
  summarize(NO2_AQI = mean(NO2_AQI, na.rm = TRUE), O3_AQI = mean(O3_AQI, na.rm = TRUE), 
            SO2_AQI = mean(SO2_AQI, na.rm = TRUE), CO_AQI = mean(CO_AQI, na.rm = TRUE))
trends = data.frame(State = trends[,1], Year = trends[,2], meanAQI = rowMeans(trends[,3:6]))
trends = trends %>% 
  filter(Year %in% c("2011", "2012", "2013", "2014", "2015"))

# I don't know of a method to filter out states with incomplete data so I manually 
# filtered them out by looking at the csv.
write_csv(trends, "trends.csv")
trends = trends %>% 
  filter(!(State %in% c("Alabama", "Alaska", "Idaho", "Kentucky", 
                        "Missouri", "New Hampshire", "Tennessee", "Washington")))
trends_spread = trends %>%
  # filter(Year %in% c("2011", "2015")) %>% 
  group_by(State) %>%
  spread(Year, meanAQI)
head(trends_spread)
greatest_change = trends_spread %>% 
  summarize(changeAQI = (`2015` - `2011`)) %>% 
  arrange(changeAQI)
head(greatest_change)
pollution_changes = greatest_change %>% 
  inner_join(trends_spread, by = c("State" = "State")) %>% 
  arrange(changeAQI)
head(pollution_changes)

five_greatest_decrease = greatest_change[1:5,]
five_greatest_increase = arrange(greatest_change, desc(changeAQI))[1:5,]
head(five_greatest_increase)
head(five_greatest_decrease)

data = pollution_changes[1:5,] %>% 
  gather("Year", "meanAQI", 3:7)
ggplot(data, aes(x = Year, y = meanAQI)) + 
  geom_point(col = "Blue") + 
  facet_wrap(~State)

data2 = pollution_changes[30:34,] %>% 
  gather("Year", "meanAQI", 3:7)
ggplot(data2, aes(x = Year, y = meanAQI)) + 
  geom_point(col = "Blue") + 
  facet_wrap(~State)

  1. We will produce a panel of maps, one per 2 years, showing the overall pollution level across the US.
newdata = rawdata %>%
  filter(State != "Country Of Mexico" & State != "District Of Columbia") %>% 
  group_by(State, Year) %>%
  summarise(NO2 = mean(NO2_AQI, na.rm = TRUE), O3 = mean(O3_AQI, na.rm = TRUE),
            SO2 = mean(SO2_AQI, na.rm = TRUE), CO = mean(CO_AQI, na.rm = TRUE))
head(newdata)
dataTrends =
  data.frame(State = newdata[,1], Year = newdata[,2], meanAQI = rowMeans(newdata[,3:6])) %>%
  rename(fips = State)
head(dataTrends)
dataTrends$fips = fips(dataTrends$fips)

plot_usmap(data = dataTrends, values = "meanAQI", color = "red") + 
  facet_wrap(~Year) + 
  scale_fill_continuous(low = "white", high = "blue", 
                        name = "Pollution Level", label = scales::comma) + 
  theme(legend.position = "right")

Conclusions

These graphics combined will show us which places currently are the most polluted, and which are likely to become more polluted as time passes, giving us a better understanding how to take pollution into account when choosing where to live.
From the first step, tables of the most and least polluted states in 2015, we see that the top 4 most polluted states are in the Southwest, so we may recommend to consider that region less strongly when deciding where to live. In contrast, North and South Dakota have very low air pollution levels so they may be worth considering for where to live in the future, along with Hawaii.
From the plots of national averages of pollutants over time, we see that NO2, SO2, and CO show a steady decrease in pollution levels over the years, and we have reason to believe that the trend will remain that way; however, we observe a somewhat scattered but consistent increase in pollution level for O3, which is likely caused by increased usage of cars, power plants, and chemical plants. It is likely that we will see a continued increase in O3 pollution unless more action is taken.
From step 3 we observe that Georgia, Maryland, Virginia, New Jersey, and North Carolina are the states with the greatest decrease in air pollution in recent years (2011-2015). Thus, we can advise these as places that would be safer to live in terms of air quality. In contrast, Nevada, Deleware, New York, Utah, and Oregon had the greatest increase in air pollution during this time period. However, the increase was not by more than 2 AQI for any state so they are not severe increases in pollution, but still worth keeping in mind.
From the overall trend of pollution in the entire US over time, we can see that pollution has generally decreased everywhere, which is a good sign for the future, since it means that hopefully any region will be safe to live.
It should be noted that this analysis is naturally flawed due to the incompleteness of the dataset used. Without the full scope of data from 2011 to 2016 for all states and all pollutants, it is hard to say that these conclusions are definitive, but we can definitely say that the general trends observed are worth considering for the future.