Pollution is an important factor that can affect where people choose to live. Man-made pollution in the form of poor air quality contributes to climate change and threatens public health. People with health conditions such as asthma need to be particularly careful in choosing places to live. With significant efforts to promote clean energy and reduce air pollution, air quality in the US has seen improvements to some extent. For example, average SO2 (indicative of acid rain) has decreased over the years. However, PM2.5 pollution in the US varies across regions with the Midwest and California being the most polluted areas. More alarmingly, air quality in the US recently worsened despite years of improvement. These observations call for in-depth understanding of the distribution and severity of air pollution in the US. In this project we aim to use US pollution data to examine the current status and trend of air pollution in US cities, in order to recommend good and bad places to live in terms of air quality.
To examine the status and trends in US air pollution, we will use the “U.S. Pollution Data” dataset from Kaggle, which has been downloaded and explored by hundreds of other researchers (https://www.kaggle.com/sogun3/uspollution). The data from this set is taken from the EPA (https://aqs.epa.gov/aqsweb/airdata/download_files.html). To examine the data we will use packages within the tidyverse to organize and visualize the data. Such packages will include dplyr, ggplot2, and tidyr. We will use tidyr and dplyr to aggregate the data by city, county, and state, sort the data, and organize it into a way that allows for easy visualization, and then use ggplot2 to graph the data. We will also use the sf package to spatially visualize air pollution by plotting pollution levels onto a map of the US (like in Fig. 2) and showing the change in the map over time via a panel of maps.
Note that the dataset does not include data for Mississippi, Montana, Nebraska, Vermont, and West Virginia.
Load data and packages
library(tidyverse)
library(knitr)
library(kableExtra)
library(usmap)
rawdata = read_csv("pollution_us_2000_2016.csv") %>%
separate(`Date Local`, c("Year", "Month", "Day"), sep = "-")
names(rawdata) <- gsub(" ", "_", names(rawdata))
head(rawdata)
pollution_by_state = rawdata %>%
filter(Year == 2015) %>%
group_by(State) %>%
summarise(NO2 = mean(NO2_AQI, na.rm = TRUE), O3 = mean(O3_AQI, na.rm = TRUE),
SO2 = mean(SO2_AQI, na.rm = TRUE), CO = mean(CO_AQI, na.rm = TRUE))
# This will give us a dataframe containing the average AQI for each pollutant by state
head(pollution_by_state)
## # A tibble: 6 x 5
## State NO2 O3 SO2 CO
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 20.0 37.3 6.95 3.93
## 2 Alaska 18.6 19.2 14.8 6.27
## 3 Arizona 26.2 44.9 1.64 6
## 4 Arkansas 19.4 33.5 2.37 4.25
## 5 California 18.9 41.1 1.25 5.37
## 6 Colorado 35.5 38.6 4.93 6.54
AQI =
data.frame(State = pollution_by_state[,1], meanAQI = rowMeans(pollution_by_state[,-1])) %>%
arrange(desc(meanAQI))
# head(AQI)
top10 = AQI[1:10,]
bottom10 = arrange(AQI, meanAQI)[1:10,]
kable(top10) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
add_header_above(c("Top 10 Most Polluted States 2015", " "))
| State | meanAQI |
|---|---|
| Colorado | 21.41057 |
| Utah | 19.85771 |
| Arizona | 19.68895 |
| Nevada | 18.93991 |
| Ohio | 18.35968 |
| Indiana | 18.01813 |
| Kansas | 17.62841 |
| New York | 17.60394 |
| District Of Columbia | 17.26486 |
| New Jersey | 17.23095 |
kable(bottom10) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
add_header_above(c("Top 10 Least Polluted States 2015", " "))
| State | meanAQI |
|---|---|
| North Dakota | 10.65881 |
| South Dakota | 10.68196 |
| Hawaii | 10.85691 |
| Tennessee | 11.35148 |
| New Hampshire | 11.42109 |
| Maine | 11.94928 |
| Oregon | 12.65205 |
| Florida | 12.65812 |
| Iowa | 12.92361 |
| Minnesota | 13.26009 |
pollutants = rawdata %>%
select(Year, NO2_Mean, O3_Mean, SO2_Mean, CO_Mean) %>%
group_by(Year) %>%
summarise(NO2_Mean = mean(NO2_Mean), O3_Mean = mean(O3_Mean),
SO2_Mean = mean(SO2_Mean), CO_Mean = mean(CO_Mean))
plot_pollutants = function(pollutant, num, units) {
ggplot(pollutants, aes(x = Year, y = unlist(pollutants[,num]))) +
geom_point(col = "Blue") +
labs(title = paste0("US Average ", pollutant, " 2000-2016"), y = paste0("Mean ", pollutant," ", units))
}
plot_pollutants("NO2", 2, "(parts per billion)")
plot_pollutants("O3", 3, "(parts per million)")
plot_pollutants("SO2", 4, "(parts per billion)")
plot_pollutants("CO", 5, "(parts per million)")
trends = rawdata %>%
filter(State != "Country Of Mexico" & State != "District Of Columbia") %>%
select(State, Year, NO2_AQI, O3_AQI, SO2_AQI, CO_AQI) %>%
group_by(State, Year) %>%
summarize(NO2_AQI = mean(NO2_AQI, na.rm = TRUE), O3_AQI = mean(O3_AQI, na.rm = TRUE),
SO2_AQI = mean(SO2_AQI, na.rm = TRUE), CO_AQI = mean(CO_AQI, na.rm = TRUE))
trends = data.frame(State = trends[,1], Year = trends[,2], meanAQI = rowMeans(trends[,3:6]))
trends = trends %>%
filter(Year %in% c("2011", "2012", "2013", "2014", "2015"))
# I don't know of a method to filter out states with incomplete data so I manually
# filtered them out by looking at the csv.
write_csv(trends, "trends.csv")
trends = trends %>%
filter(!(State %in% c("Alabama", "Alaska", "Idaho", "Kentucky",
"Missouri", "New Hampshire", "Tennessee", "Washington")))
trends_spread = trends %>%
# filter(Year %in% c("2011", "2015")) %>%
group_by(State) %>%
spread(Year, meanAQI)
head(trends_spread)
greatest_change = trends_spread %>%
summarize(changeAQI = (`2015` - `2011`)) %>%
arrange(changeAQI)
head(greatest_change)
pollution_changes = greatest_change %>%
inner_join(trends_spread, by = c("State" = "State")) %>%
arrange(changeAQI)
head(pollution_changes)
five_greatest_decrease = greatest_change[1:5,]
five_greatest_increase = arrange(greatest_change, desc(changeAQI))[1:5,]
head(five_greatest_increase)
head(five_greatest_decrease)
data = pollution_changes[1:5,] %>%
gather("Year", "meanAQI", 3:7)
ggplot(data, aes(x = Year, y = meanAQI)) +
geom_point(col = "Blue") +
facet_wrap(~State)
data2 = pollution_changes[30:34,] %>%
gather("Year", "meanAQI", 3:7)
ggplot(data2, aes(x = Year, y = meanAQI)) +
geom_point(col = "Blue") +
facet_wrap(~State)
newdata = rawdata %>%
filter(State != "Country Of Mexico" & State != "District Of Columbia") %>%
group_by(State, Year) %>%
summarise(NO2 = mean(NO2_AQI, na.rm = TRUE), O3 = mean(O3_AQI, na.rm = TRUE),
SO2 = mean(SO2_AQI, na.rm = TRUE), CO = mean(CO_AQI, na.rm = TRUE))
head(newdata)
dataTrends =
data.frame(State = newdata[,1], Year = newdata[,2], meanAQI = rowMeans(newdata[,3:6])) %>%
rename(fips = State)
head(dataTrends)
dataTrends$fips = fips(dataTrends$fips)
plot_usmap(data = dataTrends, values = "meanAQI", color = "red") +
facet_wrap(~Year) +
scale_fill_continuous(low = "white", high = "blue",
name = "Pollution Level", label = scales::comma) +
theme(legend.position = "right")
These graphics combined will show us which places currently are the most polluted, and which are likely to become more polluted as time passes, giving us a better understanding how to take pollution into account when choosing where to live.
From the first step, tables of the most and least polluted states in 2015, we see that the top 4 most polluted states are in the Southwest, so we may recommend to consider that region less strongly when deciding where to live. In contrast, North and South Dakota have very low air pollution levels so they may be worth considering for where to live in the future, along with Hawaii.
From the plots of national averages of pollutants over time, we see that NO2, SO2, and CO show a steady decrease in pollution levels over the years, and we have reason to believe that the trend will remain that way; however, we observe a somewhat scattered but consistent increase in pollution level for O3, which is likely caused by increased usage of cars, power plants, and chemical plants. It is likely that we will see a continued increase in O3 pollution unless more action is taken.
From step 3 we observe that Georgia, Maryland, Virginia, New Jersey, and North Carolina are the states with the greatest decrease in air pollution in recent years (2011-2015). Thus, we can advise these as places that would be safer to live in terms of air quality. In contrast, Nevada, Deleware, New York, Utah, and Oregon had the greatest increase in air pollution during this time period. However, the increase was not by more than 2 AQI for any state so they are not severe increases in pollution, but still worth keeping in mind.
From the overall trend of pollution in the entire US over time, we can see that pollution has generally decreased everywhere, which is a good sign for the future, since it means that hopefully any region will be safe to live.
It should be noted that this analysis is naturally flawed due to the incompleteness of the dataset used. Without the full scope of data from 2011 to 2016 for all states and all pollutants, it is hard to say that these conclusions are definitive, but we can definitely say that the general trends observed are worth considering for the future.