In this report, we will utilize, modify, and visualize 2021 USA data from the United States Environmental Protection Agency. We will identify and compare variables within the data set for each state. The link to the website where this data was obtained from is listed below:
https://aqs.epa.gov/aqsweb/airdata/download_files.html
library(readr)
library(tidyr)
library(dplyr)
library(gtools)
library( stargazer )
library(readxl)
library(sf)
library(ggmap)
library(usmap)
library(tmaptools)
set.seed( 1234 )
s.type <- "text"
A copy of the data was downloaded through the link provided. The file was modified from CSV to Excel and imported to RStudio via the syntax below
aqi <- read_excel("C:/Users/sarah/OneDrive/Desktop/annual_aqi_by_county_2021.xlsx")
Now that we have the AQI data handy, we will pull the column names. Here we can see there are spaces in our column names. This will cause issues later on if left alone.
names(aqi)
## [1] "State" "County"
## [3] "Year" "Days with AQI"
## [5] "Good Days" "Moderate Days"
## [7] "Unhealthy for Sensitive Groups Days" "Unhealthy Days"
## [9] "Very Unhealthy Days" "Hazardous Days"
## [11] "Max AQI" "90th Percentile AQI"
## [13] "Median AQI" "Days CO"
## [15] "Days NO2" "Days Ozone"
## [17] "Days PM2.5" "Days PM10"
To avoid future issues, we will replace the original names with spaces with new names. The spaces will be updated to “.”.
make.names(names(aqi))
## [1] "State" "County"
## [3] "Year" "Days.with.AQI"
## [5] "Good.Days" "Moderate.Days"
## [7] "Unhealthy.for.Sensitive.Groups.Days" "Unhealthy.Days"
## [9] "Very.Unhealthy.Days" "Hazardous.Days"
## [11] "Max.AQI" "X90th.Percentile.AQI"
## [13] "Median.AQI" "Days.CO"
## [15] "Days.NO2" "Days.Ozone"
## [17] "Days.PM2.5" "Days.PM10"
names(aqi) <- make.names(names(aqi))
The show function below confirms our columns were updated successfully. Now let’s begin interpreting it.
show(aqi)
## # A tibble: 1,002 x 18
## State County Year Days.~1 Good.~2 Moder~3 Unhea~4 Unhea~5 Very.~6 Hazar~7
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Baldwin 2021 280 264 16 0 0 0 0
## 2 Alabama Clay 2021 110 101 9 0 0 0 0
## 3 Alabama DeKalb 2021 361 331 30 0 0 0 0
## 4 Alabama Elmore 2021 241 238 3 0 0 0 0
## 5 Alabama Etowah 2021 286 253 33 0 0 0 0
## 6 Alabama Jeffer~ 2021 365 174 186 3 2 0 0
## 7 Alabama Lawren~ 2021 100 98 2 0 0 0 0
## 8 Alabama Madison 2021 357 269 86 2 0 0 0
## 9 Alabama Mobile 2021 361 270 91 0 0 0 0
## 10 Alabama Montgo~ 2021 354 294 60 0 0 0 0
## # ... with 992 more rows, 8 more variables: Max.AQI <dbl>,
## # X90th.Percentile.AQI <dbl>, Median.AQI <dbl>, Days.CO <dbl>,
## # Days.NO2 <dbl>, Days.Ozone <dbl>, Days.PM2.5 <dbl>, Days.PM10 <dbl>, and
## # abbreviated variable names 1: Days.with.AQI, 2: Good.Days,
## # 3: Moderate.Days, 4: Unhealthy.for.Sensitive.Groups.Days,
## # 5: Unhealthy.Days, 6: Very.Unhealthy.Days, 7: Hazardous.Days
We will base this analysis off five columns related to air quality on specific days. These columns sum the total days in each state by county with the retrospective air quality on said day. The sixth column used, Max AQI, will be visualized in the final part of this report. This sums the Max.AQI by county.
We will utilize the Stargazer package to view summary statistics
#Descriptives
df <- data.frame( aqi$Good.Days,
aqi$Moderate.Days,
aqi$Unhealthy.Days,
aqi$Very.Unhealthy.Days,
aqi$Hazardous.Days,
aqi$Max.AQI)
stargazer( df,
type=s.type,
digits=0,
summary.stat = c("min", "p25","median","mean","p75","max"))
##
## ===============================================================
## Statistic Min Pctl(25) Median Mean Pctl(75) Max
## ---------------------------------------------------------------
## aqi.Good.Days 7 227 279 262 313 364
## aqi.Moderate.Days 0 23 49 59 83.8 258
## aqi.Unhealthy.Days 0 0 0 1 0 61
## aqi.Very.Unhealthy.Days 0 0 0 0 0 73
## aqi.Hazardous.Days 0 0 0 0 0 12
## aqi.Max.AQI 9 90 108 126 144 2,723
## ---------------------------------------------------------------
We can see Hazardous, Unhealthy and Very Unhealthy Days are not as common as Moderate and Good Days. Let’s identify by column which states were ranked the highest in each type of day.
df1 = data.frame(Good.Days = c(aqi$Good.Days),
State = c(aqi$State))
df1 <-df1[order(-aqi$Good.Days),]
head(df1)
## Good.Days State
## 211 364 Hawaii
## 212 364 Hawaii
## 837 364 Texas
## 857 363 Texas
## 213 362 Hawaii
## 1001 361 Wyoming
df2 = data.frame(Moderate.Days = c(aqi$Moderate.Days),
State = c(aqi$State))
df2 <-df2[order(-aqi$Moderate.Days),]
head(df2)
## Moderate.Days State
## 543 258 New Mexico
## 81 251 California
## 137 251 Country Of Mexico
## 33 240 Arizona
## 64 229 California
## 59 223 California
df3 = data.frame(Unhealthy.Days = c(aqi$Unhealthy.Days),
State = c(aqi$State))
df3 <-df3[order(-aqi$Unhealthy.Days),]
head(df3)
## Unhealthy.Days State
## 29 61 Arizona
## 80 61 California
## 77 44 California
## 97 39 California
## 61 29 California
## 702 29 Oregon
df4 = data.frame(Very.Unhealthy.Days = c(aqi$Very.Unhealthy.Days),
State = c(aqi$State))
df4 <-df4[order(-aqi$Very.Unhealthy.Days),]
head(df4)
## Very.Unhealthy.Days State
## 29 73 Arizona
## 76 14 California
## 80 11 California
## 930 9 Washington
## 96 7 California
## 511 6 Nevada
df5 = data.frame(Hazardous.Days = c(aqi$Hazardous.Days),
State = c(aqi$State))
df5 <-df5[order(-aqi$Hazardous.Days),]
head(df5)
## Hazardous.Days State
## 96 12 California
## 76 7 California
## 996 4 Wyoming
## 60 3 California
## 70 3 California
## 59 2 California
df6 = data.frame(Max.AQI = c(aqi$Max.AQI),
State = c(aqi$State))
df6 <-df6[order(-aqi$Max.AQI),]
head(df6)
## Max.AQI State
## 70 2723 California
## 62 666 California
## 546 665 New Mexico
## 996 664 Wyoming
## 96 622 California
## 122 576 Colorado
Let’s visualize the distribution of these columns via a histogram and identify their means
#Change in MHV 2000-2010
hist(aqi$Good.Days, xlim=c (0,400),
main="Days with Good AQI",
xlab = "AQI Scoring Periods (Consecutive Days)",
yaxt="n",
ylab = "",
border="Black",
col="Light Green"
)
mean.x <- mean( aqi$Good.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )
hist(aqi$Moderate.Days, xlim=c (0,300),
main="Days with Moderate AQI",
xlab = "AQI Scoring Periods (Consecutive Days)",
yaxt="n",
ylab = "",
border="Black",
col="Green",
)
mean.x <- mean( aqi$Moderate.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )
hist(aqi$Unhealthy.Days, xlim=c (0,70),
main="Days with Unhealthy AQI",
xlab = "AQI Scoring Periods (Consecutive Days)",
ylab = "",
yaxt="n",
border="Black",
col="Red",
)
mean.x <- mean( aqi$Unhealthy.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )
hist(aqi$Hazardous.Days, xlim=c (0,5),
main="Days with Very Unhealthy AQI",
xlab = "AQI Scoring Periods (Consecutive Days)",
ylab = "",
yaxt="n",
border="Black",
col="Dark Red",
)
mean.x <- mean( aqi$Very.Unhealthy.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )
hist(aqi$Hazardous.Days, xlim=c (0,5),
main="Days with Hazardous AQI",
xlab = "AQI Scoring Periods (Consecutive Days)",
ylab = "",
yaxt="n",
border="Black",
col="Black",
)
mean.x <- mean( aqi$Hazardous.Days, na.rm=T )
abline( v=mean.x, col="White", lwd=2, lty=2 )
We will now begin mapping the data. For this step, we will pull in library(Us_map). This will also provide us longitude and latitude criteria necessary for plotting our data set. We will also visualize the data loaded to identify any possible errors when merging the map_data and aqi data sets.
states <- map_data("state")
head(states)
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
## 6 -87.58806 30.32665 1 6 alabama <NA>
One issue identified above is the map_data labeled states under regions. To avoid any issues in our merge, we will use the rename function
states <- states %>%
rename("State" = "region")
head(states)
## long lat group order State subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
## 6 -87.58806 30.32665 1 6 alabama <NA>
A second issue identified is the map_data lists States in lowercase whereas the AQI data capitalizes the first character in the state’s name. We will update the AQI data via the tolower function to ensure all state names are lowecase. This will also prevent issues when merging the two datasets
aqi$State <- tolower(aqi$State)
head(aqi)
## # A tibble: 6 x 18
## State County Year Days.~1 Good.~2 Moder~3 Unhea~4 Unhea~5 Very.~6 Hazar~7
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alabama Baldwin 2021 280 264 16 0 0 0 0
## 2 alabama Clay 2021 110 101 9 0 0 0 0
## 3 alabama DeKalb 2021 361 331 30 0 0 0 0
## 4 alabama Elmore 2021 241 238 3 0 0 0 0
## 5 alabama Etowah 2021 286 253 33 0 0 0 0
## 6 alabama Jeffers~ 2021 365 174 186 3 2 0 0
## # ... with 8 more variables: Max.AQI <dbl>, X90th.Percentile.AQI <dbl>,
## # Median.AQI <dbl>, Days.CO <dbl>, Days.NO2 <dbl>, Days.Ozone <dbl>,
## # Days.PM2.5 <dbl>, Days.PM10 <dbl>, and abbreviated variable names
## # 1: Days.with.AQI, 2: Good.Days, 3: Moderate.Days,
## # 4: Unhealthy.for.Sensitive.Groups.Days, 5: Unhealthy.Days,
## # 6: Very.Unhealthy.Days, 7: Hazardous.Days
We are now ready to merge our data sets. cal_aqi will tie in our columns and their relevant means. We will then use the Merge function to tie both data sets together
cal_aqi <-
aqi %>%
select(State, Good.Days, Moderate.Days, Unhealthy.Days, Very.Unhealthy.Days, Hazardous.Days) %>%
group_by(State) %>%
arrange(Good.Days, Moderate.Days, Unhealthy.Days, Very.Unhealthy.Days, Hazardous.Days) %>%
summarise(Good.Days = mean(Good.Days),
Moderate.Days = mean(Moderate.Days),
Unhealthy.Days = mean(Unhealthy.Days),
Very.Unhealthy.Days = mean(Very.Unhealthy.Days),
Hazardous.Days = mean(Hazardous.Days))
cal_aqi <- merge( states, cal_aqi, by="State" )
We will use the head function to validate two things: 1: The merge was successful and 2: There are no anomalies in our new data set
head (cal_aqi)
## State long lat group order subregion Good.Days Moderate.Days
## 1 alabama -87.46201 30.38968 1 1 <NA> 246.9333 47.26667
## 2 alabama -87.48493 30.37249 1 2 <NA> 246.9333 47.26667
## 3 alabama -87.52503 30.37249 1 3 <NA> 246.9333 47.26667
## 4 alabama -87.53076 30.33239 1 4 <NA> 246.9333 47.26667
## 5 alabama -87.57087 30.32665 1 5 <NA> 246.9333 47.26667
## 6 alabama -87.58806 30.32665 1 6 <NA> 246.9333 47.26667
## Unhealthy.Days Very.Unhealthy.Days Hazardous.Days
## 1 0.1333333 0 0
## 2 0.1333333 0 0
## 3 0.1333333 0 0
## 4 0.1333333 0 0
## 5 0.1333333 0 0
## 6 0.1333333 0 0
###3C: Mapping Our Data
Now we will map the data by the four columns we have been working with. To improve visualization, here are a few functions used along with additional explanation as to which role they played in the final visualization
direction = 1: This organized colors in the map by sorting from greatest to smallest, rather than smallest to greatest. This is why in some of the maps, the higher the count of days in a specific state, the darker those states
xis.text = element_blank() & axis.ticks.y = element_blank (): This removed any text on the x-axis and y-axis. This previous included longitude/latitude descriptions
panel.grid = element_blank(): This removed any background grid from our Map.
theme_light(): This removed background color from our Map.
ggplot() +
geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Good.Days),
color = "Black") +
scale_fill_distiller("Average Days", palette = "PRGn", direction = 1) +
xlab("") + ylab("") +
ggtitle("Good Days by State",
subtitle = "EPA 2021") +
geom_point() +
theme_light() +
theme(axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()) +
theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
theme(legend.position = "right")
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## i Please use the `linewidth` argument instead.
ggplot() +
geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Moderate.Days),
color = "Black") +
scale_fill_distiller("Average Days", palette = "YlGnBu", direction = 1) +
xlab("") + ylab("") +
ggtitle("Moderate Days by State",
subtitle = "EPA 2021") +
geom_point() +
theme_light() +
theme(
axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()) +
theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
theme(legend.position = "right")
ggplot() +
geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Unhealthy.Days),
color = "Black") +
scale_fill_distiller("Average Days", palette = "Oranges", direction = 1) +
xlab("") + ylab("") +
ggtitle("Unhealthy Days by State",
subtitle = "EPA 2021") +
geom_point() +
theme_light() +
theme(
axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()) +
theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
theme(legend.position = "right")
ggplot() +
geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Very.Unhealthy.Days),
color = "Black") +
scale_fill_distiller("Average Days", palette = "YlOrBr", direction = 1) +
xlab("") + ylab("") +
ggtitle("Very Unhealthy Days by State",
subtitle = "EPA 2021") +
geom_point() +
theme_light() +
theme(
axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()) +
theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
theme(legend.position = "right")
ggplot() +
geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Hazardous.Days),
color = "Black") +
scale_fill_distiller("Average Days", palette = "YlOrRd", direction = 1) +
xlab("") + ylab("") +
ggtitle("Hazardous Days by State",
subtitle = "EPA 2021") +
geom_point() +
theme_light() +
theme(
axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()) +
theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
theme(legend.position = "right")
For our final Map, we will repeat the steps previously completed, but this time using Max.AQI by State.
cal_aqi2 <-
aqi %>%
select(State, Max.AQI) %>%
group_by(State) %>%
summarise(Max.AQI = mean(Max.AQI))
cal_aqi2 <- merge( states, cal_aqi2, by="State" )
ggplot() +
geom_polygon(data = cal_aqi2, aes(x = long, y = lat, group = group, fill = Max.AQI),
color = "Black") +
scale_fill_distiller("Average AQI", palette = "YlOrRd", direction = 1) +
xlab("") + ylab("") +
ggtitle("Max AQI by State",
subtitle = "EPA 2021") +
geom_point() +
theme_light() +
theme(
axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank()) +
theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
theme(legend.position = "right")