Purpose

In this report, we will utilize, modify, and visualize 2021 USA data from the United States Environmental Protection Agency. We will identify and compare variables within the data set for each state. The link to the website where this data was obtained from is listed below:

https://aqs.epa.gov/aqsweb/airdata/download_files.html

library(readr)
library(tidyr)
library(dplyr)
library(gtools)
library( stargazer )
library(readxl)
library(sf)
library(ggmap)
library(usmap)
library(tmaptools)


set.seed( 1234 )


s.type <- "text"  

Part 1: Loading, Modifying and Digging into the Data

A copy of the data was downloaded through the link provided. The file was modified from CSV to Excel and imported to RStudio via the syntax below

aqi <- read_excel("C:/Users/sarah/OneDrive/Desktop/annual_aqi_by_county_2021.xlsx")

Now that we have the AQI data handy, we will pull the column names. Here we can see there are spaces in our column names. This will cause issues later on if left alone.

names(aqi)
##  [1] "State"                               "County"                             
##  [3] "Year"                                "Days with AQI"                      
##  [5] "Good Days"                           "Moderate Days"                      
##  [7] "Unhealthy for Sensitive Groups Days" "Unhealthy Days"                     
##  [9] "Very Unhealthy Days"                 "Hazardous Days"                     
## [11] "Max AQI"                             "90th Percentile AQI"                
## [13] "Median AQI"                          "Days CO"                            
## [15] "Days NO2"                            "Days Ozone"                         
## [17] "Days PM2.5"                          "Days PM10"

To avoid future issues, we will replace the original names with spaces with new names. The spaces will be updated to “.”.

make.names(names(aqi))
##  [1] "State"                               "County"                             
##  [3] "Year"                                "Days.with.AQI"                      
##  [5] "Good.Days"                           "Moderate.Days"                      
##  [7] "Unhealthy.for.Sensitive.Groups.Days" "Unhealthy.Days"                     
##  [9] "Very.Unhealthy.Days"                 "Hazardous.Days"                     
## [11] "Max.AQI"                             "X90th.Percentile.AQI"               
## [13] "Median.AQI"                          "Days.CO"                            
## [15] "Days.NO2"                            "Days.Ozone"                         
## [17] "Days.PM2.5"                          "Days.PM10"
names(aqi) <- make.names(names(aqi))

The show function below confirms our columns were updated successfully. Now let’s begin interpreting it.

show(aqi)
## # A tibble: 1,002 x 18
##    State   County   Year Days.~1 Good.~2 Moder~3 Unhea~4 Unhea~5 Very.~6 Hazar~7
##    <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Alabama Baldwin  2021     280     264      16       0       0       0       0
##  2 Alabama Clay     2021     110     101       9       0       0       0       0
##  3 Alabama DeKalb   2021     361     331      30       0       0       0       0
##  4 Alabama Elmore   2021     241     238       3       0       0       0       0
##  5 Alabama Etowah   2021     286     253      33       0       0       0       0
##  6 Alabama Jeffer~  2021     365     174     186       3       2       0       0
##  7 Alabama Lawren~  2021     100      98       2       0       0       0       0
##  8 Alabama Madison  2021     357     269      86       2       0       0       0
##  9 Alabama Mobile   2021     361     270      91       0       0       0       0
## 10 Alabama Montgo~  2021     354     294      60       0       0       0       0
## # ... with 992 more rows, 8 more variables: Max.AQI <dbl>,
## #   X90th.Percentile.AQI <dbl>, Median.AQI <dbl>, Days.CO <dbl>,
## #   Days.NO2 <dbl>, Days.Ozone <dbl>, Days.PM2.5 <dbl>, Days.PM10 <dbl>, and
## #   abbreviated variable names 1: Days.with.AQI, 2: Good.Days,
## #   3: Moderate.Days, 4: Unhealthy.for.Sensitive.Groups.Days,
## #   5: Unhealthy.Days, 6: Very.Unhealthy.Days, 7: Hazardous.Days

We will base this analysis off five columns related to air quality on specific days. These columns sum the total days in each state by county with the retrospective air quality on said day. The sixth column used, Max AQI, will be visualized in the final part of this report. This sums the Max.AQI by county.

We will utilize the Stargazer package to view summary statistics

#Descriptives

df <- data.frame( aqi$Good.Days, 
                  aqi$Moderate.Days, 
                  aqi$Unhealthy.Days,
                  aqi$Very.Unhealthy.Days,
                  aqi$Hazardous.Days,
                  aqi$Max.AQI)
                  

stargazer( df, 
           type=s.type, 
           digits=0, 
           summary.stat = c("min", "p25","median","mean","p75","max"))
## 
## ===============================================================
## Statistic               Min Pctl(25) Median Mean Pctl(75)  Max 
## ---------------------------------------------------------------
## aqi.Good.Days            7    227     279   262    313     364 
## aqi.Moderate.Days        0     23      49    59    83.8    258 
## aqi.Unhealthy.Days       0     0       0     1      0      61  
## aqi.Very.Unhealthy.Days  0     0       0     0      0      73  
## aqi.Hazardous.Days       0     0       0     0      0      12  
## aqi.Max.AQI              9     90     108   126    144    2,723
## ---------------------------------------------------------------

We can see Hazardous, Unhealthy and Very Unhealthy Days are not as common as Moderate and Good Days. Let’s identify by column which states were ranked the highest in each type of day.

df1 = data.frame(Good.Days = c(aqi$Good.Days),
                 State = c(aqi$State))
df1 <-df1[order(-aqi$Good.Days),]
head(df1)
##      Good.Days   State
## 211        364  Hawaii
## 212        364  Hawaii
## 837        364   Texas
## 857        363   Texas
## 213        362  Hawaii
## 1001       361 Wyoming
df2 = data.frame(Moderate.Days = c(aqi$Moderate.Days),
                 State = c(aqi$State))
df2 <-df2[order(-aqi$Moderate.Days),]
head(df2)
##     Moderate.Days             State
## 543           258        New Mexico
## 81            251        California
## 137           251 Country Of Mexico
## 33            240           Arizona
## 64            229        California
## 59            223        California
df3 = data.frame(Unhealthy.Days = c(aqi$Unhealthy.Days),
                 State = c(aqi$State))
df3 <-df3[order(-aqi$Unhealthy.Days),]
head(df3)
##     Unhealthy.Days      State
## 29              61    Arizona
## 80              61 California
## 77              44 California
## 97              39 California
## 61              29 California
## 702             29     Oregon
df4 = data.frame(Very.Unhealthy.Days = c(aqi$Very.Unhealthy.Days),
                 State = c(aqi$State))
df4 <-df4[order(-aqi$Very.Unhealthy.Days),]
head(df4)
##     Very.Unhealthy.Days      State
## 29                   73    Arizona
## 76                   14 California
## 80                   11 California
## 930                   9 Washington
## 96                    7 California
## 511                   6     Nevada
df5 = data.frame(Hazardous.Days = c(aqi$Hazardous.Days),
                 State = c(aqi$State))
df5 <-df5[order(-aqi$Hazardous.Days),]
head(df5)
##     Hazardous.Days      State
## 96              12 California
## 76               7 California
## 996              4    Wyoming
## 60               3 California
## 70               3 California
## 59               2 California
df6 = data.frame(Max.AQI = c(aqi$Max.AQI),
                 State = c(aqi$State))
df6 <-df6[order(-aqi$Max.AQI),]
head(df6)
##     Max.AQI      State
## 70     2723 California
## 62      666 California
## 546     665 New Mexico
## 996     664    Wyoming
## 96      622 California
## 122     576   Colorado





Part 2: Visualization with Histograms

Let’s visualize the distribution of these columns via a histogram and identify their means

#Change in MHV 2000-2010

hist(aqi$Good.Days, xlim=c (0,400),
     main="Days with Good AQI",
     xlab = "AQI Scoring Periods (Consecutive Days)",
     yaxt="n",
     ylab = "",
     border="Black",
     col="Light Green"
     )

mean.x <- mean( aqi$Good.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )

hist(aqi$Moderate.Days, xlim=c (0,300),
     main="Days with Moderate AQI",
     xlab = "AQI Scoring Periods (Consecutive Days)",
     yaxt="n",
     ylab = "",
     border="Black",
     col="Green",
     )

mean.x <- mean( aqi$Moderate.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )

hist(aqi$Unhealthy.Days, xlim=c (0,70),
     main="Days with Unhealthy AQI",
     xlab = "AQI Scoring Periods (Consecutive Days)",
     ylab = "",
     yaxt="n",
     border="Black",
     col="Red",
     )

mean.x <- mean( aqi$Unhealthy.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )

hist(aqi$Hazardous.Days, xlim=c (0,5),
     main="Days with Very Unhealthy AQI",
     xlab = "AQI Scoring Periods (Consecutive Days)",
     ylab = "",
     yaxt="n",
     border="Black",
     col="Dark Red",
     )

mean.x <- mean( aqi$Very.Unhealthy.Days, na.rm=T )
abline( v=mean.x, col="black", lwd=2, lty=2 )

hist(aqi$Hazardous.Days, xlim=c (0,5),
     main="Days with Hazardous AQI",
     xlab = "AQI Scoring Periods (Consecutive Days)",
     ylab = "",
     yaxt="n",
     border="Black",
     col="Black",
     )

mean.x <- mean( aqi$Hazardous.Days, na.rm=T )
abline( v=mean.x, col="White", lwd=2, lty=2 )

Part 3: Visualizing with Maps

3A: Preparing the Data for Merge

We will now begin mapping the data. For this step, we will pull in library(Us_map). This will also provide us longitude and latitude criteria necessary for plotting our data set. We will also visualize the data loaded to identify any possible errors when merging the map_data and aqi data sets.

states <- map_data("state")
head(states)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>
## 6 -87.58806 30.32665     1     6 alabama      <NA>

One issue identified above is the map_data labeled states under regions. To avoid any issues in our merge, we will use the rename function

states <- states %>%
  rename("State" = "region")
head(states)
##        long      lat group order   State subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>
## 6 -87.58806 30.32665     1     6 alabama      <NA>

A second issue identified is the map_data lists States in lowercase whereas the AQI data capitalizes the first character in the state’s name. We will update the AQI data via the tolower function to ensure all state names are lowecase. This will also prevent issues when merging the two datasets

aqi$State <- tolower(aqi$State)
head(aqi)
## # A tibble: 6 x 18
##   State   County    Year Days.~1 Good.~2 Moder~3 Unhea~4 Unhea~5 Very.~6 Hazar~7
##   <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 alabama Baldwin   2021     280     264      16       0       0       0       0
## 2 alabama Clay      2021     110     101       9       0       0       0       0
## 3 alabama DeKalb    2021     361     331      30       0       0       0       0
## 4 alabama Elmore    2021     241     238       3       0       0       0       0
## 5 alabama Etowah    2021     286     253      33       0       0       0       0
## 6 alabama Jeffers~  2021     365     174     186       3       2       0       0
## # ... with 8 more variables: Max.AQI <dbl>, X90th.Percentile.AQI <dbl>,
## #   Median.AQI <dbl>, Days.CO <dbl>, Days.NO2 <dbl>, Days.Ozone <dbl>,
## #   Days.PM2.5 <dbl>, Days.PM10 <dbl>, and abbreviated variable names
## #   1: Days.with.AQI, 2: Good.Days, 3: Moderate.Days,
## #   4: Unhealthy.for.Sensitive.Groups.Days, 5: Unhealthy.Days,
## #   6: Very.Unhealthy.Days, 7: Hazardous.Days

3B: Merging and Validating Our Data

We are now ready to merge our data sets. cal_aqi will tie in our columns and their relevant means. We will then use the Merge function to tie both data sets together

cal_aqi <- 
  aqi %>%
  select(State, Good.Days, Moderate.Days, Unhealthy.Days, Very.Unhealthy.Days, Hazardous.Days) %>%
  group_by(State) %>%
  arrange(Good.Days, Moderate.Days, Unhealthy.Days, Very.Unhealthy.Days, Hazardous.Days) %>%
  summarise(Good.Days = mean(Good.Days),
            Moderate.Days = mean(Moderate.Days),
            Unhealthy.Days = mean(Unhealthy.Days),
            Very.Unhealthy.Days = mean(Very.Unhealthy.Days),
            Hazardous.Days = mean(Hazardous.Days))

cal_aqi <- merge( states, cal_aqi, by="State" )

We will use the head function to validate two things: 1: The merge was successful and 2: There are no anomalies in our new data set

head (cal_aqi)
##     State      long      lat group order subregion Good.Days Moderate.Days
## 1 alabama -87.46201 30.38968     1     1      <NA>  246.9333      47.26667
## 2 alabama -87.48493 30.37249     1     2      <NA>  246.9333      47.26667
## 3 alabama -87.52503 30.37249     1     3      <NA>  246.9333      47.26667
## 4 alabama -87.53076 30.33239     1     4      <NA>  246.9333      47.26667
## 5 alabama -87.57087 30.32665     1     5      <NA>  246.9333      47.26667
## 6 alabama -87.58806 30.32665     1     6      <NA>  246.9333      47.26667
##   Unhealthy.Days Very.Unhealthy.Days Hazardous.Days
## 1      0.1333333                   0              0
## 2      0.1333333                   0              0
## 3      0.1333333                   0              0
## 4      0.1333333                   0              0
## 5      0.1333333                   0              0
## 6      0.1333333                   0              0

###3C: Mapping Our Data

Now we will map the data by the four columns we have been working with. To improve visualization, here are a few functions used along with additional explanation as to which role they played in the final visualization

direction = 1: This organized colors in the map by sorting from greatest to smallest, rather than smallest to greatest. This is why in some of the maps, the higher the count of days in a specific state, the darker those states

xis.text = element_blank() & axis.ticks.y = element_blank (): This removed any text on the x-axis and y-axis. This previous included longitude/latitude descriptions

panel.grid = element_blank(): This removed any background grid from our Map.

theme_light(): This removed background color from our Map.

ggplot() +
  geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Good.Days),
               color = "Black") +
  scale_fill_distiller("Average Days", palette = "PRGn", direction = 1) +
  xlab("") + ylab("") +
  ggtitle("Good Days by State",
          subtitle = "EPA 2021") +
  geom_point() +
  theme_light() +
  theme(axis.text = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid = element_blank()) +

  theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
  theme(legend.position = "right")
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## i Please use the `linewidth` argument instead.

ggplot() +
  geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Moderate.Days),
                color = "Black") +
  scale_fill_distiller("Average Days", palette = "YlGnBu", direction = 1) +
  xlab("") + ylab("") +
  ggtitle("Moderate Days by State",
          subtitle = "EPA 2021") +
  geom_point() +
  theme_light() +
  theme(
    axis.text = element_blank(),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid = element_blank()) +
        
    theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
  theme(legend.position = "right")

ggplot() +
  geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Unhealthy.Days),
               color = "Black") +
  scale_fill_distiller("Average Days", palette = "Oranges", direction = 1) +
  xlab("") + ylab("") +
  ggtitle("Unhealthy Days by State",
          subtitle = "EPA 2021") +
  geom_point() +
  theme_light() +
  theme(
    axis.text = element_blank(),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid = element_blank()) +
    theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
  theme(legend.position = "right")

ggplot() +
  geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Very.Unhealthy.Days),
               color = "Black") +
  scale_fill_distiller("Average Days", palette = "YlOrBr", direction = 1) +
  xlab("") + ylab("") +
  ggtitle("Very Unhealthy Days by State",
          subtitle = "EPA 2021") +
  geom_point() +
  theme_light() +
  theme(
    axis.text = element_blank(),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid = element_blank()) +
    theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
  theme(legend.position = "right")

ggplot() +
  geom_polygon(data = cal_aqi, aes(x = long, y = lat, group = group, fill = Hazardous.Days),
              color = "Black") +
  scale_fill_distiller("Average Days", palette = "YlOrRd", direction = 1) +
  xlab("") + ylab("") +
  ggtitle("Hazardous Days by State",
          subtitle = "EPA 2021") +
  geom_point() +
  theme_light() +
  theme(
    axis.text = element_blank(),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid = element_blank()) +
      theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
  theme(legend.position = "right")

Final Part: Recreating Part 3

For our final Map, we will repeat the steps previously completed, but this time using Max.AQI by State.

cal_aqi2 <- 
  aqi %>%
  select(State, Max.AQI) %>%
  group_by(State) %>%
  summarise(Max.AQI = mean(Max.AQI))

cal_aqi2 <- merge( states, cal_aqi2, by="State" )
ggplot() +
  geom_polygon(data = cal_aqi2, aes(x = long, y = lat, group = group, fill = Max.AQI),
               color = "Black") +
  scale_fill_distiller("Average AQI", palette = "YlOrRd", direction = 1) +
  xlab("") + ylab("") +
  ggtitle("Max AQI by State",
          subtitle = "EPA 2021") +
  geom_point() +
  theme_light() +
  theme(
    axis.text = element_blank(),
    axis.ticks.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid = element_blank()) +
      theme(legend.background = element_rect(fill="gray90", size=.5, linetype="dotted")) +
  theme(legend.position = "right")

Sources:

https://sparkbyexamples.com/r-programming/rename-column-in-r/

http://www.sthda.com/english/wiki/colors-in-r

https://statisticsglobe.com/tolower-toupper-casefold-chartr-r-function

https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html

https://www.guru99.com/r-sort-data-frame.html