DATA 110 Project 2

Author

David Burkart

‘Citizen Science Air Monitor’ or CSAM sensor pods. Designed by EPA, the pods measure ozone, fine particulate matter (PM2.5), temperature, and relative humidity.

‘Citizen Science Air Monitor’ or CSAM sensor pods. Designed by EPA, the pods measure ozone, fine particulate matter (PM2.5), temperature, and relative humidity.

The dataset used for this project has information about the air quality in Washington, DC, from 2021 to 2023. These data come from monitoring stations and include the dates measurements were taken, the locations of monitoring stations, and the air quality index (AQI) measured for several pollutants. Overall there are 30 variables. These include character and numeric variables. This dataset comes from the Environmental Protection Agency (EPA).

There are several reasons why I was interested in working with this dataset. I am interested and worried about how pollutants impact our environmental and personal health. I would like to learn more about what factors influence the levels of pollutants in our atmosphere, and when these become dangerous. Furthermore, I used to teach BIOL 105 at Montgomery College and used similar EPA data for one of the course’s assignments.

Before cleaning the data I used the summary function to learn about the variables. I noticed that all of the variable headers were in capitol letters, so my first step in cleaning was to change all of these letter to lowercase. I then checked for NAs using the is.na function and returned the names of variables with NAs by combining the names, which, and sapply functions. Of these variables I planned on only using the AQI, so I used the filter function to remove and records with NA for AQI and removed all of the other variables with NAs. The dataset had the date and time combined as one variable, so I used the lubridate function to separate these into distinct variable columns. Finally, I wanted to add an additional character variable that identified each AQI as being either healthy or unhealthy, which, according to the EPA, is less than 100 and greater than or equal to 100, respectively (source: https://www.airnow.gov/education/students/what-is-the-aqi/).

Load the necessary libraries for the initial cleaning and exploration

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Set working directory and load the dataset

setwd("C:/Users/dburkart/Desktop/DATA 110/data")
epa <- read_csv("Air_Quality_data_gov_epa.csv")
Rows: 45505 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): STATE_NAME, COUNTY_NAME, PARAMETER_NAME, DATETIME_LOCAL, DATUM, UN...
dbl (15): AQSID, SITE_NUM, STATE_CODE, PARAMETER_CODE, POC, LATITUDE, LONGIT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(epa)
# A tibble: 6 × 30
      AQSID SITE_NUM STATE_CODE STATE_NAME           COUNTY_NAME  PARAMETER_CODE
      <dbl>    <dbl>      <dbl> <chr>                <chr>                 <dbl>
1 110010041       41         11 District Of Columbia District of…          42602
2 110010041       41         11 District Of Columbia District of…          42602
3 110010041       41         11 District Of Columbia District of…          42602
4 110010041       41         11 District Of Columbia District of…          42602
5 110010041       41         11 District Of Columbia District of…          42602
6 110010041       41         11 District Of Columbia District of…          42602
# ℹ 24 more variables: PARAMETER_NAME <chr>, DATETIME_LOCAL <chr>, POC <dbl>,
#   LATITUDE <dbl>, LONGITUDE <dbl>, DATUM <chr>, UNITS_OF_MEASURE <chr>,
#   METHOD_CODE <dbl>, METHOD_NAME <chr>, DATE_OF_LAST_CHANGE <chr>,
#   OBJECTID <dbl>, SAMPLE_DURATION <chr>, POLLUTANT_STANDARD <chr>,
#   EVENT_TYPE <chr>, AQI <dbl>, ARITHMETIC_MEAN <dbl>, FIRST_MAX_VALUE <dbl>,
#   FIRST_MAX_HOUR <dbl>, OBSERVATION_COUNT <dbl>, OBSERVATION_PERCENT <dbl>,
#   LOCAL_SITE_NAME <chr>, ADDRESS <chr>, CITY_NAME <chr>, CBSA_NAME <chr>

Investigate the variables in the dataset using the summary function.

summary(epa)
     AQSID            SITE_NUM      STATE_CODE  STATE_NAME       
 Min.   :1.1e+08   Min.   :41.0   Min.   :11   Length:45505      
 1st Qu.:1.1e+08   1st Qu.:43.0   1st Qu.:11   Class :character  
 Median :1.1e+08   Median :43.0   Median :11   Mode  :character  
 Mean   :1.1e+08   Mean   :44.5   Mean   :11                     
 3rd Qu.:1.1e+08   3rd Qu.:43.0   3rd Qu.:11                     
 Max.   :1.1e+08   Max.   :53.0   Max.   :11                     
                                                                 
 COUNTY_NAME        PARAMETER_CODE  PARAMETER_NAME     DATETIME_LOCAL    
 Length:45505       Min.   :42101   Length:45505       Length:45505      
 Class :character   1st Qu.:61104   Class :character   Class :character  
 Mode  :character   Median :88101   Mode  :character   Mode  :character  
                    Mean   :72966                                        
                    3rd Qu.:88180                                        
                    Max.   :88403                                        
                                                                         
      POC           LATITUDE       LONGITUDE         DATUM          
 Min.   :1.000   Min.   :38.88   Min.   :-77.02   Length:45505      
 1st Qu.:1.000   1st Qu.:38.92   1st Qu.:-77.01   Class :character  
 Median :2.000   Median :38.92   Median :-77.01   Mode  :character  
 Mean   :3.068   Mean   :38.92   Mean   :-77.00                     
 3rd Qu.:5.000   3rd Qu.:38.92   3rd Qu.:-77.01                     
 Max.   :5.000   Max.   :38.97   Max.   :-76.95                     
                                                                    
 UNITS_OF_MEASURE    METHOD_CODE   METHOD_NAME        DATE_OF_LAST_CHANGE
 Length:45505       Min.   : 14    Length:45505       Length:45505       
 Class :character   1st Qu.:129    Class :character   Class :character   
 Mode  :character   Median :811    Mode  :character   Mode  :character   
                    Mean   :518                                          
                    3rd Qu.:812                                          
                    Max.   :898                                          
                    NA's   :2202                                         
    OBJECTID      SAMPLE_DURATION    POLLUTANT_STANDARD  EVENT_TYPE       
 Min.   :197659   Length:45505       Length:45505       Length:45505      
 1st Qu.:209165   Class :character   Class :character   Class :character  
 Median :220541   Mode  :character   Mode  :character   Mode  :character  
 Mean   :220541                                                           
 3rd Qu.:231917                                                           
 Max.   :243293                                                           
                                                                          
      AQI         ARITHMETIC_MEAN   FIRST_MAX_VALUE   FIRST_MAX_HOUR  
 Min.   :  0.00   Min.   :  -5.00   Min.   :  -4.60   Min.   : 0.000  
 1st Qu.: 11.00   1st Qu.:   0.00   1st Qu.:   0.04   1st Qu.: 0.000  
 Median : 23.00   Median :   1.00   Median :   0.80   Median : 0.000  
 Mean   : 24.56   Mean   :  59.91   Mean   :  65.85   Mean   : 5.065  
 3rd Qu.: 35.00   3rd Qu.:  10.00   3rd Qu.:  15.10   3rd Qu.: 9.000  
 Max.   :184.00   Max.   :1034.00   Max.   :1035.90   Max.   :23.000  
 NA's   :34620                                                        
 OBSERVATION_COUNT OBSERVATION_PERCENT LOCAL_SITE_NAME      ADDRESS         
 Min.   : 1.0      Min.   :  4.00      Length:45505       Length:45505      
 1st Qu.: 1.0      1st Qu.:100.00      Class :character   Class :character  
 Median : 1.0      Median :100.00      Mode  :character   Mode  :character  
 Mean   :10.3      Mean   : 99.28                                           
 3rd Qu.:24.0      3rd Qu.:100.00                                           
 Max.   :24.0      Max.   :100.00                                           
                                                                            
  CITY_NAME          CBSA_NAME        
 Length:45505       Length:45505      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

Perform initial cleaning of the data

The initial investigation of the variables revealed that all variable names were uppercase. Use “tolower” function to switch variable names from uppercase to lowercase.

names(epa) <- tolower(names(epa))
head(epa)
# A tibble: 6 × 30
      aqsid site_num state_code state_name           county_name  parameter_code
      <dbl>    <dbl>      <dbl> <chr>                <chr>                 <dbl>
1 110010041       41         11 District Of Columbia District of…          42602
2 110010041       41         11 District Of Columbia District of…          42602
3 110010041       41         11 District Of Columbia District of…          42602
4 110010041       41         11 District Of Columbia District of…          42602
5 110010041       41         11 District Of Columbia District of…          42602
6 110010041       41         11 District Of Columbia District of…          42602
# ℹ 24 more variables: parameter_name <chr>, datetime_local <chr>, poc <dbl>,
#   latitude <dbl>, longitude <dbl>, datum <chr>, units_of_measure <chr>,
#   method_code <dbl>, method_name <chr>, date_of_last_change <chr>,
#   objectid <dbl>, sample_duration <chr>, pollutant_standard <chr>,
#   event_type <chr>, aqi <dbl>, arithmetic_mean <dbl>, first_max_value <dbl>,
#   first_max_hour <dbl>, observation_count <dbl>, observation_percent <dbl>,
#   local_site_name <chr>, address <chr>, city_name <chr>, cbsa_name <chr>

Next, check for NAs by first reporting the total number of NAs in the dataset and then using the “names,” “which,” and “sapply” functions to identify which variables have NAs(Code Source: https://stackoverflow.com/questions/20364450/find-names-of-columns-which-contain-missing-values).

sum(is.na(epa))
[1] 70190
names(which(sapply(epa, anyNA)))
[1] "method_code"        "pollutant_standard" "aqi"               

Since AQI will be used in further analyses, the “filter’ function is used to remove the records with NAs for AQI. The other variables with NAs are unneeded and these, along with other unneeded variables, are removed with the”select” function.

epa_clean1 <- epa |>
  filter(aqi != "NA") |>
  select(-aqsid, -state_code, -state_name, -county_name, -method_code, -method_name, -pollutant_standard, -city_name, -cbsa_name)
head(epa_clean1)
# A tibble: 6 × 21
  site_num parameter_code parameter_name datetime_local   poc latitude longitude
     <dbl>          <dbl> <chr>          <chr>          <dbl>    <dbl>     <dbl>
1       41          42602 Nitrogen diox… 2021/08/15 04…     1     38.9     -77.0
2       41          42602 Nitrogen diox… 2021/01/22 05…     1     38.9     -77.0
3       41          42602 Nitrogen diox… 2021/01/23 05…     1     38.9     -77.0
4       41          42602 Nitrogen diox… 2021/01/24 05…     1     38.9     -77.0
5       41          42602 Nitrogen diox… 2021/01/25 05…     1     38.9     -77.0
6       41          42602 Nitrogen diox… 2021/01/26 05…     1     38.9     -77.0
# ℹ 14 more variables: datum <chr>, units_of_measure <chr>,
#   date_of_last_change <chr>, objectid <dbl>, sample_duration <chr>,
#   event_type <chr>, aqi <dbl>, arithmetic_mean <dbl>, first_max_value <dbl>,
#   first_max_hour <dbl>, observation_count <dbl>, observation_percent <dbl>,
#   local_site_name <chr>, address <chr>

Use the “sum” function again to ensure NAs are gone.

sum(is.na(epa_clean1))
[1] 0

The “lubridate” function is used to fix the combined date and time variable by separating them into their own variables.

library(lubridate)
library(hms)

Attaching package: 'hms'
The following object is masked from 'package:lubridate':

    hms
epa_clean1$date <- as.Date(format(ymd_hms(epa_clean1$datetime_local),format = '%Y-%m-%d'))
epa_clean1$time <- as_hms(ymd_hms(epa_clean1$datetime_local))
#https://stackoverflow.com/questions/50736661/extract-time-hms-from-lubridate-date-time-object
epa_clean <- select(epa_clean1, -datetime_local)
head(epa_clean1)
# A tibble: 6 × 23
  site_num parameter_code parameter_name datetime_local   poc latitude longitude
     <dbl>          <dbl> <chr>          <chr>          <dbl>    <dbl>     <dbl>
1       41          42602 Nitrogen diox… 2021/08/15 04…     1     38.9     -77.0
2       41          42602 Nitrogen diox… 2021/01/22 05…     1     38.9     -77.0
3       41          42602 Nitrogen diox… 2021/01/23 05…     1     38.9     -77.0
4       41          42602 Nitrogen diox… 2021/01/24 05…     1     38.9     -77.0
5       41          42602 Nitrogen diox… 2021/01/25 05…     1     38.9     -77.0
6       41          42602 Nitrogen diox… 2021/01/26 05…     1     38.9     -77.0
# ℹ 16 more variables: datum <chr>, units_of_measure <chr>,
#   date_of_last_change <chr>, objectid <dbl>, sample_duration <chr>,
#   event_type <chr>, aqi <dbl>, arithmetic_mean <dbl>, first_max_value <dbl>,
#   first_max_hour <dbl>, observation_count <dbl>, observation_percent <dbl>,
#   local_site_name <chr>, address <chr>, date <date>, time <time>

An additional column is added to indicate when air is heatlhy (AQI <= 100, source: https://www.airnow.gov/education/students/what-is-the-aqi/) using the “mutate” and “case_when” functions.

epa2 <- mutate(epa_clean, health = case_when(aqi <= 100 ~ "healthy", aqi > 100 ~ "unhealthy"))
head(epa2)
# A tibble: 6 × 23
  site_num parameter_code parameter_name           poc latitude longitude datum
     <dbl>          <dbl> <chr>                  <dbl>    <dbl>     <dbl> <chr>
1       41          42602 Nitrogen dioxide (NO2)     1     38.9     -77.0 WGS84
2       41          42602 Nitrogen dioxide (NO2)     1     38.9     -77.0 WGS84
3       41          42602 Nitrogen dioxide (NO2)     1     38.9     -77.0 WGS84
4       41          42602 Nitrogen dioxide (NO2)     1     38.9     -77.0 WGS84
5       41          42602 Nitrogen dioxide (NO2)     1     38.9     -77.0 WGS84
6       41          42602 Nitrogen dioxide (NO2)     1     38.9     -77.0 WGS84
# ℹ 16 more variables: units_of_measure <chr>, date_of_last_change <chr>,
#   objectid <dbl>, sample_duration <chr>, event_type <chr>, aqi <dbl>,
#   arithmetic_mean <dbl>, first_max_value <dbl>, first_max_hour <dbl>,
#   observation_count <dbl>, observation_percent <dbl>, local_site_name <chr>,
#   address <chr>, date <date>, time <time>, health <chr>

Check for potential factors that affect AQI

Check how the AQI for each pollutant changes over time using a facet wrap.

aqi_pollutant <- epa2 |>
  ggplot(aes(x=date, y=aqi, color = parameter_name)) +
  geom_point(alpha = 0.4) +
  labs(title = "AQI by Pollutant Type",
       caption = "Source: EPA") +
  facet_wrap(~parameter_name) + 
  theme_bw()
aqi_pollutant

To check how AQI changes over latitude and longitude, the dataset is first reduced to only the necessary variables using the “select” function. Then the facet wrap for latitude.

epa_latlong <- select(epa2, parameter_name, latitude, longitude, aqi)
head(epa_latlong)
# A tibble: 6 × 4
  parameter_name         latitude longitude   aqi
  <chr>                     <dbl>     <dbl> <dbl>
1 Nitrogen dioxide (NO2)     38.9     -77.0    12
2 Nitrogen dioxide (NO2)     38.9     -77.0    39
3 Nitrogen dioxide (NO2)     38.9     -77.0    19
4 Nitrogen dioxide (NO2)     38.9     -77.0    32
5 Nitrogen dioxide (NO2)     38.9     -77.0    30
6 Nitrogen dioxide (NO2)     38.9     -77.0    33
aqi_lat <- epa_latlong |>
  ggplot(aes(x=latitude, y=aqi, color = parameter_name)) +
  geom_point(alpha = 0.4) +
  labs(title = "AQI by Latitude and Pollutant Type",
       caption = "Source: EPA") +
  facet_wrap(~parameter_name) + 
  theme_bw()
aqi_lat

There doesn’t seem to be anything very interesting here.

Next, the facet wrap for longitude.

aqi_long <- epa_latlong |>
  ggplot(aes(x=longitude, y=aqi, color = parameter_name)) +
  geom_point(alpha = 0.4) +
  labs(title = "AQI by Longitude and Pollutant Type",
       caption = "Source: EPA") +
  facet_wrap(~parameter_name) + 
  theme_bw()
aqi_long

Also nothing interesting here.

Moving forward, it was chosen to focus on the relationship between ozone, PM 2.5, and date. Since the AQI for both pollutants appear to fluctuate seasonally, creating a sine wave pattern, a proper analysis would be more complicated (i.e. nonlinear regression or time series analysis) beyond the scope of this course. Instead, a linear regression between the AQIs of PM 2.5 and ozone was done to further explore the relationship between these two pollutants. Maybe they influence each other?

Perform a linear regression

First, put the dataset into the correct form for the regression by removing any unecessary columns using the “select” function and using the “pivot_wider” function to reformat the dataset into the the wider form. This will make each pollutant a variable. The new variable names are cleaned using the “tolower” and “gsub” functions which remove all uppercase letters and replaces spaces with “_”. The “rename” function is used to simplify the names of certain pollutants.

pollutants <- epa2 |>
  select(site_num, parameter_name, aqi, site_num, date, time, health) |>
  pivot_wider(names_from = parameter_name, values_from = aqi) 

names(pollutants) <- tolower(names(pollutants))
names(pollutants) <- gsub(" ", "_", names(pollutants))

head(pollutants)
# A tibble: 6 × 10
  site_num date       time   health  `nitrogen_dioxide_(no2)` ozone
     <dbl> <date>     <time> <chr>                      <dbl> <dbl>
1       41 2021-08-15 04:00  healthy                       12    43
2       41 2021-01-22 05:00  healthy                       39    31
3       41 2021-01-23 05:00  healthy                       19    31
4       41 2021-01-24 05:00  healthy                       32    31
5       41 2021-01-25 05:00  healthy                       30    19
6       41 2021-01-26 05:00  healthy                       33    10
# ℹ 4 more variables: `pm2.5_-_local_conditions` <dbl>, carbon_monoxide <dbl>,
#   `pm10_total_0-10um_stp` <dbl>, sulfur_dioxide <dbl>
pollutants2 <- pollutants |>
  rename(nitrogen_dioxide = `nitrogen_dioxide_(no2)`) |>
  rename(pm10 = `pm10_total_0-10um_stp`) |>
  rename(pm2.5 = `pm2.5_-_local_conditions`)

head(pollutants2)
# A tibble: 6 × 10
  site_num date       time   health nitrogen_dioxide ozone pm2.5 carbon_monoxide
     <dbl> <date>     <time> <chr>             <dbl> <dbl> <dbl>           <dbl>
1       41 2021-08-15 04:00  healt…               12    43    23              NA
2       41 2021-01-22 05:00  healt…               39    31    20              NA
3       41 2021-01-23 05:00  healt…               19    31    10              NA
4       41 2021-01-24 05:00  healt…               32    31    22              NA
5       41 2021-01-25 05:00  healt…               30    19    43              NA
6       41 2021-01-26 05:00  healt…               33    10    43              NA
# ℹ 2 more variables: pm10 <dbl>, sulfur_dioxide <dbl>

The linear regression begins using the “filter” function to remove and NAs for the variables of interest, PM2.5 and ozone, and then plotting their AQIs using “ggplot”.

ozone_pm <- pollutants2 |>
  filter(ozone != "NA") |>
  filter(pm2.5 != "NA") 

ppm_ozone <- ozone_pm |>
  ggplot(aes(x= pm2.5, y=ozone)) +
  geom_point() +
  geom_smooth(method = 'lm', formula=y~x, color = "#1aa33f")
  labs(title = "AQIs of PM 2.5 and Ozone",
       x = "AQI for PM 2.5",
       y = "AQI for Ozone",
       caption = "Source: EPA") +
  theme_bw()
NULL
ppm_ozone

There doesn’t seem to be a correlation between these two based on the plot.

Confirm if there is or isn’t a correlation and perform a linear regression of PM 2.5 and ozone using the “cor” and “lm” functions.

cor(ozone_pm$pm2.5, ozone_pm$ozone)
[1] 0.2560777
linear <- lm(ozone ~ pm2.5, data = ozone_pm)
summary(linear)

Call:
lm(formula = ozone ~ pm2.5, data = ozone_pm)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.022  -8.000   0.406   7.298  89.843 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.2738     0.9279  29.393   <2e-16 ***
pm2.5         0.2554     0.0270   9.459   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.2 on 1275 degrees of freedom
Multiple R-squared:  0.06558,   Adjusted R-squared:  0.06484 
F-statistic: 89.48 on 1 and 1275 DF,  p-value: < 2.2e-16

The equation of the model is y = 0.26x + 27.27. This means that for each incfrease in the AQI of PM 2.5, the AQI of ozone increases by 0.26. The correlation value of 0.26 indicates a very weak correlation between PM 2.5 and ozone. Also, the adjusted R squared value of 0.06 suggests that very little of the variability in ozone AQI is caused by PM 2.5 AQI. However, there is a highly significant p-value (less than 2e-16), so PM 2.5’s AQIs are still significant in some way. The discrepancy between the R squared and the p-value could indicate that the model does a poor job at describing the relationship between the two variables.

Make final visualization

Prepare the dataset for making the visualization by uding the “select” function to remove the unneeded variables, including the unused pollutants, and pivoting back to the longer form using the “pivot_longer” function. Since it was lost in the previous manipulation of the dataset, the additional column to indicate when AQI is healthy or unhealthy is added again.

pollutants3 <- pollutants2 |>
  select(-site_num, -time, -health, -nitrogen_dioxide, -carbon_monoxide, -pm10, -sulfur_dioxide) |>
  pivot_longer(cols = 2:3, names_to = "pollutant", values_to = "aqi") 

pollutants3 <- mutate(pollutants3, health = case_when(aqi <= 100 ~ "healthy", aqi > 100 ~ "unhealthy"))
head(pollutants3)
# A tibble: 6 × 4
  date       pollutant   aqi health 
  <date>     <chr>     <dbl> <chr>  
1 2021-08-15 ozone        43 healthy
2 2021-08-15 pm2.5        23 healthy
3 2021-01-22 ozone        31 healthy
4 2021-01-22 pm2.5        20 healthy
5 2021-01-23 ozone        31 healthy
6 2021-01-23 pm2.5        10 healthy

Make a scatter plot of the AQIs for PM 2.5 and ozone pollutants over time. Again the NAs for these are removed, using the “filter” function. Then the plotly library, which will need to be sued for the visualization, is loaded. Plotly adds interactivity via the tooltip. For the plot, the color of the point on the graph is set to correspond to the pollutant while the shape corresponds to the health of the AQI (circle for healthy, triangle for unhealthy).

plot_data <- filter(pollutants3, aqi != "NA")
head(plot_data)
# A tibble: 6 × 4
  date       pollutant   aqi health 
  <date>     <chr>     <dbl> <chr>  
1 2021-08-15 ozone        43 healthy
2 2021-08-15 pm2.5        23 healthy
3 2021-01-22 ozone        31 healthy
4 2021-01-22 pm2.5        20 healthy
5 2021-01-23 ozone        31 healthy
6 2021-01-23 pm2.5        10 healthy
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
plot <- plot_data |>
  ggplot(aes(date, aqi, color = pollutant, shape = health)) + 
  geom_jitter(alpha=0.7) + 
  labs(title = "Air Quality Index (AQI) for Ozone and Particulate Matter 2.5 (PM2.5) \n in Washington, DC from January 2021 - September 2023", 
       x = "Date",
       y = "AQI",
       caption = "Source: Environmental Protection Agency") +
   geom_hline(yintercept = 100, linetype = "dotdash", size = 0.5, color = "gray") +
  theme(axis.text.x = element_text(angle= 90)) +
  scale_color_brewer(palette = "Set2") +
  theme_classic()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
plot

ggplotly(plot)

Commentary on the visualization

The seasonal fluctuations of ozone throughout the date range is not unexpected considering the causes of ground-level ozone. Unlike the ozone layer in the stratosphere, ground-level ozone is created when nitrogen oxides and other volatile compounds, which are often released from automobiles, react. They are more likely to react due to the presence of high levels of heat and sunlight (source: https://www.epa.gov/ground-level-ozone-pollution/ground-level-ozone-basics). Therefore, it is expected that the AQI for ozone will be higher during the summer.

PM 2.5 are aerosolized particles that are smaller than 2.5 microns. These are usually dirt, dust, and smoke (source: https://www.epa.gov/pm-pollution/particulate-matter-pm-basics#PM). The higher levels of pM 2.5 during the summer months could be due to wildfires. For example, the Canadian wildfires during the summer of 2023 is the cause for the high levels of PM 2.5 during this time.

Bibliography

Pollutants:

https://www.airnow.gov/education/students/what-is-the-aqi/ https://www.epa.gov/ground-level-ozone-pollution/ground-level-ozone-basics https://www.epa.gov/pm-pollution/particulate-matter-pm-basics#PM

Code:

https://stackoverflow.com/questions/20364450/find-names-of-columns-which-contain-missing-values

https://stackoverflow.com/questions/20364450/find-names-of-columns-which-contain-missing-values