‘Citizen Science Air Monitor’ or CSAM sensor pods. Designed by EPA, the pods measure ozone, fine particulate matter (PM2.5), temperature, and relative humidity.
The dataset used for this project has information about the air quality in Washington, DC, from 2021 to 2023. These data come from monitoring stations and include the dates measurements were taken, the locations of monitoring stations, and the air quality index (AQI) measured for several pollutants. Overall there are 30 variables. These include character and numeric variables. This dataset comes from the Environmental Protection Agency (EPA).
There are several reasons why I was interested in working with this dataset. I am interested and worried about how pollutants impact our environmental and personal health. I would like to learn more about what factors influence the levels of pollutants in our atmosphere, and when these become dangerous. Furthermore, I used to teach BIOL 105 at Montgomery College and used similar EPA data for one of the course’s assignments.
Before cleaning the data I used the summary function to learn about the variables. I noticed that all of the variable headers were in capitol letters, so my first step in cleaning was to change all of these letter to lowercase. I then checked for NAs using the is.na function and returned the names of variables with NAs by combining the names, which, and sapply functions. Of these variables I planned on only using the AQI, so I used the filter function to remove and records with NA for AQI and removed all of the other variables with NAs. The dataset had the date and time combined as one variable, so I used the lubridate function to separate these into distinct variable columns. Finally, I wanted to add an additional character variable that identified each AQI as being either healthy or unhealthy, which, according to the EPA, is less than 100 and greater than or equal to 100, respectively (source: https://www.airnow.gov/education/students/what-is-the-aqi/).
Load the necessary libraries for the initial cleaning and exploration
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 45505 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): STATE_NAME, COUNTY_NAME, PARAMETER_NAME, DATETIME_LOCAL, DATUM, UN...
dbl (15): AQSID, SITE_NUM, STATE_CODE, PARAMETER_CODE, POC, LATITUDE, LONGIT...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(epa)
# A tibble: 6 × 30
AQSID SITE_NUM STATE_CODE STATE_NAME COUNTY_NAME PARAMETER_CODE
<dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 110010041 41 11 District Of Columbia District of… 42602
2 110010041 41 11 District Of Columbia District of… 42602
3 110010041 41 11 District Of Columbia District of… 42602
4 110010041 41 11 District Of Columbia District of… 42602
5 110010041 41 11 District Of Columbia District of… 42602
6 110010041 41 11 District Of Columbia District of… 42602
# ℹ 24 more variables: PARAMETER_NAME <chr>, DATETIME_LOCAL <chr>, POC <dbl>,
# LATITUDE <dbl>, LONGITUDE <dbl>, DATUM <chr>, UNITS_OF_MEASURE <chr>,
# METHOD_CODE <dbl>, METHOD_NAME <chr>, DATE_OF_LAST_CHANGE <chr>,
# OBJECTID <dbl>, SAMPLE_DURATION <chr>, POLLUTANT_STANDARD <chr>,
# EVENT_TYPE <chr>, AQI <dbl>, ARITHMETIC_MEAN <dbl>, FIRST_MAX_VALUE <dbl>,
# FIRST_MAX_HOUR <dbl>, OBSERVATION_COUNT <dbl>, OBSERVATION_PERCENT <dbl>,
# LOCAL_SITE_NAME <chr>, ADDRESS <chr>, CITY_NAME <chr>, CBSA_NAME <chr>
Investigate the variables in the dataset using the summary function.
summary(epa)
AQSID SITE_NUM STATE_CODE STATE_NAME
Min. :1.1e+08 Min. :41.0 Min. :11 Length:45505
1st Qu.:1.1e+08 1st Qu.:43.0 1st Qu.:11 Class :character
Median :1.1e+08 Median :43.0 Median :11 Mode :character
Mean :1.1e+08 Mean :44.5 Mean :11
3rd Qu.:1.1e+08 3rd Qu.:43.0 3rd Qu.:11
Max. :1.1e+08 Max. :53.0 Max. :11
COUNTY_NAME PARAMETER_CODE PARAMETER_NAME DATETIME_LOCAL
Length:45505 Min. :42101 Length:45505 Length:45505
Class :character 1st Qu.:61104 Class :character Class :character
Mode :character Median :88101 Mode :character Mode :character
Mean :72966
3rd Qu.:88180
Max. :88403
POC LATITUDE LONGITUDE DATUM
Min. :1.000 Min. :38.88 Min. :-77.02 Length:45505
1st Qu.:1.000 1st Qu.:38.92 1st Qu.:-77.01 Class :character
Median :2.000 Median :38.92 Median :-77.01 Mode :character
Mean :3.068 Mean :38.92 Mean :-77.00
3rd Qu.:5.000 3rd Qu.:38.92 3rd Qu.:-77.01
Max. :5.000 Max. :38.97 Max. :-76.95
UNITS_OF_MEASURE METHOD_CODE METHOD_NAME DATE_OF_LAST_CHANGE
Length:45505 Min. : 14 Length:45505 Length:45505
Class :character 1st Qu.:129 Class :character Class :character
Mode :character Median :811 Mode :character Mode :character
Mean :518
3rd Qu.:812
Max. :898
NA's :2202
OBJECTID SAMPLE_DURATION POLLUTANT_STANDARD EVENT_TYPE
Min. :197659 Length:45505 Length:45505 Length:45505
1st Qu.:209165 Class :character Class :character Class :character
Median :220541 Mode :character Mode :character Mode :character
Mean :220541
3rd Qu.:231917
Max. :243293
AQI ARITHMETIC_MEAN FIRST_MAX_VALUE FIRST_MAX_HOUR
Min. : 0.00 Min. : -5.00 Min. : -4.60 Min. : 0.000
1st Qu.: 11.00 1st Qu.: 0.00 1st Qu.: 0.04 1st Qu.: 0.000
Median : 23.00 Median : 1.00 Median : 0.80 Median : 0.000
Mean : 24.56 Mean : 59.91 Mean : 65.85 Mean : 5.065
3rd Qu.: 35.00 3rd Qu.: 10.00 3rd Qu.: 15.10 3rd Qu.: 9.000
Max. :184.00 Max. :1034.00 Max. :1035.90 Max. :23.000
NA's :34620
OBSERVATION_COUNT OBSERVATION_PERCENT LOCAL_SITE_NAME ADDRESS
Min. : 1.0 Min. : 4.00 Length:45505 Length:45505
1st Qu.: 1.0 1st Qu.:100.00 Class :character Class :character
Median : 1.0 Median :100.00 Mode :character Mode :character
Mean :10.3 Mean : 99.28
3rd Qu.:24.0 3rd Qu.:100.00
Max. :24.0 Max. :100.00
CITY_NAME CBSA_NAME
Length:45505 Length:45505
Class :character Class :character
Mode :character Mode :character
Perform initial cleaning of the data
The initial investigation of the variables revealed that all variable names were uppercase. Use “tolower” function to switch variable names from uppercase to lowercase.
names(epa) <-tolower(names(epa))head(epa)
# A tibble: 6 × 30
aqsid site_num state_code state_name county_name parameter_code
<dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 110010041 41 11 District Of Columbia District of… 42602
2 110010041 41 11 District Of Columbia District of… 42602
3 110010041 41 11 District Of Columbia District of… 42602
4 110010041 41 11 District Of Columbia District of… 42602
5 110010041 41 11 District Of Columbia District of… 42602
6 110010041 41 11 District Of Columbia District of… 42602
# ℹ 24 more variables: parameter_name <chr>, datetime_local <chr>, poc <dbl>,
# latitude <dbl>, longitude <dbl>, datum <chr>, units_of_measure <chr>,
# method_code <dbl>, method_name <chr>, date_of_last_change <chr>,
# objectid <dbl>, sample_duration <chr>, pollutant_standard <chr>,
# event_type <chr>, aqi <dbl>, arithmetic_mean <dbl>, first_max_value <dbl>,
# first_max_hour <dbl>, observation_count <dbl>, observation_percent <dbl>,
# local_site_name <chr>, address <chr>, city_name <chr>, cbsa_name <chr>
Since AQI will be used in further analyses, the “filter’ function is used to remove the records with NAs for AQI. The other variables with NAs are unneeded and these, along with other unneeded variables, are removed with the”select” function.
An additional column is added to indicate when air is heatlhy (AQI <= 100, source: https://www.airnow.gov/education/students/what-is-the-aqi/) using the “mutate” and “case_when” functions.
epa2 <-mutate(epa_clean, health =case_when(aqi <=100~"healthy", aqi >100~"unhealthy"))head(epa2)
Check how the AQI for each pollutant changes over time using a facet wrap.
aqi_pollutant <- epa2 |>ggplot(aes(x=date, y=aqi, color = parameter_name)) +geom_point(alpha =0.4) +labs(title ="AQI by Pollutant Type",caption ="Source: EPA") +facet_wrap(~parameter_name) +theme_bw()aqi_pollutant
The trends that interest me the most here are with ozone and PM 2.5.
To check how AQI changes over latitude and longitude, the dataset is first reduced to only the necessary variables using the “select” function. Then the facet wrap for latitude.
aqi_lat <- epa_latlong |>ggplot(aes(x=latitude, y=aqi, color = parameter_name)) +geom_point(alpha =0.4) +labs(title ="AQI by Latitude and Pollutant Type",caption ="Source: EPA") +facet_wrap(~parameter_name) +theme_bw()aqi_lat
There doesn’t seem to be anything very interesting here.
Next, the facet wrap for longitude.
aqi_long <- epa_latlong |>ggplot(aes(x=longitude, y=aqi, color = parameter_name)) +geom_point(alpha =0.4) +labs(title ="AQI by Longitude and Pollutant Type",caption ="Source: EPA") +facet_wrap(~parameter_name) +theme_bw()aqi_long
Also nothing interesting here.
Moving forward, it was chosen to focus on the relationship between ozone, PM 2.5, and date. Since the AQI for both pollutants appear to fluctuate seasonally, creating a sine wave pattern, a proper analysis would be more complicated (i.e. nonlinear regression or time series analysis) beyond the scope of this course. Instead, a linear regression between the AQIs of PM 2.5 and ozone was done to further explore the relationship between these two pollutants. Maybe they influence each other?
Perform a linear regression
First, put the dataset into the correct form for the regression by removing any unecessary columns using the “select” function and using the “pivot_wider” function to reformat the dataset into the the wider form. This will make each pollutant a variable. The new variable names are cleaned using the “tolower” and “gsub” functions which remove all uppercase letters and replaces spaces with “_”. The “rename” function is used to simplify the names of certain pollutants.
# A tibble: 6 × 10
site_num date time health nitrogen_dioxide ozone pm2.5 carbon_monoxide
<dbl> <date> <time> <chr> <dbl> <dbl> <dbl> <dbl>
1 41 2021-08-15 04:00 healt… 12 43 23 NA
2 41 2021-01-22 05:00 healt… 39 31 20 NA
3 41 2021-01-23 05:00 healt… 19 31 10 NA
4 41 2021-01-24 05:00 healt… 32 31 22 NA
5 41 2021-01-25 05:00 healt… 30 19 43 NA
6 41 2021-01-26 05:00 healt… 33 10 43 NA
# ℹ 2 more variables: pm10 <dbl>, sulfur_dioxide <dbl>
The linear regression begins using the “filter” function to remove and NAs for the variables of interest, PM2.5 and ozone, and then plotting their AQIs using “ggplot”.
ozone_pm <- pollutants2 |>filter(ozone !="NA") |>filter(pm2.5!="NA") ppm_ozone <- ozone_pm |>ggplot(aes(x= pm2.5, y=ozone)) +geom_point() +geom_smooth(method ='lm', formula=y~x, color ="#1aa33f")labs(title ="AQIs of PM 2.5 and Ozone",x ="AQI for PM 2.5",y ="AQI for Ozone",caption ="Source: EPA") +theme_bw()
NULL
ppm_ozone
There doesn’t seem to be a correlation between these two based on the plot.
Confirm if there is or isn’t a correlation and perform a linear regression of PM 2.5 and ozone using the “cor” and “lm” functions.
cor(ozone_pm$pm2.5, ozone_pm$ozone)
[1] 0.2560777
linear <-lm(ozone ~ pm2.5, data = ozone_pm)summary(linear)
Call:
lm(formula = ozone ~ pm2.5, data = ozone_pm)
Residuals:
Min 1Q Median 3Q Max
-38.022 -8.000 0.406 7.298 89.843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.2738 0.9279 29.393 <2e-16 ***
pm2.5 0.2554 0.0270 9.459 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.2 on 1275 degrees of freedom
Multiple R-squared: 0.06558, Adjusted R-squared: 0.06484
F-statistic: 89.48 on 1 and 1275 DF, p-value: < 2.2e-16
The equation of the model is y = 0.26x + 27.27. This means that for each incfrease in the AQI of PM 2.5, the AQI of ozone increases by 0.26. The correlation value of 0.26 indicates a very weak correlation between PM 2.5 and ozone. Also, the adjusted R squared value of 0.06 suggests that very little of the variability in ozone AQI is caused by PM 2.5 AQI. However, there is a highly significant p-value (less than 2e-16), so PM 2.5’s AQIs are still significant in some way. The discrepancy between the R squared and the p-value could indicate that the model does a poor job at describing the relationship between the two variables.
Make final visualization
Prepare the dataset for making the visualization by uding the “select” function to remove the unneeded variables, including the unused pollutants, and pivoting back to the longer form using the “pivot_longer” function. Since it was lost in the previous manipulation of the dataset, the additional column to indicate when AQI is healthy or unhealthy is added again.
Make a scatter plot of the AQIs for PM 2.5 and ozone pollutants over time. Again the NAs for these are removed, using the “filter” function. Then the plotly library, which will need to be sued for the visualization, is loaded. Plotly adds interactivity via the tooltip. For the plot, the color of the point on the graph is set to correspond to the pollutant while the shape corresponds to the health of the AQI (circle for healthy, triangle for unhealthy).
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
plot <- plot_data |>ggplot(aes(date, aqi, color = pollutant, shape = health)) +geom_jitter(alpha=0.7) +labs(title ="Air Quality Index (AQI) for Ozone and Particulate Matter 2.5 (PM2.5) \n in Washington, DC from January 2021 - September 2023", x ="Date",y ="AQI",caption ="Source: Environmental Protection Agency") +geom_hline(yintercept =100, linetype ="dotdash", size =0.5, color ="gray") +theme(axis.text.x =element_text(angle=90)) +scale_color_brewer(palette ="Set2") +theme_classic()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
plot
ggplotly(plot)
Commentary on the visualization
The seasonal fluctuations of ozone throughout the date range is not unexpected considering the causes of ground-level ozone. Unlike the ozone layer in the stratosphere, ground-level ozone is created when nitrogen oxides and other volatile compounds, which are often released from automobiles, react. They are more likely to react due to the presence of high levels of heat and sunlight (source: https://www.epa.gov/ground-level-ozone-pollution/ground-level-ozone-basics). Therefore, it is expected that the AQI for ozone will be higher during the summer.
PM 2.5 are aerosolized particles that are smaller than 2.5 microns. These are usually dirt, dust, and smoke (source: https://www.epa.gov/pm-pollution/particulate-matter-pm-basics#PM). The higher levels of pM 2.5 during the summer months could be due to wildfires. For example, the Canadian wildfires during the summer of 2023 is the cause for the high levels of PM 2.5 during this time.