For this project, I analyzed air quality data for Maryland in 2024. The dataset includes measurements from four monitoring sites: Essex, Howard County Near Road, HU-Beltsville, and Piney Run.
Unfortunately, Montgomery County, where my family and I live, is not included in this dataset. My original intention was to explore air quality across all counties in Maryland, particularly in Montgomery County, to understand the conditions where my children are growing up. However, this dataset provides valuable insights into the general trends and conditions in nearby locations, which can still be helpful for understanding regional air quality.
This limitation highlights the need for more comprehensive data to evaluate environmental conditions in Montgomery County, which I plan to pursue in future projects.
Why This Dataset?
Air quality is a critical aspect of public health, affecting respiratory and cardiovascular health. For me, this topic is deeply personal. As a Maryland resident and a parent, I want to analyze the conditions in which my kids are growing. Did I make the right decision to move to Maryland with two kids?
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)library(ggplot2)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 784 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date, Source, Units, Local Site Name, AQS Parameter Description, C...
dbl (12): Site ID, POC, Daily Max 8-hour CO Concentration, Daily AQI Value, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Extract Month and Year for seasonal analysisdata11_clean <- data11 %>%mutate(Date =as.Date(Date, format ="%m/%d/%Y"), # Convert to Date formatMonth =month(Date, label =TRUE), # Extract MonthYear =year(Date) # Extract Year ) %>%filter(!is.na(`Daily AQI Value`)) # Filter rows with NA in AQI
# Keep only relevant columnsdata11_clean <- data11_clean %>%select(Date, Month, `Daily AQI Value`, `Daily Max 8-hour CO Concentration`, `Local Site Name`)
# A tibble: 1 × 5
Date Month `Daily AQI Value` Daily Max 8-hour CO Concentr…¹ `Local Site Name`
<int> <int> <int> <int> <int>
1 0 0 0 0 0
# ℹ abbreviated name: ¹`Daily Max 8-hour CO Concentration`
:::
In this project, the data cleaning process involved several key steps to ensure the dataset was ready for analysis. First, temporal variables such as Month and Year were extracted to facilitate seasonal analysis of air quality trends. The dataset was then examined for missing values, revealing any incomplete or inaccurate entries. To ensure robust insights, a threshold of 20 observations per site and month was applied, filtering out months with insufficient data. This process reduced potential biases from sparse data points and enhanced the reliability of subsequent analyses. The cleaned dataset consisted of 746 observations, ready for exploratory analysis, regression modeling, and visualization.
#Check column names in datasetnames(data11_clean)
[1] "Date" "Month"
[3] "Daily AQI Value" "Daily Max 8-hour CO Concentration"
[5] "Local Site Name"
# build a regression modelmodel <-lm(`Daily AQI Value`~`Daily Max 8-hour CO Concentration`, data = data11_clean)summary(model)
Call:
lm(formula = `Daily AQI Value` ~ `Daily Max 8-hour CO Concentration`,
data = data11_clean)
Residuals:
Min 1Q Median 3Q Max
-0.60373 -0.08412 -0.08412 0.10583 0.53598
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.29578 0.01418 -20.86 <2e-16 ***
`Daily Max 8-hour CO Concentration` 11.89951 0.05398 220.45 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2116 on 782 degrees of freedom
Multiple R-squared: 0.9842, Adjusted R-squared: 0.9841
F-statistic: 4.86e+04 on 1 and 782 DF, p-value: < 2.2e-16
# Scatter plot with regression lineggplot(data11_clean , aes(x =`Daily Max 8-hour CO Concentration`, y =`Daily AQI Value`)) +geom_point(alpha =0.6, color ="blue") +# Scatter pointsgeom_smooth(method ="lm", se =FALSE, color ="pink") +# Regression linelabs(title ="Relationship Between CO Concentration and AQI",x ="Daily Max 8-hour CO Concentration (ppm)",y ="Daily AQI Value" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
The linear regression analysis reveals a strong positive relationship between the Daily Max 8-hour CO Concentration (Carbon Monoxide) and the Daily AQI Value (Air Quality Index), with CO concentration emerging as a significant predictor of air quality. For every 1 ppm (parts per million) increase in CO concentration, the AQI rises by approximately 11.89 units, as indicated by the model’s coefficient. This relationship is statistically significant, with a p-value of less than 2.2e-16, and the model explains 98.21% of the variance in AQI values, as shown by the high R-squared value. The residual standard error of 0.2118 further confirms the model’s accuracy. These findings highlight the critical role of CO emissions in air quality deterioration, suggesting that reducing CO pollution could substantially improve air quality in Maryland. This insight is particularly valuable for policymakers aiming to create targeted interventions and for residents concerned about the health implications of air pollution, especially for vulnerable groups like children and the elderly.
Air Quality Trends by Site
The trend chart below shows how air quality changes across the year at each monitoring site.
Piney Run consistently had the lowest AQI, reflecting excellent air quality throughout the year.
Essex, on the other hand, had higher AQI levels, particularly in the early months, which could indicate seasonal pollution spikes.
Personal Reflection
Seeing these trends makes me reflect on how air quality varies based on location. While Piney Run offers a reassuring picture, Essex raises concerns about localized pollution. As a parent, it’s important to consider how environmental factors like these might affect my family’s health.
# Summarize AQI trends by month and sitemonthly_trends <- data11_clean %>%group_by(Month, `Local Site Name`) %>%summarise(Average_AQI =mean(`Daily AQI Value`, na.rm =TRUE), .groups ="drop")
names(data11_clean)
[1] "Date" "Month"
[3] "Daily AQI Value" "Daily Max 8-hour CO Concentration"
[5] "Local Site Name"
# Plot the trendsggplot(monthly_trends, aes(x =Month, y = Average_AQI, color =`Local Site Name`, group =`Local Site Name`)) +geom_line(size =1) +geom_point(size =2) +labs(title ="Monthly Air Quality Trends by Monitoring Site (2024)",x ="Month",y ="Average AQI Value",color ="Site Name" ) +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Summary of Monthly Air Quality Trends
The visualization highlights the variation in air quality across four monitoring sites in Maryland throughout 2024. Essex consistently showed higher AQI (Air Quality Index) values, particularly in the winter months, suggesting potential pollution spikes during colder weather. In contrast, Piney Run demonstrated the cleanest air quality with the lowest AQI values across all months. Both Howard County Near Road and HU-Beltsville exhibited moderate AQI levels, with a noticeable improvement during the summer months. This seasonal trend, with higher AQI in winter and lower AQI in summer, reflects the impact of factors like heating emissions and atmospheric conditions on air quality.
ggplot(data11_clean, aes(x = Month, y =`Daily AQI Value`, color =`Local Site Name`)) +geom_line() +facet_wrap(~`Local Site Name`, scales ="free") +labs(title ="Monthly Air Quality Trends by Site (2024)",x ="Month",y ="Average AQI Value",caption ="Data Source: Maryland AQI Data, 2024" ) +theme_minimal()
library(plotly)
# Define the ggplot objectp <-ggplot(data11_clean, aes(x = Month, y =`Daily AQI Value`, color =`Local Site Name`)) +geom_line() +labs(title ="Monthly Air Quality Trends by Monitoring Site (Interactive)",x ="Month",y ="Average AQI Value",color ="Site Name" ) +theme_minimal()# Make the ggplot interactiveinteractive_plot <-ggplotly(p)# Print the interactive plotinteractive_plot
The analysis highlights the critical relationship between carbon monoxide (CO) concentrations and air quality (measured via AQI) in Maryland. Seasonal variations in AQI trends suggest heightened pollution levels during colder months, especially at sites like Essex, while Piney Run consistently reported excellent air quality. Regression analysis further confirmed a strong positive correlation between CO and AQI, emphasizing the need for targeted air pollution control measures.