Week11Project

Author

Viktoriia L.

AQI Categories

AQI Categories

Introduction

For this project, I analyzed air quality data for Maryland in 2024. The dataset includes measurements from four monitoring sites: Essex, Howard County Near Road, HU-Beltsville, and Piney Run.

Unfortunately, Montgomery County, where my family and I live, is not included in this dataset. My original intention was to explore air quality across all counties in Maryland, particularly in Montgomery County, to understand the conditions where my children are growing up. However, this dataset provides valuable insights into the general trends and conditions in nearby locations, which can still be helpful for understanding regional air quality.

This limitation highlights the need for more comprehensive data to evaluate environmental conditions in Montgomery County, which I plan to pursue in future projects.

Why This Dataset?

Air quality is a critical aspect of public health, affecting respiratory and cardiovascular health. For me, this topic is deeply personal. As a Maryland resident and a parent, I want to analyze the conditions in which my kids are growing. Did I make the right decision to move to Maryland with two kids?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
setwd("C:/Users/lnvik/Downloads/DATA110")
data11 <- read_csv("ad_viz_plotval_data.csv")
Rows: 784 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): Date, Source, Units, Local Site Name, AQS Parameter Description, C...
dbl (12): Site ID, POC, Daily Max 8-hour CO Concentration, Daily AQI Value, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Extract Month and Year for seasonal analysis
data11_clean <- data11 %>%
  mutate(
    Date = as.Date(Date, format = "%m/%d/%Y"), # Convert to Date format
    Month = month(Date, label = TRUE),        # Extract Month
    Year = year(Date)                         # Extract Year
  ) %>%
  filter(!is.na(`Daily AQI Value`))           # Filter rows with NA in AQI
# Keep only relevant columns
data11_clean <- data11_clean %>%
  select(Date, Month, `Daily AQI Value`, `Daily Max 8-hour CO Concentration`, 
         `Local Site Name`)

::: {.cell}

```{.r .cell-code}
# Check for missing values
missing_summary <- data11_clean %>%
  summarise(across(everything(), ~ sum(is.na(.))))
missing_summary
# A tibble: 1 × 5
   Date Month `Daily AQI Value` Daily Max 8-hour CO Concentr…¹ `Local Site Name`
  <int> <int>             <int>                          <int>             <int>
1     0     0                 0                              0                 0
# ℹ abbreviated name: ¹​`Daily Max 8-hour CO Concentration`

:::

In this project, the data cleaning process involved several key steps to ensure the dataset was ready for analysis. First, temporal variables such as Month and Year were extracted to facilitate seasonal analysis of air quality trends. The dataset was then examined for missing values, revealing any incomplete or inaccurate entries. To ensure robust insights, a threshold of 20 observations per site and month was applied, filtering out months with insufficient data. This process reduced potential biases from sparse data points and enhanced the reliability of subsequent analyses. The cleaned dataset consisted of 746 observations, ready for exploratory analysis, regression modeling, and visualization.

#Check column names in dataset
names(data11_clean)
[1] "Date"                              "Month"                            
[3] "Daily AQI Value"                   "Daily Max 8-hour CO Concentration"
[5] "Local Site Name"                  
# build a regression model
model <- lm(`Daily AQI Value` ~ `Daily Max 8-hour CO Concentration`, data = data11_clean)
summary(model)

Call:
lm(formula = `Daily AQI Value` ~ `Daily Max 8-hour CO Concentration`, 
    data = data11_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60373 -0.08412 -0.08412  0.10583  0.53598 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         -0.29578    0.01418  -20.86   <2e-16 ***
`Daily Max 8-hour CO Concentration` 11.89951    0.05398  220.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2116 on 782 degrees of freedom
Multiple R-squared:  0.9842,    Adjusted R-squared:  0.9841 
F-statistic: 4.86e+04 on 1 and 782 DF,  p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(data11_clean , aes(x = `Daily Max 8-hour CO Concentration`, y = `Daily AQI Value`)) +
  geom_point(alpha = 0.6, color = "blue") +  # Scatter points
  geom_smooth(method = "lm", se = FALSE, color = "pink") +  # Regression line
  labs(
    title = "Relationship Between CO Concentration and AQI",
    x = "Daily Max 8-hour CO Concentration (ppm)",
    y = "Daily AQI Value"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The linear regression analysis reveals a strong positive relationship between the Daily Max 8-hour CO Concentration (Carbon Monoxide) and the Daily AQI Value (Air Quality Index), with CO concentration emerging as a significant predictor of air quality. For every 1 ppm (parts per million) increase in CO concentration, the AQI rises by approximately 11.89 units, as indicated by the model’s coefficient. This relationship is statistically significant, with a p-value of less than 2.2e-16, and the model explains 98.21% of the variance in AQI values, as shown by the high R-squared value. The residual standard error of 0.2118 further confirms the model’s accuracy. These findings highlight the critical role of CO emissions in air quality deterioration, suggesting that reducing CO pollution could substantially improve air quality in Maryland. This insight is particularly valuable for policymakers aiming to create targeted interventions and for residents concerned about the health implications of air pollution, especially for vulnerable groups like children and the elderly.

Personal Reflection

Seeing these trends makes me reflect on how air quality varies based on location. While Piney Run offers a reassuring picture, Essex raises concerns about localized pollution. As a parent, it’s important to consider how environmental factors like these might affect my family’s health.

# Summarize AQI trends by month and site
monthly_trends <- data11_clean %>%
  group_by(Month, `Local Site Name`) %>%
  summarise(Average_AQI = mean(`Daily AQI Value`, na.rm = TRUE), .groups = "drop")
names(data11_clean)
[1] "Date"                              "Month"                            
[3] "Daily AQI Value"                   "Daily Max 8-hour CO Concentration"
[5] "Local Site Name"                  
# Plot the trends
ggplot(monthly_trends, aes(x =Month, y = Average_AQI, color = `Local Site Name`, group = `Local Site Name`)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Monthly Air Quality Trends by Monitoring Site (2024)",
    x = "Month",
    y = "Average AQI Value",
    color = "Site Name"
    
  ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.