Overview

This document outlines the process of data cleaning and regression analysis using the airquality dataset available in base R. The dataset contains daily air quality measurements in New York from May to September 1973, including ozone levels, solar radiation, wind speed, and temperature.

Data

First, we evaluate the structure of the raw data set to check for missing values before any cleaning.

glimpse(airquality)
## Rows: 153
## Columns: 6
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

Data Cleaning

Dropping NA Values

clean <- drop_na(airquality)

Converting Month to Factor

Month is stored as a numeric variable. To ensure correct interpretation at the regression stage, we convert it to a factor.

clean$Month <- as.factor(clean$Month)
str(clean)
## 'data.frame':    111 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 23 19 8 16 11 14 ...
##  $ Solar.R: int  190 118 149 313 299 99 19 256 290 274 ...
##  $ Wind   : num  7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
##  $ Temp   : int  67 72 74 62 65 59 61 69 66 68 ...
##  $ Month  : Factor w/ 5 levels "5","6","7","8",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Day    : int  1 2 3 4 7 8 9 12 13 14 ...

Classifying Ozone Exposure Levels

We create a binary variable classifying each day’s ozone level as “Good” or “Bad” relative to the dataset mean.

x <- mean(clean$Ozone)
clean$Exposure_levels <- as.factor(ifelse(clean$Ozone > x, "Bad", "Good"))

Analysis

We run a linear regression to examine how solar radiation, wind speed, temperature, and month predict ozone levels.

model <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = clean)
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.344 -13.495  -3.165  10.399  92.689 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -74.23481   26.10184  -2.844  0.00537 ** 
## Solar.R       0.05222    0.02367   2.206  0.02957 *  
## Wind         -3.10872    0.66009  -4.710 7.78e-06 ***
## Temp          1.87511    0.34073   5.503 2.74e-07 ***
## Month6      -14.75895    9.12269  -1.618  0.10876    
## Month7       -8.74861    7.82906  -1.117  0.26640    
## Month8       -4.19654    8.14693  -0.515  0.60758    
## Month9      -15.96728    6.65561  -2.399  0.01823 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.72 on 103 degrees of freedom
## Multiple R-squared:  0.6369, Adjusted R-squared:  0.6122 
## F-statistic: 25.81 on 7 and 103 DF,  p-value: < 2.2e-16

Findings

The regression model explains 63.7% of the variation in ozone levels (Adjusted R-squared: 0.61), which is a strong fit for environmental data of this kind. Temperature is the strongest and most significant predictor — each one degree Fahrenheit increase is associated with 1.88 additional ozone units (p < 0.001). This aligns with the well-established chemistry of ground-level ozone formation, which accelerates in heat. Wind has a strong negative effect — each unit increase in wind speed reduces ozone by 3.11 units (p < 0.001). This is consistent with the dispersal effect of wind on atmospheric pollutants. Solar radiation has a smaller but statistically significant positive effect (0.05 units per unit of radiation, p = 0.03), confirming that sunlight contributes to ozone formation even after controlling for temperature. Monthly variation is partially significant. September shows meaningfully lower ozone than the baseline month of May (p = 0.02), but June, July, and August do not show statistically significant differences from May after controlling for temperature and wind — suggesting that the seasonal effect is largely captured by those two variables.

Limitations

This analysis uses a small historical dataset (153 observations, reduced further after dropping NAs) from a single city. The findings should not be generalised beyond this context. Additionally, dropping NA values rather than imputing them may introduce bias if missingness is not random. Future work could explore multiple imputation methods and a larger, more recent dataset.