This document outlines the process of data cleaning and regression
analysis using the airquality dataset available in base R.
The dataset contains daily air quality measurements in New York from May
to September 1973, including ozone levels, solar radiation, wind speed,
and temperature.
First, we evaluate the structure of the raw data set to check for missing values before any cleaning.
glimpse(airquality)
## Rows: 153
## Columns: 6
## $ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
## $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
clean <- drop_na(airquality)
Month is stored as a numeric variable. To ensure correct interpretation at the regression stage, we convert it to a factor.
clean$Month <- as.factor(clean$Month)
str(clean)
## 'data.frame': 111 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 23 19 8 16 11 14 ...
## $ Solar.R: int 190 118 149 313 299 99 19 256 290 274 ...
## $ Wind : num 7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
## $ Temp : int 67 72 74 62 65 59 61 69 66 68 ...
## $ Month : Factor w/ 5 levels "5","6","7","8",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 7 8 9 12 13 14 ...
We create a binary variable classifying each day’s ozone level as “Good” or “Bad” relative to the dataset mean.
x <- mean(clean$Ozone)
clean$Exposure_levels <- as.factor(ifelse(clean$Ozone > x, "Bad", "Good"))
We run a linear regression to examine how solar radiation, wind speed, temperature, and month predict ozone levels.
model <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = clean)
summary(model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.344 -13.495 -3.165 10.399 92.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -74.23481 26.10184 -2.844 0.00537 **
## Solar.R 0.05222 0.02367 2.206 0.02957 *
## Wind -3.10872 0.66009 -4.710 7.78e-06 ***
## Temp 1.87511 0.34073 5.503 2.74e-07 ***
## Month6 -14.75895 9.12269 -1.618 0.10876
## Month7 -8.74861 7.82906 -1.117 0.26640
## Month8 -4.19654 8.14693 -0.515 0.60758
## Month9 -15.96728 6.65561 -2.399 0.01823 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.72 on 103 degrees of freedom
## Multiple R-squared: 0.6369, Adjusted R-squared: 0.6122
## F-statistic: 25.81 on 7 and 103 DF, p-value: < 2.2e-16
The regression model explains 63.7% of the variation in ozone levels (Adjusted R-squared: 0.61), which is a strong fit for environmental data of this kind. Temperature is the strongest and most significant predictor — each one degree Fahrenheit increase is associated with 1.88 additional ozone units (p < 0.001). This aligns with the well-established chemistry of ground-level ozone formation, which accelerates in heat. Wind has a strong negative effect — each unit increase in wind speed reduces ozone by 3.11 units (p < 0.001). This is consistent with the dispersal effect of wind on atmospheric pollutants. Solar radiation has a smaller but statistically significant positive effect (0.05 units per unit of radiation, p = 0.03), confirming that sunlight contributes to ozone formation even after controlling for temperature. Monthly variation is partially significant. September shows meaningfully lower ozone than the baseline month of May (p = 0.02), but June, July, and August do not show statistically significant differences from May after controlling for temperature and wind — suggesting that the seasonal effect is largely captured by those two variables.
This analysis uses a small historical dataset (153 observations, reduced further after dropping NAs) from a single city. The findings should not be generalised beyond this context. Additionally, dropping NA values rather than imputing them may introduce bias if missingness is not random. Future work could explore multiple imputation methods and a larger, more recent dataset.