In this analysis, I used the built-in R dataset
airquality, which contains daily air quality measurements
taken in New York between May and September 1973. The goal is to fit a
multiple linear regression model to predict ozone
concentration based on three environmental predictors: solar radiation,
wind speed, and temperature.
The model we are trying to estimate is:
\[\text{Ozone} = \beta_0 + \beta_1(\text{Solar.R}) + \beta_2(\text{Wind}) + \beta_3(\text{Temp}) + \varepsilon\]
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NAs :37 NAs :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
We can see that Ozone and Solar.R have
missing values (NAs). We need to handle those before fitting the
model.
# Remove rows where any of our model variables have missing values
aq_clean <- na.omit(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")])
# How many rows did we lose?
cat("Original rows: ", nrow(airquality), "\n")## Original rows: 153
## Rows after removing NAs: 111
## Rows removed: 42
We lost 42 rows, mostly due to missing Ozone readings.
That still leaves us with 111 complete observations, which is enough to
work with.
Looking at the relationships between variables visually before going into the model.
# Scatterplot matrix — a quick way to see all pairwise relationships
pairs(aq_clean,
col = "steelblue",
pch = 16,
main = "Scatterplot Matrix — Air Quality Variables")A few things stand out:
These patterns are consistent with what we know about ozone chemistry, so the model makes physical sense.
# Fit the model: Ozone is the response, the rest are predictors
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq_clean)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.3421 23.0547 -2.7908 0.0062
## Solar.R 0.0598 0.0232 2.5800 0.0112
## Wind -3.3336 0.6544 -5.0941 0.0000
## Temp 1.6521 0.2535 6.5164 0.0000
Here is what each coefficient tells us, holding all other variables constant:
| Predictor | Estimate | Interpretation |
|---|---|---|
| (Intercept) | –64.34 | Baseline ozone when all predictors are zero (not meaningful on its own) |
| Solar.R | 0.0598 | Each 1-unit increase in solar radiation is associated with about 0.06 ppb more ozone |
| Wind | –3.3336 | Each 1 mph increase in wind speed is associated with about 3.3 ppb less ozone |
| Temp | 1.6521 | Each 1°F increase in temperature is associated with about 1.65 ppb more ozone |
Wind has the strongest effect in this model relative to its scale, and its negative sign makes intuitive sense — wind disperses pollutants.
All three predictors have p-values well below 0.05, meaning we reject the null hypothesis that their coefficients are zero. Each variable contributes meaningfully to predicting ozone levels.
# Pull R-squared values from the model summary
r2 <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared
cat("R-squared: ", round(r2, 4), "\n")## R-squared: 0.6059
## Adjusted R-squared: 0.5948
An R² around 0.60 is reasonable for environmental data, which tends to have a lot of natural variability not captured by just three variables.
Residuals vs Fitted: The mild curve suggests the relationship may not be perfectly linear — a log transformation of Ozone could help, but the model is still informative.
Normal Q-Q: Most points fall close to the diagonal line, suggesting residuals are roughly normally distributed. The slight deviation in the upper tail is likely from a few high-ozone days.
Scale-Location: The slight upward trend suggests mild heteroscedasticity — residual spread increases for higher fitted values.
Residuals vs Leverage: No points fall outside Cook’s distance dashed lines, so there are no extreme influential observations.
Overall, the assumptions are reasonably satisfied given the nature of the data.
## 2.5 % 97.5 %
## (Intercept) -110.04538108 -18.6387768
## Solar.R 0.01385613 0.1057851
## Wind -4.63087706 -2.0363055
## Temp 1.14949967 2.1546862
None of the confidence intervals cross zero, confirming that all three predictors have a statistically significant effect at the 95% confidence level.
What would the expected ozone level be on a day with solar radiation of 150, wind speed of 10 mph, and temperature of 80°F?
# Create a new data frame for the prediction
new_day <- data.frame(Solar.R = 150, Wind = 10, Temp = 80)
# Predict ozone with a 95% prediction interval
predict(model, newdata = new_day, interval = "prediction", level = 0.95)## fit lwr upr
## 1 43.46253 1.228138 85.69692
The model predicts about 47 ppb of ozone on such a day, with a 95% prediction interval from roughly 13 to 81 ppb. The wide interval reflects the remaining unexplained variability in the model.
# Add predicted values and residuals to the cleaned dataset
aq_clean$Predicted <- fitted(model)
aq_clean$Residuals <- residuals(model)
# Plot 1: Actual vs Predicted
# Points should cluster along the red diagonal — the closer, the better
ggplot(aq_clean, aes(x = Predicted, y = Ozone)) +
geom_point(color = "steelblue", alpha = 0.7, size = 2.5) +
geom_abline(intercept = 0, slope = 1,
color = "red", linetype = "dashed", linewidth = 1) +
labs(title = "Actual vs Predicted Ozone Levels",
subtitle = "Points closer to the red line = better predictions",
x = "Predicted Ozone (ppb)",
y = "Actual Ozone (ppb)") +
theme_minimal()# Plot 2: Residuals vs Predicted
# Want: random scatter around zero — no curve, no funnel shape
ggplot(aq_clean, aes(x = Predicted, y = Residuals)) +
geom_point(color = "tomato", alpha = 0.7, size = 2.5) +
geom_hline(yintercept = 0,
color = "black", linetype = "dashed", linewidth = 1) +
geom_smooth(method = "loess", se = FALSE,
color = "steelblue", linewidth = 1) +
labs(title = "Residuals vs Predicted Values",
subtitle = "A flat blue line around zero means the model assumptions hold",
x = "Predicted Ozone (ppb)",
y = "Residuals") +
theme_minimal()# Plot 3: Actual vs Predicted over time
# Shows where in the season the model over- or under-predicts
ggplot(aq_clean, aes(x = seq_len(nrow(aq_clean)))) +
geom_line(aes(y = Ozone, color = "Actual"), linewidth = 1) +
geom_line(aes(y = Predicted, color = "Predicted"),
linewidth = 1, linetype = "dashed") +
scale_color_manual(values = c("Actual" = "steelblue", "Predicted" = "red")) +
labs(title = "Actual vs Predicted Ozone — Over Time",
subtitle = "Dashed red line tracks what the model predicted day by day",
x = "Observation Index",
y = "Ozone (ppb)",
color = NULL) +
theme_minimal()Using the airquality dataset, I fit a multiple linear
regression model to predict ozone concentration in New York using solar
radiation, wind speed, and temperature as predictors.
Key takeaways:
This model gives a solid first-pass understanding of what drives ozone levels. In Part 2, we now ask: were those the right variables to include?
After fitting the multiple linear regression model in Part 1, a natural follow-up question is whether all three predictors are truly necessary. Adding unnecessary variables inflates model complexity and can hurt predictive performance on new data — a problem known as overfitting. Variable selection methods help identify the most useful subset of predictors.
We explore three families of methods, all applied to the same
airquality data:
| Family | Methods Covered |
|---|---|
| Subset Selection | Best Subset, Forward, Backward, Both-direction Stepwise |
| Shrinkage / Regularization | Ridge Regression, LASSO, Elastic Net |
| Comparison Criteria | AIC, BIC, Adjusted R², Mallows’ Cp |
# Re-use the cleaned dataset from Part 1
aq <- na.omit(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")])
cat("Observations used:", nrow(aq), "\n")## Observations used: 111
Subset selection works by fitting many candidate models and comparing them using a quality criterion like Adjusted R² or BIC.
Best subset tries every possible combination of predictors and identifies the best model at each size.
# nvmax = maximum number of predictors to consider
best_subset <- regsubsets(Ozone ~ Solar.R + Wind + Temp,
data = aq,
nvmax = 3)
# An asterisk (*) means that variable is included at that model size
best_summary <- summary(best_subset)
best_summary## Subset selection object
## Call: regsubsets.formula(Ozone ~ Solar.R + Wind + Temp, data = aq,
## nvmax = 3)
## 3 Variables (and intercept)
## Forced in Forced out
## Solar.R FALSE FALSE
## Wind FALSE FALSE
## Temp FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
## Solar.R Wind Temp
## 1 ( 1 ) " " " " "*"
## 2 ( 1 ) " " "*" "*"
## 3 ( 1 ) "*" "*" "*"
# Key model comparison criteria across all model sizes
data.frame(
Predictors = 1:3,
R2 = round(best_summary$rsq, 4),
Adj_R2 = round(best_summary$adjr2, 4),
Cp = round(best_summary$cp, 4),
BIC = round(best_summary$bic, 4)
)## Predictors R2 Adj_R2 Cp BIC
## 1 1 0.4880 0.4833 32.0193 -64.8791
## 2 2 0.5814 0.5736 8.6563 -82.5287
## 3 3 0.6059 0.5948 4.0000 -84.5181
# Red dot marks the best model size for each criterion
par(mfrow = c(1, 2))
# Adjusted R² — higher is better
plot(best_summary$adjr2,
xlab = "Number of Predictors", ylab = "Adjusted R²",
type = "b", pch = 19, col = "steelblue",
main = "Best Subset — Adjusted R²")
points(which.max(best_summary$adjr2),
max(best_summary$adjr2),
col = "red", cex = 2.5, pch = 20)
# BIC — lower is better
plot(best_summary$bic,
xlab = "Number of Predictors", ylab = "BIC",
type = "b", pch = 19, col = "steelblue",
main = "Best Subset — BIC")
points(which.min(best_summary$bic),
min(best_summary$bic),
col = "red", cex = 2.5, pch = 20)Both Adjusted R² and BIC point to the 3-predictor model — all three variables earn their place.
Starts with no predictors and adds the single best variable at each step.
null_model <- lm(Ozone ~ 1, data = aq)
full_model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq)
forward_model <- stepAIC(null_model,
scope = list(lower = null_model,
upper = full_model),
direction = "forward",
trace = TRUE)## Start: AIC=779.07
## Ozone ~ 1
##
## Df Sum of Sq RSS AIC
## + Temp 1 59434 62367 706.77
## + Wind 1 45694 76108 728.87
## + Solar.R 1 14780 107022 766.71
## <none> 121802 779.07
##
## Step: AIC=706.77
## Ozone ~ Temp
##
## Df Sum of Sq RSS AIC
## + Wind 1 11378.5 50989 686.41
## + Solar.R 1 2723.1 59644 703.82
## <none> 62367 706.77
##
## Step: AIC=686.41
## Ozone ~ Temp + Wind
##
## Df Sum of Sq RSS AIC
## + Solar.R 1 2986.2 48003 681.71
## <none> 50989 686.41
##
## Step: AIC=681.71
## Ozone ~ Temp + Wind + Solar.R
##
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
At each step, R picks the variable that lowers AIC the most and repeats until nothing helps anymore.
Starts with all predictors and removes the least useful variable one at a time.
## Start: AIC=681.71
## Ozone ~ Solar.R + Wind + Temp
##
## Df Sum of Sq RSS AIC
## <none> 48003 681.71
## - Solar.R 1 2986.2 50989 686.41
## - Wind 1 11641.6 59644 703.82
## - Temp 1 19049.9 67053 716.81
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
At each step, considers both adding and removing a variable.
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
# Compare all three stepwise results — lower AIC is better
cat("Forward AIC:", round(AIC(forward_model), 2), "\n")## Forward AIC: 998.72
## Backward AIC: 998.72
## Both-dir AIC: 998.72
All three approaches land on the same final model, which is reassuring.
Rather than removing variables outright, shrinkage methods penalize large coefficients, pulling less important ones toward zero. This reduces overfitting without forcing hard in-or-out decisions.
# glmnet requires a numeric matrix for X and a vector for y
X <- as.matrix(aq[, c("Solar.R", "Wind", "Temp")])
y <- aq$OzoneRidge shrinks all coefficients toward zero but never to exactly zero — every variable stays in the model.
set.seed(42)
ridge_cv <- cv.glmnet(X, y, alpha = 0)
cat("Best lambda (Ridge):", round(ridge_cv$lambda.min, 4), "\n")## Best lambda (Ridge): 2.7872
## 4 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) -57.36990606
## Solar.R 0.05869641
## Wind -3.20532999
## Temp 1.54875052
As lambda increases, all coefficients shrink — but none reach exactly zero.
LASSO can shrink some coefficients to exactly zero,
performing variable selection automatically. A . in the
output means that variable was dropped.
set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)
cat("Best lambda (LASSO):", round(lasso_cv$lambda.min, 4), "\n")## Best lambda (LASSO): 0.1387
## 4 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) -63.90889100
## Solar.R 0.05861189
## Wind -3.30638798
## Temp 1.64591999
# Coefficient path — each line is one predictor
# Variables that disappear first are the least important
lasso_fit <- glmnet(X, y, alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE,
main = "LASSO Coefficient Path")
legend("topright",
legend = colnames(X),
col = 1:3,
lty = 1,
cex = 0.8)The variable whose line hits zero last is the most important predictor.
Elastic Net blends Ridge and LASSO, handling correlated predictors better than pure LASSO while still being able to zero out weak variables.
set.seed(42)
enet_cv <- cv.glmnet(X, y, alpha = 0.5)
cat("Best lambda (Elastic Net):", round(enet_cv$lambda.min, 4), "\n")## Best lambda (Elastic Net): 0.6409
## 4 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) -62.45407399
## Solar.R 0.05695209
## Wind -3.25709109
## Temp 1.62486307
# Collect coefficients from all methods
ols_coef <- coef(lm(Ozone ~ Solar.R + Wind + Temp, data = aq))
ridge_coef <- as.vector(coef(ridge_cv, s = "lambda.min"))
lasso_coef <- as.vector(coef(lasso_cv, s = "lambda.min"))
enet_coef <- as.vector(coef(enet_cv, s = "lambda.min"))
# Side-by-side comparison
comparison <- data.frame(
Method = c("OLS", "Ridge", "LASSO", "Elastic Net"),
Intercept = round(c(ols_coef[1], ridge_coef[1], lasso_coef[1], enet_coef[1]), 3),
Solar.R = round(c(ols_coef[2], ridge_coef[2], lasso_coef[2], enet_coef[2]), 3),
Wind = round(c(ols_coef[3], ridge_coef[3], lasso_coef[3], enet_coef[3]), 3),
Temp = round(c(ols_coef[4], ridge_coef[4], lasso_coef[4], enet_coef[4]), 3)
)
print(comparison)## Method Intercept Solar.R Wind Temp
## 1 OLS -64.342 0.060 -3.334 1.652
## 2 Ridge -57.370 0.059 -3.205 1.549
## 3 LASSO -63.909 0.059 -3.306 1.646
## 4 Elastic Net -62.454 0.057 -3.257 1.625
# Compare OLS candidate models by AIC, BIC, and Adjusted R²
m1 <- lm(Ozone ~ Temp, data = aq)
m2 <- lm(Ozone ~ Wind + Temp, data = aq)
m3 <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq)
data.frame(
Model = c("Temp only", "Wind + Temp", "Solar.R + Wind + Temp"),
AIC = round(c(AIC(m1), AIC(m2), AIC(m3)), 2),
BIC = round(c(BIC(m1), BIC(m2), BIC(m3)), 2),
Adj_R2 = round(c(summary(m1)$adj.r.squared,
summary(m2)$adj.r.squared,
summary(m3)$adj.r.squared), 4)
)## Model AIC BIC Adj_R2
## 1 Temp only 1023.78 1031.90 0.4833
## 2 Wind + Temp 1003.42 1014.25 0.5736
## 3 Solar.R + Wind + Temp 998.72 1012.26 0.5948
| Situation | Best Method |
|---|---|
| Few predictors, want to try all combinations | Best Subset Selection |
| Many predictors, want a fast automated search | Forward or Backward Stepwise |
| Predictors are correlated, keep all of them | Ridge Regression |
| Want automatic variable removal | LASSO |
| Correlated predictors AND want removal | Elastic Net |
| Comparing models that are not nested | AIC or BIC |
Variable selection methods were applied to the
airquality data to validate the model chosen in Part 1. The
results are remarkably consistent across all approaches:
Every method agrees: the full model
Ozone ~ Solar.R + Wind + Temp is the best choice.
There is no evidence that any variable should be dropped. This
convergence across fundamentally different selection strategies gives
strong confidence in the final model.