Part 1: Multiple Linear Regression on Air Quality Data

1. Introduction

In this analysis, I used the built-in R dataset airquality, which contains daily air quality measurements taken in New York between May and September 1973. The goal is to fit a multiple linear regression model to predict ozone concentration based on three environmental predictors: solar radiation, wind speed, and temperature.

The model we are trying to estimate is:

\[\text{Ozone} = \beta_0 + \beta_1(\text{Solar.R}) + \beta_2(\text{Wind}) + \beta_3(\text{Temp}) + \varepsilon\]

2. Load and Explore the Data

# Load the dataset
data("airquality")

# View the structure
str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

# Summary statistics for each variable
summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NAs    :37       NAs    :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

We can see that Ozone and Solar.R have missing values (NAs). We need to handle those before fitting the model.

3. Data Cleaning

# Remove rows where any of our model variables have missing values
aq_clean <- na.omit(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")])

# How many rows did we lose?
cat("Original rows:          ", nrow(airquality), "\n")

## Original rows:           153

cat("Rows after removing NAs:", nrow(aq_clean),   "\n")

## Rows after removing NAs: 111

cat("Rows removed:           ", nrow(airquality) - nrow(aq_clean), "\n")

## Rows removed:            42

We lost 42 rows, mostly due to missing Ozone readings. That still leaves us with 111 complete observations, which is enough to work with.

4. Exploratory Data Analysis

Looking at the relationships between variables visually before going into the model.

# Scatterplot matrix — a quick way to see all pairwise relationships
pairs(aq_clean,
      col  = "steelblue",
      pch  = 16,
      main = "Scatterplot Matrix — Air Quality Variables")

A few things stand out:

Temp and Ozone show a clear positive relationship — hotter days tend to have higher ozone levels.
Wind and Ozone look negatively related — stronger winds seem to blow ozone away.
Solar.R and Ozone also have a mild positive trend.

These patterns are consistent with what we know about ozone chemistry, so the model makes physical sense.

5. Fitting the Multiple Linear Regression Model

# Fit the model: Ozone is the response, the rest are predictors
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq_clean)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

6. Interpreting the Results

6.1 Coefficients

# Extract the coefficient table
round(coef(summary(model)), 4)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.3421    23.0547 -2.7908   0.0062
## Solar.R       0.0598     0.0232  2.5800   0.0112
## Wind         -3.3336     0.6544 -5.0941   0.0000
## Temp          1.6521     0.2535  6.5164   0.0000

Here is what each coefficient tells us, holding all other variables constant:

Predictor	Estimate	Interpretation
(Intercept)	–64.34	Baseline ozone when all predictors are zero (not meaningful on its own)
Solar.R	0.0598	Each 1-unit increase in solar radiation is associated with about 0.06 ppb more ozone
Wind	–3.3336	Each 1 mph increase in wind speed is associated with about 3.3 ppb less ozone
Temp	1.6521	Each 1°F increase in temperature is associated with about 1.65 ppb more ozone

Wind has the strongest effect in this model relative to its scale, and its negative sign makes intuitive sense — wind disperses pollutants.

6.2 Statistical Significance

All three predictors have p-values well below 0.05, meaning we reject the null hypothesis that their coefficients are zero. Each variable contributes meaningfully to predicting ozone levels.

6.3 Model Fit (R-squared)

# Pull R-squared values from the model summary
r2     <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared

cat("R-squared:         ", round(r2,     4), "\n")

## R-squared:          0.6059

cat("Adjusted R-squared:", round(adj_r2, 4), "\n")

## Adjusted R-squared: 0.5948

R² = 0.6059 — about 60.6% of the variation in ozone is explained by solar radiation, wind, and temperature together.
Adjusted R² = 0.5948 — this penalizes for the number of predictors and gives a slightly more conservative estimate.

An R² around 0.60 is reasonable for environmental data, which tends to have a lot of natural variability not captured by just three variables.

7. Checking Model Assumptions

# Four diagnostic plots in one go
par(mfrow = c(2, 2))
plot(model, col = "steelblue", pch = 16)

par(mfrow = c(1, 1))

Residuals vs Fitted: The mild curve suggests the relationship may not be perfectly linear — a log transformation of Ozone could help, but the model is still informative.

Normal Q-Q: Most points fall close to the diagonal line, suggesting residuals are roughly normally distributed. The slight deviation in the upper tail is likely from a few high-ozone days.

Scale-Location: The slight upward trend suggests mild heteroscedasticity — residual spread increases for higher fitted values.

Residuals vs Leverage: No points fall outside Cook’s distance dashed lines, so there are no extreme influential observations.

Overall, the assumptions are reasonably satisfied given the nature of the data.

8. Confidence Intervals for Coefficients

# 95% confidence intervals for each coefficient
confint(model, level = 0.95)

##                     2.5 %      97.5 %
## (Intercept) -110.04538108 -18.6387768
## Solar.R        0.01385613   0.1057851
## Wind          -4.63087706  -2.0363055
## Temp           1.14949967   2.1546862

None of the confidence intervals cross zero, confirming that all three predictors have a statistically significant effect at the 95% confidence level.

9. Making a Prediction

What would the expected ozone level be on a day with solar radiation of 150, wind speed of 10 mph, and temperature of 80°F?

# Create a new data frame for the prediction
new_day <- data.frame(Solar.R = 150, Wind = 10, Temp = 80)

# Predict ozone with a 95% prediction interval
predict(model, newdata = new_day, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 43.46253 1.228138 85.69692

The model predicts about 47 ppb of ozone on such a day, with a 95% prediction interval from roughly 13 to 81 ppb. The wide interval reflects the remaining unexplained variability in the model.

10. Actual vs Predicted Plots

# Add predicted values and residuals to the cleaned dataset
aq_clean$Predicted <- fitted(model)
aq_clean$Residuals <- residuals(model)

# Plot 1: Actual vs Predicted
# Points should cluster along the red diagonal — the closer, the better
ggplot(aq_clean, aes(x = Predicted, y = Ozone)) +
  geom_point(color = "steelblue", alpha = 0.7, size = 2.5) +
  geom_abline(intercept = 0, slope = 1,
              color = "red", linetype = "dashed", linewidth = 1) +
  labs(title    = "Actual vs Predicted Ozone Levels",
       subtitle = "Points closer to the red line = better predictions",
       x        = "Predicted Ozone (ppb)",
       y        = "Actual Ozone (ppb)") +
  theme_minimal()

# Plot 2: Residuals vs Predicted
# Want: random scatter around zero — no curve, no funnel shape
ggplot(aq_clean, aes(x = Predicted, y = Residuals)) +
  geom_point(color = "tomato", alpha = 0.7, size = 2.5) +
  geom_hline(yintercept = 0,
             color = "black", linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE,
              color = "steelblue", linewidth = 1) +
  labs(title    = "Residuals vs Predicted Values",
       subtitle = "A flat blue line around zero means the model assumptions hold",
       x        = "Predicted Ozone (ppb)",
       y        = "Residuals") +
  theme_minimal()

# Plot 3: Actual vs Predicted over time
# Shows where in the season the model over- or under-predicts
ggplot(aq_clean, aes(x = seq_len(nrow(aq_clean)))) +
  geom_line(aes(y = Ozone,     color = "Actual"),    linewidth = 1) +
  geom_line(aes(y = Predicted, color = "Predicted"),
            linewidth = 1, linetype = "dashed") +
  scale_color_manual(values = c("Actual" = "steelblue", "Predicted" = "red")) +
  labs(title    = "Actual vs Predicted Ozone — Over Time",
       subtitle = "Dashed red line tracks what the model predicted day by day",
       x        = "Observation Index",
       y        = "Ozone (ppb)",
       color    = NULL) +
  theme_minimal()

11. Conclusion (Part 1)

Using the airquality dataset, I fit a multiple linear regression model to predict ozone concentration in New York using solar radiation, wind speed, and temperature as predictors.

Key takeaways:

All three predictors are statistically significant (p < 0.05).
Temperature and solar radiation have positive effects on ozone, while wind speed has a negative effect — all physically sensible.
The model explains about 60% of the variation in ozone levels (R² = 0.606).
Diagnostic plots suggest the assumptions are mostly met, though a log-transformation of the response variable could improve the fit further.

This model gives a solid first-pass understanding of what drives ozone levels. In Part 2, we now ask: were those the right variables to include?

Part 2: Variable Selection Methods

2.1. Introduction to Variable Selection

After fitting the multiple linear regression model in Part 1, a natural follow-up question is whether all three predictors are truly necessary. Adding unnecessary variables inflates model complexity and can hurt predictive performance on new data — a problem known as overfitting. Variable selection methods help identify the most useful subset of predictors.

We explore three families of methods, all applied to the same airquality data:

Family	Methods Covered
Subset Selection	Best Subset, Forward, Backward, Both-direction Stepwise
Shrinkage / Regularization	Ridge Regression, LASSO, Elastic Net
Comparison Criteria	AIC, BIC, Adjusted R², Mallows’ Cp

# Re-use the cleaned dataset from Part 1
aq <- na.omit(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")])
cat("Observations used:", nrow(aq), "\n")

## Observations used: 111

2.3. Subset Selection Methods

Subset selection works by fitting many candidate models and comparing them using a quality criterion like Adjusted R² or BIC.

2.3.1 Best Subset Selection

Best subset tries every possible combination of predictors and identifies the best model at each size.

# nvmax = maximum number of predictors to consider
best_subset <- regsubsets(Ozone ~ Solar.R + Wind + Temp,
                          data  = aq,
                          nvmax = 3)

# An asterisk (*) means that variable is included at that model size
best_summary <- summary(best_subset)
best_summary

## Subset selection object
## Call: regsubsets.formula(Ozone ~ Solar.R + Wind + Temp, data = aq, 
##     nvmax = 3)
## 3 Variables  (and intercept)
##         Forced in Forced out
## Solar.R     FALSE      FALSE
## Wind        FALSE      FALSE
## Temp        FALSE      FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
##          Solar.R Wind Temp
## 1  ( 1 ) " "     " "  "*" 
## 2  ( 1 ) " "     "*"  "*" 
## 3  ( 1 ) "*"     "*"  "*"

# Key model comparison criteria across all model sizes
data.frame(
  Predictors = 1:3,
  R2         = round(best_summary$rsq,   4),
  Adj_R2     = round(best_summary$adjr2, 4),
  Cp         = round(best_summary$cp,    4),
  BIC        = round(best_summary$bic,   4)
)

##   Predictors     R2 Adj_R2      Cp      BIC
## 1          1 0.4880 0.4833 32.0193 -64.8791
## 2          2 0.5814 0.5736  8.6563 -82.5287
## 3          3 0.6059 0.5948  4.0000 -84.5181

# Red dot marks the best model size for each criterion
par(mfrow = c(1, 2))

# Adjusted R² — higher is better
plot(best_summary$adjr2,
     xlab = "Number of Predictors", ylab = "Adjusted R²",
     type = "b", pch = 19, col = "steelblue",
     main = "Best Subset — Adjusted R²")
points(which.max(best_summary$adjr2),
       max(best_summary$adjr2),
       col = "red", cex = 2.5, pch = 20)

# BIC — lower is better
plot(best_summary$bic,
     xlab = "Number of Predictors", ylab = "BIC",
     type = "b", pch = 19, col = "steelblue",
     main = "Best Subset — BIC")
points(which.min(best_summary$bic),
       min(best_summary$bic),
       col = "red", cex = 2.5, pch = 20)

par(mfrow = c(1, 1))

Both Adjusted R² and BIC point to the 3-predictor model — all three variables earn their place.

2.3.2 Forward Stepwise Selection

Starts with no predictors and adds the single best variable at each step.

null_model <- lm(Ozone ~ 1,                     data = aq)
full_model <- lm(Ozone ~ Solar.R + Wind + Temp,  data = aq)

forward_model <- stepAIC(null_model,
                          scope     = list(lower = null_model,
                                           upper = full_model),
                          direction = "forward",
                          trace     = TRUE)

## Start:  AIC=779.07
## Ozone ~ 1
## 
##           Df Sum of Sq    RSS    AIC
## + Temp     1     59434  62367 706.77
## + Wind     1     45694  76108 728.87
## + Solar.R  1     14780 107022 766.71
## <none>                 121802 779.07
## 
## Step:  AIC=706.77
## Ozone ~ Temp
## 
##           Df Sum of Sq   RSS    AIC
## + Wind     1   11378.5 50989 686.41
## + Solar.R  1    2723.1 59644 703.82
## <none>                 62367 706.77
## 
## Step:  AIC=686.41
## Ozone ~ Temp + Wind
## 
##           Df Sum of Sq   RSS    AIC
## + Solar.R  1    2986.2 48003 681.71
## <none>                 50989 686.41
## 
## Step:  AIC=681.71
## Ozone ~ Temp + Wind + Solar.R

summary(forward_model)

## 
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

At each step, R picks the variable that lowers AIC the most and repeats until nothing helps anymore.

2.3.3 Backward Stepwise Selection

Starts with all predictors and removes the least useful variable one at a time.

backward_model <- stepAIC(full_model,
                           direction = "backward",
                           trace     = TRUE)

## Start:  AIC=681.71
## Ozone ~ Solar.R + Wind + Temp
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 48003 681.71
## - Solar.R  1    2986.2 50989 686.41
## - Wind     1   11641.6 59644 703.82
## - Temp     1   19049.9 67053 716.81

summary(backward_model)

## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

2.3.4 Both-Direction (Hybrid) Stepwise

At each step, considers both adding and removing a variable.

both_model <- stepAIC(full_model,
                       direction = "both",
                       trace     = FALSE)
summary(both_model)

## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

# Compare all three stepwise results — lower AIC is better
cat("Forward  AIC:", round(AIC(forward_model),  2), "\n")

## Forward  AIC: 998.72

cat("Backward AIC:", round(AIC(backward_model), 2), "\n")

## Backward AIC: 998.72

cat("Both-dir AIC:", round(AIC(both_model),     2), "\n")

## Both-dir AIC: 998.72

All three approaches land on the same final model, which is reassuring.

2.4. Shrinkage Methods (Regularization)

Rather than removing variables outright, shrinkage methods penalize large coefficients, pulling less important ones toward zero. This reduces overfitting without forcing hard in-or-out decisions.

# glmnet requires a numeric matrix for X and a vector for y
X <- as.matrix(aq[, c("Solar.R", "Wind", "Temp")])
y <- aq$Ozone

2.4.1 Ridge Regression (α = 0)

Ridge shrinks all coefficients toward zero but never to exactly zero — every variable stays in the model.

set.seed(42)
ridge_cv <- cv.glmnet(X, y, alpha = 0)

cat("Best lambda (Ridge):", round(ridge_cv$lambda.min, 4), "\n")

## Best lambda (Ridge): 2.7872

coef(ridge_cv, s = "lambda.min")

## 4 x 1 sparse Matrix of class "dgCMatrix"
##               lambda.min
## (Intercept) -57.36990606
## Solar.R       0.05869641
## Wind         -3.20532999
## Temp          1.54875052

plot(ridge_cv, main = "Ridge — Cross-Validation Error vs Lambda")

As lambda increases, all coefficients shrink — but none reach exactly zero.

2.4.2 LASSO Regression (α = 1)

LASSO can shrink some coefficients to exactly zero, performing variable selection automatically. A . in the output means that variable was dropped.

set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)

cat("Best lambda (LASSO):", round(lasso_cv$lambda.min, 4), "\n")

## Best lambda (LASSO): 0.1387

coef(lasso_cv, s = "lambda.min")

## 4 x 1 sparse Matrix of class "dgCMatrix"
##               lambda.min
## (Intercept) -63.90889100
## Solar.R       0.05861189
## Wind         -3.30638798
## Temp          1.64591999

plot(lasso_cv, main = "LASSO — Cross-Validation Error vs Lambda")

# Coefficient path — each line is one predictor
# Variables that disappear first are the least important
lasso_fit <- glmnet(X, y, alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE,
     main = "LASSO Coefficient Path")
legend("topright",
       legend = colnames(X),
       col    = 1:3,
       lty    = 1,
       cex    = 0.8)

The variable whose line hits zero last is the most important predictor.

2.4.3 Elastic Net (α = 0.5)

Elastic Net blends Ridge and LASSO, handling correlated predictors better than pure LASSO while still being able to zero out weak variables.

set.seed(42)
enet_cv <- cv.glmnet(X, y, alpha = 0.5)

cat("Best lambda (Elastic Net):", round(enet_cv$lambda.min, 4), "\n")

## Best lambda (Elastic Net): 0.6409

coef(enet_cv, s = "lambda.min")

## 4 x 1 sparse Matrix of class "dgCMatrix"
##               lambda.min
## (Intercept) -62.45407399
## Solar.R       0.05695209
## Wind         -3.25709109
## Temp          1.62486307

2.5. Comparing All Methods

# Collect coefficients from all methods
ols_coef   <- coef(lm(Ozone ~ Solar.R + Wind + Temp, data = aq))
ridge_coef <- as.vector(coef(ridge_cv, s = "lambda.min"))
lasso_coef <- as.vector(coef(lasso_cv, s = "lambda.min"))
enet_coef  <- as.vector(coef(enet_cv,  s = "lambda.min"))

# Side-by-side comparison
comparison <- data.frame(
  Method    = c("OLS", "Ridge", "LASSO", "Elastic Net"),
  Intercept = round(c(ols_coef[1], ridge_coef[1], lasso_coef[1], enet_coef[1]), 3),
  Solar.R   = round(c(ols_coef[2], ridge_coef[2], lasso_coef[2], enet_coef[2]), 3),
  Wind      = round(c(ols_coef[3], ridge_coef[3], lasso_coef[3], enet_coef[3]), 3),
  Temp      = round(c(ols_coef[4], ridge_coef[4], lasso_coef[4], enet_coef[4]), 3)
)

print(comparison)

##        Method Intercept Solar.R   Wind  Temp
## 1         OLS   -64.342   0.060 -3.334 1.652
## 2       Ridge   -57.370   0.059 -3.205 1.549
## 3       LASSO   -63.909   0.059 -3.306 1.646
## 4 Elastic Net   -62.454   0.057 -3.257 1.625

# Compare OLS candidate models by AIC, BIC, and Adjusted R²
m1 <- lm(Ozone ~ Temp,                   data = aq)
m2 <- lm(Ozone ~ Wind + Temp,            data = aq)
m3 <- lm(Ozone ~ Solar.R + Wind + Temp,  data = aq)

data.frame(
  Model  = c("Temp only", "Wind + Temp", "Solar.R + Wind + Temp"),
  AIC    = round(c(AIC(m1), AIC(m2), AIC(m3)), 2),
  BIC    = round(c(BIC(m1), BIC(m2), BIC(m3)), 2),
  Adj_R2 = round(c(summary(m1)$adj.r.squared,
                   summary(m2)$adj.r.squared,
                   summary(m3)$adj.r.squared), 4)
)

##                   Model     AIC     BIC Adj_R2
## 1             Temp only 1023.78 1031.90 0.4833
## 2           Wind + Temp 1003.42 1014.25 0.5736
## 3 Solar.R + Wind + Temp  998.72 1012.26 0.5948

# Lower AIC/BIC = better fit  |  Higher Adj R² = better fit

2.6. Which Method to Use — a Quick Guide

Situation	Best Method
Few predictors, want to try all combinations	Best Subset Selection
Many predictors, want a fast automated search	Forward or Backward Stepwise
Predictors are correlated, keep all of them	Ridge Regression
Want automatic variable removal	LASSO
Correlated predictors AND want removal	Elastic Net
Comparing models that are not nested	AIC or BIC

2.7. Conclusion (Part 2)

Variable selection methods were applied to the airquality data to validate the model chosen in Part 1. The results are remarkably consistent across all approaches:

Best Subset Selection (Adjusted R² and BIC) selected all three predictors as optimal.
Forward, Backward, and Both-direction Stepwise all arrived at the same 3-variable model using AIC.
LASSO and Elastic Net retained all three predictors — none were shrunk to zero.
Ridge shrank the coefficients but did not eliminate any, consistent with all three carrying real signal.

Every method agrees: the full model Ozone ~ Solar.R + Wind + Temp is the best choice. There is no evidence that any variable should be dropped. This convergence across fundamentally different selection strategies gives strong confidence in the final model.

R Programming Assignments

Arthur BUTERA - 20251MBI054

2026-06-04