Part 1: Multiple Linear Regression Using the mtcars Dataset
This analysis uses the built-in mtcars dataset in R to examine factors that influence vehicle fuel efficiency. The response variable is miles per gallon (mpg), while the predictor variables are horsepower (hp), weight (wt), and engine displacement (disp). These variables were selected because they are commonly associated with fuel consumption, with larger and heavier vehicles generally requiring more fuel.
A multiple linear regression model is fitted to evaluate the relationship between fuel efficiency and these vehicle characteristics. The objective is to determine how changes in horsepower, weight, and engine displacement affect miles per gallon and to assess their overall contribution to predicting fuel efficiency.
library(ggplot2)
library(leaps)
library(MASS)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 5.0
## Load the data
data(mtcars)
# A quick look at the first few rows and the overall structure
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Summary statistics for just the four variables I am working with
summary(mtcars[, c("mpg", "hp", "wt", "disp")])
## mpg hp wt disp
## Min. :10.40 Min. : 52.0 Min. :1.513 Min. : 71.1
## 1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581 1st Qu.:120.8
## Median :19.20 Median :123.0 Median :3.325 Median :196.3
## Mean :20.09 Mean :146.7 Mean :3.217 Mean :230.7
## 3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610 3rd Qu.:326.0
## Max. :33.90 Max. :335.0 Max. :5.424 Max. :472.0
# mtcars has no missing values, so I can go straight into the analysis
cat("Missing values per variable:\n")
## Missing values per variable:
print(colSums(is.na(mtcars[, c("mpg", "hp", "wt", "disp")])))
## mpg hp wt disp
## 0 0 0 0
## Explore relationships visually
# A scatterplot matrix is a quick way to see how all four variables
# relate to each other before fitting anything
pairs(mtcars[, c("mpg", "hp", "wt", "disp")],
col = "steelblue",
pch = 16,
main = "Pairwise Scatterplots: mpg, hp, wt, disp (mtcars)")
# What stands out:
# - mpg drops as wt increases (heavier cars use more fuel)
# - mpg drops as hp increases (more power = more consumption)
# - hp and disp are strongly correlated with each other, which means
# there might be some multicollinearity in the model
## Fit the multiple linear regression model
model_p1 <- lm(mpg ~ hp + wt + disp, data = mtcars)
summary(model_p1)
##
## Call:
## lm(formula = mpg ~ hp + wt + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## hp -0.031157 0.011436 -2.724 0.01097 *
## wt -3.800891 1.066191 -3.565 0.00133 **
## disp -0.000937 0.010350 -0.091 0.92851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
# The summary gives us the coefficients, standard errors, p-values,
# and the overall model fit statistics
## Interpret the coefficients
round(coef(summary(model_p1)), 4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.1055 2.1108 17.5788 0.0000
## hp -0.0312 0.0114 -2.7245 0.0110
## wt -3.8009 1.0662 -3.5649 0.0013
## disp -0.0009 0.0103 -0.0905 0.9285
# Holding everything else constant:
# hp -> each additional unit of horsepower slightly reduces mpg
# wt -> each additional 1000 lbs of weight noticeably reduces mpg
# disp -> each additional cubic inch of displacement affects mpg,
# though its significance depends on what else is in the model
#
# wt tends to be the strongest and most significant predictor here.
# hp and disp share a lot of information, so one may look less
# significant when both are included — this is multicollinearity.
## How well does the model fit? (R-squared)
r2 <- summary(model_p1)$r.squared
adj_r2 <- summary(model_p1)$adj.r.squared
cat("R-squared: ", round(r2, 4), "\n")
## R-squared: 0.8268
cat("Adjusted R-squared:", round(adj_r2, 4), "\n")
## Adjusted R-squared: 0.8083
# R-squared around 0.83 means the three predictors explain roughly
# 83% of the variation in fuel efficiency, which is a strong result
# for a model with only three variables.
## Check the model assumptions (diagnostic plots)
par(mfrow = c(2, 2))
plot(model_p1, col = "steelblue", pch = 16)
par(mfrow = c(1, 1))
# What each plot tells us:
# Residuals vs Fitted -> checks whether the relationship is linear
# Normal Q-Q -> checks whether residuals are normally distributed
# Scale-Location -> checks whether variance is constant (homoscedasticity)
# Residuals vs Leverage-> identifies any unusually influential data points
## Confidence intervals for each coefficient
confint(model_p1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 32.78169625 41.429314293
## hp -0.05458171 -0.007731388
## wt -5.98488310 -1.616898063
## disp -0.02213750 0.020263482
# If a confidence interval does not cross zero, we are 95% confident
# that the predictor has a real effect on mpg
## Making a prediction
# What mpg would we expect for a car with hp = 150, wt = 3.0, disp = 200?
new_car <- data.frame(hp = 150, wt = 3.0, disp = 200)
predict(model_p1, newdata = new_car, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 20.84195 15.3333 26.35059
# The output gives a point estimate (fit) and a 95% prediction interval.
# The interval is fairly wide, which is normal — individual cars vary
# a lot even when their specs are similar.
## Visualizing actual vs predicted values
mtcars_aug <- mtcars
mtcars_aug$Predicted <- fitted(model_p1)
mtcars_aug$Residuals <- residuals(model_p1)
# Plot 1: Actual mpg vs model predictions
# Points falling on the red dashed line would be perfect predictions
ggplot(mtcars_aug, aes(x = Predicted, y = mpg)) +
geom_point(color = "steelblue", size = 2.5, alpha = 0.8) +
geom_abline(intercept = 0, slope = 1,
color = "red", linetype = "dashed", linewidth = 1) +
labs(
title = "Actual vs Predicted MPG",
subtitle = "Points close to the dashed line indicate accurate predictions",
x = "Predicted MPG",
y = "Actual MPG"
) +
theme_minimal()
# Plot 2: Residuals vs predicted values
# A random scatter around zero is what we want to see here
ggplot(mtcars_aug, aes(x = Predicted, y = Residuals)) +
geom_point(color = "tomato", size = 2.5, alpha = 0.8) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1) +
geom_smooth(method = "loess", se = FALSE,
color = "steelblue", linewidth = 1) +
labs(
title = "Residuals vs Predicted MPG",
subtitle = "The blue line should stay flat and close to zero",
x = "Predicted MPG",
y = "Residuals"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Actual and predicted values across all 32 observations
ggplot(mtcars_aug, aes(x = seq_len(nrow(mtcars_aug)))) +
geom_line(aes(y = mpg, color = "Actual"), linewidth = 1) +
geom_line(aes(y = Predicted, color = "Predicted"),
linewidth = 1, linetype = "dashed") +
scale_color_manual(
values = c("Actual" = "steelblue", "Predicted" = "red")
) +
labs(
title = "Actual vs Predicted MPG Across All Observations",
subtitle = "The dashed red line shows what the model predicted for each car",
x = "Observation",
y = "MPG",
color = NULL
) +
theme_minimal()
Part 1 Conclusion
The multiple linear regression model explained approximately 83% of the variation in fuel efficiency (R² = 0.83). Vehicle weight was the strongest predictor of miles per gallon, while horsepower and engine displacement also showed negative relationships with fuel efficiency. Overall, the model performed well and met the key regression assumptions. The next step is to apply variable selection methods to identify the most important predictors and determine whether a simpler model can provide similar results.
In this section, I use the built-in swiss dataset in R to explore the socioeconomic factors associated with fertility rates across 47 French-speaking provinces of Switzerland in 1888. The goal is to identify the most important predictors of fertility and determine which variables contribute meaningfully to the model.
The response variable is Fertility, which represents a standardized measure of fertility in each province. The potential predictor variables include:
#Agriculture: Percentage of males employed in agriculture. #Examination: Percentage of army recruits receiving the highest examination grade. #Education: Percentage of recruits with education beyond primary school. #Catholic: Percentage of the population that is Catholic. #Infant.Mortality: Number of infant deaths per 1,000 live births.
To select the most relevant predictors, variable selection techniques such as best subset selection, forward selection, and backward selection are applied. These methods help identify the combination of variables that best explains variations in fertility while avoiding unnecessary complexity in the model.
The main objective of this analysis is to answer the following question:Which socioeconomic factors have the strongest influence on fertility rates, and which variables can be removed from the model without substantially reducing its predictive performance?**
## Load and explore the data
data(swiss)
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
cat("Missing values:", sum(is.na(swiss)), "\n")
## Missing values: 0
# No missing values, so the data is ready to use as-is
cat("Number of observations:", nrow(swiss), "\n")
## Number of observations: 47
cat("Number of predictors: ", ncol(swiss) - 1, "\n")
## Number of predictors: 5
## Explore relationships visually
pairs(swiss,
col = "steelblue",
pch = 16,
main = "Pairwise Scatterplots: swiss Dataset")
# A few patterns that stand out:
# Education has a clear negative relationship with Fertility
# (more educated provinces have lower birth rates)
# Catholic has a positive relationship with Fertility
# (historically Catholic provinces tend to have higher fertility)
# Agriculture also appears positively related to Fertility
# (rural, farming communities tend to have more children)
## Start with the full model
full_model_p2 <- lm(Fertility ~ ., data = swiss)
null_model_p2 <- lm(Fertility ~ 1, data = swiss)
summary(full_model_p2)
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
# Looking at the full model, not every predictor is equally significant.
# Examination in particular has a high p-value, suggesting it might
# not be adding much once the other variables are already in the model.
# The variable selection methods below will help confirm this.
## Best Subset Selection
# With only 5 predictors, there are just 32 possible models, so we can
# exhaustively try every combination and compare them fairly.
best_sub <- regsubsets(Fertility ~ ., data = swiss, nvmax = 5)
best_sum <- summary(best_sub)
# Which variables get selected at each model size?
best_sum$which
## (Intercept) Agriculture Examination Education Catholic Infant.Mortality
## 1 TRUE FALSE FALSE TRUE FALSE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE FALSE
## 3 TRUE FALSE FALSE TRUE TRUE TRUE
## 4 TRUE TRUE FALSE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE
# Comparing criteria across model sizes
criteria_tbl <- data.frame(
Predictors = 1:5,
Adj_R2 = round(best_sum$adjr2, 4),
Cp = round(best_sum$cp, 2),
BIC = round(best_sum$bic, 2)
)
print(criteria_tbl)
## Predictors Adj_R2 Cp BIC
## 1 1 0.4282 35.20 -19.60
## 2 2 0.5552 18.49 -28.61
## 3 3 0.6390 8.18 -35.66
## 4 4 0.6707 5.03 -37.23
## 5 5 0.6710 6.00 -34.55
# Which number of predictors wins on each criterion?
cat("\nBest model size by Adjusted R-squared:", which.max(best_sum$adjr2), "predictors\n")
##
## Best model size by Adjusted R-squared: 5 predictors
cat("Best model size by BIC: ", which.min(best_sum$bic), "predictors\n")
## Best model size by BIC: 4 predictors
cat("Best model size by Cp: ", which.min(best_sum$cp), "predictors\n")
## Best model size by Cp: 4 predictors
# Visualizing the criteria to make the choice clear
par(mfrow = c(1, 2))
plot(best_sum$adjr2, type = "b", pch = 19, col = "steelblue",
xlab = "Number of Predictors", ylab = "Adjusted R-squared",
main = "Best Subset: Adjusted R-squared")
points(which.max(best_sum$adjr2), max(best_sum$adjr2),
col = "red", cex = 2.5, pch = 20)
plot(best_sum$bic, type = "b", pch = 19, col = "steelblue",
xlab = "Number of Predictors", ylab = "BIC",
main = "Best Subset: BIC")
points(which.min(best_sum$bic), min(best_sum$bic),
col = "red", cex = 2.5, pch = 20)
par(mfrow = c(1, 1))
# The red dot marks the model size selected by each criterion.
# If both agree, that gives us more confidence in the choice.
## Forward Step wise Selection
# Starting from scratch (no predictors) and adding one variable at a
# time — only keeping a variable if it meaningfully lowers the AIC.
forward_model_p2 <- stepAIC(
null_model_p2,
scope = list(lower = null_model_p2, upper = full_model_p2),
direction = "forward",
trace = TRUE
)
## Start: AIC=238.35
## Fertility ~ 1
##
## Df Sum of Sq RSS AIC
## + Education 1 3162.7 4015.2 213.04
## + Examination 1 2994.4 4183.6 214.97
## + Catholic 1 1543.3 5634.7 228.97
## + Infant.Mortality 1 1245.5 5932.4 231.39
## + Agriculture 1 894.8 6283.1 234.09
## <none> 7178.0 238.34
##
## Step: AIC=213.04
## Fertility ~ Education
##
## Df Sum of Sq RSS AIC
## + Catholic 1 961.07 3054.2 202.18
## + Infant.Mortality 1 891.25 3124.0 203.25
## + Examination 1 465.63 3549.6 209.25
## <none> 4015.2 213.04
## + Agriculture 1 61.97 3953.3 214.31
##
## Step: AIC=202.18
## Fertility ~ Education + Catholic
##
## Df Sum of Sq RSS AIC
## + Infant.Mortality 1 631.92 2422.2 193.29
## + Agriculture 1 486.28 2567.9 196.03
## <none> 3054.2 202.18
## + Examination 1 2.46 3051.7 204.15
##
## Step: AIC=193.29
## Fertility ~ Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## + Agriculture 1 264.176 2158.1 189.86
## <none> 2422.2 193.29
## + Examination 1 9.486 2412.8 195.10
##
## Step: AIC=189.86
## Fertility ~ Education + Catholic + Infant.Mortality + Agriculture
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## + Examination 1 53.027 2105.0 190.69
summary(forward_model_p2)
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality +
## Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
# The trace output shows the AIC at each step, and which variable
# got added. The process stops when no remaining variable helps.
## Backward Stepwise Selection
# Starting with all 5 predictors and peeling off the weakest one at
# each step, as long as removing it lowers the AIC.
backward_model_p2 <- stepAIC(
full_model_p2,
direction = "backward",
trace = TRUE
)
## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Df Sum of Sq RSS AIC
## - Examination 1 53.03 2158.1 189.86
## <none> 2105.0 190.69
## - Agriculture 1 307.72 2412.8 195.10
## - Infant.Mortality 1 408.75 2513.8 197.03
## - Catholic 1 447.71 2552.8 197.75
## - Education 1 1162.56 3267.6 209.36
##
## Step: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
summary(backward_model_p2)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
## Both-Direction Stepwise Selection
# This approach combines forward and backward steps — at each iteration
# it considers both adding a new variable and removing an existing one.
# It is the most flexible of the three stepwise approaches.
both_model_p2 <- stepAIC(
full_model_p2,
direction = "both",
trace = FALSE # turned off to keep the output clean
)
summary(both_model_p2)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
# Comparing the final AIC of all three stepwise models
cat("\n--- AIC Comparison: Stepwise Methods ---\n")
##
## --- AIC Comparison: Stepwise Methods ---
cat("Forward AIC:", round(AIC(forward_model_p2), 2), "\n")
## Forward AIC: 325.24
cat("Backward AIC:", round(AIC(backward_model_p2), 2), "\n")
## Backward AIC: 325.24
cat("Both-dir AIC:", round(AIC(both_model_p2), 2), "\n")
## Both-dir AIC: 325.24
# If all three produce the same AIC, it means they all landed on
# the same final model, which is a reassuring sign of consistency.
##Shrinkage Methods (Ridge, LASSO, Elastic Net)
# Instead of hard in-or-out decisions, these methods shrink coefficient
# values toward zero. LASSO can shrink them all the way to zero,
# which effectively removes that variable from the model.
X <- as.matrix(swiss[, -which(names(swiss) == "Fertility")])
y <- swiss$Fertility
# Ridge Regression (alpha = 0)
# All predictors stay in the model, but their coefficients are pulled
# toward zero to reduce overfitting. None reach exactly zero.
set.seed(42)
ridge_cv <- cv.glmnet(X, y, alpha = 0)
cat("\nOptimal lambda chosen by Ridge cross-validation:", round(ridge_cv$lambda.min, 4), "\n")
##
## Optimal lambda chosen by Ridge cross-validation: 0.8203
coef(ridge_cv, s = "lambda.min")
## 6 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 64.45560956
## Agriculture -0.12742717
## Examination -0.31929466
## Education -0.73053040
## Catholic 0.08663457
## Infant.Mortality 1.09630539
plot(ridge_cv, main = "Ridge Regression: CV Error vs Lambda")
# LASSO Regression (alpha = 1)
# This is more aggressive — it can shrink some coefficients to exactly
# zero, which means those variables are effectively removed.
# Any predictor showing a "." below was dropped by LASSO.
set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)
cat("Optimal lambda chosen by LASSO cross-validation:", round(lasso_cv$lambda.min, 4), "\n")
## Optimal lambda chosen by LASSO cross-validation: 0.0943
coef(lasso_cv, s = "lambda.min")
## 6 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 65.8317930
## Agriculture -0.1554603
## Examination -0.2474306
## Education -0.8446493
## Catholic 0.1003220
## Infant.Mortality 1.0736754
plot(lasso_cv, main = "LASSO: CV Error vs Lambda")
# The coefficient path shows each predictor's journey as lambda grows.
# Variables that hit zero first are the least important in the model.
lasso_fit <- glmnet(X, y, alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE,
main = "LASSO Coefficient Path")
legend("topright",
legend = colnames(X),
col = 1:ncol(X),
lty = 1,
cex = 0.8)
set.seed(42)
enet_cv <- cv.glmnet(X, y, alpha = 0.5)
cat("Optimal lambda chosen by Elastic Net cross-validation:", round(enet_cv$lambda.min, 4), "\n")
## Optimal lambda chosen by Elastic Net cross-validation: 0.1719
coef(enet_cv, s = "lambda.min")
## 6 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 65.64880963
## Agriculture -0.15167763
## Examination -0.25819879
## Education -0.82901443
## Catholic 0.09834202
## Infant.Mortality 1.07762299
plot(enet_cv, main = "Elastic Net: CV Error vs Lambda")
##Comparing OLS vs regularization side by side
ols_coef <- as.vector(coef(full_model_p2))
ridge_coef <- as.vector(coef(ridge_cv, s = "lambda.min"))
lasso_coef <- as.vector(coef(lasso_cv, s = "lambda.min"))
enet_coef <- as.vector(coef(enet_cv, s = "lambda.min"))
var_names <- c("Intercept", colnames(X))
comparison_p2 <- data.frame(
Variable = var_names,
OLS = round(ols_coef, 3),
Ridge = round(ridge_coef, 3),
LASSO = round(lasso_coef, 3),
Elastic_Net = round(enet_coef, 3)
)
print(comparison_p2)
## Variable OLS Ridge LASSO Elastic_Net
## 1 Intercept 66.915 64.456 65.832 65.649
## 2 Agriculture -0.172 -0.127 -0.155 -0.152
## 3 Examination -0.258 -0.319 -0.247 -0.258
## 4 Education -0.871 -0.731 -0.845 -0.829
## 5 Catholic 0.104 0.087 0.100 0.098
## 6 Infant.Mortality 1.077 1.096 1.074 1.078
##Reading this table:
# OLS column -> unpenalised estimates (baseline)
# Ridge column -> all shrunken, none are zero
# LASSO column -> any zero means that variable was dropped
# Elastic Net column -> similar to LASSO but handles correlation better
#
# If a variable shows near-zero or exactly zero across Ridge, LASSO,
# and Elastic Net, that is strong evidence it can safely be removed.
## Comparing candidate OLS models (AIC, BIC, Adj R2)
# As a final check, I compare three OLS models directly using
# information criteria — this confirms what the selection methods found.
m_full <- lm(Fertility ~ ., data = swiss)
m_4var <- lm(Fertility ~ Agriculture + Education +
Catholic + Infant.Mortality, data = swiss)
m_3var <- lm(Fertility ~ Education + Catholic + Infant.Mortality,
data = swiss)
model_comparison <- data.frame(
Model = c("Full model (5 vars)",
"Without Examination (4 vars)",
"Top 3 predictors only"),
AIC = round(c(AIC(m_full), AIC(m_4var), AIC(m_3var)), 2),
BIC = round(c(BIC(m_full), BIC(m_4var), BIC(m_3var)), 2),
Adj_R2 = round(c(summary(m_full)$adj.r.squared,
summary(m_4var)$adj.r.squared,
summary(m_3var)$adj.r.squared), 4)
)
print(model_comparison)
## Model AIC BIC Adj_R2
## 1 Full model (5 vars) 326.07 339.02 0.6710
## 2 Without Examination (4 vars) 325.24 336.34 0.6707
## 3 Top 3 predictors only 328.67 337.92 0.6390