Part 1: Multiple Linear Regression Using the mtcars Dataset

This analysis uses the built-in mtcars dataset in R to examine factors that influence vehicle fuel efficiency. The response variable is miles per gallon (mpg), while the predictor variables are horsepower (hp), weight (wt), and engine displacement (disp). These variables were selected because they are commonly associated with fuel consumption, with larger and heavier vehicles generally requiring more fuel.

A multiple linear regression model is fitted to evaluate the relationship between fuel efficiency and these vehicle characteristics. The objective is to determine how changes in horsepower, weight, and engine displacement affect miles per gallon and to assess their overall contribution to predicting fuel efficiency.

library(ggplot2)
library(leaps)    
library(MASS)  
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 5.0

## Load the data

data(mtcars)

# A quick look at the first few rows and the overall structure
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Summary statistics for just the four variables I am working with
summary(mtcars[, c("mpg", "hp", "wt", "disp")])

##       mpg              hp              wt             disp      
##  Min.   :10.40   Min.   : 52.0   Min.   :1.513   Min.   : 71.1  
##  1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581   1st Qu.:120.8  
##  Median :19.20   Median :123.0   Median :3.325   Median :196.3  
##  Mean   :20.09   Mean   :146.7   Mean   :3.217   Mean   :230.7  
##  3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610   3rd Qu.:326.0  
##  Max.   :33.90   Max.   :335.0   Max.   :5.424   Max.   :472.0

# mtcars has no missing values, so I can go straight into the analysis
cat("Missing values per variable:\n")

## Missing values per variable:

print(colSums(is.na(mtcars[, c("mpg", "hp", "wt", "disp")])))

##  mpg   hp   wt disp 
##    0    0    0    0

## Explore relationships visually

# A scatterplot matrix is a quick way to see how all four variables
# relate to each other before fitting anything
pairs(mtcars[, c("mpg", "hp", "wt", "disp")],
      col  = "steelblue",
      pch  = 16,
      main = "Pairwise Scatterplots: mpg, hp, wt, disp (mtcars)")

# What stands out:
# - mpg drops as wt increases (heavier cars use more fuel)
# - mpg drops as hp increases (more power = more consumption)
# - hp and disp are strongly correlated with each other, which means
#   there might be some multicollinearity in the model

## Fit the multiple linear regression model 
model_p1 <- lm(mpg ~ hp + wt + disp, data = mtcars)

summary(model_p1)

## 
## Call:
## lm(formula = mpg ~ hp + wt + disp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.891 -1.640 -0.172  1.061  5.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## disp        -0.000937   0.010350  -0.091  0.92851    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

# The summary gives us the coefficients, standard errors, p-values,
# and the overall model fit statistics

## Interpret the coefficients 
round(coef(summary(model_p1)), 4)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  37.1055     2.1108 17.5788   0.0000
## hp           -0.0312     0.0114 -2.7245   0.0110
## wt           -3.8009     1.0662 -3.5649   0.0013
## disp         -0.0009     0.0103 -0.0905   0.9285

# Holding everything else constant:
#   hp   -> each additional unit of horsepower slightly reduces mpg
#   wt   -> each additional 1000 lbs of weight noticeably reduces mpg
#   disp -> each additional cubic inch of displacement affects mpg,
#           though its significance depends on what else is in the model
#
# wt tends to be the strongest and most significant predictor here.
# hp and disp share a lot of information, so one may look less
# significant when both are included — this is multicollinearity.

## How well does the model fit? (R-squared) 
r2     <- summary(model_p1)$r.squared
adj_r2 <- summary(model_p1)$adj.r.squared

cat("R-squared:         ", round(r2,     4), "\n")

## R-squared:          0.8268

cat("Adjusted R-squared:", round(adj_r2, 4), "\n")

## Adjusted R-squared: 0.8083

# R-squared around 0.83 means the three predictors explain roughly
# 83% of the variation in fuel efficiency, which is a strong result
# for a model with only three variables.

## Check the model assumptions (diagnostic plots)

par(mfrow = c(2, 2))
plot(model_p1, col = "steelblue", pch = 16)

par(mfrow = c(1, 1))

# What each plot tells us:
#   Residuals vs Fitted  -> checks whether the relationship is linear
#   Normal Q-Q           -> checks whether residuals are normally distributed
#   Scale-Location       -> checks whether variance is constant (homoscedasticity)
#   Residuals vs Leverage-> identifies any unusually influential data points

## Confidence intervals for each coefficient 
confint(model_p1, level = 0.95)

##                   2.5 %       97.5 %
## (Intercept) 32.78169625 41.429314293
## hp          -0.05458171 -0.007731388
## wt          -5.98488310 -1.616898063
## disp        -0.02213750  0.020263482

# If a confidence interval does not cross zero, we are 95% confident
# that the predictor has a real effect on mpg

## Making a prediction

# What mpg would we expect for a car with hp = 150, wt = 3.0, disp = 200?
new_car <- data.frame(hp = 150, wt = 3.0, disp = 200)

predict(model_p1, newdata = new_car, interval = "prediction", level = 0.95)

##        fit     lwr      upr
## 1 20.84195 15.3333 26.35059

# The output gives a point estimate (fit) and a 95% prediction interval.
# The interval is fairly wide, which is normal — individual cars vary
# a lot even when their specs are similar.

## Visualizing actual vs predicted values

mtcars_aug           <- mtcars
mtcars_aug$Predicted <- fitted(model_p1)
mtcars_aug$Residuals <- residuals(model_p1)

# Plot 1: Actual mpg vs model predictions
# Points falling on the red dashed line would be perfect predictions
ggplot(mtcars_aug, aes(x = Predicted, y = mpg)) +
  geom_point(color = "steelblue", size = 2.5, alpha = 0.8) +
  geom_abline(intercept = 0, slope = 1,
              color = "red", linetype = "dashed", linewidth = 1) +
  labs(
    title    = "Actual vs Predicted MPG",
    subtitle = "Points close to the dashed line indicate accurate predictions",
    x        = "Predicted MPG",
    y        = "Actual MPG"
  ) +
  theme_minimal()

# Plot 2: Residuals vs predicted values
# A random scatter around zero is what we want to see here
ggplot(mtcars_aug, aes(x = Predicted, y = Residuals)) +
  geom_point(color = "tomato", size = 2.5, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE,
              color = "steelblue", linewidth = 1) +
  labs(
    title    = "Residuals vs Predicted MPG",
    subtitle = "The blue line should stay flat and close to zero",
    x        = "Predicted MPG",
    y        = "Residuals"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Actual and predicted values across all 32 observations
ggplot(mtcars_aug, aes(x = seq_len(nrow(mtcars_aug)))) +
  geom_line(aes(y = mpg,       color = "Actual"),    linewidth = 1) +
  geom_line(aes(y = Predicted, color = "Predicted"),
            linewidth = 1, linetype = "dashed") +
  scale_color_manual(
    values = c("Actual" = "steelblue", "Predicted" = "red")
  ) +
  labs(
    title    = "Actual vs Predicted MPG Across All Observations",
    subtitle = "The dashed red line shows what the model predicted for each car",
    x        = "Observation",
    y        = "MPG",
    color    = NULL
  ) +
  theme_minimal()

Part 1 Conclusion

The multiple linear regression model explained approximately 83% of the variation in fuel efficiency (R² = 0.83). Vehicle weight was the strongest predictor of miles per gallon, while horsepower and engine displacement also showed negative relationships with fuel efficiency. Overall, the model performed well and met the key regression assumptions. The next step is to apply variable selection methods to identify the most important predictors and determine whether a simpler model can provide similar results.

Part 2: Variable Selection Using the Swiss Dataset

In this section, I use the built-in swiss dataset in R to explore the socioeconomic factors associated with fertility rates across 47 French-speaking provinces of Switzerland in 1888. The goal is to identify the most important predictors of fertility and determine which variables contribute meaningfully to the model.

The response variable is Fertility, which represents a standardized measure of fertility in each province. The potential predictor variables include:

#Agriculture: Percentage of males employed in agriculture. #Examination: Percentage of army recruits receiving the highest examination grade. #Education: Percentage of recruits with education beyond primary school. #Catholic: Percentage of the population that is Catholic. #Infant.Mortality: Number of infant deaths per 1,000 live births.

To select the most relevant predictors, variable selection techniques such as best subset selection, forward selection, and backward selection are applied. These methods help identify the combination of variables that best explains variations in fertility while avoiding unnecessary complexity in the model.

The main objective of this analysis is to answer the following question:Which socioeconomic factors have the strongest influence on fertility rates, and which variables can be removed from the model without substantially reducing its predictive performance?**

## Load and explore the data 
data(swiss)

head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

str(swiss)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

summary(swiss)

##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

cat("Missing values:", sum(is.na(swiss)), "\n")

## Missing values: 0

# No missing values, so the data is ready to use as-is

cat("Number of observations:", nrow(swiss), "\n")

## Number of observations: 47

cat("Number of predictors:  ", ncol(swiss) - 1, "\n")

## Number of predictors:   5

## Explore relationships visually

pairs(swiss,
      col  = "steelblue",
      pch  = 16,
      main = "Pairwise Scatterplots: swiss Dataset")

# A few patterns that stand out:
# Education has a clear negative relationship with Fertility
# (more educated provinces have lower birth rates)
# Catholic has a positive relationship with Fertility
# (historically Catholic provinces tend to have higher fertility)
# Agriculture also appears positively related to Fertility
# (rural, farming communities tend to have more children)

## Start with the full model

full_model_p2 <- lm(Fertility ~ ., data = swiss)
null_model_p2 <- lm(Fertility ~ 1, data = swiss)

summary(full_model_p2)

## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

# Looking at the full model, not every predictor is equally significant.
# Examination in particular has a high p-value, suggesting it might
# not be adding much once the other variables are already in the model.
# The variable selection methods below will help confirm this.

## Best Subset Selection
# With only 5 predictors, there are just 32 possible models, so we can
# exhaustively try every combination and compare them fairly.

best_sub <- regsubsets(Fertility ~ ., data = swiss, nvmax = 5)
best_sum <- summary(best_sub)

# Which variables get selected at each model size?
best_sum$which

##   (Intercept) Agriculture Examination Education Catholic Infant.Mortality
## 1        TRUE       FALSE       FALSE      TRUE    FALSE            FALSE
## 2        TRUE       FALSE       FALSE      TRUE     TRUE            FALSE
## 3        TRUE       FALSE       FALSE      TRUE     TRUE             TRUE
## 4        TRUE        TRUE       FALSE      TRUE     TRUE             TRUE
## 5        TRUE        TRUE        TRUE      TRUE     TRUE             TRUE

# Comparing criteria across model sizes
criteria_tbl <- data.frame(
  Predictors = 1:5,
  Adj_R2     = round(best_sum$adjr2, 4),
  Cp         = round(best_sum$cp,    2),
  BIC        = round(best_sum$bic,   2)
)
print(criteria_tbl)

##   Predictors Adj_R2    Cp    BIC
## 1          1 0.4282 35.20 -19.60
## 2          2 0.5552 18.49 -28.61
## 3          3 0.6390  8.18 -35.66
## 4          4 0.6707  5.03 -37.23
## 5          5 0.6710  6.00 -34.55

# Which number of predictors wins on each criterion?
cat("\nBest model size by Adjusted R-squared:", which.max(best_sum$adjr2), "predictors\n")

## 
## Best model size by Adjusted R-squared: 5 predictors

cat("Best model size by BIC:               ", which.min(best_sum$bic),    "predictors\n")

## Best model size by BIC:                4 predictors

cat("Best model size by Cp:                ", which.min(best_sum$cp),      "predictors\n")

## Best model size by Cp:                 4 predictors

# Visualizing the criteria to make the choice clear
par(mfrow = c(1, 2))

plot(best_sum$adjr2, type = "b", pch = 19, col = "steelblue",
     xlab = "Number of Predictors", ylab = "Adjusted R-squared",
     main = "Best Subset: Adjusted R-squared")
points(which.max(best_sum$adjr2), max(best_sum$adjr2),
       col = "red", cex = 2.5, pch = 20)

plot(best_sum$bic, type = "b", pch = 19, col = "steelblue",
     xlab = "Number of Predictors", ylab = "BIC",
     main = "Best Subset: BIC")
points(which.min(best_sum$bic), min(best_sum$bic),
       col = "red", cex = 2.5, pch = 20)

par(mfrow = c(1, 1))

# The red dot marks the model size selected by each criterion.
# If both agree, that gives us more confidence in the choice.

## Forward Step wise Selection 
# Starting from scratch (no predictors) and adding one variable at a
# time — only keeping a variable if it meaningfully lowers the AIC.

forward_model_p2 <- stepAIC(
  null_model_p2,
  scope     = list(lower = null_model_p2, upper = full_model_p2),
  direction = "forward",
  trace     = TRUE
)

## Start:  AIC=238.35
## Fertility ~ 1
## 
##                    Df Sum of Sq    RSS    AIC
## + Education         1    3162.7 4015.2 213.04
## + Examination       1    2994.4 4183.6 214.97
## + Catholic          1    1543.3 5634.7 228.97
## + Infant.Mortality  1    1245.5 5932.4 231.39
## + Agriculture       1     894.8 6283.1 234.09
## <none>                          7178.0 238.34
## 
## Step:  AIC=213.04
## Fertility ~ Education
## 
##                    Df Sum of Sq    RSS    AIC
## + Catholic          1    961.07 3054.2 202.18
## + Infant.Mortality  1    891.25 3124.0 203.25
## + Examination       1    465.63 3549.6 209.25
## <none>                          4015.2 213.04
## + Agriculture       1     61.97 3953.3 214.31
## 
## Step:  AIC=202.18
## Fertility ~ Education + Catholic
## 
##                    Df Sum of Sq    RSS    AIC
## + Infant.Mortality  1    631.92 2422.2 193.29
## + Agriculture       1    486.28 2567.9 196.03
## <none>                          3054.2 202.18
## + Examination       1      2.46 3051.7 204.15
## 
## Step:  AIC=193.29
## Fertility ~ Education + Catholic + Infant.Mortality
## 
##               Df Sum of Sq    RSS    AIC
## + Agriculture  1   264.176 2158.1 189.86
## <none>                     2422.2 193.29
## + Examination  1     9.486 2412.8 195.10
## 
## Step:  AIC=189.86
## Fertility ~ Education + Catholic + Infant.Mortality + Agriculture
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     2158.1 189.86
## + Examination  1    53.027 2105.0 190.69

summary(forward_model_p2)

## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

# The trace output shows the AIC at each step, and which variable
# got added. The process stops when no remaining variable helps.

## Backward Stepwise Selection 
# Starting with all 5 predictors and peeling off the weakest one at
# each step, as long as removing it lowers the AIC.

backward_model_p2 <- stepAIC(
  full_model_p2,
  direction = "backward",
  trace     = TRUE
)

## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
## <none>                          2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43

summary(backward_model_p2)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

## Both-Direction Stepwise Selection 
# This approach combines forward and backward steps — at each iteration
# it considers both adding a new variable and removing an existing one.
# It is the most flexible of the three stepwise approaches.

both_model_p2 <- stepAIC(
  full_model_p2,
  direction = "both",
  trace     = FALSE  # turned off to keep the output clean
)

summary(both_model_p2)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

# Comparing the final AIC of all three stepwise models
cat("\n--- AIC Comparison: Stepwise Methods ---\n")

## 
## --- AIC Comparison: Stepwise Methods ---

cat("Forward  AIC:", round(AIC(forward_model_p2),  2), "\n")

## Forward  AIC: 325.24

cat("Backward AIC:", round(AIC(backward_model_p2), 2), "\n")

## Backward AIC: 325.24

cat("Both-dir AIC:", round(AIC(both_model_p2),     2), "\n")

## Both-dir AIC: 325.24

# If all three produce the same AIC, it means they all landed on
# the same final model, which is a reassuring sign of consistency.

##Shrinkage Methods (Ridge, LASSO, Elastic Net)
# Instead of hard in-or-out decisions, these methods shrink coefficient
# values toward zero. LASSO can shrink them all the way to zero,
# which effectively removes that variable from the model.

X <- as.matrix(swiss[, -which(names(swiss) == "Fertility")])
y <- swiss$Fertility


# Ridge Regression (alpha = 0)
# All predictors stay in the model, but their coefficients are pulled
# toward zero to reduce overfitting. None reach exactly zero.

set.seed(42)
ridge_cv <- cv.glmnet(X, y, alpha = 0)

cat("\nOptimal lambda chosen by Ridge cross-validation:", round(ridge_cv$lambda.min, 4), "\n")

## 
## Optimal lambda chosen by Ridge cross-validation: 0.8203

coef(ridge_cv, s = "lambda.min")

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                   lambda.min
## (Intercept)      64.45560956
## Agriculture      -0.12742717
## Examination      -0.31929466
## Education        -0.73053040
## Catholic          0.08663457
## Infant.Mortality  1.09630539

plot(ridge_cv, main = "Ridge Regression: CV Error vs Lambda")

# LASSO Regression (alpha = 1)
# This is more aggressive — it can shrink some coefficients to exactly
# zero, which means those variables are effectively removed.
# Any predictor showing a "." below was dropped by LASSO.

set.seed(42)
lasso_cv <- cv.glmnet(X, y, alpha = 1)

cat("Optimal lambda chosen by LASSO cross-validation:", round(lasso_cv$lambda.min, 4), "\n")

## Optimal lambda chosen by LASSO cross-validation: 0.0943

coef(lasso_cv, s = "lambda.min")

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                  lambda.min
## (Intercept)      65.8317930
## Agriculture      -0.1554603
## Examination      -0.2474306
## Education        -0.8446493
## Catholic          0.1003220
## Infant.Mortality  1.0736754

plot(lasso_cv, main = "LASSO: CV Error vs Lambda")

# The coefficient path shows each predictor's journey as lambda grows.
# Variables that hit zero first are the least important in the model.
lasso_fit <- glmnet(X, y, alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE,
     main = "LASSO Coefficient Path")
legend("topright",
       legend = colnames(X),
       col    = 1:ncol(X),
       lty    = 1,
       cex    = 0.8)

set.seed(42)
enet_cv <- cv.glmnet(X, y, alpha = 0.5)

cat("Optimal lambda chosen by Elastic Net cross-validation:", round(enet_cv$lambda.min, 4), "\n")

## Optimal lambda chosen by Elastic Net cross-validation: 0.1719

coef(enet_cv, s = "lambda.min")

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                   lambda.min
## (Intercept)      65.64880963
## Agriculture      -0.15167763
## Examination      -0.25819879
## Education        -0.82901443
## Catholic          0.09834202
## Infant.Mortality  1.07762299

plot(enet_cv, main = "Elastic Net: CV Error vs Lambda")

##Comparing OLS vs regularization side by side

ols_coef   <- as.vector(coef(full_model_p2))
ridge_coef <- as.vector(coef(ridge_cv, s = "lambda.min"))
lasso_coef <- as.vector(coef(lasso_cv, s = "lambda.min"))
enet_coef  <- as.vector(coef(enet_cv,  s = "lambda.min"))

var_names <- c("Intercept", colnames(X))

comparison_p2 <- data.frame(
  Variable    = var_names,
  OLS         = round(ols_coef,   3),
  Ridge       = round(ridge_coef, 3),
  LASSO       = round(lasso_coef, 3),
  Elastic_Net = round(enet_coef,  3)
)

print(comparison_p2)

##           Variable    OLS  Ridge  LASSO Elastic_Net
## 1        Intercept 66.915 64.456 65.832      65.649
## 2      Agriculture -0.172 -0.127 -0.155      -0.152
## 3      Examination -0.258 -0.319 -0.247      -0.258
## 4        Education -0.871 -0.731 -0.845      -0.829
## 5         Catholic  0.104  0.087  0.100       0.098
## 6 Infant.Mortality  1.077  1.096  1.074       1.078

##Reading this table:
#   OLS column         -> unpenalised estimates (baseline)
#   Ridge column       -> all shrunken, none are zero
#   LASSO column       -> any zero means that variable was dropped
#   Elastic Net column -> similar to LASSO but handles correlation better
#
# If a variable shows near-zero or exactly zero across Ridge, LASSO,
# and Elastic Net, that is strong evidence it can safely be removed.

## Comparing candidate OLS models (AIC, BIC, Adj R2)
# As a final check, I compare three OLS models directly using
# information criteria — this confirms what the selection methods found.

m_full <- lm(Fertility ~ .,                                   data = swiss)
m_4var <- lm(Fertility ~ Agriculture + Education +
                          Catholic + Infant.Mortality,         data = swiss)
m_3var <- lm(Fertility ~ Education + Catholic + Infant.Mortality,
                                                               data = swiss)

model_comparison <- data.frame(
  Model  = c("Full model (5 vars)",
             "Without Examination (4 vars)",
             "Top 3 predictors only"),
  AIC    = round(c(AIC(m_full), AIC(m_4var), AIC(m_3var)), 2),
  BIC    = round(c(BIC(m_full), BIC(m_4var), BIC(m_3var)), 2),
  Adj_R2 = round(c(summary(m_full)$adj.r.squared,
                   summary(m_4var)$adj.r.squared,
                   summary(m_3var)$adj.r.squared), 4)
)

print(model_comparison)

##                          Model    AIC    BIC Adj_R2
## 1          Full model (5 vars) 326.07 339.02 0.6710
## 2 Without Examination (4 vars) 325.24 336.34 0.6707
## 3        Top 3 predictors only 328.67 337.92 0.6390

ASSIGNMENT OF R PROGRAMMING

Charlotte Muteteri-20251MBI041

2026-06-04

Part 2: Variable Selection Using the Swiss Dataset