Homework

This assignment has two parts:

Part 1 (HW 1): Find a dataset, fit a Multiple Linear Regression (MLR) model, and interpret the results.
Part 2 (HW 2): Read about and demonstrate Variable Selection Methods used in regression.

1 Multiple Linear Regression

Dataset Description

I used the Student Performance Dataset from Kaggle student-performance-multiple-linear-regression.

This dataset contains 10,000 student records with the following variables:

Variable	Type	Description
`Hours.Studied`	Numeric	Total hours spent studying
`Previous.Scores`	Numeric	Scores from previous tests
`Extracurricular.Activities`	Categorical	Yes / No
`Sleep.Hours`	Numeric	Average daily sleep hours
`Sample.Question.Papers.Practiced`	Numeric	Number of practice papers done
`Performance.Index`	Numeric	Target / Response variable

Goal: Predict Performance.Index using all other variables as predictors.

Loading and Inspecting the Data

# Read the dataset from CSV file

# Dataset: Student Performance.csv
df <- read.csv("Student Performance.csv")


# Show first 6 rows of the dataset
head(df)

##   Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1             7              99                        Yes           9
## 2             4              82                         No           4
## 3             8              51                        Yes           7
## 4             5              52                        Yes           5
## 5             7              75                         No           8
## 6             3              78                         No           9
##   Sample.Question.Papers.Practiced Performance.Index
## 1                                1                91
## 2                                2                65
## 3                                2                45
## 4                                2                36
## 5                                5                66
## 6                                6                61

# Check the structure (column types, number of rows and columns)
str(df)

## 'data.frame':    10000 obs. of  6 variables:
##  $ Hours.Studied                   : int  7 4 8 5 7 3 7 8 5 4 ...
##  $ Previous.Scores                 : int  99 82 51 52 75 78 73 45 77 89 ...
##  $ Extracurricular.Activities      : chr  "Yes" "No" "Yes" "Yes" ...
##  $ Sleep.Hours                     : int  9 4 7 5 8 9 5 4 8 4 ...
##  $ Sample.Question.Papers.Practiced: int  1 2 2 2 5 6 6 6 2 0 ...
##  $ Performance.Index               : num  91 65 45 36 66 61 63 42 61 69 ...

# Show summary statistics for all variables
summary(df)

##  Hours.Studied   Previous.Scores Extracurricular.Activities  Sleep.Hours   
##  Min.   :1.000   Min.   :40.00   Length   :10000            Min.   :4.000  
##  1st Qu.:3.000   1st Qu.:54.00   N.unique :    2            1st Qu.:5.000  
##  Median :5.000   Median :69.00   N.blank  :    0            Median :7.000  
##  Mean   :4.993   Mean   :69.45   Min.nchar:    2            Mean   :6.531  
##  3rd Qu.:7.000   3rd Qu.:85.00   Max.nchar:    3            3rd Qu.:8.000  
##  Max.   :9.000   Max.   :99.00                              Max.   :9.000  
##  Sample.Question.Papers.Practiced Performance.Index
##  Min.   :0.000                    Min.   : 10.00   
##  1st Qu.:2.000                    1st Qu.: 40.00   
##  Median :5.000                    Median : 55.00   
##  Mean   :4.583                    Mean   : 55.22   
##  3rd Qu.:7.000                    3rd Qu.: 71.00   
##  Max.   :9.000                    Max.   :100.00

# Check if there are any missing values in each column
colSums(is.na(df))

##                    Hours.Studied                  Previous.Scores 
##                                0                                0 
##       Extracurricular.Activities                      Sleep.Hours 
##                                0                                0 
## Sample.Question.Papers.Practiced                Performance.Index 
##                                0                                0

The dataset has no missing values, so no cleaning is needed.

Exploratory Data Analysis (EDA)

Distribution of the Response Variable

# Plot histogram of the response variable: Performance.Index
ggplot(df, aes(x = Performance.Index)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Performance Index",
    x     = "Performance Index",
    y     = "Count"
  ) +
  theme_minimal()

The Performance Index appears roughly normally distributed, which is a good sign for linear regression.

Scatter Plots — Predictors vs Response

# Scatter plot: Hours Studied vs Performance Index
ggplot(df, aes(x = Hours.Studied, y = Performance.Index)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Hours Studied vs Performance Index",
    x     = "Hours Studied",
    y     = "Performance Index"
  ) +
  theme_minimal()

# Scatter plot: Previous Scores vs Performance Index
ggplot(df, aes(x = Previous.Scores, y = Performance.Index)) +
  geom_point(alpha = 0.3, color = "darkorange") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Previous Scores vs Performance Index",
    x     = "Previous Scores",
    y     = "Performance Index"
  ) +
  theme_minimal()

# Scatter plot: Sleep Hours vs Performance Index
ggplot(df, aes(x = Sleep.Hours, y = Performance.Index)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Sleep Hours vs Performance Index",
    x     = "Sleep Hours",
    y     = "Performance Index"
  ) +
  theme_minimal()

# Boxplot: Extracurricular Activities vs Performance Index
ggplot(df, aes(x = Extracurricular.Activities, y = Performance.Index,
               fill = Extracurricular.Activities)) +
  geom_boxplot() +
  labs(
    title = "Performance Index by Extracurricular Activities",
    x     = "Extracurricular Activities",
    y     = "Performance Index"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Correlation Matrix

# Select only numeric columns for correlation analysis

numeric_cols <- df %>%
  dplyr::select(Hours.Studied, Previous.Scores, Sleep.Hours,
         Sample.Question.Papers.Practiced, Performance.Index)

# Compute correlation matrix
cor_matrix <- cor(numeric_cols)

# Print the correlation matrix as a table
kable(round(cor_matrix, 3), caption = "Correlation Matrix of Numeric Variables")

Correlation Matrix of Numeric Variables
	Hours.Studied	Previous.Scores	Sleep.Hours	Sample.Question.Papers.Practiced	Performance.Index
Hours.Studied	1.000	-0.012	0.001	0.017	0.374
Previous.Scores	-0.012	1.000	0.006	0.008	0.915
Sleep.Hours	0.001	0.006	1.000	0.004	0.048
Sample.Question.Papers.Practiced	0.017	0.008	0.004	1.000	0.043
Performance.Index	0.374	0.915	0.048	0.043	1.000

# Visual correlation plot
corrplot(
  cor_matrix,
  method  = "color",    
  type    = "upper",    
  addCoef.col = "black", 
  tl.col  = "black",   
  tl.srt  = 45,        
  title   = "Correlation Plot",
  mar     = c(0, 0, 1, 0)
)

Hours.Studied and Previous.Scores show the strongest positive correlation with Performance.Index.

Encoding Categorical Variable

# Convert Extracurricular.Activities from "Yes"/"No" to 1/0

df <- df %>%
  mutate(Extracurricular.Activities = ifelse(
    Extracurricular.Activities == "Yes", 1, 0
  ))

# Confirm the encoding worked
print(paste("Unique values after encoding:",
            paste(unique(df$Extracurricular.Activities), collapse = ", ")))

## [1] "Unique values after encoding: 1, 0"

Fitting the Multiple Linear Regression Model

# Fit multiple linear regression model

# Predictors: all other columns (using the dot shortcut)
mlr_model <- lm(Performance.Index ~ ., data = df)

# Print model summary
summary(mlr_model)

## 
## Call:
## lm(formula = Performance.Index ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6333 -1.3684 -0.0311  1.3556  8.7932 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -34.075588   0.127143 -268.01   <2e-16 ***
## Hours.Studied                      2.852982   0.007873  362.35   <2e-16 ***
## Previous.Scores                    1.018434   0.001175  866.45   <2e-16 ***
## Extracurricular.Activities         0.612898   0.040781   15.03   <2e-16 ***
## Sleep.Hours                        0.480560   0.012022   39.97   <2e-16 ***
## Sample.Question.Papers.Practiced   0.193802   0.007110   27.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9887 
## F-statistic: 1.757e+05 on 5 and 9994 DF,  p-value: < 2.2e-16

Interpreting the Results

Coefficients Table

# Extract coefficients and display as a clean table
coef_table <- as.data.frame(summary(mlr_model)$coefficients)

# Round to 4 decimal places for readability
coef_table <- round(coef_table, 4)

# Display table
kable(coef_table, caption = "MLR Coefficients, Standard Errors, and p-values")

MLR Coefficients, Standard Errors, and p-values
	Estimate	Std. Error	t value
(Intercept)	-34.0756	0.1271	-268.0104
Hours.Studied	2.8530	0.0079	362.3534
Previous.Scores	1.0184	0.0012	866.4500
Extracurricular.Activities	0.6129	0.0408	15.0292
Sleep.Hours	0.4806	0.0120	39.9725
Sample.Question.Papers.Practiced	0.1938	0.0071	27.2566

Interpretation of Each Coefficient

## INTERCEPT (-33.5749):

##   When all predictors are zero, the predicted Performance Index is -33.57.

##   (This is a mathematical baseline, not practically meaningful.)

## Hours.Studied (2.8527):

##   Each additional hour of study INCREASES Performance Index by 2.85 points,

##   holding all other variables constant. [Significant, p < 0.05]

## Previous.Scores (1.0186):

##   Each 1-point increase in previous scores INCREASES Performance Index by 1.02,

##   holding all other variables constant. [Significant, p < 0.05]

## Extracurricular.Activities (0.6016):

##   Students who participate in extracurricular activities score 0.60 HIGHER

##   on the Performance Index compared to those who do not. [Significant, p < 0.05]

## Sleep.Hours (0.4810):

##   Each additional hour of sleep INCREASES Performance Index by 0.48 points.

##   [Significant, p < 0.05]

## Sample.Question.Papers.Practiced (0.1940):

##   Each additional practice paper INCREASES Performance Index by 0.19 points.

##   [Significant, p < 0.05]

Model Fit Statistics

# Extract R-squared, Adjusted R-squared, and F-statistic from model summary
model_summary <- summary(mlr_model)

# Print the key statistics
print(paste("R-squared:          ", round(model_summary$r.squared, 4)))

## [1] "R-squared:           0.9888"

print(paste("Adjusted R-squared: ", round(model_summary$adj.r.squared, 4)))

## [1] "Adjusted R-squared:  0.9887"

print(paste("F-statistic:        ", round(model_summary$fstatistic[1], 2)))

## [1] "F-statistic:         175709.15"

print(paste("Residual Std Error: ", round(model_summary$sigma, 4)))

## [1] "Residual Std Error:  2.0381"

R² = 0.9887 means the model explains 98.87% of the variation in student Performance Index — an excellent fit.

The F-statistic is very large with p < 0.001, meaning the model as a whole is statistically significant.

Model Diagnostics

Checking Assumptions (Residual Plots)

# Set layout to show 4 diagnostic plots at once
par(mfrow = c(2, 2))

# Plot all 4 standard diagnostic plots for the model
plot(mlr_model)

# Reset layout back to 1 plot
par(mfrow = c(1, 1))

Interpretation of diagnostic plots:

Plot	What to Check	What We See
Residuals vs Fitted	Should be random scatter (no pattern)	Random — linearity holds
Normal Q-Q	Points should follow the diagonal line	Roughly normal residuals
Scale-Location	Should be horizontal band	Variance is fairly constant
Residuals vs Leverage	Watch for points outside Cook’s distance	No extreme outliers

Checking Multicollinearity (VIF)

# VIF (Variance Inflation Factor) measures multicollinearity
# Rule: VIF > 5 or 10 means problematic multicollinearity
vif_values <- vif(mlr_model)

# Print VIF values
kable(
  data.frame(Variable = names(vif_values), VIF = round(vif_values, 3)),
  caption = "Variance Inflation Factors (VIF)"
)

Variance Inflation Factors (VIF)
	Variable	VIF
Hours.Studied	Hours.Studied	1.000
Previous.Scores	Previous.Scores	1.000
Extracurricular.Activities	Extracurricular.Activities	1.001
Sleep.Hours	Sleep.Hours	1.001
Sample.Question.Papers.Practiced	Sample.Question.Papers.Practiced	1.001

All VIF values are close to 1.0, which means there is no multicollinearity problem among the predictors.

Predicted vs Actual Plot

# Add predicted values to the data frame
df$predicted <- predict(mlr_model)

# Plot actual vs predicted values
ggplot(df, aes(x = Performance.Index, y = predicted)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(
    title = "Actual vs Predicted Performance Index",
    x     = "Actual Performance Index",
    y     = "Predicted Performance Index"
  ) +
  theme_minimal()

The points closely follow the red diagonal line, confirming the model predicts very well.

2 Variable Selection Methods

Why Variable Selection?

In multiple linear regression, including too many predictors can cause:

Overfitting — the model fits noise, not the true signal
Multicollinearity — correlated predictors confuse the model
Reduced interpretability — too many variables are hard to explain

Variable selection helps find the best subset of predictors that balances model fit and simplicity.

Method 1 Best Subset Selection

Best subset selection tries every possible combination of predictors and picks the best model for each size (1 predictor, 2 predictors, etc.).

# Remove the predicted column we added earlier before running selection

df_clean <- df %>% dplyr::select(-predicted)

# Run best subset selection using regsubsets()
# nvmax = 5 means consider up to 5 predictors
best_subset <- regsubsets(
  Performance.Index ~ .,
  data  = df_clean,
  nvmax = 5
)

# Get summary of best subset results
subset_summary <- summary(best_subset)

# Print which variables are selected at each model size
print(subset_summary$outmat)

##          Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1  ( 1 ) " "           "*"             " "                        " "        
## 2  ( 1 ) "*"           "*"             " "                        " "        
## 3  ( 1 ) "*"           "*"             " "                        "*"        
## 4  ( 1 ) "*"           "*"             " "                        "*"        
## 5  ( 1 ) "*"           "*"             "*"                        "*"        
##          Sample.Question.Papers.Practiced
## 1  ( 1 ) " "                             
## 2  ( 1 ) " "                             
## 3  ( 1 ) " "                             
## 4  ( 1 ) "*"                             
## 5  ( 1 ) "*"

# Plot RSS, Adjusted R2, Cp, and BIC to find the best number of predictors
par(mfrow = c(2, 2))

# RSS: lower is better
plot(subset_summary$rss,
     xlab = "Number of Predictors",
     ylab = "RSS",
     type = "b",
     main = "RSS")

# Adjusted R-squared: higher is better
plot(subset_summary$adjr2,
     xlab = "Number of Predictors",
     ylab = "Adjusted R²",
     type = "b",
     main = "Adjusted R²")
points(which.max(subset_summary$adjr2),
       max(subset_summary$adjr2),
       col = "red", cex = 2, pch = 20)

# Cp: lower is better, close to number of predictors
plot(subset_summary$cp,
     xlab = "Number of Predictors",
     ylab = "Cp",
     type = "b",
     main = "Mallows' Cp")
points(which.min(subset_summary$cp),
       min(subset_summary$cp),
       col = "red", cex = 2, pch = 20)

# BIC: lower is better
plot(subset_summary$bic,
     xlab = "Number of Predictors",
     ylab = "BIC",
     type = "b",
     main = "BIC")
points(which.min(subset_summary$bic),
       min(subset_summary$bic),
       col = "red", cex = 2, pch = 20)

par(mfrow = c(1, 1))

# Find the best number of predictors based on Adjusted R2 and BIC
best_adjr2 <- which.max(subset_summary$adjr2)
best_bic    <- which.min(subset_summary$bic)

print(paste("Best model by Adjusted R²: uses", best_adjr2, "predictors"))

## [1] "Best model by Adjusted R²: uses 5 predictors"

print(paste("Best model by BIC:         uses", best_bic,    "predictors"))

## [1] "Best model by BIC:         uses 5 predictors"

Best Subset Result: All 5 predictors are needed — removing any one of them reduces model quality. This confirms our full model is already optimal.

Method 2 Stepwise Selection (Forward, Backward, Both)

Stepwise selection adds or removes predictors one at a time based on AIC (Akaike Information Criterion). Lower AIC = better model.

Forward Selection

Starts with no predictors, adds them one by one.

# Null model (no predictors, only intercept)
null_model <- lm(Performance.Index ~ 1, data = df_clean)

# Full model (all predictors)
full_model <- lm(Performance.Index ~ ., data = df_clean)

# Forward stepwise selection
forward_model <- stepAIC(
  null_model,
  scope     = list(lower = null_model, upper = full_model),
  direction = "forward",
  trace     = FALSE   # suppress step-by-step output
)

# Show final selected model
summary(forward_model)

## 
## Call:
## lm(formula = Performance.Index ~ Previous.Scores + Hours.Studied + 
##     Sleep.Hours + Sample.Question.Papers.Practiced + Extracurricular.Activities, 
##     data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6333 -1.3684 -0.0311  1.3556  8.7932 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -34.075588   0.127143 -268.01   <2e-16 ***
## Previous.Scores                    1.018434   0.001175  866.45   <2e-16 ***
## Hours.Studied                      2.852982   0.007873  362.35   <2e-16 ***
## Sleep.Hours                        0.480560   0.012022   39.97   <2e-16 ***
## Sample.Question.Papers.Practiced   0.193802   0.007110   27.26   <2e-16 ***
## Extracurricular.Activities         0.612898   0.040781   15.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9887 
## F-statistic: 1.757e+05 on 5 and 9994 DF,  p-value: < 2.2e-16

Backward Elimination

Starts with all predictors, removes the least significant one at a time.

# Backward stepwise elimination
backward_model <- stepAIC(
  full_model,
  direction = "backward",
  trace     = FALSE   # suppress step-by-step output
)

# Show final selected model
summary(backward_model)

## 
## Call:
## lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores + 
##     Extracurricular.Activities + Sleep.Hours + Sample.Question.Papers.Practiced, 
##     data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6333 -1.3684 -0.0311  1.3556  8.7932 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -34.075588   0.127143 -268.01   <2e-16 ***
## Hours.Studied                      2.852982   0.007873  362.35   <2e-16 ***
## Previous.Scores                    1.018434   0.001175  866.45   <2e-16 ***
## Extracurricular.Activities         0.612898   0.040781   15.03   <2e-16 ***
## Sleep.Hours                        0.480560   0.012022   39.97   <2e-16 ***
## Sample.Question.Papers.Practiced   0.193802   0.007110   27.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9887 
## F-statistic: 1.757e+05 on 5 and 9994 DF,  p-value: < 2.2e-16

Both Directions (Stepwise)

Combines forward and backward — each step can either add or remove a predictor.

# Stepwise selection in both directions
both_model <- stepAIC(
  null_model,
  scope     = list(lower = null_model, upper = full_model),
  direction = "both",
  trace     = FALSE   # suppress step-by-step output
)

# Show final selected model
summary(both_model)

## 
## Call:
## lm(formula = Performance.Index ~ Previous.Scores + Hours.Studied + 
##     Sleep.Hours + Sample.Question.Papers.Practiced + Extracurricular.Activities, 
##     data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6333 -1.3684 -0.0311  1.3556  8.7932 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -34.075588   0.127143 -268.01   <2e-16 ***
## Previous.Scores                    1.018434   0.001175  866.45   <2e-16 ***
## Hours.Studied                      2.852982   0.007873  362.35   <2e-16 ***
## Sleep.Hours                        0.480560   0.012022   39.97   <2e-16 ***
## Sample.Question.Papers.Practiced   0.193802   0.007110   27.26   <2e-16 ***
## Extracurricular.Activities         0.612898   0.040781   15.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9887 
## F-statistic: 1.757e+05 on 5 and 9994 DF,  p-value: < 2.2e-16

Method 3 Comparing Methods by AIC and BIC

# Compare the three stepwise models using AIC and BIC
model_comparison <- data.frame(
  Method   = c("Forward", "Backward", "Both (Stepwise)"),
  AIC      = round(c(AIC(forward_model), AIC(backward_model), AIC(both_model)), 2),
  BIC      = round(c(BIC(forward_model), BIC(backward_model), BIC(both_model)), 2),
  Adj_R2   = round(c(
    summary(forward_model)$adj.r.squared,
    summary(backward_model)$adj.r.squared,
    summary(both_model)$adj.r.squared
  ), 4)
)

# Print comparison table
kable(model_comparison, caption = "Comparison of Variable Selection Methods")

Comparison of Variable Selection Methods
Method	AIC	BIC	Adj_R2
Forward	42627.11	42677.58	0.9887
Backward	42627.11	42677.58	0.9887
Both (Stepwise)	42627.11	42677.58	0.9887

Summary Table Variable Selection Methods

Method	How It Works	Advantage	Disadvantage
Best Subset	Tries all 2^p combinations	Guaranteed optimal subset	Very slow for large p
Forward Selection	Adds variables one by one (by AIC)	Fast, good start	May miss interactions
Backward Elimination	Removes variables one by one	Starts from full model	Can be slow
Stepwise (Both)	Adds and removes at each step	Flexible	Not guaranteed optimal
LASSO	Shrinks coefficients to zero	Handles many predictors	Requires tuning lambda
Ridge	Shrinks coefficients (not to zero)	Handles multicollinearity	Doesn’t remove variables

Key Criteria Used in Variable Selection

Criterion	Formula	Rule
AIC	\(-2\log L + 2k\)	Lower is better
BIC	\(-2\log L + k \log n\)	Lower is better (penalizes more)
Adjusted R²	\(1 - \frac{SS_{res}/(n-k-1)}{SS_{tot}/(n-1)}\)	Higher is better
Mallows’ Cp	\(\frac{SS_{res}}{\hat{\sigma}^2} - n + 2k\)	Close to k is ideal

1 Summary

We fit a Multiple Linear Regression model on the Student Performance dataset from Kaggle.
All 5 predictors (Hours.Studied, Previous.Scores, Extracurricular.Activities, Sleep.Hours, Sample.Question.Papers.Practiced) were statistically significant (p < 0.05).
The model achieved R² = 0.9887, meaning it explains 98.87% of the variance in student performance.
Diagnostic plots confirmed no major violations of regression assumptions.
VIF values near 1 confirmed there is no multicollinearity.

2 Summary

Best Subset Selection evaluates all possible combinations and picks the optimal model at each size — it confirmed all 5 predictors are needed.
Forward, Backward, and Stepwise methods all arrived at the same full model, with identical AIC, BIC, and Adjusted R².
The choice of selection method matters more when some predictors are weak or correlated. In this dataset, all predictors are genuinely important.
In practice, AIC prefers predictive accuracy while BIC prefers simpler models.

Multiple Linear Regression & Variable Selection Methods

NYINAWUMUNTU Zainab Suleiman

2026-06-04