# 1. Load Data and Check Correlations, on myside I have used the builtin 'mtcars' dataset
data(mtcars) # Load the built-in dataset and checki the variable that we are going to work with
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
CarData <- mtcars[, c("mpg", "wt", "hp", "cyl", "disp")] # Select the specific variables for our regression model
head(CarData) #looking the top rows of how data are looking
## mpg wt hp cyl disp
## Mazda RX4 21.0 2.620 110 6 160
## Mazda RX4 Wag 21.0 2.875 110 6 160
## Datsun 710 22.8 2.320 93 4 108
## Hornet 4 Drive 21.4 3.215 110 6 258
## Hornet Sportabout 18.7 3.440 175 8 360
## Valiant 18.1 3.460 105 6 225
cat("\n\t Correlation Matrix\n")
##
## Correlation Matrix
print(round(cor(CarData), 2)) #Generate a correlation matrix to diagnose potential multicollinearity
## mpg wt hp cyl disp
## mpg 1.00 -0.87 -0.78 -0.85 -0.85
## wt -0.87 1.00 0.66 0.78 0.89
## hp -0.78 0.66 1.00 0.83 0.79
## cyl -0.85 0.78 0.83 1.00 0.90
## disp -0.85 0.89 0.79 0.90 1.00
To construct the multiple linear regression model, the mtcars
dataset was selected to predict a vehicle’s fuel efficiency (mpg) using
Weight (wt), Horsepower (hp), Cylinders (cyl), and Displacement
(disp) as independent predictors.
Prior to fitting the model,
a diagnostic correlation matrix was generated to test the fundamental
assumption of predictor independence.
The matrix shows near-perfect
positive correlations among several independent variables, most notably
between disp and cyl (\(r = 0.90\)),
and disp and wt (\(r = 0.89\)). This
early diagnostic confirms the presence of severe multicollinearity
within the feature space, which will heavily influence the stability of
the raw regression coefficients in modeling phase.
# 2. Fit the Multiple Linear Regression (MLR) Model
ModelRaw <- lm(mpg ~ wt + hp + cyl + disp, data = CarData) # Fit the model using mpg as the dependent variable and remain as independent predictors
cat("\n Multiple Linear Regression Summary \n") # show the static summary
##
## Multiple Linear Regression Summary
summary(ModelRaw)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl + disp, data = CarData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## wt -3.85390 1.01547 -3.795 0.000759 ***
## hp -0.02054 0.01215 -1.691 0.102379
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
Overall Validity: The model is highly significant (F
= 37.84, p < 0.001) and explains 82.62% of the variance in fuel
efficiency (Adjusted R-squared = 0.826).
The Key
Driver: Vehicle weight (wt) is the only statistically
significant predictor (p < 0.001). Holding all other factors
constant, every 1,000 lb increase in weight reduces fuel efficiency by
3.85 mpg.
The Collinearity: Despite the high
overall accuracy, predictors hp, cyl, and disp are statistically
insignificant (\(p > 0.05\)). This
contradiction is the indicator of multicollinearity. The variables are
redundant, proving variable selection is strictly necessary.
# 3. Visual Plots
par(mfrow = c(2, 2)) # Arrange the plotting area into a 2x2 grid so all 4 plots show at once
plot(ModelRaw) # Generate the 4 standard diagnostic plots for linear regression
par(mfrow = c(1, 1)) # Reset the plotting area back to normal
To ensure the regression results are valid, diagnostic plots
were generated to test the core assumptions of Multiple Linear
Regression:
Linearity (Residuals vs Fitted Plot):
The trend line is relatively horizontal, indicating that the
relationship between the predictors and fuel efficiency is linear.
Normality (Normal Q-Q Plot): The data points follow the
dashed 45-degree line closely, confirming that the model’s residuals
(errors) are normally distributed.
Homoscedasticity
(Scale-Location Plot): The points are randomly spread with a
relatively horizontal line, shows that the variance of the errors is
constant across all values.
Outliers (Residuals vs
Leverage): No points fall outside the dashed Cook’s Distance
lines, meaning there are no extreme outliers inappropriately dragging
the regression line.
No. Including every available variable creates the thing I have read called “Kitchen Sink” model, which strictly violates the Principle of that the simplest effective model is always best.
Including all variables cause two statistical
failures:
Overfitting: The model
memorizes random noise in the training data rather than learning the
actual underlying pattern. It will appear highly accurate on paper but
will fail completely when predicting on new, unseen data.
Multicollinearity: Throwing all variables into a model
inevitably includes highly correlated, redundant predictors. This
destabilizes the algorithm, inflates standard errors, and renders
individual coefficients statistically meaningless.
Polynomial terms should not be blindly added (e.g., squared or cubed variables) to a regression model just to artificially inflate the accuracy score. Polynomials should only be added if Exploratory Data Analysis proves a clear/ specifically scatter plots, visual curve representing a non-linear relationship. Forcing a polynomial into a real linear dataset adds complexity and destroys the model’s real-world replicability.
To find the optimal model, the following selection algorithms are used:
Backward Elimination: The model starts with all
variables included. The least statistically significant variable
(highest p-value) is removed one by one until all remaining variables
are statistically significant.
Forward Selection:
The model starts completely empty. Variables are tested one by one, and
the predictor that improves the AIC/BIC score the most is added. This
repeats until adding new variables no longer improves the model.
Regularization: An advanced machine learning technique
that applies a mathematical penalty to the regression equation,
automatically shrinking the coefficients of redundant variables down to
zero, effectively deleting them from the model.
let use the example we used in question one and practically resolve the multicollinearity identified in the initial raw model, three distinct variable selection algorithms is going to be used as using the mtcars dataset.
summary(ModelRaw) #first see the data we are going to work on
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl + disp, data = CarData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## wt -3.85390 1.01547 -3.795 0.000759 ***
## hp -0.02054 0.01215 -1.691 0.102379
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
# Using the base 'stats' package to evaluate the (AIC)
cat("\t Backward Elimination\n")
## Backward Elimination
ModelBackward <- step(ModelRaw, direction = "backward", trace = 0)
print(summary(ModelBackward)$coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7517874 1.78686403 21.687038 4.799399e-19
## wt -3.1669731 0.74057588 -4.276365 1.994765e-04
## hp -0.0180381 0.01187625 -1.518838 1.400152e-01
## cyl -0.9416168 0.55091638 -1.709183 9.848010e-02
Interpretation: Starting with the four-variable model, the algorithm iteratively evaluated the AIC penalty. It deleted disp and cyl, identifying them as mathematically redundant. The final optimized model retained only Weight (wt) and Horsepower (hp), successfully balancing simplicity and predictive power.
# We must start with a completely empty "null" model (intercept only)
cat("\nForward Selection \n")
##
## Forward Selection
ModelNull <- lm(mpg ~ 1, data = CarData)
ModelForward <- step(ModelNull, direction = "forward",
scope = formula(ModelRaw), trace = 0)
print(summary(ModelForward)$coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7517874 1.78686403 21.687038 4.799399e-19
## wt -3.1669731 0.74057588 -4.276365 1.994765e-04
## cyl -0.9416168 0.55091638 -1.709183 9.848010e-02
## hp -0.0180381 0.01187625 -1.518838 1.400152e-01
Interpretation: Starting with an empty model, the algorithm tested each variable independently. It first added Weight (wt) as it provided the maximum reduction in the AIC penalty. It subsequently added cyl and hp, but firmly rejected disp, proving that disp offers zero unique predictive value when the other variables are already present.
library(glmnet) # Requires the 'glmnet' package to apply the L1 mathematical penalty
## Loading required package: Matrix
## Loaded glmnet 5.0
cat("\n Lasso Regression Coefficients \n") # Lasso requires formatting the data as a matrix
##
## Lasso Regression Coefficients
XMatrix <- model.matrix(mpg ~ wt + hp + cyl + disp - 1, data = CarData)
YMatrix <- CarData$mpg
# Use cross-validation to find the optimal penalty threshold (lambda)
set.seed(123)
lasso_model <- cv.glmnet(XMatrix, YMatrix, alpha = 1)
print(coef(lasso_model, s = "lambda.min")) # Extract and print the final penalized coefficients
## 5 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 37.9312741
## wt -3.0280155
## hp -0.0161012
## cyl -0.9271792
## disp .
Interpretation: To test a more advanced machine learning penalty, L1 Regularization was applied using cross-validation to find the optimal penalty threshold (\(\lambda\)). Unlike stepwise methods that simply delete variables, Lasso applies a mathematical weight that shrinks redundant coefficients down to zero. The Lasso model aggressively collapsed the disp coefficient to 0.000, effectively deleting it from the equation, while heavily penalizing cyl