Q1. Find a data set of which you can fit multiple linear regression and interpret your results

Q1.A Data Selection and Exploratory Data Analysis

# 1. Load Data and Check Correlations, on myside I have used the builtin 'mtcars' dataset

data(mtcars) # Load the built-in dataset and checki the variable that we are going to work with
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
CarData <- mtcars[, c("mpg", "wt", "hp", "cyl", "disp")] # Select the specific variables for our regression model
head(CarData) #looking the top rows of how data are looking
##                    mpg    wt  hp cyl disp
## Mazda RX4         21.0 2.620 110   6  160
## Mazda RX4 Wag     21.0 2.875 110   6  160
## Datsun 710        22.8 2.320  93   4  108
## Hornet 4 Drive    21.4 3.215 110   6  258
## Hornet Sportabout 18.7 3.440 175   8  360
## Valiant           18.1 3.460 105   6  225
cat("\n\t Correlation Matrix\n")
## 
##   Correlation Matrix
print(round(cor(CarData), 2)) #Generate a correlation matrix to diagnose potential multicollinearity
##        mpg    wt    hp   cyl  disp
## mpg   1.00 -0.87 -0.78 -0.85 -0.85
## wt   -0.87  1.00  0.66  0.78  0.89
## hp   -0.78  0.66  1.00  0.83  0.79
## cyl  -0.85  0.78  0.83  1.00  0.90
## disp -0.85  0.89  0.79  0.90  1.00

Interpretations to Dataset Selection and Exploratory Data Analysis


To construct the multiple linear regression model, the mtcars dataset was selected to predict a vehicle’s fuel efficiency (mpg) using Weight (wt), Horsepower (hp), Cylinders (cyl), and Displacement (disp) as independent predictors.
Prior to fitting the model, a diagnostic correlation matrix was generated to test the fundamental assumption of predictor independence.
The matrix shows near-perfect positive correlations among several independent variables, most notably between disp and cyl (\(r = 0.90\)), and disp and wt (\(r = 0.89\)). This early diagnostic confirms the presence of severe multicollinearity within the feature space, which will heavily influence the stability of the raw regression coefficients in modeling phase.

Q1.B Fitting the Multiple Linear Regression Model

# 2. Fit the Multiple Linear Regression (MLR) Model

ModelRaw <- lm(mpg ~ wt + hp + cyl + disp, data = CarData) # Fit the model using mpg as the dependent variable and remain as independent predictors
cat("\n Multiple Linear Regression Summary \n") # show the static summary
## 
##  Multiple Linear Regression Summary
summary(ModelRaw)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl + disp, data = CarData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## hp          -0.02054    0.01215  -1.691 0.102379    
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

Model Fitting and Interpretation

Overall Validity: The model is highly significant (F = 37.84, p < 0.001) and explains 82.62% of the variance in fuel efficiency (Adjusted R-squared = 0.826).
The Key Driver: Vehicle weight (wt) is the only statistically significant predictor (p < 0.001). Holding all other factors constant, every 1,000 lb increase in weight reduces fuel efficiency by 3.85 mpg.
The Collinearity: Despite the high overall accuracy, predictors hp, cyl, and disp are statistically insignificant (\(p > 0.05\)). This contradiction is the indicator of multicollinearity. The variables are redundant, proving variable selection is strictly necessary.

Q1.C Model Diagnostics with Visual

# 3. Visual Plots

par(mfrow = c(2, 2)) # Arrange the plotting area into a 2x2 grid so all 4 plots show at once
plot(ModelRaw) # Generate the 4 standard diagnostic plots for linear regression

par(mfrow = c(1, 1)) # Reset the plotting area back to normal

Model Diagnostics and Assumption Checks


To ensure the regression results are valid, diagnostic plots were generated to test the core assumptions of Multiple Linear Regression:
Linearity (Residuals vs Fitted Plot): The trend line is relatively horizontal, indicating that the relationship between the predictors and fuel efficiency is linear.
Normality (Normal Q-Q Plot): The data points follow the dashed 45-degree line closely, confirming that the model’s residuals (errors) are normally distributed.
Homoscedasticity (Scale-Location Plot): The points are randomly spread with a relatively horizontal line, shows that the variance of the errors is constant across all values.
Outliers (Residuals vs Leverage): No points fall outside the dashed Cook’s Distance lines, meaning there are no extreme outliers inappropriately dragging the regression line.

Q2 Selecting the Best Regression Model and Variable Selection

Q2.1 Should we include all variables under study?

No. Including every available variable creates the thing I have read called “Kitchen Sink” model, which strictly violates the Principle of that the simplest effective model is always best.

Including all variables cause two statistical failures:
Overfitting: The model memorizes random noise in the training data rather than learning the actual underlying pattern. It will appear highly accurate on paper but will fail completely when predicting on new, unseen data.
Multicollinearity: Throwing all variables into a model inevitably includes highly correlated, redundant predictors. This destabilizes the algorithm, inflates standard errors, and renders individual coefficients statistically meaningless.

Q2.2 Should we add polynomial terms?

Polynomial terms should not be blindly added (e.g., squared or cubed variables) to a regression model just to artificially inflate the accuracy score. Polynomials should only be added if Exploratory Data Analysis proves a clear/ specifically scatter plots, visual curve representing a non-linear relationship. Forcing a polynomial into a real linear dataset adds complexity and destroys the model’s real-world replicability.

Q2.3 What are the standard variable selection methods?

To find the optimal model, the following selection algorithms are used:

Backward Elimination: The model starts with all variables included. The least statistically significant variable (highest p-value) is removed one by one until all remaining variables are statistically significant.
Forward Selection: The model starts completely empty. Variables are tested one by one, and the predictor that improves the AIC/BIC score the most is added. This repeats until adding new variables no longer improves the model.
Regularization: An advanced machine learning technique that applies a mathematical penalty to the regression equation, automatically shrinking the coefficients of redundant variables down to zero, effectively deleting them from the model.

Q2.4 Practical Demonstration of Selection Algorithms

let use the example we used in question one and practically resolve the multicollinearity identified in the initial raw model, three distinct variable selection algorithms is going to be used as using the mtcars dataset.

Q2.4.1 BACKWARD ELIMINATION

summary(ModelRaw) #first see the data we are going to work on
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl + disp, data = CarData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## hp          -0.02054    0.01215  -1.691 0.102379    
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10
# Using the base 'stats' package to evaluate the (AIC)
cat("\t Backward Elimination\n")
##   Backward Elimination
ModelBackward <- step(ModelRaw, direction = "backward", trace = 0) 
print(summary(ModelBackward)$coefficients)
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 38.7517874 1.78686403 21.687038 4.799399e-19
## wt          -3.1669731 0.74057588 -4.276365 1.994765e-04
## hp          -0.0180381 0.01187625 -1.518838 1.400152e-01
## cyl         -0.9416168 0.55091638 -1.709183 9.848010e-02

Interpretation: Starting with the four-variable model, the algorithm iteratively evaluated the AIC penalty. It deleted disp and cyl, identifying them as mathematically redundant. The final optimized model retained only Weight (wt) and Horsepower (hp), successfully balancing simplicity and predictive power.

Q2.4.2 FORWARD SELECTION

# We must start with a completely empty "null" model (intercept only)
cat("\nForward Selection \n")
## 
## Forward Selection
ModelNull <- lm(mpg ~ 1, data = CarData) 
ModelForward <- step(ModelNull, direction = "forward", 
                      scope = formula(ModelRaw), trace = 0)
print(summary(ModelForward)$coefficients)
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 38.7517874 1.78686403 21.687038 4.799399e-19
## wt          -3.1669731 0.74057588 -4.276365 1.994765e-04
## cyl         -0.9416168 0.55091638 -1.709183 9.848010e-02
## hp          -0.0180381 0.01187625 -1.518838 1.400152e-01

Interpretation: Starting with an empty model, the algorithm tested each variable independently. It first added Weight (wt) as it provided the maximum reduction in the AIC penalty. It subsequently added cyl and hp, but firmly rejected disp, proving that disp offers zero unique predictive value when the other variables are already present.

Q2.4.3 LASSO REGRESSION

library(glmnet) # Requires the 'glmnet' package to apply the L1 mathematical penalty
## Loading required package: Matrix
## Loaded glmnet 5.0
cat("\n Lasso Regression Coefficients \n") # Lasso requires formatting the data as a matrix
## 
##  Lasso Regression Coefficients
XMatrix <- model.matrix(mpg ~ wt + hp + cyl + disp - 1, data = CarData)
YMatrix <- CarData$mpg

# Use cross-validation to find the optimal penalty threshold (lambda)
set.seed(123) 
lasso_model <- cv.glmnet(XMatrix, YMatrix, alpha = 1)
print(coef(lasso_model, s = "lambda.min")) # Extract and print the final penalized coefficients
## 5 x 1 sparse Matrix of class "dgCMatrix"
##             lambda.min
## (Intercept) 37.9312741
## wt          -3.0280155
## hp          -0.0161012
## cyl         -0.9271792
## disp         .

Interpretation: To test a more advanced machine learning penalty, L1 Regularization was applied using cross-validation to find the optimal penalty threshold (\(\lambda\)). Unlike stepwise methods that simply delete variables, Lasso applies a mathematical weight that shrinks redundant coefficients down to zero. The Lasso model aggressively collapsed the disp coefficient to 0.000, effectively deleting it from the equation, while heavily penalizing cyl