Introduction

Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between one dependent variable and multiple independent variables.

In this study, the built-in R dataset women is used. The objective is to predict a woman’s weight based on height and other derived numerical variables.

Loading Required Packages

library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded

Loading the Dataset

data(women)

head(women)
##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

Dataset Description

The women dataset contains information on:

str(women)
## 'data.frame':    15 obs. of  2 variables:
##  $ height: num  58 59 60 61 62 63 64 65 66 67 ...
##  $ weight: num  115 117 120 123 126 129 132 135 139 142 ...

Creating Additional Numerical Predictors

To demonstrate Multiple Linear Regression, additional predictors are created from height.

women$height_sq <- women$height^2

women$bmi_index <- women$weight/(women$height^2)*1000

head(women)
##   height weight height_sq bmi_index
## 1     58    115      3364  34.18549
## 2     59    117      3481  33.61103
## 3     60    120      3600  33.33333
## 4     61    123      3721  33.05563
## 5     62    126      3844  32.77836
## 6     63    129      3969  32.50189

Summary Statistics

summary(women)
##      height         weight        height_sq      bmi_index    
##  Min.   :58.0   Min.   :115.0   Min.   :3364   Min.   :31.43  
##  1st Qu.:61.5   1st Qu.:124.5   1st Qu.:3782   1st Qu.:31.60  
##  Median :65.0   Median :135.0   Median :4225   Median :31.95  
##  Mean   :65.0   Mean   :136.7   Mean   :4244   Mean   :32.32  
##  3rd Qu.:68.5   3rd Qu.:148.0   3rd Qu.:4692   3rd Qu.:32.92  
##  Max.   :72.0   Max.   :164.0   Max.   :5184   Max.   :34.19

Correlation Analysis

cor_matrix <- cor(women)

cor_matrix
##               height     weight  height_sq  bmi_index
## height     1.0000000  0.9954948  0.9995644 -0.9380678
## weight     0.9954948  1.0000000  0.9977763 -0.9011408
## height_sq  0.9995644  0.9977763  1.0000000 -0.9276567
## bmi_index -0.9380678 -0.9011408 -0.9276567  1.0000000

Correlation Plot

corrplot(cor_matrix,
         method = "number")

Scatter Plot Matrix

pairs(women,
      main = "Scatter Plot Matrix")

Multiple Linear Regression Model

The response variable is:

The predictor variables are:

model <- lm(weight ~ height + height_sq + bmi_index,
            data = women)

summary(model)
## 
## Call:
## lm(formula = weight ~ height + height_sq + bmi_index, data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32331 -0.09292  0.05985  0.07660  0.20529 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.469e+02  7.755e+01  -4.473 0.000943 ***
## height       5.318e+00  1.630e+00   3.262 0.007571 ** 
## height_sq   -7.007e-03  1.163e-02  -0.603 0.559016    
## bmi_index    5.187e+00  6.551e-01   7.918  7.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.155 on 11 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 4.667e+04 on 3 and 11 DF,  p-value: < 2.2e-16

Regression Coefficients

coef(model)
##   (Intercept)        height     height_sq     bmi_index 
## -3.468675e+02  5.317861e+00 -7.007093e-03  5.187191e+00

Interpretation of Coefficients

Height

The coefficient for height represents the expected change in weight for a one-unit increase in height while keeping the other predictors constant.

Height Squared

This variable captures nonlinear effects of height on weight.

BMI Index

The BMI index variable helps explain additional variation in weight.

Model Performance

R-Squared

summary(model)$r.squared
## [1] 0.9999214

Adjusted R-Squared

summary(model)$adj.r.squared
## [1] 0.9999

Interpretation

The R-squared value represents the proportion of variation in weight explained by the predictor variables.

ANOVA Table

anova(model)
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq    F value    Pr(>F)    
## height     1 3332.7  3332.7 138748.272 < 2.2e-16 ***
## height_sq  1   28.5    28.5   1184.994 1.490e-12 ***
## bmi_index  1    1.5     1.5     62.692 7.205e-06 ***
## Residuals 11    0.3     0.0                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

A small p-value (< 0.05) indicates that the regression model is statistically significant.

Confidence Intervals

confint(model)
##                     2.5 %        97.5 %
## (Intercept) -517.55859129 -176.17630880
## height         1.72983197    8.90588957
## height_sq     -0.03260193    0.01858774
## bmi_index      3.74526756    6.62911510

Predicted Values

women$predicted_weight <- predict(model)

head(women)
##   height weight height_sq bmi_index predicted_weight
## 1     58    115      3364  34.18549         115.3233
## 2     59    117      3481  33.61103         116.8415
## 3     60    120      3600  33.33333         119.8850
## 4     61    123      3721  33.05563         122.9145
## 5     62    126      3844  32.77836         125.9323
## 6     63    129      3969  32.50189         128.9401

Actual vs Predicted Plot

ggplot(women,
       aes(x = weight,
           y = predicted_weight)) +
  geom_point(size = 3) +
  geom_abline() +
  labs(
    title = "Actual vs Predicted Weight",
    x = "Actual Weight",
    y = "Predicted Weight"
  )

Diagnostic Plots

par(mfrow = c(2,2))
plot(model)

# CORRELATION MATRIX
# Examine relationships between height and weight

cor(women[, c("height", "weight")])
##           height    weight
## height 1.0000000 0.9954948
## weight 0.9954948 1.0000000

Question II: Main Variable Selection Methods in R

Variable selection is an important step in building a multiple linear regression model. It helps identify the most relevant predictors, improve model accuracy, and reduce overfitting. The main methods used in R include Stepwise Selection, Best Subset Selection, and LASSO Regression.

Example in R

full_model2 <- lm(mpg ~ ., data = mtcars)

step_model <- step(full_model2,
                   direction = "both",
                   trace = FALSE)

summary(step_model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The final model selected may look like:

mpg ~ wt + qsec + am

This indicates that weight (wt), quarter-mile time (qsec), and transmission type (am) are important predictors of fuel efficiency.


Example in R

library(leaps)
best_model <- regsubsets(mpg ~ .,
                         data = mtcars,
                         nvmax = 10)

summary(best_model)
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = mtcars, nvmax = 10)
## 10 Variables  (and intercept)
##      Forced in Forced out
## cyl      FALSE      FALSE
## disp     FALSE      FALSE
## hp       FALSE      FALSE
## drat     FALSE      FALSE
## wt       FALSE      FALSE
## qsec     FALSE      FALSE
## vs       FALSE      FALSE
## am       FALSE      FALSE
## gear     FALSE      FALSE
## carb     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           cyl disp hp  drat wt  qsec vs  am  gear carb
## 1  ( 1 )  " " " "  " " " "  "*" " "  " " " " " "  " " 
## 2  ( 1 )  "*" " "  " " " "  "*" " "  " " " " " "  " " 
## 3  ( 1 )  " " " "  " " " "  "*" "*"  " " "*" " "  " " 
## 4  ( 1 )  " " " "  "*" " "  "*" "*"  " " "*" " "  " " 
## 5  ( 1 )  " " "*"  "*" " "  "*" "*"  " " "*" " "  " " 
## 6  ( 1 )  " " "*"  "*" "*"  "*" "*"  " " "*" " "  " " 
## 7  ( 1 )  " " "*"  "*" "*"  "*" "*"  " " "*" "*"  " " 
## 8  ( 1 )  " " "*"  "*" "*"  "*" "*"  " " "*" "*"  "*" 
## 9  ( 1 )  " " "*"  "*" "*"  "*" "*"  "*" "*" "*"  "*" 
## 10  ( 1 ) "*" "*"  "*" "*"  "*" "*"  "*" "*" "*"  "*"

3. LASSO Regression (Least Absolute Shrinkage and Selection Operator)

Definition:
LASSO regression adds a penalty to regression coefficients, shrinking some coefficients exactly to zero and automatically selecting variables.

Role:
Performs variable selection and regularization simultaneously.

When to Use: - Many predictors - Highly correlated predictors - Cases where p > n

Example in R

x <- model.matrix(mpg ~ ., mtcars)[, -1]
y <- mtcars$mpg

lasso_model <- glmnet::cv.glmnet(x, y, alpha = 1)

lasso_model$lambda.min
## [1] 0.6647582
coef(lasso_model, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##              lambda.min
## (Intercept) 36.44500429
## cyl         -0.89288058
## disp         .         
## hp          -0.01281976
## drat         .         
## wt          -2.78332595
## qsec         .         
## vs           .         
## am           0.01347182
## gear         .         
## carb         .

Conclusion

The three main variable selection methods are:

  1. Stepwise Selection – Fast and easy to interpret.
  2. Best Subset Selection – Finds the optimal subset of predictors.
  3. LASSO Regression – Best for high-dimensional data and automatic variable selection.

These methods help improve regression model performance by selecting the most relevant predictors.


END OF QUESTION 2