Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between one dependent variable and multiple independent variables.
In this study, the built-in R dataset women is used. The objective is to predict a woman’s weight based on height and other derived numerical variables.
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
data(women)
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
The women dataset contains information on:
str(women)
## 'data.frame': 15 obs. of 2 variables:
## $ height: num 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: num 115 117 120 123 126 129 132 135 139 142 ...
To demonstrate Multiple Linear Regression, additional predictors are created from height.
women$height_sq <- women$height^2
women$bmi_index <- women$weight/(women$height^2)*1000
head(women)
## height weight height_sq bmi_index
## 1 58 115 3364 34.18549
## 2 59 117 3481 33.61103
## 3 60 120 3600 33.33333
## 4 61 123 3721 33.05563
## 5 62 126 3844 32.77836
## 6 63 129 3969 32.50189
summary(women)
## height weight height_sq bmi_index
## Min. :58.0 Min. :115.0 Min. :3364 Min. :31.43
## 1st Qu.:61.5 1st Qu.:124.5 1st Qu.:3782 1st Qu.:31.60
## Median :65.0 Median :135.0 Median :4225 Median :31.95
## Mean :65.0 Mean :136.7 Mean :4244 Mean :32.32
## 3rd Qu.:68.5 3rd Qu.:148.0 3rd Qu.:4692 3rd Qu.:32.92
## Max. :72.0 Max. :164.0 Max. :5184 Max. :34.19
cor_matrix <- cor(women)
cor_matrix
## height weight height_sq bmi_index
## height 1.0000000 0.9954948 0.9995644 -0.9380678
## weight 0.9954948 1.0000000 0.9977763 -0.9011408
## height_sq 0.9995644 0.9977763 1.0000000 -0.9276567
## bmi_index -0.9380678 -0.9011408 -0.9276567 1.0000000
corrplot(cor_matrix,
method = "number")
pairs(women,
main = "Scatter Plot Matrix")
The response variable is:
The predictor variables are:
model <- lm(weight ~ height + height_sq + bmi_index,
data = women)
summary(model)
##
## Call:
## lm(formula = weight ~ height + height_sq + bmi_index, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32331 -0.09292 0.05985 0.07660 0.20529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.469e+02 7.755e+01 -4.473 0.000943 ***
## height 5.318e+00 1.630e+00 3.262 0.007571 **
## height_sq -7.007e-03 1.163e-02 -0.603 0.559016
## bmi_index 5.187e+00 6.551e-01 7.918 7.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.155 on 11 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 4.667e+04 on 3 and 11 DF, p-value: < 2.2e-16
coef(model)
## (Intercept) height height_sq bmi_index
## -3.468675e+02 5.317861e+00 -7.007093e-03 5.187191e+00
The coefficient for height represents the expected change in weight for a one-unit increase in height while keeping the other predictors constant.
This variable captures nonlinear effects of height on weight.
The BMI index variable helps explain additional variation in weight.
summary(model)$r.squared
## [1] 0.9999214
summary(model)$adj.r.squared
## [1] 0.9999
The R-squared value represents the proportion of variation in weight explained by the predictor variables.
anova(model)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## height 1 3332.7 3332.7 138748.272 < 2.2e-16 ***
## height_sq 1 28.5 28.5 1184.994 1.490e-12 ***
## bmi_index 1 1.5 1.5 62.692 7.205e-06 ***
## Residuals 11 0.3 0.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A small p-value (< 0.05) indicates that the regression model is statistically significant.
confint(model)
## 2.5 % 97.5 %
## (Intercept) -517.55859129 -176.17630880
## height 1.72983197 8.90588957
## height_sq -0.03260193 0.01858774
## bmi_index 3.74526756 6.62911510
women$predicted_weight <- predict(model)
head(women)
## height weight height_sq bmi_index predicted_weight
## 1 58 115 3364 34.18549 115.3233
## 2 59 117 3481 33.61103 116.8415
## 3 60 120 3600 33.33333 119.8850
## 4 61 123 3721 33.05563 122.9145
## 5 62 126 3844 32.77836 125.9323
## 6 63 129 3969 32.50189 128.9401
ggplot(women,
aes(x = weight,
y = predicted_weight)) +
geom_point(size = 3) +
geom_abline() +
labs(
title = "Actual vs Predicted Weight",
x = "Actual Weight",
y = "Predicted Weight"
)
par(mfrow = c(2,2))
plot(model)
# CORRELATION MATRIX
# Examine relationships between height and weight
cor(women[, c("height", "weight")])
## height weight
## height 1.0000000 0.9954948
## weight 0.9954948 1.0000000
Variable selection is an important step in building a multiple linear regression model. It helps identify the most relevant predictors, improve model accuracy, and reduce overfitting. The main methods used in R include Stepwise Selection, Best Subset Selection, and LASSO Regression.
Definition:
Stepwise selection is an automated procedure that adds or removes
predictors one at a time based on statistical criteria such as AIC
(Akaike Information Criterion) or BIC (Bayesian Information
Criterion).
Role:
It helps reduce a large set of predictors into a simpler and more
interpretable model.
When to Use: - Moderate number of predictors (less than 30) - Fast model-building process - Easy interpretation of results
full_model2 <- lm(mpg ~ ., data = mtcars)
step_model <- step(full_model2,
direction = "both",
trace = FALSE)
summary(step_model)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The final model selected may look like:
mpg ~ wt + qsec + am
This indicates that weight (wt), quarter-mile time (qsec), and transmission type (am) are important predictors of fuel efficiency.
Definition:
Best subset selection evaluates all possible combinations of predictor
variables and selects the best model according to a criterion such as
Adjusted R², AIC, or BIC.
Role:
It identifies the optimal combination of predictors.
When to Use: - Small number of predictors (fewer than 20) - When computational cost is not a concern
library(leaps)
best_model <- regsubsets(mpg ~ .,
data = mtcars,
nvmax = 10)
summary(best_model)
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = mtcars, nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## cyl FALSE FALSE
## disp FALSE FALSE
## hp FALSE FALSE
## drat FALSE FALSE
## wt FALSE FALSE
## qsec FALSE FALSE
## vs FALSE FALSE
## am FALSE FALSE
## gear FALSE FALSE
## carb FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## cyl disp hp drat wt qsec vs am gear carb
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
## 4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " "
## 5 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " "
## 6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
## 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Definition:
LASSO regression adds a penalty to regression coefficients, shrinking
some coefficients exactly to zero and automatically selecting
variables.
Role:
Performs variable selection and regularization simultaneously.
When to Use: - Many predictors - Highly correlated predictors - Cases where p > n
x <- model.matrix(mpg ~ ., mtcars)[, -1]
y <- mtcars$mpg
lasso_model <- glmnet::cv.glmnet(x, y, alpha = 1)
lasso_model$lambda.min
## [1] 0.6647582
coef(lasso_model, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 36.44500429
## cyl -0.89288058
## disp .
## hp -0.01281976
## drat .
## wt -2.78332595
## qsec .
## vs .
## am 0.01347182
## gear .
## carb .
The three main variable selection methods are:
These methods help improve regression model performance by selecting the most relevant predictors.