This exercise shows how to apply Multiple Linear Regression in R programming.
A school administration wants to identify the major factors affecting students’ academic performance using collected data from students. The objective is to predict students’ final exam scores using several independent variables such as: Number of hour studied, Attendance percentage, and Assignment scores.
We are using ‘Student Performance Dataset’ to predict: ‘Final Exam Score’ Using multiple independent variables: (Hours Studied, Attendance Rate, and Assignment Score). Multiple Linear Regression is used to determine how these variables affect the final exam results.
# Creating the student dataset
# ----------------------------
student_ds <- data.frame(
Hr_Studied = c(5,8,2,7,6,9,4,10),
Attendance = c(80,90,60,85,75,95,70,98),
Assignment = c(70,85,55,80,78,92,60,96),
Final_Exam = c(65,88,50,82,76,94,62,98)
)
# Display the student dataset
# ---------------------------
print(student_ds)
## Hr_Studied Attendance Assignment Final_Exam
## 1 5 80 70 65
## 2 8 90 85 88
## 3 2 60 55 50
## 4 7 85 80 82
## 5 6 75 78 76
## 6 9 95 92 94
## 7 4 70 60 62
## 8 10 98 96 98
str(student_ds)
## 'data.frame': 8 obs. of 4 variables:
## $ Hr_Studied: num 5 8 2 7 6 9 4 10
## $ Attendance: num 80 90 60 85 75 95 70 98
## $ Assignment: num 70 85 55 80 78 92 60 96
## $ Final_Exam: num 65 88 50 82 76 94 62 98
summary(student_ds)
## Hr_Studied Attendance Assignment Final_Exam
## Min. : 2.000 Min. :60.00 Min. :55.00 Min. :50.00
## 1st Qu.: 4.750 1st Qu.:73.75 1st Qu.:67.50 1st Qu.:64.25
## Median : 6.500 Median :82.50 Median :79.00 Median :79.00
## Mean : 6.375 Mean :81.62 Mean :77.00 Mean :76.88
## 3rd Qu.: 8.250 3rd Qu.:91.25 3rd Qu.:86.75 3rd Qu.:89.50
## Max. :10.000 Max. :98.00 Max. :96.00 Max. :98.00
# Building the regression model
#------------------------------
model <- lm(Final_Exam ~ Hr_Studied + Attendance + Assignment, data = student_ds)
# Display model results
# ---------------------
summary(model)
##
## Call:
## lm(formula = Final_Exam ~ Hr_Studied + Attendance + Assignment,
## data = student_ds)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -1.2306 1.1975 0.2227 1.3683 -0.4524 1.0717 -0.2318 -1.9454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.04408 19.98987 2.954 0.0418 *
## Hr_Studied 8.34302 2.22418 3.751 0.0199 *
## Attendance -0.41186 0.23404 -1.760 0.1533
## Assignment -0.02256 0.29796 -0.076 0.9433
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.586 on 4 degrees of freedom
## Multiple R-squared: 0.9949, Adjusted R-squared: 0.9911
## F-statistic: 260.4 on 3 and 4 DF, p-value: 4.859e-05
The model follows this equation: \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 \] Where: - \(Y\) = Final Exam - \(X_1\) = Hours Studied - \(X_2\) = Attendance Score - \(X_3\) = Assignment Score
# Predict final exam scores
# -------------------------
predicted_scores <- predict(model)
# Create comparison dataset
# -------------------------
results <- data.frame(
Actual = student_ds$Final_Exam,
Predicted = predicted_scores
)
print(results)
## Actual Predicted
## 1 65 66.23058
## 2 88 86.80253
## 3 50 49.77727
## 4 82 80.63165
## 5 76 76.45240
## 6 94 92.92828
## 7 62 62.23184
## 8 98 99.94545
# Plot actual vs predicted scores
# -------------------------------
plot(results$Actual, results$Predicted, main = "Actual vs Predicted Final Exam Scores",
xlab = "Actual Scores", ylab = "Predicted Scores", pch = 19)
# Adding regression line
# ----------------------
abline(0,1,col="yellow",lwd=2)
## Visualization Explanation
## -------------------------
## - Points close to the yellow line indicate good predictions.
## - Large distances from the line indicate prediction errors.
# Calculate R-squared
# -------------------
R2 <- cor(results$Actual, results$Predicted)^2
print(R2)
## [1] 0.994905
# Interpretation
# ----------------
# R-squared value shows how well the independent variables explain the variation in final exam scores.
# For example: R² = 0.95 means 95% of exam performance is explained by the model.
This section evaluates whether the multiple linear regression model is appropriate for the dataset. Regression diagnostics help us verify important assumptions such as linearity, normality, homoscedasticity, and multicollinearity.
# Generate regression diagnostic plots
# ------------------------------------
par(mfrow = c(2,2))
plot(model)
The four plots produced are:
These plots help evaluate whether the regression assumptions are satisfied.
# Extract residuals
# -----------------
residuals_model <- residuals(model)
# Plot histogram of residuals
# -----------------------------
hist(residuals_model, main = "Histogram of Residuals", xlab = "Residuals", col = "lightblue", border = "black")
# Correlation
# ----------------
cor(student_ds)
## Hr_Studied Attendance Assignment Final_Exam
## Hr_Studied 1.0000000 0.9783321 0.9894338 0.9953252
## Attendance 0.9783321 1.0000000 0.9589653 0.9603002
## Assignment 0.9894338 0.9589653 1.0000000 0.9872695
## Final_Exam 0.9953252 0.9603002 0.9872695 1.0000000
# Scatterplot
# ----------------
pairs(student_ds)
This exercise demonstrated the application of Multiple Linear Regression using R programming.
The analysis showed that:
- Students who study more tend to perform better.
- Higher attendance contributes positively to final exam
performance.
- Assignment scores are also important predictors of academic
success.
Variable selection refers to the process of choosing the most relevant variables to include in a regression model. They help to improve model performance and avoid over fitting.
Variable selection, also known as feature selection, is the process
of identifying and choosing the most important predictors for a
model.
In R Programming Language This process leads to simpler, faster, and
more interpretable models, and helps in preventing overfitting.
Overfitting occurs when a model is too complex and captures noise in the
data rather than the underlying pattern. Based on the most relevant
variables, variable selection improves the model’s ability to generalize
to new, unseen data.
Variable selection is an important process in statistical modeling
and machine learning.
It involves choosing the most relevant independent variables
(predictors) for building a regression model.
The main purpose of variable selection is to improve model accuracy,
reduce complexity, avoid overfitting, and simplify interpretation.
In R Programming, several variable selection methods are commonly
used in regression analysis.
These methods help identify which variables contribute significantly to
predicting the dependent variable.
Variable selection is important because:
A model with too many irrelevant variables may become complex and less accurate.
Forward selection starts with no variables in the model.
Variables are then added one at a time based on their statistical
significance.
Process:
Advantages:
Disadvantages:
Example in R:
# Create sample dataset
# ---------------------
data <- mtcars
# Full regression model
# ---------------------
full_model <- lm(mpg ~ wt + hp + disp + cyl, data = data)
# Forward selection
# ------------------
model_forward <- step(lm(mpg ~ 1, data = data), scope = formula(full_model), direction = "forward")
## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + cyl 1 817.71 308.33 76.494
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.150 191.17 63.198
## + hp 1 83.274 195.05 63.840
## + disp 1 31.639 246.68 71.356
## <none> 278.32 73.217
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.5514 176.62 62.665
## <none> 191.17 63.198
## + disp 1 2.6796 188.49 64.746
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## + disp 1 6.1762 170.44 63.526
summary(model_forward)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Backward elimination starts with all variables included in the
model.
The least significant variables are removed one by one.
Process:
Advantages:
Disadvantages:
Example in R:
# Backward elimination
# ---------------------
model_backward <- step(full_model, direction = "backward")
## Start: AIC=63.53
## mpg ~ wt + hp + disp + cyl
##
## Df Sum of Sq RSS AIC
## - disp 1 6.176 176.62 62.665
## <none> 170.44 63.526
## - hp 1 18.048 188.49 64.746
## - cyl 1 24.546 194.99 65.831
## - wt 1 90.925 261.37 75.206
##
## Step: AIC=62.66
## mpg ~ wt + hp + cyl
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## - hp 1 14.551 191.17 63.198
## - cyl 1 18.427 195.05 63.840
## - wt 1 115.354 291.98 76.750
summary(model_backward)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Stepwise selection combines forward selection and backward
elimination.
Variables can be added or removed during the process.
Advantages:
Disadvantages:
Example in R:
# Stepwise selection
# ---------------------
model_stepwise <- step(full_model, direction = "both")
## Start: AIC=63.53
## mpg ~ wt + hp + disp + cyl
##
## Df Sum of Sq RSS AIC
## - disp 1 6.176 176.62 62.665
## <none> 170.44 63.526
## - hp 1 18.048 188.49 64.746
## - cyl 1 24.546 194.99 65.831
## - wt 1 90.925 261.37 75.206
##
## Step: AIC=62.66
## mpg ~ wt + hp + cyl
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## - hp 1 14.551 191.17 63.198
## + disp 1 6.176 170.44 63.526
## - cyl 1 18.427 195.05 63.840
## - wt 1 115.354 291.98 76.750
summary(model_stepwise)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Different statistical criteria are used to evaluate models during variable selection.
Variable selection also helps reduce multicollinearity, which occurs
when independent variables are highly correlated.
Variance Inflation Factor (VIF) is commonly used to detect
multicollinearity.
High multicollinearity means some independent variables are strongly correlated with each other, which can affect model stability.
Variable selection is widely used in:
Variable selection methods are essential in regression modeling and
machine learning.
They help identify the most important variables while improving model
simplicity and prediction accuracy.
In R Programming, methods such as forward selection, backward
elimination, and stepwise selection are commonly used to build efficient
regression models.
Proper variable selection leads to better interpretation, reduced
multicollinearity, and improved model performance.