This assignment has two parts:
I used the Student Performance Dataset from Kaggle student-performance-multiple-linear-regression.
This dataset contains 10,000 student records with the following variables:
| Variable | Type | Description |
|---|---|---|
Hours.Studied |
Numeric | Total hours spent studying |
Previous.Scores |
Numeric | Scores from previous tests |
Extracurricular.Activities |
Categorical | Yes / No |
Sleep.Hours |
Numeric | Average daily sleep hours |
Sample.Question.Papers.Practiced |
Numeric | Number of practice papers done |
Performance.Index |
Numeric | Target / Response variable |
Goal: Predict Performance.Index using
all other variables as predictors.
# Read the dataset from CSV file
# Dataset: Student Performance.csv
df <- read.csv("Student Performance.csv")
# Show first 6 rows of the dataset
head(df)
## Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1 7 99 Yes 9
## 2 4 82 No 4
## 3 8 51 Yes 7
## 4 5 52 Yes 5
## 5 7 75 No 8
## 6 3 78 No 9
## Sample.Question.Papers.Practiced Performance.Index
## 1 1 91
## 2 2 65
## 3 2 45
## 4 2 36
## 5 5 66
## 6 6 61
# Check the structure (column types, number of rows and columns)
str(df)
## 'data.frame': 10000 obs. of 6 variables:
## $ Hours.Studied : int 7 4 8 5 7 3 7 8 5 4 ...
## $ Previous.Scores : int 99 82 51 52 75 78 73 45 77 89 ...
## $ Extracurricular.Activities : chr "Yes" "No" "Yes" "Yes" ...
## $ Sleep.Hours : int 9 4 7 5 8 9 5 4 8 4 ...
## $ Sample.Question.Papers.Practiced: int 1 2 2 2 5 6 6 6 2 0 ...
## $ Performance.Index : num 91 65 45 36 66 61 63 42 61 69 ...
# Show summary statistics for all variables
summary(df)
## Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## Min. :1.000 Min. :40.00 Length :10000 Min. :4.000
## 1st Qu.:3.000 1st Qu.:54.00 N.unique : 2 1st Qu.:5.000
## Median :5.000 Median :69.00 N.blank : 0 Median :7.000
## Mean :4.993 Mean :69.45 Min.nchar: 2 Mean :6.531
## 3rd Qu.:7.000 3rd Qu.:85.00 Max.nchar: 3 3rd Qu.:8.000
## Max. :9.000 Max. :99.00 Max. :9.000
## Sample.Question.Papers.Practiced Performance.Index
## Min. :0.000 Min. : 10.00
## 1st Qu.:2.000 1st Qu.: 40.00
## Median :5.000 Median : 55.00
## Mean :4.583 Mean : 55.22
## 3rd Qu.:7.000 3rd Qu.: 71.00
## Max. :9.000 Max. :100.00
# Check if there are any missing values in each column
colSums(is.na(df))
## Hours.Studied Previous.Scores
## 0 0
## Extracurricular.Activities Sleep.Hours
## 0 0
## Sample.Question.Papers.Practiced Performance.Index
## 0 0
The dataset has no missing values, so no cleaning is needed.
# Plot histogram of the response variable: Performance.Index
ggplot(df, aes(x = Performance.Index)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Performance Index",
x = "Performance Index",
y = "Count"
) +
theme_minimal()
The Performance Index appears roughly normally distributed, which is a good sign for linear regression.
# Scatter plot: Hours Studied vs Performance Index
ggplot(df, aes(x = Hours.Studied, y = Performance.Index)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Hours Studied vs Performance Index",
x = "Hours Studied",
y = "Performance Index"
) +
theme_minimal()
# Scatter plot: Previous Scores vs Performance Index
ggplot(df, aes(x = Previous.Scores, y = Performance.Index)) +
geom_point(alpha = 0.3, color = "darkorange") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Previous Scores vs Performance Index",
x = "Previous Scores",
y = "Performance Index"
) +
theme_minimal()
# Scatter plot: Sleep Hours vs Performance Index
ggplot(df, aes(x = Sleep.Hours, y = Performance.Index)) +
geom_point(alpha = 0.3, color = "purple") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Sleep Hours vs Performance Index",
x = "Sleep Hours",
y = "Performance Index"
) +
theme_minimal()
# Boxplot: Extracurricular Activities vs Performance Index
ggplot(df, aes(x = Extracurricular.Activities, y = Performance.Index,
fill = Extracurricular.Activities)) +
geom_boxplot() +
labs(
title = "Performance Index by Extracurricular Activities",
x = "Extracurricular Activities",
y = "Performance Index"
) +
theme_minimal() +
theme(legend.position = "none")
# Select only numeric columns for correlation analysis
numeric_cols <- df %>%
dplyr::select(Hours.Studied, Previous.Scores, Sleep.Hours,
Sample.Question.Papers.Practiced, Performance.Index)
# Compute correlation matrix
cor_matrix <- cor(numeric_cols)
# Print the correlation matrix as a table
kable(round(cor_matrix, 3), caption = "Correlation Matrix of Numeric Variables")
| Hours.Studied | Previous.Scores | Sleep.Hours | Sample.Question.Papers.Practiced | Performance.Index | |
|---|---|---|---|---|---|
| Hours.Studied | 1.000 | -0.012 | 0.001 | 0.017 | 0.374 |
| Previous.Scores | -0.012 | 1.000 | 0.006 | 0.008 | 0.915 |
| Sleep.Hours | 0.001 | 0.006 | 1.000 | 0.004 | 0.048 |
| Sample.Question.Papers.Practiced | 0.017 | 0.008 | 0.004 | 1.000 | 0.043 |
| Performance.Index | 0.374 | 0.915 | 0.048 | 0.043 | 1.000 |
# Visual correlation plot
corrplot(
cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
title = "Correlation Plot",
mar = c(0, 0, 1, 0)
)
Hours.StudiedandPrevious.Scoresshow the strongest positive correlation withPerformance.Index.
# Convert Extracurricular.Activities from "Yes"/"No" to 1/0
df <- df %>%
mutate(Extracurricular.Activities = ifelse(
Extracurricular.Activities == "Yes", 1, 0
))
# Confirm the encoding worked
print(paste("Unique values after encoding:",
paste(unique(df$Extracurricular.Activities), collapse = ", ")))
## [1] "Unique values after encoding: 1, 0"
# Fit multiple linear regression model
# Predictors: all other columns (using the dot shortcut)
mlr_model <- lm(Performance.Index ~ ., data = df)
# Print model summary
summary(mlr_model)
##
## Call:
## lm(formula = Performance.Index ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6333 -1.3684 -0.0311 1.3556 8.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.075588 0.127143 -268.01 <2e-16 ***
## Hours.Studied 2.852982 0.007873 362.35 <2e-16 ***
## Previous.Scores 1.018434 0.001175 866.45 <2e-16 ***
## Extracurricular.Activities 0.612898 0.040781 15.03 <2e-16 ***
## Sleep.Hours 0.480560 0.012022 39.97 <2e-16 ***
## Sample.Question.Papers.Practiced 0.193802 0.007110 27.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9887
## F-statistic: 1.757e+05 on 5 and 9994 DF, p-value: < 2.2e-16
# Extract coefficients and display as a clean table
coef_table <- as.data.frame(summary(mlr_model)$coefficients)
# Round to 4 decimal places for readability
coef_table <- round(coef_table, 4)
# Display table
kable(coef_table, caption = "MLR Coefficients, Standard Errors, and p-values")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -34.0756 | 0.1271 | -268.0104 | 0 |
| Hours.Studied | 2.8530 | 0.0079 | 362.3534 | 0 |
| Previous.Scores | 1.0184 | 0.0012 | 866.4500 | 0 |
| Extracurricular.Activities | 0.6129 | 0.0408 | 15.0292 | 0 |
| Sleep.Hours | 0.4806 | 0.0120 | 39.9725 | 0 |
| Sample.Question.Papers.Practiced | 0.1938 | 0.0071 | 27.2566 | 0 |
## INTERCEPT (-33.5749):
## When all predictors are zero, the predicted Performance Index is -33.57.
## (This is a mathematical baseline, not practically meaningful.)
## Hours.Studied (2.8527):
## Each additional hour of study INCREASES Performance Index by 2.85 points,
## holding all other variables constant. [Significant, p < 0.05]
## Previous.Scores (1.0186):
## Each 1-point increase in previous scores INCREASES Performance Index by 1.02,
## holding all other variables constant. [Significant, p < 0.05]
## Extracurricular.Activities (0.6016):
## Students who participate in extracurricular activities score 0.60 HIGHER
## on the Performance Index compared to those who do not. [Significant, p < 0.05]
## Sleep.Hours (0.4810):
## Each additional hour of sleep INCREASES Performance Index by 0.48 points.
## [Significant, p < 0.05]
## Sample.Question.Papers.Practiced (0.1940):
## Each additional practice paper INCREASES Performance Index by 0.19 points.
## [Significant, p < 0.05]
# Extract R-squared, Adjusted R-squared, and F-statistic from model summary
model_summary <- summary(mlr_model)
# Print the key statistics
print(paste("R-squared: ", round(model_summary$r.squared, 4)))
## [1] "R-squared: 0.9888"
print(paste("Adjusted R-squared: ", round(model_summary$adj.r.squared, 4)))
## [1] "Adjusted R-squared: 0.9887"
print(paste("F-statistic: ", round(model_summary$fstatistic[1], 2)))
## [1] "F-statistic: 175709.15"
print(paste("Residual Std Error: ", round(model_summary$sigma, 4)))
## [1] "Residual Std Error: 2.0381"
R² = 0.9887 means the model explains 98.87% of the variation in student Performance Index — an excellent fit.
The F-statistic is very large with p < 0.001, meaning the model as a whole is statistically significant.
# Set layout to show 4 diagnostic plots at once
par(mfrow = c(2, 2))
# Plot all 4 standard diagnostic plots for the model
plot(mlr_model)
# Reset layout back to 1 plot
par(mfrow = c(1, 1))
Interpretation of diagnostic plots:
| Plot | What to Check | What We See |
|---|---|---|
| Residuals vs Fitted | Should be random scatter (no pattern) | Random — linearity holds |
| Normal Q-Q | Points should follow the diagonal line | Roughly normal residuals |
| Scale-Location | Should be horizontal band | Variance is fairly constant |
| Residuals vs Leverage | Watch for points outside Cook’s distance | No extreme outliers |
# VIF (Variance Inflation Factor) measures multicollinearity
# Rule: VIF > 5 or 10 means problematic multicollinearity
vif_values <- vif(mlr_model)
# Print VIF values
kable(
data.frame(Variable = names(vif_values), VIF = round(vif_values, 3)),
caption = "Variance Inflation Factors (VIF)"
)
| Variable | VIF | |
|---|---|---|
| Hours.Studied | Hours.Studied | 1.000 |
| Previous.Scores | Previous.Scores | 1.000 |
| Extracurricular.Activities | Extracurricular.Activities | 1.001 |
| Sleep.Hours | Sleep.Hours | 1.001 |
| Sample.Question.Papers.Practiced | Sample.Question.Papers.Practiced | 1.001 |
All VIF values are close to 1.0, which means there is no multicollinearity problem among the predictors.
# Add predicted values to the data frame
df$predicted <- predict(mlr_model)
# Plot actual vs predicted values
ggplot(df, aes(x = Performance.Index, y = predicted)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
labs(
title = "Actual vs Predicted Performance Index",
x = "Actual Performance Index",
y = "Predicted Performance Index"
) +
theme_minimal()
The points closely follow the red diagonal line, confirming the model predicts very well.
In multiple linear regression, including too many predictors can cause:
Variable selection helps find the best subset of predictors that balances model fit and simplicity.
Best subset selection tries every possible combination of predictors and picks the best model for each size (1 predictor, 2 predictors, etc.).
# Remove the predicted column we added earlier before running selection
df_clean <- df %>% dplyr::select(-predicted)
# Run best subset selection using regsubsets()
# nvmax = 5 means consider up to 5 predictors
best_subset <- regsubsets(
Performance.Index ~ .,
data = df_clean,
nvmax = 5
)
# Get summary of best subset results
subset_summary <- summary(best_subset)
# Print which variables are selected at each model size
print(subset_summary$outmat)
## Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1 ( 1 ) " " "*" " " " "
## 2 ( 1 ) "*" "*" " " " "
## 3 ( 1 ) "*" "*" " " "*"
## 4 ( 1 ) "*" "*" " " "*"
## 5 ( 1 ) "*" "*" "*" "*"
## Sample.Question.Papers.Practiced
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
# Plot RSS, Adjusted R2, Cp, and BIC to find the best number of predictors
par(mfrow = c(2, 2))
# RSS: lower is better
plot(subset_summary$rss,
xlab = "Number of Predictors",
ylab = "RSS",
type = "b",
main = "RSS")
# Adjusted R-squared: higher is better
plot(subset_summary$adjr2,
xlab = "Number of Predictors",
ylab = "Adjusted R²",
type = "b",
main = "Adjusted R²")
points(which.max(subset_summary$adjr2),
max(subset_summary$adjr2),
col = "red", cex = 2, pch = 20)
# Cp: lower is better, close to number of predictors
plot(subset_summary$cp,
xlab = "Number of Predictors",
ylab = "Cp",
type = "b",
main = "Mallows' Cp")
points(which.min(subset_summary$cp),
min(subset_summary$cp),
col = "red", cex = 2, pch = 20)
# BIC: lower is better
plot(subset_summary$bic,
xlab = "Number of Predictors",
ylab = "BIC",
type = "b",
main = "BIC")
points(which.min(subset_summary$bic),
min(subset_summary$bic),
col = "red", cex = 2, pch = 20)
par(mfrow = c(1, 1))
# Find the best number of predictors based on Adjusted R2 and BIC
best_adjr2 <- which.max(subset_summary$adjr2)
best_bic <- which.min(subset_summary$bic)
print(paste("Best model by Adjusted R²: uses", best_adjr2, "predictors"))
## [1] "Best model by Adjusted R²: uses 5 predictors"
print(paste("Best model by BIC: uses", best_bic, "predictors"))
## [1] "Best model by BIC: uses 5 predictors"
Best Subset Result: All 5 predictors are needed — removing any one of them reduces model quality. This confirms our full model is already optimal.
Stepwise selection adds or removes predictors one at a time based on AIC (Akaike Information Criterion). Lower AIC = better model.
Starts with no predictors, adds them one by one.
# Null model (no predictors, only intercept)
null_model <- lm(Performance.Index ~ 1, data = df_clean)
# Full model (all predictors)
full_model <- lm(Performance.Index ~ ., data = df_clean)
# Forward stepwise selection
forward_model <- stepAIC(
null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward",
trace = FALSE # suppress step-by-step output
)
# Show final selected model
summary(forward_model)
##
## Call:
## lm(formula = Performance.Index ~ Previous.Scores + Hours.Studied +
## Sleep.Hours + Sample.Question.Papers.Practiced + Extracurricular.Activities,
## data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6333 -1.3684 -0.0311 1.3556 8.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.075588 0.127143 -268.01 <2e-16 ***
## Previous.Scores 1.018434 0.001175 866.45 <2e-16 ***
## Hours.Studied 2.852982 0.007873 362.35 <2e-16 ***
## Sleep.Hours 0.480560 0.012022 39.97 <2e-16 ***
## Sample.Question.Papers.Practiced 0.193802 0.007110 27.26 <2e-16 ***
## Extracurricular.Activities 0.612898 0.040781 15.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9887
## F-statistic: 1.757e+05 on 5 and 9994 DF, p-value: < 2.2e-16
Starts with all predictors, removes the least significant one at a time.
# Backward stepwise elimination
backward_model <- stepAIC(
full_model,
direction = "backward",
trace = FALSE # suppress step-by-step output
)
# Show final selected model
summary(backward_model)
##
## Call:
## lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores +
## Extracurricular.Activities + Sleep.Hours + Sample.Question.Papers.Practiced,
## data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6333 -1.3684 -0.0311 1.3556 8.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.075588 0.127143 -268.01 <2e-16 ***
## Hours.Studied 2.852982 0.007873 362.35 <2e-16 ***
## Previous.Scores 1.018434 0.001175 866.45 <2e-16 ***
## Extracurricular.Activities 0.612898 0.040781 15.03 <2e-16 ***
## Sleep.Hours 0.480560 0.012022 39.97 <2e-16 ***
## Sample.Question.Papers.Practiced 0.193802 0.007110 27.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9887
## F-statistic: 1.757e+05 on 5 and 9994 DF, p-value: < 2.2e-16
Combines forward and backward — each step can either add or remove a predictor.
# Stepwise selection in both directions
both_model <- stepAIC(
null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both",
trace = FALSE # suppress step-by-step output
)
# Show final selected model
summary(both_model)
##
## Call:
## lm(formula = Performance.Index ~ Previous.Scores + Hours.Studied +
## Sleep.Hours + Sample.Question.Papers.Practiced + Extracurricular.Activities,
## data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6333 -1.3684 -0.0311 1.3556 8.7932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.075588 0.127143 -268.01 <2e-16 ***
## Previous.Scores 1.018434 0.001175 866.45 <2e-16 ***
## Hours.Studied 2.852982 0.007873 362.35 <2e-16 ***
## Sleep.Hours 0.480560 0.012022 39.97 <2e-16 ***
## Sample.Question.Papers.Practiced 0.193802 0.007110 27.26 <2e-16 ***
## Extracurricular.Activities 0.612898 0.040781 15.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9887
## F-statistic: 1.757e+05 on 5 and 9994 DF, p-value: < 2.2e-16
# Compare the three stepwise models using AIC and BIC
model_comparison <- data.frame(
Method = c("Forward", "Backward", "Both (Stepwise)"),
AIC = round(c(AIC(forward_model), AIC(backward_model), AIC(both_model)), 2),
BIC = round(c(BIC(forward_model), BIC(backward_model), BIC(both_model)), 2),
Adj_R2 = round(c(
summary(forward_model)$adj.r.squared,
summary(backward_model)$adj.r.squared,
summary(both_model)$adj.r.squared
), 4)
)
# Print comparison table
kable(model_comparison, caption = "Comparison of Variable Selection Methods")
| Method | AIC | BIC | Adj_R2 |
|---|---|---|---|
| Forward | 42627.11 | 42677.58 | 0.9887 |
| Backward | 42627.11 | 42677.58 | 0.9887 |
| Both (Stepwise) | 42627.11 | 42677.58 | 0.9887 |
| Method | How It Works | Advantage | Disadvantage |
|---|---|---|---|
| Best Subset | Tries all 2^p combinations | Guaranteed optimal subset | Very slow for large p |
| Forward Selection | Adds variables one by one (by AIC) | Fast, good start | May miss interactions |
| Backward Elimination | Removes variables one by one | Starts from full model | Can be slow |
| Stepwise (Both) | Adds and removes at each step | Flexible | Not guaranteed optimal |
| LASSO | Shrinks coefficients to zero | Handles many predictors | Requires tuning lambda |
| Ridge | Shrinks coefficients (not to zero) | Handles multicollinearity | Doesn’t remove variables |
| Criterion | Formula | Rule |
|---|---|---|
| AIC | \(-2\log L + 2k\) | Lower is better |
| BIC | \(-2\log L + k \log n\) | Lower is better (penalizes more) |
| Adjusted R² | \(1 - \frac{SS_{res}/(n-k-1)}{SS_{tot}/(n-1)}\) | Higher is better |
| Mallows’ Cp | \(\frac{SS_{res}}{\hat{\sigma}^2} - n + 2k\) | Close to k is ideal |
Hours.Studied,
Previous.Scores, Extracurricular.Activities,
Sleep.Hours, Sample.Question.Papers.Practiced)
were statistically significant (p < 0.05).