Utilizing Supervised Learning in Learning Analytics

Case Study 4

Author

Dr. Mighty Itauma Itauma

Business Scenario: Predicting Student Performance

In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.

Objective:

Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.3     ✔ tibble       3.2.1
✔ ggplot2      3.4.3     ✔ tidyr        1.3.0
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ lubridate 1.9.2     ✔ stringr   1.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(ggplot2)

Data Generation:

# Set a fixed random seed for reproducibility
set.seed(10923)

# Number of students
#TODO: set num_students to 500
# Enter code below:
num_students <- 500

# Simulate study hours (ranging from 1 to 20 hours)
study_hours <- sample(1:20, num_students, replace = TRUE)

# Simulate quiz scores (ranging from 0 to 100)
quiz_scores <- sample(0:100, num_students, replace = TRUE)

# Simulate forum participation (ranging from 0 to 50 posts)
forum_posts <- sample(0:50, num_students, replace = TRUE)

# Simulate previous grades (ranging from 0 to 100)
previous_grades <- sample(0:100, num_students, replace = TRUE)

# Simulate final grades (ranging from 0 to 100)
final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25

# Create a data frame
student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)

# View the first few rows of the generated data
head(student_data)

  StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
1         20         91         22             78    80.80895
2         12         26         27              1    46.45853
3         13          5          8             60    40.22946
4          4         96         13             78    70.64216
5          5         74         45             31    62.35254
6         18          1         47             50    48.42835

Explore the data

# Todo:
# Summary statistics for each variable
summary(student_data)

   StudyHours      QuizScores       ForumPosts    PreviousGrades  
 Min.   : 1.00   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
 1st Qu.: 6.00   1st Qu.: 24.00   1st Qu.:12.00   1st Qu.: 23.00  
 Median :11.00   Median : 48.00   Median :24.00   Median : 51.00  
 Mean   :10.67   Mean   : 48.54   Mean   :24.26   Mean   : 50.05  
 3rd Qu.:16.00   3rd Qu.: 73.00   3rd Qu.:37.00   3rd Qu.: 75.00  
 Max.   :20.00   Max.   :100.00   Max.   :50.00   Max.   :100.00  
  FinalGrades   
 Min.   :24.19  
 1st Qu.:47.15  
 Median :57.18  
 Mean   :57.35  
 3rd Qu.:67.01  
 Max.   :95.36

# Mean and standard deviation of FinalGrades
mean(student_data$FinalGrades)

[1] 57.34912

sd(student_data$FinalGrades)

[1] 13.59735

# Median and quantiles for StudyHours
median(student_data$StudyHours)

[1] 11

quantile(student_data$StudyHours)

  0%  25%  50%  75% 100% 
   1    6   11   16   20

# Calculate the correlation matrix
cor_matrix <- cor(student_data[, c("StudyHours", "QuizScores", "ForumPosts", "PreviousGrades", "FinalGrades")])
cor_matrix

                StudyHours   QuizScores   ForumPosts PreviousGrades FinalGrades
StudyHours      1.00000000  0.025275340 -0.018693242     0.01980435   0.1521135
QuizScores      0.02527534  1.000000000 -0.004709808     0.07537841   0.8732645
ForumPosts     -0.01869324 -0.004709808  1.000000000     0.04874554   0.2194791
PreviousGrades  0.01980435  0.075378406  0.048745536     1.00000000   0.2759020
FinalGrades     0.15211349  0.873264537  0.219479119     0.27590199   1.0000000

library(ggplot2)

ggplot(student_data, aes(x = StudyHours)) +
  geom_histogram() +
  labs(title = "Distribution of Study Hours", x = "Study Hours")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(student_data, aes(x = QuizScores)) +
  geom_histogram() +
  labs(title = "Distribution of Quiz Scores", x = "Quiz Scores")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Repeat the above for ForumPosts, PreviousGrades, and FinalGrades

ggplot(student_data, aes(x = StudyHours, y = FinalGrades)) +
  geom_point() +
  labs(title = "Study Hours vs. Final Grades", x = "Study Hours", y = "Final Grades")

ggplot(student_data, aes(x = as.factor(StudyHours), y = FinalGrades)) +
  geom_boxplot() +
  labs(title = "Box Plot of Final Grades by Study Hours", x = "Study Hours", y = "Final Grades")

Modeling

Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.

# Todo:
# Splitting the data into training and testing sets (80% training, 20% testing)
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)


# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
predictions <- predict(model, newdata = test_data)
# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mse <- mean((predictions - test_data$FinalGrades)^2)

# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mse)

# Print evaluation metrics
#Enter code below
cat("MSE:", mse, "\n")

MSE: 22.34656

cat("RMSE:", rmse, "\n")

RMSE: 4.727215

summary(model)


Call:
lm(formula = FinalGrades ~ StudyHours + QuizScores + ForumPosts + 
    PreviousGrades, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5265  -3.4421   0.3997   3.1947  15.6419 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24.953643   0.863889  28.885  < 2e-16 ***
StudyHours      0.331338   0.041453   7.993 1.46e-14 ***
QuizScores      0.402828   0.008646  46.593  < 2e-16 ***
ForumPosts      0.194558   0.017110  11.371  < 2e-16 ***
PreviousGrades  0.090502   0.008312  10.888  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.988 on 395 degrees of freedom
Multiple R-squared:  0.8648,    Adjusted R-squared:  0.8635 
F-statistic: 631.8 on 4 and 395 DF,  p-value: < 2.2e-16

Model Accuracy based on Prediction Interval

# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.96

The accuracy is calculated as the proportion of correct predictions.

Have fun!

#Model 2
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ StudyHours + QuizScores + PreviousGrades, data = train_data)


# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
predictions <- predict(model, newdata = test_data)
# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mse <- mean((predictions - test_data$FinalGrades)^2)

# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mse)

# Print evaluation metrics
#Enter code below
cat("MSE:", mse, "\n")

MSE: 33.1469

cat("RMSE:", rmse, "\n")

RMSE: 5.757334

# Prediction 2
# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.96

#Model 3
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ QuizScores + PreviousGrades, data = train_data)


# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
predictions <- predict(model, newdata = test_data)
# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mse <- mean((predictions - test_data$FinalGrades)^2)

# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mse)

# Print evaluation metrics
#Enter code below
cat("MSE:", mse, "\n")

MSE: 32.89714

cat("RMSE:", rmse, "\n")

RMSE: 5.735603

# Prediction 3
# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.96

#Model 4
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ QuizScores, data = train_data)


# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
predictions <- predict(model, newdata = test_data)
# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mse <- mean((predictions - test_data$FinalGrades)^2)

# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mse)

# Print evaluation metrics
#Enter code below
cat("MSE:", mse, "\n")

MSE: 39.08271

cat("RMSE:", rmse, "\n")

RMSE: 6.251616

# Prediction 4
# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.95

#Model 5
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ StudyHours + QuizScores, data = train_data)


# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
predictions <- predict(model, newdata = test_data)
# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mse <- mean((predictions - test_data$FinalGrades)^2)

# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mse)

# Print evaluation metrics
#Enter code below
cat("MSE:", mse, "\n")

MSE: 40.43793

cat("RMSE:", rmse, "\n")

RMSE: 6.359083

# Prediction 5
# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.96

MSE Accuracy

Model 1

22.34656 
0.96

Model 2

MSE: 33.1469 
0.96

Model 3

MSE: 32.89714 
0.96

Model 4

MSE: 39.08271 
0.95

Model 5

MSE: 40.43793 
0.96

Model 1 has best fit as it has low MSE value

Summary:

The model will have 0.96 accuracy without forum posts and previous grades features which means they are not affected the model’s performance accuracy in predicting the final grades. But model 1 has low MSE which means it best fit the data which can be used as a final model.