Utilizing Supervised Learning in Learning Analytics

Case Study 4

Author

Amandeep Singh

Business Scenario: Predicting Student Performance

In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.

Objective:

Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.

Data Generation:

# Set a fixed random seed for reproducibility
set.seed(10923)

# Number of students
#TODO: set num_students to 500
# Enter code below:
num_students <- 500

# Simulate study hours (ranging from 1 to 20 hours)
study_hours <- sample(1:20, num_students, replace = TRUE)

# Simulate quiz scores (ranging from 0 to 100)
quiz_scores <- sample(0:100, num_students, replace = TRUE)

# Simulate forum participation (ranging from 0 to 50 posts)
forum_posts <- sample(0:50, num_students, replace = TRUE)

# Simulate previous grades (ranging from 0 to 100)
previous_grades <- sample(0:100, num_students, replace = TRUE)

# Simulate final grades (ranging from 0 to 100)
final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25

# Create a data frame
student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)

# View the first few rows of the generated data
head(student_data)

  StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
1         20         91         22             78    80.80895
2         12         26         27              1    46.45853
3         13          5          8             60    40.22946
4          4         96         13             78    70.64216
5          5         74         45             31    62.35254
6         18          1         47             50    48.42835

Explore the data

# Todo:
# View the structure of the data frame
str(student_data)

'data.frame':   500 obs. of  5 variables:
 $ StudyHours    : int  20 12 13 4 5 18 17 16 3 14 ...
 $ QuizScores    : int  91 26 5 96 74 1 48 91 28 4 ...
 $ ForumPosts    : int  22 27 8 13 45 47 6 46 14 5 ...
 $ PreviousGrades: int  78 1 60 78 31 50 92 39 75 33 ...
 $ FinalGrades   : num  80.8 46.5 40.2 70.6 62.4 ...

# Summary statistics of the data
summary(student_data)

   StudyHours      QuizScores       ForumPosts    PreviousGrades  
 Min.   : 1.00   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
 1st Qu.: 6.00   1st Qu.: 24.00   1st Qu.:12.00   1st Qu.: 23.00  
 Median :11.00   Median : 48.00   Median :24.00   Median : 51.00  
 Mean   :10.67   Mean   : 48.54   Mean   :24.26   Mean   : 50.05  
 3rd Qu.:16.00   3rd Qu.: 73.00   3rd Qu.:37.00   3rd Qu.: 75.00  
 Max.   :20.00   Max.   :100.00   Max.   :50.00   Max.   :100.00  
  FinalGrades   
 Min.   :24.19  
 1st Qu.:47.15  
 Median :57.18  
 Mean   :57.35  
 3rd Qu.:67.01  
 Max.   :95.36

# Load required libraries
library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.3.1

library(corrplot)

Warning: package 'corrplot' was built under R version 4.3.1

corrplot 0.92 loaded

# Visualize the distribution of each variable
# Histogram for Study Hours
ggplot(student_data, aes(x = StudyHours)) + 
  geom_histogram(fill = "lightblue", color = "black") + 
  labs(title = "Distribution of Study Hours")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram for Quiz Scores
ggplot(student_data, aes(x = QuizScores)) + 
  geom_histogram(fill = "lightgreen", color = "black") + 
  labs(title = "Distribution of Quiz Scores")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram for Forum Posts
ggplot(student_data, aes(x = ForumPosts)) + 
  geom_histogram(fill = "lightyellow", color = "black") + 
  labs(title = "Distribution of Forum Posts")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram for Previous Grades
ggplot(student_data, aes(x = PreviousGrades)) + 
  geom_histogram(fill = "pink", color = "black") + 
  labs(title = "Distribution of Previous Grades")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Scatterplot matrix to visualize the relationships between variables
scatterplotMatrix <- ggplot(student_data, aes(x = StudyHours, y = QuizScores)) +
  geom_point(aes(color = FinalGrades)) +
  labs(title = "Scatterplot Matrix")

# Correlation matrix to check the correlation between variables
correlation_matrix <- cor(student_data[, -5])
corrplot(correlation_matrix, method = "color")

Modeling

Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.

# Todo:
# Splitting the data into training and testing sets (80% training, 20% testing)
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)


# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
predictions <- predict(model, newdata = test_data)

# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mean_squared_error <- mean((test_data$FinalGrades - predictions)^2)
rsquared <- summary(model)$r.squared

# Print evaluation metrics
#Enter code below
print(paste("Mean Squared Error:", mean_squared_error))

[1] "Mean Squared Error: 22.3465631070855"

print(paste("R-squared:", rsquared))

[1] "R-squared: 0.864833778721095"

Interpretation

Mean Squared Error (MSE) indicates the average of the squares of the differences between the actual and the predicted value of the “FinalGrades”. MSE of 22.35 shows that the model has smaller prediction errors on average, which means it is a reasonably accurate predictive performance.

R-Squared is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Here “FinalGrades” is the dependent variable and “StudyHours”, “QuizScores”, “ForumPosts” and “PreviousGrades” are the independent variables. R-squared value of 0.8648 indicates that approximately86.48% of the variance in the “FinalGrades” can be explained by the independent variables included in the model. It also indicates that model is a good fit.

Model Accuracy based on Prediction Interval

# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.96

The accuracy is calculated as the proportion of correct predictions.

Have fun!