Utilizing Supervised Learning in Learning Analytics

Case Study 4

Author

Tom Collins

Published

May 6, 2024

Business Scenario: Predicting Student Performance

##In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.

Objective:

Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.

Data Generation:

Set a fixed random seed for reproducibility

set.seed(10923)

Number of students

#TODO: set num_students to 500 # Enter code below: num_students <- 500

Simulate study hours (ranging from 1 to 20 hours)

study_hours <- sample(1:20, num_students, replace = TRUE)

Simulate quiz scores (ranging from 0 to 100)

quiz_scores <- sample(0:100, num_students, replace = TRUE)

Simulate forum participation (ranging from 0 to 50 posts)

forum_posts <- sample(0:50, num_students, replace = TRUE)

Simulate previous grades (ranging from 0 to 100)

previous_grades <- sample(0:100, num_students, replace = TRUE)

Simulate final grades (ranging from 0 to 100)

final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25

Create a data frame

student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)

View the first few rows of the generated data

head(student_data)

Explore the data

max(student_data$FinalGrades)

Todo:

view the structure of the data frame

str(student_data) summary(student_data) ## Modeling

Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.

Todo:

Splitting the data into training and testing sets (80% training, 20% testing)

set.seed(10923) # Set seed for reproducibility sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data)) train_data <- student_data[sample_index, ] test_data <- student_data[-sample_index, ]

Building a Linear Regression model using the train data and assign it to an object # called model.

Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades

Enter code below:

model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)

Making predictions on the test set. use the model object to make prediction.

Enter code below:

predictions <- predict(model, newdata = test_data) # Evaluation metrics # Compute the mean squared error and R-squared # Enter code below mean_squared_error <- mean((test_data\(FinalGrades - predictions)^2) rsquared <- summary(model)\)r.squared # Print evaluation metrics #Enter code below print(paste(“Mean Squared Error:”, mean_squared_error)) print(paste(“R-squared:”, rsquared))

Model Accuracy based on Prediction Interval

Get the predictions and prediction intervals

pred_int <- predict(model, newdata = test_data, interval = “prediction”)

Extract lower and upper bounds of the prediction interval

lower_bound <- pred_int[, “lwr”] upper_bound <- pred_int[, “upr”]

Actual values from the test data

actual_values <- test_data$FinalGrades

Check if the actual values fall within the prediction interval

correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

Compute accuracy

accuracy <- sum(correct_predictions) / length(correct_predictions)