Utilizing Supervised Learning in Learning Analytics
Case Study 4
Author
Dominic Valdiserri
Business Scenario: Predicting Student Performance
In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.
Objective:
Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.
Data Generation:
# Set a fixed random seed for reproducibilityset.seed(10923)# Number of students#TODO: set num_students to 500# Enter code below:num_students <-500# Simulate study hours (ranging from 1 to 20 hours)study_hours <-sample(1:20, num_students, replace =TRUE)# Simulate quiz scores (ranging from 0 to 100)quiz_scores <-sample(0:100, num_students, replace =TRUE)# Simulate forum participation (ranging from 0 to 50 posts)forum_posts <-sample(0:50, num_students, replace =TRUE)# Simulate previous grades (ranging from 0 to 100)previous_grades <-sample(0:100, num_students, replace =TRUE)# Simulate final grades (ranging from 0 to 100)final_grades <-0.3* study_hours +0.4* quiz_scores +0.2* forum_posts +0.1* previous_grades +rnorm(num_students, mean =0, sd =5) +25# Create a data framestudent_data <-data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)# View the first few rows of the generated datahead(student_data)
# Viewing summary statistics of datasummary(student_data)
StudyHours QuizScores ForumPosts PreviousGrades
Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 6.00 1st Qu.: 24.00 1st Qu.:12.00 1st Qu.: 23.00
Median :11.00 Median : 48.00 Median :24.00 Median : 51.00
Mean :10.67 Mean : 48.54 Mean :24.26 Mean : 50.05
3rd Qu.:16.00 3rd Qu.: 73.00 3rd Qu.:37.00 3rd Qu.: 75.00
Max. :20.00 Max. :100.00 Max. :50.00 Max. :100.00
FinalGrades
Min. :24.19
1st Qu.:47.15
Median :57.18
Mean :57.35
3rd Qu.:67.01
Max. :95.36
# Plotting the denisty of the StudyHourslibrary(ggplot2)ggplot(student_data, aes(x = StudyHours)) +geom_density(color ="red", fill =alpha("red", 0.3)) +theme_minimal() +ggtitle("Distribution of StudyHours") +ylab("Density")
# Plotting the denisty of the QuizScoreslibrary(ggplot2)ggplot(student_data, aes(x = QuizScores)) +geom_density(color ="blue", fill =alpha("blue", 0.3)) +theme_minimal() +ggtitle("Distribution of QuizScores") +ylab("Density")
# Plotting the denisty of the ForumPostslibrary(ggplot2)ggplot(student_data, aes(x = ForumPosts)) +geom_density(color ="green", fill =alpha("green", 0.3)) +theme_minimal() +ggtitle("Distribution of ForumPosts") +ylab("Density")
# Plotting the denisty of the PreviousGradeslibrary(ggplot2)ggplot(student_data, aes(x = PreviousGrades)) +geom_density(color ="purple", fill =alpha("purple", 0.3)) +theme_minimal() +ggtitle("Distribution of PreviousGrades") +ylab("Density")
# Plotting the denisty of the FinalGradeslibrary(ggplot2)ggplot(student_data, aes(x = FinalGrades)) +geom_density(color ="orange", fill =alpha("orange", 0.3)) +theme_minimal() +ggtitle("Distribution of FinalGrades") +ylab("Density")
library(ggplot2)ggplot(student_data) +aes(x = StudyHours, y = FinalGrades) +geom_point(color ="pink") +theme_minimal()
Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.
# Todo:# Splitting the data into training and testing sets (80% training, 20% testing)set.seed(10923) # Set seed for reproducibilitysample_index <-sample(1:nrow(student_data), 0.8*nrow(student_data))train_data <- student_data[sample_index, ]test_data <- student_data[-sample_index, ]# Building a Linear Regression model using the train data and assign it to an object # called model.# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades# Enter code below:model <-lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)# Making predictions on the test set. use the model object to make prediction.# Enter code below:predictions <-predict(model, newdata = test_data)# Evaluation metrics# Compute the mean squared error and R-squared# Enter code belowMSE <-mean((test_data$FinalGrades - predictions)^2)R_squared <-summary(model)$r.squared# Print evaluation metrics#Enter code belowprint (MSE)
[1] 22.34656
round(MSE, digits =2)
[1] 22.35
print(R_squared)
[1] 0.8648338
round(R_squared, digits =4)
[1] 0.8648
summary(model)
Call:
lm(formula = FinalGrades ~ StudyHours + QuizScores + ForumPosts +
PreviousGrades, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-13.5265 -3.4421 0.3997 3.1947 15.6419
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.953643 0.863889 28.885 < 2e-16 ***
StudyHours 0.331338 0.041453 7.993 1.46e-14 ***
QuizScores 0.402828 0.008646 46.593 < 2e-16 ***
ForumPosts 0.194558 0.017110 11.371 < 2e-16 ***
PreviousGrades 0.090502 0.008312 10.888 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.988 on 395 degrees of freedom
Multiple R-squared: 0.8648, Adjusted R-squared: 0.8635
F-statistic: 631.8 on 4 and 395 DF, p-value: < 2.2e-16
Model Accuracy based on Prediction Interval
# Get the predictions and prediction intervalspred_int <-predict(model, newdata = test_data, interval ="prediction")# Extract lower and upper bounds of the prediction intervallower_bound <- pred_int[, "lwr"]upper_bound <- pred_int[, "upr"]# Actual values from the test dataactual_values <- test_data$FinalGrades# Check if the actual values fall within the prediction intervalcorrect_predictions <- actual_values >= lower_bound & actual_values <= upper_bound# Compute accuracyaccuracy <-sum(correct_predictions) /length(correct_predictions)# Print accuracycat("Model Accuracy using Prediction Interval:", accuracy, "\n")
Model Accuracy using Prediction Interval: 0.96
The accuracy is calculated as the proportion of correct predictions.
Summary:
The Mean Sqaured Error (MSE) of the model is 22.35. Considering that the Final Grades have a range of 0 to 100, the MSE is somewhat low. The model did a decent job of predicting the the final grades of the students.
The R-sqaured value is 0.8648. Therefore, about 86.48% of the variance in FinalGrades can be explained by the model. Overall, the model is a strong fit for the data.