In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.
Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.
# Set a fixed random seed for reproducibility
set.seed(10923)
# Number of students
num_students <- 500
# Simulate study hours (ranging from 1 to 20 hours)
study_hours <- sample(1:20, num_students, replace = TRUE)
# Simulate quiz scores (ranging from 0 to 100)
quiz_scores <- sample(0:100, num_students, replace = TRUE)
# Simulate forum participation (ranging from 0 to 50 posts)
forum_posts <- sample(0:50, num_students, replace = TRUE)
# Simulate previous grades (ranging from 0 to 100)
previous_grades <- sample(0:100, num_students, replace = TRUE)
# Simulate final grades (ranging from 0 to 100)
final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25
# Create a data frame
student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)
# View the first few rows of the generated data
head(student_data)
## StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
## 1 20 91 22 78 80.80895
## 2 12 26 27 1 46.45853
## 3 13 5 8 60 40.22946
## 4 4 96 13 78 70.64216
## 5 5 74 45 31 62.35254
## 6 18 1 47 50 48.42835
# Calculate and print the maximum final grade
max_final_grade <- max(student_data$FinalGrades)
cat("Maximum Final Grade:", round(max_final_grade, 2), "\n")
## Maximum Final Grade: 95.36
# Calculate the correlation between 'StudyHours' and 'FinalGrades'
correlation_study_hours_final_grades <- cor(student_data$StudyHours, student_data$FinalGrades)
cat("Correlation between StudyHours and FinalGrades:", round(correlation_study_hours_final_grades, 2), "\n")
## Correlation between StudyHours and FinalGrades: 0.15
# Summary of the data
summary(student_data)
## StudyHours QuizScores ForumPosts PreviousGrades
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 24.00 1st Qu.:12.00 1st Qu.: 23.00
## Median :11.00 Median : 48.00 Median :24.00 Median : 51.00
## Mean :10.67 Mean : 48.54 Mean :24.26 Mean : 50.05
## 3rd Qu.:16.00 3rd Qu.: 73.00 3rd Qu.:37.00 3rd Qu.: 75.00
## Max. :20.00 Max. :100.00 Max. :50.00 Max. :100.00
## FinalGrades
## Min. :24.19
## 1st Qu.:47.15
## Median :57.18
## Mean :57.35
## 3rd Qu.:67.01
## Max. :95.36
# Histogram of final grades
hist(student_data$FinalGrades, main = "Distribution of Final Grades", xlab = "Final Grades", col = "blue")
# Scatter plot matrix to visualize relationships
pairs(student_data)
Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.
# Splitting the data into training and testing sets (80% training, 20% testing)
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]
# Calculate the number of observations in the test data
num_test_observations <- nrow(test_data)
cat("Number of observations in the test data:", num_test_observations, "\n")
## Number of observations in the test data: 100
# Building multiple Linear Regression models
models <- list(
model1 = lm(FinalGrades ~ StudyHours, data = train_data),
model2 = lm(FinalGrades ~ QuizScores, data = train_data),
model3 = lm(FinalGrades ~ ForumPosts, data = train_data),
model4 = lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)
)
# Initialize a list to store evaluation metrics
model_metrics <- list()
# Evaluate each model
for (model_name in names(models)) {
model <- models[[model_name]]
predictions <- predict(model, newdata = test_data)
# Compute evaluation metrics
mse <- mean((test_data$FinalGrades - predictions)^2)
r_squared <- summary(model)$r.squared
# Get prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]
# Check if the actual values fall within the prediction interval
correct_predictions <- test_data$FinalGrades >= lower_bound & test_data$FinalGrades <= upper_bound
accuracy <- sum(correct_predictions) / length(correct_predictions)
# Store metrics
model_metrics[[model_name]] <- list(MSE = round(mse, 2), R2 = round(r_squared, 4), Accuracy = round(accuracy, 2))
}
# Print evaluation metrics
for (model_name in names(model_metrics)) {
cat(model_name, "\n")
cat("Mean Squared Error:", model_metrics[[model_name]]$MSE, "\n")
cat("R-squared:", model_metrics[[model_name]]$R2, "\n")
cat("Model Accuracy using Prediction Interval:", model_metrics[[model_name]]$Accuracy, "\n\n")
}
## model1
## Mean Squared Error: 195.56
## R-squared: 0.0289
## Model Accuracy using Prediction Interval: 0.96
##
## model2
## Mean Squared Error: 39.08
## R-squared: 0.7521
## Model Accuracy using Prediction Interval: 0.95
##
## model3
## Mean Squared Error: 182.19
## R-squared: 0.0421
## Model Accuracy using Prediction Interval: 0.98
##
## model4
## Mean Squared Error: 22.35
## R-squared: 0.8648
## Model Accuracy using Prediction Interval: 0.96
Model 4, which includes all four features (StudyHours, QuizScores, ForumPosts, PreviousGrades), performs the best in terms of both Mean Squared Error (22.35) and R-squared (0.8648). This indicates that Model 4 explains the most variance in the final grades and has the lowest error rate.
Model 2 (QuizScores) also performs well with a significantly lower MSE (39.08) and a high R-squared (0.7521), indicating that quiz scores are a strong predictor of final grades.
Model 3 (ForumPosts) has a high prediction interval accuracy (0.98) but a high MSE (182.19) and low R-squared (0.0421), suggesting that forum participation alone is not a strong predictor of final grades.
Model 1 (StudyHours) has the lowest R-squared (0.0289) and the highest MSE (195.56), indicating that study hours alone do not significantly predict final grades.
Model 4, which incorporates all features, is the best model for predicting final grades due to its high explanatory power and low prediction error. Quiz scores are identified as a particularly strong individual predictor of student performance.
# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")
# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]
# Actual values from the test data
actual_values <- test_data$FinalGrades
# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound
# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)
# Print accuracy
cat("Model Accuracy using Prediction Interval:", round(accuracy, 2), "\n")
## Model Accuracy using Prediction Interval: 0.96
The accuracy is calculated as the proportion of correct predictions.
Have fun!