Utilizing Supervised Learning in Learning Analytics
Case Study 4
Author
Banoth Maheshwari
Business Scenario: Predicting Student Performance
In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.
Objective:
Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.
Data Generation:
# Set a fixed random seed for reproducibilityset.seed(10923)# Number of students#TODO: set num_students to 500# Enter code below:num_students <-500# Simulate study hours (ranging from 1 to 20 hours)study_hours <-sample(1:20, num_students, replace =TRUE)# Simulate quiz scores (ranging from 0 to 100)quiz_scores <-sample(0:100, num_students, replace =TRUE)# Simulate forum participation (ranging from 0 to 50 posts)forum_posts <-sample(0:50, num_students, replace =TRUE)# Simulate previous grades (ranging from 0 to 100)previous_grades <-sample(0:100, num_students, replace =TRUE)# Simulate final grades (ranging from 0 to 100)final_grades <-0.3* study_hours +0.4* quiz_scores +0.2* forum_posts +0.1* previous_grades +rnorm(num_students, mean =0, sd =5) +25# Create a data framestudent_data <-data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)# View the first few rows of the generated datahead(student_data)
StudyHours QuizScores ForumPosts PreviousGrades
Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 6.00 1st Qu.: 24.00 1st Qu.:12.00 1st Qu.: 23.00
Median :11.00 Median : 48.00 Median :24.00 Median : 51.00
Mean :10.67 Mean : 48.54 Mean :24.26 Mean : 50.05
3rd Qu.:16.00 3rd Qu.: 73.00 3rd Qu.:37.00 3rd Qu.: 75.00
Max. :20.00 Max. :100.00 Max. :50.00 Max. :100.00
FinalGrades
Min. :24.19
1st Qu.:47.15
Median :57.18
Mean :57.35
3rd Qu.:67.01
Max. :95.36
Modeling
Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.
# Todo:# Splitting the data into training and testing sets (80% training, 20% testing)set.seed(10923) # Set seed for reproducibilitysample_index <-sample(1:nrow(student_data), 0.8*nrow(student_data))train_data <- student_data[sample_index, ]test_data <- student_data[-sample_index, ]# Building a Linear Regression model using the train data and assign it to an object # called model.# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades# Enter code below:model <-lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)test_predictions <-predict(model, newdata = test_data)mse <-mean((test_data$FinalGrades - test_predictions)^2)# Making predictions on the test set. use the model object to make prediction.# Enter code below:ss_total <-sum((test_data$FinalGrades -mean(test_data$FinalGrades))^2)ss_residual <-sum((test_data$FinalGrades - test_predictions)^2)r_squared <-1- (ss_residual / ss_total)# Evaluation metrics# Compute the mean squared error and R-squared# Enter code below# Print evaluation metrics#Enter code belowcat("Mean Squared Error:", mse, "\n")
Mean Squared Error: 22.34656
cat("R-squared:", r_squared, "\n")
R-squared: 0.8853893
Model Accuracy based on Prediction Interval
# Get the predictions and prediction intervalspred_int <-predict(model, newdata = test_data, interval ="prediction")# Extract lower and upper bounds of the prediction intervallower_bound <- pred_int[, "lwr"]upper_bound <- pred_int[, "upr"]# Actual values from the test dataactual_values <- test_data$FinalGrades# Check if the actual values fall within the prediction intervalcorrect_predictions <- actual_values >= lower_bound & actual_values <= upper_bound# Compute accuracyaccuracy <-sum(correct_predictions) /length(correct_predictions)# Print accuracycat("Model Accuracy using Prediction Interval:", accuracy, "\n")
Model Accuracy using Prediction Interval: 0.96
The accuracy is calculated as the proportion of correct predictions.
Have fun!
# Calculate the correlation matrixcorrelation_matrix <-cor(student_data)# Print the correlation matrixprint(correlation_matrix)
# You can also visualize the correlation matrix using a heatmaplibrary("corrplot")
corrplot 0.92 loaded
corrplot(correlation_matrix, method ="color")
# Create synthetic dataset.seed(123) # For reproducibilityx <-1:100y <-2* x +rnorm(100, mean =0, sd =10)# Create a data frame from the variablesdata <-data.frame(x, y)# Perform linear regressionmodel <-lm(y ~ x, data = data)# Print the summary of the regression modelsummary(model)
Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-24.5356 -5.5236 -0.3462 6.4850 20.9487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.36404 1.84287 -0.198 0.844
x 2.02511 0.03168 63.920 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.145 on 98 degrees of freedom
Multiple R-squared: 0.9766, Adjusted R-squared: 0.9763
F-statistic: 4086 on 1 and 98 DF, p-value: < 2.2e-16
# Load necessary librarieslibrary(caret)
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.3.1
Loading required package: lattice
Warning: package 'lattice' was built under R version 4.3.1
library(randomForest)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':
margin
# Load the Iris datasetdata(iris)# Split the data into training and testing setsset.seed(123)inTrain <-createDataPartition(iris$Species, p =0.7, list =FALSE)training_data <- iris[inTrain, ]testing_data <- iris[-inTrain, ]# Train a Random Forest classifiermodel <-train(Species ~ ., data = training_data, method ="rf")# Make predictions on the testing datapredictions <-predict(model, newdata = testing_data)# Evaluate the modelconfusion_matrix <-confusionMatrix(predictions, testing_data$Species)print(confusion_matrix)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 15 0 0
versicolor 0 14 2
virginica 0 1 13
Overall Statistics
Accuracy : 0.9333
95% CI : (0.8173, 0.986)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9333 0.8667
Specificity 1.0000 0.9333 0.9667
Pos Pred Value 1.0000 0.8750 0.9286
Neg Pred Value 1.0000 0.9655 0.9355
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3111 0.2889
Detection Prevalence 0.3333 0.3556 0.3111
Balanced Accuracy 1.0000 0.9333 0.9167
# Generate synthetic data for clusteringset.seed(123)data <-data.frame(x =rnorm(100, mean =0, sd =1),y =rnorm(100, mean =0, sd =1))# Perform k-means clusteringk <-3# Number of clusterskmeans_result <-kmeans(data, centers = k)# Print the cluster assignmentscluster_assignments <- kmeans_result$clusterprint(cluster_assignments)
# Plot the clustered datalibrary(ggplot2)data$Cluster <-as.factor(cluster_assignments)ggplot(data, aes(x, y, color = Cluster)) +geom_point() +labs(title ="K-Means Clustering")
# Build a Random Forest Regression model using the train datalibrary(randomForest)model_rf <-randomForest(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)# Make predictions on the test settest_predictions_rf <-predict(model_rf, newdata = test_data)# Calculate Mean Squared Errormse_rf <-mean((test_data$FinalGrades - test_predictions_rf)^2)# Calculate R-squaredss_residual_rf <-sum((test_data$FinalGrades - test_predictions_rf)^2)r_squared_rf <-1- (ss_residual_rf / ss_total)# Print evaluation metrics for Random Forest modelcat("Random Forest Regression - Mean Squared Error:", mse_rf, "\n")
Random Forest Regression - Mean Squared Error: 34.69119
max_final_grade <-max(student_data$FinalGrades)max_final_grade <-round(max_final_grade, 2) # Round to two decimal places
max_final_grade <-max(student_data$FinalGrades)max_final_grade <-round(max_final_grade, 2) # Round to two decimal placesmax_final_grade
[1] 95.36
# Calculate the correlation between StudyHours and FinalGradescorrelation <-cor(student_data$StudyHours, student_data$FinalGrades)# Print the correlation coefficientcat("Correlation between StudyHours and FinalGrades:", correlation, "\n")
Correlation between StudyHours and FinalGrades: 0.1521135
# Fit a linear regression modelmodel <-lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = student_data)# Display the model summarysummary(model)
Call:
lm(formula = FinalGrades ~ StudyHours + QuizScores + ForumPosts +
PreviousGrades, data = student_data)
Residuals:
Min 1Q Median 3Q Max
-13.3924 -3.4734 0.3027 3.0976 16.7901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.076304 0.762633 32.88 < 2e-16 ***
StudyHours 0.298397 0.037113 8.04 6.66e-15 ***
QuizScores 0.404363 0.007692 52.57 < 2e-16 ***
ForumPosts 0.202482 0.015217 13.31 < 2e-16 ***
PreviousGrades 0.090967 0.007480 12.16 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.93 on 495 degrees of freedom
Multiple R-squared: 0.8696, Adjusted R-squared: 0.8685
F-statistic: 825.1 on 4 and 495 DF, p-value: < 2.2e-16
# Assuming your data frame is named student_datanum_observations <-nrow(student_data)# Print the number of observationscat("Number of observations in student_data:", num_observations, "\n")