Utilizing Supervised Learning in Learning Analytics

Case Study 4

Author

Shahin

Business Scenario: Predicting Student Performance

In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.

Objective:

Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.

Data Generation:

# Set a fixed random seed for reproducibility
set.seed(10923)

# Number of students
#TODO: set num_students to 500
# Enter code below:
num_students=500

# Simulate study hours (ranging from 1 to 20 hours)
study_hours <- sample(1:20, num_students, replace = TRUE)

# Simulate quiz scores (ranging from 0 to 100)
quiz_scores <- sample(0:100, num_students, replace = TRUE)

# Simulate forum participation (ranging from 0 to 50 posts)
forum_posts <- sample(0:50, num_students, replace = TRUE)

# Simulate previous grades (ranging from 0 to 100)
previous_grades <- sample(0:100, num_students, replace = TRUE)

# Simulate final grades (ranging from 0 to 100)
final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25

# Create a data frame
student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)

# View the first few rows of the generated data
head(student_data)
  StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
1         20         91         22             78    80.80895
2         12         26         27              1    46.45853
3         13          5          8             60    40.22946
4          4         96         13             78    70.64216
5          5         74         45             31    62.35254
6         18          1         47             50    48.42835

Explore the data

# Todo:
summary(student_data)
   StudyHours      QuizScores       ForumPosts    PreviousGrades  
 Min.   : 1.00   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
 1st Qu.: 6.00   1st Qu.: 24.00   1st Qu.:12.00   1st Qu.: 23.00  
 Median :11.00   Median : 48.00   Median :24.00   Median : 51.00  
 Mean   :10.67   Mean   : 48.54   Mean   :24.26   Mean   : 50.05  
 3rd Qu.:16.00   3rd Qu.: 73.00   3rd Qu.:37.00   3rd Qu.: 75.00  
 Max.   :20.00   Max.   :100.00   Max.   :50.00   Max.   :100.00  
  FinalGrades   
 Min.   :24.19  
 1st Qu.:47.15  
 Median :57.18  
 Mean   :57.35  
 3rd Qu.:67.01  
 Max.   :95.36  
# finding the correlation between StudyHours and FinalGrades
correlation <- cor(student_data$StudyHours, student_data$FinalGrades)

# Print the correlation coefficient
cat("Correlation between StudyHours and FinalGrades: ", correlation, "\n")
Correlation between StudyHours and FinalGrades:  0.1521135 
# Fit a linear regression model to predict FinalGrades using all variables
model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = student_data)

# Get the summary of the model
summary(model)

Call:
lm(formula = FinalGrades ~ StudyHours + QuizScores + ForumPosts + 
    PreviousGrades, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.3924  -3.4734   0.3027   3.0976  16.7901 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    25.076304   0.762633   32.88  < 2e-16 ***
StudyHours      0.298397   0.037113    8.04 6.66e-15 ***
QuizScores      0.404363   0.007692   52.57  < 2e-16 ***
ForumPosts      0.202482   0.015217   13.31  < 2e-16 ***
PreviousGrades  0.090967   0.007480   12.16  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.93 on 495 degrees of freedom
Multiple R-squared:  0.8696,    Adjusted R-squared:  0.8685 
F-statistic: 825.1 on 4 and 495 DF,  p-value: < 2.2e-16

Modeling

Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.

# Todo:

set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

test_data
    StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
12          12         89         38             83    74.26491
24          17         23         18             24    47.57965
29           2         92         32             77    81.94059
32          11         91          8             55    73.62627
37           2         78          5             69    60.88582
46           8         74         28             88    69.23533
54           2         11         29             33    45.38068
69           4         13         17             47    43.40824
71          14         75         33             22    61.75795
73           6         55         26             27    63.53495
78          18         40         24             82    60.68483
101          1         53          4             96    62.13523
102          8         43         35             63    57.34099
108         18         19         27              8    41.35951
114         18         52         50              6    56.19355
123         11         83         48             74    73.60207
131          3         69         27             41    67.09389
136          3          7         24             43    40.00713
141         12         31         26             61    50.69454
144         12          1         13             37    32.53758
147         15         16         21             68    46.69701
150         16         82         40             92    81.47498
160         12         40         25             55    55.22237
163          4         86         42             64    76.89570
172         20         24         45              1    44.99715
173         18         44         20             74    61.62909
183         17         62          4             44    61.40353
194          1         44         16              9    52.39921
201          9         55         29            100    60.82866
203         12         52         47             71    68.17834
208         15         19         14             17    43.30041
211         10          6          9             36    31.65385
212          6         22         47             75    52.38183
220          1         76         13             86    69.62288
221          4         58         24             57    55.22403
224          7          4         14             65    35.13503
228          9         72         50             21    83.28237
231         14          8         24             89    37.57380
232         15         50         40             60    61.77141
233         19         55         34             23    60.48159
235         19         67         10             37    62.89812
247          7         65         23             74    61.16096
254          9         34         19             97    54.01643
259         17         60          0             68    61.93399
264         17         96         29             70    86.36556
267         10         85          3             85    63.77017
272         11         62         39             47    60.81430
276         13         86         40             56    78.95117
284          8         60         23             14    64.93520
287          5         68         15             20    51.68832
289         10         32         15             56    52.89392
295         19         72         36             54    77.59619
296         17         73          8             56    60.93701
297          5         12         50             74    65.06646
298          8         17         41             24    50.04512
303         19         40         40             56    59.38023
309          2         49         21             83    61.69173
316          1         24         13             26    41.68404
317         10         18         18              4    37.87113
320          1         15         37             74    42.61517
324         13          0          6             14    31.52134
325          1         69         14             32    61.66530
327          7         24          4             44    35.89460
329         18         58         12              6    59.33133
331         18         36         14             57    50.02511
335         10         85         16             60    73.19256
337         14         66         44             52    82.27735
343          9         19         10             22    40.62374
347         14         85         44              8    83.69206
353          8         77         21             69    72.94197
360          8         55         10             79    61.14687
369         16         52          2             46    52.58248
376         20         98         43             92    82.62389
379         16         81          8             71    72.84332
381          7         61         41             90    63.61583
391          1          2         46             27    39.78401
397          9         41         14             10    40.03665
399         12         97         28             99    78.37851
401         19          7         10             43    39.19989
406         17         56         19             75    62.13746
411          8         48         13             37    52.12544
415          5         99         19             24    71.80581
420         13          6         20             38    42.05383
425          1         17         31              1    32.29709
426         11         75         49             67    76.12438
432         11          8         40             78    47.77518
435          8         27         34             49    50.53583
438         12         23          8             82    50.36122
440          4         53          4             25    51.82210
451         12         14         48             23    39.64732
459         15         35          5             45    51.87861
467          6         30         31             78    56.31453
471          8          6         46             10    38.96916
474         15         17         21             46    46.11949
480         13         19         29             78    43.78552
488         15          9         24              1    37.05143
492         18         63          4             14    53.53368
493         19         20          9             41    45.60791
497         10         26         32             43    52.33675
500         16         26         43              1    41.92070
# Building a Linear Regression model using the train data and assign it to an object # called model.

model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)

# Making predictions on the test set
test_predictions <- predict(model, newdata = test_data)

# Compute mean squared error
mse <- mean((test_data$FinalGrades - test_predictions)^2)

# Compute R-squared
ssr <- sum((test_predictions - mean(test_data$FinalGrades))^2)
sst <- sum((test_data$FinalGrades - mean(test_data$FinalGrades))^2)
r_squared <- ssr / sst

# Print evaluation metrics
cat("Mean Squared Error (MSE):", mse, "\n")
Mean Squared Error (MSE): 22.34656 
cat("R-squared (R^2):", r_squared, "\n")
R-squared (R^2): 0.8598451 

Model Accuracy based on Prediction Interval

# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")
Model Accuracy using Prediction Interval: 0.96 

The accuracy is calculated as the proportion of correct predictions.

Have fun!