Utilizing Supervised Learning in Learning Analytics

Case Study 4

Author

Banoth Maheshwari

Business Scenario: Predicting Student Performance

In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.

Objective:

Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.

Data Generation:

# Set a fixed random seed for reproducibility
set.seed(10923)

# Number of students
#TODO: set num_students to 500
# Enter code below:
num_students <- 500

# Simulate study hours (ranging from 1 to 20 hours)
study_hours <- sample(1:20, num_students, replace = TRUE)

# Simulate quiz scores (ranging from 0 to 100)
quiz_scores <- sample(0:100, num_students, replace = TRUE)

# Simulate forum participation (ranging from 0 to 50 posts)
forum_posts <- sample(0:50, num_students, replace = TRUE)

# Simulate previous grades (ranging from 0 to 100)
previous_grades <- sample(0:100, num_students, replace = TRUE)

# Simulate final grades (ranging from 0 to 100)
final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25

# Create a data frame
student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)

# View the first few rows of the generated data
head(student_data)

  StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
1         20         91         22             78    80.80895
2         12         26         27              1    46.45853
3         13          5          8             60    40.22946
4          4         96         13             78    70.64216
5          5         74         45             31    62.35254
6         18          1         47             50    48.42835

Explore the data

# Todo:
summary(student_data)

   StudyHours      QuizScores       ForumPosts    PreviousGrades  
 Min.   : 1.00   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
 1st Qu.: 6.00   1st Qu.: 24.00   1st Qu.:12.00   1st Qu.: 23.00  
 Median :11.00   Median : 48.00   Median :24.00   Median : 51.00  
 Mean   :10.67   Mean   : 48.54   Mean   :24.26   Mean   : 50.05  
 3rd Qu.:16.00   3rd Qu.: 73.00   3rd Qu.:37.00   3rd Qu.: 75.00  
 Max.   :20.00   Max.   :100.00   Max.   :50.00   Max.   :100.00  
  FinalGrades   
 Min.   :24.19  
 1st Qu.:47.15  
 Median :57.18  
 Mean   :57.35  
 3rd Qu.:67.01  
 Max.   :95.36

Modeling

Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.

# Todo:
# Splitting the data into training and testing sets (80% training, 20% testing)
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]

# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)
test_predictions <- predict(model, newdata = test_data)
mse <- mean((test_data$FinalGrades - test_predictions)^2)



# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
ss_total <- sum((test_data$FinalGrades - mean(test_data$FinalGrades))^2)
ss_residual <- sum((test_data$FinalGrades - test_predictions)^2)
r_squared <- 1 - (ss_residual / ss_total)

# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below

# Print evaluation metrics
#Enter code below
cat("Mean Squared Error:", mse, "\n")

Mean Squared Error: 22.34656

cat("R-squared:", r_squared, "\n")

R-squared: 0.8853893

Model Accuracy based on Prediction Interval

# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")

# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]

# Actual values from the test data
actual_values <- test_data$FinalGrades

# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound

# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)

# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")

Model Accuracy using Prediction Interval: 0.96

The accuracy is calculated as the proportion of correct predictions.

Have fun!

# Calculate the correlation matrix
correlation_matrix <- cor(student_data)

# Print the correlation matrix
print(correlation_matrix)

                StudyHours   QuizScores   ForumPosts PreviousGrades FinalGrades
StudyHours      1.00000000  0.025275340 -0.018693242     0.01980435   0.1521135
QuizScores      0.02527534  1.000000000 -0.004709808     0.07537841   0.8732645
ForumPosts     -0.01869324 -0.004709808  1.000000000     0.04874554   0.2194791
PreviousGrades  0.01980435  0.075378406  0.048745536     1.00000000   0.2759020
FinalGrades     0.15211349  0.873264537  0.219479119     0.27590199   1.0000000

# You can also visualize the correlation matrix using a heatmap
library("corrplot")

corrplot 0.92 loaded

corrplot(correlation_matrix, method = "color")

# Create synthetic data
set.seed(123)  # For reproducibility
x <- 1:100
y <- 2 * x + rnorm(100, mean = 0, sd = 10)

# Create a data frame from the variables
data <- data.frame(x, y)

# Perform linear regression
model <- lm(y ~ x, data = data)

# Print the summary of the regression model
summary(model)


Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.5356  -5.5236  -0.3462   6.4850  20.9487 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.36404    1.84287  -0.198    0.844    
x            2.02511    0.03168  63.920   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.145 on 98 degrees of freedom
Multiple R-squared:  0.9766,    Adjusted R-squared:  0.9763 
F-statistic:  4086 on 1 and 98 DF,  p-value: < 2.2e-16

# Load necessary libraries
library(caret)

Loading required package: ggplot2

Warning: package 'ggplot2' was built under R version 4.3.1

Loading required package: lattice

Warning: package 'lattice' was built under R version 4.3.1

library(randomForest)

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

# Load the Iris dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
inTrain <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
training_data <- iris[inTrain, ]
testing_data <- iris[-inTrain, ]

# Train a Random Forest classifier
model <- train(Species ~ ., data = training_data, method = "rf")

# Make predictions on the testing data
predictions <- predict(model, newdata = testing_data)

# Evaluate the model
confusion_matrix <- confusionMatrix(predictions, testing_data$Species)
print(confusion_matrix)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         14         2
  virginica       0          1        13

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.8173, 0.986)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9333           0.8667
Specificity                 1.0000            0.9333           0.9667
Pos Pred Value              1.0000            0.8750           0.9286
Neg Pred Value              1.0000            0.9655           0.9355
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3111           0.2889
Detection Prevalence        0.3333            0.3556           0.3111
Balanced Accuracy           1.0000            0.9333           0.9167

# Generate synthetic data for clustering
set.seed(123)
data <- data.frame(
  x = rnorm(100, mean = 0, sd = 1),
  y = rnorm(100, mean = 0, sd = 1)
)

# Perform k-means clustering
k <- 3  # Number of clusters
kmeans_result <- kmeans(data, centers = k)

# Print the cluster assignments
cluster_assignments <- kmeans_result$cluster
print(cluster_assignments)

  [1] 1 2 3 1 1 3 3 1 1 2 3 2 3 1 2 3 3 1 3 1 1 1 1 1 2 1 3 2 1 3 2 2 3 3 3 2 3
 [38] 2 2 1 2 1 1 3 3 1 1 2 2 1 2 2 2 3 1 3 1 3 2 3 2 1 1 2 1 2 2 1 3 3 1 1 3 2
 [75] 1 3 1 1 2 1 1 2 1 3 1 3 2 3 2 3 3 3 2 1 3 2 3 3 1 1

# Plot the clustered data
library(ggplot2)
data$Cluster <- as.factor(cluster_assignments)
ggplot(data, aes(x, y, color = Cluster)) +
  geom_point() +
  labs(title = "K-Means Clustering")

# Build a Random Forest Regression model using the train data
library(randomForest)
model_rf <- randomForest(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = train_data)

# Make predictions on the test set
test_predictions_rf <- predict(model_rf, newdata = test_data)

# Calculate Mean Squared Error
mse_rf <- mean((test_data$FinalGrades - test_predictions_rf)^2)

# Calculate R-squared
ss_residual_rf <- sum((test_data$FinalGrades - test_predictions_rf)^2)
r_squared_rf <- 1 - (ss_residual_rf / ss_total)

# Print evaluation metrics for Random Forest model
cat("Random Forest Regression - Mean Squared Error:", mse_rf, "\n")

Random Forest Regression - Mean Squared Error: 34.69119

cat("Random Forest Regression - R-squared:", r_squared_rf, "\n")

Random Forest Regression - R-squared: 0.8220764

max_final_grade <- max(student_data$FinalGrades)
max_final_grade <- round(max_final_grade, 2)  # Round to two decimal places

max_final_grade <- max(student_data$FinalGrades)
max_final_grade <- round(max_final_grade, 2)  # Round to two decimal places
max_final_grade

[1] 95.36

# Calculate the correlation between StudyHours and FinalGrades
correlation <- cor(student_data$StudyHours, student_data$FinalGrades)

# Print the correlation coefficient
cat("Correlation between StudyHours and FinalGrades:", correlation, "\n")

Correlation between StudyHours and FinalGrades: 0.1521135

# Fit a linear regression model
model <- lm(FinalGrades ~ StudyHours + QuizScores + ForumPosts + PreviousGrades, data = student_data)

# Display the model summary
summary(model)


Call:
lm(formula = FinalGrades ~ StudyHours + QuizScores + ForumPosts + 
    PreviousGrades, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.3924  -3.4734   0.3027   3.0976  16.7901 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    25.076304   0.762633   32.88  < 2e-16 ***
StudyHours      0.298397   0.037113    8.04 6.66e-15 ***
QuizScores      0.404363   0.007692   52.57  < 2e-16 ***
ForumPosts      0.202482   0.015217   13.31  < 2e-16 ***
PreviousGrades  0.090967   0.007480   12.16  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.93 on 495 degrees of freedom
Multiple R-squared:  0.8696,    Adjusted R-squared:  0.8685 
F-statistic: 825.1 on 4 and 495 DF,  p-value: < 2.2e-16

# Assuming your data frame is named student_data
num_observations <- nrow(student_data)

# Print the number of observations
cat("Number of observations in student_data:", num_observations, "\n")

Number of observations in student_data: 500