In this case study, you are an analyst at an online education platform. The management is interested in predicting student performance based on various factors to provide personalized support and improve the learning experience. Your task is to develop a supervised learning model to predict students’ final grades using simulated data.
Your goal is to build a predictive model using supervised learning techniques in R. You will utilize simulated student data with features such as study hours, quiz scores, forum participation, and previous grades to predict the final grades.
# Set a fixed random seed for reproducibility
set.seed(10923)
# Number of students
#TODO: set num_students to 500
# Enter code below:
num_students <- 500
# Simulate study hours (ranging from 1 to 20 hours)
study_hours <- sample(1:20, num_students, replace = TRUE)
# Simulate quiz scores (ranging from 0 to 100)
quiz_scores <- sample(0:100, num_students, replace = TRUE)
# Simulate forum participation (ranging from 0 to 50 posts)
forum_posts <- sample(0:50, num_students, replace = TRUE)
# Simulate previous grades (ranging from 0 to 100)
previous_grades <- sample(0:100, num_students, replace = TRUE)
# Simulate final grades (ranging from 0 to 100)
final_grades <- 0.3 * study_hours + 0.4 * quiz_scores + 0.2 * forum_posts + 0.1 * previous_grades + rnorm(num_students, mean = 0, sd = 5) + 25
# Create a data frame
student_data <- data.frame(StudyHours = study_hours, QuizScores = quiz_scores, ForumPosts = forum_posts, PreviousGrades = previous_grades, FinalGrades = final_grades)
# View the first few rows of the generated data
head(student_data)
## StudyHours QuizScores ForumPosts PreviousGrades FinalGrades
## 1 20 91 22 78 80.80895
## 2 12 26 27 1 46.45853
## 3 13 5 8 60 40.22946
## 4 4 96 13 78 70.64216
## 5 5 74 45 31 62.35254
## 6 18 1 47 50 48.42835
max(student_data$FinalGrades)
## [1] 95.36113
# Todo:
summary(student_data)
## StudyHours QuizScores ForumPosts PreviousGrades
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 24.00 1st Qu.:12.00 1st Qu.: 23.00
## Median :11.00 Median : 48.00 Median :24.00 Median : 51.00
## Mean :10.67 Mean : 48.54 Mean :24.26 Mean : 50.05
## 3rd Qu.:16.00 3rd Qu.: 73.00 3rd Qu.:37.00 3rd Qu.: 75.00
## Max. :20.00 Max. :100.00 Max. :50.00 Max. :100.00
## FinalGrades
## Min. :24.19
## 1st Qu.:47.15
## Median :57.18
## Mean :57.35
## 3rd Qu.:67.01
## Max. :95.36
str(student_data)
## 'data.frame': 500 obs. of 5 variables:
## $ StudyHours : int 20 12 13 4 5 18 17 16 3 14 ...
## $ QuizScores : int 91 26 5 96 74 1 48 91 28 4 ...
## $ ForumPosts : int 22 27 8 13 45 47 6 46 14 5 ...
## $ PreviousGrades: int 78 1 60 78 31 50 92 39 75 33 ...
## $ FinalGrades : num 80.8 46.5 40.2 70.6 62.4 ...
Use 80% of the data for training and 20% for testing to predict final grades. Compute the Mean Squared Error and model accuracy based on prediction interval.
# Todo:
# Splitting the data into training and testing sets (80% training, 20% testing)
set.seed(10923) # Set seed for reproducibility
sample_index <- sample(1:nrow(student_data), 0.8 * nrow(student_data))
train_data <- student_data[sample_index, ]
test_data <- student_data[-sample_index, ]
# Building a Linear Regression model using the train data and assign it to an object # called model.
# Todo: Target variable is FinalGrades and the Features are StudyHours, QuizScores, # ForumPosts, and PreviousGrades
# Enter code below:
model <- lm(FinalGrades ~., data = train_data)
# Making predictions on the test set. use the model object to make prediction.
# Enter code below:
prediction <- predict(model, data = test_data)
# Evaluation metrics
# Compute the mean squared error and R-squared
# Enter code below
mse <- mean((test_data$FinalGrades -prediction)^2)
r_sqr <- summary(model)$r.squared
# Calculate residuals
residuals <- prediction - test_data$FinalGrades
# Calculate MSE
# Print evaluation metrics
#Enter code below
mse
## [1] 358.647
r_sqr
## [1] 0.8648338
summary(model)
##
## Call:
## lm(formula = FinalGrades ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5265 -3.4421 0.3997 3.1947 15.6419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.953643 0.863889 28.885 < 2e-16 ***
## StudyHours 0.331338 0.041453 7.993 1.46e-14 ***
## QuizScores 0.402828 0.008646 46.593 < 2e-16 ***
## ForumPosts 0.194558 0.017110 11.371 < 2e-16 ***
## PreviousGrades 0.090502 0.008312 10.888 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.988 on 395 degrees of freedom
## Multiple R-squared: 0.8648, Adjusted R-squared: 0.8635
## F-statistic: 631.8 on 4 and 395 DF, p-value: < 2.2e-16
# Calculate the correlation
correlation <- cor(student_data$StudyHours, student_data$FinalGrades)
# Print the correlation with two decimal places
round(correlation, 2)
## [1] 0.15
summary(student_data)
## StudyHours QuizScores ForumPosts PreviousGrades
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 24.00 1st Qu.:12.00 1st Qu.: 23.00
## Median :11.00 Median : 48.00 Median :24.00 Median : 51.00
## Mean :10.67 Mean : 48.54 Mean :24.26 Mean : 50.05
## 3rd Qu.:16.00 3rd Qu.: 73.00 3rd Qu.:37.00 3rd Qu.: 75.00
## Max. :20.00 Max. :100.00 Max. :50.00 Max. :100.00
## FinalGrades
## Min. :24.19
## 1st Qu.:47.15
## Median :57.18
## Mean :57.35
## 3rd Qu.:67.01
## Max. :95.36
str(test_data)
## 'data.frame': 100 obs. of 5 variables:
## $ StudyHours : int 12 17 2 11 2 8 2 4 14 6 ...
## $ QuizScores : int 89 23 92 91 78 74 11 13 75 55 ...
## $ ForumPosts : int 38 18 32 8 5 28 29 17 33 26 ...
## $ PreviousGrades: int 83 24 77 55 69 88 33 47 22 27 ...
## $ FinalGrades : num 74.3 47.6 81.9 73.6 60.9 ...
# Get the predictions and prediction intervals
pred_int <- predict(model, newdata = test_data, interval = "prediction")
# Extract lower and upper bounds of the prediction interval
lower_bound <- pred_int[, "lwr"]
upper_bound <- pred_int[, "upr"]
# Actual values from the test data
actual_values <- test_data$FinalGrades
# Check if the actual values fall within the prediction interval
correct_predictions <- actual_values >= lower_bound & actual_values <= upper_bound
# Compute accuracy
accuracy <- sum(correct_predictions) / length(correct_predictions)
# Print accuracy
cat("Model Accuracy using Prediction Interval:", accuracy, "\n")
## Model Accuracy using Prediction Interval: 0.96