library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(e1071)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(rsample)
## Warning: package 'rsample' was built under R version 4.4.3
##
## Attaching package: 'rsample'
## The following object is masked from 'package:e1071':
##
## permutations
library(effects)
## Warning: package 'effects' was built under R version 4.4.3
## Loading required package: carData
## Use the command
## lattice::trellis.par.set(effectsTheme())
## to customize lattice options for effects plots.
## See ?effectsTheme for details.
library(visdat)
## Warning: package 'visdat' was built under R version 4.4.3
library(caret)
library(car)
## Warning: package 'car' was built under R version 4.4.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(PRROC)
## Warning: package 'PRROC' was built under R version 4.4.3
## Loading required package: rlang
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(tidyr)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
stu_df <- read.csv('student_habits_performance.csv')
str(stu_df)
## 'data.frame': 1000 obs. of 16 variables:
## $ student_id : chr "S1000" "S1001" "S1002" "S1003" ...
## $ age : int 23 20 21 23 19 24 21 21 23 18 ...
## $ gender : chr "Female" "Female" "Male" "Female" ...
## $ study_hours_per_day : num 0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
## $ social_media_hours : num 1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
## $ netflix_hours : num 1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
## $ part_time_job : chr "No" "No" "No" "No" ...
## $ attendance_percentage : num 85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
## $ sleep_hours : num 8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
## $ diet_quality : chr "Fair" "Good" "Poor" "Poor" ...
## $ exercise_frequency : int 6 6 1 4 3 1 2 0 3 5 ...
## $ parental_education_level : chr "Master" "High School" "High School" "Master" ...
## $ internet_quality : chr "Average" "Average" "Poor" "Good" ...
## $ mental_health_rating : int 8 8 1 1 1 4 4 8 1 10 ...
## $ extracurricular_participation: chr "Yes" "No" "No" "Yes" ...
## $ exam_score : num 56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...
summary(stu_df)
## student_id age gender study_hours_per_day
## Length:1000 Min. :17.00 Length:1000 Min. :0.00
## Class :character 1st Qu.:18.75 Class :character 1st Qu.:2.60
## Mode :character Median :20.00 Mode :character Median :3.50
## Mean :20.50 Mean :3.55
## 3rd Qu.:23.00 3rd Qu.:4.50
## Max. :24.00 Max. :8.30
## social_media_hours netflix_hours part_time_job attendance_percentage
## Min. :0.000 Min. :0.000 Length:1000 Min. : 56.00
## 1st Qu.:1.700 1st Qu.:1.000 Class :character 1st Qu.: 78.00
## Median :2.500 Median :1.800 Mode :character Median : 84.40
## Mean :2.506 Mean :1.820 Mean : 84.13
## 3rd Qu.:3.300 3rd Qu.:2.525 3rd Qu.: 91.03
## Max. :7.200 Max. :5.400 Max. :100.00
## sleep_hours diet_quality exercise_frequency parental_education_level
## Min. : 3.20 Length:1000 Min. :0.000 Length:1000
## 1st Qu.: 5.60 Class :character 1st Qu.:1.000 Class :character
## Median : 6.50 Mode :character Median :3.000 Mode :character
## Mean : 6.47 Mean :3.042
## 3rd Qu.: 7.30 3rd Qu.:5.000
## Max. :10.00 Max. :6.000
## internet_quality mental_health_rating extracurricular_participation
## Length:1000 Min. : 1.000 Length:1000
## Class :character 1st Qu.: 3.000 Class :character
## Mode :character Median : 5.000 Mode :character
## Mean : 5.438
## 3rd Qu.: 8.000
## Max. :10.000
## exam_score
## Min. : 18.40
## 1st Qu.: 58.48
## Median : 70.50
## Mean : 69.60
## 3rd Qu.: 81.33
## Max. :100.00
The dataset includes students aged 17 to 24, with most around 20 years old. Most students study about 3.5 hours per day, spend around 2.5 hours on social media, and 1.8 hours on Netflix. Many don’t work part-time, and most have good attendance and sleep around 6.5 hours a day. Most report a “Fair” diet and exercise about 3 times a week. Mental health ratings average around 5.5, and many participate in extracurricular activities. Exam scores range from 18 to 100, with an average of 69.6.
head(stu_df)
## student_id age gender study_hours_per_day social_media_hours netflix_hours
## 1 S1000 23 Female 0.0 1.2 1.1
## 2 S1001 20 Female 6.9 2.8 2.3
## 3 S1002 21 Male 1.4 3.1 1.3
## 4 S1003 23 Female 1.0 3.9 1.0
## 5 S1004 19 Female 5.0 4.4 0.5
## 6 S1005 24 Male 7.2 1.3 0.0
## part_time_job attendance_percentage sleep_hours diet_quality
## 1 No 85.0 8.0 Fair
## 2 No 97.3 4.6 Good
## 3 No 94.8 8.0 Poor
## 4 No 71.0 9.2 Poor
## 5 No 90.9 4.9 Fair
## 6 No 82.9 7.4 Fair
## exercise_frequency parental_education_level internet_quality
## 1 6 Master Average
## 2 6 High School Average
## 3 1 High School Poor
## 4 4 Master Good
## 5 3 Master Good
## 6 1 Master Average
## mental_health_rating extracurricular_participation exam_score
## 1 8 Yes 56.2
## 2 8 No 100.0
## 3 1 No 34.3
## 4 1 Yes 26.8
## 5 1 No 66.4
## 6 4 No 100.0
vis_dat(stu_df)
anyNA(stu_df)
## [1] FALSE
After performing a check with the vis_dat() function
and the AnyNA() function, it is clear that the dataset
does not contain any missing values (NA). The vis_dat()
function visually confirms that all columns are populated, while
AnyNA() returns a result of FALSE,
indicating that there are no missing values across any of the
variables.
stu_df$gender <- as.factor(stu_df$gender)
stu_df$part_time_job <- as.factor(stu_df$part_time_job)
stu_df$diet_quality <- as.factor(stu_df$diet_quality)
stu_df$parental_education_level <- as.factor(stu_df$parental_education_level)
stu_df$internet_quality <- as.factor(stu_df$internet_quality)
stu_df$extracurricular_participation <- as.factor(stu_df$extracurricular_participation)
str(stu_df)
## 'data.frame': 1000 obs. of 16 variables:
## $ student_id : chr "S1000" "S1001" "S1002" "S1003" ...
## $ age : int 23 20 21 23 19 24 21 21 23 18 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 1 1 2 1 1 2 1 1 1 1 ...
## $ study_hours_per_day : num 0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
## $ social_media_hours : num 1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
## $ netflix_hours : num 1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
## $ part_time_job : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
## $ attendance_percentage : num 85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
## $ sleep_hours : num 8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
## $ diet_quality : Factor w/ 3 levels "Fair","Good",..: 1 2 3 3 1 1 2 1 2 2 ...
## $ exercise_frequency : int 6 6 1 4 3 1 2 0 3 5 ...
## $ parental_education_level : Factor w/ 4 levels "Bachelor","High School",..: 3 2 2 3 3 3 3 1 1 1 ...
## $ internet_quality : Factor w/ 3 levels "Average","Good",..: 1 1 3 2 2 1 3 1 2 2 ...
## $ mental_health_rating : int 8 8 1 1 1 4 4 8 1 10 ...
## $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
## $ exam_score : num 56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...
stu_df <- stu_df |>
dplyr::select(-student_id)
for (col in names(stu_df)) {
if (is.numeric(stu_df[[col]])) {
hist(stu_df[[col]], main = paste("Histogram of", col), xlab = col, col = "lightblue", border = "black", probability = TRUE)
lines(density(stu_df[[col]], na.rm = TRUE), col = "red", lwd = 2) # Add density line
}
}
The distributions of the variables show slight skewness, which can indicate the presence of outliers, but none are extreme enough to require removal, so we can keep them in the analysis.
par(mfrow = c(2, 2))
for (col in colnames(stu_df)) {
if (is.factor(stu_df[[col]])) {
print(
ggplot(stu_df, aes_string(x = col)) +
geom_bar(fill = "lightblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, color = "black") +
labs(title = paste("Distribution of", col), x = col, y = "Count") +
theme_minimal()
)
}
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cor_matrix <- cor(stu_df[, sapply(stu_df, is.numeric)])
corrplot(cor_matrix, method = "color",
addCoef.col = "lightgrey",
tl.col = "black",
number.cex = 0.7)
The study_hours_per_day variable has the strongest
correlation with pass/fail, suggesting that students who dedicate more
time to studying are more likely to pass their exams. This indicates
that study habits play a crucial role in determining whether a student
will pass or fail.
threshold <- 70
stu_df$pass_fail <- ifelse(stu_df$exam_score >= threshold, "pass", "fail")
stu_df$pass_fail <- as.factor(stu_df$pass_fail)
str(stu_df)
## 'data.frame': 1000 obs. of 16 variables:
## $ age : int 23 20 21 23 19 24 21 21 23 18 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 1 1 2 1 1 2 1 1 1 1 ...
## $ study_hours_per_day : num 0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
## $ social_media_hours : num 1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
## $ netflix_hours : num 1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
## $ part_time_job : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
## $ attendance_percentage : num 85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
## $ sleep_hours : num 8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
## $ diet_quality : Factor w/ 3 levels "Fair","Good",..: 1 2 3 3 1 1 2 1 2 2 ...
## $ exercise_frequency : int 6 6 1 4 3 1 2 0 3 5 ...
## $ parental_education_level : Factor w/ 4 levels "Bachelor","High School",..: 3 2 2 3 3 3 3 1 1 1 ...
## $ internet_quality : Factor w/ 3 levels "Average","Good",..: 1 1 3 2 2 1 3 1 2 2 ...
## $ mental_health_rating : int 8 8 1 1 1 4 4 8 1 10 ...
## $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
## $ exam_score : num 56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...
## $ pass_fail : Factor w/ 2 levels "fail","pass": 1 2 1 1 1 2 2 2 2 2 ...
stu_df|>
ggplot(aes(pass_fail)) +
geom_bar()
The dataset is fairly balanced with the 70 threshold. Students below 70 are “fail” and those above 70 are “pass”, which seems like a good way to separate them for this analysis.
set.seed(28)
train_index <- createDataPartition(stu_df$pass_fail, p = 0.7, list = FALSE)
stu_train <- stu_df[train_index, ]
stu_test <- stu_df[-train_index, ]
This will look at whether students who spend more time studying are more likely to pass their exams.
ggplot(stu_df, aes(x = study_hours_per_day, fill = pass_fail)) +
geom_histogram(binwidth = 0.5, alpha = 0.7, position = "dodge") +
scale_fill_manual(values = c("fail" = "lightgreen", "pass" = "lightblue")) +
labs(title = "Distribution of Study Hours per Day by Pass/Fail", x = "Study Hours per Day", y = "Frequency") +
theme_minimal()
The histogram shows that most students who pass the exam study between 2 and 4 hours a day, with the highest number studying around 4 hours. On the other hand, students who fail tend to study less than 2 hours, and there are fewer students who study more than that. This suggests that students who study more are more likely to pass the exam. In general, more study time seems to help with exam scores.
ggplot(stu_df, aes(x = study_hours_per_day, y = exam_score, color = pass_fail)) +
geom_point(alpha = 0.7) +
labs(title = "Study Hours vs Exam Score", x = "Study Hours per Day", y = "Exam Score") +
theme_minimal() +
scale_color_manual(values = c("lightgreen", "lightblue"))
As seen in the histogram, the scatter plot shows that students who study more tend to have higher exam scores. Most of the passing students (in blue) study more than 3 hours, and their scores are mostly above 70. The failing students (in green) mostly study less than 3 hours and have lower scores. This suggests that more study time is linked to better exam performance
Looking at the mental health ratings, we can see if students with better mental health (higher scores) tend to get higher grades.
ggplot(stu_df, aes(x = mental_health_rating, fill = pass_fail)) +
geom_histogram(binwidth = 1, position = "dodge") +
scale_fill_manual(values = c("fail" = "lightgreen", "pass" = "lightblue")) +
labs(title = "Mental Health Ratings by Pass/Fail", x = "Mental Health Rating", y = "Count")
As seen in the histogram, students who tend to pass generally have better mental health ratings (closer to 7-10), but there are also some students with lower mental health ratings who still pass. On the other hand, students with lower mental health ratings (closer to 0) are more likely to fail, although we also see some students with better mental health ratings who still fail.
We want to understand how balancing a part-time job with study time and other activities affects students exam scores. This can help us determine if having a job negatively impacts their grades.
ggplot(stu_df, aes(x = exam_score, fill = pass_fail)) +
geom_histogram(position = "dodge", binwidth = 5) +
facet_wrap(~ part_time_job) +
labs(title = "Exam Scores by Pass/Fail and Part-Time Job Status",
x = "Exam Score", y = "Count") +
scale_fill_manual(values = c("fail" = "lightgreen", "pass" = "lightblue"))
The histograms show how part-time jobs affect exam scores. For students without a part-time job, most of the passing students (blue) have high scores, while failing students (green) are mostly in the lower range. For students with a part-time job, the passing students are more spread out across mid-range scores, and there are more failing students in the lower score range. This suggests that having a part-time job might hurt exam performance, but some students with jobs still do well.
For the modeling, I will be using study hours per day, attendance percentage, sleep hours, mental health rating, social media hours, netflix hours, and exercise frequency as the key factors to predict whether a student will pass or fail. These features will be incorporated into various models, including Logistic Regression, Decision Trees, Random Forest, and SVM, to assess their effectiveness in predicting academic success.
stu_glm <- glm(pass_fail ~ study_hours_per_day + attendance_percentage +
sleep_hours + mental_health_rating + social_media_hours +
netflix_hours + exercise_frequency,
data = stu_train, family = binomial)
summary(stu_glm)
##
## Call:
## glm(formula = pass_fail ~ study_hours_per_day + attendance_percentage +
## sleep_hours + mental_health_rating + social_media_hours +
## netflix_hours + exercise_frequency, family = binomial, data = stu_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.97634 2.33188 -9.424 < 2e-16 ***
## study_hours_per_day 3.27816 0.29011 11.300 < 2e-16 ***
## attendance_percentage 0.05637 0.01471 3.833 0.000127 ***
## sleep_hours 0.68576 0.12376 5.541 3.01e-08 ***
## mental_health_rating 0.70999 0.07145 9.937 < 2e-16 ***
## social_media_hours -1.00647 0.14779 -6.810 9.74e-12 ***
## netflix_hours -0.87977 0.14894 -5.907 3.49e-09 ***
## exercise_frequency 0.52407 0.08470 6.188 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 971.47 on 700 degrees of freedom
## Residual deviance: 327.61 on 693 degrees of freedom
## AIC: 343.61
##
## Number of Fisher Scoring iterations: 7
knitr::kable(vif(stu_glm))
| x | |
|---|---|
| study_hours_per_day | 3.019916 |
| attendance_percentage | 1.126974 |
| sleep_hours | 1.176234 |
| mental_health_rating | 2.177391 |
| social_media_hours | 1.442074 |
| netflix_hours | 1.252041 |
| exercise_frequency | 1.419682 |
The VIF values show that there is no significant multicollinearity.
pred_glm <- predict(stu_glm, newdata = stu_test, type = "response")
predclass_glm <- ifelse(pred_glm > 0.70, "pass", "fail")
caret::confusionMatrix(as.factor(predclass_glm), stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 144 19
## pass 2 134
##
## Accuracy : 0.9298
## 95% CI : (0.8946, 0.956)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8598
##
## Mcnemar's Test P-Value : 0.0004803
##
## Sensitivity : 0.9863
## Specificity : 0.8758
## Pos Pred Value : 0.8834
## Neg Pred Value : 0.9853
## Prevalence : 0.4883
## Detection Rate : 0.4816
## Detection Prevalence : 0.5452
## Balanced Accuracy : 0.9311
##
## 'Positive' Class : fail
##
glm_curve <- pr.curve(scores.class0 = pred_glm[stu_test$pass_fail == "pass"],
scores.class1 = pred_glm[stu_test$pass_fail == "fail"], curve = TRUE)
plot(glm_curve)
The GLM model is correct 92.64% of the time, with 95.89% sensitivity for predicting fails and 89.54% specificity for predicting passes. The Kappa score of 0.853 shows strong agreement, and the model has a balanced accuracy of 92.72%. Overall, it performs excellently.
The Precision-Recall curve shows that the model performs really well. With an AUC of 0.98, the model is great at predicting whether students will pass or fail.
Unprune
stu_dt <- rpart(pass_fail ~ study_hours_per_day + attendance_percentage +
sleep_hours + mental_health_rating + social_media_hours +
netflix_hours + exercise_frequency,
data = stu_train, method = "class")
rpart.plot(stu_dt, main = "Decision Tree for Pass/Fail Prediction",
type = 4, extra = 101, cex = 0.8)
The decision tree predicts pass or fail based on factors like study hours, mental health, social media hours, and sleep. It splits on study hours, with lower study time leading to a higher chance of failure. Further splits consider mental health and sleep, showing that poor mental health and low study hours are linked to failure, while better habits improve the chances of passing.
stu_unpruned <- predict(stu_dt, newdata = stu_test, type = "class")
confusionMatrix(stu_unpruned, stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 116 16
## pass 30 137
##
## Accuracy : 0.8462
## 95% CI : (0.8002, 0.8851)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6915
##
## Mcnemar's Test P-Value : 0.05527
##
## Sensitivity : 0.7945
## Specificity : 0.8954
## Pos Pred Value : 0.8788
## Neg Pred Value : 0.8204
## Prevalence : 0.4883
## Detection Rate : 0.3880
## Detection Prevalence : 0.4415
## Balanced Accuracy : 0.8450
##
## 'Positive' Class : fail
##
stu_unpruned_prob <- predict(stu_dt, newdata = stu_test, type = "prob")
roc_unpruned <- roc(stu_test$pass_fail, stu_unpruned_prob[, 2])
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
plot(roc_unpruned, main = "ROC Curve - Unpruned Decision Tree")
auc_unpruned <- auc(roc_unpruned)
text(0.5, 0.5, paste("AUC = ", round(auc_unpruned, 2)), col = "red", cex = 1.2)
The model is correct 84.62% of the time. It identifies 79.45% of students who fail and 89.54% of students who pass, showing strong performance in predicting both outcomes. The Kappa score of 0.6915 indicates good agreement between predicted and actual values.
The ROC curve shows the model has a good AUC of 0.9, meaning it’s good at telling apart students who pass from those who fail. The curve is close to the top-left, showing it’s doing a great job at making accurate predictions.
Let’s see if we can improve the model by pruning the decision tree. By removing unnecessary branches, we can simplify the model, reduce overfitting, and help it generalize better to new data.
Prune
stu_dt_pruned <- prune(stu_dt, cp = 0.05)
rpart.plot(stu_dt_pruned, main = "Pruned Decision Tree for Pass/Fail Prediction",
type = 4, extra = 101, cex = 0.8)
The pruned decision tree shows that study hours per day is the key factor in predicting success. Students who study less than 3.2 hours are more likely to fail. For those studying more, the tree splits at 4.4 hours, with lower study hours leading to a 34% pass rate. Further splits based on mental health rating show that better mental health improves the chances of passing.
stu_pruned <- predict(stu_dt_pruned, newdata = stu_test, type = "class")
confusionMatrix(stu_pruned, stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 126 29
## pass 20 124
##
## Accuracy : 0.8361
## 95% CI : (0.7892, 0.8762)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6725
##
## Mcnemar's Test P-Value : 0.2531
##
## Sensitivity : 0.8630
## Specificity : 0.8105
## Pos Pred Value : 0.8129
## Neg Pred Value : 0.8611
## Prevalence : 0.4883
## Detection Rate : 0.4214
## Detection Prevalence : 0.5184
## Balanced Accuracy : 0.8367
##
## 'Positive' Class : fail
##
stu_pruned_prob <- predict(stu_dt_pruned, newdata = stu_test, type = "prob")
roc_pruned <- roc(stu_test$pass_fail, stu_pruned_prob[, 2])
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
plot(roc_pruned, main = "ROC Curve - Pruned Decision Tree")
auc_pruned <- auc(roc_pruned)
text(0.5, 0.5, paste("AUC = ", round(auc_pruned, 2)), col = "blue", cex = 1.2)
The confusion matrix shows that the pruned decision tree is correct 83.61% of the time. It’s good at identifying students who pass (81.05%) but is better at spotting students who fail (86.30%).
The ROC curve shows an AUC of 0.88, which means the model is good at telling the difference between students who pass and fail.
In conclusion, while both the unpruned and pruned decision trees perform well, the unpruned tree offers a simpler, more generalizable model with slightly better performance in terms of accuracy and AUC.
stu_rf <- randomForest(pass_fail ~ study_hours_per_day + attendance_percentage +
sleep_hours + mental_health_rating + social_media_hours +
netflix_hours + exercise_frequency,
data = stu_train, mtry = sqrt(ncol(stu_train) - 1), importance = TRUE)
importance(stu_rf)
## fail pass MeanDecreaseAccuracy
## study_hours_per_day 107.2619106 111.751269 136.39599043
## attendance_percentage 0.4429466 -0.464751 -0.08713871
## sleep_hours 8.3467080 13.231200 14.63199751
## mental_health_rating 30.6948149 39.061910 47.06816174
## social_media_hours 13.1285950 15.724227 19.87403055
## netflix_hours 3.9165309 5.338810 6.76869211
## exercise_frequency 5.4782002 7.332812 8.84369142
## MeanDecreaseGini
## study_hours_per_day 170.35378
## attendance_percentage 25.06840
## sleep_hours 30.08309
## mental_health_rating 50.13260
## social_media_hours 31.90458
## netflix_hours 24.59631
## exercise_frequency 17.69699
varImpPlot(stu_rf)
This table shows the importance of each feature for predicting academic success. Study hours per day is the most influential factor, with the highest Mean Decrease Accuracy and Mean Decrease Gini values, indicating its significant role in predicting pass or fail. Mental health rating also has high importance. Attendance percentage, sleep hours, and social media hours contribute moderately, while netflix hours and exercise frequency have the least impact on the model’s predictions.
predict_rf <- predict(stu_rf, newdata = stu_test)
confusionMatrix(predict_rf, stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 133 14
## pass 13 139
##
## Accuracy : 0.9097
## 95% CI : (0.8713, 0.9396)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8193
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9110
## Specificity : 0.9085
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.9145
## Prevalence : 0.4883
## Detection Rate : 0.4448
## Detection Prevalence : 0.4916
## Balanced Accuracy : 0.9097
##
## 'Positive' Class : fail
##
stu_rf_prob <- predict(stu_rf, newdata = stu_test, type = "prob")
roc_rf <- roc(stu_test$pass_fail, stu_rf_prob[, 2])
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
plot(roc_rf, main = "ROC Curve - Random Forest")
auc_rf <- auc(roc_rf)
text(0.5, 0.5, paste("AUC = ", round(auc_rf, 2)), col = "blue", cex = 1.2)
The model is correct 91.64% of the time, with 91.10% sensitivity for predicting fails and 92.16% specificity for predicting passes. The Kappa score of 0.8327 shows strong agreement between predictions and actual values. The balanced accuracy is 91.63%, indicating reliable predictions for both pass and fail outcomes.
This ROC curve shows how well the Random Forest model performs. The curve is close to the top left corner, which is great—this means the model does a good job of correctly identifying both pass and fail students. The AUC of 0.97 is very high, indicating excellent model performance.
Linear
stu_svm_linear <- svm(pass_fail ~ study_hours_per_day + attendance_percentage +
sleep_hours + mental_health_rating + social_media_hours +
netflix_hours + exercise_frequency,
data = stu_train, kernel = "linear")
predict_svm_linear <- predict(stu_svm_linear, newdata = stu_test, type = 'response')
confusionMatrix(predict_svm_linear, stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 135 10
## pass 11 143
##
## Accuracy : 0.9298
## 95% CI : (0.8946, 0.956)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8594
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9247
## Specificity : 0.9346
## Pos Pred Value : 0.9310
## Neg Pred Value : 0.9286
## Prevalence : 0.4883
## Detection Rate : 0.4515
## Detection Prevalence : 0.4849
## Balanced Accuracy : 0.9296
##
## 'Positive' Class : fail
##
For the SVM Linear model, the accuracy is 92.98%, with 92.47% sensitivity for identifying students who fail and 93.46% specificity for identifying students who pass. The Kappa score of 0.8594 indicates strong predictive performance, and the balanced accuracy of 92.96% reflects good overall prediction for both classes.
Polynomial
stu_svm_poly <- svm(pass_fail ~ study_hours_per_day + attendance_percentage +
sleep_hours + mental_health_rating + social_media_hours + netflix_hours + exercise_frequency,
data = stu_train, kernel = "polynomial", degree = 3)
predictions_svm_poly <- predict(stu_svm_poly, newdata = stu_test)
confusionMatrix(predictions_svm_poly, stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 135 11
## pass 11 142
##
## Accuracy : 0.9264
## 95% CI : (0.8907, 0.9533)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8528
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9247
## Specificity : 0.9281
## Pos Pred Value : 0.9247
## Neg Pred Value : 0.9281
## Prevalence : 0.4883
## Detection Rate : 0.4515
## Detection Prevalence : 0.4883
## Balanced Accuracy : 0.9264
##
## 'Positive' Class : fail
##
For the SVM Polynomial (degree 3) model, the accuracy is 92.64%, with 92.47% sensitivity for predicting fails and 92.81% specificity for predicting passes. The Kappa score of 0.8528 indicates strong agreement, and the balanced accuracy is 92.64%, reflecting reliable predictions for both pass and fail outcomes.
Radial
stu_svm_radial <- svm(pass_fail ~ study_hours_per_day + attendance_percentage +
sleep_hours + mental_health_rating + social_media_hours +
netflix_hours + exercise_frequency,
data = stu_train, kernel = "radial")
predictions_svm_radial <- predict(stu_svm_radial, newdata = stu_test)
confusionMatrix(predictions_svm_radial, stu_test$pass_fail)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fail pass
## fail 135 11
## pass 11 142
##
## Accuracy : 0.9264
## 95% CI : (0.8907, 0.9533)
## No Information Rate : 0.5117
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8528
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9247
## Specificity : 0.9281
## Pos Pred Value : 0.9247
## Neg Pred Value : 0.9281
## Prevalence : 0.4883
## Detection Rate : 0.4515
## Detection Prevalence : 0.4883
## Balanced Accuracy : 0.9264
##
## 'Positive' Class : fail
##
For the SVM Radial model, the accuracy is 92.64%, with 92.47% sensitivity for fails and 92.81% specificity for passes. The Kappa score of 0.8528 indicates strong performance, and the balanced accuracy is 92.64%, showing reliable predictions for both pass and fail outcomes.
results_df <- data.frame(
Model = c("Logistic Regression", "Unpruned Decision Tree", "Pruned Decision Tree", "Random Forest", "SVM Linear", "SVM Polynomial", "SVM Radial"),
Accuracy = c(
caret::confusionMatrix(as.factor(ifelse(predict(stu_glm, newdata = stu_test, type = "response") > 0.70, "pass", "fail")), stu_test$pass_fail)$overall['Accuracy'],
caret::confusionMatrix(predict(stu_dt, newdata = stu_test, type = "class"), stu_test$pass_fail)$overall['Accuracy'],
caret::confusionMatrix(predict(stu_dt_pruned, newdata = stu_test, type = "class"), stu_test$pass_fail)$overall['Accuracy'],
caret::confusionMatrix(predict(stu_rf, newdata = stu_test), stu_test$pass_fail)$overall['Accuracy'],
caret::confusionMatrix(predict(stu_svm_linear, newdata = stu_test, type = 'response'), stu_test$pass_fail)$overall['Accuracy'],
caret::confusionMatrix(predict(stu_svm_poly, newdata = stu_test), stu_test$pass_fail)$overall['Accuracy'],
caret::confusionMatrix(predict(stu_svm_radial, newdata = stu_test), stu_test$pass_fail)$overall['Accuracy']
),
Sensitivity = c(
caret::confusionMatrix(as.factor(ifelse(predict(stu_glm, newdata = stu_test, type = "response") > 0.70, "pass", "fail")), stu_test$pass_fail)$byClass['Sensitivity'],
caret::confusionMatrix(predict(stu_dt, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Sensitivity'],
caret::confusionMatrix(predict(stu_dt_pruned, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Sensitivity'],
caret::confusionMatrix(predict(stu_rf, newdata = stu_test), stu_test$pass_fail)$byClass['Sensitivity'],
caret::confusionMatrix(predict(stu_svm_linear, newdata = stu_test, type = 'response'), stu_test$pass_fail)$byClass['Sensitivity'],
caret::confusionMatrix(predict(stu_svm_poly, newdata = stu_test), stu_test$pass_fail)$byClass['Sensitivity'],
caret::confusionMatrix(predict(stu_svm_radial, newdata = stu_test), stu_test$pass_fail)$byClass['Sensitivity']
),
Specificity = c(
caret::confusionMatrix(as.factor(ifelse(predict(stu_glm, newdata = stu_test, type = "response") > 0.70, "pass", "fail")), stu_test$pass_fail)$byClass['Specificity'],
caret::confusionMatrix(predict(stu_dt, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Specificity'],
caret::confusionMatrix(predict(stu_dt_pruned, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Specificity'],
caret::confusionMatrix(predict(stu_rf, newdata = stu_test), stu_test$pass_fail)$byClass['Specificity'],
caret::confusionMatrix(predict(stu_svm_linear, newdata = stu_test, type = 'response'), stu_test$pass_fail)$byClass['Specificity'],
caret::confusionMatrix(predict(stu_svm_poly, newdata = stu_test), stu_test$pass_fail)$byClass['Specificity'],
caret::confusionMatrix(predict(stu_svm_radial, newdata = stu_test), stu_test$pass_fail)$byClass['Specificity']
),
AUC = c(
auc(roc(stu_test$pass_fail, predict(stu_glm, newdata = stu_test, type = "response"))),
auc(roc(stu_test$pass_fail, predict(stu_dt, newdata = stu_test, type = "prob")[, 2])),
auc(roc(stu_test$pass_fail, predict(stu_dt_pruned, newdata = stu_test, type = "prob")[, 2])),
auc(roc(stu_test$pass_fail, predict(stu_rf, newdata = stu_test, type = "prob")[, 2])),
NA,
NA,
NA
)
)
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
# Display the results
print(results_df)
## Model Accuracy Sensitivity Specificity AUC
## 1 Logistic Regression 0.9297659 0.9863014 0.8758170 0.9813323
## 2 Unpruned Decision Tree 0.8461538 0.7945205 0.8954248 0.9034829
## 3 Pruned Decision Tree 0.8361204 0.8630137 0.8104575 0.8841436
## 4 Random Forest 0.9096990 0.9109589 0.9084967 0.9709016
## 5 SVM Linear 0.9297659 0.9246575 0.9346405 NA
## 6 SVM Polynomial 0.9264214 0.9246575 0.9281046 NA
## 7 SVM Radial 0.9264214 0.9246575 0.9281046 NA
The results show that we can predict student success based on factors like study habits, attendance, sleep quality, mental health, and other factors like social media usage and exercise frequency. Logistic Regression performed the best with 92.98% accuracy, 98.63% sensitivity, and an AUC of 0.98, making it highly effective. The Unpruned Decision Tree had 84.62% accuracy and 79.45% sensitivity, but it struggled more with identifying students at risk. Random Forest also performed well with 90.97% accuracy, 91.10% sensitivity, and an AUC of 0.97, showing it is a strong and balanced model. The SVM models showed similar accuracy around 92.64%, with high specificity, but their AUC values were not available.
Overall, Logistic Regression was the most effective, with Random Forest and SVMs also performing strongly. This highlights that study habits and mental health are key predictors of academic success.
In conclusion, based on the analysis and model results, we can predict a student’s academic success (pass or fail) using factors such as study hours per day, attendance percentage, sleep hours, mental health rating, social media hours, Netflix hours, and exercise frequency. The models especially Logistic Regression, Random Forest, and the SVM models performed well, showing strong accuracy and sensitivity. These findings suggest that academic performance is influenced by these key factors, and improving study habits, attendance, mental health, and sleep quality can significantly boost a student’s chances of success.