Import Library

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(e1071)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(rsample)
## Warning: package 'rsample' was built under R version 4.4.3
## 
## Attaching package: 'rsample'
## The following object is masked from 'package:e1071':
## 
##     permutations
library(effects)
## Warning: package 'effects' was built under R version 4.4.3
## Loading required package: carData
## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.
library(visdat)
## Warning: package 'visdat' was built under R version 4.4.3
library(caret)
library(car)
## Warning: package 'car' was built under R version 4.4.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(PRROC)
## Warning: package 'PRROC' was built under R version 4.4.3
## Loading required package: rlang
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tidyr)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

Load in Data

stu_df <- read.csv('student_habits_performance.csv')
str(stu_df)
## 'data.frame':    1000 obs. of  16 variables:
##  $ student_id                   : chr  "S1000" "S1001" "S1002" "S1003" ...
##  $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
##  $ gender                       : chr  "Female" "Female" "Male" "Female" ...
##  $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
##  $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
##  $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
##  $ part_time_job                : chr  "No" "No" "No" "No" ...
##  $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
##  $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
##  $ diet_quality                 : chr  "Fair" "Good" "Poor" "Poor" ...
##  $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
##  $ parental_education_level     : chr  "Master" "High School" "High School" "Master" ...
##  $ internet_quality             : chr  "Average" "Average" "Poor" "Good" ...
##  $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
##  $ extracurricular_participation: chr  "Yes" "No" "No" "Yes" ...
##  $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...

Data Explore

summary(stu_df)
##   student_id             age           gender          study_hours_per_day
##  Length:1000        Min.   :17.00   Length:1000        Min.   :0.00       
##  Class :character   1st Qu.:18.75   Class :character   1st Qu.:2.60       
##  Mode  :character   Median :20.00   Mode  :character   Median :3.50       
##                     Mean   :20.50                      Mean   :3.55       
##                     3rd Qu.:23.00                      3rd Qu.:4.50       
##                     Max.   :24.00                      Max.   :8.30       
##  social_media_hours netflix_hours   part_time_job      attendance_percentage
##  Min.   :0.000      Min.   :0.000   Length:1000        Min.   : 56.00       
##  1st Qu.:1.700      1st Qu.:1.000   Class :character   1st Qu.: 78.00       
##  Median :2.500      Median :1.800   Mode  :character   Median : 84.40       
##  Mean   :2.506      Mean   :1.820                      Mean   : 84.13       
##  3rd Qu.:3.300      3rd Qu.:2.525                      3rd Qu.: 91.03       
##  Max.   :7.200      Max.   :5.400                      Max.   :100.00       
##   sleep_hours    diet_quality       exercise_frequency parental_education_level
##  Min.   : 3.20   Length:1000        Min.   :0.000      Length:1000             
##  1st Qu.: 5.60   Class :character   1st Qu.:1.000      Class :character        
##  Median : 6.50   Mode  :character   Median :3.000      Mode  :character        
##  Mean   : 6.47                      Mean   :3.042                              
##  3rd Qu.: 7.30                      3rd Qu.:5.000                              
##  Max.   :10.00                      Max.   :6.000                              
##  internet_quality   mental_health_rating extracurricular_participation
##  Length:1000        Min.   : 1.000       Length:1000                  
##  Class :character   1st Qu.: 3.000       Class :character             
##  Mode  :character   Median : 5.000       Mode  :character             
##                     Mean   : 5.438                                    
##                     3rd Qu.: 8.000                                    
##                     Max.   :10.000                                    
##    exam_score    
##  Min.   : 18.40  
##  1st Qu.: 58.48  
##  Median : 70.50  
##  Mean   : 69.60  
##  3rd Qu.: 81.33  
##  Max.   :100.00

The dataset includes students aged 17 to 24, with most around 20 years old. Most students study about 3.5 hours per day, spend around 2.5 hours on social media, and 1.8 hours on Netflix. Many don’t work part-time, and most have good attendance and sleep around 6.5 hours a day. Most report a “Fair” diet and exercise about 3 times a week. Mental health ratings average around 5.5, and many participate in extracurricular activities. Exam scores range from 18 to 100, with an average of 69.6.

head(stu_df)
##   student_id age gender study_hours_per_day social_media_hours netflix_hours
## 1      S1000  23 Female                 0.0                1.2           1.1
## 2      S1001  20 Female                 6.9                2.8           2.3
## 3      S1002  21   Male                 1.4                3.1           1.3
## 4      S1003  23 Female                 1.0                3.9           1.0
## 5      S1004  19 Female                 5.0                4.4           0.5
## 6      S1005  24   Male                 7.2                1.3           0.0
##   part_time_job attendance_percentage sleep_hours diet_quality
## 1            No                  85.0         8.0         Fair
## 2            No                  97.3         4.6         Good
## 3            No                  94.8         8.0         Poor
## 4            No                  71.0         9.2         Poor
## 5            No                  90.9         4.9         Fair
## 6            No                  82.9         7.4         Fair
##   exercise_frequency parental_education_level internet_quality
## 1                  6                   Master          Average
## 2                  6              High School          Average
## 3                  1              High School             Poor
## 4                  4                   Master             Good
## 5                  3                   Master             Good
## 6                  1                   Master          Average
##   mental_health_rating extracurricular_participation exam_score
## 1                    8                           Yes       56.2
## 2                    8                            No      100.0
## 3                    1                            No       34.3
## 4                    1                           Yes       26.8
## 5                    1                            No       66.4
## 6                    4                            No      100.0
vis_dat(stu_df)

anyNA(stu_df)
## [1] FALSE

After performing a check with the vis_dat() function and the AnyNA() function, it is clear that the dataset does not contain any missing values (NA). The vis_dat() function visually confirms that all columns are populated, while AnyNA() returns a result of FALSE, indicating that there are no missing values across any of the variables.

Converting Categorical Variables to Factor

stu_df$gender <- as.factor(stu_df$gender)
stu_df$part_time_job <- as.factor(stu_df$part_time_job)
stu_df$diet_quality <- as.factor(stu_df$diet_quality)
stu_df$parental_education_level <- as.factor(stu_df$parental_education_level)
stu_df$internet_quality <- as.factor(stu_df$internet_quality)
stu_df$extracurricular_participation <- as.factor(stu_df$extracurricular_participation)
str(stu_df)
## 'data.frame':    1000 obs. of  16 variables:
##  $ student_id                   : chr  "S1000" "S1001" "S1002" "S1003" ...
##  $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
##  $ gender                       : Factor w/ 3 levels "Female","Male",..: 1 1 2 1 1 2 1 1 1 1 ...
##  $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
##  $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
##  $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
##  $ part_time_job                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
##  $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
##  $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
##  $ diet_quality                 : Factor w/ 3 levels "Fair","Good",..: 1 2 3 3 1 1 2 1 2 2 ...
##  $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
##  $ parental_education_level     : Factor w/ 4 levels "Bachelor","High School",..: 3 2 2 3 3 3 3 1 1 1 ...
##  $ internet_quality             : Factor w/ 3 levels "Average","Good",..: 1 1 3 2 2 1 3 1 2 2 ...
##  $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
##  $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
##  $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...

Remove unnecessary variables

stu_df <- stu_df |>
  dplyr::select(-student_id)

Check for Outlier

for (col in names(stu_df)) {
  if (is.numeric(stu_df[[col]])) {
    
    hist(stu_df[[col]], main = paste("Histogram of", col), xlab = col, col = "lightblue", border = "black", probability = TRUE)
    
    lines(density(stu_df[[col]], na.rm = TRUE), col = "red", lwd = 2)  # Add density line
  }
}

The distributions of the variables show slight skewness, which can indicate the presence of outliers, but none are extreme enough to require removal, so we can keep them in the analysis.

Visualize All Categorical Variables

par(mfrow = c(2, 2))

for (col in colnames(stu_df)) {
  if (is.factor(stu_df[[col]])) {
    print(
      ggplot(stu_df, aes_string(x = col)) + 
        geom_bar(fill = "lightblue") +
        geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, color = "black") +
        labs(title = paste("Distribution of", col), x = col, y = "Count") +
        theme_minimal()
    )
  }
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Correlation Matrix (for numeric variables)

cor_matrix <- cor(stu_df[, sapply(stu_df, is.numeric)])

corrplot(cor_matrix, method = "color", 
         addCoef.col = "lightgrey",
         tl.col = "black",       
         number.cex = 0.7)      

The study_hours_per_day variable has the strongest correlation with pass/fail, suggesting that students who dedicate more time to studying are more likely to pass their exams. This indicates that study habits play a crucial role in determining whether a student will pass or fail.

Creation of Pass/Fail Variable Based on Exam Scores

threshold <- 70
stu_df$pass_fail <- ifelse(stu_df$exam_score >= threshold, "pass", "fail")
stu_df$pass_fail <- as.factor(stu_df$pass_fail)
str(stu_df)
## 'data.frame':    1000 obs. of  16 variables:
##  $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
##  $ gender                       : Factor w/ 3 levels "Female","Male",..: 1 1 2 1 1 2 1 1 1 1 ...
##  $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
##  $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
##  $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
##  $ part_time_job                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
##  $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
##  $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
##  $ diet_quality                 : Factor w/ 3 levels "Fair","Good",..: 1 2 3 3 1 1 2 1 2 2 ...
##  $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
##  $ parental_education_level     : Factor w/ 4 levels "Bachelor","High School",..: 3 2 2 3 3 3 3 1 1 1 ...
##  $ internet_quality             : Factor w/ 3 levels "Average","Good",..: 1 1 3 2 2 1 3 1 2 2 ...
##  $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
##  $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
##  $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...
##  $ pass_fail                    : Factor w/ 2 levels "fail","pass": 1 2 1 1 1 2 2 2 2 2 ...
stu_df|>
  ggplot(aes(pass_fail)) +
  geom_bar() 

The dataset is fairly balanced with the 70 threshold. Students below 70 are “fail” and those above 70 are “pass”, which seems like a good way to separate them for this analysis.

Data Spitting

set.seed(28)
train_index <- createDataPartition(stu_df$pass_fail, p = 0.7, list = FALSE)
stu_train <- stu_df[train_index, ]
stu_test <- stu_df[-train_index, ]

Exploratory Data Analysis (EDA)

1. How do study habits influence the likelihood of passing or failing?

This will look at whether students who spend more time studying are more likely to pass their exams.

ggplot(stu_df, aes(x = study_hours_per_day, fill = pass_fail)) + 
  geom_histogram(binwidth = 0.5, alpha = 0.7, position = "dodge") + 
  scale_fill_manual(values = c("fail" = "lightgreen", "pass" = "lightblue")) + 
  labs(title = "Distribution of Study Hours per Day by Pass/Fail", x = "Study Hours per Day", y = "Frequency") +
  theme_minimal()

The histogram shows that most students who pass the exam study between 2 and 4 hours a day, with the highest number studying around 4 hours. On the other hand, students who fail tend to study less than 2 hours, and there are fewer students who study more than that. This suggests that students who study more are more likely to pass the exam. In general, more study time seems to help with exam scores.

ggplot(stu_df, aes(x = study_hours_per_day, y = exam_score, color = pass_fail)) + 
  geom_point(alpha = 0.7) + 
  labs(title = "Study Hours vs Exam Score", x = "Study Hours per Day", y = "Exam Score") +
  theme_minimal() + 
  scale_color_manual(values = c("lightgreen", "lightblue"))

As seen in the histogram, the scatter plot shows that students who study more tend to have higher exam scores. Most of the passing students (in blue) study more than 3 hours, and their scores are mostly above 70. The failing students (in green) mostly study less than 3 hours and have lower scores. This suggests that more study time is linked to better exam performance

2. Does mental health affect academic success?

Looking at the mental health ratings, we can see if students with better mental health (higher scores) tend to get higher grades.

ggplot(stu_df, aes(x = mental_health_rating, fill = pass_fail)) +
  geom_histogram(binwidth = 1, position = "dodge") +
  scale_fill_manual(values = c("fail" = "lightgreen", "pass" = "lightblue")) +
  labs(title = "Mental Health Ratings by Pass/Fail", x = "Mental Health Rating", y = "Count")

As seen in the histogram, students who tend to pass generally have better mental health ratings (closer to 7-10), but there are also some students with lower mental health ratings who still pass. On the other hand, students with lower mental health ratings (closer to 0) are more likely to fail, although we also see some students with better mental health ratings who still fail.

3. How do part-time jobs impact academic performance?

We want to understand how balancing a part-time job with study time and other activities affects students exam scores. This can help us determine if having a job negatively impacts their grades.

ggplot(stu_df, aes(x = exam_score, fill = pass_fail)) +
  geom_histogram(position = "dodge", binwidth = 5) +
  facet_wrap(~ part_time_job) + 
  labs(title = "Exam Scores by Pass/Fail and Part-Time Job Status",
       x = "Exam Score", y = "Count") +
  scale_fill_manual(values = c("fail" = "lightgreen", "pass" = "lightblue")) 

The histograms show how part-time jobs affect exam scores. For students without a part-time job, most of the passing students (blue) have high scores, while failing students (green) are mostly in the lower range. For students with a part-time job, the passing students are more spread out across mid-range scores, and there are more failing students in the lower score range. This suggests that having a part-time job might hurt exam performance, but some students with jobs still do well.

Modeling

For the modeling, I will be using study hours per day, attendance percentage, sleep hours, mental health rating, social media hours, netflix hours, and exercise frequency as the key factors to predict whether a student will pass or fail. These features will be incorporated into various models, including Logistic Regression, Decision Trees, Random Forest, and SVM, to assess their effectiveness in predicting academic success.

1. Logistic Regression

stu_glm <- glm(pass_fail ~ study_hours_per_day + attendance_percentage + 
               sleep_hours + mental_health_rating + social_media_hours + 
               netflix_hours + exercise_frequency, 
               data = stu_train, family = binomial)

summary(stu_glm)
## 
## Call:
## glm(formula = pass_fail ~ study_hours_per_day + attendance_percentage + 
##     sleep_hours + mental_health_rating + social_media_hours + 
##     netflix_hours + exercise_frequency, family = binomial, data = stu_train)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -21.97634    2.33188  -9.424  < 2e-16 ***
## study_hours_per_day     3.27816    0.29011  11.300  < 2e-16 ***
## attendance_percentage   0.05637    0.01471   3.833 0.000127 ***
## sleep_hours             0.68576    0.12376   5.541 3.01e-08 ***
## mental_health_rating    0.70999    0.07145   9.937  < 2e-16 ***
## social_media_hours     -1.00647    0.14779  -6.810 9.74e-12 ***
## netflix_hours          -0.87977    0.14894  -5.907 3.49e-09 ***
## exercise_frequency      0.52407    0.08470   6.188 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 971.47  on 700  degrees of freedom
## Residual deviance: 327.61  on 693  degrees of freedom
## AIC: 343.61
## 
## Number of Fisher Scoring iterations: 7
knitr::kable(vif(stu_glm))
x
study_hours_per_day 3.019916
attendance_percentage 1.126974
sleep_hours 1.176234
mental_health_rating 2.177391
social_media_hours 1.442074
netflix_hours 1.252041
exercise_frequency 1.419682

The VIF values show that there is no significant multicollinearity.

pred_glm <- predict(stu_glm, newdata = stu_test, type = "response")
predclass_glm <- ifelse(pred_glm > 0.70, "pass", "fail")
caret::confusionMatrix(as.factor(predclass_glm), stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  144   19
##       pass    2  134
##                                          
##                Accuracy : 0.9298         
##                  95% CI : (0.8946, 0.956)
##     No Information Rate : 0.5117         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8598         
##                                          
##  Mcnemar's Test P-Value : 0.0004803      
##                                          
##             Sensitivity : 0.9863         
##             Specificity : 0.8758         
##          Pos Pred Value : 0.8834         
##          Neg Pred Value : 0.9853         
##              Prevalence : 0.4883         
##          Detection Rate : 0.4816         
##    Detection Prevalence : 0.5452         
##       Balanced Accuracy : 0.9311         
##                                          
##        'Positive' Class : fail           
## 
glm_curve <- pr.curve(scores.class0 = pred_glm[stu_test$pass_fail == "pass"], 
                     scores.class1 = pred_glm[stu_test$pass_fail == "fail"], curve = TRUE)

plot(glm_curve)

The GLM model is correct 92.64% of the time, with 95.89% sensitivity for predicting fails and 89.54% specificity for predicting passes. The Kappa score of 0.853 shows strong agreement, and the model has a balanced accuracy of 92.72%. Overall, it performs excellently.

The Precision-Recall curve shows that the model performs really well. With an AUC of 0.98, the model is great at predicting whether students will pass or fail.

2. Decision Tree (Unprune and Prune)

Unprune

stu_dt <- rpart(pass_fail ~ study_hours_per_day + attendance_percentage + 
                sleep_hours + mental_health_rating + social_media_hours + 
                netflix_hours + exercise_frequency, 
                data = stu_train, method = "class")

rpart.plot(stu_dt, main = "Decision Tree for Pass/Fail Prediction", 
           type = 4, extra = 101, cex = 0.8)

The decision tree predicts pass or fail based on factors like study hours, mental health, social media hours, and sleep. It splits on study hours, with lower study time leading to a higher chance of failure. Further splits consider mental health and sleep, showing that poor mental health and low study hours are linked to failure, while better habits improve the chances of passing.

stu_unpruned <- predict(stu_dt, newdata = stu_test, type = "class")
confusionMatrix(stu_unpruned, stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  116   16
##       pass   30  137
##                                           
##                Accuracy : 0.8462          
##                  95% CI : (0.8002, 0.8851)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6915          
##                                           
##  Mcnemar's Test P-Value : 0.05527         
##                                           
##             Sensitivity : 0.7945          
##             Specificity : 0.8954          
##          Pos Pred Value : 0.8788          
##          Neg Pred Value : 0.8204          
##              Prevalence : 0.4883          
##          Detection Rate : 0.3880          
##    Detection Prevalence : 0.4415          
##       Balanced Accuracy : 0.8450          
##                                           
##        'Positive' Class : fail            
## 
stu_unpruned_prob <- predict(stu_dt, newdata = stu_test, type = "prob")
roc_unpruned <- roc(stu_test$pass_fail, stu_unpruned_prob[, 2])
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
plot(roc_unpruned, main = "ROC Curve - Unpruned Decision Tree")
auc_unpruned <- auc(roc_unpruned)
text(0.5, 0.5, paste("AUC = ", round(auc_unpruned, 2)), col = "red", cex = 1.2)

The model is correct 84.62% of the time. It identifies 79.45% of students who fail and 89.54% of students who pass, showing strong performance in predicting both outcomes. The Kappa score of 0.6915 indicates good agreement between predicted and actual values.

The ROC curve shows the model has a good AUC of 0.9, meaning it’s good at telling apart students who pass from those who fail. The curve is close to the top-left, showing it’s doing a great job at making accurate predictions.

Let’s see if we can improve the model by pruning the decision tree. By removing unnecessary branches, we can simplify the model, reduce overfitting, and help it generalize better to new data.

Prune

stu_dt_pruned <- prune(stu_dt, cp = 0.05) 
rpart.plot(stu_dt_pruned, main = "Pruned Decision Tree for Pass/Fail Prediction", 
           type = 4, extra = 101, cex = 0.8)

The pruned decision tree shows that study hours per day is the key factor in predicting success. Students who study less than 3.2 hours are more likely to fail. For those studying more, the tree splits at 4.4 hours, with lower study hours leading to a 34% pass rate. Further splits based on mental health rating show that better mental health improves the chances of passing.

stu_pruned <- predict(stu_dt_pruned, newdata = stu_test, type = "class")
confusionMatrix(stu_pruned, stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  126   29
##       pass   20  124
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.7892, 0.8762)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6725          
##                                           
##  Mcnemar's Test P-Value : 0.2531          
##                                           
##             Sensitivity : 0.8630          
##             Specificity : 0.8105          
##          Pos Pred Value : 0.8129          
##          Neg Pred Value : 0.8611          
##              Prevalence : 0.4883          
##          Detection Rate : 0.4214          
##    Detection Prevalence : 0.5184          
##       Balanced Accuracy : 0.8367          
##                                           
##        'Positive' Class : fail            
## 
stu_pruned_prob <- predict(stu_dt_pruned, newdata = stu_test, type = "prob")
roc_pruned <- roc(stu_test$pass_fail, stu_pruned_prob[, 2]) 
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
plot(roc_pruned, main = "ROC Curve - Pruned Decision Tree")
auc_pruned <- auc(roc_pruned)
text(0.5, 0.5, paste("AUC = ", round(auc_pruned, 2)), col = "blue", cex = 1.2)

The confusion matrix shows that the pruned decision tree is correct 83.61% of the time. It’s good at identifying students who pass (81.05%) but is better at spotting students who fail (86.30%).

The ROC curve shows an AUC of 0.88, which means the model is good at telling the difference between students who pass and fail.

In conclusion, while both the unpruned and pruned decision trees perform well, the unpruned tree offers a simpler, more generalizable model with slightly better performance in terms of accuracy and AUC.

3. Random forest

stu_rf <- randomForest(pass_fail ~ study_hours_per_day + attendance_percentage + 
               sleep_hours + mental_health_rating + social_media_hours +
               netflix_hours + exercise_frequency, 
               data = stu_train, mtry = sqrt(ncol(stu_train) - 1), importance = TRUE)

importance(stu_rf)
##                              fail       pass MeanDecreaseAccuracy
## study_hours_per_day   107.2619106 111.751269         136.39599043
## attendance_percentage   0.4429466  -0.464751          -0.08713871
## sleep_hours             8.3467080  13.231200          14.63199751
## mental_health_rating   30.6948149  39.061910          47.06816174
## social_media_hours     13.1285950  15.724227          19.87403055
## netflix_hours           3.9165309   5.338810           6.76869211
## exercise_frequency      5.4782002   7.332812           8.84369142
##                       MeanDecreaseGini
## study_hours_per_day          170.35378
## attendance_percentage         25.06840
## sleep_hours                   30.08309
## mental_health_rating          50.13260
## social_media_hours            31.90458
## netflix_hours                 24.59631
## exercise_frequency            17.69699
varImpPlot(stu_rf)

This table shows the importance of each feature for predicting academic success. Study hours per day is the most influential factor, with the highest Mean Decrease Accuracy and Mean Decrease Gini values, indicating its significant role in predicting pass or fail. Mental health rating also has high importance. Attendance percentage, sleep hours, and social media hours contribute moderately, while netflix hours and exercise frequency have the least impact on the model’s predictions.

predict_rf <- predict(stu_rf, newdata = stu_test)
confusionMatrix(predict_rf, stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  133   14
##       pass   13  139
##                                           
##                Accuracy : 0.9097          
##                  95% CI : (0.8713, 0.9396)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8193          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9110          
##             Specificity : 0.9085          
##          Pos Pred Value : 0.9048          
##          Neg Pred Value : 0.9145          
##              Prevalence : 0.4883          
##          Detection Rate : 0.4448          
##    Detection Prevalence : 0.4916          
##       Balanced Accuracy : 0.9097          
##                                           
##        'Positive' Class : fail            
## 
stu_rf_prob <- predict(stu_rf, newdata = stu_test, type = "prob")

roc_rf <- roc(stu_test$pass_fail, stu_rf_prob[, 2])
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
plot(roc_rf, main = "ROC Curve - Random Forest")

auc_rf <- auc(roc_rf)
text(0.5, 0.5, paste("AUC = ", round(auc_rf, 2)), col = "blue", cex = 1.2)

The model is correct 91.64% of the time, with 91.10% sensitivity for predicting fails and 92.16% specificity for predicting passes. The Kappa score of 0.8327 shows strong agreement between predictions and actual values. The balanced accuracy is 91.63%, indicating reliable predictions for both pass and fail outcomes.

This ROC curve shows how well the Random Forest model performs. The curve is close to the top left corner, which is great—this means the model does a good job of correctly identifying both pass and fail students. The AUC of 0.97 is very high, indicating excellent model performance.

4. SVM (linear, Polynomial, and radial)

Linear

stu_svm_linear <- svm(pass_fail ~ study_hours_per_day + attendance_percentage + 
                      sleep_hours + mental_health_rating + social_media_hours + 
                      netflix_hours + exercise_frequency, 
                      data = stu_train, kernel = "linear")

predict_svm_linear <- predict(stu_svm_linear, newdata = stu_test, type = 'response')

confusionMatrix(predict_svm_linear, stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  135   10
##       pass   11  143
##                                          
##                Accuracy : 0.9298         
##                  95% CI : (0.8946, 0.956)
##     No Information Rate : 0.5117         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8594         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9247         
##             Specificity : 0.9346         
##          Pos Pred Value : 0.9310         
##          Neg Pred Value : 0.9286         
##              Prevalence : 0.4883         
##          Detection Rate : 0.4515         
##    Detection Prevalence : 0.4849         
##       Balanced Accuracy : 0.9296         
##                                          
##        'Positive' Class : fail           
## 

For the SVM Linear model, the accuracy is 92.98%, with 92.47% sensitivity for identifying students who fail and 93.46% specificity for identifying students who pass. The Kappa score of 0.8594 indicates strong predictive performance, and the balanced accuracy of 92.96% reflects good overall prediction for both classes.

Polynomial

stu_svm_poly <- svm(pass_fail ~ study_hours_per_day + attendance_percentage + 
                    sleep_hours + mental_health_rating + social_media_hours +                                                                                             netflix_hours + exercise_frequency, 
                    data = stu_train, kernel = "polynomial", degree = 3)

predictions_svm_poly <- predict(stu_svm_poly, newdata = stu_test)

confusionMatrix(predictions_svm_poly, stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  135   11
##       pass   11  142
##                                           
##                Accuracy : 0.9264          
##                  95% CI : (0.8907, 0.9533)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8528          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9247          
##             Specificity : 0.9281          
##          Pos Pred Value : 0.9247          
##          Neg Pred Value : 0.9281          
##              Prevalence : 0.4883          
##          Detection Rate : 0.4515          
##    Detection Prevalence : 0.4883          
##       Balanced Accuracy : 0.9264          
##                                           
##        'Positive' Class : fail            
## 

For the SVM Polynomial (degree 3) model, the accuracy is 92.64%, with 92.47% sensitivity for predicting fails and 92.81% specificity for predicting passes. The Kappa score of 0.8528 indicates strong agreement, and the balanced accuracy is 92.64%, reflecting reliable predictions for both pass and fail outcomes.

Radial

stu_svm_radial <- svm(pass_fail ~ study_hours_per_day + attendance_percentage + 
                      sleep_hours + mental_health_rating + social_media_hours + 
                      netflix_hours + exercise_frequency, 
                      data = stu_train, kernel = "radial")

predictions_svm_radial <- predict(stu_svm_radial, newdata = stu_test)

confusionMatrix(predictions_svm_radial, stu_test$pass_fail)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fail pass
##       fail  135   11
##       pass   11  142
##                                           
##                Accuracy : 0.9264          
##                  95% CI : (0.8907, 0.9533)
##     No Information Rate : 0.5117          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8528          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9247          
##             Specificity : 0.9281          
##          Pos Pred Value : 0.9247          
##          Neg Pred Value : 0.9281          
##              Prevalence : 0.4883          
##          Detection Rate : 0.4515          
##    Detection Prevalence : 0.4883          
##       Balanced Accuracy : 0.9264          
##                                           
##        'Positive' Class : fail            
## 

For the SVM Radial model, the accuracy is 92.64%, with 92.47% sensitivity for fails and 92.81% specificity for passes. The Kappa score of 0.8528 indicates strong performance, and the balanced accuracy is 92.64%, showing reliable predictions for both pass and fail outcomes.

Result

results_df <- data.frame(
  Model = c("Logistic Regression", "Unpruned Decision Tree", "Pruned Decision Tree", "Random Forest", "SVM Linear", "SVM Polynomial", "SVM Radial"),
  Accuracy = c(
    caret::confusionMatrix(as.factor(ifelse(predict(stu_glm, newdata = stu_test, type = "response") > 0.70, "pass", "fail")), stu_test$pass_fail)$overall['Accuracy'],
    caret::confusionMatrix(predict(stu_dt, newdata = stu_test, type = "class"), stu_test$pass_fail)$overall['Accuracy'],
    caret::confusionMatrix(predict(stu_dt_pruned, newdata = stu_test, type = "class"), stu_test$pass_fail)$overall['Accuracy'],
    caret::confusionMatrix(predict(stu_rf, newdata = stu_test), stu_test$pass_fail)$overall['Accuracy'],
    caret::confusionMatrix(predict(stu_svm_linear, newdata = stu_test, type = 'response'), stu_test$pass_fail)$overall['Accuracy'],
    caret::confusionMatrix(predict(stu_svm_poly, newdata = stu_test), stu_test$pass_fail)$overall['Accuracy'],
    caret::confusionMatrix(predict(stu_svm_radial, newdata = stu_test), stu_test$pass_fail)$overall['Accuracy']
  ),
  Sensitivity = c(
    caret::confusionMatrix(as.factor(ifelse(predict(stu_glm, newdata = stu_test, type = "response") > 0.70, "pass", "fail")), stu_test$pass_fail)$byClass['Sensitivity'],
    caret::confusionMatrix(predict(stu_dt, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Sensitivity'],
    caret::confusionMatrix(predict(stu_dt_pruned, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Sensitivity'],
    caret::confusionMatrix(predict(stu_rf, newdata = stu_test), stu_test$pass_fail)$byClass['Sensitivity'],
    caret::confusionMatrix(predict(stu_svm_linear, newdata = stu_test, type = 'response'), stu_test$pass_fail)$byClass['Sensitivity'],
    caret::confusionMatrix(predict(stu_svm_poly, newdata = stu_test), stu_test$pass_fail)$byClass['Sensitivity'],
    caret::confusionMatrix(predict(stu_svm_radial, newdata = stu_test), stu_test$pass_fail)$byClass['Sensitivity']
  ),
  Specificity = c(
    caret::confusionMatrix(as.factor(ifelse(predict(stu_glm, newdata = stu_test, type = "response") > 0.70, "pass", "fail")), stu_test$pass_fail)$byClass['Specificity'],
    caret::confusionMatrix(predict(stu_dt, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Specificity'],
    caret::confusionMatrix(predict(stu_dt_pruned, newdata = stu_test, type = "class"), stu_test$pass_fail)$byClass['Specificity'],
    caret::confusionMatrix(predict(stu_rf, newdata = stu_test), stu_test$pass_fail)$byClass['Specificity'],
    caret::confusionMatrix(predict(stu_svm_linear, newdata = stu_test, type = 'response'), stu_test$pass_fail)$byClass['Specificity'],
    caret::confusionMatrix(predict(stu_svm_poly, newdata = stu_test), stu_test$pass_fail)$byClass['Specificity'],
    caret::confusionMatrix(predict(stu_svm_radial, newdata = stu_test), stu_test$pass_fail)$byClass['Specificity']
  ),
  AUC = c(
    auc(roc(stu_test$pass_fail, predict(stu_glm, newdata = stu_test, type = "response"))),
    auc(roc(stu_test$pass_fail, predict(stu_dt, newdata = stu_test, type = "prob")[, 2])),
    auc(roc(stu_test$pass_fail, predict(stu_dt_pruned, newdata = stu_test, type = "prob")[, 2])),
    auc(roc(stu_test$pass_fail, predict(stu_rf, newdata = stu_test, type = "prob")[, 2])),
    NA,
    NA,
    NA
  )
)
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
## Setting levels: control = fail, case = pass
## Setting direction: controls < cases
# Display the results
print(results_df)
##                    Model  Accuracy Sensitivity Specificity       AUC
## 1    Logistic Regression 0.9297659   0.9863014   0.8758170 0.9813323
## 2 Unpruned Decision Tree 0.8461538   0.7945205   0.8954248 0.9034829
## 3   Pruned Decision Tree 0.8361204   0.8630137   0.8104575 0.8841436
## 4          Random Forest 0.9096990   0.9109589   0.9084967 0.9709016
## 5             SVM Linear 0.9297659   0.9246575   0.9346405        NA
## 6         SVM Polynomial 0.9264214   0.9246575   0.9281046        NA
## 7             SVM Radial 0.9264214   0.9246575   0.9281046        NA

The results show that we can predict student success based on factors like study habits, attendance, sleep quality, mental health, and other factors like social media usage and exercise frequency. Logistic Regression performed the best with 92.98% accuracy, 98.63% sensitivity, and an AUC of 0.98, making it highly effective. The Unpruned Decision Tree had 84.62% accuracy and 79.45% sensitivity, but it struggled more with identifying students at risk. Random Forest also performed well with 90.97% accuracy, 91.10% sensitivity, and an AUC of 0.97, showing it is a strong and balanced model. The SVM models showed similar accuracy around 92.64%, with high specificity, but their AUC values were not available.

Overall, Logistic Regression was the most effective, with Random Forest and SVMs also performing strongly. This highlights that study habits and mental health are key predictors of academic success.

Conclusion

In conclusion, based on the analysis and model results, we can predict a student’s academic success (pass or fail) using factors such as study hours per day, attendance percentage, sleep hours, mental health rating, social media hours, Netflix hours, and exercise frequency. The models especially Logistic Regression, Random Forest, and the SVM models performed well, showing strong accuracy and sensitivity. These findings suggest that academic performance is influenced by these key factors, and improving study habits, attendance, mental health, and sleep quality can significantly boost a student’s chances of success.