1 Introduction

This dataset is titled “Student Habits vs Academic Performance” and shows various explanatory variables and one target/response variable (exam_score). The purpose of collecting this dataset is to see what factors may affect exam scores. This dataset was collected from Kaggle. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable).

The student_id serves as the identifier. The age (numerical) shows each persons age. The gender (categorical) shows each persons gender. The study_hours_per_day shows hours studied per day. The social_media_hours shows hours used daily on social media. The netflix_hours shows hours used daily on netflix. The part_time_job (no/yes) shows if the person has a part time job. The attendance percentage shows the percentage of classes attended. The sleep_hours shows amount of sleep per night. The diet_quality (poor/fair/good) shows quality of ones diet. The exercise_frequency shows how many times one exercises per week. The parental_education_level (none/High School/bachelor/master) shows the highest level of education of ones parents. The internet_quality(poor/avg/good) shows how good ones wi-fi is. The mental_health_rating (1-10) shows how one rates their mental health. The extracurricular_participation (n/y) shows if one participates in extracurricular activities. The target/response variable is exam_score (continuous).

Based on this dataset, we can formulate a research question. The question we can form is what variables are the most important in predicting exam score. For regression, we can use exam_score (continuous) as our response variable. For classification, we can create a binary response variable with 1 (pass) and 0 (fail). We can use the condition if exam score is greater or less than 0.60.

We will use CART, bagging, and random forests to try to answer the research question.

2 Read in Dataset

Read in dataset.

habits <- read.csv("student_habits_performance.csv")

See dataset structure.

str(habits)
'data.frame':   1000 obs. of  16 variables:
 $ student_id                   : chr  "S1000" "S1001" "S1002" "S1003" ...
 $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
 $ gender                       : chr  "Female" "Female" "Male" "Female" ...
 $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
 $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
 $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
 $ part_time_job                : chr  "No" "No" "No" "No" ...
 $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
 $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
 $ diet_quality                 : chr  "Fair" "Good" "Poor" "Poor" ...
 $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
 $ parental_education_level     : chr  "Master" "High School" "High School" "Master" ...
 $ internet_quality             : chr  "Average" "Average" "Poor" "Good" ...
 $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
 $ extracurricular_participation: chr  "Yes" "No" "No" "Yes" ...
 $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...

Check for Missing Values.

colSums(is.na(habits))
                   student_id                           age 
                            0                             0 
                       gender           study_hours_per_day 
                            0                             0 
           social_media_hours                 netflix_hours 
                            0                             0 
                part_time_job         attendance_percentage 
                            0                             0 
                  sleep_hours                  diet_quality 
                            0                             0 
           exercise_frequency      parental_education_level 
                            0                             0 
             internet_quality          mental_health_rating 
                            0                             0 
extracurricular_participation                    exam_score 
                            0                             0 

3 EDA

This section will show the distribution of each individual feature.

The below figure shows the distribution of the gender variable.

ggplot(habits, aes(x = gender)) + 
  
  geom_bar() +
  labs(title = "Gender")

The below figure shows the distribution of the part_time_job variable.

ggplot(habits, aes(x = part_time_job)) + 
  
  geom_bar() +
  labs(title = "part_time_job")

The below figure shows the distribution of the diet_quality variable.

ggplot(habits, aes(x = diet_quality)) + 
  
  geom_bar() +
  labs(title = "diet_quality")

The below figure shows the distribution of the parental_education_level variable.

ggplot(habits, aes(x = parental_education_level)) + 
  
  geom_bar() +
  labs(title = "parental_education_level")

The below figure shows the distribution of the internet_quality variable.

ggplot(habits, aes(x = internet_quality)) + 
  
  geom_bar() +
  labs(title = "internet_quality")

The below figure shows the distribution of the extracurricular_participation variable.

ggplot(habits, aes(x = extracurricular_participation)) + 
  
  geom_bar() +
  labs(title = "extracurricular_participation")

The below figure shows the distribution of the age variable.

ggplot(data = habits, aes(x = age)) + 
  geom_boxplot() + 
  
  labs(title = "age")

The below figure shows the distribution of the study_hours_per_day variable.

ggplot(data = habits, aes(x = study_hours_per_day)) + 
  geom_boxplot() + 
  
  labs(title = "study_hours_per_day")

The below figure shows the distribution of the social_media_hours variable.

ggplot(data = habits, aes(x = social_media_hours)) + 
  geom_boxplot() + 
  
  labs(title = "social_media_hours")

The below figure shows the distribution of the netflix_hours variable.

ggplot(data = habits, aes(x = netflix_hours)) + 
  geom_boxplot() + 
  
  labs(title = "netflix_hours")

The below figure shows the distribution of the attendance_percentage variable.

ggplot(data = habits, aes(x = attendance_percentage)) + 
  geom_boxplot() + 
  
  labs(title = "attendance_percentage")

The below figure shows the distribution of the sleep_hours variable.

ggplot(data = habits, aes(x = sleep_hours)) + 
  geom_boxplot() + 
  
  labs(title = "sleep_hours")

The below figure shows the distribution of the exercise_frequency variable.

ggplot(data = habits, aes(x = exercise_frequency)) + 
  geom_boxplot() + 
  
  labs(title = "exercise_frequency")

The below figure shows the distribution of the mental_health_rating variable.

ggplot(data = habits, aes(x = mental_health_rating)) + 
  geom_boxplot() + 
  
  labs(title = "mental_health_rating")

4 Imputation/Feature Engineering

The variables part_time_job, diet_quality, internet_quality, and extracurricular_participation are converted to categorical/factor.

#habits$gender <- as.factor(habits$gender)
habits$part_time_job <- as.factor(habits$part_time_job)
habits$diet_quality <- factor(habits$diet_quality,levels = c("Poor", "Fair", "Good"))
#habits$parental_education_level <- as.factor(habits$parental_education_level)
habits$internet_quality <- factor(habits$internet_quality, levels = c("Poor", "Average", "Good"))
habits$extracurricular_participation <- as.factor(habits$extracurricular_participation)

The variable gender contains the value ‘Other’. Since we only want to work with Male/female, we set ‘Other’ to missing and use MICE imputation.

habits$gender[habits$gender== "Other"] = NA
habits$gender <- as.factor(habits$gender)

init2 <- mice(habits, maxit = 0)
init2$method
                   student_id                           age 
                           ""                            "" 
                       gender           study_hours_per_day 
                     "logreg"                            "" 
           social_media_hours                 netflix_hours 
                           ""                            "" 
                part_time_job         attendance_percentage 
                           ""                            "" 
                  sleep_hours                  diet_quality 
                           ""                            "" 
           exercise_frequency      parental_education_level 
                           ""                            "" 
             internet_quality          mental_health_rating 
                           ""                            "" 
extracurricular_participation                    exam_score 
                           ""                            "" 
imp2 <- mice(habits, method = c("","", "logreg", "",  "", "", "", "", "", "", "", "", "", "", "", ""), 
                 maxit = 10,  
                 m = 5, 
                 seed=123,
                 print=F)     



complete_habits_data <- complete(imp2)

For parental_education_level, we convert both Bachelor and Master to College to reduce categories.

complete_habits_data$parental_education_level[complete_habits_data$parental_education_level == 'Bachelor'] <- 'College'
complete_habits_data$parental_education_level[complete_habits_data$parental_education_level == 'Master'] <- 'College'

complete_habits_data$parental_education_level <- factor(complete_habits_data$parental_education_level,levels = c("None", "High School", "College"))

str(complete_habits_data)
'data.frame':   1000 obs. of  16 variables:
 $ student_id                   : chr  "S1000" "S1001" "S1002" "S1003" ...
 $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
 $ gender                       : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 1 1 1 ...
 $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
 $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
 $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
 $ part_time_job                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
 $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
 $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
 $ diet_quality                 : Factor w/ 3 levels "Poor","Fair",..: 2 3 1 1 2 2 3 2 3 3 ...
 $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
 $ parental_education_level     : Factor w/ 3 levels "None","High School",..: 3 2 2 3 3 3 3 3 3 3 ...
 $ internet_quality             : Factor w/ 3 levels "Poor","Average",..: 2 2 1 3 3 2 1 2 3 3 ...
 $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
 $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
 $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...

5 CART - Regression

Decision trees recursively splits the data into subsets based on the value of a feature, with the goal of creating subsets that are as homogeneous as possible with respect to the target variable.

In regression trees, we can predict continuous outcomes by choosing splits to minimize the MSE of the target variable.

As shown below, we first build the initial regression tree to get a general sense of all the predictors and their splits.

# Set seed for reproducibility
set.seed(123)

# Split data into training (70%) and test (30%) sets
train.index3 <- sample(1:nrow(complete_habits_data), size = 0.8 * nrow(complete_habits_data))
train.data3 <- complete_habits_data[train.index3, ]
test.data3 <- complete_habits_data[-train.index3, ]

# 1. Tree Induction & 2. Splitting Criteria
# Build the initial regression tree using rpart
tree.model <- rpart(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data3,
                    method = "anova",     # For regression
                    control = rpart.control(
                      minsplit = 20,    # 3. Stopping rule: min observations to split
                      minbucket = 7,    # Min observations in terminal node
                      cp = seq(0, 0.05, 20), # Complexity parameter
                      maxdepth = 5      # Maximum tree depth
                    ))

# Visualize the unpruned tree
rpart.plot(tree.model, main = "Initial Regression Tree")

Next, we would want to prune the initial tree. Below shows a table of all the CP, nsplit, rel error, xerror, and xstd values.

#summary(tree.model)
pander(tree.model$cptable)
CP nsplit rel error xerror xstd
0.4477 0 1 1.003 0.04526
0.09511 1 0.5523 0.5551 0.02378
0.07915 2 0.4572 0.4788 0.02219
0.03897 3 0.378 0.4081 0.01805
0.02907 4 0.339 0.3724 0.01721
0.01974 5 0.31 0.3448 0.01638
0.01766 6 0.2902 0.3279 0.01603
0.01426 7 0.2726 0.3168 0.01575
0.009995 8 0.2583 0.3053 0.01527
0.009898 9 0.2483 0.3007 0.01514
0.008774 10 0.2384 0.3008 0.01528
0.006461 11 0.2297 0.2942 0.01484
0.005783 12 0.2232 0.2837 0.01412
0.005088 13 0.2174 0.2837 0.0143
0.004425 14 0.2123 0.28 0.01407
0.004335 15 0.2079 0.2805 0.01415
0.004165 16 0.2036 0.2808 0.01422
0.004046 17 0.1994 0.281 0.01422
0.003626 18 0.1953 0.2794 0.01406
0.003483 19 0.1917 0.2774 0.01395
0.002993 20 0.1882 0.2764 0.01388
0.002847 21 0.1852 0.2762 0.01382
0.002724 22 0.1824 0.2756 0.01371
0.002333 23 0.1797 0.2716 0.01328
0.0009291 24 0.1773 0.2705 0.01297
0.0006958 25 0.1764 0.2723 0.01294
6.369e-05 27 0.175 0.2724 0.01295
0 28 0.175 0.2723 0.01294

Below shows a plot of how cp changes as tree size increases. This can help with selecting cp values for pruning.

plotcp(tree.model)

For pruning, we select both best CP and min CP. Best CP is the largest CP when the xerror is within 1 standard error of the minimum. The min CP is simply the smallest CP. Best CP is preferred because it can give us the simplest tree as well as retaining accuracy.

cp.table <- tree.model$cptable

## Identify the minimum `xerror` and its `cp`.
min.xerror <- min(cp.table[, "xerror"])
min.cp.row <- which.min(cp.table[, "xerror"])
min.cp <- cp.table[min.cp.row, "CP"]

## Get the standard error (`xstd`) of the minimum `xerror`
xerror.std <- cp.table[min.cp.row, "xstd"]
threshold <- min.xerror + xerror.std  # Upper bound (1 SE rule)

## Find the simplest tree (`cp`) Where `xerror less than or equal to Threshold`.
best.cp.row <- which(cp.table[, "xerror"] <= threshold)[1]  # First row meeting criteria
best.cp <- cp.table[best.cp.row, "CP"]

## Two different trees: best CP vs minimum CP
pruned.tree.best.cp <- prune(tree.model, cp = best.cp)
pruned.tree.min.cp <- prune(tree.model, cp = min.cp)

# Visualize the pruned tree: best CP
rpart.plot(pruned.tree.best.cp, main = paste("Pruned Tree (Best CP): cp = ", round(best.cp,4)))

The above shows the tree with the best CP of 0.0044.

The below shows the tree with the min CP of 0.0009.

rpart.plot(pruned.tree.min.cp, main = paste("Pruned Tree (Minimum CP): cp = ", round(min.cp,4)))

Below we make predictions on test data using the best CP tree, min CP tree, OLS regression using features from the best CP tree (study_hours_per_day, mental_health_rating, netflix_hours, exercise_frequency, sleep_hours), and OLS regression using stepwise selection.

As shown in the output below, it seems like the OLS stepwise did the best with MSE of 29.19.

# 5. Prediction
# Make predictions on test data
pred.best.cp <- predict(pruned.tree.best.cp, newdata = test.data3)
pred.min.cp <- predict(pruned.tree.min.cp, newdata = test.data3)


# Evaluate model performance: best.cp
mse.tree.best.cp <- mean((test.data3$exam_score - pred.best.cp)^2)
rmse.tree.best.cp <- sqrt(mse.tree.best.cp)
r.squared.tree.best.cp <- cor(test.data3$exam_score, pred.best.cp)^2
# min.cp
mse.tree.min.cp <- mean((test.data3$exam_score - pred.min.cp)^2)
rmse.tree.min.cp <- sqrt(mse.tree.min.cp)
r.squared.tree.min.cp <- cor(test.data3$exam_score, pred.min.cp)^2

##
# fit ordinary least square regression 
LSE01 <- lm(exam_score ~ study_hours_per_day + mental_health_rating + netflix_hours + exercise_frequency + sleep_hours, data = train.data3)
pred.lse01 <-  predict(LSE01, newdata = test.data3)
mse.lse01 <- mean((test.data3$exam_score - pred.lse01)^2)
rmse.lse01 <- sqrt(mse.lse01)
r.squared.lse01 <- cor(test.data3$exam_score, pred.lse01)^2

##
## ordinary LSE regression model with step-wise variable selection
lse02.fit <- lm(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data3)
AIC.fit <- stepAIC(lse02.fit, direction="both", trace = FALSE)
pred.lse02 <- predict(AIC.fit, test.data3)
mse.lse02 <- mean((test.data3$exam_score - pred.lse02)^2)    # mean square error
rmse.lse02 <- sqrt(mse.lse02)                       # root mean square error
r.squared.lse02 <- (cor(test.data3$exam_score, pred.lse02))^2 # r-squared

###
Errors <- cbind(MSE = c(mse.tree.best.cp, mse.tree.min.cp, mse.lse01, mse.lse02),
                RMSE = c(rmse.tree.best.cp, rmse.tree.min.cp, rmse.lse01, rmse.lse02),
                r.squared = c(r.squared.tree.best.cp, r.squared.tree.min.cp, r.squared.lse01, r.squared.lse02))
rownames(Errors) = c("tree.best.cp", "tree.min.cp", "lse01", "lse02")
pander(Errors)
  MSE RMSE r.squared
tree.best.cp 82.23 9.068 0.7488
tree.min.cp 80.59 8.977 0.7521
lse01 45.05 6.712 0.8654
lse02 29.19 5.402 0.9143

Below we show what variables are most important in predicting exam score based on the best CP tree. Based on the below chart, we see that study hours, mental health, netflix hours, exercise frequency, and sleep hours are the most important variables in predicting exam score.

# Variable importance
importance <- pruned.tree.best.cp$variable.importance
barplot(sort(importance, decreasing = TRUE), 
        main = "Variable Importance: Best CP",
        las = 2)

Below we show what variables are most important in predicting exam score based on the min CP tree. Based on the below chart, we see that study hours, mental health, netflix hours, social media hours, and exercise freq are the most important variables in predicting exam score.

# Variable importance
importance2 <- pruned.tree.min.cp$variable.importance
barplot(sort(importance2, decreasing = TRUE), 
        main = "Variable Importance: Minimum CP",
        las = 2)

6 CART - Classification

In this section, we do classification trees. We can use them to predict categorical outcomes by using Gini impurity and entropy.

First, we need to create a binary response variable since our original response is continuous. Since exam scores over 60 is generally considered passing, we create ‘pass’ where 1 is pass and 0 is fail.

Then we create the initial tree, as shown below.

complete_habits_data <- transform(complete_habits_data, pass=ifelse(exam_score >= 60, 1, 0))

set.seed(123)
train.index4 <- createDataPartition(complete_habits_data$pass, p = 0.8, list = FALSE)
train.data4 <- complete_habits_data[train.index4, ]
test.data4 <- complete_habits_data[-train.index4, ]

train.data4$pass <- factor(train.data4$pass)
test.data4$pass <- factor(test.data4$pass)

# Build the initial classification tree
tree.model2 <- rpart(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data4,
                    method = "class",   # classification tree
                    parms = list(split = "gini",  # Using Gini index
                                 # FN cost = 1, FP cost = 0.5
                                 loss = matrix(c(0, 0.5, 1, 0), nrow = 2)  
                                 ),
                    control = rpart.control(minsplit = 15,  # Min 15 obs to split
                                           minbucket = 5,   # Min 7 obs in leaf
                                           # Complexity parameter
                                           cp = 0.001, # complex parameter
                                           maxdepth = 5))   # Max tree depth


rpart.plot(tree.model2, 
           extra = 104, # check the help document for more information
           # color palette is a sequential color scheme that blends green (G) to blue (Bu)
           box.palette = "GnBu",  
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE)

Below shows the cp table to help with the pruning process.

pander(tree.model2$cptable)
CP nsplit rel error xerror xstd
0.5982 0 1 0.5 0.02835
0.04353 1 0.4018 0.6049 0.04515
0.01786 3 0.3147 0.3103 0.03132
0.01562 7 0.2433 0.3839 0.03579
0.01004 8 0.2277 0.3683 0.03592
0.002976 10 0.2076 0.3728 0.03628
0.002232 13 0.1987 0.3772 0.03608
0.001 15 0.1942 0.3884 0.03655

Below shows the cp plot.

plotcp(tree.model2)

Below, we get the best CP and the min CP values.

cp.table2 <- tree.model2$cptable
#min.cp2 <- tree.model2$cptable[which.min(tree.model2$cptable[,"xerror"]),"CP"]
## Identify the minimum `xerror` and its `cp`.
min.xerror2 <- min(cp.table2[, "xerror"])
min.cp.row2 <- which.min(cp.table2[, "xerror"])
min.cp2 <- cp.table2[min.cp.row2, "CP"]

## Get the standard error (`xstd`) of the minimum `xerror`
xerror.std2 <- cp.table2[min.cp.row2, "xstd"]
threshold2 <- min.xerror2 + xerror.std2  # Upper bound (1 SE rule)

## Find the simplest tree (`cp`) Where `xerror less than or equal to Threshold`.
best.cp.row2 <- which(cp.table2[, "xerror"] <= threshold2)[1]  # First row meeting criteria
best.cp2 <- cp.table2[best.cp.row2, "CP"]

## Two different trees: best CP vs minimum CP
pruned.tree.1SE <- prune(tree.model2, cp = best.cp2,main = paste("Pruned Tree (Best CP): cp = ", round(best.cp2,4)))
pruned.tree.min.cp2 <- prune(tree.model2, cp = min.cp2,main = paste("Pruned Tree (Minimum CP): cp = ", round(min.cp2,4)))

Below shows the pruned tree for the best CP.

rpart.plot(pruned.tree.1SE, 
           extra = 104, # check the help document for more information
           # color palette is a sequential color scheme that blends green (G) to blue (Bu)
           box.palette = "GnBu",  
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE)

Below shows the pruned tree for the min CP. The best and min CP values are the same (0.018).

# Visualize the pruned tree
rpart.plot(pruned.tree.min.cp2, 
           extra = 104, # check the help document for more information
           # color palette is a sequential color scheme that blends green (G) to blue (Bu)
           box.palette = "GnBu",  
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE)

Next we make predictions on the test set and generate the ROC curves. The below plot shows that standard logistic regression did better than both the best and min CP trees.

# Make predictions on the test set
pred.label.1SE <- predict(pruned.tree.1SE, test.data4, type = "class") # default cutoff 0.5
pred.prob.1SE <- predict(pruned.tree.1SE, test.data4, type = "prob")[,2]
##
pred.label.min <- predict(pruned.tree.min.cp2, test.data4, type = "class") # default cutoff 0.5
pred.prob.min <- predict(pruned.tree.min.cp2, test.data4, type = "prob")[,2]

# Confusion matrix
#conf.matrix <- confusionMatrix(pred.label, test.data$diabetes, positive = "pos")
#print(conf.matrix)

########################
###  logistic regression
logit.fit <- glm(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data4, family = binomial)
AIC.logit <- step(logit.fit, direction = "both", trace = 0)
pred.logit <- predict(AIC.logit, test.data4, type = "response")

# ROC curve and AUC
roc.tree.1SE <- roc(test.data4$pass, pred.prob.1SE)
roc.tree.min <- roc(test.data4$pass, pred.prob.min)
roc.logit <- roc(test.data4$pass, pred.logit)

##
### Sen-Spe
tree.1SE.sen <- roc.tree.1SE$sensitivities
tree.1SE.spe <- roc.tree.1SE$specificities
#
tree.min.sen <- roc.tree.min$sensitivities
tree.min.spe <- roc.tree.min$specificities
#
logit.sen <- roc.logit$sensitivities
logit.spe <- roc.logit$specificities
## AUC
auc.tree.1SE <- roc.tree.1SE$auc
auc.tree.min <- roc.tree.min$auc
auc.logit <- roc.logit$auc
###
plot(1-logit.spe, logit.sen,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC: CART and Logistic Regressopm")
lines(1-tree.1SE.spe, tree.1SE.sen, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-tree.min.spe, tree.min.sen,      
      col = "orange",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
legend("bottomright", c("Logistic", "Tree 1SE", "Tree Min"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange"),
       bty="n",cex = 0.8)
## annotation - AUC
text(0.8, 0.46, paste("Logistic AUC: ", round(auc.logit,4)), cex = 0.8)
text(0.8, 0.4, paste("Tree 1SE AUC: ", round(auc.tree.1SE,4)), cex = 0.8)
text(0.8, 0.34, paste("Tree Min AUC: ", round(auc.tree.min,4)), cex = 0.8)

Below shows that the optimal cut-off probability is 0.5 This value represents the value that minimizes the total cost of misclassification of our response variable (pass).

# preditive probabilities of tree.min model
#train.data4$pass <- as.factor(train.data4$pass)

pred.prob.min2 <- predict(pruned.tree.min.cp2, train.data4, type = "prob")[,2]
##
cost <- NULL
cutoff <-seq(0,1, length = 10)
##
for (i in 1:10){
  pred.label <- ifelse(pred.prob.min2 > cutoff[i], "1", "0")
  FN <- sum(pred.label == "0" & train.data4$pass == "1")
  FP <- sum(pred.label == "1" & train.data4$pass == "0")
  cost[i] = 5*FP + 20*FN
}
## identify optimal cut-off
min.ID <- which(cost == min(cost))   # could have multiple minimum
optim.prob <- mean(cutoff[min.ID])   # take the average of the cut-offs
##
plot(cutoff, cost, type = "b", col = "navy",
     main = "Cutoff vs Misclassification Cost")
##
text(0.2, 3500, paste("Optimal cutoff:", round(optim.prob,4)), cex = 0.8)

7 BAGGING Regression

CART produces a single tree. Bagging uses multiple trees. Bagging generates multiple bootstrap samples from the training data, trains individual regression trees on each subset, and aggregates their predictions. The goal is to produce a more robust model.

Below shows the nbagg, cp, maxdepth, oob.error table. These are the optimal hyperparameters used to train the final model.

# Split the data
set.seed(123)
train.index5 <- createDataPartition(complete_habits_data$exam_score, p = 0.8, list = FALSE)
train.data5 <- complete_habits_data[train.index5, ]
test.data5 <- complete_habits_data[-train.index5, ]

# Set up train control for cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE
)

# Define parameter combinations to test
nbagg.values <- c(10, 25, 50)     # number bagged trees
cp.values <- c(0.01, 0.05, 0.1)   # candidate cp values
maxdepth.values <- c(5, 10, 20)   # maximum depth of the candidate tree

# Create an empty data frame to store results
results <- data.frame()
######## Model tuning
# Manual tuning loop
for (nbagg in nbagg.values) {
  for (cp in cp.values) {
    for (maxdepth in maxdepth.values) {
      set.seed(123)
      model <- bagging(
        exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
        data = train.data5,
        nbagg = nbagg,
        coob = TRUE,
        trControl = ctrl,
        control = rpart.control(cp = cp, 
                                maxdepth = maxdepth)
       )
      # Get OOB error from each iteration
      oob.error <- model$err
      # Store results
      results <- rbind(results, 
                       data.frame( nbagg = nbagg,
                                   cp = cp,
                                   maxdepth = maxdepth,
                                   oob.error = oob.error))
    }
  }
}
# Find the best combination that yields the minimum out-of-bag's error
best.params <- results[which.min(results$oob.error), ]
pander(best.params)
  nbagg cp maxdepth oob.error
10 25 0.01 5 8.252

Using the optimal hyperparameters from above, we train the final model and get a RMSE of 8.748 and Rsquared of 0.7364 as shown below.

# Train the final model with the best parameters
final.model <- bagging(
  exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = train.data5,
  nbagg = best.params$nbagg,
  coob = TRUE,
  trControl = ctrl,
  control = rpart.control(cp = best.params$cp, 
                          maxdepth = best.params$maxdepth),
  
  importance = TRUE
)

# Evaluate on test set
predictions <- predict(final.model, newdata = test.data5)
## Using the caret function to calculate errors across re-samples
baggedError <- postResample(pred = predictions, obs = test.data5$exam_score)
pander(baggedError)
RMSE Rsquared MAE
8.748 0.7364 7.088

Based on the bagging model, the variable importance chart shown below shows that study hours, mental health, netflix hours, social media hours, and sleep hours are the most important variables in predicting exam score.

var.imp <- varImp(final.model)
# Extract variable importance (requires a custom function)
get.bagging.importance <- function(model) {
  # Get all the trees from the bagging model
  trees <- model$mtrees
  
  # Initialize importance vector
  imp <- numeric(length(trees[[1]]$btree$variable.importance))
  names(imp) <- names(trees[[1]]$btree$variable.importance)
  
  # Sum importance across all trees
  for(tree in trees) {
    imp[names(tree$btree$variable.importance)] <- 
      imp[names(tree$btree$variable.importance)] + 
      tree$btree$variable.importance
  }
  
  # Average importance
  imp <- imp/length(trees)
  return(imp)
}
# Get importance
importance.scores <- get.bagging.importance(final.model)

# Sort and plot
importance.scores <- sort(importance.scores, decreasing = TRUE)
barplot(importance.scores, horiz = TRUE, las = 1,
        main = "Variable Importance - Bagging (ipred)",
        xlab = "Importance Score")

Below we generate the MAE/RMSE for the bagged model, pruned tree using min CP, and standard regression. Based on the values shown below, the bagged model did not do as well as the standard regression.

## ordinary LSE regression model with step-wise variable selection
lse.fit <- lm(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data5)
AIC.fit2 <- stepAIC(lse.fit, direction="both", trace = FALSE)
##
pred.lse <- predict(AIC.fit2, test.data5)
mae.lse <- mean(abs(test.data5$exam_score - pred.lse)) # mean absolutesquare error
mse.lse <- mean((test.data5$exam_score - pred.lse)^2)  # mean square error
rmse.lse <- sqrt(mse.lse)                       # root mean square error
r.squared.lse <- (cor(test.data5$exam_score, pred.lse))^2 # r-squared
##
# Base regression tree
tree.model3 <- rpart(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data5,
                    method = "anova",     # For regression
                    control = rpart.control(
                      minsplit = 20,    # 3. Stopping rule: min observations to split
                      minbucket = 7,    # Min observations in terminal node
                      cp = seq(0, 0.05, 20), # Complexity parameter
                      maxdepth = 5      # Maximum tree depth
                    ))
# cp table
cp.table3 <- tree.model3$cptable
##
## Identify the minimum `xerror` and its `cp`.
min.xerror3 <- min(cp.table3[, "xerror"])
min.cp.row3 <- which.min(cp.table3[, "xerror"])
min.cp3 <- cp.table3[min.cp.row3, "CP"]
##
pruned.tree.min.cp3 <- prune(tree.model3, cp = min.cp3)
pred.min.cp3 <- predict(pruned.tree.min.cp3, newdata = test.data5)
##
# min.cp
mae.tree.min.cp3 <- mean(abs(test.data5$exam_score - pred.min.cp3))
mse.tree.min.cp3 <- mean((test.data5$exam_score - pred.min.cp3)^2)
rmse.tree.min.cp3 <- sqrt(mse.tree.min.cp3)
r.squared.tree.min.cp3 <- cor(test.data5$exam_score, pred.min.cp3)^2
##
###
Errors2 <- cbind(MAE = c(baggedError[1], mae.tree.min.cp3, mae.lse),
          RMSE = c(baggedError[3], rmse.tree.min.cp3, rmse.lse),
          r.squared = c(baggedError[2], r.squared.tree.min.cp3, r.squared.lse))
rownames(Errors2) = c("bagged Tree", "tree.min.cp", "lse")
pander(Errors2)
  MAE RMSE r.squared
bagged Tree 8.748 7.088 0.7364
tree.min.cp 7.052 9.024 0.7205
lse 4.533 5.708 0.8875

8 BAGGING Classification

In this section, we do the bagging classification model.

Below shows the optimal hyperparameters for nbagg, minsplit, maxdepth, and cp. We use these to train the final model.

set.seed(123)

# Split data into training and testing sets
trainIndex6 <- createDataPartition(complete_habits_data$pass, p = 0.8, list = FALSE)
trainData6 <- complete_habits_data[trainIndex6, ]
testData6 <- complete_habits_data[-trainIndex6, ]

# Create a grid of hyperparameter combinations
hyper.grid <- expand.grid(
  nbagg = c(25, 50, 100),
  minsplit = c(5, 10, 20),
  maxdepth = c(5, 10, 20),
  cp = c(0.01, 0.001)
)
# Initialize a results data frame
results2 <- data.frame() # store values of tuned hyperparameters
best.accuracy <- 0      # store accuracy
best.params2 <- list()   # store best values of hyperparameter

testData6$pass <- as.factor(testData6$pass)
trainData6$pass <- as.factor(trainData6$pass)

# Loop through each hyperparameter combination
for(i in 1:nrow(hyper.grid)) {
  # Get current hyperparameters
  params <- hyper.grid[i, ]
  
  # Set rpart control parameters
  rpart.control <- rpart.control(
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp
  )
  
  # Train bagging model
  bag.model <- bagging(
    pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
    data = trainData6,
    nbagg = params$nbagg,
    coob = TRUE,
    control = rpart.control
  )
  
  # Make predictions: default cut-off 0.5
  preds <- predict(bag.model, newdata = testData6)
  
  # Calculate accuracy
  cm <- confusionMatrix(preds, testData6$pass)
  accuracy <- cm$overall["Accuracy"]
  
  # Store results
  results2 <- rbind(results2, data.frame(
    nbagg = params$nbagg,
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp,
    Accuracy = accuracy
  ))
  
  # Update best parameters if current model is better
  if(accuracy > best.accuracy) {
    best.accuracy <- accuracy
    best.params <- params
  }
  
  # Print progress
  #cat("Completed", i, "of", nrow(hyper.grid), "combinations\n")
}
pander(best.params)
  nbagg minsplit maxdepth cp
53 50 20 20 0.001

Below we train the final model using the optimal parameters and generate the confusion matrix.

# Set rpart control with best parameters
best.control <- rpart.control(
  minsplit = best.params$minsplit,
  maxdepth = best.params$maxdepth,
  cp = best.params$cp
)

# Train final model
final.bag.model <- bagging(
  pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = trainData6,
  nbagg = best.params$nbagg,
  coob = TRUE,
  control = best.control
)

# Evaluate on test set
final.preds <- predict(final.bag.model, newdata = testData6)
final.cm <- confusionMatrix(final.preds, testData6$pass)
final.cm$table
          Reference
Prediction   0   1
         0  42   5
         1  14 139

Below we generate the ROC curves for standard logistic, bagged model, and the pruned tree with min CP. Based on the below plot, we see that the standard logistic performs better than both the bagged model and the pruned tree with min CP.

########################
###  logistic regression
logit.fit2 <- glm(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = trainData6, family = binomial)
AIC.logit2 <- step(logit.fit2, direction = "both", trace = 0)
pred.logit2 <- predict(AIC.logit2, testData6, type = "response")
##
# Build the initial classification tree
tree.model4 <- rpart(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = trainData6,
                    method = "class",   # classification tree
                    parms = list(split = "gini",  # Using Gini index
                                 # FN cost = 1, FP cost = 0.5
                                 loss = matrix(c(0, 0.5, 1, 0), nrow = 2)  
                                 ),
                    control = rpart.control(minsplit = 15,  # Min 15 obs to split
                                           minbucket = 5,   # Min 7 obs in leaf
                                           # Complexity parameter
                                           cp = 0.001, # complex parameter
                                           maxdepth = 5))   # Max tree depth
###
min.cp4 <- tree.model4$cptable[which.min(tree.model4$cptable[,"xerror"]),"CP"]

# Prediction with three candidate models
pruned.tree.min <- prune(tree.model4, cp = min.cp)
pred.prob.min2 <- predict(pruned.tree.min, testData6, type = "prob")[,2]
pred.prob.bagg <- predict(final.bag.model, newdata = testData6, type = "prob")[,2]
##
# ROC object
roc.tree.min2 <- roc(testData6$pass, pred.prob.min2)
roc.logit2 <- roc(testData6$pass, pred.logit2)
roc.bagg <- roc(testData6$pass, pred.prob.bagg)
##
##
### Sen-Spe
bagg.sen <- roc.bagg$sensitivities
bagg.spe <- roc.bagg$specificities
#
tree.min.sen2 <- roc.tree.min2$sensitivities
tree.min.spe2 <- roc.tree.min2$specificities
#
logit.sen2 <- roc.logit2$sensitivities
logit.spe2 <- roc.logit2$specificities
## AUC
auc.bagg <- roc.bagg$auc
auc.tree.min2 <- roc.tree.min2$auc
auc.logit2 <- roc.logit2$auc
###
plot(1-logit.spe2, logit.sen2,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC: Classification Models")
lines(1-bagg.spe, bagg.sen, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-tree.min.spe2, tree.min.sen2,      
      col = "orange",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
legend("bottomright", c("Logistic", "bagg", "Tree Min"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange"),
       bty="n",cex = 0.8)
## annotation - AUC
text(0.8, 0.46, paste("Logistic AUC: ", round(auc.logit2,4)), cex = 0.8)
text(0.8, 0.4, paste("Bagg AUC: ", round(auc.bagg,4)), cex = 0.8)
text(0.8, 0.34, paste("Tree Min AUC: ", round(auc.tree.min2,4)), cex = 0.8)

9 Random Forest Regression

In this section, we do random forest. Random forest is similar to bagging, except that at each split, only a random subset of features is used whereas all features are considered in bagging.

Below we split the data into training/testing and generate the optimal hyperparameters shown in the table (ntree, mtry, nodesize, maxnodes, and best.rmse).

set.seed(123)  # For reproducibility
train.index7 <- sample(1:nrow(complete_habits_data), 0.8 * nrow(complete_habits_data))
train.data7 <- complete_habits_data[train.index7, ]
test.data7<- complete_habits_data[-train.index7, ]
# Create a grid (data frame) of hyperparameter combinations to test
hyper.grid2 <- expand.grid(
  ntree = c(100, 300, 500),    
  mtry = c(3, 5, 7), # Dependent on the total number features available in the data
  nodesize = c(1, 3, 5), 
  maxnodes = c(5, 10, 20, NULL)
)
##
# Initialize results storage
results3 <- data.frame()  # combination of hyperparameters and corresponding RMSE
best.rmse <- Inf         # place-holder of RMSE with initial value inf
best.params3 <- list()    # update the hyperparameter list according to the best rmse

# Set up k-fold cross-validation
k <- 5                   # 5-fold cross-validation
n0 <- dim(train.data7)[1] # size of the training data 
fold.size <- floor(n0/k) # fold size. Caution: floor() should be used. 
                         # round( ,0) should be used. why?

# Loop through each hyperparameter combination
for(i in 1:nrow(hyper.grid2)) {           
  current.params <- hyper.grid2[i, ]   # vector of hyperparameters for cross validation
  cv.errors <- numeric(k)             # store RMSE from cross-validation
  
  # Perform k-fold cross-validation
  for(j in 1:k) {
    # Split into training and validation folds
    valid.indices <- (1 + fold.size*(j-1)):(fold.size*j)  # CV observation ID vector
    cv.train <- train.data7[-valid.indices, ]   # training data in cross-validation
    cv.valid <- train.data7[valid.indices, ]    # testing data in cross-validation
    
    # Train model with current parameters
    rf.model <- randomForest(
      exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
      data = cv.train,
      ntree = current.params$ntree,
      mtry = current.params$mtry,
      nodesize = current.params$nodesize,
      maxnodes = current.params$maxnodes,
      importance = TRUE
    )
    
    # Make predictions on validation set
    preds2 <- predict(rf.model, newdata = cv.valid)    
    
    # Calculate RMSE
    cv.errors[j] <- sqrt(mean((preds2 - cv.valid$exam_score)^2))
  }
  
  # Average RMSE across folds
  avg.rmse <- mean(cv.errors)    
  
  # Store results: the data frame defined to store combinations of hyperparameters
  # and the resulting mean RMSEs from cross-validation
  results3 <- rbind(results3, data.frame(
    ntree = current.params$ntree,
    mtry = current.params$mtry,
    nodesize = current.params$nodesize,
    maxnodes = ifelse(is.null(current.params$maxnodes), "NULL", current.params$maxnodes),
    rmse = avg.rmse
  ))
  
  # Update best parameters if current model is better
  if(avg.rmse < best.rmse) {
    best.rmse <- avg.rmse
    best.params <- current.params
  }
  # Print progress: It is always a good idea to print something out in loops
  # cat(paste0("Completed ", i, "/", nrow(hyper.grid), " - RMSE: ", round(avg.rmse, 4), "\n"))
}  
pander(data.frame(cbind(current.params,best.rmse)))    # resulting tuned hyperparameters
  ntree mtry nodesize maxnodes best.rmse
81 500 7 5 20 7.655

Below we train the final model using the optimal hyperparameters generated in the above section and make predictions on the test set. The scatterplot below shows the actual vs predicted Values of the exam scores of the test set.

# Train final model with best parameters on full training set
final.rf <- randomForest(
  exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = train.data7,
  ntree = best.params$ntree,
  mtry = best.params$mtry,
  nodesize = best.params$nodesize,
  maxnodes = best.params$maxnodes,
  importance = TRUE
)

# View model summary
# print(final.rf)
# Make predictions on test set
test.preds <- predict(final.rf, newdata = test.data7)

# Calculate test RMAE
test.mae <- mean(abs(test.preds - test.data7$exam_score))
# Calculate test RMSE
test.rmse <- sqrt(mean((test.preds - test.data7$exam_score)^2))
# Calculate R-squared
test.r2 <- 1 - sum((test.data7$exam_score - test.preds)^2) / sum((test.data7$exam_score - mean(test.data7$exam_score))^2)
# cat("Test R-squared:", test.r2, "\n")

### Performance vector
RF.performance = c(test.mae, test.rmse, test.r2)

# Plot actual vs predicted values
plot.data <- data.frame(Actual = test.data7$exam_score, Predicted = test.preds)
ggplot(plot.data, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Actual vs Predicted Values",
       x = "Actual Median Value ($1000s)",
       y = "Predicted Median Value ($1000s)") +
  theme_minimal()

Below we compare the performances of standard OLS regression, base regression tree, bagged, and the random forest models. Based on the RMSE values below, we see that base tree, bagged, and the random forest perform similarly.

## ordinary LSE regression model with step-wise variable selection
lse.fit2 <- lm(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data7)
AIC.fit3 <- stepAIC(lse.fit2, direction="both", trace = FALSE)
pred.lse3 <- predict(AIC.fit3, test.data7)
mae.lse3 <- mean(abs(test.data7$exam_score - pred.lse3))      # mean absolute error
mse.lse3 <- mean((test.data7$exam_score - pred.lse3)^2)       # mean square error
rmse.lse3 <- sqrt(mse.lse3)                            # root mean square error
r.squared.lse3 <- (cor(test.data7$exam_score, pred.lse3))^2 # r-squared

### base regression tree
tree.model5 <- rpart(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data7,
                    method = "anova",     # For regression
                    control = rpart.control(
                      minsplit = 20,    # 3. Stopping rule: min observations to split
                      minbucket = 7,    # Min observations in terminal node
                      cp = seq(0, 0.05, 20), # Complexity parameter
                      maxdepth = 5      # Maximum tree depth
                    ))
cp.table4 <- tree.model5$cptable
min.xerror4 <- min(cp.table4[, "xerror"])
min.cp.row4 <- which.min(cp.table4[, "xerror"])
min.cp5 <- cp.table[min.cp.row4, "CP"]
## 
pruned.tree.min.cp4 <- prune(tree.model5, cp = min.cp5)
pred.min.cp4 <- predict(pruned.tree.min.cp4, newdata = test.data7)
# performance measures
mae.tree.min.cp4 <- mean(abs(test.data7$exam_score - pred.min.cp4))   # MAD
mse.tree.min.cp4 <- mean((test.data7$exam_score - pred.min.cp4)^2)
rmse.tree.min.cp4 <- sqrt(mse.tree.min.cp4)                    # MSE
r.squared.tree.min.cp4 <- cor(test.data7$exam_score, pred.min.cp4)^2  # R.sq

### bagging regression: from the previous section
BaggPerf <- baggedError 

### Performance Comparison Table
Errors4 <- cbind(MAE = c(mae.lse3, mae.tree.min.cp4, BaggPerf[3], RF.performance[1]),
                RMSE = c(rmse.lse3, rmse.tree.min.cp4, BaggPerf[1], RF.performance[2]),
                r.squared = c(r.squared.lse3, r.squared.tree.min.cp4, BaggPerf[2],RF.performance[3]))
rownames(Errors4) = c("LS Regression", "Regression Tree", "BAGGING", "Random Forest")
pander(Errors4)
  MAE RMSE r.squared
LS Regression 4.325 5.402 0.9143
Regression Tree 7.522 9.202 0.7411
BAGGING 7.088 8.748 0.7364
Random Forest 6.952 8.374 0.7838

THe variable importance plots shown below are based on the MSE and node impurity. According to the %IncMSE plot, study hours, mental health, exercise, social media, and netflix are the most important features. The IncNodePurity plot shows the same variables.

varImpPlot(final.rf, pch = 19, main = "Variable Importance")

10 Random Forest Classification

In this section, we do random forest for our binary response variable ‘pass’.

The table below shows the optimal hyperparameters.

set.seed(123)
train.idx <- createDataPartition(complete_habits_data$pass, p = 0.8, list = FALSE)
train.data8 <- complete_habits_data[train.idx, ]
test.data8 <- complete_habits_data[-train.idx, ]

train.data8$pass <- factor(train.data8$pass)
test.data8$pass <- factor(test.data8$pass)

# cross-validation setting
k = 5
train.size <- dim(train.data8)[1]  # training data size
fold.size2 <- floor(train.size/k)  # fold size

##
tune.grid <- expand.grid(
  mtry = c(2, 3, 4, 5),
  ntree = c(100, 300, 500),
  nodesize = c(1, 3, 5, 10),
  maxnodes = c(5, 10, 20, NULL)
)
### store hyperparameters and avg of cv AUC
results4 <- data.frame()
best.auc <- 0.5         # place-holder of AUC, 0.5 = random guess
best.hyp.params <- list()   # update the hyperparameter list according to the best auc

##
for (i in 1:nrow(tune.grid)){
  current.tune.params <- tune.grid[i, ]  # subset of DATA FRAME!! 
  cv.auc <- rep(0,k)
  ##
  for (j in 1:k){
      cv.id <- (1 + (j-1)*fold.size2):(j*fold.size2)
      cv.train2 <- train.data8[-cv.id, ]
      cv.valid2 <- train.data8[cv.id, ]
      ##
       rf.cv <- randomForest(
                   pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
                   data = cv.train2,
                   mtry = current.tune.params$mtry,
                   ntree = current.tune.params$ntree,
                   nodesize = current.tune.params$nodesize,
                   maxnodes = current.tune.params$maxnodes)
       ##
       prob.cv <- predict(rf.cv, cv.valid2, type = "prob")[, "1"]
       cv.auc[j] <- auc(roc(cv.valid2$pass, prob.cv))
      }
      ##
      # Average RMSE across folds
      avg.auc <- mean(cv.auc)  
      ##
      # Store results: the data frame defined to store combinations of hyperparameters
      # and the resulting mean RMSEs from cross-validation
      results4 <- rbind(results4, data.frame(
                   mtry = current.tune.params$mtry,
                   ntree = current.tune.params$ntree,
                   nodesize = current.tune.params$nodesize,
                   maxnodes = current.tune.params$maxnodes,
                   auc = avg.auc))
  
     # Update best parameters if current model is better
     if(avg.auc > best.auc) {
           best.auc <- avg.auc
           best.hyp.params <- current.tune.params }
    
}
pander(data.frame(cbind(best.hyp.params,best.auc)))
  mtry ntree nodesize maxnodes best.auc
108 5 500 1 20 0.9539

Below we train the final random forest model using the optimal hyperparameters from above and generate the Area under the curve below (0.9707).

final.rf.cls <- randomForest(
      pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
      data = train.data8,
      ntree = best.hyp.params$ntree,
      mtry = best.hyp.params$mtry,
      nodesize = best.hyp.params$nodesize,
      maxnodes = current.tune.params$maxnodes,
      importance = TRUE
     )

test.pred <- predict(final.rf.cls, test.data8)
test.prob <- predict(final.rf.cls, test.data8, type = "prob")
rf.roc <- roc(test.data8$pass, test.prob[, "1"])
test.auc <- auc(rf.roc)
#confusionMatrix(test.pred, test.data$diabetes)
test.auc
Area under the curve: 0.9707

The variable importance plots below show that study hours, mental health, social media, sleep, and Netflix are the most important features in predicting ‘pass’.

varImpPlot(final.rf.cls, pch = 19, main = "Variable Importance of RF Classification" )

Below we compare the performances of standard logistic, base tree, bagged, and random forest models. Based on the areas below, bagged and random forest does better than the base decision tree.

########################
###  logistic regression
logit.fit3 <- glm(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data8, family = binomial)
AIC.logit3 <- step(logit.fit3, direction = "both", trace = 0)
pred.logit3 <- predict(AIC.logit3, test.data8, type = "response")

# Build the initial classification tree
tree.model6 <- rpart(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data8,
                    method = "class",   # classification tree
                    parms = list(split = "gini",  # Using Gini index
                                 # FN cost = 1, FP cost = 0.5
                                 loss = matrix(c(0, 0.5, 1, 0), nrow = 2)  
                                 ),
                    control = rpart.control(minsplit = 15,  # Min 15 obs to split
                                           minbucket = 5,   # Min 7 obs in leaf
                                           # Complexity parameter
                                           cp = 0.001, # complex parameter
                                           maxdepth = 5))   # Max tree depth
# Find the optimal cp value that minimizes cross-validated error
min.cp6 <- tree.model6$cptable[which.min(tree.model6$cptable[,"xerror"]),"CP"]
pruned.tree.min5 <- prune(tree.model6, cp = min.cp6)


# BAGGING
# Create a grid of hyperparameter combinations
hyper.grid3 <- expand.grid(
  nbagg = c(25, 50, 100),
  minsplit = c(5, 10, 20),
  maxdepth = c(5, 10, 20),
  cp = c(0.01, 0.001)
)
# Initialize a results dataframe
results5 <- data.frame() # store values of tuned hyperparameters
best.accuracy2 <- 0      # store accuracy
best.params4 <- list()   # store best values of hyperparameter

# Loop through each hyperparameter combination
for(i in 1:nrow(hyper.grid3)) {
  # Get current hyperparameters
  params2 <- hyper.grid3[i, ]
  
  # Set rpart control parameters
  rpart.control2 <- rpart.control(
    minsplit = params2$minsplit,
    maxdepth = params2$maxdepth,
    cp = params2$cp
  )
  
  # Train bagging model
  bag.model2 <- bagging(
    pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
    data = train.data8,
    nbagg = params2$nbagg,
    coob = TRUE,
    control = rpart.control2
  )
  
  # Make predictions: default cut-off 0.5
  preds3 <- predict(bag.model2, newdata = test.data8)
  
  # Calculate accuracy
  cm2 <- confusionMatrix(preds3, test.data8$pass)
  accuracy2 <- cm2$overall["Accuracy"]
  
  # Store results
  results5 <- rbind(results5, data.frame(
    nbagg = params2$nbagg,
    minsplit = params2$minsplit,
    maxdepth = params2$maxdepth,
    cp = params2$cp,
    Accuracy = accuracy2
  ))
  
  # Update best parameters if current model is better
  if(accuracy2 > best.accuracy2) {
    best.accuracy2 <- accuracy2
    best.params4 <- params2
  }
}
# Set rpart control with best parameters
best.control2 <- rpart.control(
  minsplit = best.params4$minsplit,
  maxdepth = best.params4$maxdepth,
  cp = best.params4$cp
)

# Train final model
final.bag.model2 <- bagging(
  pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = train.data8,
  nbagg = best.params4$nbagg,
  coob = TRUE,
  control = best.control2
)
#pred.prob.bagg <- predict(final.bag.model, newdata = test.data, type = "prob")[,2]
###
# Prediction with three candidate models
pruned.tree.min6 <- prune(tree.model6, cp = min.cp6)
pred.prob.min3 <- predict(pruned.tree.min6, test.data8, type = "prob")[,2]
pred.prob.bagg2 <- predict(final.bag.model2, newdata = test.data8, type = "prob")[,2]
##
# ROC object
roc.tree.min3 <- roc(test.data8$pass, pred.prob.min3)
roc.logit3 <- roc(test.data8$pass, pred.logit3)
roc.bagg2 <- roc(test.data8$pass, pred.prob.bagg2)
roc.rf <- roc(test.data8$pass, test.prob[, "1"])

##
##
### Sen-Spe
bagg.sen2 <- roc.bagg2$sensitivities
bagg.spe2 <- roc.bagg2$specificities
#
tree.min.sen3 <- roc.tree.min3$sensitivities
tree.min.spe3 <- roc.tree.min3$specificities
#
logit.sen3 <- roc.logit3$sensitivities
logit.spe3 <- roc.logit3$specificities
#
rf.sen <- roc.rf$sensitivities
rf.spe <- roc.rf$specificities

## AUC
auc.bagg2 <- roc.bagg2$auc
auc.tree.min3 <- roc.tree.min3$auc
auc.logit3 <- roc.logit3$auc
auc.rf <- roc.rf$auc
###
###
plot(1-logit.spe3, logit.sen3,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC: Classification Models")
lines(1-bagg.spe2, bagg.sen2, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-tree.min.spe3, tree.min.sen3,      
      col = "orange",
      lty = 1,
      lwd = 1)
lines(1-rf.spe, rf.sen,      
      col = "purple4",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
legend("bottomright", c("Logistic", "bagg", "Tree Min", "Random Forest"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange", "purple4"),
       bty="n",cex = 0.8)
## annotation - AUC
text(0.8, 0.46, paste("Logistic AUC: ", round(auc.logit3,4)), cex = 0.8)
text(0.8, 0.4, paste("Bagg AUC: ", round(auc.bagg2,4)), cex = 0.8)
text(0.8, 0.34, paste("Tree Min AUC: ", round(auc.tree.min3,4)), cex = 0.8)
text(0.8, 0.28, paste("Forest AUC: ", round(auc.rf,4)), cex = 0.8)

11 Results/Conclusion

Study hours, mental health, social media, sleep, Netflix, and exercise were consistently shown to be the most important features in predicting exam score/pass, out of all the models performed.

Performing CART alone may not be enough for prediction since they only generate a single decision tree. Bagging can produce a more robust model by generating/aggregating multiple trees. Random Forest may be even better than bagging because they use a random subset of features instead of all features.

---
title: "Student Habits vs Academic Performance"
author: "Eric Zhu"
date: "2025-04-16"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: navy;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-family: system-ui;
  font-weight: bold;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 14px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>

```


```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("readxl")) { # SVM methodology
   install.packages("readxl")
library(readxl)
}
if (!require("ggplot2")) { # SVM methodology
   install.packages("ggplot2")
library(ggplot2)
}
if (!require("ISLR")) { # contains example data set "Khan"
   install.packages("ISLR")
library(ISLR)
}
if (!require("MASSExtra")) { # contains example data set "Khan"
   install.packages("MASSExtra")
library(MASSExtra)
}
if (!require("missMethods")) { # customized coloring of plots
   install.packages("missMethods")
library(missMethods)
}
if (!require("Hmisc")) { # customized coloring of plots
   install.packages("Hmisc")
library(Hmisc)
}
if (!require("glmnet")) { # customized coloring of plots
   install.packages("glmnet")
library(glmnet)
}
if (!require("GGally")) { # customized coloring of plots
   install.packages("GGally")
library(GGally)
}
if (!require("mice")) { # customized coloring of plots
   install.packages("mice")
library(mice)
}
if (!require("plotly")) { # customized coloring of plots
   install.packages("plotly")
library(plotly)
}
if (!require("caret")) { # customized coloring of plots
   install.packages("caret")
library(caret)
}
if (!require("pROC")) { # customized coloring of plots
   install.packages("pROC")
library(pROC)
}
if (!require("pander")) { # customized coloring of plots
   install.packages("pander")
library(pander)
}
if (!require("e1071")) { # customized coloring of plots
   install.packages("e1071")
library(e1071)
}
if (!require("rpart")) { # customized coloring of plots
   install.packages("rpart")
library(rpart)
}
if (!require("rpart.plot")) { # customized coloring of plots
   install.packages("rpart.plot")
library(rpart.plot)
}
if (!require("randomForest")) { # customized coloring of plots
   install.packages("randomForest")
library(randomForest)
}
if (!require("forcats")) { # customized coloring of plots
   install.packages("forcats")
library(forcats)
}
if (!require("ipred")) { # customized coloring of plots
   install.packages("ipred")
library(ipred)
}
if (!require("neuralnet")) { # customized coloring of plots
   install.packages("neuralnet")
library(neuralnet)
}
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )

```


# Introduction

This dataset is titled “Student Habits vs Academic Performance” and shows various explanatory variables and one target/response variable (exam_score). The purpose of collecting this dataset is to see what factors may affect exam scores. This dataset was collected from Kaggle. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable).

The student_id serves as the identifier. The age (numerical) shows each persons age. The gender (categorical) shows each persons gender. The study_hours_per_day shows hours studied per day. The social_media_hours shows hours used daily on social media. The netflix_hours shows hours used daily on netflix. The part_time_job (no/yes) shows if the person has a part time job. The attendance percentage shows the percentage of classes attended. The sleep_hours shows amount of sleep per night. The diet_quality (poor/fair/good) shows quality of ones diet. The exercise_frequency shows how many times one exercises per week. The parental_education_level (none/High School/bachelor/master) shows the highest level of education of ones parents. The internet_quality(poor/avg/good) shows how good ones wi-fi is. The mental_health_rating (1-10) shows how one rates their mental health. The extracurricular_participation (n/y) shows if one participates in extracurricular activities. The target/response variable is exam_score (continuous).

Based on this dataset, we can formulate a research question. The question we can form is what variables are the most important in predicting exam score. For regression, we can use exam_score (continuous) as our response variable. For classification, we can create a binary response variable with 1 (pass) and 0 (fail). We can use the condition if exam score is greater or less than 0.60.

We will use CART, bagging, and random forests to try to answer the research question.

# Read in Dataset

Read in dataset.

```{r}
habits <- read.csv("student_habits_performance.csv")
```

See dataset structure.

```{r}
str(habits)
```

Check for Missing Values.

```{r}
colSums(is.na(habits))
```


# EDA

This section will show the distribution of each individual feature.

The below figure shows the distribution of the gender variable.

```{r}
ggplot(habits, aes(x = gender)) + 
  
  geom_bar() +
  labs(title = "Gender")
```

The below figure shows the distribution of the part_time_job variable.

```{r}
ggplot(habits, aes(x = part_time_job)) + 
  
  geom_bar() +
  labs(title = "part_time_job")
```

The below figure shows the distribution of the diet_quality variable.

```{r}
ggplot(habits, aes(x = diet_quality)) + 
  
  geom_bar() +
  labs(title = "diet_quality")
```

The below figure shows the distribution of the parental_education_level variable.

```{r}
ggplot(habits, aes(x = parental_education_level)) + 
  
  geom_bar() +
  labs(title = "parental_education_level")
```

The below figure shows the distribution of the internet_quality variable.

```{r}
ggplot(habits, aes(x = internet_quality)) + 
  
  geom_bar() +
  labs(title = "internet_quality")
```

The below figure shows the distribution of the extracurricular_participation variable.

```{r}
ggplot(habits, aes(x = extracurricular_participation)) + 
  
  geom_bar() +
  labs(title = "extracurricular_participation")
```

The below figure shows the distribution of the age variable.

```{r}
ggplot(data = habits, aes(x = age)) + 
  geom_boxplot() + 
  
  labs(title = "age")
```

The below figure shows the distribution of the study_hours_per_day variable.

```{r}
ggplot(data = habits, aes(x = study_hours_per_day)) + 
  geom_boxplot() + 
  
  labs(title = "study_hours_per_day")
```

The below figure shows the distribution of the social_media_hours variable.

```{r}
ggplot(data = habits, aes(x = social_media_hours)) + 
  geom_boxplot() + 
  
  labs(title = "social_media_hours")
```

The below figure shows the distribution of the netflix_hours variable.

```{r}
ggplot(data = habits, aes(x = netflix_hours)) + 
  geom_boxplot() + 
  
  labs(title = "netflix_hours")
```

The below figure shows the distribution of the attendance_percentage variable.

```{r}
ggplot(data = habits, aes(x = attendance_percentage)) + 
  geom_boxplot() + 
  
  labs(title = "attendance_percentage")
```

The below figure shows the distribution of the sleep_hours variable.

```{r}
ggplot(data = habits, aes(x = sleep_hours)) + 
  geom_boxplot() + 
  
  labs(title = "sleep_hours")
```

The below figure shows the distribution of the exercise_frequency variable.

```{r}
ggplot(data = habits, aes(x = exercise_frequency)) + 
  geom_boxplot() + 
  
  labs(title = "exercise_frequency")
```

The below figure shows the distribution of the mental_health_rating variable.

```{r}
ggplot(data = habits, aes(x = mental_health_rating)) + 
  geom_boxplot() + 
  
  labs(title = "mental_health_rating")
```

# Imputation/Feature Engineering

The variables part_time_job, diet_quality, internet_quality, and extracurricular_participation are converted to categorical/factor.

```{r}
#habits$gender <- as.factor(habits$gender)
habits$part_time_job <- as.factor(habits$part_time_job)
habits$diet_quality <- factor(habits$diet_quality,levels = c("Poor", "Fair", "Good"))
#habits$parental_education_level <- as.factor(habits$parental_education_level)
habits$internet_quality <- factor(habits$internet_quality, levels = c("Poor", "Average", "Good"))
habits$extracurricular_participation <- as.factor(habits$extracurricular_participation)
```

The variable gender contains the value 'Other'. Since we only want to work with Male/female, we set 'Other' to missing and use MICE imputation.

```{r}
habits$gender[habits$gender== "Other"] = NA
habits$gender <- as.factor(habits$gender)

init2 <- mice(habits, maxit = 0)
init2$method

imp2 <- mice(habits, method = c("","", "logreg", "",  "", "", "", "", "", "", "", "", "", "", "", ""), 
                 maxit = 10,  
                 m = 5, 
                 seed=123,
                 print=F)     



complete_habits_data <- complete(imp2)
```

For parental_education_level, we convert both Bachelor and Master to College to reduce categories.

```{r}
complete_habits_data$parental_education_level[complete_habits_data$parental_education_level == 'Bachelor'] <- 'College'
complete_habits_data$parental_education_level[complete_habits_data$parental_education_level == 'Master'] <- 'College'

complete_habits_data$parental_education_level <- factor(complete_habits_data$parental_education_level,levels = c("None", "High School", "College"))

str(complete_habits_data)
```

# CART - Regression

Decision trees recursively splits the data into subsets based on the value of a feature, with the goal of creating subsets that are as homogeneous as possible with respect to the target variable.

In regression trees, we can predict continuous outcomes by choosing splits to minimize the MSE of the target variable.

As shown below, we first build the initial regression tree to get a general sense of all the predictors and their splits.

```{r}
# Set seed for reproducibility
set.seed(123)

# Split data into training (70%) and test (30%) sets
train.index3 <- sample(1:nrow(complete_habits_data), size = 0.8 * nrow(complete_habits_data))
train.data3 <- complete_habits_data[train.index3, ]
test.data3 <- complete_habits_data[-train.index3, ]

# 1. Tree Induction & 2. Splitting Criteria
# Build the initial regression tree using rpart
tree.model <- rpart(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data3,
                    method = "anova",     # For regression
                    control = rpart.control(
                      minsplit = 20,    # 3. Stopping rule: min observations to split
                      minbucket = 7,    # Min observations in terminal node
                      cp = seq(0, 0.05, 20), # Complexity parameter
                      maxdepth = 5      # Maximum tree depth
                    ))

# Visualize the unpruned tree
rpart.plot(tree.model, main = "Initial Regression Tree")
```

Next, we would want to prune the initial tree. Below shows a table of all the CP, nsplit, rel error, xerror, and xstd values. 

```{r}
#summary(tree.model)
pander(tree.model$cptable)
```

Below shows a plot of how cp changes as tree size increases. This can help with selecting cp values for pruning.

```{r}
plotcp(tree.model)
```

For pruning, we select both best CP and min CP. Best CP is the largest CP when the xerror is within 1 standard error of the minimum. The min CP is simply the smallest CP. Best CP is preferred because it can give us the simplest tree as well as retaining accuracy.

```{r}
cp.table <- tree.model$cptable

## Identify the minimum `xerror` and its `cp`.
min.xerror <- min(cp.table[, "xerror"])
min.cp.row <- which.min(cp.table[, "xerror"])
min.cp <- cp.table[min.cp.row, "CP"]

## Get the standard error (`xstd`) of the minimum `xerror`
xerror.std <- cp.table[min.cp.row, "xstd"]
threshold <- min.xerror + xerror.std  # Upper bound (1 SE rule)

## Find the simplest tree (`cp`) Where `xerror less than or equal to Threshold`.
best.cp.row <- which(cp.table[, "xerror"] <= threshold)[1]  # First row meeting criteria
best.cp <- cp.table[best.cp.row, "CP"]

## Two different trees: best CP vs minimum CP
pruned.tree.best.cp <- prune(tree.model, cp = best.cp)
pruned.tree.min.cp <- prune(tree.model, cp = min.cp)

# Visualize the pruned tree: best CP
rpart.plot(pruned.tree.best.cp, main = paste("Pruned Tree (Best CP): cp = ", round(best.cp,4)))
```

The above shows the tree with the best CP of 0.0044.

The below shows the tree with the min CP of 0.0009.

```{r}
rpart.plot(pruned.tree.min.cp, main = paste("Pruned Tree (Minimum CP): cp = ", round(min.cp,4)))
```

Below we make predictions on test data using the best CP tree, min CP tree, OLS regression using features from the best CP tree (study_hours_per_day, mental_health_rating, netflix_hours, exercise_frequency, sleep_hours), and OLS regression using stepwise selection.

As shown in the output below, it seems like the OLS stepwise did the best with MSE of 29.19. 

```{r}
# 5. Prediction
# Make predictions on test data
pred.best.cp <- predict(pruned.tree.best.cp, newdata = test.data3)
pred.min.cp <- predict(pruned.tree.min.cp, newdata = test.data3)


# Evaluate model performance: best.cp
mse.tree.best.cp <- mean((test.data3$exam_score - pred.best.cp)^2)
rmse.tree.best.cp <- sqrt(mse.tree.best.cp)
r.squared.tree.best.cp <- cor(test.data3$exam_score, pred.best.cp)^2
# min.cp
mse.tree.min.cp <- mean((test.data3$exam_score - pred.min.cp)^2)
rmse.tree.min.cp <- sqrt(mse.tree.min.cp)
r.squared.tree.min.cp <- cor(test.data3$exam_score, pred.min.cp)^2

##
# fit ordinary least square regression 
LSE01 <- lm(exam_score ~ study_hours_per_day + mental_health_rating + netflix_hours + exercise_frequency + sleep_hours, data = train.data3)
pred.lse01 <-  predict(LSE01, newdata = test.data3)
mse.lse01 <- mean((test.data3$exam_score - pred.lse01)^2)
rmse.lse01 <- sqrt(mse.lse01)
r.squared.lse01 <- cor(test.data3$exam_score, pred.lse01)^2

##
## ordinary LSE regression model with step-wise variable selection
lse02.fit <- lm(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data3)
AIC.fit <- stepAIC(lse02.fit, direction="both", trace = FALSE)
pred.lse02 <- predict(AIC.fit, test.data3)
mse.lse02 <- mean((test.data3$exam_score - pred.lse02)^2)    # mean square error
rmse.lse02 <- sqrt(mse.lse02)                       # root mean square error
r.squared.lse02 <- (cor(test.data3$exam_score, pred.lse02))^2 # r-squared

###
Errors <- cbind(MSE = c(mse.tree.best.cp, mse.tree.min.cp, mse.lse01, mse.lse02),
                RMSE = c(rmse.tree.best.cp, rmse.tree.min.cp, rmse.lse01, rmse.lse02),
                r.squared = c(r.squared.tree.best.cp, r.squared.tree.min.cp, r.squared.lse01, r.squared.lse02))
rownames(Errors) = c("tree.best.cp", "tree.min.cp", "lse01", "lse02")
pander(Errors)
```

Below we show what variables are most important in predicting exam score based on the best CP tree. Based on the below chart, we see that study hours, mental health, netflix hours, exercise frequency, and sleep hours are the most important variables in predicting exam score.

```{r}
# Variable importance
importance <- pruned.tree.best.cp$variable.importance
barplot(sort(importance, decreasing = TRUE), 
        main = "Variable Importance: Best CP",
        las = 2)
```

Below we show what variables are most important in predicting exam score based on the min CP tree. Based on the below chart, we see that study hours, mental health, netflix hours, social media hours, and exercise freq are the most important variables in predicting exam score.

```{r}
# Variable importance
importance2 <- pruned.tree.min.cp$variable.importance
barplot(sort(importance2, decreasing = TRUE), 
        main = "Variable Importance: Minimum CP",
        las = 2)
```

# CART - Classification

In this section, we do classification trees. We can use them to predict categorical outcomes by using Gini impurity and entropy.

First, we need to create a binary response variable since our original response is continuous. Since exam scores over 60 is generally considered passing, we create 'pass' where 1 is pass and 0 is fail.

Then we create the initial tree, as shown below.

```{r}
complete_habits_data <- transform(complete_habits_data, pass=ifelse(exam_score >= 60, 1, 0))

set.seed(123)
train.index4 <- createDataPartition(complete_habits_data$pass, p = 0.8, list = FALSE)
train.data4 <- complete_habits_data[train.index4, ]
test.data4 <- complete_habits_data[-train.index4, ]

train.data4$pass <- factor(train.data4$pass)
test.data4$pass <- factor(test.data4$pass)

# Build the initial classification tree
tree.model2 <- rpart(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data4,
                    method = "class",   # classification tree
                    parms = list(split = "gini",  # Using Gini index
                                 # FN cost = 1, FP cost = 0.5
                                 loss = matrix(c(0, 0.5, 1, 0), nrow = 2)  
                                 ),
                    control = rpart.control(minsplit = 15,  # Min 15 obs to split
                                           minbucket = 5,   # Min 7 obs in leaf
                                           # Complexity parameter
                                           cp = 0.001, # complex parameter
                                           maxdepth = 5))   # Max tree depth


rpart.plot(tree.model2, 
           extra = 104, # check the help document for more information
           # color palette is a sequential color scheme that blends green (G) to blue (Bu)
           box.palette = "GnBu",  
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE)
```

Below shows the cp table to help with the pruning process.

```{r}
pander(tree.model2$cptable)
```

Below shows the cp plot.

```{r}
plotcp(tree.model2)
```

Below, we get the best CP and the min CP values.

```{r}
cp.table2 <- tree.model2$cptable
#min.cp2 <- tree.model2$cptable[which.min(tree.model2$cptable[,"xerror"]),"CP"]
## Identify the minimum `xerror` and its `cp`.
min.xerror2 <- min(cp.table2[, "xerror"])
min.cp.row2 <- which.min(cp.table2[, "xerror"])
min.cp2 <- cp.table2[min.cp.row2, "CP"]

## Get the standard error (`xstd`) of the minimum `xerror`
xerror.std2 <- cp.table2[min.cp.row2, "xstd"]
threshold2 <- min.xerror2 + xerror.std2  # Upper bound (1 SE rule)

## Find the simplest tree (`cp`) Where `xerror less than or equal to Threshold`.
best.cp.row2 <- which(cp.table2[, "xerror"] <= threshold2)[1]  # First row meeting criteria
best.cp2 <- cp.table2[best.cp.row2, "CP"]

## Two different trees: best CP vs minimum CP
pruned.tree.1SE <- prune(tree.model2, cp = best.cp2,main = paste("Pruned Tree (Best CP): cp = ", round(best.cp2,4)))
pruned.tree.min.cp2 <- prune(tree.model2, cp = min.cp2,main = paste("Pruned Tree (Minimum CP): cp = ", round(min.cp2,4)))
```

Below shows the pruned tree for the best CP.

```{r}
rpart.plot(pruned.tree.1SE, 
           extra = 104, # check the help document for more information
           # color palette is a sequential color scheme that blends green (G) to blue (Bu)
           box.palette = "GnBu",  
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE)
```

Below shows the pruned tree for the min CP. The best and min CP values are the same (0.018).

```{r}
# Visualize the pruned tree
rpart.plot(pruned.tree.min.cp2, 
           extra = 104, # check the help document for more information
           # color palette is a sequential color scheme that blends green (G) to blue (Bu)
           box.palette = "GnBu",  
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE)
```

Next we make predictions on the test set and generate the ROC curves. The below plot shows that standard logistic regression did better than both the best and min CP trees.

```{r}
# Make predictions on the test set
pred.label.1SE <- predict(pruned.tree.1SE, test.data4, type = "class") # default cutoff 0.5
pred.prob.1SE <- predict(pruned.tree.1SE, test.data4, type = "prob")[,2]
##
pred.label.min <- predict(pruned.tree.min.cp2, test.data4, type = "class") # default cutoff 0.5
pred.prob.min <- predict(pruned.tree.min.cp2, test.data4, type = "prob")[,2]

# Confusion matrix
#conf.matrix <- confusionMatrix(pred.label, test.data$diabetes, positive = "pos")
#print(conf.matrix)

########################
###  logistic regression
logit.fit <- glm(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data4, family = binomial)
AIC.logit <- step(logit.fit, direction = "both", trace = 0)
pred.logit <- predict(AIC.logit, test.data4, type = "response")

# ROC curve and AUC
roc.tree.1SE <- roc(test.data4$pass, pred.prob.1SE)
roc.tree.min <- roc(test.data4$pass, pred.prob.min)
roc.logit <- roc(test.data4$pass, pred.logit)

##
### Sen-Spe
tree.1SE.sen <- roc.tree.1SE$sensitivities
tree.1SE.spe <- roc.tree.1SE$specificities
#
tree.min.sen <- roc.tree.min$sensitivities
tree.min.spe <- roc.tree.min$specificities
#
logit.sen <- roc.logit$sensitivities
logit.spe <- roc.logit$specificities
## AUC
auc.tree.1SE <- roc.tree.1SE$auc
auc.tree.min <- roc.tree.min$auc
auc.logit <- roc.logit$auc
###
plot(1-logit.spe, logit.sen,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC: CART and Logistic Regressopm")
lines(1-tree.1SE.spe, tree.1SE.sen, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-tree.min.spe, tree.min.sen,      
      col = "orange",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
legend("bottomright", c("Logistic", "Tree 1SE", "Tree Min"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange"),
       bty="n",cex = 0.8)
## annotation - AUC
text(0.8, 0.46, paste("Logistic AUC: ", round(auc.logit,4)), cex = 0.8)
text(0.8, 0.4, paste("Tree 1SE AUC: ", round(auc.tree.1SE,4)), cex = 0.8)
text(0.8, 0.34, paste("Tree Min AUC: ", round(auc.tree.min,4)), cex = 0.8)
```

Below shows that the optimal cut-off probability is 0.5
This value represents the value that minimizes the total cost of misclassification of our response variable (pass).

```{r}
# preditive probabilities of tree.min model
#train.data4$pass <- as.factor(train.data4$pass)

pred.prob.min2 <- predict(pruned.tree.min.cp2, train.data4, type = "prob")[,2]
##
cost <- NULL
cutoff <-seq(0,1, length = 10)
##
for (i in 1:10){
  pred.label <- ifelse(pred.prob.min2 > cutoff[i], "1", "0")
  FN <- sum(pred.label == "0" & train.data4$pass == "1")
  FP <- sum(pred.label == "1" & train.data4$pass == "0")
  cost[i] = 5*FP + 20*FN
}
## identify optimal cut-off
min.ID <- which(cost == min(cost))   # could have multiple minimum
optim.prob <- mean(cutoff[min.ID])   # take the average of the cut-offs
##
plot(cutoff, cost, type = "b", col = "navy",
     main = "Cutoff vs Misclassification Cost")
##
text(0.2, 3500, paste("Optimal cutoff:", round(optim.prob,4)), cex = 0.8)
```

# BAGGING Regression

CART produces a single tree. Bagging uses multiple trees. Bagging generates multiple bootstrap samples from the training data, trains individual regression trees on each subset, and aggregates their predictions. The goal is to produce a more robust model.

Below shows the nbagg, cp, maxdepth, oob.error table. These are the optimal hyperparameters used to train the final model.

```{r}
# Split the data
set.seed(123)
train.index5 <- createDataPartition(complete_habits_data$exam_score, p = 0.8, list = FALSE)
train.data5 <- complete_habits_data[train.index5, ]
test.data5 <- complete_habits_data[-train.index5, ]

# Set up train control for cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE
)

# Define parameter combinations to test
nbagg.values <- c(10, 25, 50)     # number bagged trees
cp.values <- c(0.01, 0.05, 0.1)   # candidate cp values
maxdepth.values <- c(5, 10, 20)   # maximum depth of the candidate tree

# Create an empty data frame to store results
results <- data.frame()
######## Model tuning
# Manual tuning loop
for (nbagg in nbagg.values) {
  for (cp in cp.values) {
    for (maxdepth in maxdepth.values) {
      set.seed(123)
      model <- bagging(
        exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
        data = train.data5,
        nbagg = nbagg,
        coob = TRUE,
        trControl = ctrl,
        control = rpart.control(cp = cp, 
                                maxdepth = maxdepth)
       )
      # Get OOB error from each iteration
      oob.error <- model$err
      # Store results
      results <- rbind(results, 
                       data.frame( nbagg = nbagg,
                                   cp = cp,
                                   maxdepth = maxdepth,
                                   oob.error = oob.error))
    }
  }
}
# Find the best combination that yields the minimum out-of-bag's error
best.params <- results[which.min(results$oob.error), ]
pander(best.params)
```

Using the optimal hyperparameters from above, we train the final model and get a RMSE of 8.748 and Rsquared of 0.7364 as shown below.

```{r}
# Train the final model with the best parameters
final.model <- bagging(
  exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = train.data5,
  nbagg = best.params$nbagg,
  coob = TRUE,
  trControl = ctrl,
  control = rpart.control(cp = best.params$cp, 
                          maxdepth = best.params$maxdepth),
  
  importance = TRUE
)

# Evaluate on test set
predictions <- predict(final.model, newdata = test.data5)
## Using the caret function to calculate errors across re-samples
baggedError <- postResample(pred = predictions, obs = test.data5$exam_score)
pander(baggedError)
```

Based on the bagging model, the variable importance chart shown below shows that study hours, mental health, netflix hours, social media hours, and sleep hours are the most important variables in predicting exam score.

```{r}
var.imp <- varImp(final.model)
# Extract variable importance (requires a custom function)
get.bagging.importance <- function(model) {
  # Get all the trees from the bagging model
  trees <- model$mtrees
  
  # Initialize importance vector
  imp <- numeric(length(trees[[1]]$btree$variable.importance))
  names(imp) <- names(trees[[1]]$btree$variable.importance)
  
  # Sum importance across all trees
  for(tree in trees) {
    imp[names(tree$btree$variable.importance)] <- 
      imp[names(tree$btree$variable.importance)] + 
      tree$btree$variable.importance
  }
  
  # Average importance
  imp <- imp/length(trees)
  return(imp)
}
```

```{r}
# Get importance
importance.scores <- get.bagging.importance(final.model)

# Sort and plot
importance.scores <- sort(importance.scores, decreasing = TRUE)
barplot(importance.scores, horiz = TRUE, las = 1,
        main = "Variable Importance - Bagging (ipred)",
        xlab = "Importance Score")
```

Below we generate the MAE/RMSE for the bagged model, pruned tree using min CP, and standard regression. Based on the values shown below, the bagged model did not do as well as the standard regression.

```{r}
## ordinary LSE regression model with step-wise variable selection
lse.fit <- lm(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data5)
AIC.fit2 <- stepAIC(lse.fit, direction="both", trace = FALSE)
##
pred.lse <- predict(AIC.fit2, test.data5)
mae.lse <- mean(abs(test.data5$exam_score - pred.lse)) # mean absolutesquare error
mse.lse <- mean((test.data5$exam_score - pred.lse)^2)  # mean square error
rmse.lse <- sqrt(mse.lse)                       # root mean square error
r.squared.lse <- (cor(test.data5$exam_score, pred.lse))^2 # r-squared
##
# Base regression tree
tree.model3 <- rpart(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data5,
                    method = "anova",     # For regression
                    control = rpart.control(
                      minsplit = 20,    # 3. Stopping rule: min observations to split
                      minbucket = 7,    # Min observations in terminal node
                      cp = seq(0, 0.05, 20), # Complexity parameter
                      maxdepth = 5      # Maximum tree depth
                    ))
# cp table
cp.table3 <- tree.model3$cptable
##
## Identify the minimum `xerror` and its `cp`.
min.xerror3 <- min(cp.table3[, "xerror"])
min.cp.row3 <- which.min(cp.table3[, "xerror"])
min.cp3 <- cp.table3[min.cp.row3, "CP"]
##
pruned.tree.min.cp3 <- prune(tree.model3, cp = min.cp3)
pred.min.cp3 <- predict(pruned.tree.min.cp3, newdata = test.data5)
##
# min.cp
mae.tree.min.cp3 <- mean(abs(test.data5$exam_score - pred.min.cp3))
mse.tree.min.cp3 <- mean((test.data5$exam_score - pred.min.cp3)^2)
rmse.tree.min.cp3 <- sqrt(mse.tree.min.cp3)
r.squared.tree.min.cp3 <- cor(test.data5$exam_score, pred.min.cp3)^2
##
###
Errors2 <- cbind(MAE = c(baggedError[1], mae.tree.min.cp3, mae.lse),
          RMSE = c(baggedError[3], rmse.tree.min.cp3, rmse.lse),
          r.squared = c(baggedError[2], r.squared.tree.min.cp3, r.squared.lse))
rownames(Errors2) = c("bagged Tree", "tree.min.cp", "lse")
pander(Errors2)
```

# BAGGING Classification

In this section, we do the bagging classification model.

Below shows the optimal hyperparameters for nbagg, minsplit, maxdepth, and cp. We use these to train the final model.

```{r}
set.seed(123)

# Split data into training and testing sets
trainIndex6 <- createDataPartition(complete_habits_data$pass, p = 0.8, list = FALSE)
trainData6 <- complete_habits_data[trainIndex6, ]
testData6 <- complete_habits_data[-trainIndex6, ]

# Create a grid of hyperparameter combinations
hyper.grid <- expand.grid(
  nbagg = c(25, 50, 100),
  minsplit = c(5, 10, 20),
  maxdepth = c(5, 10, 20),
  cp = c(0.01, 0.001)
)
# Initialize a results data frame
results2 <- data.frame() # store values of tuned hyperparameters
best.accuracy <- 0      # store accuracy
best.params2 <- list()   # store best values of hyperparameter

testData6$pass <- as.factor(testData6$pass)
trainData6$pass <- as.factor(trainData6$pass)

# Loop through each hyperparameter combination
for(i in 1:nrow(hyper.grid)) {
  # Get current hyperparameters
  params <- hyper.grid[i, ]
  
  # Set rpart control parameters
  rpart.control <- rpart.control(
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp
  )
  
  # Train bagging model
  bag.model <- bagging(
    pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
    data = trainData6,
    nbagg = params$nbagg,
    coob = TRUE,
    control = rpart.control
  )
  
  # Make predictions: default cut-off 0.5
  preds <- predict(bag.model, newdata = testData6)
  
  # Calculate accuracy
  cm <- confusionMatrix(preds, testData6$pass)
  accuracy <- cm$overall["Accuracy"]
  
  # Store results
  results2 <- rbind(results2, data.frame(
    nbagg = params$nbagg,
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp,
    Accuracy = accuracy
  ))
  
  # Update best parameters if current model is better
  if(accuracy > best.accuracy) {
    best.accuracy <- accuracy
    best.params <- params
  }
  
  # Print progress
  #cat("Completed", i, "of", nrow(hyper.grid), "combinations\n")
}
pander(best.params)
```

Below we train the final model using the optimal parameters and generate the confusion matrix.

```{r}
# Set rpart control with best parameters
best.control <- rpart.control(
  minsplit = best.params$minsplit,
  maxdepth = best.params$maxdepth,
  cp = best.params$cp
)

# Train final model
final.bag.model <- bagging(
  pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = trainData6,
  nbagg = best.params$nbagg,
  coob = TRUE,
  control = best.control
)

# Evaluate on test set
final.preds <- predict(final.bag.model, newdata = testData6)
final.cm <- confusionMatrix(final.preds, testData6$pass)
final.cm$table
```

Below we generate the ROC curves for standard logistic, bagged model, and the pruned tree with min CP. Based on the below plot, we see that the standard logistic performs better than both the bagged model and the pruned tree with min CP.

```{r}
########################
###  logistic regression
logit.fit2 <- glm(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = trainData6, family = binomial)
AIC.logit2 <- step(logit.fit2, direction = "both", trace = 0)
pred.logit2 <- predict(AIC.logit2, testData6, type = "response")
##
# Build the initial classification tree
tree.model4 <- rpart(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = trainData6,
                    method = "class",   # classification tree
                    parms = list(split = "gini",  # Using Gini index
                                 # FN cost = 1, FP cost = 0.5
                                 loss = matrix(c(0, 0.5, 1, 0), nrow = 2)  
                                 ),
                    control = rpart.control(minsplit = 15,  # Min 15 obs to split
                                           minbucket = 5,   # Min 7 obs in leaf
                                           # Complexity parameter
                                           cp = 0.001, # complex parameter
                                           maxdepth = 5))   # Max tree depth
###
min.cp4 <- tree.model4$cptable[which.min(tree.model4$cptable[,"xerror"]),"CP"]

# Prediction with three candidate models
pruned.tree.min <- prune(tree.model4, cp = min.cp)
pred.prob.min2 <- predict(pruned.tree.min, testData6, type = "prob")[,2]
pred.prob.bagg <- predict(final.bag.model, newdata = testData6, type = "prob")[,2]
##
# ROC object
roc.tree.min2 <- roc(testData6$pass, pred.prob.min2)
roc.logit2 <- roc(testData6$pass, pred.logit2)
roc.bagg <- roc(testData6$pass, pred.prob.bagg)
##
##
### Sen-Spe
bagg.sen <- roc.bagg$sensitivities
bagg.spe <- roc.bagg$specificities
#
tree.min.sen2 <- roc.tree.min2$sensitivities
tree.min.spe2 <- roc.tree.min2$specificities
#
logit.sen2 <- roc.logit2$sensitivities
logit.spe2 <- roc.logit2$specificities
## AUC
auc.bagg <- roc.bagg$auc
auc.tree.min2 <- roc.tree.min2$auc
auc.logit2 <- roc.logit2$auc
###
plot(1-logit.spe2, logit.sen2,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC: Classification Models")
lines(1-bagg.spe, bagg.sen, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-tree.min.spe2, tree.min.sen2,      
      col = "orange",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
legend("bottomright", c("Logistic", "bagg", "Tree Min"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange"),
       bty="n",cex = 0.8)
## annotation - AUC
text(0.8, 0.46, paste("Logistic AUC: ", round(auc.logit2,4)), cex = 0.8)
text(0.8, 0.4, paste("Bagg AUC: ", round(auc.bagg,4)), cex = 0.8)
text(0.8, 0.34, paste("Tree Min AUC: ", round(auc.tree.min2,4)), cex = 0.8)
```

# Random Forest Regression

In this section, we do random forest. Random forest is similar to bagging, except that at each split, only a random subset of features is used whereas all features are considered in bagging.

Below we split the data into training/testing and generate the optimal hyperparameters shown in the table (ntree, mtry, nodesize, maxnodes, and best.rmse).

```{r}
set.seed(123)  # For reproducibility
train.index7 <- sample(1:nrow(complete_habits_data), 0.8 * nrow(complete_habits_data))
train.data7 <- complete_habits_data[train.index7, ]
test.data7<- complete_habits_data[-train.index7, ]
```

```{r}
# Create a grid (data frame) of hyperparameter combinations to test
hyper.grid2 <- expand.grid(
  ntree = c(100, 300, 500),    
  mtry = c(3, 5, 7), # Dependent on the total number features available in the data
  nodesize = c(1, 3, 5), 
  maxnodes = c(5, 10, 20, NULL)
)
##
# Initialize results storage
results3 <- data.frame()  # combination of hyperparameters and corresponding RMSE
best.rmse <- Inf         # place-holder of RMSE with initial value inf
best.params3 <- list()    # update the hyperparameter list according to the best rmse

# Set up k-fold cross-validation
k <- 5                   # 5-fold cross-validation
n0 <- dim(train.data7)[1] # size of the training data 
fold.size <- floor(n0/k) # fold size. Caution: floor() should be used. 
                         # round( ,0) should be used. why?

# Loop through each hyperparameter combination
for(i in 1:nrow(hyper.grid2)) {           
  current.params <- hyper.grid2[i, ]   # vector of hyperparameters for cross validation
  cv.errors <- numeric(k)             # store RMSE from cross-validation
  
  # Perform k-fold cross-validation
  for(j in 1:k) {
    # Split into training and validation folds
    valid.indices <- (1 + fold.size*(j-1)):(fold.size*j)  # CV observation ID vector
    cv.train <- train.data7[-valid.indices, ]   # training data in cross-validation
    cv.valid <- train.data7[valid.indices, ]    # testing data in cross-validation
    
    # Train model with current parameters
    rf.model <- randomForest(
      exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
      data = cv.train,
      ntree = current.params$ntree,
      mtry = current.params$mtry,
      nodesize = current.params$nodesize,
      maxnodes = current.params$maxnodes,
      importance = TRUE
    )
    
    # Make predictions on validation set
    preds2 <- predict(rf.model, newdata = cv.valid)    
    
    # Calculate RMSE
    cv.errors[j] <- sqrt(mean((preds2 - cv.valid$exam_score)^2))
  }
  
  # Average RMSE across folds
  avg.rmse <- mean(cv.errors)    
  
  # Store results: the data frame defined to store combinations of hyperparameters
  # and the resulting mean RMSEs from cross-validation
  results3 <- rbind(results3, data.frame(
    ntree = current.params$ntree,
    mtry = current.params$mtry,
    nodesize = current.params$nodesize,
    maxnodes = ifelse(is.null(current.params$maxnodes), "NULL", current.params$maxnodes),
    rmse = avg.rmse
  ))
  
  # Update best parameters if current model is better
  if(avg.rmse < best.rmse) {
    best.rmse <- avg.rmse
    best.params <- current.params
  }
  # Print progress: It is always a good idea to print something out in loops
  # cat(paste0("Completed ", i, "/", nrow(hyper.grid), " - RMSE: ", round(avg.rmse, 4), "\n"))
}  
pander(data.frame(cbind(current.params,best.rmse)))    # resulting tuned hyperparameters
```

Below we train the final model using the optimal hyperparameters generated in the above section and make predictions on the test set. The scatterplot below shows the actual vs predicted Values of the exam scores of the test set.

```{r}
# Train final model with best parameters on full training set
final.rf <- randomForest(
  exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = train.data7,
  ntree = best.params$ntree,
  mtry = best.params$mtry,
  nodesize = best.params$nodesize,
  maxnodes = best.params$maxnodes,
  importance = TRUE
)

# View model summary
# print(final.rf)
# Make predictions on test set
test.preds <- predict(final.rf, newdata = test.data7)

# Calculate test RMAE
test.mae <- mean(abs(test.preds - test.data7$exam_score))
# Calculate test RMSE
test.rmse <- sqrt(mean((test.preds - test.data7$exam_score)^2))
# Calculate R-squared
test.r2 <- 1 - sum((test.data7$exam_score - test.preds)^2) / sum((test.data7$exam_score - mean(test.data7$exam_score))^2)
# cat("Test R-squared:", test.r2, "\n")

### Performance vector
RF.performance = c(test.mae, test.rmse, test.r2)

# Plot actual vs predicted values
plot.data <- data.frame(Actual = test.data7$exam_score, Predicted = test.preds)
ggplot(plot.data, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Actual vs Predicted Values",
       x = "Actual Median Value ($1000s)",
       y = "Predicted Median Value ($1000s)") +
  theme_minimal()
```

Below we compare the performances of standard OLS regression, base regression tree, bagged, and the random forest models. Based on the RMSE values below, we see that base tree, bagged, and the random forest perform similarly.

```{r}
## ordinary LSE regression model with step-wise variable selection
lse.fit2 <- lm(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data7)
AIC.fit3 <- stepAIC(lse.fit2, direction="both", trace = FALSE)
pred.lse3 <- predict(AIC.fit3, test.data7)
mae.lse3 <- mean(abs(test.data7$exam_score - pred.lse3))      # mean absolute error
mse.lse3 <- mean((test.data7$exam_score - pred.lse3)^2)       # mean square error
rmse.lse3 <- sqrt(mse.lse3)                            # root mean square error
r.squared.lse3 <- (cor(test.data7$exam_score, pred.lse3))^2 # r-squared

### base regression tree
tree.model5 <- rpart(exam_score ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data7,
                    method = "anova",     # For regression
                    control = rpart.control(
                      minsplit = 20,    # 3. Stopping rule: min observations to split
                      minbucket = 7,    # Min observations in terminal node
                      cp = seq(0, 0.05, 20), # Complexity parameter
                      maxdepth = 5      # Maximum tree depth
                    ))
cp.table4 <- tree.model5$cptable
min.xerror4 <- min(cp.table4[, "xerror"])
min.cp.row4 <- which.min(cp.table4[, "xerror"])
min.cp5 <- cp.table[min.cp.row4, "CP"]
## 
pruned.tree.min.cp4 <- prune(tree.model5, cp = min.cp5)
pred.min.cp4 <- predict(pruned.tree.min.cp4, newdata = test.data7)
# performance measures
mae.tree.min.cp4 <- mean(abs(test.data7$exam_score - pred.min.cp4))   # MAD
mse.tree.min.cp4 <- mean((test.data7$exam_score - pred.min.cp4)^2)
rmse.tree.min.cp4 <- sqrt(mse.tree.min.cp4)                    # MSE
r.squared.tree.min.cp4 <- cor(test.data7$exam_score, pred.min.cp4)^2  # R.sq

### bagging regression: from the previous section
BaggPerf <- baggedError 

### Performance Comparison Table
Errors4 <- cbind(MAE = c(mae.lse3, mae.tree.min.cp4, BaggPerf[3], RF.performance[1]),
                RMSE = c(rmse.lse3, rmse.tree.min.cp4, BaggPerf[1], RF.performance[2]),
                r.squared = c(r.squared.lse3, r.squared.tree.min.cp4, BaggPerf[2],RF.performance[3]))
rownames(Errors4) = c("LS Regression", "Regression Tree", "BAGGING", "Random Forest")
pander(Errors4)
```

THe variable importance plots shown below are based on the MSE and node impurity. According to the %IncMSE plot, study hours, mental health, exercise, social media, and netflix are the most important features. The IncNodePurity plot shows the same variables. 

```{r}
varImpPlot(final.rf, pch = 19, main = "Variable Importance")
```


# Random Forest Classification

In this section, we do random forest for our binary response variable 'pass'.

The table below shows the optimal hyperparameters.

```{r}
set.seed(123)
train.idx <- createDataPartition(complete_habits_data$pass, p = 0.8, list = FALSE)
train.data8 <- complete_habits_data[train.idx, ]
test.data8 <- complete_habits_data[-train.idx, ]

train.data8$pass <- factor(train.data8$pass)
test.data8$pass <- factor(test.data8$pass)

# cross-validation setting
k = 5
train.size <- dim(train.data8)[1]  # training data size
fold.size2 <- floor(train.size/k)  # fold size

##
tune.grid <- expand.grid(
  mtry = c(2, 3, 4, 5),
  ntree = c(100, 300, 500),
  nodesize = c(1, 3, 5, 10),
  maxnodes = c(5, 10, 20, NULL)
)
### store hyperparameters and avg of cv AUC
results4 <- data.frame()
best.auc <- 0.5         # place-holder of AUC, 0.5 = random guess
best.hyp.params <- list()   # update the hyperparameter list according to the best auc

##
for (i in 1:nrow(tune.grid)){
  current.tune.params <- tune.grid[i, ]  # subset of DATA FRAME!! 
  cv.auc <- rep(0,k)
  ##
  for (j in 1:k){
      cv.id <- (1 + (j-1)*fold.size2):(j*fold.size2)
      cv.train2 <- train.data8[-cv.id, ]
      cv.valid2 <- train.data8[cv.id, ]
      ##
       rf.cv <- randomForest(
                   pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
                   data = cv.train2,
                   mtry = current.tune.params$mtry,
                   ntree = current.tune.params$ntree,
                   nodesize = current.tune.params$nodesize,
                   maxnodes = current.tune.params$maxnodes)
       ##
       prob.cv <- predict(rf.cv, cv.valid2, type = "prob")[, "1"]
       cv.auc[j] <- auc(roc(cv.valid2$pass, prob.cv))
      }
      ##
      # Average RMSE across folds
      avg.auc <- mean(cv.auc)  
      ##
      # Store results: the data frame defined to store combinations of hyperparameters
      # and the resulting mean RMSEs from cross-validation
      results4 <- rbind(results4, data.frame(
                   mtry = current.tune.params$mtry,
                   ntree = current.tune.params$ntree,
                   nodesize = current.tune.params$nodesize,
                   maxnodes = current.tune.params$maxnodes,
                   auc = avg.auc))
  
     # Update best parameters if current model is better
     if(avg.auc > best.auc) {
           best.auc <- avg.auc
           best.hyp.params <- current.tune.params }
    
}
pander(data.frame(cbind(best.hyp.params,best.auc)))
```

Below we train the final random forest model using the optimal hyperparameters from above and generate the Area under the curve below (0.9707).

```{r}
final.rf.cls <- randomForest(
      pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
      data = train.data8,
      ntree = best.hyp.params$ntree,
      mtry = best.hyp.params$mtry,
      nodesize = best.hyp.params$nodesize,
      maxnodes = current.tune.params$maxnodes,
      importance = TRUE
     )

test.pred <- predict(final.rf.cls, test.data8)
test.prob <- predict(final.rf.cls, test.data8, type = "prob")
rf.roc <- roc(test.data8$pass, test.prob[, "1"])
test.auc <- auc(rf.roc)
#confusionMatrix(test.pred, test.data$diabetes)
test.auc
```

The variable importance plots below show that study hours, mental health, social media, sleep, and Netflix are the most important features in predicting 'pass'.

```{r}
varImpPlot(final.rf.cls, pch = 19, main = "Variable Importance of RF Classification" )

```

Below we compare the performances of standard logistic, base tree, bagged, and random forest models. Based on the areas below, bagged and random forest does better than the base decision tree.

```{r}
########################
###  logistic regression
logit.fit3 <- glm(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, data = train.data8, family = binomial)
AIC.logit3 <- step(logit.fit3, direction = "both", trace = 0)
pred.logit3 <- predict(AIC.logit3, test.data8, type = "response")

# Build the initial classification tree
tree.model6 <- rpart(pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation, 
                    data = train.data8,
                    method = "class",   # classification tree
                    parms = list(split = "gini",  # Using Gini index
                                 # FN cost = 1, FP cost = 0.5
                                 loss = matrix(c(0, 0.5, 1, 0), nrow = 2)  
                                 ),
                    control = rpart.control(minsplit = 15,  # Min 15 obs to split
                                           minbucket = 5,   # Min 7 obs in leaf
                                           # Complexity parameter
                                           cp = 0.001, # complex parameter
                                           maxdepth = 5))   # Max tree depth
# Find the optimal cp value that minimizes cross-validated error
min.cp6 <- tree.model6$cptable[which.min(tree.model6$cptable[,"xerror"]),"CP"]
pruned.tree.min5 <- prune(tree.model6, cp = min.cp6)


# BAGGING
# Create a grid of hyperparameter combinations
hyper.grid3 <- expand.grid(
  nbagg = c(25, 50, 100),
  minsplit = c(5, 10, 20),
  maxdepth = c(5, 10, 20),
  cp = c(0.01, 0.001)
)
# Initialize a results dataframe
results5 <- data.frame() # store values of tuned hyperparameters
best.accuracy2 <- 0      # store accuracy
best.params4 <- list()   # store best values of hyperparameter

# Loop through each hyperparameter combination
for(i in 1:nrow(hyper.grid3)) {
  # Get current hyperparameters
  params2 <- hyper.grid3[i, ]
  
  # Set rpart control parameters
  rpart.control2 <- rpart.control(
    minsplit = params2$minsplit,
    maxdepth = params2$maxdepth,
    cp = params2$cp
  )
  
  # Train bagging model
  bag.model2 <- bagging(
    pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
    data = train.data8,
    nbagg = params2$nbagg,
    coob = TRUE,
    control = rpart.control2
  )
  
  # Make predictions: default cut-off 0.5
  preds3 <- predict(bag.model2, newdata = test.data8)
  
  # Calculate accuracy
  cm2 <- confusionMatrix(preds3, test.data8$pass)
  accuracy2 <- cm2$overall["Accuracy"]
  
  # Store results
  results5 <- rbind(results5, data.frame(
    nbagg = params2$nbagg,
    minsplit = params2$minsplit,
    maxdepth = params2$maxdepth,
    cp = params2$cp,
    Accuracy = accuracy2
  ))
  
  # Update best parameters if current model is better
  if(accuracy2 > best.accuracy2) {
    best.accuracy2 <- accuracy2
    best.params4 <- params2
  }
}
# Set rpart control with best parameters
best.control2 <- rpart.control(
  minsplit = best.params4$minsplit,
  maxdepth = best.params4$maxdepth,
  cp = best.params4$cp
)

# Train final model
final.bag.model2 <- bagging(
  pass ~ age + gender + study_hours_per_day + social_media_hours + netflix_hours + part_time_job + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participation,
  data = train.data8,
  nbagg = best.params4$nbagg,
  coob = TRUE,
  control = best.control2
)
#pred.prob.bagg <- predict(final.bag.model, newdata = test.data, type = "prob")[,2]
###
# Prediction with three candidate models
pruned.tree.min6 <- prune(tree.model6, cp = min.cp6)
pred.prob.min3 <- predict(pruned.tree.min6, test.data8, type = "prob")[,2]
pred.prob.bagg2 <- predict(final.bag.model2, newdata = test.data8, type = "prob")[,2]
##
# ROC object
roc.tree.min3 <- roc(test.data8$pass, pred.prob.min3)
roc.logit3 <- roc(test.data8$pass, pred.logit3)
roc.bagg2 <- roc(test.data8$pass, pred.prob.bagg2)
roc.rf <- roc(test.data8$pass, test.prob[, "1"])

##
##
### Sen-Spe
bagg.sen2 <- roc.bagg2$sensitivities
bagg.spe2 <- roc.bagg2$specificities
#
tree.min.sen3 <- roc.tree.min3$sensitivities
tree.min.spe3 <- roc.tree.min3$specificities
#
logit.sen3 <- roc.logit3$sensitivities
logit.spe3 <- roc.logit3$specificities
#
rf.sen <- roc.rf$sensitivities
rf.spe <- roc.rf$specificities

## AUC
auc.bagg2 <- roc.bagg2$auc
auc.tree.min3 <- roc.tree.min3$auc
auc.logit3 <- roc.logit3$auc
auc.rf <- roc.rf$auc
###
###
plot(1-logit.spe3, logit.sen3,  
     xlab = "1 - specificity",
     ylab = "sensitivity",
     col = "darkred",
     type = "l",
     lty = 1,
     lwd = 1,
     main = "ROC: Classification Models")
lines(1-bagg.spe2, bagg.sen2, 
      col = "blue",
      lty = 1,
      lwd = 1)
lines(1-tree.min.spe3, tree.min.sen3,      
      col = "orange",
      lty = 1,
      lwd = 1)
lines(1-rf.spe, rf.sen,      
      col = "purple4",
      lty = 1,
      lwd = 1)
abline(0,1, col = "skyblue3", lty = 2, lwd = 2)
legend("bottomright", c("Logistic", "bagg", "Tree Min", "Random Forest"),
       lty = c(1,1,1), lwd = rep(1,3),
       col = c("red", "blue", "orange", "purple4"),
       bty="n",cex = 0.8)
## annotation - AUC
text(0.8, 0.46, paste("Logistic AUC: ", round(auc.logit3,4)), cex = 0.8)
text(0.8, 0.4, paste("Bagg AUC: ", round(auc.bagg2,4)), cex = 0.8)
text(0.8, 0.34, paste("Tree Min AUC: ", round(auc.tree.min3,4)), cex = 0.8)
text(0.8, 0.28, paste("Forest AUC: ", round(auc.rf,4)), cex = 0.8)
```

# Results/Conclusion

Study hours, mental health, social media, sleep, Netflix, and exercise were consistently shown to be the most important features in predicting exam score/pass, out of all the models performed.

Performing CART alone may not be enough for prediction since they only generate a single decision tree. Bagging can produce a more robust model by generating/aggregating multiple trees. Random Forest may be even better than bagging because they use a random subset of features instead of all features.










