1 Introduction

This project uses the same dataset as project 3.

This dataset is titled “Student Habits vs Academic Performance” and shows various explanatory variables and one target/response variable (exam_score). The purpose of collecting this dataset is to see what factors may affect exam scores. This dataset was collected from Kaggle. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable).

The student_id serves as the identifier. The age (numerical) shows each persons age. The gender (categorical) shows each persons gender. The study_hours_per_day shows hours studied per day. The social_media_hours shows hours used daily on social media. The netflix_hours shows hours used daily on netflix. The part_time_job (no/yes) shows if the person has a part time job. The attendance percentage shows the percentage of classes attended. The sleep_hours shows amount of sleep per night. The diet_quality (poor/fair/good) shows quality of ones diet. The exercise_frequency shows how many times one exercises per week. The parental_education_level (none/High School/bachelor/master) shows the highest level of education of ones parents. The internet_quality(poor/avg/good) shows how good ones wi-fi is. The mental_health_rating (1-10) shows how one rates their mental health. The extracurricular_participation (n/y) shows if one participates in extracurricular activities. The target/response variable is exam_score (continuous).

Based on this dataset, we can formulate a research question. The question we can form is what variables are the most important in predicting exam score. For regression, we can use exam_score (continuous) as our response variable. For classification, we can create a binary response variable with 1 (pass) and 0 (fail). We can use the condition if exam score is greater or less than 0.60.

2 Read in Dataset

Read in dataset. See dataset structure.

habits2 <- read.csv("student_habits_performance.csv")
str(habits2)
'data.frame':   1000 obs. of  16 variables:
 $ student_id                   : chr  "S1000" "S1001" "S1002" "S1003" ...
 $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
 $ gender                       : chr  "Female" "Female" "Male" "Female" ...
 $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
 $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
 $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
 $ part_time_job                : chr  "No" "No" "No" "No" ...
 $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
 $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
 $ diet_quality                 : chr  "Fair" "Good" "Poor" "Poor" ...
 $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
 $ parental_education_level     : chr  "Master" "High School" "High School" "Master" ...
 $ internet_quality             : chr  "Average" "Average" "Poor" "Good" ...
 $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
 $ extracurricular_participation: chr  "Yes" "No" "No" "Yes" ...
 $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...

3 EDA

The below figure shows the distribution of the gender variable.

ggplot(habits2, aes(x = gender)) + 
  
  geom_bar() +
  labs(title = "Gender")

The below figure shows the distribution of the part_time_job variable.

ggplot(habits2, aes(x = part_time_job)) + 
  
  geom_bar() +
  labs(title = "part_time_job")

The below figure shows the distribution of the diet_quality variable.

ggplot(habits2, aes(x = diet_quality)) + 
  
  geom_bar() +
  labs(title = "diet_quality")

The below figure shows the distribution of the parental_education_level variable.

ggplot(habits2, aes(x = parental_education_level)) + 
  
  geom_bar() +
  labs(title = "parental_education_level")

The below figure shows the distribution of the internet_quality variable.

ggplot(habits2, aes(x = internet_quality)) + 
  
  geom_bar() +
  labs(title = "internet_quality")

The below figure shows the distribution of the extracurricular_participation variable.

ggplot(habits2, aes(x = extracurricular_participation)) + 
  
  geom_bar() +
  labs(title = "extracurricular_participation")

The below figure shows the distribution of the age variable.

ggplot(data = habits2, aes(x = age)) + 
  geom_boxplot() + 
  
  labs(title = "age")

The below figure shows the distribution of the study_hours_per_day variable.

ggplot(data = habits2, aes(x = study_hours_per_day)) + 
  geom_boxplot() + 
  
  labs(title = "study_hours_per_day")

The below figure shows the distribution of the social_media_hours variable.

ggplot(data = habits2, aes(x = social_media_hours)) + 
  geom_boxplot() + 
  
  labs(title = "social_media_hours")

The below figure shows the distribution of the netflix_hours variable.

ggplot(data = habits2, aes(x = netflix_hours)) + 
  geom_boxplot() + 
  
  labs(title = "netflix_hours")

The below figure shows the distribution of the attendance_percentage variable.

ggplot(data = habits2, aes(x = attendance_percentage)) + 
  geom_boxplot() + 
  
  labs(title = "attendance_percentage")

The below figure shows the distribution of the sleep_hours variable.

ggplot(data = habits2, aes(x = sleep_hours)) + 
  geom_boxplot() + 
  
  labs(title = "sleep_hours")

The below figure shows the distribution of the exercise_frequency variable.

ggplot(data = habits2, aes(x = exercise_frequency)) + 
  geom_boxplot() + 
  
  labs(title = "exercise_frequency")

The below figure shows the distribution of the mental_health_rating variable.

ggplot(data = habits2, aes(x = mental_health_rating)) + 
  geom_boxplot() + 
  
  labs(title = "mental_health_rating")

4 Imputation/Feature Engineering.

The variables part_time_job, diet_quality, internet_quality, and extracurricular_participation are converted to categorical/factor.

habits2$part_time_job <- as.factor(habits2$part_time_job)
habits2$diet_quality <- factor(habits2$diet_quality,levels = c("Poor", "Fair", "Good"))

habits2$internet_quality <- factor(habits2$internet_quality, levels = c("Poor", "Average", "Good"))
habits2$extracurricular_participation <- as.factor(habits2$extracurricular_participation)

The variable gender contains the value ‘Other’. Since we only want to work with Male/female, we set ‘Other’ to missing and use MICE imputation.

habits2$gender[habits2$gender== "Other"] = NA
habits2$gender <- as.factor(habits2$gender)

init3 <- mice(habits2, maxit = 0)
init3$method
                   student_id                           age 
                           ""                            "" 
                       gender           study_hours_per_day 
                     "logreg"                            "" 
           social_media_hours                 netflix_hours 
                           ""                            "" 
                part_time_job         attendance_percentage 
                           ""                            "" 
                  sleep_hours                  diet_quality 
                           ""                            "" 
           exercise_frequency      parental_education_level 
                           ""                            "" 
             internet_quality          mental_health_rating 
                           ""                            "" 
extracurricular_participation                    exam_score 
                           ""                            "" 
imp3 <- mice(habits2, method = c("","", "logreg", "",  "", "", "", "", "", "", "", "", "", "", "", ""), 
                 maxit = 10,  
                 m = 5, 
                 seed=123,
                 print=F)     



complete_habits_data2 <- complete(imp3)

For parental_education_level, we convert both Bachelor and Master to College to reduce categories.

complete_habits_data2$parental_education_level[complete_habits_data2$parental_education_level == 'Bachelor'] <- 'College'
complete_habits_data2$parental_education_level[complete_habits_data2$parental_education_level == 'Master'] <- 'College'

complete_habits_data2$parental_education_level <- factor(complete_habits_data2$parental_education_level,levels = c("None", "High School", "College"))

str(complete_habits_data2)
'data.frame':   1000 obs. of  16 variables:
 $ student_id                   : chr  "S1000" "S1001" "S1002" "S1003" ...
 $ age                          : int  23 20 21 23 19 24 21 21 23 18 ...
 $ gender                       : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 1 1 1 ...
 $ study_hours_per_day          : num  0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
 $ social_media_hours           : num  1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
 $ netflix_hours                : num  1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
 $ part_time_job                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
 $ attendance_percentage        : num  85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
 $ sleep_hours                  : num  8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
 $ diet_quality                 : Factor w/ 3 levels "Poor","Fair",..: 2 3 1 1 2 2 3 2 3 3 ...
 $ exercise_frequency           : int  6 6 1 4 3 1 2 0 3 5 ...
 $ parental_education_level     : Factor w/ 3 levels "None","High School",..: 3 2 2 3 3 3 3 3 3 3 ...
 $ internet_quality             : Factor w/ 3 levels "Poor","Average",..: 2 2 1 3 3 2 1 2 3 3 ...
 $ mental_health_rating         : int  8 8 1 1 1 4 4 8 1 10 ...
 $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
 $ exam_score                   : num  56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...

Since neural networks can only work with numerical variables, we need to transform all the categorical variables into numerical format using a mix of label encoding (ordinal) and one hot encoding (nominal).

complete_habits_data2 is the complete dataset that will be used for the project.

complete_habits_data2$diet_quality <- as.numeric(factor(complete_habits_data2$diet_quality))
complete_habits_data2$parental_education_level <- as.numeric(factor(complete_habits_data2$parental_education_level))
complete_habits_data2$internet_quality <- as.numeric(factor(complete_habits_data2$internet_quality))

encoded_data <- model.matrix(~ gender - 1, data = complete_habits_data2)
encoded_data2 <- model.matrix(~ part_time_job - 1, data = complete_habits_data2)
encoded_data3 <- model.matrix(~ extracurricular_participation - 1, data = complete_habits_data2)

complete_habits_data2 <- cbind(complete_habits_data2, encoded_data, encoded_data2, encoded_data3)

5 Perceptron Regression

In this section, we perform a perceptron model for the continuous response variable exam_score.

The perceptron is the simplest form of a neural network. The perceptron model takes input features, computes their weighted sum, and passes the result through an activation function to generate the output. The linear activation function is used.

The below code chunk performs the data splitting and the normalization for all variables.

set.seed(123)

complete_habits_data2 <- complete_habits_data2[ -c(1,3,7,15) ]

# Normalize data (0-1 range)
normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
# normalize all features including the target variable
habits.scaled <- as.data.frame(lapply(complete_habits_data2, normalize))

# Train-test split (70-30)
#set.seed(123)
##
sample.size <- dim(habits.scaled)[1]
train.indices <- sample(1:sample.size, round(0.8*sample.size))
##

train.data.norm <- habits.scaled[train.indices, ]
test.data.norm <- habits.scaled[-train.indices, ]
##
train.orig <- complete_habits_data2[train.indices, ]
test.orig <- complete_habits_data2[-train.indices, ]

unscale <- function(x, original) {
  return(x * (max(original) - min(original)) + min(original))
}

The table below shows all possible combination of hyperparameter values of learningrate, threshold, stepmax, and RMSE.

# Define the tuning grid
tune.grid.nn <- expand.grid(
   learningrate = c(0.001, 0.01, 0.1, 0.5, 1),
   threshold = c(0.01, 0.05, 0.1, 0.5),
   stepmax = c(1e5, 1e6)  # Add stepmax to prevent infinite training
)

# Custom training function for neuralnet()
neuralnet.train <- function(learningrate, threshold, stepmax) {
  model <- neuralnet(
    exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
    data = train.data.norm,
    hidden = 0,  # Perceptron (no hidden layer)
    linear.output = TRUE,  # For regression
    learningrate = learningrate,
    threshold = threshold,
    stepmax = stepmax
  )
  # Calculate RMSE on training data
  pred <- predict(model, train.data.norm[, -ncol(train.data.norm)])
  

  rmse <- sqrt(mean((pred - train.data.norm$exam_score)^2))
  return(rmse)
}

# Perform grid search:
# using apply the () function to call neuralnet.train() using the components of the
# row vector in the tune.grid.nn (data frame of combinations of hyperparameters).
#
results <- apply(tune.grid.nn, 1, function(x) {
  neuralnet.train(x["learningrate"], x["threshold"], x["stepmax"])
})

##
# Combine results with parameter combinations
tune.results <- cbind(tune.grid.nn, RMSE = results)
##
pander(tune.results)
learningrate threshold stepmax RMSE
0.001 0.01 1e+05 0.5525
0.01 0.01 1e+05 0.4814
0.1 0.01 1e+05 0.7533
0.5 0.01 1e+05 0.7962
1 0.01 1e+05 1.134
0.001 0.05 1e+05 0.8093
0.01 0.05 1e+05 0.7343
0.1 0.05 1e+05 1.157
0.5 0.05 1e+05 0.2031
1 0.05 1e+05 0.6438
0.001 0.1 1e+05 0.4728
0.01 0.1 1e+05 0.907
0.1 0.1 1e+05 0.5015
0.5 0.1 1e+05 0.1872
1 0.1 1e+05 1.185
0.001 0.5 1e+05 0.968
0.01 0.5 1e+05 0.3787
0.1 0.5 1e+05 0.6809
0.5 0.5 1e+05 0.2296
1 0.5 1e+05 0.831
0.001 0.01 1e+06 1.137
0.01 0.01 1e+06 0.3102
0.1 0.01 1e+06 0.4608
0.5 0.01 1e+06 1.646
1 0.01 1e+06 0.6958
0.001 0.05 1e+06 0.3846
0.01 0.05 1e+06 1.085
0.1 0.05 1e+06 0.4987
0.5 0.05 1e+06 0.3708
1 0.05 1e+06 1.255
0.001 0.1 1e+06 0.7309
0.01 0.1 1e+06 0.6201
0.1 0.1 1e+06 0.4782
0.5 0.1 1e+06 0.9727
1 0.1 1e+06 0.5803
0.001 0.5 1e+06 0.9684
0.01 0.5 1e+06 0.1241
0.1 0.5 1e+06 0.7226
0.5 0.5 1e+06 0.9379
1 0.5 1e+06 0.3122

The table below shows the best combination of hyperparameter values.

# Find the best combination
best.params <- tune.results[which.min(tune.results$RMSE), ]
pander(best.params)
  learningrate threshold stepmax RMSE
37 0.01 0.5 1e+06 0.1241

The final.model.nn below trains the final neural network model using the best combination of hyperparameter values from above.

final.model.nn <- neuralnet(
  exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.norm,
  hidden = 0,
  linear.output = TRUE,
  learningrate = best.params$learningrate,
  threshold = best.params$threshold,
  stepmax = best.params$stepmax,
  rep = 1  # Multiple repetitions for stability
)

Below we use the training model to make predictions on the test data. We also show how the perceptron performs compared to standard linear regression and we see that they are similar.

full.predictions <- predict(final.model.nn, test.data.norm)
##
pred.unscale <- unscale(full.predictions , complete_habits_data2$exam_score)
###
MSE.neuralnet <- mean((pred.unscale  -test.orig$exam_score)^2)
###
r.sq.neuralnet <- (cor(pred.unscale, test.orig$exam_score))^2


###############

# Convert predictions back to original scale
#predictions.original <- predictions * (max(test.orig$exam_score) - min(test.orig$exam_score)) + min(test.orig$exam_score)

actual.original <- test.orig$exam_score

# Calculate R-squared
r.squared <- (cor(pred.unscale,actual.original))^2


#cat("R-squared:", r.squared, "\n")
Perceptron.MSE <- mean((actual.original - pred.unscale)^2)
#cat("Perceptron MSE:", Perceptron.MSE, "\n")

# Compare with linear regression for benchmarking
lm.model <- lm(exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.orig)
lm.pred <- predict(lm.model, test.orig)
lm.mse <- mean((test.orig$exam_score - lm.pred)^2)
#cat("Linear Regression MSE:", lm.mse, "\n")
lm.r.sq <- summary(lm.model)$r.squared

################
Perceptron <- c(Perceptron.MSE, r.squared)
LM <- c(lm.mse, lm.r.sq)
neuralnet <- c(MSE.neuralnet, r.sq.neuralnet)
Performace.metrics.all <- data.frame(Perceptron=Perceptron, 
                                 LM = LM, 
                                 neuralnet = neuralnet )
rownames(Performace.metrics.all) <- c("MSE", "r.sq")
pander(Performace.metrics.all)
  Perceptron LM neuralnet
MSE 30.66 29.89 30.66
r.sq 0.9108 0.8985 0.9108

6 Perceptron Classification

In this section, we do perceptron for the binary response variable ‘pass’. We create the pass variable below using the assumption that 60 or above is a pass (1, 0 otherwise).

The below code generates the optimal hyperparameters for learningrate, threshold, and accuracy.

complete_habits_data2 <- transform(complete_habits_data2, pass=ifelse(exam_score >= 60, 1, 0))



habits.scaled <- cbind(habits.scaled, complete_habits_data2$pass)

set.seed(123)



colnames(habits.scaled) <- c('age','study_hours_per_day', 'social_media_hours', 
                       'netflix_hours', 
                       'attendance_percentage', 'sleep_hours' , 'diet_quality',
                       'exercise_frequency', 'parental_education_level','internet_quality',
                       'mental_health_rating','exam_score','genderFemale','genderMale','part_time_jobNo', 'part_time_jobYes','extracurricular_participationNo','extracurricular_participationYes','pass')

sample.size2 <- dim(habits.scaled)[1]
train.indices2 <- sample(1:sample.size2, round(0.8*sample.size2))


train.data.cls <- habits.scaled[train.indices2, ]
test.data.cls <- habits.scaled[-train.indices2, ]

train.orig2 <- complete_habits_data2[train.indices2, ]
test.orig2 <- complete_habits_data2[-train.indices2, ]

#############

## Grid Search Setup
# Define the hyperparameter grid
hyper.grid.cls <- expand.grid(
  learningrate = c(0.001, 0.01, 0.05, 0.1, 0.2),
  threshold = c(0.01, 0.05)  # Stopping threshold for partial derivatives
 )


k <- 5
fold.size <- floor(dim(train.data.cls)[1]/k)
# Initialize results storage
results <- data.frame(
  learningrate = numeric(),
  threshold = numeric(),
  accuracy = numeric(),
  stringsAsFactors = FALSE
)


## Perform Grid Search with Cross-Validation
for(i in 1:nrow(hyper.grid.cls)) {
  lr <- hyper.grid.cls$learningrate[i]
  th <- hyper.grid.cls$threshold[i]
  
  #cat("\nTesting combination", i, "of", nrow(hyper.grid.cls), ": learningrate =", lr, ", threshold =", th, "\n")
  
  # 
  fold.accuracies <- numeric(k)
  
  # 
  for(fold in 1:k) {
    # Split into training and validation sets
    valid.indices <- (1 + (fold-1)*fold.size):(fold*fold.size)
    train.fold <- train.data.cls[-valid.indices, ]
    valid.fold <- train.data.cls[valid.indices, ]
    
    # Train the perceptron
    set.seed(123)
    model.sigmoid <- neuralnet(
      pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.fold,
      hidden = 0,  # Perceptron has no hidden layers
      linear.output = FALSE,
      learningrate = lr,
      act.fct = "logistic",
      algorithm = "rprop+", # The resilient backpropagation with weight backtracking
      threshold = th,
      stepmax = 1e5  # Increased to ensure convergence
    )
    
    # Make predictions
    preds <- predict(model.sigmoid, valid.fold)
    pred.classes <- ifelse(preds > 0.6, 1, 0)  # default threshold 0.5
    
    # Calculate accuracy
    fold.accuracies[fold] <- mean(pred.classes == valid.fold$pass)
  }
  
  # Store average accuracy for this hyperparameter combination
  results <- rbind(results, data.frame(
    learningrate = lr,
    threshold = th,
    accuracy = mean(fold.accuracies)
  ))
}

## Analyze Results
# Find the best combination
best.combination <- results[which.max(results$accuracy), ]
#cat("\nBest hyperparameter combination:\n")
pander(best.combination)
  learningrate threshold accuracy
6 0.001 0.05 0.9012

The final.sigmoid.model below is the final trained model using the optimal hyperparameters from above.

## Train Final Model with Best Hyperparameters
final.sigmoid.model <- neuralnet(
  pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.cls,
  hidden = 0,
  linear.output = FALSE,
  learningrate = best.combination$learningrate,
  threshold = best.combination$threshold,
  act.fct = "logistic",
  algorithm = "rprop+", # The resilient backpropagation with weight backtracking
  stepmax = 1e5
)

#plot(final.sigmoid.model)

Below we make predictions on the test set and compare the perceptron to standard logistic regression and find that they are similar in performance.

## Evaluate on Test Set
pred.sigmoid <- predict(final.sigmoid.model, test.data.cls)

###  logistic regression
logit.fit <- glm(pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.data.cls, family = binomial)
AIC.logit <- step(logit.fit, direction = "both", trace = 0)
pred.logit <- predict(AIC.logit, test.data.cls, type = "response")
pred.full <- predict(logit.fit, test.data.cls, type = "response")


## roc
roc.full.logit <- roc(test.data.cls[, ncol(test.data.cls)], pred.full)
roc.AIC.logit <- roc(test.data.cls[, ncol(test.data.cls)], pred.logit)
roc.sigmoid <- roc(test.data.cls[, ncol(test.data.cls)], pred.sigmoid )
## AUC
auc.sigmoid <- roc.sigmoid$auc
auc.full.logit <- roc.full.logit$auc
auc.AIC.logit <- roc.AIC.logit$auc

## spe-sen
sigmoid.spe <- roc.sigmoid$specificities
sigmoid.sen <- roc.sigmoid$sensitivities

full.logit.spe <- roc.full.logit$specificities
full.logit.sen <- roc.full.logit$sensitivities

AIC.logit.spe <- roc.AIC.logit$specificities
AIC.logit.sen <- roc.AIC.logit$sensitivities


# ROC curve
plot(1-sigmoid.spe, sigmoid.sen, col = "blue", type = "l", lty = 1,
     xlab = "1 - specificity",
     ylab = "sensitivity",
     main = "ROC Curves of Perceptron and Logistic Models")
lines(1-full.logit.spe, full.logit.sen, lty = 1, col = "brown")
lines(1-AIC.logit.spe, AIC.logit.sen, lty = 1, col = "steelblue")
abline(0,1, lty =2, col = "red")
text(0.98, 0.3, paste("Perceptron AUC = ", round(auc.sigmoid,4)), col = "blue", cex = 0.8, pos = 2)
text(0.98, 0.25, paste("Full Logit AUC = ", round(auc.full.logit,4)), col = "brown", cex = 0.8, pos = 2)
text(0.98, 0.2, paste("AIC AUC = ", round(auc.AIC.logit,4)), col = "steelblue", cex = 0.8, pos = 2)

7 MLP Regression

In this section, we do multilayer perceptron. MLP uses hidden layers, which is helpful for nonlinear data.

7.1 One-hidden-layer Perceptron

In this section, we use one hidden layer. Below we split the data into train and test.

habits.scaled2 <- as.data.frame(lapply(complete_habits_data2, normalize))

# Set seed for reproducibility
set.seed(123)
N <- length(habits.scaled2$exam_score)
# Create train-test split (70-30)
train.reg.index <- sample(1:N,  floor(0.8*N), replace = FALSE)
train.reg.data <- habits.scaled2[train.reg.index, ]
test.reg.data <- habits.scaled2[-train.reg.index, ]

Below we get the optimal hyperparameters for layer1, learning.rate, activation, and rmse.

# Define grid of hyperparameters
hyper.grid.reg <- expand.grid(
  layer1 = c(5, 10, 15),
  learning.rate = c(0.01, 0.1),
  activation = c("logistic", "tanh")
)

# Initialize results storage
rmse = NULL
#layer1 = NULL
#learningrate = NULL
#activation = NULL

best.reg.rmse <- Inf
best.reg.model <- NULL

# Perform grid search
for(i in 1:nrow(hyper.grid.reg)) {
  # Get current configuration
  layer <- hyper.grid.reg$layer1[i]
  lr <- hyper.grid.reg$learning.rate[i]
  act <- hyper.grid.reg$activation[i]
  
  # Train model
  set.seed(123)
  model.reg <- neuralnet(
      exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.reg.data,
      hidden = layer,
      act.fct = act,
      linear.output = TRUE,  # For regression
      learningrate = lr,
      algorithm = "rprop+",
      stepmax = 1e5 )
  

    # Make predictions
    preds.reg <- predict(model.reg, test.reg.data[, -ncol(test.reg.data)])
    
    # Calculate RMSE
    rmse.reg <- sqrt(mean((preds.reg - test.reg.data$exam_score)^2))
    
    # Store results
    rmse[i] = rmse.reg
    
    # Update best model
    if(rmse.reg < best.reg.rmse) {
      best.reg.rmse <- rmse.reg
      best.reg.model <- model.reg 
      best.reg.params <- hyper.grid.reg[i, ]
    }
}

results.regNN <- hyper.grid.reg
results.regNN$rmse <- rmse


# View results sorted by RMSE
pander(results.regNN[order(results.regNN$rmse), ][1,])
layer1 learning.rate activation rmse
5 0.01 logistic 0.07245

Using the optimal hyperparameters, we train the final model below.

final.reg.model <- neuralnet(
  exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.reg.data,
  hidden = best.reg.params$layer1,
  act.fct = best.reg.params$activation,
  linear.output = TRUE,
  learningrate = best.reg.params$learning_rate,
  algorithm = "rprop+",
  stepmax = 1e5
)

plot(final.reg.model, rep="best")

Below we show a scatterplot of the actual vs predicted values. It does not perform very well due to some outliers.

# Make predictions on test set
pred.NN1 <- predict(final.reg.model, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
rmse.NN1 <- sqrt(mean((pred.NN1  - test.reg.data$exam_score)^2))
mae.NN1 <- mean(abs(pred.NN1  - test.reg.data$exam_score))
r.squared.NN1 <- cor(pred.NN1 , test.reg.data$exam_score)^2

# cat("Performance Metrics:\n")
# cat("RMSE:", rmse, "\n")
# cat("MAE:", mae, "\n")
# cat("R-squared:", r_squared, "\n")

# Plot predictions vs actual
plot.NN1.data <- data.frame(
  Actual = test.reg.data$exam_score,
  Predicted = pred.NN1 
)

ggplot(plot.NN1.data, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "darkred") +
  annotate("text", x=0.85, y=-2, 
           label=paste("R.sq =", round(r.squared.NN1,4)), color="blue") +
  annotate("text", x=0.85, y=-1, 
           label=paste("RMSE =", round(rmse.NN1,4)), color="blue") +
  annotate("text", x=0.85, y=.06, 
           label=paste("  MAE =", round(mae.NN1,4)), color="blue") +
  ggtitle("Actual vs Predicted Values") +
  theme_minimal()

Below shows that the neural network does not perform as well as standard linear regression so 2 layers may help.

# Train linear regression model
lm.model2 <- lm(exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = test.reg.data)

# Make predictions
lm.predictions <- predict(lm.model2, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
lm.rmse <- sqrt(mean((lm.predictions - test.reg.data$exam_score)^2))
lm.mae <- mean(abs(lm.predictions - test.reg.data$exam_score))
lm.r.squared <- cor(lm.predictions, test.reg.data$exam_score)^2


## improvements
RMSE.imp <- round((lm.rmse - rmse.NN1)/lm.rmse * 100,2)
MAE.imp <- round((lm.mae - mae.NN1)/lm.mae * 100, 2)
Rsq.imp <- round((r.squared.NN1 - lm.r.squared)/lm.r.squared * 100,2)

##
Performance.table <- data.frame(
  LM = c(lm.rmse, lm.mae, lm.r.squared),
  NN.1 = c(rmse.NN1, mae.NN1, r.squared.NN1),
  Improvement.percentage = c(RMSE.imp, MAE.imp, Rsq.imp)
)
rownames(Performance.table) <- c("RMSE", "MAE", "R.square")
pander(Performance.table)
  LM NN.1 Improvement.percentage
RMSE 0.06178 0.3757 -508.1
MAE 0.04935 0.09442 -91.32
R.square 0.9216 0.1956 -78.78

Below shows the performance of both standard linear regression and NN with one layer.

# Plot both predictions
comparison.data <- data.frame(
  Actual = test.reg.data$exam_score,
  MLP = pred.NN1,
  Linear = lm.predictions
)

ggplot(comparison.data) +
  geom_point(aes(x = Actual, y = MLP, color = "MLP")) +
  geom_point(aes(x = Actual, y = Linear, color = "Linear Regression")) +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  scale_color_manual(values = c("MLP" = "blue", "Linear Regression" = "red")) +
  labs(title = "Model Comparison: Actual vs Predicted",
       x = "Actual Values",
       y = "Predicted Values",
       color = "Model Type") +
  theme(
    plot.margin = ggplot2::margin(40, 20, 20, 20, unit = "pt"),
    plot.title = element_text(hjust = 0.5, 
                              lineheight = 1.1,
                              vjust = 10)
    )

7.2 Two-hidden-layer Perceptron

Below we see possible hyperparameters.

# Define grid of hyperparameters
hyper.grid.NN2 <- expand.grid(
  layer1 = c(5, 10, 15),
  layer2 = c(0, 3, 5),  # 0 means no second layer
  learning.rate = c(0.01, 0.1),
  activation = c("logistic", "tanh")
)

# Initialize results in storage
results <- data.frame()
best.rmse <- Inf
best.model <- NULL

# Perform grid search
for(i in 1:nrow(hyper.grid.NN2)) {
  # Get current configuration
  layer1 <- hyper.grid.NN2$layer1[i]
  layer2 <- hyper.grid.NN2$layer2[i]
  lr <- hyper.grid.NN2$learning.rate[i]
  act <- hyper.grid.NN2$activation[i]
  
  # Create hidden layers vector
  if(layer2 == 0) {
    hidden.layers <- c(layer1)
  } else {
    hidden.layers <- c(layer1, layer2)
  }
  
  # Train model
  set.seed(123)
  model.NN2 <- tryCatch({
    neuralnet(
      exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.reg.data,
      hidden = hidden.layers,
      act.fct = act,
      linear.output = TRUE,  # For regression
      learningrate = lr,
      algorithm = "rprop+",
      stepmax = 1e5
    )
  }, error = function(e) NULL)
  
  if(!is.null(model.NN2)) {
    # Make predictions
    preds <- predict(model.NN2, test.reg.data[, -ncol(test.reg.data)])
    
    # Calculate RMSE
    rmse <- sqrt(mean((preds - test.reg.data$exam_score)^2))
    
    # Store results
    results <- rbind(results, data.frame(
      layer1 = layer1,
      layer2 = layer2,
      learning_rate = lr,
      activation = act,
      rmse = rmse
    ))
    
    # Update best model
    if(rmse < best.rmse) {
      best.rmse <- rmse
      best.model <- model.NN2
      best.params <- hyper.grid.NN2[i, ]
    }
  }
}

# View results sorted by RMSE
results[order(results$rmse), ]
   layer1 layer2 learning_rate activation       rmse
7       5      5          0.01   logistic 0.06945465
16      5      5          0.10   logistic 0.06945465
22      5      3          0.01       tanh 0.07200686
31      5      3          0.10       tanh 0.07200686
4       5      3          0.01   logistic 0.07225654
13      5      3          0.10   logistic 0.07225654
1       5      0          0.01   logistic 0.07245266
10      5      0          0.10   logistic 0.07245266
2      10      0          0.01   logistic 0.07481921
11     10      0          0.10   logistic 0.07481921
25      5      5          0.01       tanh 0.07675567
34      5      5          0.10       tanh 0.07675567
20     10      0          0.01       tanh 0.07835699
29     10      0          0.10       tanh 0.07835699
23     10      3          0.01       tanh 0.07888405
32     10      3          0.10       tanh 0.07888405
3      15      0          0.01   logistic 0.07943591
12     15      0          0.10   logistic 0.07943591
8      10      5          0.01   logistic 0.08117725
17     10      5          0.10   logistic 0.08117725
6      15      3          0.01   logistic 0.08352691
15     15      3          0.10   logistic 0.08352691
5      10      3          0.01   logistic 0.08462088
14     10      3          0.10   logistic 0.08462088
9      15      5          0.01   logistic 0.09212715
18     15      5          0.10   logistic 0.09212715
21     15      0          0.01       tanh 0.09573949
30     15      0          0.10       tanh 0.09573949
19      5      0          0.01       tanh 0.10622119
28      5      0          0.10       tanh 0.10622119
27     15      5          0.01       tanh 0.10843835
36     15      5          0.10       tanh 0.10843835
24     15      3          0.01       tanh 0.13060978
33     15      3          0.10       tanh 0.13060978
26     10      5          0.01       tanh 0.16913666
35     10      5          0.10       tanh 0.16913666

Below we train the final model.

final.model.NN2 <- neuralnet(
  exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.reg.data,
  hidden = if(best.params$layer2 == 0) {
        c(best.params$layer1)} else {
        c(best.params$layer1, best.params$layer2)},
  act.fct = best.params$activation,
  linear.output = TRUE,
  learningrate = best.params$learning.rate,
  algorithm = "rprop+",
  stepmax = 1e5
)

plot(final.model.NN2, rep="best")

Below is the actual vs predicted scatterplot and it seems to perform better.

# Make predictions on test set
pred.NN2 <- predict(final.model.NN2, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
rmse.NN2 <- sqrt(mean((pred.NN2  - test.reg.data$exam_score)^2))
mae.NN2 <- mean(abs(pred.NN2  - test.reg.data$exam_score))
r.squared.NN2 <- cor(pred.NN2 , test.reg.data$exam_score)^2

## vector of error metric
NN2 <- c(rmse.NN2, mae.NN2, r.squared.NN2)

# Plot predictions vs actual
plot.data.NN2 <- data.frame(
  Actual = test.reg.data$exam_score,
  Predicted = pred.NN2 
)

ggplot(plot.data.NN2, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  ggtitle("Actual vs Predicted Values") +
  theme_minimal()

Below shows that the NN with two layers does significantly better than the NN with one.

# Train linear regression model
lm.model3 <- lm(exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.reg.data)

# Make predictions
lm.predictions2 <- predict(lm.model3, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
lm.rmse2 <- sqrt(mean((lm.predictions2 - test.reg.data$exam_score)^2))
lm.mae2 <- mean(abs(lm.predictions2 - test.reg.data$exam_score))
lm.r.squared2 <- cor(lm.predictions2, test.reg.data$exam_score)^2

###
rmse.NN2.imp <- (lm.rmse2 - rmse.NN2)/lm.rmse2*100
mae.NN2.imp <- (lm.mae2 - mae.NN2)/lm.mae2 * 100
Rsq.NN2.imp <- (r.squared.NN2 - lm.r.squared2)/lm.r.squared2 * 100

NN2.improve <-c(rmse.NN2.imp, mae.NN2.imp, Rsq.NN2.imp)

perf.matrix <-data.frame(
              LM = c(lm.rmse2, lm.mae2, lm.r.squared2),
              NN.1 = c(rmse.NN1, mae.NN1, r.squared.NN1),
              NN.2 = c(rmse.NN2, mae.NN2, r.squared.NN2)
         )

perf.matrix$NN1.Improve <- round(100*(perf.matrix$LM-perf.matrix$NN.1)/perf.matrix$LM,2)
perf.matrix$NN2.Improve <- round(100*(perf.matrix$LM-perf.matrix$NN.2)/perf.matrix$LM,2)

rownames(perf.matrix) <- c("RMSE", "MAE", "R.sq")
pander(perf.matrix)
  LM NN.1 NN.2 NN1.Improve NN2.Improve
RMSE 0.067 0.3757 0.0711 -460.7 -6.11
MAE 0.0537 0.09442 0.05661 -75.83 -5.42
R.sq 0.9117 0.1956 0.8977 78.55 1.54

Below shows the performance of both standard and NN with 2 layers.

# Plot both predictions
comparison.data.NN2 <- data.frame(
  Actual = test.reg.data$exam_score,
  MLP = pred.NN2 ,
  Linear = lm.predictions2
)

ggplot(comparison.data.NN2) +
  geom_point(aes(x = Actual, y = MLP, color = "MLP")) +
  geom_point(aes(x = Actual, y = Linear, color = "Linear Regression")) +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  scale_color_manual(values = c("MLP" = "blue", "Linear Regression" = "red")) +
  labs(title = "Model Comparison: Actual vs Predicted",
       x = "Actual Values",
       y = "Predicted Values",
       color = "Model Type") +
  theme_minimal()

8 MLP Classification

Below we perform the grid search for the single hidden layer.

#habits.scaled <- cbind(habits.scaled, complete_habits_data2$pass)

set.seed(123)  # For reproducibility
sample.size.cls <- floor(0.80 * nrow(habits.scaled2))
train.indices.cls <- sample(1:sample.size.cls, size = sample.size.cls, replace = FALSE)

train.data.cls2 <- habits.scaled2[train.indices.cls, ]
test.data.cls2 <- habits.scaled2[-train.indices.cls, ]
# Function to perform grid search for neuralnet
neuralnet.grid.search <- function(train.data, test.data, hidden.layers = 1) {
  # Define the grid of hyperparameters
  if (hidden.layers == 1) {
    hidden.nodes <- c(2, 4, 6, 8, 10)
    grid <- expand.grid(hidden = hidden.nodes)
  } else {
    hidden.nodes <- c(2, 4, 6)
    grid <- expand.grid(hidden1 = hidden.nodes, hidden2 = hidden.nodes)
  }
  
  # Add columns to store results
  grid$accuracy <- NA
  grid$auc <- NA
  

  
  # Perform grid search
  for (i in 1:nrow(grid)) {
    if (hidden.layers == 1) {
      hidden <- c(grid$hidden[i])
    } else {
      hidden <- c(grid$hidden1[i], grid$hidden2[i])
    }
    
    # Train the model
    nn.model <- neuralnet(
      pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.data,
      hidden = hidden,
      linear.output = FALSE,  # For classification
      act.fct = "logistic",   # Sigmoid activation
      stepmax = 1e6           # Increase max steps for convergence
    )
    
    # Make predictions
    predictions <- predict(nn.model, test.data)
    predicted.classes <- ifelse(predictions > 0.6, 1, 0)
    
    # Calculate accuracy
    accuracy <- mean(predicted.classes == test.data$pass)
    
    # Calculate AUC
    roc.obj <- roc(test.data$pass, predictions)
    auc.value <- auc(roc.obj)
    
    # Store results
    grid$accuracy[i] <- accuracy
    grid$auc[i] <- auc.value
  }
  return(grid)
}
# Perform grid search for single hidden layer
grid.results.1layer <- neuralnet.grid.search(train.data=train.data.cls2,
                                             test.data=test.data.cls2, 
                                             hidden.layers = 1)
pander(grid.results.1layer)
hidden accuracy auc
2 0.895 0.9034
4 0.88 0.9354
6 0.875 0.9437
8 0.88 0.9403
10 0.885 0.9421

Above is the results from grid search for single hidden layer. Below we do the grid search for the 2 hidden layers.

grid.results.2layer <- neuralnet.grid.search(train.data=train.data.cls2, 
                                             test.data=test.data.cls2, 
                                             hidden.layers = 2)
pander(grid.results.2layer)
hidden1 hidden2 accuracy auc
2 2 0.87 0.8445
4 2 0.885 0.9315
6 2 0.89 0.9413
2 4 0.9 0.9148
4 4 0.89 0.922
6 4 0.895 0.9488
2 6 0.905 0.9002
4 6 0.875 0.9319
6 6 0.885 0.9347

Below we train the 1-layer model.

# Train single hidden layer model (using best configuration from grid search)
best.1layer <- grid.results.1layer[which.max(grid.results.1layer$auc), ]

nn.1layer <- neuralnet(
  pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours
  + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.cls2,
  hidden = best.1layer$hidden,
  linear.output = FALSE,
  act.fct = "logistic",
  stepmax = 1e6
)
##
plot(nn.1layer, rep="best")

Below we train the 2-layer model.

# Train two hidden layers model (using best configuration from grid search)
best.2layer <- grid.results.2layer[which.max(grid.results.2layer$auc), ]

nn.2layer <- neuralnet(
  pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours
   + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.cls2,
  hidden = c(best.2layer$hidden1, best.2layer$hidden2),
  linear.output = FALSE,
  act.fct = "logistic",
  stepmax = 1e6
)
##
plot(nn.2layer, rep="best")

Below we make predictions on the test data.

# Function to evaluate model performance
evaluate.model <- function(model, test.data) {
  # Make predictions
  predictions <- predict(model, test.data)
  predicted.classes <- ifelse(predictions > 0.6, 1, 0)
  
  # Calculate metrics
  accuracy <- mean(predicted.classes == test.data$pass)
  confusion.matrix <- table(Predicted = predicted.classes, Actual = test.data$pass)
  roc.obj <- roc(test.data$pass, predictions)
  auc.value <- auc(roc.obj)
  
  return(list(
    accuracy = accuracy,
    confusion.matrix = confusion.matrix,
    roc.obj = roc.obj,
    auc = auc.value
  ))
}

# Evaluate single hidden layer model
perf.1layer <- evaluate.model(nn.1layer, test.data.cls2)
#print(perf.1layer[c("accuracy", "confusion_matrix", "auc")])

# Evaluate two hidden layers model
perf.2layer <- evaluate.model(nn.2layer, test.data.cls2)
#print(perf.2layer[c("accuracy", "confusion_matrix", "auc")])

Below we show the performance of MLP 1 layer, MLP 2 layer, and standard logistic. We see that they all perform similarly.

# Train logistic regression model (base model)
logit.model <- glm(pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.data.cls2, family = binomial)

# Evaluate logistic regression model
logit.pred <- predict(logit.model, test.data.cls2, type = "response")
logit.classes <- ifelse(logit.pred > 0.6, 1, 0)
logit.accuracy <- mean(logit.classes == test.data.cls2$pass)
logit.roc <- roc(test.data.cls2$pass, logit.pred)
logit.auc <- auc(logit.roc)

##
roc.1layer <- perf.1layer$roc.obj
roc.2layer <- perf.2layer$roc.obj
roc.logit <- logit.roc

## specificity and sensitivity
sen.1layer <- roc.1layer$sensitivities
spe.1layer <- roc.1layer$specificities
sen.2layer <- roc.2layer$sensitivities
spe.2layer <- roc.2layer$specificities
sen.logit <- roc.logit$sensitivities
spe.logit <- roc.logit$specificities

## AUC
auc.1layer <- roc.1layer$auc
auc.2layer <- roc.2layer$auc
auc.logit <- roc.logit$auc

## Plot ROC curves for comparison
par(pty = "s")   # make a square plot to avaoid distortion
plot(1-spe.1layer, sen.1layer, type = "l", lty = 1,
     col = "blue", 
     xlab = "1 - specificity",
     ylab = "sensitvity",
     main = "ROC Curve Comparison")

lines(1-spe.2layer, sen.2layer, lty = 1, col = "darkred")
lines(1-spe.logit, sen.logit, lty = 1, col = "darkgreen")
legend("bottomright", 
       legend = c(paste("1-layer MLP (AUC =", round(perf.1layer$auc, 3), ")"),
                  paste("2-layer MLP (AUC =", round(perf.2layer$auc, 3), ")"),
                  paste("Logistic Reg (AUC =", round(logit.auc, 3), ")")),
                col = c("blue", "darkred", "darkgreen"), 
                lty = 1, cex = 0.7, bty = "n")

9 Results/Conclusion

Perceptrons can help give you a general understanding of how neural networks can predict a response variable but may be too simple.

MLP can be better since they work well with complex/nonlinear data. However, 2 layer MLP may not necessarily be better than 1 layer. Two layers may be too complex, so 1 layer MLP is better since it is simpler and can potentially generate similar accuracies.

For classification, MLP may not necessarily be better than standard logistic. Even though MLP can work well with nonlinear data, it may be more prone to overfitting. Standard logistic has high interpretability, computational efficiency, and is robust to overfitting.

For regression, MLP can also help with nonlinear data but we can take simpler models and computational resources into account.

---
title: "Student Habits vs Academic Performance Neural Networks"
author: "Eric Zhu"
date: "2025-05-05"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: navy;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-family: system-ui;
  font-weight: bold;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 14px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>

```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("readxl")) { # SVM methodology
   install.packages("readxl")
library(readxl)
}
if (!require("ggplot2")) { # SVM methodology
   install.packages("ggplot2")
library(ggplot2)
}
if (!require("ISLR")) { # contains example data set "Khan"
   install.packages("ISLR")
library(ISLR)
}
if (!require("MASSExtra")) { # contains example data set "Khan"
   install.packages("MASSExtra")
library(MASSExtra)
}
if (!require("missMethods")) { # customized coloring of plots
   install.packages("missMethods")
library(missMethods)
}
if (!require("Hmisc")) { # customized coloring of plots
   install.packages("Hmisc")
library(Hmisc)
}
if (!require("glmnet")) { # customized coloring of plots
   install.packages("glmnet")
library(glmnet)
}
if (!require("GGally")) { # customized coloring of plots
   install.packages("GGally")
library(GGally)
}
if (!require("mice")) { # customized coloring of plots
   install.packages("mice")
library(mice)
}
if (!require("plotly")) { # customized coloring of plots
   install.packages("plotly")
library(plotly)
}
if (!require("caret")) { # customized coloring of plots
   install.packages("caret")
library(caret)
}
if (!require("pROC")) { # customized coloring of plots
   install.packages("pROC")
library(pROC)
}
if (!require("pander")) { # customized coloring of plots
   install.packages("pander")
library(pander)
}
if (!require("e1071")) { # customized coloring of plots
   install.packages("e1071")
library(e1071)
}
if (!require("rpart")) { # customized coloring of plots
   install.packages("rpart")
library(rpart)
}
if (!require("rpart.plot")) { # customized coloring of plots
   install.packages("rpart.plot")
library(rpart.plot)
}
if (!require("randomForest")) { # customized coloring of plots
   install.packages("randomForest")
library(randomForest)
}
if (!require("forcats")) { # customized coloring of plots
   install.packages("forcats")
library(forcats)
}
if (!require("ipred")) { # customized coloring of plots
   install.packages("ipred")
library(ipred)
}
if (!require("neuralnet")) { # customized coloring of plots
   install.packages("neuralnet")
library(neuralnet)
}
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )
```

# Introduction

This project uses the same dataset as project 3.

This dataset is titled “Student Habits vs Academic Performance” and shows various explanatory variables and one target/response variable (exam_score). The purpose of collecting this dataset is to see what factors may affect exam scores. This dataset was collected from Kaggle. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable).

The student_id serves as the identifier. The age (numerical) shows each persons age. The gender (categorical) shows each persons gender. The study_hours_per_day shows hours studied per day. The social_media_hours shows hours used daily on social media. The netflix_hours shows hours used daily on netflix. The part_time_job (no/yes) shows if the person has a part time job. The attendance percentage shows the percentage of classes attended. The sleep_hours shows amount of sleep per night. The diet_quality (poor/fair/good) shows quality of ones diet. The exercise_frequency shows how many times one exercises per week. The parental_education_level (none/High School/bachelor/master) shows the highest level of education of ones parents. The internet_quality(poor/avg/good) shows how good ones wi-fi is. The mental_health_rating (1-10) shows how one rates their mental health. The extracurricular_participation (n/y) shows if one participates in extracurricular activities. The target/response variable is exam_score (continuous).

Based on this dataset, we can formulate a research question. The question we can form is what variables are the most important in predicting exam score. For regression, we can use exam_score (continuous) as our response variable. For classification, we can create a binary response variable with 1 (pass) and 0 (fail). We can use the condition if exam score is greater or less than 0.60.


# Read in Dataset

Read in dataset. See dataset structure.

```{r}
habits2 <- read.csv("student_habits_performance.csv")
str(habits2)
```


# EDA

The below figure shows the distribution of the gender variable.

```{r}
ggplot(habits2, aes(x = gender)) + 
  
  geom_bar() +
  labs(title = "Gender")
```

The below figure shows the distribution of the part_time_job variable.

```{r}
ggplot(habits2, aes(x = part_time_job)) + 
  
  geom_bar() +
  labs(title = "part_time_job")
```

The below figure shows the distribution of the diet_quality variable.

```{r}
ggplot(habits2, aes(x = diet_quality)) + 
  
  geom_bar() +
  labs(title = "diet_quality")
```

The below figure shows the distribution of the parental_education_level variable.

```{r}
ggplot(habits2, aes(x = parental_education_level)) + 
  
  geom_bar() +
  labs(title = "parental_education_level")
```

The below figure shows the distribution of the internet_quality variable.

```{r}
ggplot(habits2, aes(x = internet_quality)) + 
  
  geom_bar() +
  labs(title = "internet_quality")
```

The below figure shows the distribution of the extracurricular_participation variable.

```{r}
ggplot(habits2, aes(x = extracurricular_participation)) + 
  
  geom_bar() +
  labs(title = "extracurricular_participation")
```

The below figure shows the distribution of the age variable.

```{r}
ggplot(data = habits2, aes(x = age)) + 
  geom_boxplot() + 
  
  labs(title = "age")
```

The below figure shows the distribution of the study_hours_per_day variable.

```{r}
ggplot(data = habits2, aes(x = study_hours_per_day)) + 
  geom_boxplot() + 
  
  labs(title = "study_hours_per_day")
```


The below figure shows the distribution of the social_media_hours variable.

```{r}
ggplot(data = habits2, aes(x = social_media_hours)) + 
  geom_boxplot() + 
  
  labs(title = "social_media_hours")
```


The below figure shows the distribution of the netflix_hours variable.

```{r}
ggplot(data = habits2, aes(x = netflix_hours)) + 
  geom_boxplot() + 
  
  labs(title = "netflix_hours")
```

The below figure shows the distribution of the attendance_percentage variable.

```{r}
ggplot(data = habits2, aes(x = attendance_percentage)) + 
  geom_boxplot() + 
  
  labs(title = "attendance_percentage")
```

The below figure shows the distribution of the sleep_hours variable.

```{r}
ggplot(data = habits2, aes(x = sleep_hours)) + 
  geom_boxplot() + 
  
  labs(title = "sleep_hours")
```

The below figure shows the distribution of the exercise_frequency variable.

```{r}
ggplot(data = habits2, aes(x = exercise_frequency)) + 
  geom_boxplot() + 
  
  labs(title = "exercise_frequency")
```

The below figure shows the distribution of the mental_health_rating variable.

```{r}
ggplot(data = habits2, aes(x = mental_health_rating)) + 
  geom_boxplot() + 
  
  labs(title = "mental_health_rating")
```


# Imputation/Feature Engineering.

The variables part_time_job, diet_quality, internet_quality, and extracurricular_participation are converted to categorical/factor.

```{r}

habits2$part_time_job <- as.factor(habits2$part_time_job)
habits2$diet_quality <- factor(habits2$diet_quality,levels = c("Poor", "Fair", "Good"))

habits2$internet_quality <- factor(habits2$internet_quality, levels = c("Poor", "Average", "Good"))
habits2$extracurricular_participation <- as.factor(habits2$extracurricular_participation)
```

The variable gender contains the value ‘Other’. Since we only want to work with Male/female, we set ‘Other’ to missing and use MICE imputation.

```{r}
habits2$gender[habits2$gender== "Other"] = NA
habits2$gender <- as.factor(habits2$gender)

init3 <- mice(habits2, maxit = 0)
init3$method
```


```{r}
imp3 <- mice(habits2, method = c("","", "logreg", "",  "", "", "", "", "", "", "", "", "", "", "", ""), 
                 maxit = 10,  
                 m = 5, 
                 seed=123,
                 print=F)     



complete_habits_data2 <- complete(imp3)
```

For parental_education_level, we convert both Bachelor and Master to College to reduce categories.

```{r}
complete_habits_data2$parental_education_level[complete_habits_data2$parental_education_level == 'Bachelor'] <- 'College'
complete_habits_data2$parental_education_level[complete_habits_data2$parental_education_level == 'Master'] <- 'College'

complete_habits_data2$parental_education_level <- factor(complete_habits_data2$parental_education_level,levels = c("None", "High School", "College"))

str(complete_habits_data2)
```

Since neural networks can only work with numerical variables, we need to transform all the categorical variables into numerical format using a mix of label encoding (ordinal) and one hot encoding (nominal).

complete_habits_data2 is the complete dataset that will be used for the project.

```{r}
complete_habits_data2$diet_quality <- as.numeric(factor(complete_habits_data2$diet_quality))
complete_habits_data2$parental_education_level <- as.numeric(factor(complete_habits_data2$parental_education_level))
complete_habits_data2$internet_quality <- as.numeric(factor(complete_habits_data2$internet_quality))

encoded_data <- model.matrix(~ gender - 1, data = complete_habits_data2)
encoded_data2 <- model.matrix(~ part_time_job - 1, data = complete_habits_data2)
encoded_data3 <- model.matrix(~ extracurricular_participation - 1, data = complete_habits_data2)

complete_habits_data2 <- cbind(complete_habits_data2, encoded_data, encoded_data2, encoded_data3)
```


# Perceptron Regression

In this section, we perform a perceptron model for the continuous response variable exam_score.

The perceptron is the simplest form of a neural network. The perceptron model takes input features, computes their weighted sum, and passes the result through an activation function to generate the output. The linear activation function is used.

The below code chunk performs the data splitting and the normalization for all variables.

```{r}
set.seed(123)

complete_habits_data2 <- complete_habits_data2[ -c(1,3,7,15) ]

# Normalize data (0-1 range)
normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
# normalize all features including the target variable
habits.scaled <- as.data.frame(lapply(complete_habits_data2, normalize))

# Train-test split (70-30)
#set.seed(123)
##
sample.size <- dim(habits.scaled)[1]
train.indices <- sample(1:sample.size, round(0.8*sample.size))
##

train.data.norm <- habits.scaled[train.indices, ]
test.data.norm <- habits.scaled[-train.indices, ]
##
train.orig <- complete_habits_data2[train.indices, ]
test.orig <- complete_habits_data2[-train.indices, ]

unscale <- function(x, original) {
  return(x * (max(original) - min(original)) + min(original))
}
```

The table below shows all possible combination of hyperparameter values of learningrate, threshold, stepmax, and RMSE.

```{r}
# Define the tuning grid
tune.grid.nn <- expand.grid(
   learningrate = c(0.001, 0.01, 0.1, 0.5, 1),
   threshold = c(0.01, 0.05, 0.1, 0.5),
   stepmax = c(1e5, 1e6)  # Add stepmax to prevent infinite training
)

# Custom training function for neuralnet()
neuralnet.train <- function(learningrate, threshold, stepmax) {
  model <- neuralnet(
    exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
    data = train.data.norm,
    hidden = 0,  # Perceptron (no hidden layer)
    linear.output = TRUE,  # For regression
    learningrate = learningrate,
    threshold = threshold,
    stepmax = stepmax
  )
  # Calculate RMSE on training data
  pred <- predict(model, train.data.norm[, -ncol(train.data.norm)])
  

  rmse <- sqrt(mean((pred - train.data.norm$exam_score)^2))
  return(rmse)
}

# Perform grid search:
# using apply the () function to call neuralnet.train() using the components of the
# row vector in the tune.grid.nn (data frame of combinations of hyperparameters).
#
results <- apply(tune.grid.nn, 1, function(x) {
  neuralnet.train(x["learningrate"], x["threshold"], x["stepmax"])
})

##
# Combine results with parameter combinations
tune.results <- cbind(tune.grid.nn, RMSE = results)
##
pander(tune.results)
```

The table below shows the best combination of hyperparameter values.

```{r}
# Find the best combination
best.params <- tune.results[which.min(tune.results$RMSE), ]
pander(best.params)
```

The final.model.nn below trains the final neural network model using the best combination of hyperparameter values from above.

```{r}
final.model.nn <- neuralnet(
  exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.norm,
  hidden = 0,
  linear.output = TRUE,
  learningrate = best.params$learningrate,
  threshold = best.params$threshold,
  stepmax = best.params$stepmax,
  rep = 1  # Multiple repetitions for stability
)
```


Below we use the training model to make predictions on the test data. We also show how the perceptron performs compared to standard linear regression and we see that they are similar.


```{r}
full.predictions <- predict(final.model.nn, test.data.norm)
##
pred.unscale <- unscale(full.predictions , complete_habits_data2$exam_score)
###
MSE.neuralnet <- mean((pred.unscale  -test.orig$exam_score)^2)
###
r.sq.neuralnet <- (cor(pred.unscale, test.orig$exam_score))^2


###############

# Convert predictions back to original scale
#predictions.original <- predictions * (max(test.orig$exam_score) - min(test.orig$exam_score)) + min(test.orig$exam_score)

actual.original <- test.orig$exam_score

# Calculate R-squared
r.squared <- (cor(pred.unscale,actual.original))^2


#cat("R-squared:", r.squared, "\n")
Perceptron.MSE <- mean((actual.original - pred.unscale)^2)
#cat("Perceptron MSE:", Perceptron.MSE, "\n")

# Compare with linear regression for benchmarking
lm.model <- lm(exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.orig)
lm.pred <- predict(lm.model, test.orig)
lm.mse <- mean((test.orig$exam_score - lm.pred)^2)
#cat("Linear Regression MSE:", lm.mse, "\n")
lm.r.sq <- summary(lm.model)$r.squared

################
Perceptron <- c(Perceptron.MSE, r.squared)
LM <- c(lm.mse, lm.r.sq)
neuralnet <- c(MSE.neuralnet, r.sq.neuralnet)
Performace.metrics.all <- data.frame(Perceptron=Perceptron, 
                                 LM = LM, 
                                 neuralnet = neuralnet )
rownames(Performace.metrics.all) <- c("MSE", "r.sq")
pander(Performace.metrics.all)
```

# Perceptron Classification

In this section, we do perceptron for the binary response variable 'pass'. We create the pass variable below using the assumption that 60 or above is a pass (1, 0 otherwise).

The below code generates the optimal hyperparameters for learningrate, threshold, and accuracy.

```{r}
complete_habits_data2 <- transform(complete_habits_data2, pass=ifelse(exam_score >= 60, 1, 0))



habits.scaled <- cbind(habits.scaled, complete_habits_data2$pass)

set.seed(123)



colnames(habits.scaled) <- c('age','study_hours_per_day', 'social_media_hours', 
                       'netflix_hours', 
                       'attendance_percentage', 'sleep_hours' , 'diet_quality',
                       'exercise_frequency', 'parental_education_level','internet_quality',
                       'mental_health_rating','exam_score','genderFemale','genderMale','part_time_jobNo', 'part_time_jobYes','extracurricular_participationNo','extracurricular_participationYes','pass')

sample.size2 <- dim(habits.scaled)[1]
train.indices2 <- sample(1:sample.size2, round(0.8*sample.size2))


train.data.cls <- habits.scaled[train.indices2, ]
test.data.cls <- habits.scaled[-train.indices2, ]

train.orig2 <- complete_habits_data2[train.indices2, ]
test.orig2 <- complete_habits_data2[-train.indices2, ]

#############

## Grid Search Setup
# Define the hyperparameter grid
hyper.grid.cls <- expand.grid(
  learningrate = c(0.001, 0.01, 0.05, 0.1, 0.2),
  threshold = c(0.01, 0.05)  # Stopping threshold for partial derivatives
 )


k <- 5
fold.size <- floor(dim(train.data.cls)[1]/k)
# Initialize results storage
results <- data.frame(
  learningrate = numeric(),
  threshold = numeric(),
  accuracy = numeric(),
  stringsAsFactors = FALSE
)


## Perform Grid Search with Cross-Validation
for(i in 1:nrow(hyper.grid.cls)) {
  lr <- hyper.grid.cls$learningrate[i]
  th <- hyper.grid.cls$threshold[i]
  
  #cat("\nTesting combination", i, "of", nrow(hyper.grid.cls), ": learningrate =", lr, ", threshold =", th, "\n")
  
  # 
  fold.accuracies <- numeric(k)
  
  # 
  for(fold in 1:k) {
    # Split into training and validation sets
    valid.indices <- (1 + (fold-1)*fold.size):(fold*fold.size)
    train.fold <- train.data.cls[-valid.indices, ]
    valid.fold <- train.data.cls[valid.indices, ]
    
    # Train the perceptron
    set.seed(123)
    model.sigmoid <- neuralnet(
      pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.fold,
      hidden = 0,  # Perceptron has no hidden layers
      linear.output = FALSE,
      learningrate = lr,
      act.fct = "logistic",
      algorithm = "rprop+", # The resilient backpropagation with weight backtracking
      threshold = th,
      stepmax = 1e5  # Increased to ensure convergence
    )
    
    # Make predictions
    preds <- predict(model.sigmoid, valid.fold)
    pred.classes <- ifelse(preds > 0.6, 1, 0)  # default threshold 0.5
    
    # Calculate accuracy
    fold.accuracies[fold] <- mean(pred.classes == valid.fold$pass)
  }
  
  # Store average accuracy for this hyperparameter combination
  results <- rbind(results, data.frame(
    learningrate = lr,
    threshold = th,
    accuracy = mean(fold.accuracies)
  ))
}

## Analyze Results
# Find the best combination
best.combination <- results[which.max(results$accuracy), ]
#cat("\nBest hyperparameter combination:\n")
pander(best.combination)

```

The final.sigmoid.model below is the final trained model using the optimal hyperparameters from above.

```{r}
## Train Final Model with Best Hyperparameters
final.sigmoid.model <- neuralnet(
  pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.cls,
  hidden = 0,
  linear.output = FALSE,
  learningrate = best.combination$learningrate,
  threshold = best.combination$threshold,
  act.fct = "logistic",
  algorithm = "rprop+", # The resilient backpropagation with weight backtracking
  stepmax = 1e5
)

#plot(final.sigmoid.model)
```

Below we make predictions on the test set and compare the perceptron to standard logistic regression and find that they are similar in performance.

```{r}
## Evaluate on Test Set
pred.sigmoid <- predict(final.sigmoid.model, test.data.cls)

###  logistic regression
logit.fit <- glm(pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.data.cls, family = binomial)
AIC.logit <- step(logit.fit, direction = "both", trace = 0)
pred.logit <- predict(AIC.logit, test.data.cls, type = "response")
pred.full <- predict(logit.fit, test.data.cls, type = "response")


## roc
roc.full.logit <- roc(test.data.cls[, ncol(test.data.cls)], pred.full)
roc.AIC.logit <- roc(test.data.cls[, ncol(test.data.cls)], pred.logit)
roc.sigmoid <- roc(test.data.cls[, ncol(test.data.cls)], pred.sigmoid )
## AUC
auc.sigmoid <- roc.sigmoid$auc
auc.full.logit <- roc.full.logit$auc
auc.AIC.logit <- roc.AIC.logit$auc

## spe-sen
sigmoid.spe <- roc.sigmoid$specificities
sigmoid.sen <- roc.sigmoid$sensitivities

full.logit.spe <- roc.full.logit$specificities
full.logit.sen <- roc.full.logit$sensitivities

AIC.logit.spe <- roc.AIC.logit$specificities
AIC.logit.sen <- roc.AIC.logit$sensitivities


# ROC curve
plot(1-sigmoid.spe, sigmoid.sen, col = "blue", type = "l", lty = 1,
     xlab = "1 - specificity",
     ylab = "sensitivity",
     main = "ROC Curves of Perceptron and Logistic Models")
lines(1-full.logit.spe, full.logit.sen, lty = 1, col = "brown")
lines(1-AIC.logit.spe, AIC.logit.sen, lty = 1, col = "steelblue")
abline(0,1, lty =2, col = "red")
text(0.98, 0.3, paste("Perceptron AUC = ", round(auc.sigmoid,4)), col = "blue", cex = 0.8, pos = 2)
text(0.98, 0.25, paste("Full Logit AUC = ", round(auc.full.logit,4)), col = "brown", cex = 0.8, pos = 2)
text(0.98, 0.2, paste("AIC AUC = ", round(auc.AIC.logit,4)), col = "steelblue", cex = 0.8, pos = 2)
```


# MLP Regression

In this section, we do multilayer perceptron. MLP uses hidden layers, which is helpful for nonlinear data.

## One-hidden-layer Perceptron

In this section, we use one hidden layer. Below we split the data into train and test.

```{r}
habits.scaled2 <- as.data.frame(lapply(complete_habits_data2, normalize))

# Set seed for reproducibility
set.seed(123)
N <- length(habits.scaled2$exam_score)
# Create train-test split (70-30)
train.reg.index <- sample(1:N,  floor(0.8*N), replace = FALSE)
train.reg.data <- habits.scaled2[train.reg.index, ]
test.reg.data <- habits.scaled2[-train.reg.index, ]

```

Below we get the optimal hyperparameters for layer1, learning.rate, activation, and rmse.

```{r}
# Define grid of hyperparameters
hyper.grid.reg <- expand.grid(
  layer1 = c(5, 10, 15),
  learning.rate = c(0.01, 0.1),
  activation = c("logistic", "tanh")
)

# Initialize results storage
rmse = NULL
#layer1 = NULL
#learningrate = NULL
#activation = NULL

best.reg.rmse <- Inf
best.reg.model <- NULL

# Perform grid search
for(i in 1:nrow(hyper.grid.reg)) {
  # Get current configuration
  layer <- hyper.grid.reg$layer1[i]
  lr <- hyper.grid.reg$learning.rate[i]
  act <- hyper.grid.reg$activation[i]
  
  # Train model
  set.seed(123)
  model.reg <- neuralnet(
      exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.reg.data,
      hidden = layer,
      act.fct = act,
      linear.output = TRUE,  # For regression
      learningrate = lr,
      algorithm = "rprop+",
      stepmax = 1e5 )
  

    # Make predictions
    preds.reg <- predict(model.reg, test.reg.data[, -ncol(test.reg.data)])
    
    # Calculate RMSE
    rmse.reg <- sqrt(mean((preds.reg - test.reg.data$exam_score)^2))
    
    # Store results
    rmse[i] = rmse.reg
    
    # Update best model
    if(rmse.reg < best.reg.rmse) {
      best.reg.rmse <- rmse.reg
      best.reg.model <- model.reg 
      best.reg.params <- hyper.grid.reg[i, ]
    }
}

results.regNN <- hyper.grid.reg
results.regNN$rmse <- rmse


# View results sorted by RMSE
pander(results.regNN[order(results.regNN$rmse), ][1,])
```

Using the optimal hyperparameters, we train the final model below.


```{r}
final.reg.model <- neuralnet(
  exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.reg.data,
  hidden = best.reg.params$layer1,
  act.fct = best.reg.params$activation,
  linear.output = TRUE,
  learningrate = best.reg.params$learning_rate,
  algorithm = "rprop+",
  stepmax = 1e5
)

plot(final.reg.model, rep="best")
```

Below we show a scatterplot of the actual vs predicted values. It does not perform very well due to some outliers.

```{r}
# Make predictions on test set
pred.NN1 <- predict(final.reg.model, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
rmse.NN1 <- sqrt(mean((pred.NN1  - test.reg.data$exam_score)^2))
mae.NN1 <- mean(abs(pred.NN1  - test.reg.data$exam_score))
r.squared.NN1 <- cor(pred.NN1 , test.reg.data$exam_score)^2

# cat("Performance Metrics:\n")
# cat("RMSE:", rmse, "\n")
# cat("MAE:", mae, "\n")
# cat("R-squared:", r_squared, "\n")

# Plot predictions vs actual
plot.NN1.data <- data.frame(
  Actual = test.reg.data$exam_score,
  Predicted = pred.NN1 
)

ggplot(plot.NN1.data, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "darkred") +
  annotate("text", x=0.85, y=-2, 
           label=paste("R.sq =", round(r.squared.NN1,4)), color="blue") +
  annotate("text", x=0.85, y=-1, 
           label=paste("RMSE =", round(rmse.NN1,4)), color="blue") +
  annotate("text", x=0.85, y=.06, 
           label=paste("  MAE =", round(mae.NN1,4)), color="blue") +
  ggtitle("Actual vs Predicted Values") +
  theme_minimal()
```

Below shows that the neural network does not perform as well as standard linear regression so 2 layers may help.

```{r}
# Train linear regression model
lm.model2 <- lm(exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = test.reg.data)

# Make predictions
lm.predictions <- predict(lm.model2, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
lm.rmse <- sqrt(mean((lm.predictions - test.reg.data$exam_score)^2))
lm.mae <- mean(abs(lm.predictions - test.reg.data$exam_score))
lm.r.squared <- cor(lm.predictions, test.reg.data$exam_score)^2


## improvements
RMSE.imp <- round((lm.rmse - rmse.NN1)/lm.rmse * 100,2)
MAE.imp <- round((lm.mae - mae.NN1)/lm.mae * 100, 2)
Rsq.imp <- round((r.squared.NN1 - lm.r.squared)/lm.r.squared * 100,2)

##
Performance.table <- data.frame(
  LM = c(lm.rmse, lm.mae, lm.r.squared),
  NN.1 = c(rmse.NN1, mae.NN1, r.squared.NN1),
  Improvement.percentage = c(RMSE.imp, MAE.imp, Rsq.imp)
)
rownames(Performance.table) <- c("RMSE", "MAE", "R.square")
pander(Performance.table)
```

Below shows the performance of both standard linear regression and NN with one layer.

```{r}
# Plot both predictions
comparison.data <- data.frame(
  Actual = test.reg.data$exam_score,
  MLP = pred.NN1,
  Linear = lm.predictions
)

ggplot(comparison.data) +
  geom_point(aes(x = Actual, y = MLP, color = "MLP")) +
  geom_point(aes(x = Actual, y = Linear, color = "Linear Regression")) +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  scale_color_manual(values = c("MLP" = "blue", "Linear Regression" = "red")) +
  labs(title = "Model Comparison: Actual vs Predicted",
       x = "Actual Values",
       y = "Predicted Values",
       color = "Model Type") +
  theme(
    plot.margin = ggplot2::margin(40, 20, 20, 20, unit = "pt"),
    plot.title = element_text(hjust = 0.5, 
                              lineheight = 1.1,
                              vjust = 10)
    )
```


## Two-hidden-layer Perceptron

Below we see possible hyperparameters.

```{r}
# Define grid of hyperparameters
hyper.grid.NN2 <- expand.grid(
  layer1 = c(5, 10, 15),
  layer2 = c(0, 3, 5),  # 0 means no second layer
  learning.rate = c(0.01, 0.1),
  activation = c("logistic", "tanh")
)

# Initialize results in storage
results <- data.frame()
best.rmse <- Inf
best.model <- NULL

# Perform grid search
for(i in 1:nrow(hyper.grid.NN2)) {
  # Get current configuration
  layer1 <- hyper.grid.NN2$layer1[i]
  layer2 <- hyper.grid.NN2$layer2[i]
  lr <- hyper.grid.NN2$learning.rate[i]
  act <- hyper.grid.NN2$activation[i]
  
  # Create hidden layers vector
  if(layer2 == 0) {
    hidden.layers <- c(layer1)
  } else {
    hidden.layers <- c(layer1, layer2)
  }
  
  # Train model
  set.seed(123)
  model.NN2 <- tryCatch({
    neuralnet(
      exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.reg.data,
      hidden = hidden.layers,
      act.fct = act,
      linear.output = TRUE,  # For regression
      learningrate = lr,
      algorithm = "rprop+",
      stepmax = 1e5
    )
  }, error = function(e) NULL)
  
  if(!is.null(model.NN2)) {
    # Make predictions
    preds <- predict(model.NN2, test.reg.data[, -ncol(test.reg.data)])
    
    # Calculate RMSE
    rmse <- sqrt(mean((preds - test.reg.data$exam_score)^2))
    
    # Store results
    results <- rbind(results, data.frame(
      layer1 = layer1,
      layer2 = layer2,
      learning_rate = lr,
      activation = act,
      rmse = rmse
    ))
    
    # Update best model
    if(rmse < best.rmse) {
      best.rmse <- rmse
      best.model <- model.NN2
      best.params <- hyper.grid.NN2[i, ]
    }
  }
}

# View results sorted by RMSE
results[order(results$rmse), ]
```

Below we train the final model.

```{r}
final.model.NN2 <- neuralnet(
  exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.reg.data,
  hidden = if(best.params$layer2 == 0) {
        c(best.params$layer1)} else {
        c(best.params$layer1, best.params$layer2)},
  act.fct = best.params$activation,
  linear.output = TRUE,
  learningrate = best.params$learning.rate,
  algorithm = "rprop+",
  stepmax = 1e5
)

plot(final.model.NN2, rep="best")
```

Below is the actual vs predicted scatterplot and it seems to perform better.

```{r}
# Make predictions on test set
pred.NN2 <- predict(final.model.NN2, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
rmse.NN2 <- sqrt(mean((pred.NN2  - test.reg.data$exam_score)^2))
mae.NN2 <- mean(abs(pred.NN2  - test.reg.data$exam_score))
r.squared.NN2 <- cor(pred.NN2 , test.reg.data$exam_score)^2

## vector of error metric
NN2 <- c(rmse.NN2, mae.NN2, r.squared.NN2)

# Plot predictions vs actual
plot.data.NN2 <- data.frame(
  Actual = test.reg.data$exam_score,
  Predicted = pred.NN2 
)

ggplot(plot.data.NN2, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  ggtitle("Actual vs Predicted Values") +
  theme_minimal()
```

Below shows that the NN with two layers does significantly better than the NN with one.

```{r}
# Train linear regression model
lm.model3 <- lm(exam_score ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.reg.data)

# Make predictions
lm.predictions2 <- predict(lm.model3, test.reg.data[, -ncol(test.reg.data)])

# Calculate performance metrics
lm.rmse2 <- sqrt(mean((lm.predictions2 - test.reg.data$exam_score)^2))
lm.mae2 <- mean(abs(lm.predictions2 - test.reg.data$exam_score))
lm.r.squared2 <- cor(lm.predictions2, test.reg.data$exam_score)^2

###
rmse.NN2.imp <- (lm.rmse2 - rmse.NN2)/lm.rmse2*100
mae.NN2.imp <- (lm.mae2 - mae.NN2)/lm.mae2 * 100
Rsq.NN2.imp <- (r.squared.NN2 - lm.r.squared2)/lm.r.squared2 * 100

NN2.improve <-c(rmse.NN2.imp, mae.NN2.imp, Rsq.NN2.imp)

perf.matrix <-data.frame(
              LM = c(lm.rmse2, lm.mae2, lm.r.squared2),
              NN.1 = c(rmse.NN1, mae.NN1, r.squared.NN1),
              NN.2 = c(rmse.NN2, mae.NN2, r.squared.NN2)
         )

perf.matrix$NN1.Improve <- round(100*(perf.matrix$LM-perf.matrix$NN.1)/perf.matrix$LM,2)
perf.matrix$NN2.Improve <- round(100*(perf.matrix$LM-perf.matrix$NN.2)/perf.matrix$LM,2)

rownames(perf.matrix) <- c("RMSE", "MAE", "R.sq")
pander(perf.matrix)
```

Below shows the performance of both standard and NN with 2 layers.

```{r}
# Plot both predictions
comparison.data.NN2 <- data.frame(
  Actual = test.reg.data$exam_score,
  MLP = pred.NN2 ,
  Linear = lm.predictions2
)

ggplot(comparison.data.NN2) +
  geom_point(aes(x = Actual, y = MLP, color = "MLP")) +
  geom_point(aes(x = Actual, y = Linear, color = "Linear Regression")) +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  scale_color_manual(values = c("MLP" = "blue", "Linear Regression" = "red")) +
  labs(title = "Model Comparison: Actual vs Predicted",
       x = "Actual Values",
       y = "Predicted Values",
       color = "Model Type") +
  theme_minimal()
```


# MLP Classification

Below we perform the grid search for the single hidden layer.

```{r}
#habits.scaled <- cbind(habits.scaled, complete_habits_data2$pass)

set.seed(123)  # For reproducibility
sample.size.cls <- floor(0.80 * nrow(habits.scaled2))
train.indices.cls <- sample(1:sample.size.cls, size = sample.size.cls, replace = FALSE)

train.data.cls2 <- habits.scaled2[train.indices.cls, ]
test.data.cls2 <- habits.scaled2[-train.indices.cls, ]
```

```{r}
# Function to perform grid search for neuralnet
neuralnet.grid.search <- function(train.data, test.data, hidden.layers = 1) {
  # Define the grid of hyperparameters
  if (hidden.layers == 1) {
    hidden.nodes <- c(2, 4, 6, 8, 10)
    grid <- expand.grid(hidden = hidden.nodes)
  } else {
    hidden.nodes <- c(2, 4, 6)
    grid <- expand.grid(hidden1 = hidden.nodes, hidden2 = hidden.nodes)
  }
  
  # Add columns to store results
  grid$accuracy <- NA
  grid$auc <- NA
  

  
  # Perform grid search
  for (i in 1:nrow(grid)) {
    if (hidden.layers == 1) {
      hidden <- c(grid$hidden[i])
    } else {
      hidden <- c(grid$hidden1[i], grid$hidden2[i])
    }
    
    # Train the model
    nn.model <- neuralnet(
      pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
      data = train.data,
      hidden = hidden,
      linear.output = FALSE,  # For classification
      act.fct = "logistic",   # Sigmoid activation
      stepmax = 1e6           # Increase max steps for convergence
    )
    
    # Make predictions
    predictions <- predict(nn.model, test.data)
    predicted.classes <- ifelse(predictions > 0.6, 1, 0)
    
    # Calculate accuracy
    accuracy <- mean(predicted.classes == test.data$pass)
    
    # Calculate AUC
    roc.obj <- roc(test.data$pass, predictions)
    auc.value <- auc(roc.obj)
    
    # Store results
    grid$accuracy[i] <- accuracy
    grid$auc[i] <- auc.value
  }
  return(grid)
}
```


```{r}
# Perform grid search for single hidden layer
grid.results.1layer <- neuralnet.grid.search(train.data=train.data.cls2,
                                             test.data=test.data.cls2, 
                                             hidden.layers = 1)
pander(grid.results.1layer)
```

Above is the results from grid search for single hidden layer. Below we do the grid search for the 2 hidden layers.

```{r}
grid.results.2layer <- neuralnet.grid.search(train.data=train.data.cls2, 
                                             test.data=test.data.cls2, 
                                             hidden.layers = 2)
pander(grid.results.2layer)
```

Below we train the 1-layer model.

```{r}
# Train single hidden layer model (using best configuration from grid search)
best.1layer <- grid.results.1layer[which.max(grid.results.1layer$auc), ]

nn.1layer <- neuralnet(
  pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours
  + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.cls2,
  hidden = best.1layer$hidden,
  linear.output = FALSE,
  act.fct = "logistic",
  stepmax = 1e6
)
##
plot(nn.1layer, rep="best")
```

Below we train the 2-layer model.

```{r}
# Train two hidden layers model (using best configuration from grid search)
best.2layer <- grid.results.2layer[which.max(grid.results.2layer$auc), ]

nn.2layer <- neuralnet(
  pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours
   + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale,
  data = train.data.cls2,
  hidden = c(best.2layer$hidden1, best.2layer$hidden2),
  linear.output = FALSE,
  act.fct = "logistic",
  stepmax = 1e6
)
##
plot(nn.2layer, rep="best")
```

Below we make predictions on the test data.

```{r}
# Function to evaluate model performance
evaluate.model <- function(model, test.data) {
  # Make predictions
  predictions <- predict(model, test.data)
  predicted.classes <- ifelse(predictions > 0.6, 1, 0)
  
  # Calculate metrics
  accuracy <- mean(predicted.classes == test.data$pass)
  confusion.matrix <- table(Predicted = predicted.classes, Actual = test.data$pass)
  roc.obj <- roc(test.data$pass, predictions)
  auc.value <- auc(roc.obj)
  
  return(list(
    accuracy = accuracy,
    confusion.matrix = confusion.matrix,
    roc.obj = roc.obj,
    auc = auc.value
  ))
}

# Evaluate single hidden layer model
perf.1layer <- evaluate.model(nn.1layer, test.data.cls2)
#print(perf.1layer[c("accuracy", "confusion_matrix", "auc")])

# Evaluate two hidden layers model
perf.2layer <- evaluate.model(nn.2layer, test.data.cls2)
#print(perf.2layer[c("accuracy", "confusion_matrix", "auc")])
```

Below we show the performance of MLP 1 layer, MLP 2 layer, and standard logistic. We see that they all perform similarly.

```{r}
# Train logistic regression model (base model)
logit.model <- glm(pass ~ age + study_hours_per_day + social_media_hours + netflix_hours + attendance_percentage + sleep_hours + diet_quality + exercise_frequency + parental_education_level + internet_quality + mental_health_rating + extracurricular_participationNo + extracurricular_participationYes + part_time_jobNo + part_time_jobYes + genderFemale + genderMale, data = train.data.cls2, family = binomial)

# Evaluate logistic regression model
logit.pred <- predict(logit.model, test.data.cls2, type = "response")
logit.classes <- ifelse(logit.pred > 0.6, 1, 0)
logit.accuracy <- mean(logit.classes == test.data.cls2$pass)
logit.roc <- roc(test.data.cls2$pass, logit.pred)
logit.auc <- auc(logit.roc)

##
roc.1layer <- perf.1layer$roc.obj
roc.2layer <- perf.2layer$roc.obj
roc.logit <- logit.roc

## specificity and sensitivity
sen.1layer <- roc.1layer$sensitivities
spe.1layer <- roc.1layer$specificities
sen.2layer <- roc.2layer$sensitivities
spe.2layer <- roc.2layer$specificities
sen.logit <- roc.logit$sensitivities
spe.logit <- roc.logit$specificities

## AUC
auc.1layer <- roc.1layer$auc
auc.2layer <- roc.2layer$auc
auc.logit <- roc.logit$auc

## Plot ROC curves for comparison
par(pty = "s")   # make a square plot to avaoid distortion
plot(1-spe.1layer, sen.1layer, type = "l", lty = 1,
     col = "blue", 
     xlab = "1 - specificity",
     ylab = "sensitvity",
     main = "ROC Curve Comparison")

lines(1-spe.2layer, sen.2layer, lty = 1, col = "darkred")
lines(1-spe.logit, sen.logit, lty = 1, col = "darkgreen")
legend("bottomright", 
       legend = c(paste("1-layer MLP (AUC =", round(perf.1layer$auc, 3), ")"),
                  paste("2-layer MLP (AUC =", round(perf.2layer$auc, 3), ")"),
                  paste("Logistic Reg (AUC =", round(logit.auc, 3), ")")),
                col = c("blue", "darkred", "darkgreen"), 
                lty = 1, cex = 0.7, bty = "n")
```


# Results/Conclusion

Perceptrons can help give you a general understanding of how neural networks can predict a response variable but may be too simple.

MLP can be better since they work well with complex/nonlinear data. However, 2 layer MLP may not necessarily be better than 1 layer. Two layers may be too complex, so 1 layer MLP is better since it is simpler and can potentially generate similar accuracies.

For classification, MLP may not necessarily be better than standard logistic. Even though MLP can work well with nonlinear data, it may be more prone to overfitting. Standard logistic has high interpretability, computational efficiency, and is robust to overfitting.

For regression, MLP can also help with nonlinear data but we can take simpler models and computational resources into account.


















