Introduction

This project uses student data from Portuguese secondary schools to help identify students who may be at risk of failing their Portuguese language course. By detecting these students early, schools can provide targeted support to improve academic outcomes and reduce failure rates.

Using this dataset, we apply machine learning techniques to build predictive models that classify students as either “at-risk” or “not at-risk” based on demographic, behavioral, and academic indicators. This analysis aims to inform early intervention strategies that can be implemented by educators before final grades are issued.

Our approach includes exploratory data analysis, class imbalance handling, model training using both tree-based and neural network methods, and a thorough comparison of model performance. The ultimate goal is to identify a solution that balances predictive accuracy with practical impact in educational settings

Objective

The goal of this project is to build a machine learning model that can accurately identify students at risk of failing their Portuguese language course. Specifically, we define “at-risk” students as those whose final grade (G3) is below 10 on a 20-point scale.

To achieve this, we frame the task as a binary classification problem, where the model predicts whether a student falls into the “at-risk” or “not at-risk” category based on a combination of academic performance, study habits, and socio-demographic features.

We aim to:

  • Prioritize recall, to ensure that most at-risk students are identified early.
  • Explore and compare multiple models, including Random Forest and Neural Networks, to find the best-performing approach.
  • Evaluate models using key metrics such as accuracy, precision, recall, F1-score, and AUC.
  • Make recommendations based on both predictive performance and the real-world educational impact of the results..

Business Problem

Educators often struggle to identify students who are likely to fail before final grades are issued. Without early warning, students may miss the opportunity to receive targeted support that could help them succeed. Schools need a data-driven approach to proactively identify at-risk students and intervene before it’s too late.

Data Science Problem

We frame this challenge as a binary classification task: predicting whether a student is “at risk” (final grade < 10) using a variety of features such as study time, family background, and past performance. This requires selecting appropriate machine learning algorithms, addressing class imbalance, and optimizing performance for high recall to capture as many struggling students as possible.

Load the Dataset and necessary libraries

we begin by loading the dataset, importing the relevant libraries and explore.

# Load necessary libraries
library(knitr)
library(tidyverse)
library(corrplot)
library(dplyr)
library(caret)
library(ROSE)
library(kableExtra)
library(randomForest)
library(nnet)
library(pROC)
library(ROCR)
library(doParallel)
library(ggplot2)
library(fmsb)
library(tidytext)
# Load the Portuguese dataset
df <- read.csv("student-por.csv", sep = ";")

# Preview the first few rows of the dataset
kable(head(df, 10), caption = "Preview of the Bank Dataset")
Preview of the Bank Dataset
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 4 0 11 11
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 2 9 11 11
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 0 yes no no no yes yes yes no 4 3 2 2 3 3 6 12 13 12
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes no yes yes yes yes yes 3 2 2 1 1 5 0 14 14 14
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes no no yes yes no no 4 3 2 1 2 5 0 11 13 13
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes no yes yes yes yes no 5 4 2 1 2 5 6 12 12 13
GP M 16 U LE3 T 2 2 other other home mother 1 2 0 no no no no yes yes yes no 4 4 4 1 1 3 0 13 12 13
GP F 17 U GT3 A 4 4 other teacher home mother 2 2 0 yes yes no no yes yes no no 4 1 4 1 1 1 2 10 13 13
GP M 15 U LE3 A 3 2 services other home mother 1 2 0 no yes no no yes yes yes no 4 2 2 1 1 1 0 15 16 17
GP M 15 U GT3 T 3 4 other other home mother 1 2 0 no yes no yes yes yes yes no 5 5 1 1 1 5 0 12 12 13

The dataset contains student-level information with academic performance (G1–G3), personal and family background, and support indicators. This offers a well-rounded view of factors that may influence risk.

Overview of the Dataset

Check the structure of the dataset:

str(df)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

There are 649 observations and 33 variables, including numeric and categorical features such as grades (G1, G2, G3), family background, school support, and lifestyle factors.

Key Features in the Dataset:

  • school, sex, age, address, famsize, Pstatus: Basic demographic information
  • Medu, Fedu: Mother’s and father’s education levels (0 to 4 scale)
  • Mjob, Fjob, guardian: Parents’ occupations and legal guardian
  • traveltime, studytime: Time spent commuting to school and studying weekly
  • failures: Number of past class failures
  • schoolsup, famsup, paid, activities: Extra educational support indicators
  • nursery, higher, internet, romantic: Early education, future intent, and lifestyle
  • famrel, freetime, goout, Dalc, Walc, health: Family relationship, leisure, alcohol use, and self-reported health
  • absences: Number of school absences
  • G1, G2, G3: Grades from three terms (G3 is the final grade and our outcome)

These features span personal, academic, and behavioral factors that can be leveraged to predict which students may be at risk of failing.

Exploratory Data Analysis (EDA)

Check for duplicates:

sum(duplicated(df))
## [1] 0

Result: ‘0’ - There are no duplicate rows in the dataset.

Check for missing data:

colSums(is.na(df)) 
##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G1         G2         G3 
##          0          0          0          0          0

Result: All columns have 0 missing values — this dataset is complete.

Check for ‘unknown-like’ values:

# Count common 'unknown-like' values in all columns
df %>%
  summarise(across(everything(), ~sum(. %in% c("unknown", "", "?", NA))))
##   school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 1      0   0   0       0       0       0    0    0    0    0      0        0
##   traveltime studytime failures schoolsup famsup paid activities nursery higher
## 1          0         0        0         0      0    0          0       0      0
##   internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
## 1        0        0      0        0     0    0    0      0        0  0  0  0

Result: All values returned are 0, so there are no placeholder or unknown-like values either.

From above, we verifying the quality of the dataset. There are no duplicate rows, no missing values, and no placeholder values such as “unknown” or “?”. This indicates that the dataset is clean and ready for exploratory analysis without needing imputation or filtering.

Check summary statistics of the dataset

summary(df)
##     school              sex                 age          address         
##  Length:649         Length:649         Min.   :15.00   Length:649        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.74                     
##                                        3rd Qu.:18.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:649         Length:649         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :2.000   Median :2.000  
##                                        Mean   :2.515   Mean   :2.307  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:649        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.569   Mean   :1.931   Mean   :0.2219                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:649         Length:649         Length:649         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.931  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime        goout            Dalc            Walc          health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.18   Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

Summary of Dataset The dataset includes academic, demographic, and behavioral aspects. The age range is 15–22, with most students around 16–18. Final grades (G3) range from 0 to 19, with a median around 12.

Parental education levels (Medu, Fedu) vary widely, and alcohol consumption (Dalc, Walc) skews low. Most students report no prior failures, and absences range from 0 to 32 — an unusually high tail that may affect modeling.

Overall, the dataset presents a rich variety of features relevant to academic risk, with a few variables warranting further inspection for skew and outliers.

Feature Distributions:

Distribution of Numeric Features

# Select only numeric columns
numeric_cols <- df %>% select(where(is.numeric))

# Reshape data
numeric_cols %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%

# Plot 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 15, fill = "steelblue", color = "black") +
  facet_wrap(~ feature, scales = "free") +
  labs(title = "Distribution of Numeric Features", x = "Value", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(strip.text = element_text(face = "bold"))

This plot shows the distribution of all numeric features. We see right-skewed patterns in absences and alcohol consumption, while grade variables (G1, G2, G3) are more symmetric. G3 sharply declines around 10, reinforcing the logic behind our at-risk threshold.

Box-plots for Outlier Detection:

# Boxplots for key numeric variables
df %>%
  select(absences, failures, G1, G2) %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
  ggplot(aes(x = feature, y = value)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  labs(title = "Boxplots for Outlier Detection",
       x = "Feature",
       y = "Value") +
  theme_minimal()

Analysis: The boxplots above reveal a few outliers in features like absences, failures, G1, and G2. The most extreme values appear in absences, where some students missed over 30 days — well beyond the typical range. However, outliers in failures, G1, and G2 appear reasonable and reflect meaningful academic struggles.

We chose not to remove or transform these outliers before modeling. This is intentional: our selected models — Random Forest and Neural Network — are generally robust to outliers. Random Forest makes decisions based on threshold splits and is less sensitive to extreme values. Neural Networks, while more sensitive to scaling, can still handle outliers when data is preprocessed consistently. Therefore, we retained these values to preserve potentially important variation in student performance.

Distribution of Categorical Features:

To visualize how categorical variables are distributed across students, we first reshape the data into a tidy format. This allows us to group features thematically and explore patterns in factors like demographics, support, and parental background — all of which may influence student performance.

categorical_data <- df %>%
  select(where(is.character)) %>%  # Select categorical variables
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") 
glimpse(categorical_data)  # Ensure Feature and Value exist
## Rows: 11,033
## Columns: 2
## $ Feature <chr> "school", "sex", "address", "famsize", "Pstatus", "Mjob", "Fjo…
## $ Value   <chr> "GP", "F", "U", "GT3", "A", "at_home", "teacher", "course", "m…

To make the plots more interpretable, we organize the categorical features into three thematic groups: support and lifestyle, demographics, and parental background. This structure will help us analyze trends within and across these areas.

# Prepare base data
categorical_data <- df %>%
  select(where(is.character)) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

# Define groups
group1_vars <- c("schoolsup", "famsup", "paid", "internet", "activities", "higher", "nursery")
group2_vars <- c("school", "sex", "address", "famsize", "Pstatus", "romantic")
group3_vars <- c("guardian", "Mjob", "Fjob", "reason")

We now visualize each group separately using bar charts to explore category frequency within each feature. This breakdown helps uncover patterns that may relate to student risk and model predictors.

# Group 1: Support & Lifestyle
categorical_data %>%
  filter(Feature %in% group1_vars) %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "steelblue") +
  facet_wrap(~ Feature, scales = "free_y", ncol = 2) +
  coord_flip() +
  labs(title = "Support and Lifestyle Features",
       x = "Category", y = "Count") +
  theme_minimal(base_size = 12)

Analysis: Most students reported internet access and aspirations for higher education, while fewer received school support or enrolled in paid services. These indicators may reflect disparities in academic resources, which could relate to performance risk.

# Group 2: Demographics
categorical_data %>%
  filter(Feature %in% group2_vars) %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "darkorange") +
  facet_wrap(~ Feature, scales = "free_y", ncol = 2) +
  coord_flip() +
  labs(title = "Demographic Features",
       x = "Category", y = "Count") +
  theme_minimal(base_size = 12)

Analysis: We observe a slight gender skew toward female students and urban residency. Most students live in larger families and with both parents. These demographics offer contextual clues about their learning environments.

# Group 3: Parental Background
categorical_data %>%
  filter(Feature %in% group3_vars) %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "seagreen") +
  facet_wrap(~ Feature, scales = "free_y", ncol = 2) +
  coord_flip() +
  labs(title = "Parental Background Features",
       x = "Category", y = "Count") +
  theme_minimal(base_size = 12)

Analysis: Mothers are commonly reported as the main guardians. Many parents work in broad or service-related jobs. The reasons cited for choosing a school may hint at parental involvement and expectations.

These categorical features offer context that complements academic metrics, helping us build a more complete picture of factors that may influence student performance.

Creating the new at_risk variable: To guide our predictive modeling, we define a binary target variable, at_risk, based on the final grade G3. A student is considered “at risk” if their final grade is below 10, indicating potential failure in the course.

# Create binary target variable
df <- df %>%
  mutate(at_risk = ifelse(G3 < 10, "Yes", "No")) %>%
  mutate(at_risk = factor(at_risk))

# Check the distribution of the target
table(df$at_risk)
## 
##  No Yes 
## 549 100

Out of 649 students, 100 (approximately 15%) are classified as at risk of failing the Portuguese course based on their final grade (G3). This reveals a class imbalance in our target variable — an important consideration when selecting and evaluating machine learning models.

To better understand what contributes to risk, we now explore how academic indicators such as study time, previous failures, and early grades (G1 and G2) differ between at-risk and not-at-risk students. These insights can help identify early warning signs and inform which predictors may be most useful for modeling.

# Summary statistics of numeric variables by risk group
df %>%
  group_by(at_risk) %>%
  summarise(
    avg_absences = mean(absences, na.rm = TRUE),
    avg_studytime = mean(studytime, na.rm = TRUE),
    avg_failures = mean(failures, na.rm = TRUE),
    avg_G1 = mean(G1, na.rm = TRUE),
    avg_G2 = mean(G2, na.rm = TRUE)
  )
## # A tibble: 2 × 6
##   at_risk avg_absences avg_studytime avg_failures avg_G1 avg_G2
##   <fct>          <dbl>         <dbl>        <dbl>  <dbl>  <dbl>
## 1 No              3.49          1.99        0.126  12.1   12.3 
## 2 Yes             4.61          1.61        0.75    7.78   7.53

We do not include G3 in our exploratory analysis, as it is used to define the at_risk outcome. Instead, we examine earlier indicators such as G1, G2, study time, and absences, which may help predict final performance. Above we see that ‘At-risk’ students tend to have higher absences and past failures, and much lower first- and second-period grades (G1, G2), indicating early signs of struggle.

Study time by risk group:

ggplot(df, aes(x = factor(studytime), fill = at_risk)) +
  geom_bar(position = "fill") +
  labs(x = "Weekly Study Time (1 = <2 hrs, 4 = >10 hrs)",
       y = "Proportion",
       fill = "At Risk?",
       title = "Study Time Distribution by Risk") +
  scale_fill_manual(values = c("No" = "#d9d9d9", "Yes" = "#f4a261")) +
  theme_minimal()

Analysis: Students who study fewer than 2 hours per week (studytime = 1) have the highest proportion of those at risk, highlighting a clear link between limited study time and academic risk.

Visualize Academic Struggles – Failures & Absences:

ggplot(df, aes(x = factor(failures), fill = at_risk)) +
  geom_bar(position = "fill") +
  labs(x = "Number of Past Class Failures",
       y = "Proportion",
       fill = "At Risk?",
       title = "Past Failures by Risk Status") +
  scale_fill_manual(values = c("No" = "#d9d9d9", "Yes" = "#f4a261")) +
  theme_minimal()

Analysis: Students with more past failures are significantly more likely to be at risk. Nearly two-thirds of those with three previous failures fall into the at-risk group, underscoring the strong correlation between academic history and final outcomes.

Internet Access by Risk Status: Since internet access may reflect students’ access to resources or broader socioeconomic factors, we examine whether this variable is associated with academic risk. Comparing access by at_risk status helps reveal whether digital inequality contributes to performance gaps.

ggplot(df, aes(x = internet, fill = at_risk)) +
  geom_bar(position = "fill") +
  labs(title = "Internet Access by Risk Status",
       x = "Internet Access",
       y = "Proportion",
       fill = "At Risk?") +
 scale_fill_manual(values = c("No" = "#d9d9d9", "Yes" = "#f4a261")) +
  theme_minimal()

Students without internet access show a noticeably higher proportion of academic risk, hinting at the potential impact of digital resource availability on performance.

Check Correlation Between Features:

To better understand the linear relationships among numeric features, we compute and visualize the correlation matrix. This helps uncover strong associations (e.g., between early and final grades) and detect multicollinearity, which can guide model design and feature selection.

# Select only numeric variables
numeric_vars <- df %>%
  select(where(is.numeric))

# Compute correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# Plot the correlation matrix
corrplot(cor_matrix,
         method = "color",         # colored squares
         type = "upper",           # show only upper triangle
         order = "hclust",         # cluster similar variables
         addCoef.col = "black",    # show correlation coefficients
         tl.col = "black",         # label color
         number.cex = 0.6)         # coefficient size

Analysis: The correlation matrix reveals several important relationships:

Strong positive correlations:

  • G1 and G2 (r ≈ 0.86)
  • G2 and G3 (r ≈ 0.91)
  • G1 and G3 (r ≈ 0.83)

These indicate that earlier grades are strong predictors of the final grade, supporting their inclusion in modeling.

Moderate relationships:

  • Medu and Fedu (r ≈ 0.65) — suggesting that parents’ education levels are related.
  • Dalc and Walc (r ≈ 0.62) — showing that weekday and weekend alcohol consumption are aligned.

Weak or negligible correlations:

  • Most other features show low correlation with the target-related variables (G3, etc.), highlighting the importance of using nonlinear or ensemble models that can capture complex interactions beyond linear dependencies.

Algorithm Selection:

To fulfill the assignment requirement and build a predictive model for identifying at-risk students, we selected one algorithm from Weeks 1–10 and another from Weeks 11–15 based on the dataset’s characteristics and our project objectives.

Random Forest (week 1-10)

Random Forest is an ensemble method that constructs multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. It is well-suited for this task because:

  • It handles both numeric and categorical data without much preprocessing.

  • It is robust to outliers and noise.

  • It naturally handles non-linear relationships and feature interactions.

  • It provides feature importance, which is useful for identifying key risk indicators.

Given the imbalanced nature of the dataset and the mix of predictor types, Random Forest offers a reliable, interpretable, and effective modeling approach.

Neural Network (week 11-15)

Neural Networks are powerful function approximators capable of capturing complex patterns in data. For this project, a simple feedforward neural network will be used to:

  • Model non-linear relationships among features and the target.

  • Handle multi-dimensional interactions that may be missed by simpler models.

  • Compare performance with a more interpretable tree-based model.

Although less interpretable, Neural Networks provide a performance benchmark and allow us to assess whether more complex architectures yield significant gains in predictive accuracy.

With the algorithms selected — Random Forest for interpretability and Neural Network for modeling complexity — we now prepare the data to ensure compatibility and performance across both models. This involves encoding categorical variables, scaling numeric features, and addressing class imbalance before training begins.

Feature Selection and Data Leakage

Before training our models, we remove G1 and G2 — the first and second grading period scores — from our dataset. Although these variables are strong predictors of the final grade (G3), they introduce data leakage.

What is data leakage? - Data leakage occurs when information not available at prediction time is used during model training. This leads to overly optimistic results that won’t generalize in real-world use.

In our case, at_risk is defined based on G3, and since G1 and G2 are earlier grades closely tied to G3, they directly reflect the student’s academic trajectory and unfairly reveal part of the outcome. To maintain a realistic prediction scenario — especially for early interventions — we exclude these variables to avoid leakage and ensure model integrity.

Data Preprocessing

To ensure our data is suitable for both tree-based and neural network models, we perform essential pre-processing steps. These include removing grade-related leakage variables (G1, G2, G3), converting categorical variables to numeric using one-hot encoding, and preparing the final dataset with clearly separated predictors and labels. This sets the stage for consistent and fair model comparison.

# Recreate the target column
df$at_risk <- ifelse(df$G3 < 10, "Yes", "No") |> factor()

# Drop G1, G2, G3
df_model <- df |> select(-G1, -G2, -G3)

# Separate predictors and target
target <- df_model$at_risk
df_features <- df_model |> select(-at_risk)

# One-hot encode categorical predictors
dummies <- dummyVars(~ ., data = df_features)
df_encoded <- predict(dummies, newdata = df_features) |> as.data.frame()

# Add target column back
df_encoded$at_risk <- target

# Check structure
str(df_encoded)
## 'data.frame':    649 obs. of  57 variables:
##  $ schoolGP        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ schoolMS        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ sexF            : num  1 1 1 1 1 0 0 1 0 0 ...
##  $ sexM            : num  0 0 0 0 0 1 1 0 1 1 ...
##  $ age             : num  18 17 15 15 16 16 16 17 15 15 ...
##  $ addressR        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ addressU        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ famsizeGT3      : num  1 1 0 1 1 0 0 1 0 1 ...
##  $ famsizeLE3      : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ PstatusA        : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ PstatusT        : num  0 1 1 1 1 1 1 0 0 1 ...
##  $ Medu            : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu            : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjobat_home     : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ Mjobhealth      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Mjobother       : num  0 0 0 0 1 0 1 1 0 1 ...
##  $ Mjobservices    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ Mjobteacher     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fjobat_home     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fjobhealth      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fjobother       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ Fjobservices    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Fjobteacher     : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ reasoncourse    : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ reasonhome      : num  0 0 0 1 1 0 1 1 1 1 ...
##  $ reasonother     : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ reasonreputation: num  0 0 0 0 0 1 0 0 0 0 ...
##  $ guardianfather  : num  0 1 0 0 1 0 0 0 0 0 ...
##  $ guardianmother  : num  1 0 1 1 0 1 1 1 1 1 ...
##  $ guardianother   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ traveltime      : num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime       : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsupno     : num  0 1 0 1 1 1 1 0 1 1 ...
##  $ schoolsupyes    : num  1 0 1 0 0 0 0 1 0 0 ...
##  $ famsupno        : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ famsupyes       : num  0 1 0 1 1 1 0 1 1 1 ...
##  $ paidno          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ paidyes         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ activitiesno    : num  1 1 1 0 1 0 1 1 1 0 ...
##  $ activitiesyes   : num  0 0 0 1 0 1 0 0 0 1 ...
##  $ nurseryno       : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ nurseryyes      : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ higherno        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ higheryes       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ internetno      : num  1 0 0 0 1 0 0 1 0 0 ...
##  $ internetyes     : num  0 1 1 1 0 1 1 0 1 1 ...
##  $ romanticno      : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ romanticyes     : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ famrel          : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime        : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout           : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc            : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc            : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health          : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences        : num  4 2 6 0 0 6 0 2 0 0 ...
##  $ at_risk         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

The resulting dataset contains only numeric predictors and a binary target variable (at_risk), making it ready for training both interpretable and complex models. With preprocessing complete, we can now proceed to handle class imbalance and begin model experimentation.

Stratified Sampling and Resampling to Address Class Imbalance

To ensure a fair evaluation and avoid data leakage, we first split the dataset using stratified sampling — preserving the proportion of at-risk vs. not-at-risk students in both sets. We then applied the ROSE technique only to the training set to synthetically balance the classes before model training.

# Set seed for reproducibility
set.seed(123)

# Split data into training (80%) and test (20%)
split_index <- createDataPartition(df_encoded$at_risk, p = 0.8, list = FALSE)

train_data <- df_encoded[split_index, ]
test_data  <- df_encoded[-split_index, ]

# Apply ROSE only on the training set
train_data <- ROSE(at_risk ~ ., data = train_data, seed = 123)$data

# Check new class balance
kable(as.data.frame(table(train_data $at_risk)),
      col.names = c("Risk Status", "Count"),
      caption = "Class Distribution After ROSE Resampling")
Class Distribution After ROSE Resampling
Risk Status Count
No 278
Yes 242

The ROSE resampling method successfully balanced the training data: from an original distribution of 549 “No” vs. 100 “Yes” (highly imbalanced) to a more equitable 278 “No” vs. 242 “Yes.” This helps prevent the model from being biased toward the majority class and improves its ability to detect at-risk students.

Experimentation & Model Training:

To systematically track and compare all 4 experiments, we create a results data frame that logs each experiment’s metrics.

results_df <- data.frame(
  Model = character(),
  Accuracy = numeric(),
  Precision = numeric(),
  Recall = numeric(),
  F1_Score = numeric(),
  AUC = numeric(),
  stringsAsFactors = FALSE
)

Random Forest (Default)

Let’s now begin model experimentation, starting with a default Random Forest classifier to establish a baseline.

set.seed(321)

# Train default Random Forest model
rf_model <- randomForest(at_risk ~ ., data = train_data)

# Predict on test data
rf_pred <- predict(rf_model, test_data, type = "class")
rf_prob <- predict(rf_model, test_data, type = "prob")[, "Yes"]

# Confusion matrix and evaluation
conf_rf <- confusionMatrix(rf_pred, test_data$at_risk, positive = "Yes")
roc_rf <- pROC::roc(response = test_data$at_risk, predictor = rf_prob)

# Extract metrics
rf_accuracy <- conf_rf$overall["Accuracy"]
rf_precision <- conf_rf$byClass["Precision"]
rf_recall <- conf_rf$byClass["Recall"]
rf_f1 <- conf_rf$byClass["F1"]
rf_auc <- auc(roc_rf)

# Display Results
cat(sprintf("\nRandom Forest (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))
## 
## Random Forest (Default) - Accuracy: 0.7907, Precision: 0.4000, Recall: 0.7000, F1-score: 0.5091, AUC: 0.8454
# Add to results dataframe
results_df <- rbind(results_df, data.frame(
  Model = "Random Forest (Default)",
  Accuracy = round(rf_accuracy, 3),
  Precision = round(rf_precision, 3),
  Recall = round(rf_recall, 3),
  F1_Score = round(rf_f1, 3),
  AUC = round(rf_auc, 3)
))
rownames(results_df) <- NULL

Analysis: The default Random Forest model achieved solid performance with an AUC of 0.845 and a recall of 70%, which is our priority metric. This means the model successfully identified the majority of students at risk of failing. However, its precision was relatively low (40%), indicating a higher rate of false positives — students incorrectly flagged as at risk. This trade-off is acceptable in early intervention scenarios where catching more at-risk students is critical, even if it means including a few who aren’t.

Random Forest (Tuned)

To boost recall while minimizing false negatives, we enhance the Random Forest model using cost-sensitive learning. This approach assigns greater weight to the minority “Yes” class (at-risk students), helping the model prioritize identifying them even at the cost of some precision. Also, we tune the Random Forest model using 10-fold cross-validation. We optimize the mtry parameter — the number of predictors randomly sampled at each split — aiming to balance sensitivity with model stability.

Custom Evaluation Metric: We’ll extract Accuracy, Recall (priority), Precision, and F1-score for model comparison.

custom_summary <- function(data, lev = NULL, model = NULL) {
  cm <- caret::confusionMatrix(data$pred, data$obs, positive = lev[2])
  accuracy <- as.numeric(cm$overall["Accuracy"])
  recall <- as.numeric(cm$byClass["Recall"])
  precision <- as.numeric(cm$byClass["Precision"])
  f1 <- as.numeric(cm$byClass["F1"])
  c(Accuracy = accuracy, Recall = recall, Precision = precision, F1 = f1)
}

Model Training with Cross-validation and class weighting:

ctrl <- trainControl(
  method = "cv",
  number = 10,
  summaryFunction = custom_summary,
  classProbs = TRUE,
  savePredictions = TRUE
)

# Apply class weights: penalize false negatives
weights <- ifelse(train_data$at_risk == "Yes", 2, 1)

set.seed(456)
rf_weighted <- train(
  at_risk ~ ., 
  data = train_data,
  method = "rf",
  metric = "Recall",
  trControl = ctrl,
  weights = weights,
  tuneLength = 5
)

Model Evaluation on Test Data optional threshold tuning:

# Predict probabilities and labels
rf_probs <- predict(rf_weighted, test_data, type = "prob")[, "Yes"]

# Apply threshold tuning if needed (e.g., 0.4 instead of 0.5)
threshold <- 0.4
rf_preds <- factor(ifelse(rf_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))

# Evaluate performance
conf_rf <- confusionMatrix(rf_preds, test_data$at_risk, positive = "Yes")
roc_rf <- pROC::roc(test_data$at_risk, rf_probs)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
# Extract metrics
rf_accuracy <- conf_rf$overall["Accuracy"]
rf_precision <- conf_rf$byClass["Precision"]
rf_recall <- conf_rf$byClass["Recall"]
rf_f1 <- conf_rf$byClass["F1"]
rf_auc <- auc(roc_rf)

# Log results
cat(sprintf("\nRandom Forest (Weighted, Tuned, Threshold=%.2f) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f, AUC: %.4f\n",
            threshold, rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))
## 
## Random Forest (Weighted, Tuned, Threshold=0.40) - Accuracy: 0.6589, Precision: 0.2857, Recall: 0.8000, F1: 0.4211, AUC: 0.8085
results_df <- rbind(results_df, data.frame(
  Model = "Random Forest (Tuned, Weighted + Threshold)",
  Accuracy = round(rf_accuracy, 3),
  Precision = round(rf_precision, 3),
  Recall = round(rf_recall, 3),
  F1_Score = round(rf_f1, 3),
  AUC = round(rf_auc, 3)
))
rownames(results_df) <- NULL

Analysis: Random Forest (Weighted, Tuned + Threshold) By applying cost-sensitive learning (via class weights) and adjusting the decision threshold to 0.40, the model improved recall to 80%, correctly identifying the majority of at-risk students — a key priority for early intervention. However, this gain came at the cost of lower precision (28.57%) and overall accuracy (65.89%), signaling more false positives. The model’s AUC remains strong (0.81), suggesting it still separates the classes well despite the adjusted decision boundary.

This trade off is acceptable given the project’s goal: prioritizing recall to avoid missing students who need support.

Neural Network (Default)

o support early intervention efforts in education, we train a basic feedforward Neural Network using the nnet package. While not as interpretable as decision trees, neural networks can model complex, non-linear relationships — which is valuable when student risk depends on multiple overlapping factors like study habits, support systems, and background.

In this default setup:

  • We use 5 hidden units to allow for some pattern complexity without overfitting.

  • We limit training to 200 iterations for computational efficiency.

  • A small decay value (0.01) applies regularization to reduce the risk of overfitting on our imbalanced data.

Activation Function (Educational Implication): The network uses a sigmoid activation function, which outputs a value between 0 and 1 — interpreted as the predicted probability that a student is “at risk.” This is particularly useful in education, where decisions about support often require probabilistic thresholds rather than rigid labels. By outputting risk as a probability, educators can prioritize interventions with more flexibility.

# Set seed for reproducibility
set.seed(789)

# Train default neural network (single hidden layer)
nn_model <- nnet(at_risk ~ ., 
                 data = train_data, 
                 size = 5,        # hidden units
                 maxit = 200,     # maximum iterations
                 decay = 0.01,    # regularization
                 trace = FALSE)

# Predict probabilities
nn_prob <- predict(nn_model, test_data, type = "raw")

# Class prediction using 0.5 threshold
nn_pred <- ifelse(nn_prob > 0.5, "Yes", "No")
nn_pred <- factor(nn_pred, levels = levels(test_data$at_risk))

# Ensure test target is also factor
test_data$at_risk <- factor(test_data$at_risk, levels = c("No", "Yes"))

# Evaluation
conf_nn <- confusionMatrix(nn_pred, test_data$at_risk, positive = "Yes")
roc_nn <- pROC::roc(response = test_data$at_risk, predictor = as.numeric(nn_prob))
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
# Extract metrics
nn_accuracy <- conf_nn$overall["Accuracy"]
nn_precision <- conf_nn$byClass["Precision"]
nn_recall <- conf_nn$byClass["Recall"]
nn_f1 <- conf_nn$byClass["F1"]
nn_auc <- auc(roc_nn)

# Print results
cat(sprintf("\nNeural Network (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f, AUC: %.4f\n",
            nn_accuracy, nn_precision, nn_recall, nn_f1, nn_auc))
## 
## Neural Network (Default) - Accuracy: 0.7364, Precision: 0.3600, Recall: 0.9000, F1: 0.5143, AUC: 0.8422
# Add to results
results_df <- rbind(results_df, data.frame(
  Model = "Neural Network (Default)",
  Accuracy = round(nn_accuracy, 3),
  Precision = round(nn_precision, 3),
  Recall = round(nn_recall, 3),
  F1_Score = round(nn_f1, 3),
  AUC = round(nn_auc, 3)
))
rownames(results_df) <- NULL

Analysis: The default Neural Network achieved the highest recall (90%) so far, indicating strong sensitivity to at-risk students, though precision (36%) remains modest, suggesting a higher false positive rate.

Next, we tune the neural network’s architecture and regularization to explore whether we can maintain high recall while improving overall balance across performance metrics.

Neural Network (Tuned)

To optimize performance, we tune the neural network using 10-fold cross-validation. We vary two key hyperparameters:

  • size: Number of hidden units

  • decay: Regularization parameter to prevent overfitting

We aim to preserve high recall while improving precision and overall balance.

set.seed(1010)

ctrl <- trainControl(
  method = "cv",
  number = 10,
  summaryFunction = custom_summary,
  classProbs = TRUE,
  savePredictions = TRUE
)

grid <- expand.grid(
  size = c(3, 5, 7),        # try different hidden layer sizes
  decay = c(0.001, 0.01, 0.1) # different levels of regularization
)

nn_tuned <- train(
  at_risk ~ ., 
  data = train_data,
  method = "nnet",
  trControl = ctrl,
  metric = "Recall",   # focus on recall
  tuneGrid = grid,
  maxit = 300,
  trace = FALSE
)

Evaluate Tuned NN on Test Set:

# Predict on test data
nn_tuned_pred <- predict(nn_tuned, test_data)
nn_tuned_prob <- predict(nn_tuned, test_data, type = "prob")[, "Yes"]

# Ensure correct factor levels
nn_tuned_pred <- factor(nn_tuned_pred, levels = levels(test_data$at_risk))

# Confusion matrix & ROC
conf_nn_tuned <- confusionMatrix(nn_tuned_pred, test_data$at_risk, positive = "Yes")
roc_nn_tuned <- pROC::roc(response = test_data$at_risk, predictor = nn_tuned_prob)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
# Extract metrics
nn_tuned_accuracy <- conf_nn_tuned$overall["Accuracy"]
nn_tuned_precision <- conf_nn_tuned$byClass["Precision"]
nn_tuned_recall <- conf_nn_tuned$byClass["Recall"]
nn_tuned_f1 <- conf_nn_tuned$byClass["F1"]
nn_tuned_auc <- auc(roc_nn_tuned)

# Print results
cat(sprintf("\nNeural Network (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f, AUC: %.4f\n",
            nn_tuned_accuracy, nn_tuned_precision, nn_tuned_recall, nn_tuned_f1, nn_tuned_auc))
## 
## Neural Network (Tuned) - Accuracy: 0.7442, Precision: 0.3488, Recall: 0.7500, F1: 0.4762, AUC: 0.8289
# Log to results
results_df <- rbind(results_df, data.frame(
  Model = "Neural Network (Tuned)",
  Accuracy = round(nn_tuned_accuracy, 3),
  Precision = round(nn_tuned_precision, 3),
  Recall = round(nn_tuned_recall, 3),
  F1_Score = round(nn_tuned_f1, 3),
  AUC = round(nn_tuned_auc, 3)
))
rownames(results_df) <- NULL

Analysis: The tuned Neural Network achieved 75% recall, successfully identifying most at-risk students. While precision (34.9%) remains modest, the model maintains good overall accuracy (74%) and AUC (0.829). This balance makes it a solid option for early intervention, though false positives may still need to be addressed through future tuning or educator input.

Results and Visualization:

results_df %>%
  kable("html", digits = 3, caption = "Model Performance Summary") %>%
  kable_styling(full_width = F) %>%
  row_spec(which.max(results_df$Recall), bold = TRUE, background = "#DFF0D8") %>%
  row_spec(which.max(results_df$F1_Score), bold = TRUE, background = "#D9EDF7")
Model Performance Summary
Model Accuracy Precision Recall F1_Score AUC
Random Forest (Default) 0.791 0.400 0.70 0.509 0.845
Random Forest (Tuned, Weighted + Threshold) 0.659 0.286 0.80 0.421 0.808
Neural Network (Default) 0.736 0.360 0.90 0.514 0.842
Neural Network (Tuned) 0.744 0.349 0.75 0.476 0.829

All models performed reasonably well, but the default Neural Network achieved the highest recall (0.90) and F1-score (0.514), making it the most effective for identifying at-risk students.

To visualize these metrics more clearly across models, we now turn to a grouped bar chart.

results_long <- results_df %>%
  pivot_longer(cols = -Model, names_to = "Metric", values_to = "Score")

# Order metrics to keep them grouped
results_long$Metric <- factor(results_long$Metric, levels = c("Accuracy", "AUC", "F1_Score", "Precision", "Recall"))

# Plot
ggplot(results_long, aes(x = Metric, y = Score, fill = Model)) +
  geom_col(position = position_dodge()) +
  labs(
    title = "Comparison of Model Performance Metrics",
    x = "Metric", y = "Score"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right",
    legend.title = element_blank()
  )

The grouped bar chart highlights that Neural Network (Default) outperforms others in recall and F1-score, whereas Random Forest (Default) leads in accuracy and AUC.

To compare the models holistically across all metrics in a compact form, we now explore a radar chart.

# Prepare data
radar_data <- results_df %>%
  column_to_rownames("Model") %>%
  select(Accuracy, Precision, Recall, F1_Score, AUC)

# Add max and min rows for scaling
radar_data <- rbind(apply(radar_data, 2, max), 
                    apply(radar_data, 2, min), 
                    radar_data)

# Radar chart
radarchart(radar_data,
           axistype = 1,
           pcol = c("blue", "green", "red", "purple"),
           pfcol = c(rgb(1,0,0,0.3), rgb(0,1,0,0.3), rgb(0,0,1,0.3), rgb(0.5,0,0.5,0.3)),
           plty = 1,
           plwd = 2,
           cglwd = 0.8,
           cglcol = "grey",
           axislabcol = "gray",
           title = "Radar Chart of Model Metrics")

# Legend
legend("topright", legend = rownames(radar_data)[-c(1,2)], 
       col = c("blue", "green", "red", "purple"), 
       lwd = 2, lty = 1, cex = 0.57)

The radar chart offers a visual summary, showing how the default Neural Network maintains strong balance across all metrics—especially recall—while the tuned Random Forest sacrifices accuracy and precision for higher recall.

Conclusion:Among all models tested, the Neural Network (Default) emerged as the best performer for this educational risk prediction task. With the highest recall (0.90) and strongest F1-score (0.514), it proved most effective at identifying at-risk students early—aligning directly with the project’s goal of timely intervention. While this came with some trade-offs in precision (36%), the F1-score balances this by capturing both sensitivity and reliability.

Although the Random Forest (Tuned + Weighted) model also achieved high recall (0.80), its lower precision (28.6%) and accuracy (65.9%) increased the risk of overwhelming support systems with false positives. Therefore, the Neural Network (Default) strikes the best balance between recall-focused performance and practical implementation, which makes it a suitable choice for early risk detection in educational settings.