This project uses student data from Portuguese secondary schools to help identify students who may be at risk of failing their Portuguese language course. By detecting these students early, schools can provide targeted support to improve academic outcomes and reduce failure rates.
Using this dataset, we apply machine learning techniques to build predictive models that classify students as either “at-risk” or “not at-risk” based on demographic, behavioral, and academic indicators. This analysis aims to inform early intervention strategies that can be implemented by educators before final grades are issued.
Our approach includes exploratory data analysis, class imbalance handling, model training using both tree-based and neural network methods, and a thorough comparison of model performance. The ultimate goal is to identify a solution that balances predictive accuracy with practical impact in educational settings
The goal of this project is to build a machine learning model that can accurately identify students at risk of failing their Portuguese language course. Specifically, we define “at-risk” students as those whose final grade (G3) is below 10 on a 20-point scale.
To achieve this, we frame the task as a binary classification problem, where the model predicts whether a student falls into the “at-risk” or “not at-risk” category based on a combination of academic performance, study habits, and socio-demographic features.
We aim to:
Educators often struggle to identify students who are likely to fail before final grades are issued. Without early warning, students may miss the opportunity to receive targeted support that could help them succeed. Schools need a data-driven approach to proactively identify at-risk students and intervene before it’s too late.
We frame this challenge as a binary classification task: predicting whether a student is “at risk” (final grade < 10) using a variety of features such as study time, family background, and past performance. This requires selecting appropriate machine learning algorithms, addressing class imbalance, and optimizing performance for high recall to capture as many struggling students as possible.
we begin by loading the dataset, importing the relevant libraries and explore.
# Load necessary libraries
library(knitr)
library(tidyverse)
library(corrplot)
library(dplyr)
library(caret)
library(ROSE)
library(kableExtra)
library(randomForest)
library(nnet)
library(pROC)
library(ROCR)
library(doParallel)
library(ggplot2)
library(fmsb)
library(tidytext)
# Load the Portuguese dataset
df <- read.csv("student-por.csv", sep = ";")
# Preview the first few rows of the dataset
kable(head(df, 10), caption = "Preview of the Bank Dataset")
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 4 | 0 | 11 | 11 |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 2 | 9 | 11 | 11 |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 0 | yes | no | no | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 6 | 12 | 13 | 12 |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | no | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 0 | 14 | 14 | 14 |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | no | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 0 | 11 | 13 | 13 |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | no | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 6 | 12 | 12 | 13 |
| GP | M | 16 | U | LE3 | T | 2 | 2 | other | other | home | mother | 1 | 2 | 0 | no | no | no | no | yes | yes | yes | no | 4 | 4 | 4 | 1 | 1 | 3 | 0 | 13 | 12 | 13 |
| GP | F | 17 | U | GT3 | A | 4 | 4 | other | teacher | home | mother | 2 | 2 | 0 | yes | yes | no | no | yes | yes | no | no | 4 | 1 | 4 | 1 | 1 | 1 | 2 | 10 | 13 | 13 |
| GP | M | 15 | U | LE3 | A | 3 | 2 | services | other | home | mother | 1 | 2 | 0 | no | yes | no | no | yes | yes | yes | no | 4 | 2 | 2 | 1 | 1 | 1 | 0 | 15 | 16 | 17 |
| GP | M | 15 | U | GT3 | T | 3 | 4 | other | other | home | mother | 1 | 2 | 0 | no | yes | no | yes | yes | yes | yes | no | 5 | 5 | 1 | 1 | 1 | 5 | 0 | 12 | 12 | 13 |
The dataset contains student-level information with academic performance (G1–G3), personal and family background, and support indicators. This offers a well-rounded view of factors that may influence risk.
Check the structure of the dataset:
str(df)
## 'data.frame': 649 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "no" "no" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
There are 649 observations and 33 variables, including numeric and categorical features such as grades (G1, G2, G3), family background, school support, and lifestyle factors.
Key Features in the Dataset:
school, sex, age,
address, famsize, Pstatus: Basic
demographic informationMedu, Fedu: Mother’s and father’s
education levels (0 to 4 scale)Mjob, Fjob, guardian:
Parents’ occupations and legal guardiantraveltime, studytime: Time spent
commuting to school and studying weeklyfailures: Number of past class failuresschoolsup, famsup, paid,
activities: Extra educational support indicatorsnursery, higher, internet,
romantic: Early education, future intent, and
lifestylefamrel, freetime, goout,
Dalc, Walc, health: Family
relationship, leisure, alcohol use, and self-reported healthabsences: Number of school absencesG1, G2, G3: Grades from three
terms (G3 is the final grade and our outcome)These features span personal, academic, and behavioral factors that can be leveraged to predict which students may be at risk of failing.
Check for duplicates:
sum(duplicated(df))
## [1] 0
Result: ‘0’ - There are no duplicate rows in the dataset.
Check for missing data:
colSums(is.na(df))
## school sex age address famsize Pstatus Medu
## 0 0 0 0 0 0 0
## Fedu Mjob Fjob reason guardian traveltime studytime
## 0 0 0 0 0 0 0
## failures schoolsup famsup paid activities nursery higher
## 0 0 0 0 0 0 0
## internet romantic famrel freetime goout Dalc Walc
## 0 0 0 0 0 0 0
## health absences G1 G2 G3
## 0 0 0 0 0
Result: All columns have 0 missing values — this dataset is complete.
Check for ‘unknown-like’ values:
# Count common 'unknown-like' values in all columns
df %>%
summarise(across(everything(), ~sum(. %in% c("unknown", "", "?", NA))))
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## traveltime studytime failures schoolsup famsup paid activities nursery higher
## 1 0 0 0 0 0 0 0 0 0
## internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
## 1 0 0 0 0 0 0 0 0 0 0 0 0
Result: All values returned are 0, so there are no placeholder or unknown-like values either.
From above, we verifying the quality of the dataset. There are no duplicate rows, no missing values, and no placeholder values such as “unknown” or “?”. This indicates that the dataset is clean and ready for exploratory analysis without needing imputation or filtering.
summary(df)
## school sex age address
## Length:649 Length:649 Min. :15.00 Length:649
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.74
## 3rd Qu.:18.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:649 Length:649 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Mode :character Median :2.000 Median :2.000
## Mean :2.515 Mean :2.307
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:649
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.569 Mean :1.931 Mean :0.2219
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:649 Length:649 Length:649 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.931
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.18 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
Summary of Dataset The dataset includes academic,
demographic, and behavioral aspects. The age range is 15–22, with most
students around 16–18. Final grades (G3) range from 0 to
19, with a median around 12.
Parental education levels (Medu, Fedu) vary
widely, and alcohol consumption (Dalc, Walc)
skews low. Most students report no prior failures, and absences range
from 0 to 32 — an unusually high tail that may affect modeling.
Overall, the dataset presents a rich variety of features relevant to academic risk, with a few variables warranting further inspection for skew and outliers.
# Select only numeric columns
numeric_cols <- df %>% select(where(is.numeric))
# Reshape data
numeric_cols %>%
pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
# Plot
ggplot(aes(x = value)) +
geom_histogram(bins = 15, fill = "steelblue", color = "black") +
facet_wrap(~ feature, scales = "free") +
labs(title = "Distribution of Numeric Features", x = "Value", y = "Count") +
theme_minimal(base_size = 13) +
theme(strip.text = element_text(face = "bold"))
This plot shows the distribution of all numeric features. We see right-skewed patterns in absences and alcohol consumption, while grade variables (G1, G2, G3) are more symmetric. G3 sharply declines around 10, reinforcing the logic behind our at-risk threshold.
Box-plots for Outlier Detection:
# Boxplots for key numeric variables
df %>%
select(absences, failures, G1, G2) %>%
pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
ggplot(aes(x = feature, y = value)) +
geom_boxplot(fill = "lightblue", outlier.color = "red") +
labs(title = "Boxplots for Outlier Detection",
x = "Feature",
y = "Value") +
theme_minimal()
Analysis: The boxplots above reveal a few outliers in features like absences, failures, G1, and G2. The most extreme values appear in absences, where some students missed over 30 days — well beyond the typical range. However, outliers in failures, G1, and G2 appear reasonable and reflect meaningful academic struggles.
We chose not to remove or transform these outliers before modeling. This is intentional: our selected models — Random Forest and Neural Network — are generally robust to outliers. Random Forest makes decisions based on threshold splits and is less sensitive to extreme values. Neural Networks, while more sensitive to scaling, can still handle outliers when data is preprocessed consistently. Therefore, we retained these values to preserve potentially important variation in student performance.
To visualize how categorical variables are distributed across students, we first reshape the data into a tidy format. This allows us to group features thematically and explore patterns in factors like demographics, support, and parental background — all of which may influence student performance.
categorical_data <- df %>%
select(where(is.character)) %>% # Select categorical variables
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
glimpse(categorical_data) # Ensure Feature and Value exist
## Rows: 11,033
## Columns: 2
## $ Feature <chr> "school", "sex", "address", "famsize", "Pstatus", "Mjob", "Fjo…
## $ Value <chr> "GP", "F", "U", "GT3", "A", "at_home", "teacher", "course", "m…
To make the plots more interpretable, we organize the categorical features into three thematic groups: support and lifestyle, demographics, and parental background. This structure will help us analyze trends within and across these areas.
# Prepare base data
categorical_data <- df %>%
select(where(is.character)) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
# Define groups
group1_vars <- c("schoolsup", "famsup", "paid", "internet", "activities", "higher", "nursery")
group2_vars <- c("school", "sex", "address", "famsize", "Pstatus", "romantic")
group3_vars <- c("guardian", "Mjob", "Fjob", "reason")
We now visualize each group separately using bar charts to explore category frequency within each feature. This breakdown helps uncover patterns that may relate to student risk and model predictors.
# Group 1: Support & Lifestyle
categorical_data %>%
filter(Feature %in% group1_vars) %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "steelblue") +
facet_wrap(~ Feature, scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = "Support and Lifestyle Features",
x = "Category", y = "Count") +
theme_minimal(base_size = 12)
Analysis: Most students reported internet access and aspirations for higher education, while fewer received school support or enrolled in paid services. These indicators may reflect disparities in academic resources, which could relate to performance risk.
# Group 2: Demographics
categorical_data %>%
filter(Feature %in% group2_vars) %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "darkorange") +
facet_wrap(~ Feature, scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = "Demographic Features",
x = "Category", y = "Count") +
theme_minimal(base_size = 12)
Analysis: We observe a slight gender skew toward female students and urban residency. Most students live in larger families and with both parents. These demographics offer contextual clues about their learning environments.
# Group 3: Parental Background
categorical_data %>%
filter(Feature %in% group3_vars) %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "seagreen") +
facet_wrap(~ Feature, scales = "free_y", ncol = 2) +
coord_flip() +
labs(title = "Parental Background Features",
x = "Category", y = "Count") +
theme_minimal(base_size = 12)
Analysis: Mothers are commonly reported as the main guardians. Many parents work in broad or service-related jobs. The reasons cited for choosing a school may hint at parental involvement and expectations.
These categorical features offer context that complements academic metrics, helping us build a more complete picture of factors that may influence student performance.
Creating the new at_risk variable: To
guide our predictive modeling, we define a binary target variable,
at_risk, based on the final grade G3. A student is considered “at risk”
if their final grade is below 10, indicating potential failure in the
course.
# Create binary target variable
df <- df %>%
mutate(at_risk = ifelse(G3 < 10, "Yes", "No")) %>%
mutate(at_risk = factor(at_risk))
# Check the distribution of the target
table(df$at_risk)
##
## No Yes
## 549 100
Out of 649 students, 100 (approximately 15%) are classified as at risk of failing the Portuguese course based on their final grade (G3). This reveals a class imbalance in our target variable — an important consideration when selecting and evaluating machine learning models.
To better understand what contributes to risk, we now explore how academic indicators such as study time, previous failures, and early grades (G1 and G2) differ between at-risk and not-at-risk students. These insights can help identify early warning signs and inform which predictors may be most useful for modeling.
# Summary statistics of numeric variables by risk group
df %>%
group_by(at_risk) %>%
summarise(
avg_absences = mean(absences, na.rm = TRUE),
avg_studytime = mean(studytime, na.rm = TRUE),
avg_failures = mean(failures, na.rm = TRUE),
avg_G1 = mean(G1, na.rm = TRUE),
avg_G2 = mean(G2, na.rm = TRUE)
)
## # A tibble: 2 × 6
## at_risk avg_absences avg_studytime avg_failures avg_G1 avg_G2
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 No 3.49 1.99 0.126 12.1 12.3
## 2 Yes 4.61 1.61 0.75 7.78 7.53
We do not include G3 in our exploratory analysis, as it
is used to define the at_risk outcome. Instead, we examine
earlier indicators such as G1, G2, study time,
and absences, which may help predict final performance. Above we see
that ‘At-risk’ students tend to have higher absences and past failures,
and much lower first- and second-period grades (G1, G2), indicating
early signs of struggle.
Study time by risk group:
ggplot(df, aes(x = factor(studytime), fill = at_risk)) +
geom_bar(position = "fill") +
labs(x = "Weekly Study Time (1 = <2 hrs, 4 = >10 hrs)",
y = "Proportion",
fill = "At Risk?",
title = "Study Time Distribution by Risk") +
scale_fill_manual(values = c("No" = "#d9d9d9", "Yes" = "#f4a261")) +
theme_minimal()
Analysis: Students who study fewer than 2 hours per week (studytime = 1) have the highest proportion of those at risk, highlighting a clear link between limited study time and academic risk.
Visualize Academic Struggles – Failures & Absences:
ggplot(df, aes(x = factor(failures), fill = at_risk)) +
geom_bar(position = "fill") +
labs(x = "Number of Past Class Failures",
y = "Proportion",
fill = "At Risk?",
title = "Past Failures by Risk Status") +
scale_fill_manual(values = c("No" = "#d9d9d9", "Yes" = "#f4a261")) +
theme_minimal()
Analysis: Students with more past failures are significantly more likely to be at risk. Nearly two-thirds of those with three previous failures fall into the at-risk group, underscoring the strong correlation between academic history and final outcomes.
Internet Access by Risk Status: Since internet access may reflect students’ access to resources or broader socioeconomic factors, we examine whether this variable is associated with academic risk. Comparing access by at_risk status helps reveal whether digital inequality contributes to performance gaps.
ggplot(df, aes(x = internet, fill = at_risk)) +
geom_bar(position = "fill") +
labs(title = "Internet Access by Risk Status",
x = "Internet Access",
y = "Proportion",
fill = "At Risk?") +
scale_fill_manual(values = c("No" = "#d9d9d9", "Yes" = "#f4a261")) +
theme_minimal()
Students without internet access show a noticeably higher proportion of academic risk, hinting at the potential impact of digital resource availability on performance.
To better understand the linear relationships among numeric features, we compute and visualize the correlation matrix. This helps uncover strong associations (e.g., between early and final grades) and detect multicollinearity, which can guide model design and feature selection.
# Select only numeric variables
numeric_vars <- df %>%
select(where(is.numeric))
# Compute correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
# Plot the correlation matrix
corrplot(cor_matrix,
method = "color", # colored squares
type = "upper", # show only upper triangle
order = "hclust", # cluster similar variables
addCoef.col = "black", # show correlation coefficients
tl.col = "black", # label color
number.cex = 0.6) # coefficient size
Analysis: The correlation matrix reveals several important relationships:
Strong positive correlations:
These indicate that earlier grades are strong predictors of the final grade, supporting their inclusion in modeling.
Moderate relationships:
Weak or negligible correlations:
To fulfill the assignment requirement and build a predictive model for identifying at-risk students, we selected one algorithm from Weeks 1–10 and another from Weeks 11–15 based on the dataset’s characteristics and our project objectives.
Random Forest is an ensemble method that constructs multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. It is well-suited for this task because:
It handles both numeric and categorical data without much preprocessing.
It is robust to outliers and noise.
It naturally handles non-linear relationships and feature interactions.
It provides feature importance, which is useful for identifying key risk indicators.
Given the imbalanced nature of the dataset and the mix of predictor types, Random Forest offers a reliable, interpretable, and effective modeling approach.
Neural Networks are powerful function approximators capable of capturing complex patterns in data. For this project, a simple feedforward neural network will be used to:
Model non-linear relationships among features and the target.
Handle multi-dimensional interactions that may be missed by simpler models.
Compare performance with a more interpretable tree-based model.
Although less interpretable, Neural Networks provide a performance benchmark and allow us to assess whether more complex architectures yield significant gains in predictive accuracy.
With the algorithms selected — Random Forest for interpretability and Neural Network for modeling complexity — we now prepare the data to ensure compatibility and performance across both models. This involves encoding categorical variables, scaling numeric features, and addressing class imbalance before training begins.
Before training our models, we remove G1 and G2 — the first and second grading period scores — from our dataset. Although these variables are strong predictors of the final grade (G3), they introduce data leakage.
What is data leakage? - Data leakage occurs when information not available at prediction time is used during model training. This leads to overly optimistic results that won’t generalize in real-world use.
In our case, at_risk is defined based on G3, and since G1 and G2 are earlier grades closely tied to G3, they directly reflect the student’s academic trajectory and unfairly reveal part of the outcome. To maintain a realistic prediction scenario — especially for early interventions — we exclude these variables to avoid leakage and ensure model integrity.
To ensure our data is suitable for both tree-based and neural network models, we perform essential pre-processing steps. These include removing grade-related leakage variables (G1, G2, G3), converting categorical variables to numeric using one-hot encoding, and preparing the final dataset with clearly separated predictors and labels. This sets the stage for consistent and fair model comparison.
# Recreate the target column
df$at_risk <- ifelse(df$G3 < 10, "Yes", "No") |> factor()
# Drop G1, G2, G3
df_model <- df |> select(-G1, -G2, -G3)
# Separate predictors and target
target <- df_model$at_risk
df_features <- df_model |> select(-at_risk)
# One-hot encode categorical predictors
dummies <- dummyVars(~ ., data = df_features)
df_encoded <- predict(dummies, newdata = df_features) |> as.data.frame()
# Add target column back
df_encoded$at_risk <- target
# Check structure
str(df_encoded)
## 'data.frame': 649 obs. of 57 variables:
## $ schoolGP : num 1 1 1 1 1 1 1 1 1 1 ...
## $ schoolMS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sexF : num 1 1 1 1 1 0 0 1 0 0 ...
## $ sexM : num 0 0 0 0 0 1 1 0 1 1 ...
## $ age : num 18 17 15 15 16 16 16 17 15 15 ...
## $ addressR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ addressU : num 1 1 1 1 1 1 1 1 1 1 ...
## $ famsizeGT3 : num 1 1 0 1 1 0 0 1 0 1 ...
## $ famsizeLE3 : num 0 0 1 0 0 1 1 0 1 0 ...
## $ PstatusA : num 1 0 0 0 0 0 0 1 1 0 ...
## $ PstatusT : num 0 1 1 1 1 1 1 0 0 1 ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjobat_home : num 1 1 1 0 0 0 0 0 0 0 ...
## $ Mjobhealth : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Mjobother : num 0 0 0 0 1 0 1 1 0 1 ...
## $ Mjobservices : num 0 0 0 0 0 1 0 0 1 0 ...
## $ Mjobteacher : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fjobat_home : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fjobhealth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fjobother : num 0 1 1 0 1 1 1 0 1 1 ...
## $ Fjobservices : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Fjobteacher : num 1 0 0 0 0 0 0 1 0 0 ...
## $ reasoncourse : num 1 1 0 0 0 0 0 0 0 0 ...
## $ reasonhome : num 0 0 0 1 1 0 1 1 1 1 ...
## $ reasonother : num 0 0 1 0 0 0 0 0 0 0 ...
## $ reasonreputation: num 0 0 0 0 0 1 0 0 0 0 ...
## $ guardianfather : num 0 1 0 0 1 0 0 0 0 0 ...
## $ guardianmother : num 1 0 1 1 0 1 1 1 1 1 ...
## $ guardianother : num 0 0 0 0 0 0 0 0 0 0 ...
## $ traveltime : num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : num 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsupno : num 0 1 0 1 1 1 1 0 1 1 ...
## $ schoolsupyes : num 1 0 1 0 0 0 0 1 0 0 ...
## $ famsupno : num 1 0 1 0 0 0 1 0 0 0 ...
## $ famsupyes : num 0 1 0 1 1 1 0 1 1 1 ...
## $ paidno : num 1 1 1 1 1 1 1 1 1 1 ...
## $ paidyes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ activitiesno : num 1 1 1 0 1 0 1 1 1 0 ...
## $ activitiesyes : num 0 0 0 1 0 1 0 0 0 1 ...
## $ nurseryno : num 0 1 0 0 0 0 0 0 0 0 ...
## $ nurseryyes : num 1 0 1 1 1 1 1 1 1 1 ...
## $ higherno : num 0 0 0 0 0 0 0 0 0 0 ...
## $ higheryes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ internetno : num 1 0 0 0 1 0 0 1 0 0 ...
## $ internetyes : num 0 1 1 1 0 1 1 0 1 1 ...
## $ romanticno : num 1 1 1 0 1 1 1 1 1 1 ...
## $ romanticyes : num 0 0 0 1 0 0 0 0 0 0 ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : num 4 2 6 0 0 6 0 2 0 0 ...
## $ at_risk : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
The resulting dataset contains only numeric predictors and a binary target variable (at_risk), making it ready for training both interpretable and complex models. With preprocessing complete, we can now proceed to handle class imbalance and begin model experimentation.
To ensure a fair evaluation and avoid data leakage, we first split the dataset using stratified sampling — preserving the proportion of at-risk vs. not-at-risk students in both sets. We then applied the ROSE technique only to the training set to synthetically balance the classes before model training.
# Set seed for reproducibility
set.seed(123)
# Split data into training (80%) and test (20%)
split_index <- createDataPartition(df_encoded$at_risk, p = 0.8, list = FALSE)
train_data <- df_encoded[split_index, ]
test_data <- df_encoded[-split_index, ]
# Apply ROSE only on the training set
train_data <- ROSE(at_risk ~ ., data = train_data, seed = 123)$data
# Check new class balance
kable(as.data.frame(table(train_data $at_risk)),
col.names = c("Risk Status", "Count"),
caption = "Class Distribution After ROSE Resampling")
| Risk Status | Count |
|---|---|
| No | 278 |
| Yes | 242 |
The ROSE resampling method successfully balanced the training data: from an original distribution of 549 “No” vs. 100 “Yes” (highly imbalanced) to a more equitable 278 “No” vs. 242 “Yes.” This helps prevent the model from being biased toward the majority class and improves its ability to detect at-risk students.
To systematically track and compare all 4 experiments, we create a results data frame that logs each experiment’s metrics.
results_df <- data.frame(
Model = character(),
Accuracy = numeric(),
Precision = numeric(),
Recall = numeric(),
F1_Score = numeric(),
AUC = numeric(),
stringsAsFactors = FALSE
)
Let’s now begin model experimentation, starting with a default Random Forest classifier to establish a baseline.
set.seed(321)
# Train default Random Forest model
rf_model <- randomForest(at_risk ~ ., data = train_data)
# Predict on test data
rf_pred <- predict(rf_model, test_data, type = "class")
rf_prob <- predict(rf_model, test_data, type = "prob")[, "Yes"]
# Confusion matrix and evaluation
conf_rf <- confusionMatrix(rf_pred, test_data$at_risk, positive = "Yes")
roc_rf <- pROC::roc(response = test_data$at_risk, predictor = rf_prob)
# Extract metrics
rf_accuracy <- conf_rf$overall["Accuracy"]
rf_precision <- conf_rf$byClass["Precision"]
rf_recall <- conf_rf$byClass["Recall"]
rf_f1 <- conf_rf$byClass["F1"]
rf_auc <- auc(roc_rf)
# Display Results
cat(sprintf("\nRandom Forest (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))
##
## Random Forest (Default) - Accuracy: 0.7907, Precision: 0.4000, Recall: 0.7000, F1-score: 0.5091, AUC: 0.8454
# Add to results dataframe
results_df <- rbind(results_df, data.frame(
Model = "Random Forest (Default)",
Accuracy = round(rf_accuracy, 3),
Precision = round(rf_precision, 3),
Recall = round(rf_recall, 3),
F1_Score = round(rf_f1, 3),
AUC = round(rf_auc, 3)
))
rownames(results_df) <- NULL
Analysis: The default Random Forest model achieved solid performance with an AUC of 0.845 and a recall of 70%, which is our priority metric. This means the model successfully identified the majority of students at risk of failing. However, its precision was relatively low (40%), indicating a higher rate of false positives — students incorrectly flagged as at risk. This trade-off is acceptable in early intervention scenarios where catching more at-risk students is critical, even if it means including a few who aren’t.
To boost recall while minimizing false negatives, we enhance the Random Forest model using cost-sensitive learning. This approach assigns greater weight to the minority “Yes” class (at-risk students), helping the model prioritize identifying them even at the cost of some precision. Also, we tune the Random Forest model using 10-fold cross-validation. We optimize the mtry parameter — the number of predictors randomly sampled at each split — aiming to balance sensitivity with model stability.
Custom Evaluation Metric: We’ll extract Accuracy, Recall (priority), Precision, and F1-score for model comparison.
custom_summary <- function(data, lev = NULL, model = NULL) {
cm <- caret::confusionMatrix(data$pred, data$obs, positive = lev[2])
accuracy <- as.numeric(cm$overall["Accuracy"])
recall <- as.numeric(cm$byClass["Recall"])
precision <- as.numeric(cm$byClass["Precision"])
f1 <- as.numeric(cm$byClass["F1"])
c(Accuracy = accuracy, Recall = recall, Precision = precision, F1 = f1)
}
Model Training with Cross-validation and class weighting:
ctrl <- trainControl(
method = "cv",
number = 10,
summaryFunction = custom_summary,
classProbs = TRUE,
savePredictions = TRUE
)
# Apply class weights: penalize false negatives
weights <- ifelse(train_data$at_risk == "Yes", 2, 1)
set.seed(456)
rf_weighted <- train(
at_risk ~ .,
data = train_data,
method = "rf",
metric = "Recall",
trControl = ctrl,
weights = weights,
tuneLength = 5
)
Model Evaluation on Test Data optional threshold tuning:
# Predict probabilities and labels
rf_probs <- predict(rf_weighted, test_data, type = "prob")[, "Yes"]
# Apply threshold tuning if needed (e.g., 0.4 instead of 0.5)
threshold <- 0.4
rf_preds <- factor(ifelse(rf_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
# Evaluate performance
conf_rf <- confusionMatrix(rf_preds, test_data$at_risk, positive = "Yes")
roc_rf <- pROC::roc(test_data$at_risk, rf_probs)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
# Extract metrics
rf_accuracy <- conf_rf$overall["Accuracy"]
rf_precision <- conf_rf$byClass["Precision"]
rf_recall <- conf_rf$byClass["Recall"]
rf_f1 <- conf_rf$byClass["F1"]
rf_auc <- auc(roc_rf)
# Log results
cat(sprintf("\nRandom Forest (Weighted, Tuned, Threshold=%.2f) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f, AUC: %.4f\n",
threshold, rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))
##
## Random Forest (Weighted, Tuned, Threshold=0.40) - Accuracy: 0.6589, Precision: 0.2857, Recall: 0.8000, F1: 0.4211, AUC: 0.8085
results_df <- rbind(results_df, data.frame(
Model = "Random Forest (Tuned, Weighted + Threshold)",
Accuracy = round(rf_accuracy, 3),
Precision = round(rf_precision, 3),
Recall = round(rf_recall, 3),
F1_Score = round(rf_f1, 3),
AUC = round(rf_auc, 3)
))
rownames(results_df) <- NULL
Analysis: Random Forest (Weighted, Tuned + Threshold) By applying cost-sensitive learning (via class weights) and adjusting the decision threshold to 0.40, the model improved recall to 80%, correctly identifying the majority of at-risk students — a key priority for early intervention. However, this gain came at the cost of lower precision (28.57%) and overall accuracy (65.89%), signaling more false positives. The model’s AUC remains strong (0.81), suggesting it still separates the classes well despite the adjusted decision boundary.
This trade off is acceptable given the project’s goal: prioritizing recall to avoid missing students who need support.
o support early intervention efforts in education, we train a basic feedforward Neural Network using the nnet package. While not as interpretable as decision trees, neural networks can model complex, non-linear relationships — which is valuable when student risk depends on multiple overlapping factors like study habits, support systems, and background.
In this default setup:
We use 5 hidden units to allow for some pattern complexity without overfitting.
We limit training to 200 iterations for computational efficiency.
A small decay value (0.01) applies regularization to reduce the risk of overfitting on our imbalanced data.
Activation Function (Educational Implication): The network uses a sigmoid activation function, which outputs a value between 0 and 1 — interpreted as the predicted probability that a student is “at risk.” This is particularly useful in education, where decisions about support often require probabilistic thresholds rather than rigid labels. By outputting risk as a probability, educators can prioritize interventions with more flexibility.
# Set seed for reproducibility
set.seed(789)
# Train default neural network (single hidden layer)
nn_model <- nnet(at_risk ~ .,
data = train_data,
size = 5, # hidden units
maxit = 200, # maximum iterations
decay = 0.01, # regularization
trace = FALSE)
# Predict probabilities
nn_prob <- predict(nn_model, test_data, type = "raw")
# Class prediction using 0.5 threshold
nn_pred <- ifelse(nn_prob > 0.5, "Yes", "No")
nn_pred <- factor(nn_pred, levels = levels(test_data$at_risk))
# Ensure test target is also factor
test_data$at_risk <- factor(test_data$at_risk, levels = c("No", "Yes"))
# Evaluation
conf_nn <- confusionMatrix(nn_pred, test_data$at_risk, positive = "Yes")
roc_nn <- pROC::roc(response = test_data$at_risk, predictor = as.numeric(nn_prob))
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
# Extract metrics
nn_accuracy <- conf_nn$overall["Accuracy"]
nn_precision <- conf_nn$byClass["Precision"]
nn_recall <- conf_nn$byClass["Recall"]
nn_f1 <- conf_nn$byClass["F1"]
nn_auc <- auc(roc_nn)
# Print results
cat(sprintf("\nNeural Network (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f, AUC: %.4f\n",
nn_accuracy, nn_precision, nn_recall, nn_f1, nn_auc))
##
## Neural Network (Default) - Accuracy: 0.7364, Precision: 0.3600, Recall: 0.9000, F1: 0.5143, AUC: 0.8422
# Add to results
results_df <- rbind(results_df, data.frame(
Model = "Neural Network (Default)",
Accuracy = round(nn_accuracy, 3),
Precision = round(nn_precision, 3),
Recall = round(nn_recall, 3),
F1_Score = round(nn_f1, 3),
AUC = round(nn_auc, 3)
))
rownames(results_df) <- NULL
Analysis: The default Neural Network achieved the highest recall (90%) so far, indicating strong sensitivity to at-risk students, though precision (36%) remains modest, suggesting a higher false positive rate.
Next, we tune the neural network’s architecture and regularization to explore whether we can maintain high recall while improving overall balance across performance metrics.
To optimize performance, we tune the neural network using 10-fold cross-validation. We vary two key hyperparameters:
size: Number of hidden units
decay: Regularization parameter to prevent overfitting
We aim to preserve high recall while improving precision and overall balance.
set.seed(1010)
ctrl <- trainControl(
method = "cv",
number = 10,
summaryFunction = custom_summary,
classProbs = TRUE,
savePredictions = TRUE
)
grid <- expand.grid(
size = c(3, 5, 7), # try different hidden layer sizes
decay = c(0.001, 0.01, 0.1) # different levels of regularization
)
nn_tuned <- train(
at_risk ~ .,
data = train_data,
method = "nnet",
trControl = ctrl,
metric = "Recall", # focus on recall
tuneGrid = grid,
maxit = 300,
trace = FALSE
)
Evaluate Tuned NN on Test Set:
# Predict on test data
nn_tuned_pred <- predict(nn_tuned, test_data)
nn_tuned_prob <- predict(nn_tuned, test_data, type = "prob")[, "Yes"]
# Ensure correct factor levels
nn_tuned_pred <- factor(nn_tuned_pred, levels = levels(test_data$at_risk))
# Confusion matrix & ROC
conf_nn_tuned <- confusionMatrix(nn_tuned_pred, test_data$at_risk, positive = "Yes")
roc_nn_tuned <- pROC::roc(response = test_data$at_risk, predictor = nn_tuned_prob)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
# Extract metrics
nn_tuned_accuracy <- conf_nn_tuned$overall["Accuracy"]
nn_tuned_precision <- conf_nn_tuned$byClass["Precision"]
nn_tuned_recall <- conf_nn_tuned$byClass["Recall"]
nn_tuned_f1 <- conf_nn_tuned$byClass["F1"]
nn_tuned_auc <- auc(roc_nn_tuned)
# Print results
cat(sprintf("\nNeural Network (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f, AUC: %.4f\n",
nn_tuned_accuracy, nn_tuned_precision, nn_tuned_recall, nn_tuned_f1, nn_tuned_auc))
##
## Neural Network (Tuned) - Accuracy: 0.7442, Precision: 0.3488, Recall: 0.7500, F1: 0.4762, AUC: 0.8289
# Log to results
results_df <- rbind(results_df, data.frame(
Model = "Neural Network (Tuned)",
Accuracy = round(nn_tuned_accuracy, 3),
Precision = round(nn_tuned_precision, 3),
Recall = round(nn_tuned_recall, 3),
F1_Score = round(nn_tuned_f1, 3),
AUC = round(nn_tuned_auc, 3)
))
rownames(results_df) <- NULL
Analysis: The tuned Neural Network achieved 75% recall, successfully identifying most at-risk students. While precision (34.9%) remains modest, the model maintains good overall accuracy (74%) and AUC (0.829). This balance makes it a solid option for early intervention, though false positives may still need to be addressed through future tuning or educator input.
results_df %>%
kable("html", digits = 3, caption = "Model Performance Summary") %>%
kable_styling(full_width = F) %>%
row_spec(which.max(results_df$Recall), bold = TRUE, background = "#DFF0D8") %>%
row_spec(which.max(results_df$F1_Score), bold = TRUE, background = "#D9EDF7")
| Model | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|
| Random Forest (Default) | 0.791 | 0.400 | 0.70 | 0.509 | 0.845 |
| Random Forest (Tuned, Weighted + Threshold) | 0.659 | 0.286 | 0.80 | 0.421 | 0.808 |
| Neural Network (Default) | 0.736 | 0.360 | 0.90 | 0.514 | 0.842 |
| Neural Network (Tuned) | 0.744 | 0.349 | 0.75 | 0.476 | 0.829 |
All models performed reasonably well, but the default Neural Network achieved the highest recall (0.90) and F1-score (0.514), making it the most effective for identifying at-risk students.
To visualize these metrics more clearly across models, we now turn to a grouped bar chart.
results_long <- results_df %>%
pivot_longer(cols = -Model, names_to = "Metric", values_to = "Score")
# Order metrics to keep them grouped
results_long$Metric <- factor(results_long$Metric, levels = c("Accuracy", "AUC", "F1_Score", "Precision", "Recall"))
# Plot
ggplot(results_long, aes(x = Metric, y = Score, fill = Model)) +
geom_col(position = position_dodge()) +
labs(
title = "Comparison of Model Performance Metrics",
x = "Metric", y = "Score"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "right",
legend.title = element_blank()
)
The grouped bar chart highlights that Neural Network (Default) outperforms others in recall and F1-score, whereas Random Forest (Default) leads in accuracy and AUC.
To compare the models holistically across all metrics in a compact form, we now explore a radar chart.
# Prepare data
radar_data <- results_df %>%
column_to_rownames("Model") %>%
select(Accuracy, Precision, Recall, F1_Score, AUC)
# Add max and min rows for scaling
radar_data <- rbind(apply(radar_data, 2, max),
apply(radar_data, 2, min),
radar_data)
# Radar chart
radarchart(radar_data,
axistype = 1,
pcol = c("blue", "green", "red", "purple"),
pfcol = c(rgb(1,0,0,0.3), rgb(0,1,0,0.3), rgb(0,0,1,0.3), rgb(0.5,0,0.5,0.3)),
plty = 1,
plwd = 2,
cglwd = 0.8,
cglcol = "grey",
axislabcol = "gray",
title = "Radar Chart of Model Metrics")
# Legend
legend("topright", legend = rownames(radar_data)[-c(1,2)],
col = c("blue", "green", "red", "purple"),
lwd = 2, lty = 1, cex = 0.57)
The radar chart offers a visual summary, showing how the default Neural Network maintains strong balance across all metrics—especially recall—while the tuned Random Forest sacrifices accuracy and precision for higher recall.
Conclusion:Among all models tested, the Neural Network (Default) emerged as the best performer for this educational risk prediction task. With the highest recall (0.90) and strongest F1-score (0.514), it proved most effective at identifying at-risk students early—aligning directly with the project’s goal of timely intervention. While this came with some trade-offs in precision (36%), the F1-score balances this by capturing both sensitivity and reliability.
Although the Random Forest (Tuned + Weighted) model also achieved high recall (0.80), its lower precision (28.6%) and accuracy (65.9%) increased the risk of overwhelming support systems with false positives. Therefore, the Neural Network (Default) strikes the best balance between recall-focused performance and practical implementation, which makes it a suitable choice for early risk detection in educational settings.