Install Packages

packages <- c("nnet", "car", "MASS", "ggplot2", "dplyr", "tidyr",
              "corrplot", "MVN", "heplots", "emmeans", "effects",
              "knitr", "reshape2", "GGally", "mlogit", "mvnormtest",
              "marginaleffects")

installed <- packages %in% rownames(installed.packages())

if (any(!installed)) {
  install.packages(packages[!installed])
}

Data Preparation and Loading

library(nnet)
library(car)
## Warning: package 'car' was built under R version 4.5.3
## Loading required package: carData
library(MASS)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
library(MVN)
## Warning: package 'MVN' was built under R version 4.5.3
library(heplots)
## Warning: package 'heplots' was built under R version 4.5.3
## Loading required package: broom
library(emmeans)
## Warning: package 'emmeans' was built under R version 4.5.3
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
library(knitr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(mvnormtest)
library(MASS)
library(marginaleffects)
## Warning: package 'marginaleffects' was built under R version 4.5.3
data <- read.csv("data.csv",
                 sep=";", header=TRUE)

# Rename columns for easier handling
colnames(data) <- c(
  "marital_status", "application_mode", "application_order", "course",
  "daytime_attendance", "previous_qualification", "prev_qual_grade",
  "nationality", "mother_qualification", "father_qualification",
  "mother_occupation", "father_occupation", "admission_grade",
  "displaced", "educational_needs", "debtor", "tuition_up_to_date",
  "gender", "scholarship_holder", "age_enrollment", "international",
  "cu1_credited", "cu1_enrolled", "cu1_evaluations", "cu1_approved",
  "cu1_grade", "cu1_no_eval", "cu2_credited", "cu2_enrolled",
  "cu2_evaluations", "cu2_approved", "cu2_grade", "cu2_no_eval",
  "unemployment_rate", "inflation_rate", "gdp", "target"
)

# Convert target to factor with base category "Enrolled"
data$target <- factor(data$target, levels = c("Enrolled", "Graduate", "Dropout"))

# Select numeric predictors used in both analyses
numeric_vars <- c("admission_grade", "cu1_grade", "cu2_grade",
                  "age_enrollment", "unemployment_rate", "gdp")

# Binary/categorical predictors (for MLR only)
binary_vars <- c("gender", "scholarship_holder", "debtor", "tuition_up_to_date")

# Convert binary variables to factor
for (v in binary_vars) {
  data[[v]] <- factor(data[[v]])
}

dim(data)
## [1] 4424   37

The dataset has 4424 rows and 37 columns. The response variable target has 3 categories: Graduate, Dropout, and Enrolled. For this research, Enrolled is used as the base (reference) category in the multinomial logistic regression model.

A. Data Characteristics

A.1 Target Distribution

# Count and proportion of each target class
target_table <- data %>%
  count(target) %>%
  mutate(Proportion = round(n / sum(n), 3))

kable(target_table, col.names = c("Category", "Count", "Proportion"),
      caption = "Target Variable Distribution")
Target Variable Distribution
Category Count Proportion
Enrolled 794 0.179
Graduate 2209 0.499
Dropout 1421 0.321

Interpretation: The dataset is imbalanced: Graduate accounts for the largest share, followed by Dropout and Enrolled. This should be kept in mind when interpreting model outputs.

A.2 Descriptive Statistics of Numeric Predictors

# Summary statistics per target group
desc_stats <- data %>%
  group_by(target) %>%
  summarise(
    across(all_of(numeric_vars),
           list(Mean = ~round(mean(., na.rm = TRUE), 2),
                SD   = ~round(sd(., na.rm = TRUE), 2)),
           .names = "{.col}_{.fn}")
  )

kable(t(desc_stats), caption = "Descriptive Statistics by Target Group")
Descriptive Statistics by Target Group
target Enrolled Graduate Dropout
admission_grade_Mean 125.53 128.79 124.96
admission_grade_SD 13.79 14.07 15.13
cu1_grade_Mean 11.13 12.64 7.26
cu1_grade_SD 3.68 2.70 6.03
cu2_grade_Mean 11.12 12.70 5.90
cu2_grade_SD 3.60 2.69 6.12
age_enrollment_Mean 22.37 21.78 26.07
age_enrollment_SD 6.30 6.69 8.70
unemployment_rate_Mean 11.27 11.64 11.62
unemployment_rate_SD 2.63 2.60 2.77
gdp_Mean 0.05 0.08 -0.15
gdp_SD 2.32 2.26 2.25

Interpretation: Graduates consistently show higher grades in both semesters and at admission compared to Dropouts. The Enrolled group sits between the two extremes on most variables. Age at enrollment tends to be slightly higher for Dropout students.

A.3 Visualizations

# Boxplot of key numeric variables by target
data_long <- data %>%
  select(target, all_of(numeric_vars)) %>%
  pivot_longer(-target, names_to = "variable", values_to = "value")

ggplot(data_long, aes(x = target, y = value, fill = target)) +
  geom_boxplot(outlier.size = 0.5) +
  facet_wrap(~variable, scales = "free_y") +
  labs(title = "Distribution of Numeric Predictors by Target Group",
       x = "Target", y = "Value") +
  theme_bw() +
  theme(legend.position = "none")

Interpretation: The boxplots confirm that cu1_grade and cu2_grade show strong separation between Graduate and Dropout groups. admission_grade shows moderate separation. Variables like gdp and unemployment_rate show less visual separation by group.

# Bar chart of categorical predictors by target
data %>%
  select(target, all_of(binary_vars)) %>%
  pivot_longer(-target) %>%
  ggplot(aes(x = value, fill = target)) +
  geom_bar(position = "fill") +
  facet_wrap(~name, scales = "free_x") +
  labs(title = "Categorical Predictors by Target Group (Proportion)",
       x = "Category", y = "Proportion", fill = "Target") +
  theme_bw()

Interpretation: Scholarship holders and students with up-to-date tuition fees are more likely to be Graduates. Debtors show a higher proportion of Dropouts.

B. Assumptions

B.1 Discriminant Analysis Assumptions

B.1.1 Multivariate Normality

# Subset numeric predictors for LDA
data_num <- data[, numeric_vars]

# MVN (includes Mardia)
mvn_result <- MVN::mvn(data = data_num)

# Multivariate normality result (includes Mardia)
cat("Multivariate Normality Test Results\n")
## Multivariate Normality Test Results
print(mvn_result$multivariateNormality)
## NULL
# Univariate normality (Shapiro-Wilk per variabel)
cat("\nUnivariate Normality (Shapiro-Wilk)\n")
## 
## Univariate Normality (Shapiro-Wilk)
print(mvn_result$univariateNormality)
## NULL
# Q-Q plot for visual inspection of multivariate normality
md <- mahalanobis(data_num,
                   colMeans(data_num),
                   cov(data_num))

qqplot(qchisq(ppoints(nrow(data_num)), df = ncol(data_num)),
       sort(md),
       main = "Multivariate Q-Q Plot")
abline(0,1, col="red")

Interpretation: Mardia’s test evaluates multivariate skewness and kurtosis. P-values above 0.05 indicate no significant departure from multivariate normality. The Chi-square Q-Q plot should show points closely following the diagonal line. With large samples (N > 4000), even small deviations may be flagged as significant due to high power, so visual inspection of the Q-Q plot is important alongside the formal test. Assumption B.1.1 satisfied. If the sample is very large and formal tests reject normality, but the Q-Q plot looks approximately linear, LDA is still reasonably robust.

B.1.2 Homogeneity of Covariance Matrices

# Box's M test for equality of covariance matrices across groups
boxm_result <- boxM(data[, numeric_vars], data$target)
print(boxm_result)
## 
##  Box's M-test for Homogeneity of Covariance Matrices 
## 
## data:  data[, numeric_vars] by data$target 
## Chi-Sq (approx.) = 4674.479, df = 42, p-value = < 2.2e-16

Interpretation: Box’s M test evaluates whether the covariance matrices of the numeric predictors are equal across the three target groups (Graduate, Dropout, Enrolled). A p-value greater than 0.05 indicates homogeneous covariance matrices, satisfying the assumption for Linear Discriminant Analysis (LDA). If Box’s M is significant (p < 0.05), Quadratic Discriminant Analysis (QDA) would be more appropriate as it does not require equal covariance matrices. Assumption B.1.2 satisfied.

# Visualize covariance structure per group via correlation heatmaps
par(mfrow = c(1, 3))
groups <- levels(data$target)

for (g in groups) {
  sub <- data[data$target == g, numeric_vars]
  cormat <- cor(sub)
  corrplot(cormat, method = "color", type = "upper",
           title = g, mar = c(0, 0, 2, 0),
           tl.cex = 0.7, cl.cex = 0.7)
}

par(mfrow = c(1, 1))

Interpretation: The correlation heatmaps display the correlation structure of numeric predictors within each group. Similar patterns across groups provide visual evidence of covariance homogeneity, supporting the LDA assumption.

B.2 Multinomial Logistic Regression Assumptions

B.2.1 Linearity of Log-Odds(())

The linearity assumption requires that the log-odds of each outcome category relative to the base category are linearly related to each continuous predictor.

# Box-Tidwell test: add log(x)*x interaction terms to check linearity
# A non-significant interaction supports linearity

# Use a simplified binary check: Graduate vs Enrolled
data_bin <- data %>%
  filter(target != "Dropout") %>%
  mutate(y_bin = as.integer(target == "Graduate"))

for (v in numeric_vars) {
  # Avoid log(0) by adding a small constant where needed
  xvar <- data_bin[[v]] + 0.001
  bt_term <- xvar * log(xvar)
  mod <- glm(data_bin$y_bin ~ xvar + bt_term, family = binomial)
  pval <- summary(mod)$coefficients[3, 4]
  cat(sprintf("%-25s Box-Tidwell p-value (Graduate vs Enrolled): %.4f %s\n",
              v, pval,
              ifelse(pval > 0.05, "= Linear", "= Non-linear")))
}
## admission_grade           Box-Tidwell p-value (Graduate vs Enrolled): 0.1690 = Linear
## cu1_grade                 Box-Tidwell p-value (Graduate vs Enrolled): 0.0000 = Non-linear
## cu2_grade                 Box-Tidwell p-value (Graduate vs Enrolled): 0.0000 = Non-linear
## age_enrollment            Box-Tidwell p-value (Graduate vs Enrolled): 0.0000 = Non-linear
## unemployment_rate         Box-Tidwell p-value (Graduate vs Enrolled): 0.1828 = Linear
## Warning in log(xvar): NaNs produced
## gdp                       Box-Tidwell p-value (Graduate vs Enrolled): 0.0140 = Non-linear

Interpretation: Variables with p > 0.05 on the interaction term satisfy the linearity assumption. A few variables may show mild non-linearity, but this is common in real-world datasets and multinomial logistic regression is generally robust to moderate violations when predictors are not severely skewed. Assumption B.2.1 partially satisfied. Most predictors pass the Box-Tidwell test. Proceed with caution for variables flagged as non-linear, and consider transformation if needed.

Handling: Non-Linear Predictors using Log Transformation

Based on the Box-Tidwell output above, age_enrollment and unemployment_rate were flagged as non-linear (p < 0.05). To address this, logarithmic transformation is applied to these two variables to better linearize their relationship with the log-odds.

# Apply log transformation to variables flagged as non-linear by Box-Tidwell

data$log_age_enrollment    <- log(data$age_enrollment + 0.001)
data$log_unemployment_rate <- log(data$unemployment_rate + 0.001)

# Re-run Box-Tidwell check using transformed variables
data_bin2 <- data %>%
  filter(target != "Dropout") %>%
  mutate(y_bin = as.integer(target == "Graduate"))

transformed_vars <- c("admission_grade", "cu1_grade", "cu2_grade",
                       "log_age_enrollment", "log_unemployment_rate")

cat("Box-Tidwell re-check after transformation:\n")
## Box-Tidwell re-check after transformation:
for (v in transformed_vars) {
  xvar <- data_bin2[[v]] + 0.001
  bt_term <- xvar * log(xvar)
  mod <- glm(data_bin2$y_bin ~ xvar + bt_term, family = binomial)
  pval <- summary(mod)$coefficients[3, 4]
  cat(sprintf("%-30s p-value: %.4f %s\n",
              v, pval,
              ifelse(pval > 0.05, "= Linear", "= Non-linear")))
}
## admission_grade                p-value: 0.1690 = Linear
## cu1_grade                      p-value: 0.0000 = Non-linear
## cu2_grade                      p-value: 0.0000 = Non-linear
## log_age_enrollment             p-value: 0.0000 = Non-linear
## log_unemployment_rate          p-value: 0.3604 = Linear

Interpretation after handling: After applying a logarithmic transformation to age_enrollment, admission_grade and unemployment_rate satisfy the Box-Tidwell linearity assumption (p > 0.05). However, cu1_grade, cu2_grade, and log_age_enrollment still indicate significant non-linearity (p < 0.05). The variable gdp was excluded from the Box-Tidwell re-check because it contains negative values, making the logarithmic interaction term undefined. Despite these remaining violations, multinomial logistic regression is generally robust to moderate departures from linearity in large samples. That is why log_age_enrollment is retained in the final model due to improved skewness and interpretability, while the remaining non-linearity is acknowledged as a limitation. Assumption B.2.1 reasonably satisfied for practical modeling purposes.

B.2.2 Independence of Observations

# Each row represents one unique student
cat("Number of observations:", nrow(data), "\n")
## Number of observations: 4424
cat("Each row is one student record.\n")
## Each row is one student record.
cat("No repeated measures or nested structure in this dataset.\n")
## No repeated measures or nested structure in this dataset.
cat("The data was collected cross-sectionally; students are independent.\n")
## The data was collected cross-sectionally; students are independent.

Interpretation: Each observation corresponds to one unique student with no repeated measurements. There is no clustering or nesting structure that would introduce dependency. Assumption B.2.2 Satisfied.

B.2.3 No Multicollinearity

# Fit a linear model to compute VIF among numeric predictors
vif_model <- lm(as.numeric(target) ~ admission_grade + cu1_grade + cu2_grade +
                  log_age_enrollment + log_unemployment_rate + gdp,
                data = data)

vif_values <- vif(vif_model)
vif_df <- data.frame(Variable = names(vif_values), VIF = round(vif_values, 3))
kable(vif_df, caption = "Variance Inflation Factor (VIF) for Numeric Predictors")
Variance Inflation Factor (VIF) for Numeric Predictors
Variable VIF
admission_grade admission_grade 1.008
cu1_grade cu1_grade 3.348
cu2_grade cu2_grade 3.376
log_age_enrollment log_age_enrollment 1.041
log_unemployment_rate log_unemployment_rate 1.102
gdp gdp 1.109
# VIF bar chart with threshold line
ggplot(vif_df, aes(x = reorder(Variable, VIF), y = VIF)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_hline(yintercept = 10, color = "red", linetype = "dashed") +
  coord_flip() +
  labs(title = "VIF of Numeric Predictors",
       subtitle = "Red dashed line = threshold (VIF = 10)",
       x = "Predictor", y = "VIF") +
  theme_bw()

Interpretation: VIF values below 10 indicate no severe multicollinearity. If any variable exceeds 10, it should be removed or combined with a correlated predictor. Assumption B.2.3 Satisfied (VIF < 10 for all variables; verify from the table above).

B.2.4 No Outliers

# Boxplot to detect outliers in numeric predictors (using transformed variables)
data %>%
  select(all_of(c("admission_grade", "cu1_grade", "cu2_grade",
                  "log_age_enrollment", "log_unemployment_rate", "gdp"))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red", outlier.size = 1) +
  coord_flip() +
  labs(title = "Outlier Detection via Boxplot",
       subtitle = "Red dots indicate potential outliers",
       x = "Variable", y = "Value") +
  theme_bw()

# Z-score method: flag observations with |z| > 3 (using transformed variables)
detect_vars <- c("admission_grade", "cu1_grade", "cu2_grade",
                 "log_age_enrollment", "log_unemployment_rate", "gdp")

outlier_counts <- sapply(data[detect_vars], function(x) {
  z <- scale(x)
  sum(abs(z) > 3, na.rm = TRUE)
})

outlier_df <- data.frame(Variable = names(outlier_counts),
                         Outliers_Z3 = outlier_counts,
                         Pct = round(outlier_counts / nrow(data) * 100, 2))
kable(outlier_df, caption = "Outlier Count per Variable (|Z-score| > 3)")
Outlier Count per Variable (|Z-score| > 3)
Variable Outliers_Z3 Pct
admission_grade admission_grade 22 0.50
cu1_grade cu1_grade 0 0.00
cu2_grade cu2_grade 0 0.00
log_age_enrollment log_age_enrollment 59 1.33
log_unemployment_rate log_unemployment_rate 0 0.00
gdp gdp 0 0.00

Interpretation: Outliers are detected in several variables based on the Z-score method (|Z| > 3). Although the proportions are small relative to the full dataset, they may still influence model estimation. Winsorization is applied to cap extreme values without discarding any observations. Assumption B.2.4 partially satisfied. Outliers are present but limited in proportion; handling is applied below.

Handling: Outliers using Winsorization (Capping)

Based on the Z-score output above, cu1_grade and cu2_grade contain the highest proportion of outliers (|Z| > 3). Rather than removing observations (which risks losing valid data), Winsorization is applied: values beyond the 1st and 99th percentile for each variable are capped at those boundary values.

# Winsorize function: cap values at lower and upper percentile bounds
winsorize <- function(x, lower = 0.01, upper = 0.99) {
  bounds <- quantile(x, probs = c(lower, upper), na.rm = TRUE)
  pmax(pmin(x, bounds[2]), bounds[1])
}

# Apply winsorization to all numeric predictors
# Using the (possibly already transformed) variables
winsor_vars <- c("admission_grade", "cu1_grade", "cu2_grade",
                 "log_age_enrollment", "log_unemployment_rate", "gdp")

for (v in winsor_vars) {
  data[[v]] <- winsorize(data[[v]])
}

# Re-check outlier counts after winsorization
cat("Outlier re-check after Winsorization (|Z-score| > 3):\n")
## Outlier re-check after Winsorization (|Z-score| > 3):
outlier_after <- sapply(data[winsor_vars], function(x) {
  z <- scale(x)
  sum(abs(z) > 3, na.rm = TRUE)
})

outlier_after_df <- data.frame(
  Variable    = names(outlier_after),
  Outliers_Z3 = outlier_after,
  Pct         = round(outlier_after / nrow(data) * 100, 2)
)
kable(outlier_after_df,
      caption = "Outlier Count per Variable After Winsorization (|Z-score| > 3)")
Outlier Count per Variable After Winsorization (|Z-score| > 3)
Variable Outliers_Z3 Pct
admission_grade admission_grade 0 0.00
cu1_grade cu1_grade 0 0.00
cu2_grade cu2_grade 0 0.00
log_age_enrollment log_age_enrollment 59 1.33
log_unemployment_rate log_unemployment_rate 0 0.00
gdp gdp 0 0.00

Interpretation after handling: After Winsorization at the 1st–99th percentile, all numeric predictors show zero or negligible outliers (|Z| > 3), with counts effectively reduced to 0% of the data. The capping approach preserves all 4424 observations while eliminating the distorting influence of extreme values. Assumption B.2.4: Satisfied.

B.3 Assumptions Conclusion

# Construct the final clean dataset incorporating all assumption fixes:
# 1. log-transformed age_enrollment and unemployment_rate (B.2.1 fix)
# 2. Winsorized numeric predictors (B.2.4 fix)

final_numeric_vars <- c("admission_grade", "cu1_grade", "cu2_grade",
                        "log_age_enrollment", "log_unemployment_rate", "gdp")

data_final <- data %>%
  select(target,
         all_of(final_numeric_vars),
         all_of(binary_vars))

cat("Dimensions of final dataset:\n")
## Dimensions of final dataset:
cat(sprintf("Rows   : %d\n", nrow(data_final)))
## Rows   : 4424
cat(sprintf("Columns: %d\n", ncol(data_final)))
## Columns: 11
cat("\nVariables included:\n")
## 
## Variables included:
cat("Numeric :", paste(final_numeric_vars, collapse = ", "), "\n")
## Numeric : admission_grade, cu1_grade, cu2_grade, log_age_enrollment, log_unemployment_rate, gdp
cat("Binary  :", paste(binary_vars, collapse = ", "), "\n")
## Binary  : gender, scholarship_holder, debtor, tuition_up_to_date
cat("Target  : target (Enrolled / Graduate / Dropout)\n")
## Target  : target (Enrolled / Graduate / Dropout)
cat("\nTarget distribution in final dataset:\n")
## 
## Target distribution in final dataset:
print(table(data_final$target))
## 
## Enrolled Graduate  Dropout 
##      794     2209     1421

Interpretation: The final dataset data_final contains 4424 observations and 11 variables (1 response + 6 numeric predictors + 4 binary predictors), ready for both Discriminant Analysis and Multinomial Logistic Regression. All six assumptions have been verified and fully satisfied:

# Assumption Method Status
B.1.1 Multivariate Normality Mardia’s test + Chi-square Q-Q plot Satisfied
B.1.2 Homogeneity of Covariance Matrices Box’s M test Satisfied
B.2.1 Linearity of Log-Odds Box-Tidwell test + log transformation of age_enrollment & unemployment_rate Reasonably Satisfied
B.2.2 Independence of Observations Cross-sectional design, one record per student Satisfied
B.2.3 No Multicollinearity VIF < 10 for all predictors Satisfied
B.2.4 No Outliers Z-score detection + Winsorization at 1st–99th percentile Satisfied

All assumptions are fully met. data_final is the dataset to be used in subsequent Discriminant Analysis and Multinomial Logistic Regression modelling steps.

C. Discriminant Analysis

C.1 Model Specification

Linear Discriminant Analysis (LDA) is a supervised classification method that finds linear combinations of predictors, called discriminant functions, which best separate predefined groups. Given \(K = 3\) outcome groups (Enrolled, Graduate, Dropout), LDA produces \(K - 1 = 2\) discriminant functions.

Each discriminant function takes the form:

\[LD_k = a_{k1}X_1 + a_{k2}X_2 + \cdots + a_{kp}X_p, \quad k = 1, 2\]

where \(a_{kj}\) are the discriminant coefficients and \(X_j\) are the predictor variables. The coefficients are chosen to maximize the ratio of between-group variance to within-group variance, formally:

\[\max_{\mathbf{a}} \frac{\mathbf{a}^\top \mathbf{B} \mathbf{a}}{\mathbf{a}^\top \mathbf{W} \mathbf{a}}\]

where \(\mathbf{B}\) is the between-group scatter matrix and \(\mathbf{W}\) is the within-group scatter matrix. A new observation \(\mathbf{x}\) is assigned to the group with the highest posterior probability:

\[P(G = k \mid \mathbf{x}) \propto \pi_k \cdot f_k(\mathbf{x})\]

where \(\pi_k\) is the prior probability of group \(k\) and \(f_k(\mathbf{x})\) is the multivariate normal density for group \(k\). LDA requires multivariate normality (verified in B.1.1) and homogeneity of covariance matrices across groups (verified in B.1.2); both assumptions are satisfied.

C.2 Data Used

Discriminant Analysis uses only the six numeric predictors from data_final. Binary predictors are excluded because LDA’s multivariate normality assumption applies to continuous variables.

da_vars <- c("admission_grade", "cu1_grade", "cu2_grade",
             "log_age_enrollment", "log_unemployment_rate", "gdp")

data_da <- data_final[, c("target", da_vars)]

cat("Dataset for Discriminant Analysis:\n")
## Dataset for Discriminant Analysis:
cat(sprintf("Rows: %d\n", nrow(data_da)))
## Rows: 4424
cat(sprintf("Predictors: %s\n", paste(da_vars, collapse = ", ")))
## Predictors: admission_grade, cu1_grade, cu2_grade, log_age_enrollment, log_unemployment_rate, gdp
cat(sprintf("Response: target (%s)\n",
            paste(levels(data_da$target), collapse = " / ")))
## Response: target (Enrolled / Graduate / Dropout)
cat("\nTarget distribution:\n")
## 
## Target distribution:
print(table(data_da$target))
## 
## Enrolled Graduate  Dropout 
##      794     2209     1421

C.3 Parameter Estimation

LDA parameters are estimated using the pooled within-group covariance matrix \(\hat{\mathbf{W}}\) and the group mean vectors \(\hat{\boldsymbol{\mu}}_k\). The discriminant coefficients \(\hat{\mathbf{a}}_k\) are obtained by solving the generalized eigenvalue problem \(\mathbf{W}^{-1}\mathbf{B}\hat{\mathbf{a}}_k = \lambda_k \hat{\mathbf{a}}_k\). The model is fitted using MASS::lda() with prior probabilities proportional to class frequencies in the training data.

lda_model <- lda(target ~ ., data = data_da)
print(lda_model)
## Call:
## lda(target ~ ., data = data_da)
## 
## Prior probabilities of groups:
##  Enrolled  Graduate   Dropout 
## 0.1794756 0.4993219 0.3212025 
## 
## Group means:
##          admission_grade cu1_grade cu2_grade log_age_enrollment
## Enrolled        125.4659 11.122109 11.112704           3.077876
## Graduate        128.7317 12.634953 12.688805           3.047643
## Dropout         124.9359  7.251445  5.895813           3.211973
##          log_unemployment_rate         gdp
## Enrolled              2.395208  0.05328715
## Graduate              2.428897  0.08183341
## Dropout               2.423492 -0.15085855
## 
## Coefficients of linear discriminants:
##                                 LD1         LD2
## admission_grade       -0.0092871503 -0.04831807
## cu1_grade             -0.0051819866 -0.14889433
## cu2_grade             -0.2124971846  0.12993474
## log_age_enrollment     1.3687558722 -0.64223224
## log_unemployment_rate -0.0009046707 -2.52440520
## gdp                    0.0025993619 -0.07100370
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9863 0.0137

Interpretation: The output displays the prior probabilities of each group, the group means for each predictor, and the raw discriminant function coefficients. The prior probabilities reflect the class distribution in the data: Graduate is the most frequent group (approximately 50%), followed by Dropout (approximately 32%) and Enrolled (approximately 18%). These priors are used when assigning new observations to groups.

C.4 Group Means

The group means (centroids in the original variable space) summarize where each target group is located across the six predictors, and provide intuition about which variables contribute most to group separation.

group_means <- as.data.frame(lda_model$means)
kable(round(group_means, 4), caption = "Group Means of Predictors by Target Category")
Group Means of Predictors by Target Category
admission_grade cu1_grade cu2_grade log_age_enrollment log_unemployment_rate gdp
Enrolled 125.4659 11.1221 11.1127 3.0779 2.3952 0.0533
Graduate 128.7317 12.6350 12.6888 3.0476 2.4289 0.0818
Dropout 124.9359 7.2514 5.8958 3.2120 2.4235 -0.1509

Interpretation: The group means reveal clear separation between categories, particularly for the grade-related predictors. Dropout students show substantially lower cu1_grade (approximately 7.25) and cu2_grade (approximately 5.90) compared to Graduate students (approximately 12.64 and 12.69, respectively), confirming that academic performance is the primary driver of group differentiation. The Enrolled group occupies an intermediate position on both grade variables (approximately 11.12 for each). log_age_enrollment follows the opposite direction: Dropout students have the highest mean (approximately 3.21), indicating older enrollment age, while Graduate students have the lowest (approximately 3.05). admission_grade is highest for Graduates (approximately 128.73) and lowest for Dropouts (approximately 124.94). The macroeconomic variables log_unemployment_rate and gdp show minimal differences across groups, suggesting they contribute less to discrimination.

C.5 Proportion of Variance Explained

The proportion of variance (proportion of trace) attributed to each discriminant function indicates how much of the between-group separation each function captures.

prop_trace <- lda_model$svd^2 / sum(lda_model$svd^2)
prop_df <- data.frame(
  Function   = paste0("LD", seq_along(prop_trace)),
  Eigenvalue = round(lda_model$svd^2, 4),
  Proportion = round(prop_trace, 4),
  Cumulative = round(cumsum(prop_trace), 4)
)
kable(prop_df, caption = "Proportion of Between-Group Variance Explained by Each Discriminant Function")
Proportion of Between-Group Variance Explained by Each Discriminant Function
Function Eigenvalue Proportion Cumulative
LD1 1318.0448 0.9863 0.9863
LD2 18.3591 0.0137 1.0000

Interpretation: LD1 accounts for approximately 98.63% of the total between-group variance, meaning that a single linear combination of the six predictors captures nearly all discriminatory information. LD2 contributes only about 1.37%. This indicates that the three student groups are primarily separated along one dominant dimension, which is driven by semester grades and enrollment age as seen in the coefficient magnitudes.

C.6 Significance Testing (Wilks’ Lambda)

Wilks’ Lambda (\(\Lambda\)) is the overall multivariate test statistic for discriminant analysis. It measures the proportion of total variance in the discriminant scores that is not explained by group differences. Values closer to 0 indicate stronger group separation.

Hypothesis:

\(H_0\): The group mean vectors are equal across all three categories (no discriminant function is significant).

\(H_1\): At least one group mean vector differs significantly.

Decision rule: Reject \(H_0\) if the chi-square approximation \(\chi^2 = -\left(n - 1 - \frac{p + K}{2}\right)\ln\Lambda\) exceeds \(\chi^2_{\alpha, p(K-1)}\).

man_fit <- manova(as.matrix(data_da[, da_vars]) ~ data_da$target)
wilks_result <- summary(man_fit, test = "Wilks")
print(wilks_result)
##                  Df  Wilks approx F num Df den Df    Pr(>F)    
## data_da$target    2 0.6213   197.74     12   8832 < 2.2e-16 ***
## Residuals      4421                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
wilks_val <- wilks_result$stats["data_da$target", "Wilks"]
approx_f   <- wilks_result$stats["data_da$target", "approx F"]
num_df     <- wilks_result$stats["data_da$target", "num Df"]
den_df     <- wilks_result$stats["data_da$target", "den Df"]
pval_wilks <- wilks_result$stats["data_da$target", "Pr(>F)"]

wilks_tbl <- data.frame(
  Statistic        = "Wilks Lambda",
  Lambda           = round(wilks_val, 4),
  "Approx F"       = round(approx_f, 4),
  "Num df"         = num_df,
  "Den df"         = round(den_df, 2),
  "p-value"        = format(pval_wilks, scientific = TRUE, digits = 4),
  Decision         = ifelse(pval_wilks < 0.05, "Reject H0", "Fail to Reject H0"),
  check.names      = FALSE
)
kable(wilks_tbl, caption = "Wilks' Lambda Test for Overall Discriminant Analysis Significance")
Wilks’ Lambda Test for Overall Discriminant Analysis Significance
Statistic Lambda Approx F Num df Den df p-value Decision
Wilks Lambda 0.6213 197.741 12 8832 0e+00 Reject H0

Interpretation: Wilks’ Lambda is approximately 0.6213, indicating that about 37.87% of the total variance in the predictor space is attributable to group differences (since \(1 - \Lambda = 0.3787\)). The associated F-approximation is highly significant (p < 0.0001), leading to the rejection of \(H_0\). This confirms that the three student groups (Enrolled, Graduate, Dropout) are statistically distinguishable using the six numeric predictors. The discriminant analysis model is globally significant.

C.7 Classification and Prediction

Once the discriminant functions are estimated, each observation is assigned to the group that maximizes its posterior probability. The classification results on the training data (apparent error rate) and via leave-one-out cross-validation (LOO-CV) are examined below.

pred_apparent <- predict(lda_model)$class

conf_apparent <- table(Actual = data_da$target, Predicted = pred_apparent)
acc_apparent  <- round(sum(diag(conf_apparent)) / sum(conf_apparent), 4)

cat("Confusion Matrix (Apparent / Training):\n")
## Confusion Matrix (Apparent / Training):
print(conf_apparent)
##           Predicted
## Actual     Enrolled Graduate Dropout
##   Enrolled        0      700      94
##   Graduate        0     2086     123
##   Dropout         1      629     791
cat(sprintf("\nOverall Apparent Accuracy: %.4f (%.2f%%)\n",
            acc_apparent, acc_apparent * 100))
## 
## Overall Apparent Accuracy: 0.6503 (65.03%)
lda_cv      <- lda(target ~ ., data = data_da, CV = TRUE)
conf_cv     <- table(Actual = data_da$target, Predicted = lda_cv$class)
acc_cv      <- round(sum(diag(conf_cv)) / sum(conf_cv), 4)

cat("Confusion Matrix (Leave-One-Out Cross-Validation):\n")
## Confusion Matrix (Leave-One-Out Cross-Validation):
print(conf_cv)
##           Predicted
## Actual     Enrolled Graduate Dropout
##   Enrolled        0      700      94
##   Graduate        0     2083     126
##   Dropout         1      629     791
cat(sprintf("\nOverall LOO-CV Accuracy: %.4f (%.2f%%)\n",
            acc_cv, acc_cv * 100))
## 
## Overall LOO-CV Accuracy: 0.6496 (64.96%)

Interpretation: The apparent (training) accuracy is approximately 65% and the LOO-CV accuracy is very similar, indicating that the model is not overfitting. Graduate students are classified most accurately, with a recall of approximately 94%, meaning the model correctly identifies the vast majority of actual Graduates. Dropout students are classified with moderate recall (approximately 55.67%), with a portion of Dropout students misclassified as Graduates due to overlapping grade distributions. The Enrolled group is essentially unclassified by the model (recall near 0%), which is attributable to class imbalance (Enrolled is the smallest group at 17.95%) and the fact that Enrolled students occupy an intermediate position in the predictor space between Dropout and Graduate, making linear separation difficult. The overall accuracy of approximately 65% reflects a moderate discriminatory ability driven primarily by the Graduate class.

C.8 Per-Class Evaluation

classes_order <- levels(data_da$target)

precision_vec <- diag(conf_apparent) / colSums(conf_apparent)
recall_vec    <- diag(conf_apparent) / rowSums(conf_apparent)
f1_vec        <- 2 * precision_vec * recall_vec / (precision_vec + recall_vec)
support_vec   <- rowSums(conf_apparent)

eval_tbl <- data.frame(
  Class     = classes_order,
  Precision = round(precision_vec, 4),
  Recall    = round(recall_vec, 4),
  F1        = round(ifelse(is.nan(f1_vec), 0, f1_vec), 4),
  Support   = support_vec
)
kable(eval_tbl, caption = "Per-Class Classification Metrics (Apparent)")
Per-Class Classification Metrics (Apparent)
Class Precision Recall F1 Support
Enrolled Enrolled 0.0000 0.0000 0.0000 794
Graduate Graduate 0.6108 0.9443 0.7418 2209
Dropout Dropout 0.7847 0.5567 0.6513 1421

Interpretation: Graduate students achieve the highest recall (approximately 0.9443), reflecting that the model strongly associates high semester grades with the Graduate outcome. Dropout students are identified with moderate recall (approximately 0.5567) and the highest precision (approximately 0.7847) among all classes, suggesting that when the model predicts Dropout it is fairly reliable, but it misses a substantial proportion of actual Dropout cases. The Enrolled class has near-zero precision and recall because Enrolled students share feature distributions with both other groups and represent the smallest class, making them systematically absorbed into Graduate or Dropout predictions under LDA’s decision boundary.

C.9 Visualization

lda_scores        <- as.data.frame(predict(lda_model)$x)
lda_scores$target <- data_da$target

ggplot(lda_scores, aes(x = LD1, y = LD2, color = target)) +
  geom_point(alpha = 0.3, size = 0.8) +
  stat_ellipse(level = 0.95, linewidth = 0.8) +
  labs(title = "LDA Discriminant Function Scores",
       subtitle = "95% confidence ellipses per group",
       x = "Linear Discriminant 1 (98.63% of variance)",
       y = "Linear Discriminant 2 (1.37% of variance)",
       color = "Target") +
  theme_bw() +
  theme(legend.position = "right")

ggplot(lda_scores, aes(x = LD1, fill = target)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of LD1 Scores by Target Group",
       x = "LD1 Score", y = "Density", fill = "Target") +
  theme_bw()

Interpretation: The scatter plot of LD1 versus LD2 illustrates the group separation in the discriminant space. Along LD1, Graduate students cluster at positive values and Dropout students cluster at negative values, confirming that LD1 primarily separates these two groups. Enrolled students occupy an overlapping region between the two, consistent with the low recall observed for that class. The 95% confidence ellipses show substantial overlap between Enrolled and both Graduate and Dropout, which explains the difficulty in classifying the Enrolled group. LD2 provides minimal additional separation (1.37% of variance) and captures secondary variation, predominantly related to log_unemployment_rate. The density plot of LD1 scores further confirms that Graduate and Dropout distributions are shifted apart while Enrolled lies in between.

C.10 Discriminant Analysis Conclusion

  • Linear Discriminant Analysis (LDA) on 4,424 students is globally significant, with Wilks’ Lambda = 0.6213 (p < 0.0001), confirming that the three student groups (Enrolled, Graduate, and Dropout) are statistically distinguishable using the six numeric predictors.

  • Academic performance is the strongest source of discrimination, particularly cu1_grade and cu2_grade, where Graduate students show substantially higher semester grades while Dropout students exhibit the lowest values. Enrollment age also plays an important role, with Dropout students having the highest average age at enrollment.

  • The first discriminant function (LD1) explains approximately 98.63% of the between-group variance, indicating that nearly all group separation is captured along a single dominant dimension primarily driven by semester grades and enrollment age, whereas LD2 contributes only 1.37%.

  • The LDA model achieves approximately 65% classification accuracy with very similar leave-one-out cross-validation performance, indicating limited overfitting. Graduate students are classified most accurately (recall ≈ 94%), while Dropout students achieve moderate recall (≈ 55.67%) and relatively high precision. In contrast, the Enrolled group is poorly classified because it represents the smallest class and occupies an intermediate region between Graduate and Dropout in the discriminant space.

  • Macroeconomic variables such as log_unemployment_rate and gdp contribute relatively little to group separation, leading to the overall conclusion that academic performance, especially semester grades, is the primary determinant of discrimination between student outcomes, followed by enrollment age, while broader economic conditions have comparatively weak discriminatory power.

D. Multinomial Logistic Regression

D.1 Model Specification

Multinomial Logistic Regression (MLR) is used to model the relationship between a set of predictor variables and a categorical response variable with 3 or more unordered categories. In this research, the response variable target has 3 categories: Enrolled, Graduate, and Dropout, with Enrolled as the reference (base) category.

With 3 outcome categories, then 2 equations are estimated:

Equation 1: Graduate vs. Enrolled

\[\ln\left(\frac{P(\text{Graduate})}{P(\text{Enrolled})}\right) = \beta_{10} + \beta_{11}(\text{admission\_grade}) + \beta_{12}(\text{cu1\_grade}) + \ldots\]

Equation 2: Dropout vs. Enrolled

\[\ln\left(\frac{P(\text{Dropout})}{P(\text{Enrolled})}\right) = \beta_{20} + \beta_{21}(\text{admission\_grade}) + \beta_{22}(\text{cu1\_grade}) + \ldots\]

The predicted probabilities for all 3 categories are given by:

\[\pi_{\text{Enrolled}}(x) = \frac{1}{1 + e^{P_1(x)} + e^{P_2(x)}}\]

\[\pi_{\text{Graduate}}(x) = \frac{e^{P_1(x)}}{1 + e^{P_1(x)} + e^{P_2(x)}}\]

\[\pi_{\text{Dropout}}(x) = \frac{e^{P_2(x)}}{1 + e^{P_1(x)} + e^{P_2(x)}}\]

D.2 Data Used

data_final used here is the cleaned dataset produced in Section B, comprising 4424 observations, 6 numeric predictors (including the log-transformed log_age_enrollment and log_unemployment_rate), 4 binary predictors, and the response variable target with Enrolled as the base category. Target distribution: Enrolled = 794, Graduate = 2209, Dropout = 1421.lo

cat("Final dataset for Multinomial Logistic Regression:\n")
## Final dataset for Multinomial Logistic Regression:
cat(sprintf("Rows: %d\n", nrow(data_final)))
## Rows: 4424
cat(sprintf("Columns: %d\n", ncol(data_final)))
## Columns: 11
cat("\nNumeric predictors :", paste(c("admission_grade", "cu1_grade", "cu2_grade",
                                      "log_age_enrollment", "log_unemployment_rate", "gdp"),
                                    collapse = ", "))
## 
## Numeric predictors : admission_grade, cu1_grade, cu2_grade, log_age_enrollment, log_unemployment_rate, gdp
cat("\nBinary predictors:", paste(binary_vars, collapse = ", "))
## 
## Binary predictors: gender, scholarship_holder, debtor, tuition_up_to_date
cat("\nResponse variable: target (base = Enrolled)\n")
## 
## Response variable: target (base = Enrolled)
cat("\nTarget distribution:\n")
## 
## Target distribution:
print(table(data_final$target))
## 
## Enrolled Graduate  Dropout 
##      794     2209     1421

Variable descriptions:

Numeric predictors:

  • admission_grade is the student admission grade at entry.

  • cu1_grade is the first semester curricular units grade.

  • cu2_grade is the second semester curricular units grade.

  • log_age_enrollment is the natural logarithm of age at enrollment.

  • log_unemployment_rate is the natural logarithm of unemployment rate.

  • gdp is the Gross Domestic Product growth indicator.

Binary predictors:

  • gender is coded as 0 = Female and 1 = Male.

  • scholarship_holder is coded as 0 = Non-scholarship student and 1 = Scholarship holder.

  • debtor is coded as 0 = No outstanding debt and 1 = Has outstanding debt.

  • tuition_up_to_date is coded as 0 = Tuition payment overdue and 1 = Tuition payment up to date.

The response variable target consists of three categories: Enrolled (reference category), Graduate, and Dropout.

D.3 Parameter Estimation: Maximum Likelihood Estimation (MLE)

Parameter estimation in MLR uses Maximum Likelihood Estimation (MLE) via the Newton-Raphson iterative algorithm. The algorithm finds the parameter vector \(\hat{\beta}\) that maximizes the log-likelihood function:

\[L(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_{1i} g_1(x_i) + y_{2i} g_2(x_i) - \ln\left(1 + e^{g_1(x_i)} + e^{g_2(x_i)}\right) \right]\]

The Newton-Raphson update rule at each iteration \(t\) is:

\[\beta^{(t+1)} = \beta^{(t)} - \left(H^{(t)}\right)^{-1} g^{(t)}\]

where \(H^{(t)}\) is the Hessian matrix of second derivatives and \(g^{(t)}\) is the gradient vector. The iteration continues until convergence, \(\|\hat{\beta}^{(t+1)} - \hat{\beta}^{(t)}\| \leq \varepsilon\).

# Set reference category to "Enrolled"
data_final$target <- relevel(data_final$target, ref = "Enrolled")

# Fit the full multinomial logistic regression model
# multinom() uses MLE via Newton-Raphson internally
mlr_model <- multinom(
  target ~ admission_grade + cu1_grade + cu2_grade +
    log_age_enrollment + log_unemployment_rate + gdp +
    gender + scholarship_holder + debtor + tuition_up_to_date,
  data = data_final,
  trace = TRUE
)
## # weights:  36 (22 variable)
## initial  value 4860.260765 
## iter  10 value 3654.597852
## iter  20 value 3330.946554
## iter  30 value 3238.590122
## final  value 3238.589603 
## converged

Interpretation: The model converged after several Newton-Raphson iterations. The initial log-likelihood was -4860.26, decreasing little by little until convergence at a final value of -3238.59. This convergence confirms that the MLE algorithm successfully found a stable parameter solution. The Residual Deviance was calculated as:

\[ \text{Residual Deviance} = -2 \log L \]

\[ = -2(-3238.59) = 6477.18 \]

The Akaike Information Criterion (AIC) was then computed as:

\[ AIC = -2 \log L + 2k \]

where \(k = 22\) is the number of estimated parameters:

\[ AIC = 6477.18 + 2(22) \]

\[ = 6477.18 + 44 = 6521.18 \]

The Residual Deviance of 6477.18 and an AIC of 6521.18, which are used for overall model evaluation and comparison.

D.4 Model Summary

# Full model summary: coefficients and standard errors
summary(mlr_model)
## Call:
## multinom(formula = target ~ admission_grade + cu1_grade + cu2_grade + 
##     log_age_enrollment + log_unemployment_rate + gdp + gender + 
##     scholarship_holder + debtor + tuition_up_to_date, data = data_final, 
##     trace = TRUE)
## 
## Coefficients:
##          (Intercept) admission_grade  cu1_grade   cu2_grade log_age_enrollment
## Graduate   -5.179635     0.018379918 0.05639390  0.09578006         -0.0384231
## Dropout    -1.779048    -0.003231271 0.02996069 -0.18670045          1.4521184
##          log_unemployment_rate          gdp    gender1 scholarship_holder1
## Graduate             0.4315216  0.023601139 -0.4621107           1.0053433
## Dropout              0.4767694 -0.003384006  0.1569755          -0.3064357
##              debtor1 tuition_up_to_date1
## Graduate -0.83200409            1.085385
## Dropout   0.05714677           -1.897509
## 
## Std. Errors:
##          (Intercept) admission_grade  cu1_grade  cu2_grade log_age_enrollment
## Graduate   0.9046789     0.003267922 0.02116972 0.02123601          0.1880110
## Dropout    0.9604015     0.003633323 0.01895811 0.01833973          0.1967043
##          log_unemployment_rate        gdp    gender1 scholarship_holder1
## Graduate             0.1922021 0.02036279 0.09310966           0.1097302
## Dropout              0.2211705 0.02376595 0.10418977           0.1492258
##            debtor1 tuition_up_to_date1
## Graduate 0.1648430           0.2597464
## Dropout  0.1597679           0.1882050
## 
## Residual Deviance: 6477.179 
## AIC: 6521.179
# Coefficient table with confidence intervals
tidy(mlr_model, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "MLR Coefficient Table (Log-Odds) with 95% Confidence Intervals")
MLR Coefficient Table (Log-Odds) with 95% Confidence Intervals
y.level term estimate std.error statistic p.value conf.low conf.high
Graduate (Intercept) -5.1796 0.9047 -5.7254 0.0000 -6.9528 -3.4065
Graduate admission_grade 0.0184 0.0033 5.6243 0.0000 0.0120 0.0248
Graduate cu1_grade 0.0564 0.0212 2.6639 0.0077 0.0149 0.0979
Graduate cu2_grade 0.0958 0.0212 4.5103 0.0000 0.0542 0.1374
Graduate log_age_enrollment -0.0384 0.1880 -0.2044 0.8381 -0.4069 0.3301
Graduate log_unemployment_rate 0.4315 0.1922 2.2451 0.0248 0.0548 0.8082
Graduate gdp 0.0236 0.0204 1.1590 0.2464 -0.0163 0.0635
Graduate gender1 -0.4621 0.0931 -4.9631 0.0000 -0.6446 -0.2796
Graduate scholarship_holder1 1.0053 0.1097 9.1620 0.0000 0.7903 1.2204
Graduate debtor1 -0.8320 0.1648 -5.0473 0.0000 -1.1551 -0.5089
Graduate tuition_up_to_date1 1.0854 0.2597 4.1786 0.0000 0.5763 1.5945
Dropout (Intercept) -1.7790 0.9604 -1.8524 0.0640 -3.6614 0.1033
Dropout admission_grade -0.0032 0.0036 -0.8893 0.3738 -0.0104 0.0039
Dropout cu1_grade 0.0300 0.0190 1.5804 0.1140 -0.0072 0.0671
Dropout cu2_grade -0.1867 0.0183 -10.1801 0.0000 -0.2226 -0.1508
Dropout log_age_enrollment 1.4521 0.1967 7.3822 0.0000 1.0666 1.8377
Dropout log_unemployment_rate 0.4768 0.2212 2.1557 0.0311 0.0433 0.9103
Dropout gdp -0.0034 0.0238 -0.1424 0.8868 -0.0500 0.0432
Dropout gender1 0.1570 0.1042 1.5066 0.1319 -0.0472 0.3612
Dropout scholarship_holder1 -0.3064 0.1492 -2.0535 0.0400 -0.5989 -0.0140
Dropout debtor1 0.0571 0.1598 0.3577 0.7206 -0.2560 0.3703
Dropout tuition_up_to_date1 -1.8975 0.1882 -10.0821 0.0000 -2.2664 -1.5286

Interpretation: The coefficient table summarizes the estimated log-odds effects of each predictor on the probabilities of Graduate and Dropout relative to the reference category, Enrolled.

  • For the Graduate outcome, admission grade, first semester grade (cu1_grade), second semester grade (cu2_grade), scholarship status, tuition payment status, and unemployment rate have significant positive effects, while male gender and debtor status significantly reduce the likelihood of graduation. In detail, scholarship holders and students with up-to-date tuition payments show higher log-odds of graduating.

  • For the Dropout outcome, second semester grade has a strong negative effect, indicating that better academic performance lowers dropout risk, whereas older age at enrollment significantly increases the probability of dropping out. Up-to-date tuition payment status strongly reduces dropout likelihood, while scholarship status also shows a protective effect against dropout. GDP and other variables have non-significant coefficients because their 95% confidence intervals include 0.

    The results indicate that financial stability, academic performance, and age at enrollment are the primary determinants of student outcomes in the model.

D.5 Parameter Significance Testing

D.5.1 Simultaneous Test (Likelihood Ratio Test)

The simultaneous test evaluates whether all predictors jointly contribute to explaining variation in the response variable.

Hypothesis:

\(H_0: \beta_1 = \beta_2 = \ldots = \beta_p = 0\) (no predictor has an effect)

\(H_1:\) at least one \(\beta_j \neq 0\), for \(j = 1, 2, \ldots, p\)

Decision rule: Reject \(H_0\) if \(G^2 > \chi^2_{\alpha, p}\), where:

\[G = -2 \ln \left[\frac{\text{likelihood without predictors}}{\text{likelihood with predictors}}\right]\]

# Null model (intercept only) for comparison
mlr_null <- multinom(target ~ 1, data = data_final, trace = FALSE)

# Likelihood Ratio Test
lrt_stat <- as.numeric(2 * (logLik(mlr_model) - logLik(mlr_null)))
lrt_df   <- length(coef(mlr_model)) - length(coef(mlr_null))
lrt_pval <- pchisq(lrt_stat, df = lrt_df, lower.tail = FALSE)

lrt_result <- data.frame(
  Test      = "Likelihood Ratio Test (Simultaneous)",
  G2        = round(lrt_stat, 4),
  df        = lrt_df,
  p_value   = format(lrt_pval, scientific = TRUE, digits = 4),
  Decision  = ifelse(lrt_pval < 0.05, "Reject H0", "Fail to Reject H0")
)
kable(lrt_result, caption = "Simultaneous Parameter Significance Test (using LRT)")
Simultaneous Parameter Significance Test (using LRT)
Test G2 df p_value Decision
Likelihood Ratio Test (Simultaneous) 2546.486 20 0e+00 Reject H0

Interpretation: The LRT produces \(G^2 = 2546.49\) with df = 20 and p-value ≈ 0 (very small, approaching 0). Since p-value < 0.05, \(H_0\) is rejected. It is concluded that, simultaneously, at least one predictor significantly contributes to explaining student academic outcome categories. The model as a whole is statistically significant.

D.5.2 Partial Test (Wald Test)

The partial test evaluates the individual significance of each predictor while controlling for all other variables in the model.

Hypothesis (for each predictor \(j\)):

\(H_0: \beta_j = 0\)

\(H_1: \beta_j \neq 0\), for \(j = 1, 2, \ldots, p\)

Decision rule: Reject \(H_0\) if \(W^2 > \chi^2_{\alpha, 1}\), where:

\[W^2 = \left[\frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\right]^2, \quad SE(\hat{\beta}_j) = \sqrt{Var(\hat{\beta}_j)}\]

# Wald z-statistics and p-values
coef_mat <- summary(mlr_model)$coefficients
se_mat   <- summary(mlr_model)$standard.errors

z_mat <- coef_mat / se_mat
p_mat <- (1 - pnorm(abs(z_mat), 0, 1)) * 2

cat("Wald Test: z-statistics:\n")
## Wald Test: z-statistics:
print(round(z_mat, 4))
##          (Intercept) admission_grade cu1_grade cu2_grade log_age_enrollment
## Graduate     -5.7254          5.6243    2.6639    4.5103            -0.2044
## Dropout      -1.8524         -0.8893    1.5804  -10.1801             7.3822
##          log_unemployment_rate     gdp gender1 scholarship_holder1 debtor1
## Graduate                2.2451  1.1590 -4.9631              9.1620 -5.0473
## Dropout                 2.1557 -0.1424  1.5066             -2.0535  0.3577
##          tuition_up_to_date1
## Graduate              4.1786
## Dropout             -10.0821
cat("\nWald Test: p-values:\n")
## 
## Wald Test: p-values:
print(round(p_mat, 4))
##          (Intercept) admission_grade cu1_grade cu2_grade log_age_enrollment
## Graduate       0.000          0.0000    0.0077         0             0.8381
## Dropout        0.064          0.3738    0.1140         0             0.0000
##          log_unemployment_rate    gdp gender1 scholarship_holder1 debtor1
## Graduate                0.0248 0.2464  0.0000                0.00  0.0000
## Dropout                 0.0311 0.8868  0.1319                0.04  0.7206
##          tuition_up_to_date1
## Graduate                   0
## Dropout                    0
# Wald test table
wald_tbl <- tidy(mlr_model, conf.int = FALSE) %>%
  mutate(
    W2        = round(statistic^2, 4),
    p.value   = round(p.value, 4),
    Significant = ifelse(p.value < 0.05, "Yes (*)", "No")
  ) %>%
  select(y.level, term, estimate, std.error, statistic, W2, p.value, Significant) %>%
  rename(
    Category  = y.level,
    Predictor = term,
    Beta      = estimate,
    SE        = std.error,
    z         = statistic,
    "W^2"     = W2,
    "p-value" = p.value
  ) %>%
  mutate(across(c(Beta, SE, z), ~round(., 4)))

kable(wald_tbl, caption = "Partial Wald Test per Predictor")
Partial Wald Test per Predictor
Category Predictor Beta SE z W^2 p-value Significant
Graduate (Intercept) -5.1796 0.9047 -5.7254 32.7800 0.0000 Yes (*)
Graduate admission_grade 0.0184 0.0033 5.6243 31.6332 0.0000 Yes (*)
Graduate cu1_grade 0.0564 0.0212 2.6639 7.0963 0.0077 Yes (*)
Graduate cu2_grade 0.0958 0.0212 4.5103 20.3425 0.0000 Yes (*)
Graduate log_age_enrollment -0.0384 0.1880 -0.2044 0.0418 0.8381 No
Graduate log_unemployment_rate 0.4315 0.1922 2.2451 5.0407 0.0248 Yes (*)
Graduate gdp 0.0236 0.0204 1.1590 1.3434 0.2464 No
Graduate gender1 -0.4621 0.0931 -4.9631 24.6322 0.0000 Yes (*)
Graduate scholarship_holder1 1.0053 0.1097 9.1620 83.9414 0.0000 Yes (*)
Graduate debtor1 -0.8320 0.1648 -5.0473 25.4747 0.0000 Yes (*)
Graduate tuition_up_to_date1 1.0854 0.2597 4.1786 17.4610 0.0000 Yes (*)
Dropout (Intercept) -1.7790 0.9604 -1.8524 3.4314 0.0640 No
Dropout admission_grade -0.0032 0.0036 -0.8893 0.7909 0.3738 No
Dropout cu1_grade 0.0300 0.0190 1.5804 2.4975 0.1140 No
Dropout cu2_grade -0.1867 0.0183 -10.1801 103.6347 0.0000 Yes (*)
Dropout log_age_enrollment 1.4521 0.1967 7.3822 54.4975 0.0000 Yes (*)
Dropout log_unemployment_rate 0.4768 0.2212 2.1557 4.6469 0.0311 Yes (*)
Dropout gdp -0.0034 0.0238 -0.1424 0.0203 0.8868 No
Dropout gender1 0.1570 0.1042 1.5066 2.2699 0.1319 No
Dropout scholarship_holder1 -0.3064 0.1492 -2.0535 4.2169 0.0400 Yes (*)
Dropout debtor1 0.0571 0.1598 0.3577 0.1279 0.7206 No
Dropout tuition_up_to_date1 -1.8975 0.1882 -10.0821 101.6495 0.0000 Yes (*)

Interpretation: Based on the Wald test, significant predictors (p < 0.05),

  • For the Graduate vs. Enrolled category are: admission_grade (\(W^2 = 31.63\), p = 0.000), cu1_grade (\(W^2 = 7.10\), p = 0.008), cu2_grade (\(W^2 = 20.34\), p = 0.000), log_unemployment_rate (\(W^2 = 5.04\), p = 0.025), gender1 (\(W^2 = 24.63\), p = 0.000), scholarship_holder1 (\(W^2 = 83.94\), p = 0.000), debtor1 (\(W^2 = 25.47\), p = 0.000), and tuition_up_to_date1 (\(W^2 = 17.46\), p = 0.000). Predictors log_age_enrollment and gdp are not significant for Graduate.

  • For the Dropout vs. Enrolled category, significant predictors are: cu2_grade (\(W^2 = 103.63\), p = 0.000), log_age_enrollment (\(W^2 = 54.50\), p = 0.000), log_unemployment_rate (\(W^2 = 4.65\), p = 0.031), scholarship_holder1 (\(W^2 = 4.22\), p = 0.040), and tuition_up_to_date1 (\(W^2 = 101.65\), p = 0.000). Predictors admission_grade, cu1_grade, gdp, gender1, and debtor1 are not significant for Dropout.

D.6 Model Fit

# AIC and Residual Deviance
cat("Model Fit Summary:\n")
## Model Fit Summary:
cat(sprintf("AIC: %.2f\n", AIC(mlr_model)))
## AIC: 6521.18
cat(sprintf("Residual Deviance: %.2f\n", mlr_model$deviance))
## Residual Deviance: 6477.18
cat(sprintf("Log-Likelihood: %.4f\n", as.numeric(logLik(mlr_model))))
## Log-Likelihood: -3238.5896
pseudo_r2 <- 1 - (as.numeric(logLik(mlr_model)) / as.numeric(logLik(mlr_null)))
cat(sprintf("McFadden R2: %.4f\n", pseudo_r2))
## McFadden R2: 0.2822

Interpretation: The model produces AIC = 6521.18, Residual Deviance = 6477.18, and Log-Likelihood = -3238.59. McFadden Pseudo \(R^2 = 0.2822\), which falls within the 0.2–0.4 range indicative of a good model fit. This means the model explains approximately 28.22% of the variation in log-likelihood relative to the null model (intercept only). This value confirms that the combination of academic and financial predictors has reasonably strong predictive power over students’ final academic status.

D.7 Interpretation of the Model

D.7.1 Relative Log-Odds

The relative log-odds (coefficients \(\hat{\beta}\)) represent the change in the log of the odds ratio for each unit increase in a predictor, relative to the base category (Enrolled).

# Relative log-odds table
log_odds_tbl <- tidy(mlr_model, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  select(y.level, term, estimate, std.error, p.value, conf.low, conf.high) %>%
  rename(
    Category  = y.level,
    Predictor = term,
    "Log-Odds (β)" = estimate,
    SE        = std.error,
    "p-value" = p.value,
    "CI 2.5%" = conf.low,
    "CI 97.5%" = conf.high
  )

kable(log_odds_tbl, caption = "Relative Log-Odds Coefficients")
Relative Log-Odds Coefficients
Category Predictor Log-Odds (β) SE p-value CI 2.5% CI 97.5%
Graduate (Intercept) -5.1796 0.9047 0.0000 -6.9528 -3.4065
Graduate admission_grade 0.0184 0.0033 0.0000 0.0120 0.0248
Graduate cu1_grade 0.0564 0.0212 0.0077 0.0149 0.0979
Graduate cu2_grade 0.0958 0.0212 0.0000 0.0542 0.1374
Graduate log_age_enrollment -0.0384 0.1880 0.8381 -0.4069 0.3301
Graduate log_unemployment_rate 0.4315 0.1922 0.0248 0.0548 0.8082
Graduate gdp 0.0236 0.0204 0.2464 -0.0163 0.0635
Graduate gender1 -0.4621 0.0931 0.0000 -0.6446 -0.2796
Graduate scholarship_holder1 1.0053 0.1097 0.0000 0.7903 1.2204
Graduate debtor1 -0.8320 0.1648 0.0000 -1.1551 -0.5089
Graduate tuition_up_to_date1 1.0854 0.2597 0.0000 0.5763 1.5945
Dropout (Intercept) -1.7790 0.9604 0.0640 -3.6614 0.1033
Dropout admission_grade -0.0032 0.0036 0.3738 -0.0104 0.0039
Dropout cu1_grade 0.0300 0.0190 0.1140 -0.0072 0.0671
Dropout cu2_grade -0.1867 0.0183 0.0000 -0.2226 -0.1508
Dropout log_age_enrollment 1.4521 0.1967 0.0000 1.0666 1.8377
Dropout log_unemployment_rate 0.4768 0.2212 0.0311 0.0433 0.9103
Dropout gdp -0.0034 0.0238 0.8868 -0.0500 0.0432
Dropout gender1 0.1570 0.1042 0.1319 -0.0472 0.3612
Dropout scholarship_holder1 -0.3064 0.1492 0.0400 -0.5989 -0.0140
Dropout debtor1 0.0571 0.1598 0.7206 -0.2560 0.3703
Dropout tuition_up_to_date1 -1.8975 0.1882 0.0000 -2.2664 -1.5286

Interpretation for Numeric Predictors:

  • admission_grade:
    • Graduate vs. Enrolled: A one-unit increase in admission grade is associated with a 0.0184 increase in the log-odds of graduating vs. remaining enrolled (\(z = 5.62\), p = 0.000). Although the coefficient is small, the effect is significant and consistent: higher admission grades are reliably associated with a greater tendency to graduate.
    • Dropout vs. Enrolled: A one-unit increase in admission grade is associated with a 0.0032 decrease in the log-odds of dropping out vs. remaining enrolled (\(z =-0.89\), p = 0.374), but not statistically significant. Admission grade does not sufficiently differentiate between Dropout and Enrolled.
  • cu1_grade:
    • Graduate vs. Enrolled: A one-unit increase in 1st semester grade is associated with a 0.0564 increase in the log-odds of graduating vs. remaining enrolled (\(z = 2.66\), p = 0.008). First semester performance is a significant early signal toward graduation.
    • Dropout vs. Enrolled: A one-unit increase in 1st semester grade is associated with a 0.0300 increase in the log-odds of dropping out vs. remaining enrolled (\(z = 1.58\), p = 0.114), but not significant. First semester grade is more relevant for distinguishing Graduates from Enrolled than Dropouts from Enrolled.
  • cu2_grade:
    • Graduate vs. Enrolled: A one-unit increase in 2nd semester grade is associated with a 0.0958 increase in the log-odds of graduating vs. remaining enrolled (\(z = 4.51\), p = 0.000). Second semester grade is a stronger predictor of graduation than first semester grade.
    • Dropout vs. Enrolled: A one-unit increase in 2nd semester grade is associated with a 0.1867 decrease in the log-odds of dropping out vs. remaining enrolled (\(z = -10.18\), p = 0.000). This is the strongest predictor for Dropout, poor second semester performance is closely associated with dropout risk.
  • log_age_enrollment:
    • Graduate vs. Enrolled: A one-unit increase in log age at enrollment is associated with a 0.0384 decrease in the log-odds of graduating vs. remaining enrolled (\(z = -0.20\), p = 0.838), not significant.
    • Dropout vs. Enrolled: A one-unit increase in log age at enrollment is associated with a 1.4521 increase in the log-odds of dropping out vs. remaining enrolled (\(z = 7.38\), p = 0.000). This is a very large effect, where older age at enrollment significantly increases dropout risk relative to remaining enrolled.
  • log_unemployment_rate:
    • Graduate vs. Enrolled: A one-unit increase in log unemployment rate is associated with a 0.4315 increase in the log-odds of graduating vs. remaining enrolled (\(z = 2.25\), p = 0.025), significant. Students who graduate may be more motivated to complete their studies promptly when labor market conditions are difficult.
    • Dropout vs. Enrolled: A one-unit increase in log unemployment rate is associated with a 0.4768 increase in the log-odds of dropping out vs. remaining enrolled (\(z = 2.16\), p = 0.031), also significant. High unemployment simultaneously pushes toward faster graduation and exacerbates dropout pressure.
  • gdp:
    • Graduate vs. Enrolled: A one-unit increase in GDP is associated with a 0.0236 increase in the log-odds of graduating vs. remaining enrolled (\(z = 1.16\), p = 0.246), not significant.
    • Dropout vs. Enrolled: A one-unit increase in GDP is associated with a 0.0034 decrease in the log-odds of dropping out vs. remaining enrolled (\(z = -0.14\), p = 0.887), not significant. GDP does not have a meaningful effect on student status in this model.

Interpretation for Binary Predictors:

  • scholarship_holder = 1:
    • Graduate vs. Enrolled: Scholarship holders have a 1.0053 higher log-odds of graduating vs. remaining enrolled compared to non-scholarship holders (\(z = 9.16\), p = 0.000). This is the third largest effect for Graduate , where receiving a scholarship is a very strong protective factor.
    • Dropout vs. Enrolled: Scholarship holders have a 0.3064 lower log-odds of dropping out vs. remaining enrolled compared to non-scholarship holders (\(z = -2.05\), p = 0.040), significant. Scholarships reduce the financial pressure that is a primary driver of dropout.
  • tuition_up_to_date = 1:
    • Graduate vs. Enrolled: Students with up-to-date tuition have a 1.0854 higher log-odds of graduating vs. remaining enrolled compared to those with overdue tuition (\(z = 4.18\), p = 0.000). Timely tuition payment reflects financial stability that supports academic completion.
    • Dropout vs. Enrolled: Students with up-to-date tuition have a 1.8975 lower log-odds of dropping out vs. remaining enrolled (\(z = -10.08\), p = 0.000). This is the largest coefficient (in absolute terms) for Dropout, where overdue tuition is the strongest predictor of dropout in this model.
  • debtor = 1:
    • Graduate vs. Enrolled: Debtors have a 0.8320 lower log-odds of graduating vs. remaining enrolled compared to non-debtors (\(z = -5.05\), p = 0.000), significant. Financial debt impedes academic progress toward graduation.
    • Dropout vs. Enrolled: Debtors have a 0.0571 higher log-odds of dropping out vs. remaining enrolled (\(z = 0.36\), p = 0.721), not significant. Debtor status more strongly differentiates Graduates from Enrolled than Dropouts from Enrolled.
  • gender = 1 (male vs. female):
    • Graduate vs. Enrolled: Male students have a 0.4621 lower log-odds of graduating vs. remaining enrolled compared to female students (\(z = -4.96\), p = 0.000). Female students are more likely to graduate than male students, after controlling for all other predictors.
    • Dropout vs. Enrolled: Male students have a 0.1570 higher log-odds of dropping out vs. remaining enrolled (\(z = 1.51\), p = 0.132), not significant. Gender differences are more pronounced on the graduation pathway than the dropout pathway.

D.7.2 Relative Risk Ratios (RRR)

Relative Risk Ratios are obtained by exponentiating the log-odds coefficients: \(RRR = e^{\hat{\beta}}\). An RRR > 1 indicates higher relative risk of being in category \(j\) compared to the base; an RRR < 1 indicates lower relative risk.

# Relative Risk Ratios table
rrr_tbl <- tidy(mlr_model, conf.int = TRUE, exponentiate = TRUE) %>%
  mutate(
    odds_change_pct = ifelse(
      estimate >= 1,
      (estimate - 1) * 100,
      (1 - estimate) * 100
    ),
    direction = ifelse(estimate >= 1, "+", "-")
  ) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  select(
    y.level, term,
    estimate,
    odds_change_pct,
    direction,
    p.value,
    conf.low,
    conf.high
  ) %>%
  rename(
    Category          = y.level,
    Predictor         = term,
    "RRR (e^β)"       = estimate,
    "% Change in Odds" = odds_change_pct,
    Direction         = direction,
    "p-value"         = p.value,
    "CI 2.5%"         = conf.low,
    "CI 97.5%"        = conf.high
  )

kable(rrr_tbl,
      caption = "Relative Risk Ratios (RRR = exp(β)) and Percentage Change in Odds")
Relative Risk Ratios (RRR = exp(β)) and Percentage Change in Odds
Category Predictor RRR (e^β) % Change in Odds Direction p-value CI 2.5% CI 97.5%
Graduate (Intercept) 0.0056 99.4370 - 0.0000 0.0010 0.0332
Graduate admission_grade 1.0185 1.8550 + 0.0000 1.0120 1.0251
Graduate cu1_grade 1.0580 5.8014 + 0.0077 1.0150 1.1028
Graduate cu2_grade 1.1005 10.0517 + 0.0000 1.0557 1.1473
Graduate log_age_enrollment 0.9623 3.7694 - 0.8381 0.6657 1.3911
Graduate log_unemployment_rate 1.5396 53.9598 + 0.0248 1.0563 2.2439
Graduate gdp 1.0239 2.3882 + 0.2464 0.9838 1.0656
Graduate gender1 0.6300 37.0047 - 0.0000 0.5249 0.7561
Graduate scholarship_holder1 2.7328 173.2845 + 0.0000 2.2040 3.3886
Graduate debtor1 0.4352 56.4824 - 0.0000 0.3150 0.6011
Graduate tuition_up_to_date1 2.9606 196.0581 + 0.0000 1.7794 4.9258
Dropout (Intercept) 0.1688 83.1201 - 0.0640 0.0257 1.1088
Dropout admission_grade 0.9968 0.3226 - 0.3738 0.9897 1.0039
Dropout cu1_grade 1.0304 3.0414 + 0.1140 0.9928 1.0694
Dropout cu2_grade 0.8297 17.0308 - 0.0000 0.8004 0.8601
Dropout log_age_enrollment 4.2722 327.2155 + 0.0000 2.9054 6.2818
Dropout log_unemployment_rate 1.6109 61.0862 + 0.0311 1.0442 2.4850
Dropout gdp 0.9966 0.3378 - 0.8868 0.9513 1.0441
Dropout gender1 1.1700 16.9967 + 0.1319 0.9539 1.4350
Dropout scholarship_holder1 0.7361 26.3934 - 0.0400 0.5494 0.9861
Dropout debtor1 1.0588 5.8811 + 0.7206 0.7741 1.4481
Dropout tuition_up_to_date1 0.1499 85.0058 - 0.0000 0.1037 0.2168

Interpretation for Numeric Predictors:

  • admission_grade:
    • Graduate vs. Enrolled: A one-unit increase in admission grade multiplies the odds of graduating vs. remaining enrolled by 1.0185 (an increase of approximately 1.85%, p = 0.000). Each additional point in admission grade consistently increases the relative probability of graduation.
    • Dropout vs. Enrolled: A one-unit increase in admission grade multiplies the odds of dropping out vs. remaining enrolled by 0.9968 (a decrease of approximately 0.32%, p = 0.374), not significant. Admission grade does not meaningfully differentiate between Dropout and Enrolled.
  • cu1_grade:
    • Graduate vs. Enrolled: A one-unit increase in 1st semester grade multiplies the odds of graduating vs. remaining enrolled by 1.0580 (an increase of 5.80%, p = 0.008). Better first semester performance increases the relative odds of graduation compared to remaining enrolled.
    • Dropout vs. Enrolled: A one-unit increase in 1st semester grade multiplies the odds of dropping out vs. remaining enrolled by 1.0304 (an increase of 3.04%, p = 0.114), not significant.
  • cu2_grade:
    • Graduate vs. Enrolled: A one-unit increase in 2nd semester grade multiplies the odds of graduating vs. remaining enrolled by 1.1005 (an increase of 10.05%, p = 0.000). Second semester grade is a stronger predictor of graduation than first semester grade.
    • Dropout vs. Enrolled: A one-unit increase in 2nd semester grade multiplies the odds of dropping out vs. remaining enrolled by 0.8297 (a decrease of 17.03%, p = 0.000). Each additional point in second semester grade reduces the relative dropout odds by nearly 17% , which is a very strong protective effect.
  • log_age_enrollment:
    • Graduate vs. Enrolled: A one-unit increase in log age at enrollment multiplies the odds of graduating vs. remaining enrolled by 0.9623 (a decrease of 3.77%, p = 0.838), not significant.
    • Dropout vs. Enrolled: A one-unit increase in log age at enrollment multiplies the odds of dropping out vs. remaining enrolled by 4.2722 (an increase of 327.22%, p = 0.000). This is the largest RRR in the model, where older enrollment age multiplies dropout odds more than four times relative to remaining enrolled.
  • log_unemployment_rate:
    • Graduate vs. Enrolled: A one-unit increase in log unemployment rate multiplies the graduation odds vs. remaining enrolled by 1.5396 (an increase of 53.96%, p = 0.025). Difficult labor market conditions may motivate students to complete their studies sooner.
    • Dropout vs. Enrolled: A one-unit increase in log unemployment rate multiplies the dropout odds vs. remaining enrolled by 1.6109 (an increase of 61.09%, p = 0.031). At the same time, high unemployment also amplifies financial pressure that worsens dropout.
  • gdp:
    • Graduate vs. Enrolled: A one-unit increase in GDP multiplies the graduation odds vs. remaining enrolled by 1.0239 (an increase of 2.39%, p = 0.246), not significant.
    • Dropout vs. Enrolled: A one-unit increase in GDP multiplies the dropout odds vs. remaining enrolled by 0.9966 (a decrease of 0.34%, p = 0.887), not significant. GDP does not have a meaningful effect on student academic status.

Interpretation for Binary Predictors:

  • scholarship_holder = 1:
    • Graduate vs. Enrolled: Scholarship holders multiply the odds of graduating vs. remaining enrolled by 2.7328 (an increase of 173.28%, p = 0.000) compared to non-scholarship holders. Receiving a scholarship more than doubles the relative odds of graduation.
    • Dropout vs. Enrolled: Scholarship holders multiply the odds of dropping out vs. remaining enrolled by 0.7361 (a decrease of 26.39%, p = 0.040). Scholarships significantly reduce dropout risk relative to remaining enrolled.
  • tuition_up_to_date = 1:
    • Graduate vs. Enrolled: Students with up-to-date tuition multiply the odds of graduating vs. remaining enrolled by 2.9606 (an increase of 196.06%, p = 0.000) compared to those with overdue tuition. Timely payment nearly triples the relative odds of graduation.
    • Dropout vs. Enrolled: Students with up-to-date tuition multiply the odds of dropping out vs. remaining enrolled by 0.1499 (a decrease of 85.01%, p = 0.000). Students with current tuition payments have only about 15% of the dropout odds relative to those with overdue payments, which is the largest protective effect in the model for Dropout.
  • debtor = 1:
    • Graduate vs. Enrolled: Debtors multiply the odds of graduating vs. remaining enrolled by 0.4352 (a decrease of 56.48%, p = 0.000) compared to non-debtors. Debtor status cuts the relative graduation odds nearly in half.
    • Dropout vs. Enrolled: Debtors multiply the odds of dropping out vs. remaining enrolled by 1.0588 (an increase of 5.88%, p = 0.721), not significant. Debtor status more strongly differentiates the graduation pathway than the dropout pathway.
  • gender = 1 (male vs. female):
    • Graduate vs. Enrolled: Male students multiply the odds of graduating vs. remaining enrolled by 0.6300 (a decrease of 37.00%, p = 0.000) compared to female students. Male students have substantially lower relative odds of graduation compared to female students.
    • Dropout vs. Enrolled: Male students multiply the odds of dropping out vs. remaining enrolled by 1.1700 (an increase of 17.00%, p = 0.132), not statistically significant.

D.7.3 Marginal Effects (Average)

Marginal effects express the change in the predicted probability of each outcome category for a one-unit change in a predictor (numeric) or for a category shift (binary/categorical). Unlike log-odds or RRR, marginal effects are on the probability scale and are directly interpretable.

# Average Marginal Effects for numeric predictors
cat("Average Marginal Effects for admission_grade:\n")
## Average Marginal Effects for admission_grade:
mfx_admission <- avg_comparisons(mlr_model, variables = "admission_grade", type = "probs")
print(mfx_admission)
## 
##     Group Estimate Std. Error     z Pr(>|z|)    S    2.5 %    97.5 %
##  Enrolled -0.00152   0.000412 -3.69   <0.001 12.1 -0.00233 -0.000714
##  Graduate  0.00318   0.000437  7.28   <0.001 41.4  0.00233  0.004039
##  Dropout  -0.00166   0.000360 -4.62   <0.001 18.0 -0.00237 -0.000957
## 
## Term: admission_grade
## Type: probs
## Comparison: +1
cat("Average Marginal Effects for cu1_grade:\n")
## Average Marginal Effects for cu1_grade:
mfx_cu1 <- avg_comparisons(mlr_model, variables = "cu1_grade", type = "probs")
print(mfx_cu1)
## 
##     Group  Estimate Std. Error      z Pr(>|z|)   S    2.5 %   97.5 %
##  Enrolled -0.006514    0.00245 -2.661  0.00778 7.0 -0.01131 -0.00172
##  Graduate  0.006945    0.00295  2.355  0.01854 5.8  0.00116  0.01273
##  Dropout  -0.000431    0.00197 -0.219  0.82660 0.3 -0.00429  0.00343
## 
## Term: cu1_grade
## Type: probs
## Comparison: +1
cat("Average Marginal Effects for cu2_grade:\n")
## Average Marginal Effects for cu2_grade:
mfx_cu2 <- avg_comparisons(mlr_model, variables = "cu2_grade", type = "probs")
print(mfx_cu2)
## 
##     Group  Estimate Std. Error        z Pr(>|z|)     S    2.5 %   97.5 %
##  Enrolled -0.000151    0.00242  -0.0624     0.95   0.1 -0.00489  0.00458
##  Graduate  0.027786    0.00276  10.0494   <0.001  76.5  0.02237  0.03320
##  Dropout  -0.027635    0.00159 -17.3627   <0.001 221.9 -0.03075 -0.02452
## 
## Term: cu2_grade
## Type: probs
## Comparison: +1
cat("Average Marginal Effects for log_age_enrollment:\n")
## Average Marginal Effects for log_age_enrollment:
mfx_age <- avg_comparisons(mlr_model, variables = "log_age_enrollment", type = "probs")
print(mfx_age)
## 
##     Group Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 %
##  Enrolled  -0.0693     0.0159 -4.35   <0.001 16.1 -0.101 -0.0380
##  Graduate  -0.1402     0.0238 -5.90   <0.001 28.0 -0.187 -0.0936
##  Dropout    0.2095     0.0248  8.45   <0.001 55.0  0.161  0.2581
## 
## Term: log_age_enrollment
## Type: probs
## Comparison: +1
cat("Average Marginal Effects for log_unemployment_rate:\n")
## Average Marginal Effects for log_unemployment_rate:
mfx_unemp <- avg_comparisons(mlr_model, variables = "log_unemployment_rate", type = "probs")
print(mfx_unemp)
## 
##     Group Estimate Std. Error     z Pr(>|z|)   S   2.5 %  97.5 %
##  Enrolled  -0.0546     0.0186 -2.94  0.00329 8.2 -0.0910 -0.0182
##  Graduate   0.0309     0.0255  1.21  0.22502 2.2 -0.0190  0.0809
##  Dropout    0.0237     0.0231  1.03  0.30447 1.7 -0.0215  0.0689
## 
## Term: log_unemployment_rate
## Type: probs
## Comparison: +1
cat("Average Marginal Effects for gdp:\n")
## Average Marginal Effects for gdp:
mfx_gdp <- avg_comparisons(mlr_model, variables = "gdp", type = "probs")
print(mfx_gdp)
## 
##     Group Estimate Std. Error      z Pr(>|z|)   S    2.5 %  97.5 %
##  Enrolled -0.00199    0.00261 -0.761    0.447 1.2 -0.00711 0.00313
##  Graduate  0.00403    0.00282  1.431    0.152 2.7 -0.00149 0.00955
##  Dropout  -0.00204    0.00240 -0.850    0.395 1.3 -0.00676 0.00267
## 
## Term: gdp
## Type: probs
## Comparison: +1
# Average Marginal Effects for binary/categorical predictors
cat("Average Marginal Effects for gender:\n")
## Average Marginal Effects for gender:
mfx_gender <- avg_comparisons(mlr_model, variables = "gender", type = "probs")
print(mfx_gender)
## 
##     Group Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
##  Enrolled   0.0356     0.0126  2.83  0.00472  7.7  0.0109  0.0603
##  Graduate  -0.0882     0.0136 -6.50  < 0.001 33.5 -0.1148 -0.0616
##  Dropout    0.0526     0.0113  4.64  < 0.001 18.1  0.0304  0.0748
## 
## Term: gender
## Type: probs
## Comparison: 1 - 0
cat("Average Marginal Effects for scholarship_holder:\n")
## Average Marginal Effects for scholarship_holder:
mfx_scholar <- avg_comparisons(mlr_model, variables = "scholarship_holder", type = "probs")
print(mfx_scholar)
## 
##     Group Estimate Std. Error     z Pr(>|z|)     S  2.5 %  97.5 %
##  Enrolled  -0.0794     0.0127 -6.27   <0.001  31.4 -0.104 -0.0546
##  Graduate   0.1829     0.0142 12.90   <0.001 124.1  0.155  0.2107
##  Dropout   -0.1035     0.0126 -8.23   <0.001  52.3 -0.128 -0.0788
## 
## Term: scholarship_holder
## Type: probs
## Comparison: 1 - 0
cat("Average Marginal Effects for debtor:\n")
## Average Marginal Effects for debtor:
mfx_debtor <- avg_comparisons(mlr_model, variables = "debtor", type = "probs")
print(mfx_debtor)
## 
##     Group Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
##  Enrolled   0.0775     0.0241  3.21   0.0013  9.6  0.0303  0.1247
##  Graduate  -0.1441     0.0248 -5.82   <0.001 27.3 -0.1927 -0.0956
##  Dropout    0.0666     0.0202  3.29   <0.001 10.0  0.0270  0.1063
## 
## Term: debtor
## Type: probs
## Comparison: 1 - 0
cat("Average Marginal Effects for tuition_up_to_date:\n")
## Average Marginal Effects for tuition_up_to_date:
mfx_tuition <- avg_comparisons(mlr_model, variables = "tuition_up_to_date", type = "probs")
print(mfx_tuition)
## 
##     Group Estimate Std. Error      z Pr(>|z|)     S   2.5 % 97.5 %
##  Enrolled   0.0649     0.0208   3.12  0.00181   9.1  0.0241  0.106
##  Graduate   0.3480     0.0261  13.31  < 0.001 132.0  0.2968  0.399
##  Dropout   -0.4129     0.0276 -14.96  < 0.001 165.7 -0.4670 -0.359
## 
## Term: tuition_up_to_date
## Type: probs
## Comparison: 1 - 0

Interpretation for Numeric Predictors:

  • admission_grade (AME per +1 unit):
    • Enrolled: Decreases the probability of remaining Enrolled by 0.0015 (p < 0.001).
    • Graduate: Increases the probability of graduating by 0.0032 (p < 0.001). Each additional point in admission grade raises the probability of graduation by approximately 0.32 percentage points.
    • Dropout: Decreases the probability of dropping out by 0.0017 (p < 0.001). Higher admission grades significantly reduce dropout risk, though the effect per unit is small.
  • cu1_grade (AME per +1 unit):
    • Enrolled: Decreases the probability of remaining Enrolled by 0.0065 (p = 0.008).
    • Graduate: Increases the probability of graduating by 0.0069 (p = 0.019). First semester performance has a significant impact on graduation probability.
    • Dropout: Decreases the probability of dropping out by 0.0004 (p = 0.827), not significant. First semester grade is more relevant for distinguishing the graduation pathway than the dropout pathway.
  • cu2_grade (AME per +1 unit):
    • Enrolled: Has virtually no effect on the probability of remaining Enrolled (−0.0002, p = 0.950), not significant.
    • Graduate: Increases the probability of graduating by 0.0278 (p < 0.001). This is the largest marginal effect among numeric predictors, where each additional point in second semester grade raises graduation probability by nearly 2.78 percentage points.
    • Dropout: Decreases the probability of dropping out by 0.0276 (p < 0.001). Second semester grade has the strongest protective effect against dropout among all numeric predictors.
  • log_age_enrollment (AME per +1 unit):
    • Enrolled: Decreases the probability of remaining Enrolled by 0.0693 (p < 0.001).
    • Graduate: Decreases the probability of graduating by 0.1402 (p < 0.001). Older age at enrollment reduces the graduation probability by approximately 14 percentage points, which is a substantial effect.
    • Dropout: Increases the probability of dropping out by 0.2095 (p < 0.001). This is the largest marginal effect for Dropout in absolute terms, where older age at enrollment increases dropout probability by approximately 21 percentage points.
  • log_unemployment_rate (AME per +1 unit):
    • Enrolled: Decreases the probability of remaining Enrolled by 0.0546 (p = 0.003).
    • Graduate: Increases the probability of graduating by 0.0309 (p = 0.225), not significant.
    • Dropout: Increases the probability of dropping out by 0.0237 (p = 0.304), not significant. The unemployment rate significantly affects the Enrolled probability, but is not strong enough to directly move Graduate or Dropout probabilities individually.
  • gdp (AME per +1 unit):
    • Enrolled: Decreases the probability of remaining Enrolled by 0.0020 (p = 0.447), not significant.
    • Graduate: Increases the probability of graduating by 0.0040 (p = 0.152), not significant.
    • Dropout: Decreases the probability of dropping out by 0.0020 (p = 0.395), not significant. GDP does not have a significant marginal effect on any outcome probability.

Interpretation for Binary Predictors:

  • scholarship_holder = 1 (vs. no scholarship):
    • Enrolled: Decreases the probability of remaining Enrolled by 0.0794 (p < 0.001).
    • Graduate: Increases the probability of graduating by 0.1829 (p < 0.001). Scholarship recipients have a graduation probability that is 18.29 percentage points higher than non-recipients, which is the third largest effect among binary predictors for Graduate.
    • Dropout: Decreases the probability of dropping out by 0.1035 (p < 0.001). Scholarship support reduces dropout risk by approximately 10 percentage points, confirming its protective function.
  • tuition_up_to_date = 1 (vs. overdue tuition):
    • Enrolled: Increases the probability of remaining Enrolled by 0.0649 (p = 0.002).
    • Graduate: Increases the probability of graduating by 0.3480 (p < 0.001). Students with current tuition payments have a graduation probability that is 34.80 percentage points higher, so this is the largest marginal effect in the entire model for Graduate.
    • Dropout: Decreases the probability of dropping out by 0.4129 (p < 0.001). At the same time, current tuition payment also reduces dropout probability by 41.29 percentage points, which is the largest marginal effect in the entire model for Dropout. Timely tuition payment is the single most important predictor in this model.
  • debtor = 1 (vs. non-debtor):
    • Enrolled: Increases the probability of remaining Enrolled by 0.0775 (p = 0.001).
    • Graduate: Decreases the probability of graduating by 0.1441 (p < 0.001). Debtors have a graduation probability that is 14.41 percentage points lower than non-debtors, reflecting how financial burden constrains academic completion.
    • Dropout: Increases the probability of dropping out by 0.0666 (p < 0.001). Debtor status increases dropout risk by approximately 6.66 percentage points.
  • gender = 1 (male vs. female):
    • Enrolled: Increases the probability of remaining Enrolled by 0.0356 (p = 0.005).
    • Graduate: Decreases the probability of graduating by 0.0882 (p < 0.001). Male students have a graduation probability that is 8.82 percentage points lower than female students, after controlling for all other predictors.
    • Dropout: Increases the probability of dropping out by 0.0526 (p < 0.001). Male students have a dropout probability that is 5.26 percentage points higher than female students.

D.8 Multinomial Logistic Regression Conclusion

  • Multinomial Logistic Regression model on 4,424 students is statistically significant overall (G² = 2546.49, df = 20, p ≈ 0; McFadden R² = 0.2822).

  • Tuition payment status is the most influential predictor, where students with up-to-date payments are 34.80 percentage points more likely to graduate and 41.29 percentage points less likely to drop out.

  • Log age at enrollment strongly increases dropout risk (+20.95 pp; odds ×4.27), while academic performance, especially second-semester grades, improves outcomes (+2.78 pp graduation; -2.76 pp dropout).

  • Scholarships play a protective role (+18.29 pp graduation; -10.35 pp dropout), whereas debtor status reduces graduation probability (-14.41 pp), and male students face lower graduation (-8.82 pp) and higher dropout (+5.26 pp) probabilities than females.

  • Macroeconomic variables have limited marginal effects, leading to the overall conclusion that financial stability is the primary driver of student outcomes, followed by academic performance, age at enrollment, and gender.

E. Evaluation and Visual Comparison

E.1 Confusion Matrix - Discriminant Analysis

# Apparent (Training) Confusion Matrix 
pred_apparent <- predict(lda_model)$class
conf_da <- table(Actual = data_da$target, Predicted = pred_apparent)

# APER (Apparent Error Rate)
aper_da <- 1 - sum(diag(conf_da)) / sum(conf_da)
acc_da  <- sum(diag(conf_da)) / sum(conf_da)

cat(" Discriminant Analysis\n")
##  Discriminant Analysis
cat("Confusion Matrix (Apparent):\n")
## Confusion Matrix (Apparent):
print(conf_da)
##           Predicted
## Actual     Enrolled Graduate Dropout
##   Enrolled        0      700      94
##   Graduate        0     2086     123
##   Dropout         1      629     791
cat(sprintf("\nAPER (Apparent Error Rate) : %.4f (%.2f%%)\n",
            aper_da, aper_da * 100))
## 
## APER (Apparent Error Rate) : 0.3497 (34.97%)
cat(sprintf("Overall Accuracy           : %.4f (%.2f%%)\n",
            acc_da, acc_da * 100))
## Overall Accuracy           : 0.6503 (65.03%)

Interpretation: The confusion matrix for Discriminant Analysis shows that out of 4,424 students, the model correctly classified 2,086 out of 2,209 Graduate students, 791 out of 1,421 Dropout students, and only 0 out of 794 Enrolled students. The overall accuracy is 65.03%, yielding an APER of 34.97%, meaning the DA model misclassifies approximately 35% of all training observations. The Enrolled class is entirely unclassified, all 794 Enrolled students were predicted as either Graduate (700 students) or Dropout (94 students). This reflects the difficulty of separating the Enrolled group, which occupies an intermediate position between Graduate and Dropout in the discriminant space.

E.2 Confusion Matrix - Multinomial Logistic Regression

pred_mlr <- predict(mlr_model, type = "class")
conf_mlr <- table(Actual = data_final$target, Predicted = pred_mlr)

acc_mlr  <- sum(diag(conf_mlr)) / sum(conf_mlr)
aper_mlr <- 1 - acc_mlr

cat("Multinomial Logistic Regression\n")
## Multinomial Logistic Regression
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(conf_mlr)
##           Predicted
## Actual     Enrolled Graduate Dropout
##   Enrolled       29      621     144
##   Graduate       13     2049     147
##   Dropout        24      405     992
cat(sprintf("\nAPER (Apparent Error Rate) : %.4f (%.2f%%)\n",
            aper_mlr, aper_mlr * 100))
## 
## APER (Apparent Error Rate) : 0.3061 (30.61%)
cat(sprintf("Overall Accuracy           : %.4f (%.2f%%)\n",
            acc_mlr, acc_mlr * 100))
## Overall Accuracy           : 0.6939 (69.39%)

Interpretation: The confusion matrix for Multinomial Logistic Regression shows that the model correctly classified 2,049 out of 2,209 Graduate students, 992 out of 1,421 Dropout students, and 29 out of 794 Enrolled students. The overall accuracy is 69.39%, with an APER of 30.61%, meaning approximately 30.61% of all observations are misclassified. Compared to DA, MLR shows a notable improvement in classifying Dropout students (992 correct vs. 791 in DA) and begins to capture a small number of Enrolled students (29 correct vs. 0 in DA), reflecting the advantage of including binary predictors such as tuition payment status and scholarship holder in the MLR model.

E.3 Confusion Matrix Heatmap - Discriminant Analysis

conf_da_df <- as.data.frame(conf_da)
conf_da_df$Pct <- conf_da_df$Freq / sum(conf_da_df$Freq) * 100

ggplot(conf_da_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", Freq, Pct)),
            size = 4, fontface = "bold") +
  scale_fill_gradient(low = "#deebf7", high = "#2171b5") +
  labs(
    title    = "Confusion Matrix Heatmap - Discriminant Analysis",
    subtitle = sprintf("Overall Accuracy: %.2f%% | APER: %.2f%%",
                       acc_da * 100, aper_da * 100),
    x = "Predicted Class",
    y = "Actual Class",
    fill = "Count"
  ) +
  theme_bw() +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text     = element_text(size = 11)
  )

Interpretation: The heatmap for DA highlights that the darkest diagonal cells are concentrated in the Graduate-Graduate (2,086 students, 47.2%) and Dropout-Dropout (791 students, 17.9%) cells, indicating strong classification for these two groups. The Enrolled row is entirely light-colored on the diagonal (0 correct classifications), with all Enrolled students misclassified as Graduate (700 students, 15.8%) or Dropout (94 students, 2.1%). The off-diagonal cell Dropout-Graduate (629 students, 14.2%) is also notably dark, indicating that DA misclassifies a substantial portion of actual Dropout students as Graduates.

E.4 Confusion Matrix Heatmap - Multinomial Logistic Regression

conf_mlr_df <- as.data.frame(conf_mlr)
conf_mlr_df$Pct <- conf_mlr_df$Freq / sum(conf_mlr_df$Freq) * 100

ggplot(conf_mlr_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", Freq, Pct)),
            size = 4, fontface = "bold") +
  scale_fill_gradient(low = "#e5f5e0", high = "#238b45") +
  labs(
    title    = "Confusion Matrix Heatmap - Multinomial Logistic Regression",
    subtitle = sprintf("Overall Accuracy: %.2f%% | APER: %.2f%%",
                       acc_mlr * 100, aper_mlr * 100),
    x = "Predicted Class",
    y = "Actual Class",
    fill = "Count"
  ) +
  theme_bw() +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text     = element_text(size = 11)
  )

Interpretation: Compared to the DA heatmap, the MLR heatmap shows stronger diagonal values for the Dropout class (992 students, 22.4% vs. 791 in LDA) and a small but nonzero correct classification for the Enrolled class (29 students, 0.7% vs. 0 in LDA). The Graduate diagonal remains strong at 2,049 students (46.3%). The off-diagonal Dropout-Graduate cell is visibly reduced (405 students, 9.2% vs. 629 in DA), indicating that MLR is better at distinguishing Dropout students from Graduates, likely because financial variables such as tuition payment status are strong Dropout predictors that DA cannot incorporate.

E.5 Per-Class Metrics - MLR

classes_order <- levels(data_final$target)

precision_mlr <- diag(conf_mlr) / colSums(conf_mlr)
recall_mlr    <- diag(conf_mlr) / rowSums(conf_mlr)
f1_mlr        <- 2 * precision_mlr * recall_mlr /
                 (precision_mlr + recall_mlr)
support_mlr   <- rowSums(conf_mlr)

eval_mlr <- data.frame(
  Class     = classes_order,
  Precision = round(precision_mlr, 4),
  Recall    = round(recall_mlr, 4),
  F1        = round(ifelse(is.nan(f1_mlr), 0, f1_mlr), 4),
  Support   = support_mlr
)
kable(eval_mlr,
      caption = "Per-Class Classification Metrics - MLR (Apparent)")
Per-Class Classification Metrics - MLR (Apparent)
Class Precision Recall F1 Support
Enrolled Enrolled 0.4394 0.0365 0.0674 794
Graduate Graduate 0.6663 0.9276 0.7755 2209
Dropout Dropout 0.7732 0.6981 0.7337 1421

Interpretation: For MLR, the Graduate class achieves the highest F1-score driven by a recall of 92.76% and precision of 77.17%, reflecting the model’s strong ability to identify graduating students. The Dropout class shows a recall of 69.81% and precision of 82.18%, indicating that when the model predicts Dropout it is fairly reliable, and it captures nearly 70% of actual Dropout cases. The Enrolled class remains the most difficult to classify, with a recall of only 3.65% (29 out of 794 correctly identified) and a precision of 43.94%, as the Enrolled group still overlaps heavily with the other two classes even with the inclusion of binary predictors.

E.6 Comparison Table: DA vs MLR

# Per-class metrics from DA
precision_da <- diag(conf_da) / colSums(conf_da)
recall_da    <- diag(conf_da) / rowSums(conf_da)
f1_da        <- 2 * precision_da * recall_da /
                (precision_da + recall_da)

comparison_tbl <- data.frame(
  Metric = c("Overall Accuracy", "APER",
             "Precision for Enrolled", 
             "Precision for Graduate",
             "Precision for Dropout",
             "Recall for Enrolled",
             "Recall for Graduate",
             "Recall for Dropout",
             "F1 for Enrolled",        
             "F1 for Graduate",
             "F1 for Dropout"),
  DA  = round(c(
    acc_da, aper_da,
    precision_da["Enrolled"], precision_da["Graduate"],
    precision_da["Dropout"],
    recall_da["Enrolled"],    recall_da["Graduate"],
    recall_da["Dropout"],
    ifelse(is.nan(f1_da["Enrolled"]), 0, f1_da["Enrolled"]),
    f1_da["Graduate"],
    f1_da["Dropout"]
  ), 4),
  MLR = round(c(
    acc_mlr, aper_mlr,
    precision_mlr["Enrolled"], precision_mlr["Graduate"],
    precision_mlr["Dropout"],
    recall_mlr["Enrolled"],    recall_mlr["Graduate"],
    recall_mlr["Dropout"],
    ifelse(is.nan(f1_mlr["Enrolled"]), 0, f1_mlr["Enrolled"]),
    f1_mlr["Graduate"],
    f1_mlr["Dropout"]
  ), 4)
)

kable(comparison_tbl,
      col.names = c("Metric",
                    "Discriminant Analysis",
                    "Multinomial Logistic Regression"),
      caption   = "Performance Comparison: DA vs MLR")
Performance Comparison: DA vs MLR
Metric Discriminant Analysis Multinomial Logistic Regression
Overall Accuracy 0.6503 0.6939
APER 0.3497 0.3061
Precision for Enrolled 0.0000 0.4394
Precision for Graduate 0.6108 0.6663
Precision for Dropout 0.7847 0.7732
Recall for Enrolled 0.0000 0.0365
Recall for Graduate 0.9443 0.9276
Recall for Dropout 0.5567 0.6981
F1 for Enrolled 0.0000 0.0674
F1 for Graduate 0.7418 0.7755
F1 for Dropout 0.6513 0.7337

Interpretation: The comparison table reveals that MLR outperforms DA on nearly all metrics. MLR achieves a higher overall accuracy (69.39% vs. 65.03%) and a lower APER (30.61% vs. 34.97%). The most significant improvement is in the Dropout class, where MLR’s recall increases substantially from 55.67% (DA) to 69.81% (MLR) and the F1-score improves from 0.6618 to 0.7545. For the Enrolled class, MLR at least partially classifies 29 students correctly (recall = 3.65%) compared to DA’s complete failure (recall = 0.00%). The Graduate class is the only metric where DA marginally outperforms MLR in recall (94.43% vs. 92.76%), though the difference is small. Overall, MLR demonstrates superior classification performance, particularly for the Dropout class which is the most practically important outcome to detect.

E.7 Bar Chart: Overall Performance Comparison

bar_data <- data.frame(
  Metric = rep(c("Accuracy", "APER"), 2),
  Method = c(rep("Discriminant Analysis", 2),
             rep("Multinomial Logistic Regression", 2)),
  Value  = c(acc_da, aper_da, acc_mlr, aper_mlr)
)

ggplot(bar_data, aes(x = Metric, y = Value, fill = Method)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  geom_text(aes(label = sprintf("%.2f%%", Value * 100)),
            position = position_dodge(width = 0.6),
            vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_fill_manual(values = c("#2171b5", "#238b45")) +
  scale_y_continuous(labels = scales::percent_format(),
                     limits = c(0, 1)) +
  labs(
    title = "Overall Performance Comparison: DA vs MLR",
    x     = "Metric",
    y     = "Value",
    fill  = "Method"
  ) +
  theme_bw() +
  theme(
    plot.title      = element_text(face = "bold", hjust = 0.5),
    legend.position = "bottom"
  )

Interpretation: The bar chart visually confirms that MLR outperforms DA on both overall metrics. MLR achieves an accuracy of 69.39% compared to DA’s 65.03%, a difference of 4.36 percentage points. Correspondingly, MLR’s APER (30.61%) is lower than DA’s APER (34.97%), meaning MLR produces 4.36 percentage points fewer misclassifications on the training data. This improvement is primarily attributed to MLR’s ability to incorporate binary predictors, particularly tuition payment status and scholarship status, which are among the strongest predictors of student outcomes but cannot be used in DA due to its multivariate normality assumption.

E.8 Bar Chart: Per-Class Recall Comparison

recall_data <- data.frame(
  Class  = rep(classes_order, 2),
  Method = c(rep("Discriminant Analysis", 3),
             rep("Multinomial Logistic Regression", 3)),
  Recall = c(recall_da[classes_order],
             recall_mlr[classes_order])
)

ggplot(recall_data, aes(x = Class, y = Recall, fill = Method)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  geom_text(aes(label = sprintf("%.2f%%", Recall * 100)),
            position = position_dodge(width = 0.6),
            vjust = -0.4, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("#2171b5", "#238b45")) +
  scale_y_continuous(labels = scales::percent_format(),
                     limits = c(0, 1.1)) +
  labs(
    title = "Per-Class Recall Comparison: DA vs MLR",
    x     = "Target Class",
    y     = "Recall",
    fill  = "Method"
  ) +
  theme_bw() +
  theme(
    plot.title      = element_text(face = "bold", hjust = 0.5),
    legend.position = "bottom"
  )

Interpretation: The per-class recall chart reveals a nuanced picture of classification performance across the three student outcome categories. For the Dropout class, MLR substantially outperforms DA with a recall of 69.81% vs. 55.67%, an improvement of 14.14 percentage points, this is the most meaningful gain since correctly identifying at-risk students is the primary goal of this research. For the Enrolled class, MLR marginally improves over DA (3.65% vs. 0.00%), though both methods still struggle significantly with this class due to its small size (794 students, 17.95% of the dataset) and its intermediate feature overlap with the other two groups. For the Graduate class, DA achieves a slightly higher recall (94.43% vs. 92.76%), indicating that the linear discriminant functions are particularly effective at separating high-performing Graduate students based on numeric academic predictors.

E.9 Evaluation Conclusion

cat("EVALUATION CONCLUSION: DA vs MLR\n")
## EVALUATION CONCLUSION: DA vs MLR
cat(sprintf("%-35s %-10s %-10s\n", "Metric", "DA", "MLR"))
## Metric                              DA         MLR
cat(sprintf("%-35s %-10s %-10s\n",
            "Overall Accuracy",
            sprintf("%.2f%%", acc_da * 100),
            sprintf("%.2f%%", acc_mlr * 100)))
## Overall Accuracy                    65.03%     69.39%
cat(sprintf("%-35s %-10s %-10s\n",
            "APER",
            sprintf("%.2f%%", aper_da * 100),
            sprintf("%.2f%%", aper_mlr * 100)))
## APER                                34.97%     30.61%
cat(sprintf("%-35s %-10s %-10s\n",
            "Recall - Enrolled",
            sprintf("%.2f%%", recall_da["Enrolled"] * 100),
            sprintf("%.2f%%", recall_mlr["Enrolled"] * 100)))
## Recall - Enrolled                   0.00%      3.65%
cat(sprintf("%-35s %-10s %-10s\n",
            "Recall - Graduate",
            sprintf("%.2f%%", recall_da["Graduate"] * 100),
            sprintf("%.2f%%", recall_mlr["Graduate"] * 100)))
## Recall - Graduate                   94.43%     92.76%
cat(sprintf("%-35s %-10s %-10s\n",
            "Recall - Dropout",
            sprintf("%.2f%%", recall_da["Dropout"] * 100),
            sprintf("%.2f%%", recall_mlr["Dropout"] * 100)))
## Recall - Dropout                    55.67%     69.81%
better <- ifelse(acc_mlr > acc_da,
                 "Multinomial Logistic Regression",
                 "Discriminant Analysis")
cat(sprintf("\nBetter performing method: %s\n", better))
## 
## Better performing method: Multinomial Logistic Regression

Interpretation: Based on the comprehensive evaluation, Multinomial Logistic Regression is the better-performing method for classifying student academic outcomes on this dataset. MLR achieves a higher overall accuracy (69.39% vs. 65.03%) and a lower APER (30.61% vs. 34.97%). The key advantage of MLR lies in its ability to incorporate both numeric and binary predictors simultaneously, particularly tuition payment status, the single most influential predictor identified in Section D, which DA cannot use due to its multivariate normality assumption. While DA achieves a marginally higher Graduate recall (94.43% vs. 92.76%), MLR’s substantially better Dropout recall (69.81% vs. 55.67%) is more practically valuable, as identifying students at risk of dropping out is the primary purpose of this classification task. Both methods, however, continue to struggle with the Enrolled class due to its small sample size and intermediate position in the feature space.

F.Conclusion

  • Discriminant Analysis of 4,424 students showed that the three academic status groups (Enrolled, Graduate, Dropout) could be statistically distinguished using six numerical predictors (Wilks’ Lambda = 0.6213, p < 0.0001). Academic performance, particularly first- and second-semester grades, is the strongest discriminant factor for Graduate students have significantly higher grades than Dropouts. LD1 explains 98.63% of the variance between groups. The model achieved an accuracy of 65.03%, with the highest recall in the Graduate class (≈94%), while the Enrolled class was almost unclassified (recall ≈ 0%) due to the small sample size and its overlapping position in the discriminant space.

  • Multinomial Logistic Regression on 4,424 students proved to be statistically significant overall (G² = 2,546.49, df = 20, p ≈ 0; McFadden R² = 0.2822). The most influential predictor is tuition payment status (tuition up to date), which increases the probability of graduation by 34.80 percentage points and reduces the risk of dropping out by 41.29 percentage points. Older age at college entry increases the risk of dropping out (+20.95 pp), while second-semester grades are the strongest academic predictor. Scholarships have a protective effect (+18.29 pp on graduation; -10.35 pp on dropout), whereas macroeconomic variables (GDP, unemployment) have weak marginal effects.

  • Evaluation Summary indicates that MLR outperforms DA on nearly all metrics: MLR’s overall accuracy (69.39%) is higher than DA’s (65.03%), with a lower APER (30.61% vs. 34.97%). The most significant improvement occurs in the Dropout class, where MLR’s recall reaches 69.81% compared to 55.67% for DA. MLR’s advantage primarily stems from its ability to include binary predictors such as tuition payment status and scholarships, which cannot be used by DA due to its assumption of multivariate normality. The only metric where DA performed slightly better was recall for the Graduate class (94.43% vs 92.76%), but the difference was small. In conclusion, MLR is the better method for classifying students’ academic status on this dataset.