# Load the data
f1_data <- read.csv("/home/harpo/Downloads/latam-paper/master_all_models_f1.csv")
# Group by model and calculate the mean of F1_Mean and F1_Std
model_summary <- f1_data %>%
group_by(Model) %>%
summarise(
Mean_F1 = mean(F1_Mean, na.rm = TRUE),
Std_F1 = mean(F1_Std, na.rm = TRUE)
)
# Bar plot with error bars
ggplot(model_summary, aes(x = reorder(Model, -Mean_F1), y = Mean_F1)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_errorbar(aes(ymin = Mean_F1 - Std_F1, ymax = Mean_F1 + Std_F1), width = 0.2) +
labs(
title = "Average F1-Score by Model",
x = "Model",
y = "Average F1-Score"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create the boxplot for visual comparison
f1_boxplot <- ggplot(f1_data, aes(x = reorder(Model, F1_Mean, FUN = median), y = F1_Mean)) +
geom_boxplot(aes(fill = Model), show.legend = FALSE) +
labs(
title = "Distribution of F1 Scores by Model",
x = "Model",
y = "F1 Score"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
f1_boxplot
# Save the plot to the /tmp directory
#ggsave("/tmp/model_f1_boxplot.png", plot = f1_boxplot, width = 10, height = 6)
To determine if the observed differences in F1-scores among the models are statistically significant, an Analysis of Variance (ANOVA) is performed. However, this test requires that certain assumptions about the data are met.
The ANOVA test will tell us if there is any significant difference between the mean F1-scores of the models. The null hypothesis (H₀) is that all model means are equal.
# Perform the one-way ANOVA test
anova_result <- aov(F1_Mean ~ Model, data = f1_data)
summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
Model 6 0.580 0.09660 2.024 0.0738 .
Residuals 70 3.341 0.04773
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value (Pr(>F)) of 0.0738 is greater than 0.05, which would suggest there is no statistically significant difference between the model means if we assume the data is normal. However, we must first check the test’s assumptions.
The ANOVA test is only valid if three key assumptions are met: independence of observations, normality of residuals, and homogeneity of variances. Below, we check the last two.
# Load the 'car' library for Levene's Test
library(car)
# 1. Normality of Residuals Test (Shapiro-Wilk Test)
# A p-value > 0.05 suggests that the residuals are normally distributed.
anova_residuals <- residuals(object = anova_result)
shapiro.test(x = anova_residuals)
Shapiro-Wilk normality test
data: anova_residuals
W = 0.89091, p-value = 7.345e-06
# 2. Q-Q Plot for visual inspection of normality
# The points should follow the diagonal line approximately.
plot(anova_result, 2)
# 3. Homogeneity of Variances Test (Levene's Test)
# A p-value > 0.05 suggests that the variances are homogeneous.
leveneTest(F1_Mean ~ Model, data = f1_data)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 0.3395 0.9137
70
# 4. Residuals vs. Fitted Plot for visual inspection
# We look for randomly scattered points with no clear pattern.
plot(anova_result, 1)
Interpretation of the Assumptions: - Normality of Residuals: The Shapiro-Wilk test yielded a very low p-value (p < 0.05), which indicates that the residuals do not follow a normal distribution. The Q-Q plot also shows that the points deviate from the diagonal line, confirming this violation. - Homogeneity of Variances: Levene’s test will likely yield a p-value > 0.05, suggesting that the variances are homogeneous. However, the violation of the normality assumption is critical.
Since the normality assumption is not met, the ANOVA results are not reliable. We must use a non-parametric alternative.
Since the residuals are not normally distributed, the Kruskal-Wallis test is used. This test is the non-parametric alternative to ANOVA and compares the medians and rank distributions of the groups, rather than the means.
# Perform the Kruskal-Wallis test
kruskal_result <- kruskal.test(F1_Mean ~ Model, data = f1_data)
print(kruskal_result)
Kruskal-Wallis rank sum test
data: F1_Mean by Model
Kruskal-Wallis chi-squared = 15.674, df = 6, p-value = 0.01562
Interpretation of the Kruskal-Wallis Test: A low p-value (p < 0.05) in this test indicates that there is a statistically significant difference in the F1-score distribution of at least one model compared to the others.
If the result of the Kruskal-Wallis test is significant, we perform a post-hoc test to identify which specific pairs of models are different.
Key Note on Interpretation: Dunn’s test, like Kruskal-Wallis, does not compare means. It compares the average ranks of each group. This is important: a model can have a higher mean due to a few very high scores (outliers), but a lower median and average rank if most of its scores are lower than another model’s. The boxplot is the best visual tool to understand this difference.
# If the result is significant (p < 0.05), a post-hoc test is performed.
if (kruskal_result$p.value < 0.05) {
# Install the package if it is not present
if (!require(dunn.test)) {
install.packages("dunn.test")
library(dunn.test)
}
# Perform Dunn's test with Bonferroni adjustment to control the error rate.
dunn_result <- dunn.test(f1_data$F1_Mean, f1_data$Model, method = "bonferroni")
print(dunn_result)
}
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 15.6738, df = 6, p-value = 0.02
Comparison of x by group
(Bonferroni)
Col Mean-|
Row Mean | CNN DomBertU FANCI Gemma Labin Llama3B
---------+------------------------------------------------------------------
DomBertU | 0.114358
| 1.0000
|
FANCI | 1.524777 1.410419
| 1.0000 1.0000
|
Gemma | 0.028589 -0.085768 -1.496188
| 1.0000 1.0000 1.0000
|
Labin | -0.295425 -0.409784 -1.820203 -0.324015
| 1.0000 1.0000 0.7216 1.0000
|
Llama3B | 1.982211 1.867852 0.457433 1.953621 2.277637
| 0.4983 0.6487 1.0000 0.5328 0.2389
|
ModernBE | -1.419949 -1.534307 -2.944727 -1.448539 -1.124523 -3.402160
| 1.0000 1.0000 0.0339 1.0000 1.0000 0.0070*
alpha = 0.05
Reject Ho if p <= alpha/2
$chi2
[1] 15.67378
$Z
[1] 0.11435835 1.52477794 1.41041960 0.02858959 -0.08576876 -1.49618836 -0.29542573 -0.40978407 -1.82020367 -0.32401531 1.98221133 1.86785298
[13] 0.45743338 1.95362174 2.27763705 -1.41994946 -1.53430781 -2.94472740 -1.44853905 -1.12452373 -3.40216079
$P
[1] 0.4544768663 0.0636572464 0.0792079109 0.4885959587 0.4658251207 0.0673022888 0.3838343414 0.3409821755 0.0343639971 0.3729632204 0.0237277967
[12] 0.0308912831 0.3236797870 0.0253729838 0.0113741054 0.0778111975 0.0624769731 0.0016161966 0.0737331769 0.1303954538 0.0003342765
$P.adjusted
[1] 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 0.721643940 1.000000000 0.498283731 0.648716944
[13] 1.000000000 0.532832659 0.238856214 1.000000000 1.000000000 0.033940128 1.000000000 1.000000000 0.007019806
$comparisons
[1] "CNN - DomBertUrl" "CNN - FANCI" "DomBertUrl - FANCI" "CNN - Gemma" "DomBertUrl - Gemma"
[6] "FANCI - Gemma" "CNN - Labin" "DomBertUrl - Labin" "FANCI - Labin" "Gemma - Labin"
[11] "CNN - Llama3B" "DomBertUrl - Llama3B" "FANCI - Llama3B" "Gemma - Llama3B" "Labin - Llama3B"
[16] "CNN - ModernBERT" "DomBertUrl - ModernBERT" "FANCI - ModernBERT" "Gemma - ModernBERT" "Labin - ModernBERT"
[21] "Llama3B - ModernBERT"
Final Interpretation: You should look for pairs of
models with an adjusted p-value (the P.adj
column) of less
than 0.05. Those are the pairs that have a statistically different
performance.
`