Click here to uncover the dirty truth about the F1 score
Author
Mickey Campbell
Published
July 2, 2025
Introduction
For Sierra’s research on modeling fine-scale travel route selection in a wildland setting, we have been heavily focused on F1 score as the primary measure of model performance. F1 is widely regarded for its robustness as a measure of binary classification models, even in (especially in?) scenarios where there is a heavy imbalance between the two possible predicted outcomes in the data. Sierra’s data are, indeed, very imbalanced, with every decision point along one of the experimental trajectories featuring 15 directions that were not taken and 1 direction that was taken. So, right off the bat, the “positive” case only represents 6.25% of the data. No problem, though, right? F1 to the rescue!!
F1 scores of random models with 15:1 negative:positive ratios
Sierra’s models have been getting F1 scores that we’ve been interpreting as being pretty low (0.2-0.4). But, considered in isolation, it is a little difficult to understand what those numbers really mean. I thought one way to improve this understanding would be to compare it to a random model. That is, given the same distribution of positive versus negative cases, what would the F1 score be of a model that just picks positives and negatives out of a hat with no a priori knowledge of the distribution? Fortunately, R makes this easy to simulate!
Show code
# simulate observed data with ~15:1 ratioobs <-c(0,1) |>sample(size =1000, replace = T, prob =c(15/16, 1/16)) |>factor(levels =c("0", "1"))# generate random predictions with no a priori probability distributionspred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1"))# get countstable(obs)
obs
0 1
943 57
Show code
table(pred)
pred
0 1
477 523
Let’s calculate the F1 score of this “model”:
Show code
library(caret)
Warning: package 'caret' was built under R version 4.4.2
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 449 28
1 494 29
Accuracy : 0.478
95% CI : (0.4466, 0.5095)
No Information Rate : 0.943
P-Value [Acc > NIR] : 1
Kappa : -0.0031
Mcnemar's Test P-Value : <2e-16
Precision : 0.05545
Recall : 0.50877
F1 : 0.10000
Prevalence : 0.05700
Detection Rate : 0.02900
Detection Prevalence : 0.52300
Balanced Accuracy : 0.49246
'Positive' Class : 1
As we can see, the F1 score is 0.1. But, that’s just one random dataset. Let’s run this 1000 times to get a distribution of F1 scores:
Show code
# create empty vector to store f1 valuesf1s <-c()# begin loopfor (i in1:1000){ obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(15/16, 1/16)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) cm <-confusionMatrix(data = pred,reference = obs,positive ="1",mode ="prec_recall" ) f1 <- cm$byClass[[7]] f1s <-c(f1s, f1)}# plot histogram of f1shist(f1s, las =1, xlab ="F1 Score", main =NA, mar =c(5,5,1,1))abline(v =mean(f1s), lwd =3, col =2)legend("topright", legend =paste0("mean = ", round(mean(f1s),2)), text.col =2,bty ="n", x.intersp =0)
Fairly consistent numbers with a mean of 0.11. OK, so that’s somewhat reassuring… Sierra’s models yielding F1s in the range of 0.2 - 0.4 are at least clearly better than random. A low bar for success, admittedly, but that was exactly what I wanted to know.
The effect of imbalance ratio on F1 scores
However, this got me thinking… The consistency of these numbers, hovering right around 0.11 felt a little weird, and likely related to the fact that the observed data had a 15:1 negative:positive ratio. So, I thought I’d play with that ratio a bit, and see how it affected F1 scores of random models. In the code below, instead of fixing the 15:1 ratio, I varied the distributions of positive and negative cases, each time generating a random model, and extracting the F1 score:
Show code
# create empty vector to store f1 valuesf1s <-c()# create another empty vector to store the positive probabilities (i.e., the # proportion of simulated positive observations)prob.1s <-c()# begin loopfor (i in1:1000){ prob.0<-runif(1) prob.1<-1- prob.0 obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(prob.0, prob.1)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) cm <-confusionMatrix(data = pred,reference = obs,positive ="1",mode ="prec_recall" ) f1 <- cm$byClass[[7]] f1s <-c(f1s, f1) prob.1s <-c(prob.1s, prob.1)}# plot f1 as a function of positive probabilitiesplot(f1s ~ prob.1s, mar =c(5,5,1,1), las =1, xlab ="Positive Probability",ylab ="F1 Score", pch =16)grid()
Uhhh… That’s interesting!! There is an exceedingly clear relationship (effect, one might say) between the negative:positive ratio and the F1 score. I’m no statistical theorist, but at the very least, this tells me that F1 should be evaluated somewhat cautiously. If we consider a random model to be the baseline of a functionally useless model (i.e., that which we should strive to far exceed), then it seems really important to know that a random model’s F1 score of 0.1 with a 15:1 negative:positive ratio observation dataset could be functionally equivalent to 0.65 with a 1:15 ratio dataset.
I don’t know… This doesn’t sit well with me. But maybe I’m missing something?
What F1 misses: true negatives
To understand what I am, in fact, missing, I think it’s worth taking a look at the formulation of F1. Here’s a reference confusion matrix:
Show code
# plot reference cmpar(mar =c(1,5,5,1), las =1)plot(x =c(0,1),y =c(0,1),type ="n",xlim =c(0,1),ylim =c(0,1),xlab =NA,ylab =NA,xaxt ="n",yaxt ="n",xaxs ="i",yaxs ="i")rect(0, 0.5, 0.5, 1, col ="lightgray", border =NA)rect(0.5, 0, 1, 0.5, col ="lightgray", border =NA)abline(h =0.5)abline(v =0.5)text(0.25, 0.75, "True Negatives\n(TN)")text(0.25, 0.25, "False Positives\n(FP)")text(0.75, 0.75, "False Negatives\n(FN)")text(0.75, 0.25, "True Positives\n(TP)")axis(3, c(0.25, 0.75), labels =c("Negative", "Positive"))axis(2, c(0.25, 0.75), labels =c("Positive", "Negative"))box(lwd =2)mtext("Observations", line =2.5, font =2)mtext("Predictions", side =2, line =2.5, font =2, las =0)
F1 score the harmonic mean of precision and recall:
What immediately becomes clear when looking at these formulas is the fact that there is no consideration for true negatives. If we ignore true negatives, and the value they play as a measure of model performance, then, assuming we’re dealing with a random classifier, as the proportion of positive observations increases in a dataset:
TP should increase
FN should also increase
FP should decrease
As a result:
Recall should stay fairly constant
Precision should increase
Let’s see if this is true:
Show code
# create empty vectors to store valuesps <-c()rs <-c()prob.1s <-c()# begin loopfor (i in1:1000){ prob.0<-runif(1) prob.1<-1- prob.0 obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(prob.0, prob.1)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) cm <-confusionMatrix(data = pred,reference = obs,positive ="1",mode ="prec_recall" ) p <- cm$byClass[[5]] r <- cm$byClass[[6]] ps <-c(ps, p) rs <-c(rs, r) prob.1s <-c(prob.1s, prob.1)}# plot precision, recall, and f1 as functions of positive probabilitiespar(mfrow =c(1,2), mar =c(5,5,1,1), las =1)plot(ps ~ prob.1s, xlab ="Positive Probability", ylab ="Precision", pch =16)grid()plot(rs ~ prob.1s, xlab ="Positive Probability", ylab ="Recall", pch =16)grid()
Given that F1 is basically an average of precision and recall, then, F1 should increase. Thus, by ignoring true negatives, F1 values highly sensitive to the negative:positive ratio of observations.
One potential solution: macro F1
By nature, F1 is a class-specific measure. That is, you have to define what your positive class is – presumably this is the thing you’re really trying best to capture. There is an argument to be made that, in the context of Sierra’s research, we do, in fact, care most about the taken directions. But, you could just as easily argue it is of equal value to know what directions people avoided (and why!). It’s kind of a glass half full/glass half empty question, really. Or, 1/16 full and 15/16 empty, as it were.
All of this is to say that all of the F1 scores we have discussed so far are for the positive/taken class. One could just as easily compute the F1 score for the not taken class. Let’s see how that pans out in our random models with varying positive probabilities (you can probably already guess how it will look…):
Show code
# create empty vector to store f1 valuesf1s <-c()# create another empty vector to store the positive probabilities (i.e., the # proportion of simulated positive observations)prob.1s <-c()# begin loopfor (i in1:1000){ prob.0<-runif(1) prob.1<-1- prob.0 obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(prob.0, prob.1)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) cm <-confusionMatrix(data = pred,reference = obs,positive ="0",mode ="prec_recall" ) f1 <- cm$byClass[[7]] f1s <-c(f1s, f1) prob.1s <-c(prob.1s, prob.1)}# plot f1 as a function of positive probabilitiesplot(f1s ~ prob.1s, mar =c(5,5,1,1), las =1, xlab ="Positive Probability",ylab ="F1 Score", pch =16)grid()
As one increases, the other decreases:
Show code
# create empty vector to store f1 valuesf1s.0<-c()f1s.1<-c()# begin loopfor (i in1:1000){ prob.0<-runif(1) prob.1<-1- prob.0 obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(prob.0, prob.1)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) cm.0<-confusionMatrix(data = pred,reference = obs,positive ="0",mode ="prec_recall" ) cm.1<-confusionMatrix(data = pred,reference = obs,positive ="1",mode ="prec_recall" ) f1.0<- cm.0$byClass[[7]] f1.1<- cm.1$byClass[[7]] f1s.0<-c(f1s.0, f1.0) f1s.1<-c(f1s.1, f1.1)}# plot f1 for positive outcomes versus f1 for negative outcomesplot(f1s.1~ f1s.0, mar =c(5,5,1,1), las =1, xlab ="F1 for Negatives",ylab ="F1 for Positives", pch =16)grid()
If we were to focus solely on the negative outcomes as the basis of model performance assessment, then we would be in the same boat as before – not considering the holistic model performance. That said, we would certainly yield better-looking F1 numbers! :) But, as we’ve clearly seen, the absolute numbers of F1 are… Complicated.
One way that people have proposed the single-class focus of F1 is through the calculation of macro F1. It’s a fancy way of saying average F1 across your classes:
Let’s see how macro F1 holds up in our random modeling scenario where the distribution of positive and negative observations varies:
Show code
# create empty vector to store f1 valuesmacro.f1s <-c()# create another empty vector to store the positive probabilities (i.e., the # proportion of simulated positive observations)prob.1s <-c()# begin loopfor (i in1:1000){ prob.0<-runif(1) prob.1<-1- prob.0 obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(prob.0, prob.1)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) cm.0<-confusionMatrix(data = pred,reference = obs,positive ="0",mode ="prec_recall" ) cm.1<-confusionMatrix(data = pred,reference = obs,positive ="1",mode ="prec_recall" ) f1.0<- cm.0$byClass[[7]] f1.1<- cm.1$byClass[[7]] macro.f1 <- (f1.0+ f1.1) /2 macro.f1s <-c(macro.f1s, macro.f1) prob.1s <-c(prob.1s, prob.1)}# plot macro f1 as a function of positive probailityplot(macro.f1s ~ prob.1s, mar =c(5,5,1,1), las =1, xlab ="Positive Probability",ylab ="Macro F1", pch =16)grid()
Very, very interesting. So, it does seem to level off at what would seem to be a logical midpoint of 0.5. But, it still suffers when the positive probabilities are very low (i.e., in the case of Sierra’s data!) or very high. So, maybe macro F1 isn’t an ideal solution here either.
Another (better?) potential solution: Matthews Correlation Coefficient (MCC)
In scouring the deep, dark corners of the web, it seems that there is a contingent of statisticians who have suggested an alternative, and potentially better, metric to stand in place of F1: Matthews Correlation Coefficient (MCC). In fact, there is a recent, and highly cited paper by Chicco and Jurman (2020) that specifically suggests MCC as a more suitable metric, especially in cases of heavy imbalance, pointing out some of the same patterns I have pointed out here (albeit with real data).
Unlike most accuracy measures that range from 0 to 1, MCC is not unlike Pearson’s correlation for continuous data in that it ranges from -1 to 1:
-1 represents perfect disagreement between predictions and observations
0 represents a functionally random agreement between predictions and observations
+1 represents a perfect agreement between predictions and observations
And, most importantly, it takes true negatives into account. Here is the almost comical equation:
Well, nothing left to do now but test it out! Let’s see how MCC performs on random models applied to datasets with highly variable distributions of positive and negative cases:
Show code
library(mltools)
Warning: package 'mltools' was built under R version 4.4.2
Show code
# create empty vector to store mcc valuesmccs <-c()# create another empty vector to store the positive probabilities (i.e., the # proportion of simulated positive observations)prob.1s <-c()# begin loopfor (i in1:1000){ prob.0<-runif(1) prob.1<-1- prob.0 obs <-c(0,1) |>sample(size =1000, replace = T, prob =c(prob.0, prob.1)) |>factor(levels =c("0", "1")) pred <-c(0,1) |>sample(size =1000, replace = T) |>factor(levels =c("0", "1")) mcc <-mcc(pred, obs) mccs <-c(mccs, mcc) prob.1s <-c(prob.1s, prob.1)}# plot mcc as a function of positive probabilityplot(mccs ~ prob.1s, mar =c(5,5,1,1), las =1, xlab ="Positive Probability",ylab ="MCC", pch =16)grid()
Well isn’t that refreshing?!?! A performance metric that considers both the true positives AND true negatives (as well as the errors), while being robust against heavily imbalanced samples. And I love the fact that there is a clear interpretation (0 = random, 1 = perfect agreement), allowing us to evaluate model performance against a random (meaningless) model.
Conclusion
If it were entirely up to me, I would say let’s switch our base of model performance evaluation away from F1 and towards MCC. But, I am open to other ideas.
If you have made it this far, godspeed. If you have not, may your F1 scores continue to give you false confidence (or false doubt – DEPENDS ON YOUR OBSERVATIONS).