WTF1

Click here to uncover the dirty truth about the F1 score

Author

Mickey Campbell

Published

July 2, 2025

Introduction

For Sierra’s research on modeling fine-scale travel route selection in a wildland setting, we have been heavily focused on F1 score as the primary measure of model performance. F1 is widely regarded for its robustness as a measure of binary classification models, even in (especially in?) scenarios where there is a heavy imbalance between the two possible predicted outcomes in the data. Sierra’s data are, indeed, very imbalanced, with every decision point along one of the experimental trajectories featuring 15 directions that were not taken and 1 direction that was taken. So, right off the bat, the “positive” case only represents 6.25% of the data. No problem, though, right? F1 to the rescue!!

F1 scores of random models with 15:1 negative:positive ratios

Sierra’s models have been getting F1 scores that we’ve been interpreting as being pretty low (0.2-0.4). But, considered in isolation, it is a little difficult to understand what those numbers really mean. I thought one way to improve this understanding would be to compare it to a random model. That is, given the same distribution of positive versus negative cases, what would the F1 score be of a model that just picks positives and negatives out of a hat with no a priori knowledge of the distribution? Fortunately, R makes this easy to simulate!

Show code

# simulate observed data with ~15:1 ratio
obs <- c(0,1) |>
  sample(size = 1000, replace = T, prob = c(15/16, 1/16)) |>
  factor(levels = c("0", "1"))

# generate random predictions with no a priori probability distributions
pred <- c(0,1) |>
  sample(size = 1000, replace = T) |>
  factor(levels = c("0", "1"))

# get counts
table(obs)

obs
  0   1 
943  57

Show code

table(pred)

pred
  0   1 
477 523

Let’s calculate the F1 score of this “model”:

Show code

library(caret)

Warning: package 'caret' was built under R version 4.4.2

Loading required package: ggplot2

Loading required package: lattice

Show code

# create confusion matrix
cm <- confusionMatrix(
  data = pred,
  reference = obs,
  positive = "1",
  mode = "prec_recall"
)
cm

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 449  28
         1 494  29
                                          
               Accuracy : 0.478           
                 95% CI : (0.4466, 0.5095)
    No Information Rate : 0.943           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : -0.0031         
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
              Precision : 0.05545         
                 Recall : 0.50877         
                     F1 : 0.10000         
             Prevalence : 0.05700         
         Detection Rate : 0.02900         
   Detection Prevalence : 0.52300         
      Balanced Accuracy : 0.49246         
                                          
       'Positive' Class : 1

As we can see, the F1 score is 0.1. But, that’s just one random dataset. Let’s run this 1000 times to get a distribution of F1 scores:

Show code

# create empty vector to store f1 values
f1s <- c()

# begin loop
for (i in 1:1000){
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(15/16, 1/16)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  cm <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "1",
    mode = "prec_recall"
  )
  f1 <- cm$byClass[[7]]
  f1s <- c(f1s, f1)
}

# plot histogram of f1s
hist(f1s, las = 1, xlab = "F1 Score", main = NA, mar = c(5,5,1,1))
abline(v = mean(f1s), lwd = 3, col = 2)
legend("topright", legend = paste0("mean = ", round(mean(f1s),2)), text.col = 2,
       bty = "n", x.intersp = 0)

Fairly consistent numbers with a mean of 0.11. OK, so that’s somewhat reassuring… Sierra’s models yielding F1s in the range of 0.2 - 0.4 are at least clearly better than random. A low bar for success, admittedly, but that was exactly what I wanted to know.

The effect of imbalance ratio on F1 scores

However, this got me thinking… The consistency of these numbers, hovering right around 0.11 felt a little weird, and likely related to the fact that the observed data had a 15:1 negative:positive ratio. So, I thought I’d play with that ratio a bit, and see how it affected F1 scores of random models. In the code below, instead of fixing the 15:1 ratio, I varied the distributions of positive and negative cases, each time generating a random model, and extracting the F1 score:

Show code

# create empty vector to store f1 values
f1s <- c()

# create another empty vector to store the positive probabilities (i.e., the 
# proportion of simulated positive observations)
prob.1s <- c()

# begin loop
for (i in 1:1000){
  prob.0 <- runif(1)
  prob.1 <- 1 - prob.0
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(prob.0, prob.1)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  cm <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "1",
    mode = "prec_recall"
  )
  f1 <- cm$byClass[[7]]
  f1s <- c(f1s, f1)
  prob.1s <- c(prob.1s, prob.1)
}

# plot f1 as a function of positive probabilities
plot(f1s ~ prob.1s, mar = c(5,5,1,1), las = 1, xlab = "Positive Probability",
     ylab = "F1 Score", pch = 16)
grid()

Uhhh… That’s interesting!! There is an exceedingly clear relationship (effect, one might say) between the negative:positive ratio and the F1 score. I’m no statistical theorist, but at the very least, this tells me that F1 should be evaluated somewhat cautiously. If we consider a random model to be the baseline of a functionally useless model (i.e., that which we should strive to far exceed), then it seems really important to know that a random model’s F1 score of 0.1 with a 15:1 negative:positive ratio observation dataset could be functionally equivalent to 0.65 with a 1:15 ratio dataset.

I don’t know… This doesn’t sit well with me. But maybe I’m missing something?

What F1 misses: true negatives

To understand what I am, in fact, missing, I think it’s worth taking a look at the formulation of F1. Here’s a reference confusion matrix:

Show code

# plot reference cm
par(mar = c(1,5,5,1), las = 1)
plot(
  x = c(0,1),
  y = c(0,1),
  type = "n",
  xlim = c(0,1),
  ylim = c(0,1),
  xlab = NA,
  ylab = NA,
  xaxt = "n",
  yaxt = "n",
  xaxs = "i",
  yaxs = "i"
)
rect(0, 0.5, 0.5, 1, col = "lightgray", border = NA)
rect(0.5, 0, 1, 0.5, col = "lightgray", border = NA)
abline(h = 0.5)
abline(v = 0.5)
text(0.25, 0.75, "True Negatives\n(TN)")
text(0.25, 0.25, "False Positives\n(FP)")
text(0.75, 0.75, "False Negatives\n(FN)")
text(0.75, 0.25, "True Positives\n(TP)")
axis(3, c(0.25, 0.75), labels = c("Negative", "Positive"))
axis(2, c(0.25, 0.75), labels = c("Positive", "Negative"))
box(lwd = 2)
mtext("Observations", line = 2.5, font = 2)
mtext("Predictions", side = 2, line = 2.5, font = 2, las = 0)

F1 score the harmonic mean of precision and recall:

\[ F1=\frac{2\times{Precision}\times{Recall}}{Precision+Recall} \]

\[ Precision=\frac{TP}{TP+FP} \]

\[ Recall=\frac{TP}{TP+FN} \]

What immediately becomes clear when looking at these formulas is the fact that there is no consideration for true negatives. If we ignore true negatives, and the value they play as a measure of model performance, then, assuming we’re dealing with a random classifier, as the proportion of positive observations increases in a dataset:

TP should increase
FN should also increase
FP should decrease

As a result:

Recall should stay fairly constant
Precision should increase

Let’s see if this is true:

Show code

# create empty vectors to store values
ps <- c()
rs <- c()
prob.1s <- c()

# begin loop
for (i in 1:1000){
  prob.0 <- runif(1)
  prob.1 <- 1 - prob.0
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(prob.0, prob.1)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  cm <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "1",
    mode = "prec_recall"
  )
  p <- cm$byClass[[5]]
  r <- cm$byClass[[6]]
  ps <- c(ps, p)
  rs <- c(rs, r)
  prob.1s <- c(prob.1s, prob.1)
}

# plot precision, recall, and f1 as functions of positive probabilities
par(mfrow = c(1,2), mar = c(5,5,1,1), las = 1)
plot(ps ~ prob.1s, xlab = "Positive Probability", ylab = "Precision", pch = 16)
grid()
plot(rs ~ prob.1s, xlab = "Positive Probability", ylab = "Recall", pch = 16)
grid()

Given that F1 is basically an average of precision and recall, then, F1 should increase. Thus, by ignoring true negatives, F1 values highly sensitive to the negative:positive ratio of observations.

One potential solution: macro F1

By nature, F1 is a class-specific measure. That is, you have to define what your positive class is – presumably this is the thing you’re really trying best to capture. There is an argument to be made that, in the context of Sierra’s research, we do, in fact, care most about the taken directions. But, you could just as easily argue it is of equal value to know what directions people avoided (and why!). It’s kind of a glass half full/glass half empty question, really. Or, 1/16 full and 15/16 empty, as it were.

All of this is to say that all of the F1 scores we have discussed so far are for the positive/taken class. One could just as easily compute the F1 score for the not taken class. Let’s see how that pans out in our random models with varying positive probabilities (you can probably already guess how it will look…):

Show code

# create empty vector to store f1 values
f1s <- c()

# create another empty vector to store the positive probabilities (i.e., the 
# proportion of simulated positive observations)
prob.1s <- c()

# begin loop
for (i in 1:1000){
  prob.0 <- runif(1)
  prob.1 <- 1 - prob.0
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(prob.0, prob.1)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  cm <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "0",
    mode = "prec_recall"
  )
  f1 <- cm$byClass[[7]]
  f1s <- c(f1s, f1)
  prob.1s <- c(prob.1s, prob.1)
}

# plot f1 as a function of positive probabilities
plot(f1s ~ prob.1s, mar = c(5,5,1,1), las = 1, xlab = "Positive Probability",
     ylab = "F1 Score", pch = 16)
grid()

As one increases, the other decreases:

Show code

# create empty vector to store f1 values
f1s.0 <- c()
f1s.1 <- c()

# begin loop
for (i in 1:1000){
  prob.0 <- runif(1)
  prob.1 <- 1 - prob.0
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(prob.0, prob.1)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  cm.0 <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "0",
    mode = "prec_recall"
  )
  cm.1 <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "1",
    mode = "prec_recall"
  )
  f1.0 <- cm.0$byClass[[7]]
  f1.1 <- cm.1$byClass[[7]]
  f1s.0 <- c(f1s.0, f1.0)
  f1s.1 <- c(f1s.1, f1.1)
}

# plot f1 for positive outcomes versus f1 for negative outcomes
plot(f1s.1 ~ f1s.0, mar = c(5,5,1,1), las = 1, xlab = "F1 for Negatives",
     ylab = "F1 for Positives", pch = 16)
grid()

If we were to focus solely on the negative outcomes as the basis of model performance assessment, then we would be in the same boat as before – not considering the holistic model performance. That said, we would certainly yield better-looking F1 numbers! :) But, as we’ve clearly seen, the absolute numbers of F1 are… Complicated.

One way that people have proposed the single-class focus of F1 is through the calculation of macro F1. It’s a fancy way of saying average F1 across your classes:

\[ F1_{macro}=\frac{F1_{positive}+F1_{negative}}{2} \]

Let’s see how macro F1 holds up in our random modeling scenario where the distribution of positive and negative observations varies:

Show code

# create empty vector to store f1 values
macro.f1s <- c()

# create another empty vector to store the positive probabilities (i.e., the 
# proportion of simulated positive observations)
prob.1s <- c()

# begin loop
for (i in 1:1000){
  prob.0 <- runif(1)
  prob.1 <- 1 - prob.0
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(prob.0, prob.1)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  cm.0 <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "0",
    mode = "prec_recall"
  )
  cm.1 <- confusionMatrix(
    data = pred,
    reference = obs,
    positive = "1",
    mode = "prec_recall"
  )
  f1.0 <- cm.0$byClass[[7]]
  f1.1 <- cm.1$byClass[[7]]
  macro.f1 <- (f1.0 + f1.1) / 2
  macro.f1s <- c(macro.f1s, macro.f1)
  prob.1s <- c(prob.1s, prob.1)
}

# plot macro f1 as a function of positive probaility
plot(macro.f1s ~ prob.1s, mar = c(5,5,1,1), las = 1, xlab = "Positive Probability",
     ylab = "Macro F1", pch = 16)
grid()

Very, very interesting. So, it does seem to level off at what would seem to be a logical midpoint of 0.5. But, it still suffers when the positive probabilities are very low (i.e., in the case of Sierra’s data!) or very high. So, maybe macro F1 isn’t an ideal solution here either.

Another (better?) potential solution: Matthews Correlation Coefficient (MCC)

In scouring the deep, dark corners of the web, it seems that there is a contingent of statisticians who have suggested an alternative, and potentially better, metric to stand in place of F1: Matthews Correlation Coefficient (MCC). In fact, there is a recent, and highly cited paper by Chicco and Jurman (2020) that specifically suggests MCC as a more suitable metric, especially in cases of heavy imbalance, pointing out some of the same patterns I have pointed out here (albeit with real data).

Unlike most accuracy measures that range from 0 to 1, MCC is not unlike Pearson’s correlation for continuous data in that it ranges from -1 to 1:

-1 represents perfect disagreement between predictions and observations
0 represents a functionally random agreement between predictions and observations
+1 represents a perfect agreement between predictions and observations

And, most importantly, it takes true negatives into account. Here is the almost comical equation:

\[ MCC=\frac{TP\times{TN}-FP\times{FN}}{\sqrt{(TP+FP)\times(TP+FN)\times(TN+FP)\times(TN+FN)}} \]

Well, nothing left to do now but test it out! Let’s see how MCC performs on random models applied to datasets with highly variable distributions of positive and negative cases:

Show code

library(mltools)

Warning: package 'mltools' was built under R version 4.4.2

Show code

# create empty vector to store mcc values
mccs <- c()

# create another empty vector to store the positive probabilities (i.e., the 
# proportion of simulated positive observations)
prob.1s <- c()

# begin loop
for (i in 1:1000){
  prob.0 <- runif(1)
  prob.1 <- 1 - prob.0
  obs <- c(0,1) |>
    sample(size = 1000, replace = T, prob = c(prob.0, prob.1)) |>
    factor(levels = c("0", "1"))
  pred <- c(0,1) |>
    sample(size = 1000, replace = T) |>
    factor(levels = c("0", "1"))
  mcc <- mcc(pred, obs)
  mccs <- c(mccs, mcc)
  prob.1s <- c(prob.1s, prob.1)
}

# plot mcc as a function of positive probability
plot(mccs ~ prob.1s, mar = c(5,5,1,1), las = 1, xlab = "Positive Probability",
     ylab = "MCC", pch = 16)
grid()

Well isn’t that refreshing?!?! A performance metric that considers both the true positives AND true negatives (as well as the errors), while being robust against heavily imbalanced samples. And I love the fact that there is a clear interpretation (0 = random, 1 = perfect agreement), allowing us to evaluate model performance against a random (meaningless) model.

Conclusion

If it were entirely up to me, I would say let’s switch our base of model performance evaluation away from F1 and towards MCC. But, I am open to other ideas.

If you have made it this far, godspeed. If you have not, may your F1 scores continue to give you false confidence (or false doubt – DEPENDS ON YOUR OBSERVATIONS).