The Brier Score

June 8, 2024

Who should you believe?

A weather forecaster tells you the probability of rain tomorrow in your area is 70%.

Another instead tells you the probability is much lower, 20%.

They cannot be both right. How can you tell who is the better forecaster?

Scoring performance: The Brier score

A common measure of a forecaster’s accuracy is called the Brier score.

For a single prediction:

\[B_i=(s_i-p_i)^2\]

Brier score \(B_i\) of a single prediction is the squared difference between

the true state \(s_i\) (either \(s_i=0\) or \(s_i=1\)) and
the forecaster’s probability \(p_i\), between 0 and 1.

For a set of \(n\) predictions:

\[B = \frac{1}{n}\sum_{i=1}^n B_i = \frac{1}{n}\sum_{i=1}^n (s_i-p_i)^2\]

The overall Brier score \(B\) is the average of the square differences.

Example

Brier Scores for Each Prediction
Prediction	Outcome	Brier_Score
0.7	1	0.09
0.3	0	0.09
0.8	1	0.04
0.4	0	0.16
0.6	1	0.16

Who is a better forecaster?

Suppose three different forecasters made probabilistic predictions about 1,000 binary events (e.g. rain/no-rain). We have their past performance.

Those predictions are the result of simulated data via different generating mechanisms, but I am not going to tell you how the predictions were actually generated.

If you knew the generating mechanism, you could immediately identify the best forecaster.

The point is to use the Brier score indirectly to tell who is a better forecaster.

The Simulated Data

This is what our simulated data set looks like. Below are the first 6 rows with a total of 1,000 observations.

##   facts         a          b          c
## 1     1 0.2377333 0.89154682 0.93769020
## 2     1 0.6698901 0.64503935 0.95134379
## 3     0 0.8445457 0.02147546 0.08718045
## 4     0 0.6228831 0.64468335 0.06123307
## 5     0 0.9699362 0.44605831 0.04052389
## 6     1 0.2477629 0.77168606 0.95558712

## [1] 1000

The forecasters are a, b, and c. They each made 1,000 forecasters using probabilities between 0 and 1 for a binary event that can take value 0 or 1.

Plotting Brier scores

Looks like forecaster A is the worst and forecaster C is the best.

Brier score over all observations

This impression is confirmed by computing the overall Brier score, i.e., the average of the Brier scores for individual predictions. Recall:

\[B = 1/n\sum_{i=1}^n (p_i-s_i)^2\]

The computations give us:

##    brier_a    brier_b    brier_c 
## 0.33055615 0.17338180 0.04145532

Clearly, forecaster C has the lowest mean squared error, while forecaster A has the greatest mean squared error. Forecaster B is somewhat in between.

Revealing prediction mechanisms

a = random
b = decent
c = perfect

I will show the details of the generating mechanisms and R code in the next slide, one for each forecaster.

A = random forecaster

100% of the cases: the prediction is random.

# random forecaster
random = runif(1000, min=0, max=1) 
        # always picks values between 0 and 1 at random

B = decent forecaster

60% of the cases: if the event is going to occur (=1), they predict it will occur with a probability 55% to 100%, else with a probability 0% to 45%. 40% of the cases: the prediction is random.

# decent forecaster
decent = facts
  for(i in 1:1000){
    if(runif(1)> .4){       #60% of cases, non-random prediction
      if(facts[i]==1) 
        {decent[i] = (runif(1, min=55, max=100)/100)}
      else if(facts[i]==0) 
        {decent[i] = (runif(1, min=0, max=45)/100)}
    }
    else {decent[i]=runif(1)} #40% of cases, random prediction
  }

C = perfect forecaster

90% of the cases: if the event is going to occur (=1), they predict it will occur with a probability 90% to 100%, else with a probability 0% to 10%. 10% of the cases: the prediction is random.

# perfect forecaster
perfect = facts
  for(i in 1:1000){
    if(runif(1)>.1){          #90% of cases, non-random prediction
      if(facts[i]==1) 
        {perfect[i] = (runif(1, min=90, max=100)/100)}
      else if(facts[i]==0) 
        {perfect[i] = (runif(1, min=0, max=10)/100)}
    }
    else {perfect[i]=runif(1)}    #10% of cases, random prediction
  }

Distributions of Brier scores

Something odd: the random forecaster’s Brier scores are concentrated in lower values, like those of the others. Shouldn’t they be random?

What gets lost in the squared error

Problem 1

Since the Brier score is the the square of the difference between true values (0 or 1) and predicated values (probabilities between 0 and 1), lower values will be over-represented.

This is a mathematical artifact.

Problem 2

The squared error does not distinguish between types of error, such as overestimating or underestimating.

For example, a .7 forecast if the true value is 1 (underestimating) and .3 forecast if the true value is 0 (overestimating) would have the same Brier score:

\((1-.7)^2=(0-.3)^2=.09\).

Plotting errors as (signed) differences

Think of error as the difference between predicted and true value.

Distributions of error differences

The errors of the random forecaster are now uniformly distributed.

We should still prefer the Brier score

The Brier score is a proper score. Informally, this means that anyone who makes a probability forecast has an interest in being honest about it.

Formally, the Brier score is proper whenever (with \(p\neq q\)):

\[p(s-p)^2 + (1-p)(s-p)^2 < q(s-p)^2+(1-q)(s-p)^2\]

The true state could be \(s=1\) or \(s=0\). The Brier score is weighted by the probabilities of these two states happening, namely \(p\) and \((1-p)\) or \(q\) and \((1-q)\). This weighted score is the expected Brier score.

In English, the inequality above states that the expected Brier score is lower from the perspective of \(p\) used in the Brier score itself—\((s-p)^2\)—rather than from the perspective of another probability \(q\).

Proof that the Brier score is proper

We need to prove that (for any \(q \neq p)\): \[p(s-p)^2 + (1-p)(s-p)^2 < q(s-p)^2+(1-q)(s-p)^2\] Replace \(s\) with 1 or 0 when appropriate and simplify: \[p(1-p)^2 + (1-p)(0-p)^2 < q(1-p)^2+(1-q)(0-p)^2\] \[-p^2+p<-2pq+q+p^2\] Set partial derivative of \(f(q, p)=-2pq+q+p^2\) relative to \(p\) to 0: \[\frac{\partial f(q, p)}{\partial p}= -2q+2p=0\] In other words, \(f(q, p)\) is lowest when \(p=q\). This establishes propriety.