A weather forecaster tells you the probability of rain tomorrow in your area is 70%.
Another instead tells you the probability is much lower, 20%.
They cannot be both right. How can you tell who is the better forecaster?
June 8, 2024
1
A weather forecaster tells you the probability of rain tomorrow in your area is 70%.
Another instead tells you the probability is much lower, 20%.
They cannot be both right. How can you tell who is the better forecaster?
A common measure of a forecaster’s accuracy is called the Brier score.
For a single prediction:
\[B_i=(s_i-p_i)^2\]
Brier score \(B_i\) of a single prediction is the squared difference between
For a set of \(n\) predictions:
\[B = \frac{1}{n}\sum_{i=1}^n B_i = \frac{1}{n}\sum_{i=1}^n (s_i-p_i)^2\]
The overall Brier score \(B\) is the average of the square differences.
| Prediction | Outcome | Brier_Score |
|---|---|---|
| 0.7 | 1 | 0.09 |
| 0.3 | 0 | 0.09 |
| 0.8 | 1 | 0.04 |
| 0.4 | 0 | 0.16 |
| 0.6 | 1 | 0.16 |
Suppose three different forecasters made probabilistic predictions about 1,000 binary events (e.g. rain/no-rain). We have their past performance.
Those predictions are the result of simulated data via different generating mechanisms, but I am not going to tell you how the predictions were actually generated.
If you knew the generating mechanism, you could immediately identify the best forecaster.
The point is to use the Brier score indirectly to tell who is a better forecaster.
This is what our simulated data set looks like. Below are the first 6 rows with a total of 1,000 observations.
## facts a b c ## 1 1 0.2377333 0.89154682 0.93769020 ## 2 1 0.6698901 0.64503935 0.95134379 ## 3 0 0.8445457 0.02147546 0.08718045 ## 4 0 0.6228831 0.64468335 0.06123307 ## 5 0 0.9699362 0.44605831 0.04052389 ## 6 1 0.2477629 0.77168606 0.95558712
## [1] 1000
The forecasters are a, b, and c. They each made 1,000 forecasters using probabilities between 0 and 1 for a binary event that can take value 0 or 1.
Looks like forecaster A is the worst and forecaster C is the best.
This impression is confirmed by computing the overall Brier score, i.e., the average of the Brier scores for individual predictions. Recall:
\[B = 1/n\sum_{i=1}^n (p_i-s_i)^2\]
The computations give us:
## brier_a brier_b brier_c ## 0.33055615 0.17338180 0.04145532
Clearly, forecaster C has the lowest mean squared error, while forecaster A has the greatest mean squared error. Forecaster B is somewhat in between.
a = random b = decent c = perfect
I will show the details of the generating mechanisms and R code in the next slide, one for each forecaster.
100% of the cases: the prediction is random.
# random forecaster
random = runif(1000, min=0, max=1)
# always picks values between 0 and 1 at random
60% of the cases: if the event is going to occur (=1), they predict it will occur with a probability 55% to 100%, else with a probability 0% to 45%. 40% of the cases: the prediction is random.
# decent forecaster
decent = facts
for(i in 1:1000){
if(runif(1)> .4){ #60% of cases, non-random prediction
if(facts[i]==1)
{decent[i] = (runif(1, min=55, max=100)/100)}
else if(facts[i]==0)
{decent[i] = (runif(1, min=0, max=45)/100)}
}
else {decent[i]=runif(1)} #40% of cases, random prediction
}
90% of the cases: if the event is going to occur (=1), they predict it will occur with a probability 90% to 100%, else with a probability 0% to 10%. 10% of the cases: the prediction is random.
# perfect forecaster
perfect = facts
for(i in 1:1000){
if(runif(1)>.1){ #90% of cases, non-random prediction
if(facts[i]==1)
{perfect[i] = (runif(1, min=90, max=100)/100)}
else if(facts[i]==0)
{perfect[i] = (runif(1, min=0, max=10)/100)}
}
else {perfect[i]=runif(1)} #10% of cases, random prediction
}
Something odd: the random forecaster’s Brier scores are concentrated in lower values, like those of the others. Shouldn’t they be random?
Problem 1
Since the Brier score is the the square of the difference between true values (0 or 1) and predicated values (probabilities between 0 and 1), lower values will be over-represented.
This is a mathematical artifact.
Problem 2
The squared error does not distinguish between types of error, such as overestimating or underestimating.
For example, a .7 forecast if the true value is 1 (underestimating) and .3 forecast if the true value is 0 (overestimating) would have the same Brier score:
\((1-.7)^2=(0-.3)^2=.09\).
Think of error as the difference between predicted and true value.
The errors of the random forecaster are now uniformly distributed.
The Brier score is a proper score. Informally, this means that anyone who makes a probability forecast has an interest in being honest about it.
Formally, the Brier score is proper whenever (with \(p\neq q\)):
\[p(s-p)^2 + (1-p)(s-p)^2 < q(s-p)^2+(1-q)(s-p)^2\]
The true state could be \(s=1\) or \(s=0\). The Brier score is weighted by the probabilities of these two states happening, namely \(p\) and \((1-p)\) or \(q\) and \((1-q)\). This weighted score is the expected Brier score.
In English, the inequality above states that the expected Brier score is lower from the perspective of \(p\) used in the Brier score itself—\((s-p)^2\)—rather than from the perspective of another probability \(q\).
We need to prove that (for any \(q \neq p)\): \[p(s-p)^2 + (1-p)(s-p)^2 < q(s-p)^2+(1-q)(s-p)^2\] Replace \(s\) with 1 or 0 when appropriate and simplify: \[p(1-p)^2 + (1-p)(0-p)^2 < q(1-p)^2+(1-q)(0-p)^2\] \[-p^2+p<-2pq+q+p^2\] Set partial derivative of \(f(q, p)=-2pq+q+p^2\) relative to \(p\) to 0: \[\frac{\partial f(q, p)}{\partial p}= -2q+2p=0\] In other words, \(f(q, p)\) is lowest when \(p=q\). This establishes propriety.