Suppose we find ourselves looking at the following ZiPS and Steamer 2019 home run projections for five Mets batters:
Player | ZiPS HR | Steamer HR | Actual HR |
---|---|---|---|
Jeff McNeil | 14 | 11 | 23 |
Robinson Cano | 16 | 22 | 13 |
Brandon Nimmo | 13 | 15 | 8 |
Yoenis Cespedes | 20 | 5 | 0 |
Todd Frazier | 21 | 14 | 21 |
We would like to determine which set of projections was the most accurate. How should we do it?
Perhaps, most simply, we could look at how many “wins” each projection system has. In this case, it turns out that ZiPS was more accurate for four players (McNeil, Cano, Nimmo and Frazier) whereas Steamer only “won” Cespedes. ZiPS goes 4-1!
I might be tempted to dig a little deeper, however, both because ZiPS won (and I’d very much like to conclude something different) and because my evaluation has a shortcoming – we only looked at numbers of wins and losses and not how decisive they were. Maybe ZiPS’ narrow win on Nimmo shouldn’t could as much as Steamers rollicking victory on Cespedes? Let’s explore a couple of oft used metrics for quantifying the size of projection errors, mean absolute error and root mean square error.
Our first step is to calculate what are called residuals, these the the differences between the actual values (in this case the numbers of home runs these Mets hit in 2019) and the predictions (from ZiPS and Steamer).
\[residual = actual - predicted\]
Player | ZiPS HR | Steamer HR | Actual HR | ZiPS Residual | Steamer Residual |
---|---|---|---|---|---|
Jeff McNeil | 14 | 11 | 23 | 9 | 12 |
Robinson Cano | 16 | 22 | 13 | -3 | -9 |
Brandon Nimmo | 13 | 15 | 8 | -5 | -7 |
Yoenis Cespedes | 20 | 5 | 0 | -20 | -5 |
Todd Frazier | 21 | 14 | 21 | 0 | 7 |
We could now just add up or average these residuals but that wouldn’t tell us how accurate our forecasters were since positive residuals would cancel out negative residuals. For instance, Steamer projected 12 home runs too few for McNeil and 12 home runs too many for the combination of Nimmo and Cespedes. This isn’t the same as Steamer being right on the money on all three players and we don’t want our evaluation to treat it the same way. Instead, we’ll take the absolute values of these residuals first and then average them to calculate the mean absolute error (MAE).
Player | ZiPS HR | Steamer HR | Actual HR | ZiPS Abs. Residual | Steamer Abs. Residual |
---|---|---|---|---|---|
Jeff McNeil | 14 | 11 | 23 | 9 | 12 |
Robinson Cano | 16 | 22 | 13 | 3 | 9 |
Brandon Nimmo | 13 | 15 | 8 | 5 | 7 |
Yoenis Cespedes | 20 | 5 | 0 | 20 | 5 |
Todd Frazier | 21 | 14 | 21 | 0 | 7 |
\[ ZiPS\ MAE = \frac{9 + 3 + 5 + 20 + 0}{5} = 7.4\] \[ Steamer\ MAE = \frac{12 + 9 + 7 + 5 + 7}{5} = 8\]
The good news is that we now have a metric that judges these projections by the size of their errors. The bad news is that ZiPS came out ahead again since smaller errors are better. Let’s try root mean square error!
Root mean square error is just like mean absolute error except that we’re going to square the residuals (or the absolute values of the residuals, it doesn’t matter) before averaging them. Then take the square root of this average. We can remember the steps in this algorithm by reading the name backwards (1. Error, 2. Square, 3. Mean, 4. Root).
Player | ZiPS HR | Steamer HR | Actual HR | ZiPS Sq. Error | Steamer Sq. Error |
---|---|---|---|---|---|
Jeff McNeil | 14 | 11 | 23 | 81 | 144 |
Robinson Cano | 16 | 22 | 13 | 9 | 81 |
Brandon Nimmo | 13 | 15 | 8 | 25 | 49 |
Yoenis Cespedes | 20 | 5 | 0 | 400 | 25 |
Todd Frazier | 21 | 14 | 21 | 0 | 49 |
\[ ZiPS\ RMSE = \sqrt{\frac{9^2 + 3^2 + 5^2 + 20^2 + 0^2}{5}} \approx 10.1\] \[ Steamer\ RMSE = \sqrt{\frac{12^2 + 9^2 + 7^2 + 5^2 + 7^2}{5}} \approx 8.3\]
This time, Steamer comes out of top with a smaller root mean square error. You can see why this happened by looking at the squared residuals in the table above. While in our mean squared error calculation, the large Cespedes error was fairly costly, in the root mean square error calculation it overwhelms the other errors. Root mean square error punishes forecasts for their largest errors.
[To be clear this comparison is for the sake of practice only! I am by no means suggesting that we compare projection systems based on five player seasons chosen not at all randomly.]
Calculate the mean absolute error and root mean square error for ZiPS’ triples projections for the following set of hitters:
Player | ZiPS 3B | Actual 3B |
---|---|---|
Jeff McNeil | 8 | 1 |
Robinson Cano | 1 | 0 |
Brandon Nimmo | 5 | 1 |
Yoenis Cespedes | 2 | 0 |
Todd Frazier | 0 | 2 |
It depends. To develop some intuition for what these metrics do, let’s imagine standing in a hall with three elevators at locations 0, 2 and 10.
We don’t have any information about which elevator will arrive next but, whichever comes, we’d like to position ourselves to hop on. Where should we stand? Again, it depends.
The place where we choose to stand in this case is essentially our projection for the location of the of the next elevator. The best choice of location depends on how we evaluate our errors. Our choice of metric ends up being quite important! It determines where we should stand.
(Example calculations: If we choose to stand at location 8, then we’re equally likely to have to walk 8 units to the elevator at location 0, 6 units to the elevator at location 2 and 2 units to the elevator at location 10. So, our average walk is \(\frac{8+6+2}{3} = 5\frac{1}{3}\) units. On the other hand, if we choose to stand at location 7, our average walk is \(\frac{7+5+3}{3} = 5\) units so standing at 7 is better than standing at 8. However, it is not the best location.)
(Example calculations: If we choose to stand at location 8 our RMSE is: \(\sqrt{\frac{8^2+6^2+2^2}{3}} \approx 5.89\). If we choose to stand at location 7, our RMSE is \(\sqrt{\frac{7^2+5^2+3^2}{3}} \approx 5.26\) so, once again, standing at 7 is better than standing at 8. Also again, it’s not the best location.)
(Hints: This problem can be solved through trial and error but those who know calculus can use it. If you haven’t encountered calculus but do know how to find the vertex of a parabola, that could be useful to!)
Can you generalize your answers to Exercise 2? In other words, what is special about the locations you found? Can you devise short cut methods of finding the best place to stand in order to minimize the mean absolute error and root mean square error that would work for any number of elevators at any positions?
What do the solutions to these problems tell us about when to use mean absolute error and when to use root mean square error? For instance, if we built a model to project the number of days a player would spend on the IL, which metric might we want to use to evaluate it? If we worked for a player agent and, on behalf of players, created a model to project career earnings, which metric might we want to use? (Note: I am not certain that there are clear right and wrong answers to these questions so consider them fodder for conversation.)