Who made the better projections?

Suppose we find ourselves looking at the following ZiPS and Steamer 2019 home run projections for five Mets batters:

Player	ZiPS HR	Steamer HR	Actual HR
Jeff McNeil	14	11	23
Robinson Cano	16	22	13
Brandon Nimmo	13	15	8
Yoenis Cespedes	20	5	0
Todd Frazier	21	14	21

We would like to determine which set of projections was the most accurate. How should we do it?

Perhaps, most simply, we could look at how many “wins” each projection system has. In this case, it turns out that ZiPS was more accurate for four players (McNeil, Cano, Nimmo and Frazier) whereas Steamer only “won” Cespedes. ZiPS goes 4-1!

I might be tempted to dig a little deeper, however, both because ZiPS won (and I’d very much like to conclude something different) and because my evaluation has a shortcoming – we only looked at numbers of wins and losses and not how decisive they were. Maybe ZiPS’ narrow win on Nimmo shouldn’t could as much as Steamers rollicking victory on Cespedes? Let’s explore a couple of oft used metrics for quantifying the size of projection errors, mean absolute error and root mean square error.

Calculating Mean Absolute Error

Our first step is to calculate what are called residuals, these the the differences between the actual values (in this case the numbers of home runs these Mets hit in 2019) and the predictions (from ZiPS and Steamer).

\[residual = actual - predicted\]

Player	ZiPS HR	Steamer HR	Actual HR	ZiPS Residual	Steamer Residual
Jeff McNeil	14	11	23	9	12
Robinson Cano	16	22	13	-3	-9
Brandon Nimmo	13	15	8	-5	-7
Yoenis Cespedes	20	5	0	-20	-5
Todd Frazier	21	14	21	0	7

We could now just add up or average these residuals but that wouldn’t tell us how accurate our forecasters were since positive residuals would cancel out negative residuals. For instance, Steamer projected 12 home runs too few for McNeil and 12 home runs too many for the combination of Nimmo and Cespedes. This isn’t the same as Steamer being right on the money on all three players and we don’t want our evaluation to treat it the same way. Instead, we’ll take the absolute values of these residuals first and then average them to calculate the mean absolute error (MAE).

Player	ZiPS HR	Steamer HR	Actual HR	ZiPS Abs. Residual	Steamer Abs. Residual
Jeff McNeil	14	11	23	9	12
Robinson Cano	16	22	13	3	9
Brandon Nimmo	13	15	8	5	7
Yoenis Cespedes	20	5	0	20	5
Todd Frazier	21	14	21	0	7

\[ ZiPS\ MAE = \frac{9 + 3 + 5 + 20 + 0}{5} = 7.4\] \[ Steamer\ MAE = \frac{12 + 9 + 7 + 5 + 7}{5} = 8\]

The good news is that we now have a metric that judges these projections by the size of their errors. The bad news is that ZiPS came out ahead again since smaller errors are better. Let’s try root mean square error!

Calculating Root Mean Square Error

Root mean square error is just like mean absolute error except that we’re going to square the residuals (or the absolute values of the residuals, it doesn’t matter) before averaging them. Then take the square root of this average. We can remember the steps in this algorithm by reading the name backwards (1. Error, 2. Square, 3. Mean, 4. Root).

Player	ZiPS HR	Steamer HR	Actual HR	ZiPS Sq. Error	Steamer Sq. Error
Jeff McNeil	14	11	23	81	144
Robinson Cano	16	22	13	9	81
Brandon Nimmo	13	15	8	25	49
Yoenis Cespedes	20	5	0	400	25
Todd Frazier	21	14	21	0	49

\[ ZiPS\ RMSE = \sqrt{\frac{9^2 + 3^2 + 5^2 + 20^2 + 0^2}{5}} \approx 10.1\] \[ Steamer\ RMSE = \sqrt{\frac{12^2 + 9^2 + 7^2 + 5^2 + 7^2}{5}} \approx 8.3\]

This time, Steamer comes out of top with a smaller root mean square error. You can see why this happened by looking at the squared residuals in the table above. While in our mean squared error calculation, the large Cespedes error was fairly costly, in the root mean square error calculation it overwhelms the other errors. Root mean square error punishes forecasts for their largest errors.

[To be clear this comparison is for the sake of practice only! I am by no means suggesting that we compare projection systems based on five player seasons chosen not at all randomly.]

Exercise #1

Calculate the mean absolute error and root mean square error for ZiPS’ triples projections for the following set of hitters:

Player	ZiPS 3B	Actual 3B
Jeff McNeil	8	1
Robinson Cano	1	0
Brandon Nimmo	5	1
Yoenis Cespedes	2	0
Todd Frazier	0	2

Which Metric Should We Use, MAE or RMSE?

It depends. To develop some intuition for what these metrics do, let’s imagine standing in a hall with three elevators at locations 0, 2 and 10.

We don’t have any information about which elevator will arrive next but, whichever comes, we’d like to position ourselves to hop on. Where should we stand? Again, it depends.

The place where we choose to stand in this case is essentially our projection for the location of the of the next elevator. The best choice of location depends on how we evaluate our errors. Our choice of metric ends up being quite important! It determines where we should stand.

Exercise #2

Determine the best place to wait for the elevator in order to minimize our mean absolute error. In other words, where should we stand if we want to minimize the average distance we have to walk to the whichever elevator arrives?

(Example calculations: If we choose to stand at location 8, then we’re equally likely to have to walk 8 units to the elevator at location 0, 6 units to the elevator at location 2 and 2 units to the elevator at location 10. So, our average walk is \(\frac{8+6+2}{3} = 5\frac{1}{3}\) units. On the other hand, if we choose to stand at location 7, our average walk is \(\frac{7+5+3}{3} = 5\) units so standing at 7 is better than standing at 8. However, it is not the best location.)

Determine the best place to wait for the elevator in order to minimize our root mean square error. Where should we stand if we want to minimize the squared walking distance?

(Example calculations: If we choose to stand at location 8 our RMSE is: \(\sqrt{\frac{8^2+6^2+2^2}{3}} \approx 5.89\). If we choose to stand at location 7, our RMSE is \(\sqrt{\frac{7^2+5^2+3^2}{3}} \approx 5.26\) so, once again, standing at 7 is better than standing at 8. Also again, it’s not the best location.)

(Hints: This problem can be solved through trial and error but those who know calculus can use it. If you haven’t encountered calculus but do know how to find the vertex of a parabola, that could be useful to!)

Challenge Exercises:

Can you generalize your answers to Exercise 2? In other words, what is special about the locations you found? Can you devise short cut methods of finding the best place to stand in order to minimize the mean absolute error and root mean square error that would work for any number of elevators at any positions?
What do the solutions to these problems tell us about when to use mean absolute error and when to use root mean square error? For instance, if we built a model to project the number of days a player would spend on the IL, which metric might we want to use to evaluate it? If we worked for a player agent and, on behalf of players, created a model to project career earnings, which metric might we want to use? (Note: I am not certain that there are clear right and wrong answers to these questions so consider them fodder for conversation.)

Evaluating Projections and Waiting for the Elevator: How to calculate mean absolute error and root mean square error and when to use them