This is a brief report assessing which prediction accuracy measure(s) may be appropriate for assessing the forecast accuracy of the two health outcome time series using the previously developed latent process time series and spatial-temporal models.
We have several research questions that require an ability to compare forecast accuracy in different scenarios.
Therefore, we need relative measures that are not sensitive to changes in units or scale. However, our data also contain a large number of zeros, making relative measures difficult.
There are two primary types of forecast accuracy metrics:
Comparison between the true observed value \(y_t\) and the predicted value \(\hat{y_t}\), based on the error e.g., \(e_t = y_t - \hat{y_t}\)
Comparison between the accuracy of a proposed method (\(e_t\) from above) and a baseline method, often a naive forecast such as the last available datapoint (random walk), or possibly adjusting for seasonal trends only.
The second option provides more flexibility but also potentially complicates comparisons between multiple proposed models. Given the goals of our analyses, we likely will favour the first approach.
The most commonly used metrics that fit into type (1) above are:
| Measure | Formula (for Mean) |
|---|---|
| Mean (Median) absolute error, M(d)AE | \(\frac{1}{n} \sum | e_t |\) |
| Mean (Median) squared error, M(d)SE | \(\frac{1}{n} \sum e_t^2\) |
| Root Mean (Median) squared error, M(d)SE | \(\sqrt{ \frac{1}{n} \sum e_t^2}\) |
| Mean (Median) absolute percentage error, M(d)APE | \(\frac{1}{n} \sum | \frac{e_t} {y_t}|\) |
| Symmetric Mean (Median) absolute percentage error, SM(d)APE | \(\frac{1}{n} \sum \frac{| e_t|} { |y_t| + |\hat{y_t}|}\) |
Excluded are any measures purely based on ranking, in addition to those based on a naive baseline method.
There are benefits and limitations to each measure, and there are no standard best practices. In fact, inappropriate metrics have been frequently used even in large prediction competitions\(^1\).
| Measure | Robustness to outliers | Ability to address zeros | Comparisons across different scales |
|---|---|---|---|
| M(d)AE | Stable | Defined \(y_t \in R\) | Scale/unit dependent |
| M(d)SE | Sensitive | Defined \(y_t \in R\) | Scale/unit dependent |
| RM(d)SE | Sensitive | Defined \(y_t \in R\) | Scale/unit dependent |
| M(d)APE | Stable | Undefined if \(y = 0\) | Relative measure (%) |
| SM(d)APE | Stable | Undefined if \(y = \hat{y}\) | Relative measure (%) |
The mean or median absolute percentage error, M(d)APE is likely the most appropriate metric out of the above options, given that it is scale-independent, and this is why it was used in the first manuscript. However, there are too many zeros within our dataset in the regions with smaller populations. I propose a modified version of the M(d)APE that standardizes the absolute error by dividng by the mean true value across the forecast horizon. Let’s call this the horizon-standardized M(d)APE:
\[ HS MAPE = \frac{1}{n} \sum_{i=1}^n \frac{|y_i - \hat{y_i}| }{ \frac{1}{n} \sum_{i=1}^n y_i } \]
This denominator will not be zero except in degenerate cases, will be consistent across different models (whereas the window may change), and this will standardize each absolute measure of accuracy by the expected value across the horizon. I’m not sure about any theoretical properties of this metric, but most of the others don’t have great properties either.
The MAPE will be more sensitive to ‘accuracy outliers’ in the forecast horizon (instability in the modelling that may lead to very large values of MAPE) than the MdAPE, and so I propose that the mean and median versions are both used together.
\(^1\) Hyndman, Rob J., and Anne B. Koehler. “Another look at measures of forecast accuracy.” International Journal of Forecasting 22.4 (2006): 679-688.