Our data characteristics and research goals

This is a brief report assessing which prediction accuracy measure(s) may be appropriate for assessing the forecast accuracy of the two health outcome time series using the previously developed latent process time series and spatial-temporal models.

We have several research questions that require an ability to compare forecast accuracy in different scenarios.

  1. How does the spatial-temporal model compare to the temporal model on the same forecast horizon data? (Comparison between models for data with same units and scale).
  2. How does each model perform in more populated vs. sparsely populated regions of the province? (Comparisons of the same model between different datasets with the same units but very different scales)
  3. How does each model perform on each health outcome? (Comparisons of the same models on data with different units and potentially different scales).

Therefore, we need relative measures that are not sensitive to changes in units or scale. However, our data also contain a large number of zeros, making relative measures difficult.

There are two primary types of forecast accuracy metrics:

  1. Comparison between the true observed value \(y_t\) and the predicted value \(\hat{y_t}\), based on the error e.g., \(e_t = y_t - \hat{y_t}\)

  2. Comparison between the accuracy of a proposed method (\(e_t\) from above) and a baseline method, often a naive forecast such as the last available datapoint (random walk), or possibly adjusting for seasonal trends only.

The second option provides more flexibility but also potentially complicates comparisons between multiple proposed models. Given the goals of our analyses, we likely will favour the first approach.

Summary of previously used metrics

The most commonly used metrics that fit into type (1) above are:

Measure Formula (for Mean)
Mean (Median) absolute error, M(d)AE \(\frac{1}{n} \sum | e_t |\)
Mean (Median) squared error, M(d)SE \(\frac{1}{n} \sum e_t^2\)
Root Mean (Median) squared error, M(d)SE \(\sqrt{ \frac{1}{n} \sum e_t^2}\)
Mean (Median) absolute percentage error, M(d)APE \(\frac{1}{n} \sum | \frac{e_t} {y_t}|\)
Symmetric Mean (Median) absolute percentage error, SM(d)APE \(\frac{1}{n} \sum \frac{| e_t|} { |y_t| + |\hat{y_t}|}\)

Excluded are any measures purely based on ranking, in addition to those based on a naive baseline method.

There are benefits and limitations to each measure, and there are no standard best practices. In fact, inappropriate metrics have been frequently used even in large prediction competitions\(^1\).

Measure Robustness to outliers Ability to address zeros Comparisons across different scales
M(d)AE Stable Defined \(y_t \in R\) Scale/unit dependent
M(d)SE Sensitive Defined \(y_t \in R\) Scale/unit dependent
RM(d)SE Sensitive Defined \(y_t \in R\) Scale/unit dependent
M(d)APE Stable Undefined if \(y = 0\) Relative measure (%)
SM(d)APE Stable Undefined if \(y = \hat{y}\) Relative measure (%)

Proposed metric for our data

The mean or median absolute percentage error, M(d)APE is likely the most appropriate metric out of the above options, given that it is scale-independent, and this is why it was used in the first manuscript. However, there are too many zeros within our dataset in the regions with smaller populations. I propose a modified version of the M(d)APE that standardizes the absolute error by dividng by the mean true value across the forecast horizon. Let’s call this the horizon-standardized M(d)APE:

\[ HS MAPE = \frac{1}{n} \sum_{i=1}^n \frac{|y_i - \hat{y_i}| }{ \frac{1}{n} \sum_{i=1}^n y_i } \]

This denominator will not be zero except in degenerate cases, will be consistent across different models (whereas the window may change), and this will standardize each absolute measure of accuracy by the expected value across the horizon. I’m not sure about any theoretical properties of this metric, but most of the others don’t have great properties either.

The MAPE will be more sensitive to ‘accuracy outliers’ in the forecast horizon (instability in the modelling that may lead to very large values of MAPE) than the MdAPE, and so I propose that the mean and median versions are both used together.


\(^1\) Hyndman, Rob J., and Anne B. Koehler. “Another look at measures of forecast accuracy.” International Journal of Forecasting 22.4 (2006): 679-688.