1 Introduction

The Labour Force Survey (LFS) provides us a noisy signal regarding the level of employment. An important bit of information that we utilize in assessing the Stokes employment forecast is the 10 year compound annual growth rate (CAGR). Obviously if the underlying data is noisy the CAGR will also be noisy. If sampling noise is the problem, maybe smoothing is the answer: but how to smooth the data? How can we assess accuracy of various smoothing methods when we never observe the truth?

The answer to these questions is to run a simulation where we, rather than mother nature, control the data generating process. In this simulation we will draw 1000 random samples of 25 growth factors from the known distribution of historic growth factors. Each artificial time series is the cumulative product of the 25 random growth factors multiplied by 100 (an arbitrary base level).

We then apply various smoothing methods to the artificial series and compare the accuracy of the smoothed series to the known (true) CAGR: the geometric mean of all the known historic growth factors, minus one. Using historic growth factors is not necessary but it ensures that the simulated series share similar characteristics with the observed historic data. Alternatively we could draw growth factors from some other distribution (e.g. Normal): the specific distribution that we draw from is not important, what is important is that the distribution is known to us.

We compare 3 smoothing methods:

  1. Kalman Filter
  2. STL Decomposition
  3. LOESS

Finally we calculate the 10 year CAGR, and then assess the accuracy of the growth rates with RMSE and SMAPE. Again, the “true” growth rate is the geometric mean of all the known historic growth factors, minus one. It is the random sampling of growth factors which obfuscates the true (known) growth rate.

2 Load Libraries

library(tidyverse)
library(KFAS)
library(fpp3)
library(readxl)
library(here)
library(conflicted)
conflicts_prefer(dplyr::lag)
conflicts_prefer(dplyr::filter)
library(tictoc)
tic()

3 Constants

sims <- 1000
years <- 25

4 Define Functions

stl_decomp <- function(tbbl){
  tbbl |>
    model(STL(value))|>
    components()|>
    tibble()|>
    select(trend)
}

build_model <- function(y) {
  SSModel(y ~ SSMtrend(degree = 2, Q = list(0.01, 0.001)), H = 0.05)
}

compute_cagr <- function(val) {
  start <- val[length(val)-10]
  end <- val[length(val)]
  (end / start)^(.1) - 1
}

do_loess <- function(tbbl) {
  loess_fit <- loess(value ~ year, data = tbbl)
  fitted(loess_fit)
}

5 Use historic growth factors for the simulation

historic <- read_excel(here("data",
                            "Employment for 64 LMO Industries 2000-2024.xlsx"),
                       sheet = "British Columbia",
                       skip = 3) |>
  filter(str_detect(`Lmo Ind Code`, "ind")) |>
  pivot_longer(starts_with("2"), names_to = "year", values_to = "value") |>
  group_by(`Lmo Ind Code`) |>
  mutate(growth_factor = value / lag(value)) |>
  na.omit()

growth_rates <- sample(historic$growth_factor,  size = sims * years, replace = TRUE)-1 # a random sample from historic growth factors
growth_rate <- prod(historic$growth_factor)^(1/nrow(historic))-1 #the true growth rate we are trying to find.

6 Simulate Series

sim <- rep(1:sims, each = years)

nested <- tibble(sim = sim,
                 year = rep(2000:2024, sims),
                 growth_rates = growth_rates) %>%
  group_by(sim) %>%
  mutate(value = cumprod(1 + growth_rates) * 100) %>%
  select(-growth_rates) |>
  nest() |>
  mutate(ts_data = map(data, ~ ts(.x$value, start = 2000)),
         tsibble_data = map(data, ~ tsibble::as_tsibble(.x, index = year)))

7 Apply Smoothing Methods and calculate CAGRs

nested <- nested %>%
  mutate(
    model = map(ts_data, build_model),
    kfs = map(model, ~ KFS(.x, smoothing = "state")),
    trend = map(kfs, ~ .x$alphahat[, 1]),
    stl = map(tsibble_data, stl_decomp),
    stl = map(stl, ~ ts(.x$trend, start = 2000)),
    loess = map(data, do_loess),
    kalman_cagr = map_dbl(trend, compute_cagr),
    raw_cagr = map_dbl(ts_data, compute_cagr),
    stl_cagr = map_dbl(stl, compute_cagr),
    loess_cagr = map_dbl(loess, compute_cagr)
  )

8 Combine and Evaluate Results

long <- nested |>
  select(sim, kalman_cagr, raw_cagr, stl_cagr, loess_cagr) |>
  pivot_longer(-sim, names_to = "type", values_to = "cagr") |>
  mutate(
    squared_deviation = (cagr - growth_rate)^2,
    relative_deviation = abs(cagr - growth_rate) / (abs(cagr) + abs(growth_rate))
  ) |>
  group_by(type)

9 RMSE and SMAPE Comparison

long |>
  summarise(
    rmse = sqrt(mean(squared_deviation)),
    smape = mean(relative_deviation)
  ) |>
  arrange(rmse)|>
  DT::datatable(
    options = list(pageLength = 5),
    rownames = FALSE,
    caption = "RMSE and SMAPE of CAGR Estimates by Smoothing Method")|>
  DT::formatRound(columns = c("rmse", "smape"), digits = 3)

10 Density Plot of CAGR Distributions

long %>%
  ggplot(aes(x = cagr)) +
  geom_vline(xintercept = growth_rate, color = "white", lwd = 2) +
  geom_density(alpha = 0.25, fill = "grey20") +
  labs(title = "CAGR Distributions by Smoothing Method",
       x = "CAGR", y = "Density") +
  facet_wrap(~fct_reorder(type, cagr, sd), nrow = 1)

11 Why does LOESS underperform?

LOESS is a local polynomial smoother, treating time as just another variable.

  1. Assumes independence of points—ignores time order and autocorrelation.

  2. Fixed window size can cause over- or under-smoothing.

  3. Doesn’t model, only smooths.

  4. Sensitive to outliers and boundary effects.

toc()
## 38.687 sec elapsed