The Labour Force Survey (LFS) provides us a noisy signal regarding the level of employment. An important bit of information that we utilize in assessing the Stokes employment forecast is the 10 year compound annual growth rate (CAGR). Obviously if the underlying data is noisy the CAGR will also be noisy. If sampling noise is the problem, maybe smoothing is the answer: but how to smooth the data? How can we assess accuracy of various smoothing methods when we never observe the truth?
The answer to these questions is to run a simulation where we, rather than mother nature, control the data generating process. In this simulation we will draw 1000 random samples of 25 growth factors from the known distribution of historic growth factors. Each artificial time series is the cumulative product of the 25 random growth factors multiplied by 100 (an arbitrary base level).
We then apply various smoothing methods to the artificial series and compare the accuracy of the smoothed series to the known (true) CAGR: the geometric mean of all the known historic growth factors, minus one. Using historic growth factors is not necessary but it ensures that the simulated series share similar characteristics with the observed historic data. Alternatively we could draw growth factors from some other distribution (e.g. Normal): the specific distribution that we draw from is not important, what is important is that the distribution is known to us.
We compare 3 smoothing methods:
Finally we calculate the 10 year CAGR, and then assess the accuracy of the growth rates with RMSE and SMAPE. Again, the “true” growth rate is the geometric mean of all the known historic growth factors, minus one. It is the random sampling of growth factors which obfuscates the true (known) growth rate.
stl_decomp <- function(tbbl){
tbbl |>
model(STL(value))|>
components()|>
tibble()|>
select(trend)
}
build_model <- function(y) {
SSModel(y ~ SSMtrend(degree = 2, Q = list(0.01, 0.001)), H = 0.05)
}
compute_cagr <- function(val) {
start <- val[length(val)-10]
end <- val[length(val)]
(end / start)^(.1) - 1
}
do_loess <- function(tbbl) {
loess_fit <- loess(value ~ year, data = tbbl)
fitted(loess_fit)
}
historic <- read_excel(here("data",
"Employment for 64 LMO Industries 2000-2024.xlsx"),
sheet = "British Columbia",
skip = 3) |>
filter(str_detect(`Lmo Ind Code`, "ind")) |>
pivot_longer(starts_with("2"), names_to = "year", values_to = "value") |>
group_by(`Lmo Ind Code`) |>
mutate(growth_factor = value / lag(value)) |>
na.omit()
growth_rates <- sample(historic$growth_factor, size = sims * years, replace = TRUE)-1 # a random sample from historic growth factors
growth_rate <- prod(historic$growth_factor)^(1/nrow(historic))-1 #the true growth rate we are trying to find.
sim <- rep(1:sims, each = years)
nested <- tibble(sim = sim,
year = rep(2000:2024, sims),
growth_rates = growth_rates) %>%
group_by(sim) %>%
mutate(value = cumprod(1 + growth_rates) * 100) %>%
select(-growth_rates) |>
nest() |>
mutate(ts_data = map(data, ~ ts(.x$value, start = 2000)),
tsibble_data = map(data, ~ tsibble::as_tsibble(.x, index = year)))
nested <- nested %>%
mutate(
model = map(ts_data, build_model),
kfs = map(model, ~ KFS(.x, smoothing = "state")),
trend = map(kfs, ~ .x$alphahat[, 1]),
stl = map(tsibble_data, stl_decomp),
stl = map(stl, ~ ts(.x$trend, start = 2000)),
loess = map(data, do_loess),
kalman_cagr = map_dbl(trend, compute_cagr),
raw_cagr = map_dbl(ts_data, compute_cagr),
stl_cagr = map_dbl(stl, compute_cagr),
loess_cagr = map_dbl(loess, compute_cagr)
)
long |>
summarise(
rmse = sqrt(mean(squared_deviation)),
smape = mean(relative_deviation)
) |>
arrange(rmse)|>
DT::datatable(
options = list(pageLength = 5),
rownames = FALSE,
caption = "RMSE and SMAPE of CAGR Estimates by Smoothing Method")|>
DT::formatRound(columns = c("rmse", "smape"), digits = 3)
long %>%
ggplot(aes(x = cagr)) +
geom_vline(xintercept = growth_rate, color = "white", lwd = 2) +
geom_density(alpha = 0.25, fill = "grey20") +
labs(title = "CAGR Distributions by Smoothing Method",
x = "CAGR", y = "Density") +
facet_wrap(~fct_reorder(type, cagr, sd), nrow = 1)
LOESS is a local polynomial smoother, treating time as just another variable.
Assumes independence of points—ignores time order and autocorrelation.
Fixed window size can cause over- or under-smoothing.
Doesn’t model, only smooths.
Sensitive to outliers and boundary effects.
## 38.687 sec elapsed