If the goal is to estimate, don’t throw out data, and don’t use CAGR

Author

Richard Martin

In my presentation of the “Martin” model, there was some discussion regarding how much data to use (all available vs. the most recent 10 years), and how to calculate growth (CAGR or regression slope based). This note demonstrates why I used all available data, and used regression. The reason is that I am trying to estimate the true aka population¹ growth rate, not create a descriptive statistic of the recent sample data.

num_sims <- 10000
nobs <- c(5,10,25,50)
sim <- 1:num_sims

We are going to run a simulation where time series have varying lengths: 5, 10, 25, 50. There will be 10,000 simulations of each length. The time series are going to be random noise centered on 10, so we know the true growth rate is 0.

create_series <- function(nobs){
  tibble(time=1:nobs, value=10+rnorm(nobs))
}

One way to calculate growth rate is to run a linear regression where we regress the log of the value on time, and then exponentiate the slope coeffient, and subtract 1.

lm_rate <- function(tbbl) {
  coef <- coef(lm(log(value) ~ time, data = tbbl))["time"]
  (exp(coef) - 1)
}

An alternative growth measure is CAGR.

cagr <- function(tbbl) {
  start <- first(tbbl$value)
  end   <- last(tbbl$value)
  ((end / start)^(1 / (nrow(tbbl) - 1)) - 1 )
}

We create 10000 time series of lengths5, 10, 25, 50, and then we calculate the growth rates both ways.

tbbl <- crossing(nobs, sim)|>
  mutate(data=map(nobs, create_series),
         `Linear Regression`=map_dbl(data, lm_rate),
         CAGR=map_dbl(data, cagr)
         )|>
  select(-sim, -data)|>
  pivot_longer(cols=-nobs, names_to = "Method", values_to = "Estimated Growth Rate")|>
  mutate(
    nobs = factor(paste("Length of time series:", nobs), levels = paste("Length of time series:", sort(unique(nobs))))
  )

Next we plot the results that demonstrate that while neither method appears to be biased, linear regression has substantially less variance.

ggplot(tbbl, aes(`Estimated Growth Rate`, fill=Method))+
  geom_density(alpha=.5)+
  facet_wrap(~nobs, scales="free")+
  labs(title="Here we compare two methods of calculating growth: CAGR vs linear regression slope. Simulation data is random noise so true growth rate is 0.",
       subtitle="Both methods are unbiased, but linear regression has less variance (except in edge case where the time series only has two observations). Each panel is based on 10000 simulations.")+
  scale_x_continuous(labels=scales::percent)

Finally, lets look at the proportion of estimated growth rates that fall inside/outside of the truth \(\pm .5\%\).

for_dt <- tbbl|>
  group_by(nobs, Method)|>
  mutate(location=case_when(`Estimated Growth Rate`< -.005 ~ "Bottom Tail",
                            `Estimated Growth Rate` > .005 ~ "Top Tail",
                            TRUE ~ "Within .5%"
                            )
  )|>
  group_by(nobs, Method, location)|>
  summarize(location_count=n())|>
  group_by(nobs, Method)|>
  mutate(location_prop=location_count/sum(location_count))|>
  select(-location_count)|>
  pivot_wider(names_from = location, values_from = location_prop, values_fill = 0)

cagr_within_10 <- for_dt$`Within .5%`[for_dt$nobs=="Length of time series: 10" & for_dt$Method=="CAGR"]
cagr_within_25 <- for_dt$`Within .5%`[for_dt$nobs=="Length of time series: 25" & for_dt$Method=="CAGR"]
regression_within_10 <- for_dt$`Within .5%`[for_dt$nobs=="Length of time series: 10" & for_dt$Method=="Linear Regression"]

DT::datatable(for_dt)

Conclusions:

For our favourite 10 year horizion, using CAGR yields an estimate that is within \(\pm .5\%\) of the truth 24% of the time, whereas with linear regression the estimate is within \(\pm .5\%\) of the truth 34% of the time.
For both methods there is a substantial increase in the probability that the estimate is within \(\pm .5\%\) of the truth when going from 10 years to 25 years (the RTRA max). e.g. For CAGR, the percentage within \(\pm .5\%\) of the truth goes from 24% for 10 years of data to 60% for 25 years.

A final word: the title of this note is redundant. Linear regression universally² out-performs CAGR because regression uses all available data whereas CAGR is based solely on the end points, and discards the rest.

Footnotes

In the statistical, not demographic sense↩︎
except for time series of length 2↩︎