<- 10000
num_sims <- c(2,10,25,50)
nobs <- 1:num_sims sim
If the goal is to estimate, don’t throw out data, and don’t use CAGR
In my presentation of the “Martin” model, there was some discussion regarding how much data to use (all available vs. the most recent 10 years), and how to calculate growth (CAGR or regression slope based). This note demonstrates why I used all available data, and used regression. The reason is that I am trying to estimate the true aka population1 growth rate, not create a descriptive statistic of the recent sample data.
We are going to run a simulation where time series have varying lengths: 2, 10, 25, 50. There will be 10,000 simulations of each length. The time series are going to be random noise centered on 10, so we know the true growth rate is 0.
<- function(nobs){
create_series tibble(time=1:nobs, value=10+rnorm(nobs))
}
One way to calculate growth rate is to run a linear regression where we regress the log of the value on time, and then exponentiate the slope coeffient, and subtract 1.
<- function(tbbl) {
lm_rate <- coef(lm(log(value) ~ time, data = tbbl))["time"]
coef exp(coef) - 1)
( }
An alternative growth measure is CAGR.
<- function(tbbl) {
cagr <- first(tbbl$value)
start <- last(tbbl$value)
end / start)^(1 / (nrow(tbbl) - 1)) - 1 )
((end }
We create 10000 time series of lengths 2, 10, 25, 50, and then we calculate the growth rates both ways.
<- crossing(nobs, sim)|>
tbbl mutate(data=map(nobs, create_series),
`Linear Regression`=map_dbl(data, lm_rate),
CAGR=map_dbl(data, cagr)
|>
)select(-sim, -data)|>
pivot_longer(cols=-nobs, names_to = "Method", values_to = "Estimated Growth Rate")|>
mutate(
nobs = factor(paste("Length of time series:", nobs), levels = paste("Length of time series:", sort(unique(nobs))))
)
Next we plot the results that demonstrate that while neither method appears to be biased, linear regression has substantially less variance.
ggplot(tbbl, aes(`Estimated Growth Rate`, fill=Method))+
geom_density(alpha=.5)+
facet_wrap(~nobs, scales="free")+
labs(title="Here we compare two methods of calculating growth: CAGR vs linear regression slope. Simulation data is random noise so true growth rate is 0.",
subtitle="Both methods are unbiased, but linear regression has less variance (except in edge case where the time series only has two observations). Each panel is based on 10000 simulations.")+
scale_x_continuous(labels=scales::percent)
Finally, lets look at the proportion of estimated growth rates that fall inside/outside of the truth \(\pm 1\%\).
<- tbbl|>
for_dt group_by(nobs, Method)|>
mutate(location=case_when(`Estimated Growth Rate`< -.01 ~ "Bottom Tail",
`Estimated Growth Rate` > .01 ~ "Top Tail",
TRUE ~ "Within 1%"
)|>
)group_by(nobs, Method, location)|>
summarize(location_count=n())|>
group_by(nobs, Method)|>
mutate(location_prop=location_count/sum(location_count))|>
select(-location_count)|>
pivot_wider(names_from = location, values_from = location_prop, values_fill = 0)
<- for_dt$`Within 1%`[for_dt$nobs=="Length of time series: 10" & for_dt$Method=="CAGR"]
cagr_within_10 <- for_dt$`Within 1%`[for_dt$nobs=="Length of time series: 25" & for_dt$Method=="CAGR"]
cagr_within_25 <- for_dt$`Within 1%`[for_dt$nobs=="Length of time series: 10" & for_dt$Method=="Linear Regression"]
regression_within_10
::datatable(for_dt) DT
Conclusions:
- At one extreme (2 data points) the two methods are identical regarding the probability of getting an estimate that is within \(\pm 1\%\) of the truth (pretty unlikely)
- At the other extreme (50 datapoints), the two methods are nearly identical regarding the probability of getting an estimate that is within \(\pm 1\%\) of the truth. (almost certain)
- For our favourite 10 year horizion, using CAGR yields an estimate that is within \(\pm 1\%\) of the truth 48% of the time, whereas with linear regression the estimate is within \(\pm 1\%\) of the truth 64% of the time.
- For both methods there is a substantial increase in the probability that the estimate is within \(\pm 1\%\) of the truth when going from 10 years to 25 years (the RTRA max). e.g. For CAGR, the percentage within \(\pm 1\%\) of the truth goes from 48% for 10 years of data to 91% for 25 years.
Footnotes
In the statistical, not demographic sense↩︎