Forecasting Repeated Accumulating Processes with Semiparametric Regression Models and Bayesian Updates*

Harlan D. Harris, PhD
NY Open Statistical Programming Meetup, May 2017

* AKA, “Cancelling Classes for Fun and Profit”

1. Repeated Accumulating Processes

Problem 1: Forecasting Class Size

plot of chunk forecasting_class_plot

Cancel or Run?
Early Warning System
Imprecision, Accumulating Data

Problem 2: Forecasting Monthly Sales

plot of chunk sales_plot

Will We Meet Sales Targets?
Early Warning System
Imprecision, Accumulating Data

Desiderata

Care about sum of high-variance measurements.
Grouping is part of data-generating process; not arbitrary.
Want updates as time progresses.
Have covariates; not pure time-series.
Want prediction intervals (honest uncertainty).

Sales Data (Simulated!)

years = 3; dpm = 28 
mo <- function(t) 1 + (t-1) %/% dpm
dom <- function(t) 1 + (t-1) %% dpm
dat <- tibble(t=seq.int(years*12*dpm),
              t_mo=mo(t),
              t_dom=dom(t),
              bts=t_mo%%12 == 8, # Back to School!
              set=factor(ifelse(t > 33*dpm, "Test", "Train"),
                         levels=c("Train", "Test")),
              sales = 100 + (t/5) + 3 * 
                      arima.sim(list(order = c(1,1,0), 
                                     ar = .1), 
                                n = years*12*dpm - 1))
# add bts effect
dat %<>% mutate(sales = sales * (1 + .25 * bts))
# add dom effect
dat %<>% mutate(sales = sales * (1 + .4 * (t_dom/28)^2))
# add both additive and multiplicitive noise
dat %<>% mutate(sales = sales *
                rnorm(years*12*dpm, 1, .2) + 
                rnorm(years*12*dpm, 0, 20))
# force to be *nonnegative integer*
dat %<>% mutate(sales = pmax(0, round(sales)))

plot of chunk simdata_plot

Divide and Conquer

monthly <- dat %>% 
  group_by(t_mo, set, bts) %>%
  summarise(sales=sum(sales))
glimpse(monthly)

Observations: 36
Variables: 4
$ t_mo  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
$ set   <fctr> Train, Train, Train, Train, Train, Train, Train, Train,...
$ bts   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F...
$ sales <dbl> 3405, 2924, 3474, 3875, 3788, 4060, 4671, 6059, 5039, 53...

daily <- dat %>%
  group_by(t_mo) %>%
  mutate(monthly_sales=sum(sales),
         bod_sales=lag(sales,  default = 0),
         prop_sales = cumsum(bod_sales) / 
           monthly_sales,
         prop_month=(t_dom-1) / dpm) %>%
  ungroup()
glimpse(daily)

Observations: 1,008
Variables: 10
$ t             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
$ t_mo          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ t_dom         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
$ bts           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ set           <fctr> Train, Train, Train, Train, Train, Train, Train...
$ sales         <dbl> 115, 91, 109, 125, 137, 121, 110, 108, 93, 93, 1...
$ monthly_sales <dbl> 3405, 3405, 3405, 3405, 3405, 3405, 3405, 3405, ...
$ bod_sales     <dbl> 0, 115, 91, 109, 125, 137, 121, 110, 108, 93, 93...
$ prop_sales    <dbl> 0.00000000, 0.03377386, 0.06049927, 0.09251101, ...
$ prop_month    <dbl> 0.00000000, 0.03571429, 0.07142857, 0.10714286, ...

Divide and Conquer

plot of chunk prior_graph

plot of chunk timecourse_plot1

1. Repeated Accumulating Processes

2. Modeling the Prior

Bayes Rule

Prior = monthly sales estimate, prior to seeing any sales
Likelihood = P(observing sales to date | specific final sales #)
Posterior = current estimate; prior updated by sales to date

\[ P(\text{forecast sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{forecast sales}=y) \cdot P(\text{forecast sales}=y) \\ \equiv \\ P(\text{fraction sales to date}=\frac{x}{y}|\text{fraction of month}=z) \cdot P(\text{forecast sales} = y) \]

Modeling the Prior, Conceptually

plot of chunk prior_concep

\[ \text{sales} = f(t,x) + N(0, \sigma) \\ \text{sales} = f(t,x) + N(0, g(t,x)) \\ \text{sales} = f(t,x) + ?(0, g(t,x), h(t,x)) \]

Generalized Additive Models

“Mathematically speaking, GAM is an additive modeling technique where the impact of the predictive variables is captured through smooth functions which—depending on the underlying patterns in the data—can be nonlinear:”

GAM

GAM: The Predictive Modeling Silver Bullet (Kim Larsen, Stichfix, 2015)

Prophet

Facebook's open-sourced time-series Bayesian forecasting framework.
Time is a covariate; Complex seasonality; Non-clockwork periods.
Non-accumulating. STAN!

\[ log(y) = s(time) + seasonality + holiday + changepoints + error \]

GAMLSS

Generalized Additive Models for Location, Scale, & Shape
(Stasinopoulos & Rigby, 2007)
LSS = Independently model all parameters of a distribution
Arbitrary exponential-family distributions, e.g., Skew \( t \), Zero-Inflated Poisson, Beta, dozens more…
Forecast shape of uncertainty in new cases

e.g.:

Sales ~ spline(time) + back to school
Var(Sales) ~ time
Tail-heaviness(Sales) ~ time

Modeling the Prior

monthly_training <- filter(monthly, set=="Train")
prior_fit <- gamlss(log(sales) ~ cs(t_mo) + bts,
                    sigma.formula = ~ t_mo, nu.formula = ~ t_mo,
                    family=TF, data=monthly_training,
                    control=gamlss.control(trace=FALSE))


Family:  c("TF", "t Family") 
Fitting method: RS() 

Call:  
gamlss(formula = log(sales) ~ cs(t_mo) + bts, sigma.formula = ~t_mo,  
    nu.formula = ~t_mo, family = TF, data = monthly_training,  
    control = gamlss.control(trace = FALSE)) 

Mu Coefficients:
(Intercept)     cs(t_mo)      btsTRUE  
    8.27662      0.01552      0.25989  
Sigma Coefficients:
(Intercept)         t_mo  
  -2.755428     0.007908  
Nu Coefficients:
(Intercept)         t_mo  
     2.1004       0.3959  

 Degrees of Freedom for the fit: 9.999 Residual Deg. of Freedom   23 
Global Deviance:     -78.6846 
            AIC:     -58.6858 
            SBC:     -43.7216

Modeling the Prior

Prior Computation

prior_dist <- function(t_mo, bts, sales_range=2000:15000) {
  with(predictAll(prior_fit, 
                  newdata=data.frame(t_mo=t_mo, bts=bts)),
       data_frame(sales=sales_range,
              d=dTF(log(sales_range), mu=mu, 
                    sigma=sigma, nu=nu),
              t_mo=t_mo, bts=bts) %>%
         mutate(d=d/sum(d)))
}

samp_priors <- data_frame(t_mo=c(15,36), 
                          bts=c(FALSE, TRUE)) %>%
  by_row(~ prior_dist(.$t_mo, .$bts), 
         .collate='rows')

plot of chunk oneprior

1. Repeated Accumulating Processes

2. Modeling the Prior

3. Updating the Prior

Modeling the Likelihood, Conceptually

\[ P(\text{sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{sales}=y) \cdot P(\text{sales}=y) \\ \text{or} \\ P(\text{fraction sales to date}=\frac{x}{y}|\text{fraction of month}=z) \cdot P(\text{sales} = y) \]

plot of chunk likelihood_concept

Modeling the Likelihood

daily_training <- filter(daily, 
                         set=="Train")
timecourse_fit <- 
  gamlss(prop_sales ~ cs(prop_month),
         sigma.formula = ~ cs(prop_month),
         nu.formula = ~ cs(prop_month),
         data=daily_training,
         family=BEINF0)

GAMLSS-RS iteration 1: Global Deviance = -2658.72 
GAMLSS-RS iteration 2: Global Deviance = -3675.776 
GAMLSS-RS iteration 3: Global Deviance = -3864.726 
GAMLSS-RS iteration 4: Global Deviance = -3918.817 
GAMLSS-RS iteration 5: Global Deviance = -3936.177 
GAMLSS-RS iteration 6: Global Deviance = -3942.352 
GAMLSS-RS iteration 7: Global Deviance = -3944.684 
GAMLSS-RS iteration 8: Global Deviance = -3945.591 
GAMLSS-RS iteration 9: Global Deviance = -3945.944 
GAMLSS-RS iteration 10: Global Deviance = -3946.058 
GAMLSS-RS iteration 11: Global Deviance = -3946.11 
GAMLSS-RS iteration 12: Global Deviance = -3946.134 
GAMLSS-RS iteration 13: Global Deviance = -3946.145 
GAMLSS-RS iteration 14: Global Deviance = -3946.15 
GAMLSS-RS iteration 15: Global Deviance = -3946.152 
GAMLSS-RS iteration 16: Global Deviance = -3946.153


Family:  c("BEINF0", "Beta Inflated zero") 
Fitting method: RS() 

Call:  
gamlss(formula = prop_sales ~ cs(prop_month), sigma.formula = ~cs(prop_month),  
    nu.formula = ~cs(prop_month), family = BEINF0,  
    data = daily_training) 

Mu Coefficients:
   (Intercept)  cs(prop_month)  
        -2.614           4.886  
Sigma Coefficients:
   (Intercept)  cs(prop_month)  
      -2.57953        -0.01477  
Nu Coefficients:
   (Intercept)  cs(prop_month)  
        -2.333         -71.074  

 Degrees of Freedom for the fit: 15 Residual Deg. of Freedom   909 
Global Deviance:     -3946.15 
            AIC:     -3916.15 
            SBC:     -3843.72

Modeling the Likelihood

Likelihood Computation

likelihood_dist <- function(prop_month, sales_to_date,
                            sales_range=2000:15000) {
  with(predictAll(timecourse_fit, 
          newdata=data.frame(prop_month=prop_month)),
     tibble(prop_sales=pmin(1, sales_to_date / sales_range),
            sales=sales_range,
            d=dBEINF0(prop_sales, mu=mu, 
                      sigma=sigma, nu=nu)) %>%
       mutate(d = ifelse(prop_sales >= 1, 0, d),
              d = d/sum(d)))
}

samp_likelihoods <- tibble(prop_month=c(.03, .5, .90),
                           std=c(200, 3000, 5000)) %>%
  by_row(~ likelihood_dist(.$prop_month, .$std), 
         .collate='rows')

Computing the Posterior

\[ \forall y, P(\text{sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{sales}=y) \cdot \\ P(\text{sales}=y) \]

Algorithm:

For all plausible final sales numbers,
Compute the prior and likelihood distributions & multiply,
Then normalize so the posterior is a distribution.
Plot or read off quantiles as needed.

Scenario

It’s December 10th, 2016. We’ve seen 4000 sales to date. Will we hit our goal of 10,000 sales?

sample_prior <- prior_dist(36, FALSE)

sales_mode <- 
  function(dd) dd$sales[[which.max(dd$d)]]
sales_thresh <- 
  function(dd, th=10000) sum(dd$d[dd$sales >= th])

Looks bad, boss…

plot of chunk sample_prior_plot

Likelihood

sample_llh <- 
  likelihood_dist(prop_month = 10/28,
                  sales_to_date = 4000)

Chickens Counted!

plot of chunk sample_llh_plot

Responsible Bayesians

sample_post <- 
  data_frame(sales=sample_prior$sales,
             d=sample_prior$d * sample_llh$d) %>%
  mutate(d = d/sum(d))

Back to work.

plot of chunk sample_post_plot

1. Repeated Accumulating Processes

2. Modeling the Prior

3. Updating the Prior

4. Step Back

Daily Updating

In Practice

Weekends and Holidays
Backtesting
Future Months
Impact and UX

Pros and Cons

Prior & Timecourse models can use arbitrary covariates
Honest about uncertainty
Semi-parametric models are flexibile, make good use of data

Fiddly details in implementation, not OTS
Weird timecourse shapes would require trickier modeling
Continuous case is hard (but not likely necessary)

Thanks

Build trust: calculate and communicate uncertainty.

Harlan D. Harris, PhD
Twitter, Medium, GitHub: @harlanh

This Presentation on GitHub

Auto-regressive Models

\[ X_t = c + \sum_{i=1}^p \phi_i X_{t-1} + \epsilon_t \]

monthly_ts <- ts(filter(monthly, 
                        set=="Train")$sales, 
                 start=c(2014,1), frequency=12)
monthly_fc <- forecast(auto.arima(monthly_ts, 
                                  max.q=0), 3)

plot of chunk ar_plot