Forecasting Repeated Accumulating Processes with Semiparametric Regression Models and Bayesian Updates*

Harlan D. Harris, PhD
ISBIS, June 2017

* AKA, “Cancelling Classes for Fun and Profit”

1. Repeated Accumulating Processes

Problem 1: Forecasting Class Size

plot of chunk forecasting_class_plot

  • Cancel or Run?
  • Early Warning System
  • Imprecision, Accumulating Data

Problem 2: Forecasting Monthly Sales

plot of chunk sales_plot

  • Will We Meet Sales Targets?
  • Early Warning System
  • Imprecision, Accumulating Data

Repeated Accumulating Processes

Input:

  • Count of rare/noisy events.
  • Groups are part of data-generating process.
  • Covariates, not simple time series.

Output:

  • Want updates as time progresses.
  • Want prediction intervals (honest uncertainty).

Sales Data (Simulated!)

Observations: 1,008
Variables: 6
$ t     <int> 1, 2, 3, 4, 5, 6, 7,...
$ t_mo  <dbl> 1, 1, 1, 1, 1, 1, 1,...
$ t_dom <dbl> 1, 2, 3, 4, 5, 6, 7,...
$ bts   <lgl> FALSE, FALSE, FALSE,...
$ set   <fctr> Train, Train, Train...
$ sales <dbl> 115, 91, 109, 125, 1...

plot of chunk simdata_plot

Divide and Conquer

plot of chunk prior_graph

plot of chunk timecourse_plot1

1. Repeated Accumulating Processes

2. Modeling the Prior

Empirical Bayes

  • Empirical Prior = vague estimate of sales, before seeing any data, modeled from history
  • Likelihood = \( \forall \) potential final sales \( y \), P(observing sales to date | y)
  • Posterior = prior updated by sales to date (likelihood)

\[ \forall y, P(\text{total sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{total sales}=y) \cdot {\bf P(\text{total sales}=y) } \]

Modeling the Prior, Conceptually

plot of chunk prior_concep

Location: \[ \text{sales} = f(t,x) + N(0, \sigma) \]

plus Scale: \[ \text{sales} = f(t,x) + N(0, g(t,x)) \]

plus Shape: \[ \text{sales} = f(t,x) + ?(0, g(t,x), h(t,x)) \]

Generalized Additive Models

“Mathematically speaking, GAM is an additive modeling technique where the impact of the predictive variables is captured through smooth functions which—depending on the underlying patterns in the data—can be nonlinear:”

GAM

GAM: The Predictive Modeling Silver Bullet (Kim Larsen, Stichfix, 2015)

GAMLSS

GAMLSS

  • LSS = Independently model all parameters of a distribution
  • Arbitrary exponential-family distributions, e.g., Skew \( t \), Zero-Inflated Poisson, etc.
  • Forecast shape of uncertainty in new cases

Modeling the Prior


Family:  c("TF", "t Family") 
Fitting method: RS() 

Call:  
gamlss(formula = log(sales) ~ cs(t_mo) + bts, sigma.formula = ~t_mo,  
    nu.formula = ~t_mo, family = TF, data = monthly_training,  
    control = gamlss.control(trace = FALSE)) 

Mu Coefficients:
(Intercept)     cs(t_mo)      btsTRUE  
    8.27662      0.01552      0.25989  
Sigma Coefficients:
(Intercept)         t_mo  
  -2.755428     0.007908  
Nu Coefficients:
(Intercept)         t_mo  
     2.1004       0.3959  

 Degrees of Freedom for the fit: 9.999 Residual Deg. of Freedom   23 
Global Deviance:     -78.6846 
            AIC:     -58.6858 
            SBC:     -43.7216 

Modeling the Prior

plot of chunk prior_plotsplot of chunk prior_plotsplot of chunk prior_plotsplot of chunk prior_plots

Prior Computation

plot of chunk oneprior

# Simplified:

nd <- data.frame(t_mo=15, 
                 bts=FALSE)

with(predictAll(fit, newdata=nd),
     dTF(log(2000:15000), 
         mu=mu, 
         sigma=sigma, 
         nu=nu))

1. Repeated Accumulating Processes

2. Modeling the Prior

3. Updating the Prior

Modeling the Likelihood, Conceptually

\[ \forall y, P(\text{sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{sales}=y) \cdot P(\text{sales}=y) \\ \text{or} \\ {\bf P(\text{fraction sales to date}=\frac{x}{y}|\text{fraction of month}=z)} \cdot P(\text{sales} = y) \]

plot of chunk likelihood_concept

Modeling the Likelihood


Family:  c("BEINF0", "Beta Inflated zero") 
Fitting method: RS() 

Call:  
gamlss(formula = prop_sales ~ cs(prop_month), sigma.formula = ~cs(prop_month),  
    nu.formula = ~cs(prop_month), family = BEINF0,  
    data = daily_training, control = gamlss.control(trace = FALSE)) 

Mu Coefficients:
   (Intercept)  cs(prop_month)  
        -2.614           4.886  
Sigma Coefficients:
   (Intercept)  cs(prop_month)  
      -2.57953        -0.01477  
Nu Coefficients:
   (Intercept)  cs(prop_month)  
        -2.328         -44.281  

 Degrees of Freedom for the fit: 15 Residual Deg. of Freedom   909 
Global Deviance:     -3946.15 
            AIC:     -3916.15 
            SBC:     -3843.72 

Modeling the Likelihood

plot of chunk likelihood_plotsplot of chunk likelihood_plotsplot of chunk likelihood_plots

Likelihood Computation

plot of chunk onelh

Computing the Posterior

\[ \forall y, P(\text{sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{sales}=y) \cdot \\ P(\text{sales}=y) \]

Algorithm:

  • For all plausible final sales numbers (integers!),
  • Compute the prior and likelihood distributions & multiply,
  • Then normalize so the posterior is a distribution.
  • Plot or compute quantiles as needed.

Scenario -- Prior Only

It’s December 10th, 2016. We’ve seen 4000 sales to date. Will we hit our goal of 10,000 sales?

  • Plug in month 36 to get Prior distribution.
  • Read off mode and probability of meeting/exceeding goal…

Looks bad, boss…

plot of chunk sample_prior_plot

Likelihood

  • Plug in proportion of month, sales to month.
  • Read off mode and probability of meeting/exceeding goal…

Chickens Counted!