# Forecasting Repeated Accumulating Processes with Semiparametric Regression Models and Bayesian Updates*

Harlan D. Harris, PhD
ISBIS, June 2017

* AKA, “Cancelling Classes for Fun and Profit”

# 1. Repeated Accumulating Processes

### Problem 1: Forecasting Class Size

• Cancel or Run?
• Early Warning System
• Imprecision, Accumulating Data

### Problem 2: Forecasting Monthly Sales

• Will We Meet Sales Targets?
• Early Warning System
• Imprecision, Accumulating Data

### Repeated Accumulating Processes

Input:

• Count of rare/noisy events.
• Groups are part of data-generating process.
• Covariates, not simple time series.

Output:

• Want updates as time progresses.
• Want prediction intervals (honest uncertainty).

### Sales Data (Simulated!)

Observations: 1,008
Variables: 6
$t <int> 1, 2, 3, 4, 5, 6, 7,...$ t_mo  <dbl> 1, 1, 1, 1, 1, 1, 1,...
$t_dom <dbl> 1, 2, 3, 4, 5, 6, 7,...$ bts   <lgl> FALSE, FALSE, FALSE,...
$set <fctr> Train, Train, Train...$ sales <dbl> 115, 91, 109, 125, 1...


# 2. Modeling the Prior

### Empirical Bayes

• Empirical Prior = vague estimate of sales, before seeing any data, modeled from history
• Likelihood = $$\forall$$ potential final sales $$y$$, P(observing sales to date | y)
• Posterior = prior updated by sales to date (likelihood)

$\forall y, P(\text{total sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{total sales}=y) \cdot {\bf P(\text{total sales}=y) }$

### Modeling the Prior, Conceptually

Location: $\text{sales} = f(t,x) + N(0, \sigma)$

plus Scale: $\text{sales} = f(t,x) + N(0, g(t,x))$

plus Shape: $\text{sales} = f(t,x) + ?(0, g(t,x), h(t,x))$

“Mathematically speaking, GAM is an additive modeling technique where the impact of the predictive variables is captured through smooth functions whichâ€”depending on the underlying patterns in the dataâ€”can be nonlinear:”

GAM: The Predictive Modeling Silver Bullet (Kim Larsen, Stichfix, 2015)

### GAMLSS

• LSS = Independently model all parameters of a distribution
• Arbitrary exponential-family distributions, e.g., Skew $$t$$, Zero-Inflated Poisson, etc.
• Forecast shape of uncertainty in new cases

### Modeling the Prior


Family:  c("TF", "t Family")
Fitting method: RS()

Call:
gamlss(formula = log(sales) ~ cs(t_mo) + bts, sigma.formula = ~t_mo,
nu.formula = ~t_mo, family = TF, data = monthly_training,
control = gamlss.control(trace = FALSE))

Mu Coefficients:
(Intercept)     cs(t_mo)      btsTRUE
8.27662      0.01552      0.25989
Sigma Coefficients:
(Intercept)         t_mo
-2.755428     0.007908
Nu Coefficients:
(Intercept)         t_mo
2.1004       0.3959

Degrees of Freedom for the fit: 9.999 Residual Deg. of Freedom   23
Global Deviance:     -78.6846
AIC:     -58.6858
SBC:     -43.7216


### Prior Computation

# Simplified:

nd <- data.frame(t_mo=15,
bts=FALSE)

with(predictAll(fit, newdata=nd),
dTF(log(2000:15000),
mu=mu,
sigma=sigma,
nu=nu))


# 3. Updating the Prior

### Modeling the Likelihood, Conceptually

$\forall y, P(\text{sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{sales}=y) \cdot P(\text{sales}=y) \\ \text{or} \\ {\bf P(\text{fraction sales to date}=\frac{x}{y}|\text{fraction of month}=z)} \cdot P(\text{sales} = y)$

### Modeling the Likelihood


Family:  c("BEINF0", "Beta Inflated zero")
Fitting method: RS()

Call:
gamlss(formula = prop_sales ~ cs(prop_month), sigma.formula = ~cs(prop_month),
nu.formula = ~cs(prop_month), family = BEINF0,
data = daily_training, control = gamlss.control(trace = FALSE))

Mu Coefficients:
(Intercept)  cs(prop_month)
-2.614           4.886
Sigma Coefficients:
(Intercept)  cs(prop_month)
-2.57953        -0.01477
Nu Coefficients:
(Intercept)  cs(prop_month)
-2.328         -44.281

Degrees of Freedom for the fit: 15 Residual Deg. of Freedom   909
Global Deviance:     -3946.15
AIC:     -3916.15
SBC:     -3843.72


### Computing the Posterior

$\forall y, P(\text{sales}=y \vert \text{sales to date}=x) \propto \\ P(\text{sales to date}=x | \text{frac of month}=z, \text{sales}=y) \cdot \\ P(\text{sales}=y)$

Algorithm:

• For all plausible final sales numbers (integers!),
• Compute the prior and likelihood distributions & multiply,
• Then normalize so the posterior is a distribution.
• Plot or compute quantiles as needed.

## It’s December 10th, 2016. We’ve seen 4000 sales to date. Will we hit our goal of 10,000 sales?

• Plug in month 36 to get Prior distribution.
• Read off mode and probability of meeting/exceeding goal…