Bandit Optimization with Uncertain, Delayed Feedback

Harlan D. Harris, PhD
August 20th, 2015

Typical Bandit Optimization

Batch Bandit Optimization

Real Life

Simulation

Seen purchase yet? Depends on time since intervention, A/B, and maybe seasonality!

purch_time <- function(t, intv) {
  prob <- .5 * 
    # seasonality
    (dnorm(t, mean=50, sd=20)*20+1) * 
    # intervention
    ifelse(intv=='A', 1.4, 1) 
  ifelse(runif(length(intv)) < prob,
    rnbinom(length(intv), size=1, prob=.05),
    Inf)
}

plot of chunk norm

plot of chunk nbinom

Every Tick -- Simulate the World

Add \( N \) more leads, apply \( A \) and \( B \) treatments per policy
Calculate purchase time

Every 20 Ticks -- Update the Policy

Count observed purchases to date
Fit a model
Estimate \( \beta_{A>B} = \hat{\mu}_A - \hat{\mu}_B \), treatment effect
Compute \( P(A > B) \), treatment odds ratio
Set \( P(A) = P(A > B) \) (Probability Matching)
Profit…

Model to fit (incomplete!):

purchase ~ interventionAvsB + s(elapsed_time)

GAMs and Time

\( P(\text{doing this wrong}) \approx 1 \)
Generalized Additive Models
- Kim Larson – The Predictive Modeling Silver Bullet
logistic regression with splines
penalized spline complexity
Survival approach?
Care about \( \beta_{A>B} \)

Try It!

\( N=5 \) leads/tick, repeated simulations
Initially, \( P(A) = 0.5 \)
Watch \( \beta_{A>B} \) and \( P(A > B) \) change
Inspect other coefficients

Coefficients After 40 Ticks

Intercept: -0.9

\( \beta_{A>B} \): 0.333

plot of chunk coef_plots2

Uptake Over Time

Estimates of purchase probability (in 30 ticks) with each treatment

P(A>B) Over Time

Estimates that A is better == Probability of choosing A

plot of chunk a_better_plot

Thanks!

Twitter: @harlanh

Presentation: http://rpubs.com/HarlanH/uncertain-bandits

Source: https://github.com/HarlanH/uncertain-bandits

Details

Predict purchase probability 30 ticks out, for A and B
Use Standard Error of predictions to determine confidence intervals
Simulate many draws to estimate \( P(A > B) \)