Bandit Optimization with Uncertain, Delayed Feedback

Harlan D. Harris, PhD
August 20th, 2015

Typical Bandit Optimization

Batch Bandit Optimization

Real Life

Simulation

Seen purchase yet? Depends on time since intervention, A/B, and maybe seasonality!

purch_time <- function(t, intv) {
  prob <- .5 * 
    # seasonality
    (dnorm(t, mean=50, sd=20)*20+1) * 
    # intervention
    ifelse(intv=='A', 1.4, 1) 
  ifelse(runif(length(intv)) < prob,
    rnbinom(length(intv), size=1, prob=.05),
    Inf)
}

plot of chunk norm

plot of chunk nbinom

Every Tick -- Simulate the World

  • Add \( N \) more leads, apply \( A \) and \( B \) treatments per policy
  • Calculate purchase time

Every 20 Ticks -- Update the Policy

  • Count observed purchases to date
  • Fit a model
  • Estimate \( \beta_{A>B} = \hat{\mu}_A - \hat{\mu}_B \), treatment effect
  • Compute \( P(A > B) \), treatment odds ratio
  • Set \( P(A) = P(A > B) \) (Probability Matching)
  • Profit…

Model to fit (incomplete!):

purchase ~ interventionAvsB + s(elapsed_time)

GAMs and Time

  • \( P(\text{doing this wrong}) \approx 1 \)
  • Generalized Additive Models
  • logistic regression with splines
  • penalized spline complexity
  • Survival approach?
  • Care about \( \beta_{A>B} \)

Try It!

  • \( N=5 \) leads/tick, repeated simulations
  • Initially, \( P(A) = 0.5 \)
  • Watch \( \beta_{A>B} \) and \( P(A > B) \) change
  • Inspect other coefficients

Coefficients After 40 Ticks

  • Intercept: -0.9
  • \( \beta_{A>B} \): 0.333

plot of chunk coef_plots2

Uptake Over Time

Estimates of purchase probability (in 30 ticks) with each treatment

plot of chunk uptake_plot

P(A>B) Over Time

Estimates that A is better == Probability of choosing A

plot of chunk a_better_plot

Thanks!

Details

  • Predict purchase probability 30 ticks out, for A and B
  • Use Standard Error of predictions to determine confidence intervals
  • Simulate many draws to estimate \( P(A > B) \)