Exploration vs Exploitation & the Multi Armed Bandit

Intro

The recommendations to give to a customer at a decision moment can be viewed as the “Multi-Armed Bandit problem”, with the different offers being the bandits. The exploration/exploitation (or learn/earn) dilemma is addressed in many papers.

Terminology

These papers are dense with formulas and terminology:

Bernouilli multi-armed bandit : outcome is 1 or 0 (click or not etc) - we’re focusing on these only
action : the choice made, e.g. which offer to present from a number of alternatives (impression)
reward : the result of doing a certain action, i.e. the “outcome” (e.g. a click)
regret : the loss of not selecting the optimal action - here we’re just looking at it as propensities, but in real life value could be factored in (p*V)

The theoretical approaches usually ignore a lot of effects: real propositions are not stationary but change over time, possibly have seasonal effects, compete with eachother and there may be effects of population drift.

Setup

Bandit rates are fixed here.

library(data.table)
library(knitr)
library(ggplot2)
library(colorspace)
library(Matrix) # for sparse matrix
library(nnet) # for which.is.max

bandit_rates <- c(0.1, 0.5, 0.6, 0.8, 0.3, 0.25, 0.6, 0.45, 0.75, 0.65)
#bandit_rates <- round(runif(n = 10, 0, 0.2), 2)

n_bandits <- length(bandit_rates)
bandit_rates

##  [1] 0.10 0.50 0.60 0.80 0.30 0.25 0.60 0.45 0.75 0.65

Experiment framework

The experiment framework (code not shown in the notebook) runs the given function over a number of episodes. It does this repeatedly for multiple experiments then returns the averages.

Approaches

The implementations all take a matrix of the # of actions and the # reward thus far, per bandit in the form of matrices of experiments x bandits. They’re called for each episode sequentially but can process all experiments efficiently. Some don’t and they simply iterate over the experiments, collecting the results.

Sample data

Below some sample data to test out the functions in the R console.

Sample actions for 5 experiments with 3 bandits (V1, V2 & V3) after 10 rounds (episodes).

experiment	V1	V2	V3
1	0	4	6
2	8	0	2
3	9	0	1
4	2	5	3
5	1	1	8

Sample total rewards for those actions after 10 rounds

experiment	V1	V2	V3
1	0	1	1
2	6	0	2
3	1	0	1
4	1	2	2
5	0	1	5

Success rates for those example numbers

experiment	V1	V2	V3
1	NaN	0.25	0.17
2	0.75	NaN	1.00
3	0.11	NaN	1.00
4	0.50	0.40	0.67
5	0.00	1.00	0.62

Random Selection

Random selection as a baseline.

random <- function(actions, rewards, episode, n_episodes) 
{
  a <- sample(seq(n_bandits), nrow(actions), replace = T)
  return(a)
}

Greedy

The Greedy algorithm simply selects the bandit with the highest success rate so far. A trivial “Laplace smoothing” is done. There is no breaking of ties if bandits have the same propensity - so biasing towards the ones listed first.

greedy <- function(actions, rewards, episode, n_episodes) 
{
  successrates <- (0.5+rewards)/(1+actions)
  a <- apply(successrates, 1, which.max)
  return(a)
}

As Greedy but if there are ties we break them at random.

greedy_tie_breaker <- function(actions, rewards, episode, n_episodes) 
{
  successrates <- (0.5+rewards)/(1+actions)
  a <- apply(successrates, 1, which.is.max) # which.is.max randomizes between ties
  return(a)
}

Epsilon-Greedy

Fancy name for a very simple strategy where each time with a probability of “epsilon” (e.g. 10% or less) the actions are ranked randomly instead of by propensity. This makes sure we keep exploring. This solves both the cold start and the population drift problems but the opportunity costs are fairly high.

Epsilon-greedy keeps exploring at the same rate, which is good if the bandits are non stationary, but if they are more stable, there is an expected loss.

epsilon_greedy <- function(actions, rewards, episode, n_episodes, epsilon = 0.1) 
{
  r <- random(actions)
  g <- greedy_tie_breaker(actions, rewards)
  p <- runif(nrow(actions))
  a <- ifelse(p <= epsilon, r, g)
  return(a)
}

Decayed Epsilon-Greedy reduces the epsilon factor over time.

decayed_epsilon_greedy <- function(actions, rewards, episode, n_episodes) 
{
  epsilon_greedy(actions, rewards, epsilon = 0.1*(n_episodes - episode)/n_episodes)
}

Roulette Wheel

Not commonly seen in literature but the idea is to make the probability of choosing an action proportional to the probability of that action. In this article on analyticsvidhya they mention this as “Softmax Exploration” but the softmax function would only be necessary when using not just probability (p) but something like p*V for prioritization (to map it nicely to a 0-1 interval).

Roulette wheel keeps offering actions that are unlikely to give the best reward, but will converge to the true propensities rather quickly.

roulette_wheel <- function(actions, rewards, episode, n_episodes) 
{
  successrates <- (0.5+rewards)/(1+actions)
  
  rw <- function(probs) 
  {
    return(sample(1:length(probs), 1, prob = probs))
  }
  
  a <- apply(successrates, 1, rw)
  return(a)
}

Upper Confidence Bounds

UCB considers the uncertainty on the probabilities - e.g. if you have a lot of evidence you are pretty sure about the propensities but when the offers is still new and hasn’t been offered a lot yet, the uncertainy is big. The UCB algorithm establishes an optimistic upper bound to the propensity and uses that for ranking.

In UCB1 the success rate \(\frac{r_i}{a_i}\) is augmented with an upper bound \(\sqrt{\frac{2 * \ln(a)}{a_i}}\) that estimates the uncertainty. Here

\({r_i}\) is the reward of taking action i
\({a_i}\) is numer of times action i is chosen
\({a}\) is the total number of actions so far, = the number of episodes (in any episode one action is chosen)
the constant \(2\) is the exploration parameter - in practice sometimes tweaked experimentally

See e.g. https://www.cs.bham.ac.uk/internal/courses/robotics/lectures/ucb1.pdf. Sets an upper bound to the successrate. Optimism in the face of uncertainty.

UCB only slowly converges to a low regret value because the confidence bounds are pretty wide initially - so all bandits get opportunities. This optimism about the success rates translates to high regrets in the beginning.

ucb1 <- function(actions, rewards, episode, n_episodes) 
{
  successrates <- (0.5+rewards)/(1+actions)
  confbound <- t(apply(actions, 1, function(arow) { return (sqrt(2*log(episode+1)/(arow+1e-10))) }))
  
  a <- apply(successrates+confbound, 1, which.max)
  return(a)
}

Propensity smoothing

Often positioned as a solution to cold start, what it does is give a weighted average of the bandit propensity and some a priori propensity, weighted by evidence. So intially starts at a prior propensity and eventually converges to bandit propensity.

\[p_{smoothed} = p_{apriori}\frac{E_{apriori}}{E_{apriori} + {a}} + \frac{r_i}{a_i}*\frac{a}{E_{apriori} + {a}}\]

Addresses the cold start problem, but has problems dealing with changing preferences in the population (propensity might go down but it may never be able to figure out that a different population segment might be interested as it won’t be offered to them ever).

A priori rate set at 0.5 and evidence at 20% of the total number of episodes.

propensity_smoothing <- function(actions, rewards, episode, n_episodes) 
{
  evidence_apriori <- floor(0.2*n_episodes)
  successrate_apriori <- 0.5
  
  successrates <- (0.5+rewards)/(1+actions)
  
  smoothed <- successrate_apriori*evidence_apriori/(evidence_apriori+actions) + successrates*actions/(evidence_apriori+actions)
  
  a <- apply(smoothed, 1, which.max)
  return(a)
}

Thompson sampling

With this the uncertainty is modeled more mathematically using a distribution function. The idea is similar to UCB, but instead of using the upper bound, we sample from this distribution of propensities. When there is a lot of evidence, propensities are sampled from a really narrow range, but in the beginning, the distribution is much more spread out and the propensities will take on a much broader range. The “beta distribution” is a two-parameter distribution used for this, with the beta parameters directly reflecting the evidence and rewards: \(\beta(1+{r_i}, 1+{a_i}-{r_i})\).

Thompson sampling explores quickly but then backs down exploring when there is sufficient evidence.

thompson_sampling <- function(actions, rewards, episode, n_episodes) 
{
  ts <- function(a, r)
  {
    # get the probabilities one by one via the beta distrib with arguments (pos+1, neg+1), then return the bandit with the max prob
    probs <- sapply(seq(length(a)), function(j) {return(rbeta(1, r[j]+1, a[j]-r[j]+1))})
    return(which.max(probs))
  }
  # iterate over all experiments
  a <- sapply(seq(nrow(actions)), function(i) {return(ts(actions[i,], rewards[i,]))})
  return(a)
}

TODO others

epsilon-first

ABWin

What is? MK?

probability matching

Isnt that UCB?

Run experiments

Now run multiple experiments with a number of episodes

experiments <- list( "Random Baseline" = random,
                     "Greedy (w default Laplace smoothing)" = greedy,
                     "Epsilon Greedy" = epsilon_greedy,
                     # "Greedy breaking ties at random" = greedy_tie_breaker,
                     # "Decayed Espsilion Greedy" = decayed_epsilon_greedy,
                     "Propensity Smoothing" = propensity_smoothing,
                     "Roulette Wheel" = roulette_wheel,
                     "Thompson Sampling" = thompson_sampling,
                     "Upper Confidence Bounds (UCB1)" = ucb1)

for (exp in names(experiments)) 
{
  algo <- experiments[[exp]]
  experiments[[exp]] <- c("algo" = algo, runExperiments(n_experiments=1000, n_episodes=3000, algo))
  
  propensities <- melt(data.table(experiments[[exp]]$successrates)[, episode := seq(.N) ], id.vars=c("episode"), 
                       variable.name = "bandit", value.name = "propensity")
  actions <- melt(data.table(experiments[[exp]]$actions)[, episode := seq(.N) ], id.vars=c("episode"), 
                  variable.name = "bandit", value.name = "frequency") 
  rewards <- melt(data.table(experiments[[exp]]$rewards)[, episode := seq(.N) ], id.vars=c("episode"), 
                  variable.name = "bandit", value.name = "reward")
  
  plotdata <- merge(propensities, merge(actions, rewards, by=c("episode", "bandit")), by=c("episode", "bandit"))
  
  # show how the propensities of the agents evolve over time
  print(ggplot(na.omit(plotdata),
         aes(episode, propensity, colour=bandit, size=frequency)) + geom_line() +
    ggtitle("Propensities over Time", 
            subtitle = paste(exp, ", averaged over", experiments[[exp]]$n_experiments, "experiments")))
}

Different algo’s have different rate at reaching the actual success rates, depending on the exploration rate. They also differ in how quickly and how much they exploit the discovered success rates. Below showing both of them - note the x-axis is a log scale to highlight differences better.

evolution <- rbindlist(lapply(names(experiments), function(exp) { 
  return(data.table( experiment = exp,
                     episode = seq(nrow(experiments[[exp]]$successrates)),
                     `rmse (explore)` = apply(experiments[[exp]]$successrates, 1, function(x) {
                       return(sqrt(mean((x - bandit_rates)^2, na.rm=T)))}),
                     `reward (exploit)` = apply(experiments[[exp]]$rewards,1,sum) / 
                       apply(experiments[[exp]]$actions, 1, sum)))}))

# order levels by final reward
evolution$experiment <- factor(evolution$experiment,
                               levels=evolution[, .(finalregret=max(bandit_rates)-`reward (exploit)`[.N]) , by=experiment][order(-finalregret)]$experiment)

ggplot(melt(evolution, id.vars=c("episode", "experiment"), variable.name = "metric"), aes(episode, value, color=metric)) + 
  geom_line(size=1) + scale_x_log10() + facet_wrap(~ experiment, labeller = label_wrap_gen()) + ylab("") +
  scale_colour_discrete_qualitative(palette = "Dynamic") + 
  ggtitle("Exploration vs Exploitation for the various algorithms", subtitle = "Log x-axis to highlight differences")

# ylim(c(0,sqrt(mean((0.5-bandit_rates)^2)))

Showing all the regret curves together and with a normal x-axis. Lower regret is better.

ggplot(evolution, aes(episode, max(bandit_rates)-`reward (exploit)`, colour=experiment)) + geom_line(size=1) +
  ylab("Regret") +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  ggtitle("Regret over time (loss/opportunity costs)", 
          subtitle = paste("avg over", mean(sapply(experiments, function(x) {return(x$n_experiments)})), "experiments"))