The recommendations to give to a customer at a decision moment can be viewed as the “Multi-Armed Bandit problem”, with the different offers being the bandits. The exploration/exploitation (or learn/earn) dilemma is addressed in many papers.
These papers are dense with formulas and terminology:
The theoretical approaches usually ignore a lot of effects: real propositions are not stationary but change over time, possibly have seasonal effects, compete with eachother and there may be effects of population drift.
Bandit rates are fixed here.
library(data.table)
library(knitr)
library(ggplot2)
library(colorspace)
library(Matrix) # for sparse matrix
library(nnet) # for which.is.max
bandit_rates <- c(0.1, 0.5, 0.6, 0.8, 0.3, 0.25, 0.6, 0.45, 0.75, 0.65)
#bandit_rates <- round(runif(n = 10, 0, 0.2), 2)
n_bandits <- length(bandit_rates)
bandit_rates
## [1] 0.10 0.50 0.60 0.80 0.30 0.25 0.60 0.45 0.75 0.65
The experiment framework (code not shown in the notebook) runs the given function over a number of episodes. It does this repeatedly for multiple experiments then returns the averages.
The implementations all take a matrix of the # of actions and the # reward thus far, per bandit in the form of matrices of experiments x bandits. They’re called for each episode sequentially but can process all experiments efficiently. Some don’t and they simply iterate over the experiments, collecting the results.
Below some sample data to test out the functions in the R console.
Sample actions for 5 experiments with 3 bandits (V1, V2 & V3) after 10 rounds (episodes).
| experiment | V1 | V2 | V3 |
|---|---|---|---|
| 1 | 0 | 4 | 6 |
| 2 | 8 | 0 | 2 |
| 3 | 9 | 0 | 1 |
| 4 | 2 | 5 | 3 |
| 5 | 1 | 1 | 8 |
Sample total rewards for those actions after 10 rounds
| experiment | V1 | V2 | V3 |
|---|---|---|---|
| 1 | 0 | 1 | 1 |
| 2 | 6 | 0 | 2 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 2 | 2 |
| 5 | 0 | 1 | 5 |
Success rates for those example numbers
| experiment | V1 | V2 | V3 |
|---|---|---|---|
| 1 | NaN | 0.25 | 0.17 |
| 2 | 0.75 | NaN | 1.00 |
| 3 | 0.11 | NaN | 1.00 |
| 4 | 0.50 | 0.40 | 0.67 |
| 5 | 0.00 | 1.00 | 0.62 |
Random selection as a baseline.
random <- function(actions, rewards, episode, n_episodes)
{
a <- sample(seq(n_bandits), nrow(actions), replace = T)
return(a)
}
The Greedy algorithm simply selects the bandit with the highest success rate so far. A trivial “Laplace smoothing” is done. There is no breaking of ties if bandits have the same propensity - so biasing towards the ones listed first.
greedy <- function(actions, rewards, episode, n_episodes)
{
successrates <- (0.5+rewards)/(1+actions)
a <- apply(successrates, 1, which.max)
return(a)
}
As Greedy but if there are ties we break them at random.
greedy_tie_breaker <- function(actions, rewards, episode, n_episodes)
{
successrates <- (0.5+rewards)/(1+actions)
a <- apply(successrates, 1, which.is.max) # which.is.max randomizes between ties
return(a)
}
Fancy name for a very simple strategy where each time with a probability of “epsilon” (e.g. 10% or less) the actions are ranked randomly instead of by propensity. This makes sure we keep exploring. This solves both the cold start and the population drift problems but the opportunity costs are fairly high.
Epsilon-greedy keeps exploring at the same rate, which is good if the bandits are non stationary, but if they are more stable, there is an expected loss.
epsilon_greedy <- function(actions, rewards, episode, n_episodes, epsilon = 0.1)
{
r <- random(actions)
g <- greedy_tie_breaker(actions, rewards)
p <- runif(nrow(actions))
a <- ifelse(p <= epsilon, r, g)
return(a)
}
Decayed Epsilon-Greedy reduces the epsilon factor over time.
decayed_epsilon_greedy <- function(actions, rewards, episode, n_episodes)
{
epsilon_greedy(actions, rewards, epsilon = 0.1*(n_episodes - episode)/n_episodes)
}
Not commonly seen in literature but the idea is to make the probability of choosing an action proportional to the probability of that action. In this article on analyticsvidhya they mention this as “Softmax Exploration” but the softmax function would only be necessary when using not just probability (p) but something like p*V for prioritization (to map it nicely to a 0-1 interval).
Roulette wheel keeps offering actions that are unlikely to give the best reward, but will converge to the true propensities rather quickly.
roulette_wheel <- function(actions, rewards, episode, n_episodes)
{
successrates <- (0.5+rewards)/(1+actions)
rw <- function(probs)
{
return(sample(1:length(probs), 1, prob = probs))
}
a <- apply(successrates, 1, rw)
return(a)
}
UCB considers the uncertainty on the probabilities - e.g. if you have a lot of evidence you are pretty sure about the propensities but when the offers is still new and hasn’t been offered a lot yet, the uncertainy is big. The UCB algorithm establishes an optimistic upper bound to the propensity and uses that for ranking.
In UCB1 the success rate \(\frac{r_i}{a_i}\) is augmented with an upper bound \(\sqrt{\frac{2 * \ln(a)}{a_i}}\) that estimates the uncertainty. Here
See e.g. https://www.cs.bham.ac.uk/internal/courses/robotics/lectures/ucb1.pdf. Sets an upper bound to the successrate. Optimism in the face of uncertainty.
UCB only slowly converges to a low regret value because the confidence bounds are pretty wide initially - so all bandits get opportunities. This optimism about the success rates translates to high regrets in the beginning.
ucb1 <- function(actions, rewards, episode, n_episodes)
{
successrates <- (0.5+rewards)/(1+actions)
confbound <- t(apply(actions, 1, function(arow) { return (sqrt(2*log(episode+1)/(arow+1e-10))) }))
a <- apply(successrates+confbound, 1, which.max)
return(a)
}
Often positioned as a solution to cold start, what it does is give a weighted average of the bandit propensity and some a priori propensity, weighted by evidence. So intially starts at a prior propensity and eventually converges to bandit propensity.
\[p_{smoothed} = p_{apriori}\frac{E_{apriori}}{E_{apriori} + {a}} + \frac{r_i}{a_i}*\frac{a}{E_{apriori} + {a}}\]
Addresses the cold start problem, but has problems dealing with changing preferences in the population (propensity might go down but it may never be able to figure out that a different population segment might be interested as it won’t be offered to them ever).
A priori rate set at 0.5 and evidence at 20% of the total number of episodes.
propensity_smoothing <- function(actions, rewards, episode, n_episodes)
{
evidence_apriori <- floor(0.2*n_episodes)
successrate_apriori <- 0.5
successrates <- (0.5+rewards)/(1+actions)
smoothed <- successrate_apriori*evidence_apriori/(evidence_apriori+actions) + successrates*actions/(evidence_apriori+actions)
a <- apply(smoothed, 1, which.max)
return(a)
}
With this the uncertainty is modeled more mathematically using a distribution function. The idea is similar to UCB, but instead of using the upper bound, we sample from this distribution of propensities. When there is a lot of evidence, propensities are sampled from a really narrow range, but in the beginning, the distribution is much more spread out and the propensities will take on a much broader range. The “beta distribution” is a two-parameter distribution used for this, with the beta parameters directly reflecting the evidence and rewards: \(\beta(1+{r_i}, 1+{a_i}-{r_i})\).
Thompson sampling explores quickly but then backs down exploring when there is sufficient evidence.
thompson_sampling <- function(actions, rewards, episode, n_episodes)
{
ts <- function(a, r)
{
# get the probabilities one by one via the beta distrib with arguments (pos+1, neg+1), then return the bandit with the max prob
probs <- sapply(seq(length(a)), function(j) {return(rbeta(1, r[j]+1, a[j]-r[j]+1))})
return(which.max(probs))
}
# iterate over all experiments
a <- sapply(seq(nrow(actions)), function(i) {return(ts(actions[i,], rewards[i,]))})
return(a)
}
What is? MK?
Isnt that UCB?
Now run multiple experiments with a number of episodes
experiments <- list( "Random Baseline" = random,
"Greedy (w default Laplace smoothing)" = greedy,
"Epsilon Greedy" = epsilon_greedy,
# "Greedy breaking ties at random" = greedy_tie_breaker,
# "Decayed Espsilion Greedy" = decayed_epsilon_greedy,
"Propensity Smoothing" = propensity_smoothing,
"Roulette Wheel" = roulette_wheel,
"Thompson Sampling" = thompson_sampling,
"Upper Confidence Bounds (UCB1)" = ucb1)
for (exp in names(experiments))
{
algo <- experiments[[exp]]
experiments[[exp]] <- c("algo" = algo, runExperiments(n_experiments=1000, n_episodes=3000, algo))
propensities <- melt(data.table(experiments[[exp]]$successrates)[, episode := seq(.N) ], id.vars=c("episode"),
variable.name = "bandit", value.name = "propensity")
actions <- melt(data.table(experiments[[exp]]$actions)[, episode := seq(.N) ], id.vars=c("episode"),
variable.name = "bandit", value.name = "frequency")
rewards <- melt(data.table(experiments[[exp]]$rewards)[, episode := seq(.N) ], id.vars=c("episode"),
variable.name = "bandit", value.name = "reward")
plotdata <- merge(propensities, merge(actions, rewards, by=c("episode", "bandit")), by=c("episode", "bandit"))
# show how the propensities of the agents evolve over time
print(ggplot(na.omit(plotdata),
aes(episode, propensity, colour=bandit, size=frequency)) + geom_line() +
ggtitle("Propensities over Time",
subtitle = paste(exp, ", averaged over", experiments[[exp]]$n_experiments, "experiments")))
}
Different algo’s have different rate at reaching the actual success rates, depending on the exploration rate. They also differ in how quickly and how much they exploit the discovered success rates. Below showing both of them - note the x-axis is a log scale to highlight differences better.
evolution <- rbindlist(lapply(names(experiments), function(exp) {
return(data.table( experiment = exp,
episode = seq(nrow(experiments[[exp]]$successrates)),
`rmse (explore)` = apply(experiments[[exp]]$successrates, 1, function(x) {
return(sqrt(mean((x - bandit_rates)^2, na.rm=T)))}),
`reward (exploit)` = apply(experiments[[exp]]$rewards,1,sum) /
apply(experiments[[exp]]$actions, 1, sum)))}))
# order levels by final reward
evolution$experiment <- factor(evolution$experiment,
levels=evolution[, .(finalregret=max(bandit_rates)-`reward (exploit)`[.N]) , by=experiment][order(-finalregret)]$experiment)
ggplot(melt(evolution, id.vars=c("episode", "experiment"), variable.name = "metric"), aes(episode, value, color=metric)) +
geom_line(size=1) + scale_x_log10() + facet_wrap(~ experiment, labeller = label_wrap_gen()) + ylab("") +
scale_colour_discrete_qualitative(palette = "Dynamic") +
ggtitle("Exploration vs Exploitation for the various algorithms", subtitle = "Log x-axis to highlight differences")
# ylim(c(0,sqrt(mean((0.5-bandit_rates)^2)))
Showing all the regret curves together and with a normal x-axis. Lower regret is better.
ggplot(evolution, aes(episode, max(bandit_rates)-`reward (exploit)`, colour=experiment)) + geom_line(size=1) +
ylab("Regret") +
scale_colour_discrete_qualitative(palette = "Dark 3") +
ggtitle("Regret over time (loss/opportunity costs)",
subtitle = paste("avg over", mean(sapply(experiments, function(x) {return(x$n_experiments)})), "experiments"))