The fundermental idea of reinforcement learning is from Thorndike’s cat puzzle box experiment, yes, another cat in the box. Who wouldn’t love cat.
Behaviour changes because of consequences.
Two type of RL Tasks: episodic such as tic tac toe that has and terminate state, or continuing one, such as smart thermometer that continuously runs to tune the temperature to keep peeps happy or irrate when set it wrong.
RL challenges:
The key to artificial intelligence has always been the representation. - Jeff Hawkins
The “explore-exploit”, the principle of “optimism in the face of uncertainty”
The smarter A/B testing could benefit here from Multi-Armed Bandit (MAB) framework. Or an application of drug trial.
Algorithms:
* round robin
* greedy
* \(\epsilon\) greedy
* Upper Confidence Bound (UCB)
* Contextual Bandit LinUCB
Regret is to quantify the loss or “price of information”, or opportunity loss, another form of \(y-\hat{y}\), sort of.
Discounted return \[ G_{t} = R_{t+1} + \gamma R_{t+2} + ... + \gamma ^{t-1} R_{t} \] Greedy and \(\epsilon\)-greedy has linear regret \(L_{t} \ge Const \cdot T\)
I’ll focus more on UCB1 (classy or bayesian) and Thompson Sampling techniques, as these are the two main ones used for A/B/n testing.
UCB - Upper confidence Bound Algorithm \[ a_{t} = argmax \{ \hat{r}_{a} + \sqrt \frac {2 log t}{n_{a}} \} \] UCB1 achieves logarithmic regret \(L_{t} \le Const \cdot logT\)
To implement UCB1
# x succsess of trials, for instance, c(3,5,2)
# n total number of trials, for instance, c(10, 10, 10)
# otherwise interpret as
# n being number of pulls of the bandit arm
# x being the reward from each arm
# t is the time period, usually default to 1 if cumulative x and n is taken
UCB <- x/n + sqrt(2*log(t)/n)
choose.arm <- which.max(UCB)
Posterior Sampling \[ \hat\theta_{k} \sim beta(\alpha_{k}, \beta_{k}) \]
\[ \theta = max(\hat\theta_{k}) \] \[ k = argmax(\hat\theta{k}) \] \[ reward_{t} \sim Bernoulli(p = \theta) \] \[ \alpha_{k}, \beta_{k} \leftarrow \alpha_{k} + reward, \beta_{k} + 1 - reward \] \(k\) represent the arm.
# x succsess of trials, for instance, c(3,5,2)
# n total number of trials, for instance, c(10, 10, 10)
# otherwise interpret as
# n being number of pulls of the bandit arm
# x being the reward from each arm
# default is beta-bernoulli distribution
# normal-normal and gamma-poisson available in comments
thompson <- function(x, n, prior.alpha=1, prior.beta=1) {
k <- length(x)
n.draw <- 10000
ans <- matrix(nrow=n.draw, ncol=k)
for(i in 1:k){
ans[,i] <- rbeta(n.draw, x[i] + prior.alpha, n[i]-x[i]+prior.beta)
# ans[,i] <- rnorm(n.draw, mean = x[i]/n[i], sd = x[i]/sqrt(n[i]))
# ans[,i] <- rgamma(n.draw, shape = x[i] + 1, scale = 1/n[i])
}
w <- table(factor(max.col(ans), levels=1:k))
as.vector(w/sum(w))
}
[image credit][https://dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits]
Value function, function of state, or state-action pairs. Value function can be learnt from experience. insert bellman equation Policy \(\pi\), a mapping from state to probability of selection each available action
\[V(x_{0}) = max \sum_{t=0}^{\infty}\beta^tF(x_{t}, a_{t})\]
\[ V(x) = max \{F(x, a) + \beta V(T(x, a))\} \]
\[ E[X] = \frac{1}{n} \sum_{i=1}^{n} x_{i} \]
\[ v_{\pi(s)} = E_{\pi} [G_{t}|S_{t}=s] \] Monte Carlo takes means of episodes.
trade off between sample efficient learning vs generality
Doesn’t care about value functions, DP or models, only cares about finding \(\theta\) that maximizes rewards.
Animal intelligence : an experimental study of the associative processes in animals (Thorndike, 1898): https://archive.org/details/animalintelligen00thoruoft
Tutorial, Dave Silver http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/XX.pdf
Algorithms for Reinforcement Learning (Szepesvári, 2010). Synthesis Lectures on Artificial Intelligence and Machine Learning
Propose use of neural nets for Q-learning, Rprop optimizer â promising results
Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). A Brief Survey of Deep Reinforcement Learning. IEEE SPM Special Issue on Deep Learning for Visual Understanding http://arxiv.org/abs/1708.05866
Simple statistical gradient-following algorithms for connectionist reinforcement learning by Williams, 1992