Reinforcement Learning Learning Notes

M1: Introduction
M2: Exploration
M3: Temporal Credit Assignment
- markov decision process
- value function and policy
M4: Dynamnic Programming
M5: Model-free learning
- Monte Carlo Learning
- Temporal Difference Learning
M6: Representation and Generalization - Function Approximation
M7: Policy Gradient
- Further Readings

M1: Introduction

The fundermental idea of reinforcement learning is from Thorndike’s cat puzzle box experiment, yes, another cat in the box. Who wouldn’t love cat.

Behaviour changes because of consequences.

Two type of RL Tasks: episodic such as tic tac toe that has and terminate state, or continuing one, such as smart thermometer that continuously runs to tune the temperature to keep peeps happy or irrate when set it wrong.

RL challenges:

Representation
Generalization
Exploration
Temporal credit assignment

The key to artificial intelligence has always been the representation. - Jeff Hawkins

M2: Exploration

The “explore-exploit”, the principle of “optimism in the face of uncertainty”

Bandit

The smarter A/B testing could benefit here from Multi-Armed Bandit (MAB) framework. Or an application of drug trial.

Algorithms:
* round robin
* greedy
* \(\epsilon\) greedy
* Upper Confidence Bound (UCB)
* Contextual Bandit LinUCB

Regret and discounted return

Regret is to quantify the loss or “price of information”, or opportunity loss, another form of \(y-\hat{y}\), sort of.

Discounted return \[ G_{t} = R_{t+1} + \gamma R_{t+2} + ... + \gamma ^{t-1} R_{t} \] Greedy and \(\epsilon\)-greedy has linear regret \(L_{t} \ge Const \cdot T\)

A/B/n experiments

I’ll focus more on UCB1 (classy or bayesian) and Thompson Sampling techniques, as these are the two main ones used for A/B/n testing.

UCB - Upper confidence Bound Algorithm \[ a_{t} = argmax \{ \hat{r}_{a} + \sqrt \frac {2 log t}{n_{a}} \} \] UCB1 achieves logarithmic regret \(L_{t} \le Const \cdot logT\)

To implement UCB1

  # x succsess of trials,     for instance, c(3,5,2)
  # n total number of trials, for instance, c(10, 10, 10)
  # otherwise interpret as 
  # n being number of pulls of the bandit arm
  # x being the reward from each arm
  # t is the time period, usually default to 1 if cumulative x and n is taken
  UCB <- x/n + sqrt(2*log(t)/n)
  choose.arm <- which.max(UCB)

Posterior Sampling \[ \hat\theta_{k} \sim beta(\alpha_{k}, \beta_{k}) \]

\[ \theta = max(\hat\theta_{k}) \] \[ k = argmax(\hat\theta{k}) \] \[ reward_{t} \sim Bernoulli(p = \theta) \] \[ \alpha_{k}, \beta_{k} \leftarrow \alpha_{k} + reward, \beta_{k} + 1 - reward \] \(k\) represent the arm.

  # x succsess of trials,     for instance, c(3,5,2)
  # n total number of trials, for instance, c(10, 10, 10)
  # otherwise interpret as 
  # n being number of pulls of the bandit arm
  # x being the reward from each arm
  # default is beta-bernoulli distribution
  # normal-normal and gamma-poisson available in comments
  thompson <- function(x, n, prior.alpha=1, prior.beta=1) {
    k <- length(x)
    n.draw <- 10000
    ans <- matrix(nrow=n.draw, ncol=k)
    for(i in 1:k){
        ans[,i] <- rbeta(n.draw, x[i] + prior.alpha, n[i]-x[i]+prior.beta)
        # ans[,i] <- rnorm(n.draw, mean = x[i]/n[i], sd = x[i]/sqrt(n[i]))
        # ans[,i] <- rgamma(n.draw, shape = x[i] + 1, scale = 1/n[i])
    }
    w <- table(factor(max.col(ans), levels=1:k))
    as.vector(w/sum(w))
  }

[image credit][https://dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits]

M3: Temporal Credit Assignment

markov decision process

value function and policy

Value function, function of state, or state-action pairs. Value function can be learnt from experience. insert bellman equation Policy \(\pi\), a mapping from state to probability of selection each available action

M4: Dynamnic Programming

use value function to organize and struture the search for good policy
turn Bellman equations into iterative updates

\[V(x_{0}) = max \sum_{t=0}^{\infty}\beta^tF(x_{t}, a_{t})\]

\[ V(x) = max \{F(x, a) + \beta V(T(x, a))\} \]

M5: Model-free learning

Monte Carlo Learning

\[ E[X] = \frac{1}{n} \sum_{i=1}^{n} x_{i} \]

\[ v_{\pi(s)} = E_{\pi} [G_{t}|S_{t}=s] \] Monte Carlo takes means of episodes.

Temporal Difference Learning

M6: Representation and Generalization - Function Approximation

trade off between sample efficient learning vs generality

Linear function approximators

Deep Q-Learning (DQN)

Double DQN

M7: Policy Gradient

Doesn’t care about value functions, DP or models, only cares about finding \(\theta\) that maximizes rewards.