Upper Confidence Bound - example of Reinforement Learning

Intuition Lecture 172 https://www.udemy.com/machinelearning/learn/lecture/6456832

Import data and implement Random Algorthim Lecture 178 https://www.udemy.com/machinelearning/learn/lecture/6022152

Part 1 Implement/write UCB algorithm Lecture 179 https://www.udemy.com/machinelearning/learn/lecture/6022156

Part 2 Implement/write UCB algorithm Lecture 180 https://www.udemy.com/machinelearning/learn/lecture/6022160

Visualize the R UCB Lecture 181 https://www.udemy.com/machinelearning/learn/lecture/6022164

Thompson Sampling wikipedia https://papers.nips.cc/paper/4909-eluder-dimension-and-the-sample-complexity-of-optimistic-exploration.pdf

check working directory getwd()

Importing the dataset

dataset = read.csv('Ads_CTR_Optimisation.csv')

UCB compared to Thompson Sampling
Lecture 183 https://www.udemy.com/machinelearning/learn/lecture/6468288

knitr::include_graphics("UCB_compare_ThompsonSampling.png")

UCB compared to Thompson Sampling

What’s this about

knitr::include_graphics("Dataset_interpretation.png")

Pic from Python but same idea

Implementing RANDOM Selection - literally random for comparison

N = 10000
d = 10
ads_selected = integer(0)
total_reward = 0
for (n in 1:N) {
  ad = sample(1:10, 1)
  ads_selected = append(ads_selected, ad)
  reward = dataset[n, ad]
  total_reward = total_reward + reward
}

Let’s have a look at the total rewards calculated at random.

total_reward

## [1] 1216

Visualising the results - RANDOM

As we expect it’s Random!

hist(ads_selected,
     col = 'blue',
     main = 'Histogram of ads selections - RANDOM',
     xlab = 'Ads',
     ylab = 'Number of times each ad was selected')

# UCB Steps

knitr::include_graphics("UCB_Algorithm_Slide.png")

$UCB steps from lecture$

UCB steps from lecture

Implementing UCB - to compare to random

Lecture 179 https://www.udemy.com/machinelearning/learn/lecture/6022156

Lecture 180 https://www.udemy.com/machinelearning/learn/lecture/6022160

This is very interesting and a bit convoluted.

N = 10000 # step 2 bit
d = 10 # step 1 & 2 bit
ads_selected = integer(0) # step 3 bit we create an initial empty vector which the algorithm will fill
numbers_of_selections = integer(d) # step 1 bit we create a vector of size d to start
sums_of_rewards = integer(d) # step 1 bit same idea we need a place to start with vector size d 
total_reward = 0 
# Best described in Lecture 180 https://www.udemy.com/machinelearning/learn/lecture/6022160
for (n in 1:N) { # step 2 bit we'll go through earch round/row stating at 1 and go to N
  ad = 0 # step 3 bit initialize variable at zero
  max_upper_bound = 0 # step 3 bit to initialize the max upper bound at zero
  for (i in 1:d) { # step 2 bit loop through the ads/thing being tested - the first 10 rounds
    if (numbers_of_selections[i] > 0) { # step 2 calculated the Average Reward
      # note this won't be true until after the first 10 rounds ;-) as no add selection will be 
      # greater than zero until after the first pass (10 rounds)
      average_reward = sums_of_rewards[i] / numbers_of_selections[i]
      delta_i = sqrt(3/2 * log(n) / numbers_of_selections[i])
      upper_bound = average_reward + delta_i # step 2 upper end of Confidence Interval
    } else { # set upper bound to a very large number to support the 1st round action
        upper_bound = 1e400 # piece used for the initiazation of the process, where we are testing 
        # with the first 10 we'll loop through, basically the first 10 rounds are to seed our data
        # and in this seeding we don't have an upper bound. 
    }
    if (upper_bound > max_upper_bound) { # step 3 bit to readjust max upper bound on each round
      max_upper_bound = upper_bound
      ad = i
    }
  }
  ads_selected = append(ads_selected, ad) # step 3 bit to continuously add to vector
  numbers_of_selections[ad] = numbers_of_selections[ad] + 1 # this will keep things in the first if = 0 the first time we run and add through the for n loop and therefore we'll always skip to the else, then onto the 2nd if, until we get to the first time that ad 1 is used again, then we'll drop into the real work of the 1st if statement. 
  reward = dataset[n, ad] # the real reward (from dataset) as we trapes through the for n 
  sums_of_rewards[ad] = sums_of_rewards[ad] + reward
  total_reward = total_reward + reward
}

What’s going on there :-) see lecture 180, in addition to some notes I made in code. # Visulize the ads selected data Let’s look at the last 1000 items to see which add is looking the most common.

tail(ads_selected, n = 1000)

##    [1]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##   [24]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##   [47]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##   [70]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##   [93]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [116]  5  5  5  5  6  5  5  5  5  5  5  5  5  5  5  5  5  7  5  5  3 10  1
##  [139]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [162]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  8
##  [185]  9  5  5  5  5  5  5  5  5  5  5  5  5  5  5  4  5  1  5  5  5  2  8
##  [208]  8  8  8  8  5  5  5  5  5  5  5  5  5  5  1  5  8  5  5  5  5  5  5
##  [231]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [254]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [277]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [300]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [323]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [346]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [369]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [392]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [415]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [438]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [461]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [484]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [507]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [530]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [553]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [576]  5  5  5  5  5  5  5  5  5  5  5  5  5  7  5  6  5  5  4  5  3 10  9
##  [599]  5  5  2  5  5  1  8  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [622]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [645]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [668]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [691]  5  5  5  5  5  5  8  5  5  5  5  5  5  5  5  5  5  5  7  7  7  7  7
##  [714]  5  7  1  5  4  4  4  4  4  4  4  4  4  4  5  2  2  2  2  2  2  2  2
##  [737]  2  2  2  2  2  2  8  5  2  2  2  2  2  9  5  5  5  5  5  5  5  5  5
##  [760]  5  5  5  5  5  5  5  5  5  5  6  5  3 10  4  4  4  4  4  5  2  1  8
##  [783]  8  8  8  5  8  5  4  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [806]  5  5  5  5  5  5  5  5  8  5  7  5  1  5  5  5  5  5  5  5  5  5  5
##  [829]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [852]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [875]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [898]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [921]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [944]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [967]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [990]  5  5  5  5  5  5  5  5  5  5  5

Let’s have a look at the total rewards calculated with UCB. It’s much better.

total_reward

## [1] 2178

Visualising the results

hist(ads_selected,
     col = 'blue',
     main = 'Histogram of ads selections',
     xlab = 'Ads',
     ylab = 'Number of times each ad was selected')

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf

Upper Confidence Bound in R