The due date for this exam is Friday, February 24, by 2:00PM. Late submissions will not be accepted apart from exceptional circumstances. Consequently, you should plan on submitting before the due date.
This exam consists of 10 problems. The first 9 problems build on each other. Problem 10 consists of 5 parts, but can be completed without first solving problems 1–9. Each problem will be graded out of a maximum of 5 points.
For this exam you must provide all of your answers in a single file, either straight source code (.R), or an R markdown file (.Rmd), or a ‘knit’ markdown document. The R markdown document used to write this exam will be provided as a template. Regardless of what option you choose, you should delineate your answer for each question clearly. For example:
# ******************************************************************
# Problem 1)
# ******************************************************************
# Problem 2)
# [ Solution code goes here ]
etc.
If I run your source code (or knit your markdown file), it should run from beginning to end without producing any errors.
Your code should conform to the tidyverse R programming style guide,
available here for reference:
https://style.tidyverse.org/index.html.
You can refer to your notes, class lecture slides, Google, StackExchange, or other online material. However, you cannot post content from the exam or questions related to it on the internet (Discord, etc.), or consult with other students. Anyone caught violating this policy will be given an immediate zero for the exam.
Partial credit will be given, so if you are unsure of a solution or can’t get your code to work, you should include concise comments in your code that explain your thought process/approach.
The following background comes from Gureckis, T. M., & Love, B. C. (2015). “Computational reinforcement learning”. The Oxford handbook of computational and mathematical psychology, 99-117:
There are few general laws of behavior, but one may be that humans and other animals tend to repeat behaviors that have led to positive outcomes in the past and avoid those associated with punishment or pain. Such tendencies are on display in the behavior of young children who learn to avoid touching hot stoves following a painful burn, but behave in school when rewarded with toys. This basic principle exerts such a powerful influence on behavior, it manifests throughout our culture and laws. Behaviors that society wants to discourage are tied to punishment (e.g., prison time, fines, consumption taxes), whereas behaviors society condones are tied to positive outcomes (e.g., tax credits for fuel-efficient cars).
The scientific study of how animals use experience to adapt their behavior in order maximize rewards is known as reinforcement learning (RL). Reinforcement learning differs from other types of learning behavior of interest to psychologists (e.g., unsupervised learning, supervised learning) since it deals with learning from feedback that is largely evaluative rather than corrective. A restaurant diner doesn’t necessarily learn that eating at a particular business is “wrong,” simply that the experience was less than exquisite. This particular aspect of RL – learning from evaluate rather than corrective feedback – makes it a particularly rich domain for studying how people adapt their behavior based on experience.
The history of RL can be traced to early work in behavioral psychology (Thorndike, 1911; Skinner, 1938). However, the modern field of RL is a highly interdisciplinary area at the crossroads of computer science, machine learning, psychology, and neuroscience. In particular, contemporary research on RL is characterized by detailed behavioral models that make predictions across a wide range of circumstances, as well as neuroscience findings that have linked aspects of these models to particular neural substrates. In many ways, RL today stands as one of the major triumphs of cognitive science in that it offers an integrated theory of behavior at the computational, algorithmic, and implementational (i.e., neural) levels (Marr, 1982).
For this exam, we will be exploring very simple models of human reinforcement learning. In particular, we will focus on learning in “multi-armed bandit” tasks. What is a multi-armed bandit?
You have probably heard of a slot machine. It’s a gambling device where you put in some money, pull a lever, and if you are lucky you win money. In Las Vegas (so the story goes), “one-armed bandit” is a slang term for a slot machine. One-armed, because the machine has a single lever that you pull. Bandit, because generally speaking it steals your money.
You can think of a multi-armed bandit as a row of slot machines. However, in the general case, each slot machine has a different payout rate: some machines are ‘luckier’ than others. Given a finite number of choices, the goal in this setting is to maximize your expected payout.
While abstract, multi-armed bandits are a useful analogy to a very large number of real-world scenarios. For example, medical doctors might have a choice of \(n\) different treatments available for a particular disease, but the effectiveness of each treatment varies and is not entirely known. Do you select a treatment that you are confident works moderately well, or do you try a different treatment that you don’t know as much about, but has the potential to be far more effective?
In machine learning, this tradeoff is known as the ‘exploration-exploitation’ dilemma. You need to explore new (and potentially suboptimal) options in order to learn about them, but you also need to exploit what you already know in order to maximize reward. You also navigate this tradeoff constantly in your daily life without realizing it. For example, do you go out to your favorite restaurant, or do you risk trying a new place that just opened up? Do you stay at your current job, where you might be unhappy but stable, or do you risk the unknown for the possibility of higher pay or more job satisfaction?
For more information about multi-armed bandits, see: https://en.wikipedia.org/wiki/Multi-armed_bandit
The code below provides a skeleton for a reinforcement learning experiment with a multi-armed bandit task. In this experiment, the learning agent faces a choice between 10 bandits on each trial. Each bandit provides a binary reward (either 0 or 1), but the probability of reward differs between the bandits. The goal for the learning agent is to maximize the total reward received.
On each trial, the agent selects one of the alternatives, and receives a randomly generated reward, with probability determined by the particular bandit they selected. Mathematically, let \(k \in \lbrace 1 \ldots 10\rbrace\) indicate the choice made on a given trial. \(r\) indicates the reward received on that trial, where
\[ r \sim \mathrm{Bernoulli}(\theta_k) \] and \(\theta\) is a vector of length 10 that defines the reward probability for each bandit. The following code provides a basic implementation of a 10-armed bandit task.
# Include this just once at the top of your code
set.seed(42)
#New estimate = old_estimate + stepsize * (target - old estimate)
simulate_baseline_agent <- function(n_arms = 10, n_trials = 1000, alpha = 0.05) {
bandit <- 1:10 #return this
# n_arms = Number of bandits to choose from on each trial ˆ θ
# n_trials = Number of trials to simulate
# step size = alpha (a)
# Generate the true reward probability for each arm
theta_true <- runif(n_arms) #return this
theta_est <- rep(0.5, 10) #return this
for(i in 1:n_trials) {
# Choose an action randomly- chooses the bandit randomly
k <- sample(1:n_arms, 1)
# Generate a binary reward (0 or 1) according to the choice
r <- as.numeric(runif(1) < theta_true[k]) # target
theta_est[k] <- theta_est[k] + alpha * (r - theta_est[k])
}
sol <- data.frame(bandit, theta_true, theta_est)
return(sol)
# This function doesn't return anything (yet)
}
table <- simulate_baseline_agent()
print(table)
## bandit theta_true theta_est
## 1 1 0.9148060 0.8893462
## 2 2 0.9370754 0.8982559
## 3 3 0.2861395 0.3518885
## 4 4 0.8304476 0.7751360
## 5 5 0.6417455 0.5059686
## 6 6 0.5190959 0.4935130
## 7 7 0.7365883 0.6031983
## 8 8 0.1346666 0.1344954
## 9 9 0.6569923 0.5841384
## 10 10 0.7050648 0.6392469
Note that in the code above, there are two obvious limitations as a theory of human or animal learning:
As part of this exam, you will address each of these limitations.
In this problem you will modify the code so that the agent learns from its feedback. In particular, we will implement a classic learning model called temporal difference learning or TD-learning.
Suppose the agent has an estimate for the value of each of the 10 bandits. Let’s call this estimate \(\hat{\theta}\). Note that this is actually a vector, so that \(\hat{\theta}_k\) represents the estimated value for the \(k\)-th alternative. On a particular trial, the agent selects alternative \(k\), and receives a reward \(r\) that is either 1 or 0. How should the agent update its beliefs about the value of alternative \(k\)?
According to the TD-learning rule, learning is driven by the difference between what the agent predicted, and what it observed. Mathematically, we have:
\[ \hat{\theta}_k \leftarrow \hat{\theta}_k + \alpha \left(r - \hat{\theta}_k\right) \]
In plain english, this says that the new estimate for the value is equal to the old estimate, plus a term proportional to the difference between what was observed and what was predicted, (\(r - \hat{\theta}_k\)). The parameter \(\alpha\) is called the learning rate. When \(\alpha = 0\), the term on the right cancels out and no learning occurs. When \(\alpha = 1\), the updated value is exactly equal to the most recent reward signal \(r\).
Using the code above as a starting point, create a new function
called simulate_td_random(). The agent should update its
beliefs about the value of each bandit, using the TD-learning rule.
Your function should take an additional argument, alpha
which determines the learning rate for the agent. The default value for
this argument should be specified as alpha = 0.05.
Your function should return a data frame that contains three columns:
bandit with the values 1 through
10theta_true with the true reward
probabilitytheta_est with the estimated value
(based on TD-learning) for each bandit, at the end of the
simulationNote: Your agent will still select actions randomly, but will learn on the basis of the reward signal.
Some specific requirements:
# Your solution here
# Include this just once at the top of your code
set.seed(42)
#New estimate = old_estimate + stepsize * (target - old estimate)
simulate_td_random <- function(n_arms = 10, n_trials = 1000, alpha = 0.05) {
bandit <- 1:10 #return this
# n_arms = Number of bandits to choose from on each trial ˆ θ
# n_trials = Number of trials to simulate
# step size = alpha (a)
# Generate the true reward probability for each arm
theta_true <- runif(n_arms) #return this
theta_est <- rep(0.5, 10) #return this
for(i in 1:n_trials) {
# Choose an action randomly- chooses the bandit randomly
k <- sample(1:n_arms, 1)
# Generate a binary reward (0 or 1) according to the choice
r <- as.numeric(runif(1) < theta_true[k]) # target
theta_est[k] <- theta_est[k] + alpha * (r - theta_est[k])
}
sol <- data.frame(bandit, theta_true, theta_est)
return(sol)
# This function doesn't return anything (yet)
}
table <- simulate_td_random()
print(table)
## bandit theta_true theta_est
## 1 1 0.9148060 0.8893462
## 2 2 0.9370754 0.8982559
## 3 3 0.2861395 0.3518885
## 4 4 0.8304476 0.7751360
## 5 5 0.6417455 0.5059686
## 6 6 0.5190959 0.4935130
## 7 7 0.7365883 0.6031983
## 8 8 0.1346666 0.1344954
## 9 9 0.6569923 0.5841384
## 10 10 0.7050648 0.6392469
Run your function from problem 1.
Generate a bar graph that shows the estimated value (estimated reward probability) for each bandit at the end of learning. Overlaid over the bars, also show plot markers that indicate the true reward probability for each bandit.
Specific requirements:
ggplot() to construct your graphlibrary("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Bandit <- table$bandit
erp <- table$theta_est #estimated reward probability
p <- ggplot() +
geom_col(aes(x = Bandit, y = erp), fill = "#BB2649") + #values
labs(x = "Bandit", y = "Estimated Reward Probability") #labels
print(p)
Modify your function simulate_td_random() so that it
keeps track of the total accumulated reward received by the agent at
each trial. For example, if the agent receives a reward on trials 1, 3,
and 5, then its total accumulated reward over the first five trials
should be 1, 1, 2, 2, 3.
The updated function should return a data frame with three columns:
trial (1 \(\ldots\)
n_trials)reward (the reward obtained on each trial, 0 or 1)accumulated_reward (the total accumulated reward on
each trial)#New estimate = old_estimate + stepsize * (target - old estimate)
set.seed(42)
simulate_td_random <- function(n_arms = 10, n_trials = 1000, alpha = 0.05) {
trial <- 1:n_trials #return this amount of trials
reward <- rep(0,n_trials) #reward per round
accumulated_reward <- rep(0, n_trials) #return this- stores cumulative reward values
# Generate the true reward probability for each arm
theta_true <- runif(n_arms) #return this
theta_est <- rep(0.5, 10) #return this
for(i in 1:n_trials) {
# Choose an action randomly- chooses the bandit randomly
k <- sample(1:n_arms, 1)
# Generate a binary reward (0 or 1) according to the choice
r <- as.numeric(runif(1) < theta_true[k]) # target
theta_est[k] <- theta_est[k] + alpha * (r - theta_est[k])
reward[i] <- r
if (i == 1) { #chooses value of the first bandit
accumulated_reward[i] <- r
}
else if (i != 1) { #chooses values for the rest
accumulated_reward[i] <- accumulated_reward[i-1] + r
}
}
sol <- data.frame(trial, reward, accumulated_reward)
return(sol)
# This function doesn't return anything (yet)
}
reward_table <- simulate_td_random()
print(reward_table)
## trial reward accumulated_reward
## 1 1 0 0
## 2 2 1 1
## 3 3 0 1
## 4 4 1 2
## 5 5 1 3
## 6 6 0 3
## 7 7 1 4
## 8 8 1 5
## 9 9 0 5
## 10 10 1 6
## 11 11 1 7
## 12 12 1 8
## 13 13 0 8
## 14 14 0 8
## 15 15 0 8
## 16 16 1 9
## 17 17 1 10
## 18 18 1 11
## 19 19 1 12
## 20 20 1 13
## 21 21 0 13
## 22 22 1 14
## 23 23 0 14
## 24 24 0 14
## 25 25 1 15
## 26 26 0 15
## 27 27 1 16
## 28 28 1 17
## 29 29 1 18
## 30 30 1 19
## 31 31 1 20
## 32 32 1 21
## 33 33 1 22
## 34 34 1 23
## 35 35 0 23
## 36 36 0 23
## 37 37 0 23
## 38 38 1 24
## 39 39 0 24
## 40 40 1 25
## 41 41 1 26
## 42 42 0 26
## 43 43 0 26
## 44 44 1 27
## 45 45 0 27
## 46 46 1 28
## 47 47 1 29
## 48 48 0 29
## 49 49 0 29
## 50 50 0 29
## 51 51 0 29
## 52 52 0 29
## 53 53 1 30
## 54 54 0 30
## 55 55 1 31
## 56 56 1 32
## 57 57 1 33
## 58 58 0 33
## 59 59 1 34
## 60 60 1 35
## 61 61 0 35
## 62 62 0 35
## 63 63 0 35
## 64 64 1 36
## 65 65 1 37
## 66 66 1 38
## 67 67 0 38
## 68 68 1 39
## 69 69 1 40
## 70 70 1 41
## 71 71 1 42
## 72 72 0 42
## 73 73 1 43
## 74 74 0 43
## 75 75 1 44
## 76 76 0 44
## 77 77 1 45
## 78 78 1 46
## 79 79 0 46
## 80 80 1 47
## 81 81 1 48
## 82 82 1 49
## 83 83 0 49
## 84 84 0 49
## 85 85 1 50
## 86 86 1 51
## 87 87 0 51
## 88 88 1 52
## 89 89 1 53
## 90 90 1 54
## 91 91 1 55
## 92 92 1 56
## 93 93 1 57
## 94 94 1 58
## 95 95 0 58
## 96 96 1 59
## 97 97 1 60
## 98 98 1 61
## 99 99 1 62
## 100 100 1 63
## 101 101 0 63
## 102 102 1 64
## 103 103 1 65
## 104 104 0 65
## 105 105 1 66
## 106 106 1 67
## 107 107 0 67
## 108 108 1 68
## 109 109 1 69
## 110 110 0 69
## 111 111 1 70
## 112 112 1 71
## 113 113 1 72
## 114 114 1 73
## 115 115 0 73
## 116 116 1 74
## 117 117 1 75
## 118 118 0 75
## 119 119 1 76
## 120 120 0 76
## 121 121 0 76
## 122 122 1 77
## 123 123 1 78
## 124 124 0 78
## 125 125 1 79
## 126 126 0 79
## 127 127 1 80
## 128 128 1 81
## 129 129 1 82
## 130 130 1 83
## 131 131 0 83
## 132 132 0 83
## 133 133 1 84
## 134 134 1 85
## 135 135 0 85
## 136 136 0 85
## 137 137 1 86
## 138 138 0 86
## 139 139 0 86
## 140 140 1 87
## 141 141 0 87
## 142 142 1 88
## 143 143 1 89
## 144 144 0 89
## 145 145 1 90
## 146 146 1 91
## 147 147 0 91
## 148 148 1 92
## 149 149 1 93
## 150 150 0 93
## 151 151 1 94
## 152 152 1 95
## 153 153 0 95
## 154 154 1 96
## 155 155 0 96
## 156 156 0 96
## 157 157 1 97
## 158 158 1 98
## 159 159 1 99
## 160 160 1 100
## 161 161 1 101
## 162 162 1 102
## 163 163 0 102
## 164 164 1 103
## 165 165 0 103
## 166 166 1 104
## 167 167 0 104
## 168 168 0 104
## 169 169 1 105
## 170 170 0 105
## 171 171 1 106
## 172 172 1 107
## 173 173 1 108
## 174 174 1 109
## 175 175 0 109
## 176 176 1 110
## 177 177 1 111
## 178 178 1 112
## 179 179 1 113
## 180 180 0 113
## 181 181 1 114
## 182 182 0 114
## 183 183 1 115
## 184 184 1 116
## 185 185 1 117
## 186 186 1 118
## 187 187 1 119
## 188 188 1 120
## 189 189 0 120
## 190 190 0 120
## 191 191 0 120
## 192 192 1 121
## 193 193 1 122
## 194 194 1 123
## 195 195 1 124
## 196 196 0 124
## 197 197 0 124
## 198 198 1 125
## 199 199 1 126
## 200 200 1 127
## 201 201 1 128
## 202 202 0 128
## 203 203 1 129
## 204 204 0 129
## 205 205 0 129
## 206 206 1 130
## 207 207 1 131
## 208 208 1 132
## 209 209 1 133
## 210 210 1 134
## 211 211 0 134
## 212 212 1 135
## 213 213 1 136
## 214 214 1 137
## 215 215 1 138
## 216 216 1 139
## 217 217 1 140
## 218 218 1 141
## 219 219 1 142
## 220 220 1 143
## 221 221 0 143
## 222 222 1 144
## 223 223 1 145
## 224 224 1 146
## 225 225 1 147
## 226 226 0 147
## 227 227 1 148
## 228 228 0 148
## 229 229 0 148
## 230 230 0 148
## 231 231 1 149
## 232 232 1 150
## 233 233 0 150
## 234 234 1 151
## 235 235 1 152
## 236 236 1 153
## 237 237 0 153
## 238 238 0 153
## 239 239 0 153
## 240 240 1 154
## 241 241 0 154
## 242 242 0 154
## 243 243 1 155
## 244 244 1 156
## 245 245 0 156
## 246 246 0 156
## 247 247 0 156
## 248 248 0 156
## 249 249 1 157
## 250 250 1 158
## 251 251 1 159
## 252 252 1 160
## 253 253 0 160
## 254 254 1 161
## 255 255 1 162
## 256 256 1 163
## 257 257 1 164
## 258 258 0 164
## 259 259 0 164
## 260 260 1 165
## 261 261 1 166
## 262 262 0 166
## 263 263 0 166
## 264 264 1 167
## 265 265 0 167
## 266 266 1 168
## 267 267 1 169
## 268 268 0 169
## 269 269 1 170
## 270 270 0 170
## 271 271 1 171
## 272 272 1 172
## 273 273 0 172
## 274 274 1 173
## 275 275 0 173
## 276 276 1 174
## 277 277 0 174
## 278 278 0 174
## 279 279 0 174
## 280 280 1 175
## 281 281 0 175
## 282 282 0 175
## 283 283 0 175
## 284 284 1 176
## 285 285 1 177
## 286 286 0 177
## 287 287 0 177
## 288 288 1 178
## 289 289 1 179
## 290 290 1 180
## 291 291 1 181
## 292 292 0 181
## 293 293 1 182
## 294 294 0 182
## 295 295 1 183
## 296 296 1 184
## 297 297 1 185
## 298 298 1 186
## 299 299 0 186
## 300 300 1 187
## 301 301 1 188
## 302 302 0 188
## 303 303 1 189
## 304 304 0 189
## 305 305 1 190
## 306 306 1 191
## 307 307 0 191
## 308 308 1 192
## 309 309 1 193
## 310 310 1 194
## 311 311 1 195
## 312 312 1 196
## 313 313 1 197
## 314 314 1 198
## 315 315 0 198
## 316 316 1 199
## 317 317 1 200
## 318 318 1 201
## 319 319 0 201
## 320 320 1 202
## 321 321 1 203
## 322 322 1 204
## 323 323 1 205
## 324 324 1 206
## 325 325 1 207
## 326 326 0 207
## 327 327 0 207
## 328 328 1 208
## 329 329 0 208
## 330 330 1 209
## 331 331 1 210
## 332 332 1 211
## 333 333 0 211
## 334 334 1 212
## 335 335 1 213
## 336 336 0 213
## 337 337 1 214
## 338 338 0 214
## 339 339 1 215
## 340 340 1 216
## 341 341 0 216
## 342 342 1 217
## 343 343 1 218
## 344 344 1 219
## 345 345 1 220
## 346 346 1 221
## 347 347 0 221
## 348 348 1 222
## 349 349 1 223
## 350 350 0 223
## 351 351 1 224
## 352 352 0 224
## 353 353 0 224
## 354 354 1 225
## 355 355 1 226
## 356 356 1 227
## 357 357 0 227
## 358 358 1 228
## 359 359 1 229
## 360 360 1 230
## 361 361 1 231
## 362 362 1 232
## 363 363 1 233
## 364 364 1 234
## 365 365 1 235
## 366 366 1 236
## 367 367 0 236
## 368 368 0 236
## 369 369 1 237
## 370 370 0 237
## 371 371 1 238
## 372 372 0 238
## 373 373 1 239
## 374 374 1 240
## 375 375 1 241
## 376 376 1 242
## 377 377 1 243
## 378 378 1 244
## 379 379 1 245
## 380 380 1 246
## 381 381 0 246
## 382 382 0 246
## 383 383 1 247
## 384 384 1 248
## 385 385 0 248
## 386 386 0 248
## 387 387 1 249
## 388 388 1 250
## 389 389 1 251
## 390 390 0 251
## 391 391 0 251
## 392 392 1 252
## 393 393 0 252
## 394 394 1 253
## 395 395 0 253
## 396 396 0 253
## 397 397 1 254
## 398 398 1 255
## 399 399 0 255
## 400 400 1 256
## 401 401 1 257
## 402 402 1 258
## 403 403 1 259
## 404 404 1 260
## 405 405 0 260
## 406 406 0 260
## 407 407 1 261
## 408 408 1 262
## 409 409 0 262
## 410 410 1 263
## 411 411 0 263
## 412 412 1 264
## 413 413 1 265
## 414 414 1 266
## 415 415 0 266
## 416 416 1 267
## 417 417 1 268
## 418 418 1 269
## 419 419 1 270
## 420 420 1 271
## 421 421 0 271
## 422 422 0 271
## 423 423 0 271
## 424 424 1 272
## 425 425 1 273
## 426 426 1 274
## 427 427 1 275
## 428 428 1 276
## 429 429 0 276
## 430 430 1 277
## 431 431 0 277
## 432 432 1 278
## 433 433 1 279
## 434 434 1 280
## 435 435 0 280
## 436 436 0 280
## 437 437 0 280
## 438 438 1 281
## 439 439 1 282
## 440 440 0 282
## 441 441 1 283
## 442 442 1 284
## 443 443 0 284
## 444 444 1 285
## 445 445 1 286
## 446 446 1 287
## 447 447 1 288
## 448 448 0 288
## 449 449 0 288
## 450 450 1 289
## 451 451 1 290
## 452 452 0 290
## 453 453 1 291
## 454 454 1 292
## 455 455 1 293
## 456 456 0 293
## 457 457 1 294
## 458 458 1 295
## 459 459 0 295
## 460 460 1 296
## 461 461 1 297
## 462 462 1 298
## 463 463 1 299
## 464 464 0 299
## 465 465 1 300
## 466 466 0 300
## 467 467 1 301
## 468 468 0 301
## 469 469 0 301
## 470 470 0 301
## 471 471 0 301
## 472 472 1 302
## 473 473 0 302
## 474 474 1 303
## 475 475 0 303
## 476 476 0 303
## 477 477 1 304
## 478 478 1 305
## 479 479 0 305
## 480 480 1 306
## 481 481 0 306
## 482 482 1 307
## 483 483 1 308
## 484 484 0 308
## 485 485 0 308
## 486 486 1 309
## 487 487 0 309
## 488 488 1 310
## 489 489 0 310
## 490 490 0 310
## 491 491 0 310
## 492 492 1 311
## 493 493 1 312
## 494 494 1 313
## 495 495 0 313
## 496 496 1 314
## 497 497 1 315
## 498 498 0 315
## 499 499 1 316
## 500 500 1 317
## 501 501 1 318
## 502 502 1 319
## 503 503 0 319
## 504 504 1 320
## 505 505 1 321
## 506 506 1 322
## 507 507 0 322
## 508 508 1 323
## 509 509 1 324
## 510 510 1 325
## 511 511 1 326
## 512 512 0 326
## 513 513 0 326
## 514 514 1 327
## 515 515 1 328
## 516 516 0 328
## 517 517 1 329
## 518 518 1 330
## 519 519 0 330
## 520 520 0 330
## 521 521 0 330
## 522 522 0 330
## 523 523 0 330
## 524 524 0 330
## 525 525 1 331
## 526 526 0 331
## 527 527 1 332
## 528 528 1 333
## 529 529 1 334
## 530 530 0 334
## 531 531 1 335
## 532 532 0 335
## 533 533 1 336
## 534 534 1 337
## 535 535 1 338
## 536 536 0 338
## 537 537 1 339
## 538 538 1 340
## 539 539 1 341
## 540 540 1 342
## 541 541 0 342
## 542 542 0 342
## 543 543 1 343
## 544 544 1 344
## 545 545 0 344
## 546 546 0 344
## 547 547 1 345
## 548 548 1 346
## 549 549 1 347
## 550 550 1 348
## 551 551 1 349
## 552 552 1 350
## 553 553 1 351
## 554 554 1 352
## 555 555 0 352
## 556 556 1 353
## 557 557 0 353
## 558 558 0 353
## 559 559 1 354
## 560 560 0 354
## 561 561 0 354
## 562 562 0 354
## 563 563 1 355
## 564 564 0 355
## 565 565 0 355
## 566 566 1 356
## 567 567 0 356
## 568 568 0 356
## 569 569 0 356
## 570 570 0 356
## 571 571 0 356
## 572 572 0 356
## 573 573 1 357
## 574 574 1 358
## 575 575 1 359
## 576 576 1 360
## 577 577 0 360
## 578 578 1 361
## 579 579 1 362
## 580 580 0 362
## 581 581 0 362
## 582 582 0 362
## 583 583 1 363
## 584 584 0 363
## 585 585 0 363
## 586 586 0 363
## 587 587 0 363
## 588 588 1 364
## 589 589 0 364
## 590 590 1 365
## 591 591 1 366
## 592 592 0 366
## 593 593 1 367
## 594 594 1 368
## 595 595 1 369
## 596 596 1 370
## 597 597 0 370
## 598 598 0 370
## 599 599 1 371
## 600 600 0 371
## 601 601 0 371
## 602 602 0 371
## 603 603 0 371
## 604 604 0 371
## 605 605 1 372
## 606 606 0 372
## 607 607 1 373
## 608 608 1 374
## 609 609 0 374
## 610 610 0 374
## 611 611 1 375
## 612 612 0 375
## 613 613 1 376
## 614 614 0 376
## 615 615 0 376
## 616 616 1 377
## 617 617 0 377
## 618 618 0 377
## 619 619 1 378
## 620 620 1 379
## 621 621 0 379
## 622 622 1 380
## 623 623 1 381
## 624 624 1 382
## 625 625 0 382
## 626 626 1 383
## 627 627 1 384
## 628 628 0 384
## 629 629 1 385
## 630 630 1 386
## 631 631 1 387
## 632 632 0 387
## 633 633 1 388
## 634 634 0 388
## 635 635 1 389
## 636 636 0 389
## 637 637 0 389
## 638 638 0 389
## 639 639 1 390
## 640 640 0 390
## 641 641 0 390
## 642 642 1 391
## 643 643 1 392
## 644 644 1 393
## 645 645 0 393
## 646 646 1 394
## 647 647 0 394
## 648 648 1 395
## 649 649 1 396
## 650 650 0 396
## 651 651 0 396
## 652 652 1 397
## 653 653 0 397
## 654 654 0 397
## 655 655 0 397
## 656 656 0 397
## 657 657 0 397
## 658 658 1 398
## 659 659 1 399
## 660 660 1 400
## 661 661 1 401
## 662 662 0 401
## 663 663 1 402
## 664 664 1 403
## 665 665 0 403
## 666 666 1 404
## 667 667 0 404
## 668 668 1 405
## 669 669 0 405
## 670 670 0 405
## 671 671 1 406
## 672 672 0 406
## 673 673 1 407
## 674 674 1 408
## 675 675 0 408
## 676 676 1 409
## 677 677 1 410
## 678 678 1 411
## 679 679 1 412
## 680 680 1 413
## 681 681 0 413
## 682 682 1 414
## 683 683 1 415
## 684 684 1 416
## 685 685 1 417
## 686 686 1 418
## 687 687 1 419
## 688 688 1 420
## 689 689 1 421
## 690 690 1 422
## 691 691 1 423
## 692 692 1 424
## 693 693 0 424
## 694 694 1 425
## 695 695 1 426
## 696 696 1 427
## 697 697 1 428
## 698 698 1 429
## 699 699 0 429
## 700 700 1 430
## 701 701 1 431
## 702 702 1 432
## 703 703 1 433
## 704 704 1 434
## 705 705 1 435
## 706 706 0 435
## 707 707 1 436
## 708 708 1 437
## 709 709 0 437
## 710 710 1 438
## 711 711 1 439
## 712 712 1 440
## 713 713 1 441
## 714 714 0 441
## 715 715 0 441
## 716 716 1 442
## 717 717 1 443
## 718 718 1 444
## 719 719 0 444
## 720 720 0 444
## 721 721 0 444
## 722 722 0 444
## 723 723 0 444
## 724 724 1 445
## 725 725 0 445
## 726 726 1 446
## 727 727 1 447
## 728 728 0 447
## 729 729 0 447
## 730 730 0 447
## 731 731 1 448
## 732 732 1 449
## 733 733 1 450
## 734 734 0 450
## 735 735 1 451
## 736 736 1 452
## 737 737 1 453
## 738 738 0 453
## 739 739 1 454
## 740 740 1 455
## 741 741 1 456
## 742 742 1 457
## 743 743 0 457
## 744 744 1 458
## 745 745 0 458
## 746 746 0 458
## 747 747 0 458
## 748 748 1 459
## 749 749 1 460
## 750 750 0 460
## 751 751 1 461
## 752 752 0 461
## 753 753 1 462
## 754 754 1 463
## 755 755 1 464
## 756 756 1 465
## 757 757 1 466
## 758 758 1 467
## 759 759 1 468
## 760 760 1 469
## 761 761 0 469
## 762 762 0 469
## 763 763 1 470
## 764 764 1 471
## 765 765 0 471
## 766 766 1 472
## 767 767 1 473
## 768 768 0 473
## 769 769 1 474
## 770 770 0 474
## 771 771 0 474
## 772 772 0 474
## 773 773 0 474
## 774 774 1 475
## 775 775 0 475
## 776 776 0 475
## 777 777 0 475
## 778 778 0 475
## 779 779 1 476
## 780 780 0 476
## 781 781 1 477
## 782 782 1 478
## 783 783 1 479
## 784 784 0 479
## 785 785 1 480
## 786 786 1 481
## 787 787 0 481
## 788 788 0 481
## 789 789 0 481
## 790 790 1 482
## 791 791 1 483
## 792 792 1 484
## 793 793 0 484
## 794 794 1 485
## 795 795 1 486
## 796 796 0 486
## 797 797 0 486
## 798 798 0 486
## 799 799 0 486
## 800 800 1 487
## 801 801 1 488
## 802 802 1 489
## 803 803 0 489
## 804 804 1 490
## 805 805 0 490
## 806 806 1 491
## 807 807 1 492
## 808 808 1 493
## 809 809 0 493
## 810 810 1 494
## 811 811 1 495
## 812 812 1 496
## 813 813 0 496
## 814 814 1 497
## 815 815 0 497
## 816 816 1 498
## 817 817 0 498
## 818 818 0 498
## 819 819 0 498
## 820 820 1 499
## 821 821 1 500
## 822 822 1 501
## 823 823 1 502
## 824 824 0 502
## 825 825 0 502
## 826 826 0 502
## 827 827 1 503
## 828 828 1 504
## 829 829 0 504
## 830 830 1 505
## 831 831 0 505
## 832 832 1 506
## 833 833 1 507
## 834 834 0 507
## 835 835 0 507
## 836 836 0 507
## 837 837 1 508
## 838 838 1 509
## 839 839 1 510
## 840 840 1 511
## 841 841 1 512
## 842 842 1 513
## 843 843 0 513
## 844 844 1 514
## 845 845 1 515
## 846 846 0 515
## 847 847 1 516
## 848 848 1 517
## 849 849 0 517
## 850 850 1 518
## 851 851 1 519
## 852 852 1 520
## 853 853 1 521
## 854 854 1 522
## 855 855 1 523
## 856 856 1 524
## 857 857 1 525
## 858 858 1 526
## 859 859 0 526
## 860 860 1 527
## 861 861 0 527
## 862 862 1 528
## 863 863 1 529
## 864 864 1 530
## 865 865 1 531
## 866 866 0 531
## 867 867 1 532
## 868 868 0 532
## 869 869 1 533
## 870 870 1 534
## 871 871 1 535
## 872 872 0 535
## 873 873 1 536
## 874 874 1 537
## 875 875 0 537
## 876 876 0 537
## 877 877 1 538
## 878 878 0 538
## 879 879 1 539
## 880 880 0 539
## 881 881 1 540
## 882 882 0 540
## 883 883 1 541
## 884 884 1 542
## 885 885 1 543
## 886 886 1 544
## 887 887 0 544
## 888 888 1 545
## 889 889 0 545
## 890 890 0 545
## 891 891 1 546
## 892 892 0 546
## 893 893 0 546
## 894 894 1 547
## 895 895 1 548
## 896 896 1 549
## 897 897 0 549
## 898 898 0 549
## 899 899 0 549
## 900 900 0 549
## 901 901 0 549
## 902 902 0 549
## 903 903 1 550
## 904 904 0 550
## 905 905 1 551
## 906 906 0 551
## 907 907 1 552
## 908 908 1 553
## 909 909 1 554
## 910 910 0 554
## 911 911 1 555
## 912 912 0 555
## 913 913 1 556
## 914 914 0 556
## 915 915 1 557
## 916 916 1 558
## 917 917 0 558
## 918 918 1 559
## 919 919 1 560
## 920 920 0 560
## 921 921 1 561
## 922 922 1 562
## 923 923 0 562
## 924 924 0 562
## 925 925 0 562
## 926 926 1 563
## 927 927 1 564
## 928 928 1 565
## 929 929 0 565
## 930 930 0 565
## 931 931 1 566
## 932 932 0 566
## 933 933 0 566
## 934 934 1 567
## 935 935 0 567
## 936 936 0 567
## 937 937 0 567
## 938 938 1 568
## 939 939 0 568
## 940 940 1 569
## 941 941 1 570
## 942 942 0 570
## 943 943 1 571
## 944 944 1 572
## 945 945 1 573
## 946 946 1 574
## 947 947 0 574
## 948 948 0 574
## 949 949 0 574
## 950 950 1 575
## 951 951 1 576
## 952 952 0 576
## 953 953 1 577
## 954 954 1 578
## 955 955 0 578
## 956 956 0 578
## 957 957 1 579
## 958 958 1 580
## 959 959 0 580
## 960 960 0 580
## 961 961 1 581
## 962 962 0 581
## 963 963 1 582
## 964 964 1 583
## 965 965 1 584
## 966 966 1 585
## 967 967 0 585
## 968 968 1 586
## 969 969 0 586
## 970 970 0 586
## 971 971 1 587
## 972 972 1 588
## 973 973 1 589
## 974 974 1 590
## 975 975 1 591
## 976 976 0 591
## 977 977 1 592
## 978 978 1 593
## 979 979 0 593
## 980 980 0 593
## 981 981 1 594
## 982 982 1 595
## 983 983 1 596
## 984 984 1 597
## 985 985 1 598
## 986 986 1 599
## 987 987 1 600
## 988 988 1 601
## 989 989 0 601
## 990 990 1 602
## 991 991 1 603
## 992 992 0 603
## 993 993 0 603
## 994 994 0 603
## 995 995 1 604
## 996 996 1 605
## 997 997 0 605
## 998 998 1 606
## 999 999 0 606
## 1000 1000 1 607
??? does this mean make a loop that runs this function 100 times OR does it mean to use 100 bandits instead of the default 10?
Run your function simulate_td_random() 100 times. Stack
together the results into one big data frame with three columns, and
100,000 rows (1000 trials \(\times\)
100 simulations).
Once you’ve done that, assuming your results are stored in a variable
called results you can use the following tidyverse magic to
get the average accumulated reward:
avg_results <- results %>%
group_by(trial) %>%
summarise(mean_accumulated_reward = mean(accumulated_reward))
Generate a line graph that shows how average accumulated reward increases over time.
Specific requirements:
ggplot() to construct your graphlibrary(ggplot2)
#set.seed(42)
#more_results <- length(100)
more_results <- simulate_td_random(n_arms = 10, n_trials = 1000, alpha = 0.05)
#colnames(more_results, c('Result 1','Result 2','Result 3'))
for(i in 1:99) {
g = simulate_td_random(n_arms = 10, n_trials = 1000, alpha = 0.05)
rbind(g, more_results)
}
print(more_results)
## trial reward accumulated_reward
## 1 1 1 1
## 2 2 1 2
## 3 3 0 2
## 4 4 1 3
## 5 5 0 3
## 6 6 0 3
## 7 7 1 4
## 8 8 0 4
## 9 9 0 4
## 10 10 1 5
## 11 11 0 5
## 12 12 0 5
## 13 13 1 6
## 14 14 0 6
## 15 15 1 7
## 16 16 0 7
## 17 17 0 7
## 18 18 1 8
## 19 19 1 9
## 20 20 1 10
## 21 21 1 11
## 22 22 0 11
## 23 23 1 12
## 24 24 0 12
## 25 25 0 12
## 26 26 1 13
## 27 27 1 14
## 28 28 0 14
## 29 29 0 14
## 30 30 1 15
## 31 31 1 16
## 32 32 0 16
## 33 33 1 17
## 34 34 1 18
## 35 35 1 19
## 36 36 0 19
## 37 37 1 20
## 38 38 1 21
## 39 39 1 22
## 40 40 1 23
## 41 41 0 23
## 42 42 0 23
## 43 43 0 23
## 44 44 1 24
## 45 45 0 24
## 46 46 0 24
## 47 47 0 24
## 48 48 1 25
## 49 49 0 25
## 50 50 0 25
## 51 51 0 25
## 52 52 1 26
## 53 53 1 27
## 54 54 0 27
## 55 55 0 27
## 56 56 0 27
## 57 57 1 28
## 58 58 0 28
## 59 59 1 29
## 60 60 0 29
## 61 61 1 30
## 62 62 0 30
## 63 63 1 31
## 64 64 0 31
## 65 65 0 31
## 66 66 1 32
## 67 67 0 32
## 68 68 1 33
## 69 69 1 34
## 70 70 1 35
## 71 71 1 36
## 72 72 0 36
## 73 73 0 36
## 74 74 0 36
## 75 75 0 36
## 76 76 1 37
## 77 77 0 37
## 78 78 1 38
## 79 79 1 39
## 80 80 1 40
## 81 81 1 41
## 82 82 1 42
## 83 83 0 42
## 84 84 0 42
## 85 85 1 43
## 86 86 1 44
## 87 87 1 45
## 88 88 0 45
## 89 89 1 46
## 90 90 1 47
## 91 91 1 48
## 92 92 1 49
## 93 93 0 49
## 94 94 0 49
## 95 95 0 49
## 96 96 1 50
## 97 97 1 51
## 98 98 1 52
## 99 99 1 53
## 100 100 0 53
## 101 101 1 54
## 102 102 0 54
## 103 103 1 55
## 104 104 1 56
## 105 105 0 56
## 106 106 1 57
## 107 107 1 58
## 108 108 1 59
## 109 109 0 59
## 110 110 1 60
## 111 111 1 61
## 112 112 0 61
## 113 113 0 61
## 114 114 0 61
## 115 115 1 62
## 116 116 1 63
## 117 117 1 64
## 118 118 0 64
## 119 119 0 64
## 120 120 0 64
## 121 121 0 64
## 122 122 1 65
## 123 123 0 65
## 124 124 1 66
## 125 125 0 66
## 126 126 0 66
## 127 127 1 67
## 128 128 1 68
## 129 129 1 69
## 130 130 0 69
## 131 131 0 69
## 132 132 0 69
## 133 133 1 70
## 134 134 0 70
## 135 135 0 70
## 136 136 1 71
## 137 137 0 71
## 138 138 1 72
## 139 139 1 73
## 140 140 0 73
## 141 141 0 73
## 142 142 0 73
## 143 143 1 74
## 144 144 1 75
## 145 145 1 76
## 146 146 1 77
## 147 147 0 77
## 148 148 1 78
## 149 149 1 79
## 150 150 1 80
## 151 151 1 81
## 152 152 0 81
## 153 153 1 82
## 154 154 0 82
## 155 155 1 83
## 156 156 1 84
## 157 157 0 84
## 158 158 1 85
## 159 159 1 86
## 160 160 1 87
## 161 161 1 88
## 162 162 0 88
## 163 163 0 88
## 164 164 1 89
## 165 165 1 90
## 166 166 0 90
## 167 167 0 90
## 168 168 1 91
## 169 169 1 92
## 170 170 0 92
## 171 171 1 93
## 172 172 0 93
## 173 173 0 93
## 174 174 1 94
## 175 175 1 95
## 176 176 1 96
## 177 177 1 97
## 178 178 1 98
## 179 179 1 99
## 180 180 0 99
## 181 181 0 99
## 182 182 1 100
## 183 183 1 101
## 184 184 1 102
## 185 185 0 102
## 186 186 0 102
## 187 187 1 103
## 188 188 1 104
## 189 189 1 105
## 190 190 0 105
## 191 191 0 105
## 192 192 1 106
## 193 193 1 107
## 194 194 0 107
## 195 195 0 107
## 196 196 0 107
## 197 197 0 107
## 198 198 0 107
## 199 199 1 108
## 200 200 0 108
## 201 201 0 108
## 202 202 0 108
## 203 203 1 109
## 204 204 1 110
## 205 205 0 110
## 206 206 1 111
## 207 207 1 112
## 208 208 0 112
## 209 209 0 112
## 210 210 0 112
## 211 211 1 113
## 212 212 1 114
## 213 213 1 115
## 214 214 1 116
## 215 215 1 117
## 216 216 0 117
## 217 217 1 118
## 218 218 1 119
## 219 219 0 119
## 220 220 1 120
## 221 221 0 120
## 222 222 0 120
## 223 223 0 120
## 224 224 0 120
## 225 225 0 120
## 226 226 1 121
## 227 227 1 122
## 228 228 1 123
## 229 229 1 124
## 230 230 1 125
## 231 231 0 125
## 232 232 1 126
## 233 233 0 126
## 234 234 0 126
## 235 235 1 127
## 236 236 1 128
## 237 237 1 129
## 238 238 1 130
## 239 239 1 131
## 240 240 0 131
## 241 241 1 132
## 242 242 0 132
## 243 243 0 132
## 244 244 1 133
## 245 245 0 133
## 246 246 1 134
## 247 247 0 134
## 248 248 1 135
## 249 249 1 136
## 250 250 1 137
## 251 251 0 137
## 252 252 0 137
## 253 253 1 138
## 254 254 0 138
## 255 255 0 138
## 256 256 0 138
## 257 257 1 139
## 258 258 0 139
## 259 259 1 140
## 260 260 0 140
## 261 261 1 141
## 262 262 0 141
## 263 263 0 141
## 264 264 1 142
## 265 265 1 143
## 266 266 0 143
## 267 267 0 143
## 268 268 1 144
## 269 269 0 144
## 270 270 1 145
## 271 271 1 146
## 272 272 1 147
## 273 273 1 148
## 274 274 0 148
## 275 275 0 148
## 276 276 0 148
## 277 277 0 148
## 278 278 1 149
## 279 279 0 149
## 280 280 0 149
## 281 281 1 150
## 282 282 0 150
## 283 283 1 151
## 284 284 0 151
## 285 285 0 151
## 286 286 0 151
## 287 287 0 151
## 288 288 0 151
## 289 289 1 152
## 290 290 0 152
## 291 291 0 152
## 292 292 1 153
## 293 293 0 153
## 294 294 1 154
## 295 295 1 155
## 296 296 0 155
## 297 297 1 156
## 298 298 1 157
## 299 299 0 157
## 300 300 0 157
## 301 301 1 158
## 302 302 1 159
## 303 303 0 159
## 304 304 1 160
## 305 305 1 161
## 306 306 0 161
## 307 307 1 162
## 308 308 0 162
## 309 309 0 162
## 310 310 1 163
## 311 311 1 164
## 312 312 1 165
## 313 313 0 165
## 314 314 0 165
## 315 315 0 165
## 316 316 1 166
## 317 317 0 166
## 318 318 1 167
## 319 319 0 167
## 320 320 0 167
## 321 321 0 167
## 322 322 0 167
## 323 323 1 168
## 324 324 0 168
## 325 325 1 169
## 326 326 1 170
## 327 327 1 171
## 328 328 0 171
## 329 329 1 172
## 330 330 1 173
## 331 331 0 173
## 332 332 0 173
## 333 333 1 174
## 334 334 0 174
## 335 335 1 175
## 336 336 0 175
## 337 337 0 175
## 338 338 1 176
## 339 339 0 176
## 340 340 0 176
## 341 341 1 177
## 342 342 1 178
## 343 343 0 178
## 344 344 0 178
## 345 345 1 179
## 346 346 1 180
## 347 347 1 181
## 348 348 1 182
## 349 349 1 183
## 350 350 0 183
## 351 351 1 184
## 352 352 1 185
## 353 353 0 185
## 354 354 0 185
## 355 355 0 185
## 356 356 1 186
## 357 357 1 187
## 358 358 1 188
## 359 359 0 188
## 360 360 1 189
## 361 361 0 189
## 362 362 1 190
## 363 363 1 191
## 364 364 1 192
## 365 365 0 192
## 366 366 1 193
## 367 367 0 193
## 368 368 1 194
## 369 369 0 194
## 370 370 1 195
## 371 371 1 196
## 372 372 1 197
## 373 373 0 197
## 374 374 0 197
## 375 375 1 198
## 376 376 1 199
## 377 377 0 199
## 378 378 0 199
## 379 379 0 199
## 380 380 1 200
## 381 381 1 201
## 382 382 1 202
## 383 383 1 203
## 384 384 0 203
## 385 385 1 204
## 386 386 1 205
## 387 387 1 206
## 388 388 0 206
## 389 389 0 206
## 390 390 1 207
## 391 391 1 208
## 392 392 0 208
## 393 393 1 209
## 394 394 0 209
## 395 395 1 210
## 396 396 1 211
## 397 397 0 211
## 398 398 1 212
## 399 399 1 213
## 400 400 1 214
## 401 401 0 214
## 402 402 0 214
## 403 403 1 215
## 404 404 1 216
## 405 405 1 217
## 406 406 1 218
## 407 407 1 219
## 408 408 1 220
## 409 409 1 221
## 410 410 0 221
## 411 411 1 222
## 412 412 0 222
## 413 413 0 222
## 414 414 0 222
## 415 415 0 222
## 416 416 0 222
## 417 417 1 223
## 418 418 1 224
## 419 419 0 224
## 420 420 1 225
## 421 421 1 226
## 422 422 0 226
## 423 423 1 227
## 424 424 0 227
## 425 425 0 227
## 426 426 0 227
## 427 427 0 227
## 428 428 0 227
## 429 429 1 228
## 430 430 1 229
## 431 431 0 229
## 432 432 0 229
## 433 433 1 230
## 434 434 0 230
## 435 435 1 231
## 436 436 1 232
## 437 437 0 232
## 438 438 0 232
## 439 439 0 232
## 440 440 1 233
## 441 441 1 234
## 442 442 0 234
## 443 443 0 234
## 444 444 1 235
## 445 445 0 235
## 446 446 0 235
## 447 447 1 236
## 448 448 1 237
## 449 449 0 237
## 450 450 0 237
## 451 451 1 238
## 452 452 0 238
## 453 453 0 238
## 454 454 0 238
## 455 455 0 238
## 456 456 0 238
## 457 457 1 239
## 458 458 0 239
## 459 459 0 239
## 460 460 1 240
## 461 461 0 240
## 462 462 0 240
## 463 463 0 240
## 464 464 0 240
## 465 465 1 241
## 466 466 0 241
## 467 467 1 242
## 468 468 0 242
## 469 469 0 242
## 470 470 1 243
## 471 471 1 244
## 472 472 1 245
## 473 473 1 246
## 474 474 0 246
## 475 475 1 247
## 476 476 0 247
## 477 477 1 248
## 478 478 1 249
## 479 479 0 249
## 480 480 0 249
## 481 481 1 250
## 482 482 1 251
## 483 483 1 252
## 484 484 0 252
## 485 485 1 253
## 486 486 1 254
## 487 487 0 254
## 488 488 1 255
## 489 489 1 256
## 490 490 1 257
## 491 491 0 257
## 492 492 1 258
## 493 493 0 258
## 494 494 1 259
## 495 495 0 259
## 496 496 1 260
## 497 497 1 261
## 498 498 1 262
## 499 499 0 262
## 500 500 0 262
## 501 501 1 263
## 502 502 1 264
## 503 503 1 265
## 504 504 0 265
## 505 505 1 266
## 506 506 0 266
## 507 507 1 267
## 508 508 1 268
## 509 509 1 269
## 510 510 0 269
## 511 511 1 270
## 512 512 1 271
## 513 513 1 272
## 514 514 0 272
## 515 515 1 273
## 516 516 1 274
## 517 517 1 275
## 518 518 1 276
## 519 519 1 277
## 520 520 0 277
## 521 521 1 278
## 522 522 1 279
## 523 523 0 279
## 524 524 1 280
## 525 525 1 281
## 526 526 1 282
## 527 527 0 282
## 528 528 1 283
## 529 529 1 284
## 530 530 1 285
## 531 531 1 286
## 532 532 0 286
## 533 533 0 286
## 534 534 0 286
## 535 535 0 286
## 536 536 1 287
## 537 537 1 288
## 538 538 0 288
## 539 539 1 289
## 540 540 0 289
## 541 541 0 289
## 542 542 0 289
## 543 543 0 289
## 544 544 1 290
## 545 545 1 291
## 546 546 0 291
## 547 547 0 291
## 548 548 1 292
## 549 549 1 293
## 550 550 0 293
## 551 551 0 293
## 552 552 1 294
## 553 553 1 295
## 554 554 1 296
## 555 555 0 296
## 556 556 0 296
## 557 557 0 296
## 558 558 0 296
## 559 559 1 297
## 560 560 1 298
## 561 561 1 299
## 562 562 1 300
## 563 563 1 301
## 564 564 0 301
## 565 565 1 302
## 566 566 0 302
## 567 567 1 303
## 568 568 0 303
## 569 569 1 304
## 570 570 1 305
## 571 571 1 306
## 572 572 0 306
## 573 573 1 307
## 574 574 1 308
## 575 575 1 309
## 576 576 0 309
## 577 577 1 310
## 578 578 1 311
## 579 579 1 312
## 580 580 1 313
## 581 581 1 314
## 582 582 1 315
## 583 583 0 315
## 584 584 1 316
## 585 585 0 316
## 586 586 1 317
## 587 587 0 317
## 588 588 1 318
## 589 589 0 318
## 590 590 0 318
## 591 591 0 318
## 592 592 0 318
## 593 593 1 319
## 594 594 1 320
## 595 595 1 321
## 596 596 0 321
## 597 597 1 322
## 598 598 1 323
## 599 599 0 323
## 600 600 0 323
## 601 601 1 324
## 602 602 0 324
## 603 603 0 324
## 604 604 1 325
## 605 605 0 325
## 606 606 1 326
## 607 607 1 327
## 608 608 1 328
## 609 609 0 328
## 610 610 1 329
## 611 611 0 329
## 612 612 0 329
## 613 613 0 329
## 614 614 1 330
## 615 615 1 331
## 616 616 1 332
## 617 617 0 332
## 618 618 0 332
## 619 619 1 333
## 620 620 0 333
## 621 621 1 334
## 622 622 1 335
## 623 623 0 335
## 624 624 0 335
## 625 625 1 336
## 626 626 0 336
## 627 627 1 337
## 628 628 0 337
## 629 629 0 337
## 630 630 0 337
## 631 631 1 338
## 632 632 0 338
## 633 633 1 339
## 634 634 0 339
## 635 635 0 339
## 636 636 1 340
## 637 637 1 341
## 638 638 1 342
## 639 639 1 343
## 640 640 0 343
## 641 641 1 344
## 642 642 0 344
## 643 643 0 344
## 644 644 0 344
## 645 645 0 344
## 646 646 1 345
## 647 647 1 346
## 648 648 1 347
## 649 649 0 347
## 650 650 1 348
## 651 651 1 349
## 652 652 0 349
## 653 653 0 349
## 654 654 0 349
## 655 655 0 349
## 656 656 1 350
## 657 657 0 350
## 658 658 0 350
## 659 659 1 351
## 660 660 1 352
## 661 661 0 352
## 662 662 1 353
## 663 663 1 354
## 664 664 1 355
## 665 665 0 355
## 666 666 0 355
## 667 667 0 355
## 668 668 0 355
## 669 669 1 356
## 670 670 0 356
## 671 671 1 357
## 672 672 0 357
## 673 673 0 357
## 674 674 1 358
## 675 675 1 359
## 676 676 1 360
## 677 677 0 360
## 678 678 0 360
## 679 679 0 360
## 680 680 1 361
## 681 681 0 361
## 682 682 0 361
## 683 683 1 362
## 684 684 0 362
## 685 685 0 362
## 686 686 1 363
## 687 687 1 364
## 688 688 1 365
## 689 689 0 365
## 690 690 0 365
## 691 691 0 365
## 692 692 1 366
## 693 693 0 366
## 694 694 1 367
## 695 695 1 368
## 696 696 0 368
## 697 697 0 368
## 698 698 0 368
## 699 699 0 368
## 700 700 0 368
## 701 701 0 368
## 702 702 0 368
## 703 703 0 368
## 704 704 0 368
## 705 705 1 369
## 706 706 1 370
## 707 707 0 370
## 708 708 1 371
## 709 709 0 371
## 710 710 0 371
## 711 711 0 371
## 712 712 1 372
## 713 713 0 372
## 714 714 0 372
## 715 715 1 373
## 716 716 0 373
## 717 717 0 373
## 718 718 0 373
## 719 719 0 373
## 720 720 1 374
## 721 721 0 374
## 722 722 1 375
## 723 723 1 376
## 724 724 1 377
## 725 725 1 378
## 726 726 0 378
## 727 727 0 378
## 728 728 1 379
## 729 729 1 380
## 730 730 1 381
## 731 731 1 382
## 732 732 0 382
## 733 733 0 382
## 734 734 0 382
## 735 735 0 382
## 736 736 1 383
## 737 737 1 384
## 738 738 1 385
## 739 739 1 386
## 740 740 0 386
## 741 741 0 386
## 742 742 0 386
## 743 743 1 387
## 744 744 1 388
## 745 745 1 389
## 746 746 0 389
## 747 747 1 390
## 748 748 0 390
## 749 749 1 391
## 750 750 1 392
## 751 751 1 393
## 752 752 1 394
## 753 753 1 395
## 754 754 0 395
## 755 755 1 396
## 756 756 1 397
## 757 757 1 398
## 758 758 0 398
## 759 759 0 398
## 760 760 1 399
## 761 761 0 399
## 762 762 1 400
## 763 763 1 401
## 764 764 0 401
## 765 765 0 401
## 766 766 0 401
## 767 767 1 402
## 768 768 0 402
## 769 769 1 403
## 770 770 0 403
## 771 771 0 403
## 772 772 1 404
## 773 773 0 404
## 774 774 1 405
## 775 775 0 405
## 776 776 0 405
## 777 777 1 406
## 778 778 0 406
## 779 779 1 407
## 780 780 1 408
## 781 781 0 408
## 782 782 0 408
## 783 783 0 408
## 784 784 1 409
## 785 785 1 410
## 786 786 0 410
## 787 787 1 411
## 788 788 0 411
## 789 789 1 412
## 790 790 1 413
## 791 791 1 414
## 792 792 1 415
## 793 793 1 416
## 794 794 1 417
## 795 795 1 418
## 796 796 0 418
## 797 797 0 418
## 798 798 0 418
## 799 799 1 419
## 800 800 1 420
## 801 801 1 421
## 802 802 0 421
## 803 803 0 421
## 804 804 0 421
## 805 805 1 422
## 806 806 1 423
## 807 807 1 424
## 808 808 0 424
## 809 809 1 425
## 810 810 1 426
## 811 811 1 427
## 812 812 0 427
## 813 813 0 427
## 814 814 0 427
## 815 815 0 427
## 816 816 1 428
## 817 817 1 429
## 818 818 1 430
## 819 819 0 430
## 820 820 0 430
## 821 821 0 430
## 822 822 1 431
## 823 823 0 431
## 824 824 0 431
## 825 825 1 432
## 826 826 0 432
## 827 827 0 432
## 828 828 1 433
## 829 829 1 434
## 830 830 0 434
## 831 831 1 435
## 832 832 0 435
## 833 833 0 435
## 834 834 0 435
## 835 835 1 436
## 836 836 0 436
## 837 837 0 436
## 838 838 1 437
## 839 839 0 437
## 840 840 0 437
## 841 841 1 438
## 842 842 1 439
## 843 843 1 440
## 844 844 1 441
## 845 845 1 442
## 846 846 1 443
## 847 847 0 443
## 848 848 1 444
## 849 849 1 445
## 850 850 0 445
## 851 851 1 446
## 852 852 0 446
## 853 853 1 447
## 854 854 1 448
## 855 855 1 449
## 856 856 1 450
## 857 857 0 450
## 858 858 1 451
## 859 859 1 452
## 860 860 0 452
## 861 861 1 453
## 862 862 1 454
## 863 863 0 454
## 864 864 1 455
## 865 865 0 455
## 866 866 0 455
## 867 867 0 455
## 868 868 1 456
## 869 869 1 457
## 870 870 1 458
## 871 871 1 459
## 872 872 0 459
## 873 873 0 459
## 874 874 1 460
## 875 875 1 461
## 876 876 0 461
## 877 877 1 462
## 878 878 1 463
## 879 879 1 464
## 880 880 1 465
## 881 881 1 466
## 882 882 1 467
## 883 883 1 468
## 884 884 0 468
## 885 885 0 468
## 886 886 0 468
## 887 887 1 469
## 888 888 0 469
## 889 889 0 469
## 890 890 1 470
## 891 891 0 470
## 892 892 1 471
## 893 893 1 472
## 894 894 0 472
## 895 895 0 472
## 896 896 1 473
## 897 897 1 474
## 898 898 0 474
## 899 899 0 474
## 900 900 1 475
## 901 901 0 475
## 902 902 1 476
## 903 903 1 477
## 904 904 0 477
## 905 905 0 477
## 906 906 1 478
## 907 907 1 479
## 908 908 0 479
## 909 909 1 480
## 910 910 1 481
## 911 911 0 481
## 912 912 0 481
## 913 913 0 481
## 914 914 0 481
## 915 915 1 482
## 916 916 1 483
## 917 917 1 484
## 918 918 1 485
## 919 919 0 485
## 920 920 1 486
## 921 921 1 487
## 922 922 0 487
## 923 923 1 488
## 924 924 0 488
## 925 925 1 489
## 926 926 0 489
## 927 927 1 490
## 928 928 1 491
## 929 929 1 492
## 930 930 1 493
## 931 931 0 493
## 932 932 1 494
## 933 933 0 494
## 934 934 0 494
## 935 935 0 494
## 936 936 0 494
## 937 937 1 495
## 938 938 0 495
## 939 939 0 495
## 940 940 1 496
## 941 941 0 496
## 942 942 0 496
## 943 943 0 496
## 944 944 0 496
## 945 945 1 497
## 946 946 0 497
## 947 947 0 497
## 948 948 0 497
## 949 949 0 497
## 950 950 0 497
## 951 951 0 497
## 952 952 1 498
## 953 953 1 499
## 954 954 1 500
## 955 955 1 501
## 956 956 0 501
## 957 957 1 502
## 958 958 0 502
## 959 959 1 503
## 960 960 0 503
## 961 961 0 503
## 962 962 0 503
## 963 963 1 504
## 964 964 1 505
## 965 965 1 506
## 966 966 1 507
## 967 967 1 508
## 968 968 1 509
## 969 969 0 509
## 970 970 0 509
## 971 971 1 510
## 972 972 0 510
## 973 973 1 511
## 974 974 0 511
## 975 975 1 512
## 976 976 1 513
## 977 977 0 513
## 978 978 0 513
## 979 979 1 514
## 980 980 0 514
## 981 981 0 514
## 982 982 1 515
## 983 983 0 515
## 984 984 1 516
## 985 985 0 516
## 986 986 0 516
## 987 987 0 516
## 988 988 0 516
## 989 989 1 517
## 990 990 1 518
## 991 991 1 519
## 992 992 1 520
## 993 993 0 520
## 994 994 1 521
## 995 995 1 522
## 996 996 0 522
## 997 997 1 523
## 998 998 1 524
## 999 999 1 525
## 1000 1000 1 526
#results <- simulate_td_random(n_arms = 10, n_trials = 1000, alpha = 0.05)
trial <- more_results$trial
reward <- more_results$reward
ggplot(more_results, aes(x = trial, y = reward)) +
geom_line() +
xlab("Trial") +
ylab("Mean accumulated reward")
Notice that so far, your agent is choosing its actions at random —it is exploring, but not exploiting what is has learned. In the reinforcement learning literature, extensive research has gone into how to optimally balance exploration and exploitation, as well as how best to model this tradeoff in human learning. We will consider a simple heuristic approach, called \(\epsilon\)-greedy action selection (\(\epsilon\) is the Greek letter epsilon). The idea is simple:
With probability \(\epsilon\), choose an action at random, and with probability \((1 - \epsilon)\) choose the action that currently has the highest estimated value.
Create a function called `simulate_td_eps()’, that uses TD-learning and \(\epsilon\)-greedy action selection.
Note that in the case of a tie (several alternatives have the highest value), you should choose randomly between the tied options.
Try to find a value for \(\epsilon\) that maximizes the agent’s performance (you can just do this through trial and error, a complex search for the exact optimal value is not needed).
Update your graph from problem 3, to show data for both the random action selection and \(\epsilon\)-greedy action selection mechanism (using average performance over 100 simulations for each algorithm.)
Additional requirements:
simulate_td_eps <- function(steps, epsilon) {
}
\(\epsilon\)-greedy is just one possible approach to balancing exploration and exploitation. Another common approach uses the so-called “softmax” operator. If \(\hat{\theta}\) represents a vector storing the estimated values for each bandit, then the probability of choosing alternative \(k\) is given by:
\[P(\textrm{choice} = k) = \frac{e^{\beta \hat{\theta}_k}}{\sum_{j=1}^{n}e^{\beta \hat{\theta}_j}}\]
where \(\beta\) is a parameter that controls how random or deterministic the choices are. As \(\beta \rightarrow 0\), the probability for each choice approches \(1/n\) (random action selection). As \(\beta \rightarrow \infty\), the probability of choosing the option with the highest value approaches 1 (deterministic action selection). Intermediate values balance exploration and exploitation.
Create a function called simulate_td_softmax that uses
TD learning and the softmax action selection mechanism. It should have
an additional argument beta.
Try to find a value for \(\beta\) that maximizes the agent’s performance (as before, you can just do this through trial and error, a complex search for the exact optimal value is not needed).
Update your graph from problem 4 to include data for all three approaches (TD-random, TD-epsilon, and TD-softmax).
simulate_td_softmax <- function () {
}
So far we have been using the TD-learning rule to model how the agent updates its beliefs. Given that we have been discussing Bayesian parameter estimation in class, it is natural to apply the same ideas to model learning in the bandit setting.
In particular, lets assume the agent seeks to learn the distribution \(p(\theta_k)\) for each bandit. We will use a Beta distribution as the prior, with parameters \(\alpha = \beta = 1\). Recall that this is equivalent to a uniform distribution over the interval \((0, 1)\).
After each choice, the agent receives a reward of 1 or 0. We can think of this as a coin flip experiment where the coin has an unknown bias, except now there are 10 coins (corresponding to 10 bandits) and so we need to keep track of the posterior distribution for each one. You will do this by keeping track of the count of heads and tails (reward and no-reward) for each bandit.
Create a function called simulate_bayesian_agent that
implements this idea. Note: We are no longer using TD-learning. In
addition, for this problem, go back to choosing actions completely at
random. You might start with the function
simulate_baseline_agent as your starting point.
Your function should return a data frame with 4 columns:
bandit with the values 1 \(\ldots\) 10theta_true that stores the true value
for \(\theta\) for each bandita and a column labeled b;
these should store the shape parameters of the posterior distribution
for each bandit at the end of the simulation. (We’ll use
a and b to avoid confusion with the \(\alpha\) and \(\beta\) parameters used earlier—there’s
only so many Greek letters.)Run your function. Generate a plot that shows the posterior probability distributions \(p(\theta_k)\) for each bandit. Also include vertical dashed lines that show the true values for \(\theta\).
Requirements:
simulate_bayesian_agent <- function(n_arms = 10, n_trials = 100, alpha = 0.05) {
bandit <- 1:n_trials #return this amount of trials
reward <- rep(0,n_trials) #reward per round
accumulated_reward <- rep(0, n_trials) #return this- stores cumulative reward values
# Generate the true reward probability for each arm
theta_true <- 1:n_trials
a <- rep(alpha,n_trials)
b <- 1:n_trials
theta_true <- runif(n_arms) #return this
theta_est <- rep(0.5, 10) #return this
for(i in 1:n_trials) {
# Choose an action randomly- chooses the bandit randomly
k <- sample(1:n_arms, 1)
# Generate a binary reward (0 or 1) according to the choice
r <- as.numeric(runif(1) < theta_true[k]) # target
theta_est[k] <- theta_est[k] + alpha * (r - theta_est[k])
theta_true[i] <- theta_est[k]
reward[i] <- r
if (i == 1) { #chooses value of the first bandit
accumulated_reward[i] <- r
b[i] <- r
#beta <- r
#theta <- pbeta(0.0, alpha, beta)
}
else if (i != 1) { #chooses values for the rest
accumulated_reward[i] <- accumulated_reward[i-1] + r
}
}
sol <- data.frame(bandit, theta_true, a, b)
return(sol)
}
reward_table <- simulate_bayesian_agent()
print(reward_table)
## bandit theta_true a b
## 1 1 0.4750000 0.05 0
## 2 2 0.4750000 0.05 2
## 3 3 0.4750000 0.05 3
## 4 4 0.4750000 0.05 4
## 5 5 0.4750000 0.05 5
## 6 6 0.5250000 0.05 6
## 7 7 0.4512500 0.05 7
## 8 8 0.4512500 0.05 8
## 9 9 0.4286875 0.05 9
## 10 10 0.5012500 0.05 10
## 11 11 0.4750000 0.05 11
## 12 12 0.5487500 0.05 12
## 13 13 0.4512500 0.05 13
## 14 14 0.4750000 0.05 14
## 15 15 0.5012500 0.05 15
## 16 16 0.5250000 0.05 16
## 17 17 0.5487500 0.05 17
## 18 18 0.4786875 0.05 18
## 19 19 0.5012500 0.05 19
## 20 20 0.4761875 0.05 20
## 21 21 0.4072531 0.05 21
## 22 22 0.4761875 0.05 22
## 23 23 0.4547531 0.05 23
## 24 24 0.4523781 0.05 24
## 25 25 0.5261875 0.05 25
## 26 26 0.4797592 0.05 26
## 27 27 0.5012500 0.05 27
## 28 28 0.5498781 0.05 28
## 29 29 0.4523781 0.05 29
## 30 30 0.4750000 0.05 30
## 31 31 0.4368905 0.05 31
## 32 32 0.4557713 0.05 32
## 33 33 0.5261875 0.05 33
## 34 34 0.4829827 0.05 34
## 35 35 0.5723842 0.05 35
## 36 36 0.5088336 0.05 36
## 37 37 0.4797592 0.05 37
## 38 38 0.5213125 0.05 38
## 39 39 0.4650459 0.05 39
## 40 40 0.4998781 0.05 40
## 41 41 0.4748842 0.05 41
## 42 42 0.4786875 0.05 42
## 43 43 0.4833919 0.05 43
## 44 44 0.4547531 0.05 44
## 45 45 0.4320155 0.05 45
## 46 46 0.4104147 0.05 46
## 47 47 0.4511400 0.05 47
## 48 48 0.4592223 0.05 48
## 49 49 0.4320155 0.05 49
## 50 50 0.5713125 0.05 50
## 51 51 0.5937650 0.05 51
## 52 52 0.4512500 0.05 52
## 53 53 0.4362612 0.05 53
## 54 54 0.4286875 0.05 54
## 55 55 0.5427469 0.05 55
## 56 56 0.5156095 0.05 56
## 57 57 0.4398940 0.05 57
## 58 58 0.5057713 0.05 58
## 59 59 0.4072531 0.05 59
## 60 60 0.5640768 0.05 60
## 61 61 0.4604147 0.05 61
## 62 62 0.4898291 0.05 62
## 63 63 0.4804827 0.05 63
## 64 64 0.5358729 0.05 64
## 65 65 0.4644481 0.05 65
## 66 66 0.4653376 0.05 66
## 67 67 0.4417936 0.05 67
## 68 68 0.4564586 0.05 68
## 69 69 0.5590793 0.05 69
## 70 70 0.4912257 0.05 70
## 71 71 0.4666644 0.05 71
## 72 72 0.4368905 0.05 72
## 73 73 0.4836356 0.05 73
## 74 74 0.4952469 0.05 74
## 75 75 0.4933312 0.05 75
## 76 76 0.4686646 0.05 76
## 77 77 0.4285830 0.05 77
## 78 78 0.4650459 0.05 78
## 79 79 0.4571539 0.05 79
## 80 80 0.4594539 0.05 80
## 81 81 0.4373940 0.05 81
## 82 82 0.4864812 0.05 82
## 83 83 0.4452314 0.05 83
## 84 84 0.4342962 0.05 84
## 85 85 0.4155243 0.05 85
## 86 86 0.4704845 0.05 86
## 87 87 0.4697040 0.05 87
## 88 88 0.4469603 0.05 88
## 89 89 0.4746123 0.05 89
## 90 90 0.4420707 0.05 90
## 91 91 0.5311253 0.05 91
## 92 92 0.4125814 0.05 92
## 93 93 0.5008817 0.05 93
## 94 94 0.4621571 0.05 94
## 95 95 0.4890492 0.05 95
## 96 96 0.4758376 0.05 96
## 97 97 0.5145968 0.05 97
## 98 98 0.4229698 0.05 98
## 99 99 0.4888669 0.05 99
## 100 100 0.4447480 0.05 100
Using a Bayesian inference algorithm instead of TD-learning does not avoid the problem of balancing exploration and exploitation. So far your algorithm has been selecting actions randomly.
One nice feature of Bayesian inference is that it explicitly represents uncertainty about the world. We can use this to guide exploration. A simple approach, is that on each trial, the agent generates a random sample from the posterior distribution for each bandit. It then selects the alternative that has the highest value according to these random samples.
Notice how this idea naturally balances exploration and exploitation—at the beginning of the simulation, each distribution is a uniform distribution, so its choices will be completely random. As the agent learns more about each bandit, its posterior distributions will get narrower, and so the random samples will be closer to the true values and its behavior will become more deterministic. In the machine learning literature, this approach is known as posterior sampling, or Thompson sampling. It is not necessarily the optimal solution to the exploration-exploitation tradeoff, but it often performs very well.
Modify your function simulate_bayesian_agent() to
implement this idea.
In addition, modify your function so that it returns the reward and accumulated reward, in the same way that you did for problem 3.
simulate_bayesian_agent <- function(n_arms = 10, n_trials = 100, alpha = 0.05) {
bandit <- 1:n_trials #return this amount of trials
reward <- rep(0,n_trials) #reward per round
accumulated_reward <- rep(0, n_trials) #return this- stores cumulative reward values
# Generate the true reward probability for each arm
theta_true <- 1:n_trials
a <- rep(alpha,n_trials)
b <- 1:n_trials
theta_true <- runif(n_arms) #return this
theta_est <- rep(0.5, 10) #return this
for(i in 1:n_trials) {
# Choose an action randomly- chooses the bandit randomly
k <- sample(1:n_arms, 1)
# Generate a binary reward (0 or 1) according to the choice
r <- as.numeric(runif(1) < theta_true[k]) # target
theta_est[k] <- theta_est[k] + alpha * (r - theta_est[k])
theta_true[i] <- theta_est[k]
reward[i] <- r
if (i == 1) { #chooses value of the first bandit
accumulated_reward[i] <- r
b[i] <- r
#beta <- r
#theta <- pbeta(0.0, alpha, beta)
}
else if (i != 1) { #chooses values for the rest
accumulated_reward[i] <- accumulated_reward[i-1] + r
}
}
sol <- data.frame(bandit, theta_true, a, b)
return(sol)
}
reward_table <- simulate_bayesian_agent()
print(reward_table)
## bandit theta_true a b
## 1 1 0.4750000 0.05 0
## 2 2 0.4750000 0.05 2
## 3 3 0.4750000 0.05 3
## 4 4 0.4750000 0.05 4
## 5 5 0.5250000 0.05 5
## 6 6 0.5012500 0.05 6
## 7 7 0.5012500 0.05 7
## 8 8 0.5250000 0.05 8
## 9 9 0.5250000 0.05 9
## 10 10 0.4512500 0.05 10
## 11 11 0.4987500 0.05 11
## 12 12 0.5261875 0.05 12
## 13 13 0.4750000 0.05 13
## 14 14 0.5261875 0.05 14
## 15 15 0.4998781 0.05 15
## 16 16 0.5498781 0.05 16
## 17 17 0.5012500 0.05 17
## 18 18 0.5261875 0.05 18
## 19 19 0.4987500 0.05 19
## 20 20 0.5723842 0.05 20
## 21 21 0.4786875 0.05 21
## 22 22 0.5047531 0.05 22
## 23 23 0.4738125 0.05 23
## 24 24 0.4748842 0.05 24
## 25 25 0.5011400 0.05 25
## 26 26 0.4998781 0.05 26
## 27 27 0.5001219 0.05 27
## 28 28 0.5012500 0.05 28
## 29 29 0.4748842 0.05 29
## 30 30 0.4750000 0.05 30
## 31 31 0.5295155 0.05 31
## 32 32 0.5250000 0.05 32
## 33 33 0.4760830 0.05 33
## 34 34 0.4751158 0.05 34
## 35 35 0.4512500 0.05 35
## 36 36 0.4522789 0.05 36
## 37 37 0.4296649 0.05 37
## 38 38 0.4581817 0.05 38
## 39 39 0.5238125 0.05 39
## 40 40 0.5530397 0.05 40
## 41 41 0.5487500 0.05 41
## 42 42 0.5011400 0.05 42
## 43 43 0.4852726 0.05 43
## 44 44 0.5213125 0.05 44
## 45 45 0.4786875 0.05 45
## 46 46 0.4987500 0.05 46
## 47 47 0.5476219 0.05 47
## 48 48 0.4610090 0.05 48
## 49 49 0.5238125 0.05 49
## 50 50 0.4952469 0.05 50
## 51 51 0.4379585 0.05 51
## 52 52 0.5204845 0.05 52
## 53 53 0.5476219 0.05 53
## 54 54 0.5047531 0.05 54
## 55 55 0.4944603 0.05 55
## 56 56 0.4760830 0.05 56
## 57 57 0.5437650 0.05 57
## 58 58 0.4522789 0.05 58
## 59 59 0.4660606 0.05 59
## 60 60 0.5702408 0.05 60
## 61 61 0.4795155 0.05 61
## 62 62 0.5702408 0.05 62
## 63 63 0.4427576 0.05 63
## 64 64 0.5917287 0.05 64
## 65 65 0.5621423 0.05 65
## 66 66 0.5340352 0.05 66
## 67 67 0.4555397 0.05 67
## 68 68 0.4761875 0.05 68
## 69 69 0.4697373 0.05 69
## 70 70 0.4296649 0.05 70
## 71 71 0.4081817 0.05 71
## 72 72 0.5753877 0.05 72
## 73 73 0.5165768 0.05 73
## 74 74 0.4327627 0.05 74
## 75 75 0.5966183 0.05 75
## 76 76 0.5013600 0.05 76
## 77 77 0.5573334 0.05 77
## 78 78 0.4962504 0.05 78
## 79 79 0.4206197 0.05 79
## 80 80 0.4495887 0.05 80
## 81 81 0.4907479 0.05 81
## 82 82 0.5417287 0.05 82
## 83 83 0.4714379 0.05 83
## 84 84 0.4978660 0.05 84
## 85 85 0.4762920 0.05 85
## 86 86 0.4771093 0.05 86
## 87 87 0.4523781 0.05 87
## 88 88 0.3877726 0.05 88
## 89 89 0.5294668 0.05 89
## 90 90 0.5162105 0.05 90
## 91 91 0.5667874 0.05 91
## 92 92 0.5229727 0.05 92
## 93 93 0.5146423 0.05 93
## 94 94 0.4889102 0.05 94
## 95 95 0.5468241 0.05 95
## 96 96 0.5144647 0.05 96
## 97 97 0.5029934 0.05 97
## 98 98 0.4611246 0.05 98
## 99 99 0.4778437 0.05 99
## 100 100 0.5404000 0.05 100
Generate one more plot (updating your results from problem 6) that shows the average accumulated reward for all 4 models considered: TD-random, TD-epsilon, TD-softmax, and Bayesian.
# Your solution here
Define \(\theta_1\) to be the probability that a given bandit produces a reward. Assume that \(\theta_1\) is unknown, but has a posterior probability distribution defined by a Beta distribution: \(p(\theta_1) = \mathrm{Beta}(\alpha = 7, \beta = 4)\).
Using numerical integration, what is the probability that \(\theta_1 > 0.5\)?
integrand <- function(theta) { dbeta(theta, 7, 4) }
result <- integrate(integrand, lower=0.5, upper=1)
print(result)
## 0.828125 with absolute error < 9.2e-15
integrand <- function (x, lambda) { dexp(x, rate = lambda) }
# as beta approaches infinity, the behavior maximizes exploitation
Using the built-in cumulative distribution function (c.d.f.), what is the probability that \(\theta_1 > 0.5\)?
alpha = 7
beta = 4
theta = pbeta(0.5, alpha, beta)
print(theta)
## [1] 0.171875
Using Monte Carlo simulation (using 1 million samples), what is the probability that \(\theta_1 > 0.5\)?
alpha <- 7
beta <- 4
monte <- rbeta(n = 1000000, alpha, beta)
theta <- mean(monte > 0.5)
print(theta)
## [1] 0.828405
Define \(\theta_2\) to be the probability that a different bandit produces a reward. Assume that the posterior for \(\theta_2\) is given by \(p(\theta_2) = \mathrm{Beta}(\alpha = 2, \beta = 2)\).
Using Monte Carlo simulation, what is the probability that \(\theta_1 > \theta_2\)?
alpha1 <- 7
beta1 <- 4
alpha2 <- 2
beta2 <- 2
monte1 <- rbeta(1000000, alpha1, beta1)
monte2 <- rbeta(1000000, alpha2, beta2)
compare <- mean(monte1 > monte2)
print(compare)
## [1] 0.685188
What is the equal-tailed 95% credible interval for \(\theta_1\)?
# Your solution here