Using Dynamic Programming in Bayesian Decision Theory

```

References

BST 249 Course Note

Some code for visualization from Kazuki http://rpubs.com/kaz_yos/dyn-prog-1

The Game Setting

Before entering the game, the participant starts with \(n\) dollars. Two possible values of \(\theta \in \{\theta_1, \theta_2\}\) are presented to the participant for guessing. The game mechanism randomly picks a value \(\theta\) out of \(\{\theta_1, \theta_2\}\) with equal probability \(1/2\). Then, this \(\theta\) is used as the true underlying probability of being \(Success\) in a Bernoulli random number generator which generates two potential outcomes \(\{Success, Failure\}\). Whenever the participant has money (i.e. \(\geq 1\) dollars) on his hand, he has the option to spend 1 dollar to observe the next outcome (i.e. \(Success\) of \(Failure\)) from the Bernoulli random number generator. Also, at any time throughout the game, the participant has one chance to stop further observing and guess which of \(\{\theta_1, \theta_2\}\) is the true underlying \(\theta\) that was used in the game. If the participant’s guess on \(\theta\) is correct, he will receive a fixed amount of payoff (here we consider the payoff as \(n\) dollars, the same as the initial amount of money); otherwise nothing. In both cases, the game stops.

Aims

Our objective is to find the optimal decision that maximizes the expected utility. Here, the utility is defined as \[U = -Cost + payoff,\] where

\[payoff = \begin{cases}n & \mbox{ if guesses correctly;}\\ 0 & \mbox{ otherwise.}\end{cases}\]

Let \(i\) and \(j\) be the total numbers of \(Success\) and \(Failure\) that have been observed. Then, a particular decision \(d\) is a map from any possible \((i,j)\) to whether “continue” looking or “stop” (and guess), i.e.

\[d: \left((i,j): i+ j \leq n-1\right) \longrightarrow \{continue, stop\}.\]

Note that if \(i + j = n\), then the participant has used all the money to observe with no more money on hand. Then he has to “stop” and guess the value of \(\theta\) which is a deterministic action. Therefore, the possible values of \((i,j)\) that are worth considering is \(i + j \leq n-1\).

The space of decisions is thus a class of maps, i.e. \[\mathcal{D} = \{d: \left((i,j): i+ j \leq n-1\right) \longrightarrow \{continue, stop\}\}.\]

The optimal decision is \[d^* = \text{argmax}_{d \in \mathcal{D}} U(d),\] where \(U(d)\) is the expected utility.

Decomposition of the problem

This problem essentially comes down to a binary decision tree: at each observed \((i,j)\) where \(0 \leq i + j < n\) (including \(i = j = 0\)), we have the option to decide whether to continue looking or stop. At \((i,j)\) where \(i + j = n\), we have to stop. Thus, from layer 0 to layer \(n-1\), there are \(1 + 2 +\cdots + n = \frac{n(n+1)}{2}\) possible nodes for the participant to choose whether to continue or stop, which leads to \(2^{\frac{n(n+1)}{2}}\) different patterns (decisions). Therefore, for a small to moderate \(n\), we may use brute force approach (i.e. calculate the expected utility at all possible values of \((i,j)\) and compare).

However, when \(n\) is moderate to large, the number of calculations using brute force grows exponentially. To retain efficiency, we can use Dynamic Programming (DP) which saves the amount of computations by memorizing and reusing the quantities from previous steps. Thus, there is a computation time-space trade-off for Dynamic Programming and in most cases the burden of using moderate to large space is acceptable.

The Dynamic Programming approach for this game

First, we may adopt the objective prior belief of \(\theta\) being whether \(\theta_1\) or \(\theta_2\) as \(\pi = (1/2, 1/2)\). This prior is compatible according to the game mechanism and is further used for calculating the posterior of \(\theta\) being whether \(\theta_1\) or \(\theta_2\) after observing the data. Note that although the participant observes the entile sequence of \(Success\) and \(Failure\), the total number of \(Success\) \((i)\) and \(Failure\) \((j)\) are sufficient for obtaining the posterior.

Let \(d^*_h(i,j)\) be the optimal decision after observing \((i,j)\) where \(h = i+j\), \(U^*_h(i,j)\) be the optimal expected utility, \(U_h(i,j,C)\) is the optimal expected utility if we decide to continue looking, and \(U_h(i,j,S)\) is the optimal expected utility if we decide to stop. At each observed \((i,j)\) pair, if we decide to continue looking and observe a new \(Success\), then the optimal expected utility from then on is \(U^*_{h+1}(i+1,j)\). Similarly, if we decide to continue looking and observe a new \(Failure\), then the optimal expected utility from then on is \(U^*_{h+1}(i,j+1)\). Thus, by storing the values of \(U^*_{h+1}(i+1,j)\) and \(U^*_{h+1}(i,j+1)\), we may easily compute the value of \(U_h(i,j,C)\). This is the key idea of Dynamic Programming.

The Algorithm

For \(i + j = n\), since we have to stop after observing \((i,j)\), we set \[U^*_n(i,j) = U_n(i,j,S) = -n + payoff \times \max_{a \in \{\theta_1,\theta_2\}} \Pr(\theta = a|(X_n, Y_n) = (i,j)),\] for all pairs \((i,j)\) such that \(i+j=n\). Also, let the prior \(\pi = (\pi_1, \pi_2)\) where \(\Pr(\theta = \theta_1) = \pi_1\), then the posterior probability can be calculated as \[\begin{align*} \Pr(\theta = \theta_1|(X_n, Y_n) = (i,j)) &= \frac{\Pr(\theta = \theta_1, (X_n, Y_n) = (i,j))}{\Pr((X_n, Y_n) = (i,j))}\\ &= \frac{\pi_1 \times \theta_1^i (1-\theta_1)^j}{\pi_1 \times \theta_1^i (1-\theta_1)^j + (1 - \pi_1) \times \theta_2^i (1-\theta_2)^j} \end{align*}\] For \(i + j = h\), where \(h = n-1, n-2, \dots, 0\), set \[U_h(i,j,S) = -h + payoff \times \max_{a \in \{\theta_1,\theta_2\}} \Pr(\theta = a|(X_h, Y_h) = (i,j)),\] \[\begin{align*} U_h(i,j,C) = \Pr((X_{h+1}, Y_{h+1}) = (i+1, j)|(X_h, Y_h) = (i,j)) \times U^*_{h+1}(i+1,j)\\ + \Pr((X_{h+1}, Y_{h+1}) = (i, j+1)|(X_h, Y_h) = (i,j)) \times U^*_{h+1}(i,j+1), \end{align*}\] \[U^*_h(i,j) = \max\{U_h(i,j,S), U_h(i,j,C)\},\] \[d^*_h(i,j) = \begin{cases}continue & \mbox{ if } U_h(i,j,C) \geq U_h(i,j,S), \\ stop & \mbox { if } U_h(i,j,C) < U_h(i,j,S). \end{cases}\] for all pairs \((i,j)\) such that \(i+j=h\), and \[\begin{align*} \Pr((X_{h+1}, Y_{h+1}) = (i+1, j)|(X_h, Y_h) = (i,j)) = \theta_1 \times \Pr(\theta = \theta_1|(X_n, Y_n) = (i,j)) \\ + \theta_2 \times \left[1- \Pr(\theta = \theta_1|(X_n, Y_n) = (i,j))\right]. \end{align*}\]

Load packages

require(ggplot2)

Algorithm implementation

post_prob <- function(pi_1, theta_1, theta_2, i, j){
  pi_2 <- 1 - pi_1
  num <- pi_1 * theta_1^i * (1 - theta_1)^j
  denom <- num + pi_2 * theta_2^i * (1 - theta_2)^j
  return(num/denom)
}

Utility_stop <- function(pi_1, theta_1, theta_2, i, j, payoff){
  Cost <- i+j
  post_prob_1 <- post_prob(pi_1, theta_1, theta_2, i, j)
  Utility <- -1 * Cost + payoff * max(post_prob_1, 1-post_prob_1)
  return(Utility)
}

Utility_cont <- function(U_star, pi_1, theta_1, theta_2, i, j, payoff){
  n_step <- i+j
  i_index <- i+1
  j_index <- j+1
  post_prob_1 <- post_prob(pi_1, theta_1, theta_2, i, j)
  post_prob_success <- post_prob_1 * theta_1 + (1-post_prob_1) * theta_2
  if (n_step == dim(U_star)[1]-1){
    Utility <- post_prob_success * Utility_stop(pi_1, theta_1, theta_2, i+1, j, payoff) + 
              (1-post_prob_success) * Utility_stop(pi_1, theta_1, theta_2, i, j+1, payoff)
  }else{
    Utility <- post_prob_success * U_star[i_index+1, j_index] + (1-post_prob_success) * U_star[i_index, j_index+1]
  }
  return(Utility)
}



decision <- function(U_ij_cont, U_ij_stop){
  if (U_ij_cont >= U_ij_stop){
    return(1)
  }else{
    return(0)
  }
}

AdaptiveDesign <- function(pi_1 = 0.5, theta_1, theta_2, n_init, payoff = n_init){
  U_cont <- matrix(-Inf, nrow = n_init, ncol = n_init)
  U_stop <- matrix(-Inf, nrow = n_init, ncol = n_init)
  U_star <- matrix(-Inf, nrow = n_init, ncol = n_init)
  Decision <- matrix(NA, nrow = n_init, ncol = n_init)
  
  for (sum in (n_init-1):0){
    for (i in 0:sum){
      j <- sum-i
      i_index <- i+1
      j_index <- j+1
      U_stop[i_index,j_index] <- Utility_stop(pi_1, theta_1, theta_2, i, j, payoff)
      U_cont[i_index,j_index] <- Utility_cont(U_star, pi_1, theta_1, theta_2, i, j, payoff)
      U_star[i_index,j_index] <- max(U_cont[i_index,j_index], U_stop[i_index,j_index])
      Decision[i_index,j_index] <- decision(U_cont[i_index,j_index], U_stop[i_index,j_index])
      
    }
  }
  
  data <- matrix(NA, nrow = (n_init)*(n_init+1)/2, ncol = 6)
  line <- 1
  for (i in 0:(n_init-1)){
    for (j in 0:(n_init-1-i)){
      i_index <- i+1
      j_index <- j+1
      data[line,1] <- i
      data[line,2] <- j
      data[line,3] <- Decision[i_index,j_index]
      data[line,4] <- U_star[i_index,j_index]
      data[line,5] <- U_cont[i_index,j_index]
      data[line,6] <- U_stop[i_index,j_index]
      line <- line+1
    }
  }
  
  
  df <- as.data.frame(data)
  colnames(df) <- c("Success","Failure","Decision_Binary","U_star","U_stop","U_cont")
  df$Decision <- ifelse(df$Decision_Binary == 1, "continue", "stop")
  
  return(df)
}

Result for \(n = 30, \theta_1 = 0.25, \theta_2 = 0.75\)

df1 <- AdaptiveDesign(theta_1 = 0.25, theta_2 = 0.75, n_init = 30)
head(df1)

##   Success Failure Decision_Binary   U_star   U_stop   U_cont Decision
## 1       0       0               1 23.80000 23.80000 15.00000 continue
## 2       0       1               1 23.80000 23.80000 21.50000 continue
## 3       0       2               0 25.00000 24.69000 25.00000     stop
## 4       0       3               0 25.92857 24.92857 25.92857     stop
## 5       0       4               0 25.63415 24.63415 25.63415     stop
## 6       0       5               0 24.87705 23.87705 24.87705     stop

ggplot(data = df1, mapping = aes(x = Success, y = Failure)) +
  geom_point(aes(colour = Decision)) +
  theme_bw() + theme(legend.key = element_blank()) + 
  ggtitle(expression(paste(theta[1], " = 0.25, ", theta[2], " = 0.75, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

## Plot optimal expected utilities at each (i,j)
ggplot(data = df1, mapping = aes(x = Success, y = Failure, colour = U_star)) +
  geom_point(size = 5) +
  scale_color_gradient2(low = "purple", high = "red") +
  theme_bw() + theme(legend.key = element_blank()) +
  ggtitle(expression(paste(theta[1], " = 0.25, ", theta[2], " = 0.75, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

Result for \(n = 30, \theta_1 = 0.5, \theta_2 = 0.7\)

df2 <- AdaptiveDesign(theta_1 = 0.5, theta_2 = 0.7, n_init = 30)

ggplot(data = df2, mapping = aes(x = Success, y = Failure)) +
  geom_point(aes(colour = Decision)) +
  theme_bw() + theme(legend.key = element_blank()) + 
  ggtitle(expression(paste(theta[1], " = 0.5, ", theta[2], " = 0.7, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

## Plot optimal expected utilities at each (i,j)
ggplot(data = df2, mapping = aes(x = Success, y = Failure, colour = U_star)) +
  geom_point(size = 5) +
  scale_color_gradient2(low = "purple", high = "red") +
  theme_bw() + theme(legend.key = element_blank()) +
  ggtitle(expression(paste(theta[1], " = 0.5, ", theta[2], " = 0.7, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

Result for \(n = 30, \theta_1 = 0.7, \theta_2 = 0.9\)

df3 <- AdaptiveDesign(theta_1 = 0.7, theta_2 = 0.9, n_init = 30)

ggplot(data = df3, mapping = aes(x = Success, y = Failure)) +
  geom_point(aes(colour = Decision)) +
  theme_bw() + theme(legend.key = element_blank()) + 
  ggtitle(expression(paste(theta[1], " = 0.7, ", theta[2], " = 0.9, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

## Plot optimal expected utilities at each (i,j)
ggplot(data = df3, mapping = aes(x = Success, y = Failure, colour = U_star)) +
  geom_point(size = 5) +
  scale_color_gradient2(low = "purple", high = "red") +
  theme_bw() + theme(legend.key = element_blank()) +
  ggtitle(expression(paste(theta[1], " = 0.7, ", theta[2], " = 0.9, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

Result for \(n = 30, \theta_1 = 0.4, \theta_2 = 0.6\)

df4 <- AdaptiveDesign(theta_1 = 0.4, theta_2 = 0.6, n_init = 30)

ggplot(data = df4, mapping = aes(x = Success, y = Failure)) +
  geom_point(aes(colour = Decision)) +
  theme_bw() + theme(legend.key = element_blank()) + 
  ggtitle(expression(paste(theta[1], " = 0.4, ", theta[2], " = 0.6, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

## Plot optimal expected utilities at each (i,j)
ggplot(data = df4, mapping = aes(x = Success, y = Failure, colour = U_star)) +
  geom_point(size = 5) +
  scale_color_gradient2(low = "purple", high = "red") +
  theme_bw() + theme(legend.key = element_blank()) +
  ggtitle(expression(paste(theta[1], " = 0.4, ", theta[2], " = 0.6, ", "initial = payoff = $30"))) + theme(plot.title = element_text(hjust = 0.5))

Result for \(n = 100, \theta_1 = 0.25, \theta_2 = 0.75\)

df5 <- AdaptiveDesign(theta_1 = 0.25, theta_2 = 0.75, n_init = 100)

ggplot(data = df5, mapping = aes(x = Success, y = Failure)) +
  geom_point(aes(colour = Decision)) +
  theme_bw() + theme(legend.key = element_blank()) + 
  ggtitle(expression(paste(theta[1], " = 0.25, ", theta[2], " = 0.75, ", "initial = payoff = $100"))) + theme(plot.title = element_text(hjust = 0.5))

## Plot optimal expected utilities at each (i,j)
ggplot(data = df5, mapping = aes(x = Success, y = Failure, colour = U_star)) +
  geom_point(size = 5) +
  scale_color_gradient2(low = "purple", high = "red") +
  theme_bw() + theme(legend.key = element_blank()) +
  ggtitle(expression(paste(theta[1], " = 0.25, ", theta[2], " = 0.75, ", "initial = payoff = $100"))) + theme(plot.title = element_text(hjust = 0.5))

Conclusion

This problem is an example in decision theory that uses Bayes updating on unknown parameters. The solution illustrate the efficiency of Dynamic Programming algorithm in terms of reducing computational time. For initial being 100 dollars, the computational speed is still very fast (~1 second).

Top Page: http://rpubs.com/xihaoli/
github: http://github.com/xihaoli/

Using Dynamic Programming in Bayesian Decision Theory - A payoff game

Xihao Li

2016-11-20