RL model fitting

Wataru Toyokawa

2023-08-03

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Bridging a gulf between theory and experiment

In this page, I will introduce how to validate and test your RL models with a set of time-series behavioural data. To be honest, this tutorial will not cover any state-of-art modern methods of model fitting and analysis (e.g. hierarchical Bayesian methods, MCMCs, the reparameterization techniques, etc), but will only describe how to calculate log-likelihood of your model. Though I believe the calculation of log-likelihood should necessarily be the first step of model fitting, further readings and practices may be required if you want to employ a model-based analysis in your own study. Statistical Rethinking is really a nice textbook for general introduction of statistical model-based analysis. For computational-modelling of behavioural science, this tutorial paper is fantastic and highly recommended. We covered slightly more advanced topics in our summer school (COSMOS Konstanz) and you can freely dive into those course materials.1

Generating a pseudo behavioural data by simulation

Ok let’s get started.

First thing we need is data. You have a model, but not data yet. Without data, we cannot fit nothing. The first step is a parameter recovery test. Let’s generate pseudo behavioural data by running simulations with your candidate model. I will again use the two-armed risky bandit task in this demo. The task setting is as follows:

# Define functions used later

# This function is simply dividing two vectors
divideVector = function(numerator, denominator) {
  return( numerator/denominator )
}
# This function is simply multiplying two things
multiplying = function(A, B) {
  return( A*B )
}

# Rescorla-Wagner updating function
rescorla_wagner_updating = function (t # trial
                                     , numoptions # number of arms
                                     , group_size # number of individuals in a group
                                     , horizon # number of trials
                                     , alpha # learning rate (step size)
                                     , choices # choice made
                                     , payoffs # payoffs obrained
                                     , Q # a list of q values (dimensions -> num_options, horizon, group_size)
                                     ) {
  # updatingPositions = indicating which option was chosen by each individual at this trial in a flatten version of Q (called QQ below)
  updatingPositions = (choices + num_options*(1:group_size-1))
  
  # Copying the previous Q value to the current Q value
  Q[ , t+1, ] <- Q[ , t, ]
  
  # Q is a three-dimensional matrix, which is a bit tricky to handle in its current form.
  # So, let's reduce its dimension
  QQ <- aperm(Q, c(1,3,2)) # switch dimensions' ordering
  dim(QQ) <- c(num_options*group_size, horizon) # now QQ became a 2D matrix 
  # Here have I defined a new matrix, `QQ`, which is a reduced version of `Q`. 
  # Although its dimension was reduced to 2-D, it contains as same amount of
  # information as the original `Q`.
  
  # Then, Q values at the position `QQ[updatingPositions,t+1]` will be updated 
  # via Rescorla-Wagner rule:
  QQ[updatingPositions,t+1] <- 
    QQ[updatingPositions,t] + 
    alpha * (payoffs - QQ[updatingPositions,t])
  
  # Finally, we have to translate the updated `QQ` back to `Q` value
  dim(QQ) <- c(num_options, group_size, horizon)
  return( aperm(QQ, c(1,3,2)) )
}

# parameter setting
group_size = 4 # there are 'group_size' individuals
num_options = 2 # the number of options
horizon = 100 # total trial number in one session
# mean and s.d. values for two, sure and risky alternatives
mean_list = c(1, 1.5) # low and high mean payoff
sd_list = c(0.03, 1) # low and high risk 
Q_initial = mean(mean_list) # a neutral prior belief (what if Q_initial is larger (smaller) than this?)

Setting parameter values

Let’s simulate 4 individuals. We have to define their behavioural rules and parameters. Let’s assume that behavioural rules for all 4 individuals are the same, where they employ the Rescorla-Wagner learning with the softmax choice policy, while assume that their parameters are different from each other.

Q-learning (Rescorla-Wagner model)

By trying out actions and receiving rewards, the Q-learning agent learns a value function for each action \(Q(a)\) describing the expected reward. With each reward observation, the current Q-value \(Q_t(a)\) is updated with the following equation:

\[ \begin{equation} Q_{t+1,a} \leftarrow Q_{t,a} + \alpha\color{red}{[r_{t,a} - Q_{t,a} ]} \end{equation} \] The term in the equation colored red is known as the reward prediction error (RPE). This captures the difference between the expected reward \(Q_t(a)\) and the actual reward observation. When this is lower than expected, we update our expectation to be lower, and vice versa.

The amount we update our predictions is governed by the learning rate parameter \(\alpha\), which is bounded to the range \([0,1]\). This is often implemented as a free parameter, which is then estimated from data to describe a characteristic of the learner(s).

Softmax policy

Then Q values were translated into choice probabilities through a softmax (or multinomial-logistic) function such that:

\[ \begin{equation} P_{t+1, a} = \frac{\exp(\beta Q_{t+1, a})}{\exp(\beta Q_{t+1, 1}) + \exp(\beta Q_{t+1, 2})}. \end{equation} \] The parameter \(β\) regulates how sensitive the choice probability is to the belief about the option’s value (i.e. controlling the proneness to explore). As \(β→0\), the softmax choice probability approximates to a random choice (i.e. highly explorative). Conversely, if \(β→+∞\), it asymptotes to a deterministic choice in favour of the option with the highest subjective value (i.e. highly exploitative).

Here we assume true_alpha <- c(0.3, 0.6, 0.3, 0.6) and true_beta <- c(5, 5, 15, 15)

# individual parameters
true_alpha <- c(0.3, 0.6, 0.3, 0.6) # learning rate (step size)
true_beta <- c(5, 5, 15, 15) # inverse temperature of the softmax function

Running the model

Now we have defined both task and individuals. It’s time to run the model.2

# Simulation
# The initial status of the four agents
# First, I prefer to define empty lists or matrices where data will be stored later
Q <- array(dim = c(num_options, horizon, group_size))
Q[,1,] <- Q_initial # Q_initial = 1.25
netChoiceProb <- array(dim = c(num_options, horizon, group_size))
netChoiceProb[,1,] <- 1/num_options
choices <- matrix(nrow=group_size, ncol=horizon)
payoffs <- matrix(nrow=group_size, ncol=horizon)
riskyChoiceProb <- matrix(nrow=group_size, ncol=horizon)

# looping from t = 1 to t = horizon
for (t in 1:horizon) {
  # each individual chooses one option based on his/her choice probability that is netChoiceProb[,t,]
  choices[,t] <- apply(netChoiceProb[,t,] # apply() allows us to apply the 1st element to the 'function()' repeatedly 
                       , MARGIN = 2
                       , function(netChoiceProbInstance){ 
                                sample(1:num_options
                                       , 1
                                       , prob=netChoiceProbInstance
                                       , replace=FALSE
                                ) 
                         }
                        )
  
  # each subject earns payoffs whose amount depends upon choices[,t]
  payoffs[,t] <- mapply(rnorm, 1, mean_list[choices[,t]], sd_list[choices[,t]]) # payoff drawn from `rnorm(mean, sd)`
  
  if (t < horizon) {
    # Rescorla-Wagner updating rule: Q_next <- Q + alpha * (payoff - Q)
    # The function `rescorla_wagner_updating()` is defined in the first chunk above
    Q <- rescorla_wagner_updating(t
                                 , numoptions
                                 , group_size
                                 , horizon
                                 , true_alpha
                                 , choices[,t]
                                 , payoffs[,t]
                                 , Q
                                 )
  
    
    
    # Softmax choice base solely on Q values
    # Calculate "exp( beta*Q_k )" for each option k
    Q_exp = ( Q[,t+1,] * rep(true_beta, each = num_options) ) %>% exp() # %>% is called a pipe operator (you can also use |> )
    # Then, calculating the softmax probability,
    # that is, exp( beta*Q_k )/(exp( beta*Q_1 ) + exp( beta*Q_2 )) for each slot
    softmaxMatrix = Q_exp %>% 
      apply(MARGIN=1, divideVector, denominator = apply(Q_exp, MARGIN=2, sum)) %>% 
      t()
    
    # Finally, we want to store values of softmax
    for (i in 1:group_size) {
      netChoiceProb[, t+1,i] <- softmaxMatrix[,i]
    }
  }
}
# Hooooray! A simulation run is done by now. 
# Next, we need to record this result
# First, let's summarize this run: 
for(i in 1:group_size) {
   riskyChoiceProb[i,] <- netChoiceProb[2,,i]
}

Result of this simulation

The simulation has generated a series of choices which is shown as follows:

speudo_data <- data.frame(riskyChoiceProb = riskyChoiceProb) %>% 
  cbind(data.frame(choices = choices)) %>% 
  cbind(data.frame(payoffs = payoffs)) %>% 
  tidyr::pivot_longer(everything()
                      , names_to = c(".value", "trial")
                      , names_pattern = "([A-Za-z]+).(\\d+)"
                      )
speudo_data$indiv <- rep(1:group_size, each = horizon)
speudo_data$trial <- speudo_data$trial %>% as.character() %>% as.numeric() # trial was assumed to be a factor value

speudo_data %>% ggplot() + 
  geom_line(aes(trial, riskyChoiceProb, group = indiv), colour='grey')+ 
  geom_point(aes(trial, choices-1, colour = as.factor(choices))) + 
  geom_hline(yintercept = 0.5, linetype = 'dashed', colour = 'blue') + 
  scale_x_continuous(name="Trial", limits=c(1, horizon))  + 
  scale_color_viridis_d(name="Choice")  + 
  facet_wrap(~indiv) + 
  labs(y = 'Probability of choosing the risky arm', title = 'Simulated behaivour') +
  theme_test() 

As you see here, individual 1, whose parameter values are alpha = 0.3 and beta = 5, seems to be risk seeking (and payoff maximizing) while other individuals are risk averse. When \(\alpha (1 + \beta)\) is larger than a certain threshold (which is 2 in this bandit set up), an RL agent becomes risk averse. This phenomenon is called the hot stove effect. To find out more of this, visit my open access paper! But for now, our aim is to estimate agents’ parameters.

The pseudo-data just generated looks like as follows:

knitr::kable(head(speudo_data, 12), "simple")
trial riskyChoiceProb choices payoffs indiv
1 0.5000000 2 3.0587083 1
2 0.9377930 2 0.8131471 1
3 0.7762360 2 0.9441589 1
4 0.6015538 1 0.9679653 1
5 0.6974102 2 2.3377870 1
6 0.9123837 2 2.3951257 1
7 0.9702513 2 1.1940373 1
8 0.9228923 2 2.7079620 1
9 0.9829008 2 1.7533185 1
10 0.9762915 2 -0.0487528 1
11 0.6859677 2 1.1667926 1
12 0.6339126 1 1.0276680 1

Note that the riskyChoiceProb in the psudo_data contains true values which would not have been directly observable if it was a real behavioural experiment. Also, we know all the parameter values for each individual because we have defined them at the outset. The goal of the parameter recovery test is to recover these true values.

Calculating log-likelihood under a set of parameters

The next step is fitting the learning model with this pseudo data. Parameter estimation basically consists of the following two steps.

  1. Setting a parameter value
  2. Calculating log-likelihood

That’s it. Let me walk through them one by one.

1. Setting a parameter value

Let’s start our exploration with the following values:

alpha_test <- c(0.5, 0.5, 0.5, 0.5) # learning rate values 
beta_test <- c(5, 5, 5, 5) # inverse temperature
alpha_test
## [1] 0.5 0.5 0.5 0.5
beta_test
## [1] 5 5 5 5

2. Calculating a log-likelihood

The next step is to calculate log-likelihood. Code of likelihood calculation look very similar to the code used for an agent-based simulation. But unlike the simulation, we have already had a series of choice[,t] and payoffs[,t].

# picking up one individual
this_individual <- 1
data_this_individual <- speudo_data %>% dplyr::filter(indiv == this_individual) # picking up the focal agent
this_choices <- data_this_individual$choices # this agent's choices observed
this_payoffs <- data_this_individual$payoffs # this agent's payoffs observed
this_alpha <- alpha_test[this_individual] #  this agent's alpha assumed
this_beta <- beta_test[this_individual] # this agent's alpha assumed

# initial setting
Q <- matrix(rep(NA, num_options * horizon), nrow = num_options)
logChoiceProb_estimated <- matrix(rep(NA, num_options * horizon), nrow = num_options)
Q[,1] <- Q_initial # Note: this initial value can also be a subject of estimation
logChoiceProb_estimated[,1] <- log(1/num_options) # random choice at the 1st trial
logLik <- rep(0, horizon)
# calculating log-likelihood
for (t in 1:horizon) {
  # recording the log likelihood of the choice made at this trial
  logLik[t] <- logChoiceProb_estimated[this_choices[t], t]
  
  # calculating the choice probability for the following trial
  if (t < horizon) {
    # copying Q values to the next trial
    Q[,t+1] <- Q[,t] 
    
    # Rescorla-Wagner Updating for the chosen option 
    Q[this_choices[t],t+1] <- 
      Q[this_choices[t],t] + 
      this_alpha * (this_payoffs[t] - Q[this_choices[t],t])
    
    # Softmax transformation (log based)
    logChoiceProb_estimated[,t+1] <- 
      this_beta * Q[, t+1] - log( exp(this_beta * Q[, t+1]) %>% sum() )
  }
}

# the sum of the negative log-likelihood
sum(-logLik)
## [1] 37.57145

Note that the softmax function was calculated in a log form:

\(\log \frac{\exp(\beta Q_{i,t})}{\sum \exp(\beta Q_{k,t})} = \beta Q_{i,t} - \log(\sum \exp(\beta Q_{k,t}))\)

This was theoretically unnecessary, but would be a big help in computation.

For a comparison, the log-likelihood of a random choice model is as follows:

# calculating log-likelihood
logLik_random <- rep(0, horizon)
for (t in 1:horizon) {
  # a probability to choose an option is 1/num_options 
  logLik_random[t] <- log(1/num_options) 
}

# the sum of the negative log-likelihood
sum(-logLik_random)
## [1] 69.31472

So our RL model with alpha = 0.5 and beta = 5 gives a lower NLL (37.571) than the NLL of totally random choices (69.315), which suggests that the RL model fits better than the random choice. However, we still do not know what parameters give the best fit. Maybe other combinations of \(\alpha\) and \(\beta\) are better than this.

The brute force method of parameter estimation

Theoretically, you can always explore for the better combination by thoroughly exploring the entire (or reasonably wide) ranges of parameters, an approach called a brute force method. Let’s explore alpha for 0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1 and beta for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20.3

# data storage
first_individual_logliks <- 
  data.frame(
    alpha = rep(seq(0, 1, 0.05), each = 21)
    , beta = rep(seq(0, 20, 1), 21)
    , logLik = rep(NA, 21^2)
  )

# The brute force method of parameter estimation
for (this_alpha in seq(0, 1, 0.05)) {
  for (this_beta in seq(0, 20, 1)) {
    # initial setting
    Q <-matrix(rep(NA, num_options * horizon), nrow = num_options)
    logChoiceProb_estimated <- 
      matrix(rep(NA, num_options * horizon), nrow = num_options)
    Q[,1] <- Q_initial 
    logChoiceProb_estimated[,1] <- log(1/num_options) 
    logLik <- rep(0, horizon)
    
    # calculating log-likelihood
    for (t in 1:horizon) {
      # recording the log likelihood of the choice made at this trial
      logLik[t] <- logChoiceProb_estimated[this_choices[t], t]
      
      # calculating the choice probability for the following trial
      if (t < horizon) {
        # copying Q values to the next trial
        Q[,t+1] <- Q[,t] 
        
        # Rescorla-Wagner Updating for the chosen option 
        Q[this_choices[t],t+1] <- 
          Q[this_choices[t],t] + 
          this_alpha * (this_payoffs[t] - Q[this_choices[t],t])
        
        # Softmax transformation (log based)
        logChoiceProb_estimated[,t+1] <- 
          this_beta * Q[, t+1] - log( exp(this_beta * Q[, t+1]) %>% sum() )
      }
    }
    
    # the sum of log-likelihood
    first_individual_logliks$logLik[which(first_individual_logliks$alpha == this_alpha & first_individual_logliks$beta == this_beta)] <- sum(logLik)
  }
}
# which combination of the parameter gave the maximum likelihood?
maximum_likelihood_position <- which(first_individual_logliks$logLik == (first_individual_logliks$logLik %>% max()))

best_fit_values <- first_individual_logliks[maximum_likelihood_position, ]

first_individual_logliks %>% ggplot() +
  geom_tile(aes(alpha, beta, fill = logLik)) +
  geom_point(data=data.frame(alpha=true_alpha[1], beta=true_beta[1]), mapping=aes(alpha,beta), size = 3, colour='red')+ # true value
  geom_text(data=data.frame(alpha=true_alpha[1], beta=true_beta[1]), mapping=aes(alpha,beta), hjust=-0.2, vjust=0.3, label="True value", colour = 'red') +
  geom_point(data=best_fit_values, mapping=aes(alpha,beta), size = 3, colour='blue')+ # best fitting value
  geom_text(data=best_fit_values, mapping=aes(alpha,beta), hjust=-0.2, vjust=0.3, colour='blue', label="MLE") +
  scale_fill_viridis_c('Log likelihood')+
  labs(title = 'The brute force parameter estimation for the agent 1\n(The red and blues points are the true and estimated values, respectively)')+
  theme_classic()

That’s it. How hard could it be? This maximum log likelihood method has been widely used, but often criticized for its unreliability. Thought the brute force method above gave a rough estimation which was more or less correct, it suggests a wide range of likely \(\beta\) values (i.e. the yellow-ish area ranges widely along the \(\beta\)-axis in the low \(\alpha\) area.).

This is why we need more sophisticated method to reliably infer the experimental data (e.g. a hierarchical Bayesian method).


  1. I won’t be able to discuss much about how to connect agent-based models to population-level dynamics models due to the limited time allowed me here. Also, such a connection has still been rare in behavioural science. I think this is really a fascinating future direction.↩︎

  2. Note that I have manually picked up parameter values for each individuals rather than randomly generating them. This was just for the sake of tractability of this demonstration. In practice, we might assume a probability distribution from which individual parameter values are drawn. Hyper parameters of such a group-level probability distribution can then be estimated through a hierarchical Bayesian statistical method.↩︎

  3. beta > 20 is ignored here. We cannot search for an infinitely wide range. Luckily, changes of NLL becomes almost stopped once beta becomes large enough and practically, beta = 20 looks enough in this particular case here.↩︎