It has long been argued that human decision making consists of two distinct computational systems, one being computationally efficient but inflexible (model-free reinforcement learning), while the other is flexible but computationally expensive (model-based reinforcement learning). However, it is unknown whether humans utilize both systems simultaneously.
Daw et al. (2011) designed a novel two-stage task that can disambiguate a participant’s use of model-free and model-based strategies. This is because the two strategies hypothesized different effect of second-stage reinforcement on first-stage choices. Using hierarchical logistic regression on first-stage choice data, the authors found evidence of both model-free and model-based valuations, which was further confirmed by analysis of fMRI data.
The original study used fMRI and the sample size was limited (N=17), therefore it is worth replicating in a larger, out-of-scanner sample. It is also unknown whether the feature of web-based study will lead to similar engagement of the two systems, or will shift participants’ strategy towards one system.
Participants will be recruited from the online platform Prolific. On each trial (Figure 1A), participants are shown two Tibetan characters. Participants use the F and J keys to choose one of the two characters. After this, another two Tibetan characters will be shown probablistically, according to the transition matrix shown in Figure 1B. The participants again use the F and J keys to select one of two second-stage characters. Depending on the reward probability of each second-stage stimulus, the participants either get a reward ($1) or none, 1% of which (1 cent) will be paid towards their compensation. To encourage continuous learning, the reward probabilities of the four second-stage choices vary with a Gaussian random walk.Prior to the study, participants are informed that the reward probabilities will change, while the transition probabilities from the first- to second-stage stimuli remain fixed. Different from the original study, after completing the task, participants will fill out the Temporal Experience of Pleasure Scale, an 18-item questionnaire that assess capacity for anticipatory and consummatory pleasure. This is based on my previous work that capacity for anticipatory pleasure may reduce model-based learning.
Figure 1. (A) The interface of the two-step task. (B) The transition matrix. The first first-stage character has a 70% chance to lead to the pink set, and 30% chance to lead to the blue set; the second first-stage character has a 30% chance to lead to the pink set, and 70% chance to lead to the blue set.
The main challenge is to code this task by adapting the open materials from another similar study: https://github.com/wkool/tradeoffs/tree/master/tasks/space_daw_task. A second challenge to reduce the trial length (the original study was set to be slow to adapt to the sampling rate of fMRI) while still ensuring comparability with the original study.
Link to the experiment on AWS: http://ryanyan-daw2011-replication-project-psych251.s3-website-us-west-2.amazonaws.com
Project repository (on Github): https://github.com/psych251/daw2011
Original paper: https://github.com/psych251/daw2011/blob/main/original_paper/daw2011.pdf
Preregistration: https://osf.io/t4zmr
The author only reported the p value from the multilevel logistic regression; the original effect size was not reported. I was thus unable to obtain a power calculation from the original study.
I obtained regression coefficient estimations from another paper, whose senior author was a second author on this target paper. This other paper used the same task structure but with different materials and framing (Kool, Cushman, & Gershman, 2016). Based on N = 185 online participants, the estimates were: intercept (1.03), reward (0.26), transition (0.03), and reward*transition (0.20). The effect size calculation was based on these estimates, and a within-subject random intercept of 0.5.
Power analysis revealed that a sample size of 110 is needed to achieve 80% power.
subj <- factor(1:10)
reward <- c("Reward","noReward") # ratio ~~ 1:1
transition <- c(rep("common",7), rep("rare",3)) # ratio = 7:3
trialNum <- 200
# total trial num: 200
# rewardCommon:Rewardrare:noRewardCommon:nonRewardrare = 7:3:7:3 * 10
subj_full = rep(subj, trialNum)
reward_full <- rep(reward, each=length(subj)*(trialNum/2))
transition_full <- rep(rep(transition, each=length(subj)), (trialNum/10))
sim_df <- data.frame(id=subj_full,reward=reward_full, transition=transition_full)
# table(sim_df)
## Intercept and slopes for reward, transition, and reward*transition
fixed <- c(1.03,0.26,0.03,0.20)
## Random intercepts for participants
rand <- list(0.5)
model <- makeGlmer(y ~ reward*transition + (1|id),family = "binomial", fixef=fixed, VarCorr=rand, data=sim_df)
sim_10_subj <- powerSim(model, nsim=100, test = fixed("rewardReward:transitionrare"))
sim_10_subj
model_ext_subj <- extend(model, along="id", n=120)
sim_120_subj <- powerSim(model_ext_subj, nsim=120, test = fixed("rewardReward:transitionrare"))
sim_120_subj
p_curve_treat <- powerCurve(model_ext_subj, test = fixed("rewardReward:transitionrare"), along="id")
png(file="/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/writeup_resources/power_analysis.png",
width=8, height=6, units="in", res=300)
plot(p_curve_treat)Figure 2. Power analysis. Based on the estimates reported by Kool, Cushman & Gershman (2016), 120 participants could attain a power of 80%.
Considering both the original sample size (N = 17), the calculated power based on the above-mentioned paper (N = 110), the sample will be N = 40 (50% female) healthy adults recruited via Prolific. The participants need to be fluent in English, but don’t need to be of a certain nationality.
The experimental materials are 3 pairs of Tibetan characters in colored boxes, as shown in Figure 1. In the practice trials, participants use a different set of such stimuli.The details of the experiment is described below in 2.4.
In addition, participants will fill out an 18-item questionnaire assessing hedonic capacity. The Temporal Experience of Pleasure Scale (TEPS) measures people’s capacity to experience anticipatory and consummatory pleasure (Gard et al., 2006).
The task consisted of 99 trials, in three blocks of 33, separated by breaks. Each trial consisted of two stages. In the first stage, participants choose between two options, represented by Tibetan characters in colored boxes. If subjects failed to enter a choice within 2s, the trial was aborted. Which second-stage state was presented depended, probabilistically, on the first-stage choice, according to the transition scheme shown in Figure 1B (30% vs. 70% transition probability). At the second stage, subjects were presented with either of two more choices between two options (‘states’), and entered another choice. The second choice was rewarded with money (depicted by a dollar coin, though subjects were paid 1% of this amount), or not (depicted by a zero). Trials were separated by an inter-trial interval of randomized length, on average about 1.5 second.
In order to encourage ongoing learning, these reward probabilities were diffused at each trial by adding independent Gaussian noise (mean 0, SD 0.025), with reflecting boundaries at 0.25 and 0.75. Prior to the task, participants were instructed that the reward probabilities would change, but those controlling the transitions from the first to the second stage would remain fixed. They were also instructed about the overall structure of the transition matrix, specifically, that each first-stage option was primarily associated with one or the other of the second-stage states, but not which one. Prior to the session, to familiarize themselves with the structure of the task, subjects played 10 trials (original paper: 50 trials) on a practice task using a different stimulus set. The assignment of colors to states was counterbalanced across participants, and the two options at each state were permuted between left and right from trial to trial. Each second-stage option was rewarded according to a probability associated with that option. Different with the original study, after the experiment, participants will complete a measure of hedonic capacity (the Temporal Experience of Pleasure Scale). Participants get 8 dollars per hour for their participation, as well as the bonus they earn from the study.
Figure 3. Example of a Gaussian random walk
I will run a logistic regression in which the dependent variable was the first-stage choice (coded as stay versus switch), and the explanatory variables were the reward received on the previous trial, a binary indicator variable indicating whether the previous trial’s transition was common or rare, and the interaction of the two. I will take all coefficients as random effects across subjects, and estimated this multilevel regression using the lme4 linear mixed effects package (Bates and Maechler, 2010) in the R statistical language (R Development Core Team, 2010).
The key result to replicate is the significant interaction (p < 0.05) between reward and transition type in predicting first-stage choice in the following logistic regression model:
glmer(stay ~ reward * transition_type + (1 | participant_ID), data = df, family = binomial, control = glmerControl(optimizer = "bobyqa"))
reward: factor, rewarded or unrewarded, ref = rewarded
transition_type: factor, common or rare, ref = common
Computational modelling was conducted according to Daw et al.’s (2011) method and supplementary information.
In the model-free reinforcement learning model, value of all stage 1 and stage 2 choices were updated by a temporal difference prediction error signal. In the model-based reinforcement learning model, stage 1 choices were updated according to the Bellman function, which takes into consideration the transition matrix as specified above, while the stage 2 choices were updated similarly as model-free RL.
In the present computational model, model-based and model-free RL both contribute to the value update, and the proportion of the contribution from model-based RL was represented as a parameter \(\omega\) (0 < \(\omega\) < 1). The closer \(\omega\) is to 1, the larger the relative contribution of model-based RL.
All parameters were estimated using the maximum likelihood method. MATLAB code is available in the Github repo under the folder “writeup/computational_modelling”.
All participants were recruited around 12-2pm PST.
The actual sample includes N = 40 participants recruited from Prolific. According to Prolific, their median age was 27 yo, 72.5% were male, and 87.5% were white.
#read in demographic data
df_demo <- read_csv("/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/demo.csv")%>%
filter(Status == "APPROVED")%>%
select(Age, Sex,"Ethnicity simplified")quantile(df_demo$Age)## 0% 25% 50% 75% 100%
## 20.00 23.75 27.00 36.50 53.00
table(df_demo$Sex)/nrow(df_demo)##
## Female Male
## 0.275 0.725
table(df_demo$"Ethnicity simplified")/nrow(df_demo)##
## Black Mixed Other White
## 0.025 0.075 0.025 0.875
The sample was recruited as a standard sample on Prolific, instead of the gender-balanced sample in the pre-registration.
data_folder <- "/Users/rh/Desktop/first_year/PSYCH251_MF/data/aws/task-data-sav/ryanyan-daw2011-replication-project-psych251/"
# read in and combine data files from all participants
data_list <- list.files(data_folder)
for (i in 1:length(data_list)){
data_name_temp <- data_list[i]
data_temp <- read_csv(paste0(data_folder,data_name_temp))%>%
filter(trial_type == "daw_two_step", practice != 1)%>%
mutate(trial_in_total = (block-1)*33+trial_index)%>%
select(-rt,-responses,-view_history,-age:-scoreindollar,-subid,-internal_chunk_id,-url)
demo_temp <- read_csv(paste0(data_folder,data_name_temp))%>%
filter(trial_type == "call-function", !is.na(age))%>%
select(participant_id,age:scoreindollar)%>%
mutate(sex = ifelse(sex==FALSE,"female",'male'))
#join files
if (i==1){
trial_data_raw = data_temp
demo_data = demo_temp
} else {
trial_data_raw = rbind(trial_data_raw,data_temp)
demo_data = rbind(demo_data,demo_temp)
}
}
#print number of participants
participant_num = nrow(demo_data)
participant_num## [1] 40
payment_sheet = demo_data[,c(1,5)]
# show distribution of bonus payments
ggplot(payment_sheet,aes(scoreindollar))+
geom_histogram(binwidth=0.01,fill="white",color="black")+
geom_vline(xintercept = mean(payment_sheet$scoreindollar),size=2,color="blue")+
labs(x = "payment in dollars",
title = paste0("min: $",min(payment_sheet$scoreindollar),", max: $",max(payment_sheet$scoreindollar),", mean: $",mean(payment_sheet$scoreindollar)))## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
set.seed(123) #to ensure consistency in encoding
all_possible_ids <- c(trial_data_raw$participant_id) %>% unique() %>% na.omit()
# randomly sample three-digit numbers (between 100 and 1000) with no replacement
random_ids <- sample(100:1000, length(all_possible_ids), replace = F) %>% as.character()
#save dictionary locally
id_dictionary <- cbind(all_possible_ids, random_ids) %>% `colnames<-`(c("old", "new")) %>% as.data.frame()
write_csv(id_dictionary,"/Users/rh/Desktop/first_year/PSYCH251_MF/id_dictionary.csv")The original paper did not mention any data exclusion. Trials where participants fail to respond within 2 seconds will be automatically excluded.
As a robustness check, I will examine whether the main hypothesis is still true even after (1) excluding those participants who failed more than 10% of all trials (10 trials), and (2) excluding the first 20 trials of the task as practice.
ggplot(df_invalid_trials,aes(x=n))+
geom_histogram(color = "black",fill = "white")+
labs(title = "count of invalid trials (time out after 2s)",
x = "trial count")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kable(count(trial_data%>%
filter(!is.na(trial_index))%>%
group_by(choice1), state2_color)%>%
group_by(choice1)%>%
mutate(prob = round(n/sum(n),3))%>%
select(-n)%>%
pivot_wider(names_from = state2_color, values_from = prob))| choice1 | blue | pink |
|---|---|---|
| 1 | 0.310 | 0.690 |
| 2 | 0.702 | 0.298 |
The transition matrix looks fine.
# count the left and right key press freq
df_key <- count(trial_data%>%
ungroup(),participant_id,key_press2)%>%
pivot_wider(names_from = key_press2, values_from = n)
ggplot(df_key,aes(x=f))+
geom_histogram(fill="white",color = "black")+
xlim(0,99)+
geom_vline(xintercept = 49.5,size=2,color = "blue")+
labs(title = "freq pressing left key")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
The key press pattern looks fine.
choice1_freq <- count(trial_data%>%
ungroup(),participant_id,choice1)%>%
pivot_wider(names_from = choice1, names_prefix = "option", values_from = n)
ggplot(choice1_freq,aes(x = option1))+
geom_histogram(color="black", fill="white")+
labs(title="preference to option 1 in stage 1 (vs. option 2)")+
geom_vline(xintercept = 49.5,size=2,color = "blue")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Some participants showed preference towards one of the two stage 1 options.
Hypothesis 1 : significant interaction between reward and transition type, so that rewardUnrewarded:transitionRare has a significant positive estimate (p < 0.05)
# this function shifts the lower bound of the y axis around in a ggplot
shift_trans = function(d = 0) {
scales::trans_new("shift", transform = function(x) x - d, inverse = function(x) x + d)
}# count "stay" frequency for plotting
ind_sum_data <- count(trial_data%>%
filter(!is.na(last_rewarded))%>%
group_by(last_rewarded,common,stay),participant_id)%>%
group_by(participant_id,last_rewarded,common)%>%
mutate(trial_num = sum(n),
stay_prob = n/trial_num)%>%
mutate(common = ifelse(common == 1, "common", "rare"),
last_rewarded = ifelse(last_rewarded == "true","rewarded","unrewarded"))
#compute summary stats
fig3_data <- ind_sum_data%>%
group_by(last_rewarded,common)%>%
filter(stay == TRUE)%>%
summarise(mean_stay_prob = mean(stay_prob),
sd_stay_prob =sd(stay_prob,na.rm = TRUE),
se_stay_prob = sd_stay_prob/sqrt(participant_num))
fig3_data$common <- factor(fig3_data$common, levels = c("common","rare"))
ggplot(fig3_data,aes(x = last_rewarded, y = mean_stay_prob, group = common, fill = common))+
geom_col(position = position_dodge2(preserve = "single", padding = 0.1))+
scale_fill_manual(values = c("blue","red"))+
scale_x_discrete(limits = c("rewarded","unrewarded"))+
geom_errorbar(aes(ymin=mean_stay_prob-se_stay_prob, ymax=mean_stay_prob+se_stay_prob), width=.2,position=position_dodge(.9),size=1) +
theme_bw()+
scale_y_continuous(trans = shift_trans(0.5),limits = c(NA, 1),breaks=c(0.5,0.75,1))Figure 4. Results in the original study
Next, we will plot some participants’ individual data to see how it looks.
It seems that some participants (e.g., 701) were more model-based than others.
#make dataframe
trial_data_glmer <- trial_data%>%
select(participant_id,trial_index,stay,block,last_rewarded,common,choice1)%>%
mutate(common = ifelse(common == 1, "common", "rare"))
trial_data_glmer$common <- factor(trial_data_glmer$common, levels = c("common","rare"))
glm1 <- glmer(stay ~ last_rewarded * common + (1 | participant_id), data = trial_data_glmer, family = binomial, control = glmerControl(optimizer = "bobyqa"))
kable(summary(glm1)$coefficients)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.8849060 | 0.1969664 | 4.492674 | 0.0000070 |
| last_rewardedtrue | 0.3210125 | 0.0935336 | 3.432056 | 0.0005990 |
| commonrare | -0.1841440 | 0.1174138 | -1.568333 | 0.1168034 |
| last_rewardedtrue:commonrare | 0.0533074 | 0.1674285 | 0.318389 | 0.7501899 |
The main effect of reward is significant, $/beta $ = 0.374, p = .007. The main effect of transition type (common) was not significant, $/beta $ = -0.184, p = 0.116. The interaction between reward and transition type was not significant, $/beta $ = 0.053, p = 0.750.
(exclude bad-performing participants who have missed more than 10% trials and who prefer one first-stage choice, and the first 20 trials as practice)
# exclude a participant if they have >= 10 invalid trials
participant_screen <- count(trial_data_raw%>%
group_by(participant_id)%>%
mutate(timeout = (rt_1 == -1) | (rt_1 != -1 & rt_2 == -1)),timeout)%>%
filter(timeout == TRUE)%>%
filter(n >= 10)%>%
select(participant_id)
# exclude first 20 trials as practice
ind_sum_data_r <- count(trial_data%>%
filter(!participant_id %in% participant_screen$participant_id) %>%
filter(!is.na(last_rewarded),trial_in_total >= 20)%>%
group_by(last_rewarded,common,stay),participant_id)%>%
group_by(participant_id,last_rewarded,common)%>%
mutate(trial_num = sum(n),
stay_prob = n/trial_num)%>%
mutate(common = ifelse(common == 1, "common", "rare"),
last_rewarded = ifelse(last_rewarded == "true","rewarded","unrewarded"))
#summary stats
fig3_data_r <- ind_sum_data_r%>%
group_by(last_rewarded,common)%>%
filter(stay == TRUE)%>%
summarise(mean_stay_prob = mean(stay_prob),
sd_stay_prob =sd(stay_prob,na.rm = TRUE),
se_stay_prob = sd_stay_prob/sqrt(participant_num))
fig3_data_r$common <- factor(fig3_data_r$common, levels = c("common","rare"))
ggplot(fig3_data_r,aes(x = last_rewarded, y = mean_stay_prob, group = common, fill = common))+
geom_col(position = position_dodge2(preserve = "single", padding = 0.1))+
scale_fill_manual(values = c("blue","red"))+
scale_x_discrete(limits = c("rewarded","unrewarded"))+
geom_errorbar(aes(ymin=mean_stay_prob-se_stay_prob, ymax=mean_stay_prob+se_stay_prob), width=.2,position=position_dodge(.9),size=1) +
theme_bw()+
scale_y_continuous(trans = shift_trans(0.4),limits = c(NA, 1))trial_data_glmer_r <- trial_data%>%
filter(!participant_id%in%participant_screen$participant_id)%>%
select(participant_id,trial_index,stay,block,last_rewarded,common,choice1)
glm1_r <- glmer(stay ~ last_rewarded * common + (1 | participant_id), data = trial_data_glmer_r, family = binomial, control = glmerControl(optimizer = "bobyqa"))
kable(summary(glm1_r)$coefficients)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.7066807 | 0.2143644 | 3.2966329 | 0.0009785 |
| last_rewardedtrue | 0.3588610 | 0.1396963 | 2.5688659 | 0.0102032 |
| common | 0.1736162 | 0.1182159 | 1.4686358 | 0.1419316 |
| last_rewardedtrue:common | -0.0354739 | 0.1682189 | -0.2108792 | 0.8329815 |
Robustness check revealed the same pattern of results.
The two following hypotheses from Daw et al. (2011), but it’s secondary to the interaction hypothesis in hypothesis 1:
Hypothesis 2 : significant positive main effect of rewardRewarded (p < 0.05)
Hypothesis 3 : non-significant main effect of transition_typeCommon
These two hypotheses are supported by the data.
More use of model-free strategy is correlated with higher TEPS-ANT, i.e., greater self-reported anticipation for future pleasure.
(Left) The model-free algorithm only updates one first-stage option at a time.
(Right) The model-based algorithm based on Bellman’s equation updates two options simultaneously, because of its understanding of the task structure.
Figure 4. Computational Modelling. (focus on the shaded area, trial 63~81)
\(\omega\) is a parameter in the reinforcement learning model controlling the relative contribution of model-based process. to the value computation. Larger \(\omega\) suggests smaller contribution of model-based process.
df_params_estimate <- read_csv("/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/RL_learn_params_out.csv")## Rows: 40 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): expected_alpha1, expected_alpha2, expected_lumbda, expected_omega, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# ggplot(df_params_estimate%>%
# mutate(id = row_number())%>%
# pivot_longer(expected_alpha1:expected_omega,names_to = "param", names_prefix = "expected_", values_to = "val"),
# aes(x = param, y = val, color = param))+
# geom_violin()+
# geom_jitter(size=2)+
# stat_summary()
ggplot(df_params_estimate,aes(x=1,y=expected_omega))+
geom_violin()+
geom_jitter(size=2, shape=1)+
stat_summary()+
geom_hline(yintercept = 0.5, color = "red")+
labs(title = "weighting of model-based RL",
y= "omega" ,x="")+
ylim(0,1)## No summary function supplied, defaulting to `mean_se()`
quantile(df_params_estimate$expected_omega, probs = c(0.25,0.5,0.75))## 25% 50% 75%
## 0.2209307 0.2775736 0.4461924
Conclusion from computational modelling: although the interaction between reward and transition type was not replicated in the bar plots and the logistic regression, according to a reinforcement learning model with hybrid value update, the contribution of model-based RL is smaller than 50%, but non-zero. The estimated \(omega\) from the current study (0.28, 25%/75% quantile 0.22/0.45) was smaller than that of the original paper (0.39, 25%/75% quantile 0.29/0.59).
In this study, I sought to replicate the behavioural outcome in Daw et al. (2011). Specifically, we hypothesized significant interaction between reward and transition type in the logistic regression model (H1).
There is no significant interaction between reward and transition type. H1 was rejected.
In addition, we hypothesized that there is a positive main effect of reward (H2), but no significant main effect of transition type (H3) in the logistic regression model.
There is a significant and positive main effect of reward, but no significant main of transition type. H2 and H3 were accepted.
Finally, we conducted RL model and extracted a ‘weight’ parameter for model-based strategy. This parameter (median = 0.29) is between 0 and 0.5, suggesting a smaller contribution of model-based than model-free strategy.
Online studies may involve lower attention, lower motivation, lower cognitive effort than in-person studies, all of which may shift the strategy towards more model-free.
Model-based strategy may require expertise and hence more practice. The original study used 200 trials (+50 practice trials), while this replication only used 100 trials (+10 practice trials). This may have compromised the model-based strategy.
The low motivation in online studies was exacerbated by the fact that the present study only reward $0.01 for each coin earned in the task, making the difference between using different strategies negligible. Under low incentive, participants will incline towards the less cognitively expensive strategy. This phenomenon was also reported by a previous rodent study (Song et al., 2022, PLOS computational biology).