It has long been argued that human decision making consists of two distinct computational systems, one being computationally efficient but inflexible (model-free reinforcement learning), while the other is flexible but computationally expensive (model-based reinforcement learning). However, it is unknown whether humans utilize both systems simultaneously.
Daw et al. (2011) designed a novel two-stage task that can disambiguate a participant’s use of model-free and model-based strategies. This is because the two strategies hypothesized different effect of second-stage reinforcement on first-stage choices. Using hierarchical logistic regression on first-stage choice data, the authors found evidence of both model-free and model-based valuations, which was further confirmed by analysis of fMRI data.
The original study used fMRI and the sample size was limited (N=17), therefore it is worth replicating in a larger, out-of-scanner sample. It is also unknown whether the feature of web-based study will lead to similar engagement of the two systems, or will shift participants’ strategy towards one system.
Participants will be recruited from the online platform Prolific. On each trial (Figure 1A), participants are shown two Tibetan characters. Participants use the F and J keys to choose one of the two characters. After this, another two Tibetan characters will be shown probablistically, according to the transition matrix shown in Figure 1B. The participants again use the F and J keys to select one of two second-stage characters. Depending on the reward probability of each second-stage stimulus, the participants either get a reward ($1) or none, 1% of which (1 cent) will be paid towards their compensation. To encourage continuous learning, the reward probabilities of the four second-stage choices vary with a Gaussian random walk.Prior to the study, participants are informed that the reward probabilities will change, while the transition probabilities from the first- to second-stage stimuli remain fixed. Different from the original study, after completing the task, participants will fill out the Temporal Experience of Pleasure Scale, an 18-item questionnaire that assess capacity for anticipatory and consummatory pleasure. This is based on my previous work that capacity for anticipatory pleasure may reduce model-based learning.
Figure 1. (A) The interface of the two-step task. (B) The transition matrix. The first first-stage character has a 70% chance to lead to the pink set, and 30% chance to lead to the blue set; the second first-stage character has a 30% chance to lead to the pink set, and 70% chance to lead to the blue set.
The main challenge is to code this task by adapting the open materials from another similar study: https://github.com/wkool/tradeoffs/tree/master/tasks/space_daw_task. A second challenge to reduce the trial length (the original study was set to be slow to adapt to the sampling rate of fMRI) while still ensuring comparability with the original study.
Link to the experiment on AWS: http://ryanyan-daw2011-replication-project-psych251.s3-website-us-west-2.amazonaws.com
Project repository (on Github): https://github.com/psych251/daw2011
Original paper: https://github.com/psych251/daw2011/blob/main/original_paper/daw2011.pdf
Preregistration: https://osf.io/t4zmr
The author only reported the p value from the multilevel logistic regression; the original effect size was not reported. I was thus unable to obtain a power calculation from the original study.
I obtained regression coefficient estimations from another paper, whose senior author was a second author on this target paper. This other paper used the same task structure but with different materials and framing (Kool, Cushman, & Gershman, 2016). Based on N = 185 online participants, the estimates were: intercept (1.03), reward (0.26), transition (0.03), and reward*transition (0.20). The effect size calculation was based on these estimates, and a within-subject random intercept of 0.5.
Power analysis revealed that a sample size of 110 is needed to achieve 80% power.
Figure 2. Power analysis. Based on the estimates reported by Kool, Cushman & Gershman (2016), 120 participants could attain a power of 80%.
Considering both the original sample size (N = 17), the calculated power based on the above-mentioned paper (N = 110), the sample will be N = 40 (50% female) healthy adults recruited via Prolific. The participants need to be fluent in English, but don’t need to be of a certain nationality.
The experimental materials are 3 pairs of Tibetan characters in colored boxes, as shown in Figure 1. In the practice trials, participants use a different set of such stimuli.The details of the experiment is described below in 2.4.
In addition, participants will fill out an 18-item questionnaire assessing hedonic capacity. The Temporal Experience of Pleasure Scale (TEPS) measures people’s capacity to experience anticipatory and consummatory pleasure (Gard et al., 2006). It is an 18-item scale on a 6-point Likert scale from 1 (very false for me) to 6 (very true for me), with 10 items assessing anticipatory pleasure, and 8 items assessing consummatory pleasure. The anticipatory subscale (TEPS-ANT) and consummatory subscale (TEPS-CON) can be scored separately as well as together. Higher score on TEPS reflects better ability to enjoy pleasure, and lower score indicates anhedonia.
The task consisted of 99 trials, in three blocks of 33, separated by breaks. Each trial consisted of two stages. In the first stage, participants choose between two options, represented by Tibetan characters in colored boxes. If subjects failed to enter a choice within 2s, the trial was aborted. Which second-stage state was presented depended, probabilistically, on the first-stage choice, according to the transition scheme shown in Figure 1B (30% vs. 70% transition probability). At the second stage, subjects were presented with either of two more choices between two options (‘states’), and entered another choice. The second choice was rewarded with money (depicted by a dollar coin, though subjects were paid 1% of this amount), or not (depicted by a zero). Trials were separated by an inter-trial interval of randomized length, on average about 1.5 second.
In order to encourage ongoing learning, these reward probabilities were diffused at each trial by adding independent Gaussian noise (mean 0, SD 0.025), with reflecting boundaries at 0.25 and 0.75. Prior to the task, participants were instructed that the reward probabilities would change, but those controlling the transitions from the first to the second stage would remain fixed. They were also instructed about the overall structure of the transition matrix, specifically, that each first-stage option was primarily associated with one or the other of the second-stage states, but not which one. Prior to the session, to familiarize themselves with the structure of the task, subjects played 10 trials (original paper: 50 trials) on a practice task using a different stimulus set. The assignment of colors to states was counterbalanced across participants, and the two options at each state were permuted between left and right from trial to trial. Each second-stage option was rewarded according to a probability associated with that option. Different with the original study, after the experiment, participants will complete a measure of hedonic capacity (the Temporal Experience of Pleasure Scale). Participants get 8 dollars per hour for their participation, as well as the bonus they earn from the study.
Figure 3. Example of a Gaussian random walk
I will run a logistic regression in which the dependent variable was the first-stage choice (coded as stay versus switch), and the explanatory variables were the reward received on the previous trial, a binary indicator variable indicating whether the previous trial’s transition was common or rare, and the interaction of the two. I will take all coefficients as random effects across subjects, and estimated this multilevel regression using the lme4 linear mixed effects package (Bates and Maechler, 2010) in the R statistical language (R Development Core Team, 2010).
Based on my previous work that capacity for anticipatory pleasure may be associated with decreased learning of reward representation, which is reliant on model-based learning, I would like to examine the association between the TEPS questionnaire and individual differences in model-based and model-free learning.
All participants were recruited around 12-2pm PST.
The actual sample includes N = 40 participants recruited from Prolific. According to Prolific, their median age was 27 yo, 72.5% were male, and 87.5% were white.
#read in demographic data
df_demo <- read_csv("/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/demo.csv")%>%
filter(Status == "APPROVED")%>%
select(Age, Sex,"Ethnicity simplified")quantile(df_demo$Age)## 0% 25% 50% 75% 100%
## 20.00 23.75 27.00 36.50 53.00
table(df_demo$Sex)/nrow(df_demo)##
## Female Male
## 0.275 0.725
table(df_demo$"Ethnicity simplified")/nrow(df_demo)##
## Black Mixed Other White
## 0.025 0.075 0.025 0.875
The sample was recruited as a standard sample on Prolific, instead of the gender-balanced sample in the pre-registration.
data_folder <- "/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/aws/task-data-sav/ryanyan-daw2011-replication-project-psych251/"
data_list <- list.files(data_folder)set.seed(123) #to ensure consistency in encoding
all_possible_ids <- c(trial_data_raw$participant_id) %>% unique() %>% na.omit()
random_ids <- sample(100:1000, length(all_possible_ids), replace = F) %>% as.character()
id_dictionary <- cbind(all_possible_ids, random_ids) %>% `colnames<-`(c("old", "new")) %>% as.data.frame()
#save dictionary locally
write_csv(id_dictionary,"/Users/rh/Desktop/first_year/PSYCH251_MF/id_dictionary.csv")The original paper did not mention any data exclusion. Trials where participants fail to respond within 2 seconds will be automatically excluded.
As a robustness check, I will examine whether the main hypothesis is still true even after (1) excluding those participants who failed more than 10% of all trials (10 trials), and (2) excluding the first 10 trials of the task as practice.
ggplot(df_invalid_trials,aes(x=n))+
geom_histogram(color = "black",fill = "white")+
labs(title = "count of invalid trials (time out after 2s)",
x = "trial count")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kable(count(trial_data%>%
filter(!is.na(trial_index))%>%
group_by(choice1), state2_color)%>%
group_by(choice1)%>%
mutate(prob = n/sum(n))%>%
select(-n)%>%
pivot_wider(names_from = state2_color, values_from = prob))| choice1 | blue | pink |
|---|---|---|
| 1 | 0.3097034 | 0.6902966 |
| 2 | 0.7021277 | 0.2978723 |
The transition matrix looks fine.
df_key <- count(trial_data%>%
ungroup(),participant_id,key_press2)%>%
pivot_wider(names_from = key_press2, values_from = n)
ggplot(df_key,aes(x=f))+
geom_histogram(fill="white",color = "black")+
xlim(0,99)+
geom_vline(xintercept = 49.5,size=2,color = "blue")+
labs(title = "freq pressing left key")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
The key press pattern looks fine.
choice1_freq <- count(trial_data%>%
ungroup(),participant_id,choice1)%>%
pivot_wider(names_from = choice1, names_prefix = "option", values_from = n)
ggplot(choice1_freq,aes(x = option1))+
geom_histogram(color="black", fill="white")+
labs(title="preference to option 1 in stage 1 (vs. option 2)")+
geom_vline(xintercept = 49.5,size=2,color = "blue")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Some participants showed preference towards one of the two stage 1 options.
The key result to replicate is the significant interaction between reward and transition type in predicting first-stage choice in the following logistic regression model:
glmer(stay ~ reward * transition_type + (1 | participant_ID), data = df, family = binomial, control = glmerControl(optimizer = "bobyqa"))
reward: factor, rewarded or unrewarded, ref = rewarded
transition_type: factor, common or rare, ref = common
Hypothesis 1 : significant interaction between reward and transition type, so that rewardUnrewarded:transitionRare has a significant positive estimate (p < 0.05)
## `summarise()` has grouped output by 'last_rewarded'. You can override using the
## `.groups` argument.
Figure 4. Results in the original study. (C) is the result I want to replicate here.
Next, we will plot some participants’ individual data to see how it looks.
It seems that some participants (e.g., 701) were more model-based than others.
trial_data_glmer <- trial_data%>%
select(participant_id,trial_index,stay,block,last_rewarded,common,choice1)%>%
mutate(common = ifelse(common == 1, "common", "rare"))
trial_data_glmer$common <- factor(trial_data_glmer$common, levels = c("common","rare"))
glm1 <- glmer(stay ~ last_rewarded * common + (1 | participant_id), data = trial_data_glmer, family = binomial, control = glmerControl(optimizer = "bobyqa"))
kable(summary(glm1)$coefficients)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.8849060 | 0.1969664 | 4.492674 | 0.0000070 |
| last_rewardedtrue | 0.3210125 | 0.0935336 | 3.432056 | 0.0005990 |
| commonrare | -0.1841440 | 0.1174138 | -1.568333 | 0.1168034 |
| last_rewardedtrue:commonrare | 0.0533074 | 0.1674285 | 0.318389 | 0.7501899 |
The main effect of reward is significant, $/beta $ = 0.374, p = .007. The main effect of transition type (common) was not significant, $/beta $ = -0.184, p = 0.116. The interaction between reward and transition type was not significant, $/beta $ = 0.053, p = 0.750.
(exclude bad-performing participants who have missed more than 10% trials and who prefer one first-stage choice, and the first 10 trials as practice)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.7066807 | 0.2143644 | 3.2966329 | 0.0009785 |
| last_rewardedtrue | 0.3588610 | 0.1396963 | 2.5688659 | 0.0102032 |
| common | 0.1736162 | 0.1182159 | 1.4686358 | 0.1419316 |
| last_rewardedtrue:common | -0.0354739 | 0.1682189 | -0.2108792 | 0.8329815 |
Robustness check revealed the same pattern of results.
The two following hypotheses from Daw et al. (2011), but it’s secondary to the interaction hypothesis in hypothesis 1:
Hypothesis 2 : significant positive main effect of rewardRewarded (p < 0.05)
Hypothesis 3 : non-significant main effect of transition_typeCommon
These two hypotheses are supported by the data.
A metric of the use of model-based strategy can be derived from the following equation, for each individual respectively:
\(\displaystyle modelBased= \frac{P(rewardedCommon)+P(unrewardedRare)}{2} - \frac{P(rewardedRare)+P(unrewardedCommon)}{2}\)
where \(P\) is the stay probability of a trial type, averaged within-subject. Similarly, the use of model-free strategy can be derived as follows:
\(\displaystyle modelFree= \frac{P(rewardedCommon)+P(rewardedRare)}{2} - \frac{P(unrewardedCommon)+P(rewardedRare)}{2}\)
Hypothesis 4: greater capacity for anticipatory pleasure, as measured by TEPS-ANT, is associated with lower measure of model-based strategy, and higher measure of model-free strategy, as defined above.
lm(modelBased~TEPS_ANT, data) --> the coefficient should be negative (H4 a)
lm(modelFree~TEPS_ANT, data) --> the coefficient should be positive (H4 b)
model_free_measure <- ind_sum_data%>%
filter(stay == TRUE)%>%
select(participant_id,stay_prob,)%>%
pivot_wider(names_from = c(last_rewarded,common), values_from = stay_prob)%>%
rowwise()%>%
mutate(modelBased = (rewarded_common+unrewarded_rare)/2 - (unrewarded_common+rewarded_rare)/2,
modelFree = (rewarded_common+rewarded_rare)/2 - (unrewarded_common+unrewarded_rare)/2)
df_total_score <- trial_data_raw%>%
filter(block==3,trial_index==32)%>%
ungroup()%>%
select(participant_id,score)
model_free_measure <- merge(model_free_measure,df_total_score,by = "participant_id")
# ggdensity(model_free_measure$modelBased)+geom_vline(xintercept = 0, color = "red")
# t.test(model_free_measure$modelBased,mu=0)
#
# ggdensity(model_free_measure$modelFree)+geom_vline(xintercept = 0, color = "red")
# t.test(model_free_measure$modelFree,mu=0)survey_data_folder <- "/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/qualtrics/"
survey_data_list <- list.files(survey_data_folder)
for (i in 1:length(survey_data_list)){
#read ata
data_temp <- read_survey(paste0(survey_data_folder,survey_data_list[i]))
#combine data
if (i == 1){
survey_data = data_temp
} else {
survey_data = rbind(survey_data,data_temp)
}
}
survey_data$participant_id <- survey_data$prolific_id
survey_data <- survey_data%>%
rename(old = participant_id)%>%
left_join(id_dictionary, by = "old")%>%
select(-old,-prolific_id)%>%
rename(participant_id = new)
# reverse coding for the restaurant item
survey_data$TEPS_13 = 7 - survey_data$TEPS_13
# compute total score
for (p in 1:nrow(survey_data)){
survey_data$TEPS_CON[p] <- sum(survey_data[p,paste0("TEPS_",c(2,3,5,7,9,12,14,17))])
survey_data$TEPS_ANT[p] <- sum(survey_data[p,paste0("TEPS_",c(1,4,6,8,10,11,13,15,16,18))])
}
model_free_measure <- left_join(model_free_measure,survey_data%>%
select(participant_id,TEPS_CON,TEPS_ANT), by = "participant_id")Use of model-free strategy is negatively correlated with the use of model-based strategy. More use of model-based strategy yields more point in this task.
lm1 <- lm(TEPS_ANT~modelBased,model_free_measure)
kable(summary(lm1)$coefficients)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 41.886700 | 1.097026 | 38.1820390 | 0.0000000 |
| modelBased | -1.196124 | 9.885722 | -0.1209951 | 0.9043324 |
lm2 <- lm(TEPS_ANT~modelFree,model_free_measure)
kable(summary(lm2)$coefficients)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 40.76955 | 1.185015 | 34.404249 | 0.0000000 |
| modelFree | 16.90960 | 8.519334 | 1.984851 | 0.0544149 |
More use of model-free strategy is correlated with higher TEPS-ANT, i.e., greater self-reported anticipation for future pleasure.
Computational modelling was conducted according to Daw et al.’s (2011) method and supplementary information.
In the model-free reinforcement learning model, value of all stage 1 and stage 2 choices were updated by a temporal difference prediction error signal. In the model-based reinforcement learning model, stage 1 choices were updated according to the Bellman function, which takes into consideration the transition matrix as specified above, while the stage 2 choices were updated similarly as model-free RL.
In the present computational model, model-based and model-free RL both contribute to the value update, and the proportion of the contribution from model-based RL was represented as a parameter \(\omega\) (0 < \(\omega\) < 1). The closer \(\omega\) is to 1, the larger the relative contribution of model-based RL.
All parameters were estimated using the maximum likelihood method.
(Left) The model-free algorithm only updates one first-stage option at a time.
(Right) The model-based algorithm based on Bellman’s equation updates two options simultaneously, because of its understanding of the task structure.
Figure 4. Computational Modelling. (focus on the shaded area, trial 63~81)
\(\omega\) is a parameter in the reinforcement learning model controlling the relative contribution of model-based process. to the value computation. Larger \(\omega\) suggests smaller contribution of model-based process.
df_params_estimate <- read_csv("/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/RL_learn_params_out.csv")## Rows: 40 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): expected_alpha1, expected_alpha2, expected_lumbda, expected_omega, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# ggplot(df_params_estimate%>%
# mutate(id = row_number())%>%
# pivot_longer(expected_alpha1:expected_omega,names_to = "param", names_prefix = "expected_", values_to = "val"),
# aes(x = param, y = val, color = param))+
# geom_violin()+
# geom_jitter(size=2)+
# stat_summary()
ggplot(df_params_estimate,aes(x=1,y=expected_omega))+
geom_violin()+
geom_jitter(size=2, shape=1)+
stat_summary()+
geom_hline(yintercept = 0.5, color = "red")+
labs(title = "weighting of model-based RL",
y= "omega" ,x="")+
ylim(0,1)## No summary function supplied, defaulting to `mean_se()`
quantile(df_params_estimate$expected_omega, probs = c(0.25,0.5,0.75))## 25% 50% 75%
## 0.2209307 0.2775736 0.4461924
Conclusion from computational modelling: although the interaction between reward and transition type was not replicated in the bar plots and the logistic regression, according to a reinforcement learning model with hybrid value update, the contribution of model-based RL is smaller than 50%, but non-zero. The estimated \(omega\) from the current study (0.28, 25%/75% quantile 0.22/0.45) was smaller than that of the original paper (0.39, 25%/75% quantile 0.29/0.59).
In this study, I sought to replicate the behavioural outcome in Daw et al. (2011). Specifically, we hypothesized significant interaction between reward and transition type in the logistic regression model (H1).
There is no significant interaction between reward and transition type. H1 was rejected.
In addition, we hypothesized that there is a positive main effect of reward (H2), but no significant main effect of transition type (H3) in the logistic regression model.
There is a significant and positive main effect of reward, but no significant main of transition type. H2 and H3 were accepted.
We hypothesized that self-reported capacity for anticipatory pleasure is negatively correlated with the use of model-based strategy (H4a), while positively with model-free strategy (H4b).
Self-reported anticipatory pleasure (TEPS-ANT) was not correlated with use of model-based strategy’ H4a was rejected. TEPS-ANT was positively correlated with model-free strategy; H4b was accepted.
Finally, we conducted RL model and extracted a ‘weight’ parameter for model-based strategy. This parameter (median = 0.29) is between 0 and 0.5, suggesting a smaller contribution of model-based than model-free strategy.
Online studies may involve lower attention, lower motivation, lower cognitive effort than in-person studies, all of which may shift the strategy towards more model-free.
Model-based strategy may require expertise and hence more practice. The original study used 200 trials (+50 practice trials), while this replication only used 100 trials (+10 practice trials). This may have compromised the model-based strategy.
The low motivation in online studies was exacerbated by the fact that the present study only reward $0.01 for each coin earned in the task, making the difference between using different strategies negligible. Under low incentive, participants will incline towards the less cognitively expensive strategy. This phenomenon was also reported by a previous rodent study (Song et al., 2022, PLOS computational biology).