Replication of the behavioural experiment in Daw et al (2001, Neuron)

1 Introduction

1.1 Background

It has long been argued that human decision making consists of two distinct computational systems, one being computationally efficient but inflexible (model-free reinforcement learning), while the other is flexible but computationally expensive (model-based reinforcement learning). However, it is unknown whether humans utilize both systems simultaneously.

1.2 This paper

Daw et al. (2011) designed a novel two-stage task that can disambiguate a participant’s use of model-free and model-based strategies. This is because the two strategies hypothesized different effect of second-stage reinforcement on first-stage choices. Using hierarchical logistic regression on first-stage choice data, the authors found evidence of both model-free and model-based valuations, which was further confirmed by analysis of fMRI data.

1.3 Justification

The original study used fMRI and the sample size was limited (N=17), therefore it is worth replicating in a larger, out-of-scanner sample. It is also unknown whether the feature of web-based study will lead to similar engagement of the two systems, or will shift participants’ strategy towards one system.

1.4 Stimuli and Procedure

Participants will be recruited from the online platform Prolific. On each trial (Figure 1A), participants are shown two Tibetan characters. Participants use the F and J keys to choose one of the two characters. After this, another two Tibetan characters will be shown probablistically, according to the transition matrix shown in Figure 1B. The participants again use the F and J keys to select one of two second-stage characters. Depending on the reward probability of each second-stage stimulus, the participants either get a reward ($1) or none, 1% of which (1 cent) will be paid towards their compensation. To encourage continuous learning, the reward probabilities of the four second-stage choices vary with a Gaussian random walk.Prior to the study, participants are informed that the reward probabilities will change, while the transition probabilities from the first- to second-stage stimuli remain fixed. Different from the original study, after completing the task, participants will fill out the Temporal Experience of Pleasure Scale, an 18-item questionnaire that assess capacity for anticipatory and consummatory pleasure. This is based on my previous work that capacity for anticipatory pleasure may reduce model-based learning.

Figure 1. (A) The interface of the two-step task. (B) The transition matrix. The first first-stage character has a 70% chance to lead to the pink set, and 30% chance to lead to the blue set; the second first-stage character has a 30% chance to lead to the pink set, and 70% chance to lead to the blue set.

1.5 Challenges

The main challenge is to code this task by adapting the open materials from another similar study: https://github.com/wkool/tradeoffs/tree/master/tasks/space_daw_task. A second challenge to reduce the trial length (the original study was set to be slow to adapt to the sampling rate of fMRI) while still ensuring comparability with the original study.

1.6 Links

Link to the experiment on AWS: http://ryanyan-daw2011-replication-project-psych251.s3-website-us-west-2.amazonaws.com

Project repository (on Github): https://github.com/psych251/daw2011

Original paper: https://github.com/psych251/daw2011/blob/main/original_paper/daw2011.pdf

Preregistration: https://osf.io/t4zmr

2 Methods

2.1 Power Analysis

The author only reported the p value from the multilevel logistic regression; the original effect size was not reported. I was thus unable to obtain a power calculation from the original study.

I obtained regression coefficient estimations from another paper, whose senior author was a second author on this target paper. This other paper used the same task structure but with different materials and framing (Kool, Cushman, & Gershman, 2016). Based on N = 185 online participants, the estimates were: intercept (1.03), reward (0.26), transition (0.03), and reward*transition (0.20). The effect size calculation was based on these estimates, and a within-subject random intercept of 0.5.

Power analysis revealed that a sample size of 110 is needed to achieve 80% power.

Figure 2. Power analysis. Based on the estimates reported by Kool, Cushman & Gershman (2016), 120 participants could attain a power of 80%.

2.2 Planned Sample

Considering both the original sample size (N = 17), the calculated power based on the above-mentioned paper (N = 110), the sample will be N = 40 (50% female) healthy adults recruited via Prolific. The participants need to be fluent in English, but don’t need to be of a certain nationality.

2.3 Materials

The experimental materials are 3 pairs of Tibetan characters in colored boxes, as shown in Figure 1. In the practice trials, participants use a different set of such stimuli.The details of the experiment is described below in 2.4.

In addition, participants will fill out an 18-item questionnaire assessing hedonic capacity. The Temporal Experience of Pleasure Scale (TEPS) measures people’s capacity to experience anticipatory and consummatory pleasure (Gard et al., 2006). It is an 18-item scale on a 6-point Likert scale from 1 (very false for me) to 6 (very true for me), with 10 items assessing anticipatory pleasure, and 8 items assessing consummatory pleasure. The anticipatory subscale (TEPS-ANT) and consummatory subscale (TEPS-CON) can be scored separately as well as together. Higher score on TEPS reflects better ability to enjoy pleasure, and lower score indicates anhedonia.

2.4 Procedure

The task consisted of 99 trials, in three blocks of 33, separated by breaks. Each trial consisted of two stages. In the first stage, participants choose between two options, represented by Tibetan characters in colored boxes. If subjects failed to enter a choice within 2s, the trial was aborted. Which second-stage state was presented depended, probabilistically, on the first-stage choice, according to the transition scheme shown in Figure 1B (30% vs. 70% transition probability). At the second stage, subjects were presented with either of two more choices between two options (‘states’), and entered another choice. The second choice was rewarded with money (depicted by a dollar coin, though subjects were paid 1% of this amount), or not (depicted by a zero). Trials were separated by an inter-trial interval of randomized length, on average about 1.5 second.

In order to encourage ongoing learning, these reward probabilities were diffused at each trial by adding independent Gaussian noise (mean 0, SD 0.025), with reflecting boundaries at 0.25 and 0.75. Prior to the task, participants were instructed that the reward probabilities would change, but those controlling the transitions from the first to the second stage would remain fixed. They were also instructed about the overall structure of the transition matrix, specifically, that each first-stage option was primarily associated with one or the other of the second-stage states, but not which one. Prior to the session, to familiarize themselves with the structure of the task, subjects played 10 trials (original paper: 50 trials) on a practice task using a different stimulus set. The assignment of colors to states was counterbalanced across participants, and the two options at each state were permuted between left and right from trial to trial. Each second-stage option was rewarded according to a probability associated with that option. Different with the original study, after the experiment, participants will complete a measure of hedonic capacity (the Temporal Experience of Pleasure Scale). Participants get 8 dollars per hour for their participation, as well as the bonus they earn from the study.

Figure 3. Example of a Gaussian random walk

2.5 Analysis Plan

Key analysis of interest

I will run a logistic regression in which the dependent variable was the first-stage choice (coded as stay versus switch), and the explanatory variables were the reward received on the previous trial, a binary indicator variable indicating whether the previous trial’s transition was common or rare, and the interaction of the two. I will take all coefficients as random effects across subjects, and estimated this multilevel regression using the lme4 linear mixed effects package (Bates and Maechler, 2010) in the R statistical language (R Development Core Team, 2010).

Additional analyses

Based on my previous work that capacity for anticipatory pleasure may be associated with decreased learning of reward representation, which is reliant on model-based learning, I would like to examine the association between the TEPS questionnaire and individual differences in model-based and model-free learning.

2.6 Differences from Original Study

The original study is an fMRI study, while the current study uses an online sample. This means the body posture, noise level, and response keys are different. We hypothesize that this should not make a difference to the results.
The original study has a pre-fMRI training period of 50 trials, followed by 201 formal trials. Considering the characteristics of Prolific participants, which is shortened to 10 practice trials and 99 formal trials in the current study. Considering that participants may keep learning during the formal game session, we can conduct a sensitivity analysis on whether inclusion of the first 10~20 trials in the formal task will make a difference to the results. We hypothesize that this should not make a difference to the results.
I adapted the time course of a trial to make it more suitable for online experiment. Each trial in the original study lasted around 10 seconds, and the whole experiment lasts around 30~40 minutes. However, the online study needs to be more compact to reduce participant dropout.We reduced the animation time and intertrial interval for the chosen options, while keeping the time-out time (2s) constant. We hypothesize that this should not make a difference to the results.
A motivation-related questionnaire will be added at the end of the study. After the experiment, participants will fill out the TEPS questionnaire; because of this order, we do not expect any influence of this questionnaire measure on task performance.

2.7 Methods Addendum (Post Data Collection)

All participants were recruited around 12-2pm PST.

Actual Sample

The actual sample includes N = 40 participants recruited from Prolific. According to Prolific, their median age was 27 yo, 72.5% were male, and 87.5% were white.

#read in demographic data
df_demo <- read_csv("/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/demo.csv")%>%
  filter(Status == "APPROVED")%>%
  select(Age, Sex,"Ethnicity simplified")

quantile(df_demo$Age)

##    0%   25%   50%   75%  100% 
## 20.00 23.75 27.00 36.50 53.00

table(df_demo$Sex)/nrow(df_demo)

## 
## Female   Male 
##  0.275  0.725

table(df_demo$"Ethnicity simplified")/nrow(df_demo)

## 
## Black Mixed Other White 
## 0.025 0.075 0.025 0.875

Differences from pre-data collection methods plan

The sample was recruited as a standard sample on Prolific, instead of the gender-balanced sample in the pre-registration.

3 Results

3.1 Data preparation

read in and organize data (pilot data were manually removed already)

data_folder <- "/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/aws/task-data-sav/ryanyan-daw2011-replication-project-psych251/"
data_list <- list.files(data_folder)

bonus payment

substitute prolific ID for unidentifiable ID

set.seed(123) #to ensure consistency in encoding

all_possible_ids <- c(trial_data_raw$participant_id) %>% unique() %>% na.omit()

random_ids <- sample(100:1000, length(all_possible_ids), replace = F) %>% as.character()

id_dictionary <- cbind(all_possible_ids, random_ids) %>% `colnames<-`(c("old", "new")) %>% as.data.frame()

#save dictionary locally 
write_csv(id_dictionary,"/Users/rh/Desktop/first_year/PSYCH251_MF/id_dictionary.csv")

Exclusion criteria

The original paper did not mention any data exclusion. Trials where participants fail to respond within 2 seconds will be automatically excluded.

As a robustness check, I will examine whether the main hypothesis is still true even after (1) excluding those participants who failed more than 10% of all trials (10 trials), and (2) excluding the first 10 trials of the task as practice.

ggplot(df_invalid_trials,aes(x=n))+
  geom_histogram(color = "black",fill = "white")+
  labs(title = "count of invalid trials (time out after 2s)",
       x = "trial count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

manipulation check: transition matrix should be 0.3:0.7:0.7:0.3

kable(count(trial_data%>%
        filter(!is.na(trial_index))%>%
        group_by(choice1), state2_color)%>%
  group_by(choice1)%>%
  mutate(prob = n/sum(n))%>%
  select(-n)%>%
  pivot_wider(names_from = state2_color, values_from = prob))

choice1	blue	pink
1	0.3097034	0.6902966
2	0.7021277	0.2978723

The transition matrix looks fine.

keypress pattern: left and right should be relatively balanced (50:50)

df_key <- count(trial_data%>%
        ungroup(),participant_id,key_press2)%>%
  pivot_wider(names_from = key_press2, values_from = n)

ggplot(df_key,aes(x=f))+
  geom_histogram(fill="white",color = "black")+
  xlim(0,99)+
  geom_vline(xintercept = 49.5,size=2,color = "blue")+
  labs(title = "freq pressing left key")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing missing values (geom_bar).

The key press pattern looks fine.

participants’ choice of first-stage character should be balanced (50:50)

choice1_freq <- count(trial_data%>%
        ungroup(),participant_id,choice1)%>%
  pivot_wider(names_from = choice1, names_prefix = "option", values_from = n)

ggplot(choice1_freq,aes(x = option1))+
  geom_histogram(color="black", fill="white")+
  labs(title="preference to option 1 in stage 1 (vs. option 2)")+
    geom_vline(xintercept = 49.5,size=2,color = "blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Some participants showed preference towards one of the two stage 1 options.

3.2 (H1) Confirmatory analysis

The key result to replicate is the significant interaction between reward and transition type in predicting first-stage choice in the following logistic regression model:

glmer(stay ~ reward * transition_type + (1 | participant_ID), data = df, family = binomial, control = glmerControl(optimizer = "bobyqa"))

reward: factor, rewarded or unrewarded, ref = rewarded

transition_type: factor, common or rare, ref = common

Hypothesis 1 : significant interaction between reward and transition type, so that rewardUnrewarded:transitionRare has a significant positive estimate (p < 0.05)

## `summarise()` has grouped output by 'last_rewarded'. You can override using the
## `.groups` argument.

Figure 4. Results in the original study. (C) is the result I want to replicate here.

Next, we will plot some participants’ individual data to see how it looks.

plot four random individual participants as examples

It seems that some participants (e.g., 701) were more model-based than others.

logistic regression

trial_data_glmer <- trial_data%>%
  select(participant_id,trial_index,stay,block,last_rewarded,common,choice1)%>%
  mutate(common = ifelse(common == 1, "common", "rare"))

trial_data_glmer$common <- factor(trial_data_glmer$common, levels = c("common","rare"))

glm1 <- glmer(stay ~ last_rewarded * common + (1 | participant_id), data = trial_data_glmer, family = binomial, control = glmerControl(optimizer = "bobyqa"))

kable(summary(glm1)$coefficients)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.8849060	0.1969664	4.492674	0.0000070
last_rewardedtrue	0.3210125	0.0935336	3.432056	0.0005990
commonrare	-0.1841440	0.1174138	-1.568333	0.1168034
last_rewardedtrue:commonrare	0.0533074	0.1674285	0.318389	0.7501899

The main effect of reward is significant, $/beta $ = 0.374, p = .007. The main effect of transition type (common) was not significant, $/beta $ = -0.184, p = 0.116. The interaction between reward and transition type was not significant, $/beta $ = 0.053, p = 0.750.

Robuseness check

(exclude bad-performing participants who have missed more than 10% trials and who prefer one first-stage choice, and the first 10 trials as practice)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.7066807	0.2143644	3.2966329	0.0009785
last_rewardedtrue	0.3588610	0.1396963	2.5688659	0.0102032
common	0.1736162	0.1182159	1.4686358	0.1419316
last_rewardedtrue:common	-0.0354739	0.1682189	-0.2108792	0.8329815

Robustness check revealed the same pattern of results.

3.3 (H2-H4) Exploratory analyses

(H2, H3) additional analysis of logistic regression

The two following hypotheses from Daw et al. (2011), but it’s secondary to the interaction hypothesis in hypothesis 1:

Hypothesis 2 : significant positive main effect of rewardRewarded (p < 0.05)

Hypothesis 3 : non-significant main effect of transition_typeCommon

These two hypotheses are supported by the data.

(H4) task performance and self-reported pleasure

A metric of the use of model-based strategy can be derived from the following equation, for each individual respectively:

$\displaystyle modelBased= \frac{P(rewardedCommon)+P(unrewardedRare)}{2} - \frac{P(rewardedRare)+P(unrewardedCommon)}{2}$

where $P$ is the stay probability of a trial type, averaged within-subject. Similarly, the use of model-free strategy can be derived as follows:

$\displaystyle modelFree= \frac{P(rewardedCommon)+P(rewardedRare)}{2} - \frac{P(unrewardedCommon)+P(rewardedRare)}{2}$

Hypothesis 4: greater capacity for anticipatory pleasure, as measured by TEPS-ANT, is associated with lower measure of model-based strategy, and higher measure of model-free strategy, as defined above.

lm(modelBased~TEPS_ANT, data) --> the coefficient should be negative (H4 a)
lm(modelFree~TEPS_ANT, data) --> the coefficient should be positive (H4 b)

calculate behavioural measure

model_free_measure <- ind_sum_data%>%
  filter(stay == TRUE)%>%
  select(participant_id,stay_prob,)%>%
  pivot_wider(names_from = c(last_rewarded,common), values_from = stay_prob)%>%
  rowwise()%>%
  mutate(modelBased = (rewarded_common+unrewarded_rare)/2 - (unrewarded_common+rewarded_rare)/2,
    modelFree = (rewarded_common+rewarded_rare)/2 - (unrewarded_common+unrewarded_rare)/2)

df_total_score <- trial_data_raw%>%
  filter(block==3,trial_index==32)%>%
  ungroup()%>%
  select(participant_id,score)

model_free_measure <- merge(model_free_measure,df_total_score,by = "participant_id")

# ggdensity(model_free_measure$modelBased)+geom_vline(xintercept = 0, color = "red")
# t.test(model_free_measure$modelBased,mu=0)
# 
# ggdensity(model_free_measure$modelFree)+geom_vline(xintercept = 0, color = "red")
# t.test(model_free_measure$modelFree,mu=0)

read in Qualtrics data

survey_data_folder <- "/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/qualtrics/"
survey_data_list <- list.files(survey_data_folder)

for (i in 1:length(survey_data_list)){
  #read ata
  data_temp <- read_survey(paste0(survey_data_folder,survey_data_list[i]))
  #combine data
  if (i == 1){
    survey_data = data_temp
  } else {
    survey_data = rbind(survey_data,data_temp)
  }
}

survey_data$participant_id <- survey_data$prolific_id

survey_data <- survey_data%>%
  rename(old = participant_id)%>% 
  left_join(id_dictionary, by = "old")%>%
  select(-old,-prolific_id)%>%
  rename(participant_id = new)

# reverse coding for the restaurant item
survey_data$TEPS_13 = 7 - survey_data$TEPS_13

# compute total score
for (p in 1:nrow(survey_data)){
  survey_data$TEPS_CON[p] <- sum(survey_data[p,paste0("TEPS_",c(2,3,5,7,9,12,14,17))])
  survey_data$TEPS_ANT[p] <- sum(survey_data[p,paste0("TEPS_",c(1,4,6,8,10,11,13,15,16,18))])
}


model_free_measure <- left_join(model_free_measure,survey_data%>%
                                  select(participant_id,TEPS_CON,TEPS_ANT), by = "participant_id")

correlation between model-based and model-free strategy

Use of model-free strategy is negatively correlated with the use of model-based strategy. More use of model-based strategy yields more point in this task.

association between strategy and anticipatory pleasure

lm1 <- lm(TEPS_ANT~modelBased,model_free_measure)
kable(summary(lm1)$coefficients)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	41.886700	1.097026	38.1820390	0.0000000
modelBased	-1.196124	9.885722	-0.1209951	0.9043324

lm2 <- lm(TEPS_ANT~modelFree,model_free_measure)
kable(summary(lm2)$coefficients)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	40.76955	1.185015	34.404249	0.0000000
modelFree	16.90960	8.519334	1.984851	0.0544149

More use of model-free strategy is correlated with higher TEPS-ANT, i.e., greater self-reported anticipation for future pleasure.

3.4 computational modelling

Computational modelling was conducted according to Daw et al.’s (2011) method and supplementary information.

In the model-free reinforcement learning model, value of all stage 1 and stage 2 choices were updated by a temporal difference prediction error signal. In the model-based reinforcement learning model, stage 1 choices were updated according to the Bellman function, which takes into consideration the transition matrix as specified above, while the stage 2 choices were updated similarly as model-free RL.

In the present computational model, model-based and model-free RL both contribute to the value update, and the proportion of the contribution from model-based RL was represented as a parameter $\omega$ (0 < $\omega$ < 1). The closer $\omega$ is to 1, the larger the relative contribution of model-based RL.

All parameters were estimated using the maximum likelihood method.

Computational simulation demonstrated the differential value update of two types of RL algorithm

(Left) The model-free algorithm only updates one first-stage option at a time.
(Right) The model-based algorithm based on Bellman’s equation updates two options simultaneously, because of its understanding of the task structure.

Figure 4. Computational Modelling. (focus on the shaded area, trial 63~81)

weight parameter $\omega$

$\omega$ is a parameter in the reinforcement learning model controlling the relative contribution of model-based process. to the value computation. Larger $\omega$ suggests smaller contribution of model-based process.

df_params_estimate <- read_csv("/Users/rh/Desktop/first_year/PSYCH251_MF/daw2011/writeup/data/RL_learn_params_out.csv")

## Rows: 40 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): expected_alpha1, expected_alpha2, expected_lumbda, expected_omega, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# ggplot(df_params_estimate%>%
#   mutate(id = row_number())%>%
#   pivot_longer(expected_alpha1:expected_omega,names_to = "param", names_prefix = "expected_", values_to = "val"),
#   aes(x = param, y = val, color = param))+
#   geom_violin()+
#   geom_jitter(size=2)+
#   stat_summary()

ggplot(df_params_estimate,aes(x=1,y=expected_omega))+
  geom_violin()+
  geom_jitter(size=2, shape=1)+
  stat_summary()+
  geom_hline(yintercept = 0.5, color = "red")+
  labs(title = "weighting of model-based RL",
       y= "omega" ,x="")+
  ylim(0,1)

## No summary function supplied, defaulting to `mean_se()`

quantile(df_params_estimate$expected_omega, probs = c(0.25,0.5,0.75))

##       25%       50%       75% 
## 0.2209307 0.2775736 0.4461924

Conclusion from computational modelling: although the interaction between reward and transition type was not replicated in the bar plots and the logistic regression, according to a reinforcement learning model with hybrid value update, the contribution of model-based RL is smaller than 50%, but non-zero. The estimated $omega$ from the current study (0.28, 25%/75% quantile 0.22/0.45) was smaller than that of the original paper (0.39, 25%/75% quantile 0.29/0.59).

4 Discussion

Summary of Replication Attempt

In this study, I sought to replicate the behavioural outcome in Daw et al. (2011). Specifically, we hypothesized significant interaction between reward and transition type in the logistic regression model (H1).

There is no significant interaction between reward and transition type. H1 was rejected.

In addition, we hypothesized that there is a positive main effect of reward (H2), but no significant main effect of transition type (H3) in the logistic regression model.

There is a significant and positive main effect of reward, but no significant main of transition type. H2 and H3 were accepted.

We hypothesized that self-reported capacity for anticipatory pleasure is negatively correlated with the use of model-based strategy (H4a), while positively with model-free strategy (H4b).

Self-reported anticipatory pleasure (TEPS-ANT) was not correlated with use of model-based strategy’ H4a was rejected. TEPS-ANT was positively correlated with model-free strategy; H4b was accepted.

Finally, we conducted RL model and extracted a ‘weight’ parameter for model-based strategy. This parameter (median = 0.29) is between 0 and 0.5, suggesting a smaller contribution of model-based than model-free strategy.

Commentary

A shift towards model-free strategy in web-based study

Online studies may involve lower attention, lower motivation, lower cognitive effort than in-person studies, all of which may shift the strategy towards more model-free.

Truncated trial number compromises the detection of individual-level effect.

Model-based strategy may require expertise and hence more practice. The original study used 200 trials (+50 practice trials), while this replication only used 100 trials (+10 practice trials). This may have compromised the model-based strategy.

“What a difference a coin make”

The low motivation in online studies was exacerbated by the fact that the present study only reward $0.01 for each coin earned in the task, making the difference between using different strategies negligible. Under low incentive, participants will incline towards the less cognitively expensive strategy. This phenomenon was also reported by a previous rodent study (Song et al., 2022, PLOS computational biology).