Rescue Project for Negative Valence Widens Generalization of Learning by Schechtman, E., Laufer, O., & Paz, R (2010, Journal of Neuroscience)

Author

Nastasia Klevak (nklevak@stanford.edu)

Published

December 13, 2023

Introduction

choice justification: I chose this project to rescue and replicate because it is relevant to my research interests and will allow me to gain skills that I value gaining practice in. My main research interests are in decision making, cognitive control, and how understanding core functions of human cognition can help us better understand potential cognitive differences related to different mental health conditions. Through working on this paper and replicating this experiment, I can gain a better understanding of how generalizing works in the brain and how people make sub-concious decisions about reacting to stimuli and how much to generalize. Additionally, in the past I’ve worked with JSPsych, on neuroimaging, and on computational experiments, but I have never fully designed or run a behavioral experiment on prolific completely on my own. Through doing this project, I hope to expand upon my experimental design and implementation skills.

stimuli, procedures, and challenges: For this experiment, I will need to use different frequency noises as stimuli in order to test how valence affects generalization. This will be an online behavioral experiment done on Prolific, and the study includes two stages (acquisition and generalization) that will be interspersed for participants. In the paper, the study also had a second follow up experiment where participants were recalled to do a loss aversion task, in order to understand the loss aversion rate and how it impacts generalization, so I may have to recall the participants to do a second online study as well. I will code/adapt this study in JSPsych and upload it to prolific–this may be a challenge as I’ve never worked with auditory stimuli before. I will then analyze the data in a similar way to the paper, and will replicate Figures 1 and 2. Finally, I will add attention and manipulation checks to the code (if it isn’t already there), and try to identify if there are any errors in the original/replication code that need correcting.

repository link: https://github.com/psych251/schechtman2010_rescue

original paper link: https://github.com/psych251/schechtman2010_rescue/blob/main/original_paper/10460.pdf

link to paradigm: https://lprqrm1jep.cognition.run

link to my pre-processing code pipeline: https://github.com/psych251/schechtman2010_rescue/blob/main/analysis/pilotB_analysis.ipynb

link to my cleaned participant data for my pilots and experiment: https://github.com/psych251/schechtman2010_rescue/tree/main/data

link to OSF preregistration: https://osf.io/e5v3p

Summary of prior replication attempt

For the initial replication attempt by Tyler Bonnen in 2018, the interaction between generalization and negative tones was not replicated; instead, there seemed to be significantly greater generalization for positive valenced tones. In other words, the statistical test was replicated but in the opposite direction. However, in the initial replication there was also a significant difference in accuracy between positive and negative valence tones–implying participants may not have been properly learning the task. Additionally, for the replication, the learning rate during the acquisition stage was longer than in the original study. I think this may be due to an added confusion in the replication where participants weren’t told which keys corresponded to the positive or negative tones, which made learning take longer.

Methods

Power Analysis

A power analysis was not possible (without simulations) due to a lack of information in the original paper. In the original paper the number of participants for experiment 1 (which examined generalization for positive/negative tones) was 22, while in the first replication the number of participants was 20. Since both of these got significant but opposing results, I will also do an n of 22 (just like the original paper).

Planned Sample

The planned sample size is 22 participants on prolific. They will be sampled on prolific continuously until 22 participants take the study. The sampling demographics will be the demographics of Prolific, but it is unclear what exactly they will be. The only preselection rule is that participants answer “no” when asked if they have hearing difficulties, as this is an auditory sound recognition study.

Materials

All materials - can quote directly from original article - just put the text in quotations and note that this was followed precisely. Or, quote directly and just point out exceptions to what was described in the original article.

For this rescue and task, I will need to develop code on prolific for an online task version. This online task will be a behavioral task that uses sounds with various frequencies. The frequencies of the positive/negative tones (counterbalanced among participants) are 300 and 700 Hz, the neutral frequency tone is 500 Hz, and all other frequencies for the generalization stage are (-100, -60, -20, -5, 5, 20, 60, 100) Hz around the positive and negative tones. The random tones for the generalization phase are 480, 500, 520, 880, 900, 920 Hz.

Procedure

From the paper regarding the acquisition stage: “The experiment consisted of two iterated parts, an acquisition stage—designed to assign valence of gain versus loss to two different tones; and a generalization stage—designed to test differences in the generalization to the two tones. For the acquisition stage, subjects were instructed that in each trial, they would hear one of three tones (300, 500, or 700 Hz, for 200 ms); one tone would gain them money—if they pressed one of two keys after it (termed “positive” tone); one tone would lose them money—unless they pressed the other key after it (termed “negative” tone, because it is a negative reinforcer), and one tone had no outcome independent of what was pressed or not (“neutral”). The key assignment was counterbalanced across subjects. Subjects had to learn by trial-and-error which tone is the positive and which is the negative, and had 2.5 s following the tone presentation to press a key, and then received visual feedback telling them if they earned or lost in the trial. In addition, there were pavlovian trials (classical conditioning), in which the positive tone resulted in gain independent of what was pressed, and the negative tone resulted in loss independent of what was pressed. Subjects were informed by the appearance of the word “helpless” on the screen. The acquisition stage comprised of 24% instrumental negative, 24% instrumental positive, 10% pavlovian negative, 10% pavlovian positive, and 32% neutral. Negative and positive tones were 300 and 700 Hz, counterbalanced across subjects, and the neutral tone was 500 Hz. There were 100 trials overall in the acquisition stages. Each loss and gain was of 0.5 Israeli shekels (∼15 US cents).” This was followed precisely except for the following changes: 1) participants were told to press the space bar when they heard the neutral tone (although they did not receive feedback or changes in their bonus regardless of if they were correct or not), 2) the acquisition stage comprised of: 27% instrumental negative, 27% instrumental positive, 9% pavlovian negative, 9% pavlovian positive, and 27% neutral, and 3) the bonus was $0.40 USD, not $0.15.

From the paper regarding the generalization stage: “Participants were informed when the acquisition stage was over and that a new different stage begins. They were not informed in any way that this is a generalization stage or that this is the purpose of the study, and moreover, we questioned the subjects after the experiments and none had an idea that generalization was the task. In this stage, subjects were instructed that there would be many more tones. In each trial, if they hear the positive/negative tone, they should press the same key that was previously associated with it. However, if they hear a tone that is not the exact same tone as either the positive or the negative tone, they should press the “middle” key. They were informed that they would gain money for a correct press (either pressing the appropriate key for a negative or positive tone, or the middle key for a different tone), or lose the same amount for any mistake (pressing the positive/negative keys for a different tone, or the middle key for a positive/negative tone). In this stage, we forced the subject to choose a key and press by imposing a big loss (10 times the amount of regular loss) for no response. There was no feedback reported to the subjects in this generalization stage, to avoid any changes in valence of the tones that was acquired during the acquisition stage.” This was followed precisely except that: 1) the loss incurred for no press was the typical bonus amount ($0.40) and 2) if participants didn’t respond on time, feedback appeared in bright red telling them to go faster next time and showing them that they lost some money.

From the paper: “Notice that although we keep calling the original tones positive/negative tones, these two tones have completely identical requirements and outcomes in this stage and therefore should entail identical decision-making policies. We keep these names for ease and clarity of terminology because they were previously associated with positive/negative outcomes. The generalization stage comprised of 14% of the positive tone, 14% of the negative tone, and 72% of other tones (i.e., that required a press on the middle key). Of these other tones, 14% were “control” tones that are far away from the original tones (80, 100, 100, 120, 480, 500, 500, 520, 880, 900, 900, 920 Hz), and 58% were tones that were closer to the original (positive and negative) tones, [−100, −60, −20, −5, +5, +20, +60, +100] from 300 and 700 Hz. There were 84 trials overall in the generalization stage.” This was followed precisely except 1) control tones around 100 Hz were taken out due to online hearability concerns, and 2) the generalization stage comprised of 12.5% of the positive tone, 12.5% of the negative tone, 25% of the surround tones (tones closer to the original), and the rest were control tones.

From the paper: “The acquisition and generalization stages appeared three times consecutively in a loop, to keep the original tones reinforced and avoid extinction effects.” In our rescue, this was 4 times in a loop instead.

The differences in my rescue procedure from the initial paper procedure are that, in the original paper, they “re-called the same participants and performed the same experiment, with an identical acquisition stage, but this time the generalization stage had tones that are [−100, −60, −20, −5, +5, +20, +60, +100 Hz] around the 500 Hz neutral tone, without the tones surrounding the positive/negative tones” in order to “find the “neutral” generalization curve.” I will not do this part, as I am not sure if recalling participants is as feasible on prolific.

Another thing I likely will not do (the original replication also didn’t do this) but was done in the paper was, “We repeated the experiment with a new set of participants (n = 14), but this time compensating for possible loss-aversion behavior. We tested our subjects for loss-aversion before the experiment (Tom et al., 2007). Each subject was presented with a series of choices of “gaining X money in 0.5 probability and losing Y money in 0.5 probability,” and had to accept/reject the offer. Subjects were told that soon after the test is over the computer would randomly pick a single offer they had accepted and that, following a “virtual coin toss,” they would actually gain or lose the amount of money noted in that particular offer. X and Y were varied over a range of 5–20 to create a matrix of 256 binary choices. We then fitted logistic regressions with accept/reject choices as dependent variables and the size of potential gain and loss as independent variables. The loss aversion was computed as λ= −βloss/βgain, which is similar to loss-aversion in prospect theory with the simplifying common assumptions of linear value function and identical weights for 0.5 probabilities. The median in our subjects (n = 11) was found to be 1.5 and the mean was 1.7, similar to other reports in the literature (Tom et al., 2007; Sokol-Hessner et al., 2009). In the experiment, we chose to use a robust loss-aversion factor of 2 (λ = 2, a gain of twice the loss, i.e., 1 shekel of gain and 0.5 shekel of loss).” Instead I will focus on the main experiment.

Controls

Since the replication found that it was more difficult for participants to learn during acquisition in the online version, I added feedback after the neutral tone that says “neutral!,” as well as “correct” and “incorrect” feedback (instead of just the bonus amount, like what the replication did).

For attention checks and to make sure participants could truly hear the audio, I added an unmute reminder at the beginning as well as several questions asking them if they can hear specific sounds. If people selected no on more than 1 of those questions, I will exclude them.

Analysis Plan

For data exclusion: I auto-excluded participants who, on prolific, did not answer no when asked if they have hearing difficulties. I will exclude participants who answered “I cannot hear this well” on more than 1 of my 3 sound checking questions. I will also exclude participants who do not press keys on more than 50% of the questions expecting a response (i.e. the positive/negative tones during acquisition, and all of the generalization questions).

To quantify the generalization rates for individual subject, I will:

  1. In the generalization stage, measure the drop in response rate per 1 Hz of tone (which is the slope of the generalization curve) for the surround tones of both the positive and negative tones. I will then compare these positive/negative slopes using a paired t-test (within-subject). I will then look at all comparisons that are reported as significant and check if they are also significant at the same level using nonparametric Wilcoxon signed rank test. Next, I will likely use a “three-way ANOVA to check the effects and interactions of reinforcement type (negative/positive/neutral), the frequency distance (in Hz) from the original tone [0, 5, 20, 60, 100] and the side/direction of the frequency change [higher/lower than the original tone]”.

  2. In the generalization stage, I will “fit a multinomial logistic regression for each subject using all single trial (0/1—correct/incorrect) responses…In this model, the coefficient β1 describes the contribution of the reinforcement type to the response. The higher β1 is, the narrower the generalization curve is.” I will then compare the β1 coefficients of positive and negative data using t tests for each subject, and will incorporate the p-values in “Fisher’s combined p-value calculation to obtain an overall p-value for the population.”

To understand the acquisition stage, I will qualitatively look at the rate of acquisition on average to better understand how well participants understood the task and were able to learn the tone/valence correspondence.

Clarify key analysis of interest here
The key analysis of interest is the paired t-test between the postively valenced and negatively valenced generalization curves. In the original paper they got a p < 0.001 for their paired t-test (they did not provide a t value), showing that there was significantly more generalization for the negative tone than the positive one. In the 1st replication, they got a t = 4.363 in their linear model comparison, but the positive generalization was larger than the negative one (the opposite of the first study).

Differences from Original Study and 1st replication

Just like the 1st replication, I will exclude two of the original’s follow-up tasks, as described in the “procedure” section above. For the first experiment in the original study, which is what the original replication focused on, there are a few differences. In the original experiment there were 22 participants while the replication had 20 and mine will also have 22. Additionally the original was in person while the replication (and mine) are online (mine will be on Prolific while the 1st replication was on MTurk). The replication and this rescue also made the same sound-type distribution and sound frequency changes that were described above in the methods section (both got rid of the sounds around 100 Hz, slightly varied the percentages of valenced/random/surround sounds in generalization, and the percentages of instrumental/pavlovian trials in the acquisition stage). I do not expect these changes to make much of a difference in this replication; however, I do think that since mine is online (just like the 1st replication), participants may be similarly slower to learn the tone associations during acquisitions than when compared to the original study. This may be due to less necessary attention or simply sound changes when delivered via an online task. Finally, the payment structure is slightly different in both the 1st replication and in this rescue than in the original; we are paying more in general, and also, unlike the original, the loss incurred by not answering during the generalization stage is the same as the general bonus (as opposed to making the loss incurred 10x the typical bonus).

Another difference between this rescue and the 1st replication is the addition of questions asking about sound quality (as exclusion criteria). This will hopefully ensure participants perform better on the acquisition portion of the task. Additionally, this rescue added slightly more explicit feedback in the acquisition stage than the 1st replication did; in the 1st replication when participants were correct or incorrect in the positive or negative valenced tones, it would simply say + or - and then either 0.04 or 0.00. To help participants learn the tone associations more clearly, this rescue adds “correct!” or “incorrect!” before the monetary feedback. Additionally, while in the 1st replication they signified it was a neutral tone by providing no feedback, in this rescue it says “neutral! so no change in bonus” to help acquisition. Finally, while in the 1st replication they did not tell participants which key will correspond to the positive or negative tone (i.e. “p will correspond to the positive tone, which is the tone that will gain you money”), this rescue did. This is in order to ensure that the learning during acquisition is solely about learning which sound corresponds to positive or negative bonuses, as opposed to allowing participants to get distracted with learning the key associations as well. Overall, all of these changes from the 1st replication will likely allow participants to learn the associations more quickly–similar to the original study.

Methods Addendum (Post Data Collection)

Actual Sample

The actual sample ended up being 21 participants. It was 22 initially, as planned, but 1 participants had to be excluded due to prior exclusion criteria (answering that they couldn’t hear the sound on more than 1 of the 3 control sound questions).

Differences from pre-data collection methods plan

None.

Results

Data preparation

Data preparation following the analysis plan:

#### Load Relevant Libraries and Functions
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(ggpubr)
library(rstatix)

Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter
library(nnet)
library(png) 

# preprocessing was done in python, in analysis/pilotB_analysis.ipynb
full_data = read.csv('../data/final_sample/subject_data_final_full_processed.csv')

Now lets look at the control checks and filter out everyone who answered “no” to any of the sound test questions:

### Data exclusion / filtering
### filter out people who didn't answer that they could hear sounds:

# get all control questions
control_data = full_data %>%
  filter(trial_type == 'audio-button-response')

# check which participants did not answer "I can hear this" to 2 or more questions (button_pressed = 0)
exclude = control_data %>%
  filter(button_pressed != 0) %>%
  mutate(run_id = as.integer(run_id))

run_id_counts <- exclude %>%
  count(run_id)

run_id_counts_fin <- run_id_counts %>%
  filter(n != 1)

exclusion_participants = unique(run_id_counts_fin$run_id)
exclusion_participants
[1] 26
# outputs the IDs of participants to exclude
full_data$exclude = full_data$run_id %in% exclusion_participants

# excludes them from the full_data 
data = full_data %>%
  filter(exclude == FALSE)

We excluded this participant (26) for not properly answering the attention checks about hearing sounds.

Now lets get the generalization data:

### Data exclusion / filtering
## get the generalization dataset
generalization_data = data %>% 
  filter(stage=='generalization' & valence != 'control')

generalization_data$match = as.character(generalization_data$valence) == as.character(generalization_data$key_responses)

generalization_data$direction = ifelse(generalization_data$distance==0,0,ifelse(generalization_data$distance < 0,-1,1))

hypothesis_space = generalization_data %>% 
  filter(stage=='generalization' & valence!='control')  %>%
  mutate(distance = abs(distance)) %>% 
  select(valence, distance, match, run_id, correct, key_responses, rt, positive_key,direction)

Now we will plot the generalization curves for the positive and negative tones across subjects:

## do the generalization plots
hypothesis_space %>% 
  group_by( valence, distance) %>% 
  summarise(p_association = mean(match, na.rm = TRUE ), 
            sem = sd(match, na.rm = TRUE)/sqrt(length(match)),
            y_lower = p_association-sem, 
            y_upper = p_association+sem) %>%
  ggplot(aes(x=distance, y=p_association, color=valence)) + 
    geom_line(aes(color=valence), size=1.5) + 
    geom_errorbar(aes(ymin=y_lower, ymax=y_upper), size=1) + 
    ggtitle('A "Rescue" of of Schechtman et al. 2010?')
`summarise()` has grouped output by 'valence'. You can override using the
`.groups` argument.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

And now for the main statistical tests of interest:

# generalization data analysis:
# t-test within linear model analysis:
summary(lm('match ~ valence * distance', hypothesis_space))

Call:
lm(formula = "match ~ valence * distance", data = hypothesis_space)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8697 -0.4944  0.1491  0.2735  0.5615 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               0.7416204  0.0206436  35.925  < 2e-16 ***
valencepositive           0.1280731  0.0291944   4.387 1.23e-05 ***
distance                 -0.0030315  0.0004270  -7.100 1.92e-12 ***
valencepositive:distance -0.0007217  0.0006038  -1.195    0.232    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4372 on 1508 degrees of freedom
Multiple R-squared:  0.09037,   Adjusted R-squared:  0.08856 
F-statistic: 49.94 on 3 and 1508 DF,  p-value: < 2.2e-16
# paired t-test analysis
# Split the data by distance and perform paired t-tests for each distance
results_list <- by(hypothesis_space, hypothesis_space$distance, function(subdata) {
  t_test_result <- t.test(subdata$match ~ subdata$valence, paired = TRUE)
  return(t_test_result)
})
results_df <- do.call(rbind, lapply(results_list, tidy))
table_df <- results_df[, c("estimate", "p.value")]
colnames(table_df) <- c("Mean_Difference (negative valence - positive valence)", "P_Value")
table_df$Distance <- unique(hypothesis_space$distance)
table_df <- table_df[order(table_df$Distance), ]

print(table_df)
# A tibble: 5 × 3
  `Mean_Difference (negative valence - positive valence)`   P_Value Distance
                                                    <dbl>     <dbl>    <dbl>
1                                                 -0.0873 0.0478           0
2                                                 -0.159  0.00285          5
3                                                 -0.0476 0.357           20
4                                                 -0.0794 0.0585          60
5                                                 -0.131  0.0000245      100
# two way ANOVA
# to compare the positive and negative valenced slopes:
anova_model <- aov(match ~ valence * distance, data = hypothesis_space)
summary_anova <- summary(anova_model)
print(summary_anova)
                   Df Sum Sq Mean Sq F value   Pr(>F)    
valence             1   4.23   4.233  22.145 2.76e-06 ***
distance            1  24.13  24.131 126.248  < 2e-16 ***
valence:distance    1   0.27   0.273   1.429    0.232    
Residuals        1508 288.24   0.191                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# three way ANOVA
# to compare the positive and negative valenced slopes, vs distance, vs direction of distance:
anova_model_three <- aov(match ~ valence * distance * direction, data = hypothesis_space)
summary_anova_three <- summary(anova_model_three)
print(summary_anova_three)
                             Df Sum Sq Mean Sq F value   Pr(>F)    
valence                       1   4.23   4.233  22.122 2.79e-06 ***
distance                      1  24.13  24.131 126.119  < 2e-16 ***
direction                     1   0.08   0.080   0.420    0.517    
valence:distance              1   0.27   0.273   1.427    0.232    
valence:direction             1   0.02   0.025   0.130    0.719    
distance:direction            1   0.32   0.320   1.671    0.196    
valence:distance:direction    1   0.04   0.045   0.235    0.628    
Residuals                  1504 287.77   0.191                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# "fit a multinomial logistic regression for each subject using all single trial (0/1—correct/incorrect) responses...In this model, the coefficient β1 describes the contribution of the reinforcement type to the response. The higher β1 is, the narrower the generalization curve is." I will then compare the β1 coefficients of positive and negative data using t tests for each subject, and will incorporate the p-values in "Fisher's combined p-value calculation to obtain an overall p-value for the population."

# calculate the t-test values for each subject:
subject_ids <- unique(generalization_data$run_id)

models <- list()
t_test_p_values <- numeric(length(subject_ids))

for (subject_id in subject_ids) {
  subject_data <- subset(generalization_data, run_id == subject_id)
  subject_data$valence <- factor(subject_data$valence, levels = c("negative", "positive"))

  model <- glm(match ~ valence, data = subject_data, family = binomial)
  models[[as.character(subject_id)]] <- model
  
  # Perform t-test for each subject
  t_test_result <- t.test(match ~ valence, data = subject_data)
  t_test_p_values[subject_id] <- t_test_result$p.value
}

# Perform Fisher's method on the p-values
# remove NA values and 0 values
t_test_p_values <- t_test_p_values[!is.na(t_test_p_values)]
print(t_test_p_values)
 [1] 0.000000000 0.159103698 0.000000000 0.000000000 0.244428850 0.013836197
 [7] 0.211851871 0.577099079 0.695780269 0.012127470 0.138332456 0.170109480
[13] 0.032871026 0.000000000 0.291940477 1.000000000 0.460439954 0.244428850
[19] 0.000000000 0.726148453 0.000000000 0.057006105 0.810934340 0.695780269
[25] 0.235476061 0.001394248 0.022611144

Because there are several p-values of 0 in the results, it is not possible to truly calculate Fisher’s combined t-test as I initially planned. However, to get a sense of our results, I will calculate this value by removing the 0 values.

# get rid of zero values
final_ps <- t_test_p_values[t_test_p_values>0]

chi_squared_stat <- -2 * sum(log(final_ps))

# Calculate the overall Fishers combined p-value
df <- 2 * length(final_ps)
overall_p_value <- pchisq(chi_squared_stat, df = df, lower.tail = FALSE)

cat("Fisher's Combined p-value:", overall_p_value, "\n")
Fisher's Combined p-value: 0.0002757348 

Now we can look at how fast participants learned in the first acquisition block (excluding pavlovian trials):

# data pre-processing and plotting of acquisition
acquisition_data = data %>%
  filter(stage=='acquisition' & valence!='neutral' & !is.na(data$rt) & !is.na(data$i_acquisition_trial))

acquisition_data$match = as.character(acquisition_data$valence) == as.character(acquisition_data$key_responses)

acquisition_data %>% 
  filter(i_block==1 )%>% 
  group_by(i_acquisition_trial, valence) %>% 
  summarise(mean_one = mean(match), 
            sem_one = sd(match)/sqrt(length(match))) %>% 
  mutate(combined_trial = ifelse(i_acquisition_trial%%2, i_acquisition_trial-1, i_acquisition_trial)) %>% 
  group_by(combined_trial, valence) %>% 
  summarise(mean_two=mean(mean_one), 
            sem_two=mean(sem_one)) %>% 
  ggplot(aes(x=combined_trial, y=mean_two, color=valence)) + 
    geom_line(size=1.2) + 
    geom_errorbar(aes(ymin=mean_two-sem_two, ymax=mean_two+sem_two), size=1) + 
    ggtitle('Learning curves across the first acquisition stage') 
`summarise()` has grouped output by 'i_acquisition_trial'. You can override
using the `.groups` argument.
`summarise()` has grouped output by 'combined_trial'. You can override using
the `.groups` argument.

Let’s look at the average rate of correct answers per acquisition stage, to see how people learned:

acquisition_data %>% 
  filter(i_block==1)  %>% 
  group_by(run_id) %>% 
  summarise(accuracy=mean(match)) %>% 
  summarise(mean=mean(accuracy))
# A tibble: 1 × 1
   mean
  <dbl>
1 0.906
acquisition_data %>% 
  filter(i_block==2)  %>% 
  group_by(run_id) %>% 
  summarise(accuracy=mean(match)) %>% 
  summarise(mean=mean(accuracy))
# A tibble: 1 × 1
   mean
  <dbl>
1 0.967
acquisition_data %>% 
  filter(i_block==3)  %>% 
  group_by(run_id) %>% 
  summarise(accuracy=mean(match)) %>% 
  summarise(mean=mean(accuracy))
# A tibble: 1 × 1
   mean
  <dbl>
1 0.983
# based on the output below:
avg_corr <- c('1':0.9063492,'2':0.9674603,'3':0.9834055)

We can see that the average learning rate across acquisition stages is going up across participants, from 0.9063492 in stage 1, to 0.9674603 in stage 2, to 0.9834055 in stage 3.

Results of control measures

We had three control questions which asked if participants could hear a sound. In the preregistration, we decided to exclude anyone who answers they can’t hear it on more than one question. After checking this for each participant, I found that one participant said this once, and another said it twice. The rest answered that they can hear for every question. Thus, we excluded one participant (participant 26) due to our control measures.

Confirmatory analysis

The key analysis was a t-test to compare the positively valenced and negatively valenced generalization curves. This was found using a linear model (just as Tyler Bonnen did it in his first replication). Similar to the first replication, I found that valence and distance both had significant p values (p ~ 0.000) (t = 4.387 and -7.100 respectively). However, unlike the first replication, the interaction between distance and valence was not significant. In the original paper, they found that “loss induced a much wider generalization curve than gain” with a p < 0.001. Thus, they also found that valence played a significant role. However, while the key analysis result is similar in value (all three are significant), this rescue and the first replication both found the opposite directional finding than the original paper; that positive valence had a wider generalization curve than negative valence. The graph comparisons of all three experiments are below:

# read in the images
img1<-readPNG('Final-Rescue-Report_file_imgs/schechtman_absolute.png')
img2<-readPNG('Final-Rescue-Report_file_imgs/replication_absolute.png')
img3 <- readPNG('Final-Rescue-Report_file_imgs/rescue_absolute.png')

# Set up a 3x1 layout
par(mfrow = c(1,3),mar = c(1, 1, 1, 1))
par(pin = c(2, 2))

# Plot the first image in the top panel with aspect ratio adjustment
plot(1, type = "n", main = "Original", xlim = c(0, ncol(img1)), ylim = c(0, nrow(img1)),asp = ncol(img1) / nrow(img1))
rasterImage(img1, 0, 0, ncol(img1), nrow(img1))

# Plot the second image in the middle panel with aspect ratio adjustment
plot(1, type = "n", main = "1st Replication", xlim = c(0, ncol(img2)), ylim = c(0, nrow(img2)),asp =ncol(img2) / (nrow(img2)))
rasterImage(img2, 0, 0, ncol(img2), nrow(img2))

# Plot the third image in the bottom panel with aspect ratio adjustment
plot(1, type = "n", main = "Rescue", xlim = c(0, ncol(img3)), ylim = c(0, nrow(img3)),asp = ncol(img3) / nrow(img3))
rasterImage(img3, 0, 0, ncol(img3), nrow(img3))

# Reset the layout to a single-panel layout
par(mfrow = c(1, 1))

Next, as stated in the analysis plan, I conducted a three-way ANOVA to check the how valence, distance, and change direction (whether the tone is above or below the original tone) interact. I found that valence and distance were significant predictors, but direction was not (nor were any of the interactions between these variables). This was what was found in the original paper as well (in the original paper direction was p=0.44 while in this rescue it was p= 0.517).

Next, I “fit a multinomial logistic regression for each subject using all single trial (0/1—correct/incorrect) responses…In this model, the coefficient β1 describes the contribution of the reinforcement type to the response. The higher β1 is, the narrower the generalization curve is” for the generalization stage. I then attempted to compare the β1 coefficients of positive and negative data using t tests for each subject, and to incorporate the p-values in “Fisher’s combined p-value calculation to obtain an overall p-value for the population.” For this analysis, I fit a multinomial log-reg for each subject and did a t-test for subject. However, once I did that, I realized that some of the t-test p-values were around 0 which made calculating the Fischer’s combined p-value impossible if the zeros were included. Consequently, I removed the zeros and calculated it with the remaining values, to get a value of p = 0.0002757348 (the original paper got p < 0.01). While this corresponds to the statistical finding of the paper, it is in the opposite direction (positive generalizing more), and, furthermore, it is potentially not correct since I had to leave out some participant’s p-values.

Finally, I qualitatively looked at the rate of acquisition on average to better understand how well participants understood the task and were able to learn the tone/valence correspondence. First, I plotted the graph of the first acquisition stage (removing the pavlovian trials) to qualitatively examine the learning process. Below is a comparison between learning during the first stage of this rescue, the first replication, and the original paper. In all three, participants are learning at ceiling by the end of the first acquisition stage; additionally, qualitatively, participants appear to be learning at a higher accuracy earlier on in the rescue than in the replication.

# read in the images
img1<-readPNG('Final-Rescue-Report_file_imgs/schechtman_ac.png')
img2<-readPNG('Final-Rescue-Report_file_imgs/replication_ac.png')
img3 <- readPNG('Final-Rescue-Report_file_imgs/rescue_ac.png')

# Set up a 3x1 layout
par(mfrow = c(1,3),mar = c(1, 1, 1, 1))
par(pin = c(2, 2))

# Plot the first image in the top panel with aspect ratio adjustment
plot(1, type = "n", main = "Original", xlim = c(0, ncol(img1)), ylim = c(0, nrow(img1)),asp = ncol(img1) / nrow(img1))
rasterImage(img1, 0, 0, ncol(img1), nrow(img1))

# Plot the second image in the middle panel with aspect ratio adjustment
plot(1, type = "n", main = "1st Replication", xlim = c(0, ncol(img2)), ylim = c(0, nrow(img2)),asp =ncol(img2) / (nrow(img2)))
rasterImage(img2, 0, 0, ncol(img2), nrow(img2))

# Plot the third image in the bottom panel with aspect ratio adjustment
plot(1, type = "n", main = "Rescue", xlim = c(0, ncol(img3)), ylim = c(0, nrow(img3)),asp = ncol(img3) / nrow(img3))
rasterImage(img3, 0, 0, ncol(img3), nrow(img3))

# Reset the layout to a single-panel layout
par(mfrow = c(1, 1))

Finally, to examine how well participants learned across the 3 acquisition stages, I calculated the average correct rate per stage, and found that in stages 1-3, the average correct rates were: ‘1’:0.9063492,‘2’:0.9674603,‘3’:0.9834055.

Exploratory analyses

There were no follow-up analyses conducted.

Discussion

Mini meta analysis

Combining across the original paper, 1st replication, and 2nd replication, what is the aggregate effect size?

Due to a lack of information in the original paper, I was not able to calculate the effect size of it. However, I post-hoc calculated the effect size of the first replication to be 0.1449508, and calculated the effect size of this rescue to be 0.09934809.

Summary of Replication Attempt

Overall, this rescue’s primary result was that the generalization curve of positively valenced stimuli is significantly wider than that of the negatively valenced stimuli, when using a paired t-test within a linear model. This is the same main result as the first replication, and the opposite finding of the original paper (which found that negatively valenced stimuli curves were wider significantly than negatively valenced stimuli curves). However, despite the opposite directional finding, this rescue also found that valence and distance of the sound stimuli were significant (ANOVA), while sound change direction was not (ANOVA). This was found in the original paper as well. Thus, this rescue partially replicated the original result, and fully replicated the first replication, as it found similar statistical results but in the opposite direction from the original paper. Within the scale of [0,0.25,0.50,0.75,1], I would say this replicated at a 0.50 level.

Commentary

This was a ‘close’ replication of the original paper, as it was held online instead of in person and had slight changes in stimuli. Since the directional pattern of the results did not replicate, but the significance of predictive factors did (i.e. valence and distance) it is interesting to consider why that is the case. Additionally, since both the first replication and this rescue got the same pattern, it seems to suggest that either the original paper’s sample size led to an incorrect finding, or that transferring mediums from in-person to online makes a large difference.