Introduction
Justification for replication
Valence is among the principle constructs used to explain animal behavior; we tend towards things we like, away from what we don’t. Remarkably, not much is known about how valence effects processes upstream of decision-making–e.g. memory and perception. Schechtman et al. 2010 developed a paradigm in which auditory cues are paired with negative and positively valenced outcomes during an acquisition phase. During a generalization phase, they showed that subjects are more likely to confuse novel tones with the previous negative cue than the positive one. The authors discribe this result as differential perceptual generalization as a function of valence.
My hope is to build upon this core finding in order to study how these valenced processes interact with memory. By replicating this study in an online sample, I would be well positioned to iterate through paradigms, eventually converging on a study that will enable me to explore the relationship between valence and memory.
Description of procedures
There are two stages in this experiment which repeat; an acquisition stage and a generalization stage.
Acquisition stage:
The are two main trail types in the acquisitino stage, instrumental and pavlovian. In the instrumental trials, subjects hear one of three auditory tones (300Hz, 500Hz, or 700Hz). This tone is presented for 200ms, and subjects have to learn the correct response for each tone type (e.g. either p, q, or no response needed). For the “positive” tone (either 300Hz or 700Hz, randomized across subjects), subjects are given a monetary reward when they press the correct key (e.g. p). They receive no reward otherwise. For the “negative” tone (the complimentary 300Hz or 700Hz tone), subjects are given a monetary penalty unless they press the correct key (e.g. q). That is, they receive no penalty only if they press the correct key. Subjects have 2500ms to register a response; if no key press is registered in this time, that trial is marked as ‘incorrect’. For a third, “neutral” tone, the tone is presented, but subjects are not required to take any action. For all three tones, if a key is pressed before 2500ms, the trial ends.
For the pavlovian trials, subjects are first presented with the word “helpless” at the center of the screen. Then, either a positive or negative auditory tone is played. There is nothing subjects can do to change the outcome, and key presses will not end the trial (as is the case in the instrumental trials).
After all trials, subjects receive feedback, displayed at the center of the screen for 1000ms. For the positive and negative tones this is in the form of a monetary value (e.g. +$0.04, -$0.04, -$0.00, or +$0.00). For the neutral tone, the screen goes blank. At the end of each acquisition stage, subjects are given feedback about the aggregate bonus accrued in that stage.
Generalization stage:
Subjects are presented with the original three tones, as well as range of tones similar to the positive and negative tones (300 | 700 ± 5, 20, 60, and 100 Hz). There are also tones very dissimilar to those in the acquisition stage (480, 500, 520, 880, 900, or 920Hz). If the tone presented is either the original positive or negative tone in the acquisition stage, subjects are instructed to press the key that corresponded to that tone (p or q). Otherwise, they are instructed to press a third key (spacebar). Subjects are asked to respond within 2500ms and are rewarded for correct response within this time, though no feedback is given. If subjects do not respond within the alotted time, they are penalized, and given feedback and told that they will be heavily penalized (-$0.40 RESPOND FASTER presented at the center of the screen). This amount will actually not be taken out of subjects total bonus.
The acquisition and generalization stages are repeated until subjects have gone through three acquisition-generalization cycles.
Differences from Original Study
Because this study is based on performance, and chance-level performance would result in subjects receiving $0.00. This design incentives subjects to remain engaged in a way that is well suited for an online setting. Additionally, the resulting pattern of behavioral evidence will allow us to identify subjects who were not engaged (e.g. performance around chance). Any subjects who are not performing above 80% accuracy within the first block will be excluded from further analysis.
In principle, the currect javascript implimentation should not deviate in a meaningful way from the original study. Most importantly, the distribution of trial types and overall experimental length have been largely perserved, even when it is not critical for the main hypothesis; for example, the proportion of control trials is within several percentage points of the original study, not only the distribution of positive and negative instrumental and pavlovian trials.
The biggest deviation is the exclusion of several of the original control stimuli in the generalization phase. Two different auditory frequencies (e.g. 100 and 900Hz) must be played at different amplitudes in order to sound like they are being played at the same volume; typically, lower tones sound quieter, so the amplitude has to be greater. There was no calibration precedure the original authors used to ensure that tones of different frequencies were equally audible, and this seems to be acceptable for their laboratory setting. However, generating stimuli following their procedure, and then playing those tones on laptops, it was no possible to hear the 80Hz tones. This group of controll tones around 100Hz (80, 100, 120) was excluded.
Data preparation
Server side data preparation and loading with python
The data are formatted serverside, prior to the data analysis here, to aid with data preprocessing. Below is an example generalization trial’s format:
{'trial_data': {'rt': 602,
'stimulus': 'sound/705',
'key_press': 32,
'stage': 'generalization',
'correct_response': 'space',
'tone': 705,
'valence': 'negative',
'distance': 5,
'trial_type': 'audio-keyboard-response',
'trial_index': 362,
'time_elapsed': 446767,
'internal_node_id': '0.0-10.0-1.22',
'correct': True,
'i_generalization_trial': 52,
'i_block': 1},
'data_type': 'single_trial',
'iteration_name': 'pilot_3',
'context': 'acquisition',
'worker_id': 'yy',
'assignment_id': 'xx',
'hit_id': 'xxyy',
'browser': 'Chrome'}
Scripts use to extract data from server are in python mongo_data_extraction.py, which outputs subject_data.csv
Preprocessing
Import data from python into R, add analysis-related columns
# import data from server, generated by mongo_data_extraction.py
data = read.csv('subject_data.csv')
# extract data from generalization stages
generalization_data = data %>%
filter(stage=='generalization') %>%
mutate(# determine reference tone for each trial
reference_tone = as.factor(tone - distance),
# determine valence key press was associated with -- conditional because tone-valence-key pairings are randomized
association = ifelse(key_press==positive_key, 'positive', ifelse(key_press==negative_key, 'negative', NaN)))
The next step is critical: to determine whether the decision subjects made (e.g. pressing p or q) matches the valence nearest to the tone they hear. That is, if 300Hz is the positive tone in the acquisition stage, which is associated with ‘p’, and they heard a 305Hz tone, did they press ‘p’? For the tones that were in the acquisition stage, these valence-congruent decisions are going to be correct-hits. For other tones (e.g. 305Hs, 360Hz, etc.) these are going to be false alarms.
This will be termed the match or congruence between decision and the reference valence.
# determine whether the decision valence matches the nearest valenced tone
generalization_data$match = as.character(generalization_data$valence) == as.character(generalization_data$association)
To perform our primary confirmatory test, we will restrict our analysis to include the following information:
- generalization trials
- positive or negative valenced tones (not controls)
- distance of this tone to tones in the acquisition stage (0, 5, 20, 60, or 100Hz)
hypothesis_space = generalization_data %>%
filter(stage=='generalization' & valence!='control') %>%
mutate(distance = abs(distance)) %>%
select(valence, distance, match, subject, correct, association, rt, positive_key)
Confirmatory analysis
Primary test for replication
Looking for the interaction between distance and valence, predicting the match term:
summary(lm('match ~ valence * distance', hypothesis_space))
##
## Call:
## lm(formula = "match ~ valence * distance", data = hypothesis_space)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7761 -0.4591 0.2239 0.3630 0.7523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6369821 0.0225761 28.215 < 2e-16 ***
## valencepositive 0.1390770 0.0318733 4.363 1.37e-05 ***
## distance -0.0037121 0.0004667 -7.955 3.64e-15 ***
## valencepositive:distance -0.0015715 0.0006591 -2.384 0.0172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4636 on 1420 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.1266, Adjusted R-squared: 0.1247
## F-statistic: 68.6 on 3 and 1420 DF, p-value: < 2.2e-16
Valence, distance, and the interaction between the two are significant predictors of subjects responses. This is, at it’s face, a replication of the main statistical properties of the original papers. We can visualize the relationship between valence, distance, and subjects responses:
hypothesis_space %>%
group_by( valence, distance) %>%
summarise(p_association = mean(match, na.rm = TRUE ),
sem = sd(match, na.rm = TRUE)/sqrt(length(match)),
y_lower = p_association-sem,
y_upper = p_association+sem) %>%
ggplot(aes(x=distance, y=p_association, color=valence)) +
geom_line(aes(color=valence), size=1.5) +
geom_errorbar(aes(ymin=y_lower, ymax=y_upper), size=1) +
ggtitle('A "Replication" of of Schechtman et al. 2010?')

While the statistical test replicated the original findings, the pattern of the data is not consistent with the hypothesis with the original paper. In the original author’s terms, subjects seem to be showing wider perceptual generalization for positive valenced tones
Post-mortem analysis
Difference between identification accuracies in generalization stage
A surprising possibility, looking at the plot above, is that the overall accuracy of subjects correctly identifying negative and positive tones from the generalization stage is significantly different. We can visualize subject’s accuracy and raction times:
zero_distance = filter(hypothesis_space, distance==0)
plot.hits = zero_distance %>%
group_by(valence, subject) %>%
summarise(avg_correct_hit = mean(correct)) %>%
ggplot(aes(x=valence, y=avg_correct_hit)) +
geom_jitter(aes(color=valence), width=.01) +
ggtitle("Subject-level accuracy \ntone identification in generalization stage\n")
plot.rts = zero_distance %>%
group_by(valence, subject) %>%
summarise(avg_rt = mean(rt, na.rm = TRUE)) %>%
ggplot(aes(x=valence, y=avg_rt)) +
geom_jitter(aes(color=valence), width=.01) +
ggtitle("Average subject-level reaction times")
plot_grid(plot.hits, plot.rts)

It does not seem that there is a meaningful different in reaction times, but that there may be a difference in accuracies, which we can test more formally:
summary(lm('correct ~ valence ', zero_distance))
##
## Call:
## lm(formula = "correct ~ valence ", data = zero_distance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8208 -0.6708 0.1792 0.3292 0.3292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67083 0.02774 24.181 < 2e-16 ***
## valencepositive 0.15000 0.03923 3.823 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4298 on 478 degrees of freedom
## Multiple R-squared: 0.02967, Adjusted R-squared: 0.02764
## F-statistic: 14.62 on 1 and 478 DF, p-value: 0.0001491
This significant difference between positive and negative accuracies in the generalization stage, suggests that subjects are perhaps not learning the task.
Assesing learning during the first acquisition stage
Beyond looking at the accuracy rates above, it is difficult to ask whether subjects are learning the task looking at the generalization data alone. This is, in part, because we expect subjects to make errors in a way that is consistent with the “overgeneralization” claims made by the original authors. To ask whether subjects are learning, then, we can visualize the learning trajectories during the first acquisition stage, averaging over pairs of trials in a way that’s consistent with the original papers’ visualization:
data %>%
filter(stage=='acquisition' & valence!='neutral' & i_block==0) %>%
group_by(i_acquisition_trial, valence) %>%
summarise(mean_one = mean(correct),
sem_one = sd(correct)/sqrt(length(correct))) %>%
mutate(combined_trial = ifelse(i_acquisition_trial%%2, i_acquisition_trial-1, i_acquisition_trial)) %>%
group_by(combined_trial, valence) %>%
summarise(mean_two=mean(mean_one),
sem_two=mean(sem_one)) %>%
ggplot(aes(x=combined_trial, y=mean_two, color=valence)) +
geom_line(size=1.5) +
geom_errorbar(aes(ymin=mean_two-sem_two, ymax=mean_two+sem_two), size=1) +
ggtitle('Learning curves across the first acquisition stage')
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_errorbar).

We can also look at the average accuracy across all subjects:
data %>%
filter(stage=='acquisition' & valence!='control') %>%
group_by(subject) %>%
summarise(accuracy=mean(correct)) %>%
summarise(mean=mean(accuracy))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 0.668
In the original study, subjects were significantly above chance within two trials. While subjects do achieve near-ceiling performance, they appear to take longer.
Identifying high-performing subjects from acquistiion data
For the current experiment, it’s critical that subjects are not “extinguishing” the tone-valence associations learned in the acquisition stage. The acquisition stages are repeated, in large part, to protect against extinction that occurs when tone are repeatedly presented with no feedback in the generalization stage.
Here we visualize each subject’s average acquisition accuracy, accross all blocks, identifying those subjects who averaged above 75%. We expect, given the logic above, that subjects with low accuracy across all blocks will not show the behavioral effects that are central to the current study. We also expect that subjects who are consistently performing well in the acquisition stage should also perform well in the generalization stage–though this is not expected to be a direct relationship, as increased learning in the acquistion stage may lead to increased “generalization”, which decreases accuracy.
# set a relatively liberal threshhold
criterion = .75
# acquisition
acquisition_accuracy = data %>%
filter(stage=='acquisition' & valence!='control') %>%
group_by(subject) %>%
summarise(accuracy=mean(correct)) %>%
mutate(attending = accuracy>criterion)
generalization_accuracy = data %>%
filter(stage=='generalization', distance==0) %>%
group_by(subject) %>%
summarise(accuracy=mean(correct)) %>%
mutate(attending = accuracy>0)
learning_summary = data.frame(subject=generalization_accuracy$subject,
acquisition = acquisition_accuracy$accuracy,
generalization =generalization_accuracy$accuracy,
perform_well = as.factor(generalization_accuracy$attending * acquisition_accuracy$attending))
ggplot(learning_summary, aes(x=acquisition, y=generalization, color=perform_well)) +
geom_point(size=3) +
ggtitle('Selecting subjects who performed well in the acquisition stage\n accuracy >.75% ')

We can identify those subjects who seem to be consistently engage during the acquistion stages:
attention_check = filter(learning_summary, perform_well==1)
Repeating the main analysis only for those subjects who performed well in the acquisition stage
We can now isolate our analysis to include only these subjects with relatively high accuracies during the acquisition stage
hypothesis_space %>%
filter(subject %in% attention_check$subject) %>%
group_by( valence, distance) %>%
summarise(p_association = mean(match, na.rm = TRUE ),
sem = sd(match, na.rm = TRUE)/sqrt(length(match)),
y_lower = p_association-sem,
y_upper = p_association+sem) %>%
ggplot(aes(x=distance, y=p_association, color=valence)) +
geom_line(aes(color=valence), size=1.5) +
geom_errorbar(aes(ymin=y_lower, ymax=y_upper), size=1) +
ggtitle("Post-mortem attention check generalization curves")

attention_checked_hypothesis = filter(hypothesis_space, subject %in% attention_check$subject)
summary(lm('match ~ valence * distance', attention_checked_hypothesis))
##
## Call:
## lm(formula = "match ~ valence * distance", data = attention_checked_hypothesis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8088 -0.3546 0.1912 0.2734 0.9483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7266376 0.0354749 20.483 < 2e-16 ***
## valencepositive 0.0821947 0.0501690 1.638 0.1020
## distance -0.0051625 0.0007337 -7.036 6.55e-12 ***
## valencepositive:distance -0.0024084 0.0010377 -2.321 0.0207 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4338 on 500 degrees of freedom
## Multiple R-squared: 0.2378, Adjusted R-squared: 0.2332
## F-statistic: 52 on 3 and 500 DF, p-value: < 2.2e-16
We see that the interaction term here is still significant, even with the smaller sample. And, qualitative, the pattern of data in the plot above looks more consistent with the original findings–if much less pronounced.
Is tone discrimination symmetric?
The assignment of tone with positive and negative valence was randomized in this experiment. This is an online experiment, and the stimulus presentation is much less controlled than in an online setting, so it may be that certain tones are more or less discriminable–e.g. do to background noise or speaker quality. We can ask whether subjects were better are tone discriminations, independent of valence assignment.
Analysis across all subjects
First, we extract the tone data of interest
tone_data = generalization_data %>%
filter(is.finite(distance) & stage=='generalization') %>%
mutate(reference_tone = tone - distance,
abs_distance = abs(distance),
log_ratio = (log(tone/reference_tone)),
abs_log_ratio = abs(log(tone/reference_tone)),
reference_tone = as.factor(reference_tone))
Then plot for different distances, average percent correct–e.g., the degree to which subjects identified the tone as novel.
show_tone_accuracies = function(performance_group) {
if (performance_group=='all') {
which_subjects = learning_summary[['subject']]
} else if (performance_group=='high') {
which_subjects = filter(learning_summary, perform_well==1)[['subject']]
} else if (performance_group=='low') {
which_subjects = filter(learning_summary, perform_well==0)[['subject']]
}
plot.abs_distance = tone_data %>%
filter(subject %in% which_subjects) %>%
group_by(abs_distance, reference_tone) %>%
filter(reference_tone==reference_tone) %>%
summarise(avg_correct = mean(correct),
sem = sd(correct, na.rm = TRUE)/sqrt(length(correct))) %>%
ggplot(aes(x=abs_distance, y=avg_correct, color=reference_tone)) +
geom_line(size=1.5) +
geom_errorbar(aes(ymin=avg_correct-sem, ymax=avg_correct+sem), size=1) +
theme(legend.position = c(.4, .2)) +
ggtitle("correct responses \n sorted by absolute value distance")
plot.relative_distance = tone_data %>%
filter(subject %in% which_subjects) %>%
group_by(distance, reference_tone) %>%
filter(reference_tone==reference_tone) %>%
summarise(avg_correct = mean(correct),
sem = sd(correct, na.rm = TRUE)/sqrt(length(correct))) %>%
ggplot(aes(x=distance, y=avg_correct, color=reference_tone)) +
geom_line(size=1.5) +
geom_errorbar(aes(ymin=avg_correct-sem, ymax=avg_correct+sem), size=1) +
theme(legend.position="none") +
ggtitle("correct responses \n sorted by relative distance")
plot.log_ratio = tone_data %>%
filter(subject %in% which_subjects) %>%
group_by(log_ratio, reference_tone) %>%
filter(reference_tone==reference_tone) %>%
summarise(avg_correct = mean(correct),
sem = sd(correct, na.rm = TRUE)/sqrt(length(correct))) %>%
ggplot(aes(x=log_ratio, y=avg_correct, color=reference_tone)) +
geom_line(size=1.5) + theme(legend.position="none") +
geom_errorbar(aes(ymin=avg_correct-sem, ymax=avg_correct+sem), size=1) +
ggtitle("correct responses \n sorted by abs(log(tone/reference))")
plot.abs_log_ratio = tone_data %>%
filter(subject %in% which_subjects) %>%
group_by(abs_log_ratio, reference_tone) %>%
filter(reference_tone==reference_tone) %>%
summarise(avg_correct = mean(correct),
sem = sd(correct, na.rm = TRUE)/sqrt(length(correct))) %>%
ggplot(aes(x=abs_log_ratio, y=avg_correct, color=reference_tone)) +
geom_line(size=1.5) + theme(legend.position="none") +
geom_errorbar(aes(ymin=avg_correct-sem, ymax=avg_correct+sem), size=1) +
ggtitle("correct responses \n sorted by abs(log(tone/reference))")
plot_grid(plot.abs_distance, plot.relative_distance, plot.log_ratio, plot.abs_log_ratio)
}
show_tone_accuracies('all')

It appears that tones around 700Hz are easier to discriminate than tones areound 300Hz. We can test this more formally:
summary(lm('correct ~ abs(distance) + reference_tone', tone_data))
##
## Call:
## lm(formula = "correct ~ abs(distance) + reference_tone", data = tone_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6851 -0.3910 -0.2525 0.4620 0.7475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2341429 0.0290683 8.055 2.35e-15 ***
## abs(distance) 0.0036762 0.0004164 8.829 < 2e-16 ***
## reference_tone700 0.0833333 0.0307926 2.706 0.00693 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.477 on 957 degrees of freedom
## (480 observations deleted due to missingness)
## Multiple R-squared: 0.08181, Adjusted R-squared: 0.07989
## F-statistic: 42.63 on 2 and 957 DF, p-value: < 2.2e-16
The reference tone here is a significant predictor of accuracy. What seems to be driving this effect is the difference in accuracy for the tones below the reference tones–in this case, tones around 200Hz are less accurately judged than tones around 600Hz, even though they are the same distance from the reference tones.
Conclusions
With a sample of 20 subjects, the present study successfully reproduced the main statistical findings relevant to our interest, the interaction between valence and distance in prediction subjects behaviors. However, this occured in a way that was inconsistent with the theoretical claims of the original paper. It seems like this is the main effect driving the difference in the positive and negative slopes was that subjects were significantly worse at identifying negative tones than positive tones (p < .0001) in the generalization stage–not an increase in errors surrounding the negative tone.
A post-mortem analysis determined that the learning trajectories in this population were slower than those in the original study, and that across the entire study, the average performance was well below ceiling (<70%). We then identified several subjects with consistent accuracies above 75% in the acquisition stages. The main hypotheses were tested again in this subset. The interaction between valence and distance was significant, even in this smaller sample (p < .03). The pattern of data, in this case, also seemed more consistent with the main hypothesis; increased “generalization” for the tone around the negative tone.
Interestingly, there also seemed to be a significant difference between the accuracies of the two tones (700 and 300Hz) in the generalization stage (p < .01). This is a concern because it suggests that subjects may simply not be able to hear some tones as well as others. When only those subjects who performed well were tested, however, this effect of tone on accuracy was no longer significant (p > .4). That is, for those subjects who performed well, there was no difference in accuracy between 300 and 700Hz tones.
Together, these results are promising, but stil only suggestive. Critically, subjects performance in the acquisition stage was slow and still below ceiling. The centrality of learning in this stage, to any form of perceptual generalization, raise concerns about the currect implimentation. A fair test of the authors original claims, within an online sample, will ultimately demand more stringent checks on subjects performance.
In future studies, I plan to exclude subjects who don’t perform at the same level of the original paper in each acquisition stage (>90%). That is, after each acquistion stage, if a subject is not performing at a level consistent with a fully attending subject in the lab, the experiment will end and they’ will be given the bonus they’ve earned. This will allow us to generously compensate workers who are fully engaged, and not further compensate workers who are not performing well.