Replication of ‘Language is not Just for Talking: Redundant Labels Facilitate Learning of Novel Categories’ by Lupyan, Rakison, & McClelland (2007, Psychological Science)

Author

Caroline Kaicher (ckaicher@stanford.edu)

Published

December 10, 2025

Introduction

Justification

I am interested in how labels help children and adults learn categories. Lupyan, Rakison, and McClelland (2007) contributes to this question by showing that labels help adults learn object categories (in this case, categories of aliens) faster than when they have no labels, or have other nonlinguistic cues. This paper is particularly compelling because the labels are “redundant”, in other words they do not provide additional information to the participants about the category distinctions. Therefore, it is presumed that there is something “special” about having a label to associate category exemplars with in the category learning process. Thus, while words play an important role in category learning by pointing out useful category distinctions in the environment, they may be playing an even bigger role in facilitating the category learning process – however, the exact nature and mechanism of this role is unknown.

Stimuli and Procedures

Lupyan, Rakison, and McClelland (2007) consists of 2 experiments, and I will be replicating experiment 2. To conduct this experiment, I will need to recreate their category learning task. I will use PsychoPy, as this is the experiment-building software I am most familiar with, and host it online using Pavlovia. The task will have 4 conditions: No Label, Written Label, Auditory Label, and Location (nonlinguistic cue). The stimuli I will need are recordings of the auditory labels and the alien images they used in the original experiment. The images from the original study where created by Mike Tarr’s lab (the YUFO stimulus set), and are publicly available on their website.

The main challenge I anticipate for this study is finding the specific alien images the authors used in the two categories. Luckily, all the images they used are shown in Figure 1 of the paper, but there are a lot of images in the original stimulus set, so I will need to comb through them to find the exact ones. Other than that, the description of the category learning task seems clear and includes all the necessary details to recreate it.

Methods

Power Analysis

The authors reported a partial eta-squared effect size of 0.07 for the interaction between condition and block. This roughly translates to a partial cohen’s f of 0.87. Looking at the non-partial effect sizes, they are an eta-squared of about 0.066 and cohen’s f of about 0.264, indicating a medium effect size for the key analysis of interest.

Using G*Power, a power analysis for this effect size from a mixed ANOVA was conducted. The power analysis indicates that a sample of 24 participants is needed to achieve 80% power, 28 participants are needed to achieve 90% power, and 32 participants are needed to achieve 95% power.

Planned Sample

I plan to stop collecting data once I have 76 participants, 19 in each condition. Even though adequate power can be achieved with fewer participants, I decided to match the original study’s sample size. The original study had 75 participants, and I adjusted it to 76 to have an even number of participants in each group.

Participants will be required to be fluent in English, so that they will understand the instructions of the task. I will also limit the sample to be participants from the United States, since the original sample were college students living in the US.

Materials

“The stimuli were a subset of the YUFO stimulus set (Gauthier, James, Curby, & Tarr, 2003). Items in one category (shown on the left in Fig. 1) had flatter bases and a subtle ridge on their ‘‘heads.’’ Items in the other category (shown on the right in Fig. 1) had more rounded bases and smoother heads…The stimuli were presented on a black background on a 17-in. computer screen and subtended 81 of visual angle. Responses were collected using a gamepad controller. For the [written] label condition, the categories were associated with the nonsense labels ‘‘leebish’’ and ‘‘grecious,’’ which were displayed in a white, 16point font.”

The alien images used in the replication are exactly the same as the original, and the same labels were used for the categories. However, since the replication is done online, the participants complete it on their personal computer. This means that the task could be presented on any screen size and visual angle. Also, the responses were collected using the participants’ keyboards.

Procedure

“Subjects were told to imagine that they were explorers on another planet and were learning about alien life forms. Their task was to determine which aliens they should approach and which they should move away from. On each training trial, 1 of the 16 aliens appeared in the center of the screen. After 500 ms, an outline of a character in a space suit (the”explorer”) appeared in one of four positions—to the left of, to the right of, above, or below the alien. Subjects were instructed to respond with the appropriate direction key depending on the category of the alien. For instance, if the explorer appeared above the alien, they needed to press the “down” key to move toward the alien or the “up” key to move away; after the key press, the explorer moved toward or away from the alien, as indicated. Auditory feedback—a buzz for an incorrect response and a bell for a correct response—sounded 200 ms after the explorer stopped moving. In the [written] label condition, a printed label (“leebish” or “grecious”) appeared to the right of the alien 300 ms after the feedback. After another 1,500 ms, the alien (and label, in the [written] label condition) disappeared from the screen, and a fixation cross marked the start of the next trial. The total trial duration and exposure to the stimulus were equal for the two conditions. The pairing of the labels with the categories (move away vs. move toward) and with the perceptual stimuli (left vs. right side of Fig. 1) was counterbalanced across subjects. Subjects in the label condition were told that previous visitors to the planet had found it useful to name the two kinds of aliens, and that they should pay careful attention to the labels. All subjects received the same number of categorization trials (nine blocks of 16 trials each) and had equal exposure to the stimuli. The only difference between the two conditions was whether or not a verbal label appeared after each response.”

This procedure is described for Experiment 1 of the study, where there are only 2 conditions: [written] label vs no label. Experiment 2 uses the same procedure, but adds the two other conditions: auditory label and location. Everything described in the procedure above was followed exactly, besides the fact that I did not use the same bell and buzz sounds, or astronaut character as used in the original. Here is where they discuss the additional procedural considerations for Experiment 2:

“The materials and procedure were identical to those used in Experiment 1 with the following exceptions: In the auditory label condition, the written labels were replaced by recorded sound clips of a female saying”leebish” and “grecious.” In the location condition, subjects were told that some aliens lived on one side of the planet, and others lived on the other side. On each trial, after the subject responded (approach/escape) and auditory feedback was given, the alien moved up or down to signal where it “lived.” The motion started 300 ms after response feedback and lasted approximately 400 ms. The trial ended 1,300 ms after the alien stopped moving. Thus, the alien was visible for a longer total time in the location condition compared with the label conditions…To measure the degree to which subjects learned the association between stimuli and labels or locations, we included verification trials as part of the training procedure. Verification trials were presented after a random 10% of training trials. On each verification trial in the label conditions, one of the aliens appeared with a query asking: “Is this one leebish [grecious]? yes/no” (the label was randomly selected). On the verification trials in the location condition, the alien moved up or down, and subjects responded to the query, “Is this correct? yes/no”; subjects were allowed to repeat the motion numerous times before making their response. No feedback was provided for the verification trials.”

This was followed closely, with a few exceptions. First, I used a text-to-speech converter to get the auditory labels of “leebish” and “grecious” (in a female voice like the original). Second, the verification trials were done at the end of each block, rather than “after a random 10% of training trials.” This was done due to limitations of PsychoPy – specifically the set-up of loops during each block of trials, such that it is difficult to insert a new trial type within a block without it being repeated every iteration of the loop. Also, I do not think that this will affect the replication results because the verification trials are not used in the main analysis of interest, and with doing it this way, the participants only get one less verification trial that the original (9 rather than 10). The last exception is that in the verification trials for the location condition, I did not set it up so that participants can repeat the motion before making their response. I do not think this detracts from the participants’ ability to make their choice because after the alien moves once, it remains in the location where it stopped moving, so it is clear to participants which direction the alien moved the whole time (because the alien always starts in the center of the screen).

I have separate task versions set up for each condition, with counterbalancing of the labels and categories set up for each of them through Pavlovia.

Auditory Label: https://run.pavlovia.org/ckaicher/lupyan_replication_1

Written Label: https://run.pavlovia.org/ckaicher/lupyan_replication_2

Location: https://run.pavlovia.org/ckaicher/lupyan_replication_3

No Label: https://run.pavlovia.org/ckaicher/lupyan_replication_4

Analysis Plan

Data will be cleaned and tidied such that trials will be excluded if their response time is more than 3 minutes. Participants will be excluded if they pressed the same arrow key for more than 90% of trials.

The key analysis of interest is a mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor. I will use this to see if there is a significant Block x Condition interaction and main effect of Condition. Just as the authors did, I will also conduct planned comparisons of the conditions see whether the two label conditions differ from one another and whether the no-label and location conditions differ from one another. The authors also conducted 2 more ANOVAs that I will do as well: 1) a Condition x Block ANOVA with the pooled data from the label conditions and pooled data from the no-label and location conditions, and 2) a Condition x Block ANOVA of just the written-label and location conditions.

I will also use participants’ performance on the verification trials to see if their verification accuracy correlates with training accuracy, and whether an ANOVA with Condition as a between-subjects factor is significant. If the ANOVA is significant, I will follow it up with pairwise comparisons of the conditions. These verification trial analyses will only be conducted for the auditory label, written label, and location conditions, as the no-label condition does not have verification trials.

Differences from Original Study

The original study was conducted in-person, with a sample of American undergraduate students between the ages of 18 and 24. This replication will be conducted online on Prolific, with adults of any age. These differences are not anticipated to affect the results of the study based on the claims of the original article.

Methods Addendum (Post Data Collection)

Actual Sample

For my final sample, I had 76 participants ages 22-78 (M = 43.7). Of the participants that reported their gender, 35 were female and 38 were male. No participants were excluded based on the criteria of pressing the same arrow key for more than 90% of trials.

Differences from pre-data collection methods plan

None

Results

Data preparation

Data preparation following the analysis plan.

### Data Preparation

#### Load Relevant Libraries and Functions
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("emmeans")
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
library("lme4")
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack
library("lmerTest")

Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step
options(readr.show_col_types = FALSE)
options(dplyr.summarise.inform = FALSE)

theme_set(theme_classic(base_size = 18))

#### Import data
input_path <- "../data/full sample/raw_data"
output_path <- "../data/full sample/processed_data"

files <- list.files(path=input_path,
                    pattern=".csv",
                    all.files=FALSE,
                    full.names=FALSE)

#this is for making the condition names more concise during tidying
condition_names <- tibble(experimentName = c("Category_Training_LabelAuditory",
                                      "Category_Training_LabelWritten",
                                      "Category_Training_Location",
                                      "Category_Training_NoLabel"),
                          condition = c("Label_Auditory",
                                        "Label_Written",
                                        "Location",
                                        "No_Label"))

#### Data exclusion / filtering

clean_data = function(dat, index) {
  dat_clean <- dat %>% 
    mutate(participant = index) %>% 
    mutate(counterbalance_group = counterbalance_group[1]) %>% 
    select(participant,
           age,
           gender,
           counterbalance_group,
           exp_name,
           block,
           alien_stim,
           category,
           friendly,
           approach,
           key_resp_actual,
           correct,
           trial_started,
           trial_stopped) %>% 
    drop_na(alien_stim) %>% 
    mutate(trial = 1:144) %>%
    filter(trial_stopped - trial_started <= 180) %>% #remove trial if more than 3 minutes long
    mutate(condition = filter(condition_names,
                              dat$exp_name[1] == experimentName)$condition) %>% 
    mutate(gender = case_when(
      gender == "male" ~ "M",
      gender == "MALE" ~ "M",
      gender == "Male" ~ "M",
      gender == "make" ~ "M", #assuming this was a typo someone had
      gender == "Female" ~"F",
      gender == "female" ~ "F"
    ))
  
  return (dat_clean)
}

get_verification_scores = function(dat, index) {
  verification_score <- dat %>% 
    mutate(condition = filter(condition_names,
                              dat$exp_name[1] == experimentName)$condition) %>% 
    filter(!is.na(verification_correct)) %>% 
    select(condition, verification_correct) %>% 
    mutate(participant = index) %>% 
    group_by(participant, condition) %>% 
    summarize(verification_sum = sum(verification_correct))
  return (verification_score)
}

#to check whether participant pressed the same button for over 90% of trials
check_responses = function(dat) {
  if (nrow(filter(dat, key_resp_actual == "[\"up\"]"))/144 > 0.9|
      nrow(filter(dat, key_resp_actual == "[\"down\"]"))/144 > 0.9|
      nrow(filter(dat, key_resp_actual == "[\"right\"]"))/144 > 0.9|
      nrow(filter(dat, key_resp_actual == "[\"left\"]"))/144 > 0.9) {
    return (FALSE)
  } else {
    return (TRUE)
  }
}

#### Prepare data for analysis - create columns etc.

df.dat_clean_all <- tibble(participant = c(),
                           age = c(),
                           gender = c(),
                           block = c(),
                           condition = c(),
                           counterbalance.group = c(),
                           alien_stim = c(),
                           category = c(),
                           friendly = c(),
                           approach = c(),
                           key_resp.actual = c(),
                           correct = c())

df.verification_scores <- tibble(participant = c(),
                                 condition = c(),
                                 verification_sum = c())

for (i in 1:length(files)) {
  df.dat <- read_csv(paste0(input_path, "/", files[i]))
  df.dat_clean <- clean_data(df.dat, i)
  if(df.dat_clean$condition[1] != "No_Label") {
    df.verification_scores <- rbind(get_verification_scores(df.dat, i), df.verification_scores)
  }
  if (check_responses(df.dat_clean)) {
    write.csv(df.dat_clean,
              paste0(output_path, "/", sub("\\-anon.csv$", "", files[i]), "_processed-anon.csv"),
              row.names = FALSE)
    df.dat_clean_all <- rbind(df.dat_clean, df.dat_clean_all)
  } else {
    print("participant excluded for pressing the same button for over 90% of trials")
  }
}

df.dat_clean_all$condition <- factor(df.dat_clean_all$condition,
                                     levels = c("No_Label",
                                                "Location",
                                                "Label_Written",
                                                "Label_Auditory"))

write.csv(df.dat_clean_all,
          paste0(output_path, "/all_participants.csv"),
          row.names = FALSE)

df.dat_clean_all_summary <- df.dat_clean_all %>% 
  group_by(participant,
           condition,
           block) %>% 
  summarize(mean_correct = mean(correct),
            age = age[1])

df.dat_clean_all_summary$block = factor(df.dat_clean_all_summary$block)

#look at demographics
df.demos <- df.dat_clean_all %>% 
  group_by(participant) %>% 
  summarize(gender = gender[1],
            age = age[1]) %>% 
  mutate(age = ifelse(test = age == 7, #someone put their age as 7, assuming this is a fluke
                      yes = NA,
                      no = age))

table(df.demos$gender)

 F  M 
35 38 
age_range <- range(df.demos$age, na.rm = TRUE) 

paste0("Age range: ", age_range[1], "-", age_range[2])
[1] "Age range: 22-78"
paste0("Mean age: ", round(mean(df.demos$age, na.rm = TRUE), 1))
[1] "Mean age: 43.7"

Confirmatory analysis

The analyses as specified in the analysis plan.

#key graph of interest (to be displayed next to original)
acc_byBlock <- df.dat_clean_all %>% 
  mutate(block = as.numeric(block)) %>% 
  select(participant, trial, block, correct, condition) %>%
  group_by(participant,
           condition,
           block) %>%
  summarise_at(.vars = "correct",
               .funs = mean) %>% 
  ggplot(mapping = aes(x = block,
                       y = correct,
                       color = condition)) +
  stat_summary(fun.data = mean_se,
               geom = "linerange",
               position = position_dodge(0.2)) +
  stat_summary(fun = mean, geom="line",
               position = position_dodge(0.2)) +
  stat_summary(fun = mean, geom="point",
               position = position_dodge(0.2)) +
  scale_x_continuous(name = "Block",
                     breaks = c(1:9)) +
  coord_cartesian(ylim = c(0.5, 1.0)) +
  labs(y = "Proportion Correct") +
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=24),
        legend.text = element_text(size=20)) +
  scale_colour_brewer(palette = "Set1",
                      name = "",
                      labels = c("No Label", "Location", "Written Label", "Auditory Label"))

ggsave(filename = "cat_acc_byBlock.png",
       plot = acc_byBlock,
       width = 10)
Saving 10 x 5 in image
#key analysis of interest -- mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor, and follow-up comparisons
acc.lm <- glm(mean_correct ~ condition * block,
              data = df.dat_clean_all_summary)

acc.av <- acc.lm %>% 
  joint_tests()
acc.av
 model term      df1 df2 F.ratio p.value
 condition         3 648   6.774  0.0002
 block             8 648   7.341  <.0001
 condition:block  24 648   0.235  1.0000
#planned comparisons to see whether the two label conditions differ from one another and whether the no-label and location conditions differ from one another
emm <- emmeans(acc.lm,
               specs = ~ condition)
NOTE: Results may be misleading due to involvement in interactions
contrast_results <- contrast(emm,
                             list(auditory_vs_written = c(0, 0, -1, 1),
                                  location_vs_nolabel = c(-1, 1, 0, 0)),
                             adjust = "sidak")

summary(contrast_results)
 contrast            estimate     SE  df t.ratio p.value
 auditory_vs_written   0.0852 0.0201 648   4.247  <.0001
 location_vs_nolabel  -0.0219 0.0201 648  -1.094  0.4737

Results are averaged over the levels of: block 
P value adjustment: sidak method for 2 tests 
#comparing pooled Auditory Label and Written Label data to pooled Location and No Label data in new anova
df.pooled <- df.dat_clean_all_summary %>% 
  mutate(condition = case_when(
      condition == "Label_Auditory" ~ "labels_pooled",
      condition == "Label_Written" ~ "labels_pooled",
      condition == "Location" ~ "nonLabels_pooled",
      condition == "No_Label" ~ "nonLabels_pooled",
      .default = NA))

acc.lm_pooled <- glm(mean_correct ~ condition * block,
                                 data = df.pooled)

acc.av_pooled <- acc.lm_pooled %>%
  joint_tests()
acc.av_pooled
 model term      df1 df2 F.ratio p.value
 condition         1 666   1.081  0.2988
 block             8 666   7.284  <.0001
 condition:block   8 666   0.205  0.9900
#comparing just Written Label and Location conditions in new anova
df.justWrittenAndLocation <- df.dat_clean_all_summary %>% 
  filter(condition == "Label_Written" | condition == "Location")

acc.lm_WrittenAndLocation <- glm(mean_correct ~ condition * block,
                                 data = df.justWrittenAndLocation)

acc.av_WrittenAndLocation <- acc.lm_WrittenAndLocation %>%
  joint_tests()
acc.av_WrittenAndLocation
 model term      df1 df2 F.ratio p.value
 condition         1 324   0.705  0.4016
 block             8 324   3.366  0.0010
 condition:block   8 324   0.276  0.9736
####verification accuracy analyses####

#see if verification accuracy correlates with training accuracy
df.training_overall <- df.dat_clean_all_summary %>% 
  group_by(participant) %>% 
  summarize(overall_training_acc = mean(mean_correct))

df.training_verification_combined <- full_join(df.training_overall,
                                               df.verification_scores) %>% 
  drop_na()
Joining with `by = join_by(participant)`
correlation <- cor.test(df.training_verification_combined$overall_training_acc, df.training_verification_combined$verification_sum)

corr_plot <- ggplot(df.training_verification_combined,
                    aes(x = overall_training_acc,
                        y = verification_sum)) +
  geom_point(size = 2,
             alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  annotate("text", x = -Inf, y = Inf, label = paste("Correlation:", round(correlation$estimate, 2)), 
           hjust = -0.1, vjust = 1.1, size = 10, color = "black") +
  labs(x = "Training Accuracy",
       y = "# Correct",
       title = "Training accuracy vs verification performance") +
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=24),
        legend.text = element_text(size=20))

# ggsave(filename = "verification_acc_corr.png",
#        plot = corr_plot,
#        width = 10)

#check for differences in verification accuracy across conditions. If the ANOVA is significant, I will follow it up with pairwise comparisons of the conditions.

verification_byCondition <- ggplot(df.training_verification_combined,
                                   aes(x = condition,
                                       y = verification_sum)) +
  stat_summary(fun = "mean",
               geom = "bar") +
  stat_summary(fun.data = mean_se,
               geom = "linerange") +
  coord_cartesian(ylim = c(0, 9)) +
  labs(x = "Condition",
       y = "# Correct",
       title = "Verification performance by condition")

# ggsave(filename = "verification_acc_byCondition.png",
#        plot = verification_byCondition,
#        width = 10)

verification_acc.lm <- glm(verification_sum ~ condition,
                           data = df.verification_scores)

verification_acc.av <- verification_acc.lm %>% 
  joint_tests()
verification_acc.av
 model term df1 df2 F.ratio p.value
 condition    2  54   1.105  0.3387
#show plot of training trials next to original and then the plot for verification trials (no original plot for that one)

Original

Replication
corr_plot
`geom_smooth()` using formula = 'y ~ x'

verification_byCondition

Exploratory analyses

Any follow-up analyses desired (not required).

#pairwise contrasts of the 4 conditions to see which ones differ from each other
contrasts_pairwise <- contrast(emm,
                               method = "pairwise",
                               adjust = "tukey")

summary(contrasts_pairwise)
 contrast                       estimate     SE  df t.ratio p.value
 No_Label - Location              0.0219 0.0201 648   1.094  0.6936
 No_Label - Label_Written         0.0387 0.0201 648   1.932  0.2156
 No_Label - Label_Auditory       -0.0464 0.0201 648  -2.315  0.0957
 Location - Label_Written         0.0168 0.0201 648   0.838  0.8361
 Location - Label_Auditory       -0.0683 0.0201 648  -3.408  0.0039
 Label_Written - Label_Auditory  -0.0852 0.0201 648  -4.247  0.0001

Results are averaged over the levels of: block 
P value adjustment: tukey method for comparing a family of 4 estimates 
#look at individual participant data for each condition, showing their age as well
acc_byBlock_perCondition <- df.dat_clean_all_summary %>% 
  mutate(block = as.numeric(block)) %>% 
  select(participant, block, mean_correct, condition, age) %>%
  group_by(participant,
           condition) %>%
  ggplot(mapping = aes(x = block,
                       y = mean_correct,
                       group = participant,
                       color = age)) +
  geom_line(alpha = 0.4) +
  geom_point(alpha = 0.4, size = 0.25) +
  scale_x_continuous(name = "Block",
                     breaks = c(1:9)) +
  labs(y = "Proportion Correct",
       title = "Individual training performance by condition and age") +
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=24),
        legend.text = element_text(size=20)) +
  facet_grid(rows = vars(condition)) +
  scale_color_gradient(low = "yellow", high = "red") +
  theme(strip.text = element_text(size = 12))

ggsave(filename = "acc_byBlock_perCondition.png",
       plot = acc_byBlock_perCondition,
       width = 10,
       height = 10)

#add age as a covariate in the model/anova
acc.lm_withAge <- glm(mean_correct ~ condition * block * age,
              data = df.dat_clean_all_summary)

acc.av_withAge <- acc.lm_withAge %>% 
  joint_tests()
acc.av_withAge
 model term          df1 df2 F.ratio p.value
 condition             3 603   7.570  0.0001
 block                 8 603   6.378  <.0001
 age                   1 603   0.024  0.8769
 condition:block      24 603   0.218  1.0000
 condition:age         3 603   9.074  <.0001
 block:age             8 603   0.171  0.9946
 condition:block:age  24 603   0.250  0.9999
emm_withAge <- emm <- emmeans(acc.lm_withAge,
               specs = ~ condition)
NOTE: Results may be misleading due to involvement in interactions
contrasts_pairwise_withAge <- contrast(emm_withAge,
                               method = "pairwise",
                               adjust = "tukey")

summary(contrasts_pairwise_withAge)
 contrast                       estimate     SE  df t.ratio p.value
 No_Label - Location              0.0239 0.0215 603   1.110  0.6837
 No_Label - Label_Written         0.0514 0.0211 603   2.433  0.0721
 No_Label - Label_Auditory       -0.0445 0.0214 603  -2.084  0.1594
 Location - Label_Written         0.0275 0.0210 603   1.312  0.5556
 Location - Label_Auditory       -0.0684 0.0212 603  -3.224  0.0073
 Label_Written - Label_Auditory  -0.0959 0.0208 603  -4.607  <.0001

Results are averaged over the levels of: block 
P value adjustment: tukey method for comparing a family of 4 estimates 

Discussion

Summary of Replication Attempt

The key analysis of interest was a mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor, from which the original paper reported a significant Condition x Block interaction and main effect of Condition. In my analysis, there was not a significant Condition x Block interaction, F(24, 648) = 0.235, ns, but there was a main effect of Condition, F(3, 648) = 6.774, p < .001. Therefore, my data show a partial replication of the primary result of interest.

I conducted the same follow-up analyses as the original authors, and they did not replicate. First I did planned comparisons of the conditions see whether the two label conditions differ significantly from one another and whether the no-label and location conditions differ significantly from one another. In the original paper, they found that both these pairs were not significantly different. Therefore, they followed up with a Condition x Block ANOVA with the pooled data from the label conditions and pooled data from the no-label and location conditions and found a significant interaction. In my analysis, the Auditory Label and Written Label conditions were significantly different from each other, t(648) = 4.247, p < .0001, and the Location and No Label conditions were not, t(648) = -1.094, ns. In my ANOVA with the pooled data, there was not a significant Condition x Block interaction, F(8, 666) = 0.206, ns. Lastly, the authors conducted a Condition x Block ANOVA of just the written-label and location conditions, and found a significant interaction. In my analysis, the interaction was not significant, F(8, 324) = 0.276, ns.

Next, I conducted analyses on the verification trials from the Auditory Label, Written Label, and Location conditions. In the original paper, they found that verification accuracy correlated with performance on the training trials, which I replicated, r(55) = 0.47, p < .001. They also found that an ANOVA with Condition as a between-subjects factor was significant, which was not the case in my data, F(2, 54) = 1.105, ns.

Commentary

I conducted a couple exploratory analyses to better understand the patterns in my data. First, I did pairwise contrasts on the category training data to see which conditions differed from each other. These contrasts revealed that the Auditory Label condition performed significantly better than the Written Label, t(648) = 4.247, p < .0005, and Location conditions, t(648) = 3.408, p < .005, but not the No Label condition, t(648) = 2.315, ns. This is quite a different pattern than was shown in the original study, where the Auditory Label and Written Label conditions performed better than the Location and No Label conditions.

I also conducted an exploratory analysis to see if age might have affected the results, as the original study had participants aged 18-24 and mine ranged from 22-78. First, I visually inspected the relation between individual participants’ age and performance for each condition, and did not see any noticeable differences in performance by age across conditions. Additionally, when age is added as a covariate in the ANOVA, the Condition x Block interaction and Auditory Label versus No Label contrast remain not significant.

Even though my data show a partial replication of the primary result of interest, the lack of an interaction undermines the main claim made in the original paper, which is that people learn the categories quicker if the have labels versus when they do not. Additionally, the fact that the Auditory Label and No Label groups did not have a significant difference in performance, while the Written Label performed significantly lower than Auditory Label but not differently than Location, suggests that having a label is not always beneficial in this category learning task. The benefit of having a label may not be a robust effect, and may be sensitive to small task differences. The major difference between this study and the original is that the replication was done online and the original was in person. It could plausibly be the case that this influenced the findings, especially given that the participants’ overall performance was lower than the original study. I do not believe that there are any other differences between the original task and replication that would moderate the results.