Replication of ‘Language is not Just for Talking: Redundant Labels Facilitate Learning of Novel Categories’ by Lupyan, Rakison, & McClelland (2007, Psychological Science)

Author

Caroline Kaicher (ckaicher@stanford.edu)

Published

December 1, 2025

Introduction

Justification

I am interested in how labels help children and adults learn categories. Lupyan, Rakison, and McClelland (2007) contributes to this question by showing that labels help adults learn object categories (in this case, categories of aliens) faster than when they have no labels, or have other nonlinguistic cues. This paper is particularly compelling because the labels are “redundant”, in other words they do not provide additional information to the participants about the category distinctions. Therefore, it is presumed that there is something “special” about having a label to associate category exemplars with in the category learning process. Thus, while words play an important role in category learning by pointing out useful category distinctions in the environment, they may be playing an even bigger role in facilitating the category learning process – however, the exact nature and mechanism of this role is unknown.

Stimuli and Procedures

Lupyan, Rakison, and McClelland (2007) consists of 2 experiments, and I will be replicating experiment 2. To conduct this experiment, I will need to recreate their category learning task. I will use PsychoPy, as this is the experiment-building software I am most familiar with, and host it online using Pavlovia. The task will have 4 conditions: No Label, Written Label, Auditory Label, and Location (nonlinguistic cue). The stimuli I will need are recordings of the auditory labels and the alien images they used in the original experiment. The images from the original study where created by Mike Tarr’s lab (the YUFO stimulus set), and are publicly available on their website.

The main challenge I anticipate for this study is finding the specific alien images the authors used in the two categories. Luckily, all the images they used are shown in Figure 1 of the paper, but there are a lot of images in the original stimulus set, so I will need to comb through them to find the exact ones. Other than that, the description of the category learning task seems clear and includes all the necessary details to recreate it.

Repo Link

Repository: https://github.com/psych251/lupyan2007.git

Original Paper: https://github.com/psych251/lupyan2007/blob/main/original_paper/lupyan-et-al-2007-language-is-not-just-for-talking-redundant-labels-facilitate-learning-of-novel-categories-2.pdf

Methods

Power Analysis

The authors reported a partial eta-squared effect size of 0.07 for the interaction between condition and block. This roughly translates to a partial cohen’s f of 0.87. Looking at the non-partial effect sizes, they are an eta-squared of about 0.066 and cohen’s f of about 0.264, indicating a medium effect size for the key analysis of interest.

Using G*Power, a power analysis for this effect size from a mixed ANOVA was conducted. The power analysis indicates that a sample of 24 participants is needed to achieve 80% power, 28 participants are needed to achieve 90% power, and 32 participants are needed to achieve 95% power.

Planned Sample

I plan to stop collecting data once I have 76 participants, 19 in each condition. Even though adequate power can be achieved with fewer participants, I decided to match the original study’s sample size. The original study had 75 participants, and I adjusted it to 76 to have an even number of participants in each group.

Participants will be required to be fluent in English, so that they will understand the instructions of the task. I will also limit the sample to be participants from the United States, since the original sample were college students living in the US.

Materials

“The stimuli were a subset of the YUFO stimulus set (Gauthier, James, Curby, & Tarr, 2003). Items in one category (shown on the left in Fig. 1) had flatter bases and a subtle ridge on their ‘‘heads.’’ Items in the other category (shown on the right in Fig. 1) had more rounded bases and smoother heads…The stimuli were presented on a black background on a 17-in. computer screen and subtended 81 of visual angle. Responses were collected using a gamepad controller. For the [written] label condition, the categories were associated with the nonsense labels ‘‘leebish’’ and ‘‘grecious,’’ which were displayed in a white, 16point font.”

The alien images used in the replication are exactly the same as the original, and the same labels were used for the categories. However, since the replication is done online, the participants complete it on their personal computer. This means that the task could be presented on any screen size and visual angle. Also, the responses were collected using the participants’ keyboards.

Procedure

“Subjects were told to imagine that they were explorers on another planet and were learning about alien life forms. Their task was to determine which aliens they should approach and which they should move away from. On each training trial, 1 of the 16 aliens appeared in the center of the screen. After 500 ms, an outline of a character in a space suit (the”explorer”) appeared in one of four positions—to the left of, to the right of, above, or below the alien. Subjects were instructed to respond with the appropriate direction key depending on the category of the alien. For instance, if the explorer appeared above the alien, they needed to press the “down” key to move toward the alien or the “up” key to move away; after the key press, the explorer moved toward or away from the alien, as indicated. Auditory feedback—a buzz for an incorrect response and a bell for a correct response—sounded 200 ms after the explorer stopped moving. In the [written] label condition, a printed label (“leebish” or “grecious”) appeared to the right of the alien 300 ms after the feedback. After another 1,500 ms, the alien (and label, in the [written] label condition) disappeared from the screen, and a fixation cross marked the start of the next trial. The total trial duration and exposure to the stimulus were equal for the two conditions. The pairing of the labels with the categories (move away vs. move toward) and with the perceptual stimuli (left vs. right side of Fig. 1) was counterbalanced across subjects. Subjects in the label condition were told that previous visitors to the planet had found it useful to name the two kinds of aliens, and that they should pay careful attention to the labels. All subjects received the same number of categorization trials (nine blocks of 16 trials each) and had equal exposure to the stimuli. The only difference between the two conditions was whether or not a verbal label appeared after each response.”

This procedure is described for Experiment 1 of the study, where there are only 2 conditions: [written] label vs no label. Experiment 2 uses the same procedure, but adds the two other conditions: auditory label and location. Everything described in the procedure above was followed exactly, besides the fact that I did not use the same bell and buzz sounds, or astronaut character as used in the original. Here is where they discuss the additional procedural considerations for Experiment 2:

“The materials and procedure were identical to those used in Experiment 1 with the following exceptions: In the auditory label condition, the written labels were replaced by recorded sound clips of a female saying”leebish” and “grecious.” In the location condition, subjects were told that some aliens lived on one side of the planet, and others lived on the other side. On each trial, after the subject responded (approach/escape) and auditory feedback was given, the alien moved up or down to signal where it “lived.” The motion started 300 ms after response feedback and lasted approximately 400 ms. The trial ended 1,300 ms after the alien stopped moving. Thus, the alien was visible for a longer total time in the location condition compared with the label conditions…To measure the degree to which subjects learned the association between stimuli and labels or locations, we included verification trials as part of the training procedure. Verification trials were presented after a random 10% of training trials. On each verification trial in the label conditions, one of the aliens appeared with a query asking: “Is this one leebish [grecious]? yes/no” (the label was randomly selected). On the verification trials in the location condition, the alien moved up or down, and subjects responded to the query, “Is this correct? yes/no”; subjects were allowed to repeat the motion numerous times before making their response. No feedback was provided for the verification trials.”

This was followed closely, with a few exceptions. First, I used a text-to-speech converter to get the auditory labels of “leebish” and “grecious” (in a female voice like the original). Second, the verification trials were done at the end of each block, rather than “after a random 10% of training trials.” This was done due to limitations of PsychoPy – specifically the set-up of loops during each block of trials, such that it is difficult to insert a new trial type within a block without it being repeated every iteration of the loop. Also, I do not think that this will affect the replication results because the verification trials are not used in the main analysis of interest, and with doing it this way, the participants only get one less verification trial that the original (9 rather than 10). The last exception is that in the verification trials for the location condition, I did not set it up so that participants can repeat the motion before making their response. I do not think this detracts from the participants’ ability to make their choice because after the alien moves once, it remains in the location where it stopped moving, so it is clear to participants which direction the alien moved the whole time (because the alien always starts in the center of the screen).

I have separate task versions set up for each condition, with counterbalancing of the labels and categories set up for each of them through Pavlovia.

Auditory Label: https://run.pavlovia.org/ckaicher/lupyan_replication_1

Written Label: https://run.pavlovia.org/ckaicher/lupyan_replication_2

Location: https://run.pavlovia.org/ckaicher/lupyan_replication_3

No Label: https://run.pavlovia.org/ckaicher/lupyan_replication_4

Analysis Plan

Data will be cleaned and tidied such that trials will be excluded if their response time is more than 3 minutes. Participants will be excluded if they pressed the same arrow key for more than 90% of trials.

The key analysis of interest is a mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor. I will use this to see if there is a significant Block x Condition interaction and main effect of Condition. Just as the authors did, I will also conduct planned comparisons of the conditions see whether the two label conditions differ from one another and whether the no-label and location conditions differ from one another. The authors also conducted 2 more ANOVAs that I will do as well: 1) a Condition x Block ANOVA with the pooled data from the label conditions and pooled data from the no-label and location conditions, and 2) a Condition x Block ANOVA of just the written-label and location conditions.

I will also use participants’ performance on the verification trials to see if their verification accuracy correlates with training accuracy, and whether an ANOVA with Condition as a between-subjects factor is significant. If the ANOVA is significant, I will follow it up with pairwise comparisons of the conditions. These verification trial analyses will only be conducted for the auditory label, written label, and location conditions, as the no-label condition does not have verification trials.

Differences from Original Study

The original study was conducted in-person, with a sample of American undergraduate students between the ages of 18 and 24. This replication will be conducted online on Prolific, with adults of any age. These differences are not anticipated to affect the results of the study based on the claims of the original article.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

### Data Preparation

#### Load Relevant Libraries and Functions
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("emmeans")

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

theme_set(theme_classic(base_size = 18))

#### Import data
input_path <- "../data/Pilot B/raw_data"
output_path <- "../data/Pilot B/processed_data"

files <- list.files(path=input_path,
                    pattern=".csv",
                    all.files=FALSE,
                    full.names=FALSE)

#this is for making the condition names more concise during tidying
condition_names <- tibble(experimentName = c("Category_Training_LabelAuditory",
                                      "Category_Training_LabelWritten",
                                      "Category_Training_Location",
                                      "Category_Training_NoLabel"),
                          condition = c("Label_Auditory",
                                        "Label_Written",
                                        "Location",
                                        "No_Label"))

#### Data exclusion / filtering

clean_data = function(dat, index) {
  dat_clean <- dat %>% 
    mutate(participant = index) %>% 
    mutate(counterbalance_group = counterbalance_group[1]) %>% 
    select(participant,
           counterbalance_group,
           exp_name,
           block,
           alien_stim,
           category,
           friendly,
           approach,
           key_resp_actual,
           correct,
           trial_started,
           trial_stopped) %>% 
    drop_na(alien_stim) %>% 
    mutate(trial = 1:144) %>%
    filter(trial_stopped - trial_started <= 180) %>% #remove trial if more than 3 minutes long
    mutate(condition = filter(condition_names,
                              dat$exp_name[1] == experimentName)$condition)
  
  return (dat_clean)
}

#to check whether participant pressed the same button for over 90% of trials
check_responses = function(dat) {
  if (nrow(filter(dat, key_resp_actual == "[\"up\"]"))/144 > 0.9|
      nrow(filter(dat, key_resp_actual == "[\"down\"]"))/144 > 0.9|
      nrow(filter(dat, key_resp_actual == "[\"right\"]"))/144 > 0.9|
      nrow(filter(dat, key_resp_actual == "[\"left\"]"))/144 > 0.9) {
    return (FALSE)
  } else {
    return (TRUE)
  }
}

#### Prepare data for analysis - create columns etc.

df.dat_clean_all <- tibble(participant = c(),
                           block = c(),
                           condition = c(),
                           counterbalance.group = c(),
                           alien_stim = c(),
                           category = c(),
                           friendly = c(),
                           approach = c(),
                           key_resp.actual = c(),
                           correct = c())

for (i in 1:length(files)) {
  df.dat <- read_csv(paste0(input_path, "/", files[i]))
  df.dat_clean <- clean_data(df.dat, i)
  if (check_responses(df.dat_clean)) {
    write.csv(df.dat_clean,
              paste0(output_path, "/", files[i], "_processed.csv"),
              row.names = FALSE)
    df.dat_clean_all <- rbind(df.dat_clean, df.dat_clean_all)
  } else {
    print("participant excluded for pressing the same button for over 90% of trials")
  }
}

Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_8_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_8_rt, age, frame_rate, ...
lgl  (8): counterbalance_remaining, key_resp_8_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 148 Columns: 62
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (38): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl  (9): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl  (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl  (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl  (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (31): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl  (7): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 148 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (34): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl  (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df.dat_clean_all$condition <- factor(df.dat_clean_all$condition,
                                     levels = c("No_Label",
                                                "Location",
                                                "Label_Written",
                                                "Label_Auditory"))

write.csv(df.dat_clean_all,
          paste0(output_path, "/all_participants.csv"),
          row.names = FALSE)

df.dat_clean_all_summary <- df.dat_clean_all %>% 
  group_by(participant,
           condition,
           block) %>% 
  summarize(mean_correct = mean(correct))

`summarise()` has grouped output by 'participant', 'condition'. You can
override using the `.groups` argument.

df.dat_clean_all_summary$block = factor(df.dat_clean_all_summary$block)

Confirmatory analysis (Pilot B)

The analyses as specified in the analysis plan.

acc_byBlock <- df.dat_clean_all %>% 
  mutate(block = as.numeric(block)) %>% 
  select(participant, trial, block, correct, condition) %>%
  group_by(participant,
           condition,
           block) %>%
  summarise_at(.vars = "correct",
               .funs = mean) %>% 
  ggplot(mapping = aes(x = block,
                       y = correct,
                       color = condition)) +
  stat_summary(fun.data = "mean_cl_boot",
               position = position_dodge(0.2)) +
  stat_summary(fun = mean, geom="line",
               position = position_dodge(0.2)) +
  scale_x_continuous(name = "Block",
                     breaks = c(1:9)) +
  coord_cartesian(ylim = c(0.0, 1.0)) +
  labs(y = "Proportion Correct") +
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=24),
        legend.text = element_text(size=20)) +
  scale_colour_brewer(palette = "Set1",
                      name = "",
                      labels = c("No Label", "Location", "Written Label", "Auditory Label"))

ggsave(filename = "/Users/ckaicher/Desktop/PhD/Fall 2025/Replication Project/cat_acc_byBlock.png",
       plot = acc_byBlock,
       width = 10)

Saving 10 x 5 in image

Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_segment()`).

acc_byBlock

Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_segment()`).

#key analysis of interest -- mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor, and follow-up comparisons
acc.lm <- glm(mean_correct ~ condition * block,
              data = df.dat_clean_all_summary)

acc.av <- acc.lm %>% 
  joint_tests()
acc.av

 model term      df1 df2 F.ratio p.value
 condition         3  27   4.428  0.0118
 block             8  27   0.778  0.6253
 condition:block  24  27   0.234  0.9997

#planned comparisons to see whether the two label conditions differ from one another and whether the no-label and location conditions differ from one another
emm <- emmeans(acc.lm,
               specs = ~ condition)

NOTE: Results may be misleading due to involvement in interactions

contrast_results <- contrast(emm,
                             list(auditory_vs_written = c(0, 0, -1, 1),
                                  location_vs_nolabel = c(-1, 1, 0, 0)),
                             adjust = "tukey")

summary(contrast_results)

Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons

 contrast            estimate     SE df t.ratio p.value
 auditory_vs_written  -0.0451 0.0953 27  -0.474  0.8700
 location_vs_nolabel   0.1944 0.0778 27   2.499  0.0373

Results are averaged over the levels of: block 
P value adjustment: sidak method for 2 tests

#comparing pooled Auditory Label and Written Label data to pooled Location and No Label data in new anova
df.pooled <- df.dat_clean_all_summary %>% 
  mutate(condition = case_when(
      condition == "Label_Auditory" ~ "labels_pooled",
      condition == "Label_Written" ~ "labels_pooled",
      condition == "Location" ~ "nonLabels_pooled",
      condition == "No_Label" ~ "nonLabels_pooled",
      .default = NA))

acc.lm_pooled <- glm(mean_correct ~ condition * block,
                                 data = df.pooled)

acc.av_pooled <- acc.lm_pooled %>%
  joint_tests()
acc.av_pooled

 model term      df1 df2 F.ratio p.value
 condition         1  45   8.312  0.0060
 block             8  45   0.980  0.4636
 condition:block   8  45   0.335  0.9476

#comparing just Written Label and Location conditions in new anova
df.justWrittenAndLocation <- df.dat_clean_all_summary %>% 
  filter(condition == "Label_Written" | condition == "Location")

acc.lm_WrittenAndLocation <- glm(mean_correct ~ condition * block,
                                 data = df.justWrittenAndLocation)

acc.av_WrittenAndLocation <- acc.lm_WrittenAndLocation %>%
  joint_tests()
acc.av_WrittenAndLocation

 model term      df1 df2 F.ratio p.value
 condition         1  18   3.219  0.0896
 block             8  18   2.710  0.0376
 condition:block   8  18   0.312  0.9513

####verification accuracy analyses####
#I realized my experiment output was not correctly storing the accuracy of the verification trials, so I do not have the analyses here (they are not key analyses). However, I have fixed the issue since collecting my pilot B data, so I will be able to implement the analysis with my full set of data.

#see if verification accuracy correlates with training accuracy

#check for differences in verification accuracy across conditions. If the ANOVA is significant, I will follow it up with pairwise comparisons of the conditions.

Side-by-side graph with original graph is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.