Replication of ‘Language is not Just for Talking: Redundant Labels Facilitate Learning of Novel Categories’ by Lupyan, Rakison, & McClelland (2007, Psychological Science)
Author
Caroline Kaicher (ckaicher@stanford.edu)
Published
December 1, 2025
Introduction
Justification
I am interested in how labels help children and adults learn categories. Lupyan, Rakison, and McClelland (2007) contributes to this question by showing that labels help adults learn object categories (in this case, categories of aliens) faster than when they have no labels, or have other nonlinguistic cues. This paper is particularly compelling because the labels are “redundant”, in other words they do not provide additional information to the participants about the category distinctions. Therefore, it is presumed that there is something “special” about having a label to associate category exemplars with in the category learning process. Thus, while words play an important role in category learning by pointing out useful category distinctions in the environment, they may be playing an even bigger role in facilitating the category learning process – however, the exact nature and mechanism of this role is unknown.
Stimuli and Procedures
Lupyan, Rakison, and McClelland (2007) consists of 2 experiments, and I will be replicating experiment 2. To conduct this experiment, I will need to recreate their category learning task. I will use PsychoPy, as this is the experiment-building software I am most familiar with, and host it online using Pavlovia. The task will have 4 conditions: No Label, Written Label, Auditory Label, and Location (nonlinguistic cue). The stimuli I will need are recordings of the auditory labels and the alien images they used in the original experiment. The images from the original study where created by Mike Tarr’s lab (the YUFO stimulus set), and are publicly available on their website.
The main challenge I anticipate for this study is finding the specific alien images the authors used in the two categories. Luckily, all the images they used are shown in Figure 1 of the paper, but there are a lot of images in the original stimulus set, so I will need to comb through them to find the exact ones. Other than that, the description of the category learning task seems clear and includes all the necessary details to recreate it.
Original Paper: https://github.com/psych251/lupyan2007/blob/main/original_paper/lupyan-et-al-2007-language-is-not-just-for-talking-redundant-labels-facilitate-learning-of-novel-categories-2.pdf
Methods
Power Analysis
The authors reported a partial eta-squared effect size of 0.07 for the interaction between condition and block. This roughly translates to a partial cohen’s f of 0.87. Looking at the non-partial effect sizes, they are an eta-squared of about 0.066 and cohen’s f of about 0.264, indicating a medium effect size for the key analysis of interest.
Using G*Power, a power analysis for this effect size from a mixed ANOVA was conducted. The power analysis indicates that a sample of 24 participants is needed to achieve 80% power, 28 participants are needed to achieve 90% power, and 32 participants are needed to achieve 95% power.
Planned Sample
I plan to stop collecting data once I have 76 participants, 19 in each condition. Even though adequate power can be achieved with fewer participants, I decided to match the original study’s sample size. The original study had 75 participants, and I adjusted it to 76 to have an even number of participants in each group.
Participants will be required to be fluent in English, so that they will understand the instructions of the task. I will also limit the sample to be participants from the United States, since the original sample were college students living in the US.
Materials
“The stimuli were a subset of the YUFO stimulus set (Gauthier, James, Curby, & Tarr, 2003). Items in one category (shown on the left in Fig. 1) had flatter bases and a subtle ridge on their ‘‘heads.’’ Items in the other category (shown on the right in Fig. 1) had more rounded bases and smoother heads…The stimuli were presented on a black background on a 17-in. computer screen and subtended 81 of visual angle. Responses were collected using a gamepad controller. For the [written] label condition, the categories were associated with the nonsense labels ‘‘leebish’’ and ‘‘grecious,’’ which were displayed in a white, 16point font.”
The alien images used in the replication are exactly the same as the original, and the same labels were used for the categories. However, since the replication is done online, the participants complete it on their personal computer. This means that the task could be presented on any screen size and visual angle. Also, the responses were collected using the participants’ keyboards.
Procedure
“Subjects were told to imagine that they were explorers on another planet and were learning about alien life forms. Their task was to determine which aliens they should approach and which they should move away from. On each training trial, 1 of the 16 aliens appeared in the center of the screen. After 500 ms, an outline of a character in a space suit (the”explorer”) appeared in one of four positions—to the left of, to the right of, above, or below the alien. Subjects were instructed to respond with the appropriate direction key depending on the category of the alien. For instance, if the explorer appeared above the alien, they needed to press the “down” key to move toward the alien or the “up” key to move away; after the key press, the explorer moved toward or away from the alien, as indicated. Auditory feedback—a buzz for an incorrect response and a bell for a correct response—sounded 200 ms after the explorer stopped moving. In the [written] label condition, a printed label (“leebish” or “grecious”) appeared to the right of the alien 300 ms after the feedback. After another 1,500 ms, the alien (and label, in the [written] label condition) disappeared from the screen, and a fixation cross marked the start of the next trial. The total trial duration and exposure to the stimulus were equal for the two conditions. The pairing of the labels with the categories (move away vs. move toward) and with the perceptual stimuli (left vs. right side of Fig. 1) was counterbalanced across subjects. Subjects in the label condition were told that previous visitors to the planet had found it useful to name the two kinds of aliens, and that they should pay careful attention to the labels. All subjects received the same number of categorization trials (nine blocks of 16 trials each) and had equal exposure to the stimuli. The only difference between the two conditions was whether or not a verbal label appeared after each response.”
This procedure is described for Experiment 1 of the study, where there are only 2 conditions: [written] label vs no label. Experiment 2 uses the same procedure, but adds the two other conditions: auditory label and location. Everything described in the procedure above was followed exactly, besides the fact that I did not use the same bell and buzz sounds, or astronaut character as used in the original. Here is where they discuss the additional procedural considerations for Experiment 2:
“The materials and procedure were identical to those used in Experiment 1 with the following exceptions: In the auditory label condition, the written labels were replaced by recorded sound clips of a female saying”leebish” and “grecious.” In the location condition, subjects were told that some aliens lived on one side of the planet, and others lived on the other side. On each trial, after the subject responded (approach/escape) and auditory feedback was given, the alien moved up or down to signal where it “lived.” The motion started 300 ms after response feedback and lasted approximately 400 ms. The trial ended 1,300 ms after the alien stopped moving. Thus, the alien was visible for a longer total time in the location condition compared with the label conditions…To measure the degree to which subjects learned the association between stimuli and labels or locations, we included verification trials as part of the training procedure. Verification trials were presented after a random 10% of training trials. On each verification trial in the label conditions, one of the aliens appeared with a query asking: “Is this one leebish [grecious]? yes/no” (the label was randomly selected). On the verification trials in the location condition, the alien moved up or down, and subjects responded to the query, “Is this correct? yes/no”; subjects were allowed to repeat the motion numerous times before making their response. No feedback was provided for the verification trials.”
This was followed closely, with a few exceptions. First, I used a text-to-speech converter to get the auditory labels of “leebish” and “grecious” (in a female voice like the original). Second, the verification trials were done at the end of each block, rather than “after a random 10% of training trials.” This was done due to limitations of PsychoPy – specifically the set-up of loops during each block of trials, such that it is difficult to insert a new trial type within a block without it being repeated every iteration of the loop. Also, I do not think that this will affect the replication results because the verification trials are not used in the main analysis of interest, and with doing it this way, the participants only get one less verification trial that the original (9 rather than 10). The last exception is that in the verification trials for the location condition, I did not set it up so that participants can repeat the motion before making their response. I do not think this detracts from the participants’ ability to make their choice because after the alien moves once, it remains in the location where it stopped moving, so it is clear to participants which direction the alien moved the whole time (because the alien always starts in the center of the screen).
I have separate task versions set up for each condition, with counterbalancing of the labels and categories set up for each of them through Pavlovia.
No Label: https://run.pavlovia.org/ckaicher/lupyan_replication_4
Analysis Plan
Data will be cleaned and tidied such that trials will be excluded if their response time is more than 3 minutes. Participants will be excluded if they pressed the same arrow key for more than 90% of trials.
The key analysis of interest is a mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor. I will use this to see if there is a significant Block x Condition interaction and main effect of Condition. Just as the authors did, I will also conduct planned comparisons of the conditions see whether the two label conditions differ from one another and whether the no-label and location conditions differ from one another. The authors also conducted 2 more ANOVAs that I will do as well: 1) a Condition x Block ANOVA with the pooled data from the label conditions and pooled data from the no-label and location conditions, and 2) a Condition x Block ANOVA of just the written-label and location conditions.
I will also use participants’ performance on the verification trials to see if their verification accuracy correlates with training accuracy, and whether an ANOVA with Condition as a between-subjects factor is significant. If the ANOVA is significant, I will follow it up with pairwise comparisons of the conditions. These verification trial analyses will only be conducted for the auditory label, written label, and location conditions, as the no-label condition does not have verification trials.
Differences from Original Study
The original study was conducted in-person, with a sample of American undergraduate students between the ages of 18 and 24. This replication will be conducted online on Prolific, with adults of any age. These differences are not anticipated to affect the results of the study based on the claims of the original article.
Methods Addendum (Post Data Collection)
You can comment this section out prior to final report with data collection.
Actual Sample
Sample size, demographics, data exclusions based on rules spelled out in analysis plan
Differences from pre-data collection methods plan
Any differences from what was described as the original plan, or “none”.
Results
Data preparation
Data preparation following the analysis plan.
### Data Preparation#### Load Relevant Libraries and Functionslibrary("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("emmeans")
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
theme_set(theme_classic(base_size =18))#### Import datainput_path <-"../data/Pilot B/raw_data"output_path <-"../data/Pilot B/processed_data"files <-list.files(path=input_path,pattern=".csv",all.files=FALSE,full.names=FALSE)#this is for making the condition names more concise during tidyingcondition_names <-tibble(experimentName =c("Category_Training_LabelAuditory","Category_Training_LabelWritten","Category_Training_Location","Category_Training_NoLabel"),condition =c("Label_Auditory","Label_Written","Location","No_Label"))#### Data exclusion / filteringclean_data =function(dat, index) { dat_clean <- dat %>%mutate(participant = index) %>%mutate(counterbalance_group = counterbalance_group[1]) %>%select(participant, counterbalance_group, exp_name, block, alien_stim, category, friendly, approach, key_resp_actual, correct, trial_started, trial_stopped) %>%drop_na(alien_stim) %>%mutate(trial =1:144) %>%filter(trial_stopped - trial_started <=180) %>%#remove trial if more than 3 minutes longmutate(condition =filter(condition_names, dat$exp_name[1] == experimentName)$condition)return (dat_clean)}#to check whether participant pressed the same button for over 90% of trialscheck_responses =function(dat) {if (nrow(filter(dat, key_resp_actual =="[\"up\"]"))/144>0.9|nrow(filter(dat, key_resp_actual =="[\"down\"]"))/144>0.9|nrow(filter(dat, key_resp_actual =="[\"right\"]"))/144>0.9|nrow(filter(dat, key_resp_actual =="[\"left\"]"))/144>0.9) {return (FALSE) } else {return (TRUE) }}#### Prepare data for analysis - create columns etc.df.dat_clean_all <-tibble(participant =c(),block =c(),condition =c(),counterbalance.group =c(),alien_stim =c(),category =c(),friendly =c(),approach =c(),key_resp.actual =c(),correct =c())for (i in1:length(files)) { df.dat <-read_csv(paste0(input_path, "/", files[i])) df.dat_clean <-clean_data(df.dat, i)if (check_responses(df.dat_clean)) {write.csv(df.dat_clean,paste0(output_path, "/", files[i], "_processed.csv"),row.names =FALSE) df.dat_clean_all <-rbind(df.dat_clean, df.dat_clean_all) } else {print("participant excluded for pressing the same button for over 90% of trials") }}
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_8_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_8_rt, age, frame_rate, ...
lgl (8): counterbalance_remaining, key_resp_8_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 148 Columns: 62
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (38): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl (9): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 57
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (35): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 147 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (31): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl (7): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 148 Columns: 55
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): counterbalance_group, key_resp_7_keys, gender, date, exp_name, psy...
dbl (34): consent_started, consent_stopped, key_resp_7_rt, age, frame_rate, ...
lgl (8): counterbalance_remaining, key_resp_7_duration, key_resp_2_duration...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_segment()`).
acc_byBlock
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_segment()`).
#key analysis of interest -- mixed ANOVA with Condition as a between-subjects factor and Block as a within-subjects factor, and follow-up comparisonsacc.lm <-glm(mean_correct ~ condition * block,data = df.dat_clean_all_summary)acc.av <- acc.lm %>%joint_tests()acc.av
#planned comparisons to see whether the two label conditions differ from one another and whether the no-label and location conditions differ from one anotheremm <-emmeans(acc.lm,specs =~ condition)
NOTE: Results may be misleading due to involvement in interactions
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
contrast estimate SE df t.ratio p.value
auditory_vs_written -0.0451 0.0953 27 -0.474 0.8700
location_vs_nolabel 0.1944 0.0778 27 2.499 0.0373
Results are averaged over the levels of: block
P value adjustment: sidak method for 2 tests
#comparing pooled Auditory Label and Written Label data to pooled Location and No Label data in new anovadf.pooled <- df.dat_clean_all_summary %>%mutate(condition =case_when( condition =="Label_Auditory"~"labels_pooled", condition =="Label_Written"~"labels_pooled", condition =="Location"~"nonLabels_pooled", condition =="No_Label"~"nonLabels_pooled",.default =NA))acc.lm_pooled <-glm(mean_correct ~ condition * block,data = df.pooled)acc.av_pooled <- acc.lm_pooled %>%joint_tests()acc.av_pooled
####verification accuracy analyses#####I realized my experiment output was not correctly storing the accuracy of the verification trials, so I do not have the analyses here (they are not key analyses). However, I have fixed the issue since collecting my pilot B data, so I will be able to implement the analysis with my full set of data.#see if verification accuracy correlates with training accuracy#check for differences in verification accuracy across conditions. If the ANOVA is significant, I will follow it up with pairwise comparisons of the conditions.
Side-by-side graph with original graph is ideal here
Exploratory analyses
Any follow-up analyses desired (not required).
Discussion
Summary of Replication Attempt
Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.
Commentary
Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.