Replication of A familiar-size Stroop effect in the absence of basic-level recognition by Long & Konkle (2017, Cognition)

Introduction

For the replication project, I am choosing to replicate Experiment 1 from “A familiar-size Stroop effect in the absence of basic-level recognition”, by Bria Long and Talia Konkle (Cognition, 2017). The findings suggest that real-world object size is represented in the “mid-level” features of the objects, independent of whether we semantically know whether an object is large or small.

For this project, subjects will see images of two unrecognizable “texform” stimuli, and be asked to judge which image is larger or smaller on the screen. The original authors found a size Stroop effect, in which subjects were faster when the image that was larger on the screen was actually larger in real life (congruent) than when the image that was larger on the screen was actually smaller in real life (incongruent). This project will require rendering two texforms, one image larger than the other and collecting reaction time data.The stimuli that are used in the manuscript are publically available on the authors’ website, and the details of the procedure are clearly outlined in the manuscript.

preregistration: https://osf.io/evft8

github repository: github.com/psych251/long2017

original paper: github.com/psych251/long2017/original_paper

experiment (bigger/smaller): http://web.stanford.edu/~ekubota/SizeStroop/sizeStroop_experiment_biggerSmaller.html experiment (smaller/bigger):http://web.stanford.edu/~ekubota/SizeStroop/sizeStroop_experiment_smallerBigger.html

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

library(WebPower)

## Loading required package: MASS

## Loading required package: lme4

## Loading required package: Matrix

## Loading required package: lavaan

## This is lavaan 0.6-5

## lavaan is BETA software! Please report any bugs.

## Loading required package: parallel

## Loading required package: PearsonDS

80% power:

wp.rmanova(n = NULL, ng = 1, nm = 2, f = 0.95, nscor = 1,
  alpha = 0.05, power = .8, type = 1)

## Repeated-measures ANOVA analysis
## 
##            n    f ng nm nscor alpha power
##     10.77326 0.95  1  2     1  0.05   0.8
## 
## NOTE: Power analysis for within-effect test
## URL: http://psychstat.org/rmanova

90% power:

wp.rmanova(n = NULL, ng = 1, nm = 2, f = 0.95, nscor = 1,
  alpha = 0.05, power = .9, type = 1)

## Repeated-measures ANOVA analysis
## 
##           n    f ng nm nscor alpha power
##     13.7079 0.95  1  2     1  0.05   0.9
## 
## NOTE: Power analysis for within-effect test
## URL: http://psychstat.org/rmanova

100% power:

wp.rmanova(n = NULL, ng = 1, nm = 2, f = 0.95, nscor = 1,
  alpha = 0.05, power = .95, type = 1)

## Repeated-measures ANOVA analysis
## 
##           n    f ng nm nscor alpha power
##     16.4545 0.95  1  2     1  0.05  0.95
## 
## NOTE: Power analysis for within-effect test
## URL: http://psychstat.org/rmanova

Planned Sample

N = 16 (as in the original paper).

Materials

“The stimulus set consisted of 60 texform images generated from images of 30 big, inanimate objects and 30 small, inanimate objects (see Fig. 2). Big objects included things like cars and tables and were chair-sized and bigger; Small objects included things like mugs and cameras, and were table-lamp sized and smaller. The texform stimuli were synthesized using an algorithm that preserves mid-level image features from the original images, such as local combinations of orientations (see Long et al., 2016 for more detailed description of the procedure; see also Freeman & Simoncelli, 2011).”

These materials are available at https://github.com/brialorelle/TexformSizeStroop.

Procedure

Stimuli will be rendered using MTurk.

“Two grayscale texforms appeared on either side of fixation on a white background. On half the trials, participants were asked to judge which texform was visually bigger on the screen as fast as possible. On the other half of the trials, participants were asked to judge which texform was visually smaller on the screen as fast as possible. Participants indicated which side of the screen corresponded to the visually bigger or visually smaller image by pressing either the m key or the c key. The images remained present on the display until the participant responded. High accuracy was encouraged as incorrect responses resulted in feedback and a 5s interval before the next trial began. After a correct response, there was a 900 ms interval before the next trial. Trials were blocked into 4 sets, where the task switched after each set. Half of the participants started with the”visually bigger” task, and half of the participants started with the “visually smaller” task. To orient people to the tasks, all participants first saw example trials and read instructions, and then completed 24 practice trials with both task instructions (12 in each task) in the same counterbalanced order as the experiment. The critical manipulation was that the two texforms on each display were presented at visual sizes that were either congruent or incongruent with the real-world size of the original objects. For example, in a congruent display, a shoe texform would be presented at a visually small size and a couch texform would be presented at a visually big size, as typically shoes are small and couches are big in the world. On incongruent trials, this was reversed: a shoe texform would be presented at a visually big size, and a couch texform would be presented at a visually small size. The fact the texforms were generated from objects of different real-world sizes was not mentioned at any time during the experiment. Furthermore, participants never saw a version of the Size-Stroop task with recognizable objects."

“At an item level, each big object texform and small object texform were counterbalanced such that they appeared equally in both congruent/incongruent configurations, with the correct answer on the left/right side of the screen, and across both visual size tasks. Big and small object texforms were pseudo-randomly paired, such that the same random pairs of big and small texforms occurred together in the first half of the experiment for each participant. In the second half of the experiment, big and small object texforms were randomly paired together; this procedure was used in Konkle and Oliva (2012) to take into account pictorial issues related to recognizable objects, and for consistency we followed the exact procedure here. Overall, there were 480 trials (30 pairs of objects x 2 congruent/incongruent conditions x 2 left/right sides of screen x 2 bigger/smaller tasks x 2 different pairings of texforms; yielding 240 congruent/240 incongruent trials).”

Analysis Plan

Incorrect trials and trials for which rection times (RT) are shorter than 1200ms or longer than 2500 ms will be excluded. Subjects with less than 80% of trials correct will be excluded.

Per the original experiment, “Trimmed reaction times were analyzed using a 2 x 2 repeated-measures ANOVA, with congruency (congruent/incongruent) and task (bigger/smaller on the screen) as factors.”

The key analysis of interest will be the main effect of congruency in the repeated measures ANOVA.

“Overall, we found evidence for a Size-Stroop effect: on incongruent trials, participants were slower to make visual size judgments when the real-world size of original objects were incongruent with their sizes on the screen (Mdiff = 12.92 ms, SDdiff = 13.62 ms, main effect of congruency, F(1, 15) = 14.3, p = 0.002, g2p = 0.489, Cohen’s d = 0.95, Fig. 2B)”

Differences from Original Study

This study will be collected on MTurk whereas the original study was conducted on Harvard affiliates. In addition, in the original study all subjects viewed the stimuli on the same monitor with the same viewing distance, which made the visual angle of the stimuli uniform.

Actual Sample

N = 16 (n = 8 will see bigger/smaller/bigger/smaller, n = 8 will see smaller/bigger/smaller/bigger)

Differences from pre-data collection methods plan

None.

Results

Data preparation

Data preparation following the analysis plan.

###Data Preparation
####Load Relevant Libraries and Functions
library(tidyverse)

## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ tidyr::unpack() masks Matrix::unpack()

library(ez)

## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4

library(stringi)
library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(knitr)

Import data

df = read.csv("data/finalResults.csv")
original_df = read.csv('../original_paper/data/E1GroupData.csv')

Set variables to hold our measures

accuracy = character()
rt = character()
congruency = character()
subject = character()
task = character()

Tidy data

for (i in 1:NROW(df)) {# for each sub split data
  subAccuracy=character()
  subCongruency = character()
  subTask = character()
  subRT = character()
  
  for (b in 1:4) {# for each block
# First we need to split our data for each trial becasue its in one long string
    subAccuracy = c(subAccuracy, unlist(strsplit(toString(df[paste0("Answer.accuracy",b)][i,1]),",")))
    subCongruency = c(subCongruency,unlist(strsplit(toString(df[paste0("Answer.congruency",b)][i,1]),",")))
    subRT = c(subRT, unlist(strsplit(toString(df[paste0("Answer.rt",b)][i,1]),",")))
    subTask = c(subTask,unlist(strsplit(toString(df[paste0("Answer.task",b)][i,1]),",")))
  }
  # remove empty rows
  subAccuracy <- stri_remove_empty(subAccuracy, na_empty = FALSE)
  subCongruency <- stri_remove_empty(subCongruency, na_empty = FALSE)
  subRT <- stri_remove_empty(subRT, na_empty = FALSE)
  subTask <- stri_remove_empty(subTask, na_empty = FALSE)
  
 # re-code accuracy as 1s and 0s 
  subAccuracy <-
    ifelse(subAccuracy == "correct", 1, ifelse(subAccuracy == "incorrect", 0, NA))

  # Only include data in dataframe if  there  are more than 80% of trials correct. 
  if (sum(subAccuracy) >= .80*(480)){
    accuracy = c(accuracy,subAccuracy)
    congruency = c(congruency,subCongruency)
    rt= c(rt,subRT)
    task= c(task, subTask)
    subject= c(subject, replicate(480,i))
  }
  rm(subAccuracy, subCongruency, subRT, subTask)
}

Convert RT to integers

rt = strtoi(rt);

Put everything in a new dataframe!

df_long <- data.frame(subject, accuracy, congruency, rt, task)
  df_long$accuracy <- accuracy
  df_long$congruency <- congruency
  df_long$rt <- rt
  df_long$task <- task

Histograms of RTs. Here, I am plotting histograms of the orginal data vs. my pilot data. Note that my distribution is shifted to slower RTs due to MTurk iteration.

Original plot

mean(original_df$RT)

## [1] 526.5711

sd(original_df$RT)

## [1] 309.4825

h1 <- ggplot(original_df, aes(x = RT)) +
        geom_histogram() + 
        xlim(0,3000) + 
  ggtitle("Original Study")

Replication plot

mean(df_long$rt)

## [1] 1637.43

sd(df_long$rt)

## [1] 845.5667

h2 <- ggplot(df_long, aes(x = rt)) +
        geom_histogram() +
        xlim(0, 3000) + 
  ggtitle("Replication")

grid.arrange(h1,h2,nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 20 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 199 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Trim data (remove short and long RTs, and incorrect trials).

DataTrimmed<- df_long[df_long$rt>1200 & df_long$rt<2500& df_long$accuracy == 1,]
originalDataTrimmed<-original_df[original_df$RT>200 & original_df$RT<1500 & original_df$isCorrect == 1,]

Results

First, reporoduce original results.

originalAov.out  = ezANOVA(data = originalDataTrimmed, dv=.(RT), wid=.(sid), within=.(condLabel, taskNum), type=3)

## Warning: Converting "sid" to factor for ANOVA.

## Warning: "taskNum" will be treated as numeric.

## Warning: Collapsing data to cell means. *IF* the requested effects are a
## subset of the full design, you must use the "within_full" argument, else
## results may be inaccurate.

## Warning: There is at least one numeric within variable, therefore aov()
## will be used for computation and no assumption checks will be obtained.

kable(originalAov.out)

Effect	DFn	DFd	F	p	p<.05	ges
condLabel	1	15	14.3398787	0.0017909	*	0.0955299
taskNum	1	15	0.3775737	0.5481125		0.0180259
condLabel:taskNum	1	15	1.3337896	0.2661994		0.0140492

Next, run the same analysis on replication data

aov.out = ezANOVA(data = DataTrimmed, dv=.(rt), wid=.(subject), within=.(congruency,task), type=3)

## Warning: Converting "congruency" to factor for ANOVA.

## Warning: Converting "task" to factor for ANOVA.

## Warning: Collapsing data to cell means. *IF* the requested effects are a
## subset of the full design, you must use the "within_full" argument, else
## results may be inaccurate.

(kable(aov.out))

	Effect	DFn	DFd	F	p	p<.05	ges
2	congruency	1	15	13.201595	0.0024519	*	0.0060300
3	task	1	15	7.971761	0.0128378	*	0.0188062
4	congruency:task	1	15	5.702538	0.0305323	*	0.0005155

Prepare original data for plot: First, we will get the means for each subject.

sidMeans <- originalDataTrimmed %>%
    group_by(sid) %>%
    summarize(meanRT = mean(RT))

Next get means for each condition.

origPlot <- originalDataTrimmed %>%
    group_by(condLabel) %>%
    summarize(meanRT = mean(RT), SEM = sd(RT)/sqrt(length(unique(originalDataTrimmed$sid))))

Next get means by subject/ condition

origTall <- originalDataTrimmed %>%
    group_by(condLabel,sid) %>%
    summarize(meanRT = mean(RT))

Make a wide form version the subjet/ condition table

origWide = pivot_wider(origTall,names_from= condLabel, values_from = meanRT)

Calculate repeated measures error, which they did in original paper.

congruentErr = sd(origWide$congruent - sidMeans$meanRT)/sqrt(length(origWide$congruent))
incongruentErr = sd(origWide$incongruent - sidMeans$meanRT)/sqrt(length(origWide$incongruent))
origPlot$SEM_repeated <- c(congruentErr, incongruentErr)

Prepare replication data for plot: First, we will get the means for each subject.

rep_sidMeans <- DataTrimmed %>%
    group_by(subject) %>%
    summarize(meanRT = mean(rt))

Next, get means for each condition

repPlot <- DataTrimmed %>%
    group_by(congruency) %>%
    summarize(meanRT = mean(rt), SEM = sd(rt)/sqrt(length(unique(DataTrimmed$subject))))

Next, get means by subject/ condition

repTall <- DataTrimmed %>%
    group_by(congruency,subject) %>%
    summarize(meanRT = mean(rt))

Make a wide form version the subjet/ condition table

repWide = pivot_wider(repTall,names_from= congruency, values_from = meanRT)

Calculate repeated measures error, which they did in original paper.

congruentErr = sd(repWide$congruent - rep_sidMeans$meanRT)/sqrt(length(repWide$congruent))
incongruentErr = sd(repWide$incongruent - rep_sidMeans$meanRT)/sqrt(length(repWide$incongruent))
repPlot$SEM_repeated <- c(congruentErr, incongruentErr)

Original plot:

par(mfrow=c(1,2))
p1 <- ggplot(origPlot, aes(x = condLabel, y=meanRT, fill=condLabel)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values=c("#336699", "#660000")) + 
   geom_errorbar(aes(ymin=meanRT-SEM_repeated, ymax=meanRT+SEM_repeated), width=.2,
                 position=position_dodge(.9)) + 
  theme(axis.title.x=element_blank(),
        axis.ticks.x=element_blank()) + 
  theme(panel.grid.minor = element_blank(), 
  panel.background = element_blank(), 
    axis.line = element_line(colour = "black")) +
  theme(legend.position = "none") + 
  theme(text = element_text(size=20)) +
  ylab("RT") + 
  coord_cartesian(ylim =c(480,530)) + 
  ggtitle("Original Study")

Replication plot:

p2 <- ggplot(repPlot, aes(x = congruency, y=meanRT, fill=congruency)) +
    geom_bar(stat = "identity")  + 
    scale_fill_manual(values=c("#336699", "#660000")) + 
   geom_errorbar(aes(ymin=meanRT-SEM_repeated, ymax=meanRT+SEM_repeated), width=.2,
                 position=position_dodge(.9)) +
   theme(axis.title.x=element_blank(),
        axis.ticks.x=element_blank()) + 
  theme(panel.grid.minor = element_blank(), 
  panel.background = element_blank(), 
    axis.line = element_line(colour = "black")) +
  theme(legend.position = "none") + 
  theme(text = element_text(size=20)) +
  ylab("RT") + 
  coord_cartesian(ylim =c(1490,1540)) + 
  ggtitle("Replication")

Plot side-by-side

grid.arrange(p1,p2, nrow=1)

Exploratory analyses

After seeing a significant effect of task in my replication experiment, I was curious to plot RT as a function of task.

Prepare original data for plot: First, we will get the means for each subject.

sidMeans <- originalDataTrimmed %>%
    group_by(sid) %>%
    summarize(meanRT = mean(RT))

Next get means for each task

origPlot <- originalDataTrimmed %>%
    group_by(taskLabel) %>%
    summarize(meanRT = mean(RT), SEM = sd(RT)/sqrt(length(unique(originalDataTrimmed$sid))))

Next get means by subject/ task

origTall <- originalDataTrimmed %>%
    group_by(taskLabel,sid) %>%
    summarize(meanRT = mean(RT))

Make a wide form version the subject/ condition table

origWide = pivot_wider(origTall,names_from= taskLabel, values_from = meanRT)

Calculate repeated measures error, which they did in original paper.

biggerErr = sd(origWide$'which is bigger on the screen?' - sidMeans$meanRT)/sqrt(length(origWide$'which is bigger on the screen?'))
smallerErr = sd(origWide$'which is smaller on the screen?' - sidMeans$meanRT)/sqrt(length(origWide$'which is smaller on the screen?'))
origPlot$SEM_repeated <- c(smallerErr, biggerErr)

Prepare replication data for plot: First, we will get the means for each subject.

rep_sidMeans <- DataTrimmed %>%
    group_by(subject) %>%
    summarize(meanRT = mean(rt))

Next, get means for each task

repPlot <- DataTrimmed %>%
    group_by(task) %>%
    summarize(meanRT = mean(rt), SEM = sd(rt)/sqrt(length(unique(DataTrimmed$subject))))

Next, get means by subject/ task

repTall <- DataTrimmed %>%
    group_by(task,subject) %>%
    summarize(meanRT = mean(rt))

Make a wide form version the subjet/ condition table

repWide = pivot_wider(repTall,names_from= task, values_from = meanRT)

Calculate repeated measures error, which they did in original paper.

smallerErr = sd(repWide$smaller - rep_sidMeans$meanRT)/sqrt(length(repWide$smaller))
biggerErr = sd(repWide$bigger - rep_sidMeans$meanRT)/sqrt(length(repWide$bigger))
repPlot$SEM_repeated <- c(biggerErr, smallerErr)

Original plot:

par(mfrow=c(1,2))
xlabel=c("bigger","smaller")
p3 <- ggplot(origPlot, aes(x = taskLabel, y=meanRT, fill=taskLabel)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values=c("#336699", "#660000")) + 
   geom_errorbar(aes(ymin=meanRT-SEM_repeated, ymax=meanRT+SEM_repeated), width=.2,
                 position=position_dodge(.9)) + 
  theme(axis.title.x=element_blank(),
        axis.ticks.x=element_blank()) + 
  theme(panel.grid.minor = element_blank(), 
  panel.background = element_blank(), 
    axis.line = element_line(colour = "black")) +
  theme(legend.position = "none") + 
  theme(text = element_text(size=20)) +
  ylab("RT") + 
  coord_cartesian(ylim =c(480,530)) + 
  ggtitle("Original Study") + 
  scale_x_discrete(labels= xlabel)

Replication plot:

p4 <- ggplot(repPlot, aes(x = task, y=meanRT, fill=task)) +
    geom_bar(stat = "identity")  + 
    scale_fill_manual(values=c("#336699", "#660000")) + 
   geom_errorbar(aes(ymin=meanRT-SEM_repeated, ymax=meanRT+SEM_repeated), width=.2,
                 position=position_dodge(.9)) +
   theme(axis.title.x=element_blank(),
        axis.ticks.x=element_blank()) + 
  theme(panel.grid.minor = element_blank(), 
  panel.background = element_blank(), 
    axis.line = element_line(colour = "black")) +
  theme(legend.position = "none") + 
  theme(text = element_text(size=20)) +
  ylab("RT") + 
  coord_cartesian(ylim =c(1490,1560)) + 
  ggtitle("Replication")

Plot side-by-side

grid.arrange(p3,p4, nrow=1)

Discussion

Summary of Replication Attempt

This replication was a success. The key statistic of interest was the main effect of congruency in the 2x2 ANOVA with task and congruency as factors. The original study found main effect of congruency, F(1, 15) = 14.3, p = 0.002, and our replication found a main effect of congruency, F(1,15) = 13.2, p = 0.003.

One suprising finding was that our replication found a main effect of task (F(1,15) = 7.97, p = 0.01), and a task x condition interaction (F(1,15), = 5.70, p = 0.03) that was not found in the original study. One possibility is that the variable reaction times in the Mturk iteration contributed to this effect.

Commentary

One of the biggest concerns with this replication was that it would be on MTurk, where we cannot control the viewing distance or screen size of the participants. Here, the effect replicated despite variability in screen size and viewing distance.

Replication of A familiar-size Stroop effect in the absence of basic-level recognition by Long & Konkle (2017, Cognition)

Emily Kubota

December 09, 2019

Introduction

Methods

Power Analysis

Planned Sample

Materials

Procedure

Analysis Plan

Differences from Original Study

Actual Sample

Differences from pre-data collection methods plan

Results

Data preparation

Results

Exploratory analyses

Discussion

Summary of Replication Attempt

Commentary