Replication of Experiment 1: perspective-taking tasks by Tamrapi, Heydari & Hegarty (2016, Psychological Science)

Author

Chuqi Hu (chuqihu@stanford.edu)

Published

December 2, 2023

Introduction

Experiment 1 of Tarampi’s (2016) paper explored how stereotype threat contributes to gender differences in perspective-taking task performances. The researchers found that women tended to perform better when the same task was framed as a social perspective-taking task rather than spatial. This research aligned with my research interest in social cognition. Here, the result suggests that expectations from others influences one’s performance. I wonder how it might translate into social learning and education strategies.

Two tests were used in the experiment: the spatial orientation test and the road-map test. Participants also filled out the AQ questionnaire (Baron-Cohen et al., 2001). All the materials can be found on OSF. The original experiment was done in person, individually, or in a group of 2- 8 people of the same sex. Researchers first introduced perspective-taking ability, framing it either as a test of spatial ability or as a test of empathetic ability. Then, the participants took the spatial orientation test, road-map test, and filled out the AQ questionnaire, with researchers providing instruction at the start of each of the three tasks.

A foreseeable challenge will be crafting a reasonable schedule for recruiting participants, running the task in person, and coding responses. The original study had 139 undergraduate student participants. If I were to run it with around the same number of participants, the time needed from recruiting participants to producing analyzable data would be a lot. Some possible solutions are only focusing on one type of task or running the experiment online.

The repository: https://github.com/psych251/tarampi2016_rescue

The original paper: https://github.com/psych251/tarampi2016_rescue/blob/b3d6e7d926d3fa4e1370badfc73cf76629088592/original_paper/tarampi-et-al-2016-a-tale-of-two-types-of-perspective-taking-sex-differences-in-spatial-ability.pdf

Summary of prior replication attempt

The prior replication attempt made four main changes compared to the original study.

  1. The original study was conducted in person, and participants were tested individually or in same-sex groups of 2 to 8 participants. The prior replication attempt converted the study to an online format, and participants were tested individually. A concern the replication experimenter raised is whether state induction of gender bias was effective in the online environment.

  2. Participants of the original study were 139 undergraduate students at UC Santa Barbara. The prior replication was done with 255 Mechanical Turk workers whose demographics weren’t specified in the report, so we don’t know if the demographics differ a lot in terms of age and background.

  3. The original study consisted of three parts: two timed pencil-and-paper tests of perspective-taking ability (the object-perspective/spatial orientation test (Hegarty & Waller, 2004) and the road-map test (Money et al., 1965; modified by Zacks et al., 2002)) and a questionnaire that aims to measure autistic trait in adults. The prior replication attempt only kept the road-map test in order to collect more data of interest. They also included two questions at the end of the experiment, one asking about the participant’s prior CS experience and the other asking about participants’ assessment of their math ability.

  4. No description of the exclusion criterion was mentioned in the original paper. The experimenter of the replication attempt did two sets of data analyses using a light exclusion criterion (excluding participants who labeled no corner) and a strict exclusion criterion (excluding those with less than 60% accuracy). The original results did not replicate in either one.

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

The original effect size for the interaction of interest is a partial eta squared equal to 0.03. A power analysis for 80% suggests a sample size of 256. A power analysis for 90% suggests a sample size of 342.A power analysis for 95% suggests a sample size of 423.

Planned Sample

Based on the fact that the sample size of 256 gives us 80% power and that some participant data might be excluded according to the exclusion criteria (which I think will be around 10%), I plan to have a sample size of 282 participants.

The age range of this study has been adjusted to encompass individuals aged 18 to 25. This is a slight expansion from the initial study’s age range of 18 to 22, still ensuring the study focuses on young adult demographics.

The study must also be done on Laptops.

The original study did not specify any exclusion criterion. I will exclude participants (post data collection) who failed the comprehension check, who accurately guessed the study’s hypothesis (explicitly mentioned stereotype threat, or anything about framing the perspective- taking task differently for different gender group), or who clearly put no effort in the task (participants who labeled no corner or participants whose accuracy was below 20%).

Materials

In the original study, “the experiment consisted of two timed pencil-and-paper tests of perspective-taking ability: the object-perspective/spatial-orientation test (Hegarty & Waller, 2004) and the standardized road-map test of direction sense (the road-map test; Money et al., 1965; modified by Zacks et al., 2002)…The road-map test consisted of a bird’s-eye diagram of a path through a city. Participants were instructed to imagine walking along the path and write either “R” or “L” at each corner to indicate whether to take a right or left turn. The social version of the task included a human figure at every corner (see Fig. 2, right panel). Participants in the social condition were instructed to imagine themselves taking the perspective of the person as he or she walked along the path. Their score was the number of corners labeled correctly.”

In this rescue project, I will follow the first replication attempt and only focus on the road-map test. In the first replication attempt, participants labeled corners using drop-down menus to indicate either turning left or right. The experimenter expressed concerns about the extra cognitive load related to interacting with the drop-down menus. Therefore, I plan to change the drop-down menus to clicking the right or the left arrow key on the keyboard. Instead of showing the complete roadmap at once, I’m showing a smaller portion of the map on each trial.

The code for the replication study was in Javascript, HTML, and CSS. Only partial codes are available, and there is no code for data collection, so I plan to recreate the study with jsPsych.

Procedure

Participants will be tested individually online. In both conditions, participants will be told that they will complete a task that will test their perspective-taking ability.

“Participants in the spatial condition were given unmodified tests and also received the following information, which emphasized that perspective-taking is a spatial ability in which men have an advantage over women:

Perspective-taking ability can be thought of as a measure of spatial ability. Spatial ability is a cognitive ability that is defined as understanding the relations between objects in space and being able to mentally manipulate them and respond correctly. Males often score higher on measures of spatial ability.

Participants in the social condition were given modified tests, which included human figures, and received the following additional information, which emphasized that perspective-taking is an empathetic ability in which women have an advantage over men:

Perspective-taking ability can be thought of as a measure of empathetic ability. Empathetic ability is a social ability that is defined as being able to identify with and understand what another person is seeing or feeling, and respond appropriately. Females often score higher on measures of empathetic ability.”

The above-mentioned instructions will appear line by line on the screen. Participants will be instructed to read carefully and click the “next” button to go to the next sentence. This is different from the previous replication attempt, and the goal is to achieve a more effective state induction.

After that, participants will be presented with a comprehension check:

fill in the blank: _____often score higher on measures of ______ ability.

The participants then complete the road-map test. In the road-map test, participants will be first given the instruction and complete an untimed practice trial where they are expected to indicate three corner turns correctly.

Once they complete the practice trial, they will be given 30 seconds to complete as many of the 32 items as possible.

At the end of the study, instead of asking for participants’ CS experience as the experiment did in the first replication attempt, I plan to first ask a question looking at participants’ thoughts on the purpose of the study. If participants’ guesses about the study closely aligned with the study’s hypothesis (meaning that they explicitly mentioned something along the lines of gender threat, framing perspective-taking tasks differently for different gender groups, etc…).

Then, there will be a question asking about their gender, a manipulation check that probes whether the state induction of gender bias has been effective.

For the manipulation check, two multiple-choice questions will appear on the screen.

Please fill in the blank based on the information you saw at the beginning of the experiment.

  1. Perspective-taking ability can be thought of as a measure of ______(Choice1: spatial ability. Choice 2: empathetic ability)

  2. _________ (Choice1: females. Choice 2: males) often score higher on measures of empathetic/spatial (depending on the condition) ability

The last question will ask whether they have encountered any tech difficulties during the study.

Controls

The fact that the participant needs to click the next button to proceed to the next sentence in the instruction functions as an attention check.

Analysis Plan

The analysis will be a 2 (sex: male, female) X 2 (condition: social, spatial) between-subject analysis of variance (ANOVA). The dependent measure of the key task is the number of corners that a participant correctly labeled. For the ANOVA test, I’m mainly interested in the interaction effect.

The original study did not exclude any data based on participants’ performance in the road-map test. However, I’m excluding participants who failed the comprehension check, who guessed the study’s hypothesis, or who clearly did not put in effort for the task (ex., did not label any corner or whose accuracy rate was below 55%).

Differences from Original Study and 1st replication

This study aims to convert the original study to an online format, which is in line with the 1st replication attempt. However, it differs from the 1st replication in three ways. 1) Instead of showing the instructions in one paragraph, I plan to show the instructions line by line to mimic real-time communication. 2) Instead of asking the participant to select “right” or “left” from a drop-down menu, I plan to let them use the arrow keys to indicate direction. 3)I also plan to show the trials one by one. 4) Unlike the 1st replication, which introduced additional questions probing participants’ CS and math background, I plan to ask questions that probe whether the state induction succeeded. This is to address one of the main concerns from the previous replication.

Unlike the original study (which had no exclusion criteria) or the 1st replication (had two sets of exclusion criteria), I plan to apply one exclusion criterion to filter out the data from participants who clearly did not put effort into their answering.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

Read in data and remove unneeded columns.

Exclude participants who did not label any corner right or whose accuracy was less than 50%.

Calculate mean scores in each group (males- spatial, males-social, females-spatial, females-social)

### Data Preparation

#### Load Relevant Libraries and Functions
library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(ez)
library(osfr)
library(tidyr)
library(stringr)
Warning: package 'stringr' was built under R version 4.3.1
#### Import data
files <-list.files(path = "~/Desktop/tarampi2016_rescue/pilotb data/", full.names = TRUE)
# read in and combine all the CSV files using bind_rows()
combined_data <- lapply(files, function(filepath) {
  df <- read.csv(filepath)
  return(df)
}) %>%
  bind_rows()

#### Data exclusion / filtering
# exclude unwanted columns and rows
organized.data <- combined_data %>%
  select("workerid",
         "condition",
         "stimulus",
         "correct",
         "response",
         "gender",
         "first",
         "second",
         "tech_difficulty",
         "participant_hypothesis",
         "bothQuestionsCorrect") %>%
  filter (response != 0) %>%
  filter (workerid != "na")


#### Prepare data for analysis - create columns that calculates the percentage of correct answers etc.
precentage_correct <- organized.data %>%
  group_by(workerid)%>%
  mutate(correct_percent = (sum (correct == "True"))/(sum (correct == "True") + sum (correct == "False")))


ready_data <- precentage_correct %>%
  group_by(workerid) %>%
  mutate(correct_count = sum(correct == "True", na.rm = TRUE)) %>%
  filter(is.na(gender) == FALSE)

graph_data = ready_data %>%
  group_by(condition, gender) %>%
  mutate(mean_correct = mean(correct_count),mean_sem = sd(correct_count) / sqrt(n())
  )


#descriptive statistics
ggplot(data = graph_data, aes(x = gender, y = mean_correct, fill = condition)) +
  # Dodge the bars based on condition
  geom_col(position = "dodge", width = 0.5) +
  # Dodge the error bars as well
  geom_errorbar(
    aes(ymin = mean_correct - mean_sem, ymax = mean_correct + mean_sem),
    width = 0,
    position = position_dodge(0.5)
  )

Results of control measures

Confirmatory analysis

#ANOVA test
anova<- aov(correct_count ~ gender * condition, data = ready_data)
summary(anova)
                  Df Sum Sq Mean Sq F value Pr(>F)    
gender             1    184  184.07  13.373 0.0003 ***
condition          1     52   52.15   3.789 0.0525 .  
gender:condition   1     83   83.29   6.051 0.0144 *  
Residuals        309   4253   13.76                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###T-Tests
sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(sum(!is.na((x))))}
ci <- function(x) {sem(x) * 1.96} 

####Female Score By Condition
female_data <- ready_data %>%
  subset(gender=='Female')
female_summary <- female_data %>%
  group_by(condition) %>%
  summarise(mean = mean(correct_count), sd = sd(correct_count), ci_lower = mean(correct_count) - ci(correct_count), ci_upper = mean(correct_count) + ci(correct_count))
t.test(correct_count ~ condition, data = female_data, var.equal = TRUE)

    Two Sample t-test

data:  correct_count by condition
t = -6.5064, df = 127, p-value = 1.611e-09
alternative hypothesis: true difference in means between group social and group spatial is not equal to 0
95 percent confidence interval:
 -0.652067 -0.347933
sample estimates:
 mean in group social mean in group spatial 
                 14.0                  14.5 
####Male Score By Condition
male_data <- ready_data %>%
  subset(gender=='Male')
male_summary <- male_data %>%
  group_by(condition) %>%
  summarise(mean = mean(correct_count), sd = sd(correct_count), ci_lower = mean(correct_count) - ci(correct_count), ci_upper = mean(correct_count) + ci(correct_count))
t.test(correct_count ~ condition, data = male_data, var.equal = TRUE)

    Two Sample t-test

data:  correct_count by condition
t = 2.3489, df = 182, p-value = 0.0199
alternative hypothesis: true difference in means between group social and group spatial is not equal to 0
95 percent confidence interval:
 0.2677251 3.0792137
sample estimates:
 mean in group social mean in group spatial 
             16.67347              15.00000 
####Spatial Score By Sex
spatial_data = ready_data %>%
  subset(condition=='spatial')
t.test(correct_count ~ gender, data = spatial_data, var.equal = TRUE)

    Two Sample t-test

data:  correct_count by gender
t = -4.1231, df = 170, p-value = 5.835e-05
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.7393847 -0.2606153
sample estimates:
mean in group Female   mean in group Male 
                14.5                 15.0 
####Social Score By Sex
social_data = ready_data %>%
  subset(condition=='social')
t.test(correct_count ~ gender, data = social_data, var.equal = TRUE)

    Two Sample t-test

data:  correct_count by gender
t = -2.6763, df = 139, p-value = 0.00834
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -4.6485814 -0.6983574
sample estimates:
mean in group Female   mean in group Male 
            14.00000             16.67347 

Three-panel graph with original, 1st replication, and your replication is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Mini meta analysis

Combining across the original paper, 1st replication, and 2nd replication, what is the aggregate effect size?

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.