Replication of Study 1 by Tarampi, Heydari, and Hegarty (2016, Psychological Science)

Introduction

Through this project, I hope to 1) learn how to run tasks online (MTurk or Prolific) to add this skill as a tool in my repertoire for future research and 2) use this method to replicate a recent finding on spatial reasoning and gender bias, which is relevant to my own research on early childhood CS education (spatial reasoning skills correlate to CS success). Experiment 1 in this paper demonstrates that a difference in performance between genders on a spatial reasoning task can be decreased by framing the tasks as a social, rather than spatial.

To conduct this study I will develop a web version of the two spatial reasoning tasks described in the paper, which are available online as PDF downloads from the OSF site. This web version of the test will also gather gender information. I will post this short spatial reasoning test on Mechanical Turk, excluding the AQ test from the original study, as the authors did not include these scores in the analyses they present. After data collection, I will grade each respondent’s test responses, likely through the use of an auto-grader program which I will write; this auto-grader will require more time upfront, but will make grading quicker overall and reduce the possibility of errors in grading. After grading each participant’s responses, I will collate these scores with the demographic information to have a data set that should resemble that of the original study (minus the AQ data).

My primary concern in replicating this study online is in finding enough respondents with the available resources (the original experiment included 139 subjects). By removing the AQ test, which was not used in the analyses presented in the paper, I can cut the task time in half and therefore recruit twice the number of participants for the same cost.

The repository is located here: https://github.com/gdietz44/tarampi2016

The original paper can be found here: https://github.com/gdietz44/tarampi2016/blob/master/original_paper/tarampi2016.pdf

Methods

Power Analysis

The reported effect size for the interaction of interest is a partial eta squared equal to 0.03. A power analysis for 80% suggests a sample size of 256. This large number may pose some challenge in monetary resources, but with a very short task should be feasible.

Planned Sample

We will gather data from 256 Mechanical Turk workers. We will not exclude or preselect based on any criteria. The sample size was determined by a power analysis for 80 percent power based on the original reported effect.

Materials

In the original study, “the experiment consisted of two timed pencil-and-paper tests of perspective-taking ability: the object-perspective/spatial-orientation test (Hegarty & Waller, 2004) and the standardized road-map test of direction sense (the road-map test; Money et al., 1965; modified by Zacks et al., 2002)…The road-map test consisted of a bird’s-eye diagram of a path through a city (see Fig. 2, left panel). Participants were instructed to imagine walking along the path and write either “R” or “L” at each corner to indicate whether to take a right or left turn. The social version of the task included a human figure at every corner (see Fig. 2, right panel). Participants in the social condition were instructed to imagine themselves taking the perspective of the person as he or she walked along the path. Their score was the number of corners labeled correctly.”

In this replication, we have downloaded the original study materials (available at: https://osf.io/7qu6s/) and have converted them to an online format. We do not attempt to spatial orientation task. In task two, participants label corners using drop-down menus placed at each corner. Participants are asked not to rotate their head or computer screen or use external aids (to match with the request not to to turn or mark the paper). The JavaScript code randomly assigns the condition. The debug (non data-collecting) version of the task can be found here: http://web.stanford.edu/~gdietz44/psych251/study.html.

Procedure

In Tarampi et al.’s original study, “Males and females were tested individually or in same-sex groups of 2 to 8 participants. In both conditions, participants were told that they would complete two tasks that would test their perspective-taking ability. Participants in the spatial condition were given unmodified tests and also received the following information, which emphasized that perspective taking is a spatial ability in which men have an advantage over women:

Perspective-taking ability can be thought of as a measure of spatial ability. Spatial ability is a cognitive ability that is defined as understanding the relations between objects in space and being able to mentally manipulate them and respond correctly. Males often score higher on measures of spatial ability.

Participants in the social condition were given modified tests, which included human figures, and received the following additional information, which emphasized that perspective taking is an empathetic ability in which women have an advantage over men:

Perspective-taking ability can be thought of as a measure of empathetic ability. Empathetic ability is a social ability that is defined as being able to identify with and understand what another person is seeing or feeling, and respond appropriately. Females often score higher on measures of empathetic ability.

The participants then completed the two perspective taking tasks, with task order counterbalanced across participants. On the road-map test, participants were given 30 s to complete as many of the 32 items as they could. On the spatial-orientation test, they were allowed 5 min to complete 12 test items. Finally, they completed the AQ."

In the replication study, all participants were tested individually, through Amazon’s Mechanical Turk. We choose only to include the road map test, because its short duration enabled us to collect a large amount of data and because analyses of the two tests were done separately. We do not include the AQ test because the original authors do not use it in their reported analyses. Additionally, to address the original authors’ concerns regarding increased cognitive load in the digital format, we have increased the study task time by 20% (to 36 seconds) based on a comparison of scores on the paper and digital task for the same participants.

Analysis Plan

We included data from all participants who complete the task. As described in the materials section, scores were calculated based on the mean absolute angular deviation for task 1, and on the raw number of correctly labelled corners for task 2. As described in the results section of the original study, we will run a 2 (sex: male, female) by 2 (condition: social, spatial) between-subjects ANOVA.

Potential Additional Analysis

In line with the original study, we will also run t-tests comparing performance across conditions for each gender and comparing performance across gender for each condition. One potential additional analysis would require adding two short questions to the demographic information regarding computer science/programming experience and math ability that are not included in the original study. Prior research demonstrates a correlation between STEM experience and spatial reasoning skills, and MTurk workers, having chosen to make money as online workers, are likely more tech-savvy than the general population. By collecting data on programming experience, we could see how this experience might interplay with performance by gender or condition through two 3-way ANOVAs (CS by condition by sex). We are similarly collecting data in math ability because there might be very low variance in CS experience among the participants.

Differences from Original Study

The primary difference between these study is the materials (paper test vs online test) and the corresponding difference in sample (data collected from online workers as opposed to undergraduate students). The difference in materials is not expected to effect the results, but it may affect the test raw scores on the road-map test. Namely, selecting from a drop-down may introduce more cognitive load than writing a letter, so participants may not be able to answer questions (and therefore move onto the next one) as quickly. This difference could affect raw scores overall. To address this difference in raw scores, we have adjusted the task time to 36 seconds, as described above in Procedure.

The difference in subject population has a small chance of impacting the raw scores as well based on a series of assumptions: if MTurk workers have more programming/computer science experience than the original population and if this experience truly contributes to notably improved spatial reasoning skills, we might expect to see higher raw scores than in the original population as well.

Methods Addendum (Post Data Collection)

Post data collection, we have recognized the necessity of excluding data from workers who did not make an honest attempt at the task and have added additional analyses for the added demographic questions.

Actual Sample

We collected data from 255 participants (134 male: 76 spatial, 58 social; 121 female: 58 spatial, 63 social) plus one non-binary participant not included in analysis. We run all analyses as planned (excluding a single non-binary participant because we do not have sufficient data for this group), but also run these analyses on data sets that exclude workers workers who labelled no corners and a stricter data set also excluding workers with less than 60% accuracy or who labelled only one corner. When we exclude participants who labelled no corners, our data set includes 246 participants (130 male: 72 spatial, 58 social; 120 female: 57 spatial, 63 social). By also excluding participants who labelled only one corner or incorrectly labelled more than 40% of the corners they labelled (or strictest exclusion criteria, ruling out random guessers) our data set includes 212 participants (114 male: 61 spatial, 53 social; 98 female: 44 spatial, 54 social).

Differences from pre-data collection methods plan

We have also realized the non-interval ordinal data used to measure CS Experience and math ability should not be analyzed with an ANOVA. Therefore we also conduct exploratory analyses using linear models with these variables included as factors, numerics, and split into two groups (greater than 4 years of CS experience or agreement or strong agreement with math ability statement, respectively).

Results

Data preparation

Data preparation following the analysis plan.

###Data Preparation

####Load Relevant Libraries and Functions
library(tidyr)
library(ggplot2)
library(dplyr)
library(MASS)
library(lsr)

####Import data
# read csv file
raw_data = read.csv("../mturk/final-results.csv")

#### Prepare data for analysis - create columns etc.
# gets number of questions answered for each participant (used in exclusion)
num_answered = raw_data %>%
  dplyr::select(-starts_with("correct_")) %>%
  gather(corner, answer, -c("participant", "condition", "sex", "cs_experience", "math_ability")) %>%
  group_by(participant, answer) %>%
  summarise(n = n()) %>%
  spread(answer, n) %>%
  mutate(num_answered = 32 - `-`) %>%
  dplyr::select(-c(`L`, `R`, `-`))
  
# gather function (on each line: subject id, condition, demographics, num_answered, score for one trial)
tidy_data = raw_data %>%
  dplyr::select(-starts_with("answer_")) %>%
  mutate(num_answered = num_answered$num_answered) %>%
  gather(corner, correct, -c("participant", "condition", "sex", "cs_experience", "math_ability", "num_answered")) %>%
  mutate(corner = substr(corner,9,9))

# group by and summarise
d = tidy_data %>%
  group_by(participant, condition, sex, cs_experience, math_ability, num_answered) %>%
  summarise(score = sum(correct), percent = score / mean(num_answered)) %>%
  filter(sex != 'O')

d$cs_experience <- as.factor(d$cs_experience)
d$math_ability <- as.factor(d$math_ability)

#### Data exclusion / filtering
# light exclusion
d_exclusion_0 <- d %>%
  filter(num_answered != 0)

#strict exclusion
d_exclusion_strict <- d_exclusion_0  %>%
  filter(num_answered != 1 & percent >= .6)

Descriptive Statistics

condition_count = d %>%
  group_by(condition, sex) %>%
  summarise(n = n())

condition_count_no_0 = d_exclusion_0 %>%
  group_by(condition, sex) %>%
  summarise(n = n())

condition_count_strict_exclusion = d_exclusion_strict %>%
  group_by(condition, sex) %>%
  summarise(n = n())

condition_count$condition <- factor(condition_count$condition, levels = c('spatial','social'))

ggplot(condition_count, aes(x=sex, y=n, fill=condition)) +
  geom_bar(position="dodge", stat="identity") + 
  theme_bw() +
  ylab("Count") +
  ggtitle("Distribution of Participants Across Conditions Before Exclusion") +
  theme(axis.title.x = element_blank()) +
  scale_x_discrete(limits=c('M','F'), labels=c('Males','Females')) +
  scale_fill_manual(values=c("#8C8E90","#E2E3E4"),name='',limits=c('spatial','social'),labels=c('Spatial Condition','Social Condition'))

ggplot(d, aes(x=score)) +
  geom_histogram(stat="count") +
  ggtitle("Distribution of Participant Scores Before Exclusion")

Confirmatory analysis

As described in the analysis plan, we ran a 2 (sex: male, female) by 2 (condition: social, spatial) between-subjects ANOVA for each task.

anova <- aov(score ~ sex * condition, data = d)
summary(anova)

##                Df Sum Sq Mean Sq F value  Pr(>F)   
## sex             1  135.0  134.97  10.759 0.00118 **
## condition       1   30.0   29.99   2.390 0.12334   
## sex:condition   1    0.4    0.44   0.035 0.85224   
## Residuals     251 3148.7   12.54                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

etaSquared(anova)

##                     eta.sq  eta.sq.part
## sex           0.0438406777 0.0441082082
## condition     0.0090482101 0.0094336481
## sex:condition 0.0001315906 0.0001384836

While we are able to replicate the finding that male participants score higher than female participants F(1, 251) = 10.76, p = .001, \(\eta\)_p² = .04, we are not able to replicate the effect of condition F(1, 251) = 2.39, p = .12, \(\eta\)_p² = .009 or the interaction effect between sex and condition F(1, 251) = 0.035, p = .85, \(\eta\)_p² = < .001.

Replication Graph

graph_data = d %>%
  group_by(condition, sex) %>%
  summarise(mean = mean(score), ci = 1.96*(sd(score)/sqrt(n())))

graph_data$condition <- factor(graph_data$condition, levels = c('spatial','social'))

plot(ggplot(graph_data, aes(x=sex, y=mean, fill=condition)) +
  geom_bar(position="dodge", stat="identity") + 
  theme_bw() +
  ylim(0, 30) +
  ylab("Mean Number Correct") +
  ggtitle("Mean Correct by Sex and Condition") +
  theme(axis.title.x = element_blank()) +
  scale_x_discrete(limits=c('M','F'), labels=c('Males','Females')) +
  scale_fill_manual(values=c("#8C8E90","#E2E3E4"),name='',limits=c('spatial','social'),labels=c('Spatial Condition','Social Condition')) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0.2,position=position_dodge(.9)))

Original Graph

Original graph

Confirmatory Analyses with Exclusions

Under both sets of exclusion criteria we see the same results: males preform better than females on the task, but we no effect of condition, nor interaction between sex and condition.

anova_light_exclusion <- aov(score ~ sex * condition, data = d_exclusion_0)
summary(anova_light_exclusion)

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## sex             1  164.9  164.91  13.823 0.000249 ***
## condition       1   12.8   12.77   1.070 0.301907    
## sex:condition   1    0.2    0.19   0.016 0.899107    
## Residuals     246 2934.7   11.93                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

etaSquared(anova_light_exclusion)

##                     eta.sq  eta.sq.part
## sex           5.499354e-02 5.511206e-02
## condition     4.101961e-03 4.331725e-03
## sex:condition 6.174079e-05 6.547846e-05

anova_strict_exclusion <- aov(score ~ sex * condition, data = d_exclusion_strict)
summary(anova_strict_exclusion)

##                Df Sum Sq Mean Sq F value  Pr(>F)   
## sex             1   93.0   92.96   8.455 0.00403 **
## condition       1    0.0    0.04   0.003 0.95324   
## sex:condition   1    0.5    0.47   0.043 0.83684   
## Residuals     208 2286.7   10.99                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

etaSquared(anova_strict_exclusion)

##                     eta.sq  eta.sq.part
## sex           3.863165e-02 3.865620e-02
## condition     1.592152e-05 1.657198e-05
## sex:condition 1.963733e-04 2.043577e-04

Exploratory Analyses

T-Tests

In addition to the ANOVA, the original study also included t-tests. Unlike the the original study, we did not find that females were more accurate in the social condition (M = 5.81, SD = 2.99, CI = [5.07, 6.55]) than the spatial condition (M = 5.21, SD = 3.60, CI = [4.28, 6.13]), t(119) = 1.01, p = 0.32. However, our findings did match the original study in that male performance did not differ between the social condition (M = 7.41, SD = 3.51, CI = [6.51, 8.32]) and spatial condition (M = 6.64, SD = 3.92, CI = [5.76, 7.53]), t(132) = 1.18, p = 0.24. Like the original paper, we found males preform better than females in the spatial condition t(132) = 2.18, p = 0.03, however, unlike the original study, we also found this trend to hold true in the social condition t(119) = 2.71, p = .008.

###T-Tests
sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(sum(!is.na((x))))}
ci <- function(x) {sem(x) * 1.96} # reasonable approximation 

####Female Score By Condition
female_data <- d %>%
  subset(sex=='F')
female_summary <- female_data %>%
  group_by(condition) %>%
  summarise(mean = mean(score), sd = sd(score), ci_lower = mean(score) - ci(score), ci_upper = mean(score) + ci(score))
t.test(score ~ condition, data = female_data, var.equal = TRUE)

####Male Score By Condition
male_data <- d %>%
  subset(sex=='M')
male_summary <- male_data %>%
  group_by(condition) %>%
  summarise(mean = mean(score), sd = sd(score), ci_lower = mean(score) - ci(score), ci_upper = mean(score) + ci(score))
t.test(score ~ condition, data = male_data, var.equal = TRUE)

####Spatial Score By Sex
spatial_data = d %>%
  subset(condition=='spatial')
t.test(score ~ sex, data = spatial_data, var.equal = TRUE)

####Social Score By Sex
social_data = d %>%
  subset(condition=='social')
t.test(score ~ sex, data = social_data, var.equal = TRUE)

CS Experience

We find that the CS experience by sex by condition ANOVA shows a main effect only for sex F(1, 231) = 10.81, p = .001, \(\eta\)_p² = .03. A linear regression with CS experience as a factor shows a main effect of sex, t(247) = 2.73, p = .007, but not condition, t(247) = 1.73, p = .09. Even the highest level of CS experience (>10 years) does not yield statistically different scores on the perspective taking test than no experience t(247) = .20, p = .84. Looking at CS experience as a numeric value, it again has no main effect on score, t(251) = .78, p = .43. And, finally, if we split CS experience into two groups (those with greater than 4 years of experience and those with less), we again see no main effect, t(251) = -0.17, p = .87. Therefore there is not evidence in this data supporting the hypothesis that CS experience can predict spatial reasoning ability.

###CS Experience
cs_anova <- aov(score ~ sex * condition * cs_experience, data = d)
summary(cs_anova)

##                              Df Sum Sq Mean Sq F value  Pr(>F)   
## sex                           1  135.0  134.97  10.809 0.00117 **
## condition                     1   30.0   29.99   2.401 0.12260   
## cs_experience                 5   59.2   11.84   0.948 0.45080   
## sex:condition                 1    1.7    1.68   0.135 0.71383   
## sex:cs_experience             5   69.4   13.87   1.111 0.35532   
## condition:cs_experience       5   93.2   18.64   1.492 0.19311   
## sex:condition:cs_experience   5   41.2    8.23   0.659 0.65479   
## Residuals                   231 2884.6   12.49                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

etaSquared(cs_anova)

##                                   eta.sq  eta.sq.part
## sex                         0.0292758971 0.0325406177
## condition                   0.0087223559 0.0099217021
## cs_experience               0.0182333453 0.0205184920
## sex:condition               0.0006129445 0.0007037173
## sex:cs_experience           0.0191119647 0.0214859829
## condition:cs_experience     0.0281170758 0.0312928764
## sex:condition:cs_experience 0.0124189548 0.0140674407

cs_lm_factor <- lm(score ~ sex + condition + cs_experience, data = d)
summary(cs_lm_factor)

## 
## Call:
## lm(formula = score ~ sex + condition + cs_experience, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3838 -2.2972 -0.3828  1.7868 13.0518 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.7287     0.4755  12.047  < 2e-16 ***
## sexM               1.2634     0.4635   2.726  0.00687 ** 
## conditionspatial  -0.7805     0.4521  -1.726  0.08553 .  
## cs_experience1    -0.1237     0.5998  -0.206  0.83681    
## cs_experience2     1.1712     0.6608   1.772  0.07758 .  
## cs_experience3     0.9889     0.9172   1.078  0.28204    
## cs_experience4     0.2649     0.8397   0.316  0.75263    
## cs_experience5     0.1722     0.8540   0.202  0.84038    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.537 on 247 degrees of freedom
## Multiple R-squared:  0.06763,    Adjusted R-squared:  0.04121 
## F-statistic:  2.56 on 7 and 247 DF,  p-value: 0.0146

cs_lm_numeric <- lm(score ~ sex + condition + as.numeric(cs_experience), data = d)
summary(cs_lm_numeric)

## 
## Call:
## lm(formula = score ~ sex + condition + as.numeric(cs_experience), 
##     data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0157 -2.3377 -0.4752  1.5617 12.9941 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 5.5982     0.5041  11.104  < 2e-16 ***
## sexM                        1.4693     0.4497   3.268  0.00124 ** 
## conditionspatial           -0.7003     0.4456  -1.572  0.11730    
## as.numeric(cs_experience)   0.1081     0.1386   0.780  0.43608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.538 on 251 degrees of freedom
## Multiple R-squared:  0.05207,    Adjusted R-squared:  0.04074 
## F-statistic: 4.596 on 3 and 251 DF,  p-value: 0.003759

cs_lm_levels <- lm(score ~ sex + condition + I(levels(cs_experience)[cs_experience] > 3), data = d)
summary(cs_lm_levels)

## 
## Call:
## lm(formula = score ~ sex + condition + I(levels(cs_experience)[cs_experience] > 
##     3), data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6964 -2.3852 -0.3852  1.7174 12.8200 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                      5.86876    0.40022
## sexM                                             1.51644    0.44595
## conditionspatial                                -0.68876    0.44592
## I(levels(cs_experience)[cs_experience] > 3)TRUE -0.09871    0.58189
##                                                 t value Pr(>|t|)    
## (Intercept)                                      14.664  < 2e-16 ***
## sexM                                              3.400 0.000782 ***
## conditionspatial                                 -1.545 0.123708    
## I(levels(cs_experience)[cs_experience] > 3)TRUE  -0.170 0.865434    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.542 on 251 degrees of freedom
## Multiple R-squared:  0.04988,    Adjusted R-squared:  0.03853 
## F-statistic: 4.393 on 3 and 251 DF,  p-value: 0.004928

ggplot(d, aes(x=score, color=sex)) +
  geom_density() + 
  facet_grid(cs_experience ~ condition) +
  ggtitle("Density Plots of CS Experience")

cs_data = d %>%
  group_by(cs_experience) %>%
  summarise(mean = mean(score), ci = 1.96*(sd(score)/sqrt(n())))

ggplot(cs_data, aes(x=cs_experience, y=mean)) +
  geom_bar(position="dodge", stat="identity") + 
  theme_bw() +
  ylim(0, 30) +
  ylab("Mean Number Correct") +
  ggtitle("Score by CS Experience") +
  theme(axis.title.x = element_blank()) +
  scale_x_discrete(limits=c(0,1,2,3,4,5,6), labels=c('','None','<1 Year', '1-4 Years','5-6 Years', '6-10 Years', '>10 Years')) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0.2,position=position_dodge(.9))

Math Ability

However, the ANOVA including math ability an independent variable alongside sex and condition shows main effects of both sex, F(1, 231) = 11.38, p < .001, \(\eta\)_p² = .02, and math ability, F(4, 231) = 6.01, p < .001, \(\eta\)_p² = .09. A linear regression with math ability as a factor also shows a main effect of sex, t(248) = 2.26, p = 0.02, and that scores for those who strongly disagree with the statement about math ability differ significant from those who agree, t(248) = 2.57, p = .01, or strongly agree, t(248) = 3.32, p = .001. A linear regression with math ability treated as a numeric shows a main effect of math ability on score, t(251) = 4.58, p < .001, and when looking at levels, those that agree or strongly agree with the math ability statement have higher scores than those who do not, t(251) = 3.40, p < .001.

###Math Ability
math_anova <- aov(score ~ sex * condition * math_ability, data = d)
summary(math_anova)

##                             Df Sum Sq Mean Sq F value   Pr(>F)    
## sex                          1  135.0  134.97  11.387 0.000865 ***
## condition                    1   30.0   29.99   2.530 0.113052    
## math_ability                 4  284.7   71.17   6.005 0.000129 ***
## sex:condition                1    1.0    0.98   0.083 0.773528    
## sex:math_ability             4   12.2    3.06   0.258 0.904596    
## condition:math_ability       4   23.7    5.91   0.499 0.736619    
## sex:condition:math_ability   4   42.2   10.54   0.889 0.471111    
## Residuals                  235 2785.4   11.85                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

etaSquared(math_anova)

##                                  eta.sq eta.sq.part
## sex                        0.0144752233 0.016930991
## condition                  0.0109300336 0.012837581
## math_ability               0.0860686054 0.092891718
## sex:condition              0.0009712313 0.001154235
## sex:math_ability           0.0040966995 0.004850600
## condition:math_ability     0.0071362685 0.008419230
## sex:condition:math_ability 0.0127194782 0.014907993

math_lm_factor <- lm(score ~ sex + condition + math_ability, data = d)
summary(math_lm_factor)

## 
## Call:
## lm(formula = score ~ sex + condition + math_ability, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8440 -1.8457 -0.4414  1.7063 12.3249 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.5236     0.7980   5.669 3.99e-08 ***
## sexM               0.9983     0.4416   2.261  0.02463 *  
## conditionspatial  -0.7680     0.4310  -1.782  0.07602 .  
## math_ability2     -0.2291     0.9163  -0.250  0.80280    
## math_ability3      1.9195     0.8967   2.141  0.03328 *  
## math_ability4      2.0901     0.8140   2.568  0.01082 *  
## math_ability5      3.3115     0.9981   3.318  0.00104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.399 on 248 degrees of freedom
## Multiple R-squared:  0.1357, Adjusted R-squared:  0.1148 
## F-statistic: 6.488 on 6 and 248 DF,  p-value: 2.255e-06

math_lm_numeric <- lm(score ~ sex + condition + as.numeric(math_ability), data = d)
summary(math_lm_numeric)

## 
## Call:
## lm(formula = score ~ sex + condition + as.numeric(math_ability), 
##     data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0341 -2.0341 -0.6901  1.9231 12.9493 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.1197     0.7028   4.439 1.36e-05 ***
## sexM                       1.0908     0.4384   2.488   0.0135 *  
## conditionspatial          -0.7468     0.4286  -1.742   0.0827 .  
## as.numeric(math_ability)   0.8926     0.1950   4.577 7.42e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.403 on 251 degrees of freedom
## Multiple R-squared:  0.123,  Adjusted R-squared:  0.1125 
## F-statistic: 11.73 on 3 and 251 DF,  p-value: 3.218e-07

math_lm_levels <- lm(score ~ sex + condition + I(levels(math_ability)[math_ability] > 3), data = d)
summary(math_lm_levels)

## 
## Call:
## lm(formula = score ~ sex + condition + I(levels(math_ability)[math_ability] > 
##     3), data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1750 -2.1771 -0.3817  1.8229 13.6183 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                     5.1771     0.4267  12.133
## sexM                                            1.2812     0.4415   2.902
## conditionspatial                               -0.7954     0.4371  -1.820
## I(levels(math_ability)[math_ability] > 3)TRUE   1.5121     0.4445   3.402
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## sexM                                          0.004039 ** 
## conditionspatial                              0.070002 .  
## I(levels(math_ability)[math_ability] > 3)TRUE 0.000779 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.463 on 251 degrees of freedom
## Multiple R-squared:  0.09165,    Adjusted R-squared:  0.08079 
## F-statistic: 8.442 on 3 and 251 DF,  p-value: 2.297e-05

ggplot(d, aes(x=score, color=sex)) +
  geom_density() + 
  facet_grid(math_ability ~ condition) +
  ggtitle("Density Plots of Math Ability")

sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(sum(!is.na((x))))}
ci <- function(x) {sem(x) * 1.96} # reasonable approximation 
math_graph_data <- d %>%
  group_by(math_ability) %>%
  summarise(mean = mean(score), ci = ci(score))

ggplot(math_graph_data, aes(x=math_ability, y=mean)) +
  geom_bar(position="dodge", stat="identity") + 
  theme_bw() +
  ylim(0, 30) +
  ylab("Mean Number Correct") +
  xlab("\"I learn mathematics quickly.\"") +
  ggtitle("Score by Math Ability") +
  theme(axis.title.x = element_blank()) +
  scale_x_discrete(limits=c(1,2,3,4,5), labels=c('Strongly Disagree','Disagree', 'Neither agree\nnor disagree','Agree', 'Strongly Agree')) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0.2,position=position_dodge(.9))

Discussion

Summary of Replication Attempt

The primary result of our confirmatory analysis shows that, while males do preform better overall than females, this effect is not moderated by condition. That is, the social condition does not lead to a smaller than difference in scores than the spatial condition. Consequently, this replication attempt failed to replicate the original result.

Commentary

Exploratory Analyses

The follow up exploratory analyses on math ability reflect prior research demonstrating a correlation between spatial reasoning and STEM success. Indeed, we see participants with higher self-reported ability to learn math preform better on the perspective-taking test, regardless of condition. However, we did not find the same correlation when we examine the relationship between score and computer science experience. This may, however, be a byproduct of the relative rarity of such experience among our participants. Over half of the participants have none or less than one year of CS experience.

Author Concerns

Prior to the study, the original author expressed concerns that the drop-down menus might add additional cognitive load to the task as compared to the original pencil-and-paper test. This additional difficulty could result in lower overall scores. We accounted for this difference by giving participants 20% more time than in the original study, however, we still found lower overall scores than in the original study, even under the strict exclusion criteria.

Meaning of Failed Replication

The failure to replicate could plausibly be due to differences between the original and present study. First, the study design did have additional cognitive load factors, as the original authors mentioned. Although we accounted for some of this difference through a change in time, the additional time may not fully compensate for the online design.

In this replication, we also ran the study on a different population of participants who may have been less motivated to complete the task quickly and accurately. This population could have led to a larger amount of variance in scores than the original study.

Alternatively, the original study design relies quite heavily on a state induction manipulation, which may not be very effective and is likely to cause participants to reflect on the nature and purpose of the manipulation itself. Despite the inclusion of a manipulation check (participants had to correctly fill in the blanks of the manipulation to continue), participants may not internalize this manipulation enough for it to affect performance.

Replication of Study 1 by Tarampi, Heydari, and Hegarty (2016, Psychological Science)

Griffin Dietz (gdietz44@stanford.edu)

December 12, 2018

Introduction

Methods

Power Analysis

Planned Sample

Materials

Procedure

Analysis Plan

Potential Additional Analysis

Differences from Original Study

Methods Addendum (Post Data Collection)

Actual Sample

Differences from pre-data collection methods plan

Results

Data preparation

Descriptive Statistics

Confirmatory analysis

Replication Graph

Original Graph

Confirmatory Analyses with Exclusions

Exploratory Analyses

T-Tests

CS Experience

Math Ability

Discussion

Summary of Replication Attempt

Commentary

Exploratory Analyses

Author Concerns

Meaning of Failed Replication