Replication of Why Do People Tend to Infer Ought From Is? The Role of Biases in Explanation by Tworek & Cimpian (2016, Psychological Science)

Introduction

Study 2 of Tworek and Cimpian (2016) investigated if inherence bias is positively related to subjects’ ought inferences when reasoning about typical behaviors. Inherence bias is defined as a preference for explanations that appeal to intrinsic qualities (i.e., color) over explanations that appeal to extrinsic qualities (i.e., historical context). To measure inherence bias, subjects read fifteen extrinsic explanations and fifteen intrinsic explanations and recorded their endorsement of each. Ought inferences were measured by showing subjects twelve statements about different behaviors. Six of these behaviors were considered typical, and six were atypical. Subjects then answered two questions about these behaviors, one of which asked them how “right” or “wrong” the behavior was, and one of which asked them if people should perform the behavior. Subjects’ education level, conservatism, and answers to a short Cognitive Reflection Test were also recorded as control measures.

Methods

A sample of 130 participants will be recruited from Amazon Mechanical Turk. Subjects will be asked to record their endorsements of 15 pairs of explanations of different phenomena, to answer 12 pairs of questions about twelve different human behaviors, and to answer three short CRT questions. They will be asked about their education level and political leaning at the end of the survey.

To view the experiment, click here: http://web.stanford.edu/~skaltman/study2.html.

Power Analysis

The original study (n = 112) found that the interaction between the measure of inherence bias and behavior typicality was a significant predictor of the measure of ought inferences ($b$ = 2.44, 95% CI = [.78, 4.10]). The original study used a linear mixed-effects model. We estimated the effect size by calculating Cohen’s $d_z$ for the difference between the coefficient and 0 ($d_z = 0.27$). The original power to detect this effect was .81. To achieve 80% power, a sample size of 110 is needed. To achieve 90% power, a sample size of 147 is needed. To achieve 95% power, a sample size of 181 is needed.

Planned Sample

The original study tested a total of 139 participants and excluded 27 (19% of all participants) because they were either outside the US, failed 2 or more of the attention checks, or because they indicated during the debriefing that they were not paying attention.

To achieve 80% power, we would need to run 110 participants. To account for the exclusion of approximately 19% of all participants, we will run 136 participants ($136 - 136 \times .19 = 110$).

Materials

All materials are here: https://osf.io/4kanr/.

Procedure

The procedure from the original paper was followed. The text from the original paper is as follows:

“Procedure. Participants were tested online via Mechan- ical Turk using Qualtrics software. The ought measure, the measure of inherence bias, and the CRT were pre- sented in random order. Item order was randomized for all scales. The measures of participants’ education and conservatism were administered with other demographic questions at the end of the survey.”

Analysis Plan

The analysis plan from the original paper was followed. The following is quoted from the original paper:

“Analytic strategy. Because we manipulated behavior typicality within subjects, we used a multilevel model to analyze our data. The model included cross-classified random effects (specifically, intercepts) for subjects and items. Participants’ ought inferences, calculated as the average of their responses to the two ought questions on each trial, served as the dependent variable. The model included as independent variables the typicality of each stimulus behavior (0 = atypical, 1 = typical), participants’ scores on the measure of inherence bias, and the three control measures (i.e., CRT, education, and conserva- tism). The model also included the two-way interactions between behavior typicality and each of the latter four variables. We hypothesized a positive relationship between participants’ inherence bias and their ought inferences for typical—but not atypical—behaviors. Thus, our main prediction was of a significant two-way interaction between the measure of inherence bias and behavior typicality. Including the other two-way interac- tions (with CRT, education, and conservatism) in the model enabled us to explore whether the relationships between these control variables and ought inferences also differed for typical and atypical behaviors. Adjusting for these potential relationships was a conservative anal- ysis strategy; in alternative models that did not include these interactions, the predicted relationship was esti- mated to be larger in magnitude. For ease of interpretation, we present unstandardized coefficients below. Given the coding of the behavior- typicality variable, the first-order coefficients for the measure of inherence bias, CRT, education, and conser- vatism in this model are simply the slopes of the rela- tionships between these variables and ought inferences for atypical behaviors. Moreover, the slopes for typical behaviors can easily be calculated by adding each first- order coefficient to the coefficient for the corresponding two-way interaction.Can also quote directly, though it is less often spelled out effectively for an analysis strategy section. The key is to report an analysis strategy that is as close to the original - data cleaning rules, data exclusion rules, covariates, etc. - as possible.”

We will conduct a linear mixed-effects model. The key-analysis of interest is the p-value of the coefficient for the interaction between inherence bias and typicality.

Differences from Original Study

In addition to the exclusion criteria used in the original study, we will also exclude participants who answer “0” for both the “should” “right/wrong” questions on more than three trials in the ought inference section, in order to exclude participants who do not choose any options.

Methods Addendum (Post Data Collection)

Actual Sample

103 participants were recruited from Amazon Mechanical Turk (42 female; 61 male). Participants received $1.21 for participation. An additional 33 participants were tested but were excluded from analysis. 18 of these 33 were excluded because they failed two or more attention checks (out of four total) and/or answered “0” for both questions for more than 3 trials in the ought inference section. The remaining 15 were excluded because either no data was collected (n = 4) or there was a data collection error (n = 11).

Differences from pre-data collection methods plan

15 participants had to be excluded because of a bug in the data collection process.

Results

Data preparation

Data preparation following the analysis plan.

Import data

directory <- "~/GitHub/Psych254/Tworek2016/data/"
csv_files <- dir(directory, pattern = ".csv")

Create data frame and tidy data

The following loops through all the csv files containing the data and creates the data frame. It also removes participants for whom data was not collected or data was collected in a faulty manner:

for (file in csv_files) {
  jd <- read_csv(paste(paste(directory, file, sep = "")))
  jd <-
    jd %>%
    mutate(data = iconv(jd$Answer.data, "latin1", "ASCII", sub = "")) %>%
    select(-Answer.data)

  answers <- jd$data
  if (!is.na(answers)) { #filter out subjects for whom no data was collected
    file.create("answers.json")
    fileConn <- file("answers.json")
    writeLines(answers, fileConn)
    close(fileConn)
    answers <- fromJSON("answers.json")
    if (length(answers$extrinsic) == 16 & length(answers$intrinsic) == 16) { #filter out subjects for whom data points were recorded twice
      data_participant <- get_data_from_json(jd$WorkerId, answers)
      data <- bind_rows(data, data_participant)      
    }
  }
}

Data exclusion/filtering

The original paper excluded participants that failed two or more of the four attention checks. We also excluded participants who put “0” for both the should and right/wrong questions on more than 2 prompts, to eliminate participants who just clicked the “next” button without recording a response.

#create list of id's of participants who failed 2 or more attention checks 
to_exclude <- 
  bind_rows(to_exclude_inherence_extrinsic, 
                to_exclude_inherence_intrinsic, 
                to_exclude_ought_should, 
                to_exclude_ought_rightwrong) %>% 
  group_by(id) %>% 
  filter(n() >= 2) %>% 
  bind_rows(to_exclude_zero) %>% 
  .$id

#filter out participants that we want to exclude
data <-
  data %>% 
  filter(!(id %in% to_exclude))

Prepare data for analysis

The following creates a tibble with the measure of inherence bias for each participant. The original paper defined inherence bias as the difference between a participant’s endorsement of intrinsic explanations and his/her endorsement of extrinsic explanations.

inherence_scores <-
  data %>% 
  group_by(id) %>% 
  filter(prompt_type == "inherence") %>% 
  tidyr::spread(measure, value) %>% 
  summarise(inherence_bias = mean(intrinsic) - mean(extrinsic))

The following creates a tibble of the CRT scores for each participant:

#score the nurse CRT question
grade_nurse_crt <-
  data %>% 
  filter(prompt_type == "crt",
         str_detect(prompt, "nurses")) %>% 
  mutate(nurse_score = ifelse(near(value, 2.00), 1, 0))

#score the salad CRT question
grade_salad_crt <-
  data %>% 
  filter(prompt_type == "crt",
         str_detect(prompt, "salad")) %>% 
  mutate(salad_score = ifelse(near(value, 2.25), 1, 0))

#score the sally CRT question
grade_sally_crt <-
  data %>% 
  filter(prompt_type == "crt",
         str_detect(prompt, "Sally")) %>% 
  mutate(sally_score = ifelse(near(value, 5.00), 1, 0))

#join all CRT scores
crt_scores <-
  grade_nurse_crt %>% 
  left_join(grade_salad_crt, by = "id") %>% 
  left_join(grade_sally_crt, by = "id") %>% 
  group_by(id) %>% 
  summarise(crt_score = sum(sally_score, nurse_score, salad_score)/3)

The following removes the attention checks prompts so that they are not included in the analysis:

data_filtered <-
  data %>%
  filter(prompt != attention_check_inherence,
         prompt != attention_check_ought)

Now, we can join the data with the inherence bias measures and CRT scores:

data_scores <-
  data_filtered %>%
  group_by(id) %>% 
  left_join(inherence_scores, by = "id") %>% 
  left_join(crt_scores, by = "id") %>% 
  filter(measure == "average") %>% 
  rename(ought_inference = value)

Finally, we need to add in the binary typicality measure for each prompt:

data_final <-
  data_scores %>% 
  mutate(typicality = as.integer(ifelse(prompt %in% typicals, 1, 0)))

The data is now ready for analysis.

Confirmatory analysis

The following is the results table from the original paper: Original table

As detailed in the original paper, we fit a linear mixed-effects model with subjects and items as random intercept effects.

fit <- lme4::lmer(ought_inference ~ inherence_bias*typicality + 
              crt_score*typicality + education*typicality + 
              political*typicality + (1 | id) + (1 | prompt), 
            data = data_final)

coefs <- summary(fit)$coef

names <- c("Intercept", "Inherence Bias", "Typicality", "Cognitive Reflection Test", "Education", "Conservatism", "Inherence Bias x Typicality", "Typicality x Cognitive Reflection Test", "Typicality x Education", "Typicality x Conservatism")

df <- 
  as_tibble(summary(fit)$coef) %>% 
  mutate(Predictor = names,
         p = 2 * (1 - pnorm(abs(`t value`)))) %>% 
  dplyr::select(Predictor, everything()) 

df %>% 
  filter(Predictor != "Intercept") %>% 
  knitr::kable()

Predictor	Estimate	Std. Error	t value	p
Inherence Bias	-0.7257029	0.5538049	-1.3103944	0.1900624
Typicality	13.0220087	6.9902155	1.8628909	0.0624776
Cognitive Reflection Test	-0.7331833	3.0959566	-0.2368196	0.8127967
Education	1.8198478	1.1519040	1.5798606	0.1141388
Conservatism	-0.5809510	0.4609586	-1.2603106	0.2075573
Inherence Bias x Typicality	3.7368447	0.6250524	5.9784508	0.0000000
Typicality x Cognitive Reflection Test	-0.5822460	3.5093161	-0.1659144	0.8682243
Typicality x Education	-0.1874904	1.3058717	-0.1435749	0.8858361
Typicality x Conservatism	2.2128305	0.5242735	4.2207556	0.0000243

The following is the key test of interest isolated:

df %>% 
  filter(Predictor == "Inherence Bias x Typicality") %>% 
  knitr::kable()

Predictor	Estimate	Std. Error	t value	p
Inherence Bias x Typicality	3.736845	0.6250524	5.978451	0

We also computed the 95% confidence intervals for the coefficients, as done in the original paper:

ci <- 
  confint.merMod(fit, level = .95) %>% 
  as_tibble(ci) %>% 
  slice(5:n()) %>% 
  mutate(Predictor = c("Inherence Bias", 
                       "Typicality", 
                       "Cognitive Reflection Test", 
                       "Education", "Conservatism", 
                       "Inherence Bias x Typicality", 
                       "Typicality x Cognitive Reflection Test", 
                       "Typicality x Education", 
                       "Typicality x Conservatism")) %>% 
  dplyr::select(Predictor, everything())

ci

## # A tibble: 9 × 3
##                                Predictor    `2.5 %`   `97.5 %`
##                                    <chr>      <dbl>      <dbl>
## 1                         Inherence Bias -1.8007635  0.3510878
## 2                             Typicality -0.5201085 26.5821956
## 3              Cognitive Reflection Test -6.7434075  5.2858909
## 4                              Education -0.4243932  4.0520952
## 5                           Conservatism -1.4767178  0.3143380
## 6            Inherence Bias x Typicality  2.5122562  4.9604286
## 7 Typicality x Cognitive Reflection Test -7.4604541  6.2847742
## 8                 Typicality x Education -2.7416302  2.3735209
## 9              Typicality x Conservatism  1.1856862  3.2391014

Original plot:

Original plot

We recreated the plot with the new data:

mean_ib <- mean(data_final$inherence_bias, na.rm = TRUE)
sd_ib <- sd(data_final$inherence_bias, na.rm = TRUE)

#calculates if the inherence bias is one sd over the mean, one sd under the mean, or neither
find_ib_group <- function(inherence_bias) {
  if (inherence_bias <= mean_ib - sd_ib) {return("below")}
  if (inherence_bias >= mean_ib + sd_ib) {return("above")}
  else {return("NA")}
}

#tibble grouped by inherence bias level
by_ib_group <-
  data_final %>% 
  mutate(inherence_bias_group = 
           map_chr(inherence_bias, find_ib_group)) %>% 
  filter(inherence_bias_group != "NA") %>% 
  group_by(inherence_bias_group, typicality, id) %>% 
  summarise(mean_ought_inference = mean(ought_inference, na.rm = TRUE)) %>% 
  group_by(inherence_bias_group, typicality) %>% 
  summarise(grand_mean_ought_inference = 
              mean(mean_ought_inference, na.rm = TRUE),
            se = sd(mean_ought_inference, na.rm = TRUE)/sqrt(n()))

#recreate plot from original paper
by_ib_group %>% 
  ggplot(aes(fct_rev(as_factor(inherence_bias_group)),  #reverses the order of the factor levels
             grand_mean_ought_inference, linetype = as.factor(typicality))) +
  geom_point() +
  geom_errorbar(aes(ymin = grand_mean_ought_inference - se, 
                    ymax = grand_mean_ought_inference + se),
                size = .25, width = .05) +
  geom_line(aes(group = typicality)) +
  labs(y = "Ought Inference (0-100)",
       x = NULL) +
  scale_x_discrete(labels = c("Weak Inherence Bias\n(1 SD Below the\nMean)", "Strong Inherence Bias\n(1 SD Above the\nMean)")) +
  scale_y_continuous(breaks = seq(0, 100, 20)) +
  coord_cartesian(ylim = c(0, 100)) +
  #coord_cartesian(ylim = c(0, 100)) +
  scale_linetype_discrete(name = NULL, labels = c("Atypical behaviors", "Typical behaviors")) +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank())

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

We succesfully replicated the key finding of Tworek and Cimpian (2016). The predicted interaction between inherence bias and behavior typicality was significant in the fitted model ($b$ = 3.737, 95% CI = [2.512, 4.96], $p < .001$). The predicted interaction between behavior typicality and conservatism was also significant ($b =$ 2.213, 95% CI = [1.186, 3.239], $p < .001$). In the original study, conservatism was also a significant predictor, but was not in this replication.

Commentary

The calculated p-value for the interaction between inherence bias and behavior typicality was very low ($1e-7$), as was the p-value for the interaction between behavior typicality and conservatism ($6e-5$). The former is four orders of magnitude less than the p-value obtained in the original study. The p-values were obtained using the normal approximation method described in Barr et al. (2013). The original paper may have used a different approximation or method for obtaining the p-values.

It is also possible that the added exclusion criteria–excluding participants who answered “0” on both the ought inference questions for more than 3 trials–biased the data in some way.

Replication of ‘Why Do People Tend to Infer “Ought” From “Is”? The Role of Biases in Explanation’ by Tworek & Cimpian (2016, Psychological Science)

Sara Altman skaltman@stanford.edu

March 25, 2017