1. Introduction

Many social psychology studies are about interaction effects, and many of these have had unsuccesful replications. Interaction effects are found to be especially prone to fail replication in the replication project, and in this study.

In this document I explore whether interaction effects are less likely to replicate as a general rule, in a real, large data set. The data set is from the users’ answers to a large number of personal questions in the OKCupid data set.

We find that significant interaction effects explain very little of the variance, and that compared to significant main effects they are less significant, and are much less likely to replicate in a larger sample.

For the third variable we include either another ordinal variable, or gender. We find that interaction effects only replicate at 0.05 in ~9% of cases for ordinal variables, and ~22% for gender. Whereas main effects replicate in >95% of cases for either.

If we look at gender-specific effects (effects that are only significant for one gender), the replication mostly gives a non-gender-specific effect, and is very unlikely to find an effect only in that gender.

2. Analysis

Explanation

I generate 1,000,000 random regressions using the OKCupid data. An example of a regression would be:

y = “Do you like to listen to audiobooks?” a = “How was your childhood?”
b = gender

y = a + b + a * b

I look at main effects(a, b) and interaction effects (a * b).

I do this in a sample with the answers of 500 people. Then I look at the same regression in a larger sample of 5,000 people, and see whether the results that are significant in the n = 500 sample, are also significant in the larger sample.

For instance, one result could be that people who like audiobooks tend to have had a good childhood. This type of main effect result typically is replicated in the n = 5,000 sample.

Or another result could be that there is a significant interaction between gender and childhood quality in predicting whether the person likes audiobooks. This type of effect is typically not replicated in the n = 5,000 sample.

Imports and settings

library(pacman)
p_load(tidyverse, rms, broom, magrittr, glue, purrr, pander, ggpubr)
options(contrasts=c("contr.treatment", "contr.treatment"))
PanderOpts(digits = 3, table.split.table = Inf, missing = "-", table.alignment.default = "left")
theme_set(theme_bw(base_size=13))
set.seed(123)

Read data

answers = read_rds("data/parsed_data.rds") %>% filter(gender != "Other")
questions = read_csv2("data/question_data.csv")

Create a smaller original sample and larger replication sample

small_sample_n <- 250
large_sample_n <- 2500
Gender <- c("Male", "Female")
both_sample_n <- small_sample_n + large_sample_n
ordinal_vars <- questions %>% filter(Type == "O", N >= both_sample_n*2) %>% pull(Q)

tribble(~Gender, ~original, ~replication,
        "Male", small_sample_n, large_sample_n,
        "Female", small_sample_n, large_sample_n,
        "Combined", small_sample_n*2, large_sample_n*2) %>% 
  pander(caption="Sample sizes")

Sample sizes
Gender	original	replication
Male	250	2500
Female	250	2500
Combined	500	5000

tribble(~"Number of questions with sufficiently high sample size", ordinal_vars %>% length)

Number of questions with sufficiently high sample size
1622

Run analysis

analyze <- function(vars, effect, df){
  v1 <- vars[1]; v2 <- vars[2]; v3 <- vars[3]
  model <- if (effect == "interaction") glue("{v1} ~ {v2} * {v3}") 
           else glue("{v1} ~ {v2} + {v3}")
  m <- rms::lrm(model %>% as.formula, data = df, penalty = .01, tol = 1e-22)
  R2 <- m$stats[["R2"]]
  rowname <- if (effect == "interaction") " All Interactions" 
             else "TOTAL"
  P <- m %>% anova %>% tidy %>% filter(.rownames == rowname) %>% slice(1) %>% pull(P)
  
  list(R2 = R2, P = P)
}


add_column <- function(g, samplesize, effect, vars, third_variable, df_female, df_male){
  if (g != "Combined" & third_variable == "gender"){return}
  df <- if (g == "Combined") df 
        else df %>% filter(gender == g)
  rows <- if (samplesize == "small") seq(1,small_sample_n) 
          else seq(small_sample_n + 1, both_sample_n)
  df <- bind_rows(
    df_male %>% filter(row_number() %in% rows), 
    df_female %>% filter(row_number() %in% rows))
  tryCatch({
    res <- analyze(vars, effect, df)
    tibble(!!glue("{g}_{samplesize}_{effect}_R2") := res$R2, 
           !!glue("{g}_{samplesize}_{effect}_P") := res$P)
    }, error = function(e) NULL
  )
}

add_row <- function(third_variable, df){
  vars <- if (third_variable == "gender") c(sample(ordinal_vars, 2), "gender") 
          else sample(ordinal_vars, 3)  
  dfu <- df %>% select(vars, gender) %>% na.omit
  df_female <- dfu %>% filter(gender == "Woman")
  df_male <- dfu %>% filter(gender == "Man")
  if (df_female %>% nrow > both_sample_n & df_male %>% nrow > both_sample_n){
    crossing(g = c("Combined", "Man", "Woman"), 
             samplesize = c("large", "small"), 
             effect = c("interaction", "main")) %>% 
    pmap(add_column, vars, third_variable, df_female, df_male) %>% 
      bind_cols %>% mutate(Third_variable=third_variable)
  }
}

add_trio <- function(df) pmap(list(c("ordinal", "gender")), add_row, df) %>% bind_rows

#rerun(1000000, add_trio(answers)) %>% bind_rows %>% write_tsv(glue("trios.tsv"))

df <- read_tsv("data/trios.tsv")

3. Example questions

questions %>% filter(Q %in% sample(ordinal_vars, 5)) %>% 
  select(text, option_1, option_2, option_3, option_4)

text	option_1	option_2	option_3	option_4
Do you believe that because a person is overweight that it is because they are weak and without self control?	Yes / Usually	No	I’m Not Sure	-
How was your childhood?	Wonderful!	It was okay.	Not great, but I’m no worse for the wear.	Awful, and I have emotional issues as a result.
Do you enjoy creating mathematical problems to solve by yourself?	Yes.	No.	-	-
How many pillows do you sleep with at night?	None.	One.	Two.	Three or more.
Do you like to listen to audiobooks?	Yes.	No.	I’ve never tried them.	-

4. Results

Proportion of effects that replicate in the larger sample

prop <- df %>% group_by(Third_variable) %>% 
  summarise(
    Interactions = sum(
      Combined_small_interaction_P < 0.05 & Combined_large_interaction_P < 0.05, na.rm=T) / 
      sum(Combined_small_interaction_P < 0.05, na.rm=T),
    Main = sum(
      Combined_small_main_P < 0.05 & Combined_large_main_P < 0.05, na.rm=T) / 
      sum(Combined_small_main_P < 0.05, na.rm=T))

prop %>% gather(Effect, value, -Third_variable) %>% 
  ggplot(aes(x = Effect, y = value, fill=Third_variable)) +
  geom_bar(stat="identity", position="dodge", width=0.5, col="black") +
  ylab("Proportion replicated") + 
  scale_fill_manual(values=jcol)

Third_variable	Interactions	Main
gender	0.228	0.978
ordinal	0.0881	0.968

Replication of gender-specific effects

get_gender <- function(g1, df){
  g2 <- if (g1 == "Man") "Woman" else "Man"
  df %>% 
    filter(
      Combined_small_main_P > 0.05, 
      !!sym(glue("{g1}_small_main_P")) < 0.05, 
      !!sym(glue("{g2}_small_main_P")) > 0.05) %>% 
    summarise(
      gender = g1,
      only_gender = n(),
      repl_combined = sum(Combined_large_main_P < 0.05, na.rm=T),
      repl_females = sum(Woman_large_main_P < 0.05, na.rm=T),
      repl_males = sum(Man_large_main_P < 0.05, na.rm=T),
      repl_males_only = sum(Man_large_main_P < 0.05 & Woman_large_main_P > 0.05 & 
                            Combined_large_main_P > 0.05, na.rm=T),
      repl_females_only = sum(Woman_large_main_P < 0.05 & Man_large_main_P > 0.05 & 
                              Combined_large_main_P > 0.05, na.rm=T))
}

dfg <- pmap(list(c("Man", "Woman")), get_gender, df) %>% bind_rows

ymax_male <- dfg[1,2] %>% pull()
ymax_female <- dfg[2,2] %>% pull()

t3 <- dfg %>%   
  mutate(Combined = repl_combined / only_gender,
         Females = repl_females / only_gender,
         Males = repl_males / only_gender,
         Females_only = repl_females_only / only_gender,
         Males_only = repl_males_only / only_gender) %>%
  select(Combined, Females, Males, Females_only, Males_only, gender) %>% 
  gather(var, value, -gender) %>% 
  spread(gender, value)

p1 <- t3 %>% 
  ggplot(aes(var, Woman, fill=var)) + 
    geom_bar(stat="identity", fill="turquoise4", color="black") + 
    ylab("Proportion") + xlab("Replicated in") +
    ggtitle("Effects significant only in females") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1),
          plot.title = element_text(size = 13)) + 
    ylim(0, 1)

p2 <- t3 %>% 
  ggplot(aes(var, Man)) + 
    geom_bar(stat="identity", fill="turquoise4", color="black") + 
    ylab("Proportion") + xlab("Replicated in") + 
    ggtitle("Effects significant only in males") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1),
          plot.title = element_text(size = 14)) + 
    ylim(0, 1)

ggarrange(p1, p2, ncol=2)

gender	only_gender	repl_combined	repl_females	repl_males	repl_males_only	repl_females_only
Man	6200	5090	4210	4410	270	120
Woman	6880	5850	5290	4740	160	230

Description of effects

make_table <- function(df, size, effect){
  base <- glue("Combined_{size}_{effect}")
  df %>% group_by(Third_variable) %>% 
    mutate(total_n = n()) %>% 
    filter(!!sym(glue("{base}_P")) < 0.05) %>% 
    mutate(Proportion_significant = n() / total_n) %>% 
    summarise(
      Proportion_significant = mean(Proportion_significant),
      Mean_P_for_significant = mean(!!sym(glue("{base}_P"))), 
      Mean_R2_for_significant = ifelse(effect == "main", 
        mean(!!sym(glue("{base}_R2"))),
        mean(!!sym(glue("{base}_R2")) - !!sym(glue("Combined_{size}_main_R2"))))) %>% 
    pander(caption = glue("{size} sample - {effect} effects"))
}

make_table(df, "small", "main")

small sample - main effects
Third_variable	Proportion_significant	Mean_P_for_significant	Mean_R2_for_significant
gender	0.523	0.00768	0.064
ordinal	0.376	0.0108	0.0537

make_table(df, "large", "main")

large sample - main effects
Third_variable	Proportion_significant	Mean_P_for_significant	Mean_R2_for_significant
gender	0.91	0.00149	0.0333
ordinal	0.855	0.00253	0.0199

make_table(df, "small", "interaction")

small sample - interaction effects
Third_variable	Proportion_significant	Mean_P_for_significant	Mean_R2_for_significant
gender	0.0502	0.0242	0.0192
ordinal	0.0334	0.0255	0.0221

make_table(df, "large", "interaction")

large sample - interaction effects
Third_variable	Proportion_significant	Mean_P_for_significant	Mean_R2_for_significant
gender	0.138	0.018	0.00261
ordinal	0.0752	0.0223	0.0028

P-curves

plot_p <- function(df, var, title){
  ggplot(df, aes(x = !!sym(var))) + 
    geom_freqpoly(binwidth=0.005, size=1) +
    xlim(c(0.0, 0.05)) +
    ggtitle(title) + xlab("P-value") + 
    theme(plot.title = element_text(size = 13)) 
}

plot_comparison <- function(df1, df2, title){
  fig <- ggarrange(plot_p(df1, "Combined_small_interaction_P", "Interactions"), 
                   plot_p(df2, "Combined_small_main_P", "Main effects"), 
                   ncol=2, nrow=1, common.legend=T)
  annotate_figure(fig, top = text_grob(title, size = 14))
}

plot_comparison(df, df, "All effects")

plot_comparison(
  df %>% filter(Combined_large_interaction_P < 0.05), 
  df %>% filter(Combined_large_main_P < 0.05), 
  "Replicated effects")

plot_comparison(
  df %>% filter(Combined_large_interaction_P > 0.05), 
  df %>% filter(Combined_large_main_P > 0.05), 
  "Non-replicated effects")

Power

I arbitrarily define “true” effects as those with a p value below 0.001 in the large sample. What is the power to detect those in the small sample?

power <- df %>% group_by(Third_variable) %>% 
  summarise(
    Interactions = sum(
      Combined_small_interaction_P < 0.05 & Combined_large_interaction_P < 0.001, na.rm=T) / 
      sum(Combined_large_interaction_P < 0.001, na.rm=T),
    Main = sum(
      Combined_small_main_P < 0.05 & Combined_large_main_P < 0.001, na.rm=T) / 
      sum(Combined_large_main_P < 0.001, na.rm=T))

power %>% gather(Effect, value, -Third_variable) %>% 
  ggplot(aes(x = Effect, y = value, fill=Third_variable)) +
  geom_bar(stat="identity", position="dodge", width=0.5, col="black") +
  ylab("Power to identify in small sample") + 
  scale_fill_manual(values=jcol) +
  ylim(0, 1)

Pre-registered study.

What if you pick a random effect, and stick by it. This could be viewed as an assumption of no publication bias.

prop <- df %>% group_by(Third_variable) %>% 
  summarise(
    Interactions_not_replicated = sum(
      Combined_small_interaction_P < 0.05 & Combined_small_interaction_P > 0.01 & Combined_large_interaction_P > 0.05, na.rm=T) / 
      sum(Combined_small_interaction_P < 1.1, na.rm=T),
    Interactions_replicated = sum(
      Combined_small_interaction_P < 0.05 & Combined_small_interaction_P > 0.01  & Combined_large_interaction_P < 0.05, na.rm=T) / 
      sum(Combined_small_interaction_P < 1.1, na.rm=T),
    Main_not_replicated = sum(
      Combined_small_main_P < 0.05 & Combined_small_main_P > 0.01 & Combined_large_main_P > 0.05, na.rm=T) / 
      sum(Combined_small_main_P < 1.1, na.rm=T),
    Main_replicated = sum(
      Combined_small_main_P < 0.05 & Combined_small_main_P > 0.01 & Combined_large_main_P < 0.05, na.rm=T) / 
      sum(Combined_small_main_P < 1.1, na.rm=T))

prop %>% gather(Effect, value, -Third_variable) %>% 
  ggplot(aes(x = Effect, y = value, fill=Third_variable)) +
  geom_bar(stat="identity", position="dodge", width=0.5, col="black") +
  ylab("Proportion") + 
  scale_fill_manual(values=jcol)

df %>% filter(Third_variable == "ordinal", Combined_small_interaction_P < 0.05) %>% drop_na() %>% nrow()

## [1] 5960

Interaction effects do not replicate

Jonatan Pallesen, 27. December 2018