For this exercise, please try to reproduce the results from Experiment 2 of the associated paper (de la Fuente, Santiago, Roman, Dumitrache, & Casasanto, 2014). The PDF of the paper is included in the same folder as this Rmd file.

Methods summary:

Researchers tested the question of whether temporal focus differs between Moroccan and Spanish cultures, hypothesizing that Moroccans are more past-focused, whereas Spaniards are more future-focused. Two groups of participants (\(N = 40\) Moroccan and \(N=40\) Spanish) completed a temporal-focus questionnaire that contained questions about past-focused (“PAST”) and future-focused (“FUTURE”) topics. In response to each question, participants provided a rating on a 5-point Likert scale on which lower scores indicated less agreement and higher scores indicated greater agreement. The authors then performed a mixed-design ANOVA with agreement score as the dependent variable, group (Moroccan or Spanish, between-subjects) as the fixed-effects factor, and temporal focus (past or future, within-subjects) as the random effects factor. In addition, the authors performed unpaired two-sample t-tests to determine whether there was a significant difference between the two groups in agreement scores for PAST questions, and whether there was a significant difference in scores for FUTURE questions.


Target outcomes:

Below is the specific result you will attempt to reproduce (quoted directly from the results section of Experiment 2):

According to a mixed analysis of variance (ANOVA) with group (Spanish vs. Moroccan) as a between-subjects factor and temporal focus (past vs. future) as a within-subjectS factor, temporal focus differed significantly between Spaniards and Moroccans, as indicated by a significant interaction of temporal focus and group, F(1, 78) = 19.12, p = .001, ηp2 = .20 (Fig. 2). Moroccans showed greater agreement with past-focused statements than Spaniards did, t(78) = 4.04, p = .001, and Spaniards showed greater agreement with future-focused statements than Moroccans did, t(78) = −3.32, p = .001. (de la Fuente et al., 2014, p. 1685).


Step 1: Load packages

library(tidyverse) # for data munging
library(knitr) # for kable table formating
library(haven) # import and export 'SPSS', 'Stata' and 'SAS' Files
library(readxl) # import excel files

# #optional packages/functions:
library(afex) # anova functions
# library(ez) # anova functions 2
# library(scales) # for plotting
std.err <- function(x) sd(x)/sqrt(length(x)) # standard error

library(ggplot2)

Step 2: Load data

# Just Experiment 2

# Original data
data_path <- 'data/DeLaFuenteEtAl_2014_RawData.xls'
d <- read_excel(data_path, sheet=3)

# Updated data (first update)
data_path_2 <- 'data/DeLaFuenteEtAl_2014_RawData-OSF-corrected.xls'
d2 <- read_excel(data_path_2, sheet=3)

# Updated data (second update)
data_path_3 <- 'data/DeLaFuenteEtAl_2014_RawData-OSF-corrected_EHedit.xls'
d3 <- read_excel(data_path_3, sheet=3)

Step 3: Tidy data

# Data with clean factor levels for participant and variable names

# Original data
d_clean <- d %>%
  rename(
    "agree_rating" = `Agreement (0=complete disagreement; 5=complete agreement)`
  ) %>%
  mutate(
    group = ifelse(
      group == "young Spaniard", "Spaniard", group
    )
  ) %>%
  mutate(
    group = factor(
      group,
      levels = c("Spaniard", "Moroccan")
    ),
    subscale = factor(
      subscale,
      levels = c("PAST", "FUTURE")
    ),
    participant = as.factor(str_c(group, "_", participant))
  )
# Check how many items are recorded for each participant
d_clean %>% 
  group_by(participant) %>% 
  summarize(n_items_per_participant = n(), .groups = "drop") %>%
  group_by(n_items_per_participant) %>%
  summarize(n_participants = n(), .groups = "drop")
## # A tibble: 4 × 2
##   n_items_per_participant n_participants
##                     <int>          <int>
## 1                       4              2
## 2                      17              2
## 3                      21             74
## 4                      42              2
# Updated data (first update)
d_clean2 <- d2 %>%
  rename(
    "agree_rating" = `Agreement (0=complete disagreement; 5=complete agreement)`
  ) %>%
  mutate(
    group = ifelse(
      group == "young Spaniard", "Spaniard", group
    )
  ) %>%
  mutate(
    group = factor(
      group,
      levels = c("Spaniard", "Moroccan")
    ),
    subscale = factor(
      subscale,
      levels = c("PAST", "FUTURE")
    ),
    participant = as.factor(str_c(group, "_", participant))
  )
# Check how many items are recorded for each participant
d_clean2 %>% 
  group_by(participant) %>% 
  summarize(n_items_per_participant = n(), .groups = "drop") %>%
  group_by(n_items_per_participant) %>%
  summarize(n_participants = n(), .groups = "drop")
## # A tibble: 4 × 2
##   n_items_per_participant n_participants
##                     <int>          <int>
## 1                       4              1
## 2                      17              1
## 3                      21             77
## 4                      42              1
# Updated data (second update)
d_clean3 <- d3 %>%
  rename(
    "agree_rating" = `Agreement (0=complete disagreement; 5=complete agreement)`
  ) %>%
  mutate(
    group = ifelse(
      group == "young Spaniard", "Spaniard", group
    )
  ) %>%
  mutate(
    group = factor(
      group,
      levels = c("Spaniard", "Moroccan")
    ),
    subscale = factor(
      subscale,
      levels = c("PAST", "FUTURE")
    ),
    participant = as.factor(str_c(group, "_", participant))
  )
# Check how many items are recorded for each participant
d_clean3 %>% 
  group_by(participant) %>% 
  summarize(n_items_per_participant = n(), .groups = "drop") %>%
  group_by(n_items_per_participant) %>%
  summarize(n_participants = n(), .groups = "drop")
## # A tibble: 1 × 2
##   n_items_per_participant n_participants
##                     <int>          <int>
## 1                      21             80

Step 4: Run analysis

Pre-processing

# Original data

# Data frame for plotting group-level means
d_group_summary <- d_clean %>%
  group_by(group, subscale) %>%
  summarize(
    mean_rating = mean(agree_rating),
    se_rating = std.err(agree_rating),
    .groups = "drop"
  )

# Data frame with participant-level means
d_participant_summary <- d_clean %>%
  group_by(group, subscale, participant) %>%
  summarize(
    mean_rating = mean(agree_rating),
    .groups = "drop"
  )

# First update

# Data frame for plotting group-level means
d_group_summary2 <- d_clean2 %>%
  group_by(group, subscale) %>%
  summarize(
    mean_rating = mean(agree_rating),
    se_rating = std.err(agree_rating),
    .groups = "drop"
  )

# Data frame with participant-level means
d_participant_summary2 <- d_clean2 %>%
  group_by(group, subscale, participant) %>%
  summarize(
    mean_rating = mean(agree_rating),
    .groups = "drop"
  )

# Second update

# Data frame for plotting group-level means
d_group_summary3 <- d_clean3 %>%
  group_by(group, subscale) %>%
  summarize(
    mean_rating = mean(agree_rating),
    se_rating = std.err(agree_rating),
    .groups = "drop"
  )

# Data frame with participant-level means
d_participant_summary3 <- d_clean3 %>%
  group_by(group, subscale, participant) %>%
  summarize(
    mean_rating = mean(agree_rating),
    .groups = "drop"
  )

Descriptive statistics

Try to recreate Figure 2 (fig2.png, also included in the same folder as this Rmd file):

# Plot based on third and final data update
d_group_summary3 %>%
  ggplot(
    aes(x = group, y = mean_rating, fill = subscale)
  ) +
  geom_bar(
    stat = "identity",
    position = "dodge",
    color = "grey10"
  ) +
  geom_errorbar(
    aes(
      ymin = mean_rating - se_rating,
      ymax = mean_rating + se_rating
    ),
    position = position_dodge(0.9),
    color = "grey10",
    width = 0
  ) +
  scale_fill_manual(
    name = "",
    labels = c("Past-Focused Statements", "Future-Focused Statements"),
    values = c("grey40", "grey80")
  ) +
  labs(
    x = "",
    y = "Rating"
  ) +
  coord_cartesian(ylim = c(2, 4)) +
  theme_classic()

Inferential statistics

According to a mixed analysis of variance (ANOVA) with group (Spanish vs. Moroccan) as a between-subjects factor and temporal focus (past vs. future) as a within-subjects factor, temporal focus differed significantly between Spaniards and Moroccans, as indicated by a significant interaction of temporal focus and group, F(1, 78) = 19.12, p = .001, ηp2 = .20 (Fig. 2).

V1: Original data

Incorrect degrees of freedom, F-statistic, p-value, and ηp2. Missing 2 participants’ data.

# reproduce the above results here

# Using the dataframe with mean responses to each subscale per participant
# because aov_ez calculates means under-the-hood anyways if you provide
# item-level ratings
main_anova <- 
  d_participant_summary %>%
  # d_clean %>%
  aov_ez(
    id = "participant",
    # dv = "agree_rating",
    dv = "mean_rating",
    between = "group",
    within = "subscale",
    data = .
  )

# Problem: 2 participants (both participant #25 for Moroccan and Spaniard)
# missing all past-focused subscale ratings, reducing overall N
# and likely changing statistical results

nice(main_anova, es = "pes")
## Anova Table (Type 3 tests)
## 
## Response: mean_rating
##           Effect    df  MSE         F  pes p.value
## 1          group 1, 76 0.20      2.19 .028    .143
## 2       subscale 1, 76 0.50   7.98 ** .095    .006
## 3 group:subscale 1, 76 0.50 18.35 *** .194   <.001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
print(
  str_c(
    "Interaction of temporal focus and group, F(",
    nice(main_anova, es = "pes")$df[[3]],              # DF
    ") = ", round(main_anova$anova_table$F[[3]], 2),             # F-statistic
    ", p = ", round(main_anova$anova_table$`Pr(>F)`[[3]], 3),   # p-value
    ", ηp2 = ", round(as.numeric(nice(main_anova, es = "pes")$pes[[3]]), 2) # ηp2
  )
)
## [1] "Interaction of temporal focus and group, F(1, 76) = 18.35, p = 0, ηp2 = 0.19"

V2: First update

Incorrect degrees of freedom, F-statistic, and p-value; correct ηp2 (at least, within rounding error). Missing 1 participant’s data.

# reproduce the above results here

# Using the dataframe with mean responses to each subscale per participant
# because aov_ez calculates means under-the-hood anyways if you provide
# item-level ratings
main_anova2 <- 
  d_participant_summary2 %>%
  # d_clean2 %>%
  aov_ez(
    id = "participant",
    # dv = "agree_rating",
    dv = "mean_rating",
    between = "group",
    within = "subscale",
    data = .
  )

# Problem: 1 participant (Spaniard) still has missing data

nice(main_anova2, es = "pes")
## Anova Table (Type 3 tests)
## 
## Response: mean_rating
##           Effect    df  MSE         F  pes p.value
## 1          group 1, 77 0.21      2.45 .031    .122
## 2       subscale 1, 77 0.50   7.86 ** .093    .006
## 3 group:subscale 1, 77 0.50 18.89 *** .197   <.001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
print(
  str_c(
    "Interaction of temporal focus and group, F(",
    nice(main_anova2, es = "pes")$df[[3]],              # DF
    ") = ", round(main_anova2$anova_table$F[[3]], 2),             # F-statistic
    ", p = ", round(main_anova2$anova_table$`Pr(>F)`[[3]], 3),   # p-value
    ", ηp2 = ", round(as.numeric(nice(main_anova2, es = "pes")$pes[[3]]), 2) # ηp2
  )
)
## [1] "Interaction of temporal focus and group, F(1, 77) = 18.89, p = 0, ηp2 = 0.2"

V3: Second update

Correct degrees of freedom / number of participants, and correct ηp2 (at least, within rounding error); incorrect F-statistic and p-value.

# reproduce the above results here

# Using the dataframe with mean responses to each subscale per participant
# because aov_ez calculates means under-the-hood anyways if you provide
# item-level ratings
main_anova3 <- 
  d_participant_summary3 %>%
  # d_clean2 %>%
  aov_ez(
    id = "participant",
    # dv = "agree_rating",
    dv = "mean_rating",
    between = "group",
    within = "subscale",
    data = .
  )

nice(main_anova3, es = "pes")
## Anova Table (Type 3 tests)
## 
## Response: mean_rating
##           Effect    df  MSE         F  pes p.value
## 1          group 1, 78 0.21    2.88 + .036    .094
## 2       subscale 1, 78 0.51   8.10 ** .094    .006
## 3 group:subscale 1, 78 0.51 19.14 *** .197   <.001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
print(
  str_c(
    "Interaction of temporal focus and group, F(",
    nice(main_anova3, es = "pes")$df[[3]],              # DF
    ") = ", round(main_anova3$anova_table$F[[3]], 2),             # F-statistic
    ", p = ", round(main_anova3$anova_table$`Pr(>F)`[[3]], 3),   # p-value
    ", ηp2 = ", round(as.numeric(nice(main_anova3, es = "pes")$pes[[3]]), 2) # ηp2
  )
)
## [1] "Interaction of temporal focus and group, F(1, 78) = 19.14, p = 0, ηp2 = 0.2"

Moroccans showed greater agreement with past-focused statements than Spaniards did, t(78) = 4.04, p = .001,

V1: Original data

Incorrect degrees of freedom, t-statistic, and p-value. Missing 2 participants’ data.

# reproduce the above results here
past_t_test <- d_participant_summary %>%
  filter(subscale == "PAST") %>%
  t.test(
    mean_rating ~ group,
    var.equal = TRUE,
    data = .
  )
past_t_test
## 
##  Two Sample t-test
## 
## data:  mean_rating by group
## t = -3.8562, df = 76, p-value = 0.0002394
## alternative hypothesis: true difference in means between group Spaniard and group Moroccan is not equal to 0
## 95 percent confidence interval:
##  -0.8943343 -0.2851528
## sample estimates:
## mean in group Spaniard mean in group Moroccan 
##               2.691142               3.280886
print(
  str_c(
    "V1: Spaniard vs. Moroccan t(", past_t_test$parameter,   # DF
    ") = ", round(past_t_test$statistic, 2),                 # t-value
    ", p = ", round(past_t_test$p.value, 3)                  # p-value
  )
)
## [1] "V1: Spaniard vs. Moroccan t(76) = -3.86, p = 0"

V2: First update

Incorrect degrees of freedom, t-statistic, and p-value. Missing 1 participant’s data.

# reproduce the above results here
past_t_test2 <- d_participant_summary2 %>%
  filter(subscale == "PAST") %>%
  t.test(
    mean_rating ~ group,
    var.equal = TRUE,
    data = .
  )
past_t_test2
## 
##  Two Sample t-test
## 
## data:  mean_rating by group
## t = -3.9077, df = 77, p-value = 0.0001989
## alternative hypothesis: true difference in means between group Spaniard and group Moroccan is not equal to 0
## 95 percent confidence interval:
##  -0.9088251 -0.2952542
## sample estimates:
## mean in group Spaniard mean in group Moroccan 
##               2.691142               3.293182
print(
  str_c(
    "V2: Spaniard vs. Moroccan t(", past_t_test2$parameter,   # DF
    ") = ", round(past_t_test2$statistic, 2),                 # t-value
    ", p = ", round(past_t_test2$p.value, 3)                  # p-value
  )
)
## [1] "V2: Spaniard vs. Moroccan t(77) = -3.91, p = 0"

V3: Second update

Correct degrees of freedom / number of participants; incorrect t-statistic and p-value.

# reproduce the above results here
past_t_test3 <- d_participant_summary3 %>%
  filter(subscale == "PAST") %>%
  t.test(
    mean_rating ~ group,
    var.equal = TRUE,
    data = .
  )
past_t_test3
## 
##  Two Sample t-test
## 
## data:  mean_rating by group
## t = -4.0034, df = 78, p-value = 0.0001413
## alternative hypothesis: true difference in means between group Spaniard and group Moroccan is not equal to 0
## 95 percent confidence interval:
##  -0.9255970 -0.3107666
## sample estimates:
## mean in group Spaniard mean in group Moroccan 
##               2.675000               3.293182
print(
  str_c(
    "V3: Spaniard vs. Moroccan t(", past_t_test3$parameter,   # DF
    ") = ", round(past_t_test3$statistic, 2),                 # t-value
    ", p = ", round(past_t_test3$p.value, 3)                  # p-value
  )
)
## [1] "V3: Spaniard vs. Moroccan t(78) = -4, p = 0"

and Spaniards showed greater agreement with future-focused statements than Moroccans did, t(78) = −3.32, p = .001.(de la Fuente et al., 2014, p. 1685)

V1: Original data

Correct degrees of freedom / number of participants; incorrect t-statistic and p-value.

# reproduce the above results here
future_t_test <- d_participant_summary %>%
  filter(subscale == "FUTURE") %>%
  t.test(
    mean_rating ~ group,
    var.equal = TRUE,
    data = .
  )
future_t_test
## 
##  Two Sample t-test
## 
## data:  mean_rating by group
## t = 3.2098, df = 78, p-value = 0.001929
## alternative hypothesis: true difference in means between group Spaniard and group Moroccan is not equal to 0
## 95 percent confidence interval:
##  0.1349746 0.5758588
## sample estimates:
## mean in group Spaniard mean in group Moroccan 
##               3.493750               3.138333
print(
  str_c(
    "V1: Spaniard vs. Moroccan t(", future_t_test$parameter,   # DF
    ") = ", round(future_t_test$statistic, 2),                 # t-value
    ", p = ", round(future_t_test$p.value, 3)                  # p-value
  )
)
## [1] "V1: Spaniard vs. Moroccan t(78) = 3.21, p = 0.002"

V2: First update

Correct degrees of freedom / number of participants; incorrect t-statistic and p-value.

# reproduce the above results here
future_t_test2 <- d_participant_summary2 %>%
  filter(subscale == "FUTURE") %>%
  t.test(
    mean_rating ~ group,
    var.equal = TRUE,
    data = .
  )
future_t_test2
## 
##  Two Sample t-test
## 
## data:  mean_rating by group
## t = 3.4409, df = 78, p-value = 0.0009343
## alternative hypothesis: true difference in means between group Spaniard and group Moroccan is not equal to 0
## 95 percent confidence interval:
##  0.1575066 0.5899934
## sample estimates:
## mean in group Spaniard mean in group Moroccan 
##                3.49375                3.12000
print(
  str_c(
    "V2: Spaniard vs. Moroccan t(", future_t_test2$parameter,   # DF
    ") = ", round(future_t_test2$statistic, 2),                 # t-value
    ", p = ", round(future_t_test2$p.value, 3)                  # p-value
  )
)
## [1] "V2: Spaniard vs. Moroccan t(78) = 3.44, p = 0.001"

V3: Second update

Correct degrees of freedom / number of participants; incorrect t-statistic and p-value.

# reproduce the above results here
future_t_test3 <- d_participant_summary3 %>%
  filter(subscale == "FUTURE") %>%
  t.test(
    mean_rating ~ group,
    var.equal = TRUE,
    data = .
  )
future_t_test3
## 
##  Two Sample t-test
## 
## data:  mean_rating by group
## t = 3.3637, df = 78, p-value = 0.001195
## alternative hypothesis: true difference in means between group Spaniard and group Moroccan is not equal to 0
## 95 percent confidence interval:
##  0.1520282 0.5929718
## sample estimates:
## mean in group Spaniard mean in group Moroccan 
##                 3.4925                 3.1200
print(
  str_c(
    "V3: Spaniard vs. Moroccan t(", future_t_test3$parameter,   # DF
    ") = ", round(future_t_test3$statistic, 2),                 # t-value
    ", p = ", round(future_t_test3$p.value, 3)                  # p-value
  )
)
## [1] "V3: Spaniard vs. Moroccan t(78) = 3.36, p = 0.001"

Step 5: Reflection

Were you able to reproduce the results you attempted to reproduce? If not, what part(s) were you unable to reproduce?

I was not able to reproduce any of the results. The descriptive plot looked very similar to the original plot. However, I could not obtain the same F-statistic on the ANOVA, nor the same t-statistics for the analyses of each subscale, regardless of which of the three data files I used for the analyses. Using the final data files (third update), my test statistics were quite close to their reported statistics: main ANOVA original F = 19.12, reproduced F = 19.14; past-subscale original T = 4.04, reproduced T = -4.00 (sign flipped due to factor order); future-subscale original T = -3.32, reproduced T = 3.36 (sign flipped due to factor order). Therefore, I wonder if there is a tiny discrepancy remaining in the data (e.g., a couple rows that were changed) that might lead to these small differences in the statistics.

How difficult was it to reproduce your results?

It was quite difficult to reproduce the results (since I could not do it!).

What aspects made it difficult? What aspects made it easy?

  1. Participants were not uniquely identified. Moroccan and Spanish participants both had id numbers ranging from 1-40, which initially caused an error in the mixed ANOVA, before I reassigned participants to unique identifiers.
  2. There was missing and incorrectly identified data. It took some digging to go through all the data files (the original Excel sheet and the two updated Excel sheets in the data folder), diagnose the issues with the data, and compare the results obtained using each data file. In the original data file, two participants (with id number 25 for both Spaniards and Moroccans, which seems like an odd coincidence) had missing data for one of the subscales. Also, although 74 participants had data recorded for 21 items, 2 participants only had 4 items each, 2 participants had 17 items each, and 2 had 42 items each. Even after the first update, similar issues remained in the second data file. This suggests that data was duplicated or deleted or saved with incorrect participant identifiers, leading some participants to appear to have responded to too many or too few items. It is only in the third data file that all 80 participants have 21 items. However, given that the original results were not reproduced even with this final data file, I wonder whether there may still be issues with mislabeled data. Iterating through the first, second, and third data files, the statistics (e.g., F-statistics, T-statistics, degrees of freedom) did get closer to the reported results, so it seems plausible that the updated data has fewer discrepancies to the data used in the original analyses, but there are likely remaining discrepancies because I could not reproduce the exact same numbers.
  3. It was unclear exactly how the statistical tests were run. Firstly, for the mixed ANOVA, each participant answered multiple items, but the method did not specify how items were combined in the analysis. I presumed that they took the mean of all items for each subscale, which seems correct based on their results, but that wasn’t stated explicitly. For example, alternatively, they could have run a mixed-effects model that accounted for variation between subjects and between items. Next, for the t-tests, I presume that they ran a student’s t-test (assuming same variance across groups) because their reported degrees of freedom were whole numbers (78; total sample size - 2), and a Welch’s t-test would have resulted in adjusted degrees of freedom that are typically not whole numbers, although Welch’s t-test is the default for R’s t.test() function. Again, this is an inference that I made based on the results, which seems to be correct, although they did not explicitly state what test they ran.