Replication of When Seating Matters: Modeling Graded Social Attitudes as Bayesian Inference

by Wang & Jara-Ettinger (2025, Proceedings of the Annual Meeting of the Cognitive Science Society)

Author

Karla Esmeralda Perez (perezke [at] stanford [dot] edu)

Published

December 14, 2025

Introduction

Justification

Hello, I’m interested in how humans make rich social inferences from minimal perceptual cues. In this vein, Wang & Jara-Ettinger (2025) explored the view that high-level cognition (e.g., theory of mind, naive utility calculus) mediates the low-level cues that give rise to rich social inferences. Specifically, they investigated how different seating arrangements affect the inferences people draw about relationships. Previously, I worked on a related project that investigated whether children can use agents’ gaze direction and duration to infer the nature of their relationship, so I hope to continue developing my knowledge about this line of work. Finally, the authors compared their human data to three computational models; one of my goals this year is to learn more about computational modeling, so this project is an ideal first step.

Stimuli & Procedures

The stimuli for this project consisted of 30 still images depicting two characters (Yellow and Purple); each still represented a unique seating configuration between these two characters. Participants were presented with a cover story and had to pass a comprehension test to move on to the test phase. Then, participants saw all 30 stills in randomized order; for each still participants had to answer, “How does Purple feel about Yellow?” using a slider where one end represents “strongly dislikes” and the other end represents “strongly likes.”

Expected Challenges

I expect to struggle in interpreting the model outputs and figuring out how to draw the comparisons between the model data and the human data.

Methods

Planned Sample

I plan to collect 50 participants, as in the original experiment. The authors did not specify pre-selection rules in the paper; for this replication, I will recruit participants who are fluent in English.

Materials

I precisely re-created the stimuli that the authors used in their original study:

“The stimuli consisted of 30 static images of an illustration of a meeting room with a desk, a set of chairs around the table, an entrance, and two agents, a yellow one and a purple one (see Fig. 2 for examples). Yellow was always seated in one of the chairs. Purple appeared seated in one of the chairs, along with a trajectory that indicated how they reached that seat from the entrance. Each image depicted a scenario where Purple entered a meeting room and decided where to sit while Yellow was already seated. To create a rich space of trials, we used a combinatorial design. We started by selecting three initial regions where Yellow could be sitting: (1) near the entrance, (2) far from the entrance, and (3) in the middle of the room. We then selected five different seating choices for Purple to sit in: (1) closest to the entrance, (2) closest to Yellow, (3) farthest from Yellow, and (4-5) two possible intermediate distances from Yellow. For each of the three initial regions where Yellow could be seated, we selected two possible seats in this region (e.g., one along the vertical row and one along the horizontal row). This results in a total of 30 (3x5x2) possible theoretical configurations. Additionally, the stimuli were designed to ensure that there are clusters of trials (two sets with 4 trials, and two with 6 trials) that controlled for the distance between the two agents so that we could better evaluate alternative heuristics.”

Procedure

I precisely re-created the procedure that the authors implemented in their original study:

“Participants were first familiarized with the seating scenario setup through a cover story. Participants were told that they would see events where a protagonist, Purple, arrived in a meeting room. Another agent, Yellow, had already arrived and seated. Participants were then told that Purple’s seating choice would affect the probability that Yellow would initiate a conversation before the meeting started. Thus, Purple would choose a seat based both on how far they had to walk and on how they felt toward Yellow. After reading the cover story, participants were asked six simple comprehension check questions to ensure they understood the logic of the task (the cover story and 30 stimuli trials are available on the OSF page). Participants had to correctly answer each comprehension question, and pass a reCAPTCHA test to proceed to the test trials. Participants who failed one of the comprehension checks were asked to review the cover story and were given unlimited attempts to answer the comprehension check question. They could only proceed to the next question if they correctly answered the previous one, or they could choose to exit this study. In the test phase, participants were presented with all 30 trials in a randomized order. In each trial, participants viewed a static image of the meeting room with Yellow’s seat and Purple’s choice. Participants were asked to answer”How does Purple feel about Yellow?” using a slider, with one end representing Strongly Dislikes (coded as −7) and the other end representing Strongly Likes (coded as 7). At the end of the 30 trials, participants were asked to explain what strategy they used in the task, and they were asked “How intuitive do you find the following statement”: (1) “The farther away you sit from someone, the less likely they are to talk to you.” (2) “Imagine you are in the same meeting room setting as shown in the study. If you are speaking with someone, the farther away they are sitting, the harder it is to maintain a conversation.” Participants rated the strength of their intuitions using a Likert scale from 1 to 7.”

Analysis Plan

Participant judgments were z-scored within participants and averaged across trial type. Then, participant z scores were compared to model predictions. Model predictions were also z scored. The participant and model z scores were compared by computing the Pearson correlation coefficient and a 95% CI for each comparison.

Differences from Original Study

One major difference is that I did not start my -7 to 7 scale at neutral. Instead, my scale starts on the left-most corner at -7. Another difference is that participants can see which number they are selecting on the scale, but not in the original study. I do not expect these differences to impact the results.

Methods Addendum (Post Data Collection)

Actual Sample

The final sample size for this study was 50 English-speaking participants. (No differences from original sampling plan.)

Results

Data preparation

Data preparation following the analysis plan.

### clean data

# helper function: calculate and reshape means 
calculate_and_reshape_means <- function(df, target_cols, new_col_name) {
  df %>%
    select(all_of(target_cols)) %>%
    # check for cols starting with "X" and remove character
    select(-any_of(c("X"))) %>% 
    summarise(across(everything(), mean)) %>%
    tidyr::pivot_longer(everything(), names_to = "Trial", values_to = new_col_name)
}

# helper function: remove the '_Q' part of a col, e.g., '(2,7)_D_Q_1' -> '(2,7)_D_1'
clean_name <- function(col_name) {
  return(gsub("_Q_", "_", col_name))
}

# column names (the 30 trials) from original human data
target_cols <- colnames(original_participants)
all_model_names <- unique(original_model$model)

# original human mean responses 
original_means_df <- calculate_and_reshape_means(original_participants, target_cols, "Original_Human_Mean")

# clean all col names for my_participant_responses
colnames(my_participant_responses) <- sapply(colnames(my_participant_responses), clean_name)

# calculate the mean across all participants for each trial my_participant_responses
replication_means_df <- calculate_and_reshape_means(my_participant_responses, target_cols, "Replication_Human_Mean")

df_comparison_raw <- merge(original_means_df, replication_means_df, by = "Trial")

Confirmatory analysis

First, I compared the original human data with the data I collected for my replication. Recall that I did not set up my -7 to 7 scale in exactly the same way as in the original study.

# calculate similarity (Pearson Correlation) between my participant and original data
correlation_test_human_human <- cor.test(df_comparison_raw$Original_Human_Mean, 
                             df_comparison_raw$Replication_Human_Mean,
                             method = "pearson")

# extract results
r_value_hh <- correlation_test_human_human$estimate
p_value_hh <- correlation_test_human_human$p.value

# create results table
results_hh <- data.frame(
  Metric = c("Pearson Correlation (r)", "P-value", "Degrees of Freedom"),
  Value = c(round(r_value_hh, 4), format.pval(p_value_hh, digits = 5), correlation_test_human_human$parameter)
)

print(results_hh)
                     Metric      Value
cor Pearson Correlation (r)     0.9776
                    P-value < 2.22e-16
df       Degrees of Freedom         28
plot_title_hh <- paste0("Response Similarity: Original vs. Replication (r = ", round(r_value_hh, 3), ")")

min_val <- min(df_comparison_raw$Original_Human_Mean, df_comparison_raw$Replication_Human_Mean)
max_val <- max(df_comparison_raw$Original_Human_Mean, df_comparison_raw$Replication_Human_Mean)
limits <- c(min_val - 0.5, max_val + 0.5)

ggplot(df_comparison_raw, aes(Original_Human_Mean, Replication_Human_Mean)) +
  # 1. Scatter points for each trial
  geom_point(color = "#1f78b4", size = 3, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  geom_smooth(method = "lm", se = FALSE, color = "#e31a1c") +
  labs(title = plot_title_hh,
    x = "Original Human responses (mean)",
    y = "Replication Human responses (mean)") +
  coord_fixed(xlim = limits, ylim = limits) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid.minor = element_blank()
  )
`geom_smooth()` using formula = 'y ~ x'

Then, I z-scored my participant data and generated confidence intervals.

# number of bootstrap replicates
N_BOOTSTRAP_REPS <- 5000

# Z-score the human mean columns
df_comparison_z <- df_comparison_raw %>%
  mutate(
    Original_Human_Z = scale(Original_Human_Mean)[,1],
    Replication_Human_Z = scale(Replication_Human_Mean)[,1])

# bootstrap for correlations
  # d: merged Z-score dataframe
  # i: resampled row indices
cor_boot_function <- function(d, i, human_z_col, model_z_col) {
  return(cor(d[i, human_z_col], d[i, model_z_col], method = "pearson"))
}

# calculate Pearson r using Z-Scores and CIs
calculate_similarity_z <- function(human_means_df_z, model_predictions, model_name, human_col_name) {
  
  # isolate and reshape the predictions for the specific model
  target_cols_model <- target_cols[target_cols %in% colnames(model_predictions)]
  
  model_data_long <- model_predictions %>%
    select(model, all_of(target_cols_model)) %>%
    filter(model == model_name) %>%
    pivot_longer(
      cols = -model, 
      names_to = "Trial",
      values_to = "Model_Prediction")
  # model predictions are already z-scored 
    # (downloaded from OSF, Wang & Jara-Ettinger (2025))
  model_data_long <- model_data_long %>%
    rename(Model_Prediction_Z = Model_Prediction)
  # join human Z-scores and model Z-scores by trial name
  original_model_comparison_z <- merge(human_means_df_z, model_data_long, by = "Trial")
  # get column indices to use in the boot function
  human_col_index <- which(colnames(original_model_comparison_z) == human_col_name)
  model_col_index <- which(colnames(original_model_comparison_z) == "Model_Prediction_Z")
  # calculate Pearson correlation
  correlation_test <- cor.test(original_model_comparison_z[[human_col_name]], 
                               original_model_comparison_z$Model_Prediction_Z,
                               method = "pearson")
  
  # bootstrapping to get the Confidence Intervals
  
  # wrapper function: pass specific column names to the cor_boot_function
  boot_wrapper <- function(d, i) {
      cor_boot_function(d, i, human_z_col = human_col_index, model_z_col = model_col_index)
  }
  
  # suppress warnings from boot() due to small sample size
  boot_results <- suppressWarnings(boot(
    data = original_model_comparison_z, 
    statistic = boot_wrapper, 
    R = N_BOOTSTRAP_REPS))
  
  # CI calculation with fallback
  ci_method <- "BCa"
  ci_lower <- NA
  ci_upper <- NA
  
  # BCa
  ci_bca <- suppressWarnings(boot.ci(boot_results, type = "bca"))
  
  if (!is.null(ci_bca) && !is.na(ci_bca$bca[4])) {
    ci_lower <- round(ci_bca$bca[4], 4)
    ci_upper <- round(ci_bca$bca[5], 4)
    
  } else { # BCa failed, fall back to percentile method
    ci_method <- "Percentile (Fallback)"
    ci_perc <- suppressWarnings(boot.ci(boot_results, type = "perc"))
    
    if (!is.null(ci_perc) && !is.na(ci_perc$percent[4])) {
      ci_lower <- round(ci_perc$percent[4], 4)
      ci_upper <- round(ci_perc$percent[5], 4)
    } else { # both failed, boooo!
      ci_method <- "Failed"
    }
  }

  # return key results
  return(data.frame(
    Model = model_name,
    Comparison = paste0(sub("_Z", "", human_col_name), "_Z"),
    Pearson_r = round(correlation_test$estimate, 4),
    CI_Method = ci_method,
    CI_95_Lower = ci_lower,
    CI_95_Upper = ci_upper,
    P_value = format.pval(correlation_test$p.value, digits = 5),
    N_Trials = correlation_test$parameter + 2 
  ))
}

# REPRODUCTION: models vs. original human Data (z-scored)
results_models_vs_original_z <- do.call(rbind, lapply(all_model_names, function(m) {
  calculate_similarity_z(df_comparison_z, original_model, m, "Original_Human_Z")
}))

# MAIN ANALYSIS: models vs. replication human Data (z-scored)
results_models_vs_replication_z <- do.call(rbind, lapply(all_model_names, function(m) {
  calculate_similarity_z(df_comparison_z, original_model, m, "Replication_Human_Z")
}))

# combine all z-scored model results into one table
all_model_results_z <- bind_rows(results_models_vs_original_z, results_models_vs_replication_z)

Visualize the comparisons

### Visualization

# models not part of main analysis
remove_models <- c("ablated_model", "distance_model")

results_small <- all_model_results_z %>% 
  filter(!Model %in% remove_models)

forest_plot_models <- ggplot(results_small, 
                             aes(x = Pearson_r, 
                                 y = Model, 
                                 color = Comparison)) +
  # error bars for the 95% CIs
  geom_errorbarh(aes(xmin = CI_95_Lower, xmax = CI_95_Upper), 
                 height = 0.2, linewidth = 1, alpha = 0.7) +
  geom_point(size = 3.5) +
  # null hypothesis line
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Model-Human Response Correlations (Z-Scored)",
    subtitle = paste0("Pearson's r with 95% CIs (BCa, N = ", N_BOOTSTRAP_REPS, " bootstraps)"),
    x = "Pearson Correlation Coefficient (r)",
    y = NULL) +
  scale_color_manual(values = c("Original_Human_Z" = "#ff8624", # orange
                                "Replication_Human_Z" = "#99d5fb"), # blue
                     labels = c("Original_Human_Z" = "Original Human Responses", 
                                "Replication_Human_Z" = "Replication Human Responses")) +
  # xlim(-0.1, 1) + 
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    panel.grid.minor = element_blank(),
    legend.position = "bottom"
  )

forest_plot_models

Discussion

Summary of Replication Attempt

I successfully replicated the main results from Wang & Jara-Ettinger (2025) such that, as in the original study, the three main models successfully correlated with my participant responses: combined model (r = 0.83; 95% CI: 0.66-0.91), probability model (r = 0.87; 95% CI: 0.75-0.93), cost model (r = 0.89; 95% CI: 0.76-0.94). However, as we can see from the graph above, my results were slightly but systematically weaker than the results from the original study; my r values were slightly lower and my confidence intervals were slightly wider than in the original study. This may be due to the difference in the way the slider for the main trials was presented to the participants. Because the r values and confidence intervals systemically vary in the same direction as the values in the original study, are close in value, and are all p < 0.001, I conclude that this was a successful replication.

Commentary

Throughout this project, I learned that it is very important to develop intuitively structured project directories. Specifically, I developed a better sense of how important it is to include meta-data in project directories, e.g., a key for column names that may not be universally interpretable otherwise. It was a pleasure to replicate Wang & Jara-Ettinger (2025) because their OSF project directory was extremely well organized and lead author Zihan Wang was extremely kind in providing feedback and answering my questions. (If you are reading this Zihan, thank you!!)