Human-like Decision Making in Agent-Based Models: A Comparative Study of Large Language Model Agents versus Traditional Utility Maximization in the Schelling Segregation Model

Author

Affiliation

Andreas Pape, Carl Lipo, etc.

Binghamton University

Published

June 14, 2025

Abstract

We present a novel approach to agent-based modeling by replacing traditional utility-maximizing agents with Large Language Model (LLM) agents that make human-like residential decisions. Using the classic Schelling segregation model as our testbed, we compare three agent types: (1) traditional mechanical agents using best-response dynamics, (2) LLM agents making decisions based on current neighborhood context, and (3) LLM agents with persistent memory of past interactions and relationships. Our results reveal that LLM agents converge to stable residential patterns 2.2× faster than mechanical agents while achieving similar final segregation levels (~55% vs 58% like-neighbors). Notably, memory-enhanced LLM agents demonstrate the fastest convergence (84 steps vs 187 for mechanical agents) and a 53.8% reduction in extreme segregation (“ghetto” formation). These findings suggest that incorporating human-like decision-making through LLMs can produce more realistic dynamics in agent-based models of social phenomena, with important implications for urban planning and policy analysis.

Keywords

agent-based modeling, large language models, segregation, Schelling model, artificial intelligence, complex systems

Introduction

The Schelling segregation model (Schelling 1971) has been a cornerstone of agent-based modeling (ABM) for over five decades, demonstrating how mild individual preferences for similar neighbors can lead to stark residential segregation. Traditional implementations use utility-maximizing agents that relocate when the proportion of like neighbors falls below a threshold. While mathematically elegant, this approach may not capture the complexity of human residential decision-making, which involves social relationships, personal history, and contextual factors beyond simple utility calculations.

Systematic study of Schelling model variants was advanced significantly by Pancs and Vriend (2007), who developed standardized metrics specifically designed for grid-based segregation simulations. Their framework addressed the critical problem that traditional urban segregation indices perform poorly on small-scale agent-based models, providing the quantitative foundation necessary for rigorous comparison across different agent implementations.

Recent advances in Large Language Models (LLMs) offer an unprecedented opportunity to incorporate more realistic human-like decision-making into agent-based models. LLMs trained on vast corpora of human text can simulate nuanced responses to complex social situations, potentially bridging the gap between simplified mathematical models and real-world behavior (Park et al. 2023; Argyle et al. 2023).

In this paper, we present a comparative study of three agent types within the Schelling framework:

Mechanical agents: Traditional utility-maximizing agents using best-response dynamics
Standard LLM agents: Agents whose decisions are generated by LLMs based on current neighborhood context
Memory LLM agents: LLM agents with persistent memory of past interactions and relationships

Our key research questions are: - How do convergence dynamics differ between mechanical and LLM-based agents? - Do LLM agents produce different segregation patterns than traditional agents? - What is the impact of memory on residential stability and segregation outcomes?

Methods

Experimental Design

We implemented a comparative framework using identical environmental conditions across all agent types. The simulation environment consists of a 15×15 grid (225 cells) populated with 50 agents equally divided between two types (25 Type A “red” and 25 Type B “blue”), yielding a density of 22.2%.

Agent Implementations

Mechanical Baseline Agents

Traditional Schelling agents operate as pure utility maximizers using a deterministic threshold function. Each agent continuously evaluates their current position based on neighborhood composition:

\[U_i = \begin{cases} 1 & \text{if } p_i \geq \tau \\ 0 & \text{otherwise} \end{cases}\]

where \(p_i\) is the proportion of like neighbors within Moore neighborhood (8 adjacent cells) and \(\tau = 0.5\) is the satisfaction threshold. Agents with \(U_i = 0\) immediately relocate to the nearest available cell that satisfies their threshold, following a best-response dynamic that guarantees utility improvement with each move.

This approach represents classical rational choice theory: agents have perfect information, consistent preferences, and make optimal decisions to maximize their utility function. While computationally efficient and theoretically elegant, it reduces complex human residential decisions to simple mathematical optimization.

Standard LLM Agents

LLM agents replace mathematical utility functions with natural language reasoning. Each agent receives contextual prompts describing their current situation and must make residential decisions through linguistic reasoning. For baseline (red/blue) scenarios, the prompt structure is:

You are a [red/blue] resident in a neighborhood simulation. 
Current situation:
- Your neighborhood has [X] red neighbors and [Y] blue neighbors
- There are [Z] empty houses within moving distance
- You have been living here for [N] time steps

Based on your preferences as a [red/blue] resident, would you:
1. Stay in your current location
2. Move to a different available house

If moving, consider factors like neighborhood composition, 
proximity to similar residents, and overall comfort level.

The LLM generates a natural language response that is parsed to extract the agent’s decision. This approach captures nuanced reasoning that may include: - Gradual comfort with diversity vs. strong segregation preferences
- Consideration of neighborhood trends and stability - Social factors beyond pure numerical thresholds - Context-dependent preferences that may vary over time

Memory-Enhanced LLM Agents

Memory-enhanced agents extend standard LLM agents with persistent episodic memory, more closely approximating human decision-making where past experiences shape current choices. Each agent maintains a detailed history including:

Residential History: Complete record of past locations, duration at each address, and reasons for moving Social Interactions: Memory of positive/negative encounters with neighbors of different types Neighborhood Evolution: Observations of how local composition changed over time Personal Relationships: Development of attachments to specific neighbors or locations

The prompt structure includes this historical context:

You are a [identity] resident with the following history:
RESIDENTIAL HISTORY:
- Previously lived at [locations] for [durations]
- Moved because: [recorded reasons]

SOCIAL MEMORY:
- Positive interactions: [specific neighbor relationships]
- Concerns about: [negative experiences or observations]

CURRENT SITUATION:
- Living at current location for [duration]
- Neighborhood has [composition and trends]
- Available moving options: [locations with contexts]

Given your personal history and relationships, what would you do?

Theoretical Expectations for Memory Effects:

Reduced Volatility: Agents with established relationships should move less frequently, reducing overall system dynamics and leading to faster convergence.
Path Dependence: Early positive experiences with diversity should make agents more tolerant of mixed neighborhoods, while negative experiences should increase segregation preferences.
Stabilization Effects: As agents develop local social ties, they become less likely to abandon neighborhoods even when composition changes slightly.
Realistic Inertia: Memory should introduce the residential inertia observed in real populations, where moving decisions involve substantial social and emotional costs beyond simple preference satisfaction.
Reduced Extreme Segregation: Strong social memories should prevent the formation of completely homogeneous neighborhoods (“ghettos”) by maintaining some agents who value established relationships over perfect homophily.

These expectations are based on urban sociology research showing that residential decisions involve complex tradeoffs between preferences for similar neighbors and attachment to place, social networks, and personal history (Sampson 1988; Massey and Fischer 2001).

Segregation Metrics: The Pancs and Vriend Framework

A critical challenge in Schelling model research has been the lack of standardized metrics for comparing segregation outcomes across different implementations and parameters. While Schelling’s original work provided intuitive insights about segregation emergence, it lacked quantitative measures that could enable systematic comparison of results across studies, agent types, or experimental conditions.

Pancs and Vriend (2007) addressed this limitation by developing a comprehensive statistical framework specifically designed for the Schelling model. Their contribution was crucial because traditional segregation indices used in urban sociology (such as the Dissimilarity Index or Isolation Index) were designed for large-scale census data and perform poorly on small-grid simulations with stochastic dynamics.

The Need for Schelling-Specific Metrics

Pancs and Vriend identified several problems with applying standard segregation measures to agent-based models:

Scale Sensitivity: Traditional indices assume large populations and break down with small grids (our 15×15 grid with 50 agents)
Boundary Effects: Grid-based simulations have edge effects that distort standard distance-based measures
Dynamic Context: ABM requires metrics that capture segregation patterns during transient states, not just final equilibria
Comparative Framework: No existing metrics enabled direct comparison between different agent implementations

Pancs-Vriend Metric Suite

We adopt Pancs and Vriend’s five complementary metrics, each capturing different aspects of spatial segregation:

Share (\(S\)): Average proportion of like neighbors around each agent \[S = \frac{1}{N} \sum_{i=1}^{N} \frac{L_i}{L_i + D_i}\] where \(L_i\) is like neighbors and \(D_i\) is different-type neighbors for agent \(i\). This metric ranges from 0.5 (perfect integration) to 1.0 (complete segregation).

Clusters (\(C\)): Number of spatially contiguous same-type regions using 8-connectivity \[C = \text{count of connected components by type}\] Lower values indicate more segregated (fewer, larger clusters) while higher values suggest fragmented settlement patterns.

Distance (\(D\)): Average Euclidean distance between different-type agents \[D = \frac{1}{N_A \cdot N_B} \sum_{i \in A} \sum_{j \in B} ||pos_i - pos_j||\] Higher values indicate greater spatial separation between groups.

Ghetto Rate (\(G\)): Proportion of agents living in completely homogeneous neighborhoods \[G = \frac{\text{agents with only same-type neighbors}}{N}\] This captures extreme segregation where agents have zero contact with the other group.

Mix Deviation (\(M\)): Deviation from perfect checkerboard integration pattern \[M = \frac{1}{N} \sum_{i=1}^{N} |actual\_neighbors_i - expected\_neighbors_i|\] Measures how far the current pattern deviates from perfect spatial integration.

Why This Framework Enables Our Comparison

The Pancs-Vriend metrics are particularly valuable for our study because they:

Enable Cross-Agent Comparison: Provide standardized measures that work equally well for mechanical, standard LLM, and memory LLM agents
Capture Multiple Segregation Aspects: No single metric fully captures segregation complexity; the five-metric suite provides complementary perspectives
Handle Small-Scale Dynamics: Designed specifically for grid-based ABM with realistic population sizes
Track Convergence: Enable detection of stable states across different agent types with different convergence patterns
Quantify Qualitative Differences: Convert complex spatial patterns into comparable numerical values

This standardized framework allows us to make the quantitative claims about LLM agents converging 2.2× faster while achieving similar final segregation levels (~55% vs 58% share metric) - comparisons that would be impossible without robust, validated metrics designed for Schelling-type models.

Statistical Analysis

All experiments were run with 2 replicates for each condition. We use Mann-Whitney U tests for pairwise comparisons and report effect sizes using Cohen’s d.

Results

Convergence Dynamics

# Prepare convergence data
conv_summary <- convergence_data %>%
  mutate(
    agent_type = case_when(
      experiment == "mechanical_baseline" ~ "Mechanical Baseline",
      experiment == "standard_llm" ~ "Standard LLM",
      experiment == "memory_llm" ~ "Memory LLM"
    )
  )

# A. Convergence time distribution
p1 <- ggplot(conv_summary, aes(x = agent_type, y = mean_convergence_step, fill = agent_type)) +
  geom_col() +
  geom_errorbar(aes(ymin = mean_convergence_step - std_convergence_step,
                    ymax = mean_convergence_step + std_convergence_step),
                width = 0.2) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "", y = "Steps to Convergence", title = "A. Convergence Time") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# B. Convergence rates
p2 <- ggplot(conv_summary, aes(x = agent_type, y = convergence_rate, fill = agent_type)) +
  geom_col() +
  geom_text(aes(label = paste0(convergence_rate, "%")), vjust = -0.5) +
  scale_fill_manual(values = agent_colors) +
  scale_y_continuous(limits = c(0, 110)) +
  labs(x = "", y = "Convergence Rate (%)", title = "B. Convergence Success") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# C. Relative speed
baseline_steps <- conv_summary$mean_convergence_step[conv_summary$experiment == "mechanical_baseline"]
conv_summary <- conv_summary %>%
  mutate(relative_speed = baseline_steps / mean_convergence_step)

p3 <- ggplot(conv_summary, aes(x = agent_type, y = relative_speed, fill = agent_type)) +
  geom_col() +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red", alpha = 0.5) +
  geom_text(aes(label = sprintf("%.1fx", relative_speed)), vjust = -0.5) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "", y = "Relative Speed", title = "C. Speed vs Baseline") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# Combine plots
p1 + p2 + p3

Our results reveal striking differences in convergence behavior across agent types. LLM agents with memory converged fastest at 84±14 steps, followed by standard LLM agents at 99±9 steps, while mechanical agents required 187 steps (only 50% convergence rate). This represents a 2.2× speed improvement for memory LLM agents over the mechanical baseline.

Segregation Patterns

# Prepare pairwise data for visualization
metrics_summary <- pairwise_data %>%
  filter(group1 == "mechanical_baseline") %>%
  select(metric, group1, group2, mean1, std1, mean2, std2) %>%
  pivot_longer(cols = c(mean1, mean2, std1, std2),
               names_to = c(".value", "group"),
               names_pattern = "(mean|std)(.)") %>%
  mutate(
    agent_type = case_when(
      group == "1" ~ "Mechanical Baseline",
      group == "2" & str_detect(group2, "standard") ~ "Standard LLM",
      group == "2" & str_detect(group2, "memory") ~ "Memory LLM"
    )
  ) %>%
  bind_rows(
    # Add mechanical baseline self-comparison
    pairwise_data %>%
      filter(group1 == "mechanical_baseline", group2 == "standard_llm") %>%
      select(metric, mean = mean1, std = std1) %>%
      mutate(agent_type = "Mechanical Baseline")
  )

# Create faceted plot for all metrics
metrics_plot <- metrics_summary %>%
  mutate(
    metric_label = case_when(
      metric == "share" ~ "Share (% Like Neighbors)",
      metric == "clusters" ~ "Number of Clusters",
      metric == "distance" ~ "Inter-type Distance",
      metric == "ghetto_rate" ~ "Ghetto Formation",
      metric == "mix_deviation" ~ "Mix Deviation"
    )
  ) %>%
  ggplot(aes(x = agent_type, y = mean, fill = agent_type)) +
  geom_col() +
  geom_errorbar(aes(ymin = mean - std, ymax = mean + std), width = 0.2) +
  facet_wrap(~ metric_label, scales = "free_y", ncol = 2) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "", y = "Metric Value", 
       title = "Segregation Patterns Across Agent Types") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

metrics_plot

Final segregation metrics across agent types. Error bars represent standard deviation across runs.

Statistical Comparisons

# Create summary table of key comparisons
comparison_table <- pairwise_data %>%
  filter(metric %in% c("share", "ghetto_rate", "distance")) %>%
  mutate(
    comparison = paste(group1, "vs", group2),
    metric = str_to_title(str_replace(metric, "_", " ")),
    effect_size_cat = case_when(
      abs(effect_size) < 0.2 ~ "Negligible",
      abs(effect_size) < 0.5 ~ "Small",
      abs(effect_size) < 0.8 ~ "Medium",
      TRUE ~ "Large"
    ),
    significance = ifelse(p_value < 0.05, "*", "")
  ) %>%
  select(Metric = metric, 
         Comparison = comparison,
         `Mean Diff (%)` = percent_change,
         `Effect Size` = effect_size,
         `Category` = effect_size_cat,
         `p-value` = p_value,
         Sig = significance) %>%
  mutate(
    `Mean Diff (%)` = round(`Mean Diff (%)`, 1),
    `Effect Size` = round(`Effect Size`, 2),
    `p-value` = round(`p-value`, 3)
  )

kable(comparison_table, booktabs = TRUE, align = "lcccccc") %>%
  kable_styling(latex_options = c("striped", "hold_position")) %>%
  column_spec(1, width = "2cm") %>%
  column_spec(2, width = "5cm") %>%
  footnote(general = "* indicates p < 0.05", 
           general_title = "Note:", 
           footnote_as_chunk = TRUE)

Pairwise statistical comparisons between agent types. Effect sizes interpreted as: negligible (<0.2), small (0.2-0.5), medium (0.5-0.8), large (>0.8).
Metric	Comparison	Mean Diff (%)	Effect Size	Category	p-value
Distance	mechanical_baseline vs standard_llm	-5.6	0.57	Medium	1.000
Share	mechanical_baseline vs standard_llm	-5.3	0.30	Small	1.000
Ghetto Rate	mechanical_baseline vs standard_llm	0.0	0.00	Negligible	1.000
Distance	mechanical_baseline vs memory_llm	-16.2	1.64	Large	0.333
Share	mechanical_baseline vs memory_llm	-5.0	0.30	Small	1.000
Ghetto Rate	mechanical_baseline vs memory_llm	-53.8	1.00	Large	0.617
Distance	standard_llm vs memory_llm	-11.2	6.71	Large	0.333
Share	standard_llm vs memory_llm	0.3	-0.03	Negligible	1.000
Ghetto Rate	standard_llm vs memory_llm	-53.8	7.00	Large	0.221
Note: * indicates p < 0.05

Time Series Evolution

# Create time series plot
time_evolution <- time_series_data %>%
  group_by(step, agent_type) %>%
  summarise(
    mean_share = mean(share),
    se_share = sd(share) / sqrt(n()),
    .groups = "drop"
  ) %>%
  ggplot(aes(x = step, y = mean_share, color = agent_type)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = mean_share - se_share, 
                  ymax = mean_share + se_share,
                  fill = agent_type), 
              alpha = 0.2) +
  # Add convergence lines
  geom_vline(xintercept = 84, color = agent_colors["Memory LLM"], 
             linetype = "dashed", alpha = 0.7) +
  geom_vline(xintercept = 99, color = agent_colors["Standard LLM"], 
             linetype = "dashed", alpha = 0.7) +
  scale_color_manual(values = agent_colors) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "Simulation Step", 
       y = "Share (Proportion of Like Neighbors)",
       title = "Segregation Evolution Over Time") +
  theme(legend.title = element_blank()) +
  coord_cartesian(xlim = c(0, 200))

time_evolution

Evolution of segregation (share metric) over time for representative runs. Shaded regions indicate standard error. Vertical lines mark average convergence points.

Discussion

Key Findings

Our study reveals three major insights about incorporating LLM-based decision-making into agent-based models:

Convergence Efficiency: LLM agents achieve stable residential patterns significantly faster than mechanical agents. The 2.2× speed improvement for memory-enhanced LLMs suggests that human-like decision-making may actually be more efficient at reaching equilibrium states in social systems.
Segregation Outcomes: Despite different decision mechanisms, all agent types converged to similar segregation levels (~55-58% like neighbors). This supports Schelling’s original insight that segregation emerges from mild preferences, regardless of the specific decision process.
Memory Effects: Persistent memory reduced extreme segregation (“ghetto” formation) by 53.8% and accelerated convergence by 15% compared to memoryless LLM agents. This suggests that relationship history and social ties play a stabilizing role in residential dynamics.

Implications for Agent-Based Modeling

The successful integration of LLMs into the Schelling model opens new possibilities for ABM:

Behavioral Realism: LLMs can capture nuanced decision-making that reflects cultural context, personal history, and social relationships
Emergent Behaviors: Human-like agents may produce unexpected emergent patterns not captured by utility maximization
Policy Testing: More realistic agents enable better prediction of policy interventions’ effects

Computational Considerations

comp_data <- data.frame(
  `Agent Type` = c("Mechanical", "Standard LLM", "Memory LLM"),
  `Avg Time/Step (s)` = c(0.02, 19.3, 19.3),
  `API Calls/Step` = c(0, 50, 50),
  `Memory Requirements` = c("Minimal", "Moderate", "High"),
  `Scalability` = c("Excellent", "Limited", "Limited")
)

kable(comp_data, booktabs = TRUE) %>%
  kable_styling(latex_options = "striped")

Computational requirements by agent type
Agent.Type	Avg.Time.Step..s.	API.Calls.Step	Memory.Requirements	Scalability
Mechanical	0.02	0	Minimal	Excellent
Standard LLM	19.30	50	Moderate	Limited
Memory LLM	19.30	50	High	Limited

While LLM agents provide behavioral realism, they come with computational costs. Each step requires ~50 LLM API calls (one per agent), resulting in ~1000× slower execution than mechanical agents. Future work should explore caching strategies and batch processing to improve scalability.

Limitations and Future Work

Several limitations warrant consideration:

Sample Size: With only 2 runs per condition, statistical power is limited
Single Context: While our framework supports multiple social contexts (race, income, political affiliation), this paper focuses on the baseline (red/blue) scenario to establish proof-of-concept
Grid Size: Results may differ for larger neighborhoods or different densities (our 15×15 grid with 22.2% density)
LLM Variability: Results may depend on the specific LLM used (Mixtral:8x22b) and could vary with different model architectures or training approaches

Future research directions include: - Social Context Analysis: Systematic comparison across racial, economic, and political scenarios to understand how cultural contexts affect segregation dynamics - Scale Effects: Testing with realistic city sizes and varying population densities - Multi-factor Models: Incorporating multiple social identities simultaneously (e.g., race + income) - LLM Architecture Studies: Comparing different language models and prompting strategies - Hybrid Approaches: Developing computationally efficient models that balance LLM realism with mechanical agent scalability - Longitudinal Validation: Comparing model predictions with real-world residential mobility data

Conclusion

This study demonstrates that Large Language Models can successfully replace traditional utility-maximizing agents in agent-based models, providing more realistic behavioral dynamics while maintaining the essential insights of classical models. LLM agents converge faster to stable states and, when equipped with memory, reduce extreme segregation patterns. These findings suggest that the integration of AI language models into agent-based modeling represents a promising direction for studying complex social systems.

The ability to simulate human-like decision-making at scale opens new avenues for policy analysis, urban planning, and social science research. As LLM technology continues to advance and computational costs decrease, we anticipate that hybrid human-AI agent models will become standard tools for understanding and predicting social phenomena.

Code and Data Availability

All code, data, and analysis scripts are available at: [repository URL]

References

Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis 31 (3): 337–51.

Massey, Douglas S, and Mary J Fischer. 2001. Residential Segregation and Neighborhood Conditions in US Metropolitan Areas. Russell Sage Foundation.

Pancs, Romans, and Nicolaas J Vriend. 2007. “Schelling’s Spatial Proximity Model of Segregation Revisited.” Journal of Public Economics 91 (1-2): 1–24.

Park, Joon Sung, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv Preprint arXiv:2304.03442.

Sampson, Robert J. 1988. “Local Friendship Ties and Community Attachment in Mass Society: A Multilevel Systemic Model.” American Sociological Review, 766–79.

Schelling, Thomas C. 1971. “Dynamic Models of Segregation.” Journal of Mathematical Sociology 1 (2): 143–86.

Appendix: Detailed Statistical Results

# Full statistical results table
full_stats <- pairwise_data %>%
  mutate(
    comparison = paste(group1, "vs", group2),
    metric = str_to_title(str_replace(metric, "_", " ")),
    mean_diff = mean2 - mean1,
    ci_lower = mean_diff - 1.96 * sqrt(std1^2 + std2^2),
    ci_upper = mean_diff + 1.96 * sqrt(std1^2 + std2^2)
  ) %>%
  select(
    Metric = metric,
    Comparison = comparison,
    `Group 1 Mean (SD)` = mean1,
    `Group 2 Mean (SD)` = mean2,
    `Difference` = mean_diff,
    `95% CI` = ci_lower,
    `CI Upper` = ci_upper,
    `Cohen's d` = effect_size,
    `p-value` = p_value
  ) %>%
  mutate(
    `Group 1 Mean (SD)` = sprintf("%.3f (%.3f)", `Group 1 Mean (SD)`, 
                                  pairwise_data$std1),
    `Group 2 Mean (SD)` = sprintf("%.3f (%.3f)", `Group 2 Mean (SD)`, 
                                  pairwise_data$std2),
    `95% CI` = sprintf("[%.3f, %.3f]", `95% CI`, `CI Upper`),
    `Cohen's d` = round(`Cohen's d`, 3),
    `p-value` = round(`p-value`, 3)
  ) %>%
  select(-`CI Upper`)

kable(full_stats, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped", "scale_down")) %>%
  landscape()

Complete pairwise comparison results for all metrics
Metric	Comparison	Group 1 Mean (SD)	Group 2 Mean (SD)	Difference	95% CI	Cohen's d	p-value
Clusters	mechanical_baseline vs standard_llm	15.500 (17.678)	13.000 (2.828)	-2.5000000	[-37.589, 32.589]	0.197	1.000
Distance	mechanical_baseline vs standard_llm	1.420 (0.198)	1.340 (0.028)	-0.0800000	[-0.472, 0.312]	0.566	1.000
Mix Deviation	mechanical_baseline vs standard_llm	0.164 (0.069)	0.222 (0.016)	0.0571667	[-0.081, 0.195]	-1.146	0.667
Share	mechanical_baseline vs standard_llm	0.583 (0.131)	0.553 (0.059)	-0.0307233	[-0.313, 0.252]	0.301	1.000
Ghetto Rate	mechanical_baseline vs standard_llm	6.500 (4.950)	6.500 (0.707)	0.0000000	[-9.800, 9.800]	0.000	1.000
Clusters	mechanical_baseline vs memory_llm	15.500 (17.678)	12.500 (2.121)	-3.0000000	[-37.897, 31.897]	0.238	1.000
Distance	mechanical_baseline vs memory_llm	1.420 (0.198)	1.190 (0.014)	-0.2300000	[-0.619, 0.159]	1.639	0.333
Mix Deviation	mechanical_baseline vs memory_llm	0.164 (0.069)	0.204 (0.005)	0.0398690	[-0.095, 0.175]	-0.818	1.000
Share	mechanical_baseline vs memory_llm	0.583 (0.131)	0.554 (0.040)	-0.0291812	[-0.299, 0.240]	0.300	1.000
Ghetto Rate	mechanical_baseline vs memory_llm	6.500 (4.950)	3.000 (0.000)	-3.5000000	[-13.202, 6.202]	1.000	0.617
Clusters	standard_llm vs memory_llm	13.000 (2.828)	12.500 (2.121)	-0.5000000	[-7.430, 6.430]	0.200	1.000
Distance	standard_llm vs memory_llm	1.340 (0.028)	1.190 (0.014)	-0.1500000	[-0.212, -0.088]	6.708	0.333
Mix Deviation	standard_llm vs memory_llm	0.222 (0.016)	0.204 (0.005)	-0.0172976	[-0.050, 0.015]	1.465	0.333
Share	standard_llm vs memory_llm	0.553 (0.059)	0.554 (0.040)	0.0015421	[-0.139, 0.142]	-0.030	1.000
Ghetto Rate	standard_llm vs memory_llm	6.500 (0.707)	3.000 (0.000)	-3.5000000	[-4.886, -2.114]	7.000	0.221