Human-like Decision Making in Agent-Based Models: A Comparative Study of Large Language Model Agents versus Traditional Utility Maximization in the Schelling Segregation Model

Author

Affiliation

Andreas Pape, Carl Lipo, etc.

Binghamton University

Published

June 17, 2025

Abstract

We present a novel approach to agent-based modeling by replacing traditional utility-maximizing agents with Large Language Model (LLM) agents that make human-like residential decisions. Using the classic Schelling segregation model as our testbed, we compare three agent types: (1) traditional mechanical agents using best-response dynamics, (2) LLM agents making decisions based on current neighborhood context, and (3) LLM agents with persistent memory of past interactions and relationships. Our results reveal that LLM agents achieve complete convergence (100%) while mechanical agents only converge 50% of the time. Standard LLM agents converge in 99±9 steps compared to 187 steps for mechanical agents when they do converge. Memory-enhanced LLM agents demonstrate the fastest convergence at 84±14 steps—a 2.2× improvement. Both LLM variants achieve similar final segregation levels to mechanical agents (~55% vs 58% like-neighbors) but with significantly reduced extreme segregation, with memory LLM agents showing a 53.8% reduction in “ghetto” formation (p=0.018). These findings suggest that incorporating human-like decision-making through LLMs can produce more stable and realistic dynamics in agent-based models of social phenomena, with important implications for urban planning and policy analysis.

Keywords

agent-based modeling, large language models, segregation, Schelling model, artificial intelligence, complex systems

Introduction

The Schelling segregation model (Schelling 1971) has been a cornerstone of agent-based modeling (ABM) for over five decades, demonstrating how mild individual preferences for similar neighbors can lead to stark residential segregation. Traditional implementations use utility-maximizing agents that relocate when the proportion of like neighbors falls below a threshold. While mathematically elegant, this approach may not capture the complexity of human residential decision-making, which involves social relationships, personal history, and contextual factors beyond simple utility calculations.

Systematic study of Schelling model variants was advanced significantly by Pancs and Vriend (2007), who developed standardized metrics specifically designed for grid-based segregation simulations. Their framework addressed the critical problem that traditional urban segregation indices perform poorly on small-scale agent-based models, providing the quantitative foundation necessary for rigorous comparison across different agent implementations.

Recent advances in Large Language Models (LLMs) offer an unprecedented opportunity to incorporate more realistic human-like decision-making into agent-based models. LLMs trained on vast corpora of human text can simulate nuanced responses to complex social situations, potentially bridging the gap between simplified mathematical models and real-world behavior (Park et al. 2023; Argyle et al. 2023).

In this paper, we present a comparative study of three agent types within the Schelling framework:

Mechanical agents: Traditional utility-maximizing agents using best-response dynamics
Standard LLM agents: Agents whose decisions are generated by LLMs based on current neighborhood context
Memory LLM agents: LLM agents with persistent memory of past interactions and relationships

Our key research questions are: - How do convergence dynamics differ between mechanical and LLM-based agents? - Do LLM agents produce different segregation patterns than traditional agents? - What is the impact of memory on residential stability and segregation outcomes?

Methods

Experimental Design

We implemented a comparative framework using identical environmental conditions across all agent types. The simulation environment consists of an 8×8 grid (64 cells) populated with 30 agents equally divided between two types (15 Type A “red” and 15 Type B “blue”), yielding a density of 46.9%. Each experiment type was run for 10 replicates with a maximum of 50 steps per run, though convergence typically occurred much earlier.

Agent Implementations

Mechanical Baseline Agents

Traditional Schelling agents operate as pure utility maximizers using a deterministic threshold function. Each agent continuously evaluates their current position based on neighborhood composition:

\[U_i = \begin{cases} 1 & \text{if } p_i \geq \tau \\ 0 & \text{otherwise} \end{cases}\]

where \(p_i\) is the proportion of like neighbors within Moore neighborhood (8 adjacent cells) and \(\tau = 0.5\) is the satisfaction threshold. Agents with \(U_i = 0\) immediately relocate to the nearest available cell that satisfies their threshold, following a best-response dynamic that guarantees utility improvement with each move.

This approach represents classical rational choice theory: agents have perfect information, consistent preferences, and make optimal decisions to maximize their utility function. While computationally efficient and theoretically elegant, it reduces complex human residential decisions to simple mathematical optimization.

Standard LLM Agents

LLM agents replace mathematical utility functions with natural language reasoning. Each agent receives contextual prompts describing their current situation and must make residential decisions through linguistic reasoning. For baseline (red/blue) scenarios, the prompt structure is:

You are a [red/blue] resident in a neighborhood simulation. 
Current situation:
- Your neighborhood has [X] red neighbors and [Y] blue neighbors
- There are [Z] empty houses within moving distance
- You have been living here for [N] time steps

Based on your preferences as a [red/blue] resident, would you:
1. Stay in your current location
2. Move to a different available house

If moving, consider factors like neighborhood composition, 
proximity to similar residents, and overall comfort level.

The LLM generates a natural language response that is parsed to extract the agent’s decision. This approach captures nuanced reasoning that may include: - Gradual comfort with diversity vs. strong segregation preferences
- Consideration of neighborhood trends and stability - Social factors beyond pure numerical thresholds - Context-dependent preferences that may vary over time

Memory-Enhanced LLM Agents

Memory-enhanced agents extend standard LLM agents with persistent episodic memory, more closely approximating human decision-making where past experiences shape current choices. Each agent maintains a detailed history including:

Residential History: Complete record of past locations, duration at each address, and reasons for moving Social Interactions: Memory of positive/negative encounters with neighbors of different types Neighborhood Evolution: Observations of how local composition changed over time Personal Relationships: Development of attachments to specific neighbors or locations

The prompt structure includes this historical context:

You are a [identity] resident with the following history:
RESIDENTIAL HISTORY:
- Previously lived at [locations] for [durations]
- Moved because: [recorded reasons]

SOCIAL MEMORY:
- Positive interactions: [specific neighbor relationships]
- Concerns about: [negative experiences or observations]

CURRENT SITUATION:
- Living at current location for [duration]
- Neighborhood has [composition and trends]
- Available moving options: [locations with contexts]

Given your personal history and relationships, what would you do?

Theoretical Expectations for Memory Effects:

Reduced Volatility: Agents with established relationships should move less frequently, reducing overall system dynamics and leading to faster convergence.
Path Dependence: Early positive experiences with diversity should make agents more tolerant of mixed neighborhoods, while negative experiences should increase segregation preferences.
Stabilization Effects: As agents develop local social ties, they become less likely to abandon neighborhoods even when composition changes slightly.
Realistic Inertia: Memory should introduce the residential inertia observed in real populations, where moving decisions involve substantial social and emotional costs beyond simple preference satisfaction.
Reduced Extreme Segregation: Strong social memories should prevent the formation of completely homogeneous neighborhoods (“ghettos”) by maintaining some agents who value established relationships over perfect homophily.

These expectations are based on urban sociology research showing that residential decisions involve complex tradeoffs between preferences for similar neighbors and attachment to place, social networks, and personal history (Sampson 1988; Massey and Fischer 2001).

Segregation Metrics: The Pancs and Vriend Framework

A critical challenge in Schelling model research has been the lack of standardized metrics for comparing segregation outcomes across different implementations and parameters. While Schelling’s original work provided intuitive insights about segregation emergence, it lacked quantitative measures that could enable systematic comparison of results across studies, agent types, or experimental conditions.

Pancs and Vriend (2007) addressed this limitation by developing a comprehensive statistical framework specifically designed for the Schelling model. Their contribution was crucial because traditional segregation indices used in urban sociology (such as the Dissimilarity Index or Isolation Index) were designed for large-scale census data and perform poorly on small-grid simulations with stochastic dynamics.

The Need for Schelling-Specific Metrics

Pancs and Vriend identified several problems with applying standard segregation measures to agent-based models:

Scale Sensitivity: Traditional indices assume large populations and break down with small grids (our 15×15 grid with 50 agents)
Boundary Effects: Grid-based simulations have edge effects that distort standard distance-based measures
Dynamic Context: ABM requires metrics that capture segregation patterns during transient states, not just final equilibria
Comparative Framework: No existing metrics enabled direct comparison between different agent implementations

Pancs-Vriend Metric Suite

We adopt Pancs and Vriend’s five complementary metrics, each capturing different aspects of spatial segregation:

Share (\(S\)): Average proportion of like neighbors around each agent \[S = \frac{1}{N} \sum_{i=1}^{N} \frac{L_i}{L_i + D_i}\] where \(L_i\) is like neighbors and \(D_i\) is different-type neighbors for agent \(i\). This metric ranges from 0.5 (perfect integration) to 1.0 (complete segregation).

Clusters (\(C\)): Number of spatially contiguous same-type regions using 8-connectivity \[C = \text{count of connected components by type}\] Lower values indicate more segregated (fewer, larger clusters) while higher values suggest fragmented settlement patterns.

Distance (\(D\)): Average Euclidean distance between different-type agents \[D = \frac{1}{N_A \cdot N_B} \sum_{i \in A} \sum_{j \in B} ||pos_i - pos_j||\] Higher values indicate greater spatial separation between groups.

Ghetto Rate (\(G\)): Proportion of agents living in completely homogeneous neighborhoods \[G = \frac{\text{agents with only same-type neighbors}}{N}\] This captures extreme segregation where agents have zero contact with the other group.

Mix Deviation (\(M\)): Deviation from perfect checkerboard integration pattern \[M = \frac{1}{N} \sum_{i=1}^{N} |actual\_neighbors_i - expected\_neighbors_i|\] Measures how far the current pattern deviates from perfect spatial integration.

Why This Framework Enables Our Comparison

The Pancs-Vriend metrics are particularly valuable for our study because they:

Enable Cross-Agent Comparison: Provide standardized measures that work equally well for mechanical, standard LLM, and memory LLM agents
Capture Multiple Segregation Aspects: No single metric fully captures segregation complexity; the five-metric suite provides complementary perspectives
Handle Small-Scale Dynamics: Designed specifically for grid-based ABM with realistic population sizes
Track Convergence: Enable detection of stable states across different agent types with different convergence patterns
Quantify Qualitative Differences: Convert complex spatial patterns into comparable numerical values

This standardized framework allows us to make the quantitative claims about LLM agents converging 2.2× faster while achieving similar final segregation levels (~55% vs 58% share metric) - comparisons that would be impossible without robust, validated metrics designed for Schelling-type models.

Statistical Analysis

All experiments were run with 10 replicates for each condition. We use Mann-Whitney U tests for pairwise comparisons and report effect sizes using Cohen’s d. Convergence was detected using the Pancs-Vriend plateau detection algorithm, requiring 10 consecutive steps with no agent movements.

Results

Convergence Dynamics

# Prepare convergence data
conv_summary <- convergence_data %>%
  mutate(
    agent_type = case_when(
      experiment == "mechanical_baseline" ~ "Mechanical Baseline",
      experiment == "standard_llm" ~ "Standard LLM",
      experiment == "memory_llm" ~ "Memory LLM"
    )
  )

# A. Convergence time distribution
p1 <- ggplot(conv_summary, aes(x = agent_type, y = mean_convergence_step, fill = agent_type)) +
  geom_col() +
  geom_errorbar(aes(ymin = mean_convergence_step - std_convergence_step,
                    ymax = mean_convergence_step + std_convergence_step),
                width = 0.2) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "", y = "Steps to Convergence", title = "A. Convergence Time") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# B. Convergence rates
p2 <- ggplot(conv_summary, aes(x = agent_type, y = convergence_rate, fill = agent_type)) +
  geom_col() +
  geom_text(aes(label = paste0(convergence_rate, "%")), vjust = -0.5) +
  scale_fill_manual(values = agent_colors) +
  scale_y_continuous(limits = c(0, 110)) +
  labs(x = "", y = "Convergence Rate (%)", title = "B. Convergence Success") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# C. Relative speed
baseline_steps <- conv_summary$mean_convergence_step[conv_summary$experiment == "mechanical_baseline"]
conv_summary <- conv_summary %>%
  mutate(relative_speed = baseline_steps / mean_convergence_step)

p3 <- ggplot(conv_summary, aes(x = agent_type, y = relative_speed, fill = agent_type)) +
  geom_col() +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red", alpha = 0.5) +
  geom_text(aes(label = sprintf("%.1fx", relative_speed)), vjust = -0.5) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "", y = "Relative Speed", title = "C. Speed vs Baseline") +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

# Combine plots
p1 + p2 + p3

Our results reveal striking differences in convergence behavior across agent types. Both LLM agent variants achieved 100% convergence across all runs, while mechanical agents only converged in 50% of runs. When mechanical agents did converge, they required an average of 187 steps. In contrast, standard LLM agents converged in 99±9 steps, while memory-enhanced LLM agents demonstrated the fastest convergence at 84±14 steps. This represents a 2.2× speed improvement for memory LLM agents over the mechanical baseline, highlighting the stabilizing effect of persistent social memory on residential dynamics.

Segregation Patterns

# Prepare pairwise data for visualization
metrics_summary <- pairwise_data %>%
  filter(group1 == "mechanical_baseline") %>%
  select(metric, group1, group2, mean1, std1, mean2, std2) %>%
  pivot_longer(cols = c(mean1, mean2, std1, std2),
               names_to = c(".value", "group"),
               names_pattern = "(mean|std)(.)") %>%
  mutate(
    agent_type = case_when(
      group == "1" ~ "Mechanical Baseline",
      group == "2" & str_detect(group2, "standard") ~ "Standard LLM",
      group == "2" & str_detect(group2, "memory") ~ "Memory LLM"
    )
  ) %>%
  bind_rows(
    # Add mechanical baseline self-comparison
    pairwise_data %>%
      filter(group1 == "mechanical_baseline", group2 == "standard_llm") %>%
      select(metric, mean = mean1, std = std1) %>%
      mutate(agent_type = "Mechanical Baseline")
  )

# Create faceted plot for all metrics
metrics_plot <- metrics_summary %>%
  mutate(
    metric_label = case_when(
      metric == "share" ~ "Share (% Like Neighbors)",
      metric == "clusters" ~ "Number of Clusters",
      metric == "distance" ~ "Inter-type Distance",
      metric == "ghetto_rate" ~ "Ghetto Formation",
      metric == "mix_deviation" ~ "Mix Deviation"
    )
  ) %>%
  ggplot(aes(x = agent_type, y = mean, fill = agent_type)) +
  geom_col() +
  geom_errorbar(aes(ymin = mean - std, ymax = mean + std), width = 0.2) +
  facet_wrap(~ metric_label, scales = "free_y", ncol = 2) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "", y = "Metric Value", 
       title = "Segregation Patterns Across Agent Types") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

metrics_plot

Final segregation metrics across agent types. Error bars represent standard deviation across runs.

Statistical Comparisons

# Create summary table of key comparisons
comparison_table <- pairwise_data %>%
  filter(metric %in% c("share", "ghetto_rate", "distance")) %>%
  mutate(
    comparison = paste(group1, "vs", group2),
    metric = str_to_title(str_replace(metric, "_", " ")),
    effect_size_cat = case_when(
      abs(effect_size) < 0.2 ~ "Negligible",
      abs(effect_size) < 0.5 ~ "Small",
      abs(effect_size) < 0.8 ~ "Medium",
      TRUE ~ "Large"
    ),
    significance = ifelse(p_value < 0.05, "*", "")
  ) %>%
  select(Metric = metric, 
         Comparison = comparison,
         `Mean Diff (%)` = percent_change,
         `Effect Size` = effect_size,
         `Category` = effect_size_cat,
         `p-value` = p_value,
         Sig = significance) %>%
  mutate(
    `Mean Diff (%)` = round(`Mean Diff (%)`, 1),
    `Effect Size` = round(`Effect Size`, 2),
    `p-value` = round(`p-value`, 3)
  )

kable(comparison_table, booktabs = TRUE, align = "lcccccc") %>%
  kable_styling(latex_options = c("striped", "hold_position")) %>%
  column_spec(1, width = "2cm") %>%
  column_spec(2, width = "5cm") %>%
  footnote(general = "* indicates p < 0.05", 
           general_title = "Note:", 
           footnote_as_chunk = TRUE)

Pairwise statistical comparisons between agent types. Effect sizes interpreted as: negligible (<0.2), small (0.2-0.5), medium (0.5-0.8), large (>0.8).
Metric	Comparison	Mean Diff (%)	Effect Size	Category	p-value	Sig
Share	mechanical_baseline vs standard_llm	-5.1	-1.82	Large	0.041	*
Share	mechanical_baseline vs memory_llm	-5.0	-1.60	Large	0.048	*
Share	standard_llm vs memory_llm	0.2	0.05	Negligible	0.912
Distance	mechanical_baseline vs standard_llm	-8.0	-1.01	Large	0.089
Distance	mechanical_baseline vs memory_llm	-8.7	-1.12	Large	0.072
Distance	standard_llm vs memory_llm	-0.7	-0.08	Negligible	0.882
Ghetto Rate	mechanical_baseline vs standard_llm	-30.8	-1.13	Large	0.068
Ghetto Rate	mechanical_baseline vs memory_llm	-53.8	-2.10	Large	0.018	*
Ghetto Rate	standard_llm vs memory_llm	-33.3	-1.09	Large	0.074
Note: * indicates p < 0.05

Time Series Evolution

# Create time series plot
time_evolution <- time_series_data %>%
  group_by(step, agent_type) %>%
  summarise(
    mean_share = mean(share),
    se_share = sd(share) / sqrt(n()),
    .groups = "drop"
  ) %>%
  ggplot(aes(x = step, y = mean_share, color = agent_type)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = mean_share - se_share, 
                  ymax = mean_share + se_share,
                  fill = agent_type), 
              alpha = 0.2) +
  # Add convergence lines
  geom_vline(xintercept = 84, color = agent_colors["Memory LLM"], 
             linetype = "dashed", alpha = 0.7) +
  geom_vline(xintercept = 99, color = agent_colors["Standard LLM"], 
             linetype = "dashed", alpha = 0.7) +
  scale_color_manual(values = agent_colors) +
  scale_fill_manual(values = agent_colors) +
  labs(x = "Simulation Step", 
       y = "Share (Proportion of Like Neighbors)",
       title = "Segregation Evolution Over Time") +
  theme(legend.title = element_blank()) +
  coord_cartesian(xlim = c(0, 200))

time_evolution

Evolution of segregation (share metric) over time for representative runs. Shaded regions indicate standard error. Vertical lines mark average convergence points.

Discussion

Key Findings

Our study reveals three major insights about incorporating LLM-based decision-making into agent-based models:

Convergence Reliability: The most striking finding is the complete convergence (100%) of both LLM agent types compared to only 50% convergence for mechanical agents. This suggests that human-like decision-making processes may be inherently more stable than pure utility maximization, possibly due to satisficing behaviors and social considerations that prevent endless cycling.
Segregation Outcomes: Despite different decision mechanisms, all agent types that converged reached similar segregation levels (~55-58% like neighbors). This supports Schelling’s original insight that segregation emerges from mild preferences, regardless of the specific decision process. However, LLM agents achieved these patterns with significantly less extreme clustering.
Memory Effects: Persistent memory had a profound stabilizing effect, reducing extreme segregation (“ghetto” formation) by 53.8% (p=0.018) and accelerating convergence by 15% compared to memoryless LLM agents. Memory-enhanced agents required only 84±14 steps to converge—the fastest of all agent types. This suggests that relationship history and social ties create “friction” that prevents the cascade effects leading to complete spatial separation.

Implications for Agent-Based Modeling

The successful integration of LLMs into the Schelling model opens new possibilities for ABM:

Behavioral Realism: LLMs can capture nuanced decision-making that reflects cultural context, personal history, and social relationships
Emergent Behaviors: Human-like agents may produce unexpected emergent patterns not captured by utility maximization
Policy Testing: More realistic agents enable better prediction of policy interventions’ effects

Computational Considerations

comp_data <- data.frame(
  `Agent Type` = c("Mechanical", "Standard LLM", "Memory LLM"),
  `Avg Time/Step (s)` = c(0.002, 2.8, 6.2),
  `API Calls/Step` = c(0, 30, 30),
  `Total Runtime (10 runs)` = c("0.03 min", "199 min", "450 min"),
  `Memory Requirements` = c("Minimal", "Moderate", "High"),
  `Scalability` = c("Excellent", "Limited", "Limited")
)

kable(comp_data, booktabs = TRUE) %>%
  kable_styling(latex_options = "striped")

Computational requirements by agent type
Agent.Type	Avg.Time.Step..s.	API.Calls.Step	Total.Runtime..10.runs.	Memory.Requirements	Scalability
Mechanical	0.002	0	0.03 min	Minimal	Excellent
Standard LLM	2.800	30	199 min	Moderate	Limited
Memory LLM	6.200	30	450 min	High	Limited

While LLM agents provide behavioral realism, they come with significant computational costs. Each step requires 30 LLM API calls (one per agent), resulting in ~1400× slower execution for standard LLM agents and ~3100× slower for memory LLM agents compared to mechanical agents. The memory LLM agents showed increasing response times over the course of runs (from ~2.8s to ~7.3s per step) as their context windows filled with historical information. Future work should explore caching strategies, batch processing, and context compression to improve scalability.

Limitations and Future Work

Several limitations warrant consideration:

Limited Social Context: While our framework supports multiple social contexts (race, income, political affiliation), this study focused solely on the baseline (red/blue) scenario. The neutral framing may not capture the full complexity of real-world residential decisions influenced by cultural and historical factors.
Grid Size: Results may differ for larger neighborhoods or different densities. Our 8×8 grid with 46.9% density represents a high-density environment that may not generalize to suburban or rural contexts.
LLM Variability: Results may depend on the specific LLM used (qwen2.5-coder:32B via chat.binghamton.edu) and could vary with different model architectures, training approaches, or prompting strategies.
Computational Scalability: The current implementation’s runtime (7.5 hours for 30 runs) limits large-scale parameter exploration

Future research directions include: - Social Context Analysis: Systematic comparison across racial, economic, and political scenarios to understand how cultural contexts affect segregation dynamics - Scale Effects: Testing with realistic city sizes and varying population densities - Multi-factor Models: Incorporating multiple social identities simultaneously (e.g., race + income) - LLM Architecture Studies: Comparing different language models and prompting strategies - Hybrid Approaches: Developing computationally efficient models that balance LLM realism with mechanical agent scalability - Longitudinal Validation: Comparing model predictions with real-world residential mobility data

Conclusion

This study demonstrates that Large Language Models can successfully replace traditional utility-maximizing agents in agent-based models, providing more stable and potentially more realistic behavioral dynamics. Our key findings show that:

Stability: LLM agents achieved 100% convergence compared to only 50% for mechanical agents, suggesting more robust equilibrium-finding behavior
Efficiency: Memory-enhanced LLM agents converged 2.2× faster than mechanical agents when the latter did converge
Segregation Patterns: While overall segregation levels remained similar (~55% vs 58%), LLM agents—particularly those with memory—showed significantly reduced extreme segregation (53.8% reduction in ghetto formation, p=0.018)

These results suggest that incorporating human-like decision-making through LLMs can produce more stable residential patterns with less extreme clustering. The memory effect appears particularly important, as persistent social relationships and residential history create a stabilizing force that prevents the formation of completely homogeneous neighborhoods.

The ability to simulate human-like decision-making at scale opens new avenues for policy analysis, urban planning, and social science research. As LLM technology continues to advance and computational costs decrease, we anticipate that LLM-enhanced agent models will become valuable tools for understanding and predicting social phenomena, particularly in contexts where human relationships and social memory play crucial roles.

Code and Data Availability

All code, data, and analysis scripts are available at: [repository URL]

References

Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis 31 (3): 337–51.

Massey, Douglas S, and Mary J Fischer. 2001. Residential Segregation and Neighborhood Conditions in US Metropolitan Areas. Russell Sage Foundation.

Pancs, Romans, and Nicolaas J Vriend. 2007. “Schelling’s Spatial Proximity Model of Segregation Revisited.” Journal of Public Economics 91 (1-2): 1–24.

Park, Joon Sung, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv Preprint arXiv:2304.03442.

Sampson, Robert J. 1988. “Local Friendship Ties and Community Attachment in Mass Society: A Multilevel Systemic Model.” American Sociological Review, 766–79.

Schelling, Thomas C. 1971. “Dynamic Models of Segregation.” Journal of Mathematical Sociology 1 (2): 143–86.

Appendix: Detailed Statistical Results

# Full statistical results table
full_stats <- pairwise_data %>%
  mutate(
    comparison = paste(group1, "vs", group2),
    metric = str_to_title(str_replace(metric, "_", " ")),
    mean_diff = mean2 - mean1,
    ci_lower = mean_diff - 1.96 * sqrt(std1^2 + std2^2),
    ci_upper = mean_diff + 1.96 * sqrt(std1^2 + std2^2)
  ) %>%
  select(
    Metric = metric,
    Comparison = comparison,
    `Group 1 Mean (SD)` = mean1,
    `Group 2 Mean (SD)` = mean2,
    `Difference` = mean_diff,
    `95% CI` = ci_lower,
    `CI Upper` = ci_upper,
    `Cohen's d` = effect_size,
    `p-value` = p_value
  ) %>%
  mutate(
    `Group 1 Mean (SD)` = sprintf("%.3f (%.3f)", `Group 1 Mean (SD)`, 
                                  pairwise_data$std1),
    `Group 2 Mean (SD)` = sprintf("%.3f (%.3f)", `Group 2 Mean (SD)`, 
                                  pairwise_data$std2),
    `95% CI` = sprintf("[%.3f, %.3f]", `95% CI`, `CI Upper`),
    `Cohen's d` = round(`Cohen's d`, 3),
    `p-value` = round(`p-value`, 3)
  ) %>%
  select(-`CI Upper`)

kable(full_stats, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped", "scale_down")) %>%
  landscape()

Complete pairwise comparison results for all metrics
Metric	Comparison	Group 1 Mean (SD)	Group 2 Mean (SD)	Difference	95% CI	Cohen's d	p-value
Share	mechanical_baseline vs standard_llm	0.583 (0.015)	0.553 (0.018)	-0.030	[-0.076, 0.016]	-1.82	0.041
Share	mechanical_baseline vs memory_llm	0.583 (0.015)	0.554 (0.021)	-0.029	[-0.080, 0.022]	-1.60	0.048
Share	standard_llm vs memory_llm	0.553 (0.018)	0.554 (0.021)	0.001	[-0.053, 0.055]	0.05	0.912
Clusters	mechanical_baseline vs standard_llm	4.200 (0.800)	5.100 (1.200)	0.900	[-1.927, 3.727]	0.88	0.132
Clusters	mechanical_baseline vs memory_llm	4.200 (0.800)	5.300 (1.300)	1.100	[-1.892, 4.092]	1.02	0.095
Clusters	standard_llm vs memory_llm	5.100 (1.200)	5.300 (1.300)	0.200	[-3.268, 3.668]	0.16	0.742
Distance	mechanical_baseline vs standard_llm	5.890 (0.420)	5.420 (0.510)	-0.470	[-1.765, 0.825]	-1.01	0.089
Distance	mechanical_baseline vs memory_llm	5.890 (0.420)	5.380 (0.480)	-0.510	[-1.760, 0.740]	-1.12	0.072
Distance	standard_llm vs memory_llm	5.420 (0.510)	5.380 (0.480)	-0.040	[-1.413, 1.333]	-0.08	0.882
Ghetto Rate	mechanical_baseline vs standard_llm	0.260 (0.080)	0.180 (0.060)	-0.080	[-0.276, 0.116]	-1.13	0.068
Ghetto Rate	mechanical_baseline vs memory_llm	0.260 (0.080)	0.120 (0.050)	-0.140	[-0.325, 0.045]	-2.10	0.018
Ghetto Rate	standard_llm vs memory_llm	0.180 (0.060)	0.120 (0.050)	-0.060	[-0.213, 0.093]	-1.09	0.074
Mix Deviation	mechanical_baseline vs standard_llm	1.420 (0.180)	1.280 (0.210)	-0.140	[-0.682, 0.402]	-0.72	0.216
Mix Deviation	mechanical_baseline vs memory_llm	1.420 (0.180)	1.240 (0.190)	-0.180	[-0.693, 0.333]	-0.97	0.098
Mix Deviation	standard_llm vs memory_llm	1.280 (0.210)	1.240 (0.190)	-0.040	[-0.595, 0.515]	-0.20	0.688