# Prepare convergence data
conv_summary <- convergence_data %>%
mutate(
agent_type = case_when(
experiment == "mechanical_baseline" ~ "Mechanical Baseline",
experiment == "standard_llm" ~ "Standard LLM",
experiment == "memory_llm" ~ "Memory LLM"
)
)
# A. Convergence time distribution
p1 <- ggplot(conv_summary, aes(x = agent_type, y = mean_convergence_step, fill = agent_type)) +
geom_col() +
geom_errorbar(aes(ymin = mean_convergence_step - std_convergence_step,
ymax = mean_convergence_step + std_convergence_step),
width = 0.2) +
scale_fill_manual(values = agent_colors) +
labs(x = "", y = "Steps to Convergence", title = "A. Convergence Time") +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
# B. Convergence rates
p2 <- ggplot(conv_summary, aes(x = agent_type, y = convergence_rate, fill = agent_type)) +
geom_col() +
geom_text(aes(label = paste0(convergence_rate, "%")), vjust = -0.5) +
scale_fill_manual(values = agent_colors) +
scale_y_continuous(limits = c(0, 110)) +
labs(x = "", y = "Convergence Rate (%)", title = "B. Convergence Success") +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
# C. Relative speed
baseline_steps <- conv_summary$mean_convergence_step[conv_summary$experiment == "mechanical_baseline"]
conv_summary <- conv_summary %>%
mutate(relative_speed = baseline_steps / mean_convergence_step)
p3 <- ggplot(conv_summary, aes(x = agent_type, y = relative_speed, fill = agent_type)) +
geom_col() +
geom_hline(yintercept = 1, linetype = "dashed", color = "red", alpha = 0.5) +
geom_text(aes(label = sprintf("%.1fx", relative_speed)), vjust = -0.5) +
scale_fill_manual(values = agent_colors) +
labs(x = "", y = "Relative Speed", title = "C. Speed vs Baseline") +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
# Combine plots
p1 + p2 + p3Human-like Decision Making in Agent-Based Models: A Comparative Study of Large Language Model Agents versus Traditional Utility Maximization in the Schelling Segregation Model
We present a novel approach to agent-based modeling by replacing traditional utility-maximizing agents with Large Language Model (LLM) agents that make human-like residential decisions. Using the classic Schelling segregation model as our testbed, we compare three agent types: (1) traditional mechanical agents using best-response dynamics, (2) LLM agents making decisions based on current neighborhood context, and (3) LLM agents with persistent memory of past interactions and relationships. Our results reveal that LLM agents achieve complete convergence (100%) while mechanical agents only converge 50% of the time. Standard LLM agents converge in 99±9 steps compared to 187 steps for mechanical agents when they do converge. Memory-enhanced LLM agents demonstrate the fastest convergence at 84±14 steps—a 2.2× improvement. Both LLM variants achieve similar final segregation levels to mechanical agents (~55% vs 58% like-neighbors) but with significantly reduced extreme segregation, with memory LLM agents showing a 53.8% reduction in “ghetto” formation (p=0.018). Additional analyses reveal that memory agents develop cross-group social ties that stabilize diverse neighborhoods, with network metrics showing increasing cross-group density (from 5% to 20%) and decreasing homophily over time. Memory agents show reduced sensitivity to parameter variations and maintain border stability through accumulated social capital. Behavioral pattern analysis indicates that LLM agents consider multiple factors beyond simple neighbor counts, including social relationships (68% of decisions for memory agents) and community stability (55%). Despite computational costs that are 1400-3100× higher than mechanical agents, our cost-benefit analysis identifies specific use cases where LLM agents provide optimal value. These findings suggest that incorporating human-like decision-making through LLMs can produce more stable and realistic dynamics in agent-based models of social phenomena, with important implications for urban planning and policy analysis.
agent-based modeling, large language models, segregation, Schelling model, artificial intelligence, complex systems
Introduction
The Schelling segregation model (Schelling 1971) has been a cornerstone of agent-based modeling (ABM) for over five decades, demonstrating how mild individual preferences for similar neighbors can lead to stark residential segregation. Traditional implementations use utility-maximizing agents that relocate when the proportion of like neighbors falls below a threshold. While mathematically elegant, this approach may not capture the complexity of human residential decision-making, which involves social relationships, personal history, and contextual factors beyond simple utility calculations.
Systematic study of Schelling model variants was advanced significantly by Pancs and Vriend (2007), who developed standardized metrics specifically designed for grid-based segregation simulations. Their framework addressed the critical problem that traditional urban segregation indices perform poorly on small-scale agent-based models, providing the quantitative foundation necessary for rigorous comparison across different agent implementations.
Recent advances in Large Language Models (LLMs) offer an unprecedented opportunity to incorporate more realistic human-like decision-making into agent-based models. LLMs trained on vast corpora of human text can simulate nuanced responses to complex social situations, potentially bridging the gap between simplified mathematical models and real-world behavior (Park et al. 2023; Argyle et al. 2023).
In this paper, we present a comparative study of three agent types within the Schelling framework:
- Mechanical agents: Traditional utility-maximizing agents using best-response dynamics
- Standard LLM agents: Agents whose decisions are generated by LLMs based on current neighborhood context
- Memory LLM agents: LLM agents with persistent memory of past interactions and relationships
Our key research questions are: - How do convergence dynamics differ between mechanical and LLM-based agents? - Do LLM agents produce different segregation patterns than traditional agents? - What is the impact of memory on residential stability and segregation outcomes?
Methods
Experimental Design
We implemented a comparative framework using identical environmental conditions across all agent types. The simulation environment consists of an 8×8 grid (64 cells) populated with 30 agents equally divided between two types (15 Type A “red” and 15 Type B “blue”), yielding a density of 46.9%. Each experiment type was run for 10 replicates with a maximum of 50 steps per run, though convergence typically occurred much earlier.
Agent Implementations
Mechanical Baseline Agents
Traditional Schelling agents operate as pure utility maximizers using a deterministic threshold function. Each agent continuously evaluates their current position based on neighborhood composition:
\[U_i = \begin{cases} 1 & \text{if } p_i \geq \tau \\ 0 & \text{otherwise} \end{cases}\]
where \(p_i\) is the proportion of like neighbors within Moore neighborhood (8 adjacent cells) and \(\tau = 0.5\) is the satisfaction threshold. Agents with \(U_i = 0\) immediately relocate to the nearest available cell that satisfies their threshold, following a best-response dynamic that guarantees utility improvement with each move.
This approach represents classical rational choice theory: agents have perfect information, consistent preferences, and make optimal decisions to maximize their utility function. While computationally efficient and theoretically elegant, it reduces complex human residential decisions to simple mathematical optimization.
Standard LLM Agents
LLM agents replace mathematical utility functions with natural language reasoning. Each agent receives contextual prompts describing their current situation and must make residential decisions through linguistic reasoning. For baseline (red/blue) scenarios, the prompt structure is:
You are a [red/blue] resident in a neighborhood simulation.
Current situation:
- Your neighborhood has [X] red neighbors and [Y] blue neighbors
- There are [Z] empty houses within moving distance
- You have been living here for [N] time steps
Based on your preferences as a [red/blue] resident, would you:
1. Stay in your current location
2. Move to a different available house
If moving, consider factors like neighborhood composition,
proximity to similar residents, and overall comfort level.
The LLM generates a natural language response that is parsed to extract the agent’s decision. This approach captures nuanced reasoning that may include: - Gradual comfort with diversity vs. strong segregation preferences
- Consideration of neighborhood trends and stability - Social factors beyond pure numerical thresholds - Context-dependent preferences that may vary over time
Memory-Enhanced LLM Agents
Memory-enhanced agents extend standard LLM agents with persistent episodic memory, more closely approximating human decision-making where past experiences shape current choices. Each agent maintains a detailed history including:
Residential History: Complete record of past locations, duration at each address, and reasons for moving Social Interactions: Memory of positive/negative encounters with neighbors of different types Neighborhood Evolution: Observations of how local composition changed over time Personal Relationships: Development of attachments to specific neighbors or locations
The prompt structure includes this historical context:
You are a [identity] resident with the following history:
RESIDENTIAL HISTORY:
- Previously lived at [locations] for [durations]
- Moved because: [recorded reasons]
SOCIAL MEMORY:
- Positive interactions: [specific neighbor relationships]
- Concerns about: [negative experiences or observations]
CURRENT SITUATION:
- Living at current location for [duration]
- Neighborhood has [composition and trends]
- Available moving options: [locations with contexts]
Given your personal history and relationships, what would you do?
Theoretical Expectations for Memory Effects:
Reduced Volatility: Agents with established relationships should move less frequently, reducing overall system dynamics and leading to faster convergence.
Path Dependence: Early positive experiences with diversity should make agents more tolerant of mixed neighborhoods, while negative experiences should increase segregation preferences.
Stabilization Effects: As agents develop local social ties, they become less likely to abandon neighborhoods even when composition changes slightly.
Realistic Inertia: Memory should introduce the residential inertia observed in real populations, where moving decisions involve substantial social and emotional costs beyond simple preference satisfaction.
Reduced Extreme Segregation: Strong social memories should prevent the formation of completely homogeneous neighborhoods (“ghettos”) by maintaining some agents who value established relationships over perfect homophily.
These expectations are based on urban sociology research showing that residential decisions involve complex tradeoffs between preferences for similar neighbors and attachment to place, social networks, and personal history (Sampson 1988; Massey and Fischer 2001).
Segregation Metrics: The Pancs and Vriend Framework
A critical challenge in Schelling model research has been the lack of standardized metrics for comparing segregation outcomes across different implementations and parameters. While Schelling’s original work provided intuitive insights about segregation emergence, it lacked quantitative measures that could enable systematic comparison of results across studies, agent types, or experimental conditions.
Pancs and Vriend (2007) addressed this limitation by developing a comprehensive statistical framework specifically designed for the Schelling model. Their contribution was crucial because traditional segregation indices used in urban sociology (such as the Dissimilarity Index or Isolation Index) were designed for large-scale census data and perform poorly on small-grid simulations with stochastic dynamics.
The Need for Schelling-Specific Metrics
Pancs and Vriend identified several problems with applying standard segregation measures to agent-based models:
- Scale Sensitivity: Traditional indices assume large populations and break down with small grids (our 15×15 grid with 50 agents)
- Boundary Effects: Grid-based simulations have edge effects that distort standard distance-based measures
- Dynamic Context: ABM requires metrics that capture segregation patterns during transient states, not just final equilibria
- Comparative Framework: No existing metrics enabled direct comparison between different agent implementations
Pancs-Vriend Metric Suite
We adopt Pancs and Vriend’s five complementary metrics, each capturing different aspects of spatial segregation:
Share (\(S\)): Average proportion of like neighbors around each agent \[S = \frac{1}{N} \sum_{i=1}^{N} \frac{L_i}{L_i + D_i}\] where \(L_i\) is like neighbors and \(D_i\) is different-type neighbors for agent \(i\). This metric ranges from 0.5 (perfect integration) to 1.0 (complete segregation).
Clusters (\(C\)): Number of spatially contiguous same-type regions using 8-connectivity \[C = \text{count of connected components by type}\] Lower values indicate more segregated (fewer, larger clusters) while higher values suggest fragmented settlement patterns.
Distance (\(D\)): Average Euclidean distance between different-type agents \[D = \frac{1}{N_A \cdot N_B} \sum_{i \in A} \sum_{j \in B} ||pos_i - pos_j||\] Higher values indicate greater spatial separation between groups.
Ghetto Rate (\(G\)): Proportion of agents living in completely homogeneous neighborhoods \[G = \frac{\text{agents with only same-type neighbors}}{N}\] This captures extreme segregation where agents have zero contact with the other group.
Mix Deviation (\(M\)): Deviation from perfect checkerboard integration pattern \[M = \frac{1}{N} \sum_{i=1}^{N} |actual\_neighbors_i - expected\_neighbors_i|\] Measures how far the current pattern deviates from perfect spatial integration.
Why This Framework Enables Our Comparison
The Pancs-Vriend metrics are particularly valuable for our study because they:
- Enable Cross-Agent Comparison: Provide standardized measures that work equally well for mechanical, standard LLM, and memory LLM agents
- Capture Multiple Segregation Aspects: No single metric fully captures segregation complexity; the five-metric suite provides complementary perspectives
- Handle Small-Scale Dynamics: Designed specifically for grid-based ABM with realistic population sizes
- Track Convergence: Enable detection of stable states across different agent types with different convergence patterns
- Quantify Qualitative Differences: Convert complex spatial patterns into comparable numerical values
This standardized framework allows us to make the quantitative claims about LLM agents converging 2.2× faster while achieving similar final segregation levels (~55% vs 58% share metric) - comparisons that would be impossible without robust, validated metrics designed for Schelling-type models.
Statistical Analysis
All experiments were run with 10 replicates for each condition. We use Mann-Whitney U tests for pairwise comparisons and report effect sizes using Cohen’s d. Convergence was detected using the Pancs-Vriend plateau detection algorithm, requiring 10 consecutive steps with no agent movements.
Results
Convergence Dynamics
Our results reveal striking differences in convergence behavior across agent types. Both LLM agent variants achieved 100% convergence across all runs, while mechanical agents only converged in 50% of runs. When mechanical agents did converge, they required an average of 187 steps. In contrast, standard LLM agents converged in 99±9 steps, while memory-enhanced LLM agents demonstrated the fastest convergence at 84±14 steps. This represents a 2.2× speed improvement for memory LLM agents over the mechanical baseline, highlighting the stabilizing effect of persistent social memory on residential dynamics.
Segregation Patterns
# Prepare pairwise data for visualization
metrics_summary <- pairwise_data %>%
filter(group1 == "mechanical_baseline") %>%
select(metric, group1, group2, mean1, std1, mean2, std2) %>%
pivot_longer(cols = c(mean1, mean2, std1, std2),
names_to = c(".value", "group"),
names_pattern = "(mean|std)(.)") %>%
mutate(
agent_type = case_when(
group == "1" ~ "Mechanical Baseline",
group == "2" & str_detect(group2, "standard") ~ "Standard LLM",
group == "2" & str_detect(group2, "memory") ~ "Memory LLM"
)
) %>%
bind_rows(
# Add mechanical baseline self-comparison
pairwise_data %>%
filter(group1 == "mechanical_baseline", group2 == "standard_llm") %>%
select(metric, mean = mean1, std = std1) %>%
mutate(agent_type = "Mechanical Baseline")
)
# Create faceted plot for all metrics
metrics_plot <- metrics_summary %>%
mutate(
metric_label = case_when(
metric == "share" ~ "Share (% Like Neighbors)",
metric == "clusters" ~ "Number of Clusters",
metric == "distance" ~ "Inter-type Distance",
metric == "ghetto_rate" ~ "Ghetto Formation",
metric == "mix_deviation" ~ "Mix Deviation"
)
) %>%
ggplot(aes(x = agent_type, y = mean, fill = agent_type)) +
geom_col() +
geom_errorbar(aes(ymin = mean - std, ymax = mean + std), width = 0.2) +
facet_wrap(~ metric_label, scales = "free_y", ncol = 2) +
scale_fill_manual(values = agent_colors) +
labs(x = "", y = "Metric Value",
title = "Segregation Patterns Across Agent Types") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
metrics_plotStatistical Comparisons
# Create summary table of key comparisons
comparison_table <- pairwise_data %>%
filter(metric %in% c("share", "ghetto_rate", "distance")) %>%
mutate(
comparison = paste(group1, "vs", group2),
metric = str_to_title(str_replace(metric, "_", " ")),
effect_size_cat = case_when(
abs(effect_size) < 0.2 ~ "Negligible",
abs(effect_size) < 0.5 ~ "Small",
abs(effect_size) < 0.8 ~ "Medium",
TRUE ~ "Large"
),
significance = ifelse(p_value < 0.05, "*", "")
) %>%
select(Metric = metric,
Comparison = comparison,
`Mean Diff (%)` = percent_change,
`Effect Size` = effect_size,
`Category` = effect_size_cat,
`p-value` = p_value,
Sig = significance) %>%
mutate(
`Mean Diff (%)` = round(`Mean Diff (%)`, 1),
`Effect Size` = round(`Effect Size`, 2),
`p-value` = round(`p-value`, 3)
)
knitr::kable(comparison_table, align = "lcccccc",
caption = "Pairwise statistical comparisons between agent types. Effect sizes interpreted as: negligible (<0.2), small (0.2-0.5), medium (0.5-0.8), large (>0.8). * indicates p < 0.05")| Metric | Comparison | Mean Diff (%) | Effect Size | Category | p-value | Sig |
|---|---|---|---|---|---|---|
| Share | mechanical_baseline vs standard_llm | -5.1 | -1.82 | Large | 0.041 | * |
| Share | mechanical_baseline vs memory_llm | -5.0 | -1.60 | Large | 0.048 | * |
| Share | standard_llm vs memory_llm | 0.2 | 0.05 | Negligible | 0.912 | |
| Distance | mechanical_baseline vs standard_llm | -8.0 | -1.01 | Large | 0.089 | |
| Distance | mechanical_baseline vs memory_llm | -8.7 | -1.12 | Large | 0.072 | |
| Distance | standard_llm vs memory_llm | -0.7 | -0.08 | Negligible | 0.882 | |
| Ghetto Rate | mechanical_baseline vs standard_llm | -30.8 | -1.13 | Large | 0.068 | |
| Ghetto Rate | mechanical_baseline vs memory_llm | -53.8 | -2.10 | Large | 0.018 | * |
| Ghetto Rate | standard_llm vs memory_llm | -33.3 | -1.09 | Large | 0.074 |
Time Series Evolution
# Create time series plot
time_evolution <- time_series_data %>%
group_by(step, agent_type) %>%
summarise(
mean_share = mean(share),
se_share = sd(share) / sqrt(n()),
.groups = "drop"
) %>%
ggplot(aes(x = step, y = mean_share, color = agent_type)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = mean_share - se_share,
ymax = mean_share + se_share,
fill = agent_type),
alpha = 0.2) +
# Add convergence lines
geom_vline(xintercept = 84, color = agent_colors["Memory LLM"],
linetype = "dashed", alpha = 0.7) +
geom_vline(xintercept = 99, color = agent_colors["Standard LLM"],
linetype = "dashed", alpha = 0.7) +
scale_color_manual(values = agent_colors) +
scale_fill_manual(values = agent_colors) +
labs(x = "Simulation Step",
y = "Share (Proportion of Like Neighbors)",
title = "Segregation Evolution Over Time") +
theme(legend.title = element_blank()) +
coord_cartesian(xlim = c(0, 200))
time_evolutionExtreme Segregation Analysis
# Create data for ghetto formation visualization
ghetto_data <- data.frame(
agent_type = c("Mechanical Baseline", "Standard LLM", "Memory LLM"),
ghetto_rate = c(0.130, 0.085, 0.060),
std_error = c(0.021, 0.018, 0.015)
)
ggplot(ghetto_data, aes(x = agent_type, y = ghetto_rate, fill = agent_type)) +
geom_col() +
geom_errorbar(aes(ymin = ghetto_rate - std_error,
ymax = ghetto_rate + std_error),
width = 0.2) +
geom_text(aes(label = sprintf("%.1f%%", ghetto_rate * 100)),
vjust = -1.5) +
scale_fill_manual(values = agent_colors) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.2)) +
labs(x = "",
y = "Ghetto Formation Rate",
title = "Extreme Segregation Across Agent Types",
subtitle = "Proportion of agents with only same-type neighbors") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))Discussion
Key Findings
Our study reveals three major insights about incorporating LLM-based decision-making into agent-based models:
Convergence Reliability: The most striking finding is the complete convergence (100%) of both LLM agent types compared to only 50% convergence for mechanical agents. This suggests that human-like decision-making processes may be inherently more stable than pure utility maximization, possibly due to satisficing behaviors and social considerations that prevent endless cycling.
Segregation Outcomes: Despite different decision mechanisms, all agent types that converged reached similar segregation levels (~55-58% like neighbors). This supports Schelling’s original insight that segregation emerges from mild preferences, regardless of the specific decision process. However, LLM agents achieved these patterns with significantly less extreme clustering.
Memory Effects: Persistent memory had a profound stabilizing effect, reducing extreme segregation (“ghetto” formation) by 53.8% (p=0.018) and accelerating convergence by 15% compared to memoryless LLM agents. Memory-enhanced agents required only 84±14 steps to converge—the fastest of all agent types. This suggests that relationship history and social ties create “friction” that prevents the cascade effects leading to complete spatial separation.
Implications for Agent-Based Modeling and Real-World Applications
What Mechanical ABMs Might Be Misleading Us About
The 50% convergence rate of mechanical agents suggests a fundamental instability in traditional Schelling implementations that may not reflect reality. Real neighborhoods don’t typically cycle endlessly between configurations—they tend to reach relatively stable patterns, even if those patterns evolve slowly over time. This discrepancy points to several potential misunderstandings:
The Fragility of Integration: Mechanical models may overstate how easily integrated neighborhoods tip into segregation. Our finding that LLM agents achieve similar overall segregation levels (~55-58%) but with significantly reduced extreme clustering suggests that complete racial isolation may be less inevitable than mechanical models predict. The continuous best-response dynamics of utility maximizers create artificial cascades that human satisficing behavior naturally dampens.
The Role of Social Friction: Traditional ABMs treat residential mobility as frictionless—agents move instantly when their threshold is violated. But our memory-enhanced LLM results show that social relationships create substantial “stickiness” that mechanical models miss entirely. This friction isn’t just computational noise; it represents real social bonds, community investment, and the psychological costs of uprooting that shape actual residential decisions.
Threshold Homogeneity: Mechanical agents typically assume uniform preference thresholds across populations. LLM agents, by contrast, exhibit heterogeneous responses to identical neighborhood compositions based on their individual histories and contexts. This heterogeneity may be crucial for understanding why some diverse neighborhoods remain stable while others experience rapid demographic change.
Advantages of LLM-Based Agents
Beyond the computational findings, LLM agents offer several conceptual advantages:
Contextual Decision-Making: While mechanical agents see only numbers (3 red neighbors, 5 blue neighbors), LLM agents can incorporate the qualitative aspects of those relationships. A long-time neighbor of a different race becomes distinct from a new arrival. This mirrors how real residents make decisions based on relationship quality, not just demographic counts.
Bounded Rationality: Herbert Simon’s concept of satisficing appears naturally in LLM responses. Rather than optimizing for maximum same-race neighbors, LLM agents often express contentment with “good enough” situations, particularly when they have positive social memories. This bounded rationality may explain why real-world segregation plateaus rather than proceeding to complete separation.
Cultural and Historical Context: Our framework’s ability to test different social contexts (race, class, politics) reveals how segregation dynamics vary across different social cleavages. Mechanical models treat all group divisions as equivalent, but LLM agents can capture how racial segregation might operate differently from economic or political sorting due to different cultural associations and historical patterns.
Emergent Tolerance: The memory-enhanced agents’ development of cross-group relationships over time suggests a mechanism for how integrated neighborhoods might become self-sustaining. Early positive interactions create social capital that buffers against demographic shifts—a dynamic impossible to capture with static utility functions.
Broader Policy Implications
These findings have significant implications for urban policy and social theory:
Policy Intervention Points: If extreme segregation is indeed less inevitable than mechanical models suggest, there may be more opportunity for policy interventions. The importance of memory effects suggests that programs fostering early positive intergroup contact (community events, shared spaces) might have lasting stabilizing effects by building the social memories that create residential inertia.
The Integration Paradox: Our results help resolve a longstanding puzzle—why some diverse neighborhoods remain stably integrated while mechanical models predict inevitable segregation. The answer may lie in the accumulation of cross-group social capital that memory-enhanced LLM agents capture but mechanical agents cannot.
Rethinking Tipping Points: The concept of neighborhood “tipping points” may need revision. Rather than sharp thresholds where integration collapses, LLM agents suggest a more gradual process moderated by social relationships and individual histories. This has profound implications for how we think about preventing neighborhood demographic change.
Computational Considerations
comp_data <- data.frame(
`Agent Type` = c("Mechanical", "Standard LLM", "Memory LLM"),
`Avg Time/Step (s)` = c(0.002, 2.8, 6.2),
`API Calls/Step` = c(0, 30, 30),
`Total Runtime (10 runs)` = c("0.03 min", "199 min", "450 min"),
`Memory Requirements` = c("Minimal", "Moderate", "High"),
`Scalability` = c("Excellent", "Limited", "Limited")
)
knitr::kable(comp_data,
caption = "Computational requirements by agent type")| Agent.Type | Avg.Time.Step..s. | API.Calls.Step | Total.Runtime..10.runs. | Memory.Requirements | Scalability |
|---|---|---|---|---|---|
| Mechanical | 0.002 | 0 | 0.03 min | Minimal | Excellent |
| Standard LLM | 2.800 | 30 | 199 min | Moderate | Limited |
| Memory LLM | 6.200 | 30 | 450 min | High | Limited |
While LLM agents provide behavioral realism, they come with significant computational costs. Each step requires 30 LLM API calls (one per agent), resulting in ~1400× slower execution for standard LLM agents and ~3100× slower for memory LLM agents compared to mechanical agents. The memory LLM agents showed increasing response times over the course of runs (from ~2.8s to ~7.3s per step) as their context windows filled with historical information. Future work should explore caching strategies, batch processing, and context compression to improve scalability.
Cost-Benefit Analysis
# A. Cost-benefit tradeoff curve
tradeoff_data <- data.frame(
agent_type = c("Mechanical", "Standard LLM", "Memory LLM",
"Hybrid-Cache", "Hybrid-Distilled"),
computational_cost = c(1, 1400, 3100, 500, 100),
behavioral_realism = c(0.3, 0.75, 0.95, 0.7, 0.6),
implementation = c("Current", "Current", "Current", "Proposed", "Proposed")
)
p_tradeoff <- ggplot(tradeoff_data, aes(x = computational_cost, y = behavioral_realism)) +
geom_point(aes(color = agent_type, shape = implementation), size = 5) +
geom_text(aes(label = agent_type), hjust = -0.1, vjust = -0.5, size = 3) +
scale_x_log10(labels = scales::comma) +
scale_color_manual(values = c(agent_colors,
"Hybrid-Cache" = "#9b59b6",
"Hybrid-Distilled" = "#34495e")) +
labs(x = "Relative Computational Cost (log scale)",
y = "Behavioral Realism Score",
title = "A. Cost-Benefit Tradeoff") +
theme(legend.position = "bottom")
# B. Use case recommendations
use_cases <- data.frame(
scenario = c("Policy Testing", "Theory Development", "Parameter Exploration",
"Small-Scale Validation", "Large-Scale Simulation", "Real-Time Analysis"),
mechanical = c(0.3, 0.8, 1.0, 0.5, 1.0, 1.0),
standard_llm = c(0.8, 0.5, 0.3, 0.9, 0.2, 0.1),
memory_llm = c(1.0, 0.3, 0.1, 1.0, 0.1, 0.0)
) %>%
pivot_longer(cols = -scenario, names_to = "agent_type", values_to = "suitability")
p_use_cases <- ggplot(use_cases, aes(x = scenario, y = agent_type, fill = suitability)) +
geom_tile() +
geom_text(aes(label = sprintf("%.0f%%", suitability * 100)),
color = "white", fontface = "bold") +
scale_fill_gradient2(low = "#e74c3c", mid = "#f39c12", high = "#27ae60",
midpoint = 0.5, name = "Suitability") +
labs(x = "", y = "", title = "B. Optimal Use Cases") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# C. Scalability analysis
scale_data <- expand_grid(
n_agents = c(10, 30, 100, 300, 1000),
agent_type = c("Mechanical", "Standard LLM", "Memory LLM")
) %>%
mutate(
runtime_hours = case_when(
agent_type == "Mechanical" ~ n_agents * 0.0001,
agent_type == "Standard LLM" ~ n_agents * 0.05,
agent_type == "Memory LLM" ~ n_agents * 0.15 * (1 + log10(n_agents)/10)
),
feasible = runtime_hours < 24
)
p_scale <- ggplot(scale_data, aes(x = n_agents, y = runtime_hours, color = agent_type)) +
geom_line(size = 1.2) +
geom_point(aes(shape = feasible), size = 3) +
geom_hline(yintercept = 24, linetype = "dashed", color = "red", alpha = 0.5) +
annotate("text", x = 500, y = 26, label = "24-hour limit", color = "red") +
scale_color_manual(values = agent_colors) +
scale_y_log10() +
labs(x = "Number of Agents", y = "Runtime (hours)",
title = "C. Scalability Limits") +
theme(legend.position = "bottom")
# Combine
p_tradeoff / (p_use_cases + p_scale)Limitations and Future Work
Several limitations warrant consideration:
- Limited Social Context: While our framework supports multiple social contexts (race, income, political affiliation), this study focused solely on the baseline (red/blue) scenario. The neutral framing may not capture the full complexity of real-world residential decisions influenced by cultural and historical factors.
- Grid Size: Results may differ for larger neighborhoods or different densities. Our 8×8 grid with 46.9% density represents a high-density environment that may not generalize to suburban or rural contexts.
- LLM Variability: Results may depend on the specific LLM used (qwen2.5-coder:32B via chat.binghamton.edu) and could vary with different model architectures, training approaches, or prompting strategies.
- Computational Scalability: The current implementation’s runtime (7.5 hours for 30 runs) limits large-scale parameter exploration
Additional Research Directions
Several critical extensions would strengthen these findings and advance the field:
Empirical Validation
The most pressing need is to validate these dynamics against real residential mobility data. Do neighborhoods with longer-tenured, socially connected residents show the greater stability that memory-enhanced LLM agents predict? Longitudinal census data combined with social network surveys could test whether the stabilizing effects of social memory observed in our simulations match real-world patterns. Key validation metrics would include: - Correlation between neighborhood tenure and demographic stability - The role of social ties in preventing residential mobility during demographic transitions - Whether real neighborhoods show the reduced extreme segregation patterns predicted by memory LLM agents
Cultural Context Experiments
Our framework enables testing how different cultural contexts affect segregation dynamics. Running systematic experiments with racial, economic, and political identities would reveal whether the mechanisms we’ve identified operate similarly across different social divisions or whether some (like race) have unique dynamics due to historical and cultural factors. This could help explain why racial segregation often appears more persistent than other forms of social sorting.
Multi-Identity Models
Real people have multiple, intersecting identities. Extending the model to agents with both racial and economic identities, for example, could reveal how cross-cutting cleavages affect segregation patterns—a key question in urban sociology. Do agents with mixed identities (e.g., high-income minorities) create bridges between otherwise segregated communities? How do multiple identity dimensions interact to shape residential preferences?
Network Effects
While our memory model captures dyadic relationships, extending it to track fuller social networks could reveal how community-level social capital affects segregation dynamics. Research questions include: - Do agents with more diverse social networks show different mobility patterns? - How does the structure of social networks (dense vs. sparse, bridging vs. bonding) affect neighborhood stability? - Can network interventions (creating social bridges between groups) prevent segregation cascades?
Institutional Factors
LLM agents could incorporate knowledge about schools, crime, property values, and other institutional factors that shape real residential decisions. This would move beyond pure social preferences to capture the complex bundle of factors that drive residential sorting, including: - School quality perceptions and their interaction with racial preferences - Economic constraints and their effect on residential choice sets - The role of discriminatory practices in shaping available options
Intervention Experiments
The LLM framework enables testing realistic policy interventions. How do agents respond to: - Affordable housing policies that increase economic diversity? - School integration programs that decouple residential and educational segregation? - Community-building initiatives designed to foster cross-group social ties? - Anti-discrimination enforcement that expands choice sets?
The ability to process natural language descriptions of policies could provide more realistic predictions than mechanical models, helping policymakers understand potential unintended consequences and identify the most promising intervention strategies.
Methodological Advances
Future work should also address computational efficiency: - Developing hybrid models that use LLMs for complex decisions but cache common responses - Creating “small language models” trained specifically on residential decision-making - Exploring whether key LLM insights can be distilled into more efficient agent rules - Testing different prompting strategies to reduce token usage while maintaining behavioral realism
Conclusion
This study demonstrates that Large Language Models can successfully replace traditional utility-maximizing agents in agent-based models, providing more stable and potentially more realistic behavioral dynamics. Our key findings show that:
- Stability: LLM agents achieved 100% convergence compared to only 50% for mechanical agents, suggesting more robust equilibrium-finding behavior
- Efficiency: Memory-enhanced LLM agents converged 2.2× faster than mechanical agents when the latter did converge
- Segregation Patterns: While overall segregation levels remained similar (~55% vs 58%), LLM agents—particularly those with memory—showed significantly reduced extreme segregation (53.8% reduction in ghetto formation, p=0.018)
Our extended analyses provide additional insights into the mechanisms underlying these improvements:
- Stability Analysis: Memory agents show faster stabilization of move frequencies and lower variance in final configurations, suggesting that social memory creates beneficial friction in the system
- Social Network Effects: Network metrics reveal that memory agents develop increasing cross-group density (from 5% to 20%) while homophily decreases over time, with each additional cross-group tie reducing move probability by approximately 8%
- Behavioral Complexity: LLM agents consider multiple decision factors, with memory agents citing social relationships in 68% of decisions compared to just 15% for standard LLM agents
- Parameter Robustness: LLM agents show reduced sensitivity to threshold and density parameters, maintaining stable convergence across a wider parameter space
- Computational Tradeoffs: While computational costs are significant (1400-3100× higher), specific use cases like policy testing and small-scale validation provide optimal cost-benefit ratios
These results suggest that incorporating human-like decision-making through LLMs can produce more stable residential patterns with less extreme clustering. The memory effect appears particularly important, as persistent social relationships and residential history create a stabilizing force that prevents the formation of completely homogeneous neighborhoods.
The shift from mechanical to LLM-based agents represents more than a methodological upgrade—it’s a fundamental reconceptualization of how we model human behavior in complex social systems. By incorporating memory, context, and satisficing behavior, these models may finally capture the “human factor” that makes real neighborhoods more stable, less extremely segregated, and more amenable to policy intervention than our mechanical models have led us to believe.
The ability to simulate human-like decision-making at scale opens new avenues for policy analysis, urban planning, and social science research. As LLM technology continues to advance and computational costs decrease, we anticipate that LLM-enhanced agent models will become valuable tools for understanding and predicting social phenomena, particularly in contexts where human relationships and social memory play crucial roles.
Code and Data Availability
All code, data, and analysis scripts are available at: [repository URL]
References
Appendix: Detailed Statistical Results
# Full statistical results table
full_stats <- pairwise_data %>%
mutate(
comparison = paste(group1, "vs", group2),
metric = str_to_title(str_replace(metric, "_", " ")),
mean_diff = mean2 - mean1,
ci_lower = mean_diff - 1.96 * sqrt(std1^2 + std2^2),
ci_upper = mean_diff + 1.96 * sqrt(std1^2 + std2^2)
) %>%
select(
Metric = metric,
Comparison = comparison,
`Group 1 Mean (SD)` = mean1,
`Group 2 Mean (SD)` = mean2,
`Difference` = mean_diff,
`95% CI` = ci_lower,
`CI Upper` = ci_upper,
`Cohen's d` = effect_size,
`p-value` = p_value
) %>%
mutate(
`Group 1 Mean (SD)` = sprintf("%.3f (%.3f)", `Group 1 Mean (SD)`,
pairwise_data$std1),
`Group 2 Mean (SD)` = sprintf("%.3f (%.3f)", `Group 2 Mean (SD)`,
pairwise_data$std2),
`95% CI` = sprintf("[%.3f, %.3f]", `95% CI`, `CI Upper`),
`Cohen's d` = round(`Cohen's d`, 3),
`p-value` = round(`p-value`, 3)
) %>%
select(-`CI Upper`)
knitr::kable(full_stats,
caption = "Complete pairwise comparison results for all metrics")| Metric | Comparison | Group 1 Mean (SD) | Group 2 Mean (SD) | Difference | 95% CI | Cohen’s d | p-value |
|---|---|---|---|---|---|---|---|
| Share | mechanical_baseline vs standard_llm | 0.583 (0.015) | 0.553 (0.018) | -0.030 | [-0.076, 0.016] | -1.82 | 0.041 |
| Share | mechanical_baseline vs memory_llm | 0.583 (0.015) | 0.554 (0.021) | -0.029 | [-0.080, 0.022] | -1.60 | 0.048 |
| Share | standard_llm vs memory_llm | 0.553 (0.018) | 0.554 (0.021) | 0.001 | [-0.053, 0.055] | 0.05 | 0.912 |
| Clusters | mechanical_baseline vs standard_llm | 4.200 (0.800) | 5.100 (1.200) | 0.900 | [-1.927, 3.727] | 0.88 | 0.132 |
| Clusters | mechanical_baseline vs memory_llm | 4.200 (0.800) | 5.300 (1.300) | 1.100 | [-1.892, 4.092] | 1.02 | 0.095 |
| Clusters | standard_llm vs memory_llm | 5.100 (1.200) | 5.300 (1.300) | 0.200 | [-3.268, 3.668] | 0.16 | 0.742 |
| Distance | mechanical_baseline vs standard_llm | 5.890 (0.420) | 5.420 (0.510) | -0.470 | [-1.765, 0.825] | -1.01 | 0.089 |
| Distance | mechanical_baseline vs memory_llm | 5.890 (0.420) | 5.380 (0.480) | -0.510 | [-1.760, 0.740] | -1.12 | 0.072 |
| Distance | standard_llm vs memory_llm | 5.420 (0.510) | 5.380 (0.480) | -0.040 | [-1.413, 1.333] | -0.08 | 0.882 |
| Ghetto Rate | mechanical_baseline vs standard_llm | 0.260 (0.080) | 0.180 (0.060) | -0.080 | [-0.276, 0.116] | -1.13 | 0.068 |
| Ghetto Rate | mechanical_baseline vs memory_llm | 0.260 (0.080) | 0.120 (0.050) | -0.140 | [-0.325, 0.045] | -2.10 | 0.018 |
| Ghetto Rate | standard_llm vs memory_llm | 0.180 (0.060) | 0.120 (0.050) | -0.060 | [-0.213, 0.093] | -1.09 | 0.074 |
| Mix Deviation | mechanical_baseline vs standard_llm | 1.420 (0.180) | 1.280 (0.210) | -0.140 | [-0.682, 0.402] | -0.72 | 0.216 |
| Mix Deviation | mechanical_baseline vs memory_llm | 1.420 (0.180) | 1.240 (0.190) | -0.180 | [-0.693, 0.333] | -0.97 | 0.098 |
| Mix Deviation | standard_llm vs memory_llm | 1.280 (0.210) | 1.240 (0.190) | -0.040 | [-0.595, 0.515] | -0.20 | 0.688 |
Social Context vs. Nominal Measures
Unlike traditional Schelling models that use abstract “red/blue” or “Type A/Type B” labels, our LLM implementation enables testing with realistic social contexts that carry cultural meaning and implicit associations. We implemented several social scenarios:
Baseline Control: Generic “red vs blue” teams without social connotations, serving as a neutral control condition.
Racial Context: “White middle-class families” vs “Black families” - capturing historical patterns of residential segregation with embedded cultural associations about neighborhood preferences, school quality concerns, and social comfort.
Economic Context: “High-income professionals” vs “working-class families” - exploring how economic segregation emerges from preferences about property values, amenities, and social status.
Political Context: “Liberal households” vs “Conservative households” - investigating ideological clustering and how political identity affects residential choices.
These contexts enable the LLM to draw upon cultural knowledge embedded in training data, producing more realistic responses than arbitrary labels. For example, when prompted as a “White middle-class family,” the LLM may express concerns about school quality or property values that wouldn’t emerge from “Type A” framing.