This code-through is an extension of an earlier code-through, and it further explores the utility of Monte Carlo simulations, a class of computational algorithms used to approximate problems that either:
Monte Carlo methods let us understand the behavior of complex systems by repeatedly simulating outcomes under random inputs and aggregating the results. Rather than solving equations once, we “play out” many possible scenarios and summarize what typically happens.
There’s a lot of code in this, so I’ve opted to use “code_folding:hide” as a default. The knitted file will allow you to expand code chunks at your leisure.
I show how Monte Carlo simulations can be implemented in R to estimate probabilities, explore model uncertainty, and propagate measurement error into derived quantities — all using relatively straightforward code. Instead of relying only on textbook-style examples (like ideal coin flips), I embed simulation into a real metropolitan case: the Atlanta MSA.
The core ideas are illustrated through two main tools:
Together, these tools demonstrate how simulation can augment, and sometimes improve upon, traditional analytic methods when reasoning about neighborhood change and housing affordability.
Monte Carlo methods were formalized by John von Neumann and Stanislaw Ulam (circa 1946), who needed a way to approximate neutron diffusion in the core of a nuclear device. The problem lacked a convenient closed-form solution, so they turned to repeated random sampling.
As Ulam later recalled, the idea crystallized while thinking about the game of Canfield solitaire and asking: instead of deriving exact probabilities combinatorially, why not play the game many times and count how often it succeeds? Nicholas Metropolis suggested naming the method after the Monte Carlo casino in Monaco, where Ulam’s uncle used to gamble.
Monte Carlo simulations are grounded in the relative-frequency interpretation of probability. For some event \(A\) in a random experiment, its probability \(P(A)\) can be approximated by:
\[ P(A) \approx \frac{1}{n} \sum_{i=1}^{n} I(X_i \in A), \]
where \(X_1, X_2, \ldots, X_n\) are simulated outcomes and \(I(\cdot)\) is an indicator that equals 1 when the event occurs and 0 otherwise.
This approximation is justified by the Strong Law of Large Numbers (SLLN):
\[ P\left( \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n X_i = \mu \right) = 1. \]
In words: under suitable conditions, the sample mean converges almost surely (with probability 1) to the true expectation \(\mu\) as the number of observations grows.
For example, if we flip a fair coin four times, it is entirely possible to see four heads in a row — an event with probability \((1/2)^4 = 0.0625\). The SLLN does not claim that short-run outcomes must “balance out”; instead, it ensures that over a very large number of flips, the proportion of heads will almost surely converge to 0.5, even though rare streaks will still occur. Observed anomalies like the famous 1913 Monte Carlo roulette run (black appeared 26 times in a row) are improbable but not contradictory to the SLLN because the number of trials was still quite small.
In practice, Monte Carlo simulation usually follows three steps:
Monte Carlo methods are valuable because they provide workable approximations in settings where exact results are difficult, opaque, or impossible to derive. They are especially helpful in risk analysis and planning, where decision-makers care about distributions of outcomes rather than just single-point forecasts. For example: “What proportion of scenarios lead to severe housing cost burden?” or “How often does a project overrun its budget, and by how much?”
In this code-through, I apply these ideas to neighborhood change and housing affordability in the Atlanta MSA. Building on earlier course labs, I treat neighborhood indicators — median home value (MHV), poverty, vacancy, and professional employment — as random quantities subject to both sampling error and model uncertainty.
I use two complementary Monte Carlo strategies:
Bootstrap regression: I start with a linear model linking MHV growth (2000–2010) to logged poverty, vacancy, and professional share. I then resample tracts with replacement and refit this model many times to obtain an empirical distribution of predicted growth for two stylized tract profiles: a “high-opportunity” tract and a “disadvantaged” tract.
ACS-based simulation of price-to-income ratios: Using ACS tract-level estimates and their MOEs for MHV and median household income, I build log-normal sampling distributions around each tract’s point estimate. I simulate many draws from these distributions to obtain a full distribution of price-to-income ratios for each tract, rather than a single deterministic value.
These two pieces come together in the final section, where I join LTDB-based growth measures with ACS-based Monte Carlo affordability estimates and map (a) historical home value growth and (b) the probability that each tract belongs to the top 20% of affordability burden in the metro area. This provides a richer, uncertainty-aware view of which neighborhoods are both rapidly appreciating and likely to be highly burdened.
Here, I use simple gambling-inspired games to develop intuition for Monte Carlo simulations.
Consider a game of chance where a wheel is subdivided into 3 parts: 50% is purple, 25% is green, 25% is orange. If you land on purple, you get 8 points, if you land on green, you lose 20 points, and if you land on orange, you gain 15 points. What is the expected value?
Expected_Value_Manual <- mean(c(8, 8, -20, 15))
print(Expected_Value_Manual)
## [1] 2.75
For such a simplistic problem, the probability is easy to calculate. However, what if you want to answer a more interesting question? For instance, what is the probability that - after 20 spins, you’ll have more than 50 points?
# Theoretical expected value per spin
Expected_Value_Manual <- mean(c(8, 8, -20, 15)) # = 2.75
# Expected value for 20 spins
Expected_Value_20 <- Expected_Value_Manual * 20 # = 55
print(Expected_Value_20)
## [1] 55
# ---------------------------------
# Simulate outcomes
set.seed(420) #Seeds ensure reproducibility of outcomes
runs <- 10000
Spin_To_Win <- function() {
sum(sample(c(8, 8, -20, 15), 20, replace = TRUE))
}
results <- replicate(runs, Spin_To_Win())
# Empirical estimate for P(Sum > 50)
Monte_Carlo_Prob <- mean(results > 50)
print(Monte_Carlo_Prob)
## [1] 0.5429
# Empirical mean
mean_val <- mean(results)
summary_table <- data.frame(
Quantity = c("Theoretical mean (20 spins)", "Empirical mean (20 spins)", "P(Sum > 50)"),
Value = c(Expected_Value_20, mean_val, Monte_Carlo_Prob)
)
pander(summary_table,
caption = "Comparison of theoretical and simulated results for the wheel game")
| Quantity | Value |
|---|---|
| Theoretical mean (20 spins) | 55 |
| Empirical mean (20 spins) | 56 |
| P(Sum > 50) | 0.54 |
As can be seen, the probability of attaining a value >50 points is ~54%, slightly above average, which aligns with the expected value of 55. Moreover, the “mean_val” shows us that - on average - the empirically-calculated curve returns a value of 55.672, which is almost the same as the expected value. The ability to quickly and precisely return numerical approximations for complex problems is extremely useful!
Let’s plot this on a Cumulative Distribution Function (CDF). CDFs are a great way to visualize the probability that the sum of our spins will take on a value less than or equal to a given threshold. At a glance, we’ll be able to see where our Monte-Carlo derived mean lies on the curve.
## Ensure results is numeric
results_num <- unlist(results)
# Empirical CDF and values at salient points
F_results <- ecdf(results_num)
CDF_Mean <- F_results(mean_val)
CDF_Theoretical <- F_results(Expected_Value_20)
# Small horizontal offsets so the dots are actually visible
offset_emp <- -1.75
offset_theo <- 1.75
p <- ggplot(data.frame(Sum = results_num), aes(x = Sum)) +
stat_ecdf(alpha = 0.9, linewidth = 1) +
# Threshold at 50
geom_vline(xintercept = 50,
color = "black",
linetype = "dashed",
alpha = 0.7) +
geom_vline(xintercept = mean_val,
color = "purple4",
linetype = "dotted",
linewidth = 1,
alpha = 0.8) +
geom_point(
data = data.frame(x = mean_val + offset_emp, y = CDF_Mean),
aes(x = x, y = y),
color = "purple4",
size = 3,
shape = 16,
alpha = 0.9
) +
annotate(
"text",
x = mean_val + offset_emp - 45,
y = CDF_Mean + 0.05,
label = "Empirical Mean",
color = "purple4",
size = 3,
vjust = -0.5
) +
# Theoretical mean line + point + label
geom_vline(xintercept = Expected_Value_20,
color = "chartreuse4",
linetype = "dotted",
linewidth = 1,
alpha = 0.8) +
geom_point(
data = data.frame(x = Expected_Value_20 + offset_theo, y = CDF_Theoretical),
aes(x = x, y = y),
color = "chartreuse4",
size = 3,
shape = 17,
alpha = 0.9
) +
annotate(
"text",
x = Expected_Value_20 + offset_theo + 45,
y = CDF_Theoretical - 0.05,
label = "Theoretical Mean",
color = "chartreuse4",
size = 3,
vjust = -0.5
) +
labs(
title = "CDF for 20 spins (10,000 iterations)",
subtitle = paste0(
"P(Sum > 50) = ", round(Monte_Carlo_Prob, 3),
"; Empirical Mean = ", round(mean_val, 2),
"; Theoretical Mean = ", round(Expected_Value_20, 2)
),
x = "Sum of 20 Spins",
y = "Cumulative Probability"
) +
theme_light()
print(p)
So far, I’ve used large numbers of runs or iterations (10,000 or more), and you might be wondering: Why? Am I just trying to overheat your CPU? Do I harbor a suspicious fondness for computational sadism? Or is this an elaborate ploy to soften you up before selling you a time-share? Fortunately, the truth is far less sinister.
The real reason is that Monte Carlo simulations rely on random sampling instead of exact closed-form solutions. With too few iterations, your samples might not fully capture the true range of possible outcomes — leading to misleading or unstable estimates. Additionally, the specific random number generation method can unintentionally under-represent parts of the distribution if you don’t sample enough. Therefore, a sufficiently large number of iterations (typically 10,000 or more) is essential to ensure your estimates converge and remain reliable.
This section demonstrates how using a low number of Monte Carlo iterations can yield an unstable estimate of the distribution compared to a well-converged simulation with many iterations.
# Define parameters and simulation function
Expected_Value_Manual <- mean(c(8, 8, -20, 15)) # = 2.75
Expected_Value_20 <- Expected_Value_Manual * 20 # = 55
Spin_To_Win <- function(n) {
replicate(n, sum(sample(c(8, 8, -20, 15), 20, replace = TRUE)))
}
# Simulate both highest and lowest iterations
set.seed(420)
results <- tibble(
Iteration = c(rep("High (10,000)", 10000),
rep("Low (10)", 10)),
Value = c(Spin_To_Win(10000),
Spin_To_Win(10))
)
# ----------------------------------------
# Compute means & probabilities by case
summary_df <- results %>%
group_by(Iteration) %>%
summarize(
Mean = mean(Value),
Prob_Above_50 = mean(Value > 50),
.groups = "drop"
)
print(summary_df)
## # A tibble: 2 × 3
## Iteration Mean Prob_Above_50
## <chr> <dbl> <dbl>
## 1 High (10,000) 55.7 0.543
## 2 Low (10) 70.4 0.5
# ----------------------------------------
# Build label data from summary_df
label_df <- summary_df %>%
mutate(
label = paste0(
"P(Sum > 50) = ", round(Prob_Above_50, 3),
"\nMean = ", round(Mean, 2)
),
x = Expected_Value_20 + 5, # horizontal label position
y = 0.3 # vertical label position
)
# ----------------------------------------
# Plot CDFs side by side using `facet_wrap`
ggplot(results, aes(x = Value)) +
stat_ecdf(color = "cyan4", size = 1) +
geom_vline(xintercept = 50,
linetype = "dashed",
color = "black") +
geom_vline(xintercept = Expected_Value_20,
linetype = "dotted",
color = "magenta") +
geom_text(
data = label_df,
aes(x = x, y = y, label = label),
inherit.aes = FALSE,
color = "magenta",
size = 3,
hjust = 0
) +
facet_wrap(~ Iteration, scales = "free_y") +
labs(
title = "CDF Comparison: High vs Low Iterations",
subtitle = paste0(
"Dotted line: theoretical mean (", round(Expected_Value_20, 2),
"); dashed line: threshold at 50"
),
x = "Sum of 20 spins",
y = "Cumulative probability"
) +
coord_cartesian(ylim = c(0, 1.05), clip = "off") +
theme_light(base_size = 14) +
theme(
plot.margin = margin(t = 15, r = 40, b = 15, l = 15)
)
Note that the difference between the low and high iteration mean values is \(70 - 55 = 15\): the low-iteration estimate overstates the mean by about 27%. Thus, it is imperative that you allocate sufficient iterations for your problem. Graphically, this is also obvious from the substantially more “step-like” behavior of the low-iteration CDF.
I kept the casino-style examples above simple, but they illustrated three core ideas I will now reuse in a real-data setting: (1) Monte Carlo as repeated sampling to approximate probabilities, (2) the distinction between an expected value and the full distribution of outcomes, and (3) the importance of using sufficient iterations to derive stable estimates. In the remainder of this document, I apply these ideas to neighborhood change and housing affordability in the Atlanta MSA. Instead of spins and dice, the “random variables” are tract-level home values and incomes; instead of asking “What’s the chance I win more than 50 points?”, I ask questions like “How uncertain is predicted MHV growth?” and “How likely is it that a given tract belongs to the top 20% affordability burden group?”
#Load LTDB data (2000 / 2010 / meta)
URL1 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2000.rds"
d1 <- readRDS(gzcon(url(URL1))) # read the compressed 2000 LTDB file
URL2 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2010.rds"
d2 <- readRDS(gzcon(url(URL2))) # read the compressed 2010 LTDB file
URLmd <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-META-DATA.rds"
md <- readRDS(gzcon(url(URLmd))) # read metadata
#Drop the (redundant) year column from both LTDB sets
d1 <- select(d1, -year)
d2 <- select(d2, -year)
#Merge LTDB 2000 & 2010 by tractid, then attach metadata
d <- merge(d1, d2, by = "tractid")
d <- merge(d, md, by = "tractid")
#Keep only urban tracts (drawing on Lab 5's exercise)
d <- d %>%
filter(urban == "urban")
d <- d %>%
select(
tractid,
mhmval00, mhmval12,
hinc00,
hu00, vac00, own00, rent00, h30old00,
empclf00, clf00, unemp00, prof00,
dpov00, npov00,
ag25up00, hs00, col00,
pop00.x, nhwht00, nhblk00, hisp00, asian00,
cbsa, cbsaname
) %>%
mutate(
# Demographic / housing composition
p.white = 100 * nhwht00 / pop00.x,
p.black = 100 * nhblk00 / pop00.x,
p.hisp = 100 * hisp00 / pop00.x,
p.asian = 100 * asian00 / pop00.x,
p.hs = 100 * (hs00 + col00) / ag25up00,
p.col = 100 * col00 / ag25up00,
p.prof = 100 * prof00 / empclf00,
p.unemp = 100 * unemp00 / clf00,
p.vacant = 100 * vac00 / hu00,
pov.rate = 100 * npov00 / dpov00
)
# Inflation-adjust 2000 MHV to 2010 dollars
mhv.00 <- d$mhmval00 * 1.28855
mhv.10 <- d$mhmval12
mhv.change <- mhv.10 - mhv.00
# Drop tracts with unrealistic 2000 values (< $1k)
mhv.00[mhv.00 < 1000] <- NA
# Create variable for % growth in MHV 2000–2010
mhv.growth <- 100 * (mhv.change / mhv.00)
# Attach back to LTDB data frame
d$mhv.00 <- mhv.00
d$mhv.10 <- mhv.10
d$mhv.change <- mhv.change
d$mhv.growth <- mhv.growth
# Drop extreme growth > 200% to avoid wild outliers
d <- d %>%
filter(mhv.growth <= 200)
# Each of these variables is skewed; I'm eschewing corrplots for brevity
# Log Transformations are simple and preferred for this kind of skew
d <- d %>%
mutate(
log_pov = log10(pov.rate + 1),
log_pvacant = log10(p.vacant + 1),
log_prof = log10(p.prof + 1)
)
summary(d[, c("log_pov", "log_pvacant", "log_prof")])
## log_pov log_pvacant log_prof
## Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.7329 1st Qu.:0.6126 1st Qu.:1.376
## Median :0.9749 Median :0.7848 Median :1.511
## Mean :0.9866 Mean :0.8079 Mean :1.498
## 3rd Qu.:1.2389 3rd Qu.:0.9760 3rd Qu.:1.643
## Max. :2.0043 Max. :2.0043 Max. :2.004
## NA's :158 NA's :146 NA's :159
Before bringing Monte Carlo methods into the analysis, I begin with a standard parametric model using LTDB data. This gives me a baseline, “classical” view of neighborhood change: how does MHV growth from 2000 to 2010 relate to logged poverty, vacancy, and professional share in 2000? This linear regression provides a single set of coefficient estimates and predicted values, which I will later stress-test using bootstrap simulation. In other words, the regression is the starting point, and the subsequent Monte Carlo work is what allows me to ask: How stable are these predictions across many plausible realizations of the data?
reg_atl <- d %>%
filter(
cbsaname == "Atlanta-Sandy Springs-Marietta, GA",
) %>%
select(
mhv.growth,
log_pov,
log_pvacant,
log_prof
) %>%
na.omit()
# Linear regression: MHV growth as a function of poverty rate (log), vacancy rate (log), and professional share (log)
# This serves as the baseline parametric model before introducing bootstrap uncertainty.
m_atl <- lm(
mhv.growth ~ log_pov + log_pvacant + log_prof,
data = reg_atl
)
stargazer(
m_atl,
type = s.type,
title = "Atlanta MSA: MHV Growth Regression",
dep.var.labels = "MHV growth 2000 to 2010 (%)",
covariate.labels = c(
"Poverty Rate (Log)",
"Vacancy Rate (Log)",
"Professional Representation (Log)"
),
digits = 3,
single.row = FALSE,
header = FALSE
)
| Dependent variable: | |
| MHV growth 2000 to 2010 (%) | |
| Poverty Rate (Log) | 3.906 |
| (3.880) | |
| Vacancy Rate (Log) | 26.299*** |
| (4.807) | |
| Professional Representation (Log) | 0.863 |
| (5.709) | |
| Constant | -15.746 |
| (10.992) | |
| Observations | 941 |
| R2 | 0.061 |
| Adjusted R2 | 0.058 |
| Residual Std. Error | 24.764 (df = 937) |
| F Statistic | 20.142*** (df = 3; 937) |
| Note: | p<0.1; p<0.05; p<0.01 |
# Using Quartiles of Poverty and Professional Share w/ Median Vacancy
q_pov <- quantile(reg_atl$log_pov, probs = c(0.25, 0.75), na.rm = TRUE)
q_prof <- quantile(reg_atl$log_prof, probs = c(0.25, 0.75), na.rm = TRUE)
med_vac <- median(reg_atl$log_pvacant, na.rm = TRUE)
#High-opportunity tract: Low Poverty, High Professional Representation, Median Vacancy
new_hi <- data.frame(
log_pov = q_pov[1],
log_pvacant = med_vac,
log_prof = q_prof[2]
)
#Disadvantaged tract: High Poverty, Low Professional Representation, Median Vacancy
new_lo <- data.frame(
log_pov = q_pov[2],
log_pvacant = med_vac,
log_prof = q_prof[1]
)
new_hi
new_lo
pred_hi_ci <- predict(m_atl, newdata = new_hi, interval = "confidence", level = 0.95)
pred_lo_ci <- predict(m_atl, newdata = new_lo, interval = "confidence", level = 0.95)
ci_tab <- rbind(
cbind(Scenario = "High-opportunity", as.data.frame(pred_hi_ci)),
cbind(Scenario = "Disadvantaged", as.data.frame(pred_lo_ci))
) %>%
rename(
estimate = fit,
Lower_95 = lwr,
Upper_95 = upr
)
# Remove 25%/75% Row Names for Table
rownames(ci_tab) <- NULL
pander::pander(
ci_tab,
caption = "95% Confidence Intervals for Expected MHV Growth (Atlanta)",
)
| Scenario | estimate | Lower_95 | Upper_95 |
|---|---|---|---|
| High-opportunity | 8.2 | 6 | 10 |
| Disadvantaged | 9.7 | 7.6 | 12 |
What do we mean by “bootstrap” here? bootstrapping is a resampling-based method for approximating the sampling distribution of a statistic when we only have one dataset. Instead of assuming a particular parametric form for the errors, we treat the observed sample of Atlanta tracts as if it were the “population,” repeatedly draw new samples with replacement from it, re-estimate the regression on each sample, and re-compute the predictions of interest. Across many such resamples, the spread of those predictions approximates the uncertainty we would see if we could repeatedly re-collect similar LTDB data. In that sense, bootstrap is itself a type of Monte Carlo method: we rely on simulation rather than closed-form formulas to learn about variability.
The resulting empirical distributions of predicted growth (one distribution per tract profile) allow me to compute bootstrap means, standard deviations, and quantiles (5th, 50th, 95th). This provides a more intuitive sense of how much the predicted growth could vary under repeated sampling of similar neighborhoods.
set.seed(529) # Ensure reproducible simulation results
B_boot <- 5000 # number of bootstrap runs
pred_hi_boot <- numeric(B_boot) # Stores high-opportunity predictions
pred_lo_boot <- numeric(B_boot) # Stores disadvantaged predictions
n_atl <- nrow(reg_atl) # Number of ATL tracts in the sample
for (b in seq_len(B_boot)) {
# Sample Atlanta tracts with replacement to create the bootstrap dataset
idx_b <- sample.int(n_atl, size = n_atl, replace = TRUE)
d_boot <- reg_atl[idx_b, ]
# Refit the same (linear) model on the bootstrap sample
m_boot <- lm(
mhv.growth ~ log_pov + log_pvacant + log_prof,
data = d_boot
)
# Store predicted MHV growth for each scenario
pred_hi_boot[b] <- predict(m_boot, newdata = new_hi)
pred_lo_boot[b] <- predict(m_boot, newdata = new_lo)
}
# Summarize bootstrap sampling distributions
boot_sum <- data.frame(
Scenario = c("High-opportunity", "Disadvantaged"),
Mean = c(mean(pred_hi_boot), mean(pred_lo_boot)),
SD = c(sd(pred_hi_boot), sd(pred_lo_boot)),
P05 = c(quantile(pred_hi_boot, 0.05),
quantile(pred_lo_boot, 0.05)),
P50 = c(quantile(pred_hi_boot, 0.50),
quantile(pred_lo_boot, 0.50)),
P95 = c(quantile(pred_hi_boot, 0.95),
quantile(pred_lo_boot, 0.95))
)
stargazer(
boot_sum,
type = s.type,
summary = FALSE,
title = "bootstrap Monte Carlo Summaries for Predicted MHV Growth",
digits = 2
)
| Scenario | Mean | SD | P05 | P50 | P95 | |
| 1 | High-opportunity | 8.18 | 1.17 | 6.28 | 8.17 | 10.14 |
| 2 | Disadvantaged | 9.62 | 1.23 | 7.63 | 9.61 | 11.68 |
# Create Histogram of bootstrap predictions for both scenarios
mc_df <- tibble(
scenario = rep(c("High-opportunity tract", "Disadvantaged tract"), each = B_boot),
pred_grow = c(pred_hi_boot, pred_lo_boot)
)
ggplot(mc_df, aes(x = pred_grow)) +
geom_histogram(bins = 100, alpha = 0.7) +
facet_wrap(~ scenario, scales = "free_y") +
labs(
title = "Monte Carlo Uncertainty in Predicted MHV Growth (Atlanta MSA)",
subtitle = "bootstrap distributions for two stylized tract profiles",
x = "Predicted MHV growth (%)",
y = "Count"
) +
theme_light(base_size = 13)
#Step 1: MSA-to-county crosswalk to get Atlanta MSA counties
crosswalk_url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv"
crosswalk <- read.csv(crosswalk_url, stringsAsFactors = FALSE, colClasses = "character")
atlanta_msa_name <- "ATLANTA, GA"
atlanta_rows <- crosswalk$msaname == atlanta_msa_name
atlanta_fips <- na.omit(crosswalk$fipscounty[atlanta_rows])
#Split FIPS into state + county codes
state_fips <- substr(atlanta_fips, 1, 2)
county_fips <- substr(atlanta_fips, 3, 5)
vars_pi <- c(
mhv = "B25077_001", # Median value of owner-occupied housing units
hhinc = "B19013_001"# Median household income
)
# We can use these to construct a tract-level price-to-income (MHV / HH income) ratio.
atl_acs_list <- list()
for (st in unique(state_fips)) {
counties_in_state <- county_fips[state_fips == st]
# grab our ACS 5-year estimates at the tract level
acs_data <- get_acs(
geography = "tract",
variables = vars_pi,
state = st,
county = counties_in_state,
geometry = FALSE
)
atl_acs_list[[st]] <- acs_data
}
## Getting data from the 2019-2023 5-year ACS
atl_acs_raw <- bind_rows(atl_acs_list) # Stack the state-level pulls
pander(
head(atl_acs_raw),
caption = "First 6 Rows of Raw ACS Data (Atlanta Tracts)"
)
| GEOID | NAME | variable | estimate | moe |
|---|---|---|---|---|
| 13013180103 | Census Tract 1801.03; Barrow County; Georgia | hhinc | 96466 | 21350 |
| 13013180103 | Census Tract 1801.03; Barrow County; Georgia | mhv | 4e+05 | 39132 |
| 13013180104 | Census Tract 1801.04; Barrow County; Georgia | hhinc | 81012 | 1770 |
| 13013180104 | Census Tract 1801.04; Barrow County; Georgia | mhv | 292400 | 32586 |
| 13013180105 | Census Tract 1801.05; Barrow County; Georgia | hhinc | 88333 | 20051 |
| 13013180105 | Census Tract 1801.05; Barrow County; Georgia | mhv | 262900 | 33030 |
atl_acs <- atl_acs_raw %>%
select(GEOID, variable, estimate, moe) %>%
tidyr::pivot_wider(
id_cols = GEOID, # Ensure one row per tract
names_from = variable, # Separate columns for each ACS Code
values_from = c(estimate, moe) # Create new columns for estimate and moe
) %>%
rename(
mhv_est = estimate_mhv,
hhinc_est = estimate_hhinc,
mhv_moe = moe_mhv,
hhinc_moe = moe_hhinc
) %>%
filter(
!is.na(mhv_est), !is.na(hhinc_est),
mhv_est > 0, hhinc_est > 0
)
atl_acs <- atl_acs %>%
mutate( # Convert MOE to SE (90% CI = 1.645)
mhv_se = mhv_moe / 1.645,
hhinc_se = hhinc_moe / 1.645
) %>%
filter(
!is.na(mhv_est), !is.na(hhinc_est),
mhv_est > 0, hhinc_est > 0,
!is.na(mhv_se), !is.na(hhinc_se),
mhv_se >= 0, hhinc_se >= 0
)
So far, the uncertainty has come from the regression model and the particular set of tracts in the LTDB sample. But the ACS estimates themselves are also subject to sampling error, which is reported as a margin of error (MOE) for each tract. To account for this, I treat each tract’s ACS estimates for MHV and median household income as random variables centered around their published point estimates.
Assuming a log-normal sampling distribution, I do the following:
This produces a tract-specific distribution of affordability burdens, rather than a single deterministic ratio. The log-normal assumption is convenient because it respects non-negativity and, by construction, keeps the simulated mean close to the ACS point estimate while letting the MOE-driven uncertainty propagate into the price-to-income ratios. Note: If \(X \sim \text{LogNormal}(\mu_{\log}, \sigma_{\log}^2)\), then \(E[X] = \exp(\mu_{\log} + \tfrac{1}{2}\sigma_{\log}^2)\). By choosing \(\mu_{\log} = \log(\hat{x}) - \tfrac{1}{2}\sigma_{\log}^2\), I anchor the simulated mean \(E[X]\) to the ACS point estimate \(\hat{x}\), while \(\sigma_{\log}\) (derived from the MOE) controls the spread. Without this adjustment, a log-normal with the same standard deviation would systematically overshoot the point estimate.
set.seed(42069)
B_Ratio <- 1000 # Number of Monte Carlo draws per tract for the ratio
simulate_ratio_for_tract <- function(mhv_est, mhv_se, hhinc_est, hhinc_se) {
# Handle NA or non-positive SEs with a small fallback so that
# each tract still has a non-degenerate distribution in the ranking simulation.
mhv_sd_log <- ifelse(is.na(mhv_se) | mhv_se <= 0, 0.0001, mhv_se / mhv_est)
hhinc_sd_log <- ifelse(is.na(hhinc_se) | hhinc_se <= 0, 0.0001, hhinc_se / hhinc_est)
mhv_log_mean <- log(mhv_est) - 0.5 * mhv_sd_log^2
hhinc_log_mean <- log(hhinc_est) - 0.5 * hhinc_sd_log^2
# Log-normal draws of MHV and HH Income
mhv_draws <- rlnorm(B_Ratio, meanlog = mhv_log_mean, sdlog = mhv_sd_log)
hhinc_draws <- rlnorm(B_Ratio, meanlog = hhinc_log_mean, sdlog = hhinc_sd_log)
# Compute Price-to-Income ratio for all draws
ratio_draws <- mhv_draws / hhinc_draws
#Summarize Monte Carlo Distribution
tibble(
ratio_mean = mean(ratio_draws, na.rm = TRUE),
ratio_sd = sd(ratio_draws, na.rm = TRUE),
ratio_p10 = as.numeric(quantile(ratio_draws, 0.10, na.rm = TRUE, names = FALSE)),
ratio_p90 = as.numeric(quantile(ratio_draws, 0.90, na.rm = TRUE, names = FALSE))
)
}
# "as.numeric()" prevents the tables from blowing up in digit count
atl_ratio_mc <- atl_acs %>%
group_by(GEOID) %>%
group_modify(~ simulate_ratio_for_tract(
mhv_est = .x$mhv_est,
mhv_se = .x$mhv_se,
hhinc_est = .x$hhinc_est,
hhinc_se = .x$hhinc_se
)) %>%
ungroup()
pander(
head(atl_ratio_mc),
caption = "First 6 Tracts: Monte Carlo Price-to-Income Ratio Summary"
)
| GEOID | ratio_mean | ratio_sd | ratio_p10 | ratio_p90 |
|---|---|---|---|---|
| 13013180103 | 4.2 | 0.6 | 3.5 | 5 |
| 13013180104 | 3.6 | 0.26 | 3.3 | 4 |
| 13013180105 | 3 | 0.48 | 2.4 | 3.6 |
| 13013180106 | 3.5 | 0.4 | 3 | 4 |
| 13013180107 | 3 | 0.31 | 2.7 | 3.4 |
| 13013180108 | 2.5 | 0.21 | 2.3 | 2.8 |
# Merge ratio summaries with original ACS point estimates
atl_ratio_mc_full <- atl_acs %>%
select(GEOID, mhv_est, hhinc_est) %>%
left_join(atl_ratio_mc, by = "GEOID")
A common policy question is not just “what is the price-to-income ratio in this tract?” but “which tracts are truly in the top burden tier?” Because each tract’s ratio is uncertain, its rank within the metro distribution is also uncertain.
To quantify this, I simulate many “possible Atlantas”:
ratio_mean and
ratio_sd.Repeating this process many times yields, for each tract, the share
of simulations in which it appears in the top 20% of
burden. I store this as prob_top_quintile. Instead of a
hard label (“high burden” vs. “not high burden”), we now have a
probability that each tract truly belongs to the high-burden group,
given the combined uncertainty in its underlying ACS estimates.
# Simulate uncertainty for the top quintile burden class
n_tracts <- nrow(atl_ratio_mc_full)
# Create matrix of simulated ratio draws per tract
ratio_draws_mat <- matrix(NA_real_, nrow = n_tracts, ncol = B_Ratio)
for (i in seq_len(n_tracts)) {
mu <- atl_ratio_mc_full$ratio_mean[i]
sd <- atl_ratio_mc_full$ratio_sd[i]
# Use a small fallback SD (1% of mean) if the Monte Carlo SD is zero or missing,
# so that each tract still has a non-degenerate distribution in the ranking simulation.
sd <- ifelse(sd > 0, sd, mu * 0.01)
ratio_draws_mat[i, ] <- rnorm(B_Ratio, mean = mu, sd = sd)
}
#80th percentile cutoff for each simulation ACROSS tracts
top_quintile_cutoffs <- apply(
ratio_draws_mat,
2,
quantile,
probs = 0.80,
na.rm = TRUE
)
#Probability of EACH tract being in top quintile
top_quintile_prob <- numeric(n_tracts)
for (i in seq_len(n_tracts)) {
top_quintile_prob[i] <- mean(
ratio_draws_mat[i, ] > top_quintile_cutoffs,
na.rm = TRUE
)
}
atl_ratio_prob <- atl_ratio_mc_full %>%
mutate(prob_top_quintile = top_quintile_prob)
example_idx <- 1 # change index to inspect different tracts
example_draws <- ratio_draws_mat[example_idx, ]
example_df <- tibble(ratio = example_draws)
ggplot(example_df, aes(x = ratio)) +
geom_histogram(bins = 100, alpha = 0.7) +
geom_vline(xintercept = mean(example_draws), linetype = "dashed") +
labs(
title = paste0(
"Monte Carlo Price-to-Income Ratio\nTract: ",
atl_ratio_prob$GEOID[example_idx]
),
subtitle = paste0(
"Mean = ", round(mean(example_draws), 2),
", SD = ", round(sd(example_draws), 2),
", P(top quintile) = ",
round(atl_ratio_prob$prob_top_quintile[example_idx], 2)
),
x = "Price-to-income ratio",
y = "Count"
) +
theme_minimal(base_size = 13)
Up to this point, Monte Carlo has played two distinct roles. First, the bootstrap simulations quantified the sensitivity of our regression-based predictions of MHV growth are to the particular sample of Atlanta tracts. Second, the ACS-based simulations turned each tract’s point estimates for MHV and income into a full distribution for its price-to-income ratio, driven by the published margins of error. The final step is to link these two pieces: combine historical MHV growth from the LTDB with current, uncertainty-inclusive affordability measures from the ACS so we can evaluate where rapid appreciation and high burden are most likely to coincide.
# Ensure tract keys line up: LTDB tractid is an 11-digit FIPS code
# Sometimes, issues arise with fips prefixes and hyphens, so this strips those
# and keeps Hunter sane
d <- d %>%
mutate(
GEOID = tractid |>
gsub("fips", "", x = _) |>
gsub("-", "", x = _)
)
head(d$GEOID)
## [1] "01001020100" "01001020200" "01001020300" "01001020400" "01001020500"
## [6] "01001020600"
# Atlanta-only LTDB data with desired variables
atl_ltdb <- d %>%
filter(cbsaname == "Atlanta-Sandy Springs-Marietta, GA") %>%
select(
GEOID,
mhv.00, mhv.10, mhv.change, mhv.growth,
pov.rate, p.vacant, p.prof,
log_pov, log_pvacant, log_prof
)
# Combine LTDB variables with price-to-income Monte Carlo results
atl_join <- atl_ltdb %>%
inner_join(atl_ratio_prob, by = "GEOID")
summary(select(atl_join, mhv.growth, ratio_mean, prob_top_quintile))
## mhv.growth ratio_mean prob_top_quintile
## Min. :-68.239 Min. : 1.392 Min. :0.0000
## 1st Qu.: -6.495 1st Qu.: 3.171 1st Qu.:0.0000
## Median : 6.180 Median : 3.646 Median :0.0130
## Mean : 10.201 Mean : 4.131 Mean :0.2136
## 3rd Qu.: 21.253 3rd Qu.: 4.619 3rd Qu.:0.3435
## Max. :145.764 Max. :18.251 Max. :1.0000
cor_mhv_ratio <- cor(
atl_join$mhv.growth,
atl_join$ratio_mean,
use = "complete.obs"
)
cor_mhv_ratio
## [1] 0.1770434
#Color scheme corresponds to probability of being high-burden
ggplot(atl_join, aes(
x = ratio_mean,
y = mhv.growth,
color = prob_top_quintile
)) +
geom_point(alpha = 0.7) +
scale_color_viridis_c(option = "plasma") +
labs(
title = "MHV Growth vs Price-to-Income Ratio (Atlanta Tracts)",
subtitle = "Color indicates Monte Carlo probability of being in the top burden quintile",
x = "Monte Carlo mean price-to-income ratio",
y = "MHV growth 2000–2010 (%)",
color = "P(top 20% burden)"
) +
theme_light(base_size = 13)
atl_join <- atl_join %>%
mutate(
burden_class = case_when(
prob_top_quintile >= 0.8 ~ "Robust high burden (P ≥ 0.8)",
prob_top_quintile >= 0.5 ~ "Borderline high burden (0.5 ≤ P < 0.8)",
TRUE ~ "Lower / uncertain burden (P < 0.5)"
)
)
burden_summary <- atl_join %>%
group_by(burden_class) %>%
summarize(
n_tracts = n(),
mean_mhv_g = mean(mhv.growth, na.rm = TRUE),
mean_ratio = mean(ratio_mean, na.rm = TRUE),
.groups = "drop"
)
pander(
burden_summary,
caption = "Atlanta tracts by burden class (ACS Monte Carlo) and average MHV growth"
)
| burden_class | n_tracts | mean_mhv_g | mean_ratio |
|---|---|---|---|
| Borderline high burden (0.5 ≤ P < 0.8) | 59 | 17 | 5.8 |
| Lower / uncertain burden (P < 0.5) | 362 | 7.9 | 3.5 |
| Robust high burden (P ≥ 0.8) | 42 | 21 | 7.5 |
In this final step, I link together:
I restrict the maps to the LTDB-defined urban tracts so that the
spatial analysis aligns with the regression sample. The first map
visualizes the probability that each tract belongs to the top 20%
affordability-burden tier (prob_top_quintile), while the
second map shows the historical MHV growth over 2000–2010. Looked at
together, these maps highlight places that both experienced
rapid home value appreciation and are likely to exhibit
high housing cost burden — neighborhoods that may be
particularly salient for housing policy, displacement risk, and
equity-oriented planning.
# Get tract geometries for all counties in the Atlanta MSA
atl_tract_list <- list()
for (st in unique(state_fips)) {
counties_in_state <- county_fips[state_fips == st]
atl_tract_list[[st]] <- tracts(
state = st,
county = counties_in_state,
year = 2021, # geo year
cb = TRUE, # cartographic boundary
class = "sf"
)
}
atl_tracts <- bind_rows(atl_tract_list) %>%
st_transform(4326) %>% # CRS for web maps
select(GEOID, geometry)
# Restrict to *urban* LTDB Atlanta tracts only
# (atl_ltdb was created earlier from d and already filtered to Atlanta + urban)
atl_tracts_urban <- atl_tracts %>%
inner_join(atl_ltdb %>% select(GEOID), by = "GEOID")
# 🔹 Now join ACS + Monte Carlo + LTDB info
atl_map <- atl_tracts_urban %>%
left_join(atl_join, by = "GEOID")
dim(atl_map)
## [1] 485 20
table(is.na(atl_map$prob_top_quintile))
##
## FALSE TRUE
## 463 22
# Share of tracts with missing Monte Carlo burden
n_total <- nrow(atl_map)
n_na <- sum(is.na(atl_map$prob_top_quintile))
pct_na_tracts <- 100 * n_na / n_total
pct_na_tracts
## [1] 4.536082
ggplot(atl_map) +
geom_sf(aes(fill = prob_top_quintile), color = NA) +
scale_fill_viridis_c(
option = "plasma",
name = "P(Top 20% burden)",
na.value = "grey90"
) +
labs(
title = "Probability of Being in the Top Affordability Burden Quintile",
subtitle = "ACS-based Monte Carlo estimates for price-to-income ratio, Atlanta MSA tracts",
x = NULL,
y = NULL
) +
theme_void(base_size = 12) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold")
)
ggplot(atl_map) +
geom_sf(aes(fill = mhv.growth), color = NA) +
scale_fill_viridis_c(
option = "magma",
name = "MHV growth (%)"
) +
labs(
title = "Median Home Value Growth 2000–2010",
subtitle = "LTDB-based MHV growth for Atlanta urban tracts",
x = NULL,
y = NULL
) +
theme_void(base_size = 12) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold")
)
Taken together, the steps in this code-through show what Monte Carlo methods contributed beyond a conventional “run one regression and report the coefficients” workflow:
From single predictions to distributions.
The baseline linear model links MHV growth to log poverty, vacancy, and
professional share, but by itself it gives only one predicted value per
tract profile. The bootstrap simulation turns those point predictions
into full distributions for the “high-opportunity” and
“disadvantaged” tracts, making the uncertainty in those predictions
explicit (via empirical means, SDs, and quantiles).
From point estimates to uncertain inputs.
The ACS provides tract-level estimates and MOEs for MHV and household
income. Instead of treating these as fixed, error-free numbers, the
tract-level Monte Carlo simulation treats them as random variables drawn
from log-normal distributions anchored on the ACS point estimates. This
lets the published MOEs flow through into uncertainty about
price-to-income ratios.
From hard cutoffs to probabilistic
classifications.
A typical affordability analysis might define “high burden” as being
above the 80th percentile of price-to-income ratios and then classify
tracts once and for all. Here, repeated simulations of the metro-wide
ratio distribution yield a probability (prob_top_quintile)
that each tract belongs to the top 20% burden tier. That probability is
often more informative than a binary label, especially when data are
noisy.
From static maps to uncertainty-aware spatial
insight.
Joining LTDB growth measures, ACS-based Monte Carlo affordability
estimates, and tract geometries allows us to map not only where
MHV grew fastest, but also how confident we are that a tract is
highly burdened. Tracts with both high MHV growth and high
prob_top_quintile are flagged as places where rapid
appreciation and likely affordability pressure coincide — a pattern that
is easy to see on the maps but difficult to capture with point estimates
alone.
In short, Monte Carlo methods here are not a replacement for regression or ACS estimates, but a layer on top of them: they convert deterministic outputs into probabilistic statements that better reflect the uncertainty inherent in small-area neighborhood data.
Learn more about PRA, Monte Carlo simulations, and the Strong Law of Large Numbers with the following:
This code-through references and cites the following sources:
Wagaman, A. S., & Dobrow, R. P. (2021). Probability: With Applications and R. John Wiley & Sons.
Rizzo, M. L. (2019). Statistical Computing with R (2nd ed.). CRC Press