1 Introduction

This code-through is an extension of an earlier code-through, and it further explores the utility of Monte Carlo simulations, a class of computational algorithms used to approximate problems that either:

  1. Do not have explicit analytical (closed-form) solutions, or
  2. Do have closed-form solutions but are more easily and transparently approximated by simulation.

Monte Carlo methods let us understand the behavior of complex systems by repeatedly simulating outcomes under random inputs and aggregating the results. Rather than solving equations once, we “play out” many possible scenarios and summarize what typically happens.

There’s a lot of code in this, so I’ve opted to use “code_folding:hide” as a default. The knitted file will allow you to expand code chunks at your leisure.


1.1 Content Overview

I show how Monte Carlo simulations can be implemented in R to estimate probabilities, explore model uncertainty, and propagate measurement error into derived quantities — all using relatively straightforward code. Instead of relying only on textbook-style examples (like ideal coin flips), I embed simulation into a real metropolitan case: the Atlanta MSA.

The core ideas are illustrated through two main tools:

  1. bootstrap simulation of a regression model, to quantify uncertainty in predicted median home value (MHV) growth for stylized neighborhoods.
  2. Tract-level simulation of price-to-income ratios from ACS data, using published margins of error (MOEs) to generate distributions of plausible affordability burdens for each tract.

Together, these tools demonstrate how simulation can augment, and sometimes improve upon, traditional analytic methods when reasoning about neighborhood change and housing affordability.


1.1.1 History of the Monte Carlo Method

Monte Carlo methods were formalized by John von Neumann and Stanislaw Ulam (circa 1946), who needed a way to approximate neutron diffusion in the core of a nuclear device. The problem lacked a convenient closed-form solution, so they turned to repeated random sampling.

As Ulam later recalled, the idea crystallized while thinking about the game of Canfield solitaire and asking: instead of deriving exact probabilities combinatorially, why not play the game many times and count how often it succeeds? Nicholas Metropolis suggested naming the method after the Monte Carlo casino in Monaco, where Ulam’s uncle used to gamble.


1.1.2 Mathematical Motivation

Monte Carlo simulations are grounded in the relative-frequency interpretation of probability. For some event \(A\) in a random experiment, its probability \(P(A)\) can be approximated by:

\[ P(A) \approx \frac{1}{n} \sum_{i=1}^{n} I(X_i \in A), \]

where \(X_1, X_2, \ldots, X_n\) are simulated outcomes and \(I(\cdot)\) is an indicator that equals 1 when the event occurs and 0 otherwise.

This approximation is justified by the Strong Law of Large Numbers (SLLN):

\[ P\left( \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n X_i = \mu \right) = 1. \]

In words: under suitable conditions, the sample mean converges almost surely (with probability 1) to the true expectation \(\mu\) as the number of observations grows.

For example, if we flip a fair coin four times, it is entirely possible to see four heads in a row — an event with probability \((1/2)^4 = 0.0625\). The SLLN does not claim that short-run outcomes must “balance out”; instead, it ensures that over a very large number of flips, the proportion of heads will almost surely converge to 0.5, even though rare streaks will still occur. Observed anomalies like the famous 1913 Monte Carlo roulette run (black appeared 26 times in a row) are improbable but not contradictory to the SLLN because the number of trials was still quite small.

In practice, Monte Carlo simulation usually follows three steps:

  1. Simulate a trial: Represent the random process on the computer and generate one realization (an “iteration”).
  2. Score the outcome: Determine whether the event of interest \(A\) occurs in that trial.
  3. Repeat and aggregate: Repeat many times. The fraction of trials in which \(A\) occurs is your simulated estimate of \(P(A)\).


1.2 Why You Should Care

Monte Carlo methods are valuable because they provide workable approximations in settings where exact results are difficult, opaque, or impossible to derive. They are especially helpful in risk analysis and planning, where decision-makers care about distributions of outcomes rather than just single-point forecasts. For example: “What proportion of scenarios lead to severe housing cost burden?” or “How often does a project overrun its budget, and by how much?”


1.3 This Code-Through

In this code-through, I apply these ideas to neighborhood change and housing affordability in the Atlanta MSA. Building on earlier course labs, I treat neighborhood indicators — median home value (MHV), poverty, vacancy, and professional employment — as random quantities subject to both sampling error and model uncertainty.

I use two complementary Monte Carlo strategies:

  1. Bootstrap regression: I start with a linear model linking MHV growth (2000–2010) to logged poverty, vacancy, and professional share. I then resample tracts with replacement and refit this model many times to obtain an empirical distribution of predicted growth for two stylized tract profiles: a “high-opportunity” tract and a “disadvantaged” tract.

  2. ACS-based simulation of price-to-income ratios: Using ACS tract-level estimates and their MOEs for MHV and median household income, I build log-normal sampling distributions around each tract’s point estimate. I simulate many draws from these distributions to obtain a full distribution of price-to-income ratios for each tract, rather than a single deterministic value.

These two pieces come together in the final section, where I join LTDB-based growth measures with ACS-based Monte Carlo affordability estimates and map (a) historical home value growth and (b) the probability that each tract belongs to the top 20% of affordability burden in the metro area. This provides a richer, uncertainty-aware view of which neighborhoods are both rapidly appreciating and likely to be highly burdened.


2 Monte Carlo Warm-Up: Casino Examples

Here, I use simple gambling-inspired games to develop intuition for Monte Carlo simulations.


2.1 Probability Examples

2.1.1 Spin the Wheel

Consider a game of chance where a wheel is subdivided into 3 parts: 50% is purple, 25% is green, 25% is orange. If you land on purple, you get 8 points, if you land on green, you lose 20 points, and if you land on orange, you gain 15 points. What is the expected value?

Expected_Value_Manual <- mean(c(8, 8, -20, 15))
   
  print(Expected_Value_Manual)
## [1] 2.75

For such a simplistic problem, the probability is easy to calculate. However, what if you want to answer a more interesting question? For instance, what is the probability that - after 20 spins, you’ll have more than 50 points?

#  Theoretical expected value per spin
Expected_Value_Manual <- mean(c(8, 8, -20, 15))  # = 2.75

#  Expected value for 20 spins
Expected_Value_20 <- Expected_Value_Manual * 20  # = 55
print(Expected_Value_20)
## [1] 55
# ---------------------------------
#  Simulate outcomes
set.seed(420)  #Seeds ensure reproducibility of outcomes 
runs <- 10000

Spin_To_Win <- function() {
  sum(sample(c(8, 8, -20, 15), 20, replace = TRUE))
}

results <- replicate(runs, Spin_To_Win())

#  Empirical estimate for P(Sum > 50)
Monte_Carlo_Prob <- mean(results > 50)
print(Monte_Carlo_Prob)
## [1] 0.5429
#  Empirical mean
mean_val <- mean(results)


summary_table <- data.frame(
  Quantity = c("Theoretical mean (20 spins)", "Empirical mean (20 spins)", "P(Sum > 50)"),
  Value    = c(Expected_Value_20, mean_val, Monte_Carlo_Prob)
)

pander(summary_table,
       caption = "Comparison of theoretical and simulated results for the wheel game")
Comparison of theoretical and simulated results for the wheel game
Quantity Value
Theoretical mean (20 spins) 55
Empirical mean (20 spins) 56
P(Sum > 50) 0.54

As can be seen, the probability of attaining a value >50 points is ~54%, slightly above average, which aligns with the expected value of 55. Moreover, the “mean_val” shows us that - on average - the empirically-calculated curve returns a value of 55.672, which is almost the same as the expected value. The ability to quickly and precisely return numerical approximations for complex problems is extremely useful!

2.1.2 Spin the Wheel CDF

Let’s plot this on a Cumulative Distribution Function (CDF). CDFs are a great way to visualize the probability that the sum of our spins will take on a value less than or equal to a given threshold. At a glance, we’ll be able to see where our Monte-Carlo derived mean lies on the curve.

## Ensure results is numeric 
results_num <- unlist(results)

# Empirical CDF and values at salient points
F_results       <- ecdf(results_num)
CDF_Mean        <- F_results(mean_val)
CDF_Theoretical <- F_results(Expected_Value_20)

# Small horizontal offsets so the dots are actually visible
offset_emp  <- -1.75
offset_theo <-  1.75

p <- ggplot(data.frame(Sum = results_num), aes(x = Sum)) +
  stat_ecdf(alpha = 0.9, linewidth = 1) +

  # Threshold at 50
  geom_vline(xintercept = 50,
             color = "black",
             linetype = "dashed",
             alpha = 0.7) +

  
  geom_vline(xintercept = mean_val,
             color = "purple4",
             linetype = "dotted",
             linewidth = 1,
             alpha = 0.8) +

  geom_point(
    data = data.frame(x = mean_val + offset_emp, y = CDF_Mean),
    aes(x = x, y = y),
    color = "purple4",
    size = 3,
    shape = 16,
    alpha = 0.9
  ) +
  annotate(
    "text",
    x     = mean_val + offset_emp - 45,
    y     = CDF_Mean + 0.05,
    label = "Empirical Mean",
    color = "purple4",
    size  = 3,
    vjust = -0.5
  ) +

  # Theoretical mean line + point + label
  geom_vline(xintercept = Expected_Value_20,
             color = "chartreuse4",
             linetype = "dotted",
             linewidth = 1,
             alpha = 0.8) +
  geom_point(
    data = data.frame(x = Expected_Value_20 + offset_theo, y = CDF_Theoretical),
    aes(x = x, y = y),
    color = "chartreuse4",
    size = 3,
    shape = 17,
    alpha = 0.9
  ) +
  annotate(
    "text",
    x     = Expected_Value_20 + offset_theo + 45,
    y     = CDF_Theoretical - 0.05,
    label = "Theoretical Mean",
    color = "chartreuse4",
    size  = 3,
    vjust = -0.5
  ) +

  labs(
    title = "CDF for 20 spins (10,000 iterations)",
    subtitle = paste0(
      "P(Sum > 50) = ", round(Monte_Carlo_Prob, 3),
      "; Empirical Mean = ", round(mean_val, 2),
      "; Theoretical Mean = ", round(Expected_Value_20, 2)
    ),
    x = "Sum of 20 Spins",
    y = "Cumulative Probability"
  ) +
  theme_light()

print(p)

2.1.3 Comparing High vs Low Iteration Simulations

So far, I’ve used large numbers of runs or iterations (10,000 or more), and you might be wondering: Why? Am I just trying to overheat your CPU? Do I harbor a suspicious fondness for computational sadism? Or is this an elaborate ploy to soften you up before selling you a time-share? Fortunately, the truth is far less sinister.

The real reason is that Monte Carlo simulations rely on random sampling instead of exact closed-form solutions. With too few iterations, your samples might not fully capture the true range of possible outcomes — leading to misleading or unstable estimates. Additionally, the specific random number generation method can unintentionally under-represent parts of the distribution if you don’t sample enough. Therefore, a sufficiently large number of iterations (typically 10,000 or more) is essential to ensure your estimates converge and remain reliable.

This section demonstrates how using a low number of Monte Carlo iterations can yield an unstable estimate of the distribution compared to a well-converged simulation with many iterations.

#  Define parameters and simulation function
Expected_Value_Manual <- mean(c(8, 8, -20, 15))     # = 2.75
Expected_Value_20     <- Expected_Value_Manual * 20 # = 55

Spin_To_Win <- function(n) {
  replicate(n, sum(sample(c(8, 8, -20, 15), 20, replace = TRUE)))
}

#  Simulate both highest and lowest iterations
set.seed(420)

results <- tibble(
  Iteration = c(rep("High (10,000)", 10000),
                rep("Low (10)",        10)),
  Value = c(Spin_To_Win(10000),
            Spin_To_Win(10))
)

# ----------------------------------------
#  Compute means & probabilities by case
summary_df <- results %>%
  group_by(Iteration) %>%
  summarize(
    Mean           = mean(Value),
    Prob_Above_50  = mean(Value > 50),
    .groups        = "drop"
  )

print(summary_df)
## # A tibble: 2 × 3
##   Iteration      Mean Prob_Above_50
##   <chr>         <dbl>         <dbl>
## 1 High (10,000)  55.7         0.543
## 2 Low (10)       70.4         0.5
# ----------------------------------------
#  Build label data from summary_df 
label_df <- summary_df %>%
  mutate(
    label = paste0(
      "P(Sum > 50) = ", round(Prob_Above_50, 3),
      "\nMean = ", round(Mean, 2)
    ),
    x = Expected_Value_20 + 5,  # horizontal label position
    y = 0.3                     # vertical label position
  )

# ----------------------------------------
#  Plot CDFs side by side using `facet_wrap`
ggplot(results, aes(x = Value)) +
  stat_ecdf(color = "cyan4", size = 1) +
  geom_vline(xintercept = 50,
             linetype = "dashed",
             color    = "black") +
  geom_vline(xintercept = Expected_Value_20,
             linetype = "dotted",
             color    = "magenta") +
  geom_text(
    data        = label_df,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE,
    color       = "magenta",
    size        = 3,
    hjust       = 0
  ) +
  facet_wrap(~ Iteration, scales = "free_y") +
  labs(
    title = "CDF Comparison: High vs Low Iterations",
    subtitle = paste0(
      "Dotted line: theoretical mean (", round(Expected_Value_20, 2), 
      "); dashed line: threshold at 50"
    ),
    x = "Sum of 20 spins",
    y = "Cumulative probability"
  ) +
  coord_cartesian(ylim = c(0, 1.05), clip = "off") +
  theme_light(base_size = 14) +
  theme(
    plot.margin = margin(t = 15, r = 40, b = 15, l = 15)
  )

Note that the difference between the low and high iteration mean values is \(70 - 55 = 15\): the low-iteration estimate overstates the mean by about 27%. Thus, it is imperative that you allocate sufficient iterations for your problem. Graphically, this is also obvious from the substantially more “step-like” behavior of the low-iteration CDF.


2.2 From Games of Chance to Neighborhood Predictions

I kept the casino-style examples above simple, but they illustrated three core ideas I will now reuse in a real-data setting: (1) Monte Carlo as repeated sampling to approximate probabilities, (2) the distinction between an expected value and the full distribution of outcomes, and (3) the importance of using sufficient iterations to derive stable estimates. In the remainder of this document, I apply these ideas to neighborhood change and housing affordability in the Atlanta MSA. Instead of spins and dice, the “random variables” are tract-level home values and incomes; instead of asking “What’s the chance I win more than 50 points?”, I ask questions like “How uncertain is predicted MHV growth?” and “How likely is it that a given tract belongs to the top 20% affordability burden group?”

3 Data Setup: LTDB & Neighborhood Covariates

3.1 Loading LTDB Data

 #Load LTDB data (2000 / 2010 / meta) 

URL1 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2000.rds"
d1 <- readRDS(gzcon(url(URL1))) # read the compressed 2000 LTDB file 

URL2 <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-2010.rds"
d2 <- readRDS(gzcon(url(URL2))) # read the compressed 2010 LTDB file

URLmd <- "https://github.com/DS4PS/cpp-529-fall-2020/raw/main/LABS/data/rodeo/LTDB-META-DATA.rds"
md <- readRDS(gzcon(url(URLmd))) # read metadata

#Drop the (redundant) year column from both LTDB sets
d1 <- select(d1, -year)
d2 <- select(d2, -year)

#Merge LTDB 2000 & 2010 by tractid, then attach metadata

d <- merge(d1, d2, by = "tractid")
d <- merge(d, md, by = "tractid")

#Keep only urban tracts (drawing on Lab 5's exercise)

d <- d %>%
  filter(urban == "urban")

3.2 Creating Neighborhood Variables and MHV Growth

d <- d %>%
  select(
tractid,
mhmval00, mhmval12,
hinc00,
hu00, vac00, own00, rent00, h30old00,
empclf00, clf00, unemp00, prof00,
dpov00, npov00,
ag25up00, hs00, col00,
pop00.x, nhwht00, nhblk00, hisp00, asian00,
cbsa, cbsaname
) %>%
  mutate(
  
# Demographic / housing composition
  
p.white = 100 * nhwht00 / pop00.x,
p.black = 100 * nhblk00 / pop00.x,
p.hisp = 100 * hisp00 / pop00.x,
p.asian = 100 * asian00 / pop00.x,
p.hs = 100 * (hs00 + col00) / ag25up00,
p.col = 100 * col00 / ag25up00,
p.prof = 100 * prof00 / empclf00,
p.unemp = 100 * unemp00 / clf00,
p.vacant = 100 * vac00 / hu00,
pov.rate = 100 * npov00 / dpov00
)

# Inflation-adjust 2000 MHV to 2010 dollars

mhv.00 <- d$mhmval00 * 1.28855
mhv.10 <- d$mhmval12
mhv.change <- mhv.10 - mhv.00

# Drop tracts with unrealistic 2000 values (< $1k)

mhv.00[mhv.00 < 1000] <- NA

# Create variable for % growth in MHV 2000–2010

mhv.growth <- 100 * (mhv.change / mhv.00)

# Attach back to LTDB data frame

d$mhv.00 <- mhv.00
d$mhv.10 <- mhv.10
d$mhv.change <- mhv.change
d$mhv.growth <- mhv.growth

# Drop extreme growth > 200% to avoid wild outliers

d <- d %>%
filter(mhv.growth <= 200)

3.3 Log Transformations for (Skewed) Variables of Interest

# Each of these variables is skewed; I'm eschewing corrplots for brevity
# Log Transformations are simple and preferred for this kind of skew
d <- d %>%
mutate(
log_pov = log10(pov.rate + 1),
log_pvacant = log10(p.vacant + 1),
log_prof = log10(p.prof + 1)
)

summary(d[, c("log_pov", "log_pvacant", "log_prof")])
##     log_pov        log_pvacant        log_prof    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.7329   1st Qu.:0.6126   1st Qu.:1.376  
##  Median :0.9749   Median :0.7848   Median :1.511  
##  Mean   :0.9866   Mean   :0.8079   Mean   :1.498  
##  3rd Qu.:1.2389   3rd Qu.:0.9760   3rd Qu.:1.643  
##  Max.   :2.0043   Max.   :2.0043   Max.   :2.004  
##  NA's   :158      NA's   :146      NA's   :159


Before bringing Monte Carlo methods into the analysis, I begin with a standard parametric model using LTDB data. This gives me a baseline, “classical” view of neighborhood change: how does MHV growth from 2000 to 2010 relate to logged poverty, vacancy, and professional share in 2000? This linear regression provides a single set of coefficient estimates and predicted values, which I will later stress-test using bootstrap simulation. In other words, the regression is the starting point, and the subsequent Monte Carlo work is what allows me to ask: How stable are these predictions across many plausible realizations of the data?

4 Baseline Regression for Atlanta MSA

4.1 Filter Atlanta Urban Tracts

reg_atl <- d %>%
  filter(
    cbsaname == "Atlanta-Sandy Springs-Marietta, GA",
  ) %>%
  select(
    mhv.growth,
    log_pov,
    log_pvacant,
    log_prof
  ) %>%
  na.omit()

4.2 Baseline Linear Model of MHV Growth

# Linear regression: MHV growth as a function of poverty rate (log), vacancy rate (log), and professional share (log)
# This serves as the baseline parametric model before introducing bootstrap uncertainty.
m_atl <- lm(
  mhv.growth ~ log_pov + log_pvacant + log_prof,
  data = reg_atl
)

stargazer(
  m_atl,
  type            = s.type,
  title           = "Atlanta MSA: MHV Growth Regression",
  dep.var.labels  = "MHV growth 2000 to 2010 (%)",
  covariate.labels = c(
    "Poverty Rate (Log)",
    "Vacancy Rate (Log)",
    "Professional Representation (Log)"
  ),
  digits          = 3,
  single.row      = FALSE,
  header          = FALSE
)
Atlanta MSA: MHV Growth Regression
Dependent variable:
MHV growth 2000 to 2010 (%)
Poverty Rate (Log) 3.906
(3.880)
Vacancy Rate (Log) 26.299***
(4.807)
Professional Representation (Log) 0.863
(5.709)
Constant -15.746
(10.992)
Observations 941
R2 0.061
Adjusted R2 0.058
Residual Std. Error 24.764 (df = 937)
F Statistic 20.142*** (df = 3; 937)
Note: p<0.1; p<0.05; p<0.01

4.3 Tract Profiles

# Using Quartiles of Poverty and Professional Share w/ Median Vacancy
q_pov <- quantile(reg_atl$log_pov, probs = c(0.25, 0.75), na.rm = TRUE)
q_prof <- quantile(reg_atl$log_prof, probs = c(0.25, 0.75), na.rm = TRUE)
med_vac <- median(reg_atl$log_pvacant, na.rm = TRUE)

#High-opportunity tract: Low Poverty, High Professional Representation, Median Vacancy

new_hi <- data.frame(
log_pov = q_pov[1],
log_pvacant = med_vac,
log_prof = q_prof[2]
)

#Disadvantaged tract: High Poverty, Low Professional Representation, Median Vacancy

new_lo <- data.frame(
log_pov = q_pov[2],
log_pvacant = med_vac,
log_prof = q_prof[1]
)

new_hi
new_lo

4.4 Confidence Intervals (95%) for Predicted Growth

pred_hi_ci <- predict(m_atl, newdata = new_hi, interval = "confidence", level = 0.95)
pred_lo_ci <- predict(m_atl, newdata = new_lo, interval = "confidence", level = 0.95)

ci_tab <- rbind(
  cbind(Scenario = "High-opportunity", as.data.frame(pred_hi_ci)),
  cbind(Scenario = "Disadvantaged",    as.data.frame(pred_lo_ci))
) %>%
  rename(
    estimate = fit,
    Lower_95 = lwr,
    Upper_95 = upr
  )

# Remove 25%/75% Row Names for Table
rownames(ci_tab) <- NULL

pander::pander(
  ci_tab,
  caption = "95% Confidence Intervals for Expected MHV Growth (Atlanta)",
)
95% Confidence Intervals for Expected MHV Growth (Atlanta)
Scenario estimate Lower_95 Upper_95
High-opportunity 8.2 6 10
Disadvantaged 9.7 7.6 12

5 Monte Carlo Bootstrap

What do we mean by “bootstrap” here? bootstrapping is a resampling-based method for approximating the sampling distribution of a statistic when we only have one dataset. Instead of assuming a particular parametric form for the errors, we treat the observed sample of Atlanta tracts as if it were the “population,” repeatedly draw new samples with replacement from it, re-estimate the regression on each sample, and re-compute the predictions of interest. Across many such resamples, the spread of those predictions approximates the uncertainty we would see if we could repeatedly re-collect similar LTDB data. In that sense, bootstrap is itself a type of Monte Carlo method: we rely on simulation rather than closed-form formulas to learn about variability.

  1. Resample the Atlanta tracts with replacement to create a new bootstrap dataset.
  2. Refit the same linear regression on that bootstrap sample.
  3. Predict MHV growth for the “high-opportunity” and “disadvantaged” tract profiles.
  4. Repeat this thousands of times.

The resulting empirical distributions of predicted growth (one distribution per tract profile) allow me to compute bootstrap means, standard deviations, and quantiles (5th, 50th, 95th). This provides a more intuitive sense of how much the predicted growth could vary under repeated sampling of similar neighborhoods.

set.seed(529) # Ensure reproducible simulation results

B_boot <- 5000 # number of bootstrap runs

pred_hi_boot <- numeric(B_boot) # Stores high-opportunity predictions
pred_lo_boot <- numeric(B_boot) # Stores disadvantaged predictions

n_atl <- nrow(reg_atl) # Number of ATL tracts in the sample

for (b in seq_len(B_boot)) {

# Sample Atlanta tracts with replacement to create the bootstrap dataset

  idx_b  <- sample.int(n_atl, size = n_atl, replace = TRUE)
  d_boot <- reg_atl[idx_b, ]
  
  
# Refit the same (linear) model on the bootstrap sample 

  m_boot <- lm(
    mhv.growth ~ log_pov + log_pvacant + log_prof,
    data = d_boot
  )


# Store predicted MHV growth for each scenario

 pred_hi_boot[b] <- predict(m_boot, newdata = new_hi)
  pred_lo_boot[b] <- predict(m_boot, newdata = new_lo)
}

# Summarize bootstrap sampling distributions

boot_sum <- data.frame(
  Scenario  = c("High-opportunity", "Disadvantaged"),
  Mean = c(mean(pred_hi_boot), mean(pred_lo_boot)),
  SD   = c(sd(pred_hi_boot),   sd(pred_lo_boot)),
  P05  = c(quantile(pred_hi_boot, 0.05),
           quantile(pred_lo_boot, 0.05)),
  P50  = c(quantile(pred_hi_boot, 0.50),
           quantile(pred_lo_boot, 0.50)),
  P95  = c(quantile(pred_hi_boot, 0.95),
           quantile(pred_lo_boot, 0.95))
)

stargazer(
  boot_sum,
  type    = s.type,
  summary = FALSE,
  title   = "bootstrap Monte Carlo Summaries for Predicted MHV Growth",
  digits = 2
)
bootstrap Monte Carlo Summaries for Predicted MHV Growth
Scenario Mean SD P05 P50 P95
1 High-opportunity 8.18 1.17 6.28 8.17 10.14
2 Disadvantaged 9.62 1.23 7.63 9.61 11.68

6 Visualize Bootstrap Distributions

# Create Histogram of bootstrap predictions for both scenarios
mc_df <- tibble(
scenario = rep(c("High-opportunity tract", "Disadvantaged tract"), each = B_boot),
pred_grow = c(pred_hi_boot, pred_lo_boot)
)

ggplot(mc_df, aes(x = pred_grow)) +
geom_histogram(bins = 100, alpha = 0.7) +
facet_wrap(~ scenario, scales = "free_y") +
labs(
title = "Monte Carlo Uncertainty in Predicted MHV Growth (Atlanta MSA)",
subtitle = "bootstrap distributions for two stylized tract profiles",
x = "Predicted MHV growth (%)",
y = "Count"
) +
theme_light(base_size = 13)

7 Pull ACS data for Atlanta Tracts

#Step 1: MSA-to-county crosswalk to get Atlanta MSA counties

crosswalk_url <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv"
crosswalk <- read.csv(crosswalk_url, stringsAsFactors = FALSE, colClasses = "character")

atlanta_msa_name <- "ATLANTA, GA"
atlanta_rows <- crosswalk$msaname == atlanta_msa_name
atlanta_fips <- na.omit(crosswalk$fipscounty[atlanta_rows])

#Split FIPS into state + county codes

state_fips <- substr(atlanta_fips, 1, 2)
county_fips <- substr(atlanta_fips, 3, 5)

8 Pull MHV and Median Household Income from ACS Data

vars_pi <- c(
mhv = "B25077_001", # Median value of owner-occupied housing units
hhinc = "B19013_001"# Median household income
)

# We can use these to construct a tract-level price-to-income (MHV / HH income) ratio.


atl_acs_list <- list()

for (st in unique(state_fips)) {
counties_in_state <- county_fips[state_fips == st]

# grab our ACS 5-year estimates at the tract level
acs_data <- get_acs(
geography = "tract",
variables = vars_pi,
state = st,
county = counties_in_state,
geometry = FALSE
)

atl_acs_list[[st]] <- acs_data
}
## Getting data from the 2019-2023 5-year ACS
atl_acs_raw <- bind_rows(atl_acs_list) # Stack the state-level pulls

pander(
  head(atl_acs_raw),
  caption = "First 6 Rows of Raw ACS Data (Atlanta Tracts)"
)
First 6 Rows of Raw ACS Data (Atlanta Tracts)
GEOID NAME variable estimate moe
13013180103 Census Tract 1801.03; Barrow County; Georgia hhinc 96466 21350
13013180103 Census Tract 1801.03; Barrow County; Georgia mhv 4e+05 39132
13013180104 Census Tract 1801.04; Barrow County; Georgia hhinc 81012 1770
13013180104 Census Tract 1801.04; Barrow County; Georgia mhv 292400 32586
13013180105 Census Tract 1801.05; Barrow County; Georgia hhinc 88333 20051
13013180105 Census Tract 1801.05; Barrow County; Georgia mhv 262900 33030

9 Select & Filter ACS data for variables of interest

atl_acs <- atl_acs_raw %>%
  select(GEOID, variable, estimate, moe) %>%
  tidyr::pivot_wider(
    id_cols    = GEOID, # Ensure one row per tract
    names_from = variable, # Separate columns for each ACS Code
    values_from = c(estimate, moe) # Create new columns for estimate and moe
  ) %>%
  rename(
    mhv_est   = estimate_mhv,
    hhinc_est = estimate_hhinc,
    mhv_moe   = moe_mhv,
    hhinc_moe = moe_hhinc
  ) %>%
  filter(
    !is.na(mhv_est), !is.na(hhinc_est),
    mhv_est > 0, hhinc_est > 0
  )

10 Convert MOE to SE and filter tracts

atl_acs <- atl_acs %>%
  mutate(                             # Convert MOE to SE (90% CI = 1.645)
    mhv_se   = mhv_moe   / 1.645,
    hhinc_se = hhinc_moe / 1.645
  ) %>%
  filter(
    !is.na(mhv_est), !is.na(hhinc_est),
    mhv_est > 0, hhinc_est > 0,
    !is.na(mhv_se), !is.na(hhinc_se),
    mhv_se >= 0, hhinc_se >= 0
  )

11 Tract-Level Monte Carlo Simulation for Price-to-Income Ratio

11.1 Intuition

So far, the uncertainty has come from the regression model and the particular set of tracts in the LTDB sample. But the ACS estimates themselves are also subject to sampling error, which is reported as a margin of error (MOE) for each tract. To account for this, I treat each tract’s ACS estimates for MHV and median household income as random variables centered around their published point estimates.

Assuming a log-normal sampling distribution, I do the following:

  1. Convert each tract’s MOE to a standard error (SE), using the ACS fact that a 90% MOE corresponds to approximately \(1.645 \times \text{SE}\).
  2. Translate the SE into a log-scale standard deviation, \(\sigma_{\log}\), for a log-normal distribution.
  3. Choose a log-mean \(\mu_{\log}\) such that \(\exp(\mu_{\log} + \tfrac{1}{2}\sigma_{\log}^2)\) is approximately equal to the ACS point estimate. This is achieved by setting \[ \mu_{\log} = \log(\hat{x}) - \tfrac{1}{2}\sigma_{\log}^2. \]
  4. Draw many Monte Carlo samples of MHV and income from these log-normal distributions for each tract, and compute a price-to-income ratio for every draw.

This produces a tract-specific distribution of affordability burdens, rather than a single deterministic ratio. The log-normal assumption is convenient because it respects non-negativity and, by construction, keeps the simulated mean close to the ACS point estimate while letting the MOE-driven uncertainty propagate into the price-to-income ratios. Note: If \(X \sim \text{LogNormal}(\mu_{\log}, \sigma_{\log}^2)\), then \(E[X] = \exp(\mu_{\log} + \tfrac{1}{2}\sigma_{\log}^2)\). By choosing \(\mu_{\log} = \log(\hat{x}) - \tfrac{1}{2}\sigma_{\log}^2\), I anchor the simulated mean \(E[X]\) to the ACS point estimate \(\hat{x}\), while \(\sigma_{\log}\) (derived from the MOE) controls the spread. Without this adjustment, a log-normal with the same standard deviation would systematically overshoot the point estimate.

11.2 Simulation

set.seed(42069)
B_Ratio <- 1000  # Number of Monte Carlo draws per tract for the ratio

simulate_ratio_for_tract <- function(mhv_est, mhv_se, hhinc_est, hhinc_se) {
  
# Handle NA or non-positive SEs with a small fallback so that
# each tract still has a non-degenerate distribution in the ranking simulation.

    mhv_sd_log   <- ifelse(is.na(mhv_se)   | mhv_se   <= 0, 0.0001, mhv_se   / mhv_est)
  hhinc_sd_log <- ifelse(is.na(hhinc_se) | hhinc_se <= 0, 0.0001, hhinc_se / hhinc_est)


  
  
  mhv_log_mean   <- log(mhv_est)   - 0.5 * mhv_sd_log^2
  hhinc_log_mean <- log(hhinc_est) - 0.5 * hhinc_sd_log^2

  # Log-normal draws of MHV and HH Income
  mhv_draws   <- rlnorm(B_Ratio, meanlog = mhv_log_mean,   sdlog = mhv_sd_log)
  hhinc_draws <- rlnorm(B_Ratio, meanlog = hhinc_log_mean, sdlog = hhinc_sd_log)

  # Compute Price-to-Income ratio for all draws
  ratio_draws <- mhv_draws / hhinc_draws

  
  #Summarize Monte Carlo Distribution
  tibble(
    ratio_mean = mean(ratio_draws, na.rm = TRUE),
    ratio_sd   = sd(ratio_draws,   na.rm = TRUE),
    ratio_p10  = as.numeric(quantile(ratio_draws, 0.10, na.rm = TRUE, names = FALSE)),
    ratio_p90  = as.numeric(quantile(ratio_draws, 0.90, na.rm = TRUE, names = FALSE))
  )
}

# "as.numeric()" prevents the tables from blowing up in digit count


atl_ratio_mc <- atl_acs %>%
  group_by(GEOID) %>%
  group_modify(~ simulate_ratio_for_tract(
    mhv_est   = .x$mhv_est,
    mhv_se    = .x$mhv_se,
    hhinc_est = .x$hhinc_est,
    hhinc_se  = .x$hhinc_se
  )) %>%
  ungroup()

pander(
  head(atl_ratio_mc),
  caption = "First 6 Tracts: Monte Carlo Price-to-Income Ratio Summary"
)
First 6 Tracts: Monte Carlo Price-to-Income Ratio Summary
GEOID ratio_mean ratio_sd ratio_p10 ratio_p90
13013180103 4.2 0.6 3.5 5
13013180104 3.6 0.26 3.3 4
13013180105 3 0.48 2.4 3.6
13013180106 3.5 0.4 3 4
13013180107 3 0.31 2.7 3.4
13013180108 2.5 0.21 2.3 2.8

12 Combine ACS Point Estimates w/ Monte Carlo Summaries

# Merge ratio summaries with original ACS point estimates
atl_ratio_mc_full <- atl_acs %>%
select(GEOID, mhv_est, hhinc_est) %>%
left_join(atl_ratio_mc, by = "GEOID")

13 Monte Carlo Top-Quintile Classification (Across Tracts)

A common policy question is not just “what is the price-to-income ratio in this tract?” but “which tracts are truly in the top burden tier?” Because each tract’s ratio is uncertain, its rank within the metro distribution is also uncertain.

To quantify this, I simulate many “possible Atlantas”:

  1. For each simulation, I draw a burden ratio for every tract from its tract-specific distribution of ratio_mean and ratio_sd.
  2. Within that simulated city, I compute the 80th percentile of the ratios across all tracts.
  3. I flag which tracts fall above this simulated 80th percentile threshold.

Repeating this process many times yields, for each tract, the share of simulations in which it appears in the top 20% of burden. I store this as prob_top_quintile. Instead of a hard label (“high burden” vs. “not high burden”), we now have a probability that each tract truly belongs to the high-burden group, given the combined uncertainty in its underlying ACS estimates.

# Simulate uncertainty for the top quintile burden class
n_tracts <- nrow(atl_ratio_mc_full)

# Create matrix of simulated ratio draws per tract
ratio_draws_mat <- matrix(NA_real_, nrow = n_tracts, ncol = B_Ratio)

for (i in seq_len(n_tracts)) {
mu <- atl_ratio_mc_full$ratio_mean[i]
sd <- atl_ratio_mc_full$ratio_sd[i]

# Use a small fallback SD (1% of mean) if the Monte Carlo SD is zero or missing,
# so that each tract still has a non-degenerate distribution in the ranking simulation.
sd <- ifelse(sd > 0, sd, mu * 0.01) 

ratio_draws_mat[i, ] <- rnorm(B_Ratio, mean = mu, sd = sd)
}

#80th percentile cutoff for each simulation ACROSS tracts

top_quintile_cutoffs <- apply(
ratio_draws_mat,
2,
quantile,
probs = 0.80,
na.rm = TRUE
)

#Probability of EACH tract being in top quintile 

top_quintile_prob <- numeric(n_tracts)

for (i in seq_len(n_tracts)) {
top_quintile_prob[i] <- mean(
ratio_draws_mat[i, ] > top_quintile_cutoffs,
na.rm = TRUE
)
}

atl_ratio_prob <- atl_ratio_mc_full %>%
mutate(prob_top_quintile = top_quintile_prob)

14 Example Tract - Histogram for simulated Price-to-Income Ratios

example_idx <- 1 # change index to inspect different tracts

example_draws <- ratio_draws_mat[example_idx, ]

example_df <- tibble(ratio = example_draws)

ggplot(example_df, aes(x = ratio)) +
geom_histogram(bins = 100, alpha = 0.7) +
geom_vline(xintercept = mean(example_draws), linetype = "dashed") +
labs(
title = paste0(
"Monte Carlo Price-to-Income Ratio\nTract: ",
atl_ratio_prob$GEOID[example_idx]
),
subtitle = paste0(
"Mean = ", round(mean(example_draws), 2),
", SD = ", round(sd(example_draws), 2),
", P(top quintile) = ",
round(atl_ratio_prob$prob_top_quintile[example_idx], 2)
),
x = "Price-to-income ratio",
y = "Count"
) +
theme_minimal(base_size = 13)

Up to this point, Monte Carlo has played two distinct roles. First, the bootstrap simulations quantified the sensitivity of our regression-based predictions of MHV growth are to the particular sample of Atlanta tracts. Second, the ACS-based simulations turned each tract’s point estimates for MHV and income into a full distribution for its price-to-income ratio, driven by the published margins of error. The final step is to link these two pieces: combine historical MHV growth from the LTDB with current, uncertainty-inclusive affordability measures from the ACS so we can evaluate where rapid appreciation and high burden are most likely to coincide.

15 Part 5 – Joining ACS Monte Carlo Results to LTDB Atlanta Data

# Ensure tract keys line up: LTDB tractid is an 11-digit FIPS code
# Sometimes, issues arise with fips prefixes and hyphens, so this strips those
# and keeps Hunter sane
d <- d %>%
  mutate(
    GEOID = tractid |> 
      gsub("fips", "", x = _) |>
      gsub("-",    "", x = _)
  )
head(d$GEOID)
## [1] "01001020100" "01001020200" "01001020300" "01001020400" "01001020500"
## [6] "01001020600"
# Atlanta-only LTDB data with desired variables
atl_ltdb <- d %>%
  filter(cbsaname == "Atlanta-Sandy Springs-Marietta, GA") %>%
  select(
    GEOID,
    mhv.00, mhv.10, mhv.change, mhv.growth,
    pov.rate, p.vacant, p.prof,
    log_pov, log_pvacant, log_prof
  )

# Combine LTDB variables with price-to-income Monte Carlo results
atl_join <- atl_ltdb %>%
  inner_join(atl_ratio_prob, by = "GEOID")



summary(select(atl_join, mhv.growth, ratio_mean, prob_top_quintile))
##    mhv.growth        ratio_mean     prob_top_quintile
##  Min.   :-68.239   Min.   : 1.392   Min.   :0.0000   
##  1st Qu.: -6.495   1st Qu.: 3.171   1st Qu.:0.0000   
##  Median :  6.180   Median : 3.646   Median :0.0130   
##  Mean   : 10.201   Mean   : 4.131   Mean   :0.2136   
##  3rd Qu.: 21.253   3rd Qu.: 4.619   3rd Qu.:0.3435   
##  Max.   :145.764   Max.   :18.251   Max.   :1.0000

16 Correlation between MHV growth and Price-to-Income Ratio

cor_mhv_ratio <- cor(
atl_join$mhv.growth,
atl_join$ratio_mean,
use = "complete.obs"
)

cor_mhv_ratio
## [1] 0.1770434

17 Colorful Scatterplot for MHV growth vs Price-to-Income Ratio

#Color scheme corresponds to probability of being high-burden
ggplot(atl_join, aes(
x = ratio_mean,
y = mhv.growth,
color = prob_top_quintile
)) +
geom_point(alpha = 0.7) +
scale_color_viridis_c(option = "plasma") +
labs(
title = "MHV Growth vs Price-to-Income Ratio (Atlanta Tracts)",
subtitle = "Color indicates Monte Carlo probability of being in the top burden quintile",
x = "Monte Carlo mean price-to-income ratio",
y = "MHV growth 2000–2010 (%)",
color = "P(top 20% burden)"
) +
theme_light(base_size = 13)

18 Classifying tracts by the strength of their (high-burden) signal

atl_join <- atl_join %>%
mutate(
burden_class = case_when(
prob_top_quintile >= 0.8 ~ "Robust high burden (P ≥ 0.8)",
prob_top_quintile >= 0.5 ~ "Borderline high burden (0.5 ≤ P < 0.8)",
TRUE ~ "Lower / uncertain burden (P < 0.5)"
)
)

burden_summary <- atl_join %>%
group_by(burden_class) %>%
summarize(
n_tracts = n(),
mean_mhv_g = mean(mhv.growth, na.rm = TRUE),
mean_ratio = mean(ratio_mean, na.rm = TRUE),
.groups = "drop"
)

pander(
  burden_summary,
  caption = "Atlanta tracts by burden class (ACS Monte Carlo) and average MHV growth"
)
Atlanta tracts by burden class (ACS Monte Carlo) and average MHV growth
burden_class n_tracts mean_mhv_g mean_ratio
Borderline high burden (0.5 ≤ P < 0.8) 59 17 5.8
Lower / uncertain burden (P < 0.5) 362 7.9 3.5
Robust high burden (P ≥ 0.8) 42 21 7.5

19 Map – ONLY Urban Atlanta Tracts

In this final step, I link together:

I restrict the maps to the LTDB-defined urban tracts so that the spatial analysis aligns with the regression sample. The first map visualizes the probability that each tract belongs to the top 20% affordability-burden tier (prob_top_quintile), while the second map shows the historical MHV growth over 2000–2010. Looked at together, these maps highlight places that both experienced rapid home value appreciation and are likely to exhibit high housing cost burden — neighborhoods that may be particularly salient for housing policy, displacement risk, and equity-oriented planning.

# Get tract geometries for all counties in the Atlanta MSA


atl_tract_list <- list()

for (st in unique(state_fips)) {
  counties_in_state <- county_fips[state_fips == st]

  atl_tract_list[[st]] <- tracts(
    state  = st,
    county = counties_in_state,
    year   = 2021,   # geo year
    cb     = TRUE,   # cartographic boundary
    class  = "sf"
  )
}

atl_tracts <- bind_rows(atl_tract_list) %>%
  st_transform(4326) %>%   # CRS for web maps
  select(GEOID, geometry)

# Restrict to *urban* LTDB Atlanta tracts only
# (atl_ltdb was created earlier from d and already filtered to Atlanta + urban)
atl_tracts_urban <- atl_tracts %>%
  inner_join(atl_ltdb %>% select(GEOID), by = "GEOID")


# 🔹 Now join ACS + Monte Carlo + LTDB info
atl_map <- atl_tracts_urban %>%
  left_join(atl_join, by = "GEOID")

dim(atl_map)
## [1] 485  20
table(is.na(atl_map$prob_top_quintile))
## 
## FALSE  TRUE 
##   463    22
# Share of tracts with missing Monte Carlo burden
n_total <- nrow(atl_map)
n_na    <- sum(is.na(atl_map$prob_top_quintile))

pct_na_tracts <- 100 * n_na / n_total

pct_na_tracts
## [1] 4.536082

20 Map - Probability of top affordability burden

ggplot(atl_map) +
geom_sf(aes(fill = prob_top_quintile), color = NA) +
scale_fill_viridis_c(
option = "plasma",
name = "P(Top 20% burden)",
na.value = "grey90"
) +
labs(
title = "Probability of Being in the Top Affordability Burden Quintile",
subtitle = "ACS-based Monte Carlo estimates for price-to-income ratio, Atlanta MSA tracts",
x = NULL,
y = NULL
) +
theme_void(base_size = 12) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold")
)

21 Map - MHV Growth

ggplot(atl_map) +
geom_sf(aes(fill = mhv.growth), color = NA) +
scale_fill_viridis_c(
option = "magma",
name = "MHV growth (%)"
) +
labs(
title = "Median Home Value Growth 2000–2010",
subtitle = "LTDB-based MHV growth for Atlanta urban tracts",
x = NULL,
y = NULL
) +
theme_void(base_size = 12) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold")
)

22 Summary: What Monte Carlo Added Beyond a Standard Regression

Taken together, the steps in this code-through show what Monte Carlo methods contributed beyond a conventional “run one regression and report the coefficients” workflow:

In short, Monte Carlo methods here are not a replacement for regression or ACS estimates, but a layer on top of them: they convert deterministic outputs into probabilistic statements that better reflect the uncertainty inherent in small-area neighborhood data.



23 Further Resources

Learn more about PRA, Monte Carlo simulations, and the Strong Law of Large Numbers with the following:




24 Works Cited

This code-through references and cites the following sources:


Wagaman, A. S., & Dobrow, R. P. (2021). Probability: With Applications and R. John Wiley & Sons.

Rizzo, M. L. (2019). Statistical Computing with R (2nd ed.). CRC Press