Decision Trees for Social Justice

Using rpart & rpart.plot to Uncover Housing Inequality in New Jersey

Ujjwal Acharya

2026-04-30

The Question & The Data

Are people of color being denied home mortgages at higher rates in New Jersey — and can a decision tree show us why?

HMDA — Home Mortgage Disclosure Act U.S. law requiring banks to publicly report every mortgage application: race, income, loan amount, and outcome.

Our toolkit:

  • rpart — build the decision tree
  • rpart.plot — visualize it
  • 83,682 real NJ mortgage applications (2022)
hmda <- read.csv(
  "hmda_NJ_clean.csv",
  stringsAsFactors = TRUE
)
dim(hmda)
[1] 83682    17

loan_denied is our outcome: 1 = denied  |  0 = approved

:::

The Social Justice Context

Black applicants are denied at nearly double the rate of White applicants — even before controlling for income.

What Is rpart?

rpart = Recursive Partitioning and Regression Trees

  • Written at the Mayo Clinic (Therneau & Atkinson)
  • One of the most downloaded R packages ever
  • Handles classification and regression
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

How it works — yes/no questions:

Is debt ratio > 50%?
├─ YES → Is income < $60k?
│        ├─ YES → DENIED (72%)
│        └─ NO  → review (41%)
└─ NO  → Is race = Black?
          ├─ YES → DENIED (18%)
          └─ NO  → APPROVED (9%)

Note

rpart automatically finds the best splits to separate your outcome — no manual tuning needed to get started.

Core Syntax & Building the Tree

rpart() key arguments:

rpart(
  formula, # outcome ~ predictors
  data,
  method,  # "class" or "anova"
  control = rpart.control(
    cp        = 0.01, # tree complexity
    minsplit  = 20,   # min obs per split
    minbucket = 7     # min obs per leaf
  )
)

Tip

method = "class" → categories ✅ method = "anova" → numbers

Prepare & build:

hmda_model <- hmda |>
  filter(debt_to_income_ratio %in%
    c("<20%","20%-<30%","30%-<36%",
      "36%-<50%","50%-60%",">60%")) |>
  mutate(loan_denied = factor(
    loan_denied,
    labels = c("Approved","Denied")))

set.seed(42)
tree1 <- rpart(
  loan_denied ~ derived_race + income +
    loan_amount + debt_to_income_ratio +
    tract_minority_population_percent,
  data   = hmda_model,
  method = "class",
  control = rpart.control(
    cp = 0.002, minsplit = 50))

Visualizing with rpart.plot()

🔵 Blue = predicted Approved    🔴 Red = predicted Denied Each box shows: predicted class | denial probability | % of data in node

Pruning & Variable Importance

Prune to prevent overfitting:

best_cp <- tree1$cptable[
  which.min(tree1$cptable[,"xerror"]),
  "CP"]

tree_pruned <- prune(tree1, cp = best_cp)

Note

printcp(tree1) shows the full complexity table. Pick the cp where xerror is lowest for the best-generalizing tree.

ggplot: Denial by Race & Income Bracket

Even above $200k income, Black applicants are denied more often than White applicants — income alone does not explain the gap.

Predictions & Accuracy

preds <- predict(
  tree_pruned,
  hmda_model,
  type = "class"
)

conf <- table(
  Predicted = preds,
  Actual    = hmda_model$loan_denied
)
conf
          Actual
Predicted  Approved Denied
  Approved    28075   3861
  Denied        662   4159
acc <- sum(diag(conf)) / sum(conf)
cat("Accuracy:",
    round(acc * 100, 1), "%\n")
Accuracy: 87.7 %
sens <- conf["Denied","Denied"] /
        sum(conf[,"Denied"])
cat("Denial Detection:",
    round(sens * 100, 1), "%\n")
Denial Detection: 51.9 %

Warning

High overall accuracy can mask poor detection of denials. Always check sensitivity — not just accuracy.

Questions?


Packages: rpart · rpart.plot Data: HMDA 2022 — New Jersey (83,682 applications) Source: Consumer Financial Protection Bureau


Resource Link
rpart docs cran.r-project.org/package=rpart
rpart.plot guide milbo.org/rpart-plot/prp.pdf
HMDA data browser ffiec.cfpb.gov/data-browser