The Question & The Data
Are people of color being denied home mortgages at higher rates in New Jersey — and can a decision tree show us why?
HMDA — Home Mortgage Disclosure Act U.S. law requiring banks to publicly report every mortgage application: race, income, loan amount, and outcome.
Our toolkit:
rpart — build the decision tree
rpart.plot — visualize it
83,682 real NJ mortgage applications (2022)
hmda <- read.csv (
"hmda_NJ_clean.csv" ,
stringsAsFactors = TRUE
)
dim (hmda)
loan_denied is our outcome: 1 = denied | 0 = approved
:::
Welcome everyone. Today’s presentation is about using two R packages — rpart and rpart.plot — to build and visualize decision trees. But more importantly, it is about using those tools to ask a real, urgent question: are people of color being systematically denied home mortgages in New Jersey at higher rates than White applicants?
To answer that, we are using data from the Home Mortgage Disclosure Act, or HMDA. This is a federal law that requires every bank and lender to publicly report every single mortgage application they receive — including the applicant’s race, income, loan amount, and the final decision: approved or denied. This makes it one of the most powerful public datasets for studying racial discrimination in housing.
The dataset we are working with has been cleaned and filtered to 83,682 mortgage applications filed in New Jersey in 2022. That is a real, substantial dataset — not a toy example. On the right you can see us loading it with a simple read.csv call. The dim() function confirms the rows and columns.
Our outcome variable is loan_denied: a 1 means the application was denied, and a 0 means it was approved. Everything we build today will be predicting that variable.
The Social Justice Context
Black applicants are denied at nearly double the rate of White applicants — even before controlling for income.
Before we even build a single model, let us look at what the raw data is already telling us. This bar chart shows the mortgage denial rate broken down by race across all 83,682 applications in New Jersey in 2022.
Start at the top of the chart. Native Hawaiian and Pacific Islander applicants have the highest denial rate at around 33 percent — though their sample size is relatively small. American Indian or Alaska Native applicants are next. Then look at the red bar: Black or African American applicants are denied at 22.8 percent.
Now look at the bottom. White applicants are denied at 12.6 percent. That is nearly half the rate of Black applicants. Asian applicants sit at about 13.8 percent, just slightly above White applicants.
The critical thing to understand here is that this chart has not yet controlled for anything — not income, not loan amount, not debt-to-income ratio. This is purely the raw denial rate by race. A skeptic might say: maybe Black applicants simply have lower incomes or higher debt ratios that explain the gap. That is exactly what we will investigate using our decision tree. If race still shows up as a predictor even after the model accounts for all financial variables, then we have strong evidence of something beyond pure financial logic driving these decisions.
This chart was built using ggplot2, and it is what motivates everything we do next.
What Is rpart?
rpart = Recursive Partitioning and Regression Trees
Written at the Mayo Clinic (Therneau & Atkinson)
One of the most downloaded R packages ever
Handles classification and regression
install.packages ("rpart" )
install.packages ("rpart.plot" )
library (rpart)
library (rpart.plot)
How it works — yes/no questions:
Is debt ratio > 50%?
├─ YES → Is income < $60k?
│ ├─ YES → DENIED (72%)
│ └─ NO → review (41%)
└─ NO → Is race = Black?
├─ YES → DENIED (18%)
└─ NO → APPROVED (9%)
rpart automatically finds the best splits to separate your outcome — no manual tuning needed to get started.
Now let us talk about the star of today’s presentation: the rpart package.
rpart stands for Recursive Partitioning and Regression Trees. It was originally written by Terry Therneau and Beth Atkinson at the Mayo Clinic — yes, the medical research center — because decision trees are incredibly useful in clinical medicine for things like diagnosing disease based on patient characteristics. But the package has become a foundational tool across data science disciplines: social science, ecology, business, public health, and more. It is one of the most downloaded packages in R’s entire history.
rpart works by asking a series of yes or no questions about your data, one at a time, to progressively separate your outcome into purer and purer groups. The pseudocode on the right illustrates this. At the top — the root — it might ask: is the applicant’s debt-to-income ratio above 50%? If yes, it goes left and asks the next best question. If no, it goes right and asks a different question. This continues until it reaches a leaf node — a final prediction.
The algorithm automatically searches through every possible variable and every possible threshold to find the split that best separates approvals from denials at each step. You do not have to tell it which variables matter. It figures that out from the data itself.
To use it, you simply install rpart and its companion rpart.plot with the two install.packages calls shown here, then load them with library. That is all the setup you need.
Core Syntax & Building the Tree
rpart() key arguments:
rpart (
formula, # outcome ~ predictors
data,
method, # "class" or "anova"
control = rpart.control (
cp = 0.01 , # tree complexity
minsplit = 20 , # min obs per split
minbucket = 7 # min obs per leaf
)
)
method = "class" → categories ✅ method = "anova" → numbers
Prepare & build:
hmda_model <- hmda |>
filter (debt_to_income_ratio %in%
c ("<20%" ,"20%-<30%" ,"30%-<36%" ,
"36%-<50%" ,"50%-60%" ,">60%" )) |>
mutate (loan_denied = factor (
loan_denied,
labels = c ("Approved" ,"Denied" )))
set.seed (42 )
tree1 <- rpart (
loan_denied ~ derived_race + income +
loan_amount + debt_to_income_ratio +
tract_minority_population_percent,
data = hmda_model,
method = "class" ,
control = rpart.control (
cp = 0.002 , minsplit = 50 ))
Now let us look at how rpart is actually called in R. The core function is rpart() and it follows a familiar formula-based syntax that you have likely seen in other R modeling functions like lm() or glm().
The formula argument is written as outcome tilde predictors — so loan_denied on the left, and all the predictor variables on the right separated by plus signs. The data argument is simply your data frame. The method argument is the most important one to choose deliberately: “class” tells rpart you are predicting a category — in our case, Approved versus Denied. If you were predicting a continuous number like loan amount, you would use “anova” instead.
The control argument lets you tune how the tree grows. The cp parameter — short for complexity parameter — controls how large the tree gets. A smaller cp allows more splits and a more complex tree; a larger cp forces early stopping and a simpler tree. The minsplit argument says: only attempt a split if there are at least this many observations at that node. The minbucket argument sets the minimum number of observations allowed in any leaf.
On the right, you can see the actual code we use on our HMDA data. We first filter the debt_to_income_ratio to keep only the clean categorical values and convert loan_denied into a proper factor with meaningful labels. Then we set a seed for reproducibility and call rpart(). Notice we include tract_minority_population_percent — the percentage of minority residents in the applicant’s census tract — as one of our predictors. This captures neighborhood-level effects and helps us test whether redlining patterns persist in modern lending.
We set cp to 0.002, which is relatively small, allowing the tree to grow enough to reveal race-related patterns that a very simple tree might miss.
Visualizing with rpart.plot()
🔵 Blue = predicted Approved 🔴 Red = predicted Denied Each box shows: predicted class | denial probability | % of data in node
This is the heart of the presentation — the decision tree itself, visualized using rpart.plot.
Let me walk you through how to read it. Start at the very top node — the root. This represents 100% of our 83,682 mortgage applications. The box shows the predicted class for that group, the probability of denial, and the percentage of the data sitting in that node.
From the root, the tree makes its first split — its single most powerful question for separating approvals from denials. Follow the branches downward. At each internal node, if the condition is true you go left; if false you go right. The branches keep splitting until you reach the leaf nodes at the bottom, which are the final predictions.
Pay close attention to where the tree uses race or tract_minority_population_percent as a splitting variable. When race appears as a split, it means that after the tree has already accounted for debt-to-income ratio, income, and loan amount, it found that race still carries additional predictive information about denial. That is a meaningful finding.
The blue nodes predict Approved; the red nodes predict Denied. The deeper the red, the higher the denial probability in that group.
rpart.plot() is what turns the rpart model object into this readable diagram. The type = 4 argument labels every node. The extra = 104 argument displays both the class probability and the share of total data in each node. The fallen.leaves = TRUE argument aligns all leaf nodes at the same level at the bottom, making the tree much easier to read. And tweak scales the font size inside the nodes.
Pruning & Variable Importance
Prune to prevent overfitting:
best_cp <- tree1$ cptable[
which.min (tree1$ cptable[,"xerror" ]),
"CP" ]
tree_pruned <- prune (tree1, cp = best_cp)
printcp(tree1) shows the full complexity table. Pick the cp where xerror is lowest for the best-generalizing tree.
Once we have built our initial tree, we need to address a common problem in decision trees called overfitting. An overfitted tree has memorized the training data so well that it captures noise rather than real patterns, which means it will perform poorly on new data it has never seen.
The solution is pruning — trimming back branches of the tree that do not meaningfully improve predictions. rpart does this using the complexity parameter, or cp. Every row of the cp table — accessed with tree1$cptable — shows how the tree’s cross-validated error, the xerror column, changes as the tree grows more complex. We want the cp value that corresponds to the lowest xerror. That is what the code on the left does: it finds that row automatically with which.min() and then passes that optimal cp value to the prune() function, which trims the tree accordingly.
You can always inspect the full cp table by calling printcp(tree1) directly — this shows you each split level, the relative error, and the cross-validated error side by side.
On the right is the variable importance plot. This is one of rpart’s most powerful features. After building the tree, rpart internally calculates how much each variable contributed to reducing prediction error across all the splits it was used in. The longer the bar, the more that variable mattered to the tree’s decisions overall — even if it only appeared in a few branches.
Look at which variables score highest. If debt_to_income_ratio and income are at the top, that is expected — they are the strongest financial signals. But if derived_race or tract_minority_population_percent appear high on this chart, it tells us that race and neighborhood racial composition are doing real predictive work in this model, above and beyond what the financial variables alone would predict.
ggplot: Denial by Race & Income Bracket
Even above $200k income , Black applicants are denied more often than White applicants — income alone does not explain the gap.
This chart is one of the most powerful visuals in the presentation, and I want to spend a moment walking through it carefully because the story it tells is striking.
On the x-axis we have five income brackets, running from below $50,000 on the left all the way to above $200,000 on the right. On the y-axis we have the denial rate — the proportion of applications that were denied. There are three colored lines: blue for White applicants, green for Asian applicants, and red for Black or African American applicants.
The first thing to notice is that all three lines slope downward from left to right. That makes intuitive sense — higher income applicants are denied less often across all racial groups. Financial circumstances do matter.
But look at what does not happen: the red line never meets the blue line. At every single income bracket — from the lowest earners below $50k all the way up to the highest earners above $200k — Black applicants face a higher denial rate than White applicants with equivalent income. The gap does not close as income rises.
At the lowest income bracket, the gap is dramatic. But even at the highest income bracket — applicants earning more than $200,000 per year — Black applicants are still denied at a meaningfully higher rate than White applicants at the same income level.
This is the core finding that income alone cannot explain the racial disparity. If the gap were purely about financial qualifications, it should disappear once we control for income. It does not. This is what the decision tree is helping us quantify and visualize — and it is what makes this analysis socially meaningful, not just a technical exercise.
Predictions & Accuracy
preds <- predict (
tree_pruned,
hmda_model,
type = "class"
)
conf <- table (
Predicted = preds,
Actual = hmda_model$ loan_denied
)
conf
Actual
Predicted Approved Denied
Approved 28075 3861
Denied 662 4159
acc <- sum (diag (conf)) / sum (conf)
cat ("Accuracy:" ,
round (acc * 100 , 1 ), "% \n " )
sens <- conf["Denied" ,"Denied" ] /
sum (conf[,"Denied" ])
cat ("Denial Detection:" ,
round (sens * 100 , 1 ), "% \n " )
High overall accuracy can mask poor detection of denials. Always check sensitivity — not just accuracy.
Once we have a pruned tree, we can use it to make predictions on the data using the predict() function. We pass it the pruned tree, the dataset, and type = “class” to get predicted class labels — Approved or Denied — rather than raw probabilities.
The confusion matrix on the left is the standard tool for evaluating a classification model. Read it like a grid: the rows are what the model predicted, and the columns are what actually happened. The top-left cell shows correct approvals — cases where the model predicted Approved and the application was actually approved. The bottom-right cell shows correct denials. The off-diagonal cells are the errors: top-right is cases the model predicted Approved but were actually Denied — these are false negatives — and bottom-left is cases the model predicted Denied but were actually Approved — false positives.
Overall accuracy is the sum of correct predictions divided by the total. That number will likely look impressive. But here is the important caveat I want to highlight: overall accuracy can be misleading when your outcome classes are imbalanced, and ours are. In our dataset, the vast majority of applications were approved. A model that simply predicted “Approved” for everyone would still achieve high accuracy — but it would be completely useless for detecting the very thing we care about: wrongful denials.
That is why we also compute sensitivity, also called the recall or true positive rate for the denial class. This tells us: of all the applications that were actually denied, what proportion did the model correctly flag? A model that misses most real denials is failing at its most socially important task, regardless of how its overall accuracy looks.
Always report both when presenting model results to a non-technical audience.
Questions?
Packages: rpart · rpart.plot Data: HMDA 2022 — New Jersey (83,682 applications) Source: Consumer Financial Protection Bureau
rpart docs
cran.r-project.org/package=rpart
rpart.plot guide
milbo.org/rpart-plot/prp.pdf
HMDA data browser
ffiec.cfpb.gov/data-browser
Thank you for your time and attention. I want to leave you with a few closing thoughts before we open it up for questions.
We used two R packages today — rpart to build decision trees and rpart.plot to visualize them — and we applied them to a real, publicly available dataset of 83,682 mortgage applications in New Jersey. The analysis consistently showed that race plays a role in mortgage denial that cannot be fully explained by financial variables like income, loan amount, or debt-to-income ratio. The decision tree found race as a splitting variable even after accounting for those financial predictors, the variable importance chart placed neighborhood racial composition among the most influential features, and the ggplot line chart showed that the racial gap in denial rates persists at every income level — even above $200,000.
rpart and rpart.plot are powerful precisely because they are interpretable. Unlike black-box models, a decision tree can be shown to a policymaker, a journalist, a community organizer, or a judge and they can follow the logic step by step. That interpretability is not just a technical convenience — it is essential for using data science in the service of justice and accountability.
All three resources listed here are freely available. The HMDA data browser lets you download this data for any state or city in the country. The rpart and rpart.plot documentation gives you the full reference for every argument we used today and many more.
I am happy to take any questions about the methodology, the packages, the data, or the findings.