About

In this code-through, I show how to use ACS tract level data, spatial tools, and model based clustering to build a housing and transportation “stress” index for Essex County, New Jersey. I will show how to pull Census variables, create indicators that capture the following: vacancy, rent, commute time, and vehicle access. Then I will combine them into a latent construct of neighborhood stress. After, I will map this index with sf, ggplot, and mclust. My goal is to uncover distinct neighborhood types based on indicators. I aim to help the urban policymakers in the area that I grew up in, and I want to demonstrate how to generate and interpret neighborhood typologies that can potentially guide necessary change.

library(tidycensus)
library(dplyr)
library(sf)
library(ggplot2)
library(scales)
library(cluster)
library(pander)
library(mclust)
library(factoextra)
#add your census key here

data.dictionary <- load_variables(2023, "acs5", cache = TRUE)

glimpse(data.dictionary)


Step 1: Get ACS Tract Data for Essex County, NJ

Each of my ACS indicators is meant to capture a different dimension of structural stress in the local housing and transportation system set up in Essex County, NJ. The higher vacancies can indicate weak housing demand in an area. This can lead to other findings like few amenities offered and unstable neighborhood conditions. The higher median gross rent can show cost pressure to maintain that amount by devoting a larger share of income to pay rent and have shelter. The longer commute times are a burden. This can show if jobs are located near affordable housing markets. If the percentage of households with no vehicle is high for an area, it can show daily life and job access are difficult because the county may be structured around owning a car.

I will pull data for the indicators below:

acs_vars <- c(HousingUnits = "B25002_001E", VacantHousingUnits = "B25002_003E", MedGrossRent = "B25064_001E", MedTravelTimeToWork = "B08303_001E", TotHouseholdWithVehicles = "B08201_001E", HouseholdsWithoutVehicles = "B08201_002E")

EssexDF <- get_acs(
  geography = "tract",
  state = "NJ",
  county = "Essex",
  variables = acs_vars,
  year = 2023,
  output = "wide",
  geometry = TRUE
)
##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   1%  |                                                                              |==                                                                    |   3%  |                                                                              |====                                                                  |   5%  |                                                                              |=====                                                                 |   8%  |                                                                              |======                                                                |   8%  |                                                                              |=======                                                               |  10%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  13%  |                                                                              |==========                                                            |  15%  |                                                                              |===========                                                           |  16%  |                                                                              |============                                                          |  17%  |                                                                              |=============                                                         |  19%  |                                                                              |==============                                                        |  20%  |                                                                              |==============                                                        |  21%  |                                                                              |================                                                      |  22%  |                                                                              |================                                                      |  23%  |                                                                              |=================                                                     |  25%  |                                                                              |==================                                                    |  26%  |                                                                              |===================                                                   |  28%  |                                                                              |=====================                                                 |  30%  |                                                                              |======================                                                |  32%  |                                                                              |========================                                              |  34%  |                                                                              |=========================                                             |  36%  |                                                                              |===========================                                           |  38%  |                                                                              |============================                                          |  40%  |                                                                              |==============================                                        |  42%  |                                                                              |===============================                                       |  44%  |                                                                              |================================                                      |  46%  |                                                                              |==================================                                    |  49%  |                                                                              |===================================                                   |  51%  |                                                                              |=====================================                                 |  53%  |                                                                              |======================================                                |  55%  |                                                                              |========================================                              |  57%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  61%  |                                                                              |============================================                          |  63%  |                                                                              |==============================================                        |  65%  |                                                                              |===============================================                       |  67%  |                                                                              |=================================================                     |  69%  |                                                                              |==================================================                    |  71%  |                                                                              |===================================================                   |  73%  |                                                                              |=====================================================                 |  76%  |                                                                              |======================================================                |  78%  |                                                                              |========================================================              |  80%  |                                                                              |=========================================================             |  82%  |                                                                              |===========================================================           |  84%  |                                                                              |============================================================          |  86%  |                                                                              |==============================================================        |  88%  |                                                                              |===============================================================       |  90%  |                                                                              |=================================================================     |  92%  |                                                                              |==================================================================    |  94%  |                                                                              |===================================================================   |  96%  |                                                                              |===================================================================== |  98%  |                                                                              |======================================================================| 100%
glimpse(EssexDF)
## Rows: 211
## Columns: 15
## $ GEOID                     <chr> "34013016800", "34013017100", "34013000100",…
## $ NAME                      <chr> "Census Tract 168; Essex County; New Jersey"…
## $ HousingUnits              <dbl> 2333, 867, 2720, 1284, 772, 1162, 992, 953, …
## $ B25002_001M               <dbl> 215, 97, 253, 124, 72, 252, 350, 133, 135, 1…
## $ VacantHousingUnits        <dbl> 205, 65, 66, 125, 89, 59, 22, 23, 142, 22, 3…
## $ B25002_003M               <dbl> 160, 42, 77, 56, 50, 49, 34, 25, 76, 36, 66,…
## $ MedGrossRent              <dbl> 2137, 1780, 1159, 1407, 1028, 979, 962, 1465…
## $ B25064_001M               <dbl> 151, 140, 135, 71, 309, 63, 95, 238, 172, 37…
## $ MedTravelTimeToWork       <dbl> 1920, 744, 2989, 1636, 689, 1011, 794, 1289,…
## $ B08303_001M               <dbl> 342, 188, 540, 317, 238, 409, 505, 334, 296,…
## $ TotHouseholdWithVehicles  <dbl> 2128, 802, 2654, 1159, 683, 1103, 970, 930, …
## $ B08201_001M               <dbl> 231, 105, 273, 129, 81, 267, 347, 143, 135, …
## $ HouseholdsWithoutVehicles <dbl> 189, 147, 794, 257, 379, 405, 212, 117, 104,…
## $ B08201_002M               <dbl> 113, 87, 290, 131, 124, 113, 102, 81, 92, 11…
## $ geometry                  <MULTIPOLYGON [°]> MULTIPOLYGON (((-74.22361 4...,…


Step 2: Transform Data

I am converting the raw estimates into indicators that I will be able to use. Vacancy rate and percent without vehicles are converted into percentages so they are comparable across tracts with different population sizes.

EssexDF <- EssexDF %>%
  mutate(
    tract_id = GEOID,
    vacancy_rate = 100 * VacantHousingUnits/HousingUnits,
    med_rent = MedGrossRent,
    commute_time = MedTravelTimeToWork,
    pct_no_vehicle = 100 * HouseholdsWithoutVehicles/TotHouseholdWithVehicles
  )

essex_indicators <- EssexDF %>%
  st_drop_geometry() %>%
  select(tract_id, vacancy_rate, med_rent, commute_time, pct_no_vehicle)

summary(essex_indicators)
##    tract_id          vacancy_rate       med_rent     commute_time 
##  Length:211         Min.   : 0.000   Min.   : 346   Min.   :   0  
##  Class :character   1st Qu.: 2.344   1st Qu.:1324   1st Qu.:1070  
##  Mode  :character   Median : 4.870   Median :1536   Median :1616  
##                     Mean   : 5.599   Mean   :1633   Mean   :1615  
##                     3rd Qu.: 8.323   3rd Qu.:1819   3rd Qu.:2054  
##                     Max.   :20.437   Max.   :3501   Max.   :3637  
##                     NA's   :2        NA's   :7                    
##  pct_no_vehicle  
##  Min.   : 0.000  
##  1st Qu.: 7.953  
##  Median :20.915  
##  Mean   :22.583  
##  3rd Qu.:35.769  
##  Max.   :68.919  
##  NA's   :2

The table is telling me that vacancy_rate ranges from 0% - 20%, and pct_no_vehicle households 0% - 69%. This hints at a large difference in access to transportation reliability.

Step 3: Create latent construct (Stress).

I am analyzing the Housing & Transportation Stress of the County. I want to see a pattern, where higher equals more stress in a neighborhood. A tract with higher scores is more stressful on the neighborhoods infrastructure like housing availability and public transportation system. I will use the z-scores and average them into an index because it will allow me to place the indicators on a common scale. This will not allow a single variable to dominate the index. I will use this later in the code-through for mapping.

Stress does not measure anxious people but structural pressure put on the households from the city’s system.

essex_index <- essex_indicators %>%
  mutate(z_vacancy = as.numeric(scale(vacancy_rate)),
         z_rent = as.numeric(scale(med_rent)),
         z_commute = as.numeric(scale(commute_time)),
         z_novehicle = as.numeric(scale(pct_no_vehicle)),
         
         stress_index = rowMeans(
           cbind(z_vacancy, z_rent, z_commute, z_novehicle),
           na.rm = TRUE)
         )

summary(essex_index$stress_index)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.39111 -0.27037  0.04704 -0.02182  0.25663  0.92817
essex <- EssexDF %>%
  left_join(
    essex_index %>%
      select(tract_id, stress_index),
    by = c("GEOID" = "tract_id")
  )

summary(essex$stress_index)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.39111 -0.27037  0.04704 -0.02182  0.25663  0.92817

Looking at the summaries, I would consider tracts near 0 to be considered typical or average values for Essex County. If it falls below -0.5 or above 0.5, it should be considered low stress areas or high stress areas, respectively.


Step 4: Choropleth Map the Stress Index

I want to see the geography of the stress before I cluster. This choropleth map shows the distribution of stress in the neighborhoods. Using a fixed z-score scale, darker yellow tracts represent areas with high stress, dark blue tracts represent areas with low stress compared to the county.

Remember higher stress z-score means it is a more stressed area.

ggplot(essex) +
  geom_sf(aes(fill = stress_index), color = NA) +
  scale_fill_viridis_c(
    option = "cividis",
    name = "Stress (z-score)",
    limits = c(-2, 2),
    breaks = c(-2, -1, 0, 1, 2)) +
  labs(
    title = "Housing & Transportation Stress Index by Tract",
    subtitle = "Essex County, NJ",
    caption = "Source: ACS 5-year estimates") +
  theme_minimal()

The results of my map show most tracts are about average. There are some higher stressed areas and lower stressed also. This map confirms that housing costs, vacancies, commute time, and vehicle access stress is not evenly spread, but centralized to specific urban neighborhoods not suburban residential areas.


Step 5: Clustering Using mclust()

In this step, I want to cluster the tracts into neighborhood types. I will use my already created z-indicators (z_vacancy, z_rent, z_commute, z_novehicle).

essex_cluster_data <- essex_index %>%
  select(z_vacancy, z_rent, z_commute, z_novehicle) %>%
  na.omit()

complete_ids <- essex_index$tract_id[complete.cases(
  essex_index[, c("z_vacancy", "z_rent", "z_commute", "z_novehicle")]
)]
row.names(essex_cluster_data) <- complete_ids

head(essex_cluster_data)
##              z_vacancy     z_rent   z_commute z_novehicle
## 34013016800  0.7608676  0.8602823  0.45185635 -0.83594638
## 34013017100  0.4530597  0.2504590 -1.28946325 -0.25955039
## 34013000100 -0.7569888 -0.8103262  2.03473957  0.44742675
## 34013001400  0.9871519 -0.3866954  0.03133359 -0.02496097
## 34013002600  1.4151001 -1.0340989 -1.37090252  2.00765314
## 34013004801 -0.1243639 -1.1178002 -0.89411263  0.86235040
# Create Cluster Model
set.seed(1234)

mclust_fit <- Mclust(essex_cluster_data)

summary(mclust_fit)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EVI (diagonal, equal volume, varying shape) model with 5 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -943.5618 204 40 -2099.848 -2163.828
## 
## Clustering table:
##  1  2  3  4  5 
## 68 55 17 20 44
# Create Classification Labels and Table
cluster_labels <- mclust_fit$classification

cluster_df <- data.frame(
  tract_id = row.names(essex_cluster_data),
  cluster = factor(cluster_labels)
)

table(cluster_df$cluster)
## 
##  1  2  3  4  5 
## 68 55 17 20 44
essex <- essex %>%
  left_join(cluster_df, by=c("GEOID" = "tract_id"))

table(essex$cluster, useNA = "ifany")
## 
##    1    2    3    4    5 <NA> 
##   68   55   17   20   44    7
# Naming cluster for labeling in legend
# Add after comparing Step 7 table
cluster_names <- c(
  "1" = "Low Stress, Vehicle Access",
  "2" = "High Rent with a Long Commute",
  "3" = "High Vacancy Unit Rate",
  "4" = "Less Vehicle Traffic More Public Trans",
  "5" = "Moderate Stress levels (A mix of indicators)"
)

essex <- essex %>%
  mutate(
    cluster = factor(
      cluster,
      levels = names(cluster_names),
      labels = cluster_names
    )
  )

Mclust fits Gaussian mixture models with different covariance structures and numbers of neighborhood types. Then it uses BIC to select the best model. In this case, 5 neighborhood types were chosen for the EVI model. The numbers in the table show that two types are common, 68 and 55 tracts found, but more unusual patterns in the other 3 types.


Step 6: Mapping the mclust Neighborhood Types

I want to see how the neighborhood types line up geographically.

ggplot(essex) +
  geom_sf(aes(fill = cluster), color = NA) +
  scale_fill_brewer(
    palette = "Set2",
    na.value = "grey90") +
  labs(
    title = "Neighborhood Types",
    subtitle = "Essex County, NJ",
    fill = "Cluster",
    caption = "Clusters estimated using Gaussian models (mclust)") +
  theme_minimal()

This map shows five distinct neighborhood types in Essex County based on the housing and transportation stress indicators I chose. We can see where these types are more frequently located. The High Rent with a Long Commute, High Vacancy Rate, and Moderate Stress level are around the urban core areas in the county. If we refer back to Step 5 we can notice lighter shades of yellow in these areas.


Step 7: Profile and Interpret the Clusters

Now I want to add some meaning to the clustered groups. Here I summarize and analyze the underlying indicators for each cluster and give the groups names.

Note: Step 7 table will originally list 1-5 as the cluster names. I compared the table and then went back to Step 5 to add meaningful names to my groups.

essex_profiles <- essex %>%
  st_drop_geometry() %>%
  filter(!is.na(cluster)) %>%
  group_by(cluster) %>%
  summarize(
    tot_tracts = n(),
    mean_vacancy = mean(vacancy_rate, na.rm = TRUE),
    mean_rent = mean(med_rent, na.rm = TRUE),
    mean_commute = mean(commute_time, na.rm = TRUE),
    mean_no_vehicle = mean(pct_no_vehicle, na.rm = TRUE),
    mean_stress = mean(stress_index, na.rm = TRUE)
  ) %>%
  arrange(as.numeric(cluster))

essex_profiles %>% pander()
Table continues below
cluster tot_tracts mean_vacancy mean_rent
Low Stress, Vehicle Access 68 3.56 1789
High Rent with a Long Commute 55 5.517 1418
High Vacancy Unit Rate 17 5.934 742.9
Less Vehicle Traffic More Public Trans 20 2.918 2948
Moderate Stress levels (A mix of indicators) 44 10.26 1408
mean_commute mean_no_vehicle mean_stress
1849 9.355 -0.1701
1946 31.94 0.1687
1069 50.78 -0.1321
1832 2.893 0.1814
1064 31.74 0.1171

Overall, these profiles show that housing and transportation stress in Essex County is not random, but organized into a small set of recurring neighborhood patterns. Some tracts have a mix of tracts, which can be more of a burden on the inhabitants of that area. The other tracts sit on the other end of the spectrum. They show low to no stress and access to vehicles. Since there is no clear distinct neighborhood type across the entire county, I would suggest the neighborhoods are targeted for policy change rather than the entire county.