In this code-through, I show how to use ACS tract level data, spatial tools, and model based clustering to build a housing and transportation “stress” index for Essex County, New Jersey. I will show how to pull Census variables, create indicators that capture the following: vacancy, rent, commute time, and vehicle access. Then I will combine them into a latent construct of neighborhood stress. After, I will map this index with sf, ggplot, and mclust. My goal is to uncover distinct neighborhood types based on indicators. I aim to help the urban policymakers in the area that I grew up in, and I want to demonstrate how to generate and interpret neighborhood typologies that can potentially guide necessary change.
library(tidycensus)
library(dplyr)
library(sf)
library(ggplot2)
library(scales)
library(cluster)
library(pander)
library(mclust)
library(factoextra)
#add your census key here
data.dictionary <- load_variables(2023, "acs5", cache = TRUE)
glimpse(data.dictionary)
Each of my ACS indicators is meant to capture a different dimension of structural stress in the local housing and transportation system set up in Essex County, NJ. The higher vacancies can indicate weak housing demand in an area. This can lead to other findings like few amenities offered and unstable neighborhood conditions. The higher median gross rent can show cost pressure to maintain that amount by devoting a larger share of income to pay rent and have shelter. The longer commute times are a burden. This can show if jobs are located near affordable housing markets. If the percentage of households with no vehicle is high for an area, it can show daily life and job access are difficult because the county may be structured around owning a car.
I will pull data for the indicators below:
Total Housing Units = B25002_001E
Housing Vacancy Rate = B25002_003E
Median Gross Rent = B25064_001E
Median Commute time to Work(mins) = B08303_001E
Total Households = B08201_001E
Pct of Households without a Vehicle = B08201_002E
acs_vars <- c(HousingUnits = "B25002_001E", VacantHousingUnits = "B25002_003E", MedGrossRent = "B25064_001E", MedTravelTimeToWork = "B08303_001E", TotHouseholdWithVehicles = "B08201_001E", HouseholdsWithoutVehicles = "B08201_002E")
EssexDF <- get_acs(
geography = "tract",
state = "NJ",
county = "Essex",
variables = acs_vars,
year = 2023,
output = "wide",
geometry = TRUE
)
## | | | 0% | |= | 1% | |== | 3% | |==== | 5% | |===== | 8% | |====== | 8% | |======= | 10% | |======== | 12% | |========= | 13% | |========== | 15% | |=========== | 16% | |============ | 17% | |============= | 19% | |============== | 20% | |============== | 21% | |================ | 22% | |================ | 23% | |================= | 25% | |================== | 26% | |=================== | 28% | |===================== | 30% | |====================== | 32% | |======================== | 34% | |========================= | 36% | |=========================== | 38% | |============================ | 40% | |============================== | 42% | |=============================== | 44% | |================================ | 46% | |================================== | 49% | |=================================== | 51% | |===================================== | 53% | |====================================== | 55% | |======================================== | 57% | |========================================= | 59% | |=========================================== | 61% | |============================================ | 63% | |============================================== | 65% | |=============================================== | 67% | |================================================= | 69% | |================================================== | 71% | |=================================================== | 73% | |===================================================== | 76% | |====================================================== | 78% | |======================================================== | 80% | |========================================================= | 82% | |=========================================================== | 84% | |============================================================ | 86% | |============================================================== | 88% | |=============================================================== | 90% | |================================================================= | 92% | |================================================================== | 94% | |=================================================================== | 96% | |===================================================================== | 98% | |======================================================================| 100%
glimpse(EssexDF)
## Rows: 211
## Columns: 15
## $ GEOID <chr> "34013016800", "34013017100", "34013000100",…
## $ NAME <chr> "Census Tract 168; Essex County; New Jersey"…
## $ HousingUnits <dbl> 2333, 867, 2720, 1284, 772, 1162, 992, 953, …
## $ B25002_001M <dbl> 215, 97, 253, 124, 72, 252, 350, 133, 135, 1…
## $ VacantHousingUnits <dbl> 205, 65, 66, 125, 89, 59, 22, 23, 142, 22, 3…
## $ B25002_003M <dbl> 160, 42, 77, 56, 50, 49, 34, 25, 76, 36, 66,…
## $ MedGrossRent <dbl> 2137, 1780, 1159, 1407, 1028, 979, 962, 1465…
## $ B25064_001M <dbl> 151, 140, 135, 71, 309, 63, 95, 238, 172, 37…
## $ MedTravelTimeToWork <dbl> 1920, 744, 2989, 1636, 689, 1011, 794, 1289,…
## $ B08303_001M <dbl> 342, 188, 540, 317, 238, 409, 505, 334, 296,…
## $ TotHouseholdWithVehicles <dbl> 2128, 802, 2654, 1159, 683, 1103, 970, 930, …
## $ B08201_001M <dbl> 231, 105, 273, 129, 81, 267, 347, 143, 135, …
## $ HouseholdsWithoutVehicles <dbl> 189, 147, 794, 257, 379, 405, 212, 117, 104,…
## $ B08201_002M <dbl> 113, 87, 290, 131, 124, 113, 102, 81, 92, 11…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-74.22361 4...,…
I am converting the raw estimates into indicators that I will be able to use. Vacancy rate and percent without vehicles are converted into percentages so they are comparable across tracts with different population sizes.
EssexDF <- EssexDF %>%
mutate(
tract_id = GEOID,
vacancy_rate = 100 * VacantHousingUnits/HousingUnits,
med_rent = MedGrossRent,
commute_time = MedTravelTimeToWork,
pct_no_vehicle = 100 * HouseholdsWithoutVehicles/TotHouseholdWithVehicles
)
essex_indicators <- EssexDF %>%
st_drop_geometry() %>%
select(tract_id, vacancy_rate, med_rent, commute_time, pct_no_vehicle)
summary(essex_indicators)
## tract_id vacancy_rate med_rent commute_time
## Length:211 Min. : 0.000 Min. : 346 Min. : 0
## Class :character 1st Qu.: 2.344 1st Qu.:1324 1st Qu.:1070
## Mode :character Median : 4.870 Median :1536 Median :1616
## Mean : 5.599 Mean :1633 Mean :1615
## 3rd Qu.: 8.323 3rd Qu.:1819 3rd Qu.:2054
## Max. :20.437 Max. :3501 Max. :3637
## NA's :2 NA's :7
## pct_no_vehicle
## Min. : 0.000
## 1st Qu.: 7.953
## Median :20.915
## Mean :22.583
## 3rd Qu.:35.769
## Max. :68.919
## NA's :2
The table is telling me that vacancy_rate ranges from 0% - 20%, and
pct_no_vehicle households 0% - 69%. This hints at a large difference in
access to transportation reliability.
I am analyzing the Housing & Transportation Stress of the County. I want to see a pattern, where higher equals more stress in a neighborhood. A tract with higher scores is more stressful on the neighborhoods infrastructure like housing availability and public transportation system. I will use the z-scores and average them into an index because it will allow me to place the indicators on a common scale. This will not allow a single variable to dominate the index. I will use this later in the code-through for mapping.
Stress does not measure anxious people but structural pressure put on the households from the city’s system.
essex_index <- essex_indicators %>%
mutate(z_vacancy = as.numeric(scale(vacancy_rate)),
z_rent = as.numeric(scale(med_rent)),
z_commute = as.numeric(scale(commute_time)),
z_novehicle = as.numeric(scale(pct_no_vehicle)),
stress_index = rowMeans(
cbind(z_vacancy, z_rent, z_commute, z_novehicle),
na.rm = TRUE)
)
summary(essex_index$stress_index)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.39111 -0.27037 0.04704 -0.02182 0.25663 0.92817
essex <- EssexDF %>%
left_join(
essex_index %>%
select(tract_id, stress_index),
by = c("GEOID" = "tract_id")
)
summary(essex$stress_index)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.39111 -0.27037 0.04704 -0.02182 0.25663 0.92817
Looking at the summaries, I would consider tracts near 0 to be considered typical or average values for Essex County. If it falls below -0.5 or above 0.5, it should be considered low stress areas or high stress areas, respectively.
I want to see the geography of the stress before I cluster. This choropleth map shows the distribution of stress in the neighborhoods. Using a fixed z-score scale, darker yellow tracts represent areas with high stress, dark blue tracts represent areas with low stress compared to the county.
Remember higher stress z-score means it is a more stressed area.
ggplot(essex) +
geom_sf(aes(fill = stress_index), color = NA) +
scale_fill_viridis_c(
option = "cividis",
name = "Stress (z-score)",
limits = c(-2, 2),
breaks = c(-2, -1, 0, 1, 2)) +
labs(
title = "Housing & Transportation Stress Index by Tract",
subtitle = "Essex County, NJ",
caption = "Source: ACS 5-year estimates") +
theme_minimal()
The results of my map show most tracts are about average. There are some
higher stressed areas and lower stressed also. This map confirms that
housing costs, vacancies, commute time, and vehicle access stress is not
evenly spread, but centralized to specific urban neighborhoods not
suburban residential areas.
In this step, I want to cluster the tracts into neighborhood types. I will use my already created z-indicators (z_vacancy, z_rent, z_commute, z_novehicle).
essex_cluster_data <- essex_index %>%
select(z_vacancy, z_rent, z_commute, z_novehicle) %>%
na.omit()
complete_ids <- essex_index$tract_id[complete.cases(
essex_index[, c("z_vacancy", "z_rent", "z_commute", "z_novehicle")]
)]
row.names(essex_cluster_data) <- complete_ids
head(essex_cluster_data)
## z_vacancy z_rent z_commute z_novehicle
## 34013016800 0.7608676 0.8602823 0.45185635 -0.83594638
## 34013017100 0.4530597 0.2504590 -1.28946325 -0.25955039
## 34013000100 -0.7569888 -0.8103262 2.03473957 0.44742675
## 34013001400 0.9871519 -0.3866954 0.03133359 -0.02496097
## 34013002600 1.4151001 -1.0340989 -1.37090252 2.00765314
## 34013004801 -0.1243639 -1.1178002 -0.89411263 0.86235040
# Create Cluster Model
set.seed(1234)
mclust_fit <- Mclust(essex_cluster_data)
summary(mclust_fit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EVI (diagonal, equal volume, varying shape) model with 5 components:
##
## log-likelihood n df BIC ICL
## -943.5618 204 40 -2099.848 -2163.828
##
## Clustering table:
## 1 2 3 4 5
## 68 55 17 20 44
# Create Classification Labels and Table
cluster_labels <- mclust_fit$classification
cluster_df <- data.frame(
tract_id = row.names(essex_cluster_data),
cluster = factor(cluster_labels)
)
table(cluster_df$cluster)
##
## 1 2 3 4 5
## 68 55 17 20 44
essex <- essex %>%
left_join(cluster_df, by=c("GEOID" = "tract_id"))
table(essex$cluster, useNA = "ifany")
##
## 1 2 3 4 5 <NA>
## 68 55 17 20 44 7
# Naming cluster for labeling in legend
# Add after comparing Step 7 table
cluster_names <- c(
"1" = "Low Stress, Vehicle Access",
"2" = "High Rent with a Long Commute",
"3" = "High Vacancy Unit Rate",
"4" = "Less Vehicle Traffic More Public Trans",
"5" = "Moderate Stress levels (A mix of indicators)"
)
essex <- essex %>%
mutate(
cluster = factor(
cluster,
levels = names(cluster_names),
labels = cluster_names
)
)
Mclust fits Gaussian mixture models with different covariance structures and numbers of neighborhood types. Then it uses BIC to select the best model. In this case, 5 neighborhood types were chosen for the EVI model. The numbers in the table show that two types are common, 68 and 55 tracts found, but more unusual patterns in the other 3 types.
I want to see how the neighborhood types line up geographically.
ggplot(essex) +
geom_sf(aes(fill = cluster), color = NA) +
scale_fill_brewer(
palette = "Set2",
na.value = "grey90") +
labs(
title = "Neighborhood Types",
subtitle = "Essex County, NJ",
fill = "Cluster",
caption = "Clusters estimated using Gaussian models (mclust)") +
theme_minimal()
This map shows five distinct neighborhood types in Essex County based on
the housing and transportation stress indicators I chose. We can see
where these types are more frequently located. The High Rent with a Long
Commute, High Vacancy Rate, and Moderate Stress level are around the
urban core areas in the county. If we refer back to Step 5 we can notice
lighter shades of yellow in these areas.
Now I want to add some meaning to the clustered groups. Here I summarize and analyze the underlying indicators for each cluster and give the groups names.
Note: Step 7 table will originally list 1-5 as the cluster names. I compared the table and then went back to Step 5 to add meaningful names to my groups.
essex_profiles <- essex %>%
st_drop_geometry() %>%
filter(!is.na(cluster)) %>%
group_by(cluster) %>%
summarize(
tot_tracts = n(),
mean_vacancy = mean(vacancy_rate, na.rm = TRUE),
mean_rent = mean(med_rent, na.rm = TRUE),
mean_commute = mean(commute_time, na.rm = TRUE),
mean_no_vehicle = mean(pct_no_vehicle, na.rm = TRUE),
mean_stress = mean(stress_index, na.rm = TRUE)
) %>%
arrange(as.numeric(cluster))
essex_profiles %>% pander()
| cluster | tot_tracts | mean_vacancy | mean_rent |
|---|---|---|---|
| Low Stress, Vehicle Access | 68 | 3.56 | 1789 |
| High Rent with a Long Commute | 55 | 5.517 | 1418 |
| High Vacancy Unit Rate | 17 | 5.934 | 742.9 |
| Less Vehicle Traffic More Public Trans | 20 | 2.918 | 2948 |
| Moderate Stress levels (A mix of indicators) | 44 | 10.26 | 1408 |
| mean_commute | mean_no_vehicle | mean_stress |
|---|---|---|
| 1849 | 9.355 | -0.1701 |
| 1946 | 31.94 | 0.1687 |
| 1069 | 50.78 | -0.1321 |
| 1832 | 2.893 | 0.1814 |
| 1064 | 31.74 | 0.1171 |
Overall, these profiles show that housing and transportation stress
in Essex County is not random, but organized into a small set of
recurring neighborhood patterns. Some tracts have a mix of tracts, which
can be more of a burden on the inhabitants of that area. The other
tracts sit on the other end of the spectrum. They show low to no stress
and access to vehicles. Since there is no clear distinct neighborhood
type across the entire county, I would suggest the neighborhoods are
targeted for policy change rather than the entire county.