New York City’s Pre-K For All program nearly tripled full-day public school enrollment among four-year-olds in its launch year, from approximately 19,000 full-day seats in 2013–14 to 53,000 in 2014–15. This project uses spatial and regression analysis at the Neighborhood Tabulation Area (NTA) level to assess whether that expansion reached high-need neighborhoods equitably, comparing the final pre-expansion year to the launch year to capture the initial geographic distribution of new seats. Enrollment in New York City Department of Education (DOE) Pre-K programs is measured as a share of the estimated four-year-old population derived from American Community Survey (ACS) data.
Six datasets were accessed via NYC Open Data: the NYC DOE Demographic Snapshot 2013–2018, the NYC DOE Demographic Snapshot 2017–22, the NYC DOE Pre-K For All Demographic Snapshot 2015–2018, the 2018 Pre-K School Directory, UPK School Locations, and 2020 Neighborhood Tabulation Area boundaries. School point locations were accessed via the NYC DOE School Point Locations dataset. Neighborhood-level covariates were drawn from the U.S. Census Bureau ACS 2017 5-year estimates accessed programmatically via the tidycensus R package, including tables B01001 (age by sex), B17001 (poverty status), B03002 (race and Hispanic origin), C16002 (linguistic isolation), and B25070 (gross rent as a percentage of household income).
library(tidyverse) # data import, cleaning, and manipulation
library(sf) # spatial data handling (geometries as data frame columns)
library(tidycensus) # access to Census ACS data via API
# Load DOE enrollment snapshots
snap_2013_2018 <- read_csv("2013_-_2018_Demographic_Snapshot_School_20260426.csv")
snap_2015_2018_prek <- read_csv("2015_-_2018_Demographic_Snapshot_Pre-_K_For_All_20260426.csv")
snap_2017_2022 <- read_csv("2017-18__-_2021-22_Demographic_Snapshot_20260426.csv")
prek_directory <- read_csv("2018_Pre-K_School_Directory_20260426.csv")
upk_locations <- read_csv("Universal_Pre-K_(UPK)_School_Locations_20260426.csv")
# Load NTA boundaries as CSV with WKT geometry column
nta_csv <- read_csv("2020_Neighborhood_Tabulation_Areas_(NTAs)_20260426.csv")
# Load school point shapefile
school_points <- st_read("SchoolPoints_APS_2024_08_28/SchoolPoints_APS_2024_08_28.shp")
# Convert NTA CSV to sf object using the WKT geometry column
nta <- st_as_sf(nta_csv, wkt = "the_geom", crs = 4326)
School-level enrollment data was combined across snapshot files, labeled by era, and spatially joined to NTA boundaries. ACS tract-level data was pulled via the tidycensus API and aggregated to the NTA level to produce neighborhood demographic covariates.
# Combine the two snapshot files into one, dropping the overlapping 2017-18 year
# from the older file to avoid double-counting
prek_combined <- bind_rows(
snap_2013_2018 %>%
filter(Year != "2017-18") %>%
select(DBN, `School Name`, Year,
`Grade PK (Half Day & Full Day)`, `Total Enrollment`),
snap_2017_2022 %>%
select(DBN, `School Name`, Year,
`Grade PK (Half Day & Full Day)`, `Total Enrollment`)
)
# Label each row as pre-UPK (2013-14) or post-UPK (2014-15); all other years get NA
prek_combined <- prek_combined %>%
mutate(
era = if_else(Year == "2013-14", "pre_UPK",
if_else(Year == "2014-15", "post_UPK", NA_character_))
)
# Join enrollment data to school point locations by school ID (DBN/ATS)
prek_spatial <- school_points %>%
left_join(prek_combined, by = c("ATS" = "DBN"))
# Reproject school points to match NTA CRS before spatial join
prek_spatial <- st_transform(prek_spatial, st_crs(nta))
# Spatial join: attach NTA identifier to each school point
prek_nta <- st_join(prek_spatial, nta %>% select(NTA2020, NTAName))
# Aggregate to NTA level: sum enrollment across all schools in each NTA, by era
nta_prek_summary <- prek_nta %>%
st_drop_geometry() %>%
filter(!is.na(era)) %>%
group_by(NTA2020, NTAName, era) %>%
summarize(
total_prek = sum(`Grade PK (Half Day & Full Day)`, na.rm = TRUE),
.groups = "drop"
)
# Pull tract-level ACS variables for NYC's 5 counties
# Using 2017 ACS 5-year estimates to align with post-expansion period
# Counties: Bronx (005), Kings/Brooklyn (047), New York/Manhattan (061),
# Queens (081), Richmond/Staten Island (085)
acs_raw <- get_acs(
geography = "tract",
variables = c(
# 4-year-old population denominator (B01001)
male_under5 = "B01001_003", # Male under 5 years
female_under5 = "B01001_027", # Female under 5 years
# Poverty (B17001)
pop_poverty_det = "B17001_001", # Total population for poverty determination
below_poverty = "B17001_002", # Below poverty level
# Race/ethnicity (B03002)
total_pop = "B03002_001", # Total population
white_nh = "B03002_003", # White non-Hispanic
black_nh = "B03002_004", # Black non-Hispanic
hispanic = "B03002_012", # Hispanic or Latino (any race)
asian_nh = "B03002_006", # Asian non-Hispanic
# Linguistic isolation (B16002)
total_ling = "C16002_001", # Total households
ling_isolated = "C16002_004", # Spanish-speaking, limited English
ling_isolated2 = "C16002_007", # Other Indo-European, limited English
ling_isolated3 = "C16002_010", # Asian/PI language, limited English
ling_isolated4 = "C16002_013", # Other language, limited English
# Rent burden (B25070)
total_renters = "B25070_001", # Total renter households with cash rent
rent_30_34 = "B25070_007", # 30–34.9% of income on rent
rent_35_39 = "B25070_008", # 35–39.9%
rent_40_49 = "B25070_009", # 40–49.9%
rent_50plus = "B25070_010" # 50%+
),
state = "NY",
county = c("005", "047", "061", "081", "085"),
year = 2017,
survey = "acs5",
output = "wide"
)
# Calculate derived covariate rates from raw ACS counts
# 4yo population: under-5 divided by 5 (assumes uniform single-year age distribution)
acs_covariates <- acs_raw %>%
mutate(
pop_4yo = (male_under5E + female_under5E) / 5,
poverty_rate = below_povertyE / pop_poverty_detE,
pct_white_nh = white_nhE / total_popE,
pct_black_nh = black_nhE / total_popE,
pct_hispanic = hispanicE / total_popE,
pct_asian_nh = asian_nhE / total_popE,
ling_iso_rate = (ling_isolatedE + ling_isolated2E +
ling_isolated3E + ling_isolated4E) / total_lingE,
rent_burden = (rent_30_34E + rent_35_39E +
rent_40_49E + rent_50plusE) / total_rentersE
) %>%
select(GEOID, NAME, pop_4yo, poverty_rate, pct_white_nh, pct_black_nh,
pct_hispanic, pct_asian_nh, ling_iso_rate, rent_burden)
# Get tract geometries separately so we can spatial join to NTAs
tract_geo <- get_acs(
geography = "tract",
variables = "B01001_001",
state = "NY",
county = c("005", "047", "061", "081", "085"),
year = 2017,
survey = "acs5",
geometry = TRUE
) %>%
select(GEOID)
# Join covariates to geometries
acs_sf <- tract_geo %>%
left_join(acs_covariates, by = "GEOID") %>%
st_transform(st_crs(nta))
# Spatial join: assign each tract to an NTA, then aggregate to NTA level
# pop_4yo is summed (want total count); all rates are averaged across tracts
nta_covariates <- st_join(acs_sf, nta %>% select(NTA2020, NTAName)) %>%
st_drop_geometry() %>%
group_by(NTA2020, NTAName) %>%
summarize(
pop_4yo = sum(pop_4yo, na.rm = TRUE),
poverty_rate = mean(poverty_rate, na.rm = TRUE),
pct_white_nh = mean(pct_white_nh, na.rm = TRUE),
pct_black_nh = mean(pct_black_nh, na.rm = TRUE),
pct_hispanic = mean(pct_hispanic, na.rm = TRUE),
pct_asian_nh = mean(pct_asian_nh, na.rm = TRUE),
ling_iso_rate = mean(ling_iso_rate, na.rm = TRUE),
rent_burden = mean(rent_burden, na.rm = TRUE),
.groups = "drop"
)
# Pivot enrollment summary wide: one row per NTA with pre_UPK and post_UPK columns
nta_wide <- nta_prek_summary %>%
pivot_wider(
names_from = era,
values_from = total_prek
)
# Join enrollment to covariates and NTA geometries
# Then convert raw enrollment counts to rates (share of estimated 4yo population)
nta_analysis <- nta_wide %>%
left_join(nta_covariates, by = c("NTA2020", "NTAName")) %>%
left_join(nta %>% select(NTA2020, the_geom), by = "NTA2020") %>%
st_as_sf() %>%
mutate(
pre_UPK = pre_UPK / pop_4yo, # enrollment as share of estimated 4yo pop
post_UPK = post_UPK / pop_4yo,
prek_diff = post_UPK - pre_UPK # change in enrollment rate
)
# Investigate zero-enrollment NTAs before deciding how to handle them
zero_ntas <- nta_analysis %>%
st_drop_geometry() %>%
filter(post_UPK == 0) %>%
select(NTA2020, NTAName, post_UPK, pre_UPK, prek_diff, poverty_rate, pct_hispanic, pct_black_nh)
# Non-residential / special-purpose NTAs — exclude entirely from analysis
exclude_ntas <- c("QN0151", # Rikers Island
"MN0191", # Battery-Governors Island
"SI9592", # Miller Field
"BX1161") # Hutchinson Metro Center
# Remaining zero-enrollment NTAs are recoded to NA rather than excluded:
# includes high-Hispanic NTAs likely served exclusively by NYCEECs (no DOE Pre-K),
# low-child-density Manhattan NTAs where seats were not yet placed in 2014-15,
# and two NTAs (Brooklyn Heights, Dyker Heights) that lost Pre-K during rollout
na_ntas <- zero_ntas$NTA2020[!zero_ntas$NTA2020 %in% exclude_ntas]
nta_analysis <- nta_analysis %>%
filter(!NTA2020 %in% exclude_ntas) %>%
mutate(
# Flag NTAs likely served by NYCEECs rather than DOE schools
nyceec_likely = NTA2020 %in% c("BX0491", "BX1001", "BX1003", "BX1102",
"BX1201", "BK0571", "BK0703", "BK1201",
"QN0401", "QN0901", "BX0803"),
# Flag NTAs that had Pre-K before expansion but lost it in 2014-15
pre_UPK_loss = NTA2020 %in% c("BK0201", "BK1002"),
# Recode true-zero enrollment to NA
post_UPK = ifelse(NTA2020 %in% na_ntas, NA_real_, post_UPK),
pre_UPK = ifelse(NTA2020 %in% na_ntas, NA_real_, pre_UPK),
prek_diff = ifelse(NTA2020 %in% na_ntas, NA_real_, prek_diff)
)
The map below shows DOE Pre-K enrollment rates by NTA for the year before (2013–14) and the launch year (2014–15) of Pre-K For All. Use the layer control to toggle between the two years. Hover over any NTA to see its name and enrollment rate.
library(leaflet)
# Shared domain so pre and post colors are directly comparable
shared_domain <- range(c(nta_analysis$pre_UPK, nta_analysis$post_UPK), na.rm = TRUE)
pal_pre <- colorNumeric("YlOrRd", domain = shared_domain, na.color = "transparent")
pal_post <- colorNumeric("YlOrRd", domain = shared_domain, na.color = "transparent")
leaflet(nta_analysis) %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(
group = "Pre-UPK (2013-14)",
fillColor = ~pal_pre(pre_UPK),
fillOpacity = 0.7, color = "white", weight = 0.5,
label = ~paste0(NTAName, ": ", round(pre_UPK, 3))
) %>%
addPolygons(
group = "Post-UPK (2014-15)",
fillColor = ~pal_post(post_UPK),
fillOpacity = 0.7, color = "white", weight = 0.5,
label = ~paste0(NTAName, ": ", round(post_UPK, 3))
) %>%
addLayersControl(
baseGroups = c("Pre-UPK (2013-14)", "Post-UPK (2014-15)"),
options = layersControlOptions(collapsed = FALSE)
) %>%
addLegend(
"bottomright",
pal = pal_post,
values = ~post_UPK,
title = "DOE Pre-K enrollment rate",
opacity = 0.7
)
The maps below show the spatial distribution of pre-expansion enrollment and post-expansion enrollment, the change in enrollment rate between the two, and post-Pre-K for All enrollment alongside three neighborhood characteristics (poverty rate, non-Hispanic Black share, and rent burden) that showed the strongest relationships with enrollment in the correlation analysis.
library(tmap)
tmap_mode("plot")
# Reusable function for consistent choropleth style across all maps
make_map <- function(var, title, palette = "YlOrRd", midpoint = NA) {
tm_shape(nta_analysis) +
tm_polygons(
col = var,
palette = palette,
style = "jenks",
n = 5,
midpoint = midpoint,
title = title,
border.col = "white",
border.alpha = 0.4,
colorNA = "grey85",
textNA = "No data"
) +
tm_layout(
main.title = title,
main.title.size = 0.85,
legend.outside = TRUE,
legend.outside.position = "right",
frame = FALSE
)
}
# Pre- and post-UPK enrollment rates on a shared scale
make_map("pre_UPK", "DOE Pre-K Enrollment Rate — Pre-UPK (2013–14)")
make_map("post_UPK", "DOE Pre-K Enrollment Rate — Post-UPK (2014–15)")
# Change in enrollment rate — diverging palette, zero = neutral
make_map("prek_diff",
"Change in DOE Pre-K Enrollment Rate (2014–15 minus 2013–14)",
palette = "-RdBu",
midpoint = 0)
# Post-UPK enrollment vs. selected covariates — side-by-side pairs
# showing the relationships that drive the regression analysis
map_enrollment <- tm_shape(nta_analysis) +
tm_polygons(
col = "post_UPK",
palette = "YlOrRd",
style = "jenks",
n = 5,
title = "Post-UPK Enrollment Rate",
border.col = "white",
border.alpha = 0.4,
colorNA = "grey85",
textNA = "No data"
) +
tm_layout(
main.title = "Post-UPK Rate (2014–15)",
main.title.size = 0.85,
legend.outside = TRUE,
frame = FALSE
)
# Only the analytically meaningful covariates based on correlation analysis
map_covariates <- list(
list(var = "poverty_rate", label = "Poverty Rate"),
list(var = "pct_black_nh", label = "Share Non-Hispanic Black"),
list(var = "rent_burden", label = "Rent Burden Rate")
)
for (cov in map_covariates) {
map_cov <- tm_shape(nta_analysis) +
tm_polygons(
col = cov$var,
palette = "Blues",
style = "jenks",
n = 5,
title = cov$label,
border.col = "white",
border.alpha = 0.4,
colorNA = "grey85",
textNA = "No data"
) +
tm_layout(
main.title = cov$label,
main.title.size = 0.85,
legend.outside = TRUE,
frame = FALSE
)
print(tmap_arrange(map_enrollment, map_cov, ncol = 2))
}
Bivariate correlations were computed between post-UPK enrollment rate and seven ACS covariates. Scatter plots are shown for the three covariates with the most analytically meaningful relationships.
library(ggplot2)
# Drop geometry for non-spatial analysis
nta_df <- nta_analysis %>% st_drop_geometry()
# Compute bivariate correlations between post-UPK enrollment rate and each covariate
covariates <- c("poverty_rate", "pct_white_nh", "pct_black_nh",
"pct_hispanic", "pct_asian_nh", "ling_iso_rate", "rent_burden")
cor_table <- nta_df %>%
summarise(across(all_of(covariates),
~ cor(.x, post_UPK, use = "complete.obs"),
.names = "r_{.col}")) %>%
pivot_longer(everything(), names_to = "covariate", values_to = "r") %>%
mutate(covariate = gsub("r_", "", covariate)) %>%
arrange(desc(abs(r)))
print(cor_table)
# Scatter plots for the three analytically meaningful covariates
covariate_labels <- c(
poverty_rate = "Poverty rate",
pct_black_nh = "Share non-Hispanic Black",
rent_burden = "Rent burden rate"
)
for (cov in names(covariate_labels)) {
p <- ggplot(nta_df, aes(x = .data[[cov]], y = post_UPK)) +
geom_point(alpha = 0.5, size = 1.8, color = "#2c7bb6") +
geom_smooth(method = "lm", se = TRUE, color = "#d7191c", linewidth = 0.8) +
labs(
title = paste("Post-UPK enrollment rate vs.", covariate_labels[[cov]]),
x = covariate_labels[[cov]],
y = "Post-UPK enrollment rate"
) +
theme_minimal()
print(p)
}
Moran’s I was used to test whether post-UPK enrollment rates are spatially clustered across NTAs, using a Queen contiguity weights matrix.
library(spdep)
# Build spatial weights matrix using Queen contiguity (shared edges or vertices)
# zero.policy = TRUE handles NTAs with no neighbors (e.g. islands)
nb <- poly2nb(nta_analysis)
listw <- nb2listw(nb, zero.policy = TRUE)
# Test for spatial autocorrelation in post-UPK enrollment rates
# Null hypothesis: enrollment is randomly distributed across NTAs
moran.test(nta_analysis$post_UPK, listw, zero.policy = TRUE, na.action = na.exclude)
##
## Moran I test under randomisation
##
## data: nta_analysis$post_UPK
## weights: listw
## omitted: 5, 18, 23, 29, 33, 66, 78, 83, 85, 88, 91, 95, 97, 105, 108, 114, 136, 156
##
## Moran I statistic standard deviate = 1.2392, p-value = 0.1076
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.061533661 -0.005555556 0.002930817
Because Moran’s I was non-significant, OLS regression was used to model the relationship between neighborhood characteristics and post-UPK enrollment rate.
# OLS regression: post-UPK enrollment rate as outcome
# Spatial regression not needed — Moran's I was non-significant (I = 0.062, p = 0.108)
model_ols <- lm(
post_UPK ~ poverty_rate + pct_black_nh + pct_hispanic +
pct_asian_nh + ling_iso_rate + rent_burden,
data = nta_df
)
summary(model_ols)
##
## Call:
## lm(formula = post_UPK ~ poverty_rate + pct_black_nh + pct_hispanic +
## pct_asian_nh + ling_iso_rate + rent_burden, data = nta_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.139224 -0.051437 -0.006016 0.040269 0.315693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.28791 0.04759 6.050 8.64e-09 ***
## poverty_rate 0.15668 0.09742 1.608 0.109582
## pct_black_nh 0.05733 0.03278 1.749 0.082070 .
## pct_hispanic -0.02316 0.04369 -0.530 0.596648
## pct_asian_nh 0.01350 0.06503 0.208 0.835817
## ling_iso_rate 0.12602 0.09483 1.329 0.185617
## rent_burden -0.40838 0.12175 -3.354 0.000977 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07291 on 174 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.0735, Adjusted R-squared: 0.04156
## F-statistic: 2.301 on 6 and 174 DF, p-value: 0.03662
The model is statistically significant (F p = 0.037) but explains little variance (adjusted R² = 0.04), consistent with a universal program design that reduces the role of neighborhood composition in predicting uptake. Rent burden is the only significant predictor (β = −0.41, p < 0.001), suggesting housing-stability and logistical barriers matter more than demographic ones. Non-Hispanic Black share is marginally significant (β = 0.057, p = 0.08) and positive — the only demographic predictor with a positive association with enrollment. Poverty rate is positive and non-significant (β = +0.16, p = 0.11) — a substantive equity finding: the universal design appears to have worked on the economic dimension. Chinatown–Two Bridges (MN0301) is a notable positive outlier, with enrollment substantially above model predictions, potentially reflecting community-specific outreach and informal care patterns not captured by ACS covariates.
New York City Early Education Center (NYCEEC) enrollment is not captured in DOE data. Because roughly 60 percent of Pre-K for All seats citywide were located in community-based settings like NYCEECs rather than DOE-operated schools (Wallack 2023), this analysis captures only about 40 percent of total program enrollment. The undercounting is not random. NYCEECs are concentrated in high-Hispanic, high-linguistic-isolation neighborhoods, meaning enrollment in those communities is systematically understated, and negative associations between linguistic isolation or Hispanic share and DOE enrollment may partly reflect provider type rather than true barriers to uptake. The 4-year-old population denominator is an approximation based on one-fifth of the ACS under-5 population. ACS 2017 5-year estimates are not perfectly contemporaneous with 2014–15 enrollment.