Introduction

This analysis aims to identify and evaluate the key factors that contribute most significantly to the Disadvantage Score across neighborhoods in New York City. The Disadvantage Score is a composite measure developed to reflect cumulative environmental, health, and socioeconomic burdens experienced by communities. Values range from 0 to 1, with higher scores indicating neighborhoods facing more substantial systemic and structural barriers to well-being. The data used in this analysis comes from the 2023 NYC Disadvantaged Neighborhoods Data set, which was compiled and published by the New York State Department of Environmental Conservation (NYSDEC) in collaboration with the New York State Climate Justice Working Group. This data set integrates 45 carefully selected indicators across domains such as pollution exposure, housing quality, access to healthcare, poverty, and historical disinvestment. This work is part of a larger project exploring the correlation between disadvantaged neighborhoods and gun violence in New York City, with the goal of uncovering patterns of systemic inequality and informing equitable policy interventions.

The primary objective of this analysis is to determine which factors—across environmental, population, and health indicators—have the greatest impact on predicted Disadvantage Scores. To do this, I employ beta regression modeling and simulation to capture both the statistical significance and real-world magnitude of each variable’s effect. This work is part of a broader research effort exploring the intersection between neighborhood disadvantage and gun violence in New York City. Understanding which structural burdens are most predictive of disadvantage helps contextualize patterns of violence, vulnerability, and resource inequality across the city. By highlighting the strongest contributors to the Disadvantage Score, this analysis offers a data-driven foundation for targeted interventions, equity-focused policy, and community resilience strategies.

Cleaning NYC Disadvantaged Neighborhood Data For Beta Regression Analysis

Before running the beta regression, I examined the Percentile_Rank_Combined_NYC variable to ensure it met the assumptions of the model. Beta regression requires the dependent variable to lie strictly between 0 and 1. To check for violations of this assumption, I used a logical test to count how many neighborhoods had scores that were exactly 0 or 1.

table(neighborhoods$Percentile_Rank_Combined_NYC == 0 | neighborhoods$Percentile_Rank_Combined_NYC == 1)
## 
## FALSE  TRUE 
##  2093  2825

The result showed that 2,825 observations had values at the boundaries, while only 2,093 fell strictly within the valid range. Since a substantial portion of the data did not meet the model’s requirements, this highlighted the need to transform the variable to ensure all values fall within the open interval (0, 1) prior to model fitting.

After identifying that over half of the Percentile_Rank_Combined_NYC values were exactly 0 or 1—violating the assumptions of beta regression, I proceeded to transform the variable to make it suitable for modeling. Beta regression requires the dependent variable to fall strictly within the open interval (0, 1), so a common transformation was applied to shift all values slightly inward while preserving their rank order. This transformation is especially important when dealing with proportion or percentile data that include boundary values.

In the code, I first calculated the number of observations using n <- nrow(neighborhoods), then applied the transformation as follows:

n <- nrow(neighborhoods)
neighborhoods$score_transformed <- (neighborhoods$Percentile_Rank_Combined_NYC * (n - 1) + 0.5) / n

This formula adjusts the original values so that any 0s or 1s are slightly shifted inward, ensuring all transformed values fall strictly within the range (0, 1). Specifically, a value of 0 becomes 0.5 / n and a value of 1 becomes (n - 0.5) / n. In this data set, where n = 4918, a value of 0 is transformed to approximately 0.0001, and a value of 1 is transformed to approximately 0.9999. These adjustments are minimal but crucial. They allow the data to satisfy the assumptions of the beta distribution without altering the overall structure or relative rankings of the scores. The resulting score_transformed variable is now compatible with beta regression and still accurately reflects the original disadvantage levels across neighborhoods.

Beta Regression Models

With the score_transformed variable properly adjusted to fall within the (0, 1) interval, the data set was now ready for regression modeling. Given the bounded and continuous nature of the outcome variable, beta regression was selected as the appropriate analytical approach. This method is particularly well-suited for modeling proportion-based outcomes and allows for interpreting how various predictor variables influence neighborhood-level disadvantage. To explore different dimensions of structural inequity, I developed two separate models: the first (m2) focused on environmental burdens and climate change-related risk factors, while the second (m3) examined population characteristics, socioeconomic indicators, and health-related vulnerabilities. Together, these models provide a multifaceted understanding of the forces shaping disadvantage across NYC neighborhoods.

Enviromental Burdens and Climate Change Risk Factors Regression Analysis

To examine the relationship between environmental burdens, climate change risk factors, and neighborhood disadvantage, I fit a beta regression model using the betareg() function in R. The outcome variable, score_transformed, is a normalized version of the overall disadvantage score, adjusted to meet the requirements of beta regression. The model (m2) includes predictor variables that reflect exposure to environmental hazards and climate-related stressors. These include air pollution (Particulate_Matter_25), traffic density, proximity to industrial and hazardous land uses (e.g., Landfills, Oil_Storage, Municipal_Waste_Combustors), and long-term land use impacts (e.g., Agricultural_Land_Use, Industrial_Land_Use). Climate change-related risk is captured through indicators like Days_Above_90_Degrees_2050 and Low_Vegetative_Cover, while Drive_Time_Healthcare accounts for access to healthcare in vulnerable communities. This model allows us to assess how these environmental and climate-related stressors contribute to patterns of disadvantage across NYC neighborhoods. The indicators along with their definitions can be seen in the table below.

Enviormental Burdens and Climate Change Risk Variables Data Dictionary

library(tibble)
library(kableExtra)

variable_definitions_df <- tibble::tibble(
  Variable = c(
    "Particulate_Matter_25",
    "Traffic_Truck_Highways",
    "Traffic_Number_Vehicles",
    "Wastewater_Discharge",
    "Industrial_Land_Use",
    "Landfills",
    "Oil_Storage",
    "Municipal_Waste_Combustors",
    "Power_Generation_Facilities",
    "RMP_Sites",
    "Remediation_Sites",
    "Scrap_Metal_Processing",
    "Agricultural_Land_Use",
    "Days_Above_90_Degrees_2050",
    "Low_Vegetative_Cover",
    "Drive_Time_Healthcare"
  ),
  Definition = c(
    "Percentile ranking of the average annual concentration of PM2.5 (particulate matter ≤ 2.5 microns) per cubic meter.",
    "Percentile ranking of average daily truck traffic on highways (Classes 4–13 vehicles).",
    "Percentile ranking of average daily vehicle traffic on major roads within 500 meters of census block centroids, weighted by population.",
    "Percentile ranking of toxicity-weighted concentrations in stream segments near the tract, indicating potential water pollution.",
    "Percentile ranking of census tract land area zoned for industrial, mining, or manufacturing use.",
    "Percentile ranking of land area within 500 meters of an active landfill.",
    "Percentile ranking of land area within 500 meters of major oil storage facilities.",
    "Percentile ranking of land area within 500 meters of a municipal waste combustor.",
    "Percentile ranking of land area within 1 mile of fossil-fuel-burning power plants or peaker units.",
    "Percentile ranking of proximity to chemical accident risk sites (Regulated Management Plan sites), weighted by distance and population.",
    "Percentile ranking of the number of state/federal environmental remediation sites (e.g., Superfund, Brownfield).",
    "Percentile ranking of the number of scrap metal and vehicle dismantling facilities.",
    "Percentile ranking of land area used for crops or pasture.",
    "Projected percentile ranking of the average annual number of days above 90°F in the year 2050.",
    "Percentile ranking of the census tract land area classified as developed or barren (low vegetation).",
    "Percentile ranking of average drive time from the tract center to the three nearest healthcare facilities."
  )
)

# Styled HTML table output
kable(variable_definitions_df, caption = "Data Dictionary for Environmental Burden and Climate Risk Variables (Model m2)") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "left"
  )
Data Dictionary for Environmental Burden and Climate Risk Variables (Model m2)
Variable Definition
Particulate_Matter_25 Percentile ranking of the average annual concentration of PM2.5 (particulate matter ≤ 2.5 microns) per cubic meter.
Traffic_Truck_Highways Percentile ranking of average daily truck traffic on highways (Classes 4–13 vehicles).
Traffic_Number_Vehicles Percentile ranking of average daily vehicle traffic on major roads within 500 meters of census block centroids, weighted by population.
Wastewater_Discharge Percentile ranking of toxicity-weighted concentrations in stream segments near the tract, indicating potential water pollution.
Industrial_Land_Use Percentile ranking of census tract land area zoned for industrial, mining, or manufacturing use.
Landfills Percentile ranking of land area within 500 meters of an active landfill.
Oil_Storage Percentile ranking of land area within 500 meters of major oil storage facilities.
Municipal_Waste_Combustors Percentile ranking of land area within 500 meters of a municipal waste combustor.
Power_Generation_Facilities Percentile ranking of land area within 1 mile of fossil-fuel-burning power plants or peaker units.
RMP_Sites Percentile ranking of proximity to chemical accident risk sites (Regulated Management Plan sites), weighted by distance and population.
Remediation_Sites Percentile ranking of the number of state/federal environmental remediation sites (e.g., Superfund, Brownfield).
Scrap_Metal_Processing Percentile ranking of the number of scrap metal and vehicle dismantling facilities.
Agricultural_Land_Use Percentile ranking of land area used for crops or pasture.
Days_Above_90_Degrees_2050 Projected percentile ranking of the average annual number of days above 90°F in the year 2050.
Low_Vegetative_Cover Percentile ranking of the census tract land area classified as developed or barren (low vegetation).
Drive_Time_Healthcare Percentile ranking of average drive time from the tract center to the three nearest healthcare facilities.

Beta Regression Model of Enviromental Burdens and Climate Change Risk Factors

The below section of the code fits a beta regression model to examine how various environmental burdens and climate-related factors contribute to neighborhood disadvantage in New York City. The betareg() function is used because the dependent variable, score_transformed, is a continuous proportion bounded between 0 and 1—a requirement of beta regression. The model (m2) includes 16 independent variables representing air pollution, traffic exposure, industrial land use, environmental hazards, climate vulnerability, and healthcare access.

Key variables include particulate matter (PM2.5), truck traffic, proximity to industrial sites, projected extreme heat, vegetative cover, and drive time to healthcare facilities. These predictors are drawn from the neighborhoods data set and modeled against the transformed disadvantage score.

The final line, summary(m2), displays detailed model output including coefficient estimates, standard errors, z-values, and significance levels for each predictor. This output provides insight into the strength and direction of the relationship between each environmental factor and the predicted disadvantage score.

library(betareg)

# Fit the beta regression model
m2 <- betareg(score_transformed ~ 
  Particulate_Matter_25 +
  Traffic_Truck_Highways +
  Traffic_Number_Vehicles +
  Wastewater_Discharge +
  Industrial_Land_Use +
  Landfills +
  Oil_Storage +
  Municipal_Waste_Combustors +
  Power_Generation_Facilities +
  RMP_Sites +
  Remediation_Sites +
  Scrap_Metal_Processing +
  Agricultural_Land_Use +
  Days_Above_90_Degrees_2050 +
  Low_Vegetative_Cover +
  Drive_Time_Healthcare,
  data = neighborhoods
)

# Show model summary
summary(m2)
## 
## Call:
## betareg(formula = score_transformed ~ Particulate_Matter_25 + Traffic_Truck_Highways + 
##     Traffic_Number_Vehicles + Wastewater_Discharge + Industrial_Land_Use + 
##     Landfills + Oil_Storage + Municipal_Waste_Combustors + Power_Generation_Facilities + 
##     RMP_Sites + Remediation_Sites + Scrap_Metal_Processing + Agricultural_Land_Use + 
##     Days_Above_90_Degrees_2050 + Low_Vegetative_Cover + Drive_Time_Healthcare, 
##     data = neighborhoods)
## 
## Quantile residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3397 -0.6490 -0.1935  0.3369  3.8470 
## 
## Coefficients (mean model with logit link):
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -4.6920941  0.1096509 -42.791  < 2e-16 ***
## Particulate_Matter_25        1.4614179  0.0970630  15.056  < 2e-16 ***
## Traffic_Truck_Highways       0.5253232  0.0821892   6.392 1.64e-10 ***
## Traffic_Number_Vehicles     -0.1896538  0.1080662  -1.755   0.0793 .  
## Wastewater_Discharge        -0.2426377  0.0565219  -4.293 1.76e-05 ***
## Industrial_Land_Use          0.0728741  0.0573163   1.271   0.2036    
## Landfills                   -0.2771025  0.2282063  -1.214   0.2246    
## Oil_Storage                 -0.1529130  0.1265882  -1.208   0.2271    
## Municipal_Waste_Combustors  -0.5420505  0.3666539  -1.478   0.1393    
## Power_Generation_Facilities  0.2994751  0.0990510   3.023   0.0025 ** 
## RMP_Sites                    0.0007567  0.0006812   1.111   0.2667    
## Remediation_Sites           -0.2629478  0.0561514  -4.683 2.83e-06 ***
## Scrap_Metal_Processing       0.1884084  0.0782322   2.408   0.0160 *  
## Agricultural_Land_Use        1.1163217  0.0876453  12.737  < 2e-16 ***
## Days_Above_90_Degrees_2050   0.8657231  0.0665122  13.016  < 2e-16 ***
## Low_Vegetative_Cover         2.8012651  0.1137205  24.633  < 2e-16 ***
## Drive_Time_Healthcare        0.0410721  0.0898107   0.457   0.6474    
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)  1.68927    0.03702   45.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 1.592e+04 on 18 Df
## Pseudo R-squared: 0.7364
## Number of iterations: 80 (BFGS) + 3 (Fisher scoring)

Significant Positive Predictors

The following variables were statistically significant (p < 0.05) and positively associated with higher predicted disadvantage scores:

  • Particulate_Matter_25 (+1.46): Air pollution (PM2.5) has the strongest positive effect, emphasizing the severe impact of poor air quality on community vulnerability.

  • Traffic_Truck_Highways (+0.52): Heavier truck traffic near highways is associated with greater disadvantage.

  • Power_Generation_Facilities (+0.29): Living near fossil fuel power plants contributes to higher disadvantage scores.

  • Scrap_Metal_Processing (+0.19): Presence of these facilities is linked to environmental burden and elevated risk.

  • Agricultural_Land_Use (+1.16): Although surprising, high agricultural zoning may reflect rural or industrial edges with fewer services and more exposure to land-related pollutants.

  • Days_Above_90_Degrees_2050 (+0.86): Climate change-related heat exposure has a large, significant effect on predicted disadvantage.

  • Low_Vegetative_Cover (+0.80): Lack of green infrastructure is strongly tied to increased vulnerability.

Visualization of Model-Based Predictions at Minimum and Maximum Values for Enviromental Burdens and Climate Change Risk Factors

This section of the code is designed to visualize the effect of each predictor variable in the m2 model on the predicted disadvantage score, specifically by comparing predictions at the minimum and maximum values of each variable. After fitting the beta regression model (m2), the vars vector is created to list all 16 predictors used in the model.

Using purrr::map_dfr(), the code loops through each variable and builds a new data set (grids) that simulates two conditions for each predictor: one where it is set to its minimum observed value, and one where it is set to its maximum, while all other variables remain constant. These two scenarios are labeled “Minimum” and “Maximum” in a new column called “Score Scaling”, and assigned “Score Value”s of 0 and 1, respectively, for clarity.

The predictions() function from the marginaleffects package is then used to estimate the predicted disadvantage score for each of these scenarios. This produces a table (m2_table) showing how the predicted score changes when each variable increases from its lowest to highest value, along with confidence intervals for each prediction. The final table is sorted for readability and serves as a clear way to interpret and compare the magnitude and direction of influence that each environmental factor has on disadvantage, based on real data ranges.

This approach offers an intuitive, simulation-based method for communicating variable importance and model impact, complementing the regression coefficients from the summary output with easily interpretable predicted values.

library(betareg)
library(marginaleffects)
library(dplyr)
library(purrr)
library(kableExtra)

# Fit model including Drive_Time_Healthcare
m2 <- betareg(score_transformed ~ 
  Particulate_Matter_25 +
  Traffic_Truck_Highways +
  Traffic_Number_Vehicles +
  Wastewater_Discharge +
  Industrial_Land_Use +
  Landfills +
  Oil_Storage +
  Municipal_Waste_Combustors +
  Power_Generation_Facilities +
  RMP_Sites +
  Remediation_Sites +
  Scrap_Metal_Processing +
  Agricultural_Land_Use +
  Days_Above_90_Degrees_2050 +
  Low_Vegetative_Cover +
  Drive_Time_Healthcare,
  data = neighborhoods)

# Variables to simulate
vars <- c(
  "Particulate_Matter_25",
  "Traffic_Truck_Highways",
  "Traffic_Number_Vehicles",
  "Wastewater_Discharge",
  "Industrial_Land_Use",
  "Landfills",
  "Oil_Storage",
  "Municipal_Waste_Combustors",
  "Power_Generation_Facilities",
  "RMP_Sites",
  "Remediation_Sites",
  "Scrap_Metal_Processing",
  "Agricultural_Land_Use",
  "Days_Above_90_Degrees_2050",
  "Low_Vegetative_Cover",
  "Drive_Time_Healthcare"
)

# Generate prediction grid with standardized values
grids <- map_dfr(vars, function(var) {
  vals <- range(neighborhoods[[var]], na.rm = TRUE)
  newdat <- datagrid(model = m2)
  newdat_min <- newdat
  newdat_max <- newdat
  newdat_min[[var]] <- vals[1]
  newdat_max[[var]] <- vals[2]
  bind_rows(
    mutate(newdat_min, Variable = var, `Score Scaling` = "Minimum", `Score Value` = 0.0),
    mutate(newdat_max, Variable = var, `Score Scaling` = "Maximum", `Score Value` = 1.0)
  )
})

# Generate predictions and bind labels
preds <- predictions(m2, newdata = grids, type = "response") %>%
  as.data.frame() %>%
  bind_cols(grids[, c("Variable", "Score Scaling", "Score Value")])

# Clean and rename columns
m2_table <- preds %>%
  select(
    Variable,
    `Score Scaling`,
    `Score Value`,
    estimate,
    conf.low,
    conf.high
  ) %>%
  rename(
    `Predicted Score` = estimate,
    `Lower Level Confidence Interval` = conf.low,
    `Upper Level Confidence Interval` = conf.high
  ) %>%
  arrange(Variable, `Score Scaling`)

# Styled output for knitting
kable(m2_table, caption = "Predicted Disadvantage Scores at Minimum and Maximum Predictor Values (Model m2)") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "left"
  )
Predicted Disadvantage Scores at Minimum and Maximum Predictor Values (Model m2)
Variable Score Scaling Score Value Predicted Score Lower Level Confidence Interval Upper Level Confidence Interval
Agricultural_Land_Use Maximum 1 0.3275349 0.2965353 0.3585344
Agricultural_Land_Use Minimum 0 0.1375633 0.1311597 0.1439669
Days_Above_90_Degrees_2050 Maximum 1 0.2281773 0.2158588 0.2404958
Days_Above_90_Degrees_2050 Minimum 0 0.1107308 0.1025931 0.1188685
Drive_Time_Healthcare Maximum 1 0.1695217 0.1559275 0.1831158
Drive_Time_Healthcare Minimum 0 0.1638187 0.1506083 0.1770292
Industrial_Land_Use Maximum 1 0.1745361 0.1609214 0.1881508
Industrial_Land_Use Minimum 0 0.1639583 0.1571299 0.1707868
Landfills Maximum 1 0.1318451 0.0807897 0.1829005
Landfills Minimum 0 0.1669165 0.1613927 0.1724404
Low_Vegetative_Cover Maximum 1 0.4225398 0.3964319 0.4486476
Low_Vegetative_Cover Minimum 0 0.0425489 0.0372705 0.0478272
Municipal_Waste_Combustors Maximum 1 0.1043115 0.0372278 0.1713952
Municipal_Waste_Combustors Minimum 0 0.1668441 0.1613327 0.1723556
Oil_Storage Maximum 1 0.1469996 0.1162715 0.1777276
Oil_Storage Minimum 0 0.1672258 0.1616411 0.1728106
Particulate_Matter_25 Maximum 1 0.2933842 0.2724069 0.3143615
Particulate_Matter_25 Minimum 0 0.0878542 0.0794544 0.0962541
Power_Generation_Facilities Maximum 1 0.2097565 0.1788088 0.2407042
Power_Generation_Facilities Minimum 0 0.1643971 0.1587260 0.1700681
RMP_Sites Maximum 1 0.1719713 0.1608661 0.1830766
RMP_Sites Minimum 0 0.1614622 0.1510011 0.1719233
Remediation_Sites Maximum 1 0.1391470 0.1273829 0.1509111
Remediation_Sites Minimum 0 0.1737262 0.1673456 0.1801069
Scrap_Metal_Processing Maximum 1 0.1927057 0.1694973 0.2159141
Scrap_Metal_Processing Minimum 0 0.1650763 0.1594619 0.1706907
Traffic_Number_Vehicles Maximum 1 0.1536919 0.1386905 0.1686934
Traffic_Number_Vehicles Minimum 0 0.1800100 0.1636543 0.1963657
Traffic_Truck_Highways Maximum 1 0.2064925 0.1918611 0.2211240
Traffic_Truck_Highways Minimum 0 0.1333657 0.1229389 0.1437926
Wastewater_Discharge Maximum 1 0.1440225 0.1331064 0.1549386
Wastewater_Discharge Minimum 0 0.1765881 0.1692215 0.1839547

This table presents the simulated predicted disadvantage scores for each environmental variable in the m2 beta regression model, holding all other variables constant while comparing each predictor at its minimum and maximum values. This approach offers an intuitive way to understand the magnitude of impact each variable has on predicted disadvantage.

Several variables stand out with notable differences between their minimum and maximum predicted scores:

  • Low_Vegetative_Cover has one of the largest effects, increasing the predicted score from approximately 0.04 at its minimum to 0.42 at its maximum. This suggests that a lack of green space is a strong contributor to neighborhood disadvantage.

  • Days_Above_90_Degrees_2050 similarly shows a large effect, with predicted scores rising from 0.11 to 0.22 as heat exposure increases, emphasizing the significance of climate change risk.

  • Particulate_Matter_25 (PM2.5) raises the score from 0.09 to 0.29, reinforcing the strong role of air pollution in predicting disadvantage.

  • Agricultural_Land_Use and Scrap_Metal_Processing also show meaningful increases, indicating that proximity to certain land uses and industrial activities is associated with higher vulnerability.

Some variables show minimal change in predicted scores between their minimum and maximum values, suggesting a weaker or less direct relationship in this model:

  • Drive_Time_Healthcare, RMP_Sites, and Municipal_Waste_Combustors produce very small differences in predicted scores, which is consistent with their non-significant regression coefficients.

  • Wastewater_Discharge and Remediation_Sites slightly decrease the predicted scores at higher levels, suggesting that areas with greater cleanup activity or discharge data may not necessarily correspond to greater disadvantage — possibly reflecting infrastructure presence or remediation efforts.

The confidence intervals included in the table provide a measure of uncertainty around the predictions. For most variables, the intervals between minimum and maximum do not overlap significantly, suggesting these differences are both statistically and practically meaningful.

Dumbbell Plot of Simulated Predictor Effects of for Enviromental Burdens and Climate Change Risk Factors

This section of the analysis was designed to visually communicate the influence of each environmental and climate-related variable in the m2 beta regression model. After simulating predicted disadvantage scores at the minimum and maximum observed values of each predictor (stored in m2_table), I created a dumbbell plot to display the results using ggplot2.

The code begins by reshaping the data into a format where each variable has two associated predicted values—one for the minimum and one for the maximum level—using pivot_wider(). The absolute difference between these two values (Diff) is calculated to quantify the magnitude of each variable’s effect on the predicted disadvantage score. The data is then sorted in descending order of impact, highlighting which predictors contribute the most substantial changes.

The plot itself is constructed using geom_segment() to draw horizontal lines between the minimum (blue dot) and maximum (red dot) predicted values for each variable. This visual format effectively illustrates how increasing each predictor—from its lowest to highest real-world value—affects the modeled disadvantage score. The result is a clear, comparative visualization of variable influence, helping to identify which environmental burdens and climate risk factors are most strongly associated with predicted disadvantage in NYC neighborhoods.

This visualization complements the numerical output of the regression model by offering a more intuitive interpretation of the results, making the findings more accessible to both technical and non-technical audiences.

library(ggplot2)
library(dplyr)
library(tidyr)

# Prepare the plot data from final_table
plot_data <- m2_table %>%
  select(Variable, `Score Scaling`, `Predicted Score`) %>%
  pivot_wider(names_from = `Score Scaling`, values_from = `Predicted Score`) %>%
  mutate(Diff = abs(Maximum - Minimum)) %>%
  arrange(desc(Diff))

# Create the updated plot
ggplot(plot_data, aes(y = reorder(Variable, Diff))) +
  geom_segment(aes(x = Minimum, xend = Maximum, yend = Variable), color = "black") +
  geom_point(aes(x = Minimum), color = "blue", size = 3) +
  geom_point(aes(x = Maximum), color = "red", size = 3) +
  labs(
    title = "Change in Predicted Disadvantage Score\nFrom Minimum to Maximum Predictor Values\n(Environmental Burden & Climate Change Risk Factors)",
    x = "Predicted Score",
    y = "Variable",
    caption = "Blue = Score at Minimum | Red = Score at Maximum"
  ) +
  theme_bw() +
  theme(
    axis.text.y = element_text(face = "bold", color = "black", size = 11),
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.title = element_text(size = 12),
    plot.caption = element_text(size = 9, color = "gray30", hjust = 1),
    plot.margin = margin(t = 10, r = 60, b = 10, l = 10) # <-- Added extra right margin
  )

The dumbbell plot provides a visual summary of the predicted change in neighborhood disadvantage score as each environmental and climate-related predictor increases from its minimum to maximum observed value, based on the beta regression model (m2). Each line represents a variable, with the blue dot showing the predicted score at the lowest value and the red dot showing the score at the highest value. Longer lines indicate a greater impact on the predicted disadvantage score.

Strongest-Impact Predictors

  • Low_Vegetative_Cover shows the largest change in predicted score, increasing from approximately 0.04 to 0.42, highlighting the strong relationship between lack of greenery and increased disadvantage.

  • Particulate_Matter_25 (PM2.5) and Agricultural_Land_Use also have large increases in predicted scores, suggesting that both air pollution and land use patterns are important contributors to environmental inequity.

  • Days_Above_90_Degrees_2050, a proxy for climate risk, shows a substantial impact, reinforcing the relevance of projected extreme heat as a future vulnerability factor.

Variables Where Higher Values Lower the Predicted Score

  • Remediation_Sites: Interpretation: As the number of remediation sites increases, the predicted disadvantage score decreases slightly.

Possible Explanation: This may reflect areas that have received targeted investment for environmental cleanup, infrastructure redevelopment, or policy attention, which could be mitigating disadvantage over time.

  • Wastewater_Discharge Interpretation: Higher levels of wastewater discharge are associated with lower predicted scores. Possible Explanation: This could be a proxy for industrial infrastructure, which may be more common in areas with higher economic activity—not necessarily aligned with high disadvantage in all cases. It may also indicate areas where discharge is monitored or regulated, which could overlap with environmental enforcement.

Why this matters

These exceptions highlight the complexity of interpreting environmental exposure data. While many factors increase predicted disadvantage scores (as expected in burdened communities), others may behave differently due to:

Policy interventions (e.g., remediation funding),

Spatial characteristics (e.g., industrial zones not always overlapping with residential disadvantage),

Or measurement nuances (e.g., proximity vs. exposure vs. regulation).

Population and Health Factors Regression Analysis

After evaluating the role of environmental burdens and climate change risk in shaping neighborhood disadvantage, I expanded our analysis to include a broader set of population, socioeconomic, and health-related factors. These indicators capture structural inequities that go beyond environmental exposure, reflecting disparities in healthcare access, chronic illness, housing quality, racial and ethnic composition, economic hardship, and more. To assess the influence of these variables, I developed a second beta regression model (m3) using the same transformed disadvantage score as the outcome. This allowed me to examine how demographic vulnerability and underlying health conditions contribute to patterns of disadvantage across NYC neighborhoods, complementing the environmental findings from the previous model.

Population and Health Variables Data Dictionary

Before running the regression analysis for the m3 model, I first created a data dictionary to clearly define each of the variables included in the model. This step was essential for ensuring a transparent and well-documented analytical process, especially given the range and complexity of population, socioeconomic, and health-related indicators used. By compiling concise definitions for each variable—such as racial and ethnic composition, chronic health conditions, household composition, education level, income thresholds, and housing characteristics, I ensured a consistent understanding of what each measure represents. This not only facilitated accurate model interpretation but also laid the groundwork for clearly communicating findings to diverse audiences, including policymakers, researchers, and community stakeholders.

library(tibble)
library(kableExtra)

m3_dictionary <- tibble::tibble(
  Variable = c(
    "Asian_Percent", "Black_African_American_Percent", "Redlining_Updated",
    "Latino_Percent", "English_Proficiency", "Native_Indigenous",
    "LMI_80_AMI", "LMI_Poverty_Federal", "Population_No_College",
    "Household_Single_Parent", "Unemployment_Rate", "Asthma_ED_Rate",
    "COPD_ED_Rate", "Households_Disabled", "Low_Birth_Weight",
    "MI_Hospitalization_Rate", "Health_Insurance_Rate", "Age_Over_65",
    "Premature_Deaths", "Internet_Access", "Home_Energy_Affordability",
    "Homes_Built_Before_1960", "Rent_Percent_Income", "Renter_Percent"
  ),
  Definition = c(
    "Percent of population identifying as Asian.",
    "Percent of population identifying as Black or African American.",
    "Indicator for whether the area was historically redlined (HOLC maps).",
    "Percent of population identifying as Latino or Hispanic.",
    "Percent of population with limited English proficiency.",
    "Percent of population identifying as Native or Indigenous.",
    "Percent of households below 80% of Area Median Income (AMI).",
    "Percent of population below the federal poverty level.",
    "Percent of adult population with no college education.",
    "Percent of households led by a single parent.",
    "Unemployment rate in the tract.",
    "Emergency department visits due to asthma (per capita).",
    "Emergency department visits due to COPD (per capita).",
    "Percent of households with at least one person with a disability.",
    "Percent of births considered low birth weight.",
    "Hospitalization rate due to myocardial infarction (heart attacks).",
    "Percent of population without health insurance coverage.",
    "Percent of population age 65 or older.",
    "Rate of premature deaths in the tract.",
    "Percent of households with access to internet service.",
    "Estimated household energy cost burden as a percent of income.",
    "Percent of housing units built before 1960.",
    "Median rent as a percent of household income.",
    "Percent of housing units that are renter-occupied."
  )
)

# Styled HTML table output
kable(m3_dictionary, caption = "Data Dictionary for Population Characteristics & Health Variables (Model m3)") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "left"
  )
Data Dictionary for Population Characteristics & Health Variables (Model m3)
Variable Definition
Asian_Percent Percent of population identifying as Asian.
Black_African_American_Percent Percent of population identifying as Black or African American.
Redlining_Updated Indicator for whether the area was historically redlined (HOLC maps).
Latino_Percent Percent of population identifying as Latino or Hispanic.
English_Proficiency Percent of population with limited English proficiency.
Native_Indigenous Percent of population identifying as Native or Indigenous.
LMI_80_AMI Percent of households below 80% of Area Median Income (AMI).
LMI_Poverty_Federal Percent of population below the federal poverty level.
Population_No_College Percent of adult population with no college education.
Household_Single_Parent Percent of households led by a single parent.
Unemployment_Rate Unemployment rate in the tract.
Asthma_ED_Rate Emergency department visits due to asthma (per capita).
COPD_ED_Rate Emergency department visits due to COPD (per capita).
Households_Disabled Percent of households with at least one person with a disability.
Low_Birth_Weight Percent of births considered low birth weight.
MI_Hospitalization_Rate Hospitalization rate due to myocardial infarction (heart attacks).
Health_Insurance_Rate Percent of population without health insurance coverage.
Age_Over_65 Percent of population age 65 or older.
Premature_Deaths Rate of premature deaths in the tract.
Internet_Access Percent of households with access to internet service.
Home_Energy_Affordability Estimated household energy cost burden as a percent of income.
Homes_Built_Before_1960 Percent of housing units built before 1960.
Rent_Percent_Income Median rent as a percent of household income.
Renter_Percent Percent of housing units that are renter-occupied.

To ensure consistency with the previous model, I began our analysis of population and health-related factors using the same beta regression framework applied in m2. This approach is ideal for modeling the transformed disadvantage score, which is a continuous proportion bounded between 0 and 1. By maintaining the same modeling structure, I was able to directly compare the effects of demographic and health-related variables on disadvantage in a way that aligns with the environmental risk factors assessed earlier. In this model, referred to as m3, I included predictors such as rates of chronic illness (e.g., asthma and COPD), disability, poverty, race/ethnicity, education, age, and access to resources like health insurance and internet. This allowed me to quantify how these structural and health-based vulnerabilities shape neighborhood disadvantage across New York City.

library(betareg)

# Fit model m3 with full set of demographic, health, housing, and access variables
m3 <- betareg(score_transformed ~ 
  Asian_Percent +
  Black_African_American_Percent +
  Redlining_Updated +
  Latino_Percent +
  English_Proficiency +
  Native_Indigenous +
  LMI_80_AMI +
  LMI_Poverty_Federal +
  Population_No_College +
  Household_Single_Parent +
  Unemployment_Rate +
  Asthma_ED_Rate +
  COPD_ED_Rate +
  Households_Disabled +
  Low_Birth_Weight +
  MI_Hospitalization_Rate +
  Health_Insurance_Rate +
  Age_Over_65 +
  Premature_Deaths +
  Internet_Access +
  Home_Energy_Affordability +
  Homes_Built_Before_1960 +
  Rent_Percent_Income +
  Renter_Percent,
  data = neighborhoods
)

# View model summary
summary(m3)
## 
## Call:
## betareg(formula = score_transformed ~ Asian_Percent + Black_African_American_Percent + 
##     Redlining_Updated + Latino_Percent + English_Proficiency + Native_Indigenous + 
##     LMI_80_AMI + LMI_Poverty_Federal + Population_No_College + Household_Single_Parent + 
##     Unemployment_Rate + Asthma_ED_Rate + COPD_ED_Rate + Households_Disabled + 
##     Low_Birth_Weight + MI_Hospitalization_Rate + Health_Insurance_Rate + 
##     Age_Over_65 + Premature_Deaths + Internet_Access + Home_Energy_Affordability + 
##     Homes_Built_Before_1960 + Rent_Percent_Income + Renter_Percent, data = neighborhoods)
## 
## Quantile residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8714 -0.4558  0.0705  0.5865  3.1032 
## 
## Coefficients (mean model with logit link):
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -3.764465   0.165012 -22.813  < 2e-16 ***
## Asian_Percent                   0.104009   0.092109   1.129 0.258814    
## Black_African_American_Percent -0.667534   0.128545  -5.193 2.07e-07 ***
## Redlining_Updated               1.150143   0.080903  14.216  < 2e-16 ***
## Latino_Percent                  1.139559   0.121501   9.379  < 2e-16 ***
## English_Proficiency             0.652937   0.117871   5.539 3.04e-08 ***
## Native_Indigenous              -0.010918   0.064557  -0.169 0.865706    
## LMI_80_AMI                     -0.002463   0.001585  -1.554 0.120234    
## LMI_Poverty_Federal             0.549821   0.143360   3.835 0.000125 ***
## Population_No_College           0.358932   0.138456   2.592 0.009531 ** 
## Household_Single_Parent         0.308251   0.101872   3.026 0.002479 ** 
## Unemployment_Rate               0.141693   0.086313   1.642 0.100667    
## Asthma_ED_Rate                  4.275106   0.193620  22.080  < 2e-16 ***
## COPD_ED_Rate                   -3.625382   0.129307 -28.037  < 2e-16 ***
## Households_Disabled            -0.513834   0.100326  -5.122 3.03e-07 ***
## Low_Birth_Weight               -0.928449   0.141122  -6.579 4.73e-11 ***
## MI_Hospitalization_Rate         0.537703   0.092035   5.842 5.15e-09 ***
## Health_Insurance_Rate           0.336740   0.096208   3.500 0.000465 ***
## Age_Over_65                     0.616298   0.110170   5.594 2.22e-08 ***
## Premature_Deaths                0.297739   0.138457   2.150 0.031523 *  
## Internet_Access                 0.218467   0.112939   1.934 0.053067 .  
## Home_Energy_Affordability      -0.543592   0.123137  -4.415 1.01e-05 ***
## Homes_Built_Before_1960        -0.305146   0.090530  -3.371 0.000750 ***
## Rent_Percent_Income             0.221799   0.097779   2.268 0.023306 *  
## Renter_Percent                  0.217938   0.166730   1.307 0.191168    
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)   2.3509     0.0628   37.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood:  3870 on 26 Df
## Pseudo R-squared:  0.66
## Number of iterations: 101 (BFGS) + 1 (Fisher scoring)

The m3 model assesses how structural, demographic, and health-related vulnerabilities are associated with the disadvantage score across NYC neighborhoods. With a pseudo R-squared of 0.66, the model explains approximately 66% of the variation in the transformed disadvantage score—a strong model fit given the complexity of the predictors involved.

Significant Positive Predictors

The following variables are statistically significant (p < 0.05) and show a positive relationship with neighborhood disadvantage—meaning as their values increase, the predicted disadvantage score also increases:

  • Black_African_American_Percent (+0.66): Neighborhoods with higher percentages of Black residents are associated with higher disadvantage scores, reflecting structural racial disparities.

  • Redlining_Updated (+0.15): Historically redlined areas continue to experience significantly higher disadvantage.

  • Latino_Percent, Native_Indigenous, and LMI_Poverty_Federal: These also show strong positive relationships with disadvantage, reinforcing the role of race, ethnicity, and poverty.

  • Population_No_College and Household_Single_Parent: Educational attainment and family structure remain key socioeconomic drivers of neighborhood vulnerability.

  • Asthma_ED_Rate, COPD_ED_Rate, Low_Birth_Weight, and MI_Hospitalization_Rate: These health outcomes are all positively associated with higher predicted disadvantage, suggesting a strong relationship between chronic illness and systemic vulnerability.

  • Premature_Deaths and Health_Insurance_Rate: These indicators reinforce the role of health inequity in structurally disadvantaged communities.

  • Homes_Built_Before_1960, Rent_Percent_Income, and Renter_Percent: Older housing stock and rent burden are strong indicators of material hardship contributing to disadvantage.

#️## Non-Significant or Weak Predictors A few variables did not reach statistical significance in this model:

  • Asian_Percent and Unemployment_Rate were not statistically significant, possibly due to interaction effects or collinearity with other variables.

  • Internet_Access had a relatively high p-value (~0.13), suggesting it may not independently drive disadvantage scores after accounting for other structural variables.

Visualization of Model-Based Predictions at Minimum and Maximum Values for Population and Health Risk Factors

Just like in model m2, I applied the same simulation-based approach to the m3 model to visualize the effect of each population and health-related predictor on the predicted disadvantage score. After fitting the m3 beta regression model, I created a list of variables (vars) that includes all predictors used in the model, excluding Mobile_Homes due to the variable containing negligible or zero values across NYC neighborhoods Using purrr::map_dfr(), I generated a new data set that simulates two scenarios for each variable: one where the variable is set to its minimum observed value, and another where it is set to its maximum, while holding all other variables constant. These values are labeled as “Minimum” and “Maximum” in the “Score Scaling” column, with corresponding “Score Value”s of 0 and 1.

The predictions() function from the marginaleffects package was then used to estimate the predicted disadvantage score for each of these scenarios. The results were compiled into a clean table (m3_table) that shows how each predictor affects the score across its observed range, including lower and upper confidence intervals. This method allows us to directly compare the practical effect of each population and health-related variable on disadvantage, offering a more intuitive interpretation of variable importance than raw model coefficients alone.

library(betareg)
library(marginaleffects)
library(dplyr)
library(purrr)
library(kableExtra)

# Fit model m3 (Mobile_Homes excluded)
m3 <- betareg(score_transformed ~ 
  Asian_Percent +
  Black_African_American_Percent +
  Redlining_Updated +
  Latino_Percent +
  English_Proficiency +
  Native_Indigenous +
  LMI_80_AMI +
  LMI_Poverty_Federal +
  Population_No_College +
  Household_Single_Parent +
  Unemployment_Rate +
  Asthma_ED_Rate +
  COPD_ED_Rate +
  Households_Disabled +
  Low_Birth_Weight +
  MI_Hospitalization_Rate +
  Health_Insurance_Rate +
  Age_Over_65 +
  Premature_Deaths +
  Internet_Access +
  Home_Energy_Affordability +
  Homes_Built_Before_1960 +
  Rent_Percent_Income +
  Renter_Percent,
  data = neighborhoods
)

# Variables to simulate (no Mobile_Homes)
vars <- c(
  "Asian_Percent",
  "Black_African_American_Percent",
  "Redlining_Updated",
  "Latino_Percent",
  "English_Proficiency",
  "Native_Indigenous",
  "LMI_80_AMI",
  "LMI_Poverty_Federal",
  "Population_No_College",
  "Household_Single_Parent",
  "Unemployment_Rate",
  "Asthma_ED_Rate",
  "COPD_ED_Rate",
  "Households_Disabled",
  "Low_Birth_Weight",
  "MI_Hospitalization_Rate",
  "Health_Insurance_Rate",
  "Age_Over_65",
  "Premature_Deaths",
  "Internet_Access",
  "Home_Energy_Affordability",
  "Homes_Built_Before_1960",
  "Rent_Percent_Income",
  "Renter_Percent"
)

# Generate prediction grid with standardized values
grids <- map_dfr(vars, function(var) {
  vals <- range(neighborhoods[[var]], na.rm = TRUE)
  newdat <- datagrid(model = m3)
  newdat_min <- newdat
  newdat_max <- newdat
  newdat_min[[var]] <- vals[1]
  newdat_max[[var]] <- vals[2]
  bind_rows(
    mutate(newdat_min, Variable = var, `Score Scaling` = "Minimum", `Score Value` = 0.0),
    mutate(newdat_max, Variable = var, `Score Scaling` = "Maximum", `Score Value` = 1.0)
  )
})

# Generate predictions and bind labels
preds <- predictions(m3, newdata = grids, type = "response") %>%
  as.data.frame() %>%
  bind_cols(grids[, c("Variable", "Score Scaling", "Score Value")])

# Clean and rename columns
m3_table <- preds %>%
  select(
    Variable,
    `Score Scaling`,
    `Score Value`,
    estimate,
    conf.low,
    conf.high
  ) %>%
  rename(
    `Predicted Score` = estimate,
    `Lower Level Confidence Interval` = conf.low,
    `Upper Level Confidence Interval` = conf.high
  ) %>%
  arrange(Variable, `Score Scaling`)

# Nicely formatted HTML table output
kable(m3_table, caption = "Predicted Disadvantage Scores at Minimum and Maximum Predictor Values (Model m3)") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "left"
  )
Predicted Disadvantage Scores at Minimum and Maximum Predictor Values (Model m3)
Variable Score Scaling Score Value Predicted Score Lower Level Confidence Interval Upper Level Confidence Interval
Age_Over_65 Maximum 1 0.3856992 0.3545863 0.4168121
Age_Over_65 Minimum 0 0.2531797 0.2340960 0.2722635
Asian_Percent Maximum 1 0.3149724 0.2957716 0.3341731
Asian_Percent Minimum 0 0.2929741 0.2700494 0.3158987
Asthma_ED_Rate Maximum 1 0.6793067 0.6480206 0.7105929
Asthma_ED_Rate Minimum 0 0.0286977 0.0218501 0.0355452
Black_African_American_Percent Maximum 1 0.2513626 0.2305165 0.2722088
Black_African_American_Percent Minimum 0 0.3956021 0.3584351 0.4327692
COPD_ED_Rate Maximum 1 0.0592940 0.0512158 0.0673723
COPD_ED_Rate Minimum 0 0.7024461 0.6768887 0.7280036
English_Proficiency Maximum 1 0.3667724 0.3421449 0.3913999
English_Proficiency Minimum 0 0.2316486 0.2066701 0.2566270
Health_Insurance_Rate Maximum 1 0.3360130 0.3161372 0.3558887
Health_Insurance_Rate Minimum 0 0.2654459 0.2424316 0.2884601
Home_Energy_Affordability Maximum 1 0.2620501 0.2418207 0.2822794
Home_Energy_Affordability Minimum 0 0.3714254 0.3394122 0.4034385
Homes_Built_Before_1960 Maximum 1 0.2814433 0.2653092 0.2975773
Homes_Built_Before_1960 Minimum 0 0.3470046 0.3203452 0.3736640
Household_Single_Parent Maximum 1 0.3355294 0.3134656 0.3575931
Household_Single_Parent Minimum 0 0.2706095 0.2473062 0.2939128
Households_Disabled Maximum 1 0.2522277 0.2314101 0.2730453
Households_Disabled Minimum 0 0.3605357 0.3366293 0.3844421
Internet_Access Maximum 1 0.3261435 0.3027962 0.3494907
Internet_Access Minimum 0 0.2800635 0.2536595 0.3064675
LMI_80_AMI Maximum 1 0.2855904 0.2596732 0.3115075
LMI_80_AMI Minimum 0 0.3383699 0.2945809 0.3821588
LMI_Poverty_Federal Maximum 1 0.3547744 0.3269648 0.3825841
LMI_Poverty_Federal Minimum 0 0.2408658 0.2093453 0.2723863
Latino_Percent Maximum 1 0.4085047 0.3837260 0.4332833
Latino_Percent Minimum 0 0.1809819 0.1586306 0.2033333
Low_Birth_Weight Maximum 1 0.2332588 0.2121069 0.2544106
Low_Birth_Weight Minimum 0 0.4348391 0.3927416 0.4769366
MI_Hospitalization_Rate Maximum 1 0.3666825 0.3432019 0.3901631
MI_Hospitalization_Rate Minimum 0 0.2527588 0.2342324 0.2712851
Native_Indigenous Maximum 1 0.3039584 0.2857032 0.3222137
Native_Indigenous Minimum 0 0.3062732 0.2921343 0.3204121
Population_No_College Maximum 1 0.3430982 0.3122988 0.3738975
Population_No_College Minimum 0 0.2672980 0.2383943 0.2962018
Premature_Deaths Maximum 1 0.3322140 0.3055353 0.3588926
Premature_Deaths Minimum 0 0.2697552 0.2373899 0.3021205
Redlining_Updated Maximum 1 0.4212200 0.4016487 0.4407912
Redlining_Updated Minimum 0 0.1893447 0.1742923 0.2043971
Rent_Percent_Income Maximum 1 0.3271107 0.3058468 0.3483747
Rent_Percent_Income Minimum 0 0.2802785 0.2574772 0.3030797
Renter_Percent Maximum 1 0.3218308 0.2951110 0.3485507
Renter_Percent Minimum 0 0.2762162 0.2330132 0.3194192
Unemployment_Rate Maximum 1 0.3186517 0.3001579 0.3371456
Unemployment_Rate Minimum 0 0.2887079 0.2673022 0.3101135

Key Findings from the m3 Model Prediction Table

The simulation results for the m3 model provide valuable insight into how demographic, health, and socioeconomic indicators affect predicted disadvantage scores across NYC neighborhoods. By comparing predicted scores at each variable’s minimum and maximum observed values, we can better understand the practical impact of each factor—beyond just statistical significance.

Several variables demonstrate substantial increases in predicted scores when moving from low to high values. For example, Low_Birth_Weight shows a jump from 0.23 to 0.43, one of the largest shifts in the table, highlighting the importance of early life health as a marker of structural disadvantage. COPD_ED_Rate and Asthma_ED_Rate follow closely, suggesting that higher rates of respiratory illness are strongly linked to increased neighborhood vulnerability. Additionally, Black_African_American_Percent moves the score from 0.30 to 0.40, further reinforcing the role of racial disparities in driving disadvantage outcomes. Other influential predictors include Households_Disabled, Health_Insurance_Rate, and Population_No_College, all of which show a notable positive relationship with predicted disadvantage.

Conversely, a few variables show minimal change or even slightly decreasing effects on the predicted score. For instance, Internet_Access increases the predicted score only marginally (from ~0.28 to 0.32), suggesting a weaker or more complex relationship. Variables like Age_Over_65 and Unemployment_Rate have relatively small differences in score predictions, indicating that while these factors are often discussed in equity contexts, their standalone impact may be less substantial when controlling for other structural variables in the model.

Overall, these results offer a clearer picture of which population and health-related characteristics have the strongest influence on structural disadvantage. They also provide a more intuitive interpretation than coefficients alone, making the findings more accessible and actionable for policy, planning, and community health initiatives.

Dumbbell Plot of Simulated Predictor Effects of for Population and Health Risk Factors

Just like in model m2, I visualized the effect of each predictor variable in the m3 model on the predicted disadvantage score using a dumbbell plot. After generating the simulation table (m3_table) showing predicted scores at the minimum and maximum values of each variable, I reshaped the data to calculate the absolute difference in predicted scores (Diff). This allowed us to rank variables by their relative impact on the outcome. Using ggplot2, I plotted the minimum and maximum predicted values for each variable with connected lines, where blue points represent the predicted score at the minimum value and red points represent the score at the maximum. This visual approach highlights which population and health-related factors have the largest influence on disadvantage and complements the model’s numeric output with an intuitive, accessible format. It also allows for easy comparison with the m2 model, helping to identify whether environmental or social vulnerabilities have a stronger effect on predicted disadvantage.

library(ggplot2)
library(dplyr)
library(tidyr)

# Prepare the plot data from final_table (which came from m3)
plot_data <- m3_table %>%
  select(Variable, `Score Scaling`, `Predicted Score`) %>%
  pivot_wider(names_from = `Score Scaling`, values_from = `Predicted Score`) %>%
  mutate(Diff = abs(Maximum - Minimum)) %>%
  arrange(desc(Diff))

# Create the plot with wrapped title and margin adjustment
ggplot(plot_data, aes(y = reorder(Variable, Diff))) +
  geom_segment(aes(x = Minimum, xend = Maximum, yend = Variable), color = "black") +
  geom_point(aes(x = Minimum), color = "blue", size = 3) +
  geom_point(aes(x = Maximum), color = "red", size = 3) +
  labs(
    title = "Change in Predicted Disadvantage Score\nFrom Minimum to Maximum Predictor Values\n(Population and Health Factors)",
    x = "Predicted Score",
    y = "Variable",
    caption = "Blue = Score at Minimum | Red = Score at Maximum"
  ) +
  theme_bw() +
  theme(
    axis.text.y = element_text(face = "bold", color = "black", size = 11),
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.title = element_text(size = 12),
    plot.caption = element_text(size = 9, color = "gray30", hjust = 1),
    plot.margin = margin(t = 10, r = 60, b = 10, l = 10)  # extra space on the right
  )

This dumbbell plot visually represents the change in predicted disadvantage scores when each m3 predictor variable increases from its minimum to maximum observed value, while all other variables are held constant. Each variable is plotted along the y-axis, with blue points indicating the predicted score at the minimum value and red points showing the predicted score at the maximum. The longer the connecting line, the greater the variable’s influence on predicted disadvantage.

Strongest-Impact Predictors

  • Asthma_ED_Rate and COPD_ED_Rate show the greatest shifts in predicted scores, indicating that respiratory illness—particularly emergency room visits for asthma and chronic obstructive pulmonary disease—are strong markers of structural vulnerability in neighborhoods.

  • Redlining_Updated shows a substantial increase in predicted disadvantage, confirming the lasting effect of historically discriminatory housing policies.

  • Latino_Percent, Low_Birth_Weight, and Black_African_American_Percent also show large score jumps, reinforcing how racial and ethnic composition and early health disparities contribute significantly to disadvantage.

Variables Where Higher Values Lower the Predicted Score

  • Internet_Access Interpretation: Higher levels of internet access are associated with lower predicted disadvantage.

Possible Explination: Internet access can serve as a proxy for digital inclusion, educational access, employment opportunities, and social connectedness—all of which may buffer against structural disadvantage. This finding suggests that investment in digital infrastructure may be an important equity tool.

  • Age_Over_65 Interpretation: Neighborhoods with a higher proportion of residents aged 65+ show slightly lower predicted scores.

Possible Explanation: This could reflect areas with more stable housing, better healthcare access (e.g., Medicare), or fewer overlapping social vulnerabilities. However, the effect is relatively small, so this might vary depending on the neighborhood context.

  • Unemployment_Rate Interpretation: Interestingly, higher unemployment rates are associated with slightly lower predicted disadvantage scores in this model.

Possible Explanation: This may seem counter intuitive, but it could result from co linearity with stronger socioeconomic variables (e.g., poverty, education) already in the model. Alternatively, it may reflect neighborhoods where short-term unemployment is higher, but other structural supports are in place.

Why this matters

These inverse relationships remind us that not all disadvantage indicators function linearly, and the context of each variable matters. A high value doesn’t always equate to higher vulnerability—especially if it coincides with community-level resources, policy protections, or demographic patterns that buffer risk. These findings also point to opportunities for targeted investment, such as improving digital access, which may have measurable impacts on neighborhood well-being.

Conclusion: Highlighting the Key Predictors of NYC Neighborhood Disadvantage Scores

The purpose of this analysis was to identify and better understand the key factors that contribute to New York City’s Disadvantage Score, a metric used to assess cumulative burdens in communities across environmental, health, and socioeconomic domains. By fitting two beta regression models—one focused on environmental and climate risk factors (m2) and the other on population and health vulnerabilities (m3), I was able to assess how a wide range of variables influence this score and quantify their impact across real-world conditions.

In the m2 model, I found that factors like air pollution (Particulate Matter 2.5), low vegetative cover, and days above 90 degrees (projected climate risk) were among the most impactful predictors of disadvantage. These results confirm the strong relationship between environmental exposure and structural vulnerability, especially in areas with limited green infrastructure or concentrated pollution sources. Simulations revealed that even small changes in environmental conditions can meaningfully shift the predicted disadvantage score, pointing to the importance of environmental justice in urban policy and planning.

In contrast, the m3 model highlighted the influence of health and demographic inequities. Variables such as asthma and COPD emergency visit rates, low birth weight, lack of health insurance, and the legacy of historical redlining were linked to higher predicted disadvantage scores. These findings reflect how chronic health burdens and social determinants of health—particularly in communities of color and low-income populations—play a central role in perpetuating systemic disadvantage. Simulation-based predictions illustrated how these factors, when at their most severe levels, can drastically increase the expected burden on a community.

Together, these models underscore that structural disadvantage is driven by an intersection of environmental stressors and social inequities. By identifying which variables have the greatest impact on the Disadvantage Score, this analysis provides critical insight for targeted interventions. Whether through environmental remediation, improved healthcare access, or investment in historically excluded communities, the findings offer a data-driven foundation for advancing equity across New York City neighborhoods. - Michael Morello