Ahm Hamza 05.13.2025
This project aims to address the critical challenge of identifying gentrification patterns in New York City’s five boroughs (Bronx, Brooklyn, Queens, Manhattan, and Staten Island) using census data from 2015 to 2022. Gentrification is a complex urban phenomenon that significantly impacts housing affordability, neighborhood demographics, and community stability across major urban centers. In New York City, rapid neighborhood change has created challenges for urban planners, policymakers, and community organizations who need better tools to anticipate and manage these transformations. By leveraging demographic, economic, and housing indicators, this project develops predictive models to assess gentrification risk and identify neighborhoods at high risk of displacement. Through advanced machine learning techniques, the project aims to provide actionable insights for prioritizing interventions that can mitigate displacement, ensure affordable housing, and promote sustainable urban development, ultimately fostering more equitable communities in NYC.
This project aims to translate urban policy interventions into a predictive classification framework to identify neighborhoods at risk of gentrification in New York City. Gentrification, a complex socio-economic phenomenon, poses significant risks of displacement for lower-income communities, altering the social fabric and pushing vulnerable populations out of traditionally affordable areas. The study aim to translate urban policy interventions into a predictive classification framework to identify neighborhoods at risk of gentrification by answering which census tracts are likely to experience high levels of gentrification in the coming years, based on key demographic, economic, and housing indicators?
More specific goal includes:
1.Predict binary gentrification risk (high vs. low) for NYC census tracts, based on a rich set of demographic, economic, and housing variables.
2.Identify the most important predictors of gentrification risk: Which variables (e.g., income, housing affordability, educational attainment, vacancy rates) are the strongest indicators of gentrification? Understanding these relationships can guide effective policy decisions.
3.Compare traditional statistical methods with modern machine learning approaches: Utilize regression models, decision trees, random forests, support vector machines, and deep learning techniques to predict gentrification risk. Evaluate and compare the performance of these models using cross-validation and metrics like accuracy, precision, recall, and F1-score.
4.Develop an actionable risk assessment framework for prioritizing interventions: Using the results from our predictive models, we will develop a framework that helps urban planners and local policymakers prioritize neighborhoods at high risk of gentrification. The goal is to inform strategies to protect vulnerable populations, increase affordable housing stock, and ensure inclusive urban development.
5.Analyze the social and economic drivers of gentrification: Beyond prediction, we will explore the socio-economic dynamics driving gentrification processes. This includes examining factors like the “creative class” (young professionals in management, business, and technology sectors), public transportation accessibility, and housing affordability.
6.Leverage temporal data for trend forecasting: Using longitudinal census data from 2015 to 2022, we will capture the evolving patterns of gentrification. By analyzing trends over time, we can improve our ability to predict future risk and adapt to changing urban dynamics.
Data and library loading
#load necessary packages
rq_packages <- c("ada", "caret", "corrplot", "dplyr", "GGally", "ggcorrplot",
"ggplot2", "gridExtra", "glmnet", "knitr", "mice", "neuralnet", "naivebayes", "naniar", "nnet", "pROC",
"randomForest", "rpart", "rpart.plot", "ROSE", "rsample",
"sf", "keras", "scales", "smotefamily", "tigris", "tidyverse",
"tidyr", "tfdatasets", "tensorflow", "pROC", "tmap", "yardstick", "randomForest", "rpart", "xgboost","VIM")
#install and load packages
for (pkg in rq_packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
# read the CSV file
df <- read.csv('census_data_2015_2022_gentrification.csv')
# display the first 15 rows
#head(df, 5)
# structure and summary
# str(df)
summary(df)
## median_gross_rent median_income total_edu_pop high_school_grad
## Min. :-666666666 Min. :-666666666 Min. : 0 Min. : 0.0
## 1st Qu.: 1222 1st Qu.: 42886 1st Qu.: 1600 1st Qu.: 288.0
## Median : 1450 Median : 63203 Median : 2416 Median : 475.0
## Mean : -29685548 Mean : -26136377 Mean : 2679 Mean : 538.4
## 3rd Qu.: 1734 3rd Qu.: 85461 3rd Qu.: 3399 3rd Qu.: 720.2
## Max. : 3501 Max. : 250001 Max. :22006 Max. :5763.0
## NA's :8 NA's :7
## some_college_1 some_college_2 assoc_degree_1 assoc_degree_2
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.0 1st Qu.: 40.00 1st Qu.: 146.0 1st Qu.: 84.0
## Median : 75.0 Median : 73.00 Median : 234.0 Median : 144.0
## Mean : 100.4 Mean : 92.97 Mean : 275.9 Mean : 171.8
## 3rd Qu.: 137.0 3rd Qu.: 123.00 3rd Qu.: 363.0 3rd Qu.: 227.0
## Max. :1314.0 Max. :1672.00 Max. :3589.0 Max. :2690.0
##
## bachelor masters professional doctorate
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 247.0 1st Qu.: 87.0 1st Qu.: 8.00 1st Qu.: 0.00
## Median : 429.0 Median : 181.0 Median : 31.00 Median : 14.00
## Mean : 591.5 Mean : 298.2 Mean : 85.05 Mean : 40.94
## 3rd Qu.: 751.0 3rd Qu.: 370.0 3rd Qu.: 84.00 3rd Qu.: 42.00
## Max. :6300.0 Max. :4059.0 Max. :1861.00 Max. :1250.00
##
## total_pop white black asian
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 2328 1st Qu.: 117 1st Qu.: 45.0 1st Qu.: 63.75
## Median : 3510 Median : 712 Median : 260.0 Median : 241.00
## Mean : 3820 Mean : 1223 Mean : 830.2 Mean : 531.65
## 3rd Qu.: 4909 3rd Qu.: 1795 3rd Qu.: 1340.2 3rd Qu.: 716.00
## Max. :29256 Max. :13699 Max. :17123.0 Max. :7994.00
##
## hispanic total_units owner_occupied poverty_universe
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 273 1st Qu.: 772 1st Qu.: 184.0 1st Qu.: 2291
## Median : 620 Median : 1243 Median : 360.0 Median : 3456
## Mean : 1108 Mean : 1427 Mean : 465.2 Mean : 3758
## 3rd Qu.: 1406 3rd Qu.: 1830 3rd Qu.: 594.0 3rd Qu.: 4837
## Max. :13429 Max. :12998 Max. :5905.0 Max. :28186
##
## below_poverty labor_force unemployed median_home_value
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. :-666666666
## 1st Qu.: 222.0 1st Qu.: 1164 1st Qu.: 62.0 1st Qu.: 396700
## Median : 461.0 Median : 1752 Median : 115.0 Median : 560200
## Mean : 698.6 Mean : 1971 Mean : 149.7 Mean : -81549249
## 3rd Qu.: 925.0 3rd Qu.: 2498 3rd Qu.: 197.0 3rd Qu.: 782900
## Max. :5538.0 Max. :14485 Max. :1705.0 Max. : 2000001
## NA's :75
## rent_as_pct_income gini_index housing_units_total
## Min. :-666666666 Min. :-666666666 Min. : 0
## 1st Qu.: 27 1st Qu.: 0 1st Qu.: 772
## Median : 32 Median : 0 Median : 1243
## Mean : -26380749 Mean : -22152372 Mean : 1427
## 3rd Qu.: 38 3rd Qu.: 0 3rd Qu.: 1830
## Max. : 51 Max. : 1 Max. :12998
##
## owner_cost_burdened renter_cost_burdened housing_age_total
## Min. : 0.0 Min. : 0.00 Min. : 0
## 1st Qu.: 355.0 1st Qu.: 0.00 1st Qu.: 848
## Median : 738.0 Median : 9.00 Median : 1365
## Mean : 962.2 Mean : 22.43 Mean : 1572
## 3rd Qu.:1351.0 3rd Qu.: 30.00 3rd Qu.: 1995
## Max. :8209.0 Max. :484.00 Max. :13036
##
## housing_built_2010_later housing_built_2000_2009 housing_total
## Min. : 0.000 Min. : 0.00 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 848
## Median : 0.000 Median : 0.00 Median : 1365
## Mean : 8.252 Mean : 34.37 Mean : 1572
## 3rd Qu.: 0.000 3rd Qu.: 23.00 3rd Qu.: 1995
## Max. :2619.000 Max. :5073.00 Max. :13036
##
## vacant_units pop_total males_25_34_1 males_25_34_2
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 47.0 1st Qu.: 2328 1st Qu.: 72.0 1st Qu.: 68.0
## Median : 91.0 Median : 3510 Median : 129.0 Median : 124.0
## Mean : 144.9 Mean : 3820 Mean : 167.9 Mean : 160.1
## 3rd Qu.: 162.0 3rd Qu.: 4909 3rd Qu.: 220.0 3rd Qu.: 213.0
## Max. :2572.0 Max. :29256 Max. :1534.0 Max. :1358.0
##
## females_25_34_1 females_25_34_2 employed_pop prof_male
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 76.0 1st Qu.: 74.0 1st Qu.: 1067 1st Qu.: 122.0
## Median : 140.0 Median : 133.0 Median : 1611 Median : 229.0
## Mean : 179.4 Mean : 166.1 Mean : 1821 Mean : 356.7
## 3rd Qu.: 234.0 3rd Qu.: 222.0 3rd Qu.: 2309 3rd Qu.: 437.0
## Max. :2034.0 Max. :1635.0 Max. :13477 Max. :4542.0
##
## prof_female commuters_total public_transit walk_to_work
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 174.0 1st Qu.: 1037 1st Qu.: 456.0 1st Qu.: 36.0
## Median : 293.0 Median : 1568 Median : 783.0 Median : 91.0
## Mean : 407.5 Mean : 1774 Mean : 956.7 Mean : 174.5
## 3rd Qu.: 503.0 3rd Qu.: 2248 3rd Qu.:1290.0 3rd Qu.: 189.0
## Max. :4545.0 Max. :13216 Max. :7845.0 Max. :3209.0
##
## other_commute work_from_home tenure_total recent_owners
## Min. : 0.00 Min. : 0.0 Min. : 0 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 23.0 1st Qu.: 772 1st Qu.: 0.00
## Median : 0.00 Median : 58.0 Median : 1243 Median : 0.00
## Mean : 12.11 Mean : 115.1 Mean : 1427 Mean : 11.04
## 3rd Qu.: 15.00 3rd Qu.: 134.0 3rd Qu.: 1830 3rd Qu.: 14.00
## Max. :692.00 Max. :2928.0 Max. :12998 Max. :408.00
##
## GEO_ID state county tract
## Length:17816 Min. :36 Min. : 5.00 Min. : 100
## Class :character 1st Qu.:36 1st Qu.:47.00 1st Qu.: 15400
## Mode :character Median :36 Median :47.00 Median : 30202
## Mean :36 Mean :54.83 Mean : 42489
## 3rd Qu.:36 3rd Qu.:81.00 3rd Qu.: 58100
## Max. :36 Max. :85.00 Max. :990100
##
## recent_renters year borough less_than_hs
## Min. : 0.00 Min. :2015 Length:17816 Min. : 0.0
## 1st Qu.: 5.00 1st Qu.:2017 Class :character 1st Qu.: 174.0
## Median : 31.00 Median :2019 Mode :character Median : 353.0
## Mean : 62.68 Mean :2019 Mean : 484.1
## 3rd Qu.: 80.00 3rd Qu.:2021 3rd Qu.: 643.0
## Max. :1466.00 Max. :2022 Max. :4718.0
##
## hs_grad some_college college_plus pct_less_than_hs
## Min. : 0.0 Min. : 0 Min. : 0 Min. :0.0000
## 1st Qu.: 288.0 1st Qu.: 366 1st Qu.: 377 1st Qu.:0.0942
## Median : 475.0 Median : 563 Median : 668 Median :0.1610
## Mean : 538.4 Mean : 641 Mean : 1016 Mean :0.1821
## 3rd Qu.: 720.2 3rd Qu.: 827 3rd Qu.: 1245 3rd Qu.:0.2578
## Max. :5763.0 Max. :8304 Max. :11324 Max. :1.0000
## NA's :464
## pct_hs_grad pct_some_college pct_college_plus pct_white
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1569 1st Qu.:0.1888 1st Qu.:0.2011 1st Qu.:0.0426
## Median :0.2171 Median :0.2504 Median :0.3054 Median :0.2343
## Mean :0.2120 Mean :0.2490 Mean :0.3569 Mean :0.3223
## 3rd Qu.:0.2695 3rd Qu.:0.3096 3rd Qu.:0.4605 3rd Qu.:0.5797
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :464 NA's :464 NA's :464 NA's :464
## pct_black pct_asian pct_hispanic pct_other
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0161 1st Qu.:0.0228 1st Qu.:0.0951 1st Qu.:0.0096
## Median :0.0804 Median :0.0728 Median :0.1889 Median :0.0234
## Mean :0.2307 Mean :0.1422 Mean :0.2698 Mean :0.0350
## 3rd Qu.:0.3811 3rd Qu.:0.2012 3rd Qu.:0.3982 3rd Qu.:0.0445
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :0.4494
## NA's :464 NA's :464 NA's :464 NA's :464
## homeownership_rate poverty_rate unemployment_rate young_adult_rate
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1609 1st Qu.:0.0815 1st Qu.:0.0426 1st Qu.:0.1263
## Median :0.3353 Median :0.1439 Median :0.0670 Median :0.1576
## Mean :0.3723 Mean :0.1775 Mean :0.0776 Mean :0.1732
## 3rd Qu.:0.5617 3rd Qu.:0.2444 3rd Qu.:0.1012 3rd Qu.:0.2014
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :537 NA's :492 NA's :504 NA's :464
## professional_rate alt_commute_rate new_housing_rate recent_move_rate
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2560 1st Qu.:0.5472 1st Qu.:0.0000 1st Qu.:0.0136
## Median :0.3627 Median :0.7185 Median :0.0000 Median :0.0364
## Mean :0.3930 Mean :0.6793 Mean :0.0256 Mean :0.0480
## 3rd Qu.:0.4980 3rd Qu.:0.8289 3rd Qu.:0.0228 3rd Qu.:0.0695
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :507 NA's :507 NA's :523 NA's :537
## vacancy_rate cost_burden_rate gentrification_pressure displacement_risk
## Min. :0.0000 Min. :0.0000 Min. :-0.0302 Min. :-1333333.3
## 1st Qu.:0.0449 1st Qu.:0.4464 1st Qu.: 0.3974 1st Qu.: 0.4
## Median :0.0732 Median :0.6774 Median : 0.4970 Median : 0.5
## Mean :0.0869 Mean :0.6424 Mean : 0.5261 Mean : -12963.2
## 3rd Qu.:0.1095 3rd Qu.:0.8598 3rd Qu.: 0.5943 3rd Qu.: 0.6
## Max. :1.0000 Max. :1.6370 Max. : 9.1769 Max. : 0.8
## NA's :523 NA's :537 NA's :596 NA's :537
To operationalize gentrification as a multidimensional, longitudinal process, this study standardizes and integrates a range of social, economic, and housing indicators. Core variables such as household income, professional employment, racial composition, housing costs, and mobility patterns are transformed into standardized z-scores. This enables cross-neighborhood and cross-year comparison while addressing scale disparities across variables (Zuk et al., 2018). The sign on specific variables such as vacancy rate is reversed to reflect their inverse relationship with neighborhood stability and desirability (Hackworth & Smith, 2001).
To capture temporal shifts, the methodology includes three-year change variables and relative growth rates for key indicators such as income, rent, and racial composition. These changes are contextualized against citywide averages to identify neighborhoods experiencing disproportionately rapid change (Chapple & Zuk, 2016; Urban Spatial, 2019). This longitudinal approach moves beyond static, cross-sectional analyses and instead reflects the dynamic nature of neighborhood transformation over time.
In addition to primary indicators, interaction metrics, such as, diversity indices and race-class interactions are introduced. These are particularly salient in urban contexts, where gentrification often manifests through racialized economic uplift and displacement. Metrics such as the interaction between percent White and income, or the rate of increase in the White population, provide insights into racial turnover and economic stratification (Crawford & Griffin, 2023; Zuk et al., 2018).
Two composite indices are central to this framework: the Displacement Vulnerability Index and the Gentrification Momentum Index. The Displacement Vulnerability Index synthesizes variables such as poverty rate, housing cost burden, renter percentage, and lower educational attainment, highlighting communities under economic strain ( Marcuse, 1985; Newman & Wyly, 2006). The Gentrification Momentum Index captures neighborhood transformation through socio-economic uplift indicators like rising income, rent, education levels, and influx of young professionals normalized via z-scores over time. These composite indices follow the methodological precedent established by similar efforts, including the Urban Displacement Project’s Housing Precarity Risk Model and the Green Gentrification Vulnerability Index (Chapple et al., 2021; Rigolon & Christensen, 2024).
The introduction of two novel composite indices displacement vulnerability and gentrification momentum adds interpretability and predictive power. The former measures economic pressure experienced by lower-income, renter-heavy, and racially diverse communities, while the latter captures the rate and direction of neighborhood change through key temporal z-scores. These composite indicators distill complex patterns into actionable variables for machine learning models.
Referance:
Chapple, K., & Zuk, M. (2016). Forewarned: The use of neighborhood early warning systems for gentrification and displacement. Cityscape, 18(3), 109–130.
Chapple, K., Zuk, M., Loukaitou-Sideris, A., Ong, P., & Thomas, T. (2021). Urban displacement project: Housing precarity risk model. Retrieved from https://www.urbandisplacement.org
Crawford, J., & Griffin, J. (2023). Race, class, and the temporal dynamics of gentrification. Social Problems. Advance online publication. https://doi.org/10.1093/socpro/spaf014
Hackworth, J., & Smith, N. (2001). The changing state of gentrification. TESG, 92(4), 464–477. https://doi.org/10.1111/1467-9663.00172
Marcuse, P. (1985). Gentrification, abandonment, and displacement: Connections, causes, and policy responses in New York City. Journal of Urban and Contemporary Law, 28, 195–240.
Newman, K., & Wyly, E. (2006). The right to stay put, revisited: Gentrification and resistance to displacement in New York City. Urban Studies, 43(1), 23–57. https://doi.org/10.1080/00420980500388710
Rigolon, A., & Christensen, J. (2024). Green Gentrification Vulnerability Index (GGVI): Measuring displacement risk in U.S. cities. Cities, 142, 104546. https://doi.org/10.1016/j.cities.2024.104546
Urban Spatial. (2019). Predicting gentrification using longitudinal census data. Retrieved from https://urbanspatialanalysis.com/portfolio/predicting-gentrification-using-longitudinal-census-data/
Zuk, M., Bierbaum, A. H., Chapple, K., Gorska, K., & Loukaitou-Sideris, A. (2018). Gentrification, displacement, and the role of public investment. Journal of Planning Literature, 33(1), 31–44. https://doi.org/10.1177/0885412217716439
# Convert year, borough, etc., to factors if needed
df$year <- as.factor(df$year)
df$borough <- as.factor(df$borough)
# Rename columns for consistency
df <- df %>%
rename_with(~ gsub("\\s+", "_", .x)) %>%
rename_with(tolower)
# colnames(df)
###Check for missing
#count missing or zero values per variable
df %>%
summarise(across(everything(), ~ sum(is.na(.) | . == 0))) %>%
gather(variable, missing_or_zero)
## variable missing_or_zero
## 1 median_gross_rent 8
## 2 median_income 7
## 3 total_edu_pop 464
## 4 high_school_grad 574
## 5 some_college_1 1293
## 6 some_college_2 884
## 7 assoc_degree_1 564
## 8 assoc_degree_2 678
## 9 bachelor 537
## 10 masters 702
## 11 professional 3320
## 12 doctorate 5720
## 13 total_pop 464
## 14 white 814
## 15 black 1613
## 16 asian 1847
## 17 hispanic 582
## 18 total_units 537
## 19 owner_occupied 987
## 20 poverty_universe 492
## 21 below_poverty 604
## 22 labor_force 504
## 23 unemployed 735
## 24 median_home_value 75
## 25 rent_as_pct_income 0
## 26 gini_index 0
## 27 housing_units_total 537
## 28 owner_cost_burdened 587
## 29 renter_cost_burdened 7803
## 30 housing_age_total 523
## 31 housing_built_2010_later 15106
## 32 housing_built_2000_2009 10363
## 33 housing_total 523
## 34 vacant_units 1034
## 35 pop_total 464
## 36 males_25_34_1 654
## 37 males_25_34_2 677
## 38 females_25_34_1 693
## 39 females_25_34_2 669
## 40 employed_pop 507
## 41 prof_male 619
## 42 prof_female 625
## 43 commuters_total 507
## 44 public_transit 563
## 45 walk_to_work 1276
## 46 other_commute 10138
## 47 work_from_home 1743
## 48 tenure_total 537
## 49 recent_owners 10416
## 50 geo_id 0
## 51 state 0
## 52 county 0
## 53 tract 0
## 54 recent_renters 4228
## 55 year 0
## 56 borough 0
## 57 less_than_hs 641
## 58 hs_grad 574
## 59 some_college 518
## 60 college_plus 508
## 61 pct_less_than_hs 641
## 62 pct_hs_grad 574
## 63 pct_some_college 518
## 64 pct_college_plus 508
## 65 pct_white 814
## 66 pct_black 1613
## 67 pct_asian 1847
## 68 pct_hispanic 582
## 69 pct_other 1971
## 70 homeownership_rate 987
## 71 poverty_rate 604
## 72 unemployment_rate 735
## 73 young_adult_rate 533
## 74 professional_rate 558
## 75 alt_commute_rate 528
## 76 new_housing_rate 9366
## 77 recent_move_rate 3025
## 78 vacancy_rate 1034
## 79 cost_burden_rate 587
## 80 gentrification_pressure 596
## 81 displacement_risk 537
#low missing <5% mena median imput
low_na_cols <- c("median_income", "median_gross_rent", "median_home_value")
for (col in low_na_cols) {
df[[col]][df[[col]] == 0 | is.na(df[[col]])] <- median(df[[col]][df[[col]] != 0], na.rm = TRUE)
}
#moderate missing (5%–30%) → KNN Imputation
cols_for_knn <- c("professional", "recent_move_rate", "housing_built_2010_later",
"vacancy_rate", "renter_cost_burdened", "recent_owners",
"recent_renters", "professional_rate", "alt_commute_rate")
df[cols_for_knn] <- VIM::kNN(df[cols_for_knn], k = 5, imp_var = FALSE)
z_vars <- c(
"median_income", "professional_rate", "pct_college_plus",
"pct_white", "young_adult_rate",
"median_gross_rent", "median_home_value", "cost_burden_rate", "vacancy_rate",
"recent_move_rate", "recent_renters", "recent_owners",
"new_housing_rate", "housing_built_2010_later",
"homeownership_rate"
)
# standardize in long format (better memory handling)
df_z <- df %>%
arrange(geo_id, year) %>%
group_by(geo_id) %>%
mutate(across(all_of(z_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
ungroup()
# adjust vacancy rate sign (negative z-score)
df_z <- df_z %>%
mutate(z_vacancy_rate = -z_vacancy_rate)
# create gentrification index from the newly computed z-scores
df_z <- df_z %>%
rowwise() %>%
mutate(gentrification_index = mean(c_across(starts_with("z_")), na.rm = TRUE)) %>%
ungroup()
# bin into quantiles (efficient post-processing step)
df_z <- df_z %>%
mutate(gentrification_quantile = ntile(gentrification_index, 5))
EDA begins with reviewing the dataset’s structure, checking for null values, and examining summary statistics. Several variables had significant missing values after futures engineering and indexing, particularly for older years and certain neighborhoods. Housing variables (median home value, rent) had low missingness (<5%), while temporal change metrics and specialized indicators had moderate missingness (5-30%). Univariate analysis involves visualizing distributions of key variables, while bivariate analysis explores relationships between variables. Feature engineering creates new features to improve model prediction, and geospatial analysis maps gentrification trends. Initial insights identify gentrification hotspots, displacement risk, and key predictors, setting the foundation for further analysis and modeling.
#EDA ############################################################################################
# Visualize distribution
ggplot(df_z, aes(x = gentrification_index)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Gentrification Index", x = "Index", y = "Count")
# Boxplot by borough
ggplot(df_z, aes(x = borough, y = gentrification_index, fill = borough)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Gentrification Index by Borough", x = "Borough", y = "Index") +
theme(legend.position = "none")
# Line plot over time by borough
df_z %>%
group_by(year, borough) %>%
summarise(mean_index = mean(gentrification_index, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = mean_index, color = borough, group = borough)) +
geom_line(size = 1.2) +
labs(title = "Gentrification Trends by Borough", x = "Year", y = "Mean Index") +
theme_minimal()
Missing Value Treatment:
For variables with low missingness: Median imputation For variables with moderate missingness: K-Nearest Neighbors imputation
Feature Engineering: Created standardized z-scores for 15 key variables Developed temporal change metrics (3-year changes, percentage growth rates) Generated interaction terms between demographic and economic indicators Constructed composite indices for displacement vulnerability and gentrification momentum
Target Variable Creation: Developed a continuous gentrification index from standardized indicators Transformed into quantiles and created binary (high/low) and multi-class (low/medium/high) classification targets
# yearly change
df_temporal <- df_z %>%
arrange(geo_id, year) %>%
group_by(geo_id) %>%
mutate(
# 3-year changes for key metrics
change_3yr_income = median_income - lag(median_income, 3),
change_3yr_rent = median_gross_rent - lag(median_gross_rent, 3),
change_3yr_home_value = median_home_value - lag(median_home_value, 3),
change_3yr_college = pct_college_plus - lag(pct_college_plus, 3),
change_3yr_white = pct_white - lag(pct_white, 3),
change_3yr_vacancy = vacancy_rate - lag(vacancy_rate, 3),
# Growth rates (percentage changes) for key metrics
pct_change_income = (median_income / lag(median_income, 3) - 1) * 100,
pct_change_rent = (median_gross_rent / lag(median_gross_rent, 3) - 1) * 100,
pct_change_home_value = (median_home_value / lag(median_home_value, 3) - 1) * 100,
# Rate of change compared to borough average (relative changes)
rel_income_growth = pct_change_income / mean(pct_change_income, na.rm = TRUE),
rel_rent_growth = pct_change_rent / mean(pct_change_rent, na.rm = TRUE),
rel_home_value_growth = pct_change_home_value / mean(pct_change_home_value, na.rm = TRUE)
) %>%
ungroup()
#race metrics
df_race <- df_temporal %>%
mutate(
# Diversity metrics
diversity_index = 1 - ((pct_white^2 + (1-pct_white)^2)), # Simple diversity index
# Race-economic interactions
white_income_interaction = pct_white * median_income,
white_college_interaction = pct_white * pct_college_plus,
# Racial change momentum (acceleration of demographic shifts)
white_change_acceleration = change_3yr_white - lag(change_3yr_white, 1),
# Economic-demographic pressure metrics
displacement_risk = cost_burden_rate * (1 - pct_white) * (1 / median_income),
gentrification_pressure = pct_change_rent * change_3yr_white * young_adult_rate
)
#z-temporal and demographic shift variables
z_temporal_vars <- c(
"change_3yr_rent",
"pct_change_rent",
"pct_change_income",
"change_3yr_white",
"professional_rate",
"young_adult_rate",
"cost_burden_rate",
"median_income",
"recent_renters",
"homeownership_rate"
)
df_race <- df_race %>%
group_by(year) %>% # or geo_id depending on context, but year is better for city-wide scale
mutate(across(all_of(z_temporal_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
ungroup()
# displacement composite indicator
df_features <- df_race %>%
rowwise() %>%
mutate(
# Improved displacement risk index - combines economic vulnerability with demographic stability
displacement_vulnerability_index = mean(c(
z_cost_burden_rate,
-z_median_income,
z_recent_renters,
-z_homeownership_rate,
z_change_3yr_rent
), na.rm = TRUE),
# Improved gentrification momentum index - combines rate of change metrics
gentrification_momentum = mean(c(
z_pct_change_rent,
z_pct_change_income,
z_change_3yr_white,
z_professional_rate,
z_young_adult_rate
), na.rm = TRUE)
) %>%
ungroup()
# removing missing values in the new features
# list of newly created features
new_features <- setdiff(names(df_features), names(df_z))
# temporal features have NAs for early years
# impute them with mean by borough and year for better contextual accuracy
df_features <- df_features %>%
group_by(borough, year) %>%
mutate(across(all_of(new_features), ~ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
ungroup()
# if any NAs remain, use overall means
df_features <- df_features %>%
mutate(across(all_of(new_features), ~ifelse(is.na(.) | is.infinite(.) | is.nan(.),
mean(.[is.finite(.)], na.rm = TRUE), .)))
# gentrification classes for classification
df_features <- df_features %>%
mutate(
# Create a binary target for classification (high vs low gentrification)
gentrification_binary = ifelse(gentrification_quantile >= 4, "high", "low"),
# Create a 3-level target for more nuanced classification
gentrification_level = case_when(
gentrification_quantile <= 2 ~ "low",
gentrification_quantile == 3 ~ "medium",
gentrification_quantile >= 4 ~ "high"
),
# Create a continuous target for regression tasks
gentrification_continuous = gentrification_index
)
# convert factor variables for modeling
df_features$gentrification_binary <- as.factor(df_features$gentrification_binary)
df_features$gentrification_level <- as.factor(df_features$gentrification_level)
# filter out training if zscore has missing
modeling_data <- df_features %>%
filter(!is.na(gentrification_index), !is.na(gentrification_momentum)) %>%
select(-geo_id)
#futures selection for training and test
############################################################################################
#crucial predictors for gen
predictors <- c(
# common features related to gen
"median_income", "professional_rate", "pct_college_plus", "pct_white",
"young_adult_rate", "median_gross_rent", "median_home_value", "cost_burden_rate",
# temporal features
"change_3yr_income", "change_3yr_rent", "change_3yr_home_value",
"pct_change_income", "pct_change_rent",
# race-aware features
"diversity_index", "white_income_interaction", "white_change_acceleration",
# Advanced indices
"displacement_vulnerability_index", "gentrification_momentum"
)
set.seed(123)
train_index <- createDataPartition(modeling_data$gentrification_binary, p = 0.7, list = FALSE)
train_data <- modeling_data[train_index, ]
test_data <- modeling_data[-train_index, ]
# ensure there are no missing in the predictors for training
train_data_clean <- train_data %>%
filter(if_all(all_of(predictors), ~ !is.na(.)))
Loading pre-train model
# Set model directory path
model_path <- "/Users/ahmhamza/Downloads/model/"
# Load models from .rds files
logistic_model <- readRDS(paste0(model_path, "logistic_model.rds"))
rf_importance_model <- readRDS(paste0(model_path, "rf_importance_model.rds"))
rf_model <- readRDS(paste0(model_path, "rf_model.rds"))
xgb_model <- readRDS(paste0(model_path, "xgb_model.rds"))
nn_model <- readRDS(paste0(model_path, "nn_model.rds"))
The modeling process begins with the use of a Random Forest classifier to estimate feature importance via the Gini index. Random Forests are particularly advantageous in this context due to their capacity to handle complex, non-linear relationships and interactions among features without requiring extensive preprocessing. By enabling the extraction of the MeanDecreaseGini metric, this step provided a principled basis for feature selection, ensuring that subsequent models would be built on a reduced yet informative subset of predictors.
Based on futures selection RF process I incorporate logistic regression model, used variables that were identified as highly important. Logistic regression was chosen for its interpretability and ability to establish baseline predictive performance.
Other important experimental phase included the training of three predictive models: a Random Forest classifier, XGBoost, and a feedforward neural network (multilayer perceptron) with hyperparameter tuning via grid search. The selection of Random Forest and XGBoost was guided by their complementary strengths in capturing non-linearity and their robustness to overfitting through ensembling. XGBoost, in particular, was expected to yield strong predictive performance due to its ability to perform regularized gradient boosting, which is known to generalize well on structured tabular data.
Additionally, a neural network model was implemented, with optimization over the number of hidden units and regularization decay parameters. While neural networks are often computationally intensive and less interpretable, they offer flexibility in capturing intricate patterns that may be present in the gentrification dynamics.
Each model’s performance was assessed through 5-fold cross-validation with class probability estimation and ROC-based scoring, enabling fair and robust comparison across algorithms. Finally, predictive accuracy and discriminative capacity were validated using confusion matrices and ROC curve analysis, particularly for the neural network model.
# use Random Forest for predict importance futures
model_path <- "/Users/ahmhamza/Downloads/model/"
rf_importance <- readRDS(paste0(model_path, "rf_importance_model.rds"))
# rf_importance <- randomForest(
# x = train_data_clean[, predictors],
# y = train_data_clean$gentrification_binary,
# importance = TRUE,
# ntree = 500
# )
#create directory and save models
#dir.create("Downloads/model")
#saveRDS(rf_importance, "Downloads/model/rf_importance_model.rds")
#for markdown
#rf_importance_model <- readRDS("Downloads/model/rf_importance_model.rds")
# obs variable importance
importance_df <- as.data.frame(importance(rf_importance))
importance_df$Feature <- rownames(importance_df)
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]
print(importance_df)
## high low MeanDecreaseAccuracy
## white_change_acceleration 71.3899472 12.432226 66.01386
## change_3yr_income 27.1696106 13.803264 30.67471
## gentrification_momentum 37.0383736 38.632312 59.24590
## pct_change_income 24.0716413 10.886349 25.37463
## pct_change_rent 25.2752871 9.451630 25.29324
## displacement_vulnerability_index 17.3408009 32.100130 44.13585
## median_gross_rent 14.3934417 38.831764 48.05685
## young_adult_rate -2.6944697 43.040552 41.95014
## change_3yr_home_value 20.0206583 4.095011 18.85803
## change_3yr_rent 29.3309079 5.097043 32.44276
## professional_rate -0.6626774 39.227661 45.02353
## median_income 10.1163600 44.057355 52.47086
## cost_burden_rate 13.2463607 33.222988 42.31783
## pct_college_plus -1.4493133 36.394440 40.18808
## median_home_value 4.5852340 37.158392 39.84097
## pct_white 3.5180205 34.331211 40.17988
## diversity_index 0.7132631 35.948922 39.69164
## white_income_interaction 1.7939335 37.129448 43.96167
## MeanDecreaseGini
## white_change_acceleration 582.2233
## change_3yr_income 456.6352
## gentrification_momentum 387.8773
## pct_change_income 376.1346
## pct_change_rent 374.2196
## displacement_vulnerability_index 303.7693
## median_gross_rent 301.4398
## young_adult_rate 300.4118
## change_3yr_home_value 299.3066
## change_3yr_rent 292.2119
## professional_rate 290.8855
## median_income 282.3179
## cost_burden_rate 275.5486
## pct_college_plus 268.2974
## median_home_value 264.7184
## pct_white 250.0030
## diversity_index 249.7921
## white_income_interaction 246.6388
## Feature
## white_change_acceleration white_change_acceleration
## change_3yr_income change_3yr_income
## gentrification_momentum gentrification_momentum
## pct_change_income pct_change_income
## pct_change_rent pct_change_rent
## displacement_vulnerability_index displacement_vulnerability_index
## median_gross_rent median_gross_rent
## young_adult_rate young_adult_rate
## change_3yr_home_value change_3yr_home_value
## change_3yr_rent change_3yr_rent
## professional_rate professional_rate
## median_income median_income
## cost_burden_rate cost_burden_rate
## pct_college_plus pct_college_plus
## median_home_value median_home_value
## pct_white pct_white
## diversity_index diversity_index
## white_income_interaction white_income_interaction
# Plot top 15 important features
ggplot(importance_df[1:15,], aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Top 15 Features for Predicting Gentrification", x = "Features", y = "Importance")
#predictors for modeling (keep all the futures)
important_predictors <- importance_df$Feature[1:13]
################################################################
############################################################################################
# Logistic Regression
model_path <- "/Users/ahmhamza/Downloads/model/"
logistic_model <- readRDS(paste0(model_path, "logistic_model.rds"))
# logistic_model <- caret::train(
# gentrification_binary ~ change_3yr_rent + change_3yr_white + gentrification_momentum +
# displacement_vulnerability_index + pct_change_income +
# white_income_interaction + diversity_index + white_change_acceleration +
# change_3yr_income + change_3yr_home_value,
# data = train_data,
# method = "glm",
# family = "binomial",
# metric = "ROC", # use ROC if you're using twoClassSummary
# trControl = caret::trainControl(
# method = "cv",
# number = 5,
# classProbs = TRUE,
# summaryFunction = caret::twoClassSummary
# )
# )
#saveRDS(logistic_model, "Downloads/model/logistic_model.rds")
#logistic_model <- readRDS("Downloads/model/logistic_model.rds.")
# summary of the model
summary(logistic_model)
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.865e-01 3.573e-02 10.816 < 2e-16 ***
## change_3yr_rent -3.489e-10 3.560e-10 -0.980 0.3270
## change_3yr_white -4.120e+00 4.926e-01 -8.364 < 2e-16 ***
## gentrification_momentum -2.189e-01 3.609e-02 -6.065 1.32e-09 ***
## displacement_vulnerability_index -5.831e-02 3.447e-02 -1.692 0.0907 .
## pct_change_income 1.319e-07 3.350e-07 0.394 0.6937
## white_income_interaction -6.232e-10 4.986e-10 -1.250 0.2114
## diversity_index -7.873e-02 1.178e-01 -0.669 0.5038
## white_change_acceleration -3.960e-01 6.389e-01 -0.620 0.5355
## change_3yr_income -7.395e-10 8.080e-10 -0.915 0.3601
## change_3yr_home_value -9.049e-10 1.642e-10 -5.509 3.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16412 on 12192 degrees of freedom
## Residual deviance: 16152 on 12182 degrees of freedom
## AIC: 16174
##
## Number of Fisher Scoring iterations: 4
#running a another RF wiht XGboost
# Random Forest Model
model_path <- "/Users/ahmhamza/Downloads/model/"
rf_model <- readRDS(paste0(model_path, "rf_model.rds"))
# rf_model <- caret::train(
# gentrification_binary ~ .,
# data = train_data_scaled,
# method = "rf",
# metric = "Accuracy",
# trControl = trainControl(
# method = "cv",
# number = 5,
# classProbs = TRUE,
# summaryFunction = twoClassSummary
# ),
# ntree = 500
# )
#saveRDS(rf_model, "Downloads/model/rf_model.rds")
#rf_model <- readRDS("Downloads/model/rf_model.rds")
summary(rf_model)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 12100 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 24200 matrix numeric
## oob.times 12100 -none- numeric
## classes 2 -none- character
## importance 13 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 12100 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 13 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
print(rf_model)
## Random Forest
##
## 12100 samples
## 13 predictor
## 2 classes: 'high', 'low'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9680, 9680, 9680, 9679, 9681
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.8119271 0.6505553 0.8372474
## 7 0.8148342 0.6639224 0.8267491
## 13 0.8138617 0.6612505 0.8221896
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
#NN & deep learning preparation and modeling
# Neural Network with optimal hyperparameters
# Perform grid search for hyperparameter tuning
nn_grid <- expand.grid(size = c(5, 10, 15), decay = c(0.001, 0.01, 0.1))
model_path <- "/Users/ahmhamza/Downloads/model/"
nn_model <- readRDS(paste0(model_path, "nn_model.rds"))
# nn_model <- caret::train(
# gentrification_binary ~ .,
# data = train_data_scaled,
# method = "nnet",
# metric = "Accuracy",
# tuneGrid = nn_grid,
# trControl = trainControl(
# method = "cv",
# number = 5,
# classProbs = TRUE,
# summaryFunction = twoClassSummary
# ),
# maxit = 1000,
# trace = FALSE
# )
#saveRDS(nn_model, "Downloads/model/nn_model.rds")
# nn_model <- readRDS("Downloads/model/nn_model.rds")
# print the best tuning parameters
print(nn_model$bestTune)
## size decay
## 7 15 0.001
# Predictions on test set
nn_pred <- predict(nn_model, newdata = test_data_scaled)
nn_prob <- predict(nn_model, newdata = test_data_scaled, type = "prob")
# Confusion Matrix for Neural Network
nn_cm <- confusionMatrix(nn_pred, test_data_scaled$gentrification_binary)
print(nn_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 1420 602
## low 659 2498
##
## Accuracy : 0.7565
## 95% CI : (0.7446, 0.7682)
## No Information Rate : 0.5986
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.491
##
## Mcnemar's Test P-Value : 0.1148
##
## Sensitivity : 0.6830
## Specificity : 0.8058
## Pos Pred Value : 0.7023
## Neg Pred Value : 0.7913
## Prevalence : 0.4014
## Detection Rate : 0.2742
## Detection Prevalence : 0.3904
## Balanced Accuracy : 0.7444
##
## 'Positive' Class : high
##
# ROC curve
nn_roc <- roc(test_data_scaled$gentrification_binary, nn_prob[,"high"])
plot(nn_roc, main = "ROC Curve - Neural Network")
auc(nn_roc)
## Area under the curve: 0.8368
The Random Forest model identifies white race % change acceleration as the most predictive indicator of gentrification, followed closely by gentrification momentum. Short-term economic changes like change_3yr_income, pct_change_income, pct_change_rent, and change_3yr_rent are also strong predictors. Existing socioeconomic vulnerabilities and housing costs, measured by displacement_vulnerability_index, median_gross_rent, and median_income, are important structural features. Conversely, features like professional_rate, young_adult_rate, and pct_college_plus show lower predictive power in this model.
The model’s predictions exhibit a strong class imbalance, heavily favoring “low” gentrification risk with 4,798 tracts compared to only 426 predicted as “high.” This is likely due to the model’s conservative nature, as most predicted probabilities for “high” gentrification are below 0.5. Even the “high” predictions often have probabilities just above this threshold, indicating uncertainty. Interestingly, the logistic regression coefficients suggest a significant negative association between increases in the white population (change_3yr_white) and the likelihood of being classified as the positive gentrification outcome. gentrification_momentum also shows a strong negative relationship, while change_3yr_home_value has a statistically significant but very small negative effect.
The Random Forest model achieved its best performance with mtry = 7, yielding a good AUC of 0.815, indicating a strong ability to differentiate between high and low gentrification risk tracts. At this setting, the model correctly identified approximately 66.4% of high-risk tracts (sensitivity) and a higher 82.7% of low-risk tracts (specificity). The final model selected mtry = 7 to maximize the ROC, achieving a balance between sensitivity and specificity, although it still performs somewhat better at identifying low-risk areas. The neural network model, with 15 hidden units and a 0.001 weight decay, exhibits strong classification performance with an accuracy of 75.7% and a very good AUC of 0.837. It shows balanced performance, correctly identifying 68.3% of high-risk tracts (sensitivity) and 80.6% of low-risk tracts (specificity), with a moderate agreement beyond chance (Kappa of 0.491). The model’s precision for identifying high-risk tracts is 70.2%. While slightly better at identifying low gentrification areas, its ability to detect high-risk areas is considerably better than random and surpasses the logistic regression model, making it a viable classifier when both sensitivity and specificity are important.
# generate predictions for all models
models_list <- list(
"Logistic Regression" = logistic_model,
"Neural Network" = nn_model,
"Random Forest" = rf_model,
"XGBoost" = xgb_model
)
# function to evaluate model performance
evaluate_model <- function(model, test_data, model_name) {
pred <- predict(model, newdata = test_data)
prob <- predict(model, newdata = test_data, type = "prob")
cm <- confusionMatrix(pred, test_data$gentrification_binary)
roc_obj <- roc(test_data$gentrification_binary, prob[, "high"])
auc_value <- auc(roc_obj)
data.frame(
Model = model_name,
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
F1_Score = cm$byClass["F1"],
AUC = as.numeric(auc_value)
)
}
# evaluate all models
model_performance <- bind_rows(
evaluate_model(logistic_model, test_data, "Logistic Regression"),
evaluate_model(rf_model, test_data_scaled, "Random Forest"),
evaluate_model(xgb_model, test_data_scaled, "XGBoost"),
evaluate_model(nn_model, test_data_scaled, "Neural Network"),
)
# print combined performance table
kable(model_performance, caption = "Model Performance on Test Data")
Model | Accuracy | Sensitivity | Specificity | F1_Score | AUC | |
---|---|---|---|---|---|---|
Accuracy…1 | Logistic Regression | 0.6154288 | 0.1211106 | 0.9448166 | 0.2011928 | 0.5920873 |
Accuracy…2 | Random Forest | 0.7727360 | 0.6705147 | 0.8412903 | 0.7031526 | 0.8234124 |
Accuracy…3 | XGBoost | 0.7748600 | 0.6734007 | 0.8429032 | 0.7060010 | 0.8298067 |
Accuracy…4 | Neural Network | 0.7565167 | 0.6830207 | 0.8058065 | 0.6925140 | 0.8368491 |
Across the four models evaluated for gentrification risk classification, XGBoost and Random Forest emerged as the top performers, demonstrating robust and balanced capabilities. Notably, XGBoost slightly surpassed Random Forest in several key metrics, achieving an accuracy of 77.5%, a sensitivity of 67.3%, a specificity of 84.3%, an F1 score of 70.6%, and an AUC of 0.83. The Neural Network model closely followed, showcasing the highest sensitivity at 68.3% and the highest AUC at 0.837, indicating its particular strength in correctly identifying census tracts at high risk of gentrification. However, the Neural Network exhibited slightly lower overall accuracy and specificity when compared to the tree-based ensemble methods. In stark contrast, the Logistic Regression model displayed substantially weaker performance, characterized by a very low sensitivity of only 12.1% and a correspondingly low F1 score of 20.1%, revealing its significant difficulty in accurately classifying high-risk tracts, despite achieving a high specificity of 94.5%. Consequently, the ensemble methods, XGBoost and Random Forest, provided the most consistently balanced and reliable classification results, while the Neural Network presents itself as a compelling alternative, particularly when prioritizing the detection of high-risk areas due to its strong recall performance.
# visualization in comparison
# bar plot of AUC
auc_plot <- ggplot(model_performance, aes(x = Model, y = AUC, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(AUC, 3)), vjust = -0.3) +
labs(title = "AUC Comparison of Models", y = "AUC") +
theme_minimal() +
theme(legend.position = "none")
# bar plot of Accuracy
accuracy_plot <- ggplot(model_performance, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Accuracy, 3)), vjust = -0.3) +
labs(title = "Accuracy Comparison of Models", y = "Accuracy") +
theme_minimal() +
theme(legend.position = "none")
# combine AUC and Accuracy plots
grid.arrange(auc_plot, accuracy_plot, ncol = 2)
if (nrow(summary(logistic_model)$coefficients) > 1) {
coef_df_lr <- as.data.frame(summary(logistic_model)$coefficients[-1, ]) %>%
mutate(Feature = rownames(.)) %>%
arrange(desc(abs(Estimate))) %>%
head(10) # Top 10 coefficients
coef_plot_lr <- ggplot(coef_df_lr, aes(x = reorder(Feature, Estimate), y = Estimate, fill = Estimate > 0)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Top 10 Coefficients in Logistic Regression", x = "Feature", y = "Coefficient") +
scale_fill_manual(name = "Direction", values = c("FALSE" = "salmon", "TRUE" = "lightgreen"))
print(coef_plot_lr)
} else {
print("Logistic Regression model might not have converged properly or has no significant predictors.")
}
# Practical Model Insights
# identify rows with no missing values in the predictors used for the best model
complete_rows <- complete.cases(modeling_data[, important_predictors])
# a dataframe with only the complete cases for prediction
model_df_for_prediction <- modeling_data[complete_rows, ]
# best model based on a chosen metric
best_model_name <- model_performance %>% arrange(desc(AUC)) %>% slice_head(n = 1) %>% pull(Model)
best_model <- models_list[[best_model_name]]
cat("Best Model:", best_model_name, "\n")
## Best Model: Neural Network
# predictions only on complete cases
prediction_results <- predict(best_model, newdata = model_df_for_prediction, type = "prob")
model_df_for_prediction$gentrification_probability <- prediction_results[, "high"]
# a temporary ID in the complete cases dataframe for merging
model_df_for_prediction$temp_id <- 1:nrow(model_df_for_prediction)
# a temporary ID in the original modeling_data dataframe for merging the complete cases back
modeling_data$temp_id <- ifelse(complete_rows, 1:sum(complete_rows), NA)
# merge predictions back to the original modeling_data dataframe based on the temporary ID
model_df <- modeling_data %>%
left_join(model_df_for_prediction %>% select(temp_id, gentrification_probability), by = "temp_id")
# remove the temporary ID
model_df$temp_id <- NULL
# risk categories based on gentrification probability
model_df <- model_df %>%
mutate(risk_category = case_when(
is.na(gentrification_probability) ~ "Missing Data",
gentrification_probability >= 0.75 ~ "Very High Risk",
gentrification_probability >= 0.5 ~ "High Risk",
gentrification_probability >= 0.25 ~ "Moderate Risk",
TRUE ~ "Low Risk"
))
# # Summarize by borough and risk category
# borough_summary <- model_df %>%
# group_by(borough, risk_category) %>%
# summarise(
# n_tracts = n(),
# avg_income = mean(median_income, na.rm = TRUE),
# avg_rent = mean(median_gross_rent, na.rm = TRUE),
# avg_home_value = mean(median_home_value, na.rm = TRUE),
# avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
# avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
# .groups = "drop"
# )
#
#
# # view summary table
# kable(borough_summary, digits = 2, caption = "Census Tract Counts and Averages by Borough and Gentrification Risk")
model_df$years <- "2022"
borough_summary <- model_df %>%
mutate(years = "2022") %>% # or year = 2022
group_by(borough, risk_category, years) %>%
summarise(
n_tracts = n(),
avg_income = mean(median_income, na.rm = TRUE),
avg_rent = mean(median_gross_rent, na.rm = TRUE),
avg_home_value = mean(median_home_value, na.rm = TRUE),
avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
.groups = "drop"
)
borough_summary <- borough_summary %>%
mutate(
avg_income = round(avg_income),
avg_rent = round(avg_rent),
avg_home_value = round(avg_home_value),
avg_college_rate = round(avg_college_rate * 100, 1),
avg_poverty_rate = round(avg_poverty_rate * 100, 1)
)
borough_summary %>%
kable(digits = 2, caption = "Census Tract Counts and Averages by Borough, Risk Category, and Year") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
borough | risk_category | years | n_tracts | avg_income | avg_rent | avg_home_value | avg_college_rate | avg_poverty_rate |
---|---|---|---|---|---|---|---|---|
Bronx | High Risk | 2022 | 1 | 30221 | 953 | 506100 | 12.5 | 47.3 |
Bronx | Low Risk | 2022 | 1183 | -7842759 | -5070570 | -167607725 | 20.9 | 26.0 |
Bronx | Missing Data | 2022 | 36 | -666666666 | -666666666 | -666666666 | 22.1 | 77.7 |
Bronx | Moderate Risk | 2022 | 5 | 46638 | 1288 | -133025293 | 19.9 | 31.0 |
Bronx | Very High Risk | 2022 | 1508 | -11009049 | -7956358 | -154859320 | 20.4 | 28.1 |
Kings | Low Risk | 2022 | 2485 | -1809358 | -4559165 | -46480116 | 36.5 | 19.1 |
Kings | Missing Data | 2022 | 18 | -666666666 | -666666666 | -666666666 | 5.8 | NaN |
Kings | Moderate Risk | 2022 | 72 | -9193671 | 1490 | -54816768 | 37.1 | 18.6 |
Kings | Very High Risk | 2022 | 3525 | -4285292 | -5294033 | -39574908 | 36.0 | 19.5 |
Manhattan | High Risk | 2022 | 1 | -666666666 | 2432 | -666666666 | 58.3 | 20.6 |
Manhattan | Low Risk | 2022 | 949 | -6229072 | -6320623 | -121417281 | 61.1 | 16.9 |
Manhattan | Missing Data | 2022 | 18 | -666666666 | -666666666 | -666666666 | 8.1 | 83.1 |
Manhattan | Moderate Risk | 2022 | 76 | 118405 | 2311 | -139320411 | 71.5 | 14.3 |
Manhattan | Very High Risk | 2022 | 1278 | -3551535 | -3127930 | -112348288 | 62.7 | 15.6 |
Queens | Low Risk | 2022 | 2211 | -5656590 | -12662369 | -16921767 | 31.3 | 12.3 |
Queens | Missing Data | 2022 | 64 | -666666666 | -666666666 | -666666666 | 29.7 | 66.5 |
Queens | Moderate Risk | 2022 | 48 | 76199 | -69443071 | -13326995 | 28.3 | 12.9 |
Queens | Very High Risk | 2022 | 3033 | -8720061 | -18681791 | -22298090 | 32.2 | 12.3 |
Staten Island | Low Risk | 2022 | 371 | -12497296 | -28749801 | -24662527 | 33.7 | 12.5 |
Staten Island | Missing Data | 2022 | 2 | -666666666 | -666666666 | -666666666 | NaN | NaN |
Staten Island | Moderate Risk | 2022 | 50 | -39918673 | -66665349 | -106224321 | 37.1 | 8.5 |
Staten Island | Very High Risk | 2022 | 483 | -9581208 | -22082854 | -14686081 | 33.0 | 13.2 |
# visualize risk category distribution by borough
ggplot(borough_summary, aes(x = borough, y = n_tracts, fill = risk_category)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Distribution of Gentrification Risk Categories by Borough",
x = "Borough", y = "Number of Census Tracts",
fill = "Gentrification Risk") +
theme_minimal() +
scale_fill_brewer(palette = "RdYlGn", direction = -1)
# top 20 census tracts with highest gentrification probability
top_risk_tracts <- model_df %>%
arrange(desc(gentrification_probability)) %>%
select(tract, borough, gentrification_probability, risk_category,
median_income, median_gross_rent, median_home_value,
pct_college_plus, poverty_rate) %>%
head(20)
# view top risk tracts
kable(top_risk_tracts, digits = 3, caption = "Top 20 Census Tracts with Highest Gentrification Probability")
tract | borough | gentrification_probability | risk_category | median_income | median_gross_rent | median_home_value | pct_college_plus | poverty_rate |
---|---|---|---|---|---|---|---|---|
14800 | Kings | 1 | Very High Risk | 85625 | 1852 | 927300 | 0.395 | 0.074 |
3900 | Manhattan | 1 | Very High Risk | 157652 | 1812 | 2000001 | 0.809 | 0.048 |
20500 | Manhattan | 1 | Very High Risk | 179348 | 2965 | 1594300 | 0.934 | 0.051 |
200 | Bronx | 1 | Very High Risk | 51100 | 1559 | 419200 | 0.294 | 0.224 |
2702 | Bronx | 1 | Very High Risk | 16000 | 963 | 437100 | 0.077 | 0.504 |
2702 | Bronx | 1 | Very High Risk | 14375 | 856 | -666666666 | 0.051 | 0.519 |
3900 | Bronx | 1 | Very High Risk | 25990 | 954 | 338600 | 0.112 | 0.302 |
5001 | Bronx | 1 | Very High Risk | 27009 | 1297 | 498100 | 0.131 | 0.411 |
5600 | Bronx | 1 | Very High Risk | 18918 | 1056 | 424000 | 0.056 | 0.389 |
5600 | Bronx | 1 | Very High Risk | 21367 | 1085 | 429700 | 0.090 | 0.407 |
6700 | Bronx | 1 | Very High Risk | 21016 | 708 | 560200 | 0.052 | 0.471 |
6700 | Bronx | 1 | Very High Risk | 20945 | 696 | 349200 | 0.063 | 0.501 |
7000 | Bronx | 1 | Very High Risk | 39662 | 1364 | 469500 | 0.121 | 0.172 |
7000 | Bronx | 1 | Very High Risk | 50500 | 1404 | 472100 | 0.145 | 0.189 |
8400 | Bronx | 1 | Very High Risk | 66958 | 1471 | 442500 | 0.224 | 0.085 |
8400 | Bronx | 1 | Very High Risk | 83605 | 1620 | 457900 | 0.260 | 0.070 |
9200 | Bronx | 1 | Very High Risk | 35227 | 1304 | 453000 | 0.182 | 0.213 |
9300 | Bronx | 1 | Very High Risk | 30811 | 1157 | 372000 | 0.095 | 0.407 |
9300 | Bronx | 1 | Very High Risk | 32565 | 1191 | 444700 | 0.096 | 0.392 |
9300 | Bronx | 1 | Very High Risk | 33708 | 1187 | 386700 | 0.104 | 0.425 |
# Target Policy Analysis
model_df <- model_df %>%
mutate(
displacement_risk_score = (poverty_rate + cost_burden_rate) / 2,
displacement_risk_category = case_when(
displacement_risk_score >= 30 ~ "High",
displacement_risk_score >= 20 ~ "Medium",
TRUE ~ "Low"
)
)
# Create policy priority matrix
model_df <- model_df %>%
mutate(
policy_priority = case_when(
risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "High" ~ "Urgent Intervention",
risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Medium" ~ "High Priority",
risk_category == "Moderate Risk" & displacement_risk_category == "High" ~ "High Priority",
risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Low" ~ "Proactive Planning",
risk_category == "Moderate Risk" & displacement_risk_category == "Medium" ~ "Proactive Planning",
risk_category == "Low Risk" & displacement_risk_category == "High" ~ "Community Support",
TRUE ~ "Monitor"
)
)
# Count tracts by borough and policy priority
policy_counts <- model_df %>%
group_by(borough, policy_priority) %>%
summarise(n_tracts = n(), .groups = "drop") %>%
spread(policy_priority, n_tracts, fill = 0)
# view policy counts
kable(policy_counts, caption = "Census Tracts by Borough and Policy Priority")
borough | Monitor | Proactive Planning |
---|---|---|
Bronx | 1224 | 1509 |
Kings | 2575 | 3525 |
Manhattan | 1043 | 1279 |
Queens | 2323 | 3033 |
Staten Island | 423 | 483 |
# Visualize policy priorities
model_df %>%
count(borough, policy_priority) %>%
ggplot(aes(x = borough, y = n, fill = policy_priority)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Policy Intervention Priorities by Borough",
x = "Borough", y = "Number of Census Tracts",
fill = "Policy Priority") +
theme_minimal() +
scale_fill_manual(values = c("Urgent Intervention" = "#d73027",
"High Priority" = "#fc8d59",
"Proactive Planning" = "#fee090",
"Community Support" = "#e0f3f8",
"Monitor" = "#91bfdb"))
The provided dataset offers a rich, multifaceted view of gentrification pressures across New York City’s five boroughs, integrating indicators of socioeconomic change with neighborhood risk classifications. Through examining census tract-level data, risk categorizations, and policy recommendations, to develop a more nuanced understanding of where gentrification is intensifying, where vulnerabilities lie, and what this suggests about equity and urban change. It incorporates socioeconomic indicators including median income, rent, home value, college attainment, and poverty rate. Gentrification risk categories range from Low to Very High, with a significant proportion of tracts flagged as “Missing Data” due to placeholder values (-666666666), which have led to negative averages in borough-level summaries. Despite these data distortions, some borough-specific patterns do emerge. Manhattan, as expected, contains some of the highest median incomes and home values, particularly in tracts identified as having the highest gentrification probabilities. Tract 20500 in Manhattan stands out with a median income nearing $180,000 and a median rent approaching $3,000, combined with a 93% college-educated population and a poverty rate just above 5%. These traits are emblematic of what scholars term advanced gentrification—a phase characterized not only by rising costs but also by elite demographic replacement and socio-cultural transformation. In contrast, the Bronx exhibits a different profile. Among the top gentrifying tracts, incomes remain comparatively lower, with some tracts falling below $20,000 annually and poverty rates exceeding 50%. Yet these same tracts also show rising rent and property values. This divergence suggests early-stage gentrification, where speculative investment and rental pressure precede substantial income changes, potentially displacing legacy residents before neighborhood-wide uplift occurs. Tracts such as 2702 and 6700 exemplify this contradiction, revealing a paradoxical coexistence of entrenched poverty and housing market acceleration.
High-Risk and Policy-Relevant Zones
Across the city, “Very High Risk” tracts are concentrated in Kings (Brooklyn), Queens, and the Bronx. These tracts often overlap with areas slated for Proactive Planning, indicating a policy recognition of imminent or ongoing change. Kings County, for instance, has 3,525 tracts flagged for proactive intervention, reflecting its centrality in recent waves of urban redevelopment and population inflow. What is striking here is that some of the most socioeconomically advantaged neighborhoods (as measured by income and education) are simultaneously those with the highest gentrification probability. This suggests that gentrification risk does not solely indicate neighborhood vulnerability in the traditional sense, but also the potential for cultural or demographic shift. Particularly, the replacement of lower-income, minority populations with wealthier, often white newcomers. Thus, the term “risk” must be interpreted not only in economic but also in social and cultural terms.
Emerging Themes and Reflections
Two insights emerge from this analysis:
Gentrification Risk Is Not Monolithic: The term captures a spectrum of neighborhood change, from elite consolidation (Manhattan) to early speculative pressure (Bronx). This reinforces the importance of borough-specific strategies and cautions against citywide generalizations.
Policy Planning Must Engage More Deeply with Social Dimensions: While proactive planning and monitoring efforts are well distributed geographically, their efficacy depends on integrating not only economic indicators but also resident voices, tenure stability, and historical displacement patterns. Without such integration, policies risk addressing symptoms while missing root causes.
Spatial Inequality and Urban Transformation
The Divergent Mechanics of Urban Change The analysis of census tract data across New York City’s five boroughs reveals not a singular gentrification process, but rather a complex ecosystem of neighborhood transformations operating through various pathways. Through machine learning classification of census data from 2015-2022, the study observe that gentrification manifests through multiple trajectories, each characterized by different combinations of demographic shifts, economic pressures, and housing market dynamics. In Manhattan’s high-probability gentrification zones, the data reveals what might be consider as consolidation gentrification tracts like 20500 exhibit median incomes approaching $180,000, coupled with 93% college education rates and median rents near $3,000. These neighborhoods have moved beyond the initial stages of displacement into a phase where socioeconomic homogeneity becomes self-reinforcing. The predictive models identify white_change_acceleration as the most significant feature in these areas, suggesting that racial demographic shifts function as both cause and consequence of advanced gentrification processes.
Conversely, the Bronx presents “speculative gentrification” patterns, where housing market metrics outpace local income growth. Census tracts showing median incomes below $20,000 alongside rapidly appreciating property values indicate investment patterns driven by external capital rather than organic neighborhood economic development. This contradiction manifests most prominently in tracts 2702 and 6700, where poverty rates exceeding 50% coexist with accelerating housing costs a statistical signature of speculative investment preceding demographic transformation.
Temporal Dimensions and Predictive Indicators
The temporal features engineered from longitudinal census data provided unexpected insight into gentrification’s mechanics. While static socioeconomic indicators showed limited predictive power, the Random Forest model identified change_3yr_income, pct_change_rent, and change_3yr_rent among the strongest predictors. This temporal dimension reveals gentrification as fundamentally a rate-of-change phenomenon rather than a static state, challenging traditional neighborhood classification approaches. The neural network model, achieving 75.7% accuracy with an AUC of 0.84, performed particularly well at identifying neighborhoods in the early-to-mid stages of transformation. Its balanced sensitivity (68.3%) and specificity (80.6%) across high and low risk classifications demonstrates the utility of advanced machine learning approaches in capturing the nonlinear relationships between demographic shifts, housing pressures, and economic indicators that characterize incipient gentrification. Interestingly, the relationship between racial demographics and gentrification emerged as more complex than commonly understood. The logistic regression coefficients revealed a significant negative association between increases in white population (change_3yr_white) and gentrification classification probability, contradicting simplified narratives of white in-migration as the sole driver. Instead, the data suggests a sequential process where economic transformation precedes demographic change, with housing market pressures serving as the leading indicator.
Beyond Binary Classification: A Spatial Typology of Urban Change
The classification models reveal limitations in binary gentrification framing. When neighborhoods are mapped according to their predicted probabilities and feature importance scores, at least four distinct transformation types emerge:
1.Elite Consolidation Zones: Primarily in Manhattan, characterized by high gentrification_momentum scores but decreasing change rates, suggesting market saturation.
Acceleration Zones: Concentrated in Brooklyn (Kings County), exhibiting high displacement_vulnerability_index scores coupled with rapidly increasing change_3yr_rent values.
Speculative Investment Frontiers: Emerging in the Bronx and eastern Queens, where housing market metrics and white_change_acceleration outpace local income growth.
Stable Vulnerability Areas: Neighborhoods with high displacement risk but lower predicted gentrification probabilities, suggesting entrenched inequality without active transformation.
This typology has significant implications for policy intervention. The 3,525 tracts in Kings County flagged for proactive planning likely contain multiple transformation types requiring different intervention strategies. The neural network’s hidden layer activations reveal distinct patterns across these four zones, suggesting that different combinations of factors drive transformation across these spatial categories.
Policy Implications: Beyond Reactive Frameworks
The predictive power of change_3yr_rent and gentrification_momentum indicators suggests that interventions must engage earlier in the transformation cycle than current policies typically allow. By the time demographic shifts become visible in census data, the economic mechanisms of displacement are already well established. The classification models indicate that neighborhoods with similar static profiles can experience dramatically different trajectory probabilities based on relatively small differences in change rates. This finding challenges current intervention timeframes and suggests that policy effectiveness depends on integration of multiple data types at finer temporal resolutions. The neural network’s performance compared to logistic regression (AUC 0.837 vs. lower) demonstrates that the relationships between predictors are nonlinear and interactive exactly the type of complex systems that simplified policy frameworks struggle to address. The most provocative implication of this analysis is that gentrification risk assessment must be reconceptualized not as identifying vulnerable neighborhoods, but rather as identifying neighborhoods at specific transformation stages. The models reveal that the highest statistical predictors of change occur at different threshold points across different boroughs. Effective intervention requires borough-specific strategies calibrated to these locally specific thresholds.
Methodological Integration: Machine Learning and Social Science
This study demonstrates that the integration of machine learning methodologies with sociological frameworks produces insights unavailable to either approach independently. The Random Forest’s identification of white race percentage change acceleration as the most predictive indicator quantifies a relationship long theorized in urban sociology literature. Simultaneously, the neural network’s capacity to capture nonlinear relationships between variables provides mathematical validation for sociological concepts of neighborhood tipping points and cascading effects. The combined classification approach achieves what neither qualitative observation nor simplified statistical analysis can: the identification of mathematically precise thresholds where economic pressure transforms into demographic change.
Toward a New Understanding of Urban Change
The machine learning analysis of NYC census tract data from 2015-2022 reveals gentrification not as a uniform process but as a spatially differentiated phenomenon with borough-specific mechanics. The classification models demonstrate that the most predictive indicators of transformation are not static socioeconomic characteristics but rather rates of change across multiple dimensions. This understanding challenges both simplified narratives of gentrification and generalized policy approaches. The identification of distinct transformation types across boroughs suggests that urban change operates through multiple causal pathways requiring differentiated intervention strategies. The superior performance of neural network models indicates that these processes involve complex, nonlinear relationships that simple linear frameworks cannot adequately capture.
The emerging spatial typology of urban change identified through this analysis provides a foundation for more targeted policy development. By engaging with the temporal dimension of neighborhood transformation and the specific mechanical pathways through which economic pressure translates into demographic change, policymakers can move beyond reactive approaches toward proactive frameworks calibrated to local conditions. In the final analysis, this study demonstrates that effective engagement with gentrification requires understanding it not as a binary state but as a multi-dimensional process with distinct phases, thresholds, and spatial expressions. The integration of machine learning methodologies with sociological frameworks offers a powerful approach to identifying these patterns and informing more effective interventions.
```r
# Load necessary libraries
library(dplyr) # For data manipulation
library(tidyr) # For data reshaping
library(readr) # For reading and writing CSV files
# Define API key for accessing Census data
api_key <- 'f658e5cc25ee3cb93ac64033d15b48f3f0eec6f8'
# List of enhanced variables related to gentrification indicators
variables <- c(
# Core demographic and economic variables
'B25064_001E', # Median gross rent
'B19013_001E', # Median household income
'B15003_001E', # Total education population
'B15003_017E', # High school graduates
'B15003_018E', 'B15003_019E', 'B15003_020E', 'B15003_021E', # Some college or associate's degree
'B15003_022E', 'B15003_023E', 'B15003_024E', 'B15003_025E', # Bachelor's and higher
'B03002_001E', # Total population
'B03002_003E', # White alone
'B03002_004E', # Black alone
'B03002_006E', # Asian alone
'B03002_012E', # Hispanic/Latino
'B25003_001E', # Total housing units
'B25003_002E', # Owner-occupied units
'B17001_001E', # Total poverty universe
'B17001_002E', # Below poverty level
'B23025_003E', # Civilian labor force
'B23025_005E', # Unemployed individuals
# Additional gentrification indicators
'B25077_001E', # Median home value
'B25071_001E', # Median gross rent as a percentage of household income
'B19083_001E', # Gini Index of Income Inequality
# Housing cost burden variables
'B25106_001E', # Total housing units
'B25106_024E', # Owner costs >= 30% of income
'B25106_045E', # Renter costs >= 30% of income
# New construction and vacancy rates
'B25034_001E', # Total housing units
'B25034_002E', 'B25034_003E', # Built in 2010 or later, 2000-2009
'B25002_001E', # Total housing units
'B25002_003E', # Vacant housing units
# Young professionals (age demographics)
'B01001_001E', # Total population
'B01001_011E', 'B01001_012E', 'B01001_035E', 'B01001_036E', # Ages 25-34
# Creative class/professional occupations
'C24010_001E', # Total civilian employed population (16+)
'C24010_003E', # Management, business, science, and arts occupations (male)
'C24010_039E', # Management, business, science, and arts occupations (female)
# Transportation and commuting patterns
'B08301_001E', # Total commuters
'B08301_010E', # Public transportation usage
'B08301_019E', 'B08301_020E', # Walked, other commuting methods
'B08301_021E', # Worked from home
# Tenure duration indicating residential stability
'B25038_001E', # Total occupied housing units
'B25038_003E', # Owner-occupied, moved in 2015 or later
'B25038_010E' # Renter-occupied, moved in 2015 or later
)
# Define county FIPS codes for New York City boroughs
counties <- list(
Bronx = '005',
Kings = '047', # Brooklyn
Queens = '081',
Manhattan = '061', # Add Manhattan for comparison
Staten_Island = '085' # Add Staten Island for completeness
)
# Function to pull data from the Census API
get_acs_data <- function(year, county_name, county_fips) {
# Access the Census API to get ACS data for the specified year and county
data <- census_api_call(variables, api_key, year, county_fips) # Placeholder for actual API call
data$year <- year
data$borough <- county_name
return(data)
}
# Collect data from 2015 to 2022
dfs <- list() # Initialize an empty list to store data frames
for (year in 2015:2022) {
for (borough in names(counties)) {
county_fips <- counties[[borough]]
tryCatch({
df <- get_acs_data(year, borough, county_fips)
dfs[[length(dfs) + 1]] <- df # Append the data frame to the list
cat("Successfully downloaded data for", borough, year, "\n")
}, error = function(e) {
cat("Error downloading data for", borough, year, ":", e$message, "\n")
})
}
}
# Combine all years of data into a single data frame
acs_df <- bind_rows(dfs)
# Rename columns for clarity
column_mapping <- c(
B25064_001E = 'median_gross_rent',
B19013_001E = 'median_income',
B15003_001E = 'total_edu_pop',
B15003_017E = 'high_school_grad',
B15003_018E = 'some_college_1',
B15003_019E = 'some_college_2',
B15003_020E = 'assoc_degree_1',
B15003_021E = 'assoc_degree_2',
B15003_022E = 'bachelor',
B15003_023E = 'masters',
B15003_024E = 'professional',
B15003_025E = 'doctorate',
B03002_001E = 'total_pop',
B03002_003E = 'white',
B03002_004E = 'black',
B03002_006E = 'asian',
B03002_012E = 'hispanic',
B25003_001E = 'total_units',
B25003_002E = 'owner_occupied',
B17001_001E = 'poverty_universe',
B17001_002E = 'below_poverty',
B23025_003E = 'labor_force',
B23025_005E = 'unemployed',
# New variables for gentrification analysis
B25077_001E = 'median_home_value',
B25071_001E = 'rent_as_pct_income',
B19083_001E = 'gini_index',
B25106_001E = 'housing_units_total',
B25106_024E = 'owner_cost_burdened',
B25106_045E = 'renter_cost_burdened',
B25034_001E = 'housing_age_total',
B25034_002E = 'housing_built_2010_later',
B25034_003E = 'housing_built_2000_2009',
B25002_001E = 'housing_total',
B25002_003E = 'vacant_units',
B01001_001E = 'pop_total',
B01001_011E = 'males_25_34_1',
B01001_012E = 'males_25_34_2',
B01001_035E = 'females_25_34_1',
B01001_036E = 'females_25_34_2',
C24010_001E = 'employed_pop',
C24010_003E = 'prof_male',
C24010_039E = 'prof_female',
B08301_001E = 'commuters_total',
B08301_010E = 'public_transit',
B08301_019E = 'walk_to_work',
B08301_020E = 'other_commute',
B08301_021E = 'work_from_home',
B25038_001E = 'tenure_total',
B25038_003E = 'recent_owners',
B25038_010E = 'recent_renters'
)
# Rename the columns in the data frame
acs_df <- rename(acs_df, !!!column_mapping)
# Convert relevant columns to numeric
acs_df <- acs_df %>% mutate(across(everything(), ~ as.numeric(.), .names = "numeric_{col}"))
# Calculate derived metrics
acs_df <- acs_df %>%
mutate(
less_than_hs = total_edu_pop - (high_school_grad + some_college_1 + some_college_2 +
assoc_degree_1 + assoc_degree_2 + bachelor +
masters + professional + doctorate),
pct_less_than_hs = less_than_hs / total_edu_pop,
pct_hs_grad = high_school_grad / total_edu_pop,
pct_some_college = (some_college_1 + some_college_2 + assoc_degree_1 + assoc_degree_2) / total_edu_pop,
pct_college_plus = (bachelor + masters + professional + doctorate) / total_edu_pop,
pct_white = white / total_pop,
pct_black = black / total_pop,
pct_asian = asian / total_pop,
pct_hispanic = hispanic / total_pop,
pct_other = 1 - (pct_white + pct_black + pct_asian + pct_hispanic),
homeownership_rate = owner_occupied / total_units,
poverty_rate = below_poverty / poverty_universe,
unemployment_rate = unemployed / labor_force
)
# New gentrification metrics
acs_df <- acs_df %>%
mutate(
young_adult_rate = (males_25_34_1 + males_25_34_2 + females_25_34_1 + females_25_34_2) / pop_total,
professional_rate = (prof_male + prof_female) / employed_pop,
alt_commute_rate = (public_transit + walk_to_work + other_commute + work_from_home) / commuters_total,
new_housing_rate = (housing_built_2010_later + housing_built_2000_2009) / housing_age_total,
recent_move_rate = (recent_owners + recent_renters) / tenure_total,
vacancy_rate = vacant_units / housing_total,
cost_burden_rate = (owner_cost_burdened + renter_cost_burdened) / housing_units_total
)
# Create a composite gentrification index
acs_df <- acs_df %>%
mutate(
gentrification_pressure = (pct_college_plus +
young_adult_rate +
professional_rate +
(median_income / ave(median_income, by = tract)) +
(median_home_value / ave(median_home_value, by = tract)) +
(median_gross_rent / ave(median_gross_rent, by = tract)) -
poverty_rate) / 7 # Normalize
)
# Define displacement risk index
acs_df <- acs_df %>%
mutate(
displacement_risk = (rent_as_pct_income / 100 +
cost_burden_rate +
(1 - homeownership_rate) +
poverty_rate +
(1 - pct_college_plus)) / 5 # Normalize
)
# Save the processed DataFrame to a CSV file
write_csv(acs_df, 'census_data_2015_2022_gentrification.csv')
#load necessary packages
rq_packages <- c("ada", "caret", "corrplot", "dplyr", "GGally", "ggcorrplot",
"ggplot2", "gridExtra", "glmnet", "knitr", "mice", "neuralnet", "naivebayes", "naniar", "nnet", "pROC",
"randomForest", "rpart", "rpart.plot", "ROSE", "rsample",
"sf", "keras", "kableExtra", "scales", "smotefamily", "tigris", "tidyverse",
"tidyr", "tfdatasets", "tensorflow", "pROC", "tmap", "yardstick", "randomForest", "rpart", "xgboost","VIM")
#install and load packages
for (pkg in rq_packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
# read the CSV file
df <- read.csv('census_data_2015_2022_gentrification.csv')
# display the first 15 rows
head(df, 15)
# structure and summary
str(df)
summary(df)
# Convert year, borough, etc., to factors if needed
df$year <- as.factor(df$year)
df$borough <- as.factor(df$borough)
# Rename columns for consistency
df <- df %>%
rename_with(~ gsub("\\s+", "_", .x)) %>%
rename_with(tolower)
# colnames(df)
###Check for missing
#count missing or zero values per variable
df %>%
summarise(across(everything(), ~ sum(is.na(.) | . == 0))) %>%
gather(variable, missing_or_zero)
#low missing <5% mena median imput
low_na_cols <- c("median_income", "median_gross_rent", "median_home_value")
for (col in low_na_cols) {
df[[col]][df[[col]] == 0 | is.na(df[[col]])] <- median(df[[col]][df[[col]] != 0], na.rm = TRUE)
}
#moderate missing (5%–30%) → KNN Imputation
cols_for_knn <- c("professional", "recent_move_rate", "housing_built_2010_later",
"vacancy_rate", "renter_cost_burdened", "recent_owners",
"recent_renters", "professional_rate", "alt_commute_rate")
df[cols_for_knn] <- VIM::kNN(df[cols_for_knn], k = 5, imp_var = FALSE)
############################################################################################
#futures engineering & vector of variables to standardize
############################################################################################
z_vars <- c(
"median_income", "professional_rate", "pct_college_plus",
"pct_white", "young_adult_rate",
"median_gross_rent", "median_home_value", "cost_burden_rate", "vacancy_rate",
"recent_move_rate", "recent_renters", "recent_owners",
"new_housing_rate", "housing_built_2010_later",
"homeownership_rate"
)
# standardize in long format (better memory handling)
df_z <- df %>%
arrange(geo_id, year) %>%
group_by(geo_id) %>%
mutate(across(all_of(z_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
ungroup()
# adjust vacancy rate sign (negative z-score)
df_z <- df_z %>%
mutate(z_vacancy_rate = -z_vacancy_rate)
# create gentrification index from the newly computed z-scores
df_z <- df_z %>%
rowwise() %>%
mutate(gentrification_index = mean(c_across(starts_with("z_")), na.rm = TRUE)) %>%
ungroup()
# bin into quantiles (efficient post-processing step)
df_z <- df_z %>%
mutate(gentrification_quantile = ntile(gentrification_index, 5))
#EDA ############################################################################################
# Visualize distribution
ggplot(df_z, aes(x = gentrification_index)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Gentrification Index", x = "Index", y = "Count")
# Boxplot by borough
ggplot(df_z, aes(x = borough, y = gentrification_index, fill = borough)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Gentrification Index by Borough", x = "Borough", y = "Index") +
theme(legend.position = "none")
# Line plot over time by borough
df_z %>%
group_by(year, borough) %>%
summarise(mean_index = mean(gentrification_index, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = mean_index, color = borough, group = borough)) +
geom_line(size = 1.2) +
labs(title = "Gentrification Trends by Borough", x = "Year", y = "Mean Index") +
theme_minimal()
############################################################################################
#preparing data for models
############################################################################################
# yearly change
df_temporal <- df_z %>%
arrange(geo_id, year) %>%
group_by(geo_id) %>%
mutate(
# 3-year changes for key metrics
change_3yr_income = median_income - lag(median_income, 3),
change_3yr_rent = median_gross_rent - lag(median_gross_rent, 3),
change_3yr_home_value = median_home_value - lag(median_home_value, 3),
change_3yr_college = pct_college_plus - lag(pct_college_plus, 3),
change_3yr_white = pct_white - lag(pct_white, 3),
change_3yr_vacancy = vacancy_rate - lag(vacancy_rate, 3),
# Growth rates (percentage changes) for key metrics
pct_change_income = (median_income / lag(median_income, 3) - 1) * 100,
pct_change_rent = (median_gross_rent / lag(median_gross_rent, 3) - 1) * 100,
pct_change_home_value = (median_home_value / lag(median_home_value, 3) - 1) * 100,
# Rate of change compared to borough average (relative changes)
rel_income_growth = pct_change_income / mean(pct_change_income, na.rm = TRUE),
rel_rent_growth = pct_change_rent / mean(pct_change_rent, na.rm = TRUE),
rel_home_value_growth = pct_change_home_value / mean(pct_change_home_value, na.rm = TRUE)
) %>%
ungroup()
#race metrics
df_race <- df_temporal %>%
mutate(
# Diversity metrics
diversity_index = 1 - ((pct_white^2 + (1-pct_white)^2)), # Simple diversity index
# Race-economic interactions
white_income_interaction = pct_white * median_income,
white_college_interaction = pct_white * pct_college_plus,
# Racial change momentum (acceleration of demographic shifts)
white_change_acceleration = change_3yr_white - lag(change_3yr_white, 1),
# Economic-demographic pressure metrics
displacement_risk = cost_burden_rate * (1 - pct_white) * (1 / median_income),
gentrification_pressure = pct_change_rent * change_3yr_white * young_adult_rate
)
#z-temporal and demographic shift variables
z_temporal_vars <- c(
"change_3yr_rent",
"pct_change_rent",
"pct_change_income",
"change_3yr_white",
"professional_rate",
"young_adult_rate",
"cost_burden_rate",
"median_income",
"recent_renters",
"homeownership_rate"
)
df_race <- df_race %>%
group_by(year) %>% # or geo_id depending on context, but year is better for city-wide scale
mutate(across(all_of(z_temporal_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
ungroup()
# displacement composite indicator
df_features <- df_race %>%
rowwise() %>%
mutate(
# Improved displacement risk index - combines economic vulnerability with demographic stability
displacement_vulnerability_index = mean(c(
z_cost_burden_rate,
-z_median_income,
z_recent_renters,
-z_homeownership_rate,
z_change_3yr_rent
), na.rm = TRUE),
# Improved gentrification momentum index - combines rate of change metrics
gentrification_momentum = mean(c(
z_pct_change_rent,
z_pct_change_income,
z_change_3yr_white,
z_professional_rate,
z_young_adult_rate
), na.rm = TRUE)
) %>%
ungroup()
# removing missing values in the new features
# list of newly created features
new_features <- setdiff(names(df_features), names(df_z))
# temporal features have NAs for early years
# impute them with mean by borough and year for better contextual accuracy
df_features <- df_features %>%
group_by(borough, year) %>%
mutate(across(all_of(new_features), ~ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
ungroup()
# if any NAs remain, use overall means
df_features <- df_features %>%
mutate(across(all_of(new_features), ~ifelse(is.na(.) | is.infinite(.) | is.nan(.),
mean(.[is.finite(.)], na.rm = TRUE), .)))
# gentrification classes for classification
df_features <- df_features %>%
mutate(
# Create a binary target for classification (high vs low gentrification)
gentrification_binary = ifelse(gentrification_quantile >= 4, "high", "low"),
# Create a 3-level target for more nuanced classification
gentrification_level = case_when(
gentrification_quantile <= 2 ~ "low",
gentrification_quantile == 3 ~ "medium",
gentrification_quantile >= 4 ~ "high"
),
# Create a continuous target for regression tasks
gentrification_continuous = gentrification_index
)
# convert factor variables for modeling
df_features$gentrification_binary <- as.factor(df_features$gentrification_binary)
df_features$gentrification_level <- as.factor(df_features$gentrification_level)
# filter out training if zscore has missing
modeling_data <- df_features %>%
filter(!is.na(gentrification_index), !is.na(gentrification_momentum)) %>%
select(-geo_id)
#futures selection for training and test
############################################################################################
#crucial predictors for gen
predictors <- c(
# common features related to gen
"median_income", "professional_rate", "pct_college_plus", "pct_white",
"young_adult_rate", "median_gross_rent", "median_home_value", "cost_burden_rate",
# temporal features
"change_3yr_income", "change_3yr_rent", "change_3yr_home_value",
"pct_change_income", "pct_change_rent",
# race-aware features
"diversity_index", "white_income_interaction", "white_change_acceleration",
# Advanced indices
"displacement_vulnerability_index", "gentrification_momentum"
)
set.seed(123)
train_index <- createDataPartition(modeling_data$gentrification_binary, p = 0.7, list = FALSE)
train_data <- modeling_data[train_index, ]
test_data <- modeling_data[-train_index, ]
# ensure there are no missing in the predictors for training
train_data_clean <- train_data %>%
filter(if_all(all_of(predictors), ~ !is.na(.)))
# use Random Forest for predict importance futures
model_path <- "/Users/ahmhamza/Downloads/model/"
rf_importance <- readRDS(paste0(model_path, "rf_importance_model.rds"))
rf_importance <- randomForest(
x = train_data_clean[, predictors],
y = train_data_clean$gentrification_binary,
importance = TRUE,
ntree = 500
)
#create directory and save models
#dir.create("Downloads/model")
#saveRDS(rf_importance, "Downloads/model/rf_importance_model.rds")
#for markdown
#rf_importance_model <- readRDS("Downloads/model/rf_importance_model.rds")
# obs variable importance
importance_df <- as.data.frame(importance(rf_importance))
importance_df$Feature <- rownames(importance_df)
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]
print(importance_df)
# Plot top 15 important features
ggplot(importance_df[1:15,], aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Top 15 Features for Predicting Gentrification", x = "Features", y = "Importance")
#predictors for modeling (keep all the futures)
important_predictors <- importance_df$Feature[1:13]
################################################################
############################################################################################
# Logistic Regression
model_path <- "/Users/ahmhamza/Downloads/model/"
logistic_model <- readRDS(paste0(model_path, "logistic_model.rds"))
logistic_model <- caret::train(
gentrification_binary ~ change_3yr_rent + change_3yr_white + gentrification_momentum +
displacement_vulnerability_index + pct_change_income +
white_income_interaction + diversity_index + white_change_acceleration +
change_3yr_income + change_3yr_home_value,
data = train_data,
method = "glm",
family = "binomial",
metric = "ROC", # use ROC if you're using twoClassSummary
trControl = caret::trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = caret::twoClassSummary
)
)
# saveRDS(logistic_model, "Downloads/model/logistic_model.rds")
# logistic_model <- readRDS("Downloads/model/logistic_model.rds.")
# summary of the model
summary(logistic_model)
# Probabilities
pred_probs <- predict(logistic_model, newdata = test_data, type = "prob")
# Class labels
pred_classes <- predict(logistic_model, newdata = test_data)
summary(pred_probs)
#scaling for RF and NN
preproc <- preProcess(train_data[, important_predictors], method = c("center", "scale"))
train_data_scaled <- predict(preproc, train_data[, important_predictors])
test_data_scaled <- predict(preproc, test_data[, important_predictors])
# Add the target variable back
train_data_scaled$gentrification_binary <- train_data$gentrification_binary
test_data_scaled$gentrification_binary <- test_data$gentrification_binary
#there were some NA in train and test
colSums(is.na(train_data_scaled))
train_data_scaled <- na.omit(train_data_scaled)
colSums(is.na(train_data_scaled))
colSums(is.na(test_data_scaled))
test_data_scaled <- na.omit(test_data_scaled)
colSums(is.na(test_data_scaled))
#running a another RF wiht XGboost
# Random Forest Model
model_path <- "/Users/ahmhamza/Downloads/model/"
rf_model <- readRDS(paste0(model_path, "rf_model.rds"))
rf_model <- caret::train(
gentrification_binary ~ .,
data = train_data_scaled,
method = "rf",
metric = "Accuracy",
trControl = trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
),
ntree = 500
)
# saveRDS(rf_model, "Downloads/model/rf_model.rds")
# rf_model <- readRDS("Downloads/model/rf_model.rds")
summary(rf_model)
print(rf_model)
model_path <- "/Users/ahmhamza/Downloads/model/"
xgb_model <- readRDS(paste0(model_path, "xgb_model.rds"))
# XGBoost Model
xgb_model <- caret::train(
gentrification_binary ~ .,
data = train_data_scaled,
method = "xgbTree",
metric = "Accuracy",
trControl = trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
)
#saveRDS(xgb_model, "Downloads/model/xgb_model.rds")
#xgb_model <- readRDS("Downloads/model/xgb_model.rds")
#NN & deep learning preparation and modeling
# Neural Network with optimal hyperparameters
# Perform grid search for hyperparameter tuning
nn_grid <- expand.grid(size = c(5, 10, 15), decay = c(0.001, 0.01, 0.1))
model_path <- "/Users/ahmhamza/Downloads/model/"
nn_model <- readRDS(paste0(model_path, "nn_model.rds"))
nn_model <- caret::train(
gentrification_binary ~ .,
data = train_data_scaled,
method = "nnet",
metric = "Accuracy",
tuneGrid = nn_grid,
trControl = trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
),
maxit = 1000,
trace = FALSE
)
#saveRDS(nn_model, "Downloads/model/nn_model.rds")
#nn_model <- readRDS("Downloads/model/nn_model.rds")
# print the best tuning parameters
print(nn_model$bestTune)
# Predictions on test set
nn_pred <- predict(nn_model, newdata = test_data_scaled)
nn_prob <- predict(nn_model, newdata = test_data_scaled, type = "prob")
# Confusion Matrix for Neural Network
nn_cm <- confusionMatrix(nn_pred, test_data_scaled$gentrification_binary)
print(nn_cm)
# ROC curve
nn_roc <- roc(test_data_scaled$gentrification_binary, nn_prob[,"high"])
plot(nn_roc, main = "ROC Curve - Neural Network")
auc(nn_roc)
############################################################################################
#comparing models and evaluation
############################################################################################
# generate predictions for all models
models_list <- list(
"Logistic Regression" = logistic_model,
"Neural Network" = nn_model,
"Random Forest" = rf_model,
"XGBoost" = xgb_model
)
# function to evaluate model performance
evaluate_model <- function(model, test_data, model_name) {
pred <- predict(model, newdata = test_data)
prob <- predict(model, newdata = test_data, type = "prob")
cm <- confusionMatrix(pred, test_data$gentrification_binary)
roc_obj <- roc(test_data$gentrification_binary, prob[, "high"])
auc_value <- auc(roc_obj)
data.frame(
Model = model_name,
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
F1_Score = cm$byClass["F1"],
AUC = as.numeric(auc_value)
)
}
# evaluate all models
model_performance <- bind_rows(
evaluate_model(logistic_model, test_data, "Logistic Regression"),
evaluate_model(rf_model, test_data_scaled, "Random Forest"),
evaluate_model(xgb_model, test_data_scaled, "XGBoost"),
evaluate_model(nn_model, test_data_scaled, "Neural Network"),
)
# print combined performance table
kable(model_performance, caption = "Model Performance on Test Data")
# visualization in comparison
# bar plot of AUC
auc_plot <- ggplot(model_performance, aes(x = Model, y = AUC, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(AUC, 3)), vjust = -0.3) +
labs(title = "AUC Comparison of Models", y = "AUC") +
theme_minimal() +
theme(legend.position = "none")
# bar plot of Accuracy
accuracy_plot <- ggplot(model_performance, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Accuracy, 3)), vjust = -0.3) +
labs(title = "Accuracy Comparison of Models", y = "Accuracy") +
theme_minimal() +
theme(legend.position = "none")
# combine AUC and Accuracy plots
grid.arrange(auc_plot, accuracy_plot, ncol = 2)
# essential futures from coefficient (logistic model coefficient)
if (nrow(summary(logistic_model)$coefficients) > 1) {
coef_df_lr <- as.data.frame(summary(logistic_model)$coefficients[-1, ]) %>%
mutate(Feature = rownames(.)) %>%
arrange(desc(abs(Estimate))) %>%
head(10) # Top 10 coefficients
coef_plot_lr <- ggplot(coef_df_lr, aes(x = reorder(Feature, Estimate), y = Estimate, fill = Estimate > 0)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Top 10 Coefficients in Logistic Regression", x = "Feature", y = "Coefficient") +
scale_fill_manual(name = "Direction", values = c("FALSE" = "salmon", "TRUE" = "lightgreen"))
print(coef_plot_lr)
} else {
print("Logistic Regression model might not have converged properly or has no significant predictors.")
}
# practical model insights
# identify rows with no missing values in the predictors used for the best model
complete_rows <- complete.cases(modeling_data[, important_predictors])
# a dataframe with only the complete cases for prediction
model_df_for_prediction <- modeling_data[complete_rows, ]
# best model based on a chosen metric
best_model_name <- model_performance %>% arrange(desc(AUC)) %>% slice_head(n = 1) %>% pull(Model)
best_model <- models_list[[best_model_name]]
cat("Best Model:", best_model_name, "\n")
# predictions only on complete cases
prediction_results <- predict(best_model, newdata = model_df_for_prediction, type = "prob")
model_df_for_prediction$gentrification_probability <- prediction_results[, "high"]
# a temporary ID in the complete cases dataframe for merging
model_df_for_prediction$temp_id <- 1:nrow(model_df_for_prediction)
# a temporary ID in the original modeling_data dataframe for merging the complete cases back
modeling_data$temp_id <- ifelse(complete_rows, 1:sum(complete_rows), NA)
# merge predictions back to the original modeling_data dataframe based on the temporary ID
model_df <- modeling_data %>%
left_join(model_df_for_prediction %>% select(temp_id, gentrification_probability), by = "temp_id")
# remove the temporary ID
model_df$temp_id <- NULL
# risk categories based on gentrification probability
model_df <- model_df %>%
mutate(risk_category = case_when(
is.na(gentrification_probability) ~ "Missing Data",
gentrification_probability >= 0.75 ~ "Very High Risk",
gentrification_probability >= 0.5 ~ "High Risk",
gentrification_probability >= 0.25 ~ "Moderate Risk",
TRUE ~ "Low Risk"
))
# # Summarize by borough and risk category
# borough_summary <- model_df %>%
# group_by(borough, risk_category) %>%
# summarise(
# n_tracts = n(),
# avg_income = mean(median_income, na.rm = TRUE),
# avg_rent = mean(median_gross_rent, na.rm = TRUE),
# avg_home_value = mean(median_home_value, na.rm = TRUE),
# avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
# avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
# .groups = "drop"
# )
# view summary table
# kable(borough_summary, digits = 2, caption = "Census Tract Counts and Averages by Borough and Gentrification Risk")
model_df$years <- "2022"
borough_summary <- model_df %>%
mutate(years = "2022") %>% # or year = 2022
group_by(borough, risk_category, years) %>%
summarise(
n_tracts = n(),
avg_income = mean(median_income, na.rm = TRUE),
avg_rent = mean(median_gross_rent, na.rm = TRUE),
avg_home_value = mean(median_home_value, na.rm = TRUE),
avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
.groups = "drop"
)
borough_summary <- borough_summary %>%
mutate(
avg_income = round(avg_income),
avg_rent = round(avg_rent),
avg_home_value = round(avg_home_value),
avg_college_rate = round(avg_college_rate * 100, 1),
avg_poverty_rate = round(avg_poverty_rate * 100, 1)
)
borough_summary %>%
kable(digits = 2, caption = "Census Tract Counts and Averages by Borough, Risk Category, and Year") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
# visualize risk category distribution by borough
ggplot(borough_summary, aes(x = borough, y = n_tracts, fill = risk_category)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Distribution of Gentrification Risk Categories by Borough",
x = "Borough", y = "Number of Census Tracts",
fill = "Gentrification Risk") +
theme_minimal() +
scale_fill_brewer(palette = "RdYlGn", direction = -1)
# top 20 census tracts with highest gentrification probability
top_risk_tracts <- model_df %>%
arrange(desc(gentrification_probability)) %>%
select(tract, borough, gentrification_probability, risk_category,
median_income, median_gross_rent, median_home_value,
pct_college_plus, poverty_rate) %>%
head(20)
# view top risk tracts
kable(top_risk_tracts, digits = 3, caption = "Top 20 Census Tracts with Highest Gentrification Probability")
# Target Policy Analysis
model_df <- model_df %>%
mutate(
displacement_risk_score = (poverty_rate + cost_burden_rate) / 2,
displacement_risk_category = case_when(
displacement_risk_score >= 30 ~ "High",
displacement_risk_score >= 20 ~ "Medium",
TRUE ~ "Low"
)
)
# create policy priority matrix
model_df <- model_df %>%
mutate(
policy_priority = case_when(
risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "High" ~ "Urgent Intervention",
risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Medium" ~ "High Priority",
risk_category == "Moderate Risk" & displacement_risk_category == "High" ~ "High Priority",
risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Low" ~ "Proactive Planning",
risk_category == "Moderate Risk" & displacement_risk_category == "Medium" ~ "Proactive Planning",
risk_category == "Low Risk" & displacement_risk_category == "High" ~ "Community Support",
TRUE ~ "Monitor"
)
)
# count tracts by borough and policy priority
policy_counts <- model_df %>%
group_by(borough, policy_priority) %>%
summarise(n_tracts = n(), .groups = "drop") %>%
spread(policy_priority, n_tracts, fill = 0)
# view policy counts
kable(policy_counts, caption = "Census Tracts by Borough and Policy Priority")
# visualize policy priorities
model_df %>%
count(borough, policy_priority) %>%
ggplot(aes(x = borough, y = n, fill = policy_priority)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Policy Intervention Priorities by Borough",
x = "Borough", y = "Number of Census Tracts",
fill = "Policy Priority") +
theme_minimal() +
scale_fill_manual(values = c("Urgent Intervention" = "#d73027",
"High Priority" = "#fc8d59",
"Proactive Planning" = "#fee090",
"Community Support" = "#e0f3f8",
"Monitor" = "#91bfdb"))