Ahm Hamza 05.13.2025

Overview

This project aims to address the critical challenge of identifying gentrification patterns in New York City’s five boroughs (Bronx, Brooklyn, Queens, Manhattan, and Staten Island) using census data from 2015 to 2022. Gentrification is a complex urban phenomenon that significantly impacts housing affordability, neighborhood demographics, and community stability across major urban centers. In New York City, rapid neighborhood change has created challenges for urban planners, policymakers, and community organizations who need better tools to anticipate and manage these transformations. By leveraging demographic, economic, and housing indicators, this project develops predictive models to assess gentrification risk and identify neighborhoods at high risk of displacement. Through advanced machine learning techniques, the project aims to provide actionable insights for prioritizing interventions that can mitigate displacement, ensure affordable housing, and promote sustainable urban development, ultimately fostering more equitable communities in NYC.

Data Science Issue

This project aims to translate urban policy interventions into a predictive classification framework to identify neighborhoods at risk of gentrification in New York City. Gentrification, a complex socio-economic phenomenon, poses significant risks of displacement for lower-income communities, altering the social fabric and pushing vulnerable populations out of traditionally affordable areas. The study aim to translate urban policy interventions into a predictive classification framework to identify neighborhoods at risk of gentrification by answering which census tracts are likely to experience high levels of gentrification in the coming years, based on key demographic, economic, and housing indicators?

More specific goal includes:

1.Predict binary gentrification risk (high vs. low) for NYC census tracts, based on a rich set of demographic, economic, and housing variables.

2.Identify the most important predictors of gentrification risk: Which variables (e.g., income, housing affordability, educational attainment, vacancy rates) are the strongest indicators of gentrification? Understanding these relationships can guide effective policy decisions.

3.Compare traditional statistical methods with modern machine learning approaches: Utilize regression models, decision trees, random forests, support vector machines, and deep learning techniques to predict gentrification risk. Evaluate and compare the performance of these models using cross-validation and metrics like accuracy, precision, recall, and F1-score.

4.Develop an actionable risk assessment framework for prioritizing interventions: Using the results from our predictive models, we will develop a framework that helps urban planners and local policymakers prioritize neighborhoods at high risk of gentrification. The goal is to inform strategies to protect vulnerable populations, increase affordable housing stock, and ensure inclusive urban development.

5.Analyze the social and economic drivers of gentrification: Beyond prediction, we will explore the socio-economic dynamics driving gentrification processes. This includes examining factors like the “creative class” (young professionals in management, business, and technology sectors), public transportation accessibility, and housing affordability.

6.Leverage temporal data for trend forecasting: Using longitudinal census data from 2015 to 2022, we will capture the evolving patterns of gentrification. By analyzing trends over time, we can improve our ability to predict future risk and adapt to changing urban dynamics.

Data and library loading

#load necessary packages
rq_packages <- c("ada", "caret", "corrplot", "dplyr", "GGally", "ggcorrplot", 
                 "ggplot2", "gridExtra", "glmnet", "knitr", "mice", "neuralnet", "naivebayes", "naniar", "nnet", "pROC",
                 "randomForest", "rpart", "rpart.plot", "ROSE", "rsample", 
                 "sf", "keras", "scales", "smotefamily", "tigris", "tidyverse", 
                 "tidyr", "tfdatasets", "tensorflow", "pROC", "tmap", "yardstick", "randomForest", "rpart", "xgboost","VIM")

#install and load packages
for (pkg in rq_packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}


# read the CSV file
df <- read.csv('census_data_2015_2022_gentrification.csv')
# display the first 15 rows
#head(df, 5)
# structure and summary
# str(df)
 summary(df)
##  median_gross_rent    median_income        total_edu_pop   high_school_grad
##  Min.   :-666666666   Min.   :-666666666   Min.   :    0   Min.   :   0.0  
##  1st Qu.:      1222   1st Qu.:     42886   1st Qu.: 1600   1st Qu.: 288.0  
##  Median :      1450   Median :     63203   Median : 2416   Median : 475.0  
##  Mean   : -29685548   Mean   : -26136377   Mean   : 2679   Mean   : 538.4  
##  3rd Qu.:      1734   3rd Qu.:     85461   3rd Qu.: 3399   3rd Qu.: 720.2  
##  Max.   :      3501   Max.   :    250001   Max.   :22006   Max.   :5763.0  
##  NA's   :8            NA's   :7                                            
##  some_college_1   some_college_2    assoc_degree_1   assoc_degree_2  
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:  34.0   1st Qu.:  40.00   1st Qu.: 146.0   1st Qu.:  84.0  
##  Median :  75.0   Median :  73.00   Median : 234.0   Median : 144.0  
##  Mean   : 100.4   Mean   :  92.97   Mean   : 275.9   Mean   : 171.8  
##  3rd Qu.: 137.0   3rd Qu.: 123.00   3rd Qu.: 363.0   3rd Qu.: 227.0  
##  Max.   :1314.0   Max.   :1672.00   Max.   :3589.0   Max.   :2690.0  
##                                                                      
##     bachelor         masters        professional       doctorate      
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.00  
##  1st Qu.: 247.0   1st Qu.:  87.0   1st Qu.:   8.00   1st Qu.:   0.00  
##  Median : 429.0   Median : 181.0   Median :  31.00   Median :  14.00  
##  Mean   : 591.5   Mean   : 298.2   Mean   :  85.05   Mean   :  40.94  
##  3rd Qu.: 751.0   3rd Qu.: 370.0   3rd Qu.:  84.00   3rd Qu.:  42.00  
##  Max.   :6300.0   Max.   :4059.0   Max.   :1861.00   Max.   :1250.00  
##                                                                       
##    total_pop         white           black             asian        
##  Min.   :    0   Min.   :    0   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.: 2328   1st Qu.:  117   1st Qu.:   45.0   1st Qu.:  63.75  
##  Median : 3510   Median :  712   Median :  260.0   Median : 241.00  
##  Mean   : 3820   Mean   : 1223   Mean   :  830.2   Mean   : 531.65  
##  3rd Qu.: 4909   3rd Qu.: 1795   3rd Qu.: 1340.2   3rd Qu.: 716.00  
##  Max.   :29256   Max.   :13699   Max.   :17123.0   Max.   :7994.00  
##                                                                     
##     hispanic      total_units    owner_occupied   poverty_universe
##  Min.   :    0   Min.   :    0   Min.   :   0.0   Min.   :    0   
##  1st Qu.:  273   1st Qu.:  772   1st Qu.: 184.0   1st Qu.: 2291   
##  Median :  620   Median : 1243   Median : 360.0   Median : 3456   
##  Mean   : 1108   Mean   : 1427   Mean   : 465.2   Mean   : 3758   
##  3rd Qu.: 1406   3rd Qu.: 1830   3rd Qu.: 594.0   3rd Qu.: 4837   
##  Max.   :13429   Max.   :12998   Max.   :5905.0   Max.   :28186   
##                                                                   
##  below_poverty     labor_force      unemployed     median_home_value   
##  Min.   :   0.0   Min.   :    0   Min.   :   0.0   Min.   :-666666666  
##  1st Qu.: 222.0   1st Qu.: 1164   1st Qu.:  62.0   1st Qu.:    396700  
##  Median : 461.0   Median : 1752   Median : 115.0   Median :    560200  
##  Mean   : 698.6   Mean   : 1971   Mean   : 149.7   Mean   : -81549249  
##  3rd Qu.: 925.0   3rd Qu.: 2498   3rd Qu.: 197.0   3rd Qu.:    782900  
##  Max.   :5538.0   Max.   :14485   Max.   :1705.0   Max.   :   2000001  
##                                                    NA's   :75          
##  rent_as_pct_income     gini_index         housing_units_total
##  Min.   :-666666666   Min.   :-666666666   Min.   :    0      
##  1st Qu.:        27   1st Qu.:         0   1st Qu.:  772      
##  Median :        32   Median :         0   Median : 1243      
##  Mean   : -26380749   Mean   : -22152372   Mean   : 1427      
##  3rd Qu.:        38   3rd Qu.:         0   3rd Qu.: 1830      
##  Max.   :        51   Max.   :         1   Max.   :12998      
##                                                               
##  owner_cost_burdened renter_cost_burdened housing_age_total
##  Min.   :   0.0      Min.   :  0.00       Min.   :    0    
##  1st Qu.: 355.0      1st Qu.:  0.00       1st Qu.:  848    
##  Median : 738.0      Median :  9.00       Median : 1365    
##  Mean   : 962.2      Mean   : 22.43       Mean   : 1572    
##  3rd Qu.:1351.0      3rd Qu.: 30.00       3rd Qu.: 1995    
##  Max.   :8209.0      Max.   :484.00       Max.   :13036    
##                                                            
##  housing_built_2010_later housing_built_2000_2009 housing_total  
##  Min.   :   0.000         Min.   :   0.00         Min.   :    0  
##  1st Qu.:   0.000         1st Qu.:   0.00         1st Qu.:  848  
##  Median :   0.000         Median :   0.00         Median : 1365  
##  Mean   :   8.252         Mean   :  34.37         Mean   : 1572  
##  3rd Qu.:   0.000         3rd Qu.:  23.00         3rd Qu.: 1995  
##  Max.   :2619.000         Max.   :5073.00         Max.   :13036  
##                                                                  
##   vacant_units      pop_total     males_25_34_1    males_25_34_2   
##  Min.   :   0.0   Min.   :    0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:  47.0   1st Qu.: 2328   1st Qu.:  72.0   1st Qu.:  68.0  
##  Median :  91.0   Median : 3510   Median : 129.0   Median : 124.0  
##  Mean   : 144.9   Mean   : 3820   Mean   : 167.9   Mean   : 160.1  
##  3rd Qu.: 162.0   3rd Qu.: 4909   3rd Qu.: 220.0   3rd Qu.: 213.0  
##  Max.   :2572.0   Max.   :29256   Max.   :1534.0   Max.   :1358.0  
##                                                                    
##  females_25_34_1  females_25_34_2   employed_pop     prof_male     
##  Min.   :   0.0   Min.   :   0.0   Min.   :    0   Min.   :   0.0  
##  1st Qu.:  76.0   1st Qu.:  74.0   1st Qu.: 1067   1st Qu.: 122.0  
##  Median : 140.0   Median : 133.0   Median : 1611   Median : 229.0  
##  Mean   : 179.4   Mean   : 166.1   Mean   : 1821   Mean   : 356.7  
##  3rd Qu.: 234.0   3rd Qu.: 222.0   3rd Qu.: 2309   3rd Qu.: 437.0  
##  Max.   :2034.0   Max.   :1635.0   Max.   :13477   Max.   :4542.0  
##                                                                    
##   prof_female     commuters_total public_transit    walk_to_work   
##  Min.   :   0.0   Min.   :    0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 174.0   1st Qu.: 1037   1st Qu.: 456.0   1st Qu.:  36.0  
##  Median : 293.0   Median : 1568   Median : 783.0   Median :  91.0  
##  Mean   : 407.5   Mean   : 1774   Mean   : 956.7   Mean   : 174.5  
##  3rd Qu.: 503.0   3rd Qu.: 2248   3rd Qu.:1290.0   3rd Qu.: 189.0  
##  Max.   :4545.0   Max.   :13216   Max.   :7845.0   Max.   :3209.0  
##                                                                    
##  other_commute    work_from_home    tenure_total   recent_owners   
##  Min.   :  0.00   Min.   :   0.0   Min.   :    0   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  23.0   1st Qu.:  772   1st Qu.:  0.00  
##  Median :  0.00   Median :  58.0   Median : 1243   Median :  0.00  
##  Mean   : 12.11   Mean   : 115.1   Mean   : 1427   Mean   : 11.04  
##  3rd Qu.: 15.00   3rd Qu.: 134.0   3rd Qu.: 1830   3rd Qu.: 14.00  
##  Max.   :692.00   Max.   :2928.0   Max.   :12998   Max.   :408.00  
##                                                                    
##     GEO_ID              state        county          tract       
##  Length:17816       Min.   :36   Min.   : 5.00   Min.   :   100  
##  Class :character   1st Qu.:36   1st Qu.:47.00   1st Qu.: 15400  
##  Mode  :character   Median :36   Median :47.00   Median : 30202  
##                     Mean   :36   Mean   :54.83   Mean   : 42489  
##                     3rd Qu.:36   3rd Qu.:81.00   3rd Qu.: 58100  
##                     Max.   :36   Max.   :85.00   Max.   :990100  
##                                                                  
##  recent_renters         year        borough           less_than_hs   
##  Min.   :   0.00   Min.   :2015   Length:17816       Min.   :   0.0  
##  1st Qu.:   5.00   1st Qu.:2017   Class :character   1st Qu.: 174.0  
##  Median :  31.00   Median :2019   Mode  :character   Median : 353.0  
##  Mean   :  62.68   Mean   :2019                      Mean   : 484.1  
##  3rd Qu.:  80.00   3rd Qu.:2021                      3rd Qu.: 643.0  
##  Max.   :1466.00   Max.   :2022                      Max.   :4718.0  
##                                                                      
##     hs_grad        some_college   college_plus   pct_less_than_hs
##  Min.   :   0.0   Min.   :   0   Min.   :    0   Min.   :0.0000  
##  1st Qu.: 288.0   1st Qu.: 366   1st Qu.:  377   1st Qu.:0.0942  
##  Median : 475.0   Median : 563   Median :  668   Median :0.1610  
##  Mean   : 538.4   Mean   : 641   Mean   : 1016   Mean   :0.1821  
##  3rd Qu.: 720.2   3rd Qu.: 827   3rd Qu.: 1245   3rd Qu.:0.2578  
##  Max.   :5763.0   Max.   :8304   Max.   :11324   Max.   :1.0000  
##                                                  NA's   :464     
##   pct_hs_grad     pct_some_college pct_college_plus   pct_white     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1569   1st Qu.:0.1888   1st Qu.:0.2011   1st Qu.:0.0426  
##  Median :0.2171   Median :0.2504   Median :0.3054   Median :0.2343  
##  Mean   :0.2120   Mean   :0.2490   Mean   :0.3569   Mean   :0.3223  
##  3rd Qu.:0.2695   3rd Qu.:0.3096   3rd Qu.:0.4605   3rd Qu.:0.5797  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :464      NA's   :464      NA's   :464      NA's   :464     
##    pct_black        pct_asian       pct_hispanic      pct_other     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0161   1st Qu.:0.0228   1st Qu.:0.0951   1st Qu.:0.0096  
##  Median :0.0804   Median :0.0728   Median :0.1889   Median :0.0234  
##  Mean   :0.2307   Mean   :0.1422   Mean   :0.2698   Mean   :0.0350  
##  3rd Qu.:0.3811   3rd Qu.:0.2012   3rd Qu.:0.3982   3rd Qu.:0.0445  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :0.4494  
##  NA's   :464      NA's   :464      NA's   :464      NA's   :464     
##  homeownership_rate  poverty_rate    unemployment_rate young_adult_rate
##  Min.   :0.0000     Min.   :0.0000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.1609     1st Qu.:0.0815   1st Qu.:0.0426    1st Qu.:0.1263  
##  Median :0.3353     Median :0.1439   Median :0.0670    Median :0.1576  
##  Mean   :0.3723     Mean   :0.1775   Mean   :0.0776    Mean   :0.1732  
##  3rd Qu.:0.5617     3rd Qu.:0.2444   3rd Qu.:0.1012    3rd Qu.:0.2014  
##  Max.   :1.0000     Max.   :1.0000   Max.   :1.0000    Max.   :1.0000  
##  NA's   :537        NA's   :492      NA's   :504       NA's   :464     
##  professional_rate alt_commute_rate new_housing_rate recent_move_rate
##  Min.   :0.0000    Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2560    1st Qu.:0.5472   1st Qu.:0.0000   1st Qu.:0.0136  
##  Median :0.3627    Median :0.7185   Median :0.0000   Median :0.0364  
##  Mean   :0.3930    Mean   :0.6793   Mean   :0.0256   Mean   :0.0480  
##  3rd Qu.:0.4980    3rd Qu.:0.8289   3rd Qu.:0.0228   3rd Qu.:0.0695  
##  Max.   :1.0000    Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :507       NA's   :507      NA's   :523      NA's   :537     
##   vacancy_rate    cost_burden_rate gentrification_pressure displacement_risk   
##  Min.   :0.0000   Min.   :0.0000   Min.   :-0.0302         Min.   :-1333333.3  
##  1st Qu.:0.0449   1st Qu.:0.4464   1st Qu.: 0.3974         1st Qu.:       0.4  
##  Median :0.0732   Median :0.6774   Median : 0.4970         Median :       0.5  
##  Mean   :0.0869   Mean   :0.6424   Mean   : 0.5261         Mean   :  -12963.2  
##  3rd Qu.:0.1095   3rd Qu.:0.8598   3rd Qu.: 0.5943         3rd Qu.:       0.6  
##  Max.   :1.0000   Max.   :1.6370   Max.   : 9.1769         Max.   :       0.8  
##  NA's   :523      NA's   :537      NA's   :596             NA's   :537

Preprocessing Operationalizing Gentrification through Composite Indicators

To operationalize gentrification as a multidimensional, longitudinal process, this study standardizes and integrates a range of social, economic, and housing indicators. Core variables such as household income, professional employment, racial composition, housing costs, and mobility patterns are transformed into standardized z-scores. This enables cross-neighborhood and cross-year comparison while addressing scale disparities across variables (Zuk et al., 2018). The sign on specific variables such as vacancy rate is reversed to reflect their inverse relationship with neighborhood stability and desirability (Hackworth & Smith, 2001).

To capture temporal shifts, the methodology includes three-year change variables and relative growth rates for key indicators such as income, rent, and racial composition. These changes are contextualized against citywide averages to identify neighborhoods experiencing disproportionately rapid change (Chapple & Zuk, 2016; Urban Spatial, 2019). This longitudinal approach moves beyond static, cross-sectional analyses and instead reflects the dynamic nature of neighborhood transformation over time.

In addition to primary indicators, interaction metrics, such as, diversity indices and race-class interactions are introduced. These are particularly salient in urban contexts, where gentrification often manifests through racialized economic uplift and displacement. Metrics such as the interaction between percent White and income, or the rate of increase in the White population, provide insights into racial turnover and economic stratification (Crawford & Griffin, 2023; Zuk et al., 2018).

Two composite indices are central to this framework: the Displacement Vulnerability Index and the Gentrification Momentum Index. The Displacement Vulnerability Index synthesizes variables such as poverty rate, housing cost burden, renter percentage, and lower educational attainment, highlighting communities under economic strain ( Marcuse, 1985; Newman & Wyly, 2006). The Gentrification Momentum Index captures neighborhood transformation through socio-economic uplift indicators like rising income, rent, education levels, and influx of young professionals normalized via z-scores over time. These composite indices follow the methodological precedent established by similar efforts, including the Urban Displacement Project’s Housing Precarity Risk Model and the Green Gentrification Vulnerability Index (Chapple et al., 2021; Rigolon & Christensen, 2024).

The introduction of two novel composite indices displacement vulnerability and gentrification momentum adds interpretability and predictive power. The former measures economic pressure experienced by lower-income, renter-heavy, and racially diverse communities, while the latter captures the rate and direction of neighborhood change through key temporal z-scores. These composite indicators distill complex patterns into actionable variables for machine learning models.

Referance:

  1. Chapple, K., & Zuk, M. (2016). Forewarned: The use of neighborhood early warning systems for gentrification and displacement. Cityscape, 18(3), 109–130.

  2. Chapple, K., Zuk, M., Loukaitou-Sideris, A., Ong, P., & Thomas, T. (2021). Urban displacement project: Housing precarity risk model. Retrieved from https://www.urbandisplacement.org

  3. Crawford, J., & Griffin, J. (2023). Race, class, and the temporal dynamics of gentrification. Social Problems. Advance online publication. https://doi.org/10.1093/socpro/spaf014

  4. Hackworth, J., & Smith, N. (2001). The changing state of gentrification. TESG, 92(4), 464–477. https://doi.org/10.1111/1467-9663.00172

  5. Marcuse, P. (1985). Gentrification, abandonment, and displacement: Connections, causes, and policy responses in New York City. Journal of Urban and Contemporary Law, 28, 195–240.

  6. Newman, K., & Wyly, E. (2006). The right to stay put, revisited: Gentrification and resistance to displacement in New York City. Urban Studies, 43(1), 23–57. https://doi.org/10.1080/00420980500388710

  7. Rigolon, A., & Christensen, J. (2024). Green Gentrification Vulnerability Index (GGVI): Measuring displacement risk in U.S. cities. Cities, 142, 104546. https://doi.org/10.1016/j.cities.2024.104546

  8. Urban Spatial. (2019). Predicting gentrification using longitudinal census data. Retrieved from https://urbanspatialanalysis.com/portfolio/predicting-gentrification-using-longitudinal-census-data/

  9. Zuk, M., Bierbaum, A. H., Chapple, K., Gorska, K., & Loukaitou-Sideris, A. (2018). Gentrification, displacement, and the role of public investment. Journal of Planning Literature, 33(1), 31–44. https://doi.org/10.1177/0885412217716439

# Convert year, borough, etc., to factors if needed
df$year <- as.factor(df$year)
df$borough <- as.factor(df$borough)

# Rename columns for consistency
df <- df %>%
  rename_with(~ gsub("\\s+", "_", .x)) %>%
  rename_with(tolower)
# colnames(df)

###Check for missing 
#count missing or zero values per variable
df %>%
  summarise(across(everything(), ~ sum(is.na(.) | . == 0))) %>%
  gather(variable, missing_or_zero)
##                    variable missing_or_zero
## 1         median_gross_rent               8
## 2             median_income               7
## 3             total_edu_pop             464
## 4          high_school_grad             574
## 5            some_college_1            1293
## 6            some_college_2             884
## 7            assoc_degree_1             564
## 8            assoc_degree_2             678
## 9                  bachelor             537
## 10                  masters             702
## 11             professional            3320
## 12                doctorate            5720
## 13                total_pop             464
## 14                    white             814
## 15                    black            1613
## 16                    asian            1847
## 17                 hispanic             582
## 18              total_units             537
## 19           owner_occupied             987
## 20         poverty_universe             492
## 21            below_poverty             604
## 22              labor_force             504
## 23               unemployed             735
## 24        median_home_value              75
## 25       rent_as_pct_income               0
## 26               gini_index               0
## 27      housing_units_total             537
## 28      owner_cost_burdened             587
## 29     renter_cost_burdened            7803
## 30        housing_age_total             523
## 31 housing_built_2010_later           15106
## 32  housing_built_2000_2009           10363
## 33            housing_total             523
## 34             vacant_units            1034
## 35                pop_total             464
## 36            males_25_34_1             654
## 37            males_25_34_2             677
## 38          females_25_34_1             693
## 39          females_25_34_2             669
## 40             employed_pop             507
## 41                prof_male             619
## 42              prof_female             625
## 43          commuters_total             507
## 44           public_transit             563
## 45             walk_to_work            1276
## 46            other_commute           10138
## 47           work_from_home            1743
## 48             tenure_total             537
## 49            recent_owners           10416
## 50                   geo_id               0
## 51                    state               0
## 52                   county               0
## 53                    tract               0
## 54           recent_renters            4228
## 55                     year               0
## 56                  borough               0
## 57             less_than_hs             641
## 58                  hs_grad             574
## 59             some_college             518
## 60             college_plus             508
## 61         pct_less_than_hs             641
## 62              pct_hs_grad             574
## 63         pct_some_college             518
## 64         pct_college_plus             508
## 65                pct_white             814
## 66                pct_black            1613
## 67                pct_asian            1847
## 68             pct_hispanic             582
## 69                pct_other            1971
## 70       homeownership_rate             987
## 71             poverty_rate             604
## 72        unemployment_rate             735
## 73         young_adult_rate             533
## 74        professional_rate             558
## 75         alt_commute_rate             528
## 76         new_housing_rate            9366
## 77         recent_move_rate            3025
## 78             vacancy_rate            1034
## 79         cost_burden_rate             587
## 80  gentrification_pressure             596
## 81        displacement_risk             537
#low missing <5% mena median imput
low_na_cols <- c("median_income", "median_gross_rent", "median_home_value")

for (col in low_na_cols) {
  df[[col]][df[[col]] == 0 | is.na(df[[col]])] <- median(df[[col]][df[[col]] != 0], na.rm = TRUE)
}

#moderate missing (5%–30%) → KNN Imputation

cols_for_knn <- c("professional", "recent_move_rate", "housing_built_2010_later", 
                  "vacancy_rate", "renter_cost_burdened", "recent_owners", 
                  "recent_renters", "professional_rate", "alt_commute_rate")

df[cols_for_knn] <- VIM::kNN(df[cols_for_knn], k = 5, imp_var = FALSE)
z_vars <- c(
  "median_income", "professional_rate", "pct_college_plus",
  "pct_white", "young_adult_rate",
  "median_gross_rent", "median_home_value", "cost_burden_rate", "vacancy_rate",
  "recent_move_rate", "recent_renters", "recent_owners",
  "new_housing_rate", "housing_built_2010_later",
  "homeownership_rate"  
)
# standardize in long format (better memory handling)
df_z <- df %>%
  arrange(geo_id, year) %>%
  group_by(geo_id) %>%
  mutate(across(all_of(z_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
  ungroup()

# adjust vacancy rate sign (negative z-score)
df_z <- df_z %>%
  mutate(z_vacancy_rate = -z_vacancy_rate)


# create gentrification index from the newly computed z-scores
df_z <- df_z %>%
  rowwise() %>%
  mutate(gentrification_index = mean(c_across(starts_with("z_")), na.rm = TRUE)) %>%
  ungroup()

# bin into quantiles (efficient post-processing step)
df_z <- df_z %>%
  mutate(gentrification_quantile = ntile(gentrification_index, 5))

Exploratory Data Analysis

EDA begins with reviewing the dataset’s structure, checking for null values, and examining summary statistics. Several variables had significant missing values after futures engineering and indexing, particularly for older years and certain neighborhoods. Housing variables (median home value, rent) had low missingness (<5%), while temporal change metrics and specialized indicators had moderate missingness (5-30%). Univariate analysis involves visualizing distributions of key variables, while bivariate analysis explores relationships between variables. Feature engineering creates new features to improve model prediction, and geospatial analysis maps gentrification trends. Initial insights identify gentrification hotspots, displacement risk, and key predictors, setting the foundation for further analysis and modeling.

#EDA ############################################################################################
# Visualize distribution
ggplot(df_z, aes(x = gentrification_index)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Gentrification Index", x = "Index", y = "Count")

# Boxplot by borough
ggplot(df_z, aes(x = borough, y = gentrification_index, fill = borough)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Gentrification Index by Borough", x = "Borough", y = "Index") +
  theme(legend.position = "none")

# Line plot over time by borough
df_z %>%
  group_by(year, borough) %>%
  summarise(mean_index = mean(gentrification_index, na.rm = TRUE)) %>%
  ggplot(aes(x = year, y = mean_index, color = borough, group = borough)) +
  geom_line(size = 1.2) +
  labs(title = "Gentrification Trends by Borough", x = "Year", y = "Mean Index") +
  theme_minimal()

Data Preparation

Missing Value Treatment:

For variables with low missingness: Median imputation For variables with moderate missingness: K-Nearest Neighbors imputation

Feature Engineering: Created standardized z-scores for 15 key variables Developed temporal change metrics (3-year changes, percentage growth rates) Generated interaction terms between demographic and economic indicators Constructed composite indices for displacement vulnerability and gentrification momentum

Target Variable Creation: Developed a continuous gentrification index from standardized indicators Transformed into quantiles and created binary (high/low) and multi-class (low/medium/high) classification targets

# yearly change 
df_temporal <- df_z %>%
  arrange(geo_id, year) %>%
  group_by(geo_id) %>%
  mutate(
    # 3-year changes for key metrics
    change_3yr_income = median_income - lag(median_income, 3),
    change_3yr_rent = median_gross_rent - lag(median_gross_rent, 3),
    change_3yr_home_value = median_home_value - lag(median_home_value, 3),
    change_3yr_college = pct_college_plus - lag(pct_college_plus, 3),
    change_3yr_white = pct_white - lag(pct_white, 3),
    change_3yr_vacancy = vacancy_rate - lag(vacancy_rate, 3),
    
    # Growth rates (percentage changes) for key metrics
    pct_change_income = (median_income / lag(median_income, 3) - 1) * 100,
    pct_change_rent = (median_gross_rent / lag(median_gross_rent, 3) - 1) * 100,
    pct_change_home_value = (median_home_value / lag(median_home_value, 3) - 1) * 100,
    
    # Rate of change compared to borough average (relative changes)
    rel_income_growth = pct_change_income / mean(pct_change_income, na.rm = TRUE),
    rel_rent_growth = pct_change_rent / mean(pct_change_rent, na.rm = TRUE),
    rel_home_value_growth = pct_change_home_value / mean(pct_change_home_value, na.rm = TRUE)
  ) %>%
  ungroup()

#race metrics 

df_race <- df_temporal %>%
  mutate(
    # Diversity metrics
    diversity_index = 1 - ((pct_white^2 + (1-pct_white)^2)), # Simple diversity index
    
    # Race-economic interactions
    white_income_interaction = pct_white * median_income,
    white_college_interaction = pct_white * pct_college_plus,
    
    # Racial change momentum (acceleration of demographic shifts)
    white_change_acceleration = change_3yr_white - lag(change_3yr_white, 1),
    
    # Economic-demographic pressure metrics
    displacement_risk = cost_burden_rate * (1 - pct_white) * (1 / median_income),
    gentrification_pressure = pct_change_rent * change_3yr_white * young_adult_rate
  )


#z-temporal and demographic shift variables
z_temporal_vars <- c(
  "change_3yr_rent",
  "pct_change_rent",
  "pct_change_income",
  "change_3yr_white",
  "professional_rate",
  "young_adult_rate",
  "cost_burden_rate",
  "median_income",
  "recent_renters",
  "homeownership_rate"
)

df_race <- df_race %>%
  group_by(year) %>%  # or geo_id depending on context, but year is better for city-wide scale
  mutate(across(all_of(z_temporal_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
  ungroup()



# displacement composite indicator
df_features <- df_race %>%
  rowwise() %>%
  mutate(
    # Improved displacement risk index - combines economic vulnerability with demographic stability
    displacement_vulnerability_index = mean(c(
      z_cost_burden_rate,
      -z_median_income,
      z_recent_renters,
      -z_homeownership_rate,
      z_change_3yr_rent
    ), na.rm = TRUE),
    
    # Improved gentrification momentum index - combines rate of change metrics
    gentrification_momentum = mean(c(
      z_pct_change_rent,
      z_pct_change_income, 
      z_change_3yr_white,
      z_professional_rate,
      z_young_adult_rate
    ), na.rm = TRUE)
  ) %>%
  ungroup()

# removing missing values in the new features
#  list of newly created features
new_features <- setdiff(names(df_features), names(df_z))

# temporal features have NAs for early years
# impute them with mean by borough and year for better contextual accuracy
df_features <- df_features %>%
  group_by(borough, year) %>%
  mutate(across(all_of(new_features), ~ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
  ungroup()

# if any NAs remain, use overall means
df_features <- df_features %>%
  mutate(across(all_of(new_features), ~ifelse(is.na(.) | is.infinite(.) | is.nan(.), 
                                              mean(.[is.finite(.)], na.rm = TRUE), .)))



# gentrification classes for classification
df_features <- df_features %>%
  mutate(
    # Create a binary target for classification (high vs low gentrification)
    gentrification_binary = ifelse(gentrification_quantile >= 4, "high", "low"),
    
    # Create a 3-level target for more nuanced classification
    gentrification_level = case_when(
      gentrification_quantile <= 2 ~ "low",
      gentrification_quantile == 3 ~ "medium",
      gentrification_quantile >= 4 ~ "high"
    ),
    
    # Create a continuous target for regression tasks
    gentrification_continuous = gentrification_index
  )

# convert factor variables for modeling 
df_features$gentrification_binary <- as.factor(df_features$gentrification_binary)
df_features$gentrification_level <- as.factor(df_features$gentrification_level)


# filter out training if zscore has missing  
modeling_data <- df_features %>%
  filter(!is.na(gentrification_index), !is.na(gentrification_momentum)) %>%
  select(-geo_id)


#futures selection for training and test 
############################################################################################
#crucial predictors for gen
predictors <- c(
  # common features related to gen
  "median_income", "professional_rate", "pct_college_plus", "pct_white", 
  "young_adult_rate", "median_gross_rent", "median_home_value", "cost_burden_rate",
  
  # temporal features
  "change_3yr_income", "change_3yr_rent", "change_3yr_home_value", 
  "pct_change_income", "pct_change_rent",
  
  # race-aware features
  "diversity_index", "white_income_interaction", "white_change_acceleration",
  
  # Advanced indices
  "displacement_vulnerability_index", "gentrification_momentum"
)

set.seed(123)
train_index <- createDataPartition(modeling_data$gentrification_binary, p = 0.7, list = FALSE)
train_data <- modeling_data[train_index, ]
test_data <- modeling_data[-train_index, ]

# ensure there are no missing in the predictors for training
train_data_clean <- train_data %>%
  filter(if_all(all_of(predictors), ~ !is.na(.)))

Loading pre-train model

# Set model directory path
model_path <- "/Users/ahmhamza/Downloads/model/"

# Load models from .rds files
logistic_model <- readRDS(paste0(model_path, "logistic_model.rds"))
rf_importance_model <- readRDS(paste0(model_path, "rf_importance_model.rds"))
rf_model <- readRDS(paste0(model_path, "rf_model.rds"))
xgb_model <- readRDS(paste0(model_path, "xgb_model.rds"))
nn_model <- readRDS(paste0(model_path, "nn_model.rds"))

Methodologies Used

The modeling process begins with the use of a Random Forest classifier to estimate feature importance via the Gini index. Random Forests are particularly advantageous in this context due to their capacity to handle complex, non-linear relationships and interactions among features without requiring extensive preprocessing. By enabling the extraction of the MeanDecreaseGini metric, this step provided a principled basis for feature selection, ensuring that subsequent models would be built on a reduced yet informative subset of predictors.

Based on futures selection RF process I incorporate logistic regression model, used variables that were identified as highly important. Logistic regression was chosen for its interpretability and ability to establish baseline predictive performance.

Other important experimental phase included the training of three predictive models: a Random Forest classifier, XGBoost, and a feedforward neural network (multilayer perceptron) with hyperparameter tuning via grid search. The selection of Random Forest and XGBoost was guided by their complementary strengths in capturing non-linearity and their robustness to overfitting through ensembling. XGBoost, in particular, was expected to yield strong predictive performance due to its ability to perform regularized gradient boosting, which is known to generalize well on structured tabular data.

Additionally, a neural network model was implemented, with optimization over the number of hidden units and regularization decay parameters. While neural networks are often computationally intensive and less interpretable, they offer flexibility in capturing intricate patterns that may be present in the gentrification dynamics.

Each model’s performance was assessed through 5-fold cross-validation with class probability estimation and ROC-based scoring, enabling fair and robust comparison across algorithms. Finally, predictive accuracy and discriminative capacity were validated using confusion matrices and ROC curve analysis, particularly for the neural network model.

# use Random Forest for predict importance futures 
model_path <- "/Users/ahmhamza/Downloads/model/"
rf_importance <- readRDS(paste0(model_path, "rf_importance_model.rds"))

# rf_importance <- randomForest(
#   x = train_data_clean[, predictors],
#   y = train_data_clean$gentrification_binary,
#   importance = TRUE,
#   ntree = 500
# )

#create directory and save models 
#dir.create("Downloads/model")
#saveRDS(rf_importance, "Downloads/model/rf_importance_model.rds")
#for markdown 
#rf_importance_model <- readRDS("Downloads/model/rf_importance_model.rds")

# obs variable importance
importance_df <- as.data.frame(importance(rf_importance))
importance_df$Feature <- rownames(importance_df)
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]

print(importance_df)
##                                        high       low MeanDecreaseAccuracy
## white_change_acceleration        71.3899472 12.432226             66.01386
## change_3yr_income                27.1696106 13.803264             30.67471
## gentrification_momentum          37.0383736 38.632312             59.24590
## pct_change_income                24.0716413 10.886349             25.37463
## pct_change_rent                  25.2752871  9.451630             25.29324
## displacement_vulnerability_index 17.3408009 32.100130             44.13585
## median_gross_rent                14.3934417 38.831764             48.05685
## young_adult_rate                 -2.6944697 43.040552             41.95014
## change_3yr_home_value            20.0206583  4.095011             18.85803
## change_3yr_rent                  29.3309079  5.097043             32.44276
## professional_rate                -0.6626774 39.227661             45.02353
## median_income                    10.1163600 44.057355             52.47086
## cost_burden_rate                 13.2463607 33.222988             42.31783
## pct_college_plus                 -1.4493133 36.394440             40.18808
## median_home_value                 4.5852340 37.158392             39.84097
## pct_white                         3.5180205 34.331211             40.17988
## diversity_index                   0.7132631 35.948922             39.69164
## white_income_interaction          1.7939335 37.129448             43.96167
##                                  MeanDecreaseGini
## white_change_acceleration                582.2233
## change_3yr_income                        456.6352
## gentrification_momentum                  387.8773
## pct_change_income                        376.1346
## pct_change_rent                          374.2196
## displacement_vulnerability_index         303.7693
## median_gross_rent                        301.4398
## young_adult_rate                         300.4118
## change_3yr_home_value                    299.3066
## change_3yr_rent                          292.2119
## professional_rate                        290.8855
## median_income                            282.3179
## cost_burden_rate                         275.5486
## pct_college_plus                         268.2974
## median_home_value                        264.7184
## pct_white                                250.0030
## diversity_index                          249.7921
## white_income_interaction                 246.6388
##                                                           Feature
## white_change_acceleration               white_change_acceleration
## change_3yr_income                               change_3yr_income
## gentrification_momentum                   gentrification_momentum
## pct_change_income                               pct_change_income
## pct_change_rent                                   pct_change_rent
## displacement_vulnerability_index displacement_vulnerability_index
## median_gross_rent                               median_gross_rent
## young_adult_rate                                 young_adult_rate
## change_3yr_home_value                       change_3yr_home_value
## change_3yr_rent                                   change_3yr_rent
## professional_rate                               professional_rate
## median_income                                       median_income
## cost_burden_rate                                 cost_burden_rate
## pct_college_plus                                 pct_college_plus
## median_home_value                               median_home_value
## pct_white                                               pct_white
## diversity_index                                   diversity_index
## white_income_interaction                 white_income_interaction
# Plot top 15 important features
ggplot(importance_df[1:15,], aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 15 Features for Predicting Gentrification", x = "Features", y = "Importance")

#predictors for modeling (keep all the futures)
important_predictors <- importance_df$Feature[1:13]
################################################################
############################################################################################
# Logistic Regression 
model_path <- "/Users/ahmhamza/Downloads/model/"
logistic_model <- readRDS(paste0(model_path, "logistic_model.rds"))

# logistic_model <- caret::train(
#   gentrification_binary ~ change_3yr_rent + change_3yr_white + gentrification_momentum + 
#     displacement_vulnerability_index + pct_change_income + 
#     white_income_interaction + diversity_index + white_change_acceleration +
#     change_3yr_income + change_3yr_home_value,
#   data = train_data,
#   method = "glm",
#   family = "binomial",
#   metric = "ROC",  # use ROC if you're using twoClassSummary
#   trControl = caret::trainControl(
#     method = "cv", 
#     number = 5,
#     classProbs = TRUE,
#     summaryFunction = caret::twoClassSummary
#   )
# )

#saveRDS(logistic_model, "Downloads/model/logistic_model.rds")
#logistic_model <- readRDS("Downloads/model/logistic_model.rds.")

# summary of the model
summary(logistic_model)
## 
## Call:
## NULL
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       3.865e-01  3.573e-02  10.816  < 2e-16 ***
## change_3yr_rent                  -3.489e-10  3.560e-10  -0.980   0.3270    
## change_3yr_white                 -4.120e+00  4.926e-01  -8.364  < 2e-16 ***
## gentrification_momentum          -2.189e-01  3.609e-02  -6.065 1.32e-09 ***
## displacement_vulnerability_index -5.831e-02  3.447e-02  -1.692   0.0907 .  
## pct_change_income                 1.319e-07  3.350e-07   0.394   0.6937    
## white_income_interaction         -6.232e-10  4.986e-10  -1.250   0.2114    
## diversity_index                  -7.873e-02  1.178e-01  -0.669   0.5038    
## white_change_acceleration        -3.960e-01  6.389e-01  -0.620   0.5355    
## change_3yr_income                -7.395e-10  8.080e-10  -0.915   0.3601    
## change_3yr_home_value            -9.049e-10  1.642e-10  -5.509 3.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16412  on 12192  degrees of freedom
## Residual deviance: 16152  on 12182  degrees of freedom
## AIC: 16174
## 
## Number of Fisher Scoring iterations: 4
#running a another RF wiht XGboost 
# Random Forest Model

model_path <- "/Users/ahmhamza/Downloads/model/"
rf_model <- readRDS(paste0(model_path, "rf_model.rds"))

# rf_model <- caret::train(
#   gentrification_binary ~ .,
#   data = train_data_scaled,
#   method = "rf",
#   metric = "Accuracy",
#   trControl = trainControl(
#     method = "cv",
#     number = 5,
#     classProbs = TRUE,
#     summaryFunction = twoClassSummary
#   ),
#   ntree = 500
# )
#saveRDS(rf_model, "Downloads/model/rf_model.rds")
#rf_model <- readRDS("Downloads/model/rf_model.rds")
summary(rf_model)
##                 Length Class      Mode     
## call                5  -none-     call     
## type                1  -none-     character
## predicted       12100  factor     numeric  
## err.rate         1500  -none-     numeric  
## confusion           6  -none-     numeric  
## votes           24200  matrix     numeric  
## oob.times       12100  -none-     numeric  
## classes             2  -none-     character
## importance         13  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             14  -none-     list     
## y               12100  factor     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             13  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           2  -none-     character
## param               1  -none-     list
print(rf_model)
## Random Forest 
## 
## 12100 samples
##    13 predictor
##     2 classes: 'high', 'low' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9680, 9680, 9680, 9679, 9681 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.8119271  0.6505553  0.8372474
##    7    0.8148342  0.6639224  0.8267491
##   13    0.8138617  0.6612505  0.8221896
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
#NN & deep learning preparation and modeling 
# Neural Network with optimal hyperparameters
# Perform grid search for hyperparameter tuning
nn_grid <- expand.grid(size = c(5, 10, 15),  decay = c(0.001, 0.01, 0.1))

model_path <- "/Users/ahmhamza/Downloads/model/"
nn_model <- readRDS(paste0(model_path, "nn_model.rds"))

# nn_model <- caret::train(
#   gentrification_binary ~ .,
#   data = train_data_scaled,
#   method = "nnet",
#   metric = "Accuracy",
#   tuneGrid = nn_grid,
#   trControl = trainControl(
#     method = "cv",
#     number = 5,
#     classProbs = TRUE,
#     summaryFunction = twoClassSummary
#   ),
#   maxit = 1000,
#   trace = FALSE
# )
#saveRDS(nn_model, "Downloads/model/nn_model.rds")
# nn_model <- readRDS("Downloads/model/nn_model.rds")
# print the best tuning parameters
print(nn_model$bestTune)
##   size decay
## 7   15 0.001
# Predictions on test set
nn_pred <- predict(nn_model, newdata = test_data_scaled)
nn_prob <- predict(nn_model, newdata = test_data_scaled, type = "prob")

# Confusion Matrix for Neural Network
nn_cm <- confusionMatrix(nn_pred, test_data_scaled$gentrification_binary)
print(nn_cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high  low
##       high 1420  602
##       low   659 2498
##                                           
##                Accuracy : 0.7565          
##                  95% CI : (0.7446, 0.7682)
##     No Information Rate : 0.5986          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.491           
##                                           
##  Mcnemar's Test P-Value : 0.1148          
##                                           
##             Sensitivity : 0.6830          
##             Specificity : 0.8058          
##          Pos Pred Value : 0.7023          
##          Neg Pred Value : 0.7913          
##              Prevalence : 0.4014          
##          Detection Rate : 0.2742          
##    Detection Prevalence : 0.3904          
##       Balanced Accuracy : 0.7444          
##                                           
##        'Positive' Class : high            
## 
# ROC curve
nn_roc <- roc(test_data_scaled$gentrification_binary, nn_prob[,"high"])
plot(nn_roc, main = "ROC Curve - Neural Network")

auc(nn_roc)
## Area under the curve: 0.8368

Result & Prediction

The Random Forest model identifies white race % change acceleration as the most predictive indicator of gentrification, followed closely by gentrification momentum. Short-term economic changes like change_3yr_income, pct_change_income, pct_change_rent, and change_3yr_rent are also strong predictors. Existing socioeconomic vulnerabilities and housing costs, measured by displacement_vulnerability_index, median_gross_rent, and median_income, are important structural features. Conversely, features like professional_rate, young_adult_rate, and pct_college_plus show lower predictive power in this model.

The model’s predictions exhibit a strong class imbalance, heavily favoring “low” gentrification risk with 4,798 tracts compared to only 426 predicted as “high.” This is likely due to the model’s conservative nature, as most predicted probabilities for “high” gentrification are below 0.5. Even the “high” predictions often have probabilities just above this threshold, indicating uncertainty. Interestingly, the logistic regression coefficients suggest a significant negative association between increases in the white population (change_3yr_white) and the likelihood of being classified as the positive gentrification outcome. gentrification_momentum also shows a strong negative relationship, while change_3yr_home_value has a statistically significant but very small negative effect.

The Random Forest model achieved its best performance with mtry = 7, yielding a good AUC of 0.815, indicating a strong ability to differentiate between high and low gentrification risk tracts. At this setting, the model correctly identified approximately 66.4% of high-risk tracts (sensitivity) and a higher 82.7% of low-risk tracts (specificity). The final model selected mtry = 7 to maximize the ROC, achieving a balance between sensitivity and specificity, although it still performs somewhat better at identifying low-risk areas. The neural network model, with 15 hidden units and a 0.001 weight decay, exhibits strong classification performance with an accuracy of 75.7% and a very good AUC of 0.837. It shows balanced performance, correctly identifying 68.3% of high-risk tracts (sensitivity) and 80.6% of low-risk tracts (specificity), with a moderate agreement beyond chance (Kappa of 0.491). The model’s precision for identifying high-risk tracts is 70.2%. While slightly better at identifying low gentrification areas, its ability to detect high-risk areas is considerably better than random and surpasses the logistic regression model, making it a viable classifier when both sensitivity and specificity are important.

# generate predictions for all models
models_list <- list(
  "Logistic Regression" = logistic_model,
  "Neural Network" = nn_model,
  "Random Forest" = rf_model,
  "XGBoost" = xgb_model
)

# function to evaluate model performance
evaluate_model <- function(model, test_data, model_name) {
  pred <- predict(model, newdata = test_data)
  prob <- predict(model, newdata = test_data, type = "prob")
  cm <- confusionMatrix(pred, test_data$gentrification_binary)
  roc_obj <- roc(test_data$gentrification_binary, prob[, "high"])
  auc_value <- auc(roc_obj)
  data.frame(
    Model = model_name,
    Accuracy = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    F1_Score = cm$byClass["F1"],
    AUC = as.numeric(auc_value)
  )
}

# evaluate all models
model_performance <- bind_rows(
  evaluate_model(logistic_model, test_data, "Logistic Regression"),
  evaluate_model(rf_model, test_data_scaled, "Random Forest"),
  evaluate_model(xgb_model, test_data_scaled, "XGBoost"),
  evaluate_model(nn_model, test_data_scaled, "Neural Network"),
)

# print combined performance table
kable(model_performance, caption = "Model Performance on Test Data")
Model Performance on Test Data
Model Accuracy Sensitivity Specificity F1_Score AUC
Accuracy…1 Logistic Regression 0.6154288 0.1211106 0.9448166 0.2011928 0.5920873
Accuracy…2 Random Forest 0.7727360 0.6705147 0.8412903 0.7031526 0.8234124
Accuracy…3 XGBoost 0.7748600 0.6734007 0.8429032 0.7060010 0.8298067
Accuracy…4 Neural Network 0.7565167 0.6830207 0.8058065 0.6925140 0.8368491

Comparing Models Performances

Across the four models evaluated for gentrification risk classification, XGBoost and Random Forest emerged as the top performers, demonstrating robust and balanced capabilities. Notably, XGBoost slightly surpassed Random Forest in several key metrics, achieving an accuracy of 77.5%, a sensitivity of 67.3%, a specificity of 84.3%, an F1 score of 70.6%, and an AUC of 0.83. The Neural Network model closely followed, showcasing the highest sensitivity at 68.3% and the highest AUC at 0.837, indicating its particular strength in correctly identifying census tracts at high risk of gentrification. However, the Neural Network exhibited slightly lower overall accuracy and specificity when compared to the tree-based ensemble methods. In stark contrast, the Logistic Regression model displayed substantially weaker performance, characterized by a very low sensitivity of only 12.1% and a correspondingly low F1 score of 20.1%, revealing its significant difficulty in accurately classifying high-risk tracts, despite achieving a high specificity of 94.5%. Consequently, the ensemble methods, XGBoost and Random Forest, provided the most consistently balanced and reliable classification results, while the Neural Network presents itself as a compelling alternative, particularly when prioritizing the detection of high-risk areas due to its strong recall performance.

# visualization in comparison 

# bar plot of AUC
auc_plot <- ggplot(model_performance, aes(x = Model, y = AUC, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(AUC, 3)), vjust = -0.3) +
  labs(title = "AUC Comparison of Models", y = "AUC") +
  theme_minimal() +
  theme(legend.position = "none")


# bar plot of Accuracy
accuracy_plot <- ggplot(model_performance, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Accuracy, 3)), vjust = -0.3) +
  labs(title = "Accuracy Comparison of Models", y = "Accuracy") +
  theme_minimal() +
  theme(legend.position = "none")


# combine AUC and Accuracy plots
grid.arrange(auc_plot, accuracy_plot, ncol = 2)

if (nrow(summary(logistic_model)$coefficients) > 1) {
  coef_df_lr <- as.data.frame(summary(logistic_model)$coefficients[-1, ]) %>%
    mutate(Feature = rownames(.)) %>%
    arrange(desc(abs(Estimate))) %>%
    head(10) # Top 10 coefficients
  
  coef_plot_lr <- ggplot(coef_df_lr, aes(x = reorder(Feature, Estimate), y = Estimate, fill = Estimate > 0)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Top 10 Coefficients in Logistic Regression", x = "Feature", y = "Coefficient") +
    scale_fill_manual(name = "Direction", values = c("FALSE" = "salmon", "TRUE" = "lightgreen"))
  
  print(coef_plot_lr)
} else {
  print("Logistic Regression model might not have converged properly or has no significant predictors.")
}

# Practical Model Insights 

# identify rows with no missing values in the predictors used for the best model
complete_rows <- complete.cases(modeling_data[, important_predictors])

# a dataframe with only the complete cases for prediction
model_df_for_prediction <- modeling_data[complete_rows, ]

# best model based on a chosen metric 
best_model_name <- model_performance %>% arrange(desc(AUC)) %>% slice_head(n = 1) %>% pull(Model)
best_model <- models_list[[best_model_name]]
cat("Best Model:", best_model_name, "\n")
## Best Model: Neural Network
# predictions only on complete cases
prediction_results <- predict(best_model, newdata = model_df_for_prediction, type = "prob")
model_df_for_prediction$gentrification_probability <- prediction_results[, "high"]

# a temporary ID in the complete cases dataframe for merging
model_df_for_prediction$temp_id <- 1:nrow(model_df_for_prediction)

# a temporary ID in the original modeling_data dataframe for merging the complete cases back
modeling_data$temp_id <- ifelse(complete_rows, 1:sum(complete_rows), NA)

# merge predictions back to the original modeling_data dataframe based on the temporary ID
model_df <- modeling_data %>%
  left_join(model_df_for_prediction %>% select(temp_id, gentrification_probability), by = "temp_id")

# remove the temporary ID
model_df$temp_id <- NULL

#  risk categories based on gentrification probability

model_df <- model_df %>%
  mutate(risk_category = case_when(
    is.na(gentrification_probability) ~ "Missing Data",
    gentrification_probability >= 0.75 ~ "Very High Risk",
    gentrification_probability >= 0.5 ~ "High Risk",
    gentrification_probability >= 0.25 ~ "Moderate Risk",
    TRUE ~ "Low Risk"
  ))



# # Summarize by borough and risk category
# borough_summary <- model_df %>%
#   group_by(borough, risk_category) %>%
#   summarise(
#     n_tracts = n(),
#     avg_income = mean(median_income, na.rm = TRUE),
#     avg_rent = mean(median_gross_rent, na.rm = TRUE),
#     avg_home_value = mean(median_home_value, na.rm = TRUE),
#     avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
#     avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
#     .groups = "drop"
#   )
# 
# 
# # view summary table
# kable(borough_summary, digits = 2, caption = "Census Tract Counts and Averages by Borough and Gentrification Risk")


model_df$years <- "2022"

borough_summary <- model_df %>%
  mutate(years = "2022") %>%  # or year = 2022
  group_by(borough, risk_category, years) %>%
  summarise(
    n_tracts = n(),
    avg_income = mean(median_income, na.rm = TRUE),
    avg_rent = mean(median_gross_rent, na.rm = TRUE),
    avg_home_value = mean(median_home_value, na.rm = TRUE),
    avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
    avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
    .groups = "drop"
  )

borough_summary <- borough_summary %>%
  mutate(
    avg_income = round(avg_income),
    avg_rent = round(avg_rent),
    avg_home_value = round(avg_home_value),
    avg_college_rate = round(avg_college_rate * 100, 1),
    avg_poverty_rate = round(avg_poverty_rate * 100, 1)
  )


borough_summary %>%
  kable(digits = 2, caption = "Census Tract Counts and Averages by Borough, Risk Category, and Year") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Census Tract Counts and Averages by Borough, Risk Category, and Year
borough risk_category years n_tracts avg_income avg_rent avg_home_value avg_college_rate avg_poverty_rate
Bronx High Risk 2022 1 30221 953 506100 12.5 47.3
Bronx Low Risk 2022 1183 -7842759 -5070570 -167607725 20.9 26.0
Bronx Missing Data 2022 36 -666666666 -666666666 -666666666 22.1 77.7
Bronx Moderate Risk 2022 5 46638 1288 -133025293 19.9 31.0
Bronx Very High Risk 2022 1508 -11009049 -7956358 -154859320 20.4 28.1
Kings Low Risk 2022 2485 -1809358 -4559165 -46480116 36.5 19.1
Kings Missing Data 2022 18 -666666666 -666666666 -666666666 5.8 NaN
Kings Moderate Risk 2022 72 -9193671 1490 -54816768 37.1 18.6
Kings Very High Risk 2022 3525 -4285292 -5294033 -39574908 36.0 19.5
Manhattan High Risk 2022 1 -666666666 2432 -666666666 58.3 20.6
Manhattan Low Risk 2022 949 -6229072 -6320623 -121417281 61.1 16.9
Manhattan Missing Data 2022 18 -666666666 -666666666 -666666666 8.1 83.1
Manhattan Moderate Risk 2022 76 118405 2311 -139320411 71.5 14.3
Manhattan Very High Risk 2022 1278 -3551535 -3127930 -112348288 62.7 15.6
Queens Low Risk 2022 2211 -5656590 -12662369 -16921767 31.3 12.3
Queens Missing Data 2022 64 -666666666 -666666666 -666666666 29.7 66.5
Queens Moderate Risk 2022 48 76199 -69443071 -13326995 28.3 12.9
Queens Very High Risk 2022 3033 -8720061 -18681791 -22298090 32.2 12.3
Staten Island Low Risk 2022 371 -12497296 -28749801 -24662527 33.7 12.5
Staten Island Missing Data 2022 2 -666666666 -666666666 -666666666 NaN NaN
Staten Island Moderate Risk 2022 50 -39918673 -66665349 -106224321 37.1 8.5
Staten Island Very High Risk 2022 483 -9581208 -22082854 -14686081 33.0 13.2
# visualize risk category distribution by borough
ggplot(borough_summary, aes(x = borough, y = n_tracts, fill = risk_category)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Distribution of Gentrification Risk Categories by Borough",
       x = "Borough", y = "Number of Census Tracts",
       fill = "Gentrification Risk") +
  theme_minimal() +
  scale_fill_brewer(palette = "RdYlGn", direction = -1)

# top 20 census tracts with highest gentrification probability
top_risk_tracts <- model_df %>%
  arrange(desc(gentrification_probability)) %>%
  select(tract, borough, gentrification_probability, risk_category,
         median_income, median_gross_rent, median_home_value, 
         pct_college_plus, poverty_rate) %>%
  head(20)

# view top risk tracts
kable(top_risk_tracts, digits = 3, caption = "Top 20 Census Tracts with Highest Gentrification Probability")
Top 20 Census Tracts with Highest Gentrification Probability
tract borough gentrification_probability risk_category median_income median_gross_rent median_home_value pct_college_plus poverty_rate
14800 Kings 1 Very High Risk 85625 1852 927300 0.395 0.074
3900 Manhattan 1 Very High Risk 157652 1812 2000001 0.809 0.048
20500 Manhattan 1 Very High Risk 179348 2965 1594300 0.934 0.051
200 Bronx 1 Very High Risk 51100 1559 419200 0.294 0.224
2702 Bronx 1 Very High Risk 16000 963 437100 0.077 0.504
2702 Bronx 1 Very High Risk 14375 856 -666666666 0.051 0.519
3900 Bronx 1 Very High Risk 25990 954 338600 0.112 0.302
5001 Bronx 1 Very High Risk 27009 1297 498100 0.131 0.411
5600 Bronx 1 Very High Risk 18918 1056 424000 0.056 0.389
5600 Bronx 1 Very High Risk 21367 1085 429700 0.090 0.407
6700 Bronx 1 Very High Risk 21016 708 560200 0.052 0.471
6700 Bronx 1 Very High Risk 20945 696 349200 0.063 0.501
7000 Bronx 1 Very High Risk 39662 1364 469500 0.121 0.172
7000 Bronx 1 Very High Risk 50500 1404 472100 0.145 0.189
8400 Bronx 1 Very High Risk 66958 1471 442500 0.224 0.085
8400 Bronx 1 Very High Risk 83605 1620 457900 0.260 0.070
9200 Bronx 1 Very High Risk 35227 1304 453000 0.182 0.213
9300 Bronx 1 Very High Risk 30811 1157 372000 0.095 0.407
9300 Bronx 1 Very High Risk 32565 1191 444700 0.096 0.392
9300 Bronx 1 Very High Risk 33708 1187 386700 0.104 0.425
# Target Policy Analysis
model_df <- model_df %>%
  mutate(
    displacement_risk_score = (poverty_rate + cost_burden_rate) / 2,
    displacement_risk_category = case_when(
      displacement_risk_score >= 30 ~ "High",
      displacement_risk_score >= 20 ~ "Medium",
      TRUE ~ "Low"
    )
  )

# Create policy priority matrix
model_df <- model_df %>%
  mutate(
    policy_priority = case_when(
      risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "High" ~ "Urgent Intervention",
      risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Medium" ~ "High Priority",
      risk_category == "Moderate Risk" & displacement_risk_category == "High" ~ "High Priority",
      risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Low" ~ "Proactive Planning",
      risk_category == "Moderate Risk" & displacement_risk_category == "Medium" ~ "Proactive Planning",
      risk_category == "Low Risk" & displacement_risk_category == "High" ~ "Community Support",
      TRUE ~ "Monitor"
    )
  )

# Count tracts by borough and policy priority
policy_counts <- model_df %>%
  group_by(borough, policy_priority) %>%
  summarise(n_tracts = n(), .groups = "drop") %>%
  spread(policy_priority, n_tracts, fill = 0)



# view policy counts
kable(policy_counts, caption = "Census Tracts by Borough and Policy Priority")
Census Tracts by Borough and Policy Priority
borough Monitor Proactive Planning
Bronx 1224 1509
Kings 2575 3525
Manhattan 1043 1279
Queens 2323 3033
Staten Island 423 483
# Visualize policy priorities
model_df %>%
  count(borough, policy_priority) %>%
  ggplot(aes(x = borough, y = n, fill = policy_priority)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Policy Intervention Priorities by Borough",
       x = "Borough", y = "Number of Census Tracts",
       fill = "Policy Priority") +
  theme_minimal() +
  scale_fill_manual(values = c("Urgent Intervention" = "#d73027",
                               "High Priority" = "#fc8d59",
                               "Proactive Planning" = "#fee090",
                               "Community Support" = "#e0f3f8",
                               "Monitor" = "#91bfdb"))

Gentrification Risk and Neighborhood Disparities in NYC

The provided dataset offers a rich, multifaceted view of gentrification pressures across New York City’s five boroughs, integrating indicators of socioeconomic change with neighborhood risk classifications. Through examining census tract-level data, risk categorizations, and policy recommendations, to develop a more nuanced understanding of where gentrification is intensifying, where vulnerabilities lie, and what this suggests about equity and urban change. It incorporates socioeconomic indicators including median income, rent, home value, college attainment, and poverty rate. Gentrification risk categories range from Low to Very High, with a significant proportion of tracts flagged as “Missing Data” due to placeholder values (-666666666), which have led to negative averages in borough-level summaries. Despite these data distortions, some borough-specific patterns do emerge. Manhattan, as expected, contains some of the highest median incomes and home values, particularly in tracts identified as having the highest gentrification probabilities. Tract 20500 in Manhattan stands out with a median income nearing $180,000 and a median rent approaching $3,000, combined with a 93% college-educated population and a poverty rate just above 5%. These traits are emblematic of what scholars term advanced gentrification—a phase characterized not only by rising costs but also by elite demographic replacement and socio-cultural transformation. In contrast, the Bronx exhibits a different profile. Among the top gentrifying tracts, incomes remain comparatively lower, with some tracts falling below $20,000 annually and poverty rates exceeding 50%. Yet these same tracts also show rising rent and property values. This divergence suggests early-stage gentrification, where speculative investment and rental pressure precede substantial income changes, potentially displacing legacy residents before neighborhood-wide uplift occurs. Tracts such as 2702 and 6700 exemplify this contradiction, revealing a paradoxical coexistence of entrenched poverty and housing market acceleration.

High-Risk and Policy-Relevant Zones

Across the city, “Very High Risk” tracts are concentrated in Kings (Brooklyn), Queens, and the Bronx. These tracts often overlap with areas slated for Proactive Planning, indicating a policy recognition of imminent or ongoing change. Kings County, for instance, has 3,525 tracts flagged for proactive intervention, reflecting its centrality in recent waves of urban redevelopment and population inflow. What is striking here is that some of the most socioeconomically advantaged neighborhoods (as measured by income and education) are simultaneously those with the highest gentrification probability. This suggests that gentrification risk does not solely indicate neighborhood vulnerability in the traditional sense, but also the potential for cultural or demographic shift. Particularly, the replacement of lower-income, minority populations with wealthier, often white newcomers. Thus, the term “risk” must be interpreted not only in economic but also in social and cultural terms.

Emerging Themes and Reflections

Two insights emerge from this analysis:

Gentrification Risk Is Not Monolithic: The term captures a spectrum of neighborhood change, from elite consolidation (Manhattan) to early speculative pressure (Bronx). This reinforces the importance of borough-specific strategies and cautions against citywide generalizations.

Policy Planning Must Engage More Deeply with Social Dimensions: While proactive planning and monitoring efforts are well distributed geographically, their efficacy depends on integrating not only economic indicators but also resident voices, tenure stability, and historical displacement patterns. Without such integration, policies risk addressing symptoms while missing root causes.

Essay

Spatial Inequality and Urban Transformation

The Divergent Mechanics of Urban Change The analysis of census tract data across New York City’s five boroughs reveals not a singular gentrification process, but rather a complex ecosystem of neighborhood transformations operating through various pathways. Through machine learning classification of census data from 2015-2022, the study observe that gentrification manifests through multiple trajectories, each characterized by different combinations of demographic shifts, economic pressures, and housing market dynamics. In Manhattan’s high-probability gentrification zones, the data reveals what might be consider as consolidation gentrification tracts like 20500 exhibit median incomes approaching $180,000, coupled with 93% college education rates and median rents near $3,000. These neighborhoods have moved beyond the initial stages of displacement into a phase where socioeconomic homogeneity becomes self-reinforcing. The predictive models identify white_change_acceleration as the most significant feature in these areas, suggesting that racial demographic shifts function as both cause and consequence of advanced gentrification processes.

Conversely, the Bronx presents “speculative gentrification” patterns, where housing market metrics outpace local income growth. Census tracts showing median incomes below $20,000 alongside rapidly appreciating property values indicate investment patterns driven by external capital rather than organic neighborhood economic development. This contradiction manifests most prominently in tracts 2702 and 6700, where poverty rates exceeding 50% coexist with accelerating housing costs a statistical signature of speculative investment preceding demographic transformation.

Temporal Dimensions and Predictive Indicators

The temporal features engineered from longitudinal census data provided unexpected insight into gentrification’s mechanics. While static socioeconomic indicators showed limited predictive power, the Random Forest model identified change_3yr_income, pct_change_rent, and change_3yr_rent among the strongest predictors. This temporal dimension reveals gentrification as fundamentally a rate-of-change phenomenon rather than a static state, challenging traditional neighborhood classification approaches. The neural network model, achieving 75.7% accuracy with an AUC of 0.84, performed particularly well at identifying neighborhoods in the early-to-mid stages of transformation. Its balanced sensitivity (68.3%) and specificity (80.6%) across high and low risk classifications demonstrates the utility of advanced machine learning approaches in capturing the nonlinear relationships between demographic shifts, housing pressures, and economic indicators that characterize incipient gentrification. Interestingly, the relationship between racial demographics and gentrification emerged as more complex than commonly understood. The logistic regression coefficients revealed a significant negative association between increases in white population (change_3yr_white) and gentrification classification probability, contradicting simplified narratives of white in-migration as the sole driver. Instead, the data suggests a sequential process where economic transformation precedes demographic change, with housing market pressures serving as the leading indicator.

Beyond Binary Classification: A Spatial Typology of Urban Change

The classification models reveal limitations in binary gentrification framing. When neighborhoods are mapped according to their predicted probabilities and feature importance scores, at least four distinct transformation types emerge:

1.Elite Consolidation Zones: Primarily in Manhattan, characterized by high gentrification_momentum scores but decreasing change rates, suggesting market saturation.

  1. Acceleration Zones: Concentrated in Brooklyn (Kings County), exhibiting high displacement_vulnerability_index scores coupled with rapidly increasing change_3yr_rent values.

  2. Speculative Investment Frontiers: Emerging in the Bronx and eastern Queens, where housing market metrics and white_change_acceleration outpace local income growth.

  3. Stable Vulnerability Areas: Neighborhoods with high displacement risk but lower predicted gentrification probabilities, suggesting entrenched inequality without active transformation.

This typology has significant implications for policy intervention. The 3,525 tracts in Kings County flagged for proactive planning likely contain multiple transformation types requiring different intervention strategies. The neural network’s hidden layer activations reveal distinct patterns across these four zones, suggesting that different combinations of factors drive transformation across these spatial categories.

Policy Implications: Beyond Reactive Frameworks

The predictive power of change_3yr_rent and gentrification_momentum indicators suggests that interventions must engage earlier in the transformation cycle than current policies typically allow. By the time demographic shifts become visible in census data, the economic mechanisms of displacement are already well established. The classification models indicate that neighborhoods with similar static profiles can experience dramatically different trajectory probabilities based on relatively small differences in change rates. This finding challenges current intervention timeframes and suggests that policy effectiveness depends on integration of multiple data types at finer temporal resolutions. The neural network’s performance compared to logistic regression (AUC 0.837 vs. lower) demonstrates that the relationships between predictors are nonlinear and interactive exactly the type of complex systems that simplified policy frameworks struggle to address. The most provocative implication of this analysis is that gentrification risk assessment must be reconceptualized not as identifying vulnerable neighborhoods, but rather as identifying neighborhoods at specific transformation stages. The models reveal that the highest statistical predictors of change occur at different threshold points across different boroughs. Effective intervention requires borough-specific strategies calibrated to these locally specific thresholds.

Methodological Integration: Machine Learning and Social Science

This study demonstrates that the integration of machine learning methodologies with sociological frameworks produces insights unavailable to either approach independently. The Random Forest’s identification of white race percentage change acceleration as the most predictive indicator quantifies a relationship long theorized in urban sociology literature. Simultaneously, the neural network’s capacity to capture nonlinear relationships between variables provides mathematical validation for sociological concepts of neighborhood tipping points and cascading effects. The combined classification approach achieves what neither qualitative observation nor simplified statistical analysis can: the identification of mathematically precise thresholds where economic pressure transforms into demographic change.

Toward a New Understanding of Urban Change

The machine learning analysis of NYC census tract data from 2015-2022 reveals gentrification not as a uniform process but as a spatially differentiated phenomenon with borough-specific mechanics. The classification models demonstrate that the most predictive indicators of transformation are not static socioeconomic characteristics but rather rates of change across multiple dimensions. This understanding challenges both simplified narratives of gentrification and generalized policy approaches. The identification of distinct transformation types across boroughs suggests that urban change operates through multiple causal pathways requiring differentiated intervention strategies. The superior performance of neural network models indicates that these processes involve complex, nonlinear relationships that simple linear frameworks cannot adequately capture.

The emerging spatial typology of urban change identified through this analysis provides a foundation for more targeted policy development. By engaging with the temporal dimension of neighborhood transformation and the specific mechanical pathways through which economic pressure translates into demographic change, policymakers can move beyond reactive approaches toward proactive frameworks calibrated to local conditions. In the final analysis, this study demonstrates that effective engagement with gentrification requires understanding it not as a binary state but as a multi-dimensional process with distinct phases, thresholds, and spatial expressions. The integration of machine learning methodologies with sociological frameworks offers a powerful approach to identifying these patterns and informing more effective interventions.

Codes pipeline for Census Data Loding

```r
# Load necessary libraries
library(dplyr)  # For data manipulation
library(tidyr)  # For data reshaping
library(readr)  # For reading and writing CSV files

# Define API key for accessing Census data
api_key <- 'f658e5cc25ee3cb93ac64033d15b48f3f0eec6f8'

# List of enhanced variables related to gentrification indicators
variables <- c(
  # Core demographic and economic variables
  'B25064_001E',  # Median gross rent
  'B19013_001E',  # Median household income
  'B15003_001E',  # Total education population
  'B15003_017E',  # High school graduates
  'B15003_018E', 'B15003_019E', 'B15003_020E', 'B15003_021E',  # Some college or associate's degree
  'B15003_022E', 'B15003_023E', 'B15003_024E', 'B15003_025E',  # Bachelor's and higher
  'B03002_001E',  # Total population
  'B03002_003E',  # White alone
  'B03002_004E',  # Black alone
  'B03002_006E',  # Asian alone
  'B03002_012E',  # Hispanic/Latino
  'B25003_001E',  # Total housing units
  'B25003_002E',  # Owner-occupied units
  'B17001_001E',  # Total poverty universe
  'B17001_002E',  # Below poverty level
  'B23025_003E',  # Civilian labor force
  'B23025_005E',  # Unemployed individuals

  # Additional gentrification indicators
  'B25077_001E',  # Median home value
  'B25071_001E',  # Median gross rent as a percentage of household income
  'B19083_001E',  # Gini Index of Income Inequality

  # Housing cost burden variables
  'B25106_001E',  # Total housing units
  'B25106_024E',  # Owner costs >= 30% of income
  'B25106_045E',  # Renter costs >= 30% of income

  # New construction and vacancy rates
  'B25034_001E',  # Total housing units
  'B25034_002E', 'B25034_003E',  # Built in 2010 or later, 2000-2009
  'B25002_001E',  # Total housing units
  'B25002_003E',  # Vacant housing units

  # Young professionals (age demographics)
  'B01001_001E',  # Total population
  'B01001_011E', 'B01001_012E', 'B01001_035E', 'B01001_036E',  # Ages 25-34

  # Creative class/professional occupations
  'C24010_001E',  # Total civilian employed population (16+)
  'C24010_003E',  # Management, business, science, and arts occupations (male)
  'C24010_039E',  # Management, business, science, and arts occupations (female)

  # Transportation and commuting patterns
  'B08301_001E',  # Total commuters
  'B08301_010E',  # Public transportation usage
  'B08301_019E', 'B08301_020E',  # Walked, other commuting methods
  'B08301_021E',  # Worked from home

  # Tenure duration indicating residential stability
  'B25038_001E',  # Total occupied housing units
  'B25038_003E',  # Owner-occupied, moved in 2015 or later
  'B25038_010E'   # Renter-occupied, moved in 2015 or later
)

# Define county FIPS codes for New York City boroughs
counties <- list(
  Bronx = '005',
  Kings = '047',  # Brooklyn
  Queens = '081',
  Manhattan = '061',  # Add Manhattan for comparison
  Staten_Island = '085'  # Add Staten Island for completeness
)

# Function to pull data from the Census API
get_acs_data <- function(year, county_name, county_fips) {
  # Access the Census API to get ACS data for the specified year and county
  data <- census_api_call(variables, api_key, year, county_fips)  # Placeholder for actual API call
  data$year <- year
  data$borough <- county_name
  return(data)
}

# Collect data from 2015 to 2022
dfs <- list()  # Initialize an empty list to store data frames
for (year in 2015:2022) {
  for (borough in names(counties)) {
    county_fips <- counties[[borough]]
    tryCatch({
      df <- get_acs_data(year, borough, county_fips)
      dfs[[length(dfs) + 1]] <- df  # Append the data frame to the list
      cat("Successfully downloaded data for", borough, year, "\n")
    }, error = function(e) {
      cat("Error downloading data for", borough, year, ":", e$message, "\n")
    })
  }
}

# Combine all years of data into a single data frame
acs_df <- bind_rows(dfs)

# Rename columns for clarity
column_mapping <- c(
  B25064_001E = 'median_gross_rent',
  B19013_001E = 'median_income',
  B15003_001E = 'total_edu_pop',
  B15003_017E = 'high_school_grad',
  B15003_018E = 'some_college_1',
  B15003_019E = 'some_college_2',
  B15003_020E = 'assoc_degree_1',
  B15003_021E = 'assoc_degree_2',
  B15003_022E = 'bachelor',
  B15003_023E = 'masters',
  B15003_024E = 'professional',
  B15003_025E = 'doctorate',
  B03002_001E = 'total_pop',
  B03002_003E = 'white',
  B03002_004E = 'black',
  B03002_006E = 'asian',
  B03002_012E = 'hispanic',
  B25003_001E = 'total_units',
  B25003_002E = 'owner_occupied',
  B17001_001E = 'poverty_universe',
  B17001_002E = 'below_poverty',
  B23025_003E = 'labor_force',
  B23025_005E = 'unemployed',
  
  # New variables for gentrification analysis
  B25077_001E = 'median_home_value',
  B25071_001E = 'rent_as_pct_income',
  B19083_001E = 'gini_index',
  B25106_001E = 'housing_units_total',
  B25106_024E = 'owner_cost_burdened',
  B25106_045E = 'renter_cost_burdened',
  B25034_001E = 'housing_age_total',
  B25034_002E = 'housing_built_2010_later',
  B25034_003E = 'housing_built_2000_2009',
  B25002_001E = 'housing_total',
  B25002_003E = 'vacant_units',
  B01001_001E = 'pop_total',
  B01001_011E = 'males_25_34_1',
  B01001_012E = 'males_25_34_2',
  B01001_035E = 'females_25_34_1',
  B01001_036E = 'females_25_34_2',
  C24010_001E = 'employed_pop',
  C24010_003E = 'prof_male',
  C24010_039E = 'prof_female',
  B08301_001E = 'commuters_total',
  B08301_010E = 'public_transit',
  B08301_019E = 'walk_to_work',
  B08301_020E = 'other_commute',
  B08301_021E = 'work_from_home',
  B25038_001E = 'tenure_total',
  B25038_003E = 'recent_owners',
  B25038_010E = 'recent_renters'
)

# Rename the columns in the data frame
acs_df <- rename(acs_df, !!!column_mapping)

# Convert relevant columns to numeric
acs_df <- acs_df %>% mutate(across(everything(), ~ as.numeric(.), .names = "numeric_{col}"))

# Calculate derived metrics
acs_df <- acs_df %>%
  mutate(
    less_than_hs = total_edu_pop - (high_school_grad + some_college_1 + some_college_2 + 
                                      assoc_degree_1 + assoc_degree_2 + bachelor + 
                                      masters + professional + doctorate),
    pct_less_than_hs = less_than_hs / total_edu_pop,
    pct_hs_grad = high_school_grad / total_edu_pop,
    pct_some_college = (some_college_1 + some_college_2 + assoc_degree_1 + assoc_degree_2) / total_edu_pop,
    pct_college_plus = (bachelor + masters + professional + doctorate) / total_edu_pop,
    pct_white = white / total_pop,
    pct_black = black / total_pop,
    pct_asian = asian / total_pop,
    pct_hispanic = hispanic / total_pop,
    pct_other = 1 - (pct_white + pct_black + pct_asian + pct_hispanic),
    homeownership_rate = owner_occupied / total_units,
    poverty_rate = below_poverty / poverty_universe,
    unemployment_rate = unemployed / labor_force
  )

# New gentrification metrics
acs_df <- acs_df %>%
  mutate(
    young_adult_rate = (males_25_34_1 + males_25_34_2 + females_25_34_1 + females_25_34_2) / pop_total,
    professional_rate = (prof_male + prof_female) / employed_pop,
    alt_commute_rate = (public_transit + walk_to_work + other_commute + work_from_home) / commuters_total,
    new_housing_rate = (housing_built_2010_later + housing_built_2000_2009) / housing_age_total,
    recent_move_rate = (recent_owners + recent_renters) / tenure_total,
    vacancy_rate = vacant_units / housing_total,
    cost_burden_rate = (owner_cost_burdened + renter_cost_burdened) / housing_units_total
  )

# Create a composite gentrification index
acs_df <- acs_df %>%
  mutate(
    gentrification_pressure = (pct_college_plus +
                               young_adult_rate +
                               professional_rate +
                               (median_income / ave(median_income, by = tract)) +
                               (median_home_value / ave(median_home_value, by = tract)) +
                               (median_gross_rent / ave(median_gross_rent, by = tract)) -
                               poverty_rate) / 7  # Normalize
  )

# Define displacement risk index
acs_df <- acs_df %>%
  mutate(
    displacement_risk = (rent_as_pct_income / 100 + 
                         cost_burden_rate + 
                         (1 - homeownership_rate) + 
                         poverty_rate + 
                         (1 - pct_college_plus)) / 5  # Normalize
  )

# Save the processed DataFrame to a CSV file
write_csv(acs_df, 'census_data_2015_2022_gentrification.csv')

Assignment’s codes pipeline

#load necessary packages
rq_packages <- c("ada", "caret", "corrplot", "dplyr", "GGally", "ggcorrplot", 
                 "ggplot2", "gridExtra", "glmnet", "knitr", "mice", "neuralnet", "naivebayes", "naniar", "nnet", "pROC",
                 "randomForest", "rpart", "rpart.plot", "ROSE", "rsample", 
                 "sf", "keras", "kableExtra", "scales", "smotefamily", "tigris", "tidyverse", 
                 "tidyr", "tfdatasets", "tensorflow", "pROC", "tmap", "yardstick", "randomForest", "rpart", "xgboost","VIM")

#install and load packages
for (pkg in rq_packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}


# read the CSV file
df <- read.csv('census_data_2015_2022_gentrification.csv')
# display the first 15 rows
head(df, 15)


# structure and summary
str(df)
summary(df)

# Convert year, borough, etc., to factors if needed
df$year <- as.factor(df$year)
df$borough <- as.factor(df$borough)

# Rename columns for consistency
df <- df %>%
  rename_with(~ gsub("\\s+", "_", .x)) %>%
  rename_with(tolower)
# colnames(df)


###Check for missing 
#count missing or zero values per variable
df %>%
  summarise(across(everything(), ~ sum(is.na(.) | . == 0))) %>%
  gather(variable, missing_or_zero)


#low missing <5% mena median imput
low_na_cols <- c("median_income", "median_gross_rent", "median_home_value")

for (col in low_na_cols) {
  df[[col]][df[[col]] == 0 | is.na(df[[col]])] <- median(df[[col]][df[[col]] != 0], na.rm = TRUE)
}

#moderate missing (5%–30%) → KNN Imputation

cols_for_knn <- c("professional", "recent_move_rate", "housing_built_2010_later", 
                  "vacancy_rate", "renter_cost_burdened", "recent_owners", 
                  "recent_renters", "professional_rate", "alt_commute_rate")

df[cols_for_knn] <- VIM::kNN(df[cols_for_knn], k = 5, imp_var = FALSE)




############################################################################################
#futures engineering  & vector of variables to standardize
############################################################################################

z_vars <- c(
  "median_income", "professional_rate", "pct_college_plus",
  "pct_white", "young_adult_rate",
  "median_gross_rent", "median_home_value", "cost_burden_rate", "vacancy_rate",
  "recent_move_rate", "recent_renters", "recent_owners",
  "new_housing_rate", "housing_built_2010_later",
  "homeownership_rate"  
)


# standardize in long format (better memory handling)
df_z <- df %>%
  arrange(geo_id, year) %>%
  group_by(geo_id) %>%
  mutate(across(all_of(z_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
  ungroup()

# adjust vacancy rate sign (negative z-score)
df_z <- df_z %>%
  mutate(z_vacancy_rate = -z_vacancy_rate)


# create gentrification index from the newly computed z-scores
df_z <- df_z %>%
  rowwise() %>%
  mutate(gentrification_index = mean(c_across(starts_with("z_")), na.rm = TRUE)) %>%
  ungroup()

# bin into quantiles (efficient post-processing step)
df_z <- df_z %>%
  mutate(gentrification_quantile = ntile(gentrification_index, 5))



#EDA ############################################################################################
# Visualize distribution
ggplot(df_z, aes(x = gentrification_index)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Gentrification Index", x = "Index", y = "Count")

# Boxplot by borough
ggplot(df_z, aes(x = borough, y = gentrification_index, fill = borough)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Gentrification Index by Borough", x = "Borough", y = "Index") +
  theme(legend.position = "none")


# Line plot over time by borough
df_z %>%
  group_by(year, borough) %>%
  summarise(mean_index = mean(gentrification_index, na.rm = TRUE)) %>%
  ggplot(aes(x = year, y = mean_index, color = borough, group = borough)) +
  geom_line(size = 1.2) +
  labs(title = "Gentrification Trends by Borough", x = "Year", y = "Mean Index") +
  theme_minimal()




############################################################################################
#preparing data for models 
############################################################################################

# yearly change 
df_temporal <- df_z %>%
  arrange(geo_id, year) %>%
  group_by(geo_id) %>%
  mutate(
    # 3-year changes for key metrics
    change_3yr_income = median_income - lag(median_income, 3),
    change_3yr_rent = median_gross_rent - lag(median_gross_rent, 3),
    change_3yr_home_value = median_home_value - lag(median_home_value, 3),
    change_3yr_college = pct_college_plus - lag(pct_college_plus, 3),
    change_3yr_white = pct_white - lag(pct_white, 3),
    change_3yr_vacancy = vacancy_rate - lag(vacancy_rate, 3),
    
    # Growth rates (percentage changes) for key metrics
    pct_change_income = (median_income / lag(median_income, 3) - 1) * 100,
    pct_change_rent = (median_gross_rent / lag(median_gross_rent, 3) - 1) * 100,
    pct_change_home_value = (median_home_value / lag(median_home_value, 3) - 1) * 100,
    
    # Rate of change compared to borough average (relative changes)
    rel_income_growth = pct_change_income / mean(pct_change_income, na.rm = TRUE),
    rel_rent_growth = pct_change_rent / mean(pct_change_rent, na.rm = TRUE),
    rel_home_value_growth = pct_change_home_value / mean(pct_change_home_value, na.rm = TRUE)
  ) %>%
  ungroup()

#race metrics 

df_race <- df_temporal %>%
  mutate(
    # Diversity metrics
    diversity_index = 1 - ((pct_white^2 + (1-pct_white)^2)), # Simple diversity index
    
    # Race-economic interactions
    white_income_interaction = pct_white * median_income,
    white_college_interaction = pct_white * pct_college_plus,
    
    # Racial change momentum (acceleration of demographic shifts)
    white_change_acceleration = change_3yr_white - lag(change_3yr_white, 1),
    
    # Economic-demographic pressure metrics
    displacement_risk = cost_burden_rate * (1 - pct_white) * (1 / median_income),
    gentrification_pressure = pct_change_rent * change_3yr_white * young_adult_rate
  )


#z-temporal and demographic shift variables
z_temporal_vars <- c(
  "change_3yr_rent",
  "pct_change_rent",
  "pct_change_income",
  "change_3yr_white",
  "professional_rate",
  "young_adult_rate",
  "cost_burden_rate",
  "median_income",
  "recent_renters",
  "homeownership_rate"
)

df_race <- df_race %>%
  group_by(year) %>%  # or geo_id depending on context, but year is better for city-wide scale
  mutate(across(all_of(z_temporal_vars), ~ as.numeric(scale(.)), .names = "z_{.col}")) %>%
  ungroup()



# displacement composite indicator
df_features <- df_race %>%
  rowwise() %>%
  mutate(
    # Improved displacement risk index - combines economic vulnerability with demographic stability
    displacement_vulnerability_index = mean(c(
      z_cost_burden_rate,
      -z_median_income,
      z_recent_renters,
      -z_homeownership_rate,
      z_change_3yr_rent
    ), na.rm = TRUE),
    
    # Improved gentrification momentum index - combines rate of change metrics
    gentrification_momentum = mean(c(
      z_pct_change_rent,
      z_pct_change_income, 
      z_change_3yr_white,
      z_professional_rate,
      z_young_adult_rate
    ), na.rm = TRUE)
  ) %>%
  ungroup()

# removing missing values in the new features
#  list of newly created features
new_features <- setdiff(names(df_features), names(df_z))

# temporal features have NAs for early years
# impute them with mean by borough and year for better contextual accuracy
df_features <- df_features %>%
  group_by(borough, year) %>%
  mutate(across(all_of(new_features), ~ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
  ungroup()

# if any NAs remain, use overall means
df_features <- df_features %>%
  mutate(across(all_of(new_features), ~ifelse(is.na(.) | is.infinite(.) | is.nan(.), 
                                              mean(.[is.finite(.)], na.rm = TRUE), .)))



# gentrification classes for classification
df_features <- df_features %>%
  mutate(
    # Create a binary target for classification (high vs low gentrification)
    gentrification_binary = ifelse(gentrification_quantile >= 4, "high", "low"),
    
    # Create a 3-level target for more nuanced classification
    gentrification_level = case_when(
      gentrification_quantile <= 2 ~ "low",
      gentrification_quantile == 3 ~ "medium",
      gentrification_quantile >= 4 ~ "high"
    ),
    
    # Create a continuous target for regression tasks
    gentrification_continuous = gentrification_index
  )

# convert factor variables for modeling 
df_features$gentrification_binary <- as.factor(df_features$gentrification_binary)
df_features$gentrification_level <- as.factor(df_features$gentrification_level)





# filter out training if zscore has missing  
modeling_data <- df_features %>%
  filter(!is.na(gentrification_index), !is.na(gentrification_momentum)) %>%
  select(-geo_id)


#futures selection for training and test 
############################################################################################
#crucial predictors for gen
predictors <- c(
  # common features related to gen
  "median_income", "professional_rate", "pct_college_plus", "pct_white", 
  "young_adult_rate", "median_gross_rent", "median_home_value", "cost_burden_rate",
  
  # temporal features
  "change_3yr_income", "change_3yr_rent", "change_3yr_home_value", 
  "pct_change_income", "pct_change_rent",
  
  # race-aware features
  "diversity_index", "white_income_interaction", "white_change_acceleration",
  
  # Advanced indices
  "displacement_vulnerability_index", "gentrification_momentum"
)




set.seed(123)
train_index <- createDataPartition(modeling_data$gentrification_binary, p = 0.7, list = FALSE)
train_data <- modeling_data[train_index, ]
test_data <- modeling_data[-train_index, ]

# ensure there are no missing in the predictors for training
train_data_clean <- train_data %>%
  filter(if_all(all_of(predictors), ~ !is.na(.)))


# use Random Forest for predict importance futures 
model_path <- "/Users/ahmhamza/Downloads/model/"
rf_importance <- readRDS(paste0(model_path, "rf_importance_model.rds"))

rf_importance <- randomForest(
  x = train_data_clean[, predictors],
  y = train_data_clean$gentrification_binary,
  importance = TRUE,
  ntree = 500
)

#create directory and save models 
#dir.create("Downloads/model")
#saveRDS(rf_importance, "Downloads/model/rf_importance_model.rds")
#for markdown 
#rf_importance_model <- readRDS("Downloads/model/rf_importance_model.rds")

# obs variable importance
importance_df <- as.data.frame(importance(rf_importance))
importance_df$Feature <- rownames(importance_df)
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]

print(importance_df)

# Plot top 15 important features
ggplot(importance_df[1:15,], aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 15 Features for Predicting Gentrification", x = "Features", y = "Importance")



#predictors for modeling (keep all the futures)
important_predictors <- importance_df$Feature[1:13]
################################################################


############################################################################################
# Logistic Regression 
model_path <- "/Users/ahmhamza/Downloads/model/"
logistic_model <- readRDS(paste0(model_path, "logistic_model.rds"))

logistic_model <- caret::train(
  gentrification_binary ~ change_3yr_rent + change_3yr_white + gentrification_momentum + 
    displacement_vulnerability_index + pct_change_income + 
    white_income_interaction + diversity_index + white_change_acceleration +
    change_3yr_income + change_3yr_home_value,
  data = train_data,
  method = "glm",
  family = "binomial",
  metric = "ROC",  # use ROC if you're using twoClassSummary
  trControl = caret::trainControl(
    method = "cv", 
    number = 5,
    classProbs = TRUE,
    summaryFunction = caret::twoClassSummary
  )
)

# saveRDS(logistic_model, "Downloads/model/logistic_model.rds")
# logistic_model <- readRDS("Downloads/model/logistic_model.rds.")

# summary of the model
summary(logistic_model)

# Probabilities
pred_probs <- predict(logistic_model, newdata = test_data, type = "prob")
# Class labels
pred_classes <- predict(logistic_model, newdata = test_data)
summary(pred_probs)


#scaling for RF and NN 
preproc <- preProcess(train_data[, important_predictors], method = c("center", "scale"))
train_data_scaled <- predict(preproc, train_data[, important_predictors])
test_data_scaled <- predict(preproc, test_data[, important_predictors])

# Add the target variable back
train_data_scaled$gentrification_binary <- train_data$gentrification_binary
test_data_scaled$gentrification_binary <- test_data$gentrification_binary

#there were some NA in train and test 
colSums(is.na(train_data_scaled))
train_data_scaled <- na.omit(train_data_scaled)
colSums(is.na(train_data_scaled))

colSums(is.na(test_data_scaled))
test_data_scaled <- na.omit(test_data_scaled)
colSums(is.na(test_data_scaled))





#running a another RF wiht XGboost 
# Random Forest Model

model_path <- "/Users/ahmhamza/Downloads/model/"
rf_model <- readRDS(paste0(model_path, "rf_model.rds"))

rf_model <- caret::train(
  gentrification_binary ~ .,
  data = train_data_scaled,
  method = "rf",
  metric = "Accuracy",
  trControl = trainControl(
    method = "cv",
    number = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary
  ),
  ntree = 500
)
# saveRDS(rf_model, "Downloads/model/rf_model.rds")
# rf_model <- readRDS("Downloads/model/rf_model.rds")

summary(rf_model)
print(rf_model)


model_path <- "/Users/ahmhamza/Downloads/model/"
xgb_model <- readRDS(paste0(model_path, "xgb_model.rds"))

# XGBoost Model
xgb_model <- caret::train(
  gentrification_binary ~ .,
  data = train_data_scaled,
  method = "xgbTree",
  metric = "Accuracy",
  trControl = trainControl(
    method = "cv",
    number = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary
  )
)

#saveRDS(xgb_model, "Downloads/model/xgb_model.rds")
#xgb_model <- readRDS("Downloads/model/xgb_model.rds")


#NN & deep learning preparation and modeling 
# Neural Network with optimal hyperparameters
# Perform grid search for hyperparameter tuning
nn_grid <- expand.grid(size = c(5, 10, 15),  decay = c(0.001, 0.01, 0.1))


model_path <- "/Users/ahmhamza/Downloads/model/"
nn_model <- readRDS(paste0(model_path, "nn_model.rds"))

nn_model <- caret::train(
  gentrification_binary ~ .,
  data = train_data_scaled,
  method = "nnet",
  metric = "Accuracy",
  tuneGrid = nn_grid,
  trControl = trainControl(
    method = "cv",
    number = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary
  ),
  maxit = 1000,
  trace = FALSE
)
#saveRDS(nn_model, "Downloads/model/nn_model.rds")
#nn_model <- readRDS("Downloads/model/nn_model.rds")


# print the best tuning parameters
print(nn_model$bestTune)

# Predictions on test set
nn_pred <- predict(nn_model, newdata = test_data_scaled)
nn_prob <- predict(nn_model, newdata = test_data_scaled, type = "prob")

# Confusion Matrix for Neural Network
nn_cm <- confusionMatrix(nn_pred, test_data_scaled$gentrification_binary)
print(nn_cm)

# ROC curve
nn_roc <- roc(test_data_scaled$gentrification_binary, nn_prob[,"high"])
plot(nn_roc, main = "ROC Curve - Neural Network")
auc(nn_roc)


############################################################################################
#comparing models and evaluation 
############################################################################################
# generate predictions for all models
models_list <- list(
  "Logistic Regression" = logistic_model,
  "Neural Network" = nn_model,
  "Random Forest" = rf_model,
  "XGBoost" = xgb_model
)

# function to evaluate model performance
evaluate_model <- function(model, test_data, model_name) {
  pred <- predict(model, newdata = test_data)
  prob <- predict(model, newdata = test_data, type = "prob")
  cm <- confusionMatrix(pred, test_data$gentrification_binary)
  roc_obj <- roc(test_data$gentrification_binary, prob[, "high"])
  auc_value <- auc(roc_obj)
  data.frame(
    Model = model_name,
    Accuracy = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    F1_Score = cm$byClass["F1"],
    AUC = as.numeric(auc_value)
  )
}

# evaluate all models
model_performance <- bind_rows(
  evaluate_model(logistic_model, test_data, "Logistic Regression"),
  evaluate_model(rf_model, test_data_scaled, "Random Forest"),
  evaluate_model(xgb_model, test_data_scaled, "XGBoost"),
  evaluate_model(nn_model, test_data_scaled, "Neural Network"),
)

# print combined performance table
kable(model_performance, caption = "Model Performance on Test Data")

# visualization in comparison 

# bar plot of AUC
auc_plot <- ggplot(model_performance, aes(x = Model, y = AUC, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(AUC, 3)), vjust = -0.3) +
  labs(title = "AUC Comparison of Models", y = "AUC") +
  theme_minimal() +
  theme(legend.position = "none")


# bar plot of Accuracy
accuracy_plot <- ggplot(model_performance, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Accuracy, 3)), vjust = -0.3) +
  labs(title = "Accuracy Comparison of Models", y = "Accuracy") +
  theme_minimal() +
  theme(legend.position = "none")


# combine AUC and Accuracy plots
grid.arrange(auc_plot, accuracy_plot, ncol = 2)


# essential futures from coefficient (logistic model coefficient)

if (nrow(summary(logistic_model)$coefficients) > 1) {
  coef_df_lr <- as.data.frame(summary(logistic_model)$coefficients[-1, ]) %>%
    mutate(Feature = rownames(.)) %>%
    arrange(desc(abs(Estimate))) %>%
    head(10) # Top 10 coefficients
  
  coef_plot_lr <- ggplot(coef_df_lr, aes(x = reorder(Feature, Estimate), y = Estimate, fill = Estimate > 0)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Top 10 Coefficients in Logistic Regression", x = "Feature", y = "Coefficient") +
    scale_fill_manual(name = "Direction", values = c("FALSE" = "salmon", "TRUE" = "lightgreen"))
  
  print(coef_plot_lr)
} else {
  print("Logistic Regression model might not have converged properly or has no significant predictors.")
}



# practical model insights 


# identify rows with no missing values in the predictors used for the best model
complete_rows <- complete.cases(modeling_data[, important_predictors])

# a dataframe with only the complete cases for prediction
model_df_for_prediction <- modeling_data[complete_rows, ]

# best model based on a chosen metric 
best_model_name <- model_performance %>% arrange(desc(AUC)) %>% slice_head(n = 1) %>% pull(Model)
best_model <- models_list[[best_model_name]]
cat("Best Model:", best_model_name, "\n")

# predictions only on complete cases
prediction_results <- predict(best_model, newdata = model_df_for_prediction, type = "prob")
model_df_for_prediction$gentrification_probability <- prediction_results[, "high"]

# a temporary ID in the complete cases dataframe for merging
model_df_for_prediction$temp_id <- 1:nrow(model_df_for_prediction)

# a temporary ID in the original modeling_data dataframe for merging the complete cases back
modeling_data$temp_id <- ifelse(complete_rows, 1:sum(complete_rows), NA)

# merge predictions back to the original modeling_data dataframe based on the temporary ID
model_df <- modeling_data %>%
  left_join(model_df_for_prediction %>% select(temp_id, gentrification_probability), by = "temp_id")

# remove the temporary ID
model_df$temp_id <- NULL

#  risk categories based on gentrification probability

model_df <- model_df %>%
  mutate(risk_category = case_when(
    is.na(gentrification_probability) ~ "Missing Data",
    gentrification_probability >= 0.75 ~ "Very High Risk",
    gentrification_probability >= 0.5 ~ "High Risk",
    gentrification_probability >= 0.25 ~ "Moderate Risk",
    TRUE ~ "Low Risk"
  ))


# # Summarize by borough and risk category
# borough_summary <- model_df %>%
#   group_by(borough, risk_category) %>%
#   summarise(
#     n_tracts = n(),
#     avg_income = mean(median_income, na.rm = TRUE),
#     avg_rent = mean(median_gross_rent, na.rm = TRUE),
#     avg_home_value = mean(median_home_value, na.rm = TRUE),
#     avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
#     avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
#     .groups = "drop"
#   )


# view summary table
# kable(borough_summary, digits = 2, caption = "Census Tract Counts and Averages by Borough and Gentrification Risk")


model_df$years <- "2022"

borough_summary <- model_df %>%
  mutate(years = "2022") %>%  # or year = 2022
  group_by(borough, risk_category, years) %>%
  summarise(
    n_tracts = n(),
    avg_income = mean(median_income, na.rm = TRUE),
    avg_rent = mean(median_gross_rent, na.rm = TRUE),
    avg_home_value = mean(median_home_value, na.rm = TRUE),
    avg_college_rate = mean(pct_college_plus, na.rm = TRUE),
    avg_poverty_rate = mean(poverty_rate, na.rm = TRUE),
    .groups = "drop"
  )

borough_summary <- borough_summary %>%
  mutate(
    avg_income = round(avg_income),
    avg_rent = round(avg_rent),
    avg_home_value = round(avg_home_value),
    avg_college_rate = round(avg_college_rate * 100, 1),
    avg_poverty_rate = round(avg_poverty_rate * 100, 1)
  )


borough_summary %>%
  kable(digits = 2, caption = "Census Tract Counts and Averages by Borough, Risk Category, and Year") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))






# visualize risk category distribution by borough
ggplot(borough_summary, aes(x = borough, y = n_tracts, fill = risk_category)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Distribution of Gentrification Risk Categories by Borough",
       x = "Borough", y = "Number of Census Tracts",
       fill = "Gentrification Risk") +
  theme_minimal() +
  scale_fill_brewer(palette = "RdYlGn", direction = -1)






# top 20 census tracts with highest gentrification probability
top_risk_tracts <- model_df %>%
  arrange(desc(gentrification_probability)) %>%
  select(tract, borough, gentrification_probability, risk_category,
         median_income, median_gross_rent, median_home_value, 
         pct_college_plus, poverty_rate) %>%
  head(20)

# view top risk tracts
kable(top_risk_tracts, digits = 3, caption = "Top 20 Census Tracts with Highest Gentrification Probability")

# Target Policy Analysis
model_df <- model_df %>%
  mutate(
    displacement_risk_score = (poverty_rate + cost_burden_rate) / 2,
    displacement_risk_category = case_when(
      displacement_risk_score >= 30 ~ "High",
      displacement_risk_score >= 20 ~ "Medium",
      TRUE ~ "Low"
    )
  )

# create policy priority matrix
model_df <- model_df %>%
  mutate(
    policy_priority = case_when(
      risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "High" ~ "Urgent Intervention",
      risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Medium" ~ "High Priority",
      risk_category == "Moderate Risk" & displacement_risk_category == "High" ~ "High Priority",
      risk_category %in% c("Very High Risk", "High Risk") & displacement_risk_category == "Low" ~ "Proactive Planning",
      risk_category == "Moderate Risk" & displacement_risk_category == "Medium" ~ "Proactive Planning",
      risk_category == "Low Risk" & displacement_risk_category == "High" ~ "Community Support",
      TRUE ~ "Monitor"
    )
  )

# count tracts by borough and policy priority
policy_counts <- model_df %>%
  group_by(borough, policy_priority) %>%
  summarise(n_tracts = n(), .groups = "drop") %>%
  spread(policy_priority, n_tracts, fill = 0)



# view policy counts
kable(policy_counts, caption = "Census Tracts by Borough and Policy Priority")

# visualize policy priorities
model_df %>%
  count(borough, policy_priority) %>%
  ggplot(aes(x = borough, y = n, fill = policy_priority)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Policy Intervention Priorities by Borough",
       x = "Borough", y = "Number of Census Tracts",
       fill = "Policy Priority") +
  theme_minimal() +
  scale_fill_manual(values = c("Urgent Intervention" = "#d73027",
                               "High Priority" = "#fc8d59",
                               "Proactive Planning" = "#fee090",
                               "Community Support" = "#e0f3f8",
                               "Monitor" = "#91bfdb"))