Load Data

Here we load and clean the data for our final model. We’ll use the same data setup except we’ll keep xcoord and ycoord for a heat map of the residuals. After exploration it doesn’t look like XGBoost can handle coordinates like a geospatial model but it could pick up on location interactions just using the coordinates like any other numeric feature.

# Check out packages
library(tidyverse)
library(dplyr)
library(xgboost)
library(Matrix)
library(caret)
library(pdp)
library(vip)

# Load data
df1 <- readRDS("~/Documents/D698/df.rds")


Repeat Prior Data Cleanup

We’ll remove the same 37 records and columns with significant missingness, and recategorize NAs.

df2 <- df1 %>%
  filter(!is.na(xcoord))

df3 <- df2 %>%
  dplyr::select(-CONDO_Number, -EASEMENT, -condono)

df4 <- df3 %>%
  mutate(firm07_flag = replace_na(firm07_flag, 0)) %>%
  mutate(pfirm15_flag = replace_na(pfirm15_flag, 0)) %>%
  mutate(landmark = replace_na(landmark, 'None')) %>%
  mutate(histdist = replace_na(histdist, 'None')) %>%
  mutate(spdist1 = replace_na(spdist1, 'None')) %>%
  mutate(CORNER = case_when(CORNER %in% c("CR", "NE", "SW", "NW", "SE") ~ "Yes", is.na(CORNER) | CORNER == "00" ~ "No")) %>%
  mutate(ltdheight = replace_na(ltdheight, "None")) %>%
  mutate(LOT_IRREG = replace_na(LOT_IRREG, "Unknown")) %>%
  mutate(BLD_EXT = replace_na(BLD_EXT, "N")) %>%
  mutate(landuse = replace_na(landuse, "Unknown"))


Exclude Columns

We exclude the same columns as in the prior linear model except we keep BLOCK and LOT for identification and xcoord and ycoord for the heat map, including ASP to expand the number of records from ~2,000 to ~13,300.

# Define the columns to exclude
exclude_cols_lrmodel <- c("bbl", "PARID", "bctcb2020", "bct2020", "latitude", "longitude", "BORO", "YEAR", "COOP_APTS", "firm07_flag",  "pfirm15_flag", "LOT_DEP", "FINTAXCLASS", "ZIP_CODE", "LOT_FRT", "BLD_FRT", "BLD_DEP", "cd", "council", "policeprct", "healthcenterdistrict", "healtharea", "spdist1", "firecomp", "zonedist1", "bldgclass", "landuse", "tract2010", "RESIDENTIAL_AREA_GROSS", "residfar", "commfar", "facilfar", "ASP")

df5_xgb <- df4 %>%
  dplyr::select(-all_of(exclude_cols_lrmodel)) %>%
  mutate(across(where(is.character), as.factor))


Rotate NY State Plane

Manhattan street grid is 29° tilted East of True North. To improve our map we’re rotating our coordinate system to “Manhattan-North”. This will allow the natural borders of street-divided blocks in the city grid to be captured simply in the XGBoost trees.

# Rotate (and optionally scale) x/y into a "Manhattan grid" frame
rotate_xy <- function(x, y, angle_deg, scale_x = 1, scale_y = 1, center = TRUE) {
  theta <- angle_deg * pi/180
  R <- matrix(c(cos(theta), -sin(theta),
                sin(theta),  cos(theta)), 2, 2, byrow = TRUE)
  X <- cbind((x/scale_x), (y/scale_y))
  if (center) {
    ctr <- colMeans(X, na.rm = TRUE)
    X <- sweep(X, 2, ctr, "-")
  }
  XYrot <- X %*% t(R)
  list(x = XYrot[,1], y = XYrot[,2])
}

ang <- 29
rot <- rotate_xy(df5_xgb$xcoord, df5_xgb$ycoord, angle_deg = ang)

df5_xgb$x_manh <- rot$x
df5_xgb$y_manh <- rot$y



Prepare for XGBoost

Here we convert our data into a DMatrix, which is XGBoost’s optimized data structure to improve training speed and memory efficiency.


Estimated Market Value per Square Foot

In earlier iterations of the XGBoost model we saw significant leakage from the variable of top importance, GROSS_SQFT, and so we changed to model estimated market value per square foot.

# Create target: Estimated Market Value per Square foot
df_clean <- df5_xgb %>%
  mutate(mv_per_sqft = FINMKTTOT / GROSS_SQFT) %>%
  # remove rows that break division/log
  filter(is.finite(FINMKTTOT),
         is.finite(GROSS_SQFT),
         GROSS_SQFT > 0,
         is.finite(mv_per_sqft),
         mv_per_sqft > 0)


Build DMatrix

Here we prepare our data to be read in an efficient way by the XGBoost model.

# Build model frame without IDs & leakage cols
leak_cols <- c("mv_per_sqft","FINMKTTOT","GROSS_SQFT","BLOCK","LOT", "xcoord", "ycoord")
model_cols <- setdiff(names(df_clean), leak_cols)
df_model <- df_clean %>%
  select(all_of(c("mv_per_sqft", model_cols))) %>%
  drop_na()

# Align id_vars to df_model (no leakage)
id_vars <- df_clean %>%
  select(BLOCK, LOT, x_manh, y_manh) %>%
  dplyr::slice(match(rownames(df_model), rownames(df_clean)))
stopifnot(nrow(id_vars) == nrow(df_model))

# Design matrix / labels
X <- sparse.model.matrix(mv_per_sqft ~ . - 1, data = df_model)
y <- log(df_model$mv_per_sqft)

# Build DMatrix
dtrain <- xgb.DMatrix(data = X, label = y, missing = NA)



Fit XGBoost

Here we fit a basic model and observe that we have a healthy learning curve because the Root Mean Squared Error (RMSE) is steadily decreasing and it’s reaching a low value of 0.269 in log-space. This corresponds to an average prediction error of 31% in real (non-log) market values. This is calculated by taking e to the RMSE minus 1.

Variable Importance
Our recent iterations of the model had an overreliance on a single feature y_manh, or North-south tilted to align with the Manhattan street grid. We constrained the model by adding like regularization and a reduced number of columns sampled per tree however we still have high variable importance attributed to y_manh. We show later that this is a strength of the model not a weakness.

Average Sale Price Aside This is a significant increase in multiplicative error factor from 11% using the smaller data set with ~14% of the records with average sale price information. Most likely the prior model had inflated performance that wouldn’t generalize because it was a smaller dataset, so there was less complexity to model with the same number of fields plus ASP and it was overfit. ASP did not have a high correlation with FINMKTTOT but it could be that it acted like a shortcut feature, letting the model explain a lot more.

It could also be that the DOF does look at recent sale data when calculating taxes, even if it’s not part of their disclosed process, to accelerate increases for buildings experiencing higher recent sales. A future exploration could be to rerun the 1923 records with ASP twice to compare how the model performs with and without the ASP data.

set.seed(1751753275)
xgb_model <- xgboost(
  data = dtrain,
  objective = "reg:squarederror",  # for regression
  nrounds = 100,
  eta = 0.1,
  max_depth = 6,
  subsample = 0.8, # like bagging, fraction of rows sampled for each tree
  colsample_bytree = 0.8, # fraction of columns sampled for each tree
  lambda = 1.0, # L2 Ridge
  alpha = 0.1, # L1 Lasso
  verbose = 0
)

preds <- predict(xgb_model, newdata = dtrain)
y_true <- getinfo(dtrain, "label")
final_rmse <- sqrt(mean((y_true - preds)^2))
final_rmse
## [1] 0.2690246


Calculate Training R²

We manually calculate a training R² of 0.8104, compared to 0.7896 in the final logged linear model.

Note R² is not held-out. It’s a training R² based on training data that would perform less well extended to new data.

preds_xgb <- predict(xgb_model, newdata = X)
r2_xgb <- 1 - sum((y - preds_xgb)^2) / sum((y - mean(y))^2)
r2_xgb
## [1] 0.8104825


Interpret the Model

The variable importance plot shows relative variable importance in the model and the primary driver is y_manh, North-South tilted to align with the Manhattan street grid.

schooldist06 is our next best performing variable. It carves out a very specific geography for Washington Heights and Inwood which has both good and bad schools. My theory for why it’s valuable is we later find west tends to pay less taxes than east across Manhattan but this region of northern Manhattan may reverse that trend by being better developed on the West than East side and so this carve-out by school district is a useful divide when calculating trees.

LAND_AREA also has a big impact. We’ve seen this in the case of StuyTown, the outer outlier, where having a large land area doesn’t scale with how much taxes are charged per unit. Since buildings are spaced further apart and there’s an upper limit on property tax affordability per unit, the extra open space in the tax lot isn’t being taxed as heavily as in an area with greater unit density.

Since y_manh dwarfs the other variables in importance we need to look more closely. It could be that y_manh is used early in the tree-building process to separate data into big, easy-to-predict chunks but that it’s importance is in interaction with the other variables and it’s not predicting estimated market value per square foot by itself.

vip::vip(xgb_model, num_features = 20)

# Get importance matrix
importance_matrix <- xgb.importance(model = xgb_model)

# View top variables
print(importance_matrix, 20)   # top 20
##                                                                Feature
##                                                                 <char>
##   1:                                                            y_manh
##   2:                                                      schooldist06
##   3:                                                         LAND_AREA
##   4:                                                           YRBUILT
##   5:                                                            x_manh
##   6:                                                          builtfar
##   7:                                                       FINTAXFLAGT
##   8:                                                      schooldist02
##   9:                                                            YRALT1
##  10:                                                             UNITS
##  11:                                                      BLDG_CLASSC7
##  12:                                                            YRALT2
##  13:                                                        ZONINGR7-2
##  14:                                                      BLDG_CLASSC1
##  15:                                                         BLD_STORY
##  16:                                                 OFFICE_AREA_GROSS
##  17:                                                         CORNERYes
##  18:                                                      histdistNone
##  19:                                                      BLDG_CLASSD3
##  20:                                                      BLDG_CLASSD1
##  ---                                                                  
## 106:                                                          BLD_EXTG
## 107:          histdistGreenwich Village Historic District Extension II
## 108:     histdistUpper West Side / Central Park West Historic District
## 109:                                                        ZONINGM16D
## 110:                                                      BLDG_CLASSD0
## 111:                                                        ZONINGC64A
## 112:                                                        ZONINGC28A
## 113:                                                    ltdheightLH-1A
## 114:                                                        ZONINGC4-5
## 115:                                                        ZONINGC6-6
## 116:                                                        ZONINGC61G
## 117: histdistHamilton Heights / Sugar Hill Northwest Historic District
## 118:                                                        ZONINGC1-7
## 119:                                                        ZONINGC45D
## 120:                                                        ZONINGR10H
## 121:                                                        ZONINGC67T
## 122:                                                        ZONINGC5-5
## 123:                                                      ZONINGM15R7D
## 124:                                                         ZONINGR10
## 125:                                                      BLDG_CLASSS9
##              Gain        Cover    Frequency
##             <num>        <num>        <num>
##   1: 7.101895e-01 1.543289e-01 0.1224347015
##   2: 4.035271e-02 5.120328e-03 0.0013992537
##   3: 3.475170e-02 9.473181e-02 0.0981809701
##   4: 2.919019e-02 7.289802e-02 0.0776585821
##   5: 2.705178e-02 8.373175e-02 0.1112406716
##   6: 2.263130e-02 9.180566e-02 0.0979477612
##   7: 1.773582e-02 3.961333e-02 0.0156250000
##   8: 1.657335e-02 5.518370e-03 0.0048973881
##   9: 1.504934e-02 5.639425e-02 0.0680970149
##  10: 1.383123e-02 6.005781e-02 0.0680970149
##  11: 6.951507e-03 1.563815e-02 0.0156250000
##  12: 6.442223e-03 3.012650e-02 0.0230876866
##  13: 6.408744e-03 5.291123e-03 0.0090951493
##  14: 4.753378e-03 1.097008e-02 0.0118936567
##  15: 3.728537e-03 7.426294e-03 0.0279850746
##  16: 3.673379e-03 2.015487e-02 0.0233208955
##  17: 3.385792e-03 1.656659e-02 0.0125932836
##  18: 2.944740e-03 7.184375e-03 0.0046641791
##  19: 2.485359e-03 9.748685e-03 0.0088619403
##  20: 2.417644e-03 1.005966e-02 0.0104944030
##  ---                                       
## 106: 2.277856e-05 6.855703e-04 0.0002332090
## 107: 2.249169e-05 2.679130e-05 0.0002332090
## 108: 2.160370e-05 2.152872e-05 0.0004664179
## 109: 2.158195e-05 2.280450e-05 0.0004664179
## 110: 2.114265e-05 9.568322e-06 0.0009328358
## 111: 1.810081e-05 3.460543e-05 0.0002332090
## 112: 1.751745e-05 1.435248e-06 0.0002332090
## 113: 1.561272e-05 1.881770e-05 0.0004664179
## 114: 1.549372e-05 1.435248e-06 0.0002332090
## 115: 1.388980e-05 9.249378e-06 0.0002332090
## 116: 1.361087e-05 1.275776e-06 0.0002332090
## 117: 1.243241e-05 3.348913e-06 0.0002332090
## 118: 7.214904e-06 1.865823e-05 0.0002332090
## 119: 6.336697e-06 4.784161e-07 0.0002332090
## 120: 3.784088e-06 4.784161e-07 0.0002332090
## 121: 3.207042e-06 1.116304e-06 0.0002332090
## 122: 2.779584e-06 1.594720e-06 0.0004664179
## 123: 2.037295e-06 2.392080e-06 0.0002332090
## 124: 1.844418e-06 3.189441e-07 0.0002332090
## 125: 2.966401e-07 6.378881e-07 0.0002332090
# Or convert to a data frame if you want to save or copy
importance_df <- as.data.frame(importance_matrix)
head(importance_df, 20)
##              Feature        Gain       Cover   Frequency
## 1             y_manh 0.710189524 0.154328900 0.122434701
## 2       schooldist06 0.040352710 0.005120328 0.001399254
## 3          LAND_AREA 0.034751702 0.094731810 0.098180970
## 4            YRBUILT 0.029190188 0.072898015 0.077658582
## 5             x_manh 0.027051778 0.083731748 0.111240672
## 6           builtfar 0.022631295 0.091805657 0.097947761
## 7        FINTAXFLAGT 0.017735824 0.039613331 0.015625000
## 8       schooldist02 0.016573352 0.005518370 0.004897388
## 9             YRALT1 0.015049339 0.056394254 0.068097015
## 10             UNITS 0.013831229 0.060057805 0.068097015
## 11      BLDG_CLASSC7 0.006951507 0.015638146 0.015625000
## 12            YRALT2 0.006442223 0.030126500 0.023087687
## 13        ZONINGR7-2 0.006408744 0.005291123 0.009095149
## 14      BLDG_CLASSC1 0.004753378 0.010970081 0.011893657
## 15         BLD_STORY 0.003728537 0.007426294 0.027985075
## 16 OFFICE_AREA_GROSS 0.003673379 0.020154873 0.023320896
## 17         CORNERYes 0.003385792 0.016566593 0.012593284
## 18      histdistNone 0.002944740 0.007184375 0.004664179
## 19      BLDG_CLASSD3 0.002485359 0.009748685 0.008861940
## 20      BLDG_CLASSD1 0.002417644 0.010059655 0.010494403



Kicking the Tires

Here we confirm that the variable of greatest importance isn’t leaking data and gain some insight into how the coordinates tilted to the Manhattan grid is working with the data.


Model on y_manh Alone

Here we run a linear model but only using y_manh alone and arrive at an R² of 0.05601. This confirms that y_manh alone isn’t able to explain estimated market value per square foot independently and is not leaking data into the model (like gross squarefootage alone was).

This helps confirm that the preponderance of importance on y_manh is largely due to its interaction with other meaningful variables.

Since y_manh alone explains only about 5.6% of variance, we next develop an intuition of how the model is using location and makes sure it’s smooth, generalized, and not jagged, overly fit, curves.

# How strong is y_manh alone?
summary(lm(log(FINMKTTOT) ~ y_manh, data = df5_xgb))  # R^2 with just y_manh
## 
## Call:
## lm(formula = log(FINMKTTOT) ~ y_manh, data = df5_xgb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1146 -0.7466 -0.2635  0.5486  5.5893 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.527e+01  9.501e-03  1607.4   <2e-16 ***
## y_manh      -1.658e-05  5.900e-07   -28.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.096 on 13310 degrees of freedom
## Multiple R-squared:  0.05601,    Adjusted R-squared:  0.05594 
## F-statistic: 789.8 on 1 and 13310 DF,  p-value: < 2.2e-16


Location Intuition

By graphing the partial dependence plots for y_manh and x_manh we can develop intuition for how geography affects property taxes in Manhattan. These intuitions roughly chart with general perceptions of prestige by geography.

North-South

For Manhattan North-South, the highest taxes are between roughly 3rd Street and 82nd. Then there’s a precipitous drop in taxes until about 116th Street.

# Partial dependence on y_manh
library(pdp)
pdp_y <- partial(xgb_model, pred.var = "y_manh", train = as.matrix(X), grid.resolution = 30)
autoplot(pdp_y)

East-West

For Manhattan East-West, it’s blockier, but this is consistent with the more discrete East-West, long blocks. The highest taxes are roughly before 5th Avenue to Lexington, noting that Lexington is one short block after the prestigious Park Ave. To the East, property taxes drop off until 2nd Avenue. To the West, taxes drop precipitously between 5th and 6th, are flat until 8th and then drop precipitously to 9th and drop again slightly to 10th, with a small uptick right along the Hudson River.

# Partial dependence on x_manh
pdp_x <- partial(xgb_model, pred.var = "x_manh", train = as.matrix(X), grid.resolution = 30)
autoplot(pdp_x)


Internal Cross Validation Check

We ran xgb.cv() with 10 folds to check whether RMSE plateaus before the 100th round. As expected from our initial run which showed each rounds RMSE and that they were decreasing (albeit diminishingly) it confirmed that the 100th round was the best.

params <- list(
  objective = "reg:squarederror",
  eta = 0.1,
  max_depth = 6
)

cv_model <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 100,
  nfold = 10,
  verbose = 0,
  early_stopping_rounds = 10
)

best_nrounds <- cv_model$best_iteration
best_nrounds
## [1] 100



Residual Heat Map

Here we generate the residual heat map. Note no properties in Central park but also, none in a band below central park that might correspond to business areas, encompassing Times Square, without coops. We stretched Manhattan fives times it’s width to see the residuals better.

Note that residuals are the log of the actual DOF market value minus the log of the predicted market values. So a positive residual means the property is overassessed by the DOF in the context of the model and may need to have a tax challenge issued by the building. Since the amount of tax to be collected from Class 2 properties is fixed and apportioned according to market value, this would result in the one building paying less in taxes but all other buildings paying slightly more in the coming year when the following year’s tax rate is determined, but the overall result would be more fair.

If a property has a highly negative residual then it is underassessed by the DOF and should be reassessed by the DOF. If the DOF comes back with a higher assessed market value, since property taxes are selected to arrive at a specific total amount, this would slightly lower the taxes for all other properities, increasing fairness.

Discernable patterns
There appears to be evenly calculated property taxes in the Upper East Side, however on the Upper West Side there appears to be a cluster of larger discrepancies. Similarly Manhattan north of Central Park and in Chinatown are more extreme discrepancies. This could suggest that buildings in these areas could be undergoing gentrification with uneven development, or could be an example of the DOF having a billing-mentality of charging what they think buildings will accept, exaggerated in these areas by uneven development.

# Generate predictions on training data
preds <- predict(xgb_model, newdata = X)

# Calculate residuals (log-space)
residuals <- y - preds

# Join residuals with BLOCK, LOT, xcoord, ycoord
rows_used <- as.integer(rownames(X))
id_vars_filtered <- id_vars[rows_used, ]
residuals_df <- id_vars_filtered %>%
  mutate(residual = residuals)

ggplot(residuals_df, aes(x = x_manh, y = y_manh, color = residual)) +
  geom_point(alpha = 0.7) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  coord_fixed(ratio = .2) +
  theme_minimal() +
  labs(title = "Residual Heatmap: XGBoost Predictions vs Actual",
       color = "Residual (log-scale)")

residuals_abs <- abs(residuals)
residuals_df <- id_vars_filtered %>%
  mutate(residual_abs = residuals_abs)

ggplot(residuals_df, aes(x = x_manh, y = y_manh, color = residual_abs)) +
  geom_point(alpha = 0.7) +
  scale_color_gradient(low = "lightyellow", high = "navy") +
  coord_fixed(ratio = 1) +
  theme_minimal() +
  labs(title = "Residual Heatmap: Absolute Value of Discrepancies",
       color = "ABS(Residual)")




Individual Residual

Here we show the residual for our individual building of interest is 0.0971. This means our unlogged, real assessed market value is ~10.2% higher than what our model would predict.

It’s interesting that the two linear and non-linear models have conflicting predictions. Since the linear model doesn’t take location into consideration it may be that we are being overtaxed relative to our location, all else held the same.

residuals_df %>%
  filter(BLOCK == 1046, LOT == 23)
## # A tibble: 1 × 5
##   BLOCK   LOT x_manh y_manh residual_abs
##   <dbl> <dbl>  <dbl>  <dbl>        <dbl>
## 1  1046    23 -2281. -6194.       0.0971

We’re downloading these results for potential use by building boards who are curious where their building falls using this model.

write.csv(residuals_df, "residuals_df_xgb.csv", row.names = FALSE)



Conclusion

It looks like the models would have two audiences: The DOF and individual buildings. The DOF could use both models to screen for anomalies to test their processes and increase overall fairness. Individual buildings could review their discrepancies in both models to see they have a clear case for a property tax appeal, are already tax advantages, or it’s unclear.

The Linear model seems to explain better how the DOF might be computing values even without taking into consideration nearby properties. The XGBoost model seems to tell a better story of how assessed tax values are being applied across Manhattan by taking geography into account.

A third audience for these results would be politicians, lobbyists and think-tanks trying to create a fairer, more inclusive, property tax structure for New York City, firstly, by making the DOF’s practices more transparent, reproducible and comparable, and secondly, by making tax equity more salient and fair.

Individual Example
In the Individual example, the actual assessed market value is shown to be 10.2% above what the model would predict. Of interest, the building filed Tax Protests the last two years which resulted in a slight reduction and then a halt in the increase of assessed value respectively. No rationale was required by the DOF to file the tax protest and no rationale was provided in the DOF’s response to either. This may speak to the “billing-mentality” of the DOF where the assessed market values, driving property taxation before exemptions are applied, are part art.

To the extent the model is supporting the current numbers rather than tracking how the system should be implemented, we may be missing tax inequities by relying on this model alone, and would suggest using both the XGBoost and Linear model to screen for discrepancies, but then to push the DOF to clarify their calculations.