Here we load and clean the data for our final model. We’ll use the
same data setup except we’ll keep xcoord
and
ycoord
for a heat map of the residuals. After exploration
it doesn’t look like XGBoost can handle coordinates like a geospatial
model but it could pick up on location interactions just using the
coordinates like any other numeric feature.
# Check out packages
library(tidyverse)
library(dplyr)
library(xgboost)
library(Matrix)
library(caret)
library(pdp)
library(vip)
# Load data
df1 <- readRDS("~/Documents/D698/df.rds")
We’ll remove the same 37 records and columns with significant missingness, and recategorize NAs.
df2 <- df1 %>%
filter(!is.na(xcoord))
df3 <- df2 %>%
dplyr::select(-CONDO_Number, -EASEMENT, -condono)
df4 <- df3 %>%
mutate(firm07_flag = replace_na(firm07_flag, 0)) %>%
mutate(pfirm15_flag = replace_na(pfirm15_flag, 0)) %>%
mutate(landmark = replace_na(landmark, 'None')) %>%
mutate(histdist = replace_na(histdist, 'None')) %>%
mutate(spdist1 = replace_na(spdist1, 'None')) %>%
mutate(CORNER = case_when(CORNER %in% c("CR", "NE", "SW", "NW", "SE") ~ "Yes", is.na(CORNER) | CORNER == "00" ~ "No")) %>%
mutate(ltdheight = replace_na(ltdheight, "None")) %>%
mutate(LOT_IRREG = replace_na(LOT_IRREG, "Unknown")) %>%
mutate(BLD_EXT = replace_na(BLD_EXT, "N")) %>%
mutate(landuse = replace_na(landuse, "Unknown"))
We exclude the same columns as in the prior linear model except we
keep BLOCK
and LOT
for identification and
xcoord
and ycoord
for the heat map, including
ASP
to expand the number of records from ~2,000 to
~13,300.
# Define the columns to exclude
exclude_cols_lrmodel <- c("bbl", "PARID", "bctcb2020", "bct2020", "latitude", "longitude", "BORO", "YEAR", "COOP_APTS", "firm07_flag", "pfirm15_flag", "LOT_DEP", "FINTAXCLASS", "ZIP_CODE", "LOT_FRT", "BLD_FRT", "BLD_DEP", "cd", "council", "policeprct", "healthcenterdistrict", "healtharea", "spdist1", "firecomp", "zonedist1", "bldgclass", "landuse", "tract2010", "RESIDENTIAL_AREA_GROSS", "residfar", "commfar", "facilfar", "ASP")
df5_xgb <- df4 %>%
dplyr::select(-all_of(exclude_cols_lrmodel)) %>%
mutate(across(where(is.character), as.factor))
Manhattan street grid is 29° tilted East of True North. To improve our map we’re rotating our coordinate system to “Manhattan-North”. This will allow the natural borders of street-divided blocks in the city grid to be captured simply in the XGBoost trees.
# Rotate (and optionally scale) x/y into a "Manhattan grid" frame
rotate_xy <- function(x, y, angle_deg, scale_x = 1, scale_y = 1, center = TRUE) {
theta <- angle_deg * pi/180
R <- matrix(c(cos(theta), -sin(theta),
sin(theta), cos(theta)), 2, 2, byrow = TRUE)
X <- cbind((x/scale_x), (y/scale_y))
if (center) {
ctr <- colMeans(X, na.rm = TRUE)
X <- sweep(X, 2, ctr, "-")
}
XYrot <- X %*% t(R)
list(x = XYrot[,1], y = XYrot[,2])
}
ang <- 29
rot <- rotate_xy(df5_xgb$xcoord, df5_xgb$ycoord, angle_deg = ang)
df5_xgb$x_manh <- rot$x
df5_xgb$y_manh <- rot$y
Here we convert our data into a DMatrix
, which is
XGBoost’s optimized data structure to improve training speed and memory
efficiency.
In earlier iterations of the XGBoost model we saw significant leakage
from the variable of top importance, GROSS_SQFT
, and so we
changed to model estimated market value per square foot.
# Create target: Estimated Market Value per Square foot
df_clean <- df5_xgb %>%
mutate(mv_per_sqft = FINMKTTOT / GROSS_SQFT) %>%
# remove rows that break division/log
filter(is.finite(FINMKTTOT),
is.finite(GROSS_SQFT),
GROSS_SQFT > 0,
is.finite(mv_per_sqft),
mv_per_sqft > 0)
Here we prepare our data to be read in an efficient way by the XGBoost model.
# Build model frame without IDs & leakage cols
leak_cols <- c("mv_per_sqft","FINMKTTOT","GROSS_SQFT","BLOCK","LOT", "xcoord", "ycoord")
model_cols <- setdiff(names(df_clean), leak_cols)
df_model <- df_clean %>%
select(all_of(c("mv_per_sqft", model_cols))) %>%
drop_na()
# Align id_vars to df_model (no leakage)
id_vars <- df_clean %>%
select(BLOCK, LOT, x_manh, y_manh) %>%
dplyr::slice(match(rownames(df_model), rownames(df_clean)))
stopifnot(nrow(id_vars) == nrow(df_model))
# Design matrix / labels
X <- sparse.model.matrix(mv_per_sqft ~ . - 1, data = df_model)
y <- log(df_model$mv_per_sqft)
# Build DMatrix
dtrain <- xgb.DMatrix(data = X, label = y, missing = NA)
Here we fit a basic model and observe that we have a healthy learning curve because the Root Mean Squared Error (RMSE) is steadily decreasing and it’s reaching a low value of 0.269 in log-space. This corresponds to an average prediction error of 31% in real (non-log) market values. This is calculated by taking e to the RMSE minus 1.
Variable Importance
Our recent iterations of the model had an overreliance on a single
feature y_manh
, or North-south tilted to align with the
Manhattan street grid. We constrained the model by adding like
regularization and a reduced number of columns sampled per tree however
we still have high variable importance attributed to
y_manh
. We show later that this is a strength of the model
not a weakness.
Average Sale Price Aside This is a significant
increase in multiplicative error factor from 11% using the smaller data
set with ~14% of the records with average sale price information. Most
likely the prior model had inflated performance that wouldn’t generalize
because it was a smaller dataset, so there was less complexity to model
with the same number of fields plus ASP
and it was overfit.
ASP
did not have a high correlation with
FINMKTTOT
but it could be that it acted like a shortcut
feature, letting the model explain a lot more.
It could also be that the DOF does look at recent sale data when
calculating taxes, even if it’s not part of their disclosed process, to
accelerate increases for buildings experiencing higher recent sales. A
future exploration could be to rerun the 1923 records with
ASP
twice to compare how the model performs with and
without the ASP
data.
set.seed(1751753275)
xgb_model <- xgboost(
data = dtrain,
objective = "reg:squarederror", # for regression
nrounds = 100,
eta = 0.1,
max_depth = 6,
subsample = 0.8, # like bagging, fraction of rows sampled for each tree
colsample_bytree = 0.8, # fraction of columns sampled for each tree
lambda = 1.0, # L2 Ridge
alpha = 0.1, # L1 Lasso
verbose = 0
)
preds <- predict(xgb_model, newdata = dtrain)
y_true <- getinfo(dtrain, "label")
final_rmse <- sqrt(mean((y_true - preds)^2))
final_rmse
## [1] 0.2690246
We manually calculate a training R² of 0.8104, compared to 0.7896 in the final logged linear model.
Note R² is not held-out. It’s a training R² based on training data that would perform less well extended to new data.
preds_xgb <- predict(xgb_model, newdata = X)
r2_xgb <- 1 - sum((y - preds_xgb)^2) / sum((y - mean(y))^2)
r2_xgb
## [1] 0.8104825
The variable importance plot shows relative variable importance in
the model and the primary driver is y_manh
, North-South
tilted to align with the Manhattan street grid.
schooldist06
is our next best performing variable. It
carves out a very specific geography for Washington Heights and Inwood
which has both good and bad schools. My theory for why it’s valuable is
we later find west tends to pay less taxes than east across Manhattan
but this region of northern Manhattan may reverse that trend by being
better developed on the West than East side and so this carve-out by
school district is a useful divide when calculating trees.
LAND_AREA
also has a big impact. We’ve seen this in the
case of StuyTown, the outer outlier, where having a large land area
doesn’t scale with how much taxes are charged per unit. Since buildings
are spaced further apart and there’s an upper limit on property tax
affordability per unit, the extra open space in the tax lot isn’t being
taxed as heavily as in an area with greater unit density.
Since y_manh
dwarfs the other variables in importance we
need to look more closely. It could be that y_manh
is used
early in the tree-building process to separate data into big,
easy-to-predict chunks but that it’s importance is in interaction with
the other variables and it’s not predicting estimated market value per
square foot by itself.
vip::vip(xgb_model, num_features = 20)
# Get importance matrix
importance_matrix <- xgb.importance(model = xgb_model)
# View top variables
print(importance_matrix, 20) # top 20
## Feature
## <char>
## 1: y_manh
## 2: schooldist06
## 3: LAND_AREA
## 4: YRBUILT
## 5: x_manh
## 6: builtfar
## 7: FINTAXFLAGT
## 8: schooldist02
## 9: YRALT1
## 10: UNITS
## 11: BLDG_CLASSC7
## 12: YRALT2
## 13: ZONINGR7-2
## 14: BLDG_CLASSC1
## 15: BLD_STORY
## 16: OFFICE_AREA_GROSS
## 17: CORNERYes
## 18: histdistNone
## 19: BLDG_CLASSD3
## 20: BLDG_CLASSD1
## ---
## 106: BLD_EXTG
## 107: histdistGreenwich Village Historic District Extension II
## 108: histdistUpper West Side / Central Park West Historic District
## 109: ZONINGM16D
## 110: BLDG_CLASSD0
## 111: ZONINGC64A
## 112: ZONINGC28A
## 113: ltdheightLH-1A
## 114: ZONINGC4-5
## 115: ZONINGC6-6
## 116: ZONINGC61G
## 117: histdistHamilton Heights / Sugar Hill Northwest Historic District
## 118: ZONINGC1-7
## 119: ZONINGC45D
## 120: ZONINGR10H
## 121: ZONINGC67T
## 122: ZONINGC5-5
## 123: ZONINGM15R7D
## 124: ZONINGR10
## 125: BLDG_CLASSS9
## Gain Cover Frequency
## <num> <num> <num>
## 1: 7.101895e-01 1.543289e-01 0.1224347015
## 2: 4.035271e-02 5.120328e-03 0.0013992537
## 3: 3.475170e-02 9.473181e-02 0.0981809701
## 4: 2.919019e-02 7.289802e-02 0.0776585821
## 5: 2.705178e-02 8.373175e-02 0.1112406716
## 6: 2.263130e-02 9.180566e-02 0.0979477612
## 7: 1.773582e-02 3.961333e-02 0.0156250000
## 8: 1.657335e-02 5.518370e-03 0.0048973881
## 9: 1.504934e-02 5.639425e-02 0.0680970149
## 10: 1.383123e-02 6.005781e-02 0.0680970149
## 11: 6.951507e-03 1.563815e-02 0.0156250000
## 12: 6.442223e-03 3.012650e-02 0.0230876866
## 13: 6.408744e-03 5.291123e-03 0.0090951493
## 14: 4.753378e-03 1.097008e-02 0.0118936567
## 15: 3.728537e-03 7.426294e-03 0.0279850746
## 16: 3.673379e-03 2.015487e-02 0.0233208955
## 17: 3.385792e-03 1.656659e-02 0.0125932836
## 18: 2.944740e-03 7.184375e-03 0.0046641791
## 19: 2.485359e-03 9.748685e-03 0.0088619403
## 20: 2.417644e-03 1.005966e-02 0.0104944030
## ---
## 106: 2.277856e-05 6.855703e-04 0.0002332090
## 107: 2.249169e-05 2.679130e-05 0.0002332090
## 108: 2.160370e-05 2.152872e-05 0.0004664179
## 109: 2.158195e-05 2.280450e-05 0.0004664179
## 110: 2.114265e-05 9.568322e-06 0.0009328358
## 111: 1.810081e-05 3.460543e-05 0.0002332090
## 112: 1.751745e-05 1.435248e-06 0.0002332090
## 113: 1.561272e-05 1.881770e-05 0.0004664179
## 114: 1.549372e-05 1.435248e-06 0.0002332090
## 115: 1.388980e-05 9.249378e-06 0.0002332090
## 116: 1.361087e-05 1.275776e-06 0.0002332090
## 117: 1.243241e-05 3.348913e-06 0.0002332090
## 118: 7.214904e-06 1.865823e-05 0.0002332090
## 119: 6.336697e-06 4.784161e-07 0.0002332090
## 120: 3.784088e-06 4.784161e-07 0.0002332090
## 121: 3.207042e-06 1.116304e-06 0.0002332090
## 122: 2.779584e-06 1.594720e-06 0.0004664179
## 123: 2.037295e-06 2.392080e-06 0.0002332090
## 124: 1.844418e-06 3.189441e-07 0.0002332090
## 125: 2.966401e-07 6.378881e-07 0.0002332090
# Or convert to a data frame if you want to save or copy
importance_df <- as.data.frame(importance_matrix)
head(importance_df, 20)
## Feature Gain Cover Frequency
## 1 y_manh 0.710189524 0.154328900 0.122434701
## 2 schooldist06 0.040352710 0.005120328 0.001399254
## 3 LAND_AREA 0.034751702 0.094731810 0.098180970
## 4 YRBUILT 0.029190188 0.072898015 0.077658582
## 5 x_manh 0.027051778 0.083731748 0.111240672
## 6 builtfar 0.022631295 0.091805657 0.097947761
## 7 FINTAXFLAGT 0.017735824 0.039613331 0.015625000
## 8 schooldist02 0.016573352 0.005518370 0.004897388
## 9 YRALT1 0.015049339 0.056394254 0.068097015
## 10 UNITS 0.013831229 0.060057805 0.068097015
## 11 BLDG_CLASSC7 0.006951507 0.015638146 0.015625000
## 12 YRALT2 0.006442223 0.030126500 0.023087687
## 13 ZONINGR7-2 0.006408744 0.005291123 0.009095149
## 14 BLDG_CLASSC1 0.004753378 0.010970081 0.011893657
## 15 BLD_STORY 0.003728537 0.007426294 0.027985075
## 16 OFFICE_AREA_GROSS 0.003673379 0.020154873 0.023320896
## 17 CORNERYes 0.003385792 0.016566593 0.012593284
## 18 histdistNone 0.002944740 0.007184375 0.004664179
## 19 BLDG_CLASSD3 0.002485359 0.009748685 0.008861940
## 20 BLDG_CLASSD1 0.002417644 0.010059655 0.010494403
Here we confirm that the variable of greatest importance isn’t leaking data and gain some insight into how the coordinates tilted to the Manhattan grid is working with the data.
y_manh
AloneHere we run a linear model but only using y_manh
alone
and arrive at an R² of 0.05601. This confirms that y_manh
alone isn’t able to explain estimated market value per square foot
independently and is not leaking data into the model (like gross
squarefootage alone was).
This helps confirm that the preponderance of importance on
y_manh
is largely due to its interaction with other
meaningful variables.
Since y_manh
alone explains only about 5.6% of variance,
we next develop an intuition of how the model is using location and
makes sure it’s smooth, generalized, and not jagged, overly fit,
curves.
# How strong is y_manh alone?
summary(lm(log(FINMKTTOT) ~ y_manh, data = df5_xgb)) # R^2 with just y_manh
##
## Call:
## lm(formula = log(FINMKTTOT) ~ y_manh, data = df5_xgb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1146 -0.7466 -0.2635 0.5486 5.5893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.527e+01 9.501e-03 1607.4 <2e-16 ***
## y_manh -1.658e-05 5.900e-07 -28.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.096 on 13310 degrees of freedom
## Multiple R-squared: 0.05601, Adjusted R-squared: 0.05594
## F-statistic: 789.8 on 1 and 13310 DF, p-value: < 2.2e-16
By graphing the partial dependence plots for y_manh
and
x_manh
we can develop intuition for how geography affects
property taxes in Manhattan. These intuitions roughly chart with general
perceptions of prestige by geography.
For Manhattan North-South, the highest taxes are between roughly 3rd Street and 82nd. Then there’s a precipitous drop in taxes until about 116th Street.
# Partial dependence on y_manh
library(pdp)
pdp_y <- partial(xgb_model, pred.var = "y_manh", train = as.matrix(X), grid.resolution = 30)
autoplot(pdp_y)
For Manhattan East-West, it’s blockier, but this is consistent with the more discrete East-West, long blocks. The highest taxes are roughly before 5th Avenue to Lexington, noting that Lexington is one short block after the prestigious Park Ave. To the East, property taxes drop off until 2nd Avenue. To the West, taxes drop precipitously between 5th and 6th, are flat until 8th and then drop precipitously to 9th and drop again slightly to 10th, with a small uptick right along the Hudson River.
# Partial dependence on x_manh
pdp_x <- partial(xgb_model, pred.var = "x_manh", train = as.matrix(X), grid.resolution = 30)
autoplot(pdp_x)
We ran xgb.cv()
with 10 folds to check whether RMSE
plateaus before the 100th round. As expected from our initial run which
showed each rounds RMSE and that they were decreasing (albeit
diminishingly) it confirmed that the 100th round was the best.
params <- list(
objective = "reg:squarederror",
eta = 0.1,
max_depth = 6
)
cv_model <- xgb.cv(
params = params,
data = dtrain,
nrounds = 100,
nfold = 10,
verbose = 0,
early_stopping_rounds = 10
)
best_nrounds <- cv_model$best_iteration
best_nrounds
## [1] 100
Here we generate the residual heat map. Note no properties in Central park but also, none in a band below central park that might correspond to business areas, encompassing Times Square, without coops. We stretched Manhattan fives times it’s width to see the residuals better.
Note that residuals are the log of the actual DOF market value minus the log of the predicted market values. So a positive residual means the property is overassessed by the DOF in the context of the model and may need to have a tax challenge issued by the building. Since the amount of tax to be collected from Class 2 properties is fixed and apportioned according to market value, this would result in the one building paying less in taxes but all other buildings paying slightly more in the coming year when the following year’s tax rate is determined, but the overall result would be more fair.
If a property has a highly negative residual then it is underassessed by the DOF and should be reassessed by the DOF. If the DOF comes back with a higher assessed market value, since property taxes are selected to arrive at a specific total amount, this would slightly lower the taxes for all other properities, increasing fairness.
Discernable patterns
There appears to be evenly calculated property taxes in the Upper East
Side, however on the Upper West Side there appears to be a cluster of
larger discrepancies. Similarly Manhattan north of Central Park and in
Chinatown are more extreme discrepancies. This could suggest that
buildings in these areas could be undergoing gentrification with uneven
development, or could be an example of the DOF having a
billing-mentality of charging what they think buildings will accept,
exaggerated in these areas by uneven development.
# Generate predictions on training data
preds <- predict(xgb_model, newdata = X)
# Calculate residuals (log-space)
residuals <- y - preds
# Join residuals with BLOCK, LOT, xcoord, ycoord
rows_used <- as.integer(rownames(X))
id_vars_filtered <- id_vars[rows_used, ]
residuals_df <- id_vars_filtered %>%
mutate(residual = residuals)
ggplot(residuals_df, aes(x = x_manh, y = y_manh, color = residual)) +
geom_point(alpha = 0.7) +
scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
coord_fixed(ratio = .2) +
theme_minimal() +
labs(title = "Residual Heatmap: XGBoost Predictions vs Actual",
color = "Residual (log-scale)")
residuals_abs <- abs(residuals)
residuals_df <- id_vars_filtered %>%
mutate(residual_abs = residuals_abs)
ggplot(residuals_df, aes(x = x_manh, y = y_manh, color = residual_abs)) +
geom_point(alpha = 0.7) +
scale_color_gradient(low = "lightyellow", high = "navy") +
coord_fixed(ratio = 1) +
theme_minimal() +
labs(title = "Residual Heatmap: Absolute Value of Discrepancies",
color = "ABS(Residual)")
Here we show the residual for our individual building of interest is 0.0971. This means our unlogged, real assessed market value is ~10.2% higher than what our model would predict.
It’s interesting that the two linear and non-linear models have conflicting predictions. Since the linear model doesn’t take location into consideration it may be that we are being overtaxed relative to our location, all else held the same.
residuals_df %>%
filter(BLOCK == 1046, LOT == 23)
## # A tibble: 1 × 5
## BLOCK LOT x_manh y_manh residual_abs
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1046 23 -2281. -6194. 0.0971
We’re downloading these results for potential use by building boards who are curious where their building falls using this model.
write.csv(residuals_df, "residuals_df_xgb.csv", row.names = FALSE)
It looks like the models would have two audiences: The DOF and individual buildings. The DOF could use both models to screen for anomalies to test their processes and increase overall fairness. Individual buildings could review their discrepancies in both models to see they have a clear case for a property tax appeal, are already tax advantages, or it’s unclear.
The Linear model seems to explain better how the DOF might be computing values even without taking into consideration nearby properties. The XGBoost model seems to tell a better story of how assessed tax values are being applied across Manhattan by taking geography into account.
A third audience for these results would be politicians, lobbyists and think-tanks trying to create a fairer, more inclusive, property tax structure for New York City, firstly, by making the DOF’s practices more transparent, reproducible and comparable, and secondly, by making tax equity more salient and fair.
Individual Example
In the Individual example, the actual assessed market value is shown to
be 10.2% above what the model would predict. Of interest, the building
filed Tax Protests the last two years which resulted in a slight
reduction and then a halt in the increase of assessed value
respectively. No rationale was required by the DOF to file the tax
protest and no rationale was provided in the DOF’s response to either.
This may speak to the “billing-mentality” of the DOF where the assessed
market values, driving property taxation before exemptions are applied,
are part art.
To the extent the model is supporting the current numbers rather than tracking how the system should be implemented, we may be missing tax inequities by relying on this model alone, and would suggest using both the XGBoost and Linear model to screen for discrepancies, but then to push the DOF to clarify their calculations.