This study explores the resolution patterns of building violations issued by the New York City Department of Buildings (DOB). Using publicly available data from NYC OpenData the analysis seeks to identify borough-specific disparities in violation resolution times and examine how different types of violations impact resolution outcomes. Employing both Survival Analysis and Logit Regression Modeling, the goal is to better understand which boroughs or violation types are associated with prolonged or efficient resolution times, providing insights into building safety and code enforcement efficiency.
library(modelsummary)
library(lubridate)
library(tidyverse)
library(tinytable)
library(survival)
library(ggplot2)
library(margins)
selected_violation_types <- c("AEUHAZ1", "B", "C", "E", "EGNCY", "FISP", "HBLVIO", "IMEGNCY", "LANDMK", "P", "UB", "Z")
DOB_BI_data <- raw_data %>%
mutate(
ISSUE_DATE = ymd(ISSUE_DATE),
DISPOSITION_DATE = ymd(DISPOSITION_DATE),
BORO = factor(BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")),
VIOLATION_TYPE_CODE = as.factor(VIOLATION_TYPE_CODE),
time_to_event = as.numeric(DISPOSITION_DATE - ISSUE_DATE),
event = ifelse(!is.na(DISPOSITION_DATE), 1, 0)
) %>%
filter(
!is.na(ISSUE_DATE),
!is.na(BORO),
between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")),
VIOLATION_TYPE_CODE %in% selected_violation_types
)
DOB_BI_data <- DOB_BI_data %>%
select(c(BORO, VIOLATION_TYPE_CODE, time_to_event, event))
unique(DOB_BI_data$event)
## [1] 1 0
The first modeling approach involves a Logit regression, focusing on how violation types influence the likelihood of resolution (event = 1). This model retains violations with unresolved statuses (event = 0), making it suitable for evaluating current enforcement patterns.
datasummary_skim(DOB_BI_data)
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
time_to_event | 7732 | 13 | 811.8 | 1047.3 | 0.0 | 393.0 | 9119.0 | |
event | 2 | 0 | 0.9 | 0.3 | 0.0 | 1.0 | 1.0 | |
BORO | N | % | ||||||
Manhattan | 443966 | 44.1 | ||||||
Bronx | 119296 | 11.8 | ||||||
Brooklyn | 245139 | 24.3 | ||||||
Queens | 173364 | 17.2 | ||||||
Staten Island | 25043 | 2.5 |
vc <- c('VIOLATION_TYPE_CODEZ' = 'Zoning',
'VIOLATION_TYPE_CODEUB' = 'Unsafe Building',
'VIOLATION_TYPE_CODEP' = 'Plumbing',
'VIOLATION_TYPE_CODELANDMK' = 'Landmark Building',
'VIOLATION_TYPE_CODEIMEGNCY' = 'Immediate Emergency',
'VIOLATION_TYPE_CODEHBLVIO' = 'High Pressure Boiler',
'VIOLATION_TYPE_CODEFISP' = 'Facade Safety Program',
'VIOLATION_TYPE_CODEEGNCY' = 'Emergency',
'VIOLATION_TYPE_CODEE' = 'Elevator',
'VIOLATION_TYPE_CODEC' = 'Construction',
'VIOLATION_TYPE_CODEB' = 'Boiler',
'(Intercept)' = 'Immediately Hazardous')
lglm <- glm(event ~ VIOLATION_TYPE_CODE,
family = binomial(link = "logit"),
data = DOB_BI_data)
modelplot(lglm, coef_map = vc) +
aes(color = ifelse(p.value < 0.05, "Significant", "Not Significant")) +
scale_color_manual(
values = c("Significant" = "red", "Not Significant" = "gray"),
name = "p-value"
)
The model plot visualizes the coefficient estimates and 95% confidence intervals for each violation type. The coefficients represent the change in the log-odds of a violation being resolved for each violation type, relative to the baseline category (Immediately Hazardous). - Violation types with confidence intervals that do not cross zero are considered statistically significant at the 0.05 level. - Coefficients to the right of zero indicate a positive association with violation resolution, while coefficients to the left of zero indicate a negative association. - The magnitude of the coefficient indicates the strength of the association.
The model summary provides a table of exponentiated coefficients (odds ratios), confidence intervals, and other model statistics. - Odds ratios greater than 1 indicate that the violation type is associated with a higher likelihood of resolution compared to the baseline category, while odds ratios less than 1 indicate a lower likelihood of resolution. - The confidence intervals provide a range of plausible values for the odds ratios. - The AIC and BIC values can be used to compare the fit of different models.
(1) | |
---|---|
Immediately Hazardous | 4.011 |
[3.962, 4.061] | |
Boiler | 0.267 |
[0.223, 0.321] | |
Construction | 0.412 |
[0.406, 0.419] | |
Elevator | 6.042 |
[5.937, 6.149] | |
Emergency | 1.330 |
[1.213, 1.462] | |
Facade Safety Program | 0.277 |
[0.260, 0.296] | |
High Pressure Boiler | 0.087 |
[0.084, 0.090] | |
Immediate Emergency | 0.883 |
[0.837, 0.932] | |
Landmark Building | 0.812 |
[0.783, 0.842] | |
Plumbing | 0.486 |
[0.461, 0.512] | |
Unsafe Building | 0.758 |
[0.720, 0.798] | |
Zoning | 0.556 |
[0.517, 0.598] | |
Num.Obs. | 1006808 |
AIC | 627480.9 |
BIC | 627622.7 |
Log.Lik. | -313728.427 |
F | 11649.695 |
RMSE | 0.31 |
As highlighted by (Zelner 2009), interpreting coefficients directly in logit models can be challenging due to their non-linear nature. Instead, Zelner advocates for examining the differences in predicted probabilities associated with discrete changes in independent variables. This approach offers a more intuitive understanding of the impact of each variable on the outcome probability.
DOB_data <- raw_data %>%
mutate(
ISSUE_DATE = ymd(ISSUE_DATE),
DISPOSITION_DATE = ymd(DISPOSITION_DATE),
BORO = factor(BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")),
VIOLATION_TYPE_CODE = as.factor(VIOLATION_TYPE_CODE),
time_to_event = as.numeric(DISPOSITION_DATE - ISSUE_DATE),
event = ifelse(!is.na(DISPOSITION_DATE), 1, 0)
) %>%
filter(
!is.na(ISSUE_DATE),
!is.na(BORO),
between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")),
VIOLATION_TYPE_CODE %in% selected_violation_types,
time_to_event > 0
)
surv_obj <- Surv(time = DOB_data$time_to_event, event = DOB_data$event)
To assess how long violations remain unresolved across boroughs, four parametric survival models were fitted: Exponential, Weibull, Log-Logistic, and Log-Normal.Each model evaluated the effect of borough on time-to-resolution:
model_list <- list(
"Exponential" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "exp"),
"Weibull" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "weibull"),
"Log-Logistic" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "loglogistic"),
"Log-Normal" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "lognormal")
)
BORO_rename <- c(
`(Intercept)` = "Manhattan",
`BOROBronx` = "Bronx",
`BOROBrooklyn` = "Brooklyn",
`BOROQueens` = "Queens",
`BOROStaten Island` = "StatenIsland",
`Log(scale)` = "Log Scale"
)
modelsummary(model_list,
statistic = "{std.error}",
gof_omit = 'RMSE|Num.Obs.',
coef_rename = BORO_rename,
output = "gt",
fmt = 2
)
Exponential | Weibull | Log-Logistic | Log-Normal | |
---|---|---|---|---|
Manhattan | 6.60 | 6.49 | 5.90 | 5.88 |
0.00 | 0.00 | 0.00 | 0.00 | |
Bronx | 0.19 | 0.20 | 0.23 | 0.18 |
0.00 | 0.00 | 0.00 | 0.00 | |
Brooklyn | 0.20 | 0.19 | 0.11 | 0.06 |
0.00 | 0.00 | 0.00 | 0.00 | |
Queens | 0.17 | 0.16 | 0.10 | 0.06 |
0.00 | 0.00 | 0.00 | 0.00 | |
StatenIsland | 0.10 | 0.08 | -0.07 | -0.17 |
0.01 | 0.01 | 0.01 | 0.01 | |
Log Scale | 0.19 | -0.23 | 0.32 | |
0.00 | 0.00 | 0.00 | ||
AIC | 13417248.6 | 13356479.0 | 13377314.5 | 13360554.2 |
BIC | 13417307.0 | 13356549.1 | 13377384.5 | 13360624.3 |
This dual-model analysis highlights clear borough-level disparities and violation-specific trends in NYC DOB’s resolution practices. The Bronx and Brooklyn require targeted attention to reduce prolonged violation durations. While Staten Island shows inconsistency that may warrant further localized study.Certain violation types significantly reduce the likelihood of resolution, pointing to potential enforcement or compliance challenges.