Introduction

This study explores the resolution patterns of building violations issued by the New York City Department of Buildings (DOB). Using publicly available data from NYC OpenData the analysis seeks to identify borough-specific disparities in violation resolution times and examine how different types of violations impact resolution outcomes. Employing both Survival Analysis and Logit Regression Modeling, the goal is to better understand which boroughs or violation types are associated with prolonged or efficient resolution times, providing insights into building safety and code enforcement efficiency.

library(modelsummary)
library(lubridate)
library(tidyverse)
library(tinytable)
library(survival)
library(ggplot2)
library(margins)

Logit Regression using datasummary and modelplot

Data Preparation:

  • Violation Types: 12 specific violation codes were selected based on the NYC DOB Code Book.
  • Date Formatting: ISSUE_DATE and DISPOSITION_DATE were converted to proper date formats.
  • Categorical Variables: Borough codes were labeled (e.g., Manhattan, Bronx), and VIOLATION_TYPE_CODE was treated as a factor, to ensure accurate analysis.
  • New Variables: Time_to_event, calculates the duration between violation issuance and resolution. An Event variable marks resolved (1) and unresolved (0) cases.
  • Filtering Criteria: Only data from January 1, 2000, to December 31, 2025, with complete and valid information were included.
selected_violation_types <- c("AEUHAZ1", "B", "C", "E", "EGNCY", "FISP", "HBLVIO", "IMEGNCY", "LANDMK", "P", "UB", "Z")

DOB_BI_data <- raw_data %>%
  mutate(
    ISSUE_DATE = ymd(ISSUE_DATE),
    DISPOSITION_DATE = ymd(DISPOSITION_DATE),
    BORO = factor(BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")),
    VIOLATION_TYPE_CODE = as.factor(VIOLATION_TYPE_CODE),
    time_to_event = as.numeric(DISPOSITION_DATE - ISSUE_DATE),
    event = ifelse(!is.na(DISPOSITION_DATE), 1, 0)
  ) %>%
  filter(
    !is.na(ISSUE_DATE),
    !is.na(BORO),
    between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")),
    VIOLATION_TYPE_CODE %in% selected_violation_types
  )

DOB_BI_data <- DOB_BI_data %>%
  select(c(BORO, VIOLATION_TYPE_CODE, time_to_event, event))

unique(DOB_BI_data$event)
## [1] 1 0

Methodology

The first modeling approach involves a Logit regression, focusing on how violation types influence the likelihood of resolution (event = 1). This model retains violations with unresolved statuses (event = 0), making it suitable for evaluating current enforcement patterns.

Key Findings Logit Regression (datasummary and modelplot):

Data Summary Snapshot:

  • Mean Resolution Rate: ~812 days
  • Violation Resolution Rate: ~90%

Largest Violation Volumes:

  • Manhattan (44.1%)
  • Brooklyn (24.3%)
  • Queens (17.2%)
  • Bronx (11.8%)
  • Staten Island (2.5%)
datasummary_skim(DOB_BI_data)
Unique Missing Pct. Mean SD Min Median Max Histogram
time_to_event 7732 13 811.8 1047.3 0.0 393.0 9119.0
event 2 0 0.9 0.3 0.0 1.0 1.0
BORO N %
Manhattan 443966 44.1
Bronx 119296 11.8
Brooklyn 245139 24.3
Queens 173364 17.2
Staten Island 25043 2.5
vc <- c('VIOLATION_TYPE_CODEZ' = 'Zoning', 
        'VIOLATION_TYPE_CODEUB' = 'Unsafe Building',
        'VIOLATION_TYPE_CODEP' = 'Plumbing',
        'VIOLATION_TYPE_CODELANDMK' = 'Landmark Building',
        'VIOLATION_TYPE_CODEIMEGNCY' = 'Immediate Emergency',
        'VIOLATION_TYPE_CODEHBLVIO' = 'High Pressure Boiler',
        'VIOLATION_TYPE_CODEFISP' = 'Facade Safety Program',
        'VIOLATION_TYPE_CODEEGNCY' = 'Emergency',
        'VIOLATION_TYPE_CODEE' = 'Elevator',
        'VIOLATION_TYPE_CODEC' = 'Construction',
        'VIOLATION_TYPE_CODEB' = 'Boiler',
        '(Intercept)' = 'Immediately Hazardous')

lglm <- glm(event ~ VIOLATION_TYPE_CODE,
            family = binomial(link = "logit"), 
            data = DOB_BI_data)

modelplot(lglm, coef_map = vc) +
  aes(color = ifelse(p.value < 0.05, "Significant", "Not Significant")) +
  scale_color_manual(
    values = c("Significant" = "red", "Not Significant" = "gray"),
    name = "p-value"
  )

Model Plot Analysis

The model plot visualizes the coefficient estimates and 95% confidence intervals for each violation type. The coefficients represent the change in the log-odds of a violation being resolved for each violation type, relative to the baseline category (Immediately Hazardous). - Violation types with confidence intervals that do not cross zero are considered statistically significant at the 0.05 level. - Coefficients to the right of zero indicate a positive association with violation resolution, while coefficients to the left of zero indicate a negative association. - The magnitude of the coefficient indicates the strength of the association.

Model Summary Analysis

The model summary provides a table of exponentiated coefficients (odds ratios), confidence intervals, and other model statistics. - Odds ratios greater than 1 indicate that the violation type is associated with a higher likelihood of resolution compared to the baseline category, while odds ratios less than 1 indicate a lower likelihood of resolution. - The confidence intervals provide a range of plausible values for the odds ratios. - The AIC and BIC values can be used to compare the fit of different models.

(1)
Immediately Hazardous 4.011
[3.962, 4.061]
Boiler 0.267
[0.223, 0.321]
Construction 0.412
[0.406, 0.419]
Elevator 6.042
[5.937, 6.149]
Emergency 1.330
[1.213, 1.462]
Facade Safety Program 0.277
[0.260, 0.296]
High Pressure Boiler 0.087
[0.084, 0.090]
Immediate Emergency 0.883
[0.837, 0.932]
Landmark Building 0.812
[0.783, 0.842]
Plumbing 0.486
[0.461, 0.512]
Unsafe Building 0.758
[0.720, 0.798]
Zoning 0.556
[0.517, 0.598]
Num.Obs. 1006808
AIC 627480.9
BIC 627622.7
Log.Lik. -313728.427
F 11649.695
RMSE 0.31

Interpretation of Logit Models

As highlighted by (Zelner 2009), interpreting coefficients directly in logit models can be challenging due to their non-linear nature. Instead, Zelner advocates for examining the differences in predicted probabilities associated with discrete changes in independent variables. This approach offers a more intuitive understanding of the impact of each variable on the outcome probability.

Survival Analysis using modelsummary

Data Preparation:

  • This step mirrors the previous data preparation but removes the zero values in the event variable, which represents unresolved violations.
  • Survival: Construct a survival object to model violation resolution times.
DOB_data <- raw_data %>%
  mutate(
    ISSUE_DATE = ymd(ISSUE_DATE),
    DISPOSITION_DATE = ymd(DISPOSITION_DATE),
    BORO = factor(BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")),
    VIOLATION_TYPE_CODE = as.factor(VIOLATION_TYPE_CODE),
    time_to_event = as.numeric(DISPOSITION_DATE - ISSUE_DATE),
    event = ifelse(!is.na(DISPOSITION_DATE), 1, 0)
    ) %>%
  filter(
    !is.na(ISSUE_DATE),
    !is.na(BORO),
    between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")),
    VIOLATION_TYPE_CODE %in% selected_violation_types,
    time_to_event > 0
    )

surv_obj <- Surv(time = DOB_data$time_to_event, event = DOB_data$event)

Methodology

To assess how long violations remain unresolved across boroughs, four parametric survival models were fitted: Exponential, Weibull, Log-Logistic, and Log-Normal.Each model evaluated the effect of borough on time-to-resolution:

model_list <- list(
  "Exponential" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "exp"),
  "Weibull" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "weibull"),
  "Log-Logistic" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "loglogistic"),
  "Log-Normal" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "lognormal")
  )

BORO_rename <- c(
  `(Intercept)` = "Manhattan",
  `BOROBronx` = "Bronx",
  `BOROBrooklyn` = "Brooklyn",
  `BOROQueens` = "Queens",
  `BOROStaten Island` = "StatenIsland",
  `Log(scale)` = "Log Scale"
  )


modelsummary(model_list, 
             statistic = "{std.error}",
             gof_omit = 'RMSE|Num.Obs.', 
             coef_rename = BORO_rename,
             output = "gt",
             fmt = 2
             
  )
Exponential Weibull Log-Logistic Log-Normal
Manhattan 6.60 6.49 5.90 5.88
0.00 0.00 0.00 0.00
Bronx 0.19 0.20 0.23 0.18
0.00 0.00 0.00 0.00
Brooklyn 0.20 0.19 0.11 0.06
0.00 0.00 0.00 0.00
Queens 0.17 0.16 0.10 0.06
0.00 0.00 0.00 0.00
StatenIsland 0.10 0.08 -0.07 -0.17
0.01 0.01 0.01 0.01
Log Scale 0.19 -0.23 0.32
0.00 0.00 0.00
AIC 13417248.6 13356479.0 13377314.5 13360554.2
BIC 13417307.0 13356549.1 13377384.5 13360624.3

Key Findings

Final Conclusion

This dual-model analysis highlights clear borough-level disparities and violation-specific trends in NYC DOB’s resolution practices. The Bronx and Brooklyn require targeted attention to reduce prolonged violation durations. While Staten Island shows inconsistency that may warrant further localized study.Certain violation types significantly reduce the likelihood of resolution, pointing to potential enforcement or compliance challenges.

Zelner, Bennet A. 2009. “Using Simulation to Interpret Results from Logit, Probit, and Other Nonlinear Models.” Strategic Management Journal 30 (12): 1335–48.