Introduction

This study explores the resolution patterns of building violations issued by the New York City Department of Buildings (DOB). Using publicly available data from NYC OpenData the analysis seeks to identify borough-specific disparities in violation resolution times and examine how different types of violations impact resolution outcomes. Employing both Survival Analysis and Logit Regression Modeling, the goal is to better understand which boroughs or violation types are associated with prolonged or efficient resolution times, providing insights into building safety and code enforcement efficiency.

library(modelsummary)
library(lubridate)
library(tidyverse)
library(tinytable)
library(survival)
library(ggplot2)
library(margins)

Logit Regression using datasummary and modelplot

Data Preparation:

Violation Types: 12 specific violation codes were selected based on the NYC DOB Code Book.
Date Formatting: ISSUE_DATE and DISPOSITION_DATE were converted to proper date formats.
Categorical Variables: Borough codes were labeled (e.g., Manhattan, Bronx), and VIOLATION_TYPE_CODE was treated as a factor, to ensure accurate analysis.
New Variables: Time_to_event, calculates the duration between violation issuance and resolution. An Event variable marks resolved (1) and unresolved (0) cases.
Filtering Criteria: Only data from January 1, 2000, to December 31, 2025, with complete and valid information were included.

selected_violation_types <- c("AEUHAZ1", "B", "C", "E", "EGNCY", "FISP", "HBLVIO", "IMEGNCY", "LANDMK", "P", "UB", "Z")

DOB_BI_data <- raw_data %>%
  mutate(
    ISSUE_DATE = ymd(ISSUE_DATE),
    DISPOSITION_DATE = ymd(DISPOSITION_DATE),
    BORO = factor(BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")),
    VIOLATION_TYPE_CODE = as.factor(VIOLATION_TYPE_CODE),
    time_to_event = as.numeric(DISPOSITION_DATE - ISSUE_DATE),
    event = ifelse(!is.na(DISPOSITION_DATE), 1, 0)
  ) %>%
  filter(
    !is.na(ISSUE_DATE),
    !is.na(BORO),
    between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")),
    VIOLATION_TYPE_CODE %in% selected_violation_types
  )

DOB_BI_data <- DOB_BI_data %>%
  select(c(BORO, VIOLATION_TYPE_CODE, time_to_event, event))

unique(DOB_BI_data$event)

## [1] 1 0

Methodology

The first modeling approach involves a Logit regression, focusing on how violation types influence the likelihood of resolution (event = 1). This model retains violations with unresolved statuses (event = 0), making it suitable for evaluating current enforcement patterns.

Key Findings Logit Regression (datasummary and modelplot):

Data Summary Snapshot:

Mean Resolution Rate: ~812 days
Violation Resolution Rate: ~90%

Largest Violation Volumes:

Manhattan (44.1%)
Brooklyn (24.3%)
Queens (17.2%)
Bronx (11.8%)
Staten Island (2.5%)

datasummary_skim(DOB_BI_data)

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
time_to_event	7732	13	811.8	1047.3	0.0	393.0	9119.0
event	2	0	0.9	0.3	0.0	1.0	1.0
BORO	N	%
Manhattan	443966	44.1
Bronx	119296	11.8
Brooklyn	245139	24.3
Queens	173364	17.2
Staten Island	25043	2.5

vc <- c('VIOLATION_TYPE_CODEZ' = 'Zoning', 
        'VIOLATION_TYPE_CODEUB' = 'Unsafe Building',
        'VIOLATION_TYPE_CODEP' = 'Plumbing',
        'VIOLATION_TYPE_CODELANDMK' = 'Landmark Building',
        'VIOLATION_TYPE_CODEIMEGNCY' = 'Immediate Emergency',
        'VIOLATION_TYPE_CODEHBLVIO' = 'High Pressure Boiler',
        'VIOLATION_TYPE_CODEFISP' = 'Facade Safety Program',
        'VIOLATION_TYPE_CODEEGNCY' = 'Emergency',
        'VIOLATION_TYPE_CODEE' = 'Elevator',
        'VIOLATION_TYPE_CODEC' = 'Construction',
        'VIOLATION_TYPE_CODEB' = 'Boiler',
        '(Intercept)' = 'Immediately Hazardous')

lglm <- glm(event ~ VIOLATION_TYPE_CODE,
            family = binomial(link = "logit"), 
            data = DOB_BI_data)

modelplot(lglm, coef_map = vc) +
  aes(color = ifelse(p.value < 0.05, "Significant", "Not Significant")) +
  scale_color_manual(
    values = c("Significant" = "red", "Not Significant" = "gray"),
    name = "p-value"
  )

Model Plot Analysis

The model plot visualizes the coefficient estimates and 95% confidence intervals for each violation type. The coefficients represent the change in the log-odds of a violation being resolved for each violation type, relative to the baseline category (Immediately Hazardous). - Violation types with confidence intervals that do not cross zero are considered statistically significant at the 0.05 level. - Coefficients to the right of zero indicate a positive association with violation resolution, while coefficients to the left of zero indicate a negative association. - The magnitude of the coefficient indicates the strength of the association.

Model Summary Analysis

The model summary provides a table of exponentiated coefficients (odds ratios), confidence intervals, and other model statistics. - Odds ratios greater than 1 indicate that the violation type is associated with a higher likelihood of resolution compared to the baseline category, while odds ratios less than 1 indicate a lower likelihood of resolution. - The confidence intervals provide a range of plausible values for the odds ratios. - The AIC and BIC values can be used to compare the fit of different models.

	(1)
Immediately Hazardous	4.011
	[3.962, 4.061]
Boiler	0.267
	[0.223, 0.321]
Construction	0.412
	[0.406, 0.419]
Elevator	6.042
	[5.937, 6.149]
Emergency	1.330
	[1.213, 1.462]
Facade Safety Program	0.277
	[0.260, 0.296]
High Pressure Boiler	0.087
	[0.084, 0.090]
Immediate Emergency	0.883
	[0.837, 0.932]
Landmark Building	0.812
	[0.783, 0.842]
Plumbing	0.486
	[0.461, 0.512]
Unsafe Building	0.758
	[0.720, 0.798]
Zoning	0.556
	[0.517, 0.598]
Num.Obs.	1006808
AIC	627480.9
BIC	627622.7
Log.Lik.	-313728.427
F	11649.695
RMSE	0.31

Interpretation of Logit Models

As highlighted by (Zelner 2009), interpreting coefficients directly in logit models can be challenging due to their non-linear nature. Instead, Zelner advocates for examining the differences in predicted probabilities associated with discrete changes in independent variables. This approach offers a more intuitive understanding of the impact of each variable on the outcome probability.

Survival Analysis using modelsummary

Data Preparation:

This step mirrors the previous data preparation but removes the zero values in the event variable, which represents unresolved violations.
Survival: Construct a survival object to model violation resolution times.

DOB_data <- raw_data %>%
  mutate(
    ISSUE_DATE = ymd(ISSUE_DATE),
    DISPOSITION_DATE = ymd(DISPOSITION_DATE),
    BORO = factor(BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")),
    VIOLATION_TYPE_CODE = as.factor(VIOLATION_TYPE_CODE),
    time_to_event = as.numeric(DISPOSITION_DATE - ISSUE_DATE),
    event = ifelse(!is.na(DISPOSITION_DATE), 1, 0)
    ) %>%
  filter(
    !is.na(ISSUE_DATE),
    !is.na(BORO),
    between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")),
    VIOLATION_TYPE_CODE %in% selected_violation_types,
    time_to_event > 0
    )

surv_obj <- Surv(time = DOB_data$time_to_event, event = DOB_data$event)

Methodology

To assess how long violations remain unresolved across boroughs, four parametric survival models were fitted: Exponential, Weibull, Log-Logistic, and Log-Normal.Each model evaluated the effect of borough on time-to-resolution:

model_list <- list(
  "Exponential" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "exp"),
  "Weibull" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "weibull"),
  "Log-Logistic" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "loglogistic"),
  "Log-Normal" = survreg(surv_obj ~ BORO, data = DOB_data, dist = "lognormal")
  )

BORO_rename <- c(
  `(Intercept)` = "Manhattan",
  `BOROBronx` = "Bronx",
  `BOROBrooklyn` = "Brooklyn",
  `BOROQueens` = "Queens",
  `BOROStaten Island` = "StatenIsland",
  `Log(scale)` = "Log Scale"
  )


modelsummary(model_list, 
             statistic = "{std.error}",
             gof_omit = 'RMSE|Num.Obs.', 
             coef_rename = BORO_rename,
             output = "gt",
             fmt = 2
             
  )

	Exponential	Weibull	Log-Logistic	Log-Normal
Manhattan	6.60	6.49	5.90	5.88
	0.00	0.00	0.00	0.00
Bronx	0.19	0.20	0.23	0.18
	0.00	0.00	0.00	0.00
Brooklyn	0.20	0.19	0.11	0.06
	0.00	0.00	0.00	0.00
Queens	0.17	0.16	0.10	0.06
	0.00	0.00	0.00	0.00
StatenIsland	0.10	0.08	-0.07	-0.17
	0.01	0.01	0.01	0.01
Log Scale		0.19	-0.23	0.32
		0.00	0.00	0.00
AIC	13417248.6	13356479.0	13377314.5	13360554.2
BIC	13417307.0	13356549.1	13377384.5	13360624.3

Key Findings

Bronx consistently had the longest resolution times across all models.
Brooklyn followed closely, also showing elevated duration.
Queens had moderately extended times compared to Manhattan, but less pronounced than Bronx or Brooklyn.
Staten Island had mixed results—faster resolutions in the Log-Normal and Log-Logistic models but slower in Exponential and Weibull models, but slower in Exponential and Weibull models.
Higher coefficients indicate longer resolution times; thus, the Bronx and Brooklyn emerge as boroughs with persistent delays. The Weibull model has the lowest AIC and BIC values among the four models. This means it offers the best balance between goodness of fit and model simplicity, making it the best choice for analyzing the resolution time of building violations.

Final Conclusion

This dual-model analysis highlights clear borough-level disparities and violation-specific trends in NYC DOB’s resolution practices. The Bronx and Brooklyn require targeted attention to reduce prolonged violation durations. While Staten Island shows inconsistency that may warrant further localized study.Certain violation types significantly reduce the likelihood of resolution, pointing to potential enforcement or compliance challenges.

Zelner, Bennet A. 2009. “Using Simulation to Interpret Results from Logit, Probit, and Other Nonlinear Models.” Strategic Management Journal 30 (12): 1335–48.

Homework 08

Yung Ki Cho

2025-04-05

Introduction

Logit Regression using datasummary and modelplot

Data Preparation:

Methodology

Key Findings Logit Regression (datasummary and modelplot):

Data Summary Snapshot:

Largest Violation Volumes:

Model Plot Analysis

Model Summary Analysis

Interpretation of Logit Models

Survival Analysis using modelsummary

Data Preparation:

Methodology

Key Findings

Final Conclusion