Introduction

The analysis focuses on Department of Building (DOB) violation data obtained from NYC OpenData, specifically the “DOB Violations” dataset. The goal is to determine which borough has the longest open building violations; i.e., the most prolonged safety issues for occupants. Censored data indicates unresolved violations. This dataset is available at NYC OpenData

library(survival)
library(dplyr)
library(lubridate)
library(modelsummary)
library(knitr)
library(kableExtra)
library(ggplot2)
library(scales)
library(survminer)
library(clarify)

Data Wrangling

raw_data$ISSUE_DATE <- ymd(raw_data$ISSUE_DATE)
raw_data$DISPOSITION_DATE <- ymd(raw_data$DISPOSITION_DATE)
raw_data$BORO <- factor(raw_data$BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"))

raw_data$VIOLATION_TYPE_CODE <- as.factor(raw_data$VIOLATION_TYPE_CODE)
raw_data$time_to_event <- as.numeric(raw_data$DISPOSITION_DATE - raw_data$ISSUE_DATE)
raw_data$event <- ifelse(!is.na(raw_data$DISPOSITION_DATE), 1, 0)
raw_data <- raw_data %>%
  filter(!is.na(ISSUE_DATE)) 
DOB_data <- raw_data %>%
  filter(between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")))
selected_violation_types <- c("AEUHAZ1", "B", "C", "E", "EGNCY", "FISP", 
                              "HBLVIO", "IMEGNCY", "LANDMK", "P", "UB", "Z")

DOB_data <- DOB_data %>%
  filter(VIOLATION_TYPE_CODE %in% selected_violation_types)
DOB_data <- DOB_data %>%
  filter(time_to_event > 0)

surv_obj <- Surv(time = DOB_data$time_to_event, event = DOB_data$event)

Methodology

To determine which borough is the least safe for building occupants, we evaluate violation survival time (the length of time between the building violation issuance and resolution) across boroughs using survival analysis techniques. Four survival models (Exponential, Weibull, Log-Normal, and Log-Logistic) are compared based on their fit to the data.

Key findings:

Exponential Weibull Log-Normal Log-Logistic
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 6.598*** 6.491*** 5.880*** 5.905***
(0.002) (0.002) (0.002) (0.002)
BOROBronx 0.192*** 0.200*** 0.184*** 0.226***
(0.004) (0.004) (0.005) (0.005)
BOROBrooklyn 0.201*** 0.191*** 0.059*** 0.105***
(0.003) (0.003) (0.004) (0.004)
BOROQueens 0.169*** 0.161*** 0.064*** 0.096***
(0.003) (0.004) (0.004) (0.004)
BOROStaten Island 0.100*** 0.083*** -0.166*** -0.068***
(0.007) (0.009) (0.010) (0.011)
Log(scale) 0.191*** 0.323*** -0.233***
(0.001) (0.001) (0.001)
Num.Obs. 871725 871725 871725 871725
AIC 13417248.6 13356479.0 13360554.2 13377314.5
BIC 13417307.0 13356549.1 13360624.3 13377384.5
RMSE 1044.77 1048.29 1134.82 1127.37

AIC and Log-Likelihood

  • The Weibull model has the lowest AIC value (13356479) and the highest Log-Likelihood value (-6678233) among all models.Therefore is the model with the best fit.
    Goodness-of-Fit Table: AIC and Log-Likelihood
    Model AIC LogLik
    Exponential Exponential 13417249 -6708619
    Weibull Weibull 13356479 -6678233
    Log-Normal Log-Normal 13360554 -6680271
    Log-Logistic Log-Logistic 13377314 -6688651

An alternative to clarify

The predict function is a suitable alternative to sim_ame because it directly computes predicted values which works with a wide range of models. While sim_ame is useful for calculating marginal effects, predict is more versatile and efficient for generating predictions in survival analysis or logistic regression.

# Create a new data frame for predictions
new_data <- data.frame(
  BORO = factor(c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"),
                levels = levels(DOB_data$BORO)))
 


# Predict survival times for each BORO
predicted <- predict(model2, newdata = new_data, type = "response", se.fit = TRUE)

# Combine predictions with BORO levels
results <- data.frame(
  BORO = new_data$BORO,
  Predicted = predicted$fit,
  SE = predicted$se.fit
)

# View the results
print(results)
##            BORO Predicted   SE
## 1     Manhattan       659 1.28
## 2         Bronx       805 3.07
## 3      Brooklyn       798 2.20
## 4        Queens       774 2.52
## 5 Staten Island       716 6.21

Predicted Survival Times - Using the Weibull model

  • Bronx has the highest predicted survival time (805 days), indicating prolonged resolution durations.
  • Manhattan has the shortest predicted survival time (659 days), suggesting faster resolution rates.
  • Staten Island shows higher variability in predictions due to a relatively high standard error (6.21).
  • A bar plot visualizes these predictions with error bars highlighting uncertainty.

Violation Counts by Borough

  • Manhattan has the highest number of violations (>400,000).
  • Brooklyn and Queens follow with moderate counts.
  • Bronx has fewer violations than Brooklyn and Queens.
  • Staten Island records the lowest number of violations.

Survival Curves

  • Survival curves are plotted for violations resolved within 3650 days (10 years)
  • All boroughs show a decline in survival probability over time, with most violations resolved within the first few years.
  • Manhattan and Bronx exhibit slightly higher survival probabilities early on, indicating slower resolution rates.
  • Brooklyn, Queens, and Staten Island demonstrate faster resolution trends.
  • 805 days is the highest predicted survival time. The long tail suggest that some violations take much longer to resolve.

Conclusion

The analysis highlights borough-level differences in violation resolution times: 1 Violations in Bronx take significantly longer to resolve compared to other boroughs. 2 Manhattan resolves violations more quickly than others. 3 Staten Island shows variability in predictions due to higher uncertainty. These insights can guide resource allocation to address long-standing violations more effectively in specific boroughs like Bronx and Manhattan.