Introduction

The analysis focuses on Department of Building (DOB) violation data obtained from NYC OpenData, specifically the “DOB Violations” dataset. The goal is to determine which borough has the longest open building violations; i.e., the most prolonged safety issues for occupants. Censored data indicates unresolved violations. This dataset is available at NYC OpenData

library(survival)
library(dplyr)
library(lubridate)
library(modelsummary)
library(knitr)
library(kableExtra)
library(ggplot2)
library(scales)
library(survminer)
library(clarify)

Data Wrangling

Converting date variables (ISSUE_DATE and DISPOSITION_DATE) to proper formats.
Transforming borough codes into categorical names (e.g., Manhattan, Bronx).
Filtering valid violation codes and removing missing or invalid data.
Creating a time-to-event variable (days between issuance and disposition) and an event indicator (1 for resolved violations, 0 for censored).
Filtering violations within the date range of January 1, 2000, to December 31, 2025.
Constructing a survival object to model violation resolution times.

raw_data$ISSUE_DATE <- ymd(raw_data$ISSUE_DATE)

raw_data$DISPOSITION_DATE <- ymd(raw_data$DISPOSITION_DATE)

raw_data$BORO <- factor(raw_data$BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"))

raw_data$VIOLATION_TYPE_CODE <- as.factor(raw_data$VIOLATION_TYPE_CODE)

raw_data$time_to_event <- as.numeric(raw_data$DISPOSITION_DATE - raw_data$ISSUE_DATE)

raw_data$event <- ifelse(!is.na(raw_data$DISPOSITION_DATE), 1, 0)

raw_data <- raw_data %>%
  filter(!is.na(ISSUE_DATE))

DOB_data <- raw_data %>%
  filter(between(ISSUE_DATE, as.Date("2000-01-01"), as.Date("2025-12-31")))

selected_violation_types <- c("AEUHAZ1", "B", "C", "E", "EGNCY", "FISP", 
                              "HBLVIO", "IMEGNCY", "LANDMK", "P", "UB", "Z")

DOB_data <- DOB_data %>%
  filter(VIOLATION_TYPE_CODE %in% selected_violation_types)

DOB_data <- DOB_data %>%
  filter(time_to_event > 0)

surv_obj <- Surv(time = DOB_data$time_to_event, event = DOB_data$event)

Methodology

To determine which borough is the least safe for building occupants, we evaluate violation survival time (the length of time between the building violation issuance and resolution) across boroughs using survival analysis techniques. Four survival models (Exponential, Weibull, Log-Normal, and Log-Logistic) are compared based on their fit to the data.

Key findings:

Bronx: Violations take the longest to resolve across all models.
Brooklyn: Resolution times are slightly shorter than Bronx but still significant.
Queens: Slightly longer resolution times compared to Manhattan but less pronounced than Bronx or Brooklyn.
Staten Island: Faster resolution times in Log-Normal and Log-Logistic models but slower in Exponential and Weibull models.

	Exponential	Weibull	Log-Normal	Log-Logistic
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	6.598***	6.491***	5.880***	5.905***
	(0.002)	(0.002)	(0.002)	(0.002)
BOROBronx	0.192***	0.200***	0.184***	0.226***
	(0.004)	(0.004)	(0.005)	(0.005)
BOROBrooklyn	0.201***	0.191***	0.059***	0.105***
	(0.003)	(0.003)	(0.004)	(0.004)
BOROQueens	0.169***	0.161***	0.064***	0.096***
	(0.003)	(0.004)	(0.004)	(0.004)
BOROStaten Island	0.100***	0.083***	-0.166***	-0.068***
	(0.007)	(0.009)	(0.010)	(0.011)
Log(scale)		0.191***	0.323***	-0.233***
		(0.001)	(0.001)	(0.001)
Num.Obs.	871725	871725	871725	871725
AIC	13417248.6	13356479.0	13360554.2	13377314.5
BIC	13417307.0	13356549.1	13360624.3	13377384.5
RMSE	1044.77	1048.29	1134.82	1127.37

AIC and Log-Likelihood

The Weibull model has the lowest AIC value (13356479) and the highest Log-Likelihood value (-6678233) among all models.Therefore is the model with the best fit.

Goodness-of-Fit Table: AIC and Log-Likelihood
	Model	AIC	LogLik
Exponential	Exponential	13417249	-6708619
Weibull	Weibull	13356479	-6678233
Log-Normal	Log-Normal	13360554	-6680271
Log-Logistic	Log-Logistic	13377314	-6688651

An alternative to clarify

The predict function is a suitable alternative to sim_ame because it directly computes predicted values which works with a wide range of models. While sim_ame is useful for calculating marginal effects, predict is more versatile and efficient for generating predictions in survival analysis or logistic regression.

# Create a new data frame for predictions
new_data <- data.frame(
  BORO = factor(c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"),
                levels = levels(DOB_data$BORO)))
 


# Predict survival times for each BORO
predicted <- predict(model2, newdata = new_data, type = "response", se.fit = TRUE)

# Combine predictions with BORO levels
results <- data.frame(
  BORO = new_data$BORO,
  Predicted = predicted$fit,
  SE = predicted$se.fit
)

# View the results
print(results)

##            BORO Predicted   SE
## 1     Manhattan       659 1.28
## 2         Bronx       805 3.07
## 3      Brooklyn       798 2.20
## 4        Queens       774 2.52
## 5 Staten Island       716 6.21

Predicted Survival Times - Using the Weibull model

Bronx has the highest predicted survival time (805 days), indicating prolonged resolution durations.
Manhattan has the shortest predicted survival time (659 days), suggesting faster resolution rates.
Staten Island shows higher variability in predictions due to a relatively high standard error (6.21).
A bar plot visualizes these predictions with error bars highlighting uncertainty.

Violation Counts by Borough

Manhattan has the highest number of violations (>400,000).
Brooklyn and Queens follow with moderate counts.
Bronx has fewer violations than Brooklyn and Queens.
Staten Island records the lowest number of violations.

Survival Curves

Survival curves are plotted for violations resolved within 3650 days (10 years)
All boroughs show a decline in survival probability over time, with most violations resolved within the first few years.
Manhattan and Bronx exhibit slightly higher survival probabilities early on, indicating slower resolution rates.
Brooklyn, Queens, and Staten Island demonstrate faster resolution trends.
805 days is the highest predicted survival time. The long tail suggest that some violations take much longer to resolve.

Conclusion

The analysis highlights borough-level differences in violation resolution times: 1 Violations in Bronx take significantly longer to resolve compared to other boroughs. 2 Manhattan resolves violations more quickly than others. 3 Staten Island shows variability in predictions due to higher uncertainty. These insights can guide resource allocation to address long-standing violations more effectively in specific boroughs like Bronx and Manhattan.

DATA712 - Homework 07

Yung Ki Cho