The analysis focuses on Department of Building (DOB) violation data obtained from NYC OpenData, specifically the “DOB Violations” dataset. The goal is to determine which borough has the longest open building violations; i.e., the most prolonged safety issues for occupants. Censored data indicates unresolved violations. This dataset is available at NYC OpenData
raw_data$BORO <- factor(raw_data$BORO, levels = 1:5, labels = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"))
raw_data$VIOLATION_TYPE_CODE <- as.factor(raw_data$VIOLATION_TYPE_CODE)
To determine which borough is the least safe for building occupants, we evaluate violation survival time (the length of time between the building violation issuance and resolution) across boroughs using survival analysis techniques. Four survival models (Exponential, Weibull, Log-Normal, and Log-Logistic) are compared based on their fit to the data.
Exponential | Weibull | Log-Normal | Log-Logistic | |
---|---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
(Intercept) | 6.598*** | 6.491*** | 5.880*** | 5.905*** |
(0.002) | (0.002) | (0.002) | (0.002) | |
BOROBronx | 0.192*** | 0.200*** | 0.184*** | 0.226*** |
(0.004) | (0.004) | (0.005) | (0.005) | |
BOROBrooklyn | 0.201*** | 0.191*** | 0.059*** | 0.105*** |
(0.003) | (0.003) | (0.004) | (0.004) | |
BOROQueens | 0.169*** | 0.161*** | 0.064*** | 0.096*** |
(0.003) | (0.004) | (0.004) | (0.004) | |
BOROStaten Island | 0.100*** | 0.083*** | -0.166*** | -0.068*** |
(0.007) | (0.009) | (0.010) | (0.011) | |
Log(scale) | 0.191*** | 0.323*** | -0.233*** | |
(0.001) | (0.001) | (0.001) | ||
Num.Obs. | 871725 | 871725 | 871725 | 871725 |
AIC | 13417248.6 | 13356479.0 | 13360554.2 | 13377314.5 |
BIC | 13417307.0 | 13356549.1 | 13360624.3 | 13377384.5 |
RMSE | 1044.77 | 1048.29 | 1134.82 | 1127.37 |
Model | AIC | LogLik | |
---|---|---|---|
Exponential | Exponential | 13417249 | -6708619 |
Weibull | Weibull | 13356479 | -6678233 |
Log-Normal | Log-Normal | 13360554 | -6680271 |
Log-Logistic | Log-Logistic | 13377314 | -6688651 |
The predict function is a suitable alternative to sim_ame because it directly computes predicted values which works with a wide range of models. While sim_ame is useful for calculating marginal effects, predict is more versatile and efficient for generating predictions in survival analysis or logistic regression.
# Create a new data frame for predictions
new_data <- data.frame(
BORO = factor(c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"),
levels = levels(DOB_data$BORO)))
# Predict survival times for each BORO
predicted <- predict(model2, newdata = new_data, type = "response", se.fit = TRUE)
# Combine predictions with BORO levels
results <- data.frame(
BORO = new_data$BORO,
Predicted = predicted$fit,
SE = predicted$se.fit
)
# View the results
print(results)
## BORO Predicted SE
## 1 Manhattan 659 1.28
## 2 Bronx 805 3.07
## 3 Brooklyn 798 2.20
## 4 Queens 774 2.52
## 5 Staten Island 716 6.21
The analysis highlights borough-level differences in violation resolution times: 1 Violations in Bronx take significantly longer to resolve compared to other boroughs. 2 Manhattan resolves violations more quickly than others. 3 Staten Island shows variability in predictions due to higher uncertainty. These insights can guide resource allocation to address long-standing violations more effectively in specific boroughs like Bronx and Manhattan.