Research Question: What factors predict the number of days it takes to solve a standing water violation?
To answer this question, I used the Standing Water Violations dataset from the Montgomery County Open Data Portal: https://data.montgomerycountymd.gov/Consumer-Housing/Standing-Water/mx9q-5uj7/about_data. This dataset tracks housing code violations related to standing water, including when they were filed, closed, and how they were resolved.
Each row represents a single violation case. The key variables used in this analysis are:
days_to_resolve(quantitative)
disposition(categorical)
city(categorical)
I will use multiple linear regression because days_to_resolve is a continuous outcome variable and I want to understand the influence of multiple predictors simultaneously.
Data Analysis
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Here is the visualization I created to help visualize the distribution of resolution times by disposition type.
clean_data |>group_by(disposition) |>summarize(mean_days =mean(days_to_resolve)) |>ggplot(aes(x =reorder(disposition, -mean_days), y = mean_days, fill = disposition)) +geom_col(color ="black") +scale_fill_brewer(palette ="Set1") +labs(title ="Average Days to Resolve by Disposition Type",x ="Disposition",y ="Average Days to Resolve",caption ="Source: Montgomery County Open Data Portal" ) +theme_minimal() +theme(legend.position ="none",axis.text.x =element_text(angle =45, hjust =1))
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors
Justification of Approach
I chose multiple linear regression because the outcome variable days_to_resolve is continuous and I want to examine the influence of multiple predictors at the same time. This is because the dependent variable is numeric and I’m examining multiple predictors at once.
I chose disposition and city as predictors because disposition describes how the violation was resolved which affects how long it takes, and city may reflect differences in inspector workload or resources across locations.
The 5 assumptions I will check are: Linearity, Independence, Normality of Residuals, Homoscedasticity, No Multicollinearity
Statistical Analysis
# Fit the multiple linear regression modelmodel <-lm(days_to_resolve ~ disposition + city, data = clean_data)summary(model)
Call:
lm(formula = days_to_resolve ~ disposition + city, data = clean_data)
Residuals:
Min 1Q Median 3Q Max
-1233.59 -173.21 0.00 62.13 1470.73
Coefficients: (1 not defined because of singularities)
Estimate
(Intercept) -322.90
dispositionADU Class III Denial 626.45
dispositionADU Class III Passed Inspection 592.23
dispositionChange of Ownership 675.79
dispositionCitation issued 565.11
dispositionReferred to a Montgomery County Agency - no jurisdiction 166.59
dispositionReoccupied 1824.05
dispositionTP Annual Inspection Completed 399.90
dispositionTriennial - No Violations Found 1250.23
dispositionTriennial Completed 628.10
dispositionViolation Unfounded 841.94
dispositionViolations Corrected 423.90
cityBETHESDA 87.45
cityBROOKEVILLE 213.32
cityBURTONSVILLE 65.43
cityCABIN JOHN -40.00
cityCHEVY CHASE -15.93
cityDAMASCUS -46.60
cityDICKERSON -58.00
cityGAITHERSBURG 182.95
cityGERMANTOWN 20.27
cityKENSINGTON 333.02
cityMONTGOMERY VILLAGE 114.80
cityOLNEY -66.00
cityPOTOMAC 333.90
cityROCKVILLE 140.93
citySILVER SPRING 226.67
cityTAKOMA PARK NA
cityWASHINGTON GROVE 72.06
Std. Error
(Intercept) 370.68
dispositionADU Class III Denial 430.22
dispositionADU Class III Passed Inspection 427.84
dispositionChange of Ownership 325.09
dispositionCitation issued 352.82
dispositionReferred to a Montgomery County Agency - no jurisdiction 373.61
dispositionReoccupied 312.33
dispositionTP Annual Inspection Completed 476.69
dispositionTriennial - No Violations Found 319.68
dispositionTriennial Completed 306.22
dispositionViolation Unfounded 337.12
dispositionViolations Corrected 304.13
cityBETHESDA 218.60
cityBROOKEVILLE 277.87
cityBURTONSVILLE 241.38
cityCABIN JOHN 367.08
cityCHEVY CHASE 223.70
cityDAMASCUS 250.76
cityDICKERSON 367.08
cityGAITHERSBURG 229.85
cityGERMANTOWN 219.07
cityKENSINGTON 225.88
cityMONTGOMERY VILLAGE 244.79
cityOLNEY 236.95
cityPOTOMAC 218.12
cityROCKVILLE 216.92
citySILVER SPRING 213.63
cityTAKOMA PARK NA
cityWASHINGTON GROVE 305.19
t value
(Intercept) -0.871
dispositionADU Class III Denial 1.456
dispositionADU Class III Passed Inspection 1.384
dispositionChange of Ownership 2.079
dispositionCitation issued 1.602
dispositionReferred to a Montgomery County Agency - no jurisdiction 0.446
dispositionReoccupied 5.840
dispositionTP Annual Inspection Completed 0.839
dispositionTriennial - No Violations Found 3.911
dispositionTriennial Completed 2.051
dispositionViolation Unfounded 2.497
dispositionViolations Corrected 1.394
cityBETHESDA 0.400
cityBROOKEVILLE 0.768
cityBURTONSVILLE 0.271
cityCABIN JOHN -0.109
cityCHEVY CHASE -0.071
cityDAMASCUS -0.186
cityDICKERSON -0.158
cityGAITHERSBURG 0.796
cityGERMANTOWN 0.093
cityKENSINGTON 1.474
cityMONTGOMERY VILLAGE 0.469
cityOLNEY -0.279
cityPOTOMAC 1.531
cityROCKVILLE 0.650
citySILVER SPRING 1.061
cityTAKOMA PARK NA
cityWASHINGTON GROVE 0.236
Pr(>|t|)
(Intercept) 0.384109
dispositionADU Class III Denial 0.145975
dispositionADU Class III Passed Inspection 0.166885
dispositionChange of Ownership 0.038137
dispositionCitation issued 0.109838
dispositionReferred to a Montgomery County Agency - no jurisdiction 0.655858
dispositionReoccupied 9.29e-09
dispositionTP Annual Inspection Completed 0.401912
dispositionTriennial - No Violations Found 0.000104
dispositionTriennial Completed 0.040761
dispositionViolation Unfounded 0.012822
dispositionViolations Corrected 0.163970
cityBETHESDA 0.689297
cityBROOKEVILLE 0.443015
cityBURTONSVILLE 0.786453
cityCABIN JOHN 0.913270
cityCHEVY CHASE 0.943246
cityDAMASCUS 0.852648
cityDICKERSON 0.874515
cityGAITHERSBURG 0.426414
cityGERMANTOWN 0.926309
cityKENSINGTON 0.141020
cityMONTGOMERY VILLAGE 0.639289
cityOLNEY 0.780708
cityPOTOMAC 0.126434
cityROCKVILLE 0.516189
citySILVER SPRING 0.289174
cityTAKOMA PARK NA
cityWASHINGTON GROVE 0.813449
(Intercept)
dispositionADU Class III Denial
dispositionADU Class III Passed Inspection
dispositionChange of Ownership *
dispositionCitation issued
dispositionReferred to a Montgomery County Agency - no jurisdiction
dispositionReoccupied ***
dispositionTP Annual Inspection Completed
dispositionTriennial - No Violations Found ***
dispositionTriennial Completed *
dispositionViolation Unfounded *
dispositionViolations Corrected
cityBETHESDA
cityBROOKEVILLE
cityBURTONSVILLE
cityCABIN JOHN
cityCHEVY CHASE
cityDAMASCUS
cityDICKERSON
cityGAITHERSBURG
cityGERMANTOWN
cityKENSINGTON
cityMONTGOMERY VILLAGE
cityOLNEY
cityPOTOMAC
cityROCKVILLE
citySILVER SPRING
cityTAKOMA PARK
cityWASHINGTON GROVE
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 299.7 on 511 degrees of freedom
Multiple R-squared: 0.6937, Adjusted R-squared: 0.6775
F-statistic: 42.86 on 27 and 511 DF, p-value: < 2.2e-16
I checked linearity and homoscedasticity using residuals vs fitted values. The ideal is a random scatter around zero with no pattern.
plot(model$fitted.values, model$residuals,main ="Residuals vs Fitted",xlab ="Fitted Values",ylab ="Residuals")abline(h =0)
For the independence each row is a separate standing water violation case so independence is reasonably assumed.
The normality of residuals checked using a Q-Q plot. Points should follow the diagonal line.
qqnorm(model$residuals)qqline(model$residuals)
For no multicollinearity the disposition and city measureed different things, the type of outcome vs the location, so multicollinearity is not a concern here.
# RMSE - average prediction error in daysresiduals <- model$residualsrmse <-sqrt(mean(residuals^2))rmse
[1] 291.8282
Discussion of Results
The linear regression model looked at what factors predicted how long it takes to resolve a standing water violation in Montgomery County. The outcome variable was days_to_resolve.
The R squared value tells us how much of the variation in resolution time is explained by the model. The R squared accounts for the number of predictors and is a better measure of fit. The RMSE tells us on average how many days off the model’s predictions are.
Disposition tells us how many more or fewer days that resolution type takes compared to the reference category. Just like how each city tells us how much faster or slower violations are resolved there compared to the reference city. If a coefficient is positive the violation takes longer, if its negative it is resolved faster.
The best use of this is how violations get resolved and where it is located influences the resolution time. If disposition types or cities consistently take longer, resources could be moved there to speed up resolution and improve housing conditions for residents.
Conclusion
This analysis used multiple linear regression to see what factors predict the number of days it takes to resolve a standing water violation. The predictors were disposition type and city. The model provides insight into which resolution outcomes and locations are associated with faster or slower resolution times. Future research could include additional predictors such as inspector workload or seasonal patterns to improve the model’s explanatory power.