Law Firm Analysis: Payments by Agency, State, and County

Introduction

I am a very impressive and smart data scientist working for an extremely prestigious law firm. The law firm specializes in fighting parking and camera tickets. I was hired to uncover hidden patterns in NYC violation data. My firm asked me to explore the following three questions: 1. Do certain agencies issue higher payments? 2. Do drivers from different states (NY, NJ, CT) pay more? 3. Do certain counties tend to have higher payment amounts?

Here is the dataset we use: https://data.cityofnewyork.us/City-Government/Open-Parking-and-Camera-Violations/nc67-uf89/about_data

endpoint <- "https://data.cityofnewyork.us/resource/nc67-uf89.json"

resp <- GET(endpoint, query = list(
  "$limit" = 99999,
  "$order" = "issue_date DESC"
))

camera <- fromJSON(content(resp, as = "text"), flatten = TRUE)

camera<- camera %>%
  mutate(across(
    c("fine_amount", "interest_amount", "reduction_amount", "payment_amount", "amount_due"),
    ~as.numeric(.)
  ))

camera <- camera %>%
  mutate(
    agency = as.factor(issuing_agency),
    plate_state = toupper(state),
    county_raw = county
  )

1. Do certain agencies issue higher payments?

Sum of Squares How much variance is explained is represented by 874,511/627,340,303, which is an extremely small portion.

F value and p-value (Is it statistically significant?) F value = 15.47, p < .001. The difference in mean payment amounts across agencies is statistically significant.

PRE (What proportion of variance is explained?) Agency only accounts for about 0.1% of the total variance in payment amount.

While the difference is statistically significant, it is extremely small in magnitude. The typical payment amount varies across agencies, but not by a large amount. I would not recommend that the law firm base its marketing strategies on agency, since the differences across agencies are trivial. See analysis below.

# Many agencies listed have extremely low sample sizes. Let's only analyze agencies with n > 10. 
camera_agency_filtered <- camera %>%
  filter(!is.na(agency), !is.na(payment_amount)) %>%
  group_by(agency) %>%
  filter(n() > 10) %>%
  ungroup()

ggplot(
  camera_agency_filtered %>%
    mutate(agency = fct_reorder(agency, payment_amount, .fun = mean)),
  aes(x = agency, y = payment_amount)
) +
  geom_boxplot(outlier.shape = 16, outlier.size = 1) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Payment Amounts by Issuing Agency (Ordered by Mean)",
    x = "Agency",
    y = "Payment Amount ($)"
  ) + 
  scale_y_continuous(limits = c(0, 300))

agency_desc <- favstats(payment_amount ~ agency, data = camera_agency_filtered) %>%
  arrange(desc(mean))
knitr::kable(agency_desc, caption = "Descriptive statistics of payment amounts by agency")

Descriptive statistics of payment amounts by agency
agency	min	Q1	median	Q3	max	mean	sd	n
NYS OFFICE OF MENTAL HEALTH POLICE	0	180	180	190	210.00	161.33333	65.99423	15
ROOSEVELT ISLAND SECURITY	0	135	180	190	246.68	149.16083	90.57967	24
PORT AUTHORITY	0	180	180	190	242.76	147.35792	82.58394	48
NYS PARKS POLICE	0	45	180	190	242.58	143.86176	89.24158	34
PARKS DEPARTMENT	0	90	180	190	245.28	128.47736	78.92728	144
HEALTH AND HOSPITAL CORP. POLICE	0	0	180	190	245.64	124.71373	98.60130	51
POLICE DEPARTMENT	0	0	180	190	260.00	123.93855	88.00388	214
DEPARTMENT OF TRANSPORTATION	0	50	75	125	690.04	99.52822	82.88394	87273
TRAFFIC	0	65	115	115	245.79	94.59362	44.47453	12091
DEPARTMENT OF SANITATION	0	0	65	105	115.00	56.78571	48.26239	14
CON RAIL	NA	NA	NA	NA	NA	NaN	NA	0
FIRE DEPARTMENT	NA	NA	NA	NA	NA	NaN	NA	0
HEALTH DEPARTMENT POLICE	NA	NA	NA	NA	NA	NaN	NA	0
LONG ISLAND RAILROAD	NA	NA	NA	NA	NA	NaN	NA	0
NYC OFFICE OF THE SHERIFF	NA	NA	NA	NA	NA	NaN	NA	0
OTHER/UNKNOWN AGENCIES	NA	NA	NA	NA	NA	NaN	NA	0
SEA GATE ASSOCIATION POLICE	NA	NA	NA	NA	NA	NaN	NA	0
SUNY MARITIME COLLEGE	NA	NA	NA	NA	NA	NaN	NA	0
TAXI AND LIMOUSINE COMMISSION	NA	NA	NA	NA	NA	NaN	NA	0
TRANSIT AUTHORITY	NA	NA	NA	NA	NA	NaN	NA	0

agency_aov <- aov(payment_amount ~ agency, data = camera_agency_filtered)
summary(agency_aov)

##                Df    Sum Sq Mean Sq F value Pr(>F)    
## agency          9    874511   97168   15.47 <2e-16 ***
## Residuals   99898 627340303    6280                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

supernova(agency_aov)

##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ agency
## 
##                                     SS    df        MS      F   PRE     p
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Model (error reduced) |    874511.027     9 97167.892 15.473 .0014 .0000
##  Error (from model)    | 627340303.332 99898  6279.808                   
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Total (empty model)   | 628214814.359 99907  6287.996

2. Do drivers from different states (NY, NJ, CT) pay more?

Sum of Squares How much variance is explained is represented by 602,716/594,098,896, which is an extremely small portion.

F value and p-value (Is it statistically significant?) F value = 45.48, p < .001. The difference in mean payment amounts across states is statistically significant.

PRE (What proportion of variance is explained?) State only accounts for about 0.1% of the total variance in payment amount.

Similarly, the difference is statistically significant, but small in magnitude. Payment amount does vary across states, primarily since the mean Connecticut payment is about $20 less than those in New York or New Jersey. However, this may not be very meaningful, since the median is the same across all three states. See analysis below.

camera_states <- camera %>%
  filter(state %in% c("NY", "NJ", "CT"), !is.na(payment_amount))

ggplot(camera_states, aes(x = state, y = payment_amount)) +
  geom_boxplot(outlier.shape = 16, outlier.size = 1) +
  theme_minimal() +
  labs(
    title = "Payment Amounts by Driver State (NY vs NJ vs CT)",
    x = "Driver State",
    y = "Payment Amount ($)"
  ) +
  scale_y_continuous(limits = c(0, 300))

state_desc <- favstats(payment_amount ~ state, data = camera_states) %>%
  arrange(desc(mean))
knitr::kable(state_desc, caption = "Descriptive statistics of payment amounts by state")

Descriptive statistics of payment amounts by state
state	Q1	median	Q3	max	mean	sd	n
NJ	50	75	115	682.35	101.5746	89.97170	8654
NY	50	75	125	690.04	101.0902	80.93015	79541
CT	50	75	100	276.57	80.6627	46.07849	1457

state_aov <- aov(payment_amount ~ state, data = camera_states)
summary(state_aov)

##                Df    Sum Sq Mean Sq F value Pr(>F)    
## state           2    602716  301358   45.48 <2e-16 ***
## Residuals   89649 594098897    6627                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

state_sn <- supernova(state_aov)
state_sn

##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ state
## 
##                                     SS    df         MS      F   PRE     p
##  ----- --------------- | ------------- ----- ---------- ------ ----- -----
##  Model (error reduced) |    602716.142     2 301358.071 45.475 .0010 .0000
##  Error (from model)    | 594098896.889 89649   6626.944                   
##  ----- --------------- | ------------- ----- ---------- ------ ----- -----
##  Total (empty model)   | 594701613.031 89651   6633.519

TukeyHSD(state_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = payment_amount ~ state, data = camera_states)
## 
## $state
##             diff       lwr      upr     p adj
## NJ-CT 20.9119180 15.509134 26.31470 0.0000000
## NY-CT 20.4274502 15.383502 25.47140 0.0000000
## NY-NJ -0.4844678 -2.644085  1.67515 0.8587191

3. Do certain counties tend to have higher payment amounts?

Sum of Squares How much variance is explained is represented by 6,702,478/420,413,471, which is a small portion, but more relevant than the previous analyses.

F value and p-value (Is it statistically significant?) F value = 233.35, p < .001. The difference in mean payment amounts across counties is statistically significant.

PRE (What proportion of variance is explained?) County accounts for about 1.6% of the total variance in payment amount.

The difference is statistically significant and small in magnitude; however, relative to state and agency, county accounts for the most variance by far. A $30-$40 difference in mean payment across counties may represent meaningful differences in how violations are enforced. The law firm may want to increase their marketing presence in Richmond and Kings counties, since their higher payments may motivate people to contest their fines. See analysis below.

clean_county <- function(x) {
  x_up <- toupper(trimws(x))
  dplyr::case_when(
    x_up %in% c("K", "BK", "KING", "KINGS")             ~ "Kings County",
    x_up %in% c("Q", "QN", "QNS", "QUEEN", "QUEENS")    ~ "Queens County",
    x_up %in% c("BX", "BRONX")                          ~ "Bronx County",
    x_up %in% c("NY", "MN", "MANHATTAN", "NEW YORK")    ~ "New York County",
    x_up %in% c("R", "ST", "RICH", "RICHMOND")          ~ "Richmond County",
    TRUE ~ NA_character_
  )
}

camera_county <- camera %>%
  mutate(county_clean = clean_county(county)) %>%
  filter(!is.na(county_clean))

ggplot(
  camera_county %>%
    mutate(county_clean = fct_reorder(county_clean, payment_amount, .fun = mean)),
  aes(x = county_clean, y = payment_amount)
) +
  geom_boxplot(outlier.shape = 16, outlier.size = 1) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Payment Amounts by County (Ordered by Mean)",
    x = "County",
    y = "Payment Amount ($)"
  ) +
  scale_y_continuous(limits = c(0, 200))

county_desc <- favstats(payment_amount ~ county_clean, data = camera_county) %>%
  arrange(desc(mean))
knitr::kable(county_desc, caption = "Descriptive statistics of payment amounts by county")

Descriptive statistics of payment amounts by county
county_clean	Q1	median	Q3	max	mean	sd	n
Richmond County	50	125	180	250.00	114.53669	77.55385	1349
Kings County	50	75	115	690.04	110.88983	126.20448	16112
Bronx County	65	75	145	245.64	99.65870	67.53373	247
New York County	50	75	115	281.80	97.62502	62.55866	23479
Queens County	50	50	100	283.03	83.46501	60.08515	17366

county_aov <- aov(payment_amount ~ county_clean, data = camera_county)
summary(county_aov)

##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## county_clean     4   6702478 1675619   233.4 <2e-16 ***
## Residuals    58548 420413471    7181                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

county_sn <- supernova(county_aov)
county_sn

##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ county_clean
## 
##                                     SS    df          MS       F   PRE     p
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Model (error reduced) |   6702477.756     4 1675619.439 233.352 .0157 .0000
##  Error (from model)    | 420413471.103 58548    7180.663                    
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Total (empty model)   | 427115948.859 58552    7294.643

Conclusion:

Among the three variables analyzed, the firm should prioritize county in its marketing efforts. All three variables predicted differences in payment amount, however the differences predicted by state and agency were not large enough to warrant a change in marketing strategies. While the difference in paymount amount predicted by county were still modest, they were relatively large compared to the other variables analyzed. If the firm is going to change their marketing strategies due to one of these variables, it should be county.