Introduction

I am a very impressive and smart data scientist working for an extremely prestigious law firm. The law firm specializes in fighting parking and camera tickets. I was hired to uncover hidden patterns in NYC violation data. My firm asked me to explore the following three questions: 1. Do certain agencies issue higher payments? 2. Do drivers from different states (NY, NJ, CT) pay more? 3. Do certain counties tend to have higher payment amounts?

Here is the dataset we use: https://data.cityofnewyork.us/City-Government/Open-Parking-and-Camera-Violations/nc67-uf89/about_data

endpoint <- "https://data.cityofnewyork.us/resource/nc67-uf89.json"

resp <- GET(endpoint, query = list(
  "$limit" = 99999,
  "$order" = "issue_date DESC"
))

camera <- fromJSON(content(resp, as = "text"), flatten = TRUE)

camera<- camera %>%
  mutate(across(
    c("fine_amount", "interest_amount", "reduction_amount", "payment_amount", "amount_due"),
    ~as.numeric(.)
  ))

camera <- camera %>%
  mutate(
    agency = as.factor(issuing_agency),
    plate_state = toupper(state),
    county_raw = county
  )

1. Do certain agencies issue higher payments?

Sum of Squares How much variance is explained is represented by 874,511/627,340,303, which is an extremely small portion.

F value and p-value (Is it statistically significant?) F value = 15.47, p < .001. The difference in mean payment amounts across agencies is statistically significant.

PRE (What proportion of variance is explained?) Agency only accounts for about 0.1% of the total variance in payment amount.

While the difference is statistically significant, it is extremely small in magnitude. The typical payment amount varies across agencies, but not by a large amount. I would not recommend that the law firm base its marketing strategies on agency, since the differences across agencies are trivial. See analysis below.

# Many agencies listed have extremely low sample sizes. Let's only analyze agencies with n > 10. 
camera_agency_filtered <- camera %>%
  filter(!is.na(agency), !is.na(payment_amount)) %>%
  group_by(agency) %>%
  filter(n() > 10) %>%
  ungroup()

ggplot(
  camera_agency_filtered %>%
    mutate(agency = fct_reorder(agency, payment_amount, .fun = mean)),
  aes(x = agency, y = payment_amount)
) +
  geom_boxplot(outlier.shape = 16, outlier.size = 1) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Payment Amounts by Issuing Agency (Ordered by Mean)",
    x = "Agency",
    y = "Payment Amount ($)"
  ) + 
  scale_y_continuous(limits = c(0, 300))

agency_desc <- favstats(payment_amount ~ agency, data = camera_agency_filtered) %>%
  arrange(desc(mean))
knitr::kable(agency_desc, caption = "Descriptive statistics of payment amounts by agency")
Descriptive statistics of payment amounts by agency
agency min Q1 median Q3 max mean sd n missing
NYS OFFICE OF MENTAL HEALTH POLICE 0 180 180 190 210.00 161.33333 65.99423 15 0
ROOSEVELT ISLAND SECURITY 0 135 180 190 246.68 149.16083 90.57967 24 0
PORT AUTHORITY 0 180 180 190 242.76 147.35792 82.58394 48 0
NYS PARKS POLICE 0 45 180 190 242.58 143.86176 89.24158 34 0
PARKS DEPARTMENT 0 90 180 190 245.28 128.47736 78.92728 144 0
HEALTH AND HOSPITAL CORP. POLICE 0 0 180 190 245.64 124.71373 98.60130 51 0
POLICE DEPARTMENT 0 0 180 190 260.00 123.93855 88.00388 214 0
DEPARTMENT OF TRANSPORTATION 0 50 75 125 690.04 99.52822 82.88394 87273 0
TRAFFIC 0 65 115 115 245.79 94.59362 44.47453 12091 0
DEPARTMENT OF SANITATION 0 0 65 105 115.00 56.78571 48.26239 14 0
CON RAIL NA NA NA NA NA NaN NA 0 0
FIRE DEPARTMENT NA NA NA NA NA NaN NA 0 0
HEALTH DEPARTMENT POLICE NA NA NA NA NA NaN NA 0 0
LONG ISLAND RAILROAD NA NA NA NA NA NaN NA 0 0
NYC OFFICE OF THE SHERIFF NA NA NA NA NA NaN NA 0 0
OTHER/UNKNOWN AGENCIES NA NA NA NA NA NaN NA 0 0
SEA GATE ASSOCIATION POLICE NA NA NA NA NA NaN NA 0 0
SUNY MARITIME COLLEGE NA NA NA NA NA NaN NA 0 0
TAXI AND LIMOUSINE COMMISSION NA NA NA NA NA NaN NA 0 0
TRANSIT AUTHORITY NA NA NA NA NA NaN NA 0 0
agency_aov <- aov(payment_amount ~ agency, data = camera_agency_filtered)
summary(agency_aov)
##                Df    Sum Sq Mean Sq F value Pr(>F)    
## agency          9    874511   97168   15.47 <2e-16 ***
## Residuals   99898 627340303    6280                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(agency_aov)
##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ agency
## 
##                                     SS    df        MS      F   PRE     p
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Model (error reduced) |    874511.027     9 97167.892 15.473 .0014 .0000
##  Error (from model)    | 627340303.332 99898  6279.808                   
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Total (empty model)   | 628214814.359 99907  6287.996

2. Do drivers from different states (NY, NJ, CT) pay more?

Sum of Squares How much variance is explained is represented by 602,716/594,098,896, which is an extremely small portion.

F value and p-value (Is it statistically significant?) F value = 45.48, p < .001. The difference in mean payment amounts across states is statistically significant.

PRE (What proportion of variance is explained?) State only accounts for about 0.1% of the total variance in payment amount.

Similarly, the difference is statistically significant, but small in magnitude. Payment amount does vary across states, primarily since the mean Connecticut payment is about $20 less than those in New York or New Jersey. However, this may not be very meaningful, since the median is the same across all three states. See analysis below.

camera_states <- camera %>%
  filter(state %in% c("NY", "NJ", "CT"), !is.na(payment_amount))

ggplot(camera_states, aes(x = state, y = payment_amount)) +
  geom_boxplot(outlier.shape = 16, outlier.size = 1) +
  theme_minimal() +
  labs(
    title = "Payment Amounts by Driver State (NY vs NJ vs CT)",
    x = "Driver State",
    y = "Payment Amount ($)"
  ) +
  scale_y_continuous(limits = c(0, 300))

state_desc <- favstats(payment_amount ~ state, data = camera_states) %>%
  arrange(desc(mean))
knitr::kable(state_desc, caption = "Descriptive statistics of payment amounts by state")
Descriptive statistics of payment amounts by state
state min Q1 median Q3 max mean sd n missing
NJ 0 50 75 115 682.35 101.5746 89.97170 8654 0
NY 0 50 75 125 690.04 101.0902 80.93015 79541 0
CT 0 50 75 100 276.57 80.6627 46.07849 1457 0
state_aov <- aov(payment_amount ~ state, data = camera_states)
summary(state_aov)
##                Df    Sum Sq Mean Sq F value Pr(>F)    
## state           2    602716  301358   45.48 <2e-16 ***
## Residuals   89649 594098897    6627                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
state_sn <- supernova(state_aov)
state_sn
##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ state
## 
##                                     SS    df         MS      F   PRE     p
##  ----- --------------- | ------------- ----- ---------- ------ ----- -----
##  Model (error reduced) |    602716.142     2 301358.071 45.475 .0010 .0000
##  Error (from model)    | 594098896.889 89649   6626.944                   
##  ----- --------------- | ------------- ----- ---------- ------ ----- -----
##  Total (empty model)   | 594701613.031 89651   6633.519
TukeyHSD(state_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = payment_amount ~ state, data = camera_states)
## 
## $state
##             diff       lwr      upr     p adj
## NJ-CT 20.9119180 15.509134 26.31470 0.0000000
## NY-CT 20.4274502 15.383502 25.47140 0.0000000
## NY-NJ -0.4844678 -2.644085  1.67515 0.8587191

3. Do certain counties tend to have higher payment amounts?

Sum of Squares How much variance is explained is represented by 6,702,478/420,413,471, which is a small portion, but more relevant than the previous analyses.

F value and p-value (Is it statistically significant?) F value = 233.35, p < .001. The difference in mean payment amounts across counties is statistically significant.

PRE (What proportion of variance is explained?) County accounts for about 1.6% of the total variance in payment amount.

The difference is statistically significant and small in magnitude; however, relative to state and agency, county accounts for the most variance by far. A $30-$40 difference in mean payment across counties may represent meaningful differences in how violations are enforced. The law firm may want to increase their marketing presence in Richmond and Kings counties, since their higher payments may motivate people to contest their fines. See analysis below.

clean_county <- function(x) {
  x_up <- toupper(trimws(x))
  dplyr::case_when(
    x_up %in% c("K", "BK", "KING", "KINGS")             ~ "Kings County",
    x_up %in% c("Q", "QN", "QNS", "QUEEN", "QUEENS")    ~ "Queens County",
    x_up %in% c("BX", "BRONX")                          ~ "Bronx County",
    x_up %in% c("NY", "MN", "MANHATTAN", "NEW YORK")    ~ "New York County",
    x_up %in% c("R", "ST", "RICH", "RICHMOND")          ~ "Richmond County",
    TRUE ~ NA_character_
  )
}

camera_county <- camera %>%
  mutate(county_clean = clean_county(county)) %>%
  filter(!is.na(county_clean))

ggplot(
  camera_county %>%
    mutate(county_clean = fct_reorder(county_clean, payment_amount, .fun = mean)),
  aes(x = county_clean, y = payment_amount)
) +
  geom_boxplot(outlier.shape = 16, outlier.size = 1) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Payment Amounts by County (Ordered by Mean)",
    x = "County",
    y = "Payment Amount ($)"
  ) +
  scale_y_continuous(limits = c(0, 200))

county_desc <- favstats(payment_amount ~ county_clean, data = camera_county) %>%
  arrange(desc(mean))
knitr::kable(county_desc, caption = "Descriptive statistics of payment amounts by county")
Descriptive statistics of payment amounts by county
county_clean min Q1 median Q3 max mean sd n missing
Richmond County 0 50 125 180 250.00 114.53669 77.55385 1349 0
Kings County 0 50 75 115 690.04 110.88983 126.20448 16112 0
Bronx County 0 65 75 145 245.64 99.65870 67.53373 247 0
New York County 0 50 75 115 281.80 97.62502 62.55866 23479 0
Queens County 0 50 50 100 283.03 83.46501 60.08515 17366 0
county_aov <- aov(payment_amount ~ county_clean, data = camera_county)
summary(county_aov)
##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## county_clean     4   6702478 1675619   233.4 <2e-16 ***
## Residuals    58548 420413471    7181                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
county_sn <- supernova(county_aov)
county_sn
##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ county_clean
## 
##                                     SS    df          MS       F   PRE     p
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Model (error reduced) |   6702477.756     4 1675619.439 233.352 .0157 .0000
##  Error (from model)    | 420413471.103 58548    7180.663                    
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Total (empty model)   | 427115948.859 58552    7294.643

Conclusion:

Among the three variables analyzed, the firm should prioritize county in its marketing efforts. All three variables predicted differences in payment amount, however the differences predicted by state and agency were not large enough to warrant a change in marketing strategies. While the difference in paymount amount predicted by county were still modest, they were relatively large compared to the other variables analyzed. If the firm is going to change their marketing strategies due to one of these variables, it should be county.