As data scientists working for a law firm that specializes in fighting parking and camera tickets, our task is to uncover hidden patterns in New York City violation data. Specifically, we’ll explore how payment amounts vary across different groups.
We’ll answer three key questions:
The dataset comes from the NYC Open Data Portal.
# Download from NYC API
if (file.exists("camera_data.RData")) {
load("camera_data.RData")
message("Loaded local dataset: camera_data.RData")
} else {
message("Downloading dataset from NYC Open Data...")
endpoint <- "https://data.cityofnewyork.us/resource/nc67-uf89.json"
resp <- GET(endpoint, query = list("$limit" = 99999, "$order" = "issue_date DESC"))
camera <- fromJSON(content(resp, as = "text"), flatten = TRUE)
save(camera, file = "camera_data.RData")
message("Saved dataset locally as camera_data.RData")
}
## Loaded local dataset: camera_data.RData
# Confirm structure
glimpse(camera)
## Rows: 99,999
## Columns: 20
## $ plate <chr> "HPK2083", "FFZ7198", "BLANKPLATE", "BLANKPL…
## $ state <chr> "NY", "NY", "99", "99", "99", "99", "99", "9…
## $ license_type <chr> "PAS", "PAS", "999", "999", "999", "999", "9…
## $ summons_number <chr> "1420103131", "1405797526", "1405210989", "1…
## $ violation_time <chr> "00:00A", "06:49A", NA, "00:00A", NA, "00:00…
## $ violation <chr> "INSP. STICKER-EXPIRED/MISSING", "OBSTRUCTIN…
## $ fine_amount <chr> "65", "95", "45", "0", "0", "0", "115", "45"…
## $ penalty_amount <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ interest_amount <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ reduction_amount <chr> "65", "95", "45", "0", "0", "0", "115", "45"…
## $ payment_amount <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ amount_due <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ precinct <chr> "025", "000", "104", "000", "000", "000", "0…
## $ issuing_agency <chr> "POLICE DEPARTMENT", "POLICE DEPARTMENT", "D…
## $ county <chr> NA, "Q", NA, "Q", NA, "Q", NA, "K", "NY", NA…
## $ violation_status <chr> NA, "HEARING HELD-NOT GUILTY", NA, "HEARING …
## $ issue_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ judgment_entry_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ summons_image.url <chr> "http://nycserv.nyc.gov/NYCServWeb/ShowImage…
## $ summons_image.description <chr> "View Summons", "View Summons", "View Summon…
# Convert numeric variables
camera <- camera %>%
mutate(across(
c("fine_amount","interest_amount","reduction_amount","payment_amount",
"amount_due","penalty_amount"),
~as.numeric(.)
))
# Filter valid dates
camera <- camera %>%
filter(str_detect(issue_date, "^\\d{4}-\\d{2}-\\d{2}T"))
camera$issue_date <- as.Date(camera$issue_date)
ggplot(camera, aes(x = issuing_agency, y = payment_amount)) +
geom_boxplot(fill = "steelblue", color = "gray30") +
coord_flip() +
theme_minimal() +
labs(title = "Payment Amount by Issuing Agency",
x = "Issuing Agency", y = "Payment Amount ($)")
favstats(payment_amount ~ issuing_agency, data = camera) %>%
arrange(desc(mean))
## issuing_agency min Q1 median Q3 max
## 1 HEALTH DEPARTMENT POLICE 243.81 243.81 243.81 243.8100 243.81
## 2 SEA GATE ASSOCIATION POLICE 190.00 190.00 190.00 190.0000 190.00
## 3 FIRE DEPARTMENT 180.00 180.00 180.00 180.0000 180.00
## 4 NYS OFFICE OF MENTAL HEALTH POLICE 0.00 180.00 180.00 190.0000 210.00
## 5 PORT AUTHORITY 0.00 180.00 180.00 190.0000 242.76
## 6 ROOSEVELT ISLAND SECURITY 0.00 135.00 180.00 190.0000 246.68
## 7 NYS PARKS POLICE 0.00 0.00 180.00 190.0000 242.58
## 8 POLICE DEPARTMENT 0.00 65.00 180.00 190.0000 260.00
## 9 PARKS DEPARTMENT 0.00 90.00 180.00 190.0000 245.28
## 10 TAXI AND LIMOUSINE COMMISSION 125.00 125.00 125.00 125.0000 125.00
## 11 HEALTH AND HOSPITAL CORP. POLICE 0.00 0.00 180.00 190.0000 245.64
## 12 CON RAIL 0.00 0.00 95.00 228.8875 243.87
## 13 DEPARTMENT OF TRANSPORTATION 0.00 50.00 75.00 125.0000 690.04
## 14 TRAFFIC 0.00 65.00 115.00 115.0000 245.79
## 15 TRANSIT AUTHORITY 0.00 0.00 75.00 125.0000 190.00
## 16 DEPARTMENT OF SANITATION 0.00 48.75 65.00 115.0000 115.00
## 17 LONG ISLAND RAILROAD 0.00 0.00 0.00 0.0000 0.00
## mean sd n missing
## 1 243.81000 NA 1 0
## 2 190.00000 0.00000 2 0
## 3 180.00000 NA 1 0
## 4 161.33333 65.99423 15 0
## 5 150.49319 80.53742 47 0
## 6 149.16083 90.57967 24 0
## 7 142.50970 90.27092 33 0
## 8 136.71574 82.82498 190 0
## 9 128.47736 78.92728 144 0
## 10 125.00000 NA 1 0
## 11 124.71373 98.60130 51 0
## 12 112.62000 124.87146 6 0
## 13 99.52822 82.88394 87273 0
## 14 94.59362 44.47453 12091 0
## 15 78.00000 82.05181 5 0
## 16 66.25000 45.48351 12 0
## 17 0.00000 NA 1 0
anova_agency <- aov(payment_amount ~ issuing_agency, data = camera)
summary(anova_agency)
## Df Sum Sq Mean Sq F value Pr(>F)
## issuing_agency 16 1063435 66465 10.59 <2e-16 ***
## Residuals 99880 627060364 6278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_agency)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ issuing_agency
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Model (error reduced) | 1063434.678 16 66464.667 10.587 .0017 .0000
## Error (from model) | 627060364.280 99880 6278.137
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Total (empty model) | 628123798.957 99896 6287.777
If the F-value is large and p < .05, there are statistically
significant differences in mean payment amounts between issuing
agencies.
Though, the PRE (Proportion Reduction in Error) shows how much variance
is explained. A small PRE (less than 0.05) means minimal real-world
impact.
Any differences likely reflect agency-specific violation types rather
than behavioral differences.
ggplot(camera %>% filter(state %in% c("NY","NJ","CT")),
aes(x = state, y = payment_amount)) +
geom_boxplot(fill = "tan", color = "gray30") +
theme_minimal() +
labs(title = "Payment Amount by Driver State (Tri-State Area)",
x = "Driver State", y = "Payment Amount ($)")
favstats(payment_amount ~ state, data = camera) %>%
filter(state %in% c("NY","NJ","CT")) %>%
arrange(desc(mean))
## state min Q1 median Q3 max mean sd n missing
## 1 NJ 0 50 75 115 682.35 101.5746 89.97170 8654 0
## 2 NY 0 50 75 125 690.04 101.0978 80.92861 79528 0
## 3 CT 0 50 75 100 276.57 80.6627 46.07849 1457 0
tri_state <- camera %>% filter(state %in% c("NY","NJ","CT"))
anova_state <- aov(payment_amount ~ state, data = tri_state)
summary(anova_state)
## Df Sum Sq Mean Sq F value Pr(>F)
## state 2 603061 301530 45.5 <2e-16 ***
## Residuals 89636 593994009 6627
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_state)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ state
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- ---------- ------ ----- -----
## Model (error reduced) | 603060.721 2 301530.360 45.502 .0010 .0000
## Error (from model) | 593994008.724 89636 6626.735
## ----- --------------- | ------------- ----- ---------- ------ ----- -----
## Total (empty model) | 594597069.446 89638 6633.315
A significant p-value (< .05) means payment amounts differ among
NY, NJ, and CT drivers.
If out-of-state drivers (NJ or CT) pay more, this could show processing
delays or additional penalties.
Even with statistical significance, small PRE values would suggest that
differences are limited.
The firm might focus marketing on out-of-state drivers if they tend to
pay higher amounts.
camera <- camera %>%
mutate(county = case_when(
county == "K" ~ "Kings County",
county == "Q" ~ "Queens County",
county == "BX" ~ "Bronx County",
county == "NY" ~ "New York County",
county == "R" ~ "Richmond County",
TRUE ~ county
))
ggplot(camera %>% filter(!is.na(county) & county != ""),
aes(x = county, y = payment_amount)) +
geom_boxplot(fill = "lightgreen", color = "gray30") +
coord_flip() +
theme_minimal() +
labs(title = "Payment Amount by County",
x = "County", y = "Payment Amount ($)")
favstats(payment_amount ~ county, data = camera) %>%
arrange(desc(mean))
## county min Q1 median Q3 max mean sd n
## 1 RICH 180 180 180 180.00 180.00 180.00000 NA 1
## 2 Richmond County 0 65 180 180.00 245.79 139.67920 80.35405 863
## 3 Bronx 115 115 115 115.00 115.00 115.00000 NA 1
## 4 Qns 115 115 115 115.00 115.00 115.00000 NA 1
## 5 BK 0 50 75 100.00 690.04 113.54971 131.50278 14560
## 6 Queens County 0 65 115 125.00 244.46 102.35114 52.58054 983
## 7 MN 0 50 50 125.06 281.80 100.54274 73.46670 14518
## 8 Bronx County 0 65 75 160.00 245.64 100.32037 67.45720 243
## 9 New York County 0 65 115 115.00 260.00 92.95323 38.30536 8950
## 10 Kings County 0 65 65 115.00 243.81 86.09225 49.12610 1547
## 11 QN 0 50 50 100.00 283.03 82.35782 60.30923 16373
## 12 ST 0 50 50 75.00 250.00 69.66361 45.80596 485
## 13 Kings 0 0 0 0.00 0.00 0.00000 NA 1
## missing
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
county_clean <- camera %>% filter(!is.na(county) & county != "")
anova_county <- aov(payment_amount ~ county, data = county_clean)
summary(anova_county)
## Df Sum Sq Mean Sq F value Pr(>F)
## county 12 9978556 831546 116.7 <2e-16 ***
## Residuals 58513 416929615 7125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_county)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ county
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- ---------- ------- ----- -----
## Model (error reduced) | 9978556.010 12 831546.334 116.701 .0234 .0000
## Error (from model) | 416929614.778 58513 7125.419
## ----- --------------- | ------------- ----- ---------- ------- ----- -----
## Total (empty model) | 426908170.788 58525 7294.458
A significant ANOVA indicates payment amounts vary by county.
This could reflect local enforcement intensity or differences in
violation types.
If PRE is relatively larger here than in the previous analyses, county
may be the most useful predictor for marketing strategy.
Across all analyses, issuing agency, driver state, and county show
statistically significant differences in payment amounts, primarily
because of the very large dataset.
However, only county likely represents meaningful differences related to
enforcement or geographic patterns.
The law firm should prioritize county in its marketing strategy,
focusing advertising and outreach in areas with higher average payment
amounts.