I am a very impressive and smart data scientist working for an extremely prestigious law firm. The law firm specializes in fighting parking and camera tickets. I was hired to uncover hidden patterns in NYC violation data. My firm asked me to explore the following three questions: 1. Do certain agencies issue higher payments? 2. Do drivers from different states (NY, NJ, CT) pay more? 3. Do certain counties tend to have higher payment amounts?
Here is the dataset we use: https://data.cityofnewyork.us/City-Government/Open-Parking-and-Camera-Violations/nc67-uf89/about_data
endpoint <- "https://data.cityofnewyork.us/resource/nc67-uf89.json"
resp <- GET(endpoint, query = list(
"$limit" = 99999,
"$order" = "issue_date DESC"
))
camera <- fromJSON(content(resp, as = "text"), flatten = TRUE)
camera<- camera %>%
mutate(across(
c("fine_amount", "interest_amount", "reduction_amount", "payment_amount", "amount_due"),
~as.numeric(.)
))
camera <- camera %>%
mutate(
agency = as.factor(issuing_agency),
plate_state = toupper(state),
county_raw = county
)
Sum of Squares How much variance is explained is represented by 874,511/627,340,303, which is an extremely small portion.
F value and p-value (Is it statistically significant?) F value = 15.47, p < .001. The difference in mean payment amounts across agencies is statistically significant.
PRE (What proportion of variance is explained?) Agency only accounts for about 0.1% of the total variance in payment amount.
While the difference is statistically significant, it is extremely small in magnitude. The typical payment amount varies across agencies, but not by a large amount. I would not recommend that the law firm base its marketing strategies on agency, since the differences across agencies are trivial. See analysis below.
# Many agencies listed have extremely low sample sizes. Let's only analyze agencies with n > 10.
camera_agency_filtered <- camera %>%
filter(!is.na(agency), !is.na(payment_amount)) %>%
group_by(agency) %>%
filter(n() > 10) %>%
ungroup()
ggplot(
camera_agency_filtered %>%
mutate(agency = fct_reorder(agency, payment_amount, .fun = mean)),
aes(x = agency, y = payment_amount)
) +
geom_boxplot(outlier.shape = 16, outlier.size = 1) +
coord_flip() +
theme_minimal() +
labs(
title = "Payment Amounts by Issuing Agency (Ordered by Mean)",
x = "Agency",
y = "Payment Amount ($)"
) +
scale_y_continuous(limits = c(0, 300))
agency_desc <- favstats(payment_amount ~ agency, data = camera_agency_filtered) %>%
arrange(desc(mean))
knitr::kable(agency_desc, caption = "Descriptive statistics of payment amounts by agency")
agency | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
NYS OFFICE OF MENTAL HEALTH POLICE | 0 | 180 | 180 | 190 | 210.00 | 161.33333 | 65.99423 | 15 | 0 |
ROOSEVELT ISLAND SECURITY | 0 | 135 | 180 | 190 | 246.68 | 149.16083 | 90.57967 | 24 | 0 |
PORT AUTHORITY | 0 | 180 | 180 | 190 | 242.76 | 147.35792 | 82.58394 | 48 | 0 |
NYS PARKS POLICE | 0 | 45 | 180 | 190 | 242.58 | 143.86176 | 89.24158 | 34 | 0 |
PARKS DEPARTMENT | 0 | 90 | 180 | 190 | 245.28 | 128.47736 | 78.92728 | 144 | 0 |
HEALTH AND HOSPITAL CORP. POLICE | 0 | 0 | 180 | 190 | 245.64 | 124.71373 | 98.60130 | 51 | 0 |
POLICE DEPARTMENT | 0 | 0 | 180 | 190 | 260.00 | 123.93855 | 88.00388 | 214 | 0 |
DEPARTMENT OF TRANSPORTATION | 0 | 50 | 75 | 125 | 690.04 | 99.52822 | 82.88394 | 87273 | 0 |
TRAFFIC | 0 | 65 | 115 | 115 | 245.79 | 94.59362 | 44.47453 | 12091 | 0 |
DEPARTMENT OF SANITATION | 0 | 0 | 65 | 105 | 115.00 | 56.78571 | 48.26239 | 14 | 0 |
CON RAIL | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
FIRE DEPARTMENT | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
HEALTH DEPARTMENT POLICE | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
LONG ISLAND RAILROAD | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
NYC OFFICE OF THE SHERIFF | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
OTHER/UNKNOWN AGENCIES | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
SEA GATE ASSOCIATION POLICE | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
SUNY MARITIME COLLEGE | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
TAXI AND LIMOUSINE COMMISSION | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
TRANSIT AUTHORITY | NA | NA | NA | NA | NA | NaN | NA | 0 | 0 |
agency_aov <- aov(payment_amount ~ agency, data = camera_agency_filtered)
summary(agency_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## agency 9 874511 97168 15.47 <2e-16 ***
## Residuals 99898 627340303 6280
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(agency_aov)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ agency
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Model (error reduced) | 874511.027 9 97167.892 15.473 .0014 .0000
## Error (from model) | 627340303.332 99898 6279.808
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Total (empty model) | 628214814.359 99907 6287.996
Sum of Squares How much variance is explained is represented by 602,716/594,098,896, which is an extremely small portion.
F value and p-value (Is it statistically significant?) F value = 45.48, p < .001. The difference in mean payment amounts across states is statistically significant.
PRE (What proportion of variance is explained?) State only accounts for about 0.1% of the total variance in payment amount.
Similarly, the difference is statistically significant, but small in magnitude. Payment amount does vary across states, primarily since the mean Connecticut payment is about $20 less than those in New York or New Jersey. However, this may not be very meaningful, since the median is the same across all three states. See analysis below.
camera_states <- camera %>%
filter(state %in% c("NY", "NJ", "CT"), !is.na(payment_amount))
ggplot(camera_states, aes(x = state, y = payment_amount)) +
geom_boxplot(outlier.shape = 16, outlier.size = 1) +
theme_minimal() +
labs(
title = "Payment Amounts by Driver State (NY vs NJ vs CT)",
x = "Driver State",
y = "Payment Amount ($)"
) +
scale_y_continuous(limits = c(0, 300))
state_desc <- favstats(payment_amount ~ state, data = camera_states) %>%
arrange(desc(mean))
knitr::kable(state_desc, caption = "Descriptive statistics of payment amounts by state")
state | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
NJ | 0 | 50 | 75 | 115 | 682.35 | 101.5746 | 89.97170 | 8654 | 0 |
NY | 0 | 50 | 75 | 125 | 690.04 | 101.0902 | 80.93015 | 79541 | 0 |
CT | 0 | 50 | 75 | 100 | 276.57 | 80.6627 | 46.07849 | 1457 | 0 |
state_aov <- aov(payment_amount ~ state, data = camera_states)
summary(state_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## state 2 602716 301358 45.48 <2e-16 ***
## Residuals 89649 594098897 6627
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
state_sn <- supernova(state_aov)
state_sn
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ state
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- ---------- ------ ----- -----
## Model (error reduced) | 602716.142 2 301358.071 45.475 .0010 .0000
## Error (from model) | 594098896.889 89649 6626.944
## ----- --------------- | ------------- ----- ---------- ------ ----- -----
## Total (empty model) | 594701613.031 89651 6633.519
TukeyHSD(state_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = payment_amount ~ state, data = camera_states)
##
## $state
## diff lwr upr p adj
## NJ-CT 20.9119180 15.509134 26.31470 0.0000000
## NY-CT 20.4274502 15.383502 25.47140 0.0000000
## NY-NJ -0.4844678 -2.644085 1.67515 0.8587191
Sum of Squares How much variance is explained is represented by 6,702,478/420,413,471, which is a small portion, but more relevant than the previous analyses.
F value and p-value (Is it statistically significant?) F value = 233.35, p < .001. The difference in mean payment amounts across counties is statistically significant.
PRE (What proportion of variance is explained?) County accounts for about 1.6% of the total variance in payment amount.
The difference is statistically significant and small in magnitude; however, relative to state and agency, county accounts for the most variance by far. A $30-$40 difference in mean payment across counties may represent meaningful differences in how violations are enforced. The law firm may want to increase their marketing presence in Richmond and Kings counties, since their higher payments may motivate people to contest their fines. See analysis below.
clean_county <- function(x) {
x_up <- toupper(trimws(x))
dplyr::case_when(
x_up %in% c("K", "BK", "KING", "KINGS") ~ "Kings County",
x_up %in% c("Q", "QN", "QNS", "QUEEN", "QUEENS") ~ "Queens County",
x_up %in% c("BX", "BRONX") ~ "Bronx County",
x_up %in% c("NY", "MN", "MANHATTAN", "NEW YORK") ~ "New York County",
x_up %in% c("R", "ST", "RICH", "RICHMOND") ~ "Richmond County",
TRUE ~ NA_character_
)
}
camera_county <- camera %>%
mutate(county_clean = clean_county(county)) %>%
filter(!is.na(county_clean))
ggplot(
camera_county %>%
mutate(county_clean = fct_reorder(county_clean, payment_amount, .fun = mean)),
aes(x = county_clean, y = payment_amount)
) +
geom_boxplot(outlier.shape = 16, outlier.size = 1) +
coord_flip() +
theme_minimal() +
labs(
title = "Payment Amounts by County (Ordered by Mean)",
x = "County",
y = "Payment Amount ($)"
) +
scale_y_continuous(limits = c(0, 200))
county_desc <- favstats(payment_amount ~ county_clean, data = camera_county) %>%
arrange(desc(mean))
knitr::kable(county_desc, caption = "Descriptive statistics of payment amounts by county")
county_clean | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
Richmond County | 0 | 50 | 125 | 180 | 250.00 | 114.53669 | 77.55385 | 1349 | 0 |
Kings County | 0 | 50 | 75 | 115 | 690.04 | 110.88983 | 126.20448 | 16112 | 0 |
Bronx County | 0 | 65 | 75 | 145 | 245.64 | 99.65870 | 67.53373 | 247 | 0 |
New York County | 0 | 50 | 75 | 115 | 281.80 | 97.62502 | 62.55866 | 23479 | 0 |
Queens County | 0 | 50 | 50 | 100 | 283.03 | 83.46501 | 60.08515 | 17366 | 0 |
county_aov <- aov(payment_amount ~ county_clean, data = camera_county)
summary(county_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## county_clean 4 6702478 1675619 233.4 <2e-16 ***
## Residuals 58548 420413471 7181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
county_sn <- supernova(county_aov)
county_sn
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ county_clean
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- ----------- ------- ----- -----
## Model (error reduced) | 6702477.756 4 1675619.439 233.352 .0157 .0000
## Error (from model) | 420413471.103 58548 7180.663
## ----- --------------- | ------------- ----- ----------- ------- ----- -----
## Total (empty model) | 427115948.859 58552 7294.643
Among the three variables analyzed, the firm should prioritize county in its marketing efforts. All three variables predicted differences in payment amount, however the differences predicted by state and agency were not large enough to warrant a change in marketing strategies. While the difference in paymount amount predicted by county were still modest, they were relatively large compared to the other variables analyzed. If the firm is going to change their marketing strategies due to one of these variables, it should be county.