I am a data analyst working for a law firm that analyzes NYC parking violation data. In the following analysis, I explore whether different agencies issue meaningfully different fine amounts, whether drivers from different states generally pay meaningfully different fine amounts, and whether certain counties tend to issue meaningfully different fine amounts. You can look through the dataset I am using here.
endpoint<-"https://data.cityofnewyork.us/resource/nc67-uf89.json"
resp <- GET(endpoint, query = list(
"$limit" = 99999,
"$order" = "issue_date DESC"
))
cameradata <- fromJSON(content(resp, as = "text"), flatten = TRUE)
cameradata$payment_amount <- as.numeric(gsub("[^0-9.]", "", cameradata$payment_amount))
ggplot(cameradata, aes(x = issuing_agency, y = payment_amount)) +
geom_boxplot() +
coord_flip()
favstats(payment_amount ~ issuing_agency, data = cameradata) %>%
arrange(desc(mean))
## issuing_agency min Q1 median Q3 max
## 1 HEALTH DEPARTMENT POLICE 243.81 243.810 243.81 243.8100 243.81
## 2 SEA GATE ASSOCIATION POLICE 190.00 190.000 190.00 190.0000 190.00
## 3 FIRE DEPARTMENT 180.00 180.000 180.00 180.0000 180.00
## 4 NYS OFFICE OF MENTAL HEALTH POLICE 0.00 180.000 180.00 190.0000 210.00
## 5 ROOSEVELT ISLAND SECURITY 0.00 135.000 180.00 190.0000 246.68
## 6 PORT AUTHORITY 0.00 180.000 180.00 190.0000 242.76
## 7 NYS PARKS POLICE 0.00 45.000 180.00 190.0000 242.58
## 8 PARKS DEPARTMENT 0.00 90.000 180.00 190.0000 245.28
## 9 TAXI AND LIMOUSINE COMMISSION 125.00 125.000 125.00 125.0000 125.00
## 10 HEALTH AND HOSPITAL CORP. POLICE 0.00 0.000 180.00 190.0000 245.64
## 11 POLICE DEPARTMENT 0.00 0.000 180.00 190.0000 260.00
## 12 CON RAIL 0.00 0.000 95.00 228.8875 243.87
## 13 DEPARTMENT OF TRANSPORTATION 0.00 50.000 75.00 125.0000 690.04
## 14 TRAFFIC 0.00 65.000 115.00 115.0000 245.79
## 15 OTHER/UNKNOWN AGENCIES 0.00 40.115 80.23 120.3450 160.46
## 16 TRANSIT AUTHORITY 0.00 0.000 75.00 125.0000 190.00
## 17 SUNY MARITIME COLLEGE 65.00 65.000 65.00 65.0000 65.00
## 18 NYC OFFICE OF THE SHERIFF 0.00 28.750 57.50 86.2500 115.00
## 19 DEPARTMENT OF SANITATION 0.00 0.000 65.00 105.0000 115.00
## 20 LONG ISLAND RAILROAD 0.00 0.000 0.00 0.0000 0.00
## mean sd n missing
## 1 243.81000 NA 1 0
## 2 190.00000 0.00000 2 0
## 3 180.00000 NA 1 0
## 4 161.33333 65.99423 15 0
## 5 149.16083 90.57967 24 0
## 6 147.35792 82.58394 48 0
## 7 143.86176 89.24158 34 0
## 8 128.47736 78.92728 144 0
## 9 125.00000 NA 1 0
## 10 124.71373 98.60130 51 0
## 11 123.93855 88.00388 214 0
## 12 112.62000 124.87146 6 0
## 13 99.52822 82.88394 87273 0
## 14 94.59362 44.47453 12091 0
## 15 80.23000 113.46235 2 0
## 16 78.00000 82.05181 5 0
## 17 65.00000 NA 1 0
## 18 57.50000 81.31728 2 0
## 19 56.78571 48.26239 14 0
## 20 0.00000 NA 1 0
anova_model <- aov(payment_amount ~ issuing_agency, data = cameradata)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## issuing_agency 19 937675 49351 7.858 <2e-16 ***
## Residuals 99910 627464684 6280
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 69 observations deleted due to missingness
supernova(anova_model)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ issuing_agency
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- --------- ----- ----- -----
## Model (error reduced) | 937675.432 19 49351.339 7.858 .0015 .0000
## Error (from model) | 627464683.951 99910 6280.299
## ----- --------------- | ------------- ----- --------- ----- ----- -----
## Total (empty model) | 628402359.383 99929 6288.488
The model sum of squares, 937,675.432, is a miniscule proportion of the total sum of squares. Therefore, it only accounts for a very small fraction of the variance. F(19, 99,910) = 7.86, p < .001. This is a very small p value, so it is statistically significant. The PRE is .0015, so it accounts for less than 1% of the variance. Consequently, it is probably not worthwhile for the law firm to take into account issuing agency when creating their marketing strategy (unless they have some underlying philosophical or ideological reason for doing so, divorced from these statistical analyses).
ggplot(cameradata, aes(x = state, y = payment_amount)) +
geom_boxplot() +
coord_flip()
favstats(payment_amount ~ state, data = cameradata) %>%
arrange(desc(mean))
## state min Q1 median Q3 max mean sd n missing
## 1 OK 0.00 50.00 200.00 250.0000 250.00 162.19719 88.522638 160 0
## 2 ON 115.00 115.00 120.00 130.0000 145.00 125.00000 14.142136 4 0
## 3 QB 115.00 115.00 115.00 125.0000 125.00 118.75000 5.175492 8 0
## 4 NB 115.00 115.00 115.00 115.0000 115.00 115.00000 NA 1 0
## 5 AR 50.00 50.00 100.00 150.0000 250.00 113.30731 72.563803 67 0
## 6 WA 0.00 50.00 50.00 125.0000 275.00 109.09091 92.114522 33 0
## 7 TX 0.00 50.00 75.04 126.4025 277.06 104.12010 69.855661 312 0
## 8 DC 50.00 75.43 115.00 117.6800 145.00 102.66700 29.610797 20 0
## 9 NJ 0.00 50.00 75.00 115.0000 682.35 101.57462 89.971702 8654 3
## 10 NY 0.00 50.00 75.00 125.0000 690.04 101.09015 80.930148 79541 10
## 11 IN 0.00 67.50 115.00 115.0000 250.00 99.16667 50.520663 42 0
## 12 MN 0.00 50.00 75.00 107.5000 250.00 91.05847 68.580471 59 0
## 13 OH 0.00 50.00 75.00 115.0000 281.80 90.77151 65.548205 299 0
## 14 MT 50.00 50.00 87.50 100.0000 225.00 90.62500 43.671513 24 0
## 15 AL 0.00 50.00 75.00 115.0000 277.06 89.53567 56.218191 97 0
## 16 NC 0.00 50.00 75.00 115.0000 275.89 88.74886 57.680647 484 1
## 17 IL 0.00 50.00 75.00 100.0000 275.00 86.22200 54.900047 265 0
## 18 PA 0.00 50.00 75.00 100.0000 283.57 85.92090 53.933428 2977 2
## 19 IA 50.00 50.00 75.00 93.7600 175.00 85.00400 44.408710 10 0
## 20 VA 0.00 50.00 50.00 115.0000 275.00 82.70679 53.216823 527 0
## 21 SC 0.00 50.00 75.02 100.0000 250.00 82.61794 41.265398 194 0
## 22 GA 0.00 50.00 50.00 100.0000 275.62 82.57126 63.360707 302 0
## 23 MD 0.00 50.00 50.00 100.0000 250.00 81.02126 46.705884 413 0
## 24 CT 0.00 50.00 75.00 100.0000 276.57 80.66270 46.078493 1457 2
## 25 DE 0.00 50.00 75.00 75.4625 275.00 79.71512 49.576008 84 1
## 26 FL 0.00 50.00 50.00 100.0000 276.10 79.26281 50.883529 1654 2
## 27 AZ 0.00 50.00 50.00 100.0000 250.00 79.14683 50.917069 556 0
## 28 MO 0.00 50.00 50.00 75.1900 250.00 78.81636 57.999183 33 0
## 29 MA 0.00 50.00 50.00 100.0000 278.02 78.02744 48.262245 735 0
## 30 VT 0.00 50.00 75.00 75.7550 200.00 77.40515 41.129903 68 0
## 31 MS 0.00 50.00 75.16 115.0000 125.87 76.78111 42.988707 9 0
## 32 AK 75.95 75.95 75.95 75.9500 75.95 75.95000 NA 1 0
## 33 NH 50.00 50.00 50.00 100.0000 178.39 75.04704 31.790066 54 0
## 34 LA 50.00 50.00 50.00 76.4375 241.31 73.36333 41.807692 24 0
## 35 CA 0.00 50.00 50.00 100.0000 275.00 73.04461 52.607199 128 0
## 36 WI 0.00 50.00 50.00 115.0000 125.00 70.62500 44.460840 24 0
## 37 ME 0.00 50.00 50.00 75.4950 250.00 69.10433 37.054284 67 0
## 38 MI 0.00 50.00 50.00 75.0300 225.06 68.87076 35.774572 118 1
## 39 RI 0.00 50.00 50.00 75.5925 241.36 68.77096 36.502474 104 0
## 40 WV 50.00 50.00 50.00 75.6900 125.72 66.91444 25.274199 9 0
## 41 NV 50.00 50.00 50.00 75.0000 125.00 66.47059 26.325172 17 0
## 42 TN 50.00 50.00 50.00 75.0000 180.00 66.27884 30.075361 95 0
## 43 NE 0.00 50.00 50.00 85.0000 180.00 66.25000 51.527795 12 0
## 44 CO 0.00 50.00 50.00 75.0000 125.00 64.51613 28.992954 31 0
## 45 KY 50.00 50.00 50.00 75.0000 125.00 63.41818 25.188157 33 0
## 46 OR 50.00 50.00 50.00 61.2500 125.00 63.01793 23.969258 58 0
## 47 NM 50.00 50.00 50.00 63.1050 76.21 58.73667 15.132351 3 0
## 48 SD 0.00 50.00 62.50 75.0000 125.00 55.36929 35.604580 14 0
## 49 KS 0.00 12.50 50.00 87.5000 115.00 52.50000 48.347699 6 0
## 50 ID 50.00 50.00 50.00 50.0000 50.00 50.00000 NA 1 0
## 51 ND 50.00 50.00 50.00 50.0000 50.00 50.00000 NA 1 0
## 52 DP 0.00 0.00 0.00 115.0000 115.00 49.28571 61.470086 7 0
## 53 UT 0.00 50.00 50.00 50.0000 50.00 38.88889 22.047928 9 0
## 54 99 0.00 0.00 0.00 0.0000 190.00 20.51724 46.605196 29 43
anova_model <- aov(payment_amount ~ state, data = cameradata)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## state 53 4867057 91831 14.71 <2e-16 ***
## Residuals 99880 623567686 6243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 65 observations deleted due to missingness
supernova(anova_model)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ state
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Model (error reduced) | 4867056.569 53 91831.256 14.709 .0077 .0000
## Error (from model) | 623567685.703 99880 6243.169
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Total (empty model) | 628434742.273 99933 6288.561
The model sum of squares, 4,867,056.57, is a small portion of the total sum of squares. Therefore, state only accounts for a small fraction of the overall variance in payment amount. F(53, 99,880) = 14.71, p < .001. This is clearly statistically significant due to the low p-value. However, the PRE is .0077, meaning that the model explains less than 1% of the variance. Consequently, it is probably not worthwhile for the marketing firm to emphasize state in their campgain.
cameradata <- cameradata %>%
mutate(county = case_when(
county %in% c("K", "BK", "Kings") ~ "Kings County",
county %in% c("Q", "QN", "Qns") ~ "Queens County",
county %in% c("BX", "B", "Bronx", "BRONX") ~ "Bronx County",
county %in% c("R", "ST", "SI", "RICH") ~ "Richmond County",
county %in% c("NY", "N", "MN") ~ "New York County",
TRUE ~ county
))
ggplot(cameradata, aes(x = county, y = payment_amount)) +
geom_boxplot() +
coord_flip()
favstats(payment_amount ~ county, data = cameradata) %>%
arrange(desc(mean))
## county min Q1 median Q3 max mean sd n missing
## 1 Richmond County 0 50 125 180 250.00 114.53669 77.55385 1349 0
## 2 Kings County 0 50 75 115 690.04 110.88983 126.20448 16112 0
## 3 Bronx County 0 65 75 145 245.64 99.65870 67.53373 247 0
## 4 New York County 0 50 75 115 281.80 97.62502 62.55866 23479 0
## 5 Queens County 0 50 50 100 283.03 83.46501 60.08515 17366 0
anova_model <- aov(payment_amount ~ county, data = cameradata)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## county 4 6702478 1675619 233.4 <2e-16 ***
## Residuals 58548 420413471 7181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 41446 observations deleted due to missingness
supernova(anova_model)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ county
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- ----------- ------- ----- -----
## Model (error reduced) | 6702477.756 4 1675619.439 233.352 .0157 .0000
## Error (from model) | 420413471.103 58548 7180.663
## ----- --------------- | ------------- ----- ----------- ------- ----- -----
## Total (empty model) | 427115948.859 58552 7294.643
The model sum of squares, 6,702,477.756, is a small proportion of the total sum of squares. It, therefore, only accounts for a small fraction of the variance. F(4, 58,548) = 233.35, p < .001. This, once again, is a very a small p-value and is statistically significant. PRE is 1.57%, so 1.57% of the overall variance is explained by county. It is, therefore, likely not a meaningful metric for the law firm to use in their marketing campaign.
From a statistical standpoint, I can not, in good conscience, recommend that the law firm use any of the three variables explored in this analysis as central components of their marketing campaign. Though all of these variables are statistically significant, none of them account for more than 2% of the overall variance. That being said, the firm, if it had to use one of these variables, should choose county (since it is the only one that accounts for more than 1% of the variance). Violation type (as explored in a previous analysis) accounts for 33% of the variance, so is the best metric for the firm to use in their campaign.
```