Today, like the data scientist that I am, I am going to accomplish a monumental task that was assignment to me by my Law Firm. My goal is to investigate and asnwer three important questions.
endpoint<-"https://data.cityofnewyork.us/resource/nc67-uf89.json"
resp <- GET(endpoint, query = list(
"$limit" = 99999,
"$order" = "issue_date DESC"
))
camera <- fromJSON(content(resp, as = "text"), flatten = TRUE)
camera<- camera %>%
mutate(across(
c("fine_amount","interest_amount","reduction_amount","payment_amount","amount_due","penalty_amount"),
~as.numeric(.)
))
Changing the contents that are numeric into numeric.
camera<- camera %>%
filter(str_detect(issue_date,"^\\d{4}-\\d{2}-\\d{2}T"))
camera$issue_date<- as.Date(camera$issue_date)
camera$day_of_week<- weekdays(camera$issue_date)
camera<- camera %>%
mutate(
violation_time_clean= str_replace(violation_time,"(A|P)$"," \\1M"),
violation_time_24h= format(strptime(violation_time_clean, "%I:%M %p"), "%H:%M"))
camera<- camera %>%
mutate(
violation_time_24h_numeric = as.numeric(substr(violation_time_24h, 1, 2)),
time_of_day = case_when(
violation_time_24h_numeric >= 05 & violation_time_24h_numeric < 12 ~ "Morning",
violation_time_24h_numeric >= 12 & violation_time_24h_numeric < 17 ~ "Afternoon",
violation_time_24h_numeric >=17 & violation_time_24h_numeric < 24 ~ "Night"
)
)
favstats(payment_amount ~ issuing_agency, data = camera) %>% arrange(desc(mean))
## issuing_agency min Q1 median Q3 max
## 1 HEALTH DEPARTMENT POLICE 243.81 243.81 243.81 243.8100 243.81
## 2 SEA GATE ASSOCIATION POLICE 190.00 190.00 190.00 190.0000 190.00
## 3 FIRE DEPARTMENT 180.00 180.00 180.00 180.0000 180.00
## 4 NYS OFFICE OF MENTAL HEALTH POLICE 0.00 180.00 180.00 190.0000 210.00
## 5 PORT AUTHORITY 0.00 180.00 180.00 190.0000 242.76
## 6 ROOSEVELT ISLAND SECURITY 0.00 135.00 180.00 190.0000 246.68
## 7 NYS PARKS POLICE 0.00 0.00 180.00 190.0000 242.58
## 8 POLICE DEPARTMENT 0.00 65.00 180.00 190.0000 260.00
## 9 PARKS DEPARTMENT 0.00 90.00 180.00 190.0000 245.28
## 10 TAXI AND LIMOUSINE COMMISSION 125.00 125.00 125.00 125.0000 125.00
## 11 HEALTH AND HOSPITAL CORP. POLICE 0.00 0.00 180.00 190.0000 245.64
## 12 CON RAIL 0.00 0.00 95.00 228.8875 243.87
## 13 DEPARTMENT OF TRANSPORTATION 0.00 50.00 75.00 125.0000 690.04
## 14 TRAFFIC 0.00 65.00 115.00 115.0000 245.79
## 15 TRANSIT AUTHORITY 0.00 0.00 75.00 125.0000 190.00
## 16 DEPARTMENT OF SANITATION 0.00 48.75 65.00 115.0000 115.00
## 17 LONG ISLAND RAILROAD 0.00 0.00 0.00 0.0000 0.00
## mean sd n missing
## 1 243.81000 NA 1 0
## 2 190.00000 0.00000 2 0
## 3 180.00000 NA 1 0
## 4 161.33333 65.99423 15 0
## 5 150.49319 80.53742 47 0
## 6 149.16083 90.57967 24 0
## 7 142.50970 90.27092 33 0
## 8 136.71574 82.82498 190 0
## 9 128.47736 78.92728 144 0
## 10 125.00000 NA 1 0
## 11 124.71373 98.60130 51 0
## 12 112.62000 124.87146 6 0
## 13 99.52822 82.88394 87273 0
## 14 94.59362 44.47453 12091 0
## 15 78.00000 82.05181 5 0
## 16 66.25000 45.48351 12 0
## 17 0.00000 NA 1 0
Let’s look at what the stats say…
anova_model_a <- aov(payment_amount ~ issuing_agency, data = camera)
summary(anova_model_a)
## Df Sum Sq Mean Sq F value Pr(>F)
## issuing_agency 16 1063435 66465 10.59 <2e-16 ***
## Residuals 99880 627060364 6278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_model_a)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ issuing_agency
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Model (error reduced) | 1063434.678 16 66464.667 10.587 .0017 .0000
## Error (from model) | 627060364.279 99880 6278.137
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Total (empty model) | 628123798.958 99896 6287.777
The sum of squares here accounts for a reasonable amount of variation in the data, but it does not explain the larger portion of variation in the data. The F value is 10.587 with p < 2e-16, which means that there is a statistical significant difference between certain agencies with some paying more money than others.
ggplot(camera, aes(x = issuing_agency, y = payment_amount)) + geom_boxplot() + theme_minimal() + coord_flip() + ggtitle("Boxplot of Payment Amount by Agency")
favstats(payment_amount ~ state, data = camera) %>% arrange(desc(mean))
## state min Q1 median Q3 max mean sd n missing
## 1 OK 0.00 50.00 200.00 250.0000 250.00 162.19719 88.522638 160 0
## 2 ON 115.00 115.00 120.00 130.0000 145.00 125.00000 14.142136 4 0
## 3 QB 115.00 115.00 115.00 125.0000 125.00 118.75000 5.175492 8 0
## 4 NB 115.00 115.00 115.00 115.0000 115.00 115.00000 NA 1 0
## 5 AR 50.00 50.00 100.00 150.0000 250.00 113.30731 72.563803 67 0
## 6 WA 0.00 50.00 50.00 125.0000 275.00 109.09091 92.114522 33 0
## 7 TX 0.00 50.00 75.04 126.4025 277.06 104.12010 69.855661 312 0
## 8 DC 50.00 75.43 115.00 117.6800 145.00 102.66700 29.610797 20 0
## 9 NJ 0.00 50.00 75.00 115.0000 682.35 101.57462 89.971702 8654 0
## 10 NY 0.00 50.00 75.00 125.0000 690.04 101.09776 80.928614 79528 0
## 11 IN 0.00 67.50 115.00 115.0000 250.00 99.16667 50.520663 42 0
## 12 MN 0.00 50.00 75.00 107.5000 250.00 91.05847 68.580471 59 0
## 13 OH 0.00 50.00 75.00 115.0000 281.80 90.77151 65.548205 299 0
## 14 MT 50.00 50.00 87.50 100.0000 225.00 90.62500 43.671513 24 0
## 15 AL 0.00 50.00 75.00 115.0000 277.06 89.53567 56.218191 97 0
## 16 NC 0.00 50.00 75.00 115.0000 275.89 88.74886 57.680647 484 0
## 17 IL 0.00 50.00 75.00 100.0000 275.00 86.22200 54.900047 265 0
## 18 PA 0.00 50.00 75.00 100.0000 283.57 85.88709 53.910921 2976 0
## 19 IA 50.00 50.00 75.00 93.7600 175.00 85.00400 44.408710 10 0
## 20 VA 0.00 50.00 50.00 115.0000 275.00 82.70679 53.216823 527 0
## 21 SC 0.00 50.00 75.02 100.0000 250.00 82.61794 41.265398 194 0
## 22 GA 0.00 50.00 50.00 100.0000 275.62 82.57126 63.360707 302 0
## 23 MD 0.00 50.00 50.00 100.0000 250.00 81.02126 46.705884 413 0
## 24 CT 0.00 50.00 75.00 100.0000 276.57 80.66270 46.078493 1457 0
## 25 DE 0.00 50.00 75.00 75.4625 275.00 79.71512 49.576008 84 0
## 26 FL 0.00 50.00 50.00 100.0000 276.10 79.26281 50.883529 1654 0
## 27 AZ 0.00 50.00 50.00 100.0000 250.00 79.14683 50.917069 556 0
## 28 MO 0.00 50.00 50.00 75.1900 250.00 78.81636 57.999183 33 0
## 29 MA 0.00 50.00 50.00 100.0000 278.02 78.02744 48.262245 735 0
## 30 VT 0.00 50.00 75.00 75.7550 200.00 77.40515 41.129903 68 0
## 31 MS 0.00 50.00 75.16 115.0000 125.87 76.78111 42.988707 9 0
## 32 AK 75.95 75.95 75.95 75.9500 75.95 75.95000 NA 1 0
## 33 NH 50.00 50.00 50.00 100.0000 178.39 75.04704 31.790066 54 0
## 34 LA 50.00 50.00 50.00 76.4375 241.31 73.36333 41.807692 24 0
## 35 CA 0.00 50.00 50.00 100.0000 275.00 73.04461 52.607199 128 0
## 36 WI 0.00 50.00 50.00 115.0000 125.00 70.62500 44.460840 24 0
## 37 ME 50.00 50.00 50.00 75.5275 250.00 70.15136 36.325781 66 0
## 38 MI 0.00 50.00 50.00 75.0300 225.06 68.87076 35.774572 118 0
## 39 RI 0.00 50.00 50.00 75.5925 241.36 68.77096 36.502474 104 0
## 40 WV 50.00 50.00 50.00 75.6900 125.72 66.91444 25.274199 9 0
## 41 NV 50.00 50.00 50.00 75.0000 125.00 66.47059 26.325172 17 0
## 42 TN 50.00 50.00 50.00 75.0000 180.00 66.27884 30.075361 95 0
## 43 NE 0.00 50.00 50.00 85.0000 180.00 66.25000 51.527795 12 0
## 44 CO 0.00 50.00 50.00 75.0000 125.00 64.51613 28.992954 31 0
## 45 KY 50.00 50.00 50.00 75.0000 125.00 63.41818 25.188157 33 0
## 46 OR 50.00 50.00 50.00 61.2500 125.00 63.01793 23.969258 58 0
## 47 NM 50.00 50.00 50.00 63.1050 76.21 58.73667 15.132351 3 0
## 48 DP 0.00 0.00 57.50 115.0000 115.00 57.50000 62.988094 6 0
## 49 SD 0.00 50.00 62.50 75.0000 125.00 55.36929 35.604580 14 0
## 50 KS 0.00 12.50 50.00 87.5000 115.00 52.50000 48.347699 6 0
## 51 ID 50.00 50.00 50.00 50.0000 50.00 50.00000 NA 1 0
## 52 ND 50.00 50.00 50.00 50.0000 50.00 50.00000 NA 1 0
## 53 99 0.00 0.00 0.00 77.5000 190.00 46.25000 72.148161 8 0
## 54 UT 0.00 50.00 50.00 50.0000 50.00 38.88889 22.047928 9 0
It looks like drivers from NJ and NY pay quite a similar amount of money for ticket violations while drivers from CT pay less than both.
anova_model_s <- aov(payment_amount ~ state, data = camera)
summary(anova_model_s)
## Df Sum Sq Mean Sq F value Pr(>F)
## state 53 4703192 88739 14.21 <2e-16 ***
## Residuals 99843 623420606 6244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_model_s)
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ state
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Model (error reduced) | 4703192.469 53 88739.481 14.212 .0075 .0000
## Error (from model) | 623420606.488 99843 6244.009
## ----- --------------- | ------------- ----- --------- ------ ----- -----
## Total (empty model) | 628123798.958 99896 6287.777
The sum of squares here accounts for a reasonable amount of variation in the data, but it does not explain the larger portion of variation in the data The F value is 14.212 with p < 2e-16, which means that there is a statistical significant difference between drivers from different states paying more money than others.
ggplot(camera, aes(x = state, y = payment_amount)) +
geom_boxplot() +
theme_minimal() +
coord_flip() + # Flip the coordinates for horizontal orientation
theme(
axis.text.y = element_text(angle = 0, hjust = 1)
) +
ggtitle("Boxplot of Payment Amount by State")
camera <- camera %>%
mutate(county = recode(county,
"K" = "Kings County",
"Kings" = "Kings County",
"Q" = "Queens County",
"M" = "Manhattan County",
"B" = "Bronx County",
"Bronx" = "Bronx County",
"BK" = "Kings County",
"ST" = "Staten Island County",
"RICH" = "Rich County",
"R" = "Rabun County",
"Qns" = "Queens County",
"QN" = "Queens County",
"Q" = "Queens County",
"NY" = "New York County",
"MN" = "Marion County",
"K" = "Kay County",
"BX" = "Bronx County"))
favstats(payment_amount ~ county, data = camera) %>% arrange(desc(mean))
## county min Q1 median Q3 max mean sd n
## 1 Rich County 180 180 180 180.00 180.00 180.00000 NA 1
## 2 Rabun County 0 65 180 180.00 245.79 139.67920 80.35405 863
## 3 Kings County 0 50 75 115.00 690.04 110.90567 126.20960 16108
## 4 Marion County 0 50 50 125.06 281.80 100.54274 73.46670 14518
## 5 Bronx County 0 65 75 152.50 245.64 100.38053 67.32482 244
## 6 New York County 0 65 115 115.00 260.00 92.95323 38.30536 8950
## 7 Queens County 0 50 50 100.00 283.03 83.49201 60.07357 17357
## 8 Staten Island County 0 50 50 75.00 250.00 69.66361 45.80596 485
## missing
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
anova_model_c <- aov(payment_amount ~ county, data = camera)
summary(anova_model_c)
## Df Sum Sq Mean Sq F value Pr(>F)
## county 7 8540081 1220012 170.6 <2e-16 ***
## Residuals 58518 418368090 7149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 41371 observations deleted due to missingness
supernova(anova_model_c)
## Refitting to remove 41371 cases with missing value(s)
## ℹ aov(formula = payment_amount ~ county, data = listwise_delete(camera,
## c("payment_amount", "county")))
## Analysis of Variance Table (Type III SS)
## Model: payment_amount ~ county
##
## SS df MS F PRE p
## ----- --------------- | ------------- ----- ----------- ------- ----- -----
## Model (error reduced) | 8540081.223 7 1220011.603 170.646 .0200 .0000
## Error (from model) | 418368089.565 58518 7149.391
## ----- --------------- | ------------- ----- ----------- ------- ----- -----
## Total (empty model) | 426908170.788 58525 7294.458
The sum of squares here accounts for a reasonable amount of variation in the data, but it does not explain the larger portion of variation in the data. The F value is 170.646 with p < 2e-16, which means that there is a statistical significant difference between counties when it comes to the amount to money paid for traffic tickets. The F values also shows that “county” is a good predictor for the payment amount in this data set.
ggplot(camera, aes(x = county, y = payment_amount)) + geom_boxplot() + theme_minimal() + coord_flip() + ggtitle("Boxplot of Payment Amount by Counties")
All three variables (ageny, state, and county) were statistically significant based on their effects on the amount of money paid (payment_amount) for ticket violations. Although they results could not totally explain the portion of the variance that was unaccounted for, it looks like “county” is worth further investigation for potential marketing as it showed the larger F value out of the three.