R Markdown

Today, like the data scientist that I am, I am going to accomplish a monumental task that was assignment to me by my Law Firm. My goal is to investigate and asnwer three important questions.

  1. Do certain agencies issue higher payments?
  2. Do drivers from different states (NY, NJ, CT) pay more?
  3. Do certain counties tend to have higher payment amounts?

Open Parking and Camera Violations dataset

Loading the data from NYC Open Data

endpoint<-"https://data.cityofnewyork.us/resource/nc67-uf89.json"

resp <- GET(endpoint, query = list(
  "$limit" = 99999,
  "$order" = "issue_date DESC"
))
camera <- fromJSON(content(resp, as = "text"), flatten = TRUE)

First, cleaning the data

camera<- camera %>% 
  mutate(across(
    c("fine_amount","interest_amount","reduction_amount","payment_amount","amount_due","penalty_amount"),
    ~as.numeric(.)
  ))

Changing the contents that are numeric into numeric.

Filtering time correctly

camera<- camera %>% 
  filter(str_detect(issue_date,"^\\d{4}-\\d{2}-\\d{2}T"))

Creating appropriate format format for issue_data

camera$issue_date<- as.Date(camera$issue_date)
camera$day_of_week<- weekdays(camera$issue_date)

Fixing time (Part 1)

camera<- camera %>%
  mutate(
    violation_time_clean= str_replace(violation_time,"(A|P)$"," \\1M"),
    violation_time_24h= format(strptime(violation_time_clean, "%I:%M %p"), "%H:%M"))

Fixing time (Part 2)

camera<- camera %>%
  mutate(
    violation_time_24h_numeric = as.numeric(substr(violation_time_24h, 1, 2)),
    time_of_day = case_when(
      violation_time_24h_numeric >= 05 & violation_time_24h_numeric < 12 ~ "Morning",
      violation_time_24h_numeric >= 12 & violation_time_24h_numeric < 17 ~ "Afternoon",
      violation_time_24h_numeric >=17 & violation_time_24h_numeric < 24 ~ "Night"
    )
  )

Do certain agencies issue higher payments?

favstats(payment_amount ~ issuing_agency, data = camera) %>% arrange(desc(mean))
##                        issuing_agency    min     Q1 median       Q3    max
## 1            HEALTH DEPARTMENT POLICE 243.81 243.81 243.81 243.8100 243.81
## 2         SEA GATE ASSOCIATION POLICE 190.00 190.00 190.00 190.0000 190.00
## 3                     FIRE DEPARTMENT 180.00 180.00 180.00 180.0000 180.00
## 4  NYS OFFICE OF MENTAL HEALTH POLICE   0.00 180.00 180.00 190.0000 210.00
## 5                      PORT AUTHORITY   0.00 180.00 180.00 190.0000 242.76
## 6           ROOSEVELT ISLAND SECURITY   0.00 135.00 180.00 190.0000 246.68
## 7                    NYS PARKS POLICE   0.00   0.00 180.00 190.0000 242.58
## 8                   POLICE DEPARTMENT   0.00  65.00 180.00 190.0000 260.00
## 9                    PARKS DEPARTMENT   0.00  90.00 180.00 190.0000 245.28
## 10      TAXI AND LIMOUSINE COMMISSION 125.00 125.00 125.00 125.0000 125.00
## 11   HEALTH AND HOSPITAL CORP. POLICE   0.00   0.00 180.00 190.0000 245.64
## 12                           CON RAIL   0.00   0.00  95.00 228.8875 243.87
## 13       DEPARTMENT OF TRANSPORTATION   0.00  50.00  75.00 125.0000 690.04
## 14                            TRAFFIC   0.00  65.00 115.00 115.0000 245.79
## 15                  TRANSIT AUTHORITY   0.00   0.00  75.00 125.0000 190.00
## 16           DEPARTMENT OF SANITATION   0.00  48.75  65.00 115.0000 115.00
## 17               LONG ISLAND RAILROAD   0.00   0.00   0.00   0.0000   0.00
##         mean        sd     n missing
## 1  243.81000        NA     1       0
## 2  190.00000   0.00000     2       0
## 3  180.00000        NA     1       0
## 4  161.33333  65.99423    15       0
## 5  150.49319  80.53742    47       0
## 6  149.16083  90.57967    24       0
## 7  142.50970  90.27092    33       0
## 8  136.71574  82.82498   190       0
## 9  128.47736  78.92728   144       0
## 10 125.00000        NA     1       0
## 11 124.71373  98.60130    51       0
## 12 112.62000 124.87146     6       0
## 13  99.52822  82.88394 87273       0
## 14  94.59362  44.47453 12091       0
## 15  78.00000  82.05181     5       0
## 16  66.25000  45.48351    12       0
## 17   0.00000        NA     1       0

Let’s look at what the stats say…

Let’s look at an ANOVA analysis shows

anova_model_a <- aov(payment_amount ~ issuing_agency, data = camera)
summary(anova_model_a)
##                   Df    Sum Sq Mean Sq F value Pr(>F)    
## issuing_agency    16   1063435   66465   10.59 <2e-16 ***
## Residuals      99880 627060364    6278                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_model_a)
##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ issuing_agency
## 
##                                     SS    df        MS      F   PRE     p
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Model (error reduced) |   1063434.678    16 66464.667 10.587 .0017 .0000
##  Error (from model)    | 627060364.279 99880  6278.137                   
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Total (empty model)   | 628123798.958 99896  6287.777

The sum of squares here accounts for a reasonable amount of variation in the data, but it does not explain the larger portion of variation in the data. The F value is 10.587 with p < 2e-16, which means that there is a statistical significant difference between certain agencies with some paying more money than others.

Graph (payment_amount & agency)

ggplot(camera, aes(x = issuing_agency, y = payment_amount)) + geom_boxplot() + theme_minimal() + coord_flip() + ggtitle("Boxplot of Payment Amount by Agency")

Do drivers from different states (NY, NJ, CT) pay more?

favstats(payment_amount ~ state, data = camera) %>% arrange(desc(mean))
##    state    min     Q1 median       Q3    max      mean        sd     n missing
## 1     OK   0.00  50.00 200.00 250.0000 250.00 162.19719 88.522638   160       0
## 2     ON 115.00 115.00 120.00 130.0000 145.00 125.00000 14.142136     4       0
## 3     QB 115.00 115.00 115.00 125.0000 125.00 118.75000  5.175492     8       0
## 4     NB 115.00 115.00 115.00 115.0000 115.00 115.00000        NA     1       0
## 5     AR  50.00  50.00 100.00 150.0000 250.00 113.30731 72.563803    67       0
## 6     WA   0.00  50.00  50.00 125.0000 275.00 109.09091 92.114522    33       0
## 7     TX   0.00  50.00  75.04 126.4025 277.06 104.12010 69.855661   312       0
## 8     DC  50.00  75.43 115.00 117.6800 145.00 102.66700 29.610797    20       0
## 9     NJ   0.00  50.00  75.00 115.0000 682.35 101.57462 89.971702  8654       0
## 10    NY   0.00  50.00  75.00 125.0000 690.04 101.09776 80.928614 79528       0
## 11    IN   0.00  67.50 115.00 115.0000 250.00  99.16667 50.520663    42       0
## 12    MN   0.00  50.00  75.00 107.5000 250.00  91.05847 68.580471    59       0
## 13    OH   0.00  50.00  75.00 115.0000 281.80  90.77151 65.548205   299       0
## 14    MT  50.00  50.00  87.50 100.0000 225.00  90.62500 43.671513    24       0
## 15    AL   0.00  50.00  75.00 115.0000 277.06  89.53567 56.218191    97       0
## 16    NC   0.00  50.00  75.00 115.0000 275.89  88.74886 57.680647   484       0
## 17    IL   0.00  50.00  75.00 100.0000 275.00  86.22200 54.900047   265       0
## 18    PA   0.00  50.00  75.00 100.0000 283.57  85.88709 53.910921  2976       0
## 19    IA  50.00  50.00  75.00  93.7600 175.00  85.00400 44.408710    10       0
## 20    VA   0.00  50.00  50.00 115.0000 275.00  82.70679 53.216823   527       0
## 21    SC   0.00  50.00  75.02 100.0000 250.00  82.61794 41.265398   194       0
## 22    GA   0.00  50.00  50.00 100.0000 275.62  82.57126 63.360707   302       0
## 23    MD   0.00  50.00  50.00 100.0000 250.00  81.02126 46.705884   413       0
## 24    CT   0.00  50.00  75.00 100.0000 276.57  80.66270 46.078493  1457       0
## 25    DE   0.00  50.00  75.00  75.4625 275.00  79.71512 49.576008    84       0
## 26    FL   0.00  50.00  50.00 100.0000 276.10  79.26281 50.883529  1654       0
## 27    AZ   0.00  50.00  50.00 100.0000 250.00  79.14683 50.917069   556       0
## 28    MO   0.00  50.00  50.00  75.1900 250.00  78.81636 57.999183    33       0
## 29    MA   0.00  50.00  50.00 100.0000 278.02  78.02744 48.262245   735       0
## 30    VT   0.00  50.00  75.00  75.7550 200.00  77.40515 41.129903    68       0
## 31    MS   0.00  50.00  75.16 115.0000 125.87  76.78111 42.988707     9       0
## 32    AK  75.95  75.95  75.95  75.9500  75.95  75.95000        NA     1       0
## 33    NH  50.00  50.00  50.00 100.0000 178.39  75.04704 31.790066    54       0
## 34    LA  50.00  50.00  50.00  76.4375 241.31  73.36333 41.807692    24       0
## 35    CA   0.00  50.00  50.00 100.0000 275.00  73.04461 52.607199   128       0
## 36    WI   0.00  50.00  50.00 115.0000 125.00  70.62500 44.460840    24       0
## 37    ME  50.00  50.00  50.00  75.5275 250.00  70.15136 36.325781    66       0
## 38    MI   0.00  50.00  50.00  75.0300 225.06  68.87076 35.774572   118       0
## 39    RI   0.00  50.00  50.00  75.5925 241.36  68.77096 36.502474   104       0
## 40    WV  50.00  50.00  50.00  75.6900 125.72  66.91444 25.274199     9       0
## 41    NV  50.00  50.00  50.00  75.0000 125.00  66.47059 26.325172    17       0
## 42    TN  50.00  50.00  50.00  75.0000 180.00  66.27884 30.075361    95       0
## 43    NE   0.00  50.00  50.00  85.0000 180.00  66.25000 51.527795    12       0
## 44    CO   0.00  50.00  50.00  75.0000 125.00  64.51613 28.992954    31       0
## 45    KY  50.00  50.00  50.00  75.0000 125.00  63.41818 25.188157    33       0
## 46    OR  50.00  50.00  50.00  61.2500 125.00  63.01793 23.969258    58       0
## 47    NM  50.00  50.00  50.00  63.1050  76.21  58.73667 15.132351     3       0
## 48    DP   0.00   0.00  57.50 115.0000 115.00  57.50000 62.988094     6       0
## 49    SD   0.00  50.00  62.50  75.0000 125.00  55.36929 35.604580    14       0
## 50    KS   0.00  12.50  50.00  87.5000 115.00  52.50000 48.347699     6       0
## 51    ID  50.00  50.00  50.00  50.0000  50.00  50.00000        NA     1       0
## 52    ND  50.00  50.00  50.00  50.0000  50.00  50.00000        NA     1       0
## 53    99   0.00   0.00   0.00  77.5000 190.00  46.25000 72.148161     8       0
## 54    UT   0.00  50.00  50.00  50.0000  50.00  38.88889 22.047928     9       0

It looks like drivers from NJ and NY pay quite a similar amount of money for ticket violations while drivers from CT pay less than both.

Let’s look at what an ANOVA analysis shows

anova_model_s <- aov(payment_amount ~ state, data = camera)
summary(anova_model_s)
##                Df    Sum Sq Mean Sq F value Pr(>F)    
## state          53   4703192   88739   14.21 <2e-16 ***
## Residuals   99843 623420606    6244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_model_s)
##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ state
## 
##                                     SS    df        MS      F   PRE     p
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Model (error reduced) |   4703192.469    53 88739.481 14.212 .0075 .0000
##  Error (from model)    | 623420606.488 99843  6244.009                   
##  ----- --------------- | ------------- ----- --------- ------ ----- -----
##  Total (empty model)   | 628123798.958 99896  6287.777

The sum of squares here accounts for a reasonable amount of variation in the data, but it does not explain the larger portion of variation in the data The F value is 14.212 with p < 2e-16, which means that there is a statistical significant difference between drivers from different states paying more money than others.

Graph (payment_amount & drivers)

ggplot(camera, aes(x = state, y = payment_amount)) + 
  geom_boxplot() + 
  theme_minimal() + 
  coord_flip() +  # Flip the coordinates for horizontal orientation
  theme(
    axis.text.y = element_text(angle = 0, hjust = 1)
    ) +
  ggtitle("Boxplot of Payment Amount by State")

Do certain counties tend to have higher payment amounts?

First, let’s clean county by removing the abbraviations

camera <- camera %>%
  mutate(county = recode(county,
                          "K" = "Kings County",
                          "Kings" = "Kings County",
                          "Q" = "Queens County",
                          "M" = "Manhattan County",
                          "B" = "Bronx County",
                          "Bronx" = "Bronx County",
                          "BK" = "Kings County",
                          "ST" = "Staten Island County",
                          "RICH" = "Rich County",
                          "R" = "Rabun County",
                          "Qns" = "Queens County",
                          "QN" = "Queens County",
                          "Q" = "Queens County",
                          "NY" = "New York County",
                          "MN" = "Marion County",
                          "K" = "Kay County",
                          "BX" = "Bronx County"))

Let’s get ready for data analysis

favstats(payment_amount ~ county, data = camera) %>% arrange(desc(mean))
##                 county min  Q1 median     Q3    max      mean        sd     n
## 1          Rich County 180 180    180 180.00 180.00 180.00000        NA     1
## 2         Rabun County   0  65    180 180.00 245.79 139.67920  80.35405   863
## 3         Kings County   0  50     75 115.00 690.04 110.90567 126.20960 16108
## 4        Marion County   0  50     50 125.06 281.80 100.54274  73.46670 14518
## 5         Bronx County   0  65     75 152.50 245.64 100.38053  67.32482   244
## 6      New York County   0  65    115 115.00 260.00  92.95323  38.30536  8950
## 7        Queens County   0  50     50 100.00 283.03  83.49201  60.07357 17357
## 8 Staten Island County   0  50     50  75.00 250.00  69.66361  45.80596   485
##   missing
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 7       0
## 8       0

Let’s look at what an ANOVA analysis shows

anova_model_c <- aov(payment_amount ~ county, data = camera)
summary(anova_model_c)
##                Df    Sum Sq Mean Sq F value Pr(>F)    
## county          7   8540081 1220012   170.6 <2e-16 ***
## Residuals   58518 418368090    7149                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 41371 observations deleted due to missingness
supernova(anova_model_c)
## Refitting to remove 41371 cases with missing value(s)
## ℹ aov(formula = payment_amount ~ county, data = listwise_delete(camera, 
##     c("payment_amount", "county")))
##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ county
## 
##                                     SS    df          MS       F   PRE     p
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Model (error reduced) |   8540081.223     7 1220011.603 170.646 .0200 .0000
##  Error (from model)    | 418368089.565 58518    7149.391                    
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Total (empty model)   | 426908170.788 58525    7294.458

The sum of squares here accounts for a reasonable amount of variation in the data, but it does not explain the larger portion of variation in the data. The F value is 170.646 with p < 2e-16, which means that there is a statistical significant difference between counties when it comes to the amount to money paid for traffic tickets. The F values also shows that “county” is a good predictor for the payment amount in this data set.

Graph (payment_amount & county)

ggplot(camera, aes(x = county, y = payment_amount)) + geom_boxplot() + theme_minimal() + coord_flip() +   ggtitle("Boxplot of Payment Amount by Counties")

Concluding Notes

All three variables (ageny, state, and county) were statistically significant based on their effects on the amount of money paid (payment_amount) for ticket violations. Although they results could not totally explain the portion of the variance that was unaccounted for, it looks like “county” is worth further investigation for potential marketing as it showed the larger F value out of the three.