LawFirmHW10/13Revision

Introduction

I am a data analyst working for a law firm that analyzes NYC parking violation data. In the following analysis, I explore whether different agencies issue meaningfully different fine amounts, whether drivers from different states generally pay meaningfully different fine amounts, and whether certain counties tend to issue meaningfully different fine amounts. You can look through the dataset I am using here

endpoint<-"https://data.cityofnewyork.us/resource/nc67-uf89.json"
resp <- GET(endpoint, query = list(
  "$limit" = 99999,
  "$order" = "issue_date DESC"
))

cameradata <- fromJSON(content(resp, as = "text"), flatten = TRUE)
cameradata$payment_amount <- as.numeric(gsub("[^0-9.]", "", cameradata$payment_amount))

Issuing Agency Analysis

ggplot(cameradata, aes(x = issuing_agency, y = payment_amount)) +
  geom_boxplot() +
  coord_flip()

favstats(payment_amount ~ issuing_agency, data = cameradata) %>%
  arrange(desc(mean))

##                        issuing_agency    min      Q1 median       Q3    max
## 1            HEALTH DEPARTMENT POLICE 243.81 243.810 243.81 243.8100 243.81
## 2         SEA GATE ASSOCIATION POLICE 190.00 190.000 190.00 190.0000 190.00
## 3                     FIRE DEPARTMENT 180.00 180.000 180.00 180.0000 180.00
## 4  NYS OFFICE OF MENTAL HEALTH POLICE   0.00 180.000 180.00 190.0000 210.00
## 5           ROOSEVELT ISLAND SECURITY   0.00 135.000 180.00 190.0000 246.68
## 6                      PORT AUTHORITY   0.00 180.000 180.00 190.0000 242.76
## 7                    NYS PARKS POLICE   0.00  45.000 180.00 190.0000 242.58
## 8                    PARKS DEPARTMENT   0.00  90.000 180.00 190.0000 245.28
## 9       TAXI AND LIMOUSINE COMMISSION 125.00 125.000 125.00 125.0000 125.00
## 10   HEALTH AND HOSPITAL CORP. POLICE   0.00   0.000 180.00 190.0000 245.64
## 11                  POLICE DEPARTMENT   0.00   0.000 180.00 190.0000 260.00
## 12                           CON RAIL   0.00   0.000  95.00 228.8875 243.87
## 13       DEPARTMENT OF TRANSPORTATION   0.00  50.000  75.00 125.0000 690.04
## 14                            TRAFFIC   0.00  65.000 115.00 115.0000 245.79
## 15             OTHER/UNKNOWN AGENCIES   0.00  40.115  80.23 120.3450 160.46
## 16                  TRANSIT AUTHORITY   0.00   0.000  75.00 125.0000 190.00
## 17              SUNY MARITIME COLLEGE  65.00  65.000  65.00  65.0000  65.00
## 18          NYC OFFICE OF THE SHERIFF   0.00  28.750  57.50  86.2500 115.00
## 19           DEPARTMENT OF SANITATION   0.00   0.000  65.00 105.0000 115.00
## 20               LONG ISLAND RAILROAD   0.00   0.000   0.00   0.0000   0.00
##         mean        sd     n missing
## 1  243.81000        NA     1       0
## 2  190.00000   0.00000     2       0
## 3  180.00000        NA     1       0
## 4  161.33333  65.99423    15       0
## 5  149.16083  90.57967    24       0
## 6  147.35792  82.58394    48       0
## 7  143.86176  89.24158    34       0
## 8  128.47736  78.92728   144       0
## 9  125.00000        NA     1       0
## 10 124.71373  98.60130    51       0
## 11 123.93855  88.00388   214       0
## 12 112.62000 124.87146     6       0
## 13  99.52822  82.88394 87273       0
## 14  94.59362  44.47453 12091       0
## 15  80.23000 113.46235     2       0
## 16  78.00000  82.05181     5       0
## 17  65.00000        NA     1       0
## 18  57.50000  81.31728     2       0
## 19  56.78571  48.26239    14       0
## 20   0.00000        NA     1       0

anova_model <- aov(payment_amount ~ issuing_agency, data = cameradata)
summary(anova_model)

##                   Df    Sum Sq Mean Sq F value Pr(>F)    
## issuing_agency    19    937675   49351   7.858 <2e-16 ***
## Residuals      99910 627464684    6280                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 69 observations deleted due to missingness

supernova(anova_model)

##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ issuing_agency
## 
##                                     SS    df        MS     F   PRE     p
##  ----- --------------- | ------------- ----- --------- ----- ----- -----
##  Model (error reduced) |    937675.432    19 49351.339 7.858 .0015 .0000
##  Error (from model)    | 627464683.951 99910  6280.299                  
##  ----- --------------- | ------------- ----- --------- ----- ----- -----
##  Total (empty model)   | 628402359.383 99929  6288.488

The model sum of squares, 937,675.432, is a miniscule proportion of the total sum of squares. Therefore, it only accounts for a very small fraction of the variance. F(19, 99,910) = 7.86, p < .001. This is a very small p value, so it is statistically significant. The PRE is .0015, so it accounts for less than 1% of the variance. Consequently, it is probably not worthwhile for the law firm to take into account issuing agency when creating their marketing strategy (unless they have some underlying philosophical or ideological reason for doing so, divorced from these statistical analyses).

Driver State Analysis

threestatefilter <- cameradata %>%
  filter(state %in% c("NY", "NJ", "CT"))

view(threestatefilter)

ggplot(threestatefilter, aes(x = state, y = payment_amount)) +
  geom_boxplot() +
  coord_flip()

favstats(payment_amount ~ state, data = threestatefilter) %>%
  arrange(desc(mean))

##   state min Q1 median  Q3    max     mean       sd     n missing
## 1    NJ   0 50     75 115 682.35 101.5746 89.97170  8654       3
## 2    NY   0 50     75 125 690.04 101.0902 80.93015 79541      10
## 3    CT   0 50     75 100 276.57  80.6627 46.07849  1457       2

anova_model <- aov(payment_amount ~ state, data = threestatefilter)
summary(anova_model)

##                Df    Sum Sq Mean Sq F value Pr(>F)    
## state           2    602716  301358   45.48 <2e-16 ***
## Residuals   89649 594098897    6627                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 15 observations deleted due to missingness

supernova(anova_model)

##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ state
## 
##                                     SS    df         MS      F   PRE     p
##  ----- --------------- | ------------- ----- ---------- ------ ----- -----
##  Model (error reduced) |    602716.142     2 301358.071 45.475 .0010 .0000
##  Error (from model)    | 594098896.889 89649   6626.944                   
##  ----- --------------- | ------------- ----- ---------- ------ ----- -----
##  Total (empty model)   | 594701613.031 89651   6633.519

The model sum of squares, 602,716.142, is a small proportion of the total sum of squares. Therefore, state only accounts for a small fraction of the overall variance in payment amount. F(2, 89,649) = 45.48, p < .001. This is clearly statistically significant due to the low p-value. However, the PRE is .0010, meaning that the model explains less than 0.1% of the variance. Consequently, it is probably not worthwhile for the marketing firm to emphasize state in their campaign.

County Analysis

cameradata <- cameradata %>%
  mutate(county = case_when(
    county %in% c("K", "BK", "Kings") ~ "Kings County",
    county %in% c("Q", "QN", "Qns") ~ "Queens County",
    county %in% c("BX", "B", "Bronx", "BRONX") ~ "Bronx County",
    county %in% c("R", "ST", "SI", "RICH") ~ "Richmond County",
    county %in% c("NY", "N", "MN") ~ "New York County",
    TRUE ~ county
  ))




ggplot(cameradata, aes(x = county, y = payment_amount)) +
  geom_boxplot() +
  coord_flip()

favstats(payment_amount ~ county, data = cameradata) %>%
  arrange(desc(mean))

##            county min Q1 median  Q3    max      mean        sd     n missing
## 1 Richmond County   0 50    125 180 250.00 114.53669  77.55385  1349       0
## 2    Kings County   0 50     75 115 690.04 110.88983 126.20448 16112       0
## 3    Bronx County   0 65     75 145 245.64  99.65870  67.53373   247       0
## 4 New York County   0 50     75 115 281.80  97.62502  62.55866 23479       0
## 5   Queens County   0 50     50 100 283.03  83.46501  60.08515 17366       0

anova_model <- aov(payment_amount ~ county, data = cameradata)
summary(anova_model)

##                Df    Sum Sq Mean Sq F value Pr(>F)    
## county          4   6702478 1675619   233.4 <2e-16 ***
## Residuals   58548 420413471    7181                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 41446 observations deleted due to missingness

supernova(anova_model)

##  Analysis of Variance Table (Type III SS)
##  Model: payment_amount ~ county
## 
##                                     SS    df          MS       F   PRE     p
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Model (error reduced) |   6702477.756     4 1675619.439 233.352 .0157 .0000
##  Error (from model)    | 420413471.103 58548    7180.663                    
##  ----- --------------- | ------------- ----- ----------- ------- ----- -----
##  Total (empty model)   | 427115948.859 58552    7294.643

The model sum of squares, 6,702,477.756, is a small proportion of the total sum of squares. It, therefore, only accounts for a small fraction of the variance. F(4, 58,548) = 233.35, p < .001. This, once again, is a very a small p-value and is statistically significant. PRE is 1.57%, so 1.57% of the overall variance is explained by county. It is, therefore, likely not a meaningful metric for the law firm to use in their marketing campaign.

Overall Conclusion

From a statistical standpoint, I can not, in good conscience, recommend that the law firm use any of the three variables explored in this analysis as central components of their marketing campaign. Though all of these variables are statistically significant, none of them account for more than 2% of the overall variance. That being said, the firm, if it had to use one of these variables, should choose county (since it is the only one that accounts for more than 1% of the variance). Violation type (as explored in a previous analysis) accounts for 33% of the variance, so is the best metric for the firm to use in their campaign.

```