Final Project

Author

M Loukinov

Intro: My dataset is a collection of every traffic ticket/violation/citation issued by law enforcement in the district of Columbia in July of 2022. This data was originally downloaded from the District Department of Motor Vehicle’s eTIMS meter work order management system. It was then exported into DDOT’s SQL server, where the Office of the Chief Technology Officer (OCTO) geocoded citation data to the street segment level. I chose this topic because I think it will be interesting to explore, but I am also genuinely curious to find out some of the information that I will be looking for. My goal in this project is to see a number of things. First, I want to see how effective the district’s fine collection system is. How many of the fines that are given are actually paid. Second, I would like to see what states get cited most often in DC. I expect to have to filter out the DMV, but I’m curious to see what other states I will see. Finally I would like to if day of the month affects the amount of tickets.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

#Import Data
setwd("C:/Users/Thecr/OneDrive/Desktop/Data 110 Notes")
data1 <- read_csv ("ProjectFData.csv")

Rows: 152972 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): LOCATION, ISSUE_DATE, ISSUING_AGENCY_NAME, ISSUING_AGENCY_SHORT, V...
dbl (17): OBJECTID, XCOORD, YCOORD, ISSUE_TIME, ISSUING_AGENCY_CODE, DISPOSI...
lgl  (1): BODY_STYLE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataS <- read_csv ("Regions.csv")

Rows: 51 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): State, State Code, Region, Division

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Saidi Cleaning
names(data1) <- tolower(names(data1))
names(data1) <- gsub(" ","_",names(data1))
#DataS
names(dataS) <- tolower(names(dataS))
names(dataS) <- gsub(" ","_",names(dataS))

#Renaming column for joining
dataS <- dataS |> 
  rename(plate_state  = state_code)

# Cleaning date to match R Specifications
data1$issue_date <- gsub(" .*", "", data1$issue_date)
data1$issue_date <- gsub("/", "-", data1$issue_date)

#Create a seperate set without MD, VA and DC as outliers
Plate <- data1 |>
  filter(!is.na(plate_state))

Plate1 <- Plate |>
  filter(!plate_state %in% c("MD", "VA", "DC"))

#Count the number of occurrences of each state
States <- Plate |> 
  group_by(plate_state) |>         
  summarise(count = n())

#Count the number of occurrences of each state Excluding MD, VA and DC
States1 <- Plate1 |> 
  group_by(plate_state) |>         
  summarise(count = n())

#Create a top 10 list of the outsider plates
PlateTop <- States1 |>                       #Creates a top 20 list
  arrange(desc(count)) |>       # Desc makes descending order thus top 10
  mutate(column = "top") |>      #add a column for them to be consider top 20
  head(10)                               #Use 20 variables
PlateTop

# A tibble: 10 × 3
   plate_state count column
   <chr>       <int> <chr> 
 1 NC             31 top   
 2 FL             28 top   
 3 PA             23 top   
 4 NJ             18 top   
 5 NY             17 top   
 6 TX             16 top   
 7 WV             15 top   
 8 IN             14 top   
 9 AZ             13 top   
10 GA             11 top

#Joining datasets

PlateTop2 <- PlateTop |> 
  full_join(dataS, by = "plate_state") |>
  filter(!is.na(count))

#Graph the top 10 plates
graph1 <- PlateTop2 |>
  ggplot(aes(y = count, x = plate_state, fill = region)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Northeast" = "darkblue", "South" = "darkred", "West" = "darkgreen", "Midwest" = "darkorange")) +          #CHAT GPT FOR ASSISTANCE
  labs( x = "State License Plate", y = "Number of tickets" , title = "Top 10 Non DMV Tickets", fill = "Region", caption = "District Department of Motor Vehicles")+
  theme_classic()
ggplotly(graph1)

Plot Analysis: This is an interactive graph which shows the top 10 non-dmv states with citations issued to them in DC in July 2022. I’ve split the U.s. into 4 regions for this graph, and assigned each an individual color. Expectedly the majority of the top 10 is dominated by East Coast states, with the 3 major outliers being Texas, Indiana and Arizona.

#Making a sum of fines and how much were paid

Division1 <- data1 |>
  group_by(issuing_agency_short) |>
  summarize(sum_fine = sum(fine_amount), sum_paid = sum(total_paid)) |>
  arrange(desc(sum_fine))

#Making a rate for the rate at which fines came in
Division1$fine_rate <- (Division1$sum_paid/ Division1$sum_fine) * 100

Division1$fine_rate <- round(Division1$fine_rate, 2)

#Filtering out massive and small values
Division2 <- Division1 |>
  filter(sum_fine >= 10000) |>
  filter(sum_fine <= 70000)

#Plot 2
p2 <- Division2 |>
  ggplot(aes(x = issuing_agency_short))+
  geom_bar(aes(y = sum_fine), stat = "identity", fill = "skyblue", width = 0.8)+   #Used Chatgpt for formatting overlapping columns
  geom_bar(aes(y = sum_paid), stat = "identity", fill = "darkblue", width = 0.8)+
  labs( x = "Agency", y = "$ Amount", title = "Total Fines($) Given vs Paid", caption = "District Department of Motor Vehicles") +
  theme_classic()
  

p2

Plot Analysis: This plot is a bar graph showing the effectiveness and efficiency of the various law enforcement agencies in the District of Columbia. This is a very powerful and telling graph in my opinion, as it shows just how few tickets end up getting paid. The light blue represents the sum of the fines of the ticket for that department, where the dark blue represents the sum of the tickets paid for that department. This graph was shocking to me at first, and I actually spent a lot of time trouble shooting it because with more values and the large statistical outlier which has since been removed I was unable to see the dark blue portions of the graph and I thought my code wasn’t working for some reason.

Perform at least one of the following statistical analyses: linear, multiple linear, or logistic regression. Write the equation for your model, p-values, diagnostic plots, and adjusted 𝑹𝑹𝟐𝟐 values. Then ANALYZE what these values suggest about your model. This statistical analysis will be separate from your other final two visualizations.

#Filtering for Linear Regression
Division3 <- Division1 |>
  filter(sum_fine <= 20000)

#Filtering out what day of the month each ticket was on
DataDay <- data1 |>
  mutate(day = day(issue_date)) |>
  filter(fine_amount <= 500)

#Grouping the amount of fines and the amount of stops each day
LinD <- DataDay |>
  group_by(day) |>
  summarize(sum_fine = sum(fine_amount), sum_paid = sum(total_paid)) |>
  arrange((day))

#Linear visualization
Linreg <- LinD |>
  ggplot(aes( x = day, y = sum_fine)) +
  geom_point() +
  geom_smooth() +
  labs( x = "Day", y = "Total Fines Given" , title = "Fines Given vs Day", caption = "District Department of Motor Vehicles")
Linreg

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

TO DO: Summarize Data using Days to see how many fines were given on each day

#Creating model
model <- lm(sum_fine ~ day, data = LinD)

#Summary of model
summary(model)


Call:
lm(formula = sum_fine ~ day, data = LinD)

Residuals:
    Min      1Q  Median      3Q     Max 
-147169  -54932  -24777   99417  135675 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   561014      30437  18.432   <2e-16 ***
day            -1652       1660  -0.995    0.328    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 82690 on 29 degrees of freedom
Multiple R-squared:  0.03301,   Adjusted R-squared:  -0.0003347 
F-statistic:  0.99 on 1 and 29 DF,  p-value: 0.328

Linear Analysis:

Equation: Y = 561014 - 1652* day P-Value: 0.328 (Not significant) R^2 value: -0.00033 (Not Significant)

Throughout the month it does not appear that the model is effective in predicting the sum of the fines based on the day. The negative slope indicates a slight negative correlation that as the month goes on the fine amounts seem to slightly decline. However, upon closer inspection, on a weekly basis, an argument could be made that there is a positive correlation that as a week goes on the fines increase. Unfortunately I did not have enough time or experience to seperate it out on a weekly basis, however, looking at a calendar outside of the project day 10 of July was a Sunday. Given that information, you can see that weekends are disproportionately high, and that following the weekend, there’s generally a small drop from Monday to Tuesday and then a climb back up to the weekend days.

Final Essay:

There was a number of interesting conclusions that could be made from my visualizations. The most important and surprising in my opinion is the lack of efficiency within the various departments across the District. Of the agencies in my visualization not a single one managed even 20% of their fines to be paid. The majority didn’t even break 10%. That is a significant amount of money that is simply not being collected. This data should really force an investigation into the reasoning behind these numbers. Are law enforcement officers handing out tickets they know won’t stand up if challenged in court in order to meet quotas? Are they not showing up to court to back up those tickets causing automatic dismissals? It was also interesting to see that Arizona is one of the most ticketed non DMV plates given how far away it is from DC. In the linear analysis plot, it was interesting that upon closer inspection a trend could be found about the weekly correlation rather than the overall monthly one. That is also the thing that I wish I could have included, a plot seperating the month into weeks and showing the day of the week and the line of best fit for that plot. I think it would have shown some interesting data.