Intro: My dataset is a collection of every traffic ticket/violation/citation issued by law enforcement in the district of Columbia in July of 2022. This data was originally downloaded from the District Department of Motor Vehicle’s eTIMS meter work order management system. It was then exported into DDOT’s SQL server, where the Office of the Chief Technology Officer (OCTO) geocoded citation data to the street segment level. I chose this topic because I think it will be interesting to explore, but I am also genuinely curious to find out some of the information that I will be looking for. My goal in this project is to see a number of things. First, I want to see how effective the district’s fine collection system is. How many of the fines that are given are actually paid. Second, I would like to see what states get cited most often in DC. I expect to have to filter out the DMV, but I’m curious to see what other states I will see. Finally I would like to if day of the month affects the amount of tickets.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 152972 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): LOCATION, ISSUE_DATE, ISSUING_AGENCY_NAME, ISSUING_AGENCY_SHORT, V...
dbl (17): OBJECTID, XCOORD, YCOORD, ISSUE_TIME, ISSUING_AGENCY_CODE, DISPOSI...
lgl (1): BODY_STYLE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataS <-read_csv ("Regions.csv")
Rows: 51 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): State, State Code, Region, Division
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Renaming column for joiningdataS <- dataS |>rename(plate_state = state_code)
# Cleaning date to match R Specificationsdata1$issue_date <-gsub(" .*", "", data1$issue_date)data1$issue_date <-gsub("/", "-", data1$issue_date)
#Create a seperate set without MD, VA and DC as outliersPlate <- data1 |>filter(!is.na(plate_state))Plate1 <- Plate |>filter(!plate_state %in%c("MD", "VA", "DC"))
#Count the number of occurrences of each stateStates <- Plate |>group_by(plate_state) |>summarise(count =n())
#Count the number of occurrences of each state Excluding MD, VA and DCStates1 <- Plate1 |>group_by(plate_state) |>summarise(count =n())
#Create a top 10 list of the outsider platesPlateTop <- States1 |>#Creates a top 20 listarrange(desc(count)) |># Desc makes descending order thus top 10mutate(column ="top") |>#add a column for them to be consider top 20head(10) #Use 20 variablesPlateTop
# A tibble: 10 × 3
plate_state count column
<chr> <int> <chr>
1 NC 31 top
2 FL 28 top
3 PA 23 top
4 NJ 18 top
5 NY 17 top
6 TX 16 top
7 WV 15 top
8 IN 14 top
9 AZ 13 top
10 GA 11 top
#Joining datasetsPlateTop2 <- PlateTop |>full_join(dataS, by ="plate_state") |>filter(!is.na(count))
#Graph the top 10 platesgraph1 <- PlateTop2 |>ggplot(aes(y = count, x = plate_state, fill = region)) +geom_bar(stat ="identity") +scale_fill_manual(values =c("Northeast"="darkblue", "South"="darkred", "West"="darkgreen", "Midwest"="darkorange")) +#CHAT GPT FOR ASSISTANCElabs( x ="State License Plate", y ="Number of tickets" , title ="Top 10 Non DMV Tickets", fill ="Region", caption ="District Department of Motor Vehicles")+theme_classic()ggplotly(graph1)
Plot Analysis: This is an interactive graph which shows the top 10 non-dmv states with citations issued to them in DC in July 2022. I’ve split the U.s. into 4 regions for this graph, and assigned each an individual color. Expectedly the majority of the top 10 is dominated by East Coast states, with the 3 major outliers being Texas, Indiana and Arizona.
#Making a sum of fines and how much were paidDivision1 <- data1 |>group_by(issuing_agency_short) |>summarize(sum_fine =sum(fine_amount), sum_paid =sum(total_paid)) |>arrange(desc(sum_fine))
#Making a rate for the rate at which fines came inDivision1$fine_rate <- (Division1$sum_paid/ Division1$sum_fine) *100Division1$fine_rate <-round(Division1$fine_rate, 2)
#Filtering out massive and small valuesDivision2 <- Division1 |>filter(sum_fine >=10000) |>filter(sum_fine <=70000)
#Plot 2p2 <- Division2 |>ggplot(aes(x = issuing_agency_short))+geom_bar(aes(y = sum_fine), stat ="identity", fill ="skyblue", width =0.8)+#Used Chatgpt for formatting overlapping columnsgeom_bar(aes(y = sum_paid), stat ="identity", fill ="darkblue", width =0.8)+labs( x ="Agency", y ="$ Amount", title ="Total Fines($) Given vs Paid", caption ="District Department of Motor Vehicles") +theme_classic()p2
Plot Analysis: This plot is a bar graph showing the effectiveness and efficiency of the various law enforcement agencies in the District of Columbia. This is a very powerful and telling graph in my opinion, as it shows just how few tickets end up getting paid. The light blue represents the sum of the fines of the ticket for that department, where the dark blue represents the sum of the tickets paid for that department. This graph was shocking to me at first, and I actually spent a lot of time trouble shooting it because with more values and the large statistical outlier which has since been removed I was unable to see the dark blue portions of the graph and I thought my code wasn’t working for some reason.
Perform at least one of the following statistical analyses: linear, multiple linear, or logistic regression. Write the equation for your model, p-values, diagnostic plots, and adjusted 𝑹𝑹𝟐𝟐 values. Then ANALYZE what these values suggest about your model. This statistical analysis will be separate from your other final two visualizations.
#Filtering for Linear RegressionDivision3 <- Division1 |>filter(sum_fine <=20000)
#Filtering out what day of the month each ticket was onDataDay <- data1 |>mutate(day =day(issue_date)) |>filter(fine_amount <=500)
#Grouping the amount of fines and the amount of stops each dayLinD <- DataDay |>group_by(day) |>summarize(sum_fine =sum(fine_amount), sum_paid =sum(total_paid)) |>arrange((day))
#Linear visualizationLinreg <- LinD |>ggplot(aes( x = day, y = sum_fine)) +geom_point() +geom_smooth() +labs( x ="Day", y ="Total Fines Given" , title ="Fines Given vs Day", caption ="District Department of Motor Vehicles")Linreg
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
TO DO: Summarize Data using Days to see how many fines were given on each day
#Creating modelmodel <-lm(sum_fine ~ day, data = LinD)
#Summary of modelsummary(model)
Call:
lm(formula = sum_fine ~ day, data = LinD)
Residuals:
Min 1Q Median 3Q Max
-147169 -54932 -24777 99417 135675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 561014 30437 18.432 <2e-16 ***
day -1652 1660 -0.995 0.328
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 82690 on 29 degrees of freedom
Multiple R-squared: 0.03301, Adjusted R-squared: -0.0003347
F-statistic: 0.99 on 1 and 29 DF, p-value: 0.328
Linear Analysis:
Equation: Y = 561014 - 1652* day P-Value: 0.328 (Not significant) R^2 value: -0.00033 (Not Significant)
Throughout the month it does not appear that the model is effective in predicting the sum of the fines based on the day. The negative slope indicates a slight negative correlation that as the month goes on the fine amounts seem to slightly decline. However, upon closer inspection, on a weekly basis, an argument could be made that there is a positive correlation that as a week goes on the fines increase. Unfortunately I did not have enough time or experience to seperate it out on a weekly basis, however, looking at a calendar outside of the project day 10 of July was a Sunday. Given that information, you can see that weekends are disproportionately high, and that following the weekend, there’s generally a small drop from Monday to Tuesday and then a climb back up to the weekend days.
Final Essay:
There was a number of interesting conclusions that could be made from my visualizations. The most important and surprising in my opinion is the lack of efficiency within the various departments across the District. Of the agencies in my visualization not a single one managed even 20% of their fines to be paid. The majority didn’t even break 10%. That is a significant amount of money that is simply not being collected. This data should really force an investigation into the reasoning behind these numbers. Are law enforcement officers handing out tickets they know won’t stand up if challenged in court in order to meet quotas? Are they not showing up to court to back up those tickets causing automatic dismissals? It was also interesting to see that Arizona is one of the most ticketed non DMV plates given how far away it is from DC. In the linear analysis plot, it was interesting that upon closer inspection a trend could be found about the weekly correlation rather than the overall monthly one. That is also the thing that I wish I could have included, a plot seperating the month into weeks and showing the day of the week and the line of best fit for that plot. I think it would have shown some interesting data.