I. Introduction

My final project topic is about traffic violations in Montgomery County, where we live. The reason I chose this topic is purely out of curiosity. Perhaps because driving is essential in our lives. Most of us probably live in Montgomery County and have to drive for a variety of reasons. While driving, we may commit a variety of violations, either intentionally or accidentally, which may result in you being caught by the police or on camera or having to pay a fine. The dataset I selected for the final project contains real traffic violations that have occurred around us. This data is available on dataMontgomery(https://data.montgomerycountymd.gov/Public-Safety/Traffic-Violations/4mse-ku6q).

Variables in the dataset

The original dataset consists of 35 variables and approximately 1.79 million observations, updated daily. Most of the variables in this dataset are categorical and are characterized by including the date and location (latitude, longitude) variables of the occurrence of the observation. Each variable and its description are as follows:

Variables	Description
Date Of Stop	Date of the traffic violation.
Time Of Stop	Time of the traffic violation.
Agency	Agency issuing the traffic violation. (Example: MCP is Montgomery County Police)
SubAgency	Court code representing the district of assignment of the officer. R15 = 1st district, Rockville B15 = 2nd district, Bethesda SS15 = 3rd
Description	Text description of the specific charge
Location	Location of the violation, usually an address or intersection.
Latitude	Latitude location of the traffic violation.
Longitude	Longitude location of the traffic violation.
Accident	YES if traffic violation involved an accident.
Belts	YES if seat belts were in use in accident cases.
Personal Injury	Yes if traffic violation involved Personal Injury.
Property Damage	Yes if traffic violation involved Property Damage.
Fatal	Yes if traffic violation involved a fatality.
Commercial License	Yes if driver holds a Commercial Drivers License
HAZMAT	Yes if the traffic violation involved hazardous materials.
Commercial Vehicle	Yes if the vehicle committing the traffic violation is a commercial vehicle.
Alcohol	Yes if the traffic violation included an alcohol related suspension.
Work Zone	Yes if the traffic violation was in a work zone.
State	State issuing the vehicle registration.
VehicleType	Type of vehicle (Examples: Automobile, Station Wagon, Heavy Duty Truck, etc.)
Year	Year vehicle was made.
Make	Manufacturer of the vehicle (Examples: Ford, Chevy, Honda, Toyota, etc.)
Model	Model of the vehicle.
Color	Color of the vehicle.
Violation Type	Violation type. (Examples: Warning, Citation, SERO)
Charge	Numeric code for the specific charge.
Article	Article of State Law. (TA = Transportation Article, MR = Maryland Rules)
Contributed To Accident	If the traffic violation was a contributing factor in an accident.
Race	Race of the driver. (Example: Asian, Black, White, Other, etc.)
Gender	Gender of the driver (F = Female, M = Male)
Driver City	City of the driver’s home address
Driver State	State of the driver’s home address.
DL State	State issuing the Driver’s License.
Arrest Type	Type of Arrest (A = Marked, B = Unmarked, etc.)
Geolocation	Geo-coded location information.

Questions I would like to explore in this dataset

The original dataset is very large because it contains all information from January 1, 2012, to today, 2022. Therefore, I will extract and investigate only the events in 2021. I will explore 5 Ws questions, that is, the “who, when, where, what and why” of traffic violations with this dataset.

II. Data Pre-processing

Setting working directory and loading the dataset

setwd("C:/Users/ykim2/Downloads/MC/R")
df <- read.csv("traffic_violations_2021.csv")

Loading required libraries

library(tidyverse)   # ggplot2 & dplyr
library(lubridate)  # date format
library(ggthemes)   # special theme for theme_fivethirtyeight()
library(plotly)     # interactive graph for pie chart
library(wesanderson) # cool color palette for bar graphs 
library(dygraphs)    # interactive time series chart
library(xts)  # create eXtensible Time Series (xts) data
library(viridis)    # beautiful color palette for Heatmap
library(knitr)      # for a nice table 
library(kableExtra)

Data Wrangling & Preparation

Color source:
https://venngage.com/blog/fall-color-palettes/?msclkid=ae1b3a1dce9411ec9f0bbda82ee9151c
https://www.color-hex.com/color-palettes/
https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
Before cleaning up the data, we will take a look at the entire data frame structure and ensure the variables are the correct data type.

str(df)

## 'data.frame':    63697 obs. of  40 variables:
##  $ X                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date.Of.Stop           : chr  "12/08/2021" "12/08/2021" "12/09/2021" "12/09/2021" ...
##  $ Time.Of.Stop           : chr  "23:34:00" "23:20:00" "10:46:00" "10:46:00" ...
##  $ Agency                 : chr  "MCP" "MCP" "MCP" "MCP" ...
##  $ SubAgency              : chr  "2nd District, Bethesda" "5th District, Germantown" "3rd District, Silver Spring" "3rd District, Silver Spring" ...
##  $ Description            : chr  "DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION" "FAIL TO DISPLAY REG. CARD ON DEMAND" "DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED SIGNAL" "DRIVING W/O CURRENT TAGS" ...
##  $ Location               : chr  "WATKINS MILL @ TRAVIS LANE" "MIDDLEBROOK RD AT GREAT SENECA HWY" "RIDGE ROAD AND BETHESDA CHRUCH ROAD" "RIDGE ROAD AND BETHESDA CHRUCH ROAD" ...
##  $ Latitude               : num  39.2 39.2 39.3 39.3 0 ...
##  $ Longitude              : num  -77.2 -77.3 -77.2 -77.2 0 ...
##  $ Accident               : chr  "No" "No" "No" "No" ...
##  $ Belts                  : chr  "No" "No" "No" "No" ...
##  $ Personal.Injury        : chr  "No" "No" "No" "No" ...
##  $ Property.Damage        : chr  "No" "No" "No" "No" ...
##  $ Fatal                  : chr  "No" "No" "No" "No" ...
##  $ Commercial.License     : chr  "No" "No" "No" "No" ...
##  $ HAZMAT                 : chr  "No" "No" "No" "No" ...
##  $ Commercial.Vehicle     : chr  "No" "No" "No" "No" ...
##  $ Alcohol                : chr  "No" "No" "No" "No" ...
##  $ Work.Zone              : chr  "No" "No" "No" "No" ...
##  $ State                  : chr  "MD" "MD" "TX" "TX" ...
##  $ VehicleType            : chr  "02 - Automobile" "02 - Automobile" "06 - Heavy Duty Truck" "06 - Heavy Duty Truck" ...
##  $ Year                   : int  2009 2002 2009 2009 2016 2020 2020 2020 2020 2020 ...
##  $ Make                   : chr  "NISSAN" "LEXUS" "CHEVY" "CHEVY" ...
##  $ Model                  : chr  "4S" "LS 430" "SILVERADO" "SILVERADO" ...
##  $ Color                  : chr  "BLACK" "WHITE" "WHITE" "WHITE" ...
##  $ Violation.Type         : chr  "Citation" "Citation" "Citation" "Citation" ...
##  $ Charge                 : chr  "13-401(h)" "13-409(b)" "21-202(h1)" "13-411(d)" ...
##  $ Article                : chr  "Transportation Article" "Transportation Article" "Transportation Article" "Transportation Article" ...
##  $ Contributed.To.Accident: chr  "False" "False" "False" "False" ...
##  $ Race                   : chr  "WHITE" "WHITE" "HISPANIC" "HISPANIC" ...
##  $ Gender                 : chr  "M" "M" "M" "M" ...
##  $ Driver.City            : chr  "MONTGOMERY VILLAGE" "GERMANTOWN" "GAITHERSBURG" "GAITHERSBURG" ...
##  $ Driver.State           : chr  "MD" "MD" "MD" "MD" ...
##  $ DL.State               : chr  "MD" "MD" "MD" "MD" ...
##  $ Arrest.Type            : chr  "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" ...
##  $ Geolocation            : chr  "(39.160035, -77.2155816666667)" "(39.1718966666667, -77.26247)" "(39.2852916666667, -77.2090083333333)" "(39.2852916666667, -77.2090083333333)" ...
##  $ dates                  : chr  "2021-12-08" "2021-12-08" "2021-12-09" "2021-12-09" ...
##  $ date                   : int  8 8 9 9 8 8 8 8 8 8 ...
##  $ month                  : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ year                   : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...

I created 4 bottom variables(dates, date, month, year) from ‘Date.Of.Stop’ variable previously to extract 2021 data. The code is:

library(lubridate)
df <- df %>% mutate(dates = as.Date(Date.Of.Stop,"%m/%d/%Y"), date = day(dates), month = month(dates), year = year(dates))

We will transform some time and date variables and create new ones for data & time based analysis. First, the ‘dates’ variable is still character so we’ll convert the ‘dates’ variable into date format.

df$dates <- as.Date(df$dates)

For hour-based analysis, we will extract only the first two characters from ‘Time.Of.Stop’ variable and convert them into numeric.

df$hour <- substr(df$Time.Of.Stop,1,2)
df$hour <- as.numeric(df$hour)

Now, using the weekday() function, we will create a variable for the day of the week.

df$weekday <- weekdays(df$dates)
table(df$weekday)

## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##     10946      8297      6775      5321     10811     11192     10355

Using month.name[],we’ll replace numeric months with month names.

df$month_name <- month.name[df$month]
table(df$month_name)

## 
##     April    August  December  February   January      July      June     March 
##      3413      5471      7677      4842      5237      6044      3661      5817 
##       May  November   October September 
##      3724      6223      5355      6233

The order of the day variable and the month variable is messed up, so I will change it to the correct order.

df$weekday <-factor(df$weekday, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

df$month_name <- factor(df$month_name, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))

III. Exploratory Data Analysis (EDA)

Who?

Demographic information in this dataset includes only gender and race. We will not be biased against gender or race by this information.

df %>%
  group_by(Gender) %>%
  summarise(count = n()) %>%
  mutate(ratio = round((count/sum(count)*100),1),
         label = ratio) %>%
  ggplot(aes (x = Gender, y = count, fill = Gender)) +
  geom_bar(stat = "identity", color ="black" , width = 0.6) +
  geom_label(aes(label=paste(label,"%")), fill = "#FFF9F5", vjust = 0.5) +
  scale_x_discrete(labels = c('Female','Male','Unspecified')) +
  scale_fill_manual(labels= c("Female", "Male", "Unspecified"),values = c("#e3919d","#3878a4","#FFF1E0")) +
  labs( title = "Number of Traffic Violations by Gender", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov") +
  theme_fivethirtyeight()

Figure 1: We can see that almost 70% of people caught in traffic violations are men.

t <- list(
  family = "Arial Black", # assign font family for plot_ly chart
  size = 26)

#9190b8 Purple
#f7ede2 Ivory
#e3919d Pink
#84a59d Green
#f6bd60 Yellow
#f28482 Pink

df %>%
  group_by(Race) %>%
  summarise(count = n())%>%
  mutate(pct = round((count/sum(count)),3)) %>%
  plot_ly(labels = ~ Race,
          values = ~ pct,
          textposition = 'outside',
          hoverinfo = 'text',
          text = ~ paste("Number:",count),
          #hovertemplate = paste('Number: %{text}'),
          textinfo = 'label+percent',
          marker = list(colors = c("#9190b8","#f7ede2", "#e3919d" , "#f28482" , "#84a59d" , "#f6bd60"),
                        line = list(color ='black', width = 1)),
          hole = 0.5,
          type = 'pie') %>%
  layout(title = list(text = "Percent of Traffic Violations by Race", font = t),
          xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline =FALSE, showticklabels = FALSE),
         autosize = T,
         margin = list( b=40, t= 80)) %>%
  layout(annotations = list(x = 0.14 , y = 1.09, text = "Montgomery County, Maryland, 2021", showarrow = F, xref='paper', yref='paper'))

Figure 2: We can see the racial ratio in all traffic violation records in 2021. If you hover over, you can see the number of violations.

When?

This data provides the exact date and time the violation occurred. The crux of my final project is to find out when more traffic violations occur.

pal <- wes_palette("Royal2", 100, type = "continuous")


 df %>%
  group_by(hour) %>%
  summarize(count = n()) %>%
  ggplot( aes(x = hour, y = count, fill = count)) +
  geom_col(color = "black") +
  geom_text(aes(label = count), position = position_dodge(0.95), angle = 90, vjust = 0.5, hjust = 1.2,  size = 4, color ="grey23") +
  scale_fill_gradientn(colours = pal) +
  theme_fivethirtyeight() +
  labs( title = "Frequency of Traffic Violations by Hour", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + ylab('Frequency') + xlab('Hour') +
  theme(legend.position = "none")

Figure 3: In this graph, we can see the number of traffic violations per hour. As expected, the highest number of violations occurs during the morning rush hour from 7 to 9.

 df %>%
  group_by(weekday) %>%
  summarize(count = n()) %>%
  mutate(label = count) %>%
  ggplot( aes(x = weekday, y = count, fill = count)) +
  geom_col(color = "black") +
  scale_fill_gradientn(colours = pal) +
  geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
  theme_fivethirtyeight() +
  labs( title = "Frequency of Traffic Violations per Day of Week", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + ylab('Frequency')+ xlab('Weekday')+
  scale_x_discrete(labels = (c('Monday','Tuesday','Wednesday', 'Thursday','Friday','Saturday','Sunday'))) +
  # rev() : put elements in reverse order
  theme(legend.position = "none") #+ coord_flip()

Figure 4: We can see there are more traffic violations from Tuesday to Friday than on weekends.Just as most violations occur during rush hour when traffic is high, this is probably because the people are most active on Tuesdays, Wednesdays, Thursdays, and Friday.

Time Series Analysis

Now, we will create a new data frame by collecting the frequency of daily traffic violations and see the trend.

by_date <- df %>%
  group_by(dates) %>%
  summarise(count = n())

We will transform the new data frame into a type suitable for time series analysis.

library(xts) # create eXtensible Time Series (xts) data
by_date <- xts(by_date$count, order.by = by_date$dates)

str(by_date)

## An 'xts' object on 2021-01-01/2021-12-31 containing:
##   Data: int [1:365, 1] 177 184 157 142 156 233 271 274 220 143 ...
##   Indexed by objects of class: [Date] TZ: UTC
##   xts Attributes:  
##  NULL

We will create a chart using the useful dygraph package of interactive time series charts.

dygraph(by_date, 
        main = "<font size=5> Traffic Violations of Year 2021 </font> <br> <small>Montgomery County, MD</small>",
        ylab = "Frequency") %>%
  dyRangeSelector() %>%
  dySeries("V1", label = "Frequency", color = "#6b705c") %>%
  dyLegend(show = "follow") %>%
  dyOptions( fillGraph = TRUE)

Figure 5: We can see that there are clearly fewer traffic violations in the 2nd quarter of the year. This is an interactive graph. Therefore, as you hover over the line, the individual value is displayed. We can see that on December 10, 2021, there were 460 traffic violations, the most of the year. You can zoom by adjusting the date range in the range selector at the bottom of the dygraph.

Calender Heatmap Chart

Now we will create a heatmap that shows the frequencies of daily traffic violations in 2021.
First I will divide the year into 53 weeks and make a new week variable.

df <- df %>%
  mutate(week = strftime(dates,"%W"))

df1 <- df %>%
  count(month_name, week, weekday, date)
#df1

We will do a reverse sort to arrange them sequentially from week 1 on the top of the y-axis of the heat map.

df1$week <- factor(df1$week, levels = rev(sort(unique(df$week))))

Now we’ll create a heatmap that looks like a calendar.

df1 %>%
  ggplot(aes(x=weekday, y = week)) + 
  geom_tile(aes(fill = n), color = "#616161", lwd = 0.5) +
  scale_fill_viridis(option ="magma", direction = -1) +
  #theme_classic() +
  theme_tufte(base_family="Helvetica") +
  facet_wrap(~month_name, nrow = 3, scales = "free") +
  geom_text(aes(label = date), color = "grey", size = 3) +
  theme(axis.ticks.y = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y = element_blank()) +
  scale_x_discrete(labels = c("Mo","Tu","We","Th","Fr","Sa","Su"), position = "top")+
  theme(strip.placement = "outside") +
  theme(strip.text.x = element_text(size = "10", hjust = 0))+
  ggtitle("Heatmap of Traffic Violation in Montgomery County, MD (2021)") + 
  theme(plot.title = element_text(family = "Arial", face= "bold", size = "16" ))

Figure 6: It's not an interactive chart, so we don't know the exact frequency of each day, but the color shows at a glance the days with the most traffic violations. As we you can see from the time series chart, December 10th, which has the highest number of traffic violations, has the darkest color. I googled December 10, 2021 and noticed there was a severe thunderstorm. Perhaps it has something to do with the number of traffic violations.

What & Why?

There are numerous types of traffic violations. The Description variable in this data has the reasons for the traffic violations. Traffic violations can be broadly divided into general traffic violations and violations that contributed to traffic accidents.

df %>%
  group_by(Contributed.To.Accident) %>%
  summarise(count = n()) %>%
  mutate(ratio = round((count/sum(count)*100),1),
         label = ratio) %>%
  ggplot(aes (x = Contributed.To.Accident, y = count, fill = Contributed.To.Accident)) +
  geom_col(color ="black" , width = 0.6) +
  geom_text(aes(x= Contributed.To.Accident, y = 10000, label = count), size = 4, color = "#555555")   +
  geom_label(aes(label=paste(label,"%")), fill = "#FFF9F5", vjust = 0.5) +
  geom_label(aes(x = 'False', y = 15000, 
                 label = "General Traffic Violations:"), 
             hjust = 0.5, 
             vjust = 0.5, 
             lineheight = 0.8,
             colour = "#555555", 
             fill ="#e3919d", 
             label.size = NA, 
             family="Helvetica", 
             size = 3.4) +
  
    geom_label(aes(x = 'True', y = 15000, 
                 label = "Violations Contributed to Accident:"), 
             hjust = 0.5, 
             vjust = 0.5, 
             lineheight = 0.8,
             colour = "#555555", 
             fill = "transparent",
             label.size = NA, 
             family="Helvetica", 
             size = 3.5) +
  
  scale_fill_manual(values = c("#e3919d" , "#3878a4" )) +
  labs( title ="Number of traffic violations Contributed to Accident", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov") +
  theme_fivethirtyeight()

Figure 7: We can see that only about 5% of all traffic violations are related to traffic accidents. 5 out of 100 traffic accidents are not a small. We will take a look at the causes of traffic accidents below. percentage.

Reasons for Traffic Violations

Now we will look at what caused the traffic violation. First, we’ll separate traffic accident and general traffic violation data.

why <- df %>% select(Description)
acci <- df %>% filter(Contributed.To.Accident == "True") %>% select(Description)
no_acci <- df %>% filter(Contributed.To.Accident == "False")  %>% select(Description)

descrip <- df %>%
  group_by(Description) %>%
  summarise(Count = n()) %>%
  arrange(-Count)
str(descrip)

## tibble [2,119 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Description: chr [1:2119] "DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS" "EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH" "EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH" "FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER" ...
##  $ Count      : int [1:2119] 5865 3426 2883 2225 1852 1551 1426 1367 1208 1047 ...

The Description variable describes the reasons for each stop. There is a total list of 2119 factors, but there is a lot of overlap. Looking at the table below, ‘DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS’ ranked first, however, we can see that most cases of stops are speeding violations.

Top 10 Reasons for Total Traffic Violations in Montgomery County, MD, 2021

Description	Count
DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS	5865
EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH	3426
EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH	2883
FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER	2225
FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND	1852
DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION	1551
DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE	1426
NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON	1367
EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH	1208
FAILURE TO CONTROL VEHICLE SPEED ON HIGHWAY TO AVOID COLLISION	1047

Table 1: Top 1 reason is 'DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS' which means any number of offenses, such as ignoring portable warning signs, failure to yield, stopping on the crosswalk, and many others according to Mark Bigger. Next, we can see there are many cases of speeding and registration violations.

Source: https://www.bakersfieldtraffictickets.com/blog/2020/february/trucker-ticket-failure-to-obey-a-traffic-control/#

Top 10 Reasons for Traffic Accidents in Montgomery County, MD, 2021

Description	Count
FAILURE TO CONTROL VEHICLE SPEED ON HIGHWAY TO AVOID COLLISION	529
NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON	271
RECKLESS DRIVING VEHICLE IN WANTON AND WILLFUL DISREGARD FOR SAFETY OF PERSONS AND PROPERTY	188
DRIVING VEH. WHILE IMPAIRED BY ALCOHOL	186
DRIVING VEHICLE WHILE UNDER THE INFLUENCE OF ALCOHOL	185
DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS	145
DRIVER WHEN TURNING LEFT FAIL TO YIELD RIGHT OF WAY TO VEHICLE APPROACHING FROM OPPOSITE DIRECTION	117
FAILURE TO CONTROL VEH. SPEED ON HWY. TO AVOID COLLISION	107
DRIVING VEHICLE WHILE UNDER THE INFLUENCE OF ALCOHOL PER SE	92
DRIVER CHANGING LANES WHEN UNSAFE	87

Table 2: The table above shows examples of traffic violations related to accidents. The total 463(= 186 + 185 + 92) cases are related to alcohol. Therefore, the accidents caused by drunk driving should be ranked 2nd.

Top 10 Reasons for General Traffic Violations in Montgomery County, MD, 2021

Description	Count
DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS	5720
EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH	3426
EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH	2883
FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER	2196
FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND	1785
DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION	1542
DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE	1420
EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH	1208
NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON	1096
EXCEEDING THE POSTED SPEED LIMIT OF 45 MPH	1019

Table 3: We can see that speeding is the most common traffic violation. And we find out that we must carry our registration card and driver's license card with us when driving.

Text Mining

I’ve always wanted to create a word cloud. We will try to create a word cloud using texts of the Description variable for fun. To create a word cloud, we must first clean text. We will use the ‘why’ , ‘acci’ and ‘no_acci’ data with only the description variable. This is a case of using a character variable in an existing data frame, rather than using a text file extracted from websites such as Twitter.
Preparing Data for Word Cloud Visualization
Step 1: load required libraries

library(wordcloud)  # word-cloud generator
library(RColorBrewer) # color palettes
library(tm) # for text mining

Step 2: Create the Text Corpus

## Calculate Corpus
why <- Corpus(VectorSource(why))
acci <- Corpus(VectorSource(acci))
no_acci <- Corpus(VectorSource(no_acci))

Step 3: Pre-processing Text

##Data Cleaning and Wrangling

why  <- tm_map(why , removeNumbers) # Remove numbers
why  <- tm_map(why , removePunctuation) # Remove punctuations
why  <- tm_map(why , tolower)     # Convert the text to lower case
why  <- tm_map(why , removeWords, stopwords("english")) # Remove english common stopwords

acci  <- tm_map(acci , removeNumbers)
acci  <- tm_map(acci , removePunctuation)
acci  <- tm_map(acci , tolower)
acci  <- tm_map(acci , removeWords, stopwords("english"))

no_acci <- tm_map(no_acci, removeNumbers)
no_acci <- tm_map(no_acci, removePunctuation)
no_acci <- tm_map(no_acci, tolower)
no_acci <- tm_map(no_acci, removeWords, stopwords("english"))

# Remove your own stop word
acci <- tm_map(acci, removeWords, c("driving", "vehicle", "driver", "person", "posted", "failure")) 
no_acci <- tm_map(no_acci, removeWords, c("driving", "vehicle", "driver", "person", "posted", "failure"))

Step 4: Create Document Term Matrix and save as matrix

why <- TermDocumentMatrix(why)
acci <- TermDocumentMatrix(acci)
no_acci <- TermDocumentMatrix(no_acci)

why <- as.matrix(why)
acci <- as.matrix(acci)
no_acci <- as.matrix(no_acci)

Step 5: Sort extracted words and create a new data frame with words and their frequency.

why <- sort(rowSums(why), decreasing = TRUE) 
why <- data.frame(word = names(why), freq = why)

acci <- sort(rowSums(acci), decreasing = TRUE) 
acci <- data.frame(word = names(acci), freq = acci)

no_acci <- sort(rowSums(no_acci), decreasing = TRUE) 
no_acci <- data.frame(word = names(no_acci), freq = no_acci)

Step 6: Filter only words with 4 or more letters.

why <- filter(why, nchar(word) >= 4)
acci <- filter(acci, nchar(word) >= 4)
no_acci <- filter(no_acci, nchar(word) >= 4)

Word Cloud Generation

Word Cloud Tutorial : https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a
We can select a color palette for word cloud from RColorBrewer.

pal1 <- brewer.pal(8,"Dark2")
pal2 <- brewer.pal(8, "Spectral")
pal3 <- brewer.pal(8, "Accent")

We’ll create a word cloud after excluding “driving”, “vehicle”, “driver”, “person”, “failure” and “posted” among the high-frequency words that I think, are meaningless or for granted.

Traffic Accidents Word Cloud

wordcloud(words = acci$word,
              freq = acci$freq,
              min.freq = 1,
              max.words = 200,
              random.order= FALSE, 
              rot.per= 0.3,    # Texts rotation ratio
              colors = pal1)

Figure 8: We can guess the cause of traffic accidents from word cloud. We can see that there are many crashes due to speed control failure on highways or traffic accidents due to DUI.

General Traffic Violations Word Cloud

cloud2 <- wordcloud(words = no_acci$word,
              freq = no_acci$freq,
              min.freq = 1,
              max.words= 200,
              random.order=FALSE,
              rot.per=0.3, 
              colors= pal1)

Figure 9: We can see that the most frequent cases of general violations are violating speed limit.

Where?

In 2021, there were 63,697 traffic violations. This data provides the exact location with longitude and latitude for each violation. At first, I wanted GIS analysis by city or zip code, but I could not combine the information of this data frame with the map file, shapefile because there is no information about each city or zipcodes in this dataset. We could add about 63000 marks on the map of Montgomery County, but we won’t.
The Montgomery county department of police(MCPD) is divided into six districts. source: https://www.montgomerycountymd.gov/pol/districts.html

df_sub <- df %>%
  arrange(SubAgency)

 df %>%
  group_by(SubAgency) %>%
  summarize(count = n()) %>%
  mutate(label = count) %>%
  ggplot( aes(x = SubAgency, y = count, fill = count)) +
  geom_col(color = "black") +
  scale_fill_gradientn(colours = pal) +
  geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
  theme_fivethirtyeight() +
  labs( title = "Frequency of Traffic Violations by District", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + 
  ylab('Frequency') + 
  xlab('District')  +
  scale_x_discrete(labels = (c('1st District\nRockville','2nd District\nBethesda','3rd District\nSilver Spring', '4th District\nWheaton','5th District\nGermantown','6th District\nGaithersburg','Headquater'))) +
   theme(legend.position = "none")  +
   coord_flip()

Figure 10: We can see the frequencies of each district to which a police officer who catches a traffic violation is assigned. Headquater has the highest number of violations, and Rockville has the lowest. However, we should understand that the actual traffic violation locations and the districts the police are in may be different. We should find each location in the Location variable and Longitude and Latitude variables in case you want to see the exact location of the approximately 17,000 violations that were detected by police in headquarter. For example, although Rockville has the smallest number of violations, it is possible that Rockville has a higher number of violations detected in the headquarter.

loca_df <- df %>%
  group_by(Location) %>%
  summarise(count = n()) %>%
  arrange(-count)
#loca_df

The location variable contains each traffic violation location information, but some locations are addresses and others are intersections. Due to inconsistent location information, there is a total of 15,050 location lists, but I think there are duplicates as well. Still, we will take a look at the top 10 traffic violation locations.

loca_df %>%
  filter(count > 130) %>%
  mutate(label = count) %>%
  ggplot(aes(x = reorder(Location, count), y = count, fill= count)) +
  geom_col(color = "black") +
  scale_fill_gradientn(colours = pal) +
  geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
  theme_fivethirtyeight() +
  labs( title = "Top 10 Locations of Traffic Violations", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + 
  ylab('Frequency') + 
  xlab('Location')  +
  theme(legend.position = "none")  +
  coord_flip()

figure 11 : 1st ranked "BARNSVILLE & OLD HUNDRED", 2nd ranked "21910 BEALLSVILLE RD" and 6th ranked "BARNSVILLE & BEALLSVILLE", these three locations are located within 0.6 miles, where there were a total of 526 ( = 200 + 180 +146) violations. We should avoid or be cautious of passing near Barnsville & Beallsvilles since there may be many police officers working hard in that area.

IV. Conclusion

So far, we have analyzed the 2021 traffic violations in Montgomery County. We looked at when, where, who violated what traffic laws, and why. According to a student who presented this topic and inspired me in a Capstone205 class, the number of traffic violations has drastically decreased since the outbreak of COVID-19. Perhaps the traffic volume itself has been greatly reduced, so the number of violations has also decreased. People’s lives have changed a lot since the outbreak of COVID-19. However, since COVID-19 is not over yet, we can assume that a similar pattern will continue in the future. The number of traffic violations has decreased, but I guess the reasons for the violation will always be the same. We must carry our driver’s license and vehicle registration card when driving, obey the speed limit, never drink and drive, and drive more safely on the highway. Next time I get a chance, I’d like to analyze all traffic violations in 10 years.And after obtaining better skills, I hope to create a map heatmap for GIS analysis and present it to the Capstone205 class.

Thank you. The End.

Final Project: Traffic Violations in Montgomery County, MD, 2021

Yunji Kim

2022-05-07