I. Introduction

My final project topic is about traffic violations in Montgomery County, where we live. The reason I chose this topic is purely out of curiosity. Perhaps because driving is essential in our lives. Most of us probably live in Montgomery County and have to drive for a variety of reasons. While driving, we may commit a variety of violations, either intentionally or accidentally, which may result in you being caught by the police or on camera or having to pay a fine. The dataset I selected for the final project contains real traffic violations that have occurred around us. This data is available on dataMontgomery(https://data.montgomerycountymd.gov/Public-Safety/Traffic-Violations/4mse-ku6q).

Variables in the dataset

The original dataset consists of 35 variables and approximately 1.79 million observations, updated daily. Most of the variables in this dataset are categorical and are characterized by including the date and location (latitude, longitude) variables of the occurrence of the observation. Each variable and its description are as follows:

Variables Description
Date Of Stop Date of the traffic violation.
Time Of Stop Time of the traffic violation.
Agency Agency issuing the traffic violation. (Example: MCP is Montgomery County Police)
SubAgency Court code representing the district of assignment of the officer. R15 = 1st district, Rockville B15 = 2nd district, Bethesda SS15 = 3rd
Description Text description of the specific charge
Location Location of the violation, usually an address or intersection.
Latitude Latitude location of the traffic violation.
Longitude Longitude location of the traffic violation.
Accident YES if traffic violation involved an accident.
Belts YES if seat belts were in use in accident cases.
Personal Injury Yes if traffic violation involved Personal Injury.
Property Damage Yes if traffic violation involved Property Damage.
Fatal Yes if traffic violation involved a fatality.
Commercial License Yes if driver holds a Commercial Drivers License
HAZMAT Yes if the traffic violation involved hazardous materials.
Commercial Vehicle Yes if the vehicle committing the traffic violation is a commercial vehicle.
Alcohol Yes if the traffic violation included an alcohol related suspension.
Work Zone Yes if the traffic violation was in a work zone.
State State issuing the vehicle registration.
VehicleType Type of vehicle (Examples: Automobile, Station Wagon, Heavy Duty Truck, etc.)
Year Year vehicle was made.
Make Manufacturer of the vehicle (Examples: Ford, Chevy, Honda, Toyota, etc.)
Model Model of the vehicle.
Color Color of the vehicle.
Violation Type Violation type. (Examples: Warning, Citation, SERO)
Charge Numeric code for the specific charge.
Article Article of State Law. (TA = Transportation Article, MR = Maryland Rules)
Contributed To Accident If the traffic violation was a contributing factor in an accident.
Race Race of the driver. (Example: Asian, Black, White, Other, etc.)
Gender Gender of the driver (F = Female, M = Male)
Driver City City of the driver’s home address
Driver State State of the driver’s home address.
DL State State issuing the Driver’s License.
Arrest Type Type of Arrest (A = Marked, B = Unmarked, etc.)
Geolocation Geo-coded location information.

Questions I would like to explore in this dataset

The original dataset is very large because it contains all information from January 1, 2012, to today, 2022. Therefore, I will extract and investigate only the events in 2021. I will explore 5 Ws questions, that is, the “who, when, where, what and why” of traffic violations with this dataset.

II. Data Pre-processing

Setting working directory and loading the dataset

setwd("C:/Users/ykim2/Downloads/MC/R")
df <- read.csv("traffic_violations_2021.csv")

Loading required libraries

library(tidyverse)   # ggplot2 & dplyr
library(lubridate)  # date format
library(ggthemes)   # special theme for theme_fivethirtyeight()
library(plotly)     # interactive graph for pie chart
library(wesanderson) # cool color palette for bar graphs 
library(dygraphs)    # interactive time series chart
library(xts)  # create eXtensible Time Series (xts) data
library(viridis)    # beautiful color palette for Heatmap
library(knitr)      # for a nice table 
library(kableExtra)

Data Wrangling & Preparation

str(df)
## 'data.frame':    63697 obs. of  40 variables:
##  $ X                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date.Of.Stop           : chr  "12/08/2021" "12/08/2021" "12/09/2021" "12/09/2021" ...
##  $ Time.Of.Stop           : chr  "23:34:00" "23:20:00" "10:46:00" "10:46:00" ...
##  $ Agency                 : chr  "MCP" "MCP" "MCP" "MCP" ...
##  $ SubAgency              : chr  "2nd District, Bethesda" "5th District, Germantown" "3rd District, Silver Spring" "3rd District, Silver Spring" ...
##  $ Description            : chr  "DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION" "FAIL TO DISPLAY REG. CARD ON DEMAND" "DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED SIGNAL" "DRIVING W/O CURRENT TAGS" ...
##  $ Location               : chr  "WATKINS MILL @ TRAVIS LANE" "MIDDLEBROOK RD AT GREAT SENECA HWY" "RIDGE ROAD AND BETHESDA CHRUCH ROAD" "RIDGE ROAD AND BETHESDA CHRUCH ROAD" ...
##  $ Latitude               : num  39.2 39.2 39.3 39.3 0 ...
##  $ Longitude              : num  -77.2 -77.3 -77.2 -77.2 0 ...
##  $ Accident               : chr  "No" "No" "No" "No" ...
##  $ Belts                  : chr  "No" "No" "No" "No" ...
##  $ Personal.Injury        : chr  "No" "No" "No" "No" ...
##  $ Property.Damage        : chr  "No" "No" "No" "No" ...
##  $ Fatal                  : chr  "No" "No" "No" "No" ...
##  $ Commercial.License     : chr  "No" "No" "No" "No" ...
##  $ HAZMAT                 : chr  "No" "No" "No" "No" ...
##  $ Commercial.Vehicle     : chr  "No" "No" "No" "No" ...
##  $ Alcohol                : chr  "No" "No" "No" "No" ...
##  $ Work.Zone              : chr  "No" "No" "No" "No" ...
##  $ State                  : chr  "MD" "MD" "TX" "TX" ...
##  $ VehicleType            : chr  "02 - Automobile" "02 - Automobile" "06 - Heavy Duty Truck" "06 - Heavy Duty Truck" ...
##  $ Year                   : int  2009 2002 2009 2009 2016 2020 2020 2020 2020 2020 ...
##  $ Make                   : chr  "NISSAN" "LEXUS" "CHEVY" "CHEVY" ...
##  $ Model                  : chr  "4S" "LS 430" "SILVERADO" "SILVERADO" ...
##  $ Color                  : chr  "BLACK" "WHITE" "WHITE" "WHITE" ...
##  $ Violation.Type         : chr  "Citation" "Citation" "Citation" "Citation" ...
##  $ Charge                 : chr  "13-401(h)" "13-409(b)" "21-202(h1)" "13-411(d)" ...
##  $ Article                : chr  "Transportation Article" "Transportation Article" "Transportation Article" "Transportation Article" ...
##  $ Contributed.To.Accident: chr  "False" "False" "False" "False" ...
##  $ Race                   : chr  "WHITE" "WHITE" "HISPANIC" "HISPANIC" ...
##  $ Gender                 : chr  "M" "M" "M" "M" ...
##  $ Driver.City            : chr  "MONTGOMERY VILLAGE" "GERMANTOWN" "GAITHERSBURG" "GAITHERSBURG" ...
##  $ Driver.State           : chr  "MD" "MD" "MD" "MD" ...
##  $ DL.State               : chr  "MD" "MD" "MD" "MD" ...
##  $ Arrest.Type            : chr  "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" ...
##  $ Geolocation            : chr  "(39.160035, -77.2155816666667)" "(39.1718966666667, -77.26247)" "(39.2852916666667, -77.2090083333333)" "(39.2852916666667, -77.2090083333333)" ...
##  $ dates                  : chr  "2021-12-08" "2021-12-08" "2021-12-09" "2021-12-09" ...
##  $ date                   : int  8 8 9 9 8 8 8 8 8 8 ...
##  $ month                  : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ year                   : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
  • I created 4 bottom variables(dates, date, month, year) from ‘Date.Of.Stop’ variable previously to extract 2021 data. The code is:
library(lubridate)
df <- df %>% mutate(dates = as.Date(Date.Of.Stop,"%m/%d/%Y"), date = day(dates), month = month(dates), year = year(dates))
  • We will transform some time and date variables and create new ones for data & time based analysis. First, the ‘dates’ variable is still character so we’ll convert the ‘dates’ variable into date format.
df$dates <- as.Date(df$dates)
  • For hour-based analysis, we will extract only the first two characters from ‘Time.Of.Stop’ variable and convert them into numeric.
df$hour <- substr(df$Time.Of.Stop,1,2)
df$hour <- as.numeric(df$hour)
  • Now, using the weekday() function, we will create a variable for the day of the week.
df$weekday <- weekdays(df$dates)
table(df$weekday)
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##     10946      8297      6775      5321     10811     11192     10355
  • Using month.name[],we’ll replace numeric months with month names.
df$month_name <- month.name[df$month]
table(df$month_name)
## 
##     April    August  December  February   January      July      June     March 
##      3413      5471      7677      4842      5237      6044      3661      5817 
##       May  November   October September 
##      3724      6223      5355      6233
  • The order of the day variable and the month variable is messed up, so I will change it to the correct order.
df$weekday <-factor(df$weekday, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

df$month_name <- factor(df$month_name, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))


III. Exploratory Data Analysis (EDA)

Who?

  • Demographic information in this dataset includes only gender and race. We will not be biased against gender or race by this information.
df %>%
  group_by(Gender) %>%
  summarise(count = n()) %>%
  mutate(ratio = round((count/sum(count)*100),1),
         label = ratio) %>%
  ggplot(aes (x = Gender, y = count, fill = Gender)) +
  geom_bar(stat = "identity", color ="black" , width = 0.6) +
  geom_label(aes(label=paste(label,"%")), fill = "#FFF9F5", vjust = 0.5) +
  scale_x_discrete(labels = c('Female','Male','Unspecified')) +
  scale_fill_manual(labels= c("Female", "Male", "Unspecified"),values = c("#e3919d","#3878a4","#FFF1E0")) +
  labs( title = "Number of Traffic Violations by Gender", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov") +
  theme_fivethirtyeight()

Figure 1: We can see that almost 70% of people caught in traffic violations are men.


t <- list(
  family = "Arial Black", # assign font family for plot_ly chart
  size = 26)

#9190b8 Purple
#f7ede2 Ivory
#e3919d Pink
#84a59d Green
#f6bd60 Yellow
#f28482 Pink

df %>%
  group_by(Race) %>%
  summarise(count = n())%>%
  mutate(pct = round((count/sum(count)),3)) %>%
  plot_ly(labels = ~ Race,
          values = ~ pct,
          textposition = 'outside',
          hoverinfo = 'text',
          text = ~ paste("Number:",count),
          #hovertemplate = paste('Number: %{text}'),
          textinfo = 'label+percent',
          marker = list(colors = c("#9190b8","#f7ede2", "#e3919d" , "#f28482" , "#84a59d" , "#f6bd60"),
                        line = list(color ='black', width = 1)),
          hole = 0.5,
          type = 'pie') %>%
  layout(title = list(text = "Percent of Traffic Violations by Race", font = t),
          xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline =FALSE, showticklabels = FALSE),
         autosize = T,
         margin = list( b=40, t= 80)) %>%
  layout(annotations = list(x = 0.14 , y = 1.09, text = "Montgomery County, Maryland, 2021", showarrow = F, xref='paper', yref='paper'))
Figure 2: We can see the racial ratio in all traffic violation records in 2021. If you hover over, you can see the number of violations.


When?

  • This data provides the exact date and time the violation occurred. The crux of my final project is to find out when more traffic violations occur.
pal <- wes_palette("Royal2", 100, type = "continuous")


 df %>%
  group_by(hour) %>%
  summarize(count = n()) %>%
  ggplot( aes(x = hour, y = count, fill = count)) +
  geom_col(color = "black") +
  geom_text(aes(label = count), position = position_dodge(0.95), angle = 90, vjust = 0.5, hjust = 1.2,  size = 4, color ="grey23") +
  scale_fill_gradientn(colours = pal) +
  theme_fivethirtyeight() +
  labs( title = "Frequency of Traffic Violations by Hour", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + ylab('Frequency') + xlab('Hour') +
  theme(legend.position = "none") 

Figure 3: In this graph, we can see the number of traffic violations per hour. As expected, the highest number of violations occurs during the morning rush hour from 7 to 9.


 df %>%
  group_by(weekday) %>%
  summarize(count = n()) %>%
  mutate(label = count) %>%
  ggplot( aes(x = weekday, y = count, fill = count)) +
  geom_col(color = "black") +
  scale_fill_gradientn(colours = pal) +
  geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
  theme_fivethirtyeight() +
  labs( title = "Frequency of Traffic Violations per Day of Week", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + ylab('Frequency')+ xlab('Weekday')+
  scale_x_discrete(labels = (c('Monday','Tuesday','Wednesday', 'Thursday','Friday','Saturday','Sunday'))) +
  # rev() : put elements in reverse order
  theme(legend.position = "none") #+ coord_flip()  

Figure 4: We can see there are more traffic violations from Tuesday to Friday than on weekends.Just as most violations occur during rush hour when traffic is high, this is probably because the people are most active on Tuesdays, Wednesdays, Thursdays, and Friday.


Time Series Analysis

  • Now, we will create a new data frame by collecting the frequency of daily traffic violations and see the trend.
by_date <- df %>%
  group_by(dates) %>%
  summarise(count = n()) 
  • We will transform the new data frame into a type suitable for time series analysis.
library(xts) # create eXtensible Time Series (xts) data
by_date <- xts(by_date$count, order.by = by_date$dates)  
str(by_date)
## An 'xts' object on 2021-01-01/2021-12-31 containing:
##   Data: int [1:365, 1] 177 184 157 142 156 233 271 274 220 143 ...
##   Indexed by objects of class: [Date] TZ: UTC
##   xts Attributes:  
##  NULL
  • We will create a chart using the useful dygraph package of interactive time series charts.
dygraph(by_date, 
        main = "<font size=5> Traffic Violations of Year 2021 </font> <br> <small>Montgomery County, MD</small>",
        ylab = "Frequency") %>%
  dyRangeSelector() %>%
  dySeries("V1", label = "Frequency", color = "#6b705c") %>%
  dyLegend(show = "follow") %>%
  dyOptions( fillGraph = TRUE)
Figure 5: We can see that there are clearly fewer traffic violations in the 2nd quarter of the year. This is an interactive graph. Therefore, as you hover over the line, the individual value is displayed. We can see that on December 10, 2021, there were 460 traffic violations, the most of the year. You can zoom by adjusting the date range in the range selector at the bottom of the dygraph.


Calender Heatmap Chart

  • Now we will create a heatmap that shows the frequencies of daily traffic violations in 2021.
  • First I will divide the year into 53 weeks and make a new week variable.
df <- df %>%
  mutate(week = strftime(dates,"%W"))
df1 <- df %>%
  count(month_name, week, weekday, date)
#df1
  • We will do a reverse sort to arrange them sequentially from week 1 on the top of the y-axis of the heat map.
df1$week <- factor(df1$week, levels = rev(sort(unique(df$week))))
  • Now we’ll create a heatmap that looks like a calendar.
df1 %>%
  ggplot(aes(x=weekday, y = week)) + 
  geom_tile(aes(fill = n), color = "#616161", lwd = 0.5) +
  scale_fill_viridis(option ="magma", direction = -1) +
  #theme_classic() +
  theme_tufte(base_family="Helvetica") +
  facet_wrap(~month_name, nrow = 3, scales = "free") +
  geom_text(aes(label = date), color = "grey", size = 3) +
  theme(axis.ticks.y = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.text.y = element_blank()) +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y = element_blank()) +
  scale_x_discrete(labels = c("Mo","Tu","We","Th","Fr","Sa","Su"), position = "top")+
  theme(strip.placement = "outside") +
  theme(strip.text.x = element_text(size = "10", hjust = 0))+
  ggtitle("Heatmap of Traffic Violation in Montgomery County, MD (2021)") + 
  theme(plot.title = element_text(family = "Arial", face= "bold", size = "16" )) 

Figure 6: It's not an interactive chart, so we don't know the exact frequency of each day, but the color shows at a glance the days with the most traffic violations. As we you can see from the time series chart, December 10th, which has the highest number of traffic violations, has the darkest color. I googled December 10, 2021 and noticed there was a severe thunderstorm. Perhaps it has something to do with the number of traffic violations.



What & Why?

  • There are numerous types of traffic violations. The Description variable in this data has the reasons for the traffic violations. Traffic violations can be broadly divided into general traffic violations and violations that contributed to traffic accidents.
df %>%
  group_by(Contributed.To.Accident) %>%
  summarise(count = n()) %>%
  mutate(ratio = round((count/sum(count)*100),1),
         label = ratio) %>%
  ggplot(aes (x = Contributed.To.Accident, y = count, fill = Contributed.To.Accident)) +
  geom_col(color ="black" , width = 0.6) +
  geom_text(aes(x= Contributed.To.Accident, y = 10000, label = count), size = 4, color = "#555555")   +
  geom_label(aes(label=paste(label,"%")), fill = "#FFF9F5", vjust = 0.5) +
  geom_label(aes(x = 'False', y = 15000, 
                 label = "General Traffic Violations:"), 
             hjust = 0.5, 
             vjust = 0.5, 
             lineheight = 0.8,
             colour = "#555555", 
             fill ="#e3919d", 
             label.size = NA, 
             family="Helvetica", 
             size = 3.4) +
  
    geom_label(aes(x = 'True', y = 15000, 
                 label = "Violations Contributed to Accident:"), 
             hjust = 0.5, 
             vjust = 0.5, 
             lineheight = 0.8,
             colour = "#555555", 
             fill = "transparent",
             label.size = NA, 
             family="Helvetica", 
             size = 3.5) +
  
  scale_fill_manual(values = c("#e3919d" , "#3878a4" )) +
  labs( title ="Number of traffic violations Contributed to Accident", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov") +
  theme_fivethirtyeight()

Figure 7: We can see that only about 5% of all traffic violations are related to traffic accidents. 5 out of 100 traffic accidents are not a small. We will take a look at the causes of traffic accidents below. percentage.


Reasons for Traffic Violations

  • Now we will look at what caused the traffic violation. First, we’ll separate traffic accident and general traffic violation data.
why <- df %>% select(Description)
acci <- df %>% filter(Contributed.To.Accident == "True") %>% select(Description)
no_acci <- df %>% filter(Contributed.To.Accident == "False")  %>% select(Description)


descrip <- df %>%
  group_by(Description) %>%
  summarise(Count = n()) %>%
  arrange(-Count)
str(descrip)
## tibble [2,119 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Description: chr [1:2119] "DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS" "EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH" "EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH" "FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER" ...
##  $ Count      : int [1:2119] 5865 3426 2883 2225 1852 1551 1426 1367 1208 1047 ...
  • The Description variable describes the reasons for each stop. There is a total list of 2119 factors, but there is a lot of overlap. Looking at the table below, ‘DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS’ ranked first, however, we can see that most cases of stops are speeding violations.

Top 10 Reasons for Total Traffic Violations in Montgomery County, MD, 2021

Description Count
DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS 5865
EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH 3426
EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH 2883
FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER 2225
FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND 1852
DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION 1551
DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE 1426
NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON 1367
EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH 1208
FAILURE TO CONTROL VEHICLE SPEED ON HIGHWAY TO AVOID COLLISION 1047
Table 1: Top 1 reason is 'DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS' which means any number of offenses, such as ignoring portable warning signs, failure to yield, stopping on the crosswalk, and many others according to Mark Bigger. Next, we can see there are many cases of speeding and registration violations.

Source: https://www.bakersfieldtraffictickets.com/blog/2020/february/trucker-ticket-failure-to-obey-a-traffic-control/#

Top 10 Reasons for Traffic Accidents in Montgomery County, MD, 2021

Description Count
FAILURE TO CONTROL VEHICLE SPEED ON HIGHWAY TO AVOID COLLISION 529
NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON 271
RECKLESS DRIVING VEHICLE IN WANTON AND WILLFUL DISREGARD FOR SAFETY OF PERSONS AND PROPERTY 188
DRIVING VEH. WHILE IMPAIRED BY ALCOHOL 186
DRIVING VEHICLE WHILE UNDER THE INFLUENCE OF ALCOHOL 185
DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS 145
DRIVER WHEN TURNING LEFT FAIL TO YIELD RIGHT OF WAY TO VEHICLE APPROACHING FROM OPPOSITE DIRECTION 117
FAILURE TO CONTROL VEH. SPEED ON HWY. TO AVOID COLLISION 107
DRIVING VEHICLE WHILE UNDER THE INFLUENCE OF ALCOHOL PER SE 92
DRIVER CHANGING LANES WHEN UNSAFE 87
Table 2: The table above shows examples of traffic violations related to accidents. The total 463(= 186 + 185 + 92) cases are related to alcohol. Therefore, the accidents caused by drunk driving should be ranked 2nd.


Top 10 Reasons for General Traffic Violations in Montgomery County, MD, 2021

Description Count
DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS 5720
EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH 3426
EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH 2883
FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER 2196
FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND 1785
DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION 1542
DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE 1420
EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH 1208
NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON 1096
EXCEEDING THE POSTED SPEED LIMIT OF 45 MPH 1019
Table 3: We can see that speeding is the most common traffic violation. And we find out that we must carry our registration card and driver's license card with us when driving.


Text Mining

  • I’ve always wanted to create a word cloud. We will try to create a word cloud using texts of the Description variable for fun. To create a word cloud, we must first clean text. We will use the ‘why’ , ‘acci’ and ‘no_acci’ data with only the description variable. This is a case of using a character variable in an existing data frame, rather than using a text file extracted from websites such as Twitter.

  • Preparing Data for Word Cloud Visualization

  • Step 1: load required libraries

library(wordcloud)  # word-cloud generator
library(RColorBrewer) # color palettes
library(tm) # for text mining
  • Step 2: Create the Text Corpus
## Calculate Corpus
why <- Corpus(VectorSource(why))
acci <- Corpus(VectorSource(acci))
no_acci <- Corpus(VectorSource(no_acci))
  • Step 3: Pre-processing Text
##Data Cleaning and Wrangling

why  <- tm_map(why , removeNumbers) # Remove numbers
why  <- tm_map(why , removePunctuation) # Remove punctuations
why  <- tm_map(why , tolower)     # Convert the text to lower case
why  <- tm_map(why , removeWords, stopwords("english")) # Remove english common stopwords

acci  <- tm_map(acci , removeNumbers)
acci  <- tm_map(acci , removePunctuation)
acci  <- tm_map(acci , tolower)
acci  <- tm_map(acci , removeWords, stopwords("english"))

no_acci <- tm_map(no_acci, removeNumbers)
no_acci <- tm_map(no_acci, removePunctuation)
no_acci <- tm_map(no_acci, tolower)
no_acci <- tm_map(no_acci, removeWords, stopwords("english"))

# Remove your own stop word
acci <- tm_map(acci, removeWords, c("driving", "vehicle", "driver", "person", "posted", "failure")) 
no_acci <- tm_map(no_acci, removeWords, c("driving", "vehicle", "driver", "person", "posted", "failure")) 
  • Step 4: Create Document Term Matrix and save as matrix
why <- TermDocumentMatrix(why)
acci <- TermDocumentMatrix(acci)
no_acci <- TermDocumentMatrix(no_acci)
why <- as.matrix(why)
acci <- as.matrix(acci)
no_acci <- as.matrix(no_acci)
  • Step 5: Sort extracted words and create a new data frame with words and their frequency.
why <- sort(rowSums(why), decreasing = TRUE) 
why <- data.frame(word = names(why), freq = why)

acci <- sort(rowSums(acci), decreasing = TRUE) 
acci <- data.frame(word = names(acci), freq = acci)

no_acci <- sort(rowSums(no_acci), decreasing = TRUE) 
no_acci <- data.frame(word = names(no_acci), freq = no_acci)
  • Step 6: Filter only words with 4 or more letters.
why <- filter(why, nchar(word) >= 4)
acci <- filter(acci, nchar(word) >= 4)
no_acci <- filter(no_acci, nchar(word) >= 4)

Word Cloud Generation

pal1 <- brewer.pal(8,"Dark2")
pal2 <- brewer.pal(8, "Spectral")
pal3 <- brewer.pal(8, "Accent")
  • We’ll create a word cloud after excluding “driving”, “vehicle”, “driver”, “person”, “failure” and “posted” among the high-frequency words that I think, are meaningless or for granted.

Traffic Accidents Word Cloud

wordcloud(words = acci$word,
              freq = acci$freq,
              min.freq = 1,
              max.words = 200,
              random.order= FALSE, 
              rot.per= 0.3,    # Texts rotation ratio
              colors = pal1)

Figure 8: We can guess the cause of traffic accidents from word cloud. We can see that there are many crashes due to speed control failure on highways or traffic accidents due to DUI.


General Traffic Violations Word Cloud

cloud2 <- wordcloud(words = no_acci$word,
              freq = no_acci$freq,
              min.freq = 1,
              max.words= 200,
              random.order=FALSE,
              rot.per=0.3, 
              colors= pal1)

Figure 9: We can see that the most frequent cases of general violations are violating speed limit.


Where?

  • In 2021, there were 63,697 traffic violations. This data provides the exact location with longitude and latitude for each violation. At first, I wanted GIS analysis by city or zip code, but I could not combine the information of this data frame with the map file, shapefile because there is no information about each city or zipcodes in this dataset. We could add about 63000 marks on the map of Montgomery County, but we won’t.

  • The Montgomery county department of police(MCPD) is divided into six districts. source: https://www.montgomerycountymd.gov/pol/districts.html

df_sub <- df %>%
  arrange(SubAgency)
 df %>%
  group_by(SubAgency) %>%
  summarize(count = n()) %>%
  mutate(label = count) %>%
  ggplot( aes(x = SubAgency, y = count, fill = count)) +
  geom_col(color = "black") +
  scale_fill_gradientn(colours = pal) +
  geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
  theme_fivethirtyeight() +
  labs( title = "Frequency of Traffic Violations by District", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + 
  ylab('Frequency') + 
  xlab('District')  +
  scale_x_discrete(labels = (c('1st District\nRockville','2nd District\nBethesda','3rd District\nSilver Spring', '4th District\nWheaton','5th District\nGermantown','6th District\nGaithersburg','Headquater'))) +
   theme(legend.position = "none")  +
   coord_flip()

Figure 10: We can see the frequencies of each district to which a police officer who catches a traffic violation is assigned. Headquater has the highest number of violations, and Rockville has the lowest. However, we should understand that the actual traffic violation locations and the districts the police are in may be different. We should find each location in the Location variable and Longitude and Latitude variables in case you want to see the exact location of the approximately 17,000 violations that were detected by police in headquarter. For example, although Rockville has the smallest number of violations, it is possible that Rockville has a higher number of violations detected in the headquarter.
loca_df <- df %>%
  group_by(Location) %>%
  summarise(count = n()) %>%
  arrange(-count)
#loca_df
  • The location variable contains each traffic violation location information, but some locations are addresses and others are intersections. Due to inconsistent location information, there is a total of 15,050 location lists, but I think there are duplicates as well. Still, we will take a look at the top 10 traffic violation locations.
loca_df %>%
  filter(count > 130) %>%
  mutate(label = count) %>%
  ggplot(aes(x = reorder(Location, count), y = count, fill= count)) +
  geom_col(color = "black") +
  scale_fill_gradientn(colours = pal) +
  geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
  theme_fivethirtyeight() +
  labs( title = "Top 10 Locations of Traffic Violations", subtitle = "Montgomery County, MD, 2021",
        caption = "https://data.montgomerycountymd.gov")  +
  theme(axis.title = element_text()) + 
  ylab('Frequency') + 
  xlab('Location')  +
  theme(legend.position = "none")  +
  coord_flip()

figure 11 : 1st ranked "BARNSVILLE & OLD HUNDRED", 2nd ranked "21910 BEALLSVILLE RD" and 6th ranked "BARNSVILLE & BEALLSVILLE", these three locations are located within 0.6 miles, where there were a total of 526 ( = 200 + 180 +146) violations. We should avoid or be cautious of passing near Barnsville & Beallsvilles since there may be many police officers working hard in that area.


IV. Conclusion

So far, we have analyzed the 2021 traffic violations in Montgomery County. We looked at when, where, who violated what traffic laws, and why. According to a student who presented this topic and inspired me in a Capstone205 class, the number of traffic violations has drastically decreased since the outbreak of COVID-19. Perhaps the traffic volume itself has been greatly reduced, so the number of violations has also decreased. People’s lives have changed a lot since the outbreak of COVID-19. However, since COVID-19 is not over yet, we can assume that a similar pattern will continue in the future. The number of traffic violations has decreased, but I guess the reasons for the violation will always be the same. We must carry our driver’s license and vehicle registration card when driving, obey the speed limit, never drink and drive, and drive more safely on the highway. Next time I get a chance, I’d like to analyze all traffic violations in 10 years.And after obtaining better skills, I hope to create a map heatmap for GIS analysis and present it to the Capstone205 class.

Thank you. The End.