My final project topic is about traffic violations in Montgomery County, where we live. The reason I chose this topic is purely out of curiosity. Perhaps because driving is essential in our lives. Most of us probably live in Montgomery County and have to drive for a variety of reasons. While driving, we may commit a variety of violations, either intentionally or accidentally, which may result in you being caught by the police or on camera or having to pay a fine. The dataset I selected for the final project contains real traffic violations that have occurred around us. This data is available on dataMontgomery(https://data.montgomerycountymd.gov/Public-Safety/Traffic-Violations/4mse-ku6q).
The original dataset consists of 35 variables and approximately 1.79 million observations, updated daily. Most of the variables in this dataset are categorical and are characterized by including the date and location (latitude, longitude) variables of the occurrence of the observation. Each variable and its description are as follows:
| Variables | Description |
|---|---|
| Date Of Stop | Date of the traffic violation. |
| Time Of Stop | Time of the traffic violation. |
| Agency | Agency issuing the traffic violation. (Example: MCP is Montgomery County Police) |
| SubAgency | Court code representing the district of assignment of the officer. R15 = 1st district, Rockville B15 = 2nd district, Bethesda SS15 = 3rd |
| Description | Text description of the specific charge |
| Location | Location of the violation, usually an address or intersection. |
| Latitude | Latitude location of the traffic violation. |
| Longitude | Longitude location of the traffic violation. |
| Accident | YES if traffic violation involved an accident. |
| Belts | YES if seat belts were in use in accident cases. |
| Personal Injury | Yes if traffic violation involved Personal Injury. |
| Property Damage | Yes if traffic violation involved Property Damage. |
| Fatal | Yes if traffic violation involved a fatality. |
| Commercial License | Yes if driver holds a Commercial Drivers License |
| HAZMAT | Yes if the traffic violation involved hazardous materials. |
| Commercial Vehicle | Yes if the vehicle committing the traffic violation is a commercial vehicle. |
| Alcohol | Yes if the traffic violation included an alcohol related suspension. |
| Work Zone | Yes if the traffic violation was in a work zone. |
| State | State issuing the vehicle registration. |
| VehicleType | Type of vehicle (Examples: Automobile, Station Wagon, Heavy Duty Truck, etc.) |
| Year | Year vehicle was made. |
| Make | Manufacturer of the vehicle (Examples: Ford, Chevy, Honda, Toyota, etc.) |
| Model | Model of the vehicle. |
| Color | Color of the vehicle. |
| Violation Type | Violation type. (Examples: Warning, Citation, SERO) |
| Charge | Numeric code for the specific charge. |
| Article | Article of State Law. (TA = Transportation Article, MR = Maryland Rules) |
| Contributed To Accident | If the traffic violation was a contributing factor in an accident. |
| Race | Race of the driver. (Example: Asian, Black, White, Other, etc.) |
| Gender | Gender of the driver (F = Female, M = Male) |
| Driver City | City of the driver’s home address |
| Driver State | State of the driver’s home address. |
| DL State | State issuing the Driver’s License. |
| Arrest Type | Type of Arrest (A = Marked, B = Unmarked, etc.) |
| Geolocation | Geo-coded location information. |
The original dataset is very large because it contains all
information from January 1, 2012, to today, 2022. Therefore, I will
extract and investigate only the events in 2021. I will explore 5 Ws
questions, that is, the “who, when, where, what and
why” of traffic violations with this dataset.
setwd("C:/Users/ykim2/Downloads/MC/R")
df <- read.csv("traffic_violations_2021.csv")
library(tidyverse) # ggplot2 & dplyr
library(lubridate) # date format
library(ggthemes) # special theme for theme_fivethirtyeight()
library(plotly) # interactive graph for pie chart
library(wesanderson) # cool color palette for bar graphs
library(dygraphs) # interactive time series chart
library(xts) # create eXtensible Time Series (xts) data
library(viridis) # beautiful color palette for Heatmap
library(knitr) # for a nice table
library(kableExtra)
Color source:
https://venngage.com/blog/fall-color-palettes/?msclkid=ae1b3a1dce9411ec9f0bbda82ee9151c
https://www.color-hex.com/color-palettes/
https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
Before cleaning up the data, we will take a look at the entire data frame structure and ensure the variables are the correct data type.
str(df)
## 'data.frame': 63697 obs. of 40 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date.Of.Stop : chr "12/08/2021" "12/08/2021" "12/09/2021" "12/09/2021" ...
## $ Time.Of.Stop : chr "23:34:00" "23:20:00" "10:46:00" "10:46:00" ...
## $ Agency : chr "MCP" "MCP" "MCP" "MCP" ...
## $ SubAgency : chr "2nd District, Bethesda" "5th District, Germantown" "3rd District, Silver Spring" "3rd District, Silver Spring" ...
## $ Description : chr "DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION" "FAIL TO DISPLAY REG. CARD ON DEMAND" "DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED SIGNAL" "DRIVING W/O CURRENT TAGS" ...
## $ Location : chr "WATKINS MILL @ TRAVIS LANE" "MIDDLEBROOK RD AT GREAT SENECA HWY" "RIDGE ROAD AND BETHESDA CHRUCH ROAD" "RIDGE ROAD AND BETHESDA CHRUCH ROAD" ...
## $ Latitude : num 39.2 39.2 39.3 39.3 0 ...
## $ Longitude : num -77.2 -77.3 -77.2 -77.2 0 ...
## $ Accident : chr "No" "No" "No" "No" ...
## $ Belts : chr "No" "No" "No" "No" ...
## $ Personal.Injury : chr "No" "No" "No" "No" ...
## $ Property.Damage : chr "No" "No" "No" "No" ...
## $ Fatal : chr "No" "No" "No" "No" ...
## $ Commercial.License : chr "No" "No" "No" "No" ...
## $ HAZMAT : chr "No" "No" "No" "No" ...
## $ Commercial.Vehicle : chr "No" "No" "No" "No" ...
## $ Alcohol : chr "No" "No" "No" "No" ...
## $ Work.Zone : chr "No" "No" "No" "No" ...
## $ State : chr "MD" "MD" "TX" "TX" ...
## $ VehicleType : chr "02 - Automobile" "02 - Automobile" "06 - Heavy Duty Truck" "06 - Heavy Duty Truck" ...
## $ Year : int 2009 2002 2009 2009 2016 2020 2020 2020 2020 2020 ...
## $ Make : chr "NISSAN" "LEXUS" "CHEVY" "CHEVY" ...
## $ Model : chr "4S" "LS 430" "SILVERADO" "SILVERADO" ...
## $ Color : chr "BLACK" "WHITE" "WHITE" "WHITE" ...
## $ Violation.Type : chr "Citation" "Citation" "Citation" "Citation" ...
## $ Charge : chr "13-401(h)" "13-409(b)" "21-202(h1)" "13-411(d)" ...
## $ Article : chr "Transportation Article" "Transportation Article" "Transportation Article" "Transportation Article" ...
## $ Contributed.To.Accident: chr "False" "False" "False" "False" ...
## $ Race : chr "WHITE" "WHITE" "HISPANIC" "HISPANIC" ...
## $ Gender : chr "M" "M" "M" "M" ...
## $ Driver.City : chr "MONTGOMERY VILLAGE" "GERMANTOWN" "GAITHERSBURG" "GAITHERSBURG" ...
## $ Driver.State : chr "MD" "MD" "MD" "MD" ...
## $ DL.State : chr "MD" "MD" "MD" "MD" ...
## $ Arrest.Type : chr "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" ...
## $ Geolocation : chr "(39.160035, -77.2155816666667)" "(39.1718966666667, -77.26247)" "(39.2852916666667, -77.2090083333333)" "(39.2852916666667, -77.2090083333333)" ...
## $ dates : chr "2021-12-08" "2021-12-08" "2021-12-09" "2021-12-09" ...
## $ date : int 8 8 9 9 8 8 8 8 8 8 ...
## $ month : int 12 12 12 12 12 12 12 12 12 12 ...
## $ year : int 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
library(lubridate)
df <- df %>% mutate(dates = as.Date(Date.Of.Stop,"%m/%d/%Y"), date = day(dates), month = month(dates), year = year(dates))
df$dates <- as.Date(df$dates)
df$hour <- substr(df$Time.Of.Stop,1,2)
df$hour <- as.numeric(df$hour)
df$weekday <- weekdays(df$dates)
table(df$weekday)
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 10946 8297 6775 5321 10811 11192 10355
df$month_name <- month.name[df$month]
table(df$month_name)
##
## April August December February January July June March
## 3413 5471 7677 4842 5237 6044 3661 5817
## May November October September
## 3724 6223 5355 6233
df$weekday <-factor(df$weekday, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
df$month_name <- factor(df$month_name, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
df %>%
group_by(Gender) %>%
summarise(count = n()) %>%
mutate(ratio = round((count/sum(count)*100),1),
label = ratio) %>%
ggplot(aes (x = Gender, y = count, fill = Gender)) +
geom_bar(stat = "identity", color ="black" , width = 0.6) +
geom_label(aes(label=paste(label,"%")), fill = "#FFF9F5", vjust = 0.5) +
scale_x_discrete(labels = c('Female','Male','Unspecified')) +
scale_fill_manual(labels= c("Female", "Male", "Unspecified"),values = c("#e3919d","#3878a4","#FFF1E0")) +
labs( title = "Number of Traffic Violations by Gender", subtitle = "Montgomery County, MD, 2021",
caption = "https://data.montgomerycountymd.gov") +
theme_fivethirtyeight()
Figure 1: We can see that almost 70% of people caught in traffic violations are men.
t <- list(
family = "Arial Black", # assign font family for plot_ly chart
size = 26)
#9190b8 Purple
#f7ede2 Ivory
#e3919d Pink
#84a59d Green
#f6bd60 Yellow
#f28482 Pink
df %>%
group_by(Race) %>%
summarise(count = n())%>%
mutate(pct = round((count/sum(count)),3)) %>%
plot_ly(labels = ~ Race,
values = ~ pct,
textposition = 'outside',
hoverinfo = 'text',
text = ~ paste("Number:",count),
#hovertemplate = paste('Number: %{text}'),
textinfo = 'label+percent',
marker = list(colors = c("#9190b8","#f7ede2", "#e3919d" , "#f28482" , "#84a59d" , "#f6bd60"),
line = list(color ='black', width = 1)),
hole = 0.5,
type = 'pie') %>%
layout(title = list(text = "Percent of Traffic Violations by Race", font = t),
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline =FALSE, showticklabels = FALSE),
autosize = T,
margin = list( b=40, t= 80)) %>%
layout(annotations = list(x = 0.14 , y = 1.09, text = "Montgomery County, Maryland, 2021", showarrow = F, xref='paper', yref='paper'))
Figure 2: We can see the racial ratio in all traffic violation records in 2021. If you hover over, you can see the number of violations.
pal <- wes_palette("Royal2", 100, type = "continuous")
df %>%
group_by(hour) %>%
summarize(count = n()) %>%
ggplot( aes(x = hour, y = count, fill = count)) +
geom_col(color = "black") +
geom_text(aes(label = count), position = position_dodge(0.95), angle = 90, vjust = 0.5, hjust = 1.2, size = 4, color ="grey23") +
scale_fill_gradientn(colours = pal) +
theme_fivethirtyeight() +
labs( title = "Frequency of Traffic Violations by Hour", subtitle = "Montgomery County, MD, 2021",
caption = "https://data.montgomerycountymd.gov") +
theme(axis.title = element_text()) + ylab('Frequency') + xlab('Hour') +
theme(legend.position = "none")
Figure 3: In this graph, we can see the number of traffic violations per hour. As expected, the highest number of violations occurs during the morning rush hour from 7 to 9.
df %>%
group_by(weekday) %>%
summarize(count = n()) %>%
mutate(label = count) %>%
ggplot( aes(x = weekday, y = count, fill = count)) +
geom_col(color = "black") +
scale_fill_gradientn(colours = pal) +
geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
theme_fivethirtyeight() +
labs( title = "Frequency of Traffic Violations per Day of Week", subtitle = "Montgomery County, MD, 2021",
caption = "https://data.montgomerycountymd.gov") +
theme(axis.title = element_text()) + ylab('Frequency')+ xlab('Weekday')+
scale_x_discrete(labels = (c('Monday','Tuesday','Wednesday', 'Thursday','Friday','Saturday','Sunday'))) +
# rev() : put elements in reverse order
theme(legend.position = "none") #+ coord_flip()
Figure 4: We can see there are more traffic violations from Tuesday to Friday than on weekends.Just as most violations occur during rush hour when traffic is high, this is probably because the people are most active on Tuesdays, Wednesdays, Thursdays, and Friday.
by_date <- df %>%
group_by(dates) %>%
summarise(count = n())
library(xts) # create eXtensible Time Series (xts) data
by_date <- xts(by_date$count, order.by = by_date$dates)
str(by_date)
## An 'xts' object on 2021-01-01/2021-12-31 containing:
## Data: int [1:365, 1] 177 184 157 142 156 233 271 274 220 143 ...
## Indexed by objects of class: [Date] TZ: UTC
## xts Attributes:
## NULL
dygraph(by_date,
main = "<font size=5> Traffic Violations of Year 2021 </font> <br> <small>Montgomery County, MD</small>",
ylab = "Frequency") %>%
dyRangeSelector() %>%
dySeries("V1", label = "Frequency", color = "#6b705c") %>%
dyLegend(show = "follow") %>%
dyOptions( fillGraph = TRUE)
Figure 5: We can see that there are clearly fewer traffic violations in the 2nd quarter of the year. This is an interactive graph. Therefore, as you hover over the line, the individual value is displayed. We can see that on December 10, 2021, there were 460 traffic violations, the most of the year. You can zoom by adjusting the date range in the range selector at the bottom of the dygraph.
df <- df %>%
mutate(week = strftime(dates,"%W"))
df1 <- df %>%
count(month_name, week, weekday, date)
#df1
df1$week <- factor(df1$week, levels = rev(sort(unique(df$week))))
df1 %>%
ggplot(aes(x=weekday, y = week)) +
geom_tile(aes(fill = n), color = "#616161", lwd = 0.5) +
scale_fill_viridis(option ="magma", direction = -1) +
#theme_classic() +
theme_tufte(base_family="Helvetica") +
facet_wrap(~month_name, nrow = 3, scales = "free") +
geom_text(aes(label = date), color = "grey", size = 3) +
theme(axis.ticks.y = element_blank()) +
theme(axis.ticks.x = element_blank()) +
theme(axis.text.y = element_blank()) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y = element_blank()) +
scale_x_discrete(labels = c("Mo","Tu","We","Th","Fr","Sa","Su"), position = "top")+
theme(strip.placement = "outside") +
theme(strip.text.x = element_text(size = "10", hjust = 0))+
ggtitle("Heatmap of Traffic Violation in Montgomery County, MD (2021)") +
theme(plot.title = element_text(family = "Arial", face= "bold", size = "16" ))
Figure 6: It's not an interactive chart, so we don't know the exact frequency of each day, but the color shows at a glance the days with the most traffic violations. As we you can see from the time series chart, December 10th, which has the highest number of traffic violations, has the darkest color. I googled December 10, 2021 and noticed there was a severe thunderstorm. Perhaps it has something to do with the number of traffic violations.
df %>%
group_by(Contributed.To.Accident) %>%
summarise(count = n()) %>%
mutate(ratio = round((count/sum(count)*100),1),
label = ratio) %>%
ggplot(aes (x = Contributed.To.Accident, y = count, fill = Contributed.To.Accident)) +
geom_col(color ="black" , width = 0.6) +
geom_text(aes(x= Contributed.To.Accident, y = 10000, label = count), size = 4, color = "#555555") +
geom_label(aes(label=paste(label,"%")), fill = "#FFF9F5", vjust = 0.5) +
geom_label(aes(x = 'False', y = 15000,
label = "General Traffic Violations:"),
hjust = 0.5,
vjust = 0.5,
lineheight = 0.8,
colour = "#555555",
fill ="#e3919d",
label.size = NA,
family="Helvetica",
size = 3.4) +
geom_label(aes(x = 'True', y = 15000,
label = "Violations Contributed to Accident:"),
hjust = 0.5,
vjust = 0.5,
lineheight = 0.8,
colour = "#555555",
fill = "transparent",
label.size = NA,
family="Helvetica",
size = 3.5) +
scale_fill_manual(values = c("#e3919d" , "#3878a4" )) +
labs( title ="Number of traffic violations Contributed to Accident", subtitle = "Montgomery County, MD, 2021",
caption = "https://data.montgomerycountymd.gov") +
theme_fivethirtyeight()
Figure 7: We can see that only about 5% of all traffic violations are related to traffic accidents. 5 out of 100 traffic accidents are not a small. We will take a look at the causes of traffic accidents below. percentage.
why <- df %>% select(Description)
acci <- df %>% filter(Contributed.To.Accident == "True") %>% select(Description)
no_acci <- df %>% filter(Contributed.To.Accident == "False") %>% select(Description)
descrip <- df %>%
group_by(Description) %>%
summarise(Count = n()) %>%
arrange(-Count)
str(descrip)
## tibble [2,119 x 2] (S3: tbl_df/tbl/data.frame)
## $ Description: chr [1:2119] "DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS" "EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH" "EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH" "FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER" ...
## $ Count : int [1:2119] 5865 3426 2883 2225 1852 1551 1426 1367 1208 1047 ...
| Description | Count |
|---|---|
| DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS | 5865 |
| EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH | 3426 |
| EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH | 2883 |
| FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER | 2225 |
| FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND | 1852 |
| DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION | 1551 |
| DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE | 1426 |
| NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON | 1367 |
| EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH | 1208 |
| FAILURE TO CONTROL VEHICLE SPEED ON HIGHWAY TO AVOID COLLISION | 1047 |
Table 1: Top 1 reason is 'DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS' which means any number of offenses, such as ignoring portable warning signs, failure to yield, stopping on the crosswalk, and many others according to Mark Bigger. Next, we can see there are many cases of speeding and registration violations.
| Description | Count |
|---|---|
| FAILURE TO CONTROL VEHICLE SPEED ON HIGHWAY TO AVOID COLLISION | 529 |
| NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON | 271 |
| RECKLESS DRIVING VEHICLE IN WANTON AND WILLFUL DISREGARD FOR SAFETY OF PERSONS AND PROPERTY | 188 |
| DRIVING VEH. WHILE IMPAIRED BY ALCOHOL | 186 |
| DRIVING VEHICLE WHILE UNDER THE INFLUENCE OF ALCOHOL | 185 |
| DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS | 145 |
| DRIVER WHEN TURNING LEFT FAIL TO YIELD RIGHT OF WAY TO VEHICLE APPROACHING FROM OPPOSITE DIRECTION | 117 |
| FAILURE TO CONTROL VEH. SPEED ON HWY. TO AVOID COLLISION | 107 |
| DRIVING VEHICLE WHILE UNDER THE INFLUENCE OF ALCOHOL PER SE | 92 |
| DRIVER CHANGING LANES WHEN UNSAFE | 87 |
Table 2: The table above shows examples of traffic violations related to accidents. The total 463(= 186 + 185 + 92) cases are related to alcohol. Therefore, the accidents caused by drunk driving should be ranked 2nd.
| Description | Count |
|---|---|
| DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS | 5720 |
| EXCEEDING THE POSTED SPEED LIMIT OF 35 MPH | 3426 |
| EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH | 2883 |
| FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER | 2196 |
| FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND | 1785 |
| DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION | 1542 |
| DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE | 1420 |
| EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH | 1208 |
| NEGLIGENT DRIVING VEHICLE IN CARELESS AND IMPRUDENT MANNER ENDANGERING PROPERTY, LIFE AND PERSON | 1096 |
| EXCEEDING THE POSTED SPEED LIMIT OF 45 MPH | 1019 |
Table 3: We can see that speeding is the most common traffic violation. And we find out that we must carry our registration card and driver's license card with us when driving.
I’ve always wanted to create a word cloud. We will try to create a word cloud using texts of the Description variable for fun. To create a word cloud, we must first clean text. We will use the ‘why’ , ‘acci’ and ‘no_acci’ data with only the description variable. This is a case of using a character variable in an existing data frame, rather than using a text file extracted from websites such as Twitter.
Preparing Data for Word Cloud Visualization
Step 1: load required libraries
library(wordcloud) # word-cloud generator
library(RColorBrewer) # color palettes
library(tm) # for text mining
## Calculate Corpus
why <- Corpus(VectorSource(why))
acci <- Corpus(VectorSource(acci))
no_acci <- Corpus(VectorSource(no_acci))
##Data Cleaning and Wrangling
why <- tm_map(why , removeNumbers) # Remove numbers
why <- tm_map(why , removePunctuation) # Remove punctuations
why <- tm_map(why , tolower) # Convert the text to lower case
why <- tm_map(why , removeWords, stopwords("english")) # Remove english common stopwords
acci <- tm_map(acci , removeNumbers)
acci <- tm_map(acci , removePunctuation)
acci <- tm_map(acci , tolower)
acci <- tm_map(acci , removeWords, stopwords("english"))
no_acci <- tm_map(no_acci, removeNumbers)
no_acci <- tm_map(no_acci, removePunctuation)
no_acci <- tm_map(no_acci, tolower)
no_acci <- tm_map(no_acci, removeWords, stopwords("english"))
# Remove your own stop word
acci <- tm_map(acci, removeWords, c("driving", "vehicle", "driver", "person", "posted", "failure"))
no_acci <- tm_map(no_acci, removeWords, c("driving", "vehicle", "driver", "person", "posted", "failure"))
why <- TermDocumentMatrix(why)
acci <- TermDocumentMatrix(acci)
no_acci <- TermDocumentMatrix(no_acci)
why <- as.matrix(why)
acci <- as.matrix(acci)
no_acci <- as.matrix(no_acci)
why <- sort(rowSums(why), decreasing = TRUE)
why <- data.frame(word = names(why), freq = why)
acci <- sort(rowSums(acci), decreasing = TRUE)
acci <- data.frame(word = names(acci), freq = acci)
no_acci <- sort(rowSums(no_acci), decreasing = TRUE)
no_acci <- data.frame(word = names(no_acci), freq = no_acci)
why <- filter(why, nchar(word) >= 4)
acci <- filter(acci, nchar(word) >= 4)
no_acci <- filter(no_acci, nchar(word) >= 4)
pal1 <- brewer.pal(8,"Dark2")
pal2 <- brewer.pal(8, "Spectral")
pal3 <- brewer.pal(8, "Accent")
wordcloud(words = acci$word,
freq = acci$freq,
min.freq = 1,
max.words = 200,
random.order= FALSE,
rot.per= 0.3, # Texts rotation ratio
colors = pal1)
Figure 8: We can guess the cause of traffic accidents from word cloud. We can see that there are many crashes due to speed control failure on highways or traffic accidents due to DUI.
cloud2 <- wordcloud(words = no_acci$word,
freq = no_acci$freq,
min.freq = 1,
max.words= 200,
random.order=FALSE,
rot.per=0.3,
colors= pal1)
Figure 9: We can see that the most frequent cases of general violations are violating speed limit.
In 2021, there were 63,697 traffic violations. This data provides
the exact location with longitude and latitude for each violation. At
first, I wanted GIS analysis by city or zip code, but I could not
combine the information of this data frame with the map file, shapefile
because there is no information about each city or zipcodes in this
dataset. We could add about 63000 marks on the map of Montgomery County,
but we won’t.
The Montgomery county department of police(MCPD) is divided into six districts. source: https://www.montgomerycountymd.gov/pol/districts.html
df_sub <- df %>%
arrange(SubAgency)
df %>%
group_by(SubAgency) %>%
summarize(count = n()) %>%
mutate(label = count) %>%
ggplot( aes(x = SubAgency, y = count, fill = count)) +
geom_col(color = "black") +
scale_fill_gradientn(colours = pal) +
geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
theme_fivethirtyeight() +
labs( title = "Frequency of Traffic Violations by District", subtitle = "Montgomery County, MD, 2021",
caption = "https://data.montgomerycountymd.gov") +
theme(axis.title = element_text()) +
ylab('Frequency') +
xlab('District') +
scale_x_discrete(labels = (c('1st District\nRockville','2nd District\nBethesda','3rd District\nSilver Spring', '4th District\nWheaton','5th District\nGermantown','6th District\nGaithersburg','Headquater'))) +
theme(legend.position = "none") +
coord_flip()
Figure 10: We can see the frequencies of each district to which a police officer who catches a traffic violation is assigned. Headquater has the highest number of violations, and Rockville has the lowest. However, we should understand that the actual traffic violation locations and the districts the police are in may be different. We should find each location in the Location variable and Longitude and Latitude variables in case you want to see the exact location of the approximately 17,000 violations that were detected by police in headquarter. For example, although Rockville has the smallest number of violations, it is possible that Rockville has a higher number of violations detected in the headquarter.
loca_df <- df %>%
group_by(Location) %>%
summarise(count = n()) %>%
arrange(-count)
#loca_df
loca_df %>%
filter(count > 130) %>%
mutate(label = count) %>%
ggplot(aes(x = reorder(Location, count), y = count, fill= count)) +
geom_col(color = "black") +
scale_fill_gradientn(colours = pal) +
geom_label(aes(label= label), fill = "#FFF9F5", vjust = 0.5) +
theme_fivethirtyeight() +
labs( title = "Top 10 Locations of Traffic Violations", subtitle = "Montgomery County, MD, 2021",
caption = "https://data.montgomerycountymd.gov") +
theme(axis.title = element_text()) +
ylab('Frequency') +
xlab('Location') +
theme(legend.position = "none") +
coord_flip()
figure 11 : 1st ranked "BARNSVILLE & OLD HUNDRED", 2nd ranked "21910 BEALLSVILLE RD" and 6th ranked "BARNSVILLE & BEALLSVILLE", these three locations are located within 0.6 miles, where there were a total of 526 ( = 200 + 180 +146) violations. We should avoid or be cautious of passing near Barnsville & Beallsvilles since there may be many police officers working hard in that area.
So far, we have analyzed the 2021 traffic violations in Montgomery County. We looked at when, where, who violated what traffic laws, and why. According to a student who presented this topic and inspired me in a Capstone205 class, the number of traffic violations has drastically decreased since the outbreak of COVID-19. Perhaps the traffic volume itself has been greatly reduced, so the number of violations has also decreased. People’s lives have changed a lot since the outbreak of COVID-19. However, since COVID-19 is not over yet, we can assume that a similar pattern will continue in the future. The number of traffic violations has decreased, but I guess the reasons for the violation will always be the same. We must carry our driver’s license and vehicle registration card when driving, obey the speed limit, never drink and drive, and drive more safely on the highway. Next time I get a chance, I’d like to analyze all traffic violations in 10 years.And after obtaining better skills, I hope to create a map heatmap for GIS analysis and present it to the Capstone205 class.
Thank you. The End.