#Introduction:

Research Question: How do speeding-related traffic violations differ among cities in Montgomery County, and which cities experience the highest number of these offenses? Additionally, how do speeding-related traffic violations vary by gender across Montgomery County cities?

The data for my project comes from the Traffic Violations Dataset provided by the Montgomery County, Maryland Open Data Portal (Link: https://data.montgomerycountymd.gov/). This dataset contains 155,502 observations of police traffic-stop and violation records from Montgomery County, MD from 2012-2025. Each row in this dataset represents a single traffic stop, including information such as the date and time of the stop, location, description of the violation, driver information fields, and agency details(there are 42 variables).

The purpose of this project is to focus specifically on identifying speed-related violations by searching for keywords in the description variable and comparing the number of speeding violations across different cities. This will allow me to analyze which cities experience the highest rates of speeding offenses and determine which gender tends to have a higher rate of speeding violations.

Variables Selected:

description: Description of the traffic violation(character/string)

driver city: Identifies which city the stop occurred (categorical)

gender: Indicates the driver’s gender- Male or Female (categorical)

#Data Analysis In order to answer my research questions, I first loaded in my data set and started cleaning it while using the basic EDA functions, such as str() and head(). My next step was to clean the data by fixing the variable names. I replaced spaces between words with underscores and converted uppercase letters to lowercase. I also replaced any blank cells and unknown gender values that were listed as “U”, with NA, using the functions mutate(), gsub(), tolower() and na.if(). Once I finished cleaning the data set, I created a new data set named “speeding_violations.” In this new data set, I used select() in order to pull the variables description, driver_city and gender. Once I had the new data set with the 3 variables I am analyzing, I filtered the data using str.detect() function to filter rows specifically in the description column so I can pull the word “EXCEEDING”, which makes me able to identify all speed related violations. The str.detect() function comes from the stringr package which works with string data to detect specific words in strings. Then, to summarize my data, I created a final data set named “city_counts”, which groups the violations by driver_city and added a new column named “speeding_counts”, which is the total violations each city has. I used the functions group_by() and summarise() as well as using arrange() to make the data go in descending order. I also used summary() to see which gender has more speeding violations. Finally, I created two visualizations, the first one was to compare the number of speeding violations by gender to see who had the higher rate, and the second one was a horizontal bar chart which displayed the top ten cities with the most speeding violations.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)

traffic_violations <- read_csv("traffic_violations.csv")
## Rows: 155502 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (38): SeqID, Date Of Stop, Agency, SubAgency, Description, Location, Ac...
## dbl   (3): Latitude, Longitude, Year
## lgl   (1): Contributed To Accident
## time  (1): Time Of Stop
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Checking the structure and head of the data set
str(traffic_violations)
## spc_tbl_ [155,502 × 43] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SeqID                  : chr [1:155502] "3970dc8a-a84f-4049-b22a-549584f0f1e7" "9a239115-0094-4865-a0d5-73ba03499aef" "3770d2a9-be9e-425d-8deb-6168185127de" "3770d2a9-be9e-425d-8deb-6168185127de" ...
##  $ Date Of Stop           : chr [1:155502] "10/08/2025" "10/08/2025" "10/08/2025" "10/08/2025" ...
##  $ Time Of Stop           : 'hms' num [1:155502] 22:42:00 22:40:00 22:28:00 22:28:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Agency                 : chr [1:155502] "MCP" "MCP" "MCP" "MCP" ...
##  $ SubAgency              : chr [1:155502] "Headquarters and Special Operations" "5th District, Germantown" "3rd District, Silver Spring" "3rd District, Silver Spring" ...
##  $ Description            : chr [1:155502] "FAILURE OF VEH. ON HWY. TO DISPLAY LIGHTED LAMPS, ILLUMINATING DEVICE IN UNFAVORABLE VISIBILITY COND" "DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION" "FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND" "OPERATING VEHICLE ON HIGHWAY WITH UNAUTHORIZED WINDOW TINTING MATERIAL" ...
##  $ Location               : chr [1:155502] "WASHINGTONIAN BLVD @ GRAND CORNER AVE" "CONNECTICUT AVE/FAIRLY" "GEORGIA AVE / HEWITT AVE" "GEORGIA AVE / HEWITT AVE" ...
##  $ Latitude               : num [1:155502] 39.1 39.1 39.1 39.1 39 ...
##  $ Longitude              : num [1:155502] -77.2 -77.1 -77.1 -77.1 -77.1 ...
##  $ Accident               : chr [1:155502] "No" "No" "No" "No" ...
##  $ Belts                  : chr [1:155502] "No" "No" "No" "No" ...
##  $ Personal Injury        : chr [1:155502] "No" "No" "No" "No" ...
##  $ Property Damage        : chr [1:155502] "No" "No" "No" "No" ...
##  $ Fatal                  : chr [1:155502] "No" "No" "No" "No" ...
##  $ Commercial License     : chr [1:155502] "No" "No" "No" "No" ...
##  $ HAZMAT                 : chr [1:155502] "No" "No" "No" "No" ...
##  $ Commercial Vehicle     : chr [1:155502] "No" "No" "No" "No" ...
##  $ Alcohol                : chr [1:155502] "No" "No" "No" "No" ...
##  $ Work Zone              : chr [1:155502] "No" "No" "No" "No" ...
##  $ Search Conducted       : chr [1:155502] "No" "No" NA NA ...
##  $ Search Disposition     : chr [1:155502] NA NA NA NA ...
##  $ Search Outcome         : chr [1:155502] "Warning" "Warning" NA NA ...
##  $ Search Reason          : chr [1:155502] NA NA NA NA ...
##  $ Search Reason For Stop : chr [1:155502] "22-201.1" "13-401(h)" NA NA ...
##  $ Search Type            : chr [1:155502] NA NA NA NA ...
##  $ Search Arrest Reason   : chr [1:155502] NA NA NA NA ...
##  $ State                  : chr [1:155502] "MD" "MD" "MD" "MD" ...
##  $ VehicleType            : chr [1:155502] "02 - Automobile" "02 - Automobile" "02 - Automobile" "02 - Automobile" ...
##  $ Year                   : num [1:155502] 2022 2018 2024 2024 2002 ...
##  $ Make                   : chr [1:155502] "TOYOTA" "CHRYS" "TOYOTA" "TOYOTA" ...
##  $ Model                  : chr [1:155502] "COROLLA CROSS" "300" "CAMRY" "CAMRY" ...
##  $ Color                  : chr [1:155502] "BLACK" "BLACK" "BLACK" "BLACK" ...
##  $ Violation Type         : chr [1:155502] "Warning" "Warning" "Warning" "Warning" ...
##  $ Charge                 : chr [1:155502] "22-201.1" "13-401(h)" "16-112(c)" "22-406(i1)" ...
##  $ Article                : chr [1:155502] "Transportation Article" "Transportation Article" "Transportation Article" "Transportation Article" ...
##  $ Contributed To Accident: logi [1:155502] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Race                   : chr [1:155502] "HISPANIC" "BLACK" "WHITE" "WHITE" ...
##  $ Gender                 : chr [1:155502] "F" "M" "M" "M" ...
##  $ Driver City            : chr [1:155502] "GAITHERSBURG" "SILVER SPRING" "HYATTSVILLE" "HYATTSVILLE" ...
##  $ Driver State           : chr [1:155502] "MD" "MD" "MD" "MD" ...
##  $ DL State               : chr [1:155502] "MD" "MD" "MD" "MD" ...
##  $ Arrest Type            : chr [1:155502] "B - Unmarked Patrol" "A - Marked Patrol" "A - Marked Patrol" "A - Marked Patrol" ...
##  $ Geolocation            : chr [1:155502] "(39.118045, -77.2023283333333)" "(39.0596898333333, -77.0734611666667)" "(39.0755051666667, -77.0686486666667)" "(39.0755051666667, -77.0686486666667)" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SeqID = col_character(),
##   ..   `Date Of Stop` = col_character(),
##   ..   `Time Of Stop` = col_time(format = ""),
##   ..   Agency = col_character(),
##   ..   SubAgency = col_character(),
##   ..   Description = col_character(),
##   ..   Location = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double(),
##   ..   Accident = col_character(),
##   ..   Belts = col_character(),
##   ..   `Personal Injury` = col_character(),
##   ..   `Property Damage` = col_character(),
##   ..   Fatal = col_character(),
##   ..   `Commercial License` = col_character(),
##   ..   HAZMAT = col_character(),
##   ..   `Commercial Vehicle` = col_character(),
##   ..   Alcohol = col_character(),
##   ..   `Work Zone` = col_character(),
##   ..   `Search Conducted` = col_character(),
##   ..   `Search Disposition` = col_character(),
##   ..   `Search Outcome` = col_character(),
##   ..   `Search Reason` = col_character(),
##   ..   `Search Reason For Stop` = col_character(),
##   ..   `Search Type` = col_character(),
##   ..   `Search Arrest Reason` = col_character(),
##   ..   State = col_character(),
##   ..   VehicleType = col_character(),
##   ..   Year = col_double(),
##   ..   Make = col_character(),
##   ..   Model = col_character(),
##   ..   Color = col_character(),
##   ..   `Violation Type` = col_character(),
##   ..   Charge = col_character(),
##   ..   Article = col_character(),
##   ..   `Contributed To Accident` = col_logical(),
##   ..   Race = col_character(),
##   ..   Gender = col_character(),
##   ..   `Driver City` = col_character(),
##   ..   `Driver State` = col_character(),
##   ..   `DL State` = col_character(),
##   ..   `Arrest Type` = col_character(),
##   ..   Geolocation = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(traffic_violations)
## # A tibble: 6 × 43
##   SeqID      `Date Of Stop` `Time Of Stop` Agency SubAgency Description Location
##   <chr>      <chr>          <time>         <chr>  <chr>     <chr>       <chr>   
## 1 3970dc8a-… 10/08/2025     22:42          MCP    Headquar… FAILURE OF… WASHING…
## 2 9a239115-… 10/08/2025     22:40          MCP    5th Dist… DRIVING VE… CONNECT…
## 3 3770d2a9-… 10/08/2025     22:28          MCP    3rd Dist… FAILURE OF… GEORGIA…
## 4 3770d2a9-… 10/08/2025     22:28          MCP    3rd Dist… OPERATING … GEORGIA…
## 5 f99e1168-… 10/08/2025     22:25          MCP    4th Dist… RECKLESS D… 10901 G…
## 6 f99e1168-… 10/08/2025     22:25          MCP    4th Dist… DRIVING VE… 10901 G…
## # ℹ 36 more variables: Latitude <dbl>, Longitude <dbl>, Accident <chr>,
## #   Belts <chr>, `Personal Injury` <chr>, `Property Damage` <chr>, Fatal <chr>,
## #   `Commercial License` <chr>, HAZMAT <chr>, `Commercial Vehicle` <chr>,
## #   Alcohol <chr>, `Work Zone` <chr>, `Search Conducted` <chr>,
## #   `Search Disposition` <chr>, `Search Outcome` <chr>, `Search Reason` <chr>,
## #   `Search Reason For Stop` <chr>, `Search Type` <chr>,
## #   `Search Arrest Reason` <chr>, State <chr>, VehicleType <chr>, Year <dbl>, …
#Cleaning variable names to replace spaces(" ") with underscores & putting them in lowercase
names(traffic_violations) <- gsub(" ", "_", names(traffic_violations)) 
names(traffic_violations) <- tolower(names(traffic_violations))   

#Replacing spaces and "U" with NA's
traffic_violations[traffic_violations == ""] <- NA
traffic_violations <- traffic_violations |>
  mutate(gender = na_if(gender, "U"))


head(traffic_violations) 
## # A tibble: 6 × 43
##   seqid date_of_stop time_of_stop agency subagency description location latitude
##   <chr> <chr>        <time>       <chr>  <chr>     <chr>       <chr>       <dbl>
## 1 3970… 10/08/2025   22:42        MCP    Headquar… FAILURE OF… WASHING…     39.1
## 2 9a23… 10/08/2025   22:40        MCP    5th Dist… DRIVING VE… CONNECT…     39.1
## 3 3770… 10/08/2025   22:28        MCP    3rd Dist… FAILURE OF… GEORGIA…     39.1
## 4 3770… 10/08/2025   22:28        MCP    3rd Dist… OPERATING … GEORGIA…     39.1
## 5 f99e… 10/08/2025   22:25        MCP    4th Dist… RECKLESS D… 10901 G…     39.0
## 6 f99e… 10/08/2025   22:25        MCP    4th Dist… DRIVING VE… 10901 G…     39.0
## # ℹ 35 more variables: longitude <dbl>, accident <chr>, belts <chr>,
## #   personal_injury <chr>, property_damage <chr>, fatal <chr>,
## #   commercial_license <chr>, hazmat <chr>, commercial_vehicle <chr>,
## #   alcohol <chr>, work_zone <chr>, search_conducted <chr>,
## #   search_disposition <chr>, search_outcome <chr>, search_reason <chr>,
## #   search_reason_for_stop <chr>, search_type <chr>,
## #   search_arrest_reason <chr>, state <chr>, vehicletype <chr>, year <dbl>, …
#Creating a new dataset
speeding_violations <- traffic_violations |>
  select(description, driver_city, gender)

#Extracting only the word "exceeding" , to only access speed descriptions
speeding_violations <- traffic_violations |>
  select(description, driver_city, gender) |>
  filter(str_detect(description, "EXCEEDING"))

head(speeding_violations)
## # A tibble: 6 × 3
##   description                                                 driver_city gender
##   <chr>                                                       <chr>       <chr> 
## 1 RECKLESS DRIVING VEH. BY EXCEEDING POSTED SPEED LIMIT BY 3… ROCKVILLE   M     
## 2 EXCEEDING POSTED MAXIMUM SPEED LIMIT: 70 MPH IN A POSTED 3… ROCKVILLE   M     
## 3 EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH                  ROCKVILLE   M     
## 4 EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH                  GAITHERSBU… F     
## 5 EXCEEDING POSTED MAXIMUM SPEED LIMIT: 85 MPH IN A POSTED 5… POINT OF R… M     
## 6 EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH                  GAITHERSBU… M
city_counts <- speeding_violations |>
  group_by(driver_city) |>
  summarise(speeding_count = n()) |>
  arrange(desc(speeding_count))


summary(as.factor(speeding_violations$gender))
##     F     M  NA's 
## 10711 17519    13
city_counts
## # A tibble: 888 × 2
##    driver_city        speeding_count
##    <chr>                       <int>
##  1 GAITHERSBURG                 4449
##  2 SILVER SPRING                4274
##  3 GERMANTOWN                   3353
##  4 ROCKVILLE                    2244
##  5 MONTGOMERY VILLAGE            895
##  6 POTOMAC                       773
##  7 BETHESDA                      763
##  8 NORTH POTOMAC                 726
##  9 FREDERICK                     647
## 10 CLARKSBURG                    633
## # ℹ 878 more rows
barplot(table(speeding_violations$gender),
        main = "Speeding Violations by Gender",
        col = c("lightpink", "lightblue"))

ggplot(city_counts %>% slice_head(n = 10),
       aes(x = speeding_count, y = reorder(driver_city, speeding_count), fill = driver_city)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Top 10 Cities with the Highest Speeding Violations",
    x = "Number of Speeding Violations",
    y = "City"
  ) +
  theme_minimal()

#Conclusion: My analysis of the Traffic Violations data set showed significant differences in speed related violations across Montgomery County and between genders. The data collected 17,519 male drivers vs 10,711 female drivers that had speeding violations, which is a 6,808 difference. This indicates that men may have a higher rate of speed related offenses overall. When analyzing the speed related cases across cities, I found that Gaithersburg had the most speeding violations with 4,449 total violations, followed by Silver Spring and Germantown. These findings answer my research questions by showing which cities have the most speeding violations and which gender is more frequently involved. The results I got from this analysis suggest that certain cities may need more attention when it comes to traffic safety and which ones may benefit from stronger traffic enforcement. I believe understanding these trends can help local officials/authorities to know where they should place higher patrolling. For future research, I would explore time and date variables from recent years, to see what time of day speeding violations occur the most. I would also look into certain days/weeks in the year, around holidays to see if it has an impact on the number of speeding violations reported.

REFERENCES: https://data.montgomerycountymd.gov/Public-Safety/Traffic-Violations/4mse-ku6q/about_data

https://cran.r-project.org/web/packages/stringr/readme/README.html

https://www.rdocumentation.org/packages/stringr/versions/1.5.1/topics/str_detect

As well as past classwork, homework and my notes on OneNote