2024 Collisions in Montgomery County MD

Author

Bryan Argueta

Introduction

My data set gives information about each collision and details of all traffic collisions occurring on county and local roadways within Montgomery County. I plan to explore the amount of collisions that occur within each month of the year and what agency reports the collision. I want to find if any month has a disproportionate number of collisions and potential reasons, such as holidays. I also want to find out if there’s any correlation between collisions and hours.

This data set was collected using the Automated Crash Reporting System of the Maryland State Police and provided to the public by Montgomery County, MD

(https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Incidents-Data/bhju-22kf/about_data).

Variables I will use:

Agency Name: The name of agency that reported/investigated the collision
Crash Month/Time: The date and time that the collision occurred in month, day, year, hour, minute, seconds format.

Load the libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(lubridate)

Load the data

setwd("/Users/bryana/Documents/Data110/Datasets")
crashes <- read_csv("crashIncidentsData2024.csv")

Rows: 97458 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (38): Report Number, Local Case Number, Agency Name, ACRS Report Type, C...
dbl  (6): Mile Point, Lane Number, Number of Lanes, Distance, Latitude, Long...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(crashes)

# A tibble: 6 × 44
  `Report Number` `Local Case Number` `Agency Name`           `ACRS Report Type`
  <chr>           <chr>               <chr>                   <chr>             
1 MCP2686006F     230031339           Montgomery County Poli… Property Damage C…
2 MCP30580053     230064814           Montgomery County Poli… Property Damage C…
3 MCP2760004K     230071388           Montgomery County Poli… Property Damage C…
4 MCP3230004G     230031335           Montgomery County Poli… Property Damage C…
5 MCP12600013     230031067           Montgomery County Poli… Injury Crash      
6 DD55750030      230031365           Rockville Police Depar… Property Damage C…
# ℹ 40 more variables: `Crash Date/Time` <chr>, `Hit/Run` <chr>,
#   `Route Type` <chr>, `Mile Point` <dbl>, `Mile Point Direction` <chr>,
#   `Lane Direction` <chr>, `Lane Number` <dbl>, `Lane Type` <chr>,
#   `Number of Lanes` <dbl>, Direction <chr>, Distance <dbl>,
#   `Distance Unit` <chr>, `Road Grade` <chr>, NonTraffic <chr>,
#   `Road Name` <chr>, `Cross-Street Type` <chr>, `Cross-Street Name` <chr>,
#   `Off-Road Description` <chr>, Municipality <chr>, …

Clean the data:

Lowercase and remove the spaces and slashes in the header names

names(crashes) <- tolower(names(crashes))
names(crashes) <- gsub(" ","",names(crashes))
names(crashes) <- gsub("/","",names(crashes))
head(crashes)

# A tibble: 6 × 44
  reportnumber localcasenumber agencyname    acrsreporttype crashdatetime hitrun
  <chr>        <chr>           <chr>         <chr>          <chr>         <chr> 
1 MCP2686006F  230031339       Montgomery C… Property Dama… 06/30/2023 1… No    
2 MCP30580053  230064814       Montgomery C… Property Dama… 11/06/2023 1… Yes   
3 MCP2760004K  230071388       Montgomery C… Property Dama… 12/12/2023 0… Yes   
4 MCP3230004G  230031335       Montgomery C… Property Dama… 06/30/2023 0… No    
5 MCP12600013  230031067       Montgomery C… Injury Crash   06/29/2023 1… No    
6 DD55750030   230031365       Rockville Po… Property Dama… 07/01/2023 0… Yes   
# ℹ 38 more variables: routetype <chr>, milepoint <dbl>,
#   milepointdirection <chr>, lanedirection <chr>, lanenumber <dbl>,
#   lanetype <chr>, numberoflanes <dbl>, direction <chr>, distance <dbl>,
#   distanceunit <chr>, roadgrade <chr>, nontraffic <chr>, roadname <chr>,
#   `cross-streettype` <chr>, `cross-streetname` <chr>,
#   `off-roaddescription` <chr>, municipality <chr>,
#   `relatednon-motorist` <chr>, atfault <chr>, collisiontype <chr>, …

Selecting only some headers

I decided that I’m going to focus on the date/time of each collision and what agency reported it

crashes2 <- crashes |>
  select(crashdatetime, agencyname) |>
  group_by(crashdatetime, agencyname)
head(crashes2)

# A tibble: 6 × 2
# Groups:   crashdatetime, agencyname [6]
  crashdatetime          agencyname               
  <chr>                  <chr>                    
1 06/30/2023 10:55:00 PM Montgomery County Police 
2 11/06/2023 11:20:00 PM Montgomery County Police 
3 12/12/2023 08:00:00 PM Montgomery County Police 
4 06/30/2023 08:00:00 PM Montgomery County Police 
5 06/29/2023 11:53:00 AM Montgomery County Police 
6 07/01/2023 01:48:00 AM Rockville Police Departme

Linear regression analysis:

I want to check if the number of collisions increase with the hour of the day increasing.

Converting the date/time format to only hours in separate column

crashes2$hour <- hour(as.POSIXct(crashes2$crashdatetime, format = "%m/%d/%Y %I:%M:%S %p"))

Counting the number of collisions for each hour

hourly_crashes <- crashes2 %>%
  group_by(hour) %>%
  summarise(total_collisions = n())

head(hourly_crashes)

# A tibble: 6 × 2
   hour total_collisions
  <int>            <int>
1     0             1815
2     1             1509
3     2             1353
4     3             1246
5     4             1083
6     5             1561

For some reason the hour begins at 0 and ends at 23 so I will add 1 to each hour to fix that.

hourly_crashes$hour <- hourly_crashes$hour + 1

head(hourly_crashes)

# A tibble: 6 × 2
   hour total_collisions
  <dbl>            <int>
1     1             1815
2     2             1509
3     3             1353
4     4             1246
5     5             1083
6     6             1561

Linear regression model using hours of the day to predict the number of collisions

lm_model <- lm(total_collisions ~ hour, data = hourly_crashes)

summary(lm_model)


Call:
lm(formula = total_collisions ~ hour, data = hourly_crashes)

Residuals:
    Min      1Q  Median      3Q     Max 
-3487.5 -1563.3   109.3  1434.6  2520.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2281.65     777.00   2.936  0.00764 **
hour          142.33      54.38   2.617  0.01573 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1844 on 22 degrees of freedom
Multiple R-squared:  0.2374,    Adjusted R-squared:  0.2028 
F-statistic:  6.85 on 1 and 22 DF,  p-value: 0.01573

plot(lm_model)

The equation for my model is: Total Collisions = B0 + B1 * Hour + e.

Based on an adjusted R^2 value of 0.2028 it suggests that the relationship between hour of day and total collision only accounts for 20.28% of the variation in the data.

Based on a p-value of 0.01573 and using a significance level of 0.05, I can conclude that there is a significant linear relationship between the hour of the day and the total number of collisions.

Calculate the number of monthly crashes

Converting the date/time format to only months in a separate column

crashes2$month <- month(as.POSIXct(crashes2$crashdatetime, format = "%m/%d/%Y %I:%M:%S %p"))

Calculate the total crashes per month by each agency

monthly_crashes <- crashes2 %>%
  group_by(month, agencyname) %>%
  summarize(total_crashes = n())

`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

Change month format from numbers to words

# Change the months from numbers to words
monthly_crashes$month[monthly_crashes$month == 1]<- "January"
monthly_crashes$month[monthly_crashes$month == 2]<- "February"
monthly_crashes$month[monthly_crashes$month == 3]<- "March"
monthly_crashes$month[monthly_crashes$month == 4]<- "April"
monthly_crashes$month[monthly_crashes$month == 5]<- "May"
monthly_crashes$month[monthly_crashes$month == 6]<- "June"
monthly_crashes$month[monthly_crashes$month == 7]<- "July"
monthly_crashes$month[monthly_crashes$month == 8]<- "August"
monthly_crashes$month[monthly_crashes$month == 9]<- "September"
monthly_crashes$month[monthly_crashes$month == 10]<- "October"
monthly_crashes$month[monthly_crashes$month == 11]<- "November"
monthly_crashes$month[monthly_crashes$month == 12]<- "December"

# Reorder the months 
monthly_crashes$month<-factor(monthly_crashes$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

Each agency seems to have two separate names

unique_agencies <- unique(monthly_crashes$agencyname)
print(unique_agencies)

 [1] "GAITHERSBURG"              "Gaithersburg Police Depar"
 [3] "MONTGOMERY"                "Maryland-National Capital"
 [5] "Montgomery County Police"  "Rockville Police Departme"
 [7] "Takoma Park Police Depart" "TAKOMA"                   
 [9] "MCPARK"                    "ROCKVILLE"

Change the two separate names of each agency into one

monthly_crashes2 <- monthly_crashes %>%
  mutate(agencyname = case_when(
    agencyname == "MONTGOMERY" ~ "Montgomery County Police",
    agencyname == "ROCKVILLE" ~ "Rockville Police Departme",
    agencyname == "MCPARK" ~ "Maryland-National Capital",
    agencyname == "TAKOMA" ~ "Takoma Park Police Depart",
    agencyname == "GAITHERSBURG" ~ "Gaithersburg Police Depar",
    TRUE ~ agencyname
  ))

Remove any NA’s in found in month

monthly_crashes_nona <- monthly_crashes2 %>%
  filter(!is.na(month))

Bar Graph

ggplot(monthly_crashes_nona, aes(month, total_crashes, fill = agencyname)) +
  geom_bar(stat = "identity", alpha = 0.5) +
  scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
  scale_fill_brewer(palette = "Dark2", name = "Police Agencies", labels = c("Gaithersburg Police", "MD National Capital Park Police", "Montgomery County Police", "Rockville Police", "Takoma Park Police")) + 
  labs(x = "Months", y = "Total Collisions", title = "Total Collisions per Month in 2024",
       caption = "Data Provided By: Montgomery County, MD") +
  theme_classic()

Summary Essay

A.

The data set originally came with headers that were capitalized, had spaces and also slashes. I cleaned it up by lower casing all the headers and removing the spaces. I initially was going to leave the slashes but I found that it caused some trouble in my code so I ended up removing the slashes as well.

B.

The visualization represents the total collisions per month in Montgomery County MD during the 2024 year. The graph suggests that months in the fall have the most collisions. This is what I expected because the sun sets earlier during the fall and that means people are driving in the dark more often than the summer. I wonder if the leaves on the ground during fall season have an impact on the number of collisions. I’d imagine that roads with leaves, especially wet leaves make driving riskier when compared to roads with no leaves.

C.

I noticed that the data set included location coordinates consisting of both longitude and latitude. It would have been interesting to plot all the collisions on a map, though I’m not sure how I would do that.