The dataset used in this project is about car crashes across Montgomery County, Maryland. It also contains variable about the conditions of the crash, such as the weather or if the driver was under the influence of alcohol. I intend to investigate if there is a trend of where cars usually crash, what conditions make it more likely for cars to crash, and more. I have selected this dataset because I saw many posts of the article “The MoCo Show” where there are multiple crashes on Interstate 270, so I wanted to investigate where the most car crashes occur in Montgomery County.
The dataset was derived from the offical Montgomery County Database (Montgomery County Data | Open Data Portal (montgomerycountymd.gov))
Loading Libraries + Dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(highcharter)
Warning: package 'highcharter' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(leaflet)
Warning: package 'leaflet' was built under R version 4.3.3
library(ggplot2)# loads the uncleaned datasetpre_crashes <-read_csv("C:/Users/omyue/OneDrive/Desktop/Montgomery College/Spring 24/Data 101/datasets/Crash_Reporting_-_Incidents_Data_20240410.csv")
Rows: 97458 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (38): Report Number, Local Case Number, Agency Name, ACRS Report Type, C...
dbl (6): Mile Point, Lane Number, Number of Lanes, Distance, Latitude, Long...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s start off by basic cleaning. We’re going to select the variables that we would want to use for the project, and remove any NA’s.
# selects the wanted variables for the datacrash <- crash |>select(light, weather, latitude, longitude, hitrun, lanenumber, date, time)# arranges the dataset by the datecrash <-arrange(crash, date)# lists the first 6 rows of the crashes data framehead(crash)
# A tibble: 6 × 8
light weather latitude longitude hitrun lanenumber date time
<chr> <chr> <dbl> <dbl> <chr> <dbl> <date> <chr>
1 DARK NO LIGHTS CLEAR 39.1 -77.1 Yes 1 2015-01-01 07:52:…
2 DARK LIGHTS ON CLEAR 39.1 -77.2 No 0 2015-01-01 05:42:…
3 DARK LIGHTS ON CLEAR 39.1 -77.1 No 2 2015-01-01 08:54:…
4 DARK NO LIGHTS CLEAR 39.2 -77.2 No 1 2015-01-01 05:33:…
5 DARK LIGHTS ON N/A 39.2 -77.3 No 0 2015-01-01 01:10:…
6 DAYLIGHT CLEAR 39.2 -77.2 No 0 2015-01-01 02:10:…
# Parse the 12-hour format time and convert it to a POSIXct objectcrash <- crash |>mutate(time_24 =parse_date_time(time, orders ="I:%M:%S %p")) |>mutate(time24 =as.numeric(format(time_24, "%H"))) crash <-arrange(crash, date)head(crash)
# A tibble: 6 × 10
light weather latitude longitude hitrun lanenumber date time
<chr> <chr> <dbl> <dbl> <chr> <dbl> <date> <chr>
1 DARK NO LIGHTS CLEAR 39.1 -77.1 Yes 1 2015-01-01 07:52:…
2 DARK LIGHTS ON CLEAR 39.1 -77.2 No 0 2015-01-01 05:42:…
3 DARK LIGHTS ON CLEAR 39.1 -77.1 No 2 2015-01-01 08:54:…
4 DARK NO LIGHTS CLEAR 39.2 -77.2 No 1 2015-01-01 05:33:…
5 DARK LIGHTS ON N/A 39.2 -77.3 No 0 2015-01-01 01:10:…
6 DAYLIGHT CLEAR 39.2 -77.2 No 0 2015-01-01 02:10:…
# ℹ 2 more variables: time_24 <dttm>, time24 <dbl>
After the basic cleaning, we’re going to further narrow down the data. For the GIS plot, I do not want too many points on the graph, so I want to select a singular date. I decided to go with New Year’s Eve because it is the last date on the dataset.
# selects the last 239 crashes in the dataset (all crashes in December)crashes_df209 <- crash |>mutate(month =month(date))|>filter(month ==12)crashes_df209
# A tibble: 8,928 × 11
light weather latitude longitude hitrun lanenumber date time
<chr> <chr> <dbl> <dbl> <chr> <dbl> <date> <chr>
1 DAYLIGHT RAINING 39.3 -77.2 No 1 2015-12-01 12:33…
2 DAYLIGHT RAINING 39.0 -77.1 No 3 2015-12-01 09:41…
3 DAYLIGHT RAINING 39.1 -77.2 No 1 2015-12-01 01:28…
4 DAYLIGHT RAINING 39.1 -77.2 No 0 2015-12-01 10:26…
5 DAYLIGHT RAINING 39.2 -77.2 No 1 2015-12-01 01:18…
6 DARK LIGHTS ON RAINING 39.1 -77.2 No 0 2015-12-01 06:02…
7 DARK LIGHTS ON CLOUDY 39.1 -77.2 No 2 2015-12-01 06:22…
8 DAYLIGHT CLEAR 39.0 -77.0 No 2 2015-12-01 03:36…
9 DAYLIGHT UNKNOWN 39.2 -77.2 No 0 2015-12-01 11:00…
10 DARK LIGHTS ON RAINING 39.1 -77.1 No 3 2015-12-01 06:50…
# ℹ 8,918 more rows
# ℹ 3 more variables: time_24 <dttm>, time24 <dbl>, month <dbl>
Linear Regression Graphs
I will explore if the number of lanes has a correlation if the crash becomes a hit and run. First, I’m going to convert the “Yes” or “No” values in the Hit.Run column into numeric values.
# A tibble: 6 × 5
time24 count hitrun light weather
<dbl> <int> <chr> <chr> <chr>
1 0 169 Yes DARK LIGHTS ON RAINING
2 0 169 No DARK LIGHTS ON CLEAR
3 0 169 Yes DARK LIGHTS ON CLEAR
4 0 169 No DARK LIGHTS ON CLEAR
5 0 169 No DARK LIGHTS ON CLEAR
6 0 169 No DARK LIGHTS ON CLEAR
lm_model <-lm(count ~ time24 +hitrun+light+weather, data = boo)summary(lm_model)
The model shows that the amount of crashes that occur at a specific hour can be explained by factors such as weather, hit & run, and the light conditions present during the crash. 30% percent of the crashes can be explained by these variables by looking at the adjusted r-squared, and the model is statistically significant because the p-value is 2.2e-16 (far below the level of significance of 0.05).
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Looking at this plot, we can note that the majority of crashes occur later in the day, and the points peak at hour 17 and 18. Hour 17/18 is 5 to 6PM, which is rush hour. Rush hour is infamous for road rage and many car crashes, which is why the number of car crashes peak at the hours of 5 to 6PM.
The GIS plot
Creating the Tooltip
popupcrash <-paste0( # creates the tooltip for the map"<b>Time: </b>", crashes_df209$time, "<br>","<b>Light Condition: </b>", crashes_df209$light, "<br>","<b>Weather Condition: </b>", crashes_df209$weather, "<br>","<b>Hit & Run: </b>", crashes_df209$hitrun, "<br>","<b>Number of Lanes: </b>", crashes_df209$lanenumber, "<br>" )
Creating the Colors and Labels for the Graph
On the graph, I want the points to be different colors based on the weather during the scene of the crash. I also want to incorporate a legend on the map in order for the viewers to understand what the point colors mean on the graph.
First, I need to set up the colors for the graph. There are six types of weather in the dataset, so I have to create a vector that contains six different colors. After creating the color palette for the graph, I need to create labels that will be associated with each color for the legend.
# sets color palette for the points on the plotcolors <-c("#ff006d", "#ff7d00", "#ffdd00", "#adff02", "#01befe","#8f00ff", "white", "black", "red")# create the labels for the legendlabels <-c("Clear", "Cloudy", "Foggy", "Severe Wind", "Raining", "Unknown", "N/A")
Plotting the Graph
Now that we have set up the tooltip, colors, and labels for the graph, let’s set up the graph itself.
First, we will call leaflet() for a GIS graph. Then we will use setView() to set the deafult view over Montgomery County. To make the color points really stand out, I used “Stadia.AlidadeSmooth” to create a monochrome, light map in the function addProviderTiles.
To add the points on the map, we will use addCircleMarkers(). The “data” argument is used to indicate which data frame for leaflet to use, in this case it is “crashes_df”. We set the point size using the argument “radius” and I made the radius size two. I used the “colors” argument to
leaflet() |># calls for leafletsetView(lng =-77.18, lat =39.13, zoom =10.2) |># sets default view for the mapaddProviderTiles("Stadia.AlidadeSmooth") |># sets the visual for the mapaddCircleMarkers( # adds the points on the mapdata = crashes_df209, # sets the dataradius =2, # sets the size of the map pointscolor ="#ff006d",popup = popupcrash # creates the tooltip ) # |>
Assuming "longitude" and "latitude" are longitude and latitude, respectively
# addLegend(position = "bottomright", colors = colors, labels = labels, title = "Weather") # creates the legend
The graph exhibits all of the car crashes during the month of December from the years 2015-2023. The color of each point represents what weather conditions were present during the crash. An interesting pattern I have observed from the visualization was that the majority of car crashes during the month of December was on Interstate I-270. I also observed that the majority of crashes on the map occur in cities with busier roads such as Gaithersburg or Rockville. If I had a bit more time on the project, I would like to fully incorporate point colors which would correspond with the weather conditions of the crash.