The dataset used in this project is about car crashes across Montgomery County, Maryland. It also contains variable about the conditions of the crash, such as the weather or if the driver was under the influence of alcohol. I intend to investigate if there is a trend of where cars usually crash, what conditions make it more likely for cars to crash, and more. I have selected this dataset because I saw many posts of the article “The MoCo Show” where there are multiple crashes on Interstate 270, so I wanted to investigate where the most car crashes occur in Montgomery County.
The dataset was derived from the offical Montgomery County Database (Montgomery County Data | Open Data Portal (montgomerycountymd.gov))
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.3
library(ggplot2)
# loads the uncleaned dataset
pre_crashes <- read_csv("C:/Users/omyue/OneDrive/Desktop/Montgomery College/Spring 24/Data 101/datasets/Crash_Reporting_-_Incidents_Data_20240410.csv")
## Rows: 97458 Columns: 44
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (38): Report Number, Local Case Number, Agency Name, ACRS Report Type, C...
## dbl (6): Mile Point, Lane Number, Number of Lanes, Distance, Latitude, Long...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(pre_crashes) <- tolower(names(pre_crashes))
names(pre_crashes) <- gsub(" ", "", names(pre_crashes))
names(pre_crashes) <- gsub("/", "", names(pre_crashes))
head(pre_crashes)
## # A tibble: 6 × 44
## reportnumber localcasenumber agencyname acrsreporttype crashdatetime hitrun
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 MCP2686006F 230031339 Montgomery C… Property Dama… 06/30/2023 1… No
## 2 MCP30580053 230064814 Montgomery C… Property Dama… 11/06/2023 1… Yes
## 3 MCP2760004K 230071388 Montgomery C… Property Dama… 12/12/2023 0… Yes
## 4 MCP3230004G 230031335 Montgomery C… Property Dama… 06/30/2023 0… No
## 5 MCP12600013 230031067 Montgomery C… Injury Crash 06/29/2023 1… No
## 6 DD55750030 230031365 Rockville Po… Property Dama… 07/01/2023 0… Yes
## # ℹ 38 more variables: routetype <chr>, milepoint <dbl>,
## # milepointdirection <chr>, lanedirection <chr>, lanenumber <dbl>,
## # lanetype <chr>, numberoflanes <dbl>, direction <chr>, distance <dbl>,
## # distanceunit <chr>, roadgrade <chr>, nontraffic <chr>, roadname <chr>,
## # `cross-streettype` <chr>, `cross-streetname` <chr>,
## # `off-roaddescription` <chr>, municipality <chr>,
## # `relatednon-motorist` <chr>, atfault <chr>, collisiontype <chr>, …
crash <- pre_crashes %>%
mutate(date = as.Date(mdy_hms(crashdatetime)),
time = format(mdy_hms(crashdatetime), "%I:%M:%S %p"))
head(crash)
## # A tibble: 6 × 46
## reportnumber localcasenumber agencyname acrsreporttype crashdatetime hitrun
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 MCP2686006F 230031339 Montgomery C… Property Dama… 06/30/2023 1… No
## 2 MCP30580053 230064814 Montgomery C… Property Dama… 11/06/2023 1… Yes
## 3 MCP2760004K 230071388 Montgomery C… Property Dama… 12/12/2023 0… Yes
## 4 MCP3230004G 230031335 Montgomery C… Property Dama… 06/30/2023 0… No
## 5 MCP12600013 230031067 Montgomery C… Injury Crash 06/29/2023 1… No
## 6 DD55750030 230031365 Rockville Po… Property Dama… 07/01/2023 0… Yes
## # ℹ 40 more variables: routetype <chr>, milepoint <dbl>,
## # milepointdirection <chr>, lanedirection <chr>, lanenumber <dbl>,
## # lanetype <chr>, numberoflanes <dbl>, direction <chr>, distance <dbl>,
## # distanceunit <chr>, roadgrade <chr>, nontraffic <chr>, roadname <chr>,
## # `cross-streettype` <chr>, `cross-streetname` <chr>,
## # `off-roaddescription` <chr>, municipality <chr>,
## # `relatednon-motorist` <chr>, atfault <chr>, collisiontype <chr>, …
Let’s start off by basic cleaning. We’re going to select the variables that we would want to use for the project, and remove any NA’s.
# selects the wanted variables for the data
crash <- crash |>
select(light, weather, latitude, longitude, hitrun, lanenumber, date, time)
# arranges the dataset by the date
crash <- arrange(crash, date)
# lists the first 6 rows of the crashes data frame
head(crash)
## # A tibble: 6 × 8
## light weather latitude longitude hitrun lanenumber date time
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <date> <chr>
## 1 DARK NO LIGHTS CLEAR 39.1 -77.1 Yes 1 2015-01-01 07:52:…
## 2 DARK LIGHTS ON CLEAR 39.1 -77.2 No 0 2015-01-01 05:42:…
## 3 DARK LIGHTS ON CLEAR 39.1 -77.1 No 2 2015-01-01 08:54:…
## 4 DARK NO LIGHTS CLEAR 39.2 -77.2 No 1 2015-01-01 05:33:…
## 5 DARK LIGHTS ON N/A 39.2 -77.3 No 0 2015-01-01 01:10:…
## 6 DAYLIGHT CLEAR 39.2 -77.2 No 0 2015-01-01 02:10:…
# Parse the 12-hour format time and convert it to a POSIXct object
crash <- crash |>
mutate(time_24 = parse_date_time(time, orders = "I:%M:%S %p")) |>
mutate(time24 = as.numeric(format(time_24, "%H")))
crash <- arrange(crash, date)
head(crash)
## # A tibble: 6 × 10
## light weather latitude longitude hitrun lanenumber date time
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <date> <chr>
## 1 DARK NO LIGHTS CLEAR 39.1 -77.1 Yes 1 2015-01-01 07:52:…
## 2 DARK LIGHTS ON CLEAR 39.1 -77.2 No 0 2015-01-01 05:42:…
## 3 DARK LIGHTS ON CLEAR 39.1 -77.1 No 2 2015-01-01 08:54:…
## 4 DARK NO LIGHTS CLEAR 39.2 -77.2 No 1 2015-01-01 05:33:…
## 5 DARK LIGHTS ON N/A 39.2 -77.3 No 0 2015-01-01 01:10:…
## 6 DAYLIGHT CLEAR 39.2 -77.2 No 0 2015-01-01 02:10:…
## # ℹ 2 more variables: time_24 <dttm>, time24 <dbl>
After the basic cleaning, we’re going to further narrow down the data. For the GIS plot, I do not want too many points on the graph, so I want to select a singular date. I decided to go with New Year’s Eve because it is the last date on the dataset.
# selects the last 239 crashes in the dataset (all crashes in December)
crashes_df209 <- crash |>
mutate(month = month(date))|>
filter(month == 12)
crashes_df209
## # A tibble: 8,928 × 11
## light weather latitude longitude hitrun lanenumber date time
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <date> <chr>
## 1 DAYLIGHT RAINING 39.3 -77.2 No 1 2015-12-01 12:33…
## 2 DAYLIGHT RAINING 39.0 -77.1 No 3 2015-12-01 09:41…
## 3 DAYLIGHT RAINING 39.1 -77.2 No 1 2015-12-01 01:28…
## 4 DAYLIGHT RAINING 39.1 -77.2 No 0 2015-12-01 10:26…
## 5 DAYLIGHT RAINING 39.2 -77.2 No 1 2015-12-01 01:18…
## 6 DARK LIGHTS ON RAINING 39.1 -77.2 No 0 2015-12-01 06:02…
## 7 DARK LIGHTS ON CLOUDY 39.1 -77.2 No 2 2015-12-01 06:22…
## 8 DAYLIGHT CLEAR 39.0 -77.0 No 2 2015-12-01 03:36…
## 9 DAYLIGHT UNKNOWN 39.2 -77.2 No 0 2015-12-01 11:00…
## 10 DARK LIGHTS ON RAINING 39.1 -77.1 No 3 2015-12-01 06:50…
## # ℹ 8,918 more rows
## # ℹ 3 more variables: time_24 <dttm>, time24 <dbl>, month <dbl>
I will explore if the number of lanes has a correlation if the crash becomes a hit and run. First, I’m going to convert the “Yes” or “No” values in the Hit.Run column into numeric values.
boo <- crashes_df209 |>
group_by(time24) |>
reframe(count = n(), hitrun, light, time24 ,weather)
head(boo)
## # A tibble: 6 × 5
## time24 count hitrun light weather
## <dbl> <int> <chr> <chr> <chr>
## 1 0 169 Yes DARK LIGHTS ON RAINING
## 2 0 169 No DARK LIGHTS ON CLEAR
## 3 0 169 Yes DARK LIGHTS ON CLEAR
## 4 0 169 No DARK LIGHTS ON CLEAR
## 5 0 169 No DARK LIGHTS ON CLEAR
## 6 0 169 No DARK LIGHTS ON CLEAR
lm_model <- lm(count ~ time24 +hitrun+light+weather, data = boo)
summary(lm_model)
##
## Call:
## lm(formula = count ~ time24 + hitrun + light + weather, data = boo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -487.76 -60.15 -15.42 79.26 350.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 479.2603 151.5832 3.162 0.001574 **
## time24 16.6653 0.3066 54.359 < 2e-16 ***
## hitrunYes -23.6294 4.2218 -5.597 2.25e-08 ***
## lightDARK LIGHTS ON 26.0387 13.8004 1.887 0.059219 .
## lightDARK NO LIGHTS 11.4289 15.1496 0.754 0.450626
## lightDAWN 75.6210 16.9540 4.460 8.28e-06 ***
## lightDAYLIGHT 140.8222 13.8011 10.204 < 2e-16 ***
## lightDUSK 181.5106 16.3963 11.070 < 2e-16 ***
## lightN/A 102.7537 21.8851 4.695 2.70e-06 ***
## lightOTHER 129.9218 36.3717 3.572 0.000356 ***
## lightUNKNOWN 79.4703 24.8653 3.196 0.001398 **
## weatherBLOWING SNOW -336.9122 165.2813 -2.038 0.041538 *
## weatherCLEAR -310.6464 150.8980 -2.059 0.039557 *
## weatherCLOUDY -314.2530 150.9766 -2.081 0.037420 *
## weatherFOGGY -365.4775 151.5465 -2.412 0.015901 *
## weatherN/A -315.9554 150.9874 -2.093 0.036414 *
## weatherOTHER -312.9522 153.7615 -2.035 0.041849 *
## weatherRAINING -304.3264 150.9356 -2.016 0.043802 *
## weatherSEVERE WINDS -397.5095 162.9666 -2.439 0.014739 *
## weatherSLEET -334.1376 156.5491 -2.134 0.032838 *
## weatherSNOW -323.9182 151.5121 -2.138 0.032552 *
## weatherUNKNOWN -297.1501 152.6612 -1.946 0.051630 .
## weatherWINTRY MIX -344.3174 152.7077 -2.255 0.024173 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 150.7 on 8905 degrees of freedom
## Multiple R-squared: 0.2984, Adjusted R-squared: 0.2967
## F-statistic: 172.2 on 22 and 8905 DF, p-value: < 2.2e-16
The model shows that the amount of crashes that occur at a specific hour can be explained by factors such as weather, hit & run, and the light conditions present during the crash. 30% percent of the crashes can be explained by these variables by looking at the adjusted r-squared, and the model is statistically significant because the p-value is 2.2e-16 (far below the level of significance of 0.05).
plot_linear <- boo|>
group_by(time24)|>
summarise(sum = sum(count))|>
ggplot(aes(x = time24, y = sum)) +
geom_point() +
geom_smooth(color = "red")
plot_linear
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Looking at this plot, we can note that the majority of crashes occur later in the day, and the points peak at hour 17 and 18. Hour 17/18 is 5 to 6PM, which is rush hour. Rush hour is infamous for road rage and many car crashes, which is why the number of car crashes peak at the hours of 5 to 6PM.
popupcrash <- paste0( # creates the tooltip for the map
"<b>Time: </b>", crashes_df209$time, "<br>",
"<b>Light Condition: </b>", crashes_df209$light, "<br>",
"<b>Weather Condition: </b>", crashes_df209$weather, "<br>",
"<b>Hit & Run: </b>", crashes_df209$hitrun, "<br>",
"<b>Number of Lanes: </b>", crashes_df209$lanenumber, "<br>"
)
On the graph, I want the points to be different colors based on the
weather during the scene of the crash. I also want to incorporate a
legend on the map in order for the viewers to understand what the point
colors mean on the graph.
First, I need to set up the colors for the graph. There are six types of
weather in the dataset, so I have to create a vector that contains six
different colors. After creating the color palette for the graph, I need
to create labels that will be associated with each color for the
legend.
# sets color palette for the points on the plot
colors <- c("#ff006d", "#ff7d00", "#ffdd00", "#adff02", "#01befe","#8f00ff", "white", "black", "red")
# create the labels for the legend
labels <- c("Clear", "Cloudy", "Foggy", "Severe Wind", "Raining", "Unknown", "N/A")
Now that we have set up the tooltip, colors, and labels for the graph, let’s set up the graph itself.
First, we will call leaflet() for a GIS graph. Then we will use setView() to set the deafult view over Montgomery County. To make the color points really stand out, I used “Stadia.AlidadeSmooth” to create a monochrome, light map in the function addProviderTiles.
To add the points on the map, we will use addCircleMarkers(). The “data” argument is used to indicate which data frame for leaflet to use, in this case it is “crashes_df”. We set the point size using the argument “radius” and I made the radius size two. I used the “colors” argument to
leaflet() |> # calls for leaflet
setView(lng = -77.18, lat = 39.13, zoom =10.2) |> # sets default view for the map
addProviderTiles("Esri.NatGeoWorldMap") |> # sets the visual for the map
addCircleMarkers( # adds the points on the map
data = crashes_df209, # sets the data
radius = 2, # sets the size of the map points
color = "#ff006d",
popup = popupcrash # creates the tooltip
)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
The graph exhibits all of the car crashes during the month of December from the years 2015-2023. The color of each point represents what weather conditions were present during the crash. An interesting pattern I have observed from the visualization was that the majority of car crashes during the month of December was on Interstate I-270. I also observed that the majority of crashes on the map occur in cities with busier roads such as Gaithersburg or Rockville. If I had a bit more time on the project, I would like to fully incorporate point colors which would correspond with the weather conditions of the crash.