Data 110 Project II

Opening

The dataset used in this project is about car crashes across Montgomery County, Maryland. It also contains variable about the conditions of the crash, such as the weather or if the driver was under the influence of alcohol. I intend to investigate if there is a trend of where cars usually crash, what conditions make it more likely for cars to crash, and more. I have selected this dataset because I saw many posts of the article “The MoCo Show” where there are multiple crashes on Interstate 270, so I wanted to investigate where the most car crashes occur in Montgomery County.

The dataset was derived from the offical Montgomery County Database (Montgomery County Data | Open Data Portal (montgomerycountymd.gov))

Loading Libraries + Dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(highcharter)

## Warning: package 'highcharter' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(leaflet)

## Warning: package 'leaflet' was built under R version 4.3.3

library(ggplot2)

# loads the uncleaned dataset
pre_crashes <- read_csv("C:/Users/omyue/OneDrive/Desktop/Montgomery College/Spring 24/Data 101/datasets/Crash_Reporting_-_Incidents_Data_20240410.csv")

## Rows: 97458 Columns: 44
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (38): Report Number, Local Case Number, Agency Name, ACRS Report Type, C...
## dbl  (6): Mile Point, Lane Number, Number of Lanes, Distance, Latitude, Long...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the Dataset

names(pre_crashes) <- tolower(names(pre_crashes))
names(pre_crashes) <- gsub(" ", "", names(pre_crashes))
names(pre_crashes) <- gsub("/", "", names(pre_crashes))
head(pre_crashes)

## # A tibble: 6 × 44
##   reportnumber localcasenumber agencyname    acrsreporttype crashdatetime hitrun
##   <chr>        <chr>           <chr>         <chr>          <chr>         <chr> 
## 1 MCP2686006F  230031339       Montgomery C… Property Dama… 06/30/2023 1… No    
## 2 MCP30580053  230064814       Montgomery C… Property Dama… 11/06/2023 1… Yes   
## 3 MCP2760004K  230071388       Montgomery C… Property Dama… 12/12/2023 0… Yes   
## 4 MCP3230004G  230031335       Montgomery C… Property Dama… 06/30/2023 0… No    
## 5 MCP12600013  230031067       Montgomery C… Injury Crash   06/29/2023 1… No    
## 6 DD55750030   230031365       Rockville Po… Property Dama… 07/01/2023 0… Yes   
## # ℹ 38 more variables: routetype <chr>, milepoint <dbl>,
## #   milepointdirection <chr>, lanedirection <chr>, lanenumber <dbl>,
## #   lanetype <chr>, numberoflanes <dbl>, direction <chr>, distance <dbl>,
## #   distanceunit <chr>, roadgrade <chr>, nontraffic <chr>, roadname <chr>,
## #   `cross-streettype` <chr>, `cross-streetname` <chr>,
## #   `off-roaddescription` <chr>, municipality <chr>,
## #   `relatednon-motorist` <chr>, atfault <chr>, collisiontype <chr>, …

crash <- pre_crashes %>%
  mutate(date = as.Date(mdy_hms(crashdatetime)), 
         time = format(mdy_hms(crashdatetime), "%I:%M:%S %p"))
head(crash)

## # A tibble: 6 × 46
##   reportnumber localcasenumber agencyname    acrsreporttype crashdatetime hitrun
##   <chr>        <chr>           <chr>         <chr>          <chr>         <chr> 
## 1 MCP2686006F  230031339       Montgomery C… Property Dama… 06/30/2023 1… No    
## 2 MCP30580053  230064814       Montgomery C… Property Dama… 11/06/2023 1… Yes   
## 3 MCP2760004K  230071388       Montgomery C… Property Dama… 12/12/2023 0… Yes   
## 4 MCP3230004G  230031335       Montgomery C… Property Dama… 06/30/2023 0… No    
## 5 MCP12600013  230031067       Montgomery C… Injury Crash   06/29/2023 1… No    
## 6 DD55750030   230031365       Rockville Po… Property Dama… 07/01/2023 0… Yes   
## # ℹ 40 more variables: routetype <chr>, milepoint <dbl>,
## #   milepointdirection <chr>, lanedirection <chr>, lanenumber <dbl>,
## #   lanetype <chr>, numberoflanes <dbl>, direction <chr>, distance <dbl>,
## #   distanceunit <chr>, roadgrade <chr>, nontraffic <chr>, roadname <chr>,
## #   `cross-streettype` <chr>, `cross-streetname` <chr>,
## #   `off-roaddescription` <chr>, municipality <chr>,
## #   `relatednon-motorist` <chr>, atfault <chr>, collisiontype <chr>, …

Let’s start off by basic cleaning. We’re going to select the variables that we would want to use for the project, and remove any NA’s.

# selects the wanted variables for the data

crash <- crash |>
  select(light, weather, latitude, longitude, hitrun, lanenumber, date, time)

# arranges the dataset by the date

crash <- arrange(crash, date)

# lists the first 6 rows of the crashes data frame
head(crash)

## # A tibble: 6 × 8
##   light          weather latitude longitude hitrun lanenumber date       time   
##   <chr>          <chr>      <dbl>     <dbl> <chr>       <dbl> <date>     <chr>  
## 1 DARK NO LIGHTS CLEAR       39.1     -77.1 Yes             1 2015-01-01 07:52:…
## 2 DARK LIGHTS ON CLEAR       39.1     -77.2 No              0 2015-01-01 05:42:…
## 3 DARK LIGHTS ON CLEAR       39.1     -77.1 No              2 2015-01-01 08:54:…
## 4 DARK NO LIGHTS CLEAR       39.2     -77.2 No              1 2015-01-01 05:33:…
## 5 DARK LIGHTS ON N/A         39.2     -77.3 No              0 2015-01-01 01:10:…
## 6 DAYLIGHT       CLEAR       39.2     -77.2 No              0 2015-01-01 02:10:…

# Parse the 12-hour format time and convert it to a POSIXct object
crash <- crash |>
  mutate(time_24 = parse_date_time(time, orders = "I:%M:%S %p")) |>
  mutate(time24 = as.numeric(format(time_24, "%H"))) 
crash <- arrange(crash, date)
head(crash)

## # A tibble: 6 × 10
##   light          weather latitude longitude hitrun lanenumber date       time   
##   <chr>          <chr>      <dbl>     <dbl> <chr>       <dbl> <date>     <chr>  
## 1 DARK NO LIGHTS CLEAR       39.1     -77.1 Yes             1 2015-01-01 07:52:…
## 2 DARK LIGHTS ON CLEAR       39.1     -77.2 No              0 2015-01-01 05:42:…
## 3 DARK LIGHTS ON CLEAR       39.1     -77.1 No              2 2015-01-01 08:54:…
## 4 DARK NO LIGHTS CLEAR       39.2     -77.2 No              1 2015-01-01 05:33:…
## 5 DARK LIGHTS ON N/A         39.2     -77.3 No              0 2015-01-01 01:10:…
## 6 DAYLIGHT       CLEAR       39.2     -77.2 No              0 2015-01-01 02:10:…
## # ℹ 2 more variables: time_24 <dttm>, time24 <dbl>

After the basic cleaning, we’re going to further narrow down the data. For the GIS plot, I do not want too many points on the graph, so I want to select a singular date. I decided to go with New Year’s Eve because it is the last date on the dataset.

# selects the last 239 crashes in the dataset (all crashes in December)
crashes_df209 <- crash |>
  mutate(month = month(date))|>
  filter(month == 12)
crashes_df209

## # A tibble: 8,928 × 11
##    light          weather latitude longitude hitrun lanenumber date       time  
##    <chr>          <chr>      <dbl>     <dbl> <chr>       <dbl> <date>     <chr> 
##  1 DAYLIGHT       RAINING     39.3     -77.2 No              1 2015-12-01 12:33…
##  2 DAYLIGHT       RAINING     39.0     -77.1 No              3 2015-12-01 09:41…
##  3 DAYLIGHT       RAINING     39.1     -77.2 No              1 2015-12-01 01:28…
##  4 DAYLIGHT       RAINING     39.1     -77.2 No              0 2015-12-01 10:26…
##  5 DAYLIGHT       RAINING     39.2     -77.2 No              1 2015-12-01 01:18…
##  6 DARK LIGHTS ON RAINING     39.1     -77.2 No              0 2015-12-01 06:02…
##  7 DARK LIGHTS ON CLOUDY      39.1     -77.2 No              2 2015-12-01 06:22…
##  8 DAYLIGHT       CLEAR       39.0     -77.0 No              2 2015-12-01 03:36…
##  9 DAYLIGHT       UNKNOWN     39.2     -77.2 No              0 2015-12-01 11:00…
## 10 DARK LIGHTS ON RAINING     39.1     -77.1 No              3 2015-12-01 06:50…
## # ℹ 8,918 more rows
## # ℹ 3 more variables: time_24 <dttm>, time24 <dbl>, month <dbl>

Linear Regression Graphs

I will explore if the number of lanes has a correlation if the crash becomes a hit and run. First, I’m going to convert the “Yes” or “No” values in the Hit.Run column into numeric values.

 boo <- crashes_df209 |> 
   group_by(time24) |>
  reframe(count = n(), hitrun, light, time24 ,weather)
head(boo)

## # A tibble: 6 × 5
##   time24 count hitrun light          weather
##    <dbl> <int> <chr>  <chr>          <chr>  
## 1      0   169 Yes    DARK LIGHTS ON RAINING
## 2      0   169 No     DARK LIGHTS ON CLEAR  
## 3      0   169 Yes    DARK LIGHTS ON CLEAR  
## 4      0   169 No     DARK LIGHTS ON CLEAR  
## 5      0   169 No     DARK LIGHTS ON CLEAR  
## 6      0   169 No     DARK LIGHTS ON CLEAR

lm_model <- lm(count ~ time24 +hitrun+light+weather, data = boo)
summary(lm_model)

## 
## Call:
## lm(formula = count ~ time24 + hitrun + light + weather, data = boo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -487.76  -60.15  -15.42   79.26  350.87 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          479.2603   151.5832   3.162 0.001574 ** 
## time24                16.6653     0.3066  54.359  < 2e-16 ***
## hitrunYes            -23.6294     4.2218  -5.597 2.25e-08 ***
## lightDARK LIGHTS ON   26.0387    13.8004   1.887 0.059219 .  
## lightDARK NO LIGHTS   11.4289    15.1496   0.754 0.450626    
## lightDAWN             75.6210    16.9540   4.460 8.28e-06 ***
## lightDAYLIGHT        140.8222    13.8011  10.204  < 2e-16 ***
## lightDUSK            181.5106    16.3963  11.070  < 2e-16 ***
## lightN/A             102.7537    21.8851   4.695 2.70e-06 ***
## lightOTHER           129.9218    36.3717   3.572 0.000356 ***
## lightUNKNOWN          79.4703    24.8653   3.196 0.001398 ** 
## weatherBLOWING SNOW -336.9122   165.2813  -2.038 0.041538 *  
## weatherCLEAR        -310.6464   150.8980  -2.059 0.039557 *  
## weatherCLOUDY       -314.2530   150.9766  -2.081 0.037420 *  
## weatherFOGGY        -365.4775   151.5465  -2.412 0.015901 *  
## weatherN/A          -315.9554   150.9874  -2.093 0.036414 *  
## weatherOTHER        -312.9522   153.7615  -2.035 0.041849 *  
## weatherRAINING      -304.3264   150.9356  -2.016 0.043802 *  
## weatherSEVERE WINDS -397.5095   162.9666  -2.439 0.014739 *  
## weatherSLEET        -334.1376   156.5491  -2.134 0.032838 *  
## weatherSNOW         -323.9182   151.5121  -2.138 0.032552 *  
## weatherUNKNOWN      -297.1501   152.6612  -1.946 0.051630 .  
## weatherWINTRY MIX   -344.3174   152.7077  -2.255 0.024173 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 150.7 on 8905 degrees of freedom
## Multiple R-squared:  0.2984, Adjusted R-squared:  0.2967 
## F-statistic: 172.2 on 22 and 8905 DF,  p-value: < 2.2e-16

The model shows that the amount of crashes that occur at a specific hour can be explained by factors such as weather, hit & run, and the light conditions present during the crash. 30% percent of the crashes can be explained by these variables by looking at the adjusted r-squared, and the model is statistically significant because the p-value is 2.2e-16 (far below the level of significance of 0.05).

plot_linear <- boo|>
  group_by(time24)|>
  summarise(sum = sum(count))|>
  ggplot(aes(x = time24, y = sum)) +
  geom_point() +
  geom_smooth(color = "red")

plot_linear

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Looking at this plot, we can note that the majority of crashes occur later in the day, and the points peak at hour 17 and 18. Hour 17/18 is 5 to 6PM, which is rush hour. Rush hour is infamous for road rage and many car crashes, which is why the number of car crashes peak at the hours of 5 to 6PM.

The GIS plot

Creating the Tooltip

popupcrash <- paste0( # creates the tooltip for the map
      "<b>Time: </b>", crashes_df209$time, "<br>",
      "<b>Light Condition: </b>", crashes_df209$light, "<br>",
      "<b>Weather Condition: </b>", crashes_df209$weather, "<br>",
      "<b>Hit & Run: </b>", crashes_df209$hitrun, "<br>",
      "<b>Number of Lanes: </b>", crashes_df209$lanenumber, "<br>"
    )

Creating the Colors and Labels for the Graph (Future)

On the graph, I want the points to be different colors based on the weather during the scene of the crash. I also want to incorporate a legend on the map in order for the viewers to understand what the point colors mean on the graph.
First, I need to set up the colors for the graph. There are six types of weather in the dataset, so I have to create a vector that contains six different colors. After creating the color palette for the graph, I need to create labels that will be associated with each color for the legend.

# sets color palette for the points on the plot
colors <- c("#ff006d", "#ff7d00", "#ffdd00", "#adff02", "#01befe","#8f00ff", "white", "black", "red")
# create the labels for the legend
labels <- c("Clear", "Cloudy", "Foggy", "Severe Wind", "Raining", "Unknown", "N/A")

Plotting the Graph

Now that we have set up the tooltip, colors, and labels for the graph, let’s set up the graph itself.

First, we will call leaflet() for a GIS graph. Then we will use setView() to set the deafult view over Montgomery County. To make the color points really stand out, I used “Stadia.AlidadeSmooth” to create a monochrome, light map in the function addProviderTiles.

To add the points on the map, we will use addCircleMarkers(). The “data” argument is used to indicate which data frame for leaflet to use, in this case it is “crashes_df”. We set the point size using the argument “radius” and I made the radius size two. I used the “colors” argument to

leaflet() |> # calls for leaflet
  setView(lng = -77.18, lat =  39.13, zoom =10.2) |> # sets default view for the map
  addProviderTiles("Esri.NatGeoWorldMap") |> # sets the visual for the map
  addCircleMarkers( # adds the points on the map
    data = crashes_df209, # sets the data
    radius = 2, # sets the size of the map points
    color = "#ff006d",
    popup = popupcrash # creates the tooltip
    )

## Assuming "longitude" and "latitude" are longitude and latitude, respectively

The graph exhibits all of the car crashes during the month of December from the years 2015-2023. The color of each point represents what weather conditions were present during the crash. An interesting pattern I have observed from the visualization was that the majority of car crashes during the month of December was on Interstate I-270. I also observed that the majority of crashes on the map occur in cities with busier roads such as Gaithersburg or Rockville. If I had a bit more time on the project, I would like to fully incorporate point colors which would correspond with the weather conditions of the crash.