Project 3

Author

Maisha Ann Subin

Project 3

Data Set Variables
Weather
Weather Frequency by Damage
Drivers License plate
Speed Limit
Vehicle Make
Latitude
Longitude
Year (of crash)
Damage
Damage Frequency

(Total 39 variables, I’ve included the ones I use in the project)

Introductory Essay

For my Project 3 analysis, I selected a data set from Data.gov that reports crash data in Montgomery County, Maryland. The data set, collected through the Automated Crash Reporting System (ACRS) managed by the Maryland State Police, includes detailed records of vehicle crashes categorized into three types: fatal, property damage, and injury-related crashes.

The data is compiled from reports filled out by law enforcement officers at the scene of a crash. These reports include a variety of factors such as driver and vehicle information, weather and road conditions, and contributing causes.

I was drawn to this data set while browsing online and found the topic of crashes and their associated variables both relevant and compelling. Initially, I intended to use a larger data set from ‘Kaggle’, but due to size limitations, I looked for a similar alternative and found this one on Data.gov.

Through this project, I aim to explore questions such as: “Does weather play a role in the frequency or type of crashes?” and “Is there a relationship between the make of a vehicle—like Toyota, Mercedes, or Tesla—and how often it’s involved in a crash?” These questions are not only data-driven but also reflect real-world concerns about safety and accountability on the road.

Load necessary libraries

library(tidyverse)
library(viridis)
library(ggplot2)
library(DataExplorer)
library(leaflet)
library(sf)
library(plotly)

Read in crash data set

crash <- read_csv("/Users/maishasubin/Desktop/DATA110/Crash_Reporting_Data.csv")

Necessary cleaning:

Removing unnecessary columns
Separating ‘date’ and ‘time’ columns; extracting ‘year’ from ‘date’ column
Renaming variables without space & removing NA’s from all columns individually
Creating 3 additional numeric columns according to project 3’s requirements

# Removing columns I wont be using in my project analysis
crash_col <- crash[, -c(1, 2, 3, 7, 8, 9, 10, 11, 18, 19, 25, 26, 27, 28, 30, 32, 33, 34, 36 )]

# Create separate columns for date and time
crash_time <- crash_col %>%
  separate('Crash Date/Time', into = c("year", "time"), sep = " ")

# Extract year as numeric from date column
crash_time$year <- as.numeric(str_sub(crash_time$year, start = -4L))

# Sort the years in ascending order
sorted_years <- sort(crash_time$year)

unique(sorted_years)

 [1] 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

# Renaming variables 
crash_rename <- crash_time |> 
  rename(damage = `ACRS Report Type`
         , route_type = `Route Type`
         , collision_type = `Collision Type`
         , surface_condition = `Surface Condition`
         , weather = `Weather`
         , light = `Light`
         , traffic_control = `Traffic Control`
         , substance_abuse = `Driver Substance Abuse`
         , driver_fault = `Driver At Fault`
         , injury_severity = `Injury Severity`
         , circumstance = `Circumstance`, distraction = `Driver Distracted By`
         , license_state =`Drivers License State`
         , vehicle_movement = `Vehicle Movement`
         , speed_limit = `Speed Limit`
         , vehicle_make = `Vehicle Make`
         , latitude = `Latitude`
         , longitude = `Longitude`
         , location = `Location`)

# Removing missing values: NA's and manually string "N/A"
crash_nona <- crash_rename %>%
filter(!is.na(damage)  & damage != "N/A", !is.na(year) & year != "N/A"
         , !is.na(time)  & time != "N/A"
         , !is.na(route_type)  & route_type != "N/A"
         , !is.na(collision_type)  & collision_type != "N/A"
         , !is.na(surface_condition)  & surface_condition != "N/A"
         , !is.na(weather)  & weather != "N/A"
         , !is.na(light)  & light!= "N/A"
         , !is.na(traffic_control)  & traffic_control != "N/A"
         , !is.na(substance_abuse)  & substance_abuse != "N/A"
         , !is.na(driver_fault)  & driver_fault != "N/A"
         , !is.na(injury_severity)  & injury_severity != "N/A"
         , !is.na(circumstance) & circumstance != "N/A"
         , !is.na(license_state)  & license_state != "N/A"
         , !is.na(vehicle_movement)  & vehicle_movement != "N/A"
         , !is.na(speed_limit)  & speed_limit != "N/A"
         , !is.na(distraction)  & distraction != "N/A"
         , !is.na(vehicle_make)  & vehicle_make != "N/A"
         , !is.na(latitude)  & latitude != "N/A"
         , !is.na(longitude)  & longitude != "N/A"
         , !is.na(location)  & location != "N/A")

Additional numeric variable: weather frequency by damage

# Removing additional columns
crash_nona <- crash_nona %>%
  select(-starts_with("weather_freq_by_damage")) # ChatGPT helped with not creating additional columns every time this code runs

# Counting how often each weather type appears within each damage type
weather_freq_by_damage <- crash_nona %>%
  count(damage, weather, name = "weather_freq_by_damage") # Combining 2 variable's

# Joining those frequencies back into my original data
crash_nona <- crash_nona %>%
  left_join(weather_freq_by_damage, by = c("damage", "weather"))

Additional numeric variable: damage frequency

# Creating a frequency table for the 'damage' column
damage_freq <- table(crash_nona$damage)

# Mapping frequencies back to the original data frame
crash_nona$damage_freq <- damage_freq[crash_nona$damage]

Additional numeric variable: traffic control frequency

# Creating frequency table for 'traffic_control'
traffic_control_freq <- table(crash_nona$traffic_control)

# Mapping frequency values back to the data frame
crash_nona$traffic_control_freq <- traffic_control_freq[crash_nona$traffic_control]

Analyzing data through correlation plot

# To analyze correlation of selected variables
crash_corr <- crash_nona [, c("speed_limit", "year", "latitude", "longitude", "weather_freq_by_damage", "damage_freq", "traffic_control_freq")]

plot_correlation(crash_corr)

1 features with more than 20 categories ignored!
traffic_control_freq.V1: 29 categories

There seem to be relatively significant negative correlation between year and traffic control frequency of -0.48.
There also seem to strong correlation between damage frequency and damage freq property and injury. However that’s possible because of the common points in the variables.

Testing multiple linear regression model in relation to damage frequency

# Running linear regression models
lm_model <- lm(damage_freq ~ speed_limit + year 
               + latitude + longitude + weather_freq_by_damage
               + traffic_control_freq, data = crash_nona)
# Summary of model
summary(lm_model)


Call:
lm(formula = damage_freq ~ speed_limit + year + latitude + longitude + 
    weather_freq_by_damage + traffic_control_freq, data = crash_nona)

Residuals:
   Min     1Q Median     3Q    Max 
-12412  -2593   1080   2029   3784 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -1.327e+05  1.640e+04  -8.091 6.18e-16 ***
speed_limit            -1.631e+01  1.943e+00  -8.395  < 2e-16 ***
year                    8.295e+01  5.400e+00  15.360  < 2e-16 ***
latitude               -4.620e+02  2.565e+02  -1.801   0.0717 .  
longitude               5.605e+01  1.941e+02   0.289   0.7728    
weather_freq_by_damage  2.770e-01  6.524e-03  42.460  < 2e-16 ***
traffic_control_freq   -3.702e-03  5.221e-03  -0.709   0.4783    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2381 on 24855 degrees of freedom
Multiple R-squared:  0.07452,   Adjusted R-squared:  0.0743 
F-statistic: 333.6 on 6 and 24855 DF,  p-value: < 2.2e-16

Equation for my linear regression model:

damage_freq = − 132700 − (16.31 * speed_limit) + (82.95 * year) − (462 * latitude) + (56.05 * longitude)− 8.550 + (0.277 * weather_freq_by_damage) − (0.0037 * traffic_control_freq)
Analyzing model based on p-values:

speed_limit, year and weather_freq_by_damage type are statistically significant predictors of crash damage frequency with p values of < 2e-16, < 2e-16 and < 2e-16 respectively.

Variables like latitude, longitude, weather_freq_by_damage, and traffic_control_freq are not statistically significant based on their high p-values (around or greater than 0.05).
Adjusted R^2 values:

Adjusted R^2 = 0.0743 suggests this model explains about 7.4% of the variability in the response variable damage_freq, even after adjusting for the number of predictors. While the model is statistically significant (F-statistic p < 2.2e-16), it’s not strong, the predictive power is low.

Facet wrap for better understanding yearly damage count based on type:

# Grouping data by year and damage
crash_yearly <- crash_nona %>%
  group_by(year, damage) %>%
  summarise(damage_count = n(), .groups = "drop")

# Converting year to factor for discrete x-axis
crash_yearly$year <- as.factor(crash_yearly$year)

# Facet plot
ggplot(crash_yearly, aes(x = year, y = damage_count, color = damage, group = damage)) +
  geom_point() +
  geom_line() +
  facet_wrap(~damage) +
  labs(title = "Crash Count by Year and Damage Type",
       x = "Year", y = "Number of Crashes",
       caption = "Source: Data.gov") +
  scale_color_viridis_d("Damage Type") +
  theme_bw()

2024 seems to be an interesting year based on this data, with a sudden upward trend.
Fatalities in general are low and property damages are more.

Final Visualizations…

Visualization 1:

library(leaflet.extras) # Using this package for extra features such as heatmap over my GIS map. Got help from Nomar Picasso's RPubs and ChatGPT, link mentioned in the end.

# Filtering for 2024 and scale severity as 2024 showed a unusal upward trend in damage
 crash_heat_2024 <- crash_nona %>%
  filter(year == 2024) %>%
  mutate(intensity = case_when(
    damage == "Fatal Crash" ~ 15,
    damage == "Injury Crash" ~ 3,
    damage == "Property Damage Crash" ~ 2, # For intensity of colors on heatmap
    TRUE ~ 0
  )) %>%
  mutate(intensity = intensity * 20)  # Increasing intensity scale

# GIS plot
leaflet(crash_heat_2024) %>%
  addProviderTiles("CartoDB.DarkMatter") %>% # Selecting theme for map
  # Heatmap layer
  addHeatmap(
    lng = ~longitude, lat = ~latitude,
    intensity = ~intensity, # Controls how strong each point affects the heatmap
    blur = 15,              # Blur effect of heatmap
    max = 10,               # Adjust max intensity
    radius = 20,            # Radius of heatmap
    gradient = c( "0" = "#440154", "0.2" = "#3b528b",
    "0.6" = "#5dc962", "0.8" = "#fde725", "1" = "#ff5a00")
  ) %>%
  # Adding interactive markers with popups
  addCircleMarkers(
    lng = ~longitude, lat = ~latitude,
    radius = 5,  # Smaller radius to avoid overcrowding
    color = ~ifelse(damage == "Fatal Crash", "purple", 
                    ifelse(damage == "Injury Crash", "orangered", "yellow")),
    fillOpacity = 0.5,
    stroke = FALSE,
    popup = ~paste(
      "Year: ", year, "<br>",
      "Damage Type: ", damage, "<br>",
      "Vehicle Make: ", vehicle_make, "<br>",
      "Weather: ", weather
    ),
    group = "markers"  # Creating a separate group for the markers so heatmap and points can be viewed separately 
  ) %>%
  # Adding layer control to toggle between heatmap and markers
  addLayersControl(        # Adds a checkbox menu to turn the circle marker layer on/off.
    overlayGroups = c("markers"),
    options = layersControlOptions(collapsed = FALSE)
  )%>%
  # For legend in the bottom right
addLegend(
  position = "bottomright",
  colors = c("purple", "orangered", "yellow"),
  labels = c("Fatal Crash", "Injury Crash", "Property Damage Crash"),
  title = "Damage Type",
  opacity = 1
)

The above is a GIS visualization to show damage type such as property crash or fatality in 2024 in Montgomery County, Maryland. A heatmap can also be seen if the marker is unchecked which shows the density of points i.e. crashes.

Understanding data for 2nd visualization:

# Group and count crashes per speed limit
speed_summary <- crash_nona %>%
  group_by(speed_limit, damage) %>%
  summarise(crash_count = n()) %>%
  arrange(speed_limit, damage)

`summarise()` has grouped output by 'speed_limit'. You can override using the
`.groups` argument.

ggplot(speed_summary, aes(x = factor(speed_limit), y = crash_count, fill = damage)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Crash Frequency by Speed Limit and Damage Type",
    x = "Speed Limit (mph)",
    y = "Number of Crashes",
    caption = "Source: Data.gov"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("red", "maroon", "pink")) + # Adjust colors as needed
  theme(legend.title = element_blank())

Visualization 2:

# Another GIS plot for speed limit
# Using color palettes to represent speed
pal_speed <- colorNumeric(palette = "viridis", domain = crash_nona$speed_limit) # Defining a numeric color scale (pal_speed) using the "viridis" palette.

leaflet(crash_nona) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(
    lng = ~longitude, lat = ~latitude,
    color = ~pal_speed(speed_limit),
    fillOpacity = 0.6,
    radius = 3,
    popup = ~paste("Speed Limit: ", speed_limit, "<br>Damage: ", damage)
  ) %>%
  addLegend(
    "bottomright", pal = pal_speed, values = ~speed_limit,
    title = "Speed Limit (mph)"
  ) # Adds a legend in the bottom right showing the color gradient for speed limits.

As seen through the analysis bar chart before the GIS, there’s a trend in crashed between speed limit of 25 - 45, with highest being property damage followed by injury and rarely any fatalities. This GIS chart shows each locations in Montgomery County, where the crash took place with respect to speed limit. It helps us know if there are any patterns.Includes popup as well for interactivity.

Visualization 3:

# Time series line graph

# Defining selected vehicle makes
selected_makes <- c("TOYOTA", "FORD", "HYUNDAI", "NISSAN", "HONDA", 
                    "MERCEDES", "PORSCHE", "VOLKSWAGEN", "TESLA") # Selected based on personal curiosity

# Filtering and summarizing by year and make
summary_data <- crash_nona %>%
  filter(vehicle_make %in% selected_makes) %>%
  group_by(year, vehicle_make) %>%
  summarise(avg_damage = mean(damage_freq, na.rm = TRUE), .groups = "drop") # Averaging

# Final line plot
p <- ggplot(summary_data, aes(x = factor(year), y = avg_damage, color = vehicle_make, group = vehicle_make)) +
  geom_line(size = 1) +
 scale_color_viridis_d(option = "magma") +  
  labs(
    title = "Average Damage per Crash by Vehicle Make (2015–2025)",
    x = "Year",
    y = "Average Damage Frequency",
    caption = "Data.gov" # Caption not showing up because of interactivity 
  ) +
  theme(panel.background = element_blank(),  # Removes background color
    plot.background = element_blank(),   # Removes plot area background
    panel.grid = element_blank()        # Removes gridlines
    )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Converts ggplot to interactive plotly plot
ggplotly(p)

The above time series line graph shows the average damage_freq based on vehicle_make between 2015 and 2025. Plotly adds interactivity to it so specific data can be obtained.

Final Thoughts:

Understanding the spatial distribution of motor vehicle crashes and the factors associated with them is essential for improving road safety and guiding traffic policies. The data set I chose in this project contains crash records from Montgomery County, Maryland, with variables such as speed limit, crash year, weather condition, damage type, and geographic coordinates. Using R and the leaflet package, I created interactive GIS visualizations to explore how speed and damage type are spatially distributed across the county.

One visualization used a heatmap layered over a base map to show crash concentrations. Locations with a high density of crashes appear as intense areas, mostly clustered around major intersections and highways. Another visualization used circle markers colored by speed limit, allowing for quick assessment of where high-speed zones coincide with crashes. Interactive popups on these markers provided crash-specific information like year, weather, and damage severity.

According to the National Highway Traffic Safety Administration (NHTSA), speeding contributed to 29% of all traffic fatalities in 2021, and crash likelihood increases significantly with speed, especially in adverse weather or uncontrolled intersections (NHTSA, 2022). This supports the patterns seen in the GIS plot: higher-speed zones not only had more crashes but also a higher share of injury-related incidents.

While the map offered significant insights, a few limitations remained. I could not implement a filter tool within the leaflet map to toggle different years or damage types dynamically. Including such a feature would enhance interactivity and let users explore temporal trends. Additionally, the traffic control data had too many unique categories, limiting its usefulness without further grouping or simplification. Hence I could not use it appropriately.

Sources:

For image: https://www.istockphoto.com/photo/car-crash-with-police-gm660523594-120581843
Leaflt help: RPubs - Build a HeatMap in R using the {leaflet} and {leaflet.extras} packages. (n.d.). https://rpubs.com/nomarpicasso/1158007
Additional background research: https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813582.pdf
Data set used for this project: Data.gov. (2025, May 10). Montgomery County of Maryland - Crash reporting - Drivers data. https://catalog.data.gov/dataset/crash-reporting-drivers-data