Crash Reporting - Incidents Data is a dataset that consists of 117k rows and 37 columns, where each row represents a collision. This dataset provides information about each collision and details of all traffic collisions occurring on county and local roadways within Montgomery County. This information is collected through the Automated Crash Reporting System (ACRS) of the Maryland State Police, and reported by the Montgomery County Police, Gaithersburg Police, Rockville Police, or the Maryland-National Capital Park Police. The variables I will be using from this dataset are acrs_report_type and weather.
Variables (Categorical):
acrs_report_type: Identifies crash as property, injury, or fatal.
weather: Weather condition at collision location
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(dplyr)
library(lubridate)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
setwd("~/Desktop/Project 2")
df <- read_csv("Crash_Reporting_-_Incidents_Data_20251114.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 116795 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (33): Report Number, Agency Name, ACRS Report Type, Crash Date/Time, Hit...
## dbl (4): Local Case Number, Distance, Latitude, Longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This data analysis will focus on the types of crashes that were reported (acrs_report_type) and what the weather condition was at the time of the collision. I will first check if there are any missing values present in the columns I will be using. Then, I will rename the column titles to ensure uniformity and clarity. To inspect the columns, I used functions like ‘unique’ to extract the unique categories under ‘weather’ and ‘acrs_report_type’. This will show me what different types of weather conditions were reported and what different types of crashes were reported in the dataset. After extracting the unique elements, I noticed there were duplicates of the same weather condition but formatted differently. For example, “Cloudy” and “CLOUDY”. In order to clean the dataset, I created a new column called ‘weather_conditions’ and applied the ‘case_when’ function to organize the weather elements by assigning them the correct labels to avoid repetition and ensure consistency. Once I have renamed the unique elements of weather, I will then change its class type–character– to factor. This conversion is important to efficiently calculate the summary statistics, such as the count, and is necessary for graphing. For graphing and plotting, I will use bar plots to visualize the relationship between the two categorical variables.
head(df)
## # A tibble: 6 × 37
## `Report Number` `Local Case Number` `Agency Name` `ACRS Report Type`
## <chr> <dbl> <chr> <chr>
## 1 MCP157500DR 250050714 MONTGOMERY Property Damage Crash
## 2 EJ78980079 250050708 GAITHERSBURG Property Damage Crash
## 3 MCP3208007G 250050696 MONTGOMERY Property Damage Crash
## 4 MCP3084007J 250050635 MONTGOMERY Property Damage Crash
## 5 MCP27640040 250050632 MONTGOMERY Property Damage Crash
## 6 MCP2987009Z 250050636 MONTGOMERY Property Damage Crash
## # ℹ 33 more variables: `Crash Date/Time` <chr>, `Hit/Run` <chr>,
## # `Route Type` <chr>, `Lane Direction` <chr>, `Lane Type` <chr>,
## # `Number of Lanes` <chr>, Direction <chr>, Distance <dbl>,
## # `Distance Unit` <chr>, `Road Grade` <chr>, `Road Name` <chr>,
## # `Cross-Street Name` <chr>, `Off-Road Description` <chr>,
## # Municipality <chr>, `Related Non-Motorist` <chr>, `At Fault` <chr>,
## # `Collision Type` <chr>, Weather <chr>, `Surface Condition` <chr>, …
colSums(is.na(df))
## Report Number Local Case Number
## 0 14
## Agency Name ACRS Report Type
## 0 0
## Crash Date/Time Hit/Run
## 0 3629
## Route Type Lane Direction
## 15441 14826
## Lane Type Number of Lanes
## 89288 12339
## Direction Distance
## 14783 12942
## Distance Unit Road Grade
## 12327 14770
## Road Name Cross-Street Name
## 17102 25218
## Off-Road Description Municipality
## 102023 31656
## Related Non-Motorist At Fault
## 110187 0
## Collision Type Weather
## 0 0
## Surface Condition Light
## 14881 0
## Traffic Control Driver Substance Abuse
## 2408 735
## Non-Motorist Substance Abuse First Harmful Event
## 110187 0
## Second Harmful Event Junction
## 14062 15251
## Intersection Type Road Alignment
## 26375 14769
## Road Condition Road Division
## 16714 14768
## Latitude Longitude
## 0 0
## Location
## 0
# Clean variable names
names(df) <- gsub("[(). \\-]", "_", names(df)) #Replace ., (), space, with dash
names(df) <- gsub("_$", "", names(df)) #Remove trailing underscore
names(df) <- tolower(names(df)) #Lowercase
head(df)
## # A tibble: 6 × 37
## report_number local_case_number agency_name acrs_report_type `crash_date/time`
## <chr> <dbl> <chr> <chr> <chr>
## 1 MCP157500DR 250050714 MONTGOMERY Property Damage… 11/12/2025 08:00…
## 2 EJ78980079 250050708 GAITHERSBU… Property Damage… 11/12/2025 05:50…
## 3 MCP3208007G 250050696 MONTGOMERY Property Damage… 11/12/2025 01:20…
## 4 MCP3084007J 250050635 MONTGOMERY Property Damage… 11/11/2025 02:36…
## 5 MCP27640040 250050632 MONTGOMERY Property Damage… 11/11/2025 02:30…
## 6 MCP2987009Z 250050636 MONTGOMERY Property Damage… 11/11/2025 01:40…
## # ℹ 32 more variables: `hit/run` <chr>, route_type <chr>, lane_direction <chr>,
## # lane_type <chr>, number_of_lanes <chr>, direction <chr>, distance <dbl>,
## # distance_unit <chr>, road_grade <chr>, road_name <chr>,
## # cross_street_name <chr>, off_road_description <chr>, municipality <chr>,
## # related_non_motorist <chr>, at_fault <chr>, collision_type <chr>,
## # weather <chr>, surface_condition <chr>, light <chr>, traffic_control <chr>,
## # driver_substance_abuse <chr>, non_motorist_substance_abuse <chr>, …
unique(df$acrs_report_type)
## [1] "Property Damage Crash" "Injury Crash" "Fatal Crash"
table(df$acrs_report_type)
##
## Fatal Crash Injury Crash Property Damage Crash
## 365 39484 76946
table(df$weather)
##
## BLOWING SAND, SOIL, DIRT Blowing Snow
## 7 40
## BLOWING SNOW Clear
## 68 15141
## CLEAR Cloudy
## 65516 1673
## CLOUDY Fog, Smog, Smoke
## 9609 38
## FOGGY Freezing Rain Or Freezing Drizzle
## 429 33
## N/A OTHER
## 7948 214
## Rain RAINING
## 2015 11674
## Severe Crosswinds SEVERE WINDS
## 17 93
## SLEET Sleet Or Hail
## 130 7
## Snow SNOW
## 175 877
## Unknown UNKNOWN
## 190 649
## WINTRY MIX
## 252
#This column has multiple labels of the same name/category, but in different forms. Ex: "Clear" and "CLEAR
unique(df$weather)
## [1] "Clear" "Cloudy"
## [3] "Rain" "Unknown"
## [5] "Severe Crosswinds" "Fog, Smog, Smoke"
## [7] "Freezing Rain Or Freezing Drizzle" "Snow"
## [9] "Blowing Snow" "Sleet Or Hail"
## [11] "CLOUDY" "CLEAR"
## [13] "N/A" "UNKNOWN"
## [15] "RAINING" "FOGGY"
## [17] "OTHER" "SLEET"
## [19] "WINTRY MIX" "SNOW"
## [21] "SEVERE WINDS" "BLOWING SNOW"
## [23] "BLOWING SAND, SOIL, DIRT"
#Organize labels to avoid repetition
df <- df |>
mutate(
weather_conditions = case_when(
weather == "BLOWING SAND, SOIL, DIRT" ~ "Blowing sand, soil, dirt",
weather == "OTHER" ~ "Other",
weather == "WINTRY MIX" ~ "Wintry mix",
weather %in% c("Unknown", "UNKNOWN", "N/A") ~ "Unknown",
weather %in% c("Blowing Snow", "BLOWING SNOW") ~ "Blowing Snow",
weather %in% c("Clear", "CLEAR") ~ "Clear",
weather %in% c("Cloudy", "CLOUDY") ~ "Cloudy",
weather %in% c("Fog, Smog, Smoke", "FOGGY") ~ "Fog, Smog, Smoke",
weather %in% c("Rain", "RAINING") ~ "Rain",
weather %in% c("Severe Crosswinds", "SEVERE WINDS") ~ "Severe winds",
weather %in% c("SLEET", "Sleet Or Hail") ~ "Sleet or Hail",
weather %in% c("Snow", "SNOW") ~ "Snow",
TRUE ~ as.character(weather) #keep as-is if not matched
),
#Convert to factor instead of a character
weather_conditions = factor(weather_conditions,
levels = c("Blowing sand, soil, dirt", "Unknown", "Blowing Snow",
"Clear", "Cloudy", "Fog, Smog, Smoke", "Rain",
"Severe winds", "Sleet or Hail", "Snow",
"Freezing Rain Or Freezing Drizzle", "Wintry mix", "Other"))
)
table(df$weather_conditions)
##
## Blowing sand, soil, dirt Unknown
## 7 8787
## Blowing Snow Clear
## 108 80657
## Cloudy Fog, Smog, Smoke
## 11282 467
## Rain Severe winds
## 13689 110
## Sleet or Hail Snow
## 137 1052
## Freezing Rain Or Freezing Drizzle Wintry mix
## 33 252
## Other
## 214
head(df)
## # A tibble: 6 × 38
## report_number local_case_number agency_name acrs_report_type `crash_date/time`
## <chr> <dbl> <chr> <chr> <chr>
## 1 MCP157500DR 250050714 MONTGOMERY Property Damage… 11/12/2025 08:00…
## 2 EJ78980079 250050708 GAITHERSBU… Property Damage… 11/12/2025 05:50…
## 3 MCP3208007G 250050696 MONTGOMERY Property Damage… 11/12/2025 01:20…
## 4 MCP3084007J 250050635 MONTGOMERY Property Damage… 11/11/2025 02:36…
## 5 MCP27640040 250050632 MONTGOMERY Property Damage… 11/11/2025 02:30…
## 6 MCP2987009Z 250050636 MONTGOMERY Property Damage… 11/11/2025 01:40…
## # ℹ 33 more variables: `hit/run` <chr>, route_type <chr>, lane_direction <chr>,
## # lane_type <chr>, number_of_lanes <chr>, direction <chr>, distance <dbl>,
## # distance_unit <chr>, road_grade <chr>, road_name <chr>,
## # cross_street_name <chr>, off_road_description <chr>, municipality <chr>,
## # related_non_motorist <chr>, at_fault <chr>, collision_type <chr>,
## # weather <chr>, surface_condition <chr>, light <chr>, traffic_control <chr>,
## # driver_substance_abuse <chr>, non_motorist_substance_abuse <chr>, …
weather_values <- df |>
group_by(weather_conditions) |>
summarize(Count = n())
print(weather_values)
## # A tibble: 13 × 2
## weather_conditions Count
## <fct> <int>
## 1 Blowing sand, soil, dirt 7
## 2 Unknown 8787
## 3 Blowing Snow 108
## 4 Clear 80657
## 5 Cloudy 11282
## 6 Fog, Smog, Smoke 467
## 7 Rain 13689
## 8 Severe winds 110
## 9 Sleet or Hail 137
## 10 Snow 1052
## 11 Freezing Rain Or Freezing Drizzle 33
## 12 Wintry mix 252
## 13 Other 214
df$acrs_report_type = factor(df$acrs_report_type,
levels = c("Property Damage Crash", "Injury Crash", "Fatal Crash"))
Side by side bar plot of weather conditions for different ACRS report types.
library(ggplot2)
ggplot(df, aes(x =acrs_report_type, fill = weather_conditions)) +
geom_bar(position = "dodge") + #side by side bar plot
labs(title = "Weather conditions of different reports",
x = "ACRS Report Type",
y = "Count") +
theme_minimal()
Bar plot of ACRS Report Type
barplot(table(df$acrs_report_type),
main = "acrs_report_type Count",
xlab = "acrs_report_type",
ylab = "Count",
col = "SkyBlue")
Bar plot of the count of types of weather conditions reported.
#Change margin size to fit weather condition labels
par(mar = c(10, 5, 5, 4) + 0.1)
barplot(table(df$weather_conditions),
main = "Weather Condition Count",
xlab = "Weather Conditions",
ylab = "Count",
col = "SkyBlue",
cex.names = .7, #change size of names
las = 2 #rotate the text to fit the axis
)
In this statistical analysis, I will be applying the Chi-Square test for independence. This type of hypothesis testing is used to investigate the potential relationship between two categorical variables–weather and acrs_report_type. I want to explore if there is an association between the various weather conditions and the different types of crashes. The null hypothesis states that there is no association between weather conditions and ACRS. The alternative hypothesis states that there is an association between the variables. To execute the test, I created a contingency table called ‘observed_dataset’. I then performed a Chi-Square test using the observed dataset.
observed_dataset<- table(df$weather_conditions, df$acrs_report_type)
observed_dataset
##
## Property Damage Crash Injury Crash
## Blowing sand, soil, dirt 3 4
## Unknown 6198 2572
## Blowing Snow 72 36
## Clear 53149 27226
## Cloudy 7153 4098
## Fog, Smog, Smoke 316 149
## Rain 8803 4856
## Severe winds 72 36
## Sleet or Hail 95 42
## Snow 747 305
## Freezing Rain Or Freezing Drizzle 23 10
## Wintry mix 171 80
## Other 144 70
##
## Fatal Crash
## Blowing sand, soil, dirt 0
## Unknown 17
## Blowing Snow 0
## Clear 282
## Cloudy 31
## Fog, Smog, Smoke 2
## Rain 30
## Severe winds 2
## Sleet or Hail 0
## Snow 0
## Freezing Rain Or Freezing Drizzle 0
## Wintry mix 1
## Other 0
Hypothesis
\(H_0\) : Weather is not associated with ACRS (Automated Crash Reporting System) report type.
\(H_a\) : Weather is associated with ACRS (Automated Crash Reporting System) report type.
chi<- chisq.test(observed_dataset)
## Warning in chisq.test(observed_dataset): Chi-squared approximation may be
## incorrect
chi
##
## Pearson's Chi-squared test
##
## data: observed_dataset
## X-squared = 170.8, df = 24, p-value < 2.2e-16
#check expected counts
chi$expected
##
## Property Damage Crash Injury Crash
## Blowing sand, soil, dirt 4.611687 2.366437
## Unknown 5788.984991 2970.554459
## Blowing Snow 71.151745 36.510741
## Clear 53137.835712 27267.100372
## Cloudy 7432.722051 3814.020189
## Fog, Smog, Smoke 307.665414 157.875149
## Rain 9018.483617 4627.736427
## Severe winds 72.469369 37.186866
## Sleet or Hail 90.257306 46.314551
## Snow 693.070697 355.641663
## Freezing Rain Or Freezing Drizzle 21.740811 11.156060
## Wintry mix 166.020737 85.191729
## Other 140.985864 72.345357
##
## Fatal Crash
## Blowing sand, soil, dirt 0.02187594
## Unknown 27.46055054
## Blowing Snow 0.33751445
## Clear 252.06391541
## Cloudy 35.25775932
## Fog, Smog, Smoke 1.45943748
## Rain 42.77995633
## Severe winds 0.34376472
## Sleet or Hail 0.42814333
## Snow 3.28764074
## Freezing Rain Or Freezing Drizzle 0.10312941
## Wintry mix 0.78753371
## Other 0.66877863
#Chi-squared value
chi$statistic
## X-squared
## 170.803
X-squared has a high value of 170.46. This means there is a 170.46 difference between the observed data and the expected value.
Expected value = (row total * column total)/Sample size
Degrees of Freedom: df = (13-1)*(3-1) = 24
The p-value is less than the significance level of 0.05. There is enough evidence to reject the null hypothesis. Therefore, we conclude that there is a significant association between weather and ACRS report type.
In conclusion, the application of the Chi-Square test for independence between weather and ACRS report type showed a significant result. The x-squared and p-value illustrate that there is a statistically significant association between weather conditions and the type of crash reported in the dataset. To further explore potential avenues for future research and analysis, including other variables, such as the time of day and the lighting conditions at the time of the collision, can improve the model. This analysis can also contribute to improving traffic management, infrastructure planning, and generating safety interventions. For instance, analyzing this type of data can allow us to develop better road surface conditions, implement more traffic lights in certain locations (where accidents are prone to occur), and promote more safety measures when driving under certain weather conditions. Overall, this hypothesis test reveals that there is sufficient evidence to reject the null. Thus, the weather conditions are associated with the type of crash.