Project 2

Introduction

Is there a relationship between weather conditions and ACRS Report Type?

Crash Reporting - Incidents Data is a dataset that consists of 117k rows and 37 columns, where each row represents a collision. This dataset provides information about each collision and details of all traffic collisions occurring on county and local roadways within Montgomery County. This information is collected through the Automated Crash Reporting System (ACRS) of the Maryland State Police, and reported by the Montgomery County Police, Gaithersburg Police, Rockville Police, or the Maryland-National Capital Park Police. The variables I will be using from this dataset are acrs_report_type and weather.

https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Incidents-Data/bhju-22kf/about_data

Variables (Categorical):

acrs_report_type: Identifies crash as property, injury, or fatal.
weather: Weather condition at collision location

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.95 loaded

library(dplyr)
library(lubridate)
library(zoo)

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

setwd("~/Desktop/Project 2")
df <- read_csv("Crash_Reporting_-_Incidents_Data_20251114.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 116795 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (33): Report Number, Agency Name, ACRS Report Type, Crash Date/Time, Hit...
## dbl  (4): Local Case Number, Distance, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Analysis

This data analysis will focus on the types of crashes that were reported (acrs_report_type) and what the weather condition was at the time of the collision. I will first check if there are any missing values present in the columns I will be using. Then, I will rename the column titles to ensure uniformity and clarity. To inspect the columns, I used functions like ‘unique’ to extract the unique categories under ‘weather’ and ‘acrs_report_type’. This will show me what different types of weather conditions were reported and what different types of crashes were reported in the dataset. After extracting the unique elements, I noticed there were duplicates of the same weather condition but formatted differently. For example, “Cloudy” and “CLOUDY”. In order to clean the dataset, I created a new column called ‘weather_conditions’ and applied the ‘case_when’ function to organize the weather elements by assigning them the correct labels to avoid repetition and ensure consistency. Once I have renamed the unique elements of weather, I will then change its class type–character– to factor. This conversion is important to efficiently calculate the summary statistics, such as the count, and is necessary for graphing. For graphing and plotting, I will use bar plots to visualize the relationship between the two categorical variables.

head(df)

## # A tibble: 6 × 37
##   `Report Number` `Local Case Number` `Agency Name` `ACRS Report Type`   
##   <chr>                         <dbl> <chr>         <chr>                
## 1 MCP157500DR               250050714 MONTGOMERY    Property Damage Crash
## 2 EJ78980079                250050708 GAITHERSBURG  Property Damage Crash
## 3 MCP3208007G               250050696 MONTGOMERY    Property Damage Crash
## 4 MCP3084007J               250050635 MONTGOMERY    Property Damage Crash
## 5 MCP27640040               250050632 MONTGOMERY    Property Damage Crash
## 6 MCP2987009Z               250050636 MONTGOMERY    Property Damage Crash
## # ℹ 33 more variables: `Crash Date/Time` <chr>, `Hit/Run` <chr>,
## #   `Route Type` <chr>, `Lane Direction` <chr>, `Lane Type` <chr>,
## #   `Number of Lanes` <chr>, Direction <chr>, Distance <dbl>,
## #   `Distance Unit` <chr>, `Road Grade` <chr>, `Road Name` <chr>,
## #   `Cross-Street Name` <chr>, `Off-Road Description` <chr>,
## #   Municipality <chr>, `Related Non-Motorist` <chr>, `At Fault` <chr>,
## #   `Collision Type` <chr>, Weather <chr>, `Surface Condition` <chr>, …

Check for NA’s

colSums(is.na(df))

##                Report Number            Local Case Number 
##                            0                           14 
##                  Agency Name             ACRS Report Type 
##                            0                            0 
##              Crash Date/Time                      Hit/Run 
##                            0                         3629 
##                   Route Type               Lane Direction 
##                        15441                        14826 
##                    Lane Type              Number of Lanes 
##                        89288                        12339 
##                    Direction                     Distance 
##                        14783                        12942 
##                Distance Unit                   Road Grade 
##                        12327                        14770 
##                    Road Name            Cross-Street Name 
##                        17102                        25218 
##         Off-Road Description                 Municipality 
##                       102023                        31656 
##         Related Non-Motorist                     At Fault 
##                       110187                            0 
##               Collision Type                      Weather 
##                            0                            0 
##            Surface Condition                        Light 
##                        14881                            0 
##              Traffic Control       Driver Substance Abuse 
##                         2408                          735 
## Non-Motorist Substance Abuse          First Harmful Event 
##                       110187                            0 
##         Second Harmful Event                     Junction 
##                        14062                        15251 
##            Intersection Type               Road Alignment 
##                        26375                        14769 
##               Road Condition                Road Division 
##                        16714                        14768 
##                     Latitude                    Longitude 
##                            0                            0 
##                     Location 
##                            0

Rename column titles

# Clean variable names
names(df) <- gsub("[(). \\-]", "_", names(df)) #Replace ., (), space, with dash
names(df) <- gsub("_$", "", names(df))  #Remove trailing underscore
names(df) <- tolower(names(df))         #Lowercase

head(df)

## # A tibble: 6 × 37
##   report_number local_case_number agency_name acrs_report_type `crash_date/time`
##   <chr>                     <dbl> <chr>       <chr>            <chr>            
## 1 MCP157500DR           250050714 MONTGOMERY  Property Damage… 11/12/2025 08:00…
## 2 EJ78980079            250050708 GAITHERSBU… Property Damage… 11/12/2025 05:50…
## 3 MCP3208007G           250050696 MONTGOMERY  Property Damage… 11/12/2025 01:20…
## 4 MCP3084007J           250050635 MONTGOMERY  Property Damage… 11/11/2025 02:36…
## 5 MCP27640040           250050632 MONTGOMERY  Property Damage… 11/11/2025 02:30…
## 6 MCP2987009Z           250050636 MONTGOMERY  Property Damage… 11/11/2025 01:40…
## # ℹ 32 more variables: `hit/run` <chr>, route_type <chr>, lane_direction <chr>,
## #   lane_type <chr>, number_of_lanes <chr>, direction <chr>, distance <dbl>,
## #   distance_unit <chr>, road_grade <chr>, road_name <chr>,
## #   cross_street_name <chr>, off_road_description <chr>, municipality <chr>,
## #   related_non_motorist <chr>, at_fault <chr>, collision_type <chr>,
## #   weather <chr>, surface_condition <chr>, light <chr>, traffic_control <chr>,
## #   driver_substance_abuse <chr>, non_motorist_substance_abuse <chr>, …

Unique values of ACRS Report Type

unique(df$acrs_report_type)

## [1] "Property Damage Crash" "Injury Crash"          "Fatal Crash"

Frequency table of ACRS Report Type

table(df$acrs_report_type)

## 
##           Fatal Crash          Injury Crash Property Damage Crash 
##                   365                 39484                 76946

Frequency table of weather

table(df$weather)

## 
##          BLOWING SAND, SOIL, DIRT                      Blowing Snow 
##                                 7                                40 
##                      BLOWING SNOW                             Clear 
##                                68                             15141 
##                             CLEAR                            Cloudy 
##                             65516                              1673 
##                            CLOUDY                  Fog, Smog, Smoke 
##                              9609                                38 
##                             FOGGY Freezing Rain Or Freezing Drizzle 
##                               429                                33 
##                               N/A                             OTHER 
##                              7948                               214 
##                              Rain                           RAINING 
##                              2015                             11674 
##                 Severe Crosswinds                      SEVERE WINDS 
##                                17                                93 
##                             SLEET                     Sleet Or Hail 
##                               130                                 7 
##                              Snow                              SNOW 
##                               175                               877 
##                           Unknown                           UNKNOWN 
##                               190                               649 
##                        WINTRY MIX 
##                               252

#This column has multiple labels of the same name/category, but in different forms. Ex: "Clear" and "CLEAR

Weather condition labels

unique(df$weather)

##  [1] "Clear"                             "Cloudy"                           
##  [3] "Rain"                              "Unknown"                          
##  [5] "Severe Crosswinds"                 "Fog, Smog, Smoke"                 
##  [7] "Freezing Rain Or Freezing Drizzle" "Snow"                             
##  [9] "Blowing Snow"                      "Sleet Or Hail"                    
## [11] "CLOUDY"                            "CLEAR"                            
## [13] "N/A"                               "UNKNOWN"                          
## [15] "RAINING"                           "FOGGY"                            
## [17] "OTHER"                             "SLEET"                            
## [19] "WINTRY MIX"                        "SNOW"                             
## [21] "SEVERE WINDS"                      "BLOWING SNOW"                     
## [23] "BLOWING SAND, SOIL, DIRT"

Standardize weather conditions

#Organize labels to avoid repetition
df <- df |>
  mutate(
    weather_conditions = case_when(
      weather == "BLOWING SAND, SOIL, DIRT" ~ "Blowing sand, soil, dirt",
      weather == "OTHER" ~ "Other",
      weather == "WINTRY MIX" ~ "Wintry mix",
      weather %in% c("Unknown", "UNKNOWN", "N/A") ~ "Unknown",
      weather %in% c("Blowing Snow", "BLOWING SNOW") ~ "Blowing Snow",
      weather %in% c("Clear", "CLEAR") ~ "Clear",
      weather %in% c("Cloudy", "CLOUDY") ~ "Cloudy",
      weather %in% c("Fog, Smog, Smoke", "FOGGY") ~ "Fog, Smog, Smoke",
      weather %in% c("Rain", "RAINING") ~ "Rain",
      weather %in% c("Severe Crosswinds", "SEVERE WINDS") ~ "Severe winds",
      weather %in% c("SLEET", "Sleet Or Hail") ~ "Sleet or Hail",
      weather %in% c("Snow", "SNOW") ~ "Snow",
    
      TRUE ~ as.character(weather)  #keep as-is if not matched
    ),
    #Convert to factor instead of a character
    weather_conditions = factor(weather_conditions,
                            levels = c("Blowing sand, soil, dirt", "Unknown", "Blowing Snow",
                                       "Clear", "Cloudy", "Fog, Smog, Smoke", "Rain", 
                                       "Severe winds", "Sleet or Hail", "Snow",
                                       "Freezing Rain Or Freezing Drizzle", "Wintry mix", "Other"))
  )

table(df$weather_conditions)

## 
##          Blowing sand, soil, dirt                           Unknown 
##                                 7                              8787 
##                      Blowing Snow                             Clear 
##                               108                             80657 
##                            Cloudy                  Fog, Smog, Smoke 
##                             11282                               467 
##                              Rain                      Severe winds 
##                             13689                               110 
##                     Sleet or Hail                              Snow 
##                               137                              1052 
## Freezing Rain Or Freezing Drizzle                        Wintry mix 
##                                33                               252 
##                             Other 
##                               214

Check the new column (weather_conditions) added to the dataframe

head(df)

## # A tibble: 6 × 38
##   report_number local_case_number agency_name acrs_report_type `crash_date/time`
##   <chr>                     <dbl> <chr>       <chr>            <chr>            
## 1 MCP157500DR           250050714 MONTGOMERY  Property Damage… 11/12/2025 08:00…
## 2 EJ78980079            250050708 GAITHERSBU… Property Damage… 11/12/2025 05:50…
## 3 MCP3208007G           250050696 MONTGOMERY  Property Damage… 11/12/2025 01:20…
## 4 MCP3084007J           250050635 MONTGOMERY  Property Damage… 11/11/2025 02:36…
## 5 MCP27640040           250050632 MONTGOMERY  Property Damage… 11/11/2025 02:30…
## 6 MCP2987009Z           250050636 MONTGOMERY  Property Damage… 11/11/2025 01:40…
## # ℹ 33 more variables: `hit/run` <chr>, route_type <chr>, lane_direction <chr>,
## #   lane_type <chr>, number_of_lanes <chr>, direction <chr>, distance <dbl>,
## #   distance_unit <chr>, road_grade <chr>, road_name <chr>,
## #   cross_street_name <chr>, off_road_description <chr>, municipality <chr>,
## #   related_non_motorist <chr>, at_fault <chr>, collision_type <chr>,
## #   weather <chr>, surface_condition <chr>, light <chr>, traffic_control <chr>,
## #   driver_substance_abuse <chr>, non_motorist_substance_abuse <chr>, …

Get the count of the different types of weather conditions

weather_values <- df |>
  group_by(weather_conditions) |>
  summarize(Count = n())

print(weather_values)

## # A tibble: 13 × 2
##    weather_conditions                Count
##    <fct>                             <int>
##  1 Blowing sand, soil, dirt              7
##  2 Unknown                            8787
##  3 Blowing Snow                        108
##  4 Clear                             80657
##  5 Cloudy                            11282
##  6 Fog, Smog, Smoke                    467
##  7 Rain                              13689
##  8 Severe winds                        110
##  9 Sleet or Hail                       137
## 10 Snow                               1052
## 11 Freezing Rain Or Freezing Drizzle    33
## 12 Wintry mix                          252
## 13 Other                               214

Factor acrs_report_type

 df$acrs_report_type = factor(df$acrs_report_type,
                            levels = c("Property Damage Crash", "Injury Crash", "Fatal Crash"))

Visualization

Side by side bar plot of weather conditions for different ACRS report types.

library(ggplot2)

ggplot(df, aes(x =acrs_report_type, fill = weather_conditions)) +
      geom_bar(position = "dodge") + #side by side bar plot
      labs(title = "Weather conditions of different reports",
           x = "ACRS Report Type",
           y = "Count") +
      theme_minimal()

Bar plot of ACRS Report Type

barplot(table(df$acrs_report_type),
        main = "acrs_report_type Count",
        xlab = "acrs_report_type",
        ylab = "Count",
        col = "SkyBlue")

Bar plot of the count of types of weather conditions reported.

#Change margin size to fit weather condition labels
par(mar = c(10, 5, 5, 4) + 0.1)

barplot(table(df$weather_conditions),
        main = "Weather Condition Count",
        xlab = "Weather Conditions",
        ylab = "Count",
        col = "SkyBlue",
        cex.names = .7, #change size of names
        las = 2 #rotate the text to fit the axis
        )

Statistical Analysis

In this statistical analysis, I will be applying the Chi-Square test for independence. This type of hypothesis testing is used to investigate the potential relationship between two categorical variables–weather and acrs_report_type. I want to explore if there is an association between the various weather conditions and the different types of crashes. The null hypothesis states that there is no association between weather conditions and ACRS. The alternative hypothesis states that there is an association between the variables. To execute the test, I created a contingency table called ‘observed_dataset’. I then performed a Chi-Square test using the observed dataset.

observed_dataset<- table(df$weather_conditions, df$acrs_report_type)
observed_dataset

##                                    
##                                     Property Damage Crash Injury Crash
##   Blowing sand, soil, dirt                              3            4
##   Unknown                                            6198         2572
##   Blowing Snow                                         72           36
##   Clear                                             53149        27226
##   Cloudy                                             7153         4098
##   Fog, Smog, Smoke                                    316          149
##   Rain                                               8803         4856
##   Severe winds                                         72           36
##   Sleet or Hail                                        95           42
##   Snow                                                747          305
##   Freezing Rain Or Freezing Drizzle                    23           10
##   Wintry mix                                          171           80
##   Other                                               144           70
##                                    
##                                     Fatal Crash
##   Blowing sand, soil, dirt                    0
##   Unknown                                    17
##   Blowing Snow                                0
##   Clear                                     282
##   Cloudy                                     31
##   Fog, Smog, Smoke                            2
##   Rain                                       30
##   Severe winds                                2
##   Sleet or Hail                               0
##   Snow                                        0
##   Freezing Rain Or Freezing Drizzle           0
##   Wintry mix                                  1
##   Other                                       0

Hypothesis

\(H_0\) : Weather is not associated with ACRS (Automated Crash Reporting System) report type.

\(H_a\) : Weather is associated with ACRS (Automated Crash Reporting System) report type.

chi<- chisq.test(observed_dataset)

## Warning in chisq.test(observed_dataset): Chi-squared approximation may be
## incorrect

chi

## 
##  Pearson's Chi-squared test
## 
## data:  observed_dataset
## X-squared = 170.8, df = 24, p-value < 2.2e-16

#check expected counts
chi$expected

##                                    
##                                     Property Damage Crash Injury Crash
##   Blowing sand, soil, dirt                       4.611687     2.366437
##   Unknown                                     5788.984991  2970.554459
##   Blowing Snow                                  71.151745    36.510741
##   Clear                                      53137.835712 27267.100372
##   Cloudy                                      7432.722051  3814.020189
##   Fog, Smog, Smoke                             307.665414   157.875149
##   Rain                                        9018.483617  4627.736427
##   Severe winds                                  72.469369    37.186866
##   Sleet or Hail                                 90.257306    46.314551
##   Snow                                         693.070697   355.641663
##   Freezing Rain Or Freezing Drizzle             21.740811    11.156060
##   Wintry mix                                   166.020737    85.191729
##   Other                                        140.985864    72.345357
##                                    
##                                      Fatal Crash
##   Blowing sand, soil, dirt            0.02187594
##   Unknown                            27.46055054
##   Blowing Snow                        0.33751445
##   Clear                             252.06391541
##   Cloudy                             35.25775932
##   Fog, Smog, Smoke                    1.45943748
##   Rain                               42.77995633
##   Severe winds                        0.34376472
##   Sleet or Hail                       0.42814333
##   Snow                                3.28764074
##   Freezing Rain Or Freezing Drizzle   0.10312941
##   Wintry mix                          0.78753371
##   Other                               0.66877863

#Chi-squared value
chi$statistic

## X-squared 
##   170.803

X-squared has a high value of 170.46. This means there is a 170.46 difference between the observed data and the expected value.
Expected value = (row total * column total)/Sample size
Degrees of Freedom: df = (13-1)*(3-1) = 24
The p-value is less than the significance level of 0.05. There is enough evidence to reject the null hypothesis. Therefore, we conclude that there is a significant association between weather and ACRS report type.

Conclusion and Future Directions

In conclusion, the application of the Chi-Square test for independence between weather and ACRS report type showed a significant result. The x-squared and p-value illustrate that there is a statistically significant association between weather conditions and the type of crash reported in the dataset. To further explore potential avenues for future research and analysis, including other variables, such as the time of day and the lighting conditions at the time of the collision, can improve the model. This analysis can also contribute to improving traffic management, infrastructure planning, and generating safety interventions. For instance, analyzing this type of data can allow us to develop better road surface conditions, implement more traffic lights in certain locations (where accidents are prone to occur), and promote more safety measures when driving under certain weather conditions. Overall, this hypothesis test reveals that there is sufficient evidence to reject the null. Thus, the weather conditions are associated with the type of crash.