Project3__Crash_

Introduction

Crash incidents are complex events with devastating consequences on the society. In this document, this issue is partially analyzed through the following question: “How do weather and road condition influence the severity of crashes?”

To proceed with the analysis this document used a dataset provided by montgomery county. The dataset contains 117,046 observations distributed across 37 variables, and it is accessible through the following link: “https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Incidents-Data/bhju-22kf/data_preview”

Key variables includes: ACRS Report Type: type of crash reported Road grade: road level
weather: meteorologic condition surface condition: The condition of the road(wet or dry) Road condition: wheter the road is damaged or not

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pROC)
df <- read_csv("crash report.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 117046 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (33): Report Number, Agency Name, ACRS Report Type, Crash Date/Time, Hit...
## dbl  (4): Local Case Number, Distance, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(df)

## spc_tbl_ [117,046 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Report Number               : chr [1:117046] "MCP3090009D" "MCP3163009G" "MCP3245007K" "MCP3090009C" ...
##  $ Local Case Number           : num [1:117046] 2.5e+08 2.5e+08 2.5e+08 2.5e+08 2.5e+08 ...
##  $ Agency Name                 : chr [1:117046] "MONTGOMERY" "MONTGOMERY" "MONTGOMERY" "MONTGOMERY" ...
##  $ ACRS Report Type            : chr [1:117046] "Injury Crash" "Property Damage Crash" "Property Damage Crash" "Injury Crash" ...
##  $ Crash Date/Time             : chr [1:117046] "11/19/2025 09:14:00 PM" "11/19/2025 08:56:00 PM" "11/19/2025 07:08:00 PM" "11/19/2025 07:00:00 PM" ...
##  $ Hit/Run                     : chr [1:117046] "No" "No" NA "No" ...
##  $ Route Type                  : chr [1:117046] "Maryland (State) Route" "County Route" "County Route" NA ...
##  $ Lane Direction              : chr [1:117046] "Northbound" "Westbound" "Southbound" NA ...
##  $ Lane Type                   : chr [1:117046] "Lane 3" "Lane 1" "Lane 2" NA ...
##  $ Number of Lanes             : chr [1:117046] "3" "1" "2" "0" ...
##  $ Direction                   : chr [1:117046] "North" "North" "North" NA ...
##  $ Distance                    : num [1:117046] 39.6 30.9 0 NA 0 ...
##  $ Distance Unit               : chr [1:117046] "FEET" "FEET" "FEET" "FEET" ...
##  $ Road Grade                  : chr [1:117046] "Level" "Level" "Level" NA ...
##  $ Road Name                   : chr [1:117046] "WOODFIELD RD" NA "MUDDY BRANCH RD (SB/L)" NA ...
##  $ Cross-Street Name           : chr [1:117046] NA NA "W DIAMOND AVE" NA ...
##  $ Off-Road Description        : chr [1:117046] NA NA NA "Alley        OFF THE ROADWAY IN BETWEEN 108 AND 112 DUVALL LANE" ...
##  $ Municipality                : chr [1:117046] NA NA NA NA ...
##  $ Related Non-Motorist        : chr [1:117046] NA NA NA NA ...
##  $ At Fault                    : chr [1:117046] "DRIVER" "DRIVER" "UNKNOWN" "DRIVER" ...
##  $ Collision Type              : chr [1:117046] "Sideswipe, Same Direction" "Single Vehicle" "Front to Rear" "Single Vehicle" ...
##  $ Weather                     : chr [1:117046] "Clear" "Clear" "Clear" "Clear" ...
##  $ Surface Condition           : chr [1:117046] "Dry" "Dry" "Dry" NA ...
##  $ Light                       : chr [1:117046] "Dark - Lighted" "Dark - Lighted" "Dark - Lighted" "Dark - Lighted" ...
##  $ Traffic Control             : chr [1:117046] "No Controls" "No Controls" "No Controls" NA ...
##  $ Driver Substance Abuse      : chr [1:117046] "Not Suspect of Alcohol Use, Not Suspect of Drug Use, Not Suspect of Alcohol Use, Not Suspect of Drug Use" "Not Suspect of Alcohol Use, Not Suspect of Drug Use" "Not Suspect of Alcohol Use, Not Suspect of Drug Use, Unknown, Unknown" "Not Suspect of Alcohol Use, Not Suspect of Drug Use" ...
##  $ Non-Motorist Substance Abuse: chr [1:117046] NA NA NA NA ...
##  $ First Harmful Event         : chr [1:117046] "Motor Vehicle In Transport" "Curb" "Motor Vehicle In Transport" "Tree (standing)" ...
##  $ Second Harmful Event        : chr [1:117046] "Curb" NA NA NA ...
##  $ Junction                    : chr [1:117046] "Non-Junction" "Non-Junction" "Intersection or Related" NA ...
##  $ Intersection Type           : chr [1:117046] NA NA "Angled/Skewed" NA ...
##  $ Road Alignment              : chr [1:117046] "Straight" "Straight" "Straight" NA ...
##  $ Road Condition              : chr [1:117046] "No Defects" "No Defects" "No Defects" NA ...
##  $ Road Division               : chr [1:117046] "Divided, Raised Median (curbed)" "Not Divided" "Not Divided" NA ...
##  $ Latitude                    : num [1:117046] 39.2 39 39.1 39.1 39 ...
##  $ Longitude                   : num [1:117046] -77.2 -77.1 -77.2 -77.2 -77.1 ...
##  $ Location                    : chr [1:117046] "(39.17404567, -77.15160367)" "(39.0180463, -77.08721188)" "(39.1399363, -77.20522086)" "(39.12917033, -77.20309038)" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Report Number` = col_character(),
##   ..   `Local Case Number` = col_double(),
##   ..   `Agency Name` = col_character(),
##   ..   `ACRS Report Type` = col_character(),
##   ..   `Crash Date/Time` = col_character(),
##   ..   `Hit/Run` = col_character(),
##   ..   `Route Type` = col_character(),
##   ..   `Lane Direction` = col_character(),
##   ..   `Lane Type` = col_character(),
##   ..   `Number of Lanes` = col_character(),
##   ..   Direction = col_character(),
##   ..   Distance = col_double(),
##   ..   `Distance Unit` = col_character(),
##   ..   `Road Grade` = col_character(),
##   ..   `Road Name` = col_character(),
##   ..   `Cross-Street Name` = col_character(),
##   ..   `Off-Road Description` = col_character(),
##   ..   Municipality = col_character(),
##   ..   `Related Non-Motorist` = col_character(),
##   ..   `At Fault` = col_character(),
##   ..   `Collision Type` = col_character(),
##   ..   Weather = col_character(),
##   ..   `Surface Condition` = col_character(),
##   ..   Light = col_character(),
##   ..   `Traffic Control` = col_character(),
##   ..   `Driver Substance Abuse` = col_character(),
##   ..   `Non-Motorist Substance Abuse` = col_character(),
##   ..   `First Harmful Event` = col_character(),
##   ..   `Second Harmful Event` = col_character(),
##   ..   Junction = col_character(),
##   ..   `Intersection Type` = col_character(),
##   ..   `Road Alignment` = col_character(),
##   ..   `Road Condition` = col_character(),
##   ..   `Road Division` = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double(),
##   ..   Location = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(df)

## # A tibble: 6 × 37
##   `Report Number` `Local Case Number` `Agency Name` `ACRS Report Type`   
##   <chr>                         <dbl> <chr>         <chr>                
## 1 MCP3090009D               250051999 MONTGOMERY    Injury Crash         
## 2 MCP3163009G               250051998 MONTGOMERY    Property Damage Crash
## 3 MCP3245007K               250051985 MONTGOMERY    Property Damage Crash
## 4 MCP3090009C               250051983 MONTGOMERY    Injury Crash         
## 5 MCP3163009F               250051980 MONTGOMERY    Property Damage Crash
## 6 MCP3441001P               250051974 MONTGOMERY    Property Damage Crash
## # ℹ 33 more variables: `Crash Date/Time` <chr>, `Hit/Run` <chr>,
## #   `Route Type` <chr>, `Lane Direction` <chr>, `Lane Type` <chr>,
## #   `Number of Lanes` <chr>, Direction <chr>, Distance <dbl>,
## #   `Distance Unit` <chr>, `Road Grade` <chr>, `Road Name` <chr>,
## #   `Cross-Street Name` <chr>, `Off-Road Description` <chr>,
## #   Municipality <chr>, `Related Non-Motorist` <chr>, `At Fault` <chr>,
## #   `Collision Type` <chr>, Weather <chr>, `Surface Condition` <chr>, …

unique(df$`ACRS Report Type`)

## [1] "Injury Crash"          "Property Damage Crash" "Fatal Crash"

The dataset contains a wide majority of categorical data, among which the variables neeeded for the analysis. In consequence, data cleaning will be performed to refine the data.

#changing some values to lowercase
df$Light<-tolower(df$`Light`)

df$`Light`<-ifelse(df$`Light`=="daylight", "Day", "Night")
#adjusting columns' name
names(df)<- gsub(" ", "_", names(df))

#subsetting the data 
data<-df|>
  select(ACRS_Report_Type, Light, Road_Condition, Surface_Condition, Weather)

data$Weather[data$Weather=="N/A"]<-NA
data$Surface_Condition[data$Surface_Condition=="N/A"]<-NA
#removing NAs
data <- data[!(is.na(data$ACRS_Report_Type) | is.na(data$Light)| is.na(data$Road_Condition)| is.na(data$Weather)|is.na(data$Surface_Condition)),]

colSums(is.na(data))

##  ACRS_Report_Type             Light    Road_Condition Surface_Condition 
##                 0                 0                 0                 0 
##           Weather 
##                 0

The codes executed above ensured relevance of the observations. A subset that contains information regarding severity of the crashes, Weather, and road condition was created to facilitate accessibility and data processing. However, Some more cleaning is still needed to approach redundancy.

The following chunks will focus on substituting some data points in order to merge certain categories.

The first chunk is supposed to divide the report types into two categories. The variable initially had three categories: Injury crash, property damage, fatal crash. Since only a few observations are marked as fatal, injury crash and fatal crash are merged, and property damage is turned into no-fatal crash. Now we have fatal and non-fatal crash under report type.

data$ACRS_Report_Type<- gsub("Injury Crash" , "Fatal Crash", data$ACRS_Report_Type)

data$ACRS_Report_Type<-ifelse(data$ACRS_Report_Type=="Fatal Crash", "Fatal", "Non-fatal")

The second chunk has the same objective, but it is for weather variable.

data$Weather<-tolower(data$Weather)


data$Weather<- gsub("raining" , "rain", data$Weather)
data$Weather<- gsub("freezing rain or freezing drizzle" , "rain", data$Weather)
data$Weather<- gsub("severe crosswinds" , "wind", data$Weather)
data$Weather<- gsub("severe winds" , "wind", data$Weather)
data$Weather<- gsub("unknown" , "other", data$Weather)
data$Weather<- gsub("fog, smog, smoke" , "fog", data$Weather)
data$Weather<- gsub("sleet or hail" , "other", data$Weather)
data$Weather<- gsub("wintry mix" , "other", data$Weather)
data$Weather<- gsub("blowing snow" , "snow", data$Weather)
data$Weather<- gsub("blowing sand, soil, dirt" , "other", data$Weather) 
data$Weather<- gsub("sleet" , "other", data$Weather)
data$Weather<- gsub("foggy" , "fog", data$Weather)

unique(data$Weather)

## [1] "clear"  "rain"   "cloudy" "other"  "fog"    "wind"   "snow"

Same for the following chunk, but for surface condition.

data$Surface_Condition<-tolower(data$Surface_Condition)


data$Surface_Condition<- gsub("mud, dirt, gravel", "other", data$Surface_Condition)
data$Surface_Condition<- gsub("slush", "other", data$Surface_Condition)
data$Surface_Condition<- gsub("sand", "other", data$Surface_Condition)
data$Surface_Condition<- gsub("oil", "other", data$Surface_Condition)
data$Surface_Condition<- gsub("unknown", "other", data$Surface_Condition)
data$Surface_Condition<- gsub("water\\(standing/moving\\)" , "wet", data$Surface_Condition)
data$Surface_Condition<- gsub("water \\(standing, moving\\)", "wet", data$Surface_Condition)
data$Surface_Condition<- gsub("ice", "snow", data$Surface_Condition)
data$Surface_Condition<- gsub("ice/frost", "snow", data$Surface_Condition)
data$Surface_Condition<- gsub("snow/frost", "snow", data$Surface_Condition)

unique(data$Surface_Condition)

## [1] "dry"   "wet"   "other" "snow"

Below we have a graphical distribution of reported crashes in regard to meteorological conditions. It shows the severity of crashes across each Weather condition.

data.frame(prop.table(table(data$ACRS_Report_Type, data$Weather)))

##         Var1   Var2         Freq
## 1      Fatal  clear 0.2704196542
## 2  Non-fatal  clear 0.4647858208
## 3      Fatal cloudy 0.0407685863
## 4  Non-fatal cloudy 0.0638239998
## 5      Fatal    fog 0.0015450065
## 6  Non-fatal    fog 0.0031226539
## 7      Fatal  other 0.0025133556
## 8  Non-fatal  other 0.0072354176
## 9      Fatal   rain 0.0498754203
## 10 Non-fatal   rain 0.0840722889
## 11     Fatal   snow 0.0033511408
## 12 Non-fatal   snow 0.0074639045
## 13     Fatal   wind 0.0004025721
## 14 Non-fatal   wind 0.0006201787

prop.table(table(data$Weather))

## 
##       clear      cloudy         fog       other        rain        snow 
## 0.735205475 0.104592586 0.004667660 0.009748773 0.133947709 0.010815045 
##        wind 
## 0.001022751

ggplot(data, aes(x=Weather, fill=ACRS_Report_Type))+
  geom_bar( position="fill", color="black")+
  labs(x="Weather condition", y="Counts", title="Distribution of crashes in regard to weather condition")+
  theme_minimal()

These insights support the fact that inclement weather does not influence the likeliness of crash.

Further below, there is another visual. It is a bar chart that represents the distribution of crashes in regard to road condition. Road conditions are categorized into two distinct groups: Defects, that represents defected road, and No defects, that represent road without defects.

data$Road_Condition<-tolower(data$Road_Condition)

data$Road_Condition<- ifelse(data$Road_Condition=="no defects", "No defects", "Defects")


prop.table(table(data$ACRS_Report_Type, data$Road_Condition))

##            
##                Defects No defects
##   Fatal     0.01516718 0.35370856
##   Non-fatal 0.03011675 0.60100752

ggplot(data, aes(x=Road_Condition))+
  geom_bar(fill="red", color="black")+
  labs(x="Road condition", y="Counts", title="Distribution of crashes in regard to Road condition")+
  theme_minimal()

The graph shows that most of the incidents occurred in non-defected roads. Approximately 95% of crashes happened on roads marked as no defects.

prop.table(table(data$ACRS_Report_Type, data$Surface_Condition))

##            
##                     dry       other        snow         wet
##   Fatal     0.291592771 0.001784374 0.004602378 0.070896213
##   Non-fatal 0.493531645 0.005679531 0.010651840 0.121261248

ggplot(data, aes(x=Surface_Condition, fill=ACRS_Report_Type))+
  geom_bar( position="fill", color="black")+
  labs(x="Surface condition", y="Counts", title="Distribution of crashes in regard to road surface")+
  theme_minimal()

data$ACRS_Report_Type<-as.factor(data$ACRS_Report_Type)
data$Surface_Condition<-as.factor(data$Surface_Condition)
data$Weather<-as.factor(data$Weather)
data$Light<-as.factor(data$Light)
data$Road_Condition<-as.factor(data$Road_Condition)

*Predicting crash severity(fatal vs non-fatal) using logistic regression

#model creation

model<- glm(ACRS_Report_Type~Weather+Surface_Condition+Light, data=data, family ="binomial")
summary(model)

## 
## Call:
## glm(formula = ACRS_Report_Type ~ Weather + Surface_Condition + 
##     Light, family = "binomial", data = data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             0.490795   0.009407  52.175  < 2e-16 ***
## Weathercloudy          -0.082041   0.023557  -3.483 0.000496 ***
## Weatherfog              0.083704   0.105613   0.793 0.428040    
## Weatherother            0.282088   0.083864   3.364 0.000769 ***
## Weatherrain            -0.075127   0.036797  -2.042 0.041186 *  
## Weathersnow             0.072638   0.084996   0.855 0.392768    
## Weatherwind            -0.177101   0.212023  -0.835 0.403555    
## Surface_Conditionother  0.502374   0.095238   5.275 1.33e-07 ***
## Surface_Conditionsnow   0.215632   0.072861   2.960 0.003081 ** 
## Surface_Conditionwet    0.044131   0.031944   1.382 0.167117    
## LightNight              0.137230   0.014716   9.325  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121017  on 91908  degrees of freedom
## Residual deviance: 120804  on 91898  degrees of freedom
## AIC: 120826
## 
## Number of Fisher Scoring iterations: 4

#pseudo rsquared
r_square <- 1 - (model$deviance/model$null.deviance)

r_square

## [1] 0.001767582

#p-value
p_v<- 1 - pchisq((model$null.deviance - model$deviance), df=2)
p_v

## [1] 0

Above is the implementation of a logistic regression model that is supposed to predict the severity of future reported crashes. The model used weather, Light, and road surface condition to make the predictions. A few of the categories under the variables are statistically significant.

Weather condition like rain and cloud effectively predict crash incident severity since they are statistically significant, with p-values respectively, .041 and .0004. However, both cloud and rain would decrease the log odds of crash severity.

Also, factors like night time(LightNight), snow on the road surface(surface_Conditionsnow) are good predictors considering their statistical significance. Both would increase log odds of crash severity.

Nevertheless, the model only explains .001% of the variation in crash severity although it is statistically significant.

data$ACRS_Report_Type <- ifelse(data$ACRS_Report_Type== "Fatal", 1, 0)

# Predicted probabilities
predicted.probs <- model$fitted.values

# Predicted classes

predicted.classes <- ifelse(predicted.probs > 0.6, 1, 0)

# Confusion matrix

confusion_mat <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(data$ACRS_Report_Type, levels = c(0, 1))
)

confusion_mat

##          Actual
## Predicted     0     1
##         0    23    17
##         1 57983 33886

The model predicted 23 true negative(predicted as non-fatal, actually non-fatal). In addition, it also predicted 17 false negative(predicted as non-fatal, actually fatal). However, it correctly predicted 33886 fatal crashes:true positive. It also predicted 57983 crashes as fatal while they were non-fatal: false positive.

TN <- 23
FP <- 57938
FN <- 17
TP <- 33886

# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)   # true positive rate
specificity <- TN / (TN + FP)   # true negative rate
# Print results
cat("Accuracy:    ", accuracy, "\n")

## Accuracy:     0.3691217

cat("Sensitivity: ", sensitivity, "\n")

## Sensitivity:  0.9994986

cat("Specificity: ", specificity, "\n")

## Specificity:  0.0003968186

# ROC curve & AUC on full data
roc_obj <- roc(response = data$ACRS_Report_Type,
               predictor = model$fitted.values,
               levels = c(1, 0),
               direction = "<")  

# Print AUC value
auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.522

# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

The AUC is equal to 0.52, which is actually low. Such a low AUC indicates that the model is not quite good at distinguishing between fatal and non-fatal cases. Specificity, very close to zero(0.0003968186), and sensitivity(0.9994986), very close to one, suggest that the model is highly sensitive. Therefore,it tends to predict to predict more fatal crashes. Also, it has a very low accuracy(0.37). The model has a weak predictive power.

Conclusion

This document has its focus on the analysis of crash incidents in Montgomery county. It used an official dataset provided by the county, and was centered around the following question:“How do weather and road surface condition influence the severity of crashes?” After thorough analysis, the results showed that inclement weather does not have a considerable impact on the severity of crash incidents. As shown on the first graph, the distribution of fatal and non-fatal crashes across the different categories is very approximate. Moreover, damaged road does not quite impact the occurrence of crash incidents. There is a major gap between the counts of crash incidents reported on damaged roads and non-defected roads.

To predict the gravity of future instances, a logistic regression model was created. The model used weather, Light, and surface condition as predictors. The model was statistically significant, with a p-value close to zero(2e-16). In addition, the predictors slightly improved the model, considering its null deviance(121,017) and its residual deviance(120,804). However, it failed to accurately predict the severity of crash incidents. It only explained 0.01% of variation in crash severity. The model has a weak predictive power. Its low AUC(0.522) supports its incapacity to classify fatal and no-fatal crashes. the Weather and surface condition alone are not sufficient. Future research should incorporate other variables such as speed, age, gender, or vehicle type, that further explain drivers’ behavior. Also, since the dataset contains a tremendous amount of missing values, instead of dropping them, the mode can be imputed to fill the gap.

Project3Crashreport

Carlos Dave Sidney

2025-12-03

Project3__Crash__report

Carlos Dave Sidney

2025-12-03

Project3Crashreport