Introduction

Do driver substance abuse and road conditions predict whether a driver faces a severe injury in a vehicle crash in Montgomery County, Maryland?

Motor Vehicle accidents occur under many conditions, and specific factors may increase the possibilities that a driver is severely injured. In this 3rd Project, I will use the dataset Crash Reporting Data, provided by the Montgomery County Open Data Portal. This dataset contains detailed information on drivers that were involved in a vehicle accident, with variables describing the injury severity, road conditions, weather, vehicle characteristics, and also if whether or not alcohol or prohibited substances were involved.

This dataset contains 206,309 observations with 39 variables, where each row represents a driver being involved in a crash. This dataset is larger than usual and contains many categorical variables that is relevant to the crash outcomes, therefore for this analysis I will focus on just three key variables which is Injury Severity, Driver Substance Abuse, and Surface Condition. These 3 variables will relate to my research question and give way to conclude whether or not adverse road conditions or prohibited substances are associated with the increased risks of facing a severe injury. I will use Logistic Regression to answer my research question, as it will allow me to estimate the probability of severe injuries based on the selected predictors and to conclude results in terms of how each factor will influence the injury risks.

Data Analysis

In this Data Analysis, I cleaned and explored the dataset to prepare it for a logistic regression. I started by looking into the structure and its initial rows of the data by utilizing the head() and str() to understand the variables and its formats. To make sure the quality of the data, I also inspected for any missing values by using colSums(is.na()) and removed any NA’s in my three important variables by using the filter(). Additionally, I also made changes to the variable names by replacing the periods with underscores to avoid any errors in the final stages of my research, and I reviewed all the unique categories for each of the three variables. I also created binary variables for the injury severity and substance involvement by using the ifelse(), and converted these variables to factors by using mutate() to ensure its ready for modeling. I used select() and summary() to give descriptive summaries of the cleaned predictors and outcomes. Lastly, I used the xtabs() to check the relationships between the injury severity and predictor variables, combined these steps helped ensuring that the dataset was clean, formatted properly, and ready to build a logistic regression model.

Opening the Dataset

library(ggplot2)
library(cowplot)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
crash_data <- read.csv("Crash_report_data.csv")
head(crash_data)
##   Report.Number Local.Case.Number Agency.Name      ACRS.Report.Type
## 1   MCP3126006X         250037402  MONTGOMERY          Injury Crash
## 2   MCP2349001B         250037516  MONTGOMERY Property Damage Crash
## 3   MCP296500BC         250033157  MONTGOMERY Property Damage Crash
## 4   MCP2159003K         250037509  MONTGOMERY Property Damage Crash
## 5   MCP312900D6         250034573  MONTGOMERY Property Damage Crash
## 6   MCP284600BN         250037004  MONTGOMERY          Injury Crash
##          Crash.Date.Time             Route.Type               Road.Name
## 1 08/21/2025 05:21:00 PM Maryland (State) Route                        
## 2 08/22/2025 10:44:00 AM     Interstate (State) EISENHOWER MEMORIAL HWY
## 3 07/25/2025 11:55:00 AM          Bicycle Route                        
## 4 08/22/2025 10:36:00 AM Maryland (State) Route                        
## 5 08/03/2025 02:10:00 PM                                               
## 6 08/19/2025 09:50:00 AM           County Route            GRAND PRE RD
##                                                   Cross.Street.Name
## 1                                                                  
## 2                                                                  
## 3 NEW HAMPSHIRE AVE (SB/L) NORBECK RD (WB/L) SPENCERVILLE RD (WB/L)
## 4                                                                  
## 5                                                                  
## 6                                                                  
##                                                                 Off.Road.Description
## 1                                                                                   
## 2                                                                                   
## 3                                                                                   
## 4                                                                                   
## 5 Parking Lot Way        PARKING LOT OF 2741 UNIVERSITY BLVD W, KENSINGTON MD, 20895
## 6                                                                                   
##   Municipality Related.Non.Motorist            Collision.Type Weather
## 1                                               Front to Rear   Clear
## 2                                              Single Vehicle   Clear
## 3                                   Sideswipe, Same Direction   Clear
## 4                                               Front to Rear   Clear
## 5                                                Rear To Side   Clear
## 6                        Pedestrian            Single Vehicle    Rain
##   Surface.Condition    Light                 Traffic.Control
## 1               Dry Daylight                     No Controls
## 2               Dry Daylight                     No Controls
## 3               Dry Daylight          Traffic Control Signal
## 4               Dry Daylight Flashing Traffic Control Signal
## 5                   Daylight                                
## 6               Wet Daylight                     No Controls
##                                Driver.Substance.Abuse
## 1 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## 2                                    Unknown, Unknown
## 3 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## 4 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## 5                                    Unknown, Unknown
## 6 Not Suspect of Alcohol Use, Not Suspect of Drug Use
##                          Non.Motorist.Substance.Abuse
## 1                                                    
## 2                                                    
## 3                                                    
## 4                                                    
## 5                                                    
## 6 Not Suspect of Alcohol Use, Not Suspect of Drug Use
##                              Person.ID Driver.At.Fault    Injury.Severity
## 1 BB3CB0F3-5A89-45FB-9516-48DDDB92B0A9             Yes No Apparent Injury
## 2 9B84E695-215A-447E-8AA6-D3958187BBCA              No                   
## 3 1D28ADF4-0DB2-4CBC-BDB0-1C1F5E7CF955              No No Apparent Injury
## 4 AE9A3389-3486-4199-B8F6-015D7D2E1139             Yes No Apparent Injury
## 5 3B4FB53F-9543-48EA-8C28-14AC093FBC36              No                   
## 6 391A0858-066B-41A3-926D-B43D84A96A07              No No Apparent Injury
##           Circumstance                             Driver.Distracted.By
## 1 Followed Too Closely      Other Action (looking away from task, etc.)
## 2                                                                      
## 3                                                        Not Distracted
## 4 Followed Too Closely Manually Operating (dialing, playing game, etc.)
## 5                                                               Unknown
## 6                                                        Not Distracted
##   Drivers.License.State                           Vehicle.ID
## 1                    MD 768C98FA-C137-47BC-BE44-EE3BA4B95F66
## 2                       BC322ECD-006B-4919-AAF8-3F64D934B789
## 3                    CO 1F4EBE18-DB94-4CA7-8D9A-88C30E90400D
## 4                    MD AAEB6B5A-30B2-47D3-BF59-7F14D0A5BCAD
## 5                       B683B035-8C9F-45F7-BDB5-F9141CCF160D
## 6                    MD 219D547A-37CA-4C31-93C8-414479EA6A4C
##   Vehicle.Damage.Extent Vehicle.First.Impact.Location
## 1           Superficial                Twelve O Clock
## 2  Vehicle Not at Scene          Vehicle Not at Scene
## 3           Superficial                 Seven O Clock
## 4             Disabling                Twelve O Clock
## 5  Vehicle Not at Scene          Vehicle Not at Scene
## 6           Superficial                Twelve O Clock
##               Vehicle.Body.Type      Vehicle.Movement Vehicle.Going.Dir
## 1                 Passenger Car Moving Constant Speed        Northbound
## 2                               Moving Constant Speed        Northbound
## 3                 Passenger Car Moving Constant Speed         Westbound
## 4 Van - Passenger (&lt;9 Seats)   Slowing or Stopping        Southbound
## 5         Sport Utility Vehicle               Backing    Not On Roadway
## 6                 Passenger Car Moving Constant Speed        Northbound
##   Speed.Limit Driverless.Vehicle Parked.Vehicle Vehicle.Year Vehicle.Make
## 1          40                 No             No         2013          KIA
## 2          55                 No             No            0             
## 3          40                 No             No         2023        LEXUS
## 4          30                 No             No         2003       TOYOTA
## 5           0                 No             No         2023       SUBARU
## 6          25                 No             No         2016        MAZDA
##   Vehicle.Model Latitude Longitude                    Location
## 1          SOUL 39.21980 -77.25742   (39.219796, -77.25741635)
## 2               39.18018 -77.25066 (39.18018079, -77.25065714)
## 3            RX 39.12122 -76.98891 (39.12121898, -76.98890509)
## 4        SIENNA 39.20793 -77.14148  (39.20793083, -77.1414795)
## 5       IMPREZA 39.03966 -77.05724 (39.03966248, -77.05723843)
## 6          CX-5 39.09273 -77.07647    (39.09273383, -77.07647)
str(crash_data)
## 'data.frame':    206309 obs. of  39 variables:
##  $ Report.Number                : chr  "MCP3126006X" "MCP2349001B" "MCP296500BC" "MCP2159003K" ...
##  $ Local.Case.Number            : chr  "250037402" "250037516" "250033157" "250037509" ...
##  $ Agency.Name                  : chr  "MONTGOMERY" "MONTGOMERY" "MONTGOMERY" "MONTGOMERY" ...
##  $ ACRS.Report.Type             : chr  "Injury Crash" "Property Damage Crash" "Property Damage Crash" "Property Damage Crash" ...
##  $ Crash.Date.Time              : chr  "08/21/2025 05:21:00 PM" "08/22/2025 10:44:00 AM" "07/25/2025 11:55:00 AM" "08/22/2025 10:36:00 AM" ...
##  $ Route.Type                   : chr  "Maryland (State) Route" "Interstate (State)" "Bicycle Route" "Maryland (State) Route" ...
##  $ Road.Name                    : chr  "" "EISENHOWER MEMORIAL HWY" "" "" ...
##  $ Cross.Street.Name            : chr  "" "" "NEW HAMPSHIRE AVE (SB/L) NORBECK RD (WB/L) SPENCERVILLE RD (WB/L)" "" ...
##  $ Off.Road.Description         : chr  "" "" "" "" ...
##  $ Municipality                 : chr  "" "" "" "" ...
##  $ Related.Non.Motorist         : chr  "" "" "" "" ...
##  $ Collision.Type               : chr  "Front to Rear" "Single Vehicle" "Sideswipe, Same Direction" "Front to Rear" ...
##  $ Weather                      : chr  "Clear" "Clear" "Clear" "Clear" ...
##  $ Surface.Condition            : chr  "Dry" "Dry" "Dry" "Dry" ...
##  $ Light                        : chr  "Daylight" "Daylight" "Daylight" "Daylight" ...
##  $ Traffic.Control              : chr  "No Controls" "No Controls" "Traffic Control Signal" "Flashing Traffic Control Signal" ...
##  $ Driver.Substance.Abuse       : chr  "Not Suspect of Alcohol Use, Not Suspect of Drug Use" "Unknown, Unknown" "Not Suspect of Alcohol Use, Not Suspect of Drug Use" "Not Suspect of Alcohol Use, Not Suspect of Drug Use" ...
##  $ Non.Motorist.Substance.Abuse : chr  "" "" "" "" ...
##  $ Person.ID                    : chr  "BB3CB0F3-5A89-45FB-9516-48DDDB92B0A9" "9B84E695-215A-447E-8AA6-D3958187BBCA" "1D28ADF4-0DB2-4CBC-BDB0-1C1F5E7CF955" "AE9A3389-3486-4199-B8F6-015D7D2E1139" ...
##  $ Driver.At.Fault              : chr  "Yes" "No" "No" "Yes" ...
##  $ Injury.Severity              : chr  "No Apparent Injury" "" "No Apparent Injury" "No Apparent Injury" ...
##  $ Circumstance                 : chr  "Followed Too Closely" "" "" "Followed Too Closely" ...
##  $ Driver.Distracted.By         : chr  "Other Action (looking away from task, etc.)" "" "Not Distracted" "Manually Operating (dialing, playing game, etc.)" ...
##  $ Drivers.License.State        : chr  "MD" "" "CO" "MD" ...
##  $ Vehicle.ID                   : chr  "768C98FA-C137-47BC-BE44-EE3BA4B95F66" "BC322ECD-006B-4919-AAF8-3F64D934B789" "1F4EBE18-DB94-4CA7-8D9A-88C30E90400D" "AAEB6B5A-30B2-47D3-BF59-7F14D0A5BCAD" ...
##  $ Vehicle.Damage.Extent        : chr  "Superficial" "Vehicle Not at Scene" "Superficial" "Disabling" ...
##  $ Vehicle.First.Impact.Location: chr  "Twelve O Clock" "Vehicle Not at Scene" "Seven O Clock" "Twelve O Clock" ...
##  $ Vehicle.Body.Type            : chr  "Passenger Car" "" "Passenger Car" "Van - Passenger (&lt;9 Seats)" ...
##  $ Vehicle.Movement             : chr  "Moving Constant Speed" "Moving Constant Speed" "Moving Constant Speed" "Slowing or Stopping" ...
##  $ Vehicle.Going.Dir            : chr  "Northbound" "Northbound" "Westbound" "Southbound" ...
##  $ Speed.Limit                  : int  40 55 40 30 0 25 0 25 10 35 ...
##  $ Driverless.Vehicle           : chr  "No" "No" "No" "No" ...
##  $ Parked.Vehicle               : chr  "No" "No" "No" "No" ...
##  $ Vehicle.Year                 : int  2013 0 2023 2003 2023 2016 2025 2021 2022 2018 ...
##  $ Vehicle.Make                 : chr  "KIA" "" "LEXUS" "TOYOTA" ...
##  $ Vehicle.Model                : chr  "SOUL" "" "RX" "SIENNA" ...
##  $ Latitude                     : num  39.2 39.2 39.1 39.2 39 ...
##  $ Longitude                    : num  -77.3 -77.3 -77 -77.1 -77.1 ...
##  $ Location                     : chr  "(39.219796, -77.25741635)" "(39.18018079, -77.25065714)" "(39.12121898, -76.98890509)" "(39.20793083, -77.1414795)" ...

Replacing the Period Dots with underscores for the three variables

names(crash_data)[names(crash_data) == "Injury.Severity"] <- "Injury_Severity"
names(crash_data)[names(crash_data) == "Driver.Substance.Abuse"] <- "Driver_Substance_Abuse"
names(crash_data)[names(crash_data) == "Surface.Condition"] <- "Surface_Condition"
#Looking for any missing values (NA's)
colSums(is.na(crash_data))
##                 Report.Number             Local.Case.Number 
##                             0                             0 
##                   Agency.Name              ACRS.Report.Type 
##                             0                             0 
##               Crash.Date.Time                    Route.Type 
##                             0                             0 
##                     Road.Name             Cross.Street.Name 
##                             0                             8 
##          Off.Road.Description                  Municipality 
##                             0                             0 
##          Related.Non.Motorist                Collision.Type 
##                             0                             0 
##                       Weather             Surface_Condition 
##                             0                             0 
##                         Light               Traffic.Control 
##                             0                             0 
##        Driver_Substance_Abuse  Non.Motorist.Substance.Abuse 
##                             0                             0 
##                     Person.ID               Driver.At.Fault 
##                             0                             0 
##               Injury_Severity                  Circumstance 
##                             0                             0 
##          Driver.Distracted.By         Drivers.License.State 
##                             0                             0 
##                    Vehicle.ID         Vehicle.Damage.Extent 
##                             0                             0 
## Vehicle.First.Impact.Location             Vehicle.Body.Type 
##                             0                             0 
##              Vehicle.Movement             Vehicle.Going.Dir 
##                             0                             0 
##                   Speed.Limit            Driverless.Vehicle 
##                             0                             0 
##                Parked.Vehicle                  Vehicle.Year 
##                             0                             0 
##                  Vehicle.Make                 Vehicle.Model 
##                            11                            25 
##                      Latitude                     Longitude 
##                             0                             0 
##                      Location 
##                             0
#Cleaning out any NA's and Ensuring any NA's in the three important variables
crash_data <- crash_data |>
  filter(!is.na(Injury_Severity), 
         !is.na(Driver_Substance_Abuse), 
         !is.na(Surface_Condition))
#Revising the unique values of the variables im using
unique(crash_data$Injury_Severity)
##  [1] "No Apparent Injury"       ""                        
##  [3] "Possible Injury"          "Suspected Minor Injury"  
##  [5] "Suspected Serious Injury" "Fatal Injury"            
##  [7] "NO APPARENT INJURY"       "SUSPECTED MINOR INJURY"  
##  [9] "POSSIBLE INJURY"          "SUSPECTED SERIOUS INJURY"
## [11] "FATAL INJURY"
unique(crash_data$Driver_Substance_Abuse)
##  [1] "Not Suspect of Alcohol Use, Not Suspect of Drug Use"
##  [2] "Unknown, Unknown"                                   
##  [3] "Suspect of Alcohol Use, Not Suspect of Drug Use"    
##  [4] "Unknown, Not Suspect of Drug Use"                   
##  [5] "Suspect of Alcohol Use, Unknown"                    
##  [6] "Suspect of Alcohol Use, Suspect of Drug Use"        
##  [7] "Not Suspect of Alcohol Use, Unknown"                
##  [8] "Not Suspect of Alcohol Use, Suspect of Drug Use"    
##  [9] "Unknown, Suspect of Drug Use"                       
## [10] "NONE DETECTED"                                      
## [11] "UNKNOWN"                                            
## [12] "N/A"                                                
## [13] "ALCOHOL CONTRIBUTED"                                
## [14] "ALCOHOL PRESENT"                                    
## [15] "COMBINATION CONTRIBUTED"                            
## [16] "COMBINED SUBSTANCE PRESENT"                         
## [17] "ILLEGAL DRUG CONTRIBUTED"                           
## [18] "ILLEGAL DRUG PRESENT"                               
## [19] "MEDICATION CONTRIBUTED"                             
## [20] "MEDICATION PRESENT"                                 
## [21] "OTHER"
unique(crash_data$Surface_Condition)
##  [1] "Dry"                      ""                        
##  [3] "Wet"                      "Other"                   
##  [5] "Water (standing, moving)" "DRY"                     
##  [7] "ICE"                      "WET"                     
##  [9] "N/A"                      "SLUSH"                   
## [11] "UNKNOWN"                  "WATER(STANDING/MOVING)"  
## [13] "SNOW"                     "OTHER"                   
## [15] "MUD, DIRT, GRAVEL"        "OIL"                     
## [17] "SAND"                     "Ice/Frost"               
## [19] "Mud, Dirt, Gravel"        "Snow"                    
## [21] "Slush"                    "Sand"                    
## [23] "Oil"
#Changing the Injury Severity into a binary variable
crash_data$Injury_Severity_Binary <- ifelse(crash_data$Injury_Severity == "FATAL INJURY" | crash_data$Injury_Severity == "SUSPECTED SERIOUS INJURY", 1, 0)

table(crash_data$Injury_Severity_Binary)
## 
##      0      1 
## 204741   1568
#Changing the Substance into a binary variable
crash_data$Substance_Binary <- ifelse(
  crash_data$Driver_Substance_Abuse == "ALCOHOL CONTRIBUTED" |
  crash_data$Driver_Substance_Abuse == "ALCOHOL PRESENT" |
  crash_data$Driver_Substance_Abuse == "COMBINATION CONTRIBUTED" |
  crash_data$Driver_Substance_Abuse == "COMBINED SUBSTANCE PRESENT" |
  crash_data$Driver_Substance_Abuse == "ILLEGAL DRUG CONTRIBUTED" |
  crash_data$Driver_Substance_Abuse == "ILLEGAL DRUG PRESENT" |
  crash_data$Driver_Substance_Abuse == "MEDICATION CONTRIBUTED" |
  crash_data$Driver_Substance_Abuse == "MEDICATION PRESENT" |
  crash_data$Driver_Substance_Abuse == "OTHER",
  1,
  0
)

table(crash_data$Substance_Binary)
## 
##      0      1 
## 200047   6262
#Creating a binary for surface condition variable (0 = Dry, 1 = Hazardous)
crash_data$Surface_Binary <- ifelse(crash_data$Surface_Condition == "Dry", 0, 1)

table(crash_data$Surface_Binary)
## 
##      0      1 
##  25606 180703
crash_data |>
  select(Injury_Severity_Binary, Substance_Binary, Surface_Binary) |>
  summary()
##  Injury_Severity_Binary Substance_Binary  Surface_Binary  
##  Min.   :0.0000         Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000         1st Qu.:0.00000   1st Qu.:1.0000  
##  Median :0.0000         Median :0.00000   Median :1.0000  
##  Mean   :0.0076         Mean   :0.03035   Mean   :0.8759  
##  3rd Qu.:0.0000         3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000         Max.   :1.00000   Max.   :1.0000
crash_data <- crash_data |>
  mutate(Injury_Severity_Binary = as.factor(Injury_Severity_Binary),
         Substance_Binary       = as.factor(Substance_Binary),
         Surface_Binary      = as.factor(Surface_Binary))

str(crash_data[, c("Injury_Severity_Binary",
                   "Substance_Binary",
                   "Surface_Binary")])
## 'data.frame':    206309 obs. of  3 variables:
##  $ Injury_Severity_Binary: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Substance_Binary      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Surface_Binary        : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ...
xtabs(~ Injury_Severity_Binary + Substance_Binary, data = crash_data)
##                       Substance_Binary
## Injury_Severity_Binary      0      1
##                      0 198656   6085
##                      1   1391    177
xtabs(~ Injury_Severity_Binary + Surface_Binary, data = crash_data)
##                       Surface_Binary
## Injury_Severity_Binary      0      1
##                      0  25606 179135
##                      1      0   1568

Regression Analysis

To answer my question, I used a logistic regression model because my outcome variable, Injury_Severity_Binary is binary (0 = No Serious Injury, 1 = Serious Injury). The logistic regression will allow me to model the probability of a serious injury based on the predictor variables. The last model will use two predictors, Substance_Binary, Surface_Binary. The Substance binary indicate whether alcohol or any medical or prohibited substances was present (0 = No, 1 = Yes). Meanwhile Surface_Binary represents the road surface condition (0 = Dry, 1 wet/ice/snow)

logistic <- glm(Injury_Severity_Binary ~ Substance_Binary + Surface_Binary, data= crash_data, family="binomial")

summary(logistic)
## 
## Call:
## glm(formula = Injury_Severity_Binary ~ Substance_Binary + Surface_Binary, 
##     family = "binomial", data = crash_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -20.56607  110.80183  -0.186    0.853    
## Substance_Binary1   1.28613    0.08086  15.905   <2e-16 ***
## Surface_Binary1    15.74251  110.80183   0.142    0.887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18426  on 206308  degrees of freedom
## Residual deviance: 17824  on 206306  degrees of freedom
## AIC: 17830
## 
## Number of Fisher Scoring iterations: 19

The results indicate that Substance_Binary is a highly significant predictor for severe injury. This shows that crashes involving substances have a higher odd of resulting in a severe injury compared to crashes without substance involvement. Meanwhile, Surface_Binary was not statistically significant, indicating that road surface condition did not show a strong relation with injury severity in the dataset. The results show that driver impairment from substances is a key factor that’s associated with severe crash outcomes in Montgomery County, MD, while surface conditions doesn’t appear to play a substantial role.

Model Assumptions & Diagnostics

#Confusion Matrix
crash_data$Injury_num <- ifelse(crash_data$Injury_Severity_Binary == 1, 1, 0)

# Predicted probabilities
predicted.probs <- logistic$fitted.values

# Predicted classes: 1 if prob > 0.5, else 0
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)

# Confusion matrix
confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(crash_data$Injury_num, levels = c(0, 1))
)

confusion
##          Actual
## Predicted      0      1
##         0 204741   1568
##         1      0      0
#Extract Values from Confusion Matrix
TN <- confusion[1, 1]
FP <- confusion[2, 1]
FN <- confusion[1, 2]
TP <- confusion[2, 2]

accuracy    <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)   # true positive rate
specificity <- TN / (TN + FP)   # true negative rate
precision   <- TP / (TP + FP)   # positive predictive value
f1_score    <- 2 * (precision * sensitivity) / (precision + sensitivity)

cat("Accuracy: ",    round(accuracy, 4), "\n")
## Accuracy:  0.9924
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity:  0
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity:  1
cat("Precision: ",   round(precision, 4), "\n")
## Precision:  NaN
cat("F1 Score: ",    round(f1_score, 4), "\n")
## F1 Score:  NaN
#ROC Curve
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# ROC curve & AUC on full data
roc_obj <- roc(response = crash_data$Injury_Severity_Binary,
               predictor = logistic$fitted.values,
               levels = c("0", "1"),
               direction = "<")

# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.5971
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

The Confusion Matrix shows that the model predicts near all cases for non-severe injuries. Because of this, the model correctly identifies the majority which is non-severe injury, resulting in a high accuracy and specificity. However, the model is failing to predict any cases of severe injuries, which has a sensitivity of 0 and a undefined Precision and F1 score. This is happening because the dataset is very imbalanced. Severe injury cases represent less than 1% of all the observations. With very few positive cases, the model is classifies nearly all as non-severe. Additionally, the ROC curve also shows this issue, with an AUC of 0.597, indicating that the model is performing slightly better than a random guess. Overall, while substance involvement shows predictive value in the regression model, it shows that the model struggles to detect the rare severe injury cases because of this imbalance.

Conclusion

The logistic regression model examined whether substance abuse or road surface conditions were associated with severe car crash injuries in the Montgomery County, MD. The results indicate that Substance_Binary was a statistically significant predictor, meaning crashes involving substances were more likely to result in a severe injury. Meanwhile Surface_Binary did not show a meaningful relationship with injury severity. Although the model accuracy appeared high, this was because that severe injuries make up barely 1% of the dataset. Due to this imbalance, the model struggled to correctly predict the rare cases of severe injuries. The ROC Curve confirmed this with an AUC of 0.597, indicating that the model performs slightly better than a random guess.

In the future analysis, I will include additional predictors available in the dataset, such as crash type, distractions, and weather. This is too see whether they’ll improve the model’s ability to identify the severe injuries. Another improvement I have in mind would be looking into different probability thresholds (Not just 0.5) to see if sensitivity can be increased. Overall, the current model provides an insight for the role of substance involvement, more work is needed to predict better the uncommon but essential cases of severe injury.

References https://catalog.data.gov/dataset/crash-reporting-drivers-data