Do driver substance abuse and road conditions predict whether a driver faces a severe injury in a vehicle crash in Montgomery County, Maryland?
Motor Vehicle accidents occur under many conditions, and specific factors may increase the possibilities that a driver is severely injured. In this 3rd Project, I will use the dataset Crash Reporting Data, provided by the Montgomery County Open Data Portal. This dataset contains detailed information on drivers that were involved in a vehicle accident, with variables describing the injury severity, road conditions, weather, vehicle characteristics, and also if whether or not alcohol or prohibited substances were involved.
This dataset contains 206,309 observations with 39 variables, where each row represents a driver being involved in a crash. This dataset is larger than usual and contains many categorical variables that is relevant to the crash outcomes, therefore for this analysis I will focus on just three key variables which is Injury Severity, Driver Substance Abuse, and Surface Condition. These 3 variables will relate to my research question and give way to conclude whether or not adverse road conditions or prohibited substances are associated with the increased risks of facing a severe injury. I will use Logistic Regression to answer my research question, as it will allow me to estimate the probability of severe injuries based on the selected predictors and to conclude results in terms of how each factor will influence the injury risks.
In this Data Analysis, I cleaned and explored the dataset to prepare it for a logistic regression. I started by looking into the structure and its initial rows of the data by utilizing the head() and str() to understand the variables and its formats. To make sure the quality of the data, I also inspected for any missing values by using colSums(is.na()) and removed any NA’s in my three important variables by using the filter(). Additionally, I also made changes to the variable names by replacing the periods with underscores to avoid any errors in the final stages of my research, and I reviewed all the unique categories for each of the three variables. I also created binary variables for the injury severity and substance involvement by using the ifelse(), and converted these variables to factors by using mutate() to ensure its ready for modeling. I used select() and summary() to give descriptive summaries of the cleaned predictors and outcomes. Lastly, I used the xtabs() to check the relationships between the injury severity and predictor variables, combined these steps helped ensuring that the dataset was clean, formatted properly, and ready to build a logistic regression model.
Opening the Dataset
library(ggplot2)
library(cowplot)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
crash_data <- read.csv("Crash_report_data.csv")
head(crash_data)
## Report.Number Local.Case.Number Agency.Name ACRS.Report.Type
## 1 MCP3126006X 250037402 MONTGOMERY Injury Crash
## 2 MCP2349001B 250037516 MONTGOMERY Property Damage Crash
## 3 MCP296500BC 250033157 MONTGOMERY Property Damage Crash
## 4 MCP2159003K 250037509 MONTGOMERY Property Damage Crash
## 5 MCP312900D6 250034573 MONTGOMERY Property Damage Crash
## 6 MCP284600BN 250037004 MONTGOMERY Injury Crash
## Crash.Date.Time Route.Type Road.Name
## 1 08/21/2025 05:21:00 PM Maryland (State) Route
## 2 08/22/2025 10:44:00 AM Interstate (State) EISENHOWER MEMORIAL HWY
## 3 07/25/2025 11:55:00 AM Bicycle Route
## 4 08/22/2025 10:36:00 AM Maryland (State) Route
## 5 08/03/2025 02:10:00 PM
## 6 08/19/2025 09:50:00 AM County Route GRAND PRE RD
## Cross.Street.Name
## 1
## 2
## 3 NEW HAMPSHIRE AVE (SB/L) NORBECK RD (WB/L) SPENCERVILLE RD (WB/L)
## 4
## 5
## 6
## Off.Road.Description
## 1
## 2
## 3
## 4
## 5 Parking Lot Way PARKING LOT OF 2741 UNIVERSITY BLVD W, KENSINGTON MD, 20895
## 6
## Municipality Related.Non.Motorist Collision.Type Weather
## 1 Front to Rear Clear
## 2 Single Vehicle Clear
## 3 Sideswipe, Same Direction Clear
## 4 Front to Rear Clear
## 5 Rear To Side Clear
## 6 Pedestrian Single Vehicle Rain
## Surface.Condition Light Traffic.Control
## 1 Dry Daylight No Controls
## 2 Dry Daylight No Controls
## 3 Dry Daylight Traffic Control Signal
## 4 Dry Daylight Flashing Traffic Control Signal
## 5 Daylight
## 6 Wet Daylight No Controls
## Driver.Substance.Abuse
## 1 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## 2 Unknown, Unknown
## 3 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## 4 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## 5 Unknown, Unknown
## 6 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## Non.Motorist.Substance.Abuse
## 1
## 2
## 3
## 4
## 5
## 6 Not Suspect of Alcohol Use, Not Suspect of Drug Use
## Person.ID Driver.At.Fault Injury.Severity
## 1 BB3CB0F3-5A89-45FB-9516-48DDDB92B0A9 Yes No Apparent Injury
## 2 9B84E695-215A-447E-8AA6-D3958187BBCA No
## 3 1D28ADF4-0DB2-4CBC-BDB0-1C1F5E7CF955 No No Apparent Injury
## 4 AE9A3389-3486-4199-B8F6-015D7D2E1139 Yes No Apparent Injury
## 5 3B4FB53F-9543-48EA-8C28-14AC093FBC36 No
## 6 391A0858-066B-41A3-926D-B43D84A96A07 No No Apparent Injury
## Circumstance Driver.Distracted.By
## 1 Followed Too Closely Other Action (looking away from task, etc.)
## 2
## 3 Not Distracted
## 4 Followed Too Closely Manually Operating (dialing, playing game, etc.)
## 5 Unknown
## 6 Not Distracted
## Drivers.License.State Vehicle.ID
## 1 MD 768C98FA-C137-47BC-BE44-EE3BA4B95F66
## 2 BC322ECD-006B-4919-AAF8-3F64D934B789
## 3 CO 1F4EBE18-DB94-4CA7-8D9A-88C30E90400D
## 4 MD AAEB6B5A-30B2-47D3-BF59-7F14D0A5BCAD
## 5 B683B035-8C9F-45F7-BDB5-F9141CCF160D
## 6 MD 219D547A-37CA-4C31-93C8-414479EA6A4C
## Vehicle.Damage.Extent Vehicle.First.Impact.Location
## 1 Superficial Twelve O Clock
## 2 Vehicle Not at Scene Vehicle Not at Scene
## 3 Superficial Seven O Clock
## 4 Disabling Twelve O Clock
## 5 Vehicle Not at Scene Vehicle Not at Scene
## 6 Superficial Twelve O Clock
## Vehicle.Body.Type Vehicle.Movement Vehicle.Going.Dir
## 1 Passenger Car Moving Constant Speed Northbound
## 2 Moving Constant Speed Northbound
## 3 Passenger Car Moving Constant Speed Westbound
## 4 Van - Passenger (<9 Seats) Slowing or Stopping Southbound
## 5 Sport Utility Vehicle Backing Not On Roadway
## 6 Passenger Car Moving Constant Speed Northbound
## Speed.Limit Driverless.Vehicle Parked.Vehicle Vehicle.Year Vehicle.Make
## 1 40 No No 2013 KIA
## 2 55 No No 0
## 3 40 No No 2023 LEXUS
## 4 30 No No 2003 TOYOTA
## 5 0 No No 2023 SUBARU
## 6 25 No No 2016 MAZDA
## Vehicle.Model Latitude Longitude Location
## 1 SOUL 39.21980 -77.25742 (39.219796, -77.25741635)
## 2 39.18018 -77.25066 (39.18018079, -77.25065714)
## 3 RX 39.12122 -76.98891 (39.12121898, -76.98890509)
## 4 SIENNA 39.20793 -77.14148 (39.20793083, -77.1414795)
## 5 IMPREZA 39.03966 -77.05724 (39.03966248, -77.05723843)
## 6 CX-5 39.09273 -77.07647 (39.09273383, -77.07647)
str(crash_data)
## 'data.frame': 206309 obs. of 39 variables:
## $ Report.Number : chr "MCP3126006X" "MCP2349001B" "MCP296500BC" "MCP2159003K" ...
## $ Local.Case.Number : chr "250037402" "250037516" "250033157" "250037509" ...
## $ Agency.Name : chr "MONTGOMERY" "MONTGOMERY" "MONTGOMERY" "MONTGOMERY" ...
## $ ACRS.Report.Type : chr "Injury Crash" "Property Damage Crash" "Property Damage Crash" "Property Damage Crash" ...
## $ Crash.Date.Time : chr "08/21/2025 05:21:00 PM" "08/22/2025 10:44:00 AM" "07/25/2025 11:55:00 AM" "08/22/2025 10:36:00 AM" ...
## $ Route.Type : chr "Maryland (State) Route" "Interstate (State)" "Bicycle Route" "Maryland (State) Route" ...
## $ Road.Name : chr "" "EISENHOWER MEMORIAL HWY" "" "" ...
## $ Cross.Street.Name : chr "" "" "NEW HAMPSHIRE AVE (SB/L) NORBECK RD (WB/L) SPENCERVILLE RD (WB/L)" "" ...
## $ Off.Road.Description : chr "" "" "" "" ...
## $ Municipality : chr "" "" "" "" ...
## $ Related.Non.Motorist : chr "" "" "" "" ...
## $ Collision.Type : chr "Front to Rear" "Single Vehicle" "Sideswipe, Same Direction" "Front to Rear" ...
## $ Weather : chr "Clear" "Clear" "Clear" "Clear" ...
## $ Surface.Condition : chr "Dry" "Dry" "Dry" "Dry" ...
## $ Light : chr "Daylight" "Daylight" "Daylight" "Daylight" ...
## $ Traffic.Control : chr "No Controls" "No Controls" "Traffic Control Signal" "Flashing Traffic Control Signal" ...
## $ Driver.Substance.Abuse : chr "Not Suspect of Alcohol Use, Not Suspect of Drug Use" "Unknown, Unknown" "Not Suspect of Alcohol Use, Not Suspect of Drug Use" "Not Suspect of Alcohol Use, Not Suspect of Drug Use" ...
## $ Non.Motorist.Substance.Abuse : chr "" "" "" "" ...
## $ Person.ID : chr "BB3CB0F3-5A89-45FB-9516-48DDDB92B0A9" "9B84E695-215A-447E-8AA6-D3958187BBCA" "1D28ADF4-0DB2-4CBC-BDB0-1C1F5E7CF955" "AE9A3389-3486-4199-B8F6-015D7D2E1139" ...
## $ Driver.At.Fault : chr "Yes" "No" "No" "Yes" ...
## $ Injury.Severity : chr "No Apparent Injury" "" "No Apparent Injury" "No Apparent Injury" ...
## $ Circumstance : chr "Followed Too Closely" "" "" "Followed Too Closely" ...
## $ Driver.Distracted.By : chr "Other Action (looking away from task, etc.)" "" "Not Distracted" "Manually Operating (dialing, playing game, etc.)" ...
## $ Drivers.License.State : chr "MD" "" "CO" "MD" ...
## $ Vehicle.ID : chr "768C98FA-C137-47BC-BE44-EE3BA4B95F66" "BC322ECD-006B-4919-AAF8-3F64D934B789" "1F4EBE18-DB94-4CA7-8D9A-88C30E90400D" "AAEB6B5A-30B2-47D3-BF59-7F14D0A5BCAD" ...
## $ Vehicle.Damage.Extent : chr "Superficial" "Vehicle Not at Scene" "Superficial" "Disabling" ...
## $ Vehicle.First.Impact.Location: chr "Twelve O Clock" "Vehicle Not at Scene" "Seven O Clock" "Twelve O Clock" ...
## $ Vehicle.Body.Type : chr "Passenger Car" "" "Passenger Car" "Van - Passenger (<9 Seats)" ...
## $ Vehicle.Movement : chr "Moving Constant Speed" "Moving Constant Speed" "Moving Constant Speed" "Slowing or Stopping" ...
## $ Vehicle.Going.Dir : chr "Northbound" "Northbound" "Westbound" "Southbound" ...
## $ Speed.Limit : int 40 55 40 30 0 25 0 25 10 35 ...
## $ Driverless.Vehicle : chr "No" "No" "No" "No" ...
## $ Parked.Vehicle : chr "No" "No" "No" "No" ...
## $ Vehicle.Year : int 2013 0 2023 2003 2023 2016 2025 2021 2022 2018 ...
## $ Vehicle.Make : chr "KIA" "" "LEXUS" "TOYOTA" ...
## $ Vehicle.Model : chr "SOUL" "" "RX" "SIENNA" ...
## $ Latitude : num 39.2 39.2 39.1 39.2 39 ...
## $ Longitude : num -77.3 -77.3 -77 -77.1 -77.1 ...
## $ Location : chr "(39.219796, -77.25741635)" "(39.18018079, -77.25065714)" "(39.12121898, -76.98890509)" "(39.20793083, -77.1414795)" ...
names(crash_data)[names(crash_data) == "Injury.Severity"] <- "Injury_Severity"
names(crash_data)[names(crash_data) == "Driver.Substance.Abuse"] <- "Driver_Substance_Abuse"
names(crash_data)[names(crash_data) == "Surface.Condition"] <- "Surface_Condition"
#Looking for any missing values (NA's)
colSums(is.na(crash_data))
## Report.Number Local.Case.Number
## 0 0
## Agency.Name ACRS.Report.Type
## 0 0
## Crash.Date.Time Route.Type
## 0 0
## Road.Name Cross.Street.Name
## 0 8
## Off.Road.Description Municipality
## 0 0
## Related.Non.Motorist Collision.Type
## 0 0
## Weather Surface_Condition
## 0 0
## Light Traffic.Control
## 0 0
## Driver_Substance_Abuse Non.Motorist.Substance.Abuse
## 0 0
## Person.ID Driver.At.Fault
## 0 0
## Injury_Severity Circumstance
## 0 0
## Driver.Distracted.By Drivers.License.State
## 0 0
## Vehicle.ID Vehicle.Damage.Extent
## 0 0
## Vehicle.First.Impact.Location Vehicle.Body.Type
## 0 0
## Vehicle.Movement Vehicle.Going.Dir
## 0 0
## Speed.Limit Driverless.Vehicle
## 0 0
## Parked.Vehicle Vehicle.Year
## 0 0
## Vehicle.Make Vehicle.Model
## 11 25
## Latitude Longitude
## 0 0
## Location
## 0
#Cleaning out any NA's and Ensuring any NA's in the three important variables
crash_data <- crash_data |>
filter(!is.na(Injury_Severity),
!is.na(Driver_Substance_Abuse),
!is.na(Surface_Condition))
#Revising the unique values of the variables im using
unique(crash_data$Injury_Severity)
## [1] "No Apparent Injury" ""
## [3] "Possible Injury" "Suspected Minor Injury"
## [5] "Suspected Serious Injury" "Fatal Injury"
## [7] "NO APPARENT INJURY" "SUSPECTED MINOR INJURY"
## [9] "POSSIBLE INJURY" "SUSPECTED SERIOUS INJURY"
## [11] "FATAL INJURY"
unique(crash_data$Driver_Substance_Abuse)
## [1] "Not Suspect of Alcohol Use, Not Suspect of Drug Use"
## [2] "Unknown, Unknown"
## [3] "Suspect of Alcohol Use, Not Suspect of Drug Use"
## [4] "Unknown, Not Suspect of Drug Use"
## [5] "Suspect of Alcohol Use, Unknown"
## [6] "Suspect of Alcohol Use, Suspect of Drug Use"
## [7] "Not Suspect of Alcohol Use, Unknown"
## [8] "Not Suspect of Alcohol Use, Suspect of Drug Use"
## [9] "Unknown, Suspect of Drug Use"
## [10] "NONE DETECTED"
## [11] "UNKNOWN"
## [12] "N/A"
## [13] "ALCOHOL CONTRIBUTED"
## [14] "ALCOHOL PRESENT"
## [15] "COMBINATION CONTRIBUTED"
## [16] "COMBINED SUBSTANCE PRESENT"
## [17] "ILLEGAL DRUG CONTRIBUTED"
## [18] "ILLEGAL DRUG PRESENT"
## [19] "MEDICATION CONTRIBUTED"
## [20] "MEDICATION PRESENT"
## [21] "OTHER"
unique(crash_data$Surface_Condition)
## [1] "Dry" ""
## [3] "Wet" "Other"
## [5] "Water (standing, moving)" "DRY"
## [7] "ICE" "WET"
## [9] "N/A" "SLUSH"
## [11] "UNKNOWN" "WATER(STANDING/MOVING)"
## [13] "SNOW" "OTHER"
## [15] "MUD, DIRT, GRAVEL" "OIL"
## [17] "SAND" "Ice/Frost"
## [19] "Mud, Dirt, Gravel" "Snow"
## [21] "Slush" "Sand"
## [23] "Oil"
#Changing the Injury Severity into a binary variable
crash_data$Injury_Severity_Binary <- ifelse(crash_data$Injury_Severity == "FATAL INJURY" | crash_data$Injury_Severity == "SUSPECTED SERIOUS INJURY", 1, 0)
table(crash_data$Injury_Severity_Binary)
##
## 0 1
## 204741 1568
#Changing the Substance into a binary variable
crash_data$Substance_Binary <- ifelse(
crash_data$Driver_Substance_Abuse == "ALCOHOL CONTRIBUTED" |
crash_data$Driver_Substance_Abuse == "ALCOHOL PRESENT" |
crash_data$Driver_Substance_Abuse == "COMBINATION CONTRIBUTED" |
crash_data$Driver_Substance_Abuse == "COMBINED SUBSTANCE PRESENT" |
crash_data$Driver_Substance_Abuse == "ILLEGAL DRUG CONTRIBUTED" |
crash_data$Driver_Substance_Abuse == "ILLEGAL DRUG PRESENT" |
crash_data$Driver_Substance_Abuse == "MEDICATION CONTRIBUTED" |
crash_data$Driver_Substance_Abuse == "MEDICATION PRESENT" |
crash_data$Driver_Substance_Abuse == "OTHER",
1,
0
)
table(crash_data$Substance_Binary)
##
## 0 1
## 200047 6262
#Creating a binary for surface condition variable (0 = Dry, 1 = Hazardous)
crash_data$Surface_Binary <- ifelse(crash_data$Surface_Condition == "Dry", 0, 1)
table(crash_data$Surface_Binary)
##
## 0 1
## 25606 180703
crash_data |>
select(Injury_Severity_Binary, Substance_Binary, Surface_Binary) |>
summary()
## Injury_Severity_Binary Substance_Binary Surface_Binary
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :0.0000 Median :0.00000 Median :1.0000
## Mean :0.0076 Mean :0.03035 Mean :0.8759
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
crash_data <- crash_data |>
mutate(Injury_Severity_Binary = as.factor(Injury_Severity_Binary),
Substance_Binary = as.factor(Substance_Binary),
Surface_Binary = as.factor(Surface_Binary))
str(crash_data[, c("Injury_Severity_Binary",
"Substance_Binary",
"Surface_Binary")])
## 'data.frame': 206309 obs. of 3 variables:
## $ Injury_Severity_Binary: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Substance_Binary : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Surface_Binary : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ...
xtabs(~ Injury_Severity_Binary + Substance_Binary, data = crash_data)
## Substance_Binary
## Injury_Severity_Binary 0 1
## 0 198656 6085
## 1 1391 177
xtabs(~ Injury_Severity_Binary + Surface_Binary, data = crash_data)
## Surface_Binary
## Injury_Severity_Binary 0 1
## 0 25606 179135
## 1 0 1568
To answer my question, I used a logistic regression model because my outcome variable, Injury_Severity_Binary is binary (0 = No Serious Injury, 1 = Serious Injury). The logistic regression will allow me to model the probability of a serious injury based on the predictor variables. The last model will use two predictors, Substance_Binary, Surface_Binary. The Substance binary indicate whether alcohol or any medical or prohibited substances was present (0 = No, 1 = Yes). Meanwhile Surface_Binary represents the road surface condition (0 = Dry, 1 wet/ice/snow)
logistic <- glm(Injury_Severity_Binary ~ Substance_Binary + Surface_Binary, data= crash_data, family="binomial")
summary(logistic)
##
## Call:
## glm(formula = Injury_Severity_Binary ~ Substance_Binary + Surface_Binary,
## family = "binomial", data = crash_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.56607 110.80183 -0.186 0.853
## Substance_Binary1 1.28613 0.08086 15.905 <2e-16 ***
## Surface_Binary1 15.74251 110.80183 0.142 0.887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18426 on 206308 degrees of freedom
## Residual deviance: 17824 on 206306 degrees of freedom
## AIC: 17830
##
## Number of Fisher Scoring iterations: 19
The results indicate that Substance_Binary is a highly significant predictor for severe injury. This shows that crashes involving substances have a higher odd of resulting in a severe injury compared to crashes without substance involvement. Meanwhile, Surface_Binary was not statistically significant, indicating that road surface condition did not show a strong relation with injury severity in the dataset. The results show that driver impairment from substances is a key factor that’s associated with severe crash outcomes in Montgomery County, MD, while surface conditions doesn’t appear to play a substantial role.
#Confusion Matrix
crash_data$Injury_num <- ifelse(crash_data$Injury_Severity_Binary == 1, 1, 0)
# Predicted probabilities
predicted.probs <- logistic$fitted.values
# Predicted classes: 1 if prob > 0.5, else 0
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
# Confusion matrix
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(crash_data$Injury_num, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 204741 1568
## 1 0 0
#Extract Values from Confusion Matrix
TN <- confusion[1, 1]
FP <- confusion[2, 1]
FN <- confusion[1, 2]
TP <- confusion[2, 2]
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN) # true positive rate
specificity <- TN / (TN + FP) # true negative rate
precision <- TP / (TP + FP) # positive predictive value
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.9924
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 1
cat("Precision: ", round(precision, 4), "\n")
## Precision: NaN
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: NaN
#ROC Curve
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# ROC curve & AUC on full data
roc_obj <- roc(response = crash_data$Injury_Severity_Binary,
predictor = logistic$fitted.values,
levels = c("0", "1"),
direction = "<")
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.5971
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
The Confusion Matrix shows that the model predicts near all cases for non-severe injuries. Because of this, the model correctly identifies the majority which is non-severe injury, resulting in a high accuracy and specificity. However, the model is failing to predict any cases of severe injuries, which has a sensitivity of 0 and a undefined Precision and F1 score. This is happening because the dataset is very imbalanced. Severe injury cases represent less than 1% of all the observations. With very few positive cases, the model is classifies nearly all as non-severe. Additionally, the ROC curve also shows this issue, with an AUC of 0.597, indicating that the model is performing slightly better than a random guess. Overall, while substance involvement shows predictive value in the regression model, it shows that the model struggles to detect the rare severe injury cases because of this imbalance.
The logistic regression model examined whether substance abuse or road surface conditions were associated with severe car crash injuries in the Montgomery County, MD. The results indicate that Substance_Binary was a statistically significant predictor, meaning crashes involving substances were more likely to result in a severe injury. Meanwhile Surface_Binary did not show a meaningful relationship with injury severity. Although the model accuracy appeared high, this was because that severe injuries make up barely 1% of the dataset. Due to this imbalance, the model struggled to correctly predict the rare cases of severe injuries. The ROC Curve confirmed this with an AUC of 0.597, indicating that the model performs slightly better than a random guess.
In the future analysis, I will include additional predictors available in the dataset, such as crash type, distractions, and weather. This is too see whether they’ll improve the model’s ability to identify the severe injuries. Another improvement I have in mind would be looking into different probability thresholds (Not just 0.5) to see if sensitivity can be increased. Overall, the current model provides an insight for the role of substance involvement, more work is needed to predict better the uncommon but essential cases of severe injury.
References https://catalog.data.gov/dataset/crash-reporting-drivers-data