Research Question: How does the type of roadway influence the severity of a crash?
This project investigates the relationship between roadway type and crash severity using a large dataset of traffic incidents. The dataset, Crash Reporting – Drivers Data, contains over 1,000 observations and includes detailed information about each crash, such as roadway characteristics, vehicle factors, and driver behaviors. The dataset consists of 39 variables describing elements like crash type, route type, speed limit, vehicle details, and even driver conditions. Each row represents a single driver involved in a crash, making the dataset suitable for modeling crash severity at the driver level. The data were obtained from the Automated Crash Reporting System (ACRS) of the Maryland State Police, on Data.gov, which provides the publicly available crash reporting information.
I chose this research topic because roadway design and classification play an important role in traffic safety, yet many drivers are unaware of how road type can influence crash severity. Understanding whether certain roadway environments—such as interstates, state routes, or local roads can be associated with higher odds of injury can help inform policy decisions, roadway engineering, and public safety interventions. This analysis also provides an opportunity to apply logistic regression to a real-world transportation safety issue, allowing for insight and practical change to be made.
To prepare the dataset for logistic regression, several data-wrangling steps were performed. Firstly, only the variables needed for the analysis—Crash Severity and Route Type had been selected from the full dataset. Since the severity variable was originally categorical text, a new binary variable was created, coding “Injury Crash” as 1 and “Property Damage Crash” as 0. The dataset was then filtered to remove records with missing values in either of these key variables so the model would run. Moreover, roadway types with very low frequencies were excluded to avoid inaccurate estimates. Finally, the cleaned dataset was transformed into a format suitable for logistic regression by converting categorical variables into factors. Throughout this process, dplyr functions such as select(), mutate(), filter(), and drop_na() were used to organize and prepare the data for statistical modeling.
Load packages & data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read.csv("Crash_Reporting_-_Drivers_Data.csv")
df_selected <- df %>%
select(ACRS.Report.Type, Route.Type)
Create binary crash severity variable
df_selected <- df_selected %>%
mutate(
severity_binary = ifelse(ACRS.Report.Type == "Injury Crash", 1, 0),
Route.Type = as.factor(Route.Type)
)
Handle missing values & filter out rare categories
df_clean <- df_selected %>%
filter(!is.na(Route.Type), !is.na(severity_binary))
Summary
summary(df_clean)
## ACRS.Report.Type Route.Type severity_binary
## Length:207084 Maryland (State) :77077 Min. :0.0000
## Class :character County :55574 1st Qu.:0.0000
## Mode :character :20248 Median :0.0000
## County Route :12395 Mean :0.3567
## Maryland (State) Route:11902 3rd Qu.:1.0000
## Municipality : 9380 Max. :1.0000
## (Other) :20508
For this research question, I selected a logistic regression to examine how roadway type influences the probability that a crash results in an injury. Logistic regression is appropriate as the dependent variable, Crash Severity, is binary (Injury Crash = 1, Property Damage Crash = 0). The goal for this analysis is to determine whether certain roadway types—such as interstates, state routes, or local roads are associated with higher or lower odds of causing an injury-related crash. Logistic regression allows us to estimate these differences using log-odds and odds ratios, making it well suited for modeling categorical outcomes.
The final model uses crash severity as the binary dependent variable and roadway type as the categorical predictor, implemented through the logistic regression function glm(family = binomial). After fitting the model, the summary output is used to evaluate the statistical significance of the roadway categories. The coefficients are interpreted in terms of log-odds to determine how roadway type affects injury likelihood relative to a baseline category. To evaluate model performance, a confusion matrix is generated using a 0.5 probability threshold, and model accuracy, sensitivity, and specificity are calculated. A Receiver Operating Characteristic (or ROC) curve is plotted, and the Area Under the Curve (AUC) is reported to assess how well the model distinguishes between injury and non-injury crashes. The final model uses crash severity as the binary dependent variable and roadway type as the categorical predictor, implemented through the logistic regression function glm(family = binomial). After fitting the model, the summary output is used to evaluate the statistical significance of each roadway category. The coefficients are interpreted in terms of log-odds to determine how roadway type affects injury likelihood relative to a baseline category. To evaluate model performance, a confusion matrix is generated using a 0.5 probability threshold, and model accuracy, sensitivity, and specificity are calculated. A Receiver Operating Characteristic (ROC) curve is plotted, and the Area Under the Curve (AUC) is reported to assess how well the model distinguishes between injury and non-injury crashes.
Fit the logistic regression model
model <- glm(severity_binary ~ Route.Type,
data = df_clean,
family = binomial)
summary(model)
##
## Call:
## glm(formula = severity_binary ~ Route.Type, family = binomial,
## data = df_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.60400 0.01882 -85.215 < 2e-16 ***
## Route.TypeBicycle Route 1.02264 0.10173 10.053 < 2e-16 ***
## Route.TypeCounty 1.04751 0.02078 50.398 < 2e-16 ***
## Route.TypeCounty Route 0.98558 0.02662 37.018 < 2e-16 ***
## Route.TypeCrossover 1.29806 0.13650 9.510 < 2e-16 ***
## Route.TypeGovernment 1.05814 0.08498 12.452 < 2e-16 ***
## Route.TypeGovernment Route 0.82917 0.17730 4.677 2.92e-06 ***
## Route.TypeInterstate (State) 1.03781 0.03950 26.274 < 2e-16 ***
## Route.TypeLocal Route 0.91966 0.08338 11.029 < 2e-16 ***
## Route.TypeMaryland (State) 1.23764 0.02020 61.275 < 2e-16 ***
## Route.TypeMaryland (State) Route 0.98728 0.02690 36.708 < 2e-16 ***
## Route.TypeMunicipality 0.78898 0.02925 26.974 < 2e-16 ***
## Route.TypeMunicipality Route 0.87112 0.04691 18.571 < 2e-16 ***
## Route.TypeOther Public Roadway 0.96065 0.05867 16.373 < 2e-16 ***
## Route.TypePrivate Route 1.11259 0.12590 8.837 < 2e-16 ***
## Route.TypeRamp 1.02053 0.06254 16.318 < 2e-16 ***
## Route.TypeService Road 0.82124 0.30237 2.716 0.00661 **
## Route.TypeSpur 0.99621 0.11998 8.303 < 2e-16 ***
## Route.TypeUnknown -8.96203 26.71387 -0.335 0.73726
## Route.TypeUS (State) 1.03129 0.02890 35.684 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 269818 on 207083 degrees of freedom
## Residual deviance: 265173 on 207064 degrees of freedom
## AIC: 265213
##
## Number of Fisher Scoring iterations: 9
Interpret coefficients
exp(coef(model))
## (Intercept) Route.TypeBicycle Route
## 0.2010914699 2.7805246297
## Route.TypeCounty Route.TypeCounty Route
## 2.8505440797 2.6793567088
## Route.TypeCrossover Route.TypeGovernment
## 3.6621847202 2.8810028012
## Route.TypeGovernment Route Route.TypeInterstate (State)
## 2.2914165076 2.8230217310
## Route.TypeLocal Route Route.TypeMaryland (State)
## 2.5084344898 3.4474645395
## Route.TypeMaryland (State) Route Route.TypeMunicipality
## 2.6839298294 2.2011527200
## Route.TypeMunicipality Route Route.TypeOther Public Roadway
## 2.3895954513 2.6133973514
## Route.TypePrivate Route Route.TypeRamp
## 3.0422210654 2.7746665915
## Route.TypeService Road Route.TypeSpur
## 2.2733080489 2.7079938082
## Route.TypeUnknown Route.TypeUS (State)
## 0.0001281855 2.8046731066
exp(confint.default(model))
## 2.5 % 97.5 %
## (Intercept) 1.938079e-01 2.086488e-01
## Route.TypeBicycle Route 2.277901e+00 3.394053e+00
## Route.TypeCounty 2.736755e+00 2.969064e+00
## Route.TypeCounty Route 2.543125e+00 2.822886e+00
## Route.TypeCrossover 2.802530e+00 4.785532e+00
## Route.TypeGovernment 2.438998e+00 3.403109e+00
## Route.TypeGovernment Route 1.618777e+00 3.243553e+00
## Route.TypeInterstate (State) 2.612716e+00 3.050256e+00
## Route.TypeLocal Route 2.130234e+00 2.953780e+00
## Route.TypeMaryland (State) 3.313654e+00 3.586678e+00
## Route.TypeMaryland (State) Route 2.546113e+00 2.829206e+00
## Route.TypeMunicipality 2.078512e+00 2.331030e+00
## Route.TypeMunicipality Route 2.179698e+00 2.619705e+00
## Route.TypeOther Public Roadway 2.329505e+00 2.931888e+00
## Route.TypePrivate Route 2.376952e+00 3.893689e+00
## Route.TypeRamp 2.454575e+00 3.136500e+00
## Route.TypeService Road 1.256849e+00 4.111814e+00
## Route.TypeSpur 2.140517e+00 3.425915e+00
## Route.TypeUnknown 2.338562e-27 7.026333e+18
## Route.TypeUS (State) 2.650221e+00 2.968126e+00
Confusion matrix
pred_prob <- predict(model, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
# Confusion matrix with base R
table(Predicted = pred_class, Actual = df_clean$severity_binary)
## Actual
## Predicted 0 1
## 0 133226 73858
# Compute accuracy
accuracy <- mean(pred_class == df_clean$severity_binary)
# Sensitivity (true positive rate)
sensitivity <- sum(pred_class == 1 & df_clean$severity_binary == 1) /
sum(df_clean$severity_binary == 1)
# Specificity (true negative rate)
specificity <- sum(pred_class == 0 & df_clean$severity_binary == 0) /
sum(df_clean$severity_binary == 0)
accuracy
## [1] 0.6433428
sensitivity
## [1] 0
specificity
## [1] 1
ROC curve + AUC
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(df_clean$severity_binary, pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, col = "blue", main = "ROC Curve for Logistic Regression")
auc(roc_obj)
## Area under the curve: 0.5686
The results of the logistic regression analysis indicate that roadway type is a significant predictor of crash severity. Nearly all roadway categories showed higher odds of resulting in an injury crash compared to the baseline of Maryland State Routes, with Bicycle Routes, County Roads, Interstates, and Crossovers demonstrating especially elevated risk. These findings suggest that roadway design and traffic environments contribute meaningfully to the severity of crashes. The ROC curve and AUC value (approximately 0.60) indicate that roadway type alone provides limited predictive power, but still contributes valuable insight into patterns of crash severity.
These findings imply that certain roadway environments may require additional safety measures or targeted interventions to reduce injury-related crashes. Understanding which road types carry higher risk can help inform roadway planning and public safety initiatives. For future research, incorporating additional variables such as speed limit, driver distraction, vehicle movement, or environmental conditions may improve the predictive ability of the model. Further analysis could also examine interactions between roadway type and other crash factors or explore alternative modeling approaches such as decision trees or random forests. Such extensions would provide a more comprehensive understanding of the factors contributing to crash severity.
Maryland Department of Transportation. (n.d.). Crash Reporting – Drivers Data. Maryland Open Data Portal. https://data.maryland.gov/
Maryland Department of Transportation. (n.d.). Crash Reporting – Drivers Data. Maryland Open Data Portal. https://data.maryland.gov/