Introduction

Research Question: How does the type of roadway influence the severity of a crash?

This project investigates the relationship between roadway type and crash severity using a large dataset of traffic incidents. The dataset, Crash Reporting – Drivers Data, contains over 1,000 observations and includes detailed information about each crash, such as roadway characteristics, vehicle factors, and driver behaviors. The dataset consists of 39 variables describing elements like crash type, route type, speed limit, vehicle details, and even driver conditions. Each row represents a single driver involved in a crash, making the dataset suitable for modeling crash severity at the driver level. The data were obtained from the Automated Crash Reporting System (ACRS) of the Maryland State Police, on Data.gov, which provides the publicly available crash reporting information.

I chose this research topic because roadway design and classification play an important role in traffic safety, yet many drivers are unaware of how road type can influence crash severity. Understanding whether certain roadway environments—such as interstates, state routes, or local roads can be associated with higher odds of injury can help inform policy decisions, roadway engineering, and public safety interventions. This analysis also provides an opportunity to apply logistic regression to a real-world transportation safety issue, allowing for insight and practical change to be made.

Data Analysis

To prepare the dataset for logistic regression, several data-wrangling steps were performed. Firstly, only the variables needed for the analysis—Crash Severity and Route Type had been selected from the full dataset. Since the severity variable was originally categorical text, a new binary variable was created, coding “Injury Crash” as 1 and “Property Damage Crash” as 0. The dataset was then filtered to remove records with missing values in either of these key variables so the model would run. Moreover, roadway types with very low frequencies were excluded to avoid inaccurate estimates. Finally, the cleaned dataset was transformed into a format suitable for logistic regression by converting categorical variables into factors. Throughout this process, dplyr functions such as select(), mutate(), filter(), and drop_na() were used to organize and prepare the data for statistical modeling.

Load packages & data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- read.csv("Crash_Reporting_-_Drivers_Data.csv")
df_selected <- df %>%
  select(ACRS.Report.Type, Route.Type)

Create binary crash severity variable

df_selected <- df_selected %>%
  mutate(
    severity_binary = ifelse(ACRS.Report.Type == "Injury Crash", 1, 0),
    Route.Type = as.factor(Route.Type)
  )

Handle missing values & filter out rare categories

df_clean <- df_selected %>%
  filter(!is.na(Route.Type), !is.na(severity_binary))

Summary

summary(df_clean)
##  ACRS.Report.Type                    Route.Type    severity_binary 
##  Length:207084      Maryland (State)      :77077   Min.   :0.0000  
##  Class :character   County                :55574   1st Qu.:0.0000  
##  Mode  :character                         :20248   Median :0.0000  
##                     County Route          :12395   Mean   :0.3567  
##                     Maryland (State) Route:11902   3rd Qu.:1.0000  
##                     Municipality          : 9380   Max.   :1.0000  
##                     (Other)               :20508

Statistical Analysis

For this research question, I selected a logistic regression to examine how roadway type influences the probability that a crash results in an injury. Logistic regression is appropriate as the dependent variable, Crash Severity, is binary (Injury Crash = 1, Property Damage Crash = 0). The goal for this analysis is to determine whether certain roadway types—such as interstates, state routes, or local roads are associated with higher or lower odds of causing an injury-related crash. Logistic regression allows us to estimate these differences using log-odds and odds ratios, making it well suited for modeling categorical outcomes.

The final model uses crash severity as the binary dependent variable and roadway type as the categorical predictor, implemented through the logistic regression function glm(family = binomial). After fitting the model, the summary output is used to evaluate the statistical significance of the roadway categories. The coefficients are interpreted in terms of log-odds to determine how roadway type affects injury likelihood relative to a baseline category. To evaluate model performance, a confusion matrix is generated using a 0.5 probability threshold, and model accuracy, sensitivity, and specificity are calculated. A Receiver Operating Characteristic (or ROC) curve is plotted, and the Area Under the Curve (AUC) is reported to assess how well the model distinguishes between injury and non-injury crashes. The final model uses crash severity as the binary dependent variable and roadway type as the categorical predictor, implemented through the logistic regression function glm(family = binomial). After fitting the model, the summary output is used to evaluate the statistical significance of each roadway category. The coefficients are interpreted in terms of log-odds to determine how roadway type affects injury likelihood relative to a baseline category. To evaluate model performance, a confusion matrix is generated using a 0.5 probability threshold, and model accuracy, sensitivity, and specificity are calculated. A Receiver Operating Characteristic (ROC) curve is plotted, and the Area Under the Curve (AUC) is reported to assess how well the model distinguishes between injury and non-injury crashes.

Fit the logistic regression model

model <- glm(severity_binary ~ Route.Type, 
             data = df_clean, 
             family = binomial)

summary(model)
## 
## Call:
## glm(formula = severity_binary ~ Route.Type, family = binomial, 
##     data = df_clean)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.60400    0.01882 -85.215  < 2e-16 ***
## Route.TypeBicycle Route           1.02264    0.10173  10.053  < 2e-16 ***
## Route.TypeCounty                  1.04751    0.02078  50.398  < 2e-16 ***
## Route.TypeCounty Route            0.98558    0.02662  37.018  < 2e-16 ***
## Route.TypeCrossover               1.29806    0.13650   9.510  < 2e-16 ***
## Route.TypeGovernment              1.05814    0.08498  12.452  < 2e-16 ***
## Route.TypeGovernment Route        0.82917    0.17730   4.677 2.92e-06 ***
## Route.TypeInterstate (State)      1.03781    0.03950  26.274  < 2e-16 ***
## Route.TypeLocal Route             0.91966    0.08338  11.029  < 2e-16 ***
## Route.TypeMaryland (State)        1.23764    0.02020  61.275  < 2e-16 ***
## Route.TypeMaryland (State) Route  0.98728    0.02690  36.708  < 2e-16 ***
## Route.TypeMunicipality            0.78898    0.02925  26.974  < 2e-16 ***
## Route.TypeMunicipality Route      0.87112    0.04691  18.571  < 2e-16 ***
## Route.TypeOther Public Roadway    0.96065    0.05867  16.373  < 2e-16 ***
## Route.TypePrivate Route           1.11259    0.12590   8.837  < 2e-16 ***
## Route.TypeRamp                    1.02053    0.06254  16.318  < 2e-16 ***
## Route.TypeService Road            0.82124    0.30237   2.716  0.00661 ** 
## Route.TypeSpur                    0.99621    0.11998   8.303  < 2e-16 ***
## Route.TypeUnknown                -8.96203   26.71387  -0.335  0.73726    
## Route.TypeUS (State)              1.03129    0.02890  35.684  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 269818  on 207083  degrees of freedom
## Residual deviance: 265173  on 207064  degrees of freedom
## AIC: 265213
## 
## Number of Fisher Scoring iterations: 9

Interpret coefficients

exp(coef(model))
##                      (Intercept)          Route.TypeBicycle Route 
##                     0.2010914699                     2.7805246297 
##                 Route.TypeCounty           Route.TypeCounty Route 
##                     2.8505440797                     2.6793567088 
##              Route.TypeCrossover             Route.TypeGovernment 
##                     3.6621847202                     2.8810028012 
##       Route.TypeGovernment Route     Route.TypeInterstate (State) 
##                     2.2914165076                     2.8230217310 
##            Route.TypeLocal Route       Route.TypeMaryland (State) 
##                     2.5084344898                     3.4474645395 
## Route.TypeMaryland (State) Route           Route.TypeMunicipality 
##                     2.6839298294                     2.2011527200 
##     Route.TypeMunicipality Route   Route.TypeOther Public Roadway 
##                     2.3895954513                     2.6133973514 
##          Route.TypePrivate Route                   Route.TypeRamp 
##                     3.0422210654                     2.7746665915 
##           Route.TypeService Road                   Route.TypeSpur 
##                     2.2733080489                     2.7079938082 
##                Route.TypeUnknown             Route.TypeUS (State) 
##                     0.0001281855                     2.8046731066
exp(confint.default(model))
##                                         2.5 %       97.5 %
## (Intercept)                      1.938079e-01 2.086488e-01
## Route.TypeBicycle Route          2.277901e+00 3.394053e+00
## Route.TypeCounty                 2.736755e+00 2.969064e+00
## Route.TypeCounty Route           2.543125e+00 2.822886e+00
## Route.TypeCrossover              2.802530e+00 4.785532e+00
## Route.TypeGovernment             2.438998e+00 3.403109e+00
## Route.TypeGovernment Route       1.618777e+00 3.243553e+00
## Route.TypeInterstate (State)     2.612716e+00 3.050256e+00
## Route.TypeLocal Route            2.130234e+00 2.953780e+00
## Route.TypeMaryland (State)       3.313654e+00 3.586678e+00
## Route.TypeMaryland (State) Route 2.546113e+00 2.829206e+00
## Route.TypeMunicipality           2.078512e+00 2.331030e+00
## Route.TypeMunicipality Route     2.179698e+00 2.619705e+00
## Route.TypeOther Public Roadway   2.329505e+00 2.931888e+00
## Route.TypePrivate Route          2.376952e+00 3.893689e+00
## Route.TypeRamp                   2.454575e+00 3.136500e+00
## Route.TypeService Road           1.256849e+00 4.111814e+00
## Route.TypeSpur                   2.140517e+00 3.425915e+00
## Route.TypeUnknown                2.338562e-27 7.026333e+18
## Route.TypeUS (State)             2.650221e+00 2.968126e+00

Confusion matrix

pred_prob <- predict(model, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)

# Confusion matrix with base R
table(Predicted = pred_class, Actual = df_clean$severity_binary)
##          Actual
## Predicted      0      1
##         0 133226  73858
# Compute accuracy
accuracy <- mean(pred_class == df_clean$severity_binary)

# Sensitivity (true positive rate)
sensitivity <- sum(pred_class == 1 & df_clean$severity_binary == 1) /
               sum(df_clean$severity_binary == 1)

# Specificity (true negative rate)
specificity <- sum(pred_class == 0 & df_clean$severity_binary == 0) /
               sum(df_clean$severity_binary == 0)

accuracy
## [1] 0.6433428
sensitivity
## [1] 0
specificity
## [1] 1

ROC curve + AUC

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_obj <- roc(df_clean$severity_binary, pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, col = "blue", main = "ROC Curve for Logistic Regression")

auc(roc_obj)
## Area under the curve: 0.5686

Conclusions & Future Directions

The results of the logistic regression analysis indicate that roadway type is a significant predictor of crash severity. Nearly all roadway categories showed higher odds of resulting in an injury crash compared to the baseline of Maryland State Routes, with Bicycle Routes, County Roads, Interstates, and Crossovers demonstrating especially elevated risk. These findings suggest that roadway design and traffic environments contribute meaningfully to the severity of crashes. The ROC curve and AUC value (approximately 0.60) indicate that roadway type alone provides limited predictive power, but still contributes valuable insight into patterns of crash severity.

These findings imply that certain roadway environments may require additional safety measures or targeted interventions to reduce injury-related crashes. Understanding which road types carry higher risk can help inform roadway planning and public safety initiatives. For future research, incorporating additional variables such as speed limit, driver distraction, vehicle movement, or environmental conditions may improve the predictive ability of the model. Further analysis could also examine interactions between roadway type and other crash factors or explore alternative modeling approaches such as decision trees or random forests. Such extensions would provide a more comprehensive understanding of the factors contributing to crash severity.

References

Maryland Department of Transportation. (n.d.). Crash Reporting – Drivers Data. Maryland Open Data Portal. https://data.maryland.gov/

Maryland Department of Transportation. (n.d.). Crash Reporting – Drivers Data. Maryland Open Data Portal. https://data.maryland.gov/