Group Members

No. Name Matric No.
1 Chan See Mun 24085842
2 Khoo Wui Keat 24075066
3 Lim Hon Ting S2114212
4 Lim Zheng Yu U2102809
5 Sam Wei Hong U2102776

Introduction

Traffic accidents represent a critical concern in urban safety and transportation systems. Understanding the conditions that contribute to injuries in such events can inform policies and intervention strategies to reduce harm. This project aims to build predictive models that classify whether an accident will result in an injury and estimate the total number of injuries using various machine learning approaches.

Objectives

  • To explore patterns and trends and providing insights into accident distribution across different variables.
  • To utilize machine learning models (XGBoost, Random Forest, Generalized Linear Model) to predict whether an accident will result in an injury and estimate the total number of injuries.
  • To evaluate the performance of the proposed model using comprehensive metrics, including overall accuracy, sensitivity (recall), F1 score, as well as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

Load Libraries

knitr::opts_chunk$set(message = FALSE)

packages <- c(
  "tidyverse",    # Data manipulation and ggplot2
  "dplyr",        # Data manipulation
  "lubridate",    # Date/time manipulation
  "caret",        # Data splitting and model training
  "ranger",       # Random Forest model
  "speedglm",       # Generalized linear model
  "rpart",        # Decision tree model
  "rpart.plot",   # Plotting rpart trees
  "randomForest", # Traditional random forest implementation
  "xgboost"       # Extreme Gradient Boosting
)

for (pkg in packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
  suppressPackageStartupMessages({
    library(pkg, character.only = TRUE)
  })
}

set.seed(42)     # For reproducibility

Data Understanding

The dataset titled Traffic Accidents we used is taken from Kaggle, which was originally obtained from the internet and uploaded 5 months ago by its author. It provides a detailed record of traffic accidents along with key variables such as date, time, weather conditions, lighting, and road surface type. severity indicators like the number of injuries and fatalities are also included. This dataset is often used for exploratory data analysis and predictive modeling, allowing users to identify high-risk conditions, analyze accident patterns, and visualize trends to support the development of effective safety measures.

Load data

After we downloaded the dataset from Kaggle as a CSV file named traffic_accidents.csv, we loaded it into R for analysis.

accidents <- read.csv("traffic_accidents.csv")

Dataset structure

To gain an initial understanding of the dataset, we examined the first few rows using the head() function.

head(accidents)
##               crash_date traffic_control_device weather_condition
## 1 07/29/2023 01:00:00 PM         TRAFFIC SIGNAL             CLEAR
## 2 08/13/2023 12:11:00 AM         TRAFFIC SIGNAL             CLEAR
## 3 12/09/2021 10:30:00 AM         TRAFFIC SIGNAL             CLEAR
## 4 08/09/2023 07:55:00 PM         TRAFFIC SIGNAL             CLEAR
## 5 08/19/2023 02:55:00 PM         TRAFFIC SIGNAL             CLEAR
## 6 09/06/2023 12:59:00 AM            NO CONTROLS              RAIN
##       lighting_condition first_crash_type trafficway_type          alignment
## 1               DAYLIGHT          TURNING     NOT DIVIDED STRAIGHT AND LEVEL
## 2 DARKNESS, LIGHTED ROAD          TURNING        FOUR WAY STRAIGHT AND LEVEL
## 3               DAYLIGHT         REAR END  T-INTERSECTION STRAIGHT AND LEVEL
## 4               DAYLIGHT            ANGLE        FOUR WAY STRAIGHT AND LEVEL
## 5               DAYLIGHT         REAR END  T-INTERSECTION STRAIGHT AND LEVEL
## 6 DARKNESS, LIGHTED ROAD     FIXED OBJECT     NOT DIVIDED STRAIGHT AND LEVEL
##   roadway_surface_cond road_defect                       crash_type
## 1              UNKNOWN     UNKNOWN           NO INJURY / DRIVE AWAY
## 2                  DRY  NO DEFECTS           NO INJURY / DRIVE AWAY
## 3                  DRY  NO DEFECTS           NO INJURY / DRIVE AWAY
## 4                  DRY  NO DEFECTS INJURY AND / OR TOW DUE TO CRASH
## 5              UNKNOWN     UNKNOWN           NO INJURY / DRIVE AWAY
## 6                  WET     UNKNOWN INJURY AND / OR TOW DUE TO CRASH
##   intersection_related_i        damage             prim_contributory_cause
## 1                      Y $501 - $1,500                 UNABLE TO DETERMINE
## 2                      Y   OVER $1,500          IMPROPER TURNING/NO SIGNAL
## 3                      Y $501 - $1,500               FOLLOWING TOO CLOSELY
## 4                      Y   OVER $1,500                 UNABLE TO DETERMINE
## 5                      Y $501 - $1,500 DRIVING SKILLS/KNOWLEDGE/EXPERIENCE
## 6                      N $501 - $1,500                 UNABLE TO DETERMINE
##   num_units       most_severe_injury injuries_total injuries_fatal
## 1         2  NO INDICATION OF INJURY              0              0
## 2         2  NO INDICATION OF INJURY              0              0
## 3         3  NO INDICATION OF INJURY              0              0
## 4         2 NONINCAPACITATING INJURY              5              0
## 5         2  NO INDICATION OF INJURY              0              0
## 6         1 NONINCAPACITATING INJURY              2              0
##   injuries_incapacitating injuries_non_incapacitating
## 1                       0                           0
## 2                       0                           0
## 3                       0                           0
## 4                       0                           5
## 5                       0                           0
## 6                       0                           2
##   injuries_reported_not_evident injuries_no_indication crash_hour
## 1                             0                      3         13
## 2                             0                      2          0
## 3                             0                      3         10
## 4                             0                      0         19
## 5                             0                      3         14
## 6                             0                      0          0
##   crash_day_of_week crash_month
## 1                 7           7
## 2                 1           8
## 3                 5          12
## 4                 4           8
## 5                 7           8
## 6                 4           9

We also use the glimpse() function that provides a concise summary of the dataset, showing the column names, data types, and a preview of their values. The result revealed that the dataset contains 209,306 rows and 24 columns.

glimpse(accidents)
## Rows: 209,306
## Columns: 24
## $ crash_date                    <chr> "07/29/2023 01:00:00 PM", "08/13/2023 12…
## $ traffic_control_device        <chr> "TRAFFIC SIGNAL", "TRAFFIC SIGNAL", "TRA…
## $ weather_condition             <chr> "CLEAR", "CLEAR", "CLEAR", "CLEAR", "CLE…
## $ lighting_condition            <chr> "DAYLIGHT", "DARKNESS, LIGHTED ROAD", "D…
## $ first_crash_type              <chr> "TURNING", "TURNING", "REAR END", "ANGLE…
## $ trafficway_type               <chr> "NOT DIVIDED", "FOUR WAY", "T-INTERSECTI…
## $ alignment                     <chr> "STRAIGHT AND LEVEL", "STRAIGHT AND LEVE…
## $ roadway_surface_cond          <chr> "UNKNOWN", "DRY", "DRY", "DRY", "UNKNOWN…
## $ road_defect                   <chr> "UNKNOWN", "NO DEFECTS", "NO DEFECTS", "…
## $ crash_type                    <chr> "NO INJURY / DRIVE AWAY", "NO INJURY / D…
## $ intersection_related_i        <chr> "Y", "Y", "Y", "Y", "Y", "N", "Y", "Y", …
## $ damage                        <chr> "$501 - $1,500", "OVER $1,500", "$501 - …
## $ prim_contributory_cause       <chr> "UNABLE TO DETERMINE", "IMPROPER TURNING…
## $ num_units                     <int> 2, 2, 3, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ most_severe_injury            <chr> "NO INDICATION OF INJURY", "NO INDICATIO…
## $ injuries_total                <dbl> 0, 0, 0, 5, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0…
## $ injuries_fatal                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ injuries_incapacitating       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ injuries_non_incapacitating   <dbl> 0, 0, 0, 5, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0…
## $ injuries_reported_not_evident <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ injuries_no_indication        <dbl> 3, 2, 3, 0, 3, 0, 2, 1, 3, 4, 1, 2, 2, 2…
## $ crash_hour                    <int> 13, 0, 10, 19, 14, 0, 11, 14, 18, 17, 20…
## $ crash_day_of_week             <int> 7, 1, 5, 4, 7, 4, 3, 4, 2, 5, 5, 4, 5, 4…
## $ crash_month                   <int> 7, 8, 12, 8, 8, 9, 12, 9, 6, 9, 9, 11, 7…

We then generated summary statistics to get an overview of the distribution and central tendency of each variable using the summary() function.

summary(accidents)
##   crash_date        traffic_control_device weather_condition
##  Length:209306      Length:209306          Length:209306
##  Class :character   Class :character       Class :character
##  Mode  :character   Mode  :character       Mode  :character
##
##
##
##  lighting_condition first_crash_type   trafficway_type     alignment
##  Length:209306      Length:209306      Length:209306      Length:209306
##  Class :character   Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character
##
##
##
##  roadway_surface_cond road_defect         crash_type
##  Length:209306        Length:209306      Length:209306
##  Class :character     Class :character   Class :character
##  Mode  :character     Mode  :character   Mode  :character
##
##
##
##  intersection_related_i    damage          prim_contributory_cause
##  Length:209306          Length:209306      Length:209306
##  Class :character       Class :character   Class :character
##  Mode  :character       Mode  :character   Mode  :character
##
##
##
##    num_units      most_severe_injury injuries_total    injuries_fatal
##  Min.   : 1.000   Length:209306      Min.   : 0.0000   Min.   :0.000000
##  1st Qu.: 2.000   Class :character   1st Qu.: 0.0000   1st Qu.:0.000000
##  Median : 2.000   Mode  :character   Median : 0.0000   Median :0.000000
##  Mean   : 2.063                      Mean   : 0.3827   Mean   :0.001858
##  3rd Qu.: 2.000                      3rd Qu.: 1.0000   3rd Qu.:0.000000
##  Max.   :11.000                      Max.   :21.0000   Max.   :3.000000
##  injuries_incapacitating injuries_non_incapacitating
##  Min.   :0.0000          Min.   : 0.0000
##  1st Qu.:0.0000          1st Qu.: 0.0000
##  Median :0.0000          Median : 0.0000
##  Mean   :0.0381          Mean   : 0.2212
##  3rd Qu.:0.0000          3rd Qu.: 0.0000
##  Max.   :7.0000          Max.   :21.0000
##  injuries_reported_not_evident injuries_no_indication   crash_hour
##  Min.   : 0.0000               Min.   : 0.000         Min.   : 0.00
##  1st Qu.: 0.0000               1st Qu.: 2.000         1st Qu.: 9.00
##  Median : 0.0000               Median : 2.000         Median :14.00
##  Mean   : 0.1215               Mean   : 2.244         Mean   :13.37
##  3rd Qu.: 0.0000               3rd Qu.: 3.000         3rd Qu.:17.00
##  Max.   :15.0000               Max.   :49.000         Max.   :23.00
##  crash_day_of_week  crash_month
##  Min.   :1.000     Min.   : 1.000
##  1st Qu.:2.000     1st Qu.: 4.000
##  Median :4.000     Median : 7.000
##  Mean   :4.144     Mean   : 6.772
##  3rd Qu.:6.000     3rd Qu.:10.000
##  Max.   :7.000     Max.   :12.000

To assess data completeness, we checked the number of missing values in each column. The output confirms that no missing values are present in any of the columns in this dataset. This simplifies the preprocessing phase, as we do not need to perform any imputation or deletion for missing data.

colSums(is.na(accidents))
##                    crash_date        traffic_control_device
##                             0                             0
##             weather_condition            lighting_condition
##                             0                             0
##              first_crash_type               trafficway_type
##                             0                             0
##                     alignment          roadway_surface_cond
##                             0                             0
##                   road_defect                    crash_type
##                             0                             0
##        intersection_related_i                        damage
##                             0                             0
##       prim_contributory_cause                     num_units
##                             0                             0
##            most_severe_injury                injuries_total
##                             0                             0
##                injuries_fatal       injuries_incapacitating
##                             0                             0
##   injuries_non_incapacitating injuries_reported_not_evident
##                             0                             0
##        injuries_no_indication                    crash_hour
##                             0                             0
##             crash_day_of_week                   crash_month
##                             0                             0

Feature Engineering

Through feature engineering, we are able to enhance the dataset by creating new variables that can improve model performance.

Extract Temporal Features

We began by converting the crash_date column to a proper datetime format and then derive several useful time-based features such as hour of the day, day of the week, month, and year.

accidents <- accidents %>%
  mutate(crash_date = parse_date_time(crash_date, orders = "mdy HMS p"))

accidents <- accidents %>%
  mutate(
    crash_hour_of_day = hour(crash_date),
    crash_day_of_week = wday(crash_date, label = TRUE, abbr = FALSE),
    crash_month = month(crash_date, label = TRUE, abbr = FALSE),
    crash_year = year(crash_date)
  )

Create Severity Indicator

We also created a new categorical variable severity to classify each accident based on the most serious injury reported. This consolidated multiple injury-related columns into a single interpretable severity level.

# Create severity indicator
accidents <- accidents %>%
  mutate(
    severity = case_when(
      injuries_fatal > 0 ~ "Fatal",
      injuries_incapacitating > 0 ~ "Severe",
      injuries_non_incapacitating > 0 ~ "Moderate",
      injuries_reported_not_evident > 0 ~ "Minor",
      TRUE ~ "No_Injury"
    ),
    severity = factor(severity, levels = c("No_Injury", "Minor", "Moderate", "Severe", "Fatal"))
  )

Exploratory Data Analysis (EDA)

We conducted EDA to identify patterns and trends in the dataset, providing insights into accident distribution across different variables that inform model development.

Accident Frequency by Time

Hour of Day

We visualized the distribution of traffic accidents by hour of the day using a bar chart It helped reveal when accidents were most frequent throughout a typical day.

ggplot(accidents, aes(x = crash_hour_of_day)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Accidents by Hour of Day",
       x = "Hour of Day",
       y = "Number of Accidents") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(hjust = 0.5))

The plot showed a clear increase in accident frequency starting around 6 AM, with peaks observed between 3 PM and 5 PM. These hours likely correspond to morning and evening rush periods, where road congestion and driver fatigue may contribute to higher accident rates. Accidents were least frequent during the early morning hours (1 - 5 AM) with lower traffic volumes.

Day of Week

We visualized the number of traffic accidents for each day of the week using a bar chart. This provided insight into weekly patterns in accident frequency.

ggplot(accidents, aes(x = crash_day_of_week)) +
  geom_bar(fill = "darkorange") +
  labs(title = "Accidents by Day of Week",
       x = "Day of Week",
       y = "Number of Accidents") +
  theme_bw(base_size = 12) +

  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

The plot showed that Friday had the highest number of reported accidents, followed closely by Thursday, Tuesday, and Wednesday. Sunday recorded the fewest accidents. This pattern suggested that traffic volume related to weekday work commutes may contribute to increased accident rates on weekdays, particularly toward the end of the workweek.

Month

Furthermore, we examined the distribution of traffic accidents across different months to explore potential seasonal patterns.

ggplot(accidents, aes(x = crash_month)) +
  geom_bar(fill = "darkgreen") +
  labs(title = "Accidents by Month",
       x = "Month",
       y = "Number of Accidents") +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

The chart showed that October recorded the highest number of accidents, followed by September and December. In contrast, February had the fewest accidents.

Accident Severity Analysis

Severity distribution

We visualized the distribution of accidents based on their severity levels through a bar chart to understand the overall impact of reported incidents.

ggplot(accidents, aes(x = severity, fill = severity)) +
  geom_bar() +
  scale_fill_brewer(palette = "Reds") +
  labs(title = "Severity Distribution", x = "Severity", y = "Count") +
  theme_bw(base_size = 14) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

The vast majority of traffic accidents result in no injuries, with moderate injuries being the most common type among incidents that do cause harm, and fatal accidents being extremely rare.

Weather

We analyzed how accident severity varied across different weather conditions using a proportional bar chart. This allowed for comparison of severity composition within each weather type.

ggplot(accidents, aes(x = weather_condition, fill = severity)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Reds") +
  labs(title = "Severity by Weather", x = "Weather", y = "Proportion") +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5))

While most accidents across all weather conditions result in no injuries, adverse conditions like freezing rain, sleet/hail, and snow show a proportionally higher likelihood of leading to more severe injuries.

Lighting

We also explored the relationship between lighting conditions and accident severity using a proportional bar chart. This helped assess whether poor visibility contributed to more severe outcomes.

ggplot(accidents, aes(x = lighting_condition, fill = severity)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Reds") +
  labs(title = "Severity by Lighting", y = "Proportion") +
  theme_bw(base_size = 14) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))

Accidents occurring in darkness (both unlit and lighted roads), dawn, and dusk show a proportionally greater incidence of injuries compared to those occurring during daylight hours, where no-injury outcomes are most dominant.

Crash Type Analysis

Most Common Crash Types

We examined the distribution of crash types to identify the most frequently occurring scenarios.

accidents %>%
  count(first_crash_type) %>%
  arrange(desc(n)) %>%
  mutate(first_crash_type = fct_reorder(first_crash_type, n)) %>%
  ggplot(aes(x = first_crash_type, y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Most Common Crash Types",
       x = "Crash Type",
       y = "Count") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(hjust = 0.5))

“Turning,” “Angle,” and “Rear End” collisions are the most frequently occurring crash types, significantly outnumbering other incident categories.

Crash Types by Severity

Then, we analyzed how accident severity varied across different crash types using a proportional bar chart. This helped identify which types of crashes were more likely to result in severe outcomes.

ggplot(accidents, aes(x = first_crash_type, fill = severity)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Reds") +
  labs(title = "Crash Types by Severity",
       x = "Crash Type",
       y = "Proportion") +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))

Crashes involving “Pedestrians” and “Pedalcyclists”, along with “Fixed Object” and “Head On” collisions, demonstrate a disproportionately high likelihood of resulting in injuries, including severe and fatal outcomes, compared to other crash types.

Contributory Causes

Top Contributory Causes

Using a bar chart, we identified the most frequently reported primary contributory causes of traffic accidents.

accidents %>%
  count(prim_contributory_cause) %>%
  arrange(desc(n)) %>%
  top_n(10) %>% #
  mutate(prim_contributory_cause = fct_reorder(prim_contributory_cause, n)) %>%
  ggplot(aes(x = prim_contributory_cause, y = n)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  labs(title = "Top 10 Contributory Causes",
       x = "Cause",
       y = "Count") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(hjust = 0.5))

“Unable to Determine” is the most frequently cited cause of accidents, followed by driver behaviors such as “Failing to Yield Right-of-Way” and “Following Too Closely.”

Causes by Severity Rate

Beyond frequency, we also examined which primary contributory causes were most likely to result in severe or fatal accidents. This was done by calculating the proportion of severe cases for each cause.

accidents %>%
  filter(!is.na(prim_contributory_cause)) %>%
  group_by(prim_contributory_cause) %>%
  summarise(
    count = n(),
    severe_rate = sum(severity %in% c("Severe", "Fatal")) / n()
  ) %>%
  filter(count > 2) %>%
  arrange(desc(severe_rate)) %>%
  top_n(10, severe_rate) %>%
  mutate(prim_contributory_cause = fct_reorder(prim_contributory_cause, severe_rate)) %>%
  ggplot(aes(x = prim_contributory_cause, y = severe_rate)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Top 10 Most Dangerous Causes",
       x = "Cause",
       y = "Severe Accident Rate") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.y = element_text(angle = 45, hjust = 1,size = 8))

The chart highlighted that causes such as bicycle advancing legally on red light, physical condition of driver, and exceeding the authorized speed limit had the highest rates of severe or fatal outcomes.

Traffic Control Devices Impact

Severity by Traffic Control Device

We explored how the type of traffic control device present at the scene correlated with accident severity using a proportional bar chart.

ggplot(accidents, aes(x = traffic_control_device, fill = severity)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Reds") +
  labs(title = "Accident Severity by Traffic Control Device",
       x = "Traffic Control Device",
       y = "Proportion") +
  theme_bw(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))

Significant proportion of accidents across most traffic control device types result in no injury, incidents at “Pedestrian Crossing Signs” and “Stop Sign/Flasher” locations show a notably higher proportion of injuries, including severe and fatal outcomes.

Injury Rate by Signal Presence

We compared the rate of injury-causing accidents between locations with and without traffic signals. This binary classification highlighted how the presence of traffic control may influence accident severity.

accidents %>%
  mutate(has_signal = ifelse(traffic_control_device == "TRAFFIC SIGNAL", "With Signal", "Without Signal")) %>%
  group_by(has_signal) %>%
  summarise(
    total = n(),
    injury_rate = sum(severity != "No_Injury") / n()
  ) %>%
  ggplot(aes(x = has_signal, y = injury_rate, fill = has_signal)) +
  geom_col() +
  scale_fill_manual(values = c("With Signal" = "darkgreen", "Without Signal" = "darkred")) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Injury Rate: With vs Without Traffic Signals",
       x = "",
       y = "Injury Accident Rate") +
  theme_bw(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

The chart showed that injury rates were remarkably similar between the two groups. This suggested that while traffic signals are essential for managing flow and reducing collisions, they may not significantly affect whether an accident results in injury.

Data Preparation

In data preparation phase, we transformed variables into suitable formats for a more accurate and efficient modeling.

Converting Categorical Variables to Factors

We converted all categorical variables to factor type, including road conditions, crash types, injury levels, and other descriptors.

accidents <- accidents %>%
  mutate(
    traffic_control_device = factor(traffic_control_device),
    weather_condition = factor(weather_condition),
    lighting_condition = factor(lighting_condition),
    first_crash_type = factor(first_crash_type),
    trafficway_type = factor(trafficway_type),
    alignment = factor(alignment),
    roadway_surface_cond = factor(roadway_surface_cond),
    road_defect = factor(road_defect),
    crash_type = factor(crash_type),
    intersection_related_i = factor(intersection_related_i),
    damage = factor(damage, levels = c("$500 OR LESS", "$501 - $1,500", "OVER $1,500")),
    prim_contributory_cause = factor(prim_contributory_cause),
    most_severe_injury = factor(most_severe_injury)
  )

Modeling

In this section, we describe the classification and regression modeling approaches applied to the dataset.

Classification

The problem we are trying to answer is:

How can we classify whether a traffic accident will result in an injury based on features such as road conditions and traffic control devices?

The feature selection process began with an initial set of predictors chosen from our exploratory analysis, including first_crash_type, traffic_control_device, weather_condition, and lighting_condition. To enhance the model’s predictive capabilities, we then engineered several new, more descriptive features. The numeric crash_hour_of_day was transformed into a categorical time_of_day feature (e.g., “Morning Rush”, “Night”) to better capture time-based traffic patterns. Similarly, crash_day_of_week was simplified into a binary is_weekend feature to distinguish between weekday and weekend driving behaviour.

Label encoding is used here to convert categorical (factor) variables into numerical representations. Since most of the machine learning algorithms require numerical input and cannot directly process text-based categorical data. By converting factors to numeric codes, we enable these algorithms to work with our dataset while preserving the distinct categories.

cat("\n--- Label Encoding for Categorical (Factor) Columns ---\n")
##
## --- Label Encoding for Categorical (Factor) Columns ---
categorical_columns_to_encode <- names(accidents)[sapply(accidents, is.factor) & names(accidents) != "crash_type"]

original_factor_levels <- list()

for (col_name in categorical_columns_to_encode) {
  cat(paste0("\nColumn: ", col_name, " Encoding and Decoding:\n"))

  # Store original levels
  original_factor_levels[[col_name]] <- levels(accidents[[col_name]])

  # Convert factor to numeric (this uses the internal integer codes)
  # The first level will be 1, second 2, etc.
  accidents[[col_name]] <- as.numeric(accidents[[col_name]])

  # Print the mapping: original level -> new numeric code
  for (i in seq_along(original_factor_levels[[col_name]])) {
    cat(paste0(original_factor_levels[[col_name]][i], " -> ", i, "\n"))
  }
}
##
## Column: traffic_control_device Encoding and Decoding:
## BICYCLE CROSSING SIGN -> 1
## DELINEATORS -> 2
## FLASHING CONTROL SIGNAL -> 3
## LANE USE MARKING -> 4
## NO CONTROLS -> 5
## NO PASSING -> 6
## OTHER -> 7
## OTHER RAILROAD CROSSING -> 8
## OTHER REG. SIGN -> 9
## OTHER WARNING SIGN -> 10
## PEDESTRIAN CROSSING SIGN -> 11
## POLICE/FLAGMAN -> 12
## RAILROAD CROSSING GATE -> 13
## RR CROSSING SIGN -> 14
## SCHOOL ZONE -> 15
## STOP SIGN/FLASHER -> 16
## TRAFFIC SIGNAL -> 17
## UNKNOWN -> 18
## YIELD -> 19
##
## Column: weather_condition Encoding and Decoding:
## BLOWING SAND, SOIL, DIRT -> 1
## BLOWING SNOW -> 2
## CLEAR -> 3
## CLOUDY/OVERCAST -> 4
## FOG/SMOKE/HAZE -> 5
## FREEZING RAIN/DRIZZLE -> 6
## OTHER -> 7
## RAIN -> 8
## SEVERE CROSS WIND GATE -> 9
## SLEET/HAIL -> 10
## SNOW -> 11
## UNKNOWN -> 12
##
## Column: lighting_condition Encoding and Decoding:
## DARKNESS -> 1
## DARKNESS, LIGHTED ROAD -> 2
## DAWN -> 3
## DAYLIGHT -> 4
## DUSK -> 5
## UNKNOWN -> 6
##
## Column: first_crash_type Encoding and Decoding:
## ANGLE -> 1
## ANIMAL -> 2
## FIXED OBJECT -> 3
## HEAD ON -> 4
## OTHER NONCOLLISION -> 5
## OTHER OBJECT -> 6
## OVERTURNED -> 7
## PARKED MOTOR VEHICLE -> 8
## PEDALCYCLIST -> 9
## PEDESTRIAN -> 10
## REAR END -> 11
## REAR TO FRONT -> 12
## REAR TO REAR -> 13
## REAR TO SIDE -> 14
## SIDESWIPE OPPOSITE DIRECTION -> 15
## SIDESWIPE SAME DIRECTION -> 16
## TRAIN -> 17
## TURNING -> 18
##
## Column: trafficway_type Encoding and Decoding:
## ALLEY -> 1
## CENTER TURN LANE -> 2
## DIVIDED - W/MEDIAN (NOT RAISED) -> 3
## DIVIDED - W/MEDIAN BARRIER -> 4
## DRIVEWAY -> 5
## FIVE POINT, OR MORE -> 6
## FOUR WAY -> 7
## L-INTERSECTION -> 8
## NOT DIVIDED -> 9
## NOT REPORTED -> 10
## ONE-WAY -> 11
## OTHER -> 12
## PARKING LOT -> 13
## RAMP -> 14
## ROUNDABOUT -> 15
## T-INTERSECTION -> 16
## TRAFFIC ROUTE -> 17
## UNKNOWN -> 18
## UNKNOWN INTERSECTION TYPE -> 19
## Y-INTERSECTION -> 20
##
## Column: alignment Encoding and Decoding:
## CURVE ON GRADE -> 1
## CURVE ON HILLCREST -> 2
## CURVE, LEVEL -> 3
## STRAIGHT AND LEVEL -> 4
## STRAIGHT ON GRADE -> 5
## STRAIGHT ON HILLCREST -> 6
##
## Column: roadway_surface_cond Encoding and Decoding:
## DRY -> 1
## ICE -> 2
## OTHER -> 3
## SAND, MUD, DIRT -> 4
## SNOW OR SLUSH -> 5
## UNKNOWN -> 6
## WET -> 7
##
## Column: road_defect Encoding and Decoding:
## DEBRIS ON ROADWAY -> 1
## NO DEFECTS -> 2
## OTHER -> 3
## RUT, HOLES -> 4
## SHOULDER DEFECT -> 5
## UNKNOWN -> 6
## WORN SURFACE -> 7
##
## Column: intersection_related_i Encoding and Decoding:
## N -> 1
## Y -> 2
##
## Column: damage Encoding and Decoding:
## $500 OR LESS -> 1
## $501 - $1,500 -> 2
## OVER $1,500 -> 3
##
## Column: prim_contributory_cause Encoding and Decoding:
## ANIMAL -> 1
## BICYCLE ADVANCING LEGALLY ON RED LIGHT -> 2
## CELL PHONE USE OTHER THAN TEXTING -> 3
## DISREGARDING OTHER TRAFFIC SIGNS -> 4
## DISREGARDING ROAD MARKINGS -> 5
## DISREGARDING STOP SIGN -> 6
## DISREGARDING TRAFFIC SIGNALS -> 7
## DISREGARDING YIELD SIGN -> 8
## DISTRACTION - FROM INSIDE VEHICLE -> 9
## DISTRACTION - FROM OUTSIDE VEHICLE -> 10
## DISTRACTION - OTHER ELECTRONIC DEVICE (NAVIGATION DEVICE, DVD PLAYER, ETC.) -> 11
## DRIVING ON WRONG SIDE/WRONG WAY -> 12
## DRIVING SKILLS/KNOWLEDGE/EXPERIENCE -> 13
## EQUIPMENT - VEHICLE CONDITION -> 14
## EVASIVE ACTION DUE TO ANIMAL, OBJECT, NONMOTORIST -> 15
## EXCEEDING AUTHORIZED SPEED LIMIT -> 16
## EXCEEDING SAFE SPEED FOR CONDITIONS -> 17
## FAILING TO REDUCE SPEED TO AVOID CRASH -> 18
## FAILING TO YIELD RIGHT-OF-WAY -> 19
## FOLLOWING TOO CLOSELY -> 20
## HAD BEEN DRINKING (USE WHEN ARREST IS NOT MADE) -> 21
## IMPROPER BACKING -> 22
## IMPROPER LANE USAGE -> 23
## IMPROPER OVERTAKING/PASSING -> 24
## IMPROPER TURNING/NO SIGNAL -> 25
## MOTORCYCLE ADVANCING LEGALLY ON RED LIGHT -> 26
## NOT APPLICABLE -> 27
## OBSTRUCTED CROSSWALKS -> 28
## OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELESS, NEGLIGENT OR AGGRESSIVE MANNER -> 29
## PASSING STOPPED SCHOOL BUS -> 30
## PHYSICAL CONDITION OF DRIVER -> 31
## RELATED TO BUS STOP -> 32
## ROAD CONSTRUCTION/MAINTENANCE -> 33
## ROAD ENGINEERING/SURFACE/MARKING DEFECTS -> 34
## TEXTING -> 35
## TURNING RIGHT ON RED -> 36
## UNABLE TO DETERMINE -> 37
## UNDER THE INFLUENCE OF ALCOHOL/DRUGS (USE WHEN ARREST IS EFFECTED) -> 38
## VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS, ETC.) -> 39
## WEATHER -> 40
##
## Column: most_severe_injury Encoding and Decoding:
## FATAL -> 1
## INCAPACITATING INJURY -> 2
## NO INDICATION OF INJURY -> 3
## NONINCAPACITATING INJURY -> 4
## REPORTED, NOT EVIDENT -> 5
##
## Column: crash_day_of_week Encoding and Decoding:
## Sunday -> 1
## Monday -> 2
## Tuesday -> 3
## Wednesday -> 4
## Thursday -> 5
## Friday -> 6
## Saturday -> 7
##
## Column: crash_month Encoding and Decoding:
## January -> 1
## February -> 2
## March -> 3
## April -> 4
## May -> 5
## June -> 6
## July -> 7
## August -> 8
## September -> 9
## October -> 10
## November -> 11
## December -> 12
##
## Column: severity Encoding and Decoding:
## No_Injury -> 1
## Minor -> 2
## Moderate -> 3
## Severe -> 4
## Fatal -> 5
accidents <- accidents %>% select(-'crash_date')

X <- accidents %>% select(-'crash_type')
y <- accidents[['crash_type']]

Before modeling, we examined the initial distribution of the target variable, crash_type, which captures whether an accident resulted in injury or required a tow. The dataset comprised 91,930 cases (43.9%) labeled as “INJURY AND / OR TOW DUE TO CRASH”, and 117,376 cases (56.1%) labeled as “NO INJURY / DRIVE AWAY”. This indicates a moderately imbalanced class distribution, with non-injury cases slightly more prevalent.

cat("\n--- Initial Class Distribution of 'crash_type' ---\n")
##
## --- Initial Class Distribution of 'crash_type' ---
print(table(y))
## y
## INJURY AND / OR TOW DUE TO CRASH           NO INJURY / DRIVE AWAY
##                            91930                           117376
cat("\n--- Initial Class Proportion of 'crash_type' ---\n")
##
## --- Initial Class Proportion of 'crash_type' ---
print(prop.table(table(y)))
## y
## INJURY AND / OR TOW DUE TO CRASH           NO INJURY / DRIVE AWAY
##                        0.4392134                        0.5607866

Before proceeding with model training, we inspected the target variable crash_type and observed that its factor levels contained characters such as slashes (/) and spaces, which can cause issues during model fitting or result interpretation—especially when using algorithms or libraries that require syntactically valid variable names. To address this, we renamed the factor levels by first replacing spaces with underscores and then applying make.names() to ensure all labels were syntactically valid.

current_target_levels <- levels(accidents[['crash_type']])

levels_with_underscores <- gsub(" ", "_", current_target_levels)

valid_target_levels <- make.names(levels_with_underscores)

accidents[['crash_type']] <- factor(accidents[['crash_type']],
                                       levels = current_target_levels,
                                       labels = valid_target_levels)

y <- accidents[['crash_type']]
cat("\n--- Initial Class Distribution of 'crash_type' ---\n")
##
## --- Initial Class Distribution of 'crash_type' ---
print(table(y))
## y
## INJURY_AND_._OR_TOW_DUE_TO_CRASH           NO_INJURY_._DRIVE_AWAY
##                            91930                           117376
cat("\n--- Initial Class Proportion of 'crash_type' ---\n")
##
## --- Initial Class Proportion of 'crash_type' ---
print(prop.table(table(y)))
## y
## INJURY_AND_._OR_TOW_DUE_TO_CRASH           NO_INJURY_._DRIVE_AWAY
##                        0.4392134                        0.5607866

Data Splitting

set.seed(42)

train_index <- createDataPartition(y, p = 0.7, list = FALSE, times = 1)

X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

train_data <- data.frame(X_train, crash_type = y_train)
plot_confusion_matrix_heatmap <- function(conf_matrix, model_name) {
  cm_table <- as.table(conf_matrix)
  cm_df <- as.data.frame(cm_table)
  names(cm_df) <- c("Prediction", "Reference", "Frequency")

  ggplot(data = cm_df, aes(x = Reference, y = Prediction, fill = Frequency)) +
    geom_tile(color = "white") +
    scale_fill_gradient(low = "white", high = "steelblue") +
    geom_text(aes(label = sprintf("%d", Frequency)), vjust = 1, color = "black") +
    labs(
      title = paste("Confusion Matrix ", model_name),
      x = "Actual",
      y = "Predicted"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
      axis.text.y = element_text(size = 10),
      axis.title = element_text(size = 12),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      legend.position = "right"
    ) +
    coord_fixed()
}

We trained two different classification models and evaluated them on a separate test set to find the most effective one. The comparison focused on three key metrics: overall accuracy, sensitivity(recall), and F1 score which is a fairer measure when dealing with imbalanced data.

Random Forest Model

train_control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = multiClassSummary,

)
set.seed(42)
rf_model <- train(
  crash_type ~ .,
  data = train_data,
  method = "rf",
  trControl = train_control,
  tuneLength = 3
)
cat("--- Random Forest Model Summary ---\n")
## --- Random Forest Model Summary ---
print(rf_model)
## Random Forest
##
## 146515 samples
##     25 predictor
##      2 classes: 'INJURY_AND_._OR_TOW_DUE_TO_CRASH', 'NO_INJURY_._DRIVE_AWAY'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 117213, 117212, 117212, 117212, 117211
## Resampling results across tuning parameters:
##
##   mtry  logLoss    AUC        prAUC      Accuracy   Kappa      F1
##    2    0.5471718  0.9114362  0.7470384  0.8337508  0.6491500  0.7704025
##   13    0.3400071  0.9187443  0.6079218  0.8394226  0.6696509  0.8053004
##   25    0.3466631  0.9170241  0.5971850  0.8378733  0.6667464  0.8042275
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.6350951    0.9893384    0.9790172       0.7758796       0.9790172
##   0.7561189    0.9046663    0.8613374       0.8256784       0.8613374
##   0.7582167    0.9002605    0.8561954       0.8262183       0.8561954
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.6350951  0.2789407       0.8122168
##   0.7561189  0.3320957       0.8303926
##   0.7582167  0.3330171       0.8292386
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.
y_pred <- predict(rf_model, newdata = X_test)

rf_conf_matrix <- confusionMatrix(y_pred, y_test, mode = "everything")

cat("\n--- Random Forest Model Evaluation: Classification Report ---\n")
##
## --- Random Forest Model Evaluation: Classification Report ---
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
##                                   Reference
## Prediction                         INJURY_AND_._OR_TOW_DUE_TO_CRASH
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                            20866
##   NO_INJURY_._DRIVE_AWAY                                       6713
##                                   Reference
## Prediction                         NO_INJURY_._DRIVE_AWAY
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                   3480
##   NO_INJURY_._DRIVE_AWAY                            31732
##
##                Accuracy : 0.8377
##                  95% CI : (0.8348, 0.8405)
##     No Information Rate : 0.5608
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.6662
##
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.7566
##             Specificity : 0.9012
##          Pos Pred Value : 0.8571
##          Neg Pred Value : 0.8254
##               Precision : 0.8571
##                  Recall : 0.7566
##                      F1 : 0.8037
##              Prevalence : 0.4392
##          Detection Rate : 0.3323
##    Detection Prevalence : 0.3877
##       Balanced Accuracy : 0.8289
##
##        'Positive' Class : INJURY_AND_._OR_TOW_DUE_TO_CRASH
## 
plot_confusion_matrix_heatmap(rf_conf_matrix, "(Random Forest)")

XGBoost Model

X_train_matrix <- model.matrix(~ . -1, data = X_train)
X_test_matrix <- model.matrix(~ . -1, data = X_test)


y_train_bin <- as.numeric(y_train) - 1
y_test_bin <- as.numeric(y_test) - 1

dtrain <- xgb.DMatrix(data = X_train_matrix, label = y_train_bin)
dtest <- xgb.DMatrix(data = X_test_matrix, label = y_test_bin)
params <- list(
  objective = "binary:logistic",
  eval_metric = "logloss"
)

xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain),
  verbose = 1
)
## [1]  train-logloss:0.547584
## [2]  train-logloss:0.467789
## [3]  train-logloss:0.419169
## [4]  train-logloss:0.387830
## [5]  train-logloss:0.366302
## [6]  train-logloss:0.351629
## [7]  train-logloss:0.341321
## [8]  train-logloss:0.333911
## [9]  train-logloss:0.328446
## [10] train-logloss:0.324356
## [11] train-logloss:0.321032
## [12] train-logloss:0.318676
## [13] train-logloss:0.316785
## [14] train-logloss:0.314899
## [15] train-logloss:0.313734
## [16] train-logloss:0.312740
## [17] train-logloss:0.311817
## [18] train-logloss:0.310911
## [19] train-logloss:0.310186
## [20] train-logloss:0.309508
## [21] train-logloss:0.308978
## [22] train-logloss:0.308284
## [23] train-logloss:0.307944
## [24] train-logloss:0.307539
## [25] train-logloss:0.307181
## [26] train-logloss:0.306570
## [27] train-logloss:0.306257
## [28] train-logloss:0.305557
## [29] train-logloss:0.305248
## [30] train-logloss:0.304535
## [31] train-logloss:0.304242
## [32] train-logloss:0.303759
## [33] train-logloss:0.303426
## [34] train-logloss:0.303032
## [35] train-logloss:0.302747
## [36] train-logloss:0.302373
## [37] train-logloss:0.302179
## [38] train-logloss:0.301790
## [39] train-logloss:0.301169
## [40] train-logloss:0.300843
## [41] train-logloss:0.300361
## [42] train-logloss:0.299990
## [43] train-logloss:0.299623
## [44] train-logloss:0.299125
## [45] train-logloss:0.298876
## [46] train-logloss:0.298467
## [47] train-logloss:0.298294
## [48] train-logloss:0.298053
## [49] train-logloss:0.297867
## [50] train-logloss:0.297542
## [51] train-logloss:0.297392
## [52] train-logloss:0.296869
## [53] train-logloss:0.296544
## [54] train-logloss:0.296154
## [55] train-logloss:0.295736
## [56] train-logloss:0.295417
## [57] train-logloss:0.295328
## [58] train-logloss:0.294973
## [59] train-logloss:0.294491
## [60] train-logloss:0.294201
## [61] train-logloss:0.293883
## [62] train-logloss:0.293696
## [63] train-logloss:0.293350
## [64] train-logloss:0.293144
## [65] train-logloss:0.292838
## [66] train-logloss:0.292609
## [67] train-logloss:0.292371
## [68] train-logloss:0.292072
## [69] train-logloss:0.291838
## [70] train-logloss:0.291621
## [71] train-logloss:0.291272
## [72] train-logloss:0.291091
## [73] train-logloss:0.290759
## [74] train-logloss:0.290526
## [75] train-logloss:0.290355
## [76] train-logloss:0.290271
## [77] train-logloss:0.290132
## [78] train-logloss:0.289889
## [79] train-logloss:0.289739
## [80] train-logloss:0.289632
## [81] train-logloss:0.289268
## [82] train-logloss:0.288967
## [83] train-logloss:0.288630
## [84] train-logloss:0.288518
## [85] train-logloss:0.288248
## [86] train-logloss:0.288108
## [87] train-logloss:0.288058
## [88] train-logloss:0.287947
## [89] train-logloss:0.287733
## [90] train-logloss:0.287466
## [91] train-logloss:0.287303
## [92] train-logloss:0.287028
## [93] train-logloss:0.286882
## [94] train-logloss:0.286717
## [95] train-logloss:0.286430
## [96] train-logloss:0.286103
## [97] train-logloss:0.285827
## [98] train-logloss:0.285619
## [99] train-logloss:0.285357
## [100]    train-logloss:0.285123
xgb_probs <- predict(xgb_model, newdata = dtest)

xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)

predicted_factor <- factor(xgb_preds, levels = c(0,1), labels = levels(y_test))
actual_factor <- factor(y_test_bin, levels = c(0,1), labels = levels(y_test))
xgb_conf_matrix <- confusionMatrix(predicted_factor, actual_factor, mode = "everything")
cat("\n--- XGBoost Model Evaluation: Classification Report ---\n")
##
## --- XGBoost Model Evaluation: Classification Report ---
print(xgb_conf_matrix)
## Confusion Matrix and Statistics
##
##                                   Reference
## Prediction                         INJURY_AND_._OR_TOW_DUE_TO_CRASH
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                            20718
##   NO_INJURY_._DRIVE_AWAY                                       6861
##                                   Reference
## Prediction                         NO_INJURY_._DRIVE_AWAY
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                   2894
##   NO_INJURY_._DRIVE_AWAY                            32318
##
##                Accuracy : 0.8446
##                  95% CI : (0.8418, 0.8475)
##     No Information Rate : 0.5608
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.6796
##
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.7512
##             Specificity : 0.9178
##          Pos Pred Value : 0.8774
##          Neg Pred Value : 0.8249
##               Precision : 0.8774
##                  Recall : 0.7512
##                      F1 : 0.8094
##              Prevalence : 0.4392
##          Detection Rate : 0.3300
##    Detection Prevalence : 0.3760
##       Balanced Accuracy : 0.8345
##
##        'Positive' Class : INJURY_AND_._OR_TOW_DUE_TO_CRASH
## 
plot_confusion_matrix_heatmap(xgb_conf_matrix, "XGBoost")

Between the two models evaluated, XGBoost slightly outperformed Random Forest in terms of overall classification performance. XGBoost achieved a higher accuracy (84.5%) compared to Random Forest’s 83.8%, and also recorded a better F1 score (0.8094 vs. 0.8037), indicating a more balanced performance between precision and recall. Although Random Forest showed marginally higher recall (75.7% vs. 75.1%), XGBoost’s superior F1 score and accuracy suggest it is the more effective model overall for predicting injury-related crashes.

Regression

The problem we are trying to answer is:

To what extent can pre-crash factors such as road conditions and traffic control devices predict the total number of injuries resulting from a traffic accident?

We will approach this problem using two regression techniques, Generalized Linear Model (GLM) and Random Forest.

Feature Selection

We selected only pre-crash features that would realistically be known or observable before the outcome of the crash, along with the target variable injuries_total.

selected_features <- c(
  "traffic_control_device", "weather_condition", "lighting_condition", "first_crash_type", "trafficway_type",
  "alignment", "roadway_surface_cond", "road_defect", "intersection_related_i",
  "prim_contributory_cause", "num_units", "crash_day_of_week", "crash_month",
  "crash_hour_of_day", "crash_year", "injuries_total"
)

Train-Test Split

We split the dataset into an 80% training set and a 20% testing set using stratified sampling based on the target variable. This allows us to evaluate how well the model generalizes to unseen data.

train_index <- createDataPartition(accidents$injuries_total, p = 0.8, list = FALSE)
train_data <- accidents[train_index, selected_features]
test_data  <- accidents[-train_index, selected_features]

Generalized Linear Model (GLM)

As a baseline comparison, we fitted a GLM with a Gaussian distribution using the speedglm package. This model assumes a linear relationship between predictors and the target, serving as a strong interpretable baseline.

speedglm_model <- speedglm(
  injuries_total ~ .,
  data = train_data,
  family = gaussian()
)

speedglm_predictions <- predict(speedglm_model, newdata = test_data, type = "response")

We then evaluated the model using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), which measure prediction accuracy in absolute and squared terms.

speedglm_rmse <- sqrt(mean((speedglm_predictions - test_data$injuries_total)^2))
speedglm_mae  <- mean(abs(speedglm_predictions - test_data$injuries_total))

cat("speedglm RMSE:", round(speedglm_rmse, 4), "\n")
## speedglm RMSE: 0.7326
cat("speedglm MAE :", round(speedglm_mae, 4), "\n")
## speedglm MAE : 0.4697

Random Forest

We trained a Random Forest model, which is a flexible, non-linear ensemble method that captures complex interactions among high-dimensional features. It is trained using the ranger package, which supports fast and multithreaded computation. This model is capable of capturing nonlinear relationships and variable interactions that may influence injury severity.

rf_model <- ranger(injuries_total ~ ., data = train_data, num.trees = 100, num.threads = parallel::detectCores())
rf_predictions <- predict(rf_model, data = test_data)$predictions

We evaluated the model using the same RMSE and MAE metrics to directly compare its performance to the GLM.

rf_rmse <- sqrt(mean((rf_predictions - test_data$injuries_total)^2))
rf_mae  <- mean(abs(rf_predictions - test_data$injuries_total))

cat("Random Forest RMSE:", round(rf_rmse, 4), "\n")
## Random Forest RMSE: 0.7377
cat("Random Forest MAE :", round(rf_mae, 4), "\n")
## Random Forest MAE : 0.4748

Evaluation

We summarize the performance of both models using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

results <- data.frame(
  Model = c("Generalized Linear Model", "Random Forest"),
  RMSE = c(speedglm_rmse, rf_rmse),
  MAE  = c(speedglm_mae,  rf_mae)
)
print(results)
##                      Model      RMSE       MAE
## 1 Generalized Linear Model 0.7325853 0.4696693
## 2            Random Forest 0.7376651 0.4748189

The results show that the Generalized Linear Model (GLM) slightly outperformed the Random Forest model in terms of both RMSE and MAE. While the differences are marginal, this suggests that the relationship between pre-crash features and the number of injuries may be largely linear, and that a simpler model can perform just as well as a more complex one.

results_long <- pivot_longer(results, cols = c("RMSE", "MAE"), names_to = "Metric", values_to = "Value")
ggplot(results_long, aes(x = Model, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Model Performance Comparison (Test Set)", y = "Metric Value") +
  theme_minimal()

Conclusion

This study successfully demonstrated the use of machine learning models to predict injury occurrence and count in traffic accidents. Classification models like XGBoost can effectively predict whether an accident will result in injury, achieving an F1 score above 0.80. On the other hand, regression models show that injury counts can be reasonably estimated using pre-crash features, with GLM slightly outperforming Random Forest. These insights are valuable for transportation authorities and policymakers to identify high-risk conditions, guide infrastructure improvements, and prioritize enforcement efforts to enhance road safety.