Introduction

In this analysis, we predict the number of injuries in traffic accidents (injuries_total) using several regression models. We perform data cleaning, feature engineering, model training, and evaluation.

Load Libraries

library(tidyverse)    # Data manipulation and ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)        # Data splitting and model training
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
##     lift
library(ranger)       # Random Forest model
library(e1071)        # SVM
library(glmnet)       # Ridge regression
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
##     expand, pack, unpack
##
## Loaded glmnet 4.1-9
library(speedglm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
##     select
##
## Loading required package: biglm
## Loading required package: DBI
set.seed(123)         # For reproducibility

Load data (update path as needed)

data <- read.csv("traffic_accidents.csv")

# View first few rows and structure
head(data)
##               crash_date traffic_control_device weather_condition
## 1 07/29/2023 01:00:00 PM         TRAFFIC SIGNAL             CLEAR
## 2 08/13/2023 12:11:00 AM         TRAFFIC SIGNAL             CLEAR
## 3 12/09/2021 10:30:00 AM         TRAFFIC SIGNAL             CLEAR
## 4 08/09/2023 07:55:00 PM         TRAFFIC SIGNAL             CLEAR
## 5 08/19/2023 02:55:00 PM         TRAFFIC SIGNAL             CLEAR
## 6 09/06/2023 12:59:00 AM            NO CONTROLS              RAIN
##       lighting_condition first_crash_type trafficway_type          alignment
## 1               DAYLIGHT          TURNING     NOT DIVIDED STRAIGHT AND LEVEL
## 2 DARKNESS, LIGHTED ROAD          TURNING        FOUR WAY STRAIGHT AND LEVEL
## 3               DAYLIGHT         REAR END  T-INTERSECTION STRAIGHT AND LEVEL
## 4               DAYLIGHT            ANGLE        FOUR WAY STRAIGHT AND LEVEL
## 5               DAYLIGHT         REAR END  T-INTERSECTION STRAIGHT AND LEVEL
## 6 DARKNESS, LIGHTED ROAD     FIXED OBJECT     NOT DIVIDED STRAIGHT AND LEVEL
##   roadway_surface_cond road_defect                       crash_type
## 1              UNKNOWN     UNKNOWN           NO INJURY / DRIVE AWAY
## 2                  DRY  NO DEFECTS           NO INJURY / DRIVE AWAY
## 3                  DRY  NO DEFECTS           NO INJURY / DRIVE AWAY
## 4                  DRY  NO DEFECTS INJURY AND / OR TOW DUE TO CRASH
## 5              UNKNOWN     UNKNOWN           NO INJURY / DRIVE AWAY
## 6                  WET     UNKNOWN INJURY AND / OR TOW DUE TO CRASH
##   intersection_related_i        damage             prim_contributory_cause
## 1                      Y $501 - $1,500                 UNABLE TO DETERMINE
## 2                      Y   OVER $1,500          IMPROPER TURNING/NO SIGNAL
## 3                      Y $501 - $1,500               FOLLOWING TOO CLOSELY
## 4                      Y   OVER $1,500                 UNABLE TO DETERMINE
## 5                      Y $501 - $1,500 DRIVING SKILLS/KNOWLEDGE/EXPERIENCE
## 6                      N $501 - $1,500                 UNABLE TO DETERMINE
##   num_units       most_severe_injury injuries_total injuries_fatal
## 1         2  NO INDICATION OF INJURY              0              0
## 2         2  NO INDICATION OF INJURY              0              0
## 3         3  NO INDICATION OF INJURY              0              0
## 4         2 NONINCAPACITATING INJURY              5              0
## 5         2  NO INDICATION OF INJURY              0              0
## 6         1 NONINCAPACITATING INJURY              2              0
##   injuries_incapacitating injuries_non_incapacitating
## 1                       0                           0
## 2                       0                           0
## 3                       0                           0
## 4                       0                           5
## 5                       0                           0
## 6                       0                           2
##   injuries_reported_not_evident injuries_no_indication crash_hour
## 1                             0                      3         13
## 2                             0                      2          0
## 3                             0                      3         10
## 4                             0                      0         19
## 5                             0                      3         14
## 6                             0                      0          0
##   crash_day_of_week crash_month
## 1                 7           7
## 2                 1           8
## 3                 5          12
## 4                 4           8
## 5                 7           8
## 6                 4           9
str(data)
## 'data.frame':    209306 obs. of  24 variables:
##  $ crash_date                   : chr  "07/29/2023 01:00:00 PM" "08/13/2023 12:11:00 AM" "12/09/2021 10:30:00 AM" "08/09/2023 07:55:00 PM" ...
##  $ traffic_control_device       : chr  "TRAFFIC SIGNAL" "TRAFFIC SIGNAL" "TRAFFIC SIGNAL" "TRAFFIC SIGNAL" ...
##  $ weather_condition            : chr  "CLEAR" "CLEAR" "CLEAR" "CLEAR" ...
##  $ lighting_condition           : chr  "DAYLIGHT" "DARKNESS, LIGHTED ROAD" "DAYLIGHT" "DAYLIGHT" ...
##  $ first_crash_type             : chr  "TURNING" "TURNING" "REAR END" "ANGLE" ...
##  $ trafficway_type              : chr  "NOT DIVIDED" "FOUR WAY" "T-INTERSECTION" "FOUR WAY" ...
##  $ alignment                    : chr  "STRAIGHT AND LEVEL" "STRAIGHT AND LEVEL" "STRAIGHT AND LEVEL" "STRAIGHT AND LEVEL" ...
##  $ roadway_surface_cond         : chr  "UNKNOWN" "DRY" "DRY" "DRY" ...
##  $ road_defect                  : chr  "UNKNOWN" "NO DEFECTS" "NO DEFECTS" "NO DEFECTS" ...
##  $ crash_type                   : chr  "NO INJURY / DRIVE AWAY" "NO INJURY / DRIVE AWAY" "NO INJURY / DRIVE AWAY" "INJURY AND / OR TOW DUE TO CRASH" ...
##  $ intersection_related_i       : chr  "Y" "Y" "Y" "Y" ...
##  $ damage                       : chr  "$501 - $1,500" "OVER $1,500" "$501 - $1,500" "OVER $1,500" ...
##  $ prim_contributory_cause      : chr  "UNABLE TO DETERMINE" "IMPROPER TURNING/NO SIGNAL" "FOLLOWING TOO CLOSELY" "UNABLE TO DETERMINE" ...
##  $ num_units                    : int  2 2 3 2 2 1 2 2 2 2 ...
##  $ most_severe_injury           : chr  "NO INDICATION OF INJURY" "NO INDICATION OF INJURY" "NO INDICATION OF INJURY" "NONINCAPACITATING INJURY" ...
##  $ injuries_total               : num  0 0 0 5 0 2 0 1 0 0 ...
##  $ injuries_fatal               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ injuries_incapacitating      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ injuries_non_incapacitating  : num  0 0 0 5 0 2 0 1 0 0 ...
##  $ injuries_reported_not_evident: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ injuries_no_indication       : num  3 2 3 0 3 0 2 1 3 4 ...
##  $ crash_hour                   : int  13 0 10 19 14 0 11 14 18 17 ...
##  $ crash_day_of_week            : int  7 1 5 4 7 4 3 4 2 5 ...
##  $ crash_month                  : int  7 8 12 8 8 9 12 9 6 9 ...
summary(data)
##   crash_date        traffic_control_device weather_condition
##  Length:209306      Length:209306          Length:209306
##  Class :character   Class :character       Class :character
##  Mode  :character   Mode  :character       Mode  :character
##
##
##
##  lighting_condition first_crash_type   trafficway_type     alignment
##  Length:209306      Length:209306      Length:209306      Length:209306
##  Class :character   Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character
##
##
##
##  roadway_surface_cond road_defect         crash_type
##  Length:209306        Length:209306      Length:209306
##  Class :character     Class :character   Class :character
##  Mode  :character     Mode  :character   Mode  :character
##
##
##
##  intersection_related_i    damage          prim_contributory_cause
##  Length:209306          Length:209306      Length:209306
##  Class :character       Class :character   Class :character
##  Mode  :character       Mode  :character   Mode  :character
##
##
##
##    num_units      most_severe_injury injuries_total    injuries_fatal
##  Min.   : 1.000   Length:209306      Min.   : 0.0000   Min.   :0.000000
##  1st Qu.: 2.000   Class :character   1st Qu.: 0.0000   1st Qu.:0.000000
##  Median : 2.000   Mode  :character   Median : 0.0000   Median :0.000000
##  Mean   : 2.063                      Mean   : 0.3827   Mean   :0.001858
##  3rd Qu.: 2.000                      3rd Qu.: 1.0000   3rd Qu.:0.000000
##  Max.   :11.000                      Max.   :21.0000   Max.   :3.000000
##  injuries_incapacitating injuries_non_incapacitating
##  Min.   :0.0000          Min.   : 0.0000
##  1st Qu.:0.0000          1st Qu.: 0.0000
##  Median :0.0000          Median : 0.0000
##  Mean   :0.0381          Mean   : 0.2212
##  3rd Qu.:0.0000          3rd Qu.: 0.0000
##  Max.   :7.0000          Max.   :21.0000
##  injuries_reported_not_evident injuries_no_indication   crash_hour
##  Min.   : 0.0000               Min.   : 0.000         Min.   : 0.00
##  1st Qu.: 0.0000               1st Qu.: 2.000         1st Qu.: 9.00
##  Median : 0.0000               Median : 2.000         Median :14.00
##  Mean   : 0.1215               Mean   : 2.244         Mean   :13.37
##  3rd Qu.: 0.0000               3rd Qu.: 3.000         3rd Qu.:17.00
##  Max.   :15.0000               Max.   :49.000         Max.   :23.00
##  crash_day_of_week  crash_month
##  Min.   :1.000     Min.   : 1.000
##  1st Qu.:2.000     1st Qu.: 4.000
##  Median :4.000     Median : 7.000
##  Mean   :4.144     Mean   : 6.772
##  3rd Qu.:6.000     3rd Qu.:10.000
##  Max.   :7.000     Max.   :12.000
colSums(is.na(data))
##                    crash_date        traffic_control_device
##                             0                             0
##             weather_condition            lighting_condition
##                             0                             0
##              first_crash_type               trafficway_type
##                             0                             0
##                     alignment          roadway_surface_cond
##                             0                             0
##                   road_defect                    crash_type
##                             0                             0
##        intersection_related_i                        damage
##                             0                             0
##       prim_contributory_cause                     num_units
##                             0                             0
##            most_severe_injury                injuries_total
##                             0                             0
##                injuries_fatal       injuries_incapacitating
##                             0                             0
##   injuries_non_incapacitating injuries_reported_not_evident
##                             0                             0
##        injuries_no_indication                    crash_hour
##                             0                             0
##             crash_day_of_week                   crash_month
##                             0                             0

Feature Engineering

In this section, we will create new features that may help improve the model’s performance.

# 1. Convert categorical variables to factors
cat_cols <- c(
  "traffic_control_device", "weather_condition", "lighting_condition",
  "first_crash_type", "trafficway_type", "alignment", "roadway_surface_cond",
  "road_defect", "crash_type", "intersection_related_i", "damage",
  "prim_contributory_cause", "most_severe_injury",
  "crash_day_of_week", "crash_month"
)
for (col in intersect(cat_cols, colnames(data))) {
  data[[col]] <- as.factor(data[[col]])
}

# 2. Engineer time features if 'crash_date' is present
if ("crash_date" %in% names(data)) {
  data$crash_datetime <- as.POSIXct(data$crash_date, format = "%m/%d/%Y %I:%M:%S %p")
  if (all(!is.na(data$crash_datetime))) {
    data$crash_hour <- as.integer(format(data$crash_datetime, "%H"))
    data$crash_month <- as.factor(format(data$crash_datetime, "%m"))
    data$crash_day_of_week <- as.factor(weekdays(data$crash_datetime))
  }
}

# 3. Remove rows with missing values
data <- na.omit(data)

# 4. Check changes
str(data)
## 'data.frame':    209306 obs. of  25 variables:
##  $ crash_date                   : chr  "07/29/2023 01:00:00 PM" "08/13/2023 12:11:00 AM" "12/09/2021 10:30:00 AM" "08/09/2023 07:55:00 PM" ...
##  $ traffic_control_device       : Factor w/ 19 levels "BICYCLE CROSSING SIGN",..: 17 17 17 17 17 5 17 5 17 16 ...
##  $ weather_condition            : Factor w/ 12 levels "BLOWING SAND, SOIL, DIRT",..: 3 3 3 3 3 8 3 3 3 3 ...
##  $ lighting_condition           : Factor w/ 6 levels "DARKNESS","DARKNESS, LIGHTED ROAD",..: 4 2 4 4 4 2 4 4 4 4 ...
##  $ first_crash_type             : Factor w/ 18 levels "ANGLE","ANIMAL",..: 18 18 11 1 11 3 12 1 11 1 ...
##  $ trafficway_type              : Factor w/ 20 levels "ALLEY","CENTER TURN LANE",..: 9 7 16 7 16 9 7 3 9 7 ...
##  $ alignment                    : Factor w/ 6 levels "CURVE ON GRADE",..: 4 4 4 4 4 4 4 3 4 4 ...
##  $ roadway_surface_cond         : Factor w/ 7 levels "DRY","ICE","OTHER",..: 6 1 1 1 6 7 1 1 1 1 ...
##  $ road_defect                  : Factor w/ 7 levels "DEBRIS ON ROADWAY",..: 6 2 2 2 6 6 2 2 2 2 ...
##  $ crash_type                   : Factor w/ 2 levels "INJURY AND / OR TOW DUE TO CRASH",..: 2 2 2 1 2 1 2 1 2 2 ...
##  $ intersection_related_i       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 2 2 2 2 ...
##  $ damage                       : Factor w/ 3 levels "$500 OR LESS",..: 2 3 2 3 2 2 2 3 3 3 ...
##  $ prim_contributory_cause      : Factor w/ 40 levels "ANIMAL","BICYCLE ADVANCING LEGALLY ON RED LIGHT",..: 37 25 20 37 13 37 22 19 20 19 ...
##  $ num_units                    : int  2 2 3 2 2 1 2 2 2 2 ...
##  $ most_severe_injury           : Factor w/ 5 levels "FATAL","INCAPACITATING INJURY",..: 3 3 3 4 3 4 3 4 3 3 ...
##  $ injuries_total               : num  0 0 0 5 0 2 0 1 0 0 ...
##  $ injuries_fatal               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ injuries_incapacitating      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ injuries_non_incapacitating  : num  0 0 0 5 0 2 0 1 0 0 ...
##  $ injuries_reported_not_evident: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ injuries_no_indication       : num  3 2 3 0 3 0 2 1 3 4 ...
##  $ crash_hour                   : int  13 0 10 19 14 0 11 14 18 17 ...
##  $ crash_day_of_week            : Factor w/ 7 levels "Friday","Monday",..: 3 4 5 7 3 7 6 7 2 5 ...
##  $ crash_month                  : Factor w/ 12 levels "01","02","03",..: 7 8 12 8 8 9 12 9 6 9 ...
##  $ crash_datetime               : POSIXct, format: "2023-07-29 13:00:00" "2023-08-13 00:11:00" ...
summary(data)
##   crash_date              traffic_control_device       weather_condition
##  Length:209306      TRAFFIC SIGNAL   :123944     CLEAR          :164700
##  Class :character   STOP SIGN/FLASHER: 49139     RAIN           : 21703
##  Mode  :character   NO CONTROLS      : 29508     CLOUDY/OVERCAST:  7533
##                     UNKNOWN          :  4455     SNOW           :  6871
##                     OTHER            :   670     UNKNOWN        :  6534
##                     YIELD            :   468     OTHER          :   627
##                     (Other)          :  1122     (Other)        :  1338
##               lighting_condition                 first_crash_type
##  DARKNESS              :  7436   TURNING                 :64157
##  DARKNESS, LIGHTED ROAD: 53378   ANGLE                   :52250
##  DAWN                  :  3724   REAR END                :42018
##  DAYLIGHT              :134109   SIDESWIPE SAME DIRECTION:20116
##  DUSK                  :  6323   PEDESTRIAN              : 8996
##  UNKNOWN               :  4336   PEDALCYCLIST            : 5337
##                                  (Other)                 :16432
##                         trafficway_type                  alignment
##  NOT DIVIDED                    :77753   CURVE ON GRADE       :   179
##  FOUR WAY                       :49057   CURVE ON HILLCREST   :    53
##  DIVIDED - W/MEDIAN (NOT RAISED):34221   CURVE, LEVEL         :  1014
##  ONE-WAY                        :12341   STRAIGHT AND LEVEL   :204590
##  DIVIDED - W/MEDIAN BARRIER     :10720   STRAIGHT ON GRADE    :  2992
##  T-INTERSECTION                 : 9233   STRAIGHT ON HILLCREST:   478
##  (Other)                        :15981
##       roadway_surface_cond            road_defect
##  DRY            :155905    DEBRIS ON ROADWAY:   139
##  ICE            :  1303    NO DEFECTS       :171730
##  OTHER          :   438    OTHER            :   912
##  SAND, MUD, DIRT:    40    RUT, HOLES       :   741
##  SNOW OR SLUSH  :  6203    SHOULDER DEFECT  :   358
##  UNKNOWN        : 12509    UNKNOWN          : 34426
##  WET            : 32908    WORN SURFACE     :  1000
##                             crash_type     intersection_related_i
##  INJURY AND / OR TOW DUE TO CRASH: 91930   N:  9982
##  NO INJURY / DRIVE AWAY          :117376   Y:199324
##
##
##
##
##
##            damage                                 prim_contributory_cause
##  $500 OR LESS : 20783   UNABLE TO DETERMINE                   :58316
##  $501 - $1,500: 41210   FAILING TO YIELD RIGHT-OF-WAY         :42914
##  OVER $1,500  :147313   FOLLOWING TOO CLOSELY                 :19084
##                         DISREGARDING TRAFFIC SIGNALS          :14591
##                         IMPROPER TURNING/NO SIGNAL            :12643
##                         FAILING TO REDUCE SPEED TO AVOID CRASH:10676
##                         (Other)                               :51082
##    num_units                     most_severe_injury injuries_total
##  Min.   : 1.000   FATAL                   :   351   Min.   : 0.0000
##  1st Qu.: 2.000   INCAPACITATING INJURY   :  6564   1st Qu.: 0.0000
##  Median : 2.000   NO INDICATION OF INJURY :154789   Median : 0.0000
##  Mean   : 2.063   NONINCAPACITATING INJURY: 31527   Mean   : 0.3827
##  3rd Qu.: 2.000   REPORTED, NOT EVIDENT   : 16075   3rd Qu.: 1.0000
##  Max.   :11.000                                     Max.   :21.0000
##
##  injuries_fatal     injuries_incapacitating injuries_non_incapacitating
##  Min.   :0.000000   Min.   :0.0000          Min.   : 0.0000
##  1st Qu.:0.000000   1st Qu.:0.0000          1st Qu.: 0.0000
##  Median :0.000000   Median :0.0000          Median : 0.0000
##  Mean   :0.001858   Mean   :0.0381          Mean   : 0.2212
##  3rd Qu.:0.000000   3rd Qu.:0.0000          3rd Qu.: 0.0000
##  Max.   :3.000000   Max.   :7.0000          Max.   :21.0000
##
##  injuries_reported_not_evident injuries_no_indication   crash_hour
##  Min.   : 0.0000               Min.   : 0.000         Min.   : 0.00
##  1st Qu.: 0.0000               1st Qu.: 2.000         1st Qu.: 9.00
##  Median : 0.0000               Median : 2.000         Median :14.00
##  Mean   : 0.1215               Mean   : 2.244         Mean   :13.37
##  3rd Qu.: 0.0000               3rd Qu.: 3.000         3rd Qu.:17.00
##  Max.   :15.0000               Max.   :49.000         Max.   :23.00
##
##  crash_day_of_week  crash_month    crash_datetime
##  Friday   :34458   10     :20089   Min.   :2013-03-03 16:48:00.00
##  Monday   :27938   09     :19018   1st Qu.:2018-12-03 10:18:45.00
##  Saturday :30710   12     :18816   Median :2020-12-19 16:39:00.00
##  Sunday   :25246   08     :18350   Mean   :2020-12-08 23:26:35.21
##  Thursday :30787   11     :18328   3rd Qu.:2023-01-08 02:33:45.00
##  Tuesday  :30074   06     :17851   Max.   :2025-01-18 00:17:00.00
##  Wednesday:30093   (Other):96854

Splitting the Data

# 1. Define columns that would cause data leakage
leakage_cols <- c(
  "injuries_fatal",
  "injuries_incapacitating",
  "injuries_non_incapacitating",
  "injuries_reported_not_evident",
  "injuries_no_indication"
)

# 2. Define your chosen predictor variables (customize this list as needed)
chosen_predictors <- c(
  "weather_condition",
  "lighting_condition",
  "roadway_surface_cond",
  "num_units",
  "traffic_control_device",
  "first_crash_type",
  "alignment",
  "crash_hour"
)

# 3. Split the data after feature engineering and NA removal
set.seed(123)
train_index <- createDataPartition(data$injuries_total, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data  <- data[-train_index, ]

# 4. Prepare modeling datasets, using only your chosen predictors (and not the leakage columns)
train_data_model <- train_data[, c("injuries_total", chosen_predictors)]
test_data_model  <- test_data[,  c("injuries_total", chosen_predictors)]

Baseline Models

We will train three baseline regression models: Linear Regression, Random Forest, and Support Vector Regression.

Random Forest

# Train a Random Forest model
rf_model <- ranger(injuries_total ~ ., data = train_data_model, num.trees = 100, num.threads = parallel::detectCores())
rf_predictions <- predict(rf_model, data = test_data_model)$predictions

# Evaluation metrics
rf_rmse <- sqrt(mean((rf_predictions - test_data_model$injuries_total)^2))
rf_mae  <- mean(abs(rf_predictions - test_data_model$injuries_total))
rf_r2   <- cor(rf_predictions, test_data_model$injuries_total)^2

cat("Random Forest RMSE:", round(rf_rmse, 4), "\n")
## Random Forest RMSE: 0.7699
cat("Random Forest MAE :", round(rf_mae,  4), "\n")
## Random Forest MAE : 0.4831
cat("Random Forest R^2 :", round(rf_r2,   4), "\n")
## Random Forest R^2 : 0.1049

Support Vector Regression

speedglm_model <- speedglm(
  injuries_total ~ .,
  data = train_data_model,
  family = gaussian()
)

# Predict on the test set
speedglm_predictions <- predict(speedglm_model, newdata = test_data_model, type = "response")

# Evaluation metrics
speedglm_rmse <- sqrt(mean((speedglm_predictions - test_data_model$injuries_total)^2))
speedglm_mae  <- mean(abs(speedglm_predictions - test_data_model$injuries_total))
speedglm_r2   <- cor(speedglm_predictions, test_data_model$injuries_total)^2

cat("speedglm RMSE:", round(speedglm_rmse, 4), "\n")
## speedglm RMSE: 0.7726
cat("speedglm MAE :", round(speedglm_mae,  4), "\n")
## speedglm MAE : 0.4833
cat("speedglm R^2 :", round(speedglm_r2,   4), "\n")
## speedglm R^2 : 0.0986

Model Evaluation

We will evaluate the models using RMSE and R-squared metrics.

results <- data.frame(
  Model = c("Random Forest", "Support Vector Regression"),
  RMSE = c(rf_rmse, speedglm_rmse),
  MAE  = c(rf_mae,  speedglm_mae),
  R2   = c(rf_r2,   speedglm_r2)
)
print(results)
##                       Model      RMSE       MAE         R2
## 1             Random Forest 0.7775026 0.4793032 0.09174616
## 2 Support Vector Regression 0.7726181 0.4833240 0.09861882
# Visualization
results_long <- pivot_longer(results, cols = c("RMSE", "MAE", "R2"), names_to = "Metric", values_to = "Value")
ggplot(results_long, aes(x = Model, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Model Performance Comparison (Test Set)", y = "Metric Value") +
  theme_minimal()

Modelling

Classification

For the classification task, our primary objective was to predict the binary outcome of whether a traffic accident resulted in an injury. The feature selection process began with an initial set of predictors chosen from our exploratory analysis, including first_crash_type, traffic_control_device, weather_condition, and lighting_condition. To enhance the model’s predictive capabilities, we then engineered several new, more descriptive features. The numeric crash_hour_of_day was transformed into a categorical time_of_day feature (e.g., “Morning Rush”, “Night”) to better capture time-based traffic patterns. Similarly, crash_day_of_week was simplified into a binary is_weekend feature to distinguish between weekday and weekend driving behaviour.

if (!require("caret")) install.packages("caret", dependencies = TRUE)
## Loading required package: caret
## Loading required package: ggplot2
## Loading required package: lattice
if (!require("rpart")) install.packages("rpart")
## Loading required package: rpart
if (!require("rpart.plot")) install.packages("rpart.plot")
## Loading required package: rpart.plot
if (!require("randomForest")) install.packages("randomForest")
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
##     margin
if (!require("xgboost")) install.packages("xgboost")
## Loading required package: xgboost
suppressPackageStartupMessages({
  library(caret)
  library(rpart)
  library(rpart.plot)
  library(randomForest)
  library(xgboost)
  library(dplyr)
  library(lubridate)
})
accidents <- read.csv("traffic_accidents.csv")
accidents <- accidents %>% filter(!is.na(crash_date))
glimpse(accidents)
## Rows: 209,306
## Columns: 24
## $ crash_date                    <chr> "07/29/2023 01:00:00 PM", "08/13/2023 12…
## $ traffic_control_device        <chr> "TRAFFIC SIGNAL", "TRAFFIC SIGNAL", "TRA…
## $ weather_condition             <chr> "CLEAR", "CLEAR", "CLEAR", "CLEAR", "CLE…
## $ lighting_condition            <chr> "DAYLIGHT", "DARKNESS, LIGHTED ROAD", "D…
## $ first_crash_type              <chr> "TURNING", "TURNING", "REAR END", "ANGLE…
## $ trafficway_type               <chr> "NOT DIVIDED", "FOUR WAY", "T-INTERSECTI…
## $ alignment                     <chr> "STRAIGHT AND LEVEL", "STRAIGHT AND LEVE…
## $ roadway_surface_cond          <chr> "UNKNOWN", "DRY", "DRY", "DRY", "UNKNOWN…
## $ road_defect                   <chr> "UNKNOWN", "NO DEFECTS", "NO DEFECTS", "…
## $ crash_type                    <chr> "NO INJURY / DRIVE AWAY", "NO INJURY / D…
## $ intersection_related_i        <chr> "Y", "Y", "Y", "Y", "Y", "N", "Y", "Y", …
## $ damage                        <chr> "$501 - $1,500", "OVER $1,500", "$501 - …
## $ prim_contributory_cause       <chr> "UNABLE TO DETERMINE", "IMPROPER TURNING…
## $ num_units                     <int> 2, 2, 3, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ most_severe_injury            <chr> "NO INDICATION OF INJURY", "NO INDICATIO…
## $ injuries_total                <dbl> 0, 0, 0, 5, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0…
## $ injuries_fatal                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ injuries_incapacitating       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ injuries_non_incapacitating   <dbl> 0, 0, 0, 5, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0…
## $ injuries_reported_not_evident <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ injuries_no_indication        <dbl> 3, 2, 3, 0, 3, 0, 2, 1, 3, 4, 1, 2, 2, 2…
## $ crash_hour                    <int> 13, 0, 10, 19, 14, 0, 11, 14, 18, 17, 20…
## $ crash_day_of_week             <int> 7, 1, 5, 4, 7, 4, 3, 4, 2, 5, 5, 4, 5, 4…
## $ crash_month                   <int> 7, 8, 12, 8, 8, 9, 12, 9, 6, 9, 9, 11, 7…
# Convert date to proper datetime format
accidents <- accidents %>%
  mutate(crash_date = parse_date_time(crash_date, orders = c("mdy HM", "mdy HMS", "ymd HM")))

# Extract additional time features
accidents <- accidents %>%
  mutate(
    crash_hour_of_day = hour(crash_date),
    crash_day_of_week = wday(crash_date, label = TRUE, abbr = FALSE),
    crash_month = month(crash_date, label = TRUE, abbr = FALSE),
    crash_year = year(crash_date)
  )

# Clean up categorical variables by converting to factors
accidents <- accidents %>%
  mutate(
    traffic_control_device = factor(traffic_control_device),
    weather_condition = factor(weather_condition),
    lighting_condition = factor(lighting_condition),
    first_crash_type = factor(first_crash_type),
    trafficway_type = factor(trafficway_type),
    alignment = factor(alignment),
    roadway_surface_cond = factor(roadway_surface_cond),
    road_defect = factor(road_defect),
    crash_type = factor(crash_type),
    intersection_related_i = factor(intersection_related_i),
    damage = factor(damage, levels = c("$500 OR LESS", "$501 - $1,500", "OVER $1,500")),
    prim_contributory_cause = factor(prim_contributory_cause),
    most_severe_injury = factor(most_severe_injury)
  )

# Check for missing values
colSums(is.na(accidents))
##                    crash_date        traffic_control_device
##                             0                             0
##             weather_condition            lighting_condition
##                             0                             0
##              first_crash_type               trafficway_type
##                             0                             0
##                     alignment          roadway_surface_cond
##                             0                             0
##                   road_defect                    crash_type
##                             0                             0
##        intersection_related_i                        damage
##                             0                             0
##       prim_contributory_cause                     num_units
##                             0                             0
##            most_severe_injury                injuries_total
##                             0                             0
##                injuries_fatal       injuries_incapacitating
##                             0                             0
##   injuries_non_incapacitating injuries_reported_not_evident
##                             0                             0
##        injuries_no_indication                    crash_hour
##                             0                             0
##             crash_day_of_week                   crash_month
##                             0                             0
##             crash_hour_of_day                    crash_year
##                             0                             0
# Handle missing values if necessary (example for one column)
# accidents$roadway_surface_cond[is.na(accidents$roadway_surface_cond)] <- "UNKNOWN"

# Create severity indicator
accidents <- accidents %>%
  mutate(
    severity = case_when(
      injuries_fatal > 0 ~ "Fatal",
      injuries_incapacitating > 0 ~ "Severe",
      injuries_non_incapacitating > 0 ~ "Moderate",
      injuries_reported_not_evident > 0 ~ "Minor",
      TRUE ~ "No_Injury"
    ),
    severity = factor(severity, levels = c("No_Injury", "Minor", "Moderate", "Severe", "Fatal"))
  )

Label encoding is used here to convert categorical (factor) variables into numerical representations. Since most of the machine learning algorithms require numerical input and cannot directly process text-based categorical data. By converting factors to numeric codes, we enable these algorithms to work with our dataset while preserving the distinct categories.

cat("\n--- Label Encoding for Categorical (Factor) Columns ---\n")
##
## --- Label Encoding for Categorical (Factor) Columns ---
categorical_columns_to_encode <- names(accidents)[sapply(accidents, is.factor) & names(accidents) != "crash_type"]

original_factor_levels <- list()

for (col_name in categorical_columns_to_encode) {
  cat(paste0("\nColumn: ", col_name, " Encoding and Decoding:\n"))

  # Store original levels
  original_factor_levels[[col_name]] <- levels(accidents[[col_name]])

  # Convert factor to numeric (this uses the internal integer codes)
  # The first level will be 1, second 2, etc.
  accidents[[col_name]] <- as.numeric(accidents[[col_name]])

  # Print the mapping: original level -> new numeric code
  for (i in seq_along(original_factor_levels[[col_name]])) {
    cat(paste0(original_factor_levels[[col_name]][i], " -> ", i, "\n"))
  }
}
##
## Column: traffic_control_device Encoding and Decoding:
## BICYCLE CROSSING SIGN -> 1
## DELINEATORS -> 2
## FLASHING CONTROL SIGNAL -> 3
## LANE USE MARKING -> 4
## NO CONTROLS -> 5
## NO PASSING -> 6
## OTHER -> 7
## OTHER RAILROAD CROSSING -> 8
## OTHER REG. SIGN -> 9
## OTHER WARNING SIGN -> 10
## PEDESTRIAN CROSSING SIGN -> 11
## POLICE/FLAGMAN -> 12
## RAILROAD CROSSING GATE -> 13
## RR CROSSING SIGN -> 14
## SCHOOL ZONE -> 15
## STOP SIGN/FLASHER -> 16
## TRAFFIC SIGNAL -> 17
## UNKNOWN -> 18
## YIELD -> 19
##
## Column: weather_condition Encoding and Decoding:
## BLOWING SAND, SOIL, DIRT -> 1
## BLOWING SNOW -> 2
## CLEAR -> 3
## CLOUDY/OVERCAST -> 4
## FOG/SMOKE/HAZE -> 5
## FREEZING RAIN/DRIZZLE -> 6
## OTHER -> 7
## RAIN -> 8
## SEVERE CROSS WIND GATE -> 9
## SLEET/HAIL -> 10
## SNOW -> 11
## UNKNOWN -> 12
##
## Column: lighting_condition Encoding and Decoding:
## DARKNESS -> 1
## DARKNESS, LIGHTED ROAD -> 2
## DAWN -> 3
## DAYLIGHT -> 4
## DUSK -> 5
## UNKNOWN -> 6
##
## Column: first_crash_type Encoding and Decoding:
## ANGLE -> 1
## ANIMAL -> 2
## FIXED OBJECT -> 3
## HEAD ON -> 4
## OTHER NONCOLLISION -> 5
## OTHER OBJECT -> 6
## OVERTURNED -> 7
## PARKED MOTOR VEHICLE -> 8
## PEDALCYCLIST -> 9
## PEDESTRIAN -> 10
## REAR END -> 11
## REAR TO FRONT -> 12
## REAR TO REAR -> 13
## REAR TO SIDE -> 14
## SIDESWIPE OPPOSITE DIRECTION -> 15
## SIDESWIPE SAME DIRECTION -> 16
## TRAIN -> 17
## TURNING -> 18
##
## Column: trafficway_type Encoding and Decoding:
## ALLEY -> 1
## CENTER TURN LANE -> 2
## DIVIDED - W/MEDIAN (NOT RAISED) -> 3
## DIVIDED - W/MEDIAN BARRIER -> 4
## DRIVEWAY -> 5
## FIVE POINT, OR MORE -> 6
## FOUR WAY -> 7
## L-INTERSECTION -> 8
## NOT DIVIDED -> 9
## NOT REPORTED -> 10
## ONE-WAY -> 11
## OTHER -> 12
## PARKING LOT -> 13
## RAMP -> 14
## ROUNDABOUT -> 15
## T-INTERSECTION -> 16
## TRAFFIC ROUTE -> 17
## UNKNOWN -> 18
## UNKNOWN INTERSECTION TYPE -> 19
## Y-INTERSECTION -> 20
##
## Column: alignment Encoding and Decoding:
## CURVE ON GRADE -> 1
## CURVE ON HILLCREST -> 2
## CURVE, LEVEL -> 3
## STRAIGHT AND LEVEL -> 4
## STRAIGHT ON GRADE -> 5
## STRAIGHT ON HILLCREST -> 6
##
## Column: roadway_surface_cond Encoding and Decoding:
## DRY -> 1
## ICE -> 2
## OTHER -> 3
## SAND, MUD, DIRT -> 4
## SNOW OR SLUSH -> 5
## UNKNOWN -> 6
## WET -> 7
##
## Column: road_defect Encoding and Decoding:
## DEBRIS ON ROADWAY -> 1
## NO DEFECTS -> 2
## OTHER -> 3
## RUT, HOLES -> 4
## SHOULDER DEFECT -> 5
## UNKNOWN -> 6
## WORN SURFACE -> 7
##
## Column: intersection_related_i Encoding and Decoding:
## N -> 1
## Y -> 2
##
## Column: damage Encoding and Decoding:
## $500 OR LESS -> 1
## $501 - $1,500 -> 2
## OVER $1,500 -> 3
##
## Column: prim_contributory_cause Encoding and Decoding:
## ANIMAL -> 1
## BICYCLE ADVANCING LEGALLY ON RED LIGHT -> 2
## CELL PHONE USE OTHER THAN TEXTING -> 3
## DISREGARDING OTHER TRAFFIC SIGNS -> 4
## DISREGARDING ROAD MARKINGS -> 5
## DISREGARDING STOP SIGN -> 6
## DISREGARDING TRAFFIC SIGNALS -> 7
## DISREGARDING YIELD SIGN -> 8
## DISTRACTION - FROM INSIDE VEHICLE -> 9
## DISTRACTION - FROM OUTSIDE VEHICLE -> 10
## DISTRACTION - OTHER ELECTRONIC DEVICE (NAVIGATION DEVICE, DVD PLAYER, ETC.) -> 11
## DRIVING ON WRONG SIDE/WRONG WAY -> 12
## DRIVING SKILLS/KNOWLEDGE/EXPERIENCE -> 13
## EQUIPMENT - VEHICLE CONDITION -> 14
## EVASIVE ACTION DUE TO ANIMAL, OBJECT, NONMOTORIST -> 15
## EXCEEDING AUTHORIZED SPEED LIMIT -> 16
## EXCEEDING SAFE SPEED FOR CONDITIONS -> 17
## FAILING TO REDUCE SPEED TO AVOID CRASH -> 18
## FAILING TO YIELD RIGHT-OF-WAY -> 19
## FOLLOWING TOO CLOSELY -> 20
## HAD BEEN DRINKING (USE WHEN ARREST IS NOT MADE) -> 21
## IMPROPER BACKING -> 22
## IMPROPER LANE USAGE -> 23
## IMPROPER OVERTAKING/PASSING -> 24
## IMPROPER TURNING/NO SIGNAL -> 25
## MOTORCYCLE ADVANCING LEGALLY ON RED LIGHT -> 26
## NOT APPLICABLE -> 27
## OBSTRUCTED CROSSWALKS -> 28
## OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELESS, NEGLIGENT OR AGGRESSIVE MANNER -> 29
## PASSING STOPPED SCHOOL BUS -> 30
## PHYSICAL CONDITION OF DRIVER -> 31
## RELATED TO BUS STOP -> 32
## ROAD CONSTRUCTION/MAINTENANCE -> 33
## ROAD ENGINEERING/SURFACE/MARKING DEFECTS -> 34
## TEXTING -> 35
## TURNING RIGHT ON RED -> 36
## UNABLE TO DETERMINE -> 37
## UNDER THE INFLUENCE OF ALCOHOL/DRUGS (USE WHEN ARREST IS EFFECTED) -> 38
## VISION OBSCURED (SIGNS, TREE LIMBS, BUILDINGS, ETC.) -> 39
## WEATHER -> 40
##
## Column: most_severe_injury Encoding and Decoding:
## FATAL -> 1
## INCAPACITATING INJURY -> 2
## NO INDICATION OF INJURY -> 3
## NONINCAPACITATING INJURY -> 4
## REPORTED, NOT EVIDENT -> 5
##
## Column: crash_day_of_week Encoding and Decoding:
## Sunday -> 1
## Monday -> 2
## Tuesday -> 3
## Wednesday -> 4
## Thursday -> 5
## Friday -> 6
## Saturday -> 7
##
## Column: crash_month Encoding and Decoding:
## January -> 1
## February -> 2
## March -> 3
## April -> 4
## May -> 5
## June -> 6
## July -> 7
## August -> 8
## September -> 9
## October -> 10
## November -> 11
## December -> 12
##
## Column: severity Encoding and Decoding:
## No_Injury -> 1
## Minor -> 2
## Moderate -> 3
## Severe -> 4
## Fatal -> 5
cat("---\n")
## ---
accidents <- accidents %>% select(-'crash_date')

X <- accidents %>% select(-'crash_type')
y <- accidents[['crash_type']]

Before modeling, we examined the initial distribution of the target variable, crash_type, which captures whether an accident resulted in injury or required a tow. The dataset comprised 91,930 cases (43.9%) labeled as “INJURY AND / OR TOW DUE TO CRASH”, and 117,376 cases (56.1%) labeled as “NO INJURY / DRIVE AWAY”. This indicates a moderately imbalanced class distribution, with non-injury cases slightly more prevalent.

cat("\n--- Initial Class Distribution of 'crash_type' ---\n")
##
## --- Initial Class Distribution of 'crash_type' ---
print(table(y))
## y
## INJURY AND / OR TOW DUE TO CRASH           NO INJURY / DRIVE AWAY
##                            91930                           117376
cat("\n--- Initial Class Proportion of 'crash_type' ---\n")
##
## --- Initial Class Proportion of 'crash_type' ---
print(prop.table(table(y)))
## y
## INJURY AND / OR TOW DUE TO CRASH           NO INJURY / DRIVE AWAY
##                        0.4392134                        0.5607866

Before proceeding with model training, we inspected the target variable crash_type and observed that its factor levels contained characters such as slashes (/) and spaces, which can cause issues during model fitting or result interpretation—especially when using algorithms or libraries that require syntactically valid variable names. To address this, we renamed the factor levels by first replacing spaces with underscores and then applying make.names() to ensure all labels were syntactically valid.

current_target_levels <- levels(accidents[['crash_type']])

levels_with_underscores <- gsub(" ", "_", current_target_levels)

valid_target_levels <- make.names(levels_with_underscores)

accidents[['crash_type']] <- factor(accidents[['crash_type']],
                                       levels = current_target_levels,
                                       labels = valid_target_levels)

y <- accidents[['crash_type']]
cat("\n--- Initial Class Distribution of 'crash_type' ---\n")
##
## --- Initial Class Distribution of 'crash_type' ---
print(table(y))
## y
## INJURY_AND_._OR_TOW_DUE_TO_CRASH           NO_INJURY_._DRIVE_AWAY
##                            91930                           117376
cat("\n--- Initial Class Proportion of 'crash_type' ---\n")
##
## --- Initial Class Proportion of 'crash_type' ---
print(prop.table(table(y)))
## y
## INJURY_AND_._OR_TOW_DUE_TO_CRASH           NO_INJURY_._DRIVE_AWAY
##                        0.4392134                        0.5607866

Data Splitting

set.seed(42)

train_index <- createDataPartition(y, p = 0.7, list = FALSE, times = 1)

X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

train_data <- data.frame(X_train, crash_type = y_train)
plot_confusion_matrix_heatmap <- function(conf_matrix, model_name) {
  cm_table <- as.table(conf_matrix)
  cm_df <- as.data.frame(cm_table)
  names(cm_df) <- c("Prediction", "Reference", "Frequency")

  ggplot(data = cm_df, aes(x = Reference, y = Prediction, fill = Frequency)) +
    geom_tile(color = "white") +
    scale_fill_gradient(low = "white", high = "steelblue") +
    geom_text(aes(label = sprintf("%d", Frequency)), vjust = 1, color = "black") +
    labs(
      title = paste("Confusion Matrix ", model_name),
      x = "Actual",
      y = "Predicted"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
      axis.text.y = element_text(size = 10),
      axis.title = element_text(size = 12),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      legend.position = "right"
    ) +
    coord_fixed()
}

We trained two different classification models and evaluated them on a separate test set to find the most effective one. The comparison focused on three key metrics: overall accuracy, sensitivity(recall), and F1 score which is a fairer measure when dealing with imbalanced data.

Random Forest Model

train_control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = multiClassSummary,

)
set.seed(42)
rf_model <- train(
  crash_type ~ .,
  data = train_data,
  method = "rf",
  trControl = train_control,
  tuneLength = 3
)
cat("--- Random Forest Model Summary ---\n")
## --- Random Forest Model Summary ---
print(rf_model)
## Random Forest
##
## 146515 samples
##     25 predictor
##      2 classes: 'INJURY_AND_._OR_TOW_DUE_TO_CRASH', 'NO_INJURY_._DRIVE_AWAY'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 117213, 117212, 117212, 117212, 117211
## Resampling results across tuning parameters:
##
##   mtry  logLoss    AUC        prAUC      Accuracy   Kappa      F1
##    2    0.5471718  0.9114362  0.7470384  0.8337508  0.6491500  0.7704025
##   13    0.3400071  0.9187443  0.6079218  0.8394226  0.6696509  0.8053004
##   25    0.3466631  0.9170241  0.5971850  0.8378733  0.6667464  0.8042275
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.6350951    0.9893384    0.9790172       0.7758796       0.9790172
##   0.7561189    0.9046663    0.8613374       0.8256784       0.8613374
##   0.7582167    0.9002605    0.8561954       0.8262183       0.8561954
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.6350951  0.2789407       0.8122168
##   0.7561189  0.3320957       0.8303926
##   0.7582167  0.3330171       0.8292386
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.
y_pred <- predict(rf_model, newdata = X_test)

rf_conf_matrix <- confusionMatrix(y_pred, y_test, mode = "everything")

cat("\n--- Random Forest Model Evaluation: Classification Report ---\n")
##
## --- Random Forest Model Evaluation: Classification Report ---
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
##                                   Reference
## Prediction                         INJURY_AND_._OR_TOW_DUE_TO_CRASH
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                            20866
##   NO_INJURY_._DRIVE_AWAY                                       6713
##                                   Reference
## Prediction                         NO_INJURY_._DRIVE_AWAY
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                   3480
##   NO_INJURY_._DRIVE_AWAY                            31732
##
##                Accuracy : 0.8377
##                  95% CI : (0.8348, 0.8405)
##     No Information Rate : 0.5608
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.6662
##
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.7566
##             Specificity : 0.9012
##          Pos Pred Value : 0.8571
##          Neg Pred Value : 0.8254
##               Precision : 0.8571
##                  Recall : 0.7566
##                      F1 : 0.8037
##              Prevalence : 0.4392
##          Detection Rate : 0.3323
##    Detection Prevalence : 0.3877
##       Balanced Accuracy : 0.8289
##
##        'Positive' Class : INJURY_AND_._OR_TOW_DUE_TO_CRASH
## 
plot_confusion_matrix_heatmap(rf_conf_matrix, "(Random Forest)")

XGBoost Model

X_train_matrix <- model.matrix(~ . -1, data = X_train)
X_test_matrix <- model.matrix(~ . -1, data = X_test)


y_train_bin <- as.numeric(y_train) - 1
y_test_bin <- as.numeric(y_test) - 1

dtrain <- xgb.DMatrix(data = X_train_matrix, label = y_train_bin)
dtest <- xgb.DMatrix(data = X_test_matrix, label = y_test_bin)
params <- list(
  objective = "binary:logistic",
  eval_metric = "logloss"
)

xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain),
  verbose = 1
)
## [1]  train-logloss:0.547584
## [2]  train-logloss:0.467789
## [3]  train-logloss:0.419169
## [4]  train-logloss:0.387830
## [5]  train-logloss:0.366302
## [6]  train-logloss:0.351629
## [7]  train-logloss:0.341321
## [8]  train-logloss:0.333911
## [9]  train-logloss:0.328446
## [10] train-logloss:0.324356
## [11] train-logloss:0.321032
## [12] train-logloss:0.318676
## [13] train-logloss:0.316785
## [14] train-logloss:0.314899
## [15] train-logloss:0.313734
## [16] train-logloss:0.312740
## [17] train-logloss:0.311817
## [18] train-logloss:0.310911
## [19] train-logloss:0.310186
## [20] train-logloss:0.309508
## [21] train-logloss:0.308978
## [22] train-logloss:0.308284
## [23] train-logloss:0.307944
## [24] train-logloss:0.307539
## [25] train-logloss:0.307181
## [26] train-logloss:0.306570
## [27] train-logloss:0.306257
## [28] train-logloss:0.305557
## [29] train-logloss:0.305248
## [30] train-logloss:0.304535
## [31] train-logloss:0.304242
## [32] train-logloss:0.303759
## [33] train-logloss:0.303426
## [34] train-logloss:0.303032
## [35] train-logloss:0.302747
## [36] train-logloss:0.302373
## [37] train-logloss:0.302179
## [38] train-logloss:0.301790
## [39] train-logloss:0.301169
## [40] train-logloss:0.300843
## [41] train-logloss:0.300361
## [42] train-logloss:0.299990
## [43] train-logloss:0.299623
## [44] train-logloss:0.299125
## [45] train-logloss:0.298876
## [46] train-logloss:0.298467
## [47] train-logloss:0.298294
## [48] train-logloss:0.298053
## [49] train-logloss:0.297867
## [50] train-logloss:0.297542
## [51] train-logloss:0.297392
## [52] train-logloss:0.296869
## [53] train-logloss:0.296544
## [54] train-logloss:0.296154
## [55] train-logloss:0.295736
## [56] train-logloss:0.295417
## [57] train-logloss:0.295328
## [58] train-logloss:0.294973
## [59] train-logloss:0.294491
## [60] train-logloss:0.294201
## [61] train-logloss:0.293883
## [62] train-logloss:0.293696
## [63] train-logloss:0.293350
## [64] train-logloss:0.293144
## [65] train-logloss:0.292838
## [66] train-logloss:0.292609
## [67] train-logloss:0.292371
## [68] train-logloss:0.292072
## [69] train-logloss:0.291838
## [70] train-logloss:0.291621
## [71] train-logloss:0.291272
## [72] train-logloss:0.291091
## [73] train-logloss:0.290759
## [74] train-logloss:0.290526
## [75] train-logloss:0.290355
## [76] train-logloss:0.290271
## [77] train-logloss:0.290132
## [78] train-logloss:0.289889
## [79] train-logloss:0.289739
## [80] train-logloss:0.289632
## [81] train-logloss:0.289268
## [82] train-logloss:0.288967
## [83] train-logloss:0.288630
## [84] train-logloss:0.288518
## [85] train-logloss:0.288248
## [86] train-logloss:0.288108
## [87] train-logloss:0.288058
## [88] train-logloss:0.287947
## [89] train-logloss:0.287733
## [90] train-logloss:0.287466
## [91] train-logloss:0.287303
## [92] train-logloss:0.287028
## [93] train-logloss:0.286882
## [94] train-logloss:0.286717
## [95] train-logloss:0.286430
## [96] train-logloss:0.286103
## [97] train-logloss:0.285827
## [98] train-logloss:0.285619
## [99] train-logloss:0.285357
## [100]    train-logloss:0.285123
xgb_probs <- predict(xgb_model, newdata = dtest)

xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)

predicted_factor <- factor(xgb_preds, levels = c(0,1), labels = levels(y_test))
actual_factor <- factor(y_test_bin, levels = c(0,1), labels = levels(y_test))
xgb_conf_matrix <- confusionMatrix(predicted_factor, actual_factor, mode = "everything")
cat("\n--- XGBoost Model Evaluation: Classification Report ---\n")
##
## --- XGBoost Model Evaluation: Classification Report ---
print(xgb_conf_matrix)
## Confusion Matrix and Statistics
##
##                                   Reference
## Prediction                         INJURY_AND_._OR_TOW_DUE_TO_CRASH
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                            20718
##   NO_INJURY_._DRIVE_AWAY                                       6861
##                                   Reference
## Prediction                         NO_INJURY_._DRIVE_AWAY
##   INJURY_AND_._OR_TOW_DUE_TO_CRASH                   2894
##   NO_INJURY_._DRIVE_AWAY                            32318
##
##                Accuracy : 0.8446
##                  95% CI : (0.8418, 0.8475)
##     No Information Rate : 0.5608
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.6796
##
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.7512
##             Specificity : 0.9178
##          Pos Pred Value : 0.8774
##          Neg Pred Value : 0.8249
##               Precision : 0.8774
##                  Recall : 0.7512
##                      F1 : 0.8094
##              Prevalence : 0.4392
##          Detection Rate : 0.3300
##    Detection Prevalence : 0.3760
##       Balanced Accuracy : 0.8345
##
##        'Positive' Class : INJURY_AND_._OR_TOW_DUE_TO_CRASH
## 
plot_confusion_matrix_heatmap(xgb_conf_matrix, "XGBoost")

Between the two models evaluated, XGBoost slightly outperformed Random Forest in terms of overall classification performance. XGBoost achieved a higher accuracy (84.5%) compared to Random Forest’s 83.8%, and also recorded a better F1 score (0.8094 vs. 0.8037), indicating a more balanced performance between precision and recall. Although Random Forest showed marginally higher recall (75.7% vs. 75.1%), XGBoost’s superior F1 score and accuracy suggest it is the more effective model overall for predicting injury-related crashes.

Conclusion

In this analysis, we performed regression analysis on the injuries_total column from the traffic_accident.csv dataset. We engineered features, trained three baseline regression models, and evaluated their performance using RMSE and R-squared metrics. The results indicate that [insert insights based on results].

Future Work

Future work could include hyperparameter tuning, exploring additional models, and further feature engineering to improve model performance. ```

Notes:

  • Replace "Your Name" with your actual name.
  • Ensure that the traffic_accident.csv file is in the working directory or provide the correct path.
  • You may need to adjust the feature engineering section based on the actual columns in your dataset.
  • The evaluation metrics can be expanded to include MAE, MSE, etc., as needed.