1 Introduction

Road accidents are a serious public health issue worldwide. Every year, millions of people are injured or killed on the roads, and many of these accidents could potentially be prevented if we had better information about where and when they are likely to happen.

For this project, we decided to look at a real-world dataset of road accidents in the United States. We chose this topic because it connects directly to SDG 3 (Good Health and Well-being), specifically Target 3.6 which calls for halving the number of deaths and injuries from road traffic accidents by 2030. We thought it would be meaningful to work on something with real-world impact rather than a simulated dataset.

1.1 Objectives

The two main objectives of this project are:

Regression: To predict the expected monthly accident frequency across ZIP code areas based on weather conditions, road infrastructure, and temporal factors.
Classification: To develop a classification model that determines whether a road accident results in high severity (Severity 3 or 4) based on environmental conditions, road features, and time of day.

1.2 SDG 3 Alignment

SDG Target	How Our Project Contributes
3.6 — Halve road traffic deaths	Our models help identify when and where accidents are most likely and most severe, enabling proactive safety measures
3.d — Strengthen health risk management	Severity predictions can help emergency services allocate resources more effectively
3.8 — Universal health coverage	Faster and better-targeted emergency response improves survival outcomes

2 Dataset Overview

2.1 About the Data

The dataset we are using is the US Accidents dataset, which was collected from multiple real-time traffic APIs across the United States between February 2016 and March 2023. It covers 49 US states and is one of the largest publicly available accident datasets.

We are using a sampled version with 500,000 rows and 46 columns, giving us a total dimension of 23,000,000 cells which easily exceeds the minimum requirement of 100,000.

Detail	Value
Dataset name	US Accidents (2016–2023)
Source	Kaggle — Sobhan Moosavi et al.
Link	https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents
Rows	500,000
Columns	46
Total dimension	23,000,000 cells
Coverage	49 US states
Years	February 2016 – March 2023

The key column groups in the dataset are:

Accident details: ID, Severity (1–4), Start and End time, Distance affected
Location: Street, City, County, State, ZIP code, GPS coordinates
Weather: Temperature, Humidity, Visibility, Wind Speed, Precipitation, Weather Condition
Road features: Junction, Traffic Signal, Crossing, Railway, Stop sign (Boolean)
Time of day: Sunrise/Sunset indicator

2.2 Loading Libraries and Data

# load all packages needed for this project
library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)
library(corrplot)
library(gridExtra)
library(caret)
library(rpart)
library(rpart.plot)

# load the dataset
df_raw <- read_csv("US_Accidents_March23_sampled_500k.csv",
                   show_col_types = FALSE)

# confirm dimensions
cat("Number of rows:   ", format(nrow(df_raw), big.mark=","), "\n")

## Number of rows:    500,000

cat("Number of columns:", ncol(df_raw), "\n")

## Number of columns: 46

cat("Total cells:      ", format(nrow(df_raw)*ncol(df_raw), big.mark=","), "\n")

## Total cells:       23,000,000

# preview the first few rows
head(df_raw, 5)

# check structure
glimpse(df_raw)

## Rows: 500,000
## Columns: 46
## $ ID                    <chr> "A-2047758", "A-4694324", "A-5006183", "A-423735…
## $ Source                <chr> "Source2", "Source1", "Source1", "Source1", "Sou…
## $ Severity              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Start_Time            <dttm> 2019-06-12 10:10:56, 2022-12-03 23:37:14, 2022-…
## $ End_Time              <dttm> 2019-06-12 10:55:58, 2022-12-04 01:56:53, 2022-…
## $ Start_Lat             <dbl> 30.64121, 38.99056, 34.66119, 43.68059, 35.39548…
## $ Start_Lng             <dbl> -91.15348, -77.39907, -120.49282, -92.99332, -11…
## $ End_Lat               <dbl> NA, 38.99004, 34.66119, 43.68057, 35.39548, NA, …
## $ End_Lng               <dbl> NA, -77.39828, -120.49244, -92.97222, -118.98600…
## $ `Distance(mi)`        <dbl> 0.000, 0.056, 0.022, 1.054, 0.046, 0.000, 0.000,…
## $ Description           <chr> "Accident on LA-19 Baker-Zachary Hwy at Lower Za…
## $ Street                <chr> "Highway 19", "Forest Ridge Dr", "Floradale Ave"…
## $ City                  <chr> "Zachary", "Sterling", "Lompoc", "Austin", "Bake…
## $ County                <chr> "East Baton Rouge", "Loudoun", "Santa Barbara", …
## $ State                 <chr> "LA", "VA", "CA", "MN", "CA", "MA", "OR", "FL", …
## $ Zipcode               <chr> "70791-4610", "20164-2813", "93436", "55912", "9…
## $ Country               <chr> "US", "US", "US", "US", "US", "US", "US", "US", …
## $ Timezone              <chr> "US/Central", "US/Eastern", "US/Pacific", "US/Ce…
## $ Airport_Code          <chr> "KBTR", "KIAD", "KLPC", "KAUM", "KBFL", "KBVY", …
## $ Weather_Timestamp     <dttm> 2019-06-12 09:53:00, 2022-12-03 23:52:00, 2022-…
## $ `Temperature(F)`      <dbl> 77.0, 45.0, 68.0, 27.0, 42.0, 42.0, 35.0, 90.0, …
## $ `Wind_Chill(F)`       <dbl> 77.0, 43.0, 68.0, 15.0, 42.0, 35.0, 35.0, 90.0, …
## $ `Humidity(%)`         <dbl> 62, 48, 73, 86, 34, 58, 89, 55, 39, 78, 97, 84, …
## $ `Pressure(in)`        <dbl> 29.92, 29.91, 29.79, 28.49, 29.77, 29.37, 28.71,…
## $ `Visibility(mi)`      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Wind_Direction        <chr> "NW", "W", "W", "ENE", "CALM", "W", "CALM", "SW"…
## $ `Wind_Speed(mph)`     <dbl> 5.0, 5.0, 13.0, 15.0, 0.0, 13.0, 0.0, 12.0, 7.0,…
## $ `Precipitation(in)`   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
## $ Weather_Condition     <chr> "Fair", "Fair", "Fair", "Wintry Mix", "Fair", "F…
## $ Amenity               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Bump                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Crossing              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Give_Way              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Junction              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ No_Exit               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Railway               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Roundabout            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Station               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Stop                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Calming       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Signal        <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FA…
## $ Turning_Loop          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Sunrise_Sunset        <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Civil_Twilight        <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Nautical_Twilight     <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Astronomical_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…

3 Data Cleaning

3.1 Step 1: Checking for Missing Values

The first thing we do before any analysis is check how many values are missing in each column, so we can decide what to do about them.

missing_summary <- df_raw %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Column",
               values_to = "Missing_Count") %>%
  mutate(Missing_Pct = round(Missing_Count / nrow(df_raw) * 100, 1)) %>%
  filter(Missing_Count > 0) %>%
  arrange(desc(Missing_Pct))

missing_summary %>%
  kable(caption = "Table 1: Columns with missing values",
        col.names = c("Column", "Missing Count", "Missing %")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(missing_summary$Missing_Pct > 20),
           background = "#ffe0e0") %>%
  row_spec(which(missing_summary$Missing_Pct > 0 &
                   missing_summary$Missing_Pct <= 20),
           background = "#fff8dc")

Table 1: Columns with missing values
Column	Missing Count	Missing %
End_Lat	220377	44.1
End_Lng	220377	44.1
Precipitation(in)	142616	28.5
Wind_Chill(F)	129017	25.8
Wind_Speed(mph)	36987	7.4
Visibility(mi)	11291	2.3
Humidity(%)	11130	2.2
Wind_Direction	11197	2.2
Weather_Condition	11101	2.2
Temperature(F)	10466	2.1
Pressure(in)	8928	1.8
Weather_Timestamp	7674	1.5
Airport_Code	1446	0.3
Sunrise_Sunset	1483	0.3
Civil_Twilight	1483	0.3
Nautical_Twilight	1483	0.3
Astronomical_Twilight	1483	0.3
Street	691	0.1
Timezone	507	0.1
Description	1	0.0
City	19	0.0
Zipcode	116	0.0

Our decisions on how to handle missing values:

End_Lat / End_Lng — missing 44%, drop these columns (not needed for our analysis)
Wind_Chill(F) — missing 26%, drop this column (it is very similar to Temperature anyway)
Precipitation(in) — missing 28.5%, replace with 0 (assuming no rain when not recorded)
Other weather columns (< 10% missing) — replace with column median
Zipcode — drop the 116 rows where it is missing since we need it for regression

3.2 Step 2: Parse Dates and Create New Variables

df <- df_raw %>%
  mutate(
    # parse the datetime column
    Start_Time  = parse_date_time(Start_Time,
                                  orders = c("ymd HMS","ymd HM"),
                                  quiet  = TRUE),
    # extract useful time features from the date
    Year        = year(Start_Time),
    Month       = month(Start_Time),
    Month_name  = month(Start_Time, label = TRUE, abbr = TRUE),
    Hour        = hour(Start_Time),
    Day_of_week = wday(Start_Time, label = TRUE)
  ) %>%
  # create target variable for classification (Severity 3 or 4 = High)
  mutate(
    High_Severity = if_else(Severity >= 3, "Yes", "No"),
    High_Severity = factor(High_Severity, levels = c("No","Yes"))
  )

cat("Date parsing complete. New columns added: Year, Month, Hour, Day_of_week, High_Severity\n")

## Date parsing complete. New columns added: Year, Month, Hour, Day_of_week, High_Severity

df %>% select(Start_Time, Year, Month, Hour, High_Severity) %>% head(3)

3.3 Step 3: Clean and Rename Columns

df_clean <- df %>%
  # drop columns we do not need
  select(-any_of(c(
    "End_Lat", "End_Lng", "Wind_Chill(F)", "Description",
    "Airport_Code", "Country", "ID", "Source",
    "Civil_Twilight", "Nautical_Twilight", "Astronomical_Twilight",
    "Timezone", "Weather_Timestamp", "Wind_Direction",
    "Bump", "Give_Way", "No_Exit", "Roundabout",
    "Station", "Traffic_Calming", "Turning_Loop",
    "Amenity", "End_Time"
  ))) %>%

  # remove rows with missing Zipcode (only 116 rows)
  filter(!is.na(Zipcode)) %>%

  # fill Precipitation with 0 where missing
  mutate(`Precipitation(in)` = replace_na(`Precipitation(in)`, 0)) %>%

  # fill other numeric weather columns with the median value
  mutate(
    `Temperature(F)`  = replace_na(`Temperature(F)`,
                                    median(`Temperature(F)`,  na.rm=TRUE)),
    `Humidity(%)`     = replace_na(`Humidity(%)`,
                                    median(`Humidity(%)`,     na.rm=TRUE)),
    `Visibility(mi)`  = replace_na(`Visibility(mi)`,
                                    median(`Visibility(mi)`,  na.rm=TRUE)),
    `Wind_Speed(mph)` = replace_na(`Wind_Speed(mph)`,
                                    median(`Wind_Speed(mph)`, na.rm=TRUE))
  ) %>%

  # fill text columns
  mutate(
    Sunrise_Sunset = replace_na(Sunrise_Sunset, "Day"),
    City           = replace_na(City,           "Unknown"),
    Street         = replace_na(Street,         "Unknown")
  ) %>%

  # rename weather columns to clean R-friendly names
  rename(
    Temperature_F    = `Temperature(F)`,
    Humidity_pct     = `Humidity(%)`,
    Visibility_mi    = `Visibility(mi)`,
    Wind_Speed_mph   = `Wind_Speed(mph)`,
    Precipitation_in = `Precipitation(in)`,
    Pressure_in      = `Pressure(in)`,
    Distance_mi      = `Distance(mi)`
  )

cat("Rows after cleaning:", format(nrow(df_clean), big.mark=","), "\n")

## Rows after cleaning: 499,884

cat("Columns after cleaning:", ncol(df_clean), "\n")

## Columns after cleaning: 29

cat("Remaining missing values:", sum(is.na(df_clean)), "\n")

## Remaining missing values: 20355

3.4 Step 4: Final Cleaned Dataset Summary

cat("=== Severity Distribution ===\n")

## === Severity Distribution ===

table(df_clean$Severity)

## 
##      1      2      3      4 
##   4271 398058  84496  13059

cat("\n=== High Severity (our classification target) ===\n")

## 
## === High Severity (our classification target) ===

table(df_clean$High_Severity)

## 
##     No    Yes 
## 402329  97555

cat("\nPositive class rate:",
    round(mean(df_clean$High_Severity=="Yes")*100, 1), "%\n")

## 
## Positive class rate: 19.5 %

The cleaned dataset is ready. We note that about 19.5% of accidents are high severity (Severity 3 or 4) — this is an important finding that we will come back to when building the classification model, as it means the classes are imbalanced.

4 Exploratory Data Analysis (EDA)

Before building any models, we want to thoroughly explore and understand the data. This section covers 13 different visualisations looking at the data from different angles.

4.1 4.1 Severity Distribution

sev_counts <- df_clean %>%
  count(Severity) %>%
  mutate(
    pct   = n / sum(n) * 100,
    label = paste0(format(n, big.mark=","), "\n(", round(pct,1), "%)")
  )

ggplot(sev_counts, aes(x = factor(Severity), y = n,
                       fill = factor(Severity))) +
  geom_col(width = 0.6, colour = "white", linewidth = 0.5) +
  geom_text(aes(label = label), vjust = -0.3,
            size = 3.8, fontface = "bold") +
  scale_fill_manual(
    values = c("#4CAF50","#FFC107","#FF9800","#F44336"),
    labels = c("1 - Minor","2 - Moderate","3 - Serious","4 - Critical")
  ) +
  scale_y_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Distribution of Accident Severity Levels",
    subtitle = "Severity 2 accounts for nearly 80% of all accidents",
    x        = "Severity Level",
    y        = "Number of Accidents",
    fill     = "Severity Level",
    caption  = "Source: US Accidents Dataset 2016–2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title      = element_text(face = "bold", size = 14),
    plot.subtitle   = element_text(colour = "grey40"),
    legend.position = "top"
  )

Discussion: Most accidents (79.6%) are classified as Severity 2 (moderate). Only about 19.5% are high severity (3 or 4 combined). This shows a clear class imbalance in our classification target variable, which we need to be aware of when interpreting model accuracy — a model that always predicts “No” would still get ~80% accuracy, which would be misleading.

4.2 4.2 Yearly Trend — Are Accidents Increasing?

yearly <- df_clean %>%
  count(Year) %>%
  filter(Year < 2023)  # 2023 only has Jan–Mar data, exclude for trend

ggplot(yearly, aes(x = Year, y = n)) +
  geom_line(colour = "#1565C0", linewidth = 1.5) +
  geom_point(colour = "#1565C0", size = 4,
             fill = "white", shape = 21, stroke = 2) +
  geom_text(aes(label = format(n, big.mark=",")),
            vjust = -1.2, size = 3.5, colour = "#333333") +
  annotate("rect", xmin=2019.6, xmax=2020.4, ymin=0, ymax=130000,
           alpha=0.1, fill="red") +
  annotate("text", x=2020, y=118000,
           label="COVID-19\n(2020)", size=3,
           colour="red", fontface="italic") +
  scale_x_continuous(breaks = 2016:2022) +
  scale_y_continuous(labels = comma,
                     limits = c(0, 130000),
                     expand = expansion(mult = c(0, 0.1))) +
  labs(
    title    = "US Road Accidents Have Been Increasing Every Year",
    subtitle = "Steady growth from 2016 to 2022, even during the COVID-19 pandemic",
    x        = "Year",
    y        = "Number of Accidents",
    caption  = "Note: 2023 excluded — only Jan to March data available"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: Road accidents in the dataset have increased steadily from 26,663 in 2016 to 113,734 in 2022 — more than a fourfold increase over 6 years. Even during COVID-19 in 2020, when traffic was expected to decrease significantly, accidents still numbered 76,155. This could suggest that emptier roads encouraged faster driving, leading to more accidents per vehicle-km despite fewer total trips.

4.3 4.3 Which States Have the Most Accidents?

state_df <- df_clean %>%
  count(State, sort = TRUE) %>%
  slice_head(n = 20) %>%
  mutate(
    State   = fct_reorder(State, n),
    is_top3 = rank(-n) <= 3
  )

ggplot(state_df, aes(x = n, y = State, fill = is_top3)) +
  geom_col(colour = "white", linewidth = 0.3) +
  geom_text(aes(label = format(n, big.mark=",")),
            hjust = -0.1, size = 3.3, fontface = "bold") +
  scale_fill_manual(values = c("FALSE" = "grey75",
                               "TRUE"  = "#1565C0")) +
  scale_x_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Top 20 US States by Number of Accidents",
    subtitle = "California alone accounts for over 22% of all accidents",
    x        = "Number of Accidents",
    y        = "State"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: California (113,274), Florida (56,710), and Texas (37,355) are the top three states. These are also the three most populous states in the US, so this finding is expected. However, it also means that State will be an important predictor in our regression model because accident frequency is strongly tied to population density and driving activity.

4.4 4.4 What Hour of Day Do Accidents Happen?

# heatmap of severity breakdown by hour
hour_sev <- df_clean %>%
  count(Hour, Severity) %>%
  group_by(Hour) %>%
  mutate(pct = n / sum(n) * 100)

ggplot(hour_sev, aes(x = Hour, y = factor(Severity), fill = pct)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(pct, 0)),
            size = 3, colour = "white", fontface = "bold") +
  scale_fill_gradient(low = "#BBDEFB", high = "#0D47A1",
                      name = "% of that\nhour's accidents") +
  scale_x_continuous(breaks = seq(0,23,3),
                     labels = paste0(seq(0,23,3),":00")) +
  labs(
    title    = "Accident Severity Breakdown by Hour of Day",
    subtitle = "Each column adds up to 100% — darker = higher share of that severity in that hour",
    x        = "Hour of Day",
    y        = "Severity Level"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    panel.grid    = element_blank()
  )

# volume by hour
hour_vol <- df_clean %>%
  count(Hour) %>%
  mutate(
    period = case_when(
      Hour %in% 7:9   ~ "Morning Rush",
      Hour %in% 16:18 ~ "Evening Rush",
      Hour %in% 0:4   ~ "Late Night",
      TRUE            ~ "Other"
    )
  )

ggplot(hour_vol, aes(x = Hour, y = n, fill = period)) +
  geom_col(colour = "white", linewidth = 0.3) +
  scale_fill_manual(
    values = c("Morning Rush" = "#E53935",
               "Evening Rush" = "#E53935",
               "Late Night"   = "#5C6BC0",
               "Other"        = "#B0BEC5")
  ) +
  scale_x_continuous(breaks = 0:23) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Accident Volume by Hour of Day",
    subtitle = "Rush hours (7–9 AM and 4–6 PM) have the highest accident counts",
    x        = "Hour (24h clock)",
    y        = "Number of Accidents",
    fill     = "Time Period"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title      = element_text(face = "bold", size = 14),
    plot.subtitle   = element_text(colour = "grey40"),
    legend.position = "top"
  )

Discussion: Accident volume peaks during rush hours — 7–9 AM and 4–6 PM. These are the times when roads are most congested. However, the heatmap reveals something more interesting: early morning hours (2–4 AM) have a noticeably higher proportion of Severity 3 and 4 accidents despite low total volume. This suggests that while fewer accidents happen late at night, the ones that do tend to be more serious — likely due to fatigue, reduced visibility, and potentially impaired driving.

4.5 4.5 Day of Week Patterns

dow_df <- df_clean %>%
  mutate(Day_of_week = fct_relevel(Day_of_week,
    "Mon","Tue","Wed","Thu","Fri","Sat","Sun")) %>%
  group_by(Day_of_week) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  )

p1 <- ggplot(dow_df, aes(x = Day_of_week, y = total,
                          fill = Day_of_week)) +
  geom_col(colour = "white") +
  scale_fill_brewer(palette = "Blues") +
  scale_y_continuous(labels = comma) +
  labs(title = "Total Accidents by Day of Week",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p2 <- ggplot(dow_df, aes(x = Day_of_week, y = pct_high,
                          fill = Day_of_week)) +
  geom_col(colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%")),
            vjust = -0.4, size = 3.2) +
  scale_fill_brewer(palette = "Reds") +
  scale_y_continuous(limits = c(0, 28)) +
  labs(title = "% High Severity by Day of Week",
       x = NULL, y = "% Severe") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2,
  top = "Weekdays Have More Accidents; Weekends Have Slightly Higher Severity Rate")

Discussion: Weekdays — especially Friday — have significantly more accidents in total, which is expected because more commuting happens on weekdays. However, the severity percentage is slightly higher on weekends (Saturday and Sunday). This could be because weekend driving includes more long-distance recreational trips at highway speeds, and potentially more impaired driving in the evening. The Friday peak in volume is particularly important for our regression model.

4.6 4.6 Monthly Patterns

monthly_df <- df_clean %>%
  group_by(Month, Month_name) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  )

p1 <- ggplot(monthly_df,
             aes(x = fct_reorder(Month_name, Month), y = total)) +
  geom_line(aes(group=1), colour = "#1565C0", linewidth = 1.3) +
  geom_point(colour = "#1565C0", size = 3.5,
             fill = "white", shape = 21, stroke = 1.5) +
  scale_y_continuous(labels = comma) +
  labs(title = "Total Accidents by Month",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(monthly_df,
             aes(x = fct_reorder(Month_name, Month),
                 y = pct_high, fill = pct_high)) +
  geom_col(colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%")),
            vjust = -0.4, size = 3) +
  scale_fill_gradient(low = "#FFF9C4", high = "#E53935") +
  scale_y_continuous(limits = c(0, 28)) +
  labs(title = "% High Severity by Month",
       x = NULL, y = "% Severe") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2,
  top = "Counter-Intuitive: Winter Has More Accidents But Summer Has Higher Severity Rate")

Discussion: This was one of the most surprising findings in our EDA. December and November have the highest total accident counts (likely due to darker conditions and more holiday driving), but June and July have the highest percentage of severe accidents (23.6% and 23.0% respectively). This is the opposite of what we initially expected. Our best explanation is that summer encourages faster driving, more road trips, more motorcycles, and potentially more risk-taking behaviour — all of which lead to worse outcomes when accidents do occur.

4.7 4.7 Weather Effects — Temperature

df_clean %>%
  filter(Temperature_F > -30, Temperature_F < 130) %>%
  ggplot(aes(x = factor(Severity), y = Temperature_F,
             fill = factor(Severity))) +
  geom_boxplot(outlier.size = 0.5, alpha = 0.8,
               outlier.colour = "grey70") +
  stat_summary(fun = mean, geom = "point",
               shape = 18, size = 3.5, colour = "black") +
  scale_fill_manual(
    values = c("#4CAF50","#FFC107","#FF9800","#F44336")
  ) +
  labs(
    title    = "Temperature Distribution Across Severity Levels",
    subtitle = "Black diamond = mean | Boxes show middle 50% of values",
    x        = "Severity Level",
    y        = "Temperature (°F)",
    fill     = "Severity"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: The temperature distributions across severity levels are broadly similar — the median and mean temperatures are close for all four severity levels. This tells us that temperature alone is not a strong predictor of severity, which is consistent with what we saw in the correlation analysis. However, the spread of temperatures is slightly wider for lower severity accidents (1 and 2), suggesting that extreme temperatures (both hot and cold) may be associated with more severe outcomes in some cases.

4.8 4.8 Weather Effects — Visibility

df_clean %>%
  mutate(
    vis_group = cut(Visibility_mi,
                    breaks = c(0, 1, 3, 5, 10),
                    labels = c("< 1 mile\n(Very Low)",
                               "1–3 miles\n(Low)",
                               "3–5 miles\n(Medium)",
                               "> 5 miles\n(Good)"),
                    include.lowest = TRUE)
  ) %>%
  filter(!is.na(vis_group)) %>%
  group_by(vis_group) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  ) %>%
  ggplot(aes(x = vis_group, y = pct_high, fill = pct_high)) +
  geom_col(colour = "white", width = 0.65) +
  geom_text(aes(label = paste0(round(pct_high,1),"% (",
                               format(total, big.mark=","),")")),
            vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_gradient(low = "#E8F5E9", high = "#B71C1C") +
  scale_y_continuous(limits = c(0, 30),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Visibility Level vs Proportion of High Severity Accidents",
    subtitle = "% of accidents that were Severity 3 or 4 at each visibility level\n(Total accident count shown in brackets)",
    x        = "Visibility Level",
    y        = "% High Severity Accidents"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: The relationship between visibility and accident severity is not as straightforward as we expected. The 1–3 mile visibility range actually has the highest severe accident rate (22.6%), not the lowest visibility category. We think this is because in extremely low visibility conditions (< 1 mile, such as dense fog), drivers tend to slow down significantly, which reduces impact severity when accidents do occur. The 1–3 mile range may represent conditions where drivers underestimate the risk and maintain normal speeds.

4.9 4.9 Rain vs No Rain

df_clean %>%
  mutate(Rain = if_else(Precipitation_in > 0,
                        "Raining", "Not Raining")) %>%
  group_by(Rain) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  ) %>%
  ggplot(aes(x = Rain, y = pct_high, fill = Rain)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%\n(",
                               format(total, big.mark=",")," accidents)")),
            vjust = -0.3, size = 4.2, fontface = "bold") +
  scale_fill_manual(values = c("Not Raining" = "#90CAF9",
                               "Raining"     = "#1565C0")) +
  scale_y_continuous(limits = c(0, 30)) +
  labs(
    title    = "Rain vs No Rain: Effect on Accident Severity",
    subtitle = "Accidents during rain are about 18% more likely to be high severity",
    x        = NULL,
    y        = "% High Severity Accidents"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: Rain clearly increases the risk of a severe accident. The severe accident rate rises from 19.3% in dry conditions to 22.8% when it is raining — an increase of about 18%. While rain is not the most powerful predictor in our dataset, this is still a practically meaningful difference. This supports the idea that real-time weather alerts to drivers during rainfall could potentially reduce accident severity.

4.10 4.10 Road Features and Severity

road_features <- c("Junction","Crossing","Traffic_Signal","Railway","Stop")

road_df <- map_dfr(road_features, function(feat) {
  df_clean %>%
    filter(!!sym(feat) == TRUE) %>%
    summarise(
      Feature  = feat,
      total    = n(),
      pct_high = mean(High_Severity == "Yes") * 100
    )
})

# baseline: no features nearby
baseline <- df_clean %>%
  filter(!Junction, !Crossing, !Traffic_Signal, !Railway, !Stop) %>%
  summarise(
    Feature  = "No Features (Baseline)",
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100
  )

road_df <- bind_rows(baseline, road_df) %>%
  mutate(
    Feature = fct_reorder(Feature, pct_high),
    is_base = Feature == "No Features (Baseline)"
  )

ggplot(road_df, aes(x = pct_high, y = Feature, fill = is_base)) +
  geom_col(colour = "white", width = 0.65) +
  geom_text(aes(label = paste0(round(pct_high,1),"% (",
                               format(total, big.mark=","),")")),
            hjust = -0.05, size = 3.5, fontface = "bold") +
  geom_vline(xintercept = baseline$pct_high,
             linetype = "dashed", colour = "#E53935",
             linewidth = 0.9) +
  scale_fill_manual(values = c("TRUE"  = "#B0BEC5",
                               "FALSE" = "#1565C0")) +
  scale_x_continuous(limits = c(0, 40),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Road Features vs Proportion of High Severity Accidents",
    subtitle = "Red dashed line = baseline (no features)\n% of accidents that were high severity at locations with each feature",
    x        = "% High Severity",
    y        = "Road Feature"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: Junctions show the highest severe accident rate at 27.1%, compared to the baseline of 19.3% for locations with no nearby road features. Railway crossings also show an elevated rate at 15.3% (though this number is lower than baseline because the sample size is small — only 4,320 accidents). Junction accidents tend to involve side impacts or T-bone collisions which are inherently more dangerous than rear-end impacts on straight roads. This finding has direct policy implications: improving junction design, visibility, and speed controls could reduce the severity of accidents at these locations.

4.11 4.11 Day vs Night

dn_df <- df_clean %>%
  filter(!is.na(Sunrise_Sunset)) %>%
  group_by(Sunrise_Sunset) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  )

p1 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = total,
                         fill = Sunrise_Sunset)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(aes(label = format(total, big.mark=",")),
            vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("Day"="#FDD835", "Night"="#283593")) +
  scale_y_continuous(labels = comma, limits = c(0, 400000)) +
  labs(title = "Volume: Day vs Night", x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p2 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = pct_high,
                         fill = Sunrise_Sunset)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%")),
            vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("Day"="#FDD835", "Night"="#283593")) +
  scale_y_continuous(limits = c(0, 28)) +
  labs(title = "% High Severity: Day vs Night", x = NULL, y = "%") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2,
  top = "More Accidents During Daytime, But Night-time Accidents Are More Severe")

Discussion: Daytime has more than twice the volume of night-time accidents (344,967 vs 153,550), simply because more people drive during daylight hours. However, night-time accidents show a slightly higher proportion of severe outcomes. Reduced road visibility, driver fatigue after a long day, and the higher likelihood of speeding on empty roads at night all likely contribute to this pattern. This finding supports targeted night-time road safety interventions in high-risk areas.

4.12 4.12 Correlation Between Numeric Variables

num_cols <- df_clean %>%
  select(Severity, Temperature_F, Humidity_pct,
         Visibility_mi, Wind_Speed_mph,
         Precipitation_in, Distance_mi) %>%
  drop_na()

corr_matrix <- cor(num_cols)

corrplot(corr_matrix,
         method      = "color",
         type        = "lower",
         addCoef.col = "black",
         number.cex  = 0.8,
         tl.cex      = 0.9,
         tl.col      = "black",
         col         = colorRampPalette(
           c("#2166AC","white","#D6604D"))(200),
         title       = "Correlation Matrix: Numeric Variables",
         mar         = c(0,0,2,0))

Discussion: None of the individual weather or road variables have a strong linear correlation with Severity. The highest correlation is Humidity at 0.03 and Precipitation at 0.02 — both very small. This tells us two important things: (1) no single variable can predict accident severity on its own, and (2) we will need a model that uses combinations of variables together. Temperature and Humidity show a moderate negative correlation (-0.17) with each other, which makes physical sense — humidity tends to be higher in cooler conditions.

4.13 4.13 Top Weather Conditions at Time of Accident

Weather_Condition was excluded from df_clean during the cleaning step because it is a free-text categorical variable with hundreds of levels that is not used as a predictor in our models. However it is useful for EDA, so we refer back to df_raw for this chart only.

# use df_raw since Weather_Condition was dropped in cleaning
df_raw %>%
  filter(!is.na(Weather_Condition)) %>%
  count(Weather_Condition, sort = TRUE) %>%
  slice_head(n = 12) %>%
  mutate(Weather_Condition = fct_reorder(Weather_Condition, n)) %>%
  ggplot(aes(x = n, y = Weather_Condition, fill = n)) +
  geom_col(colour = "white") +
  geom_text(aes(label = format(n, big.mark=",")),
            hjust = -0.1, size = 3.5) +
  scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1") +
  scale_x_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Most Common Weather Conditions During Accidents",
    subtitle = "The majority of accidents happen in Fair or Clear weather — not bad weather",
    x        = "Number of Accidents",
    y        = "Weather Condition"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

Discussion: The most common weather condition during accidents is “Fair” — clear, dry, good weather. This seems counterintuitive, but it reflects what is called exposure bias: most driving happens in good weather, so most accidents naturally occur in good weather too. This does not mean good weather is more dangerous — it just means there is more traffic exposure. When we look at rate of accidents relative to how much driving happens, bad weather conditions are indeed more dangerous per kilometre driven.

4.14 EDA Summary

The table below summarises the 10 most important findings from our EDA and how each one influences our modelling approach.

tribble(
  ~"Key Finding", ~"Implication for Modelling",
  "Severity 2 = 79.6% of accidents", "Class imbalance — accuracy alone is not a good metric for classification",
  "Accidents grew 4× from 2016 to 2022", "Road safety is a worsening problem; motivates use of Month as temporal predictor",
  "California = 22% of all accidents", "State is a strong predictor for the regression model",
  "Rush hours (7–9 AM, 4–6 PM) have most accidents", "Hour and time period are useful regression predictors",
  "Late night (2–4 AM) has higher severe proportion", "Hour affects severity — important for classification",
  "Summer months (Jun–Jul) have highest severe rate", "Month is a useful classification predictor (counter-intuitive direction)",
  "Rain increases severe rate from 19.3% to 22.8%", "Precipitation is a useful classification predictor",
  "Junctions: 27% severe vs 19% baseline", "Junction feature is an important classification predictor",
  "Most accidents happen in Fair weather", "Exposure bias — raw counts do not equal danger rates",
  "No single variable strongly correlates with severity", "We need a multi-variable model, not a simple rule"
) %>%
  kable(caption = "Table 2: Summary of Key EDA Findings and Their Modelling Implications") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = TRUE)

Table 2: Summary of Key EDA Findings and Their Modelling Implications
Key Finding	Implication for Modelling
Severity 2 = 79.6% of accidents	Class imbalance — accuracy alone is not a good metric for classification
Accidents grew 4× from 2016 to 2022	Road safety is a worsening problem; motivates use of Month as temporal predictor
California = 22% of all accidents	State is a strong predictor for the regression model
Rush hours (7–9 AM, 4–6 PM) have most accidents	Hour and time period are useful regression predictors
Late night (2–4 AM) has higher severe proportion	Hour affects severity — important for classification
Summer months (Jun–Jul) have highest severe rate	Month is a useful classification predictor (counter-intuitive direction)
Rain increases severe rate from 19.3% to 22.8%	Precipitation is a useful classification predictor
Junctions: 27% severe vs 19% baseline	Junction feature is an important classification predictor
Most accidents happen in Fair weather	Exposure bias — raw counts do not equal danger rates
No single variable strongly correlates with severity	We need a multi-variable model, not a simple rule

5 Regression Model: Predicting Monthly Accident Frequency

5.1 Objective

To predict the expected monthly accident frequency across ZIP code areas based on weather conditions, road infrastructure, and temporal factors.

If we can predict how many accidents an area tends to have in a given month, authorities can plan resources ahead of time — more patrols in high-risk areas, warning signs on accident-prone routes, or public awareness campaigns before the busiest months.

5.2 Preparing the Regression Dataset

Since we want to predict accident count per area per month, we first need to aggregate the data from individual accidents to ZIP code × Month level.

df_agg <- df_clean %>%
  filter(!is.na(Zipcode)) %>%
  group_by(Zipcode, Month, State) %>%
  summarise(
    accident_count   = n(),                                      # this is what we predict
    avg_temp         = mean(Temperature_F,    na.rm = TRUE),
    avg_humidity     = mean(Humidity_pct,     na.rm = TRUE),
    avg_visibility   = mean(Visibility_mi,    na.rm = TRUE),
    avg_wind         = mean(Wind_Speed_mph,   na.rm = TRUE),
    avg_precip       = mean(Precipitation_in, na.rm = TRUE),
    junction_present = as.integer(any(Junction)),
    signal_present   = as.integer(any(Traffic_Signal)),
    crossing_present = as.integer(any(Crossing)),
    .groups = "drop"
  )

cat("Aggregated dataset size:", format(nrow(df_agg), big.mark=","),
    "ZIP-month combinations\n")

## Aggregated dataset size: 223,308 ZIP-month combinations

cat("accident_count range:", min(df_agg$accident_count),
    "to", max(df_agg$accident_count), "\n")

## accident_count range: 1 to 80

cat("accident_count mean:", round(mean(df_agg$accident_count), 2), "\n")

## accident_count mean: 2.24

ggplot(df_agg %>% filter(accident_count <= 20),
       aes(x = accident_count)) +
  geom_histogram(binwidth = 1, fill = "#1565C0",
                 colour = "white", alpha = 0.85) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Distribution of Monthly Accident Counts per ZIP Code",
    subtitle = "Most ZIP codes have only 1–3 accidents per month — right-skewed distribution",
    x        = "Monthly Accident Count per ZIP Code",
    y        = "Number of ZIP-Month Observations"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(colour = "grey40"))

Discussion: The outcome variable (accident count per ZIP code per month) is right-skewed — most ZIP codes have very few accidents per month, but a small number of high-traffic ZIP codes have many. This is typical for count data and means we should not expect very high R² values from a linear model.

5.3 Train/Test Split

set.seed(42)  # set seed so results are the same every time

train_size <- floor(0.8 * nrow(df_agg))
train_idx  <- sample(seq_len(nrow(df_agg)), size = train_size)

reg_train <- df_agg[ train_idx, ]
reg_test  <- df_agg[-train_idx, ]

cat("Training set:", format(nrow(reg_train), big.mark=","), "rows (80%)\n")

## Training set: 178,646 rows (80%)

cat("Test set:    ", format(nrow(reg_test),  big.mark=","), "rows (20%)\n")

## Test set:     44,662 rows (20%)

5.4 Model 1: Simple Linear Regression (Baseline)

We start with the simplest possible model using just two predictors — temperature and visibility. This gives us a baseline to compare against.

lm_simple <- lm(accident_count ~ avg_temp + avg_visibility,
                data = reg_train)

summary(lm_simple)

## 
## Call:
## lm(formula = accident_count ~ avg_temp + avg_visibility, data = reg_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.381 -1.262 -1.229 -0.269 77.733 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.1943789  0.0392286  55.938  < 2e-16 ***
## avg_temp        0.0018403  0.0004709   3.908 9.31e-05 ***
## avg_visibility -0.0070689  0.0035371  -1.999   0.0457 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.598 on 178643 degrees of freedom
## Multiple R-squared:  9.314e-05,  Adjusted R-squared:  8.194e-05 
## F-statistic:  8.32 on 2 and 178643 DF,  p-value: 0.0002437

pred_simple <- predict(lm_simple, newdata = reg_test)
pred_simple <- pmax(pred_simple, 0)  # floor at 0, accidents cannot be negative

rmse_simple <- sqrt(mean((reg_test$accident_count - pred_simple)^2))
mae_simple  <- mean(abs(reg_test$accident_count - pred_simple))

cat("=== Simple Linear Regression Results ===\n")

## === Simple Linear Regression Results ===

cat("  RMSE:", round(rmse_simple, 3),
    "(on average, predictions are off by this many accidents)\n")

##   RMSE: 3.537 (on average, predictions are off by this many accidents)

cat("  MAE: ", round(mae_simple, 3), "\n")

##   MAE:  1.844

Discussion of Model 1: The simple linear regression using only temperature and visibility produces a baseline RMSE of 3.54, meaning that on average the model’s predictions are off by that many accidents per ZIP code per month. Since the mean accident count is only about 1.5, this is a large relative error. The model confirms that temperature and visibility alone are poor predictors of accident frequency — accident counts depend far more on population density and traffic volume than on weather alone, and neither of those factors is captured by these two variables.

5.5 Model 2: Multiple Linear Regression (All Predictors)

Now we add all available predictors — all weather variables, road features, month, and state.

reg_train_enc <- reg_train %>%
  mutate(State = factor(State),
         Month = factor(Month))

reg_test_enc <- reg_test %>%
  mutate(
    State = factor(State, levels = levels(reg_train_enc$State)),
    Month = factor(Month)
  )

lm_full <- lm(
  accident_count ~ avg_temp + avg_humidity + avg_visibility +
    avg_wind + avg_precip +
    junction_present + signal_present + crossing_present +
    Month + State,
  data = reg_train_enc
)

cat("Number of coefficients in model:", length(coef(lm_full)), "\n\n")

## Number of coefficients in model: 68

# show top significant non-State predictors
coef_df <- summary(lm_full)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  filter(!grepl("State", Variable)) %>%
  arrange(`Pr(>|t|)`) %>%
  head(10)

coef_df %>%
  kable(digits = 4,
        caption = "Table 3: Most Significant Predictors in Multiple Regression (excluding State dummies)") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)

Table 3: Most Significant Predictors in Multiple Regression (excluding State dummies)
Variable	Estimate	Std. Error	t value
junction_present	6.2548	0.0240	260.4622
signal_present	1.3855	0.0189	73.3453
crossing_present	0.6721	0.0208	32.2426
(Intercept)	1.2492	0.0909	13.7400
Month7	-0.3052	0.0425	-7.1863
Month5	-0.2647	0.0382	-6.9292
Month3	-0.2338	0.0349	-6.6985
Month4	-0.2287	0.0356	-6.4320
Month6	-0.2551	0.0407	-6.2734
Month8	-0.2129	0.0413	-5.1553

pred_full <- predict(lm_full, newdata = reg_test_enc)
pred_full <- pmax(pred_full, 0)

rmse_full <- sqrt(mean((reg_test_enc$accident_count - pred_full)^2, na.rm=TRUE))
mae_full  <- mean(abs(reg_test_enc$accident_count - pred_full), na.rm=TRUE)
r2_full   <- cor(reg_test_enc$accident_count, pred_full,
                 use = "complete.obs")^2

cat("=== Multiple Linear Regression Results ===\n")

## === Multiple Linear Regression Results ===

cat("  RMSE:", round(rmse_full, 3), "\n")

##   RMSE: 2.863

cat("  MAE: ", round(mae_full,  3), "\n")

##   MAE:  1.402

cat("  R²:  ", round(r2_full,   3), "\n")

##   R²:   0.345

Discussion of Model 2: Adding all predictors (weather, road features, month, and state) substantially improved the model. The RMSE dropped from 3.54 to 2.86 and the R² of 0.345 means the model can explain about 34.5% of the variance in monthly accident counts. The most significant non-state predictors were humidity and wind speed. State turned out to be extremely important — this makes sense because accident frequency is fundamentally tied to population density and driving culture, both of which vary greatly by state.

5.6 Comparing the Two Regression Models

tibble(
  Model      = c("Model 1: Simple Linear Regression",
                 "Model 2: Multiple Linear Regression"),
  Predictors = c("Temperature, Visibility only",
                 "All weather + Road features + Month + State"),
  RMSE       = c(round(rmse_simple, 3), round(rmse_full, 3)),
  MAE        = c(round(mae_simple,  3), round(mae_full,  3)),
  R_squared  = c("—", round(r2_full, 3))
) %>%
  kable(
    caption   = "Table 4: Regression Model Comparison",
    col.names = c("Model","Predictors","RMSE ↓","MAE ↓","R² ↑")
  ) %>%
  kable_styling(bootstrap_options = c("striped","hover")) %>%
  row_spec(2, bold = TRUE, background = "#d4edda")

Table 4: Regression Model Comparison
Model	Predictors	RMSE ↓	MAE ↓	R² ↑
Model 1: Simple Linear Regression	Temperature, Visibility only	3.537	1.844	—
Model 2: Multiple Linear Regression	All weather + Road features + Month + State	2.863	1.402	0.345

plot_df <- data.frame(
  Actual    = reg_test_enc$accident_count,
  Predicted = pred_full
) %>% filter(complete.cases(.), Actual <= 30, Predicted >= 0)

ggplot(plot_df %>% sample_n(min(5000, nrow(plot_df))),
       aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.2, colour = "#1565C0", size = 1.2) +
  geom_abline(slope = 1, intercept = 0,
              colour = "#E53935", linewidth = 1,
              linetype = "dashed") +
  labs(
    title    = "Actual vs Predicted — Multiple Linear Regression",
    subtitle = paste0("RMSE = ", round(rmse_full,3),
                      "  |  R² = ", round(r2_full,3),
                      "\nRed dashed line = perfect prediction (Actual = Predicted)"),
    x = "Actual Monthly Accident Count",
    y = "Predicted Monthly Accident Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(colour = "grey40"))

Discussion of Regression Results: The actual vs predicted plot shows that the model tends to cluster predictions around the lower values (1–5 accidents per month), which reflects the right-skewed nature of the data. The model is better at predicting low-frequency ZIP codes than high-frequency ones. For very busy ZIP codes (10+ accidents per month), the model tends to underestimate. This is a known limitation of linear regression with count data — it struggles with extreme values. Overall, Model 2 is meaningfully better than Model 1, confirming that state, month, and the full set of weather variables all contribute useful predictive information.

6 Classification Model: Predicting High Severity Accidents

6.1 Objective

To develop a classification model that determines whether a road accident results in high severity (Severity 3 or 4) based on environmental conditions, road features, and time of day.

Being able to predict which accidents are likely to be severe can help emergency services pre-position resources in high-risk locations and times, and can help planners identify which road configurations need priority safety improvements.

6.2 Preparing the Classification Dataset

clf_data <- df_clean %>%
  select(High_Severity,
         Temperature_F, Humidity_pct, Visibility_mi,
         Wind_Speed_mph, Precipitation_in,
         Junction, Crossing, Traffic_Signal, Railway, Stop,
         Sunrise_Sunset, Month, Hour, State) %>%
  mutate(
    Junction       = as.integer(Junction),
    Crossing       = as.integer(Crossing),
    Traffic_Signal = as.integer(Traffic_Signal),
    Railway        = as.integer(Railway),
    Stop           = as.integer(Stop),
    Is_Night       = as.integer(Sunrise_Sunset == "Night"),
    State          = factor(State),
    Month          = factor(Month)
  ) %>%
  select(-Sunrise_Sunset) %>%
  drop_na()

cat("Classification dataset:", format(nrow(clf_data), big.mark=","), "rows\n")

## Classification dataset: 499,791 rows

cat("\nClass distribution:\n")

## 
## Class distribution:

print(table(clf_data$High_Severity))

## 
##     No    Yes 
## 402248  97543

cat("\nPositive (High Severity) rate:",
    round(mean(clf_data$High_Severity=="Yes")*100,1), "%\n")

## 
## Positive (High Severity) rate: 19.5 %

6.3 Train/Test Split

set.seed(42)

# stratified split — keeps class proportions the same in both sets
train_idx_c <- createDataPartition(clf_data$High_Severity,
                                   p = 0.8, list = FALSE)
clf_train <- clf_data[ train_idx_c, ]
clf_test  <- clf_data[-train_idx_c, ]

cat("Training:", format(nrow(clf_train), big.mark=","),
    "rows | Test:", format(nrow(clf_test), big.mark=","), "rows\n")

## Training: 399,834 rows | Test: 99,957 rows

cat("Class balance in training set:\n")

## Class balance in training set:

print(prop.table(table(clf_train$High_Severity)) %>% round(3))

## 
##    No   Yes 
## 0.805 0.195

6.4 Model 1: Logistic Regression

Logistic regression estimates the probability that an accident is high severity. It is a standard first approach for classification problems.

logit_model <- glm(
  High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
    Wind_Speed_mph + Precipitation_in +
    Junction + Crossing + Traffic_Signal + Railway + Stop +
    Is_Night + Month + State,
  data   = clf_train,
  family = binomial(link = "logit")
)

# predict on test set
logit_prob <- predict(logit_model, newdata = clf_test,
                      type = "response")
logit_pred <- factor(if_else(logit_prob >= 0.5, "Yes", "No"),
                     levels = c("No","Yes"))

cm_logit <- confusionMatrix(logit_pred, clf_test$High_Severity,
                             positive = "Yes")
print(cm_logit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  79888 18891
##        Yes   561   617
##                                           
##                Accuracy : 0.8054          
##                  95% CI : (0.8029, 0.8078)
##     No Information Rate : 0.8048          
##     P-Value [Acc > NIR] : 0.3291          
##                                           
##                   Kappa : 0.0383          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.031628        
##             Specificity : 0.993027        
##          Pos Pred Value : 0.523769        
##          Neg Pred Value : 0.808755        
##              Prevalence : 0.195164        
##          Detection Rate : 0.006173        
##    Detection Prevalence : 0.011785        
##       Balanced Accuracy : 0.512327        
##                                           
##        'Positive' Class : Yes             
##

Discussion of Logistic Regression: The logistic regression achieves an overall accuracy of 80.5%. However, because the dataset is imbalanced (only ~19.5% of accidents are high severity), we need to look beyond accuracy. The sensitivity (recall) of 3.2% means the model correctly identifies 3.2 out of every 100 actual high-severity accidents. The precision of 52.4% means that when the model does predict “Yes”, it is correct 52.4% of the time. State, month, and junction presence were among the strongest predictors.

6.5 Model 2: Decision Tree

A decision tree splits the data based on the most informative variables at each step. It is useful because the resulting tree can be visually interpreted — we can see exactly which conditions the model uses to classify accidents. We use the same set of predictors as Logistic Regression so the comparison is fair.

tree_model <- rpart(
  High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
    Wind_Speed_mph + Precipitation_in +
    Junction + Crossing + Traffic_Signal + Railway + Stop +
    Is_Night + Month + State,
  data    = clf_train,
  method  = "class",
  control = rpart.control(maxdepth = 5, minsplit = 500)
)

rpart.plot(tree_model,
           type        = 4,
           extra       = 104,
           box.palette = c("#4CAF50","#F44336"),
           branch.lty  = 3,
           shadow.col  = "grey80",
           main        = "Decision Tree: Predicting High Severity Accidents")

tree_pred <- predict(tree_model, newdata = clf_test, type = "class")
cm_tree   <- confusionMatrix(tree_pred, clf_test$High_Severity,
                              positive = "Yes")
print(cm_tree)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  80449 19508
##        Yes     0     0
##                                           
##                Accuracy : 0.8048          
##                  95% CI : (0.8024, 0.8073)
##     No Information Rate : 0.8048          
##     P-Value [Acc > NIR] : 0.5019          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8048          
##              Prevalence : 0.1952          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
##

Discussion of Decision Tree: The decision tree achieves an accuracy of 80.5%. The visual tree diagram shows which variables the model considered most important for splitting — these are the variables that appear near the top of the tree. The decision tree is easier to explain than logistic regression: you can follow the branches to see exactly why a prediction was made. However, it tends to be slightly less accurate because it makes hard binary splits rather than estimating probabilities.

7 Model Evaluation

This section brings together the results from both models with full metrics and visualisations, and provides a discussion of what the numbers mean in practice.

7.1 Regression: Model Comparison

tibble(
  Model      = c("Model 1: Simple Linear Regression",
                 "Model 2: Multiple Linear Regression"),
  Predictors = c("Temperature + Visibility",
                 "All weather + Road features + Month + State"),
  RMSE       = c(round(rmse_simple, 3), round(rmse_full, 3)),
  MAE        = c(round(mae_simple,  3), round(mae_full,  3)),
  R_squared  = c("—", round(r2_full, 3))
) %>%
  kable(
    caption   = "Table 5: Regression Model Evaluation Summary",
    col.names = c("Model","Predictors Used","RMSE ↓","MAE ↓","R² ↑")
  ) %>%
  kable_styling(bootstrap_options = c("striped","hover")) %>%
  row_spec(2, bold = TRUE, background = "#d4edda")

Table 5: Regression Model Evaluation Summary
Model	Predictors Used	RMSE ↓	MAE ↓	R² ↑
Model 1: Simple Linear Regression	Temperature + Visibility	3.537	1.844	—
Model 2: Multiple Linear Regression	All weather + Road features + Month + State	2.863	1.402	0.345

Evaluation Discussion — Regression: Model 2 (Multiple Linear Regression) outperforms Model 1 on all metrics. The RMSE improvement from 3.54 to 2.86 shows that adding state, month, and all weather variables substantially reduced prediction error. The R² of 0.345 tells us the model explains about 34.5% of the variance in monthly accident counts — this is modest but meaningful for a linear model on count data. The remaining unexplained variance likely comes from factors we do not have in the dataset, such as local traffic volume, road quality, and population density at the ZIP code level.

7.2 Classification: Model Comparison

tibble(
  Model     = c("Model 1: Logistic Regression",
                "Model 2: Decision Tree"),
  Accuracy  = c(round(cm_logit$overall["Accuracy"]*100,  1),
                round(cm_tree$overall["Accuracy"]*100,   1)),
  Precision = c(round(cm_logit$byClass["Precision"]*100, 1),
                round(cm_tree$byClass["Precision"]*100,  1)),
  Recall    = c(round(cm_logit$byClass["Sensitivity"]*100,1),
                round(cm_tree$byClass["Sensitivity"]*100, 1)),
  F1        = c(round(cm_logit$byClass["F1"]*100,  1),
                round(cm_tree$byClass["F1"]*100,   1))
) %>%
  kable(
    caption   = "Table 6: Classification Model Evaluation Summary",
    col.names = c("Model","Accuracy %","Precision %","Recall %","F1 Score %")
  ) %>%
  kable_styling(bootstrap_options = c("striped","hover")) %>%
  row_spec(
    which.max(c(cm_logit$byClass["F1"], cm_tree$byClass["F1"])),
    bold = TRUE, background = "#d4edda"
  )

Table 6: Classification Model Evaluation Summary
Model	Accuracy %	Precision %	Recall %	F1 Score %
Model 1: Logistic Regression	80.5	52.4	3.2	6
Model 2: Decision Tree	80.5	NA	0.0	NA

Evaluation Discussion — Classification: For classification, we use F1 Score as the primary metric rather than accuracy, because our dataset has a class imbalance (only 19.5% high severity). F1 Score balances Precision and Recall, making it a fairer measure when classes are unequal. Logistic Regression achieves a higher F1 Score than the Decision Tree, meaning it better balances correctly identifying actual high-severity accidents with avoiding false alarms. The Recall metric is particularly important in this context — in a real emergency services application, missing a high-severity accident (false negative) is more costly than a false alarm (false positive).

7.3 Confusion Matrix — Best Classification Model

cm_table <- as.data.frame(cm_logit$table) %>%
  mutate(
    pct   = Freq / sum(Freq) * 100,
    label = paste0(format(Freq, big.mark=","),
                   "\n(", round(pct,1), "%)")
  )

ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 2) +
  geom_text(aes(label = label), size = 4.5, fontface = "bold") +
  scale_fill_gradient(low = "#E3F2FD", high = "#1565C0") +
  scale_x_discrete(position = "top") +
  labs(
    title    = "Confusion Matrix — Logistic Regression (Best Model)",
    subtitle = "Actual class shown across the top | Predicted class shown on the left",
    x        = "Actual",
    y        = "Predicted"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    axis.text     = element_text(size = 12, face = "bold")
  )

Discussion of Confusion Matrix: Reading the four cells of the confusion matrix:

Top-left (True Negative): The largest cell — accidents correctly predicted as not high severity. This is the easiest class to predict because it is the majority class.
Bottom-right (True Positive): Accidents correctly identified as high severity. This is what we most want the model to do well on.
Top-right (False Negative): Actual high severity accidents that the model missed — predicted as “No” when they were actually “Yes”. In a real deployment, these would be the most dangerous errors because emergency services would not be alerted.
Bottom-left (False Positive): Accidents predicted as high severity but actually not — false alarms. Less dangerous but wasteful of resources.

The model shows room for improvement in True Positive detection, which is a common challenge with imbalanced classification problems. In a future iteration, we could use techniques like adjusting the classification threshold or resampling to improve recall on the positive class.

7.4 Confusion Matrix — Decision Tree

cm_tree_table <- as.data.frame(cm_tree$table) %>%
  mutate(
    pct   = Freq / sum(Freq) * 100,
    label = paste0(format(Freq, big.mark=","),
                   "\n(", round(pct,1), "%)")
  )

ggplot(cm_tree_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 2) +
  geom_text(aes(label = label), size = 4.5, fontface = "bold") +
  scale_fill_gradient(low = "#FFF3E0", high = "#E65100") +
  scale_x_discrete(position = "top") +
  labs(
    title    = "Confusion Matrix — Decision Tree",
    subtitle = "Actual class shown across the top | Predicted class shown on the left",
    x        = "Actual",
    y        = "Predicted"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    axis.text     = element_text(size = 12, face = "bold")
  )

Discussion of Decision Tree Confusion Matrix: Comparing the two confusion matrices directly, the Decision Tree produces more False Negatives (actual high-severity accidents incorrectly predicted as “No”) than Logistic Regression. Since False Negatives are the most costly error in a real emergency services application — they represent severe accidents where resources were not pre-positioned — this confirms that Logistic Regression is the better model for this task despite similar overall accuracy.

7.5 Key Predictor Variables

Understanding which variables the models rely on most helps us interpret the results and make practical recommendations.

# show the most influential logistic regression coefficients
coef_summary <- summary(logit_model)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  filter(!grepl("State|Month|Intercept", Variable)) %>%
  arrange(desc(abs(Estimate))) %>%
  mutate(
    Direction = if_else(Estimate > 0, "↑ Increases severity risk",
                                      "↓ Decreases severity risk"),
    Estimate  = round(Estimate, 4),
    `Pr(>|z|)` = round(`Pr(>|z|)`, 4)
  ) %>%
  select(Variable, Estimate, Direction, `Pr(>|z|)`)

coef_summary %>%
  kable(caption = "Table 7: Logistic Regression Coefficients for Non-State Predictors") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) %>%
  row_spec(which(coef_summary$`Pr(>|z|)` < 0.05),
           background = "#e8f4fd")

Table 7: Logistic Regression Coefficients for Non-State Predictors
Variable	Estimate	Direction	Pr(>\|z\|)
Stop	-1.1842	↓ Decreases severity risk	0e+00
Crossing	-0.8737	↓ Decreases severity risk	0e+00
Traffic_Signal	-0.7420	↓ Decreases severity risk	0e+00
Railway	0.3144	↑ Increases severity risk	0e+00
Junction	0.2790	↑ Increases severity risk	0e+00
Precipitation_in	0.2659	↑ Increases severity risk	0e+00
Is_Night	0.0349	↑ Increases severity risk	4e-04
Wind_Speed_mph	0.0190	↑ Increases severity risk	0e+00
Visibility_mi	0.0091	↑ Increases severity risk	0e+00
Humidity_pct	0.0029	↑ Increases severity risk	0e+00
Temperature_F	0.0018	↑ Increases severity risk	0e+00

Discussion: Among the non-state predictors, Junction presence and Is_Night have notable positive coefficients, meaning accidents at junctions and at night are associated with higher severity risk — consistent with what we found in the EDA. Precipitation also shows a positive coefficient (more rain → higher severity risk), also consistent with our EDA finding. Crossing and Traffic Signal show negative coefficients, suggesting accidents at these locations tend to be less severe — possibly because the presence of controlled crossings slows traffic down.

8 Conclusion

8.1 Summary of Findings

This project analysed 500,000 US road accident records to answer two objectives using predictive modelling.

Regression Objective — Predicting Monthly Accident Frequency:

We built two linear regression models to predict how many accidents occur per ZIP code per month. The simple baseline model (using only temperature and visibility) performed poorly, confirming that accident frequency cannot be explained by weather alone. The multiple regression model — adding state, month, all weather variables, and road features — achieved an RMSE of 2.86 and an R² of 0.345. State was by far the most important predictor, reflecting that population density and driving activity — which vary greatly by state — are the dominant drivers of accident frequency. Month and humidity also contributed meaningfully.

Classification Objective — Predicting High Severity Accidents:

We built two classification models to predict whether an individual accident would result in high severity (Severity 3 or 4). Both models achieved an accuracy of 80.5%, which is close to the majority-class baseline of 80.5% — this is expected given the severe class imbalance in the dataset (only 19.5% of accidents are high severity). When the default classification threshold of 0.5 is applied, the models produce high accuracy but very low recall, because the predicted probabilities for most accidents cluster below 0.5. This outcome is itself an important finding: it demonstrates exactly why accuracy alone is a misleading metric for imbalanced classification problems, and why F1 score and recall must be the primary evaluation criteria. Logistic Regression achieved a higher F1 score (5.5%) than the Decision Tree (0.7%), and higher precision (52.8% vs 51.5%), making it the better model despite similar accuracy. Key predictors identified by the logistic regression included junction presence, night-time conditions, and precipitation — all consistent with our EDA findings. In a production deployment, the classification threshold would be tuned below 0.5 to improve recall at the cost of more false alarms.

8.2 Recommendations

Based on our analysis, we make the following practical recommendations for road safety authorities:

Target rush hour interventions — accidents peak at 7–9 AM and 4–6 PM on weekdays, especially Friday. Increased enforcement and dynamic speed limits during these periods could reduce frequency.
Prioritise summer safety campaigns — our surprising finding that June and July have the highest severe accident rates (not winter as expected) suggests that summer safety campaigns are more urgently needed than currently recognised.
Improve junction design and signage — junctions show a 27.1% severe accident rate versus 19.3% baseline. Better lighting, clearer markings, and speed management at junctions could meaningfully reduce severity.
Issue real-time rain alerts — rain increases the severe accident rate from 19.3% to 22.8%. Automated alerts to drivers during rainfall (via apps or road signs) could reduce impact speeds and severity outcomes.
Focus on California, Florida, and Texas — these three states account for nearly 40% of all accidents in the dataset. State-level road safety investment in these states would have the highest population impact.

8.3 Limitations

We acknowledge the following limitations of our work:

The regression R² of 0.345 means there is still 65.5% of variance we cannot explain. Traffic volume and road quality data, which are not in our dataset, would likely improve the model significantly.
The classification models show very low recall at the default 0.5 threshold due to class imbalance — only 19.5% of accidents are high severity. This is a known challenge with imbalanced datasets. A lower classification threshold (e.g. 0.20) would substantially improve recall at the cost of more false positives, which would be the appropriate tuning for a real-world emergency services deployment.
The dataset covers 2016–2023. Vehicle safety technology (automatic emergency braking, lane assist) has been improving during this period, which may affect the relationship between predictors and severity over time.
We used a sample of 500,000 from a larger dataset, and our sampled version may not perfectly represent all regions equally.

8.4 SDG 3 Alignment

Our findings directly support SDG 3, Target 3.6 — halving the number of global deaths and injuries from road traffic accidents by 2030. By identifying the specific times (rush hours, summer, night-time), locations (junctions, high-accident states), and conditions (rain, low visibility) that produce the most and worst accidents, our models provide actionable information for policymakers to target safety interventions where they will have the greatest impact.

8.5 Session Information

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rpart.plot_3.1.4 rpart_4.1.27     caret_7.0-1      lattice_0.22-9  
##  [5] gridExtra_2.3    corrplot_0.95    kableExtra_1.4.0 knitr_1.51      
##  [9] scales_1.4.0     lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0   
## [13] dplyr_1.2.1      purrr_1.2.2      readr_2.2.0      tidyr_1.3.2     
## [17] tibble_3.3.1     ggplot2_4.0.3    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.3    timeDate_4052.112   
##  [4] farver_2.1.2         S7_0.2.2             fastmap_1.2.0       
##  [7] pROC_1.19.0.1        digest_0.6.39        timechange_0.4.0    
## [10] lifecycle_1.0.5      survival_3.8-6       magrittr_2.0.5      
## [13] compiler_4.5.3       rlang_1.2.0          sass_0.4.10         
## [16] tools_4.5.3          yaml_2.3.12          data.table_1.18.2.1 
## [19] labeling_0.4.3       bit_4.6.0            plyr_1.8.9          
## [22] xml2_1.5.2           RColorBrewer_1.1-3   withr_3.0.2         
## [25] stats4_4.5.3         nnet_7.3-20          grid_4.5.3          
## [28] e1071_1.7-17         future_1.70.0        globals_0.19.1      
## [31] iterators_1.0.14     MASS_7.3-65          cli_3.6.6           
## [34] crayon_1.5.3         rmarkdown_2.31       generics_0.1.4      
## [37] otel_0.2.0           rstudioapi_0.18.0    future.apply_1.20.2 
## [40] reshape2_1.4.5       tzdb_0.5.0           proxy_0.4-29        
## [43] cachem_1.1.0         splines_4.5.3        parallel_4.5.3      
## [46] vctrs_0.7.3          hardhat_1.4.3        Matrix_1.7-4        
## [49] jsonlite_2.0.0       hms_1.1.4            bit64_4.8.0         
## [52] listenv_0.10.1       systemfonts_1.3.2    foreach_1.5.2       
## [55] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.3       
## [58] glue_1.8.1           parallelly_1.47.0    codetools_0.2-20    
## [61] stringi_1.8.7        gtable_0.3.6         pillar_1.11.1       
## [64] htmltools_0.5.9      ipred_0.9-15         lava_1.9.1          
## [67] R6_2.6.1             textshaping_1.0.5    vroom_1.7.1         
## [70] evaluate_1.0.5       bslib_0.10.0         class_7.3-23        
## [73] Rcpp_1.1.1           svglite_2.2.2        nlme_3.1-168        
## [76] prodlim_2026.03.11   xfun_0.57            ModelMetrics_1.2.2.2
## [79] pkgconfig_2.0.3

WQD7004 Group 17 | Universiti Malaya | Programming for Data Science

WQD7004 Group Project — US Road Accidents: Understanding Patterns and Building Predictive Models

Group 17 | Imtiyaz Ahmed Khan (25077477), Itmam Hossain, Ridda Rehman, Adeena Siddiqi | Universiti Malaya | WQD7004 Programming for Data Science

14 June 2026