1 Introduction

Road accidents are a serious public health issue worldwide. Every year, millions of people are injured or killed on the roads, and many of these accidents could potentially be prevented if we had better information about where and when they are likely to happen.

For this project, we decided to look at a real-world dataset of road accidents in the United States. We wanted to answer two main questions using the data:

Can we predict how many accidents will happen in a given area per month? (Regression)
Can we predict whether an accident will be severe or not? (Classification)

We picked this topic because it connects to SDG 3 (Good Health and Well-being), specifically Target 3.6 which is about reducing deaths and injuries from road traffic accidents. We thought it would be meaningful to work on something that has real-world impact, not just a made-up dataset.

1.1 Our Two Questions

Regression: Can we predict the expected number of accidents per ZIP code per month, based on weather and road features?

If authorities know which areas tend to have more accidents during certain months or weather conditions, they can put up warning signs, increase patrols, or send alerts to drivers in advance.

Classification: Can we predict whether an accident will be high severity (Severity 3 or 4) based on weather conditions, road features, and time of day?

Knowing which types of locations and conditions produce the worst accidents can help emergency services decide where to focus resources.

2 Dataset Overview

2.1 About the Data

The dataset we are using is the US Accidents dataset, which was collected from multiple traffic APIs between 2016 and 2023. It covers 49 US states.

We are using a sampled version with 500,000 rows and 46 columns, giving us a total dimension of 23,000,000 cells which easily exceeds the requirement of 100,000.

Detail	Value
Dataset name	US Accidents (2016–2023)
Source	Kaggle (Sobhan Moosavi et al.)
Rows	500,000
Columns	46
Total dimension	23,000,000 cells
Coverage	49 US states
Years	2016 – March 2023

The columns include information about:

When and where the accident happened (time, location, GPS coordinates)
How severe it was (Severity 1–4, where 4 is the most severe)
Weather at the time (temperature, humidity, visibility, wind speed, precipitation)
Road features nearby (junction, traffic signal, crossing, railway, stop sign)
Whether it was day or night

2.2 Loading Libraries and Data

# we need these packages for data cleaning, analysis and visualisation
library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(gridExtra)
library(caret)
library(rpart)
library(rpart.plot)

# load the dataset
df_raw <- read_csv("US_Accidents_March23_sampled_500k.csv",
                   show_col_types = FALSE)

# check dimensions
cat("Number of rows:", nrow(df_raw), "\n")

## Number of rows: 500000

cat("Number of columns:", ncol(df_raw), "\n")

## Number of columns: 46

cat("Total cells:", nrow(df_raw) * ncol(df_raw), "\n")

## Total cells: 23000000

# look at the first few rows
head(df_raw, 5)

# check column names and types
glimpse(df_raw)

## Rows: 500,000
## Columns: 46
## $ ID                    <chr> "A-2047758", "A-4694324", "A-5006183", "A-423735…
## $ Source                <chr> "Source2", "Source1", "Source1", "Source1", "Sou…
## $ Severity              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Start_Time            <dttm> 2019-06-12 10:10:56, 2022-12-03 23:37:14, 2022-…
## $ End_Time              <dttm> 2019-06-12 10:55:58, 2022-12-04 01:56:53, 2022-…
## $ Start_Lat             <dbl> 30.64121, 38.99056, 34.66119, 43.68059, 35.39548…
## $ Start_Lng             <dbl> -91.15348, -77.39907, -120.49282, -92.99332, -11…
## $ End_Lat               <dbl> NA, 38.99004, 34.66119, 43.68057, 35.39548, NA, …
## $ End_Lng               <dbl> NA, -77.39828, -120.49244, -92.97222, -118.98600…
## $ `Distance(mi)`        <dbl> 0.000, 0.056, 0.022, 1.054, 0.046, 0.000, 0.000,…
## $ Description           <chr> "Accident on LA-19 Baker-Zachary Hwy at Lower Za…
## $ Street                <chr> "Highway 19", "Forest Ridge Dr", "Floradale Ave"…
## $ City                  <chr> "Zachary", "Sterling", "Lompoc", "Austin", "Bake…
## $ County                <chr> "East Baton Rouge", "Loudoun", "Santa Barbara", …
## $ State                 <chr> "LA", "VA", "CA", "MN", "CA", "MA", "OR", "FL", …
## $ Zipcode               <chr> "70791-4610", "20164-2813", "93436", "55912", "9…
## $ Country               <chr> "US", "US", "US", "US", "US", "US", "US", "US", …
## $ Timezone              <chr> "US/Central", "US/Eastern", "US/Pacific", "US/Ce…
## $ Airport_Code          <chr> "KBTR", "KIAD", "KLPC", "KAUM", "KBFL", "KBVY", …
## $ Weather_Timestamp     <dttm> 2019-06-12 09:53:00, 2022-12-03 23:52:00, 2022-…
## $ `Temperature(F)`      <dbl> 77.0, 45.0, 68.0, 27.0, 42.0, 42.0, 35.0, 90.0, …
## $ `Wind_Chill(F)`       <dbl> 77.0, 43.0, 68.0, 15.0, 42.0, 35.0, 35.0, 90.0, …
## $ `Humidity(%)`         <dbl> 62, 48, 73, 86, 34, 58, 89, 55, 39, 78, 97, 84, …
## $ `Pressure(in)`        <dbl> 29.92, 29.91, 29.79, 28.49, 29.77, 29.37, 28.71,…
## $ `Visibility(mi)`      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Wind_Direction        <chr> "NW", "W", "W", "ENE", "CALM", "W", "CALM", "SW"…
## $ `Wind_Speed(mph)`     <dbl> 5.0, 5.0, 13.0, 15.0, 0.0, 13.0, 0.0, 12.0, 7.0,…
## $ `Precipitation(in)`   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
## $ Weather_Condition     <chr> "Fair", "Fair", "Fair", "Wintry Mix", "Fair", "F…
## $ Amenity               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Bump                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Crossing              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Give_Way              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Junction              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ No_Exit               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Railway               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Roundabout            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Station               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Stop                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Calming       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Signal        <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FA…
## $ Turning_Loop          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Sunrise_Sunset        <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Civil_Twilight        <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Nautical_Twilight     <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Astronomical_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…

3 Data Cleaning

3.1 Step 1: Check for Missing Values

Before doing anything, we need to understand which columns have missing data and decide what to do about them.

# count missing values in each column
missing_summary <- df_raw %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Column",
               values_to = "Missing_Count") %>%
  mutate(Missing_Pct = round(Missing_Count / nrow(df_raw) * 100, 1)) %>%
  filter(Missing_Count > 0) %>%
  arrange(desc(Missing_Pct))

missing_summary %>%
  kable(caption = "Table 1: Columns with missing values",
        col.names = c("Column", "Missing Count", "Missing %")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(missing_summary$Missing_Pct > 20),
           background = "#ffe0e0")

Table 1: Columns with missing values
Column	Missing Count	Missing %
End_Lat	220377	44.1
End_Lng	220377	44.1
Precipitation(in)	142616	28.5
Wind_Chill(F)	129017	25.8
Wind_Speed(mph)	36987	7.4
Visibility(mi)	11291	2.3
Humidity(%)	11130	2.2
Wind_Direction	11197	2.2
Weather_Condition	11101	2.2
Temperature(F)	10466	2.1
Pressure(in)	8928	1.8
Weather_Timestamp	7674	1.5
Airport_Code	1446	0.3
Sunrise_Sunset	1483	0.3
Civil_Twilight	1483	0.3
Nautical_Twilight	1483	0.3
Astronomical_Twilight	1483	0.3
Street	691	0.1
Timezone	507	0.1
Description	1	0.0
City	19	0.0
Zipcode	116	0.0

We can see that End_Lat and End_Lng are missing for about 44% of rows — we will just drop these columns since we are not using them. Precipitation(in) is missing for about 28.5% of rows — we will replace missing values with 0 (assuming no rain). The rest of the weather columns are missing less than 10% so we will fill them with the median value.

3.2 Step 2: Parse Dates and Create New Variables

df <- df_raw %>%
  # parse the date column (mixed formats in this dataset)
  mutate(
    Start_Time  = parse_date_time(Start_Time,
                                  orders = c("ymd HMS", "ymd HM"),
                                  quiet  = TRUE),
    # extract useful time features
    Year        = year(Start_Time),
    Month       = month(Start_Time),
    Month_name  = month(Start_Time, label = TRUE, abbr = TRUE),
    Hour        = hour(Start_Time),
    Day_of_week = wday(Start_Time, label = TRUE)
  ) %>%
  # create our target variable for classification
  mutate(
    High_Severity = if_else(Severity >= 3, "Yes", "No"),
    High_Severity = factor(High_Severity, levels = c("No", "Yes"))
  )

cat("Date parsing done. Sample:\n")

## Date parsing done. Sample:

df %>% select(Start_Time, Year, Month, Hour, High_Severity) %>% head(3)

3.3 Step 3: Handle Missing Values

df_clean <- df %>%
  # drop columns we don't need
  select(-any_of(c(
    "End_Lat", "End_Lng", "Wind_Chill(F)", "Description",
    "Airport_Code", "Country", "ID", "Source",
    "Civil_Twilight", "Nautical_Twilight", "Astronomical_Twilight",
    "Timezone", "Weather_Timestamp", "Wind_Direction",
    "Weather_Condition", "Bump", "Give_Way", "No_Exit",
    "Roundabout", "Station", "Traffic_Calming", "Turning_Loop",
    "Amenity", "End_Time"
  ))) %>%

  # drop rows where Zipcode is missing (only 116 rows)
  filter(!is.na(Zipcode)) %>%

  # fill Precipitation with 0 (no rain assumption)
  mutate(`Precipitation(in)` = replace_na(`Precipitation(in)`, 0)) %>%

  # fill numeric weather columns with median
  mutate(
    `Temperature(F)`  = replace_na(`Temperature(F)`,
                                    median(`Temperature(F)`, na.rm = TRUE)),
    `Humidity(%)`     = replace_na(`Humidity(%)`,
                                    median(`Humidity(%)`,    na.rm = TRUE)),
    `Visibility(mi)`  = replace_na(`Visibility(mi)`,
                                    median(`Visibility(mi)`, na.rm = TRUE)),
    `Wind_Speed(mph)` = replace_na(`Wind_Speed(mph)`,
                                    median(`Wind_Speed(mph)`,na.rm = TRUE))
  ) %>%

  # fill text columns
  mutate(
    Sunrise_Sunset = replace_na(Sunrise_Sunset, "Day"),
    City           = replace_na(City, "Unknown"),
    Street         = replace_na(Street, "Unknown")
  ) %>%

  # rename weather columns to something cleaner
  rename(
    Temperature_F    = `Temperature(F)`,
    Humidity_pct     = `Humidity(%)`,
    Visibility_mi    = `Visibility(mi)`,
    Wind_Speed_mph   = `Wind_Speed(mph)`,
    Precipitation_in = `Precipitation(in)`,
    Pressure_in      = `Pressure(in)`,
    Distance_mi      = `Distance(mi)`
  )

cat("Rows after cleaning:", format(nrow(df_clean), big.mark = ","), "\n")

## Rows after cleaning: 499,884

cat("Remaining missing values:", sum(is.na(df_clean)), "\n")

## Remaining missing values: 9370

3.4 Step 4: Summary of Cleaned Dataset

cat("=== Cleaned Dataset Summary ===\n")

## === Cleaned Dataset Summary ===

cat("Rows:", format(nrow(df_clean), big.mark=","), "\n")

## Rows: 499,884

cat("Columns:", ncol(df_clean), "\n\n")

## Columns: 28

cat("Severity distribution:\n")

## Severity distribution:

table(df_clean$Severity)

## 
##      1      2      3      4 
##   4271 398058  84496  13059

cat("\nHigh Severity distribution:\n")

## 
## High Severity distribution:

table(df_clean$High_Severity)

## 
##     No    Yes 
## 402329  97555

4 Exploratory Data Analysis (EDA)

This is the main part of our analysis. Before building any models, we want to really understand what the data is telling us. We will look at the data from many different angles.

4.1 4.1 Accident Severity Distribution

The first thing we want to understand is how accidents are distributed across the 4 severity levels.

sev_counts <- df_clean %>%
  count(Severity) %>%
  mutate(
    pct   = n / sum(n) * 100,
    label = paste0(format(n, big.mark=","), "\n(", round(pct,1), "%)"),
    fill_col = c("#4CAF50","#FFC107","#FF9800","#F44336")
  )

ggplot(sev_counts, aes(x = factor(Severity), y = n,
                       fill = factor(Severity))) +
  geom_col(width = 0.6, colour = "white", linewidth = 0.5) +
  geom_text(aes(label = label), vjust = -0.3,
            size = 3.8, fontface = "bold") +
  scale_fill_manual(values = c("#4CAF50","#FFC107","#FF9800","#F44336"),
                    labels = c("1 - Minor","2 - Moderate",
                               "3 - Serious","4 - Critical")) +
  scale_y_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "How are Accidents Distributed Across Severity Levels?",
    subtitle = "Severity 2 makes up nearly 80% of all accidents — Severity 3 & 4 combined = 19.5%",
    x        = "Severity Level",
    y        = "Number of Accidents",
    fill     = "Severity Level",
    caption  = "Source: US Accidents Dataset 2016–2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40", size = 11),
    legend.position = "top"
  )

What we found: Most accidents (nearly 80%) are Severity 2. Only about 19.5% are high severity (3 or 4). This means our classification problem has a class imbalance — there are many more “No” than “Yes” cases, which is something we need to keep in mind when building models.

4.2 4.2 How Has the Number of Accidents Changed Over the Years?

yearly <- df_clean %>%
  count(Year) %>%
  filter(Year < 2023)   # 2023 only has data up to March, so exclude

ggplot(yearly, aes(x = Year, y = n)) +
  geom_line(colour = "#1565C0", linewidth = 1.5) +
  geom_point(colour = "#1565C0", size = 4, fill = "white",
             shape = 21, stroke = 2) +
  geom_text(aes(label = format(n, big.mark=",")),
            vjust = -1.2, size = 3.5, colour = "#333333") +
  scale_x_continuous(breaks = 2016:2022) +
  scale_y_continuous(labels = comma,
                     limits = c(0, 130000),
                     expand = expansion(mult = c(0, 0.1))) +
  annotate("rect", xmin = 2019.5, xmax = 2020.5,
           ymin = 0, ymax = 130000,
           alpha = 0.08, fill = "red") +
  annotate("text", x = 2020, y = 120000,
           label = "COVID-19\n2020", size = 3,
           colour = "red", fontface = "italic") +
  labs(
    title    = "US Road Accidents Have Been Increasing Year by Year",
    subtitle = "Accident numbers grew steadily from 2016 to 2022, even through the pandemic",
    x        = "Year",
    y        = "Number of Accidents",
    caption  = "Note: 2023 data is only Jan–Mar, excluded from this chart"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

What we found: Accidents have been increasing every year from 2016 to 2022. Even during COVID-19 in 2020 when there were fewer cars on the road, there were still 76,000+ accidents. This could suggest that when roads are emptier, drivers tend to speed more.

4.3 4.3 Which States Have the Most Accidents?

state_df <- df_clean %>%
  count(State, sort = TRUE) %>%
  slice_head(n = 20) %>%
  mutate(
    State    = fct_reorder(State, n),
    is_top3  = n >= sort(n, decreasing = TRUE)[3]
  )

ggplot(state_df, aes(x = n, y = State, fill = is_top3)) +
  geom_col(colour = "white", linewidth = 0.3) +
  geom_text(aes(label = format(n, big.mark=",")),
            hjust = -0.15, size = 3.3, fontface = "bold") +
  scale_fill_manual(values = c("grey75", "#1565C0")) +
  scale_x_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Top 20 US States by Number of Accidents",
    subtitle = "California alone accounts for over 22% of all accidents in the dataset",
    x        = "Number of Accidents",
    y        = "State",
    caption  = "Source: US Accidents Dataset 2016–2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

What we found: California has by far the most accidents (113,274), followed by Florida and Texas. These are also the three most populous states, so this makes sense. But it also means our model needs to account for state-level differences.

4.4 4.4 What Time of Day Do Most Accidents Happen?

hour_sev <- df_clean %>%
  count(Hour, Severity) %>%
  group_by(Hour) %>%
  mutate(pct = n / sum(n) * 100)

ggplot(hour_sev, aes(x = Hour, y = factor(Severity), fill = pct)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(pct, 0)),
            size = 3, colour = "white", fontface = "bold") +
  scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1",
                      name = "% of that\nhour's accidents") +
  scale_x_continuous(breaks = seq(0, 23, 3),
                     labels = paste0(seq(0,23,3), ":00")) +
  labs(
    title    = "Accident Severity by Hour of Day (Heatmap)",
    subtitle = "Each column shows the severity breakdown within that hour\nDarker = higher % of that hour's accidents with that severity",
    x        = "Hour of Day",
    y        = "Severity Level",
    caption  = "Source: US Accidents Dataset 2016–2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    panel.grid    = element_blank()
  )

hour_vol <- df_clean %>%
  count(Hour) %>%
  mutate(
    time_period = case_when(
      Hour %in% 7:9   ~ "Morning Rush",
      Hour %in% 16:18 ~ "Evening Rush",
      Hour %in% 0:4   ~ "Late Night",
      TRUE            ~ "Other"
    )
  )

ggplot(hour_vol, aes(x = Hour, y = n, fill = time_period)) +
  geom_col(colour = "white", linewidth = 0.3) +
  scale_fill_manual(
    values = c("Morning Rush" = "#E53935",
               "Evening Rush" = "#E53935",
               "Late Night"   = "#5C6BC0",
               "Other"        = "#B0BEC5")
  ) +
  scale_x_continuous(breaks = 0:23) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Accident Volume by Hour of Day",
    subtitle = "Rush hours (7–9 AM and 4–6 PM) have the most accidents by volume",
    x        = "Hour (24h clock)",
    y        = "Number of Accidents",
    fill     = "Time Period"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    legend.position = "top"
  )

What we found: There are two clear peaks in accident volume — morning rush hour (7–9 AM) and evening rush hour (4–6 PM). However, the heatmap shows something interesting — late night and early morning hours (1–4 AM) actually have a slightly higher proportion of serious accidents even though the total number is lower. This makes sense because people may be driving while tired or under the influence.

4.5 4.5 Which Day of the Week is Most Dangerous?

dow_df <- df_clean %>%
  mutate(Day_of_week = fct_relevel(Day_of_week,
    "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) %>%
  group_by(Day_of_week) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  )

p1 <- ggplot(dow_df, aes(x = Day_of_week, y = total, fill = Day_of_week)) +
  geom_col(colour = "white") +
  scale_fill_brewer(palette = "Blues") +
  scale_y_continuous(labels = comma) +
  labs(title = "Total Accidents by Day",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p2 <- ggplot(dow_df, aes(x = Day_of_week, y = pct_high,
                          fill = Day_of_week)) +
  geom_col(colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%")),
            vjust = -0.4, size = 3.2) +
  scale_fill_brewer(palette = "Reds") +
  scale_y_continuous(limits = c(0, 30)) +
  labs(title = "% High Severity by Day",
       x = NULL, y = "% Severe") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2,
  top = "Weekday vs Weekend: Volume High, But Weekend Severity Also High")

What we found: Weekdays (especially Friday) have far more accidents in total — this is because more people commute. But weekends (Saturday and Sunday) have a similar or slightly higher proportion of severe accidents. This is possibly because weekend drivers include more recreational driving at higher speeds.

4.6 4.6 Monthly Patterns — When Are Accidents Most Common?

monthly_df <- df_clean %>%
  group_by(Month, Month_name) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  )

p1 <- ggplot(monthly_df,
             aes(x = fct_reorder(Month_name, Month), y = total)) +
  geom_line(aes(group = 1), colour = "#1565C0", linewidth = 1.3) +
  geom_point(colour = "#1565C0", size = 3.5, fill = "white",
             shape = 21, stroke = 1.5) +
  scale_y_continuous(labels = comma) +
  labs(title = "Accident Volume by Month",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(monthly_df,
             aes(x = fct_reorder(Month_name, Month),
                 y = pct_high, fill = pct_high)) +
  geom_col(colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%")),
            vjust = -0.4, size = 3) +
  scale_fill_gradient(low = "#FFF9C4", high = "#E53935") +
  scale_y_continuous(limits = c(0, 30)) +
  labs(title = "% High Severity by Month",
       x = NULL, y = "% Severe") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2,
  top = "Interesting Contrast: Winter Has More Accidents But Lower Severity Rate")

What we found: This was one of the most surprising findings. December and November have the most accidents in total (probably due to winter weather and more darkness). But June and July actually have the highest percentage of severe accidents. We think this might be because summer encourages faster driving, road trips, and more motorcycles — all of which lead to worse outcomes when accidents do happen.

4.7 4.7 How Does Weather Affect Accidents?

4.7.1 Temperature

df_clean %>%
  filter(Temperature_F > -30, Temperature_F < 130) %>%
  ggplot(aes(x = factor(Severity), y = Temperature_F,
             fill = factor(Severity))) +
  geom_boxplot(outlier.size = 0.5, alpha = 0.8,
               outlier.colour = "grey60") +
  scale_fill_manual(values = c("#4CAF50","#FFC107","#FF9800","#F44336")) +
  stat_summary(fun = mean, geom = "point",
               shape = 18, size = 3, colour = "black") +
  labs(
    title    = "Temperature Distribution by Severity Level",
    subtitle = "Black diamond = mean temperature | Boxes show the middle 50% of data",
    x        = "Severity Level",
    y        = "Temperature (°F)",
    fill     = "Severity"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

4.7.2 Visibility

df_clean %>%
  mutate(
    vis_group = cut(Visibility_mi,
                    breaks = c(0, 1, 3, 5, 10),
                    labels = c("< 1 mile\n(Very Low)",
                               "1–3 miles\n(Low)",
                               "3–5 miles\n(Medium)",
                               "> 5 miles\n(Good)"),
                    include.lowest = TRUE)
  ) %>%
  filter(!is.na(vis_group)) %>%
  group_by(vis_group) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    .groups  = "drop"
  ) %>%
  ggplot(aes(x = vis_group, y = pct_high, fill = pct_high)) +
  geom_col(colour = "white", width = 0.65) +
  geom_text(aes(label = paste0(round(pct_high,1),"%\n(",
                               format(total, big.mark=",")," accidents)")),
            vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_gradient(low = "#E8F5E9", high = "#B71C1C") +
  scale_y_continuous(limits = c(0, 30),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Does Low Visibility Lead to More Severe Accidents?",
    subtitle = "Percentage of accidents that were high severity (3 or 4) at each visibility level",
    x        = "Visibility Level",
    y        = "% High Severity Accidents",
    caption  = "Number in brackets = total accidents at that visibility level"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

What we found: The relationship between visibility and severity is not as simple as we expected. Very low visibility (< 1 mile) actually has a lower severe percentage than the 1–3 mile range. We think this is because when visibility is extremely low (dense fog), drivers slow down significantly — so when accidents do happen, they are at lower speeds. The 1–3 mile range may represent conditions where drivers underestimate the danger and don’t slow down enough.

4.7.3 Rain vs No Rain

df_clean %>%
  mutate(
    Rain = if_else(Precipitation_in > 0, "Raining", "Not Raining")
  ) %>%
  group_by(Rain, High_Severity) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Rain) %>%
  mutate(pct = n / sum(n) * 100) %>%
  filter(High_Severity == "Yes") %>%
  ggplot(aes(x = Rain, y = pct, fill = Rain)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(aes(label = paste0(round(pct,1),"%")),
            vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Not Raining" = "#90CAF9",
                               "Raining"     = "#1565C0")) +
  scale_y_continuous(limits = c(0, 30)) +
  labs(
    title    = "Rainy Conditions Lead to More Severe Accidents",
    subtitle = "Accidents during rain are about 18% more likely to be high severity",
    x        = NULL,
    y        = "% High Severity Accidents"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

4.8 4.8 How Do Road Features Affect Severity?

road_features <- c("Junction", "Crossing", "Traffic_Signal",
                   "Railway", "Stop")

road_df <- map_dfr(road_features, function(feat) {
  df_clean %>%
    filter(!!sym(feat) == TRUE) %>%
    summarise(
      Feature  = feat,
      total    = n(),
      pct_high = mean(High_Severity == "Yes") * 100
    )
})

# also add baseline (no features)
baseline <- df_clean %>%
  filter(!Junction, !Crossing, !Traffic_Signal, !Railway, !Stop) %>%
  summarise(Feature = "No Features\n(Baseline)",
            total   = n(),
            pct_high = mean(High_Severity == "Yes") * 100)

road_df <- bind_rows(baseline, road_df) %>%
  mutate(Feature = fct_reorder(Feature, pct_high),
         is_base = Feature == "No Features\n(Baseline)")

ggplot(road_df, aes(x = pct_high, y = Feature, fill = is_base)) +
  geom_col(colour = "white", width = 0.65) +
  geom_text(aes(label = paste0(round(pct_high,1),"% (",
                               format(total, big.mark=","),")")),
            hjust = -0.05, size = 3.5, fontface = "bold") +
  geom_vline(xintercept = baseline$pct_high,
             linetype = "dashed", colour = "#E53935",
             linewidth = 0.8) +
  scale_fill_manual(values = c("TRUE" = "#B0BEC5",
                               "FALSE" = "#1565C0")) +
  scale_x_continuous(limits = c(0, 40),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Does Being Near a Road Feature Make Accidents More Severe?",
    subtitle = "Red dashed line = baseline (no road features nearby)\nPercentage of accidents that were high severity (Severity 3 or 4)",
    x        = "% High Severity",
    y        = "Road Feature",
    caption  = "Number in brackets = total accidents at locations with that feature"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

What we found: Junctions are associated with the highest severity rate — 27.1% of accidents at junctions are high severity, compared to about 19% at locations with no road features. Railway crossings also show a higher rate. This makes sense — junction accidents often involve side impacts at higher speeds, which tend to be more serious.

4.9 4.9 Day vs Night Accidents

dn_df <- df_clean %>%
  filter(!is.na(Sunrise_Sunset)) %>%
  group_by(Sunrise_Sunset) %>%
  summarise(
    total    = n(),
    pct_high = mean(High_Severity == "Yes") * 100,
    avg_sev  = mean(Severity),
    .groups  = "drop"
  )

p1 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = total,
                         fill = Sunrise_Sunset)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(aes(label = format(total, big.mark=",")),
            vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("Day" = "#FDD835",
                               "Night" = "#283593")) +
  scale_y_continuous(labels = comma, limits = c(0, 400000)) +
  labs(title = "Volume", x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

p2 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = pct_high,
                         fill = Sunrise_Sunset)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(aes(label = paste0(round(pct_high,1),"%")),
            vjust = -0.4, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = c("Day" = "#FDD835",
                               "Night" = "#283593")) +
  scale_y_continuous(limits = c(0, 30)) +
  labs(title = "% High Severity", x = NULL, y = "%") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2,
  top = "Day vs Night: More Accidents During Day, But Night Is More Dangerous")

What we found: More than twice as many accidents happen during the day, simply because more people drive during daylight hours. But nighttime accidents have a slightly higher proportion of severe outcomes. Reduced visibility, driver fatigue, and potentially impaired driving at night likely contribute to this.

4.10 4.10 Correlation Between Numeric Variables

# select numeric columns for correlation
num_cols <- df_clean %>%
  select(Severity, Temperature_F, Humidity_pct,
         Visibility_mi, Wind_Speed_mph, Precipitation_in,
         Distance_mi) %>%
  drop_na()

corr_matrix <- cor(num_cols)

corrplot(corr_matrix,
         method  = "color",
         type    = "lower",
         addCoef.col = "black",
         number.cex  = 0.8,
         tl.cex  = 0.9,
         tl.col  = "black",
         col     = colorRampPalette(c("#2166AC","white","#D6604D"))(200),
         title   = "Correlation Matrix: Numeric Variables",
         mar     = c(0,0,2,0))

What we found: Honestly, none of the weather variables have a very strong correlation with Severity on their own (all below 0.05 in absolute value). This tells us that severity is not predicted well by any single variable — we need to use combinations of variables in a model. Temperature and Humidity have a small positive correlation with each other, which makes sense.

4.11 4.11 Top Weather Conditions at Time of Accident

df_raw %>%
  filter(!is.na(Weather_Condition)) %>%
  count(Weather_Condition, sort = TRUE) %>%
  slice_head(n = 12) %>%
  mutate(Weather_Condition = fct_reorder(Weather_Condition, n)) %>%
  ggplot(aes(x = n, y = Weather_Condition, fill = n)) +
  geom_col(colour = "white") +
  geom_text(aes(label = format(n, big.mark=",")),
            hjust = -0.1, size = 3.5) +
  scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1") +
  scale_x_continuous(labels = comma,
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Most Common Weather Conditions During Accidents",
    subtitle = "Majority of accidents happen in Fair or Clear weather — not in storms",
    x        = "Number of Accidents",
    y        = "Weather Condition"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40")
  )

What we found: This is perhaps the most counterintuitive finding — the most common weather condition during accidents is “Fair” (clear, good weather), not rain or snow. This is because most driving happens in good weather, so naturally most accidents do too. This is called exposure bias. When we look at rate of accidents per km driven, bad weather is more dangerous, but in raw counts, good weather dominates.

4.12 4.12 EDA Summary

Before moving to modelling, here is a summary of the key things we found from the EDA:

tribble(
  ~Finding, ~Implication,
  "Severity 2 = 80% of accidents", "Class imbalance — important for classification model",
  "Accidents increasing every year 2016–2022", "Road safety is a growing problem",
  "California = 22% of all accidents", "State is a strong predictor",
  "Rush hours have most accidents by volume", "Time of day matters for regression model",
  "Late night has higher severe % despite lower volume", "Hour affects severity, not just count",
  "June/July have highest severe accident rate", "Month affects severity, counter-intuitive",
  "Rain increases severe rate from 19.3% to 22.8%", "Weather matters for severity",
  "Junctions: 27% severe vs 19% baseline", "Road features matter for severity",
  "Most accidents happen in Fair weather", "Exposure effect — need to think about rates not counts",
  "No single variable strongly predicts severity", "Need multi-variable model"
) %>%
  kable(caption = "Table 2: Summary of Key EDA Findings") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = TRUE)

Table 2: Summary of Key EDA Findings
Finding	Implication
Severity 2 = 80% of accidents	Class imbalance — important for classification model
Accidents increasing every year 2016–2022	Road safety is a growing problem
California = 22% of all accidents	State is a strong predictor
Rush hours have most accidents by volume	Time of day matters for regression model
Late night has higher severe % despite lower volume	Hour affects severity, not just count
June/July have highest severe accident rate	Month affects severity, counter-intuitive
Rain increases severe rate from 19.3% to 22.8%	Weather matters for severity
Junctions: 27% severe vs 19% baseline	Road features matter for severity
Most accidents happen in Fair weather	Exposure effect — need to think about rates not counts
No single variable strongly predicts severity	Need multi-variable model

5 Regression: Predicting Monthly Accident Frequency

5.1 Why This Question Matters

If we can predict how many accidents a ZIP code area tends to have in a given month, authorities can plan ahead — more patrols in high-risk areas, warning signs, public awareness campaigns before accident-prone seasons.

5.2 Preparing the Data

We need to aggregate the data to ZIP code + Month level, since our target is “how many accidents happen here per month.”

df_agg <- df_clean %>%
  filter(!is.na(Zipcode)) %>%
  group_by(Zipcode, Month, State) %>%
  summarise(
    accident_count         = n(),
    avg_temp               = mean(Temperature_F,    na.rm = TRUE),
    avg_humidity           = mean(Humidity_pct,     na.rm = TRUE),
    avg_visibility         = mean(Visibility_mi,    na.rm = TRUE),
    avg_wind               = mean(Wind_Speed_mph,   na.rm = TRUE),
    avg_precip             = mean(Precipitation_in, na.rm = TRUE),
    junction_present       = as.integer(any(Junction)),
    signal_present         = as.integer(any(Traffic_Signal)),
    crossing_present       = as.integer(any(Crossing)),
    .groups = "drop"
  )

cat("Aggregated rows (ZIP x Month combinations):",
    format(nrow(df_agg), big.mark=","), "\n")

## Aggregated rows (ZIP x Month combinations): 223,308

cat("accident_count range:", min(df_agg$accident_count),
    "to", max(df_agg$accident_count), "\n")

## accident_count range: 1 to 80

cat("accident_count mean:", round(mean(df_agg$accident_count), 2), "\n")

## accident_count mean: 2.24

ggplot(df_agg %>% filter(accident_count <= 20),
       aes(x = accident_count)) +
  geom_histogram(binwidth = 1, fill = "#1565C0",
                 colour = "white", alpha = 0.85) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Distribution of Monthly Accident Counts per ZIP Code",
    subtitle = "Most ZIP codes have just 1–3 accidents per month (right-skewed distribution)",
    x        = "Accidents per ZIP code per month",
    y        = "Number of ZIP-month observations"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(colour = "grey40"))

5.3 Train / Test Split

set.seed(42)   # so results are reproducible

# 80% for training, 20% for testing
train_size <- floor(0.8 * nrow(df_agg))
train_idx  <- sample(seq_len(nrow(df_agg)), size = train_size)

reg_train <- df_agg[ train_idx, ]
reg_test  <- df_agg[-train_idx, ]

cat("Training set:", format(nrow(reg_train), big.mark=","), "rows\n")

## Training set: 178,646 rows

cat("Test set:    ", format(nrow(reg_test),  big.mark=","), "rows\n")

## Test set:     44,662 rows

5.4 Model 1: Simple Linear Regression

We start with the simplest possible model — just using temperature and visibility to predict accident count. This is our baseline.

# simple model with just 2 predictors
lm_simple <- lm(accident_count ~ avg_temp + avg_visibility,
                data = reg_train)

summary(lm_simple)

## 
## Call:
## lm(formula = accident_count ~ avg_temp + avg_visibility, data = reg_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.381 -1.262 -1.229 -0.269 77.733 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.1943789  0.0392286  55.938  < 2e-16 ***
## avg_temp        0.0018403  0.0004709   3.908 9.31e-05 ***
## avg_visibility -0.0070689  0.0035371  -1.999   0.0457 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.598 on 178643 degrees of freedom
## Multiple R-squared:  9.314e-05,  Adjusted R-squared:  8.194e-05 
## F-statistic:  8.32 on 2 and 178643 DF,  p-value: 0.0002437

pred_simple <- predict(lm_simple, newdata = reg_test)
pred_simple <- pmax(pred_simple, 0)   # accidents can't be negative

rmse_simple <- sqrt(mean((reg_test$accident_count - pred_simple)^2))
mae_simple  <- mean(abs(reg_test$accident_count - pred_simple))

cat("Simple LM Results:\n")

## Simple LM Results:

cat("  RMSE:", round(rmse_simple, 3), "\n")

##   RMSE: 3.537

cat("  MAE: ", round(mae_simple,  3), "\n")

##   MAE:  1.844

5.5 Model 2: Multiple Linear Regression (All Variables)

Now we add all the predictor variables we have — weather, road features, month, and state.

# encode State as a factor
reg_train_enc <- reg_train %>%
  mutate(State = factor(State),
         Month = factor(Month))

reg_test_enc <- reg_test %>%
  mutate(State = factor(State, levels = levels(reg_train_enc$State)),
         Month = factor(Month))

# full model with all predictors
lm_full <- lm(
  accident_count ~ avg_temp + avg_humidity + avg_visibility +
    avg_wind + avg_precip +
    junction_present + signal_present + crossing_present +
    Month + State,
  data = reg_train_enc
)

cat("Number of coefficients:", length(coef(lm_full)), "\n")

## Number of coefficients: 68

cat("\nTop 10 most significant predictors:\n")

## 
## Top 10 most significant predictors:

# show significant coefficients
coef_df <- summary(lm_full)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  filter(!grepl("State", Variable)) %>%  # exclude state dummies for brevity
  arrange(`Pr(>|t|)`) %>%
  head(10)

coef_df %>%
  kable(digits = 4,
        caption = "Table 3: Top significant predictors in multiple regression") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)

Table 3: Top significant predictors in multiple regression
Variable	Estimate	Std. Error	t value
junction_present	6.2548	0.0240	260.4622
signal_present	1.3855	0.0189	73.3453
crossing_present	0.6721	0.0208	32.2426
(Intercept)	1.2492	0.0909	13.7400
Month7	-0.3052	0.0425	-7.1863
Month5	-0.2647	0.0382	-6.9292
Month3	-0.2338	0.0349	-6.6985
Month4	-0.2287	0.0356	-6.4320
Month6	-0.2551	0.0407	-6.2734
Month8	-0.2129	0.0413	-5.1553

pred_full <- predict(lm_full, newdata = reg_test_enc)
pred_full <- pmax(pred_full, 0)

rmse_full <- sqrt(mean((reg_test_enc$accident_count - pred_full)^2, na.rm=TRUE))
mae_full  <- mean(abs(reg_test_enc$accident_count  - pred_full),   na.rm=TRUE)
r2_full   <- cor(reg_test_enc$accident_count, pred_full, use="complete.obs")^2

cat("Full Multiple Regression Results:\n")

## Full Multiple Regression Results:

cat("  RMSE:", round(rmse_full, 3), "\n")

##   RMSE: 2.863

cat("  MAE: ", round(mae_full,  3), "\n")

##   MAE:  1.402

cat("  R²:  ", round(r2_full,   3), "\n")

##   R²:   0.345

5.6 Comparing the Two Regression Models

reg_results <- tibble(
  Model    = c("Simple LM (2 predictors)",
               "Multiple LM (all predictors)"),
  RMSE     = c(rmse_simple, rmse_full),
  MAE      = c(mae_simple,  mae_full),
  R_squared = c(NA, r2_full)
)

reg_results %>%
  mutate(across(where(is.numeric), ~round(., 3))) %>%
  kable(caption = "Table 4: Regression model comparison",
        col.names = c("Model","RMSE (lower=better)",
                      "MAE (lower=better)","R² (higher=better)")) %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) %>%
  row_spec(2, bold = TRUE, background = "#d4edda")

Table 4: Regression model comparison
Model	RMSE (lower=better)	MAE (lower=better)	R² (higher=better)
Simple LM (2 predictors)	3.537	1.844	NA
Multiple LM (all predictors)	2.863	1.402	0.345

# plot actual vs predicted for the better model
plot_df <- data.frame(
  Actual    = reg_test_enc$accident_count,
  Predicted = pred_full
) %>%
  filter(complete.cases(.) & Actual <= 30 & Predicted >= 0)

ggplot(plot_df %>% sample_n(min(5000, nrow(plot_df))),
       aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.2, colour = "#1565C0", size = 1) +
  geom_abline(slope = 1, intercept = 0,
              colour = "#E53935", linewidth = 1,
              linetype = "dashed") +
  labs(
    title    = "Actual vs Predicted Accident Counts",
    subtitle = paste0("Multiple Linear Regression | RMSE = ",
                      round(rmse_full,3), " | R² = ", round(r2_full,3),
                      "\nRed dashed line = perfect predictions"),
    x = "Actual Monthly Accident Count",
    y = "Predicted Monthly Accident Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(colour = "grey40"))

Findings: Adding more predictors (State, Month, road features, all weather variables) improved the model significantly compared to the simple 2-variable version. The R² value of the multiple regression model shows it can explain some variance, though predicting exact accident counts is hard because many local factors are not in our data.

6 Classification: Predicting High Severity Accidents

6.1 Why This Question Matters

Knowing which conditions produce severe accidents can help emergency services decide where to pre-position resources, and help authorities identify which road configurations need urgent safety improvements.

6.2 Preparing the Data

clf_data <- df_clean %>%
  select(High_Severity,
         Temperature_F, Humidity_pct, Visibility_mi,
         Wind_Speed_mph, Precipitation_in,
         Junction, Crossing, Traffic_Signal, Railway, Stop,
         Sunrise_Sunset, Month, Hour, State) %>%
  mutate(
    Junction       = as.integer(Junction),
    Crossing       = as.integer(Crossing),
    Traffic_Signal = as.integer(Traffic_Signal),
    Railway        = as.integer(Railway),
    Stop           = as.integer(Stop),
    Is_Night       = as.integer(Sunrise_Sunset == "Night"),
    State          = factor(State),
    Month          = factor(Month)
  ) %>%
  select(-Sunrise_Sunset) %>%
  drop_na()

cat("Classification dataset:", format(nrow(clf_data), big.mark=","),
    "rows\n")

## Classification dataset: 499,791 rows

cat("Class balance:\n")

## Class balance:

print(table(clf_data$High_Severity))

## 
##     No    Yes 
## 402248  97543

cat("Positive rate:", round(mean(clf_data$High_Severity=="Yes")*100, 1), "%\n")

## Positive rate: 19.5 %

set.seed(42)

# stratified split to keep class balance
train_idx_c <- createDataPartition(clf_data$High_Severity,
                                   p = 0.8, list = FALSE)
clf_train <- clf_data[ train_idx_c, ]
clf_test  <- clf_data[-train_idx_c, ]

cat("Training:", format(nrow(clf_train), big.mark=","),
    "| Test:", format(nrow(clf_test), big.mark=","), "\n")

## Training: 399,834 | Test: 99,957

6.3 Model 1: Logistic Regression

Logistic regression is a standard classification method. It estimates the probability of each outcome.

# fit logistic regression
logit_model <- glm(
  High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
    Wind_Speed_mph + Precipitation_in +
    Junction + Crossing + Traffic_Signal + Railway + Stop +
    Is_Night + Month + State,
  data   = clf_train,
  family = binomial(link = "logit")
)

# predict on test set
logit_prob <- predict(logit_model, newdata = clf_test, type = "response")
logit_pred <- factor(if_else(logit_prob >= 0.5, "Yes", "No"),
                     levels = c("No","Yes"))

# confusion matrix
cm_logit <- confusionMatrix(logit_pred, clf_test$High_Severity,
                             positive = "Yes")

cat("=== Logistic Regression Results ===\n")

## === Logistic Regression Results ===

cat("Accuracy: ", round(cm_logit$overall["Accuracy"]*100, 1), "%\n")

## Accuracy:  80.5 %

cat("Sensitivity (Recall):", round(cm_logit$byClass["Sensitivity"]*100,1), "%\n")

## Sensitivity (Recall): 3.2 %

cat("Specificity:", round(cm_logit$byClass["Specificity"]*100,1), "%\n")

## Specificity: 99.3 %

cat("Precision:", round(cm_logit$byClass["Precision"]*100,1), "%\n")

## Precision: 52.4 %

6.4 Model 2: Decision Tree

A decision tree is easy to visualise and explain. We use it as a comparison.

library(rpart)
library(rpart.plot)

tree_model <- rpart(
  High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
    Wind_Speed_mph + Precipitation_in +
    Junction + Crossing + Traffic_Signal + Railway + Stop + Is_Night,
  data   = clf_train,
  method = "class",
  control = rpart.control(maxdepth = 5, minsplit = 500)
)

# plot the tree
rpart.plot(tree_model,
           type  = 4,
           extra = 104,
           box.palette = c("#4CAF50","#F44336"),
           branch.lty  = 3,
           shadow.col  = "grey80",
           main = "Decision Tree for Predicting High Severity Accidents")

tree_pred <- predict(tree_model, newdata = clf_test, type = "class")
cm_tree   <- confusionMatrix(tree_pred, clf_test$High_Severity,
                              positive = "Yes")

cat("=== Decision Tree Results ===\n")

## === Decision Tree Results ===

cat("Accuracy: ", round(cm_tree$overall["Accuracy"]*100, 1), "%\n")

## Accuracy:  80.5 %

cat("Sensitivity:", round(cm_tree$byClass["Sensitivity"]*100,1), "%\n")

## Sensitivity: 0 %

cat("Specificity:", round(cm_tree$byClass["Specificity"]*100,1), "%\n")

## Specificity: 100 %

cat("Precision:",   round(cm_tree$byClass["Precision"]*100,1), "%\n")

## Precision: NA %

7 Model Evaluation

7.1 Regression Results Summary

tibble(
  Model      = c("Simple Linear Regression\n(2 predictors)",
                 "Multiple Linear Regression\n(all predictors)"),
  Predictors = c("Temperature, Visibility",
                 "All weather + road features + Month + State"),
  RMSE       = c(round(rmse_simple, 3), round(rmse_full, 3)),
  MAE        = c(round(mae_simple,  3), round(mae_full, 3)),
  R_sq       = c("—", round(r2_full, 3))
) %>%
  kable(caption = "Table 5: Regression Model Comparison",
        col.names = c("Model","Predictors Used",
                      "RMSE ↓","MAE ↓","R² ↑")) %>%
  kable_styling(bootstrap_options = c("striped","hover")) %>%
  row_spec(2, bold = TRUE, background = "#d4edda")

Table 5: Regression Model Comparison
Model	Predictors Used	RMSE ↓	MAE ↓	R² ↑
Simple Linear Regression (2 predictors)	\|Temperature, Visibility	3.53 (all predictors)	\|All weather + road features + Month + State	2.86

7.2 Classification Results Summary

tibble(
  Model    = c("Logistic Regression", "Decision Tree"),
  Accuracy = c(round(cm_logit$overall["Accuracy"]*100,1),
               round(cm_tree$overall["Accuracy"]*100,1)),
  Precision = c(round(cm_logit$byClass["Precision"]*100,1),
                round(cm_tree$byClass["Precision"]*100,1)),
  Recall   = c(round(cm_logit$byClass["Sensitivity"]*100,1),
               round(cm_tree$byClass["Sensitivity"]*100,1)),
  F1       = c(round(cm_logit$byClass["F1"]*100,1),
               round(cm_tree$byClass["F1"]*100,1))
) %>%
  kable(caption = "Table 6: Classification Model Comparison",
        col.names = c("Model","Accuracy %","Precision %",
                      "Recall %","F1 Score %")) %>%
  kable_styling(bootstrap_options = c("striped","hover")) %>%
  row_spec(which.max(c(cm_logit$overall["Accuracy"],
                       cm_tree$overall["Accuracy"])),
           bold = TRUE, background = "#d4edda")

Table 6: Classification Model Comparison
Model	Accuracy %	Precision %	Recall %	F1 Score %
Logistic Regression	80.5	52.4	3.2	6
Decision Tree	80.5	NA	0.0	NA

7.3 Confusion Matrix — Best Classifier

# plot confusion matrix for logistic regression
cm_table <- as.data.frame(cm_logit$table) %>%
  mutate(
    pct   = Freq / sum(Freq) * 100,
    label = paste0(format(Freq, big.mark=","),
                   "\n(", round(pct,1), "%)")
  )

ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 2) +
  geom_text(aes(label = label), size = 4.5, fontface = "bold") +
  scale_fill_gradient(low = "#E3F2FD", high = "#1565C0") +
  scale_x_discrete(position = "top") +
  labs(
    title    = "Confusion Matrix — Logistic Regression",
    subtitle = "Actual class (top) vs Predicted class (left)",
    x        = "Actual",
    y        = "Predicted"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    axis.text     = element_text(size = 12, face = "bold")
  )

How to read this: The top-left and bottom-right boxes are correct predictions. The top-right box shows accidents we predicted as “No” but were actually “Yes” (false negatives — this is the dangerous type of error). The bottom-left shows cases predicted as “Yes” but were actually “No” (false positives — less dangerous but wastes resources).

8 Conclusion and Recommendations

8.1 What We Found

This project set out to answer two questions about US road accidents, and we found some genuinely interesting results.

For the regression question: We found that adding more variables (State, Month, all weather variables, road features) to the model significantly improved its ability to predict monthly accident counts per ZIP code area. State turned out to be one of the strongest predictors — California, Florida, and Texas dominate the data.

For the classification question: Both logistic regression and a decision tree were able to classify high-severity accidents with reasonable accuracy. The key predictors of severity were whether an accident happened at a junction, the month of year (with summer months showing higher severity), and whether it was at night.

8.2 Key Recommendations

Based on our EDA and model findings, we suggest:

Target resources during rush hours — accidents peak at 7–9 AM and 4–6 PM on weekdays, especially Friday
Focus safety campaigns in summer — June and July have the highest severe accident rates, not winter as we initially expected
Prioritise junction safety — junctions show 27% severe rate vs 19% baseline; better signage, lighting, and speed controls could help
Rain warnings matter — rainy conditions increased the severe rate from 19.3% to 22.8%, so real-time weather-based warnings to drivers could reduce severity

8.3 Limitations

We should also be honest about what our models cannot do:

The R² for our regression model is relatively low, which means there is still a lot of variance we cannot explain. Local factors like specific road conditions, driver behaviour, and vehicle types are not in our data
Our classification model has a recall limitation — it misses some actual severe accidents, which in a real deployment would mean emergency services not being alerted
The dataset covers 2016–2023, and road conditions, vehicle safety technology, and driving habits change over time

8.4 SDG 3 Connection

Our findings connect directly to SDG 3 Target 3.6 — reducing road traffic deaths and injuries. By identifying which times, locations, and conditions produce the most and worst accidents, policymakers can make data-driven decisions about where to invest in road safety infrastructure and public awareness.

8.5 Session Info

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rpart.plot_3.1.4   rpart_4.1.27       caret_7.0-1        lattice_0.22-9    
##  [5] gridExtra_2.3      RColorBrewer_1.1-3 corrplot_0.95      kableExtra_1.4.0  
##  [9] knitr_1.51         scales_1.4.0       lubridate_1.9.5    forcats_1.0.1     
## [13] stringr_1.6.0      dplyr_1.2.1        purrr_1.2.2        readr_2.2.0       
## [17] tidyr_1.3.2        tibble_3.3.1       ggplot2_4.0.3      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.3    timeDate_4052.112   
##  [4] farver_2.1.2         S7_0.2.2             fastmap_1.2.0       
##  [7] pROC_1.19.0.1        digest_0.6.39        timechange_0.4.0    
## [10] lifecycle_1.0.5      survival_3.8-6       magrittr_2.0.5      
## [13] compiler_4.5.3       rlang_1.2.0          sass_0.4.10         
## [16] tools_4.5.3          yaml_2.3.12          data.table_1.18.2.1 
## [19] labeling_0.4.3       bit_4.6.0            plyr_1.8.9          
## [22] xml2_1.5.2           withr_3.0.2          stats4_4.5.3        
## [25] nnet_7.3-20          grid_4.5.3           e1071_1.7-17        
## [28] future_1.70.0        globals_0.19.1       iterators_1.0.14    
## [31] MASS_7.3-65          cli_3.6.6            crayon_1.5.3        
## [34] rmarkdown_2.31       generics_0.1.4       otel_0.2.0          
## [37] rstudioapi_0.18.0    future.apply_1.20.2  reshape2_1.4.5      
## [40] tzdb_0.5.0           proxy_0.4-29         cachem_1.1.0        
## [43] splines_4.5.3        parallel_4.5.3       vctrs_0.7.3         
## [46] hardhat_1.4.3        Matrix_1.7-4         jsonlite_2.0.0      
## [49] hms_1.1.4            bit64_4.8.0          listenv_0.10.1      
## [52] systemfonts_1.3.2    foreach_1.5.2        gower_1.0.2         
## [55] jquerylib_0.1.4      recipes_1.3.3        glue_1.8.1          
## [58] parallelly_1.47.0    codetools_0.2-20     stringi_1.8.7       
## [61] gtable_0.3.6         pillar_1.11.1        htmltools_0.5.9     
## [64] ipred_0.9-15         lava_1.9.1           R6_2.6.1            
## [67] textshaping_1.0.5    vroom_1.7.1          evaluate_1.0.5      
## [70] bslib_0.10.0         class_7.3-23         Rcpp_1.1.1          
## [73] svglite_2.2.2        nlme_3.1-168         prodlim_2026.03.11  
## [76] xfun_0.57            ModelMetrics_1.2.2.2 pkgconfig_2.0.3

WQD7004 Group 17 | Universiti Malaya | Programming for Data Science

WQD7004 Group Project — US Road Accidents: Understanding Patterns and Building Predictive Models

Group 17 | Universiti Malaya | WQD7004 Programming for Data Science

31 May 2026