Road accidents are a serious public health issue worldwide. Every year, millions of people are injured or killed on the roads, and many of these accidents could potentially be prevented if we had better information about where and when they are likely to happen.
For this project, we decided to look at a real-world dataset of road accidents in the United States. We chose this topic because it connects directly to SDG 3 (Good Health and Well-being), specifically Target 3.6 which calls for halving the number of deaths and injuries from road traffic accidents by 2030. We thought it would be meaningful to work on something with real-world impact rather than a simulated dataset.
The two main objectives of this project are:
Regression: To predict the expected monthly accident frequency across ZIP code areas based on weather conditions, road infrastructure, and temporal factors.
Classification: To develop a classification model that determines whether a road accident results in high severity (Severity 3 or 4) based on environmental conditions, road features, and time of day.
| SDG Target | How Our Project Contributes |
|---|---|
| 3.6 — Halve road traffic deaths | Our models help identify when and where accidents are most likely and most severe, enabling proactive safety measures |
| 3.d — Strengthen health risk management | Severity predictions can help emergency services allocate resources more effectively |
| 3.8 — Universal health coverage | Faster and better-targeted emergency response improves survival outcomes |
The dataset we are using is the US Accidents dataset, which was collected from multiple real-time traffic APIs across the United States between February 2016 and March 2023. It covers 49 US states and is one of the largest publicly available accident datasets.
We are using a sampled version with 500,000 rows and 46 columns, giving us a total dimension of 23,000,000 cells which easily exceeds the minimum requirement of 100,000.
| Detail | Value |
|---|---|
| Dataset name | US Accidents (2016–2023) |
| Source | Kaggle — Sobhan Moosavi et al. |
| Link | https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents |
| Rows | 500,000 |
| Columns | 46 |
| Total dimension | 23,000,000 cells |
| Coverage | 49 US states |
| Years | February 2016 – March 2023 |
The key column groups in the dataset are:
# load all packages needed for this project
library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)
library(corrplot)
library(gridExtra)
library(caret)
library(rpart)
library(rpart.plot)# load the dataset
df_raw <- read_csv("US_Accidents_March23_sampled_500k.csv",
show_col_types = FALSE)
# confirm dimensions
cat("Number of rows: ", format(nrow(df_raw), big.mark=","), "\n")## Number of rows: 500,000
## Number of columns: 46
## Total cells: 23,000,000
## Rows: 500,000
## Columns: 46
## $ ID <chr> "A-2047758", "A-4694324", "A-5006183", "A-423735…
## $ Source <chr> "Source2", "Source1", "Source1", "Source1", "Sou…
## $ Severity <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Start_Time <dttm> 2019-06-12 10:10:56, 2022-12-03 23:37:14, 2022-…
## $ End_Time <dttm> 2019-06-12 10:55:58, 2022-12-04 01:56:53, 2022-…
## $ Start_Lat <dbl> 30.64121, 38.99056, 34.66119, 43.68059, 35.39548…
## $ Start_Lng <dbl> -91.15348, -77.39907, -120.49282, -92.99332, -11…
## $ End_Lat <dbl> NA, 38.99004, 34.66119, 43.68057, 35.39548, NA, …
## $ End_Lng <dbl> NA, -77.39828, -120.49244, -92.97222, -118.98600…
## $ `Distance(mi)` <dbl> 0.000, 0.056, 0.022, 1.054, 0.046, 0.000, 0.000,…
## $ Description <chr> "Accident on LA-19 Baker-Zachary Hwy at Lower Za…
## $ Street <chr> "Highway 19", "Forest Ridge Dr", "Floradale Ave"…
## $ City <chr> "Zachary", "Sterling", "Lompoc", "Austin", "Bake…
## $ County <chr> "East Baton Rouge", "Loudoun", "Santa Barbara", …
## $ State <chr> "LA", "VA", "CA", "MN", "CA", "MA", "OR", "FL", …
## $ Zipcode <chr> "70791-4610", "20164-2813", "93436", "55912", "9…
## $ Country <chr> "US", "US", "US", "US", "US", "US", "US", "US", …
## $ Timezone <chr> "US/Central", "US/Eastern", "US/Pacific", "US/Ce…
## $ Airport_Code <chr> "KBTR", "KIAD", "KLPC", "KAUM", "KBFL", "KBVY", …
## $ Weather_Timestamp <dttm> 2019-06-12 09:53:00, 2022-12-03 23:52:00, 2022-…
## $ `Temperature(F)` <dbl> 77.0, 45.0, 68.0, 27.0, 42.0, 42.0, 35.0, 90.0, …
## $ `Wind_Chill(F)` <dbl> 77.0, 43.0, 68.0, 15.0, 42.0, 35.0, 35.0, 90.0, …
## $ `Humidity(%)` <dbl> 62, 48, 73, 86, 34, 58, 89, 55, 39, 78, 97, 84, …
## $ `Pressure(in)` <dbl> 29.92, 29.91, 29.79, 28.49, 29.77, 29.37, 28.71,…
## $ `Visibility(mi)` <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Wind_Direction <chr> "NW", "W", "W", "ENE", "CALM", "W", "CALM", "SW"…
## $ `Wind_Speed(mph)` <dbl> 5.0, 5.0, 13.0, 15.0, 0.0, 13.0, 0.0, 12.0, 7.0,…
## $ `Precipitation(in)` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
## $ Weather_Condition <chr> "Fair", "Fair", "Fair", "Wintry Mix", "Fair", "F…
## $ Amenity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Bump <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Crossing <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Give_Way <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Junction <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ No_Exit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Railway <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Roundabout <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Station <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Stop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Calming <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Signal <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FA…
## $ Turning_Loop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Sunrise_Sunset <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Civil_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Nautical_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Astronomical_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
The first thing we do before any analysis is check how many values are missing in each column, so we can decide what to do about them.
missing_summary <- df_raw %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Column",
values_to = "Missing_Count") %>%
mutate(Missing_Pct = round(Missing_Count / nrow(df_raw) * 100, 1)) %>%
filter(Missing_Count > 0) %>%
arrange(desc(Missing_Pct))
missing_summary %>%
kable(caption = "Table 1: Columns with missing values",
col.names = c("Column", "Missing Count", "Missing %")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(missing_summary$Missing_Pct > 20),
background = "#ffe0e0") %>%
row_spec(which(missing_summary$Missing_Pct > 0 &
missing_summary$Missing_Pct <= 20),
background = "#fff8dc")| Column | Missing Count | Missing % |
|---|---|---|
| End_Lat | 220377 | 44.1 |
| End_Lng | 220377 | 44.1 |
| Precipitation(in) | 142616 | 28.5 |
| Wind_Chill(F) | 129017 | 25.8 |
| Wind_Speed(mph) | 36987 | 7.4 |
| Visibility(mi) | 11291 | 2.3 |
| Humidity(%) | 11130 | 2.2 |
| Wind_Direction | 11197 | 2.2 |
| Weather_Condition | 11101 | 2.2 |
| Temperature(F) | 10466 | 2.1 |
| Pressure(in) | 8928 | 1.8 |
| Weather_Timestamp | 7674 | 1.5 |
| Airport_Code | 1446 | 0.3 |
| Sunrise_Sunset | 1483 | 0.3 |
| Civil_Twilight | 1483 | 0.3 |
| Nautical_Twilight | 1483 | 0.3 |
| Astronomical_Twilight | 1483 | 0.3 |
| Street | 691 | 0.1 |
| Timezone | 507 | 0.1 |
| Description | 1 | 0.0 |
| City | 19 | 0.0 |
| Zipcode | 116 | 0.0 |
Our decisions on how to handle missing values:
End_Lat / End_Lng — missing 44%, drop
these columns (not needed for our analysis)Wind_Chill(F) — missing 26%, drop this column (it is
very similar to Temperature anyway)Precipitation(in) — missing 28.5%, replace with 0
(assuming no rain when not recorded)Zipcode — drop the 116 rows where it is missing since
we need it for regressiondf <- df_raw %>%
mutate(
# parse the datetime column
Start_Time = parse_date_time(Start_Time,
orders = c("ymd HMS","ymd HM"),
quiet = TRUE),
# extract useful time features from the date
Year = year(Start_Time),
Month = month(Start_Time),
Month_name = month(Start_Time, label = TRUE, abbr = TRUE),
Hour = hour(Start_Time),
Day_of_week = wday(Start_Time, label = TRUE)
) %>%
# create target variable for classification (Severity 3 or 4 = High)
mutate(
High_Severity = if_else(Severity >= 3, "Yes", "No"),
High_Severity = factor(High_Severity, levels = c("No","Yes"))
)
cat("Date parsing complete. New columns added: Year, Month, Hour, Day_of_week, High_Severity\n")## Date parsing complete. New columns added: Year, Month, Hour, Day_of_week, High_Severity
df_clean <- df %>%
# drop columns we do not need
select(-any_of(c(
"End_Lat", "End_Lng", "Wind_Chill(F)", "Description",
"Airport_Code", "Country", "ID", "Source",
"Civil_Twilight", "Nautical_Twilight", "Astronomical_Twilight",
"Timezone", "Weather_Timestamp", "Wind_Direction",
"Bump", "Give_Way", "No_Exit", "Roundabout",
"Station", "Traffic_Calming", "Turning_Loop",
"Amenity", "End_Time"
))) %>%
# remove rows with missing Zipcode (only 116 rows)
filter(!is.na(Zipcode)) %>%
# fill Precipitation with 0 where missing
mutate(`Precipitation(in)` = replace_na(`Precipitation(in)`, 0)) %>%
# fill other numeric weather columns with the median value
mutate(
`Temperature(F)` = replace_na(`Temperature(F)`,
median(`Temperature(F)`, na.rm=TRUE)),
`Humidity(%)` = replace_na(`Humidity(%)`,
median(`Humidity(%)`, na.rm=TRUE)),
`Visibility(mi)` = replace_na(`Visibility(mi)`,
median(`Visibility(mi)`, na.rm=TRUE)),
`Wind_Speed(mph)` = replace_na(`Wind_Speed(mph)`,
median(`Wind_Speed(mph)`, na.rm=TRUE))
) %>%
# fill text columns
mutate(
Sunrise_Sunset = replace_na(Sunrise_Sunset, "Day"),
City = replace_na(City, "Unknown"),
Street = replace_na(Street, "Unknown")
) %>%
# rename weather columns to clean R-friendly names
rename(
Temperature_F = `Temperature(F)`,
Humidity_pct = `Humidity(%)`,
Visibility_mi = `Visibility(mi)`,
Wind_Speed_mph = `Wind_Speed(mph)`,
Precipitation_in = `Precipitation(in)`,
Pressure_in = `Pressure(in)`,
Distance_mi = `Distance(mi)`
)
cat("Rows after cleaning:", format(nrow(df_clean), big.mark=","), "\n")## Rows after cleaning: 499,884
## Columns after cleaning: 29
## Remaining missing values: 20355
## === Severity Distribution ===
##
## 1 2 3 4
## 4271 398058 84496 13059
##
## === High Severity (our classification target) ===
##
## No Yes
## 402329 97555
##
## Positive class rate: 19.5 %
The cleaned dataset is ready. We note that about 19.5% of accidents are high severity (Severity 3 or 4) — this is an important finding that we will come back to when building the classification model, as it means the classes are imbalanced.
Before building any models, we want to thoroughly explore and understand the data. This section covers 13 different visualisations looking at the data from different angles.
sev_counts <- df_clean %>%
count(Severity) %>%
mutate(
pct = n / sum(n) * 100,
label = paste0(format(n, big.mark=","), "\n(", round(pct,1), "%)")
)
ggplot(sev_counts, aes(x = factor(Severity), y = n,
fill = factor(Severity))) +
geom_col(width = 0.6, colour = "white", linewidth = 0.5) +
geom_text(aes(label = label), vjust = -0.3,
size = 3.8, fontface = "bold") +
scale_fill_manual(
values = c("#4CAF50","#FFC107","#FF9800","#F44336"),
labels = c("1 - Minor","2 - Moderate","3 - Serious","4 - Critical")
) +
scale_y_continuous(labels = comma,
expand = expansion(mult = c(0, 0.18))) +
labs(
title = "Distribution of Accident Severity Levels",
subtitle = "Severity 2 accounts for nearly 80% of all accidents",
x = "Severity Level",
y = "Number of Accidents",
fill = "Severity Level",
caption = "Source: US Accidents Dataset 2016–2023"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
legend.position = "top"
)Discussion: Most accidents (79.6%) are classified as Severity 2 (moderate). Only about 19.5% are high severity (3 or 4 combined). This shows a clear class imbalance in our classification target variable, which we need to be aware of when interpreting model accuracy — a model that always predicts “No” would still get ~80% accuracy, which would be misleading.
yearly <- df_clean %>%
count(Year) %>%
filter(Year < 2023) # 2023 only has Jan–Mar data, exclude for trend
ggplot(yearly, aes(x = Year, y = n)) +
geom_line(colour = "#1565C0", linewidth = 1.5) +
geom_point(colour = "#1565C0", size = 4,
fill = "white", shape = 21, stroke = 2) +
geom_text(aes(label = format(n, big.mark=",")),
vjust = -1.2, size = 3.5, colour = "#333333") +
annotate("rect", xmin=2019.6, xmax=2020.4, ymin=0, ymax=130000,
alpha=0.1, fill="red") +
annotate("text", x=2020, y=118000,
label="COVID-19\n(2020)", size=3,
colour="red", fontface="italic") +
scale_x_continuous(breaks = 2016:2022) +
scale_y_continuous(labels = comma,
limits = c(0, 130000),
expand = expansion(mult = c(0, 0.1))) +
labs(
title = "US Road Accidents Have Been Increasing Every Year",
subtitle = "Steady growth from 2016 to 2022, even during the COVID-19 pandemic",
x = "Year",
y = "Number of Accidents",
caption = "Note: 2023 excluded — only Jan to March data available"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: Road accidents in the dataset have increased steadily from 26,663 in 2016 to 113,734 in 2022 — more than a fourfold increase over 6 years. Even during COVID-19 in 2020, when traffic was expected to decrease significantly, accidents still numbered 76,155. This could suggest that emptier roads encouraged faster driving, leading to more accidents per vehicle-km despite fewer total trips.
state_df <- df_clean %>%
count(State, sort = TRUE) %>%
slice_head(n = 20) %>%
mutate(
State = fct_reorder(State, n),
is_top3 = rank(-n) <= 3
)
ggplot(state_df, aes(x = n, y = State, fill = is_top3)) +
geom_col(colour = "white", linewidth = 0.3) +
geom_text(aes(label = format(n, big.mark=",")),
hjust = -0.1, size = 3.3, fontface = "bold") +
scale_fill_manual(values = c("FALSE" = "grey75",
"TRUE" = "#1565C0")) +
scale_x_continuous(labels = comma,
expand = expansion(mult = c(0, 0.18))) +
labs(
title = "Top 20 US States by Number of Accidents",
subtitle = "California alone accounts for over 22% of all accidents",
x = "Number of Accidents",
y = "State"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: California (113,274), Florida (56,710), and Texas (37,355) are the top three states. These are also the three most populous states in the US, so this finding is expected. However, it also means that State will be an important predictor in our regression model because accident frequency is strongly tied to population density and driving activity.
# heatmap of severity breakdown by hour
hour_sev <- df_clean %>%
count(Hour, Severity) %>%
group_by(Hour) %>%
mutate(pct = n / sum(n) * 100)
ggplot(hour_sev, aes(x = Hour, y = factor(Severity), fill = pct)) +
geom_tile(colour = "white", linewidth = 0.5) +
geom_text(aes(label = round(pct, 0)),
size = 3, colour = "white", fontface = "bold") +
scale_fill_gradient(low = "#BBDEFB", high = "#0D47A1",
name = "% of that\nhour's accidents") +
scale_x_continuous(breaks = seq(0,23,3),
labels = paste0(seq(0,23,3),":00")) +
labs(
title = "Accident Severity Breakdown by Hour of Day",
subtitle = "Each column adds up to 100% — darker = higher share of that severity in that hour",
x = "Hour of Day",
y = "Severity Level"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
panel.grid = element_blank()
)# volume by hour
hour_vol <- df_clean %>%
count(Hour) %>%
mutate(
period = case_when(
Hour %in% 7:9 ~ "Morning Rush",
Hour %in% 16:18 ~ "Evening Rush",
Hour %in% 0:4 ~ "Late Night",
TRUE ~ "Other"
)
)
ggplot(hour_vol, aes(x = Hour, y = n, fill = period)) +
geom_col(colour = "white", linewidth = 0.3) +
scale_fill_manual(
values = c("Morning Rush" = "#E53935",
"Evening Rush" = "#E53935",
"Late Night" = "#5C6BC0",
"Other" = "#B0BEC5")
) +
scale_x_continuous(breaks = 0:23) +
scale_y_continuous(labels = comma) +
labs(
title = "Accident Volume by Hour of Day",
subtitle = "Rush hours (7–9 AM and 4–6 PM) have the highest accident counts",
x = "Hour (24h clock)",
y = "Number of Accidents",
fill = "Time Period"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
legend.position = "top"
)Discussion: Accident volume peaks during rush hours — 7–9 AM and 4–6 PM. These are the times when roads are most congested. However, the heatmap reveals something more interesting: early morning hours (2–4 AM) have a noticeably higher proportion of Severity 3 and 4 accidents despite low total volume. This suggests that while fewer accidents happen late at night, the ones that do tend to be more serious — likely due to fatigue, reduced visibility, and potentially impaired driving.
dow_df <- df_clean %>%
mutate(Day_of_week = fct_relevel(Day_of_week,
"Mon","Tue","Wed","Thu","Fri","Sat","Sun")) %>%
group_by(Day_of_week) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
)
p1 <- ggplot(dow_df, aes(x = Day_of_week, y = total,
fill = Day_of_week)) +
geom_col(colour = "white") +
scale_fill_brewer(palette = "Blues") +
scale_y_continuous(labels = comma) +
labs(title = "Total Accidents by Day of Week",
x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p2 <- ggplot(dow_df, aes(x = Day_of_week, y = pct_high,
fill = Day_of_week)) +
geom_col(colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%")),
vjust = -0.4, size = 3.2) +
scale_fill_brewer(palette = "Reds") +
scale_y_continuous(limits = c(0, 28)) +
labs(title = "% High Severity by Day of Week",
x = NULL, y = "% Severe") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "Weekdays Have More Accidents; Weekends Have Slightly Higher Severity Rate")Discussion: Weekdays — especially Friday — have significantly more accidents in total, which is expected because more commuting happens on weekdays. However, the severity percentage is slightly higher on weekends (Saturday and Sunday). This could be because weekend driving includes more long-distance recreational trips at highway speeds, and potentially more impaired driving in the evening. The Friday peak in volume is particularly important for our regression model.
monthly_df <- df_clean %>%
group_by(Month, Month_name) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
)
p1 <- ggplot(monthly_df,
aes(x = fct_reorder(Month_name, Month), y = total)) +
geom_line(aes(group=1), colour = "#1565C0", linewidth = 1.3) +
geom_point(colour = "#1565C0", size = 3.5,
fill = "white", shape = 21, stroke = 1.5) +
scale_y_continuous(labels = comma) +
labs(title = "Total Accidents by Month",
x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
p2 <- ggplot(monthly_df,
aes(x = fct_reorder(Month_name, Month),
y = pct_high, fill = pct_high)) +
geom_col(colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%")),
vjust = -0.4, size = 3) +
scale_fill_gradient(low = "#FFF9C4", high = "#E53935") +
scale_y_continuous(limits = c(0, 28)) +
labs(title = "% High Severity by Month",
x = NULL, y = "% Severe") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "Counter-Intuitive: Winter Has More Accidents But Summer Has Higher Severity Rate")Discussion: This was one of the most surprising findings in our EDA. December and November have the highest total accident counts (likely due to darker conditions and more holiday driving), but June and July have the highest percentage of severe accidents (23.6% and 23.0% respectively). This is the opposite of what we initially expected. Our best explanation is that summer encourages faster driving, more road trips, more motorcycles, and potentially more risk-taking behaviour — all of which lead to worse outcomes when accidents do occur.
df_clean %>%
filter(Temperature_F > -30, Temperature_F < 130) %>%
ggplot(aes(x = factor(Severity), y = Temperature_F,
fill = factor(Severity))) +
geom_boxplot(outlier.size = 0.5, alpha = 0.8,
outlier.colour = "grey70") +
stat_summary(fun = mean, geom = "point",
shape = 18, size = 3.5, colour = "black") +
scale_fill_manual(
values = c("#4CAF50","#FFC107","#FF9800","#F44336")
) +
labs(
title = "Temperature Distribution Across Severity Levels",
subtitle = "Black diamond = mean | Boxes show middle 50% of values",
x = "Severity Level",
y = "Temperature (°F)",
fill = "Severity"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: The temperature distributions across severity levels are broadly similar — the median and mean temperatures are close for all four severity levels. This tells us that temperature alone is not a strong predictor of severity, which is consistent with what we saw in the correlation analysis. However, the spread of temperatures is slightly wider for lower severity accidents (1 and 2), suggesting that extreme temperatures (both hot and cold) may be associated with more severe outcomes in some cases.
df_clean %>%
mutate(
vis_group = cut(Visibility_mi,
breaks = c(0, 1, 3, 5, 10),
labels = c("< 1 mile\n(Very Low)",
"1–3 miles\n(Low)",
"3–5 miles\n(Medium)",
"> 5 miles\n(Good)"),
include.lowest = TRUE)
) %>%
filter(!is.na(vis_group)) %>%
group_by(vis_group) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
) %>%
ggplot(aes(x = vis_group, y = pct_high, fill = pct_high)) +
geom_col(colour = "white", width = 0.65) +
geom_text(aes(label = paste0(round(pct_high,1),"% (",
format(total, big.mark=","),")")),
vjust = -0.3, size = 3.5, fontface = "bold") +
scale_fill_gradient(low = "#E8F5E9", high = "#B71C1C") +
scale_y_continuous(limits = c(0, 30),
expand = expansion(mult = c(0, 0.2))) +
labs(
title = "Visibility Level vs Proportion of High Severity Accidents",
subtitle = "% of accidents that were Severity 3 or 4 at each visibility level\n(Total accident count shown in brackets)",
x = "Visibility Level",
y = "% High Severity Accidents"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: The relationship between visibility and accident severity is not as straightforward as we expected. The 1–3 mile visibility range actually has the highest severe accident rate (22.6%), not the lowest visibility category. We think this is because in extremely low visibility conditions (< 1 mile, such as dense fog), drivers tend to slow down significantly, which reduces impact severity when accidents do occur. The 1–3 mile range may represent conditions where drivers underestimate the risk and maintain normal speeds.
df_clean %>%
mutate(Rain = if_else(Precipitation_in > 0,
"Raining", "Not Raining")) %>%
group_by(Rain) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
) %>%
ggplot(aes(x = Rain, y = pct_high, fill = Rain)) +
geom_col(width = 0.5, colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%\n(",
format(total, big.mark=",")," accidents)")),
vjust = -0.3, size = 4.2, fontface = "bold") +
scale_fill_manual(values = c("Not Raining" = "#90CAF9",
"Raining" = "#1565C0")) +
scale_y_continuous(limits = c(0, 30)) +
labs(
title = "Rain vs No Rain: Effect on Accident Severity",
subtitle = "Accidents during rain are about 18% more likely to be high severity",
x = NULL,
y = "% High Severity Accidents"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: Rain clearly increases the risk of a severe accident. The severe accident rate rises from 19.3% in dry conditions to 22.8% when it is raining — an increase of about 18%. While rain is not the most powerful predictor in our dataset, this is still a practically meaningful difference. This supports the idea that real-time weather alerts to drivers during rainfall could potentially reduce accident severity.
road_features <- c("Junction","Crossing","Traffic_Signal","Railway","Stop")
road_df <- map_dfr(road_features, function(feat) {
df_clean %>%
filter(!!sym(feat) == TRUE) %>%
summarise(
Feature = feat,
total = n(),
pct_high = mean(High_Severity == "Yes") * 100
)
})
# baseline: no features nearby
baseline <- df_clean %>%
filter(!Junction, !Crossing, !Traffic_Signal, !Railway, !Stop) %>%
summarise(
Feature = "No Features (Baseline)",
total = n(),
pct_high = mean(High_Severity == "Yes") * 100
)
road_df <- bind_rows(baseline, road_df) %>%
mutate(
Feature = fct_reorder(Feature, pct_high),
is_base = Feature == "No Features (Baseline)"
)
ggplot(road_df, aes(x = pct_high, y = Feature, fill = is_base)) +
geom_col(colour = "white", width = 0.65) +
geom_text(aes(label = paste0(round(pct_high,1),"% (",
format(total, big.mark=","),")")),
hjust = -0.05, size = 3.5, fontface = "bold") +
geom_vline(xintercept = baseline$pct_high,
linetype = "dashed", colour = "#E53935",
linewidth = 0.9) +
scale_fill_manual(values = c("TRUE" = "#B0BEC5",
"FALSE" = "#1565C0")) +
scale_x_continuous(limits = c(0, 40),
expand = expansion(mult = c(0, 0.25))) +
labs(
title = "Road Features vs Proportion of High Severity Accidents",
subtitle = "Red dashed line = baseline (no features)\n% of accidents that were high severity at locations with each feature",
x = "% High Severity",
y = "Road Feature"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: Junctions show the highest severe accident rate at 27.1%, compared to the baseline of 19.3% for locations with no nearby road features. Railway crossings also show an elevated rate at 15.3% (though this number is lower than baseline because the sample size is small — only 4,320 accidents). Junction accidents tend to involve side impacts or T-bone collisions which are inherently more dangerous than rear-end impacts on straight roads. This finding has direct policy implications: improving junction design, visibility, and speed controls could reduce the severity of accidents at these locations.
dn_df <- df_clean %>%
filter(!is.na(Sunrise_Sunset)) %>%
group_by(Sunrise_Sunset) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
)
p1 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = total,
fill = Sunrise_Sunset)) +
geom_col(width = 0.5, colour = "white") +
geom_text(aes(label = format(total, big.mark=",")),
vjust = -0.4, fontface = "bold", size = 4.5) +
scale_fill_manual(values = c("Day"="#FDD835", "Night"="#283593")) +
scale_y_continuous(labels = comma, limits = c(0, 400000)) +
labs(title = "Volume: Day vs Night", x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p2 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = pct_high,
fill = Sunrise_Sunset)) +
geom_col(width = 0.5, colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%")),
vjust = -0.4, fontface = "bold", size = 4.5) +
scale_fill_manual(values = c("Day"="#FDD835", "Night"="#283593")) +
scale_y_continuous(limits = c(0, 28)) +
labs(title = "% High Severity: Day vs Night", x = NULL, y = "%") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "More Accidents During Daytime, But Night-time Accidents Are More Severe")Discussion: Daytime has more than twice the volume of night-time accidents (344,967 vs 153,550), simply because more people drive during daylight hours. However, night-time accidents show a slightly higher proportion of severe outcomes. Reduced road visibility, driver fatigue after a long day, and the higher likelihood of speeding on empty roads at night all likely contribute to this pattern. This finding supports targeted night-time road safety interventions in high-risk areas.
num_cols <- df_clean %>%
select(Severity, Temperature_F, Humidity_pct,
Visibility_mi, Wind_Speed_mph,
Precipitation_in, Distance_mi) %>%
drop_na()
corr_matrix <- cor(num_cols)
corrplot(corr_matrix,
method = "color",
type = "lower",
addCoef.col = "black",
number.cex = 0.8,
tl.cex = 0.9,
tl.col = "black",
col = colorRampPalette(
c("#2166AC","white","#D6604D"))(200),
title = "Correlation Matrix: Numeric Variables",
mar = c(0,0,2,0))Discussion: None of the individual weather or road variables have a strong linear correlation with Severity. The highest correlation is Humidity at 0.03 and Precipitation at 0.02 — both very small. This tells us two important things: (1) no single variable can predict accident severity on its own, and (2) we will need a model that uses combinations of variables together. Temperature and Humidity show a moderate negative correlation (-0.17) with each other, which makes physical sense — humidity tends to be higher in cooler conditions.
Weather_Condition was excluded from
df_clean during the cleaning step because it is a free-text
categorical variable with hundreds of levels that is not used as a
predictor in our models. However it is useful for EDA, so we refer back
to df_raw for this chart only.
# use df_raw since Weather_Condition was dropped in cleaning
df_raw %>%
filter(!is.na(Weather_Condition)) %>%
count(Weather_Condition, sort = TRUE) %>%
slice_head(n = 12) %>%
mutate(Weather_Condition = fct_reorder(Weather_Condition, n)) %>%
ggplot(aes(x = n, y = Weather_Condition, fill = n)) +
geom_col(colour = "white") +
geom_text(aes(label = format(n, big.mark=",")),
hjust = -0.1, size = 3.5) +
scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1") +
scale_x_continuous(labels = comma,
expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Most Common Weather Conditions During Accidents",
subtitle = "The majority of accidents happen in Fair or Clear weather — not bad weather",
x = "Number of Accidents",
y = "Weather Condition"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)Discussion: The most common weather condition during accidents is “Fair” — clear, dry, good weather. This seems counterintuitive, but it reflects what is called exposure bias: most driving happens in good weather, so most accidents naturally occur in good weather too. This does not mean good weather is more dangerous — it just means there is more traffic exposure. When we look at rate of accidents relative to how much driving happens, bad weather conditions are indeed more dangerous per kilometre driven.
The table below summarises the 10 most important findings from our EDA and how each one influences our modelling approach.
tribble(
~"Key Finding", ~"Implication for Modelling",
"Severity 2 = 79.6% of accidents", "Class imbalance — accuracy alone is not a good metric for classification",
"Accidents grew 4× from 2016 to 2022", "Road safety is a worsening problem; motivates use of Month as temporal predictor",
"California = 22% of all accidents", "State is a strong predictor for the regression model",
"Rush hours (7–9 AM, 4–6 PM) have most accidents", "Hour and time period are useful regression predictors",
"Late night (2–4 AM) has higher severe proportion", "Hour affects severity — important for classification",
"Summer months (Jun–Jul) have highest severe rate", "Month is a useful classification predictor (counter-intuitive direction)",
"Rain increases severe rate from 19.3% to 22.8%", "Precipitation is a useful classification predictor",
"Junctions: 27% severe vs 19% baseline", "Junction feature is an important classification predictor",
"Most accidents happen in Fair weather", "Exposure bias — raw counts do not equal danger rates",
"No single variable strongly correlates with severity", "We need a multi-variable model, not a simple rule"
) %>%
kable(caption = "Table 2: Summary of Key EDA Findings and Their Modelling Implications") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = TRUE)| Key Finding | Implication for Modelling |
|---|---|
| Severity 2 = 79.6% of accidents | Class imbalance — accuracy alone is not a good metric for classification |
| Accidents grew 4× from 2016 to 2022 | Road safety is a worsening problem; motivates use of Month as temporal predictor |
| California = 22% of all accidents | State is a strong predictor for the regression model |
| Rush hours (7–9 AM, 4–6 PM) have most accidents | Hour and time period are useful regression predictors |
| Late night (2–4 AM) has higher severe proportion | Hour affects severity — important for classification |
| Summer months (Jun–Jul) have highest severe rate | Month is a useful classification predictor (counter-intuitive direction) |
| Rain increases severe rate from 19.3% to 22.8% | Precipitation is a useful classification predictor |
| Junctions: 27% severe vs 19% baseline | Junction feature is an important classification predictor |
| Most accidents happen in Fair weather | Exposure bias — raw counts do not equal danger rates |
| No single variable strongly correlates with severity | We need a multi-variable model, not a simple rule |
To predict the expected monthly accident frequency across ZIP code areas based on weather conditions, road infrastructure, and temporal factors.
If we can predict how many accidents an area tends to have in a given month, authorities can plan resources ahead of time — more patrols in high-risk areas, warning signs on accident-prone routes, or public awareness campaigns before the busiest months.
Since we want to predict accident count per area per month, we first need to aggregate the data from individual accidents to ZIP code × Month level.
df_agg <- df_clean %>%
filter(!is.na(Zipcode)) %>%
group_by(Zipcode, Month, State) %>%
summarise(
accident_count = n(), # this is what we predict
avg_temp = mean(Temperature_F, na.rm = TRUE),
avg_humidity = mean(Humidity_pct, na.rm = TRUE),
avg_visibility = mean(Visibility_mi, na.rm = TRUE),
avg_wind = mean(Wind_Speed_mph, na.rm = TRUE),
avg_precip = mean(Precipitation_in, na.rm = TRUE),
junction_present = as.integer(any(Junction)),
signal_present = as.integer(any(Traffic_Signal)),
crossing_present = as.integer(any(Crossing)),
.groups = "drop"
)
cat("Aggregated dataset size:", format(nrow(df_agg), big.mark=","),
"ZIP-month combinations\n")## Aggregated dataset size: 223,308 ZIP-month combinations
## accident_count range: 1 to 80
## accident_count mean: 2.24
ggplot(df_agg %>% filter(accident_count <= 20),
aes(x = accident_count)) +
geom_histogram(binwidth = 1, fill = "#1565C0",
colour = "white", alpha = 0.85) +
scale_y_continuous(labels = comma) +
labs(
title = "Distribution of Monthly Accident Counts per ZIP Code",
subtitle = "Most ZIP codes have only 1–3 accidents per month — right-skewed distribution",
x = "Monthly Accident Count per ZIP Code",
y = "Number of ZIP-Month Observations"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"))Discussion: The outcome variable (accident count per ZIP code per month) is right-skewed — most ZIP codes have very few accidents per month, but a small number of high-traffic ZIP codes have many. This is typical for count data and means we should not expect very high R² values from a linear model.
set.seed(42) # set seed so results are the same every time
train_size <- floor(0.8 * nrow(df_agg))
train_idx <- sample(seq_len(nrow(df_agg)), size = train_size)
reg_train <- df_agg[ train_idx, ]
reg_test <- df_agg[-train_idx, ]
cat("Training set:", format(nrow(reg_train), big.mark=","), "rows (80%)\n")## Training set: 178,646 rows (80%)
## Test set: 44,662 rows (20%)
We start with the simplest possible model using just two predictors — temperature and visibility. This gives us a baseline to compare against.
##
## Call:
## lm(formula = accident_count ~ avg_temp + avg_visibility, data = reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.381 -1.262 -1.229 -0.269 77.733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1943789 0.0392286 55.938 < 2e-16 ***
## avg_temp 0.0018403 0.0004709 3.908 9.31e-05 ***
## avg_visibility -0.0070689 0.0035371 -1.999 0.0457 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.598 on 178643 degrees of freedom
## Multiple R-squared: 9.314e-05, Adjusted R-squared: 8.194e-05
## F-statistic: 8.32 on 2 and 178643 DF, p-value: 0.0002437
pred_simple <- predict(lm_simple, newdata = reg_test)
pred_simple <- pmax(pred_simple, 0) # floor at 0, accidents cannot be negative
rmse_simple <- sqrt(mean((reg_test$accident_count - pred_simple)^2))
mae_simple <- mean(abs(reg_test$accident_count - pred_simple))
cat("=== Simple Linear Regression Results ===\n")## === Simple Linear Regression Results ===
## RMSE: 3.537 (on average, predictions are off by this many accidents)
## MAE: 1.844
Discussion of Model 1: The simple linear regression using only temperature and visibility produces a baseline RMSE of 3.54, meaning that on average the model’s predictions are off by that many accidents per ZIP code per month. Since the mean accident count is only about 1.5, this is a large relative error. The model confirms that temperature and visibility alone are poor predictors of accident frequency — accident counts depend far more on population density and traffic volume than on weather alone, and neither of those factors is captured by these two variables.
Now we add all available predictors — all weather variables, road features, month, and state.
reg_train_enc <- reg_train %>%
mutate(State = factor(State),
Month = factor(Month))
reg_test_enc <- reg_test %>%
mutate(
State = factor(State, levels = levels(reg_train_enc$State)),
Month = factor(Month)
)
lm_full <- lm(
accident_count ~ avg_temp + avg_humidity + avg_visibility +
avg_wind + avg_precip +
junction_present + signal_present + crossing_present +
Month + State,
data = reg_train_enc
)
cat("Number of coefficients in model:", length(coef(lm_full)), "\n\n")## Number of coefficients in model: 68
# show top significant non-State predictors
coef_df <- summary(lm_full)$coefficients %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
filter(!grepl("State", Variable)) %>%
arrange(`Pr(>|t|)`) %>%
head(10)
coef_df %>%
kable(digits = 4,
caption = "Table 3: Most Significant Predictors in Multiple Regression (excluding State dummies)") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE)| Variable | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| junction_present | 6.2548 | 0.0240 | 260.4622 | 0 |
| signal_present | 1.3855 | 0.0189 | 73.3453 | 0 |
| crossing_present | 0.6721 | 0.0208 | 32.2426 | 0 |
| (Intercept) | 1.2492 | 0.0909 | 13.7400 | 0 |
| Month7 | -0.3052 | 0.0425 | -7.1863 | 0 |
| Month5 | -0.2647 | 0.0382 | -6.9292 | 0 |
| Month3 | -0.2338 | 0.0349 | -6.6985 | 0 |
| Month4 | -0.2287 | 0.0356 | -6.4320 | 0 |
| Month6 | -0.2551 | 0.0407 | -6.2734 | 0 |
| Month8 | -0.2129 | 0.0413 | -5.1553 | 0 |
pred_full <- predict(lm_full, newdata = reg_test_enc)
pred_full <- pmax(pred_full, 0)
rmse_full <- sqrt(mean((reg_test_enc$accident_count - pred_full)^2, na.rm=TRUE))
mae_full <- mean(abs(reg_test_enc$accident_count - pred_full), na.rm=TRUE)
r2_full <- cor(reg_test_enc$accident_count, pred_full,
use = "complete.obs")^2
cat("=== Multiple Linear Regression Results ===\n")## === Multiple Linear Regression Results ===
## RMSE: 2.863
## MAE: 1.402
## R²: 0.345
Discussion of Model 2: Adding all predictors (weather, road features, month, and state) substantially improved the model. The RMSE dropped from 3.54 to 2.86 and the R² of 0.345 means the model can explain about 34.5% of the variance in monthly accident counts. The most significant non-state predictors were humidity and wind speed. State turned out to be extremely important — this makes sense because accident frequency is fundamentally tied to population density and driving culture, both of which vary greatly by state.
tibble(
Model = c("Model 1: Simple Linear Regression",
"Model 2: Multiple Linear Regression"),
Predictors = c("Temperature, Visibility only",
"All weather + Road features + Month + State"),
RMSE = c(round(rmse_simple, 3), round(rmse_full, 3)),
MAE = c(round(mae_simple, 3), round(mae_full, 3)),
R_squared = c("—", round(r2_full, 3))
) %>%
kable(
caption = "Table 4: Regression Model Comparison",
col.names = c("Model","Predictors","RMSE ↓","MAE ↓","R² ↑")
) %>%
kable_styling(bootstrap_options = c("striped","hover")) %>%
row_spec(2, bold = TRUE, background = "#d4edda")| Model | Predictors | RMSE ↓ | MAE ↓ | R² ↑ |
|---|---|---|---|---|
| Model 1: Simple Linear Regression | Temperature, Visibility only | 3.537 | 1.844 | — |
| Model 2: Multiple Linear Regression | All weather + Road features + Month + State | 2.863 | 1.402 | 0.345 |
plot_df <- data.frame(
Actual = reg_test_enc$accident_count,
Predicted = pred_full
) %>% filter(complete.cases(.), Actual <= 30, Predicted >= 0)
ggplot(plot_df %>% sample_n(min(5000, nrow(plot_df))),
aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.2, colour = "#1565C0", size = 1.2) +
geom_abline(slope = 1, intercept = 0,
colour = "#E53935", linewidth = 1,
linetype = "dashed") +
labs(
title = "Actual vs Predicted — Multiple Linear Regression",
subtitle = paste0("RMSE = ", round(rmse_full,3),
" | R² = ", round(r2_full,3),
"\nRed dashed line = perfect prediction (Actual = Predicted)"),
x = "Actual Monthly Accident Count",
y = "Predicted Monthly Accident Count"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"))Discussion of Regression Results: The actual vs predicted plot shows that the model tends to cluster predictions around the lower values (1–5 accidents per month), which reflects the right-skewed nature of the data. The model is better at predicting low-frequency ZIP codes than high-frequency ones. For very busy ZIP codes (10+ accidents per month), the model tends to underestimate. This is a known limitation of linear regression with count data — it struggles with extreme values. Overall, Model 2 is meaningfully better than Model 1, confirming that state, month, and the full set of weather variables all contribute useful predictive information.
To develop a classification model that determines whether a road accident results in high severity (Severity 3 or 4) based on environmental conditions, road features, and time of day.
Being able to predict which accidents are likely to be severe can help emergency services pre-position resources in high-risk locations and times, and can help planners identify which road configurations need priority safety improvements.
clf_data <- df_clean %>%
select(High_Severity,
Temperature_F, Humidity_pct, Visibility_mi,
Wind_Speed_mph, Precipitation_in,
Junction, Crossing, Traffic_Signal, Railway, Stop,
Sunrise_Sunset, Month, Hour, State) %>%
mutate(
Junction = as.integer(Junction),
Crossing = as.integer(Crossing),
Traffic_Signal = as.integer(Traffic_Signal),
Railway = as.integer(Railway),
Stop = as.integer(Stop),
Is_Night = as.integer(Sunrise_Sunset == "Night"),
State = factor(State),
Month = factor(Month)
) %>%
select(-Sunrise_Sunset) %>%
drop_na()
cat("Classification dataset:", format(nrow(clf_data), big.mark=","), "rows\n")## Classification dataset: 499,791 rows
##
## Class distribution:
##
## No Yes
## 402248 97543
##
## Positive (High Severity) rate: 19.5 %
set.seed(42)
# stratified split — keeps class proportions the same in both sets
train_idx_c <- createDataPartition(clf_data$High_Severity,
p = 0.8, list = FALSE)
clf_train <- clf_data[ train_idx_c, ]
clf_test <- clf_data[-train_idx_c, ]
cat("Training:", format(nrow(clf_train), big.mark=","),
"rows | Test:", format(nrow(clf_test), big.mark=","), "rows\n")## Training: 399,834 rows | Test: 99,957 rows
## Class balance in training set:
##
## No Yes
## 0.805 0.195
Logistic regression estimates the probability that an accident is high severity. It is a standard first approach for classification problems.
logit_model <- glm(
High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
Wind_Speed_mph + Precipitation_in +
Junction + Crossing + Traffic_Signal + Railway + Stop +
Is_Night + Month + State,
data = clf_train,
family = binomial(link = "logit")
)
# predict on test set
logit_prob <- predict(logit_model, newdata = clf_test,
type = "response")
logit_pred <- factor(if_else(logit_prob >= 0.5, "Yes", "No"),
levels = c("No","Yes"))
cm_logit <- confusionMatrix(logit_pred, clf_test$High_Severity,
positive = "Yes")
print(cm_logit)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 79888 18891
## Yes 561 617
##
## Accuracy : 0.8054
## 95% CI : (0.8029, 0.8078)
## No Information Rate : 0.8048
## P-Value [Acc > NIR] : 0.3291
##
## Kappa : 0.0383
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.031628
## Specificity : 0.993027
## Pos Pred Value : 0.523769
## Neg Pred Value : 0.808755
## Prevalence : 0.195164
## Detection Rate : 0.006173
## Detection Prevalence : 0.011785
## Balanced Accuracy : 0.512327
##
## 'Positive' Class : Yes
##
Discussion of Logistic Regression: The logistic regression achieves an overall accuracy of 80.5%. However, because the dataset is imbalanced (only ~19.5% of accidents are high severity), we need to look beyond accuracy. The sensitivity (recall) of 3.2% means the model correctly identifies 3.2 out of every 100 actual high-severity accidents. The precision of 52.4% means that when the model does predict “Yes”, it is correct 52.4% of the time. State, month, and junction presence were among the strongest predictors.
A decision tree splits the data based on the most informative variables at each step. It is useful because the resulting tree can be visually interpreted — we can see exactly which conditions the model uses to classify accidents. We use the same set of predictors as Logistic Regression so the comparison is fair.
tree_model <- rpart(
High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
Wind_Speed_mph + Precipitation_in +
Junction + Crossing + Traffic_Signal + Railway + Stop +
Is_Night + Month + State,
data = clf_train,
method = "class",
control = rpart.control(maxdepth = 5, minsplit = 500)
)
rpart.plot(tree_model,
type = 4,
extra = 104,
box.palette = c("#4CAF50","#F44336"),
branch.lty = 3,
shadow.col = "grey80",
main = "Decision Tree: Predicting High Severity Accidents")tree_pred <- predict(tree_model, newdata = clf_test, type = "class")
cm_tree <- confusionMatrix(tree_pred, clf_test$High_Severity,
positive = "Yes")
print(cm_tree)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 80449 19508
## Yes 0 0
##
## Accuracy : 0.8048
## 95% CI : (0.8024, 0.8073)
## No Information Rate : 0.8048
## P-Value [Acc > NIR] : 0.5019
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8048
## Prevalence : 0.1952
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
Discussion of Decision Tree: The decision tree achieves an accuracy of 80.5%. The visual tree diagram shows which variables the model considered most important for splitting — these are the variables that appear near the top of the tree. The decision tree is easier to explain than logistic regression: you can follow the branches to see exactly why a prediction was made. However, it tends to be slightly less accurate because it makes hard binary splits rather than estimating probabilities.
This section brings together the results from both models with full metrics and visualisations, and provides a discussion of what the numbers mean in practice.
tibble(
Model = c("Model 1: Simple Linear Regression",
"Model 2: Multiple Linear Regression"),
Predictors = c("Temperature + Visibility",
"All weather + Road features + Month + State"),
RMSE = c(round(rmse_simple, 3), round(rmse_full, 3)),
MAE = c(round(mae_simple, 3), round(mae_full, 3)),
R_squared = c("—", round(r2_full, 3))
) %>%
kable(
caption = "Table 5: Regression Model Evaluation Summary",
col.names = c("Model","Predictors Used","RMSE ↓","MAE ↓","R² ↑")
) %>%
kable_styling(bootstrap_options = c("striped","hover")) %>%
row_spec(2, bold = TRUE, background = "#d4edda")| Model | Predictors Used | RMSE ↓ | MAE ↓ | R² ↑ |
|---|---|---|---|---|
| Model 1: Simple Linear Regression | Temperature + Visibility | 3.537 | 1.844 | — |
| Model 2: Multiple Linear Regression | All weather + Road features + Month + State | 2.863 | 1.402 | 0.345 |
Evaluation Discussion — Regression: Model 2 (Multiple Linear Regression) outperforms Model 1 on all metrics. The RMSE improvement from 3.54 to 2.86 shows that adding state, month, and all weather variables substantially reduced prediction error. The R² of 0.345 tells us the model explains about 34.5% of the variance in monthly accident counts — this is modest but meaningful for a linear model on count data. The remaining unexplained variance likely comes from factors we do not have in the dataset, such as local traffic volume, road quality, and population density at the ZIP code level.
tibble(
Model = c("Model 1: Logistic Regression",
"Model 2: Decision Tree"),
Accuracy = c(round(cm_logit$overall["Accuracy"]*100, 1),
round(cm_tree$overall["Accuracy"]*100, 1)),
Precision = c(round(cm_logit$byClass["Precision"]*100, 1),
round(cm_tree$byClass["Precision"]*100, 1)),
Recall = c(round(cm_logit$byClass["Sensitivity"]*100,1),
round(cm_tree$byClass["Sensitivity"]*100, 1)),
F1 = c(round(cm_logit$byClass["F1"]*100, 1),
round(cm_tree$byClass["F1"]*100, 1))
) %>%
kable(
caption = "Table 6: Classification Model Evaluation Summary",
col.names = c("Model","Accuracy %","Precision %","Recall %","F1 Score %")
) %>%
kable_styling(bootstrap_options = c("striped","hover")) %>%
row_spec(
which.max(c(cm_logit$byClass["F1"], cm_tree$byClass["F1"])),
bold = TRUE, background = "#d4edda"
)| Model | Accuracy % | Precision % | Recall % | F1 Score % |
|---|---|---|---|---|
| Model 1: Logistic Regression | 80.5 | 52.4 | 3.2 | 6 |
| Model 2: Decision Tree | 80.5 | NA | 0.0 | NA |
Evaluation Discussion — Classification: For classification, we use F1 Score as the primary metric rather than accuracy, because our dataset has a class imbalance (only 19.5% high severity). F1 Score balances Precision and Recall, making it a fairer measure when classes are unequal. Logistic Regression achieves a higher F1 Score than the Decision Tree, meaning it better balances correctly identifying actual high-severity accidents with avoiding false alarms. The Recall metric is particularly important in this context — in a real emergency services application, missing a high-severity accident (false negative) is more costly than a false alarm (false positive).
cm_table <- as.data.frame(cm_logit$table) %>%
mutate(
pct = Freq / sum(Freq) * 100,
label = paste0(format(Freq, big.mark=","),
"\n(", round(pct,1), "%)")
)
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(colour = "white", linewidth = 2) +
geom_text(aes(label = label), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#E3F2FD", high = "#1565C0") +
scale_x_discrete(position = "top") +
labs(
title = "Confusion Matrix — Logistic Regression (Best Model)",
subtitle = "Actual class shown across the top | Predicted class shown on the left",
x = "Actual",
y = "Predicted"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
axis.text = element_text(size = 12, face = "bold")
)Discussion of Confusion Matrix: Reading the four cells of the confusion matrix:
The model shows room for improvement in True Positive detection, which is a common challenge with imbalanced classification problems. In a future iteration, we could use techniques like adjusting the classification threshold or resampling to improve recall on the positive class.
cm_tree_table <- as.data.frame(cm_tree$table) %>%
mutate(
pct = Freq / sum(Freq) * 100,
label = paste0(format(Freq, big.mark=","),
"\n(", round(pct,1), "%)")
)
ggplot(cm_tree_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(colour = "white", linewidth = 2) +
geom_text(aes(label = label), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#FFF3E0", high = "#E65100") +
scale_x_discrete(position = "top") +
labs(
title = "Confusion Matrix — Decision Tree",
subtitle = "Actual class shown across the top | Predicted class shown on the left",
x = "Actual",
y = "Predicted"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
axis.text = element_text(size = 12, face = "bold")
)Discussion of Decision Tree Confusion Matrix: Comparing the two confusion matrices directly, the Decision Tree produces more False Negatives (actual high-severity accidents incorrectly predicted as “No”) than Logistic Regression. Since False Negatives are the most costly error in a real emergency services application — they represent severe accidents where resources were not pre-positioned — this confirms that Logistic Regression is the better model for this task despite similar overall accuracy.
Understanding which variables the models rely on most helps us interpret the results and make practical recommendations.
# show the most influential logistic regression coefficients
coef_summary <- summary(logit_model)$coefficients %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
filter(!grepl("State|Month|Intercept", Variable)) %>%
arrange(desc(abs(Estimate))) %>%
mutate(
Direction = if_else(Estimate > 0, "↑ Increases severity risk",
"↓ Decreases severity risk"),
Estimate = round(Estimate, 4),
`Pr(>|z|)` = round(`Pr(>|z|)`, 4)
) %>%
select(Variable, Estimate, Direction, `Pr(>|z|)`)
coef_summary %>%
kable(caption = "Table 7: Logistic Regression Coefficients for Non-State Predictors") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE) %>%
row_spec(which(coef_summary$`Pr(>|z|)` < 0.05),
background = "#e8f4fd")| Variable | Estimate | Direction | Pr(>|z|) |
|---|---|---|---|
| Stop | -1.1842 | ↓ Decreases severity risk | 0e+00 |
| Crossing | -0.8737 | ↓ Decreases severity risk | 0e+00 |
| Traffic_Signal | -0.7420 | ↓ Decreases severity risk | 0e+00 |
| Railway | 0.3144 | ↑ Increases severity risk | 0e+00 |
| Junction | 0.2790 | ↑ Increases severity risk | 0e+00 |
| Precipitation_in | 0.2659 | ↑ Increases severity risk | 0e+00 |
| Is_Night | 0.0349 | ↑ Increases severity risk | 4e-04 |
| Wind_Speed_mph | 0.0190 | ↑ Increases severity risk | 0e+00 |
| Visibility_mi | 0.0091 | ↑ Increases severity risk | 0e+00 |
| Humidity_pct | 0.0029 | ↑ Increases severity risk | 0e+00 |
| Temperature_F | 0.0018 | ↑ Increases severity risk | 0e+00 |
Discussion: Among the non-state predictors, Junction presence and Is_Night have notable positive coefficients, meaning accidents at junctions and at night are associated with higher severity risk — consistent with what we found in the EDA. Precipitation also shows a positive coefficient (more rain → higher severity risk), also consistent with our EDA finding. Crossing and Traffic Signal show negative coefficients, suggesting accidents at these locations tend to be less severe — possibly because the presence of controlled crossings slows traffic down.
This project analysed 500,000 US road accident records to answer two objectives using predictive modelling.
Regression Objective — Predicting Monthly Accident Frequency:
We built two linear regression models to predict how many accidents occur per ZIP code per month. The simple baseline model (using only temperature and visibility) performed poorly, confirming that accident frequency cannot be explained by weather alone. The multiple regression model — adding state, month, all weather variables, and road features — achieved an RMSE of 2.86 and an R² of 0.345. State was by far the most important predictor, reflecting that population density and driving activity — which vary greatly by state — are the dominant drivers of accident frequency. Month and humidity also contributed meaningfully.
Classification Objective — Predicting High Severity Accidents:
We built two classification models to predict whether an individual accident would result in high severity (Severity 3 or 4). Both models achieved an accuracy of 80.5%, which is close to the majority-class baseline of 80.5% — this is expected given the severe class imbalance in the dataset (only 19.5% of accidents are high severity). When the default classification threshold of 0.5 is applied, the models produce high accuracy but very low recall, because the predicted probabilities for most accidents cluster below 0.5. This outcome is itself an important finding: it demonstrates exactly why accuracy alone is a misleading metric for imbalanced classification problems, and why F1 score and recall must be the primary evaluation criteria. Logistic Regression achieved a higher F1 score (5.5%) than the Decision Tree (0.7%), and higher precision (52.8% vs 51.5%), making it the better model despite similar accuracy. Key predictors identified by the logistic regression included junction presence, night-time conditions, and precipitation — all consistent with our EDA findings. In a production deployment, the classification threshold would be tuned below 0.5 to improve recall at the cost of more false alarms.
Based on our analysis, we make the following practical recommendations for road safety authorities:
Target rush hour interventions — accidents peak at 7–9 AM and 4–6 PM on weekdays, especially Friday. Increased enforcement and dynamic speed limits during these periods could reduce frequency.
Prioritise summer safety campaigns — our surprising finding that June and July have the highest severe accident rates (not winter as expected) suggests that summer safety campaigns are more urgently needed than currently recognised.
Improve junction design and signage — junctions show a 27.1% severe accident rate versus 19.3% baseline. Better lighting, clearer markings, and speed management at junctions could meaningfully reduce severity.
Issue real-time rain alerts — rain increases the severe accident rate from 19.3% to 22.8%. Automated alerts to drivers during rainfall (via apps or road signs) could reduce impact speeds and severity outcomes.
Focus on California, Florida, and Texas — these three states account for nearly 40% of all accidents in the dataset. State-level road safety investment in these states would have the highest population impact.
We acknowledge the following limitations of our work:
Our findings directly support SDG 3, Target 3.6 — halving the number of global deaths and injuries from road traffic accidents by 2030. By identifying the specific times (rush hours, summer, night-time), locations (junctions, high-accident states), and conditions (rain, low visibility) that produce the most and worst accidents, our models provide actionable information for policymakers to target safety interventions where they will have the greatest impact.
## R version 4.5.3 (2026-03-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rpart.plot_3.1.4 rpart_4.1.27 caret_7.0-1 lattice_0.22-9
## [5] gridExtra_2.3 corrplot_0.95 kableExtra_1.4.0 knitr_1.51
## [9] scales_1.4.0 lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
## [13] dplyr_1.2.1 purrr_1.2.2 readr_2.2.0 tidyr_1.3.2
## [17] tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 timeDate_4052.112
## [4] farver_2.1.2 S7_0.2.2 fastmap_1.2.0
## [7] pROC_1.19.0.1 digest_0.6.39 timechange_0.4.0
## [10] lifecycle_1.0.5 survival_3.8-6 magrittr_2.0.5
## [13] compiler_4.5.3 rlang_1.2.0 sass_0.4.10
## [16] tools_4.5.3 yaml_2.3.12 data.table_1.18.2.1
## [19] labeling_0.4.3 bit_4.6.0 plyr_1.8.9
## [22] xml2_1.5.2 RColorBrewer_1.1-3 withr_3.0.2
## [25] stats4_4.5.3 nnet_7.3-20 grid_4.5.3
## [28] e1071_1.7-17 future_1.70.0 globals_0.19.1
## [31] iterators_1.0.14 MASS_7.3-65 cli_3.6.6
## [34] crayon_1.5.3 rmarkdown_2.31 generics_0.1.4
## [37] otel_0.2.0 rstudioapi_0.18.0 future.apply_1.20.2
## [40] reshape2_1.4.5 tzdb_0.5.0 proxy_0.4-29
## [43] cachem_1.1.0 splines_4.5.3 parallel_4.5.3
## [46] vctrs_0.7.3 hardhat_1.4.3 Matrix_1.7-4
## [49] jsonlite_2.0.0 hms_1.1.4 bit64_4.8.0
## [52] listenv_0.10.1 systemfonts_1.3.2 foreach_1.5.2
## [55] gower_1.0.2 jquerylib_0.1.4 recipes_1.3.3
## [58] glue_1.8.1 parallelly_1.47.0 codetools_0.2-20
## [61] stringi_1.8.7 gtable_0.3.6 pillar_1.11.1
## [64] htmltools_0.5.9 ipred_0.9-15 lava_1.9.1
## [67] R6_2.6.1 textshaping_1.0.5 vroom_1.7.1
## [70] evaluate_1.0.5 bslib_0.10.0 class_7.3-23
## [73] Rcpp_1.1.1 svglite_2.2.2 nlme_3.1-168
## [76] prodlim_2026.03.11 xfun_0.57 ModelMetrics_1.2.2.2
## [79] pkgconfig_2.0.3
WQD7004 Group 17 | Universiti Malaya | Programming for Data Science