Road accidents are a serious public health issue worldwide. Every year, millions of people are injured or killed on the roads, and many of these accidents could potentially be prevented if we had better information about where and when they are likely to happen.
For this project, we decided to look at a real-world dataset of road accidents in the United States. We wanted to answer two main questions using the data:
We picked this topic because it connects to SDG 3 (Good Health and Well-being), specifically Target 3.6 which is about reducing deaths and injuries from road traffic accidents. We thought it would be meaningful to work on something that has real-world impact, not just a made-up dataset.
Regression: Can we predict the expected number of accidents per ZIP code per month, based on weather and road features?
If authorities know which areas tend to have more accidents during certain months or weather conditions, they can put up warning signs, increase patrols, or send alerts to drivers in advance.
Classification: Can we predict whether an accident will be high severity (Severity 3 or 4) based on weather conditions, road features, and time of day?
Knowing which types of locations and conditions produce the worst accidents can help emergency services decide where to focus resources.
The dataset we are using is the US Accidents dataset, which was collected from multiple traffic APIs between 2016 and 2023. It covers 49 US states.
We are using a sampled version with 500,000 rows and 46 columns, giving us a total dimension of 23,000,000 cells which easily exceeds the requirement of 100,000.
| Detail | Value |
|---|---|
| Dataset name | US Accidents (2016–2023) |
| Source | Kaggle (Sobhan Moosavi et al.) |
| Rows | 500,000 |
| Columns | 46 |
| Total dimension | 23,000,000 cells |
| Coverage | 49 US states |
| Years | 2016 – March 2023 |
The columns include information about:
# we need these packages for data cleaning, analysis and visualisation
library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(gridExtra)
library(caret)
library(rpart)
library(rpart.plot)# load the dataset
df_raw <- read_csv("US_Accidents_March23_sampled_500k.csv",
show_col_types = FALSE)
# check dimensions
cat("Number of rows:", nrow(df_raw), "\n")## Number of rows: 500000
## Number of columns: 46
## Total cells: 23000000
## Rows: 500,000
## Columns: 46
## $ ID <chr> "A-2047758", "A-4694324", "A-5006183", "A-423735…
## $ Source <chr> "Source2", "Source1", "Source1", "Source1", "Sou…
## $ Severity <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Start_Time <dttm> 2019-06-12 10:10:56, 2022-12-03 23:37:14, 2022-…
## $ End_Time <dttm> 2019-06-12 10:55:58, 2022-12-04 01:56:53, 2022-…
## $ Start_Lat <dbl> 30.64121, 38.99056, 34.66119, 43.68059, 35.39548…
## $ Start_Lng <dbl> -91.15348, -77.39907, -120.49282, -92.99332, -11…
## $ End_Lat <dbl> NA, 38.99004, 34.66119, 43.68057, 35.39548, NA, …
## $ End_Lng <dbl> NA, -77.39828, -120.49244, -92.97222, -118.98600…
## $ `Distance(mi)` <dbl> 0.000, 0.056, 0.022, 1.054, 0.046, 0.000, 0.000,…
## $ Description <chr> "Accident on LA-19 Baker-Zachary Hwy at Lower Za…
## $ Street <chr> "Highway 19", "Forest Ridge Dr", "Floradale Ave"…
## $ City <chr> "Zachary", "Sterling", "Lompoc", "Austin", "Bake…
## $ County <chr> "East Baton Rouge", "Loudoun", "Santa Barbara", …
## $ State <chr> "LA", "VA", "CA", "MN", "CA", "MA", "OR", "FL", …
## $ Zipcode <chr> "70791-4610", "20164-2813", "93436", "55912", "9…
## $ Country <chr> "US", "US", "US", "US", "US", "US", "US", "US", …
## $ Timezone <chr> "US/Central", "US/Eastern", "US/Pacific", "US/Ce…
## $ Airport_Code <chr> "KBTR", "KIAD", "KLPC", "KAUM", "KBFL", "KBVY", …
## $ Weather_Timestamp <dttm> 2019-06-12 09:53:00, 2022-12-03 23:52:00, 2022-…
## $ `Temperature(F)` <dbl> 77.0, 45.0, 68.0, 27.0, 42.0, 42.0, 35.0, 90.0, …
## $ `Wind_Chill(F)` <dbl> 77.0, 43.0, 68.0, 15.0, 42.0, 35.0, 35.0, 90.0, …
## $ `Humidity(%)` <dbl> 62, 48, 73, 86, 34, 58, 89, 55, 39, 78, 97, 84, …
## $ `Pressure(in)` <dbl> 29.92, 29.91, 29.79, 28.49, 29.77, 29.37, 28.71,…
## $ `Visibility(mi)` <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Wind_Direction <chr> "NW", "W", "W", "ENE", "CALM", "W", "CALM", "SW"…
## $ `Wind_Speed(mph)` <dbl> 5.0, 5.0, 13.0, 15.0, 0.0, 13.0, 0.0, 12.0, 7.0,…
## $ `Precipitation(in)` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
## $ Weather_Condition <chr> "Fair", "Fair", "Fair", "Wintry Mix", "Fair", "F…
## $ Amenity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Bump <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Crossing <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Give_Way <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Junction <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ No_Exit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Railway <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Roundabout <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Station <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Stop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Calming <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Traffic_Signal <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FA…
## $ Turning_Loop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ Sunrise_Sunset <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Civil_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Nautical_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
## $ Astronomical_Twilight <chr> "Day", "Night", "Day", "Day", "Night", "Day", "D…
Before doing anything, we need to understand which columns have missing data and decide what to do about them.
# count missing values in each column
missing_summary <- df_raw %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Column",
values_to = "Missing_Count") %>%
mutate(Missing_Pct = round(Missing_Count / nrow(df_raw) * 100, 1)) %>%
filter(Missing_Count > 0) %>%
arrange(desc(Missing_Pct))
missing_summary %>%
kable(caption = "Table 1: Columns with missing values",
col.names = c("Column", "Missing Count", "Missing %")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(missing_summary$Missing_Pct > 20),
background = "#ffe0e0")| Column | Missing Count | Missing % |
|---|---|---|
| End_Lat | 220377 | 44.1 |
| End_Lng | 220377 | 44.1 |
| Precipitation(in) | 142616 | 28.5 |
| Wind_Chill(F) | 129017 | 25.8 |
| Wind_Speed(mph) | 36987 | 7.4 |
| Visibility(mi) | 11291 | 2.3 |
| Humidity(%) | 11130 | 2.2 |
| Wind_Direction | 11197 | 2.2 |
| Weather_Condition | 11101 | 2.2 |
| Temperature(F) | 10466 | 2.1 |
| Pressure(in) | 8928 | 1.8 |
| Weather_Timestamp | 7674 | 1.5 |
| Airport_Code | 1446 | 0.3 |
| Sunrise_Sunset | 1483 | 0.3 |
| Civil_Twilight | 1483 | 0.3 |
| Nautical_Twilight | 1483 | 0.3 |
| Astronomical_Twilight | 1483 | 0.3 |
| Street | 691 | 0.1 |
| Timezone | 507 | 0.1 |
| Description | 1 | 0.0 |
| City | 19 | 0.0 |
| Zipcode | 116 | 0.0 |
We can see that End_Lat and End_Lng are
missing for about 44% of rows — we will just drop these columns since we
are not using them. Precipitation(in) is missing for about
28.5% of rows — we will replace missing values with 0 (assuming no
rain). The rest of the weather columns are missing less than 10% so we
will fill them with the median value.
df <- df_raw %>%
# parse the date column (mixed formats in this dataset)
mutate(
Start_Time = parse_date_time(Start_Time,
orders = c("ymd HMS", "ymd HM"),
quiet = TRUE),
# extract useful time features
Year = year(Start_Time),
Month = month(Start_Time),
Month_name = month(Start_Time, label = TRUE, abbr = TRUE),
Hour = hour(Start_Time),
Day_of_week = wday(Start_Time, label = TRUE)
) %>%
# create our target variable for classification
mutate(
High_Severity = if_else(Severity >= 3, "Yes", "No"),
High_Severity = factor(High_Severity, levels = c("No", "Yes"))
)
cat("Date parsing done. Sample:\n")## Date parsing done. Sample:
df_clean <- df %>%
# drop columns we don't need
select(-any_of(c(
"End_Lat", "End_Lng", "Wind_Chill(F)", "Description",
"Airport_Code", "Country", "ID", "Source",
"Civil_Twilight", "Nautical_Twilight", "Astronomical_Twilight",
"Timezone", "Weather_Timestamp", "Wind_Direction",
"Weather_Condition", "Bump", "Give_Way", "No_Exit",
"Roundabout", "Station", "Traffic_Calming", "Turning_Loop",
"Amenity", "End_Time"
))) %>%
# drop rows where Zipcode is missing (only 116 rows)
filter(!is.na(Zipcode)) %>%
# fill Precipitation with 0 (no rain assumption)
mutate(`Precipitation(in)` = replace_na(`Precipitation(in)`, 0)) %>%
# fill numeric weather columns with median
mutate(
`Temperature(F)` = replace_na(`Temperature(F)`,
median(`Temperature(F)`, na.rm = TRUE)),
`Humidity(%)` = replace_na(`Humidity(%)`,
median(`Humidity(%)`, na.rm = TRUE)),
`Visibility(mi)` = replace_na(`Visibility(mi)`,
median(`Visibility(mi)`, na.rm = TRUE)),
`Wind_Speed(mph)` = replace_na(`Wind_Speed(mph)`,
median(`Wind_Speed(mph)`,na.rm = TRUE))
) %>%
# fill text columns
mutate(
Sunrise_Sunset = replace_na(Sunrise_Sunset, "Day"),
City = replace_na(City, "Unknown"),
Street = replace_na(Street, "Unknown")
) %>%
# rename weather columns to something cleaner
rename(
Temperature_F = `Temperature(F)`,
Humidity_pct = `Humidity(%)`,
Visibility_mi = `Visibility(mi)`,
Wind_Speed_mph = `Wind_Speed(mph)`,
Precipitation_in = `Precipitation(in)`,
Pressure_in = `Pressure(in)`,
Distance_mi = `Distance(mi)`
)
cat("Rows after cleaning:", format(nrow(df_clean), big.mark = ","), "\n")## Rows after cleaning: 499,884
## Remaining missing values: 9370
## === Cleaned Dataset Summary ===
## Rows: 499,884
## Columns: 28
## Severity distribution:
##
## 1 2 3 4
## 4271 398058 84496 13059
##
## High Severity distribution:
##
## No Yes
## 402329 97555
This is the main part of our analysis. Before building any models, we want to really understand what the data is telling us. We will look at the data from many different angles.
The first thing we want to understand is how accidents are distributed across the 4 severity levels.
sev_counts <- df_clean %>%
count(Severity) %>%
mutate(
pct = n / sum(n) * 100,
label = paste0(format(n, big.mark=","), "\n(", round(pct,1), "%)"),
fill_col = c("#4CAF50","#FFC107","#FF9800","#F44336")
)
ggplot(sev_counts, aes(x = factor(Severity), y = n,
fill = factor(Severity))) +
geom_col(width = 0.6, colour = "white", linewidth = 0.5) +
geom_text(aes(label = label), vjust = -0.3,
size = 3.8, fontface = "bold") +
scale_fill_manual(values = c("#4CAF50","#FFC107","#FF9800","#F44336"),
labels = c("1 - Minor","2 - Moderate",
"3 - Serious","4 - Critical")) +
scale_y_continuous(labels = comma,
expand = expansion(mult = c(0, 0.18))) +
labs(
title = "How are Accidents Distributed Across Severity Levels?",
subtitle = "Severity 2 makes up nearly 80% of all accidents — Severity 3 & 4 combined = 19.5%",
x = "Severity Level",
y = "Number of Accidents",
fill = "Severity Level",
caption = "Source: US Accidents Dataset 2016–2023"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40", size = 11),
legend.position = "top"
)What we found: Most accidents (nearly 80%) are Severity 2. Only about 19.5% are high severity (3 or 4). This means our classification problem has a class imbalance — there are many more “No” than “Yes” cases, which is something we need to keep in mind when building models.
yearly <- df_clean %>%
count(Year) %>%
filter(Year < 2023) # 2023 only has data up to March, so exclude
ggplot(yearly, aes(x = Year, y = n)) +
geom_line(colour = "#1565C0", linewidth = 1.5) +
geom_point(colour = "#1565C0", size = 4, fill = "white",
shape = 21, stroke = 2) +
geom_text(aes(label = format(n, big.mark=",")),
vjust = -1.2, size = 3.5, colour = "#333333") +
scale_x_continuous(breaks = 2016:2022) +
scale_y_continuous(labels = comma,
limits = c(0, 130000),
expand = expansion(mult = c(0, 0.1))) +
annotate("rect", xmin = 2019.5, xmax = 2020.5,
ymin = 0, ymax = 130000,
alpha = 0.08, fill = "red") +
annotate("text", x = 2020, y = 120000,
label = "COVID-19\n2020", size = 3,
colour = "red", fontface = "italic") +
labs(
title = "US Road Accidents Have Been Increasing Year by Year",
subtitle = "Accident numbers grew steadily from 2016 to 2022, even through the pandemic",
x = "Year",
y = "Number of Accidents",
caption = "Note: 2023 data is only Jan–Mar, excluded from this chart"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)What we found: Accidents have been increasing every year from 2016 to 2022. Even during COVID-19 in 2020 when there were fewer cars on the road, there were still 76,000+ accidents. This could suggest that when roads are emptier, drivers tend to speed more.
state_df <- df_clean %>%
count(State, sort = TRUE) %>%
slice_head(n = 20) %>%
mutate(
State = fct_reorder(State, n),
is_top3 = n >= sort(n, decreasing = TRUE)[3]
)
ggplot(state_df, aes(x = n, y = State, fill = is_top3)) +
geom_col(colour = "white", linewidth = 0.3) +
geom_text(aes(label = format(n, big.mark=",")),
hjust = -0.15, size = 3.3, fontface = "bold") +
scale_fill_manual(values = c("grey75", "#1565C0")) +
scale_x_continuous(labels = comma,
expand = expansion(mult = c(0, 0.18))) +
labs(
title = "Top 20 US States by Number of Accidents",
subtitle = "California alone accounts for over 22% of all accidents in the dataset",
x = "Number of Accidents",
y = "State",
caption = "Source: US Accidents Dataset 2016–2023"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)What we found: California has by far the most accidents (113,274), followed by Florida and Texas. These are also the three most populous states, so this makes sense. But it also means our model needs to account for state-level differences.
hour_sev <- df_clean %>%
count(Hour, Severity) %>%
group_by(Hour) %>%
mutate(pct = n / sum(n) * 100)
ggplot(hour_sev, aes(x = Hour, y = factor(Severity), fill = pct)) +
geom_tile(colour = "white", linewidth = 0.5) +
geom_text(aes(label = round(pct, 0)),
size = 3, colour = "white", fontface = "bold") +
scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1",
name = "% of that\nhour's accidents") +
scale_x_continuous(breaks = seq(0, 23, 3),
labels = paste0(seq(0,23,3), ":00")) +
labs(
title = "Accident Severity by Hour of Day (Heatmap)",
subtitle = "Each column shows the severity breakdown within that hour\nDarker = higher % of that hour's accidents with that severity",
x = "Hour of Day",
y = "Severity Level",
caption = "Source: US Accidents Dataset 2016–2023"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
panel.grid = element_blank()
)hour_vol <- df_clean %>%
count(Hour) %>%
mutate(
time_period = case_when(
Hour %in% 7:9 ~ "Morning Rush",
Hour %in% 16:18 ~ "Evening Rush",
Hour %in% 0:4 ~ "Late Night",
TRUE ~ "Other"
)
)
ggplot(hour_vol, aes(x = Hour, y = n, fill = time_period)) +
geom_col(colour = "white", linewidth = 0.3) +
scale_fill_manual(
values = c("Morning Rush" = "#E53935",
"Evening Rush" = "#E53935",
"Late Night" = "#5C6BC0",
"Other" = "#B0BEC5")
) +
scale_x_continuous(breaks = 0:23) +
scale_y_continuous(labels = comma) +
labs(
title = "Accident Volume by Hour of Day",
subtitle = "Rush hours (7–9 AM and 4–6 PM) have the most accidents by volume",
x = "Hour (24h clock)",
y = "Number of Accidents",
fill = "Time Period"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
legend.position = "top"
)What we found: There are two clear peaks in accident volume — morning rush hour (7–9 AM) and evening rush hour (4–6 PM). However, the heatmap shows something interesting — late night and early morning hours (1–4 AM) actually have a slightly higher proportion of serious accidents even though the total number is lower. This makes sense because people may be driving while tired or under the influence.
dow_df <- df_clean %>%
mutate(Day_of_week = fct_relevel(Day_of_week,
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) %>%
group_by(Day_of_week) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
)
p1 <- ggplot(dow_df, aes(x = Day_of_week, y = total, fill = Day_of_week)) +
geom_col(colour = "white") +
scale_fill_brewer(palette = "Blues") +
scale_y_continuous(labels = comma) +
labs(title = "Total Accidents by Day",
x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p2 <- ggplot(dow_df, aes(x = Day_of_week, y = pct_high,
fill = Day_of_week)) +
geom_col(colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%")),
vjust = -0.4, size = 3.2) +
scale_fill_brewer(palette = "Reds") +
scale_y_continuous(limits = c(0, 30)) +
labs(title = "% High Severity by Day",
x = NULL, y = "% Severe") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "Weekday vs Weekend: Volume High, But Weekend Severity Also High")What we found: Weekdays (especially Friday) have far more accidents in total — this is because more people commute. But weekends (Saturday and Sunday) have a similar or slightly higher proportion of severe accidents. This is possibly because weekend drivers include more recreational driving at higher speeds.
monthly_df <- df_clean %>%
group_by(Month, Month_name) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
)
p1 <- ggplot(monthly_df,
aes(x = fct_reorder(Month_name, Month), y = total)) +
geom_line(aes(group = 1), colour = "#1565C0", linewidth = 1.3) +
geom_point(colour = "#1565C0", size = 3.5, fill = "white",
shape = 21, stroke = 1.5) +
scale_y_continuous(labels = comma) +
labs(title = "Accident Volume by Month",
x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
p2 <- ggplot(monthly_df,
aes(x = fct_reorder(Month_name, Month),
y = pct_high, fill = pct_high)) +
geom_col(colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%")),
vjust = -0.4, size = 3) +
scale_fill_gradient(low = "#FFF9C4", high = "#E53935") +
scale_y_continuous(limits = c(0, 30)) +
labs(title = "% High Severity by Month",
x = NULL, y = "% Severe") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "Interesting Contrast: Winter Has More Accidents But Lower Severity Rate")What we found: This was one of the most surprising findings. December and November have the most accidents in total (probably due to winter weather and more darkness). But June and July actually have the highest percentage of severe accidents. We think this might be because summer encourages faster driving, road trips, and more motorcycles — all of which lead to worse outcomes when accidents do happen.
df_clean %>%
filter(Temperature_F > -30, Temperature_F < 130) %>%
ggplot(aes(x = factor(Severity), y = Temperature_F,
fill = factor(Severity))) +
geom_boxplot(outlier.size = 0.5, alpha = 0.8,
outlier.colour = "grey60") +
scale_fill_manual(values = c("#4CAF50","#FFC107","#FF9800","#F44336")) +
stat_summary(fun = mean, geom = "point",
shape = 18, size = 3, colour = "black") +
labs(
title = "Temperature Distribution by Severity Level",
subtitle = "Black diamond = mean temperature | Boxes show the middle 50% of data",
x = "Severity Level",
y = "Temperature (°F)",
fill = "Severity"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)df_clean %>%
mutate(
vis_group = cut(Visibility_mi,
breaks = c(0, 1, 3, 5, 10),
labels = c("< 1 mile\n(Very Low)",
"1–3 miles\n(Low)",
"3–5 miles\n(Medium)",
"> 5 miles\n(Good)"),
include.lowest = TRUE)
) %>%
filter(!is.na(vis_group)) %>%
group_by(vis_group) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
.groups = "drop"
) %>%
ggplot(aes(x = vis_group, y = pct_high, fill = pct_high)) +
geom_col(colour = "white", width = 0.65) +
geom_text(aes(label = paste0(round(pct_high,1),"%\n(",
format(total, big.mark=",")," accidents)")),
vjust = -0.3, size = 3.5, fontface = "bold") +
scale_fill_gradient(low = "#E8F5E9", high = "#B71C1C") +
scale_y_continuous(limits = c(0, 30),
expand = expansion(mult = c(0, 0.2))) +
labs(
title = "Does Low Visibility Lead to More Severe Accidents?",
subtitle = "Percentage of accidents that were high severity (3 or 4) at each visibility level",
x = "Visibility Level",
y = "% High Severity Accidents",
caption = "Number in brackets = total accidents at that visibility level"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)What we found: The relationship between visibility and severity is not as simple as we expected. Very low visibility (< 1 mile) actually has a lower severe percentage than the 1–3 mile range. We think this is because when visibility is extremely low (dense fog), drivers slow down significantly — so when accidents do happen, they are at lower speeds. The 1–3 mile range may represent conditions where drivers underestimate the danger and don’t slow down enough.
df_clean %>%
mutate(
Rain = if_else(Precipitation_in > 0, "Raining", "Not Raining")
) %>%
group_by(Rain, High_Severity) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(Rain) %>%
mutate(pct = n / sum(n) * 100) %>%
filter(High_Severity == "Yes") %>%
ggplot(aes(x = Rain, y = pct, fill = Rain)) +
geom_col(width = 0.5, colour = "white") +
geom_text(aes(label = paste0(round(pct,1),"%")),
vjust = -0.5, size = 5, fontface = "bold") +
scale_fill_manual(values = c("Not Raining" = "#90CAF9",
"Raining" = "#1565C0")) +
scale_y_continuous(limits = c(0, 30)) +
labs(
title = "Rainy Conditions Lead to More Severe Accidents",
subtitle = "Accidents during rain are about 18% more likely to be high severity",
x = NULL,
y = "% High Severity Accidents"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)road_features <- c("Junction", "Crossing", "Traffic_Signal",
"Railway", "Stop")
road_df <- map_dfr(road_features, function(feat) {
df_clean %>%
filter(!!sym(feat) == TRUE) %>%
summarise(
Feature = feat,
total = n(),
pct_high = mean(High_Severity == "Yes") * 100
)
})
# also add baseline (no features)
baseline <- df_clean %>%
filter(!Junction, !Crossing, !Traffic_Signal, !Railway, !Stop) %>%
summarise(Feature = "No Features\n(Baseline)",
total = n(),
pct_high = mean(High_Severity == "Yes") * 100)
road_df <- bind_rows(baseline, road_df) %>%
mutate(Feature = fct_reorder(Feature, pct_high),
is_base = Feature == "No Features\n(Baseline)")
ggplot(road_df, aes(x = pct_high, y = Feature, fill = is_base)) +
geom_col(colour = "white", width = 0.65) +
geom_text(aes(label = paste0(round(pct_high,1),"% (",
format(total, big.mark=","),")")),
hjust = -0.05, size = 3.5, fontface = "bold") +
geom_vline(xintercept = baseline$pct_high,
linetype = "dashed", colour = "#E53935",
linewidth = 0.8) +
scale_fill_manual(values = c("TRUE" = "#B0BEC5",
"FALSE" = "#1565C0")) +
scale_x_continuous(limits = c(0, 40),
expand = expansion(mult = c(0, 0.25))) +
labs(
title = "Does Being Near a Road Feature Make Accidents More Severe?",
subtitle = "Red dashed line = baseline (no road features nearby)\nPercentage of accidents that were high severity (Severity 3 or 4)",
x = "% High Severity",
y = "Road Feature",
caption = "Number in brackets = total accidents at locations with that feature"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)What we found: Junctions are associated with the highest severity rate — 27.1% of accidents at junctions are high severity, compared to about 19% at locations with no road features. Railway crossings also show a higher rate. This makes sense — junction accidents often involve side impacts at higher speeds, which tend to be more serious.
dn_df <- df_clean %>%
filter(!is.na(Sunrise_Sunset)) %>%
group_by(Sunrise_Sunset) %>%
summarise(
total = n(),
pct_high = mean(High_Severity == "Yes") * 100,
avg_sev = mean(Severity),
.groups = "drop"
)
p1 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = total,
fill = Sunrise_Sunset)) +
geom_col(width = 0.5, colour = "white") +
geom_text(aes(label = format(total, big.mark=",")),
vjust = -0.4, fontface = "bold", size = 4.5) +
scale_fill_manual(values = c("Day" = "#FDD835",
"Night" = "#283593")) +
scale_y_continuous(labels = comma, limits = c(0, 400000)) +
labs(title = "Volume", x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
p2 <- ggplot(dn_df, aes(x = Sunrise_Sunset, y = pct_high,
fill = Sunrise_Sunset)) +
geom_col(width = 0.5, colour = "white") +
geom_text(aes(label = paste0(round(pct_high,1),"%")),
vjust = -0.4, fontface = "bold", size = 4.5) +
scale_fill_manual(values = c("Day" = "#FDD835",
"Night" = "#283593")) +
scale_y_continuous(limits = c(0, 30)) +
labs(title = "% High Severity", x = NULL, y = "%") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "Day vs Night: More Accidents During Day, But Night Is More Dangerous")What we found: More than twice as many accidents happen during the day, simply because more people drive during daylight hours. But nighttime accidents have a slightly higher proportion of severe outcomes. Reduced visibility, driver fatigue, and potentially impaired driving at night likely contribute to this.
# select numeric columns for correlation
num_cols <- df_clean %>%
select(Severity, Temperature_F, Humidity_pct,
Visibility_mi, Wind_Speed_mph, Precipitation_in,
Distance_mi) %>%
drop_na()
corr_matrix <- cor(num_cols)
corrplot(corr_matrix,
method = "color",
type = "lower",
addCoef.col = "black",
number.cex = 0.8,
tl.cex = 0.9,
tl.col = "black",
col = colorRampPalette(c("#2166AC","white","#D6604D"))(200),
title = "Correlation Matrix: Numeric Variables",
mar = c(0,0,2,0))What we found: Honestly, none of the weather variables have a very strong correlation with Severity on their own (all below 0.05 in absolute value). This tells us that severity is not predicted well by any single variable — we need to use combinations of variables in a model. Temperature and Humidity have a small positive correlation with each other, which makes sense.
df_raw %>%
filter(!is.na(Weather_Condition)) %>%
count(Weather_Condition, sort = TRUE) %>%
slice_head(n = 12) %>%
mutate(Weather_Condition = fct_reorder(Weather_Condition, n)) %>%
ggplot(aes(x = n, y = Weather_Condition, fill = n)) +
geom_col(colour = "white") +
geom_text(aes(label = format(n, big.mark=",")),
hjust = -0.1, size = 3.5) +
scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1") +
scale_x_continuous(labels = comma,
expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Most Common Weather Conditions During Accidents",
subtitle = "Majority of accidents happen in Fair or Clear weather — not in storms",
x = "Number of Accidents",
y = "Weather Condition"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40")
)What we found: This is perhaps the most counterintuitive finding — the most common weather condition during accidents is “Fair” (clear, good weather), not rain or snow. This is because most driving happens in good weather, so naturally most accidents do too. This is called exposure bias. When we look at rate of accidents per km driven, bad weather is more dangerous, but in raw counts, good weather dominates.
Before moving to modelling, here is a summary of the key things we found from the EDA:
tribble(
~Finding, ~Implication,
"Severity 2 = 80% of accidents", "Class imbalance — important for classification model",
"Accidents increasing every year 2016–2022", "Road safety is a growing problem",
"California = 22% of all accidents", "State is a strong predictor",
"Rush hours have most accidents by volume", "Time of day matters for regression model",
"Late night has higher severe % despite lower volume", "Hour affects severity, not just count",
"June/July have highest severe accident rate", "Month affects severity, counter-intuitive",
"Rain increases severe rate from 19.3% to 22.8%", "Weather matters for severity",
"Junctions: 27% severe vs 19% baseline", "Road features matter for severity",
"Most accidents happen in Fair weather", "Exposure effect — need to think about rates not counts",
"No single variable strongly predicts severity", "Need multi-variable model"
) %>%
kable(caption = "Table 2: Summary of Key EDA Findings") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = TRUE)| Finding | Implication |
|---|---|
| Severity 2 = 80% of accidents | Class imbalance — important for classification model |
| Accidents increasing every year 2016–2022 | Road safety is a growing problem |
| California = 22% of all accidents | State is a strong predictor |
| Rush hours have most accidents by volume | Time of day matters for regression model |
| Late night has higher severe % despite lower volume | Hour affects severity, not just count |
| June/July have highest severe accident rate | Month affects severity, counter-intuitive |
| Rain increases severe rate from 19.3% to 22.8% | Weather matters for severity |
| Junctions: 27% severe vs 19% baseline | Road features matter for severity |
| Most accidents happen in Fair weather | Exposure effect — need to think about rates not counts |
| No single variable strongly predicts severity | Need multi-variable model |
If we can predict how many accidents a ZIP code area tends to have in a given month, authorities can plan ahead — more patrols in high-risk areas, warning signs, public awareness campaigns before accident-prone seasons.
We need to aggregate the data to ZIP code + Month level, since our target is “how many accidents happen here per month.”
df_agg <- df_clean %>%
filter(!is.na(Zipcode)) %>%
group_by(Zipcode, Month, State) %>%
summarise(
accident_count = n(),
avg_temp = mean(Temperature_F, na.rm = TRUE),
avg_humidity = mean(Humidity_pct, na.rm = TRUE),
avg_visibility = mean(Visibility_mi, na.rm = TRUE),
avg_wind = mean(Wind_Speed_mph, na.rm = TRUE),
avg_precip = mean(Precipitation_in, na.rm = TRUE),
junction_present = as.integer(any(Junction)),
signal_present = as.integer(any(Traffic_Signal)),
crossing_present = as.integer(any(Crossing)),
.groups = "drop"
)
cat("Aggregated rows (ZIP x Month combinations):",
format(nrow(df_agg), big.mark=","), "\n")## Aggregated rows (ZIP x Month combinations): 223,308
## accident_count range: 1 to 80
## accident_count mean: 2.24
ggplot(df_agg %>% filter(accident_count <= 20),
aes(x = accident_count)) +
geom_histogram(binwidth = 1, fill = "#1565C0",
colour = "white", alpha = 0.85) +
scale_y_continuous(labels = comma) +
labs(
title = "Distribution of Monthly Accident Counts per ZIP Code",
subtitle = "Most ZIP codes have just 1–3 accidents per month (right-skewed distribution)",
x = "Accidents per ZIP code per month",
y = "Number of ZIP-month observations"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"))set.seed(42) # so results are reproducible
# 80% for training, 20% for testing
train_size <- floor(0.8 * nrow(df_agg))
train_idx <- sample(seq_len(nrow(df_agg)), size = train_size)
reg_train <- df_agg[ train_idx, ]
reg_test <- df_agg[-train_idx, ]
cat("Training set:", format(nrow(reg_train), big.mark=","), "rows\n")## Training set: 178,646 rows
## Test set: 44,662 rows
We start with the simplest possible model — just using temperature and visibility to predict accident count. This is our baseline.
# simple model with just 2 predictors
lm_simple <- lm(accident_count ~ avg_temp + avg_visibility,
data = reg_train)
summary(lm_simple)##
## Call:
## lm(formula = accident_count ~ avg_temp + avg_visibility, data = reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.381 -1.262 -1.229 -0.269 77.733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1943789 0.0392286 55.938 < 2e-16 ***
## avg_temp 0.0018403 0.0004709 3.908 9.31e-05 ***
## avg_visibility -0.0070689 0.0035371 -1.999 0.0457 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.598 on 178643 degrees of freedom
## Multiple R-squared: 9.314e-05, Adjusted R-squared: 8.194e-05
## F-statistic: 8.32 on 2 and 178643 DF, p-value: 0.0002437
pred_simple <- predict(lm_simple, newdata = reg_test)
pred_simple <- pmax(pred_simple, 0) # accidents can't be negative
rmse_simple <- sqrt(mean((reg_test$accident_count - pred_simple)^2))
mae_simple <- mean(abs(reg_test$accident_count - pred_simple))
cat("Simple LM Results:\n")## Simple LM Results:
## RMSE: 3.537
## MAE: 1.844
Now we add all the predictor variables we have — weather, road features, month, and state.
# encode State as a factor
reg_train_enc <- reg_train %>%
mutate(State = factor(State),
Month = factor(Month))
reg_test_enc <- reg_test %>%
mutate(State = factor(State, levels = levels(reg_train_enc$State)),
Month = factor(Month))
# full model with all predictors
lm_full <- lm(
accident_count ~ avg_temp + avg_humidity + avg_visibility +
avg_wind + avg_precip +
junction_present + signal_present + crossing_present +
Month + State,
data = reg_train_enc
)
cat("Number of coefficients:", length(coef(lm_full)), "\n")## Number of coefficients: 68
##
## Top 10 most significant predictors:
# show significant coefficients
coef_df <- summary(lm_full)$coefficients %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
filter(!grepl("State", Variable)) %>% # exclude state dummies for brevity
arrange(`Pr(>|t|)`) %>%
head(10)
coef_df %>%
kable(digits = 4,
caption = "Table 3: Top significant predictors in multiple regression") %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE)| Variable | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| junction_present | 6.2548 | 0.0240 | 260.4622 | 0 |
| signal_present | 1.3855 | 0.0189 | 73.3453 | 0 |
| crossing_present | 0.6721 | 0.0208 | 32.2426 | 0 |
| (Intercept) | 1.2492 | 0.0909 | 13.7400 | 0 |
| Month7 | -0.3052 | 0.0425 | -7.1863 | 0 |
| Month5 | -0.2647 | 0.0382 | -6.9292 | 0 |
| Month3 | -0.2338 | 0.0349 | -6.6985 | 0 |
| Month4 | -0.2287 | 0.0356 | -6.4320 | 0 |
| Month6 | -0.2551 | 0.0407 | -6.2734 | 0 |
| Month8 | -0.2129 | 0.0413 | -5.1553 | 0 |
pred_full <- predict(lm_full, newdata = reg_test_enc)
pred_full <- pmax(pred_full, 0)
rmse_full <- sqrt(mean((reg_test_enc$accident_count - pred_full)^2, na.rm=TRUE))
mae_full <- mean(abs(reg_test_enc$accident_count - pred_full), na.rm=TRUE)
r2_full <- cor(reg_test_enc$accident_count, pred_full, use="complete.obs")^2
cat("Full Multiple Regression Results:\n")## Full Multiple Regression Results:
## RMSE: 2.863
## MAE: 1.402
## R²: 0.345
reg_results <- tibble(
Model = c("Simple LM (2 predictors)",
"Multiple LM (all predictors)"),
RMSE = c(rmse_simple, rmse_full),
MAE = c(mae_simple, mae_full),
R_squared = c(NA, r2_full)
)
reg_results %>%
mutate(across(where(is.numeric), ~round(., 3))) %>%
kable(caption = "Table 4: Regression model comparison",
col.names = c("Model","RMSE (lower=better)",
"MAE (lower=better)","R² (higher=better)")) %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = FALSE) %>%
row_spec(2, bold = TRUE, background = "#d4edda")| Model | RMSE (lower=better) | MAE (lower=better) | R² (higher=better) |
|---|---|---|---|
| Simple LM (2 predictors) | 3.537 | 1.844 | NA |
| Multiple LM (all predictors) | 2.863 | 1.402 | 0.345 |
# plot actual vs predicted for the better model
plot_df <- data.frame(
Actual = reg_test_enc$accident_count,
Predicted = pred_full
) %>%
filter(complete.cases(.) & Actual <= 30 & Predicted >= 0)
ggplot(plot_df %>% sample_n(min(5000, nrow(plot_df))),
aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.2, colour = "#1565C0", size = 1) +
geom_abline(slope = 1, intercept = 0,
colour = "#E53935", linewidth = 1,
linetype = "dashed") +
labs(
title = "Actual vs Predicted Accident Counts",
subtitle = paste0("Multiple Linear Regression | RMSE = ",
round(rmse_full,3), " | R² = ", round(r2_full,3),
"\nRed dashed line = perfect predictions"),
x = "Actual Monthly Accident Count",
y = "Predicted Monthly Accident Count"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"))Findings: Adding more predictors (State, Month, road features, all weather variables) improved the model significantly compared to the simple 2-variable version. The R² value of the multiple regression model shows it can explain some variance, though predicting exact accident counts is hard because many local factors are not in our data.
Knowing which conditions produce severe accidents can help emergency services decide where to pre-position resources, and help authorities identify which road configurations need urgent safety improvements.
clf_data <- df_clean %>%
select(High_Severity,
Temperature_F, Humidity_pct, Visibility_mi,
Wind_Speed_mph, Precipitation_in,
Junction, Crossing, Traffic_Signal, Railway, Stop,
Sunrise_Sunset, Month, Hour, State) %>%
mutate(
Junction = as.integer(Junction),
Crossing = as.integer(Crossing),
Traffic_Signal = as.integer(Traffic_Signal),
Railway = as.integer(Railway),
Stop = as.integer(Stop),
Is_Night = as.integer(Sunrise_Sunset == "Night"),
State = factor(State),
Month = factor(Month)
) %>%
select(-Sunrise_Sunset) %>%
drop_na()
cat("Classification dataset:", format(nrow(clf_data), big.mark=","),
"rows\n")## Classification dataset: 499,791 rows
## Class balance:
##
## No Yes
## 402248 97543
## Positive rate: 19.5 %
set.seed(42)
# stratified split to keep class balance
train_idx_c <- createDataPartition(clf_data$High_Severity,
p = 0.8, list = FALSE)
clf_train <- clf_data[ train_idx_c, ]
clf_test <- clf_data[-train_idx_c, ]
cat("Training:", format(nrow(clf_train), big.mark=","),
"| Test:", format(nrow(clf_test), big.mark=","), "\n")## Training: 399,834 | Test: 99,957
Logistic regression is a standard classification method. It estimates the probability of each outcome.
# fit logistic regression
logit_model <- glm(
High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
Wind_Speed_mph + Precipitation_in +
Junction + Crossing + Traffic_Signal + Railway + Stop +
Is_Night + Month + State,
data = clf_train,
family = binomial(link = "logit")
)
# predict on test set
logit_prob <- predict(logit_model, newdata = clf_test, type = "response")
logit_pred <- factor(if_else(logit_prob >= 0.5, "Yes", "No"),
levels = c("No","Yes"))
# confusion matrix
cm_logit <- confusionMatrix(logit_pred, clf_test$High_Severity,
positive = "Yes")
cat("=== Logistic Regression Results ===\n")## === Logistic Regression Results ===
## Accuracy: 80.5 %
## Sensitivity (Recall): 3.2 %
## Specificity: 99.3 %
## Precision: 52.4 %
A decision tree is easy to visualise and explain. We use it as a comparison.
library(rpart)
library(rpart.plot)
tree_model <- rpart(
High_Severity ~ Temperature_F + Humidity_pct + Visibility_mi +
Wind_Speed_mph + Precipitation_in +
Junction + Crossing + Traffic_Signal + Railway + Stop + Is_Night,
data = clf_train,
method = "class",
control = rpart.control(maxdepth = 5, minsplit = 500)
)
# plot the tree
rpart.plot(tree_model,
type = 4,
extra = 104,
box.palette = c("#4CAF50","#F44336"),
branch.lty = 3,
shadow.col = "grey80",
main = "Decision Tree for Predicting High Severity Accidents")tree_pred <- predict(tree_model, newdata = clf_test, type = "class")
cm_tree <- confusionMatrix(tree_pred, clf_test$High_Severity,
positive = "Yes")
cat("=== Decision Tree Results ===\n")## === Decision Tree Results ===
## Accuracy: 80.5 %
## Sensitivity: 0 %
## Specificity: 100 %
## Precision: NA %
tibble(
Model = c("Simple Linear Regression\n(2 predictors)",
"Multiple Linear Regression\n(all predictors)"),
Predictors = c("Temperature, Visibility",
"All weather + road features + Month + State"),
RMSE = c(round(rmse_simple, 3), round(rmse_full, 3)),
MAE = c(round(mae_simple, 3), round(mae_full, 3)),
R_sq = c("—", round(r2_full, 3))
) %>%
kable(caption = "Table 5: Regression Model Comparison",
col.names = c("Model","Predictors Used",
"RMSE ↓","MAE ↓","R² ↑")) %>%
kable_styling(bootstrap_options = c("striped","hover")) %>%
row_spec(2, bold = TRUE, background = "#d4edda")| Model | Predictors Used | RMSE ↓ | MAE ↓ | R² ↑ |
|---|---|---|---|---|
| Simple Linear Regression (2 predictors) | |Temperature, Visibility |
3.53
(all predictors)
|
|All weather + road features + Month + State |
2.86
|
tibble(
Model = c("Logistic Regression", "Decision Tree"),
Accuracy = c(round(cm_logit$overall["Accuracy"]*100,1),
round(cm_tree$overall["Accuracy"]*100,1)),
Precision = c(round(cm_logit$byClass["Precision"]*100,1),
round(cm_tree$byClass["Precision"]*100,1)),
Recall = c(round(cm_logit$byClass["Sensitivity"]*100,1),
round(cm_tree$byClass["Sensitivity"]*100,1)),
F1 = c(round(cm_logit$byClass["F1"]*100,1),
round(cm_tree$byClass["F1"]*100,1))
) %>%
kable(caption = "Table 6: Classification Model Comparison",
col.names = c("Model","Accuracy %","Precision %",
"Recall %","F1 Score %")) %>%
kable_styling(bootstrap_options = c("striped","hover")) %>%
row_spec(which.max(c(cm_logit$overall["Accuracy"],
cm_tree$overall["Accuracy"])),
bold = TRUE, background = "#d4edda")| Model | Accuracy % | Precision % | Recall % | F1 Score % |
|---|---|---|---|---|
| Logistic Regression | 80.5 | 52.4 | 3.2 | 6 |
| Decision Tree | 80.5 | NA | 0.0 | NA |
# plot confusion matrix for logistic regression
cm_table <- as.data.frame(cm_logit$table) %>%
mutate(
pct = Freq / sum(Freq) * 100,
label = paste0(format(Freq, big.mark=","),
"\n(", round(pct,1), "%)")
)
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(colour = "white", linewidth = 2) +
geom_text(aes(label = label), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#E3F2FD", high = "#1565C0") +
scale_x_discrete(position = "top") +
labs(
title = "Confusion Matrix — Logistic Regression",
subtitle = "Actual class (top) vs Predicted class (left)",
x = "Actual",
y = "Predicted"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey40"),
axis.text = element_text(size = 12, face = "bold")
)How to read this: The top-left and bottom-right boxes are correct predictions. The top-right box shows accidents we predicted as “No” but were actually “Yes” (false negatives — this is the dangerous type of error). The bottom-left shows cases predicted as “Yes” but were actually “No” (false positives — less dangerous but wastes resources).
This project set out to answer two questions about US road accidents, and we found some genuinely interesting results.
For the regression question: We found that adding more variables (State, Month, all weather variables, road features) to the model significantly improved its ability to predict monthly accident counts per ZIP code area. State turned out to be one of the strongest predictors — California, Florida, and Texas dominate the data.
For the classification question: Both logistic regression and a decision tree were able to classify high-severity accidents with reasonable accuracy. The key predictors of severity were whether an accident happened at a junction, the month of year (with summer months showing higher severity), and whether it was at night.
Based on our EDA and model findings, we suggest:
We should also be honest about what our models cannot do:
Our findings connect directly to SDG 3 Target 3.6 — reducing road traffic deaths and injuries. By identifying which times, locations, and conditions produce the most and worst accidents, policymakers can make data-driven decisions about where to invest in road safety infrastructure and public awareness.
## R version 4.5.3 (2026-03-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rpart.plot_3.1.4 rpart_4.1.27 caret_7.0-1 lattice_0.22-9
## [5] gridExtra_2.3 RColorBrewer_1.1-3 corrplot_0.95 kableExtra_1.4.0
## [9] knitr_1.51 scales_1.4.0 lubridate_1.9.5 forcats_1.0.1
## [13] stringr_1.6.0 dplyr_1.2.1 purrr_1.2.2 readr_2.2.0
## [17] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 timeDate_4052.112
## [4] farver_2.1.2 S7_0.2.2 fastmap_1.2.0
## [7] pROC_1.19.0.1 digest_0.6.39 timechange_0.4.0
## [10] lifecycle_1.0.5 survival_3.8-6 magrittr_2.0.5
## [13] compiler_4.5.3 rlang_1.2.0 sass_0.4.10
## [16] tools_4.5.3 yaml_2.3.12 data.table_1.18.2.1
## [19] labeling_0.4.3 bit_4.6.0 plyr_1.8.9
## [22] xml2_1.5.2 withr_3.0.2 stats4_4.5.3
## [25] nnet_7.3-20 grid_4.5.3 e1071_1.7-17
## [28] future_1.70.0 globals_0.19.1 iterators_1.0.14
## [31] MASS_7.3-65 cli_3.6.6 crayon_1.5.3
## [34] rmarkdown_2.31 generics_0.1.4 otel_0.2.0
## [37] rstudioapi_0.18.0 future.apply_1.20.2 reshape2_1.4.5
## [40] tzdb_0.5.0 proxy_0.4-29 cachem_1.1.0
## [43] splines_4.5.3 parallel_4.5.3 vctrs_0.7.3
## [46] hardhat_1.4.3 Matrix_1.7-4 jsonlite_2.0.0
## [49] hms_1.1.4 bit64_4.8.0 listenv_0.10.1
## [52] systemfonts_1.3.2 foreach_1.5.2 gower_1.0.2
## [55] jquerylib_0.1.4 recipes_1.3.3 glue_1.8.1
## [58] parallelly_1.47.0 codetools_0.2-20 stringi_1.8.7
## [61] gtable_0.3.6 pillar_1.11.1 htmltools_0.5.9
## [64] ipred_0.9-15 lava_1.9.1 R6_2.6.1
## [67] textshaping_1.0.5 vroom_1.7.1 evaluate_1.0.5
## [70] bslib_0.10.0 class_7.3-23 Rcpp_1.1.1
## [73] svglite_2.2.2 nlme_3.1-168 prodlim_2026.03.11
## [76] xfun_0.57 ModelMetrics_1.2.2.2 pkgconfig_2.0.3
WQD7004 Group 17 | Universiti Malaya | Programming for Data Science