DATA 110 Final Project

Author

Michael Desir

2022 United States Fatal Accidents

A Comprehensive Analysis

Credit Amin Noveyrian/Unsplash

Please note

The data used in this project was downloaded from the US D.O.T. National Highway Traffic Safety Administration. The Fatality Analysis Reporting System has been in place since 1975, with the goal of identifying and reducing highway safety problems. In essence, FARS is a census of fatal motor vehicle crashes in all 50 states, D.C., and Puerto Rico. It contains data on every person, vehicle, and accident that led to a fatality within 30 days. This project specifically will use vehicle-level data and accident-level data for a comprehensive analysis. Key variables include PERSONS, FATALS (fatalities), ST_CASE (case ID), MAKENAME (car make), and ped_involved (pedestrian-involved accident).

According to an early observation (right after cleaning), 2022 saw 42,514 traffic-resultant fatalities. My analysis will address the following:

  1. What is the relationship between the number of people involved in a crash and the number of fatalities? - Linear Regression Model
  2. How did total accidents, work zone accidents, and pedestrian accidents rise and fall across the year? - Highchart
  3. Which car makers show up the most in traffic accidents? - Treemap

Fatality Analysis Reporting System Analytical User’s Manual, 1975-2022, United States Department of Transportation, National Highway Traffic Safety Administration,
Apr. 2022. https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813556. Accessed 16 Dec. 2024.

Packages/Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(treemap)
library(RColorBrewer)
setwd("C:/Users/desir_7411ic3/Desktop/Montgomery College/DATA110/qmd/FinalProject_Data/FARS2022NationalCSV")
vehicle_data <- read_csv("vehicle.csv")
Rows: 60501 Columns: 201
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (107): STATENAME, NUMOCCSNAME, MONTHNAME, HOURNAME, MINUTENAME, HARM_EVN...
dbl  (94): STATE, ST_CASE, VEH_NO, VE_FORMS, NUMOCCS, DAY, DAYNAME, MONTH, H...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
accident_data <- read_csv("accident.csv")
Rows: 39221 Columns: 80
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (34): STATENAME, COUNTYNAME, CITYNAME, MONTHNAME, DAY_WEEKNAME, HOURNAME...
dbl (46): STATE, ST_CASE, PEDS, PERNOTMVIT, VE_TOTAL, VE_FORMS, PVH_INVL, PE...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning

basic <- vehicle_data %>%
  select(STATENAME,ST_CASE,VEH_NO,VE_FORMS,NUMOCCSNAME,DAY,MONTHNAME,HOUR,HOURNAME,MINUTENAME,HARM_EVNAME,MAN_COLLNAME,UNITTYPENAME,OWNER,MAK_MODNAME,BODY_TYPNAME,MOD_YEARNAME)
accident_count <- basic %>% ## Holds vehicles for each accident
  count(ST_CASE) %>%
  rename_at('n',~'NUM_VEHICLES') %>%
  rename_at('ST_CASE',~'CASE_NUM')

sorted_accidents <- vehicle_data %>% ## Holds accidents for each VIN
  count(VINNAME)

special_data <- accident_data %>% ## Holds special case data
  select(STATENAME,ST_CASE,PEDS,VE_TOTAL,FATALS,PERSONS,PERMVIT,MONTH,TWAY_ID,ROUTENAME,RUR_URBNAME,LATITUDE,LONGITUD,HARM_EVNAME,MAN_COLLNAME,WRK_ZONENAME)

roads <- accident_data %>% ## Holds accidents for each road
  count(TWAY_ID)
timing <- accident_data
## Create datetime stamps for each accident
timing$chrono <- paste(as.character(timing$YEAR), "-" , as.character(timing$MONTH) , "-" , as.character(timing$DAY) , " " , as.character(timing$HOUR) , ":" , as.character(timing$MINUTE) , ":00",sep="")
timing$chrono <- as_datetime(timing$chrono)
timing <- subset(timing, !is.na(chrono))

How many traffic fatalities were there in 2022?

print(sum(accident_data$FATALS))
[1] 42514

Linear Regression Model

model <- lm(data=timing, PERSONS ~ FATALS)
summary(model)

Call:
lm(formula = PERSONS ~ FATALS, data = timing)

Residuals:
    Min      1Q  Median      3Q     Max 
 -5.499  -1.081  -0.081   0.919 121.501 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.60794    0.02830   21.48   <2e-16 ***
FATALS       1.47277    0.02486   59.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.707 on 38916 degrees of freedom
Multiple R-squared:  0.08275,   Adjusted R-squared:  0.08273 
F-statistic:  3511 on 1 and 38916 DF,  p-value: < 2.2e-16

Equation: FATALS = 1.47277(PERSONS) + 0.60794

The insanely low p-value (p < 2.2 * 10^(-16)) and adjusted R-squared (0.08273) prove that there is no relationship between the number of people involved in a crash and the number of fatalities. This is interesting because one would think that the more people involved, the worse the crash was and thus the greater consequences. This model proves that hypothesis incorrect.

Plot 1: Using Highchart to Analyze Accidents Throughout the Year

Plot 1 Data

Subset the vehicle dataset to find hit-and-run incidents

plot1_tojoin <- vehicle_data %>%
  select(ST_CASE,HIT_RUNNAME)

Create Y/N columns for special circumstances

plot1_data <- special_data

## pedestrian accidents
plot1_data$ped_involved <- with(plot1_data, ifelse(PEDS>0, 'Yes', 'No'))

## fatal accidents
plot1_data$fatal <- with(plot1_data, ifelse(FATALS>0,'Yes','No'))

## hit-and-run
##plot1_data <- plot1_data %>%
  ##right_join(plot1_tojoin,by="ST_CASE") %>%
  ##group_by(MONTH)

plot1_data$work <- with(plot1_data,ifelse(WRK_ZONENAME=='None','No','Yes'))

## summarize and logarithmize subsets for plot 1
plot1_sub1 <- plot1_data %>% ## 
  group_by(MONTH) %>%
  count()
plot1_sub1$n <- log10(plot1_sub1$n)

plot1_sub2 <- subset(plot1_data, ped_involved != 'No') %>%
  group_by(MONTH) %>%
  count()
plot1_sub2$n <- log10(plot1_sub2$n)

plot1_sub3 <- subset(plot1_data, work != 'No') %>%
  group_by(MONTH) %>%
  count()
plot1_sub3$n <- log10(plot1_sub3$n)

Plot 1 Design

paints <- c("skyblue","violet","limegreen")
highchart() |>
  hc_title(text = "Special Case Traffic Incidents, 2022",
           style = list(color = "#EEE9E9"),align="left") |>
  hc_caption(text="From the National Highway Traffic Safety Administration",
              style = list(color = "#EEE9E9"),align="left") |>
  hc_yAxis_multiples(
    list(title = list(text = "Count (logarithmic scale)",style=list(color = "violet")))
  ) |>
  hc_add_series(data = plot1_sub2$n,
                name = "Pedestrian-Involved",
                type = "bar",
                yAxis = 0) |>
  hc_add_series(data = plot1_sub3$n,
                name = "Work Zone Accidents",
                type = "bar",
                yAxis = 0) |>
  hc_add_series(data = plot1_sub1$n,
                name = "Total",
                type = "bar",
                yAxis = 0) |>
  hc_xAxis(categories = plot1_sub1$MONTH,
           title=list(text = "Month",
                      style=list(color = "lightblue")),
           labels=list(style=list(color="lightblue"))) |>
  hc_colors(paints) |>
  hc_chart(style = list(fontFamily = "Courier",
                        fontWeight = "bold",
                        fontSize = "15px")) |>
  hc_plotOptions(
    bar = list(pointWidth = 6), ## changes bar width
    column = list(groupPadding = 0.5) ## changes space between bar groups
  ) |>
  hc_legend(
    align = "right",
    verticalAlign = "top",
    layout = "vertical",
    backgroundColor = "#EEE9E9"
  ) |>
  hc_add_theme(hc_theme(chart = list(backgroundColor = '#393939')))

This plot shows that while total accidents and pedestrian-involved accidents remained fairly constant, work-zone accidents rose and fell multiple times over the course of the year. I couldn’t find a reason for this, but one probably exists. I had to use a logarithmic scale for the values because the total accidents figure was so much larger than the others that it shrunk them into the corner. The downside is that with this plot, it seems like the grand majority of accidents involved pedestrians, which is not nearly as true. Hence, this plot might not be as intuitive as I wish.

Plot 2: What cars were these people driving?

Plot 2 Data

vehicle_data2 <- vehicle_data %>%
  filter(MAKENAME != "Unknown Make") ## not useful
cars <- vehicle_data2 %>%
  group_by(MAKENAME) %>%
  count() %>%
  filter(n > 1300) ## trial and error, found to be optimal for treemap
plot2_data <- cars

Plot 2 Design

plot2 <- treemap(plot2_data,
                 title = "Top 12 Car Makes in 2022 Fatal Traffic Accidents",
                 title.legend = "Total ## In Accidents",
                 index = "MAKENAME", vSize = "n", vColor = "n", 
                 type = "manual", palette = "RdBu",
                 fontsize.labels = 16, ## change label size
                 fontfamily.title = "serif", fontsize.title = 18, ## title attributes
                 border.col = "#393939", border.lwds = 3, ## border color and width
                 reverse.legend = T) ## legend now parallel in direction to treemap

Not surpisingly, America’s most popular car brands led the pack. In fact, their popularity almost negates the usefulness of this graph. On the other hand, the special brands ranking high on the list are notable. Harley-Davidson coming in at #7 on the fatalities list lends serious weight to the position that motorcycles pose a serious safety risk. Freightliner, America’s leading heavy truck brand, is a notable contributor to highway fatalities. Its presence this high on the list cannot be explained by the fact that heavy trucks only account for 1% of total U.S. registered vehicles. Obviously, they pose a risk that needs immediate mitigation, or people will just keep dying.

Abbott, Cliff. “US Sales of Class 8 Trucks End 2023 on a Downward Trend That’s Expected to Continue.” TheTrucker.Com, The Trucker Media Group, 22 Jan. 2024, www.thetrucker.com/trucking-news/equipment-tech/us-sales-of-class-8-trucks-end-2023-on-a-downward-trend-thats-expected-to-continue.