library(DBI)
library(RSQLite)
library(dplyr)
library(tidyr)
library(knitr)
library(lubridate)
library(ggplot2)
library(pscl)
library(hms)
library(pscl)
library(lmtest)
library(modelsummary)
library(broom) 
library(kableExtra)

1 Abstract

Understanding the dynamics of road traffic accidents requires models that can address both the frequency and severity of such events. Traditional count models often fail to capture the excess zeros present in accident datasets, where many crashes result in no injuries. To address this limitation, we apply a Zero-Inflated Negative Binomial (ZINB) regression to model the number of victims in road accidents. This approach distinguishes between two processes: the likelihood of a crash being non-injurious and the count of victims in injurious crashes. Using real-world traffic accident data from California, we test multiple hypotheses concerning driver characteristics (age, gender, alcohol use, race), vehicle attributes (age, insurance status), and environmental conditions (weather). Our findings reveal nuanced relationships: for example, older vehicles and poor weather increase the likelihood of injury crashes, while younger drivers are more frequently involved in non-injurious incidents. Interestingly, male drivers are more likely to be involved in crashes without injuries, while crashes involving female drivers tend to result in a higher number of victims. These insights highlight the value of dual-structure models in traffic safety research and support more targeted and evidence-based policymaking in road safety and driver education.

2 Introduction

Despite major advances in automotive safety and infrastructure development, road accidents remain a leading cause of injury and death globally (WHO, 2023). With millions of collisions occurring every year, it is not only of scientific interest to understand the factors that contribute to accident frequency and severity, but also a public health and economic priority. Identifying the characteristics of drivers, vehicles and environments that lead to more severe outcomes could lead to better road safety policies, insurance models and educational campaigns.

Traditional studies of road accidents have often focused on binary outcomes (whether an accident occurred or not) or have relied on simple counting models to assess the severity of injuries (Berhanu et. al., 2023). However, traffic accident data typically contains a high proportion of zeros - many crashes result in no injuries at all. This can lead to biased estimates and misleading conclusions. Moreover, some variables influence the likelihood of injury, while others affect the number of victims, conditional on an injury occurring. In such cases, standard modelling techniques struggle to capture this complexity

To address this issue, we apply a Zero-Inflated Negative Binomial (ZINB) regression model. This allows us to analyse the probability of injury and the severity of outcomes within a single framework. The zero-inflated structure handles the large number of non-injury crashes, while the negative binomial (count model) component adjusts for overdispersion in the number of victims (Yau et. al., 2003). This dual perspective enables us to form a deeper understanding of the impact of various factors, such as driver age and gender, vehicle age, alcohol use or insurance status on road accident outcomes.

Our aim is to contribute to a more comprehensive understanding of road safety risks by modelling both injury probability and injury count in a unified framework. The insights derived from this analysis can support more data-driven safety measures and better-informed policy decisions.

3 Literature Review

The history of the first car can be traced back to 1885, when German engineer Karl Benz created the first “modern” automobile. This machine was self-propelled, gasoline-powered, had four wheels and could be steered and stopped by the driver (Ratiu, 2003). Since that moment the car industry has revolutionised transportation all over the globe. With increasing popularity of this new method of travel, the frequency of car accidents also started to grow.

Drivers’ age is believed to be one of the major factors affecting the probability of a car crash. Lawrence (2002) conducted a study focused on this aspect. He analysed nearly 64 000 accidents in New South Wales, Australia, between 1996 and 2000 and also considered various distractions such as phone use. Results showed that injury-crash rates were highest for teenagers and young adults (16–24 years). Moreover, the pattern was “U-shaped” – crash rates were lowest for middle-age groups and again higher for the oldest drivers. When it came to phone use, a significant difference appeared only for the 25–29 group, whose crash risk while phoning was 2.4 times higher than without a phone. These results may be outdated – nowadays phones are much more popular across every age group. Not only driver’s age matters – the age of the vehicle also appears to have an impact. According to a study conducted by Australian researchers in 2003, cars built before 1984 (> 15 years) had a significantly greater chance of being involved in an injury-producing crash than those manufactured after 1994 (< 5 years). They calculated that older vehicles carried roughly triple the injury risk of younger ones (Blows et al., 2003). Furthermore, risk rose by 5 % for every additional year of a vehicle’s age.

Another highly important factor affecting accident risk is alcohol. Another Australian paper highlighted the extreme results of drinking even small amount of this substance. Drivers with 0.3 and 0.5‰ alcohol in blood had 3 times higher crash risk compared to fully sober driver (Connor et. al., 2004). That amount corresponds to only one or two drinks. Consuming more than two drinks within six hours raised the crash risk eight times. The greatest change was visible after breaking the law. If the legal limit of 0.5‰ was exceeded, the risk was 23 times higher compared to sober driver (Connor et. al., 2004). That study underlined the extreme effect of consuming alcohol before driving and its dangerous consequences not only for driver.

Sex differences have also been documented. In a 2021 Australian cohort study of young drivers, men had higher overall crash rates, whereas women showed greater odds of crash-related hospitalisation (Cullen et al., 2021). Because the cohort was followed for 13 years, authors were able to show that risk decreased with each additional year of driving experience. Influence of race, especially in the United States, is another topic often raised when comparing statistics. In a 2001 article, Braver (2001) computed passenger-vehicle deaths per 10 million trips among persons aged 25–64. Black men had a 48 % higher risk of dying in a crash than white men. Similar patterns were observed for women, though exact differences were not presented. Hispanic men also faced elevated risk (+26 %) but no significant difference was found between Hispanic and white women. Education modified these effects: men without a high-school diploma had 3.5 times higher risk than graduates, and women 2.8 times, with the highest risk recorded among whites without a diploma – indicating that low education may override racial patterns (Braver, 2001).

Car crash often indicates overwhelming economical costs for drivers. Insurance coverage can mitigate such costs, yet not all motorists purchase it. According to research from 2003, the uninsured drivers were far more likely to be involved in a crash that hospitalized or killed someone. The risk was 4.8 times higher when compared to insured drivers (Blows et. al., 2003). All previous factor related to individuals’ characteristics or their decisions. Weather is independent and contribute to higher risk of car accidents. Bad conditions, especially heavy rain, ice or poor visibility, significantly raise the number of crashes (Brijs et. al., 2008).

3.1 Hypotheses

After reviewing existing literature, we propose to test eight hypotheses:

  1. Young drivers (under 25) have higher risk of car crash with victims
  2. Vehicles older than 15 years are at higher risk of car crash with injuries
  3. Drivers who consumed alcohol have higher risk of car crash with victims
  4. Men have higher overall risk of car crash than women
  5. Women have higher risk of car crash with victims than men
  6. Non-white drivers have higher risk of car crash
  7. Uninsured drivers have higher risk of car crash with victims
  8. Bad weather conditions indicate higher risk of car crash

4 Data

We decided to use data from the California Highway Patrol which covers collisions from January 1st, 2001 until mid-December, 2020. The Statewide Integrated Traffic Records System (SWITRS) contains all crashes that were reported to CHP by local and governmental agencies.

The data have a hierarchical structure.

  • The CRASH table contains information on each crash, one line per crash.
  • The PARTY table contains information from all parties involved in the crash, one line per party. Parties are the major players in a traffic crash - drivers, pedestrians, bicyclists, and parked vehicles. The information includes personal descriptors and vehicle descriptors.
  • The VICTIM table contains information about the victims - persons associated with each party. For example, a motorcyclist and his passenger are each a victim. Injury severity is included in the VICTIM table.

4.1 Loading Data from sqlite

In this section, we extract and prepare a sample dataset of traffic collisions from a local SQLite database (switrs.sqlite), focusing on drivers who were determined to be at fault. The steps include:

  • Connecting to the SQLite database.
  • Filtering the parties table to include only drivers at fault with complete demographic and vehicle-related information.
  • Randomly sampling up to 100,000 such drivers, keeping only one driver per collision (case_id).
  • Retrieving the corresponding collisions data for the selected drivers.
  • Merging the driver and collision data into a single dataset.
  • Displaying a preview of the resulting data in a formatted, scrollable table.
  • Exporting the final merged dataset to a CSV file for further analysis.
# Connect to SQLite database
conn <- dbConnect(RSQLite::SQLite(), dbname = "switrs.sqlite")

# Query drivers who were at fault and have complete information
parties_query <- "
  SELECT *
  FROM parties
  WHERE party_type = 'driver'
    AND at_fault = 1
    AND TRIM(party_age) != ''
    AND TRIM(party_sex) != ''
    AND TRIM(party_race) != ''
    AND TRIM(cellphone_in_use) != ''
    AND TRIM(financial_responsibility) != ''
    AND TRIM(party_sobriety) != ''
    AND TRIM(vehicle_make) != ''
    AND TRIM(vehicle_year) != ''
"

parties_filtered <- dbGetQuery(conn, parties_query)

# Sample up to 100,000 drivers
set.seed(123)
parties_sample_raw <- parties_filtered %>%
  sample_n(size = min(100000, nrow(.)))

# Keep one driver per case_id
parties_sample <- parties_sample_raw %>%
  group_by(case_id) %>%
  slice_sample(n = 1) %>%
  ungroup()

# Get matching collisions data
case_ids <- paste0("'", parties_sample$case_id, "'", collapse = ", ")
collisions_query <- paste0("
  SELECT *
  FROM collisions
  WHERE case_id IN (", case_ids, ")
")

collisions_subset <- dbGetQuery(conn, collisions_query)

# Merge datasets
df <- parties_sample %>%
  left_join(collisions_subset, by = "case_id")

dbDisconnect(conn)
Sample of merged dataset
id case_id party_number party_type at_fault party_sex party_age party_sobriety party_drug_physical direction_of_travel party_safety_equipment_1 party_safety_equipment_2 financial_responsibility hazardous_materials cellphone_in_use cellphone_use_type school_bus_related oaf_violation_code oaf_violation_category oaf_violation_section oaf_violation_suffix other_associate_factor_1 other_associate_factor_2 party_number_killed party_number_injured movement_preceding_collision vehicle_year vehicle_make statewide_vehicle_type chp_vehicle_type_towing chp_vehicle_type_towed party_race jurisdiction officer_id reporting_district chp_shift population county_city_location county_location special_condition beat_type chp_beat_type city_division_lapd chp_beat_class beat_number primary_road secondary_road distance direction intersection weather_1 weather_2 state_highway_indicator caltrans_county caltrans_district state_route route_suffix postmile_prefix postmile location_type ramp_intersection side_of_highway tow_away collision_severity killed_victims injured_victims party_count primary_collision_factor pcf_violation_code pcf_violation_category pcf_violation pcf_violation_subsection hit_and_run type_of_collision motor_vehicle_involved_with pedestrian_action road_surface road_condition_1 road_condition_2 lighting control_device chp_road_type pedestrian_collision bicycle_collision motorcycle_collision truck_collision not_private_property alcohol_involved statewide_vehicle_type_at_fault chp_vehicle_type_at_fault severe_injury_count other_visible_injury_count complaint_of_pain_injury_count pedestrian_killed_count pedestrian_injured_count bicyclist_killed_count bicyclist_injured_count motorcyclist_killed_count motorcyclist_injured_count primary_ramp secondary_ramp latitude longitude collision_date collision_time process_date
90 0000048 1 driver 1 male 41 had not been drinking NA east lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 changing lanes 1996 ford pickup or panel truck pickups & panels 00 hispanic 9575 9820 NA 1400 thru 2159 50000 to 100000 1902 los angeles 0 chp state highway interstate NA chp other 065 RT 210 BALDWIN AV 20 east 0 clear NA 1 los angeles 7 210 NA R 30.85 highway NA eastbound 0 property damage only 0 0 2 vehicle code violation NA unsafe lane change 21658 A not hit and run sideswipe other motor vehicle no pedestrian involved dry normal NA daylight none 1 0 0 0 0 1 NA pickup or panel truck pickups & panels 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-01-12 15:15:00 2002-05-28
427 0000227 1 driver 1 male 45 had not been drinking NA south lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 proceeding straight 1992 chevrolet passenger car mini-vans 00 white 9530 16779 NA 0600 thru 1359 >250000 1941 los angeles 0 chp state highway interstate NA chp other 072 RT 710 DEL AMO BL 528 north 0 clear NA 1 los angeles 7 710 NA NA 10.72 highway NA northbound 0 property damage only 0 0 2 vehicle code violation NA NA NA NA not hit and run rear end other motor vehicle no pedestrian involved dry normal NA daylight none 1 0 0 0 0 1 NA passenger car mini-vans 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-02-05 08:20:00 2002-03-15
717 0000378 1 driver 1 male 18 had not been drinking NA south lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 proceeding straight 1998 dodge passenger car passenger car, station 00 hispanic 9530 12519 NA 0600 thru 1359 >250000 1941 los angeles 0 chp state highway interstate NA chp other 430 RT 405 BELLFLOWER BL 136 north 0 clear NA 1 los angeles 7 405 NA NA 2.33 highway NA southbound 0 property damage only 0 0 2 vehicle code violation NA speeding 22350 NA not hit and run rear end other motor vehicle no pedestrian involved dry normal NA daylight none 1 0 0 0 0 1 NA passenger car passenger car, station 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-02-04 07:40:00 2002-06-19
1087 0000578 1 driver 1 male 36 had not been drinking NA south lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 other 1994 ford passenger car passenger car, station 00 white 9565 15286 NA 0600 thru 1359 25000 to 50000 1918 los angeles 0 chp county roadline county road line NA chp other 021 RT 405 JEFFERSON BL 50 north 0 clear NA 1 los angeles 7 405 NA NA 25.97 highway NA southbound 1 property damage only 0 0 1 vehicle code violation NA improper turning 22107 NA not hit and run hit object fixed object no pedestrian involved dry normal NA daylight none 1 0 0 0 0 1 NA passenger car passenger car, station 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-01-11 07:25:00 2002-06-19
1442 0000781 1 driver 1 male 24 had not been drinking NA south lap/shoulder harness used NA no proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 proceeding straight 1994 nissan passenger car passenger car, station 00 hispanic 9690 13096 NA 1400 thru 2159 25000 to 50000 3045 orange 0 chp state highway interstate NA chp other 055 RT 5 EL TORO RD 50 south 0 clear NA 1 orange 12 5 NA NA 18.68 highway NA southbound 0 property damage only 0 0 2 vehicle code violation NA speeding 22350 NA not hit and run rear end other motor vehicle no pedestrian involved dry normal NA dark with street lights none 1 0 0 0 0 1 NA passenger car passenger car, station 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-02-15 18:30:00 2002-05-28
1474 0000797 1 driver 1 male 51 had not been drinking NA south lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 proceeding straight 1991 honda passenger car passenger car, station 00 white 9530 13970 NA 0600 thru 1359 >250000 1941 los angeles 0 chp state highway interstate NA chp other 072 RT 710 PACIFIC AV 200 west 0 clear NA 1 los angeles 7 405 NA NA 7.06 ramp ramp exit, last 50 feet southbound 1 property damage only 0 0 2 vehicle code violation NA speeding 22350 NA not hit and run rear end other motor vehicle no pedestrian involved dry normal NA daylight none 1 0 0 0 0 1 NA passenger car passenger car, station 0 0 0 0 0 0 0 0 0 TR NA NA NA 2002-02-05 09:10:00 2002-06-19
1710 0000923 1 driver 1 male 48 had not been drinking NA south lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 entering traffic 1992 oldsmobile passenger car NA NA black 101 44 NA not chp 50000 to 100000 0101 alameda 0 not chp not chp NA not chp NA BROADWAY SANTA CLARA AV 200 north 0 cloudy NA 0 NA NA NA NA NA NA NA NA NA 0 property damage only 0 0 2 vehicle code violation NA unsafe starting or backing 22106 NA not hit and run sideswipe other motor vehicle no pedestrian involved dry normal NA daylight none 0 0 0 0 0 1 NA passenger car NA 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-02-18 15:50:00 2002-03-16
2752 0001531 2 driver 1 male 43 had not been drinking NA east lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA none apparent NA 0 0 changing lanes 1998 volvo truck or truck tractor with trailer truck tractor semi hispanic 9575 9307 NA 0600 thru 1359 >250000 1942 los angeles 0 chp state highway state route NA chp other 043 RT 134 SAN FERNANDO RD 1000 west 0 clear NA 1 los angeles 7 134 NA R 5.72 highway NA eastbound 0 property damage only 0 0 3 vehicle code violation NA unsafe lane change 21658 A not hit and run sideswipe other motor vehicle no pedestrian involved dry normal NA daylight none 1 0 0 0 1 1 NA truck or truck tractor with trailer truck tractor 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-01-31 08:45:00 2002-05-24
2810 0001565 1 driver 1 female 55 had not been drinking NA south lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA inattention NA 0 0 proceeding straight 1995 dodge other bus NA NA white 4003 5529 02 not chp 2500 to 10000 4003 san luis obispo 0 not chp not chp NA not chp 001 EMBARCADERO FRONT ST 60 north 0 clear NA 0 NA NA NA NA NA NA NA NA NA 0 property damage only 0 0 2 other improper driving NA other improper driving NA NA not hit and run sideswipe parked motor vehicle no pedestrian involved dry normal NA daylight none 0 0 0 0 0 1 NA other bus NA 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-02-08 10:35:00 2002-03-28
2836 0001584 1 driver 1 male 45 had not been drinking NA west lap/shoulder harness used NA proof of insurance obtained NA 0 cellphone not in use NA NA NA NA NA entering/leaving ramp NA 0 0 changing lanes 2001 volvo passenger car passenger car, station 00 asian 9340 013316 NA 1400 thru 2159 100000 to 250000 4314 santa clara 0 chp county roadline county road line NA chp other 530 MONTAGUE EXPWY LAURELWOOD DR 50 west 0 clear NA 0 NA NA NA NA NA NA NA NA NA 0 property damage only 0 0 2 vehicle code violation NA unsafe lane change 21658 A not hit and run sideswipe other motor vehicle no pedestrian involved dry normal NA daylight none 0 0 0 0 0 1 NA passenger car passenger car, station 0 0 0 0 0 0 0 0 0 NA NA NA NA 2002-01-16 15:15:00 2002-03-25

4.2 Cleaning nulls

In this section, we clean the merged dataset by handling missing or incomplete values. The process involves:

  • Loading the previously saved dataset of drivers at fault.
  • Defining a helper function to compute the count and percentage of missing or empty values for each column.
  • Generating a summary table of missing data across all variables and displaying the top 50 columns with the most missingness.
  • Dropping columns with more than 40% missing or empty values to reduce noise and sparsity.
  • Saving a version of the dataset with the retained columns.
  • Recalculating and displaying the missing data statistics after column filtering.
  • Removing all rows that still contain any missing or empty fields, ensuring the dataset is complete.
  • Saving the final, fully cleaned dataset to a new CSV file.
  • Printing the number of remaining observations after cleaning.

This prepares the dataset by ensuring a consistent and complete structure.

# Load dataset
df <- read.csv("driver_at_fault_sample.csv", stringsAsFactors = FALSE)

# Helper function: count and percent of missing or empty values
null_stats <- function(x) {
  total <- length(x)
  nulls <- sum(is.na(x) | trimws(x) == "")
  percent <- round(nulls / total * 100, 2)
  c(Count = nulls, Percent = percent)
}

# Compute missing value summary for all columns
null_summary_all <- as.data.frame(t(sapply(df, null_stats)))
null_summary_all$Column <- rownames(null_summary_all)
null_summary_all <- null_summary_all %>%
  select(Column, everything()) %>%
  arrange(desc(Percent))
Missing data before column filtering
Column Count Percent
pcf_violation_code pcf_violation_code 99996 100.00
oaf_violation_code oaf_violation_code 99991 99.99
hazardous_materials hazardous_materials 99979 99.98
route_suffix route_suffix 99882 99.88
school_bus_related school_bus_related 99870 99.87
road_condition_2 road_condition_2 99540 99.54
secondary_ramp secondary_ramp 99128 99.13
oaf_violation_suffix oaf_violation_suffix 97611 97.61
primary_ramp primary_ramp 97478 97.48
other_associate_factor_2 other_associate_factor_2 96815 96.82
weather_2 weather_2 96791 96.79
party_drug_physical party_drug_physical 95701 95.70
city_division_lapd city_division_lapd 95675 95.67
ramp_intersection ramp_intersection 94034 94.03
alcohol_involved alcohol_involved 89462 89.46
postmile_prefix postmile_prefix 89244 89.24
oaf_violation_category oaf_violation_category 88599 88.60
oaf_violation_section oaf_violation_section 88548 88.55
reporting_district reporting_district 73066 73.07
caltrans_county caltrans_county 66926 66.93
caltrans_district caltrans_district 66926 66.93
state_route state_route 66926 66.93
postmile postmile 66926 66.93
location_type location_type 66926 66.93
side_of_highway side_of_highway 66927 66.93
pcf_violation_subsection pcf_violation_subsection 64524 64.52
latitude latitude 58132 58.13
longitude longitude 58132 58.13
chp_vehicle_type_towed chp_vehicle_type_towed 45782 45.78
direction direction 19813 19.81
chp_vehicle_type_towing chp_vehicle_type_towing 8633 8.63
chp_vehicle_type_at_fault chp_vehicle_type_at_fault 8633 8.63
party_safety_equipment_2 party_safety_equipment_2 7845 7.85
statewide_vehicle_type statewide_vehicle_type 6128 6.13
statewide_vehicle_type_at_fault statewide_vehicle_type_at_fault 6128 6.13
beat_number beat_number 5320 5.32
party_safety_equipment_1 party_safety_equipment_1 1687 1.69
other_associate_factor_1 other_associate_factor_1 1364 1.36
pcf_violation pcf_violation 1011 1.01
pcf_violation_category pcf_violation_category 631 0.63
type_of_collision type_of_collision 467 0.47
tow_away tow_away 456 0.46
intersection intersection 451 0.45
road_surface road_surface 448 0.45
road_condition_1 road_condition_1 361 0.36
control_device control_device 298 0.30
motor_vehicle_involved_with motor_vehicle_involved_with 274 0.27
direction_of_travel direction_of_travel 261 0.26
officer_id officer_id 263 0.26
lighting lighting 261 0.26
# Drop columns with more than 40% missing data
columns_to_keep <- null_summary_all %>%
  filter(Percent <= 40) %>%
  pull(Column)

df_cleaned <- df[, columns_to_keep]

# Save cleaned dataset (columns only)
write.csv(df_cleaned, "driver_at_fault_sample_cleaned.csv", row.names = FALSE)

# Recalculate missing stats after dropping columns
null_summary_cleaned <- as.data.frame(t(sapply(df_cleaned, null_stats)))
null_summary_cleaned$Column <- rownames(null_summary_cleaned)
null_summary_cleaned <- null_summary_cleaned %>%
  select(Column, everything()) %>%
  arrange(desc(Percent))
Missing data after dropping columns
Column Count Percent
direction direction 19813 19.81
chp_vehicle_type_towing chp_vehicle_type_towing 8633 8.63
chp_vehicle_type_at_fault chp_vehicle_type_at_fault 8633 8.63
party_safety_equipment_2 party_safety_equipment_2 7845 7.85
statewide_vehicle_type statewide_vehicle_type 6128 6.13
statewide_vehicle_type_at_fault statewide_vehicle_type_at_fault 6128 6.13
beat_number beat_number 5320 5.32
party_safety_equipment_1 party_safety_equipment_1 1687 1.69
other_associate_factor_1 other_associate_factor_1 1364 1.36
pcf_violation pcf_violation 1011 1.01
pcf_violation_category pcf_violation_category 631 0.63
type_of_collision type_of_collision 467 0.47
tow_away tow_away 456 0.46
intersection intersection 451 0.45
road_surface road_surface 448 0.45
road_condition_1 road_condition_1 361 0.36
control_device control_device 298 0.30
motor_vehicle_involved_with motor_vehicle_involved_with 274 0.27
direction_of_travel direction_of_travel 261 0.26
officer_id officer_id 263 0.26
lighting lighting 261 0.26
weather_1 weather_1 180 0.18
collision_time collision_time 163 0.16
jurisdiction jurisdiction 135 0.14
movement_preceding_collision movement_preceding_collision 130 0.13
population population 16 0.02
pedestrian_action pedestrian_action 24 0.02
chp_beat_class chp_beat_class 6 0.01
state_highway_indicator state_highway_indicator 12 0.01
primary_collision_factor primary_collision_factor 8 0.01
id id 0 0.00
case_id case_id 0 0.00
party_number party_number 0 0.00
party_type party_type 0 0.00
at_fault at_fault 0 0.00
party_sex party_sex 0 0.00
party_age party_age 0 0.00
party_sobriety party_sobriety 0 0.00
financial_responsibility financial_responsibility 0 0.00
cellphone_in_use cellphone_in_use 0 0.00
cellphone_use_type cellphone_use_type 0 0.00
party_number_killed party_number_killed 0 0.00
party_number_injured party_number_injured 0 0.00
vehicle_year vehicle_year 0 0.00
vehicle_make vehicle_make 0 0.00
party_race party_race 0 0.00
chp_shift chp_shift 0 0.00
county_city_location county_city_location 0 0.00
county_location county_location 0 0.00
special_condition special_condition 0 0.00
# Drop rows with any missing or empty values
df_final <- df_cleaned %>%
  filter(across(everything(), ~ !(is.na(.) | trimws(.) == "")))

# Save final dataset
write.csv(df_final, "driver_at_fault_sample_cleaned_no_na.csv", row.names = FALSE)

Number of rows after full cleanup: 65552 ### Feature Pruning and Engineering for Modeling In this section, we finalize our feature selection and perform additional feature engineering to prepare the dataset for modeling. Specifically:

  • We load the pre-cleaned dataset containing no missing values.
  • A broad set of columns is removed, including identifiers, redundant location codes, officer/jurisdiction data, detailed injury/fatality breakdowns, and fields that are either irrelevant, redundant, or too granular for our modeling goals.
  • The collision date is converted to a Date object and used to derive the vehicle’s age at the time of the crash.
  • Injury and fatality counts are aggregated into a single total_victims variable.
  • The season of the year in which each collision occurred is inferred based on the collision date.
  • Vehicle make is used to infer the origin region of the vehicle (e.g., USA, Japan, Korea, Europe, etc.).
  • The collision time is parsed to extract the hour and categorize it into a general time of day (e.g., Morning, Afternoon, Night).
  • The processed dataset is saved for downstream modeling steps, and the final number of features (columns) is printed.
  • These transformations help reduce dimensionality, engineer informative features, and ensure consistency across variables.
Sample of transformed dataset
chp_vehicle_type_at_fault party_safety_equipment_1 type_of_collision road_surface motor_vehicle_involved_with direction_of_travel lighting weather_1 movement_preceding_collision population party_sex party_age party_sobriety financial_responsibility cellphone_in_use party_race county_location chp_beat_type party_count hit_and_run car_age total_victims season year region time_of_day
passenger car, station lap/shoulder harness used broadside dry other motor vehicle south daylight clear proceeding straight 25000 to 50000 female 22 had not been drinking no proof of insurance obtained 0 hispanic san mateo us highway 3 not hit and run 4 0 Winter 2002 Japan Morning
passenger car, station lap/shoulder harness used broadside wet other motor vehicle south daylight raining proceeding straight unincorporated male 26 had not been drinking proof of insurance obtained 0 other santa cruz state route 2 not hit and run 8 0 Winter 2002 USA Afternoon
passenger car, station lap/shoulder harness used hit object dry fixed object west dark with street lights clear proceeding straight >250000 male 17 had not been drinking no proof of insurance obtained 0 asian los angeles state route 1 not hit and run 7 0 Spring 2002 Japan Night
passenger car, station air bag deployed sideswipe dry other motor vehicle north daylight clear stopped unincorporated female 36 had not been drinking proof of insurance obtained 0 hispanic los angeles county road line 2 not hit and run 13 0 Fall 2002 USA Afternoon
passenger car, station lap/shoulder harness used rear end dry other motor vehicle north daylight clear proceeding straight unincorporated male 21 had not been drinking proof of insurance obtained 0 white tulare county road area 2 not hit and run 1 0 Winter 2003 Japan Morning
passenger car, station lap/shoulder harness used rear end dry parked motor vehicle west dark with no street lights clear backing unincorporated male 22 had not been drinking proof of insurance obtained 0 white kern state route 2 not hit and run 1 0 Spring 2003 USA Evening
passenger car, station lap belt used sideswipe dry other motor vehicle south daylight clear changing lanes 50000 to 100000 male 46 had been drinking, under influence no proof of insurance obtained 0 white san diego interstate 2 not hit and run 37 1 Summer 2003 USA Afternoon
passenger car, station not required hit object dry fixed object north daylight clear ran off road unincorporated male 19 had been drinking, under influence proof of insurance obtained 0 white el dorado county road line 1 not hit and run 15 0 Summer 2003 USA Evening
passenger car, station air bag not deployed rear end dry other motor vehicle south daylight clear changing lanes >250000 female 31 had not been drinking proof of insurance obtained 0 black los angeles interstate 2 not hit and run 12 0 Summer 2003 Japan Afternoon
passenger car, station air bag not deployed sideswipe dry other motor vehicle south daylight clear passing other vehicle unincorporated male 38 had been drinking, not under influence proof of insurance obtained 0 asian los angeles interstate 2 not hit and run 1 0 Summer 2003 Japan Afternoon

Number of columns after pruning: 26

5 Method/Model

This section focuses on modeling the number of victims in traffic accidents using count regression techniques. It begins by preparing the dataset and examining the distribution of the target variable, total_victims, including an assessment of zero inflation. A custom score test is implemented to test for excess zeros beyond the Poisson assumption.

Four models are fitted to the data:

  • Poisson Regression
  • Negative Binomial Regression
  • Zero-Inflated Poisson (ZIP)
  • Zero-Inflated Negative Binomial (ZINB)

Likelihood ratio tests are used to compare nested models, and model fit is evaluated using AIC and log-likelihood criteria. The results guide the choice of the most appropriate model for handling both overdispersion and zero inflation in the count outcome.

library(MASS)
# Load the dataset
df <- read.csv("driver_at_fault_sample_final.csv", stringsAsFactors = FALSE)
df <- df[1:10000, ]

# Ensure the 'total_victims' column is numeric
df$total_victims <- as.numeric(df$total_victims)

# Calculate the count and percentage of zeros in 'total_victims'
zero_count <- sum(df$total_victims == 0, na.rm = TRUE)
total_count <- sum(!is.na(df$total_victims))
zero_percent <- round(100 * zero_count / total_count, 2)

Number of zeros: 6 438 out of 10 000 observations (64.38%)

lr_test <- function(model1, model2) {
  lr_stat <- 2 * (logLik(model2)[1] - logLik(model1)[1])
  df_diff <- attr(logLik(model2), "df") - attr(logLik(model1), "df")
  p_val <- pchisq(lr_stat, df = df_diff, lower.tail = FALSE)
  cat("LR test comparing models:\n")
  cat("  Statistic =", round(lr_stat, 4), "\n")
  cat("  Degrees of freedom =", df_diff, "\n")
  cat("  p-value =", p_val, "\n\n")
  return(list(statistic = lr_stat, df = df_diff, p.value = p_val))
}

# Perform zero-inflation test on 'total_victims'
zero_test_result <- zero.test(df$total_victims)
knitr::kable(zero_test_result[, 1:4], caption = "Zero Inflation Test Result", align = c("l", "r", "r", "r"))
Zero Inflation Test Result
Test Statistic DF P-value
Score test for zero inflation 577.4755 1 1.33e-127
Zero Inflation Test Result
Test Statistic DF P-value
Score test for zero inflation 577.4755 1 1.33e-127

— Interpretation of zero.test result — Reject H0: Evidence of zero inflation beyond the Poisson model. Consider zero-inflated models (ZIP or ZINB).

5.1 Fitting Models

# Creating 4 general models for comparison
# Poisson
poisson_model <- glm(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility + 
cellphone_in_use + party_race +  + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+   type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, family = "poisson", data = df)

# Negative Binomial
nb_model <- glm.nb(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility + 
cellphone_in_use + party_race +  + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+ type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, data = df)

# ZIP
zip_model <- zeroinfl(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility + 
cellphone_in_use + party_race +  + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+ type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, data = df, dist = "poisson")

# ZINB
zinb_model <- zeroinfl(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility 
+ cellphone_in_use + party_race + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+ type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, data = df, dist = "negbin")

5.2 Models Comparision

Likelihood Ratio Tests Between Count Models
Comparison Statistic DF p.value
Poisson vs Negative Binomial 185.761 1 2.676615e-42
Poisson vs Zero-Inflated Poisson 663.473 200 4.681003e-51
Negative Binomial vs ZINB 486.163 200 7.509073e-26
ZIP vs ZINB 8.451 1 3.647446e-03

Interpretation:

  1. Poisson vs Negative Binomial:
  • The Likelihood Ratio (LR) statistic is 185.76 with 1 degree of freedom.
  • p < 0.0001 → This is a highly significant result.
  • Interpretation: The Negative Binomial model provides a significantly better fit than the Poisson model. This suggests overdispersion is present in the data (variance > mean), which Poisson cannot handle properly.
  1. Poisson vs Zero-Inflated Poisson (ZIP):
  • LR statistic = 663.47, df = 200, p < 0.0001.
  • Interpretation: The ZIP model is significantly better than the standard Poisson. This indicates that there is an excess of zero counts in the data that ZIP accounts for effectively.
  1. Negative Binomial vs Zero-Inflated Negative Binomial (ZINB):
  • LR statistic = 486.16, df = 200, p < 0.0001.
  • Interpretation: The ZINB model outperforms the standard Negative Binomial model. So, even with overdispersion addressed, there’s still zero inflation that needs to be modeled.
  1. ZIP vs ZINB:
  • LR statistic = 8.45, df = 1, p ≈ 0.0036.
  • Interpretation: The ZINB model is also significantly better than the ZIP model. That means both overdispersion and zero inflation are present in the data, and ZINB is the most appropriate model among those tested.

Conclusion:

  • Based on all comparisons, the Zero-Inflated Negative Binomial (ZINB) model provides the best fit to your data, as it handles both excess zeros and overdispersion, which are not adequately addressed by simpler models like Poisson, NB, or ZIP.
Comparison of Count Regression Models
Model AIC Log-Likelihood
ZINB ZINB 18425.64 -8811.82
ZIP ZIP 18432.09 -8816.04
NegBin NegBin 18511.80 -9054.90
Poisson Poisson 18695.56 -9147.78

Interpretation:

  1. ZINB (Zero-Inflated Negative Binomial):
  • Lowest AIC (18425.64) and highest log-likelihood (-8811.82) among all models.
  • Interpretation: This is the best-fitting model according to both AIC and log-likelihood. It captures both overdispersion and excess zeros effectively.
  1. ZIP (Zero-Inflated Poisson):
  • Second-best in terms of AIC and log-likelihood.
  • Interpretation: While it accounts for zero inflation, it does not handle overdispersion as well as ZINB. Still, it performs better than the standard Poisson and Negative Binomial models.
  1. Negative Binomial (NegBin):
  • Has a lower AIC than Poisson, indicating improvement when accounting for overdispersion, but still worse than ZIP and ZINB.
  • Interpretation: Handles overdispersion, but fails to account for excess zeros.
  1. Poisson:
  • Worst model with the highest AIC (18695.56) and lowest log-likelihood (-9147.78).
  • Interpretation: It fails to account for both overdispersion and zero inflation, making it unsuitable for this dataset.

Conclusion:

  • Based on both AIC and log-likelihood, the ZINB model is clearly the best choice. It provides the most accurate representation of the data, outperforming all other models in terms of goodness-of-fit.
Data summary
Name df
Number of rows 65000
Number of columns 26
_______________________
Column type frequency:
character 20
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
chp_vehicle_type_at_fault 0 1 8 91 0 58 0
party_safety_equipment_1 0 1 5 40 0 19 0
type_of_collision 0 1 5 10 0 8 0
road_surface 0 1 3 8 0 4 0
motor_vehicle_involved_with 0 1 1 30 0 12 0
direction_of_travel 0 1 4 5 0 4 0
lighting 0 1 8 39 0 5 0
weather_1 0 1 3 7 0 7 0
movement_preceding_collision 0 1 5 26 0 18 0
population 0 1 5 16 0 8 0
party_sex 0 1 4 6 0 2 0
party_sobriety 0 1 14 38 0 6 0
financial_responsibility 0 1 14 35 0 4 0
party_race 0 1 5 8 0 5 0
county_location 0 1 4 15 0 58 0
chp_beat_type 0 1 7 23 0 8 0
hit_and_run 0 1 6 15 0 3 0
season 0 1 4 6 0 4 0
region 0 1 3 6 0 6 0
time_of_day 0 1 5 9 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
party_age 0 1 36.53 16.20 0 23 32 47 121 ▅▇▃▁▁
cellphone_in_use 0 1 0.02 0.15 0 0 0 0 1 ▇▁▁▁▁
party_count 0 1 1.98 0.77 1 2 2 2 10 ▇▂▁▁▁
car_age 0 1 9.44 6.84 0 4 9 14 89 ▇▁▁▁▁
total_victims 0 1 0.55 0.89 0 0 0 1 20 ▇▁▁▁▁
year 0 1 2012.39 4.91 2002 2008 2013 2017 2021 ▃▇▆▇▆

5.3 Sample values from columns

Sample Values from Each Column
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
chp_vehicle_type_at_fault passenger car, station mini-vans motorcycle pickups & panels two axle truck
party_safety_equipment_1 lap/shoulder harness used air bag deployed lap belt used not required air bag not deployed
type_of_collision broadside hit object sideswipe rear end overturned
road_surface dry wet slippery snowy NA
motor_vehicle_involved_with other motor vehicle fixed object parked motor vehicle non-collision train
direction_of_travel south west north east NA
lighting daylight dark with street lights dark with no street lights dusk or dawn dark with street lights not functioning
weather_1 clear raining cloudy other fog
movement_preceding_collision proceeding straight stopped backing changing lanes ran off road
population 25000 to 50000 unincorporated >250000 50000 to 100000 100000 to 250000
party_sex female male NA NA NA
party_age 22 26 17 36 21
party_sobriety had not been drinking had been drinking, under influence had been drinking, not under influence impairment unknown had been drinking, impairment unknown
financial_responsibility no proof of insurance obtained proof of insurance obtained not applicable officer called away before obtained NA
cellphone_in_use 0 1 NA NA NA
party_race hispanic other asian white black
county_location san mateo santa cruz los angeles tulare kern
chp_beat_type us highway state route county road line county road area interstate
party_count 3 2 1 4 9
hit_and_run not hit and run misdemeanor felony NA NA
car_age 4 8 7 13 1
total_victims 0 1 4 2 3
season Winter Spring Fall Summer NA
year 2002 2003 2004 2005 2006
region Japan USA Korea Europe Other
time_of_day Morning Afternoon Night Evening NA
Unique CHP Vehicle Types at Fault
Vehicle.Types.at.Fault
all terrain vehicle
ambulance
dune buggy
farm labor transporter
fire truck
fork lift
general public paratransit
go-ped, zip electric scooter, motoboard
hwy. construction equip.
implement of husbandry
low speed vehicle
mini-vans
misc. motor vehicle (snowmobile, golf cart)
mobile equipment
motor driven
motor home
motor home > 40 feet
motorcycle
motorized bicycle
non-commercial bus
other commercial
paratransit
passenger car, station
passenger car, station wagon, jeep: hazardous material
passenger car, station wagon, jeep: hazardous waste or hazardous waste/material combination
pickup and camper: hazardous material
pickup w/camper
pickups & panels
pickups and panels: hazardous material
pickups and panels: hazardous waste or hazardous waste/material combination
police car
police motorcycle
public transit authority
school bus contractual type i
school bus contractual type ii
school bus private type i
school bus private type ii
school bus public type i
school bus public type ii
school pupil activity bus type i
school pupil activity bus type ii
sport utility vehicle
three-axle tank truck: hazardous material
three-axle tow truck
three axle tank truck
three or more axle truck
three or more axle truck: hazardous material
three or more axle truck: hazardous waste or hazardous waste/material combination
tour bus
truck tractor
truck tractor: hazardous material
two-axle tank truck: hazardous waste or hazardous waste/material combination
two-axle tow truck
two-axle truck: hazardous material
two-axle truck: hazardous waste or hazardous waste/material combination
two axle tank truck
two axle truck
youth bus
  1. proceeding straight
  2. stopped
  3. backing
  4. changing lanes
  5. ran off road
  6. passing other vehicle
  7. entering traffic
  8. other
  9. crossed into opposing lane
  10. making right turn
  11. making left turn
  12. other unsafe turning
  13. slowing/stopping
  14. merging
  15. making u-turn
  16. traveling wrong way
  17. parking maneuver
  18. parked
# aggregating movement into more general groups
df$move_general <- with(df, case_when(
  movement_preceding_collision %in% c("proceeding straight", "entering traffic", "merging") ~ "Straight Travel",
  movement_preceding_collision %in% c("making right turn", "making left turn", "making u-turn", "other unsafe turning", "other") ~ "Turning",
  movement_preceding_collision %in% c("changing lanes", "passing other vehicle", "crossed into opposing lane") ~ "Lane Change/Passing",
  movement_preceding_collision %in% c("slowing/stopping", "stopped", "backing", "parking maneuver", "parked") ~ "Stopping/Slowing",
  movement_preceding_collision %in% c("ran off road", "traveling wrong way") ~ "Loss of Control/Irregular",
  TRUE ~ "Other"
))

df$move_general <- factor(df$move_general)
  1. lap/shoulder harness used
  2. air bag deployed
  3. lap belt used
  4. not required
  5. air bag not deployed
  6. other
  7. none in vehicle
  8. passenger, motorcycle helmet used
  9. unknown
  10. driver, motorcycle helmet used
  11. passive restraint not used
  12. shoulder harness not used
  13. no child restraint in vehicle
  14. lap/shoulder harness not used
  15. shoulder harness used
  16. passive restraint used
  17. child restraint in vehicle, use unknown
  18. child restraint in vehicle, improper use
  19. lap belt not used
# aggregating safety equipment into air_bag variable
df$air_bag <- factor(
  ifelse(df$party_safety_equipment_1 == "air bag deployed", "Yes",
         ifelse(df$party_safety_equipment_1 == "air bag not deployed", "No", "Other safety")),
  levels = c("Yes", "No", "Other safety")
)
Distribution depending on Air Bag
Air Bag Count
Yes 14 301
No 44 601
Other safety 6 098
  1. clear
  2. raining
  3. cloudy
  4. other
  5. fog
  6. snowing
  7. wind
# aggregating weather into more general groups
df$weather <- factor(
  ifelse(df$weather_1 == "clear", "clear_sky",
         ifelse(df$weather_1 %in% c("raining", "snowing", "fog"), "harder_conditions", "other"))
)
Distribution depending on Weather
Weather Count
clear_sky 52 511
harder_conditions 2 713
other 9 776
  1. had not been drinking
  2. had been drinking, under influence
  3. had been drinking, not under influence
  4. impairment unknown
  5. had been drinking, impairment unknown
  6. not applicable
# aggregating sobriety into more general groups
df$sobriety <- factor(
  ifelse(df$party_sobriety == "had not been drinking", "not drinking",
         ifelse(df$party_sobriety %in% c(
           "had been drinking, under influence",
           "had been drinking, not under influence",
           "had been drinking, impairment unknown"
         ), "drinking", "other"))
)
Distribution of Sobriety Levels
Sobriety Count
drinking 7 004
not drinking 55 988
other 2 008
  1. dry
  2. wet
  3. slippery
  4. snowy
# aggregating road surface into more general groups
df$surface_dry <- ifelse(df$road_surface == "dry", "dry", "non-dry")
df$surface_dry <- factor(df$surface_dry)
# converting age into 2 intervals: under and over 25 years old
df$age_group <- cut(
  df$party_age,
  breaks = c(-Inf, 24, Inf),
  labels = c("under 25", "over 25"),
  right = TRUE
)
Column Missing_Values
chp_vehicle_type_at_fault 0
party_safety_equipment_1 0
type_of_collision 0
road_surface 0
motor_vehicle_involved_with 0
direction_of_travel 0
lighting 0
weather_1 0
movement_preceding_collision 0
population 0
party_sex 0
party_age 0
party_sobriety 0
financial_responsibility 0
cellphone_in_use 0
party_race 0
county_location 0
chp_beat_type 0
party_count 0
hit_and_run 0
car_age 0
total_victims 0
season 0
year 0
region 0
time_of_day 0
vehicle_general 0
move_general 0
air_bag 0
weather 0
sobriety 0
surface_dry 0
age_group 0
# aggregating weather into 2 seasons 
df$LA_season <- ifelse(df$season %in% c("Winter", "Spring"), "Wet", "Dry")
df$LA_season <- factor(df$LA_season, levels = c("Dry", "Wet"))
Distribution through Seasons
Season Count
Fall 17 018
Spring 15 745
Summer 16 064
Winter 16 173
  1. hispanic
  2. other
  3. asian
  4. white
  5. black
# aggregating race into white and non-white
df$race_group <- ifelse(df$party_race == "white", "white", "non-white")
df$race_group <- factor(df$race_group, levels = c("white", "non-white"))
  1. no proof of insurance obtained
  2. proof of insurance obtained
  3. not applicable
  4. officer called away before obtained
# aggregating insurance into 2 groups
df$financial_responsibility_bi <- ifelse(df$financial_responsibility == "proof of insurance obtained", 
                            "proof", "no or unknown proof")
df$financial_responsibility_bi <- factor(df$financial_responsibility_bi, levels = c("proof", "no or unknown proof"))
# aggregating car age into 2 intervals
df$car_age_group <- ifelse(df$car_age <= 15, "under_15", "over_15")
df$car_age_group <- factor(df$car_age_group, levels = c("under_15", "over_15"))
# aggregating lighting into 2 groups
df$lighting_simple <- ifelse(df$lighting == "daylight", "daylight", "dark")
df$lighting_simple <- factor(df$lighting_simple, levels = c("daylight", "dark"))

5.4 Interactions

# sex x race
df$sex_race <- interaction(df$party_sex, df$race_group, drop = TRUE)
Sex x Race Count
female.white 10 264
male.white 16 389
female.non-white 12 537
male.non-white 25 810
# time of the day x weather
df$timeofday_weather <- interaction(df$time_of_day, df$weather, drop = TRUE)
Time of Day x Weather Count
Afternoon.clear_sky 22 383
Evening.clear_sky 11 170
Morning.clear_sky 13 770
Night.clear_sky 5 188
Afternoon.harder_conditions 865
Evening.harder_conditions 616
Morning.harder_conditions 785
Night.harder_conditions 447
Afternoon.other 3 419
Evening.other 1 668
Morning.other 3 655
Night.other 1 034
# age group x car age group
df$age_car <- interaction(df$age_group, df$car_age_group, drop = TRUE)
Age Group x Car Age Group Count
under 25.under_15 15 809
over 25.under_15 37 359
under 25.over_15 3 840
over 25.over_15 7 992
# lighting x surface dry
df$lighting_surface <- interaction(df$lighting_simple, df$surface_dry, drop = TRUE)
Lighting x Surface Dry Count
daylight.dry 41 021
dark.dry 17 244
daylight.non-dry 3 920
dark.non-dry 2 815

5.5 Modelling

# Previous analysis showed that Zero-Inflated Negative Binomial is the most appropriate choice
# Starting with the most general model

general <- zeroinfl(total_victims ~ weather + party_sex + age_group + sobriety
                    + financial_responsibility_bi + cellphone_in_use + race_group
                    + car_age_group + LA_season + time_of_day 
                    + air_bag + lighting_surface, data = df, dist = "negbin")
Count Model (Negative Binomial)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.074 0.033 2.234 2.55e-02
weatherharder_conditions 0.059 0.048 1.233 2.17e-01
weatherother -0.011 0.024 -0.467 6.40e-01
party_sexmale -0.080 0.016 -5.071 < 1e-03
age_groupover 25 -0.014 0.016 -0.908 3.64e-01
sobrietynot drinking -0.033 0.024 -1.374 1.69e-01
sobrietyother -0.162 0.052 -3.095 1.97e-03
financial_responsibility_bino or unknown proof 0.176 0.022 8.128 < 1e-03
cellphone_in_use1 -0.098 0.049 -2.006 4.48e-02
race_groupnon-white 0.014 0.015 0.897 3.70e-01
car_age_groupover_15 -0.005 0.019 -0.241 8.10e-01
LA_seasonWet -0.019 0.015 -1.293 1.96e-01
time_of_dayEvening 0.018 0.026 0.698 4.85e-01
time_of_dayMorning -0.113 0.019 -6.000 < 1e-03
time_of_dayNight -0.204 0.034 -6.043 < 1e-03
air_bagNo -0.543 0.020 -26.515 < 1e-03
air_bagOther safety -0.420 0.022 -19.413 < 1e-03
lighting_surfacedark.dry -0.050 0.026 -1.939 5.25e-02
lighting_surfacedaylight.non-dry -0.156 0.040 -3.886 < 1e-03
lighting_surfacedark.non-dry -0.133 0.049 -2.730 6.34e-03
Log(theta) 1.088 0.044 24.451 < 1e-03
Zero-inflation Model (Logit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.422 54.178 -0.266 7.90e-01
weatherharder_conditions 0.629 0.339 1.855 6.36e-02
weatherother 0.081 0.122 0.659 5.10e-01
party_sexmale 0.379 0.089 4.234 < 1e-03
age_groupover 25 -0.378 0.080 -4.734 < 1e-03
sobrietynot drinking 0.014 0.142 0.095 9.24e-01
sobrietyother 0.469 0.229 2.044 4.09e-02
financial_responsibility_bino or unknown proof 0.441 0.103 4.287 < 1e-03
cellphone_in_use1 0.029 0.261 0.113 9.10e-01
race_groupnon-white 0.413 0.089 4.621 < 1e-03
car_age_groupover_15 -0.249 0.105 -2.364 1.81e-02
LA_seasonWet 0.064 0.078 0.828 4.07e-01
time_of_dayEvening -0.005 0.128 -0.039 9.69e-01
time_of_dayMorning -0.084 0.098 -0.859 3.90e-01
time_of_dayNight -0.490 0.221 -2.220 2.64e-02
air_bagNo 12.975 54.178 0.239 8.11e-01
air_bagOther safety 1.833 62.848 0.029 9.77e-01
lighting_surfacedark.dry -0.082 0.129 -0.633 5.27e-01
lighting_surfacedaylight.non-dry -1.414 0.444 -3.184 1.45e-03
lighting_surfacedark.non-dry -0.435 0.327 -1.331 1.83e-01
# To analyze it properly we'll use the Likelihood Ratio Test. Hypotheses:
# H0: the removed variable is jointly significant
# H1: the removed variable is not jointly significant

# Starting with dropping the variable that is insignificant in both parts – LA_season
step1 <- zeroinfl(total_victims ~ weather + party_sex + age_group + sobriety
                  + financial_responsibility_bi + cellphone_in_use + race_group + car_age_group
                  + time_of_day + air_bag + lighting_surface, data = df, dist = "negbin")
Count Model (Negative Binomial)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.065 0.032 2.005 4.49e-02
weatherharder_conditions 0.058 0.048 1.210 2.26e-01
weatherother -0.014 0.024 -0.564 5.73e-01
party_sexmale -0.079 0.016 -5.059 < 1e-03
age_groupover 25 -0.014 0.016 -0.887 3.75e-01
sobrietynot drinking -0.033 0.024 -1.362 1.73e-01
sobrietyother -0.162 0.052 -3.092 1.99e-03
financial_responsibility_bino or unknown proof 0.177 0.022 8.140 < 1e-03
cellphone_in_use1 -0.098 0.049 -1.993 4.63e-02
race_groupnon-white 0.013 0.015 0.893 3.72e-01
car_age_groupover_15 -0.005 0.019 -0.282 7.78e-01
time_of_dayEvening 0.019 0.026 0.706 4.80e-01
time_of_dayMorning -0.114 0.019 -6.061 < 1e-03
time_of_dayNight -0.203 0.034 -6.035 < 1e-03
air_bagNo -0.544 0.021 -26.518 < 1e-03
air_bagOther safety -0.420 0.022 -19.412 < 1e-03
lighting_surfacedark.dry -0.050 0.026 -1.944 5.19e-02
lighting_surfacedaylight.non-dry -0.159 0.040 -3.971 < 1e-03
lighting_surfacedark.non-dry -0.136 0.049 -2.792 5.23e-03
Log(theta) 1.087 0.044 24.440 < 1e-03
Zero-inflation Model (Logit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.660 22.781 -0.556 5.78e-01
weatherharder_conditions 0.654 0.341 1.920 5.49e-02
weatherother 0.093 0.122 0.760 4.47e-01
party_sexmale 0.382 0.090 4.268 < 1e-03
age_groupover 25 -0.376 0.080 -4.704 < 1e-03
sobrietynot drinking 0.013 0.142 0.091 9.28e-01
sobrietyother 0.469 0.229 2.044 4.09e-02
financial_responsibility_bino or unknown proof 0.438 0.103 4.266 < 1e-03
cellphone_in_use1 0.035 0.260 0.134 8.94e-01
race_groupnon-white 0.413 0.089 4.629 < 1e-03
car_age_groupover_15 -0.249 0.106 -2.360 1.83e-02
time_of_dayEvening -0.005 0.127 -0.036 9.71e-01
time_of_dayMorning -0.093 0.097 -0.959 3.38e-01
time_of_dayNight -0.488 0.220 -2.218 2.65e-02
air_bagNo 11.241 22.780 0.493 6.22e-01
air_bagOther safety 0.594 33.774 0.018 9.86e-01
lighting_surfacedark.dry -0.085 0.129 -0.663 5.07e-01
lighting_surfacedaylight.non-dry -1.433 0.450 -3.184 1.45e-03
lighting_surfacedark.non-dry -0.457 0.330 -1.384 1.66e-01
Likelihood Ratio Test: Comparison of step1 & general
#Df LogLik Df Chisq Pr(>Chisq)
step1 39 -63170.65 NA NA NA
general 41 -63168.03 2 5.257 0.072
  • p-value = 0.072 -> we fail to reject H0. No evidence that “LA_season” is jointly significant – removing it.
# In step 2 we are dropping the "weather" variable
step2 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day
                  + air_bag + lighting_surface, data = df, dist = "negbin")
Count Model (Negative Binomial)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.065 0.032 2.011 4.44e-02
party_sexmale -0.079 0.016 -5.025 < 1e-03
age_groupover 25 -0.014 0.016 -0.877 3.80e-01
sobrietynot drinking -0.034 0.024 -1.437 1.51e-01
sobrietyother -0.165 0.052 -3.149 1.64e-03
financial_responsibility_bino or unknown proof 0.175 0.022 8.053 < 1e-03
cellphone_in_use1 -0.096 0.049 -1.958 5.03e-02
race_groupnon-white 0.014 0.015 0.945 3.44e-01
car_age_groupover_15 -0.006 0.019 -0.295 7.68e-01
time_of_dayEvening 0.018 0.026 0.671 5.02e-01
time_of_dayMorning -0.115 0.019 -6.133 < 1e-03
time_of_dayNight -0.204 0.034 -6.066 < 1e-03
air_bagNo -0.542 0.020 -26.548 < 1e-03
air_bagOther safety -0.420 0.022 -19.416 < 1e-03
lighting_surfacedark.dry -0.050 0.026 -1.927 5.40e-02
lighting_surfacedaylight.non-dry -0.142 0.033 -4.370 < 1e-03
lighting_surfacedark.non-dry -0.113 0.040 -2.809 4.97e-03
Log(theta) 1.089 0.044 24.497 < 1e-03
Zero-inflation Model (Logit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.373 12.180 -0.934 3.50e-01
party_sexmale 0.382 0.089 4.276 < 1e-03
age_groupover 25 -0.372 0.080 -4.679 < 1e-03
sobrietynot drinking -0.005 0.139 -0.038 9.70e-01
sobrietyother 0.438 0.228 1.924 5.44e-02
financial_responsibility_bino or unknown proof 0.424 0.102 4.139 < 1e-03
cellphone_in_use1 0.052 0.255 0.202 8.40e-01
race_groupnon-white 0.417 0.089 4.679 < 1e-03
car_age_groupover_15 -0.249 0.105 -2.364 1.81e-02
time_of_dayEvening -0.016 0.127 -0.125 9.01e-01
time_of_dayMorning -0.091 0.097 -0.939 3.47e-01
time_of_dayNight -0.491 0.219 -2.242 2.50e-02
air_bagNo 9.983 12.180 0.820 4.12e-01
air_bagOther safety -0.715 28.474 -0.025 9.80e-01
lighting_surfacedark.dry -0.076 0.128 -0.590 5.55e-01
lighting_surfacedaylight.non-dry -1.069 0.376 -2.843 4.47e-03
lighting_surfacedark.non-dry -0.076 0.226 -0.338 7.35e-01
Likelihood Ratio Test: Comparison of step2 & step1
#Df LogLik Df Chisq Pr(>Chisq)
step2 35 -63173.43 NA NA NA
step1 39 -63170.65 4 5.549 2.35e-01
  • p-value = 0.23 -> we fail to reject H0. No evidence that “weather” is jointly significant – removing it.
# In step 3 we are dropping the "time_of_day" variable
step3 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + air_bag + lighting_surface, 
                  data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step3 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step3 29 -63234.01 NA NA NA
step2 35 -63173.43 6 121.173 <1e-03
  • p-value = <1e-03 -> we reject H0. “time_of_day” is jointly significant.
  • We are going back to model from step 2
# In step 4 we bring back "time_of_day" and remove "sobriety"
step4 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day 
                  + air_bag + lighting_surface, data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step4 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step4 35 -63173.43 NA NA NA
step2 35 -63173.43 0 0 1e+00
  • p-value = 7.551e-07 -> we reject H0. “sobriety” is jointly significant.
  • We are going back to model from step 2
# In step 5 we bring back "sobriety" and remove "lighting_surfacedark"
step5 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day + air_bag,
                  data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step5 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step5 29 -63190.43 NA NA NA
step2 35 -63173.43 6 34.004 <1e-03
  • p-value = <1e-03 -> we reject H0. “lighting_surfacedark” is jointly significant.
  • We are going back to model from step 2
# In step 6 we bring back "lighting_surfacedark" and remove "cellphone_in_use"
step6 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + race_group + car_age_group + time_of_day + air_bag
                  + lighting_surface, data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step6 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step6 33 -63176.62 NA NA NA
step2 35 -63173.43 2 6.389 4.1e-02
  • p-value = 4.1e-02 -> we reject H0. “cellphone_in_use” is jointly significant
  • We are going back to model from step 2
# In step 7 we bring back "cellphone_in_use" and remove "car_age_group"
step7 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + time_of_day + air_bag + lighting_surface, data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step7 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step7 33 -63177.36 NA NA NA
step2 35 -63173.43 2 7.866 1.96e-02
  • p-value = 1.96e-02 -> we reject H0. “car_age_group” is jointly significant
  • We are going back to model from step 2
# In step 8 we bring back "car_age_group" and remove "race_group"
step8 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + car_age_group + time_of_day + air_bag 
                  + lighting_surface, data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step8 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step8 33 -63189.12 NA NA NA
step2 35 -63173.43 2 31.393 <1e-03
  • p-value = <1e-03 -> we reject H0. “race_group” is jointly significant
  • We are going back to model from step 2
# In step 9 we bring back "race_group" and remove "air_bag"
step9 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day
                  + lighting_surface, data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step9 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step9 31 -64944.40 NA NA NA
step2 35 -63173.43 4 3541.952 <1e-03
  • p-value = <1e-03 -> we reject H0. “air_bag” is jointly significant
  • We are going back to model from step 2
# In step 10 we bring back "air_bag" and remove "age_group"
step10 <- zeroinfl(total_victims ~ party_sex + sobriety + financial_responsibility_bi 
                   + cellphone_in_use + race_group + car_age_group + time_of_day
                   + air_bag + lighting_surface, data = df, dist = "negbin")
Likelihood Ratio Test: Comparison of step10 & step2
#Df LogLik Df Chisq Pr(>Chisq)
step10 31 -64944.40 NA NA NA
step2 35 -63173.43 4 3541.952 <1e-03
  • p-value = <1e-03 -> we reject H0. “age_group” is jointly significant
  • We are going back to model from step 2
# We checked all variables and kept all jointly significant ones
final_model <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day
                  + air_bag + lighting_surface, data = df, dist = "negbin")
Count Model (Negative Binomial)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.065 0.032 2.011 4.44e-02
party_sexmale -0.079 0.016 -5.025 < 1e-03
age_groupover 25 -0.014 0.016 -0.877 3.80e-01
sobrietynot drinking -0.034 0.024 -1.437 1.51e-01
sobrietyother -0.165 0.052 -3.149 1.64e-03
financial_responsibility_bino or unknown proof 0.175 0.022 8.053 < 1e-03
cellphone_in_use1 -0.096 0.049 -1.958 5.03e-02
race_groupnon-white 0.014 0.015 0.945 3.44e-01
car_age_groupover_15 -0.006 0.019 -0.295 7.68e-01
time_of_dayEvening 0.018 0.026 0.671 5.02e-01
time_of_dayMorning -0.115 0.019 -6.133 < 1e-03
time_of_dayNight -0.204 0.034 -6.066 < 1e-03
air_bagNo -0.542 0.020 -26.548 < 1e-03
air_bagOther safety -0.420 0.022 -19.416 < 1e-03
lighting_surfacedark.dry -0.050 0.026 -1.927 5.40e-02
lighting_surfacedaylight.non-dry -0.142 0.033 -4.370 < 1e-03
lighting_surfacedark.non-dry -0.113 0.040 -2.809 4.97e-03
Log(theta) 1.089 0.044 24.497 < 1e-03
Zero-inflation Model (Logit)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.373 12.180 -0.934 3.50e-01
party_sexmale 0.382 0.089 4.276 < 1e-03
age_groupover 25 -0.372 0.080 -4.679 < 1e-03
sobrietynot drinking -0.005 0.139 -0.038 9.70e-01
sobrietyother 0.438 0.228 1.924 5.44e-02
financial_responsibility_bino or unknown proof 0.424 0.102 4.139 < 1e-03
cellphone_in_use1 0.052 0.255 0.202 8.40e-01
race_groupnon-white 0.417 0.089 4.679 < 1e-03
car_age_groupover_15 -0.249 0.105 -2.364 1.81e-02
time_of_dayEvening -0.016 0.127 -0.125 9.01e-01
time_of_dayMorning -0.091 0.097 -0.939 3.47e-01
time_of_dayNight -0.491 0.219 -2.242 2.50e-02
air_bagNo 9.983 12.180 0.820 4.12e-01
air_bagOther safety -0.715 28.474 -0.025 9.80e-01
lighting_surfacedark.dry -0.076 0.128 -0.590 5.55e-01
lighting_surfacedaylight.non-dry -1.069 0.376 -2.843 4.47e-03
lighting_surfacedark.non-dry -0.076 0.226 -0.338 7.35e-01
# Preparing baseline models with all initial variables
poisson_model <- glm(total_victims ~ weather + party_sex + age_group + sobriety
                     + financial_responsibility_bi + cellphone_in_use + race_group
                     + car_age_group + LA_season + time_of_day + air_bag + lighting_surface, 
                     family = "poisson", data = df)

# Negative Binomial model
nb_model <- glm.nb(total_victims ~ weather + party_sex + age_group + sobriety
                   + financial_responsibility_bi + cellphone_in_use + race_group
                   + car_age_group + LA_season + time_of_day + air_bag + lighting_surface, 
                   data = df)

# Zero-inflated Poisson model
zip_model <- zeroinfl(total_victims ~ weather + party_sex + age_group + sobriety
                      + financial_responsibility_bi + cellphone_in_use + race_group
                      + car_age_group + LA_season + time_of_day + air_bag + lighting_surface, 
                      data = df, dist = "poisson")
Poisson Negative Binomial  ZIP  Full model (ZINB)  Final model (ZINB)
(Intercept) 0.090*** 0.091**
(0.025) (0.029)
weatherharder_conditions 0.006 0.005
(0.036) (0.041)
weatherother -0.021 -0.023
(0.018) (0.020)
party_sexmale -0.123*** -0.126***
(0.011) (0.013)
age_groupover 25 0.032** 0.032*
(0.012) (0.013)
sobrietynot drinking -0.037* -0.036+
(0.018) (0.021)
sobrietyother -0.232*** -0.236***
(0.038) (0.042)
financial_responsibility_bino or unknown proof 0.123*** 0.121***
(0.016) (0.019)
cellphone_in_use1 -0.105** -0.105*
(0.036) (0.041)
race_groupnon-white -0.033** -0.035**
(0.011) (0.012)
car_age_groupover_15 0.020 0.023
(0.014) (0.016)
LA_seasonWet -0.026* -0.027*
(0.011) (0.012)
time_of_dayEvening 0.019 0.021
(0.019) (0.022)
time_of_dayMorning -0.100*** -0.102***
(0.013) (0.015)
time_of_dayNight -0.158*** -0.156***
(0.025) (0.029)
air_bagNo -0.797*** -0.796***
(0.012) (0.014)
air_bagOther safety -0.413*** -0.412***
(0.019) (0.022)
lighting_surfacedark.dry -0.040* -0.037+
(0.019) (0.022)
lighting_surfacedaylight.non-dry -0.039 -0.031
(0.029) (0.033)
lighting_surfacedark.non-dry -0.095** -0.090*
(0.036) (0.041)
count_(Intercept) 0.125*** 0.074* 0.065*
(0.035) (0.033) (0.032)
count_weatherharder_conditions 0.045 0.059
(0.052) (0.048)
count_weatherother -0.018 -0.011
(0.026) (0.024)
count_party_sexmale -0.040* -0.080*** -0.079***
(0.017) (0.016) (0.016)
count_age_groupover 25 -0.039* -0.014 -0.014
(0.017) (0.016) (0.016)
count_sobrietynot drinking -0.004 -0.033 -0.034
(0.025) (0.024) (0.024)
count_sobrietyother -0.082 -0.162** -0.165**
(0.057) (0.052) (0.052)
count_financial_responsibility_bino or unknown proof 0.169*** 0.176*** 0.175***
(0.023) (0.022) (0.022)
count_cellphone_in_use1 -0.102+ -0.098* -0.096+
(0.053) (0.049) (0.049)
count_race_groupnon-white 0.073*** 0.014 0.014
(0.017) (0.015) (0.015)
count_car_age_groupover_15 -0.002 -0.005 -0.006
(0.021) (0.019) (0.019)
count_LA_seasonWet -0.013 -0.019
(0.016) (0.015)
count_time_of_dayEvening 0.024 0.018 0.018
(0.028) (0.026) (0.026)
count_time_of_dayMorning -0.115*** -0.113*** -0.115***
(0.020) (0.019) (0.019)
count_time_of_dayNight -0.256*** -0.204*** -0.204***
(0.037) (0.034) (0.034)
count_air_bagNo -0.435*** -0.543*** -0.542***
(0.019) (0.020) (0.020)
count_air_bagOther safety -0.429*** -0.420*** -0.420***
(0.033) (0.022) (0.022)
count_lighting_surfacedark.dry -0.034 -0.050+ -0.050+
(0.027) (0.026) (0.026)
count_lighting_surfacedaylight.non-dry -0.172*** -0.156*** -0.142***
(0.043) (0.040) (0.033)
count_lighting_surfacedark.non-dry -0.121* -0.133** -0.113**
(0.052) (0.049) (0.040)
zero_(Intercept) -2.478*** -14.422 -11.373
(0.136) (54.178) (12.180)
zero_weatherharder_conditions 0.189 0.629+
(0.187) (0.339)
zero_weatherother 0.012 0.081
(0.079) (0.122)
zero_party_sexmale 0.333*** 0.379*** 0.382***
(0.055) (0.089) (0.089)
zero_age_groupover 25 -0.275*** -0.378*** -0.372***
(0.051) (0.080) (0.080)
zero_sobrietynot drinking 0.138 0.014 -0.005
(0.091) (0.142) (0.139)
zero_sobrietyother 0.535*** 0.469* 0.438+
(0.156) (0.229) (0.228)
zero_financial_responsibility_bino or unknown proof 0.182** 0.441*** 0.424***
(0.069) (0.103) (0.102)
zero_cellphone_in_use1 0.002 0.029 0.052
(0.171) (0.261) (0.255)
zero_race_groupnon-white 0.433*** 0.413*** 0.417***
(0.059) (0.089) (0.089)
zero_car_age_groupover_15 -0.104 -0.249* -0.249*
(0.067) (0.105) (0.105)
zero_LA_seasonWet 0.052 0.064
(0.049) (0.078)
zero_time_of_dayEvening 0.020 -0.005 -0.016
(0.082) (0.128) (0.127)
zero_time_of_dayMorning -0.049 -0.084 -0.091
(0.062) (0.098) (0.097)
zero_time_of_dayNight -0.446*** -0.490* -0.491*
(0.134) (0.221) (0.219)
zero_air_bagNo 1.607*** 12.975 9.983
(0.094) (54.178) (12.180)
zero_air_bagOther safety -0.069 1.833 -0.715
(0.246) (62.848) (28.474)
zero_lighting_surfacedark.dry 0.029 -0.082 -0.076
(0.082) (0.129) (0.128)
zero_lighting_surfacedaylight.non-dry -0.612*** -1.414** -1.069**
(0.174) (0.444) (0.376)
zero_lighting_surfacedark.non-dry -0.118 -0.435 -0.076
(0.180) (0.327) (0.226)
Num.Obs. 65000 65000 65000 65000 65000
R2 0.040 0.043 0.043
R2 Adj. 0.040 0.043 0.043
AIC 128844.3 126737.4 127156.2 126418.0 126416.9
BIC 129025.9 126928.1 127519.5 126790.4 126734.7
Log.Lik. -64402.136 -63347.708
F 276.214 204.683
RMSE 0.86 0.86 0.86 0.86 0.86
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

6 Results

We tested seven hypotheses related to the risk and severity of car crashes using a two-part model: a count model for the number of victims and a zero model for the likelihood of zero-victim crashes. Below are the results and interpretations for each hypothesis.

1. Young drivers (under 25) have higher risk of car crash with victims

  • The count model shows no significant difference in the number of victims (p = 0.380), but the zero model indicates that older drivers are significantly less likely to be in zero-victim crashes (β = –0.372, p < 0.001).
  • Hypothesis supported: Younger drivers are more often involved in harmless crashes, while older drivers are more likely to cause injury.

2. Vehicles older than 15 years are at higher risk of car crash with injuries

  • The count model shows no significant difference (p = 0.768), but the zero model shows that crashes involving older vehicles are less likely to be zero-victim (β = –0.249, p = 0.018).
  • Hypothesis supported: Older vehicles are associated with a higher risk of harmful crashes, even if not more frequent overall.

3. Drivers who consumed alcohol have higher risk of car crash with victims - Comparing to the baseline of drinking drivers:

  • “Not drinking” status was not significant in either model.

  • “Other” sobriety states (e.g., drugs, fatigue) were significant in the count model only (β = –0.165, p = 0.002), associated with fewer victims.

  • Hypothesis not supported: Alcohol consumption did not significantly increase crash severity or likelihood in this model. This may be due to underreporting or other contextual factors.

4. Men have higher overall risk of car crash than women

  • Men are associated with fewer victims in the count model (β = –0.079, p < 0.001), and more likely to be in zero-victim crashes (β = +0.382, p < 0.001).
  • Hypothesis rejected: Male drivers appear to be involved in less harmful crashes, contrary to expectations.

5. Women have higher risk of car crash with victims than men

  • Contrary to expectations, women are more likely to be involved in crashes with victims than men.
  • Model shows a positive β coefficient of +0.382 (p < 0.001) for women, indicating higher likelihood of victim-involved crashes.This finding rejects the hypothesis that women have a lower risk of crashes with victims.

6. Non-white drivers have higher risk of car crash

  • The count model showed no significant difference, but the zero model shows that non-white drivers are more likely to be in zero-victim crashes (β = +0.417, p < 0.001).
  • Hypothesis rejected: Non-white drivers are involved in less harmful crashes overall.

7. Uninsured drivers have higher risk of car crash with victims

  • Uninsured/unknown drivers are associated with more victims (β = +0.175, p < 0.001) and more likely to be in zero-victim crashes (β = +0.424, p < 0.001).
  • Hypothesis supported: While results suggest a paradox (more victims and more harmless crashes), this may reflect underreporting or bimodal crash types (very minor vs. serious).

8. Bad weather conditions indicate higher risk of car crash

  • Daylight, wet and dark, wet conditions are associated with fewer victims, likely due to cautious driving. However, daylight, wet crashes are much less likely to be harmless (β = –1.069, p < 0.01).
  • Hypothesis partially rejected: Bad conditions don’t increase crash frequency, but may increase severity when they occur.

7 Findings

In this study, we employed the Zero-Inflated Negative Binomial (ZINB) model to examine the determinants of traffic accidents and their severity. This approach enabled us to analyze each variable’s effect in two distinct dimensions:

  • Zero-inflation (logit) component: assessed the likelihood of an accident being non-injurious (zero-victim crash)
  • Count (negative binomial) component: estimated the expected number of victims, conditional on the crash involving injuries.

1. Gender

Driver gender emerged as an influential factor. Male drivers were more likely to be involved in zero-victim accidents, while female drivers were associated with injury crashes resulting in approximately 7.6% more victims.

2. Age of driver

Despite the traditional expectations, younger drivers (under 25) were more likely to be involved in non-injury crashes. It suggests that older drivers may contribute more frequently to crashes with physical harm.

3. Age of vehicle

Vehicle age also played a notable role. Cars older than 15 years were less likely to be involved in zero-victim accidents, indicating a higher likelihood of causing injury-related crashes (potentially due to outdated safety features or poorer maintenance).

4. Insurance status Lack of verified insurance was significantly associated with both crash dimensions. Uninsured drivers were more likely to be involved in zero-victim crashes, possibly reflecting minor or underreported incidents. However, if such drivers caused injury crashes, they were linked to ~19% more victims.

5. Driver’s race Driver race also influenced outcomes. Non-white drivers were more likely to be involved in crashes with no injuries, implying that white drivers may be more frequently responsible for injury-related accidents.

6. Time of day The time of day significantly shaped both injury risk and crash severity. Nighttime driving increased the likelihood of injury crashes, but such incidents involved ~18.5% fewer victims (possibly due to fewer passengers). Morning accidents also resulted in ~11% fewer victims than afternoon ones.

7. Airbag and safety equipment Surprisingly, accidents in which airbags did not deploy were associated with ~42% fewer victims. This may reflect lower-impact collisions that do not activate airbags. Similarly, crashes involving other safety features (e.g. seatbelts) led to ~34% fewer injuries, aligning with expectations regarding passive safety measures.

8. Weather and road surface Environmental conditions had a measurable effect. During daytime on wet roads, crashes were more likely to involve injuries but included ~13% fewer victims. It was likely due to more cautious driving. A similar pattern emerged under dark and wet conditions, where victim counts were ~11% lower (once again possibly due to fewer passengers).

9. Sobriety Interestingly, no significant difference was observed between drinking and non-drinking drivers. However, drivers under the influence of other substances (e.g. drugs) were associated with ~15% fewer victims in injury crashes.

10. Phone use In our model phone usage by driver was also included. Despite being jointly significant, it hasn’t provided insightful comparison results.

7.1 Next Steps and Alternative Approaches

While the ZINB model allows for an enchanted understanding of both occurrence and severity of crashes, it comes with some limitations. The model’s dual nature can complicate interpretation for policymakers or practitioners seeking straightforward insights.

A potential alternative would be to separate the analysis into two simpler models:

  • Logistic regression predicting whether any injuries occurred (binary outcome),
  • Truncated count model (e.g. negative binomial) estimating the number of victims in injury-only crashes.

This modular approach may improve interpretability while still addressing overdispersion and zero inflation where relevant. Future research could also explore spatial effects (e.g. region or county-level risk), driver behaviour history or usage of machine learning classifiers to capture nonlinear interactions across variables.

8 Bibliography

  1. Berhanu Y., Alemayehu E., Schröder D. (2023). Examining Car Accident Prediction Techniques and Road Traffic Congestion: A Comparative Analysis of Road Safety and Prevention of World Challenges in Low-Income and High-Income Countries. Journal of Advanced Transportation, https://doi.org/10.1155/2023/6643412

  2. Blows S., Ivers R., Connor J., Ameratunga S., Norton R. (2003). Car insurance and the risk of car crash injury. Accident Analysis & Prevention, 35(6), 987–990. https://doi.org/10.1016/S0001-4575(02)00106-9

  3. Blows, S., Ivers, R. Q., Woodward, M., Connor, J., & Norton, R. (2003). Vehicle year and the risk of car crash injury. Injury Prevention, 9(4), 353–356. https://doi.org/10.1136/ip.9.4.353

  4. Braver, E. R. (2003). Race, Hispanic origin, and socioeconomic status in relation to motor vehicle occupant death rates and risk factors among adults. Accident Analysis & Prevention, 35(3), 295–309. https://doi.org/10.1016/S0001-4575(01)00106-3

  5. Brijs, T., Karlis, D., & Wets, G. (2008). Studying the effect of weather conditions on daily crash counts using a discrete time-series model. Accident Analysis & Prevention, 40(3), 1180–1190. https://doi.org/10.1016/j.aap.2008.01.001

  6. Connor J., Norton R., Ameratunga S. & Jackson, R (2004). The contribution of alcohol to serious car crash injuries. Epidemiology, 15(3), 337-344. 10.1097/01.ede.0000120045.58295.86

  7. Cullen P., Möller H., Woodward M., Senserrick t., Boufous S, Rogers K., Brown J., Ivers R. (2021). Are there sex differences in crash and crash-related injury between men and women? A 13-year cohort study of young drivers in Australia. SSM - Population Health, 14, 100816. https://doi.org/10.1016/j.ssmph.2021.100816

  8. Lam L. T. (2002). Distractions and the risk of car crash injury: The effect of drivers’ age. Journal of Safety Research, 33(3), 411–419. https://doi.org/10.1016/S0022-4375(02)00034-8

  9. Ratiu S. (2003). The history of the internal combustion engine. Annals of the Faculty of Engineering Hunedoara, Tome I, 3. https://annals.fih.upt.ro/pdf-full/2003/ANNALS-2003-3-21.pdf

  10. World Health Organization. (2023). Road traffic injuries. https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries

  11. Yau, K. K. W., Wang, K., & Lee, A. H. (2003). Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometrical Journal, 45(4), 437–452. https://doi.org/10.1002/bimj.200390024