Modelling car accidents’ victims using Zero-Inflated Negative Binomial regression

1 Abstract

Understanding the dynamics of road traffic accidents requires models that can address both the frequency and severity of such events. Traditional count models often fail to capture the excess zeros present in accident datasets, where many crashes result in no injuries. To address this limitation, we apply a Zero-Inflated Negative Binomial (ZINB) regression to model the number of victims in road accidents. This approach distinguishes between two processes: the likelihood of a crash being non-injurious and the count of victims in injurious crashes. Using real-world traffic accident data from California, we test multiple hypotheses concerning driver characteristics (age, gender, alcohol use, race), vehicle attributes (age, insurance status), and environmental conditions (weather). Our findings reveal nuanced relationships: for example, older vehicles and poor weather increase the likelihood of injury crashes, while younger drivers are more frequently involved in non-injurious incidents. Interestingly, male drivers are more likely to be involved in crashes without injuries, while crashes involving female drivers tend to result in a higher number of victims. These insights highlight the value of dual-structure models in traffic safety research and support more targeted and evidence-based policymaking in road safety and driver education.

2 Introduction

Despite major advances in automotive safety and infrastructure development, road accidents remain a leading cause of injury and death globally (WHO, 2023). With millions of collisions occurring every year, it is not only of scientific interest to understand the factors that contribute to accident frequency and severity, but also a public health and economic priority. Identifying the characteristics of drivers, vehicles and environments that lead to more severe outcomes could lead to better road safety policies, insurance models and educational campaigns.

Traditional studies of road accidents have often focused on binary outcomes (whether an accident occurred or not) or have relied on simple counting models to assess the severity of injuries (Berhanu et. al., 2023). However, traffic accident data typically contains a high proportion of zeros - many crashes result in no injuries at all. This can lead to biased estimates and misleading conclusions. Moreover, some variables influence the likelihood of injury, while others affect the number of victims, conditional on an injury occurring. In such cases, standard modelling techniques struggle to capture this complexity

To address this issue, we apply a Zero-Inflated Negative Binomial (ZINB) regression model. This allows us to analyse the probability of injury and the severity of outcomes within a single framework. The zero-inflated structure handles the large number of non-injury crashes, while the negative binomial (count model) component adjusts for overdispersion in the number of victims (Yau et. al., 2003). This dual perspective enables us to form a deeper understanding of the impact of various factors, such as driver age and gender, vehicle age, alcohol use or insurance status on road accident outcomes.

Our aim is to contribute to a more comprehensive understanding of road safety risks by modelling both injury probability and injury count in a unified framework. The insights derived from this analysis can support more data-driven safety measures and better-informed policy decisions.

3 Literature Review

The history of the first car can be traced back to 1885, when German engineer Karl Benz created the first “modern” automobile. This machine was self-propelled, gasoline-powered, had four wheels and could be steered and stopped by the driver (Ratiu, 2003). Since that moment the car industry has revolutionised transportation all over the globe. With increasing popularity of this new method of travel, the frequency of car accidents also started to grow.

Drivers’ age is believed to be one of the major factors affecting the probability of a car crash. Lawrence (2002) conducted a study focused on this aspect. He analysed nearly 64 000 accidents in New South Wales, Australia, between 1996 and 2000 and also considered various distractions such as phone use. Results showed that injury-crash rates were highest for teenagers and young adults (16–24 years). Moreover, the pattern was “U-shaped” – crash rates were lowest for middle-age groups and again higher for the oldest drivers. When it came to phone use, a significant difference appeared only for the 25–29 group, whose crash risk while phoning was 2.4 times higher than without a phone. These results may be outdated – nowadays phones are much more popular across every age group. Not only driver’s age matters – the age of the vehicle also appears to have an impact. According to a study conducted by Australian researchers in 2003, cars built before 1984 (> 15 years) had a significantly greater chance of being involved in an injury-producing crash than those manufactured after 1994 (< 5 years). They calculated that older vehicles carried roughly triple the injury risk of younger ones (Blows et al., 2003). Furthermore, risk rose by 5 % for every additional year of a vehicle’s age.

Another highly important factor affecting accident risk is alcohol. Another Australian paper highlighted the extreme results of drinking even small amount of this substance. Drivers with 0.3 and 0.5‰ alcohol in blood had 3 times higher crash risk compared to fully sober driver (Connor et. al., 2004). That amount corresponds to only one or two drinks. Consuming more than two drinks within six hours raised the crash risk eight times. The greatest change was visible after breaking the law. If the legal limit of 0.5‰ was exceeded, the risk was 23 times higher compared to sober driver (Connor et. al., 2004). That study underlined the extreme effect of consuming alcohol before driving and its dangerous consequences not only for driver.

Sex differences have also been documented. In a 2021 Australian cohort study of young drivers, men had higher overall crash rates, whereas women showed greater odds of crash-related hospitalisation (Cullen et al., 2021). Because the cohort was followed for 13 years, authors were able to show that risk decreased with each additional year of driving experience. Influence of race, especially in the United States, is another topic often raised when comparing statistics. In a 2001 article, Braver (2001) computed passenger-vehicle deaths per 10 million trips among persons aged 25–64. Black men had a 48 % higher risk of dying in a crash than white men. Similar patterns were observed for women, though exact differences were not presented. Hispanic men also faced elevated risk (+26 %) but no significant difference was found between Hispanic and white women. Education modified these effects: men without a high-school diploma had 3.5 times higher risk than graduates, and women 2.8 times, with the highest risk recorded among whites without a diploma – indicating that low education may override racial patterns (Braver, 2001).

Car crash often indicates overwhelming economical costs for drivers. Insurance coverage can mitigate such costs, yet not all motorists purchase it. According to research from 2003, the uninsured drivers were far more likely to be involved in a crash that hospitalized or killed someone. The risk was 4.8 times higher when compared to insured drivers (Blows et. al., 2003). All previous factor related to individuals’ characteristics or their decisions. Weather is independent and contribute to higher risk of car accidents. Bad conditions, especially heavy rain, ice or poor visibility, significantly raise the number of crashes (Brijs et. al., 2008).

3.1 Hypotheses

After reviewing existing literature, we propose to test eight hypotheses:

Young drivers (under 25) have higher risk of car crash with victims
Vehicles older than 15 years are at higher risk of car crash with injuries
Drivers who consumed alcohol have higher risk of car crash with victims
Men have higher overall risk of car crash than women
Women have higher risk of car crash with victims than men
Non-white drivers have higher risk of car crash
Uninsured drivers have higher risk of car crash with victims
Bad weather conditions indicate higher risk of car crash

4 Data

We decided to use data from the California Highway Patrol which covers collisions from January 1st, 2001 until mid-December, 2020. The Statewide Integrated Traffic Records System (SWITRS) contains all crashes that were reported to CHP by local and governmental agencies.

The data have a hierarchical structure.

The CRASH table contains information on each crash, one line per crash.
The PARTY table contains information from all parties involved in the crash, one line per party. Parties are the major players in a traffic crash - drivers, pedestrians, bicyclists, and parked vehicles. The information includes personal descriptors and vehicle descriptors.
The VICTIM table contains information about the victims - persons associated with each party. For example, a motorcyclist and his passenger are each a victim. Injury severity is included in the VICTIM table.

4.1 Loading Data from sqlite

In this section, we extract and prepare a sample dataset of traffic collisions from a local SQLite database (switrs.sqlite), focusing on drivers who were determined to be at fault. The steps include:

Connecting to the SQLite database.
Filtering the parties table to include only drivers at fault with complete demographic and vehicle-related information.
Randomly sampling up to 100,000 such drivers, keeping only one driver per collision (case_id).
Retrieving the corresponding collisions data for the selected drivers.
Merging the driver and collision data into a single dataset.
Displaying a preview of the resulting data in a formatted, scrollable table.
Exporting the final merged dataset to a CSV file for further analysis.

# Connect to SQLite database
conn <- dbConnect(RSQLite::SQLite(), dbname = "switrs.sqlite")

# Query drivers who were at fault and have complete information
parties_query <- "
  SELECT *
  FROM parties
  WHERE party_type = 'driver'
    AND at_fault = 1
    AND TRIM(party_age) != ''
    AND TRIM(party_sex) != ''
    AND TRIM(party_race) != ''
    AND TRIM(cellphone_in_use) != ''
    AND TRIM(financial_responsibility) != ''
    AND TRIM(party_sobriety) != ''
    AND TRIM(vehicle_make) != ''
    AND TRIM(vehicle_year) != ''
"

parties_filtered <- dbGetQuery(conn, parties_query)

# Sample up to 100,000 drivers
set.seed(123)
parties_sample_raw <- parties_filtered %>%
  sample_n(size = min(100000, nrow(.)))

# Keep one driver per case_id
parties_sample <- parties_sample_raw %>%
  group_by(case_id) %>%
  slice_sample(n = 1) %>%
  ungroup()

# Get matching collisions data
case_ids <- paste0("'", parties_sample$case_id, "'", collapse = ", ")
collisions_query <- paste0("
  SELECT *
  FROM collisions
  WHERE case_id IN (", case_ids, ")
")

collisions_subset <- dbGetQuery(conn, collisions_query)

# Merge datasets
df <- parties_sample %>%
  left_join(collisions_subset, by = "case_id")

dbDisconnect(conn)

Sample of merged dataset
id	case_id	party_number	party_type	at_fault	party_sex	party_age	party_sobriety	party_drug_physical	direction_of_travel	party_safety_equipment_1	party_safety_equipment_2	financial_responsibility	hazardous_materials	cellphone_use_type	school_bus_related	oaf_violation_code	oaf_violation_category	oaf_violation_section	oaf_violation_suffix	other_associate_factor_1	other_associate_factor_2	movement_preceding_collision	vehicle_year	vehicle_make	statewide_vehicle_type	chp_vehicle_type_towing	chp_vehicle_type_towed	party_race	jurisdiction	officer_id	reporting_district	chp_shift	population	county_city_location	county_location	beat_type	chp_beat_type	city_division_lapd	chp_beat_class	beat_number	primary_road	secondary_road	distance	direction	weather_1	weather_2	state_highway_indicator	caltrans_county	caltrans_district	state_route	route_suffix	postmile_prefix	postmile	location_type	ramp_intersection	side_of_highway	tow_away	collision_severity	party_count	primary_collision_factor	pcf_violation_code	pcf_violation_category	pcf_violation	pcf_violation_subsection	hit_and_run	type_of_collision	motor_vehicle_involved_with	pedestrian_action	road_surface	road_condition_1	road_condition_2	lighting	control_device	chp_road_type	truck_collision	not_private_property	alcohol_involved	statewide_vehicle_type_at_fault	chp_vehicle_type_at_fault	primary_ramp	secondary_ramp	latitude	longitude	collision_date	collision_time	process_date
90	0000048	1	driver	1	male	41	had not been drinking	NA	east	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	changing lanes	1996	ford	pickup or panel truck	pickups & panels	00	hispanic	9575	9820	NA	1400 thru 2159	50000 to 100000	1902	los angeles	chp state highway	interstate	NA	chp other	065	RT 210	BALDWIN AV	20	east	clear	NA	1	los angeles	7	210	NA	R	30.85	highway	NA	eastbound	0	property damage only	2	vehicle code violation	NA	unsafe lane change	21658	A	not hit and run	sideswipe	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	1	0	1	NA	pickup or panel truck	pickups & panels	NA	NA	NA	NA	2002-01-12	15:15:00	2002-05-28
427	0000227	1	driver	1	male	45	had not been drinking	NA	south	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	proceeding straight	1992	chevrolet	passenger car	mini-vans	00	white	9530	16779	NA	0600 thru 1359	>250000	1941	los angeles	chp state highway	interstate	NA	chp other	072	RT 710	DEL AMO BL	528	north	clear	NA	1	los angeles	7	710	NA	NA	10.72	highway	NA	northbound	0	property damage only	2	vehicle code violation	NA	NA	NA	NA	not hit and run	rear end	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	1	0	1	NA	passenger car	mini-vans	NA	NA	NA	NA	2002-02-05	08:20:00	2002-03-15
717	0000378	1	driver	1	male	18	had not been drinking	NA	south	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	proceeding straight	1998	dodge	passenger car	passenger car, station	00	hispanic	9530	12519	NA	0600 thru 1359	>250000	1941	los angeles	chp state highway	interstate	NA	chp other	430	RT 405	BELLFLOWER BL	136	north	clear	NA	1	los angeles	7	405	NA	NA	2.33	highway	NA	southbound	0	property damage only	2	vehicle code violation	NA	speeding	22350	NA	not hit and run	rear end	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	1	0	1	NA	passenger car	passenger car, station	NA	NA	NA	NA	2002-02-04	07:40:00	2002-06-19
1087	0000578	1	driver	1	male	36	had not been drinking	NA	south	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	other	1994	ford	passenger car	passenger car, station	00	white	9565	15286	NA	0600 thru 1359	25000 to 50000	1918	los angeles	chp county roadline	county road line	NA	chp other	021	RT 405	JEFFERSON BL	50	north	clear	NA	1	los angeles	7	405	NA	NA	25.97	highway	NA	southbound	1	property damage only	1	vehicle code violation	NA	improper turning	22107	NA	not hit and run	hit object	fixed object	no pedestrian involved	dry	normal	NA	daylight	none	1	0	1	NA	passenger car	passenger car, station	NA	NA	NA	NA	2002-01-11	07:25:00	2002-06-19
1442	0000781	1	driver	1	male	24	had not been drinking	NA	south	lap/shoulder harness used	NA	no proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	proceeding straight	1994	nissan	passenger car	passenger car, station	00	hispanic	9690	13096	NA	1400 thru 2159	25000 to 50000	3045	orange	chp state highway	interstate	NA	chp other	055	RT 5	EL TORO RD	50	south	clear	NA	1	orange	12	5	NA	NA	18.68	highway	NA	southbound	0	property damage only	2	vehicle code violation	NA	speeding	22350	NA	not hit and run	rear end	other motor vehicle	no pedestrian involved	dry	normal	NA	dark with street lights	none	1	0	1	NA	passenger car	passenger car, station	NA	NA	NA	NA	2002-02-15	18:30:00	2002-05-28
1474	0000797	1	driver	1	male	51	had not been drinking	NA	south	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	proceeding straight	1991	honda	passenger car	passenger car, station	00	white	9530	13970	NA	0600 thru 1359	>250000	1941	los angeles	chp state highway	interstate	NA	chp other	072	RT 710	PACIFIC AV	200	west	clear	NA	1	los angeles	7	405	NA	NA	7.06	ramp	ramp exit, last 50 feet	southbound	1	property damage only	2	vehicle code violation	NA	speeding	22350	NA	not hit and run	rear end	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	1	0	1	NA	passenger car	passenger car, station	TR	NA	NA	NA	2002-02-05	09:10:00	2002-06-19
1710	0000923	1	driver	1	male	48	had not been drinking	NA	south	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	entering traffic	1992	oldsmobile	passenger car	NA	NA	black	101	44	NA	not chp	50000 to 100000	0101	alameda	not chp	not chp	NA	not chp	NA	BROADWAY	SANTA CLARA AV	200	north	cloudy	NA	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	0	property damage only	2	vehicle code violation	NA	unsafe starting or backing	22106	NA	not hit and run	sideswipe	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	0	0	1	NA	passenger car	NA	NA	NA	NA	NA	2002-02-18	15:50:00	2002-03-16
2752	0001531	2	driver	1	male	43	had not been drinking	NA	east	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	none apparent	NA	changing lanes	1998	volvo	truck or truck tractor with trailer	truck tractor	semi	hispanic	9575	9307	NA	0600 thru 1359	>250000	1942	los angeles	chp state highway	state route	NA	chp other	043	RT 134	SAN FERNANDO RD	1000	west	clear	NA	1	los angeles	7	134	NA	R	5.72	highway	NA	eastbound	0	property damage only	3	vehicle code violation	NA	unsafe lane change	21658	A	not hit and run	sideswipe	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	1	1	1	NA	truck or truck tractor with trailer	truck tractor	NA	NA	NA	NA	2002-01-31	08:45:00	2002-05-24
2810	0001565	1	driver	1	female	55	had not been drinking	NA	south	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	inattention	NA	proceeding straight	1995	dodge	other bus	NA	NA	white	4003	5529	02	not chp	2500 to 10000	4003	san luis obispo	not chp	not chp	NA	not chp	001	EMBARCADERO	FRONT ST	60	north	clear	NA	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	0	property damage only	2	other improper driving	NA	other improper driving	NA	NA	not hit and run	sideswipe	parked motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	0	0	1	NA	other bus	NA	NA	NA	NA	NA	2002-02-08	10:35:00	2002-03-28
2836	0001584	1	driver	1	male	45	had not been drinking	NA	west	lap/shoulder harness used	NA	proof of insurance obtained	NA	cellphone not in use	NA	NA	NA	NA	NA	entering/leaving ramp	NA	changing lanes	2001	volvo	passenger car	passenger car, station	00	asian	9340	013316	NA	1400 thru 2159	100000 to 250000	4314	santa clara	chp county roadline	county road line	NA	chp other	530	MONTAGUE EXPWY	LAURELWOOD DR	50	west	clear	NA	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	0	property damage only	2	vehicle code violation	NA	unsafe lane change	21658	A	not hit and run	sideswipe	other motor vehicle	no pedestrian involved	dry	normal	NA	daylight	none	0	0	1	NA	passenger car	passenger car, station	NA	NA	NA	NA	2002-01-16	15:15:00	2002-03-25

4.2 Cleaning nulls

In this section, we clean the merged dataset by handling missing or incomplete values. The process involves:

Loading the previously saved dataset of drivers at fault.
Defining a helper function to compute the count and percentage of missing or empty values for each column.
Generating a summary table of missing data across all variables and displaying the top 50 columns with the most missingness.
Dropping columns with more than 40% missing or empty values to reduce noise and sparsity.
Saving a version of the dataset with the retained columns.
Recalculating and displaying the missing data statistics after column filtering.
Removing all rows that still contain any missing or empty fields, ensuring the dataset is complete.
Saving the final, fully cleaned dataset to a new CSV file.
Printing the number of remaining observations after cleaning.

This prepares the dataset by ensuring a consistent and complete structure.

# Load dataset
df <- read.csv("driver_at_fault_sample.csv", stringsAsFactors = FALSE)

# Helper function: count and percent of missing or empty values
null_stats <- function(x) {
  total <- length(x)
  nulls <- sum(is.na(x) | trimws(x) == "")
  percent <- round(nulls / total * 100, 2)
  c(Count = nulls, Percent = percent)
}

# Compute missing value summary for all columns
null_summary_all <- as.data.frame(t(sapply(df, null_stats)))
null_summary_all$Column <- rownames(null_summary_all)
null_summary_all <- null_summary_all %>%
  select(Column, everything()) %>%
  arrange(desc(Percent))

Missing data before column filtering
	Column	Count	Percent
pcf_violation_code	pcf_violation_code	99996	100.00
oaf_violation_code	oaf_violation_code	99991	99.99
hazardous_materials	hazardous_materials	99979	99.98
route_suffix	route_suffix	99882	99.88
school_bus_related	school_bus_related	99870	99.87
road_condition_2	road_condition_2	99540	99.54
secondary_ramp	secondary_ramp	99128	99.13
oaf_violation_suffix	oaf_violation_suffix	97611	97.61
primary_ramp	primary_ramp	97478	97.48
other_associate_factor_2	other_associate_factor_2	96815	96.82
weather_2	weather_2	96791	96.79
party_drug_physical	party_drug_physical	95701	95.70
city_division_lapd	city_division_lapd	95675	95.67
ramp_intersection	ramp_intersection	94034	94.03
alcohol_involved	alcohol_involved	89462	89.46
postmile_prefix	postmile_prefix	89244	89.24
oaf_violation_category	oaf_violation_category	88599	88.60
oaf_violation_section	oaf_violation_section	88548	88.55
reporting_district	reporting_district	73066	73.07
caltrans_county	caltrans_county	66926	66.93
caltrans_district	caltrans_district	66926	66.93
state_route	state_route	66926	66.93
postmile	postmile	66926	66.93
location_type	location_type	66926	66.93
side_of_highway	side_of_highway	66927	66.93
pcf_violation_subsection	pcf_violation_subsection	64524	64.52
latitude	latitude	58132	58.13
longitude	longitude	58132	58.13
chp_vehicle_type_towed	chp_vehicle_type_towed	45782	45.78
direction	direction	19813	19.81
chp_vehicle_type_towing	chp_vehicle_type_towing	8633	8.63
chp_vehicle_type_at_fault	chp_vehicle_type_at_fault	8633	8.63
party_safety_equipment_2	party_safety_equipment_2	7845	7.85
statewide_vehicle_type	statewide_vehicle_type	6128	6.13
statewide_vehicle_type_at_fault	statewide_vehicle_type_at_fault	6128	6.13
beat_number	beat_number	5320	5.32
party_safety_equipment_1	party_safety_equipment_1	1687	1.69
other_associate_factor_1	other_associate_factor_1	1364	1.36
pcf_violation	pcf_violation	1011	1.01
pcf_violation_category	pcf_violation_category	631	0.63
type_of_collision	type_of_collision	467	0.47
tow_away	tow_away	456	0.46
intersection	intersection	451	0.45
road_surface	road_surface	448	0.45
road_condition_1	road_condition_1	361	0.36
control_device	control_device	298	0.30
motor_vehicle_involved_with	motor_vehicle_involved_with	274	0.27
direction_of_travel	direction_of_travel	261	0.26
officer_id	officer_id	263	0.26
lighting	lighting	261	0.26

# Drop columns with more than 40% missing data
columns_to_keep <- null_summary_all %>%
  filter(Percent <= 40) %>%
  pull(Column)

df_cleaned <- df[, columns_to_keep]

# Save cleaned dataset (columns only)
write.csv(df_cleaned, "driver_at_fault_sample_cleaned.csv", row.names = FALSE)

# Recalculate missing stats after dropping columns
null_summary_cleaned <- as.data.frame(t(sapply(df_cleaned, null_stats)))
null_summary_cleaned$Column <- rownames(null_summary_cleaned)
null_summary_cleaned <- null_summary_cleaned %>%
  select(Column, everything()) %>%
  arrange(desc(Percent))

Missing data after dropping columns
	Column	Count	Percent
direction	direction	19813	19.81
chp_vehicle_type_towing	chp_vehicle_type_towing	8633	8.63
chp_vehicle_type_at_fault	chp_vehicle_type_at_fault	8633	8.63
party_safety_equipment_2	party_safety_equipment_2	7845	7.85
statewide_vehicle_type	statewide_vehicle_type	6128	6.13
statewide_vehicle_type_at_fault	statewide_vehicle_type_at_fault	6128	6.13
beat_number	beat_number	5320	5.32
party_safety_equipment_1	party_safety_equipment_1	1687	1.69
other_associate_factor_1	other_associate_factor_1	1364	1.36
pcf_violation	pcf_violation	1011	1.01
pcf_violation_category	pcf_violation_category	631	0.63
type_of_collision	type_of_collision	467	0.47
tow_away	tow_away	456	0.46
intersection	intersection	451	0.45
road_surface	road_surface	448	0.45
road_condition_1	road_condition_1	361	0.36
control_device	control_device	298	0.30
motor_vehicle_involved_with	motor_vehicle_involved_with	274	0.27
direction_of_travel	direction_of_travel	261	0.26
officer_id	officer_id	263	0.26
lighting	lighting	261	0.26
weather_1	weather_1	180	0.18
collision_time	collision_time	163	0.16
jurisdiction	jurisdiction	135	0.14
movement_preceding_collision	movement_preceding_collision	130	0.13
population	population	16	0.02
pedestrian_action	pedestrian_action	24	0.02
chp_beat_class	chp_beat_class	6	0.01
state_highway_indicator	state_highway_indicator	12	0.01
primary_collision_factor	primary_collision_factor	8	0.01
id	id	0	0.00
case_id	case_id	0	0.00
party_number	party_number	0	0.00
party_type	party_type	0	0.00
at_fault	at_fault	0	0.00
party_sex	party_sex	0	0.00
party_age	party_age	0	0.00
party_sobriety	party_sobriety	0	0.00
financial_responsibility	financial_responsibility	0	0.00
cellphone_in_use	cellphone_in_use	0	0.00
cellphone_use_type	cellphone_use_type	0	0.00
party_number_killed	party_number_killed	0	0.00
party_number_injured	party_number_injured	0	0.00
vehicle_year	vehicle_year	0	0.00
vehicle_make	vehicle_make	0	0.00
party_race	party_race	0	0.00
chp_shift	chp_shift	0	0.00
county_city_location	county_city_location	0	0.00
county_location	county_location	0	0.00
special_condition	special_condition	0	0.00

# Drop rows with any missing or empty values
df_final <- df_cleaned %>%
  filter(across(everything(), ~ !(is.na(.) | trimws(.) == "")))

# Save final dataset
write.csv(df_final, "driver_at_fault_sample_cleaned_no_na.csv", row.names = FALSE)

Number of rows after full cleanup: 65552 ### Feature Pruning and Engineering for Modeling In this section, we finalize our feature selection and perform additional feature engineering to prepare the dataset for modeling. Specifically:

We load the pre-cleaned dataset containing no missing values.
A broad set of columns is removed, including identifiers, redundant location codes, officer/jurisdiction data, detailed injury/fatality breakdowns, and fields that are either irrelevant, redundant, or too granular for our modeling goals.
The collision date is converted to a Date object and used to derive the vehicle’s age at the time of the crash.
Injury and fatality counts are aggregated into a single total_victims variable.
The season of the year in which each collision occurred is inferred based on the collision date.
Vehicle make is used to infer the origin region of the vehicle (e.g., USA, Japan, Korea, Europe, etc.).
The collision time is parsed to extract the hour and categorize it into a general time of day (e.g., Morning, Afternoon, Night).
The processed dataset is saved for downstream modeling steps, and the final number of features (columns) is printed.
These transformations help reduce dimensionality, engineer informative features, and ensure consistency across variables.

Sample of transformed dataset
chp_vehicle_type_at_fault	party_safety_equipment_1	type_of_collision	road_surface	motor_vehicle_involved_with	direction_of_travel	lighting	weather_1	movement_preceding_collision	population	party_sex	party_age	party_sobriety	financial_responsibility	party_race	county_location	chp_beat_type	party_count	hit_and_run	car_age	total_victims	season	year	region	time_of_day
passenger car, station	lap/shoulder harness used	broadside	dry	other motor vehicle	south	daylight	clear	proceeding straight	25000 to 50000	female	22	had not been drinking	no proof of insurance obtained	hispanic	san mateo	us highway	3	not hit and run	4	0	Winter	2002	Japan	Morning
passenger car, station	lap/shoulder harness used	broadside	wet	other motor vehicle	south	daylight	raining	proceeding straight	unincorporated	male	26	had not been drinking	proof of insurance obtained	other	santa cruz	state route	2	not hit and run	8	0	Winter	2002	USA	Afternoon
passenger car, station	lap/shoulder harness used	hit object	dry	fixed object	west	dark with street lights	clear	proceeding straight	>250000	male	17	had not been drinking	no proof of insurance obtained	asian	los angeles	state route	1	not hit and run	7	0	Spring	2002	Japan	Night
passenger car, station	air bag deployed	sideswipe	dry	other motor vehicle	north	daylight	clear	stopped	unincorporated	female	36	had not been drinking	proof of insurance obtained	hispanic	los angeles	county road line	2	not hit and run	13	0	Fall	2002	USA	Afternoon
passenger car, station	lap/shoulder harness used	rear end	dry	other motor vehicle	north	daylight	clear	proceeding straight	unincorporated	male	21	had not been drinking	proof of insurance obtained	white	tulare	county road area	2	not hit and run	1	0	Winter	2003	Japan	Morning
passenger car, station	lap/shoulder harness used	rear end	dry	parked motor vehicle	west	dark with no street lights	clear	backing	unincorporated	male	22	had not been drinking	proof of insurance obtained	white	kern	state route	2	not hit and run	1	0	Spring	2003	USA	Evening
passenger car, station	lap belt used	sideswipe	dry	other motor vehicle	south	daylight	clear	changing lanes	50000 to 100000	male	46	had been drinking, under influence	no proof of insurance obtained	white	san diego	interstate	2	not hit and run	37	1	Summer	2003	USA	Afternoon
passenger car, station	not required	hit object	dry	fixed object	north	daylight	clear	ran off road	unincorporated	male	19	had been drinking, under influence	proof of insurance obtained	white	el dorado	county road line	1	not hit and run	15	0	Summer	2003	USA	Evening
passenger car, station	air bag not deployed	rear end	dry	other motor vehicle	south	daylight	clear	changing lanes	>250000	female	31	had not been drinking	proof of insurance obtained	black	los angeles	interstate	2	not hit and run	12	0	Summer	2003	Japan	Afternoon
passenger car, station	air bag not deployed	sideswipe	dry	other motor vehicle	south	daylight	clear	passing other vehicle	unincorporated	male	38	had been drinking, not under influence	proof of insurance obtained	asian	los angeles	interstate	2	not hit and run	1	0	Summer	2003	Japan	Afternoon

Number of columns after pruning: 26

5 Method/Model

This section focuses on modeling the number of victims in traffic accidents using count regression techniques. It begins by preparing the dataset and examining the distribution of the target variable, total_victims, including an assessment of zero inflation. A custom score test is implemented to test for excess zeros beyond the Poisson assumption.

Four models are fitted to the data:

Poisson Regression
Negative Binomial Regression
Zero-Inflated Poisson (ZIP)
Zero-Inflated Negative Binomial (ZINB)

Likelihood ratio tests are used to compare nested models, and model fit is evaluated using AIC and log-likelihood criteria. The results guide the choice of the most appropriate model for handling both overdispersion and zero inflation in the count outcome.

library(MASS)
# Load the dataset
df <- read.csv("driver_at_fault_sample_final.csv", stringsAsFactors = FALSE)
df <- df[1:10000, ]

# Ensure the 'total_victims' column is numeric
df$total_victims <- as.numeric(df$total_victims)

# Calculate the count and percentage of zeros in 'total_victims'
zero_count <- sum(df$total_victims == 0, na.rm = TRUE)
total_count <- sum(!is.na(df$total_victims))
zero_percent <- round(100 * zero_count / total_count, 2)

Number of zeros: 6 438 out of 10 000 observations (64.38%)

lr_test <- function(model1, model2) {
  lr_stat <- 2 * (logLik(model2)[1] - logLik(model1)[1])
  df_diff <- attr(logLik(model2), "df") - attr(logLik(model1), "df")
  p_val <- pchisq(lr_stat, df = df_diff, lower.tail = FALSE)
  cat("LR test comparing models:\n")
  cat("  Statistic =", round(lr_stat, 4), "\n")
  cat("  Degrees of freedom =", df_diff, "\n")
  cat("  p-value =", p_val, "\n\n")
  return(list(statistic = lr_stat, df = df_diff, p.value = p_val))
}

# Perform zero-inflation test on 'total_victims'
zero_test_result <- zero.test(df$total_victims)
knitr::kable(zero_test_result[, 1:4], caption = "Zero Inflation Test Result", align = c("l", "r", "r", "r"))

Zero Inflation Test Result
Test	Statistic	DF	P-value
Score test for zero inflation	577.4755	1	1.33e-127

Zero Inflation Test Result
Test	Statistic	DF	P-value
Score test for zero inflation	577.4755	1	1.33e-127

— Interpretation of zero.test result — Reject H0: Evidence of zero inflation beyond the Poisson model. Consider zero-inflated models (ZIP or ZINB).

5.1 Fitting Models

# Creating 4 general models for comparison
# Poisson
poisson_model <- glm(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility + 
cellphone_in_use + party_race +  + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+   type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, family = "poisson", data = df)

# Negative Binomial
nb_model <- glm.nb(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility + 
cellphone_in_use + party_race +  + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+ type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, data = df)

# ZIP
zip_model <- zeroinfl(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility + 
cellphone_in_use + party_race +  + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+ type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, data = df, dist = "poisson")

# ZINB
zinb_model <- zeroinfl(total_victims ~ road_surface + direction_of_travel + lighting + weather_1 + party_sex + party_age + party_sobriety + financial_responsibility 
+ cellphone_in_use + party_race + party_count + hit_and_run + car_age + season + year + region + time_of_day + chp_beat_type + chp_vehicle_type_at_fault + party_safety_equipment_1 
+ type_of_collision + county_location + population + motor_vehicle_involved_with + movement_preceding_collision, data = df, dist = "negbin")

5.2 Models Comparision

Likelihood Ratio Tests Between Count Models
Comparison	Statistic	DF	p.value
Poisson vs Negative Binomial	185.761	1	2.676615e-42
Poisson vs Zero-Inflated Poisson	663.473	200	4.681003e-51
Negative Binomial vs ZINB	486.163	200	7.509073e-26
ZIP vs ZINB	8.451	1	3.647446e-03

Interpretation:

Poisson vs Negative Binomial:

The Likelihood Ratio (LR) statistic is 185.76 with 1 degree of freedom.
p < 0.0001 → This is a highly significant result.
Interpretation: The Negative Binomial model provides a significantly better fit than the Poisson model. This suggests overdispersion is present in the data (variance > mean), which Poisson cannot handle properly.

Poisson vs Zero-Inflated Poisson (ZIP):

LR statistic = 663.47, df = 200, p < 0.0001.
Interpretation: The ZIP model is significantly better than the standard Poisson. This indicates that there is an excess of zero counts in the data that ZIP accounts for effectively.

Negative Binomial vs Zero-Inflated Negative Binomial (ZINB):

LR statistic = 486.16, df = 200, p < 0.0001.
Interpretation: The ZINB model outperforms the standard Negative Binomial model. So, even with overdispersion addressed, there’s still zero inflation that needs to be modeled.

ZIP vs ZINB:

LR statistic = 8.45, df = 1, p ≈ 0.0036.
Interpretation: The ZINB model is also significantly better than the ZIP model. That means both overdispersion and zero inflation are present in the data, and ZINB is the most appropriate model among those tested.

Conclusion:

Based on all comparisons, the Zero-Inflated Negative Binomial (ZINB) model provides the best fit to your data, as it handles both excess zeros and overdispersion, which are not adequately addressed by simpler models like Poisson, NB, or ZIP.

Comparison of Count Regression Models
	Model	AIC	Log-Likelihood
ZINB	ZINB	18425.64	-8811.82
ZIP	ZIP	18432.09	-8816.04
NegBin	NegBin	18511.80	-9054.90
Poisson	Poisson	18695.56	-9147.78

Interpretation:

ZINB (Zero-Inflated Negative Binomial):

Lowest AIC (18425.64) and highest log-likelihood (-8811.82) among all models.
Interpretation: This is the best-fitting model according to both AIC and log-likelihood. It captures both overdispersion and excess zeros effectively.

ZIP (Zero-Inflated Poisson):

Second-best in terms of AIC and log-likelihood.
Interpretation: While it accounts for zero inflation, it does not handle overdispersion as well as ZINB. Still, it performs better than the standard Poisson and Negative Binomial models.

Negative Binomial (NegBin):

Has a lower AIC than Poisson, indicating improvement when accounting for overdispersion, but still worse than ZIP and ZINB.
Interpretation: Handles overdispersion, but fails to account for excess zeros.

Poisson:

Worst model with the highest AIC (18695.56) and lowest log-likelihood (-9147.78).
Interpretation: It fails to account for both overdispersion and zero inflation, making it unsuitable for this dataset.

Conclusion:

Based on both AIC and log-likelihood, the ZINB model is clearly the best choice. It provides the most accurate representation of the data, outperforming all other models in terms of goodness-of-fit.

Data summary
Name	df
Number of rows	65000
Number of columns	26
_______________________
Column type frequency:
character	20
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
chp_vehicle_type_at_fault	1	8	91	58
party_safety_equipment_1	1	5	40	19
type_of_collision	1	5	10	8
road_surface	1	3	8	4
motor_vehicle_involved_with	1	1	30	12
direction_of_travel	1	4	5	4
lighting	1	8	39	5
weather_1	1	3	7	7
movement_preceding_collision	1	5	26	18
population	1	5	16	8
party_sex	1	4	6	2
party_sobriety	1	14	38	6
financial_responsibility	1	14	35	4
party_race	1	5	8	5
county_location	1	4	15	58
chp_beat_type	1	7	23	8
hit_and_run	1	6	15	3
season	1	4	6	4
region	1	3	6	6
time_of_day	1	5	9	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
party_age	1	36.53	16.20	0	23	32	47	121	▅▇▃▁▁
cellphone_in_use	1	0.02	0.15	0	0	0	0	1	▇▁▁▁▁
party_count	1	1.98	0.77	1	2	2	2	10	▇▂▁▁▁
car_age	1	9.44	6.84	0	4	9	14	89	▇▁▁▁▁
total_victims	1	0.55	0.89	0	0	0	1	20	▇▁▁▁▁
year	1	2012.39	4.91	2002	2008	2013	2017	2021	▃▇▆▇▆

5.3 Sample values from columns

Sample Values from Each Column
	Sample_1	Sample_2	Sample_3	Sample_4	Sample_5
chp_vehicle_type_at_fault	passenger car, station	mini-vans	motorcycle	pickups & panels	two axle truck
party_safety_equipment_1	lap/shoulder harness used	air bag deployed	lap belt used	not required	air bag not deployed
type_of_collision	broadside	hit object	sideswipe	rear end	overturned
road_surface	dry	wet	slippery	snowy	NA
motor_vehicle_involved_with	other motor vehicle	fixed object	parked motor vehicle	non-collision	train
direction_of_travel	south	west	north	east	NA
lighting	daylight	dark with street lights	dark with no street lights	dusk or dawn	dark with street lights not functioning
weather_1	clear	raining	cloudy	other	fog
movement_preceding_collision	proceeding straight	stopped	backing	changing lanes	ran off road
population	25000 to 50000	unincorporated	>250000	50000 to 100000	100000 to 250000
party_sex	female	male	NA	NA	NA
party_age	22	26	17	36	21
party_sobriety	had not been drinking	had been drinking, under influence	had been drinking, not under influence	impairment unknown	had been drinking, impairment unknown
financial_responsibility	no proof of insurance obtained	proof of insurance obtained	not applicable	officer called away before obtained	NA
cellphone_in_use	0	1	NA	NA	NA
party_race	hispanic	other	asian	white	black
county_location	san mateo	santa cruz	los angeles	tulare	kern
chp_beat_type	us highway	state route	county road line	county road area	interstate
party_count	3	2	1	4	9
hit_and_run	not hit and run	misdemeanor	felony	NA	NA
car_age	4	8	7	13	1
total_victims	0	1	4	2	3
season	Winter	Spring	Fall	Summer	NA
year	2002	2003	2004	2005	2006
region	Japan	USA	Korea	Europe	Other
time_of_day	Morning	Afternoon	Night	Evening	NA

Unique CHP Vehicle Types at Fault
Vehicle.Types.at.Fault
all terrain vehicle
ambulance
dune buggy
farm labor transporter
fire truck
fork lift
general public paratransit
go-ped, zip electric scooter, motoboard
hwy. construction equip.
implement of husbandry
low speed vehicle
mini-vans
misc. motor vehicle (snowmobile, golf cart)
mobile equipment
motor driven
motor home
motor home > 40 feet
motorcycle
motorized bicycle
non-commercial bus
other commercial
paratransit
passenger car, station
passenger car, station wagon, jeep: hazardous material
passenger car, station wagon, jeep: hazardous waste or hazardous waste/material combination
pickup and camper: hazardous material
pickup w/camper
pickups & panels
pickups and panels: hazardous material
pickups and panels: hazardous waste or hazardous waste/material combination
police car
police motorcycle
public transit authority
school bus contractual type i
school bus contractual type ii
school bus private type i
school bus private type ii
school bus public type i
school bus public type ii
school pupil activity bus type i
school pupil activity bus type ii
sport utility vehicle
three-axle tank truck: hazardous material
three-axle tow truck
three axle tank truck
three or more axle truck
three or more axle truck: hazardous material
three or more axle truck: hazardous waste or hazardous waste/material combination
tour bus
truck tractor
truck tractor: hazardous material
two-axle tank truck: hazardous waste or hazardous waste/material combination
two-axle tow truck
two-axle truck: hazardous material
two-axle truck: hazardous waste or hazardous waste/material combination
two axle tank truck
two axle truck
youth bus

proceeding straight
stopped
backing
changing lanes
ran off road
passing other vehicle
entering traffic
other
crossed into opposing lane
making right turn
making left turn
other unsafe turning
slowing/stopping
merging
making u-turn
traveling wrong way
parking maneuver
parked

# aggregating movement into more general groups
df$move_general <- with(df, case_when(
  movement_preceding_collision %in% c("proceeding straight", "entering traffic", "merging") ~ "Straight Travel",
  movement_preceding_collision %in% c("making right turn", "making left turn", "making u-turn", "other unsafe turning", "other") ~ "Turning",
  movement_preceding_collision %in% c("changing lanes", "passing other vehicle", "crossed into opposing lane") ~ "Lane Change/Passing",
  movement_preceding_collision %in% c("slowing/stopping", "stopped", "backing", "parking maneuver", "parked") ~ "Stopping/Slowing",
  movement_preceding_collision %in% c("ran off road", "traveling wrong way") ~ "Loss of Control/Irregular",
  TRUE ~ "Other"
))

df$move_general <- factor(df$move_general)

lap/shoulder harness used
air bag deployed
lap belt used
not required
air bag not deployed
other
none in vehicle
passenger, motorcycle helmet used
unknown
driver, motorcycle helmet used
passive restraint not used
shoulder harness not used
no child restraint in vehicle
lap/shoulder harness not used
shoulder harness used
passive restraint used
child restraint in vehicle, use unknown
child restraint in vehicle, improper use
lap belt not used

# aggregating safety equipment into air_bag variable
df$air_bag <- factor(
  ifelse(df$party_safety_equipment_1 == "air bag deployed", "Yes",
         ifelse(df$party_safety_equipment_1 == "air bag not deployed", "No", "Other safety")),
  levels = c("Yes", "No", "Other safety")
)

Distribution depending on Air Bag
Air Bag	Count
Yes	14 301
No	44 601
Other safety	6 098

clear
raining
cloudy
other
fog
snowing
wind

# aggregating weather into more general groups
df$weather <- factor(
  ifelse(df$weather_1 == "clear", "clear_sky",
         ifelse(df$weather_1 %in% c("raining", "snowing", "fog"), "harder_conditions", "other"))
)

Distribution depending on Weather
Weather	Count
clear_sky	52 511
harder_conditions	2 713
other	9 776

had not been drinking
had been drinking, under influence
had been drinking, not under influence
impairment unknown
had been drinking, impairment unknown
not applicable

# aggregating sobriety into more general groups
df$sobriety <- factor(
  ifelse(df$party_sobriety == "had not been drinking", "not drinking",
         ifelse(df$party_sobriety %in% c(
           "had been drinking, under influence",
           "had been drinking, not under influence",
           "had been drinking, impairment unknown"
         ), "drinking", "other"))
)

Distribution of Sobriety Levels
Sobriety	Count
drinking	7 004
not drinking	55 988
other	2 008

dry
wet
slippery
snowy

# aggregating road surface into more general groups
df$surface_dry <- ifelse(df$road_surface == "dry", "dry", "non-dry")
df$surface_dry <- factor(df$surface_dry)

# converting age into 2 intervals: under and over 25 years old
df$age_group <- cut(
  df$party_age,
  breaks = c(-Inf, 24, Inf),
  labels = c("under 25", "over 25"),
  right = TRUE
)

Column	Missing_Values
chp_vehicle_type_at_fault	0
party_safety_equipment_1	0
type_of_collision	0
road_surface	0
motor_vehicle_involved_with	0
direction_of_travel	0
lighting	0
weather_1	0
movement_preceding_collision	0
population	0
party_sex	0
party_age	0
party_sobriety	0
financial_responsibility	0
cellphone_in_use	0
party_race	0
county_location	0
chp_beat_type	0
party_count	0
hit_and_run	0
car_age	0
total_victims	0
season	0
year	0
region	0
time_of_day	0
vehicle_general	0
move_general	0
air_bag	0
weather	0
sobriety	0
surface_dry	0
age_group	0

# aggregating weather into 2 seasons 
df$LA_season <- ifelse(df$season %in% c("Winter", "Spring"), "Wet", "Dry")
df$LA_season <- factor(df$LA_season, levels = c("Dry", "Wet"))

Distribution through Seasons
Season	Count
Fall	17 018
Spring	15 745
Summer	16 064
Winter	16 173

hispanic
other
asian
white
black

# aggregating race into white and non-white
df$race_group <- ifelse(df$party_race == "white", "white", "non-white")
df$race_group <- factor(df$race_group, levels = c("white", "non-white"))

no proof of insurance obtained
proof of insurance obtained
not applicable
officer called away before obtained

# aggregating insurance into 2 groups
df$financial_responsibility_bi <- ifelse(df$financial_responsibility == "proof of insurance obtained", 
                            "proof", "no or unknown proof")
df$financial_responsibility_bi <- factor(df$financial_responsibility_bi, levels = c("proof", "no or unknown proof"))

# aggregating car age into 2 intervals
df$car_age_group <- ifelse(df$car_age <= 15, "under_15", "over_15")
df$car_age_group <- factor(df$car_age_group, levels = c("under_15", "over_15"))

# aggregating lighting into 2 groups
df$lighting_simple <- ifelse(df$lighting == "daylight", "daylight", "dark")
df$lighting_simple <- factor(df$lighting_simple, levels = c("daylight", "dark"))

5.4 Interactions

# sex x race
df$sex_race <- interaction(df$party_sex, df$race_group, drop = TRUE)

Sex x Race	Count
female.white	10 264
male.white	16 389
female.non-white	12 537
male.non-white	25 810

# time of the day x weather
df$timeofday_weather <- interaction(df$time_of_day, df$weather, drop = TRUE)

Time of Day x Weather	Count
Afternoon.clear_sky	22 383
Evening.clear_sky	11 170
Morning.clear_sky	13 770
Night.clear_sky	5 188
Afternoon.harder_conditions	865
Evening.harder_conditions	616
Morning.harder_conditions	785
Night.harder_conditions	447
Afternoon.other	3 419
Evening.other	1 668
Morning.other	3 655
Night.other	1 034

# age group x car age group
df$age_car <- interaction(df$age_group, df$car_age_group, drop = TRUE)

Age Group x Car Age Group	Count
under 25.under_15	15 809
over 25.under_15	37 359
under 25.over_15	3 840
over 25.over_15	7 992

# lighting x surface dry
df$lighting_surface <- interaction(df$lighting_simple, df$surface_dry, drop = TRUE)

Lighting x Surface Dry	Count
daylight.dry	41 021
dark.dry	17 244
daylight.non-dry	3 920
dark.non-dry	2 815

5.5 Modelling

# Previous analysis showed that Zero-Inflated Negative Binomial is the most appropriate choice
# Starting with the most general model

general <- zeroinfl(total_victims ~ weather + party_sex + age_group + sobriety
                    + financial_responsibility_bi + cellphone_in_use + race_group
                    + car_age_group + LA_season + time_of_day 
                    + air_bag + lighting_surface, data = df, dist = "negbin")

Count Model (Negative Binomial)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.074	0.033	2.234	2.55e-02
weatherharder_conditions	0.059	0.048	1.233	2.17e-01
weatherother	-0.011	0.024	-0.467	6.40e-01
party_sexmale	-0.080	0.016	-5.071	< 1e-03
age_groupover 25	-0.014	0.016	-0.908	3.64e-01
sobrietynot drinking	-0.033	0.024	-1.374	1.69e-01
sobrietyother	-0.162	0.052	-3.095	1.97e-03
financial_responsibility_bino or unknown proof	0.176	0.022	8.128	< 1e-03
cellphone_in_use1	-0.098	0.049	-2.006	4.48e-02
race_groupnon-white	0.014	0.015	0.897	3.70e-01
car_age_groupover_15	-0.005	0.019	-0.241	8.10e-01
LA_seasonWet	-0.019	0.015	-1.293	1.96e-01
time_of_dayEvening	0.018	0.026	0.698	4.85e-01
time_of_dayMorning	-0.113	0.019	-6.000	< 1e-03
time_of_dayNight	-0.204	0.034	-6.043	< 1e-03
air_bagNo	-0.543	0.020	-26.515	< 1e-03
air_bagOther safety	-0.420	0.022	-19.413	< 1e-03
lighting_surfacedark.dry	-0.050	0.026	-1.939	5.25e-02
lighting_surfacedaylight.non-dry	-0.156	0.040	-3.886	< 1e-03
lighting_surfacedark.non-dry	-0.133	0.049	-2.730	6.34e-03
Log(theta)	1.088	0.044	24.451	< 1e-03

Zero-inflation Model (Logit)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-14.422	54.178	-0.266	7.90e-01
weatherharder_conditions	0.629	0.339	1.855	6.36e-02
weatherother	0.081	0.122	0.659	5.10e-01
party_sexmale	0.379	0.089	4.234	< 1e-03
age_groupover 25	-0.378	0.080	-4.734	< 1e-03
sobrietynot drinking	0.014	0.142	0.095	9.24e-01
sobrietyother	0.469	0.229	2.044	4.09e-02
financial_responsibility_bino or unknown proof	0.441	0.103	4.287	< 1e-03
cellphone_in_use1	0.029	0.261	0.113	9.10e-01
race_groupnon-white	0.413	0.089	4.621	< 1e-03
car_age_groupover_15	-0.249	0.105	-2.364	1.81e-02
LA_seasonWet	0.064	0.078	0.828	4.07e-01
time_of_dayEvening	-0.005	0.128	-0.039	9.69e-01
time_of_dayMorning	-0.084	0.098	-0.859	3.90e-01
time_of_dayNight	-0.490	0.221	-2.220	2.64e-02
air_bagNo	12.975	54.178	0.239	8.11e-01
air_bagOther safety	1.833	62.848	0.029	9.77e-01
lighting_surfacedark.dry	-0.082	0.129	-0.633	5.27e-01
lighting_surfacedaylight.non-dry	-1.414	0.444	-3.184	1.45e-03
lighting_surfacedark.non-dry	-0.435	0.327	-1.331	1.83e-01

# To analyze it properly we'll use the Likelihood Ratio Test. Hypotheses:
# H0: the removed variable is jointly significant
# H1: the removed variable is not jointly significant

# Starting with dropping the variable that is insignificant in both parts – LA_season
step1 <- zeroinfl(total_victims ~ weather + party_sex + age_group + sobriety
                  + financial_responsibility_bi + cellphone_in_use + race_group + car_age_group
                  + time_of_day + air_bag + lighting_surface, data = df, dist = "negbin")

Count Model (Negative Binomial)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.065	0.032	2.005	4.49e-02
weatherharder_conditions	0.058	0.048	1.210	2.26e-01
weatherother	-0.014	0.024	-0.564	5.73e-01
party_sexmale	-0.079	0.016	-5.059	< 1e-03
age_groupover 25	-0.014	0.016	-0.887	3.75e-01
sobrietynot drinking	-0.033	0.024	-1.362	1.73e-01
sobrietyother	-0.162	0.052	-3.092	1.99e-03
financial_responsibility_bino or unknown proof	0.177	0.022	8.140	< 1e-03
cellphone_in_use1	-0.098	0.049	-1.993	4.63e-02
race_groupnon-white	0.013	0.015	0.893	3.72e-01
car_age_groupover_15	-0.005	0.019	-0.282	7.78e-01
time_of_dayEvening	0.019	0.026	0.706	4.80e-01
time_of_dayMorning	-0.114	0.019	-6.061	< 1e-03
time_of_dayNight	-0.203	0.034	-6.035	< 1e-03
air_bagNo	-0.544	0.021	-26.518	< 1e-03
air_bagOther safety	-0.420	0.022	-19.412	< 1e-03
lighting_surfacedark.dry	-0.050	0.026	-1.944	5.19e-02
lighting_surfacedaylight.non-dry	-0.159	0.040	-3.971	< 1e-03
lighting_surfacedark.non-dry	-0.136	0.049	-2.792	5.23e-03
Log(theta)	1.087	0.044	24.440	< 1e-03

Zero-inflation Model (Logit)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-12.660	22.781	-0.556	5.78e-01
weatherharder_conditions	0.654	0.341	1.920	5.49e-02
weatherother	0.093	0.122	0.760	4.47e-01
party_sexmale	0.382	0.090	4.268	< 1e-03
age_groupover 25	-0.376	0.080	-4.704	< 1e-03
sobrietynot drinking	0.013	0.142	0.091	9.28e-01
sobrietyother	0.469	0.229	2.044	4.09e-02
financial_responsibility_bino or unknown proof	0.438	0.103	4.266	< 1e-03
cellphone_in_use1	0.035	0.260	0.134	8.94e-01
race_groupnon-white	0.413	0.089	4.629	< 1e-03
car_age_groupover_15	-0.249	0.106	-2.360	1.83e-02
time_of_dayEvening	-0.005	0.127	-0.036	9.71e-01
time_of_dayMorning	-0.093	0.097	-0.959	3.38e-01
time_of_dayNight	-0.488	0.220	-2.218	2.65e-02
air_bagNo	11.241	22.780	0.493	6.22e-01
air_bagOther safety	0.594	33.774	0.018	9.86e-01
lighting_surfacedark.dry	-0.085	0.129	-0.663	5.07e-01
lighting_surfacedaylight.non-dry	-1.433	0.450	-3.184	1.45e-03
lighting_surfacedark.non-dry	-0.457	0.330	-1.384	1.66e-01

Likelihood Ratio Test: Comparison of step1 & general
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step1	39	-63170.65	NA	NA	NA
general	41	-63168.03	2	5.257	0.072

p-value = 0.072 -> we fail to reject H0. No evidence that “LA_season” is jointly significant – removing it.

# In step 2 we are dropping the "weather" variable
step2 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day
                  + air_bag + lighting_surface, data = df, dist = "negbin")

Count Model (Negative Binomial)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.065	0.032	2.011	4.44e-02
party_sexmale	-0.079	0.016	-5.025	< 1e-03
age_groupover 25	-0.014	0.016	-0.877	3.80e-01
sobrietynot drinking	-0.034	0.024	-1.437	1.51e-01
sobrietyother	-0.165	0.052	-3.149	1.64e-03
financial_responsibility_bino or unknown proof	0.175	0.022	8.053	< 1e-03
cellphone_in_use1	-0.096	0.049	-1.958	5.03e-02
race_groupnon-white	0.014	0.015	0.945	3.44e-01
car_age_groupover_15	-0.006	0.019	-0.295	7.68e-01
time_of_dayEvening	0.018	0.026	0.671	5.02e-01
time_of_dayMorning	-0.115	0.019	-6.133	< 1e-03
time_of_dayNight	-0.204	0.034	-6.066	< 1e-03
air_bagNo	-0.542	0.020	-26.548	< 1e-03
air_bagOther safety	-0.420	0.022	-19.416	< 1e-03
lighting_surfacedark.dry	-0.050	0.026	-1.927	5.40e-02
lighting_surfacedaylight.non-dry	-0.142	0.033	-4.370	< 1e-03
lighting_surfacedark.non-dry	-0.113	0.040	-2.809	4.97e-03
Log(theta)	1.089	0.044	24.497	< 1e-03

Zero-inflation Model (Logit)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-11.373	12.180	-0.934	3.50e-01
party_sexmale	0.382	0.089	4.276	< 1e-03
age_groupover 25	-0.372	0.080	-4.679	< 1e-03
sobrietynot drinking	-0.005	0.139	-0.038	9.70e-01
sobrietyother	0.438	0.228	1.924	5.44e-02
financial_responsibility_bino or unknown proof	0.424	0.102	4.139	< 1e-03
cellphone_in_use1	0.052	0.255	0.202	8.40e-01
race_groupnon-white	0.417	0.089	4.679	< 1e-03
car_age_groupover_15	-0.249	0.105	-2.364	1.81e-02
time_of_dayEvening	-0.016	0.127	-0.125	9.01e-01
time_of_dayMorning	-0.091	0.097	-0.939	3.47e-01
time_of_dayNight	-0.491	0.219	-2.242	2.50e-02
air_bagNo	9.983	12.180	0.820	4.12e-01
air_bagOther safety	-0.715	28.474	-0.025	9.80e-01
lighting_surfacedark.dry	-0.076	0.128	-0.590	5.55e-01
lighting_surfacedaylight.non-dry	-1.069	0.376	-2.843	4.47e-03
lighting_surfacedark.non-dry	-0.076	0.226	-0.338	7.35e-01

Likelihood Ratio Test: Comparison of step2 & step1
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step2	35	-63173.43	NA	NA	NA
step1	39	-63170.65	4	5.549	2.35e-01

p-value = 0.23 -> we fail to reject H0. No evidence that “weather” is jointly significant – removing it.

# In step 3 we are dropping the "time_of_day" variable
step3 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + air_bag + lighting_surface, 
                  data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step3 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step3	29	-63234.01	NA	NA	NA
step2	35	-63173.43	6	121.173	<1e-03

p-value = <1e-03 -> we reject H0. “time_of_day” is jointly significant.
We are going back to model from step 2

# In step 4 we bring back "time_of_day" and remove "sobriety"
step4 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day 
                  + air_bag + lighting_surface, data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step4 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step4	35	-63173.43	NA	NA	NA
step2	35	-63173.43	0	0	1e+00

p-value = 7.551e-07 -> we reject H0. “sobriety” is jointly significant.
We are going back to model from step 2

# In step 5 we bring back "sobriety" and remove "lighting_surfacedark"
step5 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day + air_bag,
                  data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step5 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step5	29	-63190.43	NA	NA	NA
step2	35	-63173.43	6	34.004	<1e-03

p-value = <1e-03 -> we reject H0. “lighting_surfacedark” is jointly significant.
We are going back to model from step 2

# In step 6 we bring back "lighting_surfacedark" and remove "cellphone_in_use"
step6 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + race_group + car_age_group + time_of_day + air_bag
                  + lighting_surface, data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step6 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step6	33	-63176.62	NA	NA	NA
step2	35	-63173.43	2	6.389	4.1e-02

p-value = 4.1e-02 -> we reject H0. “cellphone_in_use” is jointly significant
We are going back to model from step 2

# In step 7 we bring back "cellphone_in_use" and remove "car_age_group"
step7 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + time_of_day + air_bag + lighting_surface, data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step7 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step7	33	-63177.36	NA	NA	NA
step2	35	-63173.43	2	7.866	1.96e-02

p-value = 1.96e-02 -> we reject H0. “car_age_group” is jointly significant
We are going back to model from step 2

# In step 8 we bring back "car_age_group" and remove "race_group"
step8 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + car_age_group + time_of_day + air_bag 
                  + lighting_surface, data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step8 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step8	33	-63189.12	NA	NA	NA
step2	35	-63173.43	2	31.393	<1e-03

p-value = <1e-03 -> we reject H0. “race_group” is jointly significant
We are going back to model from step 2

# In step 9 we bring back "race_group" and remove "air_bag"
step9 <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day
                  + lighting_surface, data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step9 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step9	31	-64944.40	NA	NA	NA
step2	35	-63173.43	4	3541.952	<1e-03

p-value = <1e-03 -> we reject H0. “air_bag” is jointly significant
We are going back to model from step 2

# In step 10 we bring back "air_bag" and remove "age_group"
step10 <- zeroinfl(total_victims ~ party_sex + sobriety + financial_responsibility_bi 
                   + cellphone_in_use + race_group + car_age_group + time_of_day
                   + air_bag + lighting_surface, data = df, dist = "negbin")

Likelihood Ratio Test: Comparison of step10 & step2
	#Df	LogLik	Df	Chisq	Pr(>Chisq)
step10	31	-64944.40	NA	NA	NA
step2	35	-63173.43	4	3541.952	<1e-03

p-value = <1e-03 -> we reject H0. “age_group” is jointly significant
We are going back to model from step 2

# We checked all variables and kept all jointly significant ones
final_model <- zeroinfl(total_victims ~ party_sex + age_group + sobriety + financial_responsibility_bi 
                  + cellphone_in_use + race_group + car_age_group + time_of_day
                  + air_bag + lighting_surface, data = df, dist = "negbin")

Count Model (Negative Binomial)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.065	0.032	2.011	4.44e-02
party_sexmale	-0.079	0.016	-5.025	< 1e-03
age_groupover 25	-0.014	0.016	-0.877	3.80e-01
sobrietynot drinking	-0.034	0.024	-1.437	1.51e-01
sobrietyother	-0.165	0.052	-3.149	1.64e-03
financial_responsibility_bino or unknown proof	0.175	0.022	8.053	< 1e-03
cellphone_in_use1	-0.096	0.049	-1.958	5.03e-02
race_groupnon-white	0.014	0.015	0.945	3.44e-01
car_age_groupover_15	-0.006	0.019	-0.295	7.68e-01
time_of_dayEvening	0.018	0.026	0.671	5.02e-01
time_of_dayMorning	-0.115	0.019	-6.133	< 1e-03
time_of_dayNight	-0.204	0.034	-6.066	< 1e-03
air_bagNo	-0.542	0.020	-26.548	< 1e-03
air_bagOther safety	-0.420	0.022	-19.416	< 1e-03
lighting_surfacedark.dry	-0.050	0.026	-1.927	5.40e-02
lighting_surfacedaylight.non-dry	-0.142	0.033	-4.370	< 1e-03
lighting_surfacedark.non-dry	-0.113	0.040	-2.809	4.97e-03
Log(theta)	1.089	0.044	24.497	< 1e-03

Zero-inflation Model (Logit)
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-11.373	12.180	-0.934	3.50e-01
party_sexmale	0.382	0.089	4.276	< 1e-03
age_groupover 25	-0.372	0.080	-4.679	< 1e-03
sobrietynot drinking	-0.005	0.139	-0.038	9.70e-01
sobrietyother	0.438	0.228	1.924	5.44e-02
financial_responsibility_bino or unknown proof	0.424	0.102	4.139	< 1e-03
cellphone_in_use1	0.052	0.255	0.202	8.40e-01
race_groupnon-white	0.417	0.089	4.679	< 1e-03
car_age_groupover_15	-0.249	0.105	-2.364	1.81e-02
time_of_dayEvening	-0.016	0.127	-0.125	9.01e-01
time_of_dayMorning	-0.091	0.097	-0.939	3.47e-01
time_of_dayNight	-0.491	0.219	-2.242	2.50e-02
air_bagNo	9.983	12.180	0.820	4.12e-01
air_bagOther safety	-0.715	28.474	-0.025	9.80e-01
lighting_surfacedark.dry	-0.076	0.128	-0.590	5.55e-01
lighting_surfacedaylight.non-dry	-1.069	0.376	-2.843	4.47e-03
lighting_surfacedark.non-dry	-0.076	0.226	-0.338	7.35e-01

# Preparing baseline models with all initial variables
poisson_model <- glm(total_victims ~ weather + party_sex + age_group + sobriety
                     + financial_responsibility_bi + cellphone_in_use + race_group
                     + car_age_group + LA_season + time_of_day + air_bag + lighting_surface, 
                     family = "poisson", data = df)

# Negative Binomial model
nb_model <- glm.nb(total_victims ~ weather + party_sex + age_group + sobriety
                   + financial_responsibility_bi + cellphone_in_use + race_group
                   + car_age_group + LA_season + time_of_day + air_bag + lighting_surface, 
                   data = df)

# Zero-inflated Poisson model
zip_model <- zeroinfl(total_victims ~ weather + party_sex + age_group + sobriety
                      + financial_responsibility_bi + cellphone_in_use + race_group
                      + car_age_group + LA_season + time_of_day + air_bag + lighting_surface, 
                      data = df, dist = "poisson")

	Poisson	Negative Binomial	ZIP	Full model (ZINB)	Final model (ZINB)
(Intercept)	0.090***	0.091**
	(0.025)	(0.029)
weatherharder_conditions	0.006	0.005
	(0.036)	(0.041)
weatherother	-0.021	-0.023
	(0.018)	(0.020)
party_sexmale	-0.123***	-0.126***
	(0.011)	(0.013)
age_groupover 25	0.032**	0.032*
	(0.012)	(0.013)
sobrietynot drinking	-0.037*	-0.036+
	(0.018)	(0.021)
sobrietyother	-0.232***	-0.236***
	(0.038)	(0.042)
financial_responsibility_bino or unknown proof	0.123***	0.121***
	(0.016)	(0.019)
cellphone_in_use1	-0.105**	-0.105*
	(0.036)	(0.041)
race_groupnon-white	-0.033**	-0.035**
	(0.011)	(0.012)
car_age_groupover_15	0.020	0.023
	(0.014)	(0.016)
LA_seasonWet	-0.026*	-0.027*
	(0.011)	(0.012)
time_of_dayEvening	0.019	0.021
	(0.019)	(0.022)
time_of_dayMorning	-0.100***	-0.102***
	(0.013)	(0.015)
time_of_dayNight	-0.158***	-0.156***
	(0.025)	(0.029)
air_bagNo	-0.797***	-0.796***
	(0.012)	(0.014)
air_bagOther safety	-0.413***	-0.412***
	(0.019)	(0.022)
lighting_surfacedark.dry	-0.040*	-0.037+
	(0.019)	(0.022)
lighting_surfacedaylight.non-dry	-0.039	-0.031
	(0.029)	(0.033)
lighting_surfacedark.non-dry	-0.095**	-0.090*
	(0.036)	(0.041)
count_(Intercept)			0.125***	0.074*	0.065*
			(0.035)	(0.033)	(0.032)
count_weatherharder_conditions			0.045	0.059
			(0.052)	(0.048)
count_weatherother			-0.018	-0.011
			(0.026)	(0.024)
count_party_sexmale			-0.040*	-0.080***	-0.079***
			(0.017)	(0.016)	(0.016)
count_age_groupover 25			-0.039*	-0.014	-0.014
			(0.017)	(0.016)	(0.016)
count_sobrietynot drinking			-0.004	-0.033	-0.034
			(0.025)	(0.024)	(0.024)
count_sobrietyother			-0.082	-0.162**	-0.165**
			(0.057)	(0.052)	(0.052)
count_financial_responsibility_bino or unknown proof			0.169***	0.176***	0.175***
			(0.023)	(0.022)	(0.022)
count_cellphone_in_use1			-0.102+	-0.098*	-0.096+
			(0.053)	(0.049)	(0.049)
count_race_groupnon-white			0.073***	0.014	0.014
			(0.017)	(0.015)	(0.015)
count_car_age_groupover_15			-0.002	-0.005	-0.006
			(0.021)	(0.019)	(0.019)
count_LA_seasonWet			-0.013	-0.019
			(0.016)	(0.015)
count_time_of_dayEvening			0.024	0.018	0.018
			(0.028)	(0.026)	(0.026)
count_time_of_dayMorning			-0.115***	-0.113***	-0.115***
			(0.020)	(0.019)	(0.019)
count_time_of_dayNight			-0.256***	-0.204***	-0.204***
			(0.037)	(0.034)	(0.034)
count_air_bagNo			-0.435***	-0.543***	-0.542***
			(0.019)	(0.020)	(0.020)
count_air_bagOther safety			-0.429***	-0.420***	-0.420***
			(0.033)	(0.022)	(0.022)
count_lighting_surfacedark.dry			-0.034	-0.050+	-0.050+
			(0.027)	(0.026)	(0.026)
count_lighting_surfacedaylight.non-dry			-0.172***	-0.156***	-0.142***
			(0.043)	(0.040)	(0.033)
count_lighting_surfacedark.non-dry			-0.121*	-0.133**	-0.113**
			(0.052)	(0.049)	(0.040)
zero_(Intercept)			-2.478***	-14.422	-11.373
			(0.136)	(54.178)	(12.180)
zero_weatherharder_conditions			0.189	0.629+
			(0.187)	(0.339)
zero_weatherother			0.012	0.081
			(0.079)	(0.122)
zero_party_sexmale			0.333***	0.379***	0.382***
			(0.055)	(0.089)	(0.089)
zero_age_groupover 25			-0.275***	-0.378***	-0.372***
			(0.051)	(0.080)	(0.080)
zero_sobrietynot drinking			0.138	0.014	-0.005
			(0.091)	(0.142)	(0.139)
zero_sobrietyother			0.535***	0.469*	0.438+
			(0.156)	(0.229)	(0.228)
zero_financial_responsibility_bino or unknown proof			0.182**	0.441***	0.424***
			(0.069)	(0.103)	(0.102)
zero_cellphone_in_use1			0.002	0.029	0.052
			(0.171)	(0.261)	(0.255)
zero_race_groupnon-white			0.433***	0.413***	0.417***
			(0.059)	(0.089)	(0.089)
zero_car_age_groupover_15			-0.104	-0.249*	-0.249*
			(0.067)	(0.105)	(0.105)
zero_LA_seasonWet			0.052	0.064
			(0.049)	(0.078)
zero_time_of_dayEvening			0.020	-0.005	-0.016
			(0.082)	(0.128)	(0.127)
zero_time_of_dayMorning			-0.049	-0.084	-0.091
			(0.062)	(0.098)	(0.097)
zero_time_of_dayNight			-0.446***	-0.490*	-0.491*
			(0.134)	(0.221)	(0.219)
zero_air_bagNo			1.607***	12.975	9.983
			(0.094)	(54.178)	(12.180)
zero_air_bagOther safety			-0.069	1.833	-0.715
			(0.246)	(62.848)	(28.474)
zero_lighting_surfacedark.dry			0.029	-0.082	-0.076
			(0.082)	(0.129)	(0.128)
zero_lighting_surfacedaylight.non-dry			-0.612***	-1.414**	-1.069**
			(0.174)	(0.444)	(0.376)
zero_lighting_surfacedark.non-dry			-0.118	-0.435	-0.076
			(0.180)	(0.327)	(0.226)
Num.Obs.	65000	65000	65000	65000	65000
R2			0.040	0.043	0.043
R2 Adj.			0.040	0.043	0.043
AIC	128844.3	126737.4	127156.2	126418.0	126416.9
BIC	129025.9	126928.1	127519.5	126790.4	126734.7
Log.Lik.	-64402.136	-63347.708
F	276.214	204.683
RMSE	0.86	0.86	0.86	0.86	0.86
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

6 Results

We tested seven hypotheses related to the risk and severity of car crashes using a two-part model: a count model for the number of victims and a zero model for the likelihood of zero-victim crashes. Below are the results and interpretations for each hypothesis.

1. Young drivers (under 25) have higher risk of car crash with victims

The count model shows no significant difference in the number of victims (p = 0.380), but the zero model indicates that older drivers are significantly less likely to be in zero-victim crashes (β = –0.372, p < 0.001).
Hypothesis supported: Younger drivers are more often involved in harmless crashes, while older drivers are more likely to cause injury.

2. Vehicles older than 15 years are at higher risk of car crash with injuries

The count model shows no significant difference (p = 0.768), but the zero model shows that crashes involving older vehicles are less likely to be zero-victim (β = –0.249, p = 0.018).
Hypothesis supported: Older vehicles are associated with a higher risk of harmful crashes, even if not more frequent overall.

3. Drivers who consumed alcohol have higher risk of car crash with victims - Comparing to the baseline of drinking drivers:

“Not drinking” status was not significant in either model.
“Other” sobriety states (e.g., drugs, fatigue) were significant in the count model only (β = –0.165, p = 0.002), associated with fewer victims.
Hypothesis not supported: Alcohol consumption did not significantly increase crash severity or likelihood in this model. This may be due to underreporting or other contextual factors.

4. Men have higher overall risk of car crash than women

Men are associated with fewer victims in the count model (β = –0.079, p < 0.001), and more likely to be in zero-victim crashes (β = +0.382, p < 0.001).
Hypothesis rejected: Male drivers appear to be involved in less harmful crashes, contrary to expectations.

5. Women have higher risk of car crash with victims than men

Contrary to expectations, women are more likely to be involved in crashes with victims than men.
Model shows a positive β coefficient of +0.382 (p < 0.001) for women, indicating higher likelihood of victim-involved crashes.This finding rejects the hypothesis that women have a lower risk of crashes with victims.

6. Non-white drivers have higher risk of car crash

The count model showed no significant difference, but the zero model shows that non-white drivers are more likely to be in zero-victim crashes (β = +0.417, p < 0.001).
Hypothesis rejected: Non-white drivers are involved in less harmful crashes overall.

7. Uninsured drivers have higher risk of car crash with victims

Uninsured/unknown drivers are associated with more victims (β = +0.175, p < 0.001) and more likely to be in zero-victim crashes (β = +0.424, p < 0.001).
Hypothesis supported: While results suggest a paradox (more victims and more harmless crashes), this may reflect underreporting or bimodal crash types (very minor vs. serious).

8. Bad weather conditions indicate higher risk of car crash

Daylight, wet and dark, wet conditions are associated with fewer victims, likely due to cautious driving. However, daylight, wet crashes are much less likely to be harmless (β = –1.069, p < 0.01).
Hypothesis partially rejected: Bad conditions don’t increase crash frequency, but may increase severity when they occur.

7 Findings

In this study, we employed the Zero-Inflated Negative Binomial (ZINB) model to examine the determinants of traffic accidents and their severity. This approach enabled us to analyze each variable’s effect in two distinct dimensions:

Zero-inflation (logit) component: assessed the likelihood of an accident being non-injurious (zero-victim crash)
Count (negative binomial) component: estimated the expected number of victims, conditional on the crash involving injuries.

1. Gender

Driver gender emerged as an influential factor. Male drivers were more likely to be involved in zero-victim accidents, while female drivers were associated with injury crashes resulting in approximately 7.6% more victims.

2. Age of driver

Despite the traditional expectations, younger drivers (under 25) were more likely to be involved in non-injury crashes. It suggests that older drivers may contribute more frequently to crashes with physical harm.

3. Age of vehicle

Vehicle age also played a notable role. Cars older than 15 years were less likely to be involved in zero-victim accidents, indicating a higher likelihood of causing injury-related crashes (potentially due to outdated safety features or poorer maintenance).

4. Insurance status Lack of verified insurance was significantly associated with both crash dimensions. Uninsured drivers were more likely to be involved in zero-victim crashes, possibly reflecting minor or underreported incidents. However, if such drivers caused injury crashes, they were linked to ~19% more victims.

5. Driver’s race Driver race also influenced outcomes. Non-white drivers were more likely to be involved in crashes with no injuries, implying that white drivers may be more frequently responsible for injury-related accidents.

6. Time of day The time of day significantly shaped both injury risk and crash severity. Nighttime driving increased the likelihood of injury crashes, but such incidents involved ~18.5% fewer victims (possibly due to fewer passengers). Morning accidents also resulted in ~11% fewer victims than afternoon ones.

7. Airbag and safety equipment Surprisingly, accidents in which airbags did not deploy were associated with ~42% fewer victims. This may reflect lower-impact collisions that do not activate airbags. Similarly, crashes involving other safety features (e.g. seatbelts) led to ~34% fewer injuries, aligning with expectations regarding passive safety measures.

8. Weather and road surface Environmental conditions had a measurable effect. During daytime on wet roads, crashes were more likely to involve injuries but included ~13% fewer victims. It was likely due to more cautious driving. A similar pattern emerged under dark and wet conditions, where victim counts were ~11% lower (once again possibly due to fewer passengers).

9. Sobriety Interestingly, no significant difference was observed between drinking and non-drinking drivers. However, drivers under the influence of other substances (e.g. drugs) were associated with ~15% fewer victims in injury crashes.

10. Phone use In our model phone usage by driver was also included. Despite being jointly significant, it hasn’t provided insightful comparison results.

7.1 Next Steps and Alternative Approaches

While the ZINB model allows for an enchanted understanding of both occurrence and severity of crashes, it comes with some limitations. The model’s dual nature can complicate interpretation for policymakers or practitioners seeking straightforward insights.

A potential alternative would be to separate the analysis into two simpler models:

Logistic regression predicting whether any injuries occurred (binary outcome),
Truncated count model (e.g. negative binomial) estimating the number of victims in injury-only crashes.

This modular approach may improve interpretability while still addressing overdispersion and zero inflation where relevant. Future research could also explore spatial effects (e.g. region or county-level risk), driver behaviour history or usage of machine learning classifiers to capture nonlinear interactions across variables.

8 Bibliography

Berhanu Y., Alemayehu E., Schröder D. (2023). Examining Car Accident Prediction Techniques and Road Traffic Congestion: A Comparative Analysis of Road Safety and Prevention of World Challenges in Low-Income and High-Income Countries. Journal of Advanced Transportation, https://doi.org/10.1155/2023/6643412
Blows S., Ivers R., Connor J., Ameratunga S., Norton R. (2003). Car insurance and the risk of car crash injury. Accident Analysis & Prevention, 35(6), 987–990. https://doi.org/10.1016/S0001-4575(02)00106-9
Blows, S., Ivers, R. Q., Woodward, M., Connor, J., & Norton, R. (2003). Vehicle year and the risk of car crash injury. Injury Prevention, 9(4), 353–356. https://doi.org/10.1136/ip.9.4.353
Braver, E. R. (2003). Race, Hispanic origin, and socioeconomic status in relation to motor vehicle occupant death rates and risk factors among adults. Accident Analysis & Prevention, 35(3), 295–309. https://doi.org/10.1016/S0001-4575(01)00106-3
Brijs, T., Karlis, D., & Wets, G. (2008). Studying the effect of weather conditions on daily crash counts using a discrete time-series model. Accident Analysis & Prevention, 40(3), 1180–1190. https://doi.org/10.1016/j.aap.2008.01.001
Connor J., Norton R., Ameratunga S. & Jackson, R (2004). The contribution of alcohol to serious car crash injuries. Epidemiology, 15(3), 337-344. 10.1097/01.ede.0000120045.58295.86
Cullen P., Möller H., Woodward M., Senserrick t., Boufous S, Rogers K., Brown J., Ivers R. (2021). Are there sex differences in crash and crash-related injury between men and women? A 13-year cohort study of young drivers in Australia. SSM - Population Health, 14, 100816. https://doi.org/10.1016/j.ssmph.2021.100816
Lam L. T. (2002). Distractions and the risk of car crash injury: The effect of drivers’ age. Journal of Safety Research, 33(3), 411–419. https://doi.org/10.1016/S0022-4375(02)00034-8
Ratiu S. (2003). The history of the internal combustion engine. Annals of the Faculty of Engineering Hunedoara, Tome I, 3. https://annals.fih.upt.ro/pdf-full/2003/ANNALS-2003-3-21.pdf
World Health Organization. (2023). Road traffic injuries. https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries
Yau, K. K. W., Wang, K., & Lee, A. H. (2003). Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometrical Journal, 45(4), 437–452. https://doi.org/10.1002/bimj.200390024