Final Project

Abstract

This study investigates factors associated with motor vehicle collisions and their relationships with collision severity. The analysis was conducted using car crash data obtained from the New York City Police Department, which contains approximately 2 million collision records. The highest number of recorded collisions occurred between 2016-2018, with approximately 225,000 collisions occurring in each of these years. While the overall number of collisions seems to decrease from 2018-2022, the number of fatality or injury-inducing collisions increased by approximately 18% from 2018 to 2022.

Brooklyn had the highest number of collisions across all years, accounting for about 31-35% of collisions annually. Driver inattention or distraction was found to be the primary contributing factor for collisions, with approximately 31% of recorded reasons attributed to a driver not paying attention or being distracted. Alcohol abuse was identified as the main contributing factor for car crashes between the hours of 2-4 AM. Moreover, there is a significant relationship between alcohol/drug involvement in collisions and collisions severity.

The median age of drivers, irrespective of gender, was found to be 40, with approximately 50% of drivers falling between the ages of 29-53. The highest number of collisions were caused by drivers in the age range of 25-40. The data set recorded more than twice the number of collisions caused by men than by women. The median age of men involved in collisions was found to be 40, whereas the median age of women involved in collisions was 38. A significant relationship was observed between gender and age with respect to collisions, with the average age of males involved in collisions being higher than the average age of females involved in collisions.

Introduction

According to a report by the National Center for Health Statistics from the CDC, accidents are the fourth leading cause of death in New York State. Among those are included fatalities due to motor vehicle collision related injuries. As someone currently working for a personal injury attorney, I see many cases come in which involve collisions. In fact, most of the cases that come our way are motor vehicle accidents in which our clients sustained numerous injuries. In 2016, Vision Zero was created by the City of New York as a citywide traffic safety initiative aimed at reducing and eliminating serious crashes and crash related fatalities by collecting and gaining insights from traffic accident data. All police officers are required to fill out an MV-104 form at the scene of an accident, and these forms are now able to be filled out electronically and automatically stored in the Police Department’s crime data warehouse.

For this project, I am curious to see what are the main factors of motor vehicle collisions. I also want to understand which factors contribute to the likelihood and severity of motor vehicle collisions in New York City. Specifically, I will be focusing on factors such as age, gender, and driver intoxication, and the relationship that these factors have with the number and severity of collisions.

Loading and Cleaning the Data

The Data

Data is collected by the NYPD and is available online through the NYC Vision Zero Open Data page. For the purpose of this project, data was downloaded from the “Motor Vehicle Collisions – Crashes”, and “Motor Vehicle Collisions – Person” links. Data was downloaded in CSV format and loaded into R through a PostgreSQL data base.

City of New York. Retrieved April 29, 2023, from:

Each observation in the crashes data set corresponds with a crash that occurred, each with a unique collision_id. This can be matched to the collision_id in the person data set.

There are about 1.96M collisions recorded in the crashes data set, with 29 variables. There are about 5M people recorded in the person data set, with 21 variables. A full list of the description of each variable can be found in the OpenData page for each data set.

Loading Data from PostgreSQL Server

con <- dbConnect(
  Postgres(), 
  host = "localhost", 
  port = 5432,
  user = "postgres",
  password = Sys.getenv("SQL_DB_PASS"), 
  dbname = "cuny-sps",
)

crashes_data <- dbGetQuery(con, "SELECT * FROM project.crashes")
person_data <- dbGetQuery(con, "SELECT * FROM project.person")

Data Cleaning

# remove unwanted columns
crashes <- crashes_data |>
  mutate(zip_code = as.numeric(zip_code)) |>
  select(-location, -on_street_name, -cross_street_name, -off_street_name, -c(vehicle_type_code_1:vehicle_type_code_5)) |>
  arrange(collision_id) |>
  select(collision_id, c(crash_date:contributing_factor_vehicle_5))

# choose desired columns
person <- person_data |>
  select("person_id" = unique_id, collision_id, crash_date, crash_time, person_type, position_in_vehicle, person_injury, person_age, person_sex)

Creating a Drivers Data Set

The person data set contains all people involved in a crash. I want to determine what factors relating to the driver of the vehicle contribute to collisions. For this, I will create a subset of the data called drivers which only includes those indicated to be the drivers of vehicles involved in accidents.

The data set dictionary for the person data set indicates that age is automatically calculated by the system based on birth date. For some reason, this data set has values for ages way beyond and below what should be expected (from -999 to 9999). For this reason, I will limit my drivers data set to only include drivers that fall within a plausible age range (for this project, 16-87). One of the limitations in filtering the data is that there may be some ages in this data set which are over or under represented due to incorrect input of the data.

drivers <- person |>
  filter(person_type == "Occupant", 
         position_in_vehicle == "Driver",
         person_age >= 16, person_age <= 87) |>
  select(-person_type, -position_in_vehicle)

Now I will join this to the crashes data sets to create a data set of all crashes relative to the drivers of the vehicles.

crashes_by_driver <- drivers |>
  select(-crash_date, -crash_time) |>
  left_join(crashes, by = "collision_id") |>
  filter(!is.na(collision_id)) |>
  select(collision_id, crash_date, crash_time, borough, person_age, person_sex, number_persons_injured, number_persons_killed) |>
  arrange(collision_id)

Data Set of Fatal/Injury Inducing Accidents

fatalities_and_injuries <- crashes |>
  filter(number_persons_injured > 0 | number_persons_killed > 0)

Preview the Data

Crashes

rmarkdown::paged_table(crashes)

Person

rmarkdown::paged_table(person)

Drivers

rmarkdown::paged_table(drivers)

Crashes by Driver

rmarkdown::paged_table(crashes_by_driver)

Injuries and Fatalities

rmarkdown::paged_table(fatalities_and_injuries)

Exploratory Data Analysis

Collisions by Year

As we are currently in the middle of 2023, the data collection for this year is incomplete and I will limit the data till 2022.

crashes_till_22 <- crashes |>
  mutate(year = year(crash_date)) |>
  filter(year < 2023)

crashes_by_year <- crashes_till_22 |>
  count(year) 

crashes_by_year |>
  ggplot(aes(x = as.factor(year), y = n)) +
  geom_bar(stat="identity", fill = "plum4") +
  labs(title = "Collisions by Year (2012-2022)", x = "Year", y = "Number of Crashes") +
  scale_y_continuous(label = scales::comma)

There is an increase in the number of collisions until 2016 and then a decrease after 2018. The most collisions occurred in 2016-2018, with about 225,000 collisions occurring in each of these years. In 2020, the number of collisions decreased by almost 50% from the previous year, possibly owing to stay-at-home orders due to Covid-19.

Fatalities and Injuries from Crashes

Percentage of Crashes Resulting in Injury/Death

injuries_and_fatalities <- crashes_till_22 |>
  mutate(year = year(crash_date),
         injury = ifelse(number_persons_injured > 0, 1, 0),
         death = ifelse(number_persons_killed > 0, 1, 0)) |>
  group_by(year) |>
  summarize(year_involve_injury = sum(injury, na.rm=T),
            year_involve_death = sum(death, na.rm=T)) |>
  mutate(total_collisions = crashes_by_year$n,
         prop_injured = year_involve_injury / total_collisions,
         prop_killed = year_involve_death/ total_collisions) |>
  ungroup() |>
  pivot_longer(cols = c(prop_injured:prop_killed),
               names_to = "severity",
               values_to = "prop") |>
  mutate(severity = toupper(str_extract(severity, "killed|injured"))) 

injuries_and_fatalities |>
  ggplot(aes(x = as.factor(year), y = prop, fill = severity)) +
  geom_bar(stat="identity") +
  labs(title = "Collisions Involving Injury/Death per Year (2012-2022)", x = "Year", y = "Percentage of Crashes Involving Injury/Death") +
  scale_y_continuous(label = scales::percent) +
  geom_text(aes(label = paste0(round(prop*100, 2),'%')), vjust = -0.5, size=3)

Number of People Injured/Killed in Collisions

crashes_till_22 |>
  mutate(year = year(crash_date),
         injury = ifelse(number_persons_injured > 0, 1, 0),
         death = ifelse(number_persons_killed > 0, 1, 0)) |>
  group_by(year) |>
  summarize(tot_injured = sum(number_persons_injured, na.rm=T),
            tot_killed = sum(number_persons_killed, na.rm=T)) |>
  ungroup() |>
  pivot_longer(cols = c(tot_injured:tot_killed),
               names_to = "severity",
               values_to = "num_persons") |>
  mutate(severity = toupper(str_extract(severity, "killed|injured"))) |>
  ggplot(aes(x = as.factor(year), y = num_persons, fill = severity)) +
  geom_bar(stat="identity") +
  labs(title = "Number of People Injured/Killed per Year (2012-2022)", x = "Year", y = "Number of People Injured/Killed in Collisions") +
  scale_y_continuous(label = scales::comma) +
  geom_text(aes(label = num_persons), vjust = -0.5, size=3)

While the number of collisions seems to have decreased from 2018-2022, the percentage of collisions involving fatalities and/or injuries has increased, with almost 40% of collisions in 2022 involving injuries and about 0.3% of crashes involving deaths.

Collisions by Borough

crashes_till_22 |>
  filter(!is.na(borough) & year >= 2017) |>
  group_by(year) |>
  count(borough) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = borough, y = prop)) +
  geom_bar(stat="identity", fill = "steelblue") +
  facet_wrap(~year) + 
  labs(title = "Collisions by Borough (2017-2022)", x = "Borough", y = "Percentage of Crashes", caption = "*Perentage out of collisions where borough is known (i.e. not NA)") +
  scale_y_continuous(label = scales::percent) +
  geom_text(aes(label = paste0(round(prop*100, 0),'%')), size = 3) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Brooklyn and Queens have consistently had the highest percentage of collisions annually, with Brooklyn making up between 31-35% of collisions and Queens making up about 27-29% of collisions per year from 2017-2022.

Which areas within these boroughs have the highest number of collisions?

Brooklyn

crashes_till_22 |>
  filter(borough == "BROOKLYN") |>
  count(zip_code) |>
  drop_na() |>
  arrange(desc(n)) |>
  head(10) |>
  ggplot(aes(y = reorder(as.factor(zip_code), n), x = n)) +
  geom_bar(stat = "identity", fill = "steelblue3") +
  labs(title = "Highest Collisions by Zipcode - Brooklyn", x = "Count", y = "Zipcode")

The zipcode with the highest number of collisions in Brooklyn is 11207, located in East New York.

Queens

For this, I joined the crashes data to a CSV which I created of Queens cities based on zipcodes. The information for the zipcodes was found here.

queens_zips <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/final_project/queens_zipcodes.csv"))

crashes_till_22 |>
  filter(borough == "QUEENS") |>
  left_join(queens_zips, by = "zip_code") |>
  count(city) |>
  drop_na() |>
  arrange(desc(n)) |>
  head(10) |>
  ggplot(aes(y = reorder(city, n), x = n)) +
  geom_bar(stat = "identity", fill = "steelblue3") +
  labs(title = "Highest Collisions by City - Queens", x = "Count", y = "Zipcode")

The highest number of collisions in Queens are in Jamaica and Flushing.

Main Contributing Factors

To determine the contributing factors of collisions, I will use the contributing_factor_vehicle_x columns from the crashes data set. The data set will have to be transformed to a long format for this.

contributing_factors_by_crash <- crashes |>
  select(collision_id, crash_date, crash_time, borough,  c(contributing_factor_vehicle_1:contributing_factor_vehicle_5))

contributing_factors <- contributing_factors_by_crash |>
  pivot_longer(col = c(contributing_factor_vehicle_1:contributing_factor_vehicle_5),
               names_to = "vehicle",
               values_to = "factor") |>
  mutate(factor = case_when(str_detect(factor, "Cell Phone") ~ "Cell Phone",
                            str_detect(factor, "Drugs") ~ "Drugs",
                            str_detect(factor, "Ill") ~ "Illness",
                            str_detect(factor, "Uninvolved Vehicle") ~ "Reaction to Uninvolved Vehicle",
                            TRUE~factor)) |>
  filter(!is.na(factor), !factor %in% c("Unspecified", "1", "80"))

contributing_factors |>
  count(factor) |>
  arrange(desc(n)) |>
  mutate(prop = n / sum(n)) |>
  head(10) |>
  ggplot(aes(x = prop, y = reorder(factor, n))) +
  geom_bar(stat="identity", fill = "steelblue") +
  scale_x_continuous(label = scales::percent) +
  labs(title = "Top 10 Contributing Factors", x = "Percentage of Crashes", y = "Factor")

The main contributing factor in motor vehicle collisions is driver inattention/distraction, which accounts for about 31% of the causes for collisions within this data set. Other contributing factors include failure to yield to right-of-way, following too closely, and other driver action related factors, as well as driver fatigue.

contributing_factors |>
  select(-vehicle) |>
  unique() |> # remove replicated crashes (i.e. multiple drivers in one collision with same causal factor)
  count(factor) |>
  arrange(desc(n)) |>
  head(1)

## # A tibble: 1 × 2
##   factor                              n
##   <chr>                           <int>
## 1 Driver Inattention/Distraction 416287

416,287 collisions in this data set were caused by one or more drivers being inattentive or distracted.

What is the main contributing factor for crashes occurring at each hour of the day?

Since driver inattention seems to be the most overwhelming factor in this data set, I will filter that out as the top factor and will check to see the second most common factor for each hour.

contributing_factors |> 
  filter(!str_detect(factor, "Inattention/")) |>
  mutate(hour = hour(crash_time)) |>
  group_by(hour) |>
  count(factor) |>
  filter(n == max(n)) |>
  knitr::kable(col.names = c("Hour", "Factor", "Number of Collisions"))

Hour	Factor	Number of Collisions
0	Following Too Closely	4003
1	Other Vehicular	1782
2	Alcohol Involvement	1590
3	Alcohol Involvement	1609
4	Alcohol Involvement	1826
5	Following Too Closely	1425
6	Following Too Closely	3006
7	Following Too Closely	4448
8	Failure to Yield Right-of-Way	8204
9	Failure to Yield Right-of-Way	7141
10	Failure to Yield Right-of-Way	6426
11	Failure to Yield Right-of-Way	6571
12	Failure to Yield Right-of-Way	7383
13	Failure to Yield Right-of-Way	7991
14	Failure to Yield Right-of-Way	9244
15	Following Too Closely	9122
16	Failure to Yield Right-of-Way	10321
17	Failure to Yield Right-of-Way	10737
18	Failure to Yield Right-of-Way	9454
19	Failure to Yield Right-of-Way	7712
20	Failure to Yield Right-of-Way	6259
21	Failure to Yield Right-of-Way	5069
22	Failure to Yield Right-of-Way	4183
23	Failure to Yield Right-of-Way	3117

Between the hours of 2-4 AM, driver inebriation from alcohol involvement is the main factor contributing to collisions. For all other hours, following too closely or failure to yield to right of way is the main contributing factor.

How many collisions are caused by alcohol or drug involvement?

Overall

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() |> # remove duplicate crashes
  nrow()

## [1] 23486

There are 23,486 incidents of alcohol or drug involvement in car accidents recorded in this data set.

How many of these accidents resulted in fatalities or serious injury?

intox_fatalities_injuries <- fatalities_and_injuries |>
  pivot_longer(col = c(contributing_factor_vehicle_1:contributing_factor_vehicle_5),
               names_to = "vehicle",
               values_to = "factor") |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() 

intox_fatalities_injuries |>
  nrow()

## [1] 7486

intox_fatalities_injuries |>
  mutate(injury = ifelse(number_persons_injured > 0, 1, 0),
         fatality = ifelse(number_persons_killed > 0, 1, 0)) |>
  summarize(total_injury_inducing = sum(injury),
            total_fatality_inducing = sum(fatality))

## # A tibble: 1 × 2
##   total_injury_inducing total_fatality_inducing
##                   <dbl>                   <dbl>
## 1                  7439                     117

There are 7,486 fatality/injury inducing accidents caused by alcohol/drug involvement of one or more drivers, 7,439 of which caused severe injuries and 117 of which resulting in death.

Per Year (Table)

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  count(year = year(crash_date)) |>
  knitr::kable(col.names = c("Year", "Number of Collisions Involving Intoxicated Drivers"))

Year	Number of Collisions Involving Intoxicated Drivers
2012	856
2013	1760
2014	2261
2015	2201
2016	3033
2017	2848
2018	2577
2019	2389
2020	1741
2021	1924
2022	1953
2023	568

Per Year (Graph)

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() |>
  count(year = as.factor(year(crash_date))) |> 
  ggplot(aes(x = year, y = n)) +
    geom_bar(stat="identity", fill = "steelblue4") +
    labs(title = "Number of Collisions Involving Intoxicated* Drivers per Year", x = "Year", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
    geom_text(aes(label = n), vjust = -0.5, size=3)

The most recorded collisions from alcohol/drug involvement was in 2016, with 2,925 accidents occurring in relation to intoxicated drivers. Since Vision Zero was only implemented in 2016, the number of collisions from 2012-2015 may not be complete so these numbers may not accurately reflect the true number of crashes caused by alcohol/drug involvement. Likewise, the data for 2023 is not complete, as the data was accessed about five months into the year and only accounts for collisions within those five months. Since the implementation of Vision Zero in 2016, it seems that collisions resulting from drug or alcohol intoxication has decreased till 2020 and then started to increase again slightly. The major dip from 2019 to 2020 may have been caused by stay-at-home policies due to Covid-19 resulting in less drivers overall on roads during this time.

Per Hour of Day

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() |>
  count(hour = as.factor(hour(crash_time))) |> 
  ggplot(aes(x = hour, y = n)) +
    geom_bar(stat="identity", fill = "steelblue4") +
    labs(title = "Number of Collisions Involving Intoxicated* Drivers per Hour of Day", x = "Hour of Day", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
    geom_text(aes(label = n), vjust = -0.5, size=2.5)

The most collisions occur from driver intoxication in the early and later hours of the day (from midnight to 5 AM and rising steadily from 4 to 11 PM).

Across Boroughs

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  filter(!is.na(borough)) |>
  unique() |>
  count(borough) |> 
  ggplot(aes(x = borough, y = n)) +
    geom_bar(stat="identity", fill = "steelblue4") +
    labs(title = "Number of Collisions Involving Intoxicated* Drivers by Borough", x = "Borough", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
    geom_text(aes(label = n), vjust = -0.5, size=3)

The most collisions resulting from alcohol/drug usage occur in Brooklyn and Queens.

Collisions by Age

Vision Zero began in 2016, with better methods for collecting and storing data for collision reports. Therefore, I will limit the data to crashes from 2016-2022.

drivers_2016_2022 <- drivers |>
  mutate(year = year(crash_date)) |>
  filter(year >= 2016 & year <= 2022)

hist(drivers_2016_2022$person_age, 
     main="Distribution of Ages", 
     xlab = "Driver Age")

The distribution of driver’s ages in the data set is skewed to the right, with the most collisions being caused by drivers in the 25-40 range.

summary(drivers_2016_2022$person_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   40.00   41.75   53.00   87.00

The median age of drivers in this data set is 40 years old, and the middle 50% of data is contained within the 29-53 range.

Collisions by Gender

Overall

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  count(person_sex) |>
  ggplot(aes(x = n, y = person_sex, fill = person_sex)) +
  geom_bar(stat="identity") +
  labs(title = "Collisions by Gender (2016-2022)", x = "Number of Collisions", y = "Gender") +
  theme(legend.position="none") +
  scale_x_continuous(labels = scales::comma)

Overall, there seems to be a greater number of collisions caused by males than by females.

Per Year

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  group_by(year) |>
  count(person_sex) |>
  ggplot(aes(x = n, y = person_sex, fill = person_sex)) +
  geom_bar(stat="identity") +
  facet_wrap(~year) +
  labs(title = "Collisions by Gender (2016-2022)", x = "Number of Collisions", y = "Gender") +
  theme(legend.position="none") +
  scale_x_continuous(labels = scales::comma)

Annually, the number of collisions caused by female drivers is less than half the amount of collisions caused by male drivers.

Collisions by Age and Gender

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
  geom_boxplot() +
  labs(title = "Distribution of Ages by Sex", x = "Age", y = "Sex") +
  theme(legend.position = "none")

Both plots have a skew to the right, with more collisions being caused by younger people of both genders. There does seem to be a relationship between age, gender, and collisions. There seems to be a higher distribution of ages for males involved in accidents than females involved in accidents. However, it is hard to tell from the box plots alone whether this relationship is significant.

males_2016_2022 <- drivers_2016_2022 |>
  filter(person_sex == "M")

females_2016_2022 <- drivers_2016_2022 |>
  filter(person_sex == "F")

summary(males_2016_2022$person_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   30.00   40.00   42.11   53.00   87.00

summary(females_2016_2022$person_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   38.00   40.72   51.00   87.00

The median age for men involved in collisions is 40, with 50% of men involved in collisions being between the ages of 30-53.

The median age for women involved in collisions is 38, with 50% of women involved in collisions being between the ages of 29-51.

Inference

Based on the exploratory data analysis, I would like to determine the extent to which alcohol/drug involvement plays a role in the severity of car crashes, and I would also like to examine if there is a relationship between age and gender and their combined contribution to the number of collisions.

Intoxication vs. Collision Severity

First, let’s look at the relationship between driver intoxication and collision severity.

Question: Is there a significant relationship between alcohol/drug involvement and collision severity?

For this, I will define “collision severity” in two ways:

The number of collisions that result in injury/fatality.
The number of people who are injured/killed in accidents.

The variables for this analysis are categorical (whether or not there is a serious injury or death, whether or not there is alcohol involvement), so I will be using a chi-square test to compare the observed values for the number of accidents with serious injuries/deaths for accidents involving intoxicated and non-intoxicated drivers against the expected values if driver intoxication does not play a significant role.

\(H_0\): There is no significant relationship between alcohol/drug use and collision severity.

\(H_A\): There is a significant relationship between alcohol/drug use and collision severity.

\(\alpha = 0.05\)

Since Vision Zero only started in 2016 and the data for 2023 is not yet complete, I will filter the data to only collisions between 2016 and 2022. I am also going to include the use of prescription medication within the realm of drug and alcohol use.

collision_severity <- crashes |>
  filter(year(crash_date) >= 2016,
         year(crash_date) <= 2022) |>
  select(collision_id, number_persons_injured, number_persons_killed) |>
  mutate(num_injured_or_killed = number_persons_injured + number_persons_killed)

#  subset of collisions involving alcohol/drug use
alcohol_drugs <- contributing_factors |>
  filter(year(crash_date) >= 2016,
         year(crash_date) <= 2022,
         str_detect(tolower(factor), "drugs|alcohol|medication")) |>
  select(-vehicle, -factor) |>
  unique() |> # remove redundant collisions
  left_join(collision_severity, by = "collision_id") |>
  mutate(alc_drugs = "Alc/Drugs")

# subset of collisions not involving alcohol/drug use
not_alcohol_drugs <- contributing_factors |>
  filter(year(crash_date) >= 2016,
         year(crash_date) <= 2022) |>
  subset(!collision_id %in% alcohol_drugs$collision_id) |> # any collision not related to any intoxicated driver
  select(-vehicle, -factor) |>
  unique() |>
  left_join(collision_severity, by = "collision_id") |>
  mutate(alc_drugs = "No Alc/Drugs")

# all collisions
alc_drug_involvement <- rbind(alcohol_drugs, not_alcohol_drugs)

alc_drug_involvement |>
  count(alc_drugs)

## # A tibble: 2 × 2
##   alc_drugs         n
##   <chr>         <int>
## 1 Alc/Drugs     17398
## 2 No Alc/Drugs 914768

alc_drug_involvement <- alc_drug_involvement |>
  mutate(injured_or_killed = ifelse(num_injured_or_killed > 0, "Injuries/Death", "No Injuries/Death")) 

glimpse(alc_drug_involvement)

## Rows: 932,166
## Columns: 9
## $ collision_id           <dbl> 3363357, 3363421, 3363487, 3363489, 3363516, 33…
## $ crash_date             <date> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-01…
## $ crash_time             <time> 11:30:00, 04:35:00, 06:30:00, 02:54:00, 03:40:…
## $ borough                <chr> "MANHATTAN", "BRONX", "BROOKLYN", "BROOKLYN", "…
## $ number_persons_injured <dbl> 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 1,…
## $ number_persons_killed  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ num_injured_or_killed  <dbl> 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 1,…
## $ alc_drugs              <chr> "Alc/Drugs", "Alc/Drugs", "Alc/Drugs", "Alc/Dru…
## $ injured_or_killed      <chr> "Injuries/Death", "No Injuries/Death", "No Inju…

Chi-Square Analysis 1 - Number of Collisions

serious_injuries_intox_or_not <- table(alc_drug_involvement$injured_or_killed, alc_drug_involvement$alc_drugs)

(chi_sq1 <- chisq.test(serious_injuries_intox_or_not))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  serious_injuries_intox_or_not
## X-squared = 477.11, df = 1, p-value < 2.2e-16

The p-value here is less than 2.2e-16 which is much less than 0.05 so there is a significant relationship between driver intoxication and serious injury/death in collisions.

chi_sq1$residuals

##                    
##                      Alc/Drugs No Alc/Drugs
##   Injuries/Death     18.780622    -2.590045
##   No Injuries/Death -10.764323     1.484513

assocplot(serious_injuries_intox_or_not, col = c("green", "grey"), space = 0.5, main = "Associations Between Driver Intoxication and Collisions Severity", xlab = "Injuries or Death?", ylab = "Alcohol or Drugs Involved?")

assocstats(serious_injuries_intox_or_not)

##                     X^2 df P(> X^2)
## Likelihood Ratio 451.78  1        0
## Pearson          477.49  1        0
## 
## Phi-Coefficient   : 0.023 
## Contingency Coeff.: 0.023 
## Cramer's V        : 0.023

There seems to be a high association between collisions involving alcohol/drugs and collisions resulting in injury/death.

Cramer’s V for this test is 0.023, which indicates a weak association.

Chi-Square Analysis 2 - Number of People Injured/Killed

alc_drug_involvement |>
  filter(num_injured_or_killed > 0) |>
  count(num_injured_or_killed) |>
  ggplot(aes(x = num_injured_or_killed, y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Distribution of Number of People Injured in Collisions", y = "Count", x = "Number of People Injured")

alc_drug_involvement <- alc_drug_involvement |> 
  mutate(num_injured_or_killed = ifelse(num_injured_or_killed >= 9, "9+", as.character(num_injured_or_killed)))

num_injured_alc_involvement <- table(alc_drug_involvement$num_injured_or_killed, alc_drug_involvement$alc_drugs)

(chi_sq2 <- chisq.test(num_injured_alc_involvement))

## Warning in chisq.test(num_injured_alc_involvement): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  num_injured_alc_involvement
## X-squared = 902.7, df = 9, p-value < 2.2e-16

Once again, the p-value is less than 2.2e-16 which is much less than 0.05, so the results are that the relationship is significant.

chi_sq2$residuals

##     
##        Alc/Drugs No Alc/Drugs
##   0  -10.7643235    1.4845130
##   1    7.6360389   -1.0530898
##   2   15.5004082   -2.1376687
##   3   15.6302570   -2.1555762
##   4   10.3165534   -1.4227608
##   5    6.4394569   -0.8880686
##   6    5.3293324   -0.7349708
##   7    6.3269319   -0.8725502
##   8    3.1552064   -0.4351360
##   9+   0.9076880   -0.1251797

assocstats(num_injured_alc_involvement)

##                     X^2 df P(> X^2)
## Likelihood Ratio 742.24  9        0
## Pearson          902.70  9        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.031 
## Cramer's V        : 0.031

Based on the residuals, there are higher numbers of instances of more people being injured in crashes with intoxicated drivers than would be expected if there was no relationship.

Cramer’s V here is 0.031, which also indicates a weak association.

Assumptions of \(\chi^2\):

As stated above, the variables are categorical. Additionally, each observation in the alc_drug_involvement data set is independent of each other, as each records a specific collision where alcohol or drugs were either involved or not involved. The cells of the contingency table are mutually exclusive, and the expected value for 100% of the cells is greater than 5. Therefore, all assumptions of chi-square are met. (Expected values must be greater than 5 in at least 80% of the cells, and cannot go below 1).

chi_sq1$expected

##                    
##                     Alc/Drugs No Alc/Drugs
##   Injuries/Death     4302.163     226199.8
##   No Injuries/Death 13095.837     688555.2

chi_sq2$expected

##     
##         Alc/Drugs No Alc/Drugs
##   0  13095.837376  688555.1626
##   1   3295.633283  173278.3667
##   2    660.604871   34733.3951
##   3    213.575791   11229.4242
##   4     79.192701    4163.8073
##   5     30.460167    1601.5398
##   6     11.739856     617.2601
##   7      5.356659     281.6433
##   8      2.258382     118.7416
##   9+     3.340913     175.6591

Intoxication and Time of Day

Question: Is there a significant relationship between collisions caused by driver intoxication and time of day?

\(H_0\): There is no significant relationship between alcohol/drug involvement in collisions and time of day.

\(H_A\): There is a significant relationship between alcohol/drug involvement in collisions and time of day.

\(\alpha = 0.05\)

time_of_day <- alc_drug_involvement |>
  mutate(time_of_day = case_when(hour(crash_time) >= 5 & hour(crash_time) < 12 ~ "Morning",
                                 hour(crash_time) >= 12 & hour(crash_time) < 17 ~ "Afternoon",               
                                 TRUE ~ "Evening"))

intox_vs_time_of_day <- table(time_of_day$time_of_day, time_of_day$alc_drugs)

(chi_sq3 <- chisq.test(intox_vs_time_of_day))

## 
##  Pearson's Chi-squared test
## 
## data:  intox_vs_time_of_day
## X-squared = 6679.8, df = 2, p-value < 2.2e-16

The p-value is 2.2e-16 which is very small and less than 0.05, so there is a relationship between collisions caused by driver intoxication and time of day.

chi_sq3$residuals

##            
##              Alc/Drugs No Alc/Drugs
##   Afternoon -46.311833     6.386842
##   Evening    61.356442    -8.461636
##   Morning   -25.411440     3.504479

assocplot(intox_vs_time_of_day, col = c("green", "grey"), space = 0.5, main = "Associations Between Driver Intoxication and Time of Day", xlab = "Time of Day", ylab = "Alcohol or Drugs Involved?")

There are more accidents that occur in the Evening than would be expected if there was no relationship. Accidents involving alcohol/drugs are associated with occurring in the Evening.

assocstats(intox_vs_time_of_day)

##                     X^2 df P(> X^2)
## Likelihood Ratio 6780.3  2        0
## Pearson          6679.8  2        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.084 
## Cramer's V        : 0.085

Cramer’s V here is 0.085, which indicates a weak association.

Gender and Alcohol/Drug Involved Collisions

While I wanted to see the breakdown of male vs female intoxicated drivers, one of the limitations of the data set is that there is no indication in the person data set as to which vehicle number they were assigned on the MV-104 form (i.e. contributing_factor_vehicle_x). Therefore, we cannot match the people to their main contributing factor in each collision. (While there is a contributing factor column for the person data set, it is not as robust as the contributing factor columns from the crashes data set).

Instead, let’s look to see which drivers are mostly involved in collisions that have alcohol involved (regardless of whether they were the intoxicated driver) and let’s see who is more likely to be injured/killed in collisions involving alcohol/drugs.

Drivers Involved in Collisions Involving Alcohol or Drugs

person_alc_drugs <- alc_drug_involvement |>
  select(-crash_date, -crash_time, -borough) |>
  right_join(person, by = "collision_id") |>
  filter(!is.na(person_sex), 
         alc_drugs == "Alc/Drugs",
         person_type == "Occupant",
         person_sex %in% c("M", "F"))

person_alc_drugs |>
  filter(position_in_vehicle == "Driver") |>
  ggplot(aes(y = person_sex, fill = person_sex)) +
  geom_bar() +
  scale_x_continuous(label = scales::comma) +
  labs(title = "Drivers Involved in Alcohol/Drug Related Collisions*", x = "Count", y = "Driver Gender", caption = "*Regardless of if they were the intoxicated driver")

People Injured/Killed in Collisions Involving Alcohol or Drugs

person_alc_drugs |>
  filter(person_injury %in% c("Killed", "Injured")) |>
  ggplot(aes(y = person_sex, fill = person_sex)) +
  geom_bar() +
  scale_x_continuous(label = scales::comma) +
  labs(title = "People Injured/Killed in Alcohol/Drug Related Collisions", x = "Count", y = "Gender")

Age and Gender vs. Collisions

Now, let’s look at the relationship between age and gender and their joint contribution to the likelihood of collisions.

Question: Is there a significant difference in the ages of males and females involved in motor vehicle collisions?

\(H_0: \mu_1 = \mu_2\) (There is no significant difference in the ages of males and females involved in motor vehicle collisions)

\(H_A: \mu_1 \neq \mu_2\) (There is a significant difference in the ages of males and females involved in motor vehicle collisions)

Where \(\mu_1\) is the mean age of males involved in collisions and \(\mu_2\) is the mean age of females involved in collisions.

\(\alpha = 0.05\)

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  group_by(year = year(crash_date), person_sex) |>
  summarize(median = median(person_age),
            IQR = IQR(person_age),
            num_drivers = n())

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 5
## # Groups:   year [7]
##     year person_sex median   IQR num_drivers
##    <dbl> <chr>       <dbl> <dbl>       <int>
##  1  2016 F              38    22       75237
##  2  2016 M              41    23      206770
##  3  2017 F              38    22       92827
##  4  2017 M              41    23      257642
##  5  2018 F              39    22       92607
##  6  2018 M              41    24      259393
##  7  2019 F              39    22       83058
##  8  2019 M              41    24      234352
##  9  2020 F              37    22       35820
## 10  2020 M              39    23      107255
## 11  2021 F              36    21       33068
## 12  2021 M              38    23       96215
## 13  2022 F              37    21       31188
## 14  2022 M              39    24       89300

To make the data more manageable, let’s look at just collisions for 2022.

drivers_2022 <- drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F"), year == 2022) 

drivers_2022 |>
  ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
  geom_boxplot() +
  labs(title = "Distribution of Ages by Sex in 2022", x = "Age", y = "Sex") +
  theme(legend.position = "none")

drivers_2022 |>
  group_by(person_sex) |>
  summarize(mean_age = mean(person_age))

## # A tibble: 2 × 2
##   person_sex mean_age
##   <chr>         <dbl>
## 1 F              40.2
## 2 M              41.5

There is an observed difference. But, is this difference statistically significant?

males <- drivers_2022 |>
  filter(person_sex == "M")

females <- drivers_2022 |>
  filter(person_sex == "F")

Assumptions of t-test:

drivers_2022 |>
  count(person_sex)

##   person_sex     n
## 1          F 31188
## 2          M 89300

sd(males$person_age)

## [1] 14.74327

sd(females$person_age)

## [1] 14.25776

par(mfrow=c(1,2))

qqnorm(males$person_age, main = "Normal Q-Q Plot - Males")
qqline(males$person_age)

qqnorm(females$person_age, main = "Normal Q-Q Plot - Females")
qqline(females$person_age)

Each observation in the data set represents a single driver, independent of one another. From the QQ-plot, we can see that the distribution of ages for males and females closely follows the straight line in the middle (about ages 30-70 for males and about 30-80 for females), but there are problems towards the tails of the distributions. Since this is a very large sample size, and the data is relatively normal, we can continue with the analysis. (A note on this: I also performed the analysis using the log of person_age to normalize the data more and saw no difference in the results vs. the results of the un-transformed data. For interpretability purposes, I chose to use the un-transformed data.) The variance for both groups is about the same. Since the data was collected by the NYPD and records drivers actually involved in accidents, it is reasonable to assume that the data is random and does not have biases. Therefore, assumptions of the t-test are met.

(t_test <- t.test(males$person_age, females$person_age, alternative = "two.sided", var.equal = T))

## 
##  Two Sample t-test
## 
## data:  males$person_age and females$person_age
## t = 12.944, df = 120486, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.056131 1.433058
## sample estimates:
## mean of x mean of y 
##  41.46699  40.22239

The p-value for this t-test is very small (< 2.2e-16), so therefore we reject the null hypothesis and say that there is a significant difference in the mean ages between males and females involved in accidents. The mean age for men is about 41.4 and the mean age for women is about 40.2. Based on the 95% confidence interval, we are 95% confident that the true difference in the mean ages for men and women involved in car accidents is between 1.06 and 1.43.

We can also visualize this significance using the null distribution.

obs_diff <- drivers_2022 |>
  specify(person_age~person_sex) |>
  calculate(stat = "diff in means", order = c("M", "F"))

null_dist <- drivers_2022 |>
  specify(person_age~person_sex) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in means", order = c("M", "F"))

visualize(null_dist) + shade_p_value(obs_stat = obs_diff, direction = "two_sided")

The observed difference is way off from what we would expect to see if there was not a statistically significant relationship. Therefore, we can conclude that there is a significant difference in the mean ages of men and women who get involved in collisions. As such, there is a significant relationship between the ages of men and women in relation to being involved in collisions. Men who get into car accidents tend to be slightly older than women who get into car accidents.

Conclusions

Based on the analysis of car crash data obtained from the New York City Police Department, this project identified driver inattention/distraction as the primary contributing factor for collisions, with 416,287 collisions recorded as being caused by one or more drivers who were distracted or not paying attention. Aside from this, alcohol abuse was found to be the main contributing factor for car crashes between the hours of 2-4 AM. Using a chi-square test of independence, the data provided evidence that there is a significant relationship between driver intoxication and severity of collisions. There is also association between driver intoxication and time of day. Both of these have weak associations. There may be other factors that are not included in the analysis that have a an impact on the number and severity of collisions.

The analysis also revealed a significant relationship between gender and age with respect to collisions. The data showed a higher number of collisions caused by men than by women, with the median age of men involved in collisions (40) being higher than the median age of women involved in collisions (38). Using a two-sample t-test and visualizing with the null distribution, this difference was shown to be significant, with the average age of men involved in collisions being significantly higher than the average age of women involved in accidents.

According to the data, Brooklyn and Queens had the largest percentage of collisions in the data set. The zipcode with the highest number of collisions in Brooklyn was 11207, located in East New York. The highest number of collisions in Queens occurred in Jamaica and Flushing. This type of analysis, along with determining streets or cross roads with high densities of collisions occurring, can allow city planners and policymakers to determine areas where traffic safety measures need to be implemented, such as increasing the number of traffic signals or improving road signs, etc.

While there is more analysis that can be done using this data, this study provides a brief insight into some of the underlying factors which contribute to collisions and the severity of collisions. This also highlights the need for future research to be able to design evidence-based interventions and strategies to reduce the number of motor vehicle collisions, serious injuries, and deaths in New York City.

Limitations

There were a few limitations I noticed while analyzing these data sets. As noted above, even though age is described as being calculated based on birth date, there are a number of ages which fall outside the range of plausibility. When creating a data set of drivers involved in these collisions, I chose to filter the data set only for people within the ages of 16 and 87 i order to improve accuracy and to be able to analyze the relationship that age may have with the number and severity of crashes. However, there may be ages that are over or under represented in this new data set due to this error and subsequent filtering.

Another limitation that I found was that the crashes data set has separate columns for the contributing factors based on each vehicle involved in a collision (contributing_factor_vehicle_x), but, there is no indication in the person data set as to which vehicle they were in (i.e. were they vehicle 1 in the MV-104 form or vehicle 2, etc). Additionally, the contributing_factor_x columns in the person data set was not as robust as the contributing factors columns in the crashes data set. For this reason, I could not compare the main contributing factors for car accidents across different genders or age groups, as I could not match up the people to the vehicles involved in the crashes.

An additional limitation is that this data is limited only to reported accidents. Accidents which were not reported to the authorities are not included in this data set and therefore the data may not be representative of all collisions in New York City as certain types of collisions or areas with low reporting rates may not be represented.

Final Project

Shoshana Farber

May 14, 2023

Abstract

Introduction

Loading and Cleaning the Data

The Data

Loading Data from PostgreSQL Server

Data Cleaning

Creating a Drivers Data Set

Data Set of Fatal/Injury Inducing Accidents

Preview the Data

Crashes

Person

Drivers

Crashes by Driver

Injuries and Fatalities

Exploratory Data Analysis

Collisions by Year

Fatalities and Injuries from Crashes

Collisions by Borough

Brooklyn

Queens

Main Contributing Factors

Overall

Per Year (Table)

Per Year (Graph)

Per Hour of Day

Across Boroughs

Collisions by Age

Collisions by Gender

Overall

Per Year

Collisions by Age and Gender

Inference

Intoxication vs. Collision Severity

Chi-Square Analysis 1 - Number of Collisions

Chi-Square Analysis 2 - Number of People Injured/Killed

Intoxication and Time of Day

Gender and Alcohol/Drug Involved Collisions

Age and Gender vs. Collisions

Conclusions

Limitations