Based on the exploratory data analysis, I would like to determine the
extent to which alcohol/drug involvement plays a role in the severity of
car crashes, and I would also like to examine if there is a relationship
between age and gender and their combined contribution to the number of
collisions.
Intoxication vs. Collision Severity
First, let’s look at the relationship between driver intoxication and
collision severity.
Question: Is there a significant relationship between
alcohol/drug involvement and collision severity?
For this, I will define “collision severity” in two ways:
- The number of collisions that result in injury/fatality.
- The number of people who are injured/killed in accidents.
The variables for this analysis are categorical (whether or not there
is a serious injury or death, whether or not there is alcohol
involvement), so I will be using a chi-square test to compare the
observed values for the number of accidents with serious injuries/deaths
for accidents involving intoxicated and non-intoxicated drivers against
the expected values if driver intoxication does not play a significant
role.
\(H_0\): There is no significant
relationship between alcohol/drug use and collision severity.
\(H_A\): There is a significant
relationship between alcohol/drug use and collision severity.
\(\alpha = 0.05\)
Since Vision Zero only started in 2016 and the data for 2023 is not
yet complete, I will filter the data to only collisions between 2016 and
2022. I am also going to include the use of prescription medication
within the realm of drug and alcohol use.
collision_severity <- crashes |>
filter(year(crash_date) >= 2016,
year(crash_date) <= 2022) |>
select(collision_id, number_persons_injured, number_persons_killed) |>
mutate(num_injured_or_killed = number_persons_injured + number_persons_killed)
# subset of collisions involving alcohol/drug use
alcohol_drugs <- contributing_factors |>
filter(year(crash_date) >= 2016,
year(crash_date) <= 2022,
str_detect(tolower(factor), "drugs|alcohol|medication")) |>
select(-vehicle, -factor) |>
unique() |> # remove redundant collisions
left_join(collision_severity, by = "collision_id") |>
mutate(alc_drugs = "Alc/Drugs")
# subset of collisions not involving alcohol/drug use
not_alcohol_drugs <- contributing_factors |>
filter(year(crash_date) >= 2016,
year(crash_date) <= 2022) |>
subset(!collision_id %in% alcohol_drugs$collision_id) |> # any collision not related to any intoxicated driver
select(-vehicle, -factor) |>
unique() |>
left_join(collision_severity, by = "collision_id") |>
mutate(alc_drugs = "No Alc/Drugs")
# all collisions
alc_drug_involvement <- rbind(alcohol_drugs, not_alcohol_drugs)
alc_drug_involvement |>
count(alc_drugs)
## # A tibble: 2 × 2
## alc_drugs n
## <chr> <int>
## 1 Alc/Drugs 17398
## 2 No Alc/Drugs 914768
alc_drug_involvement <- alc_drug_involvement |>
mutate(injured_or_killed = ifelse(num_injured_or_killed > 0, "Injuries/Death", "No Injuries/Death"))
glimpse(alc_drug_involvement)
## Rows: 932,166
## Columns: 9
## $ collision_id <dbl> 3363357, 3363421, 3363487, 3363489, 3363516, 33…
## $ crash_date <date> 2016-01-01, 2016-01-01, 2016-01-01, 2016-01-01…
## $ crash_time <time> 11:30:00, 04:35:00, 06:30:00, 02:54:00, 03:40:…
## $ borough <chr> "MANHATTAN", "BRONX", "BROOKLYN", "BROOKLYN", "…
## $ number_persons_injured <dbl> 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 1,…
## $ number_persons_killed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ num_injured_or_killed <dbl> 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 1,…
## $ alc_drugs <chr> "Alc/Drugs", "Alc/Drugs", "Alc/Drugs", "Alc/Dru…
## $ injured_or_killed <chr> "Injuries/Death", "No Injuries/Death", "No Inju…
Chi-Square Analysis 1 - Number of Collisions
serious_injuries_intox_or_not <- table(alc_drug_involvement$injured_or_killed, alc_drug_involvement$alc_drugs)
(chi_sq1 <- chisq.test(serious_injuries_intox_or_not))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: serious_injuries_intox_or_not
## X-squared = 477.11, df = 1, p-value < 2.2e-16
The p-value here is less than 2.2e-16 which is much less than 0.05 so
there is a significant relationship between driver intoxication and
serious injury/death in collisions.
chi_sq1$residuals
##
## Alc/Drugs No Alc/Drugs
## Injuries/Death 18.780622 -2.590045
## No Injuries/Death -10.764323 1.484513
assocplot(serious_injuries_intox_or_not, col = c("green", "grey"), space = 0.5, main = "Associations Between Driver Intoxication and Collisions Severity", xlab = "Injuries or Death?", ylab = "Alcohol or Drugs Involved?")

assocstats(serious_injuries_intox_or_not)
## X^2 df P(> X^2)
## Likelihood Ratio 451.78 1 0
## Pearson 477.49 1 0
##
## Phi-Coefficient : 0.023
## Contingency Coeff.: 0.023
## Cramer's V : 0.023
There seems to be a high association between collisions involving
alcohol/drugs and collisions resulting in injury/death.
Cramer’s V for this test is 0.023, which indicates a weak
association.
Chi-Square Analysis 2 - Number of People Injured/Killed
alc_drug_involvement |>
filter(num_injured_or_killed > 0) |>
count(num_injured_or_killed) |>
ggplot(aes(x = num_injured_or_killed, y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of Number of People Injured in Collisions", y = "Count", x = "Number of People Injured")

alc_drug_involvement <- alc_drug_involvement |>
mutate(num_injured_or_killed = ifelse(num_injured_or_killed >= 9, "9+", as.character(num_injured_or_killed)))
num_injured_alc_involvement <- table(alc_drug_involvement$num_injured_or_killed, alc_drug_involvement$alc_drugs)
(chi_sq2 <- chisq.test(num_injured_alc_involvement))
## Warning in chisq.test(num_injured_alc_involvement): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: num_injured_alc_involvement
## X-squared = 902.7, df = 9, p-value < 2.2e-16
Once again, the p-value is less than 2.2e-16 which is much less than
0.05, so the results are that the relationship is significant.
chi_sq2$residuals
##
## Alc/Drugs No Alc/Drugs
## 0 -10.7643235 1.4845130
## 1 7.6360389 -1.0530898
## 2 15.5004082 -2.1376687
## 3 15.6302570 -2.1555762
## 4 10.3165534 -1.4227608
## 5 6.4394569 -0.8880686
## 6 5.3293324 -0.7349708
## 7 6.3269319 -0.8725502
## 8 3.1552064 -0.4351360
## 9+ 0.9076880 -0.1251797
assocstats(num_injured_alc_involvement)
## X^2 df P(> X^2)
## Likelihood Ratio 742.24 9 0
## Pearson 902.70 9 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.031
## Cramer's V : 0.031
Based on the residuals, there are higher numbers of instances of more
people being injured in crashes with intoxicated drivers than would be
expected if there was no relationship.
Cramer’s V here is 0.031, which also indicates a weak
association.
Assumptions of \(\chi^2\):
As stated above, the variables are categorical. Additionally, each
observation in the alc_drug_involvement data set is
independent of each other, as each records a specific collision where
alcohol or drugs were either involved or not involved. The cells of the
contingency table are mutually exclusive, and the expected value for
100% of the cells is greater than 5. Therefore, all assumptions of
chi-square are met. (Expected values must be greater than 5 in at least
80% of the cells, and cannot go below 1).
chi_sq1$expected
##
## Alc/Drugs No Alc/Drugs
## Injuries/Death 4302.163 226199.8
## No Injuries/Death 13095.837 688555.2
chi_sq2$expected
##
## Alc/Drugs No Alc/Drugs
## 0 13095.837376 688555.1626
## 1 3295.633283 173278.3667
## 2 660.604871 34733.3951
## 3 213.575791 11229.4242
## 4 79.192701 4163.8073
## 5 30.460167 1601.5398
## 6 11.739856 617.2601
## 7 5.356659 281.6433
## 8 2.258382 118.7416
## 9+ 3.340913 175.6591
Intoxication and Time of Day
Question: Is there a significant relationship between
collisions caused by driver intoxication and time of day?
\(H_0\): There is no significant
relationship between alcohol/drug involvement in collisions and time of
day.
\(H_A\): There is a significant
relationship between alcohol/drug involvement in collisions and time of
day.
\(\alpha = 0.05\)
time_of_day <- alc_drug_involvement |>
mutate(time_of_day = case_when(hour(crash_time) >= 5 & hour(crash_time) < 12 ~ "Morning",
hour(crash_time) >= 12 & hour(crash_time) < 17 ~ "Afternoon",
TRUE ~ "Evening"))
intox_vs_time_of_day <- table(time_of_day$time_of_day, time_of_day$alc_drugs)
(chi_sq3 <- chisq.test(intox_vs_time_of_day))
##
## Pearson's Chi-squared test
##
## data: intox_vs_time_of_day
## X-squared = 6679.8, df = 2, p-value < 2.2e-16
The p-value is 2.2e-16 which is very small and less than 0.05, so
there is a relationship between collisions caused by driver intoxication
and time of day.
chi_sq3$residuals
##
## Alc/Drugs No Alc/Drugs
## Afternoon -46.311833 6.386842
## Evening 61.356442 -8.461636
## Morning -25.411440 3.504479
assocplot(intox_vs_time_of_day, col = c("green", "grey"), space = 0.5, main = "Associations Between Driver Intoxication and Time of Day", xlab = "Time of Day", ylab = "Alcohol or Drugs Involved?")

There are more accidents that occur in the Evening than would be
expected if there was no relationship. Accidents involving alcohol/drugs
are associated with occurring in the Evening.
assocstats(intox_vs_time_of_day)
## X^2 df P(> X^2)
## Likelihood Ratio 6780.3 2 0
## Pearson 6679.8 2 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.084
## Cramer's V : 0.085
Cramer’s V here is 0.085, which indicates a weak association.
Gender and Alcohol/Drug Involved Collisions
While I wanted to see the breakdown of male vs female intoxicated
drivers, one of the limitations of the data set is that there is no
indication in the person data set as to which vehicle
number they were assigned on the MV-104 form
(i.e. contributing_factor_vehicle_x). Therefore, we cannot
match the people to their main contributing factor in each collision.
(While there is a contributing factor column for the
person data set, it is not as robust as the
contributing factor columns from the crashes data
set).
Instead, let’s look to see which drivers are mostly involved in
collisions that have alcohol involved (regardless of whether they were
the intoxicated driver) and let’s see who is more likely to be
injured/killed in collisions involving alcohol/drugs.
Drivers Involved in Collisions Involving Alcohol or
Drugs
person_alc_drugs <- alc_drug_involvement |>
select(-crash_date, -crash_time, -borough) |>
right_join(person, by = "collision_id") |>
filter(!is.na(person_sex),
alc_drugs == "Alc/Drugs",
person_type == "Occupant",
person_sex %in% c("M", "F"))
person_alc_drugs |>
filter(position_in_vehicle == "Driver") |>
ggplot(aes(y = person_sex, fill = person_sex)) +
geom_bar() +
scale_x_continuous(label = scales::comma) +
labs(title = "Drivers Involved in Alcohol/Drug Related Collisions*", x = "Count", y = "Driver Gender", caption = "*Regardless of if they were the intoxicated driver")

People Injured/Killed in Collisions Involving Alcohol or
Drugs
person_alc_drugs |>
filter(person_injury %in% c("Killed", "Injured")) |>
ggplot(aes(y = person_sex, fill = person_sex)) +
geom_bar() +
scale_x_continuous(label = scales::comma) +
labs(title = "People Injured/Killed in Alcohol/Drug Related Collisions", x = "Count", y = "Gender")

Age and Gender vs. Collisions
Now, let’s look at the relationship between age and gender and their
joint contribution to the likelihood of collisions.
Question: Is there a significant difference in the ages of
males and females involved in motor vehicle collisions?
\(H_0: \mu_1 = \mu_2\) (There is no
significant difference in the ages of males and females involved in
motor vehicle collisions)
\(H_A: \mu_1 \neq \mu_2\) (There is
a significant difference in the ages of males and females involved in
motor vehicle collisions)
Where \(\mu_1\) is the mean age of
males involved in collisions and \(\mu_2\) is the mean age of females involved
in collisions.
\(\alpha = 0.05\)
drivers_2016_2022 |>
filter(person_sex %in% c("M", "F")) |>
group_by(year = year(crash_date), person_sex) |>
summarize(median = median(person_age),
IQR = IQR(person_age),
num_drivers = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 5
## # Groups: year [7]
## year person_sex median IQR num_drivers
## <dbl> <chr> <dbl> <dbl> <int>
## 1 2016 F 38 22 75237
## 2 2016 M 41 23 206770
## 3 2017 F 38 22 92827
## 4 2017 M 41 23 257642
## 5 2018 F 39 22 92607
## 6 2018 M 41 24 259393
## 7 2019 F 39 22 83058
## 8 2019 M 41 24 234352
## 9 2020 F 37 22 35820
## 10 2020 M 39 23 107255
## 11 2021 F 36 21 33068
## 12 2021 M 38 23 96215
## 13 2022 F 37 21 31188
## 14 2022 M 39 24 89300
To make the data more manageable, let’s look at just collisions for
2022.
drivers_2022 <- drivers_2016_2022 |>
filter(person_sex %in% c("M", "F"), year == 2022)
drivers_2022 |>
ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
geom_boxplot() +
labs(title = "Distribution of Ages by Sex in 2022", x = "Age", y = "Sex") +
theme(legend.position = "none")

drivers_2022 |>
group_by(person_sex) |>
summarize(mean_age = mean(person_age))
## # A tibble: 2 × 2
## person_sex mean_age
## <chr> <dbl>
## 1 F 40.2
## 2 M 41.5
There is an observed difference. But, is this difference
statistically significant?
males <- drivers_2022 |>
filter(person_sex == "M")
females <- drivers_2022 |>
filter(person_sex == "F")
Assumptions of t-test:
drivers_2022 |>
count(person_sex)
## person_sex n
## 1 F 31188
## 2 M 89300
sd(males$person_age)
## [1] 14.74327
sd(females$person_age)
## [1] 14.25776
par(mfrow=c(1,2))
qqnorm(males$person_age, main = "Normal Q-Q Plot - Males")
qqline(males$person_age)
qqnorm(females$person_age, main = "Normal Q-Q Plot - Females")
qqline(females$person_age)

Each observation in the data set represents a single driver,
independent of one another. From the QQ-plot, we can see that the
distribution of ages for males and females closely follows the straight
line in the middle (about ages 30-70 for males and about 30-80 for
females), but there are problems towards the tails of the distributions.
Since this is a very large sample size, and the data is relatively
normal, we can continue with the analysis. (A note on this: I also
performed the analysis using the log of person_age to
normalize the data more and saw no difference in the results vs. the
results of the un-transformed data. For interpretability purposes, I
chose to use the un-transformed data.) The variance for both groups is
about the same. Since the data was collected by the NYPD and records
drivers actually involved in accidents, it is reasonable to assume that
the data is random and does not have biases. Therefore, assumptions of
the t-test are met.
(t_test <- t.test(males$person_age, females$person_age, alternative = "two.sided", var.equal = T))
##
## Two Sample t-test
##
## data: males$person_age and females$person_age
## t = 12.944, df = 120486, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.056131 1.433058
## sample estimates:
## mean of x mean of y
## 41.46699 40.22239
The p-value for this t-test is very small (< 2.2e-16), so
therefore we reject the null hypothesis and say that there is a
significant difference in the mean ages between males and females
involved in accidents. The mean age for men is about 41.4 and the mean
age for women is about 40.2. Based on the 95% confidence interval, we
are 95% confident that the true difference in the mean ages for men and
women involved in car accidents is between 1.06 and 1.43.
We can also visualize this significance using the null
distribution.
obs_diff <- drivers_2022 |>
specify(person_age~person_sex) |>
calculate(stat = "diff in means", order = c("M", "F"))
null_dist <- drivers_2022 |>
specify(person_age~person_sex) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "diff in means", order = c("M", "F"))
visualize(null_dist) + shade_p_value(obs_stat = obs_diff, direction = "two_sided")

The observed difference is way off from what we would expect to see
if there was not a statistically significant relationship. Therefore, we
can conclude that there is a significant difference in the mean ages of
men and women who get involved in collisions. As such, there is a
significant relationship between the ages of men and women in relation
to being involved in collisions. Men who get into car accidents tend to
be slightly older than women who get into car accidents.