When it comes to transportation, people’s safety is and always will be our top priority. With the rise of self driving cars, we must re-examine our current safety measures and introduce new ones that will better match the future of driving.
While there were many car safety features introduced in the past decades, the airbag is often considered one of the most important piece of safety equipment. However, it is vital to study the situations when airbags are the most effective and what the drawbacks are. Despite the fact that this technology saved numerous lives, there were also accidents when it was a contributing factor to inguries and even fatalities.
This study will examine how airbags contribute to injuries and fatalities of people of different age groups. We will conduct a statistical analysis and answer the following hypothesis question:
H0: Age is not a contributing factor to the level of traffic accident induries when air bags are deployed. HA: Age is a contributing factor to the level of traffic accident induries when air bags are deployed.
Between 1997-2002, police recorded 26,217 car crashes.
The conditions for the each case were: 1. which there was a harmful event, people or property 2. At least one vehicle was towed.
The dataset is publicly available on the official National Highway Traffic Safety Administration website: https://www.nhtsa.gov
This study will focus on a subset of the total dataset. Only the cases with a deployed airbag will be considered. There were 14,419 accudents (cases) recorded in which airbags were deployed.
This is an observational type of study as the data were recorded from real-life accidents by authorities that responded to the emergency. For this research, no experiments in a controlled environment were performed. Which means the causation between the age, airbags and ingury level should not be assumed.
Accorting to NHTSA, the data collection used a multi-stage probabilistic sampling scheme
# load data
library(dplyr)
df <- read.csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv',
stringsAsFactors = FALSE)
# select only the columns of interest
airbags <- subset(df, airbag == 'airbag')
airbags <- airbags[, c('injSeverity', 'ageOFocc', 'dead')]
# Divide the sample into two group for the exploratory step
seniors <- subset(airbags, ageOFocc >= 65)
other <- subset(airbags, ageOFocc < 65)
#summary statistics
summary(seniors)
## injSeverity ageOFocc dead
## Min. :0.00 Min. :65.00 Length:1536
## 1st Qu.:1.00 1st Qu.:69.00 Class :character
## Median :2.00 Median :74.00 Mode :character
## Mean :1.93 Mean :74.76
## 3rd Qu.:3.00 3rd Qu.:79.00
## Max. :6.00 Max. :97.00
## NA's :12
summary(other)
## injSeverity ageOFocc dead
## Min. :0.000 Min. :16.0 Length:12883
## 1st Qu.:0.000 1st Qu.:22.0 Class :character
## Median :1.000 Median :31.0 Mode :character
## Mean :1.587 Mean :33.6
## 3rd Qu.:3.000 3rd Qu.:43.0
## Max. :6.000 Max. :64.0
## NA's :70
There are three variables that we will be using to answer the research question: injSeverity - ‘Injure severity’ on a scale 0-6 with 0 as no injury to 6 the most severe and life thretening or fatal ageOFocc - ‘Age of the occupant’, the only numerical variable dead - indicates if the accident had any fatalities; dead/alive
At the initial glace on the the summary statistics, we can observe that the mean injury for is slightly higher (1.93) than for the younger group (1.587)
# Quartiles and outliers
# Plot 1
boxplot(seniors$ageOFocc ~ seniors$injSeverity)
# Plot 2
boxplot(seniors$ageOFocc ~ seniors$dead)
# Plot 3
boxplot(other$ageOFocc ~ other$injSeverity)
# Plot 4
boxplot(other$ageOFocc ~ other$dead)
1. The First plot describes the relationship between the age and injury level (IL.) IL 2 and 5 have the highest mean age for seniors.
Plot #2 suggests that non-fatal accidents involve seniors that are closer to the age of 65 with a few outliers. This may be the first clue to answer the research question.
PLot #3 confirms the findings in the previous charts - the mean age increases for higher level injuries.
Plot 4 does not report any outliers but follows the same trend of ‘higher age of fatal accidents’
To confirm the findings from the plots above, let’s take the entire set of cases with deployed airbags for all age groups and determine the mean age over each injury level as well as fatality count.
by(airbags$ageOFocc, airbags$injSeverity, mean)
## airbags$injSeverity: 0
## [1] 35.75663
## --------------------------------------------------------
## airbags$injSeverity: 1
## [1] 37.90249
## --------------------------------------------------------
## airbags$injSeverity: 2
## [1] 37.2841
## --------------------------------------------------------
## airbags$injSeverity: 3
## [1] 39.70391
## --------------------------------------------------------
## airbags$injSeverity: 4
## [1] 43.80922
## --------------------------------------------------------
## airbags$injSeverity: 5
## [1] 41.41892
## --------------------------------------------------------
## airbags$injSeverity: 6
## [1] 62.5
by(airbags$ageOFocc, airbags$dead, mean)
## airbags$dead: alive
## [1] 37.74166
## --------------------------------------------------------
## airbags$dead: dead
## [1] 44.68102
The all the exploratory findings continue to confirm the theory that airbags tend to be more dangerous for older people.
# Data transformation to enable further exploration
# Create a new variable to determine seniority based on the age
airbags$seniority = 'N'
airbags$seniority[airbags$ageOFocc >= 65] <- "Y"
boxplot(airbags$injSeverity ~ airbags$seniority)
Let’s confirm the conditions necessary for inference are satisfied:
# importing the inteference function
source("http://stat.duke.edu/courses/Fall12/sta101.001/labs/inference.R")
## Warning: package 'BHH2' was built under R version 3.4.3
## Warning: package 'lmPerm' was built under R version 3.4.3
# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ht', alternative = 'less',
method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
##
## H0: mu_alive - mu_dead = 0
## HA: mu_alive - mu_dead < 0
## Standard error = 0.975
## Test statistic: Z = -7.116
## p-value = 0
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ci', alternative = 'less',
method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
##
## Standard error = 0.9752
## 90 % Confidence interval = ( -8.5434 , -5.3353 )
The p-value of 0 allows us to reject the null hypothesis and assume that the age of the person involved in the accident is statistically significant to determine the fatality.
Using this sample, we can determine that the average age of the person that survived is 38 vs 45 for the person who died. The confidence levels tell us that we can be 90 % sure that the person surviging person is between ~9 and ~5 years younger.
Next, we will use the same analysis on the injury level. We will modify the data to categorize the injury level into two groups. 0, 1, 2 as non critical and 3, 4, 5, 6 as critical.
# New categorical variable
airbags$inj = 'critical'
airbags$inj[airbags$injSeverity < 3] <- "non-critical"
# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ht', alternative = 'less',
method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
##
## H0: mu_critical - mu_non-critical = 0
## HA: mu_critical - mu_non-critical < 0
## Standard error = 0.32
## Test statistic: Z = 9.997
## p-value = 1
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ci', alternative = 'less',
method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
##
## Standard error = 0.3199
## 90 % Confidence interval = ( 2.6715 , 3.7237 )
The findings in injury levels for both, confidence intervals as well as zero-p-value, match the results of the fatal accidents which means that age is statistically significant variable of the level of injuries.
The last inference test will include a simultion of the data
# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ht', alternative = 'less',
method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
##
## H0: mu_alive - mu_dead = 0
## HA: mu_alive - mu_dead < 0
## Randomizing, please wait...
## p-value = 0
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ci', alternative = 'less',
method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
##
## Bootstrapping, please wait...
## 90 % Bootstrap interval = ( -8.5599 , -5.3575 )
# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ht', alternative = 'less',
method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
##
## H0: mu_critical - mu_non-critical = 0
## HA: mu_critical - mu_non-critical < 0
## Randomizing, please wait...
## p-value = 1
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ht', alternative = 'less',
method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
##
## H0: mu_critical - mu_non-critical = 0
## HA: mu_critical - mu_non-critical < 0
## Randomizing, please wait...
## p-value = 1
The simulated inference tests show highly similar results to the ones determined from the real sample.
Due to the nature of age on human body, we could have assumed it makes us more fragile and prone to an increased risk of severe inguries and more frequent fatal accidents. This study supports this general assumption with an additional variable - deployed airbags. Additional research in a controlled environment should be conducted to confirm the effectivness and possible risk of airbags on senior drivers.