1. Introduction

When it comes to transportation, people’s safety is and always will be our top priority. With the rise of self driving cars, we must re-examine our current safety measures and introduce new ones that will better match the future of driving.

While there were many car safety features introduced in the past decades, the airbag is often considered one of the most important piece of safety equipment. However, it is vital to study the situations when airbags are the most effective and what the drawbacks are. Despite the fact that this technology saved numerous lives, there were also accidents when it was a contributing factor to inguries and even fatalities.

This study will examine how airbags contribute to injuries and fatalities of people of different age groups. We will conduct a statistical analysis and answer the following hypothesis question:

H0: Age is not a contributing factor to the level of traffic accident induries when air bags are deployed. HA: Age is a contributing factor to the level of traffic accident induries when air bags are deployed.

2. Data

Airbag And Other Influences On Accident Fatalities

Between 1997-2002, police recorded 26,217 car crashes.

The conditions for the each case were: 1. which there was a harmful event, people or property 2. At least one vehicle was towed.

The dataset is publicly available on the official National Highway Traffic Safety Administration website: https://www.nhtsa.gov

This study will focus on a subset of the total dataset. Only the cases with a deployed airbag will be considered. There were 14,419 accudents (cases) recorded in which airbags were deployed.

This is an observational type of study as the data were recorded from real-life accidents by authorities that responded to the emergency. For this research, no experiments in a controlled environment were performed. Which means the causation between the age, airbags and ingury level should not be assumed.

Accorting to NHTSA, the data collection used a multi-stage probabilistic sampling scheme

  • Data are restricted to front-seat occupants and may be restricted in other undisclosed ways.

3. Exploratory data analysis

Subsetting data of interest from the entire dataset

# load data
library(dplyr)
df <- read.csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv', 
               stringsAsFactors = FALSE)

# select only the columns of interest
airbags <- subset(df, airbag == 'airbag')
airbags <- airbags[, c('injSeverity', 'ageOFocc', 'dead')]


# Divide the sample into two group for the exploratory step
seniors <- subset(airbags, ageOFocc >= 65)
other <- subset(airbags, ageOFocc < 65)

Summary Statistics

#summary statistics
summary(seniors)
##   injSeverity      ageOFocc         dead          
##  Min.   :0.00   Min.   :65.00   Length:1536       
##  1st Qu.:1.00   1st Qu.:69.00   Class :character  
##  Median :2.00   Median :74.00   Mode  :character  
##  Mean   :1.93   Mean   :74.76                     
##  3rd Qu.:3.00   3rd Qu.:79.00                     
##  Max.   :6.00   Max.   :97.00                     
##  NA's   :12
summary(other)
##   injSeverity       ageOFocc        dead          
##  Min.   :0.000   Min.   :16.0   Length:12883      
##  1st Qu.:0.000   1st Qu.:22.0   Class :character  
##  Median :1.000   Median :31.0   Mode  :character  
##  Mean   :1.587   Mean   :33.6                     
##  3rd Qu.:3.000   3rd Qu.:43.0                     
##  Max.   :6.000   Max.   :64.0                     
##  NA's   :70

There are three variables that we will be using to answer the research question: injSeverity - ‘Injure severity’ on a scale 0-6 with 0 as no injury to 6 the most severe and life thretening or fatal ageOFocc - ‘Age of the occupant’, the only numerical variable dead - indicates if the accident had any fatalities; dead/alive

At the initial glace on the the summary statistics, we can observe that the mean injury for is slightly higher (1.93) than for the younger group (1.587)

Boxplot

# Quartiles and outliers

# Plot 1
boxplot(seniors$ageOFocc ~ seniors$injSeverity)

# Plot 2
boxplot(seniors$ageOFocc ~ seniors$dead)

# Plot 3
boxplot(other$ageOFocc ~ other$injSeverity)

# Plot 4
boxplot(other$ageOFocc ~ other$dead)

1. The First plot describes the relationship between the age and injury level (IL.) IL 2 and 5 have the highest mean age for seniors.

  1. Plot #2 suggests that non-fatal accidents involve seniors that are closer to the age of 65 with a few outliers. This may be the first clue to answer the research question.

  2. PLot #3 confirms the findings in the previous charts - the mean age increases for higher level injuries.

  3. Plot 4 does not report any outliers but follows the same trend of ‘higher age of fatal accidents’

To confirm the findings from the plots above, let’s take the entire set of cases with deployed airbags for all age groups and determine the mean age over each injury level as well as fatality count.

by(airbags$ageOFocc, airbags$injSeverity, mean)
## airbags$injSeverity: 0
## [1] 35.75663
## -------------------------------------------------------- 
## airbags$injSeverity: 1
## [1] 37.90249
## -------------------------------------------------------- 
## airbags$injSeverity: 2
## [1] 37.2841
## -------------------------------------------------------- 
## airbags$injSeverity: 3
## [1] 39.70391
## -------------------------------------------------------- 
## airbags$injSeverity: 4
## [1] 43.80922
## -------------------------------------------------------- 
## airbags$injSeverity: 5
## [1] 41.41892
## -------------------------------------------------------- 
## airbags$injSeverity: 6
## [1] 62.5
by(airbags$ageOFocc, airbags$dead, mean)
## airbags$dead: alive
## [1] 37.74166
## -------------------------------------------------------- 
## airbags$dead: dead
## [1] 44.68102

The all the exploratory findings continue to confirm the theory that airbags tend to be more dangerous for older people.

Data transformation into a full set with a new variable - Seniority Y/N

# Data transformation to enable further exploration
# Create a new variable to determine seniority based on the age
airbags$seniority = 'N'
airbags$seniority[airbags$ageOFocc >= 65] <- "Y"
boxplot(airbags$injSeverity ~ airbags$seniority)

4. Inference

Conditions

Let’s confirm the conditions necessary for inference are satisfied:

    1. Independent Observations: TRUE
    1. Sample size large enough: TRUE
    1. Data not strongly skewed: TRUE (right skeweness due the nature of the variable (age). A large sample allows us to consider it as a satisfied condition.)
# importing the inteference function
source("http://stat.duke.edu/courses/Fall12/sta101.001/labs/inference.R")
## Warning: package 'BHH2' was built under R version 3.4.3
## Warning: package 'lmPerm' was built under R version 3.4.3
# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ht', alternative = 'less', 
          method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
## 
## H0: mu_alive - mu_dead = 0 
## HA: mu_alive - mu_dead < 0 
## Standard error = 0.975 
## Test statistic: Z =  -7.116 
## p-value =  0

inference(airbags$dead, airbags$ageOFocc, est='mean', type='ci', alternative = 'less', 
          method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852

## Observed difference between means (alive-dead) = -6.9394
## 
## Standard error = 0.9752 
## 90 % Confidence interval = ( -8.5434 , -5.3353 )

Hypothesis test and Confidence intervals

The p-value of 0 allows us to reject the null hypothesis and assume that the age of the person involved in the accident is statistically significant to determine the fatality.

Using this sample, we can determine that the average age of the person that survived is 38 vs 45 for the person who died. The confidence levels tell us that we can be 90 % sure that the person surviging person is between ~9 and ~5 years younger.

Next, we will use the same analysis on the injury level. We will modify the data to categorize the injury level into two groups. 0, 1, 2 as non critical and 3, 4, 5, 6 as critical.

# New categorical variable
airbags$inj = 'critical'
airbags$inj[airbags$injSeverity < 3] <- "non-critical"

# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ht', alternative = 'less', 
          method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
## 
## H0: mu_critical - mu_non-critical = 0 
## HA: mu_critical - mu_non-critical < 0 
## Standard error = 0.32 
## Test statistic: Z =  9.997 
## p-value =  1

inference(airbags$inj, airbags$ageOFocc, est='mean', type='ci', alternative = 'less', 
          method = "theoretical", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201

## Observed difference between means (critical-non-critical) = 3.1976
## 
## Standard error = 0.3199 
## 90 % Confidence interval = ( 2.6715 , 3.7237 )

The findings in injury levels for both, confidence intervals as well as zero-p-value, match the results of the fatal accidents which means that age is statistically significant variable of the level of injuries.

Simulation

The last inference test will include a simultion of the data

# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ht', alternative = 'less', 
          method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
## 
## H0: mu_alive - mu_dead = 0 
## HA: mu_alive - mu_dead < 0 
## Randomizing, please wait...

## p-value =  0
inference(airbags$dead, airbags$ageOFocc, est='mean', type='ci', alternative = 'less', 
          method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_alive = 13908, mean_alive = 37.7417, sd_alive = 17.5907
## n_dead = 511, mean_dead = 44.681, sd_dead = 21.7852
## Observed difference between means (alive-dead) = -6.9394
## 
## Bootstrapping, please wait...

## 90 % Bootstrap interval = ( -8.5599 , -5.3575 )
# Theoretical hypothesis test and confidence intervals using the inference function
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ht', alternative = 'less', 
          method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.

## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
## 
## H0: mu_critical - mu_non-critical = 0 
## HA: mu_critical - mu_non-critical < 0 
## Randomizing, please wait...

## p-value =  1
inference(airbags$inj, airbags$ageOFocc, est='mean', type='ht', alternative = 'less', 
          method = "simulation", conflevel=0.90, success = 'alive' )
## Response variable: numerical, Explanatory variable: categorical
## Warning: Missing null value, set to 0.

## Warning: Ignoring success since data are numerical.
## Difference between two means
## Summary statistics:
## n_critical = 4931, mean_critical = 40.0917, sd_critical = 18.7268
## n_non-critical = 9488, mean_non-critical = 36.8941, sd_non-critical = 17.201
## Observed difference between means (critical-non-critical) = 3.1976
## 
## H0: mu_critical - mu_non-critical = 0 
## HA: mu_critical - mu_non-critical < 0 
## Randomizing, please wait...

## p-value =  1

The simulated inference tests show highly similar results to the ones determined from the real sample.

5. Conclusion

Due to the nature of age on human body, we could have assumed it makes us more fragile and prone to an increased risk of severe inguries and more frequent fatal accidents. This study supports this general assumption with an additional variable - deployed airbags. Additional research in a controlled environment should be conducted to confirm the effectivness and possible risk of airbags on senior drivers.