Factors responsible for fatalities

The data set used for project is available on link “https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv”.

Data Exploration

Downloading the file:

if(!file.exists("./week3")){dir.create("./week3")}
fileUrl <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv"
download.file(fileUrl,destfile = "./week3/nassCDS.csv",method = "curl")

Reading the file:

data <- read.csv("nassCDS.CSV")
head(data)
##   X  dvcat  weight  dead airbag seatbelt frontal sex ageOFocc yearacc
## 1 1  25-39  25.069 alive   none   belted       1   f       26    1997
## 2 2 24-Oct  25.069 alive airbag   belted       1   f       72    1997
## 3 3 24-Oct  32.379 alive   none     none       1   f       69    1997
## 4 4  25-39 495.444 alive airbag   belted       1   f       53    1997
## 5 5  25-39  25.069 alive   none   belted       1   f       32    1997
## 6 6  40-54  25.069 alive   none   belted       1   f       22    1997
##   yearVeh   abcat occRole deploy injSeverity  caseid
## 1    1990 unavail  driver      0           3 2:03:01
## 2    1995  deploy  driver      1           1 2:03:02
## 3    1988 unavail  driver      0           4 2:05:01
## 4    1995  deploy  driver      1           1 2:10:01
## 5    1988 unavail  driver      0           3 2:11:01
## 6    1985 unavail  driver      0           3 2:11:02
str(data)
## 'data.frame':    26217 obs. of  16 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dvcat      : Factor w/ 6 levels " ","1-9km/h",..: 4 3 3 4 4 5 6 6 3 3 ...
##  $ weight     : num  25.1 25.1 32.4 495.4 25.1 ...
##  $ dead       : Factor w/ 2 levels "alive","dead": 1 1 1 1 1 1 1 2 1 1 ...
##  $ airbag     : Factor w/ 2 levels "airbag","none": 2 1 2 1 2 2 2 2 2 2 ...
##  $ seatbelt   : Factor w/ 2 levels "belted","none": 1 1 2 1 1 1 1 2 1 1 ...
##  $ frontal    : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ sex        : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 2 2 2 1 ...
##  $ ageOFocc   : int  26 72 69 53 32 22 22 32 40 18 ...
##  $ yearacc    : int  1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
##  $ yearVeh    : int  1990 1995 1988 1995 1988 1985 1984 1987 1984 1987 ...
##  $ abcat      : Factor w/ 3 levels "deploy","nodeploy",..: 3 1 3 1 3 3 3 3 3 3 ...
##  $ occRole    : Factor w/ 2 levels "driver","pass": 1 1 1 1 1 1 1 1 1 1 ...
##  $ deploy     : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ injSeverity: int  3 1 4 1 3 3 3 4 1 0 ...
##  $ caseid     : Factor w/ 8524 levels "0.125011574",..: 2061 2062 2065 2074 2076 2077 2080 2081 2082 2083 ...
summary(data)
##        X             dvcat           weight            dead      
##  Min.   :    1          :    1   Min.   :    0.00   alive:25037  
##  1st Qu.: 6555   1-9km/h:  686   1st Qu.:   32.47   dead : 1180  
##  Median :13109   24-Oct :12847   Median :   86.99                
##  Mean   :13109   25-39  : 8214   Mean   :  462.81                
##  3rd Qu.:19663   40-54  : 2977   3rd Qu.:  364.72                
##  Max.   :26217   55+    : 1492   Max.   :57871.60                
##                                                                  
##     airbag        seatbelt        frontal       sex          ageOFocc    
##  airbag:14419   belted:18573   Min.   :0.0000   f:12248   Min.   :16.00  
##  none  :11798   none  : 7644   1st Qu.:0.0000   m:13969   1st Qu.:22.00  
##                                Median :1.0000             Median :33.00  
##                                Mean   :0.6433             Mean   :37.21  
##                                3rd Qu.:1.0000             3rd Qu.:48.00  
##                                Max.   :1.0000             Max.   :97.00  
##                                                                          
##     yearacc        yearVeh          abcat         occRole     
##  Min.   :1997   Min.   :1953   deploy  : 8836   driver:20601  
##  1st Qu.:1998   1st Qu.:1989   nodeploy: 5583   pass  : 5616  
##  Median :2000   Median :1994   unavail :11798                 
##  Mean   :2000   Mean   :1993                                  
##  3rd Qu.:2001   3rd Qu.:1997                                  
##  Max.   :2002   Max.   :2003                                  
##                 NA's   :1                                     
##      deploy       injSeverity            caseid     
##  Min.   :0.000   Min.   :0.000   0.54931713 :   18  
##  1st Qu.:0.000   1st Qu.:1.000   0.590289352:   17  
##  Median :0.000   Median :2.000   0.606956019:   16  
##  Mean   :0.337   Mean   :1.716   0.556261574:   15  
##  3rd Qu.:1.000   3rd Qu.:3.000   0.584039352:   15  
##  Max.   :1.000   Max.   :6.000   0.587511574:   15  
##                  NA's   :153     (Other)    :26121

As we have clear vision on our dataset,let us look at what we are trying to read through this data:

  1. Fatality rate with three scenarios:
  1. with airbag
  2. with seatbelt
  3. with seatbelt+airbag
  1. What is the age-group responsible for most severe injuries?

  2. Cases responsible for max deaths?

  3. Year resulted in maximum number of deaths?

  4. Age-weight ratio of cases alive/dead?

data <- na.omit(data) #row with NA values are omitted
table(data$sex)       # figures give stats of female and male entries
## 
##     f     m 
## 12178 13885
table(data$dead)      # figures for 'dead' and 'alive'
## 
## alive  dead 
## 24883  1180

For better undersanding:

prop.table(table(data$sex,data$dead),1)
##    
##          alive       dead
##   f 0.96189851 0.03810149
##   m 0.94843356 0.05156644
prop.table(table(data$seatbelt,data$dead),1)
##         
##               alive       dead
##   belted 0.97292174 0.02707826
##   none   0.91050276 0.08949724
prop.table(table(data$airbag,data$dead),1)
##         
##               alive       dead
##   airbag 0.96435547 0.03564453
##   none   0.94295216 0.05704784

General statistics showthat maximum number of cases survived the injuries.‘Male’-cases survived more than ‘female’-cases.(we will prove this through graph as well)

Data wrangling:

As all of the columns from the original data set are not required ,we will create a subset of data as per requirement of our first prediction for fatality rate:

subdata <- subset(data, select = c("dead","airbag","seatbelt"))
subdata <- na.omit(subdata)
head(subdata)
##    dead airbag seatbelt
## 1 alive   none   belted
## 2 alive airbag   belted
## 3 alive   none     none
## 4 alive airbag   belted
## 5 alive   none   belted
## 6 alive   none   belted

CASE I (Using airbags only)

subdata1 <- subdata[!((subdata$airbag) == "none" | (subdata$seatbelt) == "belted"),]
summary(subdata1)
##     dead         airbag       seatbelt   
##  alive:2977   airbag:3253   belted:   0  
##  dead : 276   none  :   0   none  :3253
#fatality_rate <- (((no_of_deaths) *100) / (no_of_total))
fatality_rate_airbag <- (((276)* 100) / (26217))
fatality_rate_airbag
## [1] 1.052752
library(ggplot2)
ggplot(subdata1,aes(x= dead,fill = factor(airbag))) +
geom_bar(width = 0.5) + xlab("dead") +
ylab("total")+ labs(fill = "airbag")

The fatality rate when cases used airbags only is 1.052752

CASE II (Using seatbelts only)

subdata2<- subdata[!((subdata$airbag) == "airbag" | (subdata$seatbelt) == "none"),]
head(subdata2)
##     dead airbag seatbelt
## 1  alive   none   belted
## 5  alive   none   belted
## 6  alive   none   belted
## 7  alive   none   belted
## 9  alive   none   belted
## 10 alive   none   belted
summary(subdata2)
##     dead         airbag       seatbelt   
##  alive:7117   airbag:   0   belted:7382  
##  dead : 265   none  :7382   none  :   0
#fatality_rate <- (((no_of_deaths) *100) / (no_of_total))
fatality_rate_seatbelt <- (((265)* 100) / (26217))
fatality_rate_seatbelt
## [1] 1.010795
library(ggplot2)
ggplot(subdata2,aes(x= dead,fill = factor(seatbelt))) +
geom_bar(width = 0.5) + xlab("dead") +
ylab("total")+ labs(fill = "seatbelt")

The fatality rate when cases used seatbelts only is 1.010795

CASE III( Using seatbelt and airbags)

subdata3 <- subdata[!((subdata$airbag) == "none" | (subdata$seatbelt) == "none"),]
head(subdata3)
##     dead airbag seatbelt
## 2  alive airbag   belted
## 4  alive airbag   belted
## 13 alive airbag   belted
## 14  dead airbag   belted
## 19 alive airbag   belted
## 20 alive airbag   belted
summary(subdata3)
##     dead          airbag        seatbelt    
##  alive:10848   airbag:11083   belted:11083  
##  dead :  235   none  :    0   none  :    0

235 cases died with ‘airbag’ and ‘seatbelt’.

#fatality_rate <- (((no_of_deaths) *100) / (no_of_total_entries))
fatality_rate_seatbelt_airbag <- (((235)* 100) / (26217))
fatality_rate_seatbelt_airbag
## [1] 0.896365

Lowest fatality rate of all three cases is 0.89.Therefore,The cases which used both ‘seatbelts’ and ‘airbags’ survived the fatalities as compared to the cases who used any one of these two.The highest fatality rate is for “airbags”alone.It can be concluded that airbags are not capable enough to stop injuries/death.

2) What age-group is responsible for most severe injuries?

some data wrangling is need to find our second conclusion: let us plot the basic his to get general outlook of age of cases who met an accident:

hist(data$ageOFocc,col = "green",breaks = 100)
rug(data$ageOFocc)

General histogram says that maximun number of cases who met accidents are in age group of 10-20. so we need to create a new column nammed ‘agerange’ in order to have more visibility.(PART OF DATA WRANGLING)

data$agerange <- '100+'
data$agerange[data$ageOFocc < 100 & data$ageOFocc >= 90] <- '90-100'
data$agerange[data$ageOFocc < 90 & data$ageOFocc >= 80] <- '80-90'
data$agerange[data$ageOFocc < 80 & data$ageOFocc >= 70] <- '70-80'
data$agerange[data$ageOFocc < 70 & data$ageOFocc >= 60] <- '60-70'
data$agerange[data$ageOFocc < 60 & data$ageOFocc >= 50] <- '50-60'
data$agerange[data$ageOFocc < 50 & data$ageOFocc >= 40] <- '40-50'
data$agerange[data$ageOFocc < 40 & data$ageOFocc >= 30] <- '30-40'
data$agerange[data$ageOFocc < 30 & data$ageOFocc >= 20] <- '20-30'
data$agerange[data$ageOFocc < 20 & data$ageOFocc >= 10] <- '10-20'
subdata4 <- subset(data,select = c("dead","ageOFocc","injSeverity","agerange"))
head(subdata4)
##    dead ageOFocc injSeverity agerange
## 1 alive       26           3    20-30
## 2 alive       72           1    70-80
## 3 alive       69           4    60-70
## 4 alive       53           1    50-60
## 5 alive       32           3    30-40
## 6 alive       22           3    20-30

We sorted out our data set and added a new column with name “agerange”.but we are still not clear on how many levels on severities are present across the data frame.let’s check:

table(subdata4$injSeverity)
## 
##    0    1    2    3    4    5    6 
## 6478 5595 4242 8495 1118  133    2

six levels.

So we need to plot a graph which will help us visualize the picture.I used ggplot with aesthetics:

library(ggplot2)
ggplot(subdata4,aes(x= agerange,fill = factor(injSeverity))) +
geom_bar(width = 0.5) + xlab("agerange") +
ylab("total")+ labs(fill = "injSeverity")

Now that’s quiet a colorful graph! we get our next answer here.The age group more vulnerable to accidental injuries is “20-30”,with maximum number of level-3 injuries.

We need to find out which injury level resulted in death for maximum cases?

subdata5 <- subdata4[!((subdata$dead) == "alive"),]
head(subdata5)
##     dead ageOFocc injSeverity agerange
## 8   dead       32           4    30-40
## 14  dead       54           4    50-60
## 83  dead       67           4    60-70
## 84  dead       64           4    60-70
## 265 dead       23           4    20-30
## 311 dead       77           4    70-80
with(subdata5,plot(ageOFocc,injSeverity,col = "blue"))

The scatterplot states that maximum number of deaths occured due to ‘level- 4’ injuries over almost every age group.Taking into account,there is a single sphere in age range (60-80),making a very good point,that a level-1 injury leaded to death for this particular case in age of ‘70-80’.We rest our second prediction here.

3) Cases responsible for maximum number of deaths?

subdata6 <- subset(data ,select = c("weight","dead","sex","ageOFocc" ))
subdata6$sex <- gsub("m","male",subdata6$sex)
subdata6$sex <- gsub("f","female",subdata6$sex)
subdata7 <- subdata6[!((subdata6$dead) == "alive"),]
head(subdata7)
##     weight dead    sex ageOFocc
## 8   27.078 dead   male       32
## 14  89.627 dead female       54
## 83  27.078 dead   male       67
## 84  27.078 dead female       64
## 265 13.374 dead   male       23
## 311 12.383 dead female       77
boxplot (ageOFocc ~ sex,data = subdata7,col = "red")

as the graph shows ,the age for death injuries tends is higher for female.The spread and median both are higher in case of females.

4) Which year resulted into maximum number of deaths?

subdata8 <- subset(data,select = c("yearacc","dead"))
subdata9 <- subdata8[!((subdata8$dead) == "alive"),] 
ggplot(subdata9,aes(x= yearacc,fill = factor(dead))) +
geom_bar(width = 0.5) + xlab("years of accidents") +
ylab("total")+ labs(fill = "dead")

Maximum deaths occured in year 1997,least in 2001.

5) age-weight ratio for cases ‘alive’ and ‘dead’.

I tried to use different version of ggplot here.

subdata10 <- subset(data,select = c("weight","ageOFocc","dead"))
subdata11 <- subdata10[!((subdata10$dead) == "dead"),]
subdata12 <- subdata10[!((subdata10$dead) == "alive"),]
par (mfrow= c(1,2))
qplot(ageOFocc,weight,data = subdata11)

qplot(ageOFocc,weight,data = subdata12)

Though the pattern is not much clear here,comparing the two graphs for ‘dead’ and ‘alive’ cases,the conclusion can be drawn.comparing two scenarios.

CONCLUSION

Answering all five questions,the higlights of dataset are: 1) Maximum number of deaths occured when cases used airbags alone. 2) Maximum cases alive used both “airbags” and “seatbelt” to save their lives.(lowest fatality ratio - 0.89) 3) Highest number of deaths are reported in year 1997,lowest in 2001. 4) Highest number of cases died due to level-4 injuries. 5) Agegroup ‘20-30’ proved to be more vulnerable to accidents and level- 3 injuries. 6) ‘Female’ cases suffered more loss of life as compared to “male” cases. 7) Two cases with level -6 injuries survived the accident. 8) Weight-Age ratio is higher for cases who survived.