The data set used for project is available on link “https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv”.
Downloading the file:
if(!file.exists("./week3")){dir.create("./week3")}
fileUrl <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv"
download.file(fileUrl,destfile = "./week3/nassCDS.csv",method = "curl")
Reading the file:
data <- read.csv("nassCDS.CSV")
head(data)
## X dvcat weight dead airbag seatbelt frontal sex ageOFocc yearacc
## 1 1 25-39 25.069 alive none belted 1 f 26 1997
## 2 2 24-Oct 25.069 alive airbag belted 1 f 72 1997
## 3 3 24-Oct 32.379 alive none none 1 f 69 1997
## 4 4 25-39 495.444 alive airbag belted 1 f 53 1997
## 5 5 25-39 25.069 alive none belted 1 f 32 1997
## 6 6 40-54 25.069 alive none belted 1 f 22 1997
## yearVeh abcat occRole deploy injSeverity caseid
## 1 1990 unavail driver 0 3 2:03:01
## 2 1995 deploy driver 1 1 2:03:02
## 3 1988 unavail driver 0 4 2:05:01
## 4 1995 deploy driver 1 1 2:10:01
## 5 1988 unavail driver 0 3 2:11:01
## 6 1985 unavail driver 0 3 2:11:02
str(data)
## 'data.frame': 26217 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dvcat : Factor w/ 6 levels " ","1-9km/h",..: 4 3 3 4 4 5 6 6 3 3 ...
## $ weight : num 25.1 25.1 32.4 495.4 25.1 ...
## $ dead : Factor w/ 2 levels "alive","dead": 1 1 1 1 1 1 1 2 1 1 ...
## $ airbag : Factor w/ 2 levels "airbag","none": 2 1 2 1 2 2 2 2 2 2 ...
## $ seatbelt : Factor w/ 2 levels "belted","none": 1 1 2 1 1 1 1 2 1 1 ...
## $ frontal : int 1 1 1 1 1 1 1 1 0 1 ...
## $ sex : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 2 2 2 1 ...
## $ ageOFocc : int 26 72 69 53 32 22 22 32 40 18 ...
## $ yearacc : int 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
## $ yearVeh : int 1990 1995 1988 1995 1988 1985 1984 1987 1984 1987 ...
## $ abcat : Factor w/ 3 levels "deploy","nodeploy",..: 3 1 3 1 3 3 3 3 3 3 ...
## $ occRole : Factor w/ 2 levels "driver","pass": 1 1 1 1 1 1 1 1 1 1 ...
## $ deploy : int 0 1 0 1 0 0 0 0 0 0 ...
## $ injSeverity: int 3 1 4 1 3 3 3 4 1 0 ...
## $ caseid : Factor w/ 8524 levels "0.125011574",..: 2061 2062 2065 2074 2076 2077 2080 2081 2082 2083 ...
summary(data)
## X dvcat weight dead
## Min. : 1 : 1 Min. : 0.00 alive:25037
## 1st Qu.: 6555 1-9km/h: 686 1st Qu.: 32.47 dead : 1180
## Median :13109 24-Oct :12847 Median : 86.99
## Mean :13109 25-39 : 8214 Mean : 462.81
## 3rd Qu.:19663 40-54 : 2977 3rd Qu.: 364.72
## Max. :26217 55+ : 1492 Max. :57871.60
##
## airbag seatbelt frontal sex ageOFocc
## airbag:14419 belted:18573 Min. :0.0000 f:12248 Min. :16.00
## none :11798 none : 7644 1st Qu.:0.0000 m:13969 1st Qu.:22.00
## Median :1.0000 Median :33.00
## Mean :0.6433 Mean :37.21
## 3rd Qu.:1.0000 3rd Qu.:48.00
## Max. :1.0000 Max. :97.00
##
## yearacc yearVeh abcat occRole
## Min. :1997 Min. :1953 deploy : 8836 driver:20601
## 1st Qu.:1998 1st Qu.:1989 nodeploy: 5583 pass : 5616
## Median :2000 Median :1994 unavail :11798
## Mean :2000 Mean :1993
## 3rd Qu.:2001 3rd Qu.:1997
## Max. :2002 Max. :2003
## NA's :1
## deploy injSeverity caseid
## Min. :0.000 Min. :0.000 0.54931713 : 18
## 1st Qu.:0.000 1st Qu.:1.000 0.590289352: 17
## Median :0.000 Median :2.000 0.606956019: 16
## Mean :0.337 Mean :1.716 0.556261574: 15
## 3rd Qu.:1.000 3rd Qu.:3.000 0.584039352: 15
## Max. :1.000 Max. :6.000 0.587511574: 15
## NA's :153 (Other) :26121
As we have clear vision on our dataset,let us look at what we are trying to read through this data:
What is the age-group responsible for most severe injuries?
Cases responsible for max deaths?
Year resulted in maximum number of deaths?
Age-weight ratio of cases alive/dead?
data <- na.omit(data) #row with NA values are omitted
table(data$sex) # figures give stats of female and male entries
##
## f m
## 12178 13885
table(data$dead) # figures for 'dead' and 'alive'
##
## alive dead
## 24883 1180
For better undersanding:
prop.table(table(data$sex,data$dead),1)
##
## alive dead
## f 0.96189851 0.03810149
## m 0.94843356 0.05156644
prop.table(table(data$seatbelt,data$dead),1)
##
## alive dead
## belted 0.97292174 0.02707826
## none 0.91050276 0.08949724
prop.table(table(data$airbag,data$dead),1)
##
## alive dead
## airbag 0.96435547 0.03564453
## none 0.94295216 0.05704784
General statistics showthat maximum number of cases survived the injuries.‘Male’-cases survived more than ‘female’-cases.(we will prove this through graph as well)
As all of the columns from the original data set are not required ,we will create a subset of data as per requirement of our first prediction for fatality rate:
subdata <- subset(data, select = c("dead","airbag","seatbelt"))
subdata <- na.omit(subdata)
head(subdata)
## dead airbag seatbelt
## 1 alive none belted
## 2 alive airbag belted
## 3 alive none none
## 4 alive airbag belted
## 5 alive none belted
## 6 alive none belted
subdata1 <- subdata[!((subdata$airbag) == "none" | (subdata$seatbelt) == "belted"),]
summary(subdata1)
## dead airbag seatbelt
## alive:2977 airbag:3253 belted: 0
## dead : 276 none : 0 none :3253
#fatality_rate <- (((no_of_deaths) *100) / (no_of_total))
fatality_rate_airbag <- (((276)* 100) / (26217))
fatality_rate_airbag
## [1] 1.052752
library(ggplot2)
ggplot(subdata1,aes(x= dead,fill = factor(airbag))) +
geom_bar(width = 0.5) + xlab("dead") +
ylab("total")+ labs(fill = "airbag")
The fatality rate when cases used airbags only is 1.052752
subdata2<- subdata[!((subdata$airbag) == "airbag" | (subdata$seatbelt) == "none"),]
head(subdata2)
## dead airbag seatbelt
## 1 alive none belted
## 5 alive none belted
## 6 alive none belted
## 7 alive none belted
## 9 alive none belted
## 10 alive none belted
summary(subdata2)
## dead airbag seatbelt
## alive:7117 airbag: 0 belted:7382
## dead : 265 none :7382 none : 0
#fatality_rate <- (((no_of_deaths) *100) / (no_of_total))
fatality_rate_seatbelt <- (((265)* 100) / (26217))
fatality_rate_seatbelt
## [1] 1.010795
library(ggplot2)
ggplot(subdata2,aes(x= dead,fill = factor(seatbelt))) +
geom_bar(width = 0.5) + xlab("dead") +
ylab("total")+ labs(fill = "seatbelt")
The fatality rate when cases used seatbelts only is 1.010795
subdata3 <- subdata[!((subdata$airbag) == "none" | (subdata$seatbelt) == "none"),]
head(subdata3)
## dead airbag seatbelt
## 2 alive airbag belted
## 4 alive airbag belted
## 13 alive airbag belted
## 14 dead airbag belted
## 19 alive airbag belted
## 20 alive airbag belted
summary(subdata3)
## dead airbag seatbelt
## alive:10848 airbag:11083 belted:11083
## dead : 235 none : 0 none : 0
235 cases died with ‘airbag’ and ‘seatbelt’.
#fatality_rate <- (((no_of_deaths) *100) / (no_of_total_entries))
fatality_rate_seatbelt_airbag <- (((235)* 100) / (26217))
fatality_rate_seatbelt_airbag
## [1] 0.896365
Lowest fatality rate of all three cases is 0.89.Therefore,The cases which used both ‘seatbelts’ and ‘airbags’ survived the fatalities as compared to the cases who used any one of these two.The highest fatality rate is for “airbags”alone.It can be concluded that airbags are not capable enough to stop injuries/death.
some data wrangling is need to find our second conclusion: let us plot the basic his to get general outlook of age of cases who met an accident:
hist(data$ageOFocc,col = "green",breaks = 100)
rug(data$ageOFocc)
General histogram says that maximun number of cases who met accidents are in age group of 10-20. so we need to create a new column nammed ‘agerange’ in order to have more visibility.(PART OF DATA WRANGLING)
data$agerange <- '100+'
data$agerange[data$ageOFocc < 100 & data$ageOFocc >= 90] <- '90-100'
data$agerange[data$ageOFocc < 90 & data$ageOFocc >= 80] <- '80-90'
data$agerange[data$ageOFocc < 80 & data$ageOFocc >= 70] <- '70-80'
data$agerange[data$ageOFocc < 70 & data$ageOFocc >= 60] <- '60-70'
data$agerange[data$ageOFocc < 60 & data$ageOFocc >= 50] <- '50-60'
data$agerange[data$ageOFocc < 50 & data$ageOFocc >= 40] <- '40-50'
data$agerange[data$ageOFocc < 40 & data$ageOFocc >= 30] <- '30-40'
data$agerange[data$ageOFocc < 30 & data$ageOFocc >= 20] <- '20-30'
data$agerange[data$ageOFocc < 20 & data$ageOFocc >= 10] <- '10-20'
subdata4 <- subset(data,select = c("dead","ageOFocc","injSeverity","agerange"))
head(subdata4)
## dead ageOFocc injSeverity agerange
## 1 alive 26 3 20-30
## 2 alive 72 1 70-80
## 3 alive 69 4 60-70
## 4 alive 53 1 50-60
## 5 alive 32 3 30-40
## 6 alive 22 3 20-30
We sorted out our data set and added a new column with name “agerange”.but we are still not clear on how many levels on severities are present across the data frame.let’s check:
table(subdata4$injSeverity)
##
## 0 1 2 3 4 5 6
## 6478 5595 4242 8495 1118 133 2
six levels.
So we need to plot a graph which will help us visualize the picture.I used ggplot with aesthetics:
library(ggplot2)
ggplot(subdata4,aes(x= agerange,fill = factor(injSeverity))) +
geom_bar(width = 0.5) + xlab("agerange") +
ylab("total")+ labs(fill = "injSeverity")
Now that’s quiet a colorful graph! we get our next answer here.The age group more vulnerable to accidental injuries is “20-30”,with maximum number of level-3 injuries.
We need to find out which injury level resulted in death for maximum cases?
subdata5 <- subdata4[!((subdata$dead) == "alive"),]
head(subdata5)
## dead ageOFocc injSeverity agerange
## 8 dead 32 4 30-40
## 14 dead 54 4 50-60
## 83 dead 67 4 60-70
## 84 dead 64 4 60-70
## 265 dead 23 4 20-30
## 311 dead 77 4 70-80
with(subdata5,plot(ageOFocc,injSeverity,col = "blue"))
The scatterplot states that maximum number of deaths occured due to ‘level- 4’ injuries over almost every age group.Taking into account,there is a single sphere in age range (60-80),making a very good point,that a level-1 injury leaded to death for this particular case in age of ‘70-80’.We rest our second prediction here.
subdata6 <- subset(data ,select = c("weight","dead","sex","ageOFocc" ))
subdata6$sex <- gsub("m","male",subdata6$sex)
subdata6$sex <- gsub("f","female",subdata6$sex)
subdata7 <- subdata6[!((subdata6$dead) == "alive"),]
head(subdata7)
## weight dead sex ageOFocc
## 8 27.078 dead male 32
## 14 89.627 dead female 54
## 83 27.078 dead male 67
## 84 27.078 dead female 64
## 265 13.374 dead male 23
## 311 12.383 dead female 77
boxplot (ageOFocc ~ sex,data = subdata7,col = "red")
as the graph shows ,the age for death injuries tends is higher for female.The spread and median both are higher in case of females.
subdata8 <- subset(data,select = c("yearacc","dead"))
subdata9 <- subdata8[!((subdata8$dead) == "alive"),]
ggplot(subdata9,aes(x= yearacc,fill = factor(dead))) +
geom_bar(width = 0.5) + xlab("years of accidents") +
ylab("total")+ labs(fill = "dead")
Maximum deaths occured in year 1997,least in 2001.
I tried to use different version of ggplot here.
subdata10 <- subset(data,select = c("weight","ageOFocc","dead"))
subdata11 <- subdata10[!((subdata10$dead) == "dead"),]
subdata12 <- subdata10[!((subdata10$dead) == "alive"),]
par (mfrow= c(1,2))
qplot(ageOFocc,weight,data = subdata11)
qplot(ageOFocc,weight,data = subdata12)
Though the pattern is not much clear here,comparing the two graphs for ‘dead’ and ‘alive’ cases,the conclusion can be drawn.comparing two scenarios.
Answering all five questions,the higlights of dataset are: 1) Maximum number of deaths occured when cases used airbags alone. 2) Maximum cases alive used both “airbags” and “seatbelt” to save their lives.(lowest fatality ratio - 0.89) 3) Highest number of deaths are reported in year 1997,lowest in 2001. 4) Highest number of cases died due to level-4 injuries. 5) Agegroup ‘20-30’ proved to be more vulnerable to accidents and level- 3 injuries. 6) ‘Female’ cases suffered more loss of life as compared to “male” cases. 7) Two cases with level -6 injuries survived the accident. 8) Weight-Age ratio is higher for cases who survived.