The objective of this exercise is to analyze the selected data to find out the following: 1. any relationship between accidents and safety measures such as airbag or seatbelt. 2. any relatship between year of vehicles and accidents 3. any trend related to gender 4. other additional information related to injury severity Statistical tool is not be used for this exercise.
Load the dplyr packages.
library(dplyr)
library(ggplot2)
theUrl <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv"
nassCDS <- read.table(file=theUrl, header=TRUE, sep=",")
Data Exploration:
head(nassCDS)
## X dvcat weight dead airbag seatbelt frontal sex ageOFocc yearacc
## 1 1 25-39 25.069 alive none belted 1 f 26 1997
## 2 2 10-24 25.069 alive airbag belted 1 f 72 1997
## 3 3 10-24 32.379 alive none none 1 f 69 1997
## 4 4 25-39 495.444 alive airbag belted 1 f 53 1997
## 5 5 25-39 25.069 alive none belted 1 f 32 1997
## 6 6 40-54 25.069 alive none belted 1 f 22 1997
## yearVeh abcat occRole deploy injSeverity caseid
## 1 1990 unavail driver 0 3 2:3:1
## 2 1995 deploy driver 1 1 2:3:2
## 3 1988 unavail driver 0 4 2:5:1
## 4 1995 deploy driver 1 1 2:10:1
## 5 1988 unavail driver 0 3 2:11:1
## 6 1985 unavail driver 0 3 2:11:2
summary(nassCDS)
## X dvcat weight dead
## Min. : 1 1-9km/h: 686 Min. : 0.00 alive:25037
## 1st Qu.: 6555 10-24 :12848 1st Qu.: 32.47 dead : 1180
## Median :13109 25-39 : 8214 Median : 86.99
## Mean :13109 40-54 : 2977 Mean : 462.81
## 3rd Qu.:19663 55+ : 1492 3rd Qu.: 364.72
## Max. :26217 Max. :57871.59
##
## airbag seatbelt frontal sex ageOFocc
## airbag:14419 belted:18573 Min. :0.0000 f:12248 Min. :16.00
## none :11798 none : 7644 1st Qu.:0.0000 m:13969 1st Qu.:22.00
## Median :1.0000 Median :33.00
## Mean :0.6433 Mean :37.21
## 3rd Qu.:1.0000 3rd Qu.:48.00
## Max. :1.0000 Max. :97.00
##
## yearacc yearVeh abcat occRole
## Min. :1997 Min. :1953 deploy : 8836 driver:20601
## 1st Qu.:1998 1st Qu.:1989 nodeploy: 5583 pass : 5616
## Median :2000 Median :1994 unavail :11798
## Mean :2000 Mean :1993
## 3rd Qu.:2001 3rd Qu.:1997
## Max. :2002 Max. :2003
## NA's :1
## deploy injSeverity caseid
## Min. :0.000 Min. :0.000 13:157:1: 11
## 1st Qu.:0.000 1st Qu.:1.000 11:131:1: 10
## Median :0.000 Median :2.000 2:94:1 : 10
## Mean :0.337 Mean :1.716 12:50:1 : 9
## 3rd Qu.:1.000 3rd Qu.:3.000 41:49:1 : 9
## Max. :1.000 Max. :6.000 43:169:1: 9
## NA's :153 (Other) :26159
#Percentage of Dead Out of All Accidents
ratiodead <- length(which(nassCDS$dead == "dead"))/length(nassCDS$dead)*100
ratiodead
## [1] 4.500896
2.Data wrangling:
a. ageVeh is the age of vehicle b. safety is the number of safety measures (2=Both Airbag & Seatbelt, 1=Airbag or Seatbelt, 0=None")
nassCDS <- nassCDS %>%
mutate(ageVeh = yearacc-yearVeh,
safety = (as.numeric(nassCDS$airbag=="airbag") + as.numeric(nassCDS$seatbelt=="belted"))
)
nassCDS$sex <- gsub("m","Male", nassCDS$sex)
nassCDS$sex <- gsub("f","Female", nassCDS$sex)
3.Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
ggplot(data = nassCDS) +
geom_histogram(aes(x=safety), binwidth=0.5) +
labs(title="Figure 1 - No. of Accident Against Safety Measures")
#Ratio of accidents without either airbag or seatbelt
ratioMea <- length(which(nassCDS$safety == 0))/length(nassCDS$safety)*100
ratioMea
## [1] 16.7029
#Plot of Injury Severity and Safety
ggplot(data = nassCDS, aes(x=safety, y=injSeverity)) +
geom_point(aes(color=dead)) +
labs(title="Figure 2 - Injury Severity and Safety")
## Warning: Removed 153 rows containing missing values (geom_point).
#Count of Accidents Against Age of Vehicles
ggplot(data = nassCDS) +
geom_histogram(aes(x=ageVeh), binwidth = 1) +
labs(title="Figure 3 - Accidents Against Age of Vehicles")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
#Count of Accidents Against Age of Vehicles
ggplot(data = nassCDS, aes(x=sex, y=ageOFocc)) +
geom_boxplot() +
labs(title="Figure 4 - Boxplot of Occupant")
g <- ggplot(data = nassCDS, aes(x=ageOFocc, y=ageVeh))
g + geom_point(aes(color=dead))
## Warning: Removed 1 rows containing missing values (geom_point).
g + geom_point(aes(color=dead))+facet_grid(injSeverity~dvcat)
## Warning: Removed 1 rows containing missing values (geom_point).
g + geom_point(aes(color=dead))+facet_grid(injSeverity~sex)
## Warning: Removed 1 rows containing missing values (geom_point).
g + geom_point(aes(color=dead))+facet_grid(injSeverity~safety)
## Warning: Removed 1 rows containing missing values (geom_point).
4.Meaningful question for analysis:
The following insights were observed from the graphs and data: A.Approximate 17% of accidents has no airbag or seatbelt comparing to “dead” rate of all accidents at only 4.5%. B.There is no obvious trend between Injury Severity and number of Measures. C. Figure 3 shows lot of accidents happened for new cars focusing on 1st or 2nd year. It may imply either new drivers or experienced drivers are not familar with their new cars which may lead to high accidents occurred. D. The average age of male occupant is lower than female.The IQR for male is also smaller than female. There are several high age drivers or passengers. E. Numeric data of age of occupant and year of vehicle are used to dig into additional insights of the data. One interesting observation is that “dead” is happening at the Injured Severity Level of 4.