The objective of this exercise is to analyze the selected data to find out the following: 1. any relationship between accidents and safety measures such as airbag or seatbelt. 2. any relatship between year of vehicles and accidents 3. any trend related to gender 4. other additional information related to injury severity Statistical tool is not be used for this exercise.

Load the dplyr packages.

library(dplyr)
library(ggplot2)
theUrl <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/nassCDS.csv"
nassCDS <- read.table(file=theUrl, header=TRUE, sep=",")

Data Exploration:

head(nassCDS)
##   X dvcat  weight  dead airbag seatbelt frontal sex ageOFocc yearacc
## 1 1 25-39  25.069 alive   none   belted       1   f       26    1997
## 2 2 10-24  25.069 alive airbag   belted       1   f       72    1997
## 3 3 10-24  32.379 alive   none     none       1   f       69    1997
## 4 4 25-39 495.444 alive airbag   belted       1   f       53    1997
## 5 5 25-39  25.069 alive   none   belted       1   f       32    1997
## 6 6 40-54  25.069 alive   none   belted       1   f       22    1997
##   yearVeh   abcat occRole deploy injSeverity caseid
## 1    1990 unavail  driver      0           3  2:3:1
## 2    1995  deploy  driver      1           1  2:3:2
## 3    1988 unavail  driver      0           4  2:5:1
## 4    1995  deploy  driver      1           1 2:10:1
## 5    1988 unavail  driver      0           3 2:11:1
## 6    1985 unavail  driver      0           3 2:11:2
summary(nassCDS)
##        X             dvcat           weight            dead      
##  Min.   :    1   1-9km/h:  686   Min.   :    0.00   alive:25037  
##  1st Qu.: 6555   10-24  :12848   1st Qu.:   32.47   dead : 1180  
##  Median :13109   25-39  : 8214   Median :   86.99                
##  Mean   :13109   40-54  : 2977   Mean   :  462.81                
##  3rd Qu.:19663   55+    : 1492   3rd Qu.:  364.72                
##  Max.   :26217                   Max.   :57871.59                
##                                                                  
##     airbag        seatbelt        frontal       sex          ageOFocc    
##  airbag:14419   belted:18573   Min.   :0.0000   f:12248   Min.   :16.00  
##  none  :11798   none  : 7644   1st Qu.:0.0000   m:13969   1st Qu.:22.00  
##                                Median :1.0000             Median :33.00  
##                                Mean   :0.6433             Mean   :37.21  
##                                3rd Qu.:1.0000             3rd Qu.:48.00  
##                                Max.   :1.0000             Max.   :97.00  
##                                                                          
##     yearacc        yearVeh          abcat         occRole     
##  Min.   :1997   Min.   :1953   deploy  : 8836   driver:20601  
##  1st Qu.:1998   1st Qu.:1989   nodeploy: 5583   pass  : 5616  
##  Median :2000   Median :1994   unavail :11798                 
##  Mean   :2000   Mean   :1993                                  
##  3rd Qu.:2001   3rd Qu.:1997                                  
##  Max.   :2002   Max.   :2003                                  
##                 NA's   :1                                     
##      deploy       injSeverity         caseid     
##  Min.   :0.000   Min.   :0.000   13:157:1:   11  
##  1st Qu.:0.000   1st Qu.:1.000   11:131:1:   10  
##  Median :0.000   Median :2.000   2:94:1  :   10  
##  Mean   :0.337   Mean   :1.716   12:50:1 :    9  
##  3rd Qu.:1.000   3rd Qu.:3.000   41:49:1 :    9  
##  Max.   :1.000   Max.   :6.000   43:169:1:    9  
##                  NA's   :153     (Other) :26159
#Percentage of Dead Out of All Accidents
ratiodead <- length(which(nassCDS$dead == "dead"))/length(nassCDS$dead)*100
ratiodead
## [1] 4.500896

2.Data wrangling:
a. ageVeh is the age of vehicle b. safety is the number of safety measures (2=Both Airbag & Seatbelt, 1=Airbag or Seatbelt, 0=None")

nassCDS <- nassCDS %>%
  mutate(ageVeh = yearacc-yearVeh,
         safety = (as.numeric(nassCDS$airbag=="airbag") + as.numeric(nassCDS$seatbelt=="belted"))
  )
nassCDS$sex <- gsub("m","Male", nassCDS$sex)
nassCDS$sex <- gsub("f","Female", nassCDS$sex)

3.Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

ggplot(data = nassCDS) + 
  geom_histogram(aes(x=safety), binwidth=0.5) + 
  labs(title="Figure 1 - No. of Accident Against Safety Measures")

#Ratio of accidents without either airbag or seatbelt
ratioMea <- length(which(nassCDS$safety == 0))/length(nassCDS$safety)*100
ratioMea
## [1] 16.7029
#Plot of Injury Severity and Safety
ggplot(data = nassCDS, aes(x=safety, y=injSeverity)) +
  geom_point(aes(color=dead)) +
  labs(title="Figure 2 - Injury Severity and Safety")
## Warning: Removed 153 rows containing missing values (geom_point).

#Count of Accidents Against Age of Vehicles
ggplot(data = nassCDS) + 
  geom_histogram(aes(x=ageVeh), binwidth = 1) +
  labs(title="Figure 3 - Accidents Against Age of Vehicles")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

#Count of Accidents Against Age of Vehicles
ggplot(data = nassCDS, aes(x=sex, y=ageOFocc)) + 
  geom_boxplot() +
  labs(title="Figure 4 - Boxplot of Occupant")

g <- ggplot(data = nassCDS, aes(x=ageOFocc, y=ageVeh))
g + geom_point(aes(color=dead))
## Warning: Removed 1 rows containing missing values (geom_point).

g + geom_point(aes(color=dead))+facet_grid(injSeverity~dvcat)
## Warning: Removed 1 rows containing missing values (geom_point).

g + geom_point(aes(color=dead))+facet_grid(injSeverity~sex)
## Warning: Removed 1 rows containing missing values (geom_point).

g + geom_point(aes(color=dead))+facet_grid(injSeverity~safety)
## Warning: Removed 1 rows containing missing values (geom_point).

4.Meaningful question for analysis:
The following insights were observed from the graphs and data: A.Approximate 17% of accidents has no airbag or seatbelt comparing to “dead” rate of all accidents at only 4.5%. B.There is no obvious trend between Injury Severity and number of Measures. C. Figure 3 shows lot of accidents happened for new cars focusing on 1st or 2nd year. It may imply either new drivers or experienced drivers are not familar with their new cars which may lead to high accidents occurred. D. The average age of male occupant is lower than female.The IQR for male is also smaller than female. There are several high age drivers or passengers. E. Numeric data of age of occupant and year of vehicle are used to dig into additional insights of the data. One interesting observation is that “dead” is happening at the Injured Severity Level of 4.