Week 3 Final Project

Assignment:

Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end

Questions to be addressed

1.) Does the model year of the vehicle have any correlation with injury or death rates?

2.) Does the direction of the impact have any correlation with death rates?

3.) How much do seatbelts help with preventing fatal crashes?

Loading the data set:

FARS<-read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/gamclass/FARS.csv")
##heads up, this takes a while, it's a very large data set

Data Set

For this assignment, I chose to work with the FARS (Fatal Accident Reporting System from the National Traffic Highway Safety Administration) dataset. The data consisted of information on all fatal crashes from 1998-2010 that involved cars (no light trucks, motorcycles or heavier vehicles) with a passenger in the front seat. The data set consists of the following:

1.) X: An essentially unique marker for each vehicle (less than 2% of the numbers repeat), represented as the year followed by a decimal point, followed by a number

2.) caseid: A factor that gives an ID for each crash

3.) state: A number that indicates which state in the US the crash occured

4.) airbag: A number that denoting the exact status of the passenger airbag

ex: 01 - Deployed from the front, 20 - airbag available but did not deploy, 29 - airbag available and switched off, 30 - airbag not available for this seat, 31 - airbag previously deployed and not replaced

5.) injury: A number that gives the code for the type of injury that the passenger experienced

6.) restraint: A number that gives the code for whether the passenger seatbelt was used and what type of seatbelt it was if it was used

7.) sex: A number (1 or 2) that shows if the passenger was male or female

8.) inimpact: A number gives the direction of the initial damage to the car in the crash, with the vast majority being either the clock position of the initial impact, above, below, no impact, or unknown. Note: There are secondary codes for side impacts, but given that these are less than 200 in both directions, I will be ignoring them.

9.) modleyr: A number that gives the model year of the car

10.) airbagAvail: whether or not was there an airbag installed on the passenger side of the car. This is yes or no binary. Note: this gives less information than the airbag column, however, given the desire to mimic the driver side data, I have used the binary options for airbag availability, depolyment and seatbelt use.

11.) airbagDeploy: whether or not the airbag activate on the passenger side, also a yes or no binary

12.) Restraint: whether or not the passenger use their seatbelt, a yes or no binary

13.) D_injury: A number that gives a code for the injury to the driver using the same corespondance as the injury column

14.) D_airbagAvail: the driver side equivalent of airbagAvail

15.) D_airbagDeploy: the driver side equivalent of airbagDeploy

16.) D_Restraint: the driver’s usage of a seatbelt

17.) year: A number giving the year of the crash

Cleaning up the data

There are many NAs throughout the data, also 99 or 9999 is used when there is no knowledge of a particular datum. In order to see what is going on, I used several filters to clean up the data. I have created a data.frame called strdimpact (for standard impact) which will have the cleaned up data.

##start with impact location, removing unknowns and top and bottom impacts:
strdimpact <- filter(FARS, FARS$inimpact <13)
##remove the unknown injuries:
strdimpact%<>%filter(injury<5,D_injury<5)
##removing unknowns from airbag and seatbelt usage
strdimpact%<>%filter(airbagAvail!="NA-code",airbagDeploy!="NA-code",Restraint!="NA-code",D_airbagAvail!="NA-code",D_airbagDeploy!="NA-code",D_Restraint!="NA-code")
strdimpact%<>%filter(modelyr<9999)

Initial Summary

With the standardized impact data, the data set has been reduced by approximately 20% in size. However, there is still more than enough data to give some interesting results. First, the summary of the model year:

summary(strdimpact$modelyr)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1924    1991    1996    1996    2000    2011

On a year by year basis, these are the summaries for model year:

i<-1998

while(i <= max(strdimpact$year)){
  print(i)
  print(summary(select(filter(strdimpact,year==i),modelyr)))
  i<-i+1
  }

## [1] 1998
##     modelyr    
##  Min.   :1929  
##  1st Qu.:1987  
##  Median :1990  
##  Mean   :1990  
##  3rd Qu.:1994  
##  Max.   :1999  
## [1] 1999
##     modelyr    
##  Min.   :1928  
##  1st Qu.:1988  
##  Median :1991  
##  Mean   :1991  
##  3rd Qu.:1995  
##  Max.   :2000  
## [1] 2000
##     modelyr    
##  Min.   :1927  
##  1st Qu.:1988  
##  Median :1992  
##  Mean   :1992  
##  3rd Qu.:1996  
##  Max.   :2001  
## [1] 2001
##     modelyr    
##  Min.   :1934  
##  1st Qu.:1989  
##  Median :1994  
##  Mean   :1993  
##  3rd Qu.:1997  
##  Max.   :2002  
## [1] 2002
##     modelyr    
##  Min.   :1937  
##  1st Qu.:1990  
##  Median :1994  
##  Mean   :1994  
##  3rd Qu.:1998  
##  Max.   :2003  
## [1] 2003
##     modelyr    
##  Min.   :1932  
##  1st Qu.:1991  
##  Median :1996  
##  Mean   :1995  
##  3rd Qu.:1999  
##  Max.   :2004  
## [1] 2004
##     modelyr    
##  Min.   :1948  
##  1st Qu.:1993  
##  Median :1997  
##  Mean   :1996  
##  3rd Qu.:2000  
##  Max.   :2005  
## [1] 2005
##     modelyr    
##  Min.   :1927  
##  1st Qu.:1993  
##  Median :1998  
##  Mean   :1997  
##  3rd Qu.:2001  
##  Max.   :2006  
## [1] 2006
##     modelyr    
##  Min.   :1932  
##  1st Qu.:1994  
##  Median :1999  
##  Mean   :1998  
##  3rd Qu.:2002  
##  Max.   :2007  
## [1] 2007
##     modelyr    
##  Min.   :1924  
##  1st Qu.:1995  
##  Median :2000  
##  Mean   :1999  
##  3rd Qu.:2003  
##  Max.   :2008  
## [1] 2008
##     modelyr    
##  Min.   :1928  
##  1st Qu.:1996  
##  Median :2000  
##  Mean   :2000  
##  3rd Qu.:2004  
##  Max.   :2009  
## [1] 2009
##     modelyr    
##  Min.   :1971  
##  1st Qu.:1998  
##  Median :2002  
##  Mean   :2001  
##  3rd Qu.:2005  
##  Max.   :2010  
## [1] 2010
##     modelyr    
##  Min.   :1979  
##  1st Qu.:1999  
##  Median :2002  
##  Mean   :2002  
##  3rd Qu.:2006  
##  Max.   :2011

##I cannot figure out a way to do this without a loop

This can be viewed as a box plot, as below:

ggplot(strdimpact,aes(y=modelyr,x=1))+geom_boxplot()+stat_boxplot(geom="errorbar")+labs(y="Model Year")+theme(axis.text.x=element_blank(),axis.title.x = element_blank())

As above, in order to consider the model year on a year by year basis, I have created box plots for all of the years:

ggplot(strdimpact,aes(y=modelyr,x=1))+geom_boxplot()+labs(y="Model Year")+theme(axis.text.x=element_blank(),axis.title.x = element_blank())+facet_wrap(~year)

Unfortunately, with the exception of the year, all of the rest of the numerical data is actually giving codes for crash data, so summarizing it will not give any meaningful results.

Passenger and Driver Injuries by year and model

First, renaming data so that I can show what the numbers mean for injury:

strdimpact %<>% mutate(Passenger_Injury=injury+1)
strdimpact %<>% mutate(Driver_Injury=D_injury+1)
strdimpact$Passenger_Injury%<>%as.factor()
strdimpact$Passenger_Injury %<>% recode_factor(`1` = "Not Injured", `2`= "Possible Injury", `3` = "Suspected Minor Injury", `4` = "Suspected Serious Injury", `5` = "Fatal Injury")
strdimpact$Driver_Injury%<>%as.factor()
strdimpact$Driver_Injury %<>% recode_factor(`1` = "Not Injured", `2`= "Possible Injury", `3` = "Suspected Minor Injury", `4` = "Suspected Serious Injury", `5` = "Fatal Injury")

Plotting model year of car in fatal accident vs year the accident occured, showing passenger injury levels:

ggplot(strdimpact,aes(x=X,modelyr))+geom_point(aes(color=strdimpact$Passenger_Injury))+labs(color="Injury Level",x="Year of Crash",y="Car Model Year",title="Passenger Injury Severity by Model Year")

Showing driver injury levels:

ggplot(strdimpact,aes(x=X,modelyr))+geom_point(aes(color=strdimpact$Driver_Injury))+labs(color="Injury Level",x="Year of Crash",y="Car Model Year",title="Driver Injury Severity by Model Year")

These plots, while giving us some information, are too crowded to really see much about what is happening. At most, I can say that even with newer cars there are large numbers of similar injuries.

Instead of looking at all injury types, I isolated the passenger and driver fatality data.

pasfaimpact <- filter(strdimpact,injury==4, inimpact >0) 
#passenger fatality impact data - limited it to crashes in which the passenger died and there was an impact
drifaimpact <- filter(strdimpact,D_injury==4, inimpact >0) 
#driver fatality impact data - limited it to crashes in which the driver died and there was an impact

I can now show which direction the fatal impact came from. For this histogram the labels 1 through 12 correspond to the numbers on a clock, with 12 being a head on collision, 6 being a rear end collision and 3 and 9 corresponding to passenger and driver side impacts, respectively:

ggplot(pasfaimpact,aes(x=inimpact,color=000000,fill=as.factor(inimpact)))+geom_histogram(color="black",binwidth = 1,center=0)+theme(legend.position="none")+labs(y="Number of Crashes",x="Initial Impact Direction")+scale_x_continuous(breaks=c(1:12))

This is interesting. However, it is easier to understand when viewing the direction of impact as a rose (or consultant’s) graph:

ggplot(pasfaimpact,aes(x=inimpact,color=000000,fill=as.factor(inimpact)))+geom_histogram(color="black",binwidth = 1,center=0)+coord_polar(start=(pi/12))+scale_y_continuous(limits=c(0,17000),breaks=(seq(0,17000,1000)))+theme(legend.position="none",axis.title.x = element_blank(),axis.title.y = element_blank())+labs(title="Passenger Deaths In Crashes with Initial Impact Directions")+scale_x_continuous(breaks=c(1:12))

This shows that the majority of fatal impacts were front on and the second largest group were passenger side impacts. Based on this, I anticipated that the majority of fatal injuries for drivers will be on the front and driver’s side of the vehicles.

ggplot(drifaimpact,aes(x=inimpact,color=000000,fill=as.factor(inimpact)))+geom_histogram(color="black",binwidth = 1,center=0)+coord_polar(start=(pi/12))+scale_y_continuous(limits=c(0,17000),breaks=(seq(0,17000,1000)))+theme(legend.position="none",axis.title.x = element_blank(),axis.title.y = element_blank())+labs(title="Driver Deaths In Crashes with Initial Impact Directions")+scale_x_continuous(breaks=c(1:12))

Finally, I considered seatbelt use in decreasing fatalities. How many of the fatal crashes happened even though the occupants were wearing seatbelts?

I created two more data.frames to hold only fatal crashes in which the person killed was wearing their seatbelt.

pasfaimpactsb <- filter(pasfaimpact,Restraint=="yes") ##seatbelt used
drifaimpactsb <- filter(drifaimpact,D_Restraint=="yes") ##seatbelt used

I then created histograms in the rose graph layout to see what the reduction looks like. I am keeping the histograms scaled to the same y-axis as the total fatal crashes in order to be able to visually see the difference.

For Drivers

ggplot(drifaimpactsb,aes(x=inimpact,color=000000,fill=as.factor(inimpact)))+geom_histogram(color="black",binwidth = 1,center=0)+coord_polar(start=(pi/12))+scale_y_continuous(limits=c(0,17000),breaks=(seq(0,17000,1000)))+theme(legend.position="none",axis.title.x = element_blank(),axis.title.y = element_blank())+labs(title="Seatbelted Driver Deaths In Crashes with Initial Impact Directions")+scale_x_continuous(breaks=c(1:12))

For Passengers

ggplot(pasfaimpactsb,aes(x=inimpact,color=000000,fill=as.factor(inimpact)))+geom_histogram(color="black",binwidth = 1,center=0)+coord_polar(start=(pi/12))+scale_y_continuous(limits=c(0,17000),breaks=(seq(0,17000,1000)))+theme(legend.position="none",axis.title.x = element_blank(),axis.title.y = element_blank())+labs(title="Seatbelted Passenger Deaths In Crashes with Initial Impact Directions")+scale_x_continuous(breaks=c(1:12))

While the overall number of fatalities has decreased, it seems that the proportion of side impact to front impact fatalities has increased with the use of seatbelts.

all_passenger_deaths<-summarize(group_by(pasfaimpact,inimpact),n())
all_passenger_deaths[3,2]/all_passenger_deaths[12,2]

##         n()
## 1 0.5472335

all_passenger_deaths_sb<-summarize(group_by(pasfaimpactsb,inimpact),n())
all_passenger_deaths_sb[3,2]/all_passenger_deaths_sb[12,2]

##         n()
## 1 0.7189042

all_driver_deaths<-summarize(group_by(drifaimpact,inimpact),n())
all_driver_deaths[9,2]/all_passenger_deaths[12,2] #driver side impacts

##         n()
## 1 0.4447921

all_driver_deaths_sb<-summarize(group_by(drifaimpactsb,inimpact),n())
all_driver_deaths_sb[9,2]/all_passenger_deaths_sb[12,2]

##        n()
## 1 0.600074

Given all of this, we can now answer the three questions from the start.

1.) There does not appear to be any correlation between model year and injury severity. The mean age of the models involved in fatal collisions is always 8 years before the year considered. The median age varies between 7 and 9 years before the year considered. From the scatterplots, no association can be seen between injuries to either drivers or passengers and model year.

2.) The direction of impact has a very large correlation with number of deaths. The vast majority of collisions that lead to death are from front end collisions. There are also a large number of deaths from side impacts into the driver’s side (for driver deaths) and passenger’s side (for passenger deaths). While this is indicative of these types of crashes being more deadly, I do not have data on total numbers of fatal and non-fatal crashes so I do not know if these are the most common types of accidents. There is a suggestion that being hit on the side is much more dangerous for the occupant on that side of the car than the other occupant, which is reasonable.

3.) In the majority of collisions, the deceased occupant was not wearing a seat belt. Once again, I am unable to definitely say that seatbelts directly cause more accidents to be non-fatal. However, the data are suggestive. Furthermore, it seems that seatbelts are more likely to prevent death in a front-end collision than if the car is hit in the side. Once again, this seems reasonable, given that side impacts can more easily cause damage beyond simple energy transference.