Introduction

The final project requirements have been identified in the document located at the following link: https://bbhosted.cuny.edu/bbcswebdav/pid-42265237-dt-content-rid-347468186_1/courses/SPS01_DATA_607_01_1199_1/607-Final%281%29.pdf

Team Members

Mario Pena

Ajay Arora

Initial Communication / Establishing Work Activity

We enjoyed collaboration on the 3rd project and decieded to continue to work together for the final project. We established communication over #Slack and spoke over the phone to introduce ourselves properly. Furthermore, we divided the work effort in half. The work effort consisted of Data Preparation, Data Cleaning, Data Analysis, and Conclusion.

Ongoing Communication / Artifact Location

We both decided to communicate via email and/or #Slack. In addition, we establihed a Final Project group folder on GitHub.

https://github.com/AjayArora35/Data-607-Final-Project

https://data607fall2019.slack.com/

Data

The data was acquired at https://www-nrd.nhtsa.dot.gov/database/veh/veh.htm. (https://www-nrd.nhtsa.dot.gov/database/VSR/Download.aspx?tstno=&curno=&database=v&name=Vehdb-export&format=export)

Objective

Specifically, Which vehicle or set of vehicles are the safest from a head and leg injury perspective.

Injury Criteria

Head Injury Criterion, Left Femur Load, Right Femur Load

Analysis and Attributes

National Highway Traffic Safety Administration

Based on the recommendations following this section, provided by the National Highway Traffic Safety Administration, we want to perform analysis to determine which vehicles have the best and worst injury severity ratings based upon the criteria listed below and using the following attributes.

Attributes

vehicle make, Vehicle year, Vehicle body type, and occupant location (Left-front-seat, Right-front-seat).

Motiviation

The National Highway Traffic Safety Administration makes their recommendation using a star rating system. A greater number of Stars mean Safer Cars. 5-Star Safety Ratings measure the crashworthiness and rollover safety of vehicles. Five stars is the highest rating, one is the lowest. However, some of the vehicles getting a 5-star rating could be cost prohibitive for some consumers and provide a small variety of choices. In this instance, a more detailed analysis of the injury criteria and attributes are warranted.

National Highway Traffic Safety Administration Recommendations

National Highway Traffic Safety Administration Recommendations

National Highway Traffic Safety Administration Recommendations

The National Highway Traffic Safety Administration Recommendations provide a basis for vehicle safety for crashworthiness and rollover. The recommendation image provides a baseline for head, leg and chest injury. In summary, the closer the value is to absolute 0, the less likely the occupant will have a severe injury.

Preparing Environment

#Loading Libraries
library(DBI)
## Warning: package 'DBI' was built under R version 3.5.3
library("knitr")
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.0
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("stringr")
library("plotly")
## Warning: package 'plotly' was built under R version 3.5.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("htmlTable")
## Warning: package 'htmlTable' was built under R version 3.5.3
library("stringr")
library("stats")
library("scales")
## Warning: package 'scales' was built under R version 3.5.3
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library("viridis")
## Warning: package 'viridis' was built under R version 3.5.3
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.5.3
## Loading required package: RColorBrewer

Data Preparation for Summarized Analysis

Retrieve data from MySQL hosted in AWS

#data <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Software%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)
#data2C <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Computing%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)
cn <- dbConnect(drv      = RMySQL::MySQL(), 
                username = "admin", 
                password = "Data__607", 
                host     = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com", 
                port     = 3306, 
                dbname   = "data607finalproject")
data <- dbGetQuery(cn, "SELECT 
RD.MAKED
,MAX(`Head Injury Criterion`) AS HIC
,MAX(`Left Femur Load`) AS LFL
,MAX(`Right Femur Load`) AS RFL
,MIN(`Head Injury Criterion`) AS min_HIC
,MIN(`Left Femur Load`) AS min_LFL
,MIN(`Right Femur Load`) AS min_RFL
FROM
(SELECT v.MAKED
        , v.YEAR
        , AVG(convert(o.HIC, SIGNED INTEGER)) AS `Head Injury Criterion`
        , AVG(convert(o.LFEM , SIGNED INTEGER)) AS `Left Femur Load`
        , AVG(convert(o.RFEM , SIGNED INTEGER)) AS `Right Femur Load`
  FROM data607finalproject.test t
  inner join data607finalproject.veh v
  on(v.TSTNO = t.TSTNO)
  inner join data607finalproject.occ o
  on(o.TSTNO = v.TSTNO)
  inner join data607finalproject.rest r
  on(r.TSTNO = v.TSTNO)
  AND (r.VEHNO = v.VEHNO)
WHERE v.MAKED != 'NHTSA' 
    AND v.MAKED NOT IN ('MCI') 
    AND  v.YEAR != '' 
    AND v.YEAR != 0 
    AND  (o.LFEM != '' AND o.LFEM != 0) 
    AND (o.RFEM != '' AND o.RFEM != 0)
    AND r.DEPLOYD = 'DEPLOYED PROPERLY'
GROUP BY
    v.MAKED
    ,v.YEAR
) AS RD
GROUP BY 
RD.MAKED 
ORDER BY
RD.MAKED")
## Warning in .local(conn, statement, ...): Decimal MySQL column 1 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 3 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 4 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 5 imported as
## numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 6 imported as
## numeric
head(data)
##       MAKED       HIC       LFL        RFL min_HIC   min_LFL   min_RFL
## 1     ACURA  723.1429 -400.7500  -329.5000 152.750 -6788.000 -7036.750
## 2      AUDI  627.5000 -269.1667  -199.6667  63.400 -6409.500 -5085.500
## 3       BMW 9999.0000 -999.0000  -999.0000  40.250 -6962.500 -7828.000
## 4     BUICK 1273.5000 -647.2222  -767.7000 149.000 -7157.500 -7034.750
## 5  CADILLAC 1120.0000 -565.0000  -704.5000  53.500 -5409.500 -5183.250
## 6 CHEVROLET 1282.7679 -956.6154 -1216.8462  96.375 -6328.364 -5758.714

Data Cleaning

Convert to numerics and truncate

data$HIC <- as.numeric(gsub(",","", data$HIC))
data$LFL <- as.numeric(gsub(",","", data$LFL))
data$RFL <- as.numeric(gsub(",","", data$RFL))
data$min_HIC <- as.numeric(gsub(",","", data$min_HIC))
data$min_LFL <- as.numeric(gsub(",","", data$min_LFL))
data$min_RFL <- as.numeric(gsub(",","", data$min_RFL))
data$HIC <- trunc(data$HIC)
data$LFL <- trunc(data$LFL)
data$RFL <- trunc(data$RFL)
data$min_HIC <- trunc(data$min_HIC)
data$min_LFL <- trunc(data$min_LFL)
data$minRFL <- trunc(data$min_RFL)

HIC Discussion

The Head Injury Criterion (HIC) is a measure of the likelihood of head injury arising from an impact. The summarized data for HCI was generated by taking the maximum of HCI value for each vehicle make.

The HIC can be used to assess safety related to vehicles, personal protective gear, and sport equipment.

Normally the variable is derived from the measurements of an accelerometer mounted at the center of mass of a crash test dummy’s head, when the dummy is exposed to crash forces. This means that the HIC includes the effects of head acceleration and the duration of the acceleration. Large accelerations may be tolerated for very short times.

At a HIC of 1000, there is an 18% probability of a severe head injury, a 55% probability of a serious injury and a 90% probability of a moderate head injury to the average adult.

Summarized Analysis

What is the maximum Head Injury Criterion per vehicle?

Measured in integers, 0 to 9,999, HIC is the computed value of the head injury criterion, based on the resultant acceleration pulse for the head center of gravity. (https://www.intmath.com/applications-integration/hic-part2.php) Generally, experts agree that Head Injury Criterion (HIC) values above 1000 are life threatening.

grid1 <- ggplot(data = data,aes(x=reorder(data$MAKED, data$HIC), y=data$HIC, fill = viridis(49), )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$HIC), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Maximum Head Injury Criterion", x = "Vehicles", y = "Severity")+
  coord_flip()
grid1

What is the maximum Left Femur Peak Load per vehicle?

Measured as an integer, LFEM indicates the maximum compression load for the left femur listed as a negative number.

grid2 <- ggplot(data = data,aes(x=reorder(data$MAKED, data$LFL), y=data$LFL, fill = viridis(49), )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$LFL), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = " Maximum Left Femur Peak Load", x = "Vehicles", y = "Severity")+
  coord_flip()
grid2

What is the maximum Right Femur Peak Load per vehicle?

Measured as an integer, RFEM indicates the maximum compression load for the right femur listed as a negative number.

grid3 <- ggplot(data = data,aes(x=reorder(data$MAKED, data$RFL), y=data$RFL, fill = viridis(49), )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$RFL), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = " Maximum Right Femur Peak Load", x = "Vehicles", y = "Severity")+
  coord_flip()
grid3

What is minimum Head Injury Criterion per vehicle?

grid4 <- ggplot(data = data,aes(x=reorder(data$MAKED, data$min_HIC), y=data$min_HIC, fill = viridis(49), )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$min_HIC), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Minimum Head Injury Criterion", x = "Vehicles", y = "Severity")+
  coord_flip()
grid4

What is the minimum Left Femur Peak Load per vehicle?

grid5 <- ggplot(data = data,aes(x=reorder(data$MAKED, data$min_LFL), y=data$min_LFL, fill = viridis(49), )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$min_LFL), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = " Minimum Left Femur Peak Load", x = "Vehicles", y = "Severity")+
  coord_flip()
grid5

What is the minimum Right Femur Peak Load per vehicle?

grid6 <- ggplot(data = data,aes(x=reorder(data$MAKED, data$min_RFL), y=data$min_RFL, fill = viridis(49), )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$min_RFL), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = " Minimum Right Femur Peak Load", x = "Vehicles", y = "Severity")+
  coord_flip()
grid6

The summarized view of the data for HCI, LFEM and RFEM does not reveal the details necessary for making a more informed decision about a vehicle. For example, suppose a potential vehicle buyer wanted to know these values based on the year, type of vehicle, location of the occputant, etc. The summarized views do not provide this level of detail, so it becomes necessary to retrieve data with year and type of vehicle, among other attributes.

Data Preparation with Additional Attributes

Retrieve data from MySQL hosted in AWS

#data <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Software%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)
#data2C <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Computing%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)
cn <- dbConnect(drv      = RMySQL::MySQL(), 
                username = "admin", 
                password = "Data__607", 
                host     = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com", 
                port     = 3306, 
                dbname   = "data607finalproject")
data2 <- dbGetQuery(cn, "SELECT v.MAKED
        , v.YEAR
        , v.BODYD
        , o.OCCLOCD
        , o.HIC AS HIC
        , o.LFEM AS LFL
        , o.RFEM AS RFL
  FROM data607finalproject.test t
  inner join data607finalproject.veh v
  on(v.TSTNO = t.TSTNO)
  inner join data607finalproject.occ o
  on(o.TSTNO = v.TSTNO)
  inner join data607finalproject.rest r
  on(r.TSTNO = t.TSTNO)
WHERE v.MAKED != 'NHTSA' 
    AND v.MAKED NOT IN ('MCI', 'OTHER') 
    AND  (o.LFEM != '' AND o.LFEM != 0) 
    AND (o.RFEM != '' AND o.RFEM != 0)
    AND (r.DEPLOYD = N'DEPLOYED PROPERLY')
    AND (o.HIC != '' AND o.HIC != 0)
    AND (v.YEAR != 0 and v.YEAR != '')
GROUP BY
v.MAKED
        , v.YEAR
        , v.BODYD
        , o.OCCLOCD
        , o.HIC 
        , o.LFEM 
        , o.RFEM 
ORDER BY v.MAKED ")
head(data2)
##   MAKED YEAR           BODYD          OCCLOCD HIC   LFL   RFL
## 1 ACURA 1988 FOUR DOOR SEDAN  LEFT FRONT SEAT 284 -6886 -7344
## 2 ACURA 1988 FOUR DOOR SEDAN RIGHT FRONT SEAT 387 -2931 -5494
## 3 ACURA 1992 FOUR DOOR SEDAN  LEFT FRONT SEAT 601 -9230 -8229
## 4 ACURA 1992 FOUR DOOR SEDAN  LEFT FRONT SEAT 897 -1326 -5534
## 5 ACURA 1992 FOUR DOOR SEDAN  LEFT FRONT SEAT 914 -3007 -7371
## 6 ACURA 1992 FOUR DOOR SEDAN RIGHT FRONT SEAT 433 -2140 -1401

Data Cleaning

Convert to numerics, factor and truncate

data2$HIC <- as.numeric(gsub(",","", data2$HIC))
data2$LFL <- as.numeric(gsub(",","", data2$LFL))
data2$RFL <- as.numeric(gsub(",","", data2$RFL))
data2$YEAR <- as.numeric(data2$YEAR)
data2$BODYD <- as.factor(data2$BODYD)
data2$MAKED <- as.factor(data2$MAKED)
data2$OCCLOCD <- as.factor(data2$OCCLOCD)
data2$HIC <- trunc(data2$HIC)
data2$LFL <- trunc(data2$LFL)
data2$RFL <- trunc(data2$RFL)

Data Analysis with Additional Attributes

In order to simplify our analysis we have decided to separete head injury criterion, left/right femur peak load and compare them against the attributes we thought were most relevant. We will add a column that averages the HCI, LFEM and RFEM by make, year, body type of the vehicle and occupant location in the vehicle.

Crash Test Distribution

The vehicle make that has the highest frequency of crash test observations in our data is Ford, and is then followed by Toyota, Honda, Chevrolet, Dodge, and Nissan to name a few.

attrMake <- data2 %>% group_by(MAKED) %>% summarise("Average HIC" = mean(HIC), Count = n())
ggplot(attrMake, aes(x=reorder(MAKED, -Count), y=Count)) + geom_bar(stat="identity", width = 0.5, fill = "tomato2") + labs(x = "Vehicle Make", y = "Frequency", title = "Distribution of Crash Test Observations by Vehicle Make") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

Injury by Car Make

attrMake_lfl <- data2 %>% group_by(MAKED) %>% summarise("Average LFL" = mean(LFL), Count = n())
attrMake_rfl <- data2 %>% group_by(MAKED) %>% summarise("Average RFL" = mean(RFL), Count = n())

Data Sample by Vehicle Make

head(attrMake)
## # A tibble: 6 x 3
##   MAKED     `Average HIC` Count
##   <fct>             <dbl> <int>
## 1 ACURA              384.    87
## 2 AUDI               278.    45
## 3 BMW                742.    71
## 4 BUICK              348.    78
## 5 CADILLAC           395.    87
## 6 CHEVROLET          515.   611

In the graphs below we can see from the crash tests that among the safest vehicle makes from a head and leg injury perspective we find Porsche, Tesla, Mini and Volkswagen among others.

ggplot(attrMake, aes(x=reorder(MAKED, `Average HIC`), y=`Average HIC`)) + geom_bar(stat="identity", width = 0.5, fill = viridis(49)) + labs(x = "Car Make", y = "AVG HIC", title = "Average HIC by Car Make") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

ggplot(attrMake_lfl, aes(x=reorder(MAKED, -`Average LFL`), y=`Average LFL`)) + geom_bar(stat="identity", width = 0.5, fill = viridis(49)) + labs(x = "Car Make", y = "AVG LFL", title = "Average LFL by Car Make") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

ggplot(attrMake_rfl, aes(x=reorder(MAKED, -`Average RFL`), y=`Average RFL`)) + geom_bar(stat="identity", width = 0.5, fill = viridis(49)) + labs(x = "Car Make", y = "AVG RFL", title = "Average RFL by Car Make") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

Injury by Year

attrYear <- data2 %>% group_by(YEAR) %>% summarise("Average HIC" = mean(HIC), Count = n())
attrYear_lfl <- data2 %>% group_by(YEAR) %>% summarise("Average LFL" = mean(LFL), Count = n())
attrYear_rfl <- data2 %>% group_by(YEAR) %>% summarise("Average RFL" = mean(RFL), Count = n())

Data Sample by Year

head(attrYear)
## # A tibble: 6 x 3
##    YEAR `Average HIC` Count
##   <dbl>         <dbl> <int>
## 1  1981         991       2
## 2  1982         667       2
## 3  1984          90.2     4
## 4  1986         662       2
## 5  1987         444.      4
## 6  1988         531.     15

During 1999 The National Highway Traffic Safety Administration planned for upgrading the Federal Motor Vehicle Safety Standard (FMVSS). They added new crash specifications that required the use of additional dummies of various sizes as well as additional performance criteria that appropriately represent head injury thresholds.

ggplot(attrYear, aes(x=YEAR, y=`Average HIC`)) + geom_line(color = "tomato2") + labs(x = "Year", y = "AVG HIC", title = "Average HIC by Year") + geom_point(color = "tomato2")

ggplot(attrYear_lfl, aes(x=YEAR, y=`Average LFL`)) + geom_line(color = "tomato2") + labs(x = "Year", y = "AVG LFL", title = "Average LFL by Year") + geom_point(color = "tomato2")

ggplot(attrYear_rfl, aes(x=YEAR, y=`Average RFL`)) + geom_line(color = "tomato2") + labs(x = "Year", y = "AVG RFL", title = "Average RFL by Year") + geom_point(color = "tomato2")

Injury by Vehicle Body Type

attrBody <- data2 %>% group_by(BODYD) %>% filter (BODYD != "OTHER") %>% summarise("Average HIC" = mean(HIC), Count = n())
attrBody_lfl <- data2 %>% group_by(BODYD) %>% filter (BODYD != "OTHER") %>% summarise("Average LFL" = mean(LFL), Count = n())
attrBody_rfl <- data2 %>% group_by(BODYD) %>% filter (BODYD != "OTHER") %>% summarise("Average RFL" = mean(RFL), Count = n())

Data Sample by Vehicle Body Type

head(attrBody)
## # A tibble: 6 x 3
##   BODYD               `Average HIC` Count
##   <fct>                       <dbl> <int>
## 1 4 DOOR PICKUP                407.   181
## 2 CONVERTIBLE                  976.    62
## 3 EXTENDED CAB PICKUP          489.   114
## 4 FIVE DOOR HATCHBACK          403.   221
## 5 FOUR DOOR SEDAN              467.  2551
## 6 MINIVAN                      454.   134

According to the graphs below, among the safest vehicle body types from a head and leg injury perspective we find that three door coupes, five door hatchbacks, and 4 door pickups have the lowest injury averages.

Please Note: The category “OTHER” for body type has been removed as it is not clear what type of vehicles are included

ggplot(attrBody, aes(x=reorder(BODYD, `Average HIC`), y=`Average HIC`)) + geom_bar(stat="identity", width = 0.5, fill = magma(15)) + labs(x = "Body Type", y = "AVG HIC", title = "Average HIC by Car Body Type") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

ggplot(attrBody_lfl, aes(x=reorder(BODYD, -`Average LFL`), y=`Average LFL`)) + geom_bar(stat="identity", width = 0.5, fill = magma(15)) + labs(x = "Body Type", y = "AVG LFL", title = "Average LFL by Car Body Type") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

ggplot(attrBody_rfl, aes(x=reorder(BODYD, -`Average RFL`), y=`Average RFL`)) + geom_bar(stat="identity", width = 0.5, fill = magma(15)) + labs(x = "Body Type", y = "AVG RFL", title = "Average RFL by Car Body Type") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

Injury by Occupant Location

attrOcc <- data2 %>% group_by(OCCLOCD) %>% summarise("Average HIC" = mean(HIC), Count = n())
attrOcc_lfl <- data2 %>% group_by(OCCLOCD) %>% summarise("Average LFL" = mean(LFL), Count = n())
attrOcc_rfl <- data2 %>% group_by(OCCLOCD) %>% summarise("Average RFL" = mean(RFL), Count = n())

Data Sample by Occupant Location

head(attrOcc)
## # A tibble: 6 x 3
##   OCCLOCD           `Average HIC` Count
##   <fct>                     <dbl> <int>
## 1 CENTER REAR SEAT           712.    21
## 2 CENTER THIRD SEAT          871      4
## 3 LEFT FRONT SEAT            389.  2923
## 4 LEFT REAR SEAT            1389.   147
## 5 LEFT THIRD SEAT           1045.     4
## 6 RIGHT FRONT SEAT           458.  2364

According to the graphs below, one of the safest places to sit in a vehicle in the unfortunate event of a crash is the left front seat. In other words, you are the most safe when you are the driver. The second safest place to sit is in the front passenger seat.

ggplot(attrOcc, aes(x=reorder(OCCLOCD, `Average HIC`), y=`Average HIC`)) + geom_bar(stat="identity", width = 0.5, fill = plasma(8)) + labs(x = "Occupant Location", y = "AVG HIC", title = "Average HIC by Occupant Location") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

ggplot(attrOcc_lfl, aes(x=reorder(OCCLOCD, -`Average LFL`), y=`Average LFL`)) + geom_bar(stat="identity", width = 0.5, fill = plasma(8)) + labs(x = "Occupant Location", y = "AVG LFL", title = "Average LFL by Occupant Location") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

ggplot(attrOcc_rfl, aes(x=reorder(OCCLOCD, -`Average RFL`), y=`Average RFL`)) + geom_bar(stat="identity", width = 0.5, fill = plasma(8)) + labs(x = "Occupant Location", y = "AVG RFL", title = "Average RFL by Occupant Location") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8))

Additional Analysis

Below we have another perspective for head injury criterion by occupant location looking at the data through boxplots. We can see that our boxplots agree with our analysis above, the two safest places to sit in a vehicle are in the driver seat and front passenger seat.

qplot(OCCLOCD, HIC, data = data2, geom= "boxplot", fill = OCCLOCD) + labs(x = "Occupant Location", y = "HIC", title = "Head Injury Criterion by Occupant Location") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8)) + theme(legend.position = "none")

We can also look at our data by comparing vehicle make, body type and head injury criterion in the same graph in order to have a better visualization of our aggregate data.

Please Note: We have removed an observation (Chevrolet convertible) that was very far from the rest of the data in order to make the graph more readable. Additionally, the category “OTHER” for body type has been removed as it is not clear what type of vehicles are included

attrMakeBody <- data2 %>% group_by(MAKED, BODYD) %>% filter (MAKED != "CHEVROLET" | BODYD != "CONVERTIBLE", BODYD != "OTHER") %>% summarise("Average HIC" = mean(HIC), Count = n())
ggplot(attrMakeBody, aes(MAKED, `Average HIC`)) + geom_point(aes(color = BODYD)) + labs(x = "Vehicle Make", y = "AVG HIC", title = "Average HIC by Vehicle Make and Body Type") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=6), legend.title = element_text(size = 8), legend.text = element_text(size = 6)) + scale_color_discrete(name = "Body Type")

Transformation

The data obtained from the more granular query has many more rows with repeating fields of data. To get a better understanding of this data, we will transform into a easier visual model.

# library(treemap)
# 
# treemap(data2,
#        index=c("MAKED", "BODYD"),
#        vSize="YEAR",
#        #vColor="GNI",
#        #type="YEAR"
#        )
library(data.tree)
## Warning: package 'data.tree' was built under R version 3.5.3
#data(GNI2014)
#(GNI2014)
data2$pathString <- paste("Vehicle_Make", 
                            data2$YEAR, 
                            data2$MAKED, 
                            data2$BODYD,
                            data2$OCCLOCD,
                            sep = "/")
data2tree <- as.Node(data2)
#print(data2tree, "HIC", "LFL", "RFL", limit = 200)
#plot(population)
#library(networkD3)
#acmeNetwork <- ToDataFrameNetwork(data2tree, "name")
#simpleNetwork(acmeNetwork[-3], fontSize = 12)
#plot(as.dendrogram(data2tree), center = TRUE)
# works: print(data2tree, "HIC", "LFL", "RFL")

Reasons for transformation: It is increasingly becoming difficult to view the results because of the number of records in the resulset. Every time another variable of interest is added it exponentially increases the output of records. The difficulty becomes in “visually” consuming the data without removing any of the results. The following data that has been transformed into a tree structure and is still too large to consume “visually”. So, we decided to ask our questions using subsets of the data as depicted below.

#SetGraphStyle(data2tree, rankdir = "LR", dpi="70")
#SetEdgeStyle(data2tree, arrowhead = "vee", color = "grey35", penwidth = "20")
#SetNodeStyle(data2tree, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow", 
#            fontname = "helvetica", tooltip = GetDefaultTooltip, fontsize = "36", height="0.75")
#SetNodeStyle(data2tree$ACURA, fillcolor = "LightBlue", penwidth = "5px")
#plot(data2tree$`2019`$ACURA, output = "graph")
#plot(as.dendrogram(data2tree), center = TRUE)
print(data2tree, "HIC", "LFL", "RFL")
##                            levelName HIC   LFL   RFL
## 1   Vehicle_Make                      NA    NA    NA
## 2    ¦--1988                          NA    NA    NA
## 3    ¦   ¦--ACURA                     NA    NA    NA
## 4    ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 5    ¦   ¦       ¦--LEFT FRONT SEAT  284 -6886 -7344
## 6    ¦   ¦       °--RIGHT FRONT SEAT 387 -2931 -5494
## 7    ¦   ¦--DODGE                     NA    NA    NA
## 8    ¦   ¦   °--THREE DOOR HATCHBACK  NA    NA    NA
## 9    ¦   ¦       ¦--LEFT FRONT SEAT  194 -3011 -6397
## 10   ¦   ¦       °--RIGHT FRONT SEAT 319 -4408 -3670
## 11   ¦   ¦--FORD                      NA    NA    NA
## 12   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 13   ¦   ¦       ¦--LEFT FRONT SEAT  456 -5089 -5378
## 14   ¦   ¦       °--RIGHT FRONT SEAT 561 -2971 -2322
## 15   ¦   ¦--OLDSMOBILE                NA    NA    NA
## 16   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 17   ¦   ¦       ¦--LEFT FRONT SEAT  709 -6010 -9399
## 18   ¦   ¦       °--RIGHT FRONT SEAT 539 -2318 -2620
## 19   ¦   °--VOLVO                     NA    NA    NA
## 20   ¦       °--FOUR DOOR SEDAN       NA    NA    NA
## 21   ¦           ¦--LEFT FRONT SEAT  519 -5084 -7976
## 22   ¦           °--RIGHT FRONT SEAT 445 -3216 -2820
## 23   ¦--1992                          NA    NA    NA
## 24   ¦   ¦--ACURA                     NA    NA    NA
## 25   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 26   ¦   ¦       ¦--LEFT FRONT SEAT  914 -3007 -7371
## 27   ¦   ¦       °--RIGHT FRONT SEAT 660 -2940 -1517
## 28   ¦   ¦--BMW                       NA    NA    NA
## 29   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 30   ¦   ¦       ¦--LEFT FRONT SEAT  705 -5418 -5196
## 31   ¦   ¦       °--RIGHT FRONT SEAT 698 -3127 -2077
## 32   ¦   ¦--CADILLAC                  NA    NA    NA
## 33   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 34   ¦   ¦       ¦--LEFT FRONT SEAT  598 -4706 -3407
## 35   ¦   ¦       °--RIGHT FRONT SEAT 900 -4809 -3180
## 36   ¦   ¦--CHEVROLET                 NA    NA    NA
## 37   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 38   ¦   ¦       °--LEFT FRONT SEAT  960 -9190 -5480
## 39   ¦   ¦--DODGE                     NA    NA    NA
## 40   ¦   ¦   °--VAN                   NA    NA    NA
## 41   ¦   ¦       ¦--LEFT FRONT SEAT  407 -2046 -6944
## 42   ¦   ¦       °--RIGHT FRONT SEAT 427 -6210 -2607
## 43   ¦   ¦--FORD                      NA    NA    NA
## 44   ¦   ¦   ¦--CONVERTIBLE           NA    NA    NA
## 45   ¦   ¦   ¦   ¦--LEFT FRONT SEAT  811 -5556 -5298
## 46   ¦   ¦   ¦   °--RIGHT FRONT SEAT 129 -1134  -574
## 47   ¦   ¦   ¦--FOUR DOOR SEDAN       NA    NA    NA
## 48   ¦   ¦   ¦   ¦--LEFT FRONT SEAT  907 -6401 -3763
## 49   ¦   ¦   ¦   °--RIGHT FRONT SEAT 331 -6067 -3759
## 50   ¦   ¦   °--VAN                   NA    NA    NA
## 51   ¦   ¦       ¦--LEFT FRONT SEAT  698 -6797 -3545
## 52   ¦   ¦       °--RIGHT FRONT SEAT 723 -4212 -1521
## 53   ¦   ¦--GEO                       NA    NA    NA
## 54   ¦   ¦   °--THREE DOOR HATCHBACK  NA    NA    NA
## 55   ¦   ¦       ¦--LEFT FRONT SEAT   75 -6121 -6005
## 56   ¦   ¦       °--RIGHT FRONT SEAT 613 -1517 -1459
## 57   ¦   ¦--HONDA                     NA    NA    NA
## 58   ¦   ¦   ¦--FOUR DOOR SEDAN       NA    NA    NA
## 59   ¦   ¦   ¦   ¦--LEFT FRONT SEAT  612 -2574 -4664
## 60   ¦   ¦   ¦   °--RIGHT FRONT SEAT 712 -3630   -63
## 61   ¦   ¦   ¦--THREE DOOR HATCHBACK  NA    NA    NA
## 62   ¦   ¦   ¦   ¦--LEFT FRONT SEAT  302 -7908 -5954
## 63   ¦   ¦   ¦   °--RIGHT FRONT SEAT 119 -1214  -819
## 64   ¦   ¦   °--TWO DOOR SEDAN        NA    NA    NA
## 65   ¦   ¦       ¦--LEFT FRONT SEAT  510 -3327 -2851
## 66   ¦   ¦       °--RIGHT FRONT SEAT 555 -1570 -2509
## 67   ¦   ¦--MITSUBISHI                NA    NA    NA
## 68   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 69   ¦   ¦       ¦--LEFT FRONT SEAT  679 -6788 -3487
## 70   ¦   ¦       °--RIGHT FRONT SEAT 472 -1993 -3692
## 71   ¦   ¦--NISSAN                    NA    NA    NA
## 72   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 73   ¦   ¦       ¦--LEFT FRONT SEAT  818 -6219 -3043
## 74   ¦   ¦       °--RIGHT FRONT SEAT 907 -4284 -4559
## 75   ¦   ¦--OLDSMOBILE                NA    NA    NA
## 76   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 77   ¦   ¦       ¦--LEFT FRONT SEAT  473 -4777 -5849
## 78   ¦   ¦       °--RIGHT FRONT SEAT 829 -5035 -5058
## 79   ¦   ¦--PLYMOUTH                  NA    NA    NA
## 80   ¦   ¦   °--VAN                   NA    NA    NA
## 81   ¦   ¦       ¦--LEFT FRONT SEAT  426 -1793 -3625
## 82   ¦   ¦       °--RIGHT FRONT SEAT 175  -231 -2202
## 83   ¦   ¦--PONTIAC                   NA    NA    NA
## 84   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 85   ¦   ¦       ¦--LEFT FRONT SEAT  360 -6122 -6291
## 86   ¦   ¦       °--RIGHT FRONT SEAT 768 -6668 -3134
## 87   ¦   ¦--SAAB                      NA    NA    NA
## 88   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 89   ¦   ¦       °--LEFT FRONT SEAT  361 -4786 -5534
## 90   ¦   ¦--TOYOTA                    NA    NA    NA
## 91   ¦   ¦   °--FOUR DOOR SEDAN       NA    NA    NA
## 92   ¦   ¦       ¦--LEFT FRONT SEAT  428 -6183 -7139
## 93   ¦   ¦       °--RIGHT FRONT SEAT 649 -1361 -1210
## 94   ¦   °--VOLVO                     NA    NA    NA
## 95   ¦       °--FOUR DOOR SEDAN       NA    NA    NA
## 96   ¦           ¦--LEFT FRONT SEAT  583 -4448 -4448
## 97   ¦           ¦--LEFT REAR SEAT   630 -4448 -4448
## 98   ¦           °--RIGHT FRONT SEAT 835 -4333  -205
## 99   ¦--1993                          NA    NA    NA
## 100  ¦   °--... 22 nodes w/ 96 sub    NA    NA    NA
## 101  °--... 34 nodes w/ 4642 sub      NA    NA    NA

What is the minimum HIC?

subset(data2, (data2$HIC >= 1 & data2$HIC <= 10) & (data2$LFL >= -10 & data2$LFL <= -1) & (data2$RFL >= -10 & data2$RFL <= -1), select=c(MAKED, YEAR, BODYD, OCCLOCD, HIC, LFL, RFL))
##       MAKED YEAR           BODYD         OCCLOCD HIC LFL RFL
## 4524  SMART 2008  TWO DOOR COUPE LEFT FRONT SEAT   8  -6  -5
## 5083 TOYOTA 2007 FOUR DOOR SEDAN LEFT FRONT SEAT   1  -6  -5
subset(data2, (data2$HIC >= 11 & data2$HIC <= 50) & (data2$LFL >= -50 & data2$LFL <= -11) & (data2$RFL >= -50 & data2$RFL <= -11), select=c(MAKED, YEAR, BODYD, OCCLOCD, HIC, LFL, RFL))
##           MAKED YEAR               BODYD         OCCLOCD HIC LFL RFL
## 167         BMW 2004               OTHER LEFT FRONT SEAT  11 -42 -26
## 169         BMW 2004               OTHER LEFT FRONT SEAT  15 -28 -47
## 173         BMW 2008     FOUR DOOR SEDAN LEFT FRONT SEAT  21 -30 -41
## 182         BMW 2009      TWO DOOR COUPE LEFT FRONT SEAT  27 -23 -15
## 336    CADILLAC 2010     FOUR DOOR SEDAN LEFT FRONT SEAT  18 -40 -46
## 781   CHEVROLET 2007     FOUR DOOR SEDAN LEFT FRONT SEAT  20 -40 -46
## 824   CHEVROLET 2010      TWO DOOR COUPE LEFT FRONT SEAT  11 -27 -31
## 836   CHEVROLET 2011     FOUR DOOR SEDAN LEFT FRONT SEAT  11 -37 -39
## 900   CHEVROLET 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  12 -35 -36
## 902   CHEVROLET 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  22 -17 -16
## 1406      DODGE 2007        PICKUP TRUCK LEFT FRONT SEAT  15 -28 -21
## 1413      DODGE 2008 FIVE DOOR HATCHBACK LEFT FRONT SEAT  13 -40 -33
## 1431      DODGE 2009               OTHER LEFT FRONT SEAT  14 -26 -27
## 1487       FIAT 2014               OTHER LEFT FRONT SEAT  16 -17 -22
## 1972       FORD 2005               OTHER LEFT FRONT SEAT  28 -48 -45
## 2096       FORD 2008               OTHER LEFT FRONT SEAT  17 -26 -22
## 2219       FORD 2013     FOUR DOOR SEDAN LEFT FRONT SEAT  17 -45 -40
## 2333        GMC 2008               OTHER LEFT FRONT SEAT  12 -39 -34
## 2815      HONDA 2008               OTHER LEFT FRONT SEAT  20 -43 -45
## 2833      HONDA 2010     FOUR DOOR SEDAN LEFT FRONT SEAT  28 -27 -29
## 2838      HONDA 2010               OTHER LEFT FRONT SEAT  24 -37 -37
## 2879      HONDA 2013     FOUR DOOR SEDAN LEFT FRONT SEAT  27 -43 -33
## 3071    HYUNDAI 2007               OTHER LEFT FRONT SEAT  50 -45 -41
## 3213     JAGUAR 2005     FOUR DOOR SEDAN LEFT FRONT SEAT  21 -12 -17
## 3257       JEEP 2004               OTHER LEFT FRONT SEAT  28 -23 -37
## 3410        KIA 2011               OTHER LEFT FRONT SEAT  12 -17 -12
## 3430        KIA 2014               OTHER LEFT FRONT SEAT  20 -34 -15
## 3442        KIA 2015               OTHER LEFT FRONT SEAT  19 -39 -34
## 3572    LINCOLN 2010     FOUR DOOR SEDAN LEFT FRONT SEAT  15 -11 -37
## 3643      MAZDA 2007               OTHER LEFT FRONT SEAT  41 -34 -47
## 3678      MAZDA 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  19 -20 -21
## 3698      MAZDA 2015     FOUR DOOR SEDAN LEFT FRONT SEAT  26 -37 -41
## 4113     NISSAN 2005       4 DOOR PICKUP LEFT FRONT SEAT  14 -36 -45
## 4162     NISSAN 2008               OTHER LEFT FRONT SEAT  30 -37 -40
## 4176     NISSAN 2009     FOUR DOOR SEDAN LEFT FRONT SEAT  27 -30 -38
## 4180     NISSAN 2009               OTHER LEFT FRONT SEAT  17 -34 -23
## 4188     NISSAN 2010               OTHER LEFT FRONT SEAT  17 -15 -42
## 4203     NISSAN 2011               OTHER LEFT FRONT SEAT  34 -36 -34
## 4232     NISSAN 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  16 -21 -15
## 4233     NISSAN 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  19 -36 -34
## 4235     NISSAN 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  34 -16 -11
## 4236     NISSAN 2014     FOUR DOOR SEDAN LEFT FRONT SEAT  34 -45 -47
## 4413    PORSCHE 2006      TWO DOOR COUPE LEFT FRONT SEAT  15 -37 -45
## 4451       SAAB 2006     FOUR DOOR SEDAN LEFT FRONT SEAT  25 -13 -27
## 4513     SATURN 2008               OTHER LEFT FRONT SEAT  24 -26 -18
## 4704     SUZUKI 2010     FOUR DOOR SEDAN LEFT FRONT SEAT  22 -46 -40
## 4989     TOYOTA 2004     FOUR DOOR SEDAN LEFT FRONT SEAT  31 -26 -34
## 5085     TOYOTA 2007     FOUR DOOR SEDAN LEFT FRONT SEAT  20 -50 -44
## 5086     TOYOTA 2007     FOUR DOOR SEDAN LEFT FRONT SEAT  24 -37 -42
## 5105     TOYOTA 2008               OTHER LEFT FRONT SEAT  16 -40 -41
## 5133     TOYOTA 2009               OTHER LEFT FRONT SEAT  11 -30 -40
## 5135     TOYOTA 2009               OTHER LEFT FRONT SEAT  49 -50 -47
## 5160     TOYOTA 2010        PICKUP TRUCK LEFT FRONT SEAT  12 -24 -15
## 5176     TOYOTA 2011     FOUR DOOR SEDAN LEFT FRONT SEAT  13 -46 -48
## 5181     TOYOTA 2011     FOUR DOOR SEDAN LEFT FRONT SEAT  28 -35 -27
## 5182     TOYOTA 2011     FOUR DOOR SEDAN LEFT FRONT SEAT  32 -12 -13
## 5282     TOYOTA 2015     FOUR DOOR SEDAN LEFT FRONT SEAT  12 -36 -35
## 5390 VOLKSWAGEN 2007     FOUR DOOR SEDAN LEFT FRONT SEAT  20 -36 -34
## 5392 VOLKSWAGEN 2007     FOUR DOOR SEDAN LEFT FRONT SEAT  28 -19 -39
## 5506      VOLVO 2007               OTHER LEFT FRONT SEAT  47 -12 -18
## 5514      VOLVO 2011     FOUR DOOR SEDAN LEFT FRONT SEAT  46 -45 -39
subset(data2, (data2$HIC >= 51 & data2$HIC <= 100) & (data2$LFL >= -100 & data2$LFL <= -51) & (data2$RFL >= -100 & data2$RFL <= -51), select=c(MAKED, YEAR, BODYD, OCCLOCD, HIC, LFL, RFL))
##           MAKED YEAR           BODYD          OCCLOCD HIC LFL RFL
## 2019       FORD 2006 FOUR DOOR SEDAN   LEFT REAR SEAT  74 -99 -94
## 2031       FORD 2006 FOUR DOOR SEDAN RIGHT FRONT SEAT  65 -76 -52
## 2825      HONDA 2009           OTHER  LEFT FRONT SEAT  92 -57 -54
## 2853      HONDA 2011           OTHER  LEFT FRONT SEAT  54 -82 -79
## 3345        KIA 2005           OTHER  LEFT FRONT SEAT  53 -98 -79
## 4158     NISSAN 2008 FOUR DOOR SEDAN  LEFT FRONT SEAT  61 -75 -86
## 5406 VOLKSWAGEN 2009           OTHER  LEFT FRONT SEAT  79 -61 -54

Conclusion

In the first output, we select a couple of vehicles with the least amount of injury to the head and legs. In addition, two additional ranges of vehicles are provided, where minimal head and leg injuries were reported. These display results with vehicle year, vehicle body type, and occupant location in the vehicle. The occupant location, is the test dummy used to measure different crash impact forces.

Difficulty Encountered: As mentioned above, visualizing the large number of rows is difficult with any ease. This presented challenges in displaying the results so a consumer can easily find their vehicle of choice. So, we decided to present the results in subsets of the overall results.

References:

https://www.intmath.com/applications-integration/hic-part2.php

https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/vehdb-v4.pdf

https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/rev_criteria_0.pdf

https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/rev_criteria_0.pdf

https://en.wikipedia.org > wiki > Head_injury_criterion

https://www.safercar.gov/Vehicle-Shoppers