BANA7038 - Data Analysis Methods

Authors: Rosetta Jacob, Michelle Heywood, Xinyi Gu, Parul Ranjan

December 4, 2021

____________________________________________________________________________________________________________________________

Fatality Crash Prediction

Introduction

The Focus

We collected data from the U. S. Department of Transportation, National Highway Traffic Safety Administration regarding different attributes of a traffic accident. In order for us to to explore accident injuries vs. property-only accidents in general, we will try to answer questions related to crash severity, such as what factors are closely associated with a severe crash and how we can better predict serious impacts. With these answers and information, the car manufacturers and design teams would be better informed with the following aspects:

To better understand the attributes affecting a severe fatal crash, we will explore:

  • What does a severe crash look like in a traffic accident (e.g., injury rate, injury distribution, etc.)?
  • What factors are more closely related to a severe traffic crash (e.g., attributes of the drivers, seating positions, the timing of the impact, environmental characteristics, etc.)?
  • How to better predict severity based on other known factors?
  • What actions will minimize potential severe crashes?

Analytical Approach

Several diagnostics tests of the model were performed, such as parameter evaluation, overall model evaluation, and prediction accuracy. Our approach consists of:

1. Data Prep: Source data, cleaning the dataset (e.g., check for variable names, types, missing values, and duplicates; fix any problems to make the dataset tidy); include data definitions

2. Exploratory Data Analysis: Perform basic summary statistics on the dataset and visualize distributions (x-y plots, scatter plots, etc.), understand relationships among variables

3. Model Selection and Interpretation: Conduct regressional analysis to build a predicting model for fatality crashes (hypotheses, parameter estimates, interpretation)

4. Conclusions: Insights gained, implications, and limitations for the model.


Packages Required

Below are the following required packages used for this analysis:

  • tidyverse : used for installation of the packages.
  • scales : used to scale functions for visualization
  • maps : for geographical data
  • DT : to create functional tables in HTML
  • knitr : for dynamic report generation
  • rmarkdown : for RMarkdown documents into a variety of formats
  • ggthemes : to implement theme across report
  • plotly : for dynamic plotting
  • plot3D : for multi-dimensional data
  • kableExtra : for styling tables
  • tinytex :for making tables
  • ggalluvial :to visualize frequency tables of categorical variables
  • ggbump : to bump chart to plot ranking when the path between two
  • lubridate : for organizing variables with dates and times.
  • rlang : for functions for base types and core R
  • fastDummies: for creation of dummy (binary) columns
  • rpart : for recursive partitioning and regression trees
  • rpart.plot : for our regression tree model
  • equatiomatic: to transform models into ‘LaTex’ Equations
  • patchwork : to layout plots for easy viewing
  • magrittr : operators which improve code
  • glue : to format and interpolate a string
  • readxl : to read excel files
  • pROC : for visualizing comparing receiver operating characteristic (ROC curves)

____________________________________________________________________________________________________________________________

Source & Cleaning

Data Source

The data’s origin came from the U.S. Department of Transportation and National Highway Traffic Safety Administration’s open portal. The associated dashboard can be found here: https://cdan.dot.gov/query.

We realized that the data is updated monthly. However, we used pre-pandemic data. We felt 2019 was a good representation under normal conditions without the unusual influence of Covid-19 lockdowns.


Import and Clean

library(readr)
ACC_AUX <- read_csv("ACC_AUX.csv")

library(readr)
national_totals <- read_csv("national_statistics_2019.csv")

Check Duplicates

Duplicate rows can be problematic and cause issues for data analysis and interpretation. No duplicate rows were identified. The code and output below demonstrate this step.

# check the total number of duplicates in the dataset
sum(duplicated(ACC_AUX))
## [1] 0
# missing data in general
sum(is.na(ACC_AUX))
## [1] 0

Tidy Data

We removed unnecessary variables and renamed some variables to tidy the data set. To save space, we did not list cleaning code. After cleaning, our data set consists of 54,409 observations and 24 variables.

Data Dictionary

Variable Type Description
CASENUM Key Unique accident case ID
severity_ code numeric Injury severity of crash
severity character Injury severity of crash
pileup_code numeric Number of vehicles involved in crash
pileup_total character Number of vehicles involved in crash
young_driver_code numeric Young driver 15-20 years old
young_driver character Young driver 15-20 years old
plus_65_code numeric Older driver 65+ years old
plus_65_driver character Older driver 65+ years old
day_type_code numeric Day of the week the crash occurred
day_type character Day of the week the crash occurred
interstate_code numeric Crash occurred on an interstate
interstate character Crash occurred on an interstate
intersection_code numeric Crash in an intersection
intersection character Crash occurred in an intersection
rollover_code Numeric Rollover involvement in crash
rollover character Rollover involvement in crash
speeding_code Numeric Speeding involvement in crash
speeding character Speeding involvement in crash
lighting_code Numeric Lighting at time of crash
lighting character Lighting at time of crash
is_fatal character Lighting at time of crash

Our variable list is quite long. To better understand the variables, we took theACC_AUX dataset and grouped them into three category datasets plus our outcome variable is_fatal:

# Subset variables into Crash Attributes
crash_attributes <- ACC_AUX %>%
  select(is_fatal, severity, severity_code, pileup_code, pileup_total, day_type_code,
         day_type, collision_code, collision_manner, rollover_code, rollover) %>%
  glimpse()

# Subset variables into Driver Attributes
driver_attributes <- ACC_AUX %>%
  select(is_fatal, young_driver_code, young_driver, plus_65_code, plus_65_driver, speeding_code, speeding) %>%
  glimpse()

# Subset variables into Environment Attributes
environment_attributes <- ACC_AUX %>%
  select(is_fatal,interstate_code, interstate, intersection_code, intersection, lighting_code, lighting) %>%
  glimpse()
  1. Crash Attributes

CASENUM, severity_code, severity, pileup_code, pileup_total, day_type_code, day_type, collision_code, collision_manner, rollover_code, and rollover

  1. Driver Attributes

young_driver_code, young_driver, plus_65_code, plus_65_driver, speeding_code, and speeding

  1. Environmental Attributes

interstate_code, interstate, intersection_code, intersection, lighting_code, and lighting

  1. Outcome Variable

is_fatal


Citation:
National Center for Statistics and Analysis. (2020, December). Crash Report Sampling System CRSS Analytical User’s Manual 2016-2019 (Report No. DOT HS 813 023). National Highway Traffic Safety Administration.

____________________________________________________________________________________________________________________________

Data Characteristics

Summary Statistics

We performed basic summary statistics on the dataset and visualize distributions to understand relationships among variables.

National Totals

Our dataset is a sampling of all police reported crashes in the year 2019. We wanted to calculate the rates of severity per 100,000 drivers. The data frame national_totals contains totals and rates presented here:

Description Total Percent of Total Rate per 100,000 Drivers
Fatal Crashes 33,244 0.00492 16 fatalities per 100k
Injury Crashes 1,916,000 0.284 1198 Injuries per 100k
Property-Damage Only 4,806,000 0.711

Glimpse of Sample Data

We took a snapshot of our sample dataset. It contains 54,409 observations and 24 variables

#Glimpse is used to summarize the variables
glimpse(ACC_AUX)
## Rows: 54,409
## Columns: 24
## $ CASENUM           <dbl> 201901174219, 201901176655, 201901176667, 2019011766~
## $ severity_code     <dbl> 3, 2, 2, 3, 2, 2, 1, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 2~
## $ severity          <chr> "Property_Only", "Servere_Crash", "Servere_Crash", "~
## $ pileup_code       <dbl> 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1~
## $ pileup_total      <chr> "single_vehicle", "two_vehicle", "two_vehicle", "sin~
## $ young_driver_code <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ young_driver      <chr> "not_young_driver", "driver15_20", "not_young_driver~
## $ plus_65_code      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0~
## $ plus_65_driver    <chr> "not_older_driver", "not_older_driver", "not_older_d~
## $ day_type_code     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day_type          <chr> "Weekday", "Weekday", "Weekday", "Weekday", "Weekday~
## $ interstate_code   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ interstate        <chr> "Non-Interstate", "Non-Interstate", "Non-Interstate"~
## $ intersection_code <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1~
## $ intersection      <chr> "Non-Intersection", "Intersection", "Non-Intersectio~
## $ collision_code    <dbl> 0, 3, 3, 0, 3, 3, 0, 3, 1, 1, 3, 0, 1, 3, 0, 3, 4, 1~
## $ collision_manner  <chr> "Other", "Angle", "Angle", "Other", "Angle", "Angle"~
## $ rollover_code     <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
## $ rollover          <chr> "rollover", "not_rollover", "not_rollover", "not_rol~
## $ speeding_code     <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ speeding          <chr> "speeding", "not_speeding", "not_speeding", "not_spe~
## $ lighting_code     <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1~
## $ lighting          <chr> "Daytime", "Nighttime", "Nighttime", "Nighttime", "D~
## $ is_fatal          <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Grouping Data

The purpose of EDA is to have a better idea about the interelationships among predicting variables and whether a crash is fatal or not. To achieve this, data-subsetting, dataset-joinning, and composing new variables were done in this step.

To see whether certain patterns would emerge, the dataset was subsetted for fatal only crashes. Barplots were also created for the subset fatal_crashes to see if there are any observable patterns in the fatality data. The barplots (not shown here due to large amount) revealed that all variables seemed to have followed the same pattern in the overall dataset and the fatal only dataset. In other words, no specific patterns emerged at this step.

Fatal Crashes Only:

# Create a subset with fatal crashes only
fatal_crashes <- ACC_AUX %>% 
  filter(severity_code == 1)

fatal_crashes %>%
  summarise(fatal_rate = sum(is_fatal == 1)/54409 * 100)

Values by Attribute Grouping:

Crash Attributes

Attribute Type Values
Vehicle Pileup Polychotomous 0 - single vehicle
1 - two vehicles
2 - three or more vehicles
Day of the Week Dichotomous 0 - weekend
1 - weekday
Rollover Involved Dichotomous 0 - No rollover involved
1 - Yes rollover involved
Collision Manner Polychotomous 0 - Other
1 - Rear-End
2 - Head-On
3 - Angle
4 - Sideswipe

Driver Attributes

Attribute Type Values
Young Drivers Dichotomous 0 - No young drivers
1 - Yes young drivers
65+ Old Drivers Dichotomous 0 - No 65+ year old drivers
1 - Yes 65+ year old drivers
Speeding Involved Dichotomous 0 - No speeding involved
1 - Yes speeding involved

Environmental Attributes

Attribute Type Values
Interstate Crash Dichotomous 0 - No interstate crash
1 - Yes interstate crash
Intersection Crash Dichotomous 0 - No intersection involved
1 - Yes intersection involved
Lighting Conditions Dichotomous 0 - nighttime
1 - daytime

Sample Barplots

Since we have categorical variables, barplots are created to visualize the data. However, due to the large size, we are only showing a couple barplots here.

R function barplot(table(data$variable)) was used to visualize frequencies for several selected categorical variables in our main sample file, ACC_AUX. Due to the large variable numbers, the output figures are not shown.

collision_releveled <- fatal_crashes %>% 
  mutate(collision_manner = fct_relevel(collision_manner, "Angle",
                                          "Head-On",
                                          "Rear-End",
                                    "Sideswipe",
                                   "Other"))
barplot(table(collision_releveled$collision_manner), xlab = "Collision Manner")

barplot(table(fatal_crashes$plus_65_driver), xlab = "65+ Driver")

*** ____________________________________________________________________________________________________________________________

Visualization

Visualize Relationships

The following is a visualization of the relationships between variables:

Crash Attributes:

Driver Attributes:

Environmental Attributes:

***

____________________________________________________________________________________________________________________________

Testing Variables

Test Variables

Due to the fact that the outcome variable, is_fatal and most of the variables in the dataset are categorical in nature, chi-squre test was used as the primary method to test variable independency. In other words, test statistics will be used to determine whether a predicting variable is correlated with the outcome variables.

If the value of the test statistic is more extreme than the statistic calculated from the null hypothesis, then the variables has a statistically significant relationship between the predictor and outcome variables.

If the value of the test statistic is less extreme than the one calculated from the null hypothesis, then the variables have no statistically significant relationship between the predictor and outcome variables.

\(H_0:\) The two variables are independent.

\(H_1:\) The two variables relate to each other.

We will use a 95% confidence level and reject the null hypothesis if the p-value is less than our predetermined significance level of 0.05.

Crash Attributes:

## Independency test for is_fatal with other variables

## Whether a fatal crash and a pileup in vehicles relates
chisq.test(crash_attributes$is_fatal, crash_attributes$pileup_total, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  crash_attributes$is_fatal and crash_attributes$pileup_total
## X-squared = 346.83, df = 2, p-value < 2.2e-16
## Whether a crash happens on a weekday or weekend
chisq.test(crash_attributes$is_fatal, crash_attributes$day_type, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  crash_attributes$is_fatal and crash_attributes$day_type
## X-squared = 70.186, df = 1, p-value < 2.2e-16
## Whether a fatality happens with a rollover crash
chisq.test(crash_attributes$is_fatal, crash_attributes$rollover, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  crash_attributes$is_fatal and crash_attributes$rollover
## X-squared = 720.95, df = 1, p-value < 2.2e-16
## The manner in which a collision occurs and a fatality crash
chisq.test(crash_attributes$is_fatal, crash_attributes$collision_manner, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  crash_attributes$is_fatal and crash_attributes$collision_manner
## X-squared = 645.05, df = 4, p-value < 2.2e-16
## VISUALIZE ## ## The manner in which a collision occurs

Crash Attributes

  • The chi-square test statistics showed that all crash attributes examined; vehicles pileup, Day Type, Rollover, and Collision Manner, were significantly associated with the a fatal crash.

  • Triangluating with the figures, variable rollover was selected as the most statistically significant relationship between the predictor and the variable within the group because it had the largest x-squared value at 720.95.

Driver Attributes:

## Independency test for is_fatal with other variables
## Whether a fatal crash has a young driver
chisq.test(driver_attributes$is_fatal, driver_attributes$young_driver, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  driver_attributes$is_fatal and driver_attributes$young_driver
## X-squared = 15.222, df = 1, p-value = 9.56e-05
## Whether a fatal crash has a 65+ years old driver
chisq.test(driver_attributes$is_fatal, driver_attributes$plus_65_driver, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  driver_attributes$is_fatal and driver_attributes$plus_65_driver
## X-squared = 2.9382, df = 1, p-value = 0.08651
## Whether speeding was involved in the crash
chisq.test(driver_attributes$is_fatal, driver_attributes$speeding, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  driver_attributes$is_fatal and driver_attributes$speeding
## X-squared = 65.698, df = 1, p-value = 5.256e-16

Driver Attributes

  • The chi-square test statistics showed that two(2) of the three(3) driver attributes Young_Driver and Speeding were significantly associated with the a fatal crash.

  • Interestingly, drivers 65+ years old is not significantly associated with the a fatal crash with a p-value of 0.08651.

  • Comparing test results, variable speeding was selected as the most statistically significant relationship between the predictor and the variable within the group because it had the largest x-squared value at 65.698.

Environmental Attributes:

## Independency test for is_fatal with other variables
## Whether a fatal crash occurred on the interstate
chisq.test(environment_attributes$is_fatal, environment_attributes$interstate, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  environment_attributes$is_fatal and environment_attributes$interstate
## X-squared = 17.561, df = 1, p-value = 2.783e-05
## Whether a fatal crash occurred in an intersection
chisq.test(environment_attributes$is_fatal, environment_attributes$intersection, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  environment_attributes$is_fatal and environment_attributes$intersection
## X-squared = 183.7, df = 1, p-value < 2.2e-16
## Whether a fatal crash occurred in night time or daytime
chisq.test(environment_attributes$is_fatal, environment_attributes$lighting, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  environment_attributes$is_fatal and environment_attributes$lighting
## X-squared = 270.56, df = 1, p-value < 2.2e-16

Environmental Attributes

  • The chi-square test statistics showed that all environmental attributes examined were significantly associated with the a fatal crash.

  • Comparing test results, variable Lighting Conditions was selected as the most statistically significant relationship between the predictor and the variable within the group because it had the largest x-squared value at 270.56.


____________________________________________________________________________________________________________________________ ***

Prediction Models

Model Selection

The EDA results showed that a lot of the variables were correlated with fatal crashes. Therefore, in this step, a prediction model for is_fatal was tested and fine-tuned. We selected logistic regression models for the creation of this predictive model.

\[ \begin{align} \text{Null : }\ H_0 &= 0 \\ \text{Alternative : }\ H_1 &\neq 0 \end{align} \]

We want to predict if you are likely to have a fatal crash given the following variables:

The null hypothesis states all coefficients in the model are equal to zero. In other words, there is no statistically significant relationship between the predictor variables, x, which means rollover, speeding, and night day, and the response variable, y - whether in a crash will lead to a fatality.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.In other words, there is statistically significant relationship between the predictor variables, x, which means rollover, speeding, and night day, and the response variable, y - whether in a single crash will lead to fatal.

  • First Step, change variable types to factors. We are first making our 5 categorical variables into a factor as it seems reasonable that a categorical variable should be made into a factor rather than left as an integer. As for example, a day of week can only be value 1 in one factor and, either a day of week is a weekday or weekend. The value is only 0 or 1, either a day of weekday is a weekday, or it is not a weekday, it cannot be 0.5 weekday.

  • Second Step, split data frame into training and testing sets using a 70/30 split. It seems quite intuitive to split data into a training portion and a test portion, so the model can be trained on the first and then tested with the testing data, but it is a good idea to split the data in a way, so that the model can be trained on a larger portion to adapt to more possible data constellations.

  • Third Step we start building a logistic regression model with dependent variable as is_fatal and independent variables.

Model 1


Build Model 1 logistic regression model:

\[ \log\left[ \frac { \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} } \right] = -4.16 + 0.47(\operatorname{pileup\_total}_{\operatorname{single\_vehicle}}) - 0.7(\operatorname{pileup\_total}_{\operatorname{two\_vehicle}}) + 0.21(\operatorname{young\_driver}_{\operatorname{not\_}}) + 0.31(\operatorname{day\_type}_{\operatorname{Weekend}}) \]

#Build a logistic regression model
mod1 <- glm(is_fatal ~ pileup_total + young_driver + day_type, family = binomial,data=acc_train)  
summary(mod1)
## 
## Call:
## glm(formula = is_fatal ~ pileup_total + young_driver + day_type, 
##     family = binomial, data = acc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2858  -0.2458  -0.1448  -0.1377   3.1200  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -4.1577     0.1667 -24.934  < 2e-16 ***
## pileup_totalsingle_vehicle     0.4676     0.1480   3.160 0.001578 ** 
## pileup_totaltwo_vehicle       -0.7020     0.1527  -4.598 4.26e-06 ***
## young_drivernot_young_driver   0.2058     0.1168   1.762 0.078152 .  
## day_typeWeekend                0.3064     0.0811   3.777 0.000159 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6790.4  on 38085  degrees of freedom
## Residual deviance: 6552.9  on 38081  degrees of freedom
## AIC: 6562.9
## 
## Number of Fisher Scoring iterations: 7
# Every variable in the logistic regression is statistically significant becuase p-value is less than 0.05.

#Use model 1 to predict probabilities on the testing data
testing_mod1 <- predict(mod1, acc_test, type = "response")

#According to the rule, if a the predicted probability of Y equaling 1 is greater than 0.5 for an observation, we will predict that the observation has a Y value of 1, we will predict that any fatal crash with a predicted response probability greater than 0.5 is predicted as a responder.
F_hat_mod1 <- as.numeric(testing_mod1 > 0.50)
table(acc_test$is_fatal, F_hat_mod1, dnn = c("Actual", "Predicted"))
##       Predicted
## Actual     0
##      0 16020
##      1   303
#Accuracy of model 1
mean(acc_test$is_fatal == F_hat_mod1)
## [1] 0.9814372
#Accuracy is 98%,very high accuaracy.

#The area under the curve(AUC) is the measure that represents ROC(Receiver Operating Characteristic) curve.An AUC value of greater than .70 indicates a good model.

roc <- roc(acc_train$is_fatal, mod1$fitted.values)
auc(roc)
## Area under the curve: 0.6651
# We can see AUC value here is around 66%. Model 1 is not a very good model.

Model 1 uses pileup_total, young_driver, and day_type. We find out that the model has accuracy of 98% but area under the curve is less that 70, so we reject this model 1.


Model 2


Model 2

Build Model 2 logistic regression model:

\[ \log\left[ \frac { \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} } \right] = -4.1 + 0.52(\operatorname{pileup\_total}_{\operatorname{single\_vehicle}}) - 0.63(\operatorname{pileup\_total}_{\operatorname{two\_vehicle}}) + 0.51(\operatorname{speeding}) + 0.29(\operatorname{day\_type}_{\operatorname{Weekend}}) \]

#Build a logistic regression model
mod2 <- glm(is_fatal ~ pileup_total + speeding + day_type, family = binomial,data=acc_train)  
summary(mod2)
## 
## Call:
## glm(formula = is_fatal ~ pileup_total + speeding + day_type, 
##     family = binomial, data = acc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3479  -0.2344  -0.1530  -0.1323   3.0797  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -4.10258    0.14156 -28.981  < 2e-16 ***
## pileup_totalsingle_vehicle  0.52180    0.14749   3.538 0.000403 ***
## pileup_totaltwo_vehicle    -0.63105    0.15320  -4.119 3.80e-05 ***
## speedingspeeding            0.51474    0.10467   4.918 8.75e-07 ***
## day_typeWeekend             0.29159    0.08117   3.592 0.000328 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6790.4  on 38085  degrees of freedom
## Residual deviance: 6534.4  on 38081  degrees of freedom
## AIC: 6544.4
## 
## Number of Fisher Scoring iterations: 7
# Every variable in the logistic regression is statistically significant because p-value is less than 0.05.

#Use model 2 to predict probabilities on the testing data
testing_mod2 <- predict(mod2, acc_test, type = "response")

#According to the rule, if a the predicted probability of Y equaling 1 is greater 
#than 0.5 for an observation, we will predict that the observation has a Y value of 1, 
#we will predict that any fatal crash with a predicted response probability greater than 0.5 is predicted as a res-ponder.
F_hat_mod2 <- as.numeric(testing_mod2 > 0.50)
table(acc_test$is_fatal, F_hat_mod2, dnn = c("Actual", "Predicted"))
##       Predicted
## Actual     0
##      0 16020
##      1   303
#Accuracy of model 2
mean(acc_test$is_fatal == F_hat_mod2)
## [1] 0.9814372
#Accuracy is 98%,very high accuracy.

#The area under the curve(AUC) is the measure that represents ROC(Receiver Operating Characteristic) curve.An AUC value of greater than .6706 indicates a good model.

roc <- roc(acc_train$is_fatal, mod2$fitted.values)
auc(roc)
## Area under the curve: 0.6706

Model 2 uses pileup_total, speeding, and day_type. We find out that the model has accuracy of 98% but area under the curve is less that 70 (0.6706), so we reject this model 2.

Model 3


Model 3

Build Model 3 logistic regression model:

\[ \log\left[ \frac { \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} } \right] = -4.61 + 1.71(\operatorname{rollover}) + 0.4(\operatorname{speeding}) + 0.88(\operatorname{lighting}_{\operatorname{Nighttime}}) \]

#Build a logistic regression model
mod3 <- glm(is_fatal ~ rollover + speeding + lighting, family = binomial,data=acc_train)  
summary(mod3)
## 
## Call:
## glm(formula = is_fatal ~ rollover + speeding + lighting, family = binomial, 
##     data = acc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6022  -0.2181  -0.1408  -0.1408   3.0395  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.60945    0.06064 -76.009  < 2e-16 ***
## rolloverrollover   1.70873    0.10111  16.900  < 2e-16 ***
## speedingspeeding   0.40253    0.10725   3.753 0.000175 ***
## lightingNighttime  0.88287    0.07887  11.194  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6790.4  on 38085  degrees of freedom
## Residual deviance: 6388.8  on 38082  degrees of freedom
## AIC: 6396.8
## 
## Number of Fisher Scoring iterations: 7

Since this p-value is less than .05, we reject the null hypothesis. In other words, there is a statistically significant relationship between the rollover, speeding, lighting and whether a single crash will lead to a fatality.

Use model 3 to predict probabilities and accuracy on the testing data:

According to the rule, if a the predicted probability of Y equaling 1 is greater than 0.5 for an observation, we will predict that the observation has a Y value of 1.

We can see that accuracy is 98%, it’s very high. The next step is tto calculate the area under the curve which is AUC to test our model again.

The area under the curve(AUC) is the measure that represents ROC(Receiver Operating Characteristic) curve. An AUC value of greater than .70 indicates a good model.

We can see AUC value here is 0.6983, around 70%. Like we mentioned before, the AUC value of our model is around 0.7, so Model 3 is a good model.

#Use model 3 to predict probabilities on the testing data
testing_mod3 <- predict(mod3, acc_test, type = "response")

F_hat_mod3 <- as.numeric(testing_mod3 > 0.50)
table(acc_test$is_fatal, F_hat_mod3, dnn = c("Actual", "Predicted"))
##       Predicted
## Actual     0
##      0 16020
##      1   303
#Accuracy of model 3
mean(acc_test$is_fatal == F_hat_mod3)
## [1] 0.9814372
roc <- roc(acc_train$is_fatal, mod3$fitted.values)
auc(roc)
## Area under the curve: 0.6983

Conclusion

  • The logistic regression model # 3 showed the most statistically significant relationship between the dependent variable (is_fatal) and independent variables because the AUC is closest to 70% at 0.6983.

The final dependent variables selected for a further predictive analysis are Rollover, Speeding, and Lighting Conditions.