The Focus
We collected data from the U. S. Department of Transportation, National Highway Traffic Safety Administration regarding different attributes of a traffic accident. In order for us to to explore accident injuries vs. property-only accidents in general, we will try to answer questions related to crash severity, such as what factors are closely associated with a severe crash and how we can better predict serious impacts. With these answers and information, the car manufacturers and design teams would be better informed with the following aspects:
To better understand the attributes affecting a severe fatal crash, we will explore:
Analytical Approach
Several diagnostics tests of the model were performed, such as parameter evaluation, overall model evaluation, and prediction accuracy. Our approach consists of:
1. Data Prep: Source data, cleaning the dataset (e.g., check for variable names, types, missing values, and duplicates; fix any problems to make the dataset tidy); include data definitions
2. Exploratory Data Analysis: Perform basic summary statistics on the dataset and visualize distributions (x-y plots, scatter plots, etc.), understand relationships among variables
3. Model Selection and Interpretation: Conduct regressional analysis to build a predicting model for fatality crashes (hypotheses, parameter estimates, interpretation)
4. Conclusions: Insights gained, implications, and limitations for the model.
Packages Required
Below are the following required packages used for this analysis:
____________________________________________________________________________________________________________________________
Data Source
The data’s origin came from the U.S. Department of Transportation and National Highway Traffic Safety Administration’s open portal. The associated dashboard can be found here: https://cdan.dot.gov/query.
We realized that the data is updated monthly. However, we used pre-pandemic data. We felt 2019 was a good representation under normal conditions without the unusual influence of Covid-19 lockdowns.
Import and Clean
library(readr)
ACC_AUX <- read_csv("ACC_AUX.csv")
library(readr)
national_totals <- read_csv("national_statistics_2019.csv")
Check Duplicates
Duplicate rows can be problematic and cause issues for data analysis and interpretation. No duplicate rows were identified. The code and output below demonstrate this step.
# check the total number of duplicates in the dataset
sum(duplicated(ACC_AUX))
## [1] 0
# missing data in general
sum(is.na(ACC_AUX))
## [1] 0
Tidy Data
We removed unnecessary variables and renamed some variables to tidy the data set. To save space, we did not list cleaning code. After cleaning, our data set consists of 54,409 observations and 24 variables.
Data Dictionary
| Variable | Type | Description |
|---|---|---|
CASENUM |
Key | Unique accident case ID |
severity_ code |
numeric | Injury severity of crash |
severity |
character | Injury severity of crash |
pileup_code |
numeric | Number of vehicles involved in crash |
pileup_total |
character | Number of vehicles involved in crash |
young_driver_code |
numeric | Young driver 15-20 years old |
young_driver |
character | Young driver 15-20 years old |
plus_65_code |
numeric | Older driver 65+ years old |
plus_65_driver |
character | Older driver 65+ years old |
day_type_code |
numeric | Day of the week the crash occurred |
day_type |
character | Day of the week the crash occurred |
interstate_code |
numeric | Crash occurred on an interstate |
interstate |
character | Crash occurred on an interstate |
intersection_code |
numeric | Crash in an intersection |
intersection |
character | Crash occurred in an intersection |
rollover_code |
Numeric | Rollover involvement in crash |
rollover |
character | Rollover involvement in crash |
speeding_code |
Numeric | Speeding involvement in crash |
speeding |
character | Speeding involvement in crash |
lighting_code |
Numeric | Lighting at time of crash |
lighting |
character | Lighting at time of crash |
is_fatal |
character | Lighting at time of crash |
Our variable list is quite long. To better understand the variables, we took theACC_AUX dataset and grouped them into three category datasets plus our outcome variable is_fatal:
# Subset variables into Crash Attributes
crash_attributes <- ACC_AUX %>%
select(is_fatal, severity, severity_code, pileup_code, pileup_total, day_type_code,
day_type, collision_code, collision_manner, rollover_code, rollover) %>%
glimpse()
# Subset variables into Driver Attributes
driver_attributes <- ACC_AUX %>%
select(is_fatal, young_driver_code, young_driver, plus_65_code, plus_65_driver, speeding_code, speeding) %>%
glimpse()
# Subset variables into Environment Attributes
environment_attributes <- ACC_AUX %>%
select(is_fatal,interstate_code, interstate, intersection_code, intersection, lighting_code, lighting) %>%
glimpse()
CASENUM, severity_code, severity, pileup_code, pileup_total, day_type_code, day_type, collision_code, collision_manner, rollover_code, and rollover
young_driver_code, young_driver, plus_65_code, plus_65_driver, speeding_code, and speeding
interstate_code, interstate, intersection_code, intersection, lighting_code, and lighting
is_fatal
Citation:
National Center for Statistics and Analysis. (2020, December). Crash Report Sampling System CRSS Analytical User’s Manual 2016-2019 (Report No. DOT HS 813 023). National Highway Traffic Safety Administration.
____________________________________________________________________________________________________________________________
Summary Statistics
We performed basic summary statistics on the dataset and visualize distributions to understand relationships among variables.
National Totals
Our dataset is a sampling of all police reported crashes in the year 2019. We wanted to calculate the rates of severity per 100,000 drivers. The data frame national_totals contains totals and rates presented here:
| Description | Total | Percent of Total | Rate per 100,000 Drivers |
|---|---|---|---|
| Fatal Crashes | 33,244 | 0.00492 | 16 fatalities per 100k |
| Injury Crashes | 1,916,000 | 0.284 | 1198 Injuries per 100k |
| Property-Damage Only | 4,806,000 | 0.711 |
Glimpse of Sample Data
We took a snapshot of our sample dataset. It contains 54,409 observations and 24 variables
#Glimpse is used to summarize the variables
glimpse(ACC_AUX)
## Rows: 54,409
## Columns: 24
## $ CASENUM <dbl> 201901174219, 201901176655, 201901176667, 2019011766~
## $ severity_code <dbl> 3, 2, 2, 3, 2, 2, 1, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 2~
## $ severity <chr> "Property_Only", "Servere_Crash", "Servere_Crash", "~
## $ pileup_code <dbl> 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1~
## $ pileup_total <chr> "single_vehicle", "two_vehicle", "two_vehicle", "sin~
## $ young_driver_code <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ young_driver <chr> "not_young_driver", "driver15_20", "not_young_driver~
## $ plus_65_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0~
## $ plus_65_driver <chr> "not_older_driver", "not_older_driver", "not_older_d~
## $ day_type_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day_type <chr> "Weekday", "Weekday", "Weekday", "Weekday", "Weekday~
## $ interstate_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ interstate <chr> "Non-Interstate", "Non-Interstate", "Non-Interstate"~
## $ intersection_code <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1~
## $ intersection <chr> "Non-Intersection", "Intersection", "Non-Intersectio~
## $ collision_code <dbl> 0, 3, 3, 0, 3, 3, 0, 3, 1, 1, 3, 0, 1, 3, 0, 3, 4, 1~
## $ collision_manner <chr> "Other", "Angle", "Angle", "Other", "Angle", "Angle"~
## $ rollover_code <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
## $ rollover <chr> "rollover", "not_rollover", "not_rollover", "not_rol~
## $ speeding_code <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ speeding <chr> "speeding", "not_speeding", "not_speeding", "not_spe~
## $ lighting_code <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1~
## $ lighting <chr> "Daytime", "Nighttime", "Nighttime", "Nighttime", "D~
## $ is_fatal <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
Grouping Data
The purpose of EDA is to have a better idea about the interelationships among predicting variables and whether a crash is fatal or not. To achieve this, data-subsetting, dataset-joinning, and composing new variables were done in this step.
To see whether certain patterns would emerge, the dataset was subsetted for fatal only crashes. Barplots were also created for the subset fatal_crashes to see if there are any observable patterns in the fatality data. The barplots (not shown here due to large amount) revealed that all variables seemed to have followed the same pattern in the overall dataset and the fatal only dataset. In other words, no specific patterns emerged at this step.
Fatal Crashes Only:
# Create a subset with fatal crashes only
fatal_crashes <- ACC_AUX %>%
filter(severity_code == 1)
fatal_crashes %>%
summarise(fatal_rate = sum(is_fatal == 1)/54409 * 100)
Values by Attribute Grouping:
Crash Attributes
| Attribute | Type | Values |
|---|---|---|
| Vehicle Pileup | Polychotomous | 0 - single vehicle |
| 1 - two vehicles | ||
| 2 - three or more vehicles | ||
| Day of the Week | Dichotomous | 0 - weekend |
| 1 - weekday | ||
| Rollover Involved | Dichotomous | 0 - No rollover involved |
| 1 - Yes rollover involved | ||
| Collision Manner | Polychotomous | 0 - Other |
| 1 - Rear-End | ||
| 2 - Head-On | ||
| 3 - Angle | ||
| 4 - Sideswipe | ||
Driver Attributes
| Attribute | Type | Values |
|---|---|---|
| Young Drivers | Dichotomous | 0 - No young drivers |
| 1 - Yes young drivers | ||
| 65+ Old Drivers | Dichotomous | 0 - No 65+ year old drivers |
| 1 - Yes 65+ year old drivers | ||
| Speeding Involved | Dichotomous | 0 - No speeding involved |
| 1 - Yes speeding involved | ||
Environmental Attributes
| Attribute | Type | Values |
|---|---|---|
| Interstate Crash | Dichotomous | 0 - No interstate crash |
| 1 - Yes interstate crash | ||
| Intersection Crash | Dichotomous | 0 - No intersection involved |
| 1 - Yes intersection involved | ||
| Lighting Conditions | Dichotomous | 0 - nighttime |
| 1 - daytime | ||
Sample Barplots
Since we have categorical variables, barplots are created to visualize the data. However, due to the large size, we are only showing a couple barplots here.
R function barplot(table(data$variable)) was used to visualize frequencies for several selected categorical variables in our main sample file, ACC_AUX. Due to the large variable numbers, the output figures are not shown.
collision_releveled <- fatal_crashes %>%
mutate(collision_manner = fct_relevel(collision_manner, "Angle",
"Head-On",
"Rear-End",
"Sideswipe",
"Other"))
barplot(table(collision_releveled$collision_manner), xlab = "Collision Manner")
barplot(table(fatal_crashes$plus_65_driver), xlab = "65+ Driver")
*** ____________________________________________________________________________________________________________________________
Visualize Relationships
The following is a visualization of the relationships between variables:
Crash Attributes:
Driver Attributes:
Environmental Attributes:
***
____________________________________________________________________________________________________________________________
Test Variables
Due to the fact that the outcome variable, is_fatal and most of the variables in the dataset are categorical in nature, chi-squre test was used as the primary method to test variable independency. In other words, test statistics will be used to determine whether a predicting variable is correlated with the outcome variables.
If the value of the test statistic is more extreme than the statistic calculated from the null hypothesis, then the variables has a statistically significant relationship between the predictor and outcome variables.
If the value of the test statistic is less extreme than the one calculated from the null hypothesis, then the variables have no statistically significant relationship between the predictor and outcome variables.
\(H_0:\) The two variables are independent.
\(H_1:\) The two variables relate to each other.
We will use a 95% confidence level and reject the null hypothesis if the p-value is less than our predetermined significance level of 0.05.
Crash Attributes:
## Independency test for is_fatal with other variables
## Whether a fatal crash and a pileup in vehicles relates
chisq.test(crash_attributes$is_fatal, crash_attributes$pileup_total, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: crash_attributes$is_fatal and crash_attributes$pileup_total
## X-squared = 346.83, df = 2, p-value < 2.2e-16
## Whether a crash happens on a weekday or weekend
chisq.test(crash_attributes$is_fatal, crash_attributes$day_type, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: crash_attributes$is_fatal and crash_attributes$day_type
## X-squared = 70.186, df = 1, p-value < 2.2e-16
## Whether a fatality happens with a rollover crash
chisq.test(crash_attributes$is_fatal, crash_attributes$rollover, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: crash_attributes$is_fatal and crash_attributes$rollover
## X-squared = 720.95, df = 1, p-value < 2.2e-16
## The manner in which a collision occurs and a fatality crash
chisq.test(crash_attributes$is_fatal, crash_attributes$collision_manner, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: crash_attributes$is_fatal and crash_attributes$collision_manner
## X-squared = 645.05, df = 4, p-value < 2.2e-16
## VISUALIZE ## ## The manner in which a collision occurs
Crash Attributes
The chi-square test statistics showed that all crash attributes examined; vehicles pileup, Day Type, Rollover, and Collision Manner, were significantly associated with the a fatal crash.
Triangluating with the figures, variable rollover was selected as the most statistically significant relationship between the predictor and the variable within the group because it had the largest x-squared value at 720.95.
Driver Attributes:
## Independency test for is_fatal with other variables
## Whether a fatal crash has a young driver
chisq.test(driver_attributes$is_fatal, driver_attributes$young_driver, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: driver_attributes$is_fatal and driver_attributes$young_driver
## X-squared = 15.222, df = 1, p-value = 9.56e-05
## Whether a fatal crash has a 65+ years old driver
chisq.test(driver_attributes$is_fatal, driver_attributes$plus_65_driver, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: driver_attributes$is_fatal and driver_attributes$plus_65_driver
## X-squared = 2.9382, df = 1, p-value = 0.08651
## Whether speeding was involved in the crash
chisq.test(driver_attributes$is_fatal, driver_attributes$speeding, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: driver_attributes$is_fatal and driver_attributes$speeding
## X-squared = 65.698, df = 1, p-value = 5.256e-16
Driver Attributes
The chi-square test statistics showed that two(2) of the three(3) driver attributes Young_Driver and Speeding were significantly associated with the a fatal crash.
Interestingly, drivers 65+ years old is not significantly associated with the a fatal crash with a p-value of 0.08651.
Comparing test results, variable speeding was selected as the most statistically significant relationship between the predictor and the variable within the group because it had the largest x-squared value at 65.698.
Environmental Attributes:
## Independency test for is_fatal with other variables
## Whether a fatal crash occurred on the interstate
chisq.test(environment_attributes$is_fatal, environment_attributes$interstate, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: environment_attributes$is_fatal and environment_attributes$interstate
## X-squared = 17.561, df = 1, p-value = 2.783e-05
## Whether a fatal crash occurred in an intersection
chisq.test(environment_attributes$is_fatal, environment_attributes$intersection, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: environment_attributes$is_fatal and environment_attributes$intersection
## X-squared = 183.7, df = 1, p-value < 2.2e-16
## Whether a fatal crash occurred in night time or daytime
chisq.test(environment_attributes$is_fatal, environment_attributes$lighting, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: environment_attributes$is_fatal and environment_attributes$lighting
## X-squared = 270.56, df = 1, p-value < 2.2e-16
Environmental Attributes
The chi-square test statistics showed that all environmental attributes examined were significantly associated with the a fatal crash.
Comparing test results, variable Lighting Conditions was selected as the most statistically significant relationship between the predictor and the variable within the group because it had the largest x-squared value at 270.56.
____________________________________________________________________________________________________________________________ ***
The EDA results showed that a lot of the variables were correlated with fatal crashes. Therefore, in this step, a prediction model for is_fatal was tested and fine-tuned. We selected logistic regression models for the creation of this predictive model.
\[ \begin{align} \text{Null : }\ H_0 &= 0 \\ \text{Alternative : }\ H_1 &\neq 0 \end{align} \]
We want to predict if you are likely to have a fatal crash given the following variables:
The null hypothesis states all coefficients in the model are equal to zero. In other words, there is no statistically significant relationship between the predictor variables, x, which means rollover, speeding, and night day, and the response variable, y - whether in a crash will lead to a fatality.
The alternative hypothesis states that not every coefficient is simultaneously equal to zero.In other words, there is statistically significant relationship between the predictor variables, x, which means rollover, speeding, and night day, and the response variable, y - whether in a single crash will lead to fatal.
First Step, change variable types to factors. We are first making our 5 categorical variables into a factor as it seems reasonable that a categorical variable should be made into a factor rather than left as an integer. As for example, a day of week can only be value 1 in one factor and, either a day of week is a weekday or weekend. The value is only 0 or 1, either a day of weekday is a weekday, or it is not a weekday, it cannot be 0.5 weekday.
Second Step, split data frame into training and testing sets using a 70/30 split. It seems quite intuitive to split data into a training portion and a test portion, so the model can be trained on the first and then tested with the testing data, but it is a good idea to split the data in a way, so that the model can be trained on a larger portion to adapt to more possible data constellations.
Third Step we start building a logistic regression model with dependent variable as is_fatal and independent variables.
Build Model 1 logistic regression model:
\[ \log\left[ \frac { \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} } \right] = -4.16 + 0.47(\operatorname{pileup\_total}_{\operatorname{single\_vehicle}}) - 0.7(\operatorname{pileup\_total}_{\operatorname{two\_vehicle}}) + 0.21(\operatorname{young\_driver}_{\operatorname{not\_}}) + 0.31(\operatorname{day\_type}_{\operatorname{Weekend}}) \]
#Build a logistic regression model
mod1 <- glm(is_fatal ~ pileup_total + young_driver + day_type, family = binomial,data=acc_train)
summary(mod1)
##
## Call:
## glm(formula = is_fatal ~ pileup_total + young_driver + day_type,
## family = binomial, data = acc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2858 -0.2458 -0.1448 -0.1377 3.1200
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1577 0.1667 -24.934 < 2e-16 ***
## pileup_totalsingle_vehicle 0.4676 0.1480 3.160 0.001578 **
## pileup_totaltwo_vehicle -0.7020 0.1527 -4.598 4.26e-06 ***
## young_drivernot_young_driver 0.2058 0.1168 1.762 0.078152 .
## day_typeWeekend 0.3064 0.0811 3.777 0.000159 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6790.4 on 38085 degrees of freedom
## Residual deviance: 6552.9 on 38081 degrees of freedom
## AIC: 6562.9
##
## Number of Fisher Scoring iterations: 7
# Every variable in the logistic regression is statistically significant becuase p-value is less than 0.05.
#Use model 1 to predict probabilities on the testing data
testing_mod1 <- predict(mod1, acc_test, type = "response")
#According to the rule, if a the predicted probability of Y equaling 1 is greater than 0.5 for an observation, we will predict that the observation has a Y value of 1, we will predict that any fatal crash with a predicted response probability greater than 0.5 is predicted as a responder.
F_hat_mod1 <- as.numeric(testing_mod1 > 0.50)
table(acc_test$is_fatal, F_hat_mod1, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 16020
## 1 303
#Accuracy of model 1
mean(acc_test$is_fatal == F_hat_mod1)
## [1] 0.9814372
#Accuracy is 98%,very high accuaracy.
#The area under the curve(AUC) is the measure that represents ROC(Receiver Operating Characteristic) curve.An AUC value of greater than .70 indicates a good model.
roc <- roc(acc_train$is_fatal, mod1$fitted.values)
auc(roc)
## Area under the curve: 0.6651
# We can see AUC value here is around 66%. Model 1 is not a very good model.
Model 1 uses pileup_total, young_driver, and day_type. We find out that the model has accuracy of 98% but area under the curve is less that 70, so we reject this model 1.
Build Model 2 logistic regression model:
\[ \log\left[ \frac { \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} } \right] = -4.1 + 0.52(\operatorname{pileup\_total}_{\operatorname{single\_vehicle}}) - 0.63(\operatorname{pileup\_total}_{\operatorname{two\_vehicle}}) + 0.51(\operatorname{speeding}) + 0.29(\operatorname{day\_type}_{\operatorname{Weekend}}) \]
#Build a logistic regression model
mod2 <- glm(is_fatal ~ pileup_total + speeding + day_type, family = binomial,data=acc_train)
summary(mod2)
##
## Call:
## glm(formula = is_fatal ~ pileup_total + speeding + day_type,
## family = binomial, data = acc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3479 -0.2344 -0.1530 -0.1323 3.0797
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.10258 0.14156 -28.981 < 2e-16 ***
## pileup_totalsingle_vehicle 0.52180 0.14749 3.538 0.000403 ***
## pileup_totaltwo_vehicle -0.63105 0.15320 -4.119 3.80e-05 ***
## speedingspeeding 0.51474 0.10467 4.918 8.75e-07 ***
## day_typeWeekend 0.29159 0.08117 3.592 0.000328 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6790.4 on 38085 degrees of freedom
## Residual deviance: 6534.4 on 38081 degrees of freedom
## AIC: 6544.4
##
## Number of Fisher Scoring iterations: 7
# Every variable in the logistic regression is statistically significant because p-value is less than 0.05.
#Use model 2 to predict probabilities on the testing data
testing_mod2 <- predict(mod2, acc_test, type = "response")
#According to the rule, if a the predicted probability of Y equaling 1 is greater
#than 0.5 for an observation, we will predict that the observation has a Y value of 1,
#we will predict that any fatal crash with a predicted response probability greater than 0.5 is predicted as a res-ponder.
F_hat_mod2 <- as.numeric(testing_mod2 > 0.50)
table(acc_test$is_fatal, F_hat_mod2, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 16020
## 1 303
#Accuracy of model 2
mean(acc_test$is_fatal == F_hat_mod2)
## [1] 0.9814372
#Accuracy is 98%,very high accuracy.
#The area under the curve(AUC) is the measure that represents ROC(Receiver Operating Characteristic) curve.An AUC value of greater than .6706 indicates a good model.
roc <- roc(acc_train$is_fatal, mod2$fitted.values)
auc(roc)
## Area under the curve: 0.6706
Model 2 uses pileup_total, speeding, and day_type. We find out that the model has accuracy of 98% but area under the curve is less that 70 (0.6706), so we reject this model 2.
Build Model 3 logistic regression model:
\[ \log\left[ \frac { \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{is\_fatal} = \operatorname{1} )} } \right] = -4.61 + 1.71(\operatorname{rollover}) + 0.4(\operatorname{speeding}) + 0.88(\operatorname{lighting}_{\operatorname{Nighttime}}) \]
#Build a logistic regression model
mod3 <- glm(is_fatal ~ rollover + speeding + lighting, family = binomial,data=acc_train)
summary(mod3)
##
## Call:
## glm(formula = is_fatal ~ rollover + speeding + lighting, family = binomial,
## data = acc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6022 -0.2181 -0.1408 -0.1408 3.0395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.60945 0.06064 -76.009 < 2e-16 ***
## rolloverrollover 1.70873 0.10111 16.900 < 2e-16 ***
## speedingspeeding 0.40253 0.10725 3.753 0.000175 ***
## lightingNighttime 0.88287 0.07887 11.194 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6790.4 on 38085 degrees of freedom
## Residual deviance: 6388.8 on 38082 degrees of freedom
## AIC: 6396.8
##
## Number of Fisher Scoring iterations: 7
Since this p-value is less than .05, we reject the null hypothesis. In other words, there is a statistically significant relationship between the rollover, speeding, lighting and whether a single crash will lead to a fatality.
Use model 3 to predict probabilities and accuracy on the testing data:
According to the rule, if a the predicted probability of Y equaling 1 is greater than 0.5 for an observation, we will predict that the observation has a Y value of 1.
We can see that accuracy is 98%, it’s very high. The next step is tto calculate the area under the curve which is AUC to test our model again.
The area under the curve(AUC) is the measure that represents ROC(Receiver Operating Characteristic) curve. An AUC value of greater than .70 indicates a good model.
We can see AUC value here is 0.6983, around 70%. Like we mentioned before, the AUC value of our model is around 0.7, so Model 3 is a good model.
#Use model 3 to predict probabilities on the testing data
testing_mod3 <- predict(mod3, acc_test, type = "response")
F_hat_mod3 <- as.numeric(testing_mod3 > 0.50)
table(acc_test$is_fatal, F_hat_mod3, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0
## 0 16020
## 1 303
#Accuracy of model 3
mean(acc_test$is_fatal == F_hat_mod3)
## [1] 0.9814372
roc <- roc(acc_train$is_fatal, mod3$fitted.values)
auc(roc)
## Area under the curve: 0.6983
Conclusion
The final dependent variables selected for a further predictive analysis are Rollover, Speeding, and Lighting Conditions.