Problem

The file accidentsFull.csv contains information on over 42,183 actual automobile accidents in 2001 in the United States that involved one of three levels of injury: NO INJURY, INJURY, or FATALITY. For each accident, additional information is recorded, such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting).

Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”

Data

Load the accidentsFull.csv and install/load any required packages. Create and insert a dummy variable called “INJURY” in the data.

accidents.df <- read.csv("D:/MSBA/3-Winter 2020/560/data/accidentsFull.csv")

accidents.df$INJURY <- ifelse(accidents.df$MAX_SEV_IR>0, "yes", "no")
head(accidents.df)
##   ï..HOUR_I_R ALCHL_I ALIGN_I STRATUM_R WRK_ZONE WKDY_I_R INT_HWY LGTCON_I_R
## 1           0       2       2         1        0        1       0          3
## 2           1       2       1         0        0        1       1          3
## 3           1       2       1         0        0        1       0          3
## 4           1       2       1         1        0        0       0          3
## 5           1       1       1         0        0        1       0          3
## 6           1       2       1         1        0        1       0          3
##   MANCOL_I_R PED_ACC_R RELJCT_I_R REL_RWY_R PROFIL_I_R SPD_LIM SUR_COND
## 1          0         0          1         0          1      40        4
## 2          2         0          1         1          1      70        4
## 3          2         0          1         1          1      35        4
## 4          2         0          1         1          1      35        4
## 5          2         0          0         1          1      25        4
## 6          0         0          1         0          1      70        4
##   TRAF_CON_R TRAF_WAY VEH_INVL WEATHER_R INJURY_CRASH NO_INJ_I PRPTYDMG_CRASH
## 1          0        3        1         1            1        1              0
## 2          0        3        2         2            0        0              1
## 3          1        2        2         2            0        0              1
## 4          1        2        2         1            0        0              1
## 5          0        2        3         1            0        0              1
## 6          0        2        1         2            1        1              0
##   FATALITIES MAX_SEV_IR INJURY
## 1          0          1    yes
## 2          0          0     no
## 3          0          0     no
## 4          0          0     no
## 5          0          0     no
## 6          0          1    yes

Part A

Using the information in this dataset, if an accident has just been reported and no further information is available, what should the prediction be? (INJURY = Yes or No?) Why?

#create a table based on INJURY
inj.tbl <- table(accidents.df$INJURY)
show(inj.tbl)
## 
##    no   yes 
## 20721 21462
#caluculate probabiliyt of injury 
inj.prob =  scales::percent(inj.tbl["yes"]/(inj.tbl["yes"]+inj.tbl["no"]),0.01)
inj.prob
##      yes 
## "50.88%"

Since ~51% of the accidents in our data set resulted in an accident, we should predict that an accident will result in injury because it is slightly more likley.

Part B

Select the first 12 records in the dataset and look only at the response (INJURY) and the two predictors WEATHER_R and TRAF_CON_R.

#convert your variables to categorical type
for (i in c(1:dim(accidents.df)[2])){
  accidents.df[,i] <- as.factor(accidents.df[,i])
}

#create a new subset with only the required records
new.df <- accidents.df[1:12,c("INJURY","WEATHER_R","TRAF_CON_R")]
new.df
##    INJURY WEATHER_R TRAF_CON_R
## 1     yes         1          0
## 2      no         2          0
## 3      no         2          1
## 4      no         1          1
## 5      no         1          0
## 6     yes         2          0
## 7      no         2          0
## 8     yes         1          0
## 9      no         2          0
## 10     no         2          0
## 11     no         2          0
## 12     no         1          2

Part B.i

Create a pivot table that examines INJURY as a function of the two predictors for these 12 records. Use all three variables in the pivot table as rows/columns.

rpivotTable::rpivotTable(new.df)

B.ii

Compute the exact Bayes conditional probabilities of an injury (INJURY = Yes) given the six possible combinations of the predictors.

#To find P(Injury=yes|WEATHER_R = 1, TRAF_CON_R =0):
numerator1 <- 2/3 * 3/12
denominator1 <- 3/12
prob1 <- numerator1/denominator1

#To find P(Injury=yes|WEATHER_R = 1, TRAF_CON_R =1):
numerator2 <- 0 * 3/12
denominator2 <- 1/12
prob2 <- numerator2/denominator2

#To find P(Injury=yes| WEATHER_R = 1, TRAF_CON_R =2):
numerator3 <- 0 * 3/12
denominator3 <- 1/12
prob3 <- numerator3/denominator3

#To find P(Injury=yes| WEATHER_R = 2, TRAF_CON_R =0):
numerator4 <- 1/3 * 3/12
denominator4 <- 6/12
prob4 <- numerator4/denominator4

#To find P(Injury=yes| WEATHER_R = 2, TRAF_CON_R =1):
numerator5 <- 0 * 3/12
denominator5 <- 1/12
prob5 <- numerator5/denominator5

#To find P(Injury=yes| WEATHER_R = 2, TRAF_CON_R =2):
numerator6 <- 0 * 3/12
denominator6 <- 0
prob6 <- numerator6/denominator6

a<-c(1,2,3,4,5,6)
b<-c(prob1,prob2,prob3,prob4,prob5,prob6)
prob.df<-data.frame(a,b)
names(prob.df)<-c('Option #','Probability')
prob.df %>% mutate_if(is.numeric, round, 3)
##   Option # Probability
## 1        1       0.667
## 2        2       0.000
## 3        3       0.000
## 4        4       0.167
## 5        5       0.000
## 6        6         NaN

NOTE: In the above 12 observations there is no observation with (Injury=yes, WEATHER_R = 2, TRAF_CON_R =2). The conditional probability here is undefined, since the denominator is zero.

B.iii

Classify the 12 accidents using these probabilities and a cutoff of 0.5.

#add probability results to you subset
new.df.prob<-new.df
head(new.df.prob)
##   INJURY WEATHER_R TRAF_CON_R
## 1    yes         1          0
## 2     no         2          0
## 3     no         2          1
## 4     no         1          1
## 5     no         1          0
## 6    yes         2          0
prob.inj <- c(0.667, 0.167, 0, 0, 0.667, 0.167, 0.167, 0.667, 0.167, 0.167, 0.167, 0)
new.df.prob$PROB_INJURY<-prob.inj

#add a column for injury prediction based on a cutoff of 0.5
new.df.prob$PREDICT_PROB<-ifelse(new.df.prob$PROB_INJURY>.5,"yes","no")
new.df.prob
##    INJURY WEATHER_R TRAF_CON_R PROB_INJURY PREDICT_PROB
## 1     yes         1          0       0.667          yes
## 2      no         2          0       0.167           no
## 3      no         2          1       0.000           no
## 4      no         1          1       0.000           no
## 5      no         1          0       0.667          yes
## 6     yes         2          0       0.167           no
## 7      no         2          0       0.167           no
## 8     yes         1          0       0.667          yes
## 9      no         2          0       0.167           no
## 10     no         2          0       0.167           no
## 11     no         2          0       0.167           no
## 12     no         1          2       0.000           no

B.iv

Compute manually the naive Bayes conditional probability of an injury given WEATHER_R = 1 and TRAF_CON_R = 1.

#To find P(Injury=yes| WEATHER_R = 1, TRAF_CON_R =1):
#  Probability of injury involved in accidents
#  =   (proportion of WEATHER_R =1 when Injury = yes) 
#      *(proportion of TRAF_CON_R =1 when Injury = yes)
#      *(propotion of Injury = yes in all cases)
man.prob <- 2/3 * 0/3 * 3/12
man.prob
## [1] 0

B.v

Run a naive Bayes classifier on the 12 records and two predictors using R. Check the model output to obtain probabilities and classifications for all 12 records. Compare this to the exact Bayes classification. Are the resulting classifications equivalent? Is the ranking (= ordering) of observations equivalent?

##load packages and run the naive Bayes classifier
library(e1071)
library(klaR)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
nb<-naiveBayes(INJURY ~ ., data = new.df)
predict(nb, newdata = new.df,type = "raw")
##              no          yes
##  [1,] 0.5000000 0.5000000000
##  [2,] 0.8000000 0.2000000000
##  [3,] 0.9992506 0.0007494379
##  [4,] 0.9970090 0.0029910269
##  [5,] 0.5000000 0.5000000000
##  [6,] 0.8000000 0.2000000000
##  [7,] 0.8000000 0.2000000000
##  [8,] 0.5000000 0.5000000000
##  [9,] 0.8000000 0.2000000000
## [10,] 0.8000000 0.2000000000
## [11,] 0.8000000 0.2000000000
## [12,] 0.9940358 0.0059642147
#check your model with the 'caret' package using the train and predict functions
library(caret)
x=new.df[,-3]
y=new.df$INJURY
model <- train(x,y,'nb', trControl = trainControl(method = 'cv',number=10))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
model
## Naive Bayes 
## 
## 12 samples
##  2 predictor
##  2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10, 11, 11, 11, 11, 11, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9444444  0.6666667
##    TRUE      0.9444444  0.6666667
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE and adjust
##  = 1.
##Now that we have generated a classification model, we use it for prediction
model.pred<-predict(model$finalModel,x)
model.pred
## $class
##   1   2   3   4   5   6   7   8   9  10  11  12 
## yes  no  no  no  no yes  no yes  no  no  no  no 
## Levels: no yes
## 
## $posterior
##             no          yes
## 1  0.001497753 0.9985022466
## 2  0.999833361 0.0001666389
## 3  0.999833361 0.0001666389
## 4  0.999333777 0.0006662225
## 5  0.999333777 0.0006662225
## 6  0.005964215 0.9940357853
## 7  0.999833361 0.0001666389
## 8  0.001497753 0.9985022466
## 9  0.999833361 0.0001666389
## 10 0.999833361 0.0001666389
## 11 0.999833361 0.0001666389
## 12 0.999333777 0.0006662225
##build a confusion matrix so that we can visualize the classification errors
table(model.pred$class,y)
##      y
##       no yes
##   no   9   0
##   yes  0   3
#compare against the manually calculated results
new.df.prob$PREDICT_PROB_NB<-model.pred$class
new.df.prob
##    INJURY WEATHER_R TRAF_CON_R PROB_INJURY PREDICT_PROB PREDICT_PROB_NB
## 1     yes         1          0       0.667          yes             yes
## 2      no         2          0       0.167           no              no
## 3      no         2          1       0.000           no              no
## 4      no         1          1       0.000           no              no
## 5      no         1          0       0.667          yes              no
## 6     yes         2          0       0.167           no             yes
## 7      no         2          0       0.167           no              no
## 8     yes         1          0       0.667          yes             yes
## 9      no         2          0       0.167           no              no
## 10     no         2          0       0.167           no              no
## 11     no         2          0       0.167           no              no
## 12     no         1          2       0.000           no              no

NOTE: The errors that appear when running the naive Bayes on this sample set are nothing to really worry about, they just mean that these parameter do very poorly when classified.

As you can see the trained data performed better than our manual calculations. However, since we trained on the same data we are testing this is to be expected (and is not a best practice, you never want to use the same data you trained on for testing).

C

Let us now return to the entire dataset. Partition the data into training (60%) and validation (40%).

set.seed(22)
train.index <- sample(c(1:dim(accidents.df)[1]), dim(accidents.df)[1]*0.6)  
train.df <- accidents.df[train.index,]
valid.df <- accidents.df[-train.index,]

C.i

Assuming that no information or initial reports about the accident itself are available at the time of prediction (only location characteristics, weather conditions, etc.), which predictors can we include in the analysis? (Use the Data_Codes sheet.)

We can use the predictors that describe the calendar time or road conditions: HOUR_I_R ALIGN_I WRK_ZONE WKDY_I_R INT_HWY LGTCON_I_R PROFIL_I_R SPD_LIM SUR_CON TRAF_CON_R TRAF_WAY WEATHER_R

C.ii

Run a naive Bayes classifier on the complete training set with the relevant predictors (and INJURY as the response). Note that all predictors are categorical. Show the confusion matrix.

#define which variable you will be using
vars <- c("INJURY", "ï..HOUR_I_R",  "ALIGN_I" ,"WRK_ZONE",  "WKDY_I_R",
          "INT_HWY",  "LGTCON_I_R", "PROFIL_I_R", "SPD_LIM", "SUR_COND",
          "TRAF_CON_R",   "TRAF_WAY",   "WEATHER_R")

nbTotal <- naiveBayes(INJURY ~ ., data = train.df[,vars])

#generate the confusion matrix using the train.df, the prediction and the classes
confusionMatrix(train.df$INJURY, predict(nbTotal, train.df[, vars]), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5795 6751
##        yes 4932 7831
##                                           
##                Accuracy : 0.5384          
##                  95% CI : (0.5322, 0.5445)
##     No Information Rate : 0.5762          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0756          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5370          
##             Specificity : 0.5402          
##          Pos Pred Value : 0.6136          
##          Neg Pred Value : 0.4619          
##              Prevalence : 0.5762          
##          Detection Rate : 0.3094          
##    Detection Prevalence : 0.5043          
##       Balanced Accuracy : 0.5386          
##                                           
##        'Positive' Class : yes             
## 
ner=1-.5384
nerp=scales::percent(ner,0.01)

C.iii

What is the overall error for the validation set?

confusionMatrix(valid.df$INJURY, predict(nbTotal, valid.df[, vars]), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3717 4458
##        yes 3381 5318
##                                          
##                Accuracy : 0.5354         
##                  95% CI : (0.5279, 0.543)
##     No Information Rate : 0.5794         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0663         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.5440         
##             Specificity : 0.5237         
##          Pos Pred Value : 0.6113         
##          Neg Pred Value : 0.4547         
##              Prevalence : 0.5794         
##          Detection Rate : 0.3152         
##    Detection Prevalence : 0.5155         
##       Balanced Accuracy : 0.5338         
##                                          
##        'Positive' Class : yes            
## 
ver=1-.5354
verp=scales::percent(ver,0.01)
paste("Overall Error: ",verp)
## [1] "Overall Error:  46.46%"

C.iv

What is the percent improvement relative to the naive rule (using the validation set)?

imp=ver-ner
paste("The percent improvement is ",scales::percent(imp,0.01))
## [1] "The percent improvement is  0.30%"

C.v

Examine the conditional probabilities output. Why do we get a probability of zero for P(INJURY = No j SPD_LIM = 5)?

options(digits = 2)
nbTotal
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  no yes 
## 0.5 0.5 
## 
## Conditional probabilities:
##      ï..HOUR_I_R
## Y        0    1
##   no  0.57 0.43
##   yes 0.58 0.42
## 
##      ALIGN_I
## Y        1    2
##   no  0.87 0.13
##   yes 0.87 0.13
## 
##      WRK_ZONE
## Y         0     1
##   no  0.975 0.025
##   yes 0.979 0.021
## 
##      WKDY_I_R
## Y        0    1
##   no  0.21 0.79
##   yes 0.23 0.77
## 
##      INT_HWY
## Y           0       1       9
##   no  0.84904 0.15033 0.00064
##   yes 0.86163 0.13774 0.00063
## 
##      LGTCON_I_R
## Y        1    2    3
##   no  0.69 0.13 0.18
##   yes 0.70 0.11 0.19
## 
##      PROFIL_I_R
## Y        0    1
##   no  0.75 0.25
##   yes 0.76 0.24
## 
##      SPD_LIM
## Y           5      10      15      20      25      30      35      40      45
##   no  8.0e-05 4.8e-04 4.5e-03 7.3e-03 1.1e-01 8.8e-02 1.9e-01 9.6e-02 1.6e-01
##   yes 7.8e-05 3.9e-04 4.4e-03 4.6e-03 9.2e-02 8.4e-02 2.1e-01 1.1e-01 1.6e-01
##      SPD_LIM
## Y          50      55      60      65      70      75
##   no  4.1e-02 1.6e-01 3.6e-02 6.6e-02 4.0e-02 6.1e-03
##   yes 3.8e-02 1.5e-01 4.3e-02 6.3e-02 2.9e-02 6.5e-03
## 
##      SUR_COND
## Y          1      2      3      4      9
##   no  0.7796 0.1738 0.0171 0.0258 0.0037
##   yes 0.8147 0.1551 0.0106 0.0150 0.0047
## 
##      TRAF_CON_R
## Y        0    1    2
##   no  0.66 0.19 0.15
##   yes 0.62 0.22 0.16
## 
##      TRAF_WAY
## Y         1     2     3
##   no  0.581 0.368 0.051
##   yes 0.564 0.396 0.040
## 
##      WEATHER_R
## Y        1    2
##   no  0.84 0.16
##   yes 0.87 0.13

We do not get true probability of zero for no injury in accidents under speed limit of 5 because we can never be entirely sure something will not happen. However, considering the highly unlikely fact of sustaining an injury while in a car accident at such a low speed, it makes sense that the probability is quite close to 0.