Rmarkdown: Logisitic Regression

Introduction

Welcome to this R demo session! Here, I will demonstrate how to use R to conduct logistic regression.

In this R Markdown file, we will explore a dataset titled fordexp.txt, which details fatal SUV accidents. The dataset comprises four key variables:

ford (categorical): ford = 0: not a ford; ford = 1: a ford
age (continuous): the vehicle’s age.
pass (continuous): number of passengers in the SUV
tire (categorical): tire = 0: not tire related; tire = 1: tire related.

The research question we are interested in exploring here today is whether the SUV fatal accidents were related to ford tires.

# load in the data
forddata <- read.table("fordexp.txt",header=TRUE)

Explore the data (Descriptives)

# the package gmodels has a good cross-tabulation function
# some of you may know it if you have used it for chi-square test
library(gmodels)

#start with ford vs. tire
gmodels::CrossTable(forddata$ford,forddata$tire,chisq = TRUE,expected=TRUE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2321 
## 
##  
##               | forddata$tire 
## forddata$ford |         0 |         1 | Row Total | 
## --------------|-----------|-----------|-----------|
##             0 |      1794 |         5 |      1799 | 
##               |  1778.072 |    20.928 |           | 
##               |     0.143 |    12.122 |           | 
##               |     0.997 |     0.003 |     0.775 | 
##               |     0.782 |     0.185 |           | 
##               |     0.773 |     0.002 |           | 
## --------------|-----------|-----------|-----------|
##             1 |       500 |        22 |       522 | 
##               |   515.928 |     6.072 |           | 
##               |     0.492 |    41.778 |           | 
##               |     0.958 |     0.042 |     0.225 | 
##               |     0.218 |     0.815 |           | 
##               |     0.215 |     0.009 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      2294 |        27 |      2321 | 
##               |     0.988 |     0.012 |           | 
## --------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  54.53411     d.f. =  1     p =  1.527732e-13 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  51.16398     d.f. =  1     p =  8.496355e-13 
## 
##

Look at cross-tabulations:

for ford = 0 (not a ford), 1794 of fatal accidents are not tire related, 5 are tire related
for ford = 1 (ford), 500 of fatal accidents are not tire related, 22 are tire related
row frequencies are in 4th row within each cell
for ford = 0, .3% of fatal accidents are tire related
for ford = 1, 4.2% of fatal accidents are tire related
chi-square test of independence is significant; meaning there is an association between whether SUV was a ford and whether accident was tire related

# now let's look at age vs. tire and pass vs. tire
# both are continuous, so a t-test is more appropriate
t.test(forddata$age ~ forddata$tire)

## 
##  Welch Two Sample t-test
## 
## data:  forddata$age by forddata$tire
## t = -5.8905, df = 26.966, p-value = 2.839e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.5643451 -0.7560398
## sample estimates:
## mean in group 0 mean in group 1 
##        2.358326        3.518519

From the results, it seems that tire related fatal accident associated with older tires

t.test(forddata$pass ~ forddata$tire)

## 
##  Welch Two Sample t-test
## 
## data:  forddata$pass by forddata$tire
## t = -6.0953, df = 26.438, p-value = 1.797e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -2.385017 -1.182809
## sample estimates:
## mean in group 0 mean in group 1 
##       0.9197908       2.7037037

Tire related fatal accident associated with more passengers.

Assumptions

Related to sample size

#sample size using EPV
table(forddata$tire)/length(forddata$tire)

## 
##          0          1 
## 0.98836708 0.01163292

Only 1.16% of fatal accidents are tire-related

We have 3 predictors, so recommended sample size is at least:

10*3/min(table(forddata$tire)/length(forddata$tire))

## [1] 2578.889

nrow(forddata) # sample size

## [1] 2321

Which is pretty close to the sample size, so good enough for an outcome with such extreme proportions

Independence of errors

We can assume that the assumption is met.

Linearity in the logit

One of the important assumptions of logistic regression is the linearity of the logit over the covariates. When this assumption is not met, a Box-Tidwell transformation might be helpful.

The Box-Tidwell Test is used to check this assumption by testing whether the logit transform is a linear function of the predictor, effectively by adding the non-linear transform of the original predictor as an interaction term to test if this addition made no better prediction.

A significant p-value in the Box-Tidwell transformation means that the linearity assumption is violated.

If one variable is indeed found to be non-linear, then we may need to incorporate higher order polynomial terms for that variable in the regression analysis to capture the non-linearity

We can run Box-Tidwell test directly using the boxTidwell function within the car package

library(car)

## Loading required package: carData

# Since you cannot take the logarithm of a negative number or zero, let's transform the two continuous variables first (just for the sake of checking the assumptions)
forddata$age_t <- forddata$age + 1
forddata$pass_t <- forddata$pass + 1

# Conduct the Box-Tidwell test
boxTidwell(formula = tire ~ age_t + pass_t, 
           data = forddata)

##        MLE of lambda Score Statistic (t) Pr(>|t|)
## age_t        1.18339              0.2538   0.7996
## pass_t       0.98455             -0.0568   0.9547
## 
## iterations =  6 
## 
## Score test for null hypothesis that all lambdas = 1:
## F = 0.033874, df = 2 and 2316, Pr(>F) = 0.9667

Both p-values > 0.05 (meaning that it is not statistically significant) This means that there is linearity in the two variables, and the assumption holds.

Absence of multicollinearity among IVs

cor.test(forddata$age, forddata$pass)

## 
##  Pearson's product-moment correlation
## 
## data:  forddata$age and forddata$pass
## t = -2.5459, df = 2319, p-value = 0.01096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09328015 -0.01213335
## sample estimates:
##        cor 
## -0.0527939

Looks fine.

Absence of outliers in the solution

This assumption can only be checked after the model is built. We will come back to this later.

Logistic regression

# logistic regression with ford as only predictor
output <- glm(tire ~ ford, data=forddata,family = "binomial"(link = "logit") )
  #note: link = logit is default for family = binomial so line below does same thing
output <- glm(tire ~ ford, data=forddata,family = "binomial")
summary(output)

## 
## Call:
## glm(formula = tire ~ ford, family = "binomial", data = forddata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.8828     0.4478 -13.137  < 2e-16 ***
## ford          2.7592     0.4980   5.541 3.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294.20  on 2320  degrees of freedom
## Residual deviance: 251.23  on 2319  degrees of freedom
## AIC: 255.23
## 
## Number of Fisher Scoring iterations: 8

First thing to check is whether tire = 0 or tire = 1 is the predicted outcome.

Best way to check this is by comparing regression coefficient to descriptives.

Cross-tabulation showed that ford and tire were positively associated

Regression coefficient is positive, so we can deduce that the logistic regression is predicting tire = 1.

First question: is this model better than a model with no predictors?

anova(update(output,~1),output,test="Chisq") #the first part of this line runs the null model with no predictors

## Analysis of Deviance Table
## 
## Model 1: tire ~ 1
## Model 2: tire ~ ford
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      2320     294.20                          
## 2      2319     251.23  1   42.964 5.577e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is a likelihood ratio test Test statistic is \(\Delta\) chi-square = 42.964

Sampling distribution is chi-square distribution with df = # predictors = 1 for this test

p-value = .00000000005577, which is slightly below .05

So this set of predictors (only 1 here, but that is the general interpretation) significantly predicts whether a fatal accident is tire related

Second question: how strong is the prediction?

#pseudo-Rsquared
rcompanion::nagelkerke(output)

## Registered S3 method overwritten by 'DescTools':
##   method         from 
##   reorder.factor gdata

## $Models
##                                              
## Model: "glm, tire ~ ford, binomial, forddata"
## Null:  "glm, tire ~ 1, binomial, forddata"   
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.1460370
## Cox and Snell (ML)                  0.0183405
## Nagelkerke (Cragg and Uhler)        0.1540580
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -1     -21.482 42.964 5.5768e-11
## 
## $Number.of.observations
##            
## Model: 2321
## Null:  2321
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

First note: likelihood ratio test comes out here too, so you didn’t have to do the previous chunk of codes

Provides three pseudo-Rsquared estimates

My preference is Nagelkerke and not just because it’s a great name

Tends to perform better than other estimates of pseudo-Rsquared

Nagelkerke Rsquared is .154, meaning …

meaning…

what on earth does that mean?

Could interpret like a regular Rsquared

15.4% of variance in outcome accounted for by predictor(s)

but what does the variance in the dichotomous outcome mean?

My recommendation: just say the Nagelkerke pseudo-Rsquared was .154 and don’t try to interpret any more than that

But size of pseudo-Rsquared can be thought of as about the same magnitude as a regular Rsquared

Third question: is each predictor variable a significant predictor of the outcome?

# null hypothesis: regression coefficient is 0

summary(output)

## 
## Call:
## glm(formula = tire ~ ford, family = "binomial", data = forddata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.8828     0.4478 -13.137  < 2e-16 ***
## ford          2.7592     0.4980   5.541 3.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294.20  on 2320  degrees of freedom
## Residual deviance: 251.23  on 2319  degrees of freedom
## AIC: 255.23
## 
## Number of Fisher Scoring iterations: 8

This is a Wald test

Test statistic is z value = 5.541

Sampling distribution is standard normal distribution

p-value is .0000000301, which is <.05, so reject null hypothesis

Note that this statistical test asks an equivalent question to the first question above:

If and only if there is exactly one predictor but the numbers are a bit different (the p-value in particular)

Because the test is different: likelihood ratio test vs. Wald test

Those tests are mathematically identical if the sample size is infinite ut will be somewhat different with finite sample sizes

Fourth question: how strong is the prediction by the predictor variable

In linear regression, this is answered with the regression coefficient

In logistic regression, this can also be answered with the regression coefficient

b1 = 2.7592: a one unit difference in ford = ford vs. not ford is associated with a 2.7592 unit difference in what?

Not directly predicting the outcome variable

Instead, predicting the logit (see the lecture)

So ford vs. not ford is associated with a 2.7592 logit difference in the outcome, which is just incomprehensible

Let’s convert to an odds ratio

exp(output$coefficients)

##  (Intercept)         ford 
##  0.002787068 15.787199819

Odds ratio is > 1 so interpret as ford vs. not ford is associated with a fatal accident being 15.78 times more likely to be tire related

Recognizing that this interpretation is not precisely correct mathematically but close enough

Confidence intervals are valuable with odds ratios because of asymmetry

exp(confint(output))

## Waiting for profiling to be done...

##                    2.5 %       97.5 %
## (Intercept) 0.0009983947  0.005999684
## ford        6.4353315185 47.327506943

Confidence interval for odds ratio for ford is 95%CI=(6.44,47.33)

Confidence intervals are impossible to interpret correctly, so don’t bother

Just report the confidence interval and let the reader misinterpret it without your help

Recall that a regression coefficient is significant if its confidence interval does not contain 0

For odds ratios, the odds ratio is significant if its confidence does not contain 1 because if something is 1 times as likely, it doesn’t do anything

Now let’s add the other predictors

output2 <- glm(tire ~ ford + pass + age, data=forddata,family = "binomial")
rcompanion::nagelkerke(output2)

## $Models
##                                                           
## Model: "glm, tire ~ ford + pass + age, binomial, forddata"
## Null:  "glm, tire ~ 1, binomial, forddata"                
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.3351220
## Cox and Snell (ML)                  0.0415885
## Nagelkerke (Cragg and Uhler)        0.3493370
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -3     -49.296 98.592 3.1212e-21
## 
## $Number.of.observations
##            
## Model: 2321
## Null:  2321
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

Likelihood ratio test: chi-square = 98.592, df = 3, p < .001

Nagelkerke pseudo-Rsquared = .349

summary(output2)

## 
## Call:
## glm(formula = tire ~ ford + pass + age, family = "binomial", 
##     data = forddata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.5227     0.8887 -10.715  < 2e-16 ***
## ford          2.8614     0.5189   5.515 3.49e-08 ***
## pass          0.6277     0.1042   6.025 1.69e-09 ***
## age           0.8499     0.1784   4.765 1.89e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294.2  on 2320  degrees of freedom
## Residual deviance: 195.6  on 2317  degrees of freedom
## AIC: 203.6
## 
## Number of Fisher Scoring iterations: 9

All three predictors are significant predictors of the outcome, p < .001 for all three

exp(output2$coefficients)

##  (Intercept)         ford         pass          age 
## 7.317474e-05 1.748663e+01 1.873297e+00 2.339482e+00

  # odds ratios are reported using scientific notation
  # you can turn off scientific notation with the function:
options(scipen=999)

Now recalculate odds ratios without scientific notation

exp(output2$coefficients)

##    (Intercept)           ford           pass            age 
##  0.00007317474 17.48662848401  1.87329694493  2.33948197494

ford vs. not ford is associated with a fatal accident being 17.48 times more likely to be tire related

SUV with one additional passenger is associated with a fatal accident being 1.87 times more likely to be tire related

SUV that is one year older is associated with a fatal accident being 2.33 times more likely to be tire related

exp(confint(output2))

## Waiting for profiling to be done...

##                     2.5 %        97.5 %
## (Intercept) 0.00001080952  0.0003597715
## ford        6.83538720161 54.3944821049
## pass        1.52628402660  2.3073202681
## age         1.67400962002  3.3864224628

Standardized estimates, like standardizing the predictors

Use beta.glm function, which just uses the function name beta

library(reghelper)

## 
## Attaching package: 'reghelper'

## The following object is masked from 'package:base':
## 
##     beta

beta(output2)

## 
## Call:
## glm(formula = "tire ~ ford.z + pass.z + age.z", family = "binomial", 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  -6.2729     0.4705 -13.332 < 0.0000000000000002 ***
## ford.z        1.1950     0.2167   5.515        0.00000003494 ***
## pass.z        0.8130     0.1349   6.025        0.00000000169 ***
## age.z         1.0806     0.2268   4.765        0.00000188719 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294.2  on 2320  degrees of freedom
## Residual deviance: 195.6  on 2317  degrees of freedom
## AIC: 203.6
## 
## Number of Fisher Scoring iterations: 9

exp(beta(output2)$coefficients)

##                Estimate Std. Error          z value Pr(>|z|)
## (Intercept) 0.001886836   1.600825   0.000001622122 1.000000
## ford.z      3.303425772   1.241954 248.315938025537 1.000000
## pass.z      2.254630307   1.144468 413.560307328185 1.000000
## age.z       2.946420358   1.254541 117.347997697031 1.000002

Standardized odds ratio interpreted like you would expect

A one SD difference in SUV age is associated with a fatal accident being 2.59 times more likely to be tire related

Note that all of these predictors have meaningful units (ford yes/no, # passengers, year of SUV age) so standardization reducing interpretability

Maybe there was an interaction between ford and the age of the tires?

output3 <- glm(tire ~ ford * age + pass, data=forddata,family = "binomial")
rcompanion::nagelkerke(output3)

## $Models
##                                                           
## Model: "glm, tire ~ ford * age + pass, binomial, forddata"
## Null:  "glm, tire ~ 1, binomial, forddata"                
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.3358150
## Cox and Snell (ML)                  0.0416727
## Nagelkerke (Cragg and Uhler)        0.3500440
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq                    p.value
##       -4     -49.398 98.795 0.000000000000000000017753
## 
## $Number.of.observations
##            
## Model: 2321
## Null:  2321
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

summary(output3)

## 
## Call:
## glm(formula = tire ~ ford * age + pass, family = "binomial", 
##     data = forddata)
## 
## Coefficients:
##             Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)  -9.0087     1.3804  -6.526 0.0000000000675 ***
## ford          2.2099     1.4915   1.482          0.1384    
## age           0.7064     0.3586   1.970          0.0489 *  
## pass          0.6250     0.1038   6.023 0.0000000017132 ***
## ford:age      0.1857     0.4079   0.455          0.6489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294.2  on 2320  degrees of freedom
## Residual deviance: 195.4  on 2316  degrees of freedom
## AIC: 205.4
## 
## Number of Fisher Scoring iterations: 9

Interaction is nonsignificant, but now ford is nonsignificant too but is a model without both of those variables OK?

Run a model without ford and fordXage interaction and compare

output4 <- glm(tire ~ age + pass, data=forddata,family = "binomial")
anova(output4,output3,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: tire ~ age + pass
## Model 2: tire ~ ford * age + pass
##   Resid. Df Resid. Dev Df Deviance        Pr(>Chi)    
## 1      2318     237.78                                
## 2      2316     195.40  2   42.384 0.0000000006258 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is a likelihood ratio test

Important note: this test is only valid if the model is estimated with maximum likelihood

\(\Delta\) chi-square = 42.384

df = difference in # of predictors = 2

p-value = .0000000006258 < .05

Dropping these two variables is a bad idea

The reason why neither is statistically significant on its own is multicollinearity

car::vif(output3)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##      ford       age      pass  ford:age 
##  8.416367  4.215465  1.065242 11.515629

exp(output3$coefficients)

##  (Intercept)         ford          age         pass     ford:age 
## 0.0001223381 9.1151322621 2.0267154387 1.8682078175 1.2040951169

Odds ratio for ford is still big, even if it isn’t significant because of multicollinearity

exp(confint(output3))

## Waiting for profiling to be done...

##                      2.5 %        97.5 %
## (Intercept) 0.000004773959   0.001220979
## ford        0.619149898529 258.699350837
## age         1.022171208346   4.364051479
## pass        1.523626099989   2.300408524
## ford:age    0.518000573628   2.656620392

Note that confidence intervals for odds ratios for ford and interaction contain 1

Since we have our model at hand, we can check whether there are any outliers in the solution model.

# examine cases with large residuals
forddata[abs(output3$residuals) > 3,]

##      ford age pass tire age_t pass_t
## 2295    0   3    2    1     4      3
## 2296    0   3    1    1     4      2
## 2297    0   5    1    1     6      2
## 2301    1   5    1    1     6      2
## 2302    1   2    4    1     3      5
## 2303    1   3    3    1     4      4
## 2304    1   3    3    1     4      4
## 2305    1   4    3    1     5      4
## 2306    1   4    3    1     5      4
## 2307    1   4    3    1     5      4
## 2308    1   4    3    1     5      4
## 2309    1   4    1    1     5      2
## 2310    1   4    1    1     5      2
## 2312    1   4    4    1     5      5
## 2313    0   3    0    1     4      1
## 2314    0   3    2    1     4      3
## 2315    1   3    4    1     4      5
## 2316    1   3    4    1     4      5
## 2317    1   3    4    1     4      5
## 2318    1   3    0    1     4      1
## 2319    1   3    2    1     4      3
## 2320    1   1    5    1     2      6
## 2321    1   2    2    1     3      3

Do not be alarmed by the results.

In our example, all cases with an outcome value of 1 have large residuals. However, we know that they are NOT outliers.

Back when this issue with Ford Explorers happened, statistical analyses far more sophisticated than these showed definitively that there was a problem with Ford Explorers, especially with Firestone tires

Ford recalled all the affected Ford Explorers

Statistics really can be a matter of life or death

Logistic regression for predction

Our ford dataset might not be ideally suited for this specific purpose, as relying solely on the variables within it makes it challenging to accurately determine if a fatal accident is tire related. However, for illustrative purposes, we’ll proceed with using our dataset to see how effectively our model can predict whether an accident is tire-related.

Let’s create a training data and a testing data.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
#make this example reproducible
set.seed(123)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(forddata), replace=TRUE, prob=c(0.7,0.3))

dta_train <- forddata %>%
  filter(sample)

dta_test <- forddata %>%
  filter(!sample)

nrow(dta_train) # contains 70% of the data

## [1] 1633

Now let’s examine the model we built earlier but just apply it to the training data.

output3 <- glm(tire ~ ford * age + pass, data=dta_train,family = "binomial")

The estimated probability of success can be found as follows:

head(output3$fitted.values)

##            1            2            3            4            5            6 
## 0.0002478010 0.0002478010 0.0014306459 0.0002478010 0.0002478010 0.0007976755

# Testing data
dta_test$prob_predicted <- predict(object = output3, newdata = dta_test,  type = "response")
head(dta_test$prob_predicted)

## [1] 0.0002478010 0.0002478010 0.0002478010 0.0007976755 0.0002478010
## [6] 0.0004446299

Using a 0.5 cut-off probability, we obtain the following accuracy levels:

dta_test$predictions <- ifelse(dta_test$prob_predicted > 0.5, 1, 0)

# Model accuracy
mean(dta_test$predictions==dta_test$tire)

## [1] 0.9898256

table(dta_test$predictions, dta_test$tire)

##    
##       0   1
##   0 681   7

Let’s examine the sensitivity and specificity for the results.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Assuming 'actual' and 'predicted' as defined above
confusionMatrix <- confusionMatrix(as.factor(dta_test$predictions), as.factor(dta_test$tire))

## Warning in confusionMatrix.default(as.factor(dta_test$predictions),
## as.factor(dta_test$tire)): Levels are not in the same order for reference and
## data. Refactoring data to match.

# Print the confusion matrix and statistics
print(confusionMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 681   7
##          1   0   0
##                                           
##                Accuracy : 0.9898          
##                  95% CI : (0.9791, 0.9959)
##     No Information Rate : 0.9898          
##     P-Value [Acc > NIR] : 0.59872         
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9898          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9898          
##          Detection Rate : 0.9898          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

# Access specific measures
sensitivity <- confusionMatrix$byClass['Sensitivity']
specificity <- confusionMatrix$byClass['Specificity']
sensitivity

## Sensitivity 
##           1

specificity

## Specificity 
##           0

Although we achieved a high accuracy, the count table reveals that the model predominantly classifies cases as non-tire related. This skew is attributed to the training data, which contains a disproportionately high number of non-tire cases. This can also be seen from the sensitivity and specificity values. While the sensitivity is a perfect value of 1, the specificity is 0.

To address this imbalance and improve the model’s ability to distinguish between tire and non-tire related cases, adjusting the cutoff probability value may be necessary.

dta_test$predictions <- ifelse(dta_test$prob_predicted > 0.3, 1, 0)

# Model accuracy
mean(dta_test$predictions==dta_test$tire)

## [1] 0.9883721

table(dta_test$predictions, dta_test$tire)

##    
##       0   1
##   0 678   5
##   1   3   2

Let’s examine the sensitivity and specificity again.

# Assuming 'actual' and 'predicted' as defined above
confusionMatrix <- confusionMatrix(as.factor(dta_test$predictions), as.factor(dta_test$tire))

# Print the confusion matrix and statistics
print(confusionMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 678   5
##          1   3   2
##                                          
##                Accuracy : 0.9884         
##                  95% CI : (0.9772, 0.995)
##     No Information Rate : 0.9898         
##     P-Value [Acc > NIR] : 0.7298         
##                                          
##                   Kappa : 0.3276         
##                                          
##  Mcnemar's Test P-Value : 0.7237         
##                                          
##             Sensitivity : 0.9956         
##             Specificity : 0.2857         
##          Pos Pred Value : 0.9927         
##          Neg Pred Value : 0.4000         
##              Prevalence : 0.9898         
##          Detection Rate : 0.9855         
##    Detection Prevalence : 0.9927         
##       Balanced Accuracy : 0.6407         
##                                          
##        'Positive' Class : 0              
##

# Access specific measures
sensitivity <- confusionMatrix$byClass['Sensitivity']
specificity <- confusionMatrix$byClass['Specificity']
sensitivity

## Sensitivity 
##   0.9955947

specificity

## Specificity 
##   0.2857143

The specificity has improved, though it still falls short of being satisfactory. This outcome was anticipated, as our dataset was not specifically constructed for predictive purposes.

Let’s now examine the ROC plot.

# install.packages("pROC")
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:gmodels':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

pROC_obj <- roc(dta_test$tire, dta_test$predictions,
            smoothed = TRUE,
            # arguments for ci
            ci=TRUE, ci.alpha=0.9, stratified=FALSE,
            # arguments for plot
            plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
            print.auc=TRUE, show.thres=TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")

## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.

## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")

I like the ROC plot generated from the pROC package because it is pretty easy to get confidence intervals for the Area Under the Curve, AUC, on the plot.

From the plot we can easily see that the ROC is far from satisfactory:

The AUC (Area Under the Curve) is a measure of the overall performance of the classification model and represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUC is provided as 0.641 with a confidence interval of (0.460-0.821). This suggests that the model has moderate discriminative ability, with a wide confidence interval indicating considerable uncertainty in the estimate.

Sensitivity and Specificity Balance: The “sweet spot” or the optimal threshold is the point on the curve that balances sensitivity and specificity to a satisfactory level based on the costs of false positives and false negatives in the specific application. This is often considered to be the point closest to the top-left corner of the plot, where both sensitivity and specificity are highest.

In this particular plot, the shaded area suggests a focus on a range of thresholds, where the upper left boundary of the shaded area is closer to the “sweet spot”.

In practice, the sweet spot is determined not only by the ROC curve but also by considering the costs and consequences of false positives and false negatives in the context where the model is applied. For some applications, a higher sensitivity is preferred even at the cost of lower specificity, and vice versa.