Welcome to this R demo session! Here, I will demonstrate how to use R to conduct logistic regression.
In this R Markdown file, we will explore a dataset titled
fordexp.txt
, which details fatal SUV accidents. The dataset
comprises four key variables:
ford
(categorical): ford = 0: not a ford; ford = 1: a
fordage
(continuous): the vehicle’s age.pass
(continuous): number of passengers in the SUVtire
(categorical): tire = 0: not tire related; tire =
1: tire related.The research question we are interested in exploring here today is whether the SUV fatal accidents were related to ford tires.
# load in the data
forddata <- read.table("fordexp.txt",header=TRUE)
# the package gmodels has a good cross-tabulation function
# some of you may know it if you have used it for chi-square test
library(gmodels)
#start with ford vs. tire
gmodels::CrossTable(forddata$ford,forddata$tire,chisq = TRUE,expected=TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2321
##
##
## | forddata$tire
## forddata$ford | 0 | 1 | Row Total |
## --------------|-----------|-----------|-----------|
## 0 | 1794 | 5 | 1799 |
## | 1778.072 | 20.928 | |
## | 0.143 | 12.122 | |
## | 0.997 | 0.003 | 0.775 |
## | 0.782 | 0.185 | |
## | 0.773 | 0.002 | |
## --------------|-----------|-----------|-----------|
## 1 | 500 | 22 | 522 |
## | 515.928 | 6.072 | |
## | 0.492 | 41.778 | |
## | 0.958 | 0.042 | 0.225 |
## | 0.218 | 0.815 | |
## | 0.215 | 0.009 | |
## --------------|-----------|-----------|-----------|
## Column Total | 2294 | 27 | 2321 |
## | 0.988 | 0.012 | |
## --------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 54.53411 d.f. = 1 p = 1.527732e-13
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 51.16398 d.f. = 1 p = 8.496355e-13
##
##
Look at cross-tabulations:
# now let's look at age vs. tire and pass vs. tire
# both are continuous, so a t-test is more appropriate
t.test(forddata$age ~ forddata$tire)
##
## Welch Two Sample t-test
##
## data: forddata$age by forddata$tire
## t = -5.8905, df = 26.966, p-value = 2.839e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.5643451 -0.7560398
## sample estimates:
## mean in group 0 mean in group 1
## 2.358326 3.518519
From the results, it seems that tire related fatal accident associated with older tires
t.test(forddata$pass ~ forddata$tire)
##
## Welch Two Sample t-test
##
## data: forddata$pass by forddata$tire
## t = -6.0953, df = 26.438, p-value = 1.797e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -2.385017 -1.182809
## sample estimates:
## mean in group 0 mean in group 1
## 0.9197908 2.7037037
Tire related fatal accident associated with more passengers.
We can assume that the assumption is met.
One of the important assumptions of logistic regression is the linearity of the logit over the covariates. When this assumption is not met, a Box-Tidwell transformation might be helpful.
The Box-Tidwell Test is used to check this assumption by testing whether the logit transform is a linear function of the predictor, effectively by adding the non-linear transform of the original predictor as an interaction term to test if this addition made no better prediction.
A significant p-value in the Box-Tidwell transformation means that the linearity assumption is violated.
If one variable is indeed found to be non-linear, then we may need to incorporate higher order polynomial terms for that variable in the regression analysis to capture the non-linearity
We can run Box-Tidwell test directly using the
boxTidwell
function within the car
package
library(car)
## Loading required package: carData
# Since you cannot take the logarithm of a negative number or zero, let's transform the two continuous variables first (just for the sake of checking the assumptions)
forddata$age_t <- forddata$age + 1
forddata$pass_t <- forddata$pass + 1
# Conduct the Box-Tidwell test
boxTidwell(formula = tire ~ age_t + pass_t,
data = forddata)
## MLE of lambda Score Statistic (t) Pr(>|t|)
## age_t 1.18339 0.2538 0.7996
## pass_t 0.98455 -0.0568 0.9547
##
## iterations = 6
##
## Score test for null hypothesis that all lambdas = 1:
## F = 0.033874, df = 2 and 2316, Pr(>F) = 0.9667
Both p-values > 0.05 (meaning that it is not statistically significant) This means that there is linearity in the two variables, and the assumption holds.
cor.test(forddata$age, forddata$pass)
##
## Pearson's product-moment correlation
##
## data: forddata$age and forddata$pass
## t = -2.5459, df = 2319, p-value = 0.01096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09328015 -0.01213335
## sample estimates:
## cor
## -0.0527939
Looks fine.
This assumption can only be checked after the model is built. We will come back to this later.
# logistic regression with ford as only predictor
output <- glm(tire ~ ford, data=forddata,family = "binomial"(link = "logit") )
#note: link = logit is default for family = binomial so line below does same thing
output <- glm(tire ~ ford, data=forddata,family = "binomial")
summary(output)
##
## Call:
## glm(formula = tire ~ ford, family = "binomial", data = forddata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.8828 0.4478 -13.137 < 2e-16 ***
## ford 2.7592 0.4980 5.541 3.01e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 294.20 on 2320 degrees of freedom
## Residual deviance: 251.23 on 2319 degrees of freedom
## AIC: 255.23
##
## Number of Fisher Scoring iterations: 8
First thing to check is whether tire = 0 or tire = 1 is the predicted outcome.
Best way to check this is by comparing regression coefficient to descriptives.
Cross-tabulation showed that ford and tire were positively associated
Regression coefficient is positive, so we can deduce that the logistic regression is predicting tire = 1.
anova(update(output,~1),output,test="Chisq") #the first part of this line runs the null model with no predictors
## Analysis of Deviance Table
##
## Model 1: tire ~ 1
## Model 2: tire ~ ford
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2320 294.20
## 2 2319 251.23 1 42.964 5.577e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is a likelihood ratio test Test statistic is \(\Delta\) chi-square = 42.964
Sampling distribution is chi-square distribution with df = # predictors = 1 for this test
p-value = .00000000005577, which is slightly below .05
So this set of predictors (only 1 here, but that is the general interpretation) significantly predicts whether a fatal accident is tire related
#pseudo-Rsquared
rcompanion::nagelkerke(output)
## Registered S3 method overwritten by 'DescTools':
## method from
## reorder.factor gdata
## $Models
##
## Model: "glm, tire ~ ford, binomial, forddata"
## Null: "glm, tire ~ 1, binomial, forddata"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.1460370
## Cox and Snell (ML) 0.0183405
## Nagelkerke (Cragg and Uhler) 0.1540580
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -1 -21.482 42.964 5.5768e-11
##
## $Number.of.observations
##
## Model: 2321
## Null: 2321
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
First note: likelihood ratio test comes out here too, so you didn’t have to do the previous chunk of codes
Provides three pseudo-Rsquared estimates
My preference is Nagelkerke and not just because it’s a great name
Tends to perform better than other estimates of pseudo-Rsquared
Nagelkerke Rsquared is .154, meaning …
meaning…
what on earth does that mean?
Could interpret like a regular Rsquared
15.4% of variance in outcome accounted for by predictor(s)
but what does the variance in the dichotomous outcome mean?
My recommendation: just say the Nagelkerke pseudo-Rsquared was .154 and don’t try to interpret any more than that
But size of pseudo-Rsquared can be thought of as about the same magnitude as a regular Rsquared
# null hypothesis: regression coefficient is 0
summary(output)
##
## Call:
## glm(formula = tire ~ ford, family = "binomial", data = forddata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.8828 0.4478 -13.137 < 2e-16 ***
## ford 2.7592 0.4980 5.541 3.01e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 294.20 on 2320 degrees of freedom
## Residual deviance: 251.23 on 2319 degrees of freedom
## AIC: 255.23
##
## Number of Fisher Scoring iterations: 8
This is a Wald test
Test statistic is z value = 5.541
Sampling distribution is standard normal distribution
p-value is .0000000301, which is <.05, so reject null hypothesis
Note that this statistical test asks an equivalent question to the first question above:
If and only if there is exactly one predictor but the numbers are a bit different (the p-value in particular)
Because the test is different: likelihood ratio test vs. Wald test
Those tests are mathematically identical if the sample size is infinite ut will be somewhat different with finite sample sizes
In linear regression, this is answered with the regression coefficient
In logistic regression, this can also be answered with the regression coefficient
b1 = 2.7592: a one unit difference in ford = ford vs. not ford is associated with a 2.7592 unit difference in what?
Not directly predicting the outcome variable
Instead, predicting the logit (see the lecture)
So ford vs. not ford is associated with a 2.7592 logit difference in the outcome, which is just incomprehensible
Let’s convert to an odds ratio
exp(output$coefficients)
## (Intercept) ford
## 0.002787068 15.787199819
Odds ratio is > 1 so interpret as ford vs. not ford is associated with a fatal accident being 15.78 times more likely to be tire related
Recognizing that this interpretation is not precisely correct mathematically but close enough
Confidence intervals are valuable with odds ratios because of asymmetry
exp(confint(output))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.0009983947 0.005999684
## ford 6.4353315185 47.327506943
Confidence interval for odds ratio for ford is 95%CI=(6.44,47.33)
Confidence intervals are impossible to interpret correctly, so don’t bother
Just report the confidence interval and let the reader misinterpret it without your help
Recall that a regression coefficient is significant if its confidence interval does not contain 0
For odds ratios, the odds ratio is significant if its confidence does not contain 1 because if something is 1 times as likely, it doesn’t do anything
output2 <- glm(tire ~ ford + pass + age, data=forddata,family = "binomial")
rcompanion::nagelkerke(output2)
## $Models
##
## Model: "glm, tire ~ ford + pass + age, binomial, forddata"
## Null: "glm, tire ~ 1, binomial, forddata"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.3351220
## Cox and Snell (ML) 0.0415885
## Nagelkerke (Cragg and Uhler) 0.3493370
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -3 -49.296 98.592 3.1212e-21
##
## $Number.of.observations
##
## Model: 2321
## Null: 2321
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
Likelihood ratio test: chi-square = 98.592, df = 3, p < .001
Nagelkerke pseudo-Rsquared = .349
summary(output2)
##
## Call:
## glm(formula = tire ~ ford + pass + age, family = "binomial",
## data = forddata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.5227 0.8887 -10.715 < 2e-16 ***
## ford 2.8614 0.5189 5.515 3.49e-08 ***
## pass 0.6277 0.1042 6.025 1.69e-09 ***
## age 0.8499 0.1784 4.765 1.89e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 294.2 on 2320 degrees of freedom
## Residual deviance: 195.6 on 2317 degrees of freedom
## AIC: 203.6
##
## Number of Fisher Scoring iterations: 9
All three predictors are significant predictors of the outcome, p < .001 for all three
exp(output2$coefficients)
## (Intercept) ford pass age
## 7.317474e-05 1.748663e+01 1.873297e+00 2.339482e+00
# odds ratios are reported using scientific notation
# you can turn off scientific notation with the function:
options(scipen=999)
Now recalculate odds ratios without scientific notation
exp(output2$coefficients)
## (Intercept) ford pass age
## 0.00007317474 17.48662848401 1.87329694493 2.33948197494
ford vs. not ford is associated with a fatal accident being 17.48 times more likely to be tire related
SUV with one additional passenger is associated with a fatal accident being 1.87 times more likely to be tire related
SUV that is one year older is associated with a fatal accident being 2.33 times more likely to be tire related
exp(confint(output2))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.00001080952 0.0003597715
## ford 6.83538720161 54.3944821049
## pass 1.52628402660 2.3073202681
## age 1.67400962002 3.3864224628
Use beta.glm
function, which just uses the function name
beta
library(reghelper)
##
## Attaching package: 'reghelper'
## The following object is masked from 'package:base':
##
## beta
beta(output2)
##
## Call:
## glm(formula = "tire ~ ford.z + pass.z + age.z", family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.2729 0.4705 -13.332 < 0.0000000000000002 ***
## ford.z 1.1950 0.2167 5.515 0.00000003494 ***
## pass.z 0.8130 0.1349 6.025 0.00000000169 ***
## age.z 1.0806 0.2268 4.765 0.00000188719 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 294.2 on 2320 degrees of freedom
## Residual deviance: 195.6 on 2317 degrees of freedom
## AIC: 203.6
##
## Number of Fisher Scoring iterations: 9
exp(beta(output2)$coefficients)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.001886836 1.600825 0.000001622122 1.000000
## ford.z 3.303425772 1.241954 248.315938025537 1.000000
## pass.z 2.254630307 1.144468 413.560307328185 1.000000
## age.z 2.946420358 1.254541 117.347997697031 1.000002
Standardized odds ratio interpreted like you would expect
A one SD difference in SUV age is associated with a fatal accident being 2.59 times more likely to be tire related
Note that all of these predictors have meaningful units (ford yes/no, # passengers, year of SUV age) so standardization reducing interpretability
output3 <- glm(tire ~ ford * age + pass, data=forddata,family = "binomial")
rcompanion::nagelkerke(output3)
## $Models
##
## Model: "glm, tire ~ ford * age + pass, binomial, forddata"
## Null: "glm, tire ~ 1, binomial, forddata"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.3358150
## Cox and Snell (ML) 0.0416727
## Nagelkerke (Cragg and Uhler) 0.3500440
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -4 -49.398 98.795 0.000000000000000000017753
##
## $Number.of.observations
##
## Model: 2321
## Null: 2321
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
summary(output3)
##
## Call:
## glm(formula = tire ~ ford * age + pass, family = "binomial",
## data = forddata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.0087 1.3804 -6.526 0.0000000000675 ***
## ford 2.2099 1.4915 1.482 0.1384
## age 0.7064 0.3586 1.970 0.0489 *
## pass 0.6250 0.1038 6.023 0.0000000017132 ***
## ford:age 0.1857 0.4079 0.455 0.6489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 294.2 on 2320 degrees of freedom
## Residual deviance: 195.4 on 2316 degrees of freedom
## AIC: 205.4
##
## Number of Fisher Scoring iterations: 9
Interaction is nonsignificant, but now ford is nonsignificant too but is a model without both of those variables OK?
Run a model without ford
and fordXage
interaction and compare
output4 <- glm(tire ~ age + pass, data=forddata,family = "binomial")
anova(output4,output3,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: tire ~ age + pass
## Model 2: tire ~ ford * age + pass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2318 237.78
## 2 2316 195.40 2 42.384 0.0000000006258 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is a likelihood ratio test
Important note: this test is only valid if the model is estimated with maximum likelihood
\(\Delta\) chi-square = 42.384
df = difference in # of predictors = 2
p-value = .0000000006258 < .05
Dropping these two variables is a bad idea
The reason why neither is statistically significant on its own is multicollinearity
car::vif(output3)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## ford age pass ford:age
## 8.416367 4.215465 1.065242 11.515629
exp(output3$coefficients)
## (Intercept) ford age pass ford:age
## 0.0001223381 9.1151322621 2.0267154387 1.8682078175 1.2040951169
Odds ratio for ford is still big, even if it isn’t significant because of multicollinearity
exp(confint(output3))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.000004773959 0.001220979
## ford 0.619149898529 258.699350837
## age 1.022171208346 4.364051479
## pass 1.523626099989 2.300408524
## ford:age 0.518000573628 2.656620392
Note that confidence intervals for odds ratios for ford and interaction contain 1
Since we have our model at hand, we can check whether there are any outliers in the solution model.
# examine cases with large residuals
forddata[abs(output3$residuals) > 3,]
## ford age pass tire age_t pass_t
## 2295 0 3 2 1 4 3
## 2296 0 3 1 1 4 2
## 2297 0 5 1 1 6 2
## 2301 1 5 1 1 6 2
## 2302 1 2 4 1 3 5
## 2303 1 3 3 1 4 4
## 2304 1 3 3 1 4 4
## 2305 1 4 3 1 5 4
## 2306 1 4 3 1 5 4
## 2307 1 4 3 1 5 4
## 2308 1 4 3 1 5 4
## 2309 1 4 1 1 5 2
## 2310 1 4 1 1 5 2
## 2312 1 4 4 1 5 5
## 2313 0 3 0 1 4 1
## 2314 0 3 2 1 4 3
## 2315 1 3 4 1 4 5
## 2316 1 3 4 1 4 5
## 2317 1 3 4 1 4 5
## 2318 1 3 0 1 4 1
## 2319 1 3 2 1 4 3
## 2320 1 1 5 1 2 6
## 2321 1 2 2 1 3 3
Do not be alarmed by the results.
In our example, all cases with an outcome value of 1 have large residuals. However, we know that they are NOT outliers.
Back when this issue with Ford Explorers happened, statistical analyses far more sophisticated than these showed definitively that there was a problem with Ford Explorers, especially with Firestone tires
Ford recalled all the affected Ford Explorers
Statistics really can be a matter of life or death
Our ford dataset might not be ideally suited for this specific purpose, as relying solely on the variables within it makes it challenging to accurately determine if a fatal accident is tire related. However, for illustrative purposes, we’ll proceed with using our dataset to see how effectively our model can predict whether an accident is tire-related.
Let’s create a training data and a testing data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
#make this example reproducible
set.seed(123)
#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(forddata), replace=TRUE, prob=c(0.7,0.3))
dta_train <- forddata %>%
filter(sample)
dta_test <- forddata %>%
filter(!sample)
nrow(dta_train) # contains 70% of the data
## [1] 1633
Now let’s examine the model we built earlier but just apply it to the training data.
output3 <- glm(tire ~ ford * age + pass, data=dta_train,family = "binomial")
The estimated probability of success can be found as follows:
head(output3$fitted.values)
## 1 2 3 4 5 6
## 0.0002478010 0.0002478010 0.0014306459 0.0002478010 0.0002478010 0.0007976755
# Testing data
dta_test$prob_predicted <- predict(object = output3, newdata = dta_test, type = "response")
head(dta_test$prob_predicted)
## [1] 0.0002478010 0.0002478010 0.0002478010 0.0007976755 0.0002478010
## [6] 0.0004446299
Using a 0.5 cut-off probability, we obtain the following accuracy levels:
dta_test$predictions <- ifelse(dta_test$prob_predicted > 0.5, 1, 0)
# Model accuracy
mean(dta_test$predictions==dta_test$tire)
## [1] 0.9898256
table(dta_test$predictions, dta_test$tire)
##
## 0 1
## 0 681 7
Let’s examine the sensitivity and specificity for the results.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Assuming 'actual' and 'predicted' as defined above
confusionMatrix <- confusionMatrix(as.factor(dta_test$predictions), as.factor(dta_test$tire))
## Warning in confusionMatrix.default(as.factor(dta_test$predictions),
## as.factor(dta_test$tire)): Levels are not in the same order for reference and
## data. Refactoring data to match.
# Print the confusion matrix and statistics
print(confusionMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 681 7
## 1 0 0
##
## Accuracy : 0.9898
## 95% CI : (0.9791, 0.9959)
## No Information Rate : 0.9898
## P-Value [Acc > NIR] : 0.59872
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9898
## Neg Pred Value : NaN
## Prevalence : 0.9898
## Detection Rate : 0.9898
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
# Access specific measures
sensitivity <- confusionMatrix$byClass['Sensitivity']
specificity <- confusionMatrix$byClass['Specificity']
sensitivity
## Sensitivity
## 1
specificity
## Specificity
## 0
Although we achieved a high accuracy, the count table reveals that the model predominantly classifies cases as non-tire related. This skew is attributed to the training data, which contains a disproportionately high number of non-tire cases. This can also be seen from the sensitivity and specificity values. While the sensitivity is a perfect value of 1, the specificity is 0.
To address this imbalance and improve the model’s ability to distinguish between tire and non-tire related cases, adjusting the cutoff probability value may be necessary.
dta_test$predictions <- ifelse(dta_test$prob_predicted > 0.3, 1, 0)
# Model accuracy
mean(dta_test$predictions==dta_test$tire)
## [1] 0.9883721
table(dta_test$predictions, dta_test$tire)
##
## 0 1
## 0 678 5
## 1 3 2
Let’s examine the sensitivity and specificity again.
# Assuming 'actual' and 'predicted' as defined above
confusionMatrix <- confusionMatrix(as.factor(dta_test$predictions), as.factor(dta_test$tire))
# Print the confusion matrix and statistics
print(confusionMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 678 5
## 1 3 2
##
## Accuracy : 0.9884
## 95% CI : (0.9772, 0.995)
## No Information Rate : 0.9898
## P-Value [Acc > NIR] : 0.7298
##
## Kappa : 0.3276
##
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.9956
## Specificity : 0.2857
## Pos Pred Value : 0.9927
## Neg Pred Value : 0.4000
## Prevalence : 0.9898
## Detection Rate : 0.9855
## Detection Prevalence : 0.9927
## Balanced Accuracy : 0.6407
##
## 'Positive' Class : 0
##
# Access specific measures
sensitivity <- confusionMatrix$byClass['Sensitivity']
specificity <- confusionMatrix$byClass['Specificity']
sensitivity
## Sensitivity
## 0.9955947
specificity
## Specificity
## 0.2857143
The specificity has improved, though it still falls short of being satisfactory. This outcome was anticipated, as our dataset was not specifically constructed for predictive purposes.
Let’s now examine the ROC plot.
# install.packages("pROC")
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
pROC_obj <- roc(dta_test$tire, dta_test$predictions,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")
I like the ROC plot generated from the pROC
package
because it is pretty easy to get confidence intervals for the Area Under
the Curve, AUC, on the plot.
From the plot we can easily see that the ROC is far from satisfactory:
The AUC (Area Under the Curve) is a measure of the overall performance of the classification model and represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUC is provided as 0.641 with a confidence interval of (0.460-0.821). This suggests that the model has moderate discriminative ability, with a wide confidence interval indicating considerable uncertainty in the estimate.
Sensitivity and Specificity Balance: The “sweet spot” or the optimal threshold is the point on the curve that balances sensitivity and specificity to a satisfactory level based on the costs of false positives and false negatives in the specific application. This is often considered to be the point closest to the top-left corner of the plot, where both sensitivity and specificity are highest.
In this particular plot, the shaded area suggests a focus on a range of thresholds, where the upper left boundary of the shaded area is closer to the “sweet spot”.
In practice, the sweet spot is determined not only by the ROC curve but also by considering the costs and consequences of false positives and false negatives in the context where the model is applied. For some applications, a higher sensitivity is preferred even at the cost of lower specificity, and vice versa.