modified from the r-statistics tutorial: http://r-statistics.co/Logistic-Regression-With-R.html#Example%20Problem

Load the IST Dataset (International Stroke Trial Database)

https://datashare.is.ed.ac.uk/handle/10283/128

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
setwd("C:/Users/MCuser/Dropbox/Rachel/MontColl/Datasets/Datasets")
IST <- read_csv("IST.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   AGE = col_double(),
##   DDEADC = col_double(),
##   FDEADC = col_double(),
##   DIED = col_double(),
##   EXPDD = col_double(),
##   EXPD6 = col_double(),
##   EXPD14 = col_double(),
##   SET14D = col_double(),
##   ID14 = col_double(),
##   OCCODE = col_double()
## )
## i Use `spec()` for the full column specifications.
names(IST)
##  [1] "SEX"      "AGE"      "RHEP24"   "RASP3"    "RXASP"    "RXHEP"   
##  [7] "DASP14"   "DASPLT"   "DLH14"    "DMH14"    "DHH14"    "DDIAGISC"
## [13] "DDIAGHA"  "DDIAGUN"  "DNOSTRK"  "DNOSTRKX" "DRSISC"   "DRSH"    
## [19] "DRSUNK"   "DALIVE"   "DDEAD"    "DDEADC"   "FDEAD"    "FDEADC"  
## [25] "FRECOVER" "FDENNIS"  "CMPLASP"  "CMPLHEP"  "DIED"     "EXPDD"   
## [31] "EXPD6"    "EXPD14"   "SET14D"   "ID14"     "OCCODE"
attach(IST)

Create a table of survival based on gender

table(SEX, DIED)
##    DIED
## SEX    0    1
##   F 6751 2277
##   M 8314 2093

Fit a logistic regression model

fit1 <- glm(DIED ~ AGE, family = binomial())
summary(fit1)
## 
## Call:
## glm(formula = DIED ~ AGE, family = binomial())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3894  -0.7567  -0.5723  -0.2762   2.9779  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.357477   0.148920  -42.69   <2e-16 ***
## AGE          0.069123   0.001944   35.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20717  on 19434  degrees of freedom
## Residual deviance: 19152  on 19433  degrees of freedom
## AIC: 19156
## 
## Number of Fisher Scoring iterations: 5

Result of the first model

log(p/1-p) = -6.357 + 0.069(Age)

This means the Odds / Probability of dying for a 55 year old:

log(p/1-p) = -6.357 + 0.069(55)

(p/1-p) = exp(-6.357 + 0.069(55))

(p/1-p) = 0.077 # Odds of dying for 55 a 55 year old in this population

p = 0.077/1.077 = 0.072 # Probability of dying for 55 a 55 year old in this population

Holding all other variables constant, a 55 year old has increased probability of dying of 0.072.

Plot the data overlayed with the logistic curve

newdat <- data.frame(AGE=seq(min(IST$AGE), max(IST$AGE),len=200))
newdat$DIED <-  predict(fit1, newdata=newdat, type="response")
plot(DIED~AGE, data=IST, col="red4")
lines(DIED ~ AGE, newdat, col="green4", lwd=2)

Try another model, adding aspirin use

fit2 <- glm(DIED ~ AGE + RXASP + RXHEP, family = binomial())
summary(fit2)
## 
## Call:
## glm(formula = DIED ~ AGE + RXASP + RXHEP, family = binomial())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3944  -0.7604  -0.5752  -0.2777   2.9973  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.453792   0.223243 -28.909   <2e-16 ***
## AGE          0.069131   0.001944  35.557   <2e-16 ***
## RXASPY      -0.060584   0.035771  -1.694   0.0903 .  
## RXHEPL       0.138109   0.171826   0.804   0.4215    
## RXHEPM       0.175975   0.171931   1.024   0.3061    
## RXHEPN       0.098077   0.169998   0.577   0.5640    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20717  on 19434  degrees of freedom
## Residual deviance: 19145  on 19429  degrees of freedom
## AIC: 19157
## 
## Number of Fisher Scoring iterations: 5

Aspirin use does not seem to add to the model

Try a third model with an interaction term between age and aspirin use

fit3 <- glm(DIED ~ AGE +AGE*RXASP + RXASP + RXHEP, family = binomial())
summary(fit3)
## 
## Call:
## glm(formula = DIED ~ AGE + AGE * RXASP + RXASP + RXHEP, family = binomial())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4242  -0.7595  -0.5762  -0.2745   2.9498  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.684932   0.270561 -24.708   <2e-16 ***
## AGE          0.072139   0.002777  25.973   <2e-16 ***
## RXASPY       0.394314   0.298046   1.323    0.186    
## RXHEPL       0.139976   0.171894   0.814    0.415    
## RXHEPM       0.178854   0.172005   1.040    0.298    
## RXHEPN       0.100509   0.170070   0.591    0.555    
## AGE:RXASPY  -0.005980   0.003890  -1.537    0.124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20717  on 19434  degrees of freedom
## Residual deviance: 19143  on 19428  degrees of freedom
## AIC: 19157
## 
## Number of Fisher Scoring iterations: 5

Again, this model is not an improvement

Check Class Bias

Ideally, the proportion of events and non-events in the Y variable should approximately be the same. So, first check the proportion of classes in the dependent variable

table(IST$DIED)
## 
##     0     1 
## 15065  4370

There is a class bias, a condition observed when the proportion of events is much smaller than proportion of non-events. So we must sample the observations in approximately equal proportions to get better models.

Create training and test datasets

# Create Training Data
input_ones <-IST[which(IST$DIED == 1), ]  # all 1's
input_zeros <- IST[which(IST$DIED == 0), ]  # all 0's
set.seed(100)  # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))  # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones))  # 0's for training. Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]  
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros)  # row bind the 1's and 0's 

# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros)  # row bind the 1's and 0's 

Build Logit Models and Predict

Start by building the logit model using training data

logitMod <- glm(DIED ~ AGE + RXASP + RXHEP, data=trainingData, family=binomial(link="logit"))

predicted_train <- predict(logitMod, trainingData, type = "response") #predicted training scores

Now assess the logit model based on test data

predicted_test <- predict(logitMod, testData, type="response")  # predicted test scores
summary(logitMod)
## 
## Call:
## glm(formula = DIED ~ AGE + RXASP + RXHEP, family = binomial(link = "logit"), 
##     data = trainingData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8608  -1.1021   0.1983   1.0409   2.5197  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.552675   0.329627 -16.845  < 2e-16 ***
## AGE          0.066586   0.002797  23.805  < 2e-16 ***
## RXASPY      -0.051366   0.054133  -0.949  0.34268    
## RXHEPL       0.673639   0.258886   2.602  0.00927 ** 
## RXHEPM       0.696785   0.259070   2.690  0.00715 ** 
## RXHEPN       0.607996   0.256065   2.374  0.01758 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8481.3  on 6117  degrees of freedom
## Residual deviance: 7786.5  on 6112  degrees of freedom
## AIC: 7798.5
## 
## Number of Fisher Scoring iterations: 4

VIF

Like in case of linear regression, we should check for multicollinearity in the model.

One way to measure multicollinearity is the variance inflation factor (VIF), which assesses how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, you can assume that the regression coefficients are poorly estimated due to multicollinearity.

As seen below, all X variables in the model have VIF well below 4.

#install.packages("car")
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.1
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(logitMod)
##           GVIF Df GVIF^(1/(2*Df))
## AGE   1.000744  1        1.000372
## RXASP 1.000198  1        1.000099
## RXHEP 1.000648  3        1.000108

Misclassification Error

Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better is your model.

Calculating the Optimal Cutoff Threshold

#install.packages("InformationValue")
library(InformationValue)
# Decide on optimal prediction probability cutoff for the model
optCutOff <- optimalCutoff(DIED, predicted_test) 
optCutOff
## [1] 0.8415185

Now look at the misclassification error

Use either a generic 0.5 threshold cutoff or the optimal cutoff calculated above. Misclassification is the error in categorical data in which the observed category is different from the underlying one

# Misclassification Error
misClassError(trainingData$DIED, predicted_train, threshold = optCutOff)
## [1] 0.5
misClassError(trainingData$DIED, predicted_train, threshold = 0.5)
## [1] 0.3598
misClassError(testData$DIED, predicted_test, threshold = 0.5)
## [1] 0.391
misClassError(testData$DIED, predicted_test, threshold = optCutOff)
## [1] 0.0985

When we used the optimal cutoff calculated on the test data, we saw the smallest misclassification error < 10%.

ROC

Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model.

plotROC(testData$DIED, predicted_test)

The above model has area under ROC curve 69.87%, which is not great, but not terrible.

Concordance

Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.

In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model.

Concordance(testData$DIED, predicted_test)
## $Concordance
## [1] 0.6846417
## 
## $Discordance
## [1] 0.3153583
## 
## $Tied
## [1] 0
## 
## $Pairs
## [1] 15739866

The above model with a concordance of 68.5%. A modestly good model.

Specificity and Sensitivity

Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. Specificity can also be calculated as 1 − False Positive Rate.

Sensitivity=(number of Actual 1′s and Predicted as 1′s) / (nubmer of Actual 1′s)

Specificity=(number of Actual 0′s and Predicted as 0′s) / (number of Actual 0′s)

sensitivity(testData$DIED, predicted_test)
## [1] 0.6628528
specificity(testData$DIED, predicted_test)
## [1] 0.6031151

The sensitivity and specificity are low at 66.2% and 60.3% respectively.

Overall

Although this tutorial goes through many concepts that are involved with logistic regression, one of the take-aways is that there are many components involved in building the model and then testing the model.