modified from the r-statistics tutorial: http://r-statistics.co/Logistic-Regression-With-R.html#Example%20Problem
https://datashare.is.ed.ac.uk/handle/10283/128
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
setwd("C:/Users/MCuser/Dropbox/Rachel/MontColl/Datasets/Datasets")
IST <- read_csv("IST.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## AGE = col_double(),
## DDEADC = col_double(),
## FDEADC = col_double(),
## DIED = col_double(),
## EXPDD = col_double(),
## EXPD6 = col_double(),
## EXPD14 = col_double(),
## SET14D = col_double(),
## ID14 = col_double(),
## OCCODE = col_double()
## )
## i Use `spec()` for the full column specifications.
names(IST)
## [1] "SEX" "AGE" "RHEP24" "RASP3" "RXASP" "RXHEP"
## [7] "DASP14" "DASPLT" "DLH14" "DMH14" "DHH14" "DDIAGISC"
## [13] "DDIAGHA" "DDIAGUN" "DNOSTRK" "DNOSTRKX" "DRSISC" "DRSH"
## [19] "DRSUNK" "DALIVE" "DDEAD" "DDEADC" "FDEAD" "FDEADC"
## [25] "FRECOVER" "FDENNIS" "CMPLASP" "CMPLHEP" "DIED" "EXPDD"
## [31] "EXPD6" "EXPD14" "SET14D" "ID14" "OCCODE"
attach(IST)
table(SEX, DIED)
## DIED
## SEX 0 1
## F 6751 2277
## M 8314 2093
fit1 <- glm(DIED ~ AGE, family = binomial())
summary(fit1)
##
## Call:
## glm(formula = DIED ~ AGE, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3894 -0.7567 -0.5723 -0.2762 2.9779
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.357477 0.148920 -42.69 <2e-16 ***
## AGE 0.069123 0.001944 35.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20717 on 19434 degrees of freedom
## Residual deviance: 19152 on 19433 degrees of freedom
## AIC: 19156
##
## Number of Fisher Scoring iterations: 5
log(p/1-p) = -6.357 + 0.069(Age)
This means the Odds / Probability of dying for a 55 year old:
log(p/1-p) = -6.357 + 0.069(55)
(p/1-p) = exp(-6.357 + 0.069(55))
(p/1-p) = 0.077 # Odds of dying for 55 a 55 year old in this population
p = 0.077/1.077 = 0.072 # Probability of dying for 55 a 55 year old in this population
Holding all other variables constant, a 55 year old has increased probability of dying of 0.072.
newdat <- data.frame(AGE=seq(min(IST$AGE), max(IST$AGE),len=200))
newdat$DIED <- predict(fit1, newdata=newdat, type="response")
plot(DIED~AGE, data=IST, col="red4")
lines(DIED ~ AGE, newdat, col="green4", lwd=2)
fit2 <- glm(DIED ~ AGE + RXASP + RXHEP, family = binomial())
summary(fit2)
##
## Call:
## glm(formula = DIED ~ AGE + RXASP + RXHEP, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3944 -0.7604 -0.5752 -0.2777 2.9973
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.453792 0.223243 -28.909 <2e-16 ***
## AGE 0.069131 0.001944 35.557 <2e-16 ***
## RXASPY -0.060584 0.035771 -1.694 0.0903 .
## RXHEPL 0.138109 0.171826 0.804 0.4215
## RXHEPM 0.175975 0.171931 1.024 0.3061
## RXHEPN 0.098077 0.169998 0.577 0.5640
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20717 on 19434 degrees of freedom
## Residual deviance: 19145 on 19429 degrees of freedom
## AIC: 19157
##
## Number of Fisher Scoring iterations: 5
Aspirin use does not seem to add to the model
fit3 <- glm(DIED ~ AGE +AGE*RXASP + RXASP + RXHEP, family = binomial())
summary(fit3)
##
## Call:
## glm(formula = DIED ~ AGE + AGE * RXASP + RXASP + RXHEP, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4242 -0.7595 -0.5762 -0.2745 2.9498
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.684932 0.270561 -24.708 <2e-16 ***
## AGE 0.072139 0.002777 25.973 <2e-16 ***
## RXASPY 0.394314 0.298046 1.323 0.186
## RXHEPL 0.139976 0.171894 0.814 0.415
## RXHEPM 0.178854 0.172005 1.040 0.298
## RXHEPN 0.100509 0.170070 0.591 0.555
## AGE:RXASPY -0.005980 0.003890 -1.537 0.124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20717 on 19434 degrees of freedom
## Residual deviance: 19143 on 19428 degrees of freedom
## AIC: 19157
##
## Number of Fisher Scoring iterations: 5
Again, this model is not an improvement
Ideally, the proportion of events and non-events in the Y variable should approximately be the same. So, first check the proportion of classes in the dependent variable
table(IST$DIED)
##
## 0 1
## 15065 4370
There is a class bias, a condition observed when the proportion of events is much smaller than proportion of non-events. So we must sample the observations in approximately equal proportions to get better models.
# Create Training Data
input_ones <-IST[which(IST$DIED == 1), ] # all 1's
input_zeros <- IST[which(IST$DIED == 0), ] # all 0's
set.seed(100) # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones)) # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones)) # 0's for training. Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros) # row bind the 1's and 0's
# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros) # row bind the 1's and 0's
Start by building the logit model using training data
logitMod <- glm(DIED ~ AGE + RXASP + RXHEP, data=trainingData, family=binomial(link="logit"))
predicted_train <- predict(logitMod, trainingData, type = "response") #predicted training scores
Now assess the logit model based on test data
predicted_test <- predict(logitMod, testData, type="response") # predicted test scores
summary(logitMod)
##
## Call:
## glm(formula = DIED ~ AGE + RXASP + RXHEP, family = binomial(link = "logit"),
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8608 -1.1021 0.1983 1.0409 2.5197
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.552675 0.329627 -16.845 < 2e-16 ***
## AGE 0.066586 0.002797 23.805 < 2e-16 ***
## RXASPY -0.051366 0.054133 -0.949 0.34268
## RXHEPL 0.673639 0.258886 2.602 0.00927 **
## RXHEPM 0.696785 0.259070 2.690 0.00715 **
## RXHEPN 0.607996 0.256065 2.374 0.01758 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8481.3 on 6117 degrees of freedom
## Residual deviance: 7786.5 on 6112 degrees of freedom
## AIC: 7798.5
##
## Number of Fisher Scoring iterations: 4
Like in case of linear regression, we should check for multicollinearity in the model.
One way to measure multicollinearity is the variance inflation factor (VIF), which assesses how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, you can assume that the regression coefficients are poorly estimated due to multicollinearity.
As seen below, all X variables in the model have VIF well below 4.
#install.packages("car")
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.1
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(logitMod)
## GVIF Df GVIF^(1/(2*Df))
## AGE 1.000744 1 1.000372
## RXASP 1.000198 1 1.000099
## RXHEP 1.000648 3 1.000108
Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better is your model.
#install.packages("InformationValue")
library(InformationValue)
# Decide on optimal prediction probability cutoff for the model
optCutOff <- optimalCutoff(DIED, predicted_test)
optCutOff
## [1] 0.8415185
Use either a generic 0.5 threshold cutoff or the optimal cutoff calculated above. Misclassification is the error in categorical data in which the observed category is different from the underlying one
# Misclassification Error
misClassError(trainingData$DIED, predicted_train, threshold = optCutOff)
## [1] 0.5
misClassError(trainingData$DIED, predicted_train, threshold = 0.5)
## [1] 0.3598
misClassError(testData$DIED, predicted_test, threshold = 0.5)
## [1] 0.391
misClassError(testData$DIED, predicted_test, threshold = optCutOff)
## [1] 0.0985
When we used the optimal cutoff calculated on the test data, we saw the smallest misclassification error < 10%.
Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model.
plotROC(testData$DIED, predicted_test)
The above model has area under ROC curve 69.87%, which is not great, but not terrible.
Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.
In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model.
Concordance(testData$DIED, predicted_test)
## $Concordance
## [1] 0.6846417
##
## $Discordance
## [1] 0.3153583
##
## $Tied
## [1] 0
##
## $Pairs
## [1] 15739866
The above model with a concordance of 68.5%. A modestly good model.
Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. Specificity can also be calculated as 1 − False Positive Rate.
Sensitivity=(number of Actual 1′s and Predicted as 1′s) / (nubmer of Actual 1′s)
Specificity=(number of Actual 0′s and Predicted as 0′s) / (number of Actual 0′s)
sensitivity(testData$DIED, predicted_test)
## [1] 0.6628528
specificity(testData$DIED, predicted_test)
## [1] 0.6031151
The sensitivity and specificity are low at 66.2% and 60.3% respectively.
Although this tutorial goes through many concepts that are involved with logistic regression, one of the take-aways is that there are many components involved in building the model and then testing the model.