To summarize:
Logistic regression, like any regression analysis, tries to predict an outcome variable based on its statistical relationships to several predictor or independent variables of interest.
The hallmark of logistic regression is that the outcome variable is always binary, or dichotomous. For example, being anemic versus not being anemic, having hypertension versus having normal blood pressure, or contracting malaria versus being malaria free. If your outcome variable is numerical, for example, glycated hemoglobin (A1C) levels , it is helpful to dichotomize it into ‘normal’ and ‘pathologic’ based on an informed threshold value to facilitate logistic regression analysis— A1C levels below 5.7 percent could be considered ‘normal’ while values above 5.7 percent can be considered ‘pathologic’, to include the prediabetic and diabetic stages.
There are three types of logistic regression— descriptive, predictive and effect-size regression models. The most commonly used model in epidemioloical studies is the effect-size model. However, for the purposes of this exercise, I will build a predictive model. Unlike linear regression where the outcome variable can assume any numerical value, in logistic regression, the outcome is a probability between 0 to 1.
The assumptions and basic principles of linear regression apply.
I will apply my knowledge of logistic regression to analyze a dataset containing characteristics of a 110,000 patients from Brazil and determine if I can predict whether they are going to miss a doctor’s appointment or not. Unlike my previous uploads, this document is not intended to be a comprehensive lesson, but is meant to serve as a record of my skill and a personal reference for future projects. That said, you may still find value in following along to the code. I would like to thank Bharatendra Rai for his instructional videos on this topic.
library("ROCR")
library("knitr")
library("caret")
The data is available here at Kaggle. The website might prompt you to log in— you can enter your Google or Facebook credentials to gain access to the dataset. Make sure you download and extract it to your working directory.
noshow <- read.csv("./noshowappointments/No_Show.csv")
Let us take a look at the data.
head(noshow)
## PatientId AppointmentID Gender ScheduledDay
## 1 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
## 2 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
## 3 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
## 4 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
## 5 8.841186e+12 5642494 F 2016-04-29T16:07:23Z
## 6 9.598513e+13 5626772 F 2016-04-27T08:36:51Z
## AppointmentDay Age Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1
## 2 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0
## 3 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0
## 4 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0
## 5 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1
## 6 2016-04-29T00:00:00Z 76 REPÚBLICA 0 1
## Diabetes Alcoholism Handcap SMS_received No.show
## 1 0 0 0 0 No
## 2 0 0 0 0 No
## 3 0 0 0 0 No
## 4 0 0 0 0 No
## 5 1 0 0 0 No
## 6 0 0 0 0 No
What do the variables mean?
The data documentation is not very clear on the criteria of classification used for certain variables (for example, what determines if a person is considered to be positive for alcoholism?), but for the purpose of building a simple predictive logistic regression model, we do not need to delve into these issues.
Two issues that are often problematic to a logistic regression model are variable classes and missing values.
str(noshow)
## 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
## $ ScheduledDay : Factor w/ 103549 levels "2015-11-10T07:13:56Z",..: 27742 27504 27539 27709 27498 20074 21386 21496 24945 20895 ...
## $ AppointmentDay: Factor w/ 27 levels "2016-04-29T00:00:00Z",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
## $ Scholarship : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hipertension : int 1 0 0 0 1 1 0 0 0 0 ...
## $ Diabetes : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Alcoholism : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Handcap : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SMS_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ No.show : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
What do you see?
Let’s fix these issues.
#changing the date/time variables to an appropriate character format.
noshow$AppointmentDay <- format(as.Date(noshow$AppointmentDay,
format = "%Y-%m-%d"), "%Y/%m/%d")
noshow$ScheduledDay <- format(as.Date(noshow$ScheduledDay,
format = "%Y-%m-%d"), "%Y/%m/%d")
#creating a new variable to capture time between the phone call and actual appointment
noshow$Time <- as.Date(noshow$AppointmentDay) - as.Date(noshow$ScheduledDay)
noshow$Time <- as.integer(noshow$Time)
#An new column with the time in days between the phone call and the
#actual date of the created appointment.
#Changing the required columns to factors.
cols_to_change <- c(8, 9, 10, 11, 12, 13)
noshow[, cols_to_change] <- lapply(noshow[, cols_to_change], factor)
Let’s take a look at the summary.
summary(noshow)
## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 F:71840 Length:110527
## 1st Qu.:4.173e+12 1st Qu.:5640286 M:38687 Class :character
## Median :3.173e+13 Median :5680573 Mode :character
## Mean :1.475e+14 Mean :5675305
## 3rd Qu.:9.439e+13 3rd Qu.:5725524
## Max. :1.000e+15 Max. :5790484
##
## AppointmentDay Age Neighbourhood Scholarship
## Length:110527 Min. : -1.00 JARDIM CAMBURI : 7717 0:99666
## Class :character 1st Qu.: 18.00 MARIA ORTIZ : 5805 1:10861
## Mode :character Median : 37.00 RESISTÊNCIA : 4431
## Mean : 37.09 JARDIM DA PENHA: 3877
## 3rd Qu.: 55.00 ITARARÉ : 3514
## Max. :115.00 CENTRO : 3334
## (Other) :81849
## Hipertension Diabetes Alcoholism Handcap SMS_received No.show
## 0:88726 0:102584 0:107167 0:108286 0:75045 No :88208
## 1:21801 1: 7943 1: 3360 1: 2042 1:35482 Yes:22319
## 2: 183
## 3: 13
## 4: 3
##
##
## Time
## Min. : -6.00
## 1st Qu.: 0.00
## Median : 4.00
## Mean : 10.18
## 3rd Qu.: 15.00
## Max. :179.00
##
Notice the Age column. There is atleast one row where the minimum age is -1. This is not acceptable. Let’s get rid of this and replace it with the mean of the remaining ages.
row_to_replace <- which(noshow$Age < 0 )
noshow[row_to_replace, 6] <- 37.09 # 6 is the Age column.
Let’s write a function to give us the percentage of NAs in each column.
perc <- function(x){sum(is.na(x))/length(x) *100 }
apply(noshow, 2, perc)
## PatientId AppointmentID Gender ScheduledDay AppointmentDay
## 0 0 0 0 0
## Age Neighbourhood Scholarship Hipertension Diabetes
## 0 0 0 0 0
## Alcoholism Handcap SMS_received No.show Time
## 0 0 0 0 0
This is great! There are no missing values. We can proceed with cross tabulations.
In this section, we will cross-tabulate our factor predictors with our outcome of interest to ensure that there are sufficient observations in each column.
kable(xtabs(~ No.show + Handcap, data = noshow ), caption = "Severity of Disability Versus No-Show")
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| No | 86374 | 1676 | 146 | 10 | 2 |
| Yes | 21912 | 366 | 37 | 3 | 1 |
| While | there ar | e no 0 | values | , I f | eel that it might be worth grouping categories 2,3 and 4 together. |
noshow$Handcap <- ifelse(noshow$Handcap == 0, "Not Disabled",
ifelse(noshow$Handcap == 1, "Disabled", "Severe Disability"))
Let’s tabulate the other factors.
kable(xtabs(~ No.show + Gender, data = noshow ), caption = "Gender Versus No-Show")
| F | M | |
|---|---|---|
| No | 57246 | 30962 |
| Yes | 14594 | 7725 |
kable(xtabs(~ No.show + Scholarship, data = noshow ), caption = "Scholarship Versus No-Show")
| 0 | 1 | |
|---|---|---|
| No | 79925 | 8283 |
| Yes | 19741 | 2578 |
kable(xtabs(~ No.show + Hipertension, data = noshow ), caption = "Hypertension Versus No-Show")
| 0 | 1 | |
|---|---|---|
| No | 70179 | 18029 |
| Yes | 18547 | 3772 |
kable(xtabs(~ No.show + Diabetes, data = noshow ), caption = "Diabetes Versus No-Show")
| 0 | 1 | |
|---|---|---|
| No | 81695 | 6513 |
| Yes | 20889 | 1430 |
kable(xtabs(~ No.show + Alcoholism, data = noshow ), caption = "Alcoholism Versus No-Show")
| 0 | 1 | |
|---|---|---|
| No | 85525 | 2683 |
| Yes | 21642 | 677 |
kable(xtabs(~ No.show + SMS_received, data = noshow ), caption = "SMS Versus No-Show")
| 0 | 1 | |
|---|---|---|
| No | 62510 | 25698 |
| Yes | 12535 | 9784 |
No zeroes or absurdly low number of observations. Let’s proceed.
What is the split between the number of people in the dataset that showed up versus those that did not show up?
table(noshow$No.show)
##
## No Yes
## 88208 22319
The number of people that missed their appointments is almost 4 times the number of people that kept their appointments. Before we build the logistic regression model, we will have to make sure that these two values are approximately equal.
'%exclude%' <- Negate('%in%') # Defining a function to do the opposite of the 'in' function.
options(scipen=999) # Avoid scientific notation printing.
#create an index
set.seed(123)
traini <- createDataPartition(noshow$No.show, p=0.8, list = F)
#Create the datasets
train <- noshow[traini, ] # Create training data
test <- noshow[-traini, ] #Create testing data
#Let's downsample
set.seed(123)
traindown <- downSample(x = train[, colnames(train) %exclude% "No.show"],
y = train$No.show, yname = "No.show") #from the 80 percent training data,
#Create a data frame called traindown where the shows and no shows are in an equal ratio
Did the downsampling work?
table(traindown$No.show)
##
## No Yes
## 17856 17856
Yes, we now have an equal number of shows and no shows in the training data (traindown).
Let’s build our initial logistic regression model.
model <- glm(No.show ~ Age + Gender + Time + Scholarship +
Hipertension + Diabetes + Alcoholism +
Handcap + SMS_received,
data = traindown, family = "binomial")
summary(model)
##
## Call:
## glm(formula = No.show ~ Age + Gender + Time + Scholarship + Hipertension +
## Diabetes + Alcoholism + Handcap + SMS_received, family = "binomial",
## data = traindown)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.178 -1.062 -0.340 1.140 1.612
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.2170856 0.0864532 -2.511
## Age -0.0070591 0.0005755 -12.265
## GenderM -0.0321483 0.0235225 -1.367
## Time 0.0285774 0.0008455 33.799
## Scholarship1 0.2175995 0.0362970 5.995
## Hipertension1 -0.0456528 0.0349079 -1.308
## Diabetes1 0.1925432 0.0485661 3.965
## Alcoholism1 0.1959179 0.0638871 3.067
## HandcapNot Disabled -0.0214637 0.0828870 -0.259
## HandcapSevere Disability 0.0820699 0.2765082 0.297
## SMS_received1 0.3579350 0.0242334 14.770
## Pr(>|z|)
## (Intercept) 0.01204 *
## Age < 0.0000000000000002 ***
## GenderM 0.17172
## Time < 0.0000000000000002 ***
## Scholarship1 0.00000000204 ***
## Hipertension1 0.19094
## Diabetes1 0.00007353049 ***
## Alcoholism1 0.00216 **
## HandcapNot Disabled 0.79567
## HandcapSevere Disability 0.76661
## SMS_received1 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 49507 on 35711 degrees of freedom
## Residual deviance: 47094 on 35701 degrees of freedom
## AIC: 47116
##
## Number of Fisher Scoring iterations: 4
This is very interesting to see. Adjusting for the effect of all other variables, age, time, scholarship, diabetes, alcoholism and SMS have a statistically significant effect on the outcome of showing up to an appointment.
Age is the only signifcant variable with a negative coefficient. The negative coefficient for a variable indicates that as that variable increases, the log odds of the noshow decreases, that is, the probability of a no show decreases. For example, for a one year increase in age from x to x+1, the odds of a no show at x+1 years is 0.99 times the odds of a no show at x years. If we wanted to compare the odds of a no show for a 75 year old, versus the odds of a no show for a 25 year old, we would take the difference in years, 50, and multiply it by the coefficient, (-0.0070591), and then exponentiate it in as e ^ (-0.1765), to obtain an odds ratio of 0.83. This means that the 75 year olds have 17 percent lesser odds of not showing up for an appointment as compared to 25 year olds. Interesting finding. Maybe older folks are simply more disciplined about their time and others’ time?
The other predictor, time, is also positively correlated with no shows. This means that the longer a person has to wait before showing up for an appointment, the more likely they are to skip the appointment altogether. The oods of missing an appointment after a 3 day wait is 8 percent higher than the odds of not showing up for an appointment on day 1. Perhaps this simply means that people do not like to wait.
Having a scholarship is correlated with lesser no shows.
Some of the other results are not so easy to explain. People with diabetes are less likely to miss appointments than people without diabetes. Alcoholics are less likely to miss appointments than non-alcoholics. People who receive SMSes are more likely to miss an appointment than those who don’t. This last one stumped me— I would have expected SMS reminders to improve appointment adherence.
Disability status, gender and hypertension status do not appear to be contributing to the follow up rates. Let’s drop them in a refined model.
model <- glm(No.show ~ Age + Time + Scholarship +
Diabetes + Alcoholism + SMS_received,
data = traindown, family = "binomial")
summary(model)
##
## Call:
## glm(formula = No.show ~ Age + Time + Scholarship + Diabetes +
## Alcoholism + SMS_received, family = "binomial", data = traindown)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1803 -1.0647 -0.3303 1.1399 1.6253
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2492192 0.0235522 -10.582 < 0.0000000000000002 ***
## Age -0.0072779 0.0005128 -14.193 < 0.0000000000000002 ***
## Time 0.0285988 0.0008452 33.837 < 0.0000000000000002 ***
## Scholarship1 0.2235433 0.0359263 6.222 0.00000000049 ***
## Diabetes1 0.1712673 0.0456693 3.750 0.000177 ***
## Alcoholism1 0.1822071 0.0633644 2.876 0.004033 **
## SMS_received1 0.3590085 0.0242113 14.828 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 49507 on 35711 degrees of freedom
## Residual deviance: 47097 on 35705 degrees of freedom
## AIC: 47111
##
## Number of Fisher Scoring iterations: 4
How do we know that this model is better than the first? For starters, all variables are significant. Second, the AIC has reduced somewhat, which is a good thing.
However, look at the null and the residual deviances. The deviance shows how much of the outcome is unexplained by the model. The degrees of freedom are the number of rows minus the number of predictors. The residual deviance for this model is slightly higher than that of the previous model, which is not necessarily good.
You win some, you lose some. Let’s forge ahead with the predictions.
pred1 <- predict(model, traindown, type = "response")
head(pred1)
## 1 2 3 4 5 6
## 0.3126451 0.3851950 0.5043319 0.5415477 0.7300379 0.4201889
These are the model’s predicted probabilities of no-shows for the first 5 respondents.
Let’s set a threshold: If the probability is <0.50, let’s consider that the patient did not show up for the appointment.
prediction1 <- ifelse(pred1 > 0.5, 1, 0) # 1 here indicates a no show.
Let’s compare the predicted to the actual values. This is called a confusion matrix.
cm <- table( Prediction = prediction1, Actual = traindown$No.show)
cm
## Actual
## Prediction No Yes
## 0 12398 7773
## 1 5458 10083
Under the ‘Actual Values’, ‘No’ indicates that the patient showed up for an appointment, and ‘Yes’ indicates that they missed their appointment. Now look at the predictions, along the off-diagonal. A prediction of 1 (meaning no show) was made 5458 times when the patient actually showed up. Similarly, a prediction of 0 (meaning present for appointment) was made 7773 times when the patient actually missed the appointment. Let us quantify the missclassifications in the confusion matrix data.
1 - sum(diag(cm))/sum(cm)
## [1] 0.3704917
37 percent of the data was misclassified. The model is only correct 63 percent of the time.
Let us try the model out on the test data:
pred2 <- predict(model, test, type = "response")
predict2 <- ifelse(pred2>0.5, 1, 0) # Again, a 1 indicates a no show.
cm2 <- table( Prediction = predict2, Actual = test$No.show)
cm2
## Actual
## Prediction No Yes
## 0 12236 2018
## 1 5405 2445
What is the misclassification error this time?
1 - sum(diag(cm2))/sum(cm2)
## [1] 0.3358216
33 percent of the data was misclassified. The model is correct 67 percent of the time.
Is this model good?
with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = F))
## [1] 0
The p-value close to zero indicates that our model is statistically better than the intercept model. Great.
Can we improve this 67 percent accuracy of this model?
table(test$No.show)
##
## No Yes
## 17641 4463
17641/(17641+4461)*100
## [1] 79.81631
This 79 percent tells us that, even if we blindly predict that all the patients will keep their appointments, we will be right 79 percent of the time. Using our model, however, we only make a correct prediction 67 percent of the time. We definitely need to improve the model.
res <- predict(model, traindown, type = "response")
ROCRpred = prediction(res, traindown$No.show)
ROCRperf <- performance(ROCRpred, "tpr", "fpr")
plot(ROCRperf, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1))
In models used to detect cancer or other diseases, we strive to have the true-positive rate as high as possible. In other words, we want the false negatives as low as possible. We want high sensitivity, even at the expense of specificity, because it can be lethal to tell someone that they are cancer-free when they are not.
But for this model, since we are not predicting a life or death situation, and since the mean prediction accuracy is 79 percent, let us aim to achieve a high accuracy, or a low false positive rate.
Let’s apply this to the testing data.
res_pred <- predict(model, test, type = "response")
cm3 <- (table(Prediction = res_pred > 0.6, Actual = test$No.show))
cm3
## Actual
## Prediction No Yes
## FALSE 15007 3196
## TRUE 2634 1267
What is the accuracy?
sum(diag(cm3))/sum(cm3)
## [1] 0.7362468
The accuracy is 73 percent at a threshold of 0.6 for the testing data.
What happens if we increase the threshold of probability to classify a case as a noshow from 0.6 to 0.75?
cm4 <- (table(Prediction = res_pred > 0.75, Actual = test$No.show))
cm4
## Actual
## Prediction No Yes
## FALSE 17051 4234
## TRUE 590 229
What is the accuracy?
sum(diag(cm4))/sum(cm4)
## [1] 0.781759
The accuracy is 78 percent at a threshold of 0.75 for the training data.