What is logistic regression?

To summarize:

Logistic regression, like any regression analysis, tries to predict an outcome variable based on its statistical relationships to several predictor or independent variables of interest.
The hallmark of logistic regression is that the outcome variable is always binary, or dichotomous. For example, being anemic versus not being anemic, having hypertension versus having normal blood pressure, or contracting malaria versus being malaria free. If your outcome variable is numerical, for example, glycated hemoglobin (A1C) levels , it is helpful to dichotomize it into ‘normal’ and ‘pathologic’ based on an informed threshold value to facilitate logistic regression analysis— A1C levels below 5.7 percent could be considered ‘normal’ while values above 5.7 percent can be considered ‘pathologic’, to include the prediabetic and diabetic stages.
There are three types of logistic regression— descriptive, predictive and effect-size regression models. The most commonly used model in epidemioloical studies is the effect-size model. However, for the purposes of this exercise, I will build a predictive model. Unlike linear regression where the outcome variable can assume any numerical value, in logistic regression, the outcome is a probability between 0 to 1.
The assumptions and basic principles of linear regression apply.

Purpose of this document

I will apply my knowledge of logistic regression to analyze a dataset containing characteristics of a 110,000 patients from Brazil and determine if I can predict whether they are going to miss a doctor’s appointment or not. Unlike my previous uploads, this document is not intended to be a comprehensive lesson, but is meant to serve as a record of my skill and a personal reference for future projects. That said, you may still find value in following along to the code. I would like to thank Bharatendra Rai for his instructional videos on this topic.

Libraries

library("ROCR")
library("knitr")
library("caret")

Importing the data

The data is available here at Kaggle. The website might prompt you to log in— you can enter your Google or Facebook credentials to gain access to the dataset. Make sure you download and extract it to your working directory.

noshow <- read.csv("./noshowappointments/No_Show.csv")

Let us take a look at the data.

head(noshow)

##      PatientId AppointmentID Gender         ScheduledDay
## 1 2.987250e+13       5642903      F 2016-04-29T18:38:08Z
## 2 5.589978e+14       5642503      M 2016-04-29T16:08:27Z
## 3 4.262962e+12       5642549      F 2016-04-29T16:19:04Z
## 4 8.679512e+11       5642828      F 2016-04-29T17:29:31Z
## 5 8.841186e+12       5642494      F 2016-04-29T16:07:23Z
## 6 9.598513e+13       5626772      F 2016-04-27T08:36:51Z
##         AppointmentDay Age     Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z  62   JARDIM DA PENHA           0            1
## 2 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            0
## 3 2016-04-29T00:00:00Z  62     MATA DA PRAIA           0            0
## 4 2016-04-29T00:00:00Z   8 PONTAL DE CAMBURI           0            0
## 5 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            1
## 6 2016-04-29T00:00:00Z  76        REPÃšBLICA           0            1
##   Diabetes Alcoholism Handcap SMS_received No.show
## 1        0          0       0            0      No
## 2        0          0       0            0      No
## 3        0          0       0            0      No
## 4        0          0       0            0      No
## 5        1          0       0            0      No
## 6        0          0       0            0      No

What do the variables mean?

PatientId: A unique idenitification number for each patient.
AppointmentID: A unique identification number for each appointment.
Gender: Was the patient male or female?
ScheduledDay: The day that the patient called in to schedule the appointment.
AppointmentDay: The day the appointment was scheduled for.
Age: The age of the patient.
Neighbourhood: The area where the doctor’s appointment was scheduled to happen.
Scholarship: Whether the patient or his family is a receipient of a certain government welfare scholarship. You can read more about this program here.
Hipertension: Whether the patient is hypertensive or not.
Diabetes: Whether the patient is diabetic or not.
Alcoholism: Whether the patient has a history of alcoholism or not.
Handcap: Whether the patient is considered to have a handicap/disability, and if so, the severity. I’m guessing a higher number implies a more severe problem.
SMS_received: Whether the patient received one or more text messages prior to the appointment.
No.show: Whether the patient missed their appointment or not. This is our outcome variable.

The data documentation is not very clear on the criteria of classification used for certain variables (for example, what determines if a person is considered to be positive for alcoholism?), but for the purpose of building a simple predictive logistic regression model, we do not need to delve into these issues.

Preparing the data

Two issues that are often problematic to a logistic regression model are variable classes and missing values.

Variable classes

str(noshow)

## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ScheduledDay  : Factor w/ 103549 levels "2015-11-10T07:13:56Z",..: 27742 27504 27539 27709 27498 20074 21386 21496 24945 20895 ...
##  $ AppointmentDay: Factor w/ 27 levels "2016-04-29T00:00:00Z",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
##  $ Scholarship   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hipertension  : int  1 0 0 0 1 1 0 0 0 0 ...
##  $ Diabetes      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Alcoholism    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Handcap       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SMS_received  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No.show       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...

What do you see?

The ScheduledDay and AppointmentDay columns are coded as factors. We will want to have them as date data. Refer tothis for more detail.
The Scholarship, Hipertension, Diabetes, Alcoholism, Handcap and SMS_received columns are all stored as integers. These need to be changed to factor.

Let’s fix these issues.

#changing the date/time variables to an appropriate character format.

noshow$AppointmentDay <- format(as.Date(noshow$AppointmentDay, 
                                        format = "%Y-%m-%d"), "%Y/%m/%d")

noshow$ScheduledDay <- format(as.Date(noshow$ScheduledDay, 
                                        format = "%Y-%m-%d"), "%Y/%m/%d")

#creating a new variable to capture time between the phone call and actual appointment

noshow$Time <- as.Date(noshow$AppointmentDay) - as.Date(noshow$ScheduledDay)

noshow$Time <- as.integer(noshow$Time)

#An new column with the time in days between the phone call and the 
#actual date of the created appointment.

#Changing the required columns to factors.

cols_to_change <- c(8, 9, 10, 11, 12, 13)

noshow[, cols_to_change] <- lapply(noshow[, cols_to_change], factor)

Missing values and other miscellaneous issues.

Let’s take a look at the summary.

summary(noshow)

##    PatientId         AppointmentID     Gender    ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   F:71840   Length:110527     
##  1st Qu.:4.173e+12   1st Qu.:5640286   M:38687   Class :character  
##  Median :3.173e+13   Median :5680573             Mode  :character  
##  Mean   :1.475e+14   Mean   :5675305                               
##  3rd Qu.:9.439e+13   3rd Qu.:5725524                               
##  Max.   :1.000e+15   Max.   :5790484                               
##                                                                    
##  AppointmentDay          Age                 Neighbourhood   Scholarship
##  Length:110527      Min.   : -1.00   JARDIM CAMBURI : 7717   0:99666    
##  Class :character   1st Qu.: 18.00   MARIA ORTIZ    : 5805   1:10861    
##  Mode  :character   Median : 37.00   RESISTÃŠNCIA   : 4431              
##                     Mean   : 37.09   JARDIM DA PENHA: 3877              
##                     3rd Qu.: 55.00   ITARARÃ‰       : 3514              
##                     Max.   :115.00   CENTRO         : 3334              
##                                      (Other)        :81849              
##  Hipertension Diabetes   Alcoholism Handcap    SMS_received No.show    
##  0:88726      0:102584   0:107167   0:108286   0:75045      No :88208  
##  1:21801      1:  7943   1:  3360   1:  2042   1:35482      Yes:22319  
##                                     2:   183                           
##                                     3:    13                           
##                                     4:     3                           
##                                                                        
##                                                                        
##       Time       
##  Min.   : -6.00  
##  1st Qu.:  0.00  
##  Median :  4.00  
##  Mean   : 10.18  
##  3rd Qu.: 15.00  
##  Max.   :179.00  
##

Notice the Age column. There is atleast one row where the minimum age is -1. This is not acceptable. Let’s get rid of this and replace it with the mean of the remaining ages.

row_to_replace <- which(noshow$Age < 0 )
noshow[row_to_replace, 6] <- 37.09  # 6 is the Age column.

Let’s write a function to give us the percentage of NAs in each column.

perc <- function(x){sum(is.na(x))/length(x) *100 }
apply(noshow, 2, perc)

##      PatientId  AppointmentID         Gender   ScheduledDay AppointmentDay 
##              0              0              0              0              0 
##            Age  Neighbourhood    Scholarship   Hipertension       Diabetes 
##              0              0              0              0              0 
##     Alcoholism        Handcap   SMS_received        No.show           Time 
##              0              0              0              0              0

This is great! There are no missing values. We can proceed with cross tabulations.

Cross-tabulations of factors

In this section, we will cross-tabulate our factor predictors with our outcome of interest to ensure that there are sufficient observations in each column.

Handicaps

kable(xtabs(~ No.show + Handcap, data = noshow ), caption = "Severity of Disability Versus No-Show")

Severity of Disability Versus No-Show
	0	1	2	3	4
No	86374	1676	146	10	2
Yes	21912	366	37	3	1
While	there ar	e no 0	values	, I f	eel that it might be worth grouping categories 2,3 and 4 together.

noshow$Handcap <- ifelse(noshow$Handcap == 0, "Not Disabled",
                         ifelse(noshow$Handcap == 1, "Disabled", "Severe Disability"))

other variables

Let’s tabulate the other factors.

kable(xtabs(~ No.show + Gender, data = noshow ), caption = "Gender Versus No-Show")

Gender Versus No-Show
	F	M
No	57246	30962
Yes	14594	7725

kable(xtabs(~ No.show + Scholarship, data = noshow ), caption = "Scholarship Versus No-Show")

Scholarship Versus No-Show
	0	1
No	79925	8283
Yes	19741	2578

kable(xtabs(~ No.show + Hipertension, data = noshow ), caption = "Hypertension Versus No-Show")

Hypertension Versus No-Show
	0	1
No	70179	18029
Yes	18547	3772

kable(xtabs(~ No.show + Diabetes, data = noshow ), caption = "Diabetes Versus No-Show")

Diabetes Versus No-Show
	0	1
No	81695	6513
Yes	20889	1430

kable(xtabs(~ No.show + Alcoholism, data = noshow ), caption = "Alcoholism Versus No-Show")

Alcoholism Versus No-Show
	0	1
No	85525	2683
Yes	21642	677

kable(xtabs(~ No.show + SMS_received, data = noshow ), caption = "SMS Versus No-Show")

SMS Versus No-Show
	0	1
No	62510	25698
Yes	12535	9784

No zeroes or absurdly low number of observations. Let’s proceed.

Class imbalance, downsampling, and partitioning:

What is the split between the number of people in the dataset that showed up versus those that did not show up?

table(noshow$No.show)

## 
##    No   Yes 
## 88208 22319

The number of people that missed their appointments is almost 4 times the number of people that kept their appointments. Before we build the logistic regression model, we will have to make sure that these two values are approximately equal.

'%exclude%' <- Negate('%in%')  # Defining a function to do the opposite of the 'in' function.
options(scipen=999)  # Avoid scientific notation printing.

#create an index
set.seed(123)
traini <- createDataPartition(noshow$No.show, p=0.8, list = F) 

#Create the datasets
train <- noshow[traini, ] # Create training data
test <- noshow[-traini, ] #Create testing data

#Let's downsample

set.seed(123)
traindown <- downSample(x = train[, colnames(train) %exclude% "No.show"],
                         y = train$No.show, yname = "No.show") #from the 80 percent training data,
#Create a data frame called traindown where the shows and no shows are in an equal ratio

Did the downsampling work?

table(traindown$No.show)

## 
##    No   Yes 
## 17856 17856

Yes, we now have an equal number of shows and no shows in the training data (traindown).

Building the model

Let’s build our initial logistic regression model.

Outcome: No.show (Did the patient turn up for an appointment or not?)
Predictors: Age, Gender, Time, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_received

model <- glm(No.show ~ Age + Gender + Time + Scholarship + 
               Hipertension + Diabetes + Alcoholism + 
               Handcap + SMS_received, 
             data = traindown, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = No.show ~ Age + Gender + Time + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + Handcap + SMS_received, family = "binomial", 
##     data = traindown)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.178  -1.062  -0.340   1.140   1.612  
## 
## Coefficients:
##                            Estimate Std. Error z value
## (Intercept)              -0.2170856  0.0864532  -2.511
## Age                      -0.0070591  0.0005755 -12.265
## GenderM                  -0.0321483  0.0235225  -1.367
## Time                      0.0285774  0.0008455  33.799
## Scholarship1              0.2175995  0.0362970   5.995
## Hipertension1            -0.0456528  0.0349079  -1.308
## Diabetes1                 0.1925432  0.0485661   3.965
## Alcoholism1               0.1959179  0.0638871   3.067
## HandcapNot Disabled      -0.0214637  0.0828870  -0.259
## HandcapSevere Disability  0.0820699  0.2765082   0.297
## SMS_received1             0.3579350  0.0242334  14.770
##                                      Pr(>|z|)    
## (Intercept)                           0.01204 *  
## Age                      < 0.0000000000000002 ***
## GenderM                               0.17172    
## Time                     < 0.0000000000000002 ***
## Scholarship1                    0.00000000204 ***
## Hipertension1                         0.19094    
## Diabetes1                       0.00007353049 ***
## Alcoholism1                           0.00216 ** 
## HandcapNot Disabled                   0.79567    
## HandcapSevere Disability              0.76661    
## SMS_received1            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49507  on 35711  degrees of freedom
## Residual deviance: 47094  on 35701  degrees of freedom
## AIC: 47116
## 
## Number of Fisher Scoring iterations: 4

Initial results

This is very interesting to see. Adjusting for the effect of all other variables, age, time, scholarship, diabetes, alcoholism and SMS have a statistically significant effect on the outcome of showing up to an appointment.

Age is the only signifcant variable with a negative coefficient. The negative coefficient for a variable indicates that as that variable increases, the log odds of the noshow decreases, that is, the probability of a no show decreases. For example, for a one year increase in age from x to x+1, the odds of a no show at x+1 years is 0.99 times the odds of a no show at x years. If we wanted to compare the odds of a no show for a 75 year old, versus the odds of a no show for a 25 year old, we would take the difference in years, 50, and multiply it by the coefficient, (-0.0070591), and then exponentiate it in as e ^ (-0.1765), to obtain an odds ratio of 0.83. This means that the 75 year olds have 17 percent lesser odds of not showing up for an appointment as compared to 25 year olds. Interesting finding. Maybe older folks are simply more disciplined about their time and others’ time?
The other predictor, time, is also positively correlated with no shows. This means that the longer a person has to wait before showing up for an appointment, the more likely they are to skip the appointment altogether. The oods of missing an appointment after a 3 day wait is 8 percent higher than the odds of not showing up for an appointment on day 1. Perhaps this simply means that people do not like to wait.
Having a scholarship is correlated with lesser no shows.
Some of the other results are not so easy to explain. People with diabetes are less likely to miss appointments than people without diabetes. Alcoholics are less likely to miss appointments than non-alcoholics. People who receive SMSes are more likely to miss an appointment than those who don’t. This last one stumped me— I would have expected SMS reminders to improve appointment adherence.
Disability status, gender and hypertension status do not appear to be contributing to the follow up rates. Let’s drop them in a refined model.

Dropping insignificant predictors.

model <- glm(No.show ~ Age  + Time + Scholarship + 
                Diabetes + Alcoholism +  SMS_received, 
             data = traindown, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = No.show ~ Age + Time + Scholarship + Diabetes + 
##     Alcoholism + SMS_received, family = "binomial", data = traindown)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1803  -1.0647  -0.3303   1.1399   1.6253  
## 
## Coefficients:
##                 Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)   -0.2492192  0.0235522 -10.582 < 0.0000000000000002 ***
## Age           -0.0072779  0.0005128 -14.193 < 0.0000000000000002 ***
## Time           0.0285988  0.0008452  33.837 < 0.0000000000000002 ***
## Scholarship1   0.2235433  0.0359263   6.222        0.00000000049 ***
## Diabetes1      0.1712673  0.0456693   3.750             0.000177 ***
## Alcoholism1    0.1822071  0.0633644   2.876             0.004033 ** 
## SMS_received1  0.3590085  0.0242113  14.828 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49507  on 35711  degrees of freedom
## Residual deviance: 47097  on 35705  degrees of freedom
## AIC: 47111
## 
## Number of Fisher Scoring iterations: 4

How do we know that this model is better than the first? For starters, all variables are significant. Second, the AIC has reduced somewhat, which is a good thing.

However, look at the null and the residual deviances. The deviance shows how much of the outcome is unexplained by the model. The degrees of freedom are the number of rows minus the number of predictors. The residual deviance for this model is slightly higher than that of the previous model, which is not necessarily good.

You win some, you lose some. Let’s forge ahead with the predictions.

Prediction

Training data

pred1 <- predict(model, traindown, type = "response")
head(pred1)

##         1         2         3         4         5         6 
## 0.3126451 0.3851950 0.5043319 0.5415477 0.7300379 0.4201889

These are the model’s predicted probabilities of no-shows for the first 5 respondents.

Let’s set a threshold: If the probability is <0.50, let’s consider that the patient did not show up for the appointment.

prediction1 <- ifelse(pred1 > 0.5, 1, 0) # 1 here indicates a no show.

Let’s compare the predicted to the actual values. This is called a confusion matrix.

cm <- table( Prediction = prediction1, Actual = traindown$No.show)
cm

##           Actual
## Prediction    No   Yes
##          0 12398  7773
##          1  5458 10083

Under the ‘Actual Values’, ‘No’ indicates that the patient showed up for an appointment, and ‘Yes’ indicates that they missed their appointment. Now look at the predictions, along the off-diagonal. A prediction of 1 (meaning no show) was made 5458 times when the patient actually showed up. Similarly, a prediction of 0 (meaning present for appointment) was made 7773 times when the patient actually missed the appointment. Let us quantify the missclassifications in the confusion matrix data.

1 - sum(diag(cm))/sum(cm)

## [1] 0.3704917

37 percent of the data was misclassified. The model is only correct 63 percent of the time.

Test data

Let us try the model out on the test data:

pred2 <- predict(model, test, type = "response")

predict2 <- ifelse(pred2>0.5, 1, 0) # Again, a 1 indicates a no show.

cm2 <- table( Prediction = predict2, Actual = test$No.show)

cm2

##           Actual
## Prediction    No   Yes
##          0 12236  2018
##          1  5405  2445

What is the misclassification error this time?

1 - sum(diag(cm2))/sum(cm2)

## [1] 0.3358216

33 percent of the data was misclassified. The model is correct 67 percent of the time.

Goodness of Fit test

Is this model good?

with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = F))

## [1] 0

The p-value close to zero indicates that our model is statistically better than the intercept model. Great.

Optimizing

Can we improve this 67 percent accuracy of this model?

table(test$No.show)

## 
##    No   Yes 
## 17641  4463

17641/(17641+4461)*100

## [1] 79.81631

This 79 percent tells us that, even if we blindly predict that all the patients will keep their appointments, we will be right 79 percent of the time. Using our model, however, we only make a correct prediction 67 percent of the time. We definitely need to improve the model.

res <- predict(model, traindown, type = "response")
ROCRpred = prediction(res, traindown$No.show)
ROCRperf <- performance(ROCRpred, "tpr", "fpr")

plot(ROCRperf, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1))

In models used to detect cancer or other diseases, we strive to have the true-positive rate as high as possible. In other words, we want the false negatives as low as possible. We want high sensitivity, even at the expense of specificity, because it can be lethal to tell someone that they are cancer-free when they are not.

But for this model, since we are not predicting a life or death situation, and since the mean prediction accuracy is 79 percent, let us aim to achieve a high accuracy, or a low false positive rate.

Let’s apply this to the testing data.

res_pred <- predict(model, test, type = "response")

cm3 <- (table(Prediction = res_pred > 0.6, Actual = test$No.show))

cm3

##           Actual
## Prediction    No   Yes
##      FALSE 15007  3196
##      TRUE   2634  1267

What is the accuracy?

sum(diag(cm3))/sum(cm3)

## [1] 0.7362468

The accuracy is 73 percent at a threshold of 0.6 for the testing data.

What happens if we increase the threshold of probability to classify a case as a noshow from 0.6 to 0.75?

cm4 <- (table(Prediction = res_pred > 0.75, Actual = test$No.show))

cm4

##           Actual
## Prediction    No   Yes
##      FALSE 17051  4234
##      TRUE    590   229