What is logistic regression?

To summarize:

Purpose of this document

I will apply my knowledge of logistic regression to analyze a dataset containing characteristics of a 110,000 patients from Brazil and determine if I can predict whether they are going to miss a doctor’s appointment or not. Unlike my previous uploads, this document is not intended to be a comprehensive lesson, but is meant to serve as a record of my skill and a personal reference for future projects. That said, you may still find value in following along to the code. I would like to thank Bharatendra Rai for his instructional videos on this topic.

Libraries

library("ROCR")
library("knitr")
library("caret")

Importing the data

The data is available here at Kaggle. The website might prompt you to log in— you can enter your Google or Facebook credentials to gain access to the dataset. Make sure you download and extract it to your working directory.

noshow <- read.csv("./noshowappointments/No_Show.csv")

Let us take a look at the data.

head(noshow)
##      PatientId AppointmentID Gender         ScheduledDay
## 1 2.987250e+13       5642903      F 2016-04-29T18:38:08Z
## 2 5.589978e+14       5642503      M 2016-04-29T16:08:27Z
## 3 4.262962e+12       5642549      F 2016-04-29T16:19:04Z
## 4 8.679512e+11       5642828      F 2016-04-29T17:29:31Z
## 5 8.841186e+12       5642494      F 2016-04-29T16:07:23Z
## 6 9.598513e+13       5626772      F 2016-04-27T08:36:51Z
##         AppointmentDay Age     Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z  62   JARDIM DA PENHA           0            1
## 2 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            0
## 3 2016-04-29T00:00:00Z  62     MATA DA PRAIA           0            0
## 4 2016-04-29T00:00:00Z   8 PONTAL DE CAMBURI           0            0
## 5 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            1
## 6 2016-04-29T00:00:00Z  76        REPÚBLICA           0            1
##   Diabetes Alcoholism Handcap SMS_received No.show
## 1        0          0       0            0      No
## 2        0          0       0            0      No
## 3        0          0       0            0      No
## 4        0          0       0            0      No
## 5        1          0       0            0      No
## 6        0          0       0            0      No

What do the variables mean?

The data documentation is not very clear on the criteria of classification used for certain variables (for example, what determines if a person is considered to be positive for alcoholism?), but for the purpose of building a simple predictive logistic regression model, we do not need to delve into these issues.

Preparing the data

Two issues that are often problematic to a logistic regression model are variable classes and missing values.

Variable classes

str(noshow)
## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ScheduledDay  : Factor w/ 103549 levels "2015-11-10T07:13:56Z",..: 27742 27504 27539 27709 27498 20074 21386 21496 24945 20895 ...
##  $ AppointmentDay: Factor w/ 27 levels "2016-04-29T00:00:00Z",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
##  $ Scholarship   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hipertension  : int  1 0 0 0 1 1 0 0 0 0 ...
##  $ Diabetes      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Alcoholism    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Handcap       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SMS_received  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No.show       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...

What do you see?

  • The ScheduledDay and AppointmentDay columns are coded as factors. We will want to have them as date data. Refer tothis for more detail.
  • The Scholarship, Hipertension, Diabetes, Alcoholism, Handcap and SMS_received columns are all stored as integers. These need to be changed to factor.

Let’s fix these issues.

#changing the date/time variables to an appropriate character format.

noshow$AppointmentDay <- format(as.Date(noshow$AppointmentDay, 
                                        format = "%Y-%m-%d"), "%Y/%m/%d")

noshow$ScheduledDay <- format(as.Date(noshow$ScheduledDay, 
                                        format = "%Y-%m-%d"), "%Y/%m/%d")
#creating a new variable to capture time between the phone call and actual appointment

noshow$Time <- as.Date(noshow$AppointmentDay) - as.Date(noshow$ScheduledDay)

noshow$Time <- as.integer(noshow$Time)

#An new column with the time in days between the phone call and the 
#actual date of the created appointment.
#Changing the required columns to factors.

cols_to_change <- c(8, 9, 10, 11, 12, 13)

noshow[, cols_to_change] <- lapply(noshow[, cols_to_change], factor)

Missing values and other miscellaneous issues.

Let’s take a look at the summary.

summary(noshow)
##    PatientId         AppointmentID     Gender    ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   F:71840   Length:110527     
##  1st Qu.:4.173e+12   1st Qu.:5640286   M:38687   Class :character  
##  Median :3.173e+13   Median :5680573             Mode  :character  
##  Mean   :1.475e+14   Mean   :5675305                               
##  3rd Qu.:9.439e+13   3rd Qu.:5725524                               
##  Max.   :1.000e+15   Max.   :5790484                               
##                                                                    
##  AppointmentDay          Age                 Neighbourhood   Scholarship
##  Length:110527      Min.   : -1.00   JARDIM CAMBURI : 7717   0:99666    
##  Class :character   1st Qu.: 18.00   MARIA ORTIZ    : 5805   1:10861    
##  Mode  :character   Median : 37.00   RESISTÊNCIA   : 4431              
##                     Mean   : 37.09   JARDIM DA PENHA: 3877              
##                     3rd Qu.: 55.00   ITARARÉ       : 3514              
##                     Max.   :115.00   CENTRO         : 3334              
##                                      (Other)        :81849              
##  Hipertension Diabetes   Alcoholism Handcap    SMS_received No.show    
##  0:88726      0:102584   0:107167   0:108286   0:75045      No :88208  
##  1:21801      1:  7943   1:  3360   1:  2042   1:35482      Yes:22319  
##                                     2:   183                           
##                                     3:    13                           
##                                     4:     3                           
##                                                                        
##                                                                        
##       Time       
##  Min.   : -6.00  
##  1st Qu.:  0.00  
##  Median :  4.00  
##  Mean   : 10.18  
##  3rd Qu.: 15.00  
##  Max.   :179.00  
## 

Notice the Age column. There is atleast one row where the minimum age is -1. This is not acceptable. Let’s get rid of this and replace it with the mean of the remaining ages.

row_to_replace <- which(noshow$Age < 0 )
noshow[row_to_replace, 6] <- 37.09  # 6 is the Age column.

Let’s write a function to give us the percentage of NAs in each column.

perc <- function(x){sum(is.na(x))/length(x) *100 }
apply(noshow, 2, perc)
##      PatientId  AppointmentID         Gender   ScheduledDay AppointmentDay 
##              0              0              0              0              0 
##            Age  Neighbourhood    Scholarship   Hipertension       Diabetes 
##              0              0              0              0              0 
##     Alcoholism        Handcap   SMS_received        No.show           Time 
##              0              0              0              0              0

This is great! There are no missing values. We can proceed with cross tabulations.

Cross-tabulations of factors

In this section, we will cross-tabulate our factor predictors with our outcome of interest to ensure that there are sufficient observations in each column.

Handicaps

kable(xtabs(~ No.show + Handcap, data = noshow ), caption = "Severity of Disability Versus No-Show")
Severity of Disability Versus No-Show
0 1 2 3 4
No 86374 1676 146 10 2
Yes 21912 366 37 3 1
While there ar e no 0 values , I f eel that it might be worth grouping categories 2,3 and 4 together.
noshow$Handcap <- ifelse(noshow$Handcap == 0, "Not Disabled",
                         ifelse(noshow$Handcap == 1, "Disabled", "Severe Disability"))

other variables

Let’s tabulate the other factors.

kable(xtabs(~ No.show + Gender, data = noshow ), caption = "Gender Versus No-Show")
Gender Versus No-Show
F M
No 57246 30962
Yes 14594 7725
kable(xtabs(~ No.show + Scholarship, data = noshow ), caption = "Scholarship Versus No-Show")
Scholarship Versus No-Show
0 1
No 79925 8283
Yes 19741 2578
kable(xtabs(~ No.show + Hipertension, data = noshow ), caption = "Hypertension Versus No-Show")
Hypertension Versus No-Show
0 1
No 70179 18029
Yes 18547 3772
kable(xtabs(~ No.show + Diabetes, data = noshow ), caption = "Diabetes Versus No-Show")
Diabetes Versus No-Show
0 1
No 81695 6513
Yes 20889 1430
kable(xtabs(~ No.show + Alcoholism, data = noshow ), caption = "Alcoholism Versus No-Show")
Alcoholism Versus No-Show
0 1
No 85525 2683
Yes 21642 677
kable(xtabs(~ No.show + SMS_received, data = noshow ), caption = "SMS Versus No-Show")
SMS Versus No-Show
0 1
No 62510 25698
Yes 12535 9784

No zeroes or absurdly low number of observations. Let’s proceed.

Class imbalance, downsampling, and partitioning:

What is the split between the number of people in the dataset that showed up versus those that did not show up?

table(noshow$No.show)
## 
##    No   Yes 
## 88208 22319

The number of people that missed their appointments is almost 4 times the number of people that kept their appointments. Before we build the logistic regression model, we will have to make sure that these two values are approximately equal.

'%exclude%' <- Negate('%in%')  # Defining a function to do the opposite of the 'in' function.
options(scipen=999)  # Avoid scientific notation printing.

#create an index
set.seed(123)
traini <- createDataPartition(noshow$No.show, p=0.8, list = F) 

#Create the datasets
train <- noshow[traini, ] # Create training data
test <- noshow[-traini, ] #Create testing data

#Let's downsample

set.seed(123)
traindown <- downSample(x = train[, colnames(train) %exclude% "No.show"],
                         y = train$No.show, yname = "No.show") #from the 80 percent training data,
#Create a data frame called traindown where the shows and no shows are in an equal ratio

Did the downsampling work?

table(traindown$No.show)
## 
##    No   Yes 
## 17856 17856

Yes, we now have an equal number of shows and no shows in the training data (traindown).

Building the model

Let’s build our initial logistic regression model.

model <- glm(No.show ~ Age + Gender + Time + Scholarship + 
               Hipertension + Diabetes + Alcoholism + 
               Handcap + SMS_received, 
             data = traindown, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = No.show ~ Age + Gender + Time + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + Handcap + SMS_received, family = "binomial", 
##     data = traindown)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.178  -1.062  -0.340   1.140   1.612  
## 
## Coefficients:
##                            Estimate Std. Error z value
## (Intercept)              -0.2170856  0.0864532  -2.511
## Age                      -0.0070591  0.0005755 -12.265
## GenderM                  -0.0321483  0.0235225  -1.367
## Time                      0.0285774  0.0008455  33.799
## Scholarship1              0.2175995  0.0362970   5.995
## Hipertension1            -0.0456528  0.0349079  -1.308
## Diabetes1                 0.1925432  0.0485661   3.965
## Alcoholism1               0.1959179  0.0638871   3.067
## HandcapNot Disabled      -0.0214637  0.0828870  -0.259
## HandcapSevere Disability  0.0820699  0.2765082   0.297
## SMS_received1             0.3579350  0.0242334  14.770
##                                      Pr(>|z|)    
## (Intercept)                           0.01204 *  
## Age                      < 0.0000000000000002 ***
## GenderM                               0.17172    
## Time                     < 0.0000000000000002 ***
## Scholarship1                    0.00000000204 ***
## Hipertension1                         0.19094    
## Diabetes1                       0.00007353049 ***
## Alcoholism1                           0.00216 ** 
## HandcapNot Disabled                   0.79567    
## HandcapSevere Disability              0.76661    
## SMS_received1            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49507  on 35711  degrees of freedom
## Residual deviance: 47094  on 35701  degrees of freedom
## AIC: 47116
## 
## Number of Fisher Scoring iterations: 4

Initial results

This is very interesting to see. Adjusting for the effect of all other variables, age, time, scholarship, diabetes, alcoholism and SMS have a statistically significant effect on the outcome of showing up to an appointment.

Dropping insignificant predictors.

model <- glm(No.show ~ Age  + Time + Scholarship + 
                Diabetes + Alcoholism +  SMS_received, 
             data = traindown, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = No.show ~ Age + Time + Scholarship + Diabetes + 
##     Alcoholism + SMS_received, family = "binomial", data = traindown)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1803  -1.0647  -0.3303   1.1399   1.6253  
## 
## Coefficients:
##                 Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)   -0.2492192  0.0235522 -10.582 < 0.0000000000000002 ***
## Age           -0.0072779  0.0005128 -14.193 < 0.0000000000000002 ***
## Time           0.0285988  0.0008452  33.837 < 0.0000000000000002 ***
## Scholarship1   0.2235433  0.0359263   6.222        0.00000000049 ***
## Diabetes1      0.1712673  0.0456693   3.750             0.000177 ***
## Alcoholism1    0.1822071  0.0633644   2.876             0.004033 ** 
## SMS_received1  0.3590085  0.0242113  14.828 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49507  on 35711  degrees of freedom
## Residual deviance: 47097  on 35705  degrees of freedom
## AIC: 47111
## 
## Number of Fisher Scoring iterations: 4

How do we know that this model is better than the first? For starters, all variables are significant. Second, the AIC has reduced somewhat, which is a good thing.

However, look at the null and the residual deviances. The deviance shows how much of the outcome is unexplained by the model. The degrees of freedom are the number of rows minus the number of predictors. The residual deviance for this model is slightly higher than that of the previous model, which is not necessarily good.

You win some, you lose some. Let’s forge ahead with the predictions.

Prediction

Training data

pred1 <- predict(model, traindown, type = "response")
head(pred1)
##         1         2         3         4         5         6 
## 0.3126451 0.3851950 0.5043319 0.5415477 0.7300379 0.4201889

These are the model’s predicted probabilities of no-shows for the first 5 respondents.

Let’s set a threshold: If the probability is <0.50, let’s consider that the patient did not show up for the appointment.

prediction1 <- ifelse(pred1 > 0.5, 1, 0) # 1 here indicates a no show.

Let’s compare the predicted to the actual values. This is called a confusion matrix.

cm <- table( Prediction = prediction1, Actual = traindown$No.show)
cm
##           Actual
## Prediction    No   Yes
##          0 12398  7773
##          1  5458 10083

Under the ‘Actual Values’, ‘No’ indicates that the patient showed up for an appointment, and ‘Yes’ indicates that they missed their appointment. Now look at the predictions, along the off-diagonal. A prediction of 1 (meaning no show) was made 5458 times when the patient actually showed up. Similarly, a prediction of 0 (meaning present for appointment) was made 7773 times when the patient actually missed the appointment. Let us quantify the missclassifications in the confusion matrix data.

1 - sum(diag(cm))/sum(cm)
## [1] 0.3704917

37 percent of the data was misclassified. The model is only correct 63 percent of the time.

Test data

Let us try the model out on the test data:

pred2 <- predict(model, test, type = "response")

predict2 <- ifelse(pred2>0.5, 1, 0) # Again, a 1 indicates a no show.

cm2 <- table( Prediction = predict2, Actual = test$No.show)

cm2
##           Actual
## Prediction    No   Yes
##          0 12236  2018
##          1  5405  2445

What is the misclassification error this time?

1 - sum(diag(cm2))/sum(cm2)
## [1] 0.3358216

33 percent of the data was misclassified. The model is correct 67 percent of the time.

Goodness of Fit test

Is this model good?

with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = F))
## [1] 0

The p-value close to zero indicates that our model is statistically better than the intercept model. Great.

Optimizing

Can we improve this 67 percent accuracy of this model?

table(test$No.show)
## 
##    No   Yes 
## 17641  4463
17641/(17641+4461)*100
## [1] 79.81631

This 79 percent tells us that, even if we blindly predict that all the patients will keep their appointments, we will be right 79 percent of the time. Using our model, however, we only make a correct prediction 67 percent of the time. We definitely need to improve the model.

res <- predict(model, traindown, type = "response")
ROCRpred = prediction(res, traindown$No.show)
ROCRperf <- performance(ROCRpred, "tpr", "fpr")

plot(ROCRperf, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1))

In models used to detect cancer or other diseases, we strive to have the true-positive rate as high as possible. In other words, we want the false negatives as low as possible. We want high sensitivity, even at the expense of specificity, because it can be lethal to tell someone that they are cancer-free when they are not.

But for this model, since we are not predicting a life or death situation, and since the mean prediction accuracy is 79 percent, let us aim to achieve a high accuracy, or a low false positive rate.

Let’s apply this to the testing data.

res_pred <- predict(model, test, type = "response")

cm3 <- (table(Prediction = res_pred > 0.6, Actual = test$No.show))

cm3
##           Actual
## Prediction    No   Yes
##      FALSE 15007  3196
##      TRUE   2634  1267

What is the accuracy?

sum(diag(cm3))/sum(cm3)
## [1] 0.7362468

The accuracy is 73 percent at a threshold of 0.6 for the testing data.

What happens if we increase the threshold of probability to classify a case as a noshow from 0.6 to 0.75?

cm4 <- (table(Prediction = res_pred > 0.75, Actual = test$No.show))

cm4
##           Actual
## Prediction    No   Yes
##      FALSE 17051  4234
##      TRUE    590   229

What is the accuracy?

sum(diag(cm4))/sum(cm4)
## [1] 0.781759

The accuracy is 78 percent at a threshold of 0.75 for the training data.

Conclusion: