Logistic Regression R

library(ggplot2) # for elegant graphics
library(PresenceAbsence) # for ROC plot
library(DescTools) # for model performance summary
library(pscl) # McFadden's R2

############################################################################
######################## SOME USEFUL FUNCTIONS TO HELP YOU #################
############################################################################

####### FUNCTION TO RETURN MODEL SUMMARY of LOGISTIC REGRESSION ############
model.fun <- function(model.fit){
  model.sum <- summary(model.fit)
  odds.ratio <- cbind(OR=exp(model.fit$coef),exp(confint(model.fit)))
  my.list <- list(model.sum,odds.ratio)
  names(my.list) <- c("Model Summary","OR Summary")
  return(my.list)
}

############### FUNCTION TO RETURN MEASURES OF MODEL PERFORMANCE ##############
### returns accuracy vs cutoff plot, confusion matrix plot
cutoff.fun <- function(Model.fit, plots=TRUE){
  accuracy <- NULL
  threshold <- seq(0.1,0.9, by=.01)
  
  for (i in 1:length(threshold)){
    accuracy[i] <- Conf(Model.fit, cutoff = threshold[i])$acc
  }
  cutoff <- threshold[which.max(accuracy)]
  conf.mat <- Conf(Model.fit, cutoff = cutoff)
  
  if (plots==TRUE){
    layout(matrix(1:2,ncol = 2))
    plot(threshold, accuracy, type = "l", main = "Cutoff Based on Accuracy")
    abline(h=max(accuracy), v = cutoff, col="red")
    fourfoldplot(conf.mat$table, main = "Confusion Matrix Plot", color=c("red","green"))
  }
  
  return(conf.mat)
}

################### FUNCTION TO RETURN ROC PLOT WITH AUC #########################

roc.plot <- function(observed, Model.fit){
  df <- data.frame(ID=1:length(observed),observed = observed, 
                   predicted = predict(Model.fit, type = "response"))
  auc.roc.plot(df, opt.methods =5, color = TRUE)
}

INTRODUCTION

This is a short tutorial to introduce how to do logistic regression in R. This tutorial assumes that you have knowledge in logistic regression and you want to know how to do it in R.

Logistic regression is basically when your response variable is not normal but rather binary with link logit. Logistic regression is commonly used in supervised learning (classification) when you want to predict the class/label of a variable. Situations, where we can apply logistic regression, are:
- Coupon usage (Y/ N);
- Heart attack risk (has had a heart attack, has not had a heart attack);
- credit scoring (default? Y/N);
- Women should take care of their homes and leave running the country to men (Y/N).

We will be using the Default dataset from ISLR library in this tutorial. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4-dimensional dataset with 10000 observations. The question of interest is to predict individuals who will default. We want to examine how each predictor variable is related to the response (default).

EXPLORATORY DATA ANALYSIS

SUMMARY STATISTICS

data("Default", package = "ISLR")
Desc(Default, plotit = FALSE)

## ------------------------------------------------------------------------- 
## Describe Default (data.frame):
## 
## data.frame:  10000 obs. of  4 variables
## 
##   Nr  ColName  Class    NAs  Levels          
##   1   default  factor   .    (2): 1-No, 2-Yes
##   2   student  factor   .    (2): 1-No, 2-Yes
##   3   balance  numeric  .                    
##   4   income   numeric  .                    
## 
## 
## ------------------------------------------------------------------------- 
## 1 - default (factor - dichotomous)
## 
##   length      n    NAs unique
##   10'000 10'000      0      2
##          100.0%   0.0%       
## 
##       freq   perc  lci.95  uci.95'
## No   9'667  96.7%   96.3%   97.0%
## Yes    333   3.3%    3.0%    3.7%
## 
## ' 95%-CI Wilson
## 
## ------------------------------------------------------------------------- 
## 2 - student (factor - dichotomous)
## 
##   length      n    NAs unique
##   10'000 10'000      0      2
##          100.0%   0.0%       
## 
##       freq   perc  lci.95  uci.95'
## No   7'056  70.6%   69.7%   71.4%
## Yes  2'944  29.4%   28.6%   30.3%
## 
## ' 95%-CI Wilson
## 
## ------------------------------------------------------------------------- 
## 3 - balance (numeric)
## 
##          length            n          NAs       unique             0s
##          10'000       10'000            0        9'502            499
##                       100.0%         0.0%                        5.0%
##                                                                      
##             .05          .10          .25       median            .75
##       0.4246597  180.5752651  481.7311051  823.6369726  1'166.3083865
##                                                                      
##           range           sd        vcoef          mad            IQR
##   2'654.3225763  483.7149852    0.5790394  507.5173037    684.5772814
##                                                                      
##            mean         meanCI
##     835.3748856    825.8930984
##                    844.8566729
##                               
##             .90            .95
##   1'471.6252732  1'665.9625812
##                               
##            skew           kurt
##       0.2459914     -0.3556736
##                               
## lowest : 0.0 (499), 0.0238163, 0.4457568, 1.6111756, 1.6740259
## highest: 2'461.5069787, 2'499.0167496, 2'502.6849312, 2'578.4690216, 2'654.3225763
## 
## ------------------------------------------------------------------------- 
## 4 - income (numeric)
## 
##        length            n          NAs       unique           0s
##        10'000       10'000            0          = n            0
##                     100.0%         0.0%                      0.0%
##                                                                  
##           .05          .10          .25       median          .75
##   13'431.7704  15'797.7793  21'340.4629  34'552.6448  43'807.7293
##                                                                  
##         range           sd        vcoef          mad          IQR
##   72'782.2658  13'336.6396       0.3979  16'350.8571  22'467.2664
##                                                                  
##          mean       meanCI
##   33'516.9819  33'255.5569
##                33'778.4069
##                           
##           .90          .95
##   50'766.1479  54'677.8906
##                           
##          skew         kurt
##        0.0733      -0.9001
##                           
## lowest : 771.9677, 1'498.2273, 2'541.2008, 2'702.9823, 2'981.2795
## highest: 70'700.6478, 71'238.5506, 71'878.7726, 72'461.3014, 73'554.2335

There are 10000 observations in the data with four variables. These variables are default, student, balance, and income. Both default and student variables are qualitative in nature with two levels each and the rest of the variables are numeric.

Considering the default variable we have 333 people who defaulted out of the 10000 people and this corresponds to about 3.33%. The student variable is dichotomous with labels “Yes” and “No”. The label “Yes” shows the total number of students. Out of the 10000, about 29.4% are students.

The income variable is numeric with a median income of about $\$35000.00$. The lowest income is about $\$772.00$ while the highest income is about $\$74000.00$.

GRAPHICAL SUMMARIES

From the above chart, we can see that only a small percentage of the total number of students defaulted. More than 70% of the non-students did not default. However, there could be a relationship between student and defaulters.

The box plot shows that those who defaulted had a median balance of more than $\$1750.00$ while those who did not default had about $\$800.00$. It also shows that, in general, those who default have a higher balance than those who don’t. We can assume that the higher your balance the more likely you are to default. Hence there is likely to be an association between default and balance.

Looking at the plot above, there seems not be any relationship between income and defaulters.

FITTING THE MODEL WITH R

We first fit the model by including all the possible interactions. The output below shows the summary of the fitted model with the odds-ratio of the estimates of the parameters and their corresponding 95% confidence interval.

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = default ~ student * balance * income, family = binomial(), 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4949  -0.1410  -0.0551  -0.0202   3.7623  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.083e+01  1.962e+00  -5.521 3.37e-08 ***
## studentYes                -1.536e+00  3.397e+00  -0.452    0.651    
## balance                    5.750e-03  1.244e-03   4.622 3.80e-06 ***
## income                    -9.887e-07  4.718e-05  -0.021    0.983    
## studentYes:balance         3.548e-04  2.032e-03   0.175    0.861    
## studentYes:income          6.366e-05  1.531e-04   0.416    0.678    
## balance:income             1.664e-09  2.986e-08   0.056    0.956    
## studentYes:balance:income -2.930e-08  8.970e-08  -0.327    0.744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.0  on 9992  degrees of freedom
## AIC: 1587
## 
## Number of Fisher Scoring iterations: 8
## 
## 
## $`OR Summary`
##                                     OR        2.5 %       97.5 %
## (Intercept)               1.975079e-05 3.805904e-07 8.116195e-04
## studentYes                2.153406e-01 2.538793e-04 1.511830e+02
## balance                   1.005767e+00 1.003383e+00 1.008276e+00
## income                    9.999990e-01 9.999070e-01 1.000091e+00
## studentYes:balance        1.000355e+00 9.964219e-01 1.004382e+00
## studentYes:income         1.000064e+00 9.997641e-01 1.000362e+00
## balance:income            1.000000e+00 9.999999e-01 1.000000e+00
## studentYes:balance:income 1.000000e+00 9.999998e-01 1.000000e+00

From this output, we can see that only two variables (student and balance) are statistically significant at 5%. Surprisingly, income has no significant effect to determine whether a person will default or not.

MODEL COMPARISON

Next, we update our model by excluding all non-significant variables and compare the two models using $\chi^2$-distribution test. The test rejects the reduced model (model with fewer parameters) if we have a p-value less than 0.05.

## Analysis of Deviance Table
## 
## Model 1: default ~ student + balance
## Model 2: default ~ student * balance * income
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      9997     1571.7                     
## 2      9992     1571.0  5  0.72249   0.9817

The output of the final model is below.

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = default ~ student + balance, family = binomial(), 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8
## 
## 
## $`OR Summary`
##                       OR        2.5 %       97.5 %
## (Intercept) 2.145622e-05 0.0000101489 4.319257e-05
## studentYes  4.892520e-01 0.3650292262 6.511107e-01
## balance     1.005755e+00 1.0053106429 1.006225e+00

INTERPRETATION OF ESTIMATES OF PARAMETERS

$\hat{\beta_1}$: Having a student vs someone who is not a student changes the log-odds by -0.714.
$\hat{\beta_2}$: For each increase in balance the log-odds of a person who defaults increases by 0.0057. In terms of odds ratio, we expect the odds of a person who defaults to increase by a factor of 1.006.

ASSESSING MODEL PERFORMANCE

There are so many ways one can access the performance of logistic regression model. In this tutorial, we will only consider accuracy, sensitivity, ROC curve and AUC, and lastly McFadden’s pseudo $R^2$.

Accuracy: This is the total number of true positive(TP) and true negative(TN) divided by the population.
Sensitivity: This is the number of true positive (TP) divided by a total number of true positive(TP) plus false negative(FN).
ROC curve and AUC: Basically, ROC curve is the plot of true positive rate (TPR)(sensitivity) against false positive rate (FPR)(1-Specificity) using different cutoff points. In ROC plot you want your points on the curve to get closer to the north west (0,1) for your model to be more accurate. The closer your points are to the diagonal line, less accurate your model is. AUC is just the area under the curve of ROC.
McFadden’s pseudo r-squared(pR2): This is one of the numerous pseudos $R^2$ that has been proposed for generalized linear models. It ranges from 0 to 1. Values closer to 0 shows that your model has no predictive power.

Model performance

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  9616  213
##        Yes   51  120
## 
##                Accuracy : 0.9736
##                  95% CI : (0.9703, 0.9766)
##     No Information Rate : 0.9667
##     P-Value [Acc > NIR] : 3.95e-05
## 
##                   Kappa : 0.4641
##  Mcnemar's Test P-Value : < 2.2e-16
## 
##             Sensitivity : 0.9947
##             Specificity : 0.3604
##          Pos Pred Value : 0.9783
##          Neg Pred Value : 0.7018
##              Prevalence : 0.9667
##          Detection Rate : 0.9616
##    Detection Prevalence : 0.9667
##       Balanced Accuracy : 0.6775
##          F-val Accuracy : 0.9865
## 
##        'Positive' Class : No

Our model seems to perform well even though the model training is just based on the training data set. The model has training accuracy of about 97% with an error rate below 3%. The model seems to be doing well in predicting “NO” when it is actually “NO” as can be seen in the confusion matrix plot in plots above.

observed <- ifelse(Default$default=="Yes",1,0)
roc.plot(observed = observed, Model.fit = Logis.Int.update)

Finally, from the ROC curve we have the AUC to be 95% which means that the model is very good for classification. Look up for McFadden’s pseudo $R^2$ below.

pR2(Logis.Int.update)

##           llh       llhNull            G2      McFadden          r2ML 
##  -785.8407986 -1460.3248557  1348.9681142     0.4618726     0.1261939 
##          r2CU 
##     0.4982388

CONCLUSION

This tutorial uses only the training dataset which is not actually good for model building. The aim of the tutorial was to show you how to use R to do logistic regression and check for its performance using accuracy, sensitivity, and area under the curve (AUC). In the next tutorial, we will look at model diagnostics in logistic regression. I hope you find this tutorial useful.