library(ggplot2) # for elegant graphics
library(PresenceAbsence) # for ROC plot
library(DescTools) # for model performance summary
library(pscl) # McFadden's R2
############################################################################
######################## SOME USEFUL FUNCTIONS TO HELP YOU #################
############################################################################
####### FUNCTION TO RETURN MODEL SUMMARY of LOGISTIC REGRESSION ############
model.fun <- function(model.fit){
model.sum <- summary(model.fit)
odds.ratio <- cbind(OR=exp(model.fit$coef),exp(confint(model.fit)))
my.list <- list(model.sum,odds.ratio)
names(my.list) <- c("Model Summary","OR Summary")
return(my.list)
}
############### FUNCTION TO RETURN MEASURES OF MODEL PERFORMANCE ##############
### returns accuracy vs cutoff plot, confusion matrix plot
cutoff.fun <- function(Model.fit, plots=TRUE){
accuracy <- NULL
threshold <- seq(0.1,0.9, by=.01)
for (i in 1:length(threshold)){
accuracy[i] <- Conf(Model.fit, cutoff = threshold[i])$acc
}
cutoff <- threshold[which.max(accuracy)]
conf.mat <- Conf(Model.fit, cutoff = cutoff)
if (plots==TRUE){
layout(matrix(1:2,ncol = 2))
plot(threshold, accuracy, type = "l", main = "Cutoff Based on Accuracy")
abline(h=max(accuracy), v = cutoff, col="red")
fourfoldplot(conf.mat$table, main = "Confusion Matrix Plot", color=c("red","green"))
}
return(conf.mat)
}
################### FUNCTION TO RETURN ROC PLOT WITH AUC #########################
roc.plot <- function(observed, Model.fit){
df <- data.frame(ID=1:length(observed),observed = observed,
predicted = predict(Model.fit, type = "response"))
auc.roc.plot(df, opt.methods =5, color = TRUE)
}
This is a short tutorial to introduce how to do logistic regression in R. This tutorial assumes that you have knowledge in logistic regression and you want to know how to do it in R.
Logistic regression is basically when your response variable is not normal but rather binary with link logit. Logistic regression is commonly used in supervised learning (classification) when you want to predict the class/label of a variable. Situations, where we can apply logistic regression, are:
- Coupon usage (Y/ N);
- Heart attack risk (has had a heart attack, has not had a heart attack);
- credit scoring (default? Y/N);
- Women should take care of their homes and leave running the country to men (Y/N).
We will be using the Default dataset from ISLR library in this tutorial. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4-dimensional dataset with 10000 observations. The question of interest is to predict individuals who will default. We want to examine how each predictor variable is related to the response (default).
data("Default", package = "ISLR")
Desc(Default, plotit = FALSE)
## -------------------------------------------------------------------------
## Describe Default (data.frame):
##
## data.frame: 10000 obs. of 4 variables
##
## Nr ColName Class NAs Levels
## 1 default factor . (2): 1-No, 2-Yes
## 2 student factor . (2): 1-No, 2-Yes
## 3 balance numeric .
## 4 income numeric .
##
##
## -------------------------------------------------------------------------
## 1 - default (factor - dichotomous)
##
## length n NAs unique
## 10'000 10'000 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## No 9'667 96.7% 96.3% 97.0%
## Yes 333 3.3% 3.0% 3.7%
##
## ' 95%-CI Wilson
##
## -------------------------------------------------------------------------
## 2 - student (factor - dichotomous)
##
## length n NAs unique
## 10'000 10'000 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## No 7'056 70.6% 69.7% 71.4%
## Yes 2'944 29.4% 28.6% 30.3%
##
## ' 95%-CI Wilson
##
## -------------------------------------------------------------------------
## 3 - balance (numeric)
##
## length n NAs unique 0s
## 10'000 10'000 0 9'502 499
## 100.0% 0.0% 5.0%
##
## .05 .10 .25 median .75
## 0.4246597 180.5752651 481.7311051 823.6369726 1'166.3083865
##
## range sd vcoef mad IQR
## 2'654.3225763 483.7149852 0.5790394 507.5173037 684.5772814
##
## mean meanCI
## 835.3748856 825.8930984
## 844.8566729
##
## .90 .95
## 1'471.6252732 1'665.9625812
##
## skew kurt
## 0.2459914 -0.3556736
##
## lowest : 0.0 (499), 0.0238163, 0.4457568, 1.6111756, 1.6740259
## highest: 2'461.5069787, 2'499.0167496, 2'502.6849312, 2'578.4690216, 2'654.3225763
##
## -------------------------------------------------------------------------
## 4 - income (numeric)
##
## length n NAs unique 0s
## 10'000 10'000 0 = n 0
## 100.0% 0.0% 0.0%
##
## .05 .10 .25 median .75
## 13'431.7704 15'797.7793 21'340.4629 34'552.6448 43'807.7293
##
## range sd vcoef mad IQR
## 72'782.2658 13'336.6396 0.3979 16'350.8571 22'467.2664
##
## mean meanCI
## 33'516.9819 33'255.5569
## 33'778.4069
##
## .90 .95
## 50'766.1479 54'677.8906
##
## skew kurt
## 0.0733 -0.9001
##
## lowest : 771.9677, 1'498.2273, 2'541.2008, 2'702.9823, 2'981.2795
## highest: 70'700.6478, 71'238.5506, 71'878.7726, 72'461.3014, 73'554.2335
There are 10000 observations in the data with four variables. These variables are default, student, balance, and income. Both default and student variables are qualitative in nature with two levels each and the rest of the variables are numeric.
Considering the default variable we have 333 people who defaulted out of the 10000 people and this corresponds to about 3.33%. The student variable is dichotomous with labels “Yes” and “No”. The label “Yes” shows the total number of students. Out of the 10000, about 29.4% are students.
The income variable is numeric with a median income of about \(\$35000.00\). The lowest income is about \(\$772.00\) while the highest income is about \(\$74000.00\).
From the above chart, we can see that only a small percentage of the total number of students defaulted. More than 70% of the non-students did not default. However, there could be a relationship between student and defaulters.
The box plot shows that those who defaulted had a median balance of more than \(\$1750.00\) while those who did not default had about \(\$800.00\). It also shows that, in general, those who default have a higher balance than those who don’t. We can assume that the higher your balance the more likely you are to default. Hence there is likely to be an association between default and balance.
Looking at the plot above, there seems not be any relationship between income and defaulters.
We first fit the model by including all the possible interactions. The output below shows the summary of the fitted model with the odds-ratio of the estimates of the parameters and their corresponding 95% confidence interval.
## Waiting for profiling to be done...
## $`Model Summary`
##
## Call:
## glm(formula = default ~ student * balance * income, family = binomial(),
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4949 -0.1410 -0.0551 -0.0202 3.7623
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.083e+01 1.962e+00 -5.521 3.37e-08 ***
## studentYes -1.536e+00 3.397e+00 -0.452 0.651
## balance 5.750e-03 1.244e-03 4.622 3.80e-06 ***
## income -9.887e-07 4.718e-05 -0.021 0.983
## studentYes:balance 3.548e-04 2.032e-03 0.175 0.861
## studentYes:income 6.366e-05 1.531e-04 0.416 0.678
## balance:income 1.664e-09 2.986e-08 0.056 0.956
## studentYes:balance:income -2.930e-08 8.970e-08 -0.327 0.744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.0 on 9992 degrees of freedom
## AIC: 1587
##
## Number of Fisher Scoring iterations: 8
##
##
## $`OR Summary`
## OR 2.5 % 97.5 %
## (Intercept) 1.975079e-05 3.805904e-07 8.116195e-04
## studentYes 2.153406e-01 2.538793e-04 1.511830e+02
## balance 1.005767e+00 1.003383e+00 1.008276e+00
## income 9.999990e-01 9.999070e-01 1.000091e+00
## studentYes:balance 1.000355e+00 9.964219e-01 1.004382e+00
## studentYes:income 1.000064e+00 9.997641e-01 1.000362e+00
## balance:income 1.000000e+00 9.999999e-01 1.000000e+00
## studentYes:balance:income 1.000000e+00 9.999998e-01 1.000000e+00
From this output, we can see that only two variables (student and balance) are statistically significant at 5%. Surprisingly, income has no significant effect to determine whether a person will default or not.
Next, we update our model by excluding all non-significant variables and compare the two models using \(\chi^2\)-distribution test. The test rejects the reduced model (model with fewer parameters) if we have a p-value less than 0.05.
## Analysis of Deviance Table
##
## Model 1: default ~ student + balance
## Model 2: default ~ student * balance * income
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9997 1571.7
## 2 9992 1571.0 5 0.72249 0.9817
The output of the final model is below.
## Waiting for profiling to be done...
## $`Model Summary`
##
## Call:
## glm(formula = default ~ student + balance, family = binomial(),
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4578 -0.1422 -0.0559 -0.0203 3.7435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
## balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.7 on 9997 degrees of freedom
## AIC: 1577.7
##
## Number of Fisher Scoring iterations: 8
##
##
## $`OR Summary`
## OR 2.5 % 97.5 %
## (Intercept) 2.145622e-05 0.0000101489 4.319257e-05
## studentYes 4.892520e-01 0.3650292262 6.511107e-01
## balance 1.005755e+00 1.0053106429 1.006225e+00
\(\hat{\beta_1}\): Having a student vs someone who is not a student changes the log-odds by -0.714.
\(\hat{\beta_2}\): For each increase in balance the log-odds of a person who defaults increases by 0.0057. In terms of odds ratio, we expect the odds of a person who defaults to increase by a factor of 1.006.
There are so many ways one can access the performance of logistic regression model. In this tutorial, we will only consider accuracy, sensitivity, ROC curve and AUC, and lastly McFadden’s pseudo \(R^2\).
Accuracy: This is the total number of true positive(TP) and true negative(TN) divided by the population.
Sensitivity: This is the number of true positive (TP) divided by a total number of true positive(TP) plus false negative(FN).
ROC curve and AUC: Basically, ROC curve is the plot of true positive rate (TPR)(sensitivity) against false positive rate (FPR)(1-Specificity) using different cutoff points. In ROC plot you want your points on the curve to get closer to the north west (0,1) for your model to be more accurate. The closer your points are to the diagonal line, less accurate your model is. AUC is just the area under the curve of ROC.
McFadden’s pseudo r-squared(pR2): This is one of the numerous pseudos \(R^2\) that has been proposed for generalized linear models. It ranges from 0 to 1. Values closer to 0 shows that your model has no predictive power.
Model performance
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 9616 213
## Yes 51 120
##
## Accuracy : 0.9736
## 95% CI : (0.9703, 0.9766)
## No Information Rate : 0.9667
## P-Value [Acc > NIR] : 3.95e-05
##
## Kappa : 0.4641
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9947
## Specificity : 0.3604
## Pos Pred Value : 0.9783
## Neg Pred Value : 0.7018
## Prevalence : 0.9667
## Detection Rate : 0.9616
## Detection Prevalence : 0.9667
## Balanced Accuracy : 0.6775
## F-val Accuracy : 0.9865
##
## 'Positive' Class : No
Our model seems to perform well even though the model training is just based on the training data set. The model has training accuracy of about 97% with an error rate below 3%. The model seems to be doing well in predicting “NO” when it is actually “NO” as can be seen in the confusion matrix plot in plots above.
observed <- ifelse(Default$default=="Yes",1,0)
roc.plot(observed = observed, Model.fit = Logis.Int.update)
Finally, from the ROC curve we have the AUC to be 95% which means that the model is very good for classification. Look up for McFadden’s pseudo \(R^2\) below.
pR2(Logis.Int.update)
## llh llhNull G2 McFadden r2ML
## -785.8407986 -1460.3248557 1348.9681142 0.4618726 0.1261939
## r2CU
## 0.4982388
This tutorial uses only the training dataset which is not actually good for model building. The aim of the tutorial was to show you how to use R to do logistic regression and check for its performance using accuracy, sensitivity, and area under the curve (AUC). In the next tutorial, we will look at model diagnostics in logistic regression. I hope you find this tutorial useful.