Logistic Regression

Logistic regression is a very common tool to solve classification problem. Given a binary outcome, we would like to classify whether an event would occur based on a set of quantitative or qualitative variables. In this blog post, we would like to use a public dataset to classify loan status based on various social demographic data.

# load packages
if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('tidyverse', 'caret', 'broom', 'Hmisc', 'kableExtra')
pacman::p_load(char = packages)
# read data
df <- read.csv(url('https://datahack-prod.s3.ap-south-1.amazonaws.com/train_file/train_u6lujuX_CVtuZ9i.csv'))

# check missing
colSums(is.na(df))
##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

# impute dataset
preProcValues <- preProcess(df, method = c("medianImpute", "center", "scale"))
df_processed <- predict(preProcValues, df)

# check again
colSums(is.na(df_processed))
##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                 0                 0                 0                 0 
##       Loan_Status 
##                 0
# check distribution
broom::glance(df_processed)
## Warning: 'glance.data.frame' is deprecated.
## See help("Deprecated")
## # A tibble: 1 x 4
##    nrow  ncol complete.obs na.fraction
##   <int> <int>        <int>       <dbl>
## 1   614    13          614           0
table(df_processed$Loan_Status)
## 
##   N   Y 
## 192 422
prop.table(ftable(df_processed$Loan_Status)) %>% round(., 1)
##    N   Y
##         
##  0.3 0.7

# head df
head(df_processed) %>% kable(.)
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 0.0729314 -0.5540356 -0.2151272 0.276411 0.4324768 Urban Y
LP001003 Male Yes 1 Graduate No -0.1343025 -0.0387000 -0.2151272 0.276411 0.4324768 Rural N
LP001005 Male Yes 0 Graduate Yes -0.3934266 -0.5540356 -0.9395335 0.276411 0.4324768 Urban Y
LP001006 Male Yes 0 Not Graduate No -0.4616860 0.2517743 -0.3085990 0.276411 0.4324768 Urban Y
LP001008 Male No 0 Graduate No 0.0976488 -0.5540356 -0.0632356 0.276411 0.4324768 Urban Y
LP001011 Male Yes 2 Graduate Yes 0.0022165 0.8798823 1.4089450 0.276411 0.4324768 Urban Y

# str(), dim()
str(df_processed)
## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ ApplicantIncome  : num  0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
##  $ CoapplicantIncome: num  -0.554 -0.0387 -0.554 0.2518 -0.554 ...
##  $ LoanAmount       : num  -0.2151 -0.2151 -0.9395 -0.3086 -0.0632 ...
##  $ Loan_Amount_Term : num  0.276 0.276 0.276 0.276 0.276 ...
##  $ Credit_History   : num  0.432 0.432 0.432 0.432 0.432 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
dim(df_processed)
## [1] 614  13

# make changes to the "Loan_Status" level, label
levels(df_processed$Loan_Status)
## [1] "N" "Y"
df_processed <- df_processed %>% dplyr::mutate(Loan_Status = Loan_Status %>% 
                                                       factor(., levels = c("Y", "N"), labels = c("Yes", "No")))
levels(df_processed$Loan_Status)
## [1] "Yes" "No"
# set seed
set.seed(1234)

# split data
index <- caret::createDataPartition(df_processed$Loan_Status, p = 0.7, list = FALSE)
trainSet <- df_processed[index, ]
testSet <- df_processed[-index, ]

# define the training control - let's do 10-fold cross-validation
train.control <- caret::trainControl(
        method = "cv",
        number = 10,
        savePredictions = 'final',
        classProbs = TRUE)

# define IVs, DV
IV <- names(df_processed)[names(df_processed) %nin% c("Loan_ID", "Loan_Status")]
DV <- "Loan_Status"
# build model
model <- caret::train(trainSet[, IV], trainSet[, DV], method = 'glm', trControl = train.control)

# summary model output
summary(model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2889  -0.7438  -0.5647   0.4431   2.1745  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -13.162035 623.299739  -0.021  0.98315    
## GenderFemale            -0.210278   0.936934  -0.224  0.82242    
## GenderMale              -0.121769   0.884842  -0.138  0.89054    
## MarriedNo               13.756845 623.299635   0.022  0.98239    
## MarriedYes              13.250102 623.299664   0.021  0.98304    
## Dependents0             -0.872462   1.038756  -0.840  0.40096    
## Dependents1             -0.394333   1.061059  -0.372  0.71016    
## Dependents2             -1.116722   1.088138  -1.026  0.30477    
## Dependents3+            -1.114125   1.129840  -0.986  0.32409    
## EducationNot Graduate    0.415874   0.302300   1.376  0.16891    
## Self_EmployedNo          0.227842   0.593470   0.384  0.70104    
## Self_EmployedYes         0.358209   0.662960   0.540  0.58898    
## ApplicantIncome          0.009642   0.158041   0.061  0.95135    
## CoapplicantIncome        0.200371   0.139653   1.435  0.15135    
## LoanAmount               0.077609   0.146405   0.530  0.59605    
## Loan_Amount_Term         0.031466   0.131057   0.240  0.81025    
## Credit_History          -1.227525   0.161786  -7.587 3.27e-14 ***
## Property_AreaSemiurban  -0.836073   0.310128  -2.696  0.00702 ** 
## Property_AreaUrban      -0.291588   0.306034  -0.953  0.34069    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 535.87  on 430  degrees of freedom
## Residual deviance: 414.25  on 412  degrees of freedom
## AIC: 452.25
## 
## Number of Fisher Scoring iterations: 13

# making prediction using above model
testSet$prediction <- predict(object = model, testSet[, IV], type = "prob")$Yes

# add result prediction
testSet <- testSet %>%
        dplyr::mutate(Loan_Status_Predicted = ifelse(prediction > .5, "Y", "N") %>%
                              factor(., levels = c("Y", "N"), labels = c("Yes", "No")))

# confusion matrix
caret::confusionMatrix(testSet$Loan_Status, testSet$Loan_Status_Predicted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes  No
##        Yes 126   0
##        No   27  30
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7926, 0.9005)
##     No Information Rate : 0.8361          
##     P-Value [Acc > NIR] : 0.3149          
##                                           
##                   Kappa : 0.6048          
##                                           
##  Mcnemar's Test P-Value : 5.624e-07       
##                                           
##             Sensitivity : 0.8235          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.5263          
##              Prevalence : 0.8361          
##          Detection Rate : 0.6885          
##    Detection Prevalence : 0.6885          
##       Balanced Accuracy : 0.9118          
##                                           
##        'Positive' Class : Yes             
## 

# tidy result
result <- broom::tidy(caret::confusionMatrix(testSet$Loan_Status, testSet$Loan_Status_Predicted))
result %>% kable(.)
term class estimate conf.low conf.high p.value
accuracy NA 0.8524590 0.7926416 0.9004632 0.3149022
kappa NA 0.6047516 NA NA 0.0000006
sensitivity Yes 0.8235294 NA NA NA
specificity Yes 1.0000000 NA NA NA
pos_pred_value Yes 1.0000000 NA NA NA
neg_pred_value Yes 0.5263158 NA NA NA
precision Yes 1.0000000 NA NA NA
recall Yes 0.8235294 NA NA NA
f1 Yes 0.9032258 NA NA NA
prevalence Yes 0.8360656 NA NA NA
detection_rate Yes 0.6885246 NA NA NA
detection_prevalence Yes 0.6885246 NA NA NA
balanced_accuracy Yes 0.9117647 NA NA NA

Summary

Above simple example demonstrates the easiness of building a logistic regression model with few lines of code. From the model summary output, we see that credit history is the most significant predictor of loan status (not surprising). Without tuning any parameter or conducting any feature engineering, our base model is able to achieve 85% accuracy, 82% sensitivity, 100% specificity, 100% precision and 90% f1 score. Of course, there’s a lot of room for improvement, but that’s a very good start. The next step is to eliminate features that we don’t need and we should compare this algorithm with other classification algorithms to come up with better solution.