Blog 3 Binary Logistic Regression

Bank Loan Approval

In Blog 3 I have decided to work on Binary Logistic Regression. For this purpose I will use Bank Loan Approval data set from the open source and link below:

https://www.kaggle.com/datasets/vikramamin/bank-loan-approval-lr-dt-rf-and-auc/data

Data Extraction

Loan <- read.csv("https://raw.githubusercontent.com/uplotnik/Health/main/bankloan.csv")
head(Loan)

##   ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1  1  25          1     49    91107      4   1.6         1        0
## 2  2  45         19     34    90089      3   1.5         1        0
## 3  3  39         15     11    94720      1   1.0         1        0
## 4  4  35          9    100    94112      1   2.7         2        0
## 5  5  35          8     45    91330      4   1.0         2        0
## 6  6  37         13     29    92121      4   0.4         2      155
##   Personal.Loan Securities.Account CD.Account Online CreditCard
## 1             0                  1          0      0          0
## 2             0                  1          0      0          0
## 3             0                  0          0      0          0
## 4             0                  0          0      0          0
## 5             0                  0          0      0          1
## 6             0                  0          0      1          0

Dataset Description

ID : ID of the customer.

Age : Age of the customer in years.

Experience : Amount of work experience in years.

Income : Amount of annual income (in thousands).

Zipcode : Postal code of the city in which the client lives.

Family : Number of family members.

CCAvg : Average monthly spending with the credit card (in thousands).

Education : Education level
    1 : Bachelor's degree
    2 : Master's degree
    3 : Advanced/Professional degree.

Mortgage: Value of home mortgage, if any (in thousands).

Securities Account : Does the customer have a securities account with the bank?

CD Account : Does the customer have a certificate of deposit account (CD) with the bank?

Online : Does the customer use the internet banking facilities?

CreditCard : Does the customer use a credit card issued by the bank?

Personal Loan : Did this customer accept the personal loan offered in the last campaign? ->

Exploration Analysis

str(Loan)

## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

plot_intro(Loan, 
           ggtheme = theme_minimal(),
  title = "Missing Data",
  theme_config = theme(plot.title = element_text(color = "orange")),
  geom_label_args = c(hjust = "inward")
)

#Extract duplicate elements:
Loan[duplicated(Loan)]

## data frame with 0 columns and 5000 rows

After checking dataset for missing or duplicate values across its columns, we observe that dataset is clean and complete, so we have reliable foundation for further analysis and modeling.

plot_bar(Loan)

From the visualization above we can see that the majority of the bank’s clients declined the offered personal loan and only a smaller amount of clients accepted it.

We also can observe that the most clients don’t have a CD account and securities accounts. Also,the majority of clients prefer don’t use the bank’s credit cards. However, significant number of clients actively use the bank’s online banking services, so we can assume that bank has convenient and reliable online program.

#Correlation table
library(corrr)
cor<-Loan %>% 
  correlate() %>% 
  focus(Personal.Loan) %>%arrange(desc(Personal.Loan))

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

knitr::kable(
  head(cor), caption = "Correlation Table")%>%
  kable_styling("striped", full_width = F)

Correlation Table
term	Personal.Loan
Income	0.5024623
CCAvg	0.3668887
CD.Account	0.3163548
Mortgage	0.1420952
Education	0.1367216
Family	0.0613670

From the correlation table above we can see strong and weak influences on personal loan decisions.

Strong Influences:

Higher income, average credit card spending as well as having a certificate of deposit account also moderately influences opting for a personal loan.

Weaker Influences:

Education, mortgage, family size, securities account, online usage, and owning a credit card display weaker correlations with personal loan decisions.

Negative Correlation:

Age and Experience have weak negative correlations

Outliers And Skewness.

## View bivariate continuous distribution based on `cut`
plot_boxplot(Loan, by = "Personal.Loan")

Boxplots reveal that CCAvg, Mortgage, and Income columns exhibit a significant presence of outliers.

plot_histogram(Loan)

We can observe that the most of the columns are normally distributed except for the Income , CCavg , Mortgage due to them having high number of outliers.

The age and experience columns match together with the same frequencies where higher ages values have the same frequncy of higher experinces and the opposite.

The right-skewed distribution in the ‘Income’ column indicates that the majority of clients within this bank tend to have relatively lower incomes.

CCAvg (Average Credit Card Spending) : The right-skewed distribution in the ‘CCAvg’ column implies that most clients exhibit lower average spending on credit cards with zero is the most frequent value.

The right-skewed distribution of the ‘Mortgage’ column suggests that the majority of clients in this bank do not hold mortgage loans. The majority of bank’s clients have bachelor’s degrees. and the majority of clients seem to belong to households with only one family member.

Compute skewness:

library(moments)

## 
## Attaching package: 'moments'

## The following object is masked from 'package:lessR':
## 
##     kurtosis

skewness(Loan$Income, na.rm = TRUE)

## [1] 0.8410862

skewness(Loan$CCAvg, na.rm = TRUE)

## [1] 1.597964

skewness(Loan$Mortgage, na.rm = TRUE)

## [1] 2.103371

Reduce Right Skewness

income<-Loan$Income
log_income <- log10(Loan$Income)
sqrt_income <- sqrt(Loan$Income)
recip_income <- 1/Loan$Income

inc_data <- data.frame(income, log_income,sqrt_income,recip_income) 
plot_histogram(inc_data)

ccaverage<-Loan$CCAvg
log_ccaverage <- log10(Loan$CCAvg)
sqrt_ccaverage <- sqrt(Loan$CCAvg)
recip_ccaverage <- 1/Loan$CCAvg

ccaverage_data <- data.frame(ccaverage, log_ccaverage,sqrt_ccaverage,recip_ccaverage) 
plot_histogram(ccaverage_data)

mortgage<-Loan$Mortgage
log_mortgage <- log10(Loan$Mortgage)
sqrt_mortgage <- sqrt(Loan$Mortgage)
recip_mortgage <- 1/Loan$Mortgage

mortgage_data <- data.frame(mortgage, log_mortgage,sqrt_mortgage,recip_mortgage) 
plot_histogram(mortgage_data)

The square root transformation appears to be the most suitable method for Income and CCAvg Column The Reciprocal Transformation seems to yield better results for Mortgage Column.

Loan$Income <- sqrt_income
Loan$CCAvg <- sqrt_ccaverage
Loan$Mortgage <- recip_mortgage

Features Scaling

We’ll apply the Standardization method to scale our data, enhancing the model’s performance and ensuring standardized ranges of values for improved processing.

library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:lessR':
## 
##     set

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following object is masked from 'package:purrr':
## 
##     transpose

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

setnames(Loan, "Personal.Loan", "loan")
setnames(Loan, "CD.Account", "CDAccount")

df3 <- Loan %>% mutate_at(c('Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
       'Mortgage'), ~(scale(.) %>% as.vector))

Modeling

Based on the Analysis and Correlation results, the selected features for inclusion are Income, CCAvg, Mortgage, Family, Education, CreditCard, CDAccount, Age, and Experience.

df = subset(df3, select = -c(ZIP.Code,Online,Securities.Account) )

Logistic Regression

RNGkind(sample.kind = "Rounding")
set.seed(100) 

index_loan <- sample(nrow(df) , size = nrow(df)*0.8) 
loan_train <- df[index_loan,]
loan_test <- df[-index_loan, ]
dftrain <- data.frame(loan_train)

dftest <- data.frame(loan_test)

For our initial logistic regression model, we will focus on the Credit_History variable, as it appears to be a key factor in determining loan approval.

# Training the model
logistic_model <- glm(loan ~ Income, family = 'binomial', data=dftrain)
# Checking the model
summary(logistic_model)

## 
## Call:
## glm(formula = loan ~ Income, family = "binomial", data = dftrain)
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -3.81541    0.13141  -29.04 <0.0000000000000002 ***
## Income       2.12680    0.09559   22.25 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2452.5  on 3999  degrees of freedom
## Residual deviance: 1552.3  on 3998  degrees of freedom
## AIC: 1556.3
## 
## Number of Fisher Scoring iterations: 7

my_prediction_tr1 <- predict(logistic_model, newdata = dftrain, type = "response")
table(dftrain$loan, my_prediction_tr1 > 0.5)

##    
##     FALSE TRUE
##   0  3530  103
##   1   257  110

accuracy = (3530 + 110) / (3530+103+257+110)
accuracy

## [1] 0.91

# Training the model
logistic_model2 <- glm(loan ~ Income, family = 'binomial', data=dftest)
# Checking the model
summary(logistic_model2)

## 
## Call:
## glm(formula = loan ~ Income, family = "binomial", data = dftest)
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)  -3.7386     0.2592  -14.42 <0.0000000000000002 ***
## Income        2.3164     0.1918   12.08 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 705.48  on 999  degrees of freedom
## Residual deviance: 405.34  on 998  degrees of freedom
## AIC: 409.34
## 
## Number of Fisher Scoring iterations: 7

my_prediction_tr2 <- predict(logistic_model2, newdata = dftest, type = "response")
table(dftest$loan, my_prediction_tr2 > 0.5)

##    
##     FALSE TRUE
##   0   857   30
##   1    67   46

accuracy2 = (857 + 46) / (30+67+857 + 46)
accuracy2

## [1] 0.903

To evaluate the model’s accuracy, we have generated a confusion table for both the training and test data:

Train data: 91% accuracy

Test data: 90.3% accuracy

To further improve the model, we can incorporate additional variables and assess their impact on accuracy.

# Training the model
logistic_model3 <- glm(loan ~ Income+CCAvg+CDAccount, family = 'binomial', data=dftrain)
# Checking the model
summary(logistic_model3)

## 
## Call:
## glm(formula = loan ~ Income + CCAvg + CDAccount, family = "binomial", 
##     data = dftrain)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept) -4.14251    0.14623 -28.329 < 0.0000000000000002 ***
## Income       2.00133    0.11017  18.166 < 0.0000000000000002 ***
## CCAvg        0.17433    0.06741   2.586              0.00971 ** 
## CDAccount    2.41364    0.19963  12.091 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2452.5  on 3999  degrees of freedom
## Residual deviance: 1388.5  on 3996  degrees of freedom
## AIC: 1396.5
## 
## Number of Fisher Scoring iterations: 7

my_prediction_tr3 <- predict(logistic_model3, newdata = dftrain, type = "response")
table(dftrain$loan, my_prediction_tr3 > 0.5)

##    
##     FALSE TRUE
##   0  3551   82
##   1   229  138

accuracy3 = (3547 + 145) / (86+222+3547 + 145)
accuracy3

## [1] 0.923

# Testing the model
logistic_model4 <- glm(loan ~ Income+CCAvg+CDAccount, family = 'binomial', data=dftest)
# Checking the model
summary(logistic_model4)

## 
## Call:
## glm(formula = loan ~ Income + CCAvg + CDAccount, family = "binomial", 
##     data = dftest)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  -3.8392     0.2657 -14.449 < 0.0000000000000002 ***
## Income        2.1173     0.2098  10.094 < 0.0000000000000002 ***
## CCAvg         0.2037     0.1336   1.525             0.127354    
## CDAccount     1.5254     0.3971   3.842             0.000122 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 705.48  on 999  degrees of freedom
## Residual deviance: 387.08  on 996  degrees of freedom
## AIC: 395.08
## 
## Number of Fisher Scoring iterations: 7

my_prediction_tr4 <- predict(logistic_model4, newdata = dftest, type = "response")
table(dftest$loan, my_prediction_tr4 > 0.5)

##    
##     FALSE TRUE
##   0   859   28
##   1    58   55

accuracy4 = (860 + 54) / (59+27+860 + 54)
accuracy4

## [1] 0.914

Train data: 92.3%

Test data: 91.4%

We observe that incorporating additional variables into the model has led to an improvement in the accuracy of the test set. This highlights the value of including relevant predictors when building a predictive model to achieve better performance on unseen data.

Decision Tree

library(rpart)
# grow tree 
dtree <- rpart(loan ~ Income+CCAvg+CDAccount,method="class", data=dftrain,parms=list(split="information"))
dtree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.10490463      0 1.0000000 1.0000000 0.04974732
## 2 0.01089918      2 0.7901907 0.7956403 0.04482964
## 3 0.01000000      4 0.7683924 0.7929155 0.04475884

plotcp(dtree)

dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree.pred <- predict(dtree.pruned, dftrain, type="class")
dtree.perf <- table(dftrain$loan, dtree.pred,
                    dnn=c("Actual", "Predicted"))
dtree.perf

##       Predicted
## Actual    0    1
##      0 3604   29
##      1  261  106

acc=(3604+106)/(3604+106+261+29)
acc

## [1] 0.9275

Lastly, we generate a confusion table to assess the model’s accuracy, applying the same steps to the test data.

dtree_test <- rpart(loan ~ Income+CCAvg+CDAccount,method="class", data=dftest,parms=list(split="information"))
dtree_test$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.07522124      0 1.0000000 1.0000000 0.08859773
## 2 0.05899705      2 0.8495575 0.9469027 0.08650459
## 3 0.01000000      5 0.6725664 0.8053097 0.08048671

plotcp(dtree_test)

dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree_test.pred <- predict(dtree_test.pruned, dftest, type="class")
dtree_test.perf <- table(dftest$loan, dtree_test.pred,
                    dnn=c("Actual", "Predicted"))
dtree_test.perf

##       Predicted
## Actual   0   1
##      0 869  18
##      1  58  55

acc4=(869+55)/(869+55+18+58)
acc4

## [1] 0.924

Accuracy:

Train data: 92.75%

Test data: 92.4%

We can observe an improved performance compared to the logistic regression model, indicating a more accurate prediction of personal loan approval status.