what combination of parameters makes a customer more likely to accept a personal loan?
# Load dataset
Bank_data <- read.csv("/Users/SMEBTHEREAL/Desktop/UniversalBank.csv")
# Explore structure of data
str(Bank_data)
## 'data.frame': 5000 obs. of 15 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Personal.Loan : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Age : int 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : int 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : int 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP.Code : int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
## $ Family : int 4 3 1 1 4 4 2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : int 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
## $ Securities.Account: int 1 1 0 0 0 0 0 0 0 0 ...
## $ CD.Account : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Online : int 0 0 0 0 0 1 1 0 1 0 ...
## $ CreditCard : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Validation : int 0 1 1 0 0 0 1 1 0 0 ...
# Change to appropriate data type
Bank_data$Personal.Loan<-as.factor(Bank_data$Personal.Loan)
Bank_data$ZIP.Code<-as.factor(Bank_data$ZIP.Code)
Bank_data$Family<-as.factor(Bank_data$Family)
Bank_data$Education<-as.factor(Bank_data$Education)
Bank_data$Online<-as.factor(Bank_data$Online)
Bank_data$CD.Account<-as.factor(Bank_data$CD.Account)
Bank_data$CreditCard<-as.factor(Bank_data$CreditCard)
Bank_data$Validation<-as.factor(Bank_data$Validation)
Bank_data$Securities.Account<-as.factor(Bank_data$Securities.Account)
# Check missing value
sum(is.na(Bank_data))
## [1] 0
# Simple "smell test" and remove outliers
summary(Bank_data)
## ID Personal.Loan Age Experience
## Min. : 1 0:4520 Min. :23.00 Min. :-3.0
## 1st Qu.:1251 1: 480 1st Qu.:35.00 1st Qu.:10.0
## Median :2500 Median :45.00 Median :20.0
## Mean :2500 Mean :45.34 Mean :20.1
## 3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0
## Max. :5000 Max. :67.00 Max. :43.0
##
## Income ZIP.Code Family CCAvg Education
## Min. : 8.00 94720 : 169 1:1472 Min. : 0.000 1:2096
## 1st Qu.: 39.00 94305 : 127 2:1296 1st Qu.: 0.700 2:1403
## Median : 64.00 95616 : 116 3:1010 Median : 1.500 3:1501
## Mean : 73.77 90095 : 71 4:1222 Mean : 1.938
## 3rd Qu.: 98.00 93106 : 57 3rd Qu.: 2.500
## Max. :224.00 92037 : 54 Max. :10.000
## (Other):4406
## Mortgage Securities.Account CD.Account Online CreditCard
## Min. : 0.0 0:4478 0:4698 0:2016 0:3530
## 1st Qu.: 0.0 1: 522 1: 302 1:2984 1:1470
## Median : 0.0
## Mean : 56.5
## 3rd Qu.:101.0
## Max. :635.0
##
## Validation
## 0:3000
## 1:2000
##
##
##
##
##
Bank_data<-subset(Bank_data,Experience>=0)
Now, we have zero missing value and we could start analysing our dataset.
# Conduct EDA
summary(Bank_data)
## ID Personal.Loan Age Experience
## Min. : 1 0:4468 Min. :24.00 Min. : 0.00
## 1st Qu.:1255 1: 480 1st Qu.:36.00 1st Qu.:10.75
## Median :2498 Median :46.00 Median :20.00
## Mean :2501 Mean :45.56 Mean :20.33
## 3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.00
## Max. :5000 Max. :67.00 Max. :43.00
##
## Income ZIP.Code Family CCAvg Education
## Min. : 8.00 94720 : 164 1:1470 Min. : 0.000 1:2080
## 1st Qu.: 39.00 94305 : 125 2:1274 1st Qu.: 0.700 2:1387
## Median : 64.00 95616 : 115 3:1001 Median : 1.500 3:1481
## Mean : 73.81 90095 : 71 4:1203 Mean : 1.936
## 3rd Qu.: 98.00 93106 : 56 3rd Qu.: 2.600
## Max. :224.00 92037 : 54 Max. :10.000
## (Other):4363
## Mortgage Securities.Account CD.Account Online CreditCard
## Min. : 0.00 0:4432 0:4646 0:1994 0:3493
## 1st Qu.: 0.00 1: 516 1: 302 1:2954 1:1455
## Median : 0.00
## Mean : 56.63
## 3rd Qu.:101.00
## Max. :635.00
##
## Validation
## 0:2967
## 1:1981
##
##
##
##
##
plot(table(Bank_data$Family,Bank_data$Personal.Loan),cex=1.5)
plot(table(Bank_data$Education,Bank_data$Personal.Loan),cex=1.5)
plot(table(Bank_data$Securities.Account,Bank_data$Personal.Loan),cex=1.5)
plot(table(Bank_data$CD.Account,Bank_data$Personal.Loan),cex=1.5)
plot(table(Bank_data$Online,Bank_data$Personal.Loan),cex=1.5)
plot(table(Bank_data$CreditCard,Bank_data$Personal.Loan),cex=1.5)
library(ggplot2)
ggplot(Bank_data,aes(Age))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Bank_data,aes(Income))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Bank_data,aes(Experience))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Bank_data,aes(CCAvg))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Bank_data,aes(Mortgage))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Prepare for modeling
train<-subset(Bank_data,Validation=="0")
test<-subset(Bank_data,Validation=="1")
train<-subset(train,select=c(-ID,-Validation,-ZIP.Code))
test<-subset(test,select=c(-ID,-Validation,-ZIP.Code))
At this stage, we created test and train set. In addition, we found that there are some unrelated and useless variables, so we decided to drop it.
# Fit a logistic regression model and show the fit details
model<-glm(formula=Personal.Loan~.,data=train,family=binomial(link='logit'))
summary(model)
##
## Call:
## glm(formula = Personal.Loan ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9463 -0.1562 -0.0489 -0.0125 4.3055
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.344e+01 2.515e+00 -5.343 9.13e-08 ***
## Age -3.261e-02 9.258e-02 -0.352 0.724663
## Experience 3.973e-02 9.177e-02 0.433 0.665083
## Income 6.992e-02 4.595e-03 15.216 < 2e-16 ***
## Family2 -1.298e-01 3.176e-01 -0.409 0.682861
## Family3 2.135e+00 3.416e-01 6.249 4.14e-10 ***
## Family4 1.524e+00 3.220e-01 4.732 2.22e-06 ***
## CCAvg 2.087e-01 6.493e-02 3.215 0.001306 **
## Education2 4.597e+00 4.022e-01 11.427 < 2e-16 ***
## Education3 4.671e+00 4.057e-01 11.512 < 2e-16 ***
## Mortgage 1.874e-03 8.241e-04 2.274 0.022949 *
## Securities.Account1 -9.319e-01 4.257e-01 -2.189 0.028591 *
## CD.Account1 3.387e+00 4.590e-01 7.379 1.59e-13 ***
## Online1 -8.022e-01 2.327e-01 -3.447 0.000567 ***
## CreditCard1 -8.532e-01 2.895e-01 -2.947 0.003205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1881.59 on 2966 degrees of freedom
## Residual deviance: 621.59 on 2952 degrees of freedom
## AIC: 651.59
##
## Number of Fisher Scoring iterations: 8
Based on the fit details, all predictors except Age and Experience are to some degree statistically significant. We can be resonably sure that those significant variables do tell something about the likelihood that one accepts the loan.
More specifically, higher possibility of accepting the loan is for one whose income and mortgage are higher, whose education level is above undergraduate, who has one or more children, who purchases by credit card more, who has a CD account and who doesn’t have a security account, online banking account or credit card.
# Calculate the prediction accuracy
model.fitted<-predict(model,test,type='response')
model.fitted<-ifelse(model.fitted>0.5,1,0)
miserror<-mean(model.fitted!=test$Personal.Loan)
print(paste('Accuracy',1-miserror))
## [1] "Accuracy 0.95557799091368"
# Create the ROC curve
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p<-predict(model,test,type="response")
pr<-prediction(p,test$Personal.Loan)
prf<-performance(pr,measure="tpr",x.measure="fpr")
plot(prf)
abline(a=0,b=1)
# Create Pseudo R-square
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model)
## llh llhNull G2 McFadden r2ML
## -310.7947360 -940.7941689 1259.9988657 0.6696464 0.3460151
## r2CU
## 0.7367893
Both the accuracy and the ROC curve show that our model performs well in predicting the acceptance of the customers to the loan. The former shows that the model captures 95.6% of the customers’ behaviors. The latter further shows that the model truly does well in distinguishing the two categories of the customers’ response to the loan.
Base on the Pseudo R-square result, we have 0.6690253 for McFadden’s R-square, with values closer to 1 indicating that the model has more predictive power.