Question:

what combination of parameters makes a customer more likely to accept a personal loan?

# Load dataset
Bank_data <- read.csv("/Users/SMEBTHEREAL/Desktop/UniversalBank.csv")
# Explore structure of data
str(Bank_data)

## 'data.frame':    5000 obs. of  15 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Validation        : int  0 1 1 0 0 0 1 1 0 0 ...

# Change to appropriate data type
Bank_data$Personal.Loan<-as.factor(Bank_data$Personal.Loan)
Bank_data$ZIP.Code<-as.factor(Bank_data$ZIP.Code)
Bank_data$Family<-as.factor(Bank_data$Family)
Bank_data$Education<-as.factor(Bank_data$Education)
Bank_data$Online<-as.factor(Bank_data$Online)
Bank_data$CD.Account<-as.factor(Bank_data$CD.Account)
Bank_data$CreditCard<-as.factor(Bank_data$CreditCard)
Bank_data$Validation<-as.factor(Bank_data$Validation)
Bank_data$Securities.Account<-as.factor(Bank_data$Securities.Account)
# Check missing value
sum(is.na(Bank_data))

## [1] 0

# Simple "smell test" and remove outliers
summary(Bank_data)

##        ID       Personal.Loan      Age          Experience  
##  Min.   :   1   0:4520        Min.   :23.00   Min.   :-3.0  
##  1st Qu.:1251   1: 480        1st Qu.:35.00   1st Qu.:10.0  
##  Median :2500                 Median :45.00   Median :20.0  
##  Mean   :2500                 Mean   :45.34   Mean   :20.1  
##  3rd Qu.:3750                 3rd Qu.:55.00   3rd Qu.:30.0  
##  Max.   :5000                 Max.   :67.00   Max.   :43.0  
##                                                             
##      Income          ZIP.Code    Family       CCAvg        Education
##  Min.   :  8.00   94720  : 169   1:1472   Min.   : 0.000   1:2096   
##  1st Qu.: 39.00   94305  : 127   2:1296   1st Qu.: 0.700   2:1403   
##  Median : 64.00   95616  : 116   3:1010   Median : 1.500   3:1501   
##  Mean   : 73.77   90095  :  71   4:1222   Mean   : 1.938            
##  3rd Qu.: 98.00   93106  :  57            3rd Qu.: 2.500            
##  Max.   :224.00   92037  :  54            Max.   :10.000            
##                   (Other):4406                                      
##     Mortgage     Securities.Account CD.Account Online   CreditCard
##  Min.   :  0.0   0:4478             0:4698     0:2016   0:3530    
##  1st Qu.:  0.0   1: 522             1: 302     1:2984   1:1470    
##  Median :  0.0                                                    
##  Mean   : 56.5                                                    
##  3rd Qu.:101.0                                                    
##  Max.   :635.0                                                    
##                                                                   
##  Validation
##  0:3000    
##  1:2000    
##            
##            
##            
##            
##

Bank_data<-subset(Bank_data,Experience>=0)

Now, we have zero missing value and we could start analysing our dataset.

# Conduct EDA
summary(Bank_data)

##        ID       Personal.Loan      Age          Experience   
##  Min.   :   1   0:4468        Min.   :24.00   Min.   : 0.00  
##  1st Qu.:1255   1: 480        1st Qu.:36.00   1st Qu.:10.75  
##  Median :2498                 Median :46.00   Median :20.00  
##  Mean   :2501                 Mean   :45.56   Mean   :20.33  
##  3rd Qu.:3750                 3rd Qu.:55.00   3rd Qu.:30.00  
##  Max.   :5000                 Max.   :67.00   Max.   :43.00  
##                                                              
##      Income          ZIP.Code    Family       CCAvg        Education
##  Min.   :  8.00   94720  : 164   1:1470   Min.   : 0.000   1:2080   
##  1st Qu.: 39.00   94305  : 125   2:1274   1st Qu.: 0.700   2:1387   
##  Median : 64.00   95616  : 115   3:1001   Median : 1.500   3:1481   
##  Mean   : 73.81   90095  :  71   4:1203   Mean   : 1.936            
##  3rd Qu.: 98.00   93106  :  56            3rd Qu.: 2.600            
##  Max.   :224.00   92037  :  54            Max.   :10.000            
##                   (Other):4363                                      
##     Mortgage      Securities.Account CD.Account Online   CreditCard
##  Min.   :  0.00   0:4432             0:4646     0:1994   0:3493    
##  1st Qu.:  0.00   1: 516             1: 302     1:2954   1:1455    
##  Median :  0.00                                                    
##  Mean   : 56.63                                                    
##  3rd Qu.:101.00                                                    
##  Max.   :635.00                                                    
##                                                                    
##  Validation
##  0:2967    
##  1:1981    
##            
##            
##            
##            
##

plot(table(Bank_data$Family,Bank_data$Personal.Loan),cex=1.5)

plot(table(Bank_data$Education,Bank_data$Personal.Loan),cex=1.5)

plot(table(Bank_data$Securities.Account,Bank_data$Personal.Loan),cex=1.5)

plot(table(Bank_data$CD.Account,Bank_data$Personal.Loan),cex=1.5)

plot(table(Bank_data$Online,Bank_data$Personal.Loan),cex=1.5)

plot(table(Bank_data$CreditCard,Bank_data$Personal.Loan),cex=1.5)

library(ggplot2)

ggplot(Bank_data,aes(Age))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Bank_data,aes(Income))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Bank_data,aes(Experience))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Bank_data,aes(CCAvg))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Bank_data,aes(Mortgage))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Prepare for modeling
train<-subset(Bank_data,Validation=="0")
test<-subset(Bank_data,Validation=="1")
train<-subset(train,select=c(-ID,-Validation,-ZIP.Code))
test<-subset(test,select=c(-ID,-Validation,-ZIP.Code))

At this stage, we created test and train set. In addition, we found that there are some unrelated and useless variables, so we decided to drop it.

# Fit a logistic regression model and show the fit details
model<-glm(formula=Personal.Loan~.,data=train,family=binomial(link='logit'))
summary(model)

## 
## Call:
## glm(formula = Personal.Loan ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9463  -0.1562  -0.0489  -0.0125   4.3055  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.344e+01  2.515e+00  -5.343 9.13e-08 ***
## Age                 -3.261e-02  9.258e-02  -0.352 0.724663    
## Experience           3.973e-02  9.177e-02   0.433 0.665083    
## Income               6.992e-02  4.595e-03  15.216  < 2e-16 ***
## Family2             -1.298e-01  3.176e-01  -0.409 0.682861    
## Family3              2.135e+00  3.416e-01   6.249 4.14e-10 ***
## Family4              1.524e+00  3.220e-01   4.732 2.22e-06 ***
## CCAvg                2.087e-01  6.493e-02   3.215 0.001306 ** 
## Education2           4.597e+00  4.022e-01  11.427  < 2e-16 ***
## Education3           4.671e+00  4.057e-01  11.512  < 2e-16 ***
## Mortgage             1.874e-03  8.241e-04   2.274 0.022949 *  
## Securities.Account1 -9.319e-01  4.257e-01  -2.189 0.028591 *  
## CD.Account1          3.387e+00  4.590e-01   7.379 1.59e-13 ***
## Online1             -8.022e-01  2.327e-01  -3.447 0.000567 ***
## CreditCard1         -8.532e-01  2.895e-01  -2.947 0.003205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1881.59  on 2966  degrees of freedom
## Residual deviance:  621.59  on 2952  degrees of freedom
## AIC: 651.59
## 
## Number of Fisher Scoring iterations: 8

Based on the fit details, all predictors except Age and Experience are to some degree statistically significant. We can be resonably sure that those significant variables do tell something about the likelihood that one accepts the loan.

More specifically, higher possibility of accepting the loan is for one whose income and mortgage are higher, whose education level is above undergraduate, who has one or more children, who purchases by credit card more, who has a CD account and who doesn’t have a security account, online banking account or credit card.

# Calculate the prediction accuracy
model.fitted<-predict(model,test,type='response')
model.fitted<-ifelse(model.fitted>0.5,1,0)
miserror<-mean(model.fitted!=test$Personal.Loan)
print(paste('Accuracy',1-miserror))

## [1] "Accuracy 0.95557799091368"

# Create the ROC curve
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p<-predict(model,test,type="response")
pr<-prediction(p,test$Personal.Loan)
prf<-performance(pr,measure="tpr",x.measure="fpr")
plot(prf)
abline(a=0,b=1)

# Create Pseudo R-square
library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

pR2(model)

##          llh      llhNull           G2     McFadden         r2ML 
## -310.7947360 -940.7941689 1259.9988657    0.6696464    0.3460151 
##         r2CU 
##    0.7367893

Both the accuracy and the ROC curve show that our model performs well in predicting the acceptance of the customers to the loan. The former shows that the model captures 95.6% of the customers’ behaviors. The latter further shows that the model truly does well in distinguishing the two categories of the customers’ response to the loan.

Base on the Pseudo R-square result, we have 0.6690253 for McFadden’s R-square, with values closer to 1 indicating that the model has more predictive power.

Final Project

Yasaige Chen

6/26/2018

Question: