library(GGally)
library(dplyr)
library(caret)
library(readr)
library(tidyverse)
library(class)
library(partykit)Column descriptions
*ID : Customer ID
*Age : Customer’s age in completed years
*Experience : Years of professional experience
*Income Annual : income of the customer ($000)
*ZIPCode : Home Address ZIP code.
*Family : Family size of the customer
*CCAvg : Avg. spending on credit cards per month ($000)
*Education : Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
*Mortgage : Value of house mortgage if any. ($000)
*Personal Loan : Did this customer accept the personal loan offered in the last campaign?
*Securities Account : Does the customer have a securities account with the bank?
*CD : Account Does the customer have a certificate of deposit (CD) account with the bank?
*Online : Does the customer use internet banking facilities?
*CreditCard : Does the customer uses a credit card issued by Universal Bank?
loans <- loan %>%
mutate(Personal.Loan = as.factor(Personal.Loan), Securities.Account = as.factor(Securities.Account),
CD.Account = as.factor(CD.Account), Online = as.factor(Online), Family = as.factor(Family),
CreditCard = as.factor(CreditCard), Education = as.factor(Education)) %>%
select(- c(ID, ZIP.Code))
head(loans)Age and experience have strong correlation. The graphic shows very high multicollinearity (1)
Because the correlation between Age and Experience is very strong, so I use feature engineering to make a new column from those two variables. Adding column performance that comes from Experience/Age
Identifying customers who purchase the Personal Loan based on dataset
#> Age Experience Income Family CCAvg
#> Min. :26.00 Min. : 0.00 Min. : 60.0 1:107 Min. : 0.000
#> 1st Qu.:35.00 1st Qu.: 9.00 1st Qu.:122.0 2:106 1st Qu.: 2.600
#> Median :45.00 Median :20.00 Median :142.5 3:133 Median : 3.800
#> Mean :45.07 Mean :19.84 Mean :144.7 4:134 Mean : 3.905
#> 3rd Qu.:55.00 3rd Qu.:30.00 3rd Qu.:172.0 3rd Qu.: 5.348
#> Max. :65.00 Max. :41.00 Max. :203.0 Max. :10.000
#> Education Mortgage Personal.Loan Securities.Account CD.Account
#> 1: 93 Min. : 0.0 0: 0 0:420 0:340
#> 2:182 1st Qu.: 0.0 1:480 1: 60 1:140
#> 3:205 Median : 0.0
#> Mean :100.8
#> 3rd Qu.:192.5
#> Max. :617.0
#> Online CreditCard
#> 0:189 0:337
#> 1:291 1:143
#>
#>
#>
#>
Customer’s income distribution that purchase the personal loan.
Checking the proportion
#>
#> 0 1
#> 0.904 0.096
set.seed(90)
idx <- sample(nrow(loans_ed), nrow(loans_ed)*0.8)
loans_train <- loans_ed[idx,]
loans_test <- loans_ed[-idx,]
logistic.train_label <- loans_ed[idx, "Personal.Loan"]
logistic.test_label <- loans_ed[-idx, "Personal.Loan"]Checking the proportion for data train and data test
#>
#> 0 1
#> 0.90525 0.09475
#>
#> 0 1
#> 0.899 0.101
loans_none <- glm(formula = Personal.Loan ~ 1, data = loans_train, family = "binomial")
loans_all <- glm(formula = Personal.Loan ~ ., data = loans_train, family = "binomial")#> Start: AIC=2509.14
#> Personal.Loan ~ 1
#>
#> Df Deviance AIC
#> + Income 1 1591.8 1595.8
#> + CCAvg 1 2100.9 2104.9
#> + CD.Account 1 2297.4 2301.4
#> + Education 2 2423.4 2429.4
#> + Mortgage 1 2430.5 2434.5
#> + Family 3 2483.8 2491.8
#> <none> 2507.1 2509.1
#> + Securities.Account 1 2506.4 2510.4
#> + Online 1 2506.8 2510.8
#> + performance 1 2507.1 2511.1
#> + CreditCard 1 2507.1 2511.1
#>
#> Step: AIC=1595.81
#> Personal.Loan ~ Income
#>
#> Df Deviance AIC
#> + Education 2 1143.3 1151.3
#> + Family 3 1367.3 1377.3
#> + CD.Account 1 1468.3 1474.3
#> + CCAvg 1 1586.8 1592.8
#> + Mortgage 1 1587.0 1593.0
#> + Securities.Account 1 1589.3 1595.3
#> <none> 1591.8 1595.8
#> + CreditCard 1 1591.6 1597.6
#> + performance 1 1591.8 1597.8
#> + Online 1 1591.8 1597.8
#>
#> Step: AIC=1151.32
#> Personal.Loan ~ Income + Education
#>
#> Df Deviance AIC
#> + Family 3 1040.9 1054.9
#> + CD.Account 1 1059.0 1069.0
#> + CCAvg 1 1130.0 1140.0
#> + Mortgage 1 1137.8 1147.8
#> <none> 1143.3 1151.3
#> + Securities.Account 1 1141.9 1151.9
#> + performance 1 1142.6 1152.6
#> + CreditCard 1 1143.2 1153.2
#> + Online 1 1143.3 1153.3
#>
#> Step: AIC=1054.89
#> Personal.Loan ~ Income + Education + Family
#>
#> Df Deviance AIC
#> + CD.Account 1 966.2 982.2
#> + CCAvg 1 1021.8 1037.8
#> + Mortgage 1 1036.8 1052.8
#> + Securities.Account 1 1038.8 1054.8
#> <none> 1040.9 1054.9
#> + performance 1 1039.8 1055.8
#> + Online 1 1040.8 1056.8
#> + CreditCard 1 1040.8 1056.8
#>
#> Step: AIC=982.2
#> Personal.Loan ~ Income + Education + Family + CD.Account
#>
#> Df Deviance AIC
#> + CCAvg 1 951.50 969.50
#> + CreditCard 1 956.54 974.54
#> + Online 1 960.35 978.35
#> + Securities.Account 1 960.36 978.36
#> + Mortgage 1 963.48 981.48
#> <none> 966.20 982.20
#> + performance 1 965.77 983.77
#>
#> Step: AIC=969.5
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg
#>
#> Df Deviance AIC
#> + CreditCard 1 942.55 962.55
#> + Securities.Account 1 944.97 964.97
#> + Online 1 945.30 965.30
#> + Mortgage 1 947.53 967.53
#> <none> 951.50 969.50
#> + performance 1 950.46 970.46
#>
#> Step: AIC=962.55
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg +
#> CreditCard
#>
#> Df Deviance AIC
#> + Securities.Account 1 933.57 955.57
#> + Online 1 933.90 955.90
#> + Mortgage 1 938.97 960.97
#> <none> 942.55 962.55
#> + performance 1 941.48 963.48
#>
#> Step: AIC=955.57
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg +
#> CreditCard + Securities.Account
#>
#> Df Deviance AIC
#> + Online 1 923.33 947.33
#> + Mortgage 1 930.14 954.14
#> <none> 933.57 955.57
#> + performance 1 932.60 956.60
#>
#> Step: AIC=947.33
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg +
#> CreditCard + Securities.Account + Online
#>
#> Df Deviance AIC
#> + Mortgage 1 920.21 946.21
#> <none> 923.33 947.33
#> + performance 1 922.12 948.12
#>
#> Step: AIC=946.21
#> Personal.Loan ~ Income + Education + Family + CD.Account + CCAvg +
#> CreditCard + Securities.Account + Online + Mortgage
#>
#> Df Deviance AIC
#> <none> 920.21 946.21
#> + performance 1 919.00 947.00
#>
#> Call: glm(formula = Personal.Loan ~ Income + Education + Family + CD.Account +
#> CCAvg + CreditCard + Securities.Account + Online + Mortgage,
#> family = "binomial", data = loans_train)
#>
#> Coefficients:
#> (Intercept) Income Education2
#> -12.663684 0.063204 3.795500
#> Education3 Family2 Family3
#> 3.968236 -0.416268 1.816834
#> Family4 CD.Account1 CCAvg
#> 1.572921 3.463408 0.204684
#> CreditCard1 Securities.Account1 Online1
#> -0.840227 -1.051212 -0.588341
#> Mortgage
#> 0.001168
#>
#> Degrees of Freedom: 3999 Total (i.e. Null); 3987 Residual
#> Null Deviance: 2507
#> Residual Deviance: 920.2 AIC: 946.2
model_logistic <- glm(formula = Personal.Loan ~ Income + Education + CD.Account +
Family + CreditCard + Online + Securities.Account + CCAvg,
family = "binomial", data = loans_train)
model_logistic %>%
summary()#>
#> Call:
#> glm(formula = Personal.Loan ~ Income + Education + CD.Account +
#> Family + CreditCard + Online + Securities.Account + CCAvg,
#> family = "binomial", data = loans_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1358 -0.1922 -0.0652 -0.0191 4.1434
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -12.60226 0.61275 -20.567 < 2e-16 ***
#> Income 0.06382 0.00344 18.552 < 2e-16 ***
#> Education2 3.76292 0.29882 12.593 < 2e-16 ***
#> Education3 3.92711 0.29842 13.160 < 2e-16 ***
#> CD.Account1 3.50698 0.38136 9.196 < 2e-16 ***
#> Family2 -0.39377 0.25587 -1.539 0.123810
#> Family3 1.83818 0.27320 6.728 1.72e-11 ***
#> Family4 1.59724 0.26070 6.127 8.96e-10 ***
#> CreditCard1 -0.85130 0.23317 -3.651 0.000261 ***
#> Online1 -0.59567 0.18702 -3.185 0.001447 **
#> Securities.Account1 -1.05852 0.34607 -3.059 0.002223 **
#> CCAvg 0.19439 0.05042 3.855 0.000116 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 2507.14 on 3999 degrees of freedom
#> Residual deviance: 923.33 on 3988 degrees of freedom
#> AIC: 947.33
#>
#> Number of Fisher Scoring iterations: 8
In this model, I set the threshold 0.5, so if the probability is higher than 0.5, that means the customer apply for the personal loan
pred <- predict(model_logistic, loans_test, type = "response")
pred_label <- as.factor(if_else(pred < 0.5, 0, 1))#> pred_label
#> 0 1
#> 910 90
hist(predict(model_logistic, loans_test, type = "response"), breaks = 100,
xlab = "Logistic Model Probability")Scaling data train and data test
train_knn <- loans_train %>%
select_if(is.numeric) %>%
scale
test_knn <- loans_test %>%
select_if(is.numeric) %>%
scale(center = attr(train_knn, "scaled:center"),
scale = attr(train_knn, "scaled:scale"))Determine k value
#> [1] 63.24555
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 883 27
#> 1 16 74
#>
#> Accuracy : 0.957
#> 95% CI : (0.9425, 0.9687)
#> No Information Rate : 0.899
#> P-Value [Acc > NIR] : 9.374e-12
#>
#> Kappa : 0.7512
#>
#> Mcnemar's Test P-Value : 0.1273
#>
#> Sensitivity : 0.7327
#> Specificity : 0.9822
#> Pos Pred Value : 0.8222
#> Neg Pred Value : 0.9703
#> Prevalence : 0.1010
#> Detection Rate : 0.0740
#> Detection Prevalence : 0.0900
#> Balanced Accuracy : 0.8574
#>
#> 'Positive' Class : 1
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 887 91
#> 1 12 10
#>
#> Accuracy : 0.897
#> 95% CI : (0.8765, 0.9151)
#> No Information Rate : 0.899
#> P-Value [Acc > NIR] : 0.6085
#>
#> Kappa : 0.1312
#>
#> Mcnemar's Test P-Value : 1.523e-14
#>
#> Sensitivity : 0.09901
#> Specificity : 0.98665
#> Pos Pred Value : 0.45455
#> Neg Pred Value : 0.90695
#> Prevalence : 0.10100
#> Detection Rate : 0.01000
#> Detection Prevalence : 0.02200
#> Balanced Accuracy : 0.54283
#>
#> 'Positive' Class : 1
#>
library(ROCR)
pred_prob <- predict(object = model_logistic, newdata = loans_test, type = "response")
pred.logistic <- prediction(pred_prob, labels = logistic.test_label)
perf <- performance(prediction.obj = pred.logistic, measure = "tpr", "fpr")
plot(perf)#> [1] 0.9755724
AUC value is good because the value is 0.97, close to 1.
From logistic regression and k-NN model, I got the accuracy, recall, and precision for both models. Because the Bank want to increase the Personal.Loan’s customer but with minimal budget, so I focus on precision. In the confusion matrix. As the result, the logistic regression predicts better than k-NN model. Logistic regression has 95.7% accuracy and 82.2% precision while k-NN has 89.7% accuracy and 45.4% precision. I choose higher precision to evaluate the prediction of Personal.Loan. It is so unfortunate if I predict someone will purchase loan but in real case they don’t. Another useful insight is when I predict someone don’t purchase the loan but in real case they purchase it. From that it can reduce marketing cost.