1. Introduction

은행에서 새로운 개인연금상품(PEP)을 개발하였다. 기존 고객들을 대상으로 가능한 많은 계좌를 유치하고자 한다. 즉, 고객의 금융상품(PEP: Personal Equity Plan, 연금보험) 구매 여부 예측에 의한 신규고객 창출하여 매출을 영업이익을 증대하는 것이 목표이다. 그렇다면, 무작위로 마케팅을 하는 것보다 구매할 가능성이 높은 사람들에게 상품을 추천해준다면 조금 더 효율적인 마케팅(영업) 활동을 할 수 있을 것이다. 기계학습 기법을 활용해 고객의 구매 의도를 예측하고 추천해주는 시스템을 개발해보겠다.

1.1. Intelligent Targeting

- What is Marketing?

“The process by which companies create value for customers and build strong customer relationships in order to capture value from customers in return(기업이 고객을 위해 가치를 창출하고, 고객 관계를 구축하여, 고객의 가치를 보상하는 프로세스).” - Kotler and Armstrong (2010)

-> (본인이 생각하는) 마케팅이란 가치 교환의 과정이라고 생각한다. 고객과 기업이 만들어내는 각각의 가치를 어떻게 교환할 것인지, 그 과정을 만들어내는 걸 마케팅이라고한다. 더 간단하게 요약하자면, ’고객의 행동을 이끌어 내는 것이라고 말 할 수 있다. 짧게 요약했지만, 그 안에는 무수히; 많은 과정과 노력이 필요하다. 가치 교환의 과정에서 모든 행동의 주체는 기업이다. 물론 고객의 관점에서 마케팅을 수행해야하지만 기업이 주체가 되어서 해야할 일을 하는 것이 마케팅이다.

- The 4P’s:

마케팅 캠페인은 고객의 요구 사항과 전반적인 만족도에 중점을 둔다. 그럼에도 불구하고 마케팅 캠페인의 성공 여부를 결정하는 다양한 변수가 있다. 캠페인을 할 때 고려해야 할 몇 가지 변수가 있다.

Segment of the Population - “마케팅 캠페인이 집단의 어느 대상에 이루어지며 그 이유는 무엇인가?” 이러한 측면은 인구 중 어느 부분이 메시지를 받을 가능성이 가장 높은지를 알려주기 때문에 매우 중요하다.
Distribution channel to reach the customer’s place - 캠페인을 최대한 활용하려면 가장 효과적인 전략을 구현해야한다. 인구 중 어느 집단을 다루고, 기업의 메시지를 전달하기 위해 어떤 도구를 사용해야하는가? (예 : 전화, 라디오, TV, 소셜 미디어 등)
Price - 잠재 고객에게 제공할 수 있는 가장 좋은 가격은 얼마인가? (은행의 경우, 그들의 주요 관심사는 잠재 고객이 정기예금 계좌를 개설하여 은행의 운영 활동을 계속할 수 있도록 하기 위한 것이므로 필요하지 않다.)
Promotional Strategy - 전략이 구현되고 잠재 고객이 어떻게 대응할 것인가이다. 이것은 이전에 했던 실수에 대해 배우고 마케팅 캠페인을 훨씬 효과적으로 만드는 방법을 결정하기 위해 (가능하면)이전 캠페인에 대한 철저한 분석이 있어야 하기 때문에 마케팅 캠페인 분석의 마지막 부분이어야 한다.

1.2. Aim

1) Data Analysis

- 은행의 고객 데이터를 바탕으로 고객 특성을 분석
    - 고객 프로파일 개발
    - 다이렉트 메일 광고 효율성 제고
    - 타겟 메일링에 의한 응답률 제고

2) Data Analytics

- 나이, 성별, 거주지, 결혼 여부, 자녀 수 등의 고객 정보 및 마케팅 이력을 통해 고객의 개인연금 가입 여부를 예측 (Yes / No)
    - 고객의 과거 이력과 유사한 고객군들의 데이터를 기반으로 해당 고객이 가입할지 예측

1.3. Data Analysis Process

1) 기존 고객 DB로부터 시험메일 발송을 위한 표본 고객 목록을 추출
2) 새로운 금융상품(PEP)의 제안 메일을 발송
3) 고객의 반응을 기록
4) R을 이용하여 캠페인 결과를 분석

1.4. Data Description

- Data Files :

pepTrainSet.csv : 학습용 데이터
pepTestSet.csv : 검증용 데이터
pepNewCustomers.csv : 신규고객 데이터

- Data Field :

age : 나이
sex : 성별
region : 거주지
income : 소득
married : 결혼 여부
children : 자녀 수
car : 차량 소유 여부
save_act : 정기 예금 가입 여부
current_act : 최근 활동 여부
mortgage : 주택 담보 여부
pep : 정기 예금 가입 여부

2. Preprocessing

2.1. Import library & Load data

- Import Libraries :

suppressPackageStartupMessages({
library(data.table)
library(dplyr)
library(tidyr)
library(lubridate)
library(DT)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(sqldf)

library(readr)
    
library(caret)
library(ROCR)
library(C50)
library(e1071)
    
    
library(rattle)
library(rpart.plot)
library(RColorBrewer)
    
library(rpart)
library(randomForest)
})

# rm(list=ls())

fillColor = "#FFA07A"
fillColor2 = "#F1C40F"

- Load Data:

# 윈도우 OS에서 전송한 파일이 맥북에서 깨질 경우
# library(readr)

read.any <- function(text, sep = "", ...) {
    
encoding <- as.character(guess_encoding(text)[1,1])

setting <- as.character(tools::file_ext(text))

if(sep != "" | !(setting %in% c("csv", "txt")) ) setting <- "custom"

separate <- list(csv = ",", txt = "\n", custom = sep)

result <- read.table(text, sep = separate[[setting]], fileEncoding = encoding, ...)

return(result)
    
}

train <- read.any('./input/pepTrainSet.csv', header=T, stringsAsFactors = F)
train <- subset(train, select=-c(id))
test <- read.any('./input/pepTestSet.csv', header=T, stringsAsFactors = F)
newd <- read.any('./input/pepNewCustomers.csv', header=T, stringsAsFactors = F)

train$pep <- factor(train$pep)
test$pep <- factor(test$pep)

2.2. Peek into the Data

datatable(train, style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))

2.3. Overview the Data

str(train)

## 'data.frame':    300 obs. of  11 variables:
##  $ age        : int  48 40 51 23 57 57 22 58 37 54 ...
##  $ sex        : chr  "FEMALE" "MALE" "FEMALE" "FEMALE" ...
##  $ region     : chr  "INNER_CITY" "TOWN" "INNER_CITY" "TOWN" ...
##  $ income     : num  17546 30085 16575 20375 50576 ...
##  $ married    : chr  "NO" "YES" "YES" "YES" ...
##  $ children   : int  1 3 0 3 0 2 0 0 2 2 ...
##  $ car        : chr  "NO" "YES" "YES" "NO" ...
##  $ save_act   : chr  "NO" "NO" "YES" "NO" ...
##  $ current_act: chr  "NO" "YES" "YES" "YES" ...
##  $ mortgage   : chr  "NO" "YES" "NO" "NO" ...
##  $ pep        : Factor w/ 2 levels "NO","YES": 2 1 1 1 1 2 2 1 1 1 ...

head(train)

##   age    sex     region  income married children car save_act current_act
## 1  48 FEMALE INNER_CITY 17546.0      NO        1  NO       NO          NO
## 2  40   MALE       TOWN 30085.1     YES        3 YES       NO         YES
## 3  51 FEMALE INNER_CITY 16575.4     YES        0 YES      YES         YES
## 4  23 FEMALE       TOWN 20375.4     YES        3  NO       NO         YES
## 5  57 FEMALE      RURAL 50576.3     YES        0  NO      YES          NO
## 6  57 FEMALE       TOWN 37869.6     YES        2  NO      YES         YES
##   mortgage pep
## 1       NO YES
## 2      YES  NO
## 3       NO  NO
## 4       NO  NO
## 5       NO  NO
## 6       NO YES

str(test)

## 'data.frame':    300 obs. of  12 variables:
##  $ id         : chr  "ID12401" "ID12402" "ID12403" "ID12404" ...
##  $ age        : int  19 37 45 49 67 35 63 38 48 28 ...
##  $ sex        : chr  "FEMALE" "FEMALE" "FEMALE" "MALE" ...
##  $ region     : chr  "INNER_CITY" "TOWN" "TOWN" "RURAL" ...
##  $ income     : num  8162 15350 29231 41462 57398 ...
##  $ married    : chr  "YES" "YES" "YES" "YES" ...
##  $ children   : int  1 0 0 3 3 0 2 0 2 1 ...
##  $ car        : chr  "YES" "NO" "NO" "NO" ...
##  $ save_act   : chr  "YES" "YES" "YES" "YES" ...
##  $ current_act: chr  "YES" "NO" "NO" "YES" ...
##  $ mortgage   : chr  "YES" "NO" "NO" "YES" ...
##  $ pep        : Factor w/ 2 levels "NO","YES": 1 1 1 1 2 1 2 2 1 1 ...

head(test)

##        id age    sex     region   income married children car save_act
## 1 ID12401  19 FEMALE INNER_CITY  8162.42     YES        1 YES      YES
## 2 ID12402  37 FEMALE       TOWN 15349.60     YES        0  NO      YES
## 3 ID12403  45 FEMALE       TOWN 29231.40     YES        0  NO      YES
## 4 ID12404  49   MALE      RURAL 41462.30     YES        3  NO      YES
## 5 ID12405  67 FEMALE      RURAL 57398.10      NO        3  NO      YES
## 6 ID12406  35 FEMALE      RURAL 11520.80     YES        0  NO       NO
##   current_act mortgage pep
## 1         YES      YES  NO
## 2          NO       NO  NO
## 3          NO       NO  NO
## 4         YES      YES  NO
## 5         YES       NO YES
## 6         YES       NO  NO

str(newd)

## 'data.frame':    200 obs. of  11 variables:
##  $ id         : chr  "ID12701" "ID12702" "ID12703" "ID12704" ...
##  $ age        : int  23 30 45 50 41 20 46 50 42 57 ...
##  $ sex        : chr  "MALE" "MALE" "FEMALE" "MALE" ...
##  $ region     : chr  "INNER_CITY" "RURAL" "RURAL" "TOWN" ...
##  $ income     : num  18767 9916 21882 46794 20721 ...
##  $ married    : chr  "YES" "NO" "NO" "YES" ...
##  $ children   : int  0 1 0 2 0 1 0 1 3 1 ...
##  $ car        : chr  "YES" "NO" "YES" "NO" ...
##  $ save_act   : chr  "YES" "YES" "YES" "YES" ...
##  $ current_act: chr  "NO" "NO" "YES" "NO" ...
##  $ mortgage   : chr  "YES" "YES" "NO" "YES" ...

head(newd)

##        id age    sex     region   income married children car save_act
## 1 ID12701  23   MALE INNER_CITY 18766.90     YES        0 YES      YES
## 2 ID12702  30   MALE      RURAL  9915.67      NO        1  NO      YES
## 3 ID12703  45 FEMALE      RURAL 21881.60      NO        0 YES      YES
## 4 ID12704  50   MALE       TOWN 46794.40     YES        2  NO      YES
## 5 ID12705  41 FEMALE INNER_CITY 20721.10     YES        0 YES      YES
## 6 ID12706  20   MALE INNER_CITY 16688.50      NO        1  NO      YES
##   current_act mortgage
## 1          NO      YES
## 2          NO      YES
## 3         YES       NO
## 4          NO      YES
## 5         YES       NO
## 6         YES      YES

prop.table(table(train$pep))

## 
##   NO  YES 
## 0.55 0.45

prop.table(table(test$pep))

## 
##        NO       YES 
## 0.5366667 0.4633333

3. Modeling

3.1. Logistic Regression

Training a model on the data

- Training a model on the data:

m_logis <- glm(pep ~ ., data=train, family=binomial(link='logit'))
summary(m_logis)

## 
## Call:
## glm(formula = pep ~ ., family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1618  -0.9717  -0.6610   1.0941   2.0354  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.568e+00  5.666e-01  -2.767  0.00565 **
## age             1.593e-02  1.299e-02   1.227  0.21998   
## sexMALE         6.887e-01  2.583e-01   2.666  0.00768 **
## regionRURAL     2.696e-01  3.702e-01   0.728  0.46641   
## regionSUBURBAN -4.920e-02  4.905e-01  -0.100  0.92011   
## regionTOWN     -1.790e-01  2.960e-01  -0.605  0.54541   
## income          3.863e-05  1.518e-05   2.545  0.01093 * 
## marriedYES     -6.382e-01  2.679e-01  -2.382  0.01720 * 
## children       -1.800e-01  1.195e-01  -1.506  0.13200   
## carYES         -2.357e-01  2.535e-01  -0.930  0.35241   
## save_actYES    -3.225e-01  2.786e-01  -1.158  0.24696   
## current_actYES  3.680e-01  3.159e-01   1.165  0.24401   
## mortgageYES    -1.942e-01  2.696e-01  -0.720  0.47128   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 370.64  on 287  degrees of freedom
## AIC: 396.64
## 
## Number of Fisher Scoring iterations: 4

- coefficients:

exp(coef(m_logis))

##    (Intercept)            age        sexMALE    regionRURAL regionSUBURBAN 
##      0.2084668      1.0160584      1.9910270      1.3095056      0.9519925 
##     regionTOWN         income     marriedYES       children         carYES 
##      0.8361374      1.0000386      0.5282468      0.8352639      0.7899789 
##    save_actYES current_actYES    mortgageYES 
##      0.7243322      1.4448518      0.8234982

Evaluating model performance 1

- train set prediction:

# train set prediction
pred_tr_lm <- predict(m_logis, train, type="response", na.action = na.pass)

# train set accuracy
t <- table(train$pep, pred_tr_lm > 0.5)
acc <- sum(diag(t)) / sum(t)
acc

## [1] 0.6566667

#pred_tr_lm
pred_tr_lm <- predict(m_logis, train, type="response", na.action = na.pass)
pred_tr_lm <- ifelse(pred_tr_lm > 0.5, 1, 0) # YES, NO

table(train$pep)

## 
##  NO YES 
## 165 135

table(pred_tr_lm)

## pred_tr_lm
##   0   1 
## 186 114

#confusionMatrix(table(train$pep, pred_tr_lm))

- test set prediction:

# test set prediction
pred_te <- predict(m_logis, test, type="response", na.action = na.pass)
t <- table(test$pep, pred_te > 0.5)

# test set accuracy
acc <- sum(diag(t)) / sum(t)
acc

## [1] 0.5866667

pred_te_lm <- predict(m_logis, test, type="response", na.action = na.pass)
pred_te_lm <- ifelse(pred_te_lm > 0.5, 1, 0) # YES, NO

table(test$pep)

## 
##  NO YES 
## 161 139

table(pred_te_lm)

## pred_te_lm
##   0   1 
## 179 121

#confusionMatrix(table(test$pep, pred_te_lm))

Area under the curve (ROC-AUC)

- ROC Curve:

# Area under the curve
predict_1 <- prediction(pred_te, test$pep)

# creating ROC curve
roc1 <- performance(predict_1,"tpr","fpr")
plot(roc1)
title("ROC Curve")

Cross Validation

- Cross Validation:

ctrl1 <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

mod_fit1 <- train(pep ~.,  data=train, method="glm", family="binomial",
                 trControl = ctrl1, tuneLength = 5)
summary(mod_fit1)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1618  -0.9717  -0.6610   1.0941   2.0354  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.568e+00  5.666e-01  -2.767  0.00565 **
## age             1.593e-02  1.299e-02   1.227  0.21998   
## sexMALE         6.887e-01  2.583e-01   2.666  0.00768 **
## regionRURAL     2.696e-01  3.702e-01   0.728  0.46641   
## regionSUBURBAN -4.920e-02  4.905e-01  -0.100  0.92011   
## regionTOWN     -1.790e-01  2.960e-01  -0.605  0.54541   
## income          3.863e-05  1.518e-05   2.545  0.01093 * 
## marriedYES     -6.382e-01  2.679e-01  -2.382  0.01720 * 
## children       -1.800e-01  1.195e-01  -1.506  0.13200   
## carYES         -2.357e-01  2.535e-01  -0.930  0.35241   
## save_actYES    -3.225e-01  2.786e-01  -1.158  0.24696   
## current_actYES  3.680e-01  3.159e-01   1.165  0.24401   
## mortgageYES    -1.942e-01  2.696e-01  -0.720  0.47128   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 370.64  on 287  degrees of freedom
## AIC: 396.64
## 
## Number of Fisher Scoring iterations: 4

pred_cross1 = predict(mod_fit1, newdata=test)
pred_cross1 <- ifelse(pred_cross1 > 0.5, 1,0)
pred_cross1

##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [116] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [139] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [162] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [185] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [208] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [231] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [254] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [277] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [300] NA

Variable importance

- Variable Importance:

varImp(m_logis, scale=FALSE)

##                  Overall
## age            1.2265733
## sexMALE        2.6656080
## regionRURAL    0.7283283
## regionSUBURBAN 0.1002997
## regionTOWN     0.6046463
## income         2.5450827
## marriedYES     2.3824252
## children       1.5062590
## carYES         0.9299269
## save_actYES    1.1577565
## current_actYES 1.1650154
## mortgageYES    0.7204052

Stepwise logistic regression

- Stepwise Logistic Regression:

full_m <- glm(pep~., data = train, family = "binomial")
null_m <- glm(pep~1., data = train, family = "binomial")
m_logis_step <- step(null_m, direction = "both", trace = F, scope = list(lower = null_m, upper = full_m))
m_logis_step

## 
## Call:  glm(formula = pep ~ income + sex + married + children, family = "binomial", 
##     data = train)
## 
## Coefficients:
## (Intercept)       income      sexMALE   marriedYES     children  
##  -1.221e+00    4.792e-05    5.959e-01   -6.228e-01   -1.781e-01  
## 
## Degrees of Freedom: 299 Total (i.e. Null);  295 Residual
## Null Deviance:       412.9 
## Residual Deviance: 377.4     AIC: 387.4

summary(m_logis_step)

## 
## Call:
## glm(formula = pep ~ income + sex + married + children, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1215  -1.0141  -0.7009   1.1531   1.9804  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.221e+00  3.785e-01  -3.225  0.00126 ** 
## income       4.792e-05  1.006e-05   4.762 1.91e-06 ***
## sexMALE      5.959e-01  2.499e-01   2.384  0.01711 *  
## marriedYES  -6.228e-01  2.633e-01  -2.365  0.01802 *  
## children    -1.781e-01  1.166e-01  -1.527  0.12676    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 377.44  on 295  degrees of freedom
## AIC: 387.44
## 
## Number of Fisher Scoring iterations: 4

Evaluating model performance 2

- training data class prediction:

pred_tr <- predict(m_logis_step, type = "response")

- trainig set classification 성능:

t <- table(train$pep, pred_tr > 0.5) # threshold = 0.5기준으로 confusion matrix
t

##      
##       FALSE TRUE
##   NO    126   39
##   YES    73   62

acc_tr <- sum(diag(t)) / sum(t) # training accuracy
acc_tr

## [1] 0.6266667

- test dataset classification 성능:

pred_te <- predict(m_logis_step, test, type = "response") # class prediction

t_te <- table(test$pep, pred_te >0.5) # confusion matrix
t_te

##      
##       FALSE TRUE
##   NO    112   49
##   YES    69   70

acc_te <- sum(diag(t_te)) / sum(t_te) # test accuracy
acc_te

## [1] 0.6066667

Grid Search

- 가장 높은 accuracy를 보여주는 threshold 탐색:

acc_th <- NULL
threshold <- seq(0.1, 0.9, by = 0.1) # threshold 범위: 0.1~0.9 범위에서 0.1씩 증가시키면서 탐색
for(i in threshold){
  pred_te <- predict(m_logis_step, test, type = "response") # class prediction

  t_te <- table(test$pep, pred_te > i) # confusion matrix
  t_te
  acc_te <- sum(diag(t_te)) / sum(t_te) # test accuracy
  acc_te # test accuracy
  
  acc_th <- c(acc_th, acc_te)
}

# plotting
plot(threshold, acc_th, type = "l", xlab = "Threshold", ylab = "Test accuracy")
abline(v = threshold[which.max(acc_th)], col = "red", lty = 2)
points(threshold[which.max(acc_th)], max(acc_th), pch = 19, col = "red")
text(threshold[which.max(acc_th)], max(acc_th), pos =4, labels = round(max(acc_th), 2), col = "red", cex = 0.8)

Training a model on the data

- Training a model on the data:

# second candidate model: Logistic Regression
lm_model <- glm(pep ~ ., data=train, family = binomial)
summary(lm_model)

## 
## Call:
## glm(formula = pep ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1618  -0.9717  -0.6610   1.0941   2.0354  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.568e+00  5.666e-01  -2.767  0.00565 **
## age             1.593e-02  1.299e-02   1.227  0.21998   
## sexMALE         6.887e-01  2.583e-01   2.666  0.00768 **
## regionRURAL     2.696e-01  3.702e-01   0.728  0.46641   
## regionSUBURBAN -4.920e-02  4.905e-01  -0.100  0.92011   
## regionTOWN     -1.790e-01  2.960e-01  -0.605  0.54541   
## income          3.863e-05  1.518e-05   2.545  0.01093 * 
## marriedYES     -6.382e-01  2.679e-01  -2.382  0.01720 * 
## children       -1.800e-01  1.195e-01  -1.506  0.13200   
## carYES         -2.357e-01  2.535e-01  -0.930  0.35241   
## save_actYES    -3.225e-01  2.786e-01  -1.158  0.24696   
## current_actYES  3.680e-01  3.159e-01   1.165  0.24401   
## mortgageYES    -1.942e-01  2.696e-01  -0.720  0.47128   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 370.64  on 287  degrees of freedom
## AIC: 396.64
## 
## Number of Fisher Scoring iterations: 4

test$lm_pred <- ifelse(predict(lm_model, test, type="response") > 0.5, "YES","NO")
test$lm_pred = as.factor(test$lm_pred)
test$lm_pred_prob <- predict(lm_model, test, type="response")

Evaluating model performance

- Confusion Matrix : Training set

confusionMatrix(test$lm_pred, test$pep)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  NO YES
##        NO  108  71
##        YES  53  68
##                                          
##                Accuracy : 0.5867         
##                  95% CI : (0.5286, 0.643)
##     No Information Rate : 0.5367         
##     P-Value [Acc > NIR] : 0.04626        
##                                          
##                   Kappa : 0.1614         
##  Mcnemar's Test P-Value : 0.12685        
##                                          
##             Sensitivity : 0.6708         
##             Specificity : 0.4892         
##          Pos Pred Value : 0.6034         
##          Neg Pred Value : 0.5620         
##              Prevalence : 0.5367         
##          Detection Rate : 0.3600         
##    Detection Prevalence : 0.5967         
##       Balanced Accuracy : 0.5800         
##                                          
##        'Positive' Class : NO             
##

3.2. Decision Tree : CART

Training a model on the data

cp(Complexity Parameter) : cp 값을 작게 줄수록 복잡도가 올라감
rpart 함수에서 조절 가능한 parameter
- minsplit : min of observations (20)
- xval : of closs validation (10)
- maxdepth : max depth of any node (30)

- Reshape Data:

train$sex <- as.factor(train$sex)
train$region <- as.factor(train$region)
train$married <- as.factor(train$married)
train$children <- as.factor(train$children)
train$car <- as.factor(train$car)
train$save_act <- as.factor(train$save_act)
train$current_act <- as.factor(train$current_act)
train$mortgage <- as.factor(train$mortgage)

test$sex <- as.factor(test$sex)
test$region <- as.factor(test$region)
test$married <- as.factor(test$married)
test$children <- as.factor(test$children)
test$car <- as.factor(test$car)
test$save_act <- as.factor(test$save_act)
test$current_act <- as.factor(test$current_act)
test$mortgage <- as.factor(test$mortgage)

newd$sex <- as.factor(newd$sex)
newd$region <- as.factor(newd$region)
newd$married <- as.factor(newd$married)
newd$children <- as.factor(newd$children)
newd$car <- as.factor(newd$car)
newd$save_act <- as.factor(newd$save_act)
newd$current_act <- as.factor(newd$current_act)
newd$mortgage <- as.factor(newd$mortgage)

- Training a model on the data:

set.seed(1004)
## Training
dt <- rpart(as.factor(pep)~., data = train, cp = 0.1^20) # 모든 변수 사용, Full tree 생성

xerror_min_which <- which.min(dt$cptable[, "xerror"])
xerror_min <- min(dt$cptable[, "xerror"])

Cross Validation: Pruning

Training Set을 이용하여 Cross Validation 수행
- 이유 : training할 때, 포함되지 않았던 data로 error를 측정해서 성능을 검증하기 위해서
validation Error가 증가하는 시점에서 가지치기

- cptable 출력:

printcp(dt) # cptable 출력

## 
## Classification tree:
## rpart(formula = as.factor(pep) ~ ., data = train, cp = 0.1^20)
## 
## Variables actually used in tree construction:
## [1] children income   married  mortgage
## 
## Root node error: 135/300 = 0.45
## 
## n= 300 
## 
##           CP nsplit rel error  xerror     xstd
## 1 3.2593e-01      0   1.00000 1.00000 0.063828
## 2 8.8889e-02      1   0.67407 0.67407 0.058979
## 3 5.9259e-02      3   0.49630 0.55556 0.055556
## 4 3.3333e-02      4   0.43704 0.47407 0.052560
## 5 2.9630e-02      6   0.37037 0.48889 0.053148
## 6 1.0000e-20      8   0.31111 0.42222 0.050332

- cpplot 출력:

plotcp(dt) # cpplot 출력

abline(v = xerror_min_which, lty = 2, col = "red")
text(xerror_min_which, xerror_min, labels = round(xerror_min_which, 2), pos = 3, col = "red")

- Pruning:

# pruning
dt_prune <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"])

Evaluating model performance

- Confusion Matrix : Training set

# training accuracy
pred_tr_dt <- predict(dt_prune, type = "class") # class(범주형)으로 예측

confusionMatrix(train$pep, pred_tr_dt, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  NO YES
##        NO  153  12
##        YES  30 105
##                                           
##                Accuracy : 0.86            
##                  95% CI : (0.8155, 0.8972)
##     No Information Rate : 0.61            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7137          
##  Mcnemar's Test P-Value : 0.008712        
##                                           
##             Sensitivity : 0.8974          
##             Specificity : 0.8361          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.9273          
##              Prevalence : 0.3900          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.8668          
##                                           
##        'Positive' Class : YES             
##

- Confusion Matrix : Test set

# test accuracy
pred_te_dt <- predict(dt_prune, test, type = "class")

confusionMatrix(test$pep, pred_te_dt, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  NO YES
##        NO  144  17
##        YES  40  99
##                                          
##                Accuracy : 0.81           
##                  95% CI : (0.761, 0.8528)
##     No Information Rate : 0.6133         
##     P-Value [Acc > NIR] : 1.598e-13      
##                                          
##                   Kappa : 0.6136         
##  Mcnemar's Test P-Value : 0.003569       
##                                          
##             Sensitivity : 0.8534         
##             Specificity : 0.7826         
##          Pos Pred Value : 0.7122         
##          Neg Pred Value : 0.8944         
##              Prevalence : 0.3867         
##          Detection Rate : 0.3300         
##    Detection Prevalence : 0.4633         
##       Balanced Accuracy : 0.8180         
##                                          
##        'Positive' Class : YES            
##

Summary : Visualization

- Plotting: Basic

# plotting
plot(dt_prune, margin = 0.1)
text(dt_prune, use.n = T)

- Feature Importance

dt_prune$variable.importance

##   children     income    married        age   mortgage   save_act 
## 44.4865854 23.6406604 13.8333517  9.8737562  4.2705882  0.8171769 
##     region        sex 
##  0.5794552  0.4270588

barplot(dt_prune$variable.importance, ylim = c(0, 50))

- Plotting : Fancy Tree

fancyRpartPlot(dt_prune, cex = 1, palettes = c("Greys","RdPu")) #fancy tree

3.3. Decision Tree : C5.0

Training a model on the data

- Training a model on the data:

set.seed(1004)
c5_model <- C5.0(pep ~ ., data=train)
c5_model

## 
## Call:
## C5.0.formula(formula = pep ~ ., data = train)
## 
## Classification Tree
## Number of samples: 300 
## Number of predictors: 10 
## 
## Tree size: 17 
## 
## Non-standard options: attempt to group attributes

summary(c5_model)

## 
## Call:
## C5.0.formula(formula = pep ~ ., data = train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Nov 19 16:38:35 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 300 cases (11 attributes) from undefined.data
## 
## Decision tree:
## 
## children = 1:
## :...income <= 12640.3: NO (5)
## :   income > 12640.3:
## :   :...current_act = YES: YES (47/1)
## :       current_act = NO:
## :       :...income <= 17390.1: NO (3)
## :           income > 17390.1: YES (7)
## children in {0,2,3}:
## :...income <= 30085.1:
##     :...children in {2,3}: NO (70/4)
##     :   children = 0:
##     :   :...married = NO:
##     :       :...mortgage = NO: YES (17/2)
##     :       :   mortgage = YES:
##     :       :   :...save_act = NO: YES (3)
##     :       :       save_act = YES: NO (7)
##     :       married = YES:
##     :       :...mortgage = NO: NO (43/6)
##     :           mortgage = YES:
##     :           :...save_act = NO: YES (8/1)
##     :               save_act = YES: NO (6)
##     income > 30085.1:
##     :...children = 2: YES (28/2)
##         children in {0,3}:
##         :...married = NO:
##             :...mortgage = NO: YES (15/1)
##             :   mortgage = YES: NO (5)
##             married = YES:
##             :...children = 0: NO (27/3)
##                 children = 3:
##                 :...income <= 43530: NO (6/1)
##                     income > 43530: YES (3)
## 
## 
## Evaluation on training data (300 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      17   21( 7.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     158     7    (a): class NO
##      14   121    (b): class YES
## 
## 
##  Attribute usage:
## 
##  100.00% income
##  100.00% children
##   46.67% married
##   34.67% mortgage
##   19.00% current_act
##    8.00% save_act
## 
## 
## Time: 0.0 secs

Summary : Visualization

- Plotting :

plot(c5_model, main="Decision Tree Plot (C5.0)")

plot(c5_model, type="simple", main="Decision Tree Plot (C5.0)")

Evaluating model performance

- Confusion Matrix : Training set

## Evaluating model performance
# create a factor vector of predictions on test data
pred_tr <- predict(c5_model, train)

# cross tabulation of predicted versus actual classes
confusionMatrix(train$pep, pred_tr, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  NO YES
##        NO  158   7
##        YES  14 121
##                                          
##                Accuracy : 0.93           
##                  95% CI : (0.895, 0.9561)
##     No Information Rate : 0.5733         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8579         
##  Mcnemar's Test P-Value : 0.1904         
##                                          
##             Sensitivity : 0.9453         
##             Specificity : 0.9186         
##          Pos Pred Value : 0.8963         
##          Neg Pred Value : 0.9576         
##              Prevalence : 0.4267         
##          Detection Rate : 0.4033         
##    Detection Prevalence : 0.4500         
##       Balanced Accuracy : 0.9320         
##                                          
##        'Positive' Class : YES            
##

- Confusion Matrix : Test set

pred_te <- predict(c5_model, test)

confusionMatrix(test$pep, pred_te, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  NO YES
##        NO  145  16
##        YES  23 116
##                                           
##                Accuracy : 0.87            
##                  95% CI : (0.8266, 0.9059)
##     No Information Rate : 0.56            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7377          
##  Mcnemar's Test P-Value : 0.3367          
##                                           
##             Sensitivity : 0.8788          
##             Specificity : 0.8631          
##          Pos Pred Value : 0.8345          
##          Neg Pred Value : 0.9006          
##              Prevalence : 0.4400          
##          Detection Rate : 0.3867          
##    Detection Prevalence : 0.4633          
##       Balanced Accuracy : 0.8709          
##                                           
##        'Positive' Class : YES             
##

test$c5_pred_prob <- predict(c5_model, test, type="prob")
#confusionMatrix(test$c5_pred, test$pep, positive = "YES")

- ROC Curve:

# 4. Model Evaluation by ROC chart

c5_pred <- prediction(test$c5_pred_prob[, "YES"], test$pep)
c5_model.perf <- performance(c5_pred, "tpr", "fpr")   # True positive rate, False positive rate

lm_pred <- prediction(test$lm_pred_prob, test$pep)
lm_model.perf <- performance(lm_pred, "tpr", "fpr")

plot(c5_model.perf, col = "red")
plot(lm_model.perf, col = "blue", add=T)
legend(0.7, 0.7, c("C5 ","LM "), cex = 0.9, col = c("red", "blue"), lty = 1)

Deployment - 신규 데이터에 모델 적용

연금보험 가입확률이 0.8 이상인 사람만 추출

newd$c5_pred <- predict(c5_model, newd, type="class")
newd$c5_pred_prob <- predict(c5_model, newd, type="prob")
target <- subset(newd, c5_pred=="YES" & c5_pred_prob[, "YES"] > 0.8)
write.csv(target[order(target$c5_pred_prob[, "YES"], decreasing=T), ],
            "dm_target.csv", row.names=FALSE)

3.4. Random Forest

Setting

- Setting:

train <- read.any('./input/pepTrainSet.csv', header=T, stringsAsFactors = F)
#train <- subset(train, select=-c(id))
test <- read.any('./input/pepTestSet.csv', header=T, stringsAsFactors = F)
newd <- read.any('./input/pepNewCustomers.csv', header=T, stringsAsFactors = F)

train$pep <- factor(train$pep)
test$pep <- factor(test$pep)

- Merge:

full <- rbind(train, test)
#full <- merge(x = train, y = test, by = "id", all.x = TRUE)

summary(full)

##       id                 age            sex               region         
##  Length:600         Min.   :18.00   Length:600         Length:600        
##  Class :character   1st Qu.:30.00   Class :character   Class :character  
##  Mode  :character   Median :42.00   Mode  :character   Mode  :character  
##                     Mean   :42.40                                        
##                     3rd Qu.:55.25                                        
##                     Max.   :67.00                                        
##      income        married             children         car           
##  Min.   : 5014   Length:600         Min.   :0.000   Length:600        
##  1st Qu.:17264   Class :character   1st Qu.:0.000   Class :character  
##  Median :24925   Mode  :character   Median :1.000   Mode  :character  
##  Mean   :27524                      Mean   :1.012                     
##  3rd Qu.:36173                      3rd Qu.:2.000                     
##  Max.   :63130                      Max.   :3.000                     
##    save_act         current_act          mortgage          pep     
##  Length:600         Length:600         Length:600         NO :326  
##  Class :character   Class :character   Class :character   YES:274  
##  Mode  :character   Mode  :character   Mode  :character            
##                                                                    
##                                                                    
##

str(full)

## 'data.frame':    600 obs. of  12 variables:
##  $ id         : chr  "ID12101" "ID12102" "ID12103" "ID12104" ...
##  $ age        : int  48 40 51 23 57 57 22 58 37 54 ...
##  $ sex        : chr  "FEMALE" "MALE" "FEMALE" "FEMALE" ...
##  $ region     : chr  "INNER_CITY" "TOWN" "INNER_CITY" "TOWN" ...
##  $ income     : num  17546 30085 16575 20375 50576 ...
##  $ married    : chr  "NO" "YES" "YES" "YES" ...
##  $ children   : int  1 3 0 3 0 2 0 0 2 2 ...
##  $ car        : chr  "NO" "YES" "YES" "NO" ...
##  $ save_act   : chr  "NO" "NO" "YES" "NO" ...
##  $ current_act: chr  "NO" "YES" "YES" "YES" ...
##  $ mortgage   : chr  "NO" "YES" "NO" "NO" ...
##  $ pep        : Factor w/ 2 levels "NO","YES": 2 1 1 1 1 2 2 1 1 1 ...

head(full)

##        id age    sex     region  income married children car save_act
## 1 ID12101  48 FEMALE INNER_CITY 17546.0      NO        1  NO       NO
## 2 ID12102  40   MALE       TOWN 30085.1     YES        3 YES       NO
## 3 ID12103  51 FEMALE INNER_CITY 16575.4     YES        0 YES      YES
## 4 ID12104  23 FEMALE       TOWN 20375.4     YES        3  NO       NO
## 5 ID12105  57 FEMALE      RURAL 50576.3     YES        0  NO      YES
## 6 ID12106  57 FEMALE       TOWN 37869.6     YES        2  NO      YES
##   current_act mortgage pep
## 1          NO       NO YES
## 2         YES      YES  NO
## 3         YES       NO  NO
## 4         YES       NO  NO
## 5          NO       NO  NO
## 6         YES       NO YES

Feature Engineering

- Married:

prop.table(table(full$married))

## 
##   NO  YES 
## 0.34 0.66

full$married <- ifelse(full$married == 'NO', 0, 1)
prop.table(table(full$married))

## 
##    0    1 
## 0.34 0.66

- Mortgage:

table(full$mortgage)

## 
##  NO YES 
## 391 209

full$mortgage<- ifelse(full$mortgage == 'NO', 0, 1)
table(full$mortgage)

## 
##   0   1 
## 391 209

- Sex:

prop.table(table(full$sex))

## 
## FEMALE   MALE 
##    0.5    0.5

full$sex <- ifelse(full$sex == 'FEMALE', 0, 1)
prop.table(table(full$sex))

## 
##   0   1 
## 0.5 0.5

- Region:

prop.table(table(full$region))

## 
## INNER_CITY      RURAL   SUBURBAN       TOWN 
##  0.4483333  0.1600000  0.1033333  0.2883333

full$region <- ifelse(full$region == 'INNER_CITY', 0,
                   ifelse(full$region == 'TOWN', 1,
                          ifelse(full$region == 'RURAL', 2, 3)))

prop.table(table(full$region))

## 
##         0         1         2         3 
## 0.4483333 0.2883333 0.1600000 0.1033333

- current_act:

prop.table(table(full$current_act))

## 
##        NO       YES 
## 0.2416667 0.7583333

full$current_act <- ifelse(full$current_act == 'NO', 0, 1)
prop.table(table(full$current_act))

## 
##         0         1 
## 0.2416667 0.7583333

- Save Account:

prop.table(table(full$save_act))

## 
##   NO  YES 
## 0.31 0.69

full$save_act <- ifelse(full$save_act == 'NO', 0, 1)
prop.table(table(full$save_act))

## 
##    0    1 
## 0.31 0.69

- Car:

prop.table(table(full$car))

## 
##        NO       YES 
## 0.5066667 0.4933333

full$car <- ifelse(full$car == 'NO', 0, 1)
prop.table(table(full$car))

## 
##         0         1 
## 0.5066667 0.4933333

- Age:

full$age <- ifelse(full$age <= 29, 0,
                       ifelse(full$age > 29 & full$age <= 41, 1,
                             ifelse(full$age > 41 & full$age <= 54, 2, 3)))

prop.table(table(full$age))

## 
##         0         1         2         3 
## 0.2400000 0.2483333 0.2566667 0.2550000

- Income:

full$income <- ifelse(full$income <= 17099, 0,
                       ifelse(full$income > 17099 & full$income <= 27250, 1,
                             ifelse(full$income > 27250 & full$income <= 35350, 2, 3)))

prop.table(table(full$income))

## 
##         0         1         2         3 
## 0.2400000 0.3266667 0.1750000 0.2583333

- Children:

table(full$children)

## 
##   0   1   2   3 
## 263 135 134  68

prop.table(table(full$children))

## 
##         0         1         2         3 
## 0.4383333 0.2250000 0.2233333 0.1133333

**- Split the Data:**

prop.table(table(full$pep))

## 
##        NO       YES 
## 0.5433333 0.4566667

str(full)

## 'data.frame':    600 obs. of  12 variables:
##  $ id         : chr  "ID12101" "ID12102" "ID12103" "ID12104" ...
##  $ age        : num  2 1 2 0 3 3 0 3 1 2 ...
##  $ sex        : num  0 1 0 0 0 0 1 1 0 1 ...
##  $ region     : num  0 1 0 1 2 1 2 1 3 1 ...
##  $ income     : num  1 2 0 1 3 3 0 1 1 1 ...
##  $ married    : num  0 1 1 1 1 1 0 1 1 1 ...
##  $ children   : int  1 3 0 3 0 2 0 0 2 2 ...
##  $ car        : num  0 1 1 0 0 0 0 1 1 1 ...
##  $ save_act   : num  0 0 1 0 1 1 0 1 0 1 ...
##  $ current_act: num  0 1 1 1 0 1 1 1 0 1 ...
##  $ mortgage   : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ pep        : Factor w/ 2 levels "NO","YES": 2 1 1 1 1 2 2 1 1 1 ...

head(full)

##        id age sex region income married children car save_act current_act
## 1 ID12101   2   0      0      1       0        1   0        0           0
## 2 ID12102   1   1      1      2       1        3   1        0           1
## 3 ID12103   2   0      0      0       1        0   1        1           1
## 4 ID12104   0   0      1      1       1        3   0        0           1
## 5 ID12105   3   0      2      3       1        0   0        1           0
## 6 ID12106   3   0      1      3       1        2   0        1           1
##   mortgage pep
## 1        0 YES
## 2        1  NO
## 3        0  NO
## 4        0  NO
## 5        0  NO
## 6        0 YES

## 70% of the sample size
smp_size <- floor(0.7 * nrow(full))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(full)), size = smp_size)

train <- full[train_ind, ]
test <- full[-train_ind, ]

- Dummy label:

train$pep <- ifelse(train$pep=="YES", 1, 0)
test$pep <-ifelse(test$pep=="YES", 1, 0)

mtry optimization

tune.rf <- tuneRF(train[, c(-1, -12)], train$pep, mtryStart = 1)

## mtry = 1  OOB error = 0.2138995 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 0.1629827 
## 0.2380404 0.05 
## mtry = 4     OOB error = 0.1330898 
## 0.1834115 0.05 
## mtry = 8     OOB error = 0.1275279 
## 0.04179099 0.05

tune.rf

##   mtry  OOBError
## 1    1 0.2138995
## 2    2 0.1629827
## 4    4 0.1330898
## 8    8 0.1275279

Training a model on the data

- Feature Selection:

# extractFeatures <- function(data) {
#   features <- c("season",
#                 "holiday",
#                 "workingday",
#                 "weather",
#                 "temp",
#                 "atemp",
#                 "humidity",
#                 "windspeed",
#                 "hour",
#                 "weekday",
#                 "quarter",
#                 "month",
#                 "date"
#                 )
#   data$hour <- hour(ymd_hms(data$datetime))
#   return(data[,features])
# }

# trainFea <- extractFeatures(train)
# testFea  <- extractFeatures(test)

- Train a model:

# Train a model across all the training data and plot the variable importance
#tr_rf <- randomForest(pep ~ ., data=train, ntree=500, importance=TRUE, na.action = na.omit)
#te_rf <- randomForest(pep ~ ., data=test, ntree=500, importance=TRUE, na.action = na.omit)
#tune.rf <- tuneRF(hr_train[, -9], hr_train$left, mtryStart = 1)
tr_rf <- randomForest(train[,c(-1,-12)], train[,12], ntree=5000, mtry=4, importance=TRUE)
te_rf <- randomForest(test[,c(-1,-12)], test[,12], ntree=5000, mtry=4, importance=TRUE)

imp <- importance(tr_rf, type=1)
featureImportance <- data.frame(Feature=row.names(imp), Importance=imp[,1])

tr_rf

## 
## Call:
##  randomForest(x = train[, c(-1, -12)], y = train[, 12], ntree = 5000,      mtry = 4, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 5000
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.1297397
##                     % Var explained: 47.9

Evaluating model performance

- Train Accuracy:

pred_tr_rf <- predict(tr_rf, type = "class")

table(train$pep)

## 
##   0   1 
## 223 197

pred_tr_rf

##        173        473        245        528        561         28 
## 0.11156394 0.30457881 0.50206575 0.20039772 0.62063797 0.70300399 
##        314        530        327        270        565        268 
## 0.23095725 0.94755801 0.88020610 0.11951300 0.77334839 0.77965223 
##        399        337         61        527        144         25 
## 0.96120495 0.26865941 0.82331485 0.31258631 0.13222307 0.20191924 
##        191        555        516        402        371        574 
## 0.84180897 0.43262443 0.70110051 0.15434689 0.11196599 0.07618672 
##        378        408        313        341        166         85 
## 0.12669519 0.72739596 0.52365381 0.27569563 0.74353125 0.83337402 
##        549        514        393        452         14        591 
## 0.90973807 0.19194475 0.52634578 0.40289169 0.83554616 0.42966173 
##        428        122        179        130         80        232 
## 0.43879363 0.74849810 0.19302563 0.66481254 0.42028110 0.86296379 
##        231        206        571         78        596        258 
## 0.62784257 0.77827861 0.27925999 0.58795285 0.71487682 0.23422531 
##        147        599         26        243        438         67 
## 0.18657500 0.52554578 0.04582605 0.89971591 0.65143453 0.21015304 
##        307        113         70        410        486        203 
## 0.73683312 0.21355943 0.30698378 0.31385282 0.03157009 0.44678903 
##        360         52        207        148        437        240 
## 0.83388525 0.18391266 0.83156795 0.13465263 0.72518923 0.48453816 
##        433        434        423        234        400        333 
## 0.36794569 0.45470544 0.46587790 0.07218653 0.21880905 0.20432413 
##        375          1        251        116        200        321 
## 0.90431541 0.84070972 0.48928007 0.01896015 0.19011227 0.73966732 
##        184         58        127        347        217        575 
## 0.15395145 0.33056044 0.06169701 0.20114837 0.39904835 0.07570925 
##         54        224        507        459        454         90 
## 0.81077517 0.03864546 0.78102986 0.61801804 0.40445709 0.61524101 
##        547        529        175        509        163         95 
## 0.69565176 0.34939507 0.91507482 0.81940936 0.07843593 0.27903585 
##        395         48        235        257        300        167 
## 0.64040404 0.06857177 0.83873297 0.27626965 0.20474370 0.35883993 
##        244        475        535        441        567        301 
## 0.09388759 0.51780647 0.21590627 0.87069718 0.78893604 0.35077472 
##        541         73        513        537         30        462 
## 0.82457369 0.15538731 0.12488765 0.94219963 0.75554652 0.56402480 
##        351        544        266        461        283        195 
## 0.71745512 0.71176494 0.72722746 0.84026036 0.68493295 0.82362053 
##        311        154        489        105        176        468 
## 0.09251729 0.90893420 0.09042819 0.87399220 0.87755028 0.19603323 
##         74         44        510        325        292        419 
## 0.17641549 0.68443467 0.08690160 0.31030706 0.74094949 0.11180783 
##        315        345        570        546        382        365 
## 0.22467912 0.13848441 0.74775566 0.59925783 0.57677409 0.86076123 
##        453        492        584        188          5        556 
## 0.89067834 0.81282296 0.57926338 0.92405959 0.19548518 0.23788200 
##        385        106        109         35        112        331 
## 0.10705353 0.35273687 0.78062547 0.76269315 0.13914233 0.43236247 
##        464        515        174        111         50        448 
## 0.38430699 0.82021789 0.84838456 0.82907361 0.23711047 0.62279318 
##        254         97        197        443        222        156 
## 0.82876962 0.63354290 0.80757588 0.10199563 0.80649510 0.27500239 
##        285        164        155        233        322         96 
## 0.06901317 0.54674163 0.21078684 0.67627841 0.65432151 0.19828157 
##        562        115        271         79        370        319 
## 0.72121194 0.80081029 0.75972639 0.20342636 0.32164864 0.18656501 
##        548        263        158        225        564        598 
## 0.23434154 0.11273593 0.24154982 0.46873215 0.79426522 0.05717356 
##        353        131        297        447        248        524 
## 0.73420974 0.18497795 0.15059104 0.04075753 0.30739813 0.36862535 
##        110        531        377        578        545        132 
## 0.13190399 0.80186538 0.74907743 0.09796588 0.90684972 0.53582090 
##        403        253        381        189        165        484 
## 0.30650378 0.28531203 0.61956483 0.61131056 0.21536141 0.41480259 
##         62        230        534        384        496        205 
## 0.91531027 0.79344298 0.73926024 0.83660600 0.43206464 0.39574857 
##        160        348        460        114        472         68 
## 0.71199341 0.21303434 0.90190471 0.48068808 0.03822904 0.37130868 
##        490         99         84        261         19        590 
## 0.19282308 0.25969233 0.55042811 0.39693495 0.25664934 0.47811581 
##        136        157        594        542        108        463 
## 0.90823437 0.09622500 0.91912016 0.73902555 0.19299228 0.69288244 
##        276        259         20        149        432        209 
## 0.64754826 0.10438876 0.15420917 0.43049917 0.95599804 0.08428683 
##        260        340        229        159        415         22 
## 0.78062574 0.81894006 0.78176360 0.36941347 0.21369084 0.24471314 
##        593        145         72        302         56        291 
## 0.10995985 0.83208313 0.82847543 0.34187351 0.48808786 0.78677056 
##        442        238        579        227        494        553 
## 0.05099733 0.16906399 0.63689330 0.51656743 0.83923036 0.80085099 
##        142        343        573        256         91        581 
## 0.23039012 0.09222599 0.49577337 0.86043434 0.83534477 0.83412949 
##        538         93        522        430        521        139 
## 0.82269076 0.77274759 0.20298802 0.50240474 0.12524890 0.07346364 
##        162        418        497        299        228        523 
## 0.27748397 0.81582541 0.64758226 0.22854408 0.75081633 0.33082434 
##        445        194        543        525          7        499 
## 0.15714106 0.80496815 0.20018916 0.33364526 0.75729346 0.28199064 
##        288          3         24        580        252        239 
## 0.34969310 0.22792162 0.27336368 0.30605819 0.86489493 0.36454036 
##        411        151        328        373        466        551 
## 0.75324280 0.79675072 0.81727014 0.61472164 0.64152579 0.24020947 
##        520        364        386        125         21         71 
## 0.15186254 0.20688172 0.31631835 0.57601024 0.76936159 0.08584969 
##         18        587        519         32         23        467 
## 0.27690475 0.73740285 0.06147004 0.24208602 0.08389478 0.20160922 
##        558        250        339        309        488        241 
## 0.24338155 0.06423732 0.87609105 0.17364131 0.41848393 0.33573541 
##        236        329        435        518        187        354 
## 0.59345307 0.58514531 0.29392071 0.86459071 0.66105491 0.93817717 
##         47        304        133        394        334        135 
## 0.67753492 0.56879443 0.35595148 0.16010210 0.31880628 0.62540589 
##        557         16        102        422        359        391 
## 0.79648370 0.32783530 0.17965857 0.91885220 0.16360484 0.04652151 
##        100        493        398         83         41        196 
## 0.28393678 0.21195902 0.55717190 0.37153146 0.63767398 0.89641054 
##         29         10        274        275        310        416 
## 0.18081360 0.08470387 0.77449868 0.03461874 0.51205967 0.48963348 
##        451        349        198        223        388        469 
## 0.28770121 0.25025282 0.43814593 0.86832248 0.89539004 0.38727264 
##        185        169        425        406        380         31 
## 0.58783785 0.81425208 0.74110117 0.61465590 0.09983703 0.24711485 
##        588        344        324        505        305        389 
## 0.90352516 0.14338552 0.66472641 0.84567020 0.55169833 0.85268138 
##        501        383        487        281        128        278 
## 0.32275030 0.70808515 0.22076794 0.74972248 0.01984671 0.71072094 
##        153        242        536        286        429        202 
## 0.75373187 0.85401244 0.19572046 0.21003908 0.81803183 0.07090048 
##        444        294        214        474         86        264 
## 0.72014916 0.39479824 0.42385683 0.63608557 0.89449354 0.03290152 
##         46        458        255        356         98        421 
## 0.68638898 0.04723722 0.13025229 0.66040692 0.43354408 0.04293480 
##        589        104        121        352        336        471 
## 0.14310378 0.30372008 0.43603777 0.31922494 0.84426963 0.03819335 
##         69        215        129        456         82        219 
## 0.16238830 0.44187473 0.08621839 0.22468781 0.28747514 0.74851977 
##        117        540         40        586        450         45 
## 0.26157444 0.29029999 0.44052097 0.06671731 0.72481361 0.32236391 
##        330        358        152        506        367        446 
## 0.08764742 0.70883566 0.74488213 0.07398595 0.92281745 0.19018417 
##        362        212        533        368        338        595 
## 0.14792825 0.88956746 0.93714080 0.04323520 0.89674477 0.27433129 
##        180        568        269         88        138        287 
## 0.80338829 0.18984440 0.58973867 0.44554224 0.88349477 0.89099508 
##         66        265        213        424        247         94 
## 0.19358906 0.77621411 0.12697682 0.05786261 0.16480336 0.86325403 
##        539         36        508         59        481          6 
## 0.01764388 0.15507811 0.55766075 0.12794696 0.37131592 0.91595331

pred_tr_rf <- ifelse(pred_tr_rf > 0.5, 1, 0) # YES, NO
table(pred_tr_rf)

## pred_tr_rf
##   0   1 
## 228 192

confusionMatrix(table(train$pep, pred_tr_rf))

## Confusion Matrix and Statistics
## 
##    pred_tr_rf
##       0   1
##   0 192  31
##   1  36 161
##                                           
##                Accuracy : 0.8405          
##                  95% CI : (0.8019, 0.8742)
##     No Information Rate : 0.5429          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6793          
##  Mcnemar's Test P-Value : 0.6251          
##                                           
##             Sensitivity : 0.8421          
##             Specificity : 0.8385          
##          Pos Pred Value : 0.8610          
##          Neg Pred Value : 0.8173          
##              Prevalence : 0.5429          
##          Detection Rate : 0.4571          
##    Detection Prevalence : 0.5310          
##       Balanced Accuracy : 0.8403          
##                                           
##        'Positive' Class : 0               
##

- Test Accuracy:

pred_te_rf <- predict(te_rf, type = "class")

table(test$pep)

## 
##   0   1 
## 103  77

pred_te_rf

##          2          4          8          9         11         12 
## 0.19798130 0.30279029 0.25659203 0.40435789 0.21952548 0.33754731 
##         13         15         17         27         33         34 
## 0.49607409 0.17828824 0.26282041 0.15556195 0.80012350 0.40686006 
##         37         38         39         42         43         49 
## 0.25957944 0.54544941 0.24503851 0.51606936 0.82603232 0.83827690 
##         51         53         55         57         60         63 
## 0.28278293 0.41161299 0.29065383 0.67559300 0.55762042 0.33342428 
##         64         65         75         76         77         81 
## 0.05797946 0.11754679 0.94366931 0.78343524 0.37959065 0.21875780 
##         87         89         92        101        103        107 
## 0.11669751 0.17503140 0.57659037 0.65390625 0.31461705 0.25925820 
##        118        119        120        123        124        126 
## 0.17696082 0.80413676 0.61510889 0.53860835 0.89833333 0.36354828 
##        134        137        140        141        143        146 
## 0.49637987 0.18886624 0.98045714 0.42675304 0.66539859 0.18407815 
##        150        161        168        170        171        172 
## 0.34880822 0.46454682 0.21294788 0.26729408 0.19851069 0.26997751 
##        177        178        181        182        183        186 
## 0.09694036 0.09612170 0.68690067 0.72914739 0.51959710 0.84657660 
##        190        192        193        199        201        204 
## 0.22064147 0.20294067 0.39466915 0.13684142 0.77021335 0.35274473 
##        208        210        211        216        218        220 
## 0.14594611 0.62518695 0.69477837 0.35246262 0.31178331 0.20286886 
##        221        226        237        246        249        262 
## 0.57537433 0.65573487 0.87369009 0.27188394 0.18272668 0.79251166 
##        267        272        273        277        279        280 
## 0.73820778 0.71050699 0.06709683 0.68721300 0.45224047 0.60654163 
##        282        284        289        290        293        295 
## 0.12885362 0.44296609 0.18171447 0.86262652 0.66155522 0.09901306 
##        296        298        303        306        308        312 
## 0.19010837 0.13348365 0.07347344 0.33561628 0.72041890 0.93605110 
##        316        317        318        320        323        326 
## 0.29316627 0.72886127 0.72838976 0.51793051 0.25127591 0.40440354 
##        332        335        342        346        350        355 
## 0.20020933 0.50280200 0.79867329 0.75718050 0.84734114 0.45808282 
##        357        361        363        366        369        372 
## 0.36566349 0.94604952 0.39295487 0.13421218 0.05497282 0.87276486 
##        374        376        379        387        390        392 
## 0.22531452 0.11861462 0.34982249 0.70677835 0.19621866 0.88553330 
##        396        397        401        404        405        407 
## 0.32851431 0.12559635 0.64771348 0.61626344 0.05698060 0.87626939 
##        409        412        413        414        417        420 
## 0.38804465 0.06081786 0.93875466 0.19292236 0.26112650 0.15768759 
##        426        427        431        436        439        440 
## 0.12524215 0.32106508 0.09254926 0.08087215 0.49149654 0.35378925 
##        449        455        457        465        470        476 
## 0.46651455 0.07132334 0.20453110 0.70356636 0.74967330 0.21852469 
##        477        478        479        480        482        483 
## 0.86283303 0.19203020 0.38623758 0.94668903 0.84661982 0.37056360 
##        485        491        495        498        500        502 
## 0.71843192 0.62133724 0.68864410 0.76887169 0.15987456 0.88443319 
##        503        504        511        512        517        526 
## 0.32566130 0.16240336 0.09025166 0.33541319 0.64144891 0.65837479 
##        532        550        552        554        559        560 
## 0.67277847 0.82518980 0.35794436 0.57382960 0.16857640 0.46138550 
##        563        566        569        572        576        577 
## 0.27137016 0.08432089 0.13946017 0.24172978 0.09762605 0.70308514 
##        582        583        585        592        597        600 
## 0.08266162 0.53196503 0.64556107 0.38203284 0.16774845 0.67819760

pred_te_rf <- ifelse(pred_te_rf > 0.5, 1, 0) # YES, NO
table(pred_te_rf)

## pred_te_rf
##   0   1 
## 110  70

confusionMatrix(table(test$pep, pred_te_rf))

## Confusion Matrix and Statistics
## 
##    pred_te_rf
##      0  1
##   0 91 12
##   1 19 58
##                                           
##                Accuracy : 0.8278          
##                  95% CI : (0.7645, 0.8799)
##     No Information Rate : 0.6111          
##     P-Value [Acc > NIR] : 2.458e-10       
##                                           
##                   Kappa : 0.6441          
##  Mcnemar's Test P-Value : 0.2812          
##                                           
##             Sensitivity : 0.8273          
##             Specificity : 0.8286          
##          Pos Pred Value : 0.8835          
##          Neg Pred Value : 0.7532          
##              Prevalence : 0.6111          
##          Detection Rate : 0.5056          
##    Detection Prevalence : 0.5722          
##       Balanced Accuracy : 0.8279          
##                                           
##        'Positive' Class : 0               
##

Cross Validation

- Hyperparameter Tuning:

#ctrl <- trainControl(method = "cv", number = 10)

#grid_rf <- expand.grid(.mtry = c(2, 4, 8, 16))

#set.seed(1004)
#m_rf <- train(pep ~ ., data = train, method = "rf",
#              trControl = ctrl, tuneGrid = grid_rf)

#grid_c50 <- expand.grid(.model = "tree",
#                        .trials = c(10, 20, 30, 40),
#                        .winnow = 'FALSE')

#set.seed(1004)
#m_c50 <- train(pep ~ ., data = train, method = "C5.0",
#               trControl = ctrl, tuneGrid = grid_c50)


#m_rf

#m_c50

Visualiztion : Feature Importance

- Feature Importance:

featureImportance

##                 Feature Importance
## age                 age  56.317327
## sex                 sex   5.397639
## region           region  23.960885
## income           income 125.399859
## married         married 120.453842
## children       children 239.372065
## car                 car  -8.237569
## save_act       save_act  68.765097
## current_act current_act -13.958201
## mortgage       mortgage  81.616033

round(tr_rf$importance, 3)

##             %IncMSE IncNodePurity
## age           0.023         8.579
## sex           0.001         3.438
## region        0.007         8.287
## income        0.063        13.219
## married       0.048         7.649
## children      0.161        25.854
## car          -0.002         3.187
## save_act      0.022         5.448
## current_act  -0.003         2.864
## mortgage      0.029         7.318

varImpPlot(tr_rf)

p <- ggplot(featureImportance, aes(x=reorder(Feature, Importance), y=Importance)) +
     geom_bar(stat="identity", fill="#53cfff") +
     coord_flip() + 
     theme_light(base_size=20) +
     xlab("Importance") +
     ylab("") + 
     ggtitle("Random Forest Feature Importance\n") +
     theme(plot.title=element_text(size=18))

p

Bank : Forecasting of Customer’s Purchasing Intention

Andrew Jang

11/19/2018

1. Introduction

1.1. Intelligent Targeting

1.2. Aim

1.3. Data Analysis Process

1.4. Data Description

2. Preprocessing

2.1. Import library & Load data

2.2. Peek into the Data

2.3. Overview the Data

3. Modeling

3.1. Logistic Regression

Training a model on the data

Evaluating model performance 1

Area under the curve (ROC-AUC)

Cross Validation

Variable importance

Stepwise logistic regression

Evaluating model performance 2

Grid Search

Training a model on the data

Evaluating model performance

3.2. Decision Tree : CART

Training a model on the data

Cross Validation: Pruning

Evaluating model performance

Summary : Visualization

3.3. Decision Tree : C5.0

Training a model on the data

Summary : Visualization

Evaluating model performance

Deployment - 신규 데이터에 모델 적용

3.4. Random Forest

Setting

Feature Engineering

mtry optimization

Training a model on the data

Evaluating model performance

Cross Validation

Visualiztion : Feature Importance