은행에서 새로운 개인연금상품(PEP)을 개발하였다. 기존 고객들을 대상으로 가능한 많은 계좌를 유치하고자 한다. 즉, 고객의 금융상품(PEP: Personal Equity Plan, 연금보험) 구매 여부 예측에 의한 신규고객 창출하여 매출을 영업이익을 증대하는 것이 목표이다. 그렇다면, 무작위로 마케팅을 하는 것보다 구매할 가능성이 높은 사람들에게 상품을 추천해준다면 조금 더 효율적인 마케팅(영업) 활동을 할 수 있을 것이다. 기계학습 기법을 활용해 고객의 구매 의도를 예측하고 추천해주는 시스템을 개발해보겠다.
- What is Marketing?
“The process by which companies create value for customers and build strong customer relationships in order to capture value from customers in return(기업이 고객을 위해 가치를 창출하고, 고객 관계를 구축하여, 고객의 가치를 보상하는 프로세스).” - Kotler and Armstrong (2010)
-> (본인이 생각하는) 마케팅이란 가치 교환의 과정이라고 생각한다. 고객과 기업이 만들어내는 각각의 가치를 어떻게 교환할 것인지, 그 과정을 만들어내는 걸 마케팅이라고한다. 더 간단하게 요약하자면, ’고객의 행동을 이끌어 내는 것이라고 말 할 수 있다. 짧게 요약했지만, 그 안에는 무수히; 많은 과정과 노력이 필요하다. 가치 교환의 과정에서 모든 행동의 주체는 기업이다. 물론 고객의 관점에서 마케팅을 수행해야하지만 기업이 주체가 되어서 해야할 일을 하는 것이 마케팅이다.
- The 4P’s:
마케팅 캠페인은 고객의 요구 사항과 전반적인 만족도에 중점을 둔다. 그럼에도 불구하고 마케팅 캠페인의 성공 여부를 결정하는 다양한 변수가 있다. 캠페인을 할 때 고려해야 할 몇 가지 변수가 있다.
Segment of the Population - “마케팅 캠페인이 집단의 어느 대상에 이루어지며 그 이유는 무엇인가?” 이러한 측면은 인구 중 어느 부분이 메시지를 받을 가능성이 가장 높은지를 알려주기 때문에 매우 중요하다.
Distribution channel to reach the customer’s place - 캠페인을 최대한 활용하려면 가장 효과적인 전략을 구현해야한다. 인구 중 어느 집단을 다루고, 기업의 메시지를 전달하기 위해 어떤 도구를 사용해야하는가? (예 : 전화, 라디오, TV, 소셜 미디어 등)
Price - 잠재 고객에게 제공할 수 있는 가장 좋은 가격은 얼마인가? (은행의 경우, 그들의 주요 관심사는 잠재 고객이 정기예금 계좌를 개설하여 은행의 운영 활동을 계속할 수 있도록 하기 위한 것이므로 필요하지 않다.)
Promotional Strategy - 전략이 구현되고 잠재 고객이 어떻게 대응할 것인가이다. 이것은 이전에 했던 실수에 대해 배우고 마케팅 캠페인을 훨씬 효과적으로 만드는 방법을 결정하기 위해 (가능하면)이전 캠페인에 대한 철저한 분석이 있어야 하기 때문에 마케팅 캠페인 분석의 마지막 부분이어야 한다.
1) Data Analysis
- 은행의 고객 데이터를 바탕으로 고객 특성을 분석
- 고객 프로파일 개발
- 다이렉트 메일 광고 효율성 제고
- 타겟 메일링에 의한 응답률 제고
2) Data Analytics
- 나이, 성별, 거주지, 결혼 여부, 자녀 수 등의 고객 정보 및 마케팅 이력을 통해 고객의 개인연금 가입 여부를 예측 (Yes / No)
- 고객의 과거 이력과 유사한 고객군들의 데이터를 기반으로 해당 고객이 가입할지 예측
1) 기존 고객 DB로부터 시험메일 발송을 위한 표본 고객 목록을 추출
2) 새로운 금융상품(PEP)의 제안 메일을 발송
3) 고객의 반응을 기록
4) R을 이용하여 캠페인 결과를 분석
- Data Files :
- Data Field :
- Import Libraries :
suppressPackageStartupMessages({
library(data.table)
library(dplyr)
library(tidyr)
library(lubridate)
library(DT)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(sqldf)
library(readr)
library(caret)
library(ROCR)
library(C50)
library(e1071)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(rpart)
library(randomForest)
})
# rm(list=ls())
fillColor = "#FFA07A"
fillColor2 = "#F1C40F"
- Load Data:
# 윈도우 OS에서 전송한 파일이 맥북에서 깨질 경우
# library(readr)
read.any <- function(text, sep = "", ...) {
encoding <- as.character(guess_encoding(text)[1,1])
setting <- as.character(tools::file_ext(text))
if(sep != "" | !(setting %in% c("csv", "txt")) ) setting <- "custom"
separate <- list(csv = ",", txt = "\n", custom = sep)
result <- read.table(text, sep = separate[[setting]], fileEncoding = encoding, ...)
return(result)
}
train <- read.any('./input/pepTrainSet.csv', header=T, stringsAsFactors = F)
train <- subset(train, select=-c(id))
test <- read.any('./input/pepTestSet.csv', header=T, stringsAsFactors = F)
newd <- read.any('./input/pepNewCustomers.csv', header=T, stringsAsFactors = F)
train$pep <- factor(train$pep)
test$pep <- factor(test$pep)
datatable(train, style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))
str(train)
## 'data.frame': 300 obs. of 11 variables:
## $ age : int 48 40 51 23 57 57 22 58 37 54 ...
## $ sex : chr "FEMALE" "MALE" "FEMALE" "FEMALE" ...
## $ region : chr "INNER_CITY" "TOWN" "INNER_CITY" "TOWN" ...
## $ income : num 17546 30085 16575 20375 50576 ...
## $ married : chr "NO" "YES" "YES" "YES" ...
## $ children : int 1 3 0 3 0 2 0 0 2 2 ...
## $ car : chr "NO" "YES" "YES" "NO" ...
## $ save_act : chr "NO" "NO" "YES" "NO" ...
## $ current_act: chr "NO" "YES" "YES" "YES" ...
## $ mortgage : chr "NO" "YES" "NO" "NO" ...
## $ pep : Factor w/ 2 levels "NO","YES": 2 1 1 1 1 2 2 1 1 1 ...
head(train)
## age sex region income married children car save_act current_act
## 1 48 FEMALE INNER_CITY 17546.0 NO 1 NO NO NO
## 2 40 MALE TOWN 30085.1 YES 3 YES NO YES
## 3 51 FEMALE INNER_CITY 16575.4 YES 0 YES YES YES
## 4 23 FEMALE TOWN 20375.4 YES 3 NO NO YES
## 5 57 FEMALE RURAL 50576.3 YES 0 NO YES NO
## 6 57 FEMALE TOWN 37869.6 YES 2 NO YES YES
## mortgage pep
## 1 NO YES
## 2 YES NO
## 3 NO NO
## 4 NO NO
## 5 NO NO
## 6 NO YES
str(test)
## 'data.frame': 300 obs. of 12 variables:
## $ id : chr "ID12401" "ID12402" "ID12403" "ID12404" ...
## $ age : int 19 37 45 49 67 35 63 38 48 28 ...
## $ sex : chr "FEMALE" "FEMALE" "FEMALE" "MALE" ...
## $ region : chr "INNER_CITY" "TOWN" "TOWN" "RURAL" ...
## $ income : num 8162 15350 29231 41462 57398 ...
## $ married : chr "YES" "YES" "YES" "YES" ...
## $ children : int 1 0 0 3 3 0 2 0 2 1 ...
## $ car : chr "YES" "NO" "NO" "NO" ...
## $ save_act : chr "YES" "YES" "YES" "YES" ...
## $ current_act: chr "YES" "NO" "NO" "YES" ...
## $ mortgage : chr "YES" "NO" "NO" "YES" ...
## $ pep : Factor w/ 2 levels "NO","YES": 1 1 1 1 2 1 2 2 1 1 ...
head(test)
## id age sex region income married children car save_act
## 1 ID12401 19 FEMALE INNER_CITY 8162.42 YES 1 YES YES
## 2 ID12402 37 FEMALE TOWN 15349.60 YES 0 NO YES
## 3 ID12403 45 FEMALE TOWN 29231.40 YES 0 NO YES
## 4 ID12404 49 MALE RURAL 41462.30 YES 3 NO YES
## 5 ID12405 67 FEMALE RURAL 57398.10 NO 3 NO YES
## 6 ID12406 35 FEMALE RURAL 11520.80 YES 0 NO NO
## current_act mortgage pep
## 1 YES YES NO
## 2 NO NO NO
## 3 NO NO NO
## 4 YES YES NO
## 5 YES NO YES
## 6 YES NO NO
str(newd)
## 'data.frame': 200 obs. of 11 variables:
## $ id : chr "ID12701" "ID12702" "ID12703" "ID12704" ...
## $ age : int 23 30 45 50 41 20 46 50 42 57 ...
## $ sex : chr "MALE" "MALE" "FEMALE" "MALE" ...
## $ region : chr "INNER_CITY" "RURAL" "RURAL" "TOWN" ...
## $ income : num 18767 9916 21882 46794 20721 ...
## $ married : chr "YES" "NO" "NO" "YES" ...
## $ children : int 0 1 0 2 0 1 0 1 3 1 ...
## $ car : chr "YES" "NO" "YES" "NO" ...
## $ save_act : chr "YES" "YES" "YES" "YES" ...
## $ current_act: chr "NO" "NO" "YES" "NO" ...
## $ mortgage : chr "YES" "YES" "NO" "YES" ...
head(newd)
## id age sex region income married children car save_act
## 1 ID12701 23 MALE INNER_CITY 18766.90 YES 0 YES YES
## 2 ID12702 30 MALE RURAL 9915.67 NO 1 NO YES
## 3 ID12703 45 FEMALE RURAL 21881.60 NO 0 YES YES
## 4 ID12704 50 MALE TOWN 46794.40 YES 2 NO YES
## 5 ID12705 41 FEMALE INNER_CITY 20721.10 YES 0 YES YES
## 6 ID12706 20 MALE INNER_CITY 16688.50 NO 1 NO YES
## current_act mortgage
## 1 NO YES
## 2 NO YES
## 3 YES NO
## 4 NO YES
## 5 YES NO
## 6 YES YES
prop.table(table(train$pep))
##
## NO YES
## 0.55 0.45
prop.table(table(test$pep))
##
## NO YES
## 0.5366667 0.4633333
- Training a model on the data:
m_logis <- glm(pep ~ ., data=train, family=binomial(link='logit'))
summary(m_logis)
##
## Call:
## glm(formula = pep ~ ., family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1618 -0.9717 -0.6610 1.0941 2.0354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.568e+00 5.666e-01 -2.767 0.00565 **
## age 1.593e-02 1.299e-02 1.227 0.21998
## sexMALE 6.887e-01 2.583e-01 2.666 0.00768 **
## regionRURAL 2.696e-01 3.702e-01 0.728 0.46641
## regionSUBURBAN -4.920e-02 4.905e-01 -0.100 0.92011
## regionTOWN -1.790e-01 2.960e-01 -0.605 0.54541
## income 3.863e-05 1.518e-05 2.545 0.01093 *
## marriedYES -6.382e-01 2.679e-01 -2.382 0.01720 *
## children -1.800e-01 1.195e-01 -1.506 0.13200
## carYES -2.357e-01 2.535e-01 -0.930 0.35241
## save_actYES -3.225e-01 2.786e-01 -1.158 0.24696
## current_actYES 3.680e-01 3.159e-01 1.165 0.24401
## mortgageYES -1.942e-01 2.696e-01 -0.720 0.47128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 412.88 on 299 degrees of freedom
## Residual deviance: 370.64 on 287 degrees of freedom
## AIC: 396.64
##
## Number of Fisher Scoring iterations: 4
- coefficients:
exp(coef(m_logis))
## (Intercept) age sexMALE regionRURAL regionSUBURBAN
## 0.2084668 1.0160584 1.9910270 1.3095056 0.9519925
## regionTOWN income marriedYES children carYES
## 0.8361374 1.0000386 0.5282468 0.8352639 0.7899789
## save_actYES current_actYES mortgageYES
## 0.7243322 1.4448518 0.8234982
- train set prediction:
# train set prediction
pred_tr_lm <- predict(m_logis, train, type="response", na.action = na.pass)
# train set accuracy
t <- table(train$pep, pred_tr_lm > 0.5)
acc <- sum(diag(t)) / sum(t)
acc
## [1] 0.6566667
#pred_tr_lm
pred_tr_lm <- predict(m_logis, train, type="response", na.action = na.pass)
pred_tr_lm <- ifelse(pred_tr_lm > 0.5, 1, 0) # YES, NO
table(train$pep)
##
## NO YES
## 165 135
table(pred_tr_lm)
## pred_tr_lm
## 0 1
## 186 114
#confusionMatrix(table(train$pep, pred_tr_lm))
- test set prediction:
# test set prediction
pred_te <- predict(m_logis, test, type="response", na.action = na.pass)
t <- table(test$pep, pred_te > 0.5)
# test set accuracy
acc <- sum(diag(t)) / sum(t)
acc
## [1] 0.5866667
pred_te_lm <- predict(m_logis, test, type="response", na.action = na.pass)
pred_te_lm <- ifelse(pred_te_lm > 0.5, 1, 0) # YES, NO
table(test$pep)
##
## NO YES
## 161 139
table(pred_te_lm)
## pred_te_lm
## 0 1
## 179 121
#confusionMatrix(table(test$pep, pred_te_lm))
- ROC Curve:
# Area under the curve
predict_1 <- prediction(pred_te, test$pep)
# creating ROC curve
roc1 <- performance(predict_1,"tpr","fpr")
plot(roc1)
title("ROC Curve")
- Cross Validation:
ctrl1 <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)
mod_fit1 <- train(pep ~., data=train, method="glm", family="binomial",
trControl = ctrl1, tuneLength = 5)
summary(mod_fit1)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1618 -0.9717 -0.6610 1.0941 2.0354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.568e+00 5.666e-01 -2.767 0.00565 **
## age 1.593e-02 1.299e-02 1.227 0.21998
## sexMALE 6.887e-01 2.583e-01 2.666 0.00768 **
## regionRURAL 2.696e-01 3.702e-01 0.728 0.46641
## regionSUBURBAN -4.920e-02 4.905e-01 -0.100 0.92011
## regionTOWN -1.790e-01 2.960e-01 -0.605 0.54541
## income 3.863e-05 1.518e-05 2.545 0.01093 *
## marriedYES -6.382e-01 2.679e-01 -2.382 0.01720 *
## children -1.800e-01 1.195e-01 -1.506 0.13200
## carYES -2.357e-01 2.535e-01 -0.930 0.35241
## save_actYES -3.225e-01 2.786e-01 -1.158 0.24696
## current_actYES 3.680e-01 3.159e-01 1.165 0.24401
## mortgageYES -1.942e-01 2.696e-01 -0.720 0.47128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 412.88 on 299 degrees of freedom
## Residual deviance: 370.64 on 287 degrees of freedom
## AIC: 396.64
##
## Number of Fisher Scoring iterations: 4
pred_cross1 = predict(mod_fit1, newdata=test)
pred_cross1 <- ifelse(pred_cross1 > 0.5, 1,0)
pred_cross1
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [93] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [116] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [139] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [162] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [185] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [208] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [231] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [254] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [277] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [300] NA
- Variable Importance:
varImp(m_logis, scale=FALSE)
## Overall
## age 1.2265733
## sexMALE 2.6656080
## regionRURAL 0.7283283
## regionSUBURBAN 0.1002997
## regionTOWN 0.6046463
## income 2.5450827
## marriedYES 2.3824252
## children 1.5062590
## carYES 0.9299269
## save_actYES 1.1577565
## current_actYES 1.1650154
## mortgageYES 0.7204052
- Stepwise Logistic Regression:
full_m <- glm(pep~., data = train, family = "binomial")
null_m <- glm(pep~1., data = train, family = "binomial")
m_logis_step <- step(null_m, direction = "both", trace = F, scope = list(lower = null_m, upper = full_m))
m_logis_step
##
## Call: glm(formula = pep ~ income + sex + married + children, family = "binomial",
## data = train)
##
## Coefficients:
## (Intercept) income sexMALE marriedYES children
## -1.221e+00 4.792e-05 5.959e-01 -6.228e-01 -1.781e-01
##
## Degrees of Freedom: 299 Total (i.e. Null); 295 Residual
## Null Deviance: 412.9
## Residual Deviance: 377.4 AIC: 387.4
summary(m_logis_step)
##
## Call:
## glm(formula = pep ~ income + sex + married + children, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1215 -1.0141 -0.7009 1.1531 1.9804
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.221e+00 3.785e-01 -3.225 0.00126 **
## income 4.792e-05 1.006e-05 4.762 1.91e-06 ***
## sexMALE 5.959e-01 2.499e-01 2.384 0.01711 *
## marriedYES -6.228e-01 2.633e-01 -2.365 0.01802 *
## children -1.781e-01 1.166e-01 -1.527 0.12676
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 412.88 on 299 degrees of freedom
## Residual deviance: 377.44 on 295 degrees of freedom
## AIC: 387.44
##
## Number of Fisher Scoring iterations: 4
- training data class prediction:
pred_tr <- predict(m_logis_step, type = "response")
- trainig set classification 성능:
t <- table(train$pep, pred_tr > 0.5) # threshold = 0.5기준으로 confusion matrix
t
##
## FALSE TRUE
## NO 126 39
## YES 73 62
acc_tr <- sum(diag(t)) / sum(t) # training accuracy
acc_tr
## [1] 0.6266667
- test dataset classification 성능:
pred_te <- predict(m_logis_step, test, type = "response") # class prediction
t_te <- table(test$pep, pred_te >0.5) # confusion matrix
t_te
##
## FALSE TRUE
## NO 112 49
## YES 69 70
acc_te <- sum(diag(t_te)) / sum(t_te) # test accuracy
acc_te
## [1] 0.6066667
- 가장 높은 accuracy를 보여주는 threshold 탐색:
acc_th <- NULL
threshold <- seq(0.1, 0.9, by = 0.1) # threshold 범위: 0.1~0.9 범위에서 0.1씩 증가시키면서 탐색
for(i in threshold){
pred_te <- predict(m_logis_step, test, type = "response") # class prediction
t_te <- table(test$pep, pred_te > i) # confusion matrix
t_te
acc_te <- sum(diag(t_te)) / sum(t_te) # test accuracy
acc_te # test accuracy
acc_th <- c(acc_th, acc_te)
}
# plotting
plot(threshold, acc_th, type = "l", xlab = "Threshold", ylab = "Test accuracy")
abline(v = threshold[which.max(acc_th)], col = "red", lty = 2)
points(threshold[which.max(acc_th)], max(acc_th), pch = 19, col = "red")
text(threshold[which.max(acc_th)], max(acc_th), pos =4, labels = round(max(acc_th), 2), col = "red", cex = 0.8)
- Training a model on the data:
# second candidate model: Logistic Regression
lm_model <- glm(pep ~ ., data=train, family = binomial)
summary(lm_model)
##
## Call:
## glm(formula = pep ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1618 -0.9717 -0.6610 1.0941 2.0354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.568e+00 5.666e-01 -2.767 0.00565 **
## age 1.593e-02 1.299e-02 1.227 0.21998
## sexMALE 6.887e-01 2.583e-01 2.666 0.00768 **
## regionRURAL 2.696e-01 3.702e-01 0.728 0.46641
## regionSUBURBAN -4.920e-02 4.905e-01 -0.100 0.92011
## regionTOWN -1.790e-01 2.960e-01 -0.605 0.54541
## income 3.863e-05 1.518e-05 2.545 0.01093 *
## marriedYES -6.382e-01 2.679e-01 -2.382 0.01720 *
## children -1.800e-01 1.195e-01 -1.506 0.13200
## carYES -2.357e-01 2.535e-01 -0.930 0.35241
## save_actYES -3.225e-01 2.786e-01 -1.158 0.24696
## current_actYES 3.680e-01 3.159e-01 1.165 0.24401
## mortgageYES -1.942e-01 2.696e-01 -0.720 0.47128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 412.88 on 299 degrees of freedom
## Residual deviance: 370.64 on 287 degrees of freedom
## AIC: 396.64
##
## Number of Fisher Scoring iterations: 4
test$lm_pred <- ifelse(predict(lm_model, test, type="response") > 0.5, "YES","NO")
test$lm_pred = as.factor(test$lm_pred)
test$lm_pred_prob <- predict(lm_model, test, type="response")
- Confusion Matrix : Training set
confusionMatrix(test$lm_pred, test$pep)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 108 71
## YES 53 68
##
## Accuracy : 0.5867
## 95% CI : (0.5286, 0.643)
## No Information Rate : 0.5367
## P-Value [Acc > NIR] : 0.04626
##
## Kappa : 0.1614
## Mcnemar's Test P-Value : 0.12685
##
## Sensitivity : 0.6708
## Specificity : 0.4892
## Pos Pred Value : 0.6034
## Neg Pred Value : 0.5620
## Prevalence : 0.5367
## Detection Rate : 0.3600
## Detection Prevalence : 0.5967
## Balanced Accuracy : 0.5800
##
## 'Positive' Class : NO
##
- Reshape Data:
train$sex <- as.factor(train$sex)
train$region <- as.factor(train$region)
train$married <- as.factor(train$married)
train$children <- as.factor(train$children)
train$car <- as.factor(train$car)
train$save_act <- as.factor(train$save_act)
train$current_act <- as.factor(train$current_act)
train$mortgage <- as.factor(train$mortgage)
test$sex <- as.factor(test$sex)
test$region <- as.factor(test$region)
test$married <- as.factor(test$married)
test$children <- as.factor(test$children)
test$car <- as.factor(test$car)
test$save_act <- as.factor(test$save_act)
test$current_act <- as.factor(test$current_act)
test$mortgage <- as.factor(test$mortgage)
newd$sex <- as.factor(newd$sex)
newd$region <- as.factor(newd$region)
newd$married <- as.factor(newd$married)
newd$children <- as.factor(newd$children)
newd$car <- as.factor(newd$car)
newd$save_act <- as.factor(newd$save_act)
newd$current_act <- as.factor(newd$current_act)
newd$mortgage <- as.factor(newd$mortgage)
- Training a model on the data:
set.seed(1004)
## Training
dt <- rpart(as.factor(pep)~., data = train, cp = 0.1^20) # 모든 변수 사용, Full tree 생성
xerror_min_which <- which.min(dt$cptable[, "xerror"])
xerror_min <- min(dt$cptable[, "xerror"])
- cptable 출력:
printcp(dt) # cptable 출력
##
## Classification tree:
## rpart(formula = as.factor(pep) ~ ., data = train, cp = 0.1^20)
##
## Variables actually used in tree construction:
## [1] children income married mortgage
##
## Root node error: 135/300 = 0.45
##
## n= 300
##
## CP nsplit rel error xerror xstd
## 1 3.2593e-01 0 1.00000 1.00000 0.063828
## 2 8.8889e-02 1 0.67407 0.67407 0.058979
## 3 5.9259e-02 3 0.49630 0.55556 0.055556
## 4 3.3333e-02 4 0.43704 0.47407 0.052560
## 5 2.9630e-02 6 0.37037 0.48889 0.053148
## 6 1.0000e-20 8 0.31111 0.42222 0.050332
- cpplot 출력:
plotcp(dt) # cpplot 출력
abline(v = xerror_min_which, lty = 2, col = "red")
text(xerror_min_which, xerror_min, labels = round(xerror_min_which, 2), pos = 3, col = "red")
- Pruning:
# pruning
dt_prune <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"])
- Confusion Matrix : Training set
# training accuracy
pred_tr_dt <- predict(dt_prune, type = "class") # class(범주형)으로 예측
confusionMatrix(train$pep, pred_tr_dt, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 153 12
## YES 30 105
##
## Accuracy : 0.86
## 95% CI : (0.8155, 0.8972)
## No Information Rate : 0.61
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7137
## Mcnemar's Test P-Value : 0.008712
##
## Sensitivity : 0.8974
## Specificity : 0.8361
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.9273
## Prevalence : 0.3900
## Detection Rate : 0.3500
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8668
##
## 'Positive' Class : YES
##
- Confusion Matrix : Test set
# test accuracy
pred_te_dt <- predict(dt_prune, test, type = "class")
confusionMatrix(test$pep, pred_te_dt, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 144 17
## YES 40 99
##
## Accuracy : 0.81
## 95% CI : (0.761, 0.8528)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : 1.598e-13
##
## Kappa : 0.6136
## Mcnemar's Test P-Value : 0.003569
##
## Sensitivity : 0.8534
## Specificity : 0.7826
## Pos Pred Value : 0.7122
## Neg Pred Value : 0.8944
## Prevalence : 0.3867
## Detection Rate : 0.3300
## Detection Prevalence : 0.4633
## Balanced Accuracy : 0.8180
##
## 'Positive' Class : YES
##
- Plotting: Basic
# plotting
plot(dt_prune, margin = 0.1)
text(dt_prune, use.n = T)
- Feature Importance
dt_prune$variable.importance
## children income married age mortgage save_act
## 44.4865854 23.6406604 13.8333517 9.8737562 4.2705882 0.8171769
## region sex
## 0.5794552 0.4270588
barplot(dt_prune$variable.importance, ylim = c(0, 50))
- Plotting : Fancy Tree
fancyRpartPlot(dt_prune, cex = 1, palettes = c("Greys","RdPu")) #fancy tree
- Training a model on the data:
set.seed(1004)
c5_model <- C5.0(pep ~ ., data=train)
c5_model
##
## Call:
## C5.0.formula(formula = pep ~ ., data = train)
##
## Classification Tree
## Number of samples: 300
## Number of predictors: 10
##
## Tree size: 17
##
## Non-standard options: attempt to group attributes
summary(c5_model)
##
## Call:
## C5.0.formula(formula = pep ~ ., data = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Nov 19 16:38:35 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 300 cases (11 attributes) from undefined.data
##
## Decision tree:
##
## children = 1:
## :...income <= 12640.3: NO (5)
## : income > 12640.3:
## : :...current_act = YES: YES (47/1)
## : current_act = NO:
## : :...income <= 17390.1: NO (3)
## : income > 17390.1: YES (7)
## children in {0,2,3}:
## :...income <= 30085.1:
## :...children in {2,3}: NO (70/4)
## : children = 0:
## : :...married = NO:
## : :...mortgage = NO: YES (17/2)
## : : mortgage = YES:
## : : :...save_act = NO: YES (3)
## : : save_act = YES: NO (7)
## : married = YES:
## : :...mortgage = NO: NO (43/6)
## : mortgage = YES:
## : :...save_act = NO: YES (8/1)
## : save_act = YES: NO (6)
## income > 30085.1:
## :...children = 2: YES (28/2)
## children in {0,3}:
## :...married = NO:
## :...mortgage = NO: YES (15/1)
## : mortgage = YES: NO (5)
## married = YES:
## :...children = 0: NO (27/3)
## children = 3:
## :...income <= 43530: NO (6/1)
## income > 43530: YES (3)
##
##
## Evaluation on training data (300 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 17 21( 7.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 158 7 (a): class NO
## 14 121 (b): class YES
##
##
## Attribute usage:
##
## 100.00% income
## 100.00% children
## 46.67% married
## 34.67% mortgage
## 19.00% current_act
## 8.00% save_act
##
##
## Time: 0.0 secs
- Plotting :
plot(c5_model, main="Decision Tree Plot (C5.0)")
plot(c5_model, type="simple", main="Decision Tree Plot (C5.0)")
- Confusion Matrix : Training set
## Evaluating model performance
# create a factor vector of predictions on test data
pred_tr <- predict(c5_model, train)
# cross tabulation of predicted versus actual classes
confusionMatrix(train$pep, pred_tr, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 158 7
## YES 14 121
##
## Accuracy : 0.93
## 95% CI : (0.895, 0.9561)
## No Information Rate : 0.5733
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8579
## Mcnemar's Test P-Value : 0.1904
##
## Sensitivity : 0.9453
## Specificity : 0.9186
## Pos Pred Value : 0.8963
## Neg Pred Value : 0.9576
## Prevalence : 0.4267
## Detection Rate : 0.4033
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.9320
##
## 'Positive' Class : YES
##
- Confusion Matrix : Test set
pred_te <- predict(c5_model, test)
confusionMatrix(test$pep, pred_te, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 145 16
## YES 23 116
##
## Accuracy : 0.87
## 95% CI : (0.8266, 0.9059)
## No Information Rate : 0.56
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7377
## Mcnemar's Test P-Value : 0.3367
##
## Sensitivity : 0.8788
## Specificity : 0.8631
## Pos Pred Value : 0.8345
## Neg Pred Value : 0.9006
## Prevalence : 0.4400
## Detection Rate : 0.3867
## Detection Prevalence : 0.4633
## Balanced Accuracy : 0.8709
##
## 'Positive' Class : YES
##
test$c5_pred_prob <- predict(c5_model, test, type="prob")
#confusionMatrix(test$c5_pred, test$pep, positive = "YES")
- ROC Curve:
# 4. Model Evaluation by ROC chart
c5_pred <- prediction(test$c5_pred_prob[, "YES"], test$pep)
c5_model.perf <- performance(c5_pred, "tpr", "fpr") # True positive rate, False positive rate
lm_pred <- prediction(test$lm_pred_prob, test$pep)
lm_model.perf <- performance(lm_pred, "tpr", "fpr")
plot(c5_model.perf, col = "red")
plot(lm_model.perf, col = "blue", add=T)
legend(0.7, 0.7, c("C5 ","LM "), cex = 0.9, col = c("red", "blue"), lty = 1)
newd$c5_pred <- predict(c5_model, newd, type="class")
newd$c5_pred_prob <- predict(c5_model, newd, type="prob")
target <- subset(newd, c5_pred=="YES" & c5_pred_prob[, "YES"] > 0.8)
write.csv(target[order(target$c5_pred_prob[, "YES"], decreasing=T), ],
"dm_target.csv", row.names=FALSE)
- Setting:
train <- read.any('./input/pepTrainSet.csv', header=T, stringsAsFactors = F)
#train <- subset(train, select=-c(id))
test <- read.any('./input/pepTestSet.csv', header=T, stringsAsFactors = F)
newd <- read.any('./input/pepNewCustomers.csv', header=T, stringsAsFactors = F)
train$pep <- factor(train$pep)
test$pep <- factor(test$pep)
- Merge:
full <- rbind(train, test)
#full <- merge(x = train, y = test, by = "id", all.x = TRUE)
summary(full)
## id age sex region
## Length:600 Min. :18.00 Length:600 Length:600
## Class :character 1st Qu.:30.00 Class :character Class :character
## Mode :character Median :42.00 Mode :character Mode :character
## Mean :42.40
## 3rd Qu.:55.25
## Max. :67.00
## income married children car
## Min. : 5014 Length:600 Min. :0.000 Length:600
## 1st Qu.:17264 Class :character 1st Qu.:0.000 Class :character
## Median :24925 Mode :character Median :1.000 Mode :character
## Mean :27524 Mean :1.012
## 3rd Qu.:36173 3rd Qu.:2.000
## Max. :63130 Max. :3.000
## save_act current_act mortgage pep
## Length:600 Length:600 Length:600 NO :326
## Class :character Class :character Class :character YES:274
## Mode :character Mode :character Mode :character
##
##
##
str(full)
## 'data.frame': 600 obs. of 12 variables:
## $ id : chr "ID12101" "ID12102" "ID12103" "ID12104" ...
## $ age : int 48 40 51 23 57 57 22 58 37 54 ...
## $ sex : chr "FEMALE" "MALE" "FEMALE" "FEMALE" ...
## $ region : chr "INNER_CITY" "TOWN" "INNER_CITY" "TOWN" ...
## $ income : num 17546 30085 16575 20375 50576 ...
## $ married : chr "NO" "YES" "YES" "YES" ...
## $ children : int 1 3 0 3 0 2 0 0 2 2 ...
## $ car : chr "NO" "YES" "YES" "NO" ...
## $ save_act : chr "NO" "NO" "YES" "NO" ...
## $ current_act: chr "NO" "YES" "YES" "YES" ...
## $ mortgage : chr "NO" "YES" "NO" "NO" ...
## $ pep : Factor w/ 2 levels "NO","YES": 2 1 1 1 1 2 2 1 1 1 ...
head(full)
## id age sex region income married children car save_act
## 1 ID12101 48 FEMALE INNER_CITY 17546.0 NO 1 NO NO
## 2 ID12102 40 MALE TOWN 30085.1 YES 3 YES NO
## 3 ID12103 51 FEMALE INNER_CITY 16575.4 YES 0 YES YES
## 4 ID12104 23 FEMALE TOWN 20375.4 YES 3 NO NO
## 5 ID12105 57 FEMALE RURAL 50576.3 YES 0 NO YES
## 6 ID12106 57 FEMALE TOWN 37869.6 YES 2 NO YES
## current_act mortgage pep
## 1 NO NO YES
## 2 YES YES NO
## 3 YES NO NO
## 4 YES NO NO
## 5 NO NO NO
## 6 YES NO YES
- Married:
prop.table(table(full$married))
##
## NO YES
## 0.34 0.66
full$married <- ifelse(full$married == 'NO', 0, 1)
prop.table(table(full$married))
##
## 0 1
## 0.34 0.66
- Mortgage:
table(full$mortgage)
##
## NO YES
## 391 209
full$mortgage<- ifelse(full$mortgage == 'NO', 0, 1)
table(full$mortgage)
##
## 0 1
## 391 209
- Sex:
prop.table(table(full$sex))
##
## FEMALE MALE
## 0.5 0.5
full$sex <- ifelse(full$sex == 'FEMALE', 0, 1)
prop.table(table(full$sex))
##
## 0 1
## 0.5 0.5
- Region:
prop.table(table(full$region))
##
## INNER_CITY RURAL SUBURBAN TOWN
## 0.4483333 0.1600000 0.1033333 0.2883333
full$region <- ifelse(full$region == 'INNER_CITY', 0,
ifelse(full$region == 'TOWN', 1,
ifelse(full$region == 'RURAL', 2, 3)))
prop.table(table(full$region))
##
## 0 1 2 3
## 0.4483333 0.2883333 0.1600000 0.1033333
- current_act:
prop.table(table(full$current_act))
##
## NO YES
## 0.2416667 0.7583333
full$current_act <- ifelse(full$current_act == 'NO', 0, 1)
prop.table(table(full$current_act))
##
## 0 1
## 0.2416667 0.7583333
- Save Account:
prop.table(table(full$save_act))
##
## NO YES
## 0.31 0.69
full$save_act <- ifelse(full$save_act == 'NO', 0, 1)
prop.table(table(full$save_act))
##
## 0 1
## 0.31 0.69
- Car:
prop.table(table(full$car))
##
## NO YES
## 0.5066667 0.4933333
full$car <- ifelse(full$car == 'NO', 0, 1)
prop.table(table(full$car))
##
## 0 1
## 0.5066667 0.4933333
- Age:
full$age <- ifelse(full$age <= 29, 0,
ifelse(full$age > 29 & full$age <= 41, 1,
ifelse(full$age > 41 & full$age <= 54, 2, 3)))
prop.table(table(full$age))
##
## 0 1 2 3
## 0.2400000 0.2483333 0.2566667 0.2550000
- Income:
full$income <- ifelse(full$income <= 17099, 0,
ifelse(full$income > 17099 & full$income <= 27250, 1,
ifelse(full$income > 27250 & full$income <= 35350, 2, 3)))
prop.table(table(full$income))
##
## 0 1 2 3
## 0.2400000 0.3266667 0.1750000 0.2583333
- Children:
table(full$children)
##
## 0 1 2 3
## 263 135 134 68
prop.table(table(full$children))
##
## 0 1 2 3
## 0.4383333 0.2250000 0.2233333 0.1133333
**- Split the Data:**
prop.table(table(full$pep))
##
## NO YES
## 0.5433333 0.4566667
str(full)
## 'data.frame': 600 obs. of 12 variables:
## $ id : chr "ID12101" "ID12102" "ID12103" "ID12104" ...
## $ age : num 2 1 2 0 3 3 0 3 1 2 ...
## $ sex : num 0 1 0 0 0 0 1 1 0 1 ...
## $ region : num 0 1 0 1 2 1 2 1 3 1 ...
## $ income : num 1 2 0 1 3 3 0 1 1 1 ...
## $ married : num 0 1 1 1 1 1 0 1 1 1 ...
## $ children : int 1 3 0 3 0 2 0 0 2 2 ...
## $ car : num 0 1 1 0 0 0 0 1 1 1 ...
## $ save_act : num 0 0 1 0 1 1 0 1 0 1 ...
## $ current_act: num 0 1 1 1 0 1 1 1 0 1 ...
## $ mortgage : num 0 1 0 0 0 0 0 0 0 0 ...
## $ pep : Factor w/ 2 levels "NO","YES": 2 1 1 1 1 2 2 1 1 1 ...
head(full)
## id age sex region income married children car save_act current_act
## 1 ID12101 2 0 0 1 0 1 0 0 0
## 2 ID12102 1 1 1 2 1 3 1 0 1
## 3 ID12103 2 0 0 0 1 0 1 1 1
## 4 ID12104 0 0 1 1 1 3 0 0 1
## 5 ID12105 3 0 2 3 1 0 0 1 0
## 6 ID12106 3 0 1 3 1 2 0 1 1
## mortgage pep
## 1 0 YES
## 2 1 NO
## 3 0 NO
## 4 0 NO
## 5 0 NO
## 6 0 YES
## 70% of the sample size
smp_size <- floor(0.7 * nrow(full))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(full)), size = smp_size)
train <- full[train_ind, ]
test <- full[-train_ind, ]
- Dummy label:
train$pep <- ifelse(train$pep=="YES", 1, 0)
test$pep <-ifelse(test$pep=="YES", 1, 0)
tune.rf <- tuneRF(train[, c(-1, -12)], train$pep, mtryStart = 1)
## mtry = 1 OOB error = 0.2138995
## Searching left ...
## Searching right ...
## mtry = 2 OOB error = 0.1629827
## 0.2380404 0.05
## mtry = 4 OOB error = 0.1330898
## 0.1834115 0.05
## mtry = 8 OOB error = 0.1275279
## 0.04179099 0.05
tune.rf
## mtry OOBError
## 1 1 0.2138995
## 2 2 0.1629827
## 4 4 0.1330898
## 8 8 0.1275279
- Feature Selection:
# extractFeatures <- function(data) {
# features <- c("season",
# "holiday",
# "workingday",
# "weather",
# "temp",
# "atemp",
# "humidity",
# "windspeed",
# "hour",
# "weekday",
# "quarter",
# "month",
# "date"
# )
# data$hour <- hour(ymd_hms(data$datetime))
# return(data[,features])
# }
# trainFea <- extractFeatures(train)
# testFea <- extractFeatures(test)
- Train a model:
# Train a model across all the training data and plot the variable importance
#tr_rf <- randomForest(pep ~ ., data=train, ntree=500, importance=TRUE, na.action = na.omit)
#te_rf <- randomForest(pep ~ ., data=test, ntree=500, importance=TRUE, na.action = na.omit)
#tune.rf <- tuneRF(hr_train[, -9], hr_train$left, mtryStart = 1)
tr_rf <- randomForest(train[,c(-1,-12)], train[,12], ntree=5000, mtry=4, importance=TRUE)
te_rf <- randomForest(test[,c(-1,-12)], test[,12], ntree=5000, mtry=4, importance=TRUE)
imp <- importance(tr_rf, type=1)
featureImportance <- data.frame(Feature=row.names(imp), Importance=imp[,1])
tr_rf
##
## Call:
## randomForest(x = train[, c(-1, -12)], y = train[, 12], ntree = 5000, mtry = 4, importance = TRUE)
## Type of random forest: regression
## Number of trees: 5000
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 0.1297397
## % Var explained: 47.9
- Train Accuracy:
pred_tr_rf <- predict(tr_rf, type = "class")
table(train$pep)
##
## 0 1
## 223 197
pred_tr_rf
## 173 473 245 528 561 28
## 0.11156394 0.30457881 0.50206575 0.20039772 0.62063797 0.70300399
## 314 530 327 270 565 268
## 0.23095725 0.94755801 0.88020610 0.11951300 0.77334839 0.77965223
## 399 337 61 527 144 25
## 0.96120495 0.26865941 0.82331485 0.31258631 0.13222307 0.20191924
## 191 555 516 402 371 574
## 0.84180897 0.43262443 0.70110051 0.15434689 0.11196599 0.07618672
## 378 408 313 341 166 85
## 0.12669519 0.72739596 0.52365381 0.27569563 0.74353125 0.83337402
## 549 514 393 452 14 591
## 0.90973807 0.19194475 0.52634578 0.40289169 0.83554616 0.42966173
## 428 122 179 130 80 232
## 0.43879363 0.74849810 0.19302563 0.66481254 0.42028110 0.86296379
## 231 206 571 78 596 258
## 0.62784257 0.77827861 0.27925999 0.58795285 0.71487682 0.23422531
## 147 599 26 243 438 67
## 0.18657500 0.52554578 0.04582605 0.89971591 0.65143453 0.21015304
## 307 113 70 410 486 203
## 0.73683312 0.21355943 0.30698378 0.31385282 0.03157009 0.44678903
## 360 52 207 148 437 240
## 0.83388525 0.18391266 0.83156795 0.13465263 0.72518923 0.48453816
## 433 434 423 234 400 333
## 0.36794569 0.45470544 0.46587790 0.07218653 0.21880905 0.20432413
## 375 1 251 116 200 321
## 0.90431541 0.84070972 0.48928007 0.01896015 0.19011227 0.73966732
## 184 58 127 347 217 575
## 0.15395145 0.33056044 0.06169701 0.20114837 0.39904835 0.07570925
## 54 224 507 459 454 90
## 0.81077517 0.03864546 0.78102986 0.61801804 0.40445709 0.61524101
## 547 529 175 509 163 95
## 0.69565176 0.34939507 0.91507482 0.81940936 0.07843593 0.27903585
## 395 48 235 257 300 167
## 0.64040404 0.06857177 0.83873297 0.27626965 0.20474370 0.35883993
## 244 475 535 441 567 301
## 0.09388759 0.51780647 0.21590627 0.87069718 0.78893604 0.35077472
## 541 73 513 537 30 462
## 0.82457369 0.15538731 0.12488765 0.94219963 0.75554652 0.56402480
## 351 544 266 461 283 195
## 0.71745512 0.71176494 0.72722746 0.84026036 0.68493295 0.82362053
## 311 154 489 105 176 468
## 0.09251729 0.90893420 0.09042819 0.87399220 0.87755028 0.19603323
## 74 44 510 325 292 419
## 0.17641549 0.68443467 0.08690160 0.31030706 0.74094949 0.11180783
## 315 345 570 546 382 365
## 0.22467912 0.13848441 0.74775566 0.59925783 0.57677409 0.86076123
## 453 492 584 188 5 556
## 0.89067834 0.81282296 0.57926338 0.92405959 0.19548518 0.23788200
## 385 106 109 35 112 331
## 0.10705353 0.35273687 0.78062547 0.76269315 0.13914233 0.43236247
## 464 515 174 111 50 448
## 0.38430699 0.82021789 0.84838456 0.82907361 0.23711047 0.62279318
## 254 97 197 443 222 156
## 0.82876962 0.63354290 0.80757588 0.10199563 0.80649510 0.27500239
## 285 164 155 233 322 96
## 0.06901317 0.54674163 0.21078684 0.67627841 0.65432151 0.19828157
## 562 115 271 79 370 319
## 0.72121194 0.80081029 0.75972639 0.20342636 0.32164864 0.18656501
## 548 263 158 225 564 598
## 0.23434154 0.11273593 0.24154982 0.46873215 0.79426522 0.05717356
## 353 131 297 447 248 524
## 0.73420974 0.18497795 0.15059104 0.04075753 0.30739813 0.36862535
## 110 531 377 578 545 132
## 0.13190399 0.80186538 0.74907743 0.09796588 0.90684972 0.53582090
## 403 253 381 189 165 484
## 0.30650378 0.28531203 0.61956483 0.61131056 0.21536141 0.41480259
## 62 230 534 384 496 205
## 0.91531027 0.79344298 0.73926024 0.83660600 0.43206464 0.39574857
## 160 348 460 114 472 68
## 0.71199341 0.21303434 0.90190471 0.48068808 0.03822904 0.37130868
## 490 99 84 261 19 590
## 0.19282308 0.25969233 0.55042811 0.39693495 0.25664934 0.47811581
## 136 157 594 542 108 463
## 0.90823437 0.09622500 0.91912016 0.73902555 0.19299228 0.69288244
## 276 259 20 149 432 209
## 0.64754826 0.10438876 0.15420917 0.43049917 0.95599804 0.08428683
## 260 340 229 159 415 22
## 0.78062574 0.81894006 0.78176360 0.36941347 0.21369084 0.24471314
## 593 145 72 302 56 291
## 0.10995985 0.83208313 0.82847543 0.34187351 0.48808786 0.78677056
## 442 238 579 227 494 553
## 0.05099733 0.16906399 0.63689330 0.51656743 0.83923036 0.80085099
## 142 343 573 256 91 581
## 0.23039012 0.09222599 0.49577337 0.86043434 0.83534477 0.83412949
## 538 93 522 430 521 139
## 0.82269076 0.77274759 0.20298802 0.50240474 0.12524890 0.07346364
## 162 418 497 299 228 523
## 0.27748397 0.81582541 0.64758226 0.22854408 0.75081633 0.33082434
## 445 194 543 525 7 499
## 0.15714106 0.80496815 0.20018916 0.33364526 0.75729346 0.28199064
## 288 3 24 580 252 239
## 0.34969310 0.22792162 0.27336368 0.30605819 0.86489493 0.36454036
## 411 151 328 373 466 551
## 0.75324280 0.79675072 0.81727014 0.61472164 0.64152579 0.24020947
## 520 364 386 125 21 71
## 0.15186254 0.20688172 0.31631835 0.57601024 0.76936159 0.08584969
## 18 587 519 32 23 467
## 0.27690475 0.73740285 0.06147004 0.24208602 0.08389478 0.20160922
## 558 250 339 309 488 241
## 0.24338155 0.06423732 0.87609105 0.17364131 0.41848393 0.33573541
## 236 329 435 518 187 354
## 0.59345307 0.58514531 0.29392071 0.86459071 0.66105491 0.93817717
## 47 304 133 394 334 135
## 0.67753492 0.56879443 0.35595148 0.16010210 0.31880628 0.62540589
## 557 16 102 422 359 391
## 0.79648370 0.32783530 0.17965857 0.91885220 0.16360484 0.04652151
## 100 493 398 83 41 196
## 0.28393678 0.21195902 0.55717190 0.37153146 0.63767398 0.89641054
## 29 10 274 275 310 416
## 0.18081360 0.08470387 0.77449868 0.03461874 0.51205967 0.48963348
## 451 349 198 223 388 469
## 0.28770121 0.25025282 0.43814593 0.86832248 0.89539004 0.38727264
## 185 169 425 406 380 31
## 0.58783785 0.81425208 0.74110117 0.61465590 0.09983703 0.24711485
## 588 344 324 505 305 389
## 0.90352516 0.14338552 0.66472641 0.84567020 0.55169833 0.85268138
## 501 383 487 281 128 278
## 0.32275030 0.70808515 0.22076794 0.74972248 0.01984671 0.71072094
## 153 242 536 286 429 202
## 0.75373187 0.85401244 0.19572046 0.21003908 0.81803183 0.07090048
## 444 294 214 474 86 264
## 0.72014916 0.39479824 0.42385683 0.63608557 0.89449354 0.03290152
## 46 458 255 356 98 421
## 0.68638898 0.04723722 0.13025229 0.66040692 0.43354408 0.04293480
## 589 104 121 352 336 471
## 0.14310378 0.30372008 0.43603777 0.31922494 0.84426963 0.03819335
## 69 215 129 456 82 219
## 0.16238830 0.44187473 0.08621839 0.22468781 0.28747514 0.74851977
## 117 540 40 586 450 45
## 0.26157444 0.29029999 0.44052097 0.06671731 0.72481361 0.32236391
## 330 358 152 506 367 446
## 0.08764742 0.70883566 0.74488213 0.07398595 0.92281745 0.19018417
## 362 212 533 368 338 595
## 0.14792825 0.88956746 0.93714080 0.04323520 0.89674477 0.27433129
## 180 568 269 88 138 287
## 0.80338829 0.18984440 0.58973867 0.44554224 0.88349477 0.89099508
## 66 265 213 424 247 94
## 0.19358906 0.77621411 0.12697682 0.05786261 0.16480336 0.86325403
## 539 36 508 59 481 6
## 0.01764388 0.15507811 0.55766075 0.12794696 0.37131592 0.91595331
pred_tr_rf <- ifelse(pred_tr_rf > 0.5, 1, 0) # YES, NO
table(pred_tr_rf)
## pred_tr_rf
## 0 1
## 228 192
confusionMatrix(table(train$pep, pred_tr_rf))
## Confusion Matrix and Statistics
##
## pred_tr_rf
## 0 1
## 0 192 31
## 1 36 161
##
## Accuracy : 0.8405
## 95% CI : (0.8019, 0.8742)
## No Information Rate : 0.5429
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6793
## Mcnemar's Test P-Value : 0.6251
##
## Sensitivity : 0.8421
## Specificity : 0.8385
## Pos Pred Value : 0.8610
## Neg Pred Value : 0.8173
## Prevalence : 0.5429
## Detection Rate : 0.4571
## Detection Prevalence : 0.5310
## Balanced Accuracy : 0.8403
##
## 'Positive' Class : 0
##
- Test Accuracy:
pred_te_rf <- predict(te_rf, type = "class")
table(test$pep)
##
## 0 1
## 103 77
pred_te_rf
## 2 4 8 9 11 12
## 0.19798130 0.30279029 0.25659203 0.40435789 0.21952548 0.33754731
## 13 15 17 27 33 34
## 0.49607409 0.17828824 0.26282041 0.15556195 0.80012350 0.40686006
## 37 38 39 42 43 49
## 0.25957944 0.54544941 0.24503851 0.51606936 0.82603232 0.83827690
## 51 53 55 57 60 63
## 0.28278293 0.41161299 0.29065383 0.67559300 0.55762042 0.33342428
## 64 65 75 76 77 81
## 0.05797946 0.11754679 0.94366931 0.78343524 0.37959065 0.21875780
## 87 89 92 101 103 107
## 0.11669751 0.17503140 0.57659037 0.65390625 0.31461705 0.25925820
## 118 119 120 123 124 126
## 0.17696082 0.80413676 0.61510889 0.53860835 0.89833333 0.36354828
## 134 137 140 141 143 146
## 0.49637987 0.18886624 0.98045714 0.42675304 0.66539859 0.18407815
## 150 161 168 170 171 172
## 0.34880822 0.46454682 0.21294788 0.26729408 0.19851069 0.26997751
## 177 178 181 182 183 186
## 0.09694036 0.09612170 0.68690067 0.72914739 0.51959710 0.84657660
## 190 192 193 199 201 204
## 0.22064147 0.20294067 0.39466915 0.13684142 0.77021335 0.35274473
## 208 210 211 216 218 220
## 0.14594611 0.62518695 0.69477837 0.35246262 0.31178331 0.20286886
## 221 226 237 246 249 262
## 0.57537433 0.65573487 0.87369009 0.27188394 0.18272668 0.79251166
## 267 272 273 277 279 280
## 0.73820778 0.71050699 0.06709683 0.68721300 0.45224047 0.60654163
## 282 284 289 290 293 295
## 0.12885362 0.44296609 0.18171447 0.86262652 0.66155522 0.09901306
## 296 298 303 306 308 312
## 0.19010837 0.13348365 0.07347344 0.33561628 0.72041890 0.93605110
## 316 317 318 320 323 326
## 0.29316627 0.72886127 0.72838976 0.51793051 0.25127591 0.40440354
## 332 335 342 346 350 355
## 0.20020933 0.50280200 0.79867329 0.75718050 0.84734114 0.45808282
## 357 361 363 366 369 372
## 0.36566349 0.94604952 0.39295487 0.13421218 0.05497282 0.87276486
## 374 376 379 387 390 392
## 0.22531452 0.11861462 0.34982249 0.70677835 0.19621866 0.88553330
## 396 397 401 404 405 407
## 0.32851431 0.12559635 0.64771348 0.61626344 0.05698060 0.87626939
## 409 412 413 414 417 420
## 0.38804465 0.06081786 0.93875466 0.19292236 0.26112650 0.15768759
## 426 427 431 436 439 440
## 0.12524215 0.32106508 0.09254926 0.08087215 0.49149654 0.35378925
## 449 455 457 465 470 476
## 0.46651455 0.07132334 0.20453110 0.70356636 0.74967330 0.21852469
## 477 478 479 480 482 483
## 0.86283303 0.19203020 0.38623758 0.94668903 0.84661982 0.37056360
## 485 491 495 498 500 502
## 0.71843192 0.62133724 0.68864410 0.76887169 0.15987456 0.88443319
## 503 504 511 512 517 526
## 0.32566130 0.16240336 0.09025166 0.33541319 0.64144891 0.65837479
## 532 550 552 554 559 560
## 0.67277847 0.82518980 0.35794436 0.57382960 0.16857640 0.46138550
## 563 566 569 572 576 577
## 0.27137016 0.08432089 0.13946017 0.24172978 0.09762605 0.70308514
## 582 583 585 592 597 600
## 0.08266162 0.53196503 0.64556107 0.38203284 0.16774845 0.67819760
pred_te_rf <- ifelse(pred_te_rf > 0.5, 1, 0) # YES, NO
table(pred_te_rf)
## pred_te_rf
## 0 1
## 110 70
confusionMatrix(table(test$pep, pred_te_rf))
## Confusion Matrix and Statistics
##
## pred_te_rf
## 0 1
## 0 91 12
## 1 19 58
##
## Accuracy : 0.8278
## 95% CI : (0.7645, 0.8799)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 2.458e-10
##
## Kappa : 0.6441
## Mcnemar's Test P-Value : 0.2812
##
## Sensitivity : 0.8273
## Specificity : 0.8286
## Pos Pred Value : 0.8835
## Neg Pred Value : 0.7532
## Prevalence : 0.6111
## Detection Rate : 0.5056
## Detection Prevalence : 0.5722
## Balanced Accuracy : 0.8279
##
## 'Positive' Class : 0
##
- Hyperparameter Tuning:
#ctrl <- trainControl(method = "cv", number = 10)
#grid_rf <- expand.grid(.mtry = c(2, 4, 8, 16))
#set.seed(1004)
#m_rf <- train(pep ~ ., data = train, method = "rf",
# trControl = ctrl, tuneGrid = grid_rf)
#grid_c50 <- expand.grid(.model = "tree",
# .trials = c(10, 20, 30, 40),
# .winnow = 'FALSE')
#set.seed(1004)
#m_c50 <- train(pep ~ ., data = train, method = "C5.0",
# trControl = ctrl, tuneGrid = grid_c50)
#m_rf
#m_c50
- Feature Importance:
featureImportance
## Feature Importance
## age age 56.317327
## sex sex 5.397639
## region region 23.960885
## income income 125.399859
## married married 120.453842
## children children 239.372065
## car car -8.237569
## save_act save_act 68.765097
## current_act current_act -13.958201
## mortgage mortgage 81.616033
round(tr_rf$importance, 3)
## %IncMSE IncNodePurity
## age 0.023 8.579
## sex 0.001 3.438
## region 0.007 8.287
## income 0.063 13.219
## married 0.048 7.649
## children 0.161 25.854
## car -0.002 3.187
## save_act 0.022 5.448
## current_act -0.003 2.864
## mortgage 0.029 7.318
varImpPlot(tr_rf)
p <- ggplot(featureImportance, aes(x=reorder(Feature, Importance), y=Importance)) +
geom_bar(stat="identity", fill="#53cfff") +
coord_flip() +
theme_light(base_size=20) +
xlab("Importance") +
ylab("") +
ggtitle("Random Forest Feature Importance\n") +
theme(plot.title=element_text(size=18))
p