At a bank in Portugal wants to predict what kind of prospective customer will buy a product from the bank when called by the bank, the purpose of this prediction is to help the telemarketing team to find out which customer will buy the product for yes or no based on campaign data. the marketing.
library(dplyr)
library(caret)
library(inspectdf)
library(gridExtra)
library(GGally)
library(rsample)
library(e1071)
library(ROCR)
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
customer <- read.csv("bank-full.csv")
head(customer)
Column description :
age : Age
job : type of job
marital : marital status (category)
education : Categorical of education
default : has credit in default
balance : Saldo
housing : Categorical of has housing loan
loan : Personal loan
contact : contact communication type
day : last contact of day
month : last contact month of year
duration : last contact duration, in seconds (numeric)
campaign : number of contacts performed during this campaign and for this client
pdays : number of days that passed by after the client was last contacted from a previous campaign
previous : number of contacts performed before this campaign and for this client
poutcome : outcome of the previous marketing campaign
y : the client subscribed a term deposit
glimpse(customer)
#> Rows: 45,211
#> Columns: 17
#> $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, ~
#> $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "~
#> $ marital <chr> "married", "single", "married", "married", "single", "marrie~
#> $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", ~
#> $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",~
#> $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71~
#> $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"~
#> $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"~
#> $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
#> $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
#> $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may~
#> $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,~
#> $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ~
#> $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
#> $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
cust <- customer %>%
mutate_if(is.character, as.factor)
inspect missing value
colSums(is.na(cust))
#> age job marital education default balance housing loan
#> 0 0 0 0 0 0 0 0
#> contact day month duration campaign pdays previous poutcome
#> 0 0 0 0 0 0 0 0
#> y
#> 0
In this study we will focus on numerical predictor data, so we will delete the job, martial, education, contact, income data columns
cust_clean <- cust %>%
select(-c("job","marital","education","contact","poutcome"))
On data that is binary option category, we will make it class 0 and 1,and we enter it into the processing data
transform <- model.matrix(y ~ ., data = cust_clean) %>%
as.data.frame() %>%
select(c("defaultyes","housingyes","loanyes"))
cust_clean$default <- transform$defaultyes
cust_clean$housing <- transform$housingyes
cust_clean$loan <- transform$loanyes
head(cust_clean)
ggcorr(cust_clean ,hjust = 1, layout.exp = 2, label = T, label_size = 2.9)
From the data above, all predictors that have a strong enough relationship are only
pdays andprevious, so it is still possible to use the Naive Bayes model.
#Proportion of target variables before training data
prop.table(table(cust_clean$y))
#>
#> no yes
#> 0.8830152 0.1169848
Cross Validation
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(data = cust_clean, prop = 0.8, strata ="y")
cust_train <- training(index)
cust_test <- testing(index)
Proporsi data
#Proportion of target variables after training data
prop.table(table(cust_train$y))
#>
#> no yes
#> 0.882997 0.117003
After training data, the proportion of data still has the same proportion as the data before training, because the proportion of data is not balanced, we will use ‘downSample’ for data balancing
set.seed(100)
down_cust <- downSample(x = cust_train %>% select(-y),
y = cust_train$y,
yname = "y")
prop.table(table(down_cust$y))
#>
#> no yes
#> 0.5 0.5
cust_naive <- naiveBayes(x = down_cust %>% select(-y),
y = down_cust$y,
laplace = 1)
We can also interpret one of the predictors to find out the proportion of the class
prop.table(table(down_cust$y, down_cust$month)) %>%
as.data.frame() %>%
rename(Class = Var1, Month = Var2)
From the results above, it can be concluded that the class proportion of customers in the month apr for no = 3% and for yes = 5%
#model fitting
cust_pred <- predict(cust_naive, newdata = cust_test, type = "class")#for the class prediction
Evaluate the model with confusion matrix
confusionMatrix(cust_pred, cust_test$y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6054 224
#> yes 1930 833
#>
#> Accuracy : 0.7618
#> 95% CI : (0.7528, 0.7705)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3213
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.78808
#> Specificity : 0.75827
#> Pos Pred Value : 0.30148
#> Neg Pred Value : 0.96432
#> Prevalence : 0.11691
#> Detection Rate : 0.09214
#> Detection Prevalence : 0.30561
#> Balanced Accuracy : 0.77317
#>
#> 'Positive' Class : yes
#>
From the evaluation results above, the Naive Bayes model gets 76% accuracy, 30% precision, recall, 78% in the positive class Yes
ROC
Bentuk Probability dari hasil prediksi
cust_prob <- predict(cust_naive, newdata = cust_test, type = "raw")#for the probability
round(head(cust_prob),4)
#> no yes
#> [1,] 1.0000 0.0000
#> [2,] 0.7801 0.2199
#> [3,] 0.7478 0.2522
#> [4,] 0.7830 0.2170
#> [5,] 0.7225 0.2775
#> [6,] 0.9871 0.0129
Objek ROC
#Objek Prediction
cust_roc <- prediction(predictions = cust_prob[,2], # = kelas positif
labels = as.numeric(cust_test$y == "yes")) # label kelas positif
#Objec performance dari prediction
perf <- performance(prediction.obj = cust_roc,
measure = "tpr",
x.measure = "fpr")
#plot
plot(perf, main = "ROC")
abline(0,1, lty =2)
auc <- performance(prediction.obj = cust_roc,
measure = "auc")
auc@y.values
#> [[1]]
#> [1] 0.8302735
The ROC is a probability curve representing the degree or measure of separation. This tells how much the model is able to differentiate between classes. The closer the curve reaches to the top left of the plot (true positive high and false positive low), the better our model will be. The higher the AUC score, the better the model will separate the target classes
In the fitting model we will use mincriterion = 0.95, where the p-value must be below 0.05 for a node to create branches. Then minsplit = 500 or the minimum number of observations after splitting. And minbucket = 1200 as the minimum number of observations in the terminal node.
library(partykit)
cust_tree <- ctree(formula = y ~ ., data = down_cust,
control = ctree_control(mincriterion = 0.5,
minsplit = 500,
minbucket = 1200))
plot(cust_tree, type = "simple")
From the plot decision tree above, it can be seen that 1 is the
Root Node, 2,5 and 6 are theInternal Node and 3,4,7,8,9 are theLeaf Nodes
predict.DT <- predict(cust_tree, newdata = cust_test)# For the class
predict.prob.DT <- predict(cust_tree, newdata = cust_test, type = "prob")# For the probability
confusionMatrix(predict.DT, cust_test$y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6692 338
#> yes 1292 719
#>
#> Accuracy : 0.8197
#> 95% CI : (0.8116, 0.8276)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3725
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.68023
#> Specificity : 0.83818
#> Pos Pred Value : 0.35753
#> Neg Pred Value : 0.95192
#> Prevalence : 0.11691
#> Detection Rate : 0.07953
#> Detection Prevalence : 0.22243
#> Balanced Accuracy : 0.75920
#>
#> 'Positive' Class : yes
#>
From the results of the decision tree model confusion matrix, it can be seen that there is an increase in the accuracy value to 81%. Furthermore, the results of the confusion matrix will be evaluated using ROC andAUC to find out how well the model classifies the two classes.
#Objek Prediction
dt_roc <- prediction(predictions = predict.prob.DT[,2], # = kelas positif
labels = as.numeric(cust_test$y == "yes")) # label kelas positif
#Objec performance dari prediction
dt_perf <- performance(prediction.obj = dt_roc,
measure = "tpr",
x.measure = "fpr")
#plot
plot(dt_perf, main = "ROC")
abline(0,1, lty =2)
dt_auc <- performance(prediction.obj = dt_roc,
measure = "auc")
dt_auc@y.values
#> [[1]]
#> [1] 0.8138376
One of the weaknesses of random forest is that the modeling takes a long time, so the model will be saved in the form of an RDS file with the function saveRDS () so that the model can be used immediately without having to train before.
#discarding columns with low or close to 0 variance, so that they are not eligible to be used as predictors
n0_var <- nearZeroVar(down_cust)
down_cust <- down_cust[, -n0_var]
dim(down_cust)
#> [1] 8464 10
#model building
# set.seed(2018)
# ctrl <- trainControl(method="repeatedcv", number=4, repeats=3) # k-fold cross validation
# forest <- train(y ~ ., data=down_cust, method="rf", trControl = ctrl)
# saveRDS(forest, "model.random.forest.rds")
model_forest <- readRDS("model.random.forest.rds")
model_forest
#> Random Forest
#>
#> 8464 samples
#> 9 predictor
#> 2 classes: 'no', 'yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 3 times)
#> Summary of sample sizes: 6348, 6348, 6348, 6348, 6348, 6348, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8390044 0.6780088
#> 10 0.8474716 0.6949433
#> 19 0.8438091 0.6876181
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 10.
From the model summary,the optimum number of variables considered for splitting at each tree node mtry is 10 because the highest accuracy. We can also inspect the importance of each variable that was used in our random forest using varImp().
Var_imp <- varImp(model_forest)
plot(Var_imp)
From the result, the importance variable can be interpreted into a plot, where the importance variable above is the duration variable.
When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation. For example, the OOB we achieved (in the summary below) was generated from our wine_train dataset.
plot(model_forest$finalModel)
legend("topright", colnames(model_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
model_forest$finalModel
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 10
#>
#> OOB estimate of error rate: 14.97%
#> Confusion matrix:
#> no yes class.error
#> no 3472 760 0.1795841
#> yes 507 3725 0.1198015
predict.forest <- predict(model_forest, cust_test, type = "raw")# for the class prediction
predict.prob.forest <- predict(model_forest, cust_test, type = "prob")# for the probability
cm_forest <- confusionMatrix(predict.forest, cust_test$y, positive = "yes")
cm_forest
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6457 132
#> yes 1527 925
#>
#> Accuracy : 0.8165
#> 95% CI : (0.8084, 0.8244)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.4349
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.8751
#> Specificity : 0.8087
#> Pos Pred Value : 0.3772
#> Neg Pred Value : 0.9800
#> Prevalence : 0.1169
#> Detection Rate : 0.1023
#> Detection Prevalence : 0.2712
#> Balanced Accuracy : 0.8419
#>
#> 'Positive' Class : yes
#>
#Objek Prediction
forest_roc <- prediction(predictions = predict.prob.forest[,2], # = kelas positif
labels = as.numeric(cust_test$y == "yes")) # label kelas positif
#Objec performance dari prediction
forest_perf <- performance(prediction.obj = forest_roc,
measure = "tpr",
x.measure = "fpr")
#plot
plot(forest_perf, main = "ROC")
abline(0,1, lty =2)
forest_auc <- performance(prediction.obj = forest_roc,
measure = "auc")
forest_auc@y.values
#> [[1]]
#> [1] 0.9041757
Based on the results of confusion matrix from random forest, the algorithm gives a good result with an accuracy of 81% and a recall of 87% in the target class positive = yes. The value of the AUC also has a high yield of 90% compared to the two previous models.
the prediction results of the three models in the test data against the positive
yesclass
cust_test$predict.naive <- cust_pred
cust_test$predict.DT <- predict.DT
cust_test$predict.forest <- predict.forest
cust_test %>%
select(y,predict.naive, predict.DT, predict.forest) %>%
filter(y == "yes") %>%
head(5)
df1 <- data.frame(model = c("Naive Bayes","Decision Tree","Random Forest"),
accuracy = c(0.7618,0.8197,0.8165),
sensitivity = c(0.78808,0.68023,0.8751),
specificity = c(0.75827,0.83818,0.8087),
precision = c(0.30148,0.35753,0.3772),
AUC = c(0.8302735,0.8138376,0.9041757)
)
df1
Based on the results of the three models, the data model from random forest has better results, in addition to high accuracy and also the value of AUC and sensitivity or recall, because in this case the target class is the customer buying the product yes of course if recalled. This high model proves that the model has a small number in False Positive or a prediction of no but actually ‘yes’.