library(dplyr)
library(ggplot2)
library(e1071)
library(partykit)
library(rsample)
library(caret)
library(ROCR)
library(GGally)
library(tidyverse)In business, Churn or Moving Out is something that cannot be avoided. In every industry and business model, there will always be customers coming in and going out. The calculation of the ratio of Churn and not Churn is called the Churn Ratio - a measure of the number of individuals or items moving out of a collective group over a specific period.
For the industry, a customer who Churn is a loss to the business because they cannot sell its services or products to that customer. In most cases, there are a myriad of conditions that lead to customer dropouts. Therefore, knowing the likelihood that customers will leave can help businesses to reduce the number of customers who unsubscribe.
The machine learning algorithm can predict whether a customer will churn or not. This case categorized as classification problem. The purpose of this notebook is to create a model that is able to predict whether a bank customer will churn or not. By knowing customer symptoms earlier, companies can implement strategies to reduce customer churn. The data used in this case were taken from Kaggle, Churn Modeling.
# Import Data
df <- read.csv(file = "Churn_Modelling.csv", row.names = 1, stringsAsFactors = T)
# Take out irrelevant columns : CustomerId, Surname, Geography
churn <- df %>%
select(-c(CustomerId, Surname, Geography)) %>%
mutate(HasCrCard = as.factor(HasCrCard),
IsActiveMember = as.factor(IsActiveMember),
Exited = as.factor(Exited))
# description
glimpse(churn)## Rows: 10,000
## Columns: 10
## $ CreditScore <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528,~
## $ Gender <fct> Female, Female, Female, Female, Female, Male, Male, Fe~
## $ Age <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 25~
## $ Tenure <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9,~
## $ Balance <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78,~
## $ NumOfProducts <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, ~
## $ HasCrCard <fct> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~
## $ IsActiveMember <fct> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, ~
## $ EstimatedSalary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, 1~
## $ Exited <fct> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
In original dataset, there are 13 columns. CustomerId, Surename, Geography colums omitted because it will not used in this analysis.
This data consists of 10,000 rows or cases and 10 columns. Further explanation:
1. CreditScore: A credit score is a number between 300–850 that depicts a consumer’s creditworthiness.
2. Gender: Female or Male
3. Age: Customer’s age
4. Tenure: The period or duration for which the loan amount is sanctioned.
5. Balance: The amount of a loan that is left to be paid
6. NumOfProducts: Number of product that customer’s have
7. HasCrCard: Is customer has credit card or not. (1 = Yes, 0 = No)
8. IsActiveMember: Is customer active or not. (1 = Yes, 0 = No)
9. EstimatedSalary : Estimation of customer’s salary.
Target 10. Exited: Is the customer exit or not. (1 = Yes, 0 = No)
First thing to inspect is wether there are any NA values in rows.
# NA Value
churn %>%
summarise_all(funs(sum(is.na(.))))## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
The dataset used is clean and have no NA or Blank values.
boxplot <- df %>%
select(-c(Surname, Geography)) %>%
mutate(HasCrCard = as.factor(HasCrCard),
IsActiveMember = as.factor(IsActiveMember),
Exited = as.factor(Exited)) %>%
select_if(is.numeric) %>%
pivot_longer(-CustomerId)
ggplot(data = boxplot, aes(x = name, y = value, fill = name)) +
geom_boxplot() +
facet_wrap(facets = ~name, scales = 'free')churncorr <- churn %>%
select_if(is.numeric)
ggcorr(data = churncorr, label = T) +
labs(title = "Correlation Matrix")# Target variable's proportion
data.frame(table(churn$Exited))# Target variable's proportion in percentage
data.frame(prop.table(table(churn$Exited)))From 1000 rows data, 79,63% (7963) are customers that are not churn, and 20.37% data (2037 rows) are customers that churn. The data is imbalanced, therefore we can expect the same if we do cross validation the results will not much different.
# Train Test Split
RNGkind(sample.kind = "Rounding")
set.seed(444)
split <- sample(nrow(churn), nrow(churn)*0.8)
churn_train <- churn[split,]
churn_test <- churn[-split,]Train-Test split is aimed to separate training dataset and testing dataset. This is essential to machine learning. For training the algorithm, we will use the training dataset, and predict, and testing the algorithm we will use testing dataset.
# Target Proportion on Train
prop.table(table(churn_train$Exited))##
## 0 1
## 0.79575 0.20425
table(churn_train$Exited)##
## 0 1
## 6366 1634
# Target Proportion on Test
prop.table(table(churn_test$Exited))##
## 0 1
## 0.7985 0.2015
# DOWNSAMPLE
train_down <- downSample(x = churn_train %>% select(-Exited),
y = churn_train$Exited,
list = F,
yname = "Exited"
)
table(train_down$Exited)##
## 0 1
## 1634 1634
The first algorithm is logistic Regression. The method used is stepwise Regression in the backward direction. Logistic Regression has the advantage of an easy-to-interpret algorithm similar to linear Regression.
# Logistic Regression
modellr <- glm(Exited ~., data = train_down, family = "binomial")
modellrback <- step(object = modellr, direction = "backward", trace = F)
predlr <- predict(modellrback, churn_test, type = "response")
# Classifying Class and Predicting
pred_exited <- ifelse(predlr > 0.5,"1","0")
pred_exited <- as.factor(pred_exited)The threshold used is 0.5. Then the results of the prediction score in the form of a probability that has a score above 0.5 are categorized as Exited; on the other hand, a score below 0.5 is categorized as Stay.
# Model Summary
summary(modellrback)##
## Call:
## glm(formula = Exited ~ CreditScore + Gender + Age + Balance +
## NumOfProducts + IsActiveMember, family = "binomial", data = train_down)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.41872 -0.96357 -0.08514 0.98366 2.48815
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.331e+00 3.336e-01 -6.989 2.77e-12 ***
## CreditScore -6.200e-04 3.995e-04 -1.552 0.1207
## GenderMale -5.263e-01 7.819e-02 -6.730 1.69e-11 ***
## Age 7.717e-02 4.114e-03 18.759 < 2e-16 ***
## Balance 5.293e-06 6.500e-07 8.144 3.83e-16 ***
## NumOfProducts -1.151e-01 5.855e-02 -1.967 0.0492 *
## IsActiveMember1 -9.325e-01 7.938e-02 -11.748 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4530.4 on 3267 degrees of freedom
## Residual deviance: 3847.0 on 3261 degrees of freedom
## AIC: 3861
##
## Number of Fisher Scoring iterations: 3
The summary model results show several variables that have a significant effect on predicting the target class. The significant variables in predicting are Male Gender, Age, balance, Active Member, and Number of Products.
# Confusion Matrix
confusionMatrix(pred_exited, churn_test$Exited, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1113 137
## 1 484 266
##
## Accuracy : 0.6895
## 95% CI : (0.6687, 0.7097)
## No Information Rate : 0.7985
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2701
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6600
## Specificity : 0.6969
## Pos Pred Value : 0.3547
## Neg Pred Value : 0.8904
## Prevalence : 0.2015
## Detection Rate : 0.1330
## Detection Prevalence : 0.3750
## Balanced Accuracy : 0.6785
##
## 'Positive' Class : 1
##
In classification, a confusion matrix is used to see the performance of a classification model. The confusion matrix compares the predicted results with actual data. The model’s accuracy is 68%, Sensitivity 66%, Specificity 69%, and Pos Pred Value 35%. Given the unbalanced proportion of targets, accuracy measures are not very well used. Therefore, Sensitivity is more suitable because it calculates how many valid results predict positive, from actual positive events.
Algoritma kedua yang digunakan adalah Naive Bayes. Naive Bayes merupakan algoritma yang diturunkan dari teorema probability Bayes. Kata “Naive” di awal nama menandakan algoritma ini tidak mempertimbangkan hubungan antar variabel prediktor. Mengasumsikan antar variabel prediktor tidak memiliki hubungan linier atau non-linier.
# Naive Bayes Modelling
## Model Fit with downsample train data
modelbayesdown <- naiveBayes(Exited ~ ., data = train_down)
## Prediction
predbayesdown <- predict(modelbayesdown, churn_test)
## Confusion Matrix NB
confusionMatrix(predbayesdown, churn_test$Exited, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1200 116
## 1 397 287
##
## Accuracy : 0.7435
## 95% CI : (0.7238, 0.7625)
## No Information Rate : 0.7985
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3677
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7122
## Specificity : 0.7514
## Pos Pred Value : 0.4196
## Neg Pred Value : 0.9119
## Prevalence : 0.2015
## Detection Rate : 0.1435
## Detection Prevalence : 0.3420
## Balanced Accuracy : 0.7318
##
## 'Positive' Class : 1
##
The evaluation of the Naive Bayes algorithm still uses the Confusion Matrix. Naive Bayes succeeded in predicting 1200 True Negatives and 287 True Negatives. There are 116 False Positive data and 397 False Negative data. The statistical results obtained a sensitivity score of 71% and a Specificity of 75%. The difference in scores shows that the Naive Bayes model is relatively better at predicting the Stay class (Exited = 0) than the Exited class. The Naive Bayes Sensitivity score is still better than the Logistic Regression score.
Algoritma selanjutya adalah Decision Tree menggunakan fungsi ctree. Decision Tree adalah algoritma yang tidak terlalu kompleks, dapat diinterpretasikan, dan dapat dijelaskan melalui gambar pohon pengambilan keputusan.
# Making Upsample Data
set.seed(175)
train_up <- upSample(x = churn_train %>% select(-Exited),
y = churn_train$Exited,
list = F,
yname = "Exited"
)
table(train_up$Exited)##
## 0 1
## 6366 6366
# Boostrapping
boot_index <- sample(nrow(train_up), nrow(train_up), replace = T)
data_train_boot <- train_up[boot_index, ]
head(data_train_boot, 10)control_tree <- ctree_control(mtry = 2)
modeldt <- ctree(Exited ~ ., train_up, control = control_tree)
preddt <- predict(object = modeldt, churn_test, type = "prob")
plot(modeldt)Plot Decision Tree yang dihasilkan tampak kompleks. Hal tersebut dipengaruhi oleh parameter mtry=2.
Melakukan pengulangan atau iterasi model tree dengan 500 pohon.
# Repetition
tree_model <- list()
# Melatih 500 decision Tree
for (i in 1:500) {
# Bootstrapping
set.seed(i)
index <- sample(1:nrow(train_up), nrow(train_up), replace = T)
data_train_boot <- train_up[index, ]
# Membuat Model
tree_model[[i]] <- ctree(Exited ~., data_train_boot, control = ctree_control(mtry = 2))
}Setelah melakukan pengulangan sebanyak 500 pohon, kita melakukan agregasi dengan mengumpulkan hasil prediksi dan kemudian mengambil kesimpulan berdasarkan hasil voting terbanyak.
# Aggregation
tree_predict <- matrix(nrow = nrow(churn_test), ncol = 500) %>% as.data.frame()
for (j in 1:500) {
tree_predict[ ,j] <- predict(tree_model[[j]], newdata = churn_test)
}
colnames(tree_predict) <- paste0("tree_", 1:500)
head(tree_predict, 10)Decision Tree Model Evaluation
rf_prediction <- data.frame(actual = churn_test$Exited,
tree_predict
) %>%
rownames_to_column("index") %>%
pivot_longer(cols = -c(index, actual)) %>%
count(index, actual, value) %>%
pivot_wider(names_from = value, values_from = n) %>%
replace_na(list(neg = 0, pos = 0)) %>%
rename(no = "0", yes = "1") %>%
mutate(prob_yes = yes/(yes+no),
prediction = factor(ifelse(prob_yes > 0.5, "yes", "no")),
actual = factor(ifelse(actual == 1, "yes", "no"))
)
head(rf_prediction, 20)confusionMatrix(rf_prediction$prediction, rf_prediction$actual, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1267 126
## yes 330 277
##
## Accuracy : 0.772
## 95% CI : (0.753, 0.7902)
## No Information Rate : 0.7985
## P-Value [Acc > NIR] : 0.9984
##
## Kappa : 0.4042
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6873
## Specificity : 0.7934
## Pos Pred Value : 0.4563
## Neg Pred Value : 0.9095
## Prevalence : 0.2015
## Detection Rate : 0.1385
## Detection Prevalence : 0.3035
## Balanced Accuracy : 0.7404
##
## 'Positive' Class : yes
##
The results above show the level of accuracy of the model that has been made. The decision tree model produces an accuracy of 77%, with a sensitivity of 68% and a specificity of 79%. It can be concluded that this model is quite good at predicting both classes (churned/not churned) by looking at its accuracy, which reaches 77%.
Evaluation of the model using the Area Under Curve Plot.
# Model Evaluation
modelrocdt <- prediction(preddt[,2], churn_test$Exited)
rocdt <- performance(modelrocdt,
"tpr", # True Positive Rate (Recall),
"fpr") # False Positive Rate (1 - Specificity)
# PLOT
plot(rocdt)
polygon(c(rocdt@x.values[[1]], 1),
c(rocdt@y.values[[1]], 0), col = "grey")
abline(0,1 , lty = 2)# AUC Value of Decision Tree model
aucdt <- performance(modelrocdt, measure = "auc")@y.values
aucdt## [[1]]
## [1] 0.7935879
The AUC Score is 79%.
With Random Forest, we are using K-Fold Cross-Validation. K-Fold Cross Validation is done by dividing the data into several groups and then repeating. In this case, we divide the data into ten groups with three repetitions.
Fitting Random Forest Model
# RANDOM FOREST
churn_forest <- train(Exited ~ .,
data = train_down,
method="rf",
trControl = ctrl
)
plot(churn_forest)# Predicting Testing Data
pred_churn <- predict(churn_forest, churn_test)
confusionMatrix(pred_churn, churn_test$Exited, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1221 101
## 1 376 302
##
## Accuracy : 0.7615
## 95% CI : (0.7422, 0.78)
## No Information Rate : 0.7985
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4095
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7494
## Specificity : 0.7646
## Pos Pred Value : 0.4454
## Neg Pred Value : 0.9236
## Prevalence : 0.2015
## Detection Rate : 0.1510
## Detection Prevalence : 0.3390
## Balanced Accuracy : 0.7570
##
## 'Positive' Class : 1
##
The Random Forest model produces a prediction accuracy of 76% with a Sensitivity and Specificity of 74% and 76%, respectively. Although the accuracy of this model is slightly lower than the Decision Tree model, this model has better Sensitivity.
varImp(churn_forest) %>% plot()The chart above shows which variables are essential in making predictions on the Random Forest model. Age is very dominant as a predictor. Then, NumOfProducts, Balance, and EstimatedSalary are around 40 - 50%.
After comparing several classification models, we can gain insight into which model is the best in making predictions. Linear Regression is a simple but powerful model to predict. This model gets 68% accuracy. The Naive Bayes model, only taking a column with a numerical value, gets a prediction accuracy of 74%.
The third model, Decision Tree with Bootstrapping and Ensamble, produces an accuracy of 77%. And, finally, the Random Forest model has a prediction accuracy of 76%.
The last three models have higher accuracy than the Linear Regression Model. However, Linear Regression has the strength that the model is easy to understand and interpret. In contrast, the Naive Bayes, Decision Tree, Random Forest models have more accurate predictions but tend to hard to interpret.
The use of a good classification model does not only look at the level of accuracy or other measures but also looks at the cases and data used. If the purpose of the analysis is to generate interpretations, a simple model such as LInier Regression can be applied. To improve the model, more advanced methods can be used. On the other hand, if the purpose of the model is to predict–accuracy is critical–the more advanced complex model is relatively better than the simple model.