Churn in a business setting refers to losing an acquired, potentially profitable customer. Acquiring a new customer is always more expensive than retaining an existing one. Hence, not letting them churn is the key to a sustained revenue stream.
In a large company such as a telco with a variety of consumers, you will find many people who have the same characteristics. Some of them must have left as customers and remained as customers every month.
Once the importance of customers and how difficult it is to get new customers, we must know by predicting whether there are still many loyal customers and how many will leave.
We will try to make predictions by making a model using the Naive Bayes algorithm and the Decision Tree model
The objective is to predict churn or no churn in order to retain customers.
library(tidyverse)
library(e1071) #naive bayes
library(caret) #confusion matrix
library(ROCR) #test model porformance using ROC/AU
library(partykit) #decision tree
library(rsample) #generate sample dataLoad data source from directory
telco <- read.csv("data_input/telcochurn.csv", stringsAsFactors = F)Check data structure
glimpse(telco)## Rows: 7,043
## Columns: 22
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW~
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",~
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes~
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"~
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ MultipleLines <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye~
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt~
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "~
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N~
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y~
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes~
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye~
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes~
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One ~
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "~
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y~
colnames(telco)## [1] "X" "customerID" "gender" "SeniorCitizen"
## [5] "Partner" "Dependents" "tenure" "PhoneService"
## [9] "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup"
## [13] "DeviceProtection" "TechSupport" "StreamingTV" "StreamingMovies"
## [17] "Contract" "PaperlessBilling" "PaymentMethod" "MonthlyCharges"
## [21] "TotalCharges" "Churn"
The data set includes information about: (20 variables)
1. Remove unused column
telco <- telco %>%
select (- X, -customerID)2. Check Missing Value
anyNA(telco)## [1] TRUE
colSums(is.na(telco))## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 11 0
The dataset is generally clean except for 11 rows in the total charges column, this corresponds to the number in the tenor column which is zero (0).
Remove NA values
telco <- na.omit(telco)
#recheck NA
anyNA(telco)## [1] FALSE
No Missing value detected , data is clean
Check data target Churn proportional
telco_nb <- telco %>%
mutate_if(is.character, as.factor) %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen),
tenure = as.factor(tenure))
prop.table(table(telco_nb$Churn))##
## No Yes
## 0.734215 0.265785
From the results, the proportion shows that the probability of a consumer who will churn is 26.58%
Now data is ready to modelling process
Before create model, let split data to telco_train and telco_test with 80% as train and 20% as test data
RNGkind(sample.kind = "Rounding")
set.seed(417)
idxs <- sample(nrow(telco_nb),nrow(telco_nb)*0.8)
telco_train <- telco_nb[idxs,]
telco_test <- telco_nb[-idxs,]prop.table(table(telco_train$Churn))##
## No Yes
## 0.7347556 0.2652444
From the results, the proportion shows is unbalanced
model_nb <- naiveBayes(Churn ~ ., data = telco_nb)telco_test$pred <- predict(model_nb, newdata = telco_test,
type = "class")confusionMatrix(as.factor(telco_test$pred),
as.factor(telco_test$Churn),
positive = "No")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 839 113
## Yes 191 264
##
## Accuracy : 0.7839
## 95% CI : (0.7615, 0.8052)
## No Information Rate : 0.7321
## P-Value [Acc > NIR] : 0.000004132
##
## Kappa : 0.4831
##
## Mcnemar's Test P-Value : 0.000010043
##
## Sensitivity : 0.8146
## Specificity : 0.7003
## Pos Pred Value : 0.8813
## Neg Pred Value : 0.5802
## Prevalence : 0.7321
## Detection Rate : 0.5963
## Detection Prevalence : 0.6766
## Balanced Accuracy : 0.7574
##
## 'Positive' Class : No
##
Notes :
#create predict using probability output
pred_prob <- predict(model_nb, newdata = telco_test, type = "raw")
#process ROC
pred_rocr <- prediction(predictions = pred_prob[,2],
labels = as.numeric(ifelse(test = telco_test$Churn == 'No',
0,1)
)
)
perf <- performance(prediction.obj = pred_rocr,
measure = "tpr",
x.measure = "fpr")plot(perf)Check AUC (Area Under Curve) Value
auc <- performance(pred_rocr, "auc")
auc@y.values## [[1]]
## [1] 0.8524851
Split data to 80% of data train and 20% of data test
set.seed(417)
idx <- initial_split(telco_nb, prop = 0.8, strata = "Churn")
train_dt <- training(idx)
test_dt <- testing(idx)prop.table(table(train_dt$Churn))##
## No Yes
## 0.7342222 0.2657778
model_dt <- ctree(Churn ~ ., data = train_dt,
control = ctree_control(mincriterion = 0.999,
minsplit = 50,
minbucket = 2))
plot(model_dt, type = "simple")From the results of the plot, it can be seen that the main factor the customer will Churn / No Churn as a customer is the status of the service contract
pred_dt <- predict(model_dt, newdata = test_dt, type= "response")confusionMatrix(pred_dt, test_dt$Churn, positive = "No" )## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 950 219
## Yes 83 155
##
## Accuracy : 0.7854
## 95% CI : (0.763, 0.8066)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : 0.00000522302022919
##
## Kappa : 0.3779
##
## Mcnemar's Test P-Value : 0.00000000000000795
##
## Sensitivity : 0.9197
## Specificity : 0.4144
## Pos Pred Value : 0.8127
## Neg Pred Value : 0.6513
## Prevalence : 0.7342
## Detection Rate : 0.6752
## Detection Prevalence : 0.8308
## Balanced Accuracy : 0.6670
##
## 'Positive' Class : No
##
pred_dt_prob <- predict(model_dt, newdata = test_dt, type= "prob")
pred_dt_roc <- prediction(predictions = pred_dt_prob[,1],
labels = as.numeric(ifelse(test = test_dt$Churn == 'No',1,0)))
perf_dt <- performance(prediction.obj = pred_dt_roc,
measure = "tpr",
x.measure = "fpr")plot(perf_dt)Check AUC (Area Under Curve) Value
auc <- performance(pred_dt_roc, "auc")
auc@y.values## [[1]]
## [1] 0.8301777
Both models show almost the same good performance and in my opinion both Naive Bayes and Decision Tree models can be used for modeling