1 Overview

Churn in a business setting refers to losing an acquired, potentially profitable customer. Acquiring a new customer is always more expensive than retaining an existing one. Hence, not letting them churn is the key to a sustained revenue stream.

1.1 Bussiness Question

In a large company such as a telco with a variety of consumers, you will find many people who have the same characteristics. Some of them must have left as customers and remained as customers every month.

Once the importance of customers and how difficult it is to get new customers, we must know by predicting whether there are still many loyal customers and how many will leave.

We will try to make predictions by making a model using the Naive Bayes algorithm and the Decision Tree model

The objective is to predict churn or no churn in order to retain customers.

2 Data Collection and Preparation

2.1 Data Collection

2.1.1 Load library needed

library(tidyverse)
library(e1071) #naive bayes
library(caret) #confusion matrix
library(ROCR) #test model porformance using ROC/AU
library(partykit) #decision tree
library(rsample) #generate sample data

2.1.2 Load data and Checking

Load data source from directory

telco <- read.csv("data_input/telcochurn.csv", stringsAsFactors = F)

Check data structure

glimpse(telco)
## Rows: 7,043
## Columns: 22
## $ X                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW~
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",~
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes~
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"~
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ MultipleLines    <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye~
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt~
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "~
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N~
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y~
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes~
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye~
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes~
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One ~
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "~
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y~
colnames(telco)
##  [1] "X"                "customerID"       "gender"           "SeniorCitizen"   
##  [5] "Partner"          "Dependents"       "tenure"           "PhoneService"    
##  [9] "MultipleLines"    "InternetService"  "OnlineSecurity"   "OnlineBackup"    
## [13] "DeviceProtection" "TechSupport"      "StreamingTV"      "StreamingMovies" 
## [17] "Contract"         "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"  
## [21] "TotalCharges"     "Churn"

The data set includes information about: (20 variables)

  • Customers who left within the last month – the column is called Churn
  • Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
  • Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
  • Demographic info about customers – gender, age range, and if they have partners and dependents Our target is the Churn column

2.2 Data Wrangling

1. Remove unused column

telco <- telco %>% 
  select (- X, -customerID)

2. Check Missing Value

anyNA(telco)
## [1] TRUE
colSums(is.na(telco))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0               11                0

The dataset is generally clean except for 11 rows in the total charges column, this corresponds to the number in the tenor column which is zero (0).

Remove NA values

telco <- na.omit(telco)

#recheck NA

anyNA(telco)
## [1] FALSE

No Missing value detected , data is clean


Check data target Churn proportional

telco_nb <- telco %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(SeniorCitizen = as.factor(SeniorCitizen),
         tenure = as.factor(tenure))

prop.table(table(telco_nb$Churn))
## 
##       No      Yes 
## 0.734215 0.265785

From the results, the proportion shows that the probability of a consumer who will churn is 26.58%

Now data is ready to modelling process

3 Model Fitting

3.1 Naive Bayes Model

3.1.1 Cross Validation

Before create model, let split data to telco_train and telco_test with 80% as train and 20% as test data

RNGkind(sample.kind = "Rounding")
set.seed(417)

idxs <- sample(nrow(telco_nb),nrow(telco_nb)*0.8)

telco_train <- telco_nb[idxs,]

telco_test <- telco_nb[-idxs,]

3.1.2 Check Proportion

prop.table(table(telco_train$Churn))
## 
##        No       Yes 
## 0.7347556 0.2652444

From the results, the proportion shows is unbalanced

3.1.3 Create Model

model_nb <- naiveBayes(Churn ~ ., data = telco_nb)

3.1.4 Prediction

telco_test$pred <- predict(model_nb, newdata = telco_test, 
                           type = "class")

3.1.5 Model Performance using confusion matrix

confusionMatrix(as.factor(telco_test$pred), 
                as.factor(telco_test$Churn), 
                positive = "No")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  839 113
##        Yes 191 264
##                                           
##                Accuracy : 0.7839          
##                  95% CI : (0.7615, 0.8052)
##     No Information Rate : 0.7321          
##     P-Value [Acc > NIR] : 0.000004132     
##                                           
##                   Kappa : 0.4831          
##                                           
##  Mcnemar's Test P-Value : 0.000010043     
##                                           
##             Sensitivity : 0.8146          
##             Specificity : 0.7003          
##          Pos Pred Value : 0.8813          
##          Neg Pred Value : 0.5802          
##              Prevalence : 0.7321          
##          Detection Rate : 0.5963          
##    Detection Prevalence : 0.6766          
##       Balanced Accuracy : 0.7574          
##                                           
##        'Positive' Class : No              
## 

Notes :

  • From the results of checking the performance of the model, the accuracy of the model is 78.4% and the most important thing here is that the sensitivity/recall value reaches 81.5% and this indicates that the model is good to use. Because we focus on how many customers will stay and customers will leave

3.1.6 Model performance using ROC/AUC

#create predict using probability output
pred_prob <- predict(model_nb, newdata = telco_test, type = "raw")
#process ROC
pred_rocr <- prediction(predictions = pred_prob[,2],
                        labels = as.numeric(ifelse(test = telco_test$Churn == 'No',
                                                   0,1)
                                            )
                        )

perf <- performance(prediction.obj = pred_rocr, 
                    measure = "tpr",
                    x.measure = "fpr")

3.1.7 ROC Visualisation

plot(perf)

Check AUC (Area Under Curve) Value

auc <- performance(pred_rocr, "auc")

auc@y.values
## [[1]]
## [1] 0.8524851

3.2 Decicion Tree

3.2.1 Cross Validation

Split data to 80% of data train and 20% of data test

set.seed(417)

idx <- initial_split(telco_nb, prop = 0.8, strata = "Churn")

train_dt <- training(idx)

test_dt <- testing(idx)

3.2.2 Check Proportion

prop.table(table(train_dt$Churn))
## 
##        No       Yes 
## 0.7342222 0.2657778

3.2.3 Create Model

model_dt <- ctree(Churn ~ ., data = train_dt,
                  control = ctree_control(mincriterion = 0.999,
                                          minsplit = 50,
                                          minbucket = 2))

plot(model_dt, type = "simple")

From the results of the plot, it can be seen that the main factor the customer will Churn / No Churn as a customer is the status of the service contract

3.2.4 Create Prediction

pred_dt <- predict(model_dt, newdata = test_dt, type= "response")

3.2.5 Check Model

confusionMatrix(pred_dt, test_dt$Churn, positive = "No" )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  950 219
##        Yes  83 155
##                                              
##                Accuracy : 0.7854             
##                  95% CI : (0.763, 0.8066)    
##     No Information Rate : 0.7342             
##     P-Value [Acc > NIR] : 0.00000522302022919
##                                              
##                   Kappa : 0.3779             
##                                              
##  Mcnemar's Test P-Value : 0.00000000000000795
##                                              
##             Sensitivity : 0.9197             
##             Specificity : 0.4144             
##          Pos Pred Value : 0.8127             
##          Neg Pred Value : 0.6513             
##              Prevalence : 0.7342             
##          Detection Rate : 0.6752             
##    Detection Prevalence : 0.8308             
##       Balanced Accuracy : 0.6670             
##                                              
##        'Positive' Class : No                 
## 

3.2.6 Check Performance model using ROC / AUC

pred_dt_prob <- predict(model_dt, newdata = test_dt, type= "prob")

pred_dt_roc <- prediction(predictions = pred_dt_prob[,1],
                        labels = as.numeric(ifelse(test = test_dt$Churn == 'No',1,0)))

perf_dt <- performance(prediction.obj = pred_dt_roc, 
                       measure = "tpr",
                       x.measure = "fpr")

3.2.7 ROC Visualisation

plot(perf_dt)

Check AUC (Area Under Curve) Value

auc <- performance(pred_dt_roc, "auc")

auc@y.values
## [[1]]
## [1] 0.8301777

4 Conclusion

  • Accuracy
    • Naive Bayes : 71.4 %
    • Decision Tree: 79.7 %
  • Recall / Sensitivity
    • Naive Bayes : 81.5%
    • Decision Tree: 88.9 %
  • AUC
    • Naive Bayes : 85.25 %
    • Decision Tree: 85.20 %

Both models show almost the same good performance and in my opinion both Naive Bayes and Decision Tree models can be used for modeling