1 Intro

Customer churn is the loss of customers from a business. Churn is calculated by how many customers leave your business in a given time. Customer churn is important for businesses to know because it is a picture of the success of a business in retaining customers.

Businesses will lose a lot if they lose customers. In fact, getting a new customer is 5 times more expensive than retaining an existing customer, and getting a new customer loyal is also 16 times more expensive. So, a strategy is needed to stop customer churn or lose customers and retain the customers you already have, because they are the main source of business revenue. Therefore, in this case study we will predict customer churn.

1.1 Load Library & Read Data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
## Loading required package: lattice
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:gtools':
## 
##     permutations
churn <- read.csv("Telco_Customer_Churn.csv", stringsAsFactors = T)
head(churn)

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges Demographic info about customers – gender, age range, and if they have partners and dependents

2 Data Pre-Processing

str(churn)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Good, the data type is correct, now we throw away the customer id because we don’t need it

churn <- churn %>% 
  select(-customerID)
head(churn)

Great, now we will check missing values

colSums(is.na(churn))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0               11                0

We have missing values, let’s remove it.

churn <- na.omit(churn)
colSums(is.na(churn))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

Great, now we don’t have missing values

3 Cross Validation

First, we divide data into train data and test data

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(churn), size = nrow(churn)*0.8)

churn_train <- churn[index,]
churn_test <- churn[-index,]

Next, we want to check proportion of target variable in train data

prop.table(table(churn_train$Churn))
## 
##        No       Yes 
## 0.7363556 0.2636444

Proportion of target variable is not balanced. Therefore, we do an up sample to balance the data

RNGkind(sample.kind = "Rounding")
set.seed(100)

churn_train_balance <- upSample(x = churn_train %>% select(-Churn),
                                y = churn_train$Churn,
                                yname = "Churn")

prop.table(table(churn_train_balance$Churn))
## 
##  No Yes 
## 0.5 0.5

Great, now we have a balanced proportion in the target variable

4 Modelling

4.1 Naive Bayes

model_naive_bayes <- naiveBayes(Churn~., data = churn_train_balance)

Now, we want to predict

predict_churn_test2 <- predict(model_naive_bayes, churn_test, type = "class")

Next, we want to evaluate model

confusionMatrix(predict_churn_test2, churn_test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  689  75
##        Yes 332 311
##                                           
##                Accuracy : 0.7107          
##                  95% CI : (0.6863, 0.7343)
##     No Information Rate : 0.7257          
##     P-Value [Acc > NIR] : 0.9001          
##                                           
##                   Kappa : 0.3981          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8057          
##             Specificity : 0.6748          
##          Pos Pred Value : 0.4837          
##          Neg Pred Value : 0.9018          
##              Prevalence : 0.2743          
##          Detection Rate : 0.2210          
##    Detection Prevalence : 0.4570          
##       Balanced Accuracy : 0.7403          
##                                           
##        'Positive' Class : Yes             
## 

4.2 Decision Tree

model_decision_tree <- ctree(formula = Churn~., data = churn_train_balance)

Now, we want to predict

predict_churn_test1 <- predict(model_decision_tree, churn_test, type = "response")

Next, we want to evaluate model

confusionMatrix(predict_churn_test1, reference = churn_test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  790 115
##        Yes 231 271
##                                           
##                Accuracy : 0.7541          
##                  95% CI : (0.7307, 0.7764)
##     No Information Rate : 0.7257          
##     P-Value [Acc > NIR] : 0.008609        
##                                           
##                   Kappa : 0.4352          
##                                           
##  Mcnemar's Test P-Value : 6.312e-10       
##                                           
##             Sensitivity : 0.7021          
##             Specificity : 0.7738          
##          Pos Pred Value : 0.5398          
##          Neg Pred Value : 0.8729          
##              Prevalence : 0.2743          
##          Detection Rate : 0.1926          
##    Detection Prevalence : 0.3568          
##       Balanced Accuracy : 0.7379          
##                                           
##        'Positive' Class : Yes             
## 

4.3 Random Forest

set.seed(100)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
forest <- train(Churn ~ ., data=churn_train_balance, method="rf", trControl = ctrl)
forest
## Random Forest 
## 
## 8284 samples
##   19 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 6213, 6213, 6213, 6213, 6213, 6213, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7848564  0.5697134
##   16    0.8918708  0.7837415
##   30    0.8867703  0.7735404
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.
varImp(forest)
## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                               Overall
## TotalCharges                  100.000
## MonthlyCharges                 89.569
## tenure                         89.023
## ContractTwo year               48.781
## InternetServiceFiber optic     31.952
## ContractOne year               23.458
## PaymentMethodElectronic check  12.660
## genderMale                     11.200
## PaperlessBillingYes             9.148
## PartnerYes                      8.211
## SeniorCitizen                   8.179
## OnlineSecurityYes               7.832
## DependentsYes                   7.296
## OnlineBackupYes                 7.120
## TechSupportYes                  6.942
## DeviceProtectionYes             5.590
## MultipleLinesYes                5.456
## StreamingTVYes                  5.291
## PaymentMethodMailed check       5.189
## StreamingMoviesYes              5.056
forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 16
## 
##         OOB estimate of  error rate: 8.32%
## Confusion matrix:
##       No  Yes class.error
## No  3544  598  0.14437470
## Yes   91 4051  0.02197006

Now, we want to predict

churn_test_predict <- predict(forest, newdata = churn_test, type = "raw") #menghasilkan kelas

Next, we want to evaluate model

confusionMatrix(data = churn_test_predict, reference = churn_test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  862 171
##        Yes 159 215
##                                           
##                Accuracy : 0.7655          
##                  95% CI : (0.7424, 0.7874)
##     No Information Rate : 0.7257          
##     P-Value [Acc > NIR] : 0.0003826       
##                                           
##                   Kappa : 0.4052          
##                                           
##  Mcnemar's Test P-Value : 0.5448269       
##                                           
##             Sensitivity : 0.5570          
##             Specificity : 0.8443          
##          Pos Pred Value : 0.5749          
##          Neg Pred Value : 0.8345          
##              Prevalence : 0.2743          
##          Detection Rate : 0.1528          
##    Detection Prevalence : 0.2658          
##       Balanced Accuracy : 0.7006          
##                                           
##        'Positive' Class : Yes             
## 

5 Conclusion

Based on the modeling that has been done, the naive bayes model produces an accuracy of 71%, the decision tree model is 75%, and the random forest model is 76%. Therefore, the random forest model is the best model for predicting customer churn because it has the highest accuracy among other models.