Intro

Customer churn, also known as customer attrition, occurs when customers stop doing business with a company or stop using a company’s services. By being aware of and monitoring churn rate, companies are equipped to determine their customer retention success rates and identify strategies for improvement. We will use a machine learning model to understand the precise customer behaviors and attributes which signal the risk and timing of customer churn. In this project, I want to predict Telco customer churn based on their behavior. This dataset taken from Kaggle dataset.

Import Dataset

churn <- read.csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

datatable(churn, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))

Exploratory Data Analysis

churn <- churn %>% 
  select(- customerID) %>% 
  mutate(SeniorCitizen = as.factor(SeniorCitizen))

datatable(churn, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))

A. Visualization

Telco Churn Percentage

churn %>% 
  group_by(Churn) %>% 
  summarise(Number = n()) %>%
  mutate(Percent = prop.table(Number)*100) %>% 
ggplot(aes(Churn, Percent)) + 
  geom_col(aes(fill = Churn)) +
  labs(title = "Churn Percentage") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = sprintf("%.2f%%", Percent)), hjust = 0.01,vjust = -0.5, size = 4) +
  theme_minimal()

From the plot, we can see that the percentage of customer that churn is 26.54%

Customer Behavior Based on `Churn` or `Not Churn`

1. Churn based on Tenure

churn %>% 
  group_by(tenure, Churn) %>% 
  summarise(Number = n()) %>% 
  ggplot(aes(tenure, Number)) +
  geom_line(aes(col = Churn)) +
  labs(x = "Tenure (month)",
       y = "Number of Customer",
       title = "Churn Based on Tenure") +
  scale_x_continuous(breaks = seq(0, 70, 10)) +
  theme_minimal()

From the graph we can see that the highest churn happen mostly after 1 month usage.

2. Churn based on Customer Characteristics

plot_grid(ggplot(churn, aes(x=gender,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"), 
          ggplot(churn, aes(x=SeniorCitizen,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x=Partner,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x=Dependents,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage") +
          theme_minimal() +
          scale_x_discrete(labels = function(x) str_wrap(x, width = 10)),
          align = "h")

From the customer characteristics, I can get several information :

From the gender, male and female seems to have same proportion to churn.
Senior citizens are more likely to churn
Customer without partner are more likely to churn compared to customer who have partner
Customer without dependents are more likely to churn compared to customer with dependents

3. Churn based on Service

plot_grid(ggplot(churn, aes(x= InternetService ,fill = Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"), 
          ggplot(churn, aes(x= OnlineSecurity, fill = Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x= OnlineBackup, fill = Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x= DeviceProtection, fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x=PhoneService,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x=MultipleLines,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x= TechSupport, fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage"),
          ggplot(churn, aes(x= StreamingTV ,fill=Churn))+ geom_bar(position = 'fill') + labs(y = "Percentage") + 
          theme_minimal() +
          scale_x_discrete(labels = function(x) str_wrap(x, width = 10)),
          align = "h")

From the service that used by the customer, I can get several information :

Customer who use fibser optic in internet service are more likely to churn compared to others
Customer who don’t use online security, online backup, device protection, streaming TV, and tech support are more likely to churn compared to others

B. Checking Correlation

Inspect the correlation between numerical column from the dataset

ggcorr(churn, label = T)

Because TotalCharges has strong correlation with tenure and MonthlyCharges, I can take out TotalCharges column

churn_ed <- churn %>% 
  select(- TotalCharges)

datatable(churn_ed, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX= T))

Modelling

In this project, I am using 3 machine learning models (Naive Bayes, Decision Tree, and Random Forest)

A. Cross Validation

To make the model, I have to split the dataset into data train and data test. Data train for making the model, and data test as the unseen data to make the prediction

set.seed(100)
idx <- initial_split(data = churn_ed, prop = 0.8, strata = "Churn")

churn_train <- training(idx)
churn_test <- testing(idx)
test_label <- churn_test[, "Churn"]

B. Machine Learning Models

1. Naive Bayes Model

model_naive <- naiveBayes(formula = Churn ~ ., data = churn_train, type = "class")

predict1 <- predict(object = model_naive, newdata = churn_train)
predict2 <- predict(object = model_naive, newdata = churn_test)

Making confusion matrix for data train and data test. This is to see if the model overfit or not.

# for data train
confusionMatrix(predict1, reference = churn_train$Churn, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  2844  284
       Yes 1296 1212
                                          
               Accuracy : 0.7197          
                 95% CI : (0.7077, 0.7314)
    No Information Rate : 0.7346          
    P-Value [Acc > NIR] : 0.9944          
                                          
                  Kappa : 0.4088          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8102          
            Specificity : 0.6870          
         Pos Pred Value : 0.4833          
         Neg Pred Value : 0.9092          
             Prevalence : 0.2654          
         Detection Rate : 0.2150          
   Detection Prevalence : 0.4450          
      Balanced Accuracy : 0.7486          
                                          
       'Positive' Class : Yes

# for data test
confusionMatrix(predict2, reference = churn_test$Churn, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  712  58
       Yes 322 315
                                         
               Accuracy : 0.7299         
                 95% CI : (0.7059, 0.753)
    No Information Rate : 0.7349         
    P-Value [Acc > NIR] : 0.6761         
                                         
                  Kappa : 0.4347         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8445         
            Specificity : 0.6886         
         Pos Pred Value : 0.4945         
         Neg Pred Value : 0.9247         
             Prevalence : 0.2651         
         Detection Rate : 0.2239         
   Detection Prevalence : 0.4527         
      Balanced Accuracy : 0.7665         
                                         
       'Positive' Class : Yes

Visualize ROC curve for Naive-Bayes Model

ROC is a curve that illustrates how capable the model is made to define between positive and negative classes. This plot has 2 axes namely:

x: False Positive Rate (recall)
y: True Positive Rate (1 - specificitty)

pred_prob_naive <- predict(object = model_naive, newdata = churn_test, type = "raw")
pred.naive <- prediction(pred_prob_naive[,2], labels = test_label)
perf.naive <- performance(prediction.obj = pred.naive, measure = "tpr", "fpr")
plot(perf.naive, colorize = T)

Checking AUC (Area under the curve) value

auc <- performance(pred.naive, measure = "auc")
auc@y.values[[1]]

[1] 0.8412656

2. Decision Tree Model

model_dt <- ctree(formula = Churn ~., data = churn_train, 
                  control = ctree_control(mincriterion = 0.95))
plot(model_dt, type = "simple")

Prediction with decision tree model

pred_train_dt <- predict(object = model_dt, newdata = churn_train) 
pred_test_dt <- predict(object = model_dt, newdata = churn_test)

Creating confusion matrix for data train and data test.

# for data train
confusionMatrix(data = pred_train_dt, reference = churn_train$Churn, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  3818  816
       Yes  322  680
                                          
               Accuracy : 0.7981          
                 95% CI : (0.7874, 0.8085)
    No Information Rate : 0.7346          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4212          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4545          
            Specificity : 0.9222          
         Pos Pred Value : 0.6786          
         Neg Pred Value : 0.8239          
             Prevalence : 0.2654          
         Detection Rate : 0.1207          
   Detection Prevalence : 0.1778          
      Balanced Accuracy : 0.6884          
                                          
       'Positive' Class : Yes

# for data test
confusionMatrix(data = pred_test_dt, reference = churn_test$Churn, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  937 196
       Yes  97 177
                                          
               Accuracy : 0.7918          
                 95% CI : (0.7696, 0.8127)
    No Information Rate : 0.7349          
    P-Value [Acc > NIR] : 4.327e-07       
                                          
                  Kappa : 0.416           
                                          
 Mcnemar's Test P-Value : 1.033e-08       
                                          
            Sensitivity : 0.4745          
            Specificity : 0.9062          
         Pos Pred Value : 0.6460          
         Neg Pred Value : 0.8270          
             Prevalence : 0.2651          
         Detection Rate : 0.1258          
   Detection Prevalence : 0.1947          
      Balanced Accuracy : 0.6904          
                                          
       'Positive' Class : Yes

From two confusion matrix using decision tree for data train and data test, it shows that the model has bigger accuracy when predicting the data test. I think it’s logical because I use the data train in making the model. But because the difference seem not too significant, I think the model still can not be concluded to be overfit. The confusion matrix show bad score in Sensitivity but good score in Specificity. That mean that the decision tree model good at predicting the customer who doesn’t churn.

Visualize ROC curve for Decision Tree Model

pred_prob_dt <- predict(object = model_dt, newdata = churn_test, type = "prob")
pred.dt <- prediction(pred_prob_dt[,2], labels = test_label)
perf.dt <- performance(prediction.obj = pred.dt, measure = "tpr", "fpr")
plot(perf.dt, colorize = T)

Checking AUC (Area under the curve) value

auc <- performance(pred.dt, measure = "auc")
auc@y.values[[1]]

[1] 0.8365921

3. Random Forest Model

I want to create random forest model using data train with k-fold cross validation. In this model I use 5 fold cross validation and 3 repetition.

# trl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

# churn_forest <- train(Churn ~., data = churn_train, method = "rf", trControl = ctrl)

# saveRDS(churn_forest, "data/churn_forest.RDS")

I will show the demonstration of k fold cross validation in random forest model

ani.options(interval = 1, nmax = 25)

cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l", )

model_forest <- readRDS("data/churn_forest.RDS")

model_forest

Random Forest 

5636 samples
  18 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 4509, 4508, 4509, 4509, 4509, 4508, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.7884444  0.3429087
  15    0.7947143  0.4342739
  29    0.7921121  0.4299515

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.

model_forest$finalModel


Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 15

        OOB estimate of  error rate: 20.53%
Confusion matrix:
      No Yes class.error
No  3716 424   0.1024155
Yes  733 763   0.4899733

plot(model_forest)

The model show the best accuracy at mtry = 15

Evaluate random forest model for data train and data test

# for data train
a <- predict(model_forest, churn_train)
confusionMatrix(data = a, reference = churn_train$Churn, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  4051  183
       Yes   89 1313
                                          
               Accuracy : 0.9517          
                 95% CI : (0.9458, 0.9572)
    No Information Rate : 0.7346          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8737          
                                          
 Mcnemar's Test P-Value : 1.711e-08       
                                          
            Sensitivity : 0.8777          
            Specificity : 0.9785          
         Pos Pred Value : 0.9365          
         Neg Pred Value : 0.9568          
             Prevalence : 0.2654          
         Detection Rate : 0.2330          
   Detection Prevalence : 0.2488          
      Balanced Accuracy : 0.9281          
                                          
       'Positive' Class : Yes

# for data test
churn_pred <- predict(model_forest, churn_test)
confusionMatrix(data = churn_pred, reference = churn_test$Churn, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1004   37
       Yes   30  336
                                          
               Accuracy : 0.9524          
                 95% CI : (0.9399, 0.9629)
    No Information Rate : 0.7349          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8771          
                                          
 Mcnemar's Test P-Value : 0.4635          
                                          
            Sensitivity : 0.9008          
            Specificity : 0.9710          
         Pos Pred Value : 0.9180          
         Neg Pred Value : 0.9645          
             Prevalence : 0.2651          
         Detection Rate : 0.2388          
   Detection Prevalence : 0.2601          
      Balanced Accuracy : 0.9359          
                                          
       'Positive' Class : Yes

From random forest model, we can see most important variables that use in each fold and repetition

plot(varImp(model_forest))

Visualize ROC Curve for Random Forest Model

pred_prob_rf <- predict(object = model_forest, newdata = churn_test, type = "prob")
pred.forest <- prediction(pred_prob_rf[,2], labels = test_label)
perf.rf <- performance(prediction.obj = pred.forest, measure = "tpr", "fpr")
plot(perf.rf, colorize = T)

Checking AUC (Area under the curve) value

auc <- performance(pred.forest, measure = "auc")
auc@y.values[[1]]

[1] 0.9863787

Conclusion

From the three models that already created, I got the accuracy and recall from them :

Naive-Bayes model : Accuracy = 72.9% ,Recall = 84.4%, and AUC = 84.12%
Decision Tree model : Accuracy = 79.18%, Recall = 47.45%, and AUC = 83.65%
Random Forest model : Accuracy = 95.24%, Recall = 90.08%, and AUC = 98.63%

I can conclude that Random forest model give the best prediction among all three models. This is because random forest model has the highest performance compared to naive bayes model and decision tree model.

Telco Customer Churn Prediction

Toho Dustin

December 13, 2019

Intro

Import Dataset

Exploratory Data Analysis

A. Visualization

Telco Churn Percentage

Customer Behavior Based on `Churn` or `Not Churn`

1. Churn based on Tenure

2. Churn based on Customer Characteristics

3. Churn based on Service

B. Checking Correlation

Modelling

A. Cross Validation

B. Machine Learning Models

1. Naive Bayes Model

2. Decision Tree Model

3. Random Forest Model

Conclusion

Telco Customer Churn Prediction

Toho Dustin

December 13, 2019

Intro

Import Dataset

Exploratory Data Analysis

A. Visualization

Telco Churn Percentage

Customer Behavior Based on Churn or Not Churn

1. Churn based on Tenure

2. Churn based on Customer Characteristics

3. Churn based on Service

B. Checking Correlation

Modelling

A. Cross Validation

B. Machine Learning Models

1. Naive Bayes Model

2. Decision Tree Model

3. Random Forest Model

Conclusion

Customer Behavior Based on `Churn` or `Not Churn`