1 About the Dataset

In this case we use the Orange Telecom’s churn dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models.

1.1 Business Question

How to mitigate churn customer in Orange Telecom?

In this case we will comparing predictive model with Logistic Regression, Stepwise Regression, and KNN to find highest sensitivity (recall) with suitable accuracy & AIC.

2 Prepare the Data

Read Data

telcochurn_train <- read.csv("churn-bigml-80.csv", stringsAsFactors = TRUE) #80% of data
telcochurn_test <- read.csv("churn-bigml-20.csv", stringsAsFactors = TRUE) #20% of data

nrow(telcochurn_train)
#> [1] 2666
nrow(telcochurn_test)
#> [1] 667

Data Train

head(telcochurn_train)

Data Test

head(telcochurn_test)

Selection variable in both data

dropcols <- c("State","Area.code")

telcochurn_train <-  telcochurn_train %>%
  select(-one_of(dropcols))

telcochurn_test <- telcochurn_test %>%
  select(-one_of(dropcols))

3 Explarotary Data

Focus on Data Train to create model

glimpse(telcochurn_train)
#> Rows: 2,666
#> Columns: 18
#> $ Account.length         <int> 128, 107, 137, 84, 75, 118, 121, 147, 141, 74, …
#> $ International.plan     <fct> No, No, No, Yes, Yes, Yes, No, Yes, Yes, No, No…
#> $ Voice.mail.plan        <fct> Yes, Yes, No, No, No, No, Yes, No, Yes, No, No,…
#> $ Number.vmail.messages  <int> 25, 26, 0, 0, 0, 0, 24, 0, 37, 0, 0, 0, 0, 27, …
#> $ Total.day.minutes      <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2…
#> $ Total.day.calls        <int> 110, 123, 114, 71, 113, 98, 88, 79, 84, 127, 96…
#> $ Total.day.charge       <dbl> 45.07, 27.47, 41.38, 50.90, 28.34, 37.98, 37.09…
#> $ Total.eve.minutes      <dbl> 197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348.5,…
#> $ Total.eve.calls        <int> 99, 103, 110, 88, 122, 101, 108, 94, 111, 148, …
#> $ Total.eve.charge       <dbl> 16.78, 16.62, 10.30, 5.26, 12.61, 18.75, 29.62,…
#> $ Total.night.minutes    <dbl> 244.7, 254.4, 162.6, 196.9, 186.9, 203.9, 212.6…
#> $ Total.night.calls      <int> 91, 103, 104, 89, 121, 118, 118, 96, 97, 94, 12…
#> $ Total.night.charge     <dbl> 11.01, 11.45, 7.32, 8.86, 8.41, 9.18, 9.57, 9.5…
#> $ Total.intl.minutes     <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 11.…
#> $ Total.intl.calls       <int> 3, 3, 5, 7, 3, 6, 7, 6, 5, 5, 2, 5, 6, 4, 3, 5,…
#> $ Total.intl.charge      <dbl> 2.70, 3.70, 3.29, 1.78, 2.73, 1.70, 2.03, 1.92,…
#> $ Customer.service.calls <int> 1, 1, 0, 2, 3, 0, 3, 0, 0, 0, 1, 3, 4, 1, 3, 1,…
#> $ Churn                  <fct> False, False, False, False, False, False, False…

STATE: 51 Unique States in United States of America Account Length. Length of The Account Area Code 415 relates to San Francisco,408 is of San Jose and 510 is of City of Okland International Plan Yes Indicate International Plan is Present and No Indicates no subscription for Internatinal Plan Voice Mail Plan Yes Indicates Voice Mail Plan is Present and No Indicates no subscription for Voice Mail Plan Number vmail messages Number of Voice Mail Messages ranging from 0 to 50 Total day minutes Total Number of Minutes Spent By Customers in Morning Total day calls Total Number of Calls made by Customer in Morning. Total day charge Total Charge to the Customers in Morning. Total eve minutesTotal Number of Minutes Spent By Customers in Evening Total eve calls Total Number of Calls made by Customer in Evening. Total eve charge Total Charge to the Customers in Morning. Total night minutes Total Number of Minutes Spent By Customers in the Night. Total night calls Total Number of Calls made by Customer in Night. Total night charge Total Charge to the Customers in Night.

4 Data Wrangling

Check NA

colSums(is.na(telcochurn_train))
#>         Account.length     International.plan        Voice.mail.plan 
#>                      0                      0                      0 
#>  Number.vmail.messages      Total.day.minutes        Total.day.calls 
#>                      0                      0                      0 
#>       Total.day.charge      Total.eve.minutes        Total.eve.calls 
#>                      0                      0                      0 
#>       Total.eve.charge    Total.night.minutes      Total.night.calls 
#>                      0                      0                      0 
#>     Total.night.charge     Total.intl.minutes       Total.intl.calls 
#>                      0                      0                      0 
#>      Total.intl.charge Customer.service.calls                  Churn 
#>                      0                      0                      0
summary(telcochurn_train)
#>  Account.length  International.plan Voice.mail.plan Number.vmail.messages
#>  Min.   :  1.0   No :2396           No :1933        Min.   : 0.000       
#>  1st Qu.: 73.0   Yes: 270           Yes: 733        1st Qu.: 0.000       
#>  Median :100.0                                      Median : 0.000       
#>  Mean   :100.6                                      Mean   : 8.022       
#>  3rd Qu.:127.0                                      3rd Qu.:19.000       
#>  Max.   :243.0                                      Max.   :50.000       
#>  Total.day.minutes Total.day.calls Total.day.charge Total.eve.minutes
#>  Min.   :  0.0     Min.   :  0.0   Min.   : 0.00    Min.   :  0.0    
#>  1st Qu.:143.4     1st Qu.: 87.0   1st Qu.:24.38    1st Qu.:165.3    
#>  Median :179.9     Median :101.0   Median :30.59    Median :200.9    
#>  Mean   :179.5     Mean   :100.3   Mean   :30.51    Mean   :200.4    
#>  3rd Qu.:215.9     3rd Qu.:114.0   3rd Qu.:36.70    3rd Qu.:235.1    
#>  Max.   :350.8     Max.   :160.0   Max.   :59.64    Max.   :363.7    
#>  Total.eve.calls Total.eve.charge Total.night.minutes Total.night.calls
#>  Min.   :  0     Min.   : 0.00    Min.   : 43.7       Min.   : 33.0    
#>  1st Qu.: 87     1st Qu.:14.05    1st Qu.:166.9       1st Qu.: 87.0    
#>  Median :100     Median :17.08    Median :201.2       Median :100.0    
#>  Mean   :100     Mean   :17.03    Mean   :201.2       Mean   :100.1    
#>  3rd Qu.:114     3rd Qu.:19.98    3rd Qu.:236.5       3rd Qu.:113.0    
#>  Max.   :170     Max.   :30.91    Max.   :395.0       Max.   :166.0    
#>  Total.night.charge Total.intl.minutes Total.intl.calls Total.intl.charge
#>  Min.   : 1.970     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
#>  1st Qu.: 7.513     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
#>  Median : 9.050     Median :10.20      Median : 4.000   Median :2.750    
#>  Mean   : 9.053     Mean   :10.24      Mean   : 4.467   Mean   :2.764    
#>  3rd Qu.:10.640     3rd Qu.:12.10      3rd Qu.: 6.000   3rd Qu.:3.270    
#>  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
#>  Customer.service.calls   Churn     
#>  Min.   :0.000          False:2278  
#>  1st Qu.:1.000          True : 388  
#>  Median :1.000                      
#>  Mean   :1.563                      
#>  3rd Qu.:2.000                      
#>  Max.   :9.000

4.1 Tuning Target Variable

Check proportion of target variable

table(telcochurn_train$Churn)
#> 
#> False  True 
#>  2278   388
prop.table(table(telcochurn_train$Churn))
#> 
#>     False      True 
#> 0.8544636 0.1455364

From result above we know our proportion of Churn variable as target is unbalance. The next step is balancing the Churn variable with up sample, down sample, both sample, ROSE, and SMOTE.

#Up Sampling
set.seed(309)
telcochurn_train_ups <- upSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
a <- table(telcochurn_train_ups$Churn)

#Down Sampling
set.seed(309)
telcochurn_train_downs <- downSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
b <- table(telcochurn_train_downs$Churn)

#Both Sampling
telcochurn_train_boths <- ovun.sample(Churn~.,data = telcochurn_train, method= "both", p=0.51, N = nrow(telcochurn_train), seed = 309)$data
c <- table(telcochurn_train_boths$Churn)

#ROSE method
set.seed(309)
telcochurn_train_rose <- ROSE(Churn~.,data=telcochurn_train)$data
d <- table(telcochurn_train_rose$Churn)

#SMOTE method
set.seed(309)
telcochurn_train_smote <- SMOTE(Churn~.,data = telcochurn_train, k = 51, perc.over = 180, perc.under = 170)
e <- table(telcochurn_train_smote$Churn)

data.tuning <- data.frame(a,b,c,d,e)%>%
  select(Var1,Freq,Freq.1,Freq.2,Freq.3,Freq.4) %>%
  rename(Churn = Var1,
         "Up Sample" = Freq,
         "Down Sample" = Freq.1,
         "Both Sample" = Freq.2,
         ROSE = Freq.3,
         SMOTE = Freq.4)

data.tuning

5 Naive Bayes Method

1. Making Naive Bayes model using all predictor variables, without upsampling or downsampling the data train.

  • Making model.
model_nb <- naiveBayes(Churn~., telcochurn_train_smote)
pred <- predict(model_nb, telcochurn_test, type = "class")
  • Model Evaluation
confusionMatrix(pred, telcochurn_test$Churn, positive = "True")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction False True
#>      False   403   20
#>      True    169   75
#>                                              
#>                Accuracy : 0.7166             
#>                  95% CI : (0.6808, 0.7506)   
#>     No Information Rate : 0.8576             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.2987             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7895             
#>             Specificity : 0.7045             
#>          Pos Pred Value : 0.3074             
#>          Neg Pred Value : 0.9527             
#>              Prevalence : 0.1424             
#>          Detection Rate : 0.1124             
#>    Detection Prevalence : 0.3658             
#>       Balanced Accuracy : 0.7470             
#>                                              
#>        'Positive' Class : True               
#> 

Note:

  • Model Summary :
    • Accuracy = 0.7166
    • Sensitivity = 0.7895

6 Decision Tree

1. Making Decision Tree model.

  • Making model.
model_dt <- ctree(Churn ~ ., telcochurn_train_smote)
  • Inspect model Decision Tree.
plot(model_dt, type = "simple")

  • Decision Tree Model Evaluation.
class_predict_dt <- predict(model_dt, telcochurn_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telcochurn_test$Churn, positive = "True")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction False True
#>      False   512   22
#>      True     60   73
#>                                          
#>                Accuracy : 0.8771         
#>                  95% CI : (0.8497, 0.901)
#>     No Information Rate : 0.8576         
#>     P-Value [Acc > NIR] : 0.08103        
#>                                          
#>                   Kappa : 0.5687         
#>                                          
#>  Mcnemar's Test P-Value : 0.00004389     
#>                                          
#>             Sensitivity : 0.7684         
#>             Specificity : 0.8951         
#>          Pos Pred Value : 0.5489         
#>          Neg Pred Value : 0.9588         
#>              Prevalence : 0.1424         
#>          Detection Rate : 0.1094         
#>    Detection Prevalence : 0.1994         
#>       Balanced Accuracy : 0.8318         
#>                                          
#>        'Positive' Class : True           
#> 

2. Pruning Decision Tree model.

  • Making model.
model_dt_pruning <- ctree(Churn ~ ., telcochurn_train_smote,
                          control = ctree_control(minsplit = 100,
                                                  minbucket = 50))
  • Inspect model Decision Tree.
plot(model_dt_pruning, type = "simple")

  • Decision Tree Model Evaluation.
class_predict_dt <- predict(model_dt_pruning, telcochurn_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telcochurn_test$Churn, positive = "True")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction False True
#>      False   471   13
#>      True    101   82
#>                                                
#>                Accuracy : 0.8291               
#>                  95% CI : (0.7983, 0.8569)     
#>     No Information Rate : 0.8576               
#>     P-Value [Acc > NIR] : 0.9828               
#>                                                
#>                   Kappa : 0.4953               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000000000000003691
#>                                                
#>             Sensitivity : 0.8632               
#>             Specificity : 0.8234               
#>          Pos Pred Value : 0.4481               
#>          Neg Pred Value : 0.9731               
#>              Prevalence : 0.1424               
#>          Detection Rate : 0.1229               
#>    Detection Prevalence : 0.2744               
#>       Balanced Accuracy : 0.8433               
#>                                                
#>        'Positive' Class : True                 
#> 

Notes:

  • Model Decision Tree / model_dt:
    • Accuracy = 0.877
    • Sensitivity = 0.768
  • Model Decision Tree after pruning / model_dt_pruning:
    • Accuracy : 0.829 
    • Sensitivity : 0.863
  • Model Decision Tree after pruning or model_dt_pruning is better than models generated by Naive Bayes because it produce better Recall / Sensitivity, which is 0.863

7 Conclusion

Here’s some conclusions we can take from this study case:

  • For this dataset, we try 4 method up sampling, down sampling, both sampling, ROSE, and SMOTE.
  • The chosen Matrix Evaluation for this study case is Recall / Sensitivity, because we want to focus on class positive. In this case, as we want to improve our services regarding to customer’s feedback, firstly we have to know which customer is going to stop using our services.
  • Model Desicion Tree with SMOTE is the best model for this study case. It generate sensitivity of 0.863.