In this case we use the Orange Telecom’s churn dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models.
How to mitigate churn customer in Orange Telecom?
In this case we will comparing predictive model with Logistic Regression, Stepwise Regression, and KNN to find highest sensitivity (recall) with suitable accuracy & AIC.
Read Data
telcochurn_train <- read.csv("churn-bigml-80.csv", stringsAsFactors = TRUE) #80% of data
telcochurn_test <- read.csv("churn-bigml-20.csv", stringsAsFactors = TRUE) #20% of data
nrow(telcochurn_train)
#> [1] 2666
nrow(telcochurn_test)
#> [1] 667
Data Train
head(telcochurn_train)
Data Test
head(telcochurn_test)
Selection variable in both data
dropcols <- c("State","Area.code")
telcochurn_train <- telcochurn_train %>%
select(-one_of(dropcols))
telcochurn_test <- telcochurn_test %>%
select(-one_of(dropcols))
Focus on Data Train to create model
glimpse(telcochurn_train)
#> Rows: 2,666
#> Columns: 18
#> $ Account.length <int> 128, 107, 137, 84, 75, 118, 121, 147, 141, 74, …
#> $ International.plan <fct> No, No, No, Yes, Yes, Yes, No, Yes, Yes, No, No…
#> $ Voice.mail.plan <fct> Yes, Yes, No, No, No, No, Yes, No, Yes, No, No,…
#> $ Number.vmail.messages <int> 25, 26, 0, 0, 0, 0, 24, 0, 37, 0, 0, 0, 0, 27, …
#> $ Total.day.minutes <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2…
#> $ Total.day.calls <int> 110, 123, 114, 71, 113, 98, 88, 79, 84, 127, 96…
#> $ Total.day.charge <dbl> 45.07, 27.47, 41.38, 50.90, 28.34, 37.98, 37.09…
#> $ Total.eve.minutes <dbl> 197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348.5,…
#> $ Total.eve.calls <int> 99, 103, 110, 88, 122, 101, 108, 94, 111, 148, …
#> $ Total.eve.charge <dbl> 16.78, 16.62, 10.30, 5.26, 12.61, 18.75, 29.62,…
#> $ Total.night.minutes <dbl> 244.7, 254.4, 162.6, 196.9, 186.9, 203.9, 212.6…
#> $ Total.night.calls <int> 91, 103, 104, 89, 121, 118, 118, 96, 97, 94, 12…
#> $ Total.night.charge <dbl> 11.01, 11.45, 7.32, 8.86, 8.41, 9.18, 9.57, 9.5…
#> $ Total.intl.minutes <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 11.…
#> $ Total.intl.calls <int> 3, 3, 5, 7, 3, 6, 7, 6, 5, 5, 2, 5, 6, 4, 3, 5,…
#> $ Total.intl.charge <dbl> 2.70, 3.70, 3.29, 1.78, 2.73, 1.70, 2.03, 1.92,…
#> $ Customer.service.calls <int> 1, 1, 0, 2, 3, 0, 3, 0, 0, 0, 1, 3, 4, 1, 3, 1,…
#> $ Churn <fct> False, False, False, False, False, False, False…
STATE: 51 Unique States in United States of America Account Length. Length of The Account Area Code 415 relates to San Francisco,408 is of San Jose and 510 is of City of Okland International Plan Yes Indicate International Plan is Present and No Indicates no subscription for Internatinal Plan Voice Mail Plan Yes Indicates Voice Mail Plan is Present and No Indicates no subscription for Voice Mail Plan Number vmail messages Number of Voice Mail Messages ranging from 0 to 50 Total day minutes Total Number of Minutes Spent By Customers in Morning Total day calls Total Number of Calls made by Customer in Morning. Total day charge Total Charge to the Customers in Morning. Total eve minutesTotal Number of Minutes Spent By Customers in Evening Total eve calls Total Number of Calls made by Customer in Evening. Total eve charge Total Charge to the Customers in Morning. Total night minutes Total Number of Minutes Spent By Customers in the Night. Total night calls Total Number of Calls made by Customer in Night. Total night charge Total Charge to the Customers in Night.
Check NA
colSums(is.na(telcochurn_train))
#> Account.length International.plan Voice.mail.plan
#> 0 0 0
#> Number.vmail.messages Total.day.minutes Total.day.calls
#> 0 0 0
#> Total.day.charge Total.eve.minutes Total.eve.calls
#> 0 0 0
#> Total.eve.charge Total.night.minutes Total.night.calls
#> 0 0 0
#> Total.night.charge Total.intl.minutes Total.intl.calls
#> 0 0 0
#> Total.intl.charge Customer.service.calls Churn
#> 0 0 0
summary(telcochurn_train)
#> Account.length International.plan Voice.mail.plan Number.vmail.messages
#> Min. : 1.0 No :2396 No :1933 Min. : 0.000
#> 1st Qu.: 73.0 Yes: 270 Yes: 733 1st Qu.: 0.000
#> Median :100.0 Median : 0.000
#> Mean :100.6 Mean : 8.022
#> 3rd Qu.:127.0 3rd Qu.:19.000
#> Max. :243.0 Max. :50.000
#> Total.day.minutes Total.day.calls Total.day.charge Total.eve.minutes
#> Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
#> 1st Qu.:143.4 1st Qu.: 87.0 1st Qu.:24.38 1st Qu.:165.3
#> Median :179.9 Median :101.0 Median :30.59 Median :200.9
#> Mean :179.5 Mean :100.3 Mean :30.51 Mean :200.4
#> 3rd Qu.:215.9 3rd Qu.:114.0 3rd Qu.:36.70 3rd Qu.:235.1
#> Max. :350.8 Max. :160.0 Max. :59.64 Max. :363.7
#> Total.eve.calls Total.eve.charge Total.night.minutes Total.night.calls
#> Min. : 0 Min. : 0.00 Min. : 43.7 Min. : 33.0
#> 1st Qu.: 87 1st Qu.:14.05 1st Qu.:166.9 1st Qu.: 87.0
#> Median :100 Median :17.08 Median :201.2 Median :100.0
#> Mean :100 Mean :17.03 Mean :201.2 Mean :100.1
#> 3rd Qu.:114 3rd Qu.:19.98 3rd Qu.:236.5 3rd Qu.:113.0
#> Max. :170 Max. :30.91 Max. :395.0 Max. :166.0
#> Total.night.charge Total.intl.minutes Total.intl.calls Total.intl.charge
#> Min. : 1.970 Min. : 0.00 Min. : 0.000 Min. :0.000
#> 1st Qu.: 7.513 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
#> Median : 9.050 Median :10.20 Median : 4.000 Median :2.750
#> Mean : 9.053 Mean :10.24 Mean : 4.467 Mean :2.764
#> 3rd Qu.:10.640 3rd Qu.:12.10 3rd Qu.: 6.000 3rd Qu.:3.270
#> Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
#> Customer.service.calls Churn
#> Min. :0.000 False:2278
#> 1st Qu.:1.000 True : 388
#> Median :1.000
#> Mean :1.563
#> 3rd Qu.:2.000
#> Max. :9.000
Check proportion of target variable
table(telcochurn_train$Churn)
#>
#> False True
#> 2278 388
prop.table(table(telcochurn_train$Churn))
#>
#> False True
#> 0.8544636 0.1455364
From result above we know our proportion of Churn variable as target is unbalance. The next step is balancing the Churn variable with up sample, down sample, both sample, ROSE, and SMOTE.
#Up Sampling
set.seed(309)
telcochurn_train_ups <- upSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
a <- table(telcochurn_train_ups$Churn)
#Down Sampling
set.seed(309)
telcochurn_train_downs <- downSample(x = telcochurn_train %>% select(-Churn), y = telcochurn_train$Churn, yname = "Churn")
b <- table(telcochurn_train_downs$Churn)
#Both Sampling
telcochurn_train_boths <- ovun.sample(Churn~.,data = telcochurn_train, method= "both", p=0.51, N = nrow(telcochurn_train), seed = 309)$data
c <- table(telcochurn_train_boths$Churn)
#ROSE method
set.seed(309)
telcochurn_train_rose <- ROSE(Churn~.,data=telcochurn_train)$data
d <- table(telcochurn_train_rose$Churn)
#SMOTE method
set.seed(309)
telcochurn_train_smote <- SMOTE(Churn~.,data = telcochurn_train, k = 51, perc.over = 180, perc.under = 170)
e <- table(telcochurn_train_smote$Churn)
data.tuning <- data.frame(a,b,c,d,e)%>%
select(Var1,Freq,Freq.1,Freq.2,Freq.3,Freq.4) %>%
rename(Churn = Var1,
"Up Sample" = Freq,
"Down Sample" = Freq.1,
"Both Sample" = Freq.2,
ROSE = Freq.3,
SMOTE = Freq.4)
data.tuning
1. Making Naive Bayes model using all predictor variables, without upsampling or downsampling the data train.
model_nb <- naiveBayes(Churn~., telcochurn_train_smote)
pred <- predict(model_nb, telcochurn_test, type = "class")
confusionMatrix(pred, telcochurn_test$Churn, positive = "True")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction False True
#> False 403 20
#> True 169 75
#>
#> Accuracy : 0.7166
#> 95% CI : (0.6808, 0.7506)
#> No Information Rate : 0.8576
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2987
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7895
#> Specificity : 0.7045
#> Pos Pred Value : 0.3074
#> Neg Pred Value : 0.9527
#> Prevalence : 0.1424
#> Detection Rate : 0.1124
#> Detection Prevalence : 0.3658
#> Balanced Accuracy : 0.7470
#>
#> 'Positive' Class : True
#>
Note:
1. Making Decision Tree model.
model_dt <- ctree(Churn ~ ., telcochurn_train_smote)
plot(model_dt, type = "simple")
class_predict_dt <- predict(model_dt, telcochurn_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telcochurn_test$Churn, positive = "True")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction False True
#> False 512 22
#> True 60 73
#>
#> Accuracy : 0.8771
#> 95% CI : (0.8497, 0.901)
#> No Information Rate : 0.8576
#> P-Value [Acc > NIR] : 0.08103
#>
#> Kappa : 0.5687
#>
#> Mcnemar's Test P-Value : 0.00004389
#>
#> Sensitivity : 0.7684
#> Specificity : 0.8951
#> Pos Pred Value : 0.5489
#> Neg Pred Value : 0.9588
#> Prevalence : 0.1424
#> Detection Rate : 0.1094
#> Detection Prevalence : 0.1994
#> Balanced Accuracy : 0.8318
#>
#> 'Positive' Class : True
#>
2. Pruning Decision Tree model.
model_dt_pruning <- ctree(Churn ~ ., telcochurn_train_smote,
control = ctree_control(minsplit = 100,
minbucket = 50))
plot(model_dt_pruning, type = "simple")
class_predict_dt <- predict(model_dt_pruning, telcochurn_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telcochurn_test$Churn, positive = "True")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction False True
#> False 471 13
#> True 101 82
#>
#> Accuracy : 0.8291
#> 95% CI : (0.7983, 0.8569)
#> No Information Rate : 0.8576
#> P-Value [Acc > NIR] : 0.9828
#>
#> Kappa : 0.4953
#>
#> Mcnemar's Test P-Value : 0.0000000000000003691
#>
#> Sensitivity : 0.8632
#> Specificity : 0.8234
#> Pos Pred Value : 0.4481
#> Neg Pred Value : 0.9731
#> Prevalence : 0.1424
#> Detection Rate : 0.1229
#> Detection Prevalence : 0.2744
#> Balanced Accuracy : 0.8433
#>
#> 'Positive' Class : True
#>
Notes:
model_dt:model_dt_pruning:model_dt_pruning is better than models generated by Naive Bayes because it produce better Recall / Sensitivity, which is 0.863Here’s some conclusions we can take from this study case:
0.863.