telco <- read.csv("telcochurn.csv", stringsAsFactors = TRUE)
glimpse(telco)
#> Rows: 7,043
#> Columns: 22
#> $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
#> $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-~
#> $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
#> $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
#> $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
#> $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
#> $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
#> $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
#> $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
#> $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
#> $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,~
#> $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N~
#> $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y~
#> $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,~
#> $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye~
#> $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes~
#> $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M~
#> $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
#> $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr~
#> $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
#> $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
#> $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
Data Explanation:
Overall, the column in dataset is self explanatory. Maybe only a few that needed further explanation.
Business Question:
We assume ourself to be a data analyst in a Telco company. We were asked to build the best model to predict the customer that intent to stop using our services, so we could ask them for their feedback of our services in order to improve our services.
telco <- telco %>%
select(-c(X, customerID))
colSums(is.na(telco))
#> gender SeniorCitizen Partner Dependents
#> 0 0 0 0
#> tenure PhoneService MultipleLines InternetService
#> 0 0 0 0
#> OnlineSecurity OnlineBackup DeviceProtection TechSupport
#> 0 0 0 0
#> StreamingTV StreamingMovies Contract PaperlessBilling
#> 0 0 0 0
#> PaymentMethod MonthlyCharges TotalCharges Churn
#> 0 0 11 0
telco_clean <- telco %>%
na.omit()
colSums(is.na(telco_clean))
#> gender SeniorCitizen Partner Dependents
#> 0 0 0 0
#> tenure PhoneService MultipleLines InternetService
#> 0 0 0 0
#> OnlineSecurity OnlineBackup DeviceProtection TechSupport
#> 0 0 0 0
#> StreamingTV StreamingMovies Contract PaperlessBilling
#> 0 0 0 0
#> PaymentMethod MonthlyCharges TotalCharges Churn
#> 0 0 0 0
prop.table(table(telco_clean$Churn))
#>
#> No Yes
#> 0.734215 0.265785
table(telco_clean$Churn)
#>
#> No Yes
#> 5163 1869
Note:
The proportion of target variable is imbalance, but we keep going for now. Will do the upsampling and/or downsampling later.
Split the original data to data train and data test, and then check proportion of target variable in data train.
RNGkind(sample.kind= "Rounding")
set.seed(417)
idx <- sample(nrow(telco_clean), nrow(telco_clean)*0.8)
telco_train <- telco_clean[idx,]
telco_test <- telco_clean[-idx,]
prop.table(table(telco_train$Churn))
#>
#> No Yes
#> 0.7347556 0.2652444
1. Making Naive Bayes model using all predictor variables, without upsampling or downsampling the data train.
model_nb <- naiveBayes(Churn~., telco_train)
pred <- predict(model_nb, telco_test, type = "class")
confusionMatrix(pred, telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 829 113
#> Yes 201 264
#>
#> Accuracy : 0.7768
#> 95% CI : (0.7542, 0.7983)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.0000641500
#>
#> Kappa : 0.4703
#>
#> Mcnemar's Test P-Value : 0.0000009122
#>
#> Sensitivity : 0.7003
#> Specificity : 0.8049
#> Pos Pred Value : 0.5677
#> Neg Pred Value : 0.8800
#> Prevalence : 0.2679
#> Detection Rate : 0.1876
#> Detection Prevalence : 0.3305
#> Balanced Accuracy : 0.7526
#>
#> 'Positive' Class : Yes
#>
Note:
2. Making Naive Bayes model using all predictor variables, and downsampling the data train.
telco_train_down <- downSample(x = telco_train %>% select(-Churn), y = telco_train$Churn, yname = "Churn")
table(telco_train_down$Churn)
#>
#> No Yes
#> 1492 1492
prop.table(table(telco_train_down$Churn))
#>
#> No Yes
#> 0.5 0.5
model_nb_down <- naiveBayes(Churn~., telco_train_down)
pred_down <- predict(model_nb_down, telco_test, type = "class")
confusionMatrix(pred_down, telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 755 82
#> Yes 275 295
#>
#> Accuracy : 0.7463
#> 95% CI : (0.7227, 0.7688)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.1199
#>
#> Kappa : 0.4435
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7825
#> Specificity : 0.7330
#> Pos Pred Value : 0.5175
#> Neg Pred Value : 0.9020
#> Prevalence : 0.2679
#> Detection Rate : 0.2097
#> Detection Prevalence : 0.4051
#> Balanced Accuracy : 0.7578
#>
#> 'Positive' Class : Yes
#>
Note:
3. Making Naive Bayes model using all predictor variables, and upsampling the data train.
telco_train_up <- upSample(x = telco_train %>% select(-Churn), y = telco_train$Churn, yname = "Churn")
table(telco_train_up$Churn)
#>
#> No Yes
#> 4133 4133
prop.table(table(telco_train_up$Churn))
#>
#> No Yes
#> 0.5 0.5
model_nb_up <- naiveBayes(Churn~., telco_train_up)
pred_up <- predict(model_nb_up, telco_test, type = "class")
confusionMatrix(pred_up, telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 755 83
#> Yes 275 294
#>
#> Accuracy : 0.7456
#> 95% CI : (0.7219, 0.7681)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.1325
#>
#> Kappa : 0.4416
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7798
#> Specificity : 0.7330
#> Pos Pred Value : 0.5167
#> Neg Pred Value : 0.9010
#> Prevalence : 0.2679
#> Detection Rate : 0.2090
#> Detection Prevalence : 0.4044
#> Balanced Accuracy : 0.7564
#>
#> 'Positive' Class : Yes
#>
Note:
ROC and AUC
pred_prob <- predict(model_nb, telco_test, type = "raw")
head(pred_prob)
#> No Yes
#> [1,] 0.02250140 0.97749860197
#> [2,] 0.01464784 0.98535215574
#> [3,] 0.99998922 0.00001078168
#> [4,] 0.99420460 0.00579539609
#> [5,] 0.99942708 0.00057292442
#> [6,] 0.99841864 0.00158136367
pred_rocr <- prediction(predictions = pred_prob[,2],
labels = as.numeric(ifelse(test = telco_test$Churn == "Yes", 1, 0)))
perform <- performance(prediction.obj = pred_rocr, measure = "tpr", x.measure = "fpr")
# roc plot
plot(perform)
# auc
auc <- performance(pred_rocr, "auc")
auc@y.values
#> [[1]]
#> [1] 0.8408424
pred_prob_down <- predict(model_nb_down, telco_test, type = "raw")
head(pred_prob_down)
#> No Yes
#> [1,] 0.008816022 0.99118397761
#> [2,] 0.005479125 0.99452087518
#> [3,] 0.999969267 0.00003073294
#> [4,] 0.984380651 0.01561934880
#> [5,] 0.998128070 0.00187192969
#> [6,] 0.995777509 0.00422249090
pred_rocr_down <- prediction(predictions = pred_prob_down[,2],
labels = as.numeric(ifelse(test = telco_test$Churn == "Yes", 1, 0)))
perf_down <- performance(prediction.obj = pred_rocr_down, measure = "tpr", x.measure = "fpr")
# roc plot
plot(perf_down)
# auc
auc_down <- performance(pred_rocr_down, "auc")
auc_down@y.values
#> [[1]]
#> [1] 0.8408115
Naive Bayes model evaluation:
model_nb_down is the better model. The details is below:model_nb_down produced less AUC value rather than model_nb, which is 0.8408115 compared to 0.8408424, but it has the better Recall / Sensitivity as a Matrix.1. Making Decision Tree model.
model_dt <- ctree(Churn ~ ., telco_train_down)
plot(model_dt, type = "simple")
class_predict_dt <- predict(model_dt, telco_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 780 86
#> Yes 250 291
#>
#> Accuracy : 0.7612
#> 95% CI : (0.738, 0.7833)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.0069
#>
#> Kappa : 0.465
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7719
#> Specificity : 0.7573
#> Pos Pred Value : 0.5379
#> Neg Pred Value : 0.9007
#> Prevalence : 0.2679
#> Detection Rate : 0.2068
#> Detection Prevalence : 0.3845
#> Balanced Accuracy : 0.7646
#>
#> 'Positive' Class : Yes
#>
2. Pruning Decision Tree model.
model_dt_pruning <- ctree(Churn ~ ., telco_train_down,
control = ctree_control(minsplit = 100,
minbucket = 50))
plot(model_dt_pruning, type = "simple")
class_predict_dt <- predict(model_dt_pruning, telco_test, type = "response")
confusionMatrix(as.factor(class_predict_dt),telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 775 78
#> Yes 255 299
#>
#> Accuracy : 0.7633
#> 95% CI : (0.7402, 0.7853)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.004052
#>
#> Kappa : 0.4749
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.7931
#> Specificity : 0.7524
#> Pos Pred Value : 0.5397
#> Neg Pred Value : 0.9086
#> Prevalence : 0.2679
#> Detection Rate : 0.2125
#> Detection Prevalence : 0.3937
#> Balanced Accuracy : 0.7728
#>
#> 'Positive' Class : Yes
#>
Notes:
model_dt:model_dt_pruning:model_dt_pruning is better than models generated by Naive Bayes because it produce better Recall / Sensitivity, which is 0.7931set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4)
# model_rf <- train(Churn ~ ., data = telco_train_down, method="rf", trControl = ctrl)
# saveRDS(model_rf, file = "model_rf.rds")
model_rf <- readRDS("model_rf.RDS")
# Print Output
model_rf
#> Random Forest
#>
#> 2984 samples
#> 19 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 4 times)
#> Summary of sample sizes: 2238, 2238, 2238, 2238, 2238, 2238, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.7615617 0.5231233
#> 12 0.7554457 0.5108914
#> 23 0.7533512 0.5067024
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
# Model Configuration
model_rf$finalModel
#>
#> Call:
#> randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 2
#>
#> OOB estimate of error rate: 23.59%
#> Confusion matrix:
#> No Yes class.error
#> No 1084 408 0.2734584
#> Yes 296 1196 0.1983914
# Model Visualization
plot(model_rf)
# Important predictor variables to predict the target variables
varImp(model_rf)
#> rf variable importance
#>
#> only 20 most important variables shown (out of 23)
#>
#> Overall
#> tenure 100.000
#> TotalCharges 82.679
#> MonthlyCharges 78.872
#> InternetServiceFiber optic 46.534
#> ContractTwo year 44.927
#> PaymentMethodElectronic check 29.326
#> ContractOne year 24.326
#> InternetServiceNo 20.647
#> TechSupportYes 17.443
#> OnlineSecurityYes 16.276
#> PaperlessBillingYes 13.717
#> OnlineBackupYes 7.447
#> SeniorCitizen 7.060
#> DependentsYes 6.853
#> PartnerYes 6.587
#> PaymentMethodCredit card (automatic) 4.827
#> StreamingMoviesYes 4.684
#> DeviceProtectionYes 4.587
#> MultipleLinesYes 4.525
#> genderMale 4.417
pred_rf <- predict(model_rf, newdata = telco_test, positive = "Yes")
confusionMatrix(pred_rf, reference = telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 763 65
#> Yes 267 312
#>
#> Accuracy : 0.764
#> 95% CI : (0.741, 0.786)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.003369
#>
#> Kappa : 0.4858
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8276
#> Specificity : 0.7408
#> Pos Pred Value : 0.5389
#> Neg Pred Value : 0.9215
#> Prevalence : 0.2679
#> Detection Rate : 0.2217
#> Detection Prevalence : 0.4115
#> Balanced Accuracy : 0.7842
#>
#> 'Positive' Class : Yes
#>
Random Forest Model tuning.
mtry.set.seed(417)
grid <- expand.grid(.mtry = c(3,4,5))
# model_rf_tune <- train(Churn ~ ., data = telco_train_down,
# method="rf",
# trControl = ctrl,
# tuneGrid = grid)
# saveRDS(model_rf_tune, file = "model_rf_tune.rds")
model_rf_tune <- readRDS("model_rf_tune.RDS")
# Print Output
model_rf_tune
#> Random Forest
#>
#> 2984 samples
#> 19 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 4 times)
#> Summary of sample sizes: 2238, 2238, 2238, 2238, 2238, 2238, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 3 0.7647453 0.5294906
#> 4 0.7620643 0.5241287
#> 5 0.7592158 0.5184316
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 3.
# Model Configuration
model_rf_tune$finalModel
#>
#> Call:
#> randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 3
#>
#> OOB estimate of error rate: 24.3%
#> Confusion matrix:
#> No Yes class.error
#> No 1083 409 0.2741287
#> Yes 316 1176 0.2117962
# Model Visualization
plot(model_rf_tune)
# Important predictor variables to predict the target variables
varImp(model_rf_tune)
#> rf variable importance
#>
#> only 20 most important variables shown (out of 23)
#>
#> Overall
#> tenure 100.000
#> TotalCharges 94.292
#> MonthlyCharges 89.705
#> ContractTwo year 39.427
#> InternetServiceFiber optic 37.174
#> PaymentMethodElectronic check 20.812
#> ContractOne year 18.846
#> InternetServiceNo 14.023
#> TechSupportYes 12.830
#> OnlineSecurityYes 11.967
#> PaperlessBillingYes 10.857
#> genderMale 8.004
#> OnlineBackupYes 6.998
#> PartnerYes 6.608
#> DependentsYes 6.561
#> SeniorCitizen 6.205
#> MultipleLinesYes 5.531
#> StreamingMoviesYes 4.962
#> StreamingTVYes 4.611
#> DeviceProtectionYes 4.345
pred_rf_tune <- predict(model_rf_tune, newdata = telco_test, positive = "Yes")
confusionMatrix(pred_rf_tune, reference = telco_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 770 71
#> Yes 260 306
#>
#> Accuracy : 0.7647
#> 95% CI : (0.7417, 0.7867)
#> No Information Rate : 0.7321
#> P-Value [Acc > NIR] : 0.002791
#>
#> Kappa : 0.4826
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8117
#> Specificity : 0.7476
#> Pos Pred Value : 0.5406
#> Neg Pred Value : 0.9156
#> Prevalence : 0.2679
#> Detection Rate : 0.2175
#> Detection Prevalence : 0.4023
#> Balanced Accuracy : 0.7796
#>
#> 'Positive' Class : Yes
#>
Note:
model_rf:model_rf_tune:
mtry and trees, but in this case it won’t do anything good to our model. Therefore the best model is still the model that generated by Random Forest. It has better accuracy and Sensitivity.Here’s some conclusions we can take from this study case:
model_rf is the best model for this study case. It generate sensitivity of 0.8276.tenure or their duration of using our services.