Customer churn is the loss of customers from a business. Churn is calculated by how many customers leave your business in a given time. Customer churn is important for businesses to know because it is a picture of the success of a business in retaining customers.
Businesses will lose a lot if they lose customers. In fact, getting a new customer is 5 times more expensive than retaining an existing customer, and getting a new customer loyal is also 16 times more expensive. So, a strategy is needed to stop customer churn or lose customers and retain the customers you already have, because they are the main source of business revenue. Therefore, in this case study we will predict customer churn.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)## Loading required package: lattice
library(partykit)## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(e1071)##
## Attaching package: 'e1071'
## The following object is masked from 'package:gtools':
##
## permutations
churn <- read.csv("Telco_Customer_Churn.csv", stringsAsFactors = T)head(churn)Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
The data set includes information about:
Customers who left within the last month – the column is called Churn Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges Demographic info about customers – gender, age range, and if they have partners and dependents
str(churn)## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
Good, the data type is correct, now we throw away the customer id because we don’t need it
churn <- churn %>%
select(-customerID)
head(churn)Great, now we will check missing values
colSums(is.na(churn))## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 11 0
We have missing values, let’s remove it.
churn <- na.omit(churn)colSums(is.na(churn))## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
Great, now we don’t have missing values
First, we divide data into train data and test data
library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(churn), size = nrow(churn)*0.8)
churn_train <- churn[index,]
churn_test <- churn[-index,]Next, we want to check proportion of target variable in train data
prop.table(table(churn_train$Churn))##
## No Yes
## 0.7363556 0.2636444
Proportion of target variable is not balanced. Therefore, we do an up sample to balance the data
RNGkind(sample.kind = "Rounding")
set.seed(100)
churn_train_balance <- upSample(x = churn_train %>% select(-Churn),
y = churn_train$Churn,
yname = "Churn")
prop.table(table(churn_train_balance$Churn))##
## No Yes
## 0.5 0.5
Great, now we have a balanced proportion in the target variable
model_naive_bayes <- naiveBayes(Churn~., data = churn_train_balance)Now, we want to predict
predict_churn_test2 <- predict(model_naive_bayes, churn_test, type = "class")Next, we want to evaluate model
confusionMatrix(predict_churn_test2, churn_test$Churn, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 689 75
## Yes 332 311
##
## Accuracy : 0.7107
## 95% CI : (0.6863, 0.7343)
## No Information Rate : 0.7257
## P-Value [Acc > NIR] : 0.9001
##
## Kappa : 0.3981
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8057
## Specificity : 0.6748
## Pos Pred Value : 0.4837
## Neg Pred Value : 0.9018
## Prevalence : 0.2743
## Detection Rate : 0.2210
## Detection Prevalence : 0.4570
## Balanced Accuracy : 0.7403
##
## 'Positive' Class : Yes
##
model_decision_tree <- ctree(formula = Churn~., data = churn_train_balance)Now, we want to predict
predict_churn_test1 <- predict(model_decision_tree, churn_test, type = "response")Next, we want to evaluate model
confusionMatrix(predict_churn_test1, reference = churn_test$Churn, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 790 115
## Yes 231 271
##
## Accuracy : 0.7541
## 95% CI : (0.7307, 0.7764)
## No Information Rate : 0.7257
## P-Value [Acc > NIR] : 0.008609
##
## Kappa : 0.4352
##
## Mcnemar's Test P-Value : 6.312e-10
##
## Sensitivity : 0.7021
## Specificity : 0.7738
## Pos Pred Value : 0.5398
## Neg Pred Value : 0.8729
## Prevalence : 0.2743
## Detection Rate : 0.1926
## Detection Prevalence : 0.3568
## Balanced Accuracy : 0.7379
##
## 'Positive' Class : Yes
##
set.seed(100)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation
forest <- train(Churn ~ ., data=churn_train_balance, method="rf", trControl = ctrl)
forest## Random Forest
##
## 8284 samples
## 19 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times)
## Summary of sample sizes: 6213, 6213, 6213, 6213, 6213, 6213, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7848564 0.5697134
## 16 0.8918708 0.7837415
## 30 0.8867703 0.7735404
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.
varImp(forest)## rf variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## TotalCharges 100.000
## MonthlyCharges 89.569
## tenure 89.023
## ContractTwo year 48.781
## InternetServiceFiber optic 31.952
## ContractOne year 23.458
## PaymentMethodElectronic check 12.660
## genderMale 11.200
## PaperlessBillingYes 9.148
## PartnerYes 8.211
## SeniorCitizen 8.179
## OnlineSecurityYes 7.832
## DependentsYes 7.296
## OnlineBackupYes 7.120
## TechSupportYes 6.942
## DeviceProtectionYes 5.590
## MultipleLinesYes 5.456
## StreamingTVYes 5.291
## PaymentMethodMailed check 5.189
## StreamingMoviesYes 5.056
forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 16
##
## OOB estimate of error rate: 8.32%
## Confusion matrix:
## No Yes class.error
## No 3544 598 0.14437470
## Yes 91 4051 0.02197006
Now, we want to predict
churn_test_predict <- predict(forest, newdata = churn_test, type = "raw") #menghasilkan kelasNext, we want to evaluate model
confusionMatrix(data = churn_test_predict, reference = churn_test$Churn, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 862 171
## Yes 159 215
##
## Accuracy : 0.7655
## 95% CI : (0.7424, 0.7874)
## No Information Rate : 0.7257
## P-Value [Acc > NIR] : 0.0003826
##
## Kappa : 0.4052
##
## Mcnemar's Test P-Value : 0.5448269
##
## Sensitivity : 0.5570
## Specificity : 0.8443
## Pos Pred Value : 0.5749
## Neg Pred Value : 0.8345
## Prevalence : 0.2743
## Detection Rate : 0.1528
## Detection Prevalence : 0.2658
## Balanced Accuracy : 0.7006
##
## 'Positive' Class : Yes
##
Based on the modeling that has been done, the naive bayes model produces an accuracy of 71%, the decision tree model is 75%, and the random forest model is 76%. Therefore, the random forest model is the best model for predicting customer churn because it has the highest accuracy among other models.