Background
Surely, we’ve all experienced dissatisfaction with a telecommunications company’s services that caused us to choose to switch to another service provider. This phenomenon is known as Customer Churn. In the world of telecommunications business, Customer Churn is the tendency of customers to stop interacting with a company. This can be caused by various factors, such as too high prices, poor signal quality, or unsatisfactory service. To anticipate Customer Churn, companies need to understand the types of churn, namely voluntary churn and involuntary churn. *Voluntary churn occurs when a customer deliberately chooses to unsubscribe and switch to another provider, while involuntary churn is caused by external factors such as moving locations or factors that cannot be controlled by the customer.
By utilizing machine learning technology, companies can develop predictive models to identify customers who are likely to churn and take preventive measures to retain them.
Workflow
Import Data
The data used is customer profile data from a telecommunications
company obtained from Kaggle.
The dataset contains data for 7043 customers which includes customer
demographics, account payment information, and service products
registered by each customer. From this information, we want to predict
whether a customer will Churn or not.
customer <- read.csv("data_input/Telco-Customer-Churn.csv", stringsAsFactors = T)
head(customer)
The following is a description of each variable:
CustomerID: Customer IDGender: Gender of the customer i.e. Female and MaleSeniorCitizen: Whether the customer is a senior citizen (0: No, 1: Yes)Partner: Whether the customer has a partner or not (Yes, No)Dependents: Whether the customer has dependents or not (Yes, No)Tenure: Number of months in using the company`s productMultipleLines: Whether or not the customer has multiple lines (Yes, No, No phone service)OnlineSecurity: Whether or not the customer has online securityOnlineBackup: Whether or not the customer has online backupDeviceProtection: Whether or not the customer has device protectionTechSupport: Whether or not the customer has technical supportStreamingTV: Whether or not the customer subscribes to streaming TVStreamingMovies: Whether or not the customer subscribes to streaming moviesContract: Terms of the subscription contract (Month-to-month, One year, Two year)PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)PaymentMethod: Payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))MonthlyCharges: The amount of payments made each monthTotalCharges: The total amount charged by the customerChurn: Whether the customer Churn or not (Yes or No)
Data Cleansing
Before entering the modeling stage, let’s clean up the data first.
First, check the completeness of the data, from this stage we will get information whether our data is complete.
colSums(is.na(customer))
#> customerID gender SeniorCitizen Partner
#> 0 0 0 0
#> Dependents tenure PhoneService MultipleLines
#> 0 0 0 0
#> InternetService OnlineSecurity OnlineBackup DeviceProtection
#> 0 0 0 0
#> TechSupport StreamingTV StreamingMovies Contract
#> 0 0 0 0
#> PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
#> 0 0 0 11
#> Churn
#> 0
Out of 7043 observations, there are 11 observations in the
TotalCharges column that are missing values (NA).
Since the number of NAs is quite small, we can discard these
observations.
Secondly, we need to discard the variable that is
not relevant to the modeling, CustomerID.
Third, we adjust the data type of
SeniorCitizen column from numeric to categorical.
customer <- customer %>%
select(-customerID) %>%
na.omit() %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen))
Exploratory Data Analysis
Next, let’s explore the data for both categorical and numerical columns.
To find out the proportion of classes in each categorical variable,
we can use the inspect_cat function from the package
inspectdf as follows:
customer %>% inspect_cat() %>% show_plot()
From the visualization above, it can be seen that the class
proportion for the target variable Churn is more in the
No category than Yes. Then, for the
other variables, the proportion is mostly balanced.
Next we can explore the distribution for numeric data variables with
the inspect_num function from the package
inspectdf as follows:
customer %>% inspect_num() %>% show_plot()
From the visualization above, it can be concluded that the distribution of numerical data is quite diverse for each variable.
Train-Test Splitting
After we perform data cleansing and data exploration, the next step is train-test splitting, which is dividing the data into train and test data with a proportion of 80:20. The train data is used to build the model while the test data is used to evaluate the model performance.
set.seed(100)
idx <- initial_split(data = customer,
prop = 0.8,
strata = "Churn")
data_train <- training(idx)
data_test <- testing(idx)
Modeling
Next, we will perform modeling using the Random
Forest algorithm (package caret) by
specifying the number of cross validation, repetitions, and specifying
the target variable name as well as the predictors used from the train
data.
set.seed(100)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
model_forest <- train(Churn ~ .,
data = data_train,
method = "rf",
trControl = ctrl)
# saveRDS(model_forest, "assets/model_forest.rds")
The above chunk takes quite a long time to execute. To shorten the time, let’s load the model that was previously saved into an RDS file.
model_forest <- readRDS("assets/model_forest.rds")
model_forest
#> Random Forest
#>
#> 5627 samples
#> 19 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 4501, 4502, 4501, 4502, 4502, 4501, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.7837817 0.3252122
#> 16 0.7750746 0.3779712
#> 30 0.7731203 0.3727503
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
For now, we obtain a Random Forest model with an accuracy rate in the train data of 78.38% with an optimum try value of 2.
Next, we will do tuning model by doing upsampling, which means we will equalize the proportion of target variables to be equal.
data_train_up <- upSample(x = data_train[, -20],
y = data_train$Churn,
yname = "Churn")
# cek proporsi
prop.table(table(data_train_up$Churn))
#>
#> No Yes
#> 0.5 0.5
From the data that has been upsampling, we will recreate the Random Forest model.
set.seed(100)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
model_forest_up <- train(Churn ~ .,
data = data_train_up,
method = "rf",
trControl = ctrl)
# saveRDS(model_forest_up, "assets/model_forest_up.rds")
To shorten the time, let’s load the previously saved model into an RDS file.
model_forest_up <- readRDS("assets/model_forest_up.rds")
model_forest_up
#> Random Forest
#>
#> 8262 samples
#> 19 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 6609, 6610, 6609, 6610, 6610, 6610, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.7760017 0.5520022
#> 16 0.8911472 0.7822945
#> 30 0.8875167 0.7750336
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 16.
After upsampling, it can be seen that the accuracy value in the train data increased to 89.11% with an optimum try value of 16.
Model Evaluation
Finally, let’s test the random forest model that we have created on the test data. In this case, we want to get the largest recall or sensitivity value possible so that our model can detect as many churn customers as possible.
pred <- predict(model_forest_up, newdata = data_test, type = "prob")
pred$result <- as.factor(ifelse(pred$Yes > 0.45, "Yes", "No"))
confusionMatrix(pred$result, data_test$Churn, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 1031 19
#> Yes 2 355
#>
#> Accuracy : 0.9851
#> 95% CI : (0.9773, 0.9907)
#> No Information Rate : 0.7342
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9612
#>
#> Mcnemar's Test P-Value : 0.0004803
#>
#> Sensitivity : 0.9492
#> Specificity : 0.9981
#> Pos Pred Value : 0.9944
#> Neg Pred Value : 0.9819
#> Prevalence : 0.2658
#> Detection Rate : 0.2523
#> Detection Prevalence : 0.2537
#> Balanced Accuracy : 0.9736
#>
#> 'Positive' Class : Yes
#>
By using a threshold of 0.45, we obtained a recall of 94.92% with an accuracy of 98.51%.
In addition to using the confusion matrix, we can form a ROC curve
along with the AUC value by using the package ROCR as
follows:
pred_prob <- predict(object = model_forest_up, newdata = data_test, type = "prob")
pred <- prediction(pred_prob[,2], labels = data_test$Churn)
perf <- performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
plot(perf)
auc <- performance(pred, measure = "auc")
auc@y.values[[1]]
#> [1] 0.9925235
The AUC value above states that our model performs
99.25% in separating the distribution of the positive
Churn class from the negative in the test data.
Conclusion
With a model to predict customer churn, telecommunication companies easily know which customers have a tendency to churn.
The following visualization shows the prediction results for two customers. Both customers have a high chance of churn and we can also see which variables supports and contradicts the model’s prediction results.
library(lime)
test_x <- data_test %>%
dplyr::select(-Churn)
explainer <- lime(test_x, model_forest_up)
explanation <- lime::explain(test_x[1:2,],
explainer,
labels = c("Yes"),
n_features = 8)
plot_features(explanation)
It can be concluded that the strongest reason these two customers are likely to churn is because they have a monthly contract and tenure which is still below 8 months. From here, the marketing party can promote products with a longer term contract so that these two customers can stay longer.
External Resources
- Dataset: Kaggle: Telco Customer Churn
- Repository: GitHub: Ahmad Fauzi