Rapid development of technology has lead to a major advancement in the telecommunication industry. However, the competition between various providers has become even more fierce. Customer retention has also become one of the huge challenges that need to be overcome in order to survive. It is stated that the cost of acquiring a new customer is far more greater than for retaining the existing one. Service providers in telecommunication industry needs a tool to understand customers’ behavior, and also predict whether their customers will leave their services or not. The dataset used is obtained from Kaggle, which contains 3,333 observations about customer’s churn.
These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function.
library(tidyverse)
library(gtools)
library(smotefamily)
library(caret)
library(class)churn <- read.csv("data_input/telecom_churn.csv")
head(churn)glimpse(churn)## Rows: 3,333
## Columns: 11
## $ Churn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, ~
## $ AccountWeeks <int> 128, 107, 137, 84, 75, 118, 121, 147, 117, 141, 65, 74~
## $ ContractRenewal <int> 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ DataPlan <int> 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ~
## $ DataUsage <dbl> 2.70, 3.70, 0.00, 0.00, 0.00, 0.00, 2.03, 0.00, 0.19, ~
## $ CustServCalls <int> 1, 1, 0, 2, 3, 0, 3, 0, 1, 0, 4, 0, 1, 3, 4, 4, 1, 3, ~
## $ DayMins <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2, 157.0~
## $ DayCalls <int> 110, 123, 114, 71, 113, 98, 88, 79, 97, 84, 137, 127, ~
## $ MonthlyCharge <dbl> 89.0, 82.0, 52.0, 57.0, 41.0, 57.0, 87.3, 36.0, 63.9, ~
## $ OverageFee <dbl> 9.87, 9.78, 6.06, 3.10, 7.42, 11.03, 17.43, 5.16, 17.5~
## $ RoamMins <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 8.7, 11.2,~
Here are some informations about the features:
Our data consists of 3,333 observations and 11 features. Most of the features are numeric. We can see using glimpse(), that some feature types are incorrect. We will correct it later on after dealing with missing and duplicated values.
Checking whether the data has any missing values.
colSums(is.na(churn))## Churn AccountWeeks ContractRenewal DataPlan DataUsage
## 0 0 0 0 0
## CustServCalls DayMins DayCalls MonthlyCharge OverageFee
## 0 0 0 0 0
## RoamMins
## 0
The data has no missing value, let’s now check whether the data has any duplicated value. Systematically, in the dataset like this, I do not think that duplicated values should exist. Considering that there are features measuring the average number of roaming minutes, monthly data usage, etc in the data, I personally think that every observation should have unique values.
sum(duplicated(churn))## [1] 0
Turns out that the data also has no duplicated observations. Let’s fix the data types before proceeding to the modeling part.
We can see that some of the categorical features are read as integer. We need to convert Churn, ContractRenewal, and DataPlan into categorical types.
churn <- churn %>%
mutate(across(c(Churn, ContractRenewal, DataPlan),
as.factor))We want our target variable to have balanced class proportion so that the model could classify well on all classes instead of only the majority class. Let’s first check on the class proportion.
prop.table(table(churn$Churn))##
## 0 1
## 0.8550855 0.1449145
table(churn$Churn)##
## 0 1
## 2850 483
We have a moderate class imbalance. This is a problem since proceeding with our data might lead to a model that can predict well on only one class. Let’s first split the data into 70% train set and 30% test set. The train set for constructing the model and the test set for measuring the out-of-sample accuracy.
RNGkind(sample.kind = "Rounding")
set.seed(417)
idx <- sample(nrow(churn), nrow(churn) * 0.7)
train <- churn[idx, ]
test <- churn[-idx, ]Rechecking the class proportion in the train set.
table(train$Churn)##
## 0 1
## 1988 345
prop.table(table(train$Churn))##
## 0 1
## 0.8521217 0.1478783
Since the data is quite small, and I need to fix on the balance between classes. There are several methods for handling class imbalances:
Since SMOTE is based on KNN algorithm, it can only receive dataframe containing numerical data. Besides, I want to create both logistic regression model and KNN model for this dataset, and evaluate which method is better. So, I am going to use upsampling for handling class imbalance, then compare the logistic regression and KNN model.
train_up <- upSample(x = train[, -1],
y = train$Churn,
yname = "Churn")
prop.table(table(train_up$Churn))##
## 0 1
## 0.5 0.5
We are done in pre-processing the data. The model might be biased since we used upsampling to handle class imbalance (as it duplicates the data and thus, might make the model overfit to those data).
Let’s first create a binary logistic regression model from the train data.
model_logreg <- glm(Churn ~ .,
data = train,
family = "binomial")
summary(model_logreg)##
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9769 -0.5186 -0.3583 -0.2195 2.9639
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9390434 0.6452955 -7.654 0.0000000000000195 ***
## AccountWeeks -0.0006168 0.0016312 -0.378 0.70532
## ContractRenewal1 -1.9990908 0.1676028 -11.928 < 0.0000000000000002 ***
## DataPlan1 -1.4371045 0.6464985 -2.223 0.02622 *
## DataUsage 2.1158316 2.2762817 0.930 0.35262
## CustServCalls 0.5000905 0.0462865 10.804 < 0.0000000000000002 ***
## DayMins 0.0449673 0.0384947 1.168 0.24275
## DayCalls 0.0005983 0.0032536 0.184 0.85411
## MonthlyCharge -0.1951222 0.2263181 -0.862 0.38860
## OverageFee 0.4555130 0.3858588 1.181 0.23779
## RoamMins 0.0706709 0.0261827 2.699 0.00695 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1955.1 on 2332 degrees of freedom
## Residual deviance: 1564.6 on 2322 degrees of freedom
## AIC: 1586.6
##
## Number of Fisher Scoring iterations: 5
We can see that there are still insignificant variables in the model. I personally think that all the independent variables in our data is important to determine whether customers will leave the service or not, but they are statistically proven to be insignificant to the target variable. I will use backward stepwise algorithm to improve the model.
model_logreg_2 <- step(object = model_logreg,
direction = "backward",
trace = F)
summary(model_logreg_2)##
## Call:
## glm(formula = Churn ~ ContractRenewal + DataPlan + CustServCalls +
## DayMins + OverageFee + RoamMins, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9861 -0.5217 -0.3585 -0.2205 2.9544
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.080538 0.512817 -9.907 < 0.0000000000000002 ***
## ContractRenewal1 -2.006294 0.167471 -11.980 < 0.0000000000000002 ***
## DataPlan1 -0.968907 0.172405 -5.620 0.0000000191 ***
## CustServCalls 0.497958 0.046107 10.800 < 0.0000000000000002 ***
## DayMins 0.011786 0.001254 9.402 < 0.0000000000000002 ***
## OverageFee 0.123382 0.026997 4.570 0.0000048745 ***
## RoamMins 0.077687 0.024309 3.196 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1955.1 on 2332 degrees of freedom
## Residual deviance: 1566.0 on 2326 degrees of freedom
## AIC: 1580
##
## Number of Fisher Scoring iterations: 5
Making the prediction and evaluating its performance using confusion matrix.
test$pred_prob <- predict(model_logreg_2,
newdata = test,
type = "response")
test$pred <- as.factor(ifelse(test$pred_prob > 0.5, 1, 0))
confusionMatrix(test$pred,
reference = test$Churn,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 839 112
## 1 23 26
##
## Accuracy : 0.865
## 95% CI : (0.8422, 0.8856)
## No Information Rate : 0.862
## P-Value [Acc > NIR] : 0.4134
##
## Kappa : 0.2218
##
## Mcnemar's Test P-Value : 0.00000000000003624
##
## Sensitivity : 0.1884
## Specificity : 0.9733
## Pos Pred Value : 0.5306
## Neg Pred Value : 0.8822
## Prevalence : 0.1380
## Detection Rate : 0.0260
## Detection Prevalence : 0.0490
## Balanced Accuracy : 0.5809
##
## 'Positive' Class : 1
##
Although our model has a pretty high accuracy of 86.5%, by looking at the metrics, we know that the model could classify class 0 very well. But it is terrible in classifying class 1. This makes sense, since initially, our data has moderate class imbalance. Regarding our model and data, there are 2 important cases:
It was stated before that the cost of acquiring a new customer is far more greater than for retaining the existing one. Thus, the impact of False Negative is much more severe to the company. Looking at the metrics in the confusion matrix:
Since we want to minimize false negative cases, we want to improve our model’s recall/sensitivity.
I will plot recall, specifity, precision, and accuracy from all possible cutoffs.
metrics <- function(cutoff){
prediction <- as.factor(ifelse(test$pred_prob > cutoff, 1, 0))
conf <- confusionMatrix(prediction,
reference = test$Churn,
positive = "1")
res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
return(res)
}cutoffs <- seq(0.01, 0.99, length = 99)
result <- matrix(nrow = 99, ncol = 4)
for(i in 1:99){
result[i, ] <- metrics(cutoffs[i])
}
result <- as.data.frame(result) %>%
rename(Accuracy = V1,
Recall = V2,
Specifity = V3,
Precision = V4) %>%
mutate(Cutoff = cutoffs)Plotting metrics for all possible cutoffs.
result %>%
gather(key = "metrics", value = "value", -Cutoff) %>%
ggplot(mapping = aes(x = Cutoff,
y = value,
col = metrics)) +
geom_line(lwd = 1) +
labs(title = "Metrics for Different Cutoffs",
y = "Value") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5))I personally think that the best cutoff for this data is 0.16. It gives us 77.7% accuracy, 76.8% recall, 77.8% specificity, and 35.7% precision. Honestly, the model is not good enough because of the class imbalance. But since I do not have any more data regarding this case, I will just stick to it. Recall that I mentioned about how severe the impact of false negative could be. It is the reason why I choose 0.16 to be the model’s optimum cutoff, since it gives fair accuracy and recall. Here is the final result.
test$pred <- as.factor(ifelse(test$pred_prob > 0.16, 1, 0))
conf_matrix <- confusionMatrix(test$pred,
reference = test$Churn,
positive = "1")
eval_logreg <- data.frame("Accuracy" = conf_matrix$overall[1],
"Recall" = conf_matrix$byClass[1],
"Specificity" = conf_matrix$byClass[2],
"Precision" = conf_matrix$byClass[3],
"Neg.Pred.Value" = conf_matrix$byClass[4],
row.names = NULL)
eval_logregOne of the advantages of logistic regression model is its interpretability. We can easily interpret which features are the most significant to our target variable in a logistic regression model.
data.frame("Coefficient" = exp(model_logreg_2$coefficients)) %>%
arrange(desc(Coefficient))To determine whether customers will leave the service or not, the most significant feature seems to be CustServCalls. For a one-unit increment in CustServCalls, we expect to see ~64.5% increase in the odds of leaving the service. This might be because more calls to the customer service means that the customers are experiencing more troubles.
KNN does not build a model like logistic regression does. Instead, it directly infer which class does a point belong based on the K nearest records. The distance itself is measured using Euclidean Distance. KNN can only handle numerical data, so we have to remove non-numeric data and scale the rest so all of them have the same range.
train_x_scaled <- train_up %>%
select_if(is.numeric) %>%
scale()
test_x_scaled <- test %>%
select(-c(pred_prob, pred)) %>%
select_if(is.numeric) %>%
scale(center = attr(train_x_scaled, "scaled:center"),
scale = attr(train_x_scaled, "scaled:scale"))
train_y <- train_up$Churn
test_y <- test$ChurnThe next step is to find the optimum K. There is a common practice where we should begin with K equal to the square root of the number of training data. Besides, if the number of classes in our data is even, we need to pick odd value for K as to avoid a tie.
sqrt(nrow(train_up))## [1] 63.05553
We will begin with \(K = 63\).
pred_knn <- knn(train = train_x_scaled,
test = test_x_scaled,
cl = train_y,
k = 63)We will use confusion matrix to evaluate the performance of KNN for our data.
confusionMatrix(pred_knn,
reference = test_y,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 745 39
## 1 117 99
##
## Accuracy : 0.844
## 95% CI : (0.82, 0.866)
## No Information Rate : 0.862
## P-Value [Acc > NIR] : 0.9532
##
## Kappa : 0.4701
##
## Mcnemar's Test P-Value : 0.0000000007051
##
## Sensitivity : 0.7174
## Specificity : 0.8643
## Pos Pred Value : 0.4583
## Neg Pred Value : 0.9503
## Prevalence : 0.1380
## Detection Rate : 0.0990
## Detection Prevalence : 0.2160
## Balanced Accuracy : 0.7908
##
## 'Positive' Class : 1
##
Although the accuracy is slightly less than the initial logistic regression model, KNN seems to classify both class 0 and 1 pretty well.
I will plot the same metrics as the ones used in logistic regression for different values of K.
knn_metrics <- function(k){
prediction <- knn(train = train_x_scaled,
test = test_x_scaled,
cl = train_y,
k = k)
conf <- confusionMatrix(prediction,
reference = test_y,
positive = "1")
res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
return(res)
}knn_result <- matrix(nrow = 100, ncol = 4)
for(i in 1:100){
knn_result[i, ] <- knn_metrics(i)
}
knn_result <- as.data.frame(knn_result) %>%
rename(Accuracy = V1,
Recall = V2,
Specifity = V3,
Precision = V4) %>%
mutate(k = seq(1, 100, length = 100))Plotting the metrics for different values of K.
knn_result %>%
gather(key = "metrics", value = "value", -k) %>%
ggplot(mapping = aes(x = k,
y = value,
col = metrics)) +
geom_line(lwd = 1) +
labs(title = "Metrics for Different Cutoffs",
y = "Value") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5))Again, deciding which value of K is the most optimum will depend on the business case. I personally think that using \(K = 41\) is the most optimal K.
pred_knn <- knn(train = train_x_scaled,
test = test_x_scaled,
cl = train_y,
k = 41)
conf_matrix <- confusionMatrix(pred_knn,
reference = test_y,
positive = "1")
eval_knn <- data.frame("Accuracy" = conf_matrix$overall[1],
"Recall" = conf_matrix$byClass[1],
"Specificity" = conf_matrix$byClass[2],
"Precision" = conf_matrix$byClass[3],
"Neg.Pred.Value" = conf_matrix$byClass[4],
row.names = NULL)
eval_knnHere is the final result, using KNN with \(K = 41\) gives us 82.1% accuracy, 74.64% recall, 83.3% specificity, 41.7% precision, and 95.35% negative prediction value.
Here are the results of optimized logistic regression and KNN algorithm.
# Logistic Regression Evaluation
eval_logreg# KNN Evaluation
eval_knnThere is only a slight difference in the metrics of logistic regression and KNN. In overall, K - Nearest Neighbor algorithm gives better prediction on this dataset, although the recall is slightly less than logistic regression.
For predicting the telecommunication customer churn dataset, using both KNN and logistic regression gives similar result, with KNN having a slightly better metrics in overall. Despite the accuracy, using logistic regression algorithm gives more benefit than KNN, since the result is more interpretable and it can also handle categorical variables. The model would perform better if there are more data regarding this case. There is a moderate class imbalance in the data, where 85% of the observations belong to the customer who will keep using telecommunication provider’s services. Thus, the model was having difficulties to learn the characteristics of customers who will leave the services well. If there were more observations regarding customers who will leave services, I am pretty sure the model would perform better in classifying them.