1 Background

1.1 Brief

Rapid development of technology has lead to a major advancement in the telecommunication industry. However, the competition between various providers has become even more fierce. Customer retention has also become one of the huge challenges that need to be overcome in order to survive. It is stated that the cost of acquiring a new customer is far more greater than for retaining the existing one. Service providers in telecommunication industry needs a tool to understand customers’ behavior, and also predict whether their customers will leave their services or not. The dataset used is obtained from Kaggle, which contains 3,333 observations about customer’s churn.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function.

library(tidyverse)
library(gtools)
library(smotefamily)
library(caret)
library(class)

2 Data Wrangling

2.1 Data Inspection

churn <- read.csv("data_input/telecom_churn.csv")
head(churn)
glimpse(churn)
## Rows: 3,333
## Columns: 11
## $ Churn           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, ~
## $ AccountWeeks    <int> 128, 107, 137, 84, 75, 118, 121, 147, 117, 141, 65, 74~
## $ ContractRenewal <int> 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ DataPlan        <int> 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ~
## $ DataUsage       <dbl> 2.70, 3.70, 0.00, 0.00, 0.00, 0.00, 2.03, 0.00, 0.19, ~
## $ CustServCalls   <int> 1, 1, 0, 2, 3, 0, 3, 0, 1, 0, 4, 0, 1, 3, 4, 4, 1, 3, ~
## $ DayMins         <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2, 157.0~
## $ DayCalls        <int> 110, 123, 114, 71, 113, 98, 88, 79, 97, 84, 137, 127, ~
## $ MonthlyCharge   <dbl> 89.0, 82.0, 52.0, 57.0, 41.0, 57.0, 87.3, 36.0, 63.9, ~
## $ OverageFee      <dbl> 9.87, 9.78, 6.06, 3.10, 7.42, 11.03, 17.43, 5.16, 17.5~
## $ RoamMins        <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 8.7, 11.2,~

Here are some informations about the features:

  • Churn: 1 if customer cancelled service, 0 if not
  • AccountWeeks: number of weeks customer has had active account
  • ContractRenewal: 1 if customer recently renewed contract, 0 if not
  • DataPlan: 1 if customer has data plan, 0 if not
  • DataUsage: gigabytes of monthly data usage
  • CustServCalls: number of calls into customer service
  • DayMins: average daytime minutes per month
  • DayCalls: average number of daytime calls
  • MonthlyCharge: average monthly bill
  • OverageFee: largest overage fee in last 12 months
  • RoamMins: average number of roaming minutes

Our data consists of 3,333 observations and 11 features. Most of the features are numeric. We can see using glimpse(), that some feature types are incorrect. We will correct it later on after dealing with missing and duplicated values.

2.2 Missing and Duplicated Values

Checking whether the data has any missing values.

colSums(is.na(churn))
##           Churn    AccountWeeks ContractRenewal        DataPlan       DataUsage 
##               0               0               0               0               0 
##   CustServCalls         DayMins        DayCalls   MonthlyCharge      OverageFee 
##               0               0               0               0               0 
##        RoamMins 
##               0

The data has no missing value, let’s now check whether the data has any duplicated value. Systematically, in the dataset like this, I do not think that duplicated values should exist. Considering that there are features measuring the average number of roaming minutes, monthly data usage, etc in the data, I personally think that every observation should have unique values.

sum(duplicated(churn))
## [1] 0

Turns out that the data also has no duplicated observations. Let’s fix the data types before proceeding to the modeling part.

2.3 Incorrect Feature Types

We can see that some of the categorical features are read as integer. We need to convert Churn, ContractRenewal, and DataPlan into categorical types.

churn <- churn %>% 
  mutate(across(c(Churn, ContractRenewal, DataPlan),
                as.factor))

2.4 Class Balance and Cross Validation

We want our target variable to have balanced class proportion so that the model could classify well on all classes instead of only the majority class. Let’s first check on the class proportion.

prop.table(table(churn$Churn))
## 
##         0         1 
## 0.8550855 0.1449145
table(churn$Churn)
## 
##    0    1 
## 2850  483

We have a moderate class imbalance. This is a problem since proceeding with our data might lead to a model that can predict well on only one class. Let’s first split the data into 70% train set and 30% test set. The train set for constructing the model and the test set for measuring the out-of-sample accuracy.

RNGkind(sample.kind = "Rounding")
set.seed(417)
idx <- sample(nrow(churn), nrow(churn) * 0.7)
train <- churn[idx, ]
test <- churn[-idx, ]

Rechecking the class proportion in the train set.

table(train$Churn)
## 
##    0    1 
## 1988  345
prop.table(table(train$Churn))
## 
##         0         1 
## 0.8521217 0.1478783

Since the data is quite small, and I need to fix on the balance between classes. There are several methods for handling class imbalances:

  • Upsampling: duplicating data points corresponding to minority class.
  • Downsampling: drop observations from majority class. Since our data is quite small, it will result in using only 690 observations for constructing the model. I am afraid the model will not have enough data to learn from.
  • SMOTE (Synthetic Minority Oversampling Technique): synthetically generate data points in the proximity of the minority class based on KNN (K Nearest Neighbors) algorithm.

Since SMOTE is based on KNN algorithm, it can only receive dataframe containing numerical data. Besides, I want to create both logistic regression model and KNN model for this dataset, and evaluate which method is better. So, I am going to use upsampling for handling class imbalance, then compare the logistic regression and KNN model.

train_up <- upSample(x = train[, -1],
                     y = train$Churn,
                     yname = "Churn")

prop.table(table(train_up$Churn))
## 
##   0   1 
## 0.5 0.5

We are done in pre-processing the data. The model might be biased since we used upsampling to handle class imbalance (as it duplicates the data and thus, might make the model overfit to those data).

3 Binary Logistic Regression

Let’s first create a binary logistic regression model from the train data.

model_logreg <- glm(Churn ~ .,
                    data = train,
                    family = "binomial")

summary(model_logreg)
## 
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9769  -0.5186  -0.3583  -0.2195   2.9639  
## 
## Coefficients:
##                    Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -4.9390434  0.6452955  -7.654   0.0000000000000195 ***
## AccountWeeks     -0.0006168  0.0016312  -0.378              0.70532    
## ContractRenewal1 -1.9990908  0.1676028 -11.928 < 0.0000000000000002 ***
## DataPlan1        -1.4371045  0.6464985  -2.223              0.02622 *  
## DataUsage         2.1158316  2.2762817   0.930              0.35262    
## CustServCalls     0.5000905  0.0462865  10.804 < 0.0000000000000002 ***
## DayMins           0.0449673  0.0384947   1.168              0.24275    
## DayCalls          0.0005983  0.0032536   0.184              0.85411    
## MonthlyCharge    -0.1951222  0.2263181  -0.862              0.38860    
## OverageFee        0.4555130  0.3858588   1.181              0.23779    
## RoamMins          0.0706709  0.0261827   2.699              0.00695 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1955.1  on 2332  degrees of freedom
## Residual deviance: 1564.6  on 2322  degrees of freedom
## AIC: 1586.6
## 
## Number of Fisher Scoring iterations: 5

We can see that there are still insignificant variables in the model. I personally think that all the independent variables in our data is important to determine whether customers will leave the service or not, but they are statistically proven to be insignificant to the target variable. I will use backward stepwise algorithm to improve the model.

model_logreg_2 <- step(object = model_logreg,
                       direction = "backward",
                       trace = F)

summary(model_logreg_2)
## 
## Call:
## glm(formula = Churn ~ ContractRenewal + DataPlan + CustServCalls + 
##     DayMins + OverageFee + RoamMins, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9861  -0.5217  -0.3585  -0.2205   2.9544  
## 
## Coefficients:
##                   Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)      -5.080538   0.512817  -9.907 < 0.0000000000000002 ***
## ContractRenewal1 -2.006294   0.167471 -11.980 < 0.0000000000000002 ***
## DataPlan1        -0.968907   0.172405  -5.620         0.0000000191 ***
## CustServCalls     0.497958   0.046107  10.800 < 0.0000000000000002 ***
## DayMins           0.011786   0.001254   9.402 < 0.0000000000000002 ***
## OverageFee        0.123382   0.026997   4.570         0.0000048745 ***
## RoamMins          0.077687   0.024309   3.196              0.00139 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1955.1  on 2332  degrees of freedom
## Residual deviance: 1566.0  on 2326  degrees of freedom
## AIC: 1580
## 
## Number of Fisher Scoring iterations: 5

3.1 Model Evaluation

Making the prediction and evaluating its performance using confusion matrix.

test$pred_prob <- predict(model_logreg_2,
                          newdata = test,
                          type = "response")

test$pred <- as.factor(ifelse(test$pred_prob > 0.5, 1, 0))

confusionMatrix(test$pred,
                reference = test$Churn,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 839 112
##          1  23  26
##                                              
##                Accuracy : 0.865              
##                  95% CI : (0.8422, 0.8856)   
##     No Information Rate : 0.862              
##     P-Value [Acc > NIR] : 0.4134             
##                                              
##                   Kappa : 0.2218             
##                                              
##  Mcnemar's Test P-Value : 0.00000000000003624
##                                              
##             Sensitivity : 0.1884             
##             Specificity : 0.9733             
##          Pos Pred Value : 0.5306             
##          Neg Pred Value : 0.8822             
##              Prevalence : 0.1380             
##          Detection Rate : 0.0260             
##    Detection Prevalence : 0.0490             
##       Balanced Accuracy : 0.5809             
##                                              
##        'Positive' Class : 1                  
## 

Although our model has a pretty high accuracy of 86.5%, by looking at the metrics, we know that the model could classify class 0 very well. But it is terrible in classifying class 1. This makes sense, since initially, our data has moderate class imbalance. Regarding our model and data, there are 2 important cases:

  • False Positive: customer is going to keep using the service, but the model predicted that it will leave. The providers may offer some promos to retain the customer and lose some funds because of it.
  • False Negative: customer is going to leave the service, but the model predicted that it will keep using the services. The providers may lose one of its customers.

It was stated before that the cost of acquiring a new customer is far more greater than for retaining the existing one. Thus, the impact of False Negative is much more severe to the company. Looking at the metrics in the confusion matrix:

  • Recall / Sensitivity: out of all customers who will leave the service, how many are correctly classified?
  • Specificity: out of all customers who will stay, how many are correctly classified?
  • Positive Prediction Value / Precision: out of all customers who are predicted to leave, how many are correctly classified?
  • Accuracy: how many customers are correctly classified out of all cases?

Since we want to minimize false negative cases, we want to improve our model’s recall/sensitivity.

3.2 Cutoff Tuning

I will plot recall, specifity, precision, and accuracy from all possible cutoffs.

metrics <- function(cutoff){
  prediction <- as.factor(ifelse(test$pred_prob > cutoff, 1, 0))
  conf <- confusionMatrix(prediction, 
                          reference = test$Churn,
                          positive = "1")
  res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
  return(res)
}
cutoffs <- seq(0.01, 0.99, length = 99)
result <- matrix(nrow = 99, ncol = 4)

for(i in 1:99){
  result[i, ] <- metrics(cutoffs[i])
}

result <- as.data.frame(result) %>% 
  rename(Accuracy = V1,
         Recall = V2,
         Specifity = V3,
         Precision = V4) %>% 
  mutate(Cutoff = cutoffs)

Plotting metrics for all possible cutoffs.

result %>% 
  gather(key = "metrics", value = "value", -Cutoff) %>% 
  ggplot(mapping = aes(x = Cutoff,
                       y = value,
                       col = metrics)) +
  geom_line(lwd = 1) +
  labs(title = "Metrics for Different Cutoffs",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5))

I personally think that the best cutoff for this data is 0.16. It gives us 77.7% accuracy, 76.8% recall, 77.8% specificity, and 35.7% precision. Honestly, the model is not good enough because of the class imbalance. But since I do not have any more data regarding this case, I will just stick to it. Recall that I mentioned about how severe the impact of false negative could be. It is the reason why I choose 0.16 to be the model’s optimum cutoff, since it gives fair accuracy and recall. Here is the final result.

test$pred <- as.factor(ifelse(test$pred_prob > 0.16, 1, 0))
conf_matrix <- confusionMatrix(test$pred,
                               reference = test$Churn,
                               positive = "1")

eval_logreg <- data.frame("Accuracy" = conf_matrix$overall[1],
                          "Recall" = conf_matrix$byClass[1],
                          "Specificity" = conf_matrix$byClass[2],
                          "Precision" = conf_matrix$byClass[3],
                          "Neg.Pred.Value" = conf_matrix$byClass[4],
                          row.names = NULL)

eval_logreg

3.3 Model Interpretation

One of the advantages of logistic regression model is its interpretability. We can easily interpret which features are the most significant to our target variable in a logistic regression model.

data.frame("Coefficient" = exp(model_logreg_2$coefficients)) %>% 
  arrange(desc(Coefficient))

To determine whether customers will leave the service or not, the most significant feature seems to be CustServCalls. For a one-unit increment in CustServCalls, we expect to see ~64.5% increase in the odds of leaving the service. This might be because more calls to the customer service means that the customers are experiencing more troubles.

4 K - Nearest Neighbor

KNN does not build a model like logistic regression does. Instead, it directly infer which class does a point belong based on the K nearest records. The distance itself is measured using Euclidean Distance. KNN can only handle numerical data, so we have to remove non-numeric data and scale the rest so all of them have the same range.

train_x_scaled <- train_up %>% 
  select_if(is.numeric) %>% 
  scale()

test_x_scaled <- test %>% 
  select(-c(pred_prob, pred)) %>% 
  select_if(is.numeric) %>% 
  scale(center = attr(train_x_scaled, "scaled:center"),
        scale = attr(train_x_scaled, "scaled:scale"))

train_y <- train_up$Churn
test_y <- test$Churn

The next step is to find the optimum K. There is a common practice where we should begin with K equal to the square root of the number of training data. Besides, if the number of classes in our data is even, we need to pick odd value for K as to avoid a tie.

sqrt(nrow(train_up))
## [1] 63.05553

We will begin with \(K = 63\).

pred_knn <- knn(train = train_x_scaled,
                test = test_x_scaled,
                cl = train_y,
                k = 63)

4.1 Evaluation

We will use confusion matrix to evaluate the performance of KNN for our data.

confusionMatrix(pred_knn,
                reference = test_y,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 745  39
##          1 117  99
##                                          
##                Accuracy : 0.844          
##                  95% CI : (0.82, 0.866)  
##     No Information Rate : 0.862          
##     P-Value [Acc > NIR] : 0.9532         
##                                          
##                   Kappa : 0.4701         
##                                          
##  Mcnemar's Test P-Value : 0.0000000007051
##                                          
##             Sensitivity : 0.7174         
##             Specificity : 0.8643         
##          Pos Pred Value : 0.4583         
##          Neg Pred Value : 0.9503         
##              Prevalence : 0.1380         
##          Detection Rate : 0.0990         
##    Detection Prevalence : 0.2160         
##       Balanced Accuracy : 0.7908         
##                                          
##        'Positive' Class : 1              
## 

Although the accuracy is slightly less than the initial logistic regression model, KNN seems to classify both class 0 and 1 pretty well.

4.2 Tuning

I will plot the same metrics as the ones used in logistic regression for different values of K.

knn_metrics <- function(k){
  prediction <- knn(train = train_x_scaled,
                    test = test_x_scaled,
                    cl = train_y,
                    k = k)
  
  conf <- confusionMatrix(prediction, 
                          reference = test_y,
                          positive = "1")
  
  res <- c(conf$overall[1], conf$byClass[1], conf$byClass[2], conf$byClass[3])
  return(res)
}
knn_result <- matrix(nrow = 100, ncol = 4)
  
for(i in 1:100){
  knn_result[i, ] <- knn_metrics(i)
}

knn_result <- as.data.frame(knn_result) %>% 
  rename(Accuracy = V1,
         Recall = V2,
         Specifity = V3,
         Precision = V4) %>% 
  mutate(k = seq(1, 100, length = 100))

Plotting the metrics for different values of K.

knn_result %>% 
  gather(key = "metrics", value = "value", -k) %>% 
  ggplot(mapping = aes(x = k,
                       y = value,
                       col = metrics)) +
  geom_line(lwd = 1) +
  labs(title = "Metrics for Different Cutoffs",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5))

Again, deciding which value of K is the most optimum will depend on the business case. I personally think that using \(K = 41\) is the most optimal K.

pred_knn <- knn(train = train_x_scaled,
                 test = test_x_scaled,
                 cl = train_y,
                 k = 41)

conf_matrix <- confusionMatrix(pred_knn,
                               reference = test_y,
                               positive = "1")

eval_knn <- data.frame("Accuracy" = conf_matrix$overall[1],
                       "Recall" = conf_matrix$byClass[1],
                       "Specificity" = conf_matrix$byClass[2],
                       "Precision" = conf_matrix$byClass[3],
                       "Neg.Pred.Value" = conf_matrix$byClass[4],
                       row.names = NULL)

eval_knn

Here is the final result, using KNN with \(K = 41\) gives us 82.1% accuracy, 74.64% recall, 83.3% specificity, 41.7% precision, and 95.35% negative prediction value.

5 Comparison Between Logistic Regression and KNN

Here are the results of optimized logistic regression and KNN algorithm.

# Logistic Regression Evaluation

eval_logreg
# KNN Evaluation

eval_knn

There is only a slight difference in the metrics of logistic regression and KNN. In overall, K - Nearest Neighbor algorithm gives better prediction on this dataset, although the recall is slightly less than logistic regression.

6 Conclusion

For predicting the telecommunication customer churn dataset, using both KNN and logistic regression gives similar result, with KNN having a slightly better metrics in overall. Despite the accuracy, using logistic regression algorithm gives more benefit than KNN, since the result is more interpretable and it can also handle categorical variables. The model would perform better if there are more data regarding this case. There is a moderate class imbalance in the data, where 85% of the observations belong to the customer who will keep using telecommunication provider’s services. Thus, the model was having difficulties to learn the characteristics of customers who will leave the services well. If there were more observations regarding customers who will leave services, I am pretty sure the model would perform better in classifying them.