Introduction

In this session, we will analyze the data we have. This time, we will conduct churn analysis using two machine learning models : Logistic Regression and K-NN. The dataset used is WA_Fn-UseC_-Telco-Customer-Churn.csv, which contains customer data from a telecommunications company.

Objectives

  • Determine which variables are predictor variables and which is the target variable.
  • Build two models: Logistic Regression and K-NN.
  • Compare both models to determine which one performs better.

Data Preprocessing

Library

Library that will be used in this analysis include :

library(dplyr)
library(tidyverse)
library(GGally)
library(caret)
library(car)

Read Data

churn <-  read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
head(churn)

Check and Adjust Data Types

We will check the data types of each column to ensure they are correctly formatted.

str(churn)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

Data Description

The dataset consists of 7,043 rows and 21 columns.

  • customerID : Unique ID for each customer.
  • gender : Customer’s gender.
  • SeniorCitizen : Whether the customer is a senior citizen (0 = No, 1 = Yes).
  • Partner : Whether the customer has a partner (Yes or No).
  • Dependents : Whether the customer has dependents (Yes or No).
  • tenure : Number of months the customer has been using the service.
  • PhoneService : Whether the customer has phone service (Yes or No).
  • MultipleLines : Whether the customer has multiple phone lines.
  • InternetService : Type of internet service (DSL, Fiber optic, No).
  • OnlineSecurity : Additional online security service.
  • OnlineBackup : Additional online backup service.
  • DeviceProtection: Additional device protection service.
  • TechSupport : Additional tech support service.
  • StreamingTV : Whether the customer subscribes to streaming TV (Yes or No).
  • StreamingMovies : Whether the customer subscribes to streaming movies (Yes or No).
  • Contract : Type of customer contract (Month-to-month, One year, Two year).
  • PaperlessBilling: Whether the customer uses paperless billing.
  • PaymentMethod : Customer’s payment method.
  • MonthlyCharges : Monthly charge paid by the customer.
  • TotalCharges : Total amount paid by the customer.
  • Churn : Whether the customer has canceled the subscription (Yes or No).

From the above data, we can see that some data types need to be adjusted before proceeding to the next steps.

churn <- churn %>% 
  mutate_if(is.character, as.factor)

str(churn)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Check for Missing Values (NA)

churn %>% 
  is.na() %>% 
  colSums()
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

It is found that there are 11 missing values (NA) in the TotalCharges column. Therefore, we will handle them using the following process.

Fill NA with 0

churn_clean <- churn %>% 
  mutate(TotalCharges = replace_na(TotalCharges, 0))
head(churn_clean)
churn_clean %>% 
  is.na() %>% 
  colSums()
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0                0 
##            Churn 
##                0

We will check for missing values again, and it is confirmed that the dataset no longer contains any missing values (NA).

Check Correlation

We will check the correlation between variables, as we will only use variables with significant correlation for model development.

ggcorr(churn_clean, label=TRUE)
## Warning in ggcorr(churn_clean, label = TRUE): data in column(s) 'customerID',
## 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
## 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
## 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
## 'PaperlessBilling', 'PaymentMethod', 'Churn' are not numeric and were ignored

Cross Validation

We will split the dataset into 80% training data and 20% testing data, ensuring reproducibility by setting the random_state (seed) to 123.

RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(churn_clean)

index <- sample(row_data, row_data*0.8)

data_train <- churn_clean[ index, ]
data_test <- churn_clean[ -index, ] 

Target Class Proportion

After performing cross-validation, the next step is to check whether the target variable (Churn) in the training data is balanced or imbalanced.

prop.table(table(data_train$Churn))
## 
##        No       Yes 
## 0.7328718 0.2671282

It turns out that the proportion of the target variable classes is Imbalance, so we take the following steps:

set.seed(123)
data_train_downsample <- downSample(x = data_train %>% select(-Churn),
                               y = data_train$Churn, 
                               list = F,
                               yname = "Churn"
                               )

Downsampling is the process of reducing the majority class until its count matches the minority class, achieving a balanced class proportion. In our training data, the No class will be sampled and reduced until it matches the number of Yes instances.

prop.table(table(data_train_downsample$Churn))
## 
##  No Yes 
## 0.5 0.5

It can be seen that the proportion of the target variable classes is now balanced. Therefore, we proceed to the next step, which is the model-building process.

Model Building

Logistic Regression

Based on the previous correlation analysis, we will use four predictor variables (SeniorCitizen, tenure, MonthlyCharges, TotalCharges) to predict the target variable Churn.

model_down <- glm(Churn ~ SeniorCitizen + tenure + MonthlyCharges + TotalCharges,
                  data_train_downsample, family = "binomial")
model_step_down <- step(model_down, trace = 0)

Prediction

After building the model, the next step is to make predictions on the data_test dataset.

pred_downsample <- predict(model_step_down, data_test, type = "response")
pred_class_down <- ifelse(pred_downsample > 0.5, "Yes", "No") %>% 
  as.factor()

Evaluation

Finally, we evaluate the model’s performance by calculating Accuracy, Sensitivity, and other metrics.

log_conf <- confusionMatrix(pred_class_down, data_test$Churn, positive = "Yes")
log_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  761  91
##        Yes 284 273
##                                         
##                Accuracy : 0.7339        
##                  95% CI : (0.71, 0.7568)
##     No Information Rate : 0.7417        
##     P-Value [Acc > NIR] : 0.7588        
##                                         
##                   Kappa : 0.4078        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.7500        
##             Specificity : 0.7282        
##          Pos Pred Value : 0.4901        
##          Neg Pred Value : 0.8932        
##              Prevalence : 0.2583        
##          Detection Rate : 0.1938        
##    Detection Prevalence : 0.3953        
##       Balanced Accuracy : 0.7391        
##                                         
##        'Positive' Class : Yes           
## 

K-NN Model

Before building the K-Nearest Neighbors (K-NN) model, we apply Z-Score scaling to standardize the data.

train_x <- data_train_downsample %>% 
  select(c(SeniorCitizen,tenure,MonthlyCharges,TotalCharges)) %>%
  scale()

train_y <- data_train_downsample$Churn

Then, we apply the mean and standard deviation from the scaled training data to the test data using the scale() function. This ensures that both datasets have the same scaling, preventing data leakage.

test_x <- data_test %>% 
  select(c(SeniorCitizen,tenure,MonthlyCharges,TotalCharges)) %>%
  scale(center = attr(train_x, "scaled:center"),
                              scale = attr(train_x, "scaled:scale")
        )
test_y <- data_test$Churn

To determine the optimal K value for the K-NN model, we use the square root of the number of training samples as a common heuristic.

k_choose <- sqrt(nrow(train_x)) %>% 
  round()
k_choose
## [1] 55

Prediction

pred_knn <- knn3Train(train = train_x, 
                      cl = train_y,
                      test = test_x,
                      k = k_choose
                      ) %>% 
  as.factor()

Evaluation

knn_conf <- confusionMatrix(pred_knn, test_y)
knn_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  735  78
##        Yes 310 286
##                                           
##                Accuracy : 0.7246          
##                  95% CI : (0.7005, 0.7478)
##     No Information Rate : 0.7417          
##     P-Value [Acc > NIR] : 0.9313          
##                                           
##                   Kappa : 0.405           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7033          
##             Specificity : 0.7857          
##          Pos Pred Value : 0.9041          
##          Neg Pred Value : 0.4799          
##              Prevalence : 0.7417          
##          Detection Rate : 0.5216          
##    Detection Prevalence : 0.5770          
##       Balanced Accuracy : 0.7445          
##                                           
##        'Positive' Class : No              
## 

Model Performance

The final step is to compare the performance of the two models: Logistic Regression and K-NN. We will evaluate them using key metrics such as:

  • Accuracy – The overall correctness of the model
  • Precision – The proportion of true positive predictions
  • Recall (Sensitivity) – The ability to detect actual positives
  • F1-Score – The balance between precision and recall

Logistic Regression

eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])
eval_logit

K-NN

eval_knn <- data_frame(Accuracy = knn_conf$overall[1],
           Recall = knn_conf$byClass[1],
           Specificity = knn_conf$byClass[2],
           Precision = knn_conf$byClass[3])

eval_knn

Conclusion

Based on the analysis performed, we can draw the following conclusions:

  • The predictor variables used in the Logistic Regression model are: SeniorCitizen, tenure,MonthlyCharges, dan TotalCharges. Meanwhile, the target variable is Churn.

  • After comparing the performance of the Logistic Regression and K-NN models, we found that Logistic Regression performed better.

  • This is evident from its higher accuracy and higher recall, indicating that it is more effective in predicting customer churn.

  • Final Recommendation: Based on performance metrics, Logistic Regression is the preferred model for predicting customer churn in this dataset.