In this session, we will analyze the data we have. This time, we will
conduct churn analysis using two machine learning models : Logistic
Regression and K-NN. The dataset used is
WA_Fn-UseC_-Telco-Customer-Churn.csv, which contains
customer data from a telecommunications company.
Library that will be used in this analysis include :
library(dplyr)
library(tidyverse)
library(GGally)
library(caret)
library(car)
churn <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
head(churn)
We will check the data types of each column to ensure they are correctly formatted.
str(churn)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
Data Description
The dataset consists of 7,043 rows and 21 columns.
customerID : Unique ID for each customer.gender : Customer’s gender.SeniorCitizen : Whether the customer is a senior
citizen (0 = No, 1 = Yes).Partner : Whether the customer has a partner (Yes or
No).Dependents : Whether the customer has dependents (Yes
or No).tenure : Number of months the customer has been using
the service.PhoneService : Whether the customer has phone service
(Yes or No).MultipleLines : Whether the customer has multiple phone
lines.InternetService : Type of internet service (DSL, Fiber
optic, No).OnlineSecurity : Additional online security
service.OnlineBackup : Additional online backup service.DeviceProtection: Additional device protection
service.TechSupport : Additional tech support service.StreamingTV : Whether the customer subscribes to
streaming TV (Yes or No).StreamingMovies : Whether the customer subscribes to
streaming movies (Yes or No).Contract : Type of customer contract (Month-to-month,
One year, Two year).PaperlessBilling: Whether the customer uses paperless
billing.PaymentMethod : Customer’s payment method.MonthlyCharges : Monthly charge paid by the
customer.TotalCharges : Total amount paid by the customer.Churn : Whether the customer has canceled the
subscription (Yes or No).From the above data, we can see that some data types need to be adjusted before proceeding to the next steps.
churn <- churn %>%
mutate_if(is.character, as.factor)
str(churn)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
churn %>%
is.na() %>%
colSums()
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
It is found that there are 11 missing values (NA) in the TotalCharges column. Therefore, we will handle them using the following process.
churn_clean <- churn %>%
mutate(TotalCharges = replace_na(TotalCharges, 0))
head(churn_clean)
churn_clean %>%
is.na() %>%
colSums()
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 0
## Churn
## 0
We will check for missing values again, and it is confirmed that the dataset no longer contains any missing values (NA).
We will check the correlation between variables, as we will only use variables with significant correlation for model development.
ggcorr(churn_clean, label=TRUE)
## Warning in ggcorr(churn_clean, label = TRUE): data in column(s) 'customerID',
## 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
## 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
## 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
## 'PaperlessBilling', 'PaymentMethod', 'Churn' are not numeric and were ignored
We will split the dataset into 80% training data and 20% testing
data, ensuring reproducibility by setting the
random_state (seed) to 123.
RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(churn_clean)
index <- sample(row_data, row_data*0.8)
data_train <- churn_clean[ index, ]
data_test <- churn_clean[ -index, ]
After performing cross-validation, the next step is to check whether the target variable (Churn) in the training data is balanced or imbalanced.
prop.table(table(data_train$Churn))
##
## No Yes
## 0.7328718 0.2671282
It turns out that the proportion of the target variable classes is Imbalance, so we take the following steps:
set.seed(123)
data_train_downsample <- downSample(x = data_train %>% select(-Churn),
y = data_train$Churn,
list = F,
yname = "Churn"
)
Downsampling is the process of reducing the majority class until its count matches the minority class, achieving a balanced class proportion. In our training data, the No class will be sampled and reduced until it matches the number of Yes instances.
prop.table(table(data_train_downsample$Churn))
##
## No Yes
## 0.5 0.5
It can be seen that the proportion of the target variable classes is now balanced. Therefore, we proceed to the next step, which is the model-building process.
Based on the previous correlation analysis, we will use four
predictor variables (SeniorCitizen, tenure,
MonthlyCharges, TotalCharges) to predict the
target variable Churn.
model_down <- glm(Churn ~ SeniorCitizen + tenure + MonthlyCharges + TotalCharges,
data_train_downsample, family = "binomial")
model_step_down <- step(model_down, trace = 0)
After building the model, the next step is to make predictions on the
data_test dataset.
pred_downsample <- predict(model_step_down, data_test, type = "response")
pred_class_down <- ifelse(pred_downsample > 0.5, "Yes", "No") %>%
as.factor()
Finally, we evaluate the model’s performance by calculating
Accuracy, Sensitivity, and other metrics.
log_conf <- confusionMatrix(pred_class_down, data_test$Churn, positive = "Yes")
log_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 761 91
## Yes 284 273
##
## Accuracy : 0.7339
## 95% CI : (0.71, 0.7568)
## No Information Rate : 0.7417
## P-Value [Acc > NIR] : 0.7588
##
## Kappa : 0.4078
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7500
## Specificity : 0.7282
## Pos Pred Value : 0.4901
## Neg Pred Value : 0.8932
## Prevalence : 0.2583
## Detection Rate : 0.1938
## Detection Prevalence : 0.3953
## Balanced Accuracy : 0.7391
##
## 'Positive' Class : Yes
##
Before building the K-Nearest Neighbors (K-NN) model, we apply
Z-Score scaling to standardize the data.
train_x <- data_train_downsample %>%
select(c(SeniorCitizen,tenure,MonthlyCharges,TotalCharges)) %>%
scale()
train_y <- data_train_downsample$Churn
Then, we apply the mean and standard deviation from the scaled
training data to the test data using the scale() function.
This ensures that both datasets have the same scaling, preventing data
leakage.
test_x <- data_test %>%
select(c(SeniorCitizen,tenure,MonthlyCharges,TotalCharges)) %>%
scale(center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale")
)
test_y <- data_test$Churn
To determine the optimal K value for the K-NN model, we use the square root of the number of training samples as a common heuristic.
k_choose <- sqrt(nrow(train_x)) %>%
round()
k_choose
## [1] 55
pred_knn <- knn3Train(train = train_x,
cl = train_y,
test = test_x,
k = k_choose
) %>%
as.factor()
knn_conf <- confusionMatrix(pred_knn, test_y)
knn_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 735 78
## Yes 310 286
##
## Accuracy : 0.7246
## 95% CI : (0.7005, 0.7478)
## No Information Rate : 0.7417
## P-Value [Acc > NIR] : 0.9313
##
## Kappa : 0.405
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7033
## Specificity : 0.7857
## Pos Pred Value : 0.9041
## Neg Pred Value : 0.4799
## Prevalence : 0.7417
## Detection Rate : 0.5216
## Detection Prevalence : 0.5770
## Balanced Accuracy : 0.7445
##
## 'Positive' Class : No
##
The final step is to compare the performance of the two models: Logistic Regression and K-NN. We will evaluate them using key metrics such as:
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
eval_logit
eval_knn <- data_frame(Accuracy = knn_conf$overall[1],
Recall = knn_conf$byClass[1],
Specificity = knn_conf$byClass[2],
Precision = knn_conf$byClass[3])
eval_knn
Based on the analysis performed, we can draw the following conclusions:
The predictor variables used in the Logistic Regression model
are: SeniorCitizen,
tenure,MonthlyCharges, dan
TotalCharges. Meanwhile, the target variable is
Churn.
After comparing the performance of the Logistic Regression and K-NN models, we found that Logistic Regression performed better.
This is evident from its higher accuracy and higher recall, indicating that it is more effective in predicting customer churn.
Final Recommendation: Based on performance metrics, Logistic Regression is the preferred model for predicting customer churn in this dataset.