Telco Customer Churn Prediction and Segmentation

Introduction

Customer churn is a major concern in the telecommunications industry. It is found that, retaining existing customers is often more cost-effective than acquiring new ones. Therefore, understanding why customers leave and identifying those at risk are crucial in building customer loyalty and improving business performance.

This project focuses on analyzing the IBM Telco Customer Churn to build predictive models that forecast churn and apply segmentation technique to group/segment customers with similar characteristics together for effective target marketing.

Project Objectives

Perform Exploratory Data Analysis (EDA) to understand customer demographics and their Telco service usage.
Build and evaluate machine learning models (Logistic Regression, Decision Tree, KNN) to predict churn accurately.
Apply K-Means clustering to segment customers based on demographics and usage behavior.
Interpret and visualize segment characteristics to support marketing and retention strategies.

Dataset Description

A dataset provided by IBM describing customers’ usage pattern on Telecommunication services.
A total of 7,043 records of customer data and 21 attributes/features.
The Target prediction column is ‘Churn’, predicting whether a customer will churn from the Telco services or not. (Yes/No)

Project Questions

Question 1: How do customer demographics and usage patterns differ between churned and non-churned customers?

What visual or insights are gained from Exploratory Data Analysis (EDA)?

Question 2: Predicting whether a customer will churn or not (Yes/No).

Which machine learning model perform the best in predicting Churn?

Question 3: What are the customer segments based on their usage patterns and demographics?

How many distinct customer segments can we identify using clustering algorithms?
What are the characteristics of each segment?
Which segments contain more churn customers?
What is the marketing step for each segment, as part of retention strategy.

Methodology

This section displays the steps of data understanding, data collection, data overview, exploratory data analysis (EDA), data cleaning, and data processing.

1. Importing R Libraries

Load the required libraries for R Programming.

#load libraries
library(knitr) #for kable() function to display output in table
library(ggplot2) #for plotting

library(caret)

## Loading required package: lattice

library(rpart)
library(rpart.plot)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2. Data Collection and Data Understanding

Load the dataset and display the first 5 rows of the dataset

dataset <- read.csv("/Users/sammi/OneDrive/Documents/Telco-Customer-Churn-Data.csv")
head_data <- head(dataset, n=5)
kable(head_data)

customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

Data Understanding

Number of columns and rows:

cat("Number of columns:", ncol(dataset), "\nNumber of rows:", nrow(dataset))

## Number of columns: 21 
## Number of rows: 7043

Dataset Structure

str(dataset)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

There are a total of 4 numerical columns and 17 categorical columns. But the ‘Senior Citizen’ column is supposed to be categorical so, let’s change 0 and 1 into No and Yes.

dataset$SeniorCitizen <- ifelse(dataset$SeniorCitizen == 1,
                                "Yes", "No")

Therefore, there are a total of 3 numerical columns and 18 categorical columns.

Count of Missing values

#total missing values
cat("The sum of missing value in the dataset is: ", sum(is.na(dataset)))

## The sum of missing value in the dataset is:  11

#check which column has  missing values
missing_cols <- colSums(is.na(dataset))
missing_cols[missing_cols > 0]

## TotalCharges 
##           11

Only ‘Total Charges’ column has missing values. There is 11 rows in this column that has missing values.

3. Exploratory Data Analysis (EDA)

This section focuses on exploring the dataset through visualization.

#drop customer id - not needed for plotting (EDA)
df_eda <- subset(dataset, select = -c(customerID))

#setting the colours for visuals: https://r-charts.com/colors/
colours <- c("thistle", "thistle1", "#FFBBFF", "#8B7B8B")
colours_2<- c("#90EE90","#FFAEB9","#B0E2FF")

1. Data Distribution for each columns

#plot box plot for all numerical columns and pie chart for all categorical columns
data_plotting <- function(data){
  for (name_of_col in names(data)){
    column <- data[[name_of_col]]
    formatted_col_name = gsub("([a-z])([A-Z])", "\\1 \\2", name_of_col)
    
    if (is.numeric(column)){
     boxplot(column, 
            main = paste("Distribution of", formatted_col_name), 
            ylab = formatted_col_name, 
            col = "lightblue")
    }
    else if (is.character(column)){
      count_for_pie <- table(column)
      percentages <- round(100*count_for_pie/sum(count_for_pie), 2)
      labels <- paste(names(count_for_pie), "-", percentages, "%")
      
      pie(count_for_pie,
          labels = labels,
          main = paste("Distribution of", formatted_col_name),
          col = colours)
    }
  }
} 

#call function to plot
data_plotting(df_eda)

2. Demographics of Churn VS Non-Churn Customers

#gender
ggplot(df_eda, aes(x = gender, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Gender Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

#partner
ggplot(df_eda, aes(x = Partner, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Partner Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

#dependents
ggplot(df_eda, aes(x = Dependents, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Dependents Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

#senior citizen
ggplot(df_eda, aes(x = SeniorCitizen, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Senior Citizen Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

3. Telco Usage Patterns of Churn VS Non-Churn Customers

#get all the telco usage columns and churn col
telco_usage_cols <- c("PhoneService", "MultipleLines", "InternetService",
              "OnlineSecurity", "OnlineBackup", "DeviceProtection",
              "TechSupport", "StreamingTV", "StreamingMovies")

df_telco_usage <- subset(df_eda, select = c("Churn", telco_usage_cols))

for (col_name in names(df_telco_usage)) {
  column <- df_telco_usage[[col_name]]
  formatted_col_name = gsub("([a-z])([A-Z])", "\\1 \\2", col_name)
  
   plot <- ggplot(df_telco_usage, aes(x = column, fill = Churn)) +
          geom_bar(position = "fill") +
          labs(title = paste("Proportion of Churn by", formatted_col_name), 
               x = formatted_col_name, 
               y = "Proportion") + 
          scale_fill_manual(values = colours_2)
   
   print(plot)
}

4. Data Cleaning & Processing

This section is on data cleaning and data processing.

Data Cleaning - Imputation

Based on the Box Plot for Total Charges, it is found that the values in this column has a right skew distribution. Therefore, it is best to impute the missing values with Median instead of Mean.

#calc the median value of Total Charges column
median_val <- round(median(dataset$TotalCharges, na.rm = TRUE),2)
cat("The median value of Total Charges is ", median_val)

## The median value of Total Charges is  1397.47

Impute the missing rows with the Median value of Total Charges

#imputation on Total Charges column using Median
dataset$TotalCharges[is.na(dataset$TotalCharges)] <- median_val

#check the if there is still any missing values
cat("The sum of missing value in the dataset is: ", sum(is.na(dataset)))

## The sum of missing value in the dataset is:  0

Data Processing Take a look into the latest cleaned dataset

head_data <- head(dataset, n=5)
kable(head_data)

customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
7590-VHVEG	Female	No	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
5575-GNVDE	Male	No	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
3668-QPYBK	Male	No	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
7795-CFOCW	Male	No	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
9237-HQITU	Female	No	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

Drop Customer ID column as it is not needed for modelling.

model_dataset <- subset(dataset, select = -c(customerID))

To prepare dataset for classification data modelling and segmentation, the categorical columns needs to be label encoded for K-Nearest Neighbour (KNN) Model and K-Means Clustering.

#encode the categorical columns
dataset_for_encoding <- model_dataset

#factorized the churn column
dataset_for_encoding$Churn <- factor(dataset_for_encoding$Churn, levels = c("No", "Yes"))

#one-hot encode the categorical columns
dummy_model <- dummyVars(Churn ~ ., data = dataset_for_encoding)  # exclude Churn from encoding
encoded_data <- predict(dummy_model, newdata = dataset_for_encoding)

## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Churn' is not a factor

encoded_data <- as.data.frame(encoded_data)

# add back the Churn column
final_encoded_data <- cbind(encoded_data, Churn = dataset_for_encoding$Churn)

# check the encoded data
str(final_encoded_data)

## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges                          : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

head_data <- head(final_encoded_data, n=5)
kable(head_data)

genderFemale	genderMale	SeniorCitizenNo	PartnerNo	PartnerYes	DependentsNo	tenure	PhoneServiceNo	PhoneServiceYes	MultipleLinesNo	MultipleLinesNo phone service	InternetServiceDSL	InternetServiceFiber optic	OnlineSecurityNo	OnlineSecurityYes	OnlineBackupNo	OnlineBackupYes	DeviceProtectionNo	DeviceProtectionYes	TechSupportNo	TechSupportYes	StreamingTVNo	StreamingMoviesNo	ContractMonth-to-month	ContractOne year	PaperlessBillingNo	PaperlessBillingYes	PaymentMethodBank transfer (automatic)	PaymentMethodElectronic check	PaymentMethodMailed check	MonthlyCharges	TotalCharges	Churn
1	0	1	0	1	1	1	1	0	0	1	1	0	1	0	0	1	1	0	1	0	1	1	1	0	0	1	0	1	0	29.85	29.85	No
0	1	1	1	0	1	34	0	1	1	0	1	0	0	1	1	0	0	1	1	0	1	1	0	1	1	0	0	0	1	56.95	1889.50	No
0	1	1	1	0	1	2	0	1	1	0	1	0	0	1	0	1	1	0	1	0	1	1	1	0	0	1	0	0	1	53.85	108.15	Yes
0	1	1	1	0	1	45	1	0	0	1	1	0	0	1	1	0	0	1	0	1	1	1	0	1	1	0	1	0	0	42.30	1840.75	No
1	0	1	1	0	1	2	0	1	1	0	0	1	1	0	1	0	1	0	1	0	1	1	1	0	0	1	0	1	0	70.70	151.65	Yes

Whereas, for Logistic Regression and Decision Tree, we do not need to label encode the columns. We just need to convert the columns into factor.

#make a copy of dataset
factorized_dataset <- model_dataset

#convert character columns to factor
factorized_dataset[] <- lapply(factorized_dataset, function(x) {
  if (is.character(x)) as.factor(x) else x
})

#convert tenure into numeric
factorized_dataset$tenure <- as.numeric(factorized_dataset$tenure)

#check if cols are factor
str(factorized_dataset)

## 'data.frame':    7043 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Results/Data Modelling

This section displays the results and discussion for all the project questions.

Project Question 1: How do customer demographics and usage patterns differ between churned and non-churned customers?

Refer to EDA section and presentation video.

Project Question 2: Predicting whether a customer will churn or not (Yes/No)?

These classification models will be built to predict Churn and the best model will be determined:

Logistic Regression
Decision Tree
K-Nearest Neighbour (KNN)

Data Splitting for Logistic Regression and Decision Tree

For Logistic Regression and Decision Tree modelling, the factorized dataset is used.

#split into train and test 80:20 ratio
set.seed(166)
train_index <- createDataPartition(factorized_dataset$Churn, p = 0.8, list = FALSE)
train_data <- factorized_dataset[train_index, ]
test_data  <- factorized_dataset[-train_index, ]

Logistic Regression

#logistic Regression
logr_model <- glm(Churn ~ ., data = train_data, family = binomial)

#predict the Churn probabilities and convert to respective classes
logr_probs <- predict(logr_model, newdata = test_data, type = "response")
logr_preds <- ifelse(logr_probs > 0.5, "Yes", "No")
logr_preds <- factor(logr_preds, levels = c("No", "Yes"))

logr_true_value <- test_data$Churn

#save the confusion matrix table
conf_matrix_lr <- table(Predicted = logr_preds, Actual = logr_true_value)

Decision Tree

#decision Tree
dtree_model <- rpart(Churn ~ ., data = train_data, method = "class")
rpart.plot(dtree_model)

#predict classes for test set
dtree_preds <- predict(dtree_model, newdata = test_data, type = "class")

dtree_true_value <- test_data$Churn

#save the confusion matrix table
conf_matrix_dtree <- table(Predicted = dtree_preds, Actual = dtree_true_value)

Model Evaluation for Logistic Regression

confusionMatrix(logr_preds, test_data$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  948 181
##        Yes  86 192
##                                           
##                Accuracy : 0.8102          
##                  95% CI : (0.7888, 0.8304)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 2.141e-11       
##                                           
##                   Kappa : 0.4698          
##                                           
##  Mcnemar's Test P-Value : 8.783e-09       
##                                           
##             Sensitivity : 0.9168          
##             Specificity : 0.5147          
##          Pos Pred Value : 0.8397          
##          Neg Pred Value : 0.6906          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6738          
##    Detection Prevalence : 0.8024          
##       Balanced Accuracy : 0.7158          
##                                           
##        'Positive' Class : No              
##

Model Evaluation for Decision Tree

confusionMatrix(dtree_preds, test_data$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  970 224
##        Yes  64 149
##                                           
##                Accuracy : 0.7953          
##                  95% CI : (0.7733, 0.8161)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 8.066e-08       
##                                           
##                   Kappa : 0.3912          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9381          
##             Specificity : 0.3995          
##          Pos Pred Value : 0.8124          
##          Neg Pred Value : 0.6995          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6894          
##    Detection Prevalence : 0.8486          
##       Balanced Accuracy : 0.6688          
##                                           
##        'Positive' Class : No              
##

Data Scaling for K-Nearest Neighbour (KNN)

For K-Nearest Neighbour (KNN) modelling, the label encoded dataset is used.

#KNN is distance-based so we need to scale the numerical columns
cols_to_scale <- c("tenure", "MonthlyCharges", "TotalCharges")

preprocess_model <- preProcess(final_encoded_data[, cols_to_scale], method = c("center", "scale"))
scaled_values <- predict(preprocess_model, final_encoded_data[, cols_to_scale])

#replace original columns with scaled data
final_encoded_data[, cols_to_scale] <- scaled_values

#view the scaled dataset
str(final_encoded_data)

## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ TotalCharges                          : num  -0.994 -0.173 -0.96 -0.195 -0.94 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Data Splitting for K-Nearest Neighbour (KNN)

set.seed(166)
train_index <- createDataPartition(final_encoded_data$Churn, p = 0.8, list = FALSE)
train_data_knn <- final_encoded_data[train_index, ]
test_data_knn  <- final_encoded_data[-train_index, ]

K-Nearest Neighbour (KNN)

For this modelling, the label encoded dataset is used.

set.seed(166)
knn_model <- train(Churn ~ ., data = train_data_knn,
                   method = "knn",
                   tuneLength = 10,             
                   trControl = trainControl(method = "cv", number = 5)) 

print(knn_model)

## k-Nearest Neighbors 
## 
## 5636 samples
##   46 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4509, 4508, 4509, 4509, 4509 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7659708  0.3895913
##    7  0.7712953  0.4054283
##    9  0.7785689  0.4229484
##   11  0.7835368  0.4346933
##   13  0.7831814  0.4339648
##   15  0.7837141  0.4351059
##   17  0.7853107  0.4410087
##   19  0.7881490  0.4469546
##   21  0.7892154  0.4506070
##   23  0.7874395  0.4466846
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 21.

#predict classes for test set
knn_preds <- predict(knn_model, newdata = test_data_knn)

knn_true_value <- test_data$Churn

#save the confusion matrix table
conf_matrix_knn <- table(Predicted = knn_preds, Actual = knn_true_value)

Model Evaluation for K-Nearest Neighbour (KNN)

confusionMatrix(knn_preds, test_data_knn$Churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  904 166
##        Yes 130 207
##                                           
##                Accuracy : 0.7896          
##                  95% CI : (0.7674, 0.8107)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 1.127e-06       
##                                           
##                   Kappa : 0.4429          
##                                           
##  Mcnemar's Test P-Value : 0.04192         
##                                           
##             Sensitivity : 0.8743          
##             Specificity : 0.5550          
##          Pos Pred Value : 0.8449          
##          Neg Pred Value : 0.6142          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6425          
##    Detection Prevalence : 0.7605          
##       Balanced Accuracy : 0.7146          
##                                           
##        'Positive' Class : No              
##

The Best Chosen Model Interpretation

Let’s compare the metrics of churn prediction using Precision, Recall, and F1-Score.

#function to calculate metrics
calculate_metrics <- function(conf_matrix) {
  
  #extracting values from confusion matrix
  TP <- conf_matrix[1, 1]  # True Positive
  TN <- conf_matrix[2, 2]  # True Negative
  FP <- conf_matrix[2, 1]  # False Positive
  FN <- conf_matrix[1, 2]  # False Negative
  
  #calc Precision, Recall, and F1-Score
  Precision <- (TP / (TP + FP))*100
  Recall <- (TP / (TP + FN)) *100
  F1_Score <- 2 * ((Precision * Recall) / (Precision + Recall))
  
  #return metrics as a list
  return(list(Precision = Precision, Recall = Recall, F1_Score = F1_Score))
}

#metrics for each model
logreg_metrics <- calculate_metrics(conf_matrix_lr)
tree_metrics <- calculate_metrics(conf_matrix_dtree)
knn_metrics <- calculate_metrics(conf_matrix_knn)

#data frame with the calculated metrics
metrics_df <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "KNN"),
  Precision = c(logreg_metrics$Precision, tree_metrics$Precision, knn_metrics$Precision),
  Recall = c(logreg_metrics$Recall, tree_metrics$Recall, knn_metrics$Recall),
  F1_Score = c(logreg_metrics$F1_Score, tree_metrics$F1_Score, knn_metrics$F1_Score)
)

#view the metrics in a table 
kable(metrics_df, caption = "Model Metrics: Precision, Recall, and F1-Score")

Model Metrics: Precision, Recall, and F1-Score
Model	Precision	Recall	F1_Score
Logistic Regression	91.68279	83.96811	87.65603
Decision Tree	93.81044	81.23953	87.07361
KNN	87.42747	84.48598	85.93156

The best classification model is Logistic Regression. This is because:

Highest Accuracy of 81.02%
Highest F1-Score which balances both precision and recall.

Project Question 3: What are the customer segments based on their usage patterns and demographics?

K-Means Clustering is used to build the customer segmentation model and each segment’s characteristics will be interpreted.

Building the K-Means Segmentation Model

#make a copy of the previously encoded dataset for clustering that is label encoded and scaled
clustering_dataset <- final_encoded_data

#view the dataset first
str(clustering_dataset)

## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ TotalCharges                          : num  -0.994 -0.173 -0.96 -0.195 -0.94 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

K-Means Clustering requires columns to be numeric, so let’s change Churn back into numeric.

#convert to numeric
clustering_dataset$Churn <- as.numeric(clustering_dataset$Churn)

#ensure all cols are numeric
str(clustering_dataset)

## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ TotalCharges                          : num  -0.994 -0.173 -0.96 -0.195 -0.94 ...
##  $ Churn                                 : num  1 1 2 1 2 2 1 1 2 1 ...

K-Means Clustering

#determine optimal K value using Elbow method
wss <- vector()
for (k in 1:10) {
  km <- kmeans(clustering_dataset, centers = k, nstart = 25)
  wss[k] <- km$tot.withinss
}

## Warning: did not converge in 10 iterations

plot(1:10, wss, type = "b",
     xlab = "Number of Clusters (K)",
     ylab = "Total Within-Cluster SS",
     main = "Elbow Method for K")

From the plot, it is found that, the optimal K value is 3 (the point where it is an Elbow)

#train the K-Means Clustering Model, use K=3
set.seed(166)
kmeans_model <- kmeans(clustering_dataset, centers = 3, nstart = 25)

#assign the segments back to dataset
clustering_dataset$Segment <- as.factor(kmeans_model$cluster)

It is found that, there are three different segments, 1, 2, and 3.

#view the sample data segments
head_clustering_data <- head(clustering_dataset, n=5)
kable(head_clustering_data)

genderFemale	genderMale	SeniorCitizenNo	PartnerNo	PartnerYes	DependentsNo	tenure	PhoneServiceNo	PhoneServiceYes	MultipleLinesNo	MultipleLinesNo phone service	InternetServiceDSL	InternetServiceFiber optic	OnlineSecurityNo	OnlineSecurityYes	OnlineBackupNo	OnlineBackupYes	DeviceProtectionNo	DeviceProtectionYes	TechSupportNo	TechSupportYes	StreamingTVNo	StreamingMoviesNo	ContractMonth-to-month	ContractOne year	PaperlessBillingNo	PaperlessBillingYes	PaymentMethodBank transfer (automatic)	PaymentMethodElectronic check	PaymentMethodMailed check	MonthlyCharges	TotalCharges	Churn	Segment
1	0	1	0	1	1	-1.2773539	1	0	0	1	1	0	1	0	0	1	1	0	1	0	1	1	1	0	0	1	0	1	0	-1.1602405	-0.9941713	1	3
0	1	1	1	0	1	0.0663227	0	1	1	0	1	0	0	1	1	0	0	1	1	0	1	1	0	1	1	0	0	0	1	-0.2596105	-0.1732318	1	3
0	1	1	1	0	1	-1.2366364	0	1	1	0	1	0	0	1	0	1	1	0	1	0	1	1	1	0	0	1	0	0	1	-0.3626346	-0.9596059	2	3
0	1	1	1	0	1	0.5142149	1	0	0	1	1	0	0	1	1	0	0	1	0	1	1	1	0	1	1	0	1	0	0	-0.7464825	-0.1947524	1	3
1	0	1	1	0	1	-1.2366364	0	1	1	0	0	1	1	0	1	0	1	0	1	0	1	1	1	0	0	1	0	1	0	0.1973512	-0.9404029	2	3

Let’s calculate the proportion of each segments

segment_summary <- clustering_dataset %>%
  group_by(Segment) %>%
  summarise(Count = n()) %>%
  mutate(Proportion = round((Count / sum(Count)) * 100, 2))

kable(segment_summary)

Segment	Count	Proportion
1	2330	33.08
2	1526	21.67
3	3187	45.25

K-Means Clusters Interpretation

The dataset is label encoded, so we need to append the Segment Columns back to the original dataset for interpretation.

#view the original dataset first
kable(head(clustering_dataset, n=5))

genderFemale	genderMale	SeniorCitizenNo	PartnerNo	PartnerYes	DependentsNo	tenure	PhoneServiceNo	PhoneServiceYes	MultipleLinesNo	MultipleLinesNo phone service	InternetServiceDSL	InternetServiceFiber optic	OnlineSecurityNo	OnlineSecurityYes	OnlineBackupNo	OnlineBackupYes	DeviceProtectionNo	DeviceProtectionYes	TechSupportNo	TechSupportYes	StreamingTVNo	StreamingMoviesNo	ContractMonth-to-month	ContractOne year	PaperlessBillingNo	PaperlessBillingYes	PaymentMethodBank transfer (automatic)	PaymentMethodElectronic check	PaymentMethodMailed check	MonthlyCharges	TotalCharges	Churn	Segment
1	0	1	0	1	1	-1.2773539	1	0	0	1	1	0	1	0	0	1	1	0	1	0	1	1	1	0	0	1	0	1	0	-1.1602405	-0.9941713	1	3
0	1	1	1	0	1	0.0663227	0	1	1	0	1	0	0	1	1	0	0	1	1	0	1	1	0	1	1	0	0	0	1	-0.2596105	-0.1732318	1	3
0	1	1	1	0	1	-1.2366364	0	1	1	0	1	0	0	1	0	1	1	0	1	0	1	1	1	0	0	1	0	0	1	-0.3626346	-0.9596059	2	3
0	1	1	1	0	1	0.5142149	1	0	0	1	1	0	0	1	1	0	0	1	0	1	1	1	0	1	1	0	1	0	0	-0.7464825	-0.1947524	1	3
1	0	1	1	0	1	-1.2366364	0	1	1	0	0	1	1	0	1	0	1	0	1	0	1	1	1	0	0	1	0	1	0	0.1973512	-0.9404029	2	3

#check if both dataset have the same no.of rows
nrow(dataset) == length(clustering_dataset$Segment)

## [1] TRUE

#append segments back to original dataset
dataset$Segment <- clustering_dataset$Segment

#view dataset
kable(head(dataset, n=5))

customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	Segment
7590-VHVEG	Female	No	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No	3
5575-GNVDE	Male	No	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No	3
3668-QPYBK	Male	No	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes	3
7795-CFOCW	Male	No	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No	3
9237-HQITU	Female	No	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes	3

round_by_two_decimals <- function(value){
  round(value, 2)
}

segment_stats <- dataset%>% group_by(Segment) %>%
                summarise(
                  AvgTenureMonths = round(mean(tenure),0),
                  AvgMonthlyCharges = round_by_two_decimals(mean(MonthlyCharges)),
                  PercentageOfSeniorCitizen = round_by_two_decimals(mean(as.numeric(SeniorCitizen == "Yes")) * 100),
                  PercentageWithPartner = round_by_two_decimals(mean(as.numeric(Partner == "Yes")) * 100),
                  PercentageWithDependent = round_by_two_decimals(mean(as.numeric(Dependents == "Yes")) * 100),
                  PercentageWithPhoneService = round_by_two_decimals(mean(as.numeric(PhoneService == "Yes")) * 100),
                  PercentageWithInternet = round_by_two_decimals(mean(as.numeric(InternetService != "No")) * 100)
                )
kable(segment_stats)

Segment	AvgTenureMonths	AvgMonthlyCharges	PercentageOfSeniorCitizen	PercentageWithPartner	PercentageWithDependent	PercentageWithPhoneService	PercentageWithInternet
1	57	89.02	19.70	70.04	35.67	91.97	100
2	31	21.08	3.41	48.36	42.14	100.00	0
3	15	67.94	19.80	32.38	19.96	84.47	100

From the characteristics table above, we are able to identify three different segments. Let’s name them:

Segment 1: Loyal Premium Digital Customers
Segment 2: Low-Value Offline Young Customers
Segment 3: New & Tech-Savvy Singles

Visualise the Customer Segments

ggplot(dataset, aes(x = tenure, y = MonthlyCharges, color = Segment)) +
  geom_point(alpha = 0.6) +
  labs(title = "Customer Segments based on their Telco Usage",
       x = "Tenure (Months)", y = "Monthly Charges") +
  scale_color_manual(
    values = c("1" = "#1b9e77", "2" = "#d95f02", "3" = "#7570b3"),
    labels = c("Loyal Premium Digital", "Low-Value Offline Young", "New & Tech-Savvy Singles")) + theme_minimal()

#function to calculate Churn and Non-Churn percentage in dataset
calc_churn <- function(data) {
  data %>%
    count(Churn) %>%
    rename(Count = n) %>%
    mutate(Percentage = round(Count / sum(Count) * 100, 2))
}

Segment 1: Loyal Premium Digital Customers

Highest Average Tenure of 57 months (almost 5 years) –> Loyal Customers
Highest Average Monthly Charges on Telco Service of 89.02 –> Premium Customers
Everyone in this segment uses Internet Service –> Digital Customers
Highest Partner Percentage –> 70% of them are in relationship/have family

Look into Churn Proportion for this Segment

seg1_data <- dataset %>% filter(Segment == 1)
kable(calc_churn(seg1_data))

Churn	Count	Percentage
No	1991	85.45
Yes	339	14.55

Segment 2: Low-Value Offline Young Customers

Moderate Tenure of 31 months (almost 2.5 years)
Very low spenders of only 21.08 Average Monthly Charges –> Low-Value Customers
No one in this segment uses Internet, they only use phone service –> Offline Customers
There are only 3% of senior citizens in this segment –> Young Customers

Look into Churn Proportion for this Segment

seg2_data <- dataset %>% filter(Segment == 2)
kable(calc_churn(seg2_data))

Churn	Count	Percentage
No	1413	92.6
Yes	113	7.4

Segment 3: New & Tech-Savvy Singles

Lowest Average Tenure of 15 months –> New Customers
Mid to High Spenders with 67.94 Average Monthly Charges
All of them uses Internet Service –> Tech-Savvy Customers
Only 32% of them has partner –> Single Customers

Look into Churn Proportion for this Segment

seg3_data <- dataset %>% filter(Segment == 3)
kable(calc_churn(seg3_data))

Churn	Count	Percentage
No	1770	55.54
Yes	1417	44.46

Majority of Churn Customers are from which segment and what are their characterisitcs?

It is found that, the highest churn amount are customers from the New & Tech-Savvy Singles Segment whereas, the lowest churn amount are from the Loyal Premium Digital Customers Segment.

What are the retention strategies?

Loyal Premium Digital Customer Segment are high paying customers who stayed for a long time. They are important to retain.

Comfortable with reliable Telco Services, money is not an issue, they can pay.
Ideal to focus on promoting cross-selling and provide additional benefits for their loyalty.

Low-Value Offline Young Customer Segment are budget-friendly customers.

Attracted to affordable Telco Services.
Focus on promoting budget, value for money Telco Packages.

New & Tech-Savvy Singles Segment are new digital users.

Attracted to fast internet subscriptions.
Focus on promoting loyalty perks due to their low tenure history.

Conclusion

In conclusion, all the project objectives/questions of this study are achieved.

Churn Classification Model

Three models were trained (Logistic Regression, Decision Tree, KNN) and the best model is Logistic Regression.

Telco Customers Segmentation/Clustering

K-Means clustering is applied to segment customers based on their demographics and telco usage patterns. The clustering revealed three distinct customer segments:

Loyal Premium Digital Customers – long-tenure, high-paying, digitally engaged customers.
Low-Value Offline Young Customers – mid-tenure, low spenders with basic services.
New & Tech-Savvy Singles – short-tenure, internet-first users with churn risk.

By combining classification and clustering, this project achieved both predictive power and descriptive insight. This enables the Telco business to not only predict churn but also to understand and act on customer behavior which is, a crucial step in improving customer lifetime value and retention rate.

Telco Customer Churn Prediction and Segmentation

Sammi Chia Li (22107446)

Semester 2, March - June 2025

Introduction

Project Objectives

Dataset Description

Project Questions

Methodology

1. Importing R Libraries

2. Data Collection and Data Understanding

3. Exploratory Data Analysis (EDA)

1. Data Distribution for each columns

2. Demographics of Churn VS Non-Churn Customers

3. Telco Usage Patterns of Churn VS Non-Churn Customers

4. Data Cleaning & Processing

Results/Data Modelling

Project Question 1: How do customer demographics and usage patterns differ between churned and non-churned customers?

Project Question 2: Predicting whether a customer will churn or not (Yes/No)?

Data Splitting for Logistic Regression and Decision Tree

Logistic Regression

Decision Tree

Model Evaluation for Logistic Regression

Model Evaluation for Decision Tree

Data Scaling for K-Nearest Neighbour (KNN)

Data Splitting for K-Nearest Neighbour (KNN)

K-Nearest Neighbour (KNN)

Model Evaluation for K-Nearest Neighbour (KNN)

The Best Chosen Model Interpretation

Project Question 3: What are the customer segments based on their usage patterns and demographics?

Building the K-Means Segmentation Model

K-Means Clustering

K-Means Clusters Interpretation

Visualise the Customer Segments

Segment 1: Loyal Premium Digital Customers

Segment 2: Low-Value Offline Young Customers

Segment 3: New & Tech-Savvy Singles

Majority of Churn Customers are from which segment and what are their characterisitcs?

What are the retention strategies?

Conclusion