Introduction

Customer churn is a major concern in the telecommunications industry. It is found that, retaining existing customers is often more cost-effective than acquiring new ones. Therefore, understanding why customers leave and identifying those at risk are crucial in building customer loyalty and improving business performance.

This project focuses on analyzing the IBM Telco Customer Churn to build predictive models that forecast churn and apply segmentation technique to group/segment customers with similar characteristics together for effective target marketing.

Project Objectives

  1. Perform Exploratory Data Analysis (EDA) to understand customer demographics and their Telco service usage.
  2. Build and evaluate machine learning models (Logistic Regression, Decision Tree, KNN) to predict churn accurately.
  3. Apply K-Means clustering to segment customers based on demographics and usage behavior.
  4. Interpret and visualize segment characteristics to support marketing and retention strategies.

Dataset Description

Project Questions

Question 1: How do customer demographics and usage patterns differ between churned and non-churned customers?

Question 2: Predicting whether a customer will churn or not (Yes/No).

Question 3: What are the customer segments based on their usage patterns and demographics?

Methodology

This section displays the steps of data understanding, data collection, data overview, exploratory data analysis (EDA), data cleaning, and data processing.

1. Importing R Libraries

Load the required libraries for R Programming.

#load libraries
library(knitr) #for kable() function to display output in table
library(ggplot2) #for plotting

library(caret)
## Loading required package: lattice
library(rpart)
library(rpart.plot)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2. Data Collection and Data Understanding

Load the dataset and display the first 5 rows of the dataset

dataset <- read.csv("/Users/sammi/OneDrive/Documents/Telco-Customer-Churn-Data.csv")
head_data <- head(dataset, n=5)
kable(head_data)
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Data Understanding

  1. Number of columns and rows:
cat("Number of columns:", ncol(dataset), "\nNumber of rows:", nrow(dataset))
## Number of columns: 21 
## Number of rows: 7043
  1. Dataset Structure
str(dataset)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

There are a total of 4 numerical columns and 17 categorical columns. But the ‘Senior Citizen’ column is supposed to be categorical so, let’s change 0 and 1 into No and Yes.

dataset$SeniorCitizen <- ifelse(dataset$SeniorCitizen == 1,
                                "Yes", "No")

Therefore, there are a total of 3 numerical columns and 18 categorical columns.

  1. Count of Missing values
#total missing values
cat("The sum of missing value in the dataset is: ", sum(is.na(dataset)))
## The sum of missing value in the dataset is:  11
#check which column has  missing values
missing_cols <- colSums(is.na(dataset))
missing_cols[missing_cols > 0]
## TotalCharges 
##           11

Only ‘Total Charges’ column has missing values. There is 11 rows in this column that has missing values.

3. Exploratory Data Analysis (EDA)

This section focuses on exploring the dataset through visualization.

#drop customer id - not needed for plotting (EDA)
df_eda <- subset(dataset, select = -c(customerID))

#setting the colours for visuals: https://r-charts.com/colors/
colours <- c("thistle", "thistle1", "#FFBBFF", "#8B7B8B")
colours_2<- c("#90EE90","#FFAEB9","#B0E2FF")

1. Data Distribution for each columns

#plot box plot for all numerical columns and pie chart for all categorical columns
data_plotting <- function(data){
  for (name_of_col in names(data)){
    column <- data[[name_of_col]]
    formatted_col_name = gsub("([a-z])([A-Z])", "\\1 \\2", name_of_col)
    
    if (is.numeric(column)){
     boxplot(column, 
            main = paste("Distribution of", formatted_col_name), 
            ylab = formatted_col_name, 
            col = "lightblue")
    }
    else if (is.character(column)){
      count_for_pie <- table(column)
      percentages <- round(100*count_for_pie/sum(count_for_pie), 2)
      labels <- paste(names(count_for_pie), "-", percentages, "%")
      
      pie(count_for_pie,
          labels = labels,
          main = paste("Distribution of", formatted_col_name),
          col = colours)
    }
  }
} 

#call function to plot
data_plotting(df_eda)

2. Demographics of Churn VS Non-Churn Customers

#gender
ggplot(df_eda, aes(x = gender, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Gender Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

#partner
ggplot(df_eda, aes(x = Partner, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Partner Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

#dependents
ggplot(df_eda, aes(x = Dependents, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Dependents Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

#senior citizen
ggplot(df_eda, aes(x = SeniorCitizen, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Senior Citizen Proportion by Churn Status", y = "Proportion") + 
  scale_fill_manual(values = colours_2)

3. Telco Usage Patterns of Churn VS Non-Churn Customers

#get all the telco usage columns and churn col
telco_usage_cols <- c("PhoneService", "MultipleLines", "InternetService",
              "OnlineSecurity", "OnlineBackup", "DeviceProtection",
              "TechSupport", "StreamingTV", "StreamingMovies")

df_telco_usage <- subset(df_eda, select = c("Churn", telco_usage_cols))

for (col_name in names(df_telco_usage)) {
  column <- df_telco_usage[[col_name]]
  formatted_col_name = gsub("([a-z])([A-Z])", "\\1 \\2", col_name)
  
   plot <- ggplot(df_telco_usage, aes(x = column, fill = Churn)) +
          geom_bar(position = "fill") +
          labs(title = paste("Proportion of Churn by", formatted_col_name), 
               x = formatted_col_name, 
               y = "Proportion") + 
          scale_fill_manual(values = colours_2)
   
   print(plot)
}

4. Data Cleaning & Processing

This section is on data cleaning and data processing.

  1. Data Cleaning - Imputation

Based on the Box Plot for Total Charges, it is found that the values in this column has a right skew distribution. Therefore, it is best to impute the missing values with Median instead of Mean.

#calc the median value of Total Charges column
median_val <- round(median(dataset$TotalCharges, na.rm = TRUE),2)
cat("The median value of Total Charges is ", median_val)
## The median value of Total Charges is  1397.47

Impute the missing rows with the Median value of Total Charges

#imputation on Total Charges column using Median
dataset$TotalCharges[is.na(dataset$TotalCharges)] <- median_val

#check the if there is still any missing values
cat("The sum of missing value in the dataset is: ", sum(is.na(dataset)))
## The sum of missing value in the dataset is:  0
  1. Data Processing Take a look into the latest cleaned dataset
head_data <- head(dataset, n=5)
kable(head_data)
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7590-VHVEG Female No Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
5575-GNVDE Male No No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
3668-QPYBK Male No No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
7795-CFOCW Male No No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
9237-HQITU Female No No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Drop Customer ID column as it is not needed for modelling.

model_dataset <- subset(dataset, select = -c(customerID))

To prepare dataset for classification data modelling and segmentation, the categorical columns needs to be label encoded for K-Nearest Neighbour (KNN) Model and K-Means Clustering.

#encode the categorical columns
dataset_for_encoding <- model_dataset

#factorized the churn column
dataset_for_encoding$Churn <- factor(dataset_for_encoding$Churn, levels = c("No", "Yes"))

#one-hot encode the categorical columns
dummy_model <- dummyVars(Churn ~ ., data = dataset_for_encoding)  # exclude Churn from encoding
encoded_data <- predict(dummy_model, newdata = dataset_for_encoding)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Churn' is not a factor
encoded_data <- as.data.frame(encoded_data)

# add back the Churn column
final_encoded_data <- cbind(encoded_data, Churn = dataset_for_encoding$Churn)

# check the encoded data
str(final_encoded_data)
## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges                          : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
head_data <- head(final_encoded_data, n=5)
kable(head_data)
genderFemale genderMale SeniorCitizenNo SeniorCitizenYes PartnerNo PartnerYes DependentsNo DependentsYes tenure PhoneServiceNo PhoneServiceYes MultipleLinesNo MultipleLinesNo phone service MultipleLinesYes InternetServiceDSL InternetServiceFiber optic InternetServiceNo OnlineSecurityNo OnlineSecurityNo internet service OnlineSecurityYes OnlineBackupNo OnlineBackupNo internet service OnlineBackupYes DeviceProtectionNo DeviceProtectionNo internet service DeviceProtectionYes TechSupportNo TechSupportNo internet service TechSupportYes StreamingTVNo StreamingTVNo internet service StreamingTVYes StreamingMoviesNo StreamingMoviesNo internet service StreamingMoviesYes ContractMonth-to-month ContractOne year ContractTwo year PaperlessBillingNo PaperlessBillingYes PaymentMethodBank transfer (automatic) PaymentMethodCredit card (automatic) PaymentMethodElectronic check PaymentMethodMailed check MonthlyCharges TotalCharges Churn
1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 29.85 29.85 No
0 1 1 0 1 0 1 0 34 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 56.95 1889.50 No
0 1 1 0 1 0 1 0 2 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 53.85 108.15 Yes
0 1 1 0 1 0 1 0 45 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 42.30 1840.75 No
1 0 1 0 1 0 1 0 2 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 70.70 151.65 Yes

Whereas, for Logistic Regression and Decision Tree, we do not need to label encode the columns. We just need to convert the columns into factor.

#make a copy of dataset
factorized_dataset <- model_dataset

#convert character columns to factor
factorized_dataset[] <- lapply(factorized_dataset, function(x) {
  if (is.character(x)) as.factor(x) else x
})

#convert tenure into numeric
factorized_dataset$tenure <- as.numeric(factorized_dataset$tenure)

#check if cols are factor
str(factorized_dataset)
## 'data.frame':    7043 obs. of  20 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Results/Data Modelling

This section displays the results and discussion for all the project questions.

Project Question 1: How do customer demographics and usage patterns differ between churned and non-churned customers?

Refer to EDA section and presentation video.

Project Question 2: Predicting whether a customer will churn or not (Yes/No)?

These classification models will be built to predict Churn and the best model will be determined:

  • Logistic Regression
  • Decision Tree
  • K-Nearest Neighbour (KNN)

Data Splitting for Logistic Regression and Decision Tree

For Logistic Regression and Decision Tree modelling, the factorized dataset is used.

#split into train and test 80:20 ratio
set.seed(166)
train_index <- createDataPartition(factorized_dataset$Churn, p = 0.8, list = FALSE)
train_data <- factorized_dataset[train_index, ]
test_data  <- factorized_dataset[-train_index, ]

Logistic Regression

#logistic Regression
logr_model <- glm(Churn ~ ., data = train_data, family = binomial)

#predict the Churn probabilities and convert to respective classes
logr_probs <- predict(logr_model, newdata = test_data, type = "response")
logr_preds <- ifelse(logr_probs > 0.5, "Yes", "No")
logr_preds <- factor(logr_preds, levels = c("No", "Yes"))

logr_true_value <- test_data$Churn

#save the confusion matrix table
conf_matrix_lr <- table(Predicted = logr_preds, Actual = logr_true_value)

Decision Tree

#decision Tree
dtree_model <- rpart(Churn ~ ., data = train_data, method = "class")
rpart.plot(dtree_model)

#predict classes for test set
dtree_preds <- predict(dtree_model, newdata = test_data, type = "class")

dtree_true_value <- test_data$Churn

#save the confusion matrix table
conf_matrix_dtree <- table(Predicted = dtree_preds, Actual = dtree_true_value)

Model Evaluation for Logistic Regression

confusionMatrix(logr_preds, test_data$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  948 181
##        Yes  86 192
##                                           
##                Accuracy : 0.8102          
##                  95% CI : (0.7888, 0.8304)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 2.141e-11       
##                                           
##                   Kappa : 0.4698          
##                                           
##  Mcnemar's Test P-Value : 8.783e-09       
##                                           
##             Sensitivity : 0.9168          
##             Specificity : 0.5147          
##          Pos Pred Value : 0.8397          
##          Neg Pred Value : 0.6906          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6738          
##    Detection Prevalence : 0.8024          
##       Balanced Accuracy : 0.7158          
##                                           
##        'Positive' Class : No              
## 

Model Evaluation for Decision Tree

confusionMatrix(dtree_preds, test_data$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  970 224
##        Yes  64 149
##                                           
##                Accuracy : 0.7953          
##                  95% CI : (0.7733, 0.8161)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 8.066e-08       
##                                           
##                   Kappa : 0.3912          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9381          
##             Specificity : 0.3995          
##          Pos Pred Value : 0.8124          
##          Neg Pred Value : 0.6995          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6894          
##    Detection Prevalence : 0.8486          
##       Balanced Accuracy : 0.6688          
##                                           
##        'Positive' Class : No              
## 

Data Scaling for K-Nearest Neighbour (KNN)

For K-Nearest Neighbour (KNN) modelling, the label encoded dataset is used.

#KNN is distance-based so we need to scale the numerical columns
cols_to_scale <- c("tenure", "MonthlyCharges", "TotalCharges")

preprocess_model <- preProcess(final_encoded_data[, cols_to_scale], method = c("center", "scale"))
scaled_values <- predict(preprocess_model, final_encoded_data[, cols_to_scale])

#replace original columns with scaled data
final_encoded_data[, cols_to_scale] <- scaled_values

#view the scaled dataset
str(final_encoded_data)
## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ TotalCharges                          : num  -0.994 -0.173 -0.96 -0.195 -0.94 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Data Splitting for K-Nearest Neighbour (KNN)

set.seed(166)
train_index <- createDataPartition(final_encoded_data$Churn, p = 0.8, list = FALSE)
train_data_knn <- final_encoded_data[train_index, ]
test_data_knn  <- final_encoded_data[-train_index, ]

K-Nearest Neighbour (KNN)

For this modelling, the label encoded dataset is used.

set.seed(166)
knn_model <- train(Churn ~ ., data = train_data_knn,
                   method = "knn",
                   tuneLength = 10,             
                   trControl = trainControl(method = "cv", number = 5)) 

print(knn_model)
## k-Nearest Neighbors 
## 
## 5636 samples
##   46 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4509, 4508, 4509, 4509, 4509 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7659708  0.3895913
##    7  0.7712953  0.4054283
##    9  0.7785689  0.4229484
##   11  0.7835368  0.4346933
##   13  0.7831814  0.4339648
##   15  0.7837141  0.4351059
##   17  0.7853107  0.4410087
##   19  0.7881490  0.4469546
##   21  0.7892154  0.4506070
##   23  0.7874395  0.4466846
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 21.
#predict classes for test set
knn_preds <- predict(knn_model, newdata = test_data_knn)

knn_true_value <- test_data$Churn

#save the confusion matrix table
conf_matrix_knn <- table(Predicted = knn_preds, Actual = knn_true_value)

Model Evaluation for K-Nearest Neighbour (KNN)

confusionMatrix(knn_preds, test_data_knn$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  904 166
##        Yes 130 207
##                                           
##                Accuracy : 0.7896          
##                  95% CI : (0.7674, 0.8107)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 1.127e-06       
##                                           
##                   Kappa : 0.4429          
##                                           
##  Mcnemar's Test P-Value : 0.04192         
##                                           
##             Sensitivity : 0.8743          
##             Specificity : 0.5550          
##          Pos Pred Value : 0.8449          
##          Neg Pred Value : 0.6142          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6425          
##    Detection Prevalence : 0.7605          
##       Balanced Accuracy : 0.7146          
##                                           
##        'Positive' Class : No              
## 

The Best Chosen Model Interpretation

Let’s compare the metrics of churn prediction using Precision, Recall, and F1-Score.

#function to calculate metrics
calculate_metrics <- function(conf_matrix) {
  
  #extracting values from confusion matrix
  TP <- conf_matrix[1, 1]  # True Positive
  TN <- conf_matrix[2, 2]  # True Negative
  FP <- conf_matrix[2, 1]  # False Positive
  FN <- conf_matrix[1, 2]  # False Negative
  
  #calc Precision, Recall, and F1-Score
  Precision <- (TP / (TP + FP))*100
  Recall <- (TP / (TP + FN)) *100
  F1_Score <- 2 * ((Precision * Recall) / (Precision + Recall))
  
  #return metrics as a list
  return(list(Precision = Precision, Recall = Recall, F1_Score = F1_Score))
}

#metrics for each model
logreg_metrics <- calculate_metrics(conf_matrix_lr)
tree_metrics <- calculate_metrics(conf_matrix_dtree)
knn_metrics <- calculate_metrics(conf_matrix_knn)

#data frame with the calculated metrics
metrics_df <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "KNN"),
  Precision = c(logreg_metrics$Precision, tree_metrics$Precision, knn_metrics$Precision),
  Recall = c(logreg_metrics$Recall, tree_metrics$Recall, knn_metrics$Recall),
  F1_Score = c(logreg_metrics$F1_Score, tree_metrics$F1_Score, knn_metrics$F1_Score)
)

#view the metrics in a table 
kable(metrics_df, caption = "Model Metrics: Precision, Recall, and F1-Score")
Model Metrics: Precision, Recall, and F1-Score
Model Precision Recall F1_Score
Logistic Regression 91.68279 83.96811 87.65603
Decision Tree 93.81044 81.23953 87.07361
KNN 87.42747 84.48598 85.93156

The best classification model is Logistic Regression. This is because:

  • Highest Accuracy of 81.02%
  • Highest F1-Score which balances both precision and recall.

Project Question 3: What are the customer segments based on their usage patterns and demographics?

K-Means Clustering is used to build the customer segmentation model and each segment’s characteristics will be interpreted.

Building the K-Means Segmentation Model

#make a copy of the previously encoded dataset for clustering that is label encoded and scaled
clustering_dataset <- final_encoded_data

#view the dataset first
str(clustering_dataset)
## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ TotalCharges                          : num  -0.994 -0.173 -0.96 -0.195 -0.94 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

K-Means Clustering requires columns to be numeric, so let’s change Churn back into numeric.

#convert to numeric
clustering_dataset$Churn <- as.numeric(clustering_dataset$Churn)

#ensure all cols are numeric
str(clustering_dataset)
## 'data.frame':    7043 obs. of  47 variables:
##  $ genderFemale                          : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                            : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizenNo                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SeniorCitizenYes                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerNo                             : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ PartnerYes                            : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsNo                          : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ DependentsYes                         : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneServiceNo                        : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ PhoneServiceYes                       : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo                       : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ MultipleLinesNo phone service         : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceDSL                    : num  1 1 1 1 0 0 0 1 0 1 ...
##  $ InternetServiceFiber optic            : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo                      : num  1 0 0 0 1 1 1 0 1 0 ...
##  $ OnlineSecurityNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                     : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo                        : num  0 1 0 1 1 1 0 1 1 0 ...
##  $ OnlineBackupNo internet service       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                       : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo                    : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DeviceProtectionNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                   : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo                         : num  1 1 1 0 1 1 1 1 0 1 ...
##  $ TechSupportNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                        : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo                         : num  1 1 1 1 1 0 0 1 0 1 ...
##  $ StreamingTVNo internet service        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                        : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo                     : num  1 1 1 1 1 0 1 1 0 1 ...
##  $ StreamingMoviesNo internet service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                    : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth-to-month                : num  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne year                      : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingNo                    : num  0 1 0 1 0 0 0 1 0 1 ...
##  $ PaperlessBillingYes                   : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank transfer (automatic): num  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit card (automatic)  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check         : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check             : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ TotalCharges                          : num  -0.994 -0.173 -0.96 -0.195 -0.94 ...
##  $ Churn                                 : num  1 1 2 1 2 2 1 1 2 1 ...

K-Means Clustering

#determine optimal K value using Elbow method
wss <- vector()
for (k in 1:10) {
  km <- kmeans(clustering_dataset, centers = k, nstart = 25)
  wss[k] <- km$tot.withinss
}
## Warning: did not converge in 10 iterations
plot(1:10, wss, type = "b",
     xlab = "Number of Clusters (K)",
     ylab = "Total Within-Cluster SS",
     main = "Elbow Method for K")

From the plot, it is found that, the optimal K value is 3 (the point where it is an Elbow)

#train the K-Means Clustering Model, use K=3
set.seed(166)
kmeans_model <- kmeans(clustering_dataset, centers = 3, nstart = 25)

#assign the segments back to dataset
clustering_dataset$Segment <- as.factor(kmeans_model$cluster)

It is found that, there are three different segments, 1, 2, and 3.

#view the sample data segments
head_clustering_data <- head(clustering_dataset, n=5)
kable(head_clustering_data)
genderFemale genderMale SeniorCitizenNo SeniorCitizenYes PartnerNo PartnerYes DependentsNo DependentsYes tenure PhoneServiceNo PhoneServiceYes MultipleLinesNo MultipleLinesNo phone service MultipleLinesYes InternetServiceDSL InternetServiceFiber optic InternetServiceNo OnlineSecurityNo OnlineSecurityNo internet service OnlineSecurityYes OnlineBackupNo OnlineBackupNo internet service OnlineBackupYes DeviceProtectionNo DeviceProtectionNo internet service DeviceProtectionYes TechSupportNo TechSupportNo internet service TechSupportYes StreamingTVNo StreamingTVNo internet service StreamingTVYes StreamingMoviesNo StreamingMoviesNo internet service StreamingMoviesYes ContractMonth-to-month ContractOne year ContractTwo year PaperlessBillingNo PaperlessBillingYes PaymentMethodBank transfer (automatic) PaymentMethodCredit card (automatic) PaymentMethodElectronic check PaymentMethodMailed check MonthlyCharges TotalCharges Churn Segment
1 0 1 0 0 1 1 0 -1.2773539 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 -1.1602405 -0.9941713 1 3
0 1 1 0 1 0 1 0 0.0663227 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 -0.2596105 -0.1732318 1 3
0 1 1 0 1 0 1 0 -1.2366364 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 -0.3626346 -0.9596059 2 3
0 1 1 0 1 0 1 0 0.5142149 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 -0.7464825 -0.1947524 1 3
1 0 1 0 1 0 1 0 -1.2366364 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0.1973512 -0.9404029 2 3

Let’s calculate the proportion of each segments

segment_summary <- clustering_dataset %>%
  group_by(Segment) %>%
  summarise(Count = n()) %>%
  mutate(Proportion = round((Count / sum(Count)) * 100, 2))

kable(segment_summary)
Segment Count Proportion
1 2330 33.08
2 1526 21.67
3 3187 45.25

K-Means Clusters Interpretation

The dataset is label encoded, so we need to append the Segment Columns back to the original dataset for interpretation.

#view the original dataset first
kable(head(clustering_dataset, n=5))
genderFemale genderMale SeniorCitizenNo SeniorCitizenYes PartnerNo PartnerYes DependentsNo DependentsYes tenure PhoneServiceNo PhoneServiceYes MultipleLinesNo MultipleLinesNo phone service MultipleLinesYes InternetServiceDSL InternetServiceFiber optic InternetServiceNo OnlineSecurityNo OnlineSecurityNo internet service OnlineSecurityYes OnlineBackupNo OnlineBackupNo internet service OnlineBackupYes DeviceProtectionNo DeviceProtectionNo internet service DeviceProtectionYes TechSupportNo TechSupportNo internet service TechSupportYes StreamingTVNo StreamingTVNo internet service StreamingTVYes StreamingMoviesNo StreamingMoviesNo internet service StreamingMoviesYes ContractMonth-to-month ContractOne year ContractTwo year PaperlessBillingNo PaperlessBillingYes PaymentMethodBank transfer (automatic) PaymentMethodCredit card (automatic) PaymentMethodElectronic check PaymentMethodMailed check MonthlyCharges TotalCharges Churn Segment
1 0 1 0 0 1 1 0 -1.2773539 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 -1.1602405 -0.9941713 1 3
0 1 1 0 1 0 1 0 0.0663227 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 -0.2596105 -0.1732318 1 3
0 1 1 0 1 0 1 0 -1.2366364 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 -0.3626346 -0.9596059 2 3
0 1 1 0 1 0 1 0 0.5142149 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 -0.7464825 -0.1947524 1 3
1 0 1 0 1 0 1 0 -1.2366364 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0.1973512 -0.9404029 2 3
#check if both dataset have the same no.of rows
nrow(dataset) == length(clustering_dataset$Segment)
## [1] TRUE
#append segments back to original dataset
dataset$Segment <- clustering_dataset$Segment

#view dataset
kable(head(dataset, n=5))
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn Segment
7590-VHVEG Female No Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No 3
5575-GNVDE Male No No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No 3
3668-QPYBK Male No No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes 3
7795-CFOCW Male No No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No 3
9237-HQITU Female No No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes 3
round_by_two_decimals <- function(value){
  round(value, 2)
}

segment_stats <- dataset%>% group_by(Segment) %>%
                summarise(
                  AvgTenureMonths = round(mean(tenure),0),
                  AvgMonthlyCharges = round_by_two_decimals(mean(MonthlyCharges)),
                  PercentageOfSeniorCitizen = round_by_two_decimals(mean(as.numeric(SeniorCitizen == "Yes")) * 100),
                  PercentageWithPartner = round_by_two_decimals(mean(as.numeric(Partner == "Yes")) * 100),
                  PercentageWithDependent = round_by_two_decimals(mean(as.numeric(Dependents == "Yes")) * 100),
                  PercentageWithPhoneService = round_by_two_decimals(mean(as.numeric(PhoneService == "Yes")) * 100),
                  PercentageWithInternet = round_by_two_decimals(mean(as.numeric(InternetService != "No")) * 100)
                )
kable(segment_stats)
Segment AvgTenureMonths AvgMonthlyCharges PercentageOfSeniorCitizen PercentageWithPartner PercentageWithDependent PercentageWithPhoneService PercentageWithInternet
1 57 89.02 19.70 70.04 35.67 91.97 100
2 31 21.08 3.41 48.36 42.14 100.00 0
3 15 67.94 19.80 32.38 19.96 84.47 100

From the characteristics table above, we are able to identify three different segments. Let’s name them:

  • Segment 1: Loyal Premium Digital Customers
  • Segment 2: Low-Value Offline Young Customers
  • Segment 3: New & Tech-Savvy Singles

Visualise the Customer Segments

ggplot(dataset, aes(x = tenure, y = MonthlyCharges, color = Segment)) +
  geom_point(alpha = 0.6) +
  labs(title = "Customer Segments based on their Telco Usage",
       x = "Tenure (Months)", y = "Monthly Charges") +
  scale_color_manual(
    values = c("1" = "#1b9e77", "2" = "#d95f02", "3" = "#7570b3"),
    labels = c("Loyal Premium Digital", "Low-Value Offline Young", "New & Tech-Savvy Singles")) + theme_minimal()

#function to calculate Churn and Non-Churn percentage in dataset
calc_churn <- function(data) {
  data %>%
    count(Churn) %>%
    rename(Count = n) %>%
    mutate(Percentage = round(Count / sum(Count) * 100, 2))
}
Segment 1: Loyal Premium Digital Customers
  • Highest Average Tenure of 57 months (almost 5 years) –> Loyal Customers
  • Highest Average Monthly Charges on Telco Service of 89.02 –> Premium Customers
  • Everyone in this segment uses Internet Service –> Digital Customers
  • Highest Partner Percentage –> 70% of them are in relationship/have family

Look into Churn Proportion for this Segment

seg1_data <- dataset %>% filter(Segment == 1)
kable(calc_churn(seg1_data))
Churn Count Percentage
No 1991 85.45
Yes 339 14.55
Segment 2: Low-Value Offline Young Customers
  • Moderate Tenure of 31 months (almost 2.5 years)
  • Very low spenders of only 21.08 Average Monthly Charges –> Low-Value Customers
  • No one in this segment uses Internet, they only use phone service –> Offline Customers
  • There are only 3% of senior citizens in this segment –> Young Customers

Look into Churn Proportion for this Segment

seg2_data <- dataset %>% filter(Segment == 2)
kable(calc_churn(seg2_data))
Churn Count Percentage
No 1413 92.6
Yes 113 7.4
Segment 3: New & Tech-Savvy Singles
  • Lowest Average Tenure of 15 months –> New Customers
  • Mid to High Spenders with 67.94 Average Monthly Charges
  • All of them uses Internet Service –> Tech-Savvy Customers
  • Only 32% of them has partner –> Single Customers

Look into Churn Proportion for this Segment

seg3_data <- dataset %>% filter(Segment == 3)
kable(calc_churn(seg3_data))
Churn Count Percentage
No 1770 55.54
Yes 1417 44.46

Majority of Churn Customers are from which segment and what are their characterisitcs?

It is found that, the highest churn amount are customers from the New & Tech-Savvy Singles Segment whereas, the lowest churn amount are from the Loyal Premium Digital Customers Segment.

What are the retention strategies?

Loyal Premium Digital Customer Segment are high paying customers who stayed for a long time. They are important to retain.

  • Comfortable with reliable Telco Services, money is not an issue, they can pay.
  • Ideal to focus on promoting cross-selling and provide additional benefits for their loyalty.

Low-Value Offline Young Customer Segment are budget-friendly customers.

  • Attracted to affordable Telco Services.
  • Focus on promoting budget, value for money Telco Packages.

New & Tech-Savvy Singles Segment are new digital users.

  • Attracted to fast internet subscriptions.
  • Focus on promoting loyalty perks due to their low tenure history.

Conclusion

In conclusion, all the project objectives/questions of this study are achieved.

Churn Classification Model

Telco Customers Segmentation/Clustering

K-Means clustering is applied to segment customers based on their demographics and telco usage patterns. The clustering revealed three distinct customer segments:

  1. Loyal Premium Digital Customers – long-tenure, high-paying, digitally engaged customers.
  2. Low-Value Offline Young Customers – mid-tenure, low spenders with basic services.
  3. New & Tech-Savvy Singles – short-tenure, internet-first users with churn risk.

By combining classification and clustering, this project achieved both predictive power and descriptive insight. This enables the Telco business to not only predict churn but also to understand and act on customer behavior which is, a crucial step in improving customer lifetime value and retention rate.