Customer churn is a major concern in the telecommunications industry. It is found that, retaining existing customers is often more cost-effective than acquiring new ones. Therefore, understanding why customers leave and identifying those at risk are crucial in building customer loyalty and improving business performance.
This project focuses on analyzing the IBM Telco Customer Churn to build predictive models that forecast churn and apply segmentation technique to group/segment customers with similar characteristics together for effective target marketing.
Question 1: How do customer demographics and usage patterns differ between churned and non-churned customers?
Question 2: Predicting whether a customer will churn or not (Yes/No).
Question 3: What are the customer segments based on their usage patterns and demographics?
This section displays the steps of data understanding, data collection, data overview, exploratory data analysis (EDA), data cleaning, and data processing.
Load the required libraries for R Programming.
#load libraries
library(knitr) #for kable() function to display output in table
library(ggplot2) #for plotting
library(caret)
## Loading required package: lattice
library(rpart)
library(rpart.plot)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Load the dataset and display the first 5 rows of the dataset
dataset <- read.csv("/Users/sammi/OneDrive/Documents/Telco-Customer-Churn-Data.csv")
head_data <- head(dataset, n=5)
kable(head_data)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
Data Understanding
cat("Number of columns:", ncol(dataset), "\nNumber of rows:", nrow(dataset))
## Number of columns: 21
## Number of rows: 7043
str(dataset)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
There are a total of 4 numerical columns and 17 categorical columns. But the ‘Senior Citizen’ column is supposed to be categorical so, let’s change 0 and 1 into No and Yes.
dataset$SeniorCitizen <- ifelse(dataset$SeniorCitizen == 1,
"Yes", "No")
Therefore, there are a total of 3 numerical columns and 18 categorical columns.
#total missing values
cat("The sum of missing value in the dataset is: ", sum(is.na(dataset)))
## The sum of missing value in the dataset is: 11
#check which column has missing values
missing_cols <- colSums(is.na(dataset))
missing_cols[missing_cols > 0]
## TotalCharges
## 11
Only ‘Total Charges’ column has missing values. There is 11 rows in this column that has missing values.
This section focuses on exploring the dataset through visualization.
#drop customer id - not needed for plotting (EDA)
df_eda <- subset(dataset, select = -c(customerID))
#setting the colours for visuals: https://r-charts.com/colors/
colours <- c("thistle", "thistle1", "#FFBBFF", "#8B7B8B")
colours_2<- c("#90EE90","#FFAEB9","#B0E2FF")
#plot box plot for all numerical columns and pie chart for all categorical columns
data_plotting <- function(data){
for (name_of_col in names(data)){
column <- data[[name_of_col]]
formatted_col_name = gsub("([a-z])([A-Z])", "\\1 \\2", name_of_col)
if (is.numeric(column)){
boxplot(column,
main = paste("Distribution of", formatted_col_name),
ylab = formatted_col_name,
col = "lightblue")
}
else if (is.character(column)){
count_for_pie <- table(column)
percentages <- round(100*count_for_pie/sum(count_for_pie), 2)
labels <- paste(names(count_for_pie), "-", percentages, "%")
pie(count_for_pie,
labels = labels,
main = paste("Distribution of", formatted_col_name),
col = colours)
}
}
}
#call function to plot
data_plotting(df_eda)
#gender
ggplot(df_eda, aes(x = gender, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Gender Proportion by Churn Status", y = "Proportion") +
scale_fill_manual(values = colours_2)
#partner
ggplot(df_eda, aes(x = Partner, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Partner Proportion by Churn Status", y = "Proportion") +
scale_fill_manual(values = colours_2)
#dependents
ggplot(df_eda, aes(x = Dependents, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Dependents Proportion by Churn Status", y = "Proportion") +
scale_fill_manual(values = colours_2)
#senior citizen
ggplot(df_eda, aes(x = SeniorCitizen, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Senior Citizen Proportion by Churn Status", y = "Proportion") +
scale_fill_manual(values = colours_2)
#get all the telco usage columns and churn col
telco_usage_cols <- c("PhoneService", "MultipleLines", "InternetService",
"OnlineSecurity", "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies")
df_telco_usage <- subset(df_eda, select = c("Churn", telco_usage_cols))
for (col_name in names(df_telco_usage)) {
column <- df_telco_usage[[col_name]]
formatted_col_name = gsub("([a-z])([A-Z])", "\\1 \\2", col_name)
plot <- ggplot(df_telco_usage, aes(x = column, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = paste("Proportion of Churn by", formatted_col_name),
x = formatted_col_name,
y = "Proportion") +
scale_fill_manual(values = colours_2)
print(plot)
}
This section is on data cleaning and data processing.
Based on the Box Plot for Total Charges, it is found that the values in this column has a right skew distribution. Therefore, it is best to impute the missing values with Median instead of Mean.
#calc the median value of Total Charges column
median_val <- round(median(dataset$TotalCharges, na.rm = TRUE),2)
cat("The median value of Total Charges is ", median_val)
## The median value of Total Charges is 1397.47
Impute the missing rows with the Median value of Total Charges
#imputation on Total Charges column using Median
dataset$TotalCharges[is.na(dataset$TotalCharges)] <- median_val
#check the if there is still any missing values
cat("The sum of missing value in the dataset is: ", sum(is.na(dataset)))
## The sum of missing value in the dataset is: 0
head_data <- head(dataset, n=5)
kable(head_data)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7590-VHVEG | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 5575-GNVDE | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 3668-QPYBK | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 7795-CFOCW | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 9237-HQITU | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
Drop Customer ID column as it is not needed for modelling.
model_dataset <- subset(dataset, select = -c(customerID))
To prepare dataset for classification data modelling and segmentation, the categorical columns needs to be label encoded for K-Nearest Neighbour (KNN) Model and K-Means Clustering.
#encode the categorical columns
dataset_for_encoding <- model_dataset
#factorized the churn column
dataset_for_encoding$Churn <- factor(dataset_for_encoding$Churn, levels = c("No", "Yes"))
#one-hot encode the categorical columns
dummy_model <- dummyVars(Churn ~ ., data = dataset_for_encoding) # exclude Churn from encoding
encoded_data <- predict(dummy_model, newdata = dataset_for_encoding)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Churn' is not a factor
encoded_data <- as.data.frame(encoded_data)
# add back the Churn column
final_encoded_data <- cbind(encoded_data, Churn = dataset_for_encoding$Churn)
# check the encoded data
str(final_encoded_data)
## 'data.frame': 7043 obs. of 47 variables:
## $ genderFemale : num 1 0 0 0 1 1 0 1 1 0 ...
## $ genderMale : num 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizenNo : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SeniorCitizenYes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PartnerNo : num 0 1 1 1 1 1 1 1 0 1 ...
## $ PartnerYes : num 1 0 0 0 0 0 0 0 1 0 ...
## $ DependentsNo : num 1 1 1 1 1 1 0 1 1 0 ...
## $ DependentsYes : num 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneServiceNo : num 1 0 0 1 0 0 0 1 0 0 ...
## $ PhoneServiceYes : num 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLinesNo : num 0 1 1 0 1 0 0 0 0 1 ...
## $ MultipleLinesNo phone service : num 1 0 0 1 0 0 0 1 0 0 ...
## $ MultipleLinesYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetServiceDSL : num 1 1 1 1 0 0 0 1 0 1 ...
## $ InternetServiceFiber optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityNo : num 1 0 0 0 1 1 1 0 1 0 ...
## $ OnlineSecurityNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityYes : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackupNo : num 0 1 0 1 1 1 0 1 1 0 ...
## $ OnlineBackupNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineBackupYes : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtectionNo : num 1 0 1 0 1 0 1 1 0 1 ...
## $ DeviceProtectionNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DeviceProtectionYes : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupportNo : num 1 1 1 0 1 1 1 1 0 1 ...
## $ TechSupportNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TechSupportYes : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTVNo : num 1 1 1 1 1 0 0 1 0 1 ...
## $ StreamingTVNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingTVYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMoviesNo : num 1 1 1 1 1 0 1 1 0 1 ...
## $ StreamingMoviesNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingMoviesYes : num 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractMonth-to-month : num 1 0 1 0 1 1 1 1 1 0 ...
## $ ContractOne year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBillingNo : num 0 1 0 1 0 0 0 1 0 1 ...
## $ PaperlessBillingYes : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodBank transfer (automatic): num 0 0 0 1 0 0 0 0 0 1 ...
## $ PaymentMethodCredit card (automatic) : num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
head_data <- head(final_encoded_data, n=5)
kable(head_data)
| genderFemale | genderMale | SeniorCitizenNo | SeniorCitizenYes | PartnerNo | PartnerYes | DependentsNo | DependentsYes | tenure | PhoneServiceNo | PhoneServiceYes | MultipleLinesNo | MultipleLinesNo phone service | MultipleLinesYes | InternetServiceDSL | InternetServiceFiber optic | InternetServiceNo | OnlineSecurityNo | OnlineSecurityNo internet service | OnlineSecurityYes | OnlineBackupNo | OnlineBackupNo internet service | OnlineBackupYes | DeviceProtectionNo | DeviceProtectionNo internet service | DeviceProtectionYes | TechSupportNo | TechSupportNo internet service | TechSupportYes | StreamingTVNo | StreamingTVNo internet service | StreamingTVYes | StreamingMoviesNo | StreamingMoviesNo internet service | StreamingMoviesYes | ContractMonth-to-month | ContractOne year | ContractTwo year | PaperlessBillingNo | PaperlessBillingYes | PaymentMethodBank transfer (automatic) | PaymentMethodCredit card (automatic) | PaymentMethodElectronic check | PaymentMethodMailed check | MonthlyCharges | TotalCharges | Churn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 29.85 | 29.85 | No |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 34 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 56.95 | 1889.50 | No |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 53.85 | 108.15 | Yes |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 45 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 42.30 | 1840.75 | No |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 70.70 | 151.65 | Yes |
Whereas, for Logistic Regression and Decision Tree, we do not need to label encode the columns. We just need to convert the columns into factor.
#make a copy of dataset
factorized_dataset <- model_dataset
#convert character columns to factor
factorized_dataset[] <- lapply(factorized_dataset, function(x) {
if (is.character(x)) as.factor(x) else x
})
#convert tenure into numeric
factorized_dataset$tenure <- as.numeric(factorized_dataset$tenure)
#check if cols are factor
str(factorized_dataset)
## 'data.frame': 7043 obs. of 20 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
This section displays the results and discussion for all the project questions.
Refer to EDA section and presentation video.
These classification models will be built to predict Churn and the best model will be determined:
For Logistic Regression and Decision Tree modelling, the factorized dataset is used.
#split into train and test 80:20 ratio
set.seed(166)
train_index <- createDataPartition(factorized_dataset$Churn, p = 0.8, list = FALSE)
train_data <- factorized_dataset[train_index, ]
test_data <- factorized_dataset[-train_index, ]
#logistic Regression
logr_model <- glm(Churn ~ ., data = train_data, family = binomial)
#predict the Churn probabilities and convert to respective classes
logr_probs <- predict(logr_model, newdata = test_data, type = "response")
logr_preds <- ifelse(logr_probs > 0.5, "Yes", "No")
logr_preds <- factor(logr_preds, levels = c("No", "Yes"))
logr_true_value <- test_data$Churn
#save the confusion matrix table
conf_matrix_lr <- table(Predicted = logr_preds, Actual = logr_true_value)
#decision Tree
dtree_model <- rpart(Churn ~ ., data = train_data, method = "class")
rpart.plot(dtree_model)
#predict classes for test set
dtree_preds <- predict(dtree_model, newdata = test_data, type = "class")
dtree_true_value <- test_data$Churn
#save the confusion matrix table
conf_matrix_dtree <- table(Predicted = dtree_preds, Actual = dtree_true_value)
confusionMatrix(logr_preds, test_data$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 948 181
## Yes 86 192
##
## Accuracy : 0.8102
## 95% CI : (0.7888, 0.8304)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 2.141e-11
##
## Kappa : 0.4698
##
## Mcnemar's Test P-Value : 8.783e-09
##
## Sensitivity : 0.9168
## Specificity : 0.5147
## Pos Pred Value : 0.8397
## Neg Pred Value : 0.6906
## Prevalence : 0.7349
## Detection Rate : 0.6738
## Detection Prevalence : 0.8024
## Balanced Accuracy : 0.7158
##
## 'Positive' Class : No
##
confusionMatrix(dtree_preds, test_data$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 970 224
## Yes 64 149
##
## Accuracy : 0.7953
## 95% CI : (0.7733, 0.8161)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 8.066e-08
##
## Kappa : 0.3912
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9381
## Specificity : 0.3995
## Pos Pred Value : 0.8124
## Neg Pred Value : 0.6995
## Prevalence : 0.7349
## Detection Rate : 0.6894
## Detection Prevalence : 0.8486
## Balanced Accuracy : 0.6688
##
## 'Positive' Class : No
##
For K-Nearest Neighbour (KNN) modelling, the label encoded dataset is used.
#KNN is distance-based so we need to scale the numerical columns
cols_to_scale <- c("tenure", "MonthlyCharges", "TotalCharges")
preprocess_model <- preProcess(final_encoded_data[, cols_to_scale], method = c("center", "scale"))
scaled_values <- predict(preprocess_model, final_encoded_data[, cols_to_scale])
#replace original columns with scaled data
final_encoded_data[, cols_to_scale] <- scaled_values
#view the scaled dataset
str(final_encoded_data)
## 'data.frame': 7043 obs. of 47 variables:
## $ genderFemale : num 1 0 0 0 1 1 0 1 1 0 ...
## $ genderMale : num 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizenNo : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SeniorCitizenYes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PartnerNo : num 0 1 1 1 1 1 1 1 0 1 ...
## $ PartnerYes : num 1 0 0 0 0 0 0 0 1 0 ...
## $ DependentsNo : num 1 1 1 1 1 1 0 1 1 0 ...
## $ DependentsYes : num 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
## $ PhoneServiceNo : num 1 0 0 1 0 0 0 1 0 0 ...
## $ PhoneServiceYes : num 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLinesNo : num 0 1 1 0 1 0 0 0 0 1 ...
## $ MultipleLinesNo phone service : num 1 0 0 1 0 0 0 1 0 0 ...
## $ MultipleLinesYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetServiceDSL : num 1 1 1 1 0 0 0 1 0 1 ...
## $ InternetServiceFiber optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityNo : num 1 0 0 0 1 1 1 0 1 0 ...
## $ OnlineSecurityNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityYes : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackupNo : num 0 1 0 1 1 1 0 1 1 0 ...
## $ OnlineBackupNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineBackupYes : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtectionNo : num 1 0 1 0 1 0 1 1 0 1 ...
## $ DeviceProtectionNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DeviceProtectionYes : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupportNo : num 1 1 1 0 1 1 1 1 0 1 ...
## $ TechSupportNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TechSupportYes : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTVNo : num 1 1 1 1 1 0 0 1 0 1 ...
## $ StreamingTVNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingTVYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMoviesNo : num 1 1 1 1 1 0 1 1 0 1 ...
## $ StreamingMoviesNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingMoviesYes : num 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractMonth-to-month : num 1 0 1 0 1 1 1 1 1 0 ...
## $ ContractOne year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBillingNo : num 0 1 0 1 0 0 0 1 0 1 ...
## $ PaperlessBillingYes : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodBank transfer (automatic): num 0 0 0 1 0 0 0 0 0 1 ...
## $ PaymentMethodCredit card (automatic) : num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num -1.16 -0.26 -0.363 -0.746 0.197 ...
## $ TotalCharges : num -0.994 -0.173 -0.96 -0.195 -0.94 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
set.seed(166)
train_index <- createDataPartition(final_encoded_data$Churn, p = 0.8, list = FALSE)
train_data_knn <- final_encoded_data[train_index, ]
test_data_knn <- final_encoded_data[-train_index, ]
For this modelling, the label encoded dataset is used.
set.seed(166)
knn_model <- train(Churn ~ ., data = train_data_knn,
method = "knn",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 5))
print(knn_model)
## k-Nearest Neighbors
##
## 5636 samples
## 46 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4509, 4508, 4509, 4509, 4509
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7659708 0.3895913
## 7 0.7712953 0.4054283
## 9 0.7785689 0.4229484
## 11 0.7835368 0.4346933
## 13 0.7831814 0.4339648
## 15 0.7837141 0.4351059
## 17 0.7853107 0.4410087
## 19 0.7881490 0.4469546
## 21 0.7892154 0.4506070
## 23 0.7874395 0.4466846
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 21.
#predict classes for test set
knn_preds <- predict(knn_model, newdata = test_data_knn)
knn_true_value <- test_data$Churn
#save the confusion matrix table
conf_matrix_knn <- table(Predicted = knn_preds, Actual = knn_true_value)
confusionMatrix(knn_preds, test_data_knn$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 904 166
## Yes 130 207
##
## Accuracy : 0.7896
## 95% CI : (0.7674, 0.8107)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 1.127e-06
##
## Kappa : 0.4429
##
## Mcnemar's Test P-Value : 0.04192
##
## Sensitivity : 0.8743
## Specificity : 0.5550
## Pos Pred Value : 0.8449
## Neg Pred Value : 0.6142
## Prevalence : 0.7349
## Detection Rate : 0.6425
## Detection Prevalence : 0.7605
## Balanced Accuracy : 0.7146
##
## 'Positive' Class : No
##
Let’s compare the metrics of churn prediction using Precision, Recall, and F1-Score.
#function to calculate metrics
calculate_metrics <- function(conf_matrix) {
#extracting values from confusion matrix
TP <- conf_matrix[1, 1] # True Positive
TN <- conf_matrix[2, 2] # True Negative
FP <- conf_matrix[2, 1] # False Positive
FN <- conf_matrix[1, 2] # False Negative
#calc Precision, Recall, and F1-Score
Precision <- (TP / (TP + FP))*100
Recall <- (TP / (TP + FN)) *100
F1_Score <- 2 * ((Precision * Recall) / (Precision + Recall))
#return metrics as a list
return(list(Precision = Precision, Recall = Recall, F1_Score = F1_Score))
}
#metrics for each model
logreg_metrics <- calculate_metrics(conf_matrix_lr)
tree_metrics <- calculate_metrics(conf_matrix_dtree)
knn_metrics <- calculate_metrics(conf_matrix_knn)
#data frame with the calculated metrics
metrics_df <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "KNN"),
Precision = c(logreg_metrics$Precision, tree_metrics$Precision, knn_metrics$Precision),
Recall = c(logreg_metrics$Recall, tree_metrics$Recall, knn_metrics$Recall),
F1_Score = c(logreg_metrics$F1_Score, tree_metrics$F1_Score, knn_metrics$F1_Score)
)
#view the metrics in a table
kable(metrics_df, caption = "Model Metrics: Precision, Recall, and F1-Score")
| Model | Precision | Recall | F1_Score |
|---|---|---|---|
| Logistic Regression | 91.68279 | 83.96811 | 87.65603 |
| Decision Tree | 93.81044 | 81.23953 | 87.07361 |
| KNN | 87.42747 | 84.48598 | 85.93156 |
The best classification model is Logistic Regression. This is because:
K-Means Clustering is used to build the customer segmentation model and each segment’s characteristics will be interpreted.
#make a copy of the previously encoded dataset for clustering that is label encoded and scaled
clustering_dataset <- final_encoded_data
#view the dataset first
str(clustering_dataset)
## 'data.frame': 7043 obs. of 47 variables:
## $ genderFemale : num 1 0 0 0 1 1 0 1 1 0 ...
## $ genderMale : num 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizenNo : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SeniorCitizenYes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PartnerNo : num 0 1 1 1 1 1 1 1 0 1 ...
## $ PartnerYes : num 1 0 0 0 0 0 0 0 1 0 ...
## $ DependentsNo : num 1 1 1 1 1 1 0 1 1 0 ...
## $ DependentsYes : num 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
## $ PhoneServiceNo : num 1 0 0 1 0 0 0 1 0 0 ...
## $ PhoneServiceYes : num 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLinesNo : num 0 1 1 0 1 0 0 0 0 1 ...
## $ MultipleLinesNo phone service : num 1 0 0 1 0 0 0 1 0 0 ...
## $ MultipleLinesYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetServiceDSL : num 1 1 1 1 0 0 0 1 0 1 ...
## $ InternetServiceFiber optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityNo : num 1 0 0 0 1 1 1 0 1 0 ...
## $ OnlineSecurityNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityYes : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackupNo : num 0 1 0 1 1 1 0 1 1 0 ...
## $ OnlineBackupNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineBackupYes : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtectionNo : num 1 0 1 0 1 0 1 1 0 1 ...
## $ DeviceProtectionNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DeviceProtectionYes : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupportNo : num 1 1 1 0 1 1 1 1 0 1 ...
## $ TechSupportNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TechSupportYes : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTVNo : num 1 1 1 1 1 0 0 1 0 1 ...
## $ StreamingTVNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingTVYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMoviesNo : num 1 1 1 1 1 0 1 1 0 1 ...
## $ StreamingMoviesNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingMoviesYes : num 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractMonth-to-month : num 1 0 1 0 1 1 1 1 1 0 ...
## $ ContractOne year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBillingNo : num 0 1 0 1 0 0 0 1 0 1 ...
## $ PaperlessBillingYes : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodBank transfer (automatic): num 0 0 0 1 0 0 0 0 0 1 ...
## $ PaymentMethodCredit card (automatic) : num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num -1.16 -0.26 -0.363 -0.746 0.197 ...
## $ TotalCharges : num -0.994 -0.173 -0.96 -0.195 -0.94 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
K-Means Clustering requires columns to be numeric, so let’s change Churn back into numeric.
#convert to numeric
clustering_dataset$Churn <- as.numeric(clustering_dataset$Churn)
#ensure all cols are numeric
str(clustering_dataset)
## 'data.frame': 7043 obs. of 47 variables:
## $ genderFemale : num 1 0 0 0 1 1 0 1 1 0 ...
## $ genderMale : num 0 1 1 1 0 0 1 0 0 1 ...
## $ SeniorCitizenNo : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SeniorCitizenYes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PartnerNo : num 0 1 1 1 1 1 1 1 0 1 ...
## $ PartnerYes : num 1 0 0 0 0 0 0 0 1 0 ...
## $ DependentsNo : num 1 1 1 1 1 1 0 1 1 0 ...
## $ DependentsYes : num 0 0 0 0 0 0 1 0 0 1 ...
## $ tenure : num -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
## $ PhoneServiceNo : num 1 0 0 1 0 0 0 1 0 0 ...
## $ PhoneServiceYes : num 0 1 1 0 1 1 1 0 1 1 ...
## $ MultipleLinesNo : num 0 1 1 0 1 0 0 0 0 1 ...
## $ MultipleLinesNo phone service : num 1 0 0 1 0 0 0 1 0 0 ...
## $ MultipleLinesYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetServiceDSL : num 1 1 1 1 0 0 0 1 0 1 ...
## $ InternetServiceFiber optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetServiceNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityNo : num 1 0 0 0 1 1 1 0 1 0 ...
## $ OnlineSecurityNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurityYes : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackupNo : num 0 1 0 1 1 1 0 1 1 0 ...
## $ OnlineBackupNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineBackupYes : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtectionNo : num 1 0 1 0 1 0 1 1 0 1 ...
## $ DeviceProtectionNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DeviceProtectionYes : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupportNo : num 1 1 1 0 1 1 1 1 0 1 ...
## $ TechSupportNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TechSupportYes : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTVNo : num 1 1 1 1 1 0 0 1 0 1 ...
## $ StreamingTVNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingTVYes : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMoviesNo : num 1 1 1 1 1 0 1 1 0 1 ...
## $ StreamingMoviesNo internet service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StreamingMoviesYes : num 0 0 0 0 0 1 0 0 1 0 ...
## $ ContractMonth-to-month : num 1 0 1 0 1 1 1 1 1 0 ...
## $ ContractOne year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ ContractTwo year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBillingNo : num 0 1 0 1 0 0 0 1 0 1 ...
## $ PaperlessBillingYes : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethodBank transfer (automatic): num 0 0 0 1 0 0 0 0 0 1 ...
## $ PaymentMethodCredit card (automatic) : num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethodElectronic check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethodMailed check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ MonthlyCharges : num -1.16 -0.26 -0.363 -0.746 0.197 ...
## $ TotalCharges : num -0.994 -0.173 -0.96 -0.195 -0.94 ...
## $ Churn : num 1 1 2 1 2 2 1 1 2 1 ...
#determine optimal K value using Elbow method
wss <- vector()
for (k in 1:10) {
km <- kmeans(clustering_dataset, centers = k, nstart = 25)
wss[k] <- km$tot.withinss
}
## Warning: did not converge in 10 iterations
plot(1:10, wss, type = "b",
xlab = "Number of Clusters (K)",
ylab = "Total Within-Cluster SS",
main = "Elbow Method for K")
From the plot, it is found that, the optimal K value is 3 (the point where it is an Elbow)
#train the K-Means Clustering Model, use K=3
set.seed(166)
kmeans_model <- kmeans(clustering_dataset, centers = 3, nstart = 25)
#assign the segments back to dataset
clustering_dataset$Segment <- as.factor(kmeans_model$cluster)
It is found that, there are three different segments, 1, 2, and 3.
#view the sample data segments
head_clustering_data <- head(clustering_dataset, n=5)
kable(head_clustering_data)
| genderFemale | genderMale | SeniorCitizenNo | SeniorCitizenYes | PartnerNo | PartnerYes | DependentsNo | DependentsYes | tenure | PhoneServiceNo | PhoneServiceYes | MultipleLinesNo | MultipleLinesNo phone service | MultipleLinesYes | InternetServiceDSL | InternetServiceFiber optic | InternetServiceNo | OnlineSecurityNo | OnlineSecurityNo internet service | OnlineSecurityYes | OnlineBackupNo | OnlineBackupNo internet service | OnlineBackupYes | DeviceProtectionNo | DeviceProtectionNo internet service | DeviceProtectionYes | TechSupportNo | TechSupportNo internet service | TechSupportYes | StreamingTVNo | StreamingTVNo internet service | StreamingTVYes | StreamingMoviesNo | StreamingMoviesNo internet service | StreamingMoviesYes | ContractMonth-to-month | ContractOne year | ContractTwo year | PaperlessBillingNo | PaperlessBillingYes | PaymentMethodBank transfer (automatic) | PaymentMethodCredit card (automatic) | PaymentMethodElectronic check | PaymentMethodMailed check | MonthlyCharges | TotalCharges | Churn | Segment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | -1.2773539 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | -1.1602405 | -0.9941713 | 1 | 3 |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0.0663227 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | -0.2596105 | -0.1732318 | 1 | 3 |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | -1.2366364 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | -0.3626346 | -0.9596059 | 2 | 3 |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0.5142149 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | -0.7464825 | -0.1947524 | 1 | 3 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | -1.2366364 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.1973512 | -0.9404029 | 2 | 3 |
Let’s calculate the proportion of each segments
segment_summary <- clustering_dataset %>%
group_by(Segment) %>%
summarise(Count = n()) %>%
mutate(Proportion = round((Count / sum(Count)) * 100, 2))
kable(segment_summary)
| Segment | Count | Proportion |
|---|---|---|
| 1 | 2330 | 33.08 |
| 2 | 1526 | 21.67 |
| 3 | 3187 | 45.25 |
The dataset is label encoded, so we need to append the Segment Columns back to the original dataset for interpretation.
#view the original dataset first
kable(head(clustering_dataset, n=5))
| genderFemale | genderMale | SeniorCitizenNo | SeniorCitizenYes | PartnerNo | PartnerYes | DependentsNo | DependentsYes | tenure | PhoneServiceNo | PhoneServiceYes | MultipleLinesNo | MultipleLinesNo phone service | MultipleLinesYes | InternetServiceDSL | InternetServiceFiber optic | InternetServiceNo | OnlineSecurityNo | OnlineSecurityNo internet service | OnlineSecurityYes | OnlineBackupNo | OnlineBackupNo internet service | OnlineBackupYes | DeviceProtectionNo | DeviceProtectionNo internet service | DeviceProtectionYes | TechSupportNo | TechSupportNo internet service | TechSupportYes | StreamingTVNo | StreamingTVNo internet service | StreamingTVYes | StreamingMoviesNo | StreamingMoviesNo internet service | StreamingMoviesYes | ContractMonth-to-month | ContractOne year | ContractTwo year | PaperlessBillingNo | PaperlessBillingYes | PaymentMethodBank transfer (automatic) | PaymentMethodCredit card (automatic) | PaymentMethodElectronic check | PaymentMethodMailed check | MonthlyCharges | TotalCharges | Churn | Segment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | -1.2773539 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | -1.1602405 | -0.9941713 | 1 | 3 |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0.0663227 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | -0.2596105 | -0.1732318 | 1 | 3 |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | -1.2366364 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | -0.3626346 | -0.9596059 | 2 | 3 |
| 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0.5142149 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | -0.7464825 | -0.1947524 | 1 | 3 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | -1.2366364 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0.1973512 | -0.9404029 | 2 | 3 |
#check if both dataset have the same no.of rows
nrow(dataset) == length(clustering_dataset$Segment)
## [1] TRUE
#append segments back to original dataset
dataset$Segment <- clustering_dataset$Segment
#view dataset
kable(head(dataset, n=5))
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | Segment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7590-VHVEG | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | 3 |
| 5575-GNVDE | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No | 3 |
| 3668-QPYBK | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | 3 |
| 7795-CFOCW | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No | 3 |
| 9237-HQITU | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | 3 |
round_by_two_decimals <- function(value){
round(value, 2)
}
segment_stats <- dataset%>% group_by(Segment) %>%
summarise(
AvgTenureMonths = round(mean(tenure),0),
AvgMonthlyCharges = round_by_two_decimals(mean(MonthlyCharges)),
PercentageOfSeniorCitizen = round_by_two_decimals(mean(as.numeric(SeniorCitizen == "Yes")) * 100),
PercentageWithPartner = round_by_two_decimals(mean(as.numeric(Partner == "Yes")) * 100),
PercentageWithDependent = round_by_two_decimals(mean(as.numeric(Dependents == "Yes")) * 100),
PercentageWithPhoneService = round_by_two_decimals(mean(as.numeric(PhoneService == "Yes")) * 100),
PercentageWithInternet = round_by_two_decimals(mean(as.numeric(InternetService != "No")) * 100)
)
kable(segment_stats)
| Segment | AvgTenureMonths | AvgMonthlyCharges | PercentageOfSeniorCitizen | PercentageWithPartner | PercentageWithDependent | PercentageWithPhoneService | PercentageWithInternet |
|---|---|---|---|---|---|---|---|
| 1 | 57 | 89.02 | 19.70 | 70.04 | 35.67 | 91.97 | 100 |
| 2 | 31 | 21.08 | 3.41 | 48.36 | 42.14 | 100.00 | 0 |
| 3 | 15 | 67.94 | 19.80 | 32.38 | 19.96 | 84.47 | 100 |
From the characteristics table above, we are able to identify three different segments. Let’s name them:
ggplot(dataset, aes(x = tenure, y = MonthlyCharges, color = Segment)) +
geom_point(alpha = 0.6) +
labs(title = "Customer Segments based on their Telco Usage",
x = "Tenure (Months)", y = "Monthly Charges") +
scale_color_manual(
values = c("1" = "#1b9e77", "2" = "#d95f02", "3" = "#7570b3"),
labels = c("Loyal Premium Digital", "Low-Value Offline Young", "New & Tech-Savvy Singles")) + theme_minimal()
#function to calculate Churn and Non-Churn percentage in dataset
calc_churn <- function(data) {
data %>%
count(Churn) %>%
rename(Count = n) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 2))
}
Look into Churn Proportion for this Segment
seg2_data <- dataset %>% filter(Segment == 2)
kable(calc_churn(seg2_data))
| Churn | Count | Percentage |
|---|---|---|
| No | 1413 | 92.6 |
| Yes | 113 | 7.4 |
Look into Churn Proportion for this Segment
seg3_data <- dataset %>% filter(Segment == 3)
kable(calc_churn(seg3_data))
| Churn | Count | Percentage |
|---|---|---|
| No | 1770 | 55.54 |
| Yes | 1417 | 44.46 |
It is found that, the highest churn amount are customers from the New & Tech-Savvy Singles Segment whereas, the lowest churn amount are from the Loyal Premium Digital Customers Segment.
Loyal Premium Digital Customer Segment are high paying customers who stayed for a long time. They are important to retain.
Low-Value Offline Young Customer Segment are budget-friendly customers.
New & Tech-Savvy Singles Segment are new digital users.
In conclusion, all the project objectives/questions of this study are achieved.
Churn Classification Model
Telco Customers Segmentation/Clustering
K-Means clustering is applied to segment customers based on their demographics and telco usage patterns. The clustering revealed three distinct customer segments:
By combining classification and clustering, this project achieved both predictive power and descriptive insight. This enables the Telco business to not only predict churn but also to understand and act on customer behavior which is, a crucial step in improving customer lifetime value and retention rate.