This study project was made as part of Business Anlaytics course, 2020. The analysis is conducted on the Telco customer churn: IBM dataset.
Clients from this cluster leave mainly because of monthly charges and Internet service that are also interconnected. Long term contracts and the usage of paperless billing can persuade them to stay. Two strategies suggested to prevent the churn of these clients:
Offer bonuses to clients, who use paperless billing. For example, additional service options such as “Streaming Movies” or “Tech Support” that are not used frequently.
Introduce free trials for several services for one- and two-years contracts. It will engage clients’ to use more services in the future and, consequently stay longer in the service.
Clients from this group are subscribed to many services and pay highest monthly charges. That is why it is quite important to keep them in the company. Two strategies suggested to prevent the churn of these clients:
Extend the length of the contracts, for example introduce three-year contract with the same monthly charges for this type of clients.
Explore more the user experience of paying by electronic check. It might appear that customers churn because of the difficulties in the process of payment this way.
df <- read_csv("~/WA_Fn-UseC_-Telco-Customer-Churn.csv")
#computation of churn rate
ch1 <- df %>%
group_by(Churn) %>%
summarise(n = n()) %>%
mutate(rate = paste0(round(100 * n/sum(n), 0), "%")) %>%
arrange(-n)
#resulting table
library(kableExtra)
ch1 %>%
kbl() %>%
kable_material(c("striped", "hover"))
Churn | n | rate |
---|---|---|
No | 5174 | 73% |
Yes | 1869 | 27% |
Overall churn rate is 27%, i.e. almost one third of clients leave The row “Yes” identifies the total number of clients churn, the column “No” identifies clients, who are still staying with the company. It is seen that 1869 clients had churn.
There is no high difference in churn among two genders, while in the case of type of Internet service the highest churn rate of show clients of “Fiber Optics” Internet service.
#churn by gender
p1 <- ggplot(data = df, aes(x = gender))+
geom_bar(aes(fill = Churn),
position = "dodge",
alpha = 0.6) +
scale_fill_manual(values = c(
"No" = "#14b542",
"Yes" = "#b50e05")) +
theme(panel.background = element_blank(),
plot.title = element_text(color = "#045e2c")) +
labs(title = "Clients churn by gender",
x = "Gender",
y = "Clients")
#churn by type of service
p2 <- ggplot(data = df, aes(x = InternetService))+
geom_bar(aes(fill = Churn),
position = "dodge",
alpha = 0.6) +
scale_fill_manual(values = c(
"No" = "#14b542",
"Yes" = "#b50e05")) +
theme(panel.background = element_blank(),
plot.title = element_text(color = "#045e2c")) +
labs(title = "Clients churn by Internet service",
x = "Internet service",
y = "Clients")
ggpubr::ggarrange(p1, p2,
ncol = 2, nrow = 2)
Clients who pay charges around $65- $100 per month show have high churn rate, which is a sign that the clients who bring higher profits to the company have tendency to leave.
Critical point, when clients are more likely to leave: first 1-1,5 years in the company’s service.
Clients, who choose month-to-month type of contract have highers risks to churn.
#Monthly charges
p3 <- ggplot(data = df)+
geom_histogram(aes(x = MonthlyCharges, fill = Churn), alpha = 0.6) +
scale_fill_manual(values = c(
"No" = "#14b542",
"Yes" = "#b50e05")) +
theme(panel.background = element_blank(),
plot.title = element_text(color = "#045e2c")) +
labs(title = "Distribution of Monthly Charges",
x = "Monthly Charges",
y = " ")
#Tenure - time with the company
p4 <- ggplot(data = df)+
geom_histogram(aes(x = tenure, fill = Churn), alpha = 0.6)+
scale_fill_manual(values = c(
"No" = "#14b542",
"Yes" = "#b50e05")) +
theme(panel.background = element_blank(),
plot.title = element_text(color = "#045e2c"))+
labs(title = "Distribution of tenure",
x = "Tenure",
y = " ")
ggpubr::ggarrange(p3, p4,
ncol = 2, nrow = 2)
For the further analysis I created the variable “Service score”, which indicates number of services, which client uses. Due to the reason that some variables were measured in different scale, all of them were standardized.
df1 <- select(df, c("gender","SeniorCitizen", "tenure", "PhoneService", "MultipleLines", "InternetService",
"OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV",
"StreamingMovies", "Contract", "PaymentMethod","PaperlessBilling", "MonthlyCharges", "Churn"))
df1$PhoneService <- ifelse(df1$PhoneService == "Yes", 1, 0) #PhoneService recode
df1$PhoneService <- as.numeric(df1$PhoneService)
df1$MultipleLines <- ifelse(df1$MultipleLines == "Yes", 1, 0) #Multiple lines recode
df1$MultipleLines <- as.numeric(df1$MultipleLines)
df1$OnlineSecurity <- ifelse(df1$OnlineSecurity == "Yes", 1, 0) #Online Security recode
df1$OnlineSecurity <- as.numeric(df1$OnlineSecurity)
df1$OnlineBackup <- ifelse(df1$OnlineBackup == "Yes", 1, 0) #Online Backup recode
df1$OnlineBackup <- as.numeric(df1$OnlineBackup)
df1$DeviceProtection <- ifelse(df1$DeviceProtection == "Yes", 1, 0) #Device protection recode
df1$DeviceProtection <- as.numeric(df1$DeviceProtection)
df1$TechSupport <- ifelse(df1$TechSupport == "Yes", 1, 0) #Tech Support recode
df1$TechSupport <- as.numeric(df1$TechSupport)
df1$StreamingTV <- ifelse(df1$StreamingTV == "Yes", 1, 0) #Streaming tv recode
df1$StreamingTV <- as.numeric(df1$StreamingTV)
df1$StreamingMovies <- ifelse(df1$StreamingMovies == "Yes", 1, 0) #Streaming movies recode
df1$StreamingMovies<- as.numeric(df1$StreamingMovies)
#Internet service recode - principle of points: churn rate by Internet service
df1$InternetService <- ifelse(df1$InternetService == "DSL", 1, ifelse(df1$InternetService == "Fiber optic", 2, 0))
df1$InternetService <- as.numeric(df1$InternetService)
#Sum and standartization
colnms = c("PhoneService", "MultipleLines", "InternetService",
"OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV",
"StreamingMovies")
df1$service_score <- rowSums(df1[ ,colnms])
#Standardization function
mean_sd_standard <- function(x) {
(x - mean(x)) / sd(x)
}
df_standardized <- df1 %>%
mutate_if(is.numeric, mean_sd_standard)
#summary(df_standardized)
Variables selected for clustering: Tenure, Monthly Charges, and Servuce score.
The best number of clusters is 4. It was identified with the help of silhouette method showed in the graph below.
df1_stand <- select(df_standardized, c("tenure", "MonthlyCharges", "service_score"))
fviz_nbclust(df1_stand, kmeans, method = "silhouette", k.max = 24) +
theme_minimal() +
ggtitle("The Silhouette Plot")
Cluster | tenure | MonthlyCharges | service_score |
---|---|---|---|
1 | 57.85803 | 92.30174 | 7.387667 |
2 | 15.28338 | 77.68989 | 5.023022 |
3 | 54.51603 | 30.51127 | 2.170341 |
4 | 10.25064 | 30.03194 | 1.711637 |
Interpretation:
“High - High” - The first cluster includes clients, who use many telecom services (7.39 points in average) and pay high monthly charges (92$ in average). Clients in this cluster in average stay in the service during 58 months (more than 4,5 years).
“Medium - High” - The second cluster unites clients, who have average service score (5 points), but pay quite high charges per month (78$ in average). People in this cluster do not stay very long in the service and churn after 15 months in average. Taking into account that the mean value for the variable “Internet Service” is 1.6-1.7, which is close to 2, it can be assumed that people in these cluster mainly are clients of “Fiber Optics” Internet service, where the churn rate is high.
“Low - Low” - The third cluster contains clients, who have low service score (i.e. do not use many options provided by the service) and pay low monthly charges (31$ in average). From the other side, they also show quite high loyalty and stay in service 54 month in average (i.e. more than 4 years).
“Very Low - Low” - The forth cluster includes clients, who have the lowest service score (1.7 points), lowest monthly charges (30$ in average) and the shortest period of staying in service - only 10 months in average.
The highest rate of churn is shown by the cluster “Medium - High”, i.e. the cluster with medium service score, high charges and short tenure.
df1$cluster <- gsub("1", "High - High", df1$cluster)
df1$cluster <- gsub("2", "Medium - High", df1$cluster)
df1$cluster <- gsub("3", "Low - Low", df1$cluster)
df1$cluster <- gsub("4", "Very Low - Low", df1$cluster)
ggplot(data = df1, aes(x = cluster))+
geom_bar(aes(fill = Churn), position = "dodge", alpha = 0.6) +
scale_fill_manual(values = c(
"No" = "#14b542",
"Yes" = "#b50e05")) +
theme(panel.background = element_blank(),
plot.title = element_text(color = "#045e2c")) +
labs(title = "Churn in clusters",
x = "Cluster",
y = "Clients")
To be precisely, 47% of customers, who belong to this cluster, are more likely to churn.
cluster2 <- filter(df1, df1$cluster == "Medium - High" )
ch2 <- cluster2 %>%
group_by(Churn) %>%
summarise(n = n()) %>%
mutate(rate = paste0(round(100 * n/sum(n), 0), "%")) %>%
arrange(-n)
#resulting table
ch2 %>%
kbl() %>%
kable_material(c("striped", "hover"))
Churn | n | rate |
---|---|---|
No | 1275 | 53% |
Yes | 1114 | 47% |
cluster2$Churn <- as.factor(cluster2$Churn)
cluster2_model <- select(cluster2, c("gender","SeniorCitizen", "tenure", "service_score", "Contract", "PaymentMethod","PaperlessBilling", "MonthlyCharges", "Churn"))
cluster2_model$Churn <- as.factor(cluster2_model$Churn)
cluster2_model$gender <- as.factor(cluster2_model$gender)
cluster2_model$Contract <- as.factor(cluster2_model$Contract)
cluster2_model$PaymentMethod <- as.factor(cluster2_model$PaymentMethod)
cluster2_model$PaperlessBilling <- as.factor(cluster2_model$PaperlessBilling)
cluster2_model$SeniorCitizen<- as.factor(cluster2_model$SeniorCitizen)
m1 <- glm(Churn ~., family = binomial, cluster2_model)
library(sjPlot)
tab_model(m1)
Churn | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 0.06 | 0.03 – 0.12 | <0.001 |
gender [Male] | 0.80 | 0.67 – 0.96 | 0.019 |
SeniorCitizen [1] | 1.31 | 1.05 – 1.64 | 0.019 |
tenure | 0.96 | 0.95 – 0.97 | <0.001 |
service_score | 0.73 | 0.65 – 0.81 | <0.001 |
Contract [One year] | 0.38 | 0.26 – 0.57 | <0.001 |
Contract [Two year] | 0.06 | 0.00 – 0.31 | 0.008 |
PaymentMethod [Credit card (automatic)] |
0.96 | 0.68 – 1.36 | 0.836 |
PaymentMethod [Electronic check] |
1.49 | 1.14 – 1.97 | 0.004 |
PaymentMethod [Mailed check] |
1.03 | 0.73 – 1.45 | 0.878 |
PaperlessBilling [Yes] | 1.68 | 1.36 – 2.07 | <0.001 |
MonthlyCharges | 1.06 | 1.05 – 1.07 | <0.001 |
Observations | 2389 | ||
R2 Tjur | 0.234 |
Interpretation of significant predictors of the model:
Increasing in tenure by one month decreases the odds of churn by 4%;
The increasing in service score by one point decreases the odd of churn by 27%;
The increase in monthly charges per $1 increases the odds of churn by 6%.
Having the one-year contract reduces odds of churn by 62%, while having two year contract reduces the odds of churn by 94%;
If the client pays by electronic check, then the odds of churn increase by the factor 1.49, while the overall fact of usage paperless billing increases odds of churn almost by twice;
Being senior citizen increases odds of churn almost by 1.31 times;
For male clients the odds of churn are 20% lower than for female clients from this cluster;
set.seed(1)
test_ind = createDataPartition(cluster2_model$Churn, p = 0.2, list = FALSE)
cluster2_model.test = cluster2_model[test_ind,]
cluster2_model.train = cluster2_model[-test_ind,]
cluster2_model.test = cluster2_model.test%>% na.omit()
cluster2_model.train = cluster2_model.train%>% na.omit()
m1_train <- m1 <- glm(Churn~., family = binomial, cluster2_model.train)
test1 <- predict(m1_train, cluster2_model.test, type="response")
prediction1 <- factor(ifelse(test1 > 0.5,"Yes","No"))
matrix1 <- caret::confusionMatrix(prediction1, cluster2_model.test$Churn, mode = "prec_recall")
matrix1
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 173 52
## Yes 82 171
##
## Accuracy : 0.7197
## 95% CI : (0.6771, 0.7595)
## No Information Rate : 0.5335
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.4415
##
## Mcnemar's Test P-Value : 0.01224
##
## Precision : 0.7689
## Recall : 0.6784
## F1 : 0.7208
## Prevalence : 0.5335
## Detection Rate : 0.3619
## Detection Prevalence : 0.4707
## Balanced Accuracy : 0.7226
##
## 'Positive' Class : No
##
The false negative rate is not high, but also cannot be called low (173 correct vs. 52 incorrect). False positive rate is also not low (82 incorrect vs. 171 correct). The recall of the model is 0.678. It means that about the model correctly identified about 67.8% of customers who leaved. The precision of the model is 0.77 and f1-score is about 0.72. The model’s accuracy is also about 72%, meaning that received model performs well, but not ideally.
Now, lets explore another “problematic” cluster, which was the forth cluster called “High - High” - clients with highest service score and highest monthly charges. The churn rate in this cluster is 16%. However, this cluster represents the clients who bring the highest profit for the company and use many company’s service. That is why it is important to know reasons of their churn.
cluster1 <- filter(df1, df1$cluster == "High - High" )
ch3 <- cluster1 %>%
group_by(Churn) %>%
summarise(n = n()) %>%
mutate(rate = paste0(round(100 * n/sum(n), 0), "%")) %>%
arrange(-n)
#resulting table
ch3 %>%
kbl() %>%
kable_material(c("striped", "hover"))
Churn | n | rate |
---|---|---|
No | 1760 | 84% |
Yes | 332 | 16% |
cluster1$Churn <- as.factor(cluster1$Churn)
cluster1$Churn <- as.factor(cluster1$Churn)
cluster1_model <- select(cluster1, c("gender","SeniorCitizen", "tenure", "service_score", "Contract",
"PaymentMethod","PaperlessBilling", "MonthlyCharges", "Churn"))
cluster1_model$Churn <- as.factor(cluster1_model$Churn)
cluster1_model$gender <- as.factor(cluster1_model$gender)
cluster1_model$Contract <- as.factor(cluster1_model$Contract)
cluster1_model$PaymentMethod <- as.factor(cluster1_model$PaymentMethod)
cluster1_model$PaperlessBilling <- as.factor(cluster1_model$PaperlessBilling)
cluster1_model$SeniorCitizen<- as.factor(cluster1_model$SeniorCitizen)
m2 <- glm(Churn ~., family = binomial, cluster1_model)
tab_model(m2)
Churn | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 0.02 | 0.01 – 0.08 | <0.001 |
gender [Male] | 1.24 | 0.96 – 1.61 | 0.094 |
SeniorCitizen [1] | 1.21 | 0.91 – 1.60 | 0.193 |
tenure | 0.98 | 0.97 – 0.99 | 0.001 |
service_score | 0.76 | 0.63 – 0.92 | 0.004 |
Contract [One year] | 0.62 | 0.46 – 0.84 | 0.002 |
Contract [Two year] | 0.21 | 0.13 – 0.33 | <0.001 |
PaymentMethod [Credit card (automatic)] |
0.98 | 0.68 – 1.42 | 0.931 |
PaymentMethod [Electronic check] |
1.44 | 1.05 – 1.99 | 0.024 |
PaymentMethod [Mailed check] |
0.73 | 0.35 – 1.38 | 0.351 |
PaperlessBilling [Yes] | 1.15 | 0.84 – 1.60 | 0.385 |
MonthlyCharges | 1.06 | 1.04 – 1.08 | <0.001 |
Observations | 2092 | ||
R2 Tjur | 0.152 |
Interpretation of significant predictors of the model:
Increasing in tenure by one month decreases the odds of churn only by 2%;
The increasing in service score by one point decreases the odd of churn by 24%;
The increase in monthly charges per $1 increases the odds of churn by 6%.
Having the one-year contract reduces odds of churn by 38%, while having two year contract reduces the odds of churn by 79%;
If the client pays by electronic check, then the odds of churn increase by the factor 1.44;
set.seed(1)
test_ind = createDataPartition(cluster1_model$Churn, p = 0.2, list = FALSE)
cluster1_model.test = cluster1_model[test_ind,]
cluster1_model.train = cluster1_model[-test_ind,]
cluster1_model.test = cluster1_model.test%>% na.omit()
cluster1_model.train = cluster1_model.train%>% na.omit()
m2_train <- m1 <- glm(Churn~., family = binomial, cluster1_model.train)
test1 <- predict(m2_train, cluster1_model.test, type="response")
prediction1 <- factor(ifelse(test1 > 0.5,"Yes","No"))
matrix2 <- caret::confusionMatrix(prediction1, cluster1_model.test$Churn, mode = "prec_recall")
matrix2
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 347 61
## Yes 5 6
##
## Accuracy : 0.8425
## 95% CI : (0.804, 0.876)
## No Information Rate : 0.8401
## P-Value [Acc > NIR] : 0.4794
##
## Kappa : 0.1139
##
## Mcnemar's Test P-Value : 1.288e-11
##
## Precision : 0.8505
## Recall : 0.9858
## F1 : 0.9132
## Prevalence : 0.8401
## Detection Rate : 0.8282
## Detection Prevalence : 0.9737
## Balanced Accuracy : 0.5377
##
## 'Positive' Class : No
##
The false negative rate is very low (347 correct vs. 61 incorrect). False positive appeared to be high due to the small number of churned clients in the test dataset. The recall of the model is 0.986. It means that about the model correctly identified about 98,6% of customers who leaved, which is very good result. The precision of the model is 0.85 and f1-score is about 0.91. The accuracy of the model is about 84,25%, meaning that performance of the model is pretty good.
Conclusions
The list of company’s clients might be divided into 4 clusters using client’s time in service, numbers of servics and monthly payments. These clusters clusters are:
“High-High” - clients with high number of services and high monthly payments.
“Medium-High” - clients with medium number of services, but with high monthly charges.
“Low-Low” - clients with low number of services and with low charges.
“Very Low-Low” - clients with the minimum number of services and with low charges.
The cluster with the highest churn rate: “Medium-High” (rate: 47%)
For both analyzed clusters (“High-High” and “Medium-Low”) increase in variables describing tenure, the service score and length of contracts decrease the odds of churn, while the payments by electronic check and the increase in monthly charges increase the odds of client’s churn.