1 Introduction

This study project was made as part of Business Anlaytics course, 2020. The analysis is conducted on the Telco customer churn: IBM dataset.

2 Summary - Churn Prevention Recommendation

2.1 Recommendations for the cluster “Medium - High”

Clients from this cluster leave mainly because of monthly charges and Internet service that are also interconnected. Long term contracts and the usage of paperless billing can persuade them to stay. Two strategies suggested to prevent the churn of these clients:

  1. Offer bonuses to clients, who use paperless billing. For example, additional service options such as “Streaming Movies” or “Tech Support” that are not used frequently.

  2. Introduce free trials for several services for one- and two-years contracts. It will engage clients’ to use more services in the future and, consequently stay longer in the service.

2.2 Recommendations for the cluster “High - High”

Clients from this group are subscribed to many services and pay highest monthly charges. That is why it is quite important to keep them in the company. Two strategies suggested to prevent the churn of these clients:

  1. Extend the length of the contracts, for example introduce three-year contract with the same monthly charges for this type of clients.

  2. Explore more the user experience of paying by electronic check. It might appear that customers churn because of the difficulties in the process of payment this way.

3 Analysis

3.1 Data exploration

3.1.1 Churn rate

df <- read_csv("~/WA_Fn-UseC_-Telco-Customer-Churn.csv")

#computation of churn rate
ch1 <- df %>% 
  group_by(Churn) %>% 
  summarise(n = n()) %>% 
  mutate(rate = paste0(round(100 * n/sum(n), 0), "%")) %>% 
  arrange(-n) 

#resulting table
library(kableExtra)
ch1 %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
Churn n rate
No 5174 73%
Yes 1869 27%

Overall churn rate is 27%, i.e. almost one third of clients leave The row “Yes” identifies the total number of clients churn, the column “No” identifies clients, who are still staying with the company. It is seen that 1869 clients had churn.

3.1.2 Correlation of churn

There is no high difference in churn among two genders, while in the case of type of Internet service the highest churn rate of show clients of “Fiber Optics” Internet service.

#churn by gender
p1 <- ggplot(data = df, aes(x = gender))+
  geom_bar(aes(fill = Churn), 
           position = "dodge", 
           alpha = 0.6) +
  scale_fill_manual(values = c(
    "No" = "#14b542", 
    "Yes" = "#b50e05")) +
  theme(panel.background = element_blank(),
        plot.title = element_text(color = "#045e2c")) + 
  labs(title = "Clients churn by gender", 
       x = "Gender", 
       y = "Clients")


#churn by type of service
p2 <- ggplot(data = df, aes(x = InternetService))+
  geom_bar(aes(fill = Churn), 
           position = "dodge",
           alpha = 0.6) +
  scale_fill_manual(values = c(
    "No" = "#14b542", 
    "Yes" = "#b50e05")) +
   theme(panel.background = element_blank(),
        plot.title = element_text(color = "#045e2c")) +
  labs(title = "Clients churn by Internet service", 
       x = "Internet service", 
       y = "Clients")


ggpubr::ggarrange(p1, p2,
          ncol = 2, nrow = 2)

  1. Clients who pay charges around $65- $100 per month show have high churn rate, which is a sign that the clients who bring higher profits to the company have tendency to leave.

  2. Critical point, when clients are more likely to leave: first 1-1,5 years in the company’s service.

  3. Clients, who choose month-to-month type of contract have highers risks to churn.

#Monthly charges
p3 <- ggplot(data = df)+
  geom_histogram(aes(x = MonthlyCharges, fill = Churn), alpha = 0.6) +
  scale_fill_manual(values = c(
    "No" = "#14b542", 
    "Yes" = "#b50e05")) +
   theme(panel.background = element_blank(),
        plot.title = element_text(color = "#045e2c")) +
  labs(title = "Distribution of Monthly Charges", 
       x = "Monthly Charges", 
       y = " ")


#Tenure - time with the company
p4 <- ggplot(data = df)+
  geom_histogram(aes(x = tenure, fill = Churn), alpha = 0.6)+
  scale_fill_manual(values = c(
    "No" = "#14b542", 
    "Yes" = "#b50e05")) +
   theme(panel.background = element_blank(),
        plot.title = element_text(color = "#045e2c"))+  
  labs(title = "Distribution of tenure", 
       x = "Tenure",
       y = " ") 

ggpubr::ggarrange(p3, p4,
          ncol = 2, nrow = 2)

3.1.3 Service scores

For the further analysis I created the variable “Service score”, which indicates number of services, which client uses. Due to the reason that some variables were measured in different scale, all of them were standardized.

df1 <- select(df, c("gender","SeniorCitizen", "tenure", "PhoneService", "MultipleLines", "InternetService", 
                       "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", 
                       "StreamingMovies", "Contract", "PaymentMethod","PaperlessBilling", "MonthlyCharges", "Churn"))

df1$PhoneService <- ifelse(df1$PhoneService == "Yes", 1, 0) #PhoneService recode
df1$PhoneService <- as.numeric(df1$PhoneService)

df1$MultipleLines <- ifelse(df1$MultipleLines == "Yes", 1, 0) #Multiple lines recode
df1$MultipleLines <- as.numeric(df1$MultipleLines)

df1$OnlineSecurity <- ifelse(df1$OnlineSecurity == "Yes", 1, 0) #Online Security recode
df1$OnlineSecurity <- as.numeric(df1$OnlineSecurity)

df1$OnlineBackup <- ifelse(df1$OnlineBackup == "Yes", 1, 0) #Online Backup recode
df1$OnlineBackup <- as.numeric(df1$OnlineBackup)

df1$DeviceProtection <- ifelse(df1$DeviceProtection == "Yes", 1, 0) #Device protection recode
df1$DeviceProtection <- as.numeric(df1$DeviceProtection)

df1$TechSupport <- ifelse(df1$TechSupport == "Yes", 1, 0) #Tech Support recode
df1$TechSupport <- as.numeric(df1$TechSupport)

df1$StreamingTV <- ifelse(df1$StreamingTV == "Yes", 1, 0) #Streaming tv recode
df1$StreamingTV <- as.numeric(df1$StreamingTV)

df1$StreamingMovies <- ifelse(df1$StreamingMovies == "Yes", 1, 0) #Streaming movies recode
df1$StreamingMovies<- as.numeric(df1$StreamingMovies)

#Internet service recode - principle of points: churn rate by Internet service
df1$InternetService <- ifelse(df1$InternetService == "DSL", 1, ifelse(df1$InternetService == "Fiber optic", 2, 0))
df1$InternetService <- as.numeric(df1$InternetService)

#Sum and standartization
colnms = c("PhoneService", "MultipleLines", "InternetService", 
           "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", 
           "StreamingMovies")
df1$service_score <- rowSums(df1[ ,colnms])

#Standardization function 
mean_sd_standard <- function(x) {
  (x - mean(x)) / sd(x)
}

df_standardized <- df1 %>%
  mutate_if(is.numeric, mean_sd_standard)

#summary(df_standardized)

4 Segmentation

Variables selected for clustering: Tenure, Monthly Charges, and Servuce score.

4.1 Best number of clusters

The best number of clusters is 4. It was identified with the help of silhouette method showed in the graph below.

df1_stand <- select(df_standardized, c("tenure", "MonthlyCharges", "service_score"))

fviz_nbclust(df1_stand, kmeans, method = "silhouette", k.max = 24) + 
  theme_minimal() + 
  ggtitle("The Silhouette Plot")

4.2 Clustering

Cluster tenure MonthlyCharges service_score
1 57.85803 92.30174 7.387667
2 15.28338 77.68989 5.023022
3 54.51603 30.51127 2.170341
4 10.25064 30.03194 1.711637

Interpretation:

  1. “High - High” - The first cluster includes clients, who use many telecom services (7.39 points in average) and pay high monthly charges (92$ in average). Clients in this cluster in average stay in the service during 58 months (more than 4,5 years).

  2. “Medium - High” - The second cluster unites clients, who have average service score (5 points), but pay quite high charges per month (78$ in average). People in this cluster do not stay very long in the service and churn after 15 months in average. Taking into account that the mean value for the variable “Internet Service” is 1.6-1.7, which is close to 2, it can be assumed that people in these cluster mainly are clients of “Fiber Optics” Internet service, where the churn rate is high.

  3. “Low - Low” - The third cluster contains clients, who have low service score (i.e. do not use many options provided by the service) and pay low monthly charges (31$ in average). From the other side, they also show quite high loyalty and stay in service 54 month in average (i.e. more than 4 years).

  4. “Very Low - Low” - The forth cluster includes clients, who have the lowest service score (1.7 points), lowest monthly charges (30$ in average) and the shortest period of staying in service - only 10 months in average.

4.3 Churn in received clusters

The highest rate of churn is shown by the cluster “Medium - High”, i.e. the cluster with medium service score, high charges and short tenure.

df1$cluster <- gsub("1", "High - High", df1$cluster)
df1$cluster <- gsub("2", "Medium - High", df1$cluster)
df1$cluster <- gsub("3", "Low - Low", df1$cluster)
df1$cluster <- gsub("4", "Very Low - Low", df1$cluster)

ggplot(data = df1, aes(x = cluster))+
  geom_bar(aes(fill = Churn), position = "dodge", alpha = 0.6) +
    scale_fill_manual(values = c(
    "No" = "#14b542", 
    "Yes" = "#b50e05")) +
   theme(panel.background = element_blank(),
        plot.title = element_text(color = "#045e2c")) +
   labs(title = "Churn in clusters", 
       x = "Cluster", 
       y = "Clients")

4.3.1 Churn prediction for the cluster “Medium - High”

To be precisely, 47% of customers, who belong to this cluster, are more likely to churn.

cluster2 <- filter(df1, df1$cluster == "Medium - High" )

ch2 <- cluster2 %>% 
  group_by(Churn) %>% 
  summarise(n = n()) %>% 
  mutate(rate = paste0(round(100 * n/sum(n), 0), "%")) %>% 
  arrange(-n)

#resulting table
ch2 %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
Churn n rate
No 1275 53%
Yes 1114 47%

4.3.1.1 Regression model

cluster2$Churn <- as.factor(cluster2$Churn)
cluster2_model <- select(cluster2, c("gender","SeniorCitizen", "tenure", "service_score", "Contract", "PaymentMethod","PaperlessBilling", "MonthlyCharges", "Churn"))

cluster2_model$Churn <- as.factor(cluster2_model$Churn)
cluster2_model$gender <- as.factor(cluster2_model$gender)
cluster2_model$Contract <- as.factor(cluster2_model$Contract)
cluster2_model$PaymentMethod <- as.factor(cluster2_model$PaymentMethod)
cluster2_model$PaperlessBilling <- as.factor(cluster2_model$PaperlessBilling)
cluster2_model$SeniorCitizen<- as.factor(cluster2_model$SeniorCitizen)

m1 <- glm(Churn ~., family = binomial, cluster2_model)

library(sjPlot)
tab_model(m1)
  Churn
Predictors Odds Ratios CI p
(Intercept) 0.06 0.03 – 0.12 <0.001
gender [Male] 0.80 0.67 – 0.96 0.019
SeniorCitizen [1] 1.31 1.05 – 1.64 0.019
tenure 0.96 0.95 – 0.97 <0.001
service_score 0.73 0.65 – 0.81 <0.001
Contract [One year] 0.38 0.26 – 0.57 <0.001
Contract [Two year] 0.06 0.00 – 0.31 0.008
PaymentMethod [Credit
card (automatic)]
0.96 0.68 – 1.36 0.836
PaymentMethod [Electronic
check]
1.49 1.14 – 1.97 0.004
PaymentMethod [Mailed
check]
1.03 0.73 – 1.45 0.878
PaperlessBilling [Yes] 1.68 1.36 – 2.07 <0.001
MonthlyCharges 1.06 1.05 – 1.07 <0.001
Observations 2389
R2 Tjur 0.234

Interpretation of significant predictors of the model:

  1. Increasing in tenure by one month decreases the odds of churn by 4%;

  2. The increasing in service score by one point decreases the odd of churn by 27%;

  3. The increase in monthly charges per $1 increases the odds of churn by 6%.

  4. Having the one-year contract reduces odds of churn by 62%, while having two year contract reduces the odds of churn by 94%;

  5. If the client pays by electronic check, then the odds of churn increase by the factor 1.49, while the overall fact of usage paperless billing increases odds of churn almost by twice;

  6. Being senior citizen increases odds of churn almost by 1.31 times;

  7. For male clients the odds of churn are 20% lower than for female clients from this cluster;

set.seed(1)
test_ind = createDataPartition(cluster2_model$Churn, p = 0.2, list = FALSE)
cluster2_model.test = cluster2_model[test_ind,]
cluster2_model.train = cluster2_model[-test_ind,]
cluster2_model.test = cluster2_model.test%>% na.omit()
cluster2_model.train =  cluster2_model.train%>% na.omit()

m1_train <- m1 <- glm(Churn~., family = binomial, cluster2_model.train)

test1 <- predict(m1_train, cluster2_model.test, type="response")
prediction1 <- factor(ifelse(test1 > 0.5,"Yes","No"))
matrix1 <- caret::confusionMatrix(prediction1, cluster2_model.test$Churn, mode = "prec_recall")
matrix1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  173  52
##        Yes  82 171
##                                           
##                Accuracy : 0.7197          
##                  95% CI : (0.6771, 0.7595)
##     No Information Rate : 0.5335          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.4415          
##                                           
##  Mcnemar's Test P-Value : 0.01224         
##                                           
##               Precision : 0.7689          
##                  Recall : 0.6784          
##                      F1 : 0.7208          
##              Prevalence : 0.5335          
##          Detection Rate : 0.3619          
##    Detection Prevalence : 0.4707          
##       Balanced Accuracy : 0.7226          
##                                           
##        'Positive' Class : No              
## 

The false negative rate is not high, but also cannot be called low (173 correct vs. 52 incorrect). False positive rate is also not low (82 incorrect vs. 171 correct). The recall of the model is 0.678. It means that about the model correctly identified about 67.8% of customers who leaved. The precision of the model is 0.77 and f1-score is about 0.72. The model’s accuracy is also about 72%, meaning that received model performs well, but not ideally.

4.3.2 Churn prediction for the cluster “High - High”

Now, lets explore another “problematic” cluster, which was the forth cluster called “High - High” - clients with highest service score and highest monthly charges. The churn rate in this cluster is 16%. However, this cluster represents the clients who bring the highest profit for the company and use many company’s service. That is why it is important to know reasons of their churn.

cluster1 <- filter(df1, df1$cluster == "High - High" )

ch3 <- cluster1 %>% 
  group_by(Churn) %>% 
  summarise(n = n()) %>% 
  mutate(rate = paste0(round(100 * n/sum(n), 0), "%")) %>% 
  arrange(-n)

#resulting table
ch3 %>%
  kbl() %>%
  kable_material(c("striped", "hover"))
Churn n rate
No 1760 84%
Yes 332 16%

4.3.2.1 Regression model

cluster1$Churn <- as.factor(cluster1$Churn)

cluster1$Churn <- as.factor(cluster1$Churn)
cluster1_model <- select(cluster1, c("gender","SeniorCitizen", "tenure", "service_score", "Contract", 
                                       "PaymentMethod","PaperlessBilling", "MonthlyCharges", "Churn"))

cluster1_model$Churn <- as.factor(cluster1_model$Churn)
cluster1_model$gender <- as.factor(cluster1_model$gender)
cluster1_model$Contract <- as.factor(cluster1_model$Contract)
cluster1_model$PaymentMethod <- as.factor(cluster1_model$PaymentMethod)
cluster1_model$PaperlessBilling <- as.factor(cluster1_model$PaperlessBilling)
cluster1_model$SeniorCitizen<- as.factor(cluster1_model$SeniorCitizen)

m2 <- glm(Churn ~., family = binomial, cluster1_model)

tab_model(m2)
  Churn
Predictors Odds Ratios CI p
(Intercept) 0.02 0.01 – 0.08 <0.001
gender [Male] 1.24 0.96 – 1.61 0.094
SeniorCitizen [1] 1.21 0.91 – 1.60 0.193
tenure 0.98 0.97 – 0.99 0.001
service_score 0.76 0.63 – 0.92 0.004
Contract [One year] 0.62 0.46 – 0.84 0.002
Contract [Two year] 0.21 0.13 – 0.33 <0.001
PaymentMethod [Credit
card (automatic)]
0.98 0.68 – 1.42 0.931
PaymentMethod [Electronic
check]
1.44 1.05 – 1.99 0.024
PaymentMethod [Mailed
check]
0.73 0.35 – 1.38 0.351
PaperlessBilling [Yes] 1.15 0.84 – 1.60 0.385
MonthlyCharges 1.06 1.04 – 1.08 <0.001
Observations 2092
R2 Tjur 0.152

Interpretation of significant predictors of the model:

  1. Increasing in tenure by one month decreases the odds of churn only by 2%;

  2. The increasing in service score by one point decreases the odd of churn by 24%;

  3. The increase in monthly charges per $1 increases the odds of churn by 6%.

  4. Having the one-year contract reduces odds of churn by 38%, while having two year contract reduces the odds of churn by 79%;

  5. If the client pays by electronic check, then the odds of churn increase by the factor 1.44;

set.seed(1)
test_ind = createDataPartition(cluster1_model$Churn, p = 0.2, list = FALSE)
cluster1_model.test = cluster1_model[test_ind,]
cluster1_model.train = cluster1_model[-test_ind,]
cluster1_model.test = cluster1_model.test%>% na.omit()
cluster1_model.train =  cluster1_model.train%>% na.omit()

m2_train <- m1 <- glm(Churn~., family = binomial, cluster1_model.train)

test1 <- predict(m2_train, cluster1_model.test, type="response")
prediction1 <- factor(ifelse(test1 > 0.5,"Yes","No"))
matrix2 <- caret::confusionMatrix(prediction1, cluster1_model.test$Churn, mode = "prec_recall")
matrix2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  347  61
##        Yes   5   6
##                                         
##                Accuracy : 0.8425        
##                  95% CI : (0.804, 0.876)
##     No Information Rate : 0.8401        
##     P-Value [Acc > NIR] : 0.4794        
##                                         
##                   Kappa : 0.1139        
##                                         
##  Mcnemar's Test P-Value : 1.288e-11     
##                                         
##               Precision : 0.8505        
##                  Recall : 0.9858        
##                      F1 : 0.9132        
##              Prevalence : 0.8401        
##          Detection Rate : 0.8282        
##    Detection Prevalence : 0.9737        
##       Balanced Accuracy : 0.5377        
##                                         
##        'Positive' Class : No            
## 

The false negative rate is very low (347 correct vs. 61 incorrect). False positive appeared to be high due to the small number of churned clients in the test dataset. The recall of the model is 0.986. It means that about the model correctly identified about 98,6% of customers who leaved, which is very good result. The precision of the model is 0.85 and f1-score is about 0.91. The accuracy of the model is about 84,25%, meaning that performance of the model is pretty good.

Conclusions

  1. The list of company’s clients might be divided into 4 clusters using client’s time in service, numbers of servics and monthly payments. These clusters clusters are:

    • “High-High” - clients with high number of services and high monthly payments.

    • “Medium-High” - clients with medium number of services, but with high monthly charges.

    • “Low-Low” - clients with low number of services and with low charges.

    • “Very Low-Low” - clients with the minimum number of services and with low charges.

  2. The cluster with the highest churn rate: “Medium-High” (rate: 47%)

  3. For both analyzed clusters (“High-High” and “Medium-Low”) increase in variables describing tenure, the service score and length of contracts decrease the odds of churn, while the payments by electronic check and the increase in monthly charges increase the odds of client’s churn.