1. Preparing the data

Firstly, I upload all the libraries I need for this project (I will specify the particular packages if needed):

library(broom)
library(plyr)
library(dplyr)
library(randomForest)
library(rpart)
library(rpart.plot)
library(caret)
library(ggplot2)
library(gridExtra)
library(readr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.3
library(dendextend)
## Warning: package 'dendextend' was built under R version 4.0.3
library(MASS)

Than, I load the actual data. It includes 21 variables and 7043 obvservations in general:

data <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
data = na.omit(data)
summary(data)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7032        Length:7032        Min.   :0.0000   Length:7032       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1624                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7032        Min.   : 1.00   Length:7032        Length:7032       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.42                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7032        Length:7032        Length:7032        Length:7032       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7032        Length:7032        Length:7032        Length:7032       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7032        Length:7032        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.59   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.80   Mean   :2283.3  
##                                        3rd Qu.: 89.86   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##     Churn          
##  Length:7032       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

As our task is to examine the churn and suggest some ideas on its reduction, I explored the problem visually as a beginning:

geom_text(aes(label=len), vjust=1.6, color=“white”, size=3.5)

data <- as.data.frame(data)
ggplot(data, aes(x = Churn)) +
  geom_bar(fill = "#00bfff") +
  ggtitle("Churned and Retained Customers") +
  xlab("Churn?") +
  ylab("Number of Observations") +
  theme_classic()

count(data %>% filter(data$Churn == "Yes")) / nrow(data)
##          n
## 1 0.265785

A bit more than a quarter of all our customers leave - approximately 25.5%! Further, I explore the correlations between “Churn” and the other variables: the easiest way to do that is to built a correlation matrix with all the variables and inspect the rows we are interested in (“corrplot” package is used here). Also, I made all the variables numeric for this operation:

cordata <- data %>% mutate_if(is.character, as.factor) %>% mutate_if(is.factor, as.numeric)

corrplot(cor(cordata), method = "square")

It seems that the demographic variables are not important at all, while the most influensive are (1) “tenure” and (2) “Contract”:

paste("correlation between Churn and tenure =", cor(cordata$Churn, cordata$tenure))
## [1] "correlation between Churn and tenure = -0.354049358953251"
paste("correlation between Churn and Contract =", cor(cordata$Churn, cordata$Contract))
## [1] "correlation between Churn and Contract = -0.396149532993654"

Further, I remove the redundant variables: (1) “customerID” as it just defines each customer and is not helpful in the analysis at all, (2) “PhoneSevice” as it is already included in the “MultipleLines”, and (3) “TotalCharges” as it is the outcome of multiplication between “MonthlyCharges” and “tenure”.

data1 <- data %>% dplyr::select(!"customerID" & !"PhoneService" & !"TotalCharges")

Now I recode the service-used variables (multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies) into the numeric 0-1. I perform it manually, so to say, but it works pretty well. After doing this operations, I also create a new variable called “TelecomService” that summarizes all the services a customer uses:

data1$MultipleLines <- as.factor(data1$MultipleLines)
data1$InternetService <- as.factor(data1$InternetService)
data1$OnlineSecurity <- as.factor(data1$OnlineSecurity)
data1$OnlineBackup <- as.factor(data1$OnlineBackup)
data1$DeviceProtection <- as.factor(data1$DeviceProtection)
data1$TechSupport <- as.factor(data1$TechSupport)
data1$StreamingTV <- as.factor(data1$StreamingTV)
data1$StreamingMovies <- as.factor(data1$StreamingMovies)

data1$MultipleLines <- revalue(data1$MultipleLines, c("No phone service"=0, "No" = 1, "Yes"=2))
data1$InternetService <- revalue(data1$InternetService, c("No"=0, "Fiber optic" = 2, "DSL"=1))
data1$OnlineSecurity <- revalue(data1$OnlineSecurity, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$OnlineBackup <- revalue(data1$OnlineBackup, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$DeviceProtection <- revalue(data1$DeviceProtection, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$TechSupport <- revalue(data1$TechSupport, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$StreamingTV <- revalue(data1$StreamingTV, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$StreamingMovies <- revalue(data1$StreamingMovies, c("No internet service"="0", "No" = "0", "Yes"="1"))

data1$MultipleLines = as.numeric(as.character(data1$MultipleLines))
data1$InternetService = as.numeric(as.character(data1$InternetService))
data1$OnlineSecurity = as.numeric(as.character(data1$OnlineSecurity))
data1$OnlineBackup = as.numeric(as.character(data1$OnlineBackup))
data1$DeviceProtection = as.numeric(as.character(data1$DeviceProtection))
data1$TechSupport = as.numeric(as.character(data1$TechSupport))
data1$StreamingTV = as.numeric(as.character(data1$StreamingTV))
data1$StreamingMovies = as.numeric(as.character(data1$StreamingMovies))

data1$TelecomServices <- data1$MultipleLines + data1$InternetService + data1$OnlineSecurity + data1$OnlineBackup + data1$DeviceProtection + data1$TechSupport + data1$StreamingTV + data1$StreamingMovies

Finally, I standardize 2 continuous variables, “tenure”, “MontlhyCharges”, and “TelecomServices”:

data1$tenure <- scale(data1$tenure, center = TRUE, scale = TRUE)
data1$MonthlyCharges <- scale(data1$MonthlyCharges, center = TRUE, scale = TRUE)
data1$TelecomServices <- scale(data1$TelecomServices, center = TRUE, scale = TRUE)

2. Clusterization

The 3 variables I normalized are used here for the clusterization. To define the optimal number of clusters, I use the fancy fviz_nbclust function from the “factoextra” package:

clustdata <- data1 %>% dplyr::select("tenure", "MonthlyCharges", "TelecomServices")
fviz_nbclust(clustdata, kmeans, method = "silhouette")

Also, I want to briefly look at the results of hierarchical clustering done with the “complete” and “average” methods. The respective dendrograms are shown below (to note, the “dendextend” package was used to bring color):

dist <- dist(clustdata, method = "euclidean")

hc_complete <- hclust(dist, method = "complete")
hc_average <- hclust(dist, method = "average")

plot(color_branches((as.dendrogram(hc_complete)), h = 4)) + title(main = "complete linkage")

## integer(0)
plot(color_branches((as.dendrogram(hc_average)), h = 2)) + title(main = "average linkage")

## integer(0)

The values of height were chosen visually thus creating 3 clusters what is different from the kmeans result. So, to determine what number I need, I present the tables containing the numbers of observations in each cluster below:

complete3 <- cutree(hc_complete, 3)
average3 <- cutree(hc_average, 3)
table(complete3, average3)
##          average3
## complete3    1    2    3
##         1 1714  189  147
##         2    0 2538  742
##         3    0  703  999
complete4 <- cutree(hc_complete, 4)
average4 <- cutree(hc_average, 4)
table(complete4, average4)
##          average4
## complete4    1    2    3    4
##         1 1126  189  147  588
##         2    0 2379   43    0
##         3    0  703  999    0
##         4    0  159  699    0

It may be seen as not that advanced heuristic but the number of zeros in the second table is higher - there are more “matches” between the methods, so to say. Due to that, I divide the data into 4 clusters:

clusters <- kmeans(clustdata, centers = 4)
data1$cluster <- factor(clusters$cluster)

Let’s now explore each cluster separately:

data1 %>% dplyr::select("tenure", "MonthlyCharges", "TelecomServices", "cluster") %>% group_by(cluster) %>% summarize_if(is.numeric, mean)
## # A tibble: 4 x 4
##   cluster tenure MonthlyCharges TelecomServices
##   <fct>    <dbl>          <dbl>           <dbl>
## 1 1       -0.836         0.362           0.0813
## 2 2        1.00          1.16            1.33  
## 3 3        0.833         0.0888          0.284 
## 4 4       -0.246        -1.30           -1.21

Befor some interpretations, I want to put some illustrations as well:

ggplot(data1, aes(x = cluster)) + geom_bar(fill = "#00bfff") +
  xlab("clusters") +
  ggtitle("What is the clusters size?") +
  ylab("number of consumers") +
  theme_classic()

ggplot(data1, aes(fill=Churn, x = cluster)) +
  geom_bar(position="dodge")+
  ggtitle("How churn is distributed among clusters")+
  xlab("Churn?")+
  ylab("Number of consumers") +
  theme_classic()

The dramatic fact I notice in the table content is the dramatic (!) distinctiveness of the first cluster: its members pay relatively more money than the others, although they have relatively low number of services. The graph about size also shows that this cluster is the biggest one - the problem seems to be serious. Why is it so different from the 4th cluster, the one that is the second on the parameter of size but with a low churn rate? That is a question.

paste("the percent of churn in the 1st cluster is", count(data1 %>% filter(cluster == "1") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "1")))
## [1] "the percent of churn in the 1st cluster is 0.486369536996971"
paste("the percent of churn in the 2nd cluster is", count(data1 %>% filter(cluster == "2") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "2")))
## [1] "the percent of churn in the 2nd cluster is 0.196732026143791"
paste("the percent of churn in the 3rd cluster is", count(data1 %>% filter(cluster == "3") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "3")))
## [1] "the percent of churn in the 3rd cluster is 0.121239744758432"
paste("the percent of churn in the 4th cluster is", count(data1 %>% filter(cluster == "4") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "4")))
## [1] "the percent of churn in the 4th cluster is 0.148519579751671"

The percentages calculated above show that the fourth cluster is “worse” than the third one - but nevertheless, it is still “better” than the main source of the company’s financial losses. To make the distinctions between clusters more tangible, I rename them:

data1$cluster <- as.factor(data1$cluster)
data1$cluster <- revalue(data1$cluster, c("1" = "50% leavers and huge", "2" = "20% leavers and small", "3" = "12% leavers and small", "4" = "15% leavers and huge"))

Short explanation about the new names: I guess that the size & number of leavers are the most important cluster characteristics because they both may give an impression of what is going on in each group and where the policy should be changed.

Churn prediction

In this section, I try to create & train the model of logistic regression for the clusters “50% leavers and huge” and “20% leavers and small”. I also turn the “Churn” variable into a factor one and split the data sets to get the “train” and “test” parts:

fifty_leavers <- data1 %>% filter(cluster == "50% leavers and huge") %>% dplyr::select(!"cluster")
twenty_leavers <- data1 %>% filter(cluster == "20% leavers and small") %>% dplyr::select(!"cluster")

fifty_leavers$Churn <- as.factor(fifty_leavers$Churn)
twenty_leavers$Churn <- as.factor(twenty_leavers$Churn)

test_churn = createDataPartition(fifty_leavers$Churn, p=0.2, list = FALSE)
fifty_test = fifty_leavers[test_churn,]
fifty_train = fifty_leavers[-test_churn,]

test_churn = createDataPartition(twenty_leavers$Churn, p=0.2, list = FALSE)
twenty_test = twenty_leavers[test_churn,]
twenty_train = twenty_leavers[-test_churn,]

“50% leavers and huge” & model

model_fifty <- glm(Churn~., family = binomial, fifty_train)
model_fifty2 <- stepAIC(model_fifty, trace = 0)
summary(model_fifty)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = fifty_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.99356  -0.95116  -0.00039   0.90968   2.65487  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -7.11400    1.24946  -5.694 1.24e-08 ***
## genderMale                            -0.17715    0.10670  -1.660 0.096873 .  
## SeniorCitizen                          0.12335    0.13868   0.889 0.373768    
## PartnerYes                            -0.08109    0.13030  -0.622 0.533705    
## DependentsYes                          0.14794    0.15641   0.946 0.344238    
## tenure                                -1.50475    0.16288  -9.238  < 2e-16 ***
## MultipleLines                          0.73463    0.18420   3.988 6.66e-05 ***
## InternetService                        2.70521    0.58818   4.599 4.24e-06 ***
## OnlineSecurity                        -0.10404    0.17712  -0.587 0.556954    
## OnlineBackup                           0.14580    0.15953   0.914 0.360755    
## DeviceProtection                       0.26192    0.16012   1.636 0.101885    
## TechSupport                            0.12206    0.18034   0.677 0.498506    
## StreamingTV                            0.94402    0.23819   3.963 7.39e-05 ***
## StreamingMovies                        0.86465    0.23657   3.655 0.000257 ***
## ContractOne year                      -0.93099    0.25948  -3.588 0.000333 ***
## ContractTwo year                     -15.23187  385.17643  -0.040 0.968456    
## PaperlessBillingYes                    0.31599    0.12149   2.601 0.009296 ** 
## PaymentMethodCredit card (automatic)  -0.09314    0.20401  -0.457 0.648010    
## PaymentMethodElectronic check          0.18900    0.16085   1.175 0.240002    
## PaymentMethodMailed check             -0.08015    0.19358  -0.414 0.678837    
## MonthlyCharges                        -2.00655    0.65667  -3.056 0.002246 ** 
## TelecomServices                             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2560.5  on 1847  degrees of freedom
## Residual deviance: 2090.3  on 1827  degrees of freedom
## AIC: 2132.3
## 
## Number of Fisher Scoring iterations: 15
summary(model_fifty2)
## 
## Call:
## glm(formula = Churn ~ gender + tenure + MultipleLines + InternetService + 
##     DeviceProtection + StreamingTV + StreamingMovies + Contract + 
##     PaperlessBilling + MonthlyCharges, family = binomial, data = fifty_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9725  -0.9636  -0.0004   0.9062   2.6389  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -6.7753     0.6541 -10.358  < 2e-16 ***
## genderMale           -0.1827     0.1059  -1.725 0.084527 .  
## tenure               -1.5174     0.1551  -9.782  < 2e-16 ***
## MultipleLines         0.6945     0.1314   5.286 1.25e-07 ***
## InternetService       2.5905     0.3279   7.899 2.81e-15 ***
## DeviceProtection      0.2247     0.1379   1.629 0.103238    
## StreamingTV           0.9013     0.1671   5.393 6.93e-08 ***
## StreamingMovies       0.8235     0.1671   4.927 8.33e-07 ***
## ContractOne year     -0.9748     0.2567  -3.797 0.000147 ***
## ContractTwo year    -15.2230   384.5038  -0.040 0.968419    
## PaperlessBillingYes   0.3414     0.1203   2.838 0.004539 ** 
## MonthlyCharges       -1.8063     0.3895  -4.638 3.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2560.5  on 1847  degrees of freedom
## Residual deviance: 2099.1  on 1836  degrees of freedom
## AIC: 2123.1
## 
## Number of Fisher Scoring iterations: 15

I built 2 models; the function stepAIC() from the “MASS” package was used for the second one. The AIC value is smaller for the second model (2094.5 < 2103.3), so the most suitable formula is:

Churn ~ gender + tenure + MultipleLines + InternetService + DeviceProtection + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges

Next, I test how the models behave:

test50 <- predict(model_fifty, fifty_test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
prediction50 <- factor(ifelse(test50 > 0.5, "Yes", "No"))
confusion50 <- confusionMatrix(prediction50, fifty_test$Churn, mode = "prec_recall")
confusion50
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  162  60
##        Yes  76 165
##                                           
##                Accuracy : 0.7063          
##                  95% CI : (0.6625, 0.7474)
##     No Information Rate : 0.514           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4132          
##                                           
##  Mcnemar's Test P-Value : 0.1984          
##                                           
##               Precision : 0.7297          
##                  Recall : 0.6807          
##                      F1 : 0.7043          
##              Prevalence : 0.5140          
##          Detection Rate : 0.3499          
##    Detection Prevalence : 0.4795          
##       Balanced Accuracy : 0.7070          
##                                           
##        'Positive' Class : No              
## 
test50 <- predict(model_fifty2, fifty_test, type = "response")
prediction50 <- factor(ifelse(test50 > 0.5, "Yes", "No"))
confusion50 <- confusionMatrix(prediction50, fifty_test$Churn, mode = "prec_recall")
confusion50
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  160  65
##        Yes  78 160
##                                          
##                Accuracy : 0.6911         
##                  95% CI : (0.6469, 0.733)
##     No Information Rate : 0.514          
##     P-Value [Acc > NIR] : 7.625e-15      
##                                          
##                   Kappa : 0.3828         
##                                          
##  Mcnemar's Test P-Value : 0.3156         
##                                          
##               Precision : 0.7111         
##                  Recall : 0.6723         
##                      F1 : 0.6911         
##              Prevalence : 0.5140         
##          Detection Rate : 0.3456         
##    Detection Prevalence : 0.4860         
##       Balanced Accuracy : 0.6917         
##                                          
##        'Positive' Class : No             
## 

Surprisingly, the accuracy of both models is about 0.67 meaning they fail to predict approximately 33% of the test data. I want to use the second model further; at least, it provides me with the most suitable formula earlier. Its most significant variables are:

  1. tenure: with each increase of tenure value by 1 point (1 month), the probability to churn decreases by 1.55. In other words, those customers who are in longer have greater chances of staying in future.
  2. MultipleLines: while the number of lines increases, the chance to churn increases by 0.75 as well.
  3. InternetService: while its type is increased, the chance of churn increases by 2.67, what is a lot. Literally, it is the comparison between fibre optic and DSL outlined with the strong tendency of the fibre optic to influence the churn.
  4. StreamingTV: this option, if positive, increases the probability to churn by 0.81. So, those who use this service are more likely to churn.
  5. StreamingMovies: a similar story with the value of 0.95. It is interesting that the phenomenon of streaming is so important.
  6. Contact (One year): those customers who has a contract for 1 year in advance are less likely to churn (-0.84). On the other hand, the 2 year is not significant for the churn.
  7. MonthlyCharges: each additional dollar decreases the chance of a customer to churn by 92%.

These variables are the most powerful predictors of the churn due to their relatively small p-values.

“20% leavers and small” & model

All the operations are similar here; all we need to change are the names & interpretations.

model_twenty <- glm(Churn~., family = binomial, twenty_train)
model_twenty2 <- stepAIC(model_twenty, trace = 0)
summary(model_twenty)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = twenty_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5530  -0.6233  -0.3609  -0.1734   3.0006  
## 
## Coefficients: (1 not defined because of singularities)
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -2.80926    3.31662  -0.847 0.396981    
## genderMale                            0.24704    0.16394   1.507 0.131839    
## SeniorCitizen                        -0.06910    0.18819  -0.367 0.713477    
## PartnerYes                           -0.01697    0.18591  -0.091 0.927255    
## DependentsYes                        -0.15177    0.20987  -0.723 0.469580    
## tenure                               -0.58900    0.16763  -3.514 0.000442 ***
## MultipleLines                         0.42285    0.41127   1.028 0.303882    
## InternetService                       0.23512    1.61623   0.145 0.884335    
## OnlineSecurity                       -0.25074    0.36735  -0.683 0.494880    
## OnlineBackup                         -0.37261    0.35839  -1.040 0.298492    
## DeviceProtection                     -0.15829    0.36380  -0.435 0.663487    
## TechSupport                          -0.59279    0.36699  -1.615 0.106253    
## StreamingTV                           0.69664    0.71058   0.980 0.326901    
## StreamingMovies                       0.06774    0.66956   0.101 0.919411    
## ContractOne year                     -0.69713    0.20055  -3.476 0.000509 ***
## ContractTwo year                     -1.48279    0.29992  -4.944 7.66e-07 ***
## PaperlessBillingYes                   0.18611    0.22466   0.828 0.407447    
## PaymentMethodCredit card (automatic) -0.19083    0.23901  -0.798 0.424629    
## PaymentMethodElectronic check         0.20834    0.20111   1.036 0.300216    
## PaymentMethodMailed check            -0.71467    0.47507  -1.504 0.132493    
## MonthlyCharges                        0.96181    1.89813   0.507 0.612356    
## TelecomServices                            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1211.1  on 1222  degrees of freedom
## Residual deviance:  954.7  on 1202  degrees of freedom
## AIC: 996.7
## 
## Number of Fisher Scoring iterations: 6
summary(model_twenty2)
## 
## Call:
## glm(formula = Churn ~ gender + tenure + MultipleLines + OnlineSecurity + 
##     OnlineBackup + TechSupport + StreamingTV + Contract + PaymentMethod + 
##     MonthlyCharges, family = binomial, data = twenty_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6567  -0.6307  -0.3590  -0.1899   2.9278  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -2.3726     0.6693  -3.545 0.000393 ***
## genderMale                             0.2451     0.1634   1.500 0.133676    
## tenure                                -0.5867     0.1625  -3.610 0.000306 ***
## MultipleLines                          0.3830     0.2592   1.478 0.139528    
## OnlineSecurity                        -0.2845     0.1837  -1.548 0.121508    
## OnlineBackup                          -0.3815     0.1790  -2.131 0.033080 *  
## TechSupport                           -0.6247     0.1817  -3.437 0.000587 ***
## StreamingTV                            0.6535     0.3273   1.997 0.045876 *  
## ContractOne year                      -0.7343     0.1959  -3.749 0.000177 ***
## ContractTwo year                      -1.5766     0.2859  -5.514 3.51e-08 ***
## PaymentMethodCredit card (automatic)  -0.1756     0.2380  -0.738 0.460680    
## PaymentMethodElectronic check          0.2261     0.2000   1.131 0.258213    
## PaymentMethodMailed check             -0.7092     0.4722  -1.502 0.133163    
## MonthlyCharges                         1.1373     0.4025   2.825 0.004724 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1211.12  on 1222  degrees of freedom
## Residual deviance:  957.16  on 1209  degrees of freedom
## AIC: 985.16
## 
## Number of Fisher Scoring iterations: 6

The AIC of the second model is a bit better: 1013.3 > 1026.1 Thus, our formula comes from the second model again:

Churn ~ gender + tenure + MultipleLines + OnlineSecurity + OnlineBackup + TechSupport + StreamingTV + Contract + MonthlyCharges

To tell the truth, I was surprised: some of the variables that were not relevant for the previous cluster appear in the formula. Let’s test models now:

test20 <- predict(model_twenty, twenty_test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
prediction20 <- factor(ifelse(test20 > 0.5, "Yes", "No"))
confusion20 <- confusionMatrix(prediction20, twenty_test$Churn, mode = "prec_recall")
confusion20
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  227  46
##        Yes  19  15
##                                           
##                Accuracy : 0.7883          
##                  95% CI : (0.7383, 0.8326)
##     No Information Rate : 0.8013          
##     P-Value [Acc > NIR] : 0.74280         
##                                           
##                   Kappa : 0.2023          
##                                           
##  Mcnemar's Test P-Value : 0.00126         
##                                           
##               Precision : 0.8315          
##                  Recall : 0.9228          
##                      F1 : 0.8748          
##              Prevalence : 0.8013          
##          Detection Rate : 0.7394          
##    Detection Prevalence : 0.8893          
##       Balanced Accuracy : 0.5843          
##                                           
##        'Positive' Class : No              
## 
test20 <- predict(model_twenty2, twenty_test, type = "response")
prediction20 <- factor(ifelse(test20 > 0.5, "Yes", "No"))
confusion20 <- confusionMatrix(prediction20, twenty_test$Churn, mode = "prec_recall")
confusion20
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  224  45
##        Yes  22  16
##                                           
##                Accuracy : 0.7818          
##                  95% CI : (0.7313, 0.8267)
##     No Information Rate : 0.8013          
##     P-Value [Acc > NIR] : 0.824296        
##                                           
##                   Kappa : 0.2014          
##                                           
##  Mcnemar's Test P-Value : 0.007194        
##                                           
##               Precision : 0.8327          
##                  Recall : 0.9106          
##                      F1 : 0.8699          
##              Prevalence : 0.8013          
##          Detection Rate : 0.7296          
##    Detection Prevalence : 0.8762          
##       Balanced Accuracy : 0.5864          
##                                           
##        'Positive' Class : No              
## 

Hooray! The accuracy of each model is around 0.81 what is cool (at least it is better than the accuracy of the models I constructed for another cluster). The second one is a bit more accurate, so I will use it for the variables selection:

  1. tenure: with each additional month, the risk of churn decreases by 0.44. It is less powerful than tenure value in the first model.
  2. OnlineSecurity: if this service is used, the probability to churn decreases by 0.58. In other words, this offer makes customers more “friendly” to the company.
  3. TechSupport: similarly, its usage decreases the churn rates by 0.57.
  4. Contract (one year): those who make a contract for a year are less likely to churn (on 0.62).
  5. Contract (two year): those who make a contract for 2 years are less likely to churn (on 52%!). Again, this variable did not appear in the previous cluster.
  6. MonthlyCharges: each additional dollar increases the risk of churning by 0.98. This pattern is different from the one in previous cluster.

Similarly, these variables are the most powerful predictors of the churn due to their relatively small p-values.

Churn-prevention recommendations

The general advice for both cluster policies is to motivate customers to stay longer. It should be the key idea behind all the suggestions we made: the longer people are in the less they are to leave. Referring to my own user experience, this rule is universal - we do not like to change the platforms if we have been using it for a long interval. In variable terms, this advice relies on the significance of tenure and contract - the second one in fact shows as the same pattern as the first.

Further, I want to speak separately about the first cluster, “50% leavers and huge”. Streaming services increase the churn from this group, so we need to modify this options somehow. It might be done quantitatively (by increasing the number of shows or films we offer) or qualitatively (by suggesting an original high-ranking content). Anyway, such changes imply an unpredictable investments - so, if the company does not aim to be an opponent for “Netflix”, it might be not necessary. Next, we may more often advice customers not to use fibre optic - it influences their potential churn significantly. Let them possess DSL more frequently or not use our Internet services at all. The same story is about multiple lines - it is better if not necessary! Finally, a distinctive parameter of this group is that they more likely to stay if they pay more. This insight is hard to interpret but it is good to know. Maybe, we may suggest them a set of options of high cost - just suggest - and we will thus “capture” them if they react.

Another picture comes from the second cluster, “20% leavers and small”. The members of this cluster do not like to pay more - in contrast with a group we have just analyzed. On the other hand, they value the comfort (if I call it right): they enjoy tech support and online security, Probably, these 2 services are simply great and thus stimulate these customers to stay longer or they really enjoy “comfort”. Anyway, we may provide them with some discounts for these 2 options.

Final comments

This paper aims to present some practical advises based on the customers clusterization & churn prediction. It is clear from the results that segmenting customers is a wonderful practice as it allows as to be “smart” in terms of decision-making. So to say, it prevents our bad decisions, like offering something that somebody likes and somebody else does not like. Also, the importance of clusters emerges when we speak about the reasons behind the churn: different groups leave the company differently and need diverse approaches for churn prevention.

Hope it was not that boring to go through. Thank you for the attention!