Firstly, I upload all the libraries I need for this project (I will specify the particular packages if needed):
library(broom)
library(plyr)
library(dplyr)
library(randomForest)
library(rpart)
library(rpart.plot)
library(caret)
library(ggplot2)
library(gridExtra)
library(readr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.3
library(dendextend)
## Warning: package 'dendextend' was built under R version 4.0.3
library(MASS)
Than, I load the actual data. It includes 21 variables and 7043 obvservations in general:
data <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
data = na.omit(data)
summary(data)
## customerID gender SeniorCitizen Partner
## Length:7032 Length:7032 Min. :0.0000 Length:7032
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1624
## 3rd Qu.:0.0000
## Max. :1.0000
## Dependents tenure PhoneService MultipleLines
## Length:7032 Min. : 1.00 Length:7032 Length:7032
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.42
## 3rd Qu.:55.00
## Max. :72.00
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7032 Length:7032 Length:7032 Length:7032
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7032 Length:7032 Length:7032 Length:7032
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7032 Length:7032 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.59 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.80 Mean :2283.3
## 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## Churn
## Length:7032
## Class :character
## Mode :character
##
##
##
As our task is to examine the churn and suggest some ideas on its reduction, I explored the problem visually as a beginning:
geom_text(aes(label=len), vjust=1.6, color=“white”, size=3.5)
data <- as.data.frame(data)
ggplot(data, aes(x = Churn)) +
geom_bar(fill = "#00bfff") +
ggtitle("Churned and Retained Customers") +
xlab("Churn?") +
ylab("Number of Observations") +
theme_classic()
count(data %>% filter(data$Churn == "Yes")) / nrow(data)
## n
## 1 0.265785
A bit more than a quarter of all our customers leave - approximately 25.5%! Further, I explore the correlations between “Churn” and the other variables: the easiest way to do that is to built a correlation matrix with all the variables and inspect the rows we are interested in (“corrplot” package is used here). Also, I made all the variables numeric for this operation:
cordata <- data %>% mutate_if(is.character, as.factor) %>% mutate_if(is.factor, as.numeric)
corrplot(cor(cordata), method = "square")
It seems that the demographic variables are not important at all, while the most influensive are (1) “tenure” and (2) “Contract”:
paste("correlation between Churn and tenure =", cor(cordata$Churn, cordata$tenure))
## [1] "correlation between Churn and tenure = -0.354049358953251"
paste("correlation between Churn and Contract =", cor(cordata$Churn, cordata$Contract))
## [1] "correlation between Churn and Contract = -0.396149532993654"
Further, I remove the redundant variables: (1) “customerID” as it just defines each customer and is not helpful in the analysis at all, (2) “PhoneSevice” as it is already included in the “MultipleLines”, and (3) “TotalCharges” as it is the outcome of multiplication between “MonthlyCharges” and “tenure”.
data1 <- data %>% dplyr::select(!"customerID" & !"PhoneService" & !"TotalCharges")
Now I recode the service-used variables (multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies) into the numeric 0-1. I perform it manually, so to say, but it works pretty well. After doing this operations, I also create a new variable called “TelecomService” that summarizes all the services a customer uses:
data1$MultipleLines <- as.factor(data1$MultipleLines)
data1$InternetService <- as.factor(data1$InternetService)
data1$OnlineSecurity <- as.factor(data1$OnlineSecurity)
data1$OnlineBackup <- as.factor(data1$OnlineBackup)
data1$DeviceProtection <- as.factor(data1$DeviceProtection)
data1$TechSupport <- as.factor(data1$TechSupport)
data1$StreamingTV <- as.factor(data1$StreamingTV)
data1$StreamingMovies <- as.factor(data1$StreamingMovies)
data1$MultipleLines <- revalue(data1$MultipleLines, c("No phone service"=0, "No" = 1, "Yes"=2))
data1$InternetService <- revalue(data1$InternetService, c("No"=0, "Fiber optic" = 2, "DSL"=1))
data1$OnlineSecurity <- revalue(data1$OnlineSecurity, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$OnlineBackup <- revalue(data1$OnlineBackup, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$DeviceProtection <- revalue(data1$DeviceProtection, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$TechSupport <- revalue(data1$TechSupport, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$StreamingTV <- revalue(data1$StreamingTV, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$StreamingMovies <- revalue(data1$StreamingMovies, c("No internet service"="0", "No" = "0", "Yes"="1"))
data1$MultipleLines = as.numeric(as.character(data1$MultipleLines))
data1$InternetService = as.numeric(as.character(data1$InternetService))
data1$OnlineSecurity = as.numeric(as.character(data1$OnlineSecurity))
data1$OnlineBackup = as.numeric(as.character(data1$OnlineBackup))
data1$DeviceProtection = as.numeric(as.character(data1$DeviceProtection))
data1$TechSupport = as.numeric(as.character(data1$TechSupport))
data1$StreamingTV = as.numeric(as.character(data1$StreamingTV))
data1$StreamingMovies = as.numeric(as.character(data1$StreamingMovies))
data1$TelecomServices <- data1$MultipleLines + data1$InternetService + data1$OnlineSecurity + data1$OnlineBackup + data1$DeviceProtection + data1$TechSupport + data1$StreamingTV + data1$StreamingMovies
Finally, I standardize 2 continuous variables, “tenure”, “MontlhyCharges”, and “TelecomServices”:
data1$tenure <- scale(data1$tenure, center = TRUE, scale = TRUE)
data1$MonthlyCharges <- scale(data1$MonthlyCharges, center = TRUE, scale = TRUE)
data1$TelecomServices <- scale(data1$TelecomServices, center = TRUE, scale = TRUE)
The 3 variables I normalized are used here for the clusterization. To define the optimal number of clusters, I use the fancy fviz_nbclust function from the “factoextra” package:
clustdata <- data1 %>% dplyr::select("tenure", "MonthlyCharges", "TelecomServices")
fviz_nbclust(clustdata, kmeans, method = "silhouette")
Also, I want to briefly look at the results of hierarchical clustering done with the “complete” and “average” methods. The respective dendrograms are shown below (to note, the “dendextend” package was used to bring color):
dist <- dist(clustdata, method = "euclidean")
hc_complete <- hclust(dist, method = "complete")
hc_average <- hclust(dist, method = "average")
plot(color_branches((as.dendrogram(hc_complete)), h = 4)) + title(main = "complete linkage")
## integer(0)
plot(color_branches((as.dendrogram(hc_average)), h = 2)) + title(main = "average linkage")
## integer(0)
The values of height were chosen visually thus creating 3 clusters what is different from the kmeans result. So, to determine what number I need, I present the tables containing the numbers of observations in each cluster below:
complete3 <- cutree(hc_complete, 3)
average3 <- cutree(hc_average, 3)
table(complete3, average3)
## average3
## complete3 1 2 3
## 1 1714 189 147
## 2 0 2538 742
## 3 0 703 999
complete4 <- cutree(hc_complete, 4)
average4 <- cutree(hc_average, 4)
table(complete4, average4)
## average4
## complete4 1 2 3 4
## 1 1126 189 147 588
## 2 0 2379 43 0
## 3 0 703 999 0
## 4 0 159 699 0
It may be seen as not that advanced heuristic but the number of zeros in the second table is higher - there are more “matches” between the methods, so to say. Due to that, I divide the data into 4 clusters:
clusters <- kmeans(clustdata, centers = 4)
data1$cluster <- factor(clusters$cluster)
Let’s now explore each cluster separately:
data1 %>% dplyr::select("tenure", "MonthlyCharges", "TelecomServices", "cluster") %>% group_by(cluster) %>% summarize_if(is.numeric, mean)
## # A tibble: 4 x 4
## cluster tenure MonthlyCharges TelecomServices
## <fct> <dbl> <dbl> <dbl>
## 1 1 -0.836 0.362 0.0813
## 2 2 1.00 1.16 1.33
## 3 3 0.833 0.0888 0.284
## 4 4 -0.246 -1.30 -1.21
Befor some interpretations, I want to put some illustrations as well:
ggplot(data1, aes(x = cluster)) + geom_bar(fill = "#00bfff") +
xlab("clusters") +
ggtitle("What is the clusters size?") +
ylab("number of consumers") +
theme_classic()
ggplot(data1, aes(fill=Churn, x = cluster)) +
geom_bar(position="dodge")+
ggtitle("How churn is distributed among clusters")+
xlab("Churn?")+
ylab("Number of consumers") +
theme_classic()
The dramatic fact I notice in the table content is the dramatic (!) distinctiveness of the first cluster: its members pay relatively more money than the others, although they have relatively low number of services. The graph about size also shows that this cluster is the biggest one - the problem seems to be serious. Why is it so different from the 4th cluster, the one that is the second on the parameter of size but with a low churn rate? That is a question.
paste("the percent of churn in the 1st cluster is", count(data1 %>% filter(cluster == "1") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "1")))
## [1] "the percent of churn in the 1st cluster is 0.486369536996971"
paste("the percent of churn in the 2nd cluster is", count(data1 %>% filter(cluster == "2") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "2")))
## [1] "the percent of churn in the 2nd cluster is 0.196732026143791"
paste("the percent of churn in the 3rd cluster is", count(data1 %>% filter(cluster == "3") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "3")))
## [1] "the percent of churn in the 3rd cluster is 0.121239744758432"
paste("the percent of churn in the 4th cluster is", count(data1 %>% filter(cluster == "4") %>% filter(Churn == "Yes")) / count(data1 %>% filter(cluster == "4")))
## [1] "the percent of churn in the 4th cluster is 0.148519579751671"
The percentages calculated above show that the fourth cluster is “worse” than the third one - but nevertheless, it is still “better” than the main source of the company’s financial losses. To make the distinctions between clusters more tangible, I rename them:
data1$cluster <- as.factor(data1$cluster)
data1$cluster <- revalue(data1$cluster, c("1" = "50% leavers and huge", "2" = "20% leavers and small", "3" = "12% leavers and small", "4" = "15% leavers and huge"))
Short explanation about the new names: I guess that the size & number of leavers are the most important cluster characteristics because they both may give an impression of what is going on in each group and where the policy should be changed.
In this section, I try to create & train the model of logistic regression for the clusters “50% leavers and huge” and “20% leavers and small”. I also turn the “Churn” variable into a factor one and split the data sets to get the “train” and “test” parts:
fifty_leavers <- data1 %>% filter(cluster == "50% leavers and huge") %>% dplyr::select(!"cluster")
twenty_leavers <- data1 %>% filter(cluster == "20% leavers and small") %>% dplyr::select(!"cluster")
fifty_leavers$Churn <- as.factor(fifty_leavers$Churn)
twenty_leavers$Churn <- as.factor(twenty_leavers$Churn)
test_churn = createDataPartition(fifty_leavers$Churn, p=0.2, list = FALSE)
fifty_test = fifty_leavers[test_churn,]
fifty_train = fifty_leavers[-test_churn,]
test_churn = createDataPartition(twenty_leavers$Churn, p=0.2, list = FALSE)
twenty_test = twenty_leavers[test_churn,]
twenty_train = twenty_leavers[-test_churn,]
model_fifty <- glm(Churn~., family = binomial, fifty_train)
model_fifty2 <- stepAIC(model_fifty, trace = 0)
summary(model_fifty)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = fifty_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.99356 -0.95116 -0.00039 0.90968 2.65487
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.11400 1.24946 -5.694 1.24e-08 ***
## genderMale -0.17715 0.10670 -1.660 0.096873 .
## SeniorCitizen 0.12335 0.13868 0.889 0.373768
## PartnerYes -0.08109 0.13030 -0.622 0.533705
## DependentsYes 0.14794 0.15641 0.946 0.344238
## tenure -1.50475 0.16288 -9.238 < 2e-16 ***
## MultipleLines 0.73463 0.18420 3.988 6.66e-05 ***
## InternetService 2.70521 0.58818 4.599 4.24e-06 ***
## OnlineSecurity -0.10404 0.17712 -0.587 0.556954
## OnlineBackup 0.14580 0.15953 0.914 0.360755
## DeviceProtection 0.26192 0.16012 1.636 0.101885
## TechSupport 0.12206 0.18034 0.677 0.498506
## StreamingTV 0.94402 0.23819 3.963 7.39e-05 ***
## StreamingMovies 0.86465 0.23657 3.655 0.000257 ***
## ContractOne year -0.93099 0.25948 -3.588 0.000333 ***
## ContractTwo year -15.23187 385.17643 -0.040 0.968456
## PaperlessBillingYes 0.31599 0.12149 2.601 0.009296 **
## PaymentMethodCredit card (automatic) -0.09314 0.20401 -0.457 0.648010
## PaymentMethodElectronic check 0.18900 0.16085 1.175 0.240002
## PaymentMethodMailed check -0.08015 0.19358 -0.414 0.678837
## MonthlyCharges -2.00655 0.65667 -3.056 0.002246 **
## TelecomServices NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2560.5 on 1847 degrees of freedom
## Residual deviance: 2090.3 on 1827 degrees of freedom
## AIC: 2132.3
##
## Number of Fisher Scoring iterations: 15
summary(model_fifty2)
##
## Call:
## glm(formula = Churn ~ gender + tenure + MultipleLines + InternetService +
## DeviceProtection + StreamingTV + StreamingMovies + Contract +
## PaperlessBilling + MonthlyCharges, family = binomial, data = fifty_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9725 -0.9636 -0.0004 0.9062 2.6389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.7753 0.6541 -10.358 < 2e-16 ***
## genderMale -0.1827 0.1059 -1.725 0.084527 .
## tenure -1.5174 0.1551 -9.782 < 2e-16 ***
## MultipleLines 0.6945 0.1314 5.286 1.25e-07 ***
## InternetService 2.5905 0.3279 7.899 2.81e-15 ***
## DeviceProtection 0.2247 0.1379 1.629 0.103238
## StreamingTV 0.9013 0.1671 5.393 6.93e-08 ***
## StreamingMovies 0.8235 0.1671 4.927 8.33e-07 ***
## ContractOne year -0.9748 0.2567 -3.797 0.000147 ***
## ContractTwo year -15.2230 384.5038 -0.040 0.968419
## PaperlessBillingYes 0.3414 0.1203 2.838 0.004539 **
## MonthlyCharges -1.8063 0.3895 -4.638 3.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2560.5 on 1847 degrees of freedom
## Residual deviance: 2099.1 on 1836 degrees of freedom
## AIC: 2123.1
##
## Number of Fisher Scoring iterations: 15
I built 2 models; the function stepAIC() from the “MASS” package was used for the second one. The AIC value is smaller for the second model (2094.5 < 2103.3), so the most suitable formula is:
Churn ~ gender + tenure + MultipleLines + InternetService + DeviceProtection + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges
Next, I test how the models behave:
test50 <- predict(model_fifty, fifty_test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
prediction50 <- factor(ifelse(test50 > 0.5, "Yes", "No"))
confusion50 <- confusionMatrix(prediction50, fifty_test$Churn, mode = "prec_recall")
confusion50
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 162 60
## Yes 76 165
##
## Accuracy : 0.7063
## 95% CI : (0.6625, 0.7474)
## No Information Rate : 0.514
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4132
##
## Mcnemar's Test P-Value : 0.1984
##
## Precision : 0.7297
## Recall : 0.6807
## F1 : 0.7043
## Prevalence : 0.5140
## Detection Rate : 0.3499
## Detection Prevalence : 0.4795
## Balanced Accuracy : 0.7070
##
## 'Positive' Class : No
##
test50 <- predict(model_fifty2, fifty_test, type = "response")
prediction50 <- factor(ifelse(test50 > 0.5, "Yes", "No"))
confusion50 <- confusionMatrix(prediction50, fifty_test$Churn, mode = "prec_recall")
confusion50
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 160 65
## Yes 78 160
##
## Accuracy : 0.6911
## 95% CI : (0.6469, 0.733)
## No Information Rate : 0.514
## P-Value [Acc > NIR] : 7.625e-15
##
## Kappa : 0.3828
##
## Mcnemar's Test P-Value : 0.3156
##
## Precision : 0.7111
## Recall : 0.6723
## F1 : 0.6911
## Prevalence : 0.5140
## Detection Rate : 0.3456
## Detection Prevalence : 0.4860
## Balanced Accuracy : 0.6917
##
## 'Positive' Class : No
##
Surprisingly, the accuracy of both models is about 0.67 meaning they fail to predict approximately 33% of the test data. I want to use the second model further; at least, it provides me with the most suitable formula earlier. Its most significant variables are:
These variables are the most powerful predictors of the churn due to their relatively small p-values.
All the operations are similar here; all we need to change are the names & interpretations.
model_twenty <- glm(Churn~., family = binomial, twenty_train)
model_twenty2 <- stepAIC(model_twenty, trace = 0)
summary(model_twenty)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = twenty_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5530 -0.6233 -0.3609 -0.1734 3.0006
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.80926 3.31662 -0.847 0.396981
## genderMale 0.24704 0.16394 1.507 0.131839
## SeniorCitizen -0.06910 0.18819 -0.367 0.713477
## PartnerYes -0.01697 0.18591 -0.091 0.927255
## DependentsYes -0.15177 0.20987 -0.723 0.469580
## tenure -0.58900 0.16763 -3.514 0.000442 ***
## MultipleLines 0.42285 0.41127 1.028 0.303882
## InternetService 0.23512 1.61623 0.145 0.884335
## OnlineSecurity -0.25074 0.36735 -0.683 0.494880
## OnlineBackup -0.37261 0.35839 -1.040 0.298492
## DeviceProtection -0.15829 0.36380 -0.435 0.663487
## TechSupport -0.59279 0.36699 -1.615 0.106253
## StreamingTV 0.69664 0.71058 0.980 0.326901
## StreamingMovies 0.06774 0.66956 0.101 0.919411
## ContractOne year -0.69713 0.20055 -3.476 0.000509 ***
## ContractTwo year -1.48279 0.29992 -4.944 7.66e-07 ***
## PaperlessBillingYes 0.18611 0.22466 0.828 0.407447
## PaymentMethodCredit card (automatic) -0.19083 0.23901 -0.798 0.424629
## PaymentMethodElectronic check 0.20834 0.20111 1.036 0.300216
## PaymentMethodMailed check -0.71467 0.47507 -1.504 0.132493
## MonthlyCharges 0.96181 1.89813 0.507 0.612356
## TelecomServices NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1211.1 on 1222 degrees of freedom
## Residual deviance: 954.7 on 1202 degrees of freedom
## AIC: 996.7
##
## Number of Fisher Scoring iterations: 6
summary(model_twenty2)
##
## Call:
## glm(formula = Churn ~ gender + tenure + MultipleLines + OnlineSecurity +
## OnlineBackup + TechSupport + StreamingTV + Contract + PaymentMethod +
## MonthlyCharges, family = binomial, data = twenty_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6567 -0.6307 -0.3590 -0.1899 2.9278
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3726 0.6693 -3.545 0.000393 ***
## genderMale 0.2451 0.1634 1.500 0.133676
## tenure -0.5867 0.1625 -3.610 0.000306 ***
## MultipleLines 0.3830 0.2592 1.478 0.139528
## OnlineSecurity -0.2845 0.1837 -1.548 0.121508
## OnlineBackup -0.3815 0.1790 -2.131 0.033080 *
## TechSupport -0.6247 0.1817 -3.437 0.000587 ***
## StreamingTV 0.6535 0.3273 1.997 0.045876 *
## ContractOne year -0.7343 0.1959 -3.749 0.000177 ***
## ContractTwo year -1.5766 0.2859 -5.514 3.51e-08 ***
## PaymentMethodCredit card (automatic) -0.1756 0.2380 -0.738 0.460680
## PaymentMethodElectronic check 0.2261 0.2000 1.131 0.258213
## PaymentMethodMailed check -0.7092 0.4722 -1.502 0.133163
## MonthlyCharges 1.1373 0.4025 2.825 0.004724 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1211.12 on 1222 degrees of freedom
## Residual deviance: 957.16 on 1209 degrees of freedom
## AIC: 985.16
##
## Number of Fisher Scoring iterations: 6
The AIC of the second model is a bit better: 1013.3 > 1026.1 Thus, our formula comes from the second model again:
Churn ~ gender + tenure + MultipleLines + OnlineSecurity + OnlineBackup + TechSupport + StreamingTV + Contract + MonthlyCharges
To tell the truth, I was surprised: some of the variables that were not relevant for the previous cluster appear in the formula. Let’s test models now:
test20 <- predict(model_twenty, twenty_test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
prediction20 <- factor(ifelse(test20 > 0.5, "Yes", "No"))
confusion20 <- confusionMatrix(prediction20, twenty_test$Churn, mode = "prec_recall")
confusion20
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 227 46
## Yes 19 15
##
## Accuracy : 0.7883
## 95% CI : (0.7383, 0.8326)
## No Information Rate : 0.8013
## P-Value [Acc > NIR] : 0.74280
##
## Kappa : 0.2023
##
## Mcnemar's Test P-Value : 0.00126
##
## Precision : 0.8315
## Recall : 0.9228
## F1 : 0.8748
## Prevalence : 0.8013
## Detection Rate : 0.7394
## Detection Prevalence : 0.8893
## Balanced Accuracy : 0.5843
##
## 'Positive' Class : No
##
test20 <- predict(model_twenty2, twenty_test, type = "response")
prediction20 <- factor(ifelse(test20 > 0.5, "Yes", "No"))
confusion20 <- confusionMatrix(prediction20, twenty_test$Churn, mode = "prec_recall")
confusion20
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 224 45
## Yes 22 16
##
## Accuracy : 0.7818
## 95% CI : (0.7313, 0.8267)
## No Information Rate : 0.8013
## P-Value [Acc > NIR] : 0.824296
##
## Kappa : 0.2014
##
## Mcnemar's Test P-Value : 0.007194
##
## Precision : 0.8327
## Recall : 0.9106
## F1 : 0.8699
## Prevalence : 0.8013
## Detection Rate : 0.7296
## Detection Prevalence : 0.8762
## Balanced Accuracy : 0.5864
##
## 'Positive' Class : No
##
Hooray! The accuracy of each model is around 0.81 what is cool (at least it is better than the accuracy of the models I constructed for another cluster). The second one is a bit more accurate, so I will use it for the variables selection:
Similarly, these variables are the most powerful predictors of the churn due to their relatively small p-values.
The general advice for both cluster policies is to motivate customers to stay longer. It should be the key idea behind all the suggestions we made: the longer people are in the less they are to leave. Referring to my own user experience, this rule is universal - we do not like to change the platforms if we have been using it for a long interval. In variable terms, this advice relies on the significance of tenure and contract - the second one in fact shows as the same pattern as the first.
Further, I want to speak separately about the first cluster, “50% leavers and huge”. Streaming services increase the churn from this group, so we need to modify this options somehow. It might be done quantitatively (by increasing the number of shows or films we offer) or qualitatively (by suggesting an original high-ranking content). Anyway, such changes imply an unpredictable investments - so, if the company does not aim to be an opponent for “Netflix”, it might be not necessary. Next, we may more often advice customers not to use fibre optic - it influences their potential churn significantly. Let them possess DSL more frequently or not use our Internet services at all. The same story is about multiple lines - it is better if not necessary! Finally, a distinctive parameter of this group is that they more likely to stay if they pay more. This insight is hard to interpret but it is good to know. Maybe, we may suggest them a set of options of high cost - just suggest - and we will thus “capture” them if they react.
Another picture comes from the second cluster, “20% leavers and small”. The members of this cluster do not like to pay more - in contrast with a group we have just analyzed. On the other hand, they value the comfort (if I call it right): they enjoy tech support and online security, Probably, these 2 services are simply great and thus stimulate these customers to stay longer or they really enjoy “comfort”. Anyway, we may provide them with some discounts for these 2 options.
This paper aims to present some practical advises based on the customers clusterization & churn prediction. It is clear from the results that segmenting customers is a wonderful practice as it allows as to be “smart” in terms of decision-making. So to say, it prevents our bad decisions, like offering something that somebody likes and somebody else does not like. Also, the importance of clusters emerges when we speak about the reasons behind the churn: different groups leave the company differently and need diverse approaches for churn prevention.
Hope it was not that boring to go through. Thank you for the attention!