The following data anylysis is conducted with aim to find out the reasons for churn and also to offer some data-based insights for churn prevention.
Let’s download our data and have a brief look on it. The variables which are the features for customers was summarised as the “Score” variable which has 0-10 points counting every feature customer is using.
data = read.csv("data.csv")
data = mutate(data, Score =
(MultipleLines =="No") +
2*(MultipleLines =="Yes") +
(InternetService == "DSL") +
2*(InternetService == "Fiber optic") +
(OnlineSecurity == "Yes")+
(OnlineBackup == "Yes")+
(DeviceProtection == "Yes")+
(TechSupport == "Yes")+
(StreamingMovies == "Yes")+
(StreamingTV == "Yes")
)
data = data %>% dplyr::select(-c(TotalCharges,customerID )) %>% na.omit()
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
summary(data)
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:3488 0:5901 No :3641 No :4933 Min. : 0.00 No : 682
## Male :3555 1:1142 Yes:3402 Yes:2110 1st Qu.: 9.00 Yes:6361
## Median :29.00
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
## MultipleLines InternetService OnlineSecurity
## No :3390 DSL :2421 No :3498
## No phone service: 682 Fiber optic:3096 No internet service:1526
## Yes :2971 No :1526 Yes :2019
##
##
##
## OnlineBackup DeviceProtection
## No :3088 No :3095
## No internet service:1526 No internet service:1526
## Yes :2429 Yes :2422
##
##
##
## TechSupport StreamingTV
## No :3473 No :2810
## No internet service:1526 No internet service:1526
## Yes :2044 Yes :2707
##
##
##
## StreamingMovies Contract PaperlessBilling
## No :2785 Month-to-month:3875 No :2872
## No internet service:1526 One year :1473 Yes:4171
## Yes :2732 Two year :1695
##
##
##
## PaymentMethod MonthlyCharges Churn Score
## Bank transfer (automatic):1544 Min. : 18.25 No :5174 Min. : 1.000
## Credit card (automatic) :1522 1st Qu.: 35.50 Yes:1869 1st Qu.: 2.000
## Electronic check :2365 Median : 70.35 Median : 5.000
## Mailed check :1612 Mean : 64.76 Mean : 4.586
## 3rd Qu.: 89.85 3rd Qu.: 7.000
## Max. :118.75 Max. :10.000
First, let’s see how big is actual problem: Seems like one fourth of our customer churn.That’s pity! We need to see which variables correlate with churn. The variables with which we can build correlation matrix:
data$churn1 <- as.numeric(data$Churn)
m = cor(dplyr::select(data,churn1,Score,tenure,MonthlyCharges))
corrplot(m, method = "number", type = "upper")
Tenure and Monthly Charges have weak correlation, where as Score has almost no corellation with Churn.
The customer personal information variables: Only gender does not have any correlation with Churn! Payment and contract information:
Seems like electronic and paperless payments have some correlation with churn as well as contract legth
We take Tenure, MonthlyCharges and Score variables for segmenting our customers. Let’s find out opitmal number of clusters to be build using silhouette method:
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
| cluster | tenure | MonthlyCharges | Score | churn1 |
|---|---|---|---|---|
| 1 | 30.65263 | 23.92487 | 1.471579 | 0.1063158 |
| 2 | 19.84003 | 87.57246 | 6.040161 | 0.4732262 |
| 3 | 11.65551 | 60.42042 | 3.588557 | 0.4169203 |
| 4 | 60.29860 | 90.00765 | 7.269192 | 0.1370887 |
Some pattern stars to appear, but for full picture lets see how other variables in our data differs between our clusters: With all these tables and graphs we can summarize our customer clusters as: 1st cluster - Phone users with average tenure, low charges and using almost no features. Family guys in middle age probably with children. Long contracts and small churn.Stable customers.
2st cluster - “Netflix&Chill” guys Cinema and TV watching guys with fiber optic internet and month to month contract and quite small - tenure.Using a lot of services paying a lot they have rather huge churn.
3rd cluster - “Try out” guys: they use not much features by the average charges, The tenure is very small with tremendous churn
4th cluster - “take it all” guys: They are rather rich using all our features and being charged a lot. Usually sign long terms contracts and have small churn
Let’s assign our clusters to the new dataset.I dicided not to use score in Netflix cluster as to look at features provided, and for Try out a leave only Score variable
Netflix <- filter(data,cluster == "2")
Netflix <- dplyr::select(Netflix,-Score,-churn1,-cluster)
tryout <- filter(data,cluster == "3")
tryout <- dplyr::select(tryout,Score,Churn,Contract,MonthlyCharges,gender,SeniorCitizen,Partner,Dependents,tenure,PaperlessBilling,PaymentMethod)
For evaluting our model i will divide our datasets into training and test groups(8:2)
sample <- sample.int(n = nrow(tryout), size = floor(.8*nrow(tryout)), replace = F)
train <- tryout[sample, ]
test2 <- tryout[-sample, ]
sample <- sample.int(n = nrow(Netflix), size = floor(.8*nrow(Netflix)), replace = F)
train1 <- Netflix[sample, ]
test1 <- Netflix[-sample, ]
We build two models on our Netflix cluster: one with regular logistic regression and one with step wise AIC method:
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0388 -0.9832 -0.2407 0.9439 2.6553
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.351275 1.868009 1.794 0.072807 .
## genderMale -0.197862 0.134261 -1.474 0.140560
## SeniorCitizen1 0.204075 0.161841 1.261 0.207322
## PartnerYes 0.087430 0.150975 0.579 0.562521
## DependentsYes -0.284090 0.188571 -1.507 0.131929
## tenure -0.043293 0.006146 -7.044 1.86e-12 ***
## PhoneServiceYes 3.113319 1.628936 1.911 0.055971 .
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 1.270939 0.335008 3.794 0.000148 ***
## InternetServiceFiber optic 5.259192 1.527019 3.444 0.000573 ***
## OnlineSecurityYes 0.422019 0.346955 1.216 0.223852
## OnlineBackupYes 0.562488 0.326954 1.720 0.085361 .
## DeviceProtectionYes 0.866091 0.329797 2.626 0.008636 **
## TechSupportYes 0.601290 0.343150 1.752 0.079728 .
## StreamingTVYes 1.862543 0.612009 3.043 0.002340 **
## StreamingMoviesYes 1.985264 0.608490 3.263 0.001104 **
## ContractOne year -0.396616 0.247203 -1.604 0.108623
## ContractTwo year -2.226099 1.043932 -2.132 0.032972 *
## PaperlessBillingYes 0.423183 0.163321 2.591 0.009566 **
## PaymentMethodCredit card (automatic) 0.004909 0.239724 0.020 0.983663
## PaymentMethodElectronic check 0.305546 0.192216 1.590 0.111926
## PaymentMethodMailed check 0.285112 0.267643 1.065 0.286754
## MonthlyCharges -0.165282 0.059449 -2.780 0.005432 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1653.3 on 1194 degrees of freedom
## Residual deviance: 1339.2 on 1173 degrees of freedom
## AIC: 1383.2
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm(formula = Churn ~ gender + Dependents + tenure + MultipleLines +
## InternetService + DeviceProtection + StreamingTV + StreamingMovies +
## Contract + PaperlessBilling + MonthlyCharges, family = binomial,
## data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9274 -0.9853 -0.2473 0.9598 2.6048
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.650847 1.061307 2.498 0.01250 *
## genderMale -0.207121 0.132293 -1.566 0.11744
## DependentsYes -0.271249 0.170341 -1.592 0.11130
## tenure -0.042567 0.005888 -7.230 4.83e-13 ***
## MultipleLinesNo phone service -1.246813 1.148079 -1.086 0.27748
## MultipleLinesYes 0.784391 0.152789 5.134 2.84e-07 ***
## InternetServiceFiber optic 2.922001 0.447192 6.534 6.40e-11 ***
## DeviceProtectionYes 0.385344 0.158466 2.432 0.01503 *
## StreamingTVYes 0.960072 0.209305 4.587 4.50e-06 ***
## StreamingMoviesYes 1.067216 0.210141 5.079 3.80e-07 ***
## ContractOne year -0.468886 0.243228 -1.928 0.05388 .
## ContractTwo year -2.203821 1.040028 -2.119 0.03409 *
## PaperlessBillingYes 0.457148 0.161400 2.832 0.00462 **
## MonthlyCharges -0.072679 0.018014 -4.035 5.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1653.3 on 1194 degrees of freedom
## Residual deviance: 1349.4 on 1181 degrees of freedom
## AIC: 1377.4
##
## Number of Fisher Scoring iterations: 6
And now let’s compare this models and decide which will we choose:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 131 67
## Yes 27 74
##
## Accuracy : 0.6856
## 95% CI : (0.6297, 0.7378)
## No Information Rate : 0.5284
## P-Value [Acc > NIR] : 2.332e-08
##
## Kappa : 0.3594
##
## Mcnemar's Test P-Value : 5.757e-05
##
## Sensitivity : 0.8291
## Specificity : 0.5248
## Pos Pred Value : 0.6616
## Neg Pred Value : 0.7327
## Prevalence : 0.5284
## Detection Rate : 0.4381
## Detection Prevalence : 0.6622
## Balanced Accuracy : 0.6770
##
## 'Positive' Class : No
##
test <- predict(Model1aic, test1)
predicted <- factor(ifelse(test > 0.5,"Yes","No"))
caret::confusionMatrix(predicted, test1$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 130 73
## Yes 28 68
##
## Accuracy : 0.6622
## 95% CI : (0.6055, 0.7156)
## No Information Rate : 0.5284
## P-Value [Acc > NIR] : 1.871e-06
##
## Kappa : 0.3104
##
## Mcnemar's Test P-Value : 1.197e-05
##
## Sensitivity : 0.8228
## Specificity : 0.4823
## Pos Pred Value : 0.6404
## Neg Pred Value : 0.7083
## Prevalence : 0.5284
## Detection Rate : 0.4348
## Detection Prevalence : 0.6789
## Balanced Accuracy : 0.6525
##
## 'Positive' Class : No
##
We choose our first model as it has slightly more accuracy
The predictors mainly contributing to the churn: Postive effect: Tenure Negative effect: MultipleLinesYes, InternetServiceFiberoptic, PaperlessBillingYes, StreamingMoviesYes,StreamingTVYes Now let’s the exactly same procudure with Try out guys:
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8052 -0.9823 -0.3884 0.9973 2.8104
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.934674 0.390779 -2.392 0.016765 *
## Score -0.303957 0.094144 -3.229 0.001244 **
## ContractOne year -1.090138 0.343201 -3.176 0.001491 **
## ContractTwo year -14.874302 382.137242 -0.039 0.968951
## MonthlyCharges 0.029810 0.006174 4.828 1.38e-06 ***
## genderMale 0.072065 0.125548 0.574 0.565967
## SeniorCitizen1 0.372590 0.172586 2.159 0.030861 *
## PartnerYes -0.093329 0.161146 -0.579 0.562484
## DependentsYes 0.077993 0.182602 0.427 0.669292
## tenure -0.053568 0.007521 -7.122 1.06e-12 ***
## PaperlessBillingYes 0.519827 0.135435 3.838 0.000124 ***
## PaymentMethodCredit card (automatic) 0.122982 0.252405 0.487 0.626087
## PaymentMethodElectronic check 0.294963 0.199814 1.476 0.139895
## PaymentMethodMailed check -0.228209 0.221767 -1.029 0.303457
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1783.9 on 1313 degrees of freedom
## Residual deviance: 1486.6 on 1300 degrees of freedom
## AIC: 1514.6
##
## Number of Fisher Scoring iterations: 15
##
## Call:
## glm(formula = Churn ~ Score + Contract + MonthlyCharges + SeniorCitizen +
## tenure + PaperlessBilling + PaymentMethod, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8139 -0.9823 -0.3873 0.9988 2.8293
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.898299 0.380557 -2.360 0.018251 *
## Score -0.302136 0.094018 -3.214 0.001311 **
## ContractOne year -1.085172 0.342558 -3.168 0.001536 **
## ContractTwo year -14.850696 382.134660 -0.039 0.969000
## MonthlyCharges 0.029558 0.006160 4.798 1.60e-06 ***
## SeniorCitizen1 0.366399 0.170280 2.152 0.031417 *
## tenure -0.054284 0.007373 -7.362 1.81e-13 ***
## PaperlessBillingYes 0.518176 0.135367 3.828 0.000129 ***
## PaymentMethodCredit card (automatic) 0.133006 0.251967 0.528 0.597589
## PaymentMethodElectronic check 0.299045 0.199671 1.498 0.134214
## PaymentMethodMailed check -0.214051 0.220969 -0.969 0.332697
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1783.9 on 1313 degrees of freedom
## Residual deviance: 1487.3 on 1303 degrees of freedom
## AIC: 1509.3
##
## Number of Fisher Scoring iterations: 15
And now let’s compare this models and decide which will we choose:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 169 90
## Yes 21 49
##
## Accuracy : 0.6626
## 95% CI : (0.6087, 0.7136)
## No Information Rate : 0.5775
## P-Value [Acc > NIR] : 0.0009635
##
## Kappa : 0.2593
##
## Mcnemar's Test P-Value : 1.087e-10
##
## Sensitivity : 0.8895
## Specificity : 0.3525
## Pos Pred Value : 0.6525
## Neg Pred Value : 0.7000
## Prevalence : 0.5775
## Detection Rate : 0.5137
## Detection Prevalence : 0.7872
## Balanced Accuracy : 0.6210
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 172 88
## Yes 18 51
##
## Accuracy : 0.6778
## 95% CI : (0.6244, 0.728)
## No Information Rate : 0.5775
## P-Value [Acc > NIR] : 0.0001186
##
## Kappa : 0.2919
##
## Mcnemar's Test P-Value : 2.058e-11
##
## Sensitivity : 0.9053
## Specificity : 0.3669
## Pos Pred Value : 0.6615
## Neg Pred Value : 0.7391
## Prevalence : 0.5775
## Detection Rate : 0.5228
## Detection Prevalence : 0.7903
## Balanced Accuracy : 0.6361
##
## 'Positive' Class : No
##
We choose our first model as it has slightly more accuracy
The predictors mainly contributing to the churn: Postive effect: Score,ContractOne year,tenure
Negative effect: MonthlyCharges,PaperlessBillingYes
These are some recomendations to be done reducing churn for every cluster: Netflix guys: 1. Try to increase the tenure by stimulating customers to stay longer.
2.Reduce paperless billing by introducing oblige paper bill system
Try out guys:
1.Encorage people to sign longer contracts by inroducing bonuses and discounts
2.Reduce paperless billing by introducing oblige paper bill system