Intro

The following data anylysis is conducted with aim to find out the reasons for churn and also to offer some data-based insights for churn prevention.

Data

Let’s download our data and have a brief look on it. The variables which are the features for customers was summarised as the “Score” variable which has 0-10 points counting every feature customer is using.

data = read.csv("data.csv")
data = mutate(data, Score = 
                (MultipleLines =="No") + 
                2*(MultipleLines =="Yes") +
                (InternetService == "DSL") +
                2*(InternetService == "Fiber optic") +
                (OnlineSecurity == "Yes")+
                (OnlineBackup == "Yes")+
                (DeviceProtection == "Yes")+
                (TechSupport == "Yes")+
                (StreamingMovies == "Yes")+
                (StreamingTV == "Yes")
                )
data = data %>% dplyr::select(-c(TotalCharges,customerID )) %>% na.omit()
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
summary(data)
##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:3488   0:5901        No :3641   No :4933   Min.   : 0.00   No : 682    
##  Male  :3555   1:1142        Yes:3402   Yes:2110   1st Qu.: 9.00   Yes:6361    
##                                                    Median :29.00               
##                                                    Mean   :32.37               
##                                                    3rd Qu.:55.00               
##                                                    Max.   :72.00               
##           MultipleLines     InternetService             OnlineSecurity
##  No              :3390   DSL        :2421   No                 :3498  
##  No phone service: 682   Fiber optic:3096   No internet service:1526  
##  Yes             :2971   No         :1526   Yes                :2019  
##                                                                       
##                                                                       
##                                                                       
##               OnlineBackup             DeviceProtection
##  No                 :3088   No                 :3095   
##  No internet service:1526   No internet service:1526   
##  Yes                :2429   Yes                :2422   
##                                                        
##                                                        
##                                                        
##               TechSupport                StreamingTV  
##  No                 :3473   No                 :2810  
##  No internet service:1526   No internet service:1526  
##  Yes                :2044   Yes                :2707  
##                                                       
##                                                       
##                                                       
##             StreamingMovies           Contract    PaperlessBilling
##  No                 :2785   Month-to-month:3875   No :2872        
##  No internet service:1526   One year      :1473   Yes:4171        
##  Yes                :2732   Two year      :1695                   
##                                                                   
##                                                                   
##                                                                   
##                    PaymentMethod  MonthlyCharges   Churn          Score       
##  Bank transfer (automatic):1544   Min.   : 18.25   No :5174   Min.   : 1.000  
##  Credit card (automatic)  :1522   1st Qu.: 35.50   Yes:1869   1st Qu.: 2.000  
##  Electronic check         :2365   Median : 70.35              Median : 5.000  
##  Mailed check             :1612   Mean   : 64.76              Mean   : 4.586  
##                                   3rd Qu.: 89.85              3rd Qu.: 7.000  
##                                   Max.   :118.75              Max.   :10.000

Churn and Its Correlations

First, let’s see how big is actual problem: Seems like one fourth of our customer churn.That’s pity! We need to see which variables correlate with churn. The variables with which we can build correlation matrix:

data$churn1 <- as.numeric(data$Churn)
m = cor(dplyr::select(data,churn1,Score,tenure,MonthlyCharges))
corrplot(m, method = "number", type = "upper")

Tenure and Monthly Charges have weak correlation, where as Score has almost no corellation with Churn.

The customer personal information variables: Only gender does not have any correlation with Churn! Payment and contract information: Seems like electronic and paperless payments have some correlation with churn as well as contract legth

Clustering

We take Tenure, MonthlyCharges and Score variables for segmenting our customers. Let’s find out opitmal number of clusters to be build using silhouette method:

## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4
Our number is 4, now let’s use it in our clustering and have a look on our clustering: Does not tell that much, lets see the table with variables we buld our clusters on:
cluster tenure MonthlyCharges Score churn1
1 30.65263 23.92487 1.471579 0.1063158
2 19.84003 87.57246 6.040161 0.4732262
3 11.65551 60.42042 3.588557 0.4169203
4 60.29860 90.00765 7.269192 0.1370887

Some pattern stars to appear, but for full picture lets see how other variables in our data differs between our clusters: With all these tables and graphs we can summarize our customer clusters as: 1st cluster - Phone users with average tenure, low charges and using almost no features. Family guys in middle age probably with children. Long contracts and small churn.Stable customers.

2st cluster - “Netflix&Chill” guys Cinema and TV watching guys with fiber optic internet and month to month contract and quite small - tenure.Using a lot of services paying a lot they have rather huge churn.

3rd cluster - “Try out” guys: they use not much features by the average charges, The tenure is very small with tremendous churn

4th cluster - “take it all” guys: They are rather rich using all our features and being charged a lot. Usually sign long terms contracts and have small churn

Predictions

Let’s assign our clusters to the new dataset.I dicided not to use score in Netflix cluster as to look at features provided, and for Try out a leave only Score variable

Netflix <- filter(data,cluster == "2")
Netflix <- dplyr::select(Netflix,-Score,-churn1,-cluster)
tryout <- filter(data,cluster == "3")
tryout <-  dplyr::select(tryout,Score,Churn,Contract,MonthlyCharges,gender,SeniorCitizen,Partner,Dependents,tenure,PaperlessBilling,PaymentMethod)

For evaluting our model i will divide our datasets into training and test groups(8:2)

sample <- sample.int(n = nrow(tryout), size = floor(.8*nrow(tryout)), replace = F)
train <- tryout[sample, ]
test2  <- tryout[-sample, ]
sample <- sample.int(n = nrow(Netflix), size = floor(.8*nrow(Netflix)), replace = F)
train1 <- Netflix[sample, ]
test1  <- Netflix[-sample, ]

We build two models on our Netflix cluster: one with regular logistic regression and one with step wise AIC method:

## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0388  -0.9832  -0.2407   0.9439   2.6553  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           3.351275   1.868009   1.794 0.072807 .  
## genderMale                           -0.197862   0.134261  -1.474 0.140560    
## SeniorCitizen1                        0.204075   0.161841   1.261 0.207322    
## PartnerYes                            0.087430   0.150975   0.579 0.562521    
## DependentsYes                        -0.284090   0.188571  -1.507 0.131929    
## tenure                               -0.043293   0.006146  -7.044 1.86e-12 ***
## PhoneServiceYes                       3.113319   1.628936   1.911 0.055971 .  
## MultipleLinesNo phone service               NA         NA      NA       NA    
## MultipleLinesYes                      1.270939   0.335008   3.794 0.000148 ***
## InternetServiceFiber optic            5.259192   1.527019   3.444 0.000573 ***
## OnlineSecurityYes                     0.422019   0.346955   1.216 0.223852    
## OnlineBackupYes                       0.562488   0.326954   1.720 0.085361 .  
## DeviceProtectionYes                   0.866091   0.329797   2.626 0.008636 ** 
## TechSupportYes                        0.601290   0.343150   1.752 0.079728 .  
## StreamingTVYes                        1.862543   0.612009   3.043 0.002340 ** 
## StreamingMoviesYes                    1.985264   0.608490   3.263 0.001104 ** 
## ContractOne year                     -0.396616   0.247203  -1.604 0.108623    
## ContractTwo year                     -2.226099   1.043932  -2.132 0.032972 *  
## PaperlessBillingYes                   0.423183   0.163321   2.591 0.009566 ** 
## PaymentMethodCredit card (automatic)  0.004909   0.239724   0.020 0.983663    
## PaymentMethodElectronic check         0.305546   0.192216   1.590 0.111926    
## PaymentMethodMailed check             0.285112   0.267643   1.065 0.286754    
## MonthlyCharges                       -0.165282   0.059449  -2.780 0.005432 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1653.3  on 1194  degrees of freedom
## Residual deviance: 1339.2  on 1173  degrees of freedom
## AIC: 1383.2
## 
## Number of Fisher Scoring iterations: 6
## 
## Call:
## glm(formula = Churn ~ gender + Dependents + tenure + MultipleLines + 
##     InternetService + DeviceProtection + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + MonthlyCharges, family = binomial, 
##     data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9274  -0.9853  -0.2473   0.9598   2.6048  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    2.650847   1.061307   2.498  0.01250 *  
## genderMale                    -0.207121   0.132293  -1.566  0.11744    
## DependentsYes                 -0.271249   0.170341  -1.592  0.11130    
## tenure                        -0.042567   0.005888  -7.230 4.83e-13 ***
## MultipleLinesNo phone service -1.246813   1.148079  -1.086  0.27748    
## MultipleLinesYes               0.784391   0.152789   5.134 2.84e-07 ***
## InternetServiceFiber optic     2.922001   0.447192   6.534 6.40e-11 ***
## DeviceProtectionYes            0.385344   0.158466   2.432  0.01503 *  
## StreamingTVYes                 0.960072   0.209305   4.587 4.50e-06 ***
## StreamingMoviesYes             1.067216   0.210141   5.079 3.80e-07 ***
## ContractOne year              -0.468886   0.243228  -1.928  0.05388 .  
## ContractTwo year              -2.203821   1.040028  -2.119  0.03409 *  
## PaperlessBillingYes            0.457148   0.161400   2.832  0.00462 ** 
## MonthlyCharges                -0.072679   0.018014  -4.035 5.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1653.3  on 1194  degrees of freedom
## Residual deviance: 1349.4  on 1181  degrees of freedom
## AIC: 1377.4
## 
## Number of Fisher Scoring iterations: 6

And now let’s compare this models and decide which will we choose:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  131  67
##        Yes  27  74
##                                           
##                Accuracy : 0.6856          
##                  95% CI : (0.6297, 0.7378)
##     No Information Rate : 0.5284          
##     P-Value [Acc > NIR] : 2.332e-08       
##                                           
##                   Kappa : 0.3594          
##                                           
##  Mcnemar's Test P-Value : 5.757e-05       
##                                           
##             Sensitivity : 0.8291          
##             Specificity : 0.5248          
##          Pos Pred Value : 0.6616          
##          Neg Pred Value : 0.7327          
##              Prevalence : 0.5284          
##          Detection Rate : 0.4381          
##    Detection Prevalence : 0.6622          
##       Balanced Accuracy : 0.6770          
##                                           
##        'Positive' Class : No              
## 
test <- predict(Model1aic, test1)
predicted <- factor(ifelse(test > 0.5,"Yes","No"))
caret::confusionMatrix(predicted, test1$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  130  73
##        Yes  28  68
##                                           
##                Accuracy : 0.6622          
##                  95% CI : (0.6055, 0.7156)
##     No Information Rate : 0.5284          
##     P-Value [Acc > NIR] : 1.871e-06       
##                                           
##                   Kappa : 0.3104          
##                                           
##  Mcnemar's Test P-Value : 1.197e-05       
##                                           
##             Sensitivity : 0.8228          
##             Specificity : 0.4823          
##          Pos Pred Value : 0.6404          
##          Neg Pred Value : 0.7083          
##              Prevalence : 0.5284          
##          Detection Rate : 0.4348          
##    Detection Prevalence : 0.6789          
##       Balanced Accuracy : 0.6525          
##                                           
##        'Positive' Class : No              
## 

We choose our first model as it has slightly more accuracy

The predictors mainly contributing to the churn: Postive effect: Tenure Negative effect: MultipleLinesYes, InternetServiceFiberoptic, PaperlessBillingYes, StreamingMoviesYes,StreamingTVYes Now let’s the exactly same procudure with Try out guys:

## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8052  -0.9823  -0.3884   0.9973   2.8104  
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.934674   0.390779  -2.392 0.016765 *  
## Score                                 -0.303957   0.094144  -3.229 0.001244 ** 
## ContractOne year                      -1.090138   0.343201  -3.176 0.001491 ** 
## ContractTwo year                     -14.874302 382.137242  -0.039 0.968951    
## MonthlyCharges                         0.029810   0.006174   4.828 1.38e-06 ***
## genderMale                             0.072065   0.125548   0.574 0.565967    
## SeniorCitizen1                         0.372590   0.172586   2.159 0.030861 *  
## PartnerYes                            -0.093329   0.161146  -0.579 0.562484    
## DependentsYes                          0.077993   0.182602   0.427 0.669292    
## tenure                                -0.053568   0.007521  -7.122 1.06e-12 ***
## PaperlessBillingYes                    0.519827   0.135435   3.838 0.000124 ***
## PaymentMethodCredit card (automatic)   0.122982   0.252405   0.487 0.626087    
## PaymentMethodElectronic check          0.294963   0.199814   1.476 0.139895    
## PaymentMethodMailed check             -0.228209   0.221767  -1.029 0.303457    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1783.9  on 1313  degrees of freedom
## Residual deviance: 1486.6  on 1300  degrees of freedom
## AIC: 1514.6
## 
## Number of Fisher Scoring iterations: 15
## 
## Call:
## glm(formula = Churn ~ Score + Contract + MonthlyCharges + SeniorCitizen + 
##     tenure + PaperlessBilling + PaymentMethod, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8139  -0.9823  -0.3873   0.9988   2.8293  
## 
## Coefficients:
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -0.898299   0.380557  -2.360 0.018251 *  
## Score                                 -0.302136   0.094018  -3.214 0.001311 ** 
## ContractOne year                      -1.085172   0.342558  -3.168 0.001536 ** 
## ContractTwo year                     -14.850696 382.134660  -0.039 0.969000    
## MonthlyCharges                         0.029558   0.006160   4.798 1.60e-06 ***
## SeniorCitizen1                         0.366399   0.170280   2.152 0.031417 *  
## tenure                                -0.054284   0.007373  -7.362 1.81e-13 ***
## PaperlessBillingYes                    0.518176   0.135367   3.828 0.000129 ***
## PaymentMethodCredit card (automatic)   0.133006   0.251967   0.528 0.597589    
## PaymentMethodElectronic check          0.299045   0.199671   1.498 0.134214    
## PaymentMethodMailed check             -0.214051   0.220969  -0.969 0.332697    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1783.9  on 1313  degrees of freedom
## Residual deviance: 1487.3  on 1303  degrees of freedom
## AIC: 1509.3
## 
## Number of Fisher Scoring iterations: 15

And now let’s compare this models and decide which will we choose:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  169  90
##        Yes  21  49
##                                           
##                Accuracy : 0.6626          
##                  95% CI : (0.6087, 0.7136)
##     No Information Rate : 0.5775          
##     P-Value [Acc > NIR] : 0.0009635       
##                                           
##                   Kappa : 0.2593          
##                                           
##  Mcnemar's Test P-Value : 1.087e-10       
##                                           
##             Sensitivity : 0.8895          
##             Specificity : 0.3525          
##          Pos Pred Value : 0.6525          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.5775          
##          Detection Rate : 0.5137          
##    Detection Prevalence : 0.7872          
##       Balanced Accuracy : 0.6210          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  172  88
##        Yes  18  51
##                                          
##                Accuracy : 0.6778         
##                  95% CI : (0.6244, 0.728)
##     No Information Rate : 0.5775         
##     P-Value [Acc > NIR] : 0.0001186      
##                                          
##                   Kappa : 0.2919         
##                                          
##  Mcnemar's Test P-Value : 2.058e-11      
##                                          
##             Sensitivity : 0.9053         
##             Specificity : 0.3669         
##          Pos Pred Value : 0.6615         
##          Neg Pred Value : 0.7391         
##              Prevalence : 0.5775         
##          Detection Rate : 0.5228         
##    Detection Prevalence : 0.7903         
##       Balanced Accuracy : 0.6361         
##                                          
##        'Positive' Class : No             
## 

We choose our first model as it has slightly more accuracy

The predictors mainly contributing to the churn: Postive effect: Score,ContractOne year,tenure
Negative effect: MonthlyCharges,PaperlessBillingYes

Recomendations

These are some recomendations to be done reducing churn for every cluster: Netflix guys: 1. Try to increase the tenure by stimulating customers to stay longer.

2.Reduce paperless billing by introducing oblige paper bill system

  1. Improve quality of Movies and TV so that people won’t churn as they seek for it especially

Try out guys:

1.Encorage people to sign longer contracts by inroducing bonuses and discounts

2.Reduce paperless billing by introducing oblige paper bill system

  1. Offer more features by lower price,