Introduction

Companies need to focus their effort on reducing the number of customers churned. In this tutorial, I will do a churn analysis for telecom customers from data that I found in Kaggle. You can download the data by following this link : https://www.kaggle.com/becksddf/churn-in-telecoms-dataset/data#

Loading data

data=read.csv('C:/Users/TOSHIBA/Desktop/churn.csv',header=T)
str(data)
## 'data.frame':    3333 obs. of  21 variables:
##  $ state                 : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ account.length        : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ area.code             : int  415 415 415 408 415 510 510 415 408 415 ...
##  $ phone.number          : Factor w/ 3333 levels "327-1058","327-1319",..: 1927 1576 1118 1708 111 2254 1048 81 292 118 ...
##  $ international.plan    : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
##  $ voice.mail.plan       : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
##  $ number.vmail.messages : int  25 26 0 0 0 0 24 0 0 37 ...
##  $ total.day.minutes     : num  265 162 243 299 167 ...
##  $ total.day.calls       : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ total.day.charge      : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ total.eve.minutes     : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ total.eve.calls       : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ total.eve.charge      : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ total.night.minutes   : num  245 254 163 197 187 ...
##  $ total.night.calls     : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ total.night.charge    : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ total.intl.minutes    : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total.intl.calls      : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ total.intl.charge     : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ customer.service.calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                 : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
summary(data)
##      state      account.length    area.code       phone.number 
##  WV     : 106   Min.   :  1.0   Min.   :408.0   327-1058:   1  
##  MN     :  84   1st Qu.: 74.0   1st Qu.:408.0   327-1319:   1  
##  NY     :  83   Median :101.0   Median :415.0   327-3053:   1  
##  AL     :  80   Mean   :101.1   Mean   :437.2   327-3587:   1  
##  OH     :  78   3rd Qu.:127.0   3rd Qu.:510.0   327-3850:   1  
##  OR     :  78   Max.   :243.0   Max.   :510.0   327-3954:   1  
##  (Other):2824                                   (Other) :3327  
##  international.plan voice.mail.plan number.vmail.messages
##  no :3010           no :2411        Min.   : 0.000       
##  yes: 323           yes: 922        1st Qu.: 0.000       
##                                     Median : 0.000       
##                                     Mean   : 8.099       
##                                     3rd Qu.:20.000       
##                                     Max.   :51.000       
##                                                          
##  total.day.minutes total.day.calls total.day.charge total.eve.minutes
##  Min.   :  0.0     Min.   :  0.0   Min.   : 0.00    Min.   :  0.0    
##  1st Qu.:143.7     1st Qu.: 87.0   1st Qu.:24.43    1st Qu.:166.6    
##  Median :179.4     Median :101.0   Median :30.50    Median :201.4    
##  Mean   :179.8     Mean   :100.4   Mean   :30.56    Mean   :201.0    
##  3rd Qu.:216.4     3rd Qu.:114.0   3rd Qu.:36.79    3rd Qu.:235.3    
##  Max.   :350.8     Max.   :165.0   Max.   :59.64    Max.   :363.7    
##                                                                      
##  total.eve.calls total.eve.charge total.night.minutes total.night.calls
##  Min.   :  0.0   Min.   : 0.00    Min.   : 23.2       Min.   : 33.0    
##  1st Qu.: 87.0   1st Qu.:14.16    1st Qu.:167.0       1st Qu.: 87.0    
##  Median :100.0   Median :17.12    Median :201.2       Median :100.0    
##  Mean   :100.1   Mean   :17.08    Mean   :200.9       Mean   :100.1    
##  3rd Qu.:114.0   3rd Qu.:20.00    3rd Qu.:235.3       3rd Qu.:113.0    
##  Max.   :170.0   Max.   :30.91    Max.   :395.0       Max.   :175.0    
##                                                                        
##  total.night.charge total.intl.minutes total.intl.calls total.intl.charge
##  Min.   : 1.040     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 7.520     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median : 9.050     Median :10.30      Median : 4.000   Median :2.780    
##  Mean   : 9.039     Mean   :10.24      Mean   : 4.479   Mean   :2.765    
##  3rd Qu.:10.590     3rd Qu.:12.10      3rd Qu.: 6.000   3rd Qu.:3.270    
##  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
##                                                                          
##  customer.service.calls   churn     
##  Min.   :0.000          False:2850  
##  1st Qu.:1.000          True : 483  
##  Median :1.000                      
##  Mean   :1.563                      
##  3rd Qu.:2.000                      
##  Max.   :9.000                      
## 

Loading libraries

library(ggplot2)
library(plotly)
library(dplyr)
library(randomForest)

Visualisation of the variables

To visualise the data we will use two packages of R which are : ggplot2 and plotly

p=ggplot(data)+geom_bar(aes(x=churn))
p
Churn

Churn

p1=ggplot(data)+geom_bar(aes(x=state,fill=churn))
p1
Churn by state

Churn by state

p2=ggplot(data)+geom_bar(aes(x=as.factor(area.code),fill=churn))
p2
Churn by area code

Churn by area code

p3=ggplot(data)+geom_bar(aes(x=international.plan,fill=churn))
p3
Churn by international plan

Churn by international plan

p4=ggplot(data)+geom_bar(aes(x=voice.mail.plan,fill=churn))
p4
Churn by voice mail plan

Churn by voice mail plan

Random forest

data2=data[,7:21]
churn_RandomForest <- randomForest(churn~.,data=data2, ntree = 100, 
                                  mtry = 2, na.action = na.roughfix)
print(churn_RandomForest)
## 
## Call:
##  randomForest(formula = churn ~ ., data = data2, ntree = 100,      mtry = 2, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 8.31%
## Confusion matrix:
##       False True class.error
## False  2824   26 0.009122807
## True    251  232 0.519668737

Logistic regression

reg <- glm(churn ~., 
  data = data2, family = binomial(logit))
reg
## 
## Call:  glm(formula = churn ~ ., family = binomial(logit), data = data2)
## 
## Coefficients:
##            (Intercept)   number.vmail.messages       total.day.minutes  
##             -7.8815516              -0.0245235              -0.5741144  
##        total.day.calls        total.day.charge       total.eve.minutes  
##              0.0029735               3.4515631               0.3840558  
##        total.eve.calls        total.eve.charge     total.night.minutes  
##              0.0009446              -4.4402475               0.0102128  
##      total.night.calls      total.night.charge      total.intl.minutes  
##              0.0009018              -0.1669207              -1.4374061  
##       total.intl.calls       total.intl.charge  customer.service.calls  
##             -0.0793376               5.6669682               0.4549603  
## 
## Degrees of Freedom: 3332 Total (i.e. Null);  3318 Residual
## Null Deviance:       2758 
## Residual Deviance: 2363  AIC: 2393
summary(reg)
## 
## Call:
## glm(formula = churn ~ ., family = binomial(logit), data = data2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7853  -0.5661  -0.4016  -0.2502   2.9868  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.8815516  0.6695246 -11.772  < 2e-16 ***
## number.vmail.messages  -0.0245235  0.0045098  -5.438 5.39e-08 ***
## total.day.minutes      -0.5741144  3.1345784  -0.183 0.854676    
## total.day.calls         0.0029735  0.0026444   1.124 0.260809    
## total.day.charge        3.4515631 18.4388176   0.187 0.851512    
## total.eve.minutes       0.3840558  1.5630932   0.246 0.805913    
## total.eve.calls         0.0009446  0.0026287   0.359 0.719342    
## total.eve.charge       -4.4402475 18.3892705  -0.241 0.809200    
## total.night.minutes     0.0102128  0.8339306   0.012 0.990229    
## total.night.calls       0.0009018  0.0027169   0.332 0.739955    
## total.night.charge     -0.1669207 18.5312690  -0.009 0.992813    
## total.intl.minutes     -1.4374061  5.0276526  -0.286 0.774955    
## total.intl.calls       -0.0793376  0.0237729  -3.337 0.000846 ***
## total.intl.charge       5.6669682 18.6201516   0.304 0.760864    
## customer.service.calls  0.4549603  0.0371624  12.242  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2758.3  on 3332  degrees of freedom
## Residual deviance: 2362.8  on 3318  degrees of freedom
## AIC: 2392.8
## 
## Number of Fisher Scoring iterations: 5