This file analyses what factors govern the churn of a telecommunications company.

ibm.df<-read.csv(paste("IBM.csv"),)

Dimensions of the data set

Analysis is best when the dataset is rich in data.

dim(ibm.df)

## [1] 7043   21

Summary Statistics

What the dataset looks like

library(car)
some(ibm.df)

##      customerID gender SeniorCitizen Partner Dependents tenure
## 99   3212-KXOCR   Male             0      No         No     52
## 1060 0376-YMCJC   Male             0      No         No     23
## 1304 0042-RLHYP Female             0     Yes        Yes     69
## 1308 1866-NXPSP Female             0      No         No     36
## 1360 7943-RQCHR Female             0      No         No      9
## 2192 3988-RQIXO Female             1      No         No      1
## 4054 7446-YPODE   Male             1      No         No     11
## 4394 2057-ZBLPD Female             0     Yes         No     21
## 4456 8280-MQRQN Female             0      No         No      1
## 6894 5876-HZVZM Female             0     Yes        Yes      6
##      PhoneService MultipleLines InternetService      OnlineSecurity
## 99            Yes            No              No No internet service
## 1060          Yes           Yes     Fiber optic                  No
## 1304          Yes            No              No No internet service
## 1308          Yes            No             DSL                 Yes
## 1360          Yes           Yes     Fiber optic                  No
## 2192          Yes            No     Fiber optic                  No
## 4054          Yes           Yes             DSL                 Yes
## 4394          Yes           Yes     Fiber optic                  No
## 4456          Yes            No             DSL                  No
## 6894          Yes            No             DSL                 Yes
##             OnlineBackup    DeviceProtection         TechSupport
## 99   No internet service No internet service No internet service
## 1060                  No                  No                 Yes
## 1304 No internet service No internet service No internet service
## 1308                 Yes                 Yes                 Yes
## 1360                  No                  No                  No
## 2192                  No                  No                  No
## 4054                 Yes                  No                  No
## 4394                  No                  No                  No
## 4456                  No                 Yes                  No
## 6894                 Yes                  No                  No
##              StreamingTV     StreamingMovies       Contract
## 99   No internet service No internet service       Two year
## 1060                 Yes                  No Month-to-month
## 1304 No internet service No internet service       Two year
## 1308                  No                 Yes       One year
## 1360                 Yes                 Yes Month-to-month
## 2192                 Yes                 Yes Month-to-month
## 4054                  No                  No Month-to-month
## 4394                  No                 Yes Month-to-month
## 4456                  No                  No Month-to-month
## 6894                  No                  No Month-to-month
##      PaperlessBilling             PaymentMethod MonthlyCharges
## 99                 No Bank transfer (automatic)          21.00
## 1060              Yes          Electronic check          90.60
## 1304               No Bank transfer (automatic)          19.70
## 1308              Yes              Mailed check          75.55
## 1360              Yes          Electronic check          94.75
## 2192              Yes          Electronic check          91.30
## 4054               No Bank transfer (automatic)          60.25
## 4394              Yes          Electronic check          86.50
## 4456              Yes              Mailed check          50.45
## 6894               No   Credit card (automatic)          55.90
##      TotalCharges Churn
## 99        1107.20    No
## 1060      1943.20   Yes
## 1304      1396.90    No
## 1308      2680.15    No
## 1360       889.90   Yes
## 2192        91.30   Yes
## 4054       662.95    No
## 4394      1808.70   Yes
## 4456        50.45   Yes
## 6894       365.35   Yes

Summary

summary(ibm.df)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies
##  No                 :2810   No                 :2785  
##  No internet service:1526   No internet service:1526  
##  Yes                :2707   Yes                :2732  
##                                                       
##                                                       
##                                                       
##                                                       
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2872         Bank transfer (automatic):1544  
##  One year      :1473   Yes:4171         Credit card (automatic)  :1522  
##  Two year      :1695                    Electronic check         :2365  
##                                         Mailed check             :1612  
##                                                                         
##                                                                         
##                                                                         
##  MonthlyCharges    TotalCharges    Churn     
##  Min.   : 18.25   Min.   :  18.8   No :5174  
##  1st Qu.: 35.50   1st Qu.: 401.4   Yes:1869  
##  Median : 70.35   Median :1397.5             
##  Mean   : 64.76   Mean   :2283.3             
##  3rd Qu.: 89.85   3rd Qu.:3794.7             
##  Max.   :118.75   Max.   :8684.8             
##                   NA's   :11

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

describe(ibm.df)

##                   vars    n    mean      sd  median trimmed     mad   min
## customerID*          1 7043 3522.00 2033.28 3522.00 3522.00 2610.86  1.00
## gender*              2 7043    1.50    0.50    2.00    1.51    0.00  1.00
## SeniorCitizen        3 7043    0.16    0.37    0.00    0.08    0.00  0.00
## Partner*             4 7043    1.48    0.50    1.00    1.48    0.00  1.00
## Dependents*          5 7043    1.30    0.46    1.00    1.25    0.00  1.00
## tenure               6 7043   32.37   24.56   29.00   31.43   32.62  0.00
## PhoneService*        7 7043    1.90    0.30    2.00    2.00    0.00  1.00
## MultipleLines*       8 7043    1.94    0.95    2.00    1.93    1.48  1.00
## InternetService*     9 7043    1.87    0.74    2.00    1.84    1.48  1.00
## OnlineSecurity*     10 7043    1.79    0.86    2.00    1.74    1.48  1.00
## OnlineBackup*       11 7043    1.91    0.88    2.00    1.88    1.48  1.00
## DeviceProtection*   12 7043    1.90    0.88    2.00    1.88    1.48  1.00
## TechSupport*        13 7043    1.80    0.86    2.00    1.75    1.48  1.00
## StreamingTV*        14 7043    1.99    0.89    2.00    1.98    1.48  1.00
## StreamingMovies*    15 7043    1.99    0.89    2.00    1.99    1.48  1.00
## Contract*           16 7043    1.69    0.83    1.00    1.61    0.00  1.00
## PaperlessBilling*   17 7043    1.59    0.49    2.00    1.62    0.00  1.00
## PaymentMethod*      18 7043    2.57    1.07    3.00    2.59    1.48  1.00
## MonthlyCharges      19 7043   64.76   30.09   70.35   64.97   35.66 18.25
## TotalCharges        20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn*              21 7043    1.27    0.44    1.00    1.21    0.00  1.00
##                       max  range  skew kurtosis    se
## customerID*       7043.00 7042.0  0.00    -1.20 24.23
## gender*              2.00    1.0 -0.02    -2.00  0.01
## SeniorCitizen        1.00    1.0  1.83     1.36  0.00
## Partner*             2.00    1.0  0.07    -2.00  0.01
## Dependents*          2.00    1.0  0.87    -1.23  0.01
## tenure              72.00   72.0  0.24    -1.39  0.29
## PhoneService*        2.00    1.0 -2.73     5.43  0.00
## MultipleLines*       3.00    2.0  0.12    -1.88  0.01
## InternetService*     3.00    2.0  0.21    -1.15  0.01
## OnlineSecurity*      3.00    2.0  0.42    -1.52  0.01
## OnlineBackup*        3.00    2.0  0.18    -1.68  0.01
## DeviceProtection*    3.00    2.0  0.19    -1.68  0.01
## TechSupport*         3.00    2.0  0.40    -1.54  0.01
## StreamingTV*         3.00    2.0  0.03    -1.72  0.01
## StreamingMovies*     3.00    2.0  0.01    -1.72  0.01
## Contract*            3.00    2.0  0.63    -1.27  0.01
## PaperlessBilling*    2.00    1.0 -0.38    -1.86  0.01
## PaymentMethod*       4.00    3.0 -0.17    -1.21  0.01
## MonthlyCharges     118.75  100.5 -0.22    -1.26  0.36
## TotalCharges      8684.80 8666.0  0.96    -0.23 27.03
## Churn*               2.00    1.0  1.06    -0.87  0.01

Datatypes of the columns

str(ibm.df)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

For easier interpretation, lets change some values

ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==0]<-'No'
ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==1]<-'Yes'
ibm.df$SeniorCitizen<-factor(ibm.df$SeniorCitizen)
str(ibm.df$SeniorCitizen)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

ibm.df$MultipleLines[ibm.df$MultipleLines=="No phone service"]<-'No'
ibm.df$MultipleLines<-factor(ibm.df$MultipleLines)
str(ibm.df$MultipleLines)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...

ibm.df$OnlineSecurity[ibm.df$OnlineSecurity=="No internet service"]<-'No'
ibm.df$OnlineSecurity<-factor(ibm.df$OnlineSecurity)
str(ibm.df$OnlineSecurity)

##  Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...

ibm.df$OnlineBackup[ibm.df$OnlineBackup=="No internet service"]<-'No'
ibm.df$OnlineBackup<-factor(ibm.df$OnlineBackup)
str(ibm.df$OnlineBackup)

##  Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...

ibm.df$DeviceProtection[ibm.df$DeviceProtection=="No internet service"]<-'No'
ibm.df$DeviceProtection<-factor(ibm.df$DeviceProtection)
str(ibm.df$DeviceProtection)

##  Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...

ibm.df$TechSupport[ibm.df$TechSupport=="No internet service"]<-'No'
ibm.df$TechSupport<-factor(ibm.df$TechSupport)
str(ibm.df$TechSupport)

##  Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...

ibm.df$StreamingTV[ibm.df$StreamingTV=="No internet service"]<-'No'
ibm.df$StreamingTV<-factor(ibm.df$StreamingTV)
str(ibm.df$StreamingTV)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...

ibm.df$StreamingMovies[ibm.df$StreamingMovies=="No internet service"]<-'No'
ibm.df$StreamingMovies<-factor(ibm.df$StreamingMovies)
str(ibm.df$StreamingMovies)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...

The data set now looks like

some(ibm.df)

##      customerID gender SeniorCitizen Partner Dependents tenure
## 1102 4712-UYOOI Female            No     Yes        Yes     20
## 1280 3452-FLHYD   Male            No     Yes         No     25
## 1684 0354-VXMJC   Male            No     Yes        Yes     23
## 2892 0495-RVCBF Female            No      No         No      1
## 3457 0436-TWFFZ Female            No      No         No     67
## 3608 9987-LUTYD Female            No      No         No     13
## 4660 6616-AALSR Female            No     Yes        Yes     65
## 4830 1600-DILPE Female            No      No         No     12
## 6046 4487-ZYJZK Female            No     Yes        Yes     38
## 6883 8065-QBYTO Female           Yes      No         No     71
##      PhoneService MultipleLines InternetService OnlineSecurity
## 1102          Yes            No              No             No
## 1280          Yes            No              No             No
## 1684          Yes            No              No             No
## 2892          Yes            No     Fiber optic             No
## 3457          Yes           Yes             DSL            Yes
## 3608          Yes            No             DSL            Yes
## 4660          Yes           Yes     Fiber optic            Yes
## 4830          Yes            No             DSL             No
## 6046          Yes            No              No             No
## 6883          Yes           Yes     Fiber optic            Yes
##      OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
## 1102           No               No          No          No              No
## 1280           No               No          No          No              No
## 1684           No               No          No          No              No
## 2892           No               No          No          No             Yes
## 3457           No              Yes         Yes         Yes             Yes
## 3608           No               No         Yes          No              No
## 4660           No              Yes          No         Yes             Yes
## 4830           No               No          No          No              No
## 6046           No               No          No          No              No
## 6883          Yes               No         Yes         Yes              No
##            Contract PaperlessBilling             PaymentMethod
## 1102 Month-to-month               No          Electronic check
## 1280       One year              Yes Bank transfer (automatic)
## 1684       Two year               No   Credit card (automatic)
## 2892 Month-to-month              Yes          Electronic check
## 3457       Two year              Yes              Mailed check
## 3608       One year               No              Mailed check
## 4660       Two year              Yes   Credit card (automatic)
## 4830 Month-to-month              Yes Bank transfer (automatic)
## 6046       One year               No   Credit card (automatic)
## 6883       One year              Yes   Credit card (automatic)
##      MonthlyCharges TotalCharges Churn
## 1102          20.00       417.65    No
## 1280          20.95       495.15    No
## 1684          19.60       426.65    No
## 2892          79.70        79.70   Yes
## 3457          85.25      5714.20    No
## 3608          55.15       742.90    No
## 4660         104.30      6725.30    No
## 4830          45.00       524.35    No
## 6046          19.60       763.10    No
## 6883          99.65      7181.25    No

Analysis of the Variables

Gender Analysis

gender<-table(ibm.df$gender)
gender

## 
## Female   Male 
##   3488   3555

It would provide a clear picture to see how churn varies with gender

plot(ibm.df$gender,main="Number of males and females")

gender2<-xtabs(~ibm.df$Churn+ibm.df$gender)
gender2

##             ibm.df$gender
## ibm.df$Churn Female Male
##          No    2549 2625
##          Yes    939  930

Percentage of male and female participation in churn

gender3<-prop.table(gender2,2)*100
format(round(gender3, 2), nsmall = 2)

##             ibm.df$gender
## ibm.df$Churn Female  Male   
##          No  "73.08" "73.84"
##          Yes "26.92" "26.16"

This shows that around equal percentages of males and females left the subscription last month.

Senior Citizen

sc<-table(ibm.df$SeniorCitizen)
sc

## 
##   No  Yes 
## 5901 1142

plot(ibm.df$SeniorCitizen,main="Number of Senior Citizen")

It would provide a clear picture to see how churn varies with Senior citizenship.

sc2<-xtabs(~ibm.df$Churn+ibm.df$SeniorCitizen)
sc2

##             ibm.df$SeniorCitizen
## ibm.df$Churn   No  Yes
##          No  4508  666
##          Yes 1393  476

Percentage of male and female participation in churn

sc3<-prop.table(sc2,2)*100
format(round(sc3, 2), nsmall = 2)

##             ibm.df$SeniorCitizen
## ibm.df$Churn No      Yes    
##          No  "76.39" "58.32"
##          Yes "23.61" "41.68"

A greater percentage of senior citizens have not ended the subscription.

Partner Analysis

The number of people who have and don’t have partners

partner<-table(ibm.df$Partner)
partner

## 
##   No  Yes 
## 3641 3402

It would provide a clear picture to see how churn varies with partnership

plot(ibm.df$Partner,main="Number of Partners")

partner2<-xtabs(~ibm.df$Churn+ibm.df$Partner)
partner2

##             ibm.df$Partner
## ibm.df$Churn   No  Yes
##          No  2441 2733
##          Yes 1200  669

Percentage of participation in churn

partner3<-prop.table(partner2,2)*100
format(round(partner3, 2), nsmall = 2)

##             ibm.df$Partner
## ibm.df$Churn No      Yes    
##          No  "67.04" "80.34"
##          Yes "32.96" "19.66"

Dependents Analysis

The number of people who are and are not dependent.

dependent<-table(ibm.df$Dependents)
dependent

## 
##   No  Yes 
## 4933 2110

This shows that mostly people don’t have dependents.

plot(ibm.df$Partner,main="Number of People having Dependents")

How churn is affected by dependents.

dependent2<-xtabs(~ibm.df$Churn+ibm.df$Dependents)
dependent2

##             ibm.df$Dependents
## ibm.df$Churn   No  Yes
##          No  3390 1784
##          Yes 1543  326

Percentage of dependent participation in churn

dependent3<-prop.table(dependent2,2)*100
format(round(dependent3, 2), nsmall = 2)

##             ibm.df$Dependents
## ibm.df$Churn No      Yes    
##          No  "68.72" "84.55"
##          Yes "31.28" "15.45"

Tenure

tenure<-table(ibm.df$tenure)
tenure

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
##  11 613 238 200 176 133 110 131 123 119 116  99 117 109  76  99  80  87 
##  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##  97  73  71  63  90  85  94  79  79  72  57  72  72  65  69  64  65  88 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
##  50  65  59  56  64  70  65  65  51  61  74  68  64  66  68  68  80  70 
##  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71 
##  68  64  80  65  67  60  76  76  70  72  80  76  89  98 100  95 119 170 
##  72 
## 362

boxplot(ibm.df$tenure,horizontal = TRUE, main="Tenure of Subscribers",xlab="Months",col="grey")

hist(ibm.df$tenure,breaks = 30,main = "Frequency of Tenure Months",xlab = "Tenure",col="grey")

Average Tenure of all the Subscribers w.r.t churn

aggregate(tenure~Churn,data = ibm.df,mean)

##   Churn   tenure
## 1    No 37.56997
## 2   Yes 17.97913

library(lattice)
barchart(tenure~Churn,data = ibm.df,col="grey")

Mean tenure based on gender

aggregate(tenure~gender,data = ibm.df,mean)

##   gender   tenure
## 1 Female 32.24455
## 2   Male 32.49536

boxplot(tenure~gender,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Gender",xlab="Tenure")

Mean tenure based on Senior Citizenship

aggregate(tenure~SeniorCitizen,data = ibm.df,mean)

##   SeniorCitizen   tenure
## 1            No 32.19217
## 2           Yes 33.29597

boxplot(tenure~SeniorCitizen,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Senior Citizenship",xlab="Tenure")

Hypothesis 1

Hypothesis : The difference in mean tenure of those subscibers who ended their contract and those who did not is not significant.

t.test(tenure~Churn,data = ibm.df)

## 
##  Welch Two Sample t-test
## 
## data:  tenure by Churn
## t = 34.824, df = 4048.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  18.48789 20.69378
## sample estimates:
##  mean in group No mean in group Yes 
##          37.56997          17.97913

By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean tenure of those subscibers who ended their contract and those who did not.

Subscription Analysis

par(mfrow=c(3,3))
plot(ibm.df$PhoneService,main="People who subscribed for Phone Service")
plot(ibm.df$MultipleLines,main="People who subscribed for Multiple Lines")
plot(ibm.df$InternetService,main="People who subscribed for Internet Service")
plot(ibm.df$OnlineSecurity,main="People who subscribed for Online Security")
plot(ibm.df$OnlineBackup,main="People who subscribed for Online Backup")
plot(ibm.df$DeviceProtection,main="People who subscribed for Device Protection")
plot(ibm.df$TechSupport,main="People who subscribed for Tech Support")
plot(ibm.df$StreamingTV,main="People who subscribed for Streaming TV")
plot(ibm.df$StreamingMovies,main="People who subscribed for Streaming Movies")

Contract Analysis

Analysing the contract may give some insight about churn.

table(ibm.df$Contract)

## 
## Month-to-month       One year       Two year 
##           3875           1473           1695

plot(ibm.df$Contract,main="Frequency of Different Contracts")

It may be interesting to analyse how tenure varies with contracts.

aggregate(tenure~Contract,data = ibm.df,mean)

##         Contract   tenure
## 1 Month-to-month 18.03665
## 2       One year 42.04481
## 3       Two year 56.73510

boxplot(tenure~Contract,data = ibm.df,horizontal=TRUE,col="grey",main="Variation of Tenure with Contracts",xlab="Tenure")

Lets see how churn is affected by contracts made by subscribers.

xtabs(~ibm.df$Churn+ibm.df$Contract)

##             ibm.df$Contract
## ibm.df$Churn Month-to-month One year Two year
##          No            2220     1307     1647
##          Yes           1655      166       48

Monthly Charges Analysis

boxplot(ibm.df$MonthlyCharges,col="grey",horizontal=TRUE, main="Median of Monthly Charges")

Lets see how churn depends on Monthly Charges

aggregate(MonthlyCharges~Churn,data = ibm.df,mean)

##   Churn MonthlyCharges
## 1    No       61.26512
## 2   Yes       74.44133

boxplot(MonthlyCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Monthly Charges",xlab="Monthly Charges",ylab="Churn")

Variation of monthly charges with tenure and Churn

scatterplot(MonthlyCharges~tenure|Churn,data = ibm.df,cex=0.5)

It can be seen rom the above plot that majority of the subscribers who ended their contracts had high monthly charges and lower tenures.

Hypothesis 2

hypothesis: The differnce in monthly charges based on churn is not significant.

t.test(MonthlyCharges~Churn,data = ibm.df)

## 
##  Welch Two Sample t-test
## 
## data:  MonthlyCharges by Churn
## t = -18.408, df = 4135.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.57957 -11.77284
## sample estimates:
##  mean in group No mean in group Yes 
##          61.26512          74.44133

By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean monthly charges of those subscibers who ended their contract and those who did not.

Total Charges Analysis

boxplot(ibm.df$TotalCharges,col="grey",horizontal=TRUE, main="Median of Total Charges")

Lets see how churn depends on Total Charges

aggregate(TotalCharges~Churn,data = ibm.df,mean)

##   Churn TotalCharges
## 1    No     2555.344
## 2   Yes     1531.796

boxplot(TotalCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Total Charges",xlab="Total Charges",ylab="Churn")

Variation of Total charges with tenure and Churn

scatterplot(TotalCharges~tenure|Churn,data = ibm.df,cex=0.5)

It can be seen rom the above plot that majority of the subscribers who ended their contracts had high total charges.

Hypothesis 3

hypothesis: The differnce in total charges based on churn is not significant.

t.test(TotalCharges~Churn,data = ibm.df)

## 
##  Welch Two Sample t-test
## 
## data:  TotalCharges by Churn
## t = 18.801, df = 4042.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   916.8121 1130.2840
## sample estimates:
##  mean in group No mean in group Yes 
##          2555.344          1531.796

By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean total charges of those subscibers who ended their contract and those who did not.

Logistic Regression

Since Churn is a factor variable, a linear regression cannot be applied as we need to analyse the dependence of churn on the other variables. As a result, a logistic regression is applied for the analysis.

Now since this is a logistic regression, we need to divide the data into two sets: ‘train’ - on which the model of the regression will be based on and ‘test’ - on which the model will be tested.

train <- ibm.df[1:6950,]
test <- ibm.df[6951:7043,]

Model1

The dependent variable is Churn and all other variables( excluding customer ID) are considered to be independent.

Model: Churn = b + b1 x TotalCharges + b2 x MonthlyCharges + b3 x tenure + b4 x gender + b5 x PhoneService + b6 x SeniorCitizen + b7 x Partner + b8 x Dependents + b9 x MultipleLines + b10 x InternetService + b11 x OnlineSecurity + b12 x OnlineBackup + b13 x DeviceProtection + b14 x TechSupport + b15 x StreamingTV + b16 x StreamingMovies + b17 x Contract + e

model1=glm(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data = train,family = binomial)

summary(model1)

## 
## Call:
## glm(formula = Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     gender + PhoneService + SeniorCitizen + Partner + Dependents + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8752  -0.6925  -0.2906   0.7552   3.3872  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.823e+00  8.112e-01   2.248  0.02459 *  
## TotalCharges                3.038e-04  7.051e-05   4.308 1.65e-05 ***
## MonthlyCharges             -5.439e-02  3.188e-02  -1.706  0.08796 .  
## tenure                     -5.889e-02  6.177e-03  -9.533  < 2e-16 ***
## genderMale                 -2.357e-02  6.494e-02  -0.363  0.71667    
## PhoneServiceYes             4.140e-01  6.512e-01   0.636  0.52499    
## SeniorCitizenYes            2.431e-01  8.460e-02   2.873  0.00406 ** 
## PartnerYes                  1.015e-02  7.784e-02   0.130  0.89628    
## DependentsYes              -1.650e-01  8.977e-02  -1.837  0.06614 .  
## MultipleLinesYes            5.316e-01  1.781e-01   2.986  0.00283 ** 
## InternetServiceFiber optic  2.227e+00  8.010e-01   2.780  0.00544 ** 
## InternetServiceNo          -2.296e+00  8.100e-01  -2.834  0.00459 ** 
## OnlineSecurityYes          -1.806e-01  1.793e-01  -1.007  0.31379    
## OnlineBackupYes             9.931e-02  1.761e-01   0.564  0.57273    
## DeviceProtectionYes         2.115e-01  1.768e-01   1.196  0.23152    
## TechSupportYes             -1.456e-01  1.811e-01  -0.804  0.42133    
## StreamingTVYes              7.939e-01  3.277e-01   2.423  0.01540 *  
## StreamingMoviesYes          7.939e-01  3.277e-01   2.422  0.01542 *  
## ContractOne year           -7.368e-01  1.076e-01  -6.845 7.65e-12 ***
## ContractTwo year           -1.459e+00  1.757e-01  -8.302  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8029.0  on 6938  degrees of freedom
## Residual deviance: 5789.6  on 6919  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 5829.6
## 
## Number of Fisher Scoring iterations: 6

From the above regression analysis, its clear that Total Charges, Tenure and Contracts are the most significant variables in determining the churn in the company.

library(leaps)
leap1<- regsubsets(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data=train,nbest=1)
summary(leap1)

## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     gender + PhoneService + SeniorCitizen + Partner + Dependents + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract, data = train, nbest = 1)
## 19 Variables  (and intercept)
##                            Forced in Forced out
## TotalCharges                   FALSE      FALSE
## MonthlyCharges                 FALSE      FALSE
## tenure                         FALSE      FALSE
## genderMale                     FALSE      FALSE
## PhoneServiceYes                FALSE      FALSE
## SeniorCitizenYes               FALSE      FALSE
## PartnerYes                     FALSE      FALSE
## DependentsYes                  FALSE      FALSE
## MultipleLinesYes               FALSE      FALSE
## InternetServiceFiber optic     FALSE      FALSE
## InternetServiceNo              FALSE      FALSE
## OnlineSecurityYes              FALSE      FALSE
## OnlineBackupYes                FALSE      FALSE
## DeviceProtectionYes            FALSE      FALSE
## TechSupportYes                 FALSE      FALSE
## StreamingTVYes                 FALSE      FALSE
## StreamingMoviesYes             FALSE      FALSE
## ContractOne year               FALSE      FALSE
## ContractTwo year               FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          TotalCharges MonthlyCharges tenure genderMale PhoneServiceYes
## 1  ( 1 ) " "          " "            "*"    " "        " "            
## 2  ( 1 ) " "          " "            "*"    " "        " "            
## 3  ( 1 ) "*"          " "            " "    " "        " "            
## 4  ( 1 ) "*"          " "            " "    " "        " "            
## 5  ( 1 ) "*"          "*"            " "    " "        "*"            
## 6  ( 1 ) "*"          " "            " "    " "        " "            
## 7  ( 1 ) "*"          "*"            " "    " "        "*"            
## 8  ( 1 ) "*"          "*"            " "    " "        "*"            
##          SeniorCitizenYes PartnerYes DependentsYes MultipleLinesYes
## 1  ( 1 ) " "              " "        " "           " "             
## 2  ( 1 ) " "              " "        " "           " "             
## 3  ( 1 ) " "              " "        " "           " "             
## 4  ( 1 ) " "              " "        " "           " "             
## 5  ( 1 ) " "              " "        " "           " "             
## 6  ( 1 ) " "              " "        " "           " "             
## 7  ( 1 ) " "              " "        " "           " "             
## 8  ( 1 ) " "              " "        " "           " "             
##          InternetServiceFiber optic InternetServiceNo OnlineSecurityYes
## 1  ( 1 ) " "                        " "               " "              
## 2  ( 1 ) "*"                        " "               " "              
## 3  ( 1 ) "*"                        "*"               " "              
## 4  ( 1 ) "*"                        "*"               " "              
## 5  ( 1 ) " "                        " "               "*"              
## 6  ( 1 ) "*"                        "*"               " "              
## 7  ( 1 ) " "                        " "               "*"              
## 8  ( 1 ) " "                        " "               "*"              
##          OnlineBackupYes DeviceProtectionYes TechSupportYes StreamingTVYes
## 1  ( 1 ) " "             " "                 " "            " "           
## 2  ( 1 ) " "             " "                 " "            " "           
## 3  ( 1 ) " "             " "                 " "            " "           
## 4  ( 1 ) " "             " "                 " "            " "           
## 5  ( 1 ) " "             " "                 "*"            " "           
## 6  ( 1 ) " "             " "                 " "            " "           
## 7  ( 1 ) " "             " "                 "*"            " "           
## 8  ( 1 ) "*"             " "                 "*"            " "           
##          StreamingMoviesYes ContractOne year ContractTwo year
## 1  ( 1 ) " "                " "              " "             
## 2  ( 1 ) " "                " "              " "             
## 3  ( 1 ) " "                " "              " "             
## 4  ( 1 ) "*"                " "              " "             
## 5  ( 1 ) " "                " "              " "             
## 6  ( 1 ) "*"                "*"              "*"             
## 7  ( 1 ) " "                "*"              "*"             
## 8  ( 1 ) " "                "*"              "*"

plot(leap1,scale="adjr2")

Coefficiencts of Model1

Let us view the coffecients of model1 again.

model1$coefficients

##                (Intercept)               TotalCharges 
##               1.8233111979               0.0003037555 
##             MonthlyCharges                     tenure 
##              -0.0543932133              -0.0588872582 
##                 genderMale            PhoneServiceYes 
##              -0.0235657745               0.4139685779 
##           SeniorCitizenYes                 PartnerYes 
##               0.2430841292               0.0101467163 
##              DependentsYes           MultipleLinesYes 
##              -0.1649522270               0.5316171354 
## InternetServiceFiber optic          InternetServiceNo 
##               2.2267266925              -2.2958244053 
##          OnlineSecurityYes            OnlineBackupYes 
##              -0.1805734677               0.0993077378 
##        DeviceProtectionYes             TechSupportYes 
##               0.2114850749              -0.1456074125 
##             StreamingTVYes         StreamingMoviesYes 
##               0.7938830923               0.7939111022 
##           ContractOne year           ContractTwo year 
##              -0.7368360340              -1.4585078047

library(coefplot)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

coefplot(model1)

Inference for model1

As is clear this model seems wrong as it shows that monthly charges and phone service subscription (and many other) are not significant to the churn in the company.

Therefore, this model is not good enough for the data.

Model2

This time let us ditch some variables.

Model2: Churn = b + b1 x TotalCharges + b2 x tenure + b3 x gender + b4 x PhoneService + b5 x SeniorCitizen + b6 x MultipleLines + b7 x InternetService + b8 x Contract + e

We have chosen this model as these are the most basic and sidnificant factors on which churn could result.

model2=glm(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data = train,family = binomial)
summary(model2)

## 
## Call:
## glm(formula = Churn ~ TotalCharges + tenure + gender + PhoneService + 
##     SeniorCitizen + MultipleLines + InternetService + Contract, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7307  -0.7093  -0.3050   0.8173   3.5096  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 3.483e-01  1.234e-01   2.821 0.004786 ** 
## TotalCharges                3.120e-04  6.316e-05   4.939 7.86e-07 ***
## tenure                     -6.203e-02  5.885e-03 -10.540  < 2e-16 ***
## genderMale                 -1.477e-02  6.417e-02  -0.230 0.817903    
## PhoneServiceYes            -7.805e-01  1.295e-01  -6.026 1.68e-09 ***
## SeniorCitizenYes            3.299e-01  8.173e-02   4.036 5.43e-05 ***
## MultipleLinesYes            3.037e-01  7.862e-02   3.863 0.000112 ***
## InternetServiceFiber optic  1.083e+00  9.291e-02  11.653  < 2e-16 ***
## InternetServiceNo          -7.178e-01  1.277e-01  -5.619 1.92e-08 ***
## ContractOne year           -8.096e-01  1.050e-01  -7.708 1.28e-14 ***
## ContractTwo year           -1.675e+00  1.727e-01  -9.700  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8029.0  on 6938  degrees of freedom
## Residual deviance: 5888.3  on 6928  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 5910.3
## 
## Number of Fisher Scoring iterations: 6

This model is seems satisfactory enough as except gender all the other variables seem to significant(p value<0.05).

leap2<- regsubsets(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data=train,nbest=1)
summary(leap2)

## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + tenure + gender + PhoneService + 
##     SeniorCitizen + MultipleLines + InternetService + Contract, 
##     data = train, nbest = 1)
## 10 Variables  (and intercept)
##                            Forced in Forced out
## TotalCharges                   FALSE      FALSE
## tenure                         FALSE      FALSE
## genderMale                     FALSE      FALSE
## PhoneServiceYes                FALSE      FALSE
## SeniorCitizenYes               FALSE      FALSE
## MultipleLinesYes               FALSE      FALSE
## InternetServiceFiber optic     FALSE      FALSE
## InternetServiceNo              FALSE      FALSE
## ContractOne year               FALSE      FALSE
## ContractTwo year               FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          TotalCharges tenure genderMale PhoneServiceYes SeniorCitizenYes
## 1  ( 1 ) " "          "*"    " "        " "             " "             
## 2  ( 1 ) " "          "*"    " "        " "             " "             
## 3  ( 1 ) "*"          " "    " "        " "             " "             
## 4  ( 1 ) "*"          " "    " "        " "             " "             
## 5  ( 1 ) "*"          " "    " "        " "             " "             
## 6  ( 1 ) "*"          " "    " "        " "             " "             
## 7  ( 1 ) "*"          " "    " "        " "             "*"             
## 8  ( 1 ) "*"          "*"    " "        " "             "*"             
##          MultipleLinesYes InternetServiceFiber optic InternetServiceNo
## 1  ( 1 ) " "              " "                        " "              
## 2  ( 1 ) " "              "*"                        " "              
## 3  ( 1 ) " "              "*"                        "*"              
## 4  ( 1 ) " "              "*"                        "*"              
## 5  ( 1 ) " "              "*"                        "*"              
## 6  ( 1 ) "*"              "*"                        "*"              
## 7  ( 1 ) "*"              "*"                        "*"              
## 8  ( 1 ) "*"              "*"                        "*"              
##          ContractOne year ContractTwo year
## 1  ( 1 ) " "              " "             
## 2  ( 1 ) " "              " "             
## 3  ( 1 ) " "              " "             
## 4  ( 1 ) "*"              " "             
## 5  ( 1 ) "*"              "*"             
## 6  ( 1 ) "*"              "*"             
## 7  ( 1 ) "*"              "*"             
## 8  ( 1 ) "*"              "*"

plot(leap2,scale="adjr2")

Coefficients of Model2

model2$coefficients

##                (Intercept)               TotalCharges 
##               0.3482585988               0.0003119577 
##                     tenure                 genderMale 
##              -0.0620319045              -0.0147743337 
##            PhoneServiceYes           SeniorCitizenYes 
##              -0.7805109440               0.3298960142 
##           MultipleLinesYes InternetServiceFiber optic 
##               0.3037316602               1.0826036296 
##          InternetServiceNo           ContractOne year 
##              -0.7177828701              -0.8096150070 
##           ContractTwo year 
##              -1.6749967414

coefplot(model2)

The model seems good enough to predict the churn.

Lets see how this model behaves for the dataset test.

Test the Dataset

fitted.results <- predict(model2,newdata=subset(test,select=c(2,3,6,7,8,9,16,20)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,'Yes','No')

misClasificError <- mean(fitted.results != test$Churn)
print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.795698924731183"

The accuracy of prediction is above 75%. This proves that the model is good enough.

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p <- predict(model2, newdata=subset(test,select=c(2,3,6,7,8,9,16,20)), type="response")
pr <- prediction(p, test$Churn)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.8318681

Inference of Model2

Since auc value is closer is closer to 1 than to 0.5, model2 is a good predictive model.
Since it is a logistic regression the coefficients in the analysis are the log of odds of churn.
- The analysis shows that a 1 unit increase in total charges increases the log of odds of churn.
- The analysis shows that a 1 unit increase in tenure decreases the log of odds of churn.
- The analysis shows that gender is not a significant factor in churn determination.
- The analysis shows that being a senior citizen increases the log of odds of churn.
- The analysis shows that having subscribed for the phone service decreases the log of odds of churn.
- The analysis shows that having subscribed for the multiple lines increases the log of odds of churn.
- The analysis shows that having subscribed for the fiber optic internet service increases the log of odds of churn.
- The analysis shows that having subscribed for no internet service increases the log of odds of churn.
- The analysis shows that having subscribed for one and two year contracts with the company decreases the log of odds of churn.
The accuracy level of the model is > 75%.

The model passes all the tests for being applied in the analysis.

                                              THE END

Churn in Telecommunications Company

Rahul Poddar

26 January 2018