This file analyses what factors govern the churn of a telecommunications company.

ibm.df<-read.csv(paste("IBM.csv"),)

Dimensions of the data set

Analysis is best when the dataset is rich in data.

dim(ibm.df)

## [1] 7043   21

Summary Statistics

What the dataset looks like

library(car)
some(ibm.df)

##      customerID gender SeniorCitizen Partner Dependents tenure
## 681  1448-PWKYE   Male             0     Yes        Yes      1
## 754  3115-CZMZD   Male             0      No        Yes      0
## 1985 4373-MAVJG Female             0     Yes        Yes     14
## 2300 6177-PEVRA Female             0      No         No     48
## 2750 9086-YJYXS   Male             0     Yes        Yes     34
## 3667 3194-ORPIK Female             0     Yes        Yes     50
## 4277 1226-UDFZR Female             0      No         No      2
## 4591 4884-TVUQF Female             1      No         No     57
## 4797 3807-BPOMJ Female             0     Yes         No     55
## 6435 6608-QQLVK   Male             0      No         No      1
##      PhoneService MultipleLines InternetService      OnlineSecurity
## 681           Yes            No     Fiber optic                  No
## 754           Yes            No              No No internet service
## 1985          Yes           Yes     Fiber optic                  No
## 2300          Yes            No             DSL                 Yes
## 2750          Yes           Yes             DSL                  No
## 3667          Yes           Yes     Fiber optic                  No
## 4277          Yes            No             DSL                  No
## 4591          Yes           Yes     Fiber optic                 Yes
## 4797          Yes            No     Fiber optic                 Yes
## 6435          Yes            No             DSL                  No
##             OnlineBackup    DeviceProtection         TechSupport
## 681                   No                  No                  No
## 754  No internet service No internet service No internet service
## 1985                 Yes                  No                  No
## 2300                 Yes                  No                  No
## 2750                  No                 Yes                  No
## 3667                  No                  No                  No
## 4277                 Yes                  No                  No
## 4591                  No                 Yes                 Yes
## 4797                  No                  No                  No
## 6435                  No                 Yes                  No
##              StreamingTV     StreamingMovies       Contract
## 681                   No                 Yes Month-to-month
## 754  No internet service No internet service       Two year
## 1985                  No                 Yes Month-to-month
## 2300                  No                  No       Two year
## 2750                 Yes                 Yes       One year
## 3667                 Yes                  No Month-to-month
## 4277                  No                  No Month-to-month
## 4591                  No                 Yes       Two year
## 4797                 Yes                 Yes       One year
## 6435                  No                  No Month-to-month
##      PaperlessBilling             PaymentMethod MonthlyCharges
## 681                No          Electronic check          80.00
## 754                No              Mailed check          20.25
## 1985              Yes Bank transfer (automatic)          90.90
## 2300               No   Credit card (automatic)          55.50
## 2750               No Bank transfer (automatic)          77.20
## 3667              Yes Bank transfer (automatic)          84.40
## 4277              Yes              Mailed check          49.60
## 4591               No   Credit card (automatic)         101.30
## 4797              Yes          Electronic check          94.75
## 6435              Yes          Electronic check          50.50
##      TotalCharges Churn
## 681         80.00   Yes
## 754            NA    No
## 1985      1259.00   Yes
## 2300      2627.35    No
## 2750      2753.80    No
## 3667      4116.15   Yes
## 4277       114.70   Yes
## 4591      5779.60    No
## 4797      5276.10    No
## 6435        50.50   Yes

Summary

summary(ibm.df)

##       customerID      gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies
##  No                 :2810   No                 :2785  
##  No internet service:1526   No internet service:1526  
##  Yes                :2707   Yes                :2732  
##                                                       
##                                                       
##                                                       
##                                                       
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2872         Bank transfer (automatic):1544  
##  One year      :1473   Yes:4171         Credit card (automatic)  :1522  
##  Two year      :1695                    Electronic check         :2365  
##                                         Mailed check             :1612  
##                                                                         
##                                                                         
##                                                                         
##  MonthlyCharges    TotalCharges    Churn     
##  Min.   : 18.25   Min.   :  18.8   No :5174  
##  1st Qu.: 35.50   1st Qu.: 401.4   Yes:1869  
##  Median : 70.35   Median :1397.5             
##  Mean   : 64.76   Mean   :2283.3             
##  3rd Qu.: 89.85   3rd Qu.:3794.7             
##  Max.   :118.75   Max.   :8684.8             
##                   NA's   :11

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

describe(ibm.df)

##                   vars    n    mean      sd  median trimmed     mad   min
## customerID*          1 7043 3522.00 2033.28 3522.00 3522.00 2610.86  1.00
## gender*              2 7043    1.50    0.50    2.00    1.51    0.00  1.00
## SeniorCitizen        3 7043    0.16    0.37    0.00    0.08    0.00  0.00
## Partner*             4 7043    1.48    0.50    1.00    1.48    0.00  1.00
## Dependents*          5 7043    1.30    0.46    1.00    1.25    0.00  1.00
## tenure               6 7043   32.37   24.56   29.00   31.43   32.62  0.00
## PhoneService*        7 7043    1.90    0.30    2.00    2.00    0.00  1.00
## MultipleLines*       8 7043    1.94    0.95    2.00    1.93    1.48  1.00
## InternetService*     9 7043    1.87    0.74    2.00    1.84    1.48  1.00
## OnlineSecurity*     10 7043    1.79    0.86    2.00    1.74    1.48  1.00
## OnlineBackup*       11 7043    1.91    0.88    2.00    1.88    1.48  1.00
## DeviceProtection*   12 7043    1.90    0.88    2.00    1.88    1.48  1.00
## TechSupport*        13 7043    1.80    0.86    2.00    1.75    1.48  1.00
## StreamingTV*        14 7043    1.99    0.89    2.00    1.98    1.48  1.00
## StreamingMovies*    15 7043    1.99    0.89    2.00    1.99    1.48  1.00
## Contract*           16 7043    1.69    0.83    1.00    1.61    0.00  1.00
## PaperlessBilling*   17 7043    1.59    0.49    2.00    1.62    0.00  1.00
## PaymentMethod*      18 7043    2.57    1.07    3.00    2.59    1.48  1.00
## MonthlyCharges      19 7043   64.76   30.09   70.35   64.97   35.66 18.25
## TotalCharges        20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn*              21 7043    1.27    0.44    1.00    1.21    0.00  1.00
##                       max  range  skew kurtosis    se
## customerID*       7043.00 7042.0  0.00    -1.20 24.23
## gender*              2.00    1.0 -0.02    -2.00  0.01
## SeniorCitizen        1.00    1.0  1.83     1.36  0.00
## Partner*             2.00    1.0  0.07    -2.00  0.01
## Dependents*          2.00    1.0  0.87    -1.23  0.01
## tenure              72.00   72.0  0.24    -1.39  0.29
## PhoneService*        2.00    1.0 -2.73     5.43  0.00
## MultipleLines*       3.00    2.0  0.12    -1.88  0.01
## InternetService*     3.00    2.0  0.21    -1.15  0.01
## OnlineSecurity*      3.00    2.0  0.42    -1.52  0.01
## OnlineBackup*        3.00    2.0  0.18    -1.68  0.01
## DeviceProtection*    3.00    2.0  0.19    -1.68  0.01
## TechSupport*         3.00    2.0  0.40    -1.54  0.01
## StreamingTV*         3.00    2.0  0.03    -1.72  0.01
## StreamingMovies*     3.00    2.0  0.01    -1.72  0.01
## Contract*            3.00    2.0  0.63    -1.27  0.01
## PaperlessBilling*    2.00    1.0 -0.38    -1.86  0.01
## PaymentMethod*       4.00    3.0 -0.17    -1.21  0.01
## MonthlyCharges     118.75  100.5 -0.22    -1.26  0.36
## TotalCharges      8684.80 8666.0  0.96    -0.23 27.03
## Churn*               2.00    1.0  1.06    -0.87  0.01

Datatypes of the columns

str(ibm.df)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

For easier interpretation, lets change some values

ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==0]<-'No'
ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==1]<-'Yes'
ibm.df$SeniorCitizen<-factor(ibm.df$SeniorCitizen)
str(ibm.df$SeniorCitizen)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

ibm.df$MultipleLines[ibm.df$MultipleLines=="No phone service"]<-'No'
ibm.df$MultipleLines<-factor(ibm.df$MultipleLines)
str(ibm.df$MultipleLines)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...

ibm.df$OnlineSecurity[ibm.df$OnlineSecurity=="No internet service"]<-'No'
ibm.df$OnlineSecurity<-factor(ibm.df$OnlineSecurity)
str(ibm.df$OnlineSecurity)

##  Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...

ibm.df$OnlineBackup[ibm.df$OnlineBackup=="No internet service"]<-'No'
ibm.df$OnlineBackup<-factor(ibm.df$OnlineBackup)
str(ibm.df$OnlineBackup)

##  Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...

ibm.df$DeviceProtection[ibm.df$DeviceProtection=="No internet service"]<-'No'
ibm.df$DeviceProtection<-factor(ibm.df$DeviceProtection)
str(ibm.df$DeviceProtection)

##  Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...

ibm.df$TechSupport[ibm.df$TechSupport=="No internet service"]<-'No'
ibm.df$TechSupport<-factor(ibm.df$TechSupport)
str(ibm.df$TechSupport)

##  Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...

ibm.df$StreamingTV[ibm.df$StreamingTV=="No internet service"]<-'No'
ibm.df$StreamingTV<-factor(ibm.df$StreamingTV)
str(ibm.df$StreamingTV)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...

ibm.df$StreamingMovies[ibm.df$StreamingMovies=="No internet service"]<-'No'
ibm.df$StreamingMovies<-factor(ibm.df$StreamingMovies)
str(ibm.df$StreamingMovies)

##  Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...

The data set now looks like

some(ibm.df)

##      customerID gender SeniorCitizen Partner Dependents tenure
## 2198 8548-AWOFC   Male            No     Yes         No     66
## 2815 7599-NTMDP Female            No     Yes        Yes     62
## 3029 8740-XLHDR   Male            No      No         No      5
## 3154 9099-FTUHS Female            No      No         No     23
## 3199 8564-LDKFL   Male            No     Yes         No     40
## 3897 8590-OHDIW Female            No     Yes        Yes     38
## 5024 6653-CBBOM Female            No      No         No      1
## 5404 0292-WEGCH Female            No     Yes        Yes     54
## 5778 6087-YPWHO   Male            No     Yes         No     72
## 6248 3428-MMGUB   Male            No      No         No     60
##      PhoneService MultipleLines InternetService OnlineSecurity
## 2198          Yes           Yes             DSL             No
## 2815           No            No             DSL            Yes
## 3029           No            No             DSL            Yes
## 3154          Yes           Yes             DSL            Yes
## 3199          Yes           Yes     Fiber optic             No
## 3897          Yes            No              No             No
## 5024          Yes            No     Fiber optic             No
## 5404          Yes           Yes             DSL             No
## 5778          Yes           Yes             DSL            Yes
## 6248          Yes           Yes     Fiber optic             No
##      OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
## 2198           No               No         Yes          No             Yes
## 2815          Yes               No         Yes         Yes              No
## 3029          Yes              Yes         Yes          No              No
## 3154           No               No          No          No              No
## 3199          Yes              Yes          No         Yes             Yes
## 3897           No               No          No          No              No
## 5024           No               No          No          No              No
## 5404          Yes              Yes         Yes         Yes             Yes
## 5778          Yes              Yes         Yes          No              No
## 6248           No              Yes          No         Yes              No
##            Contract PaperlessBilling             PaymentMethod
## 2198 Month-to-month               No          Electronic check
## 2815       Two year               No Bank transfer (automatic)
## 3029 Month-to-month              Yes              Mailed check
## 3154 Month-to-month               No          Electronic check
## 3199       One year              Yes Bank transfer (automatic)
## 3897       One year               No              Mailed check
## 5024 Month-to-month              Yes          Electronic check
## 5404 Month-to-month              Yes          Electronic check
## 5778       Two year               No              Mailed check
## 6248       Two year              Yes          Electronic check
##      MonthlyCharges TotalCharges Churn
## 2198          63.85      4264.60    No
## 2815          48.70      3008.55    No
## 3029          43.25       219.00   Yes
## 3154          54.40      1249.25    No
## 3199         106.00      4178.65    No
## 3897          20.30       749.35    No
## 5024          70.30        70.30   Yes
## 5404          86.20      4524.05    No
## 5778          68.15      4808.70    No
## 6248          89.55      5231.20    No

Analysis of the Variables

Gender Analysis

gender<-table(ibm.df$gender)
gender

## 
## Female   Male 
##   3488   3555

It would provide a clear picture to see how churn varies with gender

plot(ibm.df$gender,main="Number of males and females")

gender2<-xtabs(~ibm.df$Churn+ibm.df$gender)
gender2

##             ibm.df$gender
## ibm.df$Churn Female Male
##          No    2549 2625
##          Yes    939  930

Percentage of male and female participation in churn

gender3<-prop.table(gender2,2)*100
format(round(gender3, 2), nsmall = 2)

##             ibm.df$gender
## ibm.df$Churn Female  Male   
##          No  "73.08" "73.84"
##          Yes "26.92" "26.16"

This shows that around equal percentages of males and females left the subscription last month.

Senior Citizen

sc<-table(ibm.df$SeniorCitizen)
sc

## 
##   No  Yes 
## 5901 1142

plot(ibm.df$SeniorCitizen,main="Number of Senior Citizen")

It would provide a clear picture to see how churn varies with Senior citizenship.

sc2<-xtabs(~ibm.df$Churn+ibm.df$SeniorCitizen)
sc2

##             ibm.df$SeniorCitizen
## ibm.df$Churn   No  Yes
##          No  4508  666
##          Yes 1393  476

Percentage of male and female participation in churn

sc3<-prop.table(sc2,2)*100
format(round(sc3, 2), nsmall = 2)

##             ibm.df$SeniorCitizen
## ibm.df$Churn No      Yes    
##          No  "76.39" "58.32"
##          Yes "23.61" "41.68"

A greater percentage of senior citizens have not ended the subscription.

Partner Analysis

The number of people who have and don’t have partners

partner<-table(ibm.df$Partner)
partner

## 
##   No  Yes 
## 3641 3402

It would provide a clear picture to see how churn varies with partnership

plot(ibm.df$Partner,main="Number of Partners")

partner2<-xtabs(~ibm.df$Churn+ibm.df$Partner)
partner2

##             ibm.df$Partner
## ibm.df$Churn   No  Yes
##          No  2441 2733
##          Yes 1200  669

Percentage of participation in churn

partner3<-prop.table(partner2,2)*100
format(round(partner3, 2), nsmall = 2)

##             ibm.df$Partner
## ibm.df$Churn No      Yes    
##          No  "67.04" "80.34"
##          Yes "32.96" "19.66"

Dependents Analysis

The number of people who are and are not dependent.

dependent<-table(ibm.df$Dependents)
dependent

## 
##   No  Yes 
## 4933 2110

This shows that mostly people don’t have dependents.

plot(ibm.df$Partner,main="Number of People having Dependents")

How churn is affected by dependents.

dependent2<-xtabs(~ibm.df$Churn+ibm.df$Dependents)
dependent2

##             ibm.df$Dependents
## ibm.df$Churn   No  Yes
##          No  3390 1784
##          Yes 1543  326

Percentage of dependent participation in churn

dependent3<-prop.table(dependent2,2)*100
format(round(dependent3, 2), nsmall = 2)

##             ibm.df$Dependents
## ibm.df$Churn No      Yes    
##          No  "68.72" "84.55"
##          Yes "31.28" "15.45"

Tenure

tenure<-table(ibm.df$tenure)
tenure

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
##  11 613 238 200 176 133 110 131 123 119 116  99 117 109  76  99  80  87 
##  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##  97  73  71  63  90  85  94  79  79  72  57  72  72  65  69  64  65  88 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
##  50  65  59  56  64  70  65  65  51  61  74  68  64  66  68  68  80  70 
##  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71 
##  68  64  80  65  67  60  76  76  70  72  80  76  89  98 100  95 119 170 
##  72 
## 362

boxplot(ibm.df$tenure,horizontal = TRUE, main="Tenure of Subscribers",xlab="Months",col="grey")

hist(ibm.df$tenure,breaks = 30,main = "Frequency of Tenure Months",xlab = "Tenure",col="grey")

Average Tenure of all the Subscribers w.r.t churn

aggregate(tenure~Churn,data = ibm.df,mean)

##   Churn   tenure
## 1    No 37.56997
## 2   Yes 17.97913

library(lattice)
barchart(tenure~Churn,data = ibm.df,col="grey")

Mean tenure based on gender

aggregate(tenure~gender,data = ibm.df,mean)

##   gender   tenure
## 1 Female 32.24455
## 2   Male 32.49536

boxplot(tenure~gender,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Gender",xlab="Tenure")

Mean tenure based on Senior Citizenship

aggregate(tenure~SeniorCitizen,data = ibm.df,mean)

##   SeniorCitizen   tenure
## 1            No 32.19217
## 2           Yes 33.29597

boxplot(tenure~SeniorCitizen,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Senior Citizenship",xlab="Tenure")

Hypothesis 1

Hypothesis : The difference in mean tenure of those subscibers who ended their contract and those who did not is not significant.

t.test(tenure~Churn,data = ibm.df)

## 
##  Welch Two Sample t-test
## 
## data:  tenure by Churn
## t = 34.824, df = 4048.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  18.48789 20.69378
## sample estimates:
##  mean in group No mean in group Yes 
##          37.56997          17.97913

By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean tenure of those subscibers who ended their contract and those who did not.

Subscription Analysis

par(mfrow=c(3,3))
plot(ibm.df$PhoneService,main="People who subscribed for Phone Service")
plot(ibm.df$MultipleLines,main="People who subscribed for Multiple Lines")
plot(ibm.df$InternetService,main="People who subscribed for Internet Service")
plot(ibm.df$OnlineSecurity,main="People who subscribed for Online Security")
plot(ibm.df$OnlineBackup,main="People who subscribed for Online Backup")
plot(ibm.df$DeviceProtection,main="People who subscribed for Device Protection")
plot(ibm.df$TechSupport,main="People who subscribed for Tech Support")
plot(ibm.df$StreamingTV,main="People who subscribed for Streaming TV")
plot(ibm.df$StreamingMovies,main="People who subscribed for Streaming Movies")

Contract Analysis

Analysing the contract may give some insight about churn.

table(ibm.df$Contract)

## 
## Month-to-month       One year       Two year 
##           3875           1473           1695

plot(ibm.df$Contract,main="Frequency of Different Contracts")

It may be interesting to analyse how tenure varies with contracts.

aggregate(tenure~Contract,data = ibm.df,mean)

##         Contract   tenure
## 1 Month-to-month 18.03665
## 2       One year 42.04481
## 3       Two year 56.73510

boxplot(tenure~Contract,data = ibm.df,horizontal=TRUE,col="grey",main="Variation of Tenure with Contracts",xlab="Tenure")

Lets see how churn is affected by contracts made by subscribers.

xtabs(~ibm.df$Churn+ibm.df$Contract)

##             ibm.df$Contract
## ibm.df$Churn Month-to-month One year Two year
##          No            2220     1307     1647
##          Yes           1655      166       48

Monthly Charges Analysis

boxplot(ibm.df$MonthlyCharges,col="grey",horizontal=TRUE, main="Median of Monthly Charges")

Lets see how churn depends on Monthly Charges

aggregate(MonthlyCharges~Churn,data = ibm.df,mean)

##   Churn MonthlyCharges
## 1    No       61.26512
## 2   Yes       74.44133

boxplot(MonthlyCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Monthly Charges",xlab="Monthly Charges",ylab="Churn")

Variation of monthly charges with tenure and Churn

scatterplot(MonthlyCharges~tenure|Churn,data = ibm.df,cex=0.5)

It can be seen rom the above plot that majority of the subscribers who ended their contracts had high monthly charges and lower tenures.

Hypothesis 2

hypothesis: The differnce in monthly charges based on churn is not significant.

t.test(MonthlyCharges~Churn,data = ibm.df)

## 
##  Welch Two Sample t-test
## 
## data:  MonthlyCharges by Churn
## t = -18.408, df = 4135.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.57957 -11.77284
## sample estimates:
##  mean in group No mean in group Yes 
##          61.26512          74.44133

By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean monthly charges of those subscibers who ended their contract and those who did not.

Total Charges Analysis

boxplot(ibm.df$TotalCharges,col="grey",horizontal=TRUE, main="Median of Total Charges")

Lets see how churn depends on Total Charges

aggregate(TotalCharges~Churn,data = ibm.df,mean)

##   Churn TotalCharges
## 1    No     2555.344
## 2   Yes     1531.796

boxplot(TotalCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Total Charges",xlab="Total Charges",ylab="Churn")

Variation of Total charges with tenure and Churn

scatterplot(TotalCharges~tenure|Churn,data = ibm.df,cex=0.5)

It can be seen rom the above plot that majority of the subscribers who ended their contracts had high total charges.

Hypothesis 3

hypothesis: The differnce in total charges based on churn is not significant.

t.test(TotalCharges~Churn,data = ibm.df)

## 
##  Welch Two Sample t-test
## 
## data:  TotalCharges by Churn
## t = 18.801, df = 4042.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   916.8121 1130.2840
## sample estimates:
##  mean in group No mean in group Yes 
##          2555.344          1531.796

By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean total charges of those subscibers who ended their contract and those who did not.

Logistic Regression

Since Churn is a factor variable, a linear regression cannot be applied as we need to analyse the dependence of churn on the other variables. As a result, a logistic regression is applied for the analysis.

Now since this is a logistic regression, we need to divide the data into two sets: ‘train’ - on which the model of the regression will be based on and ‘test’ - on which the model will be tested.

train <- ibm.df[1:6950,]
test <- ibm.df[6951:7043,]

Model1

The dependent variable is Churn and all other variables( excluding customer ID) are considered to be independent.

Model: \[Churn = \beta_0 + \beta_1TotalCharges + \beta_2MonthlyCharges + \beta_3tenure + \beta_4gender + \beta_5PhoneService + \beta_6SeniorCitizen + \beta_7Partner + \beta_8Dependents + \beta_9MultipleLines + \beta_10InternetService + \beta_11OnlineSecurity + \beta_12OnlineBackup + \beta_13DeviceProtection + \beta_14TechSupport + \beta_15StreamingTV + \beta_16StreamingMovies + \beta_17Contract + \epsilon\]

model1=glm(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data = train,family = binomial)

summary(model1)

## 
## Call:
## glm(formula = Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     gender + PhoneService + SeniorCitizen + Partner + Dependents + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8752  -0.6925  -0.2906   0.7552   3.3872  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.823e+00  8.112e-01   2.248  0.02459 *  
## TotalCharges                3.038e-04  7.051e-05   4.308 1.65e-05 ***
## MonthlyCharges             -5.439e-02  3.188e-02  -1.706  0.08796 .  
## tenure                     -5.889e-02  6.177e-03  -9.533  < 2e-16 ***
## genderMale                 -2.357e-02  6.494e-02  -0.363  0.71667    
## PhoneServiceYes             4.140e-01  6.512e-01   0.636  0.52499    
## SeniorCitizenYes            2.431e-01  8.460e-02   2.873  0.00406 ** 
## PartnerYes                  1.015e-02  7.784e-02   0.130  0.89628    
## DependentsYes              -1.650e-01  8.977e-02  -1.837  0.06614 .  
## MultipleLinesYes            5.316e-01  1.781e-01   2.986  0.00283 ** 
## InternetServiceFiber optic  2.227e+00  8.010e-01   2.780  0.00544 ** 
## InternetServiceNo          -2.296e+00  8.100e-01  -2.834  0.00459 ** 
## OnlineSecurityYes          -1.806e-01  1.793e-01  -1.007  0.31379    
## OnlineBackupYes             9.931e-02  1.761e-01   0.564  0.57273    
## DeviceProtectionYes         2.115e-01  1.768e-01   1.196  0.23152    
## TechSupportYes             -1.456e-01  1.811e-01  -0.804  0.42133    
## StreamingTVYes              7.939e-01  3.277e-01   2.423  0.01540 *  
## StreamingMoviesYes          7.939e-01  3.277e-01   2.422  0.01542 *  
## ContractOne year           -7.368e-01  1.076e-01  -6.845 7.65e-12 ***
## ContractTwo year           -1.459e+00  1.757e-01  -8.302  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8029.0  on 6938  degrees of freedom
## Residual deviance: 5789.6  on 6919  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 5829.6
## 
## Number of Fisher Scoring iterations: 6

From the above regression analysis, its clear that Total Charges, Tenure and Contracts are the most significant variables in determining the churn in the company.

library(leaps)
leap1<- regsubsets(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data=train,nbest=1)
summary(leap1)

## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     gender + PhoneService + SeniorCitizen + Partner + Dependents + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract, data = train, nbest = 1)
## 19 Variables  (and intercept)
##                            Forced in Forced out
## TotalCharges                   FALSE      FALSE
## MonthlyCharges                 FALSE      FALSE
## tenure                         FALSE      FALSE
## genderMale                     FALSE      FALSE
## PhoneServiceYes                FALSE      FALSE
## SeniorCitizenYes               FALSE      FALSE
## PartnerYes                     FALSE      FALSE
## DependentsYes                  FALSE      FALSE
## MultipleLinesYes               FALSE      FALSE
## InternetServiceFiber optic     FALSE      FALSE
## InternetServiceNo              FALSE      FALSE
## OnlineSecurityYes              FALSE      FALSE
## OnlineBackupYes                FALSE      FALSE
## DeviceProtectionYes            FALSE      FALSE
## TechSupportYes                 FALSE      FALSE
## StreamingTVYes                 FALSE      FALSE
## StreamingMoviesYes             FALSE      FALSE
## ContractOne year               FALSE      FALSE
## ContractTwo year               FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          TotalCharges MonthlyCharges tenure genderMale PhoneServiceYes
## 1  ( 1 ) " "          " "            "*"    " "        " "            
## 2  ( 1 ) " "          " "            "*"    " "        " "            
## 3  ( 1 ) "*"          " "            " "    " "        " "            
## 4  ( 1 ) "*"          " "            " "    " "        " "            
## 5  ( 1 ) "*"          "*"            " "    " "        "*"            
## 6  ( 1 ) "*"          " "            " "    " "        " "            
## 7  ( 1 ) "*"          "*"            " "    " "        "*"            
## 8  ( 1 ) "*"          "*"            " "    " "        "*"            
##          SeniorCitizenYes PartnerYes DependentsYes MultipleLinesYes
## 1  ( 1 ) " "              " "        " "           " "             
## 2  ( 1 ) " "              " "        " "           " "             
## 3  ( 1 ) " "              " "        " "           " "             
## 4  ( 1 ) " "              " "        " "           " "             
## 5  ( 1 ) " "              " "        " "           " "             
## 6  ( 1 ) " "              " "        " "           " "             
## 7  ( 1 ) " "              " "        " "           " "             
## 8  ( 1 ) " "              " "        " "           " "             
##          InternetServiceFiber optic InternetServiceNo OnlineSecurityYes
## 1  ( 1 ) " "                        " "               " "              
## 2  ( 1 ) "*"                        " "               " "              
## 3  ( 1 ) "*"                        "*"               " "              
## 4  ( 1 ) "*"                        "*"               " "              
## 5  ( 1 ) " "                        " "               "*"              
## 6  ( 1 ) "*"                        "*"               " "              
## 7  ( 1 ) " "                        " "               "*"              
## 8  ( 1 ) " "                        " "               "*"              
##          OnlineBackupYes DeviceProtectionYes TechSupportYes StreamingTVYes
## 1  ( 1 ) " "             " "                 " "            " "           
## 2  ( 1 ) " "             " "                 " "            " "           
## 3  ( 1 ) " "             " "                 " "            " "           
## 4  ( 1 ) " "             " "                 " "            " "           
## 5  ( 1 ) " "             " "                 "*"            " "           
## 6  ( 1 ) " "             " "                 " "            " "           
## 7  ( 1 ) " "             " "                 "*"            " "           
## 8  ( 1 ) "*"             " "                 "*"            " "           
##          StreamingMoviesYes ContractOne year ContractTwo year
## 1  ( 1 ) " "                " "              " "             
## 2  ( 1 ) " "                " "              " "             
## 3  ( 1 ) " "                " "              " "             
## 4  ( 1 ) "*"                " "              " "             
## 5  ( 1 ) " "                " "              " "             
## 6  ( 1 ) "*"                "*"              "*"             
## 7  ( 1 ) " "                "*"              "*"             
## 8  ( 1 ) " "                "*"              "*"

plot(leap1,scale="adjr2")

Coefficiencts of Model1

Let us view the coffecients of model1 again.

model1$coefficients

##                (Intercept)               TotalCharges 
##               1.8233111979               0.0003037555 
##             MonthlyCharges                     tenure 
##              -0.0543932133              -0.0588872582 
##                 genderMale            PhoneServiceYes 
##              -0.0235657745               0.4139685779 
##           SeniorCitizenYes                 PartnerYes 
##               0.2430841292               0.0101467163 
##              DependentsYes           MultipleLinesYes 
##              -0.1649522270               0.5316171354 
## InternetServiceFiber optic          InternetServiceNo 
##               2.2267266925              -2.2958244053 
##          OnlineSecurityYes            OnlineBackupYes 
##              -0.1805734677               0.0993077378 
##        DeviceProtectionYes             TechSupportYes 
##               0.2114850749              -0.1456074125 
##             StreamingTVYes         StreamingMoviesYes 
##               0.7938830923               0.7939111022 
##           ContractOne year           ContractTwo year 
##              -0.7368360340              -1.4585078047

library(coefplot)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

coefplot(model1)

Inference for model1

As is clear this model seems wrong as it shows that monthly charges and phone service subscription (and many other) are not significant to the churn in the company.

Therefore, this model is not good enough for the data.

Model2

This time let us ditch some variables.

Model2: \[Churn = \beta_0 + \beta_1TotalCharges + \beta_2tenure + \beta_3gender + \beta_4PhoneService + \beta_5SeniorCitizen + \beta_6MultipleLines + \beta_7InternetService + \beta_8Contract + \epsilon\]

We have chosen this model as these are the most basic and sidnificant factors on which churn could result.

model2=glm(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data = train,family = binomial)
summary(model2)

## 
## Call:
## glm(formula = Churn ~ TotalCharges + tenure + gender + PhoneService + 
##     SeniorCitizen + MultipleLines + InternetService + Contract, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7307  -0.7093  -0.3050   0.8173   3.5096  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 3.483e-01  1.234e-01   2.821 0.004786 ** 
## TotalCharges                3.120e-04  6.316e-05   4.939 7.86e-07 ***
## tenure                     -6.203e-02  5.885e-03 -10.540  < 2e-16 ***
## genderMale                 -1.477e-02  6.417e-02  -0.230 0.817903    
## PhoneServiceYes            -7.805e-01  1.295e-01  -6.026 1.68e-09 ***
## SeniorCitizenYes            3.299e-01  8.173e-02   4.036 5.43e-05 ***
## MultipleLinesYes            3.037e-01  7.862e-02   3.863 0.000112 ***
## InternetServiceFiber optic  1.083e+00  9.291e-02  11.653  < 2e-16 ***
## InternetServiceNo          -7.178e-01  1.277e-01  -5.619 1.92e-08 ***
## ContractOne year           -8.096e-01  1.050e-01  -7.708 1.28e-14 ***
## ContractTwo year           -1.675e+00  1.727e-01  -9.700  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8029.0  on 6938  degrees of freedom
## Residual deviance: 5888.3  on 6928  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 5910.3
## 
## Number of Fisher Scoring iterations: 6

This model is seems satisfactory enough as except gender all the other variables seem to significant(p value<0.05).

leap2<- regsubsets(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data=train,nbest=1)
summary(leap2)

## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + tenure + gender + PhoneService + 
##     SeniorCitizen + MultipleLines + InternetService + Contract, 
##     data = train, nbest = 1)
## 10 Variables  (and intercept)
##                            Forced in Forced out
## TotalCharges                   FALSE      FALSE
## tenure                         FALSE      FALSE
## genderMale                     FALSE      FALSE
## PhoneServiceYes                FALSE      FALSE
## SeniorCitizenYes               FALSE      FALSE
## MultipleLinesYes               FALSE      FALSE
## InternetServiceFiber optic     FALSE      FALSE
## InternetServiceNo              FALSE      FALSE
## ContractOne year               FALSE      FALSE
## ContractTwo year               FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          TotalCharges tenure genderMale PhoneServiceYes SeniorCitizenYes
## 1  ( 1 ) " "          "*"    " "        " "             " "             
## 2  ( 1 ) " "          "*"    " "        " "             " "             
## 3  ( 1 ) "*"          " "    " "        " "             " "             
## 4  ( 1 ) "*"          " "    " "        " "             " "             
## 5  ( 1 ) "*"          " "    " "        " "             " "             
## 6  ( 1 ) "*"          " "    " "        " "             " "             
## 7  ( 1 ) "*"          " "    " "        " "             "*"             
## 8  ( 1 ) "*"          "*"    " "        " "             "*"             
##          MultipleLinesYes InternetServiceFiber optic InternetServiceNo
## 1  ( 1 ) " "              " "                        " "              
## 2  ( 1 ) " "              "*"                        " "              
## 3  ( 1 ) " "              "*"                        "*"              
## 4  ( 1 ) " "              "*"                        "*"              
## 5  ( 1 ) " "              "*"                        "*"              
## 6  ( 1 ) "*"              "*"                        "*"              
## 7  ( 1 ) "*"              "*"                        "*"              
## 8  ( 1 ) "*"              "*"                        "*"              
##          ContractOne year ContractTwo year
## 1  ( 1 ) " "              " "             
## 2  ( 1 ) " "              " "             
## 3  ( 1 ) " "              " "             
## 4  ( 1 ) "*"              " "             
## 5  ( 1 ) "*"              "*"             
## 6  ( 1 ) "*"              "*"             
## 7  ( 1 ) "*"              "*"             
## 8  ( 1 ) "*"              "*"

plot(leap2,scale="adjr2")

Coefficients of Model2

model2$coefficients

##                (Intercept)               TotalCharges 
##               0.3482585988               0.0003119577 
##                     tenure                 genderMale 
##              -0.0620319045              -0.0147743337 
##            PhoneServiceYes           SeniorCitizenYes 
##              -0.7805109440               0.3298960142 
##           MultipleLinesYes InternetServiceFiber optic 
##               0.3037316602               1.0826036296 
##          InternetServiceNo           ContractOne year 
##              -0.7177828701              -0.8096150070 
##           ContractTwo year 
##              -1.6749967414

coefplot(model2)

The model seems good enough to predict the churn.

Lets see how this model behaves for the dataset test.

Test the Dataset

fitted.results <- predict(model2,newdata=subset(test,select=c(2,3,6,7,8,9,16,20)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,'Yes','No')

misClasificError <- mean(fitted.results != test$Churn)
print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.795698924731183"

The accuracy of prediction is above 75%. This proves that the model is good enough.

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p <- predict(model2, newdata=subset(test,select=c(2,3,6,7,8,9,16,20)), type="response")
pr <- prediction(p, test$Churn)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.8318681

Inference of Model2

Since auc value is closer is closer to 1 than to 0.5, model2 is a good predictive model.
Since it is a logistic regression the coefficients in the analysis are the log of odds of churn.
- The analysis shows that a 1 unit increase in total charges increases the log of odds of churn.
- The analysis shows that a 1 unit increase in tenure decreases the log of odds of churn.
- The analysis shows that gender is not a significant factor in churn determination.
- The analysis shows that being a senior citizen increases the log of odds of churn.
- The analysis shows that having subscribed for the phone service decreases the log of odds of churn.
- The analysis shows that having subscribed for the multiple lines increases the log of odds of churn.
- The analysis shows that having subscribed for the fiber optic internet service increases the log of odds of churn.
- The analysis shows that having subscribed for no internet service increases the log of odds of churn.
- The analysis shows that having subscribed for one and two year contracts with the company decreases the log of odds of churn.
The accuracy level of the model is > 75%.

The model passes all the tests for being applied in the analysis.

                                              THE END

Churn in Telecommunications Company

Rahul Poddar

26 January 2018