This file analyses what factors govern the churn of a telecommunications company.
ibm.df<-read.csv(paste("IBM.csv"),)
Analysis is best when the dataset is rich in data.
dim(ibm.df)
## [1] 7043 21
library(car)
some(ibm.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 681 1448-PWKYE Male 0 Yes Yes 1
## 754 3115-CZMZD Male 0 No Yes 0
## 1985 4373-MAVJG Female 0 Yes Yes 14
## 2300 6177-PEVRA Female 0 No No 48
## 2750 9086-YJYXS Male 0 Yes Yes 34
## 3667 3194-ORPIK Female 0 Yes Yes 50
## 4277 1226-UDFZR Female 0 No No 2
## 4591 4884-TVUQF Female 1 No No 57
## 4797 3807-BPOMJ Female 0 Yes No 55
## 6435 6608-QQLVK Male 0 No No 1
## PhoneService MultipleLines InternetService OnlineSecurity
## 681 Yes No Fiber optic No
## 754 Yes No No No internet service
## 1985 Yes Yes Fiber optic No
## 2300 Yes No DSL Yes
## 2750 Yes Yes DSL No
## 3667 Yes Yes Fiber optic No
## 4277 Yes No DSL No
## 4591 Yes Yes Fiber optic Yes
## 4797 Yes No Fiber optic Yes
## 6435 Yes No DSL No
## OnlineBackup DeviceProtection TechSupport
## 681 No No No
## 754 No internet service No internet service No internet service
## 1985 Yes No No
## 2300 Yes No No
## 2750 No Yes No
## 3667 No No No
## 4277 Yes No No
## 4591 No Yes Yes
## 4797 No No No
## 6435 No Yes No
## StreamingTV StreamingMovies Contract
## 681 No Yes Month-to-month
## 754 No internet service No internet service Two year
## 1985 No Yes Month-to-month
## 2300 No No Two year
## 2750 Yes Yes One year
## 3667 Yes No Month-to-month
## 4277 No No Month-to-month
## 4591 No Yes Two year
## 4797 Yes Yes One year
## 6435 No No Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges
## 681 No Electronic check 80.00
## 754 No Mailed check 20.25
## 1985 Yes Bank transfer (automatic) 90.90
## 2300 No Credit card (automatic) 55.50
## 2750 No Bank transfer (automatic) 77.20
## 3667 Yes Bank transfer (automatic) 84.40
## 4277 Yes Mailed check 49.60
## 4591 No Credit card (automatic) 101.30
## 4797 Yes Electronic check 94.75
## 6435 Yes Electronic check 50.50
## TotalCharges Churn
## 681 80.00 Yes
## 754 NA No
## 1985 1259.00 Yes
## 2300 2627.35 No
## 2750 2753.80 No
## 3667 4116.15 Yes
## 4277 114.70 Yes
## 4591 5779.60 No
## 4797 5276.10 No
## 6435 50.50 Yes
summary(ibm.df)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies
## No :2810 No :2785
## No internet service:1526 No internet service:1526
## Yes :2707 Yes :2732
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2872 Bank transfer (automatic):1544
## One year :1473 Yes:4171 Credit card (automatic) :1522
## Two year :1695 Electronic check :2365
## Mailed check :1612
##
##
##
## MonthlyCharges TotalCharges Churn
## Min. : 18.25 Min. : 18.8 No :5174
## 1st Qu.: 35.50 1st Qu.: 401.4 Yes:1869
## Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describe(ibm.df)
## vars n mean sd median trimmed mad min
## customerID* 1 7043 3522.00 2033.28 3522.00 3522.00 2610.86 1.00
## gender* 2 7043 1.50 0.50 2.00 1.51 0.00 1.00
## SeniorCitizen 3 7043 0.16 0.37 0.00 0.08 0.00 0.00
## Partner* 4 7043 1.48 0.50 1.00 1.48 0.00 1.00
## Dependents* 5 7043 1.30 0.46 1.00 1.25 0.00 1.00
## tenure 6 7043 32.37 24.56 29.00 31.43 32.62 0.00
## PhoneService* 7 7043 1.90 0.30 2.00 2.00 0.00 1.00
## MultipleLines* 8 7043 1.94 0.95 2.00 1.93 1.48 1.00
## InternetService* 9 7043 1.87 0.74 2.00 1.84 1.48 1.00
## OnlineSecurity* 10 7043 1.79 0.86 2.00 1.74 1.48 1.00
## OnlineBackup* 11 7043 1.91 0.88 2.00 1.88 1.48 1.00
## DeviceProtection* 12 7043 1.90 0.88 2.00 1.88 1.48 1.00
## TechSupport* 13 7043 1.80 0.86 2.00 1.75 1.48 1.00
## StreamingTV* 14 7043 1.99 0.89 2.00 1.98 1.48 1.00
## StreamingMovies* 15 7043 1.99 0.89 2.00 1.99 1.48 1.00
## Contract* 16 7043 1.69 0.83 1.00 1.61 0.00 1.00
## PaperlessBilling* 17 7043 1.59 0.49 2.00 1.62 0.00 1.00
## PaymentMethod* 18 7043 2.57 1.07 3.00 2.59 1.48 1.00
## MonthlyCharges 19 7043 64.76 30.09 70.35 64.97 35.66 18.25
## TotalCharges 20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn* 21 7043 1.27 0.44 1.00 1.21 0.00 1.00
## max range skew kurtosis se
## customerID* 7043.00 7042.0 0.00 -1.20 24.23
## gender* 2.00 1.0 -0.02 -2.00 0.01
## SeniorCitizen 1.00 1.0 1.83 1.36 0.00
## Partner* 2.00 1.0 0.07 -2.00 0.01
## Dependents* 2.00 1.0 0.87 -1.23 0.01
## tenure 72.00 72.0 0.24 -1.39 0.29
## PhoneService* 2.00 1.0 -2.73 5.43 0.00
## MultipleLines* 3.00 2.0 0.12 -1.88 0.01
## InternetService* 3.00 2.0 0.21 -1.15 0.01
## OnlineSecurity* 3.00 2.0 0.42 -1.52 0.01
## OnlineBackup* 3.00 2.0 0.18 -1.68 0.01
## DeviceProtection* 3.00 2.0 0.19 -1.68 0.01
## TechSupport* 3.00 2.0 0.40 -1.54 0.01
## StreamingTV* 3.00 2.0 0.03 -1.72 0.01
## StreamingMovies* 3.00 2.0 0.01 -1.72 0.01
## Contract* 3.00 2.0 0.63 -1.27 0.01
## PaperlessBilling* 2.00 1.0 -0.38 -1.86 0.01
## PaymentMethod* 4.00 3.0 -0.17 -1.21 0.01
## MonthlyCharges 118.75 100.5 -0.22 -1.26 0.36
## TotalCharges 8684.80 8666.0 0.96 -0.23 27.03
## Churn* 2.00 1.0 1.06 -0.87 0.01
str(ibm.df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
For easier interpretation, lets change some values
ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==0]<-'No'
ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==1]<-'Yes'
ibm.df$SeniorCitizen<-factor(ibm.df$SeniorCitizen)
str(ibm.df$SeniorCitizen)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
ibm.df$MultipleLines[ibm.df$MultipleLines=="No phone service"]<-'No'
ibm.df$MultipleLines<-factor(ibm.df$MultipleLines)
str(ibm.df$MultipleLines)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
ibm.df$OnlineSecurity[ibm.df$OnlineSecurity=="No internet service"]<-'No'
ibm.df$OnlineSecurity<-factor(ibm.df$OnlineSecurity)
str(ibm.df$OnlineSecurity)
## Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
ibm.df$OnlineBackup[ibm.df$OnlineBackup=="No internet service"]<-'No'
ibm.df$OnlineBackup<-factor(ibm.df$OnlineBackup)
str(ibm.df$OnlineBackup)
## Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
ibm.df$DeviceProtection[ibm.df$DeviceProtection=="No internet service"]<-'No'
ibm.df$DeviceProtection<-factor(ibm.df$DeviceProtection)
str(ibm.df$DeviceProtection)
## Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
ibm.df$TechSupport[ibm.df$TechSupport=="No internet service"]<-'No'
ibm.df$TechSupport<-factor(ibm.df$TechSupport)
str(ibm.df$TechSupport)
## Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
ibm.df$StreamingTV[ibm.df$StreamingTV=="No internet service"]<-'No'
ibm.df$StreamingTV<-factor(ibm.df$StreamingTV)
str(ibm.df$StreamingTV)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
ibm.df$StreamingMovies[ibm.df$StreamingMovies=="No internet service"]<-'No'
ibm.df$StreamingMovies<-factor(ibm.df$StreamingMovies)
str(ibm.df$StreamingMovies)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
The data set now looks like
some(ibm.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 2198 8548-AWOFC Male No Yes No 66
## 2815 7599-NTMDP Female No Yes Yes 62
## 3029 8740-XLHDR Male No No No 5
## 3154 9099-FTUHS Female No No No 23
## 3199 8564-LDKFL Male No Yes No 40
## 3897 8590-OHDIW Female No Yes Yes 38
## 5024 6653-CBBOM Female No No No 1
## 5404 0292-WEGCH Female No Yes Yes 54
## 5778 6087-YPWHO Male No Yes No 72
## 6248 3428-MMGUB Male No No No 60
## PhoneService MultipleLines InternetService OnlineSecurity
## 2198 Yes Yes DSL No
## 2815 No No DSL Yes
## 3029 No No DSL Yes
## 3154 Yes Yes DSL Yes
## 3199 Yes Yes Fiber optic No
## 3897 Yes No No No
## 5024 Yes No Fiber optic No
## 5404 Yes Yes DSL No
## 5778 Yes Yes DSL Yes
## 6248 Yes Yes Fiber optic No
## OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
## 2198 No No Yes No Yes
## 2815 Yes No Yes Yes No
## 3029 Yes Yes Yes No No
## 3154 No No No No No
## 3199 Yes Yes No Yes Yes
## 3897 No No No No No
## 5024 No No No No No
## 5404 Yes Yes Yes Yes Yes
## 5778 Yes Yes Yes No No
## 6248 No Yes No Yes No
## Contract PaperlessBilling PaymentMethod
## 2198 Month-to-month No Electronic check
## 2815 Two year No Bank transfer (automatic)
## 3029 Month-to-month Yes Mailed check
## 3154 Month-to-month No Electronic check
## 3199 One year Yes Bank transfer (automatic)
## 3897 One year No Mailed check
## 5024 Month-to-month Yes Electronic check
## 5404 Month-to-month Yes Electronic check
## 5778 Two year No Mailed check
## 6248 Two year Yes Electronic check
## MonthlyCharges TotalCharges Churn
## 2198 63.85 4264.60 No
## 2815 48.70 3008.55 No
## 3029 43.25 219.00 Yes
## 3154 54.40 1249.25 No
## 3199 106.00 4178.65 No
## 3897 20.30 749.35 No
## 5024 70.30 70.30 Yes
## 5404 86.20 4524.05 No
## 5778 68.15 4808.70 No
## 6248 89.55 5231.20 No
gender<-table(ibm.df$gender)
gender
##
## Female Male
## 3488 3555
It would provide a clear picture to see how churn varies with gender
plot(ibm.df$gender,main="Number of males and females")
gender2<-xtabs(~ibm.df$Churn+ibm.df$gender)
gender2
## ibm.df$gender
## ibm.df$Churn Female Male
## No 2549 2625
## Yes 939 930
Percentage of male and female participation in churn
gender3<-prop.table(gender2,2)*100
format(round(gender3, 2), nsmall = 2)
## ibm.df$gender
## ibm.df$Churn Female Male
## No "73.08" "73.84"
## Yes "26.92" "26.16"
This shows that around equal percentages of males and females left the subscription last month.
sc<-table(ibm.df$SeniorCitizen)
sc
##
## No Yes
## 5901 1142
plot(ibm.df$SeniorCitizen,main="Number of Senior Citizen")
It would provide a clear picture to see how churn varies with Senior citizenship.
sc2<-xtabs(~ibm.df$Churn+ibm.df$SeniorCitizen)
sc2
## ibm.df$SeniorCitizen
## ibm.df$Churn No Yes
## No 4508 666
## Yes 1393 476
Percentage of male and female participation in churn
sc3<-prop.table(sc2,2)*100
format(round(sc3, 2), nsmall = 2)
## ibm.df$SeniorCitizen
## ibm.df$Churn No Yes
## No "76.39" "58.32"
## Yes "23.61" "41.68"
A greater percentage of senior citizens have not ended the subscription.
The number of people who have and don’t have partners
partner<-table(ibm.df$Partner)
partner
##
## No Yes
## 3641 3402
It would provide a clear picture to see how churn varies with partnership
plot(ibm.df$Partner,main="Number of Partners")
partner2<-xtabs(~ibm.df$Churn+ibm.df$Partner)
partner2
## ibm.df$Partner
## ibm.df$Churn No Yes
## No 2441 2733
## Yes 1200 669
Percentage of participation in churn
partner3<-prop.table(partner2,2)*100
format(round(partner3, 2), nsmall = 2)
## ibm.df$Partner
## ibm.df$Churn No Yes
## No "67.04" "80.34"
## Yes "32.96" "19.66"
The number of people who are and are not dependent.
dependent<-table(ibm.df$Dependents)
dependent
##
## No Yes
## 4933 2110
This shows that mostly people don’t have dependents.
plot(ibm.df$Partner,main="Number of People having Dependents")
How churn is affected by dependents.
dependent2<-xtabs(~ibm.df$Churn+ibm.df$Dependents)
dependent2
## ibm.df$Dependents
## ibm.df$Churn No Yes
## No 3390 1784
## Yes 1543 326
Percentage of dependent participation in churn
dependent3<-prop.table(dependent2,2)*100
format(round(dependent3, 2), nsmall = 2)
## ibm.df$Dependents
## ibm.df$Churn No Yes
## No "68.72" "84.55"
## Yes "31.28" "15.45"
tenure<-table(ibm.df$tenure)
tenure
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 11 613 238 200 176 133 110 131 123 119 116 99 117 109 76 99 80 87
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 97 73 71 63 90 85 94 79 79 72 57 72 72 65 69 64 65 88
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 50 65 59 56 64 70 65 65 51 61 74 68 64 66 68 68 80 70
## 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
## 68 64 80 65 67 60 76 76 70 72 80 76 89 98 100 95 119 170
## 72
## 362
boxplot(ibm.df$tenure,horizontal = TRUE, main="Tenure of Subscribers",xlab="Months",col="grey")
hist(ibm.df$tenure,breaks = 30,main = "Frequency of Tenure Months",xlab = "Tenure",col="grey")
Average Tenure of all the Subscribers w.r.t churn
aggregate(tenure~Churn,data = ibm.df,mean)
## Churn tenure
## 1 No 37.56997
## 2 Yes 17.97913
library(lattice)
barchart(tenure~Churn,data = ibm.df,col="grey")
aggregate(tenure~gender,data = ibm.df,mean)
## gender tenure
## 1 Female 32.24455
## 2 Male 32.49536
boxplot(tenure~gender,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Gender",xlab="Tenure")
aggregate(tenure~SeniorCitizen,data = ibm.df,mean)
## SeniorCitizen tenure
## 1 No 32.19217
## 2 Yes 33.29597
boxplot(tenure~SeniorCitizen,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Senior Citizenship",xlab="Tenure")
Hypothesis : The difference in mean tenure of those subscibers who ended their contract and those who did not is not significant.
t.test(tenure~Churn,data = ibm.df)
##
## Welch Two Sample t-test
##
## data: tenure by Churn
## t = 34.824, df = 4048.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 18.48789 20.69378
## sample estimates:
## mean in group No mean in group Yes
## 37.56997 17.97913
By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean tenure of those subscibers who ended their contract and those who did not.
par(mfrow=c(3,3))
plot(ibm.df$PhoneService,main="People who subscribed for Phone Service")
plot(ibm.df$MultipleLines,main="People who subscribed for Multiple Lines")
plot(ibm.df$InternetService,main="People who subscribed for Internet Service")
plot(ibm.df$OnlineSecurity,main="People who subscribed for Online Security")
plot(ibm.df$OnlineBackup,main="People who subscribed for Online Backup")
plot(ibm.df$DeviceProtection,main="People who subscribed for Device Protection")
plot(ibm.df$TechSupport,main="People who subscribed for Tech Support")
plot(ibm.df$StreamingTV,main="People who subscribed for Streaming TV")
plot(ibm.df$StreamingMovies,main="People who subscribed for Streaming Movies")
Analysing the contract may give some insight about churn.
table(ibm.df$Contract)
##
## Month-to-month One year Two year
## 3875 1473 1695
plot(ibm.df$Contract,main="Frequency of Different Contracts")
It may be interesting to analyse how tenure varies with contracts.
aggregate(tenure~Contract,data = ibm.df,mean)
## Contract tenure
## 1 Month-to-month 18.03665
## 2 One year 42.04481
## 3 Two year 56.73510
boxplot(tenure~Contract,data = ibm.df,horizontal=TRUE,col="grey",main="Variation of Tenure with Contracts",xlab="Tenure")
Lets see how churn is affected by contracts made by subscribers.
xtabs(~ibm.df$Churn+ibm.df$Contract)
## ibm.df$Contract
## ibm.df$Churn Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
boxplot(ibm.df$MonthlyCharges,col="grey",horizontal=TRUE, main="Median of Monthly Charges")
Lets see how churn depends on Monthly Charges
aggregate(MonthlyCharges~Churn,data = ibm.df,mean)
## Churn MonthlyCharges
## 1 No 61.26512
## 2 Yes 74.44133
boxplot(MonthlyCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Monthly Charges",xlab="Monthly Charges",ylab="Churn")
scatterplot(MonthlyCharges~tenure|Churn,data = ibm.df,cex=0.5)
It can be seen rom the above plot that majority of the subscribers who ended their contracts had high monthly charges and lower tenures.
hypothesis: The differnce in monthly charges based on churn is not significant.
t.test(MonthlyCharges~Churn,data = ibm.df)
##
## Welch Two Sample t-test
##
## data: MonthlyCharges by Churn
## t = -18.408, df = 4135.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.57957 -11.77284
## sample estimates:
## mean in group No mean in group Yes
## 61.26512 74.44133
By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean monthly charges of those subscibers who ended their contract and those who did not.
boxplot(ibm.df$TotalCharges,col="grey",horizontal=TRUE, main="Median of Total Charges")
Lets see how churn depends on Total Charges
aggregate(TotalCharges~Churn,data = ibm.df,mean)
## Churn TotalCharges
## 1 No 2555.344
## 2 Yes 1531.796
boxplot(TotalCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Total Charges",xlab="Total Charges",ylab="Churn")
scatterplot(TotalCharges~tenure|Churn,data = ibm.df,cex=0.5)
It can be seen rom the above plot that majority of the subscribers who ended their contracts had high total charges.
hypothesis: The differnce in total charges based on churn is not significant.
t.test(TotalCharges~Churn,data = ibm.df)
##
## Welch Two Sample t-test
##
## data: TotalCharges by Churn
## t = 18.801, df = 4042.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 916.8121 1130.2840
## sample estimates:
## mean in group No mean in group Yes
## 2555.344 1531.796
By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean total charges of those subscibers who ended their contract and those who did not.
Since Churn is a factor variable, a linear regression cannot be applied as we need to analyse the dependence of churn on the other variables. As a result, a logistic regression is applied for the analysis.
Now since this is a logistic regression, we need to divide the data into two sets: ‘train’ - on which the model of the regression will be based on and ‘test’ - on which the model will be tested.
train <- ibm.df[1:6950,]
test <- ibm.df[6951:7043,]
The dependent variable is Churn and all other variables( excluding customer ID) are considered to be independent.
Model: \[Churn = \beta_0 + \beta_1TotalCharges + \beta_2MonthlyCharges + \beta_3tenure + \beta_4gender + \beta_5PhoneService + \beta_6SeniorCitizen + \beta_7Partner + \beta_8Dependents + \beta_9MultipleLines + \beta_10InternetService + \beta_11OnlineSecurity + \beta_12OnlineBackup + \beta_13DeviceProtection + \beta_14TechSupport + \beta_15StreamingTV + \beta_16StreamingMovies + \beta_17Contract + \epsilon\]
model1=glm(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data = train,family = binomial)
summary(model1)
##
## Call:
## glm(formula = Churn ~ TotalCharges + MonthlyCharges + tenure +
## gender + PhoneService + SeniorCitizen + Partner + Dependents +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies +
## Contract, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8752 -0.6925 -0.2906 0.7552 3.3872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.823e+00 8.112e-01 2.248 0.02459 *
## TotalCharges 3.038e-04 7.051e-05 4.308 1.65e-05 ***
## MonthlyCharges -5.439e-02 3.188e-02 -1.706 0.08796 .
## tenure -5.889e-02 6.177e-03 -9.533 < 2e-16 ***
## genderMale -2.357e-02 6.494e-02 -0.363 0.71667
## PhoneServiceYes 4.140e-01 6.512e-01 0.636 0.52499
## SeniorCitizenYes 2.431e-01 8.460e-02 2.873 0.00406 **
## PartnerYes 1.015e-02 7.784e-02 0.130 0.89628
## DependentsYes -1.650e-01 8.977e-02 -1.837 0.06614 .
## MultipleLinesYes 5.316e-01 1.781e-01 2.986 0.00283 **
## InternetServiceFiber optic 2.227e+00 8.010e-01 2.780 0.00544 **
## InternetServiceNo -2.296e+00 8.100e-01 -2.834 0.00459 **
## OnlineSecurityYes -1.806e-01 1.793e-01 -1.007 0.31379
## OnlineBackupYes 9.931e-02 1.761e-01 0.564 0.57273
## DeviceProtectionYes 2.115e-01 1.768e-01 1.196 0.23152
## TechSupportYes -1.456e-01 1.811e-01 -0.804 0.42133
## StreamingTVYes 7.939e-01 3.277e-01 2.423 0.01540 *
## StreamingMoviesYes 7.939e-01 3.277e-01 2.422 0.01542 *
## ContractOne year -7.368e-01 1.076e-01 -6.845 7.65e-12 ***
## ContractTwo year -1.459e+00 1.757e-01 -8.302 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8029.0 on 6938 degrees of freedom
## Residual deviance: 5789.6 on 6919 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 5829.6
##
## Number of Fisher Scoring iterations: 6
From the above regression analysis, its clear that Total Charges, Tenure and Contracts are the most significant variables in determining the churn in the company.
library(leaps)
leap1<- regsubsets(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data=train,nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + MonthlyCharges + tenure +
## gender + PhoneService + SeniorCitizen + Partner + Dependents +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies +
## Contract, data = train, nbest = 1)
## 19 Variables (and intercept)
## Forced in Forced out
## TotalCharges FALSE FALSE
## MonthlyCharges FALSE FALSE
## tenure FALSE FALSE
## genderMale FALSE FALSE
## PhoneServiceYes FALSE FALSE
## SeniorCitizenYes FALSE FALSE
## PartnerYes FALSE FALSE
## DependentsYes FALSE FALSE
## MultipleLinesYes FALSE FALSE
## InternetServiceFiber optic FALSE FALSE
## InternetServiceNo FALSE FALSE
## OnlineSecurityYes FALSE FALSE
## OnlineBackupYes FALSE FALSE
## DeviceProtectionYes FALSE FALSE
## TechSupportYes FALSE FALSE
## StreamingTVYes FALSE FALSE
## StreamingMoviesYes FALSE FALSE
## ContractOne year FALSE FALSE
## ContractTwo year FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## TotalCharges MonthlyCharges tenure genderMale PhoneServiceYes
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" " " " "
## 3 ( 1 ) "*" " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " "*"
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*"
## SeniorCitizenYes PartnerYes DependentsYes MultipleLinesYes
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## InternetServiceFiber optic InternetServiceNo OnlineSecurityYes
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) " " " " "*"
## 6 ( 1 ) "*" "*" " "
## 7 ( 1 ) " " " " "*"
## 8 ( 1 ) " " " " "*"
## OnlineBackupYes DeviceProtectionYes TechSupportYes StreamingTVYes
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) "*" " " "*" " "
## StreamingMoviesYes ContractOne year ContractTwo year
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) "*" " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
plot(leap1,scale="adjr2")
Let us view the coffecients of model1 again.
model1$coefficients
## (Intercept) TotalCharges
## 1.8233111979 0.0003037555
## MonthlyCharges tenure
## -0.0543932133 -0.0588872582
## genderMale PhoneServiceYes
## -0.0235657745 0.4139685779
## SeniorCitizenYes PartnerYes
## 0.2430841292 0.0101467163
## DependentsYes MultipleLinesYes
## -0.1649522270 0.5316171354
## InternetServiceFiber optic InternetServiceNo
## 2.2267266925 -2.2958244053
## OnlineSecurityYes OnlineBackupYes
## -0.1805734677 0.0993077378
## DeviceProtectionYes TechSupportYes
## 0.2114850749 -0.1456074125
## StreamingTVYes StreamingMoviesYes
## 0.7938830923 0.7939111022
## ContractOne year ContractTwo year
## -0.7368360340 -1.4585078047
library(coefplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
coefplot(model1)
As is clear this model seems wrong as it shows that monthly charges and phone service subscription (and many other) are not significant to the churn in the company.
Therefore, this model is not good enough for the data.
This time let us ditch some variables.
Model2: \[Churn = \beta_0 + \beta_1TotalCharges + \beta_2tenure + \beta_3gender + \beta_4PhoneService + \beta_5SeniorCitizen + \beta_6MultipleLines + \beta_7InternetService + \beta_8Contract + \epsilon\]
We have chosen this model as these are the most basic and sidnificant factors on which churn could result.
model2=glm(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data = train,family = binomial)
summary(model2)
##
## Call:
## glm(formula = Churn ~ TotalCharges + tenure + gender + PhoneService +
## SeniorCitizen + MultipleLines + InternetService + Contract,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7307 -0.7093 -0.3050 0.8173 3.5096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.483e-01 1.234e-01 2.821 0.004786 **
## TotalCharges 3.120e-04 6.316e-05 4.939 7.86e-07 ***
## tenure -6.203e-02 5.885e-03 -10.540 < 2e-16 ***
## genderMale -1.477e-02 6.417e-02 -0.230 0.817903
## PhoneServiceYes -7.805e-01 1.295e-01 -6.026 1.68e-09 ***
## SeniorCitizenYes 3.299e-01 8.173e-02 4.036 5.43e-05 ***
## MultipleLinesYes 3.037e-01 7.862e-02 3.863 0.000112 ***
## InternetServiceFiber optic 1.083e+00 9.291e-02 11.653 < 2e-16 ***
## InternetServiceNo -7.178e-01 1.277e-01 -5.619 1.92e-08 ***
## ContractOne year -8.096e-01 1.050e-01 -7.708 1.28e-14 ***
## ContractTwo year -1.675e+00 1.727e-01 -9.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8029.0 on 6938 degrees of freedom
## Residual deviance: 5888.3 on 6928 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 5910.3
##
## Number of Fisher Scoring iterations: 6
This model is seems satisfactory enough as except gender all the other variables seem to significant(p value<0.05).
leap2<- regsubsets(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data=train,nbest=1)
summary(leap2)
## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + tenure + gender + PhoneService +
## SeniorCitizen + MultipleLines + InternetService + Contract,
## data = train, nbest = 1)
## 10 Variables (and intercept)
## Forced in Forced out
## TotalCharges FALSE FALSE
## tenure FALSE FALSE
## genderMale FALSE FALSE
## PhoneServiceYes FALSE FALSE
## SeniorCitizenYes FALSE FALSE
## MultipleLinesYes FALSE FALSE
## InternetServiceFiber optic FALSE FALSE
## InternetServiceNo FALSE FALSE
## ContractOne year FALSE FALSE
## ContractTwo year FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## TotalCharges tenure genderMale PhoneServiceYes SeniorCitizenYes
## 1 ( 1 ) " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) "*" " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*"
## MultipleLinesYes InternetServiceFiber optic InternetServiceNo
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" "*"
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## ContractOne year ContractTwo year
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
plot(leap2,scale="adjr2")
model2$coefficients
## (Intercept) TotalCharges
## 0.3482585988 0.0003119577
## tenure genderMale
## -0.0620319045 -0.0147743337
## PhoneServiceYes SeniorCitizenYes
## -0.7805109440 0.3298960142
## MultipleLinesYes InternetServiceFiber optic
## 0.3037316602 1.0826036296
## InternetServiceNo ContractOne year
## -0.7177828701 -0.8096150070
## ContractTwo year
## -1.6749967414
coefplot(model2)
The model seems good enough to predict the churn.
Lets see how this model behaves for the dataset test.
fitted.results <- predict(model2,newdata=subset(test,select=c(2,3,6,7,8,9,16,20)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,'Yes','No')
misClasificError <- mean(fitted.results != test$Churn)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.795698924731183"
The accuracy of prediction is above 75%. This proves that the model is good enough.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(model2, newdata=subset(test,select=c(2,3,6,7,8,9,16,20)), type="response")
pr <- prediction(p, test$Churn)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8318681
The model passes all the tests for being applied in the analysis.
THE END