This file analyses what factors govern the churn of a telecommunications company.
ibm.df<-read.csv(paste("IBM.csv"),)
Analysis is best when the dataset is rich in data.
dim(ibm.df)
## [1] 7043 21
library(car)
some(ibm.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 99 3212-KXOCR Male 0 No No 52
## 1060 0376-YMCJC Male 0 No No 23
## 1304 0042-RLHYP Female 0 Yes Yes 69
## 1308 1866-NXPSP Female 0 No No 36
## 1360 7943-RQCHR Female 0 No No 9
## 2192 3988-RQIXO Female 1 No No 1
## 4054 7446-YPODE Male 1 No No 11
## 4394 2057-ZBLPD Female 0 Yes No 21
## 4456 8280-MQRQN Female 0 No No 1
## 6894 5876-HZVZM Female 0 Yes Yes 6
## PhoneService MultipleLines InternetService OnlineSecurity
## 99 Yes No No No internet service
## 1060 Yes Yes Fiber optic No
## 1304 Yes No No No internet service
## 1308 Yes No DSL Yes
## 1360 Yes Yes Fiber optic No
## 2192 Yes No Fiber optic No
## 4054 Yes Yes DSL Yes
## 4394 Yes Yes Fiber optic No
## 4456 Yes No DSL No
## 6894 Yes No DSL Yes
## OnlineBackup DeviceProtection TechSupport
## 99 No internet service No internet service No internet service
## 1060 No No Yes
## 1304 No internet service No internet service No internet service
## 1308 Yes Yes Yes
## 1360 No No No
## 2192 No No No
## 4054 Yes No No
## 4394 No No No
## 4456 No Yes No
## 6894 Yes No No
## StreamingTV StreamingMovies Contract
## 99 No internet service No internet service Two year
## 1060 Yes No Month-to-month
## 1304 No internet service No internet service Two year
## 1308 No Yes One year
## 1360 Yes Yes Month-to-month
## 2192 Yes Yes Month-to-month
## 4054 No No Month-to-month
## 4394 No Yes Month-to-month
## 4456 No No Month-to-month
## 6894 No No Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges
## 99 No Bank transfer (automatic) 21.00
## 1060 Yes Electronic check 90.60
## 1304 No Bank transfer (automatic) 19.70
## 1308 Yes Mailed check 75.55
## 1360 Yes Electronic check 94.75
## 2192 Yes Electronic check 91.30
## 4054 No Bank transfer (automatic) 60.25
## 4394 Yes Electronic check 86.50
## 4456 Yes Mailed check 50.45
## 6894 No Credit card (automatic) 55.90
## TotalCharges Churn
## 99 1107.20 No
## 1060 1943.20 Yes
## 1304 1396.90 No
## 1308 2680.15 No
## 1360 889.90 Yes
## 2192 91.30 Yes
## 4054 662.95 No
## 4394 1808.70 Yes
## 4456 50.45 Yes
## 6894 365.35 Yes
summary(ibm.df)
## customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies
## No :2810 No :2785
## No internet service:1526 No internet service:1526
## Yes :2707 Yes :2732
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2872 Bank transfer (automatic):1544
## One year :1473 Yes:4171 Credit card (automatic) :1522
## Two year :1695 Electronic check :2365
## Mailed check :1612
##
##
##
## MonthlyCharges TotalCharges Churn
## Min. : 18.25 Min. : 18.8 No :5174
## 1st Qu.: 35.50 1st Qu.: 401.4 Yes:1869
## Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describe(ibm.df)
## vars n mean sd median trimmed mad min
## customerID* 1 7043 3522.00 2033.28 3522.00 3522.00 2610.86 1.00
## gender* 2 7043 1.50 0.50 2.00 1.51 0.00 1.00
## SeniorCitizen 3 7043 0.16 0.37 0.00 0.08 0.00 0.00
## Partner* 4 7043 1.48 0.50 1.00 1.48 0.00 1.00
## Dependents* 5 7043 1.30 0.46 1.00 1.25 0.00 1.00
## tenure 6 7043 32.37 24.56 29.00 31.43 32.62 0.00
## PhoneService* 7 7043 1.90 0.30 2.00 2.00 0.00 1.00
## MultipleLines* 8 7043 1.94 0.95 2.00 1.93 1.48 1.00
## InternetService* 9 7043 1.87 0.74 2.00 1.84 1.48 1.00
## OnlineSecurity* 10 7043 1.79 0.86 2.00 1.74 1.48 1.00
## OnlineBackup* 11 7043 1.91 0.88 2.00 1.88 1.48 1.00
## DeviceProtection* 12 7043 1.90 0.88 2.00 1.88 1.48 1.00
## TechSupport* 13 7043 1.80 0.86 2.00 1.75 1.48 1.00
## StreamingTV* 14 7043 1.99 0.89 2.00 1.98 1.48 1.00
## StreamingMovies* 15 7043 1.99 0.89 2.00 1.99 1.48 1.00
## Contract* 16 7043 1.69 0.83 1.00 1.61 0.00 1.00
## PaperlessBilling* 17 7043 1.59 0.49 2.00 1.62 0.00 1.00
## PaymentMethod* 18 7043 2.57 1.07 3.00 2.59 1.48 1.00
## MonthlyCharges 19 7043 64.76 30.09 70.35 64.97 35.66 18.25
## TotalCharges 20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn* 21 7043 1.27 0.44 1.00 1.21 0.00 1.00
## max range skew kurtosis se
## customerID* 7043.00 7042.0 0.00 -1.20 24.23
## gender* 2.00 1.0 -0.02 -2.00 0.01
## SeniorCitizen 1.00 1.0 1.83 1.36 0.00
## Partner* 2.00 1.0 0.07 -2.00 0.01
## Dependents* 2.00 1.0 0.87 -1.23 0.01
## tenure 72.00 72.0 0.24 -1.39 0.29
## PhoneService* 2.00 1.0 -2.73 5.43 0.00
## MultipleLines* 3.00 2.0 0.12 -1.88 0.01
## InternetService* 3.00 2.0 0.21 -1.15 0.01
## OnlineSecurity* 3.00 2.0 0.42 -1.52 0.01
## OnlineBackup* 3.00 2.0 0.18 -1.68 0.01
## DeviceProtection* 3.00 2.0 0.19 -1.68 0.01
## TechSupport* 3.00 2.0 0.40 -1.54 0.01
## StreamingTV* 3.00 2.0 0.03 -1.72 0.01
## StreamingMovies* 3.00 2.0 0.01 -1.72 0.01
## Contract* 3.00 2.0 0.63 -1.27 0.01
## PaperlessBilling* 2.00 1.0 -0.38 -1.86 0.01
## PaymentMethod* 4.00 3.0 -0.17 -1.21 0.01
## MonthlyCharges 118.75 100.5 -0.22 -1.26 0.36
## TotalCharges 8684.80 8666.0 0.96 -0.23 27.03
## Churn* 2.00 1.0 1.06 -0.87 0.01
str(ibm.df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
For easier interpretation, lets change some values
ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==0]<-'No'
ibm.df$SeniorCitizen[ibm.df$SeniorCitizen==1]<-'Yes'
ibm.df$SeniorCitizen<-factor(ibm.df$SeniorCitizen)
str(ibm.df$SeniorCitizen)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
ibm.df$MultipleLines[ibm.df$MultipleLines=="No phone service"]<-'No'
ibm.df$MultipleLines<-factor(ibm.df$MultipleLines)
str(ibm.df$MultipleLines)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
ibm.df$OnlineSecurity[ibm.df$OnlineSecurity=="No internet service"]<-'No'
ibm.df$OnlineSecurity<-factor(ibm.df$OnlineSecurity)
str(ibm.df$OnlineSecurity)
## Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
ibm.df$OnlineBackup[ibm.df$OnlineBackup=="No internet service"]<-'No'
ibm.df$OnlineBackup<-factor(ibm.df$OnlineBackup)
str(ibm.df$OnlineBackup)
## Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
ibm.df$DeviceProtection[ibm.df$DeviceProtection=="No internet service"]<-'No'
ibm.df$DeviceProtection<-factor(ibm.df$DeviceProtection)
str(ibm.df$DeviceProtection)
## Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
ibm.df$TechSupport[ibm.df$TechSupport=="No internet service"]<-'No'
ibm.df$TechSupport<-factor(ibm.df$TechSupport)
str(ibm.df$TechSupport)
## Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
ibm.df$StreamingTV[ibm.df$StreamingTV=="No internet service"]<-'No'
ibm.df$StreamingTV<-factor(ibm.df$StreamingTV)
str(ibm.df$StreamingTV)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
ibm.df$StreamingMovies[ibm.df$StreamingMovies=="No internet service"]<-'No'
ibm.df$StreamingMovies<-factor(ibm.df$StreamingMovies)
str(ibm.df$StreamingMovies)
## Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
The data set now looks like
some(ibm.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 1102 4712-UYOOI Female No Yes Yes 20
## 1280 3452-FLHYD Male No Yes No 25
## 1684 0354-VXMJC Male No Yes Yes 23
## 2892 0495-RVCBF Female No No No 1
## 3457 0436-TWFFZ Female No No No 67
## 3608 9987-LUTYD Female No No No 13
## 4660 6616-AALSR Female No Yes Yes 65
## 4830 1600-DILPE Female No No No 12
## 6046 4487-ZYJZK Female No Yes Yes 38
## 6883 8065-QBYTO Female Yes No No 71
## PhoneService MultipleLines InternetService OnlineSecurity
## 1102 Yes No No No
## 1280 Yes No No No
## 1684 Yes No No No
## 2892 Yes No Fiber optic No
## 3457 Yes Yes DSL Yes
## 3608 Yes No DSL Yes
## 4660 Yes Yes Fiber optic Yes
## 4830 Yes No DSL No
## 6046 Yes No No No
## 6883 Yes Yes Fiber optic Yes
## OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
## 1102 No No No No No
## 1280 No No No No No
## 1684 No No No No No
## 2892 No No No No Yes
## 3457 No Yes Yes Yes Yes
## 3608 No No Yes No No
## 4660 No Yes No Yes Yes
## 4830 No No No No No
## 6046 No No No No No
## 6883 Yes No Yes Yes No
## Contract PaperlessBilling PaymentMethod
## 1102 Month-to-month No Electronic check
## 1280 One year Yes Bank transfer (automatic)
## 1684 Two year No Credit card (automatic)
## 2892 Month-to-month Yes Electronic check
## 3457 Two year Yes Mailed check
## 3608 One year No Mailed check
## 4660 Two year Yes Credit card (automatic)
## 4830 Month-to-month Yes Bank transfer (automatic)
## 6046 One year No Credit card (automatic)
## 6883 One year Yes Credit card (automatic)
## MonthlyCharges TotalCharges Churn
## 1102 20.00 417.65 No
## 1280 20.95 495.15 No
## 1684 19.60 426.65 No
## 2892 79.70 79.70 Yes
## 3457 85.25 5714.20 No
## 3608 55.15 742.90 No
## 4660 104.30 6725.30 No
## 4830 45.00 524.35 No
## 6046 19.60 763.10 No
## 6883 99.65 7181.25 No
gender<-table(ibm.df$gender)
gender
##
## Female Male
## 3488 3555
It would provide a clear picture to see how churn varies with gender
plot(ibm.df$gender,main="Number of males and females")
gender2<-xtabs(~ibm.df$Churn+ibm.df$gender)
gender2
## ibm.df$gender
## ibm.df$Churn Female Male
## No 2549 2625
## Yes 939 930
Percentage of male and female participation in churn
gender3<-prop.table(gender2,2)*100
format(round(gender3, 2), nsmall = 2)
## ibm.df$gender
## ibm.df$Churn Female Male
## No "73.08" "73.84"
## Yes "26.92" "26.16"
This shows that around equal percentages of males and females left the subscription last month.
sc<-table(ibm.df$SeniorCitizen)
sc
##
## No Yes
## 5901 1142
plot(ibm.df$SeniorCitizen,main="Number of Senior Citizen")
It would provide a clear picture to see how churn varies with Senior citizenship.
sc2<-xtabs(~ibm.df$Churn+ibm.df$SeniorCitizen)
sc2
## ibm.df$SeniorCitizen
## ibm.df$Churn No Yes
## No 4508 666
## Yes 1393 476
Percentage of male and female participation in churn
sc3<-prop.table(sc2,2)*100
format(round(sc3, 2), nsmall = 2)
## ibm.df$SeniorCitizen
## ibm.df$Churn No Yes
## No "76.39" "58.32"
## Yes "23.61" "41.68"
A greater percentage of senior citizens have not ended the subscription.
The number of people who have and don’t have partners
partner<-table(ibm.df$Partner)
partner
##
## No Yes
## 3641 3402
It would provide a clear picture to see how churn varies with partnership
plot(ibm.df$Partner,main="Number of Partners")
partner2<-xtabs(~ibm.df$Churn+ibm.df$Partner)
partner2
## ibm.df$Partner
## ibm.df$Churn No Yes
## No 2441 2733
## Yes 1200 669
Percentage of participation in churn
partner3<-prop.table(partner2,2)*100
format(round(partner3, 2), nsmall = 2)
## ibm.df$Partner
## ibm.df$Churn No Yes
## No "67.04" "80.34"
## Yes "32.96" "19.66"
The number of people who are and are not dependent.
dependent<-table(ibm.df$Dependents)
dependent
##
## No Yes
## 4933 2110
This shows that mostly people don’t have dependents.
plot(ibm.df$Partner,main="Number of People having Dependents")
How churn is affected by dependents.
dependent2<-xtabs(~ibm.df$Churn+ibm.df$Dependents)
dependent2
## ibm.df$Dependents
## ibm.df$Churn No Yes
## No 3390 1784
## Yes 1543 326
Percentage of dependent participation in churn
dependent3<-prop.table(dependent2,2)*100
format(round(dependent3, 2), nsmall = 2)
## ibm.df$Dependents
## ibm.df$Churn No Yes
## No "68.72" "84.55"
## Yes "31.28" "15.45"
tenure<-table(ibm.df$tenure)
tenure
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 11 613 238 200 176 133 110 131 123 119 116 99 117 109 76 99 80 87
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 97 73 71 63 90 85 94 79 79 72 57 72 72 65 69 64 65 88
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 50 65 59 56 64 70 65 65 51 61 74 68 64 66 68 68 80 70
## 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
## 68 64 80 65 67 60 76 76 70 72 80 76 89 98 100 95 119 170
## 72
## 362
boxplot(ibm.df$tenure,horizontal = TRUE, main="Tenure of Subscribers",xlab="Months",col="grey")
hist(ibm.df$tenure,breaks = 30,main = "Frequency of Tenure Months",xlab = "Tenure",col="grey")
Average Tenure of all the Subscribers w.r.t churn
aggregate(tenure~Churn,data = ibm.df,mean)
## Churn tenure
## 1 No 37.56997
## 2 Yes 17.97913
library(lattice)
barchart(tenure~Churn,data = ibm.df,col="grey")
aggregate(tenure~gender,data = ibm.df,mean)
## gender tenure
## 1 Female 32.24455
## 2 Male 32.49536
boxplot(tenure~gender,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Gender",xlab="Tenure")
aggregate(tenure~SeniorCitizen,data = ibm.df,mean)
## SeniorCitizen tenure
## 1 No 32.19217
## 2 Yes 33.29597
boxplot(tenure~SeniorCitizen,data = ibm.df,horizontal=TRUE,col="grey",main="Median of Tenure wrt Senior Citizenship",xlab="Tenure")
Hypothesis : The difference in mean tenure of those subscibers who ended their contract and those who did not is not significant.
t.test(tenure~Churn,data = ibm.df)
##
## Welch Two Sample t-test
##
## data: tenure by Churn
## t = 34.824, df = 4048.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 18.48789 20.69378
## sample estimates:
## mean in group No mean in group Yes
## 37.56997 17.97913
By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean tenure of those subscibers who ended their contract and those who did not.
par(mfrow=c(3,3))
plot(ibm.df$PhoneService,main="People who subscribed for Phone Service")
plot(ibm.df$MultipleLines,main="People who subscribed for Multiple Lines")
plot(ibm.df$InternetService,main="People who subscribed for Internet Service")
plot(ibm.df$OnlineSecurity,main="People who subscribed for Online Security")
plot(ibm.df$OnlineBackup,main="People who subscribed for Online Backup")
plot(ibm.df$DeviceProtection,main="People who subscribed for Device Protection")
plot(ibm.df$TechSupport,main="People who subscribed for Tech Support")
plot(ibm.df$StreamingTV,main="People who subscribed for Streaming TV")
plot(ibm.df$StreamingMovies,main="People who subscribed for Streaming Movies")
Analysing the contract may give some insight about churn.
table(ibm.df$Contract)
##
## Month-to-month One year Two year
## 3875 1473 1695
plot(ibm.df$Contract,main="Frequency of Different Contracts")
It may be interesting to analyse how tenure varies with contracts.
aggregate(tenure~Contract,data = ibm.df,mean)
## Contract tenure
## 1 Month-to-month 18.03665
## 2 One year 42.04481
## 3 Two year 56.73510
boxplot(tenure~Contract,data = ibm.df,horizontal=TRUE,col="grey",main="Variation of Tenure with Contracts",xlab="Tenure")
Lets see how churn is affected by contracts made by subscribers.
xtabs(~ibm.df$Churn+ibm.df$Contract)
## ibm.df$Contract
## ibm.df$Churn Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
boxplot(ibm.df$MonthlyCharges,col="grey",horizontal=TRUE, main="Median of Monthly Charges")
Lets see how churn depends on Monthly Charges
aggregate(MonthlyCharges~Churn,data = ibm.df,mean)
## Churn MonthlyCharges
## 1 No 61.26512
## 2 Yes 74.44133
boxplot(MonthlyCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Monthly Charges",xlab="Monthly Charges",ylab="Churn")
scatterplot(MonthlyCharges~tenure|Churn,data = ibm.df,cex=0.5)
It can be seen rom the above plot that majority of the subscribers who ended their contracts had high monthly charges and lower tenures.
hypothesis: The differnce in monthly charges based on churn is not significant.
t.test(MonthlyCharges~Churn,data = ibm.df)
##
## Welch Two Sample t-test
##
## data: MonthlyCharges by Churn
## t = -18.408, df = 4135.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.57957 -11.77284
## sample estimates:
## mean in group No mean in group Yes
## 61.26512 74.44133
By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean monthly charges of those subscibers who ended their contract and those who did not.
boxplot(ibm.df$TotalCharges,col="grey",horizontal=TRUE, main="Median of Total Charges")
Lets see how churn depends on Total Charges
aggregate(TotalCharges~Churn,data = ibm.df,mean)
## Churn TotalCharges
## 1 No 2555.344
## 2 Yes 1531.796
boxplot(TotalCharges~Churn,data = ibm.df,col="grey",horizontal=TRUE, main="Churn vs Total Charges",xlab="Total Charges",ylab="Churn")
scatterplot(TotalCharges~tenure|Churn,data = ibm.df,cex=0.5)
It can be seen rom the above plot that majority of the subscribers who ended their contracts had high total charges.
hypothesis: The differnce in total charges based on churn is not significant.
t.test(TotalCharges~Churn,data = ibm.df)
##
## Welch Two Sample t-test
##
## data: TotalCharges by Churn
## t = 18.801, df = 4042.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 916.8121 1130.2840
## sample estimates:
## mean in group No mean in group Yes
## 2555.344 1531.796
By the above t-test, the p value(<0.05) proves that the hypothesis is wrong and there is significant difference in mean total charges of those subscibers who ended their contract and those who did not.
Since Churn is a factor variable, a linear regression cannot be applied as we need to analyse the dependence of churn on the other variables. As a result, a logistic regression is applied for the analysis.
Now since this is a logistic regression, we need to divide the data into two sets: ‘train’ - on which the model of the regression will be based on and ‘test’ - on which the model will be tested.
train <- ibm.df[1:6950,]
test <- ibm.df[6951:7043,]
The dependent variable is Churn and all other variables( excluding customer ID) are considered to be independent.
Model: Churn = b + b1 x TotalCharges + b2 x MonthlyCharges + b3 x tenure + b4 x gender + b5 x PhoneService + b6 x SeniorCitizen + b7 x Partner + b8 x Dependents + b9 x MultipleLines + b10 x InternetService + b11 x OnlineSecurity + b12 x OnlineBackup + b13 x DeviceProtection + b14 x TechSupport + b15 x StreamingTV + b16 x StreamingMovies + b17 x Contract + e
model1=glm(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data = train,family = binomial)
summary(model1)
##
## Call:
## glm(formula = Churn ~ TotalCharges + MonthlyCharges + tenure +
## gender + PhoneService + SeniorCitizen + Partner + Dependents +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies +
## Contract, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8752 -0.6925 -0.2906 0.7552 3.3872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.823e+00 8.112e-01 2.248 0.02459 *
## TotalCharges 3.038e-04 7.051e-05 4.308 1.65e-05 ***
## MonthlyCharges -5.439e-02 3.188e-02 -1.706 0.08796 .
## tenure -5.889e-02 6.177e-03 -9.533 < 2e-16 ***
## genderMale -2.357e-02 6.494e-02 -0.363 0.71667
## PhoneServiceYes 4.140e-01 6.512e-01 0.636 0.52499
## SeniorCitizenYes 2.431e-01 8.460e-02 2.873 0.00406 **
## PartnerYes 1.015e-02 7.784e-02 0.130 0.89628
## DependentsYes -1.650e-01 8.977e-02 -1.837 0.06614 .
## MultipleLinesYes 5.316e-01 1.781e-01 2.986 0.00283 **
## InternetServiceFiber optic 2.227e+00 8.010e-01 2.780 0.00544 **
## InternetServiceNo -2.296e+00 8.100e-01 -2.834 0.00459 **
## OnlineSecurityYes -1.806e-01 1.793e-01 -1.007 0.31379
## OnlineBackupYes 9.931e-02 1.761e-01 0.564 0.57273
## DeviceProtectionYes 2.115e-01 1.768e-01 1.196 0.23152
## TechSupportYes -1.456e-01 1.811e-01 -0.804 0.42133
## StreamingTVYes 7.939e-01 3.277e-01 2.423 0.01540 *
## StreamingMoviesYes 7.939e-01 3.277e-01 2.422 0.01542 *
## ContractOne year -7.368e-01 1.076e-01 -6.845 7.65e-12 ***
## ContractTwo year -1.459e+00 1.757e-01 -8.302 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8029.0 on 6938 degrees of freedom
## Residual deviance: 5789.6 on 6919 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 5829.6
##
## Number of Fisher Scoring iterations: 6
From the above regression analysis, its clear that Total Charges, Tenure and Contracts are the most significant variables in determining the churn in the company.
library(leaps)
leap1<- regsubsets(Churn~TotalCharges+MonthlyCharges+tenure+gender+PhoneService+SeniorCitizen+Partner+Dependents+MultipleLines+InternetService+OnlineSecurity+OnlineBackup+DeviceProtection+TechSupport+StreamingTV+StreamingMovies+Contract,data=train,nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + MonthlyCharges + tenure +
## gender + PhoneService + SeniorCitizen + Partner + Dependents +
## MultipleLines + InternetService + OnlineSecurity + OnlineBackup +
## DeviceProtection + TechSupport + StreamingTV + StreamingMovies +
## Contract, data = train, nbest = 1)
## 19 Variables (and intercept)
## Forced in Forced out
## TotalCharges FALSE FALSE
## MonthlyCharges FALSE FALSE
## tenure FALSE FALSE
## genderMale FALSE FALSE
## PhoneServiceYes FALSE FALSE
## SeniorCitizenYes FALSE FALSE
## PartnerYes FALSE FALSE
## DependentsYes FALSE FALSE
## MultipleLinesYes FALSE FALSE
## InternetServiceFiber optic FALSE FALSE
## InternetServiceNo FALSE FALSE
## OnlineSecurityYes FALSE FALSE
## OnlineBackupYes FALSE FALSE
## DeviceProtectionYes FALSE FALSE
## TechSupportYes FALSE FALSE
## StreamingTVYes FALSE FALSE
## StreamingMoviesYes FALSE FALSE
## ContractOne year FALSE FALSE
## ContractTwo year FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## TotalCharges MonthlyCharges tenure genderMale PhoneServiceYes
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" " " " "
## 3 ( 1 ) "*" " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " "*"
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*"
## SeniorCitizenYes PartnerYes DependentsYes MultipleLinesYes
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## InternetServiceFiber optic InternetServiceNo OnlineSecurityYes
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) " " " " "*"
## 6 ( 1 ) "*" "*" " "
## 7 ( 1 ) " " " " "*"
## 8 ( 1 ) " " " " "*"
## OnlineBackupYes DeviceProtectionYes TechSupportYes StreamingTVYes
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) "*" " " "*" " "
## StreamingMoviesYes ContractOne year ContractTwo year
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) "*" " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
plot(leap1,scale="adjr2")
Let us view the coffecients of model1 again.
model1$coefficients
## (Intercept) TotalCharges
## 1.8233111979 0.0003037555
## MonthlyCharges tenure
## -0.0543932133 -0.0588872582
## genderMale PhoneServiceYes
## -0.0235657745 0.4139685779
## SeniorCitizenYes PartnerYes
## 0.2430841292 0.0101467163
## DependentsYes MultipleLinesYes
## -0.1649522270 0.5316171354
## InternetServiceFiber optic InternetServiceNo
## 2.2267266925 -2.2958244053
## OnlineSecurityYes OnlineBackupYes
## -0.1805734677 0.0993077378
## DeviceProtectionYes TechSupportYes
## 0.2114850749 -0.1456074125
## StreamingTVYes StreamingMoviesYes
## 0.7938830923 0.7939111022
## ContractOne year ContractTwo year
## -0.7368360340 -1.4585078047
library(coefplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
coefplot(model1)
As is clear this model seems wrong as it shows that monthly charges and phone service subscription (and many other) are not significant to the churn in the company.
Therefore, this model is not good enough for the data.
This time let us ditch some variables.
Model2: Churn = b + b1 x TotalCharges + b2 x tenure + b3 x gender + b4 x PhoneService + b5 x SeniorCitizen + b6 x MultipleLines + b7 x InternetService + b8 x Contract + e
We have chosen this model as these are the most basic and sidnificant factors on which churn could result.
model2=glm(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data = train,family = binomial)
summary(model2)
##
## Call:
## glm(formula = Churn ~ TotalCharges + tenure + gender + PhoneService +
## SeniorCitizen + MultipleLines + InternetService + Contract,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7307 -0.7093 -0.3050 0.8173 3.5096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.483e-01 1.234e-01 2.821 0.004786 **
## TotalCharges 3.120e-04 6.316e-05 4.939 7.86e-07 ***
## tenure -6.203e-02 5.885e-03 -10.540 < 2e-16 ***
## genderMale -1.477e-02 6.417e-02 -0.230 0.817903
## PhoneServiceYes -7.805e-01 1.295e-01 -6.026 1.68e-09 ***
## SeniorCitizenYes 3.299e-01 8.173e-02 4.036 5.43e-05 ***
## MultipleLinesYes 3.037e-01 7.862e-02 3.863 0.000112 ***
## InternetServiceFiber optic 1.083e+00 9.291e-02 11.653 < 2e-16 ***
## InternetServiceNo -7.178e-01 1.277e-01 -5.619 1.92e-08 ***
## ContractOne year -8.096e-01 1.050e-01 -7.708 1.28e-14 ***
## ContractTwo year -1.675e+00 1.727e-01 -9.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8029.0 on 6938 degrees of freedom
## Residual deviance: 5888.3 on 6928 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 5910.3
##
## Number of Fisher Scoring iterations: 6
This model is seems satisfactory enough as except gender all the other variables seem to significant(p value<0.05).
leap2<- regsubsets(Churn~TotalCharges+tenure+gender+PhoneService+SeniorCitizen+MultipleLines+InternetService+Contract,data=train,nbest=1)
summary(leap2)
## Subset selection object
## Call: regsubsets.formula(Churn ~ TotalCharges + tenure + gender + PhoneService +
## SeniorCitizen + MultipleLines + InternetService + Contract,
## data = train, nbest = 1)
## 10 Variables (and intercept)
## Forced in Forced out
## TotalCharges FALSE FALSE
## tenure FALSE FALSE
## genderMale FALSE FALSE
## PhoneServiceYes FALSE FALSE
## SeniorCitizenYes FALSE FALSE
## MultipleLinesYes FALSE FALSE
## InternetServiceFiber optic FALSE FALSE
## InternetServiceNo FALSE FALSE
## ContractOne year FALSE FALSE
## ContractTwo year FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## TotalCharges tenure genderMale PhoneServiceYes SeniorCitizenYes
## 1 ( 1 ) " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) "*" " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " "*"
## MultipleLinesYes InternetServiceFiber optic InternetServiceNo
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" "*"
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## ContractOne year ContractTwo year
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
plot(leap2,scale="adjr2")
model2$coefficients
## (Intercept) TotalCharges
## 0.3482585988 0.0003119577
## tenure genderMale
## -0.0620319045 -0.0147743337
## PhoneServiceYes SeniorCitizenYes
## -0.7805109440 0.3298960142
## MultipleLinesYes InternetServiceFiber optic
## 0.3037316602 1.0826036296
## InternetServiceNo ContractOne year
## -0.7177828701 -0.8096150070
## ContractTwo year
## -1.6749967414
coefplot(model2)
The model seems good enough to predict the churn.
Lets see how this model behaves for the dataset test.
fitted.results <- predict(model2,newdata=subset(test,select=c(2,3,6,7,8,9,16,20)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,'Yes','No')
misClasificError <- mean(fitted.results != test$Churn)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.795698924731183"
The accuracy of prediction is above 75%. This proves that the model is good enough.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(model2, newdata=subset(test,select=c(2,3,6,7,8,9,16,20)), type="response")
pr <- prediction(p, test$Churn)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8318681
The model passes all the tests for being applied in the analysis.
THE END