Project Title: Using Customer Behavior Data to Improve Customer Retention
NAME: ASWATHY GUNADEEP
EMAIL: aswathygunadeep@gmail.com
COLLEGE / COMPANY: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA
A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving.
The data set includes information about:
PRELIMINARY WORK
Reading the raw data into a dataframe
setwd("D:/desktop/Data Analytics internship-sameer mathur/project")
ash.df <- read.csv(paste("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=""))
attach(ash.df)
dim(ash.df)
## [1] 7043 21
View(ash.df)
str(ash.df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
library(psych)
## Warning: package 'psych' was built under R version 3.5.3
describe(ash.df)
## vars n mean sd median trimmed mad min
## customerID* 1 7043 3522.00 2033.28 3522.00 3522.00 2610.86 1.00
## gender* 2 7043 1.50 0.50 2.00 1.51 0.00 1.00
## SeniorCitizen 3 7043 0.16 0.37 0.00 0.08 0.00 0.00
## Partner* 4 7043 1.48 0.50 1.00 1.48 0.00 1.00
## Dependents* 5 7043 1.30 0.46 1.00 1.25 0.00 1.00
## tenure 6 7043 32.37 24.56 29.00 31.43 32.62 0.00
## PhoneService* 7 7043 1.90 0.30 2.00 2.00 0.00 1.00
## MultipleLines* 8 7043 1.94 0.95 2.00 1.93 1.48 1.00
## InternetService* 9 7043 1.87 0.74 2.00 1.84 1.48 1.00
## OnlineSecurity* 10 7043 1.79 0.86 2.00 1.74 1.48 1.00
## OnlineBackup* 11 7043 1.91 0.88 2.00 1.88 1.48 1.00
## DeviceProtection* 12 7043 1.90 0.88 2.00 1.88 1.48 1.00
## TechSupport* 13 7043 1.80 0.86 2.00 1.75 1.48 1.00
## StreamingTV* 14 7043 1.99 0.89 2.00 1.98 1.48 1.00
## StreamingMovies* 15 7043 1.99 0.89 2.00 1.99 1.48 1.00
## Contract* 16 7043 1.69 0.83 1.00 1.61 0.00 1.00
## PaperlessBilling* 17 7043 1.59 0.49 2.00 1.62 0.00 1.00
## PaymentMethod* 18 7043 2.57 1.07 3.00 2.59 1.48 1.00
## MonthlyCharges 19 7043 64.76 30.09 70.35 64.97 35.66 18.25
## TotalCharges 20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn* 21 7043 1.27 0.44 1.00 1.21 0.00 1.00
## max range skew kurtosis se
## customerID* 7043.00 7042.0 0.00 -1.20 24.23
## gender* 2.00 1.0 -0.02 -2.00 0.01
## SeniorCitizen 1.00 1.0 1.83 1.36 0.00
## Partner* 2.00 1.0 0.07 -2.00 0.01
## Dependents* 2.00 1.0 0.87 -1.23 0.01
## tenure 72.00 72.0 0.24 -1.39 0.29
## PhoneService* 2.00 1.0 -2.73 5.43 0.00
## MultipleLines* 3.00 2.0 0.12 -1.88 0.01
## InternetService* 3.00 2.0 0.21 -1.15 0.01
## OnlineSecurity* 3.00 2.0 0.42 -1.52 0.01
## OnlineBackup* 3.00 2.0 0.18 -1.68 0.01
## DeviceProtection* 3.00 2.0 0.19 -1.68 0.01
## TechSupport* 3.00 2.0 0.40 -1.54 0.01
## StreamingTV* 3.00 2.0 0.03 -1.72 0.01
## StreamingMovies* 3.00 2.0 0.01 -1.72 0.01
## Contract* 3.00 2.0 0.63 -1.27 0.01
## PaperlessBilling* 2.00 1.0 -0.38 -1.86 0.01
## PaymentMethod* 4.00 3.0 -0.17 -1.21 0.01
## MonthlyCharges 118.75 100.5 -0.22 -1.26 0.36
## TotalCharges 8684.80 8666.0 0.96 -0.23 27.03
## Churn* 2.00 1.0 1.06 -0.87 0.01
library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
some(ash.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 1163 9359-UGBTK Female 0 No No 67
## 1536 7905-NJMXS Male 1 Yes No 7
## 2568 0407-BDJKB Male 0 Yes No 60
## 3338 0840-DFEZH Female 0 No No 7
## 3410 4806-DXQCE Female 1 Yes No 70
## 3691 6038-GCYEC Female 0 No No 24
## 5590 7253-UVNDW Female 0 No No 46
## 5674 6260-XLACS Male 0 No No 4
## 5687 4369-HTUIF Male 1 No No 6
## 6405 1428-IEDPR Male 0 No No 52
## PhoneService MultipleLines InternetService OnlineSecurity
## 1163 Yes No DSL Yes
## 1536 Yes Yes DSL Yes
## 2568 Yes Yes Fiber optic No
## 3338 Yes Yes Fiber optic No
## 3410 Yes Yes Fiber optic Yes
## 3691 No No phone service DSL No
## 5590 Yes No DSL No
## 5674 Yes No No No internet service
## 5687 Yes No Fiber optic Yes
## 6405 Yes Yes Fiber optic Yes
## OnlineBackup DeviceProtection TechSupport
## 1163 No Yes Yes
## 1536 Yes No Yes
## 2568 Yes Yes No
## 3338 No No No
## 3410 Yes Yes Yes
## 3691 Yes Yes No
## 5590 No Yes Yes
## 5674 No internet service No internet service No internet service
## 5687 No No No
## 6405 No Yes Yes
## StreamingTV StreamingMovies Contract
## 1163 No Yes One year
## 1536 No No Month-to-month
## 2568 Yes No Month-to-month
## 3338 No No Month-to-month
## 3410 Yes Yes Two year
## 3691 Yes Yes Month-to-month
## 5590 No No Two year
## 5674 No internet service No internet service One year
## 5687 Yes No Month-to-month
## 6405 Yes Yes Two year
## PaperlessBilling PaymentMethod MonthlyCharges
## 1163 Yes Bank transfer (automatic) 72.35
## 1536 No Electronic check 64.20
## 2568 Yes Electronic check 95.75
## 3338 Yes Electronic check 75.35
## 3410 Yes Electronic check 113.65
## 3691 Yes Credit card (automatic) 54.15
## 5590 No Credit card (automatic) 54.35
## 5674 No Mailed check 19.70
## 5687 Yes Bank transfer (automatic) 85.15
## 6405 Yes Electronic check 109.30
## TotalCharges Churn
## 1163 4991.50 No
## 1536 475.00 No
## 2568 5742.90 Yes
## 3338 564.65 No
## 3410 7714.65 No
## 3691 1240.25 Yes
## 5590 2460.15 Yes
## 5674 117.80 No
## 5687 503.60 Yes
## 6405 5731.40 No
VISUALIZATION
x <- c(5174 , 1869)
pct <- round(x/sum(x)*100)
labels <- c("No", "Yes")
lbls <- paste(labels, pct)
lbls <- paste(lbls,"%",sep="")
pie(x,labels = lbls,main="churn", col=c("skyblue","yellow"))
The obvious guess is Total charges and Monthly charges and tenure.To make sure of this, we plot the corrgram, create correlation matrix and scatterplot matrix , excluding the the less important factor variables , having 2 levels-“Yes” and “No”.
CORRGRAM
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.5.3
par(mfrow=c(1,1))
corrgram(ash.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of dataset")
SCATTERPLOTMATRIX
scatterplotMatrix(~TotalCharges+MonthlyCharges+tenure | Churn, data=ash.df,main="scatterplotmatrix")
CORRELATION MATRIX
den.df <- subset(ash.df[,c(6,19,20)], ash.df$TotalCharges>=0 && ash.df$MonthlyCharges>=0)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.5.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.5.3
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
## Loading required package: survival
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
data <- cor(den.df,use = "complete.obs")
round(data,2)
## tenure MonthlyCharges TotalCharges
## tenure 1.00 0.25 0.83
## MonthlyCharges 0.25 1.00 0.65
## TotalCharges 0.83 0.65 1.00
The key drivers are tenure, contract, internet service, total charges and monthly charges.
par(mfrow=c(1,2))
d2 <- density(ash.df$tenure)
plot(d2, main="tenure")
polygon(d2, col="red", border="blue")
histogram(~ tenure | Churn,col="green")
This visualization shows that customers who have been customers for shorter periods are more likely to leave.
xyplot(TotalCharges ~ MonthlyCharges, data=den.df, main="total charges v/s monthly charges", type = c("p", "g"),xlab="monthly charges", ylab="total charges")
par(mfrow=c(1,2))
den.df <- subset(ash.df, ash.df$TotalCharges>=0)
d <- density(den.df$TotalCharges)
plot(d, main="Total charges")
polygon(d, col="red", border="blue")
den1.df <- subset(ash.df, ash.df$MonthlyCharges>=0)
d1 <- density(den1.df$MonthlyCharges)
plot(d1, main="Monthly charges")
polygon(d1, col="red", border="blue")
par(mfrow=c(2,1))
library(lattice)
histogram(~ TotalCharges | Churn, data=ash.df)
histogram(~ MonthlyCharges | Churn, data=ash.df, col="orange")
library(vcd)
## Warning: package 'vcd' was built under R version 3.5.3
## Loading required package: grid
tab1 <- xtabs(~ Churn + Contract)
tab1
## Contract
## Churn Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
mosaic(tab1, shade=TRUE, legend=TRUE, main="contract basis")
tab2 <- xtabs(~ Churn + InternetService)
tab2
## InternetService
## Churn DSL Fiber optic No
## No 1962 1799 1413
## Yes 459 1297 113
mosaic(tab2, shade=TRUE, legend=TRUE, main="internet service effects")
boxplot(TotalCharges ~ InternetService, data=den.df,main="total charges with type of internet service", col=c("skyblue","orange","green"),xlab="type of internet service",ylab="total charges")
library(ggplot2)
qplot(x = TotalCharges, y = tenure, data = den.df, color = factor(Churn))
boxplot(tenure ~ Contract, data=ash.df,main="tenure v/s contract", col=c("red","orange","yellow"),xlab="type of contract",ylab="tenure")
boxplot(tenure ~ OnlineSecurity, data=ash.df,main="tenure v/s online security", col=c("burlywood","darkolivegreen"),xlab="online security",ylab="tenure")
From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.
CHI-SQUARED TESTS:
chisq.test(Churn,TotalCharges)
## Warning in chisq.test(Churn, TotalCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and TotalCharges
## X-squared = 6503.6, df = 6529, p-value = 0.5859
chisq.test(Churn,MonthlyCharges)
## Warning in chisq.test(Churn, MonthlyCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and MonthlyCharges
## X-squared = 2123.6, df = 1584, p-value < 2.2e-16
chisq.test(Churn,tenure)
## Warning in chisq.test(Churn, tenure): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and tenure
## X-squared = 1065.3, df = 72, p-value < 2.2e-16
chisq.test(Churn,OnlineSecurity)
##
## Pearson's Chi-squared test
##
## data: Churn and OnlineSecurity
## X-squared = 850, df = 2, p-value < 2.2e-16
chisq.test(Churn,Contract)
##
## Pearson's Chi-squared test
##
## data: Churn and Contract
## X-squared = 1184.6, df = 2, p-value < 2.2e-16
Hence, since all the p-values are very less (except for totalcharges), our original hypothesis is correct.