Project Title: Using Customer Behavior Data to Improve Customer Retention
NAME: ASWATHY GUNADEEP
EMAIL: aswathygunadeep@gmail.com
COLLEGE / COMPANY: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA
A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving.
The data set includes information about:
PRELIMINARY WORK
Reading the raw data into a dataframe
setwd("D:/desktop/Data Analytics internship-sameer mathur/project")
ash.df <- read.csv(paste("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=""))
attach(ash.df)
dim(ash.df)
## [1] 7043 21
View(ash.df)
str(ash.df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
library(psych)
describe(ash.df)
## vars n mean sd median trimmed mad min
## customerID* 1 7043 3522.00 2033.28 3522.00 3522.00 2610.86 1.00
## gender* 2 7043 1.50 0.50 2.00 1.51 0.00 1.00
## SeniorCitizen 3 7043 0.16 0.37 0.00 0.08 0.00 0.00
## Partner* 4 7043 1.48 0.50 1.00 1.48 0.00 1.00
## Dependents* 5 7043 1.30 0.46 1.00 1.25 0.00 1.00
## tenure 6 7043 32.37 24.56 29.00 31.43 32.62 0.00
## PhoneService* 7 7043 1.90 0.30 2.00 2.00 0.00 1.00
## MultipleLines* 8 7043 1.94 0.95 2.00 1.93 1.48 1.00
## InternetService* 9 7043 1.87 0.74 2.00 1.84 1.48 1.00
## OnlineSecurity* 10 7043 1.79 0.86 2.00 1.74 1.48 1.00
## OnlineBackup* 11 7043 1.91 0.88 2.00 1.88 1.48 1.00
## DeviceProtection* 12 7043 1.90 0.88 2.00 1.88 1.48 1.00
## TechSupport* 13 7043 1.80 0.86 2.00 1.75 1.48 1.00
## StreamingTV* 14 7043 1.99 0.89 2.00 1.98 1.48 1.00
## StreamingMovies* 15 7043 1.99 0.89 2.00 1.99 1.48 1.00
## Contract* 16 7043 1.69 0.83 1.00 1.61 0.00 1.00
## PaperlessBilling* 17 7043 1.59 0.49 2.00 1.62 0.00 1.00
## PaymentMethod* 18 7043 2.57 1.07 3.00 2.59 1.48 1.00
## MonthlyCharges 19 7043 64.76 30.09 70.35 64.97 35.66 18.25
## TotalCharges 20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn* 21 7043 1.27 0.44 1.00 1.21 0.00 1.00
## max range skew kurtosis se
## customerID* 7043.00 7042.0 0.00 -1.20 24.23
## gender* 2.00 1.0 -0.02 -2.00 0.01
## SeniorCitizen 1.00 1.0 1.83 1.36 0.00
## Partner* 2.00 1.0 0.07 -2.00 0.01
## Dependents* 2.00 1.0 0.87 -1.23 0.01
## tenure 72.00 72.0 0.24 -1.39 0.29
## PhoneService* 2.00 1.0 -2.73 5.43 0.00
## MultipleLines* 3.00 2.0 0.12 -1.88 0.01
## InternetService* 3.00 2.0 0.21 -1.15 0.01
## OnlineSecurity* 3.00 2.0 0.42 -1.52 0.01
## OnlineBackup* 3.00 2.0 0.18 -1.68 0.01
## DeviceProtection* 3.00 2.0 0.19 -1.68 0.01
## TechSupport* 3.00 2.0 0.40 -1.54 0.01
## StreamingTV* 3.00 2.0 0.03 -1.72 0.01
## StreamingMovies* 3.00 2.0 0.01 -1.72 0.01
## Contract* 3.00 2.0 0.63 -1.27 0.01
## PaperlessBilling* 2.00 1.0 -0.38 -1.86 0.01
## PaymentMethod* 4.00 3.0 -0.17 -1.21 0.01
## MonthlyCharges 118.75 100.5 -0.22 -1.26 0.36
## TotalCharges 8684.80 8666.0 0.96 -0.23 27.03
## Churn* 2.00 1.0 1.06 -0.87 0.01
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
some(ash.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 359 9057-MSWCO Male 1 Yes No 27
## 1136 1779-PWPMG Female 1 Yes No 72
## 1210 8268-YDIXR Male 0 Yes No 56
## 1673 4737-AQCPU Male 0 Yes Yes 72
## 2156 0524-IAVZO Female 0 Yes No 30
## 2577 1074-AMIOH Female 0 Yes Yes 53
## 3312 6137-MFAJN Female 0 No No 48
## 3995 0406-BPDVR Female 1 Yes No 54
## 5169 6000-APYLU Male 0 No No 9
## 5888 4316-XCSLJ Male 0 No Yes 17
## PhoneService MultipleLines InternetService OnlineSecurity
## 359 No No phone service DSL Yes
## 1136 Yes Yes Fiber optic Yes
## 1210 Yes Yes Fiber optic Yes
## 1673 Yes Yes DSL Yes
## 2156 Yes Yes Fiber optic No
## 2577 Yes No Fiber optic No
## 3312 No No phone service DSL Yes
## 3995 Yes Yes Fiber optic No
## 5169 Yes No Fiber optic No
## 5888 Yes No DSL Yes
## OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
## 359 No No No No No
## 1136 Yes Yes Yes Yes Yes
## 1210 Yes Yes No No No
## 1673 Yes Yes Yes No No
## 2156 No No No No Yes
## 2577 Yes Yes Yes Yes Yes
## 3312 Yes No No Yes No
## 3995 Yes No No Yes Yes
## 5169 No No No No Yes
## 5888 No No No No No
## Contract PaperlessBilling PaymentMethod
## 359 Month-to-month No Credit card (automatic)
## 1136 Two year Yes Bank transfer (automatic)
## 1210 One year Yes Electronic check
## 1673 Two year No Credit card (automatic)
## 2156 Month-to-month Yes Electronic check
## 2577 Month-to-month Yes Bank transfer (automatic)
## 3312 Month-to-month Yes Electronic check
## 3995 One year Yes Credit card (automatic)
## 5169 Month-to-month Yes Electronic check
## 5888 Month-to-month Yes Mailed check
## MonthlyCharges TotalCharges Churn
## 359 30.75 805.10 Yes
## 1136 114.65 8333.95 No
## 1210 93.15 5253.95 No
## 1673 72.10 5016.65 No
## 2156 85.00 2624.25 Yes
## 2577 108.25 5935.10 No
## 3312 44.80 2104.55 No
## 3995 101.50 5373.10 Yes
## 5169 80.80 713.10 Yes
## 5888 50.30 846.80 No
VISUALIZATION
pie(table(ash.df$Churn),main="churn", col=c("skyblue","yellow"))
The obvious guess is Total charges and Monthly charges and tenure.To make sure of this, we plot the corrgram, create correlation matrix and scatterplot matrix , excluding the the less important factor variables , having 2 levels-“Yes” and “No”.
CORRGRAM
library(corrgram)
par(mfrow=c(1,1))
corrgram(ash.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of dataset")
SCATTERPLOTMATRIX
scatterplotMatrix(~TotalCharges+MonthlyCharges+tenure | Churn, data=ash.df,main="scatterplotmatrix")
CORRELATION MATRIX
den.df <- subset(ash.df[,c(6,19,20)], ash.df$TotalCharges>=0 && ash.df$MonthlyCharges>=0)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
data <- cor(den.df,use = "complete.obs")
round(data,2)
## tenure MonthlyCharges TotalCharges
## tenure 1.00 0.25 0.83
## MonthlyCharges 0.25 1.00 0.65
## TotalCharges 0.83 0.65 1.00
The key drivers are tenure, contract, internet service, total charges and monthly charges.
par(mfrow=c(1,2))
d2 <- density(ash.df$tenure)
plot(d2, main="tenure")
polygon(d2, col="red", border="blue")
histogram(~ tenure | Churn,col="green")
This visualization shows that customers who have been customers for shorter periods are more likely to leave.
xyplot(TotalCharges ~ MonthlyCharges, data=den.df, main="total charges v/s monthly charges", type = c("p", "g"),xlab="monthly charges", ylab="total charges")
par(mfrow=c(1,2))
den.df <- subset(ash.df, ash.df$TotalCharges>=0)
d <- density(den.df$TotalCharges)
plot(d, main="Total charges")
polygon(d, col="red", border="blue")
den1.df <- subset(ash.df, ash.df$MonthlyCharges>=0)
d1 <- density(den1.df$MonthlyCharges)
plot(d1, main="Monthly charges")
polygon(d1, col="red", border="blue")
par(mfrow=c(2,1))
library(lattice)
histogram(~ TotalCharges | Churn, data=ash.df)
histogram(~ MonthlyCharges | Churn, data=ash.df, col="orange")
library(vcd)
## Loading required package: grid
tab1 <- xtabs(~ Churn + Contract)
tab1
## Contract
## Churn Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
mosaic(tab1, shade=TRUE, legend=TRUE, main="contract basis")
tab2 <- xtabs(~ Churn + InternetService)
tab2
## InternetService
## Churn DSL Fiber optic No
## No 1962 1799 1413
## Yes 459 1297 113
mosaic(tab2, shade=TRUE, legend=TRUE, main="internet service effects")
boxplot(TotalCharges ~ InternetService, data=den.df,main="total charges with type of internet service", col=c("skyblue","orange","green"),xlab="type of internet service",ylab="total charges")
library(ggplot2)
qplot(x = TotalCharges, y = tenure, data = den.df, color = factor(Churn))
boxplot(tenure ~ Contract, data=ash.df,main="tenure v/s contract", col=c("red","orange","yellow"),xlab="type of contract",ylab="tenure")
boxplot(tenure ~ OnlineSecurity, data=ash.df,main="tenure v/s online security", col=c("burlywood","darkolivegreen"),xlab="online security",ylab="tenure")
From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.
CHI-SQUARED TESTS:
chisq.test(Churn,TotalCharges)
## Warning in chisq.test(Churn, TotalCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and TotalCharges
## X-squared = 6503.6, df = 6529, p-value = 0.5859
chisq.test(Churn,MonthlyCharges)
## Warning in chisq.test(Churn, MonthlyCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and MonthlyCharges
## X-squared = 2123.6, df = 1584, p-value < 2.2e-16
chisq.test(Churn,tenure)
## Warning in chisq.test(Churn, tenure): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and tenure
## X-squared = 1065.3, df = 72, p-value < 2.2e-16
chisq.test(Churn,OnlineSecurity)
##
## Pearson's Chi-squared test
##
## data: Churn and OnlineSecurity
## X-squared = 850, df = 2, p-value < 2.2e-16
chisq.test(Churn,Contract)
##
## Pearson's Chi-squared test
##
## data: Churn and Contract
## X-squared = 1184.6, df = 2, p-value < 2.2e-16
Hence, since all the p-values are very less (except for totalcharges), our original hypothesis is correct.