Project Title: Using Customer Behavior Data to Improve Customer Retention

NAME: ASWATHY GUNADEEP

EMAIL:

COLLEGE / COMPANY: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA

A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving.

The data set includes information about:

PRELIMINARY WORK

Reading the raw data into a dataframe

setwd("D:/desktop/Data Analytics internship-sameer mathur/project")
ash.df <- read.csv(paste("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=""))
attach(ash.df)
dim(ash.df)
## [1] 7043   21
View(ash.df)
str(ash.df)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
library(psych)
## Warning: package 'psych' was built under R version 3.5.3
describe(ash.df)
##                   vars    n    mean      sd  median trimmed     mad   min
## customerID*          1 7043 3522.00 2033.28 3522.00 3522.00 2610.86  1.00
## gender*              2 7043    1.50    0.50    2.00    1.51    0.00  1.00
## SeniorCitizen        3 7043    0.16    0.37    0.00    0.08    0.00  0.00
## Partner*             4 7043    1.48    0.50    1.00    1.48    0.00  1.00
## Dependents*          5 7043    1.30    0.46    1.00    1.25    0.00  1.00
## tenure               6 7043   32.37   24.56   29.00   31.43   32.62  0.00
## PhoneService*        7 7043    1.90    0.30    2.00    2.00    0.00  1.00
## MultipleLines*       8 7043    1.94    0.95    2.00    1.93    1.48  1.00
## InternetService*     9 7043    1.87    0.74    2.00    1.84    1.48  1.00
## OnlineSecurity*     10 7043    1.79    0.86    2.00    1.74    1.48  1.00
## OnlineBackup*       11 7043    1.91    0.88    2.00    1.88    1.48  1.00
## DeviceProtection*   12 7043    1.90    0.88    2.00    1.88    1.48  1.00
## TechSupport*        13 7043    1.80    0.86    2.00    1.75    1.48  1.00
## StreamingTV*        14 7043    1.99    0.89    2.00    1.98    1.48  1.00
## StreamingMovies*    15 7043    1.99    0.89    2.00    1.99    1.48  1.00
## Contract*           16 7043    1.69    0.83    1.00    1.61    0.00  1.00
## PaperlessBilling*   17 7043    1.59    0.49    2.00    1.62    0.00  1.00
## PaymentMethod*      18 7043    2.57    1.07    3.00    2.59    1.48  1.00
## MonthlyCharges      19 7043   64.76   30.09   70.35   64.97   35.66 18.25
## TotalCharges        20 7032 2283.30 2266.77 1397.47 1970.14 1812.92 18.80
## Churn*              21 7043    1.27    0.44    1.00    1.21    0.00  1.00
##                       max  range  skew kurtosis    se
## customerID*       7043.00 7042.0  0.00    -1.20 24.23
## gender*              2.00    1.0 -0.02    -2.00  0.01
## SeniorCitizen        1.00    1.0  1.83     1.36  0.00
## Partner*             2.00    1.0  0.07    -2.00  0.01
## Dependents*          2.00    1.0  0.87    -1.23  0.01
## tenure              72.00   72.0  0.24    -1.39  0.29
## PhoneService*        2.00    1.0 -2.73     5.43  0.00
## MultipleLines*       3.00    2.0  0.12    -1.88  0.01
## InternetService*     3.00    2.0  0.21    -1.15  0.01
## OnlineSecurity*      3.00    2.0  0.42    -1.52  0.01
## OnlineBackup*        3.00    2.0  0.18    -1.68  0.01
## DeviceProtection*    3.00    2.0  0.19    -1.68  0.01
## TechSupport*         3.00    2.0  0.40    -1.54  0.01
## StreamingTV*         3.00    2.0  0.03    -1.72  0.01
## StreamingMovies*     3.00    2.0  0.01    -1.72  0.01
## Contract*            3.00    2.0  0.63    -1.27  0.01
## PaperlessBilling*    2.00    1.0 -0.38    -1.86  0.01
## PaymentMethod*       4.00    3.0 -0.17    -1.21  0.01
## MonthlyCharges     118.75  100.5 -0.22    -1.26  0.36
## TotalCharges      8684.80 8666.0  0.96    -0.23 27.03
## Churn*               2.00    1.0  1.06    -0.87  0.01
library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
some(ash.df)
##      customerID gender SeniorCitizen Partner Dependents tenure
## 1163 9359-UGBTK Female             0      No         No     67
## 1536 7905-NJMXS   Male             1     Yes         No      7
## 2568 0407-BDJKB   Male             0     Yes         No     60
## 3338 0840-DFEZH Female             0      No         No      7
## 3410 4806-DXQCE Female             1     Yes         No     70
## 3691 6038-GCYEC Female             0      No         No     24
## 5590 7253-UVNDW Female             0      No         No     46
## 5674 6260-XLACS   Male             0      No         No      4
## 5687 4369-HTUIF   Male             1      No         No      6
## 6405 1428-IEDPR   Male             0      No         No     52
##      PhoneService    MultipleLines InternetService      OnlineSecurity
## 1163          Yes               No             DSL                 Yes
## 1536          Yes              Yes             DSL                 Yes
## 2568          Yes              Yes     Fiber optic                  No
## 3338          Yes              Yes     Fiber optic                  No
## 3410          Yes              Yes     Fiber optic                 Yes
## 3691           No No phone service             DSL                  No
## 5590          Yes               No             DSL                  No
## 5674          Yes               No              No No internet service
## 5687          Yes               No     Fiber optic                 Yes
## 6405          Yes              Yes     Fiber optic                 Yes
##             OnlineBackup    DeviceProtection         TechSupport
## 1163                  No                 Yes                 Yes
## 1536                 Yes                  No                 Yes
## 2568                 Yes                 Yes                  No
## 3338                  No                  No                  No
## 3410                 Yes                 Yes                 Yes
## 3691                 Yes                 Yes                  No
## 5590                  No                 Yes                 Yes
## 5674 No internet service No internet service No internet service
## 5687                  No                  No                  No
## 6405                  No                 Yes                 Yes
##              StreamingTV     StreamingMovies       Contract
## 1163                  No                 Yes       One year
## 1536                  No                  No Month-to-month
## 2568                 Yes                  No Month-to-month
## 3338                  No                  No Month-to-month
## 3410                 Yes                 Yes       Two year
## 3691                 Yes                 Yes Month-to-month
## 5590                  No                  No       Two year
## 5674 No internet service No internet service       One year
## 5687                 Yes                  No Month-to-month
## 6405                 Yes                 Yes       Two year
##      PaperlessBilling             PaymentMethod MonthlyCharges
## 1163              Yes Bank transfer (automatic)          72.35
## 1536               No          Electronic check          64.20
## 2568              Yes          Electronic check          95.75
## 3338              Yes          Electronic check          75.35
## 3410              Yes          Electronic check         113.65
## 3691              Yes   Credit card (automatic)          54.15
## 5590               No   Credit card (automatic)          54.35
## 5674               No              Mailed check          19.70
## 5687              Yes Bank transfer (automatic)          85.15
## 6405              Yes          Electronic check         109.30
##      TotalCharges Churn
## 1163      4991.50    No
## 1536       475.00    No
## 2568      5742.90   Yes
## 3338       564.65    No
## 3410      7714.65    No
## 3691      1240.25   Yes
## 5590      2460.15   Yes
## 5674       117.80    No
## 5687       503.60   Yes
## 6405      5731.40    No

VISUALIZATION

x <- c(5174 , 1869)
pct <- round(x/sum(x)*100)
labels <- c("No", "Yes")
lbls <- paste(labels, pct)
lbls <- paste(lbls,"%",sep="")
pie(x,labels = lbls,main="churn", col=c("skyblue","yellow"))

  1. Which customers are likely to leave?

The obvious guess is Total charges and Monthly charges and tenure.To make sure of this, we plot the corrgram, create correlation matrix and scatterplot matrix , excluding the the less important factor variables , having 2 levels-“Yes” and “No”.

CORRGRAM

library(corrgram)
## Warning: package 'corrgram' was built under R version 3.5.3
par(mfrow=c(1,1))
corrgram(ash.df, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Corrgram of dataset")

SCATTERPLOTMATRIX

scatterplotMatrix(~TotalCharges+MonthlyCharges+tenure | Churn, data=ash.df,main="scatterplotmatrix")

CORRELATION MATRIX

den.df <- subset(ash.df[,c(6,19,20)], ash.df$TotalCharges>=0 && ash.df$MonthlyCharges>=0)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.5.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.5.3
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill
## Loading required package: survival
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
data <- cor(den.df,use = "complete.obs")
round(data,2)
##                tenure MonthlyCharges TotalCharges
## tenure           1.00           0.25         0.83
## MonthlyCharges   0.25           1.00         0.65
## TotalCharges     0.83           0.65         1.00

The key drivers are tenure, contract, internet service, total charges and monthly charges.

  1. tenure drives churn?
par(mfrow=c(1,2))
d2 <- density(ash.df$tenure)
plot(d2, main="tenure")
polygon(d2, col="red", border="blue")
histogram(~ tenure | Churn,col="green")

This visualization shows that customers who have been customers for shorter periods are more likely to leave.

  1. Role of total charges and monthly charges in retention.
xyplot(TotalCharges ~ MonthlyCharges, data=den.df, main="total charges v/s monthly charges", type = c("p", "g"),xlab="monthly charges", ylab="total charges")

par(mfrow=c(1,2))
den.df <- subset(ash.df, ash.df$TotalCharges>=0)
d <- density(den.df$TotalCharges)
plot(d, main="Total charges")
polygon(d, col="red", border="blue")
den1.df <- subset(ash.df, ash.df$MonthlyCharges>=0)
d1 <- density(den1.df$MonthlyCharges)
plot(d1, main="Monthly charges")
polygon(d1, col="red", border="blue")

par(mfrow=c(2,1))
library(lattice)
histogram(~ TotalCharges | Churn, data=ash.df)

histogram(~ MonthlyCharges | Churn, data=ash.df, col="orange")

  1. role of contract
library(vcd)
## Warning: package 'vcd' was built under R version 3.5.3
## Loading required package: grid
tab1 <- xtabs(~ Churn + Contract)
tab1
##      Contract
## Churn Month-to-month One year Two year
##   No            2220     1307     1647
##   Yes           1655      166       48
mosaic(tab1, shade=TRUE, legend=TRUE, main="contract basis")

  1. role of internet service
tab2 <- xtabs(~ Churn + InternetService)
tab2
##      InternetService
## Churn  DSL Fiber optic   No
##   No  1962        1799 1413
##   Yes  459        1297  113
mosaic(tab2, shade=TRUE, legend=TRUE, main="internet service effects")

boxplot(TotalCharges ~ InternetService, data=den.df,main="total charges with type of internet service", col=c("skyblue","orange","green"),xlab="type of internet service",ylab="total charges")

library(ggplot2)
qplot(x = TotalCharges, y = tenure, data = den.df, color = factor(Churn))

boxplot(tenure ~ Contract, data=ash.df,main="tenure v/s contract", col=c("red","orange","yellow"),xlab="type of contract",ylab="tenure")

boxplot(tenure ~ OnlineSecurity, data=ash.df,main="tenure v/s online security", col=c("burlywood","darkolivegreen"),xlab="online security",ylab="tenure")

From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.

CHI-SQUARED TESTS:

chisq.test(Churn,TotalCharges)
## Warning in chisq.test(Churn, TotalCharges): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  Churn and TotalCharges
## X-squared = 6503.6, df = 6529, p-value = 0.5859
chisq.test(Churn,MonthlyCharges)
## Warning in chisq.test(Churn, MonthlyCharges): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  Churn and MonthlyCharges
## X-squared = 2123.6, df = 1584, p-value < 2.2e-16
chisq.test(Churn,tenure)
## Warning in chisq.test(Churn, tenure): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  Churn and tenure
## X-squared = 1065.3, df = 72, p-value < 2.2e-16
chisq.test(Churn,OnlineSecurity)
## 
##  Pearson's Chi-squared test
## 
## data:  Churn and OnlineSecurity
## X-squared = 850, df = 2, p-value < 2.2e-16
chisq.test(Churn,Contract)
## 
##  Pearson's Chi-squared test
## 
## data:  Churn and Contract
## X-squared = 1184.6, df = 2, p-value < 2.2e-16

Hence, since all the p-values are very less (except for totalcharges), our original hypothesis is correct.