Project Title: Using Customer Behavior Data to Improve Customer Retention
NAME: ASWATHY GUNADEEP
EMAIL: aswathygunadeep@gmail.com
COLLEGE / COMPANY: NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA
A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving.
The data set includes information about:
PRELIMINARY WORK
setwd("C:/Users/user/Desktop/tarsha systems summer internship/basic script")
ash.df <- read.csv(paste("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=""))
attach(ash.df)
dim(ash.df)
## [1] 7043 21
str(ash.df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
library(car)
## Loading required package: carData
some(ash.df)
## customerID gender SeniorCitizen Partner Dependents tenure
## 604 2001-MCUUW Male 0 No No 46
## 1372 6567-HOOPW Female 1 No No 1
## 1925 1845-CSBRZ Female 1 Yes Yes 22
## 2044 6557-BZXLQ Male 1 No No 16
## 2340 4094-NSEDU Female 1 No No 21
## 2412 6901-GOGZG Male 0 No Yes 60
## 3219 6869-FGJJC Male 0 No No 68
## 5951 2862-PFNIK Male 0 No Yes 1
## 6016 2325-NBPZG Female 0 No No 3
## 6255 9933-QRGTX Female 0 Yes No 60
## PhoneService MultipleLines InternetService OnlineSecurity
## 604 Yes Yes Fiber optic No
## 1372 Yes No Fiber optic No
## 1925 Yes Yes Fiber optic No
## 2044 Yes No Fiber optic No
## 2340 Yes Yes Fiber optic No
## 2412 Yes Yes DSL Yes
## 3219 Yes No Fiber optic No
## 5951 No No phone service DSL No
## 6016 Yes No Fiber optic Yes
## 6255 Yes No Fiber optic Yes
## OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
## 604 Yes Yes Yes Yes Yes
## 1372 No No No Yes No
## 1925 No No No Yes No
## 2044 No No No No No
## 2340 No No No Yes Yes
## 2412 No Yes Yes Yes Yes
## 3219 Yes Yes No No No
## 5951 No No No No No
## 6016 No No No No No
## 6255 No No Yes Yes Yes
## Contract PaperlessBilling PaymentMethod
## 604 Two year No Electronic check
## 1372 Month-to-month Yes Electronic check
## 1925 Month-to-month No Bank transfer (automatic)
## 2044 Month-to-month No Electronic check
## 2340 Month-to-month Yes Electronic check
## 2412 Two year Yes Electronic check
## 3219 One year No Credit card (automatic)
## 5951 Month-to-month Yes Electronic check
## 6016 Month-to-month Yes Electronic check
## 6255 Two year Yes Electronic check
## MonthlyCharges TotalCharges Churn
## 604 108.65 4903.20 No
## 1372 79.20 79.20 Yes
## 1925 85.35 1961.60 No
## 2044 69.65 1043.30 No
## 2340 94.25 1973.75 Yes
## 2412 84.95 4984.85 No
## 3219 79.60 5461.45 No
## 5951 24.70 24.70 No
## 6016 74.55 233.65 No
## 6255 97.20 5611.75 No
VISUALIZATION
pie(table(ash.df$Churn),main="churn", col=c("skyblue","yellow"))
mytable <- with(ash.df, table(Churn))
prop.table(mytable)*100
## Churn
## No Yes
## 73.46301 26.53699
Which customers are likely to leave?
The obvious guess is Total charges and Monthly charges and tenure. To make sure of this, we plot the corrgram, create correlation matrix and scatterplot matrix , excluding the the less important factor variables , having 2 levels-“Yes” and “No”.
CORRGRAM
library(corrgram)
par(mfrow=c(1,1))
corrgram(ash.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of dataset")
These are some of the possible conclusions that can be deduced from the above graph:
SCATTERPLOTMATRIX
library(car)
scatterplotMatrix(~TotalCharges+MonthlyCharges+tenure | Churn, data=ash.df,main="scatterplotmatrix", col=c("red","black"))
CORRELATION MATRIX
den.df <- subset(ash.df[,c(6,19,20)], ash.df$TotalCharges>=0 && ash.df$MonthlyCharges>=0)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
data <- cor(den.df,use = "complete.obs")
round(data,2)
## tenure MonthlyCharges TotalCharges
## tenure 1.00 0.25 0.83
## MonthlyCharges 0.25 1.00 0.65
## TotalCharges 0.83 0.65 1.00
The correlation matrix also suggests that tenure affects total charges greatly.
I 1.Demographic variables such as gender and no. of senior citizens:
par(mfrow=c(1,2))
mytable1 <- with(ash.df, table(gender))
barchart(mytable1, col="grey", main="genderwise breakup")
prop.table(mytable1)*100
## gender
## Female Male
## 49.52435 50.47565
mytable <- with(ash.df, table(SeniorCitizen))
barchart(mytable,main="no. of seniorcitizens",col="grey")
prop.table(mytable)*100
## SeniorCitizen
## 0 1
## 83.78532 16.21468
There are more males than females.
par(mfrow=c(1,1))
d <- density(ash.df$tenure)
plot(d, main="tenure", col="blue")
polygon(d, col="blue", border="turquoise")
Many customers use the company’s service for a shorter period.
par(mfrow=c(1,2))
mytable1 <- with(ash.df, table(PhoneService))
histogram(ash.df$PhoneService, main="phone service", col=c("red", "orange"), xlab="phone service")
prop.table(mytable1)*100
## PhoneService
## No Yes
## 9.683374 90.316626
mytable <- with(ash.df, table(MultipleLines))
histogram(ash.df$MultipleLines, main="multiple lines",col=c("red", "orange", "yellow"),xlab="multiple lines service")
prop.table(mytable)*100
## MultipleLines
## No No phone service Yes
## 48.132898 9.683374 42.183729
Most customers use the company’s phone service and do not have multiple lines.
pie(table(ash.df$InternetService),main="types of internet services", col=c("magenta","violet","pink"))
mytable <- with(ash.df, table(InternetService))
prop.table(mytable)*100
## InternetService
## DSL Fiber optic No
## 34.37456 43.95854 21.66690
Almost 44% customers have fiber optic type of internet service.
par(mfrow=c(1,2))
mytable <- with(ash.df, table(OnlineSecurity))
prop.table(mytable)*100
## OnlineSecurity
## No No internet service Yes
## 49.66634 21.66690 28.66676
barchart(mytable, main="online security split up", horizontal=FALSE, xlab="online security", col=c("burlywood", "darkolivegreen"))
mytable1 <- with(ash.df, table(OnlineBackup))
prop.table(mytable1)*100
## OnlineBackup
## No No internet service Yes
## 43.84495 21.66690 34.48814
barchart(mytable1, main="online backup split up", horizontal=FALSE, xlab="online backup", col=c("burlywood","darkolivegreen"))
par(mfrow=c(1,1))
mytable <- with(ash.df, table(Contract))
prop.table(mytable)*100
## Contract
## Month-to-month One year Two year
## 55.01917 20.91438 24.06645
barchart(mytable, main="contract", horizontal=FALSE, col=c("skyblue","orange","green"), xlab="type of contract")
55% customers have availed month-to-month contract basis.
par(mfrow=c(1,2))
den.df <- subset(ash.df, ash.df$TotalCharges>=0)
d <- density(den.df$TotalCharges)
plot(d, main="Total charges")
polygon(d, col="red", border="blue")
den1.df <- subset(ash.df, ash.df$MonthlyCharges>=0)
d1 <- density(den1.df$MonthlyCharges)
plot(d1, main="Monthly charges")
polygon(d1, col="red", border="blue")
Monthly charges are more than total charges, which could be reason for customers to leave.
The key drivers are tenure, contract, internet service, total charges and monthly charges.
II 1. tenure drives churn?
par(mfrow=c(1,2))
histogram(~ tenure | Churn,col="green", main="tenure with churn")
This shows that customers who tend to leave are people who have been customers for shorter periods.
xyplot(TotalCharges ~ MonthlyCharges, data=den.df, main="total charges v/s monthly charges", type = c("p", "g"),xlab="monthly charges", ylab="total charges")
par(mfrow=c(2,1))
library(lattice)
histogram(~ TotalCharges | Churn, data=ash.df)
histogram(~ MonthlyCharges | Churn, data=ash.df, col="orange")
As stated above, due to high monthly charges, churn percentage is more in that case.
library(vcd)
## Loading required package: grid
par(mfrow=c(1,1))
tab1 <- xtabs(~ Churn + Contract)
tab1
## Contract
## Churn Month-to-month One year Two year
## No 2220 1307 1647
## Yes 1655 166 48
mosaic(tab1, shade=TRUE, legend=TRUE, main="contract basis")
Customers who tend to leave are on a month-to-month contract basis.
par(mfrow=c(1,1))
tab2 <- xtabs(~ Churn + InternetService)
tab2
## InternetService
## Churn DSL Fiber optic No
## No 1962 1799 1413
## Yes 459 1297 113
mosaic(tab2, shade=TRUE, legend=TRUE, main="internet service effects")
Customers who tend to leave are havinf fiber optic type of internet connection.
boxplot(TotalCharges ~ InternetService, data=den.df,main="total charges with type of internet service", col=c("blue","red","yellow"),xlab="type of internet service",ylab="total charges")
Total charges are more for fiber optic type of internet service.
boxplot(tenure ~ Contract, data=ash.df,main="tenure v/s contract", col=c("green","orange","pink"),xlab="type of contract",ylab="tenure")
boxplot(tenure ~ OnlineSecurity, data=ash.df,main="tenure v/s online security", col=c("darkcyan","yellowgreen","darkseagreen1"),xlab="online security",ylab="tenure")
Evidently, customers prefer to use the company’s services for a longer time if they are assured online security.
plot(x=ash.df$tenure, y=ash.df$TotalCharges, main="total charges v/s tenure", col="red", cex=0.6, xlab="tenure", ylab="total charges")
From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.
III
library(ggplot2)
par(mfrow=c(1,2))
qplot(x = TotalCharges, y = tenure, data = ash.df, color = factor(Churn))
## Warning: Removed 11 rows containing missing values (geom_point).
qplot(x = TotalCharges, y = MonthlyCharges, data = ash.df, color = factor(Churn))
## Warning: Removed 11 rows containing missing values (geom_point).
CHI-SQUARED TESTS:
chisq.test(Churn,TotalCharges)
## Warning in chisq.test(Churn, TotalCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and TotalCharges
## X-squared = 6503.6, df = 6529, p-value = 0.5859
chisq.test(Churn,MonthlyCharges)
## Warning in chisq.test(Churn, MonthlyCharges): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and MonthlyCharges
## X-squared = 2123.6, df = 1584, p-value < 2.2e-16
chisq.test(Churn,tenure)
## Warning in chisq.test(Churn, tenure): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: Churn and tenure
## X-squared = 1065.3, df = 72, p-value < 2.2e-16
chisq.test(Churn,Contract)
##
## Pearson's Chi-squared test
##
## data: Churn and Contract
## X-squared = 1184.6, df = 2, p-value < 2.2e-16
Hence, since all the p-values are very less (except for totalcharges), our original hypothesis is correct and NULL hypothesis of independence is rejected.
t-Tests NULL hypothesis: There is no difference in total charges for people abandoning and still using the company’s services.
t.test(TotalCharges ~ Churn, data=ash.df)
##
## Welch Two Sample t-test
##
## data: TotalCharges by Churn
## t = 18.801, df = 4042.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 916.8121 1130.2840
## sample estimates:
## mean in group No mean in group Yes
## 2555.344 1531.796
Based on the above output of the t-test, we can reject the hypothesis that the total charges of people leaving and still using the company’s services are the same. (p<0.001)
Regression models:
Simple linear regression (1)
sfit1 <- lm(TotalCharges ~ tenure, data=ash.df)
summary(sfit1)
##
## Call:
## lm(formula = TotalCharges ~ tenure, data = ash.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3997.2 -416.1 157.1 526.1 3382.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -189.529 25.254 -7.505 6.91e-14 ***
## tenure 76.271 0.621 122.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1278 on 7030 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.6821, Adjusted R-squared: 0.682
## F-statistic: 1.508e+04 on 1 and 7030 DF, p-value: < 2.2e-16
plot(y=TotalCharges, x=tenure, main="total charges v/s tenure", col="orange")
abline(sfit1)
The regression coefficient (76.27) is significantly dfferent from zero (p < 0.001). The multiple R-squared indciates that the model accounts for 68% of the variance in tenure. Although, the residual standard error is quite high for this model to predict total charges. This is not a good model.
Simple linear regression (2)
sfit2 <- lm(MonthlyCharges ~ tenure, data=ash.df)
summary(sfit2)
##
## Call:
## lm(formula = MonthlyCharges ~ tenure, data = ash.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.498 -27.251 6.245 24.943 54.376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.92978 0.57476 95.57 <2e-16 ***
## tenure 0.30372 0.01415 21.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.15 on 7041 degrees of freedom
## Multiple R-squared: 0.06145, Adjusted R-squared: 0.06132
## F-statistic: 461 on 1 and 7041 DF, p-value: < 2.2e-16
plot(y=MonthlyCharges, x=tenure, main="monthly charges v/s tenure", col="red")
abline(sfit2)
The regression coefficient (0.303) is dfferent from zero (p < 0.001). The multiple R-squared indciates that the model accounts for 6% of the variance in tenure. Although, the residual standard error is quite high for this model to predict monthly charges. This is not at all a good model.
Logistic regression
model <- glm(Churn ~ TotalCharges + OnlineSecurity + OnlineBackup + tenure + MonthlyCharges + gender + Contract + PhoneService + MultipleLines + InternetService, family=binomial(link='logit'),data=ash.df)
summary(model)
##
## Call:
## glm(formula = Churn ~ TotalCharges + OnlineSecurity + OnlineBackup +
## tenure + MonthlyCharges + gender + Contract + PhoneService +
## MultipleLines + InternetService, family = binomial(link = "logit"),
## data = ash.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7072 -0.7073 -0.2926 0.8059 3.4724
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.733e-01 2.027e-01 0.855 0.39250
## TotalCharges 3.057e-04 7.022e-05 4.353 1.34e-05
## OnlineSecurityNo internet service -5.870e-01 1.873e-01 -3.133 0.00173
## OnlineSecurityYes -5.550e-01 8.490e-02 -6.537 6.29e-11
## OnlineBackupNo internet service NA NA NA NA
## OnlineBackupYes -2.356e-01 7.834e-02 -3.007 0.00264
## tenure -5.915e-02 6.135e-03 -9.641 < 2e-16
## MonthlyCharges 1.075e-02 4.263e-03 2.521 0.01171
## genderMale -3.006e-02 6.397e-02 -0.470 0.63839
## ContractOne year -8.254e-01 1.056e-01 -7.817 5.39e-15
## ContractTwo year -1.682e+00 1.729e-01 -9.726 < 2e-16
## PhoneServiceYes -9.549e-01 1.463e-01 -6.527 6.71e-11
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 2.627e-01 8.099e-02 3.244 0.00118
## InternetServiceFiber optic 7.417e-01 1.390e-01 5.338 9.41e-08
## InternetServiceNo NA NA NA NA
##
## (Intercept)
## TotalCharges ***
## OnlineSecurityNo internet service **
## OnlineSecurityYes ***
## OnlineBackupNo internet service
## OnlineBackupYes **
## tenure ***
## MonthlyCharges *
## genderMale
## ContractOne year ***
## ContractTwo year ***
## PhoneServiceYes ***
## MultipleLinesNo phone service
## MultipleLinesYes **
## InternetServiceFiber optic ***
## InternetServiceNo
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8143.4 on 7031 degrees of freedom
## Residual deviance: 5938.0 on 7019 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 5964
##
## Number of Fisher Scoring iterations: 6
It is evident that MonthlyCharges and gender is not signiificant at all. As for the statistically significant variables, tenure has the lowest p-value suggesting a strong association of the period for which the customers used the company’s services with the churn prediction.
As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC(receiving operating characteristic) is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(gplots)
fitted.results <- predict(model,newdata=subset(ash.df,select=c(10,20,6,16,7,9)),type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pr <- prediction(fitted.results, ash.df$Churn)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8418911
Forecasting future values:
library(forecast)
ash1.df <- subset(ash.df[c(6,19,20)])
myts <- ts(ash1.df)
forecast(myts, 3)
## Warning in ets(object, lambda = lambda, allow.multiplicative.trend =
## allow.multiplicative.trend, : Missing values encountered. Using longest
## contiguous portion of time series
## tenure
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 7044 32.38652 0.9130221 63.86002 -15.74804 80.52109
## 7045 32.38652 0.9130219 63.86002 -15.74804 80.52109
## 7046 32.38652 0.9130218 63.86002 -15.74804 80.52109
##
## MonthlyCharges
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 7044 64.7517 26.24427 103.2591 5.859675 123.6437
## 7045 64.7517 26.24427 103.2591 5.859675 123.6437
## 7046 64.7517 26.24427 103.2591 5.859674 123.6437
##
## TotalCharges
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 3332 2254.99 -651.9964 5161.977 -2190.862 6700.843
## 3333 2254.99 -651.9964 5161.977 -2190.862 6700.843
## 3334 2254.99 -651.9965 5161.977 -2190.862 6700.843
plot(forecast(myts, 3))
## Warning in ets(object, lambda = lambda, allow.multiplicative.trend =
## allow.multiplicative.trend, : Missing values encountered. Using longest
## contiguous portion of time series
From the above analysis and visualizations, customers who leave tend to be ones who are on a month-to-month contract, have fiber optic internet service, and have been customers for shorter periods.