Telecom Churn - EDA and Prediction

With the rapid growth of the telecommunications industry, the service providers are more inclined towards customer base expansion. In order to fulfill the need to thrive in a competitive world, maintaining current customers has become a huge challenge. The cost of attracting a new customer is said to be much higher than that of maintaining the current one. Hence, it is important for the telecom companies to use advanced analytics to consider consumer behavior and forecast in-turn customer interaction as whether they would exit the business or not. This collection of data provides information regarding a telecom company’s level of customers. Specific characteristics are reported for each customer relating to the services used.

We will be performing various steps to arrive at model to predict the future churns in this industry.

Introduction

Customer churns happen when consumers or subscribers cease business with a company or service. This is also known as consumer turnover or defection. Many service providers, such as telecommunications companies, internet companies, or insurance firms often use customer churn analysis as one of the main business indicators to maintain the customer satisfaction.

About the data

Target variable is called Churn - it represents customers who left.
Predictor variables include all the remaining variables.

Objective

This the demonstration in R to analyze all relevant customer data and predict Customer churn.

Insights

Some possible insights could be -

What variables are contributing to customer churn?
Who are the customers more likely to churn?
What actions can be taken to stop them from leaving?

Data Import

We have taken the dataset from Kaggle, data source link Customer Churn.
We are importing the Telecom Churn data and converting the response variable to factors.

library(rmarkdown)
library(data.table)
library(ggplot2)
library(GGally)
library(grid)
library(gridExtra)
library(corrplot)
library(tidyverse)
library(tidyr)
library(caret)
library(knitr)
library(mgcv)
library(nnet)
library(NeuralNetTools)
library(e1071)
library(verification)
library(glmnet)
library(dplyr)
library(ROCR)
set.seed(12345)

telecom <- read.csv("~/telecom_churn.csv")
telecom$Churn <- factor(telecom$Churn)
dim(telecom)

## [1] 3333   11

str(telecom)

## 'data.frame':    3333 obs. of  11 variables:
##  $ Churn          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AccountWeeks   : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ ContractRenewal: int  1 1 1 0 0 0 1 0 1 0 ...
##  $ DataPlan       : int  1 1 0 0 0 0 1 0 0 1 ...
##  $ DataUsage      : num  2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
##  $ CustServCalls  : int  1 1 0 2 3 0 3 0 1 0 ...
##  $ DayMins        : num  265 162 243 299 167 ...
##  $ DayCalls       : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ MonthlyCharge  : num  89 82 52 57 41 57 87.3 36 63.9 93.2 ...
##  $ OverageFee     : num  9.87 9.78 6.06 3.1 7.42 ...
##  $ RoamMins       : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...

There are 3333 records of customer and 11 variables.

We will convert variables ContractRenewal and DataPlan to factors as well since it takes only two levels of values, namely “0” and “1”.

telecom$DataPlan <- factor(telecom$DataPlan)
telecom$ContractRenewal <- factor(telecom$ContractRenewal)

Here is the data.

paged_table(telecom)

Data Cleaning and Split

Data Cleaning

Missing Values

We are checking if there are any missing values in any of the variables.

colSums(is.na(telecom))

##           Churn    AccountWeeks ContractRenewal        DataPlan       DataUsage 
##               0               0               0               0               0 
##   CustServCalls         DayMins        DayCalls   MonthlyCharge      OverageFee 
##               0               0               0               0               0 
##        RoamMins 
##               0

print(paste(sum(complete.cases(telecom)),"Complete cases!"))

## [1] "3333 Complete cases!"

any(is.null(telecom))

## [1] FALSE

We see that there are no missing values in any of the variables and there are 3333 complete cases. Also, there are no invalid or spurious data in the dataset.

Duplicate Records

We are checking if there are duplicate records in the dataset which is redundant and hence needs to be removed.

dim(unique(telecom))[1]

## [1] 3333

There is no duplicate records in the dataset

Data Split

We are splitting the data into training and testing set with 80:20 percentage of data. We will use the trainings set to train the model and testing set to validate the model.

index <- sample(nrow(telecom),nrow(telecom)*0.80)
telecom_train = telecom[index,]
telecom_test = telecom[-index,]

Data Dictionary

Variable	Type	Description
Churn	factor	1 if customer cancelled service, 0 if not
AccountWeeks	integer	number of weeks customer has had active account
ContractRenewal	factor	1 if customer recently renewed contract, 0 if not
DataPlan	factor	1 if customer has data plan, 0 if not
DataUsage	double	gigabytes of monthly data usage
CustServCalls	integer	number of calls into customer service
DayMins	double	average daytime minutes per month
DayCalls	integer	average number of daytime calls
MonthlyCharge	double	average monthly bill
OverageFee	double	largest overage fee in last 12 months
RoamMins	double	average number of roaming minutes

EDA

Stayed Vs Cancelled

In the bar graph below, we can see that the number of customers’ turnover is made up 1/4 of the customers deciding to stay with their current service provider.

ggplot(telecom, aes(Churn)) + geom_bar(aes(fill = Churn), width = 0.6) + labs(title = "Stayed vs Cancelled", subtitle = "From Customer Churn dataset", x = "Customer Churn", y = "Count") + guides(fill = FALSE) + scale_x_discrete(labels = c("0" = "Stayed", "1" = "Cancelled")) + theme_classic()

We can see from the above bar graph that we have high count of customers who did not cancel their service.

Correlation between variables

corrplot(cor(telecom[,-1]), type = "lower", method = "number")

We can see the correlation coefficient of 0.95 between Data Usage and Data Plan. This indicates a strong positive linear relationship. It is a logical statement because customer who has data plan tends to use more data. Also, looking at the correlation between Monthly Charge and Data Plan or Monthly Charge and Data Usage, the coefficients are 0.74 and 0.78 respectively. These are strong linear relationships. Lastly, Days Mins and Monthly Charge has 0.57 correlation coefficent showing a moderate positive linear relationship.

ggpairs(telecom)

The plot shows that DayMins, AverageFee, RoamingMinutes are normally ditributed.AccountWeeks, Data Usage and Monthy Charges are right skewed and DailyCalls is left skewed.

Relation between Churn and Categorical variables

ggplot(telecom, aes(Churn)) + geom_bar(aes(fill = ContractRenewal))
ggplot(telecom, aes(Churn)) + geom_bar(aes(fill = DataPlan))

We can see that the the count of customers who churned did not renew their contract but they have active data plan.

Relation between Churn and Numerical variables and Outliers

Account Weeks + Day Calls: Customers decided to stay with the current service carrier and those chose to opt out the service have used almost the same number of weeks and Day calls. The population in two groups are almost in normal distribution.
Data Usage: Consumers in group 0 used more data than group 1, however group 1 has more outliers than the other group. This means that customers who decided to leave the service provider does not show the trend on how much they use data. -Customer Call: Consumers in group 1 who chose to terminate the serive shows they have spend much times on calling customer service. Consumers in group 0 are skew positively.
Monthlt Charge + Overage Fee +Roam Mins: In these catagories, there is not much differece between both groups. Consumers in group 1 has slightly higher number in monthly charge and overage fee.
Day Mins: Consumer in group 1 shows to spend more time in day calls.

Logistic Regression

To begin, a logistic regression model containing all variables is fit to our training data, this model use the default logit link function. Train a logistic regression model with all predictor variables:

telecom.glm0 <- glm(Churn~., family=binomial, data=telecom_train)
summary(telecom.glm0)

## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = telecom_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8204  -0.5214  -0.3607  -0.2168   2.9479  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.6456455  0.5946808  -9.494  < 2e-16 ***
## AccountWeeks     -0.0001974  0.0015357  -0.129 0.897716    
## ContractRenewal1 -1.9158591  0.1597725 -11.991  < 2e-16 ***
## DataPlan1        -1.4915812  0.5980313  -2.494 0.012626 *  
## DataUsage         0.5945315  2.1193691   0.281 0.779076    
## CustServCalls     0.4912088  0.0428183  11.472  < 2e-16 ***
## DayMins           0.0184734  0.0358158   0.516 0.606001    
## DayCalls          0.0028711  0.0030045   0.956 0.339278    
## MonthlyCharge    -0.0387268  0.2104983  -0.184 0.854032    
## OverageFee        0.2056159  0.3594994   0.572 0.567355    
## RoamMins          0.0796817  0.0238582   3.340 0.000838 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2208.6  on 2665  degrees of freedom
## Residual deviance: 1785.3  on 2655  degrees of freedom
## AIC: 1807.3
## 
## Number of Fisher Scoring iterations: 5

We see that some of the variables has p-valus < 0.05 and therefore, plays a significant role in the model.

Model performance:

AIC(telecom.glm0)

## [1] 1807.332

BIC(telecom.glm0)

## [1] 1872.103

telecom.glm0$deviance

## [1] 1785.332

AUC for training data:

pred.glm0.train <- predict(telecom.glm0, type="response")
pred0 <- prediction(pred.glm0.train, telecom_train$Churn)
unlist(slot(performance(pred0, "auc"), "y.values"))

## [1] 0.8093944

class.glm0.train<- (pred.glm0.train>0.5)*1

AUC for testing data:

pred.glm0.test <- predict(telecom.glm0, telecom_test, type="response")
pred0 <- prediction(pred.glm0.test, telecom_test$Churn)
unlist(slot(performance(pred0, "auc"), "y.values"))

## [1] 0.850737

class.glm0.test<- (pred.glm0.test>0.5)*1

Missclassification Rate for training data:

mean(telecom_test$Churn != class.glm0.test)

## [1] 0.1229385

Missclassification Rate for testing data:

mean(telecom_train$Churn != class.glm0.train)

## [1] 0.1425356

We see that the missclassification rate of the full model is not too good so to improve upon the model, we perform the variable selection.

Variable Selection

In order to see if we could improve upon a simple model including all possible variables, various variable selection procedures were used including stepwise selection using both AIC and BIC criteria.

AIC-stepwise We build the model using AIC stepwise.

model_aic <- step(telecom.glm0)

summary(model_aic)

## 
## Call:
## glm(formula = Churn ~ ContractRenewal + DataPlan + CustServCalls + 
##     DayMins + OverageFee + RoamMins, family = binomial, data = telecom_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8718  -0.5291  -0.3604  -0.2185   2.9321  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.476759   0.475315 -11.522  < 2e-16 ***
## ContractRenewal1 -1.917849   0.159366 -12.034  < 2e-16 ***
## DataPlan1        -0.899978   0.159449  -5.644 1.66e-08 ***
## CustServCalls     0.487195   0.042614  11.433  < 2e-16 ***
## DayMins           0.011909   0.001175  10.132  < 2e-16 ***
## OverageFee        0.139130   0.025060   5.552 2.82e-08 ***
## RoamMins          0.089987   0.022005   4.089 4.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2208.6  on 2665  degrees of freedom
## Residual deviance: 1787.3  on 2659  degrees of freedom
## AIC: 1801.3
## 
## Number of Fisher Scoring iterations: 5

AUC for training data:

pred.aic.train <- predict(model_aic, type="response")
pred_aic <- prediction(pred.aic.train, telecom_train$Churn)
auc_aic_train<-unlist(slot(performance(pred_aic, "auc"), "y.values"))
auc_aic_train

## [1] 0.8085984

class.glm0.train.aic<- (pred.aic.train>0.5)*1

AUC for testing data:

pred.aic.test <- predict(model_aic, telecom_test, type="response")
pred_aic1 <- prediction(pred.aic.test, telecom_test$Churn)
auc_aic_test<-unlist(slot(performance(pred_aic1, "auc"), "y.values"))
auc_aic_test

## [1] 0.85216

class.glm0.test.aic<- (pred.aic.test>0.5)*1

Missclassification Rate for training data:

MR_aic_train<-mean(telecom_train$Churn != class.glm0.train.aic)
MR_aic_train

## [1] 0.1414104

Missclassification Rate for testing data:

MR_aic_test<-mean(telecom_test$Churn != class.glm0.test.aic)
MR_aic_test

## [1] 0.125937

BIC-stepwise We build the model using BIC stepwise.

model_bic <- step(telecom.glm0, k=log(nrow(telecom_train)))

summary(model_bic)

## 
## Call:
## glm(formula = Churn ~ ContractRenewal + DataPlan + CustServCalls + 
##     DayMins + OverageFee + RoamMins, family = binomial, data = telecom_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8718  -0.5291  -0.3604  -0.2185   2.9321  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.476759   0.475315 -11.522  < 2e-16 ***
## ContractRenewal1 -1.917849   0.159366 -12.034  < 2e-16 ***
## DataPlan1        -0.899978   0.159449  -5.644 1.66e-08 ***
## CustServCalls     0.487195   0.042614  11.433  < 2e-16 ***
## DayMins           0.011909   0.001175  10.132  < 2e-16 ***
## OverageFee        0.139130   0.025060   5.552 2.82e-08 ***
## RoamMins          0.089987   0.022005   4.089 4.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2208.6  on 2665  degrees of freedom
## Residual deviance: 1787.3  on 2659  degrees of freedom
## AIC: 1801.3
## 
## Number of Fisher Scoring iterations: 5

AUC for training data:

pred.bic.train <- predict(model_bic, type="response")
pred_bic <- prediction(pred.bic.train, telecom_train$Churn)
unlist(slot(performance(pred_bic, "auc"), "y.values"))

## [1] 0.8085984

class.glm0.train.bic<- (pred.bic.train>0.5)*1

AUC for testing data:

pred.bic.test <- predict(model_bic, telecom_test, type="response")
pred_bic1 <- prediction(pred.bic.test, telecom_test$Churn)
unlist(slot(performance(pred_bic1, "auc"), "y.values"))

## [1] 0.85216

class.glm0.test.bic<- (pred.bic.test>0.5)*1

Missclassification Rate for training data:

mean(telecom_train$Churn != class.glm0.train.bic)

## [1] 0.1414104

Missclassification Rate for testing data:

mean(telecom_test$Churn != class.glm0.test.bic)

## [1] 0.125937

Model Performance- Logistic Regression

Given that all three models AIC and BIC model is the same and is better than the full model in terms of missclassification rate of the testing data. As the model fit using AIC (same as BIC in this case) (hereafter the ‘AIC model’) has the smallest MR we considered it our best model.

In order to determine if our model has sufficient ability to discriminate between true positives and false positives, we plotted the ROC curve. Figure 1 displays the ROC curve for the AIC model based on our training data and Figure 2 displays the ROC curve for the AIC model based on the remaining 20% testing data.

ROC curve for training data:

pred <- prediction(pred.aic.train, telecom_train$Churn)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

ROC curve for testing data:

pred <- prediction(pred.aic.test, telecom_test$Churn)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

Conclusion- Logistic Regression

The final model contains 6 predictors. The in-sample and out-of-sample performance of this model is comparable. The model can accurately predict the churn classification for more than 88% of the customers in the data set.

Neural Network

Our goal here is to develop a neural network to determine if a customer cancels the telecom service or not.

We now generate the error of the neural network model, along with the weights between the inputs, hidden layers, and outputs.

library(neuralnet)

## 
## Attaching package: 'neuralnet'

## The following object is masked from 'package:ROCR':
## 
##     prediction

## The following object is masked from 'package:dplyr':
## 
##     compute

nn <- neuralnet(Churn ~ ., data = telecom_train, hidden = c(2,1),act.fct = "logistic", linear.output = FALSE,stepmax = 1e+08, likelihood = TRUE)
nn$result.matrix

##                                      [,1]
## error                        8.844941e+01
## reached.threshold            9.797548e-03
## steps                        1.906010e+05
## aic                          2.308988e+02
## bic                          3.898839e+02
## Intercept.to.1layhid1        6.172225e+00
## AccountWeeks.to.1layhid1    -3.867337e-01
## ContractRenewal.to.1layhid1 -6.828850e+00
## DataPlan.to.1layhid1        -1.875615e-01
## DataUsage.to.1layhid1        8.504366e-01
## CustServCalls.to.1layhid1   -9.084971e+00
## DayMins.to.1layhid1          3.515021e+00
## DayCalls.to.1layhid1        -8.295585e-01
## MonthlyCharge.to.1layhid1   -6.726830e-01
## OverageFee.to.1layhid1       2.096985e+00
## RoamMins.to.1layhid1         2.481003e-01
## Intercept.to.1layhid2       -1.813270e+00
## AccountWeeks.to.1layhid2     8.806643e-01
## ContractRenewal.to.1layhid2 -3.172061e+01
## DataPlan.to.1layhid2        -2.961457e+00
## DataUsage.to.1layhid2       -3.277022e+01
## CustServCalls.to.1layhid2   -1.179714e+01
## DayMins.to.1layhid2          1.485483e+01
## DayCalls.to.1layhid2        -2.991698e-01
## MonthlyCharge.to.1layhid2    3.879850e+01
## OverageFee.to.1layhid2       6.154376e+00
## RoamMins.to.1layhid2         4.641863e+00
## Intercept.to.2layhid1       -2.340426e+00
## 1layhid1.to.2layhid1         3.733727e+01
## 1layhid2.to.2layhid1        -3.516225e+01
## Intercept.to.Churn           2.241288e+00
## 2layhid1.to.Churn           -5.803788e+00

We create a neural network with 2 hidden layers and our neural network looks like:

plot(nn, rep = "best")

In-sample performance

We get the confusion matrix for the training data:

temp_train <- subset(telecom_train, select = c("AccountWeeks","ContractRenewal","DataPlan","DataUsage","CustServCalls","DayMins","DayCalls","MonthlyCharge","OverageFee","RoamMins"))
nn.results1 <- compute(nn, temp_train)
results1 <- data.frame(actual = telecom_train$Churn, prediction = nn.results1$net.result)
prob.result.train <- nn.results1$net.result
roundedresults1 <- sapply(results1,round,digits = 0)
roundedresultsdf1 = data.frame(roundedresults1)
attach(roundedresultsdf1)
confusion_matrix_train <- table(actual,prediction)
confusion_matrix_train

##       prediction
## actual    0    1
##      0 2220   52
##      1  190  204

Missclassification rate(MR) for the training data:

MR_train <- (confusion_matrix_train[1,2] + confusion_matrix_train[2,1])/nrow(telecom_train)
MR_train

## [1] 0.09077269

ROC curve for training data and AUC:

nn.pred = ROCR::prediction(prob.result.train, telecom_train$Churn)
pref <- performance(nn.pred, "tpr", "fpr")
plot(pref, colorize=TRUE)

auc_train<-unlist(slot(ROCR::performance(nn.pred, "auc"),"y.values"))
auc_train

## [1] 0.8895576

Out-of-sample performance

We get the confusion matrix for the testing data:

temp_test <- subset(telecom_test, select = c("AccountWeeks","ContractRenewal","DataPlan","DataUsage","CustServCalls","DayMins","DayCalls","MonthlyCharge","OverageFee","RoamMins"))
nn.results <- compute(nn, temp_test)
results <- data.frame(actual = telecom_test$Churn, prediction = nn.results$net.result)
prob.result.test <- nn.results$net.result
roundedresults <- sapply(results,round,digits = 0)
roundedresultsdf = data.frame(roundedresults)
attach(roundedresultsdf)
confusion_matrix_test <- table(actual,prediction)
confusion_matrix_test

##       prediction
## actual   0   1
##      0 569   9
##      1  39  50

Missclassification rate(MR) for the testing data:

MR_test <- (confusion_matrix_test[1,2] + confusion_matrix_test[2,1])/nrow(telecom_test)
MR_test

## [1] 0.07196402

ROC curve for testing data and AUC:

nn.pred = ROCR::prediction(prob.result.test, telecom_test$Churn)
pref <- performance(nn.pred, "tpr", "fpr")
plot(pref, colorize=TRUE)

auc_test<-unlist(slot(ROCR::performance(nn.pred, "auc"),"y.values"))
auc_test

## [1] 0.8989347

Conclusion

We will compare the AUC and missclassification rate (MR) of the training and testing data of both the approaches - Logistic regression and Neural Network. The lower the MR, the better the accuracy of prediction of the model. The higher the AUC, the better the model.

Logistic Regression

##  MR for training   MR for testing AUC for training  AUC for testing 
##        0.1414104        0.1259370        0.8085984        0.8521600

Neural Network

##  MR for training   MR for testing AUC for training  AUC for testing 
##       0.09077269       0.07196402       0.88955760       0.89893472

We can see that the Neural Network performs much better than the Logistic model here and therefore, we will choose the neural network as our final model for predicting the new data in the future.