The biggest problem a telecom company faces is of Churned Customers. Churning is a term used in this industry to describe whether the consumer or the user is going to continue the service with the company any further or not.

Churning has a huge impact on Revenue of a company. If the company predicts the churn rate of the customers with high accuracy, it gives the company a gestimate of how its revenues would look like and in turn give it freedom to plan finances ahead.

In this project we are more concerned about the customers who are going to get churned. We are going to predict whether the customer be Churned or not on the basis of its billing information and customer demographics.

The most important scenario for a company from a business perspective is to retain customers. Thus, we are more interested in the instances where we predicted that the customer would not be churned but infact the customer was churned. This shows potential losses for the company and is denoted by SPECIFICITY. Thus, our primary goal is to increases the SPECIFICITY of the model and select the model that trades off between ACCURACY and SPECIFICITY.

Loading Important Libraries

library(dplyr)
library(rpart)
library(randomForest)
library(ROCR)
library(rpart.plot)
library(dummies)
library(caret)
library(ggplot2)
library(pROC)
library(DT)

Loading the Data set

churn <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
churn$SeniorCitizen <- as.factor(churn$SeniorCitizen)
str(churn)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
churn_copy <- churn[,-1]
sum(is.na(churn_copy))
## [1] 11

We can see that there aree 11 entries in our data set which are empty or have NA values. The number of NA’s as compared to the whole dataset is relatively less and hence we would drop these 11 values.

Cleaning the Data set and creating TRAIN-TEST splits

set.seed(131313)
churn_final <- churn_copy[complete.cases(churn_copy),]
intrain <- createDataPartition(y = churn_final$Churn, p = 0.7, list = FALSE)
training <- churn_final[intrain, ]
testing <- churn_final[-intrain, ]

Stepwise Logistic Regression Model

The goal of this project is the Classify whether the customer would be Churned or Not. It builds up a classic Classification probelm and hence we would run LOGISTIC regression on our data set.

After building the logistic model we would run a STEPWISE logistic regression in forward as well as backward direction to drop the unnecessary variables and obtain a model with as low as possible AIC.

model_glm <- glm(Churn ~ ., data = training, family = binomial)
model_step <- step(model_glm, direction = "both", steps = 10000, trace = FALSE)
predictions_step <- predict(model_step, testing, type = "response")

We predict the class probabilities of all the testing instances by using the predict function on our Stepwise logistic regression model. The next step is to select the right cut-off point so that we can distinguish the instances into 0’s and 1’s depending on their probabilities.

To decide the cut off point

pred_step <- prediction(predictions_step, testing$Churn)
plot(performance(pred_step, "tpr", "fpr"), colorize = TRUE)

auc_step <- performance(pred_step, "auc")
auc_step
## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.8434016
## 
## 
## Slot "alpha.values":
## list()
roc_step <- roc(response = testing$Churn, predictor = predictions_step)
c <- coords(roc_step, "best", "threshold")
c
##   threshold specificity sensitivity 
##   0.2865218   0.7525840   0.7821429

We find that the point 0.2865218 is the best cut-off point to distinguish the class probabilities into 0’s and 1’s. We now form the Predicted scores using ifelse loops.

Forming Confusion-matrix from the decided cut-off point

Churn_step <- ifelse(predictions_step >= c[[1]], 
                     "Yes", "No")
cm_log <- confusionMatrix(testing$Churn, Churn_step)
cm_log
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1165  383
##        Yes  122  438
##                                           
##                Accuracy : 0.7604          
##                  95% CI : (0.7416, 0.7785)
##     No Information Rate : 0.6105          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4655          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9052          
##             Specificity : 0.5335          
##          Pos Pred Value : 0.7526          
##          Neg Pred Value : 0.7821          
##              Prevalence : 0.6105          
##          Detection Rate : 0.5527          
##    Detection Prevalence : 0.7343          
##       Balanced Accuracy : 0.7194          
##                                           
##        'Positive' Class : No              
## 

From the confusion matrix for stepwise logisitic model we find that the Accuracy of the model is 0.7604364 & Specificity is 0.5334957.

Random-Forest

tc_rf <- trainControl(method = "repeatedcv",repeats = 2,number = 3, search = "random")
rf_train <- train(Churn ~ ., data = training, method = "rf", trainControl = tc_rf) ##Time consuming
plot(varImp(rf_train, scale = F))

predict_rftrain <- predict(rf_train, testing)
cm_rf <- confusionMatrix(predict_rftrain, testing$Churn)
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1471  352
##        Yes   77  208
##                                           
##                Accuracy : 0.7965          
##                  95% CI : (0.7787, 0.8135)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 1.872e-11       
##                                           
##                   Kappa : 0.3815          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9503          
##             Specificity : 0.3714          
##          Pos Pred Value : 0.8069          
##          Neg Pred Value : 0.7298          
##              Prevalence : 0.7343          
##          Detection Rate : 0.6978          
##    Detection Prevalence : 0.8648          
##       Balanced Accuracy : 0.6608          
##                                           
##        'Positive' Class : No              
## 

From the confusion matrix for Random Forest model we find that the Accuracy of the model is 0.7964896 & Specificity is 0.3714286.

Decision Trees - RPART

rp <- rpart(Churn ~ ., data = training, method = "class")
predictions_rp <- predict(rp, testing, type = "class")
rpart.plot(rp, type=1, extra=100, branch.lty=3, box.palette="RdYlGn", tweak = 1.2, fallen.leaves = FALSE)

cm_rpart <- confusionMatrix(predictions_rp, testing$Churn)
cm_rpart
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1404  304
##        Yes  144  256
##                                           
##                Accuracy : 0.7875          
##                  95% CI : (0.7694, 0.8048)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 9.284e-09       
##                                           
##                   Kappa : 0.4006          
##  Mcnemar's Test P-Value : 5.821e-14       
##                                           
##             Sensitivity : 0.9070          
##             Specificity : 0.4571          
##          Pos Pred Value : 0.8220          
##          Neg Pred Value : 0.6400          
##              Prevalence : 0.7343          
##          Detection Rate : 0.6660          
##    Detection Prevalence : 0.8102          
##       Balanced Accuracy : 0.6821          
##                                           
##        'Positive' Class : No              
## 

From the confusion matrix for RPART model we find that the Accuracy of the model is 0.7874763 & Specificity is 0.4571429.

Principal Component Analysis

As majority of our variables are Factor variables, we create dummy variables for our data to perform Principal Component Analysis.

dummy.data <- dummy.data.frame(churn_final)
intrain.pca <- createDataPartition(y = dummy.data$ChurnYes, p = 0.7, list = FALSE)
pca.train <- dummy.data[intrain.pca, ]
pca.test <- dummy.data[-intrain.pca, ]
pca.train1 <- pca.train[,-c(47,48)] #Removing the dependent variables
pca.test1 <- pca.test[,-c(47,48)] #Removing the dependent variables
pr_train <- prcomp(pca.train1, center = T, scale. = T)
summary(pr_train)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6
## Standard deviation     3.3801 2.4573 2.06827 1.61305 1.5181 1.41535
## Proportion of Variance 0.2484 0.1313 0.09299 0.05656 0.0501 0.04355
## Cumulative Proportion  0.2484 0.3796 0.47264 0.52920 0.5793 0.62285
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.37808 1.25900 1.22355 1.15630 1.14324 1.12002
## Proportion of Variance 0.04128 0.03446 0.03255 0.02907 0.02841 0.02727
## Cumulative Proportion  0.66413 0.69859 0.73114 0.76020 0.78862 0.81589
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     1.09540 1.07106 1.05333 1.00670 0.97358 0.96478
## Proportion of Variance 0.02608 0.02494 0.02412 0.02203 0.02061 0.02023
## Cumulative Proportion  0.84197 0.86691 0.89103 0.91306 0.93367 0.95390
##                           PC19   PC20    PC21    PC22    PC23      PC24
## Standard deviation     0.93056 0.9022 0.61726 0.24203 0.03076 4.365e-14
## Proportion of Variance 0.01882 0.0177 0.00828 0.00127 0.00002 0.000e+00
## Cumulative Proportion  0.97273 0.9904 0.99871 0.99998 1.00000 1.000e+00
##                           PC25      PC26      PC27      PC28      PC29
## Standard deviation     2.9e-14 2.477e-14 5.999e-15 5.969e-15 4.992e-15
## Proportion of Variance 0.0e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.0e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC30      PC31      PC32      PC33      PC34
## Standard deviation     4.025e-15 3.441e-15 3.264e-15 2.538e-15 2.309e-15
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC35      PC36      PC37  PC38      PC39
## Standard deviation     1.986e-15 1.428e-15 1.307e-15 1e-15 6.966e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1e+00 1.000e+00
##                            PC40     PC41     PC42     PC43     PC44
## Standard deviation     2.41e-16 2.41e-16 2.41e-16 2.41e-16 2.41e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
##                            PC45      PC46
## Standard deviation     2.41e-16 7.346e-17
## Proportion of Variance 0.00e+00 0.000e+00
## Cumulative Proportion  1.00e+00 1.000e+00
plot(pr_train, type = "l")

biplot(pr_train, scale = 0)

pca_var <- pr_train$sdev ^ 2
pca_var_prop <- pca_var / sum(pca_var)
plot(cumsum(pca_var_prop), type = 'b', ylab = "Cummulative Variance", xlab = "Principal Components",main = "Principal Components VS Variance Explained")

pca.final.train <- data.frame(ChurnYes = pca.train$ChurnYes, pr_train$x)
pca.final.train1 <- pca.final.train[,1:16] #Selecting the first 15 Principal components
pca.final.test <- predict(pr_train, pca.test1)
pca.final.test <- as.data.frame(pca.final.test)
pca.final.test1 <- pca.final.test[,1:15]
rpart_pca <- rpart(ChurnYes ~ ., data = pca.final.train1)
predict_rpart.pca <- predict(rpart_pca, newdata = pca.final.test1)
pred_rpart_pca <- prediction(predict_rpart.pca, pca.test$ChurnNo)
perf_rpart_pca <- performance(pred_rpart_pca, "tpr", "fpr")
plot(perf_rpart_pca, colorize = TRUE)

myroc <- roc(pca.test$ChurnNo, predict_rpart.pca)
c_pca <- coords(myroc, "best", ret = "threshold")
churned <- ifelse(predict_rpart.pca > c_pca[[1]], 1, 0)
cm_pca <- confusionMatrix(churned, pca.test$ChurnYes)
cm_pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 966 103
##          1 565 475
##                                           
##                Accuracy : 0.6833          
##                  95% CI : (0.6629, 0.7031)
##     No Information Rate : 0.7259          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3626          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6310          
##             Specificity : 0.8218          
##          Pos Pred Value : 0.9036          
##          Neg Pred Value : 0.4567          
##              Prevalence : 0.7259          
##          Detection Rate : 0.4580          
##    Detection Prevalence : 0.5069          
##       Balanced Accuracy : 0.7264          
##                                           
##        'Positive' Class : 0               
## 

From the confusion matrix for Decision trees on Principal components model we find that the Accuracy of the model is 0.6832622 & Specificity is 0.8217993.

Observation matrix

Model Specificty Accuracy
Logistic 0.53 0.76
Random Forest 0.37 0.79
RPART 0.46 0.78
PCA + RPART 0.82 0.68

Conclusion

  1. According to our problem statement, PCA + Decisison trees yeilds SPECIFICITY of 82%, better than any other model and hence we select it as a final model.

  2. If accuracy is taken into account, Random forest out performs every other problem.