A telecommunications company requires a predictive model to choose which customers will leave their plan. The results will inform the Marketing and Customer Retention teams about which customer are likely to leave their plan so that resources can be directed to these customers.

churn_raw <- read.csv("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv", header = T)

str(churn_raw)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Checking for missing values

sum(is.na(churn_raw))
## [1] 11

Where are the 11 missing values?

sapply(churn_raw, function(x) sum(is.na(x)))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

All of the NAs are in the TotalCharges column, but I might be able to compute the total charges since we have data for monthly charges and tenure in months

churn_raw[is.na(churn_raw$TotalCharges),1:6]
##      customerID gender SeniorCitizen Partner Dependents tenure
## 489  4472-LVYGI Female             0     Yes        Yes      0
## 754  3115-CZMZD   Male             0      No        Yes      0
## 937  5709-LVOEQ Female             0     Yes        Yes      0
## 1083 4367-NUYAO   Male             0     Yes        Yes      0
## 1341 1371-DWPAZ Female             0     Yes        Yes      0
## 3332 7644-OMVMY   Male             0     Yes        Yes      0
## 3827 3213-VVOLG   Male             0     Yes        Yes      0
## 4381 2520-SGTTA Female             0     Yes        Yes      0
## 5219 2923-ARZLG   Male             0     Yes        Yes      0
## 6671 4075-WKNIU Female             0     Yes        Yes      0
## 6755 2775-SEFEE   Male             0      No        Yes      0

These customers all show tenure of zero months so they haven’t made their first payment yet. Are there any other zero tenure customers in the data set?

library(dplyr)
churn_raw %>%
    filter(tenure == 0) %>%
    summarize("Zero Tenure" = n())
##   Zero Tenure
## 1          11

These eleven are the only customers with zero tenure so they can safely be removed

churnnoNAs <- churn_raw[complete.cases(churn_raw),]
dim(churnnoNAs)
## [1] 7032   21

Cleaning the Data

Customer ID isn’t useful to our analysis, neither is Total Charges since it is highly correlated with Monthly Charges

churn_neat <- churnnoNAs %>%
                select(-customerID, -TotalCharges) %>%
                rename(Gender = gender, Tenure = tenure)

table(churn_neat$SeniorCitizen)
## 
##    0    1 
## 5890 1142
churn_neat$SeniorCitizen <- as.factor(ifelse(churn_neat$SeniorCitizen == 1, "Yes", "No"))

table(churn_neat$SeniorCitizen)
## 
##   No  Yes 
## 5890 1142
str(churn_neat)
## 'data.frame':    7032 obs. of  19 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

The variables OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV and StreamingMovies all require an internet connection and the variable MultipleLines needs a phone service so I will replace “No internet service” and “No phone service” with “No”.

factorrenames <- names(churn_neat[9:14])
  
data <- churn_neat %>%
                        mutate_at(.vars=factorrenames,
                                  .funs=~recode_factor(., `No internet service`="No")) %>%
                                      mutate_at(.vars="MultipleLines",
                                                .funs=~recode_factor(., `No phone service`="No"))
  

str(data)
## 'data.frame':    7032 obs. of  19 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Data Exploration

churnrate <- table(data$Churn) / nrow(data)

churnrate
## 
##       No      Yes 
## 0.734215 0.265785

Over the entire data set, 26.5% of customers churned.

I will create a a trainControl object so that all of the models use the same 10-fold cross validation on 70% of the data as a training set. I will then use the remaining 30% of the data to test the model accuracy.

I will be using the both area under the ROC curve (AUC) and Accuracy percentage as metrics for assessing model accuracy.

set.seed(1)

rowindices <- sample(nrow(data))
data_shuffled <- data[rowindices,]

First the rows are shuffled to eliminate any bias in the order of the observations in the data set

split <- round(nrow(data_shuffled) * 0.7)
split
## [1] 4922

The data set will be split on the 4922nd row. Observations 1 to 4922 will make up the training set and the remaining 2110 observations will be the test set.

train <- data_shuffled[1:split,]
test <- data_shuffled[(split+1):nrow(data_shuffled),]

dim(train)
## [1] 4922   19
dim(test)
## [1] 2110   19
library(caret)
control <- trainControl(
  method = "cv",
  number = 10,
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  verboseIter = FALSE
)

Logistic Regression Model

glm_model <- train(Churn ~ ., data = train,
                method="glm", 
                trControl = control
                  )

glm_model
## Generalized Linear Model 
## 
## 4922 samples
##   18 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4430, 4429, 4430, 4430, 4429, 4429, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8433015  0.8985381  0.5359928

The Logistic Regression model on the training data gives an AUC value of 0.84. 90% of customers that the model identified as leaving their service did so.

Predictive Capability

glm_pred <- predict(glm_model, newdata = test)


glmcm <- confusionMatrix(glm_pred, test[["Churn"]])
glmaccuracy <- glmcm$overall[c(1,3,4)]
glmcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1382  253
##        Yes  154  321
##                                           
##                Accuracy : 0.8071          
##                  95% CI : (0.7896, 0.8237)
##     No Information Rate : 0.728           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4852          
##  Mcnemar's Test P-Value : 1.188e-06       
##                                           
##             Sensitivity : 0.8997          
##             Specificity : 0.5592          
##          Pos Pred Value : 0.8453          
##          Neg Pred Value : 0.6758          
##              Prevalence : 0.7280          
##          Detection Rate : 0.6550          
##    Detection Prevalence : 0.7749          
##       Balanced Accuracy : 0.7295          
##                                           
##        'Positive' Class : No              
## 

When the model is applied to the test data it yields accuracy of 80.7%, with 90% of customers that the model identified as leaving their service doing so and 53.46% of customers that the model identified as staying with the service staying with the company.

Generalised Linear Model - Ridge and Lasso Regression

The training data is used to find the optimal value for alpha, the elasticnet mixing parameter.

glmnet_model <- train(Churn ~ ., data = train,
  metric = "ROC",
  method = "glmnet",
  trControl = control,
  preProcess = c("center","scale")
)

plot(glmnet_model)

glmnet_model$bestTune$alpha
## [1] 0.55

Alpha = 0.55 maximises the AUC.

Predictive Capability

glmnet_pred <- predict(glmnet_model, newdata = test)

glmnetcm <- confusionMatrix(glmnet_pred, test[["Churn"]])
glmnetaccuracy <- glmnetcm$overall[c(1,3,4)]
glmnetcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1383  256
##        Yes  153  318
##                                           
##                Accuracy : 0.8062          
##                  95% CI : (0.7886, 0.8228)
##     No Information Rate : 0.728           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4815          
##  Mcnemar's Test P-Value : 4.569e-07       
##                                           
##             Sensitivity : 0.9004          
##             Specificity : 0.5540          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.6752          
##              Prevalence : 0.7280          
##          Detection Rate : 0.6555          
##    Detection Prevalence : 0.7768          
##       Balanced Accuracy : 0.7272          
##                                           
##        'Positive' Class : No              
## 

The Lasso model correctly classifies 81% of all customers, including 90% of all customers that went on to leave their service.

Random Forest

Cross validation on the training set selects the number of variables to add to the model at each branch

rf_model <- train(Churn ~ ., data=train,
  metric = "ROC",
  method = "ranger",
  trControl = control
)

plot(rf_model)

mtry = 2 yields the largest AUC value on the training data

Predictive Capability

rf_pred <- predict(rf_model, newdata = test)
rfcm <- confusionMatrix(rf_pred, test[["Churn"]])
rfaccuracy <- rfcm$overall[c(1,3,4)]
rfcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1424  317
##        Yes  112  257
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7789, 0.8137)
##     No Information Rate : 0.728           
##     P-Value [Acc > NIR] : 1.659e-13       
##                                           
##                   Kappa : 0.422           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9271          
##             Specificity : 0.4477          
##          Pos Pred Value : 0.8179          
##          Neg Pred Value : 0.6965          
##              Prevalence : 0.7280          
##          Detection Rate : 0.6749          
##    Detection Prevalence : 0.8251          
##       Balanced Accuracy : 0.6874          
##                                           
##        'Positive' Class : No              
## 

The Random Forest model yields accuracy of 80% but it correctly identifies 92% of all customers that did go on to leave the service.

K-Nearest Neighbours

Cross validation on the training set is used to find the optimal value for K

knn_model <- train(Churn ~ ., data = train, 
                   method = "knn", trControl = control,
                   preProcess = c("center","scale"), tuneLength = 50)
knn_model
## k-Nearest Neighbors 
## 
## 4922 samples
##   18 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4430, 4429, 4430, 4430, 4429, 4430, ... 
## Resampling results across tuning parameters:
## 
##   k    ROC        Sens       Spec     
##     5  0.7679915  0.8409266  0.5026834
##     7  0.7863464  0.8453427  0.5335540
##     9  0.7959654  0.8503014  0.5405725
##    11  0.8032439  0.8555416  0.5321169
##    13  0.8084182  0.8610573  0.5336374
##    15  0.8113219  0.8662968  0.5444305
##    17  0.8126446  0.8665731  0.5560346
##    19  0.8144746  0.8627102  0.5598927
##    21  0.8195417  0.8651911  0.5629756
##    23  0.8209569  0.8651880  0.5660763
##    25  0.8216082  0.8651835  0.5652713
##    27  0.8225264  0.8660114  0.5614252
##    29  0.8229787  0.8662892  0.5560167
##    31  0.8239682  0.8676689  0.5598629
##    33  0.8238503  0.8684931  0.5637150
##    35  0.8246429  0.8676689  0.5668277
##    37  0.8243899  0.8676689  0.5660465
##    39  0.8244966  0.8715279  0.5606202
##    41  0.8245858  0.8665662  0.5652713
##    43  0.8250267  0.8657413  0.5683602
##    45  0.8253998  0.8673965  0.5691294
##    47  0.8255240  0.8649141  0.5668098
##    49  0.8260608  0.8665700  0.5637448
##    51  0.8261070  0.8646394  0.5614073
##    53  0.8263300  0.8662915  0.5621765
##    55  0.8268182  0.8668432  0.5567680
##    57  0.8266926  0.8687708  0.5606321
##    59  0.8268643  0.8671164  0.5559988
##    61  0.8281500  0.8657390  0.5529219
##    63  0.8281291  0.8668425  0.5591055
##    65  0.8283792  0.8665693  0.5575432
##    67  0.8281783  0.8665685  0.5583065
##    69  0.8278101  0.8671195  0.5552236
##    71  0.8279462  0.8679467  0.5528980
##    73  0.8284916  0.8676697  0.5529100
##    75  0.8283043  0.8665677  0.5459451
##    77  0.8284033  0.8668455  0.5559750
##    79  0.8279869  0.8662930  0.5567501
##    81  0.8280169  0.8649133  0.5598271
##    83  0.8280381  0.8676689  0.5559690
##    85  0.8278481  0.8668409  0.5605963
##    87  0.8278187  0.8695965  0.5559750
##    89  0.8277724  0.8690463  0.5544305
##    91  0.8277313  0.8693210  0.5536613
##    93  0.8275961  0.8684946  0.5544305
##    95  0.8279002  0.8693218  0.5559809
##    97  0.8277472  0.8690448  0.5551998
##    99  0.8275440  0.8690456  0.5521049
##   101  0.8278845  0.8704230  0.5559511
##   103  0.8280408  0.8704222  0.5544007
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 73.

Predictive Capability

knn_pred <- predict(knn_model, newdata = test)
knncm <- confusionMatrix(knn_pred, test[["Churn"]])
knnaccuracy <- knncm$overall[c(1,3,4)]
knncm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1331  252
##        Yes  205  322
##                                           
##                Accuracy : 0.7834          
##                  95% CI : (0.7652, 0.8008)
##     No Information Rate : 0.728           
##     P-Value [Acc > NIR] : 2.751e-09       
##                                           
##                   Kappa : 0.4388          
##  Mcnemar's Test P-Value : 0.03141         
##                                           
##             Sensitivity : 0.8665          
##             Specificity : 0.5610          
##          Pos Pred Value : 0.8408          
##          Neg Pred Value : 0.6110          
##              Prevalence : 0.7280          
##          Detection Rate : 0.6308          
##    Detection Prevalence : 0.7502          
##       Balanced Accuracy : 0.7138          
##                                           
##        'Positive' Class : No              
## 

The KNN model has predictive accuracy of 78% and correctly classifies 87% of customers who went on to leave.

Support Vector Classifier

Cross validation on the training set with 6 different values for the Cost parameter

grid <- expand.grid(C = c(0.01, 0.05, 0.1, 0.25, 0.5, 1))

svm_linear_model <- train(Churn ~., data = train, method = "svmLinear",
                 trControl= control,
                 preProcess = c("center", "scale"),
                 tuneLength = 6,
                 tuneGrid = grid)

svm_linear_model
## Support Vector Machines with Linear Kernel 
## 
## 4922 samples
##   18 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4430, 4429, 4430, 4430, 4431, 4430, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.01  0.8396572  0.9046139  0.5119320
##   0.05  0.8394989  0.9054366  0.5119261
##   0.10  0.8391577  0.9057121  0.5165534
##   0.25  0.8391774  0.9054366  0.5134645
##   0.50  0.8391731  0.9059876  0.5157782
##   1.00  0.8393649  0.9057121  0.5142338
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
plot(svm_linear_model, main = "Cross validation to determine cost parameter")

svm_linear_pred <- predict(svm_linear_model, newdata = test)
svmcm <- confusionMatrix(svm_linear_pred, test[["Churn"]])
svm_linearaccuracy <- svmcm$overall[c(1,3,4)]
svmcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1382  274
##        Yes  154  300
##                                           
##                Accuracy : 0.7972          
##                  95% CI : (0.7794, 0.8141)
##     No Information Rate : 0.728           
##     P-Value [Acc > NIR] : 1.126e-13       
##                                           
##                   Kappa : 0.452           
##  Mcnemar's Test P-Value : 8.815e-09       
##                                           
##             Sensitivity : 0.8997          
##             Specificity : 0.5226          
##          Pos Pred Value : 0.8345          
##          Neg Pred Value : 0.6608          
##              Prevalence : 0.7280          
##          Detection Rate : 0.6550          
##    Detection Prevalence : 0.7848          
##       Balanced Accuracy : 0.7112          
##                                           
##        'Positive' Class : No              
## 

The Support Vector Classifier resulted in accuracy of 80% and 90% of all customers who went on to leave being correctly classified.

Model Comparison

model_list <- list("Logistic" = glm_model, "GLMnet" = glmnet_model, "Random Forest" = rf_model, "kNN" = knn_model, "SVM" = svm_linear_model)

resamples <- resamples(model_list)

dotplot(resamples, metric="ROC", main = "Area Under Curve with 95% CI")

The GLMnet and Logistic models showed the greatest overall predictive accuracy but the Random Forest model showed the highest sensitivity, correctly identifying 92.6% of customers who did go on to churn. Additional domain knowledge would be required from the telecoms company to decide which model is the most useful.

models <- c("Logistic", "GLMnet", "Random Forest", "SVM", "KNN")

accuracysummary <- bind_rows(Logistic = glmaccuracy, GLMnet = glmnetaccuracy, RandomForest = rfaccuracy, kNN = knnaccuracy, SVM = svm_linearaccuracy)
library(tibble)
accuracysummary2 <- add_column(accuracysummary, "Model" = models, .before = "Accuracy")

accuracysummary2
## # A tibble: 5 x 4
##   Model         Accuracy AccuracyLower AccuracyUpper
##   <chr>            <dbl>         <dbl>         <dbl>
## 1 Logistic         0.807         0.790         0.824
## 2 GLMnet           0.806         0.789         0.823
## 3 Random Forest    0.797         0.779         0.814
## 4 SVM              0.783         0.765         0.801
## 5 KNN              0.797         0.779         0.814
library(ggthemes)
ggplot(accuracysummary2, aes(x = Model, y = Accuracy)) + geom_bar(stat = "identity") + 
        geom_errorbar(width = 0.2, aes(ymin = AccuracyLower, ymax = AccuracyUpper), color = "black") +
        coord_cartesian(ylim = c(0.7, 0.85)) +
        labs(y = "Accuracy %", x = "Model", title = "Model Prediction Accuracy with 95% CI") +
        theme_minimal()

Conclusion

In this case all models show significantly greater predictive accuracy than the null model that predicts ‘No’ for every customer with accuracy of 72.6%. The Logistic and GLMnet methods are almost identical in terms of results.

If maximum accuracy is the goal then I would recommend that the logistic model since it is much more interpretable than the GLMnet model that blends lasso and ridge regression since it is more interpretable.

Identifying attributes of customers likely to churn

summary(glm_model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9616  -0.6763  -0.2895   0.6848   3.1879  
## 
## Coefficients:
##                                         Estimate Std. Error z value
## (Intercept)                             0.775181   0.984493   0.787
## GenderMale                             -0.049477   0.077851  -0.636
## SeniorCitizenYes                        0.177516   0.102686   1.729
## PartnerYes                              0.051354   0.092172   0.557
## DependentsYes                          -0.204377   0.105942  -1.929
## Tenure                                 -0.033802   0.002896 -11.674
## PhoneServiceYes                         0.192776   0.790082   0.244
## MultipleLinesYes                        0.530945   0.217205   2.444
## `InternetServiceFiber optic`            1.842418   0.971471   1.897
## InternetServiceNo                      -1.721730   0.982493  -1.752
## OnlineSecurityYes                      -0.178644   0.218480  -0.818
## OnlineBackupYes                        -0.026976   0.213708  -0.126
## DeviceProtectionYes                     0.185122   0.215591   0.859
## TechSupportYes                         -0.176914   0.219763  -0.805
## StreamingTVYes                          0.623458   0.398743   1.564
## StreamingMoviesYes                      0.619407   0.397912   1.557
## `ContractOne year`                     -0.705278   0.128256  -5.499
## `ContractTwo year`                     -1.425403   0.217093  -6.566
## PaperlessBillingYes                     0.373285   0.088498   4.218
## `PaymentMethodCredit card (automatic)` -0.044267   0.138485  -0.320
## `PaymentMethodElectronic check`         0.364357   0.113537   3.209
## `PaymentMethodMailed check`             0.084719   0.135625   0.625
## MonthlyCharges                         -0.037768   0.038660  -0.977
##                                        Pr(>|z|)    
## (Intercept)                             0.43105    
## GenderMale                              0.52508    
## SeniorCitizenYes                        0.08386 .  
## PartnerYes                              0.57742    
## DependentsYes                           0.05371 .  
## Tenure                                  < 2e-16 ***
## PhoneServiceYes                         0.80723    
## MultipleLinesYes                        0.01451 *  
## `InternetServiceFiber optic`            0.05789 .  
## InternetServiceNo                       0.07970 .  
## OnlineSecurityYes                       0.41355    
## OnlineBackupYes                         0.89955    
## DeviceProtectionYes                     0.39052    
## TechSupportYes                          0.42081    
## StreamingTVYes                          0.11792    
## StreamingMoviesYes                      0.11956    
## `ContractOne year`                     3.82e-08 ***
## `ContractTwo year`                     5.17e-11 ***
## PaperlessBillingYes                    2.46e-05 ***
## `PaymentMethodCredit card (automatic)`  0.74923    
## `PaymentMethodElectronic check`         0.00133 ** 
## `PaymentMethodMailed check`             0.53220    
## MonthlyCharges                          0.32860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5672.9  on 4921  degrees of freedom
## Residual deviance: 4065.5  on 4899  degrees of freedom
## AIC: 4111.5
## 
## Number of Fisher Scoring iterations: 6

Tenure, ContratOneYear, ContractTwoYear, PaymentMethodElectronic check, PaperlessBillingYes and MultipleLinesYes are the most significant predictors. Customers with these attributes are the most likely to not churn.

levels(data$Contract)
## [1] "Month-to-month" "One year"       "Two year"

Being signed up to a one year or two year contract is a strong indicator that a customer will not leave the service, so those on month-to-month contracts are the most likely to churn.

levels(data$PaymentMethod)
## [1] "Bank transfer (automatic)" "Credit card (automatic)"  
## [3] "Electronic check"          "Mailed check"

Customers who pay by Electronic Check tend to churn much less.

Recomendation

Customers that have signed up recently on a month-to-month contract with a single telephone line and who pay with an alternative method to electronic check are the most likely to churn. Resources should be focussed on these customers to move them to products that are indicators of brand loyalty. Marketing and retention teams should prioritse the following products in descending order of importance:

  1. Two-year contract
  2. One-year contract
  3. Paperless billing
  4. Payment by electronic check
  5. A second telephone line