A telecommunications company requires a predictive model to choose which customers will leave their plan. The results will inform the Marketing and Customer Retention teams about which customer are likely to leave their plan so that resources can be directed to these customers.
churn_raw <- read.csv("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv", header = T)
str(churn_raw)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
sum(is.na(churn_raw))
## [1] 11
sapply(churn_raw, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
All of the NAs are in the TotalCharges column, but I might be able to compute the total charges since we have data for monthly charges and tenure in months
churn_raw[is.na(churn_raw$TotalCharges),1:6]
## customerID gender SeniorCitizen Partner Dependents tenure
## 489 4472-LVYGI Female 0 Yes Yes 0
## 754 3115-CZMZD Male 0 No Yes 0
## 937 5709-LVOEQ Female 0 Yes Yes 0
## 1083 4367-NUYAO Male 0 Yes Yes 0
## 1341 1371-DWPAZ Female 0 Yes Yes 0
## 3332 7644-OMVMY Male 0 Yes Yes 0
## 3827 3213-VVOLG Male 0 Yes Yes 0
## 4381 2520-SGTTA Female 0 Yes Yes 0
## 5219 2923-ARZLG Male 0 Yes Yes 0
## 6671 4075-WKNIU Female 0 Yes Yes 0
## 6755 2775-SEFEE Male 0 No Yes 0
These customers all show tenure of zero months so they haven’t made their first payment yet. Are there any other zero tenure customers in the data set?
library(dplyr)
churn_raw %>%
filter(tenure == 0) %>%
summarize("Zero Tenure" = n())
## Zero Tenure
## 1 11
These eleven are the only customers with zero tenure so they can safely be removed
churnnoNAs <- churn_raw[complete.cases(churn_raw),]
dim(churnnoNAs)
## [1] 7032 21
Customer ID isn’t useful to our analysis, neither is Total Charges since it is highly correlated with Monthly Charges
churn_neat <- churnnoNAs %>%
select(-customerID, -TotalCharges) %>%
rename(Gender = gender, Tenure = tenure)
table(churn_neat$SeniorCitizen)
##
## 0 1
## 5890 1142
churn_neat$SeniorCitizen <- as.factor(ifelse(churn_neat$SeniorCitizen == 1, "Yes", "No"))
table(churn_neat$SeniorCitizen)
##
## No Yes
## 5890 1142
str(churn_neat)
## 'data.frame': 7032 obs. of 19 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ Tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
The variables OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV and StreamingMovies all require an internet connection and the variable MultipleLines needs a phone service so I will replace “No internet service” and “No phone service” with “No”.
factorrenames <- names(churn_neat[9:14])
data <- churn_neat %>%
mutate_at(.vars=factorrenames,
.funs=~recode_factor(., `No internet service`="No")) %>%
mutate_at(.vars="MultipleLines",
.funs=~recode_factor(., `No phone service`="No"))
str(data)
## 'data.frame': 7032 obs. of 19 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ Tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
churnrate <- table(data$Churn) / nrow(data)
churnrate
##
## No Yes
## 0.734215 0.265785
Over the entire data set, 26.5% of customers churned.
I will create a a trainControl object so that all of the models use the same 10-fold cross validation on 70% of the data as a training set. I will then use the remaining 30% of the data to test the model accuracy.
I will be using the both area under the ROC curve (AUC) and Accuracy percentage as metrics for assessing model accuracy.
set.seed(1)
rowindices <- sample(nrow(data))
data_shuffled <- data[rowindices,]
First the rows are shuffled to eliminate any bias in the order of the observations in the data set
split <- round(nrow(data_shuffled) * 0.7)
split
## [1] 4922
The data set will be split on the 4922nd row. Observations 1 to 4922 will make up the training set and the remaining 2110 observations will be the test set.
train <- data_shuffled[1:split,]
test <- data_shuffled[(split+1):nrow(data_shuffled),]
dim(train)
## [1] 4922 19
dim(test)
## [1] 2110 19
library(caret)
control <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE
)
glm_model <- train(Churn ~ ., data = train,
method="glm",
trControl = control
)
glm_model
## Generalized Linear Model
##
## 4922 samples
## 18 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4430, 4429, 4430, 4430, 4429, 4429, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8433015 0.8985381 0.5359928
The Logistic Regression model on the training data gives an AUC value of 0.84. 90% of customers that the model identified as leaving their service did so.
glm_pred <- predict(glm_model, newdata = test)
glmcm <- confusionMatrix(glm_pred, test[["Churn"]])
glmaccuracy <- glmcm$overall[c(1,3,4)]
glmcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1382 253
## Yes 154 321
##
## Accuracy : 0.8071
## 95% CI : (0.7896, 0.8237)
## No Information Rate : 0.728
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4852
## Mcnemar's Test P-Value : 1.188e-06
##
## Sensitivity : 0.8997
## Specificity : 0.5592
## Pos Pred Value : 0.8453
## Neg Pred Value : 0.6758
## Prevalence : 0.7280
## Detection Rate : 0.6550
## Detection Prevalence : 0.7749
## Balanced Accuracy : 0.7295
##
## 'Positive' Class : No
##
When the model is applied to the test data it yields accuracy of 80.7%, with 90% of customers that the model identified as leaving their service doing so and 53.46% of customers that the model identified as staying with the service staying with the company.
The training data is used to find the optimal value for alpha, the elasticnet mixing parameter.
glmnet_model <- train(Churn ~ ., data = train,
metric = "ROC",
method = "glmnet",
trControl = control,
preProcess = c("center","scale")
)
plot(glmnet_model)
glmnet_model$bestTune$alpha
## [1] 0.55
Alpha = 0.55 maximises the AUC.
glmnet_pred <- predict(glmnet_model, newdata = test)
glmnetcm <- confusionMatrix(glmnet_pred, test[["Churn"]])
glmnetaccuracy <- glmnetcm$overall[c(1,3,4)]
glmnetcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1383 256
## Yes 153 318
##
## Accuracy : 0.8062
## 95% CI : (0.7886, 0.8228)
## No Information Rate : 0.728
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4815
## Mcnemar's Test P-Value : 4.569e-07
##
## Sensitivity : 0.9004
## Specificity : 0.5540
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.6752
## Prevalence : 0.7280
## Detection Rate : 0.6555
## Detection Prevalence : 0.7768
## Balanced Accuracy : 0.7272
##
## 'Positive' Class : No
##
The Lasso model correctly classifies 81% of all customers, including 90% of all customers that went on to leave their service.
Cross validation on the training set selects the number of variables to add to the model at each branch
rf_model <- train(Churn ~ ., data=train,
metric = "ROC",
method = "ranger",
trControl = control
)
plot(rf_model)
mtry = 2 yields the largest AUC value on the training data
rf_pred <- predict(rf_model, newdata = test)
rfcm <- confusionMatrix(rf_pred, test[["Churn"]])
rfaccuracy <- rfcm$overall[c(1,3,4)]
rfcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1424 317
## Yes 112 257
##
## Accuracy : 0.7967
## 95% CI : (0.7789, 0.8137)
## No Information Rate : 0.728
## P-Value [Acc > NIR] : 1.659e-13
##
## Kappa : 0.422
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9271
## Specificity : 0.4477
## Pos Pred Value : 0.8179
## Neg Pred Value : 0.6965
## Prevalence : 0.7280
## Detection Rate : 0.6749
## Detection Prevalence : 0.8251
## Balanced Accuracy : 0.6874
##
## 'Positive' Class : No
##
The Random Forest model yields accuracy of 80% but it correctly identifies 92% of all customers that did go on to leave the service.
Cross validation on the training set is used to find the optimal value for K
knn_model <- train(Churn ~ ., data = train,
method = "knn", trControl = control,
preProcess = c("center","scale"), tuneLength = 50)
knn_model
## k-Nearest Neighbors
##
## 4922 samples
## 18 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4430, 4429, 4430, 4430, 4429, 4430, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.7679915 0.8409266 0.5026834
## 7 0.7863464 0.8453427 0.5335540
## 9 0.7959654 0.8503014 0.5405725
## 11 0.8032439 0.8555416 0.5321169
## 13 0.8084182 0.8610573 0.5336374
## 15 0.8113219 0.8662968 0.5444305
## 17 0.8126446 0.8665731 0.5560346
## 19 0.8144746 0.8627102 0.5598927
## 21 0.8195417 0.8651911 0.5629756
## 23 0.8209569 0.8651880 0.5660763
## 25 0.8216082 0.8651835 0.5652713
## 27 0.8225264 0.8660114 0.5614252
## 29 0.8229787 0.8662892 0.5560167
## 31 0.8239682 0.8676689 0.5598629
## 33 0.8238503 0.8684931 0.5637150
## 35 0.8246429 0.8676689 0.5668277
## 37 0.8243899 0.8676689 0.5660465
## 39 0.8244966 0.8715279 0.5606202
## 41 0.8245858 0.8665662 0.5652713
## 43 0.8250267 0.8657413 0.5683602
## 45 0.8253998 0.8673965 0.5691294
## 47 0.8255240 0.8649141 0.5668098
## 49 0.8260608 0.8665700 0.5637448
## 51 0.8261070 0.8646394 0.5614073
## 53 0.8263300 0.8662915 0.5621765
## 55 0.8268182 0.8668432 0.5567680
## 57 0.8266926 0.8687708 0.5606321
## 59 0.8268643 0.8671164 0.5559988
## 61 0.8281500 0.8657390 0.5529219
## 63 0.8281291 0.8668425 0.5591055
## 65 0.8283792 0.8665693 0.5575432
## 67 0.8281783 0.8665685 0.5583065
## 69 0.8278101 0.8671195 0.5552236
## 71 0.8279462 0.8679467 0.5528980
## 73 0.8284916 0.8676697 0.5529100
## 75 0.8283043 0.8665677 0.5459451
## 77 0.8284033 0.8668455 0.5559750
## 79 0.8279869 0.8662930 0.5567501
## 81 0.8280169 0.8649133 0.5598271
## 83 0.8280381 0.8676689 0.5559690
## 85 0.8278481 0.8668409 0.5605963
## 87 0.8278187 0.8695965 0.5559750
## 89 0.8277724 0.8690463 0.5544305
## 91 0.8277313 0.8693210 0.5536613
## 93 0.8275961 0.8684946 0.5544305
## 95 0.8279002 0.8693218 0.5559809
## 97 0.8277472 0.8690448 0.5551998
## 99 0.8275440 0.8690456 0.5521049
## 101 0.8278845 0.8704230 0.5559511
## 103 0.8280408 0.8704222 0.5544007
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 73.
knn_pred <- predict(knn_model, newdata = test)
knncm <- confusionMatrix(knn_pred, test[["Churn"]])
knnaccuracy <- knncm$overall[c(1,3,4)]
knncm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1331 252
## Yes 205 322
##
## Accuracy : 0.7834
## 95% CI : (0.7652, 0.8008)
## No Information Rate : 0.728
## P-Value [Acc > NIR] : 2.751e-09
##
## Kappa : 0.4388
## Mcnemar's Test P-Value : 0.03141
##
## Sensitivity : 0.8665
## Specificity : 0.5610
## Pos Pred Value : 0.8408
## Neg Pred Value : 0.6110
## Prevalence : 0.7280
## Detection Rate : 0.6308
## Detection Prevalence : 0.7502
## Balanced Accuracy : 0.7138
##
## 'Positive' Class : No
##
The KNN model has predictive accuracy of 78% and correctly classifies 87% of customers who went on to leave.
Cross validation on the training set with 6 different values for the Cost parameter
grid <- expand.grid(C = c(0.01, 0.05, 0.1, 0.25, 0.5, 1))
svm_linear_model <- train(Churn ~., data = train, method = "svmLinear",
trControl= control,
preProcess = c("center", "scale"),
tuneLength = 6,
tuneGrid = grid)
svm_linear_model
## Support Vector Machines with Linear Kernel
##
## 4922 samples
## 18 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4430, 4429, 4430, 4430, 4431, 4430, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.01 0.8396572 0.9046139 0.5119320
## 0.05 0.8394989 0.9054366 0.5119261
## 0.10 0.8391577 0.9057121 0.5165534
## 0.25 0.8391774 0.9054366 0.5134645
## 0.50 0.8391731 0.9059876 0.5157782
## 1.00 0.8393649 0.9057121 0.5142338
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
plot(svm_linear_model, main = "Cross validation to determine cost parameter")
svm_linear_pred <- predict(svm_linear_model, newdata = test)
svmcm <- confusionMatrix(svm_linear_pred, test[["Churn"]])
svm_linearaccuracy <- svmcm$overall[c(1,3,4)]
svmcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1382 274
## Yes 154 300
##
## Accuracy : 0.7972
## 95% CI : (0.7794, 0.8141)
## No Information Rate : 0.728
## P-Value [Acc > NIR] : 1.126e-13
##
## Kappa : 0.452
## Mcnemar's Test P-Value : 8.815e-09
##
## Sensitivity : 0.8997
## Specificity : 0.5226
## Pos Pred Value : 0.8345
## Neg Pred Value : 0.6608
## Prevalence : 0.7280
## Detection Rate : 0.6550
## Detection Prevalence : 0.7848
## Balanced Accuracy : 0.7112
##
## 'Positive' Class : No
##
The Support Vector Classifier resulted in accuracy of 80% and 90% of all customers who went on to leave being correctly classified.
model_list <- list("Logistic" = glm_model, "GLMnet" = glmnet_model, "Random Forest" = rf_model, "kNN" = knn_model, "SVM" = svm_linear_model)
resamples <- resamples(model_list)
dotplot(resamples, metric="ROC", main = "Area Under Curve with 95% CI")
The GLMnet and Logistic models showed the greatest overall predictive accuracy but the Random Forest model showed the highest sensitivity, correctly identifying 92.6% of customers who did go on to churn. Additional domain knowledge would be required from the telecoms company to decide which model is the most useful.
models <- c("Logistic", "GLMnet", "Random Forest", "SVM", "KNN")
accuracysummary <- bind_rows(Logistic = glmaccuracy, GLMnet = glmnetaccuracy, RandomForest = rfaccuracy, kNN = knnaccuracy, SVM = svm_linearaccuracy)
library(tibble)
accuracysummary2 <- add_column(accuracysummary, "Model" = models, .before = "Accuracy")
accuracysummary2
## # A tibble: 5 x 4
## Model Accuracy AccuracyLower AccuracyUpper
## <chr> <dbl> <dbl> <dbl>
## 1 Logistic 0.807 0.790 0.824
## 2 GLMnet 0.806 0.789 0.823
## 3 Random Forest 0.797 0.779 0.814
## 4 SVM 0.783 0.765 0.801
## 5 KNN 0.797 0.779 0.814
library(ggthemes)
ggplot(accuracysummary2, aes(x = Model, y = Accuracy)) + geom_bar(stat = "identity") +
geom_errorbar(width = 0.2, aes(ymin = AccuracyLower, ymax = AccuracyUpper), color = "black") +
coord_cartesian(ylim = c(0.7, 0.85)) +
labs(y = "Accuracy %", x = "Model", title = "Model Prediction Accuracy with 95% CI") +
theme_minimal()
In this case all models show significantly greater predictive accuracy than the null model that predicts ‘No’ for every customer with accuracy of 72.6%. The Logistic and GLMnet methods are almost identical in terms of results.
If maximum accuracy is the goal then I would recommend that the logistic model since it is much more interpretable than the GLMnet model that blends lasso and ridge regression since it is more interpretable.
summary(glm_model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9616 -0.6763 -0.2895 0.6848 3.1879
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.775181 0.984493 0.787
## GenderMale -0.049477 0.077851 -0.636
## SeniorCitizenYes 0.177516 0.102686 1.729
## PartnerYes 0.051354 0.092172 0.557
## DependentsYes -0.204377 0.105942 -1.929
## Tenure -0.033802 0.002896 -11.674
## PhoneServiceYes 0.192776 0.790082 0.244
## MultipleLinesYes 0.530945 0.217205 2.444
## `InternetServiceFiber optic` 1.842418 0.971471 1.897
## InternetServiceNo -1.721730 0.982493 -1.752
## OnlineSecurityYes -0.178644 0.218480 -0.818
## OnlineBackupYes -0.026976 0.213708 -0.126
## DeviceProtectionYes 0.185122 0.215591 0.859
## TechSupportYes -0.176914 0.219763 -0.805
## StreamingTVYes 0.623458 0.398743 1.564
## StreamingMoviesYes 0.619407 0.397912 1.557
## `ContractOne year` -0.705278 0.128256 -5.499
## `ContractTwo year` -1.425403 0.217093 -6.566
## PaperlessBillingYes 0.373285 0.088498 4.218
## `PaymentMethodCredit card (automatic)` -0.044267 0.138485 -0.320
## `PaymentMethodElectronic check` 0.364357 0.113537 3.209
## `PaymentMethodMailed check` 0.084719 0.135625 0.625
## MonthlyCharges -0.037768 0.038660 -0.977
## Pr(>|z|)
## (Intercept) 0.43105
## GenderMale 0.52508
## SeniorCitizenYes 0.08386 .
## PartnerYes 0.57742
## DependentsYes 0.05371 .
## Tenure < 2e-16 ***
## PhoneServiceYes 0.80723
## MultipleLinesYes 0.01451 *
## `InternetServiceFiber optic` 0.05789 .
## InternetServiceNo 0.07970 .
## OnlineSecurityYes 0.41355
## OnlineBackupYes 0.89955
## DeviceProtectionYes 0.39052
## TechSupportYes 0.42081
## StreamingTVYes 0.11792
## StreamingMoviesYes 0.11956
## `ContractOne year` 3.82e-08 ***
## `ContractTwo year` 5.17e-11 ***
## PaperlessBillingYes 2.46e-05 ***
## `PaymentMethodCredit card (automatic)` 0.74923
## `PaymentMethodElectronic check` 0.00133 **
## `PaymentMethodMailed check` 0.53220
## MonthlyCharges 0.32860
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5672.9 on 4921 degrees of freedom
## Residual deviance: 4065.5 on 4899 degrees of freedom
## AIC: 4111.5
##
## Number of Fisher Scoring iterations: 6
Tenure, ContratOneYear, ContractTwoYear, PaymentMethodElectronic check, PaperlessBillingYes and MultipleLinesYes are the most significant predictors. Customers with these attributes are the most likely to not churn.
levels(data$Contract)
## [1] "Month-to-month" "One year" "Two year"
Being signed up to a one year or two year contract is a strong indicator that a customer will not leave the service, so those on month-to-month contracts are the most likely to churn.
levels(data$PaymentMethod)
## [1] "Bank transfer (automatic)" "Credit card (automatic)"
## [3] "Electronic check" "Mailed check"
Customers who pay by Electronic Check tend to churn much less.
Customers that have signed up recently on a month-to-month contract with a single telephone line and who pay with an alternative method to electronic check are the most likely to churn. Resources should be focussed on these customers to move them to products that are indicators of brand loyalty. Marketing and retention teams should prioritse the following products in descending order of importance: