Introduction

The business question that is being analyzed are which factors affect the status of customers. Status means which customers currently use this service and which customer have switched out of this service. Regork should be interested in this analysis because this can explain customer behavior and which groups are more likely to switch out of our services.

I first started with analyzing the customer retention data by trying to find trends and visualization before moving to machine learning. I first found the average tenure length for each variable and looked at the trends of how the length differed between predictors. Secondly, I found the percent of customer who are current and who have left for each predictor and each status within the predictor. After I found a few trends, I moved to using machine learning to find the most impactful factors. The first method I used was decision trees and then random forests. When I compared these methods, random forest had a higher AOC value which meant it was a better model for looking at status. The third machine learning method was MARS or Multivariate Adaptive Regression Splines. I compared the AOC value for the random forest and MARS models and found that the MARS model had the higher value and was the best model out of the three. Once I found the best model, I found the top featured importance and looked at the generalized and validation errors and compared the values.

My proposed solution is to focus on tenure rates because that is the most impactful factor when it comes to people leaving or staying. If we can focus on this factor that can help increase customer retention rate. Shorter the tenure length, the more likely a customer is to leave. If we focus on keeping people for longer periods of time, they will become less likely to leave.

Packages Required

Exploratory Data Analysis

The data below analyses how the different factors affect the average tenure length. The baseline was looking at the average tenure length for the current customers and customers that have left to look at the difference in length. Factors with larger ranges are more statistically significant when affecting the status.

Predictor Status Average Tenure
Status Current 37.55823
Status Left 18.01293
Gender Female 32.22873
Gender Male 32.51897
SeniorCitizen Not Senior 32.18701
SeniorCitizen Senior 33.34951
Partner No 23.32134
Partner Yes 42.03603
Dependents No 29.78280
Dependents Yes 38.40238
PhoneService No 31.82692
PhoneService Yes 32.43381
MultipleLines No 24.11243
MultipleLines No phone service 31.82692
MultipleLines Yes 41.93631
InternetService DSL 32.88150
InternetService Fiber optic 32.89919
InternetService No 30.51284
OnlineSecurity No 25.84385
OnlineSecurity No internet service 30.51284
OnlineSecurity Yes 45.06770
OnlineBackup No 23.70879
OnlineBackup No internet service 30.51284
OnlineBackup Yes 44.58880
DeviceProtection No 23.72251
DeviceProtection No internet service 30.51284
DeviceProtection Yes 44.60598
TechSupport No 25.83179
TechSupport No internet service 30.51284
TechSupport Yes 44.87057
StreamingTV No 25.04513
StreamingTV No internet service 30.51284
StreamingTV Yes 41.04129
StreamingMovies No 24.73425
StreamingMovies No internet service 30.51284
StreamingMovies Yes 41.18065
Contract Month-to-month 18.02106
Contract One year 42.04437
Contract Two year 56.71132
PaperlessBilling No 32.17750
PaperlessBilling Yes 32.51197
PaymentMethod Bank transfer (automatic) 43.62907
PaymentMethod Credit card (automatic) 43.28175
PaymentMethod Electronic check 25.16085
PaymentMethod Mailed check 21.89457

The graph below shows the average tenure length for each predictor and each option underneath the predictor.

Looking at the graph you can see how certain predictors have similar tenure lengths and that means these factors are not statistically significant for the model. Factors that are significant according to the graph are contract, dependents, partner, multiple lines, online security, online banking, device protection, tech support, streaming TV, streaming movies, and payment method. All these factors have a large range. Factor that are not significant are gender, senior citizen status, phone service, internet service, and paperless billing. All these factors have a similar average no matter the status of the factor.

The data below shows the percent of customers that have either left or are current for each of the predictors. If the percent difference within categories is high, then that predictor may be a significant factor in whether if a customers will churn.

Predictor Current or Left Status Percentage
Gender Current Female 0.7306028
Gender Current Male 0.7389581
Gender Left Female 0.2693972
Gender Left Male 0.2610419
SeniorCitizen Current Not Senior 0.7640641
SeniorCitizen Current Senior 0.5834069
SeniorCitizen Left Not Senior 0.2359359
SeniorCitizen Left Senior 0.4165931
Partner Current No 0.6706338
Partner Current Yes 0.8033077
Partner Left No 0.3293662
Partner Left Yes 0.1966923
Dependents Current No 0.6871680
Dependents Current Yes 0.8456057
Dependents Left No 0.3128320
Dependents Left Yes 0.1543943
PhoneService Current No 0.7500000
PhoneService Current Yes 0.7331963
PhoneService Left No 0.2500000
PhoneService Left Yes 0.2668037
MultipleLines Current No 0.7496292
MultipleLines Current No phone service 0.7500000
MultipleLines Current Yes 0.7144309
MultipleLines Left No 0.2503708
MultipleLines Left No phone service 0.2500000
MultipleLines Left Yes 0.2855691
InternetService Current DSL 0.8112266
InternetService Current Fiber optic 0.5808130
InternetService Current No 0.9256090
InternetService Left DSL 0.1887734
InternetService Left Fiber optic 0.4191870
InternetService Left No 0.0743910
OnlineSecurity Current No 0.5819649
OnlineSecurity Current No internet service 0.9256090
OnlineSecurity Current Yes 0.8546541
OnlineSecurity Left No 0.4180351
OnlineSecurity Left No internet service 0.0743910
OnlineSecurity Left Yes 0.1453459
OnlineBackup Current No 0.6009772
OnlineBackup Current No internet service 0.9256090
OnlineBackup Current Yes 0.7850622
OnlineBackup Left No 0.3990228
OnlineBackup Left No internet service 0.0743910
OnlineBackup Left Yes 0.2149378
DeviceProtection Current No 0.6093038
DeviceProtection Current No internet service 0.9256090
DeviceProtection Current Yes 0.7747298
DeviceProtection Left No 0.3906962
DeviceProtection Left No internet service 0.0743910
DeviceProtection Left Yes 0.2252702
TechSupport Current No 0.5838167
TechSupport Current No internet service 0.9256090
TechSupport Current Yes 0.8484252
TechSupport Left No 0.4161833
TechSupport Left No internet service 0.0743910
TechSupport Left Yes 0.1515748
StreamingTV Current No 0.6661891
StreamingTV Current No internet service 0.9256090
StreamingTV Current Yes 0.6982887
StreamingTV Left No 0.3338109
StreamingTV Left No internet service 0.0743910
StreamingTV Left Yes 0.3017113
StreamingMovies Current No 0.6647357
StreamingMovies Current No internet service 0.9256090
StreamingMovies Current Yes 0.6994113
StreamingMovies Left No 0.3352643
StreamingMovies Left No internet service 0.0743910
StreamingMovies Left Yes 0.3005887
Contract Current Month-to-month 0.5729140
Contract Current One year 0.8873720
Contract Current Two year 0.9715471
Contract Left Month-to-month 0.4270860
Contract Left One year 0.1126280
Contract Left Two year 0.0284529
PaperlessBilling Current No 0.8361286
PaperlessBilling Current Yes 0.6647329
PaperlessBilling Left No 0.1638714
PaperlessBilling Left Yes 0.3352671
PaymentMethod Current Bank transfer (automatic) 0.8324641
PaymentMethod Current Credit card (automatic) 0.8485450
PaymentMethod Current Electronic check 0.5459574
PaymentMethod Current Mailed check 0.8109794
PaymentMethod Left Bank transfer (automatic) 0.1675359
PaymentMethod Left Credit card (automatic) 0.1514550
PaymentMethod Left Electronic check 0.4540426
PaymentMethod Left Mailed check 0.1890206

Below is a graph of the table above which visulaizes the relationship.

The graph above shows the percentage of customers who are current or past users of the service according to the different predictors and the factors underneath the predictors.

According to the graph and table, significant factors are senior status, partners, dependents, internet service, online security, online banking, device protection, tech support, streaming TV, streaming movies, contract, and paperless billing. Variable that are not significant are gender, phone service, multiple lines, and payment method. When compared to the factors that affected the average tenure length and the percent of current vs left were partners, dependents, streaming movies/ TV, and tech support. These factors are seen as the most significant.

Machine Learning

Decision Trees

The first machine learning model is using decision trees and tuned to find best AOC results to compare those with the other models.

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.789     5 0.00537 Preprocessor1_Model1
## 2 roc_auc  binary     0.801     5 0.00651 Preprocessor1_Model1
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estima…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001          8    30 roc_auc binary    0.824     5 0.00698 Prepro…
## 2    0.0000000178          8    30 roc_auc binary    0.824     5 0.00698 Prepro…
## 3    0.00000316            8    30 roc_auc binary    0.824     5 0.00698 Prepro…
## 4    0.0000000001          8    21 roc_auc binary    0.823     5 0.00548 Prepro…
## 5    0.0000000178          8    21 roc_auc binary    0.823     5 0.00548 Prepro…
## # … with abbreviated variable name ¹​.estimator
Confusion Matrix

Below is the confusion matrix for the decision tree algorithm. This means that this model will correctly predict that the customer will leave 153 out of 463 times and correctly predict the customer will stay 1389 out of 1636 times.

##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310

Random Forest

The second machine learning model is using random forest and tuned to find best AOC results to compare those with the other models.

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.788     5 0.00223 Preprocessor1_Model1
## 2 roc_auc  binary     0.804     5 0.00826 Preprocessor1_Model1
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.793     5 0.00198 Preprocessor1_Model1
## 2 roc_auc  binary     0.834     5 0.00373 Preprocessor1_Model1
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2   200    15 roc_auc binary     0.841     5 0.00374 Preprocessor1_Model0…
## 2     2    87    20 roc_auc binary     0.841     5 0.00310 Preprocessor1_Model0…
## 3     2   162    20 roc_auc binary     0.841     5 0.00403 Preprocessor1_Model0…
## 4     2   125    20 roc_auc binary     0.841     5 0.00399 Preprocessor1_Model0…
## 5     2   200    20 roc_auc binary     0.841     5 0.00342 Preprocessor1_Model0…
Confusion Matrix

Below is the confusion matrix for the random forest algorithm. This means that this model will correctly predict that the customer will leave 153 out of 463 times and correctly predict the customer will stay 1389 out of 1636 times.

##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310

MARS

The third machine learning model is using MARS and tuned to find best AOC results to compare those with the other models.

## # A tibble: 100 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        17           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  3        21           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  4        23           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  5        25           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  6        27           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  7        29           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  8        31           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
##  9        33           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
## 10        35           1 roc_auc binary     0.848     5 0.00557 Preprocessor1_M…
## # … with 90 more rows
Confusion Matrix

Below is the confusion matrix for the MARS algorithm. This means that this model will correctly predict that the customer will leave 153 out of 463 times and correctly predict the customer will stay 1389 out of 1636 times. It was found that each of the three models have the same confusion matrix.

##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310
Validation and Generalized Errors

Validation error using 5 kfold and found that the best AOC model was using 17 num terms and 1 prod degree, resulting in an AOC value of 0.484. The generalized error was 0.801 and shows that the error decreased with machine learning.

Summary

The most important factors that Regork needs to focus on to keep customers from leaving is tenure length and total charges. Other factors to look at is monthly charges, two-year contracts, one-year contracts, no phone service with multiple lines, having online security, having tech support, electronic check payment method, and fiber optic internet service.

It is not clear how much potential revenue per month that we will lose if we do not start looking into these groups.

The business question was what factors can help predict if a customer will stay or leave. Through analyzing the customer retention data, it was found that the most important factors to consider are tenure length and total charges. The longer the customer is using our service and more total they spent, the more likely they will stay. This means we need to focus our incentive scheme on our newer customers to encourage them to stay with our services. We need to focus our energy on incentivizing our customers to stay longer instead of focusing energy on our already long-term customers.

Limitations of this data is that this is not all our customers, just a sample. There are also other factors that play into whether people are willing to stay or not.