The business question that is being analyzed are which factors affect the status of customers. Status means which customers currently use this service and which customer have switched out of this service. Regork should be interested in this analysis because this can explain customer behavior and which groups are more likely to switch out of our services.
I first started with analyzing the customer retention data by trying to find trends and visualization before moving to machine learning. I first found the average tenure length for each variable and looked at the trends of how the length differed between predictors. Secondly, I found the percent of customer who are current and who have left for each predictor and each status within the predictor. After I found a few trends, I moved to using machine learning to find the most impactful factors. The first method I used was decision trees and then random forests. When I compared these methods, random forest had a higher AOC value which meant it was a better model for looking at status. The third machine learning method was MARS or Multivariate Adaptive Regression Splines. I compared the AOC value for the random forest and MARS models and found that the MARS model had the higher value and was the best model out of the three. Once I found the best model, I found the top featured importance and looked at the generalized and validation errors and compared the values.
My proposed solution is to focus on tenure rates because that is the most impactful factor when it comes to people leaving or staying. If we can focus on this factor that can help increase customer retention rate. Shorter the tenure length, the more likely a customer is to leave. If we focus on keeping people for longer periods of time, they will become less likely to leave.
The data below analyses how the different factors affect the average tenure length. The baseline was looking at the average tenure length for the current customers and customers that have left to look at the difference in length. Factors with larger ranges are more statistically significant when affecting the status.
Predictor | Status | Average Tenure |
---|---|---|
Status | Current | 37.55823 |
Status | Left | 18.01293 |
Gender | Female | 32.22873 |
Gender | Male | 32.51897 |
SeniorCitizen | Not Senior | 32.18701 |
SeniorCitizen | Senior | 33.34951 |
Partner | No | 23.32134 |
Partner | Yes | 42.03603 |
Dependents | No | 29.78280 |
Dependents | Yes | 38.40238 |
PhoneService | No | 31.82692 |
PhoneService | Yes | 32.43381 |
MultipleLines | No | 24.11243 |
MultipleLines | No phone service | 31.82692 |
MultipleLines | Yes | 41.93631 |
InternetService | DSL | 32.88150 |
InternetService | Fiber optic | 32.89919 |
InternetService | No | 30.51284 |
OnlineSecurity | No | 25.84385 |
OnlineSecurity | No internet service | 30.51284 |
OnlineSecurity | Yes | 45.06770 |
OnlineBackup | No | 23.70879 |
OnlineBackup | No internet service | 30.51284 |
OnlineBackup | Yes | 44.58880 |
DeviceProtection | No | 23.72251 |
DeviceProtection | No internet service | 30.51284 |
DeviceProtection | Yes | 44.60598 |
TechSupport | No | 25.83179 |
TechSupport | No internet service | 30.51284 |
TechSupport | Yes | 44.87057 |
StreamingTV | No | 25.04513 |
StreamingTV | No internet service | 30.51284 |
StreamingTV | Yes | 41.04129 |
StreamingMovies | No | 24.73425 |
StreamingMovies | No internet service | 30.51284 |
StreamingMovies | Yes | 41.18065 |
Contract | Month-to-month | 18.02106 |
Contract | One year | 42.04437 |
Contract | Two year | 56.71132 |
PaperlessBilling | No | 32.17750 |
PaperlessBilling | Yes | 32.51197 |
PaymentMethod | Bank transfer (automatic) | 43.62907 |
PaymentMethod | Credit card (automatic) | 43.28175 |
PaymentMethod | Electronic check | 25.16085 |
PaymentMethod | Mailed check | 21.89457 |
The graph below shows the average tenure length for each predictor and each option underneath the predictor.
Looking at the graph you can see how certain predictors have similar tenure lengths and that means these factors are not statistically significant for the model. Factors that are significant according to the graph are contract, dependents, partner, multiple lines, online security, online banking, device protection, tech support, streaming TV, streaming movies, and payment method. All these factors have a large range. Factor that are not significant are gender, senior citizen status, phone service, internet service, and paperless billing. All these factors have a similar average no matter the status of the factor.
The data below shows the percent of customers that have either left or are current for each of the predictors. If the percent difference within categories is high, then that predictor may be a significant factor in whether if a customers will churn.
Predictor | Current or Left | Status | Percentage |
---|---|---|---|
Gender | Current | Female | 0.7306028 |
Gender | Current | Male | 0.7389581 |
Gender | Left | Female | 0.2693972 |
Gender | Left | Male | 0.2610419 |
SeniorCitizen | Current | Not Senior | 0.7640641 |
SeniorCitizen | Current | Senior | 0.5834069 |
SeniorCitizen | Left | Not Senior | 0.2359359 |
SeniorCitizen | Left | Senior | 0.4165931 |
Partner | Current | No | 0.6706338 |
Partner | Current | Yes | 0.8033077 |
Partner | Left | No | 0.3293662 |
Partner | Left | Yes | 0.1966923 |
Dependents | Current | No | 0.6871680 |
Dependents | Current | Yes | 0.8456057 |
Dependents | Left | No | 0.3128320 |
Dependents | Left | Yes | 0.1543943 |
PhoneService | Current | No | 0.7500000 |
PhoneService | Current | Yes | 0.7331963 |
PhoneService | Left | No | 0.2500000 |
PhoneService | Left | Yes | 0.2668037 |
MultipleLines | Current | No | 0.7496292 |
MultipleLines | Current | No phone service | 0.7500000 |
MultipleLines | Current | Yes | 0.7144309 |
MultipleLines | Left | No | 0.2503708 |
MultipleLines | Left | No phone service | 0.2500000 |
MultipleLines | Left | Yes | 0.2855691 |
InternetService | Current | DSL | 0.8112266 |
InternetService | Current | Fiber optic | 0.5808130 |
InternetService | Current | No | 0.9256090 |
InternetService | Left | DSL | 0.1887734 |
InternetService | Left | Fiber optic | 0.4191870 |
InternetService | Left | No | 0.0743910 |
OnlineSecurity | Current | No | 0.5819649 |
OnlineSecurity | Current | No internet service | 0.9256090 |
OnlineSecurity | Current | Yes | 0.8546541 |
OnlineSecurity | Left | No | 0.4180351 |
OnlineSecurity | Left | No internet service | 0.0743910 |
OnlineSecurity | Left | Yes | 0.1453459 |
OnlineBackup | Current | No | 0.6009772 |
OnlineBackup | Current | No internet service | 0.9256090 |
OnlineBackup | Current | Yes | 0.7850622 |
OnlineBackup | Left | No | 0.3990228 |
OnlineBackup | Left | No internet service | 0.0743910 |
OnlineBackup | Left | Yes | 0.2149378 |
DeviceProtection | Current | No | 0.6093038 |
DeviceProtection | Current | No internet service | 0.9256090 |
DeviceProtection | Current | Yes | 0.7747298 |
DeviceProtection | Left | No | 0.3906962 |
DeviceProtection | Left | No internet service | 0.0743910 |
DeviceProtection | Left | Yes | 0.2252702 |
TechSupport | Current | No | 0.5838167 |
TechSupport | Current | No internet service | 0.9256090 |
TechSupport | Current | Yes | 0.8484252 |
TechSupport | Left | No | 0.4161833 |
TechSupport | Left | No internet service | 0.0743910 |
TechSupport | Left | Yes | 0.1515748 |
StreamingTV | Current | No | 0.6661891 |
StreamingTV | Current | No internet service | 0.9256090 |
StreamingTV | Current | Yes | 0.6982887 |
StreamingTV | Left | No | 0.3338109 |
StreamingTV | Left | No internet service | 0.0743910 |
StreamingTV | Left | Yes | 0.3017113 |
StreamingMovies | Current | No | 0.6647357 |
StreamingMovies | Current | No internet service | 0.9256090 |
StreamingMovies | Current | Yes | 0.6994113 |
StreamingMovies | Left | No | 0.3352643 |
StreamingMovies | Left | No internet service | 0.0743910 |
StreamingMovies | Left | Yes | 0.3005887 |
Contract | Current | Month-to-month | 0.5729140 |
Contract | Current | One year | 0.8873720 |
Contract | Current | Two year | 0.9715471 |
Contract | Left | Month-to-month | 0.4270860 |
Contract | Left | One year | 0.1126280 |
Contract | Left | Two year | 0.0284529 |
PaperlessBilling | Current | No | 0.8361286 |
PaperlessBilling | Current | Yes | 0.6647329 |
PaperlessBilling | Left | No | 0.1638714 |
PaperlessBilling | Left | Yes | 0.3352671 |
PaymentMethod | Current | Bank transfer (automatic) | 0.8324641 |
PaymentMethod | Current | Credit card (automatic) | 0.8485450 |
PaymentMethod | Current | Electronic check | 0.5459574 |
PaymentMethod | Current | Mailed check | 0.8109794 |
PaymentMethod | Left | Bank transfer (automatic) | 0.1675359 |
PaymentMethod | Left | Credit card (automatic) | 0.1514550 |
PaymentMethod | Left | Electronic check | 0.4540426 |
PaymentMethod | Left | Mailed check | 0.1890206 |
Below is a graph of the table above which visulaizes the relationship.
The graph above shows the percentage of customers who are current or past users of the service according to the different predictors and the factors underneath the predictors.
According to the graph and table, significant factors are senior status, partners, dependents, internet service, online security, online banking, device protection, tech support, streaming TV, streaming movies, contract, and paperless billing. Variable that are not significant are gender, phone service, multiple lines, and payment method. When compared to the factors that affected the average tenure length and the percent of current vs left were partners, dependents, streaming movies/ TV, and tech support. These factors are seen as the most significant.
The first machine learning model is using decision trees and tuned to find best AOC results to compare those with the other models.
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.789 5 0.00537 Preprocessor1_Model1
## 2 roc_auc binary 0.801 5 0.00651 Preprocessor1_Model1
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estima…¹ mean n std_err .config
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 8 30 roc_auc binary 0.824 5 0.00698 Prepro…
## 2 0.0000000178 8 30 roc_auc binary 0.824 5 0.00698 Prepro…
## 3 0.00000316 8 30 roc_auc binary 0.824 5 0.00698 Prepro…
## 4 0.0000000001 8 21 roc_auc binary 0.823 5 0.00548 Prepro…
## 5 0.0000000178 8 21 roc_auc binary 0.823 5 0.00548 Prepro…
## # … with abbreviated variable name ¹.estimator
Below is the confusion matrix for the decision tree algorithm. This means that this model will correctly predict that the customer will leave 153 out of 463 times and correctly predict the customer will stay 1389 out of 1636 times.
## Truth
## Prediction Current Left
## Current 1389 247
## Left 153 310
The second machine learning model is using random forest and tuned to find best AOC results to compare those with the other models.
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.788 5 0.00223 Preprocessor1_Model1
## 2 roc_auc binary 0.804 5 0.00826 Preprocessor1_Model1
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.793 5 0.00198 Preprocessor1_Model1
## 2 roc_auc binary 0.834 5 0.00373 Preprocessor1_Model1
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 200 15 roc_auc binary 0.841 5 0.00374 Preprocessor1_Model0…
## 2 2 87 20 roc_auc binary 0.841 5 0.00310 Preprocessor1_Model0…
## 3 2 162 20 roc_auc binary 0.841 5 0.00403 Preprocessor1_Model0…
## 4 2 125 20 roc_auc binary 0.841 5 0.00399 Preprocessor1_Model0…
## 5 2 200 20 roc_auc binary 0.841 5 0.00342 Preprocessor1_Model0…
Below is the confusion matrix for the random forest algorithm. This means that this model will correctly predict that the customer will leave 153 out of 463 times and correctly predict the customer will stay 1389 out of 1636 times.
## Truth
## Prediction Current Left
## Current 1389 247
## Left 153 310
The third machine learning model is using MARS and tuned to find best AOC results to compare those with the other models.
## # A tibble: 100 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 17 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 2 19 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 3 21 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 4 23 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 5 25 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 6 27 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 7 29 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 8 31 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 9 33 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 10 35 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## # … with 90 more rows
When the AOC values were compared for all three machine learning models, the mars model had the highest value. The decision tree model had an AOC value of 0.824, the random forest model had a value of 0.841, and the mars model had a value of 0.848. Since the mars model was the most optimal model, the top 10 featured importance.
Below is a graph of the featured importance and this shows the top ten impactful predictors and shows in order how impactful they are.
The top predictor to help determine whether or not a customer will leave or not is tenure length followed by total charges.
Below is the confusion matrix for the MARS algorithm. This means that this model will correctly predict that the customer will leave 153 out of 463 times and correctly predict the customer will stay 1389 out of 1636 times. It was found that each of the three models have the same confusion matrix.
## Truth
## Prediction Current Left
## Current 1389 247
## Left 153 310
Validation error using 5 kfold and found that the best AOC model was using 17 num terms and 1 prod degree, resulting in an AOC value of 0.484. The generalized error was 0.801 and shows that the error decreased with machine learning.
The most important factors that Regork needs to focus on to keep customers from leaving is tenure length and total charges. Other factors to look at is monthly charges, two-year contracts, one-year contracts, no phone service with multiple lines, having online security, having tech support, electronic check payment method, and fiber optic internet service.
It is not clear how much potential revenue per month that we will lose if we do not start looking into these groups.
The business question was what factors can help predict if a customer will stay or leave. Through analyzing the customer retention data, it was found that the most important factors to consider are tenure length and total charges. The longer the customer is using our service and more total they spent, the more likely they will stay. This means we need to focus our incentive scheme on our newer customers to encourage them to stay with our services. We need to focus our energy on incentivizing our customers to stay longer instead of focusing energy on our already long-term customers.
Limitations of this data is that this is not all our customers, just a sample. There are also other factors that play into whether people are willing to stay or not.