Final Project

Introduction

Problem Statement:

We set out to find why customers of Regork’s services are leaving and work to retain the existing customers. This includes searching through the customer data for patterns in customers’ relationship status, payment options, and status between those who are still a customer and those who have left.

Implementation:

We used customer data to figure out why customers were leaving and what we could do to retain those existing customers. We focused on those who had already left in order to find out what was causing them to leave. We then used these tendencies to address the promotions we should create in order to keep the customers we still have using Regork’s services.

Solution:

We intend for our findings to assist in the retention of the current customers of Regork. We found that many of those who left cancelled their subscription before the 18-month mark. We also found that these were the people who had the highest monthly charges. Therefore, we believe it would be wise to find a better plan for keeping business longer than 18 months. This could come in the form of a subscription plan that would encourage customers to stay past 18 months in order to attempt to retain these customers.

Packages Required

The required packages are listed and followed by their use:

tidyverse = Allows for data manipulation and also includes ggplot2 used for creation of graphs.
tidymodels = Modeling and machine learning.
gridExtra = Allows more flexibility with the graphs.
baguette = Provides modeling for bagging (bootstrap aggregation).
ranger = Supports implementation of random forests.
vip = Used in construction of variable importance plots.
scales = Cleaning up the graphs.

library(tidyverse)
library(tidymodels)
library(gridExtra)
library(baguette)
library(ranger)
library(vip)
library(scales)

Data Preparation

Explanation of Data

Explanation: The dataset we were given showed us the customers and various factors within their usage of services within Regork. We were shown things such as gender, whether they have a partner, if they were a senior citizen, the services to which they were subscribed, their monthly charges and payment method, and among other things their status of whether they were a current or former customer. This would all come in handy as there were many examples of customers who are current and also of those who had left and therefore are no longer using Regork’s services. In total there were 6,999 rows in the dataset with 20 columns also known as variables.

Importing of Data

Below is how we imported the data.

retention <- read_csv("customer_retention.csv")

Organization of Data

We did not have to do much to organize the data and get it to the point we could model it.

Below is how we changed the variable of Status from a character class to a factor class.

retention <- mutate(retention, Status = factor(Status))

We then took the dataset and split it into the customers who were listed as “Left” (showing that they are no longer a customer) and those who were listed as “Current” (showing they are currently a customer).

left_retention <- retention %>% filter(Status == 'Left')

stay_retention <- retention %>% filter(Status == 'Current')

Summary of Variables

Out of the 20 variables, we focused mainly on 11 of them. Those that we focused on are listed below with a description.

Tenure: How long the customer has been or was a customer for in months.
Contract: Type of contract was agreed upon between month-to-month, one-year, and two-year.
Partner: Whether the customer had a partner or not.
Dependents: Whether the customer had any dependents or not.
InternetService: What kind of internet service provider the customer paid for.
OnlineSecurity: Whether the customer had online security or not.
OnlineBackup: Whether the customer had online backup or not.
PaymentMethod: The customer’s payment method used.
MonthlyCharges: Amount the customer pays every month for Regork’s services.
TotalCharges: Total amount of money paid throughout the customer’s tenure.
Status: Denotes whether the customer is existing or former.

Exploratory Data Analysis

This is the first half of the analysis of the project. The findings of our analysis through exploration of the dataset will be represented visually through graphs and written analysis.

Monthly Charges Analysis

To start, we thought that an analysis of Monthly Charge and its impact on Customer Status would be worthwhile to look at. When exploring the dataset, we found that a box-and-whisker plot would be the best way to represent the data, with number callouts at the median values. Therefore, the first step was to find the median values of both the left and current customers, which was done here:

median(left_retention$MonthlyCharges)
## [1] 79.65
median(stay_retention$MonthlyCharges)
## [1] 64.4

After finding that, then the box-and-whisker plot was built. It was built using ggplot2, a package within tidyverse. Then to cleanup the graph, these changes were made:

  1. Filled in the graph with a dark green color.
  2. Changed the axis title and number formats of the y-axis.
  3. Changed the axis title of the x-axis.
  4. Added in custom callouts using annotate to display the median values.
  5. Added in a title for the graph.
monthcharge <- ggplot(retention, aes(x = Status, y = MonthlyCharges)) + 
  geom_boxplot(fill = 'darkgreen') +
  scale_y_continuous(name = "Monthly Charges", labels = dollar_format()) +
  xlab("Customer Status") +
  annotate("text", x = "Current", y = 69 , label = '$64.40', color = 'grey') +
  annotate("text", x = "Left", y = 84 , label = '$79.65', color = 'grey') +
  ggtitle("Range of Monthly Charges by Customer Status")
monthcharge

As can be seen in the graph, the median values of the current and left customers have a significant difference, with left customers paying $15.25 more every month. Additionally, the 1st and 3rd quartiles for the left customers are higher than the current customers, which shows left customers are generally paying higher amounts across the board. This is an important insight, and one that will be the basis for the rest of our analysis going forward.

Contract Type Analysis

The next predictor variable we looked at was Contract, which has 3 possible values: Month-to-Month, One year, and Two year. We made a bar graph for this analysis, which can be seen below:

contractleft <- ggplot(left_retention, aes(x = Contract)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Customer Status") +
  ggtitle("Number of Left Customers by \nContract Type")

contractcurrent <- ggplot(stay_retention, aes(x = Contract)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Current Customers", labels = comma_format()) +
  xlab("Customer Status") +
  ggtitle("Number of Current Customers \nby Contract Type")

grid.arrange(contractleft, contractcurrent, ncol = 2)

As can be seen, all of the contract types are relatively similar with current customers, with month-to-month contracts making up the slight majority. On the other hand, month-to-month contracts make up the vast majority of contracts with left customers. Diving deeper into this, as a general rule, as contract length increases, monthly payments usually decrease in return. Therefore month-to-month contracts, which make up the vast majority of contracts with customers who left, would have the highest monthly payments. Since we found that left customers had higher monthly payments on average than current customers, this lines up perfectly with our findings here. Since most customers who left had month-to-month contracts, they payed more on monthly payments than current customers, who mostly have two year contracts with lower monthly payments.

Partner and Dependent Status Analysis

Next, we turned our analysis on two more predictor variables: Partner and Dependents. Both these variables had only two possible values, Yes and No. For partner analysis, we compared current customer to left customers, and for dependent analysis, we just focused on left customers:

partnerleft <- ggplot(left_retention, aes(x = Partner)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Partner Status") +
  ggtitle("Number of Left Customers by \nPartner Status")

partnercurrent <- ggplot(stay_retention, aes(x = Partner)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Current Customers", labels = comma_format()) +
  xlab("Partner Status") +
  ggtitle("Number of Current Customers \nby Partner Status")

grid.arrange(partnerleft, partnercurrent, ncol = 2)

dependentsleft <- ggplot(left_retention, aes(x = Dependents)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Dependent Status") +
  ggtitle("Number of Left Customers by Dependent Status")
dependentsleft

Looking at the partner status graph, around double the amount of left customers did not have a partner than ones who did. Looking at current customers, slightly more customers do have a partner than ones who do not.

The dependents graph is pretty similar to the left customer graph for partners, with many more left customers not having any dependents than ones who do. This ties strongly into contract types, since customers with families or long-term partners are much more likely to purchase longer contracts (one year, two years) than customers without any dependents or partners. Since they are more likely to purchase longer contracts, their monthly payments would be less as well, which ties into our monthly payment analysis earlier. Since the large majority of left customers do not have any dependents or partners, they are more likely to purchase shorter contracts (month-to-month), leading to higher monthly payments on average.

Internet Service Analysis

The next predictor variable we looked at is InternetService. This variable has three possible values: DSL, Fiber optic, and No. The bar graph for it can be seen below:

serviceleft <- ggplot(left_retention, aes(x = InternetService)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Internet Service") +
  ggtitle("Number of Left Customers by \nInternet Service")

servicecurrent <- ggplot(stay_retention, aes(x = InternetService)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Current Customers", labels = comma_format()) +
  xlab("Internet Service") +
  ggtitle("Number of Current Customers \nby Internet Service")

grid.arrange(serviceleft, servicecurrent, ncol = 2)

The current customers graph seems to make sense - DSL is the budget-conscious option so it has the most customers, followed by Fiber optic, the more expensive option, and then followed by customers with no internet service provider at all. The left customers graph tells a different story, however. Fiber optic makes up the vast majority of customers who left, with DSL and no provider falling significantly behind. This also ties into monthly payments, as Fiber optic is the most expensive service provider option. This means that the vast majority of left customers used Fiber optic, which increased their monthly payments, which can be seen in the monthly payments graph at the beginning of this section.

Tenure Analysis

Tenure represents the number of months a customer has stayed with the company. To start the tenure analysis, we first wanted to find the average tenure for customer who left Regork vs. current customers:

left_retention %>% select('Status','Tenure') %>% summarize(Average = mean(Tenure))
## # A tibble: 1 × 1
##   Average
##     <dbl>
## 1    18.0
stay_retention %>% select('Status','Tenure') %>% summarize(Average = mean(Tenure))
## # A tibble: 1 × 1
##   Average
##     <dbl>
## 1    37.6

Customer who left had an average tenure of 18 months, which is 1 year and 6 months. Comparing this to the contract types, this would fall between one year and two year contracts. This might give an indication as to what kinds of contracts Regork should push and recommend to its customers. Month-to-month contracts could allow customers to leave more easily, since they have the flexibility to do so. They are not committing to a fixed length of time, so they may be more likely to leave.

We then built histograms for tenure lengths across left and current customers, which can be seen below:

tenureleft <- ggplot(left_retention, aes(x = Tenure)) + 
  geom_histogram(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Customers", labels = comma_format()) +
  xlab("Months of Tenure") +
  ggtitle("Number of Left Customers by Tenure Length")

tenurecurrent <- ggplot(stay_retention, aes(x = Tenure)) + 
  geom_histogram(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Customers", labels = comma_format()) +
  xlab("Months of Tenure") +
  ggtitle("Number of Current Customers by Tenure Length")

grid.arrange(tenureleft, tenurecurrent, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As expected, the majority of left customers had lower tenures, while current customers mostly have higher tenures. Something that is interesting about the current customer tenure graph is that there appears to be a large spike around 72 months. We believe this might be the result of an upper cap placed on the data at 72 months. This might be due to a large change in Regork, a change in the products offered at that time, or simply not having data from before that time period.

Online Security and Online Backup Analysis

The next two predictor variables we looked at are the OnlineSecurity and OnlineBackup variables. Both of these variables have three possible values: Yes, No, and No internet service. The two graphs we made can be seen below:

securityleft <- ggplot(left_retention, aes(x = OnlineSecurity)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Online Security Status") +
  ggtitle("Number of Left Customers by \nOnline Security Status")

securitycurrent <- ggplot(stay_retention, aes(x = OnlineSecurity)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Current Customers", labels = comma_format()) +
  xlab("Online Security Status") +
  ggtitle("Number of Current Customers \nby Online Security Status")

grid.arrange(securityleft, securitycurrent, ncol = 2)

backupleft <- ggplot(left_retention, aes(x = OnlineBackup)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Online Backup Status") +
  ggtitle("Number of Left Customers by \nOnline Backup Status")

backupcurrent <- ggplot(stay_retention, aes(x = OnlineBackup)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Current Customers", labels = comma_format()) +
  xlab("Online Backup Status") +
  ggtitle("Number of Current Customers \nby Online Backup Status")

grid.arrange(backupleft, backupcurrent, ncol = 2)

Both sets of graphs are very similar. Looking at current customers, most customers do not have online security, while most customers do have an online backup. On the other hand, looking at left customers reveals an interesting insight: the vast majority of customers who left did not have online security or online backup. The left customers are almost identical between the two variables, indicating this is something Regork should look into. This could mean two things: either customers who left did not care about online security and online backup, or that they might have left because they were not offered these options. This could be because of a lack of customer awareness, or the pricing being too high for these services.

Payment Method Analysis

Finally, the last variable we looked at in our EDA was PaymentMethod. This variable has four possible values: Electronic check, Mailed check, Bank transfer (automatic), and Credit card(automatic). The graph we made for left customers is below:

paymentleft <- ggplot(left_retention, aes(x = PaymentMethod)) + 
  geom_bar(fill = "darkgreen") +
  scale_y_continuous(name = "Number of Left Customers", labels = comma_format()) +
  xlab("Payment Method") +
  ggtitle("Number of Left Customers by Payment Method")
paymentleft

This graph reveals that the large majority of customer who left used electronic checks as their payment method. When we looked at the same variable for current customers, we found that the payment methods were almost exactly equal across the board for current customers, making this graph stand out even more. We were not able to think of a good reason as to why so many left customers used electronic checks as opposed to other payment methods, so this would require some further analysis and information. Was there a marketing campaign that caused customer to use electronic checks? Was there some issue with the electronic check process Regork used? These are questions that can only be answered with data that we do not currently have in this dataset.

Machine Learning

In this section our group will split our data and fit them to 3 different types of models. Each model has been listed in order of results showing the lowest AUC with our Bagged Tree Model, followed by our Decision Tree Model, and lastly the model that performed the best for us, Logistic Regression. Our results will help us determine a model that on average is capturing most of the spread of data while predicting on unseen data.

Data Spliting

Before creating a model we must split our data into our training and test data using a 70-30 split

set.seed(123)
ret <- initial_split(retention, prop = 0.7, strata = "Status")
ret_train <- training(ret)
ret_test <- testing(ret)
kfold <- vfold_cv(ret_train, v = 5)
ret_model_recipe <- recipe(Status ~ ., data = ret_train)

Bagged Tree Model

Our first model is a bagged tree model

# create bagged CART model object with
# tuning option set for number of bagged trees
ret_bag_mod <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")
# create the hyperparameter grid

ret_bag_hyper_grid <- expand.grid(times = c(5, 25, 50))

# train our model across the hyper parameter grid
set.seed(123)
ret_bag_results <- tune_grid(ret_bag_mod, ret_model_recipe, resamples = kfold, grid = ret_bag_hyper_grid)
# model results
show_best(ret_bag_results, metric = "roc_auc")
## # A tibble: 3 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1    50 roc_auc binary     0.816     5 0.00567 Preprocessor1_Model3
## 2    25 roc_auc binary     0.814     5 0.00813 Preprocessor1_Model2
## 3     5 roc_auc binary     0.773     5 0.00675 Preprocessor1_Model1

In this model we are performing bootstrapping, hyper parameter tuning, a hyper grid, as well as kfold training. Even with all of these preparations to the model it performs the worst out of all of our models used(which is surprising because our model that didn’t do much preparation did the best). The AUC seen in this model was .815 which in it of itself performs better than the most basic model in class which had an AUC of .7 approximately. Next we’ll go over the decision tree model.

Decision Tree Model

# Step 1: create decision tree model object
ret_mod <- decision_tree(mode = 'classification') %>%
  set_engine("rpart")
# Step 2: create model recipe
ret_model_recipe <- recipe(Status ~ ., data = ret_train)
# Step 3: fit model workflow
ret_fit <- workflow() %>%
  add_recipe(ret_model_recipe) %>%
  add_model(ret_mod) %>%
  fit(data = ret_train)
rpart.plot::rpart.plot(ret_fit$fit$fit$fit)

Based on this decision tree, if we look at the root node we see its whether they have a contract or not. This shows that from this model the most influential factor for customer status was whether or not they had a contract of one or two years.

# train model
ret_results <- fit_resamples(ret_mod, ret_model_recipe, kfold)
# model results
collect_metrics(ret_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.787     5 0.00598 Preprocessor1_Model1
## 2 roc_auc  binary     0.801     5 0.00849 Preprocessor1_Model1

After obtaining the AUC for this model we decided to take it a little further and try and tune a model that will take into account hyperparameters(because our AUC here was only around .7). So the following model will tune the hyperparamters of cost_complexity, tree_depth, and min_n and will also perform a k-fold validation and do all of this on a grid of 3 levels. Below are the results of this model.

ret_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(), #<<
  tree_depth = tune(), #<<
  min_n = tune() #<<
 ) %>% 
  set_engine("rpart")
# create the hyperparameter grid
ret_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
levels = 3
)
# train our model across the hyper parameter grid
set.seed(123)
ret_results <- tune_grid(ret_mod, ret_model_recipe, resamples = kfold, grid = ret_hyper_grid)
# get best results
show_best(ret_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estima…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001          8    40 roc_auc binary    0.821     5 0.00460 Prepro…
## 2    0.00000316            8    40 roc_auc binary    0.821     5 0.00460 Prepro…
## 3    0.0000000001          8    21 roc_auc binary    0.821     5 0.00436 Prepro…
## 4    0.00000316            8    21 roc_auc binary    0.821     5 0.00436 Prepro…
## 5    0.0000000001         15    40 roc_auc binary    0.817     5 0.00546 Prepro…
## # … with abbreviated variable name ¹​.estimator

As we can see, the model that created the highest AUC was the model with these hyperparameters: cost_complexity = 1xe^-10, tree_depth = 8, min_n = 21. This resulted in a model with a mean AUC of .823 which performs better than the model before! This model is still not the best model, in the next model we will discuss what the best model is.

Logistic Regression Model

The final model we will discuss is a simple Logistic Regression model. Since we cannot do hyper parameter tuning for this model we implemented kfold resamples to provide a model that better captures our data. Since this is the model that performs the best for us we will also plot the important features and take a look at the confusion matrix to understand if this is a good model.

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(ret_train, v = 5)
# titanic_train model via cross validation
results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)
# collect the average accuracy rate
collect_metrics(results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.796     5 0.00994 Preprocessor1_Model1
## 2 roc_auc  binary     0.844     5 0.00769 Preprocessor1_Model1

After collecting our results we can clearly see this model performs the best out of all of our models with a AUC of .844 with an almost 80% accuracy. Next we plot the important features and confusion matrix.

Is this a Good Model?

In this section we will ultimately answer if this model is good and explore what this model does well and what it could do better. To be short, the model is good but could always improve like any other model, now lets explore how it could be better.

final_fit <- logistic_reg() %>%
fit(Status ~ ., data = ret_train)
tidy(final_fit)
## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.03     0.975       1.06   2.89e- 1
##  2 GenderMale                     -0.0174   0.0778     -0.224  8.23e- 1
##  3 SeniorCitizen                   0.187    0.101       1.85   6.47e- 2
##  4 PartnerYes                     -0.0630   0.0934     -0.674  5.00e- 1
##  5 DependentsYes                  -0.0397   0.106      -0.372  7.10e- 1
##  6 Tenure                         -0.0561   0.00752    -7.45   9.16e-14
##  7 PhoneServiceYes                 0.0864   0.774       0.112  9.11e- 1
##  8 MultipleLinesNo phone service  NA       NA          NA     NA       
##  9 MultipleLinesYes                0.464    0.211       2.20   2.80e- 2
## 10 InternetServiceFiber optic      1.76     0.949       1.85   6.42e- 2
## # … with 21 more rows
final_fit %>%
  predict(ret_test) %>%
  bind_cols(ret_test %>% select(Status))
## # A tibble: 2,100 × 2
##    .pred_class Status 
##    <fct>       <fct>  
##  1 Current     Current
##  2 Left        Current
##  3 Left        Left   
##  4 Current     Current
##  5 Current     Current
##  6 Current     Current
##  7 Current     Current
##  8 Current     Current
##  9 Current     Current
## 10 Current     Current
## # … with 2,090 more rows
vip(final_fit$fit, num_features = 20)

Here, we learn that the top 3 most influential features are Tenure, Contracts(1 or 2 years), and Total Charges. These features make sense as to why someone would stay or not. Tenure or the amount of time people have stayed makes sense because if we think about how people pay bills in the real world,if you’ve been somewhere for a while you will continue with the status quo unless given a way better opportunity. Contracts also make sense because in the real world many people are tied to their phones for 2 years with contracts that help to finance your cellular plan. If I have a cellular plan I am more likely to stay with the business because to break your contract, it would require you to pay the remaining balance on your plan and the interest. Finally, the last influential factor is total charges which again makes sense, as a customer if I am paying crazy intermittent charges for my plan I would want to leave, as you increase total charges you on average will see more people leave. For specific numbers on how many people will leave we can look at our prediction for this model. We can see in the estimates column for all three features the numbers are -0.0560666237,-0.8014865665,-1.3963409618, 0.0002903204 respective to this order(tenure, contract 1-year,2-year, total charges). Negative numbers show an decrease in the chance of leaving because you essentially are decreasing the response variable which is status.

final_fit %>%
  predict(ret_test) %>%
  bind_cols(ret_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310

Next let’s understand how well my model is predicting against the actual data. Our true positive value is 1389 which means that the model predicted people would stay and people did stay 1389 times. Next our true negative value is 310 meaning the model predicted people would leave and 310 people left. Our next 2 values are the values the model didn’t quite capture. Our false positive is 153 meaning our model predicted 153 left but they actually stayed. Our false negative is 247 which means our model predicted 247 would stay and but they left. Lets look at these as percentages to better understand this. If the total predictions are 2099 (by summing all numbers) we can see that the model had 7% false positives and almost 12% false negatives. This shows us that our model is being more conservative and predicting more people are staying than leaving. Overall, this model predicted well but more complex algorithms could predict a more accurate model. Examples of Models that could have been used would be MARS, K-Nearest Neighbors, and even Naive Bayes.

Finally, the answer the question of whether this model is good, after looking over all of the data we can see the model performs well, all models could always be tweaked but our model is correctly predicting on our data 80% of the time and captures almost 84% of the data. As a person responsible for making business decisions I can use the information of influential features to further make business decisions. I’ve made logical reasoning for why each feature has a real world application and will take this to the next section to make complete recommendations for the business.

Business Analysis & Conclusion

Predictor Variable Ratings

We would rate the predictors Tenure, Contract, PaymentMethod, MonthlyCharges, and InternetService higher than the other predictors in the model. As a business manager, I would focus on these predictor variables because they had the largest impact on the response variable in our model.

Tenure being the most important predictor variable makes sense, as customers with longer tenures are much more likely to be current customers than customers with shorter tenures. While focusing on offering incentives and deals to longer tenure customers to reward long-term loyalty is very important, Regork should focus on offering promotions and deals to customer with shorter tenures to make sure they do not switch to another provider. It is much harder for customers with longer tenures to switch easily to competitors, because they have become comfortable with their current services, and they are likely enjoying discounts that come with being a loyal customer. Therefore, by focusing on lower tenure customers, Regork can best maximize their promotional efforts.

Customers with one year and two year contracts are much more likely to stay with Regork than customers with month-to-month contracts, leading those contract types to have high importance. Based off that information, as a business manager we would incentivize customers to choose one year and two year contracts over month-to-month, through deals/promotions and bundles with other services. By doing this along with continuing to advertise our month-to-month contracts, Regork can maximize its revenue while keeping potential lost customers.

As we found in our EDA, the Electronic Check payment method makes up the vast majority of lost customers when compared to other payment methods. This could be due to many factors, however we were not able to find the exact reason for this with the data provided to us. Some of our theories include a potential issue with the way Regork currently processes electronic checks, or a marketing campaign related to electronic checks that was a massive failure. Regardless, the electronic check payment method has a large importance on whether a customer stays or leaves, so it should definitely be something we look into as a business manager.

We also found in our EDA that on average, lost customers had higher monthly charges on average than current customers and inferred that this insight would be the basis for the rest of our analysis, which our model seems to confirm. Since monthly charges is on the list of the 20 most influential variables, it should be something we look into. Whether this means offering promotions that lower monthly charges for customers with lower tenures, or potentially a referral program for newer customers to bring their friends into the Regork ecosystem.

And finally, internet service is an interesting variable to look at. Starting off, we found in our EDA that customers with a fiber optic internet service made up the vast majority of lost customers when compared to DSL and no internet. This is also shown in our model, with Internet ServiceFiber optic ranking highly in terms of variable importance. Therefore, as a business manager we should look into offering deals that decrease the cost of fiber optic services, or running promotions that bundle in other services, like internet security or internet backup for a cheaper overall price. Something that is also interesting is that InternetServiceNo also ranked highly for variable importance as well. And from looking at our model, InternetServiceNo has a negative coefficient, meaning it makes customer more likely to stay. Customers with no internet service are more likely to stay with Regork than leave, which is something we would look into further as a business manager.

Lost Customer Predictions and Incentives Scheme

From looking at our confusion matrix, our model predicts that 557 customers will leave Regork every month if we do not make any efforts to retain them. By multiplying this by the mean monthly charge for lost customers, which is $47, we get a result of $41,218. This means that if Regork does not make any further efforts to retain these predicted lost customers, they will lose $41,218 in revenue per month. Multiplying this by 12 results in a predicted loss in revenue of $500,000 over the course of a year. Therefore, some efforts should be used to help retain this massive loss in revenue.

For our incentive scheme, we would like to target the variables we found to be the most influential on making customers stay in the Regork ecosystem.

Our first step would be to start offering contracts in 6-month periods in addition to the month-to-month, one year, and two year contracts. This is a good idea because we found that the average tenure for lost customers is 18 months, which is 1 year and 6 months. This would give customers an option between a one year and two year contract, which are quite different in terms of commitment. Additionally, Regork can then target customers around the 6-month mark for incentive and deals to keep them in the company for longer, rather than continually offering deals every month that may not get used. The four types of contracts that would now be offered are month-to-month, 6 month, one year, and two year contracts. The interest rate and monthly charges of these contracts would decrease as you increase the length of time of the contract. While this would initially have a startup cost to get the 6 month contracts working, it would pay back significantly. Not only would this add a stream of revenue through an additional contract type, it would decrease promotional spending, since Regork would not have to offer promotions as frequently (month-to-month). By implementing methods like scheduling contracts to end around the holidays, Regork can more strategically promote their offerings and deals to get customer into another, longer-term contract.

Another aspect of our incentive scheme would be targeting newer customers that have a fiber optic service plan. As we found through our EDA and model, fiber optic service plans are causing new customers to leave because of the higher monthly charges. This issue can be fixed in two ways: implementing a switching program where customers can switch from a fiber optic plan to a DSL plan easily and get money for it, or by simply lowering the cost of fiber optic through bundles with services like internet security and backup, which a majority of lost customers also do not have. Bundling streaming services could also be a possible initiative to explore as well.

Improving Regork’s current payment methods would also be very important to help retain those predicted lost customers. As we found in our EDA and model, the electronic check payment method plays a very large role in causing customers to leave the company. This may be because of a issue in the existing electronic check processing method, which could cause customers to be dissatisfied over time and eventually leave the company. We also found in our EDA that payment method usage is equal across all types with current customers, showing that Regork’s customers use all of the payment methods offered equally. Therefore, by investing in improving the current payment method processes, especially focusing on electronic checks, Regork can target a service that every customer uses equally, while focusing on a variable that may be causing a large amount of customers to leave. The costs related with improving these systems would be relatively small when compared with the possible returns in revenue such a change would make.

Conclusion Statement

Overall, by offering 6-month contracts in addition to existing contract plans, offering a switching program and bundles to newer customers with fiber optic plans, and improving Regork’s current payment plans (with a focus on electronic checks), Regork can best maximize its revenue gain and minimize its costs. This would also greatly reduce the $41,218 that Regork would lose in revenue per month from the predicted 557 customers lost every month. While the initial costs may be high to start these incentives, they would pay back in significant revenue over time.