Luisa Lopez Final Project (Data Mining)

Introduction

BUSINESS PROBLEM

Regork decided to expand as a business and enter the telecommunications market. This is a great step but it comes with its challenges. One of them is that in this industry, customer retention is vital. In most cases, lead generation can be very costly. With some items such as marketing, sales, discounts, free trials, media promotion, these can add up and decrease the profit by customer. For Regork right now is important to understand its clients and know what items could improve customer retention. Additionally, Regork needs to plan for these customer that show behavior of leaving the company services. If we attack this problem, we will be able to predict which customers are prone to leaving, and create strategies to incentyvize loyalty.

APPROACH

To approach this problem we are using the following libraries to help in our analysis: - Tidymodels: simplify model and preprocessing data

Tidyverse: For data manipulation, visualization and analysis
vip: measure the importance of forecasting features
Kernlab: used for forecasting with methods as regression and classification
Earth: helps perform non-linear regression analysis for complez relationships
ggplot2: to create organized and clean graphs
ggridges: used to visualized data across multiple variables to provide a better analysis

setwd('S:/2025/Data mining/DM SS2025 002/Final Project')


library(tidymodels)
library(tidyverse)
library(vip)
library(kernlab)
library(earth)
library(ggplot2)
library(ggridges)
library(DT)

For our data, I used the retention dataset provided by Regork with the following variables:

retention <- read_csv('customer_retention.csv')

retentionvar <- data.frame(Variable = names(retention), Type = sapply(retention, class))

datatable(retentionvar)

FOR REGORK:

This analysis will help Regork because they will be able to create a well structured strategy to handle customers that are prone to leaving by:

Identify them as early as possible. If they are showing the behaviors our model flags as leaving, Regork can know which customers and flag them.
Understand their motivations for leaving. It could be high costs or dissatisfaction with a service.
Implement a promotion or incentive to keep customer’s satisfaction and increase tenure.

Our proposed solution for Regork is to:

Implement our recommendations for certain services such as Internet and Phone
Use the forecast model with the highest accuracy, to predict behaviors that show customers with higher changes of leaving.
Prepare for the future based on our predictions of customer leaving.

Data Exploration

Data Cleaning

First, we decided to perform some data cleaning to have accurate analysis.

To better analyze the data, we modified the columns that were related to internet service. Since had 3 variables (“yes”, “no”, “no internet service”), we coded the “no internet service” as simply “no” to simplify analysis.

Then, we created a new column just for internet service that contained “yes” and “no” variables. I kept the original “InternetService” column so I could analyze if the type of internet affected our problem.

To help our nonlinear model we factorized our “Status” variable to guide our model to predict the Left status by making it the default level.

retention$Status <- relevel(as.factor(retention$Status), ref="Left")

# check levels
levels(retention$Status)

## [1] "Left"    "Current"

Finally, here is a view of the new table with the specific changes.

lapply(retention[, c("StreamingMovies", "StreamingTV", "OnlineBackup", "MultipleLines", "OnlineSecurity", "DeviceProtection", "TechSupport", "StreamingTV")], function(x) unique(x))

## $StreamingMovies
## [1] "No"  "Yes"
## 
## $StreamingTV
## [1] "No"  "Yes"
## 
## $OnlineBackup
## [1] "Yes" "No" 
## 
## $MultipleLines
## [1] "Yes" "No" 
## 
## $OnlineSecurity
## [1] "No"  "Yes"
## 
## $DeviceProtection
## [1] "No"  "Yes"
## 
## $TechSupport
## [1] "No"  "Yes"
## 
## $StreamingTV
## [1] "No"  "Yes"

Data analysis

To start at our analysis, we wanted to take a first look of how is the current situation at Regork. We noticed that the majority of its customers are still active. This is great news since we know first hand that Regork has a good business model for the moment. Although there’s still a 27% of customers that left Regork.

retention %>%
  count(Status) %>%
  mutate(perc = n / sum(n) * 100) %>%
  ggplot(aes(x = "", y = perc, fill = Status)) +
  geom_bar(stat = "identity") +  
  scale_fill_manual(values = c("Left" = "#FFB3BA", 
                              "Current" = "#FFDFBA",
                              "Other" = "#BAE1FF")) +
  theme_minimal() +
  labs(title = "Customer Status Distribution", y = "Percentage (%)", x = "")

Now we wanted to understand the financial contribution by each group (Current vs Left). We discovered that customers that left, have a higher monthly charge compared to current customers. This definitely showed us a trend that high costs maybe be one of the reasons for cancelling services with Regork. So we decided to dig in deeper.

retention %>%
  ggplot(aes(x = MonthlyCharges, fill = Status)) +
  geom_density(alpha = 0.5) +  
  scale_fill_manual(values = c("Left" = "#FFB3BA",
                              "Current" = "#FFDFBA",
                              "Other" = "#BAE1FF")) +
  theme_minimal() +
  labs(title = "Density of Monthly Charges by Customer Status", x = "Monthly Charges", y = "Density")

Since their 2 main lines of businesses are internet and phone service, we decided to analyze the distribution of each and we found the following: We divided our customer by if they had phone service or not. Then we made a boxplot distribution of their monthly charges by type of internet service.

Mostly our current customers have a HIGHER median monthly charge compared to customer that left independently of the type of service
However, there’s multiple outliers in our customers that left, had phone service and used DSL.
This could represent that customers that were willing to pay for premium services, were not satisfied with this service. This could show that there’s room for improvement for our DSL service.
This could also explain why our previous graph showed that nonactive customers had a higher monthly charge.
We could interpret dissatisfaction with certain preimum services.

retention %>%
  ggplot(aes(x = InternetService, y = MonthlyCharges, fill = Status)) +
  geom_boxplot() + 
  facet_wrap(~ PhoneService) +  
  scale_fill_manual(values = c("Left" = "#FFB3BA",
                              "Current" = "#FFDFBA",
                              "Other" = "#BAE1FF")) +
  theme_minimal() +
  labs(title = "Monthly Charges by Internet Service and Status",
       x = "Internet Service",
       y = "Monthly Charges") +
  theme(panel.border = element_rect(color = "black", fill = NA, linewidth = 1))

After looking at this finding, we decided to add Tenure to the picture. By analyzing the tenure distribution by internet service only for customers that left. This is the conclusion we drew:

Majority of its customers use fiber optics (however this is across their whole business)
DSL services show low tenure and the majority of its customers have a low tenure.

This represents that there is performance issues with this internet type or the cost is too high for customers’ satisfaction.

retention %>%
  filter(Status == "Left") %>%
  ggplot(aes(x = Tenure, y = InternetService, color = InternetService)) +
  geom_jitter(width = 0.2, alpha = 0.6) +  
  scale_color_manual(values = c("DSL" = "#FFB3BA",
                               "Fiber optic" = "#FFDFBA",
                               "No internet service" = "#BAE1FF")) +  
  theme_minimal() +
  labs(title = "Relationship between Tenure and Type of Internet Service",
       x = "Months with Company",
       y = "Type of Internet Service") +
  theme(legend.position = "none")

Machine Learning

Multiple Linear Regression

MULTIPLE lINEAR REGRESSION

The first model I chose is multiple linear regression due to its simplicity and its widely used by analysts. Here are the steps done to measure the model’s performance

Split dataset to create test and train sets

set.seed(56)
ret_split <- initial_split(retention, prop = 0.7, strata = "Status")
ret_train <- training(ret_split)
ret_test <- testing(ret_split)

Create model and add cross validation to measure models performance with more generalized data

This model shows an AUC of 0.8427. 1 would be a perfect value, but 0.8427 is a good start for our model.

lr_mod <- logistic_reg()

set.seed(56)
kfold <- vfold_cv(ret_train, v=5)

results_lr <- fit_resamples(lr_mod, Status ~ ., kfold, 
                            control =                  control_resamples(save_pred = TRUE))

auc_results <-collect_metrics(results_lr) %>%
  filter(.metric == "roc_auc")

datatable(auc_results)

Assess confusion matrix

Then I want to look if our over is being over positive or over negative. This means if itd having more false negatives or false positives. I will assess this by creating a confusion matrix This model is good at predicting current customers, however, it has false negatives. It means it says customers are going to stay but they actually leave.

results_lr_pred <- collect_predictions(results_lr)

# Create a confusion matrix
conf_matrix <- conf_mat(results_lr_pred, truth = Status, estimate = .pred_class)

# Print the confusion matrix
print(conf_matrix)

##           Truth
## Prediction Left Current
##    Left     707     389
##    Current  592    3202

Fit to the whole dataset and analyze intercepts

final_lr_fit <- lr_mod %>%
  fit(Status ~ ., data = ret_train)

final_lr_fit

## parsnip model object
## 
## 
## Call:  stats::glm(formula = Status ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                          (Intercept)                            GenderMale  
##                           -1.3701141                            -0.0504146  
##                        SeniorCitizen                            PartnerYes  
##                           -0.2283368                             0.0262553  
##                        DependentsYes                                Tenure  
##                            0.1791876                             0.0639146  
##                      PhoneServiceYes                      MultipleLinesYes  
##                           -1.1988873                            -0.6444637  
##           InternetServiceFiber optic                     InternetServiceNo  
##                           -2.4305828                             4.5719948  
##                    OnlineSecurityYes                       OnlineBackupYes  
##                           -0.0189472                            -0.1266872  
##                  DeviceProtectionYes                        TechSupportYes  
##                           -0.2877474                             0.0917098  
##                       StreamingTVYes                    StreamingMoviesYes  
##                           -0.8822981                            -0.8491109  
##                     ContractOne year                      ContractTwo year  
##                            0.6199628                             1.2577589  
##                  PaperlessBillingYes  PaymentMethodCredit card (automatic)  
##                           -0.3238204                             0.0778706  
##        PaymentMethodElectronic check             PaymentMethodMailed check  
##                           -0.2561627                             0.0775908  
##                       MonthlyCharges                          TotalCharges  
##                            0.0680896                            -0.0003614  
##                InternetServiceY_NYes  
##                                   NA  
## 
## Degrees of Freedom: 4889 Total (i.e. Null);  4866 Residual
##   (9 observations deleted due to missingness)
## Null Deviance:       5661 
## Residual Deviance: 4060  AIC: 4108

Since this is our best model, we will perform feature importance and determine which variables are key to determine if a customer is going to stay

 final_lr_fit %>%
  vip(num_features = 20)

We see that Tenure, contracttwoyear, contractoneyear, and totalcharges are most influential varibles

Decision Tree

To have a non linear approach, we decided to use decision trees to give more flexibility to our model. Here are the steps of our analysis.

Initial splitting (Done in the previous model)
Create the model object, recipe, and tuning placeholders.

# create model with tuning placeholders 
dt_model <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

# Create recipe
model_recipe <- recipe(
  Status ~ ., 
  data = ret_train
)

# create grids 
hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n()
)

Create grid to fit resamples
Fit all the possible hyperparameters and show the ones with the highest AUC

tunedresults <- tune_grid(dt_model, model_recipe, resamples = kfold, grid = hyper_grid, control =                  control_resamples(save_pred = TRUE))

# get results
datatable(show_best(tunedresults, metric = "roc_auc", n = 10))

The best hyperparameters (costcomplexity = 1, tree depth = 8, min trees = 21) only give us the an average of 0.81607 but we still need to fit it to our model

Fit model with best parameters and measure metrics

best_model <- select_best(tunedresults, metric = 'roc_auc')

final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model (dt_model) %>%
  finalize_workflow(best_model)

final_fit <- final_wf %>%
  fit(data = ret_train)

Assess confusion matrix

results_dt_pred <- collect_predictions(tunedresults)

# Create a confusion matrix
conf_matrix <- conf_mat(results_dt_pred, truth = Status, estimate = .pred_class)

# Print the confusion matrix
print(conf_matrix)

##           Truth
## Prediction  Left Current
##    Left     7922    6246
##    Current 27151   90954

Looking at the confusion matrix we can determine the following. This model has a high error in giving false negatives. This means our models predicts they will still be customer but in reality the customer leaves Regork. This portrays that our model is still too optimistic for our purposes.

Random Forests

Lastly, we decided to use one of the most popular forecasting model in the industry. Random Forests gives us the flexibility and eliminates biases from our decision trees. This are the steps we’re following:

Splitting, recipe and kfolds is done from our previous models
Create model, grids for re-sampling and placeholders for tuning

rf_modtune <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

# create grid

hyper_dfgrid <- grid_regular(
  mtry(range= c(2, 15)),
  min_n(range=c(5, 40)),
  trees(range =c(40, 350)),
  levels = 5
        
)

rfresults <- tune_grid(rf_modtune, model_recipe, resamples = kfold, grid = hyper_dfgrid, control =                  control_resamples(save_pred = TRUE))

# show best

datatable(show_best(rfresults, metric = "roc_auc"))

When looking at our model we determine that the parameters that yield the highest AUC are: 0.

Fit our best parameters and assess confusion matrix

best_hyperparameters <- select_best(rfresults, metric = "roc_auc")

results_rf_pred <- collect_predictions(rfresults)

# Create a confusion matrix
conf_matrix <- conf_mat(results_rf_pred, truth = Status, estimate = .pred_class)

# Print the confusion matrix
print(conf_matrix)

##           Truth
## Prediction   Left Current
##    Left     78897   43601
##    Current  83478  406399

After assessing our model we see that just like the other models, it does a good job at predicting when customers will stay current. Compared to other models, it’s more stabilized in its false negatives (predicted current but actual left) and false positives (predicted left but actual current) However, is still has a lower AUC compared to multiple linear regression

The graphs shows us that Tenure is the feature that determines customer status. Then it’s followed by TotalCharges, Contract, MonthlyCharges, and OnlineSecurity. This aligns with one of our hypothesis that Monthly charges was influential in determining of customer will leave or stay. However, this model denies our hypothesis that the type of internet is important in customer behavior.

Optimal Model

After analyzing all the model we decided that MULTIPLE LINEAR REGRESSION has the highest AUC and more stable confusion matrix. Here is a plot of the auc and confusion matrix for reference

results_lr_pred <- collect_predictions(results_lr)

# 2. Create the ROC curve
results_lr_pred %>%
  roc_curve(truth = Status, .pred_Left) %>%
  autoplot()

MODEL FITTING

Since we determined that this is the best model, we will test our model

final_mr_model <- lr_mod
#final fit
final_mr_fit <- final_mr_model %>%
  fit(Status ~ ., data = ret_train)

mr_predictions_class <- predict(final_mr_fit, ret_test, type = "class")

# Add the true labels to the predictions
mr_predictions_class <- bind_cols(mr_predictions_class, truth = ret_test$Status)

# Compute accuracy using the predicted class labels and true labels
accuracy_value <- accuracy(mr_predictions_class, truth = truth, estimate = .pred_class)

# Print the accuracy value
print(accuracy_value)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.816

# Predict probabilities (for AUC)
mr_predictions_prob <- predict(final_mr_fit, ret_test, type = "prob")

mr_predictions_prob <- bind_cols(mr_predictions_prob, truth = ret_test$Status)
# Compute AUC using the predicted probabilities for the "Left" class
roc_auc_value <- roc_auc(mr_predictions_prob, truth = truth, .pred_Left)

# Print the AUC value
print(roc_auc_value)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.849

print(mr_predictions_prob)

## # A tibble: 2,100 × 3
##    .pred_Left .pred_Current truth  
##         <dbl>         <dbl> <fct>  
##  1     0.648          0.352 Current
##  2     0.322          0.678 Left   
##  3     0.0104         0.990 Current
##  4     0.442          0.558 Current
##  5     0.0133         0.987 Current
##  6     0.325          0.675 Left   
##  7     0.0300         0.970 Current
##  8     0.0281         0.972 Current
##  9     0.212          0.788 Current
## 10     0.430          0.570 Current
## # ℹ 2,090 more rows

conf_matrixfinal <- conf_mat(mr_predictions_class, truth = truth, estimate = .pred_class)

# Print the confusion matrix
print(conf_matrixfinal)

##           Truth
## Prediction Left Current
##    Left     310     140
##    Current  247    1401

Our generalization error showed similar values between the first model and validating it with unseen data. We saw a 0.84 of area under the curve (AUC). Regork should expect an accuracy of around 81%, which is a very good number is the industry. And if we include the interpretation from our confusion matrix, we determine that most of our models error come from false negatives.

After looking at this observations I learned that for such a complex business problem it was very appropriate using a model that worked with this variables. A linear approach seemed to be the correct path and it will help Regork predict its customers

If we look at the feature importance shows that customers with low tenure need more attention that the ones with high tenure. This also shows that to increase retention we need to make customers get involved with more services with Regork. The higher their financial contribution, the higher motivation they have to stay with us. Finally, it shows that the length of the contract is important as well.

Business Analysis

Factors to focus on

Factor to focus on of a certain customer are:

The length of its tenure: Customer with lower tenure have to be taken care more compared to the ones with higher months as Regork customers
Total Charges: Regork has to incentivize its customers to engage in more premium services (online security, multiple lines etc)
Contract Length: Regork has to motivate its customers to engage in longer contracts (1 year - 2 yearS)
Paperless Billings: customers engaging in this service have higher chances of leaving. Regork should puts its focusing in improving this service

Potential Loss

Using our latest model we will estimate the revenue loss from the customers that we predict will leave.

# Add predicted probabilities to the test data (ret_test)
ret_test_pred <- bind_cols(ret_test, mr_predictions_prob)

# Create a new column that indicates whether the customer is predicted to leave
ret_test_pred <- ret_test_pred %>%
  mutate(predicted_status = if_else(.pred_Left > 0.5, "Left", "Current"))

# Filter customers predicted to leave
predicted_left_customers <- ret_test_pred %>%
  filter(predicted_status == "Left")

# Estimate the total revenue lost from customers predicted to leave
revenue_lost_predicted <- predicted_left_customers %>%
  summarise(
    total_revenue_lost = sum(TotalCharges)
  )


print(revenue_lost_predicted)

## # A tibble: 1 × 1
##   total_revenue_lost
##                <dbl>
## 1            409884.

we would expect a total loss of $409,988.20 if Regork doesn’t develop a strategy to increase these customers loyalty

Incentive Scheme

As mentioned before Regork has to focus in increasing the engagement of the customer with the company. They need to take advantage of the multiple extra services offered: online backup, paperless billing, online support. And also, Regork needs to make sure that these are high quality services that leave the customer with a smile on their face.

Some incentives are:

Lower rates for customer with longer contracts
Special discounts and promotions for new customers
Better bundles for phone and internet service
More premium bundles that include all the extra services at a fixed cost
Recognition to old customer with high tenure

All of these will target the main variables that determine if a customer will leave or stay.

Proposal to Regork Telecom

The implications of my analysis show that Regork’s customers are willing to pay for a good service, and it shows that their reasons for leaving are mainly low engagement, low perception of benefits, bad experience with a service.

My proposal to Regork is to use the results from this analysis and start implementing all the incentives from the points above. Additionally, Regork should also work on improving this premium services to increase satisfaction of its customers. This will make customers realize that is valuable what Regork had to offer. Finally, I recommend Regork to deploy this method quarterly to track its results and find areas of growth. Since customers behaviors fluctuate, it important for Regork to be updated and accommodate as necessary

Limitations

Although we were able to asses multiple features, there are some limitations to our analysis. We got aggregated costs, so we were not able to analyze how much was each customer paying for each service. This could have helped us in developing a pricing strategy.

Additionally, we couldn’t assess the customers experince with us. A variables such as customer rating (1 - 5) could ahve helped us in determine how their satisfaction with the service impacted their decision to stay.

Regork Customer Retention Strategy

2025-04-10