BANA 4080 - FINAL PROJECT

Regork Telecommunication Analysis

Introduction

Problem Statement

We are collaborating with a company called Regork, which has recently launched telecommunication products such as internet services, streaming platforms, and phone services. The CEO of Regork has assigned us the task of creating models to predict potential customer churn. Our focus is on analyzing customer retention by considering a variety of relevant variables.

Solution Overview

We tackled these challenges by performing a more in-depth analysis of the data set. Our goal was to identify trends among customers who continued using the service versus those who decided to leave.
Our primary focus turned to internet service, given its widespread usage. We observed that a significant portion of those who discontinued the service were on fiber optics.
Next, we examined the lengths of customers’ contracts. We believe that promoting a reduced price for longer commitment periods could encourage more customers to remain with us.
Finally, we analyzed customer tenure, noting a clear trend in the number of individuals who stayed compared to those who left. We think that providing enhanced offers for those with longer tenure could be a crucial incentive for retention.

Proposed Solutions

The core of our solution will involve creating enticing offers to encourage individuals to stay with Regork.
We will provide package deals that combine specific products, such as offering access to multiple streaming services at a reduced bundle price.

Required Packages

These are the packages required to conduct our analysis.

library(ggplot2)
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(kernlab)
library(rpart.plot)
library(pdp)
library(ranger)
library(readr)

Exploratory Data Analysis

We analyzed customer retention data to find key trends and patterns. We also created a variable summary to highlight the main factors. Below is the structure of our dataset:

cr <- read_csv("data/customer_retention.csv")
variables <- ls(cr)

cr <- mutate(cr, Status = factor(Status)) %>%
  na.omit()

The first trend to figure out is which customers decided to stay or leave, based off of which internet service that they had.

cr %>%
  ggplot(aes(x = InternetService)) +
  geom_bar(fill = 'skyblue', size = 1.5) + facet_wrap(~ Status) + 
  ylab('# of Consumers') +
  xlab('Mode of Service') +
  ggtitle('Customer Retention with Different Internet Services')

The next trend we wanted to see was was the length our consumer base stayed with the company using our service. The graph shows that, with time, our customers gain loyalty and are much less likely to leave.

cr %>%
  ggplot(aes(Tenure)) +
  geom_histogram(fill = 'aquamarine', size = .3, color = "black") +
  facet_wrap(~ Status) + 
  ylab('# of Consumers') +
  xlab('Time in Months') +
  ggtitle('Consumer Retention with Tenure')

The final trend that we decided to focus on was the effects of contracts on customer retention. Through these graphs, we found that the most customers left when they had month-to-month contracts. The least amount of customers left with 2 year contracts.

cr %>% 
  ggplot(aes(Contract)) +
  geom_bar(width = .8, fill = 'pink') +
  facet_wrap(~ Status) +
  ylab('# of Consumers') +
  xlab('Length of Contract') +
  ggtitle('Consumer Retention relating to Contracts')

Predictive Analysis & Machine Learning

Data Setup

First, we have established test and training sets that will be used for our upcoming prediction models:

set.seed(123)
cr_split <- initial_split(cr, prop = 0.7, strata = Status)
cr_train <- training(cr_split)
cr_test <- testing(cr_split)


cr_recipe <- recipe(Status ~ ., data = cr_train)

Logistic Regression Model

In our initial approach, we opted to implement a simple logistic regression model, which yielded an AUC of 0.844—an encouraging outcome. Nonetheless, I believe there is scope for enhancement through the application of more sophisticated models.

set.seed(123)
kfold <- vfold_cv(cr_train, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

Results

collect_metrics(results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00374 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00974 Preprocessor1_Model1

Random Forest Model

For our subsequent model, we opted to employ a random forest approach. We believed that implementing hyperparameter tuning would enhance the precision of our results. Ultimately, we achieved an AUC of 0.835, which is slightly lower in accuracy compared to our logistic regression model.

set.seed(123)
kfold <- vfold_cv(cr_train, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., kfold)

Results

collect_metrics(results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00374 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00974 Preprocessor1_Model1

Decision Tree

In accordance with our third model’s findings, the inaugural row yields the most favorable outcomes, achieving an AUC of 0.814. However, this performance falls short of that exhibited by both the logistic regression and random forest models, positioning this particular model as the least optimal among the three.

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")


dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

set.seed(123)
dt_results <- tune_grid(dt_mod, cr_recipe, resamples = kfold, grid = dt_hyper_grid)

Results

print(dt_results)

## # Tuning results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits             id    .metrics           .notes          
##   <list>             <chr> <list>             <list>          
## 1 <split [3912/979]> Fold1 <tibble [375 × 7]> <tibble [0 × 3]>
## 2 <split [3913/978]> Fold2 <tibble [375 × 7]> <tibble [0 × 3]>
## 3 <split [3913/978]> Fold3 <tibble [375 × 7]> <tibble [0 × 3]>
## 4 <split [3913/978]> Fold4 <tibble [375 × 7]> <tibble [0 × 3]>
## 5 <split [3913/978]> Fold5 <tibble [375 × 7]> <tibble [0 × 3]>

Optimal Model

Our Optimal Model ultimately emerged as the logistic regression model, which achieved the highest AUC of 0.844. Armed with this insight, we can now strategically implement offers and refine pricing structures to enhance customer retention.

Feature Importance

The following represents the five most significant variables within our model, accompanied by the accuracy derived from our confusion matrix.

final_fit <- logistic_reg() %>%
fit(Status ~ ., data = cr_test)

tidy(final_fit)

## # A tibble: 31 × 5
##    term                          estimate std.error statistic     p.value
##    <chr>                            <dbl>     <dbl>     <dbl>       <dbl>
##  1 (Intercept)                    2.17       1.55      1.40    0.160     
##  2 GenderMale                    -0.00317    0.120    -0.0264  0.979     
##  3 SeniorCitizen                  0.101      0.157     0.643   0.520     
##  4 PartnerYes                     0.328      0.149     2.20    0.0275    
##  5 DependentsYes                 -0.382      0.166    -2.30    0.0215    
##  6 Tenure                        -0.0511     0.0110   -4.64    0.00000351
##  7 PhoneServiceYes                1.07       1.22      0.877   0.381     
##  8 MultipleLinesNo phone service NA         NA        NA      NA         
##  9 MultipleLinesYes               0.551      0.336     1.64    0.101     
## 10 InternetServiceFiber optic     2.93       1.51      1.94    0.0521    
## # ℹ 21 more rows

final_fit %>%
predict(cr_test) %>%
bind_cols(cr_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1386  243
##    Left        154  314

The confusion matrix revealed 154 false positives and 243 false negatives. The occurrence of false negatives indicates a higher number of customers who departed than what our model had initially projected.

The five most influential variables identified were tenure, the presence of two-year contracts, the utilization of paperless billing, familial status, and individuals with spouses.

Conclusion and Recommendation

Based on our exploratory analysis and the insights gained from our machine learning models, we recommend sticking with our original proposed solution of offering package deals with specific products for a lower price.

Internet Service

Through our analysis, we found that fiber-optic customers are much more likely to churn compared to DSL customers or those without internet service. To combat this, we recommend offering a deal that targets those who prefer fiber-optic. For example, we could offer a year long contract that includes 1-3 months of an additional service, such as a streaming subscription or membership to a location of their choosing. This could increase customer retention while also increasing value.

Contract Length

Our data has shown that customers with longer contracts are less likely to turnover. When plans are month-to-month, our customers are much more likely to churn, whereas our customers with 2-year contracts remained with us. We are recommending making our long term contracts mpre appealing to customers, this could be achieved by reducing the costs for longer deals or bundling temporary upgrades like a 3 or 6-month higher internet speed. That may be enough for some customers to bite on a 1 or 2-year contract.

Consumer Tenure

Customer retention improves greatly as customers spend more time with our company. As our recommendation is to encourage our customers to take longer contracts, we can assist with customer tenure by adjusting our offers to look more appealing. After the first contract, our customers are much more likely to become loyal to our brand.

BANA 4080 - FINAL PROJECT

GROUP 21: Aidan Schmidt, Tobin Smith

2025-04-27

Regork Telecommunication Analysis

Introduction

Problem Statement

Solution Overview

Proposed Solutions

Required Packages

Exploratory Data Analysis

We analyzed customer retention data to find key trends and patterns. We also created a variable summary to highlight the main factors. Below is the structure of our dataset:

The first trend to figure out is which customers decided to stay or leave, based off of which internet service that they had.

The next trend we wanted to see was was the length our consumer base stayed with the company using our service. The graph shows that, with time, our customers gain loyalty and are much less likely to leave.

The final trend that we decided to focus on was the effects of contracts on customer retention. Through these graphs, we found that the most customers left when they had month-to-month contracts. The least amount of customers left with 2 year contracts.

Predictive Analysis & Machine Learning

Data Setup

Logistic Regression Model

Results

Random Forest Model

Results

Decision Tree

Results

Optimal Model

Feature Importance

Conclusion and Recommendation

Internet Service

Contract Length

Consumer Tenure