Introduction

As Regork expands into the telecommunications sector, retaining existing customers has become a crucial focus, given the significantly higher costs associated with acquiring new customers. To support these efforts, we conducted an analysis of customer data to identify patterns that could help predict customer churn and guide retention strategies.

Our findings highlight several important trends.

Data Preparation & Exploratory Data Analysis

Packages Required

To successfully replicate the results presented in this report, it is necessary to install and load the following packages:

library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
library(vip)
## Warning: package 'vip' was built under R version 4.4.3
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
library(pdp)
## Warning: package 'pdp' was built under R version 4.4.3
library(kernlab)
library(baguette)
## Warning: package 'baguette' was built under R version 4.4.3
library(dplyr)
library(ggplot2)

Data Preperation

Data Import:

The dataset customer_retention.csv is read into R from a specified file path using read_csv(). This function loads the data into a tibble (a modern version of a data frame) called retention.

Data Type Transformation:

The Status variable within the retention dataset is converted into a factor using mutate(). Factors are used in R to represent categorical variables, making them suitable for statistical modeling and plotting.

There were 6988 observations for the Status variable, 1856 (26.6%) left Regork’s program while 5132 (73.4%) stayed.

Missing Value Handling:

The na.omit() function is applied to the retention dataset to remove any rows containing missing (NA) values.

After this step, the dataset contains only complete cases with no missing data.

#1. Data Import
retention <- read_csv("~/Data Mining ⚒/customer_retention.csv")

#2. Data Type Transformation 
retention <- mutate(retention, Status = factor(Status))

#3. Missing Values Handling
retention <- na.omit(retention)
retention %>%
  count(Status)
## # A tibble: 2 × 2
##   Status      n
##   <fct>   <int>
## 1 Current  5132
## 2 Left     1856

Exploratory Data Analysis

Graph_1

retention %>%
  group_by(SeniorCitizen, TechSupport) %>%
  summarize(count = n()) %>%
  group_by(SeniorCitizen) %>%
  mutate(prop = count / sum(count)) %>%
  ggplot(aes(x = TechSupport, y = prop, fill = SeniorCitizen)) +
  geom_col(position = "dodge") +
  labs(title = "Proportion of Senior Citizens with and without Tech Support",
       x = "Tech Support",
       y = "Proportion",
       fill = "Senior Citizen") +
  theme_minimal()

## Technology Support and Senior Citizens: Graph_1

A minority of customers sign up for tech support (approximately 30% overall), but among those who do sign up for tech support, the majority are Senior citizens. Importantly, amomg the participating customers very few senior citizens lacked internet service, indicating that once they are connected, they tend to maintain their internet access consistently. Regork could improve this by creating targeted strategies to encourage tech support adoption among seniors, emphasizing simplicity and reliability.

Graph_2

ggplot(retention, aes(x = OnlineSecurity, y = Tenure, fill = OnlineSecurity)) +
  geom_boxplot() +
  labs(title = "Tenure by Online Security",
       x = "Online Security",
       y = "Tenure (months)") +
  theme_minimal()

## Online Security Services and Customer Tenure: Graph_2

Our analysis, supported by a boxplot visualization, shows a strong relationship between customer tenure and the use of online security services. Customers who have been with Regork for a longer period are more likely to subscribe to the company’s online security offerings. This trend suggests that loyalty and time with Regork contribute to greater adoption of additional services. Newer customers tend to be less engaged with these offerings, indicating a potential opportunity for targeted marketing early in the customer lifecycle. Encouraging new users to adopt online security services sooner could help strengthen long-term customer relationships and satisfaction.

Graph_3

ggplot(retention, aes(x = Partner, fill = Dependents)) +
  geom_bar(position = "dodge") +
  labs(title = "Dependents by Partner Status",
       x = "Partners",
       y = "Count",
       fill = "Has Dependents") +
  theme_minimal()

## Partners and Dependents: Graph_3

We observed a clear relationship between having a partner and having dependents. Customers without partners are much more likely to not have dependents, suggesting a strong link between relationship status and family responsibilities. Among customers with partners, slightly more have dependents than those who do not, although the difference is relatively small. Understanding these family-related factors can help Regork tailor its service offerings and retention strategies to better meet the needs of different customer groups.

Machine Learning

Logisitc Regression

set.seed(123)
logistic_split <- initial_split(retention, prop = 0.7, strata = Status)
logistic_train <- training(logistic_split)
logisitc_test <- testing(logistic_split)
set.seed(123)
logistic_kfold <- vfold_cv(logistic_train, v = 5)

results <- logistic_reg() %>%
  fit_resamples(Status ~ ., logistic_kfold)
  collect_metrics(results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.797     5 0.00705 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00374 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00974 Preprocessor1_Model1
final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = logistic_train)

tidy(final_fit)
## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.02     0.969      1.05    2.95e- 1
##  2 GenderMale                     -0.0461   0.0779    -0.592   5.54e- 1
##  3 SeniorCitizen                   0.258    0.101      2.55    1.08e- 2
##  4 PartnerYes                     -0.141    0.0925    -1.52    1.29e- 1
##  5 DependentsYes                  -0.0475   0.108     -0.441   6.59e- 1
##  6 Tenure                         -0.0646   0.00762   -8.47    2.35e-17
##  7 PhoneServiceYes                -0.0417   0.774     -0.0539  9.57e- 1
##  8 MultipleLinesNo phone service  NA       NA         NA      NA       
##  9 MultipleLinesYes                0.442    0.211      2.10    3.56e- 2
## 10 InternetServiceFiber optic      1.48     0.950      1.56    1.18e- 1
## # ℹ 21 more rows
exp(coef(final_fit$fit))
##                          (Intercept)                           GenderMale 
##                            2.7593894                            0.9549166 
##                        SeniorCitizen                           PartnerYes 
##                            1.2949103                            0.8688686 
##                        DependentsYes                               Tenure 
##                            0.9536487                            0.9374419 
##                      PhoneServiceYes        MultipleLinesNo phone service 
##                            0.9591253                                   NA 
##                     MultipleLinesYes           InternetServiceFiber optic 
##                            1.5562922                            4.4143178 
##                    InternetServiceNo    OnlineSecurityNo internet service 
##                            0.2214383                                   NA 
##                    OnlineSecurityYes      OnlineBackupNo internet service 
##                            0.7591602                                   NA 
##                      OnlineBackupYes  DeviceProtectionNo internet service 
##                            0.9028811                                   NA 
##                  DeviceProtectionYes       TechSupportNo internet service 
##                            1.1801326                                   NA 
##                       TechSupportYes       StreamingTVNo internet service 
##                            0.8583097                                   NA 
##                       StreamingTVYes   StreamingMoviesNo internet service 
##                            1.6073481                                   NA 
##                   StreamingMoviesYes                     ContractOne year 
##                            1.7542279                            0.4452648 
##                     ContractTwo year                  PaperlessBillingYes 
##                            0.2497494                            1.3698551 
## PaymentMethodCredit card (automatic)        PaymentMethodElectronic check 
##                            1.0223826                            1.3813743 
##            PaymentMethodMailed check                       MonthlyCharges 
##                            0.9410159                            0.9689548 
##                         TotalCharges 
##                            1.0003906

Significant variables that were predictive of retention in the logistic regression were: Longer Tenure (OR=0.94, CI: 0.92 - 0.95); Contract OneYear (OR=0.45, CI: 0.343-0.58); and Contract TwoYear (OR=0.25, CI: 0.17-0.37).

Factors that significantly contributed to leaving were: PaperlessBillingYes (OR=1.4, CI: 1.15=1.63); Payment with Electronic check (OR=1.38, CI: 1.11-1.73); having multiple lines (OR=1.56, CI: 1.03 - 2.35); and being a Senior Citizen (OR=1.295, CI 1.1-1.58). Lastly, while having fiberoptic cable had an OR of 4.14, the confidence interval for this variable was quite wide (0.68 - 28.5) indicating it was not significant and highly variable.

MARS

set.seed(123)
mars_split <- initial_split(retention, prop = 0.7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
retention_recipe <- recipe(Status ~ ., data = mars_train) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

set.seed(123)
mars_kfold <- vfold_cv(mars_train, v = 5, strata = Status)

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(c(1, 30)), prod_degree(), levels = 25)

retention_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(mars_mod)

tuning_results <- retention_wf %>%
  tune_grid(resamples = mars_kfold, grid = mars_grid)
## Warning: package 'earth' was built under R version 4.4.3
## Warning: package 'plotmo' was built under R version 4.4.3
tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 50 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        15           1 roc_auc binary     0.850     5 0.00509 Preprocessor1_M…
##  2        16           1 roc_auc binary     0.849     5 0.00502 Preprocessor1_M…
##  3        19           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  4        20           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  5        21           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  6        22           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  7        23           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  8        25           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  9        26           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
## 10        27           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
## # ℹ 40 more rows
autoplot(tuning_results)

best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")

final_wf <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(mars_mod) %>%
  finalize_workflow(best_hyperparameters)

final_fit <- final_wf %>%
  fit(data = mars_train)

final_fit %>%
  extract_fit_parsnip() %>%
  vip()

Tenure is the most important feature in the model, meaning it has the greatest influence on predicting customer retention. This aligns with the common assumption that customers who have been with Regork for a longer period are less likely to churn in the future. Longer tenure often correlates with customer loyalty and satisfaction, as established relationships and familiarity with services typically reduce the likelihood of departure. Therefore, focusing on retaining long-term customers and nurturing their relationship with Regork could be a key strategy to reduce churn.

Decision Tree

set.seed(123)
decision_split <- initial_split(retention, prop = 0.7, strata = "Status")
decision_train <- training(decision_split)
decision_test <- testing(decision_split)
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

model_recipe <- recipe(Status ~ ., data = decision_train)

dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = decision_train)

set.seed(123)
kfold <- vfold_cv(decision_train, v = 5)

dt_results <- fit_resamples(dt_mod, model_recipe, kfold)

collect_metrics(dt_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.785     5 0.00552 Preprocessor1_Model1
## 2 brier_class binary     0.147     5 0.00263 Preprocessor1_Model1
## 3 roc_auc     binary     0.803     5 0.00777 Preprocessor1_Model1
dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)

show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.814     5 0.00780
## 2    0.0000000178         15    40 roc_auc binary     0.814     5 0.00780
## 3    0.00000316           15    40 roc_auc binary     0.814     5 0.00780
## 4    0.0000000001         11    40 roc_auc binary     0.814     5 0.00827
## 5    0.0000000178         11    40 roc_auc binary     0.814     5 0.00827
## # ℹ 1 more variable: .config <chr>
dt_best_model <- select_best(dt_results, metric = 'roc_auc')

dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

dt_final_fit <- dt_final_wf %>%
  fit(data = decision_train)

dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(20)

The decision tree results also highlight contract and tenure as key factors influencing whether a customer will stay or leave. These two features consistently emerge as the most important predictors in every machine learning model trained and tested on this dataset. As discussed in our business analysis, both contract length and customer tenure are strong indicators of retention, emphasizing their critical role in understanding customer behavior and guiding retention strategies.

Business Analysis and Conclusion

Our analysis of customer retention, utilizing Logistic Regression, MARS, and Decision Trees, provides valuable insights into the factors influencing customer behavior. Contract length and tenure consistently emerge as the most significant predictors of retention. Customers with longer contracts, particularly those on 2-year contracts, are less likely to churn, reflecting the value of long-term commitments. This suggests that Regork can improve retention by offering loyalty incentives to long-term customers.

Our analysis also reveals that senior citizens are slightly more likely to leave and are less likely to have tech support, with a greater proportion of seniors lacking this service compared to non-senior customers. (Interestingly, seniors who do not have internet service are in the minority, suggesting that once connected, they remain consistent in maintaining service.) This insight indicates that providing dedicated tech support for senior customers could help reduce churn in this segment by offering additional reassurance and assistance.

Finally, when examining partners and dependents, customers without partners are more likely to have no dependents, while those with partners are more likely to have dependents. This suggests that family-related factors influence customer behavior and service needs. Understanding these dynamics can help Regork tailor retention strategies, such as offering family-focused plans or discounts.

In conclusion, Regork can improve retention by focusing on long-term customer relationships, providing tailored tech support for senior citizens, and addressing the family-oriented needs of customers.