Required Libraries:

Introduction:

As the economic climate and customer demands change over time, understanding consumer behavior becomes one of the most important factors to analyze to maximize retention, and improve relationships with the customer. Thanks to the dataset provided, we were able to analyze the factors that drive customers to stay with Regork, and subsequently provide recommendations to ensure unsatisfied customers don’t leave.

As telecommunications is a very competitive sector, it is crucial to understand and ensure the decisions Regork makes are data-backed, and will cause a maximization of customer retention. Provided in this report is a detailed analysis and model determining whether customers will leave Regork at some point in the future. This data will ensure Regork has the most comprehensive information possible to make an informed decision on how to keep its customers.

Data Preparation & Exploratory Data Analysis

The first analysis we decided to perform was to figure out which types of families are Regork customers. Below is a graph that shows the average tenure of a customer by partner status and dependent status. According to our analysis, those with a partner are more likely to stay for longer. Having dependents does not seem to matter as much, so trying to figure out how to market toward single people would be a strategy to improve retention duration, at least.

Graph code:

ggplot(tenure_by_partner_dependents, aes(x = Partner_Dependents, y = AvgTenure, fill = Partner_Dependents)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Average Tenure by Partner & Dependents Status",
    x = "Partner & Dependents Status",
    y = "Average Tenure (Months)",
    fill = "Partner & Dependents Status"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank())

The next analysis we needed to perform was to figure out the mean duration customers stay with Regork. In our case, on average, customers left after 18 months. In this case, it would be beneficial to try and keep customers early in their tenure, since later customers tend to stay.

Graph code:

ggplot(df_left, aes(x = Tenure)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Tenure for Customers Who Left",
    x = "Tenure (Months)",
    y = "Frequency"
  ) +
  theme_minimal()

##   Mean_Tenure Median_Tenure Min_Tenure Max_Tenure
## 1    18.01293            10          1         72

The last factor we wanted to determine was what types of contracts people were more likely to leave. Our analysis revealed that month-to-month customers were most likely to leave. This means that finding a way to make monthly contracts stay would be ideal in this case.

ggplot(left_contract_counts, aes(x = Contract, y = n, fill = Contract)) +
  geom_bar(stat = "identity", color = "black") +
  labs(
    title = "Number of Customers Who Left by Contract Type",
    x = "Contract Type",
    y = "Number of Customers",
    fill = "Contract Type"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Machine Learning

For our first machine learning model, we decided to perform a logistic regression:

df <- df %>%
  mutate(Status = factor(Status),
         Contract = factor(Contract, levels = c("Month-to-month", "One year", "Two year"))) %>%
  na.omit()

set.seed(123)
regression_split <- initial_split(df, prop = 0.7, strata = Status)
regression_train <- training(regression_split)
regression_test <- testing(regression_split)

set.seed(123)
regression_kfolds <- vfold_cv(regression_train, v = 5, strata = Status)

regression_results <- logistic_reg() %>% 
  fit_resamples(Status ~ ., regression_kfolds) %>%
  collect_metrics()

print(regression_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.799     5 0.00357 Preprocessor1_Model1
## 2 brier_class binary     0.136     5 0.00186 Preprocessor1_Model1
## 3 roc_auc     binary     0.845     5 0.00514 Preprocessor1_Model1

Next, we performed a MARS analysis.

set.seed(123)

# Splitting data
multivariate_split <- initial_split(df, prop = 0.7, strata = Status)
multivariate_train <- training(multivariate_split)
multivariate_test <- testing(multivariate_split)

# Recipe for modeling
multivariate_recipe <- recipe(Status ~ ., data = multivariate_train)

# Cross-validation
set.seed(123)
multivariate_kfolds <- vfold_cv(multivariate_train, v = 5, strata = "Status")

# MARS model specification
multivariate_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

# Hyperparameter grid for tuning
multivariate_grid <- grid_regular(num_terms(range = c(1, 20)), prod_degree(), levels = 50)

# Workflow setup
multivariate_wf <- workflow() %>%
  add_recipe(multivariate_recipe) %>%
  add_model(multivariate_mod)

# Tuning the MARS model
multivariate_results <- multivariate_wf %>% tune_grid(resamples = multivariate_kfolds, grid = multivariate_grid)

# Collecting metrics and filtering for roc_auc
metrics <- multivariate_results %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  arrange(desc(mean))

# Viewing the metrics
print(metrics)
## # A tibble: 40 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.850     5 0.00486 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        18           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        17           1 roc_auc binary     0.849     5 0.00503 Preprocessor1_M…
##  5        16           1 roc_auc binary     0.849     5 0.00490 Preprocessor1_M…
##  6        15           1 roc_auc binary     0.847     5 0.00486 Preprocessor1_M…
##  7        14           1 roc_auc binary     0.847     5 0.00455 Preprocessor1_M…
##  8        13           1 roc_auc binary     0.845     5 0.00493 Preprocessor1_M…
##  9        12           1 roc_auc binary     0.844     5 0.00465 Preprocessor1_M…
## 10        11           1 roc_auc binary     0.843     5 0.00493 Preprocessor1_M…
## # ℹ 30 more rows
# Plotting results
autoplot(multivariate_results)

# Selecting the best model based on roc_auc
multivariate_best <- select_best(multivariate_results, metric = "roc_auc")

# Finalizing the workflow with the best model
multivariate_final_wf <- workflow() %>% 
  add_model(multivariate_mod) %>% 
  add_formula(Status ~ .) %>% 
  finalize_workflow(multivariate_best)

# Fitting the final workflow to training data
multivariate_fit <- multivariate_final_wf %>% 
  fit(data = multivariate_train)

# Extracting and plotting feature importance
multivariate_fit %>% 
  extract_fit_parsnip() %>% 
  vip()