As Regork expands into the telecommunications sector, retaining existing customers has become a crucial focus, given the significantly higher costs associated with acquiring new customers. To support these efforts, we conducted an analysis of customer data to identify patterns that could help predict customer churn and guide retention strategies.
Our findings highlight several important trends.
To successfully replicate the results presented in this report, it is necessary to install and load the following packages:
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
library(vip)
## Warning: package 'vip' was built under R version 4.4.3
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
library(pdp)
## Warning: package 'pdp' was built under R version 4.4.3
library(kernlab)
library(baguette)
## Warning: package 'baguette' was built under R version 4.4.3
library(dplyr)
library(ggplot2)
Data Import:
The dataset customer_retention.csv is read into R from a specified file path using read_csv(). This function loads the data into a tibble (a modern version of a data frame) called retention.
Data Type Transformation:
The Status variable within the retention dataset is converted into a factor using mutate(). Factors are used in R to represent categorical variables, making them suitable for statistical modeling and plotting.
There were 6988 observations for the Status variable, 1856 (26.6%) left Regork’s program while 5132 (73.4%) stayed.
Missing Value Handling:
The na.omit() function is applied to the retention dataset to remove any rows containing missing (NA) values.
After this step, the dataset contains only complete cases with no missing data.
#1. Data Import
retention <- read_csv("~/Data Mining ⚒/customer_retention.csv")
#2. Data Type Transformation
retention <- mutate(retention, Status = factor(Status))
#3. Missing Values Handling
retention <- na.omit(retention)
retention %>%
count(Status)
## # A tibble: 2 × 2
## Status n
## <fct> <int>
## 1 Current 5132
## 2 Left 1856
Graph_1
retention %>%
group_by(SeniorCitizen, TechSupport) %>%
summarize(count = n()) %>%
group_by(SeniorCitizen) %>%
mutate(prop = count / sum(count)) %>%
ggplot(aes(x = TechSupport, y = prop, fill = SeniorCitizen)) +
geom_col(position = "dodge") +
labs(title = "Proportion of Senior Citizens with and without Tech Support",
x = "Tech Support",
y = "Proportion",
fill = "Senior Citizen") +
theme_minimal()
## Technology Support and Senior Citizens: Graph_1
A minority of customers sign up for tech support (approximately 30% overall), but among those who do sign up for tech support, the majority are Senior citizens. Importantly, amomg the participating customers very few senior citizens lacked internet service, indicating that once they are connected, they tend to maintain their internet access consistently. Regork could improve this by creating targeted strategies to encourage tech support adoption among seniors, emphasizing simplicity and reliability.
Graph_2
ggplot(retention, aes(x = OnlineSecurity, y = Tenure, fill = OnlineSecurity)) +
geom_boxplot() +
labs(title = "Tenure by Online Security",
x = "Online Security",
y = "Tenure (months)") +
theme_minimal()
## Online Security Services and Customer Tenure: Graph_2
Our analysis, supported by a boxplot visualization, shows a strong relationship between customer tenure and the use of online security services. Customers who have been with Regork for a longer period are more likely to subscribe to the company’s online security offerings. This trend suggests that loyalty and time with Regork contribute to greater adoption of additional services. Newer customers tend to be less engaged with these offerings, indicating a potential opportunity for targeted marketing early in the customer lifecycle. Encouraging new users to adopt online security services sooner could help strengthen long-term customer relationships and satisfaction.
Graph_3
ggplot(retention, aes(x = Partner, fill = Dependents)) +
geom_bar(position = "dodge") +
labs(title = "Dependents by Partner Status",
x = "Partners",
y = "Count",
fill = "Has Dependents") +
theme_minimal()
## Partners and Dependents: Graph_3
We observed a clear relationship between having a partner and having dependents. Customers without partners are much more likely to not have dependents, suggesting a strong link between relationship status and family responsibilities. Among customers with partners, slightly more have dependents than those who do not, although the difference is relatively small. Understanding these family-related factors can help Regork tailor its service offerings and retention strategies to better meet the needs of different customer groups.
set.seed(123)
logistic_split <- initial_split(retention, prop = 0.7, strata = Status)
logistic_train <- training(logistic_split)
logisitc_test <- testing(logistic_split)
set.seed(123)
logistic_kfold <- vfold_cv(logistic_train, v = 5)
results <- logistic_reg() %>%
fit_resamples(Status ~ ., logistic_kfold)
collect_metrics(results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.797 5 0.00705 Preprocessor1_Model1
## 2 brier_class binary 0.137 5 0.00374 Preprocessor1_Model1
## 3 roc_auc binary 0.844 5 0.00974 Preprocessor1_Model1
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = logistic_train)
tidy(final_fit)
## # A tibble: 31 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.02 0.969 1.05 2.95e- 1
## 2 GenderMale -0.0461 0.0779 -0.592 5.54e- 1
## 3 SeniorCitizen 0.258 0.101 2.55 1.08e- 2
## 4 PartnerYes -0.141 0.0925 -1.52 1.29e- 1
## 5 DependentsYes -0.0475 0.108 -0.441 6.59e- 1
## 6 Tenure -0.0646 0.00762 -8.47 2.35e-17
## 7 PhoneServiceYes -0.0417 0.774 -0.0539 9.57e- 1
## 8 MultipleLinesNo phone service NA NA NA NA
## 9 MultipleLinesYes 0.442 0.211 2.10 3.56e- 2
## 10 InternetServiceFiber optic 1.48 0.950 1.56 1.18e- 1
## # ℹ 21 more rows
exp(coef(final_fit$fit))
## (Intercept) GenderMale
## 2.7593894 0.9549166
## SeniorCitizen PartnerYes
## 1.2949103 0.8688686
## DependentsYes Tenure
## 0.9536487 0.9374419
## PhoneServiceYes MultipleLinesNo phone service
## 0.9591253 NA
## MultipleLinesYes InternetServiceFiber optic
## 1.5562922 4.4143178
## InternetServiceNo OnlineSecurityNo internet service
## 0.2214383 NA
## OnlineSecurityYes OnlineBackupNo internet service
## 0.7591602 NA
## OnlineBackupYes DeviceProtectionNo internet service
## 0.9028811 NA
## DeviceProtectionYes TechSupportNo internet service
## 1.1801326 NA
## TechSupportYes StreamingTVNo internet service
## 0.8583097 NA
## StreamingTVYes StreamingMoviesNo internet service
## 1.6073481 NA
## StreamingMoviesYes ContractOne year
## 1.7542279 0.4452648
## ContractTwo year PaperlessBillingYes
## 0.2497494 1.3698551
## PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## 1.0223826 1.3813743
## PaymentMethodMailed check MonthlyCharges
## 0.9410159 0.9689548
## TotalCharges
## 1.0003906
Significant variables that were predictive of retention in the logistic regression were: Longer Tenure (OR=0.94, CI: 0.92 - 0.95); Contract OneYear (OR=0.45, CI: 0.343-0.58); and Contract TwoYear (OR=0.25, CI: 0.17-0.37).
Factors that significantly contributed to leaving were: PaperlessBillingYes (OR=1.4, CI: 1.15=1.63); Payment with Electronic check (OR=1.38, CI: 1.11-1.73); having multiple lines (OR=1.56, CI: 1.03 - 2.35); and being a Senior Citizen (OR=1.295, CI 1.1-1.58). Lastly, while having fiberoptic cable had an OR of 4.14, the confidence interval for this variable was quite wide (0.68 - 28.5) indicating it was not significant and highly variable.
set.seed(123)
mars_split <- initial_split(retention, prop = 0.7, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
retention_recipe <- recipe(Status ~ ., data = mars_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
set.seed(123)
mars_kfold <- vfold_cv(mars_train, v = 5, strata = Status)
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode("classification")
mars_grid <- grid_regular(num_terms(c(1, 30)), prod_degree(), levels = 25)
retention_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(mars_mod)
tuning_results <- retention_wf %>%
tune_grid(resamples = mars_kfold, grid = mars_grid)
## Warning: package 'earth' was built under R version 4.4.3
## Warning: package 'plotmo' was built under R version 4.4.3
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 50 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 1 roc_auc binary 0.850 5 0.00509 Preprocessor1_M…
## 2 16 1 roc_auc binary 0.849 5 0.00502 Preprocessor1_M…
## 3 19 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 4 20 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 5 21 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 6 22 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 7 23 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 8 25 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 9 26 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## 10 27 1 roc_auc binary 0.849 5 0.00503 Preprocessor1_M…
## # ℹ 40 more rows
autoplot(tuning_results)
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(best_hyperparameters)
final_fit <- final_wf %>%
fit(data = mars_train)
final_fit %>%
extract_fit_parsnip() %>%
vip()
Tenure is the most important feature in the model, meaning it has the
greatest influence on predicting customer retention. This aligns with
the common assumption that customers who have been with Regork for a
longer period are less likely to churn in the future. Longer tenure
often correlates with customer loyalty and satisfaction, as established
relationships and familiarity with services typically reduce the
likelihood of departure. Therefore, focusing on retaining long-term
customers and nurturing their relationship with Regork could be a key
strategy to reduce churn.
set.seed(123)
decision_split <- initial_split(retention, prop = 0.7, strata = "Status")
decision_train <- training(decision_split)
decision_test <- testing(decision_split)
dt_mod <- decision_tree(mode = "classification") %>%
set_engine("rpart")
model_recipe <- recipe(Status ~ ., data = decision_train)
dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = decision_train)
set.seed(123)
kfold <- vfold_cv(decision_train, v = 5)
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)
collect_metrics(dt_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.785 5 0.00552 Preprocessor1_Model1
## 2 brier_class binary 0.147 5 0.00263 Preprocessor1_Model1
## 3 roc_auc binary 0.803 5 0.00777 Preprocessor1_Model1
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 15 40 roc_auc binary 0.814 5 0.00780
## 2 0.0000000178 15 40 roc_auc binary 0.814 5 0.00780
## 3 0.00000316 15 40 roc_auc binary 0.814 5 0.00780
## 4 0.0000000001 11 40 roc_auc binary 0.814 5 0.00827
## 5 0.0000000178 11 40 roc_auc binary 0.814 5 0.00827
## # ℹ 1 more variable: .config <chr>
dt_best_model <- select_best(dt_results, metric = 'roc_auc')
dt_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(dt_best_model)
dt_final_fit <- dt_final_wf %>%
fit(data = decision_train)
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(20)
The decision tree results also highlight contract and tenure as key
factors influencing whether a customer will stay or leave. These two
features consistently emerge as the most important predictors in every
machine learning model trained and tested on this dataset. As discussed
in our business analysis, both contract length and customer tenure are
strong indicators of retention, emphasizing their critical role in
understanding customer behavior and guiding retention strategies.
Our analysis of customer retention, utilizing Logistic Regression, MARS, and Decision Trees, provides valuable insights into the factors influencing customer behavior. Contract length and tenure consistently emerge as the most significant predictors of retention. Customers with longer contracts, particularly those on 2-year contracts, are less likely to churn, reflecting the value of long-term commitments. This suggests that Regork can improve retention by offering loyalty incentives to long-term customers.
Our analysis also reveals that senior citizens are slightly more likely to leave and are less likely to have tech support, with a greater proportion of seniors lacking this service compared to non-senior customers. (Interestingly, seniors who do not have internet service are in the minority, suggesting that once connected, they remain consistent in maintaining service.) This insight indicates that providing dedicated tech support for senior customers could help reduce churn in this segment by offering additional reassurance and assistance.
Finally, when examining partners and dependents, customers without partners are more likely to have no dependents, while those with partners are more likely to have dependents. This suggests that family-related factors influence customer behavior and service needs. Understanding these dynamics can help Regork tailor retention strategies, such as offering family-focused plans or discounts.
In conclusion, Regork can improve retention by focusing on long-term customer relationships, providing tailored tech support for senior citizens, and addressing the family-oriented needs of customers.