Introduction:

Customer churn significantly impacts Regork Telecom’s revenue and growth. We analyzed customer data using logistic regression, decision trees, and random forest models to predict churn based on attributes like contract type, device protection, and tech support usage. Our models identified key drivers of churn that can inform targeted retention strategies. This data-driven approach enables Regork Telecom to proactively reduce churn and enhance customer satisfaction.

Data Preparation:

Packages Needed for the Report:

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(earth)

Data Sets:

The following is the data needed for the Report:

setwd("C:/Users/15134/OneDrive/Desktop/BANA 4080 - Data Mining/UC-Data Mining")

data <- read_csv("customer_retention.csv")

data
## # A tibble: 6,999 × 20
##    Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines   
##    <chr>          <dbl> <chr>   <chr>       <dbl> <chr>        <chr>           
##  1 Female             0 Yes     No              1 No           No phone service
##  2 Male               0 No      No             34 Yes          No              
##  3 Male               0 No      No              2 Yes          No              
##  4 Male               0 No      No             45 No           No phone service
##  5 Female             0 No      No              2 Yes          No              
##  6 Female             0 No      No              8 Yes          Yes             
##  7 Male               0 No      Yes            22 Yes          Yes             
##  8 Female             0 No      No             10 No           No phone service
##  9 Female             0 Yes     No             28 Yes          Yes             
## 10 Male               0 No      Yes            62 Yes          No              
## # ℹ 6,989 more rows
## # ℹ 13 more variables: InternetService <chr>, OnlineSecurity <chr>,
## #   OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## #   StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## #   PaperlessBilling <chr>, PaymentMethod <chr>, MonthlyCharges <dbl>,
## #   TotalCharges <dbl>, Status <chr>

Exploratory Data Analysis

Tenure Vs. Device protection

Using our Exploratory Data Analysis tools in R, we created this data chart contrasting Tenure with Device Protection. Customers with device protection (“Yes”) tend to have a longer tenure on average compared to those without it. Those with no protection (“None”) show more variability in tenure. Overall, having device protection is clearly associated with longer customer retention.

ggplot(data, aes(x = DeviceProtection, y = Tenure, fill = DeviceProtection)) +
  geom_boxplot(alpha = 0.7, outlier.color = "black") +
  ggtitle("Tenure Distribution by Device Protection Status") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 15, hjust = 1)
  ) +
  labs(x = "Device Protection", y = "Length of Tenure (months)") +
  scale_fill_brewer(palette = "Purples")

Customer Churn based on Dependent Status

In our next Exploratory Data Analysis Chart, we measured Customer Churn and Dependent Status against one another. Customers without dependents make up a larger share of both those who stayed and those who churned. However, the difference is more pronounced among those who left, suggesting customers with dependents may be slightly more likely to stay. This shows that dependents being a stabilizing factor in customer retention.

ggplot(data, aes(x = Dependents)) + 
  geom_bar(fill = "red") + 
  facet_wrap(~Status) +
  labs(
    title = "Customer Churn by Dependent Status",
    subtitle = "Comparison of customers who stayed or left based on whether they have dependents",
    x = "Has Dependents",
    y = "Number of Customers"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12, margin = margin(b = 10)),
    axis.text.x = element_text(angle = 15, vjust = 0.9, hjust = 1),
    strip.text = element_text(face = "bold", size = 12),
    panel.grid.major.y = element_line(color = "grey80"),
    panel.grid.minor = element_blank()
  )

Tech Support Usage by Contract Type

Our last EDA Graph transposes Tech Support Usage and Contract Type. Customers with month-to-month contracts are much less likely to use tech support compared to those with longer-term contracts. Tech support usage increases with the length of the contract, especially among one-year and two-year customers. This suggests that long-term customers may be more engaged with support services, thus increasing their customer satisfaction.

ggplot(data, aes(x = TechSupport)) + 
  geom_bar(fill = "navyblue") +
  geom_text(
    stat = "count",
    aes(label = ..count..),
    hjust = -0.075,              
    size = 4
  ) +
  facet_wrap(~Contract) +
  coord_flip() +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.25)) 
  ) +
  labs(
    title = "Tech Support Usage by Contract Type",
    subtitle = "How customers with different contracts use tech support services",
    x = "Tech Support Response",
    y = "Number of Customers"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    plot.subtitle = element_text(size = 13, hjust = 0.5, margin = margin(b = 10)),
    strip.text = element_text(face = "bold", size = 13),
    axis.text.y = element_text(size = 12),
    axis.text.x = element_text(size = 12),
    panel.grid.major.x = element_line(color = "grey85"),
    panel.grid.minor = element_blank(),
    plot.margin = margin(t = 15, r = 20, b = 15, l = 15)
  )

Machine Learning

Logistic Regression

data$Status <- as.factor(data$Status)

set.seed(123)
split <- initial_split(data, prop = 0.7, strata = "Status")
logistic_train <- training(split)
logistic_test <- testing(split)

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(logistic_train, v = 5)

# titanic_train model via cross validation
results <- logistic_reg() %>%
    fit_resamples(Status ~ ., kfold)

# collect the average accuracy rate
collect_metrics(results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.796     5 0.00994 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00412 Preprocessor1_Model1
## 3 roc_auc     binary     0.844     5 0.00769 Preprocessor1_Model1

Above we built a logistic regression model to predict customer status (like churn) using a machine learning approach. It split the data into training and testing sets with stratified sampling, then used 5-fold cross-validation to evaluate the model’s performance. Finally, it collects the average accuracy across the folds to assess how well the model generalizes. As you can see, this logistic regression model has an Area Under the Curve of 0.84. Not bad!


# retrain our model across the entire training data
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = logistic_train)
tidy(final_fit)
## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.03     0.975       1.06   2.89e- 1
##  2 GenderMale                     -0.0174   0.0778     -0.224  8.23e- 1
##  3 SeniorCitizen                   0.187    0.101       1.85   6.47e- 2
##  4 PartnerYes                     -0.0630   0.0934     -0.674  5.00e- 1
##  5 DependentsYes                  -0.0397   0.106      -0.372  7.10e- 1
##  6 Tenure                         -0.0561   0.00752    -7.45   9.16e-14
##  7 PhoneServiceYes                 0.0864   0.774       0.112  9.11e- 1
##  8 MultipleLinesNo phone service  NA       NA          NA     NA       
##  9 MultipleLinesYes                0.464    0.211       2.20   2.80e- 2
## 10 InternetServiceFiber optic      1.76     0.949       1.85   6.42e- 2
## # ℹ 21 more rows

final_fit %>%
predict(logistic_test) %>%
bind_cols(logistic_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310

vip(final_fit$fit, num_features = 20)

The confusion matrix reveals that our logistic regression model correctly predicted 1,389 customers who stayed and 310 customers who left, demonstrating solid classification ability. However, it misclassified 247 customers who actually stayed as having left, and 153 who actually left as still current, indicating some room for improvement

Decision Tree

# Step 1: create decision tree model object
dt_mod <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Step 2: create model recipe
model_recipe <- recipe(Status ~ ., data = logistic_train)

# Step 3: fit model workflow
dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = logistic_train) 

# create resampling procedure
set.seed(123)
kfold <- vfold_cv(logistic_train, v = 5)

# train model
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)


# model 

rpart.plot::rpart.plot(dt_fit$fit$fit$fit)


# model results
collect_metrics(dt_results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.789     5 0.00537 Preprocessor1_Model1
## 2 brier_class binary     0.147     5 0.00318 Preprocessor1_Model1
## 3 roc_auc     binary     0.801     5 0.00651 Preprocessor1_Model1

The decision tree model achieved an AUC of 0.80, indicating it performs fairly well at distinguishing between customers who stay and those who leave. The model is simple and easy to interpret, as shown by the decision tree plot, which helps visualize the key factors influencing customer status. While slightly less accurate than the logistic regression model (AUC 0.84), it still provides valuable insight and is useful for quick, explainable decisions.

dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

# create the hyperparameter grid
dt_hyper_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# train our model across the hyper parameter grid
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfold, grid = dt_hyper_grid)

# get best results
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.824     5 0.00698
## 2    0.0000000178          8    30 roc_auc binary     0.824     5 0.00698
## 3    0.00000316            8    30 roc_auc binary     0.824     5 0.00698
## 4    0.0000000001          8    21 roc_auc binary     0.823     5 0.00548
## 5    0.0000000178          8    21 roc_auc binary     0.823     5 0.00548
## # ℹ 1 more variable: .config <chr>

# get best hyperparameter values
dt_best_model <- select_best(dt_results, metric = 'roc_auc')

The hyperparameter tuning for the decision tree model tested combinations of cost complexity, tree depth, and minimum node size to improve model performance. The best models achieved an AUC of 0.8244, slightly better than the initial decision tree model (AUC 0.80), showing that tuning helped boost predictive power. These top results were consistent across different parameter settings, indicating a stable and reliable model.

# put together final workflow
dt_final_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  finalize_workflow(dt_best_model)

# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
  fit(data = logistic_train)

# plot feature importance
dt_final_fit %>%
  extract_fit_parsnip() %>%
  vip(20)

This feature importance plot from the final decision tree model shows which variables had the biggest impact on predicting customer status. Contract type, tenure, and total charges were the top three most influential factors, suggesting that long-term customers with specific contract types and higher overall spending are more likely to stay. Features like PhoneService and PaperlessBilling had minimal influence on the model’s predictions.

Random Forrest

set.seed(123)
rf_split <- initial_split(data, prop = 0.7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = rf_train)

rf_mod <- rand_forest(mode = "classification") %>%
set_engine("ranger")

set.seed(123)
rf_kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, rf_recipe, rf_kfold)

collect_metrics(results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.796     5 0.00545 Preprocessor1_Model1
## 2 brier_class binary     0.138     5 0.00189 Preprocessor1_Model1
## 3 roc_auc     binary     0.837     5 0.00498 Preprocessor1_Model1

The random forest model achieved an accuracy of about 79.6% and a roc_auc score of 0.8369, showing strong overall performance in classifying customer status. The low brier_class value suggests the model’s probability predictions were also well-calibrated. Overall, the random forest outperformed the earlier decision tree in both accuracy and AUC; however it is not more accurate than our origainal logistic regression.

rf_mod <- rand_forest(mode = "classification", trees = tune(), mtry = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity")

rf_hyper_grid <- grid_regular(trees(range = c(100, 1000)), mtry(range = c(1, 50)), min_n(range = c(1, 20)), levels = 5)

set.seed(123)
results <- tune_grid(rf_mod, rf_recipe, resamples = rf_kfold, grid = rf_hyper_grid)

show_best(results)
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1   775     5 roc_auc binary     0.839     5 0.00626 Preprocessor1_Model0…
## 2     1   775    20 roc_auc binary     0.839     5 0.00596 Preprocessor1_Model1…
## 3     1   550     1 roc_auc binary     0.839     5 0.00626 Preprocessor1_Model0…
## 4     1   550    10 roc_auc binary     0.839     5 0.00639 Preprocessor1_Model0…
## 5     1  1000    10 roc_auc binary     0.838     5 0.00579 Preprocessor1_Model0…

After tuning the random forest model’s hyperparameters, the best result achieved a roc_auc score of 0.8384, showing a slight improvement over the original model. This suggests that fine-tuning parameters like the number of trees, variables at each split (mtry), and minimum samples (min_n) helped optimize the model’s ability to separate the classes.

rf_best <- select_best(results, metric = "roc_auc")

final_rf_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(rf_mod) %>%
  finalize_workflow(rf_best)

final_rf_fit <- final_rf_wf %>% fit(data = rf_train)

final_rf_fit %>% extract_fit_parsnip() %>% vip(num_features = 10)

Optimal Model: Logistic Regression

collect_metrics(results, summarize = FALSE) %>% filter(.metric == "roc_auc")
## # A tibble: 625 × 8
##    id     mtry trees min_n .metric .estimator .estimate .config               
##    <chr> <int> <int> <int> <chr>   <chr>          <dbl> <chr>                 
##  1 Fold1     1   100     1 roc_auc binary         0.825 Preprocessor1_Model001
##  2 Fold2     1   100     1 roc_auc binary         0.849 Preprocessor1_Model001
##  3 Fold3     1   100     1 roc_auc binary         0.822 Preprocessor1_Model001
##  4 Fold4     1   100     1 roc_auc binary         0.853 Preprocessor1_Model001
##  5 Fold5     1   100     1 roc_auc binary         0.838 Preprocessor1_Model001
##  6 Fold1     1   325     1 roc_auc binary         0.827 Preprocessor1_Model002
##  7 Fold2     1   325     1 roc_auc binary         0.849 Preprocessor1_Model002
##  8 Fold3     1   325     1 roc_auc binary         0.821 Preprocessor1_Model002
##  9 Fold4     1   325     1 roc_auc binary         0.854 Preprocessor1_Model002
## 10 Fold5     1   325     1 roc_auc binary         0.840 Preprocessor1_Model002
## # ℹ 615 more rows
final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = logistic_train)

tidy(final_fit)
## # A tibble: 31 × 5
##    term                          estimate std.error statistic   p.value
##    <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     1.03     0.975       1.06   2.89e- 1
##  2 GenderMale                     -0.0174   0.0778     -0.224  8.23e- 1
##  3 SeniorCitizen                   0.187    0.101       1.85   6.47e- 2
##  4 PartnerYes                     -0.0630   0.0934     -0.674  5.00e- 1
##  5 DependentsYes                  -0.0397   0.106      -0.372  7.10e- 1
##  6 Tenure                         -0.0561   0.00752    -7.45   9.16e-14
##  7 PhoneServiceYes                 0.0864   0.774       0.112  9.11e- 1
##  8 MultipleLinesNo phone service  NA       NA          NA     NA       
##  9 MultipleLinesYes                0.464    0.211       2.20   2.80e- 2
## 10 InternetServiceFiber optic      1.76     0.949       1.85   6.42e- 2
## # ℹ 21 more rows

vip::vip(final_fit)

Which predictor variables appear to be most influential in customer behavior?

The top three predictor variables are Tenure, two year contract, and one year contract. These three variables are what contribute most to the the status(whether a customer is current or left) out of all the variables.

Why are those specific predictor variables the most influential?

Those variables are the most influential because when using the vip function in R it shows that those three variables of tenure, two year contract, and one year contract are of the highest importance.

final_fit %>% 
  predict(logistic_test, type = "prob") %>%
  mutate(truth = logistic_test$Status) %>%
  roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.848

How does this generalization error compare to the cross validation error seen in earlier results?

This generalization error is slightly higher than the one we saw in the earlier results.

As a person responsible for making business decisions, what else are you learning from the observations in this section?

Overall, after running our model through the training data and the generalization error and having positive results. We can takeaway that our model would have a high chance of success in the future on other data as well.

Conclusion and Business Analysis:

The key factors that Regork Telecom should focus on to reduce customer churn are clearly identified above by our Random Forest Analysis. Contract type, monthly charges, tenure, internet service type, technical support access, and bundled services like online security are the most important variables. The random forest model helped confirm the top risk factors by ranking them based on importance. By improving these areas, managers can more effectively target customers at risk and boost retention.

It is clear that if no action is taken, Regork Telecom could lose significant revenue. With a churn rate of around 26.5% and an average monthly charge of 64.76 dollars in the set of data of just over 7,000 customers, the company risks losing approximately 120,837 dollars per month. Over the course of a year, this adds up to over $1.4 million in lost revenue. Acting quickly is crucial to protect both monthly and long-term financial performance.

An incentive scheme for retaining customers has been clearly presented. Key strategies include offering Loyalty Programs for Long-Tenure Customers, Incentives for Longer Contracts, and Proactive Support for At-Risk Customers. Personalized outreach to at-risk customers is recommended to strengthen their relationship with the brand. Together, these tactics are designed to directly address the biggest churn factors identified in the analysis.

Consumers will benefit from longer contracts offering better value, improved service bundling, and targeted loyalty rewards. Customers will experience more personalized attention, leading to higher satisfaction and retention. To the RegorkTelecom CEO, I propose implementing a retention program based on the random forest findings and improving the first six months of the customer experience. Investing in predictive churn modeling and competitive pricing will help the company save millions while improving customer loyalty.

There are some limitations to the analysis, including the lack of customer satisfaction data and the complexity of interpreting the random forest model. To improve, future work should include customer feedback, support call logs, or app usage patterns. These were not included but could significantly impact churn and improve predictions. A deeper look into demograhpics would also be beneficial, because it wouls tell us more clearly who and why people are leaving Regork Telecom

Final Project

Derek Christ and Seth Engelhardt

2025-04-27