Customer Retention Machine Learning Analysis

Introduction

Summary of report and problem statement:

As the technology landscape continues to evolve, the demand for advanced and versatile devices grows increasingly complex. In response, our team has developed a new generation of Smart Home TVs, designed to enhance accessibility to news, entertainment, social connection, gaming, and more for every household.

Unlike the previous generation of Smart TVs, our latest model introduces innovative features to maximize comfort and interactivity. By leveraging insights from the “Customer Retention” analysis, we gained a deeper understanding of our target customers, ideal price points, and premium package offerings.

This new generation of Smart TVs sets a groundbreaking standard for the industry by enabling direct human interaction with the TV. Additionally, customers will enjoy a complimentary three-month trial of our premium package upon purchasing the TV. After the trial period, a competitively priced subscription will allow continued access to premium services.

The premium package includes device protection for enhanced privacy and security, as well as access to third-party platforms like Netflix and Spotify. This ensures that customers can enjoy the best streaming and entertainment experiences, effectively transforming the TV into a large, multifunctional smart device.

Packages Required

library(tidymodels)
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(corrplot)
library(caret)
library(randomForest)
library(earth)
library(ROCR)
library(vip)
library(ranger) 
library(rsample)

Data Preparation

data <- read.csv("C:/Users/Admin/Downloads/2023_BANA4080_group24_final_project.html/customer_retention.csv")

class(data)
summary(data)
str(data)
data <- na.omit(data)
table(data$Status)
table(data$Gender)
table(data$InternetService)
summary(data$Tenure)
summary(data$MonthlyCharges)

data$Gender <- as.factor(data$Gender)
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
data$Partner <- as.factor(data$Partner)
data$Dependents <- as.factor(data$Dependents)
data$PhoneService <- as.factor(data$PhoneService)
data$MultipleLines <- as.factor(data$MultipleLines)
data$InternetService <- as.factor(data$InternetService)
data$OnlineSecurity <- as.factor(data$OnlineSecurity)
data$OnlineBackup <- as.factor(data$OnlineBackup)
data$DeviceProtection <- as.factor(data$DeviceProtection)
data$TechSupport <- as.factor(data$TechSupport)
data$StreamingTV <- as.factor(data$StreamingTV)
data$StreamingMovies <- as.factor(data$StreamingMovies)
data$Contract <- as.factor(data$Contract)
data$PaperlessBilling <- as.factor(data$PaperlessBilling)
data$PaymentMethod <- as.factor(data$PaymentMethod)
data$Status <- as.factor(data$Status)
data$Tenure <- scale(data$Tenure)
data$MonthlyCharges <- scale(data$MonthlyCharges)
data$TotalCharges <- scale(data$TotalCharges)
sum(is.na(data))
data <- na.omit(data)

Exploratory Analysis

ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Distribution of Tenure by Customer Status", x = "Tenure (months)", y = "Count") +
  theme_minimal()

According to the graph titled “Tenure by Customer Status”, we observe that customer loyalty to Regork Premium tends to increase over time. Specifically, the longer customers remain subscribed to our premium pack, the more money and time they invest in it. The analysis indicates that customers who consistently use the Premium Pack for extended periods—70 months or more—represent the largest group. Conversely, former Premium users tend to show a different pattern: a high number of them use the service only for the first few months before discontinuing their subscription.

ggplot(data, aes(x = TotalCharges)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Total Charges by Price Range", 
       x = "Total Charges ($)", y = "Count") +
  theme_minimal()

According to the illustration, it is evident that most of our customers are willing to spend nearly $1,000 on a TV. Approximately 1,000 survey respondents indicated that they find $970 to be a reasonable price for a TV. Following that, $1,050 was the second most popular choice, with a significant group of respondents agreeing it is a fair price. To summarize, our team has decided to set the price range for our Smart TVs between $1,050 and $1,250, depending on the features and memory capacity, to best meet customer expectations.

ggplot(data, aes(x = PaymentMethod)) +
  geom_bar(fill = "lightgreen") +
  labs(title = "Distribution of Payment Methods", 
       x = "Payment Method", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 360, hjust = 1))

According to the bar chart above on payment method frequency, it is surprising that electronic checks are the most frequently used payment method among the four options. Following this, mailed checks are the second most common. Lastly, both automatic methods—bank transfer and credit—have approximately the same number of users.

ggplot(data, aes(x = SeniorCitizen, fill = DeviceProtection)) +
  geom_bar(position = "dodge") +
  labs(title = "Senior Citizens and Device Protection", x = "Senior Citizen", y = "Count")

If senior citizens, or customers above a certain age, demonstrate a higher tendency to opt for Device Protection due to their increased risk awareness, this presents a strategic opportunity for targeted marketing.

Our team could enhance promotional efforts for Regork Premium, highlighting its inclusion of Device Protection, specifically tailored to senior customers. This strategy aims to provide added value and peace of mind, ensuring a superior experience for this demographic while increasing the adoption of both our Smart Home TV and Premium package offerings.

Machine Learning

Logistic Regression

set.seed(42)
split_data <- initial_split(data, prop = 0.75, strata = Status)

train_data <- training(split_data)
test_data <- testing(split_data)

set.seed(42)  
cv_folds <- vfold_cv(train_data, v = 10, strata = Status)

logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%  
  set_mode("classification")

logistic_resamples <- fit_resamples(logistic_model, Status ~ ., resamples = cv_folds)
logistic_resamples %>%
  collect_metrics() 
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.803    10 0.00679 Preprocessor1_Model1
## 2 brier_class binary     0.138    10 0.00270 Preprocessor1_Model1
## 3 roc_auc     binary     0.838    10 0.00653 Preprocessor1_Model1

The model performs smoothly especially terms of discrimination across classes, as evidenced by its high ROC AUC score. The accuracy and Brier score also indicate that the model makes quite precise predictions and that the estimated probabilities are fairly adequately calibrated.

Multivariate Adaptive Regression Spline(MARS

set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = "Status")
train_data <- training(data_split)
test_data <- testing(data_split)

mars_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes())

set.seed(123)
mars_kfolds <- vfold_cv(train_data, v = 7, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification") %>%
  set_engine("earth")

mars_grid <- grid_regular(num_terms(range = c(1, 20)), prod_degree(), levels = 5)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

mars_results <- mars_wf %>%
  tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 10 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.847     7 0.00365 Preprocessor1_M…
##  2        15           1 roc_auc binary     0.846     7 0.00374 Preprocessor1_M…
##  3        20           2 roc_auc binary     0.842     7 0.00481 Preprocessor1_M…
##  4        10           1 roc_auc binary     0.841     7 0.00504 Preprocessor1_M…
##  5        15           2 roc_auc binary     0.841     7 0.00459 Preprocessor1_M…
##  6        10           2 roc_auc binary     0.835     7 0.00703 Preprocessor1_M…
##  7         5           1 roc_auc binary     0.823     7 0.00549 Preprocessor1_M…
##  8         5           2 roc_auc binary     0.816     7 0.00496 Preprocessor1_M…
##  9         1           1 roc_auc binary     0.5       7 0       Preprocessor1_M…
## 10         1           2 roc_auc binary     0.5       7 0       Preprocessor1_M…

The best-performing configurations are those with 20 or 15 terms and prod_degree = 1, which consistently yield high ROC AUC scores, reflecting good classification performance. Increasing the prod_degree slightly reduces performance, and reducing num_terms below 10 significantly lowers the model’s ability to capture important patterns in the data. The 1-term configurations are ineffective and underfit the data, resulting in a ROC AUC of 0.5, which indicates no better than random guessing.

autoplot(mars_results)

The MARS model performs optimally with 15-20 terms and a degree of interaction = 1, achieving high accuracy, low Brier score, and excellent ROC AUC. These settings balance complexity and predictive performance, making them ideal for practical applications. Increasing interaction complexity provides minimal benefits and is unnecessary for this dataset.

mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
  add_model(mars_mod) %>%
  add_recipe(mars_recipe) %>%
  finalize_workflow(mars_best)

mars_final_model <- mars_final_wf %>%
  fit(data = train_data)

mars_final_model %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

The analysis underscores that behavioral and financial factors like tenure, total charges, and monthly charges dominate customer retention. Additionally, value-added services (e.g., online security, tech support) and long-term contracts play pivotal roles in enhancing customer satisfaction and reducing churn.

Random Forest Model

set.seed(123)
rf_split <- initial_split(data, prop = 0.7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = rf_train) %>%
  step_dummy(all_nominal(), -all_outcomes())   

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

set.seed(123)
rf_kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, rf_recipe, rf_kfold)

collect_metrics(results)
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.800     5 0.00591 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00350 Preprocessor1_Model1
## 3 roc_auc     binary     0.842     5 0.0105  Preprocessor1_Model1

The model appears to be performing well with an accuracy of 80%, a decent Brier score indicating good calibration, and a strong ROC AUC score showing good discriminative power. Minor improvements in calibration or feature engineering might help further boost performance.

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "impurity")

rf_param_grid <- grid_regular(
  trees(range = c(200, 2000)),  
  mtry(range = c(2, 20)),        
  min_n(range = c(2, 15)),       
  levels = 5                     
)

rf_tune_results <- tune_grid(
  rf_mod, 
  rf_recipe, 
  resamples = rf_kfold, 
  grid = rf_param_grid
)

show_best(rf_tune_results)
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.841     5  0.0108 Preprocessor1_Model1

This model can be considered effective for the classification task, with further optimization potentially improving performance

rf_best <- select_best(rf_tune_results, metric = "roc_auc")

final_rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best)

final_rf_fit <- final_rf_wf %>% fit(data = rf_train)

final_rf_fit %>% 
  extract_fit_parsnip() %>%
  vip::vip(num_features = 10) + 
  theme_minimal() + 
  labs(
    title = "Top 10 Important Features for Random Forest Model",
    x = "Importance",
    y = "Features"
  )

The feature importance analysis reveals that a combination of customer commitment (through tenure and contract type), financial investment (monthly and total charges), and service engagement (internet services, online security) are the most influential factors in predicting customer retention. These insights provide actionable strategies for businesses to reduce churn and improve customer satisfaction. By focusing on retaining long-term, committed customers and offering value-added services, businesses can enhance customer loyalty and ultimately boost their bottom line.

Confusion Matrix

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "impurity")

rf_predictions <- rf_mod %>%
  fit(Status ~ ., data = rf_train) %>%
  predict(rf_test)

rf_predictions %>%
  bind_cols(rf_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1380  256
##    Left        160  301

The Random Forest model performs reasonably well with an accuracy of 81.2%, but there is room for improvement in predicting customers who leave. Focus should be given to improving precision and recall for the “Left” class, especially if that is a critical outcome for the business. Further model adjustments and techniques for handling class imbalance could enhance performance.

Business Analysis & Conclusion

Important Predictor Variable:

  • In our report, the five essential variables include Tenure, Total Charges, Monthly Charges, Payment Method and Online Security.

  • Using machine learning analysis, our team determined that the Tenure is the most significant among the five variables, as it plays a key role in predicting the months of our loyal customers towards our products. Through analysis and graphing, we gained deeper insights into how these variables interact, enhancing the quality of our report.

Predicted Revenue Loss per Month:

  • To estimate the monthly revenue loss if our product does not attract sufficient customer support, our team conducted detailed calculations:

  • Predicted Revenue Loss = Average Revenue per Customer × Number of Customers (Current or Targeted) = (1000+1250)/2 * 1000 = 1,1250,000 (dollars)

Implication:

  • Using RStudio, we were able to establish a link between senior customers and device protection, enabling us to take the next step: promoting the right product or package to the appropriate customer segment.

Limitation:

  • The machine learning analysis of the Smart Home TV product confirms that the key limitation of customer demand uncertainty has a substantial impact on predicting product performance and customer retention. The 80% accuracy of the Logistic Regression model underscores that while useful, the model is not entirely reliable in forecasting demand, which is crucial for ensuring product success. This limitation, coupled with budget and time constraints, emphasizes the need for a balance between model complexity and resource allocation. Moving forward, while enhancing the model with more advanced techniques might improve accuracy, budget and time constraints will likely continue to influence the approach taken.