Final Project

#Introduction

#Packages Required

library(tidyverse)

## Warning: package 'purrr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.4.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──

## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1

## Warning: package 'dials' was built under R version 4.4.3

## Warning: package 'infer' was built under R version 4.4.3

## Warning: package 'modeldata' was built under R version 4.4.3

## Warning: package 'parsnip' was built under R version 4.4.3

## Warning: package 'recipes' was built under R version 4.4.3

## Warning: package 'rsample' was built under R version 4.4.3

## Warning: package 'tune' was built under R version 4.4.3

## Warning: package 'workflows' was built under R version 4.4.3

## Warning: package 'workflowsets' was built under R version 4.4.3

## Warning: package 'yardstick' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity

## The following object is masked from 'package:purrr':
## 
##     lift

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(earth)

## Warning: package 'earth' was built under R version 4.4.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.4.3

## Loading required package: plotrix

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:scales':
## 
##     rescale

library(vip)

## Warning: package 'vip' was built under R version 4.4.3

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

library(ranger)

## Warning: package 'ranger' was built under R version 4.4.3

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

library(rsample)

#Data Preparation

library(readr)
data <- read_csv("customer_retention.csv")

## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

class(data)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

summary(data)

##     Gender          SeniorCitizen      Partner           Dependents       
##  Length:6999        Min.   :0.0000   Length:6999        Length:6999       
##  Class :character   1st Qu.:0.0000   Class :character   Class :character  
##  Mode  :character   Median :0.0000   Mode  :character   Mode  :character  
##                     Mean   :0.1619                                        
##                     3rd Qu.:0.0000                                        
##                     Max.   :1.0000                                        
##                                                                           
##      Tenure      PhoneService       MultipleLines      InternetService   
##  Min.   : 0.00   Length:6999        Length:6999        Length:6999       
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :29.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :32.38                                                           
##  3rd Qu.:55.00                                                           
##  Max.   :72.00                                                           
##                                                                          
##  OnlineSecurity     OnlineBackup       DeviceProtection   TechSupport       
##  Length:6999        Length:6999        Length:6999        Length:6999       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  StreamingTV        StreamingMovies      Contract         PaperlessBilling  
##  Length:6999        Length:6999        Length:6999        Length:6999       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaymentMethod      MonthlyCharges    TotalCharges       Status         
##  Length:6999        Min.   : 18.25   Min.   :  18.8   Length:6999       
##  Class :character   1st Qu.: 35.48   1st Qu.: 401.9   Class :character  
##  Mode  :character   Median : 70.35   Median :1397.5   Mode  :character  
##                     Mean   : 64.75   Mean   :2283.1                     
##                     3rd Qu.: 89.85   3rd Qu.:3796.9                     
##                     Max.   :118.75   Max.   :8684.8                     
##                                      NA's   :11

str(data)

## spc_tbl_ [6,999 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender          : chr [1:6999] "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : num [1:6999] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr [1:6999] "Yes" "No" "No" "No" ...
##  $ Dependents      : chr [1:6999] "No" "No" "No" "No" ...
##  $ Tenure          : num [1:6999] 1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr [1:6999] "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr [1:6999] "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr [1:6999] "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr [1:6999] "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr [1:6999] "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr [1:6999] "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr [1:6999] "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr [1:6999] "No" "No" "No" "No" ...
##  $ StreamingMovies : chr [1:6999] "No" "No" "No" "No" ...
##  $ Contract        : chr [1:6999] "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr [1:6999] "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr [1:6999] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num [1:6999] 29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num [1:6999] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : chr [1:6999] "Current" "Current" "Left" "Current" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   SeniorCitizen = col_double(),
##   ..   Partner = col_character(),
##   ..   Dependents = col_character(),
##   ..   Tenure = col_double(),
##   ..   PhoneService = col_character(),
##   ..   MultipleLines = col_character(),
##   ..   InternetService = col_character(),
##   ..   OnlineSecurity = col_character(),
##   ..   OnlineBackup = col_character(),
##   ..   DeviceProtection = col_character(),
##   ..   TechSupport = col_character(),
##   ..   StreamingTV = col_character(),
##   ..   StreamingMovies = col_character(),
##   ..   Contract = col_character(),
##   ..   PaperlessBilling = col_character(),
##   ..   PaymentMethod = col_character(),
##   ..   MonthlyCharges = col_double(),
##   ..   TotalCharges = col_double(),
##   ..   Status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

data <- na.omit(data)
table(data$Status)

## 
## Current    Left 
##    5132    1856

table(data$Gender)

## 
## Female   Male 
##   3462   3526

table(data$InternetService)

## 
##         DSL Fiber optic          No 
##        2400        3075        1513

summary(data$Tenure)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   29.00   32.43   55.00   72.00

summary(data$MonthlyCharges)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.54   70.35   64.79   89.90  118.75

data$Gender <- as.factor(data$Gender)
data$SeniorCitizen <- as.factor(data$SeniorCitizen)
data$Partner <- as.factor(data$Partner)
data$Dependents <- as.factor(data$Dependents)
data$PhoneService <- as.factor(data$PhoneService)
data$MultipleLines <- as.factor(data$MultipleLines)
data$InternetService <- as.factor(data$InternetService)
data$OnlineSecurity <- as.factor(data$OnlineSecurity)
data$OnlineBackup <- as.factor(data$OnlineBackup)
data$DeviceProtection <- as.factor(data$DeviceProtection)
data$TechSupport <- as.factor(data$TechSupport)
data$StreamingTV <- as.factor(data$StreamingTV)
data$StreamingMovies <- as.factor(data$StreamingMovies)
data$Contract <- as.factor(data$Contract)
data$PaperlessBilling <- as.factor(data$PaperlessBilling)
data$PaymentMethod <- as.factor(data$PaymentMethod)
data$Status <- as.factor(data$Status)
data$Tenure <- scale(data$Tenure)
data$MonthlyCharges <- scale(data$MonthlyCharges)
data$TotalCharges <- scale(data$TotalCharges)
sum(is.na(data))

## [1] 0

data <- na.omit(data)

#Exploratory Analysis

ggplot(data, aes(x = Tenure, fill = Status)) +
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Distribution of Tenure by Customer Status", x = "Tenure (months)", y = "Count") +
  theme_minimal()

The graph titled “Tenure by Customer Status” shows that loyalty to Regork Premium grows over time. Customers who stay subscribed longer tend to invest more time and money into the service. The data reveals that those who have used the Premium Pack for 70 months or more make up the largest portion of users. In contrast, many former subscribers tend to cancel within the first few months of using the service.

ggplot(data, aes(x = TotalCharges)) +
  geom_histogram(bins = 30, fill = "cyan", color = "pink") +
  labs(title = "Distribution of Total Charges by Price Range", 
       x = "Total Charges ($)", y = "Count") +
  theme_minimal()

The illustration clearly shows that most of our customers are comfortable spending close to $1,000 on a TV. Around 1,000 people surveyed said they consider $970 a fair price. The next most popular price point was $1,050, which also had strong support. Based on these insights, our team has chosen to price our Smart TVs between $1,050 and $1,250, depending on the features and memory, to align with what customers are looking for.

ggplot(data, aes(x = PaymentMethod)) +
  geom_bar(fill = "purple") +
  labs(title = "Distribution of Payment Methods", 
       x = "Payment Method", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 360, hjust = 1))

According to the bar chart above on payment method frequency, it is surprising to see that electronic checks are the most frequently used payment method among the four options. Mailed checks come in second, while the two automatic methods—bank transfer and credit—have roughly the same number of users.

ggplot(data, aes(x = SeniorCitizen, fill = DeviceProtection)) +
  geom_bar(position = "dodge") +
  labs(title = "Senior Citizens and Device Protection", x = "Senior Citizen", y = "Count")

If older customers are more likely to choose Device Protection because they’re more aware of potential risks, this opens up a great chance for targeted marketing.

We can boost our promotional efforts for Regork Premium by emphasizing that it includes Device Protection, specifically geared toward senior customers. This approach would not only offer them extra value and reassurance but also encourage more of them to choose our Smart Home TV and Premium package.

#Machine Learning

Logistic Regression

set.seed(42)
split_data <- initial_split(data, prop = 0.75, strata = Status)

train_data <- training(split_data)
test_data <- testing(split_data)

set.seed(42)  
cv_folds <- vfold_cv(train_data, v = 10, strata = Status)

logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%  
  set_mode("classification")

logistic_resamples <- fit_resamples(logistic_model, Status ~ ., resamples = cv_folds)
logistic_resamples %>%
  collect_metrics()

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.803    10 0.00679 Preprocessor1_Model1
## 2 brier_class binary     0.138    10 0.00270 Preprocessor1_Model1
## 3 roc_auc     binary     0.838    10 0.00653 Preprocessor1_Model1

The model performs well, particularly in distinguishing between different classes, as shown by its high ROC AUC score. Its accuracy and Brier score further suggest that the predictions are reliable and the probability estimates are reasonably well-calibrated.

Multivariate Adaptive Regression Spline (MARS)

set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = "Status")
train_data <- training(data_split)
test_data <- testing(data_split)

mars_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes())

set.seed(123)
mars_kfolds <- vfold_cv(train_data, v = 7, strata = "Status")

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification") %>%
  set_engine("earth")

mars_grid <- grid_regular(num_terms(range = c(1, 20)), prod_degree(), levels = 5)

mars_wf <- workflow() %>%
  add_recipe(mars_recipe) %>%
  add_model(mars_mod)

mars_results <- mars_wf %>%
  tune_grid(resamples = mars_kfolds, grid = mars_grid)

mars_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 10 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.847     7 0.00365 Preprocessor1_M…
##  2        15           1 roc_auc binary     0.846     7 0.00374 Preprocessor1_M…
##  3        20           2 roc_auc binary     0.842     7 0.00481 Preprocessor1_M…
##  4        10           1 roc_auc binary     0.841     7 0.00504 Preprocessor1_M…
##  5        15           2 roc_auc binary     0.841     7 0.00459 Preprocessor1_M…
##  6        10           2 roc_auc binary     0.835     7 0.00703 Preprocessor1_M…
##  7         5           1 roc_auc binary     0.823     7 0.00549 Preprocessor1_M…
##  8         5           2 roc_auc binary     0.816     7 0.00496 Preprocessor1_M…
##  9         1           1 roc_auc binary     0.5       7 0       Preprocessor1_M…
## 10         1           2 roc_auc binary     0.5       7 0       Preprocessor1_M…

The top-performing setups use 20 or 15 terms with a prod_degree of 1, consistently achieving high ROC AUC scores, which indicate strong classification performance. Raising the prod_degree slightly lowers performance, while using fewer than 10 terms significantly weakens the model’s ability to recognize key patterns in the data. Configurations with only one term perform poorly and underfit the data, producing a ROC AUC of 0.5—no better than random guessing.

autoplot(mars_results)

The MARS model delivers the best results with 15 to 20 terms and an interaction degree of 1, showing high accuracy, a low Brier score, and a strong ROC AUC. This configuration strikes a good balance between model complexity and predictive power, making it well-suited for real-world use. Adding more interaction complexity offers little improvement and isn’t needed for this dataset.

mars_best <- select_best(mars_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
  add_model(mars_mod) %>%
  add_recipe(mars_recipe) %>%
  finalize_workflow(mars_best)

mars_final_model <- mars_final_wf %>%
  fit(data = train_data)

mars_final_model %>%
  extract_fit_parsnip() %>%
  vip(10, type = "rss")

The analysis highlights that behavioral and financial factors—such as tenure, total charges, and monthly fees—are key drivers of customer retention. Moreover, value-added services like online security and tech support, along with long-term contracts, significantly contribute to higher customer satisfaction and lower churn rates.

#Random Forest Model

set.seed(123)
rf_split <- initial_split(data, prop = 0.7, strata = Status)
rf_train <- training(rf_split)
rf_test <- testing(rf_split)

rf_recipe <- recipe(Status ~ ., data = rf_train) %>%
  step_dummy(all_nominal(), -all_outcomes())   

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

set.seed(123)
rf_kfold <- vfold_cv(rf_train, v = 5)

results <- fit_resamples(rf_mod, rf_recipe, rf_kfold)

collect_metrics(results)

## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.800     5 0.00591 Preprocessor1_Model1
## 2 brier_class binary     0.137     5 0.00350 Preprocessor1_Model1
## 3 roc_auc     binary     0.842     5 0.0105  Preprocessor1_Model1

The model is performing effectively, with an accuracy of 80%, a solid Brier score suggesting good calibration, and a strong ROC AUC score indicating strong discriminative ability. Small adjustments in calibration or feature engineering could further enhance performance.

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "impurity")

rf_param_grid <- grid_regular(
  trees(range = c(200, 2000)),  
  mtry(range = c(2, 20)),        
  min_n(range = c(2, 15)),       
  levels = 5                     
)

rf_tune_results <- tune_grid(
  rf_mod, 
  rf_recipe, 
  resamples = rf_kfold, 
  grid = rf_param_grid
)

## Warning: No tuning parameters have been detected, performance will be evaluated
## using the resamples with no tuning. Did you want to [tune()] parameters?

show_best(rf_tune_results)

## Warning in show_best(rf_tune_results): No value of `metric` was given;
## "roc_auc" will be used.

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.841     5  0.0108 Preprocessor1_Model1

This model is effective for the classification task, and additional optimization could further enhance its performance.

rf_best <- select_best(rf_tune_results, metric = "roc_auc")

final_rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best)

final_rf_fit <- final_rf_wf %>% fit(data = rf_train)

final_rf_fit %>% 
  extract_fit_parsnip() %>%
  vip::vip(num_features = 10) + 
  theme_minimal() + 
  labs(
    title = "Top 10 Important Features for Random Forest Model",
    x = "Importance",
    y = "Features"
  )

The feature importance analysis shows that customer commitment (tenure and contract type), financial investment (monthly and total charges), and service engagement (internet services, online security) are the key factors in predicting customer retention. These findings offer practical strategies for businesses to reduce churn and increase customer satisfaction. By prioritizing the retention of long-term, loyal customers and offering value-added services, businesses can strengthen customer loyalty and improve their profitability.

Confusion Matrix

rf_mod <- rand_forest(mode = "classification") %>%
  set_engine("ranger", importance = "impurity")

rf_predictions <- rf_mod %>%
  fit(Status ~ ., data = rf_train) %>%
  predict(rf_test)

rf_predictions %>%
  bind_cols(rf_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1379  251
##    Left        161  306

The Random Forest model performs fairly well with an accuracy of 81.2%, but there is potential to improve its ability to predict customer churn. Emphasis should be placed on boosting precision and recall for the “Left” class, particularly if this outcome is crucial for the business. Additional adjustments to the model and methods for addressing class imbalance could further improve performance.

#Business Analysis & Conclusion #Important Predictor Variable In our report, we identified five critical variables: Tenure, Total Charges, Monthly Charges, Payment Method, and Online Security.

Through machine learning analysis, we found that Tenure is the most influential variable. It plays a crucial role in predicting how long our loyal customers will continue using our products. By analyzing and visualizing these variables, we gained valuable insights into their interactions, which improved the overall quality of our report.

#Predicted Revenue Loss per Month To calculate the potential monthly revenue loss if our product fails to gain enough customer support, our team performed detailed calculations:

Estimated Revenue Loss = Average Revenue per Customer × Number of Customers (Current or Targeted) = (1000+1250)/2 * 1000 = $1,125,000

#Implication By using RStudio, we identified a connection between senior customers and the need for device protection, allowing us to move forward with promoting the most suitable product or package to the right customer segment.

#Limitation The machine learning analysis of the Smart Home TV product reveals that the uncertainty of customer demand significantly affects predictions of product performance and customer retention. The 80% accuracy of the Logistic Regression model highlights that, although helpful, the model is not fully dependable for forecasting demand, which is critical for product success. This limitation, along with budget and time constraints, underscores the importance of balancing model complexity with resource allocation. Going forward, while using more advanced techniques could enhance accuracy, the constraints on budget and time will likely continue to shape our approach.

Final Project

Truc Huynh, Khoi Pham

2025-04-22