Introduction

Data Preparation & Exploratory Data Analysis

Data Cleaning

I started by preparing the data. I changed the partner, dependents, and status to display 1 and 0. This will allow me to run code in a more efficient way.

retention <- retention %>%
  mutate(
    Partner = ifelse(Partner == "Yes", 1, 0),
    Dependents = ifelse(Dependents == "Yes", 1, 0),
    Status = ifelse(Status == "Left", 1, 0)  # Encode "Left" as 1 and "Current" as 0
  )

Calculate and visualize the baseline churn rate

I started by calculating what percentage of people were current customers. I found that 73.5% of the records are for current clients while 26.5% left.

retention %>%
  count(Status) %>%
  mutate(percentage = n / sum(n) * 100)

Create graphs to compare churn rate with factors

Next I created a sequence of graphs to see which factors had an effect on whether or not people stayed as clients.

# Gender and churn
ggplot(retention, aes(x = Gender, fill = factor(Status))) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Gender", x = "Gender", y = "Proportion")

# Contract Type and churn
ggplot(retention, aes(x = Contract, fill = factor(Status))) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Contract Type", x = "Contract Type", y = "Proportion")

# Senior Citizen and Churn
ggplot(retention, aes(x = factor(SeniorCitizen), fill = factor(Status))) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Senior Citizen Status", x = "Senior Citizen (0 = No, 1 = Yes)", y = "Proportion")

# Payment Method and Churn
ggplot(retention, aes(x = PaymentMethod, fill = factor(Status))) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Payment Method", x = "Payment Method", y = "Proportion")

# Churn Rate by Tenure
ggplot(retention, aes(x = Tenure, fill = factor(Status))) +
  geom_histogram(binwidth = 5, position = "fill") +
  labs(title = "Churn Rate by Tenure (Months)", x = "Tenure (Months)", y = "Proportion")

# Create a variable for the number of services used by each customer
retention <- retention %>%
  mutate(
    NumServices = rowSums(
      select(., PhoneService, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies) == "Yes", 
      na.rm = TRUE
    )
  )

# Number of Services vs. Churn
ggplot(retention, aes(x = NumServices, fill = factor(Status))) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Number of Services Used", x = "Number of Services", y = "Proportion")

#Partner and Dependents Impact
stayed_customers <- retention %>% filter(Status == 0)
partner_proportion_stayed <- stayed_customers %>%
  summarise(partner_rate = mean(Partner)) %>%
  pull(partner_rate)

dependents_proportion_stayed <- stayed_customers %>%
  summarise(dependents_rate = mean(Dependents)) %>%
  pull(dependents_rate)

partner_proportion_stayed
dependents_proportion_stayed
stayed_customers <- retention %>%
  filter(Status == 0) %>%  # Only customers who stayed
  group_by(Tenure) %>%
  summarise(
    Partner_Proportion = mean(Partner),
    Dependents_Proportion = mean(Dependents)
  ) %>%
  pivot_longer(cols = c(Partner_Proportion, Dependents_Proportion), 
               names_to = "Category", 
               values_to = "Proportion")
ggplot(stayed_customers, aes(x = Tenure, y = Proportion, color = Category)) +
  geom_line(size = 1) +
  labs(
    title = "Proportion of Customers Who Stayed by Tenure",
    x = "Tenure (Months)",
    y = "Proportion",
    color = "Category"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Partner_Proportion" = "blue", 
                                "Dependents_Proportion" = "green"),
                     labels = c("Partners", "Dependents"))

From this we can see that there is nearly not difference between churn rate between genders.

This data is exactly how you would expect it. As contracts increase in length customers are more likely to stay. This makes sense because if you are willing to sign a long term contract you are likely invested in staying as a customer. My takeaway from this is that we should focus on retaining our month-to-month clients.

From this we can see that senior citizens are more likely to leave. This isn’t overly surprising because of their age and they are more than likely month-to-month customers. However, this probably isn’t a group we should heavily focus on keeping.

This is a very surprising graph as it shows that those that use an electronic check are over twice as likely to leave. This is a good demographic to target and attempt to keep moving forward.

This shows that as customer tenure increased their churn rate decreases. This is what you would expect because as a long time customer you are more likely to stay. With that said, a focus should be put on retaining the new customers in their first year.

Unsurprisingly as people use more and more services they are less likely to leave. This makes sense as people with lots of services are less volatile. I would recommend focusing on people with 0-2 services to attempt to lock them in long term.

The next comparison I wanted to see was between partners and dependents. I wanted to see How likely people who have them were to stay over time. What I found is that people with dependents are more likely to stay in the long term. This makes sense as they are more likely to try and keep consistent spending.

Calculate the averages for customers who left and stayed

I final thing I wanted to analyze was the average charges for people who stayed and left. What i found was the the average monthly charges of people who were current customers was $61.20 while past customers was $74.50. This is obviously a major reason that some of these people decided to leave.

charges_summary <- retention %>%
  group_by(Status) %>%
  summarise(
    AvgTotalCharges = mean(TotalCharges, na.rm = TRUE),
    AvgMonthlyCharges = mean(MonthlyCharges, na.rm = TRUE)
  )
print(charges_summary)

Machine Learning

The first step in the machine learning process was to create training and test sets and then create a preprocess.

# Split the data into training and test sets
set.seed(123)
data_split <- initial_split(retention, prop = 0.8, strata = Status)
train_data <- training(data_split)
test_data <- testing(data_split)

# Create a recipe for preprocessing
recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%  # One-hot encoding for categorical variables
  step_normalize(all_numeric(), -all_outcomes())  # Normalizing numerical variables
prepped_recipe <- prep(recipe, training = train_data)
train_data_processed <- bake(prepped_recipe, new_data = train_data)
test_data_processed <- bake(prepped_recipe, new_data = test_data)
# Ensure 'Status' is a factor
train_data_processed$Status <- as.factor(train_data_processed$Status)
test_data_processed$Status <- as.factor(test_data_processed$Status)
str(train_data_processed$Status)

After this I created my three separate models.

#Logistic regression model
log_reg_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
log_reg_fit <- fit(log_reg_model, Status ~ ., data = train_data_processed)
log_reg_preds <- predict(log_reg_fit, new_data = test_data_processed, type = "prob")

#Decision tree model
dt_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")
dt_fit <- fit(dt_model, Status ~ ., data = train_data_processed)
dt_preds <- predict(dt_fit, new_data = test_data_processed, type = "prob")

#Random forest model
rf_model <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")
rf_fit <- fit(rf_model, Status ~ ., data = train_data_processed)
rf_preds <- predict(rf_fit, new_data = test_data_processed, type = "prob")

I then created a confusion matrix to track the false negatives and false positives of each of the models.

#Confusion Matrix

# Convert probabilities to class predictions
log_reg_class_preds <- predict(log_reg_fit, new_data = test_data_processed, type = "class")$.pred_class
dt_class_preds <- predict(dt_fit, new_data = test_data_processed, type = "class")$.pred_class
rf_class_preds <- predict(rf_fit, new_data = test_data_processed, type = "class")$.pred_class

# Add predictions to test data for evaluation
test_data_processed$log_reg_preds <- log_reg_class_preds
test_data_processed$dt_preds <- dt_class_preds
test_data_processed$rf_preds <- rf_class_preds

# Confusion Matrices
log_reg_conf_matrix <- conf_mat(test_data_processed, truth = Status, estimate = log_reg_preds)
dt_conf_matrix <- conf_mat(test_data_processed, truth = Status, estimate = dt_preds)
rf_conf_matrix <- conf_mat(test_data_processed, truth = Status, estimate = rf_preds)

# Print confusion matrices
log_reg_conf_matrix
dt_conf_matrix
rf_conf_matrix

Next I ran a comparison to find out which model had the best results

#Cross-validation
cv_folds <- vfold_cv(train_data_processed, v = 5, strata = Status)
log_reg_cv <- fit_resamples(log_reg_model, Status ~ ., resamples = cv_folds)
log_reg_cv_metrics <- collect_metrics(log_reg_cv)
dt_cv <- fit_resamples(dt_model, Status ~ ., resamples = cv_folds)
dt_cv_metrics <- collect_metrics(dt_cv)
rf_cv <- fit_resamples(rf_model, Status ~ ., resamples = cv_folds)
rf_cv_metrics <- collect_metrics(rf_cv)

#Compare AUC for each model
log_reg_auc <- log_reg_cv_metrics %>% filter(.metric == "roc_auc")
dt_auc <- dt_cv_metrics %>% filter(.metric == "roc_auc")
rf_auc <- rf_cv_metrics %>% filter(.metric == "roc_auc")
auc_comparison <- bind_rows(log_reg_auc, dt_auc, rf_auc)
auc_comparison
##           Truth
## Prediction   0   1
##          0 943 154
##          1  85 218
##           Truth
## Prediction   0   1
##          0 972 222
##          1  57 150
##           Truth
## Prediction   0   1
##          0 951 181
##          1  78 191
## # A tibble: 3 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.844     5 0.00613 Preprocessor1_Model1
## 2 roc_auc binary     0.705     5 0.00822 Preprocessor1_Model1
## 3 roc_auc binary     0.840     5 0.00488 Preprocessor1_Model1

This data is very interesting to look out. We can see that the decision tree model is not an option because of it’s excessive false positives and low area under the curve. However, the logistic regression model and decision tree model are both viable options. They both have very similar false positives and false negatives meaning the decision should be based on the standard error and mean area under the curve. The decision tree model has a lower standard error which means it is more accurate. However, the logistic regression model has a higher mean area under the curve which is why I selected it as the best model. Next I ran the following code to find which factors are most important to the logistic regression model

# Extract coefficients from the logistic regression model
log_reg_coefs <- tidy(log_reg_fit$fit) %>%  # Access the glm object
  filter(term != "(Intercept)") %>%  # Remove the intercept
  mutate(abs_estimate = abs(estimate)) %>%  # Compute absolute values of coefficients
  arrange(desc(abs_estimate))  # Sort by absolute importance

# Visualize feature importance
ggplot(log_reg_coefs, aes(x = reorder(term, abs_estimate), y = abs_estimate)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(
    title = "Feature Importance for Logistic Regression",
    x = "Features",
    y = "Absolute Value of Coefficients"
  ) +
  theme_minimal()

# Display the sorted coefficients table
log_reg_coefs
## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_bar()`).

## # A tibble: 31 × 6
##    term                       estimate std.error statistic  p.value abs_estimate
##    <chr>                         <dbl>     <dbl>     <dbl>    <dbl>        <dbl>
##  1 Tenure                       -1.67     0.177      -9.41 4.80e-21        1.67 
##  2 MonthlyCharges               -1.50     1.06       -1.41 1.58e- 1        1.50 
##  3 NumServices                   1.18     0.669       1.76 7.86e- 2        1.18 
##  4 InternetService_Fiber.opt…    0.985    0.441       2.23 2.55e- 2        0.985
##  5 TotalCharges                  0.954    0.183       5.20 1.95e- 7        0.954
##  6 InternetService_No           -0.819    0.370      -2.21 2.68e- 2        0.819
##  7 Contract_Two.year            -0.648    0.0871     -7.44 1.04e-13        0.648
##  8 OnlineSecurity_Yes           -0.371    0.0996     -3.72 1.96e- 4        0.371
##  9 TechSupport_Yes              -0.332    0.100      -3.32 9.12e- 4        0.332
## 10 Contract_One.year            -0.279    0.0493     -5.66 1.48e- 8        0.279
## # ℹ 21 more rows

The most important factors to consider are tenure, monthly charges, and number or services. This matches up with what I found in my comparisons in the data analysis. Tenure is very important because customers become more loyal over time. People with lower prices are also much more likely to stay. Similarly, people with more services are more stable financially and overall.

Business Analysis & Conclusion

Factors

As a business manager I would focus my attention on tenure, monthly charges, number of services, and payment method. My model identified the first three is very important in terms of churn rate and my comparison’s found interesting things with in each of these. The company should focus on people who are with in their first year because there is plenty of turnover here. Similarly with people who have high monthly charges. As a business manager I would offer programs to lower monthly charges if people increase from month-to-month to year long contracts. This would make staying beneficial for them while also developing loyalty. I would also focus on providing good service to those with low number of services as well as those using electronic checks. These demographics have high turnover rates and need a reason to stay.

Future Leavers

I ran the following code to determine which customers are going to leave in the future.

# Predict Leavers
rf_test_preds <- predict(rf_fit, new_data = test_data_processed, type = "prob") %>%
  as.data.frame() %>%
  mutate(Status = test_data_processed$Status)
rf_test_preds <- as.data.frame(rf_test_preds)
at_risk_customers <- bind_cols(test_data_processed, rf_test_preds) %>%
  filter(`Status...8` == 0, .pred_1 > 0.5)

# View the result
at_risk_customers
## New names:
## • `Status` -> `Status...8`
## • `Status` -> `Status...38`
## # A tibble: 78 × 38
##    SeniorCitizen Partner Dependents   Tenure MonthlyCharges TotalCharges
##            <dbl>   <dbl>      <dbl>    <dbl>          <dbl>        <dbl>
##  1         2.28   -0.965     -0.654 -1.27            -0.648       -0.985
##  2        -0.438  -0.965     -0.654 -0.782            0.704       -0.468
##  3        -0.438  -0.965      1.53  -0.782            0.450       -0.548
##  4        -0.438  -0.965     -0.654 -1.15             0.852       -0.847
##  5         2.28   -0.965     -0.654 -1.27             0.244       -0.973
##  6        -0.438   1.04       1.53  -0.905            0.704       -0.626
##  7        -0.438  -0.965     -0.654 -1.27            -0.322       -0.981
##  8        -0.438  -0.965     -0.654 -0.823            0.824       -0.554
##  9        -0.438  -0.965     -0.654 -1.19             0.332       -0.900
## 10        -0.438  -0.965     -0.654 -0.00962          0.694        0.171
## # ℹ 68 more rows
## # ℹ 32 more variables: NumServices <dbl>, Status...8 <fct>, Gender_Male <dbl>,
## #   PhoneService_Yes <dbl>, MultipleLines_No.phone.service <dbl>,
## #   MultipleLines_Yes <dbl>, InternetService_Fiber.optic <dbl>,
## #   InternetService_No <dbl>, OnlineSecurity_No.internet.service <dbl>,
## #   OnlineSecurity_Yes <dbl>, OnlineBackup_No.internet.service <dbl>,
## #   OnlineBackup_Yes <dbl>, DeviceProtection_No.internet.service <dbl>, …

Revenue Loss

From the following code I was able to determine the future revenue loss each month if these customers are allowed to leave.

# Money Lost
predicted_customers <- bind_cols(test_data_processed, rf_test_preds)
at_risk_customers <- predicted_customers %>%
  filter(`Status...8` == 0, .pred_1 > 0.5)
predicted_loss <- sum(at_risk_customers$MonthlyCharges, na.rm = TRUE)
predicted_loss
## New names:
## • `Status` -> `Status...8`
## • `Status` -> `Status...38`
## [1] 27.89113

Inentive Scheme

I would propose offering year long programs at a decreased monthly rate to the predicted leavers. Many of these people are in their first few months and are also paying higher prices than many of the long term customers. While the company would lose a little bit on these decreased rates they would turn a profit because they will stay for months to years longer than expected.

Conclusion

I propose a discounted year long contract in order to keep customers who woudl otherwise leave. These customers have low tenure, month-to-month contracts, and high monthly charges and would benefit heavily from this deal.