Final Project | Regork Customer Retention Opportunities

Riley Brandenburg | 12/5/2023

Overview

In response to the imperative of retaining existing customers in the telecommunications market, I conducted a comprehensive analysis of customer data to build a predictive model for identifying potential churn. The goal is to enable Regork to take proactive measures, such as offering targeted incentives, to retain at-risk customers.

Data Preparation & Exploratory Data Analysis:

Introduction: In this analysis, we aim to explore the customer dataset for Regork Telecom and identify patterns that may influence customer churn. The dataset includes information on services, demographics, and the customer status (churn or not churn).

Data Exploration Highlights: Identified that certain services have higher tenure than others, suggesting service-specific retention strategies. Observed relationships between demographics and service usage, which can guide targeted marketing efforts. Explored the baseline percentage of churn and non-churn customers to establish a benchmark. Used visualizations and correlation analysis to identify predictor variables with potential relationships with customer status. Addressed missing values and ensured consistency in categorical variable levels during data cleaning.

Machine Learning:

Model Evaluation: Split the data into training and test sets. Evaluated three algorithms (logistic regression, decision tree, random forest) using 5-fold cross-validation, focusing on AUC as the primary metric. Selected the random forest algorithm as the optimal model based on AUC. Analyzed confusion matrices to understand model behavior.

Feature Importance: Plotted and interpreted feature importance for the random forest model. Identified influential predictors, such as tenure, service usage, and contract type.

Business Insights: Rated predictors based on relative importance and identified focus areas for retention efforts. Selected customers predicted to leave for further analysis.

Business Analysis & Conclusion:

Predicted Loss and Incentive Proposal: Estimated the predicted loss in revenue per month if no action is taken using the model. Proposed an incentive scheme based on the model’s insights, considering the cost of incentives against the benefits of retaining customers. Justified the proposal through a cost-benefit analysis.

Packages Required

completejourney - provides access to data sets characterizing household level transactions over one year from a group of 2,469 households.

ggplot2 - data visualization package.

tidymodels - allows you to perform discrete parts of the ML workflow with discrete packages.

dplyr - aims to provide a function for each basic verb of data manipulation.

pdp - a package for constructing parital dependence plots and individual conditional expectation curves.

vip - a package for constructing variable importance plots.

tidyverse - an R programming package that helps to transform and better present data.

Data Preparation

#Load the packages I plan to use:
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vip)
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(pdp)
## 
## Attaching package: 'pdp'
## 
## The following object is masked from 'package:purrr':
## 
##     partial
#Importing Required Data:
library(readr)
customer_retention <- read_csv("~/Downloads/customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(customer_retention)

#Data Organization:
CustomerRetention <- read.csv("customer_retention.csv")
CustomerRetention <- CustomerRetention %>% 
  dplyr::mutate(Status = as.factor(Status))

#Data Splitting:
set.seed(123)
split_retention <- initial_split(CustomerRetention, prop = .7, strata = "Status")
train_retention <- training(split_retention)
test_retention  <- testing(split_retention)

dim(test_retention)
## [1] 2100   20
dim(train_retention)
## [1] 4899   20

Visualizing the Numbers

cr <- read.csv("~/Downloads/customer_retention.csv")
cr <- na.omit(cr)


ggplot(cr, aes(InternetService)) +
  geom_bar(fill = "aliceblue", color = "blue1") +
  facet_wrap(~Contract) +
  coord_flip() +
  ggtitle("Customer Internet Service and Length of Contract") +
  labs(y = "Count of Contract Type", x = "Customer Internet Service")

ggplot(cr, aes(Tenure)) +
  geom_bar(fill = "darkslategray3") +
  facet_wrap(~Contract) +
  ggtitle("Comparison of Tenure and Type of Contract ") +
  theme(plot.title = element_text(hjust = 0)) +
  labs(y = "Count of Customers", x = "Length of Tenure (Months)")

Interpreting the Data

Within these data visualization break-downs, I was able to identify 3 major themes and recommendations:

  1. By analyzing the relationship between customer’s internet services and the length of their contract, we can tell see that the Fiber Optic service generates the most month-to-month contracts. This information is helpful in understanding market segmentation and ways Regork can best optimize targeting different internet service areas. Based on these findings, I would recommend campaigns specifically based on targeting current customers with Fiber Optic service.

  2. By visualizing the customer data we can tell that the DSL contributes most to one year contracts and no internet service contributes most to two year contracts. This analysis is crucial in determining what type of contract Regork is most interested in pursuing.

  3. It’s also crucial to understand how the choice of contract contributes to the tenure of a customer. We can see that new customers are most likely to begin with a Month-to-Month contract and that there is an increase in two year contract selections as the tenure with the company increases.