Final Project

Introduction

Goal: Analyze customer data and find a model to predict whether customers will leave in the future to help Regork take preventative action for retention of customers. Through this, we sorted through the data to identify any initial trends. Then we split, tested, and analyzed the data through 3 different models. We used a logistic regression, multivariate adaptive regression splines, and a random forest model to analyze the data set. Each model performance was compared by the value of area under the curve, looking for the model which had the highest metric.

Data Preparation

Packages used:

library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(kernlab)

Tidy Data Manipulation:

#Importing Data
data <- read.csv("customer_retention.csv")
data <- data %>%
  mutate(Status = factor(Status)) %>%
  na.omit()
#Splitting the Data
set.seed(123)
split <- initial_split(data, prop = .7, strata = Status)
retention_train <- training(split)
retention_test <- testing(split)
r_recipe <- recipe(Status ~ ., data = retention_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

Exploratory Analysis

Tenure Data Analysis

How long do customers stay under different contracts? We found that the largest churn of customers occurred in the month-to-month contracts where as the most loyal customers had 2 year contracts.

ggplot(data, aes(Tenure)) +   
  geom_histogram(fill = "red") +   
  facet_wrap(~Contract) +   
  ggtitle("Tenure vs. Type of Contract ") +   
  theme(plot.title = element_text(hjust = .5)) +   
  labs(y = "Count of Customers", x = "Length of Tenure (months)")

Internet Service Data Analysis

What customer group makes up most of Regork’s internet services? We found that most senior customers, female or male had fiber optic internet service. For the non-senior females and males, DSL and fiber optic internet were the more popular options; however, some had no internet service at all. Regork should target young males and females both internet services and focus on fiber optic for seniors.

#Specifying Data
male_senior <- subset(data, Gender == "Male" & SeniorCitizen == 1)
male_non_senior <- subset(data, Gender == "Male" & SeniorCitizen == 0)
female_senior <- subset(data, Gender == "Female" & SeniorCitizen == 1)
female_non_senior <- subset(data, Gender == "Female" & SeniorCitizen == 0)

male_senior_services <- male_senior$InternetService
female_senior_services <- female_senior$InternetService
female_non_senior_services <- female_non_senior$InternetService
male_non_senior_services <- male_non_senior$InternetService

streaming_data <- data.frame(Subset = c(rep("Male Senior", length(male_senior_services)),              
                                      rep("Female Senior", length(female_senior_services)),              
                                      rep("Male Non-Senior", length(male_non_senior_services)),              
                                      rep("Female Non-Senior", length(female_non_senior_services))),
                             InternetServiceStatus = c(male_senior_services, female_senior_services, male_non_senior_services, female_non_senior_services))
streaming_data <- streaming_data %>%
  mutate(InternetServiceStatus = factor(InternetServiceStatus, levels = c("DSL", "Fiber optic", "No"), labels = c("DSL", "Fiber optic", "None")))
#Plot
  ggplot(streaming_data, aes(x = InternetServiceStatus, fill = Subset)) +   
  geom_bar(position = "dodge", stat = "count") +   
  labs(title = "Internet Service Distribution",        
       x = "Internet Service", y = "Number of Customers", fill = "Customer Type") +   
  theme(axis.text.x = element_text(hjust = .5, size = 10, color = "black")) +
  scale_y_continuous(limits = c(0,1150))

Demographics Analysis

What are the customer demographics Regork should focus on? We see that most of Regork’s phone service customers are single with no children males and females. However, we also see a large amount of customers that have a partner or has both a partner and children. Based off of this, Regork may target family phone service plans or possible multi-line deals while keeping single line prices moderate.

gender_data <- data %>%
  mutate(PhoneService = factor(PhoneService, levels = c("Yes", "No"))) %>%
  mutate(Dependents = factor(Dependents, levels = c("Yes", "No"), labels = c("Has Dependents", "Does Not Have Dependents"))) %>%
  mutate(Partner = factor(Partner, levels = c("Yes", "No"), labels = c("Has a Partner", "Does Not Have a Partner"))) %>%
  count(PhoneService, Gender, Dependents, Partner) %>%
  complete(Gender, PhoneService, fill = list(n = 0)) %>%
  ggplot(aes(PhoneService, n, fill = Gender)) + geom_col(position = "dodge") +
  labs(y = "Number of Customers", x = "Has Phone Service?") +
  facet_wrap(~Dependents + Partner)
gender_data

Revenue Data Analysis

What services are generating the most monthly revenue for Regork? Based off this graph, we determined that Regork makes the most for customers who stream both movies and TV. However, this only raises the average monthly charge by $10 since those who do not stream movies or TV still pay about $60 per month. Streaming both increases their average monthly charge by about $10, and only streaming movies and not TV brings the charge up about $15.

glimpse(data$StreamingTV)

##  chr [1:6988] "No" "No" "No" "No" "No" "Yes" "Yes" "No" "Yes" "No" "No" ...

contract_data <- data %>%
  mutate(StreamingMovies = factor(StreamingMovies, levels = c("Yes", "No", "No internet service"),
                                  labels = c("Streams Movies", "Does Not Stream Movies", "No Internet Service"))) %>%
  mutate(StreamingTV = factor(StreamingTV, levels = c("Yes", "No", "No internet service"),
                              labels = c("Yes", "No", "No Internet Service"))) %>%
  count(StreamingMovies, StreamingTV, MonthlyCharges) %>%
  ggplot(aes(StreamingTV, MonthlyCharges)) + geom_boxplot(position = "dodge") +
  labs(y = "Monthly Charges", x = "Does the Customer Stream TV?", caption = "Average Monthly Charge Plotted and Labeled") +
  scale_y_continuous(labels=scales::dollar_format()) +
  theme(axis.text.x = element_text(color = "black", size = 10)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  stat_summary(fun=mean, colour="green", geom="point", 
               shape=16, size=3, show.legend=FALSE) +
  stat_summary(fun.y=mean, colour="black", geom="text", show_guide = FALSE, 
               vjust=-1.5, aes(label=round(..y.., digits=2)))+
  facet_wrap(~StreamingMovies)
contract_data

Machine Learning

Linear Regression

set.seed(123)
 
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

logistic_reg() %>% 
  fit_resamples(Status ~ ., kfolds) %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

In our linear regression, we find AUC to be .845.

final_fit <- logistic_reg() %>%
  fit(Status ~ ., data = retention_train)
vip::vip(final_fit)

The 3 most important predictors according to this model is Tenure, 2 year contracts, and 1 year contracts.

MARS (Multivariate Adaptive Regression Splines)

set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_mode("classification")

mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 25)

m_wf <- workflow() %>%
  add_recipe(r_recipe) %>%
  add_model(mars_mod)

tuning_results <- m_wf %>%
  tune_grid(resamples = kfolds, grid = mars_grid)

tuning_results %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))

## # A tibble: 50 × 8
##    num_terms prod_degree .metric .estimator  mean     n std_err .config         
##        <int>       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
##  1        20           1 roc_auc binary     0.850     5 0.00486 Preprocessor1_M…
##  2        19           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  3        21           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  4        22           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  5        23           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  6        25           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  7        26           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  8        27           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
##  9        28           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## 10        30           1 roc_auc binary     0.849     5 0.00486 Preprocessor1_M…
## # ℹ 40 more rows

Here we find that the minimum number of terms as 20 generates the highest AUC at .85.

A graphical representation of the tuned MARS model:

autoplot(tuning_results)

After analyzing what model works best in the MARS model we can extract the most important features.

m_best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")

m_final_wf <- workflow() %>%
  add_recipe(r_recipe) %>%
  add_model(mars_mod) %>%
  finalize_workflow(m_best_hyperparameters)

m_final_fit <- m_final_wf %>%
  fit(data = retention_train)
vip::vip(final_fit)

Here we find that Total Charges, Tenure, and Monthly Charges are the 3 most influential predictors.

Random Forest Model

set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)

rf_mod <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune(),
  min_n = tune()
  ) %>%
  set_engine("ranger", importance = "impurity")

rf_hyper_grid <- grid_regular(
  trees(range = c(5, 500)),
  mtry(range = c(2, 50)),
  min_n(range = c(1, 20)),
  levels = 5
  )

set.seed(123)
rf_results <- tune_grid(rf_mod, r_recipe, resamples = kfolds, grid = rf_hyper_grid)

show_best(rf_results, metric = "roc_auc")

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    14   128    20 roc_auc binary     0.840     5 0.00350 Preprocessor1_Model1…
## 2     2   252    20 roc_auc binary     0.840     5 0.00458 Preprocessor1_Model1…
## 3     2   500    15 roc_auc binary     0.840     5 0.00446 Preprocessor1_Model0…
## 4     2   376    20 roc_auc binary     0.840     5 0.00468 Preprocessor1_Model1…
## 5     2   500     1 roc_auc binary     0.840     5 0.00413 Preprocessor1_Model0…

Here we see that our most optimal random forest model is one with 128 trees, with an AUC of .840.

rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")

final_rf_wf <- workflow() %>%
  add_recipe(r_recipe) %>%
  add_model(rf_mod) %>%
  finalize_workflow(rf_best_hyperparameters)

rf_final_fit <- final_rf_wf %>%
  fit(data = retention_train)

rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 20)

Here we see that Tenure, Total Charges, and Monthly Charges are the most influential predictors.

Confusion Matrix

To better understand how our best model (MARS) performs, we decided to build a confusion matrix to test any false negatives and positives.

cm <- m_final_wf %>%
  fit(data = retention_train)

cm %>% predict(retention_test) %>% 
  bind_cols(retention_test %>% select(Status)) %>%
  conf_mat(truth = Status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1383  242
##    Left        157  315

Business Analysis

Most Influential Predictor Variables:

Across our 3 models, we found that Tenure, Total Charges, Monthly Charges, 2 year contracts, and 1 year contracts were the most influential predictors for this data set.

Tenure: This has been a top 3 influential feature across all 3 models. It goes without saying that customers who have been with Regork the longest have lower churn and are more likely to stay customers in the telecommunications services. Regork should target these individuals and cater their telecommunication services to those needs when expanding into this space.
Charges (Total and Monthly): Customers are often sensitive to price, and being one of the top features that impacts the status of a customer, Regork should be emphasizing the balance of charges. There must be more data analysis to better understand what Regork should be charging for services to widen their customer base. For our analysis, we know that it is an important feature to be further looked upon when trying to optimize business and expand the customer base.
Contract Type: Similarly to Tenure, the contract type was another significant predictor in the status of a customer. Customers who have longer contracts have lower churn, which emphasizes the importance of implementing longer term contracts to customers for Regork. We suggest eliminating the option of a monthly contract to help stabilize and reduce churn in customers. Additionally, more analysis may be required to understand at what prices and services certain contracts should be to increase the customer base while preventing any current customer loss.

Conclusion:

We believe that Regork has the capability to make the telecommunication sector a profitable sector of the company. As the financial teams and data analysis teams work to further understand the more influential predictors of customer status for actual implementation, we see the business being successful. In this regard, Regork should focus on getting customers into long term contracts for their services to reduce churn. With our confusion matrix, we are able to estimate that about 315 customers may leave. This could result in lost revenues ranging from $7,122.15 to $27,868.05 per month. With this, Regork needs to focus on customer retention through longer contracts and the finance team may need to conduct a deeper analysis into customer price sensitivity. Finally, Regork must target their loyal customer base which has longer tenure to ensure a successful sector in telecommunications.