Goal: Analyze customer data and find a model to predict whether customers will leave in the future to help Regork take preventative action for retention of customers. Through this, we sorted through the data to identify any initial trends. Then we split, tested, and analyzed the data through 3 different models. We used a logistic regression, multivariate adaptive regression splines, and a random forest model to analyze the data set. Each model performance was compared by the value of area under the curve, looking for the model which had the highest metric.
Packages used:
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(kernlab)
Tidy Data Manipulation:
#Importing Data
data <- read.csv("customer_retention.csv")
data <- data %>%
mutate(Status = factor(Status)) %>%
na.omit()
#Splitting the Data
set.seed(123)
split <- initial_split(data, prop = .7, strata = Status)
retention_train <- training(split)
retention_test <- testing(split)
r_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
How long do customers stay under different contracts? We found that the largest churn of customers occurred in the month-to-month contracts where as the most loyal customers had 2 year contracts.
ggplot(data, aes(Tenure)) +
geom_histogram(fill = "red") +
facet_wrap(~Contract) +
ggtitle("Tenure vs. Type of Contract ") +
theme(plot.title = element_text(hjust = .5)) +
labs(y = "Count of Customers", x = "Length of Tenure (months)")
What customer group makes up most of Regork’s internet services? We found that most senior customers, female or male had fiber optic internet service. For the non-senior females and males, DSL and fiber optic internet were the more popular options; however, some had no internet service at all. Regork should target young males and females both internet services and focus on fiber optic for seniors.
#Specifying Data
male_senior <- subset(data, Gender == "Male" & SeniorCitizen == 1)
male_non_senior <- subset(data, Gender == "Male" & SeniorCitizen == 0)
female_senior <- subset(data, Gender == "Female" & SeniorCitizen == 1)
female_non_senior <- subset(data, Gender == "Female" & SeniorCitizen == 0)
male_senior_services <- male_senior$InternetService
female_senior_services <- female_senior$InternetService
female_non_senior_services <- female_non_senior$InternetService
male_non_senior_services <- male_non_senior$InternetService
streaming_data <- data.frame(Subset = c(rep("Male Senior", length(male_senior_services)),
rep("Female Senior", length(female_senior_services)),
rep("Male Non-Senior", length(male_non_senior_services)),
rep("Female Non-Senior", length(female_non_senior_services))),
InternetServiceStatus = c(male_senior_services, female_senior_services, male_non_senior_services, female_non_senior_services))
streaming_data <- streaming_data %>%
mutate(InternetServiceStatus = factor(InternetServiceStatus, levels = c("DSL", "Fiber optic", "No"), labels = c("DSL", "Fiber optic", "None")))
#Plot
ggplot(streaming_data, aes(x = InternetServiceStatus, fill = Subset)) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Internet Service Distribution",
x = "Internet Service", y = "Number of Customers", fill = "Customer Type") +
theme(axis.text.x = element_text(hjust = .5, size = 10, color = "black")) +
scale_y_continuous(limits = c(0,1150))
What are the customer demographics Regork should focus on? We see that most of Regork’s phone service customers are single with no children males and females. However, we also see a large amount of customers that have a partner or has both a partner and children. Based off of this, Regork may target family phone service plans or possible multi-line deals while keeping single line prices moderate.
gender_data <- data %>%
mutate(PhoneService = factor(PhoneService, levels = c("Yes", "No"))) %>%
mutate(Dependents = factor(Dependents, levels = c("Yes", "No"), labels = c("Has Dependents", "Does Not Have Dependents"))) %>%
mutate(Partner = factor(Partner, levels = c("Yes", "No"), labels = c("Has a Partner", "Does Not Have a Partner"))) %>%
count(PhoneService, Gender, Dependents, Partner) %>%
complete(Gender, PhoneService, fill = list(n = 0)) %>%
ggplot(aes(PhoneService, n, fill = Gender)) + geom_col(position = "dodge") +
labs(y = "Number of Customers", x = "Has Phone Service?") +
facet_wrap(~Dependents + Partner)
gender_data
What services are generating the most monthly revenue for Regork? Based off this graph, we determined that Regork makes the most for customers who stream both movies and TV. However, this only raises the average monthly charge by $10 since those who do not stream movies or TV still pay about $60 per month. Streaming both increases their average monthly charge by about $10, and only streaming movies and not TV brings the charge up about $15.
glimpse(data$StreamingTV)
## chr [1:6988] "No" "No" "No" "No" "No" "Yes" "Yes" "No" "Yes" "No" "No" ...
contract_data <- data %>%
mutate(StreamingMovies = factor(StreamingMovies, levels = c("Yes", "No", "No internet service"),
labels = c("Streams Movies", "Does Not Stream Movies", "No Internet Service"))) %>%
mutate(StreamingTV = factor(StreamingTV, levels = c("Yes", "No", "No internet service"),
labels = c("Yes", "No", "No Internet Service"))) %>%
count(StreamingMovies, StreamingTV, MonthlyCharges) %>%
ggplot(aes(StreamingTV, MonthlyCharges)) + geom_boxplot(position = "dodge") +
labs(y = "Monthly Charges", x = "Does the Customer Stream TV?", caption = "Average Monthly Charge Plotted and Labeled") +
scale_y_continuous(labels=scales::dollar_format()) +
theme(axis.text.x = element_text(color = "black", size = 10)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
stat_summary(fun=mean, colour="green", geom="point",
shape=16, size=3, show.legend=FALSE) +
stat_summary(fun.y=mean, colour="black", geom="text", show_guide = FALSE,
vjust=-1.5, aes(label=round(..y.., digits=2)))+
facet_wrap(~StreamingMovies)
contract_data
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
logistic_reg() %>%
fit_resamples(Status ~ ., kfolds) %>%
collect_metrics() %>%
filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
In our linear regression, we find AUC to be .845.
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = retention_train)
vip::vip(final_fit)
The 3 most important predictors according to this model is Tenure, 2 year contracts, and 1 year contracts.
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode("classification")
mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 25)
m_wf <- workflow() %>%
add_recipe(r_recipe) %>%
add_model(mars_mod)
tuning_results <- m_wf %>%
tune_grid(resamples = kfolds, grid = mars_grid)
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 50 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 20 1 roc_auc binary 0.850 5 0.00486 Preprocessor1_M…
## 2 19 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 3 21 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 4 22 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 5 23 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 6 25 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 7 26 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 8 27 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 9 28 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 10 30 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## # ℹ 40 more rows
Here we find that the minimum number of terms as 20 generates the highest AUC at .85.
A graphical representation of the tuned MARS model:
autoplot(tuning_results)
After analyzing what model works best in the MARS model we can extract the most important features.
m_best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
m_final_wf <- workflow() %>%
add_recipe(r_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(m_best_hyperparameters)
m_final_fit <- m_final_wf %>%
fit(data = retention_train)
vip::vip(final_fit)
Here we find that Total Charges, Tenure, and Monthly Charges are the 3 most influential predictors.
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
rf_hyper_grid <- grid_regular(
trees(range = c(5, 500)),
mtry(range = c(2, 50)),
min_n(range = c(1, 20)),
levels = 5
)
set.seed(123)
rf_results <- tune_grid(rf_mod, r_recipe, resamples = kfolds, grid = rf_hyper_grid)
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 14 128 20 roc_auc binary 0.840 5 0.00350 Preprocessor1_Model1…
## 2 2 252 20 roc_auc binary 0.840 5 0.00458 Preprocessor1_Model1…
## 3 2 500 15 roc_auc binary 0.840 5 0.00446 Preprocessor1_Model0…
## 4 2 376 20 roc_auc binary 0.840 5 0.00468 Preprocessor1_Model1…
## 5 2 500 1 roc_auc binary 0.840 5 0.00413 Preprocessor1_Model0…
Here we see that our most optimal random forest model is one with 128 trees, with an AUC of .840.
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
add_recipe(r_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = retention_train)
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)
Here we see that Tenure, Total Charges, and Monthly Charges are the most influential predictors.
To better understand how our best model (MARS) performs, we decided to build a confusion matrix to test any false negatives and positives.
cm <- m_final_wf %>%
fit(data = retention_train)
cm %>% predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
Most Influential Predictor Variables:
Across our 3 models, we found that Tenure, Total Charges, Monthly Charges, 2 year contracts, and 1 year contracts were the most influential predictors for this data set.
Tenure: This has been a top 3 influential feature across all 3 models. It goes without saying that customers who have been with Regork the longest have lower churn and are more likely to stay customers in the telecommunications services. Regork should target these individuals and cater their telecommunication services to those needs when expanding into this space.
Charges (Total and Monthly): Customers are often sensitive to price, and being one of the top features that impacts the status of a customer, Regork should be emphasizing the balance of charges. There must be more data analysis to better understand what Regork should be charging for services to widen their customer base. For our analysis, we know that it is an important feature to be further looked upon when trying to optimize business and expand the customer base.
Contract Type: Similarly to Tenure, the contract type was another significant predictor in the status of a customer. Customers who have longer contracts have lower churn, which emphasizes the importance of implementing longer term contracts to customers for Regork. We suggest eliminating the option of a monthly contract to help stabilize and reduce churn in customers. Additionally, more analysis may be required to understand at what prices and services certain contracts should be to increase the customer base while preventing any current customer loss.
Conclusion:
We believe that Regork has the capability to make the telecommunication sector a profitable sector of the company. As the financial teams and data analysis teams work to further understand the more influential predictors of customer status for actual implementation, we see the business being successful. In this regard, Regork should focus on getting customers into long term contracts for their services to reduce churn. With our confusion matrix, we are able to estimate that about 315 customers may leave. This could result in lost revenues ranging from $7,122.15 to $27,868.05 per month. With this, Regork needs to focus on customer retention through longer contracts and the finance team may need to conduct a deeper analysis into customer price sensitivity. Finally, Regork must target their loyal customer base which has longer tenure to ensure a successful sector in telecommunications.