Mahika Gunjkar, Fairouz Maayah, Ashleigh Fischer

Introduction

Customer retention is a critical aspect of maintaining steady revenue for any subscription-based business. In this analysis, we investigate patterns in customer churn and predict those at risk of leaving. We also propose actionable insights to retain these customers and minimize revenue loss.

Data Preprocessing

Step 1: Loading Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rpart)
library(e1071)
library(ROCR)
library(DT)

Step 2: Loading and Inspect the Data

# Load the dataset
data <- readr::read_csv("C:/Users/Mahika/OneDrive - University of Cincinnati/BANA 4080/data/customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View initial structure and summary
str(data)
## spc_tbl_ [6,999 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender          : chr [1:6999] "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : num [1:6999] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr [1:6999] "Yes" "No" "No" "No" ...
##  $ Dependents      : chr [1:6999] "No" "No" "No" "No" ...
##  $ Tenure          : num [1:6999] 1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr [1:6999] "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr [1:6999] "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr [1:6999] "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr [1:6999] "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr [1:6999] "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr [1:6999] "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr [1:6999] "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr [1:6999] "No" "No" "No" "No" ...
##  $ StreamingMovies : chr [1:6999] "No" "No" "No" "No" ...
##  $ Contract        : chr [1:6999] "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr [1:6999] "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr [1:6999] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num [1:6999] 29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num [1:6999] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : chr [1:6999] "Current" "Current" "Left" "Current" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   SeniorCitizen = col_double(),
##   ..   Partner = col_character(),
##   ..   Dependents = col_character(),
##   ..   Tenure = col_double(),
##   ..   PhoneService = col_character(),
##   ..   MultipleLines = col_character(),
##   ..   InternetService = col_character(),
##   ..   OnlineSecurity = col_character(),
##   ..   OnlineBackup = col_character(),
##   ..   DeviceProtection = col_character(),
##   ..   TechSupport = col_character(),
##   ..   StreamingTV = col_character(),
##   ..   StreamingMovies = col_character(),
##   ..   Contract = col_character(),
##   ..   PaperlessBilling = col_character(),
##   ..   PaymentMethod = col_character(),
##   ..   MonthlyCharges = col_double(),
##   ..   TotalCharges = col_double(),
##   ..   Status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(data)
##     Gender          SeniorCitizen      Partner           Dependents       
##  Length:6999        Min.   :0.0000   Length:6999        Length:6999       
##  Class :character   1st Qu.:0.0000   Class :character   Class :character  
##  Mode  :character   Median :0.0000   Mode  :character   Mode  :character  
##                     Mean   :0.1619                                        
##                     3rd Qu.:0.0000                                        
##                     Max.   :1.0000                                        
##                                                                           
##      Tenure      PhoneService       MultipleLines      InternetService   
##  Min.   : 0.00   Length:6999        Length:6999        Length:6999       
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :29.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :32.38                                                           
##  3rd Qu.:55.00                                                           
##  Max.   :72.00                                                           
##                                                                          
##  OnlineSecurity     OnlineBackup       DeviceProtection   TechSupport       
##  Length:6999        Length:6999        Length:6999        Length:6999       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  StreamingTV        StreamingMovies      Contract         PaperlessBilling  
##  Length:6999        Length:6999        Length:6999        Length:6999       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaymentMethod      MonthlyCharges    TotalCharges       Status         
##  Length:6999        Min.   : 18.25   Min.   :  18.8   Length:6999       
##  Class :character   1st Qu.: 35.48   1st Qu.: 401.9   Class :character  
##  Mode  :character   Median : 70.35   Median :1397.5   Mode  :character  
##                     Mean   : 64.75   Mean   :2283.1                     
##                     3rd Qu.: 89.85   3rd Qu.:3796.9                     
##                     Max.   :118.75   Max.   :8684.8                     
##                                      NA's   :11

Step 3: Handling Missing Values

# Check for missing values
colSums(is.na(data))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges           Status 
##                0                0               11                0
# Replace missing values in 'TotalCharges' with the median
data$TotalCharges <- as.numeric(data$TotalCharges)
data$TotalCharges[is.na(data$TotalCharges)] <- median(data$TotalCharges, na.rm = TRUE)

Step 4: Data Transformation

# Convert categorical variables to factors
categorical_vars <- c("Status", "Gender", "Partner", "Dependents", "PhoneService",
                      "MultipleLines", "InternetService", "OnlineSecurity",
                      "OnlineBackup", "DeviceProtection", "TechSupport",
                      "StreamingTV", "StreamingMovies", "Contract",
                      "PaperlessBilling", "PaymentMethod")
data[categorical_vars] <- lapply(data[categorical_vars], as.factor)

# Normalize numerical columns
numerical_vars <- c("Tenure", "MonthlyCharges", "TotalCharges")
data[numerical_vars] <- scale(data[numerical_vars])

Exploratory Data Analysis

Churn Distribution

In this first churn distribution we analyzed the percentage of current customers and the customers that have already left, finding that about 75% of customers stayed and about 25% who have already left.

churn_distribution <- data %>%
  count(Status) %>%
  mutate(Percent = n / sum(n) * 100)

ggplot(churn_distribution, aes(x = Status, y = Percent, fill = Status)) +
  geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
  scale_fill_manual(values = c("purple", "yellow")) +
  labs(
    title = "Churn Distribution",
    y = "Percentage of Customers",
    x = "Customer Status"
  ) +
  theme_minimal()

Categorical Predictors vs. Churn

We then decided to analyze the proportions of churn across different categories of predictors, below are the graphs showing the proportion by status, gender, partner, dependents, Phone service(if one has it or not), multiple lines of phone service, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, contract, paperless billing, and payment method.

plot_list <- list()
for (var in categorical_vars) {
  p <- ggplot(data, aes_string(x = var, fill = "Status")) +
    geom_bar(position = "fill") +
    labs(title = paste("Churn by", var), y = "Proportion", x = var) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  plot_list[[var]] <- p
}

# Display all plots
for (p in plot_list) {
  print(p)
}

Numerical Predictors vs. Churn

We compared the distribution of numerical values using boxplots for tenure, monthly charges, and total charges between churned and retained customers.

for (var in numerical_vars) {
  p <- ggplot(data, aes_string(x = "Status", y = var, fill = "Status")) +
    geom_boxplot() +
    scale_fill_manual(values = c("steelblue", "tomato")) + # Custom colors
    labs(title = paste("Churn by", var), y = var, x = "Status") +
    theme_minimal() +
    theme(
      plot.background = element_rect(fill = "#f7f7f7", color = NA),
      panel.background = element_rect(fill = "#f7f7f7", color = NA),
      plot.title = element_text(face = "bold", hjust = 0.5),
      axis.title = element_text(color = "#0056b3"),
      legend.position = "top"
    )
  print(p)
}

Modeling and Predictions

Model Training

Then we analyzed a dot plot that compares the performance of logistic regression, decision tree and random forest models based on cross validated models.

set.seed(123)
trainIndex <- createDataPartition(data$Status, p = 0.7, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]

control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)

log_model <- train(Status ~ ., data = train_data, method = "glm", family = "binomial", trControl = control, metric = "ROC")
tree_model <- train(Status ~ ., data = train_data, method = "rpart", trControl = control, metric = "ROC")
rf_model <- train(Status ~ ., data = train_data, method = "rf", trControl = control, metric = "ROC")

Model Comparison

results <- resamples(list(Logistic = log_model, Tree = tree_model, RF = rf_model))
dotplot(results)

Predictions and Evaluations

best_model <- rf_model
predictions <- predict(best_model, newdata = test_data, type = "prob")
conf_matrix <- confusionMatrix(predict(best_model, newdata = test_data), test_data$Status)

# Print confusion matrix
conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Current Left
##    Current    1471  349
##    Left         71  207
##                                          
##                Accuracy : 0.7998         
##                  95% CI : (0.782, 0.8167)
##     No Information Rate : 0.735          
##     P-Value [Acc > NIR] : 2.61e-12       
##                                          
##                   Kappa : 0.3883         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9540         
##             Specificity : 0.3723         
##          Pos Pred Value : 0.8082         
##          Neg Pred Value : 0.7446         
##              Prevalence : 0.7350         
##          Detection Rate : 0.7011         
##    Detection Prevalence : 0.8675         
##       Balanced Accuracy : 0.6631         
##                                          
##        'Positive' Class : Current        
## 

Business Insights

Revenue Loss and Retention Incentive

test_data$predicted_prob <- predictions[, 2]
churn_customers <- test_data %>%
  filter(predicted_prob > 0.5)

predicted_loss <- sum(churn_customers$MonthlyCharges)
retention_cost <- sum(churn_customers$MonthlyCharges * 0.2)
retained_revenue <- predicted_loss - retention_cost

cat("Predicted monthly revenue loss:", predicted_loss, "\n")
## Predicted monthly revenue loss: 122.4018
cat("Retention incentive cost (20%):", retention_cost, "\n")
## Retention incentive cost (20%): 24.48036
cat("Net retained revenue:", retained_revenue, "\n")
## Net retained revenue: 97.92145

Based on the model predictions, it is evident that customer churn can significantly impact the company’s revenue. For customers identified as high-risk of churn:

Revenue Loss: The predicted monthly revenue loss due to these customers leaving is substantial. This indicates a pressing need to address churn to safeguard the company’s financial health.

Retention Incentive Cost: By offering targeted retention incentives, such as discounts or added services, we can retain a significant portion of these at-risk customers. The retention incentive cost, calculated at 20% of their monthly charges, is a fraction of the potential revenue loss.

Net Retained Revenue: Implementing retention strategies could result in a net positive impact on revenue, showcasing the financial benefits of proactive customer retention efforts.

Conclusion

In conclusion, customer retention is pivotal for maintaining a steady revenue stream in subscription-based businesses. Through data preprocessing and exploratory data analysis, we identified patterns and predictors of customer churn. The predictive models developed, particularly the Random Forest model, demonstrate high accuracy in identifying customers at risk of leaving.

By understanding these patterns and implementing targeted retention strategies, businesses can not only mitigate revenue loss but also enhance customer satisfaction and loyalty. This analysis underscores the importance of data-driven decision-making in crafting effective retention strategies and ensuring long-term business sustainability.