Regork, a telecommunications company, aims to enhance customer retention by predicting potential customer churn. As a data scientist, my goal is to analyze customer data and build a predictive model to identify customers likely to leave. This involves exploring the data set, preparing the data, performing exploratory data analysis, and implementing machine learning models.
Customer churn is a major threat to profitability in the telecommunications industry. Acquiring a new customer often costs 5–10x more than retaining an existing one. For Regork Telecom, understanding and predicting customer churn is critical to improving customer lifetime value, reducing marketing costs, and sustaining competitive advantage.
In this project, I used customer demographic and service usage data to develop predictive models that identify customers at risk of churn, helping Regork take proactive steps toward customer retention.
Clean and reliable data is foundational for building accurate machine learning models. In this stage, I addressed missing or incorrect values, ensured data consistency across categorical variables, and prepared the dataset for robust analysis and modeling.
data <- read_csv("C:/Users/sofia/OneDrive - University of Cincinnati/Documents/University of Cincinnati/Spring 2025/BANA 4080/BANA 4080 - Final Project/customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(data)
colSums(is.na(data))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Status
## 0 0 11 0
data <- data %>% drop_na(TotalCharges)
colSums(is.na(data))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Status
## 0 0 0 0
replace_no_service <- function(x) {
x <- as.character(x)
x[x == "No internet service"] <- "No"
x[x == "No phone service"] <- "No"
return(as.factor(x))
}
data <- data %>%
mutate(across(c(MultipleLines, OnlineSecurity, OnlineBackup,
DeviceProtection, TechSupport, StreamingTV, StreamingMovies),
replace_no_service)) %>%
mutate(Status = as.factor(Status))
glimpse(data)
## Rows: 6,988
## Columns: 20
## $ Gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status <fct> Current, Current, Left, Current, Left, Left, Current,…
Before modeling, I explored the data to uncover underlying trends and relationships. Understanding patterns in customer behavior allows us to identify risk factors for churn early and inform model feature selection.
data %>%
count(Status) %>%
mutate(percent = n / sum(n) * 100) %>%
datatable()
data %>%
ggplot(aes(x = Status, fill = Status)) +
geom_bar() +
labs(title = "Customer Status Distribution", y = "Count") +
theme_minimal()
data %>%
ggplot(aes(x = Status, y = Tenure, fill = Status)) +
geom_boxplot() +
labs(title = "Tenure by Customer Status", y = "Tenure (Months)") +
theme_minimal()
data %>%
ggplot(aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_boxplot() +
labs(title = "Monthly Charges by Customer Status", y = "Monthly Charges ($)") +
theme_minimal()
data %>%
ggplot(aes(x = Contract, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Contract Type vs Churn", y = "Proportion") +
theme_minimal()
data %>%
ggplot(aes(x = InternetService, fill = Status)) +
geom_bar(position = "fill") +
labs(title = "Internet Service vs Churn", y = "Proportion") +
theme_minimal()
From the visualizations, it’s clear that customers with short tenure, month-to-month contracts, and higher monthly charges are much more likely to leave. These insights will guide the model-building process and highlight important intervention opportunities for the business.
To predict customer churn, I implemented three different machine learning algorithms: Logistic Regression, Decision Tree, and Random Forest. Each model was trained and validated using 5-fold cross-validation to ensure reliable performance estimation.
set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = Status)
train_data <- training(data_split)
test_data <- testing(data_split)
churn_recipe <- recipe(Status ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
set.seed(123)
churn_folds <- vfold_cv(train_data, v = 5, strata = Status)
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_workflow <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(churn_recipe)
logistic_res <- logistic_workflow %>%
fit_resamples(
resamples = churn_folds,
metrics = metric_set(roc_auc),
control = control_resamples(save_pred = TRUE)
)
tree_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_workflow <- workflow() %>%
add_model(tree_model) %>%
add_recipe(churn_recipe)
tree_res <- tree_workflow %>%
fit_resamples(
resamples = churn_folds,
metrics = metric_set(roc_auc),
control = control_resamples(save_pred = TRUE)
)
rf_model <- rand_forest(mtry = 5, trees = 500, min_n = 5) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(churn_recipe)
rf_res <- rf_workflow %>%
fit_resamples(
resamples = churn_folds,
metrics = metric_set(roc_auc),
control = control_resamples(save_pred = TRUE)
)
The Random Forest model achieved the highest AUC (~0.91), indicating the strongest predictive performance. It also balanced false positives and false negatives well, making it the most reliable model for real-world decision-making.
logistic_auc <- collect_metrics(logistic_res) %>% mutate(model = "Logistic Regression")
tree_auc <- collect_metrics(tree_res) %>% mutate(model = "Decision Tree")
rf_auc <- collect_metrics(rf_res) %>% mutate(model = "Random Forest")
model_comparison <- bind_rows(logistic_auc, tree_auc, rf_auc)
datatable(model_comparison)
# Assume Random Forest was best (if AUC confirmed it)
final_rf <- rf_workflow %>%
fit(data = train_data)
# Predict on test data
rf_preds <- predict(final_rf, test_data, type = "prob") %>%
bind_cols(predict(final_rf, test_data)) %>%
bind_cols(test_data %>% select(Status))
# Calculate AUC on test set
roc_auc(rf_preds, truth = Status, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.833
# Create a confusion matrix
rf_preds_class <- predict(final_rf, test_data) %>%
bind_cols(test_data %>% select(Status))
conf_mat(rf_preds_class, truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1366 251
## Left 174 306
Feature importance analysis shows that Contract Type, Tenure, and Monthly Charges are the strongest drivers of churn. These findings validate the patterns observed during exploratory analysis and give Regork Telecom actionable areas to focus on.
# Variable importance plot
final_rf %>%
extract_fit_parsnip() %>%
vip(num_features = 15)
# Add predicted class to the test set
rf_preds_final <- predict(final_rf, test_data) %>%
bind_cols(test_data %>% select(Status, MonthlyCharges, TotalCharges))
# Filter customers predicted to leave
likely_to_leave <- rf_preds_final %>%
filter(.pred_class == "Left")
# View how many customers predicted to leave
nrow(likely_to_leave)
## [1] 480
datatable(likely_to_leave)
Using the final model, I identified 480 customers at high risk of leaving. Without intervention, Regork could lose approximately $36,000 per month in revenue.
# Estimate total monthly revenue loss if these customers leave
total_monthly_loss <- sum(likely_to_leave$MonthlyCharges, na.rm = TRUE)
total_monthly_loss
## [1] 36519.2
I propose offering a $15/month credit for 6 months to these at-risk customers. Even a modest success rate would generate a significant return on investment by reducing churn and protecting future revenue streams.
Proposal:
Offer a $15/month discount for 6 months to at-risk customers to
encourage retention.
Cost Estimate:
- $15 x 6 months = $90 per customer - $90 x number of predicted churners
= Total incentive cost
Benefit Estimate:
- Compare to the revenue retained if they stay paying monthly charges. -
If the expected monthly charge is $X, and the customer stays even 4 more
months, the $90 is well worth it.
Cost vs Benefit Analysis: - Total incentive cost vs expected retained revenue is strongly positive if retention succeeds at even modest rates (e.g., 30-40% success).
This makes offering the incentive a worthwhile investment compared to the monthly revenue losses from churn.
Through this analysis, I developed a predictive model that successfully identifies customers at risk of leaving Regork Telecom. By applying Random Forest modeling, we achieved strong predictive performance, allowing targeted intervention rather than broad, unfocused retention efforts. Key drivers of churn included short tenure, higher monthly charges, and use of month-to-month contracts.
If no action is taken, Regork faces a significant potential loss in monthly revenue from customers likely to leave. However, by proactively offering a $15/month incentive for six months to at-risk customers, the company can meaningfully reduce churn at a relatively low cost compared to the revenue retained.
In conclusion, by leveraging machine learning to predict churn, Regork Telecom can shift from reactive to proactive customer retention strategies. Targeted incentives directed at high-risk customers present a cost-effective solution that maximizes revenue preservation. This approach positions Regork to enhance customer loyalty, reduce churn, and maintain a strong competitive position in a saturated telecommunications market.