This machine learning project revolves around the essential task of predicting customer churn. The project’s primary objective is to build a strong predictive model capable of identifying customers who are likely to churn. By effectively recognizing potential churners, the model aims to optimize decision-making in customer retention strategies, reduce business risks, and contribute to more informed actions in the field of customer relationship management.
These are some valuable business advantages that are offered by the result of this project:
Enhanced Risk Management: Accurate identification of potential churners minimizes the risk of revenue loss and enables proactive intervention strategies.Personalized Customer Engagement: Predicting churn allows for targeted retention efforts, fostering stronger customer relationships and loyalty.Operational Efficiency: Automated churn prediction streamlines resource allocation, reducing manual efforts and operational costs.Maximized Customer Lifetime Value: Retaining high-value customers through predictive modeling increases their long-term contribution to the business.Financial Stability: Minimizing customer attrition ensures a steady revenue stream and enhances overall financial stability.Data-Driven Decision-Making: Churn prediction provides valuable insights into customer behavior, enabling informed strategic decisions and improvements.Competitive Advantage: Effective churn management enhances the company’s reputation for customer-centric practices, differentiating it from competitors.Optimized Marketing Strategies: Accurate churn predictions lead to targeted marketing campaigns, maximizing their impact and minimizing waste.Enhanced Profitability: Reduced churn rates translate to increased customer retention, contributing to improved financial performance and profitability.Continuous Improvement: Ongoing analysis of churn patterns enables the refinement of retention strategies, adapting to evolving customer preferences.In short, save the business more time, save the business more money, provide the business more insight, and make the business more money.
Now, let’s jump into the coding process!
## [1] TRUE
There are missing values. Let’s inspect the missing values!
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
There are only 11 rows that has a missing value. Let’s remove those rows!
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 0
## Churn
## 0
Cool! We are now free of missing values!
## Rows: 7,032
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
We need to do 2 things:
- Remove
customerIDcolumn- Parse categorical columns into factor
churn <- churn %>%
mutate_if(is.character, as.factor) %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen))
churn %>% glimpse()## Rows: 7,032
## Columns: 20
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
Cool. All categorical columns are now in the correct data type
The proportion of the target variable is imbalanced. Let’s upsample the data!
up_churn <- upSample(x = churn %>% select(-Churn),
y = churn$Churn,
yname = "Churn")
up_churn$Churn %>% table() %>% barplot()Cool. The target proportion is now balanced!
Cool. All categorical columns are distributed normally
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Cool. All numerical columns are distributed normally
Because we don’t have a lot of records, we will set the training proportion to 85%
I chose the Random Forest algorithm. Because, in cases like predicting human actions, using a simple one rule algorithm may not suffice. The Random Forest algorithm, being an ensemble method which combine multiple rules (trees), is well-suited for such scenarios as it doesn’t solely rely on a single rule.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 637 27
## Yes 137 747
##
## Accuracy : 0.8941
## 95% CI : (0.8776, 0.909)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7881
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9651
## Specificity : 0.8230
## Pos Pred Value : 0.8450
## Neg Pred Value : 0.9593
## Prevalence : 0.5000
## Detection Rate : 0.4826
## Detection Prevalence : 0.5711
## Balanced Accuracy : 0.8941
##
## 'Positive' Class : Yes
##
The metric that we’re gonna be focusing on is
Sensitivity. Why? because we want to minimize the case where customers who are likely to churn are predicted as not likely to churn.The random forest model achieved an overall
Accuracy of 89%, with aSensitivity of 95%. This high sensitivity indicates that the model is effective at correctly identifying customers who are likely to churn, which is crucial for proactively addressing potential churn and retaining valuable customers.We’ll tune the model and see if we could increase it’s performance!
We’re gonna tune the model by implementing K-Fold Cross Validation with 3 repetitions of 5 folds
ctrl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3
)
model_rf_tunned <- train(
Churn ~ .,
data = train_data,
method = "rf",
trControl = ctrl
)pred_rf_tunned <- predict(model_rf_tunned, X_test)
confusionMatrix(pred_rf_tunned, y_test, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 653 23
## Yes 121 751
##
## Accuracy : 0.907
## 95% CI : (0.8914, 0.921)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.814
##
## Mcnemar's Test P-Value : 6.302e-16
##
## Sensitivity : 0.9703
## Specificity : 0.8437
## Pos Pred Value : 0.8612
## Neg Pred Value : 0.9660
## Prevalence : 0.5000
## Detection Rate : 0.4851
## Detection Prevalence : 0.5633
## Balanced Accuracy : 0.9070
##
## 'Positive' Class : Yes
##
The performance increased!
The tuned random forest model achieved a higher
Accuracy of 90%andSensitivity of 96%indicating stronger ability to accurately identify customers who are likely to churn, making it the winner model of this project!
My way of explaining the AUC of ROC score is that it reflects the level of certainty our model has in its predictions. A high AUC of ROC score indicates that the model is very confident and sure of its predictions. As people who use the model, we want the model to be highly confident in its predictions since we rely on it for making decisions. We don’t want a model that isn’t sure or confident about its predictions. This is why the AUC or ROC score is a crucial measure to determine if the model is prepared for practical use or not.
pred_rf_tunned_raw <- predict(model_rf_tunned, X_test, type = "prob")
plotROC(pred_rf_tunned_raw[, 2], ifelse(y_test == "Yes", 1, 0)) # This function is from my personal packageMagnificent. Why? Because the closer the AUC to 1 the more confident the model is at detecting which customers are churning and which are not
In conclusion, the generated model is truly outstanding, boasting an
Accuracy of 90%andSensitivity of 96%. The model exhibits exceptional accuracy, high sensitivity, and strong specificity, signifying its prowess in effectively identifying customers who are likely to churn, while also performs well in identifying customers who are not likely to churn.Furthermore, the impressive
AUC of ROC score of 0.96serves as an additional testament to the model’s readiness for practical deployment. The near-perfect AUC of ROC score indicates the model’s high confidence in distinguishing between customers who are likely to churn and who are not. With its remarkable performance across various metrics, the model has proven itself to be well-prepared and capable of reliable utilization and real-world applications.