This project explores churn management within the cellular telecommunications industry. The primary objective is to develop a predictive model for customer churn and leverage its insights to design a strategic incentive plan aimed at customer retention.
print(dim(df))
[1] 71047 69
The data set consists of 69 attributes (columns) for 71,047 customers
(rows), with the Churn column indicating whether a customer
left the company two months after observation. The available data
includes the following categories: * Customer’s Phone Usage -
Monthly Call Minutes, Recurring Charges, Roaming Calls, Blocked Calls,
Inbound Calls * Customer’s Phone Equipment - Handset Model,
CurrentEquipmentDays, HandsetRefurbished?, HandsetWebCapable? *
Customer Demographics - CreditRating (ex: A, AA), PrizmCode (ex:
Rural, Town), Occupation (ex: Professor, Clerical), Marital Status *
Advertising Data - BuysViaMailOrder?, RespondsToMailOffers?
dim(df)
[1] 69309 69
Correlation Between Numerical Variables:
Given the large number of features in the dataset, the first step is to identify variables with high correlations, as these may introduce bias into the model. Removing highly correlated variables offers several advantages, including reducing potential bias, improving model robustness, and enhancing interpretability.
From the correlation plot, we observe high correlations between the following variables: 1. Handset & HandsetPrice - 0.91 2. Handset & HandsetModel - 0.91 2. DroppedCalls & DroppedBlockedCalls - 0.88 3. DroppedBlockedCalls, PeakCallsInOut, OffPeakCallsInOut - show high correlations with several other predictors.
As a result, HandsetModel and HandsetPrice were removed, as their effects were already captured by Handset. Additionally, DroppedBlockedCalls, PeakCallsInOut, and OffPeakCallsInOut were dropped due to their high correlation with multiple other variables.
dim(df)
[1] 69309 64
#Scale the Dataset
df_scaled <- as.data.frame(lapply(df, function(x) {
if(is.numeric(x)) {
(x - min(x)) / (max(x) - min(x))
} else {
x
}
}))
Before proceeding with further steps such as feature selection and modeling, we will first split the dataset into a calibration set and a test set.
print(dim(test.data))
[1] 30368 62
To select the most relevant predictors for the predictive model, I employed both Forward and Backward Feature Selection methods. These techniques help identify the optimal subset of variables that maximize the model’s predictive performance.
Backward selection begins with a model that includes all variables and iteratively removes the least important features, evaluating the model’s performance at each step. In contrast, Forward selection starts with a model that only contains the intercept, gradually adding the most significant features at each iteration.
summary(logit.backward)
Call:
glm(formula = Churn ~ MonthlyRevenue + MonthlyMinutes + TotalRecurringCharge +
OverageMinutes + RoamingCalls + PercChangeMinutes + PercChangeRevenues +
DroppedCalls + BlockedCalls + CustomerCareCalls + ThreewayCalls +
InboundCalls + MonthsInService + UniqueSubs + ActiveSubs +
Handsets + CurrentEquipmentDays + AgeHH1 + ChildrenInHH. +
CreditRatingA. + CreditRatingAA. + PrizmCodeSuburban. + HandsetRefurbished. +
HandsetWebCapable. + OccupationHomemaker. + MaritalStatusMissing. +
RespondsToMailOffers. + NewCellphoneUser. + OwnsMotorcycle. +
HandsetPriceMissing. + MadeCallToRetentionTeam., family = "binomial",
data = calibrat.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5199 -1.1389 -0.6925 1.1488 2.4782
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.33623 0.33043 -1.018 0.308902
MonthlyRevenue 2.28941 0.96200 2.380 0.017321 *
MonthlyMinutes -2.03429 0.29494 -6.897 5.30e-12 ***
TotalRecurringCharge -1.39297 0.35079 -3.971 7.16e-05 ***
OverageMinutes 3.22493 1.18550 2.720 0.006522 **
RoamingCalls 8.15306 2.28447 3.569 0.000358 ***
PercChangeMinutes -4.45759 0.48145 -9.259 < 2e-16 ***
PercChangeRevenues 8.32346 1.32086 6.302 2.95e-10 ***
DroppedCalls 1.81708 0.33950 5.352 8.69e-08 ***
BlockedCalls 0.97697 0.35099 2.783 0.005378 **
CustomerCareCalls -1.95427 0.89980 -2.172 0.029864 *
ThreewayCalls -1.89818 0.73479 -2.583 0.009786 **
InboundCalls -1.21619 0.42089 -2.890 0.003858 **
MonthsInService -1.23041 0.10226 -12.032 < 2e-16 ***
UniqueSubs 35.75536 3.88456 9.204 < 2e-16 ***
ActiveSubs -10.90967 1.47409 -7.401 1.35e-13 ***
Handsets 1.52622 0.33966 4.493 7.01e-06 ***
CurrentEquipmentDays 2.63237 0.13256 19.857 < 2e-16 ***
AgeHH1 -0.32529 0.06370 -5.107 3.28e-07 ***
ChildrenInHH.1 0.10548 0.02656 3.971 7.15e-05 ***
CreditRatingA.1 -0.17497 0.03528 -4.959 7.09e-07 ***
CreditRatingAA.1 -0.36479 0.03426 -10.646 < 2e-16 ***
PrizmCodeSuburban.1 -0.06211 0.02237 -2.776 0.005503 **
HandsetRefurbished.1 0.23519 0.03169 7.422 1.15e-13 ***
HandsetWebCapable.1 -0.15784 0.03744 -4.216 2.49e-05 ***
OccupationHomemaker.1 0.28766 0.18910 1.521 0.128214
MaritalStatusMissing.1 0.06429 0.02739 2.347 0.018921 *
RespondsToMailOffers.1 -0.12937 0.02632 -4.915 8.89e-07 ***
NewCellphoneUser.1 -0.07196 0.02650 -2.716 0.006612 **
OwnsMotorcycle.1 0.14334 0.08785 1.632 0.102758
HandsetPriceMissing.1 -0.14887 0.03351 -4.442 8.91e-06 ***
MadeCallToRetentionTeam.1 0.73769 0.05773 12.779 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 53983 on 38940 degrees of freedom
Residual deviance: 52340 on 38909 degrees of freedom
AIC: 52404
Number of Fisher Scoring iterations: 4
summary(reality.forward)
Call:
glm(formula = Churn ~ CurrentEquipmentDays + MadeCallToRetentionTeam. +
CreditRatingAA. + AgeHH1 + HandsetRefurbished. + MonthsInService +
PercChangeMinutes + UniqueSubs + ActiveSubs + TotalRecurringCharge +
MonthlyRevenue + MonthlyMinutes + PercChangeRevenues + HandsetPriceMissing. +
DroppedCalls + CreditRatingA. + RespondsToMailOffers. + HandsetWebCapable. +
Handsets + ChildrenInHH. + RoamingCalls + InboundCalls +
NewCellphoneUser. + PrizmCodeSuburban. + OverageMinutes +
MaritalStatusMissing. + ThreewayCalls + BlockedCalls + CustomerCareCalls +
OwnsMotorcycle. + OccupationHomemaker., family = "binomial",
data = calibrat.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5199 -1.1389 -0.6925 1.1488 2.4782
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.33623 0.33043 -1.018 0.308902
CurrentEquipmentDays 2.63237 0.13256 19.857 < 2e-16 ***
MadeCallToRetentionTeam.1 0.73769 0.05773 12.779 < 2e-16 ***
CreditRatingAA.1 -0.36479 0.03426 -10.646 < 2e-16 ***
AgeHH1 -0.32529 0.06370 -5.107 3.28e-07 ***
HandsetRefurbished.1 0.23519 0.03169 7.422 1.15e-13 ***
MonthsInService -1.23041 0.10226 -12.032 < 2e-16 ***
PercChangeMinutes -4.45759 0.48145 -9.259 < 2e-16 ***
UniqueSubs 35.75536 3.88456 9.204 < 2e-16 ***
ActiveSubs -10.90967 1.47409 -7.401 1.35e-13 ***
TotalRecurringCharge -1.39297 0.35079 -3.971 7.16e-05 ***
MonthlyRevenue 2.28941 0.96200 2.380 0.017321 *
MonthlyMinutes -2.03429 0.29494 -6.897 5.30e-12 ***
PercChangeRevenues 8.32346 1.32086 6.302 2.95e-10 ***
HandsetPriceMissing.1 -0.14887 0.03351 -4.442 8.91e-06 ***
DroppedCalls 1.81708 0.33950 5.352 8.69e-08 ***
CreditRatingA.1 -0.17497 0.03528 -4.959 7.09e-07 ***
RespondsToMailOffers.1 -0.12937 0.02632 -4.915 8.89e-07 ***
HandsetWebCapable.1 -0.15784 0.03744 -4.216 2.49e-05 ***
Handsets 1.52622 0.33966 4.493 7.01e-06 ***
ChildrenInHH.1 0.10548 0.02656 3.971 7.15e-05 ***
RoamingCalls 8.15306 2.28447 3.569 0.000358 ***
InboundCalls -1.21619 0.42089 -2.890 0.003858 **
NewCellphoneUser.1 -0.07196 0.02650 -2.716 0.006612 **
PrizmCodeSuburban.1 -0.06211 0.02237 -2.776 0.005503 **
OverageMinutes 3.22493 1.18550 2.720 0.006522 **
MaritalStatusMissing.1 0.06429 0.02739 2.347 0.018921 *
ThreewayCalls -1.89818 0.73479 -2.583 0.009786 **
BlockedCalls 0.97697 0.35099 2.783 0.005378 **
CustomerCareCalls -1.95427 0.89980 -2.172 0.029864 *
OwnsMotorcycle.1 0.14334 0.08785 1.632 0.102758
OccupationHomemaker.1 0.28766 0.18910 1.521 0.128214
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 53983 on 38940 degrees of freedom
Residual deviance: 52340 on 38909 degrees of freedom
AIC: 52404
Number of Fisher Scoring iterations: 4
Both forward and backward selection models returned a similar set of features:
Churn ~ CurrentEquipmentDays + MadeCallToRetentionTeam. +
CreditRatingAA. + AgeHH1 + HandsetRefurbished. + MonthsInService +
PercChangeMinutes + UniqueSubs + ActiveSubs + TotalRecurringCharge +
MonthlyRevenue + MonthlyMinutes + PercChangeRevenues + HandsetPriceMissing. +
DroppedCalls + CreditRatingA. + RespondsToMailOffers. + HandsetWebCapable. +
Handsets + ChildrenInHH. + RoamingCalls + InboundCalls +
NewCellphoneUser. + PrizmCodeSuburban. + OverageMinutes +
MaritalStatusMissing. + ThreewayCalls + BlockedCalls + CustomerCareCalls +
OwnsMotorcycle. + OccupationHomemaker.
Using the selected features, we can train a logistic regression model on the calibration dataset and evaluate its performance on the unseen test dataset.
summary(logit.model)
Call:
glm(formula = Churn ~ MonthlyRevenue + MonthlyMinutes + TotalRecurringCharge +
OverageMinutes + RoamingCalls + PercChangeMinutes + PercChangeRevenues +
DroppedCalls + BlockedCalls + CustomerCareCalls + ThreewayCalls +
InboundCalls + MonthsInService + UniqueSubs + ActiveSubs +
Handsets + CurrentEquipmentDays + AgeHH1 + ChildrenInHH. +
CreditRatingA. + CreditRatingAA. + PrizmCodeSuburban. + HandsetRefurbished. +
HandsetWebCapable. + OccupationHomemaker. + MaritalStatusMissing. +
RespondsToMailOffers. + NewCellphoneUser. + OwnsMotorcycle. +
HandsetPriceMissing. + MadeCallToRetentionTeam., family = "binomial",
data = calibrat.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5199 -1.1389 -0.6925 1.1488 2.4782
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.33623 0.33043 -1.018 0.308902
MonthlyRevenue 2.28941 0.96200 2.380 0.017321 *
MonthlyMinutes -2.03429 0.29494 -6.897 5.30e-12 ***
TotalRecurringCharge -1.39297 0.35079 -3.971 7.16e-05 ***
OverageMinutes 3.22493 1.18550 2.720 0.006522 **
RoamingCalls 8.15306 2.28447 3.569 0.000358 ***
PercChangeMinutes -4.45759 0.48145 -9.259 < 2e-16 ***
PercChangeRevenues 8.32346 1.32086 6.302 2.95e-10 ***
DroppedCalls 1.81708 0.33950 5.352 8.69e-08 ***
BlockedCalls 0.97697 0.35099 2.783 0.005378 **
CustomerCareCalls -1.95427 0.89980 -2.172 0.029864 *
ThreewayCalls -1.89818 0.73479 -2.583 0.009786 **
InboundCalls -1.21619 0.42089 -2.890 0.003858 **
MonthsInService -1.23041 0.10226 -12.032 < 2e-16 ***
UniqueSubs 35.75536 3.88456 9.204 < 2e-16 ***
ActiveSubs -10.90967 1.47409 -7.401 1.35e-13 ***
Handsets 1.52622 0.33966 4.493 7.01e-06 ***
CurrentEquipmentDays 2.63237 0.13256 19.857 < 2e-16 ***
AgeHH1 -0.32529 0.06370 -5.107 3.28e-07 ***
ChildrenInHH.1 0.10548 0.02656 3.971 7.15e-05 ***
CreditRatingA.1 -0.17497 0.03528 -4.959 7.09e-07 ***
CreditRatingAA.1 -0.36479 0.03426 -10.646 < 2e-16 ***
PrizmCodeSuburban.1 -0.06211 0.02237 -2.776 0.005503 **
HandsetRefurbished.1 0.23519 0.03169 7.422 1.15e-13 ***
HandsetWebCapable.1 -0.15784 0.03744 -4.216 2.49e-05 ***
OccupationHomemaker.1 0.28766 0.18910 1.521 0.128214
MaritalStatusMissing.1 0.06429 0.02739 2.347 0.018921 *
RespondsToMailOffers.1 -0.12937 0.02632 -4.915 8.89e-07 ***
NewCellphoneUser.1 -0.07196 0.02650 -2.716 0.006612 **
OwnsMotorcycle.1 0.14334 0.08785 1.632 0.102758
HandsetPriceMissing.1 -0.14887 0.03351 -4.442 8.91e-06 ***
MadeCallToRetentionTeam.1 0.73769 0.05773 12.779 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 53983 on 38940 degrees of freedom
Residual deviance: 52340 on 38909 degrees of freedom
AIC: 52404
Number of Fisher Scoring iterations: 4
confusionMatrix(as.factor(as.numeric(test.data$predict_logit > 0.5)), as.factor(test.data$Churn))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 17893 252
1 11889 334
Accuracy : 0.6002
95% CI : (0.5947, 0.6057)
No Information Rate : 0.9807
P-Value [Acc > NIR] : 1
Kappa : 0.0159
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.60080
Specificity : 0.56997
Pos Pred Value : 0.98611
Neg Pred Value : 0.02733
Prevalence : 0.98070
Detection Rate : 0.58921
Detection Prevalence : 0.59750
Balanced Accuracy : 0.58538
'Positive' Class : 0
print(paste0("Recall = ", recall(as.factor(as.numeric(test.data$predict_logit > 0.5)), as.factor(test.data$Churn))))
[1] "Recall = 0.600799140420388"
Based on the confusion matrix, our model correctly identifies 17,893 non-churners and 334 churners. However, the model’s overall accuracy is only 60%, primarily due to misclassifying 11,889 non-churners as churners. This error is attributed to the high class imbalance in the dataset, as only about 2% of customers actually churn.
The organization can assess the costs associated with two types of errors: Type I error, where additional incentives are provided to a customer who is unlikely to churn, and Type II error, where a customer who actually churns is not provided with incentives due to the prediction that they would not churn. By evaluating these costs, the company could decide if a different threshold, other than 50%, should be applied to classify customers as likely to churn.
Despite the low overall accuracy, the model demonstrates an exceptionally high precision, with 98.6% of customers predicted to churn actually doing so. This precision is valuable for the company, as it ensures that resources and incentives are not wasted on customers who are unlikely to churn.
The model can also be re-run with a different threshold for classifying churners and non-churners. For instance, we could test the performance of the model by predicting a customer as a churner only when their probability of churning exceeds 70%, instead of the default 50% threshold.
print(confusionMatrix(as.factor(as.numeric(test.data$predict_logit > 0.7)), as.factor(test.data$Churn)))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 29293 562
1 489 24
Accuracy : 0.9654
95% CI : (0.9633, 0.9674)
No Information Rate : 0.9807
P-Value [Acc > NIR] : 1.00000
Kappa : 0.0261
Mcnemar's Test P-Value : 0.02636
Sensitivity : 0.98358
Specificity : 0.04096
Pos Pred Value : 0.98118
Neg Pred Value : 0.04678
Prevalence : 0.98070
Detection Rate : 0.96460
Detection Prevalence : 0.98311
Balanced Accuracy : 0.51227
'Positive' Class : 0
print(paste0("Recall (70% threshold) = ", recall(as.factor(as.numeric(test.data$predict_logit > 0.7)), as.factor(test.data$Churn))))
[1] "Recall (70% threshold) = 0.983580686320596"
The performance of the re-adjusted model significantly improved, achieving an accuracy of 96.5%. The recall saw a substantial increase, rising from 60% (with the 50% threshold) to 98.4%. However, this came at the cost of a slight decrease in precision, which dropped from 98.6% to 98.1%.
Both models demonstrate strengths in different aspects of classification, and the company can choose which model to use based on their specific business goals: * Minimizing the Risk of Missing a Churner (business cost: losing customers): * Model I (50% threshold): Fewer churners were predicted to be non-churners (252). * Model II (70% threshold): More churners were predicted as non-churners (562). * Minimizing the Cost of Incorrectly Identifying a Churner (business cost: spending unnecessary resources on customers who won’t churn): * Model I (50% threshold): A high number of non-churners were predicted to churn (11,889). * Model II (70% threshold): Fewer non-churners were predicted to churn (489).
The final step of this project is to identify the key factors that contribute most to customer churn. This analysis can provide valuable insights to help the telecommunications company design a contingency-based incentive plan aimed at reducing churn.
The quantitative importance of the variables is calculated as: * Numeric Variables: Standard Deviation of the Feature x Marginal Effect of the Feature * Categorical Variables: Marginal Effect of the Feature
Based on this, we can calculate the importance of all features, and rank them based on their importance.
The results above highlight the predictors in our dataset that have the greatest impact on a customer’s likelihood to churn. Key factors influencing churn include whether the customer has contacted the retention team, the number of days they have used their current equipment, their customer profile (such as credit rating), and details about their equipment and handsets.
It is important to note that while certain features, such as whether the customer is a homemaker or owns a motorcycle, are identified as important in the model, they are not statistically significant in predicting customer churn.
Based on this analysis, some recommended retention actions include:
CurrentEquipmentDays is strongly correlated with
churn.