PART1: Predicting Customer Churn with Precision: Unleashing the Power of Machine Learning and Logistic Regression in Telecommunication Analytics
Strategic Insights Unveiled: Machine Learning and Logistic Regression Illuminate the Path to Proactive Anti-Churn Promotions by Identifying At-Risk Consumers
Author
Affiliations
John Karuitha
Karatina University, School of Business
University of the Witwatersrand, Johannesburg, School of Construction Economics & Management
Published
January 12, 2024
Modified
January 12, 2024
Executive Summary
This study focuses on optimizing the anti-churn campaign for a telecommunications company through advanced machine learning techniques. The analysis involves the implementation of logistic regression to pinpoint 2000 consumers for urgent contact in the anti-churn initiative. Through the application of logistic regression model, a total of 2262 clients were successfully identified, providing targeted insights to significantly enhance the efficacy of the anti-churn campaign. The findings underscore the value of employing sophisticated machine learning methodologies for precise customer churn prediction and strategic intervention in the telecommunications sector.
1Introduction
As of April 2023, many customers have been canceling their contracts with a mobile phone company, negatively affecting its performance. To tackle this issue, the company wants to reach out to active customers and offer them special deals to prevent them from canceling in the future—a strategy known as “anti-churn.” However, due to budget constraints, they can only contact 2,000 people. The goal is to identify those most likely to cancel in the next 3 months.
In this analysis, various tools are used to find 2,000 customers at risk of leaving for the anti-churn campaign. Here’s a breakdown: Section 2 explains the goals, Section 3 lists the methods used, Section 4 explores the data and finds key features related to customer cancellations, Section 5 puts the methods into action and assesses their performance, and finally, Section 6 concludes the analysis.
The primary objective of this analysis is to identify and select a targeted group of 2,000 active consumers who are at a heightened risk of canceling their contracts within the next 3 months.
3Technique
The principle of the project will be to construct several targets of 2000 customers using more or less complex statistical methods in order to improve the performance of the marketing campaign:
Random targeting.
Business targeting
Profiled targeting
Scored targeting V1
Scored targeting V2
4 Data
There are two sets of data.
base_telecom_2022_12.txt has 44529 rows and 42 columns.
base_telecom_2023_03.txt has 22528 rows and 41 columns.
The extra column in base_telecom_2022_12.txt contains a column flag_resiliation that indicates whether the consumer churned or not. Hence, I use this data set to train our models. I test the models on the second set of data; base_telecom_2023_03.txt .
4.1 Data Exploration
The table below shows the summary statistics for numeric variables in the data. We see that the variables taille_ville and revenu_moyen_ville have a few missing observations.
I start by converting the data dates into the proper format. I also create an age column age that is the current year (2023) minus the year of birth (date_naissance). The goal is to find out whether age has a bearing on consumer churn. Similarly, I also create a duration column that captures the period from the start of the contract. Finally, I create a feature that captures whether the client is under commitment where they can only exit the plan by paying a contract fee. I capture any member whose commitment period ends on or before December 31, 2023 to be under commitment, otherwise, they are not under commitment.
I plot the extent of churn in our data. We see in the figure below that the incidents of churn are relatively few compared to the consumers that stay. During modelling, this is an important consideration which would require us to balance the data to aptly capture the characteristics of the people that churn.
Code
telcom_two %>%ggplot(aes(x =factor(flag_resiliation))) +geom_bar() +labs(title ="Prevalence of Churn in the Data",y ="There is a serious imbalance with the non-churn over-represented in the data",x ="Churned?", y ="Count")
I create a correlation funnel that bins the data and creates a correlation matrix of the correlations. Starting from the top, we see the variables that have stronger linear relationships with churn. In this case, nb_reengagements and vols_appels variables are highly related to churn. Sex has the lowest linear relationship with churn.
CorrelationFunnel Package
The correlationfunnel package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name “Correlation Funnel”), making it very efficient to understand which features are most likely to provide business insights and lend well to a [machine learning model](https://cran.r-project.org/web/packages/correlationfunnel/vignettes/introducing_correlation_funnel.html).
To run the machine learning models, I start by creating a recipe that I use throughout the analysis.
5.1 Logistic regression model
Logistic Regression is a statistical model widely used in machine learning for binary and multi-class classification tasks. Despite its name, logistic regression is used for classification, not regression (Boateng and Abaye 2019). It models the probability of a sample belonging to a particular class based on one or more predictor variables. Logistic Regression models the probability of an event occurring, such as the probability of an observation belonging to a certain class. The logistic function, also known as the sigmoid function, is employed to squash the output into the range (0, 1). The model is useful mainly due to its simplicity and interpretability. The AUC curve below shows the model does better than the null model. Table () shows the metrics for the model for use in comparing with other models (Das 2021).
5.1.1 Model 1: Full Model
I start by fitting the full model with all the variables. Notably, some variables would add little value, such as the ID of the clients. But in the following models, we shall make gradual improvements by removing redundant variables.
Code
## Set up the modelmodel1 <-glm(factor(flag_resiliation) ~ ., data = telcom_two %>% dplyr::select(-age,-duration,-committed ), family ="binomial")metrics <-function(model, type ="response", new_data = telcom_two){tibble(pred = model %>%predict(type = type,new_data = new_data)) %>%bind_cols(new_data) %>%mutate(class =case_when( pred >=0.5~1,.default =0 )) %>%mutate(flag_resiliation =factor(flag_resiliation),class =factor(class)) %>%conf_mat(truth = flag_resiliation,estimate = class) %>%summary()}plotter <-function(model, type ="response", new_data = telcom_two){tibble(pred = model %>%predict(type = type,new_data = new_data)) %>%bind_cols(new_data) %>%mutate(class =case_when( pred >=0.5~1,.default =0 )) %>%mutate(flag_resiliation =factor(flag_resiliation),class =factor(class)) %>%conf_mat(truth = flag_resiliation,estimate = class) %>%autoplot()}
We look at the metrics for model 1, which we treat as the base model. We see a reasonably high specificity of the model.
Code
metrics(model = model1) %>%gt()
.metric
.estimator
.estimate
accuracy
binary
0.85
kap
binary
0.39
sens
binary
0.96
spec
binary
0.37
ppv
binary
0.87
npv
binary
0.66
mcc
binary
0.42
j_index
binary
0.33
bal_accuracy
binary
0.66
detection_prevalence
binary
0.90
precision
binary
0.87
recall
binary
0.96
f_meas
binary
0.91
Plotting the confusion matrix shows that the model has a less than 50% chance of detecting the churn cases.
Code
plotter(model = model1)
Confusion matrix for model 1
5.1.2 Model 2: Remove the statistically insignificant sex variable
In this model, I remove the sex variable that is not very useful predictors of churn in the training set, going by the p-value. There is a marginal improvement in the number of churn cases captured. While the model reduces redundancy of irrelevant models, there is marginal improvement in the model in terms of specificity and sensitivity, two crucial metrics in this case.
Code
model2 <-glm(factor(flag_resiliation) ~ ., data = telcom_two %>% dplyr::select(- sexe,-age,-duration,-committed ), family ="binomial")model2_pred <-predict(model2, newdata = telcom_one, type ="response")classes2 <-tibble(prob = model2_pred) %>%mutate(class =case_when( prob >=0.5~1,.default =0 ))
Code
metrics(model2) %>%gt()
.metric
.estimator
.estimate
accuracy
binary
0.85
kap
binary
0.39
sens
binary
0.96
spec
binary
0.37
ppv
binary
0.87
npv
binary
0.66
mcc
binary
0.42
j_index
binary
0.33
bal_accuracy
binary
0.66
detection_prevalence
binary
0.90
precision
binary
0.87
recall
binary
0.96
f_meas
binary
0.91
Code
plotter(model2)
Confusion matrix for model 2
5.1.3 Model 3: Removal of variable mode_paiement
In this model, I remove one further variable mode_paiement. We note that the model is able to identify more cases of churn in the test set.
Code
model3 <-glm(factor(flag_resiliation) ~ ., data = telcom_two %>% dplyr::select(- sexe,- mode_paiement,-age,-duration,-committed ), family ="binomial")model3_pred <-predict(model3, newdata = telcom_one, type ="response")classes3 <-tibble(prob = model3_pred) %>%mutate(class =case_when( prob >=0.5~1,.default =0 ))
Let us examine the metrics for the model. We see that the removal of this variable simplifies the model without a significant increase in the power of the model. However, it is better to have a model with simpler model with fewer variables if the additional variables do not add value.
Code
metrics(model3) %>%gt()
.metric
.estimator
.estimate
accuracy
binary
0.85
kap
binary
0.39
sens
binary
0.96
spec
binary
0.37
ppv
binary
0.87
npv
binary
0.66
mcc
binary
0.42
j_index
binary
0.33
bal_accuracy
binary
0.66
detection_prevalence
binary
0.90
precision
binary
0.87
recall
binary
0.96
f_meas
binary
0.91
Looking at the confusion matrix paints a similar picture.
Code
plotter(model3)
Confusion matrix for model 3
5.1.4 Model 4: Model without the Variable flag_appels_vers_international
The model summary shows that the variable flag_appels_vers_international is not significant. We again remove this variable from the model. We see marginal gains in sensitivity and specificity.
Code
model4 <-glm(factor(flag_resiliation) ~ ., data = telcom_two %>% dplyr::select(- sexe,- mode_paiement,- flag_appels_vers_international,-age,-duration,-committed ), family ="binomial")model4_pred <-predict(model4, newdata = telcom_one, type ="response")classes4 <-tibble(prob = model4_pred) %>%mutate(class =case_when( prob >=0.5~1,.default =0 ))
Code
metrics(model4) %>%gt()
.metric
.estimator
.estimate
accuracy
binary
0.85
kap
binary
0.39
sens
binary
0.96
spec
binary
0.37
ppv
binary
0.87
npv
binary
0.66
mcc
binary
0.42
j_index
binary
0.33
bal_accuracy
binary
0.66
detection_prevalence
binary
0.90
precision
binary
0.87
recall
binary
0.96
f_meas
binary
0.91
Code
plotter(model4)
Confusion matrix for model 4
5.1.5 Model 5: Model without the variable flag_appels_depuis_international
Removing the variable flag_appels_depuis_international serves to simplify the model further with marginal gains to the metrics of the model. Note that the removal of this variable does not make the model worse off. We find this model better because it does not contain redundant variables.
here I add an age column that is the difference between the year of birth and the current year. This number is captured in terms of days in the data. I postulate that older clients are less likely to leave than the youth. Including the age variable raises the specificity and marginally lowers the sensitivity of the model. Hence, this variable is a useful addition to the model.
In this model, I include the duration that the client has been with the company. We postulate that a client that has stayed longer will find it more difficult to leave the company given that they have developed a list of contacts that they communicate often on their line. However, we find that duration, while significant, negatively affects specificity while retaining sensitivity at the same level. We shall consider removing this duration variable.
We include the committed variable in this model. Most of the metrics remain stable as they were before. However, commitment is statistically significant factor in explaining churn. Hence, we retain the variable.
I use model 10 to pick my list of 2000 people to be reached immediately. The model has the lowest number of variables but does not lose power relative to the other models developed previously. Examining the performance of this model out of sample gives a list of the following 2262 people.
Code
classes10 %>%count(class) %>%gt()
class
n
0
20267
1
2261
Let us view the first few 10 people by ID. The full listing is in a .txt file named C4_Name.txt.
In conclusion, the aim of this analysis is to enhance the effectiveness of a telecommunications company’s anti-churn campaign. Ten machine learning models were employed to identify 2000 consumers deemed crucial for urgent contact in the anti-churn initiative. Utilizing the logistic regression model, a total of 2262 clients were successfully identified for targeted intervention in the anti-churn campaign.
References
Boateng, Ernest Yeboah, and Daniel A Abaye. 2019. “A Review of the Logistic Regression Model with Emphasis on Medical Research.”Journal of Data Analysis and Information Processing 7 (4): 190–207.
Das, Abhik. 2021. “Logistic Regression.” In Encyclopedia of Quality of Life and Well-Being Research, 1–2. Springer.