PART1: Predicting Customer Churn with Precision: Unleashing the Power of Machine Learning and Logistic Regression in Telecommunication Analytics

Strategic Insights Unveiled: Machine Learning and Logistic Regression Illuminate the Path to Proactive Anti-Churn Promotions by Identifying At-Risk Consumers

Author

Affiliations

John Karuitha

Karatina University, School of Business

University of the Witwatersrand, Johannesburg, School of Construction Economics & Management

Published

January 12, 2024

Modified

January 12, 2024

Executive Summary

This study focuses on optimizing the anti-churn campaign for a telecommunications company through advanced machine learning techniques. The analysis involves the implementation of logistic regression to pinpoint 2000 consumers for urgent contact in the anti-churn initiative. Through the application of logistic regression model, a total of 2262 clients were successfully identified, providing targeted insights to significantly enhance the efficacy of the anti-churn campaign. The findings underscore the value of employing sophisticated machine learning methodologies for precise customer churn prediction and strategic intervention in the telecommunications sector.

1 Introduction

As of April 2023, many customers have been canceling their contracts with a mobile phone company, negatively affecting its performance. To tackle this issue, the company wants to reach out to active customers and offer them special deals to prevent them from canceling in the future—a strategy known as “anti-churn.” However, due to budget constraints, they can only contact 2,000 people. The goal is to identify those most likely to cancel in the next 3 months.

In this analysis, various tools are used to find 2,000 customers at risk of leaving for the anti-churn campaign. Here’s a breakdown: Section 2 explains the goals, Section 3 lists the methods used, Section 4 explores the data and finds key features related to customer cancellations, Section 5 puts the methods into action and assesses their performance, and finally, Section 6 concludes the analysis.

I successfully identify 2262 clients.

Code

if(!require(pacman)){
        install.packages('pacman')
}

p_load(tidyverse, janitor, skimr, 
       ggthemes, gt, correlationfunnel,
       mice, doParallel, tidymodels,
       klaR, ranger, rpart, kknn,
       kernlab, LiblineaR, brulee,
       conflicted, themis, xgboost,
       usemodels, AppliedPredictiveModeling,
       discrim, baguette, nnet, patchwork,
       kableExtra, caret, corrplot)

theme_set(ggthemes::theme_wsj())
options(digits = 2)
options(scipen = 999)

## Speed
## Hasten code execution by parallel computing
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makeCluster(all_cores)
registerDoParallel(cl)

Code

telcom_one <- read.csv2("base_telecom_2023_03.txt")
telcom_two <- read.csv2("base_telecom_2022_12.txt") %>% 
        mice(seed = 234, 
             printFlag = FALSE) %>% 
        complete()

2 Objective

The primary objective of this analysis is to identify and select a targeted group of 2,000 active consumers who are at a heightened risk of canceling their contracts within the next 3 months.

3 Technique

The principle of the project will be to construct several targets of 2000 customers using more or less complex statistical methods in order to improve the performance of the marketing campaign:

Random targeting.
Business targeting
Profiled targeting
Scored targeting V1
Scored targeting V2

4 Data

There are two sets of data.

base_telecom_2022_12.txt has 44529 rows and 42 columns.
base_telecom_2023_03.txt has 22528 rows and 41 columns.

The extra column in base_telecom_2022_12.txt contains a column flag_resiliation that indicates whether the consumer churned or not. Hence, I use this data set to train our models. I test the models on the second set of data; base_telecom_2023_03.txt .

4.1 Data Exploration

The table below shows the summary statistics for numeric variables in the data. We see that the variables taille_ville and revenu_moyen_ville have a few missing observations.

Code

telcom_two %>% 
        dplyr::select(where(is.numeric)) %>% 
        skim_without_charts() %>% 
        dplyr::select(-skim_type, -n_missing) %>% 
        rename(
            Variable = skim_variable,
            Complete = complete_rate,
            Mean = numeric.mean,
            SD = numeric.sd,
            Min = numeric.p0,
            Q1 = numeric.p25,
            Median = numeric.p50,
            Q3 = numeric.p75,
            Max = numeric.p100
        ) %>% 
        gt(caption = "Summary Statistics: Numeric Variables")

Summary Statistics: Numeric Variables
Variable	Complete	Mean	SD	Min	Q1	Median	Q3	Max
flag_resiliation	1	0.18	0.38	0	0	0	0	1
taille_ville	1	58223.13	78778.66	37	6231	25701	73002	798074
revenu_moyen_ville	1	15744.46	4930.16	0	12482	14160	18071	57257
nb_migrations	1	1.46	1.44	0	0	1	2	13
flag_migration_hausse	1	0.35	0.48	0	0	0	1	1
flag_migration_baisse	1	0.50	0.50	0	0	1	1	1
nb_services	1	2.96	1.83	0	2	3	4	18
flag_personnalisation_repondeur	1	0.20	0.40	0	0	0	0	1
flag_telechargement_sonnerie	1	0.14	0.34	0	0	0	0	1
nb_reengagements	1	0.64	0.70	0	0	1	1	4
vol_appels_m6	1	18012.26	10562.61	0	9631	16683	25484	46229
vol_appels_m5	1	18002.92	10559.17	0	9626	16714	25532	46347
vol_appels_m4	1	18017.06	10597.15	0	9617	16667	25477	46613
vol_appels_m3	1	18056.74	10646.03	0	9575	16731	25554	46172
vol_appels_m2	1	18046.34	10619.43	0	9584	16692	25576	47101
vol_appels_m1	1	18029.88	10639.35	0	9541	16631	25618	47151
flag_appels_vers_international	1	0.26	0.44	0	0	0	1	1
flag_appels_depuis_international	1	0.17	0.37	0	0	0	0	1
flag_appels_numeros_speciaux	1	0.58	0.49	0	0	1	1	1
nb_sms_m6	1	101.75	133.43	0	11	29	135	534
nb_sms_m5	1	101.59	133.35	0	11	30	135	534
nb_sms_m4	1	101.60	133.51	0	11	30	135	538
nb_sms_m3	1	101.54	133.65	0	10	31	135	535
nb_sms_m2	1	101.42	133.56	0	10	31	134	537
nb_sms_m1	1	101.17	133.47	0	10	32	134	536

Summary Statistics for Numeric Variables

Code

telcom_two %>% 
        dplyr::select(where(is.character)) %>% 
        skim_without_charts() %>% 
        dplyr::select(-n_missing) %>% 
        dplyr::rename(
                Variable = skim_variable,
                Complete = complete_rate,
                Char_min = character.min,
                Char_max = character.max,
                Empty = character.empty,
                Unique = character.n_unique,
                Blank = character.whitespace
                
        ) %>% 
        gt(caption = "Summary Statistics for Character Variables")

Summary Statistics for Character Variables
skim_type	Variable	Complete	Char_min	Char_max	Empty	Unique
character	id_client	1	15	15	0	44529
character	date_naissance	1	0	10	45	15796
character	sexe	1	0	8	6	3
character	csp	1	5	19	0	8
character	code_postal	1	4	5	0	3628
character	type_ville	1	0	14	1944	5
character	date_activation	1	10	10	0	2414
character	enseigne	1	8	19	0	3
character	mode_paiement	1	3	8	0	3
character	duree_offre_init	1	1	3	0	8
character	duree_offre	1	1	3	0	8
character	telephone_init	1	12	15	0	4
character	telephone	1	12	15	0	3
character	date_fin_engagement	1	0	10	602	1956
character	date_dernier_reengagement	1	0	10	21429	1016
character	situation_impayes	1	12	15	0	3
character	segment	1	1	1	0	3

Summary Statistics for Character Variables

4.2 Feature Engineering

I start by converting the data dates into the proper format. I also create an age column age that is the current year (2023) minus the year of birth (date_naissance). The goal is to find out whether age has a bearing on consumer churn. Similarly, I also create a duration column that captures the period from the start of the contract. Finally, I create a feature that captures whether the client is under commitment where they can only exit the plan by paying a contract fee. I capture any member whose commitment period ends on or before December 31, 2023 to be under commitment, otherwise, they are not under commitment.

Code

telcom_two <- telcom_two %>% 
        mutate(
                date_naissance = dmy(date_naissance),
                date_activation = dmy(date_activation),
                date_fin_engagement = dmy(date_fin_engagement),
                date_dernier_reengagement = dmy(date_dernier_reengagement)
        ) %>% 
        mutate(
                age = as.numeric(today() - date_naissance),
                duration = as.numeric(today() - date_activation)
        ) %>% 
        mutate(committed = case_when(
              date_fin_engagement <= as.Date("2022-12-31") ~ "Committed",
              .default = "Not Committed"
        )) %>% 
        dplyr::select(-starts_with("date"),
               -code_postal,
               -id_client) %>% 
        mice(seed = 234, printFlag = FALSE) %>% 
        complete()


## Do the same to test set
telcom_one <- telcom_one %>% 
        mutate(
                date_naissance = dmy(date_naissance),
                date_activation = dmy(date_activation),
                date_fin_engagement = dmy(date_fin_engagement),
                date_dernier_reengagement = dmy(date_dernier_reengagement)
        ) %>% 
        mutate(
                age = as.numeric(today() - date_naissance),
                duration = as.numeric(today() - date_activation)
        ) %>% 
        mutate(committed = case_when(
              date_fin_engagement <= as.Date("2022-12-31") ~ "Committed",
              .default = "Not Committed"
        ))

4.3 Data Visualization

I plot the extent of churn in our data. We see in the figure below that the incidents of churn are relatively few compared to the consumers that stay. During modelling, this is an important consideration which would require us to balance the data to aptly capture the characteristics of the people that churn.

Code

telcom_two %>% 
        ggplot(aes(x = factor(flag_resiliation))) + 
        geom_bar() + 
        labs(title = "Prevalence of Churn in the Data",
             y = "There is a serious imbalance with the non-churn over-represented in the data",
             x = "Churned?", y = "Count")

I create a correlation funnel that bins the data and creates a correlation matrix of the correlations. Starting from the top, we see the variables that have stronger linear relationships with churn. In this case, nb_reengagements and vols_appels variables are highly related to churn. Sex has the lowest linear relationship with churn.

CorrelationFunnel Package

The correlationfunnel package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name “Correlation Funnel”), making it very efficient to understand which features are most likely to provide business insights and lend well to a [machine learning model](https://cran.r-project.org/web/packages/correlationfunnel/vignettes/introducing_correlation_funnel.html).

Code

telcom_two %>% 
        binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
        correlate("flag_resiliation__0") %>% 
        plot_correlation_funnel()

5 Modeling

To run the machine learning models, I start by creating a recipe that I use throughout the analysis.

5.1 Logistic regression model

Logistic Regression is a statistical model widely used in machine learning for binary and multi-class classification tasks. Despite its name, logistic regression is used for classification, not regression (Boateng and Abaye 2019). It models the probability of a sample belonging to a particular class based on one or more predictor variables. Logistic Regression models the probability of an event occurring, such as the probability of an observation belonging to a certain class. The logistic function, also known as the sigmoid function, is employed to squash the output into the range (0, 1). The model is useful mainly due to its simplicity and interpretability. The AUC curve below shows the model does better than the null model. Table () shows the metrics for the model for use in comparing with other models (Das 2021).

5.1.1 Model 1: Full Model

I start by fitting the full model with all the variables. Notably, some variables would add little value, such as the ID of the clients. But in the following models, we shall make gradual improvements by removing redundant variables.

Code

## Set up the model
model1 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                              -age,
                              -duration,
                              -committed
                      ), 
                  family = "binomial")


metrics <- function(model, type = "response", 
                    new_data = telcom_two){
        
        tibble(pred = model %>% 
               predict(type = type,
                       new_data = new_data)) %>% 
        bind_cols(new_data) %>% 
                mutate(class = case_when(
                pred >= 0.5 ~ 1,
                .default = 0
        )) %>% 
        mutate(flag_resiliation = factor(flag_resiliation),
               class = factor(class)) %>% 
                conf_mat(truth = flag_resiliation,
                         estimate = class) %>% 
                summary()

        
        
}

plotter <- function(model, type = "response", 
                    new_data = telcom_two){
        
        tibble(pred = model %>% 
               predict(type = type,
                       new_data = new_data)) %>% 
        bind_cols(new_data) %>% 
                mutate(class = case_when(
                pred >= 0.5 ~ 1,
                .default = 0
        )) %>% 
        mutate(flag_resiliation = factor(flag_resiliation),
               class = factor(class)) %>% 
                conf_mat(truth = flag_resiliation,
                         estimate = class) %>% 
                autoplot()

        
        
}

We look at the metrics for model 1, which we treat as the base model. We see a reasonably high specificity of the model.

Code

metrics(model = model1) %>% 
        gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.39
sens	binary	0.96
spec	binary	0.37
ppv	binary	0.87
npv	binary	0.66
mcc	binary	0.42
j_index	binary	0.33
bal_accuracy	binary	0.66
detection_prevalence	binary	0.90
precision	binary	0.87
recall	binary	0.96
f_meas	binary	0.91

Plotting the confusion matrix shows that the model has a less than 50% chance of detecting the churn cases.

Code

plotter(model = model1)

5.1.2 Model 2: Remove the statistically insignificant `sex` variable

In this model, I remove the sex variable that is not very useful predictors of churn in the training set, going by the p-value. There is a marginal improvement in the number of churn cases captured. While the model reduces redundancy of irrelevant models, there is marginal improvement in the model in terms of specificity and sensitivity, two crucial metrics in this case.

Code

model2 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        -age,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")

model2_pred <- predict(model2, newdata = telcom_one, 
                       type = "response")

classes2 <- tibble(prob = model2_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model2) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.39
sens	binary	0.96
spec	binary	0.37
ppv	binary	0.87
npv	binary	0.66
mcc	binary	0.42
j_index	binary	0.33
bal_accuracy	binary	0.66
detection_prevalence	binary	0.90
precision	binary	0.87
recall	binary	0.96
f_meas	binary	0.91

Code

plotter(model2)

5.1.3 Model 3: Removal of variable `mode_paiement`

In this model, I remove one further variable mode_paiement. We note that the model is able to identify more cases of churn in the test set.

Code

model3 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        -age,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")

model3_pred <- predict(model3, newdata = telcom_one, 
                       type = "response")

classes3 <- tibble(prob = model3_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Let us examine the metrics for the model. We see that the removal of this variable simplifies the model without a significant increase in the power of the model. However, it is better to have a model with simpler model with fewer variables if the additional variables do not add value.

Code

metrics(model3) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.39
sens	binary	0.96
spec	binary	0.37
ppv	binary	0.87
npv	binary	0.66
mcc	binary	0.42
j_index	binary	0.33
bal_accuracy	binary	0.66
detection_prevalence	binary	0.90
precision	binary	0.87
recall	binary	0.96
f_meas	binary	0.91

Looking at the confusion matrix paints a similar picture.

Code

plotter(model3)

5.1.4 Model 4: Model without the Variable `flag_appels_vers_international`

The model summary shows that the variable flag_appels_vers_international is not significant. We again remove this variable from the model. We see marginal gains in sensitivity and specificity.

Code

model4 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        -age,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")


model4_pred <- predict(model4, newdata = telcom_one, 
                       type = "response")

classes4 <- tibble(prob = model4_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model4) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.39
sens	binary	0.96
spec	binary	0.37
ppv	binary	0.87
npv	binary	0.66
mcc	binary	0.42
j_index	binary	0.33
bal_accuracy	binary	0.66
detection_prevalence	binary	0.90
precision	binary	0.87
recall	binary	0.96
f_meas	binary	0.91

Code

plotter(model4)

5.1.5 Model 5: Model without the variable `flag_appels_depuis_international`

Removing the variable flag_appels_depuis_international serves to simplify the model further with marginal gains to the metrics of the model. Note that the removal of this variable does not make the model worse off. We find this model better because it does not contain redundant variables.

Code

model5 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        -age,
                        -duration,
                        - committed
                      ), 
                  family = "binomial")


model5_pred <- predict(model5, newdata = telcom_one, 
                       type = "response")

classes5 <- tibble(prob = model5_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Let us see the metrics.

Code

metrics(model5) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.39
sens	binary	0.96
spec	binary	0.37
ppv	binary	0.87
npv	binary	0.66
mcc	binary	0.42
j_index	binary	0.33
bal_accuracy	binary	0.66
detection_prevalence	binary	0.90
precision	binary	0.87
recall	binary	0.96
f_meas	binary	0.91

Code

plotter(model5)

5.1.6 Model 6: Including the age variable

here I add an age column that is the difference between the year of birth and the current year. This number is captured in terms of days in the data. I postulate that older clients are less likely to leave than the youth. Including the age variable raises the specificity and marginally lowers the sensitivity of the model. Hence, this variable is a useful addition to the model.

Code

model6 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")


model6_pred <- predict(model6, newdata = telcom_one, 
                       type = "response")

classes6 <- tibble(prob = model6_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model6) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.39
sens	binary	0.96
spec	binary	0.37
ppv	binary	0.87
npv	binary	0.66
mcc	binary	0.42
j_index	binary	0.33
bal_accuracy	binary	0.66
detection_prevalence	binary	0.90
precision	binary	0.87
recall	binary	0.96
f_meas	binary	0.91

Code

plotter(model = model6)

5.1.7 Model 7: Including the duration variable

In this model, I include the duration that the client has been with the company. We postulate that a client that has stayed longer will find it more difficult to leave the company given that they have developed a list of contacts that they communicate often on their line. However, we find that duration, while significant, negatively affects specificity while retaining sensitivity at the same level. We shall consider removing this duration variable.

Code

model7 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        -committed
                      ), 
                  family = "binomial")


model7_pred <- predict(model7, newdata = telcom_one, 
                       type = "response")

classes7 <- tibble(prob = model7_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model7) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.85
kap	binary	0.41
sens	binary	0.96
spec	binary	0.38
ppv	binary	0.88
npv	binary	0.67
mcc	binary	0.43
j_index	binary	0.34
bal_accuracy	binary	0.67
detection_prevalence	binary	0.90
precision	binary	0.88
recall	binary	0.96
f_meas	binary	0.91

Code

plotter(model7)

5.1.8 Model 8: Add the committed variable

We include the committed variable in this model. Most of the metrics remain stable as they were before. However, commitment is statistically significant factor in explaining churn. Hence, we retain the variable.

Code

model8 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international
                      ), 
                  family = "binomial")

model8_pred <- predict(model8, newdata = telcom_one, 
                       type = "response")

classes8 <- tibble(prob = model8_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model8) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.86
kap	binary	0.46
sens	binary	0.95
spec	binary	0.44
ppv	binary	0.89
npv	binary	0.68
mcc	binary	0.47
j_index	binary	0.39
bal_accuracy	binary	0.70
detection_prevalence	binary	0.88
precision	binary	0.89
recall	binary	0.95
f_meas	binary	0.92

Code

plotter(model8)

5.1.9 Model 9: Remove `revenu_moyen_ville` variable

I remove yet another statistically insignificant variable revenu_moyen_ville. The removal of this variable make the model better

Code

model9 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        - revenu_moyen_ville,
                        -duration
                      ), 
                  family = "binomial")


model9_pred <- predict(model9, newdata = telcom_one, 
                       type = "response")

classes9 <- tibble(prob = model9_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model9) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.86
kap	binary	0.46
sens	binary	0.95
spec	binary	0.44
ppv	binary	0.89
npv	binary	0.68
mcc	binary	0.47
j_index	binary	0.39
bal_accuracy	binary	0.70
detection_prevalence	binary	0.88
precision	binary	0.89
recall	binary	0.95
f_meas	binary	0.92

Code

plotter(model9)

5.1.10 Model 10: Remove the variable `nb_migrations`

In this final model, I remove yet another variable nb_migrations. Here, the model stabilizes as seen in the metrics and confusion matrix plot below.

Code

model10 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        - revenu_moyen_ville,
                        - nb_migrations
                      ), 
                  family = "binomial")


model10_pred <- predict(model10, 
                        newdata = telcom_one, 
                       type = "response")

classes10 <- tibble(prob = model10_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Code

metrics(model10) %>% gt()

.metric	.estimator	.estimate
accuracy	binary	0.86
kap	binary	0.46
sens	binary	0.95
spec	binary	0.44
ppv	binary	0.89
npv	binary	0.68
mcc	binary	0.47
j_index	binary	0.39
bal_accuracy	binary	0.70
detection_prevalence	binary	0.88
precision	binary	0.89
recall	binary	0.95
f_meas	binary	0.92

Code

plotter(model10)

6 Targeted People

I use model 10 to pick my list of 2000 people to be reached immediately. The model has the lowest number of variables but does not lose power relative to the other models developed previously. Examining the performance of this model out of sample gives a list of the following 2262 people.

Code

classes10 %>% 
        count(class) %>% 
        gt()

class	n
0	20267
1	2261

Let us view the first few 10 people by ID. The full listing is in a .txt file named C4_Name.txt.

Code

classes10 %>% 
        bind_cols(telcom_one) %>% 
        dplyr::filter(class == 1) %>% 
        dplyr::select(id_client) %>% 
        head(10) %>% 
        gt()

id_client
ID_460929730842
ID_257426459183
ID_479882497610
ID_584332167071
ID_421830917719
ID_214259825048
ID_333542594422
ID_975317711232
ID_550935426471
ID_208791053579

Code

# classes10 %>% 
#         bind_cols(telcom_one) %>%
#         dplyr::filter(class == 1) %>% 
#         dplyr::select(id_client) %>% 
#         write_csv('C4_Name.txt')

7 Conclusion

In conclusion, the aim of this analysis is to enhance the effectiveness of a telecommunications company’s anti-churn campaign. Ten machine learning models were employed to identify 2000 consumers deemed crucial for urgent contact in the anti-churn initiative. Utilizing the logistic regression model, a total of 2262 clients were successfully identified for targeted intervention in the anti-churn campaign.

References

Boateng, Ernest Yeboah, and Daniel A Abaye. 2019. “A Review of the Logistic Regression Model with Emphasis on Medical Research.” Journal of Data Analysis and Information Processing 7 (4): 190–207.

Das, Abhik. 2021. “Logistic Regression.” In Encyclopedia of Quality of Life and Well-Being Research, 1–2. Springer.

1 Introduction

2 Objective

3 Technique

4 Data

4.1 Data Exploration

4.2 Feature Engineering

4.3 Data Visualization

5 Modeling

5.1 Logistic regression model

5.1.1 Model 1: Full Model

5.1.2 Model 2: Remove the statistically insignificant sex variable

5.1.3 Model 3: Removal of variable mode_paiement

5.1.4 Model 4: Model without the Variable flag_appels_vers_international

5.1.5 Model 5: Model without the variable flag_appels_depuis_international

5.1.6 Model 6: Including the age variable

5.1.7 Model 7: Including the duration variable

5.1.8 Model 8: Add the committed variable

5.1.9 Model 9: Remove revenu_moyen_ville variable

5.1.10 Model 10: Remove the variable nb_migrations

6 Targeted People

7 Conclusion

References

5.1.2 Model 2: Remove the statistically insignificant `sex` variable

5.1.3 Model 3: Removal of variable `mode_paiement`

5.1.4 Model 4: Model without the Variable `flag_appels_vers_international`

5.1.5 Model 5: Model without the variable `flag_appels_depuis_international`

5.1.9 Model 9: Remove `revenu_moyen_ville` variable

5.1.10 Model 10: Remove the variable `nb_migrations`