PART1: Predicting Customer Churn with Precision: Unleashing the Power of Machine Learning and Logistic Regression in Telecommunication Analytics

Strategic Insights Unveiled: Machine Learning and Logistic Regression Illuminate the Path to Proactive Anti-Churn Promotions by Identifying At-Risk Consumers

Author
Affiliations

John Karuitha

Karatina University, School of Business

University of the Witwatersrand, Johannesburg, School of Construction Economics & Management

Published

January 12, 2024

Modified

January 12, 2024

Executive Summary

This study focuses on optimizing the anti-churn campaign for a telecommunications company through advanced machine learning techniques. The analysis involves the implementation of logistic regression to pinpoint 2000 consumers for urgent contact in the anti-churn initiative. Through the application of logistic regression model, a total of 2262 clients were successfully identified, providing targeted insights to significantly enhance the efficacy of the anti-churn campaign. The findings underscore the value of employing sophisticated machine learning methodologies for precise customer churn prediction and strategic intervention in the telecommunications sector.

1 Introduction

As of April 2023, many customers have been canceling their contracts with a mobile phone company, negatively affecting its performance. To tackle this issue, the company wants to reach out to active customers and offer them special deals to prevent them from canceling in the future—a strategy known as “anti-churn.” However, due to budget constraints, they can only contact 2,000 people. The goal is to identify those most likely to cancel in the next 3 months.

In this analysis, various tools are used to find 2,000 customers at risk of leaving for the anti-churn campaign. Here’s a breakdown: Section 2 explains the goals, Section 3 lists the methods used, Section 4 explores the data and finds key features related to customer cancellations, Section 5 puts the methods into action and assesses their performance, and finally, Section 6 concludes the analysis.

I successfully identify 2262 clients.

Code
if(!require(pacman)){
        install.packages('pacman')
}

p_load(tidyverse, janitor, skimr, 
       ggthemes, gt, correlationfunnel,
       mice, doParallel, tidymodels,
       klaR, ranger, rpart, kknn,
       kernlab, LiblineaR, brulee,
       conflicted, themis, xgboost,
       usemodels, AppliedPredictiveModeling,
       discrim, baguette, nnet, patchwork,
       kableExtra, caret, corrplot)

theme_set(ggthemes::theme_wsj())
options(digits = 2)
options(scipen = 999)

## Speed
## Hasten code execution by parallel computing
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makeCluster(all_cores)
registerDoParallel(cl)
Code
telcom_one <- read.csv2("base_telecom_2023_03.txt")
telcom_two <- read.csv2("base_telecom_2022_12.txt") %>% 
        mice(seed = 234, 
             printFlag = FALSE) %>% 
        complete()

2 Objective

The primary objective of this analysis is to identify and select a targeted group of 2,000 active consumers who are at a heightened risk of canceling their contracts within the next 3 months.

3 Technique

The principle of the project will be to construct several targets of 2000 customers using more or less complex statistical methods in order to improve the performance of the marketing campaign:

  1. Random targeting.
  2. Business targeting
  3. Profiled targeting
  4. Scored targeting V1
  5. Scored targeting V2

4 Data

There are two sets of data.

  • base_telecom_2022_12.txt has 44529 rows and 42 columns.

  • base_telecom_2023_03.txt has 22528 rows and 41 columns.

The extra column in base_telecom_2022_12.txt contains a column flag_resiliation that indicates whether the consumer churned or not. Hence, I use this data set to train our models. I test the models on the second set of data; base_telecom_2023_03.txt .

4.1 Data Exploration

The table below shows the summary statistics for numeric variables in the data. We see that the variables taille_ville and revenu_moyen_ville have a few missing observations.

Code
telcom_two %>% 
        dplyr::select(where(is.numeric)) %>% 
        skim_without_charts() %>% 
        dplyr::select(-skim_type, -n_missing) %>% 
        rename(
            Variable = skim_variable,
            Complete = complete_rate,
            Mean = numeric.mean,
            SD = numeric.sd,
            Min = numeric.p0,
            Q1 = numeric.p25,
            Median = numeric.p50,
            Q3 = numeric.p75,
            Max = numeric.p100
        ) %>% 
        gt(caption = "Summary Statistics: Numeric Variables")
Summary Statistics: Numeric Variables
Variable Complete Mean SD Min Q1 Median Q3 Max
flag_resiliation 1 0.18 0.38 0 0 0 0 1
taille_ville 1 58223.13 78778.66 37 6231 25701 73002 798074
revenu_moyen_ville 1 15744.46 4930.16 0 12482 14160 18071 57257
nb_migrations 1 1.46 1.44 0 0 1 2 13
flag_migration_hausse 1 0.35 0.48 0 0 0 1 1
flag_migration_baisse 1 0.50 0.50 0 0 1 1 1
nb_services 1 2.96 1.83 0 2 3 4 18
flag_personnalisation_repondeur 1 0.20 0.40 0 0 0 0 1
flag_telechargement_sonnerie 1 0.14 0.34 0 0 0 0 1
nb_reengagements 1 0.64 0.70 0 0 1 1 4
vol_appels_m6 1 18012.26 10562.61 0 9631 16683 25484 46229
vol_appels_m5 1 18002.92 10559.17 0 9626 16714 25532 46347
vol_appels_m4 1 18017.06 10597.15 0 9617 16667 25477 46613
vol_appels_m3 1 18056.74 10646.03 0 9575 16731 25554 46172
vol_appels_m2 1 18046.34 10619.43 0 9584 16692 25576 47101
vol_appels_m1 1 18029.88 10639.35 0 9541 16631 25618 47151
flag_appels_vers_international 1 0.26 0.44 0 0 0 1 1
flag_appels_depuis_international 1 0.17 0.37 0 0 0 0 1
flag_appels_numeros_speciaux 1 0.58 0.49 0 0 1 1 1
nb_sms_m6 1 101.75 133.43 0 11 29 135 534
nb_sms_m5 1 101.59 133.35 0 11 30 135 534
nb_sms_m4 1 101.60 133.51 0 11 30 135 538
nb_sms_m3 1 101.54 133.65 0 10 31 135 535
nb_sms_m2 1 101.42 133.56 0 10 31 134 537
nb_sms_m1 1 101.17 133.47 0 10 32 134 536

Summary Statistics for Numeric Variables

Code
telcom_two %>% 
        dplyr::select(where(is.character)) %>% 
        skim_without_charts() %>% 
        dplyr::select(-n_missing) %>% 
        dplyr::rename(
                Variable = skim_variable,
                Complete = complete_rate,
                Char_min = character.min,
                Char_max = character.max,
                Empty = character.empty,
                Unique = character.n_unique,
                Blank = character.whitespace
                
        ) %>% 
        gt(caption = "Summary Statistics for Character Variables")
Summary Statistics for Character Variables
skim_type Variable Complete Char_min Char_max Empty Unique Blank
character id_client 1 15 15 0 44529 0
character date_naissance 1 0 10 45 15796 0
character sexe 1 0 8 6 3 0
character csp 1 5 19 0 8 0
character code_postal 1 4 5 0 3628 0
character type_ville 1 0 14 1944 5 0
character date_activation 1 10 10 0 2414 0
character enseigne 1 8 19 0 3 0
character mode_paiement 1 3 8 0 3 0
character duree_offre_init 1 1 3 0 8 0
character duree_offre 1 1 3 0 8 0
character telephone_init 1 12 15 0 4 0
character telephone 1 12 15 0 3 0
character date_fin_engagement 1 0 10 602 1956 0
character date_dernier_reengagement 1 0 10 21429 1016 0
character situation_impayes 1 12 15 0 3 0
character segment 1 1 1 0 3 0

Summary Statistics for Character Variables

4.2 Feature Engineering

I start by converting the data dates into the proper format. I also create an age column age that is the current year (2023) minus the year of birth (date_naissance). The goal is to find out whether age has a bearing on consumer churn. Similarly, I also create a duration column that captures the period from the start of the contract. Finally, I create a feature that captures whether the client is under commitment where they can only exit the plan by paying a contract fee. I capture any member whose commitment period ends on or before December 31, 2023 to be under commitment, otherwise, they are not under commitment.

Code
telcom_two <- telcom_two %>% 
        mutate(
                date_naissance = dmy(date_naissance),
                date_activation = dmy(date_activation),
                date_fin_engagement = dmy(date_fin_engagement),
                date_dernier_reengagement = dmy(date_dernier_reengagement)
        ) %>% 
        mutate(
                age = as.numeric(today() - date_naissance),
                duration = as.numeric(today() - date_activation)
        ) %>% 
        mutate(committed = case_when(
              date_fin_engagement <= as.Date("2022-12-31") ~ "Committed",
              .default = "Not Committed"
        )) %>% 
        dplyr::select(-starts_with("date"),
               -code_postal,
               -id_client) %>% 
        mice(seed = 234, printFlag = FALSE) %>% 
        complete()


## Do the same to test set
telcom_one <- telcom_one %>% 
        mutate(
                date_naissance = dmy(date_naissance),
                date_activation = dmy(date_activation),
                date_fin_engagement = dmy(date_fin_engagement),
                date_dernier_reengagement = dmy(date_dernier_reengagement)
        ) %>% 
        mutate(
                age = as.numeric(today() - date_naissance),
                duration = as.numeric(today() - date_activation)
        ) %>% 
        mutate(committed = case_when(
              date_fin_engagement <= as.Date("2022-12-31") ~ "Committed",
              .default = "Not Committed"
        ))

4.3 Data Visualization

I plot the extent of churn in our data. We see in the figure below that the incidents of churn are relatively few compared to the consumers that stay. During modelling, this is an important consideration which would require us to balance the data to aptly capture the characteristics of the people that churn.

Code
telcom_two %>% 
        ggplot(aes(x = factor(flag_resiliation))) + 
        geom_bar() + 
        labs(title = "Prevalence of Churn in the Data",
             y = "There is a serious imbalance with the non-churn over-represented in the data",
             x = "Churned?", y = "Count")

I create a correlation funnel that bins the data and creates a correlation matrix of the correlations. Starting from the top, we see the variables that have stronger linear relationships with churn. In this case, nb_reengagements and vols_appels variables are highly related to churn. Sex has the lowest linear relationship with churn.

CorrelationFunnel Package

The correlationfunnel package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name “Correlation Funnel”), making it very efficient to understand which features are most likely to provide business insights and lend well to a [machine learning model](https://cran.r-project.org/web/packages/correlationfunnel/vignettes/introducing_correlation_funnel.html).

Code
telcom_two %>% 
        binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
        correlate("flag_resiliation__0") %>% 
        plot_correlation_funnel()

Correlation Funnel

5 Modeling

To run the machine learning models, I start by creating a recipe that I use throughout the analysis.

5.1 Logistic regression model

Logistic Regression is a statistical model widely used in machine learning for binary and multi-class classification tasks. Despite its name, logistic regression is used for classification, not regression (Boateng and Abaye 2019). It models the probability of a sample belonging to a particular class based on one or more predictor variables. Logistic Regression models the probability of an event occurring, such as the probability of an observation belonging to a certain class. The logistic function, also known as the sigmoid function, is employed to squash the output into the range (0, 1). The model is useful mainly due to its simplicity and interpretability. The AUC curve below shows the model does better than the null model. Table () shows the metrics for the model for use in comparing with other models (Das 2021).

5.1.1 Model 1: Full Model

I start by fitting the full model with all the variables. Notably, some variables would add little value, such as the ID of the clients. But in the following models, we shall make gradual improvements by removing redundant variables.

Code
## Set up the model
model1 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                              -age,
                              -duration,
                              -committed
                      ), 
                  family = "binomial")


metrics <- function(model, type = "response", 
                    new_data = telcom_two){
        
        tibble(pred = model %>% 
               predict(type = type,
                       new_data = new_data)) %>% 
        bind_cols(new_data) %>% 
                mutate(class = case_when(
                pred >= 0.5 ~ 1,
                .default = 0
        )) %>% 
        mutate(flag_resiliation = factor(flag_resiliation),
               class = factor(class)) %>% 
                conf_mat(truth = flag_resiliation,
                         estimate = class) %>% 
                summary()

        
        
}

plotter <- function(model, type = "response", 
                    new_data = telcom_two){
        
        tibble(pred = model %>% 
               predict(type = type,
                       new_data = new_data)) %>% 
        bind_cols(new_data) %>% 
                mutate(class = case_when(
                pred >= 0.5 ~ 1,
                .default = 0
        )) %>% 
        mutate(flag_resiliation = factor(flag_resiliation),
               class = factor(class)) %>% 
                conf_mat(truth = flag_resiliation,
                         estimate = class) %>% 
                autoplot()

        
        
}

We look at the metrics for model 1, which we treat as the base model. We see a reasonably high specificity of the model.

Code
metrics(model = model1) %>% 
        gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.39
sens binary 0.96
spec binary 0.37
ppv binary 0.87
npv binary 0.66
mcc binary 0.42
j_index binary 0.33
bal_accuracy binary 0.66
detection_prevalence binary 0.90
precision binary 0.87
recall binary 0.96
f_meas binary 0.91

Plotting the confusion matrix shows that the model has a less than 50% chance of detecting the churn cases.

Code
plotter(model = model1)

Confusion matrix for model 1

5.1.2 Model 2: Remove the statistically insignificant sex variable

In this model, I remove the sex variable that is not very useful predictors of churn in the training set, going by the p-value. There is a marginal improvement in the number of churn cases captured. While the model reduces redundancy of irrelevant models, there is marginal improvement in the model in terms of specificity and sensitivity, two crucial metrics in this case.

Code
model2 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        -age,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")

model2_pred <- predict(model2, newdata = telcom_one, 
                       type = "response")

classes2 <- tibble(prob = model2_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model2) %>% gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.39
sens binary 0.96
spec binary 0.37
ppv binary 0.87
npv binary 0.66
mcc binary 0.42
j_index binary 0.33
bal_accuracy binary 0.66
detection_prevalence binary 0.90
precision binary 0.87
recall binary 0.96
f_meas binary 0.91
Code
plotter(model2)

Confusion matrix for model 2

5.1.3 Model 3: Removal of variable mode_paiement

In this model, I remove one further variable mode_paiement. We note that the model is able to identify more cases of churn in the test set.

Code
model3 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        -age,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")

model3_pred <- predict(model3, newdata = telcom_one, 
                       type = "response")

classes3 <- tibble(prob = model3_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Let us examine the metrics for the model. We see that the removal of this variable simplifies the model without a significant increase in the power of the model. However, it is better to have a model with simpler model with fewer variables if the additional variables do not add value.

Code
metrics(model3) %>% gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.39
sens binary 0.96
spec binary 0.37
ppv binary 0.87
npv binary 0.66
mcc binary 0.42
j_index binary 0.33
bal_accuracy binary 0.66
detection_prevalence binary 0.90
precision binary 0.87
recall binary 0.96
f_meas binary 0.91

Looking at the confusion matrix paints a similar picture.

Code
plotter(model3)

Confusion matrix for model 3

5.1.4 Model 4: Model without the Variable flag_appels_vers_international

The model summary shows that the variable flag_appels_vers_international is not significant. We again remove this variable from the model. We see marginal gains in sensitivity and specificity.

Code
model4 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        -age,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")


model4_pred <- predict(model4, newdata = telcom_one, 
                       type = "response")

classes4 <- tibble(prob = model4_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model4) %>% gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.39
sens binary 0.96
spec binary 0.37
ppv binary 0.87
npv binary 0.66
mcc binary 0.42
j_index binary 0.33
bal_accuracy binary 0.66
detection_prevalence binary 0.90
precision binary 0.87
recall binary 0.96
f_meas binary 0.91
Code
plotter(model4)

Confusion matrix for model 4

5.1.5 Model 5: Model without the variable flag_appels_depuis_international

Removing the variable flag_appels_depuis_international serves to simplify the model further with marginal gains to the metrics of the model. Note that the removal of this variable does not make the model worse off. We find this model better because it does not contain redundant variables.

Code
model5 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        -age,
                        -duration,
                        - committed
                      ), 
                  family = "binomial")


model5_pred <- predict(model5, newdata = telcom_one, 
                       type = "response")

classes5 <- tibble(prob = model5_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))

Let us see the metrics.

Code
metrics(model5) %>% gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.39
sens binary 0.96
spec binary 0.37
ppv binary 0.87
npv binary 0.66
mcc binary 0.42
j_index binary 0.33
bal_accuracy binary 0.66
detection_prevalence binary 0.90
precision binary 0.87
recall binary 0.96
f_meas binary 0.91
Code
plotter(model5)

Confusion matrix for model 5

5.1.6 Model 6: Including the age variable

here I add an age column that is the difference between the year of birth and the current year. This number is captured in terms of days in the data. I postulate that older clients are less likely to leave than the youth. Including the age variable raises the specificity and marginally lowers the sensitivity of the model. Hence, this variable is a useful addition to the model.

Code
model6 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        -duration,
                        -committed
                      ), 
                  family = "binomial")


model6_pred <- predict(model6, newdata = telcom_one, 
                       type = "response")

classes6 <- tibble(prob = model6_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model6) %>% gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.39
sens binary 0.96
spec binary 0.37
ppv binary 0.87
npv binary 0.66
mcc binary 0.42
j_index binary 0.33
bal_accuracy binary 0.66
detection_prevalence binary 0.90
precision binary 0.87
recall binary 0.96
f_meas binary 0.91
Code
plotter(model = model6)

Confusion matrix for model 6

5.1.7 Model 7: Including the duration variable

In this model, I include the duration that the client has been with the company. We postulate that a client that has stayed longer will find it more difficult to leave the company given that they have developed a list of contacts that they communicate often on their line. However, we find that duration, while significant, negatively affects specificity while retaining sensitivity at the same level. We shall consider removing this duration variable.

Code
model7 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        -committed
                      ), 
                  family = "binomial")


model7_pred <- predict(model7, newdata = telcom_one, 
                       type = "response")

classes7 <- tibble(prob = model7_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model7) %>% gt()
.metric .estimator .estimate
accuracy binary 0.85
kap binary 0.41
sens binary 0.96
spec binary 0.38
ppv binary 0.88
npv binary 0.67
mcc binary 0.43
j_index binary 0.34
bal_accuracy binary 0.67
detection_prevalence binary 0.90
precision binary 0.88
recall binary 0.96
f_meas binary 0.91
Code
plotter(model7)

Confusion matrix for model 1

5.1.8 Model 8: Add the committed variable

We include the committed variable in this model. Most of the metrics remain stable as they were before. However, commitment is statistically significant factor in explaining churn. Hence, we retain the variable.

Code
model8 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international
                      ), 
                  family = "binomial")

model8_pred <- predict(model8, newdata = telcom_one, 
                       type = "response")

classes8 <- tibble(prob = model8_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model8) %>% gt()
.metric .estimator .estimate
accuracy binary 0.86
kap binary 0.46
sens binary 0.95
spec binary 0.44
ppv binary 0.89
npv binary 0.68
mcc binary 0.47
j_index binary 0.39
bal_accuracy binary 0.70
detection_prevalence binary 0.88
precision binary 0.89
recall binary 0.95
f_meas binary 0.92
Code
plotter(model8)

Confusion matrix for model 8

5.1.9 Model 9: Remove revenu_moyen_ville variable

I remove yet another statistically insignificant variable revenu_moyen_ville. The removal of this variable make the model better

Code
model9 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        - revenu_moyen_ville,
                        -duration
                      ), 
                  family = "binomial")


model9_pred <- predict(model9, newdata = telcom_one, 
                       type = "response")

classes9 <- tibble(prob = model9_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model9) %>% gt()
.metric .estimator .estimate
accuracy binary 0.86
kap binary 0.46
sens binary 0.95
spec binary 0.44
ppv binary 0.89
npv binary 0.68
mcc binary 0.47
j_index binary 0.39
bal_accuracy binary 0.70
detection_prevalence binary 0.88
precision binary 0.89
recall binary 0.95
f_meas binary 0.92
Code
plotter(model9)

Confusion matrix for model 9

5.1.10 Model 10: Remove the variable nb_migrations

In this final model, I remove yet another variable nb_migrations. Here, the model stabilizes as seen in the metrics and confusion matrix plot below.

Code
model10 <- glm(factor(flag_resiliation) ~ ., 
              data = telcom_two %>% 
                      dplyr::select(
                        - sexe,
                        - mode_paiement,
                        - flag_appels_vers_international,
                        - flag_appels_depuis_international,
                        - revenu_moyen_ville,
                        - nb_migrations
                      ), 
                  family = "binomial")


model10_pred <- predict(model10, 
                        newdata = telcom_one, 
                       type = "response")

classes10 <- tibble(prob = model10_pred) %>% 
        mutate(class = case_when(
                prob >= 0.5 ~ 1,
                .default = 0
        ))
Code
metrics(model10) %>% gt()
.metric .estimator .estimate
accuracy binary 0.86
kap binary 0.46
sens binary 0.95
spec binary 0.44
ppv binary 0.89
npv binary 0.68
mcc binary 0.47
j_index binary 0.39
bal_accuracy binary 0.70
detection_prevalence binary 0.88
precision binary 0.89
recall binary 0.95
f_meas binary 0.92
Code
plotter(model10)

Confusion matrix for model 10

6 Targeted People

I use model 10 to pick my list of 2000 people to be reached immediately. The model has the lowest number of variables but does not lose power relative to the other models developed previously. Examining the performance of this model out of sample gives a list of the following 2262 people.

Code
classes10 %>% 
        count(class) %>% 
        gt()
class n
0 20267
1 2261

Let us view the first few 10 people by ID. The full listing is in a .txt file named C4_Name.txt.

Code
classes10 %>% 
        bind_cols(telcom_one) %>% 
        dplyr::filter(class == 1) %>% 
        dplyr::select(id_client) %>% 
        head(10) %>% 
        gt()
id_client
ID_460929730842
ID_257426459183
ID_479882497610
ID_584332167071
ID_421830917719
ID_214259825048
ID_333542594422
ID_975317711232
ID_550935426471
ID_208791053579
Code
# classes10 %>% 
#         bind_cols(telcom_one) %>%
#         dplyr::filter(class == 1) %>% 
#         dplyr::select(id_client) %>% 
#         write_csv('C4_Name.txt')

7 Conclusion

In conclusion, the aim of this analysis is to enhance the effectiveness of a telecommunications company’s anti-churn campaign. Ten machine learning models were employed to identify 2000 consumers deemed crucial for urgent contact in the anti-churn initiative. Utilizing the logistic regression model, a total of 2262 clients were successfully identified for targeted intervention in the anti-churn campaign.

References

Boateng, Ernest Yeboah, and Daniel A Abaye. 2019. “A Review of the Logistic Regression Model with Emphasis on Medical Research.” Journal of Data Analysis and Information Processing 7 (4): 190–207.
Das, Abhik. 2021. “Logistic Regression.” In Encyclopedia of Quality of Life and Well-Being Research, 1–2. Springer.