Telco Customer Churn

Kaggle Competition

1 Description

This project attempts to investigate the Kaggle Telco Customer Churn dataset. The investigation conducted the EDA, the Chi-Squared Test and H2O’s AutoML to build the suitable model predicting customer churn.

2 Data Overview

2.1 About the data

The dataset from Telco consists of 7,043 records with twenty attributes divided into two categories: customer demographic data and information related to their wireless accounts. The demographic features include the customer’s gender, whether they have a partner, or dependents, and 65 years or older. The features related to their account information include how long the customer has been with Telco, their monthly and total charges, the contract each customer carries (month-to-month, one year, or two years), and the type of phone, internet, and TV services they have. Our target variable for this study is Churn, a binary indicator that represents whether or not the customer left within the last month.

library(tidyverse)
library(readr)
library(stringr)
data_raw = read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
data_raw %>% glimpse

## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

2.2 Fields & formats

sapply(data_raw, function(x) sum(is.na (x)))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

There were 11 observations with missing TotalCharges data. Since it is a fairly small amount, these observations will be removed prior to beginning the analysis, leaving 7,032 customers in the data set. Also, We would need to drop the customerID. We would then divide the features into binary, categorical, or numerical variables.

data = data_raw %>% drop_na %>% select(-customerID)

binary = sapply(data %>% 
                  select(-Churn), function(x) n_distinct(x)) %>%
  sort %>%
  .[. == 2] %>%
  names

categorical = sapply(data %>% 
                       select(-Churn), function(x) n_distinct(x)) %>%
  sort %>%
  .[. <= 50 & . > 2] %>%
  names

numerical = sapply(data %>% 
                     select(-Churn), function(x) n_distinct(x)) %>%
  sort %>%
  .[. > 50] %>%
  names

paste("The Number of binary, categorical and numerical variables are ", length(binary), ", ", length(categorical), ", ", length(numerical),sep = "")

## [1] "The Number of binary, categorical and numerical variables are 6, 10, 3"

library(plotly)
num_hist = map(numerical,
    ~data %>%
    plot_ly(x = ~.data[[.x]], name = .x) %>%
    add_histogram() %>%
    layout(title = paste('Histogram of', .x), 
           plot_bgcolor = "#e5ecf6",
           xaxis = list(title = .x),
           yaxis = list(title = "Count")))

subplot(num_hist) %>% layout(title = "Histogram of Numerical Variables")

It seems from the freq histograms that the large cluster of customers have high total charges (2904 customers). While a large cluster of customers have a low teneure count (1970 customers), Monthly charges seem to be less skewed.

library(aplot)
cat_bar = map(categorical,
    ~data %>%
    count(.data[[.x]]) %>%
    plot_ly(x = ~.data[[.x]], 
            y = ~n, 
            name = .x) %>% 
    add_bars()   %>%
    layout(plot_bgcolor = "#e5ecf6", 
           xaxis = list(title = .x),
           yaxis = list(title = "Count")))

subplot(cat_bar, nrows = 5, margin = 0.05) %>% layout(title = "Barcharts of Categorical Variables")

Additionally, several of the Yes/No categorical variables contained an additional group indicating that the customer had no phone or internet service. These were recorded and combined with the value No.

data <- data %>%
  mutate_at(binary, as.factor) %>%
  mutate_at(c("MultipleLines",
              "OnlineSecurity",
              "OnlineBackup",
              "DeviceProtection",
              "TechSupport",
              "StreamingTV",
              "StreamingMovies"), 
            ~case_when(. == "Yes" ~ "Yes", TRUE ~ "No")) %>%
  mutate_if(is.character, as.factor) %>%
  select(Churn,is.factor,everything())

data %>% glimpse

## Rows: 7,032
## Columns: 20
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…

3 Exploratory Data Analysis

3.1 Tenure & charges

library(scales)
ggplot(data, aes(x = fct_rev(Churn), y = tenure, fill = fct_rev(Churn))) +
  geom_bar(stat = "summary", fun = "mean", alpha = 0.6, color = "grey20", show.legend = F) +
  stat_summary(aes(label = paste(round(..y.., 0), "months")), fun = mean, 
               geom = "text", size = 3.5, vjust = -0.5) +
  labs(title = "Average Customer Tenure \n", x = "", y = "Customer Tenure\n") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data, aes(x = fct_rev(Churn), y = MonthlyCharges, fill = fct_rev(Churn))) +
  geom_bar(stat = "summary", fun = "mean", alpha = 0.6, color = "grey20", show.legend = F) +
  stat_summary(aes(label = dollar(..y..)), fun = mean, 
               geom = "text", size = 3.5, vjust = -0.5) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Average Monthly Charges \n", x = "", y = "Monthly Charges \n") +
  theme(plot.title = element_text(hjust = 0.5))

In the graphs above, Telco’s current and churned customers’ average tenure and their monthly fees are shown. Clients that left Telco kept their services for roughly 18 months, while current customers have been with the company for slightly over 3 years.

ggplot(data, aes(x = Contract, y = MonthlyCharges, fill = fct_rev(Churn))) +
  geom_bar(position = "dodge", stat = "summary", fun = "mean", alpha = 0.6, color = "grey20") +
  stat_summary(aes(label = dollar(..y..)), fun = mean, 
               geom = "text", size = 3.5, vjust = -0.5,
               position = position_dodge(width = 0.9)) +
  coord_cartesian(ylim = c(0, 95)) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "\nAverage Monthly Charges by Contract Type", x = "\n Contract Type", 
       y = "Monthly Charges \n", fill = "") +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "top", legend.justification = "left")

Additionally, attrited customers had higher monthly charges on average by about $10. This holds true across each contract type.

3.2 Type of account services

ggplot(data, aes(x = Contract, group = fct_rev(Churn))) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
           alpha = 0.6, color = "grey20", show.legend = F) +
  geom_text(aes(label = percent(..prop..), y = ..prop.. ), 
            size = 4, stat = "count", vjust = -0.5) +
  facet_grid(~fct_rev(Churn)) +
  scale_y_continuous(labels = percent_format()) +
  coord_cartesian(ylim = c(0, .95)) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Customer Churn by Contract Type\n", x = "\n Contract Type", y = "") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data, aes(x = InternetService, group = fct_rev(Churn))) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
           alpha = 0.6, color = "grey20", show.legend = F) +
  geom_text(aes(label = percent(..prop..), y = ..prop.. ), 
            size = 4, stat = "count", vjust = -0.5) +
  facet_grid(~fct_rev(Churn)) +
  scale_y_continuous(labels = percent_format()) +
  coord_cartesian(ylim = c(0, .9)) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "\n Customer Churn by Internet Service \n", x = "\n Internet Service", y = "") +
  theme(plot.title = element_text(hjust = 0.5))

Nearly 89% of former customers were on month-to-month contracts, with a much smaller proportion on one or two-year contracts. Of customers who left, 70% had Fiber Optic internet. This could be an indicator of potential dissatisfaction with the service.

3.3 Customer Demographics

ggplot(data, aes(x = fct_rev(ifelse(SeniorCitizen==1, "Yes", "No")), group = Churn)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
           alpha = 0.6, color = "grey20", show.legend = F) +
  geom_text(aes(label = percent(..prop.., accuracy = 0.1), y = ..prop..), 
            size = 4, stat = "count", position = position_stack(vjust = 0.5)) +
  facet_grid(~fct_rev(Churn)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  coord_cartesian(ylim = c(0, .9)) +
  labs(x = "\n Senior Citizen", y = "")

Based on the demographic attributes of Telco’s customers, about a quarter of those who left were senior citizens.

ggplot(data, aes(x = gender, group = Churn)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
           alpha = 0.6, color = "grey20", show.legend = F) +
  geom_text(aes(label = percent(..prop.., accuracy = 0.1), y = ..prop..), 
            size = 4, stat = "count", position = position_stack(vjust = 0.5)) +
  facet_grid(~fct_rev(Churn)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  coord_cartesian(ylim = c(0, .6)) +
  labs(x = "\n Gender", y = "")

The distribution of gender is proportional in both current and former customers, with an approximately equal number of men and women indicating the insignificance of the feature.

3.4 Correlations & Distribution

library(GGally)
g = data %>% 
  select(tenure, MonthlyCharges, TotalCharges, Churn) %>%
  ggpairs(aes(color = fct_rev(Churn)), title = "Customer Account Distributions and Correlations \n",
          columnLabels = c("Tenure", "Monthly Charges", "Total Charges", "Churn"),
          upper = list(combo = wrap("box_no_facet", alpha = 0.7)),
          diag = list(continuous = wrap("densityDiag", alpha = 0.6), 
                      discrete = wrap("barDiag", alpha = 0.7, color = "grey30")),
          lower = list(combo = wrap("box_no_facet", alpha = 0.7), continuous = wrap("smooth", alpha = 0.15))) 

ggplotly(g)

The correlations between our numeric variables show that TotalCharges is strongly correlated with customer tenure, especially among customers who left (Churn = Yes), with a correlation of more than 0.95. The histogram of MonthlyCharges has a multimodal shape, while the distribution of customer tenure is relatively uniform among current customers but skewed to the right in customers who left.

4 Train/Valid/Test sets

Since the contest period did not provide the testing set, we need to split the data into train and test on the 70-30 ratio. We then further split the train into train and valid with the note that the validation data has exactly the same number of observations in test data

library(caret)
set.seed(1)
id <- createDataPartition(y = data %>% pull(Churn), p = 0.7, list = FALSE)
train_valid <- data[id, ] 
df_test <- data[-id, ] # Test data.  
true_test_labels <- as.factor(df_test$Churn)
set.seed(1)
id_new <- createDataPartition(y = train_valid %>% pull(Churn), p = nrow(df_test) / nrow(train_valid), list = FALSE)
df_train <- train_valid[-id_new, ]
true_train_labels <- as.factor(df_train$Churn)
df_valid <- train_valid[id_new, ]
true_valid_labels <- as.factor(df_valid$Churn)

5 Feature Selection

We will use the Chi-Squared Test for feature selection. This test evaluates the association between two categorical variables. The null hypothesis for this test is that there is no relationship between our response variable and the categorical feature, and the alternative hypothesis is that there is a relationship.

library(broom)
cat.var <- train_valid %>% select_if(is.factor) %>% select(-Churn)
chi <- map(cat.var, ~chisq.test(train_valid$Churn, .x))

chi_fs = do.call(rbind, lapply(chi, tidy)) %>%
  mutate(variable = cat.var %>% colnames) %>%
  arrange(p.value) %>%
  mutate_at(c(1,2), funs(round(., 3))) %>% select(variable, p.value)

chi_fs

## # A tibble: 16 × 2
##    variable         p.value
##    <chr>              <dbl>
##  1 Contract           0    
##  2 InternetService    0    
##  3 PaymentMethod      0    
##  4 PaperlessBilling   0    
##  5 TechSupport        0    
##  6 OnlineSecurity     0    
##  7 Dependents         0    
##  8 Partner            0    
##  9 SeniorCitizen      0    
## 10 OnlineBackup       0    
## 11 StreamingTV        0    
## 12 DeviceProtection   0    
## 13 StreamingMovies    0    
## 14 MultipleLines      0.001
## 15 gender             0.239
## 16 PhoneService       0.648

Looking at the results of the tests, Gender and PhoneService have very small chi-squared statistics and p-values that are greater than the significance threshold, $a$, of 0.05, indicating they are independent of our target variable. The rest of the categorical features do have a statistically significant association with customer churn.

var_selected = chi_fs %>%
  filter(p.value <= 0.05) %>% select(variable) %>% pull

paste("We select", length(var_selected),"categorical variables:",paste(var_selected,collapse=", "))

## [1] "We select 14 categorical variables: Contract, InternetService, PaymentMethod, PaperlessBilling, TechSupport, OnlineSecurity, Dependents, Partner, SeniorCitizen, OnlineBackup, StreamingTV, DeviceProtection, StreamingMovies, MultipleLines"

6 Automated Machine Learning

We will score the leaderboard using the validation dataset using 4-fold cross-validation. The metric used for early stopping is AUC for this classification problem. The AutoML will be stopped if the simple moving average does not improve for 10 (stopping_rounds) scoring events or if the relative improvement is not at least 0.025. The maximum run time would be 1 hour.

To achieve the leader model, we will use the valid dataset to obtain the appropriate threshold, which will then be used to evaluate the classification report in the test set.

6.1 Train AutoML

library(h2o)
h2o.init(nthreads = 20, max_mem_size = "32g")

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 3 hours 
##     H2O cluster timezone:       Asia/Ho_Chi_Minh 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.38.0.1 
##     H2O cluster version age:    2 months and 4 days  
##     H2O cluster name:           H2O_started_from_R_macos_sbv884 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.96 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.2.2 (2022-10-31)

h2o.no_progress()
train_h2o <- as.h2o(df_train)
valid_h2o <- as.h2o(df_valid)
test_h2o <- as.h2o(df_test)

# Train Auto Machine Learning:
start <- Sys.time ()
autoML <- h2o.automl(x = var_selected,
                     y = "Churn",
                     training_frame = train_h2o,
                     nfolds = 4,
                     stopping_metric = "AUC",
                     stopping_rounds = 10,
                     stopping_tolerance = 0.025,
                     max_models = 15,
                     max_runtime_secs = 60*60,
                     seed = 1,
                     sort_metric = "AUC")
run = Sys.time () - start

autoML@leader@model$model_summary

## GLM Model: summary
##     family  link              regularization
## 1 binomial logit Ridge ( lambda = 0.001775 )
##                                                                    lambda_search
## 1 nlambda = 30, lambda.max = 9.4055, lambda.min = 0.001775, lambda.1se = 0.01194
##   number_of_predictors_total number_of_active_predictors number_of_iterations
## 1                         32                          32                   38
##                                          training_frame
## 1 AutoML_32_20221124_42256_training_df_train_sid_bf90_1

h2o.performance(autoML@leader, test_h2o)@metrics$AUC

## [1] 0.8385792

The model took 1 seconds, and produce the leader GLM with the corresponding AUC 0.84.

Furthermore, we can observe the feature importance of this leader. This yields immediate insights on how the companies should act given the data.

h2o.varimp_plot(autoML@leader)

We may further observe the AUC of other models on the leaderboard.

df_results <- autoML@leaderboard %>% as.data.frame() %>% 
  select(model_id, auc) %>% 
    mutate(Rank = 1:nrow(.), auc = round(auc, 4)) %>% 
  rename(AUC_Val = auc)

df_results

##                                                   model_id AUC_Val Rank
## 1                           GLM_1_AutoML_32_20221124_42256  0.8268    1
## 2     StackedEnsemble_AllModels_1_AutoML_32_20221124_42256  0.8254    2
## 3  StackedEnsemble_BestOfFamily_1_AutoML_32_20221124_42256  0.8242    3
## 4                           GBM_1_AutoML_32_20221124_42256  0.8214    4
## 5              GBM_grid_1_AutoML_32_20221124_42256_model_1  0.8213    5
## 6                           XRT_1_AutoML_32_20221124_42256  0.8144    6
## 7                  DeepLearning_1_AutoML_32_20221124_42256  0.8144    7
## 8                           DRF_1_AutoML_32_20221124_42256  0.8046    8
## 9                       XGBoost_1_AutoML_32_20221124_42256  0.8011    9
## 10                          GBM_5_AutoML_32_20221124_42256  0.7979   10
## 11                          GBM_4_AutoML_32_20221124_42256  0.7966   11
## 12                          GBM_2_AutoML_32_20221124_42256  0.7965   12
## 13         XGBoost_grid_1_AutoML_32_20221124_42256_model_1  0.7953   13
## 14                          GBM_3_AutoML_32_20221124_42256  0.7950   14
## 15         XGBoost_grid_1_AutoML_32_20221124_42256_model_2  0.7946   15
## 16                      XGBoost_3_AutoML_32_20221124_42256  0.7833   16
## 17                      XGBoost_2_AutoML_32_20221124_42256  0.7828   17

For this dataset, GLM wins the best model. We may additionally investigate this leader when it comes to the valid set.

6.2 AUC on valid data

getAUC_onValidData <- function(i) {
    
    # Extract i-th model:
    best_ith <- h2o.getModel(autoML@leaderboard[i, 1])
    
    # Model performance by ith model by AUC on Test data:
    metrics_ith <- h2o.performance(model = best_ith, newdata = valid_h2o)
    
    # Return output:
    return(data.frame(AUC_Test = metrics_ith@metrics$AUC, model_id = best_ith@model_id))
    
}

# Calculate AUC for the leaderboard:
auc_on_validData <- map(1:nrow(df_results), getAUC_onValidData) %>% bind_rows() %>% arrange(desc(AUC_Test))

auc_on_validData

##     AUC_Test                                                model_id
## 1  0.8048307 StackedEnsemble_BestOfFamily_1_AutoML_32_20221124_42256
## 2  0.8047305                          GLM_1_AutoML_32_20221124_42256
## 3  0.8045094    StackedEnsemble_AllModels_1_AutoML_32_20221124_42256
## 4  0.7989327                          GBM_1_AutoML_32_20221124_42256
## 5  0.7967863             GBM_grid_1_AutoML_32_20221124_42256_model_1
## 6  0.7920634                 DeepLearning_1_AutoML_32_20221124_42256
## 7  0.7898778                          XRT_1_AutoML_32_20221124_42256
## 8  0.7800514                          DRF_1_AutoML_32_20221124_42256
## 9  0.7700287                      XGBoost_1_AutoML_32_20221124_42256
## 10 0.7695946                          GBM_4_AutoML_32_20221124_42256
## 11 0.7686820                          GBM_2_AutoML_32_20221124_42256
## 12 0.7681420                          GBM_5_AutoML_32_20221124_42256
## 13 0.7679859                          GBM_3_AutoML_32_20221124_42256
## 14 0.7679577         XGBoost_grid_1_AutoML_32_20221124_42256_model_1
## 15 0.7655638                      XGBoost_2_AutoML_32_20221124_42256
## 16 0.7651360         XGBoost_grid_1_AutoML_32_20221124_42256_model_2
## 17 0.7507634                      XGBoost_3_AutoML_32_20221124_42256

The top performing model on the leaderboard for the valid set is still GLM so we will use this for predicting the customer churn in the testing set.

6.3 Use model for prediction

6.3.1 Maximizing Accuracy

library(ConfusionTableR)
thres_max_accuracy = h2o.performance(autoML@leader, valid_h2o)@metrics$max_criteria_and_metric_scores %>% 
  filter(metric == "max accuracy") %>% 
  pull(threshold)

prob_h20 <- h2o.predict(autoML@leader, test_h2o) %>%
  as.data.frame() %>%
  pull(Yes)
pred_labels = ifelse(prob_h20 >= thres_max_accuracy, "Yes", "No") %>% as.factor
df_cm_max_accuracy = binary_class_cm(pred_labels,true_test_labels)

## [INFO] Building a record level confusion matrix to store in dataset

## [INFO] Build finished and to expose record level cm use the record_level_cm list item

confusionMatrix(data = pred_labels,true_test_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1417  318
##        Yes  131  242
##                                           
##                Accuracy : 0.787           
##                  95% CI : (0.7689, 0.8043)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 1.252e-08       
##                                           
##                   Kappa : 0.389           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9154          
##             Specificity : 0.4321          
##          Pos Pred Value : 0.8167          
##          Neg Pred Value : 0.6488          
##              Prevalence : 0.7343          
##          Detection Rate : 0.6722          
##    Detection Prevalence : 0.8231          
##       Balanced Accuracy : 0.6738          
##                                           
##        'Positive' Class : No              
##

We will not choose the threshold maximizing accuracy as the testing Specificity is less than $0.5$.

6.3.2 Youden Method

library(pROC)
prob_h20_valid <- h2o.predict(autoML@leader, valid_h2o) %>%
  as.data.frame() %>%
  pull(Yes)

thres_youden = coords(roc(true_valid_labels, prob_h20_valid), "best", best.method="youden")$threshold
pred_labels = ifelse(prob_h20_valid >= thres_youden, "Yes", "No") %>% as.factor

library(ConfusionTableR)
prob_h20 <- h2o.predict(autoML@leader, test_h2o) %>%
  as.data.frame() %>%
  pull(Yes)
pred_labels = ifelse(prob_h20 >= thres_youden, "Yes", "No") %>% as.factor
df_cm_youden = binary_class_cm(pred_labels,true_test_labels)
confusionMatrix(data = pred_labels,true_test_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1160  129
##        Yes  388  431
##                                          
##                Accuracy : 0.7547         
##                  95% CI : (0.7358, 0.773)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 0.01745        
##                                          
##                   Kappa : 0.4522         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.7494         
##             Specificity : 0.7696         
##          Pos Pred Value : 0.8999         
##          Neg Pred Value : 0.5263         
##              Prevalence : 0.7343         
##          Detection Rate : 0.5503         
##    Detection Prevalence : 0.6115         
##       Balanced Accuracy : 0.7595         
##                                          
##        'Positive' Class : No             
##

The $\text{p-value}$ for the NIR test is greater than $0.01$ so we cannot reject the null hypothesis $Acc = NIR$ meaning the model is only good when we accept the confidence level $0.05$.

6.3.3 Maximizing F1

thres_max_f1 = h2o.performance(autoML@leader, valid_h2o)@metrics$max_criteria_and_metric_scores %>% 
  filter(metric == "max f1") %>% 
  pull(threshold)

library(ConfusionTableR)
prob_h20 <- h2o.predict(autoML@leader, test_h2o) %>%
  as.data.frame() %>%
  pull(Yes)
pred_labels = ifelse(prob_h20 >= thres_max_f1, "Yes", "No") %>% as.factor
df_cm_max_f1 = binary_class_cm(pred_labels,true_test_labels)

## [INFO] Building a record level confusion matrix to store in dataset

## [INFO] Build finished and to expose record level cm use the record_level_cm list item

confusionMatrix(data = pred_labels,true_test_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1173  134
##        Yes  375  426
##                                           
##                Accuracy : 0.7585          
##                  95% CI : (0.7397, 0.7767)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 0.006012        
##                                           
##                   Kappa : 0.4559          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7578          
##             Specificity : 0.7607          
##          Pos Pred Value : 0.8975          
##          Neg Pred Value : 0.5318          
##              Prevalence : 0.7343          
##          Detection Rate : 0.5565          
##    Detection Prevalence : 0.6200          
##       Balanced Accuracy : 0.7592          
##                                           
##        'Positive' Class : No              
##

The $\text{p-value}$ for the NIR test is less than $0.01$ so we reject the null hypothesis $Acc = NIR$ implying the model is strong.

rbind(df_cm_max_accuracy[[2]],df_cm_youden[[2]],df_cm_max_f1[[2]]) %>% 
  mutate(approach = c("Max Accuracy","Youden", "Max F1")) %>% 
  select(approach, Accuracy, Sensitivity, Specificity, Precision,F1) %>%
  t %>% 
  data.frame() %>% .[-1,] %>% rownames_to_column("metrics") %>% select(metrics, "Max Accuracy" = X1, Youden = X2, "Max F1" = X3)

##       metrics Max Accuracy    Youden    Max F1
## 1    Accuracy    0.7870019 0.7547438 0.7585389
## 2 Sensitivity    0.9153747 0.7493540 0.7577519
## 3 Specificity    0.4321429 0.7696429 0.7607143
## 4   Precision    0.8167147 0.8999224 0.8974751
## 5          F1    0.8632348 0.8177652 0.8217163

As the numbers Sensitivity and Specificity are the most balanced among the 3 approaches, we will eventually use this approach for future data .

7 Summary

In the EDA section, we found that the most significant information is the customer’s monthly charges, the type of internet service and contract they have, and the length of time they have been customers with Telco. To proactively reduce their churn rate, Telco could target customers who are on month-to-month contracts, use fiber optic internet, have higher monthly charges on average, and have a shorter tenure of fewer than 18 months.

Using the Chi-Squared Test we can see some of the important predictors of customer attrition include Tenure, MonthlyCharges, InternetService, PaymentMethod, Contract, OnlineSecurity, TechSupport, and PaperlessBilling.

Using Chi-Squared Test for feature selection and AutoML, we have already got very satisfactory results pertaining to the GLM model. For future data, we can consider maximizing F1 to obtain the balance for Sensitivity and Specificity.