Telco Customer Churn
Kaggle Competition
1 Description
This project attempts to investigate the Kaggle Telco Customer Churn dataset. The investigation conducted the EDA, the Chi-Squared Test and H2O’s AutoML to build the suitable model predicting customer churn.
2 Data Overview
2.1 About the data
The dataset from Telco consists of 7,043 records with twenty attributes divided into two categories: customer demographic data and information related to their wireless accounts. The demographic features include the customer’s gender, whether they have a partner, or dependents, and 65 years or older. The features related to their account information include how long the customer has been with Telco, their monthly and total charges, the contract each customer carries (month-to-month, one year, or two years), and the type of phone, internet, and TV services they have. Our target variable for this study is Churn, a binary indicator that represents whether or not the customer left within the last month.
library(tidyverse)
library(readr)
library(stringr)
data_raw = read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
data_raw %>% glimpse## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
2.2 Fields & formats
sapply(data_raw, function(x) sum(is.na (x)))## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
There were 11 observations with missing TotalCharges data. Since it is a fairly small amount, these observations will be removed prior to beginning the analysis, leaving 7,032 customers in the data set. Also, We would need to drop the customerID. We would then divide the features into binary, categorical, or numerical variables.
data = data_raw %>% drop_na %>% select(-customerID)
binary = sapply(data %>%
select(-Churn), function(x) n_distinct(x)) %>%
sort %>%
.[. == 2] %>%
names
categorical = sapply(data %>%
select(-Churn), function(x) n_distinct(x)) %>%
sort %>%
.[. <= 50 & . > 2] %>%
names
numerical = sapply(data %>%
select(-Churn), function(x) n_distinct(x)) %>%
sort %>%
.[. > 50] %>%
names
paste("The Number of binary, categorical and numerical variables are ", length(binary), ", ", length(categorical), ", ", length(numerical),sep = "")## [1] "The Number of binary, categorical and numerical variables are 6, 10, 3"
library(plotly)
num_hist = map(numerical,
~data %>%
plot_ly(x = ~.data[[.x]], name = .x) %>%
add_histogram() %>%
layout(title = paste('Histogram of', .x),
plot_bgcolor = "#e5ecf6",
xaxis = list(title = .x),
yaxis = list(title = "Count")))
subplot(num_hist) %>% layout(title = "Histogram of Numerical Variables")It seems from the freq histograms that the large cluster of customers have high total charges (2904 customers). While a large cluster of customers have a low teneure count (1970 customers), Monthly charges seem to be less skewed.
library(aplot)
cat_bar = map(categorical,
~data %>%
count(.data[[.x]]) %>%
plot_ly(x = ~.data[[.x]],
y = ~n,
name = .x) %>%
add_bars() %>%
layout(plot_bgcolor = "#e5ecf6",
xaxis = list(title = .x),
yaxis = list(title = "Count")))
subplot(cat_bar, nrows = 5, margin = 0.05) %>% layout(title = "Barcharts of Categorical Variables")Additionally, several of the Yes/No categorical variables contained an additional group indicating that the customer had no phone or internet service. These were recorded and combined with the value No.
data <- data %>%
mutate_at(binary, as.factor) %>%
mutate_at(c("MultipleLines",
"OnlineSecurity",
"OnlineBackup",
"DeviceProtection",
"TechSupport",
"StreamingTV",
"StreamingMovies"),
~case_when(. == "Yes" ~ "Yes", TRUE ~ "No")) %>%
mutate_if(is.character, as.factor) %>%
select(Churn,is.factor,everything())
data %>% glimpse## Rows: 7,032
## Columns: 20
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
3 Exploratory Data Analysis
3.1 Tenure & charges
library(scales)
ggplot(data, aes(x = fct_rev(Churn), y = tenure, fill = fct_rev(Churn))) +
geom_bar(stat = "summary", fun = "mean", alpha = 0.6, color = "grey20", show.legend = F) +
stat_summary(aes(label = paste(round(..y.., 0), "months")), fun = mean,
geom = "text", size = 3.5, vjust = -0.5) +
labs(title = "Average Customer Tenure \n", x = "", y = "Customer Tenure\n") +
theme(plot.title = element_text(hjust = 0.5))ggplot(data, aes(x = fct_rev(Churn), y = MonthlyCharges, fill = fct_rev(Churn))) +
geom_bar(stat = "summary", fun = "mean", alpha = 0.6, color = "grey20", show.legend = F) +
stat_summary(aes(label = dollar(..y..)), fun = mean,
geom = "text", size = 3.5, vjust = -0.5) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Average Monthly Charges \n", x = "", y = "Monthly Charges \n") +
theme(plot.title = element_text(hjust = 0.5))In the graphs above, Telco’s current and churned customers’ average tenure and their monthly fees are shown. Clients that left Telco kept their services for roughly 18 months, while current customers have been with the company for slightly over 3 years.
ggplot(data, aes(x = Contract, y = MonthlyCharges, fill = fct_rev(Churn))) +
geom_bar(position = "dodge", stat = "summary", fun = "mean", alpha = 0.6, color = "grey20") +
stat_summary(aes(label = dollar(..y..)), fun = mean,
geom = "text", size = 3.5, vjust = -0.5,
position = position_dodge(width = 0.9)) +
coord_cartesian(ylim = c(0, 95)) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "\nAverage Monthly Charges by Contract Type", x = "\n Contract Type",
y = "Monthly Charges \n", fill = "") +
theme(plot.title = element_text(hjust = 0.5), legend.position = "top", legend.justification = "left")Additionally, attrited customers had higher monthly charges on average by about $10. This holds true across each contract type.
3.2 Type of account services
ggplot(data, aes(x = Contract, group = fct_rev(Churn))) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
alpha = 0.6, color = "grey20", show.legend = F) +
geom_text(aes(label = percent(..prop..), y = ..prop.. ),
size = 4, stat = "count", vjust = -0.5) +
facet_grid(~fct_rev(Churn)) +
scale_y_continuous(labels = percent_format()) +
coord_cartesian(ylim = c(0, .95)) +
scale_fill_brewer(palette = "Paired") +
labs(title = "Customer Churn by Contract Type\n", x = "\n Contract Type", y = "") +
theme(plot.title = element_text(hjust = 0.5))ggplot(data, aes(x = InternetService, group = fct_rev(Churn))) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
alpha = 0.6, color = "grey20", show.legend = F) +
geom_text(aes(label = percent(..prop..), y = ..prop.. ),
size = 4, stat = "count", vjust = -0.5) +
facet_grid(~fct_rev(Churn)) +
scale_y_continuous(labels = percent_format()) +
coord_cartesian(ylim = c(0, .9)) +
scale_fill_brewer(palette = "Paired") +
labs(title = "\n Customer Churn by Internet Service \n", x = "\n Internet Service", y = "") +
theme(plot.title = element_text(hjust = 0.5))Nearly 89% of former customers were on month-to-month contracts, with a much smaller proportion on one or two-year contracts. Of customers who left, 70% had Fiber Optic internet. This could be an indicator of potential dissatisfaction with the service.
3.3 Customer Demographics
ggplot(data, aes(x = fct_rev(ifelse(SeniorCitizen==1, "Yes", "No")), group = Churn)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
alpha = 0.6, color = "grey20", show.legend = F) +
geom_text(aes(label = percent(..prop.., accuracy = 0.1), y = ..prop..),
size = 4, stat = "count", position = position_stack(vjust = 0.5)) +
facet_grid(~fct_rev(Churn)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
coord_cartesian(ylim = c(0, .9)) +
labs(x = "\n Senior Citizen", y = "")Based on the demographic attributes of Telco’s customers, about a quarter of those who left were senior citizens.
ggplot(data, aes(x = gender, group = Churn)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = "count",
alpha = 0.6, color = "grey20", show.legend = F) +
geom_text(aes(label = percent(..prop.., accuracy = 0.1), y = ..prop..),
size = 4, stat = "count", position = position_stack(vjust = 0.5)) +
facet_grid(~fct_rev(Churn)) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
coord_cartesian(ylim = c(0, .6)) +
labs(x = "\n Gender", y = "")The distribution of gender is proportional in both current and former customers, with an approximately equal number of men and women indicating the insignificance of the feature.
3.4 Correlations & Distribution
library(GGally)
g = data %>%
select(tenure, MonthlyCharges, TotalCharges, Churn) %>%
ggpairs(aes(color = fct_rev(Churn)), title = "Customer Account Distributions and Correlations \n",
columnLabels = c("Tenure", "Monthly Charges", "Total Charges", "Churn"),
upper = list(combo = wrap("box_no_facet", alpha = 0.7)),
diag = list(continuous = wrap("densityDiag", alpha = 0.6),
discrete = wrap("barDiag", alpha = 0.7, color = "grey30")),
lower = list(combo = wrap("box_no_facet", alpha = 0.7), continuous = wrap("smooth", alpha = 0.15)))
ggplotly(g)The correlations between our numeric variables show that TotalCharges is strongly correlated with customer tenure, especially among customers who left (Churn = Yes), with a correlation of more than 0.95. The histogram of MonthlyCharges has a multimodal shape, while the distribution of customer tenure is relatively uniform among current customers but skewed to the right in customers who left.
4 Train/Valid/Test sets
Since the contest period did not provide the testing set, we need to split the data into train and test on the 70-30 ratio. We then further split the train into train and valid with the note that the validation data has exactly the same number of observations in test data
library(caret)
set.seed(1)
id <- createDataPartition(y = data %>% pull(Churn), p = 0.7, list = FALSE)
train_valid <- data[id, ]
df_test <- data[-id, ] # Test data.
true_test_labels <- as.factor(df_test$Churn)
set.seed(1)
id_new <- createDataPartition(y = train_valid %>% pull(Churn), p = nrow(df_test) / nrow(train_valid), list = FALSE)
df_train <- train_valid[-id_new, ]
true_train_labels <- as.factor(df_train$Churn)
df_valid <- train_valid[id_new, ]
true_valid_labels <- as.factor(df_valid$Churn)5 Feature Selection
We will use the Chi-Squared Test for feature selection. This test evaluates the association between two categorical variables. The null hypothesis for this test is that there is no relationship between our response variable and the categorical feature, and the alternative hypothesis is that there is a relationship.
library(broom)
cat.var <- train_valid %>% select_if(is.factor) %>% select(-Churn)
chi <- map(cat.var, ~chisq.test(train_valid$Churn, .x))
chi_fs = do.call(rbind, lapply(chi, tidy)) %>%
mutate(variable = cat.var %>% colnames) %>%
arrange(p.value) %>%
mutate_at(c(1,2), funs(round(., 3))) %>% select(variable, p.value)
chi_fs## # A tibble: 16 × 2
## variable p.value
## <chr> <dbl>
## 1 Contract 0
## 2 InternetService 0
## 3 PaymentMethod 0
## 4 PaperlessBilling 0
## 5 TechSupport 0
## 6 OnlineSecurity 0
## 7 Dependents 0
## 8 Partner 0
## 9 SeniorCitizen 0
## 10 OnlineBackup 0
## 11 StreamingTV 0
## 12 DeviceProtection 0
## 13 StreamingMovies 0
## 14 MultipleLines 0.001
## 15 gender 0.239
## 16 PhoneService 0.648
Looking at the results of the tests, Gender and PhoneService have very small chi-squared statistics and p-values that are greater than the significance threshold, \(a\), of 0.05, indicating they are independent of our target variable. The rest of the categorical features do have a statistically significant association with customer churn.
var_selected = chi_fs %>%
filter(p.value <= 0.05) %>% select(variable) %>% pull
paste("We select", length(var_selected),"categorical variables:",paste(var_selected,collapse=", "))## [1] "We select 14 categorical variables: Contract, InternetService, PaymentMethod, PaperlessBilling, TechSupport, OnlineSecurity, Dependents, Partner, SeniorCitizen, OnlineBackup, StreamingTV, DeviceProtection, StreamingMovies, MultipleLines"
6 Automated Machine Learning
We will score the leaderboard using the validation dataset using 4-fold cross-validation. The metric used for early stopping is AUC for this classification problem. The AutoML will be stopped if the simple moving average does not improve for 10 (stopping_rounds) scoring events or if the relative improvement is not at least 0.025. The maximum run time would be 1 hour.
To achieve the leader model, we will use the valid dataset to obtain the appropriate threshold, which will then be used to evaluate the classification report in the test set.
6.1 Train AutoML
library(h2o)
h2o.init(nthreads = 20, max_mem_size = "32g")## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 3 hours
## H2O cluster timezone: Asia/Ho_Chi_Minh
## H2O data parsing timezone: UTC
## H2O cluster version: 3.38.0.1
## H2O cluster version age: 2 months and 4 days
## H2O cluster name: H2O_started_from_R_macos_sbv884
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.96 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.2.2 (2022-10-31)
h2o.no_progress()
train_h2o <- as.h2o(df_train)
valid_h2o <- as.h2o(df_valid)
test_h2o <- as.h2o(df_test)# Train Auto Machine Learning:
start <- Sys.time ()
autoML <- h2o.automl(x = var_selected,
y = "Churn",
training_frame = train_h2o,
nfolds = 4,
stopping_metric = "AUC",
stopping_rounds = 10,
stopping_tolerance = 0.025,
max_models = 15,
max_runtime_secs = 60*60,
seed = 1,
sort_metric = "AUC")
run = Sys.time () - startautoML@leader@model$model_summary## GLM Model: summary
## family link regularization
## 1 binomial logit Ridge ( lambda = 0.001775 )
## lambda_search
## 1 nlambda = 30, lambda.max = 9.4055, lambda.min = 0.001775, lambda.1se = 0.01194
## number_of_predictors_total number_of_active_predictors number_of_iterations
## 1 32 32 38
## training_frame
## 1 AutoML_32_20221124_42256_training_df_train_sid_bf90_1
h2o.performance(autoML@leader, test_h2o)@metrics$AUC## [1] 0.8385792
The model took 1 seconds, and produce the leader GLM with the corresponding AUC 0.84.
Furthermore, we can observe the feature importance of this leader. This yields immediate insights on how the companies should act given the data.
h2o.varimp_plot(autoML@leader)We may further observe the AUC of other models on the leaderboard.
df_results <- autoML@leaderboard %>% as.data.frame() %>%
select(model_id, auc) %>%
mutate(Rank = 1:nrow(.), auc = round(auc, 4)) %>%
rename(AUC_Val = auc)
df_results## model_id AUC_Val Rank
## 1 GLM_1_AutoML_32_20221124_42256 0.8268 1
## 2 StackedEnsemble_AllModels_1_AutoML_32_20221124_42256 0.8254 2
## 3 StackedEnsemble_BestOfFamily_1_AutoML_32_20221124_42256 0.8242 3
## 4 GBM_1_AutoML_32_20221124_42256 0.8214 4
## 5 GBM_grid_1_AutoML_32_20221124_42256_model_1 0.8213 5
## 6 XRT_1_AutoML_32_20221124_42256 0.8144 6
## 7 DeepLearning_1_AutoML_32_20221124_42256 0.8144 7
## 8 DRF_1_AutoML_32_20221124_42256 0.8046 8
## 9 XGBoost_1_AutoML_32_20221124_42256 0.8011 9
## 10 GBM_5_AutoML_32_20221124_42256 0.7979 10
## 11 GBM_4_AutoML_32_20221124_42256 0.7966 11
## 12 GBM_2_AutoML_32_20221124_42256 0.7965 12
## 13 XGBoost_grid_1_AutoML_32_20221124_42256_model_1 0.7953 13
## 14 GBM_3_AutoML_32_20221124_42256 0.7950 14
## 15 XGBoost_grid_1_AutoML_32_20221124_42256_model_2 0.7946 15
## 16 XGBoost_3_AutoML_32_20221124_42256 0.7833 16
## 17 XGBoost_2_AutoML_32_20221124_42256 0.7828 17
For this dataset, GLM wins the best model. We may additionally investigate this leader when it comes to the valid set.
6.2 AUC on valid data
getAUC_onValidData <- function(i) {
# Extract i-th model:
best_ith <- h2o.getModel(autoML@leaderboard[i, 1])
# Model performance by ith model by AUC on Test data:
metrics_ith <- h2o.performance(model = best_ith, newdata = valid_h2o)
# Return output:
return(data.frame(AUC_Test = metrics_ith@metrics$AUC, model_id = best_ith@model_id))
}
# Calculate AUC for the leaderboard:
auc_on_validData <- map(1:nrow(df_results), getAUC_onValidData) %>% bind_rows() %>% arrange(desc(AUC_Test))
auc_on_validData## AUC_Test model_id
## 1 0.8048307 StackedEnsemble_BestOfFamily_1_AutoML_32_20221124_42256
## 2 0.8047305 GLM_1_AutoML_32_20221124_42256
## 3 0.8045094 StackedEnsemble_AllModels_1_AutoML_32_20221124_42256
## 4 0.7989327 GBM_1_AutoML_32_20221124_42256
## 5 0.7967863 GBM_grid_1_AutoML_32_20221124_42256_model_1
## 6 0.7920634 DeepLearning_1_AutoML_32_20221124_42256
## 7 0.7898778 XRT_1_AutoML_32_20221124_42256
## 8 0.7800514 DRF_1_AutoML_32_20221124_42256
## 9 0.7700287 XGBoost_1_AutoML_32_20221124_42256
## 10 0.7695946 GBM_4_AutoML_32_20221124_42256
## 11 0.7686820 GBM_2_AutoML_32_20221124_42256
## 12 0.7681420 GBM_5_AutoML_32_20221124_42256
## 13 0.7679859 GBM_3_AutoML_32_20221124_42256
## 14 0.7679577 XGBoost_grid_1_AutoML_32_20221124_42256_model_1
## 15 0.7655638 XGBoost_2_AutoML_32_20221124_42256
## 16 0.7651360 XGBoost_grid_1_AutoML_32_20221124_42256_model_2
## 17 0.7507634 XGBoost_3_AutoML_32_20221124_42256
The top performing model on the leaderboard for the valid set is still GLM so we will use this for predicting the customer churn in the testing set.
6.3 Use model for prediction
6.3.1 Maximizing Accuracy
library(ConfusionTableR)
thres_max_accuracy = h2o.performance(autoML@leader, valid_h2o)@metrics$max_criteria_and_metric_scores %>%
filter(metric == "max accuracy") %>%
pull(threshold)
prob_h20 <- h2o.predict(autoML@leader, test_h2o) %>%
as.data.frame() %>%
pull(Yes)
pred_labels = ifelse(prob_h20 >= thres_max_accuracy, "Yes", "No") %>% as.factor
df_cm_max_accuracy = binary_class_cm(pred_labels,true_test_labels)## [INFO] Building a record level confusion matrix to store in dataset
## [INFO] Build finished and to expose record level cm use the record_level_cm list item
confusionMatrix(data = pred_labels,true_test_labels)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1417 318
## Yes 131 242
##
## Accuracy : 0.787
## 95% CI : (0.7689, 0.8043)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 1.252e-08
##
## Kappa : 0.389
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9154
## Specificity : 0.4321
## Pos Pred Value : 0.8167
## Neg Pred Value : 0.6488
## Prevalence : 0.7343
## Detection Rate : 0.6722
## Detection Prevalence : 0.8231
## Balanced Accuracy : 0.6738
##
## 'Positive' Class : No
##
We will not choose the threshold maximizing accuracy as the testing Specificity is less than \(0.5\).
6.3.2 Youden Method
library(pROC)
prob_h20_valid <- h2o.predict(autoML@leader, valid_h2o) %>%
as.data.frame() %>%
pull(Yes)
thres_youden = coords(roc(true_valid_labels, prob_h20_valid), "best", best.method="youden")$threshold
pred_labels = ifelse(prob_h20_valid >= thres_youden, "Yes", "No") %>% as.factor
library(ConfusionTableR)
prob_h20 <- h2o.predict(autoML@leader, test_h2o) %>%
as.data.frame() %>%
pull(Yes)
pred_labels = ifelse(prob_h20 >= thres_youden, "Yes", "No") %>% as.factor
df_cm_youden = binary_class_cm(pred_labels,true_test_labels)
confusionMatrix(data = pred_labels,true_test_labels)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1160 129
## Yes 388 431
##
## Accuracy : 0.7547
## 95% CI : (0.7358, 0.773)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.01745
##
## Kappa : 0.4522
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.7494
## Specificity : 0.7696
## Pos Pred Value : 0.8999
## Neg Pred Value : 0.5263
## Prevalence : 0.7343
## Detection Rate : 0.5503
## Detection Prevalence : 0.6115
## Balanced Accuracy : 0.7595
##
## 'Positive' Class : No
##
The \(\text{p-value}\) for the NIR test is greater than \(0.01\) so we cannot reject the null hypothesis \(Acc = NIR\) meaning the model is only good when we accept the confidence level \(0.05\).
6.3.3 Maximizing F1
thres_max_f1 = h2o.performance(autoML@leader, valid_h2o)@metrics$max_criteria_and_metric_scores %>%
filter(metric == "max f1") %>%
pull(threshold)
library(ConfusionTableR)
prob_h20 <- h2o.predict(autoML@leader, test_h2o) %>%
as.data.frame() %>%
pull(Yes)
pred_labels = ifelse(prob_h20 >= thres_max_f1, "Yes", "No") %>% as.factor
df_cm_max_f1 = binary_class_cm(pred_labels,true_test_labels)## [INFO] Building a record level confusion matrix to store in dataset
## [INFO] Build finished and to expose record level cm use the record_level_cm list item
confusionMatrix(data = pred_labels,true_test_labels)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1173 134
## Yes 375 426
##
## Accuracy : 0.7585
## 95% CI : (0.7397, 0.7767)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 0.006012
##
## Kappa : 0.4559
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7578
## Specificity : 0.7607
## Pos Pred Value : 0.8975
## Neg Pred Value : 0.5318
## Prevalence : 0.7343
## Detection Rate : 0.5565
## Detection Prevalence : 0.6200
## Balanced Accuracy : 0.7592
##
## 'Positive' Class : No
##
The \(\text{p-value}\) for the NIR test is less than \(0.01\) so we reject the null hypothesis \(Acc = NIR\) implying the model is strong.
rbind(df_cm_max_accuracy[[2]],df_cm_youden[[2]],df_cm_max_f1[[2]]) %>%
mutate(approach = c("Max Accuracy","Youden", "Max F1")) %>%
select(approach, Accuracy, Sensitivity, Specificity, Precision,F1) %>%
t %>%
data.frame() %>% .[-1,] %>% rownames_to_column("metrics") %>% select(metrics, "Max Accuracy" = X1, Youden = X2, "Max F1" = X3)## metrics Max Accuracy Youden Max F1
## 1 Accuracy 0.7870019 0.7547438 0.7585389
## 2 Sensitivity 0.9153747 0.7493540 0.7577519
## 3 Specificity 0.4321429 0.7696429 0.7607143
## 4 Precision 0.8167147 0.8999224 0.8974751
## 5 F1 0.8632348 0.8177652 0.8217163
As the numbers Sensitivity and Specificity are the most balanced among the 3 approaches, we will eventually use this approach for future data .
7 Summary
In the EDA section, we found that the most significant information is the customer’s monthly charges, the type of internet service and contract they have, and the length of time they have been customers with Telco. To proactively reduce their churn rate, Telco could target customers who are on month-to-month contracts, use fiber optic internet, have higher monthly charges on average, and have a shorter tenure of fewer than 18 months.
Using the Chi-Squared Test we can see some of the important predictors of customer attrition include Tenure, MonthlyCharges, InternetService, PaymentMethod, Contract, OnlineSecurity, TechSupport, and PaperlessBilling.
Using Chi-Squared Test for feature selection and AutoML, we have already got very satisfactory results pertaining to the GLM model. For future data, we can consider maximizing F1 to obtain the balance for Sensitivity and Specificity.