Churn data set

Context

Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

Variables

  • CustomerID: A unique ID that identifies each customer.

  • Gender: The customer’s gender: Male, Female

  • Age: The customer’s current age, in years, at the time the fiscal quarter ended.

  • Senior Citizen: Indicates if the customer is 65 or older: Yes, No

  • Married (Partner): Indicates if the customer is married: Yes, No

  • Dependents: Indicates if the customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.

  • Number of Dependents: Indicates the number of dependents that live with the customer.

  • Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes, No

  • Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No

  • Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber Optic, Cable.

  • Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No

  • Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No

  • Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No

  • Premium Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No

  • Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No. The company does not charge an additional fee for this service.

  • Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service.

  • Contract: Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.

  • Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No

  • Payment Method: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check

  • Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.

  • Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.

  • Tenure: Indicates the total amount of months that the customer has been with the company.

  • Churn: Yes = the customer left the company this quarter. No = the customer remained with the company.


Content

In each row represents a customer, each column contains customer’s attributes described in the column Metadata. Source from: Kaggle



How to select Machine Learning Model for churn prediction?

  • When looking at churn target, it only contains data ‘No’, ‘Yes’. This is classification problem.

    • ‘No’ means “a customer remains services”.

    • ‘Yes’ means “a customer cancels or stops to continue service”.


The Steps of Analysis and Evaluation Data

1. Collect and Clean Data

2. Explore and Transform Data

3. Train Model and Scoring

4. Model Evaluation

5. Conclusion


1. Collect and Clean Data

  • Install and call packages.

  • Get data “churn.csv” from my github.

  • Check and drop missing values

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart)
library(rpart.plot)
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
library(gbm)
## Loaded gbm 2.1.8.1
# Collect Data
churn_url <- "https://raw.githubusercontent.com/Thanida45/churn_prediction/main/WA_Fn-UseC_-Telco-Customer-Churn.csv"
churn_df <- read.csv(churn_url)

# Check Missing Values
cat("Check dataframe completed? : ", ifelse(mean(complete.cases(churn_df)) == 1, "Yes", "No, need to Clean" ))
## Check dataframe completed? :  No, need to Clean
cat("\nThere are missing values : ", sum(is.na(churn_df)), " rows")
## 
## There are missing values :  11  rows
churn_df <- na.omit(churn_df)
cat("Check dataframe completed? :",ifelse(mean(complete.cases(churn_df)) == 1, "Yes", "No, need to Clean" ))
## Check dataframe completed? : Yes
glimpse(churn_df)
## Rows: 7,032
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

2. Explore and Transform Data

  • Explore data and plotting.

  • Transform data.

dt_1 <- churn_df %>%
        select(gender, Churn, Contract,PaymentMethod, TotalCharges) %>%
        top_n(10, wt = (TotalCharges))%>%
        group_by(Contract) %>%
        arrange(desc(TotalCharges))
    
dt_1
## # A tibble: 10 × 5
## # Groups:   Contract [2]
##    gender Churn Contract PaymentMethod             TotalCharges
##    <chr>  <chr> <chr>    <chr>                            <dbl>
##  1 Male   Yes   One year Bank transfer (automatic)        8685.
##  2 Female No    Two year Bank transfer (automatic)        8672.
##  3 Female No    Two year Credit card (automatic)          8670.
##  4 Male   No    Two year Credit card (automatic)          8594.
##  5 Male   No    Two year Electronic check                 8565.
##  6 Male   No    Two year Electronic check                 8547.
##  7 Female No    Two year Credit card (automatic)          8543.
##  8 Female No    Two year Credit card (automatic)          8530.
##  9 Female No    Two year Electronic check                 8497.
## 10 Female No    Two year Bank transfer (automatic)        8478.
table(churn_df$gender, churn_df$Churn, dnn = c("Gender", "Churn"))
##         Churn
## Gender     No  Yes
##   Female 2544  939
##   Male   2619  930
ggplot(churn_df, aes(gender, fill = Churn)) +
    geom_bar(position = "dodge", color = "black") +
    scale_fill_manual(values =c("#c0d6e4", "#e4cec0") )  +
    theme_minimal() +
    labs(title = "The number of churn between male and female")+
    theme(plot.title = element_text(hjust = 0.5))

table(churn_df$SeniorCitizen, churn_df$Churn , dnn = c("SeniorCitizen", "Churn"))
##              Churn
## SeniorCitizen   No  Yes
##             0 4497 1393
##             1  666  476
ggplot(churn_df, aes(SeniorCitizen , fill = Churn)) +
  geom_bar(position = "dodge", color = "black") +
  scale_fill_manual(values =c("#c0d6e4", "#e4cec0") )  +
  theme_minimal() +
  labs( title = "The churn ratio of senior citizen") +
  theme(plot.title = element_text(hjust = 0.5))

table(churn_df$Contract, churn_df$Churn, dnn = c("Contract", "Churn"))
##                 Churn
## Contract           No  Yes
##   Month-to-month 2220 1655
##   One year       1306  166
##   Two year       1637   48
ggplot(churn_df, aes(Contract, fill = Churn)) +
    geom_bar(position = "dodge", color = "black") +
    scale_fill_manual(values =c("#c0d6e4", "#e4cec0") )  +
    theme_minimal() +
    labs( title =  "The churn ratio of each customers's contract.",
          y = "The number of customers in each contract.") +
        theme(plot.title = element_text(hjust = 0.5))

churn_df %>%
        ggplot(aes(tenure , fill = Churn)) + 
        geom_histogram(position = "fill",color = "black", bins = 6) +
        scale_fill_manual(values =c("#c0d6e4", "#e4cec0") )  +
        theme_minimal() +
        labs(title = "The churn ratio of customers in 72 months(12 months for a bin))",
              x = "Months",
              y = "Churn ratio")  +
        theme(plot.title = element_text(hjust = 0.5))

churn_df %>%
        ggplot(aes(tenure, fill = Churn)) +
        geom_density(color = "black", alpha =0.7) +
        scale_fill_manual(values =c("#c0d6e4", "#dcc0ae") ) +
        theme_minimal() +
        labs( title = "The different of churn ratio of customers from 0 to 72 months",
               x = "Months",
               y = "Ratio")  +
        theme(plot.title = element_text(hjust = 0.5))

table(churn_df$InternetService, dnn = c("InternetService"))
## InternetService
##         DSL Fiber optic          No 
##        2416        3096        1520
churn_df %>%
    ggplot(aes(InternetService)) +
    geom_bar(fill = '#c0d6e4', color = "black") +
    theme_minimal() +
    labs(title = "The number of internet services in each category",
         y = "The number of internet services") +
    theme(plot.title = element_text(hjust = 0.5))

churn_df %>%
    ggplot(aes(InternetService, fill = Churn)) +
    geom_bar(position = "fill", color = "black") + 
    scale_fill_manual(values =c("#c0d6e4", "#e4cec0")) +
    facet_wrap(~PaymentMethod) +
    theme_minimal() +
    labs(title = "The churn ratio in each internet service categories by gender",
         y = "Ratio") +
    theme(plot.title = element_text(hjust = 0.5))

table(churn_df$PaymentMethod, churn_df$Churn, dnn = c("PaymentMethod", "Churn"))
##                            Churn
## PaymentMethod                 No  Yes
##   Bank transfer (automatic) 1284  258
##   Credit card (automatic)   1289  232
##   Electronic check          1294 1071
##   Mailed check              1296  308
churn_df %>%
  ggplot(aes(PaymentMethod, fill = Churn)) +
  geom_bar(position = "fill", color = "black") +
  scale_fill_manual(values = c("#c0d6e4", "#e4cec0")) +
  theme_minimal() + 
  labs( title = "The churn ratio of customers in each payment mathod",
        y = "The number of customers") +
  theme(plot.title = element_text(hjust = 0.5))

table(churn_df$TechSupport, churn_df$Churn, dnn = c("TechSupport", "Churn"))
##                      Churn
## TechSupport             No  Yes
##   No                  2026 1446
##   No internet service 1407  113
##   Yes                 1730  310
churn_df %>%
  ggplot(aes(TechSupport, fill = Churn)) +
  geom_bar(color = "black") +
  scale_fill_manual(values = c("#c0d6e4", "#e4cec0")) +
  theme_minimal() +
  labs(title = "The number of churn in TechSupports") +
  theme(plot.title = element_text(hjust = 0.5))

churn_df %>%
  ggplot(aes(TotalCharges)) +
  geom_histogram(bins = 15, color = "black", fill ="#c0d6e4" ) +
  theme_minimal() +
  labs(title = "The number of TotalCharges") +
  theme(plot.title = element_text(hjust = 0.5))

churn_df %>%
  ggplot() +
  geom_density(aes(TotalCharges, fill = Churn), alpha = 0.8) +
  facet_wrap(~Contract) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) x*1000) +
  scale_fill_manual(values = c("#aecadc", "#f0cfc1")) +
  labs(title = "The TotalCharges' ratio in each contract type", 
       y = "The TotalCharges ratio") +
  theme(plot.title = element_text(hjust = 0.5))

churn_df %>%
  ggplot(aes(Contract, TotalCharges, fill = Churn)) +
  geom_boxplot() +
  scale_fill_manual(values = c("#aecadc", "#f0cfc1")) +
  theme_minimal() +
  labs(title = "The TotalCharges' numbers in each contract type", 
       y = "The number of TotalCharges") +
  theme(plot.title = element_text(hjust = 0.5))

churn_df <- churn_df %>% 
              mutate(across(where(is.character), as.factor))
              
colnames(churn_df)[2] <- "Gender"
colnames(churn_df)[6] <- "Tenure"


## Drop CustomerID column
churn_df$customerID = NULL

glimpse(churn_df)
## Rows: 7,032
## Columns: 20
## $ Gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ Tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
## baseline prediction no model accuracy = 73.42%
churn_df %>%
  count(Churn) %>%
  mutate(pct = n/sum(n)*100)
##   Churn    n     pct
## 1    No 5163 73.4215
## 2   Yes 1869 26.5785

3. Train Model and Scoring

  • Random Sampling

  • Train Logistic Regression Model

set.seed(55)
id <- createDataPartition(y = churn_df$Churn,
                          p = 0.7,
                          list = FALSE)

train_df <- churn_df[id, ]
test_df <- churn_df[-id, ]

nrow(train_df)
## [1] 4924
nrow(test_df)
## [1] 2108

3.1 Train Model with caret (rpart)

set.seed(55)
## set control method
control <- trainControl(method = "repeatedcv",
                        number = 5,
                        classProbs = TRUE,
                        summaryFunction = prSummary,
                        verboseIter = TRUE
                        )

## train model
rpart_model <- train( Churn ~ .,
                      data = train_df,
                      method = "rpart",
                      metric = "AUC",
                      trControl = control
                    )
## + Fold1.Rep1: cp=0.003247 
## - Fold1.Rep1: cp=0.003247 
## + Fold2.Rep1: cp=0.003247 
## - Fold2.Rep1: cp=0.003247 
## + Fold3.Rep1: cp=0.003247 
## - Fold3.Rep1: cp=0.003247 
## + Fold4.Rep1: cp=0.003247 
## - Fold4.Rep1: cp=0.003247 
## + Fold5.Rep1: cp=0.003247 
## - Fold5.Rep1: cp=0.003247 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.00325 on full training set
p_rpart <- predict(rpart_model, newdata = test_df)
mean(p_rpart == test_df$Churn)
## [1] 0.7903226

3.2 Train Model with caret (random forest)

set.seed(55)
## set control method
control <- trainControl(method = "repeatedcv",
                        number = 5,
                        classProbs = TRUE,
                        summaryFunction = prSummary,
                        verboseIter = TRUE
                        )

## train model
rf_model <- train( Churn ~ .,
                      data = train_df,
                      method = "rf",
                      metric = "AUC",
                      trControl = control
                    )
## + Fold1.Rep1: mtry= 2 
## - Fold1.Rep1: mtry= 2 
## + Fold1.Rep1: mtry=16 
## - Fold1.Rep1: mtry=16 
## + Fold1.Rep1: mtry=30 
## - Fold1.Rep1: mtry=30 
## + Fold2.Rep1: mtry= 2 
## - Fold2.Rep1: mtry= 2 
## + Fold2.Rep1: mtry=16 
## - Fold2.Rep1: mtry=16 
## + Fold2.Rep1: mtry=30 
## - Fold2.Rep1: mtry=30 
## + Fold3.Rep1: mtry= 2 
## - Fold3.Rep1: mtry= 2 
## + Fold3.Rep1: mtry=16 
## - Fold3.Rep1: mtry=16 
## + Fold3.Rep1: mtry=30 
## - Fold3.Rep1: mtry=30 
## + Fold4.Rep1: mtry= 2 
## - Fold4.Rep1: mtry= 2 
## + Fold4.Rep1: mtry=16 
## - Fold4.Rep1: mtry=16 
## + Fold4.Rep1: mtry=30 
## - Fold4.Rep1: mtry=30 
## + Fold5.Rep1: mtry= 2 
## - Fold5.Rep1: mtry= 2 
## + Fold5.Rep1: mtry=16 
## - Fold5.Rep1: mtry=16 
## + Fold5.Rep1: mtry=30 
## - Fold5.Rep1: mtry=30 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 16 on full training set
## test model
p_rf <- predict(rf_model, newdata = test_df)
mean(p_rf == test_df$Churn)
## [1] 0.7941176

3.3 Train Model with caret (Generalized Boosted Regression Modeling (GBM))

set.seed(55)
## set control method
control <- trainControl(method = "repeatedcv",
                        number = 5,
                        classProbs = TRUE,
                        summaryFunction = prSummary,
                        verboseIter = TRUE
                        )

## train model
gbm_model <- train( Churn ~ .,
                      data = train_df,
                      method = "gbm",
                      metric = "AUC",
                      trControl = control
                    )
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1356             nan     0.1000    0.0098
##      2        1.1189             nan     0.1000    0.0071
##      3        1.1022             nan     0.1000    0.0091
##      4        1.0894             nan     0.1000    0.0063
##      5        1.0733             nan     0.1000    0.0075
##      6        1.0630             nan     0.1000    0.0051
##      7        1.0496             nan     0.1000    0.0059
##      8        1.0386             nan     0.1000    0.0051
##      9        1.0294             nan     0.1000    0.0039
##     10        1.0191             nan     0.1000    0.0039
##     20        0.9523             nan     0.1000    0.0019
##     40        0.8904             nan     0.1000    0.0006
##     60        0.8604             nan     0.1000    0.0004
##     80        0.8453             nan     0.1000   -0.0002
##    100        0.8345             nan     0.1000   -0.0001
##    120        0.8279             nan     0.1000    0.0001
##    140        0.8240             nan     0.1000   -0.0001
##    150        0.8219             nan     0.1000   -0.0001
## 
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1224             nan     0.1000    0.0167
##      2        1.0964             nan     0.1000    0.0131
##      3        1.0687             nan     0.1000    0.0129
##      4        1.0479             nan     0.1000    0.0095
##      5        1.0302             nan     0.1000    0.0083
##      6        1.0126             nan     0.1000    0.0080
##      7        0.9982             nan     0.1000    0.0067
##      8        0.9871             nan     0.1000    0.0050
##      9        0.9769             nan     0.1000    0.0043
##     10        0.9684             nan     0.1000    0.0036
##     20        0.8960             nan     0.1000    0.0032
##     40        0.8427             nan     0.1000    0.0003
##     60        0.8241             nan     0.1000    0.0000
##     80        0.8122             nan     0.1000   -0.0000
##    100        0.8061             nan     0.1000   -0.0001
##    120        0.8005             nan     0.1000   -0.0002
##    140        0.7951             nan     0.1000   -0.0005
##    150        0.7931             nan     0.1000   -0.0006
## 
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1209             nan     0.1000    0.0202
##      2        1.0857             nan     0.1000    0.0179
##      3        1.0545             nan     0.1000    0.0147
##      4        1.0318             nan     0.1000    0.0103
##      5        1.0102             nan     0.1000    0.0100
##      6        0.9922             nan     0.1000    0.0081
##      7        0.9792             nan     0.1000    0.0064
##      8        0.9670             nan     0.1000    0.0059
##      9        0.9557             nan     0.1000    0.0045
##     10        0.9438             nan     0.1000    0.0056
##     20        0.8757             nan     0.1000    0.0027
##     40        0.8271             nan     0.1000    0.0005
##     60        0.8082             nan     0.1000   -0.0002
##     80        0.7959             nan     0.1000   -0.0003
##    100        0.7875             nan     0.1000   -0.0001
##    120        0.7804             nan     0.1000   -0.0003
##    140        0.7726             nan     0.1000   -0.0002
##    150        0.7690             nan     0.1000   -0.0003
## 
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1385             nan     0.1000    0.0097
##      2        1.1202             nan     0.1000    0.0088
##      3        1.1047             nan     0.1000    0.0076
##      4        1.0901             nan     0.1000    0.0072
##      5        1.0757             nan     0.1000    0.0063
##      6        1.0621             nan     0.1000    0.0069
##      7        1.0502             nan     0.1000    0.0053
##      8        1.0376             nan     0.1000    0.0054
##      9        1.0271             nan     0.1000    0.0051
##     10        1.0183             nan     0.1000    0.0043
##     20        0.9515             nan     0.1000    0.0032
##     40        0.8890             nan     0.1000    0.0006
##     60        0.8598             nan     0.1000    0.0001
##     80        0.8433             nan     0.1000    0.0003
##    100        0.8340             nan     0.1000    0.0000
##    120        0.8265             nan     0.1000    0.0003
##    140        0.8221             nan     0.1000   -0.0001
##    150        0.8205             nan     0.1000   -0.0002
## 
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1305             nan     0.1000    0.0135
##      2        1.1054             nan     0.1000    0.0125
##      3        1.0805             nan     0.1000    0.0117
##      4        1.0595             nan     0.1000    0.0092
##      5        1.0350             nan     0.1000    0.0113
##      6        1.0169             nan     0.1000    0.0086
##      7        1.0043             nan     0.1000    0.0069
##      8        0.9896             nan     0.1000    0.0068
##      9        0.9795             nan     0.1000    0.0044
##     10        0.9645             nan     0.1000    0.0070
##     20        0.8959             nan     0.1000    0.0024
##     40        0.8444             nan     0.1000    0.0001
##     60        0.8255             nan     0.1000   -0.0004
##     80        0.8150             nan     0.1000   -0.0001
##    100        0.8068             nan     0.1000   -0.0002
##    120        0.8022             nan     0.1000   -0.0002
##    140        0.7984             nan     0.1000   -0.0002
##    150        0.7948             nan     0.1000   -0.0002
## 
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1168             nan     0.1000    0.0200
##      2        1.0846             nan     0.1000    0.0162
##      3        1.0545             nan     0.1000    0.0135
##      4        1.0272             nan     0.1000    0.0135
##      5        1.0085             nan     0.1000    0.0092
##      6        0.9876             nan     0.1000    0.0097
##      7        0.9715             nan     0.1000    0.0078
##      8        0.9587             nan     0.1000    0.0056
##      9        0.9469             nan     0.1000    0.0057
##     10        0.9372             nan     0.1000    0.0045
##     20        0.8761             nan     0.1000    0.0005
##     40        0.8283             nan     0.1000   -0.0000
##     60        0.8093             nan     0.1000   -0.0005
##     80        0.7974             nan     0.1000   -0.0002
##    100        0.7875             nan     0.1000   -0.0006
##    120        0.7794             nan     0.1000   -0.0003
##    140        0.7726             nan     0.1000   -0.0004
##    150        0.7688             nan     0.1000   -0.0003
## 
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1360             nan     0.1000    0.0114
##      2        1.1187             nan     0.1000    0.0082
##      3        1.0996             nan     0.1000    0.0087
##      4        1.0831             nan     0.1000    0.0069
##      5        1.0681             nan     0.1000    0.0064
##      6        1.0552             nan     0.1000    0.0065
##      7        1.0419             nan     0.1000    0.0059
##      8        1.0298             nan     0.1000    0.0061
##      9        1.0199             nan     0.1000    0.0052
##     10        1.0103             nan     0.1000    0.0044
##     20        0.9430             nan     0.1000    0.0024
##     40        0.8799             nan     0.1000    0.0008
##     60        0.8477             nan     0.1000    0.0002
##     80        0.8312             nan     0.1000    0.0000
##    100        0.8202             nan     0.1000   -0.0000
##    120        0.8138             nan     0.1000   -0.0002
##    140        0.8089             nan     0.1000   -0.0001
##    150        0.8073             nan     0.1000   -0.0001
## 
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1238             nan     0.1000    0.0177
##      2        1.1007             nan     0.1000    0.0111
##      3        1.0736             nan     0.1000    0.0136
##      4        1.0535             nan     0.1000    0.0100
##      5        1.0364             nan     0.1000    0.0089
##      6        1.0195             nan     0.1000    0.0079
##      7        0.9989             nan     0.1000    0.0099
##      8        0.9846             nan     0.1000    0.0067
##      9        0.9700             nan     0.1000    0.0066
##     10        0.9602             nan     0.1000    0.0044
##     20        0.8870             nan     0.1000    0.0020
##     40        0.8320             nan     0.1000    0.0005
##     60        0.8125             nan     0.1000   -0.0000
##     80        0.8012             nan     0.1000   -0.0001
##    100        0.7928             nan     0.1000   -0.0001
##    120        0.7864             nan     0.1000   -0.0001
##    140        0.7817             nan     0.1000   -0.0001
##    150        0.7778             nan     0.1000   -0.0001
## 
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1194             nan     0.1000    0.0193
##      2        1.0847             nan     0.1000    0.0168
##      3        1.0551             nan     0.1000    0.0139
##      4        1.0314             nan     0.1000    0.0112
##      5        1.0111             nan     0.1000    0.0098
##      6        0.9943             nan     0.1000    0.0073
##      7        0.9755             nan     0.1000    0.0089
##      8        0.9612             nan     0.1000    0.0072
##      9        0.9495             nan     0.1000    0.0048
##     10        0.9343             nan     0.1000    0.0074
##     20        0.8661             nan     0.1000    0.0019
##     40        0.8142             nan     0.1000    0.0000
##     60        0.7952             nan     0.1000   -0.0001
##     80        0.7843             nan     0.1000   -0.0003
##    100        0.7766             nan     0.1000   -0.0006
##    120        0.7679             nan     0.1000   -0.0005
##    140        0.7610             nan     0.1000   -0.0003
##    150        0.7566             nan     0.1000    0.0000
## 
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1371             nan     0.1000    0.0099
##      2        1.1200             nan     0.1000    0.0083
##      3        1.1035             nan     0.1000    0.0076
##      4        1.0898             nan     0.1000    0.0065
##      5        1.0762             nan     0.1000    0.0059
##      6        1.0631             nan     0.1000    0.0063
##      7        1.0527             nan     0.1000    0.0049
##      8        1.0405             nan     0.1000    0.0056
##      9        1.0324             nan     0.1000    0.0035
##     10        1.0227             nan     0.1000    0.0050
##     20        0.9602             nan     0.1000    0.0033
##     40        0.8966             nan     0.1000    0.0010
##     60        0.8696             nan     0.1000    0.0006
##     80        0.8527             nan     0.1000   -0.0003
##    100        0.8431             nan     0.1000   -0.0001
##    120        0.8359             nan     0.1000   -0.0001
##    140        0.8310             nan     0.1000   -0.0000
##    150        0.8298             nan     0.1000   -0.0001
## 
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1249             nan     0.1000    0.0160
##      2        1.1006             nan     0.1000    0.0125
##      3        1.0733             nan     0.1000    0.0126
##      4        1.0575             nan     0.1000    0.0072
##      5        1.0403             nan     0.1000    0.0078
##      6        1.0237             nan     0.1000    0.0074
##      7        1.0063             nan     0.1000    0.0081
##      8        0.9932             nan     0.1000    0.0059
##      9        0.9790             nan     0.1000    0.0068
##     10        0.9662             nan     0.1000    0.0054
##     20        0.8981             nan     0.1000    0.0027
##     40        0.8531             nan     0.1000    0.0005
##     60        0.8347             nan     0.1000    0.0000
##     80        0.8238             nan     0.1000   -0.0001
##    100        0.8172             nan     0.1000   -0.0003
##    120        0.8111             nan     0.1000   -0.0003
##    140        0.8063             nan     0.1000   -0.0002
##    150        0.8045             nan     0.1000   -0.0002
## 
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1216             nan     0.1000    0.0181
##      2        1.0923             nan     0.1000    0.0147
##      3        1.0666             nan     0.1000    0.0125
##      4        1.0467             nan     0.1000    0.0093
##      5        1.0261             nan     0.1000    0.0093
##      6        1.0067             nan     0.1000    0.0084
##      7        0.9898             nan     0.1000    0.0091
##      8        0.9751             nan     0.1000    0.0064
##      9        0.9633             nan     0.1000    0.0053
##     10        0.9502             nan     0.1000    0.0063
##     20        0.8880             nan     0.1000    0.0007
##     40        0.8399             nan     0.1000   -0.0000
##     60        0.8208             nan     0.1000   -0.0004
##     80        0.8085             nan     0.1000    0.0000
##    100        0.8002             nan     0.1000   -0.0003
##    120        0.7914             nan     0.1000   -0.0004
##    140        0.7845             nan     0.1000   -0.0002
##    150        0.7796             nan     0.1000    0.0000
## 
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1358             nan     0.1000    0.0108
##      2        1.1192             nan     0.1000    0.0085
##      3        1.1034             nan     0.1000    0.0081
##      4        1.0870             nan     0.1000    0.0086
##      5        1.0727             nan     0.1000    0.0069
##      6        1.0584             nan     0.1000    0.0068
##      7        1.0470             nan     0.1000    0.0051
##      8        1.0352             nan     0.1000    0.0051
##      9        1.0234             nan     0.1000    0.0051
##     10        1.0154             nan     0.1000    0.0037
##     20        0.9461             nan     0.1000    0.0025
##     40        0.8831             nan     0.1000    0.0012
##     60        0.8529             nan     0.1000    0.0003
##     80        0.8363             nan     0.1000    0.0006
##    100        0.8253             nan     0.1000   -0.0001
##    120        0.8192             nan     0.1000   -0.0000
##    140        0.8143             nan     0.1000   -0.0002
##    150        0.8130             nan     0.1000   -0.0000
## 
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150 
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1220             nan     0.1000    0.0179
##      2        1.0967             nan     0.1000    0.0131
##      3        1.0741             nan     0.1000    0.0108
##      4        1.0561             nan     0.1000    0.0080
##      5        1.0378             nan     0.1000    0.0089
##      6        1.0185             nan     0.1000    0.0086
##      7        1.0044             nan     0.1000    0.0063
##      8        0.9857             nan     0.1000    0.0090
##      9        0.9701             nan     0.1000    0.0075
##     10        0.9623             nan     0.1000    0.0031
##     20        0.8921             nan     0.1000    0.0023
##     40        0.8381             nan     0.1000    0.0003
##     60        0.8205             nan     0.1000    0.0000
##     80        0.8102             nan     0.1000   -0.0005
##    100        0.8028             nan     0.1000   -0.0008
##    120        0.7971             nan     0.1000   -0.0001
##    140        0.7925             nan     0.1000   -0.0003
##    150        0.7898             nan     0.1000   -0.0001
## 
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1153             nan     0.1000    0.0201
##      2        1.0828             nan     0.1000    0.0153
##      3        1.0579             nan     0.1000    0.0127
##      4        1.0343             nan     0.1000    0.0120
##      5        1.0158             nan     0.1000    0.0086
##      6        1.0003             nan     0.1000    0.0072
##      7        0.9861             nan     0.1000    0.0062
##      8        0.9709             nan     0.1000    0.0071
##      9        0.9569             nan     0.1000    0.0059
##     10        0.9452             nan     0.1000    0.0054
##     20        0.8707             nan     0.1000    0.0015
##     40        0.8248             nan     0.1000   -0.0001
##     60        0.8070             nan     0.1000   -0.0002
##     80        0.7944             nan     0.1000   -0.0005
##    100        0.7861             nan     0.1000   -0.0007
##    120        0.7771             nan     0.1000   -0.0003
##    140        0.7708             nan     0.1000   -0.0001
##    150        0.7678             nan     0.1000   -0.0004
## 
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150 
## Aggregating results
## Selecting tuning parameters
## Fitting n.trees = 150, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10 on full training set
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.1222             nan     0.1000    0.0167
##      2        1.1000             nan     0.1000    0.0092
##      3        1.0755             nan     0.1000    0.0111
##      4        1.0501             nan     0.1000    0.0122
##      5        1.0315             nan     0.1000    0.0088
##      6        1.0137             nan     0.1000    0.0095
##      7        0.9966             nan     0.1000    0.0081
##      8        0.9826             nan     0.1000    0.0069
##      9        0.9704             nan     0.1000    0.0061
##     10        0.9630             nan     0.1000    0.0031
##     20        0.8951             nan     0.1000    0.0027
##     40        0.8455             nan     0.1000    0.0007
##     60        0.8282             nan     0.1000   -0.0001
##     80        0.8176             nan     0.1000    0.0001
##    100        0.8105             nan     0.1000   -0.0001
##    120        0.8059             nan     0.1000   -0.0001
##    140        0.8006             nan     0.1000   -0.0002
##    150        0.7983             nan     0.1000   -0.0004
## test model
p_gbm <- predict(gbm_model, newdata = test_df)
mean(p_gbm == test_df$Churn)
## [1] 0.806926
cat("\n gbm Model\n")
## 
##  gbm Model
gbm_model
## Stochastic Gradient Boosting 
## 
## 4924 samples
##   19 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3939, 3939, 3939, 3940, 3939 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  AUC        Precision  Recall     F        
##   1                   50      0.8922882  0.8060274  0.9405256  0.8680355
##   1                  100      0.9253426  0.8218884  0.9233748  0.8696022
##   1                  150      0.9312815  0.8329592  0.9109267  0.8701193
##   2                   50      0.9187952  0.8259954  0.9181189  0.8695827
##   2                  100      0.9296088  0.8336686  0.9087137  0.8695230
##   2                  150      0.9339370  0.8357417  0.9056708  0.8692439
##   3                   50      0.9292849  0.8341968  0.9078838  0.8694013
##   3                  100      0.9315188  0.8381482  0.9031812  0.8693919
##   3                  150      0.9332234  0.8412906  0.9026279  0.8708219
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.

4. Model Evaluation

  1. Recursive Partitioning and Regression Trees Model 79.03% (rpart)

  2. Random Forest Model 79.41% (rf)

  3. Generalized Boosted Regression Model 80.69% (gbm)

  • From 3 models, the optimal model is Generalized Boosted Regression Modeling (gbm)
# gbm
c_gbm <- confusionMatrix(p_gbm , test_df$Churn, 
                mode = "prec_recall", positive = "Yes")
c_gbm$table
##           Reference
## Prediction   No  Yes
##        No  1405  264
##        Yes  143  296
Predict <- factor(c(0, 0, 1, 1))
Actual <- factor(c(0, 1, 0, 1))
y <- c(c_gbm$table[1,1], c_gbm$table[1,2],c_gbm$table[2,1], c_gbm$table[2,2]  )
df <- data.frame(Predict, Actual, y)

ggplot(df, aes(Predict, Actual, fill = y)) +
  geom_tile() +
  scale_fill_gradient(low = "#f0f8ff", high = "#6497b1") +
  geom_text(aes(label = sprintf("%1.0f", y))) +
  theme_minimal()+
  labs(title = "Confusion Matrix",
      subtitle = "0 = No, 1 = Yes") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) 

5. Conclusion

In case of no model, the baseline’s accuracy is predicted at 73.42%.

In the optimal model, the Generalized Boosted Regression Model(gbm) got 80.69%, precision 67.43%, recall 52.86%, F1 59.26% when splitting train 70%, test 30%.

Source: The original data be on Kaggle