Content
In each row represents a customer, each column contains customer’s attributes described in the column Metadata. Source from: Kaggle
Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.
CustomerID: A unique ID that identifies each customer.
Gender: The customer’s gender: Male, Female
Age: The customer’s current age, in years, at the time the fiscal quarter ended.
Senior Citizen: Indicates if the customer is 65 or older: Yes, No
Married (Partner): Indicates if the customer is married: Yes, No
Dependents: Indicates if the customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.
Number of Dependents: Indicates the number of dependents that live with the customer.
Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes, No
Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No
Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber Optic, Cable.
Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No
Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No
Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No
Premium Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No
Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No. The company does not charge an additional fee for this service.
Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service.
Contract: Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.
Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No
Payment Method: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check
Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.
Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.
Tenure: Indicates the total amount of months that the customer has been with the company.
Churn: Yes = the customer left the company this quarter. No = the customer remained with the company.
In each row represents a customer, each column contains customer’s attributes described in the column Metadata. Source from: Kaggle
When looking at churn target, it only contains data ‘No’, ‘Yes’. This is classification problem.
‘No’ means “a customer remains services”.
‘Yes’ means “a customer cancels or stops to continue service”.
1. Collect and Clean Data
2. Explore and Transform Data
3. Train Model and Scoring
4. Model Evaluation
5. Conclusion
Install and call packages.
Get data “churn.csv” from my github.
Check and drop missing values
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(rpart)
library(rpart.plot)
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(gbm)
## Loaded gbm 2.1.8.1
# Collect Data
churn_url <- "https://raw.githubusercontent.com/Thanida45/churn_prediction/main/WA_Fn-UseC_-Telco-Customer-Churn.csv"
churn_df <- read.csv(churn_url)
# Check Missing Values
cat("Check dataframe completed? : ", ifelse(mean(complete.cases(churn_df)) == 1, "Yes", "No, need to Clean" ))
## Check dataframe completed? : No, need to Clean
cat("\nThere are missing values : ", sum(is.na(churn_df)), " rows")
##
## There are missing values : 11 rows
churn_df <- na.omit(churn_df)
cat("Check dataframe completed? :",ifelse(mean(complete.cases(churn_df)) == 1, "Yes", "No, need to Clean" ))
## Check dataframe completed? : Yes
glimpse(churn_df)
## Rows: 7,032
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
Explore data and plotting.
Transform data.
dt_1 <- churn_df %>%
select(gender, Churn, Contract,PaymentMethod, TotalCharges) %>%
top_n(10, wt = (TotalCharges))%>%
group_by(Contract) %>%
arrange(desc(TotalCharges))
dt_1
## # A tibble: 10 × 5
## # Groups: Contract [2]
## gender Churn Contract PaymentMethod TotalCharges
## <chr> <chr> <chr> <chr> <dbl>
## 1 Male Yes One year Bank transfer (automatic) 8685.
## 2 Female No Two year Bank transfer (automatic) 8672.
## 3 Female No Two year Credit card (automatic) 8670.
## 4 Male No Two year Credit card (automatic) 8594.
## 5 Male No Two year Electronic check 8565.
## 6 Male No Two year Electronic check 8547.
## 7 Female No Two year Credit card (automatic) 8543.
## 8 Female No Two year Credit card (automatic) 8530.
## 9 Female No Two year Electronic check 8497.
## 10 Female No Two year Bank transfer (automatic) 8478.
table(churn_df$gender, churn_df$Churn, dnn = c("Gender", "Churn"))
## Churn
## Gender No Yes
## Female 2544 939
## Male 2619 930
ggplot(churn_df, aes(gender, fill = Churn)) +
geom_bar(position = "dodge", color = "black") +
scale_fill_manual(values =c("#c0d6e4", "#e4cec0") ) +
theme_minimal() +
labs(title = "The number of churn between male and female")+
theme(plot.title = element_text(hjust = 0.5))
table(churn_df$SeniorCitizen, churn_df$Churn , dnn = c("SeniorCitizen", "Churn"))
## Churn
## SeniorCitizen No Yes
## 0 4497 1393
## 1 666 476
ggplot(churn_df, aes(SeniorCitizen , fill = Churn)) +
geom_bar(position = "dodge", color = "black") +
scale_fill_manual(values =c("#c0d6e4", "#e4cec0") ) +
theme_minimal() +
labs( title = "The churn ratio of senior citizen") +
theme(plot.title = element_text(hjust = 0.5))
table(churn_df$Contract, churn_df$Churn, dnn = c("Contract", "Churn"))
## Churn
## Contract No Yes
## Month-to-month 2220 1655
## One year 1306 166
## Two year 1637 48
ggplot(churn_df, aes(Contract, fill = Churn)) +
geom_bar(position = "dodge", color = "black") +
scale_fill_manual(values =c("#c0d6e4", "#e4cec0") ) +
theme_minimal() +
labs( title = "The churn ratio of each customers's contract.",
y = "The number of customers in each contract.") +
theme(plot.title = element_text(hjust = 0.5))
churn_df %>%
ggplot(aes(tenure , fill = Churn)) +
geom_histogram(position = "fill",color = "black", bins = 6) +
scale_fill_manual(values =c("#c0d6e4", "#e4cec0") ) +
theme_minimal() +
labs(title = "The churn ratio of customers in 72 months(12 months for a bin))",
x = "Months",
y = "Churn ratio") +
theme(plot.title = element_text(hjust = 0.5))
churn_df %>%
ggplot(aes(tenure, fill = Churn)) +
geom_density(color = "black", alpha =0.7) +
scale_fill_manual(values =c("#c0d6e4", "#dcc0ae") ) +
theme_minimal() +
labs( title = "The different of churn ratio of customers from 0 to 72 months",
x = "Months",
y = "Ratio") +
theme(plot.title = element_text(hjust = 0.5))
table(churn_df$InternetService, dnn = c("InternetService"))
## InternetService
## DSL Fiber optic No
## 2416 3096 1520
churn_df %>%
ggplot(aes(InternetService)) +
geom_bar(fill = '#c0d6e4', color = "black") +
theme_minimal() +
labs(title = "The number of internet services in each category",
y = "The number of internet services") +
theme(plot.title = element_text(hjust = 0.5))
churn_df %>%
ggplot(aes(InternetService, fill = Churn)) +
geom_bar(position = "fill", color = "black") +
scale_fill_manual(values =c("#c0d6e4", "#e4cec0")) +
facet_wrap(~PaymentMethod) +
theme_minimal() +
labs(title = "The churn ratio in each internet service categories by gender",
y = "Ratio") +
theme(plot.title = element_text(hjust = 0.5))
table(churn_df$PaymentMethod, churn_df$Churn, dnn = c("PaymentMethod", "Churn"))
## Churn
## PaymentMethod No Yes
## Bank transfer (automatic) 1284 258
## Credit card (automatic) 1289 232
## Electronic check 1294 1071
## Mailed check 1296 308
churn_df %>%
ggplot(aes(PaymentMethod, fill = Churn)) +
geom_bar(position = "fill", color = "black") +
scale_fill_manual(values = c("#c0d6e4", "#e4cec0")) +
theme_minimal() +
labs( title = "The churn ratio of customers in each payment mathod",
y = "The number of customers") +
theme(plot.title = element_text(hjust = 0.5))
table(churn_df$TechSupport, churn_df$Churn, dnn = c("TechSupport", "Churn"))
## Churn
## TechSupport No Yes
## No 2026 1446
## No internet service 1407 113
## Yes 1730 310
churn_df %>%
ggplot(aes(TechSupport, fill = Churn)) +
geom_bar(color = "black") +
scale_fill_manual(values = c("#c0d6e4", "#e4cec0")) +
theme_minimal() +
labs(title = "The number of churn in TechSupports") +
theme(plot.title = element_text(hjust = 0.5))
churn_df %>%
ggplot(aes(TotalCharges)) +
geom_histogram(bins = 15, color = "black", fill ="#c0d6e4" ) +
theme_minimal() +
labs(title = "The number of TotalCharges") +
theme(plot.title = element_text(hjust = 0.5))
churn_df %>%
ggplot() +
geom_density(aes(TotalCharges, fill = Churn), alpha = 0.8) +
facet_wrap(~Contract) +
theme_minimal() +
scale_y_continuous(labels = function(x) x*1000) +
scale_fill_manual(values = c("#aecadc", "#f0cfc1")) +
labs(title = "The TotalCharges' ratio in each contract type",
y = "The TotalCharges ratio") +
theme(plot.title = element_text(hjust = 0.5))
churn_df %>%
ggplot(aes(Contract, TotalCharges, fill = Churn)) +
geom_boxplot() +
scale_fill_manual(values = c("#aecadc", "#f0cfc1")) +
theme_minimal() +
labs(title = "The TotalCharges' numbers in each contract type",
y = "The number of TotalCharges") +
theme(plot.title = element_text(hjust = 0.5))
churn_df <- churn_df %>%
mutate(across(where(is.character), as.factor))
colnames(churn_df)[2] <- "Gender"
colnames(churn_df)[6] <- "Tenure"
## Drop CustomerID column
churn_df$customerID = NULL
glimpse(churn_df)
## Rows: 7,032
## Columns: 20
## $ Gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ Tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
## baseline prediction no model accuracy = 73.42%
churn_df %>%
count(Churn) %>%
mutate(pct = n/sum(n)*100)
## Churn n pct
## 1 No 5163 73.4215
## 2 Yes 1869 26.5785
Random Sampling
Train Logistic Regression Model
set.seed(55)
id <- createDataPartition(y = churn_df$Churn,
p = 0.7,
list = FALSE)
train_df <- churn_df[id, ]
test_df <- churn_df[-id, ]
nrow(train_df)
## [1] 4924
nrow(test_df)
## [1] 2108
set.seed(55)
## set control method
control <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = prSummary,
verboseIter = TRUE
)
## train model
rpart_model <- train( Churn ~ .,
data = train_df,
method = "rpart",
metric = "AUC",
trControl = control
)
## + Fold1.Rep1: cp=0.003247
## - Fold1.Rep1: cp=0.003247
## + Fold2.Rep1: cp=0.003247
## - Fold2.Rep1: cp=0.003247
## + Fold3.Rep1: cp=0.003247
## - Fold3.Rep1: cp=0.003247
## + Fold4.Rep1: cp=0.003247
## - Fold4.Rep1: cp=0.003247
## + Fold5.Rep1: cp=0.003247
## - Fold5.Rep1: cp=0.003247
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.00325 on full training set
p_rpart <- predict(rpart_model, newdata = test_df)
mean(p_rpart == test_df$Churn)
## [1] 0.7903226
set.seed(55)
## set control method
control <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = prSummary,
verboseIter = TRUE
)
## train model
rf_model <- train( Churn ~ .,
data = train_df,
method = "rf",
metric = "AUC",
trControl = control
)
## + Fold1.Rep1: mtry= 2
## - Fold1.Rep1: mtry= 2
## + Fold1.Rep1: mtry=16
## - Fold1.Rep1: mtry=16
## + Fold1.Rep1: mtry=30
## - Fold1.Rep1: mtry=30
## + Fold2.Rep1: mtry= 2
## - Fold2.Rep1: mtry= 2
## + Fold2.Rep1: mtry=16
## - Fold2.Rep1: mtry=16
## + Fold2.Rep1: mtry=30
## - Fold2.Rep1: mtry=30
## + Fold3.Rep1: mtry= 2
## - Fold3.Rep1: mtry= 2
## + Fold3.Rep1: mtry=16
## - Fold3.Rep1: mtry=16
## + Fold3.Rep1: mtry=30
## - Fold3.Rep1: mtry=30
## + Fold4.Rep1: mtry= 2
## - Fold4.Rep1: mtry= 2
## + Fold4.Rep1: mtry=16
## - Fold4.Rep1: mtry=16
## + Fold4.Rep1: mtry=30
## - Fold4.Rep1: mtry=30
## + Fold5.Rep1: mtry= 2
## - Fold5.Rep1: mtry= 2
## + Fold5.Rep1: mtry=16
## - Fold5.Rep1: mtry=16
## + Fold5.Rep1: mtry=30
## - Fold5.Rep1: mtry=30
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 16 on full training set
## test model
p_rf <- predict(rf_model, newdata = test_df)
mean(p_rf == test_df$Churn)
## [1] 0.7941176
set.seed(55)
## set control method
control <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = prSummary,
verboseIter = TRUE
)
## train model
gbm_model <- train( Churn ~ .,
data = train_df,
method = "gbm",
metric = "AUC",
trControl = control
)
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1356 nan 0.1000 0.0098
## 2 1.1189 nan 0.1000 0.0071
## 3 1.1022 nan 0.1000 0.0091
## 4 1.0894 nan 0.1000 0.0063
## 5 1.0733 nan 0.1000 0.0075
## 6 1.0630 nan 0.1000 0.0051
## 7 1.0496 nan 0.1000 0.0059
## 8 1.0386 nan 0.1000 0.0051
## 9 1.0294 nan 0.1000 0.0039
## 10 1.0191 nan 0.1000 0.0039
## 20 0.9523 nan 0.1000 0.0019
## 40 0.8904 nan 0.1000 0.0006
## 60 0.8604 nan 0.1000 0.0004
## 80 0.8453 nan 0.1000 -0.0002
## 100 0.8345 nan 0.1000 -0.0001
## 120 0.8279 nan 0.1000 0.0001
## 140 0.8240 nan 0.1000 -0.0001
## 150 0.8219 nan 0.1000 -0.0001
##
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1224 nan 0.1000 0.0167
## 2 1.0964 nan 0.1000 0.0131
## 3 1.0687 nan 0.1000 0.0129
## 4 1.0479 nan 0.1000 0.0095
## 5 1.0302 nan 0.1000 0.0083
## 6 1.0126 nan 0.1000 0.0080
## 7 0.9982 nan 0.1000 0.0067
## 8 0.9871 nan 0.1000 0.0050
## 9 0.9769 nan 0.1000 0.0043
## 10 0.9684 nan 0.1000 0.0036
## 20 0.8960 nan 0.1000 0.0032
## 40 0.8427 nan 0.1000 0.0003
## 60 0.8241 nan 0.1000 0.0000
## 80 0.8122 nan 0.1000 -0.0000
## 100 0.8061 nan 0.1000 -0.0001
## 120 0.8005 nan 0.1000 -0.0002
## 140 0.7951 nan 0.1000 -0.0005
## 150 0.7931 nan 0.1000 -0.0006
##
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1209 nan 0.1000 0.0202
## 2 1.0857 nan 0.1000 0.0179
## 3 1.0545 nan 0.1000 0.0147
## 4 1.0318 nan 0.1000 0.0103
## 5 1.0102 nan 0.1000 0.0100
## 6 0.9922 nan 0.1000 0.0081
## 7 0.9792 nan 0.1000 0.0064
## 8 0.9670 nan 0.1000 0.0059
## 9 0.9557 nan 0.1000 0.0045
## 10 0.9438 nan 0.1000 0.0056
## 20 0.8757 nan 0.1000 0.0027
## 40 0.8271 nan 0.1000 0.0005
## 60 0.8082 nan 0.1000 -0.0002
## 80 0.7959 nan 0.1000 -0.0003
## 100 0.7875 nan 0.1000 -0.0001
## 120 0.7804 nan 0.1000 -0.0003
## 140 0.7726 nan 0.1000 -0.0002
## 150 0.7690 nan 0.1000 -0.0003
##
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1385 nan 0.1000 0.0097
## 2 1.1202 nan 0.1000 0.0088
## 3 1.1047 nan 0.1000 0.0076
## 4 1.0901 nan 0.1000 0.0072
## 5 1.0757 nan 0.1000 0.0063
## 6 1.0621 nan 0.1000 0.0069
## 7 1.0502 nan 0.1000 0.0053
## 8 1.0376 nan 0.1000 0.0054
## 9 1.0271 nan 0.1000 0.0051
## 10 1.0183 nan 0.1000 0.0043
## 20 0.9515 nan 0.1000 0.0032
## 40 0.8890 nan 0.1000 0.0006
## 60 0.8598 nan 0.1000 0.0001
## 80 0.8433 nan 0.1000 0.0003
## 100 0.8340 nan 0.1000 0.0000
## 120 0.8265 nan 0.1000 0.0003
## 140 0.8221 nan 0.1000 -0.0001
## 150 0.8205 nan 0.1000 -0.0002
##
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1305 nan 0.1000 0.0135
## 2 1.1054 nan 0.1000 0.0125
## 3 1.0805 nan 0.1000 0.0117
## 4 1.0595 nan 0.1000 0.0092
## 5 1.0350 nan 0.1000 0.0113
## 6 1.0169 nan 0.1000 0.0086
## 7 1.0043 nan 0.1000 0.0069
## 8 0.9896 nan 0.1000 0.0068
## 9 0.9795 nan 0.1000 0.0044
## 10 0.9645 nan 0.1000 0.0070
## 20 0.8959 nan 0.1000 0.0024
## 40 0.8444 nan 0.1000 0.0001
## 60 0.8255 nan 0.1000 -0.0004
## 80 0.8150 nan 0.1000 -0.0001
## 100 0.8068 nan 0.1000 -0.0002
## 120 0.8022 nan 0.1000 -0.0002
## 140 0.7984 nan 0.1000 -0.0002
## 150 0.7948 nan 0.1000 -0.0002
##
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1168 nan 0.1000 0.0200
## 2 1.0846 nan 0.1000 0.0162
## 3 1.0545 nan 0.1000 0.0135
## 4 1.0272 nan 0.1000 0.0135
## 5 1.0085 nan 0.1000 0.0092
## 6 0.9876 nan 0.1000 0.0097
## 7 0.9715 nan 0.1000 0.0078
## 8 0.9587 nan 0.1000 0.0056
## 9 0.9469 nan 0.1000 0.0057
## 10 0.9372 nan 0.1000 0.0045
## 20 0.8761 nan 0.1000 0.0005
## 40 0.8283 nan 0.1000 -0.0000
## 60 0.8093 nan 0.1000 -0.0005
## 80 0.7974 nan 0.1000 -0.0002
## 100 0.7875 nan 0.1000 -0.0006
## 120 0.7794 nan 0.1000 -0.0003
## 140 0.7726 nan 0.1000 -0.0004
## 150 0.7688 nan 0.1000 -0.0003
##
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1360 nan 0.1000 0.0114
## 2 1.1187 nan 0.1000 0.0082
## 3 1.0996 nan 0.1000 0.0087
## 4 1.0831 nan 0.1000 0.0069
## 5 1.0681 nan 0.1000 0.0064
## 6 1.0552 nan 0.1000 0.0065
## 7 1.0419 nan 0.1000 0.0059
## 8 1.0298 nan 0.1000 0.0061
## 9 1.0199 nan 0.1000 0.0052
## 10 1.0103 nan 0.1000 0.0044
## 20 0.9430 nan 0.1000 0.0024
## 40 0.8799 nan 0.1000 0.0008
## 60 0.8477 nan 0.1000 0.0002
## 80 0.8312 nan 0.1000 0.0000
## 100 0.8202 nan 0.1000 -0.0000
## 120 0.8138 nan 0.1000 -0.0002
## 140 0.8089 nan 0.1000 -0.0001
## 150 0.8073 nan 0.1000 -0.0001
##
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1238 nan 0.1000 0.0177
## 2 1.1007 nan 0.1000 0.0111
## 3 1.0736 nan 0.1000 0.0136
## 4 1.0535 nan 0.1000 0.0100
## 5 1.0364 nan 0.1000 0.0089
## 6 1.0195 nan 0.1000 0.0079
## 7 0.9989 nan 0.1000 0.0099
## 8 0.9846 nan 0.1000 0.0067
## 9 0.9700 nan 0.1000 0.0066
## 10 0.9602 nan 0.1000 0.0044
## 20 0.8870 nan 0.1000 0.0020
## 40 0.8320 nan 0.1000 0.0005
## 60 0.8125 nan 0.1000 -0.0000
## 80 0.8012 nan 0.1000 -0.0001
## 100 0.7928 nan 0.1000 -0.0001
## 120 0.7864 nan 0.1000 -0.0001
## 140 0.7817 nan 0.1000 -0.0001
## 150 0.7778 nan 0.1000 -0.0001
##
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1194 nan 0.1000 0.0193
## 2 1.0847 nan 0.1000 0.0168
## 3 1.0551 nan 0.1000 0.0139
## 4 1.0314 nan 0.1000 0.0112
## 5 1.0111 nan 0.1000 0.0098
## 6 0.9943 nan 0.1000 0.0073
## 7 0.9755 nan 0.1000 0.0089
## 8 0.9612 nan 0.1000 0.0072
## 9 0.9495 nan 0.1000 0.0048
## 10 0.9343 nan 0.1000 0.0074
## 20 0.8661 nan 0.1000 0.0019
## 40 0.8142 nan 0.1000 0.0000
## 60 0.7952 nan 0.1000 -0.0001
## 80 0.7843 nan 0.1000 -0.0003
## 100 0.7766 nan 0.1000 -0.0006
## 120 0.7679 nan 0.1000 -0.0005
## 140 0.7610 nan 0.1000 -0.0003
## 150 0.7566 nan 0.1000 0.0000
##
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1371 nan 0.1000 0.0099
## 2 1.1200 nan 0.1000 0.0083
## 3 1.1035 nan 0.1000 0.0076
## 4 1.0898 nan 0.1000 0.0065
## 5 1.0762 nan 0.1000 0.0059
## 6 1.0631 nan 0.1000 0.0063
## 7 1.0527 nan 0.1000 0.0049
## 8 1.0405 nan 0.1000 0.0056
## 9 1.0324 nan 0.1000 0.0035
## 10 1.0227 nan 0.1000 0.0050
## 20 0.9602 nan 0.1000 0.0033
## 40 0.8966 nan 0.1000 0.0010
## 60 0.8696 nan 0.1000 0.0006
## 80 0.8527 nan 0.1000 -0.0003
## 100 0.8431 nan 0.1000 -0.0001
## 120 0.8359 nan 0.1000 -0.0001
## 140 0.8310 nan 0.1000 -0.0000
## 150 0.8298 nan 0.1000 -0.0001
##
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1249 nan 0.1000 0.0160
## 2 1.1006 nan 0.1000 0.0125
## 3 1.0733 nan 0.1000 0.0126
## 4 1.0575 nan 0.1000 0.0072
## 5 1.0403 nan 0.1000 0.0078
## 6 1.0237 nan 0.1000 0.0074
## 7 1.0063 nan 0.1000 0.0081
## 8 0.9932 nan 0.1000 0.0059
## 9 0.9790 nan 0.1000 0.0068
## 10 0.9662 nan 0.1000 0.0054
## 20 0.8981 nan 0.1000 0.0027
## 40 0.8531 nan 0.1000 0.0005
## 60 0.8347 nan 0.1000 0.0000
## 80 0.8238 nan 0.1000 -0.0001
## 100 0.8172 nan 0.1000 -0.0003
## 120 0.8111 nan 0.1000 -0.0003
## 140 0.8063 nan 0.1000 -0.0002
## 150 0.8045 nan 0.1000 -0.0002
##
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1216 nan 0.1000 0.0181
## 2 1.0923 nan 0.1000 0.0147
## 3 1.0666 nan 0.1000 0.0125
## 4 1.0467 nan 0.1000 0.0093
## 5 1.0261 nan 0.1000 0.0093
## 6 1.0067 nan 0.1000 0.0084
## 7 0.9898 nan 0.1000 0.0091
## 8 0.9751 nan 0.1000 0.0064
## 9 0.9633 nan 0.1000 0.0053
## 10 0.9502 nan 0.1000 0.0063
## 20 0.8880 nan 0.1000 0.0007
## 40 0.8399 nan 0.1000 -0.0000
## 60 0.8208 nan 0.1000 -0.0004
## 80 0.8085 nan 0.1000 0.0000
## 100 0.8002 nan 0.1000 -0.0003
## 120 0.7914 nan 0.1000 -0.0004
## 140 0.7845 nan 0.1000 -0.0002
## 150 0.7796 nan 0.1000 0.0000
##
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1358 nan 0.1000 0.0108
## 2 1.1192 nan 0.1000 0.0085
## 3 1.1034 nan 0.1000 0.0081
## 4 1.0870 nan 0.1000 0.0086
## 5 1.0727 nan 0.1000 0.0069
## 6 1.0584 nan 0.1000 0.0068
## 7 1.0470 nan 0.1000 0.0051
## 8 1.0352 nan 0.1000 0.0051
## 9 1.0234 nan 0.1000 0.0051
## 10 1.0154 nan 0.1000 0.0037
## 20 0.9461 nan 0.1000 0.0025
## 40 0.8831 nan 0.1000 0.0012
## 60 0.8529 nan 0.1000 0.0003
## 80 0.8363 nan 0.1000 0.0006
## 100 0.8253 nan 0.1000 -0.0001
## 120 0.8192 nan 0.1000 -0.0000
## 140 0.8143 nan 0.1000 -0.0002
## 150 0.8130 nan 0.1000 -0.0000
##
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1220 nan 0.1000 0.0179
## 2 1.0967 nan 0.1000 0.0131
## 3 1.0741 nan 0.1000 0.0108
## 4 1.0561 nan 0.1000 0.0080
## 5 1.0378 nan 0.1000 0.0089
## 6 1.0185 nan 0.1000 0.0086
## 7 1.0044 nan 0.1000 0.0063
## 8 0.9857 nan 0.1000 0.0090
## 9 0.9701 nan 0.1000 0.0075
## 10 0.9623 nan 0.1000 0.0031
## 20 0.8921 nan 0.1000 0.0023
## 40 0.8381 nan 0.1000 0.0003
## 60 0.8205 nan 0.1000 0.0000
## 80 0.8102 nan 0.1000 -0.0005
## 100 0.8028 nan 0.1000 -0.0008
## 120 0.7971 nan 0.1000 -0.0001
## 140 0.7925 nan 0.1000 -0.0003
## 150 0.7898 nan 0.1000 -0.0001
##
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1153 nan 0.1000 0.0201
## 2 1.0828 nan 0.1000 0.0153
## 3 1.0579 nan 0.1000 0.0127
## 4 1.0343 nan 0.1000 0.0120
## 5 1.0158 nan 0.1000 0.0086
## 6 1.0003 nan 0.1000 0.0072
## 7 0.9861 nan 0.1000 0.0062
## 8 0.9709 nan 0.1000 0.0071
## 9 0.9569 nan 0.1000 0.0059
## 10 0.9452 nan 0.1000 0.0054
## 20 0.8707 nan 0.1000 0.0015
## 40 0.8248 nan 0.1000 -0.0001
## 60 0.8070 nan 0.1000 -0.0002
## 80 0.7944 nan 0.1000 -0.0005
## 100 0.7861 nan 0.1000 -0.0007
## 120 0.7771 nan 0.1000 -0.0003
## 140 0.7708 nan 0.1000 -0.0001
## 150 0.7678 nan 0.1000 -0.0004
##
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Aggregating results
## Selecting tuning parameters
## Fitting n.trees = 150, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10 on full training set
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1222 nan 0.1000 0.0167
## 2 1.1000 nan 0.1000 0.0092
## 3 1.0755 nan 0.1000 0.0111
## 4 1.0501 nan 0.1000 0.0122
## 5 1.0315 nan 0.1000 0.0088
## 6 1.0137 nan 0.1000 0.0095
## 7 0.9966 nan 0.1000 0.0081
## 8 0.9826 nan 0.1000 0.0069
## 9 0.9704 nan 0.1000 0.0061
## 10 0.9630 nan 0.1000 0.0031
## 20 0.8951 nan 0.1000 0.0027
## 40 0.8455 nan 0.1000 0.0007
## 60 0.8282 nan 0.1000 -0.0001
## 80 0.8176 nan 0.1000 0.0001
## 100 0.8105 nan 0.1000 -0.0001
## 120 0.8059 nan 0.1000 -0.0001
## 140 0.8006 nan 0.1000 -0.0002
## 150 0.7983 nan 0.1000 -0.0004
## test model
p_gbm <- predict(gbm_model, newdata = test_df)
mean(p_gbm == test_df$Churn)
## [1] 0.806926
cat("\n gbm Model\n")
##
## gbm Model
gbm_model
## Stochastic Gradient Boosting
##
## 4924 samples
## 19 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3939, 3939, 3939, 3940, 3939
## Resampling results across tuning parameters:
##
## interaction.depth n.trees AUC Precision Recall F
## 1 50 0.8922882 0.8060274 0.9405256 0.8680355
## 1 100 0.9253426 0.8218884 0.9233748 0.8696022
## 1 150 0.9312815 0.8329592 0.9109267 0.8701193
## 2 50 0.9187952 0.8259954 0.9181189 0.8695827
## 2 100 0.9296088 0.8336686 0.9087137 0.8695230
## 2 150 0.9339370 0.8357417 0.9056708 0.8692439
## 3 50 0.9292849 0.8341968 0.9078838 0.8694013
## 3 100 0.9315188 0.8381482 0.9031812 0.8693919
## 3 150 0.9332234 0.8412906 0.9026279 0.8708219
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## AUC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
Recursive Partitioning and Regression Trees Model 79.03% (rpart)
Random Forest Model 79.41% (rf)
Generalized Boosted Regression Model 80.69% (gbm)
# gbm
c_gbm <- confusionMatrix(p_gbm , test_df$Churn,
mode = "prec_recall", positive = "Yes")
c_gbm$table
## Reference
## Prediction No Yes
## No 1405 264
## Yes 143 296
Predict <- factor(c(0, 0, 1, 1))
Actual <- factor(c(0, 1, 0, 1))
y <- c(c_gbm$table[1,1], c_gbm$table[1,2],c_gbm$table[2,1], c_gbm$table[2,2] )
df <- data.frame(Predict, Actual, y)
ggplot(df, aes(Predict, Actual, fill = y)) +
geom_tile() +
scale_fill_gradient(low = "#f0f8ff", high = "#6497b1") +
geom_text(aes(label = sprintf("%1.0f", y))) +
theme_minimal()+
labs(title = "Confusion Matrix",
subtitle = "0 = No, 1 = Yes") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
In case of no model, the baseline’s accuracy is predicted at 73.42%.
In the optimal model, the Generalized Boosted Regression Model(gbm) got 80.69%, precision 67.43%, recall 52.86%, F1 59.26% when splitting train 70%, test 30%.
Source: The original data be on Kaggle