Customer churn or customer attrition is the phenomenon where customers of a business no longer purchase or interact with the business. A high churn means that higher number of customers no longer want to purchase goods and services from the business.
Customer churn is a one of the major problem in the telecommunication industry. Through customer churn analysis, the telco company can understand the behavior of their customer and help them to create strategies to retain their existing customer and attract new customer.
Throughout this article, i try to gain some insight from the data relevant to telco customer profile and services, and try to develop customer churn prediction model using Logistic Regression and K-Nearest Neighbour.
Dataset in this article was obtained from Kaggle. The data set includes information about:
The following are a brief explanations of each variable in this dataset.
| Variable | Description |
|---|---|
| CustomerID | ID Customer |
| Gender | Whether the customer is a male or a female |
| SeniorCitizen | Whether the customer is a senior citizen or not (1, 0) |
| Partner | Whether the customer has a partner or not (Yes, No) |
| Dependents | Whether the customer has dependents or not (Yes, No) |
| Tenure | Number of months the customer has stayed with the company |
| PhoneService | Whether the customer has a phone service or not (Yes, No) |
| MultipleLines | Whether the customer has multiple lines or not (Yes, No, No phone service) |
| InternetService | Customer’s internet service provider (DSL, Fiber optic, No) |
| OnlineSecurity | Whether the customer has online security or not (Yes, No, No internet service) |
| OnlineBackup | Whether the customer has online backup or not (Yes, No, No internet service) |
| DeviceProtection | Whether the customer has device protection or not (Yes, No, No internet service) |
| TechSupport | Whether the customer has tech support or not (Yes, No, No internet service) |
| StreamingTV | Whether the customer has streaming TV or not (Yes, No, No internet service) |
| StreamingMovies | Whether the customer has streaming movies or not (Yes, No, No internet service) |
| Contract | The contract term of the customer (Month-to-month, One year, Two year) |
| PaperlessBilling | Whether the customer has paperless billing or not (Yes, No) |
| PaymentMethod | The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) |
| MonthlyCharges | The amount charged to the customer monthly |
| TotalCharges | The total amount charged to the customer |
| Churn | Whether the customer churned or not (Yes or No) |
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(GGally)
library(gtools)
library(caret)
library(car)
library(scales)
library(lmtest)
library(ggplot2)
library(plotly)
library(ggthemes)
library(MLmetrics)
library(performance)
library(ggpubr)Load data csv
cust_churn_raw <- read.csv("data-input/WA_Fn-UseC_-Telco-Customer-Churn.csv")check dataset structur
str(cust_churn_raw)## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
Here are the summary of dataset:
summary(cust_churn_raw %>% mutate_if(is.character, as.factor)%>% select(-customerID)) ## gender SeniorCitizen Partner Dependents tenure
## Female:3488 Min. :0.0000 No :3641 No :4933 Min. : 0.00
## Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110 1st Qu.: 9.00
## Median :0.0000 Median :29.00
## Mean :0.1621 Mean :32.37
## 3rd Qu.:0.0000 3rd Qu.:55.00
## Max. :1.0000 Max. :72.00
##
## PhoneService MultipleLines InternetService
## No : 682 No :3390 DSL :2421
## Yes:6361 No phone service: 682 Fiber optic:3096
## Yes :2971 No :1526
##
##
##
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
Checking if there is a missing value:
colSums(is.na(cust_churn_raw))## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
There are eleven missing values in TotalCharges, Let us check their profiles:
cust_churn_raw %>%
select(-customerID) %>%
filter(is.na(TotalCharges))Those who has no information in TotalCharge, are: haveDependents, have no tenure and have Contracts for Two Years. Probably there are missing information in these profile or they are just sign up as new customer.
Because the number of missing value relatively small to number of observation, i decide to remove those.
cust_churn <- cust_churn_raw %>%
na.omit() %>%
select(-c(customerID))
dim(cust_churn)## [1] 7032 20
Our dataset has 7032 rows and 20 columns.
Here’s top six row data
head(cust_churn)Let’s analyze the relation among the features to our target variable (Churn)
Churn against categorical featuresHere are the categorical features in the dataset
cust_churn %>% select_if(is.character) %>% names(.)## [1] "gender" "Partner" "Dependents" "PhoneService"
## [5] "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup"
## [9] "DeviceProtection" "TechSupport" "StreamingTV" "StreamingMovies"
## [13] "Contract" "PaperlessBilling" "PaymentMethod" "Churn"
There are 15 categorical variables. I will break the analysis into three segments.
cprofile_p1 <- cust_churn %>%
select(Churn, gender) %>% count(Churn, gender) %>%
ggplot(aes(gender, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
cprofile_p2 <- cust_churn %>%
select(Churn, Partner) %>% count(Churn, Partner) %>%
ggplot(aes(Partner, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
cprofile_p3 <- cust_churn %>%
select(Churn, Dependents) %>% count(Churn, Dependents) %>%
ggplot(aes(Dependents, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
cprofile_p4 <- cust_churn %>%
mutate(SeniorCitizen = ifelse(SeniorCitizen == "1", "Yes", "No")) %>%
select(Churn, SeniorCitizen) %>% count(Churn, SeniorCitizen) %>%
ggplot(aes(SeniorCitizen, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
subplot1 <- ggarrange(cprofile_p1, cprofile_p2, cprofile_p3, cprofile_p4, ncol = 2, nrow = 2,
common.legend = TRUE,
legend = "bottom")
subplot1Insight:
Customer who has Dependents has lower Churn Rate (17.44%) than those who do not have.
Customer who has Phone Service has higher Churn Rate (82.56%) than those who do not have.
SeniorCitizen Customer has lower Churn Rate (25.47%) than those who are not Senior Citizen.
The variance in gender does not seems to have a significant influence to customer churn rate. Both male and female have almost same Churn Rate.
services_p8 <- cust_churn %>%
select(Churn, InternetService) %>% count(Churn, InternetService) %>%
ggplot(aes(InternetService, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p9 <- cust_churn %>%
select(Churn, PhoneService) %>% count(Churn, PhoneService) %>%
ggplot(aes(PhoneService, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
subplot4 <- ggarrange(services_p8, services_p9,
ncol = 2, nrow = 1, common.legend = TRUE, legend = "bottom")
subplot4Insight:
Customer who subscribe to InternetSevice using Fiber Optic has higher churn rate last month compare to those who use DSL or even no using intenet service.
The variance in PhoneService have little influence to customer churn rate. Customer who used PhoneService has higher churn rate than those who not.
services_p1 <- cust_churn %>%
select(Churn, MultipleLines) %>% count(Churn, MultipleLines) %>%
ggplot(aes(MultipleLines, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p2 <- cust_churn %>%
select(Churn, OnlineSecurity) %>% count(Churn, OnlineSecurity) %>%
ggplot(aes(OnlineSecurity, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p3 <- cust_churn %>%
select(Churn, OnlineBackup) %>% count(Churn, OnlineBackup) %>%
ggplot(aes(OnlineBackup, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p4 <- cust_churn %>%
select(Churn, DeviceProtection) %>% count(Churn, DeviceProtection) %>%
ggplot(aes(DeviceProtection, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p5 <- cust_churn %>%
select(Churn, TechSupport) %>% count(Churn, TechSupport) %>%
ggplot(aes(TechSupport, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p6 <- cust_churn %>%
select(Churn, StreamingTV) %>% count(Churn, StreamingTV) %>%
ggplot(aes(StreamingTV, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
services_p7 <- cust_churn %>%
select(Churn, StreamingMovies) %>% count(Churn, StreamingMovies) %>%
ggplot(aes(StreamingMovies, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
subplot2 <- ggarrange(services_p1, services_p2, services_p3, services_p4,
services_p5, services_p6, services_p7,
#labels = c("MultipleLines", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"),
ncol = 2, nrow = 4, common.legend = TRUE, legend = "bottom")
subplot2cust_churn %>% select(Churn,MultipleLines) %>%
#filter(Churn == "Yes") %>%
group_by(MultipleLines, Churn) %>%
summarise(n= n()) %>%
mutate(Percentage = round((n/sum(n))*100,2))cust_churn %>% select(Churn,DeviceProtection) %>%
#filter(Churn == "Yes") %>%
group_by(DeviceProtection, Churn) %>%
summarise(n= n()) %>%
mutate(Percentage = round((n/sum(n))*100,2))cust_churn %>% select(Churn,OnlineBackup) %>%
#filter(Churn == "Yes") %>%
group_by(OnlineBackup, Churn) %>%
summarise(n= n()) %>%
mutate(Percentage = round((n/sum(n))*100,2))cust_churn %>% select(Churn,OnlineSecurity) %>%
#filter(Churn == "Yes") %>%
group_by(OnlineSecurity, Churn) %>%
summarise(n= n()) %>%
mutate(Percentage = round((n/sum(n))*100,2))cust_churn %>% select(Churn,TechSupport) %>%
#filter(Churn == "Yes") %>%
group_by(TechSupport, Churn) %>%
summarise(n= n()) %>%
mutate(Percentage = round((n/sum(n))*100,2))DeviceProtection, OnlineBackup, OnlineSecurity and TechSupport service last month have lower Churn rate compared to the customers who don’t. The Average number of customer who subscribe one of those services and not leave is more than 77 %.#
subsinfo_p1 <- cust_churn %>%
select(Churn, Contract) %>% count(Churn, Contract) %>%
ggplot(aes(Contract, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
subsinfo_p2 <- cust_churn %>%
select(Churn, PaymentMethod) %>% count(Churn, PaymentMethod) %>%
ggplot(aes(PaymentMethod, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
subsinfo_p3 <- cust_churn %>%
select(Churn, PaperlessBilling) %>% count(Churn, PaperlessBilling) %>%
ggplot(aes(PaperlessBilling, n, fill = Churn)) +
geom_bar(stat="identity", position = "fill") +
coord_flip() +
geom_text(aes(label=n),position = position_fill(vjust = 0.5))
subplot3 <- ggarrange(subsinfo_p1, subsinfo_p2, subsinfo_p3,
ncol = 2, nrow = 2,
common.legend = TRUE, legend = "bottom")
subplot3Insight:
Churn against numerical featuresHere are the numerical features in the dataset
cust_churn %>% select_if(is.numeric) %>% names(.)## [1] "SeniorCitizen" "tenure" "MonthlyCharges" "TotalCharges"
These are the distribution of numeric data:
cust_churn %>% select_if(is.numeric) %>% summary(.)## SeniorCitizen tenure MonthlyCharges TotalCharges
## Min. :0.0000 Min. : 1.00 Min. : 18.25 Min. : 18.8
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 35.59 1st Qu.: 401.4
## Median :0.0000 Median :29.00 Median : 70.35 Median :1397.5
## Mean :0.1624 Mean :32.42 Mean : 64.80 Mean :2283.3
## 3rd Qu.:0.0000 3rd Qu.:55.00 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :1.0000 Max. :72.00 Max. :118.75 Max. :8684.8
bp1 <- cust_churn %>%
select(tenure) %>%
ggplot(aes(tenure)) + geom_boxplot()
bp2 <- cust_churn %>%
select(MonthlyCharges) %>%
ggplot(aes(MonthlyCharges)) + geom_boxplot()
bp3 <- cust_churn %>%
select(TotalCharges) %>%
ggplot(aes(TotalCharges)) + geom_boxplot()
subplot4 <- ggarrange(bp1, bp2, bp3,
ncol = 1, nrow = 3,
common.legend = TRUE, legend = "bottom")
subplot4Look like there is no outliers for each of numeric feature.
agg_tenure_churn <- cust_churn %>%
filter(Churn == "Yes") %>%
select(tenure, Churn) %>%
group_by(tenure) %>%
summarize(n = n()) %>% mutate(ChurnRate = round((n/sum(n)),2)*100)
numplot1 <- agg_tenure_churn %>%
ggplot(aes(x=tenure, y=ChurnRate, col=tenure)) +
geom_point(alpha=0.5) +
geom_smooth(method = "lm", col = "red") +
labs(
x = "tenure (in months)", y = "Churn Rate (%)"
) + theme(legend.position = "none") +
annotate("text", label = paste("Corr = ", round(cor(agg_tenure_churn %>% select(-n))[2],3)),
x = 65, y = 20, size = 3, colour = "red")
agg_monthlycharges_churn <- cust_churn %>%
filter(Churn == "Yes") %>%
select(MonthlyCharges, Churn) %>%
group_by(MonthlyCharges) %>%
summarize(n = n()) %>% mutate(ChurnRate = round((n/sum(n)),2)*100)
numplot2 <- agg_monthlycharges_churn %>%
ggplot(aes(x=MonthlyCharges, y=ChurnRate, color=MonthlyCharges)) +
geom_point(alpha=0.5) +
geom_smooth(method = "lm", col = "red") +
labs(
x = "Monthly Charges", y = "Churn Rate (%)"
) + theme(legend.position = "none") +
annotate("text", label = paste("Corr = ", round(cor(agg_monthlycharges_churn %>% select(-n))[2],3)),
x = 100, y = 20, size = 3, colour = "red")
agg_totalcharges_churn <- cust_churn %>%
filter(Churn == "Yes") %>%
select(TotalCharges, Churn) %>%
group_by(TotalCharges) %>%
summarize(n = n()) %>% mutate(ChurnRate = round((n/sum(n)),2)*100)
numplot3 <- agg_totalcharges_churn %>%
ggplot(aes(x=TotalCharges, y=ChurnRate, color=TotalCharges)) +
geom_point(alpha=0.5) +
geom_smooth(method = "lm", col = "red") +
labs(
x = "Total Charges", y = "Churn Rate (%)"
) + theme(legend.position = "none") +
annotate("text", label = paste("Corr = ", round(cor(agg_totalcharges_churn %>% select(-n))[2],6)),
x = 7500, y = 20, size = 3, colour = "red")
subplot5 <- ggarrange(numplot1, numplot2, numplot3,
ncol = 1, nrow = 3)
subplot5Among these three variable, only tenure that has correlation with Churn while MonthlyCharges and TotalCharge almost have no correlation with Churn.
From the EDA, we know that several values in categorical variable have almost same information. For example in StreamingTV and OnlineBackup, there are “No” and “No internet service” values.
I simplified those value by set them as “No”. But before that, i separate the data between between categorical features, numerical features, and target for data wrangling
# separate target, numeric variable, and categorical variable
cust_churn_num <- cust_churn %>% select_if(is.numeric) # 4 columns
cust_churn_char <- cust_churn %>%
select(-Churn) %>%
select_if(is.character) # 15 columns
Churn <- as.factor(cust_churn$Churn) I use gsub() to replace “No phone service” values to “No”. As you can see below, the value in `MultipleLines" now only have two unique values.
#replace "No phone Service" to "No"
cust_churn_char <- lapply(cust_churn_char %>% select_if(is.character),
function(x) {
gsub("No phone service", "No", x)
}) %>% as_tibble()
unique(cust_churn_char$MultipleLines)## [1] "No" "Yes"
The same treatment applied to replace “No internet service” values to “No”.
#replace "No internet service" to "No"
cust_churn_char <- lapply(cust_churn_char %>% select_if(is.character),
function(x) {
gsub("No internet service", "No", x)
}) %>% as_tibble()
unique(cust_churn_char$StreamingTV)## [1] "No" "Yes"
unique(cust_churn_char$OnlineSecurity)## [1] "No" "Yes"
unique(cust_churn_char$StreamingMovies)## [1] "No" "Yes"
Based on insight that we found in EDA section, there are several features that has poor correlation with customer churn such as gender, PhoneService, MonthlyCharges, TotalCharges. So i will remove them from the predictor.
cust_churn_char <- cust_churn_char %>%
select(-c(gender, PhoneService))
cust_churn_num <- cust_churn_num %>%
select(-c(MonthlyCharges, TotalCharges))Because I will use Decision Tree for fitting model which require predictor in numerical, i need to transform the data into dummy variables to ensure all the data in numeric format.
cust_churn_char <- data.frame(sapply(cust_churn_char,function(x) data.frame(model.matrix(~x-1,data =cust_churn_char))[,-1]))
head(cust_churn_char)Here is the final form of data for creating model.
# concat cust_churn_char + cust_churn_num
cust_churn_tidy <- cbind(cust_churn_char, cust_churn_num, Churn)
str(cust_churn_tidy)## 'data.frame': 7032 obs. of 20 variables:
## $ Partner : num 1 0 0 0 0 0 0 0 1 0 ...
## $ Dependents : num 0 0 0 0 0 0 1 0 0 1 ...
## $ MultipleLines : num 0 0 0 0 0 1 1 0 1 0 ...
## $ InternetService.xFiber.optic : num 0 0 0 0 1 1 1 0 1 0 ...
## $ InternetService.xNo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OnlineSecurity : num 0 1 1 1 0 0 0 1 0 1 ...
## $ OnlineBackup : num 1 0 1 0 0 0 1 0 0 1 ...
## $ DeviceProtection : num 0 1 0 1 0 1 0 0 1 0 ...
## $ TechSupport : num 0 0 0 1 0 0 0 0 1 0 ...
## $ StreamingTV : num 0 0 0 0 0 1 1 0 1 0 ...
## $ StreamingMovies : num 0 0 0 0 0 1 0 0 1 0 ...
## $ Contract.xOne.year : num 0 1 0 1 0 0 0 0 0 1 ...
## $ Contract.xTwo.year : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PaperlessBilling : num 1 0 1 0 1 1 1 0 1 0 ...
## $ PaymentMethod.xCredit.card..automatic.: num 0 0 0 0 0 0 1 0 0 0 ...
## $ PaymentMethod.xElectronic.check : num 1 0 0 0 1 1 0 0 1 0 ...
## $ PaymentMethod.xMailed.check : num 0 1 1 0 0 0 0 1 0 0 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
In this section i split 80 % of dataset as data train and the rest as data test.
RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(cust_churn_tidy)
# sampel dan ambil 80% data secara acak
index <- sample(row_data, row_data*0.8)
data_train <- cust_churn_tidy[ index, ]
data_test <- cust_churn_tidy[ -index, ] Before creating model using data train, first we need to check whether imbalance class exist in target variable. Class imbalance can affect our model ability to classify the output.
data_train %>%
group_by(Churn) %>%
summarise(total = n()) %>%
mutate(percentage = round((total/sum(total))*100,2))The target class are not represent equally. Although the ratio is not is not too high, i will treat them as imbalance class case.
In this case i use upscale function from package caret to oversampling data-train class.
library(caret)
set.seed(123)
data_train_up <- upSample(x = data_train %>% select(-Churn),
y = data_train$Churn,
list = F,
yname = "Churn")
table(data_train_up$Churn)##
## No Yes
## 4153 4153
I create model using full predictor and using step-wise backward for feature selection
# model - only intercept
model_null <- glm(Churn~1,data_train_up, family = "binomial")
# model -full predictor
model_full <- glm(Churn~.,data_train_up, family = "binomial")Model Fitting
# model stepwise backward
model_bw <- step(object = model_full, direction = "backward", trace = F)
summary(model_bw)##
## Call:
## glm(formula = Churn ~ Dependents + MultipleLines + InternetService.xFiber.optic +
## InternetService.xNo + OnlineSecurity + OnlineBackup + TechSupport +
## StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year +
## PaperlessBilling + PaymentMethod.xElectronic.check + SeniorCitizen +
## tenure, family = "binomial", data = data_train_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.41191 -0.76322 0.08776 0.74987 2.79269
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.406706 0.077410 5.254 1.49e-07 ***
## Dependents -0.231041 0.066788 -3.459 0.000542 ***
## MultipleLines 0.182901 0.066407 2.754 0.005883 **
## InternetService.xFiber.optic 0.812849 0.069659 11.669 < 2e-16 ***
## InternetService.xNo -0.990590 0.102480 -9.666 < 2e-16 ***
## OnlineSecurity -0.365417 0.069173 -5.283 1.27e-07 ***
## OnlineBackup -0.186077 0.064977 -2.864 0.004187 **
## TechSupport -0.174745 0.070546 -2.477 0.013248 *
## StreamingTV 0.310541 0.068982 4.502 6.74e-06 ***
## StreamingMovies 0.309845 0.068575 4.518 6.23e-06 ***
## Contract.xOne.year -0.670791 0.084761 -7.914 2.49e-15 ***
## Contract.xTwo.year -1.346374 0.127370 -10.571 < 2e-16 ***
## PaperlessBilling 0.314829 0.061736 5.100 3.40e-07 ***
## PaymentMethod.xElectronic.check 0.297004 0.059970 4.953 7.33e-07 ***
## SeniorCitizen 0.315110 0.073557 4.284 1.84e-05 ***
## tenure -0.032411 0.001879 -17.250 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11514.6 on 8305 degrees of freedom
## Residual deviance: 8055.6 on 8290 degrees of freedom
## AIC: 8087.6
##
## Number of Fisher Scoring iterations: 5
levels(data_test$Churn)## [1] "No" "Yes"
Fitting Model K-NN
#initialize k neighbour
k_neighbour <- sqrt(nrow(data_train_up)) %>% round()
# Model Fitting K-NN + prediction
model_knn <- class::knn(train = data_train_up %>% select(-Churn),
cl = data_train_up$Churn,
test = data_test %>% select(-Churn),
k = k_neighbour)Time to evaluate the models. we use Confusion matrix to evaluate the result between prediction and actual on data test.
Generally there are four metrics to measure performance (for binary classification) based on Confusion Matrix:
Accuracy : Measure how many of our data is correctly predicted. \[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
Sensitivity : measures out of all positive outcome, how many are correctly predicted. \[Sensitivity = \frac{TP}{TP + FN}\]
Specificty: measure how many negative outcome is correctly predicted. \[Specificty = \frac{TN}{TN + FP}\]
Precision: measures how many of our positive prediction is correct. \[Precision = \frac{TN}{TN + FP}\]
First i check the proportion target class in data test.
data_test %>%
group_by(Churn) %>%
summarise(total = n()) %>%
mutate(percentage = round((total/sum(total))*100,2))71.78 % data test labeled as Not-Churn Customer while the rest labeled as Churn.
Check the level of target value and predict value in unseen data (data testing) using logistic regression model.
levels(data_test$Churn)## [1] "No" "Yes"
churn_modelbw <- predict(model_bw, data_test, type = "response")Prior to binary classification case, I will convert the probability output in prediction to factor using threshold 0.5. If the probability value more than 0.5 will classified as “Yes” while the rest classified as “No”.
pred_churn_modelbw <- ifelse(churn_modelbw > 0.5, "Yes", "No") %>%
as.factor()library(caret)
cm_model_lr <- confusionMatrix(pred_churn_modelbw,
data_test$Churn,
positive = "Yes")
cm_model_lr## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 732 76
## Yes 278 321
##
## Accuracy : 0.7484
## 95% CI : (0.7249, 0.7709)
## No Information Rate : 0.7178
## P-Value [Acc > NIR] : 0.005509
##
## Kappa : 0.462
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8086
## Specificity : 0.7248
## Pos Pred Value : 0.5359
## Neg Pred Value : 0.9059
## Prevalence : 0.2822
## Detection Rate : 0.2281
## Detection Prevalence : 0.4257
## Balanced Accuracy : 0.7667
##
## 'Positive' Class : Yes
##
Accuracy = 74.84 %
Around 74 % of testing data are correctly classified with this logistic model
Recall / Sensitivity = 80.86 %;
This model correctly identified 80.86 % of actual churned customer.
Specificity = 72.84 %
This model correctly identified 72.84 % of actual non-churned customer.
Precision = 53.59 %
Around 74.84 % of our positive prediction (Churn Customer) are correctly classified.
cm_model_knn <- confusionMatrix(model_knn, data_test$Churn, positive = "Yes")
cm_model_knn## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 683 80
## Yes 327 317
##
## Accuracy : 0.7107
## 95% CI : (0.6863, 0.7343)
## No Information Rate : 0.7178
## P-Value [Acc > NIR] : 0.7339
##
## Kappa : 0.3993
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7985
## Specificity : 0.6762
## Pos Pred Value : 0.4922
## Neg Pred Value : 0.8952
## Prevalence : 0.2822
## Detection Rate : 0.2253
## Detection Prevalence : 0.4577
## Balanced Accuracy : 0.7374
##
## 'Positive' Class : Yes
##
Accuracy = 71.07 %
Around 71 % of testing data are correctly classified with this KNN model
Recall / Sensitivity = 79.85 %
This model correctly identified 79.85 % of actual churned customer.
Specificity = 67.62 %
This KNN model correctly identified 67.62 % of actual non-churned customer.
Precision = 53.59 %
Around 49.22 % of our positive prediction (Churn Customer) are correctly classified.
library(tibble)
summary_eval_lr <- data_frame(Model = "Logistic Regression", Accuracy = round(cm_model_lr$overall[1]*100,2),
Recall = round(cm_model_lr$byClass[1]*100,2),
Specificity = round(cm_model_lr$byClass[2]*100,2),
Precision = round(cm_model_lr$byClass[3]*100,2),
Kappa = round(cm_model_lr$overall[2]*100,2))
summary_eval_knn <- data_frame(Model = "K-NN", Accuracy = round(cm_model_knn$overall[1]*100,2),
Recall = round(cm_model_knn$byClass[1]*100,2),
Specificity = round(cm_model_knn$byClass[2]*100,2),
Precision = round(cm_model_knn$byClass[3]*100,2),
Kappa = round(cm_model_knn$overall[2]*100,2))
summary_eval_model <- rbind(summary_eval_lr, summary_eval_knn)
summary_eval_modelBoth model has decent score of Accuracy and Recall. Based on the metrics table above, we can say that the Logistic Regression model outperform K-NN model.
As previously mention in evaluation section, we can conclude that Logistic Regression model outperform K-NN model.
From business owner perspective, it is important to watch Recall(sensitivity) as main evaluation model because it describes the ability of a model to correctly predict positive class (True Positive) among the total of real positive class.
In this case, our Logistic Regression model has Recall score 80.86 % which means the model will correctly identified 80.86 % customer who will actually churn and around 19.14 % potential churn customer incorrectly predicted as non-churn customer.