Introduction

Customer churn or customer attrition is the phenomenon where customers of a business no longer purchase or interact with the business. A high churn means that higher number of customers no longer want to purchase goods and services from the business.

Customer churn is a one of the major problem in the telecommunication industry. Through customer churn analysis, the telco company can understand the behavior of their customer and help them to create strategies to retain their existing customer and attract new customer.

Throughout this article, i try to gain some insight from the data relevant to telco customer profile and services, and try to develop customer churn prediction model using Logistic Regression and K-Nearest Neighbour.

Dataset in this article was obtained from Kaggle. The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents

Data Preparation

About the Data

The following are a brief explanations of each variable in this dataset.

Variable	Description
CustomerID	ID Customer
Gender	Whether the customer is a male or a female
SeniorCitizen	Whether the customer is a senior citizen or not (1, 0)
Partner	Whether the customer has a partner or not (Yes, No)
Dependents	Whether the customer has dependents or not (Yes, No)
Tenure	Number of months the customer has stayed with the company
PhoneService	Whether the customer has a phone service or not (Yes, No)
MultipleLines	Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService	Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity	Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup	Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection	Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport	Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV	Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies	Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract	The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling	Whether the customer has paperless billing or not (Yes, No)
PaymentMethod	The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges	The amount charged to the customer monthly
TotalCharges	The total amount charged to the customer
Churn	Whether the customer churned or not (Yes or No)

Import Library

library(dplyr)
library(tidyr)
library(readr)
library(stringr)

library(GGally)
library(gtools)
library(caret)
library(car)
library(scales)
library(lmtest)
library(ggplot2)
library(plotly)
library(ggthemes)
library(MLmetrics)
library(performance)
library(ggpubr)

Pre-Processing Data

Load data csv

cust_churn_raw <- read.csv("data-input/WA_Fn-UseC_-Telco-Customer-Churn.csv")

check dataset structur

str(cust_churn_raw)

## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

Here are the summary of dataset:

summary(cust_churn_raw %>% mutate_if(is.character, as.factor)%>% select(-customerID))

##     gender     SeniorCitizen    Partner    Dependents     tenure     
##  Female:3488   Min.   :0.0000   No :3641   No :4933   Min.   : 0.00  
##  Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110   1st Qu.: 9.00  
##                Median :0.0000                         Median :29.00  
##                Mean   :0.1621                         Mean   :32.37  
##                3rd Qu.:0.0000                         3rd Qu.:55.00  
##                Max.   :1.0000                         Max.   :72.00  
##                                                                      
##  PhoneService          MultipleLines     InternetService
##  No : 682     No              :3390   DSL        :2421  
##  Yes:6361     No phone service: 682   Fiber optic:3096  
##               Yes             :2971   No         :1526  
##                                                         
##                                                         
##                                                         
##                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

Checking if there is a missing value:

colSums(is.na(cust_churn_raw))

##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

There are eleven missing values in TotalCharges, Let us check their profiles:

cust_churn_raw %>% 
  select(-customerID) %>% 
  filter(is.na(TotalCharges))

Those who has no information in TotalCharge, are: haveDependents, have no tenure and have Contracts for Two Years. Probably there are missing information in these profile or they are just sign up as new customer.

Because the number of missing value relatively small to number of observation, i decide to remove those.

cust_churn <- cust_churn_raw %>%
  na.omit() %>% 
  select(-c(customerID))

dim(cust_churn)

## [1] 7032   20

Our dataset has 7032 rows and 20 columns.

Here’s top six row data

head(cust_churn)

EDA

Let’s analyze the relation among the features to our target variable (Churn)

1. `Churn` against categorical features

Here are the categorical features in the dataset

cust_churn %>% select_if(is.character) %>% names(.)

##  [1] "gender"           "Partner"          "Dependents"       "PhoneService"    
##  [5] "MultipleLines"    "InternetService"  "OnlineSecurity"   "OnlineBackup"    
##  [9] "DeviceProtection" "TechSupport"      "StreamingTV"      "StreamingMovies" 
## [13] "Contract"         "PaperlessBilling" "PaymentMethod"    "Churn"

There are 15 categorical variables. I will break the analysis into three segments.

a. Customer Profile

cprofile_p1 <- cust_churn %>% 
  select(Churn, gender) %>% count(Churn, gender) %>% 
  ggplot(aes(gender, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

cprofile_p2 <- cust_churn %>% 
  select(Churn, Partner) %>% count(Churn, Partner) %>% 
  ggplot(aes(Partner, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

cprofile_p3 <- cust_churn %>% 
  select(Churn, Dependents) %>% count(Churn, Dependents) %>%
  ggplot(aes(Dependents, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

cprofile_p4 <- cust_churn %>% 
  mutate(SeniorCitizen = ifelse(SeniorCitizen == "1", "Yes", "No")) %>% 
  select(Churn, SeniorCitizen) %>% count(Churn, SeniorCitizen) %>% 
  ggplot(aes(SeniorCitizen, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

subplot1 <- ggarrange(cprofile_p1, cprofile_p2, cprofile_p3, cprofile_p4,  ncol = 2, nrow = 2, 
                      common.legend = TRUE, 
                      legend = "bottom")
subplot1

Insight:

Customer who has Dependents has lower Churn Rate (17.44%) than those who do not have.
Customer who has Phone Service has higher Churn Rate (82.56%) than those who do not have.
SeniorCitizen Customer has lower Churn Rate (25.47%) than those who are not Senior Citizen.
The variance in gender does not seems to have a significant influence to customer churn rate. Both male and female have almost same Churn Rate.

b. Service Subscription

services_p8 <- cust_churn %>% 
  select(Churn, InternetService) %>% count(Churn, InternetService) %>% 
  ggplot(aes(InternetService, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p9 <- cust_churn %>% 
  select(Churn, PhoneService) %>% count(Churn, PhoneService) %>% 
  ggplot(aes(PhoneService, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

subplot4 <- ggarrange(services_p8, services_p9, 
                    ncol = 2, nrow = 1, common.legend = TRUE, legend = "bottom")
subplot4

Insight:

Customer who subscribe to InternetSevice using Fiber Optic has higher churn rate last month compare to those who use DSL or even no using intenet service.
The variance in PhoneService have little influence to customer churn rate. Customer who used PhoneService has higher churn rate than those who not.

services_p1 <- cust_churn %>% 
  select(Churn, MultipleLines) %>% count(Churn, MultipleLines) %>% 
  ggplot(aes(MultipleLines, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p2 <- cust_churn %>% 
  select(Churn, OnlineSecurity) %>% count(Churn, OnlineSecurity) %>% 
  ggplot(aes(OnlineSecurity, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p3 <- cust_churn %>% 
  select(Churn, OnlineBackup) %>% count(Churn, OnlineBackup) %>% 
  ggplot(aes(OnlineBackup, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p4 <- cust_churn %>% 
  select(Churn, DeviceProtection) %>% count(Churn, DeviceProtection) %>% 
  ggplot(aes(DeviceProtection, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p5 <- cust_churn %>% 
  select(Churn, TechSupport) %>% count(Churn, TechSupport) %>% 
  ggplot(aes(TechSupport, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p6 <- cust_churn %>% 
  select(Churn, StreamingTV) %>% count(Churn, StreamingTV) %>% 
  ggplot(aes(StreamingTV, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

services_p7 <- cust_churn %>% 
  select(Churn, StreamingMovies) %>% count(Churn, StreamingMovies) %>% 
  ggplot(aes(StreamingMovies, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))


subplot2 <- ggarrange(services_p1, services_p2, services_p3, services_p4,
                    services_p5, services_p6, services_p7, 
                    #labels = c("MultipleLines", "OnlineSecurity",  "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"),
                    ncol = 2, nrow = 4, common.legend = TRUE, legend = "bottom")
subplot2

cust_churn %>% select(Churn,MultipleLines) %>% 
  #filter(Churn == "Yes") %>%  
  group_by(MultipleLines, Churn) %>%
  summarise(n= n()) %>% 
  mutate(Percentage = round((n/sum(n))*100,2))

cust_churn %>% select(Churn,DeviceProtection) %>% 
  #filter(Churn == "Yes") %>%  
  group_by(DeviceProtection, Churn) %>%
  summarise(n= n()) %>% 
  mutate(Percentage = round((n/sum(n))*100,2))

cust_churn %>% select(Churn,OnlineBackup) %>% 
  #filter(Churn == "Yes") %>%  
  group_by(OnlineBackup, Churn) %>%
  summarise(n= n()) %>% 
  mutate(Percentage = round((n/sum(n))*100,2))

cust_churn %>% select(Churn,OnlineSecurity) %>% 
  #filter(Churn == "Yes") %>%  
  group_by(OnlineSecurity, Churn) %>%
  summarise(n= n()) %>% 
  mutate(Percentage = round((n/sum(n))*100,2))

cust_churn %>% select(Churn,TechSupport) %>% 
  #filter(Churn == "Yes") %>%  
  group_by(TechSupport, Churn) %>%
  summarise(n= n()) %>% 
  mutate(Percentage = round((n/sum(n))*100,2))

Customers who subscribe DeviceProtection, OnlineBackup, OnlineSecurity and TechSupport service last month have lower Churn rate compared to the customers who don’t. The Average number of customer who subscribe one of those services and not leave is more than 77 %.

c. Contract+Billing Type

#
subsinfo_p1 <- cust_churn %>% 
  select(Churn, Contract) %>% count(Churn, Contract) %>% 
  ggplot(aes(Contract, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

subsinfo_p2 <- cust_churn %>% 
  select(Churn, PaymentMethod) %>% count(Churn, PaymentMethod) %>% 
  ggplot(aes(PaymentMethod, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

subsinfo_p3 <- cust_churn %>% 
  select(Churn, PaperlessBilling) %>% count(Churn, PaperlessBilling) %>% 
  ggplot(aes(PaperlessBilling, n, fill = Churn)) +
  geom_bar(stat="identity", position = "fill") +
  coord_flip() +
  geom_text(aes(label=n),position = position_fill(vjust = 0.5))

subplot3 <- ggarrange(subsinfo_p1, subsinfo_p2, subsinfo_p3, 
                      ncol = 2, nrow = 2,
                      common.legend = TRUE, legend = "bottom")
subplot3

Insight:

Customers who sign longer contract have lower churn rate.
Customers who choose paperless billing have higher churn rate.
Customers who pay with electronic check have higher churn rate than customers who pay with other methods.

2. `Churn` against numerical features

Here are the numerical features in the dataset

cust_churn %>% select_if(is.numeric) %>% names(.)

## [1] "SeniorCitizen"  "tenure"         "MonthlyCharges" "TotalCharges"

These are the distribution of numeric data:

cust_churn %>% select_if(is.numeric) %>% summary(.)

##  SeniorCitizen        tenure      MonthlyCharges    TotalCharges   
##  Min.   :0.0000   Min.   : 1.00   Min.   : 18.25   Min.   :  18.8  
##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 35.59   1st Qu.: 401.4  
##  Median :0.0000   Median :29.00   Median : 70.35   Median :1397.5  
##  Mean   :0.1624   Mean   :32.42   Mean   : 64.80   Mean   :2283.3  
##  3rd Qu.:0.0000   3rd Qu.:55.00   3rd Qu.: 89.86   3rd Qu.:3794.7  
##  Max.   :1.0000   Max.   :72.00   Max.   :118.75   Max.   :8684.8

bp1 <- cust_churn %>% 
  select(tenure) %>% 
  ggplot(aes(tenure)) + geom_boxplot()

bp2 <- cust_churn %>% 
  select(MonthlyCharges) %>% 
  ggplot(aes(MonthlyCharges)) + geom_boxplot()

bp3 <- cust_churn %>% 
  select(TotalCharges) %>% 
  ggplot(aes(TotalCharges)) + geom_boxplot()

subplot4 <- ggarrange(bp1, bp2, bp3, 
                      ncol = 1, nrow = 3,
                      common.legend = TRUE, legend = "bottom")
subplot4

Look like there is no outliers for each of numeric feature.

agg_tenure_churn <- cust_churn %>% 
  filter(Churn == "Yes") %>% 
  select(tenure, Churn) %>% 
  group_by(tenure) %>% 
  summarize(n = n()) %>% mutate(ChurnRate = round((n/sum(n)),2)*100)

numplot1 <- agg_tenure_churn %>% 
  ggplot(aes(x=tenure, y=ChurnRate, col=tenure)) +
  geom_point(alpha=0.5) +
  geom_smooth(method = "lm", col = "red") +
  labs(
    x = "tenure (in months)", y = "Churn Rate (%)"
  ) + theme(legend.position = "none") +
  annotate("text", label = paste("Corr = ", round(cor(agg_tenure_churn %>% select(-n))[2],3)), 
           x = 65, y = 20, size = 3, colour = "red")


agg_monthlycharges_churn <- cust_churn %>% 
  filter(Churn == "Yes") %>%
  select(MonthlyCharges, Churn) %>% 
  group_by(MonthlyCharges) %>% 
  summarize(n = n()) %>% mutate(ChurnRate = round((n/sum(n)),2)*100)

numplot2 <- agg_monthlycharges_churn %>% 
  ggplot(aes(x=MonthlyCharges, y=ChurnRate, color=MonthlyCharges)) +
  geom_point(alpha=0.5) +
  geom_smooth(method = "lm", col = "red") +
  labs(
    x = "Monthly Charges", y = "Churn Rate (%)"
  ) + theme(legend.position = "none") +
  annotate("text", label = paste("Corr = ", round(cor(agg_monthlycharges_churn %>% select(-n))[2],3)), 
           x = 100, y = 20, size = 3, colour = "red")

agg_totalcharges_churn <- cust_churn %>% 
  filter(Churn == "Yes") %>% 
  select(TotalCharges, Churn) %>% 
  group_by(TotalCharges) %>% 
  summarize(n = n()) %>% mutate(ChurnRate = round((n/sum(n)),2)*100)
  
numplot3 <- agg_totalcharges_churn %>% 
  ggplot(aes(x=TotalCharges, y=ChurnRate, color=TotalCharges)) +
  geom_point(alpha=0.5) +
  geom_smooth(method = "lm", col = "red") +
  labs(
    x = "Total Charges", y = "Churn Rate (%)"
  ) + theme(legend.position = "none") +
  annotate("text", label = paste("Corr = ", round(cor(agg_totalcharges_churn %>% select(-n))[2],6)), 
           x = 7500, y = 20, size = 3, colour = "red")

subplot5 <- ggarrange(numplot1, numplot2, numplot3, 
                      ncol = 1, nrow = 3)
subplot5

Among these three variable, only tenure that has correlation with Churn while MonthlyCharges and TotalCharge almost have no correlation with Churn.

Data Wrangling

Cleaning Categorical Feature

From the EDA, we know that several values in categorical variable have almost same information. For example in StreamingTV and OnlineBackup, there are “No” and “No internet service” values.

I simplified those value by set them as “No”. But before that, i separate the data between between categorical features, numerical features, and target for data wrangling

# separate target, numeric variable, and categorical variable
cust_churn_num <- cust_churn %>% select_if(is.numeric) # 4 columns
cust_churn_char <- cust_churn %>% 
  select(-Churn) %>% 
  select_if(is.character) # 15 columns
Churn <- as.factor(cust_churn$Churn)

I use gsub() to replace “No phone service” values to “No”. As you can see below, the value in `MultipleLines" now only have two unique values.

#replace "No phone Service" to "No"
cust_churn_char <- lapply(cust_churn_char %>% select_if(is.character), 
                  function(x) {
                   gsub("No phone service", "No", x)
                  }) %>% as_tibble()


unique(cust_churn_char$MultipleLines)

## [1] "No"  "Yes"

The same treatment applied to replace “No internet service” values to “No”.

#replace "No internet service" to "No"
cust_churn_char <- lapply(cust_churn_char %>% select_if(is.character), 
                  function(x) {
                   gsub("No internet service", "No", x)
                  }) %>% as_tibble()


unique(cust_churn_char$StreamingTV)

## [1] "No"  "Yes"

unique(cust_churn_char$OnlineSecurity)

## [1] "No"  "Yes"

unique(cust_churn_char$StreamingMovies)

## [1] "No"  "Yes"

Feature Selection

Based on insight that we found in EDA section, there are several features that has poor correlation with customer churn such as gender, PhoneService, MonthlyCharges, TotalCharges. So i will remove them from the predictor.

cust_churn_char <- cust_churn_char %>% 
  select(-c(gender, PhoneService))

cust_churn_num <- cust_churn_num %>% 
  select(-c(MonthlyCharges, TotalCharges))

Create Dummy Variable

Because I will use Decision Tree for fitting model which require predictor in numerical, i need to transform the data into dummy variables to ensure all the data in numeric format.

cust_churn_char <- data.frame(sapply(cust_churn_char,function(x) data.frame(model.matrix(~x-1,data =cust_churn_char))[,-1]))

head(cust_churn_char)

Here is the final form of data for creating model.

# concat cust_churn_char + cust_churn_num
cust_churn_tidy <- cbind(cust_churn_char, cust_churn_num, Churn)
str(cust_churn_tidy)

## 'data.frame':    7032 obs. of  20 variables:
##  $ Partner                               : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ Dependents                            : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ MultipleLines                         : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetService.xFiber.optic          : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetService.xNo                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurity                        : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackup                          : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtection                      : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupport                           : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTV                           : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMovies                       : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ Contract.xOne.year                    : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ Contract.xTwo.year                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBilling                      : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethod.xCredit.card..automatic.: num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethod.xElectronic.check       : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethod.xMailed.check           : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ SeniorCitizen                         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tenure                                : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ Churn                                 : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Fitting Model

Cross-Validation

In this section i split 80 % of dataset as data train and the rest as data test.

RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(cust_churn_tidy)
# sampel dan ambil 80% data secara acak
index <- sample(row_data, row_data*0.8)

data_train <- cust_churn_tidy[ index, ]
data_test <- cust_churn_tidy[ -index, ]

Before creating model using data train, first we need to check whether imbalance class exist in target variable. Class imbalance can affect our model ability to classify the output.

data_train %>% 
  group_by(Churn) %>% 
  summarise(total = n()) %>% 
  mutate(percentage = round((total/sum(total))*100,2))

The target class are not represent equally. Although the ratio is not is not too high, i will treat them as imbalance class case.

In this case i use upscale function from package caret to oversampling data-train class.

library(caret)
set.seed(123)
data_train_up <- upSample(x = data_train %>% select(-Churn),
                          y = data_train$Churn,
                          list = F,
                          yname = "Churn")

table(data_train_up$Churn)

## 
##   No  Yes 
## 4153 4153

Logistic Regression

I create model using full predictor and using step-wise backward for feature selection

# model - only intercept
model_null <- glm(Churn~1,data_train_up, family = "binomial")

# model -full predictor
model_full <- glm(Churn~.,data_train_up, family = "binomial")

Model Fitting

# model stepwise backward
model_bw <- step(object = model_full, direction = "backward", trace = F)
summary(model_bw)

## 
## Call:
## glm(formula = Churn ~ Dependents + MultipleLines + InternetService.xFiber.optic + 
##     InternetService.xNo + OnlineSecurity + OnlineBackup + TechSupport + 
##     StreamingTV + StreamingMovies + Contract.xOne.year + Contract.xTwo.year + 
##     PaperlessBilling + PaymentMethod.xElectronic.check + SeniorCitizen + 
##     tenure, family = "binomial", data = data_train_up)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.41191  -0.76322   0.08776   0.74987   2.79269  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      0.406706   0.077410   5.254 1.49e-07 ***
## Dependents                      -0.231041   0.066788  -3.459 0.000542 ***
## MultipleLines                    0.182901   0.066407   2.754 0.005883 ** 
## InternetService.xFiber.optic     0.812849   0.069659  11.669  < 2e-16 ***
## InternetService.xNo             -0.990590   0.102480  -9.666  < 2e-16 ***
## OnlineSecurity                  -0.365417   0.069173  -5.283 1.27e-07 ***
## OnlineBackup                    -0.186077   0.064977  -2.864 0.004187 ** 
## TechSupport                     -0.174745   0.070546  -2.477 0.013248 *  
## StreamingTV                      0.310541   0.068982   4.502 6.74e-06 ***
## StreamingMovies                  0.309845   0.068575   4.518 6.23e-06 ***
## Contract.xOne.year              -0.670791   0.084761  -7.914 2.49e-15 ***
## Contract.xTwo.year              -1.346374   0.127370 -10.571  < 2e-16 ***
## PaperlessBilling                 0.314829   0.061736   5.100 3.40e-07 ***
## PaymentMethod.xElectronic.check  0.297004   0.059970   4.953 7.33e-07 ***
## SeniorCitizen                    0.315110   0.073557   4.284 1.84e-05 ***
## tenure                          -0.032411   0.001879 -17.250  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11514.6  on 8305  degrees of freedom
## Residual deviance:  8055.6  on 8290  degrees of freedom
## AIC: 8087.6
## 
## Number of Fisher Scoring iterations: 5

levels(data_test$Churn)

## [1] "No"  "Yes"

K-NN

Fitting Model K-NN

#initialize k neighbour
k_neighbour <- sqrt(nrow(data_train_up)) %>% round()

# Model Fitting K-NN + prediction
model_knn <- class::knn(train = data_train_up %>% select(-Churn),
                        cl = data_train_up$Churn,
                        test = data_test %>% select(-Churn),
                        k = k_neighbour)

Evaluation

Time to evaluate the models. we use Confusion matrix to evaluate the result between prediction and actual on data test.

Generally there are four metrics to measure performance (for binary classification) based on Confusion Matrix:

Accuracy : Measure how many of our data is correctly predicted. \[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
Sensitivity : measures out of all positive outcome, how many are correctly predicted. \[Sensitivity = \frac{TP}{TP + FN}\]
Specificty: measure how many negative outcome is correctly predicted. \[Specificty = \frac{TN}{TN + FP}\]
Precision: measures how many of our positive prediction is correct. \[Precision = \frac{TN}{TN + FP}\]

First i check the proportion target class in data test.

data_test %>% 
  group_by(Churn) %>% 
  summarise(total = n()) %>% 
  mutate(percentage = round((total/sum(total))*100,2))

71.78 % data test labeled as Not-Churn Customer while the rest labeled as Churn.

Logistic Regression

Check the level of target value and predict value in unseen data (data testing) using logistic regression model.

levels(data_test$Churn)

## [1] "No"  "Yes"

churn_modelbw <- predict(model_bw, data_test, type = "response")

Prior to binary classification case, I will convert the probability output in prediction to factor using threshold 0.5. If the probability value more than 0.5 will classified as “Yes” while the rest classified as “No”.

pred_churn_modelbw <- ifelse(churn_modelbw > 0.5, "Yes", "No") %>% 
  as.factor()

library(caret)
cm_model_lr <- confusionMatrix(pred_churn_modelbw,
                data_test$Churn,
                positive = "Yes")
cm_model_lr

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  732  76
##        Yes 278 321
##                                           
##                Accuracy : 0.7484          
##                  95% CI : (0.7249, 0.7709)
##     No Information Rate : 0.7178          
##     P-Value [Acc > NIR] : 0.005509        
##                                           
##                   Kappa : 0.462           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8086          
##             Specificity : 0.7248          
##          Pos Pred Value : 0.5359          
##          Neg Pred Value : 0.9059          
##              Prevalence : 0.2822          
##          Detection Rate : 0.2281          
##    Detection Prevalence : 0.4257          
##       Balanced Accuracy : 0.7667          
##                                           
##        'Positive' Class : Yes             
##

Accuracy = 74.84 %

Around 74 % of testing data are correctly classified with this logistic model
Recall / Sensitivity = 80.86 %;

This model correctly identified 80.86 % of actual churned customer.
Specificity = 72.84 %

This model correctly identified 72.84 % of actual non-churned customer.
Precision = 53.59 %

Around 74.84 % of our positive prediction (Churn Customer) are correctly classified.

KNN

cm_model_knn <- confusionMatrix(model_knn, data_test$Churn, positive = "Yes")
cm_model_knn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  683  80
##        Yes 327 317
##                                           
##                Accuracy : 0.7107          
##                  95% CI : (0.6863, 0.7343)
##     No Information Rate : 0.7178          
##     P-Value [Acc > NIR] : 0.7339          
##                                           
##                   Kappa : 0.3993          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7985          
##             Specificity : 0.6762          
##          Pos Pred Value : 0.4922          
##          Neg Pred Value : 0.8952          
##              Prevalence : 0.2822          
##          Detection Rate : 0.2253          
##    Detection Prevalence : 0.4577          
##       Balanced Accuracy : 0.7374          
##                                           
##        'Positive' Class : Yes             
##

Accuracy = 71.07 %

Around 71 % of testing data are correctly classified with this KNN model
Recall / Sensitivity = 79.85 %

This model correctly identified 79.85 % of actual churned customer.
Specificity = 67.62 %

This KNN model correctly identified 67.62 % of actual non-churned customer.
Precision = 53.59 %

Around 49.22 % of our positive prediction (Churn Customer) are correctly classified.

Evaluation Summary

library(tibble)

summary_eval_lr <- data_frame(Model = "Logistic Regression", Accuracy = round(cm_model_lr$overall[1]*100,2),
           Recall = round(cm_model_lr$byClass[1]*100,2),
           Specificity = round(cm_model_lr$byClass[2]*100,2),
           Precision = round(cm_model_lr$byClass[3]*100,2),
           Kappa = round(cm_model_lr$overall[2]*100,2))

summary_eval_knn <- data_frame(Model = "K-NN", Accuracy = round(cm_model_knn$overall[1]*100,2),
           Recall = round(cm_model_knn$byClass[1]*100,2),
           Specificity = round(cm_model_knn$byClass[2]*100,2),
           Precision = round(cm_model_knn$byClass[3]*100,2), 
           Kappa = round(cm_model_knn$overall[2]*100,2))

summary_eval_model <-  rbind(summary_eval_lr, summary_eval_knn)
summary_eval_model

Both model has decent score of Accuracy and Recall. Based on the metrics table above, we can say that the Logistic Regression model outperform K-NN model.

Conclusion

As previously mention in evaluation section, we can conclude that Logistic Regression model outperform K-NN model.

From business owner perspective, it is important to watch Recall(sensitivity) as main evaluation model because it describes the ability of a model to correctly predict positive class (True Positive) among the total of real positive class.

In this case, our Logistic Regression model has Recall score 80.86 % which means the model will correctly identified 80.86 % customer who will actually churn and around 19.14 % potential churn customer incorrectly predicted as non-churn customer.

Telco Case: Customer Churn Prediction

Husain A. Hidayat

February 09, 2021

Introduction

Data Preparation

About the Data

Import Library

Pre-Processing Data

EDA

1. `Churn` against categorical features

a. Customer Profile

b. Service Subscription

c. Contract+Billing Type

2. `Churn` against numerical features

Data Wrangling

Cleaning Categorical Feature

Feature Selection

Create Dummy Variable

Fitting Model

Cross-Validation

Logistic Regression

K-NN

Evaluation

Logistic Regression

KNN

Evaluation Summary

Conclusion

Telco Case: Customer Churn Prediction

Husain A. Hidayat

February 09, 2021

Introduction

Data Preparation

About the Data

Import Library

Pre-Processing Data

EDA

1. Churn against categorical features

a. Customer Profile

b. Service Subscription

c. Contract+Billing Type

2. Churn against numerical features

Data Wrangling

Cleaning Categorical Feature

Feature Selection

Create Dummy Variable

Fitting Model

Cross-Validation

Logistic Regression

K-NN

Evaluation

Logistic Regression

KNN

Evaluation Summary

Conclusion

1. `Churn` against categorical features

2. `Churn` against numerical features