Introduction

In this project, I am going to be working with the “Telco Customer Churn” dataset to analyze customer churn and outline its predictors. The project consists of the following main steps:

  1. Data loading, preprocessing, and exploration

  2. K-means clustering to outline customer segments

  3. Binary logistic regression to predict churn (for a segment with the highest churn)

There are 21 variables and 7043 observations in the dataset. They are described on Kaggle as follows:

customerID- A customer’s ID

(1) demographic variables:

gender - A customer’s gender

SeniorCitizen - Whether the customer is a senior citizen or not (1, 0)

Partner- Whether the customer has a partner or not (Yes, No)

Dependents - Whether the customer has dependents or not (Yes, No)

(2) variables related to the company’s services:

PhoneService - Whether the customer has a phone service or not (Yes, No)

MultipleLines - Whether the customer has multiple lines or not (Yes, No, No phone service)

InternetService - Customer’s internet service provider (DSL, Fiber optic, No)

OnlineSecurity - Whether the customer has online security or not (Yes, No, No internet service)

OnlineBackup - Whether the customer has online backup or not (Yes, No, No internet service)

DeviceProtection - Whether the customer has device protection or not (Yes, No, No internet service)

TechSupport - Whether the customer has tech support or not (Yes, No, No internet service)

StreamingTV - Whether the customer has streaming TV or not (Yes, No, No internet service)

StreamingMovies - Whether the customer has streaming movies or not (Yes, No, No internet service)

(3) contract and payment details:

Contract - The contract term of the customer (Month-to-month, One year, Two year)

PaperlessBilling - Whether the customer has paperless billing or not (Yes, No)

PaymentMethod - The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

(4) tenure and charges details:

tenure - Number of months the customer has stayed with the company

MonthlyCharges - The amount charged to the customer monthly

TotalCharges - The total amount charged to the customer

(5) the variable of interest:

Churn - Whether the customer have left the company within the last month (Yes or No)

Importing the libraries.

library(dplyr)
library(ggplot2)
library(patchwork)
library(factoextra)
library(MASS)
library(descr)
library(caret)

Data preprocessing and exploration

Let’s have a first look at our data and check/correct the variable types and coding.

data <- read.csv("telco_customerchurn.csv")

glimpse(data)
## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW~
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",~
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes~
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"~
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "~
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt~
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "~
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N~
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y~
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes~
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye~
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes~
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One ~
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "~
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y~
data$SeniorCitizen <- if_else(data$SeniorCitizen == 1, "Yes", "No")

data[-c(1, 6, 19, 20)] <- lapply(data[-c(1, 6, 19, 20)], as.factor)
summary(data)
##   customerID           gender     SeniorCitizen Partner    Dependents
##  Length:7043        Female:3488   No :5901      No :3641   No :4933  
##  Class :character   Male  :3555   Yes:1142      Yes:3402   Yes:2110  
##  Mode  :character                                                    
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2810   No                 :2785   Month-to-month:3875  
##  No internet service:1526   No internet service:1526   One year      :1473  
##  Yes                :2707   Yes                :2732   Two year      :1695  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling                   PaymentMethod  MonthlyCharges  
##  No :2872         Bank transfer (automatic):1544   Min.   : 18.25  
##  Yes:4171         Credit card (automatic)  :1522   1st Qu.: 35.50  
##                   Electronic check         :2365   Median : 70.35  
##                   Mailed check             :1612   Mean   : 64.76  
##                                                    3rd Qu.: 89.85  
##                                                    Max.   :118.75  
##                                                                    
##   TotalCharges    Churn     
##  Min.   :  18.8   No :5174  
##  1st Qu.: 401.4   Yes:1869  
##  Median :1397.5             
##  Mean   :2283.3             
##  3rd Qu.:3794.7             
##  Max.   :8684.8             
##  NA's   :11

We can notice 11 missing values in the TotalCharges variable, this will be addressed later.

Summarizing the data by churn: out of the total 7043 customers 26.5% have churned (1869 customers).

data$Churn <- relevel(data$Churn, ref = "Yes")

ggplot(data, aes(Churn, fill = Churn)) + geom_bar(width = 0.6) + ggtitle("26.5% of customers left the company last month (n=7043)") + ylab("Number of customers") + xlab("Churn") + scale_y_continuous(limits = c(0, 6000))

Moving forward, there were only 11 missing values, which constitutes approximately 0.2% of the dataset, hence, these 11 observations can be safely removed.

# how many missing values are there
missing_values <- sum(is.na(data))
missing_values
## [1] 11
# share of missing values
missing_values/nrow(data)
## [1] 0.001561834
data %>% filter(is.na(TotalCharges))
##    customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1  4472-LVYGI Female            No     Yes        Yes      0           No
## 2  3115-CZMZD   Male            No      No        Yes      0          Yes
## 3  5709-LVOEQ Female            No     Yes        Yes      0          Yes
## 4  4367-NUYAO   Male            No     Yes        Yes      0          Yes
## 5  1371-DWPAZ Female            No     Yes        Yes      0           No
## 6  7644-OMVMY   Male            No     Yes        Yes      0          Yes
## 7  3213-VVOLG   Male            No     Yes        Yes      0          Yes
## 8  2520-SGTTA Female            No     Yes        Yes      0          Yes
## 9  2923-ARZLG   Male            No     Yes        Yes      0          Yes
## 10 4075-WKNIU Female            No     Yes        Yes      0          Yes
## 11 2775-SEFEE   Male            No      No        Yes      0          Yes
##       MultipleLines InternetService      OnlineSecurity        OnlineBackup
## 1  No phone service             DSL                 Yes                  No
## 2                No              No No internet service No internet service
## 3                No             DSL                 Yes                 Yes
## 4               Yes              No No internet service No internet service
## 5  No phone service             DSL                 Yes                 Yes
## 6                No              No No internet service No internet service
## 7               Yes              No No internet service No internet service
## 8                No              No No internet service No internet service
## 9                No              No No internet service No internet service
## 10              Yes             DSL                  No                 Yes
## 11              Yes             DSL                 Yes                 Yes
##       DeviceProtection         TechSupport         StreamingTV
## 1                  Yes                 Yes                 Yes
## 2  No internet service No internet service No internet service
## 3                  Yes                  No                 Yes
## 4  No internet service No internet service No internet service
## 5                  Yes                 Yes                 Yes
## 6  No internet service No internet service No internet service
## 7  No internet service No internet service No internet service
## 8  No internet service No internet service No internet service
## 9  No internet service No internet service No internet service
## 10                 Yes                 Yes                 Yes
## 11                  No                 Yes                  No
##        StreamingMovies Contract PaperlessBilling             PaymentMethod
## 1                   No Two year              Yes Bank transfer (automatic)
## 2  No internet service Two year               No              Mailed check
## 3                  Yes Two year               No              Mailed check
## 4  No internet service Two year               No              Mailed check
## 5                   No Two year               No   Credit card (automatic)
## 6  No internet service Two year               No              Mailed check
## 7  No internet service Two year               No              Mailed check
## 8  No internet service Two year               No              Mailed check
## 9  No internet service One year              Yes              Mailed check
## 10                  No Two year               No              Mailed check
## 11                  No Two year              Yes Bank transfer (automatic)
##    MonthlyCharges TotalCharges Churn
## 1           52.55           NA    No
## 2           20.25           NA    No
## 3           80.85           NA    No
## 4           25.75           NA    No
## 5           56.05           NA    No
## 6           19.85           NA    No
## 7           25.35           NA    No
## 8           20.00           NA    No
## 9           19.70           NA    No
## 10          73.35           NA    No
## 11          61.90           NA    No
data_full <- na.omit(data)

Now, let’s look at customer churn among variables containing demographic, services and customer account information.

p1 <- ggplot(data_full, aes(fill = Churn, x = gender)) +
  geom_bar(position = "stack") + xlab("Gender?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))

p2 <- ggplot(data_full, aes(fill = Churn, x = SeniorCitizen)) +
  geom_bar(position = "stack") + xlab("Is a senior citizen?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))

p3 <- ggplot(data_full, aes(fill = Churn, x = Partner)) +
  geom_bar(position = "stack") + xlab("Has a partner?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))

p4 <- ggplot(data_full, aes(fill = Churn, x = Dependents)) +
  geom_bar(position = "stack") + xlab("Has dependents?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))

p1 + p2 + p3 + p4 + plot_annotation(title = "Customer churn by demographic variables") + plot_layout(guides = "collect")

p1 <- ggplot(data_full, aes(fill = Churn, x = gender)) +
  geom_bar(position = "fill") + xlab("Gender?") + scale_y_continuous(labels = scales::percent) + ylab(" ")

p2 <- ggplot(data_full, aes(fill = Churn, x = SeniorCitizen)) +
  geom_bar(position = "fill") + xlab("Is a senior citizen?") + scale_y_continuous(labels = scales::percent) + ylab(" ")

p3 <- ggplot(data_full, aes(fill = Churn, x = Partner)) +
  geom_bar(position = "fill") + xlab("Has a partner?") + scale_y_continuous(labels = scales::percent) + ylab(" ")

p4 <- ggplot(data_full, aes(fill = Churn, x = Dependents)) +
  geom_bar(position = "fill") + xlab("Has dependents?") + scale_y_continuous(labels = scales::percent) + ylab(" ")

p1 + p2 + p3 + p4 + plot_annotation(title = "Customer churn by demographic variables") + plot_layout(guides = "collect")

p6 <- ggplot(data_full, aes(fill = Churn, x = PhoneService)) +
  geom_bar(position = "stack") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))

p7 <- ggplot(data_full, aes(fill = Churn, x = MultipleLines)) +
  geom_bar(position = "stack") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))

p6 + p7 + plot_annotation(title = "Customer churn by phone service details") + plot_layout(nrow = 1, guides = "collect")

p6 <- ggplot(data_full, aes(fill = Churn, x = PhoneService)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ")

p7 <- ggplot(data_full, aes(fill = Churn, x = MultipleLines)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ")

p6 + p7 + plot_annotation(title = "Customer churn by phone service details") + plot_layout(nrow = 1, guides = "collect")

p8 <- ggplot(data_full, aes(fill = Churn, x = InternetService)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p9 <- ggplot(data_full, aes(fill = Churn, x = OnlineSecurity)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p10 <- ggplot(data_full, aes(fill = Churn, x = OnlineBackup)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p11 <- ggplot(data_full, aes(fill = Churn, x = DeviceProtection)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p12 <- ggplot(data_full, aes(fill = Churn, x = TechSupport)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p13 <- ggplot(data_full, aes(fill = Churn, x = StreamingTV)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p14 <- ggplot(data_full, aes(fill = Churn, x = StreamingMovies)) +
  geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))

p8 + p9 + p10 + p11 + p12 + p13 + p14 + plot_annotation(title = "Customer churn by internet service details") + plot_layout(nrow = 4, guides = "collect")

p8 <- ggplot(data_full, aes(fill = Churn, x = InternetService)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p9 <- ggplot(data_full, aes(fill = Churn, x = OnlineSecurity)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p10 <- ggplot(data_full, aes(fill = Churn, x = OnlineBackup)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p11 <- ggplot(data_full, aes(fill = Churn, x = DeviceProtection)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p12 <- ggplot(data_full, aes(fill = Churn, x = TechSupport)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p13 <- ggplot(data_full, aes(fill = Churn, x = StreamingTV)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p14 <- ggplot(data_full, aes(fill = Churn, x = StreamingMovies)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()

p8 + p9 + p10 + p11 + p12 + p13 + p14 + plot_annotation(title = "Customer churn by internet service details") + plot_layout(nrow = 4, guides = "collect")

p15 <- ggplot(data_full, aes(fill = Churn, x = Contract)) +
  geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Contract type") + scale_y_continuous(limits = c(0, 6000))

p16 <- ggplot(data_full, aes(fill = Churn, x = PaperlessBilling)) +
  geom_bar(position = "stack") + ylab(" ") + xlab("Is billing paperless?") + scale_y_continuous(limits = c(0, 6000))

p17 <- ggplot(data_full, aes(fill = Churn, x = PaymentMethod)) +
  geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Payment method") + scale_y_continuous(limits = c(0, 6000))

p15 + p16 + p17 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")

p15 <- ggplot(data_full, aes(fill = Churn, x = Contract)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + coord_flip() + ylab(" ") + xlab("Contract type")

p16 <- ggplot(data_full, aes(fill = Churn, x = PaperlessBilling)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + xlab("Is billing paperless?")

p17 <- ggplot(data_full, aes(fill = Churn, x = PaymentMethod)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + coord_flip() + ylab(" ") + xlab("Payment method")

p15 + p16 + p17 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")

p5 <- ggplot(data_full, aes(fill = Churn, x = tenure)) +
  geom_histogram(position = "identity", alpha = 0.6, binwidth = 1) + xlab("Number of months with the company (binwidth = 1 month)") + scale_x_continuous(limits = c(0, 72))

p18 <- ggplot(data_full, aes(fill = Churn, x = MonthlyCharges)) +
  geom_histogram(position = "identity", alpha = 0.6, binwidth = 1) + scale_x_continuous(limits = c(0, 120)) + xlab("Monthly charges (binwidth = 1 dollar)")

p19 <- ggplot(data_full, aes(fill = Churn, x = TotalCharges)) +
  geom_histogram(position = "identity", alpha = 0.6, binwidth = 100) + scale_x_continuous(limits = c(0, 9000)) + xlab("Total charges (binwidth = 100 dollars)")

p5 + p18 + p19 + plot_annotation(title = "Customer churn by tenure, monthly and total charges") + plot_layout(nrow = 3, guides = "collect")

Churn was higher among:

  • senior citizens

  • people without a partner

  • people without dependents

  • customers with month-to-month contracts

  • customers with electronic check as their payment method

  • customers with paperless billing

  • customers whose Internet service provider was Fiber optic

  • customers without online security service

  • customers without online backup

  • customers without tech support

  • customers without device protection

  • customers who had been using the company’s services for around 1-9 months (max tenure was 6 years); the longer customers stayed with the company the less likely they were to churn.

Clustering

Preparing variables for clustering

First, I will check correlations between our numeric variables TotalCharges, MonthlyCharges and tenure to see whether all of them should be used for clustering.

Since the TotalCharges variable has moderate and high correlations with the other 2 variables and will not enrich our interpretation of clusters I will remove it. (I will also remove the customerID variable since it is not useful for us in this project.)

cor(data_full$TotalCharges, data_full$MonthlyCharges)
## [1] 0.6510648
cor(data_full$TotalCharges, data_full$tenure)
## [1] 0.8258805
cor(data_full$tenure, data_full$MonthlyCharges)
## [1] 0.2468618
data_clean <- data_full[-c(1, 20)]

glimpse(data_clean)
## Rows: 7,032
## Columns: 19
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N~
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, ~
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No ~
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in~
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in~
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte~
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int~
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte~
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M~
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr~
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~

Next, I will recode 9 variables concerned with the company’s services to facilitate our analysis and create a new numeric variable for clustering. Recoding was necessary since OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies are dependent on whether a customer has InternetService or not. Same goes for MultipleLines and PhoneService variables. Now, for every variable listed above possible values will be 0 (No) and 1 (Yes) referring to whether a customer has purchased this service or not.

data_clean$MultipleLines <- if_else(data_clean$MultipleLines == "Yes", 1, 0)

data_clean$InternetService <- if_else(data_clean$InternetService == "No", 0, 1)

data_clean$PhoneService <- if_else(data_clean$PhoneService == "Yes", 1, 0)

data_clean$OnlineSecurity <- if_else(data_clean$OnlineSecurity == "No internet service", 0, if_else(data_clean$OnlineSecurity == "No", 0, 1))

data_clean$OnlineBackup <- if_else(data_clean$OnlineBackup == "No internet service", 0, if_else(data_clean$OnlineBackup == "No", 0, 1))

data_clean$DeviceProtection <- if_else(data_clean$DeviceProtection == "No internet service", 0, if_else(data_clean$DeviceProtection == "No", 0, 1))

data_clean$TechSupport <- if_else(data_clean$TechSupport == "No internet service", 0, if_else(data_clean$TechSupport == "No", 0, 1))

data_clean$StreamingTV <- if_else(data_clean$StreamingTV == "No internet service", 0, if_else(data_clean$StreamingTV == "No", 0, 1))

data_clean$StreamingMovies <- if_else(data_clean$StreamingMovies == "No internet service", 0, if_else(data_clean$StreamingMovies == "No", 0, 1))

Our 9 service variables are now in the right format and I can create a new variable, TelcoServices, that represents the number of the company’s services used by a customer (ranging from 1 to 9 services).

data_clean <- data_clean %>%
  mutate(TelcoServices = StreamingMovies + StreamingTV + TechSupport + DeviceProtection + OnlineSecurity + OnlineBackup + InternetService + MultipleLines + PhoneService)

summary(data_clean[20])
##  TelcoServices  
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :4.000  
##  Mean   :4.147  
##  3rd Qu.:6.000  
##  Max.   :9.000

Next, I will normalize our 3 clustering variables: tenure, MonthlyCharges, and TelcoServices. Now they range from 0 to 1.

range01 <- function(x){(x-min(x))/(max(x)-min(x))}

data_standardized <- data_clean
data_standardized$tenure <- range01(data_standardized$tenure)
data_standardized$MonthlyCharges <- range01(data_standardized$MonthlyCharges)
data_standardized$TelcoServices <- range01(data_standardized$TelcoServices)

sub_data_standard <- data_standardized %>%
  dplyr::select(tenure, MonthlyCharges, TelcoServices)

summary(sub_data_standard)
##      tenure       MonthlyCharges   TelcoServices   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1127   1st Qu.:0.1725   1st Qu.:0.1250  
##  Median :0.3944   Median :0.5184   Median :0.3750  
##  Mean   :0.4426   Mean   :0.4632   Mean   :0.3934  
##  3rd Qu.:0.7606   3rd Qu.:0.7126   3rd Qu.:0.6250  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

How many clusters?

Now it’s time to determine how many clusters we should have at the end of our clustering procedure.

set.seed(42)

# i'm limiting the max number of clusters to 5 because a bigger number would leave us with a less interpretable set of clusters (in case of the data i'm working with)

fviz_nbclust(sub_data_standard, kmeans, method = "silhouette", k.max = 5)

fviz_nbclust(sub_data_standard, kmeans, method = "wss", k.max = 5)

The silhouette plot indicates that 4 clusters is the optimal choice, whereas the other plot suggests 3-4 clusters. I have tried to use 3 and 4 centers for k-means clustering and 3 centers seems to work better in terms of interpretation of our clusters.

K-means clustering

set.seed(42)

model_customers <- kmeans(sub_data_standard, centers = 3)

clust_customers <- model_customers$cluster

# adding the cluster number variable to several datasets I'll use later
sub_data_standard <- mutate(sub_data_standard, cluster = clust_customers)
data_standardized <- mutate(data_standardized, cluster = clust_customers)
data_clean <- mutate(data_clean, cluster = clust_customers)
data_full <- mutate(data_full, cluster = clust_customers)

sub_data_standard %>% 
  group_by(cluster) %>% 
  count()
## # A tibble: 3 x 2
## # Groups:   cluster [3]
##   cluster     n
##     <int> <int>
## 1       1  2670
## 2       2  2347
## 3       3  2015

Cluster description

Now, I will inspect the clusters and describe them based on our 3 clustering variables: tenure, MonthlyCharges, and TelcoServices.

data_clean %>%
  dplyr::select(tenure, MonthlyCharges, TelcoServices, cluster) %>%
  group_by(cluster) %>% 
  summarize(mean_monthly_charges = mean(MonthlyCharges), mean_telco_services = mean(TelcoServices), mean_tenure = mean(tenure))
## # A tibble: 3 x 4
##   cluster mean_monthly_charges mean_telco_services mean_tenure
##     <int>                <dbl>               <dbl>       <dbl>
## 1       1                 73.5                4.04        13.1
## 2       2                 89.0                6.52        57.1
## 3       3                 25.2                1.52        29.4
data_clean %>%
  dplyr::select(tenure, MonthlyCharges, TelcoServices, cluster) %>%
  group_by(cluster) %>% 
  summarize(median_monthly_charges = median(MonthlyCharges), median_telco_services = median(TelcoServices), median_tenure = median(tenure))
## # A tibble: 3 x 4
##   cluster median_monthly_charges median_telco_services median_tenure
##     <int>                  <dbl>                 <dbl>         <dbl>
## 1       1                   74.9                     4            10
## 2       2                   92                       7            60
## 3       3                   20.5                     1            25

tenure

  • min: 1 month

  • max: 72 months (6 years)

MonthlyCharges

  • min: 18.25 dollars

  • max: 118.75 dollars

TelcoServices

  • min: 1 service

  • max: 9 services


Cluster 1
consists of customers who, on average, had been with the company for 1 year and 1 month, whose monthly charges were relatively high, 74 dollars, and who used around 4 services of the company.

Cluster 2 consists of customers who, on average, had been with the company for 4 years and 9 months, whose monthly charge was the highest among clusters, 89 dollars, and who used 7 services on average.

Cluster 3 consists of customers who, on average, had been with the company for 2 years and 5 months, whose monthly charge was the lowest among clusters, 25 dollars, and who used 1-2 services on average.

So, the clusters can be shortly described as:

  • Cluster 1: short tenure, high monthly pay, half of the services used.

  • Cluster 2: long tenure, high monthly pay, almost all of the services used.

  • Cluster 3: long tenure, low monthly pay, 1-2 services used.

Churn among clusters

Next, I will examine churn rates among each cluster.

ggplot(data_clean, aes(cluster, fill = Churn)) +
  geom_bar() + scale_y_continuous(limits = c(0, 3000)) + ylab("Number of customers") + xlab("Cluster")

ggplot(data_clean, aes(cluster, fill = Churn)) +
  geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab("Number of customers") + xlab("Cluster")

data_clean %>%
  dplyr::select(cluster, Churn) %>%
  group_by(cluster, Churn) %>%
  count()
## # A tibble: 6 x 3
## # Groups:   cluster, Churn [6]
##   cluster Churn     n
##     <int> <fct> <int>
## 1       1 Yes    1261
## 2       1 No     1409
## 3       2 Yes     362
## 4       2 No     1985
## 5       3 Yes     246
## 6       3 No     1769

Cluster 1 had the highest share of customers who left the company (47% left, n=2670), therefore, I will explore this segment further.

Churn prediction

Data preparation

Creating a separate dataset for Cluster 1. The cluster (represents the cluster a customer is assigned to) and the TelcoServices variables can be removed since they are not needed anymore. I will also replace the recoded InternetService variable with its original version comprised of 3 categories (DSL, Fiber optic, and No Internet service) each representing a customer’s Internet service provider. This will be helpful later in our regression analysis.

data_clean$InternetService <- data_full$InternetService

cluster_1_data <- data_clean %>%
  filter(cluster == "1")

cluster_1_data <- cluster_1_data[-c(20, 21)]

Before running logistic regression, I will make sure that all variable types are correct.

cluster_1_data[c(6, 7)] <- lapply(cluster_1_data[c(6, 7)], as.factor)
cluster_1_data[c(9:14)] <- lapply(cluster_1_data[c(9:14)], as.factor)

# there are no customers without internet service in Cluster 1. so, I'll remove this option

cluster_1_data$InternetService <- if_else(cluster_1_data$InternetService == "DSL", "DSL", if_else(cluster_1_data$InternetService == "Fiber optic", "Fiber optic", remove()))

cluster_1_data$InternetService <- as.factor(cluster_1_data$InternetService)

glimpse(cluster_1_data)
## Rows: 2,670
## Columns: 19
## $ gender           <fct> Male, Male, Female, Female, Male, Male, Female, Femal~
## $ SeniorCitizen    <fct> No, No, No, No, No, No, No, No, Yes, No, No, Yes, No,~
## $ Partner          <fct> No, No, No, No, No, Yes, Yes, No, No, No, No, Yes, Ye~
## $ Dependents       <fct> No, No, No, No, Yes, Yes, Yes, No, No, No, Yes, No, Y~
## $ tenure           <int> 34, 2, 2, 8, 22, 13, 10, 21, 1, 30, 17, 2, 27, 1, 5, ~
## $ PhoneService     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ MultipleLines    <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
## $ InternetService  <fct> DSL, DSL, Fiber optic, Fiber optic, Fiber optic, DSL,~
## $ OnlineSecurity   <fct> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ OnlineBackup     <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,~
## $ DeviceProtection <fct> 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,~
## $ TechSupport      <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ StreamingTV      <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,~
## $ StreamingMovies  <fct> 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,~
## $ Contract         <fct> One year, Month-to-month, Month-to-month, Month-to-mo~
## $ PaperlessBilling <fct> No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, ~
## $ PaymentMethod    <fct> Mailed check, Mailed check, Electronic check, Electro~
## $ MonthlyCharges   <dbl> 56.95, 53.85, 70.70, 99.65, 89.10, 49.95, 55.20, 90.0~
## $ Churn            <fct> No, Yes, Yes, Yes, No, No, Yes, No, Yes, No, Yes, No,~

Binary logistic regression

Setting a reference class to “No” (customer stayed).

cluster_1_data$Churn <- relevel(cluster_1_data$Churn, ref = "No")

First, I’m using all variables as predictors of churn.

logitModelFull1 <- glm(Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure + PhoneService + MultipleLines + InternetService + TechSupport + OnlineBackup + OnlineSecurity + DeviceProtection + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges, family = "binomial", data = cluster_1_data)

summary(logitModelFull1)
## 
## Call:
## glm(formula = Churn ~ gender + SeniorCitizen + Partner + Dependents + 
##     tenure + PhoneService + MultipleLines + InternetService + 
##     TechSupport + OnlineBackup + OnlineSecurity + DeviceProtection + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges, family = "binomial", data = cluster_1_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9701  -0.9724  -0.2817   0.9348   2.7525  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           1.017430   1.135881   0.896   0.3704    
## genderMale                           -0.108522   0.087969  -1.234   0.2173    
## SeniorCitizenYes                      0.161299   0.114239   1.412   0.1580    
## PartnerYes                           -0.012899   0.107364  -0.120   0.9044    
## DependentsYes                         0.019790   0.130372   0.152   0.8793    
## tenure                               -0.050454   0.004891 -10.317  < 2e-16 ***
## PhoneService1                         0.094426   0.913894   0.103   0.9177    
## MultipleLines1                        0.516257   0.240921   2.143   0.0321 *  
## InternetServiceFiber optic            1.873673   1.108938   1.690   0.0911 .  
## TechSupport1                         -0.131855   0.249546  -0.528   0.5972    
## OnlineBackup1                        -0.011320   0.242922  -0.047   0.9628    
## OnlineSecurity1                      -0.191426   0.247210  -0.774   0.4387    
## DeviceProtection1                     0.161894   0.242606   0.667   0.5046    
## StreamingTV1                          0.516579   0.449503   1.149   0.2505    
## StreamingMovies1                      0.557640   0.452342   1.233   0.2177    
## ContractOne year                     -0.906933   0.208783  -4.344 1.40e-05 ***
## ContractTwo year                     -2.610385   1.024992  -2.547   0.0109 *  
## PaperlessBillingYes                   0.412028   0.099510   4.141 3.46e-05 ***
## PaymentMethodCredit card (automatic) -0.034001   0.170437  -0.199   0.8419    
## PaymentMethodElectronic check         0.329177   0.136536   2.411   0.0159 *  
## PaymentMethodMailed check             0.088153   0.160956   0.548   0.5839    
## MonthlyCharges                       -0.036591   0.044114  -0.829   0.4068    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3693.2  on 2669  degrees of freedom
## Residual deviance: 3051.2  on 2648  degrees of freedom
## AIC: 3095.2
## 
## Number of Fisher Scoring iterations: 6
coefsexp <- round(exp(logitModelFull1$coefficients), 2)
coefsexp
##                          (Intercept)                           genderMale 
##                                 2.77                                 0.90 
##                     SeniorCitizenYes                           PartnerYes 
##                                 1.18                                 0.99 
##                        DependentsYes                               tenure 
##                                 1.02                                 0.95 
##                        PhoneService1                       MultipleLines1 
##                                 1.10                                 1.68 
##           InternetServiceFiber optic                         TechSupport1 
##                                 6.51                                 0.88 
##                        OnlineBackup1                      OnlineSecurity1 
##                                 0.99                                 0.83 
##                    DeviceProtection1                         StreamingTV1 
##                                 1.18                                 1.68 
##                     StreamingMovies1                     ContractOne year 
##                                 1.75                                 0.40 
##                     ContractTwo year                  PaperlessBillingYes 
##                                 0.07                                 1.51 
## PaymentMethodCredit card (automatic)        PaymentMethodElectronic check 
##                                 0.97                                 1.39 
##            PaymentMethodMailed check                       MonthlyCharges 
##                                 1.09                                 0.96

The following variables did have a statistically significant influence on customer churn in Cluster 1 (p-value < 0.05):

  • tenure
  • Multiple lines (Yes)
  • Contract (One-year)
  • Contract (Two-year)
  • Paperless Billing (Yes)
  • PaymentMethod (Electronic check)

Coefficient interpretations for the variables mentioned above:

  • tenure - 0.95 (For every 1 month increase in tenure, the odds of churning decreases by 5%.)

  • Multiple lines (Yes) - 1.68 (The odds of churning for customers with the multiple lines service are 68% higher as compared to customers without it.)

  • Contract (One-year) - 0.40 (The odds of churning for customers with the one-year contract are 60% lower as compared to customers with the month-to-month contract.)

  • Contract (Two-year) - 0.07 (The odds of churning for customers with the one-year contract are 93% lower as compared to customers with the month-to-month contract.)

  • Paperless Billing (Yes) - 1.51 (The odds of churning for customers with paperless billing are 51% higher as compared to customers with the paper billing option.)

  • PaymentMethod (Electronic check) - 1.39 (The odds of churning for customers with electronic check as their payment method are 39% higher as compared to customers with the mailed check payment method.)

Let’s visualize churn among our statistically significant variables for Cluster 1.

cluster_1_data$Churn <- relevel(cluster_1_data$Churn, ref = "Yes")

p20 <- ggplot(cluster_1_data, aes(fill = Churn, x = Contract)) +
  geom_bar(position = "fill") + coord_flip() + ylab(" ") + xlab("Contract type") + scale_y_continuous(labels = scales::percent)

p21 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaperlessBilling)) + geom_bar(position = "fill") + ylab(" ") + xlab("Is billing paperless?") + scale_y_continuous(labels = scales::percent)

p22 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaymentMethod)) +
  geom_bar(position = "fill") + coord_flip() + ylab(" ") + xlab("Payment method") + scale_y_continuous(labels = scales::percent)

p20 + p21 + p22 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")

p20 <- ggplot(cluster_1_data, aes(fill = Churn, x = Contract)) +
  geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Contract type") + scale_y_continuous(limits = c(0, 3000))

p21 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaperlessBilling)) + geom_bar(position = "stack") + ylab(" ") + xlab("Is billing paperless?") + scale_y_continuous(limits = c(0, 3000))

p22 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaymentMethod)) +
  geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Payment method") + scale_y_continuous(limits = c(0, 3000))

p20 + p21 + p22 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")

p23 <- ggplot(cluster_1_data, aes(fill = Churn, x = MultipleLines)) +
  geom_bar(position = "fill") + ylab(" ") + scale_y_continuous(labels = scales::percent)

p24 <- ggplot(cluster_1_data, aes(fill = Churn, x = MultipleLines)) +
  geom_bar(position = "stack") + ylab(" ") + scale_y_continuous(limits = c(0, 3000))

p23 + p24 + plot_annotation(title = "Customer churn by phone-related services") + plot_layout(guides = "collect")

p25 <- ggplot(cluster_1_data, aes(fill = Churn, x = tenure)) +
  geom_histogram(position = "identity", alpha = 0.6, binwidth = 1) + xlab("Number of months with the company (binwidth = 1 month)")

p25 + plot_annotation(title = "Customer churn by tenure")

Overall, these results correspond with our previous finding that those who stayed with the company for longer were less likely to churn. Most of the Cluster 1 customers who left had been using the company’s services for less than 10 months.

Next, I’m using the stepAIC() function to determine the model with an optimal set of independent variables. I’ll use this model later to make predictions.

logitModelNew1 <- stepAIC(logitModelFull1, trace = 0)
summary(logitModelNew1)
## 
## Call:
## glm(formula = Churn ~ SeniorCitizen + tenure + MultipleLines + 
##     InternetService + OnlineSecurity + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges, 
##     family = "binomial", data = cluster_1_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7786  -0.9389   0.2726   0.9706   1.9889  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -0.917991   0.434193  -2.114 0.034494 *  
## SeniorCitizenYes                     -0.169345   0.111494  -1.519 0.128795    
## tenure                                0.050791   0.004769  10.650  < 2e-16 ***
## MultipleLines1                       -0.503442   0.107411  -4.687 2.77e-06 ***
## InternetServiceFiber optic           -1.825828   0.252348  -7.235 4.64e-13 ***
## OnlineSecurity1                       0.212038   0.123138   1.722 0.085076 .  
## StreamingTV1                         -0.490939   0.127600  -3.847 0.000119 ***
## StreamingMovies1                     -0.533569   0.126403  -4.221 2.43e-05 ***
## ContractOne year                      0.899755   0.206686   4.353 1.34e-05 ***
## ContractTwo year                      2.626024   1.019904   2.575 0.010030 *  
## PaperlessBillingYes                  -0.404708   0.099067  -4.085 4.40e-05 ***
## PaymentMethodCredit card (automatic)  0.041778   0.169641   0.246 0.805471    
## PaymentMethodElectronic check        -0.322014   0.136031  -2.367 0.017923 *  
## PaymentMethodMailed check            -0.070405   0.160075  -0.440 0.660064    
## MonthlyCharges                        0.033597   0.008585   3.913 9.10e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3693.2  on 2669  degrees of freedom
## Residual deviance: 3056.1  on 2655  degrees of freedom
## AIC: 3086.1
## 
## Number of Fisher Scoring iterations: 6

Model fit assessment

Our model’s Pseudo R-squared (McFadden’s R2) can be interpreted as good fit since the value is around 0.2.

LogRegR2(logitModelNew1)
## Chi2                 637.0748 
## Df                   14 
## Sig.                 0 
## Cox and Snell Index  0.2122739 
## Nagelkerke Index     0.2833226 
## McFadden's R2        0.1724995

Prediction

Next, I will split the data into train and test datasets, fit a logistic regression model and evaluate its performance.

cluster_1_data$Churn <- relevel(cluster_1_data$Churn, ref = "No")
set.seed(42)

cluster_1_data$isTrain <- rbinom(nrow(cluster_1_data), 1, 0.80)
train <- subset(cluster_1_data, cluster_1_data$isTrain == 1)
test <- subset(cluster_1_data, cluster_1_data$isTrain == 0)
# checking distributions
summary(test)
##     gender    SeniorCitizen Partner   Dependents     tenure      PhoneService
##  Female:294   No :435       No :365   No :436    Min.   : 1.00   0: 24       
##  Male  :249   Yes:108       Yes:178   Yes:107    1st Qu.: 3.00   1:519       
##                                                  Median :10.00               
##                                                  Mean   :13.22               
##                                                  3rd Qu.:22.00               
##                                                  Max.   :46.00               
##  MultipleLines    InternetService OnlineSecurity OnlineBackup DeviceProtection
##  0:348         DSL        :199    0:434          0:390        0:404           
##  1:195         Fiber optic:344    1:109          1:153        1:139           
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  TechSupport StreamingTV StreamingMovies           Contract   PaperlessBilling
##  0:427       0:333       0:347           Month-to-month:471   No :167         
##  1:116       1:210       1:196           One year      : 65   Yes:376         
##                                          Two year      :  7                   
##                                                                               
##                                                                               
##                                                                               
##                    PaymentMethod MonthlyCharges   Churn        isTrain 
##  Bank transfer (automatic): 69   Min.   : 40.15   No :289   Min.   :0  
##  Credit card (automatic)  : 85   1st Qu.: 60.20   Yes:254   1st Qu.:0  
##  Electronic check         :283   Median : 75.15             Median :0  
##  Mailed check             :106   Mean   : 73.98             Mean   :0  
##                                  3rd Qu.: 85.85             3rd Qu.:0  
##                                  Max.   :110.10             Max.   :0
summary(train)
##     gender     SeniorCitizen Partner    Dependents     tenure      PhoneService
##  Female:1036   No :1701      No :1421   No :1704   Min.   : 1.00   0: 118      
##  Male  :1091   Yes: 426      Yes: 706   Yes: 423   1st Qu.: 3.00   1:2009      
##                                                    Median :10.00               
##                                                    Mean   :13.02               
##                                                    3rd Qu.:21.00               
##                                                    Max.   :46.00               
##  MultipleLines    InternetService OnlineSecurity OnlineBackup DeviceProtection
##  0:1335        DSL        : 822   0:1628         0:1558       0:1545          
##  1: 792        Fiber optic:1305   1: 499         1: 569       1: 582          
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  TechSupport StreamingTV StreamingMovies           Contract    PaperlessBilling
##  0:1647      0:1369      0:1343          Month-to-month:1879   No : 666        
##  1: 480      1: 758      1: 784          One year      : 204   Yes:1461        
##                                          Two year      :  44                   
##                                                                                
##                                                                                
##                                                                                
##                    PaymentMethod  MonthlyCharges   Churn         isTrain 
##  Bank transfer (automatic): 307   Min.   : 35.45   No :1120   Min.   :1  
##  Credit card (automatic)  : 293   1st Qu.: 59.10   Yes:1007   1st Qu.:1  
##  Electronic check         :1093   Median : 74.80              Median :1  
##  Mailed check             : 434   Mean   : 73.32              Mean   :1  
##                                   3rd Qu.: 85.70              3rd Qu.:1  
##                                   Max.   :112.95              Max.   :1
logitTrainNew <- glm(Churn ~ SeniorCitizen + tenure + MultipleLines + InternetService + OnlineSecurity + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges, family = "binomial", data = train[-20])
train$pred <- predict(logitTrainNew, train[-20], type = 'response')
train$pred <- ifelse(train$pred > 0.5, "Yes", "No")

test$pred <- predict(logitTrainNew, type = 'response', newdata = test[-c(19, 20)])
test$pred <- ifelse(test$pred > 0.5, "Yes", "No")
confusionMatrix(as.factor(train$pred), as.factor(train$Churn), positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  768 306
##        Yes 352 701
##                                           
##                Accuracy : 0.6906          
##                  95% CI : (0.6705, 0.7102)
##     No Information Rate : 0.5266          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.381           
##                                           
##  Mcnemar's Test P-Value : 0.07938         
##                                           
##             Sensitivity : 0.6961          
##             Specificity : 0.6857          
##          Pos Pred Value : 0.6657          
##          Neg Pred Value : 0.7151          
##              Prevalence : 0.4734          
##          Detection Rate : 0.3296          
##    Detection Prevalence : 0.4951          
##       Balanced Accuracy : 0.6909          
##                                           
##        'Positive' Class : Yes             
## 

For train data, confusion matrix indicated that sensitivity = 0.70 and specificity = 0.69.

Accuracy is 0.69.

confusionMatrix(as.factor(test$pred), as.factor(test$Churn), positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  196  74
##        Yes  93 180
##                                          
##                Accuracy : 0.6924         
##                  95% CI : (0.6517, 0.731)
##     No Information Rate : 0.5322         
##     P-Value [Acc > NIR] : 2.044e-14      
##                                          
##                   Kappa : 0.3851         
##                                          
##  Mcnemar's Test P-Value : 0.1637         
##                                          
##             Sensitivity : 0.7087         
##             Specificity : 0.6782         
##          Pos Pred Value : 0.6593         
##          Neg Pred Value : 0.7259         
##              Prevalence : 0.4678         
##          Detection Rate : 0.3315         
##    Detection Prevalence : 0.5028         
##       Balanced Accuracy : 0.6934         
##                                          
##        'Positive' Class : Yes            
## 

For test data, sensitivity = 0.71 and specificity = 0.68.

Accuracy in this case is 0.69, which means that the model predicts churn 69% of the time.

Conclusion

Earlier, I performed clustering that resulted in the following customer segments:

  • Cluster 1: short tenure, high monthly pay, half of the services used.

  • Cluster 2: long tenure, high monthly pay, almost all of the services used.

  • Cluster 3: long tenure, low monthly pay, 1-2 services used.

Then, I explored churn for Cluster 1 in which 47% of customers left the company this month (n=2670).

Here are the insights for Cluster 1 customer churn:

  • Longer staying customers were less likely to churn.

  • Customers with the 1 year and 2 year contracts were less likely to churn as compared to people using the month-to-month contract type.

  • Customers with paperless billing were more likely to churn as compared to the people with paper billing.

  • Customers who had multiple lines were more likely to churn as compared to those who had not.

  • Churn was higher among customers with the Fiber optic Internet provider as compared to those using the DSL one.