In this project, I am going to be working with the “Telco Customer Churn” dataset to analyze customer churn and outline its predictors. The project consists of the following main steps:
Data loading, preprocessing, and exploration
K-means clustering to outline customer segments
Binary logistic regression to predict churn (for a segment with the highest churn)
There are 21 variables and 7043 observations in the dataset. They are described on Kaggle as follows:
customerID
- A customer’s ID
(1) demographic variables:
gender
- A customer’s gender
SeniorCitizen
- Whether the customer is a senior citizen
or not (1, 0)
Partner
- Whether the customer has a partner or not (Yes,
No)
Dependents
- Whether the customer has dependents or not
(Yes, No)
(2) variables related to the company’s services:
PhoneService
- Whether the customer has a phone service
or not (Yes, No)
MultipleLines
- Whether the customer has multiple lines
or not (Yes, No, No phone service)
InternetService
- Customer’s internet service provider
(DSL, Fiber optic, No)
OnlineSecurity
- Whether the customer has online
security or not (Yes, No, No internet service)
OnlineBackup
- Whether the customer has online backup or
not (Yes, No, No internet service)
DeviceProtection
- Whether the customer has device
protection or not (Yes, No, No internet service)
TechSupport
- Whether the customer has tech support or
not (Yes, No, No internet service)
StreamingTV
- Whether the customer has streaming TV or
not (Yes, No, No internet service)
StreamingMovies
- Whether the customer has streaming
movies or not (Yes, No, No internet service)
(3) contract and payment details:
Contract
- The contract term of the customer
(Month-to-month, One year, Two year)
PaperlessBilling
- Whether the customer has paperless
billing or not (Yes, No)
PaymentMethod
- The customer’s payment method
(Electronic check, Mailed check, Bank transfer (automatic), Credit card
(automatic))
(4) tenure and charges details:
tenure
- Number of months the customer has stayed with
the company
MonthlyCharges
- The amount charged to the customer
monthly
TotalCharges
- The total amount charged to the
customer
(5) the variable of interest:
Churn
- Whether the customer have left the company
within the last month (Yes or No)
Importing the libraries.
library(dplyr)
library(ggplot2)
library(patchwork)
library(factoextra)
library(MASS)
library(descr)
library(caret)
Let’s have a first look at our data and check/correct the variable types and coding.
data <- read.csv("telco_customerchurn.csv")
glimpse(data)
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW~
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",~
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes~
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"~
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "~
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt~
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "~
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N~
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y~
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes~
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye~
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes~
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One ~
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", ~
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "~
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y~
data$SeniorCitizen <- if_else(data$SeniorCitizen == 1, "Yes", "No")
data[-c(1, 6, 19, 20)] <- lapply(data[-c(1, 6, 19, 20)], as.factor)
summary(data)
## customerID gender SeniorCitizen Partner Dependents
## Length:7043 Female:3488 No :5901 No :3641 No :4933
## Class :character Male :3555 Yes:1142 Yes:3402 Yes:2110
## Mode :character
##
##
##
##
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies Contract
## No :2810 No :2785 Month-to-month:3875
## No internet service:1526 No internet service:1526 One year :1473
## Yes :2707 Yes :2732 Two year :1695
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges
## No :2872 Bank transfer (automatic):1544 Min. : 18.25
## Yes:4171 Credit card (automatic) :1522 1st Qu.: 35.50
## Electronic check :2365 Median : 70.35
## Mailed check :1612 Mean : 64.76
## 3rd Qu.: 89.85
## Max. :118.75
##
## TotalCharges Churn
## Min. : 18.8 No :5174
## 1st Qu.: 401.4 Yes:1869
## Median :1397.5
## Mean :2283.3
## 3rd Qu.:3794.7
## Max. :8684.8
## NA's :11
We can notice 11 missing values in the TotalCharges
variable, this will be addressed later.
Summarizing the data by churn: out of the total 7043 customers 26.5% have churned (1869 customers).
data$Churn <- relevel(data$Churn, ref = "Yes")
ggplot(data, aes(Churn, fill = Churn)) + geom_bar(width = 0.6) + ggtitle("26.5% of customers left the company last month (n=7043)") + ylab("Number of customers") + xlab("Churn") + scale_y_continuous(limits = c(0, 6000))
Moving forward, there were only 11 missing values, which constitutes approximately 0.2% of the dataset, hence, these 11 observations can be safely removed.
# how many missing values are there
missing_values <- sum(is.na(data))
missing_values
## [1] 11
# share of missing values
missing_values/nrow(data)
## [1] 0.001561834
data %>% filter(is.na(TotalCharges))
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 4472-LVYGI Female No Yes Yes 0 No
## 2 3115-CZMZD Male No No Yes 0 Yes
## 3 5709-LVOEQ Female No Yes Yes 0 Yes
## 4 4367-NUYAO Male No Yes Yes 0 Yes
## 5 1371-DWPAZ Female No Yes Yes 0 No
## 6 7644-OMVMY Male No Yes Yes 0 Yes
## 7 3213-VVOLG Male No Yes Yes 0 Yes
## 8 2520-SGTTA Female No Yes Yes 0 Yes
## 9 2923-ARZLG Male No Yes Yes 0 Yes
## 10 4075-WKNIU Female No Yes Yes 0 Yes
## 11 2775-SEFEE Male No No Yes 0 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL Yes No
## 2 No No No internet service No internet service
## 3 No DSL Yes Yes
## 4 Yes No No internet service No internet service
## 5 No phone service DSL Yes Yes
## 6 No No No internet service No internet service
## 7 Yes No No internet service No internet service
## 8 No No No internet service No internet service
## 9 No No No internet service No internet service
## 10 Yes DSL No Yes
## 11 Yes DSL Yes Yes
## DeviceProtection TechSupport StreamingTV
## 1 Yes Yes Yes
## 2 No internet service No internet service No internet service
## 3 Yes No Yes
## 4 No internet service No internet service No internet service
## 5 Yes Yes Yes
## 6 No internet service No internet service No internet service
## 7 No internet service No internet service No internet service
## 8 No internet service No internet service No internet service
## 9 No internet service No internet service No internet service
## 10 Yes Yes Yes
## 11 No Yes No
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 1 No Two year Yes Bank transfer (automatic)
## 2 No internet service Two year No Mailed check
## 3 Yes Two year No Mailed check
## 4 No internet service Two year No Mailed check
## 5 No Two year No Credit card (automatic)
## 6 No internet service Two year No Mailed check
## 7 No internet service Two year No Mailed check
## 8 No internet service Two year No Mailed check
## 9 No internet service One year Yes Mailed check
## 10 No Two year No Mailed check
## 11 No Two year Yes Bank transfer (automatic)
## MonthlyCharges TotalCharges Churn
## 1 52.55 NA No
## 2 20.25 NA No
## 3 80.85 NA No
## 4 25.75 NA No
## 5 56.05 NA No
## 6 19.85 NA No
## 7 25.35 NA No
## 8 20.00 NA No
## 9 19.70 NA No
## 10 73.35 NA No
## 11 61.90 NA No
data_full <- na.omit(data)
Now, let’s look at customer churn among variables containing demographic, services and customer account information.
p1 <- ggplot(data_full, aes(fill = Churn, x = gender)) +
geom_bar(position = "stack") + xlab("Gender?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))
p2 <- ggplot(data_full, aes(fill = Churn, x = SeniorCitizen)) +
geom_bar(position = "stack") + xlab("Is a senior citizen?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))
p3 <- ggplot(data_full, aes(fill = Churn, x = Partner)) +
geom_bar(position = "stack") + xlab("Has a partner?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))
p4 <- ggplot(data_full, aes(fill = Churn, x = Dependents)) +
geom_bar(position = "stack") + xlab("Has dependents?") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))
p1 + p2 + p3 + p4 + plot_annotation(title = "Customer churn by demographic variables") + plot_layout(guides = "collect")
p1 <- ggplot(data_full, aes(fill = Churn, x = gender)) +
geom_bar(position = "fill") + xlab("Gender?") + scale_y_continuous(labels = scales::percent) + ylab(" ")
p2 <- ggplot(data_full, aes(fill = Churn, x = SeniorCitizen)) +
geom_bar(position = "fill") + xlab("Is a senior citizen?") + scale_y_continuous(labels = scales::percent) + ylab(" ")
p3 <- ggplot(data_full, aes(fill = Churn, x = Partner)) +
geom_bar(position = "fill") + xlab("Has a partner?") + scale_y_continuous(labels = scales::percent) + ylab(" ")
p4 <- ggplot(data_full, aes(fill = Churn, x = Dependents)) +
geom_bar(position = "fill") + xlab("Has dependents?") + scale_y_continuous(labels = scales::percent) + ylab(" ")
p1 + p2 + p3 + p4 + plot_annotation(title = "Customer churn by demographic variables") + plot_layout(guides = "collect")
p6 <- ggplot(data_full, aes(fill = Churn, x = PhoneService)) +
geom_bar(position = "stack") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))
p7 <- ggplot(data_full, aes(fill = Churn, x = MultipleLines)) +
geom_bar(position = "stack") + ylab(" ") + scale_y_continuous(limits = c(0, 6000))
p6 + p7 + plot_annotation(title = "Customer churn by phone service details") + plot_layout(nrow = 1, guides = "collect")
p6 <- ggplot(data_full, aes(fill = Churn, x = PhoneService)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ")
p7 <- ggplot(data_full, aes(fill = Churn, x = MultipleLines)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ")
p6 + p7 + plot_annotation(title = "Customer churn by phone service details") + plot_layout(nrow = 1, guides = "collect")
p8 <- ggplot(data_full, aes(fill = Churn, x = InternetService)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p9 <- ggplot(data_full, aes(fill = Churn, x = OnlineSecurity)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p10 <- ggplot(data_full, aes(fill = Churn, x = OnlineBackup)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p11 <- ggplot(data_full, aes(fill = Churn, x = DeviceProtection)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p12 <- ggplot(data_full, aes(fill = Churn, x = TechSupport)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p13 <- ggplot(data_full, aes(fill = Churn, x = StreamingTV)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p14 <- ggplot(data_full, aes(fill = Churn, x = StreamingMovies)) +
geom_bar(position = "stack") + ylab(" ") + coord_flip() + scale_y_continuous(limits = c(0, 6000))
p8 + p9 + p10 + p11 + p12 + p13 + p14 + plot_annotation(title = "Customer churn by internet service details") + plot_layout(nrow = 4, guides = "collect")
p8 <- ggplot(data_full, aes(fill = Churn, x = InternetService)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p9 <- ggplot(data_full, aes(fill = Churn, x = OnlineSecurity)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p10 <- ggplot(data_full, aes(fill = Churn, x = OnlineBackup)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p11 <- ggplot(data_full, aes(fill = Churn, x = DeviceProtection)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p12 <- ggplot(data_full, aes(fill = Churn, x = TechSupport)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p13 <- ggplot(data_full, aes(fill = Churn, x = StreamingTV)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p14 <- ggplot(data_full, aes(fill = Churn, x = StreamingMovies)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + coord_flip()
p8 + p9 + p10 + p11 + p12 + p13 + p14 + plot_annotation(title = "Customer churn by internet service details") + plot_layout(nrow = 4, guides = "collect")
p15 <- ggplot(data_full, aes(fill = Churn, x = Contract)) +
geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Contract type") + scale_y_continuous(limits = c(0, 6000))
p16 <- ggplot(data_full, aes(fill = Churn, x = PaperlessBilling)) +
geom_bar(position = "stack") + ylab(" ") + xlab("Is billing paperless?") + scale_y_continuous(limits = c(0, 6000))
p17 <- ggplot(data_full, aes(fill = Churn, x = PaymentMethod)) +
geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Payment method") + scale_y_continuous(limits = c(0, 6000))
p15 + p16 + p17 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")
p15 <- ggplot(data_full, aes(fill = Churn, x = Contract)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + coord_flip() + ylab(" ") + xlab("Contract type")
p16 <- ggplot(data_full, aes(fill = Churn, x = PaperlessBilling)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab(" ") + xlab("Is billing paperless?")
p17 <- ggplot(data_full, aes(fill = Churn, x = PaymentMethod)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + coord_flip() + ylab(" ") + xlab("Payment method")
p15 + p16 + p17 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")
p5 <- ggplot(data_full, aes(fill = Churn, x = tenure)) +
geom_histogram(position = "identity", alpha = 0.6, binwidth = 1) + xlab("Number of months with the company (binwidth = 1 month)") + scale_x_continuous(limits = c(0, 72))
p18 <- ggplot(data_full, aes(fill = Churn, x = MonthlyCharges)) +
geom_histogram(position = "identity", alpha = 0.6, binwidth = 1) + scale_x_continuous(limits = c(0, 120)) + xlab("Monthly charges (binwidth = 1 dollar)")
p19 <- ggplot(data_full, aes(fill = Churn, x = TotalCharges)) +
geom_histogram(position = "identity", alpha = 0.6, binwidth = 100) + scale_x_continuous(limits = c(0, 9000)) + xlab("Total charges (binwidth = 100 dollars)")
p5 + p18 + p19 + plot_annotation(title = "Customer churn by tenure, monthly and total charges") + plot_layout(nrow = 3, guides = "collect")
Churn was higher among:
senior citizens
people without a partner
people without dependents
customers with month-to-month contracts
customers with electronic check as their payment method
customers with paperless billing
customers whose Internet service provider was Fiber optic
customers without online security service
customers without online backup
customers without tech support
customers without device protection
customers who had been using the company’s services for around 1-9 months (max tenure was 6 years); the longer customers stayed with the company the less likely they were to churn.
First, I will check correlations between our numeric variables
TotalCharges
, MonthlyCharges
and
tenure
to see whether all of them should be used for
clustering.
Since the TotalCharges
variable has moderate and high
correlations with the other 2 variables and will not enrich our
interpretation of clusters I will remove it. (I will also remove the
customerID
variable since it is not useful for us in this
project.)
cor(data_full$TotalCharges, data_full$MonthlyCharges)
## [1] 0.6510648
cor(data_full$TotalCharges, data_full$tenure)
## [1] 0.8258805
cor(data_full$tenure, data_full$MonthlyCharges)
## [1] 0.2468618
data_clean <- data_full[-c(1, 20)]
glimpse(data_clean)
## Rows: 7,032
## Columns: 19
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N~
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, ~
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No ~
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in~
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in~
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte~
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int~
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte~
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M~
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr~
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
Next, I will recode 9 variables concerned with the company’s services
to facilitate our analysis and create a new numeric variable for
clustering. Recoding was necessary since OnlineSecurity
,
OnlineBackup
, DeviceProtection
,
TechSupport
, StreamingTV
,
StreamingMovies
are dependent on whether a customer has
InternetService
or not. Same goes for
MultipleLines
and PhoneService
variables. Now,
for every variable listed above possible values will be 0 (No) and 1
(Yes) referring to whether a customer has purchased this service or
not.
data_clean$MultipleLines <- if_else(data_clean$MultipleLines == "Yes", 1, 0)
data_clean$InternetService <- if_else(data_clean$InternetService == "No", 0, 1)
data_clean$PhoneService <- if_else(data_clean$PhoneService == "Yes", 1, 0)
data_clean$OnlineSecurity <- if_else(data_clean$OnlineSecurity == "No internet service", 0, if_else(data_clean$OnlineSecurity == "No", 0, 1))
data_clean$OnlineBackup <- if_else(data_clean$OnlineBackup == "No internet service", 0, if_else(data_clean$OnlineBackup == "No", 0, 1))
data_clean$DeviceProtection <- if_else(data_clean$DeviceProtection == "No internet service", 0, if_else(data_clean$DeviceProtection == "No", 0, 1))
data_clean$TechSupport <- if_else(data_clean$TechSupport == "No internet service", 0, if_else(data_clean$TechSupport == "No", 0, 1))
data_clean$StreamingTV <- if_else(data_clean$StreamingTV == "No internet service", 0, if_else(data_clean$StreamingTV == "No", 0, 1))
data_clean$StreamingMovies <- if_else(data_clean$StreamingMovies == "No internet service", 0, if_else(data_clean$StreamingMovies == "No", 0, 1))
Our 9 service variables are now in the right format and I can create
a new variable, TelcoServices
, that represents the number
of the company’s services used by a customer (ranging from 1 to 9
services).
data_clean <- data_clean %>%
mutate(TelcoServices = StreamingMovies + StreamingTV + TechSupport + DeviceProtection + OnlineSecurity + OnlineBackup + InternetService + MultipleLines + PhoneService)
summary(data_clean[20])
## TelcoServices
## Min. :1.000
## 1st Qu.:2.000
## Median :4.000
## Mean :4.147
## 3rd Qu.:6.000
## Max. :9.000
Next, I will normalize our 3 clustering variables:
tenure
, MonthlyCharges
, and
TelcoServices
. Now they range from 0 to 1.
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
data_standardized <- data_clean
data_standardized$tenure <- range01(data_standardized$tenure)
data_standardized$MonthlyCharges <- range01(data_standardized$MonthlyCharges)
data_standardized$TelcoServices <- range01(data_standardized$TelcoServices)
sub_data_standard <- data_standardized %>%
dplyr::select(tenure, MonthlyCharges, TelcoServices)
summary(sub_data_standard)
## tenure MonthlyCharges TelcoServices
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1127 1st Qu.:0.1725 1st Qu.:0.1250
## Median :0.3944 Median :0.5184 Median :0.3750
## Mean :0.4426 Mean :0.4632 Mean :0.3934
## 3rd Qu.:0.7606 3rd Qu.:0.7126 3rd Qu.:0.6250
## Max. :1.0000 Max. :1.0000 Max. :1.0000
Now it’s time to determine how many clusters we should have at the end of our clustering procedure.
set.seed(42)
# i'm limiting the max number of clusters to 5 because a bigger number would leave us with a less interpretable set of clusters (in case of the data i'm working with)
fviz_nbclust(sub_data_standard, kmeans, method = "silhouette", k.max = 5)
fviz_nbclust(sub_data_standard, kmeans, method = "wss", k.max = 5)
The silhouette plot indicates that 4 clusters is the optimal choice, whereas the other plot suggests 3-4 clusters. I have tried to use 3 and 4 centers for k-means clustering and 3 centers seems to work better in terms of interpretation of our clusters.
set.seed(42)
model_customers <- kmeans(sub_data_standard, centers = 3)
clust_customers <- model_customers$cluster
# adding the cluster number variable to several datasets I'll use later
sub_data_standard <- mutate(sub_data_standard, cluster = clust_customers)
data_standardized <- mutate(data_standardized, cluster = clust_customers)
data_clean <- mutate(data_clean, cluster = clust_customers)
data_full <- mutate(data_full, cluster = clust_customers)
sub_data_standard %>%
group_by(cluster) %>%
count()
## # A tibble: 3 x 2
## # Groups: cluster [3]
## cluster n
## <int> <int>
## 1 1 2670
## 2 2 2347
## 3 3 2015
Now, I will inspect the clusters and describe them based on our 3
clustering variables: tenure
, MonthlyCharges
,
and TelcoServices
.
data_clean %>%
dplyr::select(tenure, MonthlyCharges, TelcoServices, cluster) %>%
group_by(cluster) %>%
summarize(mean_monthly_charges = mean(MonthlyCharges), mean_telco_services = mean(TelcoServices), mean_tenure = mean(tenure))
## # A tibble: 3 x 4
## cluster mean_monthly_charges mean_telco_services mean_tenure
## <int> <dbl> <dbl> <dbl>
## 1 1 73.5 4.04 13.1
## 2 2 89.0 6.52 57.1
## 3 3 25.2 1.52 29.4
data_clean %>%
dplyr::select(tenure, MonthlyCharges, TelcoServices, cluster) %>%
group_by(cluster) %>%
summarize(median_monthly_charges = median(MonthlyCharges), median_telco_services = median(TelcoServices), median_tenure = median(tenure))
## # A tibble: 3 x 4
## cluster median_monthly_charges median_telco_services median_tenure
## <int> <dbl> <dbl> <dbl>
## 1 1 74.9 4 10
## 2 2 92 7 60
## 3 3 20.5 1 25
tenure
min: 1 month
max: 72 months (6 years)
MonthlyCharges
min: 18.25 dollars
max: 118.75 dollars
TelcoServices
min: 1 service
max: 9 services
Cluster 1 consists of customers who, on average, had been with
the company for 1 year and 1 month, whose monthly charges were
relatively high, 74 dollars, and who used around 4 services of the
company.
Cluster 2 consists of customers who, on average, had been with the company for 4 years and 9 months, whose monthly charge was the highest among clusters, 89 dollars, and who used 7 services on average.
Cluster 3 consists of customers who, on average, had been with the company for 2 years and 5 months, whose monthly charge was the lowest among clusters, 25 dollars, and who used 1-2 services on average.
So, the clusters can be shortly described as:
Cluster 1: short tenure, high monthly pay, half of the services used.
Cluster 2: long tenure, high monthly pay, almost all of the services used.
Cluster 3: long tenure, low monthly pay, 1-2 services used.
Next, I will examine churn rates among each cluster.
ggplot(data_clean, aes(cluster, fill = Churn)) +
geom_bar() + scale_y_continuous(limits = c(0, 3000)) + ylab("Number of customers") + xlab("Cluster")
ggplot(data_clean, aes(cluster, fill = Churn)) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + ylab("Number of customers") + xlab("Cluster")
data_clean %>%
dplyr::select(cluster, Churn) %>%
group_by(cluster, Churn) %>%
count()
## # A tibble: 6 x 3
## # Groups: cluster, Churn [6]
## cluster Churn n
## <int> <fct> <int>
## 1 1 Yes 1261
## 2 1 No 1409
## 3 2 Yes 362
## 4 2 No 1985
## 5 3 Yes 246
## 6 3 No 1769
Cluster 1 had the highest share of customers who left the company (47% left, n=2670), therefore, I will explore this segment further.
Creating a separate dataset for Cluster 1. The cluster
(represents the cluster a customer is assigned to) and the
TelcoServices
variables can be removed since they are not
needed anymore. I will also replace the recoded
InternetService
variable with its original version
comprised of 3 categories (DSL, Fiber optic, and No Internet service)
each representing a customer’s Internet service provider. This will be
helpful later in our regression analysis.
data_clean$InternetService <- data_full$InternetService
cluster_1_data <- data_clean %>%
filter(cluster == "1")
cluster_1_data <- cluster_1_data[-c(20, 21)]
Before running logistic regression, I will make sure that all variable types are correct.
cluster_1_data[c(6, 7)] <- lapply(cluster_1_data[c(6, 7)], as.factor)
cluster_1_data[c(9:14)] <- lapply(cluster_1_data[c(9:14)], as.factor)
# there are no customers without internet service in Cluster 1. so, I'll remove this option
cluster_1_data$InternetService <- if_else(cluster_1_data$InternetService == "DSL", "DSL", if_else(cluster_1_data$InternetService == "Fiber optic", "Fiber optic", remove()))
cluster_1_data$InternetService <- as.factor(cluster_1_data$InternetService)
glimpse(cluster_1_data)
## Rows: 2,670
## Columns: 19
## $ gender <fct> Male, Male, Female, Female, Male, Male, Female, Femal~
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, Yes, No, No, Yes, No,~
## $ Partner <fct> No, No, No, No, No, Yes, Yes, No, No, No, No, Yes, Ye~
## $ Dependents <fct> No, No, No, No, Yes, Yes, Yes, No, No, No, Yes, No, Y~
## $ tenure <int> 34, 2, 2, 8, 22, 13, 10, 21, 1, 30, 17, 2, 27, 1, 5, ~
## $ PhoneService <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ MultipleLines <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
## $ InternetService <fct> DSL, DSL, Fiber optic, Fiber optic, Fiber optic, DSL,~
## $ OnlineSecurity <fct> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ OnlineBackup <fct> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,~
## $ DeviceProtection <fct> 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,~
## $ TechSupport <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ StreamingTV <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,~
## $ StreamingMovies <fct> 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,~
## $ Contract <fct> One year, Month-to-month, Month-to-month, Month-to-mo~
## $ PaperlessBilling <fct> No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, ~
## $ PaymentMethod <fct> Mailed check, Mailed check, Electronic check, Electro~
## $ MonthlyCharges <dbl> 56.95, 53.85, 70.70, 99.65, 89.10, 49.95, 55.20, 90.0~
## $ Churn <fct> No, Yes, Yes, Yes, No, No, Yes, No, Yes, No, Yes, No,~
Setting a reference class to “No” (customer stayed).
cluster_1_data$Churn <- relevel(cluster_1_data$Churn, ref = "No")
First, I’m using all variables as predictors of churn.
logitModelFull1 <- glm(Churn ~ gender + SeniorCitizen + Partner + Dependents + tenure + PhoneService + MultipleLines + InternetService + TechSupport + OnlineBackup + OnlineSecurity + DeviceProtection + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges, family = "binomial", data = cluster_1_data)
summary(logitModelFull1)
##
## Call:
## glm(formula = Churn ~ gender + SeniorCitizen + Partner + Dependents +
## tenure + PhoneService + MultipleLines + InternetService +
## TechSupport + OnlineBackup + OnlineSecurity + DeviceProtection +
## StreamingTV + StreamingMovies + Contract + PaperlessBilling +
## PaymentMethod + MonthlyCharges, family = "binomial", data = cluster_1_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9701 -0.9724 -0.2817 0.9348 2.7525
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.017430 1.135881 0.896 0.3704
## genderMale -0.108522 0.087969 -1.234 0.2173
## SeniorCitizenYes 0.161299 0.114239 1.412 0.1580
## PartnerYes -0.012899 0.107364 -0.120 0.9044
## DependentsYes 0.019790 0.130372 0.152 0.8793
## tenure -0.050454 0.004891 -10.317 < 2e-16 ***
## PhoneService1 0.094426 0.913894 0.103 0.9177
## MultipleLines1 0.516257 0.240921 2.143 0.0321 *
## InternetServiceFiber optic 1.873673 1.108938 1.690 0.0911 .
## TechSupport1 -0.131855 0.249546 -0.528 0.5972
## OnlineBackup1 -0.011320 0.242922 -0.047 0.9628
## OnlineSecurity1 -0.191426 0.247210 -0.774 0.4387
## DeviceProtection1 0.161894 0.242606 0.667 0.5046
## StreamingTV1 0.516579 0.449503 1.149 0.2505
## StreamingMovies1 0.557640 0.452342 1.233 0.2177
## ContractOne year -0.906933 0.208783 -4.344 1.40e-05 ***
## ContractTwo year -2.610385 1.024992 -2.547 0.0109 *
## PaperlessBillingYes 0.412028 0.099510 4.141 3.46e-05 ***
## PaymentMethodCredit card (automatic) -0.034001 0.170437 -0.199 0.8419
## PaymentMethodElectronic check 0.329177 0.136536 2.411 0.0159 *
## PaymentMethodMailed check 0.088153 0.160956 0.548 0.5839
## MonthlyCharges -0.036591 0.044114 -0.829 0.4068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3693.2 on 2669 degrees of freedom
## Residual deviance: 3051.2 on 2648 degrees of freedom
## AIC: 3095.2
##
## Number of Fisher Scoring iterations: 6
coefsexp <- round(exp(logitModelFull1$coefficients), 2)
coefsexp
## (Intercept) genderMale
## 2.77 0.90
## SeniorCitizenYes PartnerYes
## 1.18 0.99
## DependentsYes tenure
## 1.02 0.95
## PhoneService1 MultipleLines1
## 1.10 1.68
## InternetServiceFiber optic TechSupport1
## 6.51 0.88
## OnlineBackup1 OnlineSecurity1
## 0.99 0.83
## DeviceProtection1 StreamingTV1
## 1.18 1.68
## StreamingMovies1 ContractOne year
## 1.75 0.40
## ContractTwo year PaperlessBillingYes
## 0.07 1.51
## PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## 0.97 1.39
## PaymentMethodMailed check MonthlyCharges
## 1.09 0.96
The following variables did have a statistically significant influence on customer churn in Cluster 1 (p-value < 0.05):
tenure
Multiple lines
(Yes)Contract
(One-year)Contract
(Two-year)Paperless Billing
(Yes)PaymentMethod
(Electronic check)Coefficient interpretations for the variables mentioned above:
tenure
- 0.95 (For every 1 month
increase in tenure, the odds of churning decreases by 5%.)
Multiple lines
(Yes) - 1.68 (The
odds of churning for customers with the multiple lines service are 68%
higher as compared to customers without it.)
Contract
(One-year) - 0.40 (The
odds of churning for customers with the one-year contract are 60% lower
as compared to customers with the month-to-month contract.)
Contract
(Two-year) - 0.07 (The
odds of churning for customers with the one-year contract are 93% lower
as compared to customers with the month-to-month contract.)
Paperless Billing
(Yes) - 1.51 (The
odds of churning for customers with paperless billing are 51% higher as
compared to customers with the paper billing option.)
PaymentMethod
(Electronic check) -
1.39 (The odds of churning for customers with
electronic check as their payment method are 39% higher as compared to
customers with the mailed check payment method.)
Let’s visualize churn among our statistically significant variables for Cluster 1.
cluster_1_data$Churn <- relevel(cluster_1_data$Churn, ref = "Yes")
p20 <- ggplot(cluster_1_data, aes(fill = Churn, x = Contract)) +
geom_bar(position = "fill") + coord_flip() + ylab(" ") + xlab("Contract type") + scale_y_continuous(labels = scales::percent)
p21 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaperlessBilling)) + geom_bar(position = "fill") + ylab(" ") + xlab("Is billing paperless?") + scale_y_continuous(labels = scales::percent)
p22 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaymentMethod)) +
geom_bar(position = "fill") + coord_flip() + ylab(" ") + xlab("Payment method") + scale_y_continuous(labels = scales::percent)
p20 + p21 + p22 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")
p20 <- ggplot(cluster_1_data, aes(fill = Churn, x = Contract)) +
geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Contract type") + scale_y_continuous(limits = c(0, 3000))
p21 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaperlessBilling)) + geom_bar(position = "stack") + ylab(" ") + xlab("Is billing paperless?") + scale_y_continuous(limits = c(0, 3000))
p22 <- ggplot(cluster_1_data, aes(fill = Churn, x = PaymentMethod)) +
geom_bar(position = "stack") + coord_flip() + ylab(" ") + xlab("Payment method") + scale_y_continuous(limits = c(0, 3000))
p20 + p21 + p22 + plot_annotation(title = "Customer churn by contract type, billing, and payment method") + plot_layout(nrow = 2, guides = "collect")
p23 <- ggplot(cluster_1_data, aes(fill = Churn, x = MultipleLines)) +
geom_bar(position = "fill") + ylab(" ") + scale_y_continuous(labels = scales::percent)
p24 <- ggplot(cluster_1_data, aes(fill = Churn, x = MultipleLines)) +
geom_bar(position = "stack") + ylab(" ") + scale_y_continuous(limits = c(0, 3000))
p23 + p24 + plot_annotation(title = "Customer churn by phone-related services") + plot_layout(guides = "collect")
p25 <- ggplot(cluster_1_data, aes(fill = Churn, x = tenure)) +
geom_histogram(position = "identity", alpha = 0.6, binwidth = 1) + xlab("Number of months with the company (binwidth = 1 month)")
p25 + plot_annotation(title = "Customer churn by tenure")
Overall, these results correspond with our previous finding that those who stayed with the company for longer were less likely to churn. Most of the Cluster 1 customers who left had been using the company’s services for less than 10 months.
Next, I’m using the stepAIC() function to determine the model with an optimal set of independent variables. I’ll use this model later to make predictions.
logitModelNew1 <- stepAIC(logitModelFull1, trace = 0)
summary(logitModelNew1)
##
## Call:
## glm(formula = Churn ~ SeniorCitizen + tenure + MultipleLines +
## InternetService + OnlineSecurity + StreamingTV + StreamingMovies +
## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges,
## family = "binomial", data = cluster_1_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7786 -0.9389 0.2726 0.9706 1.9889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.917991 0.434193 -2.114 0.034494 *
## SeniorCitizenYes -0.169345 0.111494 -1.519 0.128795
## tenure 0.050791 0.004769 10.650 < 2e-16 ***
## MultipleLines1 -0.503442 0.107411 -4.687 2.77e-06 ***
## InternetServiceFiber optic -1.825828 0.252348 -7.235 4.64e-13 ***
## OnlineSecurity1 0.212038 0.123138 1.722 0.085076 .
## StreamingTV1 -0.490939 0.127600 -3.847 0.000119 ***
## StreamingMovies1 -0.533569 0.126403 -4.221 2.43e-05 ***
## ContractOne year 0.899755 0.206686 4.353 1.34e-05 ***
## ContractTwo year 2.626024 1.019904 2.575 0.010030 *
## PaperlessBillingYes -0.404708 0.099067 -4.085 4.40e-05 ***
## PaymentMethodCredit card (automatic) 0.041778 0.169641 0.246 0.805471
## PaymentMethodElectronic check -0.322014 0.136031 -2.367 0.017923 *
## PaymentMethodMailed check -0.070405 0.160075 -0.440 0.660064
## MonthlyCharges 0.033597 0.008585 3.913 9.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3693.2 on 2669 degrees of freedom
## Residual deviance: 3056.1 on 2655 degrees of freedom
## AIC: 3086.1
##
## Number of Fisher Scoring iterations: 6
Our model’s Pseudo R-squared (McFadden’s R2) can be interpreted as good fit since the value is around 0.2.
LogRegR2(logitModelNew1)
## Chi2 637.0748
## Df 14
## Sig. 0
## Cox and Snell Index 0.2122739
## Nagelkerke Index 0.2833226
## McFadden's R2 0.1724995
Next, I will split the data into train and test datasets, fit a logistic regression model and evaluate its performance.
cluster_1_data$Churn <- relevel(cluster_1_data$Churn, ref = "No")
set.seed(42)
cluster_1_data$isTrain <- rbinom(nrow(cluster_1_data), 1, 0.80)
train <- subset(cluster_1_data, cluster_1_data$isTrain == 1)
test <- subset(cluster_1_data, cluster_1_data$isTrain == 0)
# checking distributions
summary(test)
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:294 No :435 No :365 No :436 Min. : 1.00 0: 24
## Male :249 Yes:108 Yes:178 Yes:107 1st Qu.: 3.00 1:519
## Median :10.00
## Mean :13.22
## 3rd Qu.:22.00
## Max. :46.00
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0:348 DSL :199 0:434 0:390 0:404
## 1:195 Fiber optic:344 1:109 1:153 1:139
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 0:427 0:333 0:347 Month-to-month:471 No :167
## 1:116 1:210 1:196 One year : 65 Yes:376
## Two year : 7
##
##
##
## PaymentMethod MonthlyCharges Churn isTrain
## Bank transfer (automatic): 69 Min. : 40.15 No :289 Min. :0
## Credit card (automatic) : 85 1st Qu.: 60.20 Yes:254 1st Qu.:0
## Electronic check :283 Median : 75.15 Median :0
## Mailed check :106 Mean : 73.98 Mean :0
## 3rd Qu.: 85.85 3rd Qu.:0
## Max. :110.10 Max. :0
summary(train)
## gender SeniorCitizen Partner Dependents tenure PhoneService
## Female:1036 No :1701 No :1421 No :1704 Min. : 1.00 0: 118
## Male :1091 Yes: 426 Yes: 706 Yes: 423 1st Qu.: 3.00 1:2009
## Median :10.00
## Mean :13.02
## 3rd Qu.:21.00
## Max. :46.00
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0:1335 DSL : 822 0:1628 0:1558 0:1545
## 1: 792 Fiber optic:1305 1: 499 1: 569 1: 582
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 0:1647 0:1369 0:1343 Month-to-month:1879 No : 666
## 1: 480 1: 758 1: 784 One year : 204 Yes:1461
## Two year : 44
##
##
##
## PaymentMethod MonthlyCharges Churn isTrain
## Bank transfer (automatic): 307 Min. : 35.45 No :1120 Min. :1
## Credit card (automatic) : 293 1st Qu.: 59.10 Yes:1007 1st Qu.:1
## Electronic check :1093 Median : 74.80 Median :1
## Mailed check : 434 Mean : 73.32 Mean :1
## 3rd Qu.: 85.70 3rd Qu.:1
## Max. :112.95 Max. :1
logitTrainNew <- glm(Churn ~ SeniorCitizen + tenure + MultipleLines + InternetService + OnlineSecurity + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges, family = "binomial", data = train[-20])
train$pred <- predict(logitTrainNew, train[-20], type = 'response')
train$pred <- ifelse(train$pred > 0.5, "Yes", "No")
test$pred <- predict(logitTrainNew, type = 'response', newdata = test[-c(19, 20)])
test$pred <- ifelse(test$pred > 0.5, "Yes", "No")
confusionMatrix(as.factor(train$pred), as.factor(train$Churn), positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 768 306
## Yes 352 701
##
## Accuracy : 0.6906
## 95% CI : (0.6705, 0.7102)
## No Information Rate : 0.5266
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.381
##
## Mcnemar's Test P-Value : 0.07938
##
## Sensitivity : 0.6961
## Specificity : 0.6857
## Pos Pred Value : 0.6657
## Neg Pred Value : 0.7151
## Prevalence : 0.4734
## Detection Rate : 0.3296
## Detection Prevalence : 0.4951
## Balanced Accuracy : 0.6909
##
## 'Positive' Class : Yes
##
For train data, confusion matrix indicated that sensitivity = 0.70 and specificity = 0.69.
Accuracy is 0.69.
confusionMatrix(as.factor(test$pred), as.factor(test$Churn), positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 196 74
## Yes 93 180
##
## Accuracy : 0.6924
## 95% CI : (0.6517, 0.731)
## No Information Rate : 0.5322
## P-Value [Acc > NIR] : 2.044e-14
##
## Kappa : 0.3851
##
## Mcnemar's Test P-Value : 0.1637
##
## Sensitivity : 0.7087
## Specificity : 0.6782
## Pos Pred Value : 0.6593
## Neg Pred Value : 0.7259
## Prevalence : 0.4678
## Detection Rate : 0.3315
## Detection Prevalence : 0.5028
## Balanced Accuracy : 0.6934
##
## 'Positive' Class : Yes
##
For test data, sensitivity = 0.71 and specificity = 0.68.
Accuracy in this case is 0.69, which means that the model predicts churn 69% of the time.
Earlier, I performed clustering that resulted in the following customer segments:
Cluster 1: short tenure, high monthly pay, half of the services used.
Cluster 2: long tenure, high monthly pay, almost all of the services used.
Cluster 3: long tenure, low monthly pay, 1-2 services used.
Then, I explored churn for Cluster 1 in which 47% of customers left the company this month (n=2670).
Here are the insights for Cluster 1 customer churn:
Longer staying customers were less likely to churn.
Customers with the 1 year and 2 year contracts were less likely to churn as compared to people using the month-to-month contract type.
Customers with paperless billing were more likely to churn as compared to the people with paper billing.
Customers who had multiple lines were more likely to churn as compared to those who had not.
Churn was higher among customers with the Fiber optic Internet provider as compared to those using the DSL one.