Introduction

library(readr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(grid)
library(psych)
library(gtable)
library(knitr)
library(kableExtra)
library(formattable)
library(caret)
library(MASS)
library(pROC)
library(GGally)
library(ggcorrplot)
library(purrr)
library(cluster)
library(summarytools)
library(qwraps2)
library(DT)
library(gtable)

WA <- read_csv("~/Documents/БИЗНЕСАНАЛ/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In this project, I want to predict Telco customer churn based on their behavior. This dataset taken from Kaggle dataset. Firstly, I will do a short exploratory analysis with plots to get acquainted with the data at hand. Then, I will make a cluster analysis and define groups, where the persent of cluents who have churned higher. Thirdly, I will use logistic regression to predict churn. Along the way, the most important factors for predicting churn will be outlined. Finally, I will give some reccomendation to the company about the fact how it can decrease customer churn.

Data overview

Let’s download data and have a briefly look on metrics:

dfs = dfSummary(WA[,-1], 
                style = "grid",          
                valid.col = FALSE, varnumbers = FALSE, tmp.img.dir = "/tmp")

dfs$Missing <- NULL 
dfs$`Freqs (% of Valid)` <- NULL 

print(dfs, 
      headings = FALSE,
      footnote = NA, 
      max.tbl.height = 500,
      max.tbl.width = 250,
      method = "render")

Variable

Stats / Values

Graph

gender [character]

1. Female

2. Male

SeniorCitizen [numeric]

Min : 0

Mean : 0.2

Max : 1

Partner [character]

1. No

2. Yes

Dependents [character]

1. No

2. Yes

tenure [numeric]

Mean (sd) : 32.4 (24.6)

min ≤ med ≤ max:

0 ≤ 29 ≤ 72

IQR (CV) : 46 (0.8)

PhoneService [character]

1. No

2. Yes

MultipleLines [character]

1. No

2. No phone service

3. Yes

InternetService [character]

1. DSL

2. Fiber optic

3. No

OnlineSecurity [character]

1. No

2. No internet service

3. Yes

OnlineBackup [character]

1. No

2. No internet service

3. Yes

DeviceProtection [character]

1. No

2. No internet service

3. Yes

TechSupport [character]

1. No

2. No internet service

3. Yes

StreamingTV [character]

1. No

2. No internet service

3. Yes

StreamingMovies [character]

1. No

2. No internet service

3. Yes

Contract [character]

1. Month-to-month

2. One year

3. Two year

PaperlessBilling [character]

1. No

2. Yes

PaymentMethod [character]

1. Bank transfer (automatic)

2. Credit card (automatic)

3. Electronic check

4. Mailed check

MonthlyCharges [numeric]

Mean (sd) : 64.8 (30.1)

min ≤ med ≤ max:

18.2 ≤ 70.3 ≤ 118.8

IQR (CV) : 54.3 (0.5)

TotalCharges [numeric]

Mean (sd) : 2283.3 (2266.8)

min ≤ med ≤ max:

18.8 ≤ 1397.5 ≤ 8684.8

IQR (CV) : 3393.3 (1)

Churn [character]

1. No

2. Yes

All in all there are 7043 observations and 21 variables.

The data set includes information about:

The column “Churn” is customers who left within the last month.
Services that each customer has signed up for phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
Customer account information: how long they have been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
Demographic info about customers: gender, age range, and if they have partners and dependents

Let’s Check for any missing data in the data set.

kable(paste0(round(((nrow(WA) - nrow(na.omit(WA)))/nrow(WA))*100,2), "%"), col.names = "Percent of NA") %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Percent of NA
0.16%

WA = WA %>% na.omit()

Only 0.16% missing values. I delete them.

How big is the problem?

In order to evaluate how big the problem of customer churn from the company is, it is necessary to know the percentage of these customers from the total number.

th = theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        rect = element_blank(),
        panel.grid.major.y = element_line(color = "grey92", size = 0.5))

WA %>% 
  group_by(Churn) %>% 
  summarise(Number = n()) %>%
  mutate(Percent = prop.table(Number)*100) %>% 
ggplot(aes(Churn, Percent)) + 
  geom_col(aes(fill = Churn)) +
  labs(title = "Churn Percentage") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = sprintf("%.2f%%", Percent)), vjust = 3,size = 4) + 
  th

From the plot, we can see that the percentage of customer that churn is around 27%. It is a litle bit more than 1/4 from the all cluents. It still not a very big problem, but I think, it is also not a great result. To understand, why this problem can occure, let’s make some correlations.

Explore correlations of churn

Continuos variables

To begin with, we will look at the dependence of customer churn on numerical variables, such as Tenure, Monthly Charges and Total Charges. Small description for is above:

tenure: Number of months the customer has stayed with the company
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer

In fact, there is another numerical variable in the data, but it is so defined by R, although it is a factor variable in the raw data from Kaggle. It is Senior Citizen.

Let’s have a briefly look on this metrics.

WA_contin = WA %>% dplyr::select(tenure, MonthlyCharges, TotalCharges)

kable(describe(WA_contin)) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped")) %>% scroll_box(width = "700px", height = "150px")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
tenure	1	7032	32.42179	24.54526	29.000	31.48685	32.61720	1.00	72.00	71.0	0.2376294	-1.3881476	0.2927037
MonthlyCharges	2	7032	64.79821	30.08597	70.350	65.01030	35.65653	18.25	118.75	100.5	-0.2220082	-1.2566125	0.3587770
TotalCharges	3	7032	2283.30044	2266.77136	1397.475	1970.14024	1812.92328	18.80	8684.80	8666.0	0.9612323	-0.2332742	27.0313843

From the statistics above, we can see that the longest tenure is 72 months or 6 years, and the maximum monthly charge is 118.75 dollars. For Total Charges the maximum is 8684.80 dollars. The minimum tenure is 1 month and minimum monthly charge is about 18.25 dollars. For Total Charges minimum is 18.80 dollars. The customer can expect to have a monthly charge of about 64.80 dollars and the mean tenure is 32 months or around 3 years. The customer can expect to have a Total Charge is 2283.30 dollars.

The use of the metric is not indicative or effective. If the user leaves, the total costs are clearly lower. It is difficult to say anything about correlation with customer churn. Also, we may not consider this variable because it correlates with the other two variables that we are exploring.

corr <- round(cor(WA_contin), 1)
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE,
   ggtheme = ggplot2::theme_classic,
   colors = c("#6D9EC1", "white", "#E46726"))

Now, I would like to make plots, which can illustrates graphically data.

graph1 = WA %>% 
  group_by(tenure, Churn) %>% 
  summarise(Number = n()) %>% 
  ggplot(aes(tenure, Number)) +
  geom_line(aes(col = Churn)) +
  labs(x = "Tenure (month)",
       y = "",
       title = "Churn Based on Tenure") +
  scale_x_continuous(breaks = seq(0, 80, 10)) + 
  th

graph2 = WA %>% 
  group_by(MonthlyCharges, Churn) %>% 
  summarise(Number = n()) %>% 
  ggplot(aes(MonthlyCharges, Number)) +
  geom_line(aes(col = Churn)) +
  labs(x = "Monthly Charges",
       y = "",
       title = "Churn Based on Monthly Charges") +
  scale_x_continuous(breaks = seq(0, 120, 20)) + 
  th

legend = gtable_filter(ggplotGrob(graph2), "guide-box") 

#plot graphs
grid.arrange(arrangeGrob(graph1+ theme(legend.position="none"), graph2 + theme(legend.position="none"), nrow = 2,
             left = textGrob("Number of Customer", rot = 90, vjust = 1, 
                                          gp = gpar(fontsize = 12))),
             legend, 
    widths=unit.c(unit(1, "npc") - legend$width, legend$width), 
    nrow=1)

From the graph we can see that the highest churn happen mostly after 1 month usage. It is noticeable that customers leave much less frequently after 25 months. The monthly charges chart shows that most of the loyal customers that stayed with the company had a lower monthly charge than most of the customers that churned.

Demographic data

I was interested in studying how customer churn can be related to the demographic characteristics of customers. These characteristics include the following variables:

gender: gender (female, male)
Senior Citizen: Whether the customer is a senior citizen or not (1, 0)
Partner Whether: the customer has a partner or not (Yes, No)
Dependents: Whether the customer has dependents or not (Yes, No)

# AS I said earlier, SeniorCitizen is numeric, but it is a factor)
WA <- WA %>% 
  mutate(SeniorCitizen = as.factor(SeniorCitizen))

p1 = WA %>% ggplot(aes(x=gender,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(gender, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p2 = WA %>% ggplot(aes(x=SeniorCitizen,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(SeniorCitizen, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent) + 
  th + 
  theme(legend.position = "none")

p3 = WA %>% ggplot(aes(x=Partner,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(Partner, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p4 = WA %>% ggplot(aes(x=Dependents,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(Dependents, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

g = WA %>% ggplot(aes(x=Dependents,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "Percentage")+
  theme_classic()

legend = gtable_filter(ggplotGrob(g), "guide-box") 

grid.arrange(arrangeGrob(p1, p2, p3, p4, nrow = 2,
             top = textGrob("Demographic characteristics and churn", vjust = 1, gp = gpar(fontface = "bold", cex = 1.5)),
             left = textGrob("Percentage", rot = 90, vjust = 1, 
                                          gp = gpar(fontsize = 12))),
             legend, 
    widths=unit.c(unit(1, "npc") - legend$width, legend$width), 
    nrow=1)

From the customer characteristics, I can get several information :

From the gender, male and female seems to have same proportion to churn.
Senior citizens are more likely to churn
Customer without partner are more likely to churn compared to customer who have partner
Customer without dependents are more likely to churn compared to customer with dependents

Services of the company and churn

Also, Telecom companie have different services which can be used by cluents. The list of this services are below:

PhoneService: Whether the customer has a phone service or not (Yes, No)
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)

And there are plots which illustarates the distribution of churned and not churned cluents in each category.

p5 = WA %>% ggplot(aes(x=InternetService,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(InternetService, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p6 = WA %>% ggplot(aes(x=OnlineSecurity,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(OnlineSecurity, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p7 = WA %>% ggplot(aes(x=OnlineBackup,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(OnlineBackup, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p8 = WA %>% ggplot(aes(x=DeviceProtection,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(DeviceProtection, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p9 = WA %>% ggplot(aes(x=PhoneService,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(PhoneService, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p10 = WA %>% ggplot(aes(x=MultipleLines,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(MultipleLines, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p11 = WA %>% ggplot(aes(x=TechSupport,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(TechSupport, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p12 = WA %>% ggplot(aes(x=StreamingTV,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(StreamingTV, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p13 = WA %>% ggplot(aes(x=StreamingMovies,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "") +
  geom_text(data = . %>% 
              group_by(StreamingMovies, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

grid.arrange(arrangeGrob(p5, p6, p7, p8, p9, p10, p11, p12,p13, nrow = 3,
             top = textGrob("Services of the company and churn", vjust = 1, gp = gpar(fontface = "bold", cex = 1.5)),
             left = textGrob("Percentage", rot = 90, vjust = 1, 
                                          gp = gpar(fontsize = 12))),
             legend, 
    widths=unit.c(unit(1, "npc") - legend$width, legend$width), 
    nrow=1)

From the service that used by the customer, I can get several information :

Customer who use fibser optic in internet service are more likely to churn compared to others
Customer who don’t use online security, online backup, device protection, streaming TV, and tech support are more likely to churn compared to others
The presence or non-existence of a telephone service at the customer’s premises does not affect the customer’s churn from the company in any way.

Contract and payment status

The remaining categorical variables are related to contract and payment status. There are the foloowing variables:

Contract: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

p14 = WA %>% ggplot(aes(x=PaymentMethod,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "", x ="", title = "Payment Method") +
  geom_text(data = . %>% 
              group_by(PaymentMethod, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p15 = WA %>% ggplot(aes(x=Contract,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "", x ="", title = "Contract type") +
  geom_text(data = . %>% 
              group_by(Contract, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

p16 = WA %>% ggplot(aes(x=PaperlessBilling,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "", x="", title = "Paperless Billing") +
  geom_text(data = . %>% 
              group_by(PaperlessBilling, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ 
  th + 
  theme(legend.position = "none")

grid.arrange(arrangeGrob(p15, p16, p14, nrow = 2,
             top = textGrob("Contract type, payments and churn", vjust = 1, gp = gpar(fontface = "bold", cex = 1.5)),
             left = textGrob("Percentage", rot = 90, vjust = 1, 
                                          gp = gpar(fontsize = 12))),
             legend, 
    widths=unit.c(unit(1, "npc") - legend$width, legend$width), 
    nrow=1)

From these plots I can coclude that:

Customers who sign monthly contracts leave more often, which is generally logical as they can leave the company at any time without much loss to themselves. As can be seen from the graph, the longer the contract period, the less likely it is that the client will churn It is possible that those who conclude a longer contract already trust the company and take this step consciously.
In general, customers who receive paper invoices are more likely to churn than those who do not receive them.
Those customers who use ellectron payment of bills are the ones who churn the most, which is quite strange as it is now a rather convenient way. The company may have a problem with this service, as the ratios of customer churn on other payment methods are approximately the same.

Cluster analysis

Recoding and Standardize continuous variables

Here I recode the service-used variables into numeric 0-1 and calculate a new feature summing up the number of telecom services used by a customer (of the nine offered). I get new variable - “all_services”. The maximum sum of this variable is 10 (I also deleted PhoneService, because this variable takes in account in MultipleLines)

WA_new = WA %>% dplyr::select(-customerID, -TotalCharges)

WA_new$PhoneService = if_else(WA_new$PhoneService == "Yes", 1, 0)

WA_new$MultipleLines = if_else(WA_new$MultipleLines == "Yes", 1, if_else(WA_new$MultipleLines == "No", 0, 0))
WA_new$InternetService = if_else(WA_new$InternetService == "DSL", 1, if_else(WA_new$InternetService == "No", 0, 2))
WA_new$OnlineSecurity = if_else(WA_new$OnlineSecurity == "Yes", 1, if_else(WA_new$OnlineSecurity == "No", 0, 0))
WA_new$OnlineBackup = if_else(WA_new$OnlineBackup == "Yes", 1, if_else(WA_new$OnlineBackup == "No", 0, 0))
WA_new$DeviceProtection = if_else(WA_new$DeviceProtection == "Yes", 1, if_else(WA_new$DeviceProtection == "No", 0, 0))
WA_new$TechSupport = if_else(WA_new$TechSupport == "Yes", 1, if_else(WA_new$TechSupport == "No", 0, 0))
WA_new$StreamingTV = if_else(WA_new$StreamingTV == "Yes", 1, if_else(WA_new$StreamingTV == "No", 0, 0))
WA_new$StreamingMovies = if_else(WA_new$StreamingMovies == "Yes", 1, if_else(WA_new$StreamingMovies == "No", 0, 0))

WA_new$all_services = WA_new$MultipleLines + WA_new$InternetService + WA_new$OnlineSecurity + WA_new$OnlineBackup + WA_new$DeviceProtection + WA_new$TechSupport + WA_new$StreamingTV + WA_new$StreamingMovies

# Create the mean-sd standardization function
mean_sd_standard <- function(x) {
    (x-mean(x))/sd(x)}

# Apply the function to each numeric variable in the clustering set
WA_new <- WA_new %>%
    mutate(MonthlyCharges_stand = mean_sd_standard(MonthlyCharges))

WA_new <- WA_new %>%
    mutate(tenure_stand = mean_sd_standard(tenure))

WA_new <- WA_new %>%
    mutate(all_services_stnd = mean_sd_standard(all_services))

Elbow (Scree) Plot & Silhouette analysis

Since the data is quite large and I know nothing about the TV company’s clients, it is difficult to assume how many clusters there will be and what number will be optimal. Therefore, I decided to use two methods to determine the optimal number of clusters.

WA_cluster = WA_new %>% dplyr::select(MonthlyCharges_stand, tenure_stand, all_services_stnd)

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = WA_cluster, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
elbow_plot = ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  labs(x = "Number of clusters", y ="Total within-cluster sum of squares",title = "Elbow plot") +
  scale_x_continuous(breaks = 1:10) + theme_linedraw() +
  theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))

# # Use map_dbl to run many models with varying value of k
# sil_width <- map_dbl(2:10,  function(k){
#   model <- pam(x = WA_cluster, k = k)
#   model$silinfo$avg.width
# })
# 
# # Generate a data frame containing both k and sil_width
# sil_df <- data.frame(
#   k = 2:10,
#   sil_width = sil_width
# )

# write.csv(sil_df, "~/Documents/БИЗНЕСАНАЛ/sil_df.csv")
sil_df <- read_csv("~/Documents/БИЗНЕСАНАЛ/sil_df.csv")
sil_df = sil_df[,-1]

# Plot the relationship between k and sil_width
sil_plot = ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line()  +
  labs(x = "Number of clusters", y ="Silhouette Width", title = "Silhouette plot") +
  scale_x_continuous(breaks = 2:10) + theme_linedraw() +
  theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))

grid.arrange(elbow_plot,sil_plot, nrow =2)

As we can see, the elbow method reccomend to devide all data in 2 clusters or 3 clusters, but for our deep analys it looks like really strange (exactly, when I go to the Interpretation part, I understand, why elbow shows the best variant is 2 clusters). Silhouete plot shows that if I take 4 clusters then most of observations are similar to the cluster it is assigned relative to other clusters. After that analysis I decided to devide all data to 4 clusters and save it in dataset.

I use set.seed equal to 123, because I want to save my results. Also, I added to my dataset new colomn “Cluster”, were save the number of cluster. The table below shows the distribution by clusters.

set.seed(123)
clusters <- kmeans(WA_new[,21:23], 4)

# Save the cluster number in the dataset as column 'Borough'
WA_new$Cluster <- as.factor(clusters$cluster)

# Look at the distribution of cluster
kable(table(WA_new$Cluster))%>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Var1	Freq
1	1750
2	1490
3	1727
4	2065

Interpretation of the clusters

To start assessing the quality and distinctiveness of these groups, we will look at the average values of the groups for each cluster by original variables in our clustering dataset. This will give us the first clear idea of how groups behave. Below I visualize the clusters with a Parallel Coordinate Plot. Also I make a barplot with percent of churn in all clusters.

# Group by the cluster assignment and calculate averages
WA_clus_avg <- WA_new %>%
    dplyr::select(MonthlyCharges_stand, tenure_stand, all_services_stnd,Cluster) %>%
    group_by(Cluster) %>%
    summarize_if(is.numeric,mean)

# Create the min-max scaling function
min_max_standard <- function(x) {
  (x - min(x))/(max(x)-min(x))
}

# Apply this function to each numeric variable in the bustabit_clus_avg object
WA_avg_minmax <- WA_clus_avg %>%
    mutate_if(is.numeric, min_max_standard)

# Load the GGally package
              
# Create a parallel coordinate plot of the values, starts with column 2
parrallel_plot = ggparcoord(WA_avg_minmax, columns = 2:ncol(WA_avg_minmax), 
           groupColumn = "Cluster", scale = "globalminmax", order = "skewness")  +
  labs(x = "", title = "Parallel Coordinate Plot") + theme_linedraw() +
  theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))

clusters_churn_plot = WA_new %>% ggplot(aes(x=Cluster,fill=Churn)) + 
  geom_bar(position = 'fill') +
  labs(y = "Percent", title = "Churn of cluents by clusters") +
  geom_text(data = . %>% 
              group_by(Cluster, Churn) %>%
              tally() %>%
              mutate(p = n / sum(n)) %>%
              ungroup(),
            aes(y = p, label = scales::percent(p)),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) + 
  scale_y_continuous(labels = scales::percent)+ theme_linedraw() +
  theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))

grid.arrange(parrallel_plot, clusters_churn_plot, nrow = 2)

Interpretation: 4th cluster have the highest values compared to the rest of the clusters for all the permanent ones. The clusters are distributed as follows by tensure: clients from the 4th class stayed the longest, clients from the 3rd claster also stayed the longest, but approximately 2 times less frequently, while clients from the 2nd and 1st clusters stayed the shortest. In general, this distribution corresponds to the outflow of clients from these clusters: in cluster 1 and 2 the outflow accounts for almost half of all data, it could be assumed that people from these clusters could also leave the company in the near future. In clusters 3 and 4, outflow accounts for less than 15% of the total data in each cluster. In terms of monthly payments, again, it turns out that people from the 4 clusters pay the most, which is not surprising, because if you look at whether they use services, you can see that they have the maximum, and for services you need an extra payment. At the same time, people in the 3 clusters pay the lowest monthly fee, but they also hardly use any different services. Customers from the second cluster also pay quite a lot for monthly services, but they use fewer services than customers from the 4th cluster. In general, 1 pays less than 2 and 4, but again, they use fewer services. In general, we can understand why the outflow of clients is more frequent in clusters 1 and 2 - they pay a price every month that is disproportionate to the services they use. However, it is possible that the services they use, or the type of service provider they use, somehow affect the monthly costs.

To make more detailed analysis I decided to use summaty function for each cluster. Also, I go back to the original dataset without NA, remove all variables which are not used in cluster division (customerID, TotalCharges, gender, SeniorCitizen, Partner, Dependents, PaperlessBilling, PaymentMethod, Contract). After that I get such a beautiful description of all clusters.

WA_interpret = WA %>% dplyr::select(-customerID, -TotalCharges, -gender,  -Partner, -Dependents, -PaperlessBilling, -PaymentMethod, -Contract)
WA_interpret$Cluster <- as.factor(clusters$cluster)

WA_interpret <- WA_interpret %>%
  mutate_if(sapply(WA_interpret, is.character), as.factor)

# results <- WA_interpret %>%
#   group_by(Cluster) %>%
#   do(the_summary = summary(.))
# WA_interpret %>% summary(.)
#   
# options(knitr.kable.NA = "**")
# kable(results$the_summary) %>%  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), font_size = 8, full_width = FALSE)

options(qwraps2_markup = "markdown")
our_summary1 <-
  list("Senior Citizen" =
       list("Senior"  = ~ qwraps2::n_perc0(SeniorCitizen == 1),
            "No"      = ~ qwraps2::n_perc0(SeniorCitizen == 0)),
       "Tenure" =
       list("min"       = ~ min(tenure),
            "max"       = ~ max(tenure),
            "mean (sd)" = ~ qwraps2::mean_sd(tenure)),
       "Monthly Charges" =
       list("min"       = ~ min(MonthlyCharges),
            "max"       = ~ max(MonthlyCharges),
            "mean (sd)" = ~ qwraps2::mean_sd(MonthlyCharges)),
       "Phone Service" =
       list("No" = ~ qwraps2::n_perc0(PhoneService == "No"),
            "Yes"  = ~ qwraps2::n_perc0(PhoneService == "Yes")),
       "MultipleLines" =
       list("No" = ~ qwraps2::n_perc0(MultipleLines == "No"),
            "Yes"  = ~ qwraps2::n_perc0(MultipleLines == "Yes"),
            "No phone service" = ~ qwraps2::n_perc0(MultipleLines == "No phone service")),
       "InternetService" =
       list("DSL" = ~ qwraps2::n_perc0(InternetService == "DSL"),
            "Fiber optic"  = ~ qwraps2::n_perc0(InternetService == "Fiber optic"),
            "No"  = ~ qwraps2::n_perc0(InternetService == "No")),
       "Online Security" =
       list("No" = ~ qwraps2::n_perc0(OnlineSecurity == "No"),
            "Yes"  = ~ qwraps2::n_perc0( OnlineSecurity == "Yes"),
            "No internet service"  = ~ qwraps2::n_perc0( OnlineSecurity == "No internet service")),
       "Online Backup" =
       list("No" = ~ qwraps2::n_perc0(OnlineBackup == "No"),
            "Yes"  = ~ qwraps2::n_perc0( OnlineBackup == "Yes"),
            "No internet service"  = ~ qwraps2::n_perc0( OnlineBackup == "No internet service")),
       "Device Protection" =
       list("No" = ~ qwraps2::n_perc0(DeviceProtection == "No"),
            "Yes"  = ~ qwraps2::n_perc0( DeviceProtection == "Yes"),
            "No internet service"  = ~ qwraps2::n_perc0( DeviceProtection == "No internet service")),
       "Tech Support" =
       list("No" = ~ qwraps2::n_perc0(TechSupport == "No"),
            "Yes"  = ~ qwraps2::n_perc0( TechSupport == "Yes"),
            "No internet service"  = ~ qwraps2::n_perc0( TechSupport == "No internet service")),
       "Streaming TV" =
       list("No" = ~ qwraps2::n_perc0(StreamingTV == "No"),
            "Yes"  = ~ qwraps2::n_perc0( StreamingTV == "Yes"),
            "No internet service"  = ~ qwraps2::n_perc0( StreamingTV == "No internet service")),
       "StreamingMovies" =
       list("No" = ~ qwraps2::n_perc0(StreamingMovies == "No"),
            "Yes"  = ~ qwraps2::n_perc0( StreamingMovies == "Yes"),
            "No internet service"  = ~ qwraps2::n_perc0( StreamingMovies == "No internet service")),
       "Churn" =
       list("No" = ~ qwraps2::n_perc0(Churn == "No"),
            "Yes"  = ~ qwraps2::n_perc0( Churn == "Yes"))
       )

print(summary_table(dplyr::group_by(WA_interpret, Cluster), our_summary1), booktabs = TRUE)

	1 (N = 1750)	2 (N = 1490)	3 (N = 1727)	4 (N = 2065)
Senior Citizen
Senior	278 (16)	362 (24)	85 (5)	417 (20)
No	1,472 (84)	1,128 (76)	1,642 (95)	1,648 (80)
Tenure
min	1	1	1	33
max	55	47	72	72
mean (sd)	12.29 ± 12.35	19.68 ± 12.14	30.64 ± 24.02	60.18 ± 10.00
Monthly Charges
min	28.45	52.5	18.25	38.5
max	81.95	112.95	52.7	118.75
mean (sd)	58.80 ± 13.68	87.47 ± 10.88	22.53 ± 5.39	88.88 ± 18.41
Phone Service
No	298 (17)	30 (2)	171 (10)	181 (9)
Yes	1,452 (83)	1,460 (98)	1,556 (90)	1,884 (91)
MultipleLines
No	1,142 (65)	575 (39)	1,212 (70)	456 (22)
Yes	310 (18)	885 (59)	344 (20)	1,428 (69)
No phone service	298 (17)	30 (2)	171 (10)	181 (9)
InternetService
DSL	1,127 (64)	269 (18)	207 (12)	813 (39)
Fiber optic	623 (36)	1,221 (82)	0 (0)	1,252 (61)
No	0 (0)	0 (0)	1,520 (88)	0 (0)
Online Security
No	1,347 (77)	1,094 (73)	168 (10)	888 (43)
Yes	403 (23)	396 (27)	39 (2)	1,177 (57)
No internet service	0 (0)	0 (0)	1,520 (88)	0 (0)
Online Backup
No	1,389 (79)	897 (60)	169 (10)	632 (31)
Yes	361 (21)	593 (40)	38 (2)	1,433 (69)
No internet service	0 (0)	0 (0)	1,520 (88)	0 (0)
Device Protection
No	1,447 (83)	843 (57)	181 (10)	623 (30)
Yes	303 (17)	647 (43)	26 (2)	1,442 (70)
No internet service	0 (0)	0 (0)	1,520 (88)	0 (0)
Tech Support
No	1,403 (80)	1,028 (69)	174 (10)	867 (42)
Yes	347 (20)	462 (31)	33 (2)	1,198 (58)
No internet service	0 (0)	0 (0)	1,520 (88)	0 (0)
Streaming TV
No	1,454 (83)	567 (38)	192 (11)	596 (29)
Yes	296 (17)	923 (62)	15 (1)	1,469 (71)
No internet service	0 (0)	0 (0)	1,520 (88)	0 (0)
StreamingMovies
No	1,452 (83)	564 (38)	192 (11)	573 (28)
Yes	298 (17)	926 (62)	15 (1)	1,492 (72)
No internet service	0 (0)	0 (0)	1,520 (88)	0 (0)
Churn
No	1,037 (59)	783 (53)	1,562 (90)	1,781 (86)
Yes	713 (41)	707 (47)	165 (10)	284 (14)

`Name of the cluster` <- c("Economical","Сinephiles", "Australopithecus", "Take everything from life")

Churn <- c("40.7%","47.4%", "9.6%", "13.8%")

Interpretation <- c("This type of customer, for the most part, does not have access to the Internet. They only use telephone services. They pay the lowest monthly fee, because they do not use the services (they cannot do this because there is no internet connection)",  "Everybody has internet, mostly using fibre optics. Fibre optic is best for downloading music and videos and watching streaming television and films. And that is why customers from this cluster are actively purchasing TV and film striming services. Customers do not mind using other services as well, but here they are distributed 50/50.", "All customers have the internet, but do not like to spend money on various services. They also use telephone services.", "Most customers use all the services offered by the company. It is possible that many of them subscribe, even on an annual basis, which is advantageous.")

Coef_tab <- data.frame(`Name of the cluster`, Interpretation, Churn) %>% datatable(colnames = c('Name of variable' = 2, 'Cluster' = 1))

Coef_tab

Predict churn

In order to prevent customer churn, you need to learn to predict in advance which customers may leave us in the future. As almost half of the clients from the clusters Сinephiles and Economic ones who are leaving the company, I will focus on them.

For the Сinephiles cluster, I decided to use the original dataset, without any summed-up variables, as the main difference between these customers is precisely the use of specific services. Therefore, I left only the standardized variables Mothly Charges and tenure.

I am creating two new data sets to create models. The Сinephiles clusters and the Economical clusters differ in their needs, so they need to be analysed separately. .

# 1st cluster
WA_model = WA_new[,-(5:14)]
WA_model = WA_model %>% dplyr::select(-MonthlyCharges, -all_services)
econom = WA_model %>% filter(Cluster == "1")
econom = WA_model %>% dplyr::select(-Cluster)
econom$Churn = as.factor(econom$Churn)

# 2nd cluster
cinema = WA %>% dplyr::select(-customerID, -TotalCharges, -MonthlyCharges, -tenure)
c = WA_new %>% dplyr::select(MonthlyCharges_stand,tenure_stand,Cluster)
cinema = cbind(cinema,c)
cinema = cinema %>% filter(Cluster == "2") %>% dplyr::select(-Cluster)
cinema$Churn = as.factor(cinema$Churn)

Also I need to split the data into training and testing data sets. Exactly 80% of the original cleaned data will be used for training the model and 20% will be used for testing the model.

set.seed(1)
test_ind = createDataPartition(cinema$Churn, p = 0.2, list = FALSE)
cinema.test = cinema[test_ind,]
cinema.train = cinema[-test_ind,]

test_ind = createDataPartition(econom$Churn, p = 0.2, list = FALSE)
econom.test = econom[test_ind,]
econom.train = econom[-test_ind,]

Predict churn for cluster “Сinephiles”

Create The Model

Finally, it’s time to create and train the model. The statistical model that I will use will be Logistic Regression. This model is good for binary classification.Firstly, I desided to use function stepAIC() from library MASS. This function give me the best one model.

logitModelFull_cinema <- glm(Churn~., family = binomial, cinema.train)

#Build the new model
logitModelFull_cinema_new <- stepAIC(logitModelFull_cinema,trace = 0) 
summary(logitModelFull_cinema)

## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = cinema.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.02226  -0.96374  -0.00049   0.94332   2.70270  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           -6.55279    2.48829  -2.633  0.00845 ** 
## genderMale                            -0.22516    0.13397  -1.681  0.09282 .  
## SeniorCitizen1                         0.21980    0.16100   1.365  0.17219    
## PartnerYes                             0.06317    0.15007   0.421  0.67378    
## DependentsYes                         -0.17650    0.18445  -0.957  0.33862    
## PhoneServiceYes                        2.04005    1.39375   1.464  0.14327    
## MultipleLinesNo phone service               NA         NA      NA       NA    
## MultipleLinesYes                       1.03408    0.32473   3.184  0.00145 ** 
## InternetServiceFiber optic             3.85931    1.49846   2.576  0.01001 *  
## OnlineSecurityYes                      0.26099    0.33692   0.775  0.43854    
## OnlineBackupYes                        0.49921    0.32077   1.556  0.11964    
## DeviceProtectionYes                    0.50118    0.32256   1.554  0.12024    
## TechSupportYes                         0.15315    0.33652   0.455  0.64903    
## StreamingTVYes                         1.27988    0.59942   2.135  0.03274 *  
## StreamingMoviesYes                     1.45675    0.59889   2.432  0.01500 *  
## ContractOne year                      -0.53756    0.24043  -2.236  0.02536 *  
## ContractTwo year                     -15.07900  367.09277  -0.041  0.96723    
## PaperlessBillingYes                    0.51275    0.16206   3.164  0.00156 ** 
## PaymentMethodCredit card (automatic)   0.08139    0.24140   0.337  0.73598    
## PaymentMethodElectronic check          0.37863    0.19134   1.979  0.04784 *  
## PaymentMethodMailed check              0.19055    0.26276   0.725  0.46834    
## MonthlyCharges_stand                  -3.41179    1.75348  -1.946  0.05169 .  
## tenure_stand                          -0.92041    0.15073  -6.106 1.02e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1648.0  on 1190  degrees of freedom
## Residual deviance: 1335.6  on 1169  degrees of freedom
## AIC: 1379.6
## 
## Number of Fisher Scoring iterations: 15

summary(logitModelFull_cinema_new)

## 
## Call:
## glm(formula = Churn ~ gender + SeniorCitizen + MultipleLines + 
##     InternetService + OnlineBackup + DeviceProtection + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + MonthlyCharges_stand + 
##     tenure_stand, family = binomial, data = cinema.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.99193  -0.98585  -0.00048   0.95266   2.67201  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -3.6609     0.4361  -8.394  < 2e-16 ***
## genderMale                     -0.2111     0.1325  -1.593 0.111141    
## SeniorCitizen1                  0.2643     0.1554   1.701 0.088972 .  
## MultipleLinesNo phone service  -1.3938     0.8557  -1.629 0.103339    
## MultipleLinesYes                0.8651     0.1656   5.224 1.75e-07 ***
## InternetServiceFiber optic      3.0898     0.5246   5.890 3.87e-09 ***
## OnlineBackupYes                 0.3276     0.1686   1.942 0.052089 .  
## DeviceProtectionYes             0.3113     0.1687   1.846 0.064932 .  
## StreamingTVYes                  0.9832     0.2480   3.965 7.34e-05 ***
## StreamingMoviesYes              1.1480     0.2438   4.709 2.49e-06 ***
## ContractOne year               -0.6033     0.2381  -2.533 0.011298 *  
## ContractTwo year              -15.1489   370.0589  -0.041 0.967346    
## PaperlessBillingYes             0.5263     0.1607   3.276 0.001054 ** 
## MonthlyCharges_stand           -2.4768     0.6543  -3.785 0.000153 ***
## tenure_stand                   -0.9587     0.1467  -6.535 6.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1648.0  on 1190  degrees of freedom
## Residual deviance: 1342.2  on 1176  degrees of freedom
## AIC: 1372.2
## 
## Number of Fisher Scoring iterations: 15

So the model, which gave me function stepAIC is next: Churn ~ gender + SeniorCitizen + MultipleLines + InternetService + OnlineBackup + DeviceProtection + StreamingTV + StreamingMovies + Contract + PaperlessBilling + MonthlyCharges_stand + tenure_stand. AIC is 1372.2, which is less than 1379.6 and should be better.

Evaluate The Model

To evaluate the logistic regression model, I will print the predictions and look at the test statistics like precision, recall and the f-1 score.

test <- predict(logitModelFull_cinema, cinema.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, cinema.test$Churn, mode = "prec_recall")
confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  107  35
##        Yes  50 107
##                                           
##                Accuracy : 0.7157          
##                  95% CI : (0.6609, 0.7662)
##     No Information Rate : 0.5251          
##     P-Value [Acc > NIR] : 1.293e-11       
##                                           
##                   Kappa : 0.4329          
##                                           
##  Mcnemar's Test P-Value : 0.1289          
##                                           
##               Precision : 0.7535          
##                  Recall : 0.6815          
##                      F1 : 0.7157          
##              Prevalence : 0.5251          
##          Detection Rate : 0.3579          
##    Detection Prevalence : 0.4749          
##       Balanced Accuracy : 0.7175          
##                                           
##        'Positive' Class : No              
##

Recall is about 68%. It means that the model correctly identified 68% of the customers who retained and missed 32%.The precision of the model was about 75% and the f1-score was about 72%. The accuracy of the model was about 71%. Exactly, 107 cluents were true predicted that they don’t churn and 107 cluent were identified as thise who really churn.

Let’s look on the second model:

test <- predict(logitModelFull_cinema_new, cinema.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, cinema.test$Churn, mode = "prec_recall")
confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  106  37
##        Yes  51 105
##                                           
##                Accuracy : 0.7057          
##                  95% CI : (0.6505, 0.7567)
##     No Information Rate : 0.5251          
##     P-Value [Acc > NIR] : 1.459e-10       
##                                           
##                   Kappa : 0.4127          
##                                           
##  Mcnemar's Test P-Value : 0.1658          
##                                           
##               Precision : 0.7413          
##                  Recall : 0.6752          
##                      F1 : 0.7067          
##              Prevalence : 0.5251          
##          Detection Rate : 0.3545          
##    Detection Prevalence : 0.4783          
##       Balanced Accuracy : 0.7073          
##                                           
##        'Positive' Class : No              
##

The results of the second model is little worse, but the significans of predictors are higher. I thing, that I can choose the second one.

The most significant values are: 1. MultipleLines 2. InternetService 3. StreamingTVYes
4. StreamingMoviesYes 5. MonthlyCharges_stand 6. tenure_stand

Exactly, all these variables have the lowest p-values and can be identified as the best predictors of customer churn for the cluents from the cluster “Сinephiles”

Interpretation of the coefficients

Coefficient <- c("MultipleLines(Yes)","InternetService(Fiber optic)", "StreamingTV(Yes) ", "StreamingMovies(Yes)", "MonthlyCharges_stand", "tenure_stand")

Interpretation <- c("Considering that the customer uses multipline lines, the risk of customer churn increases by 2.38 times or by 138% for those who have multipline lines compared to those who do not. Thus, customers with multiline lines, as opposed to those who have one line or no lines at all, are more inclined to churn out.",
                    "Given that a customer uses fibre optics, the risk of customer churn increases by a factor of 21.97 or 2097 % for those who have fibre optics compared to those who have DSL (There are no customers in the analysed cluster who do not use internet at all).",
                    "Given that the customer is using Streaming TV, the risk of customer churn increases by 2.67 times or by 167% for those who have Streaming TV connected than for those who do not use the service.",
                    "Given that the customer is using Streaming Movies, the risk of customer churn increases by 3.15 times or by 215% for those who have Streaming Movies connected than for those who do not use the service.",
                    "An increase of one dollar in MonthlyCharges of customer decreases the risk of churn by 0.08 times or by 92%.",
                    "An increase of one month in customer staying in the company decreases the risk of churn by  0.38 times or by 62%.")

Coef_tab <- data.frame(Coefficient, Interpretation)

formattable(Coef_tab, 
            align =c("l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Coefficient	Interpretation
MultipleLines(Yes)	Considering that the customer uses multipline lines, the risk of customer churn increases by 2.38 times or by 138% for those who have multipline lines compared to those who do not. Thus, customers with multiline lines, as opposed to those who have one line or no lines at all, are more inclined to churn out.
InternetService(Fiber optic)	Given that a customer uses fibre optics, the risk of customer churn increases by a factor of 21.97 or 2097 % for those who have fibre optics compared to those who have DSL (There are no customers in the analysed cluster who do not use internet at all).
StreamingTV(Yes)	Given that the customer is using Streaming TV, the risk of customer churn increases by 2.67 times or by 167% for those who have Streaming TV connected than for those who do not use the service.
StreamingMovies(Yes)	Given that the customer is using Streaming Movies, the risk of customer churn increases by 3.15 times or by 215% for those who have Streaming Movies connected than for those who do not use the service.
MonthlyCharges_stand	An increase of one dollar in MonthlyCharges of customer decreases the risk of churn by 0.08 times or by 92%.
tenure_stand	An increase of one month in customer staying in the company decreases the risk of churn by 0.38 times or by 62%.

Predict churn for cluster “Economical”

Create The Model

As I have wrote earlie, for the clients from cluster “Economical” I save the variable with sum of all services, because they are not use some of that services actively (exactly, most of them don’t like use any services).

logitModelFull_econom <- glm(Churn~., family = binomial, econom.train)

#Build the new model
logitModelFull_econom_new <- stepAIC(logitModelFull_econom,trace = 0) 
summary(logitModelFull_econom)

## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = econom.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8315  -0.6981  -0.3049   0.7438   3.1209  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -1.35487    0.12304 -11.011  < 2e-16 ***
## genderMale                           -0.02655    0.07138  -0.372 0.709919    
## SeniorCitizen1                        0.34213    0.09336   3.665 0.000248 ***
## PartnerYes                            0.07658    0.08541   0.897 0.369953    
## DependentsYes                        -0.31835    0.09881  -3.222 0.001274 ** 
## ContractOne year                     -0.85359    0.11711  -7.289 3.13e-13 ***
## ContractTwo year                     -1.55308    0.19041  -8.156 3.45e-16 ***
## PaperlessBillingYes                   0.38886    0.08084   4.810 1.51e-06 ***
## PaymentMethodCredit card (automatic) -0.07018    0.12655  -0.555 0.579204    
## PaymentMethodElectronic check         0.38970    0.10378   3.755 0.000173 ***
## PaymentMethodMailed check            -0.10931    0.12332  -0.886 0.375438    
## MonthlyCharges_stand                  0.85553    0.09669   8.848  < 2e-16 ***
## tenure_stand                         -0.84424    0.06334 -13.329  < 2e-16 ***
## all_services_stnd                    -0.26017    0.10650  -2.443 0.014568 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6513.9  on 5624  degrees of freedom
## Residual deviance: 4802.2  on 5611  degrees of freedom
## AIC: 4830.2
## 
## Number of Fisher Scoring iterations: 6

summary(logitModelFull_econom_new)

## 
## Call:
## glm(formula = Churn ~ SeniorCitizen + Dependents + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges_stand + 
##     tenure_stand + all_services_stnd, family = binomial, data = econom.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8476  -0.6987  -0.3050   0.7380   3.0944  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          -1.34056    0.11420 -11.739  < 2e-16 ***
## SeniorCitizen1                        0.35176    0.09270   3.795 0.000148 ***
## DependentsYes                        -0.28159    0.08961  -3.142 0.001676 ** 
## ContractOne year                     -0.85195    0.11710  -7.276 3.45e-13 ***
## ContractTwo year                     -1.55331    0.19041  -8.158 3.41e-16 ***
## PaperlessBillingYes                   0.38829    0.08081   4.805 1.55e-06 ***
## PaymentMethodCredit card (automatic) -0.07417    0.12643  -0.587 0.557406    
## PaymentMethodElectronic check         0.38736    0.10372   3.735 0.000188 ***
## PaymentMethodMailed check            -0.11538    0.12316  -0.937 0.348847    
## MonthlyCharges_stand                  0.85629    0.09664   8.861  < 2e-16 ***
## tenure_stand                         -0.83344    0.06199 -13.444  < 2e-16 ***
## all_services_stnd                    -0.25868    0.10644  -2.430 0.015084 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6513.9  on 5624  degrees of freedom
## Residual deviance: 4803.1  on 5613  degrees of freedom
## AIC: 4827.1
## 
## Number of Fisher Scoring iterations: 6

So the model, which gave me function stepAIC is next: Churn ~ SeniorCitizen + Dependents + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges_stand + tenure_stand + all_services_stnd. AIC is 4827.1, which is less than 4830.2 and should be better.

Evaluate The Model

To evaluate the logistic regression model, I will print the predictions and look at the test statistics like precision, recall and the f-1 score.

test <- predict(logitModelFull_econom, econom.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, econom.test$Churn, mode = "prec_recall")
confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  929 177
##        Yes 104 197
##                                           
##                Accuracy : 0.8003          
##                  95% CI : (0.7784, 0.8209)
##     No Information Rate : 0.7342          
##     P-Value [Acc > NIR] : 4.549e-09       
##                                           
##                   Kappa : 0.4543          
##                                           
##  Mcnemar's Test P-Value : 1.746e-05       
##                                           
##               Precision : 0.8400          
##                  Recall : 0.8993          
##                      F1 : 0.8686          
##              Prevalence : 0.7342          
##          Detection Rate : 0.6603          
##    Detection Prevalence : 0.7861          
##       Balanced Accuracy : 0.7130          
##                                           
##        'Positive' Class : No              
##

Recall is about 90%. It means that the model correctly identified 90% of the customers who retained and missed 10%.The precision of the model was about 84% and the f1-score was about 87%. The accuracy of the model was about 79,5%. Exactly, 929 cluents were true predicted that they don’t churn and 197 cluent were identified as thise who really churn.

Let’s look on the second model:

test <- predict(logitModelFull_econom_new, econom.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, econom.test$Churn, mode = "prec_recall")
confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  927 176
##        Yes 106 198
##                                           
##                Accuracy : 0.7996          
##                  95% CI : (0.7777, 0.8202)
##     No Information Rate : 0.7342          
##     P-Value [Acc > NIR] : 6.639e-09       
##                                           
##                   Kappa : 0.4539          
##                                           
##  Mcnemar's Test P-Value : 3.976e-05       
##                                           
##               Precision : 0.8404          
##                  Recall : 0.8974          
##                      F1 : 0.8680          
##              Prevalence : 0.7342          
##          Detection Rate : 0.6588          
##    Detection Prevalence : 0.7839          
##       Balanced Accuracy : 0.7134          
##                                           
##        'Positive' Class : No              
##

In the second model accuracy(0.7996) is little less than in the first model(0.8003). And all other metrics are in the same situation. So, I would like to plot ROC curve and AUC for prediction model on test set, to define maybe which one is better:

log.model.econom = glm(Churn~., data = econom.train, family = binomial(link = 'logit'))
pred1 = predict(log.model.econom, newdata = econom.test, type = "response")
ROC_econom1 = roc(response = econom.test$Churn, predictor = pred1)

log.model.econom2 = glm(Churn ~ SeniorCitizen + Dependents + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges_stand + tenure_stand, data = econom.train, family = binomial(link = 'logit'))
pred2 = predict(log.model.econom2, newdata = econom.test, type = "response")
ROC_econom2 = roc(response = econom.test$Churn, predictor = pred2)

ggplot() + geom_path(aes(y=ROC_econom1$sensitivities, x=1-ROC_econom1$specificities), color = "green") +
  geom_path(aes(y=ROC_econom2$sensitivities, x=1-ROC_econom2$specificities), color = "blue") + labs(title = "ROC curve") + xlab("FPR") + ylab("TPR") + theme_linedraw() +theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))

pROC::auc(ROC_econom1)

## Area under the curve: 0.8364

pROC::auc(ROC_econom2)

## Area under the curve: 0.8365

It is still difficult to say, which one is better, but the AUC of the first model(0.8364) is less by 0.0001 than second (0.8365). Still for me better to have good presision and recall numbers, because they are related to the model’s mistakes and how well it selects its customers. So I will describe the most significant values based on the first model.

The most significant values are: 1. SeniorCitizen1 2. ContractOne year 3. ContractTwo year
4. PaperlessBillingYes 5. PaymentMethodElectronic check 6. MonthlyCharges_stand 7. tenure_stand

Exactly, all these variables have the lowest p-values and can be identified as the best predictors of customer churn for the cluents from the cluster “Economical”

Interpretation of the coefficients

Coefficient <- c("Senior Citizen(1)","Contract (One year)", "Contract (Two year)", "Paperless Billing (Yes)","Payment Method(Electronic check)", "MonthlyCharges_stand","tenure_stand")

Interpretation <- c("Considering that the customer is elderly, the risk of customer churn increases by a factor of 1.41 or 41% for those who are elderly compared to those who are younger.",
                    "Given that the customer signs a contract for one year, the risk of customer churn decreases by a factor of 0.43 or 57% for those who sign a monthly contract. So to say, clients who have, for example, one-year contract are 57% more likely to stay",
                    "Given that the customer signs a contract for two year, the risk of customer churn decreases by a factor of 0.21 or 79% for those who sign a monthly contract.So to say, clients who have, for example, two-year contract are 79% more likely to stay.",
                    "Considering that the customer receives paper receipts, the risk of customer churn increases by 1.48 or 48% for those who receive these receipts compared to those who do not receive this receipt.",
                    "Taking into account that the customer pays bills by electronic check, the risk of customer churn increases by 1.48 or 48% for those who receive electronic cheques compared to those who use Bank transfer (automatic).",
                    "An increase of one dollar in MonthlyCharges of customer increases the risk of churn by 2.35  times or by 135%.", 
                    "An increase of one month that customer staying in the company decreases the risk of churn by 0.43 times or by 57%.")

Coef_tab <- data.frame(Coefficient, Interpretation)

formattable(Coef_tab, 
            align =c("l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Coefficient	Interpretation
Senior Citizen(1)	Considering that the customer is elderly, the risk of customer churn increases by a factor of 1.41 or 41% for those who are elderly compared to those who are younger.
Contract (One year)	Given that the customer signs a contract for one year, the risk of customer churn decreases by a factor of 0.43 or 57% for those who sign a monthly contract. So to say, clients who have, for example, one-year contract are 57% more likely to stay
Contract (Two year)	Given that the customer signs a contract for two year, the risk of customer churn decreases by a factor of 0.21 or 79% for those who sign a monthly contract.So to say, clients who have, for example, two-year contract are 79% more likely to stay.
Paperless Billing (Yes)	Considering that the customer receives paper receipts, the risk of customer churn increases by 1.48 or 48% for those who receive these receipts compared to those who do not receive this receipt.
Payment Method(Electronic check)	Taking into account that the customer pays bills by electronic check, the risk of customer churn increases by 1.48 or 48% for those who receive electronic cheques compared to those who use Bank transfer (automatic).
MonthlyCharges_stand	An increase of one dollar in MonthlyCharges of customer increases the risk of churn by 2.35 times or by 135%.
tenure_stand	An increase of one month that customer staying in the company decreases the risk of churn by 0.43 times or by 57%.

Overall findings & Churn-prevention recommendations

In the course of the analysis, four main clusters were identified. Based on a more in-depth analysis, the main characteristics of each cluster were identified. The clusters were named as follows: Economical, Сinephiles, Australopithecus, Take everything from life.

In the cluster Economical, Сinephiles, were defined the biggest churn of clients, which were 40.7% and 47,4% respectefully. For each of these clusters we made models, which aim is to predict the churn of clients. Because cluents in this cluent have sifferent chafacteristics, for both cluster will be given recommendations.

Churn-prevention recommendations for clients from “Сinephiles”:

It is most likely that the company has some problems with fibre optics, as the analysis shows great risks in terms of customer outflow if they use fibre optics.
Optionally, it is necessary to study the operation of fibre optics in more detail and offer customers the possibility of switching to DSL. Unfortunately, this data does not specify the cost per service. Therefore, if customers suddenly do not want to switch to DSL Ь for the cost, you can offer them a discount on the use of TV and film services as part of the bonus. It is possible to make 3 months of watching TV and movies free of charge, so that customers can appreciate the difference and the end result on DSL.

Churn-prevention recommendations for clients from “Economical”:

In this cluster, more attention should be paid to the demographics of customers and the type of contract they enter into and how they pay for services.

It is not entirely clear how a person relates to the group of elderly people, but perhaps they could be of retirement age (~60 years). They could be offered a pension rate and, if they have the internet, offer some discounts on security services, because this could be good protection against fraudsters.
Also, customers who sign annual and two annual contracts are less likely to outflow, which is logical. After all, they have already paid the money and signed a contract for a long time, while people with monthly contracts can leave at any time. It is worth making some kind of action, like this: a year’s contract is more profitable than a 10% monthly payment.
The company may have some problem with electronic payment for services, it is worth analyzing this point and understanding what the problem is (e.g. fraudsters, there is a delay in payment and connection of the service, inconvenient interface).

Business analytics: churn prediction in Telco

Kostrova Anna

10/24/2020

Introduction

Data overview

How big is the problem?

Explore correlations of churn

Continuos variables

Demographic data

Services of the company and churn

Contract and payment status

Cluster analysis

Recoding and Standardize continuous variables

Elbow (Scree) Plot & Silhouette analysis

Interpretation of the clusters

Predict churn

Predict churn for cluster “Сinephiles”

Create The Model

Evaluate The Model

Interpretation of the coefficients

Predict churn for cluster “Economical”

Create The Model

Evaluate The Model

Interpretation of the coefficients

Overall findings & Churn-prevention recommendations