library(readr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(grid)
library(psych)
library(gtable)
library(knitr)
library(kableExtra)
library(formattable)
library(caret)
library(MASS)
library(pROC)
library(GGally)
library(ggcorrplot)
library(purrr)
library(cluster)
library(summarytools)
library(qwraps2)
library(DT)
library(gtable)
WA <- read_csv("~/Documents/БИЗНЕСАНАЛ/WA_Fn-UseC_-Telco-Customer-Churn.csv")
In this project, I want to predict Telco customer churn based on their behavior. This dataset taken from Kaggle dataset. Firstly, I will do a short exploratory analysis with plots to get acquainted with the data at hand. Then, I will make a cluster analysis and define groups, where the persent of cluents who have churned higher. Thirdly, I will use logistic regression to predict churn. Along the way, the most important factors for predicting churn will be outlined. Finally, I will give some reccomendation to the company about the fact how it can decrease customer churn.
Let’s download data and have a briefly look on metrics:
dfs = dfSummary(WA[,-1],
style = "grid",
valid.col = FALSE, varnumbers = FALSE, tmp.img.dir = "/tmp")
dfs$Missing <- NULL
dfs$`Freqs (% of Valid)` <- NULL
print(dfs,
headings = FALSE,
footnote = NA,
max.tbl.height = 500,
max.tbl.width = 250,
method = "render")
| Variable | Stats / Values | Graph | ||||
|---|---|---|---|---|---|---|
| gender [character] |
|
|||||
| SeniorCitizen [numeric] |
|
|||||
| Partner [character] |
|
|||||
| Dependents [character] |
|
|||||
| tenure [numeric] |
|
|||||
| PhoneService [character] |
|
|||||
| MultipleLines [character] |
|
|||||
| InternetService [character] |
|
|||||
| OnlineSecurity [character] |
|
|||||
| OnlineBackup [character] |
|
|||||
| DeviceProtection [character] |
|
|||||
| TechSupport [character] |
|
|||||
| StreamingTV [character] |
|
|||||
| StreamingMovies [character] |
|
|||||
| Contract [character] |
|
|||||
| PaperlessBilling [character] |
|
|||||
| PaymentMethod [character] |
|
|||||
| MonthlyCharges [numeric] |
|
|||||
| TotalCharges [numeric] |
|
|||||
| Churn [character] |
|
All in all there are 7043 observations and 21 variables.
The data set includes information about:
The column “Churn” is customers who left within the last month.
Services that each customer has signed up for phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
Customer account information: how long they have been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
Demographic info about customers: gender, age range, and if they have partners and dependents
Let’s Check for any missing data in the data set.
kable(paste0(round(((nrow(WA) - nrow(na.omit(WA)))/nrow(WA))*100,2), "%"), col.names = "Percent of NA") %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
| Percent of NA |
|---|
| 0.16% |
WA = WA %>% na.omit()
Only 0.16% missing values. I delete them.
In order to evaluate how big the problem of customer churn from the company is, it is necessary to know the percentage of these customers from the total number.
th = theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"),
axis.ticks = element_blank(),
panel.grid = element_blank(),
rect = element_blank(),
panel.grid.major.y = element_line(color = "grey92", size = 0.5))
WA %>%
group_by(Churn) %>%
summarise(Number = n()) %>%
mutate(Percent = prop.table(Number)*100) %>%
ggplot(aes(Churn, Percent)) +
geom_col(aes(fill = Churn)) +
labs(title = "Churn Percentage") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = sprintf("%.2f%%", Percent)), vjust = 3,size = 4) +
th
From the plot, we can see that the percentage of customer that churn is around 27%. It is a litle bit more than 1/4 from the all cluents. It still not a very big problem, but I think, it is also not a great result. To understand, why this problem can occure, let’s make some correlations.
To begin with, we will look at the dependence of customer churn on numerical variables, such as Tenure, Monthly Charges and Total Charges. Small description for is above:
tenure: Number of months the customer has stayed with the company
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
In fact, there is another numerical variable in the data, but it is so defined by R, although it is a factor variable in the raw data from Kaggle. It is Senior Citizen.
Let’s have a briefly look on this metrics.
WA_contin = WA %>% dplyr::select(tenure, MonthlyCharges, TotalCharges)
kable(describe(WA_contin)) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped")) %>% scroll_box(width = "700px", height = "150px")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tenure | 1 | 7032 | 32.42179 | 24.54526 | 29.000 | 31.48685 | 32.61720 | 1.00 | 72.00 | 71.0 | 0.2376294 | -1.3881476 | 0.2927037 |
| MonthlyCharges | 2 | 7032 | 64.79821 | 30.08597 | 70.350 | 65.01030 | 35.65653 | 18.25 | 118.75 | 100.5 | -0.2220082 | -1.2566125 | 0.3587770 |
| TotalCharges | 3 | 7032 | 2283.30044 | 2266.77136 | 1397.475 | 1970.14024 | 1812.92328 | 18.80 | 8684.80 | 8666.0 | 0.9612323 | -0.2332742 | 27.0313843 |
From the statistics above, we can see that the longest tenure is 72 months or 6 years, and the maximum monthly charge is 118.75 dollars. For Total Charges the maximum is 8684.80 dollars. The minimum tenure is 1 month and minimum monthly charge is about 18.25 dollars. For Total Charges minimum is 18.80 dollars. The customer can expect to have a monthly charge of about 64.80 dollars and the mean tenure is 32 months or around 3 years. The customer can expect to have a Total Charge is 2283.30 dollars.
The use of the metric is not indicative or effective. If the user leaves, the total costs are clearly lower. It is difficult to say anything about correlation with customer churn. Also, we may not consider this variable because it correlates with the other two variables that we are exploring.
corr <- round(cor(WA_contin), 1)
ggcorrplot(corr, hc.order = TRUE, type = "lower",
lab = TRUE,
ggtheme = ggplot2::theme_classic,
colors = c("#6D9EC1", "white", "#E46726"))
Now, I would like to make plots, which can illustrates graphically data.
graph1 = WA %>%
group_by(tenure, Churn) %>%
summarise(Number = n()) %>%
ggplot(aes(tenure, Number)) +
geom_line(aes(col = Churn)) +
labs(x = "Tenure (month)",
y = "",
title = "Churn Based on Tenure") +
scale_x_continuous(breaks = seq(0, 80, 10)) +
th
graph2 = WA %>%
group_by(MonthlyCharges, Churn) %>%
summarise(Number = n()) %>%
ggplot(aes(MonthlyCharges, Number)) +
geom_line(aes(col = Churn)) +
labs(x = "Monthly Charges",
y = "",
title = "Churn Based on Monthly Charges") +
scale_x_continuous(breaks = seq(0, 120, 20)) +
th
legend = gtable_filter(ggplotGrob(graph2), "guide-box")
#plot graphs
grid.arrange(arrangeGrob(graph1+ theme(legend.position="none"), graph2 + theme(legend.position="none"), nrow = 2,
left = textGrob("Number of Customer", rot = 90, vjust = 1,
gp = gpar(fontsize = 12))),
legend,
widths=unit.c(unit(1, "npc") - legend$width, legend$width),
nrow=1)
From the graph we can see that the highest churn happen mostly after 1 month usage. It is noticeable that customers leave much less frequently after 25 months. The monthly charges chart shows that most of the loyal customers that stayed with the company had a lower monthly charge than most of the customers that churned.
I was interested in studying how customer churn can be related to the demographic characteristics of customers. These characteristics include the following variables:
# AS I said earlier, SeniorCitizen is numeric, but it is a factor)
WA <- WA %>%
mutate(SeniorCitizen = as.factor(SeniorCitizen))
p1 = WA %>% ggplot(aes(x=gender,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(gender, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p2 = WA %>% ggplot(aes(x=SeniorCitizen,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(SeniorCitizen, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent) +
th +
theme(legend.position = "none")
p3 = WA %>% ggplot(aes(x=Partner,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(Partner, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p4 = WA %>% ggplot(aes(x=Dependents,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(Dependents, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
g = WA %>% ggplot(aes(x=Dependents,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "Percentage")+
theme_classic()
legend = gtable_filter(ggplotGrob(g), "guide-box")
grid.arrange(arrangeGrob(p1, p2, p3, p4, nrow = 2,
top = textGrob("Demographic characteristics and churn", vjust = 1, gp = gpar(fontface = "bold", cex = 1.5)),
left = textGrob("Percentage", rot = 90, vjust = 1,
gp = gpar(fontsize = 12))),
legend,
widths=unit.c(unit(1, "npc") - legend$width, legend$width),
nrow=1)
From the customer characteristics, I can get several information :
Also, Telecom companie have different services which can be used by cluents. The list of this services are below:
And there are plots which illustarates the distribution of churned and not churned cluents in each category.
p5 = WA %>% ggplot(aes(x=InternetService,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(InternetService, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p6 = WA %>% ggplot(aes(x=OnlineSecurity,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(OnlineSecurity, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p7 = WA %>% ggplot(aes(x=OnlineBackup,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(OnlineBackup, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p8 = WA %>% ggplot(aes(x=DeviceProtection,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(DeviceProtection, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p9 = WA %>% ggplot(aes(x=PhoneService,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(PhoneService, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p10 = WA %>% ggplot(aes(x=MultipleLines,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(MultipleLines, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p11 = WA %>% ggplot(aes(x=TechSupport,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(TechSupport, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p12 = WA %>% ggplot(aes(x=StreamingTV,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(StreamingTV, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p13 = WA %>% ggplot(aes(x=StreamingMovies,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "") +
geom_text(data = . %>%
group_by(StreamingMovies, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
grid.arrange(arrangeGrob(p5, p6, p7, p8, p9, p10, p11, p12,p13, nrow = 3,
top = textGrob("Services of the company and churn", vjust = 1, gp = gpar(fontface = "bold", cex = 1.5)),
left = textGrob("Percentage", rot = 90, vjust = 1,
gp = gpar(fontsize = 12))),
legend,
widths=unit.c(unit(1, "npc") - legend$width, legend$width),
nrow=1)
From the service that used by the customer, I can get several information :
Customer who use fibser optic in internet service are more likely to churn compared to others
Customer who don’t use online security, online backup, device protection, streaming TV, and tech support are more likely to churn compared to others
The presence or non-existence of a telephone service at the customer’s premises does not affect the customer’s churn from the company in any way.
The remaining categorical variables are related to contract and payment status. There are the foloowing variables:
p14 = WA %>% ggplot(aes(x=PaymentMethod,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "", x ="", title = "Payment Method") +
geom_text(data = . %>%
group_by(PaymentMethod, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p15 = WA %>% ggplot(aes(x=Contract,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "", x ="", title = "Contract type") +
geom_text(data = . %>%
group_by(Contract, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
p16 = WA %>% ggplot(aes(x=PaperlessBilling,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "", x="", title = "Paperless Billing") +
geom_text(data = . %>%
group_by(PaperlessBilling, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+
th +
theme(legend.position = "none")
grid.arrange(arrangeGrob(p15, p16, p14, nrow = 2,
top = textGrob("Contract type, payments and churn", vjust = 1, gp = gpar(fontface = "bold", cex = 1.5)),
left = textGrob("Percentage", rot = 90, vjust = 1,
gp = gpar(fontsize = 12))),
legend,
widths=unit.c(unit(1, "npc") - legend$width, legend$width),
nrow=1)
From these plots I can coclude that:
Customers who sign monthly contracts leave more often, which is generally logical as they can leave the company at any time without much loss to themselves. As can be seen from the graph, the longer the contract period, the less likely it is that the client will churn It is possible that those who conclude a longer contract already trust the company and take this step consciously.
In general, customers who receive paper invoices are more likely to churn than those who do not receive them.
Those customers who use ellectron payment of bills are the ones who churn the most, which is quite strange as it is now a rather convenient way. The company may have a problem with this service, as the ratios of customer churn on other payment methods are approximately the same.
Here I recode the service-used variables into numeric 0-1 and calculate a new feature summing up the number of telecom services used by a customer (of the nine offered). I get new variable - “all_services”. The maximum sum of this variable is 10 (I also deleted PhoneService, because this variable takes in account in MultipleLines)
WA_new = WA %>% dplyr::select(-customerID, -TotalCharges)
WA_new$PhoneService = if_else(WA_new$PhoneService == "Yes", 1, 0)
WA_new$MultipleLines = if_else(WA_new$MultipleLines == "Yes", 1, if_else(WA_new$MultipleLines == "No", 0, 0))
WA_new$InternetService = if_else(WA_new$InternetService == "DSL", 1, if_else(WA_new$InternetService == "No", 0, 2))
WA_new$OnlineSecurity = if_else(WA_new$OnlineSecurity == "Yes", 1, if_else(WA_new$OnlineSecurity == "No", 0, 0))
WA_new$OnlineBackup = if_else(WA_new$OnlineBackup == "Yes", 1, if_else(WA_new$OnlineBackup == "No", 0, 0))
WA_new$DeviceProtection = if_else(WA_new$DeviceProtection == "Yes", 1, if_else(WA_new$DeviceProtection == "No", 0, 0))
WA_new$TechSupport = if_else(WA_new$TechSupport == "Yes", 1, if_else(WA_new$TechSupport == "No", 0, 0))
WA_new$StreamingTV = if_else(WA_new$StreamingTV == "Yes", 1, if_else(WA_new$StreamingTV == "No", 0, 0))
WA_new$StreamingMovies = if_else(WA_new$StreamingMovies == "Yes", 1, if_else(WA_new$StreamingMovies == "No", 0, 0))
WA_new$all_services = WA_new$MultipleLines + WA_new$InternetService + WA_new$OnlineSecurity + WA_new$OnlineBackup + WA_new$DeviceProtection + WA_new$TechSupport + WA_new$StreamingTV + WA_new$StreamingMovies
# Create the mean-sd standardization function
mean_sd_standard <- function(x) {
(x-mean(x))/sd(x)}
# Apply the function to each numeric variable in the clustering set
WA_new <- WA_new %>%
mutate(MonthlyCharges_stand = mean_sd_standard(MonthlyCharges))
WA_new <- WA_new %>%
mutate(tenure_stand = mean_sd_standard(tenure))
WA_new <- WA_new %>%
mutate(all_services_stnd = mean_sd_standard(all_services))
Since the data is quite large and I know nothing about the TV company’s clients, it is difficult to assume how many clusters there will be and what number will be optimal. Therefore, I decided to use two methods to determine the optimal number of clusters.
WA_cluster = WA_new %>% dplyr::select(MonthlyCharges_stand, tenure_stand, all_services_stnd)
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = WA_cluster, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
# Plot the elbow plot
elbow_plot = ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
labs(x = "Number of clusters", y ="Total within-cluster sum of squares",title = "Elbow plot") +
scale_x_continuous(breaks = 1:10) + theme_linedraw() +
theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))
# # Use map_dbl to run many models with varying value of k
# sil_width <- map_dbl(2:10, function(k){
# model <- pam(x = WA_cluster, k = k)
# model$silinfo$avg.width
# })
#
# # Generate a data frame containing both k and sil_width
# sil_df <- data.frame(
# k = 2:10,
# sil_width = sil_width
# )
# write.csv(sil_df, "~/Documents/БИЗНЕСАНАЛ/sil_df.csv")
sil_df <- read_csv("~/Documents/БИЗНЕСАНАЛ/sil_df.csv")
sil_df = sil_df[,-1]
# Plot the relationship between k and sil_width
sil_plot = ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
labs(x = "Number of clusters", y ="Silhouette Width", title = "Silhouette plot") +
scale_x_continuous(breaks = 2:10) + theme_linedraw() +
theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))
grid.arrange(elbow_plot,sil_plot, nrow =2)
As we can see, the elbow method reccomend to devide all data in 2 clusters or 3 clusters, but for our deep analys it looks like really strange (exactly, when I go to the Interpretation part, I understand, why elbow shows the best variant is 2 clusters). Silhouete plot shows that if I take 4 clusters then most of observations are similar to the cluster it is assigned relative to other clusters. After that analysis I decided to devide all data to 4 clusters and save it in dataset.
I use set.seed equal to 123, because I want to save my results. Also, I added to my dataset new colomn “Cluster”, were save the number of cluster. The table below shows the distribution by clusters.
set.seed(123)
clusters <- kmeans(WA_new[,21:23], 4)
# Save the cluster number in the dataset as column 'Borough'
WA_new$Cluster <- as.factor(clusters$cluster)
# Look at the distribution of cluster
kable(table(WA_new$Cluster))%>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
| Var1 | Freq |
|---|---|
| 1 | 1750 |
| 2 | 1490 |
| 3 | 1727 |
| 4 | 2065 |
To start assessing the quality and distinctiveness of these groups, we will look at the average values of the groups for each cluster by original variables in our clustering dataset. This will give us the first clear idea of how groups behave. Below I visualize the clusters with a Parallel Coordinate Plot. Also I make a barplot with percent of churn in all clusters.
# Group by the cluster assignment and calculate averages
WA_clus_avg <- WA_new %>%
dplyr::select(MonthlyCharges_stand, tenure_stand, all_services_stnd,Cluster) %>%
group_by(Cluster) %>%
summarize_if(is.numeric,mean)
# Create the min-max scaling function
min_max_standard <- function(x) {
(x - min(x))/(max(x)-min(x))
}
# Apply this function to each numeric variable in the bustabit_clus_avg object
WA_avg_minmax <- WA_clus_avg %>%
mutate_if(is.numeric, min_max_standard)
# Load the GGally package
# Create a parallel coordinate plot of the values, starts with column 2
parrallel_plot = ggparcoord(WA_avg_minmax, columns = 2:ncol(WA_avg_minmax),
groupColumn = "Cluster", scale = "globalminmax", order = "skewness") +
labs(x = "", title = "Parallel Coordinate Plot") + theme_linedraw() +
theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))
clusters_churn_plot = WA_new %>% ggplot(aes(x=Cluster,fill=Churn)) +
geom_bar(position = 'fill') +
labs(y = "Percent", title = "Churn of cluents by clusters") +
geom_text(data = . %>%
group_by(Cluster, Churn) %>%
tally() %>%
mutate(p = n / sum(n)) %>%
ungroup(),
aes(y = p, label = scales::percent(p)),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
scale_y_continuous(labels = scales::percent)+ theme_linedraw() +
theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))
grid.arrange(parrallel_plot, clusters_churn_plot, nrow = 2)
Interpretation: 4th cluster have the highest values compared to the rest of the clusters for all the permanent ones. The clusters are distributed as follows by tensure: clients from the 4th class stayed the longest, clients from the 3rd claster also stayed the longest, but approximately 2 times less frequently, while clients from the 2nd and 1st clusters stayed the shortest. In general, this distribution corresponds to the outflow of clients from these clusters: in cluster 1 and 2 the outflow accounts for almost half of all data, it could be assumed that people from these clusters could also leave the company in the near future. In clusters 3 and 4, outflow accounts for less than 15% of the total data in each cluster. In terms of monthly payments, again, it turns out that people from the 4 clusters pay the most, which is not surprising, because if you look at whether they use services, you can see that they have the maximum, and for services you need an extra payment. At the same time, people in the 3 clusters pay the lowest monthly fee, but they also hardly use any different services. Customers from the second cluster also pay quite a lot for monthly services, but they use fewer services than customers from the 4th cluster. In general, 1 pays less than 2 and 4, but again, they use fewer services. In general, we can understand why the outflow of clients is more frequent in clusters 1 and 2 - they pay a price every month that is disproportionate to the services they use. However, it is possible that the services they use, or the type of service provider they use, somehow affect the monthly costs.
To make more detailed analysis I decided to use summaty function for each cluster. Also, I go back to the original dataset without NA, remove all variables which are not used in cluster division (customerID, TotalCharges, gender, SeniorCitizen, Partner, Dependents, PaperlessBilling, PaymentMethod, Contract). After that I get such a beautiful description of all clusters.
WA_interpret = WA %>% dplyr::select(-customerID, -TotalCharges, -gender, -Partner, -Dependents, -PaperlessBilling, -PaymentMethod, -Contract)
WA_interpret$Cluster <- as.factor(clusters$cluster)
WA_interpret <- WA_interpret %>%
mutate_if(sapply(WA_interpret, is.character), as.factor)
# results <- WA_interpret %>%
# group_by(Cluster) %>%
# do(the_summary = summary(.))
# WA_interpret %>% summary(.)
#
# options(knitr.kable.NA = "**")
# kable(results$the_summary) %>% kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), font_size = 8, full_width = FALSE)
options(qwraps2_markup = "markdown")
our_summary1 <-
list("Senior Citizen" =
list("Senior" = ~ qwraps2::n_perc0(SeniorCitizen == 1),
"No" = ~ qwraps2::n_perc0(SeniorCitizen == 0)),
"Tenure" =
list("min" = ~ min(tenure),
"max" = ~ max(tenure),
"mean (sd)" = ~ qwraps2::mean_sd(tenure)),
"Monthly Charges" =
list("min" = ~ min(MonthlyCharges),
"max" = ~ max(MonthlyCharges),
"mean (sd)" = ~ qwraps2::mean_sd(MonthlyCharges)),
"Phone Service" =
list("No" = ~ qwraps2::n_perc0(PhoneService == "No"),
"Yes" = ~ qwraps2::n_perc0(PhoneService == "Yes")),
"MultipleLines" =
list("No" = ~ qwraps2::n_perc0(MultipleLines == "No"),
"Yes" = ~ qwraps2::n_perc0(MultipleLines == "Yes"),
"No phone service" = ~ qwraps2::n_perc0(MultipleLines == "No phone service")),
"InternetService" =
list("DSL" = ~ qwraps2::n_perc0(InternetService == "DSL"),
"Fiber optic" = ~ qwraps2::n_perc0(InternetService == "Fiber optic"),
"No" = ~ qwraps2::n_perc0(InternetService == "No")),
"Online Security" =
list("No" = ~ qwraps2::n_perc0(OnlineSecurity == "No"),
"Yes" = ~ qwraps2::n_perc0( OnlineSecurity == "Yes"),
"No internet service" = ~ qwraps2::n_perc0( OnlineSecurity == "No internet service")),
"Online Backup" =
list("No" = ~ qwraps2::n_perc0(OnlineBackup == "No"),
"Yes" = ~ qwraps2::n_perc0( OnlineBackup == "Yes"),
"No internet service" = ~ qwraps2::n_perc0( OnlineBackup == "No internet service")),
"Device Protection" =
list("No" = ~ qwraps2::n_perc0(DeviceProtection == "No"),
"Yes" = ~ qwraps2::n_perc0( DeviceProtection == "Yes"),
"No internet service" = ~ qwraps2::n_perc0( DeviceProtection == "No internet service")),
"Tech Support" =
list("No" = ~ qwraps2::n_perc0(TechSupport == "No"),
"Yes" = ~ qwraps2::n_perc0( TechSupport == "Yes"),
"No internet service" = ~ qwraps2::n_perc0( TechSupport == "No internet service")),
"Streaming TV" =
list("No" = ~ qwraps2::n_perc0(StreamingTV == "No"),
"Yes" = ~ qwraps2::n_perc0( StreamingTV == "Yes"),
"No internet service" = ~ qwraps2::n_perc0( StreamingTV == "No internet service")),
"StreamingMovies" =
list("No" = ~ qwraps2::n_perc0(StreamingMovies == "No"),
"Yes" = ~ qwraps2::n_perc0( StreamingMovies == "Yes"),
"No internet service" = ~ qwraps2::n_perc0( StreamingMovies == "No internet service")),
"Churn" =
list("No" = ~ qwraps2::n_perc0(Churn == "No"),
"Yes" = ~ qwraps2::n_perc0( Churn == "Yes"))
)
print(summary_table(dplyr::group_by(WA_interpret, Cluster), our_summary1), booktabs = TRUE)
| 1 (N = 1750) | 2 (N = 1490) | 3 (N = 1727) | 4 (N = 2065) | |
|---|---|---|---|---|
| Senior Citizen | ||||
| Senior | 278 (16) | 362 (24) | 85 (5) | 417 (20) |
| No | 1,472 (84) | 1,128 (76) | 1,642 (95) | 1,648 (80) |
| Tenure | ||||
| min | 1 | 1 | 1 | 33 |
| max | 55 | 47 | 72 | 72 |
| mean (sd) | 12.29 ± 12.35 | 19.68 ± 12.14 | 30.64 ± 24.02 | 60.18 ± 10.00 |
| Monthly Charges | ||||
| min | 28.45 | 52.5 | 18.25 | 38.5 |
| max | 81.95 | 112.95 | 52.7 | 118.75 |
| mean (sd) | 58.80 ± 13.68 | 87.47 ± 10.88 | 22.53 ± 5.39 | 88.88 ± 18.41 |
| Phone Service | ||||
| No | 298 (17) | 30 (2) | 171 (10) | 181 (9) |
| Yes | 1,452 (83) | 1,460 (98) | 1,556 (90) | 1,884 (91) |
| MultipleLines | ||||
| No | 1,142 (65) | 575 (39) | 1,212 (70) | 456 (22) |
| Yes | 310 (18) | 885 (59) | 344 (20) | 1,428 (69) |
| No phone service | 298 (17) | 30 (2) | 171 (10) | 181 (9) |
| InternetService | ||||
| DSL | 1,127 (64) | 269 (18) | 207 (12) | 813 (39) |
| Fiber optic | 623 (36) | 1,221 (82) | 0 (0) | 1,252 (61) |
| No | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| Online Security | ||||
| No | 1,347 (77) | 1,094 (73) | 168 (10) | 888 (43) |
| Yes | 403 (23) | 396 (27) | 39 (2) | 1,177 (57) |
| No internet service | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| Online Backup | ||||
| No | 1,389 (79) | 897 (60) | 169 (10) | 632 (31) |
| Yes | 361 (21) | 593 (40) | 38 (2) | 1,433 (69) |
| No internet service | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| Device Protection | ||||
| No | 1,447 (83) | 843 (57) | 181 (10) | 623 (30) |
| Yes | 303 (17) | 647 (43) | 26 (2) | 1,442 (70) |
| No internet service | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| Tech Support | ||||
| No | 1,403 (80) | 1,028 (69) | 174 (10) | 867 (42) |
| Yes | 347 (20) | 462 (31) | 33 (2) | 1,198 (58) |
| No internet service | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| Streaming TV | ||||
| No | 1,454 (83) | 567 (38) | 192 (11) | 596 (29) |
| Yes | 296 (17) | 923 (62) | 15 (1) | 1,469 (71) |
| No internet service | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| StreamingMovies | ||||
| No | 1,452 (83) | 564 (38) | 192 (11) | 573 (28) |
| Yes | 298 (17) | 926 (62) | 15 (1) | 1,492 (72) |
| No internet service | 0 (0) | 0 (0) | 1,520 (88) | 0 (0) |
| Churn | ||||
| No | 1,037 (59) | 783 (53) | 1,562 (90) | 1,781 (86) |
| Yes | 713 (41) | 707 (47) | 165 (10) | 284 (14) |
`Name of the cluster` <- c("Economical","Сinephiles", "Australopithecus", "Take everything from life")
Churn <- c("40.7%","47.4%", "9.6%", "13.8%")
Interpretation <- c("This type of customer, for the most part, does not have access to the Internet. They only use telephone services. They pay the lowest monthly fee, because they do not use the services (they cannot do this because there is no internet connection)", "Everybody has internet, mostly using fibre optics. Fibre optic is best for downloading music and videos and watching streaming television and films. And that is why customers from this cluster are actively purchasing TV and film striming services. Customers do not mind using other services as well, but here they are distributed 50/50.", "All customers have the internet, but do not like to spend money on various services. They also use telephone services.", "Most customers use all the services offered by the company. It is possible that many of them subscribe, even on an annual basis, which is advantageous.")
Coef_tab <- data.frame(`Name of the cluster`, Interpretation, Churn) %>% datatable(colnames = c('Name of variable' = 2, 'Cluster' = 1))
Coef_tab
In order to prevent customer churn, you need to learn to predict in advance which customers may leave us in the future. As almost half of the clients from the clusters Сinephiles and Economic ones who are leaving the company, I will focus on them.
For the Сinephiles cluster, I decided to use the original dataset, without any summed-up variables, as the main difference between these customers is precisely the use of specific services. Therefore, I left only the standardized variables Mothly Charges and tenure.
I am creating two new data sets to create models. The Сinephiles clusters and the Economical clusters differ in their needs, so they need to be analysed separately. .
# 1st cluster
WA_model = WA_new[,-(5:14)]
WA_model = WA_model %>% dplyr::select(-MonthlyCharges, -all_services)
econom = WA_model %>% filter(Cluster == "1")
econom = WA_model %>% dplyr::select(-Cluster)
econom$Churn = as.factor(econom$Churn)
# 2nd cluster
cinema = WA %>% dplyr::select(-customerID, -TotalCharges, -MonthlyCharges, -tenure)
c = WA_new %>% dplyr::select(MonthlyCharges_stand,tenure_stand,Cluster)
cinema = cbind(cinema,c)
cinema = cinema %>% filter(Cluster == "2") %>% dplyr::select(-Cluster)
cinema$Churn = as.factor(cinema$Churn)
Also I need to split the data into training and testing data sets. Exactly 80% of the original cleaned data will be used for training the model and 20% will be used for testing the model.
set.seed(1)
test_ind = createDataPartition(cinema$Churn, p = 0.2, list = FALSE)
cinema.test = cinema[test_ind,]
cinema.train = cinema[-test_ind,]
test_ind = createDataPartition(econom$Churn, p = 0.2, list = FALSE)
econom.test = econom[test_ind,]
econom.train = econom[-test_ind,]
Finally, it’s time to create and train the model. The statistical model that I will use will be Logistic Regression. This model is good for binary classification.Firstly, I desided to use function stepAIC() from library MASS. This function give me the best one model.
logitModelFull_cinema <- glm(Churn~., family = binomial, cinema.train)
#Build the new model
logitModelFull_cinema_new <- stepAIC(logitModelFull_cinema,trace = 0)
summary(logitModelFull_cinema)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = cinema.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.02226 -0.96374 -0.00049 0.94332 2.70270
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.55279 2.48829 -2.633 0.00845 **
## genderMale -0.22516 0.13397 -1.681 0.09282 .
## SeniorCitizen1 0.21980 0.16100 1.365 0.17219
## PartnerYes 0.06317 0.15007 0.421 0.67378
## DependentsYes -0.17650 0.18445 -0.957 0.33862
## PhoneServiceYes 2.04005 1.39375 1.464 0.14327
## MultipleLinesNo phone service NA NA NA NA
## MultipleLinesYes 1.03408 0.32473 3.184 0.00145 **
## InternetServiceFiber optic 3.85931 1.49846 2.576 0.01001 *
## OnlineSecurityYes 0.26099 0.33692 0.775 0.43854
## OnlineBackupYes 0.49921 0.32077 1.556 0.11964
## DeviceProtectionYes 0.50118 0.32256 1.554 0.12024
## TechSupportYes 0.15315 0.33652 0.455 0.64903
## StreamingTVYes 1.27988 0.59942 2.135 0.03274 *
## StreamingMoviesYes 1.45675 0.59889 2.432 0.01500 *
## ContractOne year -0.53756 0.24043 -2.236 0.02536 *
## ContractTwo year -15.07900 367.09277 -0.041 0.96723
## PaperlessBillingYes 0.51275 0.16206 3.164 0.00156 **
## PaymentMethodCredit card (automatic) 0.08139 0.24140 0.337 0.73598
## PaymentMethodElectronic check 0.37863 0.19134 1.979 0.04784 *
## PaymentMethodMailed check 0.19055 0.26276 0.725 0.46834
## MonthlyCharges_stand -3.41179 1.75348 -1.946 0.05169 .
## tenure_stand -0.92041 0.15073 -6.106 1.02e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1648.0 on 1190 degrees of freedom
## Residual deviance: 1335.6 on 1169 degrees of freedom
## AIC: 1379.6
##
## Number of Fisher Scoring iterations: 15
summary(logitModelFull_cinema_new)
##
## Call:
## glm(formula = Churn ~ gender + SeniorCitizen + MultipleLines +
## InternetService + OnlineBackup + DeviceProtection + StreamingTV +
## StreamingMovies + Contract + PaperlessBilling + MonthlyCharges_stand +
## tenure_stand, family = binomial, data = cinema.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.99193 -0.98585 -0.00048 0.95266 2.67201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6609 0.4361 -8.394 < 2e-16 ***
## genderMale -0.2111 0.1325 -1.593 0.111141
## SeniorCitizen1 0.2643 0.1554 1.701 0.088972 .
## MultipleLinesNo phone service -1.3938 0.8557 -1.629 0.103339
## MultipleLinesYes 0.8651 0.1656 5.224 1.75e-07 ***
## InternetServiceFiber optic 3.0898 0.5246 5.890 3.87e-09 ***
## OnlineBackupYes 0.3276 0.1686 1.942 0.052089 .
## DeviceProtectionYes 0.3113 0.1687 1.846 0.064932 .
## StreamingTVYes 0.9832 0.2480 3.965 7.34e-05 ***
## StreamingMoviesYes 1.1480 0.2438 4.709 2.49e-06 ***
## ContractOne year -0.6033 0.2381 -2.533 0.011298 *
## ContractTwo year -15.1489 370.0589 -0.041 0.967346
## PaperlessBillingYes 0.5263 0.1607 3.276 0.001054 **
## MonthlyCharges_stand -2.4768 0.6543 -3.785 0.000153 ***
## tenure_stand -0.9587 0.1467 -6.535 6.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1648.0 on 1190 degrees of freedom
## Residual deviance: 1342.2 on 1176 degrees of freedom
## AIC: 1372.2
##
## Number of Fisher Scoring iterations: 15
So the model, which gave me function stepAIC is next: Churn ~ gender + SeniorCitizen + MultipleLines + InternetService + OnlineBackup + DeviceProtection + StreamingTV + StreamingMovies + Contract + PaperlessBilling + MonthlyCharges_stand + tenure_stand. AIC is 1372.2, which is less than 1379.6 and should be better.
To evaluate the logistic regression model, I will print the predictions and look at the test statistics like precision, recall and the f-1 score.
test <- predict(logitModelFull_cinema, cinema.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, cinema.test$Churn, mode = "prec_recall")
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 107 35
## Yes 50 107
##
## Accuracy : 0.7157
## 95% CI : (0.6609, 0.7662)
## No Information Rate : 0.5251
## P-Value [Acc > NIR] : 1.293e-11
##
## Kappa : 0.4329
##
## Mcnemar's Test P-Value : 0.1289
##
## Precision : 0.7535
## Recall : 0.6815
## F1 : 0.7157
## Prevalence : 0.5251
## Detection Rate : 0.3579
## Detection Prevalence : 0.4749
## Balanced Accuracy : 0.7175
##
## 'Positive' Class : No
##
Recall is about 68%. It means that the model correctly identified 68% of the customers who retained and missed 32%.The precision of the model was about 75% and the f1-score was about 72%. The accuracy of the model was about 71%. Exactly, 107 cluents were true predicted that they don’t churn and 107 cluent were identified as thise who really churn.
Let’s look on the second model:
test <- predict(logitModelFull_cinema_new, cinema.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, cinema.test$Churn, mode = "prec_recall")
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 106 37
## Yes 51 105
##
## Accuracy : 0.7057
## 95% CI : (0.6505, 0.7567)
## No Information Rate : 0.5251
## P-Value [Acc > NIR] : 1.459e-10
##
## Kappa : 0.4127
##
## Mcnemar's Test P-Value : 0.1658
##
## Precision : 0.7413
## Recall : 0.6752
## F1 : 0.7067
## Prevalence : 0.5251
## Detection Rate : 0.3545
## Detection Prevalence : 0.4783
## Balanced Accuracy : 0.7073
##
## 'Positive' Class : No
##
The results of the second model is little worse, but the significans of predictors are higher. I thing, that I can choose the second one.
The most significant values are: 1. MultipleLines 2.
InternetService 3. StreamingTVYes
4. StreamingMoviesYes 5. MonthlyCharges_stand 6. tenure_stand
Exactly, all these variables have the lowest p-values and can be identified as the best predictors of customer churn for the cluents from the cluster “Сinephiles”
Coefficient <- c("MultipleLines(Yes)","InternetService(Fiber optic)", "StreamingTV(Yes) ", "StreamingMovies(Yes)", "MonthlyCharges_stand", "tenure_stand")
Interpretation <- c("Considering that the customer uses multipline lines, the risk of customer churn increases by 2.38 times or by 138% for those who have multipline lines compared to those who do not. Thus, customers with multiline lines, as opposed to those who have one line or no lines at all, are more inclined to churn out.",
"Given that a customer uses fibre optics, the risk of customer churn increases by a factor of 21.97 or 2097 % for those who have fibre optics compared to those who have DSL (There are no customers in the analysed cluster who do not use internet at all).",
"Given that the customer is using Streaming TV, the risk of customer churn increases by 2.67 times or by 167% for those who have Streaming TV connected than for those who do not use the service.",
"Given that the customer is using Streaming Movies, the risk of customer churn increases by 3.15 times or by 215% for those who have Streaming Movies connected than for those who do not use the service.",
"An increase of one dollar in MonthlyCharges of customer decreases the risk of churn by 0.08 times or by 92%.",
"An increase of one month in customer staying in the company decreases the risk of churn by 0.38 times or by 62%.")
Coef_tab <- data.frame(Coefficient, Interpretation)
formattable(Coef_tab,
align =c("l","l"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
| Coefficient | Interpretation |
|---|---|
| MultipleLines(Yes) | Considering that the customer uses multipline lines, the risk of customer churn increases by 2.38 times or by 138% for those who have multipline lines compared to those who do not. Thus, customers with multiline lines, as opposed to those who have one line or no lines at all, are more inclined to churn out. |
| InternetService(Fiber optic) | Given that a customer uses fibre optics, the risk of customer churn increases by a factor of 21.97 or 2097 % for those who have fibre optics compared to those who have DSL (There are no customers in the analysed cluster who do not use internet at all). |
| StreamingTV(Yes) | Given that the customer is using Streaming TV, the risk of customer churn increases by 2.67 times or by 167% for those who have Streaming TV connected than for those who do not use the service. |
| StreamingMovies(Yes) | Given that the customer is using Streaming Movies, the risk of customer churn increases by 3.15 times or by 215% for those who have Streaming Movies connected than for those who do not use the service. |
| MonthlyCharges_stand | An increase of one dollar in MonthlyCharges of customer decreases the risk of churn by 0.08 times or by 92%. |
| tenure_stand | An increase of one month in customer staying in the company decreases the risk of churn by 0.38 times or by 62%. |
As I have wrote earlie, for the clients from cluster “Economical” I save the variable with sum of all services, because they are not use some of that services actively (exactly, most of them don’t like use any services).
logitModelFull_econom <- glm(Churn~., family = binomial, econom.train)
#Build the new model
logitModelFull_econom_new <- stepAIC(logitModelFull_econom,trace = 0)
summary(logitModelFull_econom)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = econom.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8315 -0.6981 -0.3049 0.7438 3.1209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.35487 0.12304 -11.011 < 2e-16 ***
## genderMale -0.02655 0.07138 -0.372 0.709919
## SeniorCitizen1 0.34213 0.09336 3.665 0.000248 ***
## PartnerYes 0.07658 0.08541 0.897 0.369953
## DependentsYes -0.31835 0.09881 -3.222 0.001274 **
## ContractOne year -0.85359 0.11711 -7.289 3.13e-13 ***
## ContractTwo year -1.55308 0.19041 -8.156 3.45e-16 ***
## PaperlessBillingYes 0.38886 0.08084 4.810 1.51e-06 ***
## PaymentMethodCredit card (automatic) -0.07018 0.12655 -0.555 0.579204
## PaymentMethodElectronic check 0.38970 0.10378 3.755 0.000173 ***
## PaymentMethodMailed check -0.10931 0.12332 -0.886 0.375438
## MonthlyCharges_stand 0.85553 0.09669 8.848 < 2e-16 ***
## tenure_stand -0.84424 0.06334 -13.329 < 2e-16 ***
## all_services_stnd -0.26017 0.10650 -2.443 0.014568 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6513.9 on 5624 degrees of freedom
## Residual deviance: 4802.2 on 5611 degrees of freedom
## AIC: 4830.2
##
## Number of Fisher Scoring iterations: 6
summary(logitModelFull_econom_new)
##
## Call:
## glm(formula = Churn ~ SeniorCitizen + Dependents + Contract +
## PaperlessBilling + PaymentMethod + MonthlyCharges_stand +
## tenure_stand + all_services_stnd, family = binomial, data = econom.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8476 -0.6987 -0.3050 0.7380 3.0944
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.34056 0.11420 -11.739 < 2e-16 ***
## SeniorCitizen1 0.35176 0.09270 3.795 0.000148 ***
## DependentsYes -0.28159 0.08961 -3.142 0.001676 **
## ContractOne year -0.85195 0.11710 -7.276 3.45e-13 ***
## ContractTwo year -1.55331 0.19041 -8.158 3.41e-16 ***
## PaperlessBillingYes 0.38829 0.08081 4.805 1.55e-06 ***
## PaymentMethodCredit card (automatic) -0.07417 0.12643 -0.587 0.557406
## PaymentMethodElectronic check 0.38736 0.10372 3.735 0.000188 ***
## PaymentMethodMailed check -0.11538 0.12316 -0.937 0.348847
## MonthlyCharges_stand 0.85629 0.09664 8.861 < 2e-16 ***
## tenure_stand -0.83344 0.06199 -13.444 < 2e-16 ***
## all_services_stnd -0.25868 0.10644 -2.430 0.015084 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6513.9 on 5624 degrees of freedom
## Residual deviance: 4803.1 on 5613 degrees of freedom
## AIC: 4827.1
##
## Number of Fisher Scoring iterations: 6
So the model, which gave me function stepAIC is next: Churn ~ SeniorCitizen + Dependents + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges_stand + tenure_stand + all_services_stnd. AIC is 4827.1, which is less than 4830.2 and should be better.
To evaluate the logistic regression model, I will print the predictions and look at the test statistics like precision, recall and the f-1 score.
test <- predict(logitModelFull_econom, econom.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, econom.test$Churn, mode = "prec_recall")
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 929 177
## Yes 104 197
##
## Accuracy : 0.8003
## 95% CI : (0.7784, 0.8209)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : 4.549e-09
##
## Kappa : 0.4543
##
## Mcnemar's Test P-Value : 1.746e-05
##
## Precision : 0.8400
## Recall : 0.8993
## F1 : 0.8686
## Prevalence : 0.7342
## Detection Rate : 0.6603
## Detection Prevalence : 0.7861
## Balanced Accuracy : 0.7130
##
## 'Positive' Class : No
##
Recall is about 90%. It means that the model correctly identified 90% of the customers who retained and missed 10%.The precision of the model was about 84% and the f1-score was about 87%. The accuracy of the model was about 79,5%. Exactly, 929 cluents were true predicted that they don’t churn and 197 cluent were identified as thise who really churn.
Let’s look on the second model:
test <- predict(logitModelFull_econom_new, econom.test, type="response")
pred <- factor(ifelse(test > 0.5,"Yes","No"))
confusion <- caret::confusionMatrix(pred, econom.test$Churn, mode = "prec_recall")
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 927 176
## Yes 106 198
##
## Accuracy : 0.7996
## 95% CI : (0.7777, 0.8202)
## No Information Rate : 0.7342
## P-Value [Acc > NIR] : 6.639e-09
##
## Kappa : 0.4539
##
## Mcnemar's Test P-Value : 3.976e-05
##
## Precision : 0.8404
## Recall : 0.8974
## F1 : 0.8680
## Prevalence : 0.7342
## Detection Rate : 0.6588
## Detection Prevalence : 0.7839
## Balanced Accuracy : 0.7134
##
## 'Positive' Class : No
##
In the second model accuracy(0.7996) is little less than in the first model(0.8003). And all other metrics are in the same situation. So, I would like to plot ROC curve and AUC for prediction model on test set, to define maybe which one is better:
log.model.econom = glm(Churn~., data = econom.train, family = binomial(link = 'logit'))
pred1 = predict(log.model.econom, newdata = econom.test, type = "response")
ROC_econom1 = roc(response = econom.test$Churn, predictor = pred1)
log.model.econom2 = glm(Churn ~ SeniorCitizen + Dependents + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges_stand + tenure_stand, data = econom.train, family = binomial(link = 'logit'))
pred2 = predict(log.model.econom2, newdata = econom.test, type = "response")
ROC_econom2 = roc(response = econom.test$Churn, predictor = pred2)
ggplot() + geom_path(aes(y=ROC_econom1$sensitivities, x=1-ROC_econom1$specificities), color = "green") +
geom_path(aes(y=ROC_econom2$sensitivities, x=1-ROC_econom2$specificities), color = "blue") + labs(title = "ROC curve") + xlab("FPR") + ylab("TPR") + theme_linedraw() +theme(plot.title = element_text(size=14, hjust = 0.5, face="bold"))
pROC::auc(ROC_econom1)
## Area under the curve: 0.8364
pROC::auc(ROC_econom2)
## Area under the curve: 0.8365
It is still difficult to say, which one is better, but the AUC of the first model(0.8364) is less by 0.0001 than second (0.8365). Still for me better to have good presision and recall numbers, because they are related to the model’s mistakes and how well it selects its customers. So I will describe the most significant values based on the first model.
The most significant values are: 1. SeniorCitizen1
2. ContractOne year 3. ContractTwo year
4. PaperlessBillingYes 5. PaymentMethodElectronic check 6.
MonthlyCharges_stand 7. tenure_stand
Exactly, all these variables have the lowest p-values and can be identified as the best predictors of customer churn for the cluents from the cluster “Economical”
Coefficient <- c("Senior Citizen(1)","Contract (One year)", "Contract (Two year)", "Paperless Billing (Yes)","Payment Method(Electronic check)", "MonthlyCharges_stand","tenure_stand")
Interpretation <- c("Considering that the customer is elderly, the risk of customer churn increases by a factor of 1.41 or 41% for those who are elderly compared to those who are younger.",
"Given that the customer signs a contract for one year, the risk of customer churn decreases by a factor of 0.43 or 57% for those who sign a monthly contract. So to say, clients who have, for example, one-year contract are 57% more likely to stay",
"Given that the customer signs a contract for two year, the risk of customer churn decreases by a factor of 0.21 or 79% for those who sign a monthly contract.So to say, clients who have, for example, two-year contract are 79% more likely to stay.",
"Considering that the customer receives paper receipts, the risk of customer churn increases by 1.48 or 48% for those who receive these receipts compared to those who do not receive this receipt.",
"Taking into account that the customer pays bills by electronic check, the risk of customer churn increases by 1.48 or 48% for those who receive electronic cheques compared to those who use Bank transfer (automatic).",
"An increase of one dollar in MonthlyCharges of customer increases the risk of churn by 2.35 times or by 135%.",
"An increase of one month that customer staying in the company decreases the risk of churn by 0.43 times or by 57%.")
Coef_tab <- data.frame(Coefficient, Interpretation)
formattable(Coef_tab,
align =c("l","l"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
| Coefficient | Interpretation |
|---|---|
| Senior Citizen(1) | Considering that the customer is elderly, the risk of customer churn increases by a factor of 1.41 or 41% for those who are elderly compared to those who are younger. |
| Contract (One year) | Given that the customer signs a contract for one year, the risk of customer churn decreases by a factor of 0.43 or 57% for those who sign a monthly contract. So to say, clients who have, for example, one-year contract are 57% more likely to stay |
| Contract (Two year) | Given that the customer signs a contract for two year, the risk of customer churn decreases by a factor of 0.21 or 79% for those who sign a monthly contract.So to say, clients who have, for example, two-year contract are 79% more likely to stay. |
| Paperless Billing (Yes) | Considering that the customer receives paper receipts, the risk of customer churn increases by 1.48 or 48% for those who receive these receipts compared to those who do not receive this receipt. |
| Payment Method(Electronic check) | Taking into account that the customer pays bills by electronic check, the risk of customer churn increases by 1.48 or 48% for those who receive electronic cheques compared to those who use Bank transfer (automatic). |
| MonthlyCharges_stand | An increase of one dollar in MonthlyCharges of customer increases the risk of churn by 2.35 times or by 135%. |
| tenure_stand | An increase of one month that customer staying in the company decreases the risk of churn by 0.43 times or by 57%. |
In the course of the analysis, four main clusters were identified. Based on a more in-depth analysis, the main characteristics of each cluster were identified. The clusters were named as follows: Economical, Сinephiles, Australopithecus, Take everything from life.
In the cluster Economical, Сinephiles, were defined the biggest churn of clients, which were 40.7% and 47,4% respectefully. For each of these clusters we made models, which aim is to predict the churn of clients. Because cluents in this cluent have sifferent chafacteristics, for both cluster will be given recommendations.
Churn-prevention recommendations for clients from “Сinephiles”:
It is most likely that the company has some problems with fibre optics, as the analysis shows great risks in terms of customer outflow if they use fibre optics.
Optionally, it is necessary to study the operation of fibre optics in more detail and offer customers the possibility of switching to DSL. Unfortunately, this data does not specify the cost per service. Therefore, if customers suddenly do not want to switch to DSL Ь for the cost, you can offer them a discount on the use of TV and film services as part of the bonus. It is possible to make 3 months of watching TV and movies free of charge, so that customers can appreciate the difference and the end result on DSL.
Churn-prevention recommendations for clients from “Economical”:
In this cluster, more attention should be paid to the demographics of customers and the type of contract they enter into and how they pay for services.
It is not entirely clear how a person relates to the group of elderly people, but perhaps they could be of retirement age (~60 years). They could be offered a pension rate and, if they have the internet, offer some discounts on security services, because this could be good protection against fraudsters.
Also, customers who sign annual and two annual contracts are less likely to outflow, which is logical. After all, they have already paid the money and signed a contract for a long time, while people with monthly contracts can leave at any time. It is worth making some kind of action, like this: a year’s contract is more profitable than a 10% monthly payment.
The company may have some problem with electronic payment for services, it is worth analyzing this point and understanding what the problem is (e.g. fraudsters, there is a delay in payment and connection of the service, inconvenient interface).