PROJECT III - ASSOCIATION RULES

 

 

This project is dedicated to the Churn Telco Data. The dataset covers following information concerning a fictional telco company:

  • customers, who left within last month
  • services of each customer
  • account information
  • demographic data

The dataset was found and downloaded from kaggle: (https://www.kaggle.com/blastchar/telco-customer-churn).

 

The project consists of 6 parts: 1. Introduction 2. EDA part 3. Association Rules - preliminary assumptions 4. Association Rules for whole dataset 5. Association Rules - churn analysis 6. Final Conclusions

 

Following libraries were used during the analysis:

  • knitr
  • rmarkdown
  • arules
  • arulesViz
  • arulesCBA
  • tidyverse
  • psych
  • data.table
  • factoextra
  • cluster
  • ClusterR
  • ggpubr

 

 

 

INTRODUCTION

Originally, the dataset consists of 21 columns and 7043 observations. The features present as follows:

  • customerID
  • gender
  • SeniorCitizen - 0-1 variable stating whether customer is a senior
  • Partner - variable stating whether customer has a partner (yes/no)
  • Dependents - variable stating whether customer has any dependents (children, parents, grandparents etc.)
  • tenure - how much customer stayed with the company (in)
  • PhoneService - does a customer have a phone service?
  • MultipleLines - does a customer have multiple lines?
  • InternetService - customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity - does a customer have an online security?
  • OnlineBackup - does a customer have an online backup service?
  • DeviceProtection - does a customer have a device protection service?
  • TechSupport - does a customer subscribe to an additional technical support plan?
  • StreamingTV - does a customer use Internet to stream TV?
  • StreamingMovies - does a customer use Internet to stream movies?
  • Contract - contract term of customer (month-to-month, one/two years)
  • PaperlessBilling - paper or paperless billing?
  • PaymentMethod - payment method
  • MonthlyCharges - monthly commitment
  • TotalCharges - total charged commitment
  • Churn - did customer churn last month?

 

 

Let’s take a look at the dataset.

# inspect the dataset
paged_table(data[1:10,])

The dataset is created for a fictional telco company. It contains information about the customer (demographic data), services owned by the customer (fix/mobile/additional services) and the information whether the customer left the company within last month (churn). There are 21 features, 7043 rows, and no missing data.

 

The aim of this project is to find some association rules which would help to determine the customer churn and to spot and express some hidden patterns concerning the customer behaviour, owned services, value etc.

 

 

 

EDA

First of all - let’s take a look at the continuous variables description - tenure

data$MonthlyCharges <- as.numeric(data$MonthlyCharges)
data$TotalCharges <- as.numeric(data$TotalCharges)
paged_table(describe(data[,c('tenure','MonthlyCharges','TotalCharges')]))

The average customer stays at the telco for ~32 months and pays monthly $64.76. The most loyal customer subscribes to the telco’s services for 6 years. The most valuable one pays $118.75 monthly. The kurtosis and skewness seem to be moderate in case of these variables, the skewness of monthly commitment is negative, what indicates, that there are some concentrated observations with low value in comparison with the rest of the distribution (probably there is a relatively big group of customers, who simply would like to use one basic service - one sim card / a cheap internet connection) - we are going to check that in a moment.

Let’s take a look at the tenure histogram in order to dig into details:

  ggplot(data=data,aes(x = tenure)) +
  geom_histogram() +
  stat_bin(bins=50)+
  theme_minimal() +
  labs(x = "", y = "", title = "Tenure Variable Distribution") +
  scale_x_continuous(breaks = c(0,6,12,18,24,30,36,42,48,54,60,66,72)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, there are quite a lot newcomers (0-3 month). Then the quantity of customers decrease as increases the tenure duration - up to 24 months. Since then the data seems to be distributed quite equally, but it is not the end of the story. There is also a peak in the “loyal wing” (66-72 months stay). From the modelling perspective, these two groups seem to be especially interesting.

 

 

Let’s check additionally the distribution of monthly charges.

  ggplot(data=data,aes(x = MonthlyCharges)) +
  geom_histogram() +
  stat_bin(bins=50)+
  theme_minimal() +
  labs(x = "", y = "", title = "Monthly Charges Distribution") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  #scale_x_continuous(breaks = c(0,6,12,18,24,30,36,42,48,54,60,66,72)) 

As stated above, there is a big group, which wants a cheap service (probably they do not care about the additional services).

 

 

Let’s concentrate on the demographic data. At the beginning - male/female division:

data %>%
  group_by(gender) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(gender, -counts))) +
  geom_bar(stat="identity", fill = 'darkgrey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=20, face='bold'))  +
  labs(x = "", y = "", title = "Gender counts") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

There is a slight difference for male/female counts - 67 more males. As the difference is negligable, there would be no sense in constructing the offer taking into account the focus on only one sex. However, it could be informative to check the sex/senior, sex/dependencies and sex/partner breakdown. Let’s check that.

 

 

First of all - gender ~ senior breakdown:

data %>%
  group_by(gender, SeniorCitizen) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(SeniorCitizen, -counts))) +
  geom_bar(stat="identity", fill = 'brown') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=20, face='bold'))  +
  labs(x = "", y = "", title = "Gender ~ Senior breakdown") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  facet_wrap(~ gender)
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.

There is no major difference between the genders concerning senior citizens. Let’s check other dimensions.

 

 

Below the gender ~ dependents division:

data %>%
  group_by(gender, Dependents) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(Dependents, -counts))) +
  geom_bar(stat="identity", fill = 'purple') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=20, face='bold'))  +
  labs(x = "", y = "", title = "Gender ~ Dependents breakdown") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  facet_wrap(~ gender)
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.

Again, we can see similar results for both genders. It seems that the genders are equal concerning this dimension as well. Let’s check the last case.

 

 

Finally, the gender ~ partner division:

data %>%
  group_by(gender, Partner) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(Partner, -counts))) +
  geom_bar(stat="identity", fill = 'black') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=20, face='bold'))  +
  labs(x = "", y = "", title = "Gender ~ Partner breakdown") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  facet_wrap(~ gender)
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.

Again, there is no significant difference between genders concerning partners. At the same time, both males and females gravitate towards being single (still, it is a slight tendency).

 

The next part of the exploration is to analyze the services owned by the customers

 

 

Firstly let’s check how many customers do subscribe to phone / internet main services (separately and mutualy).

Below the information on phone number service

data %>%
  group_by(PhoneService) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(PhoneService, -counts))) +
  geom_bar(stat="identity", fill = 'darkgreen') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=13, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to a phone service at the Telco Company?") +
  geom_text(aes(label = counts), vjust = 1.3, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

As we can see, the vast majority (90%) of customers do subscribe to at least one phone service.

 

 

Let’s take a look at the internet connection services:

data %>%
  group_by(InternetService) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(InternetService, -counts))) +
  geom_bar(stat="identity", fill = 'darkblue') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=12, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to an internet service at the Telco Company?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

The Telco Company is more focused on the phone services - at least it should be according to the profiles of their customers. There is still a potential encourage 1500 customers to buy an internet service. Additionally, there is also an opportunity (assuming that the company has at disposal an appropriate fiber infrastructure) to upgrade the DSL services into faster and more stable Fiber ones.

 

 

The next step in our services analysis is to check the connection between phone and internet main services. It is a popular trend among the telecom companies to provide the customer with the “convergent offer” - mixing the phone and internet (and, possibly, the TV) services together as a bigger offer.

# define a new variable - mutual phone/internet services
data$convergence <- ifelse(data$PhoneService == 'Yes' & data$InternetService == 'DSL', 'Phone + DSL',
                           ifelse(data$PhoneService == 'Yes' & data$InternetService == 'Fiber optic', 'Phone + Fiber', 
                                  ifelse(data$PhoneService == 'Yes' & data$InternetService == 'No', 'Phone, No Net',
                                         ifelse(data$PhoneService == 'No' & data$InternetService == 'DSL', 'No Phone, DSL', 
                                                ifelse(data$PhoneService == 'No' & data$InternetService == 'Fiber optic', 'No Phone, Fiber', 'No Phone, No Net')))))


data %>%
  group_by(convergence) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(convergence, -counts))) +
  geom_bar(stat="identity", fill = 'chocolate') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=12, face='bold'))  +
  labs(x = "", y = "", title = "Are you a convergent customer?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

There are four major groups taking into account the phone and internet services - we can approach each of them differently. But first: Good news for the company! The largest group (3096 customers) subscribes to Phone service and Fiber optic internet service. It constitutes our perfect convergence offer. There is also a relatively high potential - there are 1739 customers who subscribe to Phone and DSL internet service - it could be profitable to upgrade them into the Fiber. A little bit smaller, yet still large group constitute the customers with Phone service and no Internet service. The smallest group - 682 customers - are those with only DSL internet, no phone services. It is desirable to provide them with an additional phone service.

We are going to come back to the convergance later, for now let’s move on to the additional services owned by the customers.

 

 

First of all let’s check how many additional services do possess the customers. We are going to check that by plotting the services counts distribution. There are six additional services:

  • OnlineSecurity
  • OnlineBackup
  • DeviceProtection
  • TechSupport
  • StreamingTV
  • StreamingMovies

Since then, let’s take a look at the additional services distribution:

#first - recode all addiitonal services variables into 0-1 form:
data$os <- ifelse(data$OnlineSecurity == 'Yes', 1,0)
data$ob <- ifelse(data$OnlineBackup == 'Yes', 1,0)
data$dp <- ifelse(data$DeviceProtection == 'Yes', 1,0)
data$ts <- ifelse(data$TechSupport == 'Yes', 1,0)
data$st <- ifelse(data$StreamingTV == 'Yes', 1,0)
data$sm <- ifelse(data$StreamingMovies == 'Yes', 1,0)

#second - calculate the row-wise sums of owned additional services:
data$add_services <- apply(data[,c('os','ob','dp','ts','st','sm')], 1, sum)

#third - show the distribution
data %>%
  group_by(add_services) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = as.factor(add_services))) +
  geom_bar(stat="identity", fill = 'pink') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=12, face='bold'))  +
  labs(x = "", y = "", title = "How many additional services do you subscribe to?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

Another opportunity for the company ahead. Many of the customers seem to be quite conservative concerning their telco services habits. There are many customers who use no additional services. Among customers using the additional services, most of them possess three of them. Since then there is also a potential gap to fill - sell next additional offers.

Now let’s check which additional services are less / more popular among the customers:

#first - recode all addiitonal services variables into 0-1 form:
data$os <- ifelse(data$OnlineSecurity == 'Yes', 1,0)
data$ob <- ifelse(data$OnlineBackup == 'Yes', 1,0)
data$dp <- ifelse(data$DeviceProtection == 'Yes', 1,0)
data$ts <- ifelse(data$TechSupport == 'Yes', 1,0)
data$st <- ifelse(data$StreamingTV == 'Yes', 1,0)
data$sm <- ifelse(data$StreamingMovies == 'Yes', 1,0)

#second - calculate the row-wise sums of owned additional services:
data$add_services <- apply(data[,c('os','ob','dp','ts','st','sm')], 1, sum)

#third - show the distribution
os <- data %>%
  group_by(OnlineSecurity) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(OnlineSecurity, -counts))) +
  geom_bar(stat="identity", fill = 'grey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=10),
        title =element_text(size=10, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to Online Security?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

ob <- data %>%
  group_by(OnlineBackup) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(OnlineBackup, -counts))) +
  geom_bar(stat="identity", fill = 'grey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=10),
        title =element_text(size=10, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to Online Backup?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

dp <- data %>%
  group_by(DeviceProtection) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(DeviceProtection, -counts))) +
  geom_bar(stat="identity", fill = 'grey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=10),
        title =element_text(size=10, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to Device Protection?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

ts <- data %>%
  group_by(TechSupport) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(TechSupport, -counts))) +
  geom_bar(stat="identity", fill = 'grey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=10),
        title =element_text(size=10, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to Tech Support?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

st <- data %>%
  group_by(StreamingTV) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(StreamingTV, -counts))) +
  geom_bar(stat="identity", fill = 'grey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=10),
        title =element_text(size=10, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to Streaming TV?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

sm <- data %>%
  group_by(StreamingMovies) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(StreamingMovies, -counts))) +
  geom_bar(stat="identity", fill = 'grey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=10),
        title =element_text(size=10, face='bold'))  +
  labs(x = "", y = "", title = "Do you subscribe to Streaming Movies?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 
ggarrange(os, ob, ncol = 2, nrow = 1)

ggarrange(dp, ts, ncol = 2, nrow = 1)

ggarrange(st, sm, ncol = 2, nrow = 1)

The most popular additional services (~2700 customer) are streaming TV / Movies. The next are the Online Backup and Device Protection services (~2400 customers). The last, and least popular group are the Online Security and Tech Support (~2000 customers). Still, for each of the additional internet service there is a great potential to achieve better results (about half of the customer not addressed for each additional service).

 

 

As a final cut of the EDA part, let’s take a look at the churn rate at the Telco Company.

data %>%
  group_by(Churn) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = as.factor(Churn))) +
  geom_bar(stat="identity", fill = 'darkgreen') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=18, face='bold'))  +
  labs(x = "", y = "", title = "Did the customer churn last month?") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

26.5% of all customers churned last month. It is a clear indicator, that the dataset is provided for a fictional telecom company. In the reality, the month-to-month churn rate as a fraction of whole customer base is about 2-4%, whereas in the cohort lifecycle approach, about 6-15% of customers within one cohort churn. Yet, the dataset constitutes a good base to search for useful association rules.

 

 

 

SEARCHING FOR THE ASSOCIATION RULES

 

In order to perform the association rules search, we need to construct a transaction matrix. All of the features are supossed to be factors, so I am going to transform the data to isolate the rules in a readable manner.

All of the 0-1 and yes/no columns are going to be recoded into informative factors (e.g. SeniorCitizen: 0-1 -> senior -> not_senior). The more interesting part is the transformation of continuous variables. These are:

  • Tenure
  • Monthly Charges
  • Total Charges

The Tenure is going to be recoded into three factors:

  • Fresher - 0-6 months
  • Loyal - 60-72 months
  • Ordinary - the rest

 

 

Let’s take a look at old and new variables:

data$tenure_new <- ifelse(data$tenure <= 6, 'fresher', 
                          ifelse(data$tenure >=60, 'loyal',
                                'ordinary'))
data$tenure_new <- as.factor(data$tenure_new)

  ggplot(data=data,aes(x = tenure)) +
  geom_histogram() +
  stat_bin(bins=50)+
  theme_minimal() +
  labs(x = "", y = "", title = "Tenure Variable Distribution - Continuous") +
  scale_x_continuous(breaks = c(0,6,12,18,24,30,36,42,48,54,60,66,72)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data %>%
  group_by(tenure_new) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = as.factor(tenure_new))) +
  geom_bar(stat="identity", fill = 'navyblue') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=18, face='bold'))  +
  labs(x = "", y = "", title = "Tenure Variable Distribution - Factor") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

There are 1480 freshers and loyal customers, most of them are the ordinary ones. This distribution is arbitrary, but from the telecom perspective these groups constitute a valuable distinction between customers.

 

 

As long as the Monthly Charges are concerned, I decided to recode the variables into five levels based on monthly bills ranges:

  • [18.25;25]
  • (25;50]
  • (50;75]
  • (75;100]
  • more than 100

 

 

Let’s take a look at the distributions for the old and new versions of the variable:

data$MonthlyCharges_new <- ifelse(data$MonthlyCharges >= 18.25 & data$MonthlyCharges <= 25, '1. 18.25-25',
                           ifelse(data$MonthlyCharges >25 & data$MonthlyCharges <= 50, '2. 25-50',
                           ifelse(data$MonthlyCharges > 50 & data$MonthlyCharges <= 75, '3. 50-75',
                        ifelse(data$MonthlyCharges > 75 & data$MonthlyCharges <= 100, '4. 75-100','5. >100'))))

  ggplot(data=data,aes(x = MonthlyCharges)) +
  geom_histogram() +
  stat_bin(bins=50)+
  theme_minimal() +
  labs(x = "", y = "", title = "Monthly Charges Distribution - Continuous") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  data %>%
  group_by(MonthlyCharges_new) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = as.factor(MonthlyCharges_new))) +
  geom_bar(stat="identity", fill = 'violet') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=18, face='bold'))  +
  labs(x = "", y = "", title = "Monthly Charges Distribution - Factor") +
  geom_text(aes(label = counts), vjust = 2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) 

The largest group of the customers pay 75-100$ monthly. The most valuable group is also quite big (902 customers). What is a little bit surprising, the 25-50$ is the least popular one. It is also a next indicator, that the data is fictional, it would be difficult in the real life to find such a distribution of bills among the customers (maybe, if we play a little bit and manipulate the data :)).

 

As far as the last continuous variable - TotalCharges - is concerned, I am going to discard this one from the transactions matrix. I decided to do that, since this variable is simply a combination of the tenure and monthly charges. Similarly, the customer id is going to be removed - it provides no information for the association rules analysis.

The last change in the transformation process is to squeeze the PhoneService and MultipleLines into one variable, which will inform if the customer has a phone service and whether it is a single or multiple lines service.

 

Below the code chunk concerning the transaction matrix construction:

dt <- data[,2]
dt$senior <- ifelse(data$SeniorCitizen == '1', 'senior','not_senior')
dt$partner <- ifelse(data$Partner == 'Yes', 'couple', 'single')
dt$dependents <- ifelse(data$Dependents == 'Yes', 'family', 'no_family')
dt$tenure <- ifelse(data$tenure <= 6, 'fresher', 
                          ifelse(data$tenure >=60, 'loyal',
                                'ordinary'))
dt$phone_service <- ifelse(data$MultipleLines == 'Yes', 'multiple_phone_service', 
                           ifelse(data$MultipleLines == 'No', 'single_phone_service', 'no_phone_service'))
dt$internet_service <- ifelse(data$InternetService == 'DSL', 'DSL_internet',
                              ifelse(data$InternetService == 'No', 'no_internet','Fiber_internet'))
dt$online_security <- ifelse(data$OnlineSecurity == 'Yes','online_security',
                             ifelse(data$OnlineSecurity == 'No', 'no_online_security',
                                    'no_online_security_no_net'))
dt$online_backup <- ifelse(data$OnlineBackup == 'Yes','online_backup',
                             ifelse(data$OnlineBackup == 'No', 'no_online_backup',
                                    'no_online_backup_no_net'))
dt$device_protection <- ifelse(data$DeviceProtection == 'Yes','device_protection',
                             ifelse(data$DeviceProtection == 'No', 'no_device_protection',
                                    'no_device_protection_no_net'))
dt$tech_support <- ifelse(data$TechSupport == 'Yes','tech_support',
                             ifelse(data$TechSupport == 'No', 'no_tech_support',
                                    'no_tech_support_no_net'))
dt$stream_tv <- ifelse(data$StreamingTV == 'Yes','stream_tv',
                             ifelse(data$StreamingTV == 'No', 'no_stream_tv',
                                    'no_stream_tv_no_net'))
dt$stream_movies <- ifelse(data$StreamingMovies == 'Yes','stream_movies',
                             ifelse(data$StreamingMovies == 'No', 'no_stream_movies',
                                    'no_stream_movies_no_net'))
dt$contract <- data$Contract
dt$paperless_billing <- ifelse(data$PaperlessBilling=='Yes','paperless_billing','paper_billing')
dt$payment_method <- data$PaymentMethod
dt$monthly_charges <- ifelse(data$MonthlyCharges >= 18.25 & data$MonthlyCharges <= 25, '1. 18.25-25',
                           ifelse(data$MonthlyCharges >25 & data$MonthlyCharges <= 50, '2. 25-50',
                           ifelse(data$MonthlyCharges > 50 & data$MonthlyCharges <= 75, '3. 50-75',
                        ifelse(data$MonthlyCharges > 75 & data$MonthlyCharges <= 100, '4. 75-100','5. >100'))))

dt$churn <- ifelse(data$Churn =='Yes','churn','no_churn')



write.csv(dt, 'churn_transactions.csv', row.names = FALSE)

trans <- read.transactions("churn_transactions.csv", sep = ",", header = T)

 

 

 

ASSOCIATION RULES - WHOLE DATASET

 

Let’s take a look at the item frequencies - the plot is showing all of the items which occur > 3000 times in the dataset:

item_freq <- itemFrequency(trans, type="absolute")
data_freq <- data.frame(item=names(item_freq), freq=item_freq, row.names=NULL)
data_freq <- data_freq[order(data_freq$freq, decreasing = TRUE),]

data_freq[data_freq$freq>3000,] %>%
  ggplot(aes(y=freq, x = reorder(item, freq))) +
  geom_bar(stat="identity", fill = 'orange') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=12, face='bold'))  +
  labs(x = "", y = "", title = "Item frequencies - absolute counts:") +
  geom_text(aes(label = freq), hjust = 1, size = 4, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  coord_flip() +
  scale_y_discrete(labels = NULL) 

The most frequent situations are no-senior customers, no-churners, people with no family and with paperless billing, and customers with 6m-5y tenure (ordinary).

 

 

We are going to base the rules search on three measures, let’s present them and shortly describe what they are informaing about:

  • support - informs, how often the item/itemset/rule occurs in the data -> Support = count(item)/count(all)
  • confidence - shows how often the item occurs in presence of other item -> Confidence(x->y) = support(x,y)/support(x)
  • lift - this measure describes how increases the probability of having one item in itemset with knowledge of the second item present in itemset over probability of having one item without knowledge about the second one.
    • Lift > 1 - positive association between two items/itemsets.
    • Lift ~ 1 - items/itemsets independent
    • Lift < 1 - negative association between two items/itemsets.

 

I am going to search for useful rules concerning telco customers using the apriori algorithm. Firstly, we are going to perform the search on whole dataset. The threshold for minimum support and confidence levels are 5% and 25%, respectively.

telco_rules <- apriori(trans, parameter = list(support = 0.05, confidence = 0.25, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 352 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[51 item(s), 7043 transaction(s)] done [0.01s].
## sorting and recoding items ... [51 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(trans, parameter = list(support = 0.05, confidence = 0.25, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
##  done [1.03s].
## writing ... [738597 rule(s)] done [0.22s].
## creating S4 object  ... done [0.73s].
summary(telco_rules)
## set of 738597 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##   1463  17307  70939 133864 151167 140525 113943  74349  35040 
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  2.000000  5.000000  6.000000  6.538999  8.000000 10.000000 
## 
## summary of quality measures:
##     support             confidence           coverage         
##  Min.   :0.05012069   Min.   :0.2500000   Min.   :0.05012069  
##  1st Qu.:0.05480619   1st Qu.:0.6519337   1st Qu.:0.06403521  
##  Median :0.06289933   Median :0.8751323   Median :0.08348715  
##  Mean   :0.07142036   Mean   :0.8116622   Mean   :0.09477312  
##  3rd Qu.:0.07908562   3rd Qu.:1.0000000   3rd Qu.:0.10989635  
##  Max.   :0.64006815   Max.   :1.0000000   Max.   :0.83785319  
##       lift               count          
##  Min.   :0.4616598   Min.   : 353.0000  
##  1st Qu.:1.2070512   1st Qu.: 386.0000  
##  Median :1.7570394   Median : 443.0000  
##  Mean   :2.5817493   Mean   : 503.0136  
##  3rd Qu.:4.6153342   3rd Qu.: 557.0000  
##  Max.   :7.8082040   Max.   :4508.0000  
## 
## mining info:
##   data ntransactions support confidence
##  trans          7043    0.05       0.25
##                                                                                    call
##  apriori(data = trans, parameter = list(support = 0.05, confidence = 0.25, minlen = 2))

 

There are quite a lot of rules found - 738597! It may seem too large to analyze, but we are going to use appropriate measures to sort the rules and make sense of them in a way to construct some recommendations towards the telco company.

The maximum support among the rules is 0.64, there are some rules that cover much of the data. Yet, 75% of the rules present the support level >= 7.14% - there are numerous infrequent rules. I suppose it is great - often the real value is hidden in small chunks of data! There are many rules with 100% confidence and an average rule presents the confidence level equal to 0.81.

Let’s take a look at some randomly choosen rules:

set.seed(123)
sample <- floor(runif(10, min = 1, max = 738597))
df <- DATAFRAME(telco_rules[sample,])
rownames(df) <- NULL
paged_table(df)

Let’s take a look at the rule #4 - it seems to be quite interesting case. there is a subset of not senior, single, male customers, who do not possess additional services and did not churn and they consequently pay the lowest commitment range (18.25-25 monthly). It seems to be quite intuitive - young single males, who do not care about any additional features, want to get the core service and pay as little as possible. It could be an appropriate strategy to contact them as early as we spot the contract expiration phase and offer renewal at a little bit higher cost, giving better core service (e.g. DSL -> Fiber Optic). It is a small subset (small support value), but the elements co-occur frequently (confidence level very high).

The rest of the randomly choosen rules are not interesting - it is a drawback of that many rules. However, thanks to a rather small support threshold for such a dataset, we are able to catch some useful nuggets, such as the rule described above. Let’s take a look at the found rules via some visualisation methods.

Firstly, let’s create a scatterplot for the rules concerning the support-confidence-lift intensity:

plot(telco_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Rules with high support and confidence levels tend to present rather low lift (as it should be, according to the lift equation). The most interesting part constitute the red points with extremely high lift. These are rare, but potentially very valuable rules. We are going to inspect them later.

 

 

Now let’s take a look at the confidence-support-rule length intensity plot:

plot(telco_rules, shading="order", control=list(main="Two-key plot"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The plot illustrates how increasing length of the rule translates into decreasing support, while often the confidence increases.

 

 

At this point, I am not going plot the graph of the rules - we are going to use this tool later. For now, let’s move on to the very core of this part - rules analysis!

 

I am going to present top 30 rules according to support/confidence/lift and try to single out interesting chunks. Let’s start with the support-wise rules - these are the ones, which occur quite frequently in the transactional data. Yet, there is a trap as far as support-wise rules are concerned - I will cover this topic in a moment. Let’s take a look at the rules:

df <- DATAFRAME(sort(telco_rules, by = "support")[1:30])
rownames(df) <- NULL
paged_table(df)

All of the presented rules are two-element combinations. The first rule is very, very interesting and potentially important information. Actually, senior citizens tend to be non-churners. They seem to be loyal customers and do not change the services. Telco could use this information to try increasing seniors’ value by proposing additional/better services. Seniors tolerate such trials more and it is more probable that they will stay with us anyway. On the other hand, the telco could focus more on younger customers (e.g. by some special discounts/ adjusting the offer), since the senior target seems to be safe.

Take also a look at the rule #23. Younger customers tend to possess single phone services. It creates some potential to sell additional sim-cards and/or use this information as an opportunity to sell additional mobile line and some internet services (e.g. by creating the convergence offer).

Still, in this case we have to remember, that for high support rules, there is lift ~1 -> it indicates that the itemsets are independent. Yet, it is useful to know that e.g. senior and non-churn coexist together

 

Let’s move on to the rules, which present the highest confidence level:

df <- DATAFRAME(sort(telco_rules, by = "confidence")[1:30])
rownames(df) <- NULL
paged_table(df)

There are two interesing chunks. Customers who possess the DSL internet, do not subscribe to the phone service (prospective convergence subset of customers). Fiber internet users are the most valuable customers segment. It may be a clear sign, in which direction the telco should influence their customers (focus on fiber optic internet, pairing it with phone services / additional internet services).

 

And finally - lift-sorted rules:

df <- DATAFRAME(sort(telco_rules, by = "lift")[1:30])
rownames(df) <- NULL
paged_table(df)

These rules apply to high-value segment of the telco customers. From the financial perspective of the entrepreneuship, this is the key group for the telco. As stated above, these are Fiber internet users, they often buy additional services and have multiple mobile lines. Among the additional services within high-value segment, the streaming ones are the most popular. Also in some of the subgroups, the customers are non-seniors, sometimes these are customers with no family (streaming for solitary people?). What is interesting - many of the high-value subgroups consist of loyal customers (>5 years tenure). Hence, the telco should aim at keeping the customers while trying to increase their value (e.g. within the renewal processes and/or up/cross-sell campaigns - it seems that in many cases that helped in the past).

It seems that we were able to find interesting rules for the whole dataset. Now, let’s move on to especially interesting part - what are the circumstances likely to affect the customer churn?

 

 

 

ASSOCIATION RULES - WHAT LEADS TO CHURN?

 

Customer churn is probably the largest challenge for the telecom sector. The infrastructure and cutomer maintenance costs are relatively high and there is always some probability, that we are going to lose the customer, which translates into prospective profit loss.

Let’s check for the rules, that could be useful for tackling this problem. Again, I am going to set the threshold values for support and confidence as 0.05 and 0.25:

churn_rules <- apriori(data=trans, parameter=list(supp=0.05,conf = 0.25,minlen = 2), 
                       appearance=list(default="lhs", rhs="churn"), control=list(verbose=F))
summary(churn_rules)
## set of 5298 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6    7    8    9 
##   22  243  949 1752 1512  677  134    9 
## 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 2.000000 5.000000 5.000000 5.340317 6.000000 9.000000 
## 
## summary of quality measures:
##     support             confidence           coverage         
##  Min.   :0.05012069   Min.   :0.2504425   Min.   :0.06673293  
##  1st Qu.:0.05423825   1st Qu.:0.4836389   1st Qu.:0.09683374  
##  Median :0.06062757   Median :0.5464975   Median :0.11529178  
##  Mean   :0.06697951   Mean   :0.5401622   Mean   :0.12927430  
##  3rd Qu.:0.07255431   3rd Qu.:0.6050461   3rd Qu.:0.14422121  
##  Max.   :0.23498509   Max.   :0.7737844   Max.   :0.70041176  
##       lift               count          
##  Min.   :0.9437487   Min.   : 353.0000  
##  1st Qu.:1.8225088   1st Qu.: 382.0000  
##  Median :2.0593805   Median : 427.0000  
##  Mean   :2.0355070   Mean   : 471.7367  
##  3rd Qu.:2.2800107   3rd Qu.: 511.0000  
##  Max.   :2.9158712   Max.   :1655.0000  
## 
## mining info:
##   data ntransactions support confidence
##  trans          7043    0.05       0.25
##                                                                                                                                                           call
##  apriori(data = trans, parameter = list(supp = 0.05, conf = 0.25, minlen = 2), appearance = list(default = "lhs", rhs = "churn"), control = list(verbose = F))

There are 5298 rules found. Most of them consist of 4-7 elements. Max support of the rules is 0.235 and the max confidence is 0.774. Let’s take a look at some random rules concerning churn:

 

set.seed(123)
sample <- floor(runif(10, min = 1, max = 5298))
df <- DATAFRAME(churn_rules[sample,])
rownames(df) <- NULL
paged_table(df)

Looking at the rule #6 - there is a subset of churners, who have month-to-month contract, are single, fiber internet users with rather high value (second most valuable subgroup). In a moment we are going to analyze the rules applying ordering by support/confidence/lift. Firtsly though - let’s visualise the rules.

 

Below - rules scatterplots:

plot(churn_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(churn_rules, shading="order", control=list(main="Two-key plot"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Again, the larger is support, lift and confidence are decreasing. What is more interesting - the confidence-support combinations for each rule-length segment are layered by triangular patterns.

 

 

Let’s create the graph of top 5 rules according to the lift value:

plot(churn_rules, method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph", max = 5)
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE
## Warning: Too many rules supplied. Only plotting the best 5 using 'lift' (change
## control parameter max if needed).

As we can see, The strongest link is observed between churn and additional internet services -> no online security/backup. Let’s analyze the most popular rules according to the quality measures.

 

First - rules ordered by the support value:

df <- DATAFRAME(sort(churn_rules, by = "support")[1:30])
rownames(df) <- NULL
paged_table(df)

There are two interesting rules -> #18 and #20. There is a link between these two - churners do possess month-to-month contract. Additionally, both rules have lift > 2, so we observe positive dependency. It seems, that month-to-month contracts might be easier to resign from (it is just a hypothesis).

 

Let’s verify the rules according to confidence measure:

df <- DATAFRAME(sort(churn_rules, by = "confidence")[1:30])
rownames(df) <- NULL
paged_table(df)

The vast majority of rules consist of Fiber_internet and fresher. Since then, there are churners, who subscribe to fiber internet and their tenure is lower than six months. Again, in many cases (rules) the contract type among churners is month-to-month one. Let’s take a look at rule #20. There is a subset of churners, who are single freshers, who subscribe to the fiber internet within month-to-month contract. Here I suppose, that this might be a difficult, choosy segment of customers, who check diligently all of the conditions and change the operators rather frequently in order to achieve the best quality/price index (again, it is only one of many possibilities).

 

Lastly - let’s take a look at the rules ordered by the lift measure:

df <- DATAFRAME(sort(churn_rules, by = "lift")[1:30])
rownames(df) <- NULL
paged_table(df)

Actually, the results are very similar to the previos ones. Since then, there is no sense in re-interpreting the same rules second time.

 

 

 

FINAL CONCLUSIONS

90% of the telco customers subscribe to a phone service, 16% of them are senior citizens, and 27% are churners.

There are 3000 Fiber Optic Internet users -> there is a huge potential to sell more internet services and upsell existing customers (DSL technology).

Another story is the convergence - probably here lies the source of prospective profits for the telco. 31% of the customers subscribe to a phone or internet service, but not both. I strongly recommend, that the telco should give preferential treatment to these customers.

Assuming 5% and 25% support and confidence thresholds, a lot of association rules via apriori algorithm were specified - over 730000!

A subset of young, single men tend to choose an offer as cheap as possible and stay (do not churn) at the telco.

There are some DSL customers, who do not subscribe to the phone service (prospective convergence through cross-selling and/or increasing value through up-selling -> upgrade to fiber optic internet).

The fiber internet users proove to be the most valuable customers. They also often subscribe to additional internet services and are loyal (more than five years tenure).

From the churners perspective, some of them have a month-to-month contract. Moreover, many subsets of churners “contain” freshers (up to six months tenure) and also fiber optic internet users. As we can see - the great value goes hand in hand with risk.

Some of the customers could be troublesome and choosy - it could be a good idea to define the segments of churners and “healthy”/“desirable” customers and treat them appropriately. The association rules exercise is a solid starting point for further research to reduce churn and increase the overall customer value.