Introduction

Association rules is a non-linear, unsupervised learning technique designed to uncover association between different items in the dataset. The definition of association rule is frequently simplified to a conditional statement “if something happens, then something different also tends to happen”. The applications of this technique are found in various parts of the consumer services market. Probably the most popular application is the market basket analysis trying to identify which products are frequently sold together. Additionally, the association rules technique is also used in recommendation systems and in credit risk analysis.

In the following work, I will focus on another application, namely on churn analysis. It may allow to discover which customer groups are likely to resign from their service company and which are unlikely. Having such information can undoubtedly allow the company to implement thoughtful marketing strategies targeted at relevant groups.

Association rule measures

There are three common measures used to assess the quality of rules: support, confidence and lift. I explain them shortly below for the general rule “if A itemset (antecedent), then B itemset (consequent)”.

Support

Support is simply a measure of how often the joint itemset appears in the considered database.

Confidence

Confidence is a measure of the probability of appearing an antecedent and a consequent together divided by the probability of appearing the antecedent. In other words, it checks to what extent the rule is correct. The higher confidence, the stronger the rule.

Lift

Lift is a measure of the probability of appearing an antecedent and a consequent together divided by the product of the probability of appearing the antecedent and the probability of appearing the consequent. It means that the higher lift, the higher chance of co-occurrence of A and B. This definition indicates that

if the rule has lift higher than one, then A and B are positively dependent on each other,
if the rule has lift below one, then A and B are negatively dependent, i.e. A has negative effect on B appearance,
if the rule has lift equal to one, then A and B are independent on each other and thus we cannot derive any rule from them.

Data

In this paper I will use Telco customer churn dataset. It contains 7043 rows representing customers. The dataset includes information about services that a particular customer has signed up for, demographic information about customer and account information. However, in case of considering analysis, the most interesting one is “Churn” column, which provides information whether the customer left or not.

We can then load the data from the webpage.

data <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")

In the following analysis, I will limit myself to a few of the most important variables for a better understanding and visualization of association rules.

data <- data[,c(1:3,5,18,20)]

The considered variables are the following:

Gender - whether the customer is a male or a female
SeniorCitizen - whether the customer is a senior citizen or not (Yes / No)
Partner - whether the customer has a partner or not (Yes / No)
Tenure - the number of months the customer has stayed with the company
MonthlyCharges - the amount charged to the customer monthly
Churn - whether the customer churned or not (Yes / No)

library(dplyr)

glimpse(data)

## Rows: 7,032
## Columns: 6
## $ Gender         <fct> Female, Male, Male, Male, Female, Female, Male, Fema...
## $ SeniorCitizen  <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, ...
## $ Partner        <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Y...
## $ Tenure         <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, ...
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29....
## $ Churn          <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, ...

Data preparation

First, I check whether there are any missing values.

any(is.na(data))

## [1] FALSE

Next, as one could noticed, there are two variables that are continuous (Tenure and MonthlyCharges). In order to properly prepare the database to create association rules, careful discretization should be made.

Tenure

First, I will look at Tenure variable, the values of which are in the range from 1 to 72, meaning the number of months since which the customer has been associated with the company. The best to see its distribution is to plot the histogram.

hist(data$Tenure, breaks = seq(0,72), col = "blue")

Definitely the most customers have been associated with the company for a month. There are also many “72” values of this variable, which might suggest that this level is actually “72 or more”. In terms of the analysis of customer resignation from the company, it is worth to highlight the first month. Many changes in customer satisfaction may also be noticed in the first six months of joining the company, and also in the first year. Thus, it was decided to discredit Tenure variable as below.

data$Tenure <- as.factor(ifelse(data$Tenure <= 1, "1",
                         ifelse(data$Tenure <= 6, "(1,6]",
                         ifelse(data$Tenure <= 12, "(6,12]",
                         ifelse(data$Tenure <= 24, "(12,24]",
                         ifelse(data$Tenure <= 48, "(24,48]",
                         ifelse(data$Tenure <= 60, "(48,60]", "(60,72]")))))))

table(data$Tenure)[c(7,1,5,2:4,6)]

## 
##       1   (1,6]  (6,12] (12,24] (24,48] (48,60] (60,72] 
##     613     857     705    1024    1594     832    1407

MonthlyCharges

I have decided to make a discretization of MonthlyCharges variable simply based on the quantiles in order to have an even distribution of customers in groups.

quantile(data$MonthlyCharges)

##       0%      25%      50%      75%     100% 
##  18.2500  35.5875  70.3500  89.8625 118.7500

data$MonthlyCharges <- as.factor(ifelse(data$MonthlyCharges <= 35, "0-35",
                                 ifelse(data$MonthlyCharges <= 70, "36-70",
                                 ifelse(data$MonthlyCharges <= 90, "71-90", "91-120"))))

I also found it better to shorten the names of some variables.

colnames(data)[c(2,5)] <- c("Senior","MCharges")

At the end, we get the following data structure:

glimpse(data)

## Rows: 7,032
## Columns: 6
## $ Gender   <fct> Female, Male, Male, Male, Female, Female, Male, Female, Fe...
## $ Senior   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No...
## $ Partner  <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Yes, No...
## $ Tenure   <fct> "1", "(24,48]", "(1,6]", "(24,48]", "(1,6]", "(6,12]", "(1...
## $ MCharges <fct> 0-35, 36-70, 36-70, 36-70, 71-90, 91-120, 71-90, 0-35, 91-...
## $ Churn    <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, No, Ye...

Association rules

Now we can move on to find the association rules in the database. To do this, I will use apriori() function from arules package. It is worth to note that it requires data to be either an object of transactions class or any data structure that can be transformed into this object (such as a matrix or data frame). Our considered database already represents one customer in each row, which means that it is in the correct form.

Sometimes in similar analyses one can come across an approach according to which in the binary variables value “No” is replaced with NA. In this case, however, it is not advisable, since we could lose information resulting, for example, from the fact that someone does not have a partner or is not a senior citizen.

The first rules will be produced assuming default settings, that is

minimum support: supp = 0.1
minimum confidence: conf = 0.8

library(arules)
library(arulesViz)

rules_all <- apriori(data)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 703 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[19 item(s), 7032 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [63 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspectDT(rules_all)

This way we received 63 association rules. As one can easily observe from the listed first 10 rules, the consequents are not only the values of Churn variable, but also, for example, whether the client is a senior or not.

{Churn = No} rules

Now we move to the implications that interest us (the company) the most, i.e. the rules with Churn variable as a consequent. Since the rules with {Churn=Yes} consequent have generally smaller support, I have decided to split analyzing Churn depending on its value. We start with {Churn=No}, so we pass it to the rhs inside the appearance argument as below.

rules_N_all <- apriori(data, control = list(verbose=F), parameter = list(supp = 0.01, conf = 0.89), appearance = list(rhs=c("Churn=No")))

rules_N_all

## set of 140 rules

There are 140 rules that have support at least equal to 0.01 and confidence at least 0.89. Now I sort them based on confidence to see which rules are the strongest ones. In addition, I have decided to round the numerical values for a better visualization.

quality(rules_N_all) <- round(quality(rules_N_all), digits = 4)

rules_N_all <- sort(rules_N_all, by = "confidence")

inspectDT(rules_N_all)

Redundant rules

Obviously, there are too many association rules discovered from the considered dataset. Before studying any rules and looking for the most interesting ones, it is necessary to remove redundant rules. For example, it is visible that rule #3 and rule #4 do not provide any additional information in comparison to rule #1. In other words, information contained in rules #3 and #4 are already explicitly implied in the first rule. This means that rule #3 and #4 can be considered to be redundant and should be omitted. Below I remove all redundant rules based on the above sort.

subsets_N <- is.subset(rules_N_all, rules_N_all)
subsets_N[lower.tri(subsets_N, diag = T)] <- F
redundant_N <- (colSums(subsets_N) >= 1)

# removing redundant rules
rules_N <- rules_N_all[!redundant_N]

I check whether all obtained rules are significant.

all(is.significant(rules_N, data))

## [1] TRUE

rules_N

## set of 60 rules

inspectDT(rules_N)

At first glance, one can already notice some interesting patterns. Namely, there are three rules with confidence equal to 1. They mean that for all customers in the dataset with features as those antecedents Churn value is No. More precisely, none of the customers that pay monthly less than 35 and have stayed with the company for at least 5 years and are male or have no partner, have churned. For the company, this is very vital information. What is more, it concerns 148 and 74 cases respectively.

If we look at rule #4, we can see even more. For people who pay no more than 35 a month and have been customers for over 5 years, the probability of leaving is also negligible. For 278 people in this rule, confidence is equal to 0.996, which means that almost everyone (99,6%) with such characteristics have not churned.

In the discussed rules, the lift is obviously above 1, which means that the antecedents and consequents have positive effects on each other with high probability of co-occurrence.

It is also worth looking at which rules have the highest support.

rules_N <- sort(rules_N, by = "support")
inspectDT(rules_N)

It turned out that itemset {Churn=No, MCharges=0-35} has the highest support and appears in more than every fifth customer.

{Churn = Yes} rules

We now turn to the rules with {Churn = Yes} consequent. As mentioned before, in order to get any association rules, the parameter constraints (actually support value) needed to be slightly reduced. Rules are sorted again based on confidence values.

rules_Y_all <- apriori(data, control = list(verbose=F), parameter = list(supp = 0.005, conf = 0.80), appearance = list(rhs=c("Churn=Yes")))

rules_Y_all

## set of 14 rules

quality(rules_Y_all) <- round(quality(rules_Y_all), digits = 4)

rules_Y_all <- sort(rules_Y_all, by = "confidence")

inspectDT(rules_Y_all)

There are only 14 rules obtained. Nevertheless, as before, the redundant should be made.

Redundant rules

subsets_Y <- is.subset(rules_Y_all, rules_Y_all)
subsets_Y[lower.tri(subsets_Y, diag = T)] <- F
redundant_Y <- (colSums(subsets_Y) >= 1)

# removing redundant rules
rules_Y <- rules_Y_all[!redundant_Y]

I again check whether all obtained rules are significant.

all(is.significant(rules_Y, data))

## [1] TRUE

rules_Y

## set of 7 rules

inspectDT(rules_Y)

Only 7 association rules are left.

First of all, note that the confidence for these rules is in the range of 0.81-0.91 (which is rather high), but most of all the lift in all cases is greater than 3, indicating a strong dependence between antecedents and consequents.

An important observation is certainly that a senior citizen who is associated with the company only for a month is very likely to churn. The confidence in this case is 0.86, which means that 86% of senior citizens with tenure equal to 1 in the database have churned. Moreover, if we consider women among those customers, the confidence rises to 0.91, which means that the rule is very strong.

In general, it is easy to see that the above rules are mostly related to the low value of Tenure variable (usually up to 6 months), to high monthly costs, and mostly are associated with women.

Visualization

In order to make the obtained results more visually attractive, I provide below some interesting plots that may help with better understanding the results. To do so, I will use arulesViz package which is already loaded.

{Churn = No} rules

The best to start and the simplest to plot is a scatterplot showing all the rules with {Churn=No} consequent as points in terms of support, confidence and lift.

plot(rules_N, method = "scatterplot", measure = c("support","confidence"), shading = c("lift"))

As we can see there are some rules with confidence equal to 1, which were already mentioned. Generally, the majority of those rules have support below 0.07. Also worth noticing is that all of them have high confidence and high lift values, which means that obtained rules are rather strong.

In the next visualizations I limit the number of rules to 10, since otherwise they would not be clearly visible. Below I first visualize the rules in the grouped form where circle size depends on the support value and color intensity depends on the lift value.

plot(rules_N[1:10], method = "grouped", measure = "confidence")

The next graph is similar to the above structure with circles, but with a much more interesting visuality. The rules are here presented as a network with edges that represent relationship between nodes. What is more, those edges are directed through the dots showing the power of that relationship in terms of support (size of a dot) and lift (color of a dot).

set.seed(123)
plot(rules_N[1:10], method = "graph", cex = 0.7)

For example we may see that the biggest dot represents the rule {Tenure=(60,72], MCharges=0-35} => {Churn=No}, which was already mentioned to be the relationship with the highest support. It is also worth to notice that there are many arrows coming from {Tenure=(60,72] } and {MCharges=0-35} nodes, which emphasizes their importance in the strongest 10 association rules with {Churn=No} consequent.

The same graph can be presented in an alternative form as below. Probably for some it can be more transparent when there are not so many nodes.

plot(rules_N[1:10], method = "graph", cex = 0.7, control = list(layout=igraph::in_circle()))

Finally, we can draw a parallel coordinate plot to see how the consequent is achieved.

plot(rules_N[1:10], method = "paracoord", measure = "confidence", lty = "dotted")

{Churn = Yes} rules

Similarly, plots for rules with {Churn=Yes} consequent can be created. However, since there are only 7 rules, there is no need to provide the scatterplot.

plot(rules_Y, method = "grouped", measure = "confidence")

set.seed(987)
plot(rules_Y, method = "graph", cex = 0.7)

plot(rules_Y, method = "graph", cex = 0.7, control = list(layout=igraph::in_circle()))

In the two above graphs it is good to highlight the fact that there are many edges coming from {Tenure=1} and {MCharges=71-90}, which indicates their importance in the potential client leave. What is more, {Tenure=1} node contributes to both: the biggest dot (highest support rule) and redest dot (highest lift rule). It follows that creating a separate level for the first month was a good idea.

plot(rules_Y, method = "paracoord", measure = "confidence", lty = "dotted")

Summary

In the above paper, the churn analysis was conducted based on Telco customer dataset as a presentation of one of the association rules applications. While such an analysis could as well be performed using alternative tools, this method provides an interesting insight and visual understanding of the data, which is equally important in analytical environments.

Who is going to churn?

Unsupervised Learning - Association Rules

Szymon Groszkiewicz

Introduction

Association rule measures

Support

Confidence

Lift

Data

Data preparation

Tenure

MonthlyCharges

Association rules

{Churn = No} rules

Redundant rules

{Churn = Yes} rules

Redundant rules

Visualization

{Churn = No} rules

{Churn = Yes} rules

Summary