---
title: "Association rule mining"
output:
flexdashboard::flex_dashboard:
social: menu
source_code: embed
theme: united
---
```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(kableExtra)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE)
```
Workflow {.storyboard}
=========================================
Inputs {.sidebar}
-------------------------------------
Association rule mining is an approach to discovering patterns of co-occurrence in a (large) dataset, by identifying entities that frequently appear together in a group. As an unsupervised learning technique, association rule mining can be used to identify novel patterns/relationships amongst entities in a large set of data.
This type of patterns are summarized by **association rules**, which predicts the occurrence of one or more entities based on the occurrences of other entities in a certain grouping, such as a transaction or an individual.
As usual, we will perform association rule mining on the [IBM Telco customer churn dataset](https://developer.ibm.com/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/), to identify customer characteristics and purchasing behaviours that tend to appear together, which may be helpful in informing marketing and customer retention strategies.
Hope you enjoy your stay! :)
### **Prepare & inspect data** Convert dataset to transaction format {data-commentary-width=550}
```{r}
## Import library
library(plyr)
library(dplyr)
library(arulesCBA)
library(arules)
library(ggplot2)
## Import data that has been cleaned up
df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_yes_no.csv")
## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges
## Rename the levels based on knowledge of min/max monthly charges
df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges,
c("[-Inf,29.4)"="$0-29.4",
"[29.4,56)"="$29.4-56",
"[56,68.8)"="$56-68.8",
"[68.8,107)"="$68.8-107",
"[107, Inf]" = "$107-118.75"))
## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ .,
df[, c('Tenure', 'Churn')],
method='mdlp')$Tenure
## Rename the levels based on knowledge of min/max tenures
df$Binned_Tenure = revalue(df$Binned_Tenure,
c("[-Inf,1.5)"="1-1.5m",
"[1.5,5.5)"="1.5-5.5m",
"[5.5,17.5)"="5.5-17.5m",
"[17.5,43.5)"="17.5-43.5m",
"[43.5,59.5)"="43.5-59.5m",
"[59.5,70.5)"="59.5-70.5m",
"[70.5, Inf]"="70.5-72m"))
## Replace "No"s with empty values
df[df=="No"]<-NA
df[] <- lapply(df, function(x) levels(x)[x])
## Replace "Yes"s with the column name
w <- which(df == "Yes", arr.ind = TRUE)
df[w] <- names(df)[w[,"col"]]
## Output to CSV
write.csv(df, './final_df.csv', row.names=FALSE)
## Convert dataframe to transaction format
tData <- read.transactions('./final_df.csv',
format = "basket", sep = ",",
header=TRUE)
## Get item frequency
x <- data.frame(sort(table(unlist(LIST(tData))), decreasing=TRUE))
## Plot
ggplot(data=x, aes(x=factor(x$Var1), y=x$Freq)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust=1),
panel.grid.major = element_blank(),
panel.background = element_blank()) +
ylab('Frequency') +
xlab('Items')
```
***
Description
General workflow
There are two things we need to do before the dataset is ready for association rule mining:
1. There are two continuous variables whose relationship to customer churn we are interested in: `MonthlyCharges` and `Tenure`. We will ignore the TotalCharges variable here, as it is a product of these two. To make use of them, we need to **discretize** them into categorical variables. However, simply dividing up the values into equal sized bins would almost definitely result in information loss. To obtain the most informative binning, we will use a **supervised discretization** function from the `arulesCBA` package, which identifies bin breaks that retain the most predictive power with respect to the target variable that we are interested in, `Churn`.
2. The `arules` package requires the input dataset to be in the **transaction** format. In our case, this means that each row corresponds to a single customer, listing all their personal characteristics and purchasing behaviours as comma-separated items.
Once all of this is done, we can look at the prevalence of each of these characteristics or behaviours is amongst the customers. For example, month-to-month contracts and paperless billing are the most common.
```{r eval=F, echo=T}
## Import libraries
library(arulesCBA)
library(arules)
library(ggplot2)
## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn"
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ .,
df[, c('MonthlyCharges', 'Churn')],
method='mdlp')$MonthlyCharges
## Discretize "Tenure" with respect to "Churn"/"No Churn"
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ .,
df[, c('Tenure', 'Churn')],
method='mdlp')$Tenure
## Convert dataframe to transaction format
tData <- read.transactions(df,
format = "basket",
sep = ",",
header=TRUE)
## Get item frequency
x <- data.frame(sort(table(unlist(LIST(tData))),
decreasing=TRUE))
## Plot
ggplot(data=x, aes(x=factor(x$Var1), y=x$Freq)) +
geom_col()
```
### **Generate & filter rules** Extract the most informative rules by measures of "interestingness" and statistical significance {data-commentary-width=500}
```{r message=FALSE, warning=FALSE, include=FALSE}
## Import libraries
library(knitr)
library(kableExtra)
## Create rules
rules <- apriori(tData,
parameter = list(supp = 0.001,
conf=0.9,
minlen=3,
maxlen=5))
## Filter rules by lift
filtered_rules <- subset(rules, subset = lift > 1.5)
## Filter out redundant rules
nonr_rules <- filtered_rules[!is.redundant(filtered_rules)]
## Filter out statistically insignificant rules
sig_rules <- nonr_rules[!is.significant(nonr_rules,
tData,
method = "fisher",
adjust = 'bonferroni')]
## Extract rules that have "Churn" as consequent
churn_rules <- subset(sig_rules, subset=rhs %pin% 'Churn')
## Convert rules matrix to dataframe
churn_rules_df <- DATAFRAME(churn_rules, setStart='', setEnd='', separate = TRUE)
```
```{r}
## Sort the rules by how often they are true, which is measured by "support"
kable(head(churn_rules_df[order(-churn_rules_df$support),], 20), format='html', escape=F) %>%
kable_styling(full_width=T, bootstrap_options = c("striped", "hover"))
```
***
Description
General workflow
The Apriori algorithm is most commonly used for association rule mining. We can set various parameters to limit the number of rules created from the dataset, usually how often the rule is observed (**support**), how often it is true (**confidence**) and the minimum/maximum length of the rule. As an example, let’s generate rules that appear in at least 0.1% of customers, holds true 90% of the time, and contain 3-5 “items”.
The number of rules obtained at this step will almost certainly be too numerous to provide insights. We can further filter the rules by their **lift**, which is a measure of how much more or less likely the items in a given rule appear together as compared to by chance. Therefore, it is a metric of the importance of a rule. In addition, redundant and statistically insignificant rules should also be removed to leave only the most informative ones. Finally, as we are most interested in customer churn, we can extract *only* the rules that contain "Churn" in the right-hand side.
After all this is done, we can then convert the association rules to a dataframe for easy inspection and take a look. “LHS” and “RHS” refer to items on the left- and right-hand side of each rule, respectively. “Count” gives how many instances, whether it be transactions or customers, in which the rule appears. In our case, the “count” of each rule divided by the total number of customers in the dataset equals its support.
```{r, eval=F, echo=T}
## Import libraries
library(arules)
## Convert dataframe to transaction format
tData <- read.transactions(df,
format = "basket",
sep = ",",
header=TRUE)
## Create rules
rules <- apriori(tData,
parameter = list(supp = 0.001,
conf=0.9,
minlen=3,
maxlen=5))
## Filter rules by lift
filtered_rules <- subset(rules, subset = lift > 1.5)
## Filter out redundant rules
nonr_rules <- filtered_rules[!is.redundant(filtered_rules)]
## Filter out statistically insignificant rules
sig_rules <- nonr_rules[!is.significant(nonr_rules, tData,
method = "fisher",
adjust = 'bonferroni')]
## Extract rules that have "Churn" as consequent
churn_rules <- subset(sig_rules,
subset=rhs %pin% 'Churn')
## Convert rules matrix to dataframe
churn_rules_df <- DATAFRAME(churn_rules,
setStart='', setEnd='',
separate = TRUE)
```
### **Grouped matrix** Overview of the association rules {data-commentary-width=550}
```{r}
## Import libraries
library(ggplot2)
library(arulesViz)
## Grouped matrix
plot(churn_rules, method="grouped", control=list(main=NULL))
```
***
Description
General workflow
The grouped matrix plot provides a birds eye overview of the most informative rules. The antecedents of each rule are grouped in the columns using clustering. Each group is labeled with the "items" that are most prevalent in its antecedent. Balloons in the matrix are used to represent with what consequent the antecedents are connected.
From this plot, we can see that purchasing fiber optic internet service and having a month-to-month contract are associated with higher likelihood of the customer leaving the company.
However, its utility is limited precisely by how distilled the information is. So, that is where the circular grouped plot comes in, as we will see next.
```{r, eval=F, echo=T}
## Import libraries
library(arules)
library(arulesViz)
## Convert dataframe to transaction format
tData <- read.transactions(df,
format = "basket",
sep = ",",
header=TRUE)
## Create rules
rules <- apriori(tData,
parameter = list(supp = 0.001,
conf=0.9,
minlen=3,
maxlen=5))
## Plot
plot(rules, method="grouped",
control=list(main=NULL))
```
### **Circular grouped plot** More granular summary of the association rules {data-commentary-width=450}
```{r, fig.height=15, fig.width=17}
plot(churn_rules, method="graph", cex=1.7,
control=list(main=NULL, layout=igraph::in_circle()))
```
***
Description
General workflow
The circular grouped plot provides a more detailed summary of the rules than the grouped matrix, but obviously at the cost of the number of rules that can be visualized before it gets too cluttered.
Looking at customer characteristics and purchasing behaviours that have the most number of arrows stemming from it, we see that customers with short tenures (1-5.5 months), dependents and purchased fiber optic internet service seem to be the most likely to leave the company. This is consistent with what we saw in the grouped matrix plot previously.
These insights can help to inform new marketing campaigns and retention strategies, such as offering discounted year-long contracts to new customers, create family plans, and make the internet service package more competitive.
```{r, eval=F, echo=T}
## Import libraries
library(arules)
library(arulesViz)
## Convert dataframe to transaction format
tData <- read.transactions(df,
format = "basket",
sep = ",",
header=TRUE)
## Create rules
rules <- apriori(tData,
parameter = list(supp = 0.001,
conf=0.9,
minlen=3,
maxlen=5))
## Plot
plot(rules, method="graph",
control=list(main=NULL,
layout=igraph::in_circle()))
```
### **Interactive network graph** User-friendly filtering and exploration of association rules {data-commentary-width=450}
```{r message=FALSE, warning=FALSE, echo=F, include=FALSE, fig.width=10, fig.height=10}
x <- plot(churn_rules, method = "graph",
control=list(main=NULL), engine = "htmlwidget")
```
```{r}
x
```
***
Description
General workflow
With a handy dropdown menu, this plot allows interactive exploration of rules that contain specific items in this antecedent.
In addition, the zoom-in/out, drag-and-drop and highlight-upon-hovering capabilities make it easy to zero in on something of interest.
Feel free to play!
```{r, eval=F, echo=T}
## Import libraries
library(arules)
library(arulesViz)
## Convert dataframe to transaction format
tData <- read.transactions(df,
format = "basket",
sep = ",",
header=TRUE)
## Create rules
rules <- apriori(tData,
parameter = list(supp = 0.001,
conf=0.9,
minlen=3,
maxlen=5))
## Plot
plot(churn_rules, method = "graph",
control=list(main=NULL),
engine = "htmlwidget")
```
### **Shiny app** A light-weight all-in-one exploration and recommender tool {data-commentary-width=250}
```{r out.width='100%'}
knitr::include_app("https://nancy-chelaru-centea.shinyapps.io/arules_app/", height='700px')
```
***
A shiny app
Session info
=========================================
Column
--------------------------------------------
```{r}
sessionInfo()
```