SYS 6018: Tutorial on Association Analysis

Hannah Frederick (hbf3k), Sean Grace (smg2mx), Annie Williams (maw3as), André Zazzera (alz9cb)

Date: 12/10/2020

Introduction and Applications

Association analysis is an analytical tool used in the examination of datasets, useful for understanding the relationships between elements. Association rule analysis has become an increasingly popular analytical method because of how critically important it can be for commercial databases. It can be used any time we have sets of items and a variety of items that may or may not be included in a given set. Association analysis is fairly versatile, and can be applied in a number of domains. For example, it can be used in medical diagnoses where association rules can help determine the probability of a patient having an illness given a set of symptoms. It can also be used to analyze census data, and use association rules to determine how to efficiently set up a variety of public services. In truth, association analysis can be useful whenever one is interested in uncovering interesting or significant relationships within large datasets. However, the most practical application for this type of analysis is called “market basket” analysis, where vendors are trying to determine how groups of items are sold together, who would like to find metrics related to these groupings of items. The value proposition for this type of analysis is massive, as it can help optimize vital business aspects such as recommendation algorithms, special deals, and inventory management.

In the context of this “market basket” analysis, the goal is generally to seek joint values of items \(I = \{i_1, i_2 \dots i_d\}\) which consistently arise in a database’s transactions \(T = \{t_1, t_2, \dots, t_N\}\). While it might be more simple and aesthetically pleasing to think of each transaction as a set like:

\(t_1 = \{i_1, i_3, i_4\}, t_2 = \{i_1,i_2,i_5\}\)

It may be more practical to have data represented in a binary form:

\(i_1\) \(i_2\) \(i_3\) \(i_4\) \(i_5\)
\(t_1\) 1 0 1 1 0
\(t_2\) 1 1 0 0 1

Where each cell has a 1 if the transaction contains a given item and a 0 otherwise.

Put simply, the goal of this analysis is to find conditions where the probability density of finding a given item \(x_i\) in a set, or \(P(x_i)\), is larger than it is outside of said conditions.

For the purposes of this tutorial, we will be walking through a market basket analysis on a sample dataset of grocery purchases. We will start by introducing the dataset. We will next go more into detail regarding possible applications of this type of analysis. This includes the goals and steps of the Apriori algorithm, a vital algorithm for association rule analysis. We will last examine some of the most important interest measures for association rules, including support, confidence, lift, etc. This tutorial is aimed at students with some background in probability theory and statistical computing who are interested in learning more about a common method in the practice of data mining.

Data Description

The data we use for this tutorial can be found on Kaggle here. There are 38,765 rows of purchase orders from grocery stores, but we need to perform some cleaning before we can really make effective use of these data. We use the unite function to combine the Member_number and Date columns into a new column, called transactions, which we will use to identify the transactions of this dataset. We then rename the other column to just be called items, which we will use to identify the items in these transactions. This allows us to use the language of association analysis (transactions and items) more clearly as we proceed. Lastly, we only keep unique rows of the data, so that no transactions will have duplicate items. In effect, this allows us to only concern ourselves with the presence or absence of an item in a transaction, rather than how many times an item appears. We do this because working with vectors over a field of two elements is much easier than vectors over the integers.

#https://www.kaggle.com/heeraldedhia/groceries-dataset
#reading in the data
groceries <- read_csv("groceries_dataset.csv")
#cleaning the data
groceries <- groceries %>% unite(transactions, Member_number, Date, sep = '_')
groceries <- groceries %>% rename(items = itemDescription)
#transactions: customer identification + _ + date of transaction
#items: item in transaction
#checking the data so no transactions have duplicate items
groceries <- groceries %>% distinct(transactions, items, .keep_all = TRUE)

Next, we will need the number of transactions and the number of items for future reference, as well as for some simple visualization, so we calculate them here.

num_trans = n_distinct(groceries$transactions) #number of transactions: 14,963
num_items = n_distinct(groceries$items) # number of items: 167

To try to visualize our dataset, we plot a simple bar graph. We see that most of our transactions have 2 items in them.

count(groceries, transactions) %>% 
  ggplot(aes(n)) + geom_bar() + xlab("number of items per transaction") + ggtitle('Distribution of items per transaction') + theme(plot.title = element_text(hjust = 0.5))

For a more complete understanding of the dataset, we examine the summary statistics. This tells us that we have 14,963 transactions with 167 items among them, our more frequent items, the distribution of lengths of transactions, and more.

trans_list = split(groceries$items, groceries$transactions) # get transaction list
trans_class = as(trans_list, "transactions") #get transaction class
summary(trans_class)
## transactions as itemMatrix in sparse format with
##  14963 rows (elements/itemsets/transactions) and
##  167 columns (items) and a density of 0.01520957 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2363             1827             1646             1453 
##           yogurt          (Other) 
##             1285            29432 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##   205 10012  2727  1273   338   179   113    96    19     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    2.00    2.54    3.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##     transactionID
## 1 1000_15-03-2015
## 2 1000_24-06-2014
## 3 1000_24-07-2015

Before we continue much further, it can be useful to understand what our most frequent items are. We group them here.

#finding counts and support for most frequent items
freq_items = count(groceries, items, sort=TRUE) %>% mutate(support=n/num_trans)

Now we visually examine the 20 most frequent items. This allows us to build an intuition of what items are going to be the most significant actors in our exploration of these data.

#plotting top 20 most frequent items
freq_items %>% slice(1:20) %>% 
  ggplot(aes(fct_reorder(items, n), n)) + # order bars by n
  geom_col() +         # barplot
  coord_flip() +       # rotate plot 90 deg
  theme(axis.title.y = element_blank()) # remove y axis title

Apriori Algorithm

Before we describe the Apriori algorithm, we should define the support \(S(A)\) of an itemset \(A\) as the proportion of transactions in the dataset that have \(A\), and it approximates the probability that the itemset \(A\) will show up in a future transaction. Support can take on a value from 0 to 1; and as the support of an itemset increases, the probability of that itemset being in a basket also increases. An itemset is deemed a “frequent itemset” if its support meets some support threshold that the user selects. We should also define confidence \(C(A \rightarrow B)\) as an approximation of the probability that itemset \(B\) will show up in a transaction with prior information that the transaction already has itemset \(A\). Equivalently, this can be expressed as \(P(B \vert A)\). Confidence can also take on a value from 0 to 1; and as the confidence of an association rule (or the relationship between two itemsets \(A\) and \(B\)) increases, the probability of \(B\) being in a basket conditioned on \(A\) also increases. An association rule is deemed a “high confidence” rule if its confidence meets some confidence threshold that the user selects.

Description

In association analysis, one might want to find all itemsets in the dataset with support that meets some support threshold \(s\). But if there are \(x\) distinct items in the dataset, then there are \(2^{x} - 1\) possible non-empty itemsets, which could cause computing time and space issues with too large of a search space. The Apriori algorithm seeks to decrease the time to find all such itemsets by restricting the search space of the dataset. To do this, it uses the rule that if an itemset is frequent, then all of its subsets must also be frequent; which also means that if an itemset is not frequent, then none of its supersets will be frequent. This makes sense since one rule of probability is that \(P(A, B) \leq P(A)\); which in terms of our goal means that \(S(\{a, b\}) \leq S(\{a\})\), so if \(S(\{a\}) < s\), then \(S(\{a, b\}) < s\). The Apriori algorithm uses this rule to search over the dataset and find all itemsets with at least a support of \(s\) as follows.

The first pass over the data computes the support of all itemsets of size one, and if the support of an itemset is less than \(s\), then the item in that itemset is discarded from the data. The next pass computes the support of all itemsets of size two with items that withstood the first pass. Now, if the support of an itemset is less than \(s\), then both of the items in that itemset are discarded from the data. For each pass, this method saves all itemsets of that size that meet the support threshold, and throws out the itemsets that did not. This process continues until there are no more itemsets with a support of at least \(s\), and then returns all itemsets that were saved in each pass. Since this method restricts the search space for each pass, it decreases the search time as well; which means that this process will run in reasonable time even with enormous datasets.

The graph above shows a visual representation of the support-based pruning in the Apriori algorithm. This process is akin to alpha-beta pruning, which is a search method that decreases the number of nodes evaluated by the minimax algorithm in its search tree. In this same way, the Apriori algorithm reduces the number of itemsets evaluated by this algorithm in its search space. In this example, the algorithm finds that the itemset \(\{a, b\}\) does not meet the support threshold \(s\), or \(S(\{a, b\}) < s\). This means that all supersets of \(\{a, b\}\), or itemsets of greater size that contain \(a\) and \(b\), will also not meet the support threshold \(s\) since the \(S(\{x, y\}) \leq S(\{x\})\) for any items \(x, y\). The Apriori algorithm then discards all supersets of \(\{a, b\}\) from the search space, such as \(\{a, b, c\}, \{a, b, c, d\},\) and \(\{a, b, c, d, e\}\). This pruning process saves significant computing time and space for even large datasets as long as the data is sparse enough and the threshold is high enough.

Each itemset \(K\) that meets some support threshold can be cast into a set of association rules that show relationships between two itemsets in the form of \(A \rightarrow B\). The first itemset \(A\) is called the antecedent, and the second itemset \(B\) is called the consequent; and no one item can be shared in both \(A\) and \(B\). The support of the rule \(S(A \rightarrow B)\) is the fraction of observations in the union of the antecedent and the consequent, which is just the support of the itemset \(K\) from which \(A\) and \(B\) were split. Support approximates \(P(A, B)\), or the probability of observing in the population itemsets \(A\) and \(B\) simultaneously in a basket. The confidence of the rule \(C(A \rightarrow B)\) is the support of the rule over the support of the antecedent, or \(C(A \rightarrow B) = \frac{S(A \rightarrow B)}{S(A)}\). Confidence approximates \(P(B \vert A)\), or the probability of observing in the population itemset \(B\) in a basket with prior information that itemset \(A\) is in that basket.

In this way, the Apriori algorithm can find all rules that meet some confidence threshold \(c\) containing all itemsets that meet some support threshold \(s\). Thus, the output of this method is a collection of association rules that meet the constraints that \(S(A \rightarrow B) \geq s\) and \(C(A \rightarrow B) \geq c\). In terms of market basket research, this means that it finds all relationships between items in \(A\) and \(B\) in which the probability of observing all items in the union of \(A\) and \(B\) in a basket is greater than \(s\), and the probability of observing itemset \(B\) in a basket with prior information of itemset \(A\) in that basket is greater than \(c\). It should be noted that these relationships represent association and not causation between antecedents and consequents, which means that when \(C(A \rightarrow B)\) is high, we should not assume that if a customer buys itemset \(A\), then this selection influences the customer to also buy itemset \(B\). We should also be careful about assumptions if expected transactions in the future come from a different distribution than the observed transactions in the dataset. We should not extrapolate the conclusions of this association analysis outside the population of the dataset we use.

Implementation

You can run this algorithm in R with the apriori function in the arules package, and you can find the documentation for this function here. The arguments for this function are as follows:

  • data is an object of the transactions class that holds the items and transactions or a data structure that can be coerced to the transactions class (you can find more information on this class here)
  • parameter is an object of the APparameter class or a list that sets the minimum support (support), the minimum confidence (confidence), the minimum and maximum itemset sizes in total (minlen and maxlen), the type of association mined (target), and other variables that you can find here
  • appearance is an object of the APappearance class or a list that sets the restrictions on the associations mined, such as what items are in the antecendent (lhs), what items are in the consequent (rhs), what items can be in the itemset (item), what items cannot be in the itemset (none), and other variables that you can find here
  • control is an object of the APcontrol class or a list that sets the algorithmic parameters of the mining algorithm, such as how to sort items (sort), if it should report progress (verbose), how to filter unused items (filter), and other variables that you can find here

The function returns an object of the rules class (you can find more information on that here) or the itemsets class (you can find more information on that here) depending on what you set for the target parameter. The default values for the support, “confidence”, minlen, and maxlen parameters are all 0.1, 0.8, 1, and 10 respectively. It should also be noted that this function only returns rules with one item in the consequent; so if minlen is set to 1, then no items are in the antecedent, so \(\{\} \rightarrow B\). This means that the confidence will be the probability of observing itemset \(B\) in a basket regardless of what items are already in the basket, which is just equivalent to the support of \(B\). You can forgo these sorts of rules by setting minlen to 2.

#function for printing output from apriori() function
#source: Michael Porter
apriori_output <- function(x){
  if(class(x) == "itemsets"){
    out = data.frame(items=arules::labels(x), x@quality, stringsAsFactors = FALSE)
  }
  else if(class(x) == "rules"){
    out = data.frame(
      lhs = arules::labels(lhs(x)),
      rhs = arules::labels(rhs(x)),
      x@quality, 
      stringsAsFactors = FALSE)
  }
  else stop("only works with class of itemsets or rules")
  if(require(tibble)) as_tibble(out) else out
}

For our groceries dataset, let’s say we want to find all frequent itemsets with a support of at least 0.1, with at least 1 item, and with at most 2 items. We can use the apriori function and first pass it our dataset, cast to a transactions class. We can then pass it a list of parameters that set the support to 0.01, the minlen to 1, the maxlen to 2, and the target to “frequent” since we are interested in finding frequent itemsets. We can then use the apriori_output function defined by Michael Porter to print out the results of this method, including the items, their support, and their count. We can also sort the items depending on their support in descending order, and then we are left with the following table. From this table, we can see that whole milk has the greatest support (0.158) out of all itemsets in this dataset that have at least one and at most two items. This means that whole milk occurs in 2,363 observations out of 14,963 transactions, so \(\hat{P}(\text{{whole milk}})\), or the approximate probability in the population of whole milk being in a basket is 2363/14963 = 0.158.

#finding all frequent itemsets >= support threshold
s = 0.01
min = 1
max = 2
freq_items1 = apriori(trans_class, #run apriori function to find frequent items
               parameter = list(support = s, minlen = min, maxlen = max, target="frequent"))
apriori_output(freq_items1) %>% 
  arrange(-support) #order by support (largest to smallest)
## # A tibble: 69 x 4
##    items              support transIdenticalToItemsets count
##    <chr>                <dbl>                    <dbl> <int>
##  1 {whole milk}        0.158                   0.0144   2363
##  2 {other vegetables}  0.122                   0.0100   1827
##  3 {rolls/buns}        0.110                   0.00916  1646
##  4 {soda}              0.0971                  0.00849  1453
##  5 {yogurt}            0.0859                  0.00668  1285
##  6 {root vegetables}   0.0696                  0.00648  1041
##  7 {tropical fruit}    0.0678                  0.00581  1014
##  8 {bottled water}     0.0607                  0.00434   908
##  9 {sausage}           0.0603                  0.00521   903
## 10 {citrus fruit}      0.0531                  0.00401   795
## # … with 59 more rows

But we may also want to find all association rules with a support of at least 0.01, with a confidence of at least 0.10, and with at least 2 items in each rule. Then the apriori function will take all of the size-two itemsets from our last run that had a support of at least 0.01 and permutate all items in these itemsets to find all rules with a confidence of at least 0.10. We can again use the apriori function and first pass it our dataset, cast to a transactions class. We can then pass it a list of parameters that set the support to 0.01, the confidence to 0.10, the minlen to 2, and the target to “rules” since we are interested in finding association rules. We can then print out the results of this function with our apriori_output function, including the itemset of the antecedent (lhs), the itemset of the consequent (rhs), and the support, the confidence, and the count of each rule. We can also sort the rules depending on their confidence in descending order, and then we are left with the following table.

#finding all association rules >= support threshold and >= confidence threshold
s = 0.01
c = 0.10
min = 2
freq_rules1 <- apriori(trans_class, 
               parameter = list(support = s, confidence = c, minlen = min, 
               target="rules"))
apriori_output(freq_rules1) %>% arrange(-confidence) #order by confidence (largest to smallest)
## # A tibble: 4 x 7
##   lhs                rhs          support confidence coverage  lift count
##   <chr>              <chr>          <dbl>      <dbl>    <dbl> <dbl> <int>
## 1 {yogurt}           {whole milk}  0.0112      0.130   0.0859 0.823   167
## 2 {rolls/buns}       {whole milk}  0.0140      0.127   0.110  0.804   209
## 3 {other vegetables} {whole milk}  0.0148      0.122   0.122  0.769   222
## 4 {soda}             {whole milk}  0.0116      0.120   0.0971 0.758   174

From this table, we can see that the rule {yogurt} \(\rightarrow\) {whole milk} has the greatest confidence (0.130) out of all rules derived from this dataset that have at least two items. This means that \(\hat{P}(\text{{whole milk}}\vert \text{{yogurt}})\), or the approximate probability in the population of whole milk being in a basket given that yogurt is in that basket is 0.130. Market owners could use this information to increase their sales of whole milk by placing whole milk next to yogurt since it seems that customers who buy yogurt are also likely to buy whole milk. But we could have also calculated the confidence of this rule by first finding all itemsets with a support of at least 0.01 that contain whole milk or yogurt or both. We know that \(C(\text{{yogurt}} \rightarrow \text{{whole milk}}) = \frac{S(\text{{yogurt}} \rightarrow \text{{whole milk}})}{S(\text{yogurt})} = \frac{S(\text{{yogurt, whole milk}})}{S(\text{yogurt})}\), so the confidence of {yogurt} \(\rightarrow\) {whole milk} is 0.0112/0.0859 = 0.130.

itemset = c("whole milk", "yogurt")
s = 0.01
milk_and_yogurt = apriori(trans_class, 
        parameter = list(support = s, target="frequent"), 
        appearance = list(items = itemset))
apriori_output(milk_and_yogurt)
## # A tibble: 3 x 4
##   items               support transIdenticalToItemsets count
##   <chr>                 <dbl>                    <dbl> <int>
## 1 {yogurt}             0.0859                   0.0747  1285
## 2 {whole milk}         0.158                    0.147   2363
## 3 {whole milk,yogurt}  0.0112                   0.0112   167

While this can be helpful most of the time, not all rules with high support and high confidence are useful. If the support of {whole milk} is very high and the support of {yogurt} is very high, then the confidence of {yogurt} \(\rightarrow\) {whole milk} will be high regardless of if there is real association between these itemsets. It is also possible that itemsets with low support and low confidence can show useful relationships among itemsets. And when a transaction dataset has an uneven distribution of support for individual items, these “rare items” are sometimes pruned out in the Apriori algorithm. For these reasons, it is important to consider other interest measures that can quantify the associations among itemsets.

Interest Measures

The concept of interestingness is subjective, and depends on the goals of a business or a research question. The interestingness of a rule oftentimes determines its value, depending on the given context. There are multiple different ways to measure the interestingness of a rule, many of which are included in the arules package.

The author of this package, Michael Hahsler, separates support and confidence from the rest of the interest measures based on their importance. We chose only a few alternative interest measures for the sake of this tutorial, but more can be found on Hahsler’s page.

Support and Confidence Review

Support

  • Support is a proportion of transactions that contain item \(A\). It can also be thought of as the probability of \(A\) being involved in a future transaction. Low support could indicate that a rule happens by chance.

    • \(S(A)\) = (# of transactions containing \(A\)) / (# total transactions) \(\hat{=}P(A)\)

Confidence

  • Confidence is an estimate of the conditional probability of \(B\) being in a basket given the basket already contains \(A\).

    • \(C(A \rightarrow B) \hat{=} P(B \vert A)\)

Alternative Interest Measures

We will use the interestMeasure function from the apriori package to calculate the following alternative interest measures.

#adding other interest measures
output = apriori_output(freq_rules1) %>% 
  mutate(jaccard = interestMeasure(freq_rules1, measure="jaccard", trans_class), 
         cosine = interestMeasure(freq_rules1, measure="cosine", trans_class),
         rpf = interestMeasure(freq_rules1, measure="rulePowerFactor", trans_class)
         )

Lift

  • Lift is based on statistical independence, and is not directional. Hahsler says “lift measures how many times more often \(A\) and \(B\) occur together than expected if they were statistically independent.” In other words, lift defines an interesting rule as one that happens more or less than we would normally expect.

    • \(\text{lift}(A \rightarrow B) = \text{lift}(B \rightarrow A) = \frac{C(A \rightarrow B)}{S(B)} = \frac{C(B \rightarrow A)}{S(A)} = \frac{P(A \cap B)}{P(A)P(B)}\)
    • Lift = 1 indicates independence between itemsets \(A\) and \(B\).
    • Lift > 1 indicates positive association between itemsets \(A\) and \(B\).
      • This means that when itemset \(A\) is present in a transaction, itemset \(B\) is more likely to be included, or \(P(B \vert A) > P(B)\).
    • Lift < 1 indicates negative association between itemsets \(A\) and \(B\).
      • This means that when itemset \(A\) is present in a transaction, itemset \(B\) is less likely to be included, or \(P(B \vert A) < P(B)\).

We see here that the lift of all four of our identified rules is less than one. This indicates an inhibitive association between \(A\) and \(B\) for each rule.

output %>% arrange(-lift)
## # A tibble: 4 x 10
##   lhs      rhs    support confidence coverage  lift count jaccard cosine     rpf
##   <chr>    <chr>    <dbl>      <dbl>    <dbl> <dbl> <int>   <dbl>  <dbl>   <dbl>
## 1 {yogurt} {whol…  0.0112      0.130   0.0859 0.823   167  0.0480 0.0958 0.00145
## 2 {rolls/… {whol…  0.0140      0.127   0.110  0.804   209  0.055  0.106  0.00177
## 3 {other … {whol…  0.0148      0.122   0.122  0.769   222  0.0559 0.107  0.00180
## 4 {soda}   {whol…  0.0116      0.120   0.0971 0.758   174  0.0478 0.0939 0.00139

The below graph shows the differences in lift for the four rules shown in the table above. The color and size of each node (or rule) represent the value of the lift for that rule. Each word with an outgoing edge to a node is the antecedent of that rule, and each word with an incoming edge from a node is the consequent of that rule. At a glance, we can see that {yogurt} \(\rightarrow\) {whole milk} has the greatest lift, {rolls/buns} \(\rightarrow\) {whole milk} has the next greatest lift, {other vegetables} \(\rightarrow\) {whole milk} has the next greatest lift, and {soda} \(\rightarrow\) {whole milk} has the least lift. Since whole milk is in the consequent of each rule, market owners could use this graph to rank which items should be placed next to whole milk in order to increase sales of whole milk. We can also set the measure parameter to “support” or “confidence” to show the differences in support or confidence for these four rules.

#plot - some nodes are items, others are rules
plot(freq_rules1, method="graph", measure="lift")

Cosine

  • Cosine is a measure of correlation between the items in \(A\) and \(B\), as defined in (Tan et al. (2004)):

    • \(\cos(A \rightarrow B) = \frac{P(A \cap B)}{\sqrt{P(A)P(B)}} = \sqrt{P(A|B)P(B|A)}\)
    • Value in [0, 1]
    • A cosine value of 0.5 means that there is no correlation between itemsets \(A\) and \(B\).
    • A cosine value close to 0 means that there is a negative association between itemsets \(A\) and \(B\).
    • A cosine value close to 1 means that there is a positive association between itemsets \(A\) and \(B\).
    • Null-invariant: does not change with the number of null transactions

In this example, all of the cosine measures are relatively low, which indicates that there is also a negative association.

output %>% arrange(-cosine)
## # A tibble: 4 x 10
##   lhs      rhs    support confidence coverage  lift count jaccard cosine     rpf
##   <chr>    <chr>    <dbl>      <dbl>    <dbl> <dbl> <int>   <dbl>  <dbl>   <dbl>
## 1 {other … {whol…  0.0148      0.122   0.122  0.769   222  0.0559 0.107  0.00180
## 2 {rolls/… {whol…  0.0140      0.127   0.110  0.804   209  0.055  0.106  0.00177
## 3 {yogurt} {whol…  0.0112      0.130   0.0859 0.823   167  0.0480 0.0958 0.00145
## 4 {soda}   {whol…  0.0116      0.120   0.0971 0.758   174  0.0478 0.0939 0.00139

Jaccard’s coefficient

  • Jaccard’s coefficient is a measure for dependence using the Jaccard similarity between the two sets of transactions that contain the items in \(A\) and \(B\), respectively (Hahsler), and is defined as:

    • \(J(A \rightarrow B) = \frac{P(A \cap B)}{P(A) + P(B) - P(A \cap B)}\)
    • Value in [0, 1]
    • A Jaccard’s coefficient close to 0 means that there is a negative association between itemsets \(A\) and \(B\).
    • A Jaccard’s coefficient close to 1 means that there is a positive association between itemsets \(A\) and \(B\).
    • Null-invariant: does not change with the number of null transactions

Similarly, all of the Jaccard’s coefficients are relatively low and also confirms what the cosine measure told us.

output %>% arrange(-jaccard)
## # A tibble: 4 x 10
##   lhs      rhs    support confidence coverage  lift count jaccard cosine     rpf
##   <chr>    <chr>    <dbl>      <dbl>    <dbl> <dbl> <int>   <dbl>  <dbl>   <dbl>
## 1 {other … {whol…  0.0148      0.122   0.122  0.769   222  0.0559 0.107  0.00180
## 2 {rolls/… {whol…  0.0140      0.127   0.110  0.804   209  0.055  0.106  0.00177
## 3 {yogurt} {whol…  0.0112      0.130   0.0859 0.823   167  0.0480 0.0958 0.00145
## 4 {soda}   {whol…  0.0116      0.120   0.0971 0.758   174  0.0478 0.0939 0.00139

Rule Power Factor

  • Rule Power Factor weights the confidence of a rule by its support, which can correct for changes in the ratio between the support of a rule and the support of its antecedent (Hasler), and is defined as:

    • \(rpf(A \rightarrow B) = S(A \cup B) * C(A \cup B)\)
    • Value in [0, 1]
    • A rule power factor value close to 0 means that there is a negative association between itemsets \(A\) and \(B\).
    • A rule power factor value close to 1 means that there is a positive association between itemsets \(A\) and \(B\).

In this example, the rule power factor is also relatively low for all of the rules, which also means that there is a negative association.

output %>% arrange(-rpf)
## # A tibble: 4 x 10
##   lhs      rhs    support confidence coverage  lift count jaccard cosine     rpf
##   <chr>    <chr>    <dbl>      <dbl>    <dbl> <dbl> <int>   <dbl>  <dbl>   <dbl>
## 1 {other … {whol…  0.0148      0.122   0.122  0.769   222  0.0559 0.107  0.00180
## 2 {rolls/… {whol…  0.0140      0.127   0.110  0.804   209  0.055  0.106  0.00177
## 3 {yogurt} {whol…  0.0112      0.130   0.0859 0.823   167  0.0480 0.0958 0.00145
## 4 {soda}   {whol…  0.0116      0.120   0.0971 0.758   174  0.0478 0.0939 0.00139

References

  1. Hahsler, M. (2020, August 21). A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules. Retrieved December 10, 2020, from https://michael.hahsler.net/research/association_rules/measures.html

  2. Hahsler, M. (n.d.). Arules v1.6-6. Retrieved December 10, 2020, from https://www.rdocumentation.org/packages/arules/versions/1.6-6

  3. Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer.

  4. Porter, Michael. (2020). 09-Association Analysis. Personal Collection of M. Porter, University of Virginia, Charlottesville VA. https://mdporter.github.io/SYS6018/lectures/09-association.pdf