This module introduces market basket analysis.
The name of this technique, market basket analysis, probably derives from its frequent application to the analyzing consumer buying behavior of consumers in supermarkets. Upon check-out, the items in the cart or basket are recorded. Especially when linked to customer or membership cards, the data provides valuable insight in buying patterns.
The same holds true for e-commerce organizations, like Amazon. Compared to traditional bookstores, Amazon has a much better insight in buying patterns. From purchasing patterns, Amazon is able to detect which titles are frequently ordered by the same buyers. This knowledge enables Amazon to effectively cross-sell items, by recommending titles to buyers under headings like "customers who bought this item also bought ..."
However, as we will show, the technique can be applied in all cases where we are looking for patterns, or relationships between characteristics of objects or persons. Some examples:
The popularity of the technique has everything to do with the vast amount of data that large organizations (supermarkets; online sellers, such as Amazon) have about people (customers; members, etc.) and their transactions.
Traditionally, marketing specialists rely on intuitive knowledge of buying behavior, sometimes supported by information from consumer surveys. Driving questions are, for example, what are the expected sales numbers if the price for a product is reduced by 5%? Or, for supermarkets, what is the optimal store layout (order of products and shelves; place on the shelf)?
Of course, it is possible to use experiments, but frequent changes in store layout, for example, will easily lead to customer confusion and annoyance. With the emergence of barcodes and scanning systems that are linked to customer cards, companies can better map who buys what, when and in what quantities.
Online stores such as Amazon have an enormous competitive advantage over traditional stores. To traditional stores, most customers are just anonymous buyers. But online stores know the profile of the customers and their buying habits. Amazon knows the genres of books their customers enjoy, and can recommend books by other authors.
Like cluster analysis via K-means clustering, market basket analysis is an example of unsupervised learning. That is, it is not necessary to first train the algorithm, and then see if the algorithm provides good predictions in a test set of data not used in the training.
What we are looking for is a set of rules that relate characteristics to one another. We can then use these rules for all types of decisions. In the setting of the supermarkets, the characteristics are the groceries (items) in the shopping basket of the customer.
The relationship between items is the joint occurrence of items in the shopping basket.
In the jargon of market basket analysis, we speak of item sets, and of association rules.
An item set is simply a collection of one or more items. We display these collections with curly brackets {}.
An example of an item set in a customer's shopping basket in the supermarket is the combination of bread and peanut butter, shown as:
{bread, peanut butter}
But also
{bread}
and
{bread, peanut butter, toilet cleaner}
are examples of item sets.
The number of possible item sets quickly takes on enormous proportions.
Suppose a very small store sells 10 types of products (or items). A customer may or may not buy any item, which results in:
\(2^{10} = 1024\)
possible item sets! For a slightly larger supermarket with 100 different items, the number of options is
\(2^{10} = 1.27*10^{30}\)
(1270000000000000000 ... stops at 28 zeros)
In those situations it is not feasible to evaluate all possible combinations. We have to look for a smart and efficient algorithm to generate rules that connect item sets.
An example of such a rule is:
{bread} → {peanut butter, sprinkles}
In this rule, a link is made between an item or item set to the left of the arrow (commonly referred to as LHS, or left hand side) and an item or item set to the right of the arrow (RHS, or right hand side).
The left side sets the condition, and the right side is the result.
In words, this rule says that the purchase of bread results in the purchase of peanut butter and sprinkles.
Market basket analysis algorithm is a smart method to detect “interesting” association rules between items.
What is interesting depends on the type of application.
In a large supermarket with a thousand products and many thousands of customers and millions of transactions per year, combinations that make up a small part of the whole can carry important information. Different criteria will apply in medical applications or smaller data sets.
The name of the technique, market basket analysis, seems to limit its application. But the principles of analyzing association rules, with the core concepts of support, confidence and lift, are widely applicable, as evident from the Titanic data video above and in the example below.
As an example of market basket analysis we will use a small data set with characteristics of 10 people who have shown criminal behavior.
Psychologists have interviewed these individuals and coded the information from the conversations using keywords (such as "drugs" if there was a drug addiction; "divorced", and so on).
The question now is whether these characteristics (items) are related to each other. For example, are drug users relatively often divorced; or vice versa, do criminals who are divorced more often use drugs?
Note that this technique is geared toward finding associations. It is unsupervised, as we do not have a target variable to explain or predict. Nor do we have a theory that uses logical reasoning to link the LHS to the RHS.
That said, there often is a logic to the associations found. Consumers who buy bread, may also peanut butter and chocolate sprinkles to use on the bread. But the point is, we do not start analyzing the data to test a priori theories. We will keep associations even if we do not fully understand their logic, like in the beer and diapers example in the video!
We can record the information from the interviews in a database.
Such a data file (in Excel) can look like this:
In the first interview, for example, it turns out that the interviewee has divorced parents and has been guilty of shoplifting.
Shoplifting does not occur in any of the other interviews.
One conclusion is that shoplifting is always associated with divorced parents (although the evidence for this conclusion, with only one case of shoplifting, is pretty thin).
Conversely, there are several interviewees with divorced parents who have not been guilty of shoplifting.
Note that due to the lack of structure in the data set, it is not so easy to identify these types of relationships even in this small data set!
A number of things stand out in the database.
Columns normally have a fixed meaning. It contained a variable with an unambiguous meaning or label (e.g. age or gender).
The columns in this file are the "loose" comments noted down by the psychologist during the interview.
The drug comment appear as the first comment (e.g. for person 2) or the third comment (person 3).
It is not relevant to our example, but in principle the interviewer can code comments twice or more in one and the same interview.
This way of storing data has great advantages.
Think of the thousands of products offered by supermarkets, out of which only a very small proportion is part of each transaction.
If every column were to represent a product (item), then every record in the data file would mainly consist of zeros or empty cells!
An online store would have to create a line with as many columns as there are items in its assortment, even though each transaction is bound to include only one or a few items.
In its quest to be all things to all people, Amazon has built an unbelievable catalog of more than 12 million products, books, media, wine, and services. If you expand this to Amazon Marketplace sellers, as well, the number is closer to more than 350 million products.
Source: www.bigcommerce.com
In our small example, too, this method of data storage has advantages: the interviewer can note down the comments during the interview, without having to worry about the sequence and the number of possible comments.
In the same vein, the cashier in the supermarket also scans the products in any "random" order in which the customers put them on the belt!
Self test: try to see for yourself if you can detect some pattern in the data. Although there are only 10 people in the file and the number of comments has a maximum of 5, it is not an easy task! Algorithms help!
When we look for interesting association rules in a huge set of possible rules, the challenge is to develop an efficient algorithm for reducing the number of possibilities.
Such an algorithm has two steps.
In our example, we could limit our quest to those comments (items) and combinations of comments that are mentioned in at least 2 out of 10 cases.
The comment "adhd" (which stands for Attention Deficit Hyperactivity Disorder) would then be dropped as it occurs only once. But here's the trick: if "adhd" only occurs once, then all potential combinations of other comments plus "adhd" cannot occur twice or more! In other words: with the exclusion of one item a lot of item sets will go with it.
Examples:
{adhd; drugs}; {adhd; divorced}; {adhd; drugs; divorced}; ...
We refer to the number of occurrences of items and item sets as support.
Support = Frequency Of Item Set / Total
The support for the comment "adhd" is 1 in 10 (0.10, or 10%).
Keep in mind that an item set can consist of 1 or more items. A single comment itself is therefore also an item set, even if the set only contains one item!
The formula for confidence is:
Confidence (X→Y) = (Support (X,Y)) / (Support (X))
In words, this formula expresses the probability that item set X (as a condition) leads to item set Y (the consequence).
In the example of the supermarket, the rule that buying bread leads to buying peanut butter has a probability equal to the number of times peanut butter and bread are bought together, divided by the number of times bread is bought.
In a numerical example: imagine a supermarket analyzing 80 shopping baskets.
It is helpful to think back of what we have learned about set theory and Venn diagrams, in the refresher course.
A package for generating high-quality Venn-diagrams is ggVennDiagram.
Below, we will transform the above figure into a data set, and present it as a Venn diagram.
bread <- c(rep(1,60),rep(0,20))
peanutbutter <- c(rep(0,20),rep(1,50),rep(0,10))
pb <- as.data.frame(cbind(bread,peanutbutter))
pb$rec <- as.numeric(rownames(pb))
A = as.vector(pb[bread==1,"rec"])
B = as.vector(pb[peanutbutter==1,"rec"])
karakterA <- paste("k",A,sep="")
karakterB <- paste("k",B,sep="")
str(karakterA);str(karakterB)
## chr [1:60] "k1" "k2" "k3" "k4" "k5" "k6" "k7" "k8" "k9" "k10" "k11" "k12" ...
## chr [1:50] "k21" "k22" "k23" "k24" "k25" "k26" "k27" "k28" "k29" "k30" ...
library(ggVennDiagram)
kx <- list(Brood = karakterA, PeanutButter = karakterB)
ggVennDiagram(kx, label="both",color="black")
The support for bread is 60/80 = 0.75 (or 75%).
The support for peanut butter is 50/80 = 0.625 (or 62.5%).
We can also calculate the support for the item set {bread, peanut butter}: 40/80 = 0.50.
We can now calculate the confidence for the association rules.
{bread} → {peanut butter}
and
{peanut butter} → {bread}
For the first association {bread} → {peanut butter} the confidence can be calculated as:
Confidence (bread → peanut butter) = 0.50 / 0.75 = 0.67
In 2 out of 3 cases (67%), buying bread leads to buying peanut butter.
This seems to be informative. Or is it? What do you think? What exactly is the "value" of this information?
The 67% confidence for this rule is only informative if it deviates from support for peanut butter! Without prior information on buying bread, we would estimate the probability of buying peanut butter at 62.5%.
But if we know that there bread is part of the shopping basket, then the probability of peanut butter in the shopping basket is only slightly higher (67%).
For assessing the value if information, we need - in addition to support and confidence - a third measure, which we call lift. We will come back to that later on.
Note that confidence for {bread} → {peanut butter} is not the same as confidence for {peanut butter} → {bread}!
Confidence (peanut butter → bread) = 0.50 / 0.625 = 0.80
We can interpret the difference.
If peanut butter is bought, chances are very high (80%) that bread will also be bought.
It is more common to buy bread but "no peanut butter, than to buy peanut butter but no bread. How come? Well, not everyone likes peanut butter on their bread; but in reverse, peanut butter cannot be used for many other purposes than on your bread.
Let's apply this to our example.
As an example we can look at the relationship between {parents divorced} and {divorced}.
It may of course happen in incidental cases that your parents divorce after you have been divorced yourself, but much more important in the context of explanations for criminal behavior are the consequences of divorced parents on the lives of their children. A divorce can be related to the loss of a job; job loss can lead to drug use and lack of money; and so on.
So the interesting relationship for us is:
{parents divorced} → {divorced}
In the formula for confidence we have to calculate the support for the item {parents separated} and for the item set {parents separated; divorced}. The support for {divorced} is not important in this relationship.
Because the data file is small, we can easily calculate it, for example in Excel.
In the penultimate column, we indicated with a 1 whether the comment “parents divorced” applies to each of the 10 offendants in our study. This appears to be the case 6 times. The support for {parents divorced} is 6/10 = 0.60.
In the last column we look at the item couple {parents divorced; divorced}. We have entered an “x” if the parents are not divorced. After all, if the parents are not divorced, then the item pair {parents divorced; separated} is certainly not an issue!
We have indicated with a \(0\) or \(1\) whether the person is divorced or not, given that the parents are divorced. This occurs in 3 out of 6 cases. The confidence of the relationship is therefore 3/6 = 0.50.
Self test
In our sample file, you are looking for the relationship between “job loss” and “drugs”.
Write out the association rule for yourself. Then calculate:
In a large file with many people and many items, it is impossible to make these calculations manually. We will therefore use R.
A powerful package for market basket analysis is the arules package.
If you are going to use arules for the first time, you have to first install it on our computer with install.packages(). If you want to use the functions of arules in your R session, you invoke the package with the library() command.
# install.packages("arules")
library(arules)
## Warning: package 'arules' was built under R version 4.0.5
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
In the console a warning may appeared that the package was written in a later version of R than the version installed on your computer. Something like:
Warning message: package "arules" was built under R version 3.2.5
Packages are regularly updated (bug fixes, new functions). If it were only for this reason, it is recommended to regularly install the latest version of R, and the latest versions of the packages you used most often.
There is, what we call, “backward compatibility”. In recent versions of R, packages built in older versions will work, but the reverse is not always the case. R gives a warning because it may be that the functions in the package will work under the old R version.
We normally read our data as a rectangular file, in which the rows and columns have a fixed meaning: each row is an object (say, a person or a transaction), and each column is a variable.
We can do the same to our data. Our data read stored in Excel was saved as a csv-file and then read in R. A csv-file is just a text-file, with (here, 10) lines. Each of the 10 lines represents the data for a specific person. Every line consists of up to 5 elements, corresponding to the columns in the Excel-file.
R can read (import) data from Excel directly, but it is more common to store and share data in csv-files, probably because they are plain and simple to use by most software programs.
Below we use Notepad++, to display the csv-file. Notepad++ is a popular free source code editor.
If we read the comma separated values csv-file crime.csv, with read.csv(), we get the following result. The header=F option tells R that the first line does not contain header information, that is, names of variables.
crimitest <- read.csv("crime.csv", header=F)
crimitest
Apart from missing data for some variables, we do see data in rectangular format - as expected.
Our file does not contain names for the variables in the first row, so we specified that there is no header in the file (header = F; F is a short notation for False). R itself generates names for the variables (V1 to V5).
The number of variables (5) is determined by the maximum number of comments for one person, in this case person 3.
Note that a certain comment, for example "drugs", can occur as a value of V1 (persons 2, 5 and 7), or as V3 (persons 3 and 9). This is not very useful for analyzing, for several reasons:
Our latter wish would require us to choose a design in which all possible comments ("adhd", "divorced", "drugs", ...) represent variables which are then coded with 0 (does not occur) or one (does occur). But that's the very thing we wanted to prevent, since interviews, and in shopping baskets for that matter, are likely to consists of a limited number of all possible words, comments and products. The number of zeroes in our data set would be tremendous.
The arules package has an extremely useful function to read such data in an efficient way: the read.transactions() function.
The data is organized in a so-called sparse matrix, a rectangle that uses space sparingly.
Since we are not going to use the rectangular data frame crimitest created above, we will first delete it using rm().
rm(crimitest)
crimi <- read.transactions("crime.csv",sep=",")
summary(crimi)
## transactions as itemMatrix in sparse format with
## 10 rows (elements/itemsets/transactions) and
## 8 columns (items) and a density of 0.3375
##
## most frequent items:
## parents divorced divorced drug abuse lost job
## 6 5 5 3
## tax debt (Other)
## 3 5
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 5
## 5 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.5 2.7 3.0 5.0
##
## includes extended item information - examples:
## labels
## 1 adhd
## 2 divorced
## 3 drug abuse
You see that the summary() function in R is smart. You can apply the function to a data frame, as in previous chapters. But you can also use the same function to summarize the object crimi that is not a data frame! Depending on the type of object (in R: the object class) summary() gives a different result.
Let's look at the output step-by-step.
The first line states that it is indeed a matrix in a sparse format. Sparse here means that each line contains only the comments that have been made (the "ones" in the file; not the "zeros" for all comments that have not been made). In a shopping basket analogy: we list all the types of products that are in the shopping basket. We do not list all types of products that are not in the basket!
The matrix has 10 rows and 8 columns.
The 10 rows are the interviews with 10 individuals with criminal behavior.
The 8 columns represent all possible comments (the items) that have been made. You can easily check for yourself, in this small sample file, that indeed 8 different comments have were made.
The remark "parents divorced"" occurs (as we have already calculated in Excel, above) 6 times.
The support can be calculated as 6 divided by the number of rows (10) is 0.60
All remaining comments (3) appear as "other". Check for yourself which these three comments are!
An interesting statistic is the density.
If we think of the sparse matrix as a rectangle of 10 people times 8 comments filled with zeros and ones, then how many of the \(10 * 8 = 80\) cells of the matrix contains a 1?
We can calculate that ourselves in our simple example:
A total of 27 comments were made. In a matrix with \(10*8=80\) cells, gives a density of 27/80 = 0.3375. The output confirms that.
The density in the shopping baskets at a supermarket will of course be very small!
In most cases (5 people) the number of comments is limited to two. Three comments are made in four cases, and five comments in one case.
The distribution of comments is shown statistically, including the minimum (2), the mean (\(27/10 = 2.7\)) and the maximum (5).
Finally, a number of examples are given of the values in the matrix.
The labels are the comments in the data file, and sorted alphabetically: column 1 of the matrix, returns a vector of zeros and ones for the comment "adhd". We will rarely use such matrices (for a supermarket with thousands of items, such a matrix would be too large).
You can view your data (the 10 lines of the sparse matrix) with the inspect() command.
To show the first three rows we use:
inspect(crimi[1:3])
## items
## [1] {parents divorced,
## shop lifting}
## [2] {divorced,
## drug abuse,
## violent crime}
## [3] {drug abuse,
## lost job,
## parents divorced,
## tax debt,
## violent crime}
Compare this with the data in the Excel file!
We can also create a table of the relative frequency of the different items, using the itemFrequency() function.
In the market basket analysis jargon, this function shows the support of the various items.
For example:
You can see that the items are arranged in alphabetical order. This is not useful if we have a lot of items (products, comments, etc.) in our database. The output will be huge. You can limit the output to, say, the first 3 items, but those are probably not the most important (or most frequent) items!
The output of itemFrequency() is a vector that we can sort. Because we are interested in the items with the highest support, we sort using the option decreasing = TRUE (the default is sort ascending, so decreasing = FALSE).
In the case of a large number of items, it is recommended that you only display the items with the highest support.
To keep the commands short and clear, we first create a sorted vector (xs) below.
Much more convenient is the itemFrequencyPlot() function because we can directly indicate the minimum support for items displayed, or the topN (for example the top three) items with the highest support.
Since we only have 8 items in our simple example, a graphical representation is quite possible (see figures below).
itemFrequency(crimi)
## adhd divorced drug abuse lost job
## 0.1 0.5 0.5 0.3
## parents divorced shop lifting tax debt violent crime
## 0.6 0.1 0.3 0.3
xs <- sort(itemFrequency(crimi), decreasing=TRUE)
xs[1:3]
## parents divorced divorced drug abuse
## 0.6 0.5 0.5
sort(itemFrequency(crimi), decreasing=TRUE)
## parents divorced divorced drug abuse lost job
## 0.6 0.5 0.5 0.3
## tax debt violent crime adhd shop lifting
## 0.3 0.3 0.1 0.1
itemFrequencyPlot(crimi)
itemFrequencyPlot(crimi, support=.3)
itemFrequencyPlot(crimi, topN=3)
The patterns in the data can visualized with the image() function.
Applied to shopping baskets, this function shows in a matrix diagram (a graph with blocks) which consumers buy which products. This really only makes sense if both the number of rows (in our case, criminals) and the number of items (comments about these criminals) are small.
For large data files, the image() function produces a huge amount of dots in a rectangle that is difficult to interpret.
In our example:
image(crimi) # all data
image(crimi[1:3,]) # first 3 cases (rows)
image(crimi[,6:8]) # rows 6 to 8
Try to discover patterns yourself on the basis of the figure.
Now that we have read and explored the data, we are ready for the next step: analyzing the associations between the items.
We have already started with this in the image() function in step 3, but it will be clear that the graph is not easy to interpret - especially if we have a lot of data.
We can efficiently trace relationships (or associations) in the file with the apriori() function.
The apriori() function uses the following arguments, in brackets:
By indicating a minimum support, we can filter out item sets that are rare.
The same applies to the minimum confidence: we are interested in the relationship between items, and in the probability that the presence of one item predicts the presence of other items.
Some examples:
cr1 <- apriori(crimi)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [40 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(cr1)
## lhs rhs support confidence coverage lift count
## [1] {shop lifting} => {parents divorced} 0.1 1 0.1 1.666667 1
## [2] {adhd} => {parents divorced} 0.1 1 0.1 1.666667 1
## [3] {lost job,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [4] {tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [5] {drug abuse,
## tax debt} => {lost job} 0.1 1 0.1 3.333333 1
## [6] {lost job,
## parents divorced} => {tax debt} 0.1 1 0.1 3.333333 1
## [7] {lost job,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [8] {lost job,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [9] {lost job,
## parents divorced} => {violent crime} 0.1 1 0.1 3.333333 1
## [10] {lost job,
## parents divorced} => {drug abuse} 0.1 1 0.1 2.000000 1
## [11] {tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [12] {drug abuse,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [13] {tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [14] {divorced,
## tax debt} => {parents divorced} 0.1 1 0.1 1.666667 1
## [15] {drug abuse,
## tax debt} => {parents divorced} 0.1 1 0.1 1.666667 1
## [16] {lost job,
## tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [17] {drug abuse,
## lost job,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [18] {drug abuse,
## lost job,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [19] {drug abuse,
## tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [20] {lost job,
## tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [21] {lost job,
## parents divorced,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [22] {lost job,
## parents divorced,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [23] {parents divorced,
## tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [24] {drug abuse,
## lost job,
## tax debt} => {parents divorced} 0.1 1 0.1 1.666667 1
## [25] {lost job,
## parents divorced,
## tax debt} => {drug abuse} 0.1 1 0.1 2.000000 1
## [26] {drug abuse,
## lost job,
## parents divorced} => {tax debt} 0.1 1 0.1 3.333333 1
## [27] {drug abuse,
## parents divorced,
## tax debt} => {lost job} 0.1 1 0.1 3.333333 1
## [28] {drug abuse,
## lost job,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [29] {lost job,
## parents divorced,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [30] {drug abuse,
## lost job,
## parents divorced} => {violent crime} 0.1 1 0.1 3.333333 1
## [31] {drug abuse,
## parents divorced,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [32] {drug abuse,
## tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [33] {parents divorced,
## tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [34] {drug abuse,
## parents divorced,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [35] {drug abuse,
## parents divorced,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [36] {drug abuse,
## lost job,
## tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [37] {lost job,
## parents divorced,
## tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [38] {drug abuse,
## lost job,
## parents divorced,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [39] {drug abuse,
## lost job,
## parents divorced,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [40] {drug abuse,
## parents divorced,
## tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
The first specification uses the default values and results in a large number of lines.
If the optional parameters are omitted, then the default values for support, confidence, and minuses are:
The associaion rules are stored in a new object (cr1).
Default values are usually not the best ones. The optimal values vary from case to case.
In our case, we can use a fairly high value for support, say 0.2 or 0.3, because an item or item pair must appear at least 2 out of 10 times.
In a supermarket with thousands of items, we set the support at a much lower level, as it is unlikely for combinations of products to have support levels as high as 0.2.
Anyway, we only want to see relationships between items, and then we need to have at least 2 items (minlen=2).
We want to have a manageable set of meaningful relationships. If we set the thresholds too low, we will have a lot of rules which makes it hard to digest. If we set the thresholds too high then the number of lines is small, or even zero.
Especially in large data files, we are forced to use trial and error to see what gives the best result. You can compare it to googling. If you are google for information about the presidential elections in the US, then the search string "presidential elections" yields over 300 million hits (including presidential elections in other countries than the US).
Adding "amsterdam" to your search string reduces the number of hits substantially to less than 3 million, still not something you can digest within a short time. The hits include a lot of documents linked to the University of Amsterdam.
Replacing "amsterdam" by "tietjerksteradeel" (a small municipality in the north of the Netherlands) yields 1 hit only. Very manageable, but not necessarily relevant. It turns out that this hit is a Master thesis by a student on a Dutch right-wing politician who has strong feelings on both the US presidential elections and climate change; his views on climate change were criticized by a Dutch weatherman who was born and raised in Tietjerksteradeel. Probably not what we are looking for.
Let's look at the first association rule in the output.
{shop lifting} => {parents divorced}
Let's focus on the three statistics of interest: support, confidence and lift.
By now, you should be able to check that yourself from the small data set presented before. There is only one interviewee with indications of both "shop lifting" and "parents divorced". It's the very first case, out of the set of 10. Support is therefore \(1/10=0.10\), as shown in the output.
The confidence can be calculated as:
Confidence (X→Y) = (Support (X,Y)) / (Support (X))
or here:
X = Shop Lifting, and Y = Parents Divorced
Confidence (X→Y) = 0.10 / 0.10 = 1
This reflects the fact that in all cases of shop lifting, the parents turn out to be divorced. There's only one case of shop lifting, and the only other comment in the interview was about the parents being divorced, resulting is this finding. Obviously, the evidence is quite thin, with only 10 cases in total, and 1 case in support of this finding.
Moreover, it is questionable that this order of events (shop lifting causing parents to divorce) makes a lot of sense. And, since parents often got divorced in our sample (6 out of 10 cases), it may not be that informative anyway.
The formula for lift is:
Lift (X→Y) = (Confidence (X,Y)) / (Support (Y))
The outcome of the formula is:
(X = Shop Lifting, and Y = Parents Divorced)
Lift (X→Y) = 1 / (6/10) = 1 / 0.6 = 1.67
The reasoning is as follows.
If the chance of divorced parents occurs more often with prior information about shoplifting, then this this is informative.
In our example, shoplifting is always associated with divorced parents (even if we only have one case): the chance is therefore 100%. Without prior information on shoplifting, the probability of divorced parents is 6/10 = 60% (the support for {parents divorced}). Lift is the ratio between these probabilities: 100% / 60% = 1.67.
Note that items with 100% support can never be informative!
This applies to (combinations of) items on the left (LHS) and the right (RHS)!
In an example:
If every consumer has bread in his or her shopping basket, and 30% of all consumers also have peanut butter, then the confidence of {bread, peanut butter} is also 30%. 30% buys peanut butter, and 30% buys both peanut butter and bread (since everybody buys bread)!
The lift, then, for Bread → Peanut Butter can be computed as:
Lift(X→Y) = 30% (confidence of item pair), divided by 30% (support for peanut butter) = 1
In our example {peanut butter} → {bread}.
If everyone buys bread and 30% buys peanut butter, the confidence for the item pair {bread, peanut butter} can be computed as support for the item pair {bread, peanut butter} (30%), divided by the support for {peanut butter} (30%); and \(30%/30%=1\).
The lift then is confidence (1) divided by support for bread (1): and \(1/1 = 1\).
This is a complicated way of stating that items that always occur, provide no information for predicting other items. And predicting them is pointless, since they always oocur anyway.
While confidence(X→Y) ≠ confidence(Y→X), by definition lift(X→Y) = lift(Y→X)!
To recap:
To control our output, we can use the various opitions, in brackets. If we set these values to the default values, then obviously the output will be identical to what we have above.
Run both sets of commands and verify that the result is indeed the same! Just to make sure.
cr2 <- apriori(crimi, parameter = list(support=.1, confidence=.8, minlen=1))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [40 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(cr2)
## lhs rhs support confidence coverage lift count
## [1] {shop lifting} => {parents divorced} 0.1 1 0.1 1.666667 1
## [2] {adhd} => {parents divorced} 0.1 1 0.1 1.666667 1
## [3] {lost job,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [4] {tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [5] {drug abuse,
## tax debt} => {lost job} 0.1 1 0.1 3.333333 1
## [6] {lost job,
## parents divorced} => {tax debt} 0.1 1 0.1 3.333333 1
## [7] {lost job,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [8] {lost job,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [9] {lost job,
## parents divorced} => {violent crime} 0.1 1 0.1 3.333333 1
## [10] {lost job,
## parents divorced} => {drug abuse} 0.1 1 0.1 2.000000 1
## [11] {tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [12] {drug abuse,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [13] {tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [14] {divorced,
## tax debt} => {parents divorced} 0.1 1 0.1 1.666667 1
## [15] {drug abuse,
## tax debt} => {parents divorced} 0.1 1 0.1 1.666667 1
## [16] {lost job,
## tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [17] {drug abuse,
## lost job,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [18] {drug abuse,
## lost job,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [19] {drug abuse,
## tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [20] {lost job,
## tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [21] {lost job,
## parents divorced,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [22] {lost job,
## parents divorced,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [23] {parents divorced,
## tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [24] {drug abuse,
## lost job,
## tax debt} => {parents divorced} 0.1 1 0.1 1.666667 1
## [25] {lost job,
## parents divorced,
## tax debt} => {drug abuse} 0.1 1 0.1 2.000000 1
## [26] {drug abuse,
## lost job,
## parents divorced} => {tax debt} 0.1 1 0.1 3.333333 1
## [27] {drug abuse,
## parents divorced,
## tax debt} => {lost job} 0.1 1 0.1 3.333333 1
## [28] {drug abuse,
## lost job,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [29] {lost job,
## parents divorced,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [30] {drug abuse,
## lost job,
## parents divorced} => {violent crime} 0.1 1 0.1 3.333333 1
## [31] {drug abuse,
## parents divorced,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
## [32] {drug abuse,
## tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [33] {parents divorced,
## tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [34] {drug abuse,
## parents divorced,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [35] {drug abuse,
## parents divorced,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [36] {drug abuse,
## lost job,
## tax debt,
## violent crime} => {parents divorced} 0.1 1 0.1 1.666667 1
## [37] {lost job,
## parents divorced,
## tax debt,
## violent crime} => {drug abuse} 0.1 1 0.1 2.000000 1
## [38] {drug abuse,
## lost job,
## parents divorced,
## tax debt} => {violent crime} 0.1 1 0.1 3.333333 1
## [39] {drug abuse,
## lost job,
## parents divorced,
## violent crime} => {tax debt} 0.1 1 0.1 3.333333 1
## [40] {drug abuse,
## parents divorced,
## tax debt,
## violent crime} => {lost job} 0.1 1 0.1 3.333333 1
Great! Nothing changed.
Below, we set the bar a bit higher. We only want rules with support levels of .3 or higher.
This leaves only 7 rules, but some of them have a length of 1 (with just right hand sides, and no left hand sides). Not rules at all, really.
cr3 <- apriori(crimi, parameter = list(support=.3, confidence=.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.3 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(cr3)
## lhs rhs support confidence coverage lift
## [1] {} => {divorced} 0.5 0.5 1.0 1.0
## [2] {} => {drug abuse} 0.5 0.5 1.0 1.0
## [3] {} => {parents divorced} 0.6 0.6 1.0 1.0
## [4] {divorced} => {drug abuse} 0.3 0.6 0.5 1.2
## [5] {drug abuse} => {divorced} 0.3 0.6 0.5 1.2
## [6] {divorced} => {parents divorced} 0.3 0.6 0.5 1.0
## [7] {parents divorced} => {divorced} 0.3 0.5 0.6 1.0
## count
## [1] 5
## [2] 5
## [3] 6
## [4] 3
## [5] 3
## [6] 3
## [7] 3
And we can set minlen at 2. This leaves out the one-item rules.
From rules 1 versus 2, and rules 3 versus 4, you note that the support levels are the same (as the item pairs are the same), and that the lift levels are the same, by definition: lift(X→Y) = lift(Y→X).
cr4 <- apriori(crimi, parameter = list(support=.3, confidence=.4, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.3 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [4 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(cr4)
## lhs rhs support confidence coverage lift
## [1] {divorced} => {drug abuse} 0.3 0.6 0.5 1.2
## [2] {drug abuse} => {divorced} 0.3 0.6 0.5 1.2
## [3] {divorced} => {parents divorced} 0.3 0.6 0.5 1.0
## [4] {parents divorced} => {divorced} 0.3 0.5 0.6 1.0
## count
## [1] 3
## [2] 3
## [3] 3
## [4] 3
Of course, depending on the size of the output, the case at hand and what you want to learn from it, you can play around with the settings.
Remember that, as a data scientist, your challenge is to find meaningful information and to present it in a clear way!
cr5 <- apriori(crimi, parameter = list(support=.2, confidence=.3, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.2 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(cr5)
## lhs rhs support confidence coverage
## [1] {lost job} => {tax debt} 0.2 0.6666667 0.3
## [2] {tax debt} => {lost job} 0.2 0.6666667 0.3
## [3] {lost job} => {drug abuse} 0.2 0.6666667 0.3
## [4] {drug abuse} => {lost job} 0.2 0.4000000 0.5
## [5] {tax debt} => {parents divorced} 0.2 0.6666667 0.3
## [6] {parents divorced} => {tax debt} 0.2 0.3333333 0.6
## [7] {violent crime} => {divorced} 0.2 0.6666667 0.3
## [8] {divorced} => {violent crime} 0.2 0.4000000 0.5
## [9] {violent crime} => {drug abuse} 0.2 0.6666667 0.3
## [10] {drug abuse} => {violent crime} 0.2 0.4000000 0.5
## [11] {violent crime} => {parents divorced} 0.2 0.6666667 0.3
## [12] {parents divorced} => {violent crime} 0.2 0.3333333 0.6
## [13] {divorced} => {drug abuse} 0.3 0.6000000 0.5
## [14] {drug abuse} => {divorced} 0.3 0.6000000 0.5
## [15] {divorced} => {parents divorced} 0.3 0.6000000 0.5
## [16] {parents divorced} => {divorced} 0.3 0.5000000 0.6
## [17] {drug abuse} => {parents divorced} 0.2 0.4000000 0.5
## [18] {parents divorced} => {drug abuse} 0.2 0.3333333 0.6
## lift count
## [1] 2.2222222 2
## [2] 2.2222222 2
## [3] 1.3333333 2
## [4] 1.3333333 2
## [5] 1.1111111 2
## [6] 1.1111111 2
## [7] 1.3333333 2
## [8] 1.3333333 2
## [9] 1.3333333 2
## [10] 1.3333333 2
## [11] 1.1111111 2
## [12] 1.1111111 2
## [13] 1.2000000 3
## [14] 1.2000000 3
## [15] 1.0000000 3
## [16] 1.0000000 3
## [17] 0.6666667 2
## [18] 0.6666667 2
It is important to interpret the association rules correctly.
There are two lines with a relatively large lift, (\(lift>2\)).
Both relate to the items {job loss} and {tax debt}.
It is more than twice as likely, for the 10 individuals in our database, to have a tax debt knowing they have lost their jobs, than for the group as a whole. The opposite is also true: it is more than twice as likely to lose your job knowing there is a tax liability than for the group as a whole.
As an analyst, your main interest may be in certain aspects from the interviews, such as the relationships between (comments about) drug abuse and other items.
We can select lines from the object cr5 created above with the subset() command.
After the name of the object with the association rules, you must indicate which items you want to select.
To select all association rules with drug abuse in either the LHS or RHS), use the following command.
It is handy to save the rules sorted by lift as an object (say, short). The object is a data frame. We can select rules with, for example, \(lift>1.2\) like below. Since lift is a variable in a data frame, we have to refer to it as short$lift.
short <- inspect(sort(cr5, by="lift"))
## lhs rhs support confidence coverage
## [1] {lost job} => {tax debt} 0.2 0.6666667 0.3
## [2] {tax debt} => {lost job} 0.2 0.6666667 0.3
## [3] {lost job} => {drug abuse} 0.2 0.6666667 0.3
## [4] {drug abuse} => {lost job} 0.2 0.4000000 0.5
## [5] {violent crime} => {divorced} 0.2 0.6666667 0.3
## [6] {divorced} => {violent crime} 0.2 0.4000000 0.5
## [7] {violent crime} => {drug abuse} 0.2 0.6666667 0.3
## [8] {drug abuse} => {violent crime} 0.2 0.4000000 0.5
## [9] {divorced} => {drug abuse} 0.3 0.6000000 0.5
## [10] {drug abuse} => {divorced} 0.3 0.6000000 0.5
## [11] {tax debt} => {parents divorced} 0.2 0.6666667 0.3
## [12] {parents divorced} => {tax debt} 0.2 0.3333333 0.6
## [13] {violent crime} => {parents divorced} 0.2 0.6666667 0.3
## [14] {parents divorced} => {violent crime} 0.2 0.3333333 0.6
## [15] {divorced} => {parents divorced} 0.3 0.6000000 0.5
## [16] {parents divorced} => {divorced} 0.3 0.5000000 0.6
## [17] {drug abuse} => {parents divorced} 0.2 0.4000000 0.5
## [18] {parents divorced} => {drug abuse} 0.2 0.3333333 0.6
## lift count
## [1] 2.2222222 2
## [2] 2.2222222 2
## [3] 1.3333333 2
## [4] 1.3333333 2
## [5] 1.3333333 2
## [6] 1.3333333 2
## [7] 1.3333333 2
## [8] 1.3333333 2
## [9] 1.2000000 3
## [10] 1.2000000 3
## [11] 1.1111111 2
## [12] 1.1111111 2
## [13] 1.1111111 2
## [14] 1.1111111 2
## [15] 1.0000000 3
## [16] 1.0000000 3
## [17] 0.6666667 2
## [18] 0.6666667 2
short[short$lift>1.2,]
We can select lines from the object cr5 created above with the subset() command.
After the name of the object with the association rules, you must indicate which items you want to select.
To select all association rules with drug abuse in either the LHS or RHS), use the following command.
It is handy to save the rules sorted by lift as an object (say, x). The object is a data frame. We can select rules with, for example, \(lift>1.2\) like below.
drugsRules <- subset(cr5, items %in% "drug abuse")
x<-inspect(sort(drugsRules, by="lift"))
## lhs rhs support confidence coverage
## [1] {lost job} => {drug abuse} 0.2 0.6666667 0.3
## [2] {drug abuse} => {lost job} 0.2 0.4000000 0.5
## [3] {violent crime} => {drug abuse} 0.2 0.6666667 0.3
## [4] {drug abuse} => {violent crime} 0.2 0.4000000 0.5
## [5] {divorced} => {drug abuse} 0.3 0.6000000 0.5
## [6] {drug abuse} => {divorced} 0.3 0.6000000 0.5
## [7] {drug abuse} => {parents divorced} 0.2 0.4000000 0.5
## [8] {parents divorced} => {drug abuse} 0.2 0.3333333 0.6
## lift count
## [1] 1.3333333 2
## [2] 1.3333333 2
## [3] 1.3333333 2
## [4] 1.3333333 2
## [5] 1.2000000 3
## [6] 1.2000000 3
## [7] 0.6666667 2
## [8] 0.6666667 2
x[x$lift>1,]
Note that the string value after the %in% operator has to match the item content exactly!
Using parts of it ("abuse"), or extensions ("drug abuse and related"), won't do.
In the shopping baskets (and their translation into the sparse matrix) it can happen that items appear twice.
For example, at the checkout, the cashier scans two cartons of yogurt, one at a time.
When analyzing which products are bought together, it is less important which quantities are bought, and one entry for the yogurt item is sufficient. The same holds true for mentioning, say, "drugs" in our sample of coded interviews.
Of course it is possible to remove the duplications manually or with some smart algorithm, but there's no need to do so.
The read.transactions() function does the job for us.
Let's test this.
We have added the item drugs for records 5 and 7, so that the item appears twice in those records.
As a result, the total number of items (comments) increases from 27 to 29, two of which are duplications (items that occur twice or more within a record).
Since the data files, when read using read.csv(), differ, the identical() function will return a FALSE.
(x1t <- read.csv("crime.txt", header=F)) # No duplicates
(x2t <- read.csv("crimed.txt",header=F)) # Some duplicates
identical(x1t,x2t)
## [1] FALSE
But when reading the different files as sparse matrices, it turns out that the resulting objects are identical.
In market basket analysis, our interest is not in the frequency of occurrences!
x1 <- read.transactions("crime.txt",sep=",")
summary(x1)
## transactions as itemMatrix in sparse format with
## 10 rows (elements/itemsets/transactions) and
## 9 columns (items) and a density of 0.3
##
## most frequent items:
## divorced drug abuse parents divorced lost job
## 5 5 5 3
## tax debt (Other)
## 3 6
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 5
## 5 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.5 2.7 3.0 5.0
##
## includes extended item information - examples:
## labels
## 1 adhd
## 2 divorced
## 3 drug abuse
x2 <- read.transactions("crimed.txt",sep=",")
## Warning in asMethod(object): removing duplicated items in transactions
summary(x2)
## transactions as itemMatrix in sparse format with
## 10 rows (elements/itemsets/transactions) and
## 9 columns (items) and a density of 0.3
##
## most frequent items:
## divorced drug abuse parents divorced lost job
## 5 5 5 3
## tax debt (Other)
## 3 6
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 5
## 5 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.5 2.7 3.0 5.0
##
## includes extended item information - examples:
## labels
## 1 adhd
## 2 divorced
## 3 drug abuse
identical(x1,x2)
## [1] TRUE
There are various ways to visualize the relationships found.
The arulesViz package offers a wide range of possibilities, of which we will briefly discuss only two popular types of charts.
For an overview, we definitely recommend viewing the vignette of the package.
First of all, if you haven't already, you need to install arulesViz.
And, of course, to use the package's functions, use the library() command.
# install.packages ("arulesViz")
library (arulesViz)
## Warning: package 'arulesViz' was built under R version 4.0.5
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
The first type of graph maps all association rules.
We recommend mapping the most important rules only, for example by making a selection with subset() from rules with an lift exceeding a specified value.
Let's use the set of rules cr5, as it is already based on minimum values for support and confidence.
The following command gives a first impression of the graph.
plot (cr5, method = "graph")
The option engine="interactive" offers the possibility to edit the graph interactively.
plot(cr5,method="graph",engine="interactive")
With the mouse you can then move the items and the associations between the items to give a better overview.
Some standard methods for the layout of the graph are available.
The graph will initially look like this:
The pink circles and arrows, represent the associations.
The size of the pink circles corresponds to the support: the larger the circle, the greater the support of the item pair.
Since, as stated earlier, the lift of, for example, {drug abuse} → {divorded} is the same as that of the reverse, {divorced} → {drug abuse}, there are two paths between these items, with equal lift.
In the interactive mode you can drag all circles to a different position. Try it out for yourself!
Below we have made some adjustments to the bottom of the chart.
In most cases it is better not to include all association rules in the graph, but to make a deliberate selection.
We will make two selections:
For example, it makes more sense that losing a job leads to a tax debt rather than the other way around.
Or the chance of a divorce is greater if your parents are divorced, while it is less likely that your parents are getting a divorce as a result of your divorce.
First of all, we determine a subset with lift>1 and print it.
We won't use the interactive option for editing.
This has two advantages.
Firstly, the result is very clear. Second, the graph provides information on both the support (the size of the circles) and the lift (the color of the circles; the darker the color, the greater the lift).
cr5s <- subset(cr5, lift>1)
inspect(cr5s)
## lhs rhs support confidence coverage
## [1] {lost job} => {tax debt} 0.2 0.6666667 0.3
## [2] {tax debt} => {lost job} 0.2 0.6666667 0.3
## [3] {lost job} => {drug abuse} 0.2 0.6666667 0.3
## [4] {drug abuse} => {lost job} 0.2 0.4000000 0.5
## [5] {tax debt} => {parents divorced} 0.2 0.6666667 0.3
## [6] {parents divorced} => {tax debt} 0.2 0.3333333 0.6
## [7] {violent crime} => {divorced} 0.2 0.6666667 0.3
## [8] {divorced} => {violent crime} 0.2 0.4000000 0.5
## [9] {violent crime} => {drug abuse} 0.2 0.6666667 0.3
## [10] {drug abuse} => {violent crime} 0.2 0.4000000 0.5
## [11] {violent crime} => {parents divorced} 0.2 0.6666667 0.3
## [12] {parents divorced} => {violent crime} 0.2 0.3333333 0.6
## [13] {divorced} => {drug abuse} 0.3 0.6000000 0.5
## [14] {drug abuse} => {divorced} 0.3 0.6000000 0.5
## lift count
## [1] 2.222222 2
## [2] 2.222222 2
## [3] 1.333333 2
## [4] 1.333333 2
## [5] 1.111111 2
## [6] 1.111111 2
## [7] 1.333333 2
## [8] 1.333333 2
## [9] 1.333333 2
## [10] 1.333333 2
## [11] 1.111111 2
## [12] 1.111111 2
## [13] 1.200000 3
## [14] 1.200000 3
plot(cr5s,method="graph")
Next, we make a selection:
cr5s2 <- subset(cr5s, c(1,3,4,6,7,8,10,12,13,14)); inspect(cr5s2)
## lhs rhs support confidence coverage lift
## [1] {lost job} => {tax debt} 0.2 0.6666667 0.3 2.222222
## [2] {lost job} => {drug abuse} 0.2 0.6666667 0.3 1.333333
## [3] {drug abuse} => {lost job} 0.2 0.4000000 0.5 1.333333
## [4] {parents divorced} => {tax debt} 0.2 0.3333333 0.6 1.111111
## [5] {violent crime} => {divorced} 0.2 0.6666667 0.3 1.333333
## [6] {divorced} => {violent crime} 0.2 0.4000000 0.5 1.333333
## [7] {drug abuse} => {violent crime} 0.2 0.4000000 0.5 1.333333
## [8] {parents divorced} => {violent crime} 0.2 0.3333333 0.6 1.111111
## [9] {divorced} => {drug abuse} 0.3 0.6000000 0.5 1.200000
## [10] {drug abuse} => {divorced} 0.3 0.6000000 0.5 1.200000
## count
## [1] 2
## [2] 2
## [3] 2
## [4] 2
## [5] 2
## [6] 2
## [7] 2
## [8] 2
## [9] 3
## [10] 3
plot(cr5s2,method="graph")
The same information can be displayed in a grouped matrix.
The matrix displays all items to the LHS and all items to the RHS, and for each item pair displays the support and lift, with the size and color of the circles.
The graph reveals shows that drugs often go hand in hand with divorce, and that job loss is a good predictor of tax debt (or vice versa).
Note that the graph is symmetrical: the lift for {lost job} → {tax debt} is the same as for {tax debt} → {lost job}, and the support for the item pair is fixed.
In other words: the two circles at the top left are the same size (support), and have the same dark red color (lift).
plot(cr5, method="grouped")
As before, we can limit ourselves to the rules that make the most sense to us.
We then apply the command to the object cr5s2.
A slight disadvantage is that the LHS and RHS items are no longer in the same order, which makes reading the graph a bit more challenging.
The advantage is that illogical relationships and duplicate information are no longer displayed, which allows us to focus on the most relevant pieces of information.
An ideal design is to have as few items as possible in the selection that occur on both the LHS and the RHS.
plot(cr5s2, method="grouped")
Let's practice market basket analysis with a larger example.
The example is based on part of the freely downloadable groceries csv-file. Click here for this file and other examples.
The file contains the shopping baskets of consumers in a supermarket.
Each record in the file lists, in no particular order, the items that make up the basket.
We have made a selection (mmba_assignment.csv) of shopping baskets from the original database of nearly 10 thousand purchases in the supermarket.
Let's work step by step, by breaking down the problem into the following parts.