Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.
# Import required R libraries
#library(tidyverse)
library(arules)
library(arulesViz)
# Set seed for assignment
set.seed(200)
The arules
package (https://www.rdocumentation.org/packages/arules/versions/1.7-1) “provides the infrastructure for representing, manipulating and analyzing transaction data and patterns using frequent itemsets and association rules.” The arules
library contains the data structure definitions and mining algorithms - APRIORI and ECLAT.
The arulesViz
library provides visualizations for the association rules.
Using the example provided at http://r-statistics.co/Association-Mining-With-R.html, I read in the provided CSV file as transactions objects per the arules
package.
# http://r-statistics.co/Association-Mining-With-R.html
<- read.transactions("GroceryDataSet.csv", sep=",")
grocery_ds class(grocery_ds)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
The class()
function confirms the grocery_ds
object is transactions from the arules
package.
summary(grocery_ds)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Summary of the grocery data set indicates 9835 rows, or individual receipts defined as transactions, with a total of 169 columns, or unique items. The most frequent items are whole milk, other vegetables, rolls/buns, soda, and yogurt. The median number of items per transaction is 3 with an average number of 4.4 items per transaction. The minimum is 1, which is expected, and the maximum items per a transaction is 32.
size(head(grocery_ds))
## [1] 4 3 1 4 4 5
LIST(head(grocery_ds, 3))
## [[1]]
## [1] "citrus fruit" "margarine" "ready soups"
## [4] "semi-finished bread"
##
## [[2]]
## [1] "coffee" "tropical fruit" "yogurt"
##
## [[3]]
## [1] "whole milk"
inspect(head(grocery_ds, 3))
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
The size()
function confirms the item count per transaction and the LIST()
function confirms the data object meets expectations, the first three transactions are displayed. The results above match the raw CSV file. The inspect()
functions appears to behave the same as LIST()
but perhaps a cleaner output presentation.
# calculates 'support' of the frequent items in the dataset
<- 0.07
support_val <- eclat(grocery_ds, parameter = list(supp=support_val, maxlen=15)) frequentItems
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 688
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing ... [19 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
The eclat()
function “finds frequent item sets with the Eclat algorithm, which carries out a depth first search on the subset lattice and determines the support of item sets by intersecting transaction lists.” (https://borgelt.net/eclat.html) The parameter supp
represents \(support\) which is defined as the proportion of transactions in the dataset which contain the item. The parameter value is the minimum threshold.
inspect(frequentItems)
## items support count
## [1] {other vegetables, whole milk} 0.07483477 736
## [2] {whole milk} 0.25551601 2513
## [3] {other vegetables} 0.19349263 1903
## [4] {rolls/buns} 0.18393493 1809
## [5] {yogurt} 0.13950178 1372
## [6] {soda} 0.17437722 1715
## [7] {root vegetables} 0.10899847 1072
## [8] {tropical fruit} 0.10493137 1032
## [9] {bottled water} 0.11052364 1087
## [10] {sausage} 0.09395018 924
## [11] {shopping bags} 0.09852567 969
## [12] {citrus fruit} 0.08276563 814
## [13] {pastry} 0.08896797 875
## [14] {pip fruit} 0.07564820 744
## [15] {whipped/sour cream} 0.07168277 705
## [16] {fruit/vegetable juice} 0.07229283 711
## [17] {newspapers} 0.07981698 785
## [18] {bottled beer} 0.08052872 792
## [19] {canned beer} 0.07768175 764
With a support value of 0.07, the output reports the above 19 results which indicates only one pair of items appear at the given proportion of the time, whole milk and other vegetables. The result makes sense given those two items are the most frequent individually and are grocery staples.
itemFrequencyPlot(grocery_ds, topN=10, type="absolute", main="Item Frequency")
The plot above displays a count of the 10 most frequent items with whole milk and other vegetables occurring most often, matching the results of the summary()
function above.
# Define minimum support
<- 0.001
supp_val # Define minimum confidence (increase to get stronger rules)
<- 0.9
conf_val # Increase maxlen to get longer rules
<- 5
maxlen_val <- apriori(grocery_ds, parameter=list(supp=supp_val, conf=conf_val, maxlen=maxlen_val)) rules
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [123 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
<- sort(rules, by="confidence", decreasing=TRUE) rules_conf
The apriori()
function “finds association rules and frequent item sets with the Apriori algorithm, which carries out a breadth first search on the subset lattice and determines the support of item sets by subset tests.” (https://borgelt.net/apriori.html)
inspect(head(rules_conf, 10))
## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
## [3] {butter,
## rice,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [4] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
## [5] {butter,
## domestic eggs,
## soft cheese} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [6] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1 0.001016777 5.168156 10
## [7] {butter,
## hygiene articles,
## pip fruit} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [8] {hygiene articles,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [9] {hygiene articles,
## pip fruit,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [10] {cream cheese,
## domestic eggs,
## sugar} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
With a support value of 0.001 and confidence value of 0.9, the above output shows 10 association rules with a resulting confidence value of 1. The confidence value of 1 indicates the item on the right hand side always occurs when the item or items on the left hand side occur. Not surprisingly, of the 10 rules displayed above, 9 rules indicate whole milk on the right hand side with other vegetables as the remaining value on the right hand side. Given the higher frequency of those two items, these results are expected.
<- sort(rules, by="lift", decreasing=TRUE)
rules_lift inspect(head(rules_lift, 10))
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [4] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [5] {cream cheese,
## curd,
## other vegetables,
## whipped/sour cream} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [6] {butter,
## other vegetables,
## tropical fruit,
## white bread} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [7] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1.0000000 0.001016777 5.168156 10
## [8] {brown bread,
## pip fruit,
## whipped/sour cream} => {other vegetables} 0.001118454 1.0000000 0.001118454 5.168156 11
## [9] {grapes,
## tropical fruit,
## whole milk,
## yogurt} => {other vegetables} 0.001016777 1.0000000 0.001016777 5.168156 10
## [10] {ham,
## pip fruit,
## tropical fruit,
## yogurt} => {other vegetables} 0.001016777 1.0000000 0.001016777 5.168156 10
The same values of support and confidence, and thus from the same resulting rules, the rules are sorted by lift value in order to find the 10 rules with the highest lift. The lift value indicates “the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS.” (https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf) The higher the lift value, the stronger the association between the LHS and the RHS.
Note: I decided on confidence value of 0.9 so I could receive some high lift rules that didn’t indicate whole milk and other vegetables on the right hand side. The top lift result (11.235269) given the parameter values is an LHS of liquor and red/blush wine and an RHS of bottled beer. That association makes sense for someone buying alcohol together. The rules with an RHS of yogurt shows different LHS items including butter, cream cheese, curd, and whipped/sour cream which seems valid given those items are typically in the same refrigerated section of a grocery store. Overall, none of the above rules sticks out like the infamous diapers to beer association.
# Could I find the diapers => beer rule
<- 0.001
supp_baby_val <- 0.05
conf_baby_val <- apriori(grocery_ds,
rules_baby parameter=list(supp=supp_baby_val, conf=conf_baby_val),
appearance = list (default="rhs", lhs="baby food"),
control = list (verbose=F))
<- sort (rules_baby, by="confidence", decreasing=TRUE)
rules_baby_conf inspect(head(rules_baby_conf))
## lhs rhs support confidence coverage lift count
## [1] {} => {whole milk} 0.2555160 0.2555160 1 1 2513
## [2] {} => {other vegetables} 0.1934926 0.1934926 1 1 1903
## [3] {} => {rolls/buns} 0.1839349 0.1839349 1 1 1809
## [4] {} => {soda} 0.1743772 0.1743772 1 1 1715
## [5] {} => {yogurt} 0.1395018 0.1395018 1 1 1372
## [6] {} => {bottled water} 0.1105236 0.1105236 1 1 1087
Nope, couldn’t find the baby product to beer rule. Apparently, “baby food” as the LHS didn’t produce any meaningful association, nor did “baby cosmetics” as the LHS.
I used the plotting functions from the library arulesViz
to help understand the association rules through visualizations.
# https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf
options(digits = 2)
plot(rules)
The above scatterplot indicates the relationship the values of Support and Confidence for the 123 association rules generated by with a support value of 0 and confidence value of 0.9.
plot(rules, measure=c("support", "lift"), shading="confidence")
The above scatterplot indicates the relationship the values of Support and Lift for the 123 association rules generated by with a support value of 0 and confidence value of 0.9.
plot(rules, method="two-key plot")
The above two-key scatterplot indicates the relationship the values of Support and Confidence for the 123 association rules generated by with a support value of 0 and confidence value of 0.9 in which order identifies the number of items in the rule.
plot(rules, method="grouped", control = list(k = 10))
The above grouped matrix-based visualization uses a balloon plot to show the LHS values as columns and the RHS items as rows. The color of the balloon shows the aggregated interest measure and the size of the balloon show the aggregated support.
<- head(rules, n=10, by="lift")
subrules2 plot(subrules2, method="graph")
The above graph-based visualization shows the items and rules as vertices and connections with directed edges. The plot helps identify which rules share items.
Overall, the market basket analysis proved straightforward with the use of the arules
package. In order to tease out some “interesting” associations, then more modifications of the support and confidence levels would be required to find associations with few occurrences but with high confidence values. Too bad this dataset didn’t have the diapers to beer connection.