Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
(grocery <- read.transactions("GroceryDataSet.csv", sep = ","))
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
Below is summary of grocery
items. Whole milk is the most frequently bought items followed by other vegetables.
summary(grocery)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Below is a frequency plot of the top 20 most grocery items. As you can see, the top 3 items are whole milk, other vegetables, and rolls/buns.
itemFrequencyPlot(grocery, topN=20)
The apriori
function of the arules
package is used to generate the association rules.
The parameter support
is “defined as the proportion of transactions in the data set which contain the itemset. For example the itemset {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions).” I tried using support greater than 0.009, but the apriori
function threw an error.
The parameter confidence
has a value of 0.55. In the example of {milk, bread} => {butter}, this means that 55% of transactions containing milk and bread also has butter.
Use minlen
= 2 so that the LHS (antecedent) is not empty. By default, apriori
has a minlen value of 1 (empty LHS).
The association rules need to meet the minimum support and confidence values. Support of .009 and confidence of .55 generated 10 rules.
The model below with support of 0.009 and confidence of 0.55 generated 10 association rules.
Source: https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf
https://www.rdocumentation.org/packages/arules/versions/1.6-7/topics/apriori
basket_model <- apriori(grocery, parameter = list(support=.009, confidence=0.55 , minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.55 0.1 1 none FALSE TRUE 5 0.009 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 88
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Below is summary of base model with support value of 0.009 and confidence value of 0.55 applied to data set with 9,835 transactions.
summary(basket_model)
## set of 10 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 10
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.009354 Min. :0.5525 Min. :0.01464 Min. :2.162
## 1st Qu.:0.009914 1st Qu.:0.5648 1st Qu.:0.01721 1st Qu.:2.210
## Median :0.010930 Median :0.5738 Median :0.01886 Median :2.246
## Mean :0.011174 Mean :0.5779 Mean :0.01941 Mean :2.408
## 3rd Qu.:0.012227 3rd Qu.:0.5840 3rd Qu.:0.02105 3rd Qu.:2.445
## Max. :0.014540 Max. :0.6389 Max. :0.02583 Max. :3.030
## count
## Min. : 92.0
## 1st Qu.: 97.5
## Median :107.5
## Mean :109.9
## 3rd Qu.:120.2
## Max. :143.0
##
## mining info:
## data ntransactions support confidence
## grocery 9835 0.009 0.55
When there are too many association rules that satisfy the support and confidence constraints, lift can be used to further filter or rank the rules. Lift with greater values indicate stronger association.
The strongest association with the highest lift value is {citrus fruit,root vegetables} ==> {other vegetables}. At the bottom is {domestic eggs,other vegetables} ==> {whole milk} .
inspect(sort(basket_model, by="lift")[1:10])
## lhs rhs support
## [1] {citrus fruit,root vegetables} => {other vegetables} 0.010371124
## [2] {root vegetables,tropical fruit} => {other vegetables} 0.012302999
## [3] {butter,yogurt} => {whole milk} 0.009354347
## [4] {curd,yogurt} => {whole milk} 0.010066090
## [5] {curd,other vegetables} => {whole milk} 0.009862735
## [6] {butter,other vegetables} => {whole milk} 0.011489578
## [7] {root vegetables,tropical fruit} => {whole milk} 0.011997966
## [8] {root vegetables,yogurt} => {whole milk} 0.014539908
## [9] {root vegetables,whipped/sour cream} => {whole milk} 0.009456024
## [10] {domestic eggs,other vegetables} => {whole milk} 0.012302999
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.6388889 0.01464159 2.500387 92
## [4] 0.5823529 0.01728521 2.279125 99
## [5] 0.5739645 0.01718353 2.246296 97
## [6] 0.5736041 0.02003050 2.244885 113
## [7] 0.5700483 0.02104728 2.230969 118
## [8] 0.5629921 0.02582613 2.203354 143
## [9] 0.5535714 0.01708185 2.166484 93
## [10] 0.5525114 0.02226741 2.162336 121
rules <- head(basket_model, n = 10, by = "lift")
plot(rules, method = "graph")
Convert grocery
transaction data into a data frame. This data frame has 169 rows and 9835 columns. So, the grocery items are going to be clustered based on transactions.
grocery_df <- as.data.frame(as.matrix(grocery@data))
row.names(grocery_df) <- grocery@itemInfo$labels
dim(grocery_df)
## [1] 169 9835
The grocery items are going to be clustered into 5 groups as specified by the centers
parameter. The algorithm is repeated 50 times as specified by nstart
(each time with a different set of centers).
The grocery items are clustered into 5 groups. Majority of the items are in one group with 143 grocery items, followed by a group with 22 items. The rest of the groups only have 1 item.
set.seed(1)
k_cluster <- kmeans(grocery_df, centers=5, nstart=50, iter.max=10)
table(k_cluster$cluster)
##
## 1 2 3 4 5
## 22 1 2 1 143
Below, the grocery items are clustered into 10 groups. Vast majority of the items are clustered in one group with 138 items, followed by a group with 21 items. The rest of the groups only have 1 item.
set.seed(1)
k_cluster2 <- kmeans(grocery_df, centers=10, nstart=50, iter.max=10)
table(k_cluster2$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 1 1 2 2 1 1 1 138 21 1
Clustering the grocery items into 50 groups continues to result in one group containing most of the grocery items with 115 items. The rest of the other groups only have 1 item.
set.seed(1)
k_cluster3 <- kmeans(grocery_df, centers=50, nstart=50, iter.max=10)
table(k_cluster3$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 115 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50
## 1 2 1 1 1 1 1 1 1 1
Further clustering the grocery items into 100 groups still continues to result in one group containing most of the grocery items with 68 items. The rest of the groups only have 1 item.
It appears that most items cluster into one group whether the items are clustered with 5, 10, 50, or 100 centers.
set.seed(1)
k_cluster4 <- kmeans(grocery_df, centers=100, nstart=50, iter.max=10)
table(k_cluster4$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 68 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Below are the 68 grocery items clustered in one group with 100 centers.
k_cluster4$cluster[k_cluster4$cluster == 31]
## abrasive cleaner artif. sweetener baby cosmetics
## 31 31 31
## baby food bags bathroom cleaner
## 31 31 31
## brandy canned fruit cereals
## 31 31 31
## cleaner cocoa drinks cooking chocolate
## 31 31 31
## cookware cream curd cheese
## 31 31 31
## decalcifier dental care female sanitary products
## 31 31 31
## finished products fish flower soil/fertilizer
## 31 31 31
## frozen chicken frozen fruits hair spray
## 31 31 31
## honey instant coffee jam
## 31 31 31
## ketchup kitchen towels kitchen utensil
## 31 31 31
## light bulbs liqueur liquor (appetizer)
## 31 31 31
## liver loaf make up remover male cosmetics
## 31 31 31
## meat spreads nut snack nuts/prunes
## 31 31 31
## organic products organic sausage popcorn
## 31 31 31
## potato products preservation products prosecco
## 31 31 31
## pudding powder ready soups rubbing alcohol
## 31 31 31
## rum salad dressing sauces
## 31 31 31
## skin care snack products soap
## 31 31 31
## softener sound storage medium soups
## 31 31 31
## sparkling wine specialty fat specialty vegetables
## 31 31 31
## spices syrup tea
## 31 31 31
## tidbits toilet cleaner vinegar
## 31 31 31
## whisky zwieback
## 31 31