Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift. Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
Here, the data is read in and the top ten items are displayed. We can see some of the top items such as whole milk, other vegetables, rolls/buns, and soda in the data summary.
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Next, package arules is used to mine the grocery data for association
rules using Apriori algorithm.
We can see the whole milk is the top item, out of top ten items in plot.
Below that we have a table listing that reports: support, confidence and
top 10 rules by lift.
| LHS | RHS | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {root vegetables,tropical fruit} | {other vegetables} | 0.0123030 | 0.5845411 | 0.0210473 | 3.020999 | 121 |
| {rolls/buns,root vegetables} | {other vegetables} | 0.0122013 | 0.5020921 | 0.0243010 | 2.594890 | 120 |
| {root vegetables,yogurt} | {other vegetables} | 0.0129131 | 0.5000000 | 0.0258261 | 2.584078 | 127 |
| {root vegetables,yogurt} | {whole milk} | 0.0145399 | 0.5629921 | 0.0258261 | 2.203354 | 143 |
| {domestic eggs,other vegetables} | {whole milk} | 0.0123030 | 0.5525114 | 0.0222674 | 2.162336 | 121 |
| {rolls/buns,root vegetables} | {whole milk} | 0.0127097 | 0.5230126 | 0.0243010 | 2.046888 | 125 |
| {other vegetables,pip fruit} | {whole milk} | 0.0135231 | 0.5175097 | 0.0261312 | 2.025351 | 133 |
| {tropical fruit,yogurt} | {whole milk} | 0.0151500 | 0.5173611 | 0.0292832 | 2.024770 | 149 |
| {other vegetables,yogurt} | {whole milk} | 0.0222674 | 0.5128806 | 0.0434164 | 2.007235 | 219 |
| {other vegetables,whipped/sour cream} | {whole milk} | 0.0146416 | 0.5070423 | 0.0288765 | 1.984385 | 144 |
We see that there are 232 association rules.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [232 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 232 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3
## 1 151 80
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.341 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.2006 Min. :0.01729 Min. :0.8991
## 1st Qu.:0.01200 1st Qu.:0.2470 1st Qu.:0.03437 1st Qu.:1.4432
## Median :0.01490 Median :0.3170 Median :0.05241 Median :1.7277
## Mean :0.02005 Mean :0.3321 Mean :0.06708 Mean :1.7890
## 3rd Qu.:0.02227 3rd Qu.:0.4033 3rd Qu.:0.07565 3rd Qu.:2.0762
## Max. :0.25552 Max. :0.5862 Max. :1.00000 Max. :3.2950
## count
## Min. : 99.0
## 1st Qu.: 118.0
## Median : 146.5
## Mean : 197.2
## 3rd Qu.: 219.0
## Max. :2513.0
##
## mining info:
## data ntransactions support confidence
## groceryData 9835 0.01 0.2
## call
## apriori(data = groceryData, parameter = list(support = 0.01, confidence = 0.2))
The status package was used to help perform cluster analysis. We can see that milk and other vegetables are the top item cluster features. This follow consistency with our rules. For the cluster analysis, we look at items with > 4% support.