HW 10

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift. Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Here, the data is read in and the top ten items are displayed. We can see some of the top items such as whole milk, other vegetables, rolls/buns, and soda in the data summary.

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Next, package arules is used to mine the grocery data for association rules using Apriori algorithm.
We can see the whole milk is the top item, out of top ten items in plot. Below that we have a table listing that reports: support, confidence and top 10 rules by lift.

LHS	RHS	support	confidence	coverage	lift	count
{root vegetables,tropical fruit}	{other vegetables}	0.0123030	0.5845411	0.0210473	3.020999	121
{rolls/buns,root vegetables}	{other vegetables}	0.0122013	0.5020921	0.0243010	2.594890	120
{root vegetables,yogurt}	{other vegetables}	0.0129131	0.5000000	0.0258261	2.584078	127
{root vegetables,yogurt}	{whole milk}	0.0145399	0.5629921	0.0258261	2.203354	143
{domestic eggs,other vegetables}	{whole milk}	0.0123030	0.5525114	0.0222674	2.162336	121
{rolls/buns,root vegetables}	{whole milk}	0.0127097	0.5230126	0.0243010	2.046888	125
{other vegetables,pip fruit}	{whole milk}	0.0135231	0.5175097	0.0261312	2.025351	133
{tropical fruit,yogurt}	{whole milk}	0.0151500	0.5173611	0.0292832	2.024770	149
{other vegetables,yogurt}	{whole milk}	0.0222674	0.5128806	0.0434164	2.007235	219
{other vegetables,whipped/sour cream}	{whole milk}	0.0146416	0.5070423	0.0288765	1.984385	144

We see that there are 232 association rules.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [232 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

## set of 232 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3 
##   1 151  80 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.341   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01007   Min.   :0.2006   Min.   :0.01729   Min.   :0.8991  
##  1st Qu.:0.01200   1st Qu.:0.2470   1st Qu.:0.03437   1st Qu.:1.4432  
##  Median :0.01490   Median :0.3170   Median :0.05241   Median :1.7277  
##  Mean   :0.02005   Mean   :0.3321   Mean   :0.06708   Mean   :1.7890  
##  3rd Qu.:0.02227   3rd Qu.:0.4033   3rd Qu.:0.07565   3rd Qu.:2.0762  
##  Max.   :0.25552   Max.   :0.5862   Max.   :1.00000   Max.   :3.2950  
##      count       
##  Min.   :  99.0  
##  1st Qu.: 118.0  
##  Median : 146.5  
##  Mean   : 197.2  
##  3rd Qu.: 219.0  
##  Max.   :2513.0  
## 
## mining info:
##         data ntransactions support confidence
##  groceryData          9835    0.01        0.2
##                                                                             call
##  apriori(data = groceryData, parameter = list(support = 0.01, confidence = 0.2))

Cluster Analysis

The status package was used to help perform cluster analysis. We can see that milk and other vegetables are the top item cluster features. This follow consistency with our rules. For the cluster analysis, we look at items with > 4% support.