Homework 10

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Load data of 9,835 transactions and 169 items.

(grocery <- read.transactions("GroceryDataSet.csv", sep = ","))

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

Summary

Below is summary of grocery items. Whole milk is the most frequently bought items followed by other vegetables.

summary(grocery)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Top 20 items by frequency.

Below is a frequency plot of the top 20 most grocery items. As you can see, the top 3 items are whole milk, other vegetables, and rolls/buns.

itemFrequencyPlot(grocery, topN=20)

Association Rules

The apriori function of the arules package is used to generate the association rules.

The parameter support is “defined as the proportion of transactions in the data set which contain the itemset. For example the itemset {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions).” I tried using support greater than 0.009, but the apriori function threw an error.

The parameter confidence has a value of 0.55. In the example of {milk, bread} => {butter}, this means that 55% of transactions containing milk and bread also has butter.

Use minlen = 2 so that the LHS (antecedent) is not empty. By default, apriori has a minlen value of 1 (empty LHS).

The association rules need to meet the minimum support and confidence values. Support of .009 and confidence of .55 generated 10 rules.

The model below with support of 0.009 and confidence of 0.55 generated 10 association rules.

Source: https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf

https://www.rdocumentation.org/packages/arules/versions/1.6-7/topics/apriori

basket_model <- apriori(grocery, parameter = list(support=.009, confidence=0.55 , minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.55    0.1    1 none FALSE            TRUE       5   0.009      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 88 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Summary of model

Below is summary of base model with support value of 0.009 and confidence value of 0.55 applied to data set with 9,835 transactions.

summary(basket_model)

## set of 10 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence        coverage            lift      
##  Min.   :0.009354   Min.   :0.5525   Min.   :0.01464   Min.   :2.162  
##  1st Qu.:0.009914   1st Qu.:0.5648   1st Qu.:0.01721   1st Qu.:2.210  
##  Median :0.010930   Median :0.5738   Median :0.01886   Median :2.246  
##  Mean   :0.011174   Mean   :0.5779   Mean   :0.01941   Mean   :2.408  
##  3rd Qu.:0.012227   3rd Qu.:0.5840   3rd Qu.:0.02105   3rd Qu.:2.445  
##  Max.   :0.014540   Max.   :0.6389   Max.   :0.02583   Max.   :3.030  
##      count      
##  Min.   : 92.0  
##  1st Qu.: 97.5  
##  Median :107.5  
##  Mean   :109.9  
##  3rd Qu.:120.2  
##  Max.   :143.0  
## 
## mining info:
##     data ntransactions support confidence
##  grocery          9835   0.009       0.55

Rules sorted by Lift

When there are too many association rules that satisfy the support and confidence constraints, lift can be used to further filter or rank the rules. Lift with greater values indicate stronger association.

The strongest association with the highest lift value is {citrus fruit,root vegetables} ==> {other vegetables}. At the bottom is {domestic eggs,other vegetables} ==> {whole milk} .

inspect(sort(basket_model, by="lift")[1:10])

##      lhs                                     rhs                support    
## [1]  {citrus fruit,root vegetables}       => {other vegetables} 0.010371124
## [2]  {root vegetables,tropical fruit}     => {other vegetables} 0.012302999
## [3]  {butter,yogurt}                      => {whole milk}       0.009354347
## [4]  {curd,yogurt}                        => {whole milk}       0.010066090
## [5]  {curd,other vegetables}              => {whole milk}       0.009862735
## [6]  {butter,other vegetables}            => {whole milk}       0.011489578
## [7]  {root vegetables,tropical fruit}     => {whole milk}       0.011997966
## [8]  {root vegetables,yogurt}             => {whole milk}       0.014539908
## [9]  {root vegetables,whipped/sour cream} => {whole milk}       0.009456024
## [10] {domestic eggs,other vegetables}     => {whole milk}       0.012302999
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.6388889  0.01464159 2.500387  92  
## [4]  0.5823529  0.01728521 2.279125  99  
## [5]  0.5739645  0.01718353 2.246296  97  
## [6]  0.5736041  0.02003050 2.244885 113  
## [7]  0.5700483  0.02104728 2.230969 118  
## [8]  0.5629921  0.02582613 2.203354 143  
## [9]  0.5535714  0.01708185 2.166484  93  
## [10] 0.5525114  0.02226741 2.162336 121

Graph of Rules

rules <- head(basket_model, n = 10, by = "lift")
plot(rules, method = "graph")

Clustering

Convert grocery transaction data into a data frame. This data frame has 169 rows and 9835 columns. So, the grocery items are going to be clustered based on transactions.

grocery_df <- as.data.frame(as.matrix(grocery@data)) 
row.names(grocery_df) <- grocery@itemInfo$labels
dim(grocery_df)

## [1]  169 9835

K-means clustering

The grocery items are going to be clustered into 5 groups as specified by the centers parameter. The algorithm is repeated 50 times as specified by nstart (each time with a different set of centers).

The grocery items are clustered into 5 groups. Majority of the items are in one group with 143 grocery items, followed by a group with 22 items. The rest of the groups only have 1 item.

set.seed(1)
k_cluster <- kmeans(grocery_df, centers=5, nstart=50, iter.max=10)
table(k_cluster$cluster)

## 
##   1   2   3   4   5 
##  22   1   2   1 143

Below, the grocery items are clustered into 10 groups. Vast majority of the items are clustered in one group with 138 items, followed by a group with 21 items. The rest of the groups only have 1 item.

set.seed(1)
k_cluster2 <- kmeans(grocery_df, centers=10, nstart=50, iter.max=10)
table(k_cluster2$cluster)

## 
##   1   2   3   4   5   6   7   8   9  10 
##   1   1   2   2   1   1   1 138  21   1

Clustering the grocery items into 50 groups continues to result in one group containing most of the grocery items with 115 items. The rest of the other groups only have 1 item.

set.seed(1)
k_cluster3 <- kmeans(grocery_df, centers=50, nstart=50, iter.max=10)
table(k_cluster3$cluster)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1 115   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   5   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50 
##   1   2   1   1   1   1   1   1   1   1

Further clustering the grocery items into 100 groups still continues to result in one group containing most of the grocery items with 68 items. The rest of the groups only have 1 item.

It appears that most items cluster into one group whether the items are clustered with 5, 10, 50, or 100 centers.

set.seed(1)
k_cluster4 <- kmeans(grocery_df, centers=100, nstart=50, iter.max=10)
table(k_cluster4$cluster)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1  68   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

Below are the 68 grocery items clustered in one group with 100 centers.

k_cluster4$cluster[k_cluster4$cluster == 31]

##         abrasive cleaner         artif. sweetener           baby cosmetics 
##                       31                       31                       31 
##                baby food                     bags         bathroom cleaner 
##                       31                       31                       31 
##                   brandy             canned fruit                  cereals 
##                       31                       31                       31 
##                  cleaner             cocoa drinks        cooking chocolate 
##                       31                       31                       31 
##                 cookware                    cream              curd cheese 
##                       31                       31                       31 
##              decalcifier              dental care female sanitary products 
##                       31                       31                       31 
##        finished products                     fish   flower soil/fertilizer 
##                       31                       31                       31 
##           frozen chicken            frozen fruits               hair spray 
##                       31                       31                       31 
##                    honey           instant coffee                      jam 
##                       31                       31                       31 
##                  ketchup           kitchen towels          kitchen utensil 
##                       31                       31                       31 
##              light bulbs                  liqueur       liquor (appetizer) 
##                       31                       31                       31 
##               liver loaf          make up remover           male cosmetics 
##                       31                       31                       31 
##             meat spreads                nut snack              nuts/prunes 
##                       31                       31                       31 
##         organic products          organic sausage                  popcorn 
##                       31                       31                       31 
##          potato products    preservation products                 prosecco 
##                       31                       31                       31 
##           pudding powder              ready soups          rubbing alcohol 
##                       31                       31                       31 
##                      rum           salad dressing                   sauces 
##                       31                       31                       31 
##                skin care           snack products                     soap 
##                       31                       31                       31 
##                 softener     sound storage medium                    soups 
##                       31                       31                       31 
##           sparkling wine            specialty fat     specialty vegetables 
##                       31                       31                       31 
##                   spices                    syrup                      tea 
##                       31                       31                       31 
##                  tidbits           toilet cleaner                  vinegar 
##                       31                       31                       31 
##                   whisky                 zwieback 
##                       31                       31