R Markdown

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Observe Data

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
grocerydata <- read.transactions("GroceryDataSet.csv", sep = ",")
summary(grocerydata)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Visualize Frequent Items

We can visualize to top frequent items using the ItemFrequency chart. Our data shows that top item is bottled water, citrus fruit and beer. It would be intresting to see top grocery items during Covid.

itemFrequencyPlot(grocerydata[, 1:30], topN=30 , main="Top 30 
                  Frequent Items")

##Create Rules

We created a rules using a support of 0.05 and a confidence of 0.6. This returned 13 Rules for us.

rules <- apriori(grocerydata, parameter = list(supp = 0.005, conf = 0.6, maxlen=3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target   ext
##       3  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.04s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3
## Warning in apriori(grocerydata, parameter = list(supp = 0.005, conf = 0.6, :
## Mining stopped (maxlen reached). Only patterns up to a length of 3 returned!
##  done [0.01s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules)
## set of 13 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 13 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence          lift           count      
##  Min.   :0.005186   Min.   :0.6022   Min.   :2.357   Min.   :51.00  
##  1st Qu.:0.005592   1st Qu.:0.6071   1st Qu.:2.434   1st Qu.:55.00  
##  Median :0.005999   Median :0.6224   Median :2.480   Median :59.00  
##  Mean   :0.006398   Mean   :0.6249   Mean   :2.562   Mean   :62.92  
##  3rd Qu.:0.006711   3rd Qu.:0.6378   3rd Qu.:2.537   3rd Qu.:66.00  
##  Max.   :0.009354   Max.   :0.6600   Max.   :3.124   Max.   :92.00  
## 
## mining info:
##         data ntransactions support confidence
##  grocerydata          9835   0.005        0.6

Inspect Rules

inspect(rules)
##      lhs                               rhs                support    
## [1]  {onions,root vegetables}       => {other vegetables} 0.005693950
## [2]  {curd,tropical fruit}          => {whole milk}       0.006507372
## [3]  {domestic eggs,margarine}      => {whole milk}       0.005185562
## [4]  {butter,domestic eggs}         => {whole milk}       0.005998983
## [5]  {butter,whipped/sour cream}    => {whole milk}       0.006710727
## [6]  {bottled water,butter}         => {whole milk}       0.005388917
## [7]  {butter,tropical fruit}        => {whole milk}       0.006202339
## [8]  {butter,root vegetables}       => {whole milk}       0.008235892
## [9]  {butter,yogurt}                => {whole milk}       0.009354347
## [10] {domestic eggs,pip fruit}      => {whole milk}       0.005388917
## [11] {domestic eggs,tropical fruit} => {whole milk}       0.006914082
## [12] {pip fruit,whipped/sour cream} => {other vegetables} 0.005592272
## [13] {pip fruit,whipped/sour cream} => {whole milk}       0.005998983
##      confidence lift     count
## [1]  0.6021505  3.112008 56   
## [2]  0.6336634  2.479936 64   
## [3]  0.6219512  2.434099 51   
## [4]  0.6210526  2.430582 59   
## [5]  0.6600000  2.583008 66   
## [6]  0.6022727  2.357084 53   
## [7]  0.6224490  2.436047 61   
## [8]  0.6377953  2.496107 81   
## [9]  0.6388889  2.500387 92   
## [10] 0.6235294  2.440275 53   
## [11] 0.6071429  2.376144 68   
## [12] 0.6043956  3.123610 55   
## [13] 0.6483516  2.537421 59
library("arulesViz")
## Loading required package: grid
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
library("grid")
plot(rules, method='two-key plot')

top5 <- head(rules, n=5, by ="lift")
plot(top5, method = "graph", engine="htmlwidget")

When we inspect our rules, we see majoirty of them are assocaited with milk, butter and vegetables.

Top 10 Rules by Lift

inspect(rules[1:10,] ,by= "lift")
##      lhs                            rhs                support     confidence
## [1]  {onions,root vegetables}    => {other vegetables} 0.005693950 0.6021505 
## [2]  {curd,tropical fruit}       => {whole milk}       0.006507372 0.6336634 
## [3]  {domestic eggs,margarine}   => {whole milk}       0.005185562 0.6219512 
## [4]  {butter,domestic eggs}      => {whole milk}       0.005998983 0.6210526 
## [5]  {butter,whipped/sour cream} => {whole milk}       0.006710727 0.6600000 
## [6]  {bottled water,butter}      => {whole milk}       0.005388917 0.6022727 
## [7]  {butter,tropical fruit}     => {whole milk}       0.006202339 0.6224490 
## [8]  {butter,root vegetables}    => {whole milk}       0.008235892 0.6377953 
## [9]  {butter,yogurt}             => {whole milk}       0.009354347 0.6388889 
## [10] {domestic eggs,pip fruit}   => {whole milk}       0.005388917 0.6235294 
##      lift     count
## [1]  3.112008 56   
## [2]  2.479936 64   
## [3]  2.434099 51   
## [4]  2.430582 59   
## [5]  2.583008 66   
## [6]  2.357084 53   
## [7]  2.436047 61   
## [8]  2.496107 81   
## [9]  2.500387 92   
## [10] 2.440275 53