DATA 624 Homework 10 - Market Basket Analysis

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Do a simple cluster analysis on the data as well. Use whichever packages you like.

library(arules)
library(RColorBrewer)
library(kableExtra)

Load Data

Let’s load the data and examine it

data <- read.csv("https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_624/data/GroceryDataSet.csv", header=FALSE)

# View Summary
summary(data) %>%
  kable() %>%
    kable_styling()

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32
sausage : 825	:2159	:3802	:5101	:6106	:6961	:7606	:8151	:8589	:8939	:9185	:9367	:9484	:9562	:9639	:9694	:9740	:9769	:9783	:9797	:9806	:9817	:9821	:9827	:9828	:9828	:9829	:9830	:9831	:9834	:9834	:9834
whole milk : 717	whole milk : 654	whole milk : 506	whole milk : 315	rolls/buns : 176	soda : 150	soda : 120	shopping bags: 76	soda : 61	shopping bags : 49	shopping bags: 40	soda : 30	soda : 24	shopping bags : 18	shopping bags : 16	shopping bags : 11	napkins : 8	candy : 5	detergent : 4	bottled beer : 3	napkins : 4	napkins : 2	waffles : 2	bottled beer : 2	chocolate : 2	chocolate : 1	abrasive cleaner : 1	chocolate : 1	cooking chocolate : 1	skin care: 1	hygiene articles: 1	candles: 1
frankfurter : 580	other vegetables: 550	other vegetables: 415	other vegetables: 254	soda : 168	rolls/buns : 146	shopping bags: 107	bottled water: 68	shopping bags : 56	soda : 39	newspapers : 36	shopping bags : 19	shopping bags : 18	fruit/vegetable juice: 17	napkins : 13	napkins : 9	chocolate : 5	chocolate : 5	fruit/vegetable juice: 4	napkins : 3	fruit/vegetable juice : 2	baking powder : 1	chocolate marshmallow: 1	bottled water: 1	fruit/vegetable juice : 1	female sanitary products: 1	chocolate : 1	hygiene articles: 1	house keeping products: 2	NA	NA	NA
tropical fruit : 482	root vegetables : 383	rolls/buns : 293	rolls/buns : 238	yogurt : 160	shopping bags: 107	rolls/buns : 92	newspapers : 66	fruit/vegetable juice: 55	fruit/vegetable juice: 34	pastry : 27	chocolate : 17	fruit/vegetable juice: 16	newspapers : 14	fruit/vegetable juice: 11	chocolate : 8	newspapers : 5	napkins : 5	shopping bags : 4	pot plants : 3	house keeping products: 2	bottled beer : 1	cling film/bags : 1	cake bar : 1	liquor (appetizer) : 1	long life bakery product: 1	hygiene articles : 2	napkins : 2	soups : 1	NA	NA	NA
other vegetables: 460	rolls/buns : 378	yogurt : 289	soda : 211	whole milk : 149	bottled water: 95	newspapers : 68	rolls/buns : 59	bottled water : 54	newspapers : 33	bottled water: 25	fruit/vegetable juice: 17	napkins : 14	soda : 14	hygiene articles : 11	hygiene articles : 6	candy : 4	newspapers : 3	chocolate : 3	candy : 2	hygiene articles : 2	cleaner : 1	dental care : 1	coffee : 1	long life bakery product: 1	margarine : 1	long life bakery product: 1	sugar : 1	NA	NA	NA	NA
citrus fruit : 453	tropical fruit : 355	soda : 229	yogurt : 202	shopping bags: 145	yogurt : 93	domestic eggs: 57	soda : 59	newspapers : 51	bottled water : 26	napkins : 23	napkins : 17	newspapers : 14	napkins : 11	candy : 9	long life bakery product: 6	fruit/vegetable juice: 4	bottled water: 2	bottled water : 2	hygiene articles: 2	candles : 1	cling film/bags: 1	dog food : 1	flour : 1	pasta : 1	rum : 1	specialty fat : 1	NA	NA	NA	NA	NA
(Other) :6318	(Other) :5356	(Other) :4301	(Other) :3514	(Other) :2931	(Other) :2283	(Other) :1785	(Other) :1356	(Other) : 969	(Other) : 715	(Other) : 499	(Other) : 368	(Other) : 265	(Other) : 199	(Other) : 136	(Other) : 101	(Other) : 69	(Other) : 46	(Other) : 35	(Other) : 25	(Other) : 18	(Other) : 12	(Other) : 8	(Other) : 2	white wine : 1	(Other) : 2	NA	NA	NA	NA	NA	NA

groceryDataset = read.transactions("https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_624/data/GroceryDataSet.csv", sep = ',', rm.duplicates = TRUE)

Analyse data

# View summary
summary(groceryDataset)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

As we see from the summary, ‘whole milk’ is the most frequent item with 2513 and then followed by ‘other vegetables’ with 1903. Lets see a visual using the item frequency plot.

itemFrequencyPlot(groceryDataset,topN=10, type="absolute", col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")

We see a better visual using the frequency plot above. ItemFrequencyPlot was used to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.

Train/ Extract

Lets train the apriori to extract rules by defining the minimim support and confidence value. This is basically the likelihood of the product to be purchased.

min_suport <- 6 * 7/ nrow(groceryDataset)
min_suport

## [1] 0.004270463

We find the confidence which is the likelihood of a product being purchased given another product is purchased.

Confidence(p1 -> p2) = # of observation where p1 and p2 purchased/ # of observation where p1 purchased

# Training Apriori on the grocery dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.6))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [40 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Inspect top 10 rules by lift. Lift indicates the significance of the rule.

inspect(sort(rules, by = 'lift')[1:10])

##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [2]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [3]  {pip fruit,                                                                                 
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [4]  {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       yogurt}             => {other vegetables} 0.004982206  0.6125000 0.008134215 3.165495    49
## [5]  {pip fruit,                                                                                 
##       whipped/sour cream} => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [6]  {onions,                                                                                    
##       root vegetables}    => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [7]  {curd,                                                                                      
##       domestic eggs}      => {whole milk}       0.004778851  0.7343750 0.006507372 2.874086    47
## [8]  {butter,                                                                                    
##       curd}               => {whole milk}       0.004880529  0.7164179 0.006812405 2.803808    48
## [9]  {tropical fruit,                                                                            
##       whipped/sour cream,                                                                        
##       yogurt}             => {whole milk}       0.004372140  0.7049180 0.006202339 2.758802    43
## [10] {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       yogurt}             => {whole milk}       0.005693950  0.7000000 0.008134215 2.739554    56

We started with 0.6 confidence. Lets reduce the confidence to 0.4 and see if it is better.

# Training Apriori on the dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.4))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [432 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Let see the performance after we changed the confidence to 0.4

# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])

##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {liquor}             => {bottled beer}     0.004677173  0.4220183 0.011082867 5.240594    46
## [2]  {herbs,                                                                                     
##       whole milk}         => {root vegetables}  0.004168785  0.5394737 0.007727504 4.949369    41
## [3]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       tropical fruit}     => {root vegetables}  0.004473818  0.4943820 0.009049314 4.535678    44
## [4]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       root vegetables}    => {tropical fruit}   0.004473818  0.4313725 0.010371124 4.110997    44
## [5]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       whole milk}         => {root vegetables}  0.005795628  0.4453125 0.013014743 4.085493    57
## [6]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [7]  {herbs}              => {root vegetables}  0.007015760  0.4312500 0.016268429 3.956477    69
## [8]  {tropical fruit,                                                                            
##       whipped/sour cream,                                                                        
##       whole milk}         => {yogurt}           0.004372140  0.5512821 0.007930859 3.951792    43
## [9]  {citrus fruit,                                                                              
##       pip fruit}          => {tropical fruit}   0.005592272  0.4044118 0.013828165 3.854060    55
## [10] {whipped/sour cream,                                                                        
##       whole milk,                                                                                
##       yogurt}             => {tropical fruit}   0.004372140  0.4018692 0.010879512 3.829829    43

Although it looks better, there is more room for improvement. For example we see ‘citrus fruit’ in multiple rules. We can further change the minimum value to 0.2 and evaluate its performance.

# Training Apriori on the dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1268 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])

##      lhs                     rhs                      support confidence    coverage     lift count
## [1]  {flour}              => {sugar}              0.004982206  0.2865497 0.017386884 8.463112    49
## [2]  {processed cheese}   => {white bread}        0.004168785  0.2515337 0.016573462 5.975445    41
## [3]  {liquor}             => {bottled beer}       0.004677173  0.4220183 0.011082867 5.240594    46
## [4]  {berries,                                                                                     
##       whole milk}         => {whipped/sour cream} 0.004270463  0.3620690 0.011794611 5.050990    42
## [5]  {herbs,                                                                                       
##       whole milk}         => {root vegetables}    0.004168785  0.5394737 0.007727504 4.949369    41
## [6]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       tropical fruit}     => {root vegetables}    0.004473818  0.4943820 0.009049314 4.535678    44
## [7]  {other vegetables,                                                                            
##       root vegetables,                                                                             
##       tropical fruit}     => {citrus fruit}       0.004473818  0.3636364 0.012302999 4.393567    44
## [8]  {whipped/sour cream,                                                                          
##       yogurt}             => {curd}               0.004575496  0.2205882 0.020742247 4.140239    45
## [9]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       root vegetables}    => {tropical fruit}     0.004473818  0.4313725 0.010371124 4.110997    44
## [10] {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       whole milk}         => {root vegetables}    0.005795628  0.4453125 0.013014743 4.085493    57

As we see we have better results and the association rules looks better.

DATA 624 Homework 10 - Market Basket Analysis

Monu Chacko

5/9/2021

Load Data

Analyse data

Train/ Extract