Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Do a simple cluster analysis on the data as well. Use whichever packages you like.
library(arules)
library(RColorBrewer)
library(kableExtra)
Let’s load the data and examine it
data <- read.csv("https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_624/data/GroceryDataSet.csv", header=FALSE)
# View Summary
summary(data) %>%
kable() %>%
kable_styling()
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sausage : 825 | :2159 | :3802 | :5101 | :6106 | :6961 | :7606 | :8151 | :8589 | :8939 | :9185 | :9367 | :9484 | :9562 | :9639 | :9694 | :9740 | :9769 | :9783 | :9797 | :9806 | :9817 | :9821 | :9827 | :9828 | :9828 | :9829 | :9830 | :9831 | :9834 | :9834 | :9834 | |
whole milk : 717 | whole milk : 654 | whole milk : 506 | whole milk : 315 | rolls/buns : 176 | soda : 150 | soda : 120 | shopping bags: 76 | soda : 61 | shopping bags : 49 | shopping bags: 40 | soda : 30 | soda : 24 | shopping bags : 18 | shopping bags : 16 | shopping bags : 11 | napkins : 8 | candy : 5 | detergent : 4 | bottled beer : 3 | napkins : 4 | napkins : 2 | waffles : 2 | bottled beer : 2 | chocolate : 2 | chocolate : 1 | abrasive cleaner : 1 | chocolate : 1 | cooking chocolate : 1 | skin care: 1 | hygiene articles: 1 | candles: 1 | |
frankfurter : 580 | other vegetables: 550 | other vegetables: 415 | other vegetables: 254 | soda : 168 | rolls/buns : 146 | shopping bags: 107 | bottled water: 68 | shopping bags : 56 | soda : 39 | newspapers : 36 | shopping bags : 19 | shopping bags : 18 | fruit/vegetable juice: 17 | napkins : 13 | napkins : 9 | chocolate : 5 | chocolate : 5 | fruit/vegetable juice: 4 | napkins : 3 | fruit/vegetable juice : 2 | baking powder : 1 | chocolate marshmallow: 1 | bottled water: 1 | fruit/vegetable juice : 1 | female sanitary products: 1 | chocolate : 1 | hygiene articles: 1 | house keeping products: 2 | NA | NA | NA | |
tropical fruit : 482 | root vegetables : 383 | rolls/buns : 293 | rolls/buns : 238 | yogurt : 160 | shopping bags: 107 | rolls/buns : 92 | newspapers : 66 | fruit/vegetable juice: 55 | fruit/vegetable juice: 34 | pastry : 27 | chocolate : 17 | fruit/vegetable juice: 16 | newspapers : 14 | fruit/vegetable juice: 11 | chocolate : 8 | newspapers : 5 | napkins : 5 | shopping bags : 4 | pot plants : 3 | house keeping products: 2 | bottled beer : 1 | cling film/bags : 1 | cake bar : 1 | liquor (appetizer) : 1 | long life bakery product: 1 | hygiene articles : 2 | napkins : 2 | soups : 1 | NA | NA | NA | |
other vegetables: 460 | rolls/buns : 378 | yogurt : 289 | soda : 211 | whole milk : 149 | bottled water: 95 | newspapers : 68 | rolls/buns : 59 | bottled water : 54 | newspapers : 33 | bottled water: 25 | fruit/vegetable juice: 17 | napkins : 14 | soda : 14 | hygiene articles : 11 | hygiene articles : 6 | candy : 4 | newspapers : 3 | chocolate : 3 | candy : 2 | hygiene articles : 2 | cleaner : 1 | dental care : 1 | coffee : 1 | long life bakery product: 1 | margarine : 1 | long life bakery product: 1 | sugar : 1 | NA | NA | NA | NA | |
citrus fruit : 453 | tropical fruit : 355 | soda : 229 | yogurt : 202 | shopping bags: 145 | yogurt : 93 | domestic eggs: 57 | soda : 59 | newspapers : 51 | bottled water : 26 | napkins : 23 | napkins : 17 | newspapers : 14 | napkins : 11 | candy : 9 | long life bakery product: 6 | fruit/vegetable juice: 4 | bottled water: 2 | bottled water : 2 | hygiene articles: 2 | candles : 1 | cling film/bags: 1 | dog food : 1 | flour : 1 | pasta : 1 | rum : 1 | specialty fat : 1 | NA | NA | NA | NA | NA | |
(Other) :6318 | (Other) :5356 | (Other) :4301 | (Other) :3514 | (Other) :2931 | (Other) :2283 | (Other) :1785 | (Other) :1356 | (Other) : 969 | (Other) : 715 | (Other) : 499 | (Other) : 368 | (Other) : 265 | (Other) : 199 | (Other) : 136 | (Other) : 101 | (Other) : 69 | (Other) : 46 | (Other) : 35 | (Other) : 25 | (Other) : 18 | (Other) : 12 | (Other) : 8 | (Other) : 2 | white wine : 1 | (Other) : 2 | NA | NA | NA | NA | NA | NA |
groceryDataset = read.transactions("https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_624/data/GroceryDataSet.csv", sep = ',', rm.duplicates = TRUE)
# View summary
summary(groceryDataset)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
As we see from the summary, ‘whole milk’ is the most frequent item with 2513 and then followed by ‘other vegetables’ with 1903. Lets see a visual using the item frequency plot.
itemFrequencyPlot(groceryDataset,topN=10, type="absolute", col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
We see a better visual using the frequency plot above. ItemFrequencyPlot was used to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.
Lets train the apriori to extract rules by defining the minimim support and confidence value. This is basically the likelihood of the product to be purchased.
min_suport <- 6 * 7/ nrow(groceryDataset)
min_suport
## [1] 0.004270463
We find the confidence which is the likelihood of a product being purchased given another product is purchased.
Confidence(p1 -> p2) = # of observation where p1 and p2 purchased/ # of observation where p1 purchased
# Training Apriori on the grocery dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.004 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 39
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [40 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Inspect top 10 rules by lift. Lift indicates the significance of the rule.
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables,
## tropical fruit} => {other vegetables} 0.004473818 0.7857143 0.005693950 4.060694 44
## [2] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [3] {pip fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005490595 0.6136364 0.008947636 3.171368 54
## [4] {root vegetables,
## tropical fruit,
## yogurt} => {other vegetables} 0.004982206 0.6125000 0.008134215 3.165495 49
## [5] {pip fruit,
## whipped/sour cream} => {other vegetables} 0.005592272 0.6043956 0.009252669 3.123610 55
## [6] {onions,
## root vegetables} => {other vegetables} 0.005693950 0.6021505 0.009456024 3.112008 56
## [7] {curd,
## domestic eggs} => {whole milk} 0.004778851 0.7343750 0.006507372 2.874086 47
## [8] {butter,
## curd} => {whole milk} 0.004880529 0.7164179 0.006812405 2.803808 48
## [9] {tropical fruit,
## whipped/sour cream,
## yogurt} => {whole milk} 0.004372140 0.7049180 0.006202339 2.758802 43
## [10] {root vegetables,
## tropical fruit,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
We started with 0.6 confidence. Lets reduce the confidence to 0.4 and see if it is better.
# Training Apriori on the dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.004 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 39
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [432 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Let see the performance after we changed the confidence to 0.4
# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor} => {bottled beer} 0.004677173 0.4220183 0.011082867 5.240594 46
## [2] {herbs,
## whole milk} => {root vegetables} 0.004168785 0.5394737 0.007727504 4.949369 41
## [3] {citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.004473818 0.4943820 0.009049314 4.535678 44
## [4] {citrus fruit,
## other vegetables,
## root vegetables} => {tropical fruit} 0.004473818 0.4313725 0.010371124 4.110997 44
## [5] {citrus fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.005795628 0.4453125 0.013014743 4.085493 57
## [6] {citrus fruit,
## root vegetables,
## tropical fruit} => {other vegetables} 0.004473818 0.7857143 0.005693950 4.060694 44
## [7] {herbs} => {root vegetables} 0.007015760 0.4312500 0.016268429 3.956477 69
## [8] {tropical fruit,
## whipped/sour cream,
## whole milk} => {yogurt} 0.004372140 0.5512821 0.007930859 3.951792 43
## [9] {citrus fruit,
## pip fruit} => {tropical fruit} 0.005592272 0.4044118 0.013828165 3.854060 55
## [10] {whipped/sour cream,
## whole milk,
## yogurt} => {tropical fruit} 0.004372140 0.4018692 0.010879512 3.829829 43
Although it looks better, there is more room for improvement. For example we see ‘citrus fruit’ in multiple rules. We can further change the minimum value to 0.2 and evaluate its performance.
# Training Apriori on the dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.004 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 39
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1268 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence coverage lift count
## [1] {flour} => {sugar} 0.004982206 0.2865497 0.017386884 8.463112 49
## [2] {processed cheese} => {white bread} 0.004168785 0.2515337 0.016573462 5.975445 41
## [3] {liquor} => {bottled beer} 0.004677173 0.4220183 0.011082867 5.240594 46
## [4] {berries,
## whole milk} => {whipped/sour cream} 0.004270463 0.3620690 0.011794611 5.050990 42
## [5] {herbs,
## whole milk} => {root vegetables} 0.004168785 0.5394737 0.007727504 4.949369 41
## [6] {citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.004473818 0.4943820 0.009049314 4.535678 44
## [7] {other vegetables,
## root vegetables,
## tropical fruit} => {citrus fruit} 0.004473818 0.3636364 0.012302999 4.393567 44
## [8] {whipped/sour cream,
## yogurt} => {curd} 0.004575496 0.2205882 0.020742247 4.140239 45
## [9] {citrus fruit,
## other vegetables,
## root vegetables} => {tropical fruit} 0.004473818 0.4313725 0.010371124 4.110997 44
## [10] {citrus fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.005795628 0.4453125 0.013014743 4.085493 57
As we see we have better results and the association rules looks better.