# Import packages
library(tidyverse)
library(arules)
## Warning: package 'arules' was built under R version 4.3.3
library(arulesViz)
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
Your assignment is to use R to mine the data for association rules.
You should report:
Your top 10 rules by lift.
# Import data
items_data <- read.transactions("/Users/dirkhartog/Desktop/CUNY_MSDS/DATA_624/GroceryDataSet (1).csv", sep = ",")
summary(items_data)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
There are 9,835 transactions with 169 unique products. The most frequent items included - Whole Milk - Other vegatables - Rolls/buns - Soda - Yogurt
2,159 transactions contain 1 item and 1 transaction contains 32 unique items. There a total of \(2^{169} - 1\) possible item sets. We can print off the first 5 in our data.
# List the first 5 transactions in the data
inspect(items_data[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
The first five transactions are listed above. We can see that 3/5 top items are in these transaction along with several other items. We can calculate the support of each item to find the fraction of transactions within the data set that contain each item
# Calculate the support for the top five most frequent items
itemFrequency(items_data[,c("whole milk", "other vegetables",
"rolls/buns", "soda", "yogurt")])
## whole milk other vegetables rolls/buns soda
## 0.2555160 0.1934926 0.1839349 0.1743772
## yogurt
## 0.1395018
# Create a tibble with item frequency/support for all items
item_frequency <- tibble(
Items = names(itemFrequency(items_data)), Frequency = itemFrequency(items_data)
)
item_frequency <- item_frequency %>% arrange(desc(Frequency))
# Top 20 most frequent items
head(item_frequency, 20)
## # A tibble: 20 × 2
## Items Frequency
## <chr> <dbl>
## 1 whole milk 0.256
## 2 other vegetables 0.193
## 3 rolls/buns 0.184
## 4 soda 0.174
## 5 yogurt 0.140
## 6 bottled water 0.111
## 7 root vegetables 0.109
## 8 tropical fruit 0.105
## 9 shopping bags 0.0985
## 10 sausage 0.0940
## 11 pastry 0.0890
## 12 citrus fruit 0.0828
## 13 bottled beer 0.0805
## 14 newspapers 0.0798
## 15 canned beer 0.0777
## 16 pip fruit 0.0756
## 17 fruit/vegetable juice 0.0723
## 18 whipped/sour cream 0.0717
## 19 brown bread 0.0649
## 20 domestic eggs 0.0634
The data contain 169 unique items which gives us \(3^{169} - 2{169+1} + 1\) =
r3^169 - 2^(169+1) + 1
number of rules. Evaluating all of
the rules would computationally challenging and may not be important. We
will use the apriori function in the arules package to find the
association rules based on certain criteria that determine (the support,
confidence and minimum length of a rule. This alogorithm will look
through all of the possible item sets and determine if they meet the
minimum support critera that we set. This will only retrieve rules that
show up in the data at the support threshold that we choose. For example
if we were interested in the item sets that show up at least 5% of the
time we would then need to set our support value at 0.05
One way of identifying the strength of a rule within a dataset is to consider the degree of certainty of each rule using confidence. Confidence is the ratio of the number of transactions that include both the antecedent and consequent to the number of transactions that include only the antecedent.
In our case we will se the support to 0.015 and confidence to 0.25 with a minimum rule length of 2
# Finding association rules
item_rules <- apriori(items_data,
parameter = list(support = 0.015,
confidence = 0.25,
minlen = 2
))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.015 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 147
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [73 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [78 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(item_rules)
## set of 78 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 62 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.205 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01505 Min. :0.2537 Min. :0.02928 Min. :0.9932
## 1st Qu.:0.01790 1st Qu.:0.3084 1st Qu.:0.05247 1st Qu.:1.5047
## Median :0.02191 Median :0.3546 Median :0.06121 Median :1.7400
## Mean :0.02558 Mean :0.3608 Mean :0.07319 Mean :1.7632
## 3rd Qu.:0.02888 3rd Qu.:0.4056 3rd Qu.:0.08277 3rd Qu.:1.9427
## Max. :0.07483 Max. :0.5174 Max. :0.25552 Max. :3.0404
## count
## Min. :148.0
## 1st Qu.:176.0
## Median :215.5
## Mean :251.6
## 3rd Qu.:284.0
## Max. :736.0
##
## mining info:
## data ntransactions support confidence
## items_data 9835 0.015 0.25
## call
## apriori(data = items_data, parameter = list(support = 0.015, confidence = 0.25, minlen = 2))
# View the first 10 rules
inspect(item_rules[1:10])
## lhs rhs support confidence coverage
## [1] {sugar} => {whole milk} 0.01504830 0.4444444 0.03385867
## [2] {cream cheese} => {whole milk} 0.01647178 0.4153846 0.03965430
## [3] {chicken} => {other vegetables} 0.01789527 0.4170616 0.04290798
## [4] {chicken} => {whole milk} 0.01759024 0.4099526 0.04290798
## [5] {white bread} => {whole milk} 0.01708185 0.4057971 0.04209456
## [6] {chocolate} => {whole milk} 0.01667514 0.3360656 0.04961871
## [7] {coffee} => {whole milk} 0.01870869 0.3222417 0.05805796
## [8] {frozen vegetables} => {other vegetables} 0.01779359 0.3699789 0.04809354
## [9] {frozen vegetables} => {whole milk} 0.02043721 0.4249471 0.04809354
## [10] {beef} => {root vegetables} 0.01738688 0.3313953 0.05246568
## lift count
## [1] 1.739400 148
## [2] 1.625670 162
## [3] 2.155439 176
## [4] 1.604411 173
## [5] 1.588147 168
## [6] 1.315243 164
## [7] 1.261141 184
## [8] 1.912108 175
## [9] 1.663094 201
## [10] 3.040367 171
# Sort by lift
item_rules %>% sort(by = "lift") %>% head(10) %>% inspect()
## lhs rhs support confidence
## [1] {beef} => {root vegetables} 0.01738688 0.3313953
## [2] {other vegetables, whole milk} => {root vegetables} 0.02318251 0.3097826
## [3] {whole milk, yogurt} => {tropical fruit} 0.01514997 0.2704174
## [4] {pip fruit} => {tropical fruit} 0.02043721 0.2701613
## [5] {tropical fruit, whole milk} => {yogurt} 0.01514997 0.3581731
## [6] {root vegetables, whole milk} => {other vegetables} 0.02318251 0.4740125
## [7] {curd} => {yogurt} 0.01728521 0.3244275
## [8] {root vegetables} => {other vegetables} 0.04738180 0.4347015
## [9] {chicken} => {other vegetables} 0.01789527 0.4170616
## [10] {other vegetables, whole milk} => {yogurt} 0.02226741 0.2975543
## coverage lift count
## [1] 0.05246568 3.040367 171
## [2] 0.07483477 2.842082 228
## [3] 0.05602440 2.577089 149
## [4] 0.07564820 2.574648 201
## [5] 0.04229792 2.567516 149
## [6] 0.04890696 2.449770 228
## [7] 0.05327911 2.325615 170
## [8] 0.10899847 2.246605 466
## [9] 0.04290798 2.155439 176
## [10] 0.07483477 2.132979 219
When we sort the rules by lift we can see interesting findings. The first rule with the highest lift tells us that those who buy beef are 3x more likely to also purchase root vegetables. We can also see that many of the top 10 rules by lift include the whole milk yogurt, and other vegetables which were in the top five most frequently purchased items. Lets evaluate the top rules excluding these items to identify other rules that might be insightful.
item_rules %>% subset(!items %in% c("whole milk", "yogurt", "other vegetables")) %>% sort(by = "lift") %>%
inspect()
## lhs rhs support confidence
## [1] {beef} => {root vegetables} 0.01738688 0.3313953
## [2] {pip fruit} => {tropical fruit} 0.02043721 0.2701613
## [3] {frankfurter} => {rolls/buns} 0.01921708 0.3258621
## [4] {sausage} => {rolls/buns} 0.03060498 0.3257576
## [5] {bottled water} => {soda} 0.02897814 0.2621895
## [6] {sausage} => {soda} 0.02430097 0.2586580
## [7] {fruit/vegetable juice} => {soda} 0.01840366 0.2545710
## coverage lift count
## [1] 0.05246568 3.040367 171
## [2] 0.07564820 2.574648 201
## [3] 0.05897306 1.771616 189
## [4] 0.09395018 1.771048 301
## [5] 0.11052364 1.503577 285
## [6] 0.09395018 1.483324 239
## [7] 0.07229283 1.459887 181
Excluding some of the top items reveals possible pairings to take into account when setting up the items in the store, more specifcally the last four rules. Placing soda and rolls/buns near the sausage/frankfurters may lead to more purchases as these items are likely to be purhased together.
We can inspect the results sorted by confidence or suppor as well for further insights.
item_rules %>%
sort(by = "confidence") %>% head(10) %>%
inspect()
## lhs rhs support
## [1] {tropical fruit, yogurt} => {whole milk} 0.01514997
## [2] {other vegetables, yogurt} => {whole milk} 0.02226741
## [3] {butter} => {whole milk} 0.02755465
## [4] {curd} => {whole milk} 0.02613116
## [5] {other vegetables, root vegetables} => {whole milk} 0.02318251
## [6] {other vegetables, tropical fruit} => {whole milk} 0.01708185
## [7] {root vegetables, whole milk} => {other vegetables} 0.02318251
## [8] {domestic eggs} => {whole milk} 0.02999492
## [9] {rolls/buns, yogurt} => {whole milk} 0.01555669
## [10] {whipped/sour cream} => {whole milk} 0.03223183
## confidence coverage lift count
## [1] 0.5173611 0.02928317 2.024770 149
## [2] 0.5128806 0.04341637 2.007235 219
## [3] 0.4972477 0.05541434 1.946053 271
## [4] 0.4904580 0.05327911 1.919481 257
## [5] 0.4892704 0.04738180 1.914833 228
## [6] 0.4759207 0.03589222 1.862587 168
## [7] 0.4740125 0.04890696 2.449770 228
## [8] 0.4727564 0.06344687 1.850203 295
## [9] 0.4526627 0.03436706 1.771563 153
## [10] 0.4496454 0.07168277 1.759754 317