Association rule is one of machine learning methods, that helps uncover relationships between vairbles. It’s aim is to show how frequently items appear in transactions. It can help predict client behaviour: if a client bought item A, which product will they choose next? This method is called Market Basket Analysis. In this paper a Groceries Dataset found on Kaggle is used.
library(arules)
## Ładowanie wymaganego pakietu: Matrix
##
## Dołączanie pakietu: 'arules'
## Następujące obiekty zostały zakryte z 'package:base':
##
## abbreviate, write
library(arulesViz)
library(dplyr)
##
## Dołączanie pakietu: 'dplyr'
## Następujące obiekty zostały zakryte z 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## Następujące obiekty zostały zakryte z 'package:stats':
##
## filter, lag
## Następujące obiekty zostały zakryte z 'package:base':
##
## intersect, setdiff, setequal, union
In order to work on the data, we first need to load the dataset.
df = read.csv('Groceries_dataset.csv', row.names=NULL, sep=",")
head(df)
## Member_number Date itemDescription
## 1 1808 21-07-2015 tropical fruit
## 2 2552 05-01-2015 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 12-12-2015 other vegetables
## 5 3037 01-02-2015 whole milk
## 6 4941 14-02-2015 rolls/buns
str(df)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : int 1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
## $ Date : chr "21-07-2015" "05-01-2015" "19-09-2015" "12-12-2015" ...
## $ itemDescription: chr "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...
The dataset has 38 765 observations of 3 variables:
* Member_numer - unique ID of a client
* Date - date of purchase
* itemDescription - category of item purchased
summary(df)
## Member_number Date itemDescription
## Min. :1000 Length:38765 Length:38765
## 1st Qu.:2002 Class :character Class :character
## Median :3005 Mode :character Mode :character
## Mean :3004
## 3rd Qu.:4007
## Max. :5000
As the Date variable is a character, it needs to be transformed into date format.
df$Date <-as.Date(df$Date, format="%d-%m-%Y")
In order to perform market basket analysis, it’s necessary to format the dataset into a transcations object. Grouping data by Member_number and Date allows us to create unique transactions. Grouping by Member_number alone would mean multiple purchases by the same client would be mixed together.
transactions_list <- df %>%
group_by(Member_number, Date) %>%
summarise(items = list(itemDescription)) %>%
ungroup() %>%
.$items
## `summarise()` has grouped output by 'Member_number'. You can override using the
## `.groups` argument.
transactions<- as(transactions_list, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
## 14963 rows (elements/itemsets/transactions) and
## 167 columns (items) and a density of 0.01520957
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2363 1827 1646 1453
## yogurt (Other)
## 1285 29432
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 205 10012 2727 1273 338 179 113 96 19 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 2.00 2.54 3.00 10.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
From the summary above we can gain basic information, like most frequently bought items: whole milk, other vegetables, rolls/buns, soda, yogurt.
To gather even more information about the transactions, we can create a frequency plot, showing 20 most often bought products.
itemFrequencyPlot(transactions, topN=20, type="absolute")
Most often bought products are shown on the frequency plot, but we can also explore products that were least popular:
item_freq <- as.data.frame(itemFrequency(transactions,type="absolute"), cols = 'product')
colnames(item_freq) <- 'nb_of_purchases'
item_freq$product_names <- names(itemFrequency(transactions, type = "absolute"))
item_freq %>%
group_by(.,nb_of_purchases) %>%
summarise(
nb_of_products = n(),
product_names = paste(product_names, collapse=",")
) %>%
head(.,5)
## # A tibble: 5 × 3
## nb_of_purchases nb_of_products product_names
## <int> <int> <chr>
## 1 1 2 kitchen utensil,preservation products
## 2 3 1 baby cosmetics
## 3 4 1 bags
## 4 5 4 frozen chicken,make up remover,rubbing alcohol…
## 5 6 1 salad dressing
There are 2 products that were only bought once: kitchen utensils and preservation products. Baby cosmetics were bought 3 times, while bags were bought 4 times.
Association rules are based on couple key concepts. First is support, which is the proportion of transactions in the dataset in which a particular item appears. The higher the support the more often the item appears in the dataset. Second is confidence, the probability that item B will appear in a transaction, if item A is already in the basket. High confidence means a strong likelihood that B will appear when A does. Lastly, lift measures the strength of the association between products. Lift lower than 1 means that items are negatively correlated, while lift above 1 means positive correlation. If items are independent, lift will be equal to 1.
We have to define those statistics to be able to analyze rules and patterns. By default support is set to 0.1
freq_items<-eclat(transactions, parameter=list(supp=0.1, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.1 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 1496
##
## create itemset ...
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [3 item(s)] done [0.00s].
## creating sparse bit matrix ... [3 row(s), 14963 column(s)] done [0.00s].
## writing ... [3 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq_items)
## items support count
## [1] {whole milk} 0.1579229 2363
## [2] {other vegetables} 0.1221012 1827
## [3] {rolls/buns} 0.1100047 1646
With support set to 0.1 only shows 3 frequently bought items. In order to get more results we need to set support lower.
freq_items<-eclat(transactions, parameter=list(supp=0.05, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 748
##
## create itemset ...
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating sparse bit matrix ... [11 row(s), 14963 column(s)] done [0.00s].
## writing ... [11 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq_items)
## items support count
## [1] {whole milk} 0.15792288 2363
## [2] {other vegetables} 0.12210118 1827
## [3] {rolls/buns} 0.11000468 1646
## [4] {soda} 0.09710620 1453
## [5] {yogurt} 0.08587850 1285
## [6] {tropical fruit} 0.06776716 1014
## [7] {root vegetables} 0.06957161 1041
## [8] {sausage} 0.06034886 903
## [9] {bottled water} 0.06068302 908
## [10] {citrus fruit} 0.05313106 795
## [11] {pastry} 0.05172759 774
All of the frequent sets however are one-item baskets. In order to get item sets of two or more items, the support needs to be set even lower.
freq_items<-eclat(transactions, parameter=list(supp=0.001, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 14
##
## create itemset ...
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating sparse bit matrix ... [149 row(s), 14963 column(s)] done [0.00s].
## writing ... [750 set(s)] done [0.02s].
## Creating S4 object ... done [0.00s].
freq_rules<- ruleInduction(freq_items, transactions, confidence=0.1)
summary(freq_rules)
## set of 131 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 114 17
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 2.00 2.13 2.00 3.00
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.001002 Min. :0.1000 Min. :0.6458 Min. : 1.0
## 1st Qu.:0.001337 1st Qu.:0.1098 1st Qu.:0.8074 1st Qu.: 42.0
## Median :0.001938 Median :0.1215 Median :0.8795 Median :164.0
## Mean :0.002933 Mean :0.1257 Mean :0.9464 Mean :247.8
## 3rd Qu.:0.003776 3rd Qu.:0.1347 3rd Qu.:1.0319 3rd Qu.:495.5
## Max. :0.014837 Max. :0.2558 Max. :2.1829 Max. :601.0
##
## mining info:
## data ntransactions support
## transactions 14963 0.001
## call
## eclat(data = transactions, parameter = list(supp = 0.001, maxlen = 10))
## confidence
## 0.1
By setting support to 0.001 and confidence to 0.1, we get 131 rules. Most of them, 114, have the size of two items, which means both lhs (left hand side) and rhs (right hand side) are one product. 17 rules are of size 3, meaning that lhs is two items and rhs is one.
From the summary above we also get mean values of support, confidence and lift. Average support is equal to 0.0029, confidence is 0.13 and lift is 0.95.
We can analyze the rules by sorting them by confidence, support and lift. This can help identify the most relevant insights.
rules.by.conf<-sort(freq_rules, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence lift
## [1] {sausage, yogurt} => {whole milk} 0.001470293 0.2558140 1.619866
## [2] {rolls/buns, sausage} => {whole milk} 0.001136136 0.2125000 1.345594
## [3] {sausage, soda} => {whole milk} 0.001069304 0.1797753 1.138374
## [4] {semi-finished bread} => {whole milk} 0.001670788 0.1760563 1.114825
## [5] {rolls/buns, yogurt} => {whole milk} 0.001336630 0.1709402 1.082428
## [6] {sausage, whole milk} => {yogurt} 0.001470293 0.1641791 1.911760
## itemset
## [1] 565
## [2] 567
## [3] 566
## [4] 11
## [5] 586
## [6] 565
Confidence measures how often does rhs item appear, if the transaction contains the lhs item. The highest confidence level is 25.58%, meaning that in 25.58% of transactions, where sausage and yogurt appear, whole milk will also appear.
rules.by.supp<-sort(freq_rules, by="support", decreasing=TRUE)
inspect(head(rules.by.supp))
## lhs rhs support confidence lift itemset
## [1] {other vegetables} => {whole milk} 0.014836597 0.1215107 0.7694305 601
## [2] {rolls/buns} => {whole milk} 0.013967787 0.1269745 0.8040284 599
## [3] {soda} => {whole milk} 0.011628684 0.1197522 0.7582957 595
## [4] {yogurt} => {whole milk} 0.011160863 0.1299611 0.8229402 588
## [5] {sausage} => {whole milk} 0.008955423 0.1483942 0.9396627 568
## [6] {tropical fruit} => {whole milk} 0.008220277 0.1213018 0.7681077 581
Support shows the proportion os transactions in which a rule appears. Highest support is equal to 0.0148, meaning the rule “other vegetables => whole milk” appears in 1.48% of all transactions.
rules.by.lift<-sort(freq_rules, by="lift", decreasing=TRUE)
inspect(head(rules.by.lift))
## lhs rhs support confidence lift
## [1] {whole milk, yogurt} => {sausage} 0.001470293 0.1317365 2.182917
## [2] {sausage, whole milk} => {yogurt} 0.001470293 0.1641791 1.911760
## [3] {sausage, yogurt} => {whole milk} 0.001470293 0.2558140 1.619866
## [4] {flour} => {tropical fruit} 0.001069304 0.1095890 1.617141
## [5] {processed cheese} => {root vegetables} 0.001069304 0.1052632 1.513019
## [6] {soft cheese} => {yogurt} 0.001269799 0.1266667 1.474952
## itemset
## [1] 565
## [2] 565
## [3] 565
## [4] 16
## [5] 21
## [6] 25
Lift measures how much more likley the items on the right-hand side, are to be bought when lhs items are purchased. Highest lift is equal to 2.18. The likelihood of purchasing sausage is increase 2.18 times, when whole milk and yogurt are bought.
We can also plot the rules using scatter plot, to include all three parameters: support, confidence, which are plotted on x and y axis, as well as lift, which is added by shading the dots.
plot(freq_rules, measure=c("support", "confidence"), shading="lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Majority of rules have support below 0.005 and confidence below 0.15. Rules with highest confidence have really low support, just like rules with highest lift.
In order to find rules that appear in baskets most often and are bought together we need to sort by support and confidence.
inspect(head(sort(sort(freq_rules, by ="confidence"),by="support"),10))
## lhs rhs support confidence lift
## [1] {other vegetables} => {whole milk} 0.014836597 0.1215107 0.7694305
## [2] {rolls/buns} => {whole milk} 0.013967787 0.1269745 0.8040284
## [3] {soda} => {whole milk} 0.011628684 0.1197522 0.7582957
## [4] {yogurt} => {whole milk} 0.011160863 0.1299611 0.8229402
## [5] {sausage} => {whole milk} 0.008955423 0.1483942 0.9396627
## [6] {tropical fruit} => {whole milk} 0.008220277 0.1213018 0.7681077
## [7] {root vegetables} => {whole milk} 0.007551962 0.1085495 0.6873575
## [8] {bottled beer} => {whole milk} 0.007150972 0.1578171 0.9993303
## [9] {citrus fruit} => {whole milk} 0.007150972 0.1345912 0.8522590
## [10] {bottled water} => {whole milk} 0.007150972 0.1178414 0.7461959
## itemset
## [1] 601
## [2] 599
## [3] 595
## [4] 588
## [5] 568
## [6] 581
## [7] 575
## [8] 488
## [9] 548
## [10] 557
For all rules whole milk is on rhs, as it is the most frequently appearing item. To visualize the rules we can use a matrix, which plots lhs on x axis, rhs on y axis and adds lift by shading.
plot_rules <- freq_rules %>%
sort(by = "confidence") %>%
head(10) %>%
sort(by = "support")
plot(plot_rules, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{sausage,whole milk}" "{sausage,yogurt}" "{rolls/buns,sausage}"
## [4] "{sausage,soda}" "{semi-finished bread}" "{rolls/buns,yogurt}"
## [7] "{detergent}" "{ham}" "{bottled beer}"
## [10] "{frozen fish}"
## Itemsets in Consequent (RHS)
## [1] "{whole milk}" "{yogurt}"
If we want to analyze and visualize relationships between items further, a good choice is a parallel coordinates plot. It shows individuals choices, based on what they already have in their basket.
plot(plot_rules, method="paracoord")
In this case, if the person buys whole milk and sausage, they are likely to also buy yogurt.
To deepen the analysis, it’s possible to check what motivates people to buy certain items. To showcase this, we will analyse whole milk, as the most frequently bought item.
whole_milk <-apriori(data = transactions, parameter = list(support=0.001, confidence=0.1), appearance = list(default="lhs", rhs="whole milk"), control=list(verbose=F))
inspect(sort(whole_milk, by="lift"))
## lhs rhs support confidence
## [1] {sausage, yogurt} => {whole milk} 0.001470293 0.2558140
## [2] {rolls/buns, sausage} => {whole milk} 0.001136136 0.2125000
## [3] {sausage, soda} => {whole milk} 0.001069304 0.1797753
## [4] {semi-finished bread} => {whole milk} 0.001670788 0.1760563
## [5] {rolls/buns, yogurt} => {whole milk} 0.001336630 0.1709402
## [6] {detergent} => {whole milk} 0.001403462 0.1627907
## [7] {ham} => {whole milk} 0.002740092 0.1601562
## [8] {} => {whole milk} 0.157922876 0.1579229
## [9] {bottled beer} => {whole milk} 0.007150972 0.1578171
## [10] {frozen fish} => {whole milk} 0.001069304 0.1568627
## [11] {candy} => {whole milk} 0.002138609 0.1488372
## [12] {sausage} => {whole milk} 0.008955423 0.1483942
## [13] {onions} => {whole milk} 0.002940587 0.1452145
## [14] {processed cheese} => {whole milk} 0.001470293 0.1447368
## [15] {newspapers} => {whole milk} 0.005613847 0.1443299
## [16] {domestic eggs} => {whole milk} 0.005279690 0.1423423
## [17] {cat food} => {whole milk} 0.001670788 0.1412429
## [18] {waffles} => {whole milk} 0.002606429 0.1407942
## [19] {hamburger meat} => {whole milk} 0.003074250 0.1406728
## [20] {other vegetables, yogurt} => {whole milk} 0.001136136 0.1404959
## [21] {frankfurter} => {whole milk} 0.005279690 0.1398230
## [22] {sugar} => {whole milk} 0.002472766 0.1396226
## [23] {chewing gum} => {whole milk} 0.001670788 0.1388889
## [24] {beef} => {whole milk} 0.004678206 0.1377953
## [25] {flour} => {whole milk} 0.001336630 0.1369863
## [26] {frozen vegetables} => {whole milk} 0.003809397 0.1360382
## [27] {pork} => {whole milk} 0.005012364 0.1351351
## [28] {pip fruit} => {whole milk} 0.006616320 0.1348774
## [29] {citrus fruit} => {whole milk} 0.007150972 0.1345912
## [30] {long life bakery product} => {whole milk} 0.002405935 0.1343284
## [31] {grapes} => {whole milk} 0.001938114 0.1342593
## [32] {shopping bags} => {whole milk} 0.006348994 0.1334270
## [33] {butter} => {whole milk} 0.004678206 0.1328273
## [34] {pasta} => {whole milk} 0.001069304 0.1322314
## [35] {meat} => {whole milk} 0.002205440 0.1309524
## [36] {white bread} => {whole milk} 0.003141081 0.1309192
## [37] {oil} => {whole milk} 0.001938114 0.1300448
## [38] {yogurt} => {whole milk} 0.011160863 0.1299611
## [39] {fruit/vegetable juice} => {whole milk} 0.004410880 0.1296660
## [40] {pot plants} => {whole milk} 0.001002473 0.1282051
## [41] {canned beer} => {whole milk} 0.006014837 0.1282051
## [42] {ice cream} => {whole milk} 0.001938114 0.1277533
## [43] {hard cheese} => {whole milk} 0.001871282 0.1272727
## [44] {rolls/buns} => {whole milk} 0.013967787 0.1269745
## [45] {hygiene articles} => {whole milk} 0.001737619 0.1268293
## [46] {margarine} => {whole milk} 0.004076723 0.1265560
## [47] {pastry} => {whole milk} 0.006482657 0.1253230
## [48] {chocolate} => {whole milk} 0.002940587 0.1246459
## [49] {rolls/buns, soda} => {whole milk} 0.001002473 0.1239669
## [50] {curd} => {whole milk} 0.004143554 0.1230159
## [51] {chicken} => {whole milk} 0.003408407 0.1223022
## [52] {other vegetables} => {whole milk} 0.014836597 0.1215107
## [53] {cream cheese } => {whole milk} 0.002873755 0.1214689
## [54] {tropical fruit} => {whole milk} 0.008220277 0.1213018
## [55] {coffee} => {whole milk} 0.003809397 0.1205074
## [56] {soft cheese} => {whole milk} 0.001202967 0.1200000
## [57] {soda} => {whole milk} 0.011628684 0.1197522
## [58] {specialty bar} => {whole milk} 0.001670788 0.1196172
## [59] {brown bread} => {whole milk} 0.004477712 0.1190053
## [60] {UHT-milk} => {whole milk} 0.002539598 0.1187500
## [61] {bottled water} => {whole milk} 0.007150972 0.1178414
## [62] {other vegetables, soda} => {whole milk} 0.001136136 0.1172414
## [63] {beverages} => {whole milk} 0.001938114 0.1169355
## [64] {frozen meals} => {whole milk} 0.001938114 0.1155378
## [65] {other vegetables, rolls/buns} => {whole milk} 0.001202967 0.1139241
## [66] {pickled vegetables} => {whole milk} 0.001002473 0.1119403
## [67] {napkins} => {whole milk} 0.002405935 0.1087613
## [68] {white wine} => {whole milk} 0.001269799 0.1085714
## [69] {root vegetables} => {whole milk} 0.007551962 0.1085495
## [70] {herbs} => {whole milk} 0.001136136 0.1075949
## [71] {whipped/sour cream} => {whole milk} 0.004611375 0.1055046
## [72] {sliced cheese} => {whole milk} 0.001470293 0.1047619
## [73] {berries} => {whole milk} 0.002272272 0.1042945
## [74] {salty snack} => {whole milk} 0.001938114 0.1032028
## [75] {dessert} => {whole milk} 0.002405935 0.1019830
## coverage lift count
## [1] 0.005747511 1.6198664 22
## [2] 0.005346521 1.3455935 17
## [3] 0.005948005 1.1383739 16
## [4] 0.009490076 1.1148248 25
## [5] 0.007819288 1.0824282 20
## [6] 0.008621266 1.0308240 21
## [7] 0.017108869 1.0141422 41
## [8] 1.000000000 1.0000000 2363
## [9] 0.045311769 0.9993303 107
## [10] 0.006816815 0.9932870 16
## [11] 0.014368776 0.9424677 32
## [12] 0.060348861 0.9396627 134
## [13] 0.020249950 0.9195281 44
## [14] 0.010158391 0.9165033 22
## [15] 0.038895943 0.9139265 84
## [16] 0.037091492 0.9013409 79
## [17] 0.011829179 0.8943792 25
## [18] 0.018512330 0.8915379 39
## [19] 0.021853906 0.8907689 46
## [20] 0.008086614 0.8896486 17
## [21] 0.037759808 0.8853879 79
## [22] 0.017710352 0.8841192 37
## [23] 0.012029673 0.8794729 25
## [24] 0.033950411 0.8725479 70
## [25] 0.009757402 0.8674253 20
## [26] 0.028002406 0.8614217 57
## [27] 0.037091492 0.8557034 75
## [28] 0.049054334 0.8540712 99
## [29] 0.053131057 0.8522590 107
## [30] 0.017910847 0.8505947 36
## [31] 0.014435608 0.8501571 29
## [32] 0.047584041 0.8448869 95
## [33] 0.035220210 0.8410898 70
## [34] 0.008086614 0.8373163 16
## [35] 0.016841542 0.8292173 33
## [36] 0.023992515 0.8290073 47
## [37] 0.014903428 0.8234706 29
## [38] 0.085878500 0.8229402 167
## [39] 0.034017243 0.8210717 66
## [40] 0.007819288 0.8118211 15
## [41] 0.046915725 0.8118211 90
## [42] 0.015170755 0.8089601 29
## [43] 0.014702934 0.8059170 28
## [44] 0.110004678 0.8040284 209
## [45] 0.013700461 0.8031089 26
## [46] 0.032212792 0.8013786 61
## [47] 0.051727595 0.7935709 97
## [48] 0.023591526 0.7892833 44
## [49] 0.008086614 0.7849841 15
## [50] 0.033683085 0.7789617 62
## [51] 0.027868743 0.7744423 51
## [52] 0.122101183 0.7694305 222
## [53] 0.023658357 0.7691661 43
## [54] 0.067767159 0.7681077 123
## [55] 0.031611308 0.7630775 57
## [56] 0.010024728 0.7598646 18
## [57] 0.097106195 0.7582957 174
## [58] 0.013967787 0.7574408 25
## [59] 0.037626144 0.7535661 67
## [60] 0.021386086 0.7519493 38
## [61] 0.060683018 0.7461959 107
## [62] 0.009690570 0.7423964 17
## [63] 0.016574216 0.7404594 29
## [64] 0.016774711 0.7316093 29
## [65] 0.010559380 0.7213904 18
## [66] 0.008955423 0.7088289 15
## [67] 0.022121232 0.6886990 36
## [68] 0.011695516 0.6874965 19
## [69] 0.069571610 0.6873575 113
## [70] 0.010559380 0.6813132 17
## [71] 0.043707813 0.6680767 69
## [72] 0.014034619 0.6633738 22
## [73] 0.021787075 0.6604140 34
## [74] 0.018779656 0.6535016 29
## [75] 0.023591526 0.6457773 36
is.significant(whole_milk, transactions)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE
Above function tests whether found rules are statistically signifact using Fisher’s Exact Test. This output means that rules found for “whole milk” are found, but they are not statistically significant, which means the association might be weak. A solution could be to increase support and confidence, but that would mean less rules would show up. Instead, we can try find significant rules for other products.
Instead of choosing most freguently bought items, let’s focus on an item appearing in a rule with the highest lift, which is sausage.
sausage <-apriori(data = transactions, parameter = list(support=0.001, confidence=0.1), appearance = list(default="lhs", rhs="sausage"), control=list(verbose=F))
inspect(sort(sausage, by="lift"))
## lhs rhs support confidence coverage
## [1] {whole milk, yogurt} => {sausage} 0.001470293 0.1317365 0.01116086
## lift count
## [1] 2.182917 22
is.significant(sausage, transactions)
## [1] TRUE
The rule found for sausage shows a meaningful relationship between items. When whole milk and yogurt are purchased together, the customer is likely to buy sausage as well.
Market Basket Analysis can be a useful tool, providing valuable insight into customer behaviour. In this paper a dataset of grocery transactions was analyzes. By using association rule mining frequently bought items were identified and relationships between the explored. Support, confidence and lift metrics were essential in that analysis.
Whole milk appears to be most frequently bought item, however rules predicting milk purchase were not statistically significant. This might happen because it’s a staple item, bought often and independently of other products. Sausage, yogurt and whole milk exhibited a strong association. If a customer bough whole milk and yogurt, they were more likely to also buy sausage.
Majority of rules involved small, two-product itemsets. This means customers typically purchase specific pairs of products, rather than larger sets of items. A couple of niche purchases were made, including kitchen utensils and baby cosmetics.
The aim of this paper was to showcase market basket analysis using association rule mining. Insights like this can be used in several different ways. Businesses might use such analysis to create personalized promotions or create cheaper product bundles, based on strong association between items. Market basket analysis is a powerful method, which helps to understand customer behaviour.