Market basket analysis is an unsupervised machine learning technique that can be useful for finding patterns in transactional data.
It can be a very powerful tool for analyzing the purchasing patterns of consumers.
It is used for knowledge discovery rather than prediction.
This analysis results in a set of association rules that identify patterns of relationships among items.
The main algorithm used in market basket analysis is the apriori algorithm.
The three statistical measures in market basket analysis are support, confidence, and lift.
This data contain 7501 rows which refers to one store transactions.
119 columns are features for each of the 169 different items that might appear in someone’s grocery basket.
Each cell in the matrix is a 1 if the item was purchased for the corresponding transaction, or 0 otherwise
Density value of 0.0329 (3.3 %) refers to the proportion of non-zero matrix cells.
A total of 1754 transactions contained only a single item, while one transaction had 20 items
The mean of item per transaction is 3.914 while the median is 3.
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
## almonds antioxydant juice asparagus
## 0.0203972804 0.0089321424 0.0047993601
## avocado babies food bacon
## 0.0333288895 0.0045327290 0.0086655113
## barbecue sauce black tea blueberries
## 0.0107985602 0.0142647647 0.0091987735
## body spray bramble brownies
## 0.0114651380 0.0018664178 0.0337288362
## bug spray burger sauce burgers
## 0.0086655113 0.0058658845 0.0871883749
## butter cake candy bars
## 0.0301293161 0.0810558592 0.0097320357
## carrots cauliflower cereals
## 0.0153312892 0.0047993601 0.0257299027
## champagne chicken chili
## 0.0467937608 0.0599920011 0.0061325157
## chocolate chocolate bread chutney
## 0.1638448207 0.0042660979 0.0041327823
## cider clothes accessories cookies
## 0.0105319291 0.0083988801 0.0803892814
## cooking oil corn cottage cheese
## 0.0510598587 0.0047993601 0.0318624183
## cream dessert wine eggplant
## 0.0009332089 0.0043994134 0.0131982402
## eggs energy bar energy drink
## 0.1797093721 0.0270630583 0.0266631116
## escalope extra dark chocolate flax seed
## 0.0793227570 0.0119984002 0.0090654579
## french fries french wine fresh bread
## 0.1709105453 0.0225303293 0.0430609252
## fresh tuna fromage blanc frozen smoothie
## 0.0222636982 0.0135981869 0.0633248900
## frozen vegetables gluten free bar grated cheese
## 0.0953206239 0.0069324090 0.0523930143
## green beans green grapes green tea
## 0.0086655113 0.0090654579 0.1321157179
## ground beef gums ham
## 0.0982535662 0.0134648714 0.0265297960
## hand protein bar herb & pepper honey
## 0.0051993068 0.0494600720 0.0474603386
## hot dogs ketchup light cream
## 0.0323956806 0.0043994134 0.0155979203
## light mayo low fat yogurt magazines
## 0.0271963738 0.0765231302 0.0109318757
## mashed potato mayonnaise meatballs
## 0.0041327823 0.0061325157 0.0209305426
## melons milk mineral water
## 0.0119984002 0.1295827223 0.2383682176
## mint mint green tea muffins
## 0.0174643381 0.0055992534 0.0241301160
## mushroom cream sauce napkins nonfat milk
## 0.0190641248 0.0006665778 0.0103986135
## oatmeal oil olive oil
## 0.0043994134 0.0230635915 0.0658578856
## pancakes parmesan cheese pasta
## 0.0950539928 0.0198640181 0.0157312358
## pepper pet food pickles
## 0.0265297960 0.0065324623 0.0059992001
## protein bar red wine rice
## 0.0185308626 0.0281295827 0.0187974937
## salad salmon salt
## 0.0049326756 0.0425276630 0.0091987735
## sandwich shallot shampoo
## 0.0045327290 0.0077323024 0.0049326756
## shrimp soda soup
## 0.0714571390 0.0062658312 0.0505265965
## spaghetti sparkling water spinach
## 0.1741101187 0.0062658312 0.0070657246
## strawberries strong cheese tea
## 0.0213304893 0.0077323024 0.0038661512
## tomato juice tomato sauce tomatoes
## 0.0303959472 0.0141314491 0.0683908812
## toothpaste turkey vegetables mix
## 0.0081322490 0.0625249967 0.0257299027
## water spray white wine whole weat flour
## 0.0003999467 0.0165311292 0.0093320891
## whole wheat pasta whole wheat rice yams
## 0.0294627383 0.0585255299 0.0114651380
## yogurt cake zucchini
## 0.0273296894 0.0094654046
## almonds antioxydant juice asparagus
## 153 67 36
## avocado babies food bacon
## 250 34 65
## barbecue sauce black tea blueberries
## 81 107 69
## body spray bramble brownies
## 86 14 253
## bug spray burger sauce burgers
## 65 44 654
## butter cake candy bars
## 226 608 73
## carrots cauliflower cereals
## 115 36 193
## champagne chicken chili
## 351 450 46
## chocolate chocolate bread chutney
## 1229 32 31
## cider clothes accessories cookies
## 79 63 603
## cooking oil corn cottage cheese
## 383 36 239
## cream dessert wine eggplant
## 7 33 99
## eggs energy bar energy drink
## 1348 203 200
## escalope extra dark chocolate flax seed
## 595 90 68
## french fries french wine fresh bread
## 1282 169 323
## fresh tuna fromage blanc frozen smoothie
## 167 102 475
## frozen vegetables gluten free bar grated cheese
## 715 52 393
## green beans green grapes green tea
## 65 68 991
## ground beef gums ham
## 737 101 199
## hand protein bar herb & pepper honey
## 39 371 356
## hot dogs ketchup light cream
## 243 33 117
## light mayo low fat yogurt magazines
## 204 574 82
## mashed potato mayonnaise meatballs
## 31 46 157
## melons milk mineral water
## 90 972 1788
## mint mint green tea muffins
## 131 42 181
## mushroom cream sauce napkins nonfat milk
## 143 5 78
## oatmeal oil olive oil
## 33 173 494
## pancakes parmesan cheese pasta
## 713 149 118
## pepper pet food pickles
## 199 49 45
## protein bar red wine rice
## 139 211 141
## salad salmon salt
## 37 319 69
## sandwich shallot shampoo
## 34 58 37
## shrimp soda soup
## 536 47 379
## spaghetti sparkling water spinach
## 1306 47 53
## strawberries strong cheese tea
## 160 58 29
## tomato juice tomato sauce tomatoes
## 228 106 513
## toothpaste turkey vegetables mix
## 61 469 193
## water spray white wine whole weat flour
## 3 124 70
## whole wheat pasta whole wheat rice yams
## 221 439 86
## yogurt cake zucchini
## 205 71
Next, i want to visualise the weight of the Jaccard distance among the items with a frequency higher than 10%, where the Jaccard coefficient is calculated as –> J_coef = f11/(f+1 + f1+ -f11)
The more different items are milk and green tea, while the more similar are spaghetti and water.
## chocolate eggs french fries green tea milk mineral water
## eggs 0.89
## french fries 0.89 0.88
## green tea 0.91 0.91 0.90
## milk 0.88 0.89 0.91 0.93
## mineral water 0.85 0.86 0.91 0.91 0.85
## spaghetti 0.87 0.88 0.91 0.91 0.87 0.83
Then, for a graphical display, i plot these distance by histogram representation.
The a priori algorithm uses a simple preliminary belief about the properties of frequent elements. Using this a priori belief, all subsets of frequent elements must also be frequent. This allows you to limit the number of rules to search for.
There are two statistical measures that can be used to determine whether a rule is interesting:
Support measures the frequency an item appears in a given transactional data set
Confidence measures the algorithm’s predictive power or accuracy.
The first step in order to create a set of association rules is to determine the optimal thresholds for support and confidence.
I try different values of support and confidence and see graphically how many rules are generated for each combination.
I decide to try with:
Support value of 10%, 5%, 2% and 1%
Confidence value from 10% to 80%
After have created vector with different value of support and confidence, now plot it in order to find the best value.
From the plot i can saw that:
Support level of 10% –> I find few rules and with an very low confidence levels. hence, i cannot use this value because of the resulting rules are unrepresentative.
Support level of 5% –> have more or less result that happen for a support level of 10%.
Support level of 2% –> this time i can find 20 rules with a confidence of at least 30%.
Support level of 1% –> Too many rules
As, the above graph shown, i decide to create the association rule of apriori algorithm with value of support = 10% and confidence = 30%
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 150
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 20 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 20
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02013 Min. :0.3060 Min. :0.05053 Min. :1.316
## 1st Qu.:0.02290 1st Qu.:0.3303 1st Qu.:0.06522 1st Qu.:1.435
## Median :0.02593 Median :0.3515 Median :0.07399 Median :1.563
## Mean :0.03080 Mean :0.3609 Mean :0.08613 Mean :1.618
## 3rd Qu.:0.03660 3rd Qu.:0.3836 3rd Qu.:0.09605 3rd Qu.:1.758
## Max. :0.05973 Max. :0.4565 Max. :0.17411 Max. :2.291
## count
## Min. :151.0
## 1st Qu.:171.8
## Median :194.5
## Mean :231.1
## 3rd Qu.:274.5
## Max. :448.0
##
## mining info:
## data ntransactions support confidence
## Basket 7501 0.02 0.3
Let’s now reorder the rules so that we are able to inspect the first 5 most meaningful rule order by support, confidence, lift and count
where:
Support measures the frequency an item appears in a given transactional data set
Confidence measures the algorithm’s predictive power or accuracy.
Lift value is the ratio of the observed support to that expected if two items are independent, high lift values indicate stronger associations.
Count is the sum of associations
## lhs rhs support confidence coverage lift
## [1] {spaghetti} => {mineral water} 0.05972537 0.3430322 0.17411012 1.439085
## [2] {chocolate} => {mineral water} 0.05265965 0.3213995 0.16384482 1.348332
## [3] {milk} => {mineral water} 0.04799360 0.3703704 0.12958272 1.553774
## [4] {ground beef} => {mineral water} 0.04092788 0.4165536 0.09825357 1.747522
## [5] {ground beef} => {spaghetti} 0.03919477 0.3989145 0.09825357 2.291162
## count
## [1] 448
## [2] 395
## [3] 360
## [4] 307
## [5] 294
## lhs rhs support confidence coverage lift
## [1] {soup} => {mineral water} 0.02306359 0.4564644 0.05052660 1.914955
## [2] {olive oil} => {mineral water} 0.02759632 0.4190283 0.06585789 1.757904
## [3] {ground beef} => {mineral water} 0.04092788 0.4165536 0.09825357 1.747522
## [4] {ground beef} => {spaghetti} 0.03919477 0.3989145 0.09825357 2.291162
## [5] {cooking oil} => {mineral water} 0.02013065 0.3942559 0.05105986 1.653978
## count
## [1] 173
## [2] 207
## [3] 307
## [4] 294
## [5] 151
## lhs rhs support confidence coverage lift
## [1] {ground beef} => {spaghetti} 0.03919477 0.3989145 0.09825357 2.291162
## [2] {olive oil} => {spaghetti} 0.02293028 0.3481781 0.06585789 1.999758
## [3] {soup} => {mineral water} 0.02306359 0.4564644 0.05052660 1.914955
## [4] {burgers} => {eggs} 0.02879616 0.3302752 0.08718837 1.837830
## [5] {olive oil} => {mineral water} 0.02759632 0.4190283 0.06585789 1.757904
## count
## [1] 294
## [2] 172
## [3] 173
## [4] 216
## [5] 207
## lhs rhs support confidence coverage lift
## [1] {spaghetti} => {mineral water} 0.05972537 0.3430322 0.17411012 1.439085
## [2] {chocolate} => {mineral water} 0.05265965 0.3213995 0.16384482 1.348332
## [3] {milk} => {mineral water} 0.04799360 0.3703704 0.12958272 1.553774
## [4] {ground beef} => {mineral water} 0.04092788 0.4165536 0.09825357 1.747522
## [5] {ground beef} => {spaghetti} 0.03919477 0.3989145 0.09825357 2.291162
## count
## [1] 448
## [2] 395
## [3] 360
## [4] 307
## [5] 294
As conclusion i want to find association with a particular items
I create association rules using the same level value of support and confidence as before for eggs and inspect them.
## set of 2 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01946 Min. :0.3113 Min. :0.06252 Min. :1.732
## 1st Qu.:0.02180 1st Qu.:0.3160 1st Qu.:0.06869 1st Qu.:1.759
## Median :0.02413 Median :0.3208 Median :0.07486 Median :1.785
## Mean :0.02413 Mean :0.3208 Mean :0.07486 Mean :1.785
## 3rd Qu.:0.02646 3rd Qu.:0.3255 3rd Qu.:0.08102 3rd Qu.:1.811
## Max. :0.02880 Max. :0.3303 Max. :0.08719 Max. :1.838
## count
## Min. :146.0
## 1st Qu.:163.5
## Median :181.0
## Mean :181.0
## 3rd Qu.:198.5
## Max. :216.0
##
## mining info:
## data ntransactions support confidence
## Basket 7501 0.01 0.3
After order by confidence level i inspected them.
I simply to see that for this product only two other items, which are burgers and turkey, have an association with the choosen level value.
## lhs rhs support confidence coverage lift count
## [1] {turkey} => {eggs} 0.01946407 0.3113006 0.06252500 1.732245 146
## [2] {burgers} => {eggs} 0.02879616 0.3302752 0.08718837 1.837830 216
So, let’s try to do the same with another product, maybe more common, spaghetti
Now i found 17 rules.
## set of 17 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 8 9
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.529 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01013 Min. :0.3004 Min. :0.02760 Min. :1.725
## 1st Qu.:0.01093 1st Qu.:0.3155 1st Qu.:0.03373 1st Qu.:1.812
## Median :0.01573 Median :0.3288 Median :0.04253 Median :1.889
## Mean :0.01585 Mean :0.3377 Mean :0.04669 Mean :1.940
## 3rd Qu.:0.01653 3rd Qu.:0.3482 3rd Qu.:0.05239 3rd Qu.:2.000
## Max. :0.03919 Max. :0.4169 Max. :0.09825 Max. :2.395
## count
## Min. : 76.0
## 1st Qu.: 82.0
## Median :118.0
## Mean :118.9
## 3rd Qu.:124.0
## Max. :294.0
##
## mining info:
## data ntransactions support confidence
## Basket 7501 0.01 0.3
Let’s order by confidence level and inspect the first 6.
The best associations with spaghetti are ground beef, mineral water, olive oil and red wine.
Hence, in conclusion, I can say that this association present in these market baskets represents ingredients that are often consumed together
## lhs rhs support confidence coverage
## [1] {ground beef,mineral water} => {spaghetti} 0.01706439 0.4169381 0.04092788
## [2] {ground beef} => {spaghetti} 0.03919477 0.3989145 0.09825357
## [3] {mineral water,olive oil} => {spaghetti} 0.01026530 0.3719807 0.02759632
## [4] {red wine} => {spaghetti} 0.01026530 0.3649289 0.02812958
## [5] {olive oil} => {spaghetti} 0.02293028 0.3481781 0.06585789
## [6] {chocolate,milk} => {spaghetti} 0.01093188 0.3402490 0.03212905
## lift count
## [1] 2.394681 128
## [2] 2.291162 294
## [3] 2.136468 77
## [4] 2.095966 77
## [5] 1.999758 172
## [6] 1.954217 82