Companies do what they can if they want to survive or to win with others on their fields. On the retail market in large part it means becoming more and more attractive for the customers and also earning as much as possible from every single person that enters the shop. It can’t be done without a deep knowledge about customer’s behaviour, also or maybe especially the one that even the customers themselves are unaware of.
This project was inspired by the seventh chapter of the book “The power of habit”, written by Charles Duhigg. The book is available online here. For many years the retailers have used differrent psychological tricks directly related to the decisions that customers make in a store. Among others, often fruits and vegetables are to be placed close to the entry as the people who already have healthy food in their shopping basket are supposed to be more likely to buy some snacks and unhealthy food. In fact, if you go to any shop it is very likely that you’ll see that kind of order. The author focuses in his book on the Target Corporation and its ways of going one or even two steps further with making personalised adds and coupons based on many individual details instead of on behavior of the majority so that the offer can be precisely adapted to every customer separately. Nowadays personalised advertising is indeed powerful and present almost everywhere. However, at least until stationary shops exist and not the whole retail market is taken over by e-shopping, the problem of the products arrangment and it may still be a key to optimize the revenues. Also, extremely important is to attract not those customers that already are regular clients of the specific shop but the ones that came there for the first time or buy things in many places and aren’t convinced about the advantage of the one shop over another. Overall those people are the ones that the company knows very little about and doesn’t even have the way to effectively direct to them the offer that will be somehow personalised.
The world is changing and so are the people’s choices and eating habits. On one hand more and more people are obese, on the other hand healthy lifestyle and special diets are probably more popular than ever so it may be that vegetables and snacks don’t end in the same basket as often as before. It is possible, that the rules which are known and used for many years aren’t the most important ones from the perspective of company or even are existing no more. Fortunately, with the large amounts of data gathered these days it is really easy to investigate present customers behavior and compare it with the rules that are known and used for many years. The market basket analysis is a technique that may help to deal with that.
library(arules)
library(arulesViz)
A database used in this analysis was downloaded from kaggle.com and can be found here. The database contains 7501 transactions from the grocery stores and 119 different products. Based on that data the market basket analysis will be performed with the focus on the food generally considered as healthy or unhealthy.
trans<-read.transactions("Market_Basket_Optimisation.csv", format="basket", sep=",", skip=0)
trans
## transactions in sparse format with
## 7501 transactions (rows) and
## 119 items (columns)
LIST(head(trans))
## [[1]]
## [1] "almonds" "antioxydant juice" "avocado"
## [4] "cottage cheese" "energy drink" "frozen smoothie"
## [7] "green grapes" "green tea" "honey"
## [10] "low fat yogurt" "mineral water" "olive oil"
## [13] "salad" "salmon" "shrimp"
## [16] "spinach" "tomato juice" "vegetables mix"
## [19] "whole weat flour" "yams"
##
## [[2]]
## [1] "burgers" "eggs" "meatballs"
##
## [[3]]
## [1] "chutney"
##
## [[4]]
## [1] "avocado" "turkey"
##
## [[5]]
## [1] "energy bar" "green tea" "milk"
## [4] "mineral water" "whole wheat rice"
##
## [[6]]
## [1] "low fat yogurt"
sort(itemFrequency(trans, type="absolute"))
## water spray napkins cream
## 3 5 7
## bramble tea chutney
## 14 29 31
## mashed potato chocolate bread dessert wine
## 31 32 33
## ketchup oatmeal babies food
## 33 33 34
## sandwich asparagus cauliflower
## 34 36 36
## corn salad shampoo
## 36 37 37
## hand protein bar mint green tea burger sauce
## 39 42 44
## pickles chili mayonnaise
## 45 46 46
## soda sparkling water pet food
## 47 47 49
## gluten free bar spinach shallot
## 52 53 58
## strong cheese toothpaste clothes accessories
## 58 61 63
## bacon bug spray green beans
## 65 65 65
## antioxydant juice flax seed green grapes
## 67 68 68
## blueberries salt whole weat flour
## 69 69 70
## zucchini candy bars nonfat milk
## 71 73 78
## cider barbecue sauce magazines
## 79 81 82
## body spray yams extra dark chocolate
## 86 86 90
## melons eggplant gums
## 90 99 101
## fromage blanc tomato sauce black tea
## 102 106 107
## carrots light cream pasta
## 115 117 118
## white wine mint protein bar
## 124 131 139
## rice mushroom cream sauce parmesan cheese
## 141 143 149
## almonds meatballs strawberries
## 153 157 160
## fresh tuna french wine oil
## 167 169 173
## muffins cereals vegetables mix
## 181 193 193
## ham pepper energy drink
## 199 199 200
## energy bar light mayo yogurt cake
## 203 204 205
## red wine whole wheat pasta butter
## 211 221 226
## tomato juice cottage cheese hot dogs
## 228 239 243
## avocado brownies salmon
## 250 253 319
## fresh bread champagne honey
## 323 351 356
## herb & pepper soup cooking oil
## 371 379 383
## grated cheese whole wheat rice chicken
## 393 439 450
## turkey frozen smoothie olive oil
## 469 475 494
## tomatoes shrimp low fat yogurt
## 513 536 574
## escalope cookies cake
## 595 603 608
## burgers pancakes frozen vegetables
## 654 713 715
## ground beef milk green tea
## 737 972 991
## chocolate french fries spaghetti
## 1229 1282 1306
## eggs mineral water
## 1348 1788
The database consists of over 100 different items, however there must be some order of product placement in stores, logical for the customers so they don’t get upset that they can’t find anything. Because of that these products will be grouped into 19 groups.
names.real<-c("almonds", "antioxydant juice", "asparagus", "avocado", "other", "bacon", "barbecue sauce", "black tea", "blueberries", "body spray", "bramble", "brownies", "bug spray", "burger sauce", "burgers", "butter", "cake", "candy bars", "carrots", "cauliflower", "cereals", "champagne", "chicken", "chili", "chocolate", "chocolate bread", "chutney", "cider", "clothes accessories" ,"cookies", "cooking oil", "corn", "cottage cheese", "cream", "dessert wine", "eggplant", "eggs", "energy bar", "energy drink", "escalope", "extra dark chocolate", "flax seed", "french fries", "french wine", "fresh bread", "fresh tuna", "fromage blanc", "frozen smoothie", "frozen vegetables", "gluten free bar", "grated cheese", "green beans", "green grapes", "green tea", "ground beef", "gums", "ham", "hand protein bar", "herb & pepper", "honey", "hot dogs", "ketchup", "light cream", "light mayo", "low fat yogurt", "magazines", "mashed potato", "mayonnaise", "meatballs", "melons", "milk", "mineral water", "mint", "mint green tea", "muffins", "mushroom cream sauce", "napkins", "nonfat milk", "oatmeal", "oil", "olive oil", "pancakes", "parmesan cheese", "pasta", "pepper", "pet food", "pickles", "protein bar", "red wine", "rice", "salad", "salmon", "salt", "sandwich", "shallot", "shampoo", "shrimp", "soda", "soup", "spaghetti", "sparkling water", "spinach", "strawberries", "strong cheese", "tea", "tomato juice", "tomato sauce", "tomatoes", "toothpaste", "turkey", "vegetables mix", "water spray", "white wine", "whole weat flour", "whole wheat pasta", "whole wheat rice", "yams", "yogurt cake", "zucchini")
names.level1<-c("nuts and seeds", "beverages", "vegetables", "fruits", "other", "meat", "sauces and spices", "beverages", "fruits", "cosmetics", "fruits", "snacks", "other", "sauces and spices", "fast food meals", "dairy products", "snacks", "snacks", "vegetables", "vegetables", "cereals", "alcoholic", "meat", "vegetables", "snacks", "snacks", "sauces and spices", "alcoholic", "other", "snacks", "oils and flour", "vegetables", "dairy products", "dairy products", "alcoholic", "vegetables", "dairy products", "snacks", "beverages", "meat", "snacks", "nuts and seeds", "fast food meals", "alcoholic", "bread", "fishes", "dairy products", "beverages", "vegetables", "snacks", "dairy products", "vegetables", "fruits", "beverages", "meat", "snacks", "meat", "snacks", "sauces and spices", "honey", "fast food meals", "sauces and spices", "cosmetics", "sauces and spices", "dairy products", "other", "vegetables", "sauces and spices", "meat", "fruits", "dairy products" ,"beverages", "sauces and spices", "beverages", "snacks", "sauces and spices","other", "dairy products", "cereals", "oils and flour", "oils and flour", "ready-made meals", "dairy products", "pasta and rice", "vegetables", "other", "vegetables", "snacks", "alcoholic", "pasta and rice", "vegetables", "fishes", "sauces and spices", "bread", "vegetables", "cosmetics", "fishes", "beverages", "ready-made meals", "ready-made meals", "beverages", "vegetables", "fruits", "dairy products", "beverages", "beverages", "sauces and spices", "vegetables", "cosmetics", "meat", "vegetables", "cosmetics", "alcoholic", "oils and flour", "pasta and rice", "pasta and rice", "vegetables", "snacks", "vegetables")
itemInfo(trans) <- data.frame(labels = names.real, level1 = names.level1)
itemInfo(trans)
## labels level1
## 1 almonds nuts and seeds
## 2 antioxydant juice beverages
## 3 asparagus vegetables
## 4 avocado fruits
## 5 other other
## 6 bacon meat
## 7 barbecue sauce sauces and spices
## 8 black tea beverages
## 9 blueberries fruits
## 10 body spray cosmetics
## 11 bramble fruits
## 12 brownies snacks
## 13 bug spray other
## 14 burger sauce sauces and spices
## 15 burgers fast food meals
## 16 butter dairy products
## 17 cake snacks
## 18 candy bars snacks
## 19 carrots vegetables
## 20 cauliflower vegetables
## 21 cereals cereals
## 22 champagne alcoholic
## 23 chicken meat
## 24 chili vegetables
## 25 chocolate snacks
## 26 chocolate bread snacks
## 27 chutney sauces and spices
## 28 cider alcoholic
## 29 clothes accessories other
## 30 cookies snacks
## 31 cooking oil oils and flour
## 32 corn vegetables
## 33 cottage cheese dairy products
## 34 cream dairy products
## 35 dessert wine alcoholic
## 36 eggplant vegetables
## 37 eggs dairy products
## 38 energy bar snacks
## 39 energy drink beverages
## 40 escalope meat
## 41 extra dark chocolate snacks
## 42 flax seed nuts and seeds
## 43 french fries fast food meals
## 44 french wine alcoholic
## 45 fresh bread bread
## 46 fresh tuna fishes
## 47 fromage blanc dairy products
## 48 frozen smoothie beverages
## 49 frozen vegetables vegetables
## 50 gluten free bar snacks
## 51 grated cheese dairy products
## 52 green beans vegetables
## 53 green grapes fruits
## 54 green tea beverages
## 55 ground beef meat
## 56 gums snacks
## 57 ham meat
## 58 hand protein bar snacks
## 59 herb & pepper sauces and spices
## 60 honey honey
## 61 hot dogs fast food meals
## 62 ketchup sauces and spices
## 63 light cream cosmetics
## 64 light mayo sauces and spices
## 65 low fat yogurt dairy products
## 66 magazines other
## 67 mashed potato vegetables
## 68 mayonnaise sauces and spices
## 69 meatballs meat
## 70 melons fruits
## 71 milk dairy products
## 72 mineral water beverages
## 73 mint sauces and spices
## 74 mint green tea beverages
## 75 muffins snacks
## 76 mushroom cream sauce sauces and spices
## 77 napkins other
## 78 nonfat milk dairy products
## 79 oatmeal cereals
## 80 oil oils and flour
## 81 olive oil oils and flour
## 82 pancakes ready-made meals
## 83 parmesan cheese dairy products
## 84 pasta pasta and rice
## 85 pepper vegetables
## 86 pet food other
## 87 pickles vegetables
## 88 protein bar snacks
## 89 red wine alcoholic
## 90 rice pasta and rice
## 91 salad vegetables
## 92 salmon fishes
## 93 salt sauces and spices
## 94 sandwich bread
## 95 shallot vegetables
## 96 shampoo cosmetics
## 97 shrimp fishes
## 98 soda beverages
## 99 soup ready-made meals
## 100 spaghetti ready-made meals
## 101 sparkling water beverages
## 102 spinach vegetables
## 103 strawberries fruits
## 104 strong cheese dairy products
## 105 tea beverages
## 106 tomato juice beverages
## 107 tomato sauce sauces and spices
## 108 tomatoes vegetables
## 109 toothpaste cosmetics
## 110 turkey meat
## 111 vegetables mix vegetables
## 112 water spray cosmetics
## 113 white wine alcoholic
## 114 whole weat flour oils and flour
## 115 whole wheat pasta pasta and rice
## 116 whole wheat rice pasta and rice
## 117 yams vegetables
## 118 yogurt cake snacks
## 119 zucchini vegetables
trans_level2<-aggregate(trans, by="level1")
itemFrequencyPlot(trans_level2, topN=20, type="relative", main="Item Frequency")
With the two-dimensional analysis it is possible to see the connection between every two items in the dataset.
#How many times the two types of products appear in one transaction
ctab<-crossTable(trans_level2, sort=TRUE)
ctab
## beverages dairy products snacks meat ready-made meals
## beverages 3218 1555 1432 1154 1119
## dairy products 1555 3169 1391 1116 1089
## snacks 1432 1391 3151 923 918
## meat 1154 1116 923 2225 868
## ready-made meals 1119 1089 918 868 2073
## vegetables 1088 1005 899 777 780
## fast food meals 907 932 878 648 585
## sauces and spices 587 561 514 478 463
## oils and flour 592 573 499 439 475
## fishes 540 539 470 387 412
## alcoholic 441 427 404 308 323
## pasta and rice 487 463 405 381 312
## fruits 327 292 276 227 233
## honey 197 174 162 148 134
## bread 187 188 167 134 114
## cosmetics 173 146 165 135 151
## other 161 153 136 119 108
## cereals 145 127 108 86 102
## nuts and seeds 138 131 107 91 87
## vegetables fast food meals sauces and spices
## beverages 1088 907 587
## dairy products 1005 932 561
## snacks 899 878 514
## meat 777 648 478
## ready-made meals 780 585 463
## vegetables 1974 588 381
## fast food meals 588 1940 318
## sauces and spices 381 318 1144
## oils and flour 401 280 252
## fishes 447 253 216
## alcoholic 284 272 164
## pasta and rice 316 283 180
## fruits 220 199 100
## honey 132 110 67
## bread 137 106 83
## cosmetics 115 96 73
## other 106 81 77
## cereals 91 71 66
## nuts and seeds 89 86 42
## oils and flour fishes alcoholic pasta and rice fruits
## beverages 592 540 441 487 327
## dairy products 573 539 427 463 292
## snacks 499 470 404 405 276
## meat 439 387 308 381 227
## ready-made meals 475 412 323 312 233
## vegetables 401 447 284 316 220
## fast food meals 280 253 272 283 199
## sauces and spices 252 216 164 180 100
## oils and flour 1032 226 151 193 127
## fishes 226 961 155 176 120
## alcoholic 151 155 904 145 90
## pasta and rice 193 176 145 881 93
## fruits 127 120 90 93 616
## honey 74 79 51 57 48
## bread 73 75 70 53 50
## cosmetics 87 57 43 49 49
## other 61 61 55 54 32
## cereals 61 49 42 38 27
## nuts and seeds 48 48 35 31 28
## honey bread cosmetics other cereals nuts and seeds
## beverages 197 187 173 161 145 138
## dairy products 174 188 146 153 127 131
## snacks 162 167 165 136 108 107
## meat 148 134 135 119 86 91
## ready-made meals 134 114 151 108 102 87
## vegetables 132 137 115 106 91 89
## fast food meals 110 106 96 81 71 86
## sauces and spices 67 83 73 77 66 42
## oils and flour 74 73 87 61 61 48
## fishes 79 75 57 61 49 48
## alcoholic 51 70 43 55 42 35
## pasta and rice 57 53 49 54 38 31
## fruits 48 50 49 32 27 28
## honey 356 25 23 16 15 16
## bread 25 355 20 28 23 16
## cosmetics 23 20 302 22 17 9
## other 16 28 22 290 15 18
## cereals 15 23 17 15 225 11
## nuts and seeds 16 16 9 18 11 219
The amount of products in each category has of course major impact on the number of the transactions in which it occurs with other products. The Chi-squared test can also be performed to check the null hypothesis that occurence of the products in these pairs is independent.
chi2tab<-crossTable(trans_level2, measure="chiSquared", sort=TRUE)
chi2tab
## beverages dairy products snacks meat
## beverages NA 0.0037466879 6.341890e-04 5.556089e-03
## dairy products 0.0037466879 NA 3.578228e-04 4.392554e-03
## snacks 0.0006341890 0.0003578228 NA 1.943191e-05
## meat 0.0055560889 0.0043925545 1.943191e-05 NA
## ready-made meals 0.0079067527 0.0069194876 3.407711e-04 1.388761e-02
## vegetables 0.0091535267 0.0046760292 7.825446e-04 8.345843e-03
## fast food meals 0.0008943448 0.0020547881 6.503093e-04 1.219169e-03
## sauces and spices 0.0025145310 0.0016647181 3.100548e-04 7.553312e-03
## oils and flour 0.0067086344 0.0057393520 1.318536e-03 7.689797e-03
## fishes 0.0052749863 0.0058083803 1.451877e-03 4.860123e-03
## alcoholic 0.0009720034 0.0007094019 2.064478e-04 7.894749e-04
## pasta and rice 0.0041940242 0.0029529230 4.390582e-04 7.305932e-03
## fruits 0.0019851173 0.0005165335 1.529892e-04 1.430406e-03
## honey 0.0017109393 0.0004936112 1.382333e-04 2.269692e-03
## bread 0.0010541092 0.0012849576 2.855593e-04 1.042620e-03
## cosmetics 0.0019416435 0.0003542176 1.528374e-03 3.069946e-03
## other 0.0014344176 0.0010110142 2.199670e-04 1.685482e-03
## cereals 0.0032451062 0.0014309939 2.563991e-04 7.408843e-04
## nuts and seeds 0.0027529674 0.0021332815 3.261877e-04 1.391436e-03
## ready-made meals vegetables fast food meals
## beverages 0.0079067527 0.0091535267 8.943448e-04
## dairy products 0.0069194876 0.0046760292 2.054788e-03
## snacks 0.0003407711 0.0007825446 6.503093e-04
## meat 0.0138876089 0.0083458428 1.219169e-03
## ready-made meals NA 0.0134334598 5.935067e-04
## vegetables 0.0134334598 NA 1.566776e-03
## fast food meals 0.0005935067 0.0015667760 NA
## sauces and spices 0.0090921524 0.0028297493 2.205618e-04
## oils and flour 0.0168376867 0.0082211848 8.560581e-05
## fishes 0.0107608982 0.0198597562 1.064308e-05
## alcoholic 0.0028567432 0.0011908743 8.319115e-04
## pasta and rice 0.0025710529 0.0040719540 1.779244e-03
## fruits 0.0030845370 0.0027560396 1.317702e-03
## honey 0.0017187382 0.0020888237 4.653297e-04
## bread 0.0003431462 0.0027097444 2.921890e-04
## cosmetics 0.0072860873 0.0021168712 5.464640e-04
## other 0.0012906225 0.0015390273 6.391755e-05
## cereals 0.0033992524 0.0022750653 3.758046e-04
## nuts and seeds 0.0015441047 0.0022758912 2.028864e-03
## sauces and spices oils and flour fishes
## beverages 2.514531e-03 6.708634e-03 5.274986e-03
## dairy products 1.664718e-03 5.739352e-03 5.808380e-03
## snacks 3.100548e-04 1.318536e-03 1.451877e-03
## meat 7.553312e-03 7.689797e-03 4.860123e-03
## ready-made meals 9.092152e-03 1.683769e-02 1.076090e-02
## vegetables 2.829749e-03 8.221185e-03 1.985976e-02
## fast food meals 2.205618e-04 8.560581e-05 1.064308e-05
## sauces and spices NA 7.581184e-03 4.385383e-03
## oils and flour 7.581184e-03 NA 8.868592e-03
## fishes 4.385383e-03 8.868592e-03 NA
## alcoholic 6.601250e-04 7.599232e-04 1.767264e-03
## pasta and rice 2.066399e-03 5.668643e-03 4.707242e-03
## fruits 5.197447e-05 2.807934e-03 2.850789e-03
## honey 3.963682e-04 1.704030e-03 3.258933e-03
## bread 2.050570e-03 1.593061e-03 2.554134e-03
## cosmetics 2.100859e-03 6.628078e-03 1.155031e-03
## other 3.237140e-03 1.487790e-03 2.040423e-03
## cereals 3.900203e-03 3.887380e-03 1.882224e-03
## nuts and seeds 2.951833e-04 1.412885e-03 1.889703e-03
## alcoholic pasta and rice fruits honey
## beverages 0.0009720034 0.0041940242 1.985117e-03 1.710939e-03
## dairy products 0.0007094019 0.0029529230 5.165335e-04 4.936112e-04
## snacks 0.0002064478 0.0004390582 1.529892e-04 1.382333e-04
## meat 0.0007894749 0.0073059319 1.430406e-03 2.269692e-03
## ready-made meals 0.0028567432 0.0025710529 3.084537e-03 1.718738e-03
## vegetables 0.0011908743 0.0040719540 2.756040e-03 2.088824e-03
## fast food meals 0.0008319115 0.0017792435 1.317702e-03 4.653297e-04
## sauces and spices 0.0006601250 0.0020663988 5.197447e-05 3.963682e-04
## oils and flour 0.0007599232 0.0056686428 2.807934e-03 1.704030e-03
## fishes 0.0017672644 0.0047072422 2.850789e-03 3.258933e-03
## alcoholic NA 0.0018926169 4.461065e-04 2.036605e-04
## pasta and rice 0.0018926169 NA 7.857621e-04 7.354334e-04
## fruits 0.0004461065 0.0007857621 NA 1.605610e-03
## honey 0.0002036605 0.0007354334 1.605610e-03 NA
## bread 0.0023081480 0.0004086285 1.987282e-03 5.257814e-04
## cosmetics 0.0001597389 0.0006880178 3.147811e-03 6.986802e-04
## other 0.0015334232 0.0015561158 3.749786e-04 4.844963e-05
## cereals 0.0010890954 0.0006757293 5.240432e-04 2.331424e-04
## nuts and seeds 0.0003741646 0.0001443965 7.435227e-04 4.031261e-04
## bread cosmetics other cereals
## beverages 0.0010541092 1.941643e-03 1.434418e-03 0.0032451062
## dairy products 0.0012849576 3.542176e-04 1.011014e-03 0.0014309939
## snacks 0.0002855593 1.528374e-03 2.199670e-04 0.0002563991
## meat 0.0010426197 3.069946e-03 1.685482e-03 0.0007408843
## ready-made meals 0.0003431462 7.286087e-03 1.290623e-03 0.0033992524
## vegetables 0.0027097444 2.116871e-03 1.539027e-03 0.0022750653
## fast food meals 0.0002921890 5.464640e-04 6.391755e-05 0.0003758046
## sauces and spices 0.0020505702 2.100859e-03 3.237140e-03 0.0039002033
## oils and flour 0.0015930607 6.628078e-03 1.487790e-03 0.0038873798
## fishes 0.0025541337 1.155031e-03 2.040423e-03 0.0018822244
## alcoholic 0.0023081480 1.597389e-04 1.533423e-03 0.0010890954
## pasta and rice 0.0004086285 6.880178e-04 1.556116e-03 0.0006757293
## fruits 0.0019872819 3.147811e-03 3.749786e-04 0.0005240432
## honey 0.0005257814 6.986802e-04 4.844963e-05 0.0002331424
## bread NA 3.038203e-04 1.979410e-03 0.0019099539
## cosmetics 0.0003038203 NA 1.217054e-03 0.0009280763
## other 0.0019794103 1.217054e-03 NA 0.0006084999
## cereals 0.0019099539 9.280763e-04 6.084999e-04 NA
## nuts and seeds 0.0004084832 5.051095e-07 1.430964e-03 0.0003984305
## nuts and seeds
## beverages 2.752967e-03
## dairy products 2.133281e-03
## snacks 3.261877e-04
## meat 1.391436e-03
## ready-made meals 1.544105e-03
## vegetables 2.275891e-03
## fast food meals 2.028864e-03
## sauces and spices 2.951833e-04
## oils and flour 1.412885e-03
## fishes 1.889703e-03
## alcoholic 3.741646e-04
## pasta and rice 1.443965e-04
## fruits 7.435227e-04
## honey 4.031261e-04
## bread 4.084832e-04
## cosmetics 5.051095e-07
## other 1.430964e-03
## cereals 3.984305e-04
## nuts and seeds NA
One can see that in almost all cases the null hypothesis should be rejected so the occurence of products is not independent.
Apriori algorithm is used to create the itemsets and rules for the given data. Each time it selects the rules basing on their support (number of transactions with all products in the itemset divided by total number of transactions) and increases itemsets by one element. The confidence of the rules means the number of transactions with all products in the itemset divided by the number of transactions with the left handside of the rule (itemset that is the antecedent) To obtain any results to analysis the confidence had to be lowered, from the default value equal to 80%, to 50%. Nine rules were obtained.
rules.trans<-apriori(trans_level2, parameter=list(supp=0.1, conf=0.5))
rules.by.conf<-sort(rules.trans, by="confidence", decreasing=TRUE)
inspect(rules.by.conf)
## lhs rhs support confidence
## [1] {dairy products,snacks} => {beverages} 0.1079856 0.5823149
## [2] {beverages,snacks} => {dairy products} 0.1079856 0.5656425
## [3] {vegetables} => {beverages} 0.1450473 0.5511651
## [4] {ready-made meals} => {beverages} 0.1491801 0.5397974
## [5] {ready-made meals} => {dairy products} 0.1451806 0.5253256
## [6] {beverages,dairy products} => {snacks} 0.1079856 0.5209003
## [7] {meat} => {beverages} 0.1538462 0.5186517
## [8] {vegetables} => {dairy products} 0.1339821 0.5091185
## [9] {meat} => {dairy products} 0.1487802 0.5015730
## lift count
## [1] 1.357347 810
## [2] 1.338872 810
## [3] 1.284739 1088
## [4] 1.258241 1119
## [5] 1.243442 1089
## [6] 1.240011 810
## [7] 1.208952 1154
## [8] 1.205080 1005
## [9] 1.187220 1116
One can see that those rules contain antecedent itemsets with one or two items. All rules have quite similar support below 0.16 and confidence below 0.6. Another useful measure is the Lift which equals to the confidence divided by expected confidence of the rule. In other words it says how much more often items appear in transaction comparing to the number of times we would expect them to appear if they were independent. Both beverages and dairy products appear as the consequents four times each. There is also one rule showing that people buy snacks if they buy beverages and dairy products. Antecedents are more diversified.
plot(rules.trans, measure=c("confidence","lift"), shading="support")
It can also be seen on the graph that the confidence and lift measures for the rules increase collectively but the rules with the highest values of those two measures are the ones that have the smallest support.
is.significant(rules.trans, trans_level2)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.maximal(rules.trans)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.redundant(rules.trans)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
All of those rules are significant according to Fisher’s exact test and all of them are maximal. None of them is redundant so there aren’t any more general rules with at least as high confidence as their.
trans.closed<-apriori(trans_level2,
parameter=list(target="closed frequent itemsets", support=0.01))
rules.closed<-ruleInduction(trans.closed, trans_level2, control=list(verbose=TRUE))
rules.closed
## set of 1 rules
inspect(rules.closed)
## lhs rhs support confidence lift itemset
## [1] {pasta and rice,
## ready-made meals,
## snacks,
## vegetables} => {beverages} 0.01013198 0.8085106 1.884599 600
Closed itemsets are the ones which supersets have lower support. There is only one itemset like that with the support over 1% but it has high confidence and lift. It is quite long itemset with five items.
d.jac.i<-dissimilarity(trans_level2, which="items", method = "dice")
round(d.jac.i,2)
## alcoholic beverages bread cereals cosmetics
## beverages 0.79
## bread 0.89 0.90
## cereals 0.93 0.92 0.92
## cosmetics 0.93 0.90 0.94 0.94
## dairy products 0.79 0.51 0.89 0.93 0.92
## fast food meals 0.81 0.65 0.91 0.93 0.91
## fishes 0.83 0.74 0.89 0.92 0.91
## fruits 0.88 0.83 0.90 0.94 0.89
## honey 0.92 0.89 0.93 0.95 0.93
## meat 0.80 0.58 0.90 0.93 0.89
## nuts and seeds 0.94 0.92 0.94 0.95 0.97
## oils and flour 0.84 0.72 0.89 0.90 0.87
## other 0.91 0.91 0.91 0.94 0.93
## pasta and rice 0.84 0.76 0.91 0.93 0.92
## ready-made meals 0.78 0.58 0.91 0.91 0.87
## sauces and spices 0.84 0.73 0.89 0.90 0.90
## snacks 0.80 0.55 0.90 0.94 0.90
## vegetables 0.80 0.58 0.88 0.92 0.90
## dairy products fast food meals fishes fruits honey meat
## beverages
## bread
## cereals
## cosmetics
## dairy products
## fast food meals 0.64
## fishes 0.74 0.83
## fruits 0.85 0.84 0.85
## honey 0.90 0.90 0.88 0.90
## meat 0.59 0.69 0.76 0.84 0.89
## nuts and seeds 0.92 0.92 0.92 0.93 0.94 0.93
## oils and flour 0.73 0.81 0.77 0.85 0.89 0.73
## other 0.91 0.93 0.90 0.93 0.95 0.91
## pasta and rice 0.77 0.80 0.81 0.88 0.91 0.75
## ready-made meals 0.58 0.71 0.73 0.83 0.89 0.60
## sauces and spices 0.74 0.79 0.79 0.89 0.91 0.72
## snacks 0.56 0.66 0.77 0.85 0.91 0.66
## vegetables 0.61 0.70 0.70 0.83 0.89 0.63
## nuts and seeds oils and flour other pasta and rice
## beverages
## bread
## cereals
## cosmetics
## dairy products
## fast food meals
## fishes
## fruits
## honey
## meat
## nuts and seeds
## oils and flour 0.92
## other 0.93 0.91
## pasta and rice 0.94 0.80 0.91
## ready-made meals 0.92 0.69 0.91 0.79
## sauces and spices 0.94 0.77 0.89 0.82
## snacks 0.94 0.76 0.92 0.80
## vegetables 0.92 0.73 0.91 0.78
## ready-made meals sauces and spices snacks
## beverages
## bread
## cereals
## cosmetics
## dairy products
## fast food meals
## fishes
## fruits
## honey
## meat
## nuts and seeds
## oils and flour
## other
## pasta and rice
## ready-made meals
## sauces and spices 0.71
## snacks 0.65 0.76
## vegetables 0.61 0.76 0.65
According to Dice’s coefficient all groups of products are generally very dissimilar. Additionally the dendrogram for those categories is presented below.
plot(hclust(d.jac.i, method = "ward.D2"), main = "Dendrogram for items")
Now let’s focus on the four types of products which are especially connected with the healthy or unhealthy eating: vegetables, fruits, snacks and fast food meals. To obtain the results for those specific rules the support and confidence were lowered again to respectively 5% (2% for fruits) and 20% (10% for fruits).
rules.veg.r<-apriori(data=trans_level2, parameter=list(supp=0.05,conf = 0.2),
appearance=list(default="lhs", rhs="vegetables"), control=list(verbose=F))
rules.veg.r.bylift<-sort(rules.veg.r, by="lift", decreasing=TRUE)
inspect(rules.veg.r.bylift)
## lhs rhs support
## [1] {fishes} => {vegetables} 0.05959205
## [2] {beverages,ready-made meals} => {vegetables} 0.06719104
## [3] {meat,ready-made meals} => {vegetables} 0.05172644
## [4] {ready-made meals,snacks} => {vegetables} 0.05239301
## [5] {meat,snacks} => {vegetables} 0.05132649
## [6] {beverages,meat} => {vegetables} 0.06372484
## [7] {dairy products,ready-made meals} => {vegetables} 0.05972537
## [8] {dairy products,meat} => {vegetables} 0.05959205
## [9] {beverages,snacks} => {vegetables} 0.07452340
## [10] {beverages,dairy products} => {vegetables} 0.08078923
## [11] {oils and flour} => {vegetables} 0.05345954
## [12] {ready-made meals} => {vegetables} 0.10398614
## [13] {dairy products,snacks} => {vegetables} 0.06972404
## [14] {meat} => {vegetables} 0.10358619
## [15] {beverages} => {vegetables} 0.14504733
## [16] {sauces and spices} => {vegetables} 0.05079323
## [17] {dairy products} => {vegetables} 0.13398214
## [18] {fast food meals} => {vegetables} 0.07838955
## [19] {snacks} => {vegetables} 0.11985069
## [20] {} => {vegetables} 0.26316491
## confidence lift count
## [1] 0.4651405 1.767487 447
## [2] 0.4504021 1.711483 504
## [3] 0.4470046 1.698572 388
## [4] 0.4281046 1.626754 393
## [5] 0.4171181 1.585006 385
## [6] 0.4142114 1.573961 478
## [7] 0.4113866 1.563227 448
## [8] 0.4005376 1.522002 447
## [9] 0.3903631 1.483340 559
## [10] 0.3897106 1.480861 606
## [11] 0.3885659 1.476511 401
## [12] 0.3762663 1.429774 780
## [13] 0.3759885 1.428718 523
## [14] 0.3492135 1.326976 777
## [15] 0.3380982 1.284739 1088
## [16] 0.3330420 1.265526 381
## [17] 0.3171347 1.205080 1005
## [18] 0.3030928 1.151722 588
## [19] 0.2853063 1.084135 899
## [20] 0.2631649 1.000000 1974
plot(rules.veg.r, method="paracoord", control=list(reorder=TRUE))
There are 19 rules plus rule with only vegetables in the itemset. one can see that according to lift surprisingly popular are fishes and also itemsets with ready-made meals and something else (beverages, meat, snacks..). Snacks appear few times, even alone in the antecedent itemset although on the last place with the lift only minimally higher than the one. Fast food meals appear only alone on the left handside with only 15% higher occurence than in case of the independence. Surprisingly there is no fruits among the rules at all.
is.significant(rules.veg.r, trans_level2)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
inspect(rules.veg.r[is.significant(rules.veg.r, trans_level2)==F])
## lhs rhs support confidence lift count
## [1] {} => {vegetables} 0.2631649 0.2631649 1 1974
is.redundant(rules.veg.r)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
All of the rules are significant exept the one with no antecedent and none of them is redundant.
rules.fru.r<-apriori(data=trans_level2, parameter=list(supp=0.02,conf = 0.1),
appearance=list(default="lhs", rhs="fruits"), control=list(verbose=F))
rules.fru.r.bylift<-sort(rules.fru.r, by="lift", decreasing=TRUE)
inspect(rules.fru.r.bylift)
## lhs rhs support confidence lift
## [1] {beverages,dairy products} => {fruits} 0.02373017 0.1144695 1.393889
## [2] {ready-made meals} => {fruits} 0.03106252 0.1123975 1.368658
## [3] {vegetables} => {fruits} 0.02932942 0.1114488 1.357107
## [4] {dairy products,snacks} => {fruits} 0.02026396 0.1092739 1.330623
## [5] {beverages,snacks} => {fruits} 0.02079723 0.1089385 1.326539
## [6] {fast food meals} => {fruits} 0.02652980 0.1025773 1.249079
## [7] {meat} => {fruits} 0.03026263 0.1020225 1.242322
## [8] {beverages} => {fruits} 0.04359419 0.1016159 1.237372
## count
## [1] 178
## [2] 233
## [3] 220
## [4] 152
## [5] 156
## [6] 199
## [7] 227
## [8] 327
plot(rules.fru.r, method="graph")
There weren’t any rules for the values of support and confidence on the level used for the rest of products, so it had to be lowered even more. Among those rules one can see that the vegetables are among the products that make people buy fruits the most basing on the lift measure, but generally all those rules are quite weak.
is.significant(rules.fru.r, trans_level2)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.redundant(rules.fru.r)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
All of the rules are significant and none of them is redundant.
rules.snacks.r<-apriori(data=trans_level2, parameter=list(supp=0.05,conf = 0.2),
appearance=list(default="lhs", rhs="snacks"), control=list(verbose=F))
rules.snacks.r.bylift<-sort(rules.snacks.r, by="lift", decreasing=TRUE)
inspect(rules.snacks.r.bylift)
## lhs rhs support confidence
## [1] {beverages,fast food meals} => {snacks} 0.06479136 0.5358324
## [2] {dairy products,fast food meals} => {snacks} 0.06532462 0.5257511
## [3] {beverages,dairy products} => {snacks} 0.10798560 0.5209003
## [4] {dairy products,vegetables} => {snacks} 0.06972404 0.5203980
## [5] {beverages,vegetables} => {snacks} 0.07452340 0.5137868
## [6] {ready-made meals,vegetables} => {snacks} 0.05239301 0.5038462
## [7] {beverages,ready-made meals} => {snacks} 0.07452340 0.4995532
## [8] {meat,vegetables} => {snacks} 0.05132649 0.4954955
## [9] {fishes} => {snacks} 0.06265831 0.4890739
## [10] {dairy products,ready-made meals} => {snacks} 0.07052393 0.4857668
## [11] {dairy products,meat} => {snacks} 0.07225703 0.4856631
## [12] {oils and flour} => {snacks} 0.06652446 0.4835271
## [13] {beverages,meat} => {snacks} 0.07345687 0.4774697
## [14] {meat,ready-made meals} => {snacks} 0.05479269 0.4735023
## [15] {pasta and rice} => {snacks} 0.05399280 0.4597049
## [16] {vegetables} => {snacks} 0.11985069 0.4554205
## [17] {fast food meals} => {snacks} 0.11705106 0.4525773
## [18] {sauces and spices} => {snacks} 0.06852420 0.4493007
## [19] {alcoholic} => {snacks} 0.05385949 0.4469027
## [20] {beverages} => {snacks} 0.19090788 0.4449969
## [21] {ready-made meals} => {snacks} 0.12238368 0.4428365
## [22] {dairy products} => {snacks} 0.18544194 0.4389397
## [23] {} => {snacks} 0.42007732 0.4200773
## [24] {meat} => {snacks} 0.12305026 0.4148315
## lift count
## [1] 1.2755566 486
## [2] 1.2515579 490
## [3] 1.2400106 810
## [4] 1.2388148 523
## [5] 1.2230766 559
## [6] 1.1994129 393
## [7] 1.1891934 559
## [8] 1.1795340 385
## [9] 1.1642473 470
## [10] 1.1563746 529
## [11] 1.1561278 542
## [12] 1.1510432 499
## [13] 1.1366233 551
## [14] 1.1271789 411
## [15] 1.0943340 405
## [16] 1.0841349 899
## [17] 1.0773667 878
## [18] 1.0695667 514
## [19] 1.0638581 404
## [20] 1.0593214 1432
## [21] 1.0541785 918
## [22] 1.0449022 1391
## [23] 1.0000000 3151
## [24] 0.9875122 923
plot(rules.snacks.r, method="paracoord", control=list(reorder=TRUE))
More rules are for the snacks. The lift values aren’t as high as for the rules for vegetables but there are many interesting combinations. In the two top antecedent itemsets fast food meals appear so it seems that unhealthy food sticks together.
is.significant(rules.snacks.r, trans_level2)
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE
inspect(rules.snacks.r[is.significant(rules.snacks.r, trans_level2)==F])
## lhs rhs support confidence lift count
## [1] {} => {snacks} 0.42007732 0.4200773 1.0000000 3151
## [2] {alcoholic} => {snacks} 0.05385949 0.4469027 1.0638581 404
## [3] {pasta and rice} => {snacks} 0.05399280 0.4597049 1.0943340 405
## [4] {sauces and spices} => {snacks} 0.06852420 0.4493007 1.0695667 514
## [5] {fast food meals} => {snacks} 0.11705106 0.4525773 1.0773667 878
## [6] {ready-made meals} => {snacks} 0.12238368 0.4428365 1.0541785 918
## [7] {meat} => {snacks} 0.12305026 0.4148315 0.9875122 923
## [8] {dairy products} => {snacks} 0.18544194 0.4389397 1.0449022 1391
is.redundant(rules.snacks.r)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE
inspect(rules.snacks.r[is.redundant(rules.snacks.r)==T])
## lhs rhs support confidence lift count
## [1] {meat} => {snacks} 0.1230503 0.4148315 0.9875122 923
Some of the rules with single items as the antecedents are not significant including the rule with the fast food meal. One of them, with meat on the left handside is also redundant.
rules.ff.r<-apriori(data=trans_level2, parameter=list(supp=0.05,conf = 0.2),
appearance=list(default="lhs", rhs="fast food meals"), control=list(verbose=F))
rules.ff.r.bylift<-sort(rules.ff.r, by="lift", decreasing=TRUE)
inspect(rules.ff.r.bylift)
## lhs rhs support confidence
## [1] {dairy products,snacks} => {fast food meals} 0.06532462 0.3522646
## [2] {beverages,snacks} => {fast food meals} 0.06479136 0.3393855
## [3] {beverages,dairy products} => {fast food meals} 0.06799093 0.3279743
## [4] {vegetables} => {fast food meals} 0.07838955 0.2978723
## [5] {dairy products} => {fast food meals} 0.12425010 0.2940991
## [6] {meat} => {fast food meals} 0.08638848 0.2912360
## [7] {ready-made meals} => {fast food meals} 0.07798960 0.2821997
## [8] {beverages} => {fast food meals} 0.12091721 0.2818521
## [9] {snacks} => {fast food meals} 0.11705106 0.2786417
## [10] {} => {fast food meals} 0.25863218 0.2586322
## lift count
## [1] 1.362029 490
## [2] 1.312232 486
## [3] 1.268111 510
## [4] 1.151722 588
## [5] 1.137133 932
## [6] 1.126062 648
## [7] 1.091124 585
## [8] 1.089780 907
## [9] 1.077367 878
## [10] 1.000000 1940
plot(rules.ff.r, method="graph")
The results for the fast food meals show that indeed also for fast food there are no stronger rules according to the lift measure than those described above in the part about snacks.
is.significant(rules.ff.r, trans_level2)
## [1] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
inspect(rules.ff.r[is.significant(rules.ff.r, trans_level2)==F])
## lhs rhs support confidence lift
## [1] {} => {fast food meals} 0.2586322 0.2586322 1.000000
## [2] {ready-made meals} => {fast food meals} 0.0779896 0.2821997 1.091124
## count
## [1] 1940
## [2] 585
is.redundant(rules.ff.r)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
In this case the fast food meals alone and the rule of ready-made meals leading to fast food meals buing are not significant. None of those rules is redundant.
Association rules aren’t completely obvious even after the market basket analysis but this technique gives great opportunity to mine through the data and look for some unique schemes which hide the truth about customers behavior.