Aim of this report is to analyze relationship between products in the basket. Data was downloaded from Kaggle website.
Summary of the data set:
summary(raw_data)
## transactions as itemMatrix in sparse format with
## 9039 rows (elements/itemsets/transactions) and
## 168 columns (items) and a density of 0.02621708
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2325 1740 1656 1579
## yogurt (Other)
## 1258 31254
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1981 1513 1210 919 775 594 504 405 319 219 166 112 72 71 48 44
## 17 18 19 20 21 22 23 24 27 28 29
## 25 13 14 9 10 4 6 1 1 1 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.404 6.000 29.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Data summary: There are 9039 transactions and 168 unique items.
Whole milk, other vegetables, rolls/ buns, soda and yoghurt are the most
frequently bought products.
Most of the transactions (1981) include only a single item and the
largest shopping basket includes 29 items.
hist(size(raw_data), main = 'Number of items per basket', xlab = '# Items', col = 'cornflowerblue')
Let’s check the top products:
Items which belong to 5% of all transactions:
itemFrequencyPlot(raw_data, support = 0.05, col = 'pink', xlab = 'Item name', ylab = 'Frequency', main = 'Frequency plot with 5% support' )
Top 10 items with highest (absolute) frequency:
itemFrequencyPlot(raw_data, topN = 10, type = 'absolute', col = 'lightslateblue', xlab = 'Item name', ylab = 'Absolute frequency', main = 'Absolute item frequency plot- top 10')
Top 10 items with highest (relative) frequency:
itemFrequencyPlot(raw_data, topN = 10, type = 'relative', col = 'coral1', xlab = 'Item name', ylab = 'Relative frequency', main = 'Relative item frequency plot- top 10' )
Below is a list of items absolute frequency in the data set:
sort(round(itemFrequency(raw_data, type = 'absolute'),4), decreasing = TRUE)
## whole milk other vegetables rolls/buns
## 2325 1740 1656
## soda yogurt root vegetables
## 1579 1258 990
## bottled water tropical fruit shopping bags
## 969 949 902
## sausage pastry bottled beer
## 855 795 732
## citrus fruit newspapers pip fruit
## 730 726 701
## canned beer fruit/vegetable juice whipped/sour cream
## 699 648 646
## brown bread domestic eggs frankfurter
## 585 577 531
## margarine pork coffee
## 528 528 514
## butter napkins curd
## 501 476 472
## beef chocolate frozen vegetables
## 469 461 433
## chicken white bread cream cheese
## 395 378 355
## waffles long life bakery product dessert
## 348 342 337
## salty snack sugar berries
## 336 300 299
## UHT-milk hamburger meat hygiene articles
## 297 289 287
## onions specialty chocolate candy
## 286 281 267
## frozen meals butter milk misc. beverages
## 256 252 251
## oil specialty bar beverages
## 245 245 241
## meat ham ice cream
## 237 232 230
## hard cheese sliced cheese cat food
## 222 221 211
## grapes chewing gum white wine
## 196 186 173
## detergent red/blush wine semi-finished bread
## 171 164 163
## baking powder pickled vegetables dishes
## 162 162 161
## soft cheese potted plants flour
## 159 158 156
## herbs processed cheese canned fish
## 147 144 138
## seasonal products pasta cake bar
## 134 130 122
## mustard packaged fruit/vegetables frozen fish
## 113 111 110
## cling film/bags liquor spread cheese
## 105 104 102
## canned vegetables salt flower (seeds)
## 101 99 98
## condensed milk frozen dessert dish cleaner
## 95 95 91
## roll products pet care sweet spreads
## 91 87 86
## chocolate marshmallow candles dog food
## 84 83 83
## mayonnaise photo/film house keeping products
## 82 82 78
## specialty cheese turkey frozen potato products
## 76 76 75
## Instant food products popcorn liquor (appetizer)
## 72 69 68
## rice instant coffee finished products
## 67 66 62
## soups zwieback vinegar
## 62 61 60
## female sanitary products jam dental care
## 57 53 52
## kitchen towels cereals sauces
## 52 50 50
## cleaner softener sparkling wine
## 49 49 49
## liver loaf spices curd cheese
## 47 47 44
## male cosmetics ketchup brandy
## 40 39 38
## meat spreads rum tea
## 38 38 37
## light bulbs nuts/prunes specialty fat
## 35 32 31
## artif. sweetener canned fruit skin care
## 30 30 30
## syrup nut snack fish
## 30 29 28
## snack products abrasive cleaner potato products
## 28 27 27
## cooking chocolate cookware organic sausage
## 24 23 22
## pudding powder tidbits bathroom cleaner
## 22 22 21
## cocoa drinks soap flower soil/fertilizer
## 21 21 19
## prosecco ready soups specialty vegetables
## 19 17 16
## decalcifier organic products cream
## 15 15 13
## honey frozen fruits hair spray
## 13 12 9
## liqueur make up remover rubbing alcohol
## 9 8 8
## salad dressing whisky toilet cleaner
## 8 8 6
## frozen chicken baby cosmetics bags
## 5 4 4
## kitchen utensil preservation products sound storage medium
## 4 2 1
Here I apply Apriori algorithm, with support 2% and confidence value of 40%.
# creating the rules - 2% support, 40% confidence
rules <- apriori(raw_data, parameter=list(supp=0.02, conf=0.4, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.02 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 180
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 9039 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Support tells us how popular an item is, as measured by the
proportion of transactions in which an item appears.
Rules with highest support value:
inspect(sort(rules, by = 'support'))
## lhs rhs support
## [1] {yogurt} => {whole milk} 0.05675407
## [2] {root vegetables} => {whole milk} 0.05000553
## [3] {root vegetables} => {other vegetables} 0.04746100
## [4] {tropical fruit} => {whole milk} 0.04270384
## [5] {whipped/sour cream} => {whole milk} 0.03219383
## [6] {domestic eggs} => {whole milk} 0.03064498
## [7] {whipped/sour cream} => {other vegetables} 0.02876424
## [8] {butter} => {whole milk} 0.02810045
## [9] {curd} => {whole milk} 0.02522403
## [10] {margarine} => {whole milk} 0.02433898
## [11] {other vegetables, root vegetables} => {whole milk} 0.02345392
## [12] {root vegetables, whole milk} => {other vegetables} 0.02345392
## [13] {other vegetables, yogurt} => {whole milk} 0.02267950
## [14] {beef} => {whole milk} 0.02090939
## [15] {frozen vegetables} => {whole milk} 0.02057750
## confidence coverage lift count
## [1] 0.4077901 0.13917469 1.585383 513
## [2] 0.4565657 0.10952539 1.775009 452
## [3] 0.4333333 0.10952539 2.251092 429
## [4] 0.4067439 0.10498949 1.581315 386
## [5] 0.4504644 0.07146808 1.751289 291
## [6] 0.4800693 0.06383449 1.866386 277
## [7] 0.4024768 0.07146808 2.090797 260
## [8] 0.5069860 0.05542649 1.971031 254
## [9] 0.4830508 0.05221817 1.877977 228
## [10] 0.4166667 0.05841354 1.619892 220
## [11] 0.4941725 0.04746100 1.921215 212
## [12] 0.4690265 0.05000553 2.436512 212
## [13] 0.5242967 0.04325700 2.038330 205
## [14] 0.4029851 0.05188627 1.566702 189
## [15] 0.4295612 0.04790353 1.670023 186
The highest support value is equal to almost 5.7%, which can be
understood as the pair { yoghurt -> whole milk} occurs in 5.7% of
transactions.
Lift tells us what is the ratio between confidence of the rule and
the expected confidence of the rule.
Rules with highest lift value:
inspect(sort(rules, by = 'lift'))
## lhs rhs support
## [1] {root vegetables, whole milk} => {other vegetables} 0.02345392
## [2] {root vegetables} => {other vegetables} 0.04746100
## [3] {whipped/sour cream} => {other vegetables} 0.02876424
## [4] {other vegetables, yogurt} => {whole milk} 0.02267950
## [5] {butter} => {whole milk} 0.02810045
## [6] {other vegetables, root vegetables} => {whole milk} 0.02345392
## [7] {curd} => {whole milk} 0.02522403
## [8] {domestic eggs} => {whole milk} 0.03064498
## [9] {root vegetables} => {whole milk} 0.05000553
## [10] {whipped/sour cream} => {whole milk} 0.03219383
## [11] {frozen vegetables} => {whole milk} 0.02057750
## [12] {margarine} => {whole milk} 0.02433898
## [13] {yogurt} => {whole milk} 0.05675407
## [14] {tropical fruit} => {whole milk} 0.04270384
## [15] {beef} => {whole milk} 0.02090939
## confidence coverage lift count
## [1] 0.4690265 0.05000553 2.436512 212
## [2] 0.4333333 0.10952539 2.251092 429
## [3] 0.4024768 0.07146808 2.090797 260
## [4] 0.5242967 0.04325700 2.038330 205
## [5] 0.5069860 0.05542649 1.971031 254
## [6] 0.4941725 0.04746100 1.921215 212
## [7] 0.4830508 0.05221817 1.877977 228
## [8] 0.4800693 0.06383449 1.866386 277
## [9] 0.4565657 0.10952539 1.775009 452
## [10] 0.4504644 0.07146808 1.751289 291
## [11] 0.4295612 0.04790353 1.670023 186
## [12] 0.4166667 0.05841354 1.619892 220
## [13] 0.4077901 0.13917469 1.585383 513
## [14] 0.4067439 0.10498949 1.581315 386
## [15] 0.4029851 0.05188627 1.566702 189
The highest lift value is equal to around 2.44 for the rule {root
vegetables, whole milk -> other vegetables}.
Confidence tells us about some probability relationship.
inspect(sort(rules, by = 'confidence'))
## lhs rhs support
## [1] {other vegetables, yogurt} => {whole milk} 0.02267950
## [2] {butter} => {whole milk} 0.02810045
## [3] {other vegetables, root vegetables} => {whole milk} 0.02345392
## [4] {curd} => {whole milk} 0.02522403
## [5] {domestic eggs} => {whole milk} 0.03064498
## [6] {root vegetables, whole milk} => {other vegetables} 0.02345392
## [7] {root vegetables} => {whole milk} 0.05000553
## [8] {whipped/sour cream} => {whole milk} 0.03219383
## [9] {root vegetables} => {other vegetables} 0.04746100
## [10] {frozen vegetables} => {whole milk} 0.02057750
## [11] {margarine} => {whole milk} 0.02433898
## [12] {yogurt} => {whole milk} 0.05675407
## [13] {tropical fruit} => {whole milk} 0.04270384
## [14] {beef} => {whole milk} 0.02090939
## [15] {whipped/sour cream} => {other vegetables} 0.02876424
## confidence coverage lift count
## [1] 0.5242967 0.04325700 2.038330 205
## [2] 0.5069860 0.05542649 1.971031 254
## [3] 0.4941725 0.04746100 1.921215 212
## [4] 0.4830508 0.05221817 1.877977 228
## [5] 0.4800693 0.06383449 1.866386 277
## [6] 0.4690265 0.05000553 2.436512 212
## [7] 0.4565657 0.10952539 1.775009 452
## [8] 0.4504644 0.07146808 1.751289 291
## [9] 0.4333333 0.10952539 2.251092 429
## [10] 0.4295612 0.04790353 1.670023 186
## [11] 0.4166667 0.05841354 1.619892 220
## [12] 0.4077901 0.13917469 1.585383 513
## [13] 0.4067439 0.10498949 1.581315 386
## [14] 0.4029851 0.05188627 1.566702 189
## [15] 0.4024768 0.07146808 2.090797 260
The highest confidence value achieved is 52.4% for the rule {other
vegetables, yoghurt -> whole milk}, which means that if someone buys
other vegetables and yoghurt, they will also buy whole milk with 52.4%
probability.
Below is a simple scatter plot with support and lift on the axes and
confidence represented by the color of the points.
plot(rules, measure = c('support','lift'), shading = 'confidence')
The next visualization represents the rules as a graph. The rules are
represented as items connected by arrows.
plot(rules, method = 'graph')
The rules can be also visualized as a grouped matrix- based visualization. Support measure is represented as the size of the balloons, and the lift measure is represented by color of the balloons.
plot(rules, method = 'grouped')
Association rules can help us to find interesting patterns in
customer buying preferences and habits. In the analyzed data set, it has
been discovered that whole milk, other vegetables, rolls/buns, soda and
yoghurt are most frequently bought products.
{yoghurt -> whole milk}, {root vegetables -> whole milk}, {root
vegetables -> other vegetables} are the rules with highest support
measure.
{root vegetables, whole milk -> other vegetables}, {root vegetables
-> other vegetables}, {whipped/ sour cream -> other vegetables}
are the rules with highest lift measure.
{other vegetables, yoghurt -> whole milk}, {butter -> whole milk},
{other vegetables, root vegetables -> whole milk} are the rules with
highest confidence measure.