Market Basket Analysis

Explore the data

Aim of this report is to analyze relationship between products in the basket. Data was downloaded from Kaggle website.

Summary of the data set:

summary(raw_data)
## transactions as itemMatrix in sparse format with
##  9039 rows (elements/itemsets/transactions) and
##  168 columns (items) and a density of 0.02621708 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2325             1740             1656             1579 
##           yogurt          (Other) 
##             1258            31254 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1981 1513 1210  919  775  594  504  405  319  219  166  112   72   71   48   44 
##   17   18   19   20   21   22   23   24   27   28   29 
##   25   13   14    9   10    4    6    1    1    1    3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.404   6.000  29.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Data summary: There are 9039 transactions and 168 unique items.
Whole milk, other vegetables, rolls/ buns, soda and yoghurt are the most frequently bought products.
Most of the transactions (1981) include only a single item and the largest shopping basket includes 29 items.

hist(size(raw_data), main = 'Number of items per basket', xlab = '# Items', col = 'cornflowerblue')

Let’s check the top products:

Items which belong to 5% of all transactions:

itemFrequencyPlot(raw_data, support = 0.05, col = 'pink', xlab = 'Item name', ylab = 'Frequency', main = 'Frequency plot with 5% support' )

Top 10 items with highest (absolute) frequency:

itemFrequencyPlot(raw_data, topN = 10, type = 'absolute', col = 'lightslateblue', xlab = 'Item name', ylab = 'Absolute frequency', main = 'Absolute item frequency plot- top 10')

Top 10 items with highest (relative) frequency:

itemFrequencyPlot(raw_data, topN = 10, type = 'relative', col = 'coral1', xlab = 'Item name', ylab = 'Relative frequency', main = 'Relative item frequency plot- top 10' )

Below is a list of items absolute frequency in the data set:

sort(round(itemFrequency(raw_data, type = 'absolute'),4), decreasing = TRUE)
##                whole milk          other vegetables                rolls/buns 
##                      2325                      1740                      1656 
##                      soda                    yogurt           root vegetables 
##                      1579                      1258                       990 
##             bottled water            tropical fruit             shopping bags 
##                       969                       949                       902 
##                   sausage                    pastry              bottled beer 
##                       855                       795                       732 
##              citrus fruit                newspapers                 pip fruit 
##                       730                       726                       701 
##               canned beer     fruit/vegetable juice        whipped/sour cream 
##                       699                       648                       646 
##               brown bread             domestic eggs               frankfurter 
##                       585                       577                       531 
##                 margarine                      pork                    coffee 
##                       528                       528                       514 
##                    butter                   napkins                      curd 
##                       501                       476                       472 
##                      beef                 chocolate         frozen vegetables 
##                       469                       461                       433 
##                   chicken               white bread              cream cheese 
##                       395                       378                       355 
##                   waffles  long life bakery product                   dessert 
##                       348                       342                       337 
##               salty snack                     sugar                   berries 
##                       336                       300                       299 
##                  UHT-milk            hamburger meat          hygiene articles 
##                       297                       289                       287 
##                    onions       specialty chocolate                     candy 
##                       286                       281                       267 
##              frozen meals               butter milk           misc. beverages 
##                       256                       252                       251 
##                       oil             specialty bar                 beverages 
##                       245                       245                       241 
##                      meat                       ham                 ice cream 
##                       237                       232                       230 
##               hard cheese             sliced cheese                  cat food 
##                       222                       221                       211 
##                    grapes               chewing gum                white wine 
##                       196                       186                       173 
##                 detergent            red/blush wine       semi-finished bread 
##                       171                       164                       163 
##             baking powder        pickled vegetables                    dishes 
##                       162                       162                       161 
##               soft cheese             potted plants                     flour 
##                       159                       158                       156 
##                     herbs          processed cheese               canned fish 
##                       147                       144                       138 
##         seasonal products                     pasta                  cake bar 
##                       134                       130                       122 
##                   mustard packaged fruit/vegetables               frozen fish 
##                       113                       111                       110 
##           cling film/bags                    liquor             spread cheese 
##                       105                       104                       102 
##         canned vegetables                      salt            flower (seeds) 
##                       101                        99                        98 
##            condensed milk            frozen dessert              dish cleaner 
##                        95                        95                        91 
##             roll products                  pet care             sweet spreads 
##                        91                        87                        86 
##     chocolate marshmallow                   candles                  dog food 
##                        84                        83                        83 
##                mayonnaise                photo/film    house keeping products 
##                        82                        82                        78 
##          specialty cheese                    turkey    frozen potato products 
##                        76                        76                        75 
##     Instant food products                   popcorn        liquor (appetizer) 
##                        72                        69                        68 
##                      rice            instant coffee         finished products 
##                        67                        66                        62 
##                     soups                  zwieback                   vinegar 
##                        62                        61                        60 
##  female sanitary products                       jam               dental care 
##                        57                        53                        52 
##            kitchen towels                   cereals                    sauces 
##                        52                        50                        50 
##                   cleaner                  softener            sparkling wine 
##                        49                        49                        49 
##                liver loaf                    spices               curd cheese 
##                        47                        47                        44 
##            male cosmetics                   ketchup                    brandy 
##                        40                        39                        38 
##              meat spreads                       rum                       tea 
##                        38                        38                        37 
##               light bulbs               nuts/prunes             specialty fat 
##                        35                        32                        31 
##          artif. sweetener              canned fruit                 skin care 
##                        30                        30                        30 
##                     syrup                 nut snack                      fish 
##                        30                        29                        28 
##            snack products          abrasive cleaner           potato products 
##                        28                        27                        27 
##         cooking chocolate                  cookware           organic sausage 
##                        24                        23                        22 
##            pudding powder                   tidbits          bathroom cleaner 
##                        22                        22                        21 
##              cocoa drinks                      soap    flower soil/fertilizer 
##                        21                        21                        19 
##                  prosecco               ready soups      specialty vegetables 
##                        19                        17                        16 
##               decalcifier          organic products                     cream 
##                        15                        15                        13 
##                     honey             frozen fruits                hair spray 
##                        13                        12                         9 
##                   liqueur           make up remover           rubbing alcohol 
##                         9                         8                         8 
##            salad dressing                    whisky            toilet cleaner 
##                         8                         8                         6 
##            frozen chicken            baby cosmetics                      bags 
##                         5                         4                         4 
##           kitchen utensil     preservation products      sound storage medium 
##                         4                         2                         1

Apriori algorithm

Here I apply Apriori algorithm, with support 2% and confidence value of 40%.

# creating the rules - 2% support, 40% confidence
rules <- apriori(raw_data, parameter=list(supp=0.02, conf=0.4, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.02      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 180 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 9039 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Support tells us how popular an item is, as measured by the proportion of transactions in which an item appears.

Rules with highest support value:

inspect(sort(rules, by = 'support'))
##      lhs                                    rhs                support   
## [1]  {yogurt}                            => {whole milk}       0.05675407
## [2]  {root vegetables}                   => {whole milk}       0.05000553
## [3]  {root vegetables}                   => {other vegetables} 0.04746100
## [4]  {tropical fruit}                    => {whole milk}       0.04270384
## [5]  {whipped/sour cream}                => {whole milk}       0.03219383
## [6]  {domestic eggs}                     => {whole milk}       0.03064498
## [7]  {whipped/sour cream}                => {other vegetables} 0.02876424
## [8]  {butter}                            => {whole milk}       0.02810045
## [9]  {curd}                              => {whole milk}       0.02522403
## [10] {margarine}                         => {whole milk}       0.02433898
## [11] {other vegetables, root vegetables} => {whole milk}       0.02345392
## [12] {root vegetables, whole milk}       => {other vegetables} 0.02345392
## [13] {other vegetables, yogurt}          => {whole milk}       0.02267950
## [14] {beef}                              => {whole milk}       0.02090939
## [15] {frozen vegetables}                 => {whole milk}       0.02057750
##      confidence coverage   lift     count
## [1]  0.4077901  0.13917469 1.585383 513  
## [2]  0.4565657  0.10952539 1.775009 452  
## [3]  0.4333333  0.10952539 2.251092 429  
## [4]  0.4067439  0.10498949 1.581315 386  
## [5]  0.4504644  0.07146808 1.751289 291  
## [6]  0.4800693  0.06383449 1.866386 277  
## [7]  0.4024768  0.07146808 2.090797 260  
## [8]  0.5069860  0.05542649 1.971031 254  
## [9]  0.4830508  0.05221817 1.877977 228  
## [10] 0.4166667  0.05841354 1.619892 220  
## [11] 0.4941725  0.04746100 1.921215 212  
## [12] 0.4690265  0.05000553 2.436512 212  
## [13] 0.5242967  0.04325700 2.038330 205  
## [14] 0.4029851  0.05188627 1.566702 189  
## [15] 0.4295612  0.04790353 1.670023 186

The highest support value is equal to almost 5.7%, which can be understood as the pair { yoghurt -> whole milk} occurs in 5.7% of transactions.

Lift tells us what is the ratio between confidence of the rule and the expected confidence of the rule.
Rules with highest lift value:

inspect(sort(rules, by = 'lift'))
##      lhs                                    rhs                support   
## [1]  {root vegetables, whole milk}       => {other vegetables} 0.02345392
## [2]  {root vegetables}                   => {other vegetables} 0.04746100
## [3]  {whipped/sour cream}                => {other vegetables} 0.02876424
## [4]  {other vegetables, yogurt}          => {whole milk}       0.02267950
## [5]  {butter}                            => {whole milk}       0.02810045
## [6]  {other vegetables, root vegetables} => {whole milk}       0.02345392
## [7]  {curd}                              => {whole milk}       0.02522403
## [8]  {domestic eggs}                     => {whole milk}       0.03064498
## [9]  {root vegetables}                   => {whole milk}       0.05000553
## [10] {whipped/sour cream}                => {whole milk}       0.03219383
## [11] {frozen vegetables}                 => {whole milk}       0.02057750
## [12] {margarine}                         => {whole milk}       0.02433898
## [13] {yogurt}                            => {whole milk}       0.05675407
## [14] {tropical fruit}                    => {whole milk}       0.04270384
## [15] {beef}                              => {whole milk}       0.02090939
##      confidence coverage   lift     count
## [1]  0.4690265  0.05000553 2.436512 212  
## [2]  0.4333333  0.10952539 2.251092 429  
## [3]  0.4024768  0.07146808 2.090797 260  
## [4]  0.5242967  0.04325700 2.038330 205  
## [5]  0.5069860  0.05542649 1.971031 254  
## [6]  0.4941725  0.04746100 1.921215 212  
## [7]  0.4830508  0.05221817 1.877977 228  
## [8]  0.4800693  0.06383449 1.866386 277  
## [9]  0.4565657  0.10952539 1.775009 452  
## [10] 0.4504644  0.07146808 1.751289 291  
## [11] 0.4295612  0.04790353 1.670023 186  
## [12] 0.4166667  0.05841354 1.619892 220  
## [13] 0.4077901  0.13917469 1.585383 513  
## [14] 0.4067439  0.10498949 1.581315 386  
## [15] 0.4029851  0.05188627 1.566702 189

The highest lift value is equal to around 2.44 for the rule {root vegetables, whole milk -> other vegetables}.

Confidence tells us about some probability relationship.

inspect(sort(rules, by = 'confidence'))
##      lhs                                    rhs                support   
## [1]  {other vegetables, yogurt}          => {whole milk}       0.02267950
## [2]  {butter}                            => {whole milk}       0.02810045
## [3]  {other vegetables, root vegetables} => {whole milk}       0.02345392
## [4]  {curd}                              => {whole milk}       0.02522403
## [5]  {domestic eggs}                     => {whole milk}       0.03064498
## [6]  {root vegetables, whole milk}       => {other vegetables} 0.02345392
## [7]  {root vegetables}                   => {whole milk}       0.05000553
## [8]  {whipped/sour cream}                => {whole milk}       0.03219383
## [9]  {root vegetables}                   => {other vegetables} 0.04746100
## [10] {frozen vegetables}                 => {whole milk}       0.02057750
## [11] {margarine}                         => {whole milk}       0.02433898
## [12] {yogurt}                            => {whole milk}       0.05675407
## [13] {tropical fruit}                    => {whole milk}       0.04270384
## [14] {beef}                              => {whole milk}       0.02090939
## [15] {whipped/sour cream}                => {other vegetables} 0.02876424
##      confidence coverage   lift     count
## [1]  0.5242967  0.04325700 2.038330 205  
## [2]  0.5069860  0.05542649 1.971031 254  
## [3]  0.4941725  0.04746100 1.921215 212  
## [4]  0.4830508  0.05221817 1.877977 228  
## [5]  0.4800693  0.06383449 1.866386 277  
## [6]  0.4690265  0.05000553 2.436512 212  
## [7]  0.4565657  0.10952539 1.775009 452  
## [8]  0.4504644  0.07146808 1.751289 291  
## [9]  0.4333333  0.10952539 2.251092 429  
## [10] 0.4295612  0.04790353 1.670023 186  
## [11] 0.4166667  0.05841354 1.619892 220  
## [12] 0.4077901  0.13917469 1.585383 513  
## [13] 0.4067439  0.10498949 1.581315 386  
## [14] 0.4029851  0.05188627 1.566702 189  
## [15] 0.4024768  0.07146808 2.090797 260

The highest confidence value achieved is 52.4% for the rule {other vegetables, yoghurt -> whole milk}, which means that if someone buys other vegetables and yoghurt, they will also buy whole milk with 52.4% probability.

Visualize association rules

Below is a simple scatter plot with support and lift on the axes and confidence represented by the color of the points.

plot(rules, measure = c('support','lift'), shading = 'confidence')


The next visualization represents the rules as a graph. The rules are represented as items connected by arrows.

plot(rules, method = 'graph')

The rules can be also visualized as a grouped matrix- based visualization. Support measure is represented as the size of the balloons, and the lift measure is represented by color of the balloons.

plot(rules, method = 'grouped')

Conclusion

Association rules can help us to find interesting patterns in customer buying preferences and habits. In the analyzed data set, it has been discovered that whole milk, other vegetables, rolls/buns, soda and yoghurt are most frequently bought products.
{yoghurt -> whole milk}, {root vegetables -> whole milk}, {root vegetables -> other vegetables} are the rules with highest support measure.
{root vegetables, whole milk -> other vegetables}, {root vegetables -> other vegetables}, {whipped/ sour cream -> other vegetables} are the rules with highest lift measure.
{other vegetables, yoghurt -> whole milk}, {butter -> whole milk}, {other vegetables, root vegetables -> whole milk} are the rules with highest confidence measure.