Analysis introduction

Market basket analysis is an unsupervised machine learning technique that can be useful for finding patterns in transactional data.

It can be a very powerful tool for analyzing the purchasing patterns of consumers.

It is used for knowledge discovery rather than prediction.

This analysis results in a set of association rules that identify patterns of relationships among items.

The main algorithm used in market basket analysis is the apriori algorithm.

The three statistical measures in market basket analysis are support, confidence, and lift.

Description of dataset

This data contain 7501 rows which refers to one store transactions.

119 columns are features for each of the 169 different items that might appear in someone’s grocery basket.

Each cell in the matrix is a 1 if the item was purchased for the corresponding transaction, or 0 otherwise

Density value of 0.0329 (3.3 %) refers to the proportion of non-zero matrix cells.

A total of 1754 transactions contained only a single item, while one transaction had 20 items

The mean of item per transaction is 3.914 while the median is 3.

## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Item frequency

Relative frequency

##              almonds    antioxydant juice            asparagus 
##         0.0203972804         0.0089321424         0.0047993601 
##              avocado          babies food                bacon 
##         0.0333288895         0.0045327290         0.0086655113 
##       barbecue sauce            black tea          blueberries 
##         0.0107985602         0.0142647647         0.0091987735 
##           body spray              bramble             brownies 
##         0.0114651380         0.0018664178         0.0337288362 
##            bug spray         burger sauce              burgers 
##         0.0086655113         0.0058658845         0.0871883749 
##               butter                 cake           candy bars 
##         0.0301293161         0.0810558592         0.0097320357 
##              carrots          cauliflower              cereals 
##         0.0153312892         0.0047993601         0.0257299027 
##            champagne              chicken                chili 
##         0.0467937608         0.0599920011         0.0061325157 
##            chocolate      chocolate bread              chutney 
##         0.1638448207         0.0042660979         0.0041327823 
##                cider  clothes accessories              cookies 
##         0.0105319291         0.0083988801         0.0803892814 
##          cooking oil                 corn       cottage cheese 
##         0.0510598587         0.0047993601         0.0318624183 
##                cream         dessert wine             eggplant 
##         0.0009332089         0.0043994134         0.0131982402 
##                 eggs           energy bar         energy drink 
##         0.1797093721         0.0270630583         0.0266631116 
##             escalope extra dark chocolate            flax seed 
##         0.0793227570         0.0119984002         0.0090654579 
##         french fries          french wine          fresh bread 
##         0.1709105453         0.0225303293         0.0430609252 
##           fresh tuna        fromage blanc      frozen smoothie 
##         0.0222636982         0.0135981869         0.0633248900 
##    frozen vegetables      gluten free bar        grated cheese 
##         0.0953206239         0.0069324090         0.0523930143 
##          green beans         green grapes            green tea 
##         0.0086655113         0.0090654579         0.1321157179 
##          ground beef                 gums                  ham 
##         0.0982535662         0.0134648714         0.0265297960 
##     hand protein bar        herb & pepper                honey 
##         0.0051993068         0.0494600720         0.0474603386 
##             hot dogs              ketchup          light cream 
##         0.0323956806         0.0043994134         0.0155979203 
##           light mayo       low fat yogurt            magazines 
##         0.0271963738         0.0765231302         0.0109318757 
##        mashed potato           mayonnaise            meatballs 
##         0.0041327823         0.0061325157         0.0209305426 
##               melons                 milk        mineral water 
##         0.0119984002         0.1295827223         0.2383682176 
##                 mint       mint green tea              muffins 
##         0.0174643381         0.0055992534         0.0241301160 
## mushroom cream sauce              napkins          nonfat milk 
##         0.0190641248         0.0006665778         0.0103986135 
##              oatmeal                  oil            olive oil 
##         0.0043994134         0.0230635915         0.0658578856 
##             pancakes      parmesan cheese                pasta 
##         0.0950539928         0.0198640181         0.0157312358 
##               pepper             pet food              pickles 
##         0.0265297960         0.0065324623         0.0059992001 
##          protein bar             red wine                 rice 
##         0.0185308626         0.0281295827         0.0187974937 
##                salad               salmon                 salt 
##         0.0049326756         0.0425276630         0.0091987735 
##             sandwich              shallot              shampoo 
##         0.0045327290         0.0077323024         0.0049326756 
##               shrimp                 soda                 soup 
##         0.0714571390         0.0062658312         0.0505265965 
##            spaghetti      sparkling water              spinach 
##         0.1741101187         0.0062658312         0.0070657246 
##         strawberries        strong cheese                  tea 
##         0.0213304893         0.0077323024         0.0038661512 
##         tomato juice         tomato sauce             tomatoes 
##         0.0303959472         0.0141314491         0.0683908812 
##           toothpaste               turkey       vegetables mix 
##         0.0081322490         0.0625249967         0.0257299027 
##          water spray           white wine     whole weat flour 
##         0.0003999467         0.0165311292         0.0093320891 
##    whole wheat pasta     whole wheat rice                 yams 
##         0.0294627383         0.0585255299         0.0114651380 
##          yogurt cake             zucchini 
##         0.0273296894         0.0094654046

Absolute frequency

##              almonds    antioxydant juice            asparagus 
##                  153                   67                   36 
##              avocado          babies food                bacon 
##                  250                   34                   65 
##       barbecue sauce            black tea          blueberries 
##                   81                  107                   69 
##           body spray              bramble             brownies 
##                   86                   14                  253 
##            bug spray         burger sauce              burgers 
##                   65                   44                  654 
##               butter                 cake           candy bars 
##                  226                  608                   73 
##              carrots          cauliflower              cereals 
##                  115                   36                  193 
##            champagne              chicken                chili 
##                  351                  450                   46 
##            chocolate      chocolate bread              chutney 
##                 1229                   32                   31 
##                cider  clothes accessories              cookies 
##                   79                   63                  603 
##          cooking oil                 corn       cottage cheese 
##                  383                   36                  239 
##                cream         dessert wine             eggplant 
##                    7                   33                   99 
##                 eggs           energy bar         energy drink 
##                 1348                  203                  200 
##             escalope extra dark chocolate            flax seed 
##                  595                   90                   68 
##         french fries          french wine          fresh bread 
##                 1282                  169                  323 
##           fresh tuna        fromage blanc      frozen smoothie 
##                  167                  102                  475 
##    frozen vegetables      gluten free bar        grated cheese 
##                  715                   52                  393 
##          green beans         green grapes            green tea 
##                   65                   68                  991 
##          ground beef                 gums                  ham 
##                  737                  101                  199 
##     hand protein bar        herb & pepper                honey 
##                   39                  371                  356 
##             hot dogs              ketchup          light cream 
##                  243                   33                  117 
##           light mayo       low fat yogurt            magazines 
##                  204                  574                   82 
##        mashed potato           mayonnaise            meatballs 
##                   31                   46                  157 
##               melons                 milk        mineral water 
##                   90                  972                 1788 
##                 mint       mint green tea              muffins 
##                  131                   42                  181 
## mushroom cream sauce              napkins          nonfat milk 
##                  143                    5                   78 
##              oatmeal                  oil            olive oil 
##                   33                  173                  494 
##             pancakes      parmesan cheese                pasta 
##                  713                  149                  118 
##               pepper             pet food              pickles 
##                  199                   49                   45 
##          protein bar             red wine                 rice 
##                  139                  211                  141 
##                salad               salmon                 salt 
##                   37                  319                   69 
##             sandwich              shallot              shampoo 
##                   34                   58                   37 
##               shrimp                 soda                 soup 
##                  536                   47                  379 
##            spaghetti      sparkling water              spinach 
##                 1306                   47                   53 
##         strawberries        strong cheese                  tea 
##                  160                   58                   29 
##         tomato juice         tomato sauce             tomatoes 
##                  228                  106                  513 
##           toothpaste               turkey       vegetables mix 
##                   61                  469                  193 
##          water spray           white wine     whole weat flour 
##                    3                  124                   70 
##    whole wheat pasta     whole wheat rice                 yams 
##                  221                  439                   86 
##          yogurt cake             zucchini 
##                  205                   71

Plot of top 10 item frequency

Relative frequency plot

Absolute frequency plot

Distance between the Items

Next, i want to visualise the weight of the Jaccard distance among the items with a frequency higher than 10%, where the Jaccard coefficient is calculated as –> J_coef = f11/(f+1 + f1+ -f11)

The more different items are milk and green tea, while the more similar are spaghetti and water.

##               chocolate eggs french fries green tea milk mineral water
## eggs               0.89                                               
## french fries       0.89 0.88                                          
## green tea          0.91 0.91         0.90                             
## milk               0.88 0.89         0.91      0.93                   
## mineral water      0.85 0.86         0.91      0.91 0.85              
## spaghetti          0.87 0.88         0.91      0.91 0.87          0.83

Then, for a graphical display, i plot these distance by histogram representation.

The Apriori Algorithm

The a priori algorithm uses a simple preliminary belief about the properties of frequent elements. Using this a priori belief, all subsets of frequent elements must also be frequent. This allows you to limit the number of rules to search for.

There are two statistical measures that can be used to determine whether a rule is interesting:

  • Support measures the frequency an item appears in a given transactional data set

  • Confidence measures the algorithm’s predictive power or accuracy.

Finding the right value of support and confidence

The first step in order to create a set of association rules is to determine the optimal thresholds for support and confidence.

I try different values of support and confidence and see graphically how many rules are generated for each combination.

I decide to try with:

Support value of 10%, 5%, 2% and 1%

Confidence value from 10% to 80%

After have created vector with different value of support and confidence, now plot it in order to find the best value.

From the plot i can saw that:

  • Support level of 10% –> I find few rules and with an very low confidence levels. hence, i cannot use this value because of the resulting rules are unrepresentative.

  • Support level of 5% –> have more or less result that happen for a support level of 10%.

  • Support level of 2% –> this time i can find 20 rules with a confidence of at least 30%.

  • Support level of 1% –> Too many rules

As, the above graph shown, i decide to create the association rule of apriori algorithm with value of support = 10% and confidence = 30%

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 150 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 20 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2 
## 20 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02013   Min.   :0.3060   Min.   :0.05053   Min.   :1.316  
##  1st Qu.:0.02290   1st Qu.:0.3303   1st Qu.:0.06522   1st Qu.:1.435  
##  Median :0.02593   Median :0.3515   Median :0.07399   Median :1.563  
##  Mean   :0.03080   Mean   :0.3609   Mean   :0.08613   Mean   :1.618  
##  3rd Qu.:0.03660   3rd Qu.:0.3836   3rd Qu.:0.09605   3rd Qu.:1.758  
##  Max.   :0.05973   Max.   :0.4565   Max.   :0.17411   Max.   :2.291  
##      count      
##  Min.   :151.0  
##  1st Qu.:171.8  
##  Median :194.5  
##  Mean   :231.1  
##  3rd Qu.:274.5  
##  Max.   :448.0  
## 
## mining info:
##    data ntransactions support confidence
##  Basket          7501    0.02        0.3

Plot the rule

Scatterplot

Graph

Parallel coordinates

Grouped matrix

Visualize the rule

Let’s now reorder the rules so that we are able to inspect the first 5 most meaningful rule order by support, confidence, lift and count

where:

  • Support measures the frequency an item appears in a given transactional data set

  • Confidence measures the algorithm’s predictive power or accuracy.

  • Lift value is the ratio of the observed support to that expected if two items are independent, high lift values indicate stronger associations.

  • Count is the sum of associations

Support

##     lhs              rhs             support    confidence coverage   lift    
## [1] {spaghetti}   => {mineral water} 0.05972537 0.3430322  0.17411012 1.439085
## [2] {chocolate}   => {mineral water} 0.05265965 0.3213995  0.16384482 1.348332
## [3] {milk}        => {mineral water} 0.04799360 0.3703704  0.12958272 1.553774
## [4] {ground beef} => {mineral water} 0.04092788 0.4165536  0.09825357 1.747522
## [5] {ground beef} => {spaghetti}     0.03919477 0.3989145  0.09825357 2.291162
##     count
## [1] 448  
## [2] 395  
## [3] 360  
## [4] 307  
## [5] 294

Confidence

##     lhs              rhs             support    confidence coverage   lift    
## [1] {soup}        => {mineral water} 0.02306359 0.4564644  0.05052660 1.914955
## [2] {olive oil}   => {mineral water} 0.02759632 0.4190283  0.06585789 1.757904
## [3] {ground beef} => {mineral water} 0.04092788 0.4165536  0.09825357 1.747522
## [4] {ground beef} => {spaghetti}     0.03919477 0.3989145  0.09825357 2.291162
## [5] {cooking oil} => {mineral water} 0.02013065 0.3942559  0.05105986 1.653978
##     count
## [1] 173  
## [2] 207  
## [3] 307  
## [4] 294  
## [5] 151

Lift

##     lhs              rhs             support    confidence coverage   lift    
## [1] {ground beef} => {spaghetti}     0.03919477 0.3989145  0.09825357 2.291162
## [2] {olive oil}   => {spaghetti}     0.02293028 0.3481781  0.06585789 1.999758
## [3] {soup}        => {mineral water} 0.02306359 0.4564644  0.05052660 1.914955
## [4] {burgers}     => {eggs}          0.02879616 0.3302752  0.08718837 1.837830
## [5] {olive oil}   => {mineral water} 0.02759632 0.4190283  0.06585789 1.757904
##     count
## [1] 294  
## [2] 172  
## [3] 173  
## [4] 216  
## [5] 207

Count

##     lhs              rhs             support    confidence coverage   lift    
## [1] {spaghetti}   => {mineral water} 0.05972537 0.3430322  0.17411012 1.439085
## [2] {chocolate}   => {mineral water} 0.05265965 0.3213995  0.16384482 1.348332
## [3] {milk}        => {mineral water} 0.04799360 0.3703704  0.12958272 1.553774
## [4] {ground beef} => {mineral water} 0.04092788 0.4165536  0.09825357 1.747522
## [5] {ground beef} => {spaghetti}     0.03919477 0.3989145  0.09825357 2.291162
##     count
## [1] 448  
## [2] 395  
## [3] 360  
## [4] 307  
## [5] 294

Conclusion

As conclusion i want to find association with a particular items

Eggs

I create association rules using the same level value of support and confidence as before for eggs and inspect them.

## set of 2 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 
## 2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01946   Min.   :0.3113   Min.   :0.06252   Min.   :1.732  
##  1st Qu.:0.02180   1st Qu.:0.3160   1st Qu.:0.06869   1st Qu.:1.759  
##  Median :0.02413   Median :0.3208   Median :0.07486   Median :1.785  
##  Mean   :0.02413   Mean   :0.3208   Mean   :0.07486   Mean   :1.785  
##  3rd Qu.:0.02646   3rd Qu.:0.3255   3rd Qu.:0.08102   3rd Qu.:1.811  
##  Max.   :0.02880   Max.   :0.3303   Max.   :0.08719   Max.   :1.838  
##      count      
##  Min.   :146.0  
##  1st Qu.:163.5  
##  Median :181.0  
##  Mean   :181.0  
##  3rd Qu.:198.5  
##  Max.   :216.0  
## 
## mining info:
##    data ntransactions support confidence
##  Basket          7501    0.01        0.3

After order by confidence level i inspected them.

I simply to see that for this product only two other items, which are burgers and turkey, have an association with the choosen level value.

##     lhs          rhs    support    confidence coverage   lift     count
## [1] {turkey}  => {eggs} 0.01946407 0.3113006  0.06252500 1.732245 146  
## [2] {burgers} => {eggs} 0.02879616 0.3302752  0.08718837 1.837830 216

Spaghetti

So, let’s try to do the same with another product, maybe more common, spaghetti

Now i found 17 rules.

## set of 17 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 8 9 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.529   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01013   Min.   :0.3004   Min.   :0.02760   Min.   :1.725  
##  1st Qu.:0.01093   1st Qu.:0.3155   1st Qu.:0.03373   1st Qu.:1.812  
##  Median :0.01573   Median :0.3288   Median :0.04253   Median :1.889  
##  Mean   :0.01585   Mean   :0.3377   Mean   :0.04669   Mean   :1.940  
##  3rd Qu.:0.01653   3rd Qu.:0.3482   3rd Qu.:0.05239   3rd Qu.:2.000  
##  Max.   :0.03919   Max.   :0.4169   Max.   :0.09825   Max.   :2.395  
##      count      
##  Min.   : 76.0  
##  1st Qu.: 82.0  
##  Median :118.0  
##  Mean   :118.9  
##  3rd Qu.:124.0  
##  Max.   :294.0  
## 
## mining info:
##    data ntransactions support confidence
##  Basket          7501    0.01        0.3

Let’s order by confidence level and inspect the first 6.

The best associations with spaghetti are ground beef, mineral water, olive oil and red wine.

Hence, in conclusion, I can say that this association present in these market baskets represents ingredients that are often consumed together

##     lhs                            rhs         support    confidence coverage  
## [1] {ground beef,mineral water} => {spaghetti} 0.01706439 0.4169381  0.04092788
## [2] {ground beef}               => {spaghetti} 0.03919477 0.3989145  0.09825357
## [3] {mineral water,olive oil}   => {spaghetti} 0.01026530 0.3719807  0.02759632
## [4] {red wine}                  => {spaghetti} 0.01026530 0.3649289  0.02812958
## [5] {olive oil}                 => {spaghetti} 0.02293028 0.3481781  0.06585789
## [6] {chocolate,milk}            => {spaghetti} 0.01093188 0.3402490  0.03212905
##     lift     count
## [1] 2.394681 128  
## [2] 2.291162 294  
## [3] 2.136468  77  
## [4] 2.095966  77  
## [5] 1.999758 172  
## [6] 1.954217  82