Market Basket Analysis

Problem statement

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore “Market Basket Analysis”.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.

Brief explanation

Initially, I proceeded to read with read_csv. Although I was able to read the usual csv file (GroceryDataSet.csv i.e.), it didn’t help in down stream analysis. So, in order to mine the data for Association Rules, I googled and learned that apriori() function was required. This is not something, which we customarily use or have used so in the past. On googling, I hit upon the following page:

https://blog.aptitive.com/building-the-transactions-class-for-association-rule-mining-in-r-using-arules-and-apriori-c6be64268bc4                     

The page gives an overview of transactions class, apriori() functions etc. The package arules is required, which I added to the list of libraries above. “Market Basket Analysis” was a good clue.
Explanation of some of the terms in Association Rules, which we’ll encounter below:

Support of a set of items is the frequency with which, an item appears in the dataset.

Confidence of a rule is the frequency of how often a rule has been found to be true.

Lift is the ratio of the actual support to the expected support.

Reading data and summary

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

From summary, we see that some of the most freuent items are “whole milk”, “other vegetables”, “rolls/buns”, “soda” etc. In order to get a better visualization, I’ll use function itemFrequencyPlot().

Frequency of top 20 most frequent items

This graph gives an idea of frequencies of top 20 most frequent items. This graph corroborate the few observations in summary.

Further analysis

Now, I’ll use apriori() function, for “Market Basket Analysis”. I explored apriori() function, by varying the values of the parameters, support and confidence. With some combinations, I didn’t get any results at all – simply errored out. With support = 0.001, confidence = 0.4, in descending order of lift, I got a table (shown down below).

## set of 8955 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##   81 2771 4804 1245   54 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.824   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.4000   Min.   :0.001017   Min.   : 1.565  
##  1st Qu.:0.001118   1st Qu.:0.4583   1st Qu.:0.001932   1st Qu.: 2.316  
##  Median :0.001322   Median :0.5319   Median :0.002542   Median : 2.870  
##  Mean   :0.001811   Mean   :0.5579   Mean   :0.003478   Mean   : 3.191  
##  3rd Qu.:0.001830   3rd Qu.:0.6296   3rd Qu.:0.003559   3rd Qu.: 3.733  
##  Max.   :0.056024   Max.   :1.0000   Max.   :0.139502   Max.   :21.494  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 17.81  
##  3rd Qu.: 18.00  
##  Max.   :551.00  
## 
## mining info:
##                  data ntransactions support confidence
##  grocery_transactions          9835   0.001        0.4

An important observation in summary is, there 8955 rules with length from 2 to 6.

In the following, I’ll display the top 10 rules with their support and confidence, sorted descending order of lift.

## Selecting by count
LHS RHS support confidence coverage lift count
{root vegetables} {other vegetables} 0.0473818 0.4347015 0.1089985 2.246605 466
{whipped/sour cream} {other vegetables} 0.0288765 0.4028369 0.0716828 2.081924 284
{butter} {whole milk} 0.0275547 0.4972477 0.0554143 1.946053 271
{curd} {whole milk} 0.0261312 0.4904580 0.0532791 1.919480 257
{domestic eggs} {whole milk} 0.0299949 0.4727564 0.0634469 1.850203 295
{whipped/sour cream} {whole milk} 0.0322318 0.4496454 0.0716828 1.759754 317
{root vegetables} {whole milk} 0.0489070 0.4486940 0.1089985 1.756031 481
{margarine} {whole milk} 0.0241993 0.4131944 0.0585663 1.617098 238
{tropical fruit} {whole milk} 0.0422979 0.4031008 0.1049314 1.577595 416
{yogurt} {whole milk} 0.0560244 0.4016035 0.1395018 1.571735 551

What is this table telling us? The rule having the greatest lift (2.246605), is for the item {other vegetables}, after purchase of {root vegetables}. The support and confidence of the item are 0.04738180 and 0.4347015 respectively.

The following graph gives a good visualization of how the items are associating.

Cluster analysis

In order to do cluster analysis, groupings must be identified. After creating a network graph from the given data, I’ll use cluster_louvain() to

The following step will associate customers and items to 19 clusters.

name members
8: chocolate + soda + specialty bar + pastry + salty snack + waffles + candy + dessert + chocolate marshmallow + specialty chocolate + popcorn + cake bar + snack products + finished products + make up remover + potato products + hair spray + light bulbs + baby food + tidbits 1292
10: other vegetables + rice + abrasive cleaner + flour + beef + chicken + root vegetables + bathroom cleaner + spices + pork + turkey + oil + curd cheese + onions + herbs + dog food + frozen fish + salad dressing + vinegar + roll products + frozen fruits 1087
12: ready soups + rolls/buns + frankfurter + sausage + spread cheese + hard cheese + canned fish + seasonal products + frozen potato products + sliced cheese + soft cheese + meat + mustard + mayonnaise + nut snack + ketchup + cream 1053
13: whole milk + butter + cereals + curd + detergent + hamburger meat + flower (seeds) + canned vegetables + pasta + softener + Instant food products + honey + cocoa drinks + cleaner + soups + soap + pudding powder 857
5: liquor (appetizer) + canned beer + shopping bags + misc. beverages + chewing gum + brandy + liqueur + whisky 730
7: yogurt + cream cheese + meat spreads + packaged fruit/vegetables + butter milk + berries + whipped/sour cream + baking powder + specialty cheese + instant coffee + organic sausage + cooking chocolate + kitchen utensil 674
4: tropical fruit + pip fruit + white bread + processed cheese + sweet spreads + beverages + ham + cookware + tea + syrup + baby cosmetics + specialty vegetables + sound storage medium 624
15: citrus fruit + hygiene articles + domestic eggs + cat food + cling film/bags + canned fruit + dental care + flower soil/fertilizer + female sanitary products + dish cleaner + house keeping products + rubbing alcohol + preservation products 569
16: bottled beer + red/blush wine + prosecco + liquor + rum 432
11: UHT-milk + bottled water + white wine + male cosmetics 349
2: long life bakery product + pot plants + fruit/vegetable juice + pickled vegetables + jam + bags 341
3: semi-finished bread + newspapers + pet care + nuts/prunes + toilet cleaner 298
6: dishes + napkins + grapes + zwieback + decalcifier 293
1: coffee + condensed milk + sparkling wine + fish + kitchen towels 287
18: sugar + frozen vegetables + salt + skin care + liver loaf + frozen chicken 273
14: frozen dessert + ice cream + frozen meals 262
9: margarine + artif. sweetener + specialty fat + candles + organic products 207
17: brown bread + sauces 128
19: photo/film 79

Marker: 624-11