Market Basket Analysis with Association Rules

1.Introduction

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.(wiki)

This dataset is about the Groceries Market Basket.The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items.

This paper focuses on analyzing this dataset by using Association rule to find the most preferred shopping combinations of the customers and help the store to make the most suitable decision for the combination of products.

2.Data description

As can be seen from the chart below, whole milk,soda,rolls,yogurt,bottled water,shopping bags,tropical fruit,canned beer,sausage and pastry are the most purchased foods.

library(arules)
library(arulesViz)

setwd("E:/Master of Data science/6-unsupervised learing/7-ULproject/project3-Association rules")
trans1<-read.transactions("groceries.csv", format="basket", sep=",", skip=0)
itemFrequency(trans1, type="relative")

##          abrasive cleaner          artif. sweetener            baby cosmetics 
##              0.0023953606              0.0027735754              0.0006303580 
##                      bags             baking powder          bathroom cleaner 
##              0.0005042864              0.0128593041              0.0021432173 
##                      beef                   berries                 beverages 
##              0.0405950580              0.0284921836              0.0258446798 
##              bottled beer             bottled water                    brandy 
##              0.0798033283              0.1062783661              0.0044125063 
##               brown bread                    butter               butter milk 
##              0.0572365103              0.0438729198              0.0218103883 
##                  cake bar                   candles                     candy 
##              0.0117246596              0.0081946546              0.0284921836 
##               canned beer               canned fish              canned fruit 
##              0.0850983359              0.0123550177              0.0026475038 
##         canned vegetables                  cat food                   cereals 
##              0.0075642965              0.0208018154              0.0045385779 
##               chewing gum                   chicken                 chocolate 
##              0.0204236006              0.0310136157              0.0457639939 
##     chocolate marshmallow              citrus fruit                   cleaner 
##              0.0086989410              0.0668179526              0.0042864347 
##           cling film/bags              cocoa drinks                    coffee 
##              0.0100857287              0.0021432173              0.0553454362 
##            condensed milk         cooking chocolate                  cookware 
##              0.0095814423              0.0022692890              0.0027735754 
##                     cream              cream cheese                      curd 
##              0.0007564297              0.0321482602              0.0447554211 
##               curd cheese               decalcifier               dental care 
##              0.0036560767              0.0012607161              0.0046646495 
##                   dessert                 detergent              dish cleaner 
##              0.0316439738              0.0158850227              0.0100857287 
##                    dishes                  dog food             domestic eggs 
##              0.0143721634              0.0078164397              0.0510590015 
##  female sanitary products         finished products                      fish 
##              0.0057992940              0.0055471508              0.0027735754 
##                     flour            flower (seeds)    flower soil/fertilizer 
##              0.0137418053              0.0081946546              0.0021432173 
##               frankfurter            frozen chicken            frozen dessert 
##              0.0526979324              0.0007564297              0.0088250126 
##               frozen fish             frozen fruits              frozen meals 
##              0.0086989410              0.0005042864              0.0258446798 
##    frozen potato products         frozen vegetables     fruit/vegetable juice 
##              0.0071860817              0.0375693394              0.0635400908 
##                    grapes                hair spray                       ham 
##              0.0165153807              0.0011346445              0.0209278870 
##            hamburger meat               hard cheese                     herbs 
##              0.0240796773              0.0186585981              0.0105900151 
##                     honey    house keeping products          hygiene articles 
##              0.0015128593              0.0069339385              0.0289964700 
##                 ice cream            instant coffee     Instant food products 
##              0.0247100353              0.0068078669              0.0065557237 
##                       jam                   ketchup            kitchen towels 
##              0.0044125063              0.0034039334              0.0046646495 
##           kitchen utensil               light bulbs                   liqueur 
##              0.0003782148              0.0035300050              0.0010085729 
##                    liquor        liquor (appetizer)                liver loaf 
##              0.0121028744              0.0080685830              0.0044125063 
##  long life bakery product           make up remover            male cosmetics 
##              0.0331568331              0.0008825013              0.0046646495 
##                 margarine                mayonnaise                      meat 
##              0.0481593545              0.0069339385              0.0196671710 
##              meat spreads           misc. beverages                   mustard 
##              0.0041603631              0.0282400403              0.0108421583 
##                   napkins                newspapers                 nut snack 
##              0.0470247100              0.0750126072              0.0028996470 
##               nuts/prunes                       oil                    onions 
##              0.0030257186              0.0224407463              0.0208018154 
##          organic products           organic sausage packaged fruit/vegetables 
##              0.0012607161              0.0021432173              0.0122289460 
##                     pasta                    pastry                  pet care 
##              0.0133635905              0.0823247605              0.0093292990 
##                photo/film        pickled vegetables                 pip fruit 
##              0.0100857287              0.0142460918              0.0613968734 
##                   popcorn                      pork           potato products 
##              0.0069339385              0.0446293495              0.0025214322 
##             potted plants     preservation products          processed cheese 
##              0.0160110943              0.0001260716              0.0137418053 
##                  prosecco            pudding powder               ready soups 
##              0.0021432173              0.0018910741              0.0015128593 
##            red/blush wine                      rice             roll products 
##              0.0176500252              0.0045385779              0.0068078669 
##                rolls/buns           root vegetables           rubbing alcohol 
##              0.1752395361              0.0763993949              0.0007564297 
##                       rum            salad dressing                      salt 
##              0.0036560767              0.0003782148              0.0088250126 
##               salty snack                    sauces                   sausage 
##              0.0335350479              0.0049167927              0.0830811901 
##         seasonal products       semi-finished bread             shopping bags 
##              0.0131114473              0.0155068079              0.0934190620 
##                 skin care             sliced cheese            snack products 
##              0.0028996470              0.0191628845              0.0026475038 
##                      soap                      soda               soft cheese 
##              0.0026475038              0.1756177509              0.0123550177 
##                  softener      sound storage medium                     soups 
##              0.0047907211              0.0001260716              0.0045385779 
##            sparkling wine             specialty bar          specialty cheese 
##              0.0050428643              0.0269793243              0.0052950076 
##       specialty chocolate             specialty fat      specialty vegetables 
##              0.0301311145              0.0031517902              0.0012607161 
##                    spices             spread cheese                     sugar 
##              0.0040342915              0.0100857287              0.0286182552 
##             sweet spreads                     syrup                       tea 
##              0.0081946546              0.0026475038              0.0028996470 
##                   tidbits            toilet cleaner            tropical fruit 
##              0.0023953606              0.0006303580              0.0856026223 
##                    turkey                  UHT-milk                   vinegar 
##              0.0051689360              0.0313918306              0.0050428643 
##                   waffles        whipped/sour cream                    whisky 
##              0.0351739788              0.0530761473              0.0007564297 
##               white bread                white wine                whole milk 
##              0.0351739788              0.0208018154              0.2240292486 
##                    yogurt                  zwieback 
##              0.1191376702              0.0064296520

itemFrequencyPlot(trans1, topN=30, type="relative", main="Item Frequency")

3.Key metrics in Association Rules

3.1 Support

The support is calculated as the proportion of the number of occurrences of the product combination as a percentage of the total number of transactions,so the value is in the interval [0,1]. The larger the value, the more frequent the occurrence of product combinations and the more relevant they are. \[ Support=\frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions}=P(A\cap B)\]

3.2 Confidence

Confidence is calculated as the proportion of the number of times A and B are traded together as a percentage of the number of times A is traded, and the interval of this value is also [0,1]. The larger the value, the greater the probability that A and B always occur together, and A and B are more likely to be purchased in a bundle.Confidence is often used for cross-selling recommendations. Usually the higher confidence the better the recommendation.

\[ Confidence=\frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions\ with\ A}=\frac{P(A\cap B)}{P(A)}\]

3.3 Lift

Lift is a calculation of the percentage of the number of times A and B are purchased together as a percentage of the product of the number of times A and B are sold separately; the larger the value, the greater the probability that the two items will be sold in a bundle. \[ Lift=\frac{Confidence}{Expected\ Confidence}=\frac{P(A\cap B)}{P(A).P(B)}\]

4.Analysis

4.1 Apply apriori function

After constant checking of the parameter values, it was finally determined that the parameter of support was 0.01 and confidence was 0.3. At this point, the rules do not contain single items, but basically combinations of two items and more, which is more effective for analyzing how to sell product bundles.

rules.basket <- apriori(trans1, parameter = list(supp = 0.01, conf = 0.3, minlen=1, maxlen=15))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      15  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 79 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 7932 transaction(s)] done [0.00s].
## sorting and recoding items ... [81 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [30 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(sort(rules.basket, by ="confidence"),by="support",decreasing = TRUE),15))

##      lhs                        rhs          support    confidence coverage  
## [1]  {yogurt}                => {whole milk} 0.04185577 0.3513228  0.11913767
## [2]  {root vegetables}       => {whole milk} 0.03189612 0.4174917  0.07639939
## [3]  {tropical fruit}        => {whole milk} 0.03126576 0.3652430  0.08560262
## [4]  {pastry}                => {whole milk} 0.02811397 0.3415008  0.08232476
## [5]  {sausage}               => {rolls/buns} 0.02697932 0.3247344  0.08308119
## [6]  {newspapers}            => {whole milk} 0.02357539 0.3142857  0.07501261
## [7]  {domestic eggs}         => {whole milk} 0.02193646 0.4296296  0.05105900
## [8]  {whipped/sour cream}    => {whole milk} 0.02181039 0.4109264  0.05307615
## [9]  {citrus fruit}          => {whole milk} 0.02168432 0.3245283  0.06681795
## [10] {pip fruit}             => {whole milk} 0.02054967 0.3347023  0.06139687
## [11] {curd}                  => {whole milk} 0.02017146 0.4507042  0.04475542
## [12] {fruit/vegetable juice} => {whole milk} 0.02004539 0.3154762  0.06354009
## [13] {butter}                => {whole milk} 0.01991931 0.4540230  0.04387292
## [14] {brown bread}           => {whole milk} 0.01966717 0.3436123  0.05723651
## [15] {margarine}             => {whole milk} 0.01853253 0.3848168  0.04815935
##      lift     count
## [1]  1.568200 332  
## [2]  1.863559 253  
## [3]  1.630336 248  
## [4]  1.524358 223  
## [5]  1.853089 214  
## [6]  1.402878 187  
## [7]  1.917739 174  
## [8]  1.834253 173  
## [9]  1.448598 172  
## [10] 1.494011 163  
## [11] 2.011810 160  
## [12] 1.408192 159  
## [13] 2.026624 158  
## [14] 1.533783 156  
## [15] 1.717708 147

After sorting the rules in descending order by confidence and support, the 15 rules with the largest confidence and support values are obtained. Most of the rules have whole milk in the right hand side, and on the left hand side,the items are relatively mixed and the most frequently purchased items are yogurt and root vegetables.This means that most of the consumers will buy a bottle of whole milk by the way after purchasing other items.

There are also butter milk, UHT-milk, and condensed milk in the store, but most of the consumers choose whole milk, which proves that consumers prefer whole milk.

4.2 visualization

plot(rules.basket, method="grouped")

As you can see from the graph above, whole milk is the item that customers often purchase. {curd}~{yogurt} has the largest lift value, which means that curd and yogurt are usually purchased in bundles than purchased seperately, and {sausage}~{rolls/buns} has the largest support value, so they make up a large portion of all the transactions, and customers often purchase the items as bundles. The support values of {yogurt}~{whole milk} are also large, and they are also among the items frequently purchased by customers.

plot(rules.basket, method="graph")

As we can see, in the center of that graphic is whole milk, which means that whole milk is something that customers often buy. Among them root vegetables and yogurt are most often purchased together with whole milk.

5.Induction

Based on the above analysis, I know that whole milk is the most frequently purchased item by the customer, then I will analyze, what is the most frequently purchased item paired with whole milk.

rules_Whole_milk<-apriori(data=trans1, parameter=list(supp=0.02,conf = 0.3), 
appearance=list(default="lhs", rhs="whole milk"), control=list(verbose=F)) 
inspect(sort(rules_Whole_milk, by='lift'))

##      lhs                        rhs          support    confidence coverage  
## [1]  {curd}                  => {whole milk} 0.02017146 0.4507042  0.04475542
## [2]  {domestic eggs}         => {whole milk} 0.02193646 0.4296296  0.05105900
## [3]  {root vegetables}       => {whole milk} 0.03189612 0.4174917  0.07639939
## [4]  {whipped/sour cream}    => {whole milk} 0.02181039 0.4109264  0.05307615
## [5]  {tropical fruit}        => {whole milk} 0.03126576 0.3652430  0.08560262
## [6]  {yogurt}                => {whole milk} 0.04185577 0.3513228  0.11913767
## [7]  {pastry}                => {whole milk} 0.02811397 0.3415008  0.08232476
## [8]  {pip fruit}             => {whole milk} 0.02054967 0.3347023  0.06139687
## [9]  {citrus fruit}          => {whole milk} 0.02168432 0.3245283  0.06681795
## [10] {fruit/vegetable juice} => {whole milk} 0.02004539 0.3154762  0.06354009
## [11] {newspapers}            => {whole milk} 0.02357539 0.3142857  0.07501261
##      lift     count
## [1]  2.011810 160  
## [2]  1.917739 174  
## [3]  1.863559 253  
## [4]  1.834253 173  
## [5]  1.630336 248  
## [6]  1.568200 332  
## [7]  1.524358 223  
## [8]  1.494011 163  
## [9]  1.448598 172  
## [10] 1.408192 159  
## [11] 1.402878 187

is.significant(rules_Whole_milk, trans1)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

is.superset(rules_Whole_milk)

## 11 x 11 sparse Matrix of class "ngCMatrix"
##                                                         
## {curd,whole milk}                  | . . . . . . . . . .
## {whipped/sour cream,whole milk}    . | . . . . . . . . .
## {domestic eggs,whole milk}         . . | . . . . . . . .
## {newspapers,whole milk}            . . . | . . . . . . .
## {pip fruit,whole milk}             . . . . | . . . . . .
## {fruit/vegetable juice,whole milk} . . . . . | . . . . .
## {citrus fruit,whole milk}          . . . . . . | . . . .
## {pastry,whole milk}                . . . . . . . | . . .
## {root vegetables,whole milk}       . . . . . . . . | . .
## {tropical fruit,whole milk}        . . . . . . . . . | .
## {whole milk,yogurt}                . . . . . . . . . . |

From the above results it is clear that all the rules are significant and all the rules are not supersets of other rules. We obtained 11 rules, among them the largest lift values are {curd}~{whole milk}, and then {domestic eggs}~{whole milk}.But their count values are not significant. the largest count values are {yogurt}~{whole milk}.

plot(rules_Whole_milk, method="graph",control = list(cex=0.9))

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

As you can see from the chart above, most often sold with whole milk are domestic eggs, root vegetables, and yogurt.

6.Dissimilarity measures

6.1 Jaccard Index

The Jaccard Index, also known as the Jaccard similarity coefficient, is a statistical method for comparing similarities between finite sample sets. It is defined as the ratio of the size of the intersection of two sets to the size of the concatenation, and is used to quantify the degree of similarity between two sets.The Jaccard Index is calculated using the following formula:
Jaccard coefficient (similarity): \[ J(X,Y)=\frac{\lvert X\cap Y \rvert}{\lvert X\cup Y \rvert}\]

Jaccard distance (dissimilarity) is 1−Jaccardcoefficient:

\[ d_j(X,Y)=1-Jaccard\ coefficient=\frac{\lvert X\cup Y \rvert-\lvert X\cap Y \rvert}{\lvert X\cup Y \rvert}\]

The Jaccard coefficient (similarity) index is generally in [0,1], when Jaccard coefficient is 0, it means that all products are different, when Jaccard coefficient is 1, it means that all are the same. the larger the value of Jaccard coefficient, it means that the more similar between two products.

trans.sel<-trans1[,itemFrequency(trans1)>0.1]
jac<-dissimilarity(trans.sel, which="items") 
round(jac,digits=3)

##            bottled water rolls/buns  soda whole milk
## rolls/buns         0.920                            
## soda               0.886      0.888                 
## whole milk         0.903      0.863 0.912           
## yogurt             0.911      0.893 0.913      0.861

plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")

The results show that ‘bottled water’ and ‘rolls/buns’ do not overlap in 92%, so they are maximally dissimilar.

According to the tree diagram, the two most similar goods are whole milk and yogurt, and bottled water and soda。

6.2 Affinity measure

Affinity is a measure of similarity of two items and can be representead as:

\[ A(i,j)=\frac{supp(i,j)}{supp(i)+supp(j)-supp(i,j)}\]

The larger the value, the more similar the two products are, and the more customers are inclined to buy them together at the time of purchase.

a = affinity(trans.sel)
round(a, digits=3)

## An object of class "ar_similarity"
##               bottled water rolls/buns  soda whole milk yogurt
## bottled water         0.000      0.080 0.114      0.097  0.089
## rolls/buns            0.080      0.000 0.112      0.137  0.107
## soda                  0.114      0.112 0.000      0.088  0.087
## whole milk            0.097      0.137 0.088      0.000  0.139
## yogurt                0.089      0.107 0.087      0.139  0.000
## Slot "method":
## [1] "Affinity"

par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)

As can be seen from the graph above, the darker the color of the intersection, the more likely it is that the two items will be purchased together. From the graph above, it is clear that whole milk and rolls/buns, whole milk and yogurt, soda and bottled water, and soda and rolls/buns consumers often buy them together.

7.Conclusions

Based on the above analysis, it is clear that the product frequently purchased by the consumers of this store is whole milk. the products frequently purchased along with whole milk are yogurt, rolls/buns. the product combinations that are also frequently purchased include soda and bottled water, soda and rolls/buns.

The owner of the store can adjust the product placement based on the results of the above analysis.

The quantity of whole milk can be increased, and the whole milk store can be placed in a conspicuous position that can be seen on the way out as well as the way in, which is convenient for customers to buy;
You can put yogurt, rolls/buns and whole milk together to make it convenient for customers to buy;
Soda, bottled water and rolls/buns can be placed together for customers’ convenience.