Market Basket Analysis

In this study we are going to mine the association rules and explore frequent itemsets in a supermarket transaction data set by using arules package.

Generating Rules

We will use Apriori algorithm which basically counts the item combinations to generate rules. There are 2 important parameters to define.
Support is a probability that a transaction contains itemsets.
Confidence is a conditional probability that if the transaction has item A, what is the chance that B is also included in the transaction.

# Run apriori algorithm and generate rules
rules <- apriori(GTransactions, parameter = list(support=0.01, confidence=0.5))

# summary of rules
summary(rules)

## set of 7501 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5 
##   32 2622 4447  400 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.695   4.000   5.000 
## 
## summary of quality measures:
##     support          confidence          lift       
##  Min.   :0.01001   Min.   :0.5002   Min.   : 1.622  
##  1st Qu.:0.01052   1st Qu.:0.6673   1st Qu.: 6.373  
##  Median :0.01119   Median :0.7648   Median : 8.007  
##  Mean   :0.01165   Mean   :0.7335   Mean   : 8.877  
##  3rd Qu.:0.01211   3rd Qu.:0.8048   3rd Qu.:11.468  
##  Max.   :0.03301   Max.   :0.9742   Max.   :28.188  
## 
## mining info:
##           data ntransactions support confidence
##  GTransactions         64808    0.01        0.5

Creating SubRules

If we are interested in specific rules that has certain items in it, subset of rules can be created using the subset function.

Rules for Wine Purchase

For example if want to check if there is a wine in the basket what else the customer can purchase, we can check the rules that includes wine in the left hand side.

It is seen that customers who purchase wine also gets fresh vegetables.

Support and confidence are not good indicators of correlation. Lift is an important measure to understand if the events are correlated. While lift value larger than 1 indicates there is positive correlation between events, lift value equal to 1 indicates that events are independent.

# summary of rules
WineRulesL <- subset(rules, subset = lhs %pin% "Wine")
inspect(WineRulesL)

##     lhs                     rhs                support    confidence
## [1] {Candles,Wine}       => {Fresh Vegetables} 0.01029194 0.8707572 
## [2] {Fresh Chicken,Wine} => {Fresh Vegetables} 0.01023022 0.8851802 
## [3] {Sauces,Wine}        => {Fresh Vegetables} 0.01492100 0.9088346 
## [4] {Cooking Oil,Wine}   => {Fresh Vegetables} 0.01272991 0.7405745 
## [5] {Rice,Wine}          => {Fresh Vegetables} 0.01030737 0.8077388 
## [6] {Juice,Wine}         => {Fresh Vegetables} 0.01024565 0.7272727 
## [7] {Fresh Fruit,Wine}   => {Fresh Vegetables} 0.01530675 0.5814771 
##     lift    
## [1] 2.821460
## [2] 2.868195
## [3] 2.944840
## [4] 2.399638
## [5] 2.617266
## [6] 2.356537
## [7] 1.884124

# summary of rules
WineRulesR <- subset(rules, subset = rhs %pin% "Wine")
plot (WineRulesR,method="graph",shading="lift")

Rules for Fresh Food

If there is a fresh food in the basket, probability of procurement of deodorizes and sour cream increases more than 15 times.

FreshRulesL <- subset(rules, subset = lhs %pin% "Fresh" & lift>15 & confidence>0.8)
plot(FreshRulesL, method="grouped", control=list(k=50))

The item sets such as {Cheese, Milk, Pancake Mix, Sliced Bread}, {Cheese, Juice, Milk, Pancake Mix} increases the probability of purchasing Canned Vegetables more than 7 times.

CannedRulesR <- subset(rules, subset = rhs %pin% "Canned"& lift> 5 & confidence>0.8)
inspect(sort(CannedRulesR, by='lift')[1:5])

##     lhs               rhs                    support confidence     lift
## [1] {Cheese,                                                            
##      Milk,                                                              
##      Pancake Mix,                                                       
##      Sliced Bread} => {Canned Vegetables} 0.01064683  0.8508015 8.154206
## [2] {Cheese,                                                            
##      Juice,                                                             
##      Milk,                                                              
##      Pancake Mix}  => {Canned Vegetables} 0.01055425  0.8507463 8.153677
## [3] {Juice,                                                             
##      Milk,                                                              
##      Pancake Mix}  => {Canned Vegetables} 0.01242131  0.8491561 8.138437
## [4] {Cheese,                                                            
##      Milk,                                                              
##      Pancake Mix}  => {Canned Vegetables} 0.01232872  0.8490967 8.137867
## [5] {Juice,                                                             
##      Milk,                                                              
##      Pancake Mix,                                                       
##      Sliced Bread} => {Canned Vegetables} 0.01063140  0.8485222 8.132361

Small vs Large Transactions

If we define small transactions as transaction that has 6 or less items we reach the rules below.

LTransactions = subset(GTransactions, subset= size (GTransactions) > 6)
STransactions = subset(GTransactions, subset= size (GTransactions) <= 6) 
Lrules <- apriori(LTransactions, parameter = list(support=0.05, confidence=0.8 ))
Srules <- apriori(STransactions, parameter = list(support=0.01, confidence=0.25))
library("RColorBrewer")

plot(Lrules,control=list(col=brewer.pal(5,"Spectral")),main="")

In the rules defined for transactions that have more than six items, we can see the patterns increase Jam and Cottage Cheese sales.

inspect(sort(Lrules, by='lift')[1:5])

##     lhs                             rhs              support    confidence
## [1] {Jam,Sliced Bread,Waffles}   => {Cottage Cheese} 0.05549930 0.8127896 
## [2] {Bagels,Milk}                => {Muffins}        0.05701810 0.8342593 
## [3] {Cereal,Jam,Waffles}         => {Cottage Cheese} 0.05473991 0.8061510 
## [4] {Jelly,Sliced Bread,Waffles} => {Cottage Cheese} 0.05353753 0.8041825 
## [5] {Jam,Jelly,Waffles}          => {Cottage Cheese} 0.05341096 0.8038095 
##     lift    
## [1] 5.276788
## [2] 5.235490
## [3] 5.233688
## [4] 5.220909
## [5] 5.218487

Other Patterns Between 2 categories

There is a strong positive association between fresh shower soap and cleaners. Presence of shower soap has increased the probability that the cleaners will occur on this transaction approximately 12 times more likely. Out of 100 customers who bought fresh shower soap, 40 bought cleaners too.

ProductRules <- subset(rules, size(rules) < 3) 
inspect(sort(ProductRules, by='lift')[1:15])

##      lhs                  rhs              support    confidence lift     
## [1]  {Shower Soap}     => {Cleaners}       0.01004506 0.6760125  12.673131
## [2]  {Deodorizers}     => {Pancake Mix}    0.02084619 0.5029784   9.329429
## [3]  {Deodorizers}     => {Frozen Chicken} 0.02163313 0.5219657   7.762174
## [4]  {Dishwasher Soap} => {Sliced Bread}   0.01013764 0.6759259   6.454311
## [5]  {Shower Soap}     => {Sliced Bread}   0.01002963 0.6749740   6.445221
## [6]  {Dishwasher Soap} => {Juice}          0.01032280 0.6882716   6.301103
## [7]  {Shower Soap}     => {Juice}          0.01006049 0.6770509   6.198377
## [8]  {Deodorizers}     => {Cereal}         0.02132453 0.5145197   5.745175
## [9]  {Pancake Mix}     => {Juice}          0.03300518 0.6121923   5.604600
## [10] {Bagels}          => {Sliced Bread}   0.02711085 0.5738080   5.479202
## [11] {Bagels}          => {Juice}          0.02738859 0.5796865   5.307010
## [12] {Deodorizers}     => {Sliced Bread}   0.02280583 0.5502606   5.254352
## [13] {Sauces}          => {Wine}           0.01641773 0.5256917   5.130878
## [14] {Deodorizers}     => {Juice}          0.02258980 0.5450484   4.989899
## [15] {Pancake Mix}     => {Sliced Bread}   0.02789779 0.5174585   4.941130

Market Basket Analysis

Ezgi

3/20/2017

Getting Data