Market Basket Analysis

INTRODUCTION
DATA
ANALYSIS
CONCLUSION
REFERENCES

INTRODUCTION

Association rules is studied in this analysis. Association rules is a machine learning method and it aims to find “If this, then that” relations between purchased products. In other words, Association rules find the relationships between set of items for each distinct transaction and “If this” part called as antecedent and “then that” part called as consequent. In this study, market basket analysis is studied which is one of the most popular association rules approach. At first, set of transactions are extracted in order to find rules and under favour of rules, occurrence of an item can be predicted based on occurrences of other items in the transaction. Moreover, in this study, there are 3 rule evaluation metrics which are support, confidence and lift.

Support: Support indicates how frequently the if/then relationship appears in the database. (Reference 2)

Confident: Confidence tells about the number of times these relationships have been found to be true. (Reference 2)

Lift: The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence, assuming that the itemsets X and Y are independent of each other.The expected confidence is the confidence divided by the frequency of {Y}. (Reference 3)

DATA

The data is collected from a grocery store and each transaction represents the items that purchased for one basket in one time. The type of data is transactional data and it consists of 7501 transactions and 20 columns.

ANALYSIS

In here, market basket analysis done step by step. In the case of extremely long outputs, the important part of the outputs has been shown.

The libraries shown below are used in this study:

library(arulesViz)
library(arules)
library(dplyr)

Also, the dataset imported in R as it is shown.

mbo<-read.transactions("Market_Basket_Optimisation.csv", format="basket", sep=",", skip=0)

The details about the dataset before the analysis are shown below.

inspect(head(mbo))

##     items              
## [1] {almonds,          
##      antioxydant juice,
##      avocado,          
##      cottage cheese,   
##      energy drink,     
##      frozen smoothie,  
##      green grapes,     
##      green tea,        
##      honey,            
##      low fat yogurt,   
##      mineral water,    
##      olive oil,        
##      salad,            
##      salmon,           
##      shrimp,           
##      spinach,          
##      tomato juice,     
##      vegetables mix,   
##      whole weat flour, 
##      yams}             
## [2] {burgers,          
##      eggs,             
##      meatballs}        
## [3] {chutney}          
## [4] {avocado,          
##      turkey}           
## [5] {energy bar,       
##      green tea,        
##      milk,             
##      mineral water,    
##      whole wheat rice} 
## [6] {low fat yogurt}

summary(mbo)

## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

length(mbo)

## [1] 7501

At this stage all items which included the analysis can be observed with number of appearances. According to output, the most frequent item is the mineral water with the 1788 times appearance, meanwhile water spray is the least frequent item with 3 times appearance.

itemFrequency(mbo, type="absolute")

##              almonds    antioxydant juice            asparagus 
##                  153                   67                   36 
##              avocado          babies food                bacon 
##                  250                   34                   65 
##       barbecue sauce            black tea          blueberries 
##                   81                  107                   69 
##           body spray              bramble             brownies 
##                   86                   14                  253 
##            bug spray         burger sauce              burgers 
##                   65                   44                  654 
##               butter                 cake           candy bars 
##                  226                  608                   73 
##              carrots          cauliflower              cereals 
##                  115                   36                  193 
##            champagne              chicken                chili 
##                  351                  450                   46 
##            chocolate      chocolate bread              chutney 
##                 1229                   32                   31 
##                cider  clothes accessories              cookies 
##                   79                   63                  603 
##          cooking oil                 corn       cottage cheese 
##                  383                   36                  239 
##                cream         dessert wine             eggplant 
##                    7                   33                   99 
##                 eggs           energy bar         energy drink 
##                 1348                  203                  200 
##             escalope extra dark chocolate            flax seed 
##                  595                   90                   68 
##         french fries          french wine          fresh bread 
##                 1282                  169                  323 
##           fresh tuna        fromage blanc      frozen smoothie 
##                  167                  102                  475 
##    frozen vegetables      gluten free bar        grated cheese 
##                  715                   52                  393 
##          green beans         green grapes            green tea 
##                   65                   68                  991 
##          ground beef                 gums                  ham 
##                  737                  101                  199 
##     hand protein bar        herb & pepper                honey 
##                   39                  371                  356 
##             hot dogs              ketchup          light cream 
##                  243                   33                  117 
##           light mayo       low fat yogurt            magazines 
##                  204                  574                   82 
##        mashed potato           mayonnaise            meatballs 
##                   31                   46                  157 
##               melons                 milk        mineral water 
##                   90                  972                 1788 
##                 mint       mint green tea              muffins 
##                  131                   42                  181 
## mushroom cream sauce              napkins          nonfat milk 
##                  143                    5                   78 
##              oatmeal                  oil            olive oil 
##                   33                  173                  494 
##             pancakes      parmesan cheese                pasta 
##                  713                  149                  118 
##               pepper             pet food              pickles 
##                  199                   49                   45 
##          protein bar             red wine                 rice 
##                  139                  211                  141 
##                salad               salmon                 salt 
##                   37                  319                   69 
##             sandwich              shallot              shampoo 
##                   34                   58                   37 
##               shrimp                 soda                 soup 
##                  536                   47                  379 
##            spaghetti      sparkling water              spinach 
##                 1306                   47                   53 
##         strawberries        strong cheese                  tea 
##                  160                   58                   29 
##         tomato juice         tomato sauce             tomatoes 
##                  228                  106                  513 
##           toothpaste               turkey       vegetables mix 
##                   61                  469                  193 
##          water spray           white wine     whole weat flour 
##                    3                  124                   70 
##    whole wheat pasta     whole wheat rice                 yams 
##                  221                  439                   86 
##          yogurt cake             zucchini 
##                  205                   71

itemFrequencyPlot(mbo, topN=10, type="absolute", main="Item Frequency")

The histogram above represents, first 10 item frequency on market basket analysis and as can be clearly seen mineral water is the most frequent item.

At this stage apriori algorithm is applied according to obtain support, confidence and lift values. Then all values are sorted

rules.mbo<-apriori(mbo, parameter=list(supp=0.1, conf=0.1))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 750 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

As can be seen on above, 7 rules are generated by apriori algorithm. Now all measures are sorted respectively by descending order below.

rules.by.count<- sort(rules.mbo, by="count", decreasing=TRUE) 
inspect(rules.by.count)

##     lhs    rhs             support   confidence lift count
## [1] {}  => {mineral water} 0.2383682 0.2383682  1    1788 
## [2] {}  => {eggs}          0.1797094 0.1797094  1    1348 
## [3] {}  => {spaghetti}     0.1741101 0.1741101  1    1306 
## [4] {}  => {french fries}  0.1709105 0.1709105  1    1282 
## [5] {}  => {chocolate}     0.1638448 0.1638448  1    1229 
## [6] {}  => {green tea}     0.1321157 0.1321157  1     991 
## [7] {}  => {milk}          0.1295827 0.1295827  1     972

rules.by.supp<-sort(rules.mbo, by = "support", decreasing=TRUE) 
inspect(rules.by.supp)

##     lhs    rhs             support   confidence lift count
## [1] {}  => {mineral water} 0.2383682 0.2383682  1    1788 
## [2] {}  => {eggs}          0.1797094 0.1797094  1    1348 
## [3] {}  => {spaghetti}     0.1741101 0.1741101  1    1306 
## [4] {}  => {french fries}  0.1709105 0.1709105  1    1282 
## [5] {}  => {chocolate}     0.1638448 0.1638448  1    1229 
## [6] {}  => {green tea}     0.1321157 0.1321157  1     991 
## [7] {}  => {milk}          0.1295827 0.1295827  1     972

rules.by.conf <- sort(rules.mbo, by = "confidence", decreasing=TRUE) 
inspect(rules.by.conf)

##     lhs    rhs             support   confidence lift count
## [1] {}  => {mineral water} 0.2383682 0.2383682  1    1788 
## [2] {}  => {eggs}          0.1797094 0.1797094  1    1348 
## [3] {}  => {spaghetti}     0.1741101 0.1741101  1    1306 
## [4] {}  => {french fries}  0.1709105 0.1709105  1    1282 
## [5] {}  => {chocolate}     0.1638448 0.1638448  1    1229 
## [6] {}  => {green tea}     0.1321157 0.1321157  1     991 
## [7] {}  => {milk}          0.1295827 0.1295827  1     972

rules.by.lift<-sort(rules.mbo, by = "lift", decreasing=TRUE) 
inspect(rules.by.lift)

##     lhs    rhs             support   confidence lift count
## [1] {}  => {green tea}     0.1321157 0.1321157  1     991 
## [2] {}  => {french fries}  0.1709105 0.1709105  1    1282 
## [3] {}  => {chocolate}     0.1638448 0.1638448  1    1229 
## [4] {}  => {eggs}          0.1797094 0.1797094  1    1348 
## [5] {}  => {spaghetti}     0.1741101 0.1741101  1    1306 
## [6] {}  => {mineral water} 0.2383682 0.2383682  1    1788 
## [7] {}  => {milk}          0.1295827 0.1295827  1     972

Mineral water is situated for all tables except the table that sorted by lift. Now the code is analyzed which type of transactions lead to mineral water.

mbo.sel<-mbo[,itemFrequency(mbo)>0.05] 
rules.mw<-apriori(data=mbo, parameter=list(supp=0.001,conf = 0.08), 
                            appearance=list(default="lhs", rhs="mineral water"), control=list(verbose=F)) 
rules.mw.byconf<-sort(rules.mw, by="confidence", decreasing=TRUE)
inspect(head(rules.mw.byconf))

##     lhs                    rhs                 support confidence     lift count
## [1] {ground beef,                                                               
##      light cream,                                                               
##      olive oil}         => {mineral water} 0.001199840  1.0000000 4.195190     9
## [2] {cake,                                                                      
##      olive oil,                                                                 
##      shrimp}            => {mineral water} 0.001199840  1.0000000 4.195190     9
## [3] {red wine,                                                                  
##      soup}              => {mineral water} 0.001866418  0.9333333 3.915511    14
## [4] {ground beef,                                                               
##      pancakes,                                                                  
##      whole wheat rice}  => {mineral water} 0.001333156  0.9090909 3.813809    10
## [5] {frozen vegetables,                                                         
##      milk,                                                                      
##      spaghetti,                                                                 
##      turkey}            => {mineral water} 0.001199840  0.9000000 3.775671     9
## [6] {chocolate,                                                                 
##      frozen vegetables,                                                         
##      olive oil,                                                                 
##      shrimp}            => {mineral water} 0.001199840  0.9000000 3.775671     9

Now the opposite situation of above is analyzed which a customer buys at first mineral water.

rules.mw<-apriori(data=mbo, parameter=list(supp=0.001,conf = 0.08), 
                            appearance=list(default="rhs",lhs="mineral water"), control=list(verbose=F)) 
rules.mw.byconf<-sort(rules.mw, by="support", decreasing=FALSE)
inspect(head(rules.mw.byconf))

##     lhs                rhs                support    confidence lift     count
## [1] {mineral water} => {turkey}           0.01919744 0.08053691 1.288075 144  
## [2] {mineral water} => {cooking oil}      0.02013065 0.08445190 1.653978 151  
## [3] {mineral water} => {whole wheat rice} 0.02013065 0.08445190 1.442993 151  
## [4] {mineral water} => {frozen smoothie}  0.02026396 0.08501119 1.342461 152  
## [5] {mineral water} => {chicken}          0.02279696 0.09563758 1.594172 171  
## [6] {mineral water} => {soup}             0.02306359 0.09675615 1.914955 173

A person receiving mineral water is likely to have above, turkey, cooking oil, whole wheat rice, frozen smoothie, chicken and soup in his/her basket

Here the plots are generated for rules of market basket optimization dataset.

plot(rules.mbo) #graph belonging to rules

plot(rules.mbo, measure=c("support","lift"), shading="confidence")

plot(rules.mbo, shading="order", control=list(main="Two-key plot"))

plot(rules.mbo, method="graph")

plot(rules.mbo, method="graph", control=list(type="items"))

## Available control parameters (with default values):
## main  =  Graph for 7 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

Now closed frequent items are searched by the help of apriori algorithm.

mbo.closed<-apriori(mbo, parameter=list(target="closed frequent itemsets",support=0.15))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.15      1
##  maxlen                   target   ext
##      10 closed frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1125 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## filtering closed item sets ... done [0.00s].
## writing ... [5 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

mbo.closed

## set of 5 itemsets

inspect(mbo.closed)

##     items           support   count
## [1] {french fries}  0.1709105 1282 
## [2] {chocolate}     0.1638448 1229 
## [3] {eggs}          0.1797094 1348 
## [4] {spaghetti}     0.1741101 1306 
## [5] {mineral water} 0.2383682 1788

class(mbo.closed)

## [1] "itemsets"
## attr(,"package")
## [1] "arules"

For checking the significance of the algorithm is. signifianct is used.

is.significant(rules.mbo, mbo)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Maximal itemset is reached with is.maximal.

is.maximal(rules.mbo)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

is.redundant(rules.mbo) #finding redundant rules

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

inspect(rules.mbo[is.redundant(rules.mbo)==FALSE])

##     lhs    rhs             support   confidence lift count
## [1] {}  => {green tea}     0.1321157 0.1321157  1     991 
## [2] {}  => {milk}          0.1295827 0.1295827  1     972 
## [3] {}  => {french fries}  0.1709105 0.1709105  1    1282 
## [4] {}  => {chocolate}     0.1638448 0.1638448  1    1229 
## [5] {}  => {eggs}          0.1797094 0.1797094  1    1348 
## [6] {}  => {spaghetti}     0.1741101 0.1741101  1    1306 
## [7] {}  => {mineral water} 0.2383682 0.2383682  1    1788

At this stage, supersets and subsets are shown below.

is.superset(rules.mbo) #finds supersets

## 7 x 7 sparse Matrix of class "ngCMatrix"
##                 {green tea} {milk} {french fries} {chocolate} {eggs}
## {green tea}               |      .              .           .      .
## {milk}                    .      |              .           .      .
## {french fries}            .      .              |           .      .
## {chocolate}               .      .              .           |      .
## {eggs}                    .      .              .           .      |
## {spaghetti}               .      .              .           .      .
## {mineral water}           .      .              .           .      .
##                 {spaghetti} {mineral water}
## {green tea}               .               .
## {milk}                    .               .
## {french fries}            .               .
## {chocolate}               .               .
## {eggs}                    .               .
## {spaghetti}               |               .
## {mineral water}           .               |

is.subset(rules.mbo) # finds subsets

## 7 x 7 sparse Matrix of class "ngCMatrix"
##                 {green tea} {milk} {french fries} {chocolate} {eggs}
## {green tea}               |      .              .           .      .
## {milk}                    .      |              .           .      .
## {french fries}            .      .              |           .      .
## {chocolate}               .      .              .           |      .
## {eggs}                    .      .              .           .      |
## {spaghetti}               .      .              .           .      .
## {mineral water}           .      .              .           .      .
##                 {spaghetti} {mineral water}
## {green tea}               .               .
## {milk}                    .               .
## {french fries}            .               .
## {chocolate}               .               .
## {eggs}                    .               .
## {spaghetti}               |               .
## {mineral water}           .               |

supportingTransactions(rules.mbo, mbo)

## tidLists in sparse format with
##  7 items/itemsets (rows) and
##  7501 transactions (columns)

CONCLUSION

In conclusion, the market basket analysis is studied in this analysis and it is one of the most popular association rules approach. In this study, “market basket optimization” dataset is analyzed, and results were obtained as follows: There are 7501 transactions and 119 different items. After, the necessary information’s are observed related with the dataset by using summary method. “arules” and “arulesViz” packages are mainly used in the analysis. Then, set of transactions are determined and rules for these transactions are analyzed. In this case, 7 rules are obtained. Moreover, support, confidence, lift and set of rules are found. After this step, all outputs were sorted for each method. According to the results, mineral water is the most frequent item. Then, which type of transaction lead to mineral water and the opposite situation of this are analyzed. The results are plotted and then the analysis is tested. Lastly, subset and supersets are obtained.

Market Basket Analysis

Mert ÇAMUR

28 02 2020

INTRODUCTION

DATA

ANALYSIS

CONCLUSION

REFERENCES