MarketBasketAnalysis.knit

Market Basket Analysis:

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.

The dataset (comma separated file) is attached to this post.

You assignment is to use R to mine the data for association rules. Provide information on all relevant statistics like support, confidence, lift and others. Also, Provide your top 10 rules by lift with the associated metrics. .

itemFrequencyPlot(grocery_data, topN = 20, type = "absolute", main = "Top 20 Items Frequency")

It can be seen above that whole milk and other vegetables are the most frequently purchased items.

rules <- apriori(grocery_data, parameter = list(supp = 0.01, conf = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 15 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01007   Min.   :0.5000   Min.   :0.01729   Min.   :1.984  
##  1st Qu.:0.01174   1st Qu.:0.5151   1st Qu.:0.02089   1st Qu.:2.036  
##  Median :0.01230   Median :0.5245   Median :0.02430   Median :2.203  
##  Mean   :0.01316   Mean   :0.5411   Mean   :0.02454   Mean   :2.299  
##  3rd Qu.:0.01403   3rd Qu.:0.5718   3rd Qu.:0.02598   3rd Qu.:2.432  
##  Max.   :0.02227   Max.   :0.5862   Max.   :0.04342   Max.   :3.030  
##      count      
##  Min.   : 99.0  
##  1st Qu.:115.5  
##  Median :121.0  
##  Mean   :129.4  
##  3rd Qu.:138.0  
##  Max.   :219.0  
## 
## mining info:
##          data ntransactions support confidence
##  grocery_data          9835    0.01        0.5
##                                                                     call
##  apriori(data = grocery_data, parameter = list(supp = 0.01, conf = 0.5))

In this code I am using the apriori algorithim with a minimum support threshold of 1% and a confidence threshold of 50%. The algoritihim will not take items that do not appear in at least 1% of transaction into account, as well as rules that do not have at least 50% confidence. This allows me to sort through the data more efficenetly and create a more thorough analysis.

top_10_rules <- sort(rules, by = "lift")[1:10]
plot(top_10_rules, method = "graph", control = list(type = "items"))

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

This plot provides several insights into the purchasing habits of individuals at Market Basket. Several of the items such as whole milk, other vegetables, root vegetables, and troipical fruit appear to be clustered together, indicating that customers often purchase one or all of them together. The dark red coloring identifying high lift value displayed here also indicates a strong association between root vegetables and other vegetables. As expected, both whole milk and other vegetables have large nodes, indicating their high purchase frequency.

top_10_rules_df <- as(top_10_rules, "data.frame")
print(top_10_rules_df)

##                                                     rules    support confidence
## 7    {citrus fruit,root vegetables} => {other vegetables} 0.01037112  0.5862069
## 8  {root vegetables,tropical fruit} => {other vegetables} 0.01230300  0.5845411
## 13     {rolls/buns,root vegetables} => {other vegetables} 0.01220132  0.5020921
## 11         {root vegetables,yogurt} => {other vegetables} 0.01291307  0.5000000
## 1                           {curd,yogurt} => {whole milk} 0.01006609  0.5823529
## 2               {butter,other vegetables} => {whole milk} 0.01148958  0.5736041
## 9        {root vegetables,tropical fruit} => {whole milk} 0.01199797  0.5700483
## 12               {root vegetables,yogurt} => {whole milk} 0.01453991  0.5629921
## 3        {domestic eggs,other vegetables} => {whole milk} 0.01230300  0.5525114
## 4             {whipped/sour cream,yogurt} => {whole milk} 0.01087951  0.5245098
##      coverage     lift count
## 7  0.01769192 3.029608   102
## 8  0.02104728 3.020999   121
## 13 0.02430097 2.594890   120
## 11 0.02582613 2.584078   127
## 1  0.01728521 2.279125    99
## 2  0.02003050 2.244885   113
## 9  0.02104728 2.230969   118
## 12 0.02582613 2.203354   143
## 3  0.02226741 2.162336   121
## 4  0.02074225 2.052747   107

plot(top_10_rules, method = "grouped", control = list(k = 10))

plot(top_10_rules, method = "paracoord", control = list(reorder = TRUE))

The grouped plot above displays how different items purchased by customers are related to oneanother. Each circle represents a rule where certain items tend to lead to other items being bought. The size of the circle displays the frequency of this, and the color shows how strong the relationship is. This plot confirms what was seen above regaridng the high frequency of purchases of other vegetables and root vegetables together.

This parallel coordinates plot helps visualize the top 10 rules that show which items are often bought together. With thicker lines identifying a more frequent rule, this plot also shares the same results of other vegetables and root vegetables being commonly bought together.