Market Basket Analysis:
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
The dataset (comma separated file) is attached to this post.
You assignment is to use R to mine the data for association rules. Provide information on all relevant statistics like support, confidence, lift and others. Also, Provide your top 10 rules by lift with the associated metrics. .
itemFrequencyPlot(grocery_data, topN = 20, type = "absolute", main = "Top 20 Items Frequency")
It can be seen above that whole milk and other vegetables are the most
frequently purchased items.
rules <- apriori(grocery_data, parameter = list(supp = 0.01, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 15 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.5000 Min. :0.01729 Min. :1.984
## 1st Qu.:0.01174 1st Qu.:0.5151 1st Qu.:0.02089 1st Qu.:2.036
## Median :0.01230 Median :0.5245 Median :0.02430 Median :2.203
## Mean :0.01316 Mean :0.5411 Mean :0.02454 Mean :2.299
## 3rd Qu.:0.01403 3rd Qu.:0.5718 3rd Qu.:0.02598 3rd Qu.:2.432
## Max. :0.02227 Max. :0.5862 Max. :0.04342 Max. :3.030
## count
## Min. : 99.0
## 1st Qu.:115.5
## Median :121.0
## Mean :129.4
## 3rd Qu.:138.0
## Max. :219.0
##
## mining info:
## data ntransactions support confidence
## grocery_data 9835 0.01 0.5
## call
## apriori(data = grocery_data, parameter = list(supp = 0.01, conf = 0.5))
In this code I am using the apriori algorithim with a minimum support threshold of 1% and a confidence threshold of 50%. The algoritihim will not take items that do not appear in at least 1% of transaction into account, as well as rules that do not have at least 50% confidence. This allows me to sort through the data more efficenetly and create a more thorough analysis.
top_10_rules <- sort(rules, by = "lift")[1:10]
plot(top_10_rules, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
This plot provides several insights into the purchasing habits of
individuals at Market Basket. Several of the items such as whole milk,
other vegetables, root vegetables, and troipical fruit appear to be
clustered together, indicating that customers often purchase one or all
of them together. The dark red coloring identifying high lift value
displayed here also indicates a strong association between root
vegetables and other vegetables. As expected, both whole milk and other
vegetables have large nodes, indicating their high purchase
frequency.
top_10_rules_df <- as(top_10_rules, "data.frame")
print(top_10_rules_df)
## rules support confidence
## 7 {citrus fruit,root vegetables} => {other vegetables} 0.01037112 0.5862069
## 8 {root vegetables,tropical fruit} => {other vegetables} 0.01230300 0.5845411
## 13 {rolls/buns,root vegetables} => {other vegetables} 0.01220132 0.5020921
## 11 {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000
## 1 {curd,yogurt} => {whole milk} 0.01006609 0.5823529
## 2 {butter,other vegetables} => {whole milk} 0.01148958 0.5736041
## 9 {root vegetables,tropical fruit} => {whole milk} 0.01199797 0.5700483
## 12 {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921
## 3 {domestic eggs,other vegetables} => {whole milk} 0.01230300 0.5525114
## 4 {whipped/sour cream,yogurt} => {whole milk} 0.01087951 0.5245098
## coverage lift count
## 7 0.01769192 3.029608 102
## 8 0.02104728 3.020999 121
## 13 0.02430097 2.594890 120
## 11 0.02582613 2.584078 127
## 1 0.01728521 2.279125 99
## 2 0.02003050 2.244885 113
## 9 0.02104728 2.230969 118
## 12 0.02582613 2.203354 143
## 3 0.02226741 2.162336 121
## 4 0.02074225 2.052747 107
plot(top_10_rules, method = "grouped", control = list(k = 10))
plot(top_10_rules, method = "paracoord", control = list(reorder = TRUE))
The grouped plot above displays how different items purchased by
customers are related to oneanother. Each circle represents a rule where
certain items tend to lead to other items being bought. The size of the
circle displays the frequency of this, and the color shows how strong
the relationship is. This plot confirms what was seen above regaridng
the high frequency of purchases of other vegetables and root vegetables
together.
This parallel coordinates plot helps visualize the top 10 rules that show which items are often bought together. With thicker lines identifying a more frequent rule, this plot also shares the same results of other vegetables and root vegetables being commonly bought together.