Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
df<- read.transactions("GroceryDataSet.csv",format="basket", sep=",")
summary(df)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Whole milk is the most popular item, with vegies and rolls/buns after it. There are 9835 transactions (rows) and 169 items (columns.)
Below Top 10 items.
itemFrequencyPlot(df, topN=10, type="absolute", main="Top 10 Items")
rules <- apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen=3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3)):
## Mining stopped (maxlen reached). Only patterns up to a length of 3 returned!
## done [0.01s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules <- sort(rules, by="confidence", decreasing=T)
rules
## set of 29 rules
inspect(head(rules, 10))
## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1.0000000 0.001118454 3.913649 11
## [3] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [4] {bottled water,
## rice} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [6] {grapes,
## onions} => {other vegetables} 0.001118454 0.9166667 0.001220132 4.737476 11
## [7] {hard cheese,
## oil} => {other vegetables} 0.001118454 0.9166667 0.001220132 4.737476 11
## [8] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [9] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [10] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
summary(rules)
## set of 29 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 29
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001118 Min. : 3.131
## 1st Qu.:0.001118 1st Qu.:0.8125 1st Qu.:0.001220 1st Qu.: 3.261
## Median :0.001220 Median :0.8462 Median :0.001525 Median : 3.613
## Mean :0.001473 Mean :0.8613 Mean :0.001732 Mean : 4.000
## 3rd Qu.:0.001729 3rd Qu.:0.9091 3rd Qu.:0.002135 3rd Qu.: 4.199
## Max. :0.002542 Max. :1.0000 Max. :0.003152 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:11.00
## Median :12.00
## Mean :14.48
## 3rd Qu.:17.00
## Max. :25.00
##
## mining info:
## data ntransactions support confidence
## df 9835 0.001 0.8
## call
## apriori(data = df, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3))
plot(rules, method="graph", engine = "igraph", layout = igraph::in_circle(), limit = 10)
plot(rules, method="graph", engine = "igraph", layout = igraph::in_circle(), limit = 20)
The frequency plot shows that the top 10 frequently purchased items are whole milk, other vegetables, rolls/buns, soda, yogurt, bottled water and so on in the decreasing order. The function apriori() returns 29 association rules for these data. To avoid overly long rules, the apriori() function is run with maxlen=3 specified. After inspecting the top 10 rules by confidence shows that most of the associations are with other vegetables which is the #2 most purchased items. After increasing limit to 20, there are shown many associations with the whole milk which is the #1 of the most purchased items. These relationships are pictured above in the plot of the network where most arrows are pointing toward whole milk and other vegetables categories.
References:
https://cran.r-project.org/web/packages/arulesViz/arulesViz.pdf
http://www.salemmarafi.com/code/market-basket-analysis-with-r/