library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(cluster)
set.seed(12345)
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
groceryfile = "https://raw.githubusercontent.com/jerryjerald27/data624/refs/heads/main/GroceryDataSet.csv"
data <- read.csv(groceryfile, header=FALSE)
head(data)
## V1 V2 V3 V4
## 1 citrus fruit semi-finished bread margarine ready soups
## 2 tropical fruit yogurt coffee
## 3 whole milk
## 4 pip fruit yogurt cream cheese meat spreads
## 5 other vegetables whole milk condensed milk long life bakery product
## 6 whole milk butter yogurt rice
## V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1
## 2
## 3
## 4
## 5
## 6 abrasive cleaner
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1
## 2
## 3
## 4
## 5
## 6
We are going to have to read them in as transactions for the best fit
tdata = read.transactions(groceryfile, sep = ',')
summary(tdata)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Here we see that when read as transactions using read.transactions there are 9835 total transactions, and the most number of unique items is 169. is there is at least 1 transactions with 169 unique items. We also see the top 5 most frequent items in the transactions Whole milk followed by other vegetables, rolls/buns, soda and yogurt
Now we can mine with apriori from the arules library that helps us with association rules. I am expecting to see the top 5 show up a lot on the Right hand side of these rules when not ordered. Confidence lets us chose the strength of the rules that show up, the higher they are the more certain that these items show up together. Support is the proportion of transactions that contain all the items in the rule. In this example we are okay with the support being around .001 which means the rules we uncover are going to have items that occur together in around 9.8 transactions out of 9835. This seems reasonable compared to setting a higher support that might hide seasonal patterns or event specific buying patterns.
rules <- apriori(tdata, parameter = list(supp = 0.001,conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001220 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001247 Mean :0.8663 Mean :0.001449 Mean : 3.951
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.341
## Max. :0.003152 Max. :1.0000 Max. :0.003559 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :12.27
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## tdata 9835 0.001 0.8
## call
## apriori(data = tdata, parameter = list(supp = 0.001, conf = 0.8))
Now we inspect the rules to see the confidence lift and support. Inspecting gives us the coverage, lift and count for the rules along with the support and confidence. Count is basically the number of times the rule appears, we have set a minimum of 9.8 with our confidence at 0.001. Coverage is how often the left hand side appears across all the transactions regardless of the RHS being present. And Lift tells us the deviation between the RHS occurring randomly vs RHS occurring when the LHS has occurred. The higher the Lift the more positive the association.
inspect(rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 0.002135231 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 0.001220132 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 0.001626843 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 0.001525165 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 0.001220132 3.913649 12
inspect(sort(rules, by = "lift", decreasing = TRUE)[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {oil,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [4] {citrus fruit,
## fruit/vegetable juice,
## grapes} => {tropical fruit} 0.001118454 0.8461538 0.001321810 8.063879 11
## [5] {other vegetables,
## rice,
## whole milk,
## yogurt} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [6] {oil,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [7] {ham,
## other vegetables,
## pip fruit,
## yogurt} => {tropical fruit} 0.001016777 0.8333333 0.001220132 7.941699 10
## [8] {beef,
## citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.001016777 0.8333333 0.001220132 7.645367 10
## [9] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [10] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
We can explain our top rule here and its metrics. About 0.19 percent of all the transactions, or 19, contains liquor,wine, and bottled beer. The LHS alone, liquor and redwine together appear around 0.21 or 21 times as shown by the coverage. I.e., out of 21 transactions that have the LHS, 19 of them have the RHS, which puts the probability that they buy bottled beer when they have liquor and red wine, at around 90 percent (19/21 * 100), which is denoted by our confidence at 90 percent. Our Lift tells us that customers who buy liquor and wine are about 11 times more likely to also buy bottled beer, which is pretty high.This makes a lot of sense as it looks like my ideal night in with friends grocery run! The next set of citrus fruits, juice and yogurt looks like what id run out to get to cure my hangover the next day.
Clustering by size , lets say shopping cart big small,shopping basket big small, or a tiny bag. WIth a size break
sizes <- size(tdata)
clusters <- cut(sizes, breaks = 5, labels = FALSE)
table(clusters)
## clusters
## 1 2 3 4 5
## 8151 1411 235 31 7
breaks <- seq(min(sizes), max(sizes), length.out = 6)
print(breaks)
## [1] 1.0 7.2 13.4 19.6 25.8 32.0
for (i in 1:5) {
clusters_t <- tdata[clusters == i]
item_no_t <- itemFrequency(clusters_t, type = "absolute")
cat("Cluster number ", i )
print(head(sort(item_no_t, decreasing = TRUE), 5))
}
## Cluster number 1 whole milk rolls/buns soda other vegetables
## 1598 1274 1222 1065
## yogurt
## 761
## Cluster number 2 whole milk other vegetables yogurt rolls/buns
## 727 662 473 446
## soda
## 387
## Cluster number 3 whole milk other vegetables yogurt root vegetables
## 161 148 116 107
## soda
## 94
## Cluster number 4other vegetables whole milk tropical fruit pastry
## 24 21 19 16
## pip fruit
## 16
## Cluster number 5 butter root vegetables tropical fruit whole milk yogurt
## 7 6 6 6 6
Most transactions seems to fall in the small cluster with only very few in the biggest transactions cluster. As the sizes get bigger the top items seem to be going from snack materials to actual household groceries.
WE could also use something like k-means or kmedoids.We would first need to convert tdata into a binary matrix with 1/0 for each item, run the clustering and see how the items are distributed
tmatrix <- as(tdata, "matrix") * 1
dist_matrix <- daisy(tmatrix, metric = "gower")
pam_res <- pam(dist_matrix, k = 5)
clusters <- pam_res$clustering
table(clusters)
## clusters
## 1 2 3 4 5
## 3817 1175 1631 1661 1551
for (i in 1:5) {
clusters_t <- tdata[clusters == i]
item_no_t <- itemFrequency(clusters_t, type = "absolute")
cat("Cluster number ", i )
print(head(sort(item_no_t, decreasing = TRUE), 5))
}
## Cluster number 1 canned beer shopping bags bottled beer newspapers pastry
## 719 347 317 247 240
## Cluster number 2 yogurt bottled water soda rolls/buns tropical fruit
## 713 688 255 203 185
## Cluster number 3 whole milk rolls/buns yogurt root vegetables tropical fruit
## 1631 316 260 231 219
## Cluster number 4other vegetables whole milk root vegetables yogurt
## 1661 634 402 316
## tropical fruit
## 302
## Cluster number 5 soda rolls/buns sausage shopping bags pastry
## 977 921 207 168 146