In the Grocery Dataset, there contains rows of purchase receipts. Each line represents 1 receipt and items purchased. Using R, I will mine the data for association rules.
Market Basket Analysis, or the study of relationships in transaction data sets seeks to discover patterns and associations between items. In the case of retail transactions, items that are closely associated are most likely to be bought together. Association rules explains the dynamics of these relationships.
In my analysis, I will report the support, confidence and lift and organize the top 10 rules by lift.
data <- read_csv("C:\\Users\\urios\\OneDrive\\Documents\\GroceryDataSet.csv", col_names = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 9835 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## lgl (10): X23, X24, X25, X26, X27, X28, X29, X30, X31, X32
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
One of the first things we need to address is the column types. The columns are current split between Character and Logical data types. Instead, all the columns should be converted to Character types - since the columns only contain character values.
summary(data)
## X1 X2 X3 X4
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X5 X6 X7 X8
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X9 X10 X11 X12
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X13 X14 X15 X16
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X17 X18 X19 X20
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X21 X22 X23 X24
## Length:9835 Length:9835 Mode:logical Mode:logical
## Class :character Class :character NA's:9835 NA's:9835
## Mode :character Mode :character
## X25 X26 X27 X28 X29
## Mode:logical Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:9835 NA's:9835 NA's:9835 NA's:9835 NA's:9835
##
## X30 X31 X32
## Mode:logical Mode:logical Mode:logical
## NA's:9835 NA's:9835 NA's:9835
##
To prepare our data set for analysis, we first transform all the column types to character datatypes and rename our columns.
Once clean, I will transform the dataset into a basket format to allow for easy analysis.
clean <- data %>%
mutate(across(everything(), as.character))
colnames(clean) <- paste0("Item_", 1:ncol(clean))
write.csv(clean, "grocery_items.csv", quote = FALSE, row.names = TRUE, na = "" )
transaction <- read.transactions("grocery_items.csv", format = 'basket', sep = ',')
The most bought items include Whole milk, other vegetables, rolls/buns, and soda.
itemFrequencyPlot(transaction, topN = 10, type = 'absolute')
When applying the apriori method, I had to tweak the parameter from the default numbers. When using the default confidence level of 0.8 and support of 0.1, it gives us 0 results. I lowered the support to %.5 and confidence of 40%. This selects items that appear in at least 0.5% of transactions and that only rules with at least 40% are generated.
By lowering the parameters we are able to get results.
rules <- apriori(transaction, parameter = list(supp = 0.005, conf = 0.4, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10036 item(s), 9836 transaction(s)] done [0.02s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [262 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
When inspecting the first 10 rows, I notice the following:
The rule {herbs} -> {root vegetables} has the highest lift. This tells us that the purchase of root vegetables are close to 4 times more likely when a customer purchases herbs.
Whole milk is observed to be most purchased alongside other items.
The rule {detergent} -> {whole milk} has the most amount of transactions (88 transactions).
inspect(rules[1:10])
## lhs rhs support confidence
## [1] {cake bar} => {whole milk} 0.005591704 0.4263566
## [2] {mustard} => {whole milk} 0.005185035 0.4322034
## [3] {pot plants} => {whole milk} 0.006913379 0.4000000
## [4] {pasta} => {whole milk} 0.005998373 0.4013605
## [5] {herbs} => {root vegetables} 0.007015047 0.4312500
## [6] {herbs} => {other vegetables} 0.007726718 0.4750000
## [7] {herbs} => {whole milk} 0.007726718 0.4750000
## [8] {processed cheese} => {whole milk} 0.007015047 0.4233129
## [9] {semi-finished bread} => {whole milk} 0.007116714 0.4022989
## [10] {detergent} => {whole milk} 0.008946726 0.4656085
## coverage lift count
## [1] 0.01311509 1.668780 55
## [2] 0.01199675 1.691664 51
## [3] 0.01728345 1.565619 68
## [4] 0.01494510 1.570944 59
## [5] 0.01626678 3.956880 69
## [6] 0.01626678 2.455123 76
## [7] 0.01626678 1.859172 76
## [8] 0.01657178 1.656867 69
## [9] 0.01769012 1.574617 70
## [10] 0.01921513 1.822413 88
When organizing by lift, we can observe the following:
The strong rule by lift is {citrus fruit, other vegetables, whole milk} -> {root vegetables}. The lift is 4.085908 which means that the purchase of root vegetables are more likely to happen if citrus fruit, other vegetables, or whole milk are bought first.
The rule {beef, other vegetables} -> {root vegetables} occurs most frequently (78 total rules).
We can see that root vegetables are popular complementary items.
inspect(sort(rules, by = "lift")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.005795039 0.4453125 0.013013420 4.085908 57
## [2] {herbs} => {root vegetables} 0.007015047 0.4312500 0.016266775 3.956880 69
## [3] {citrus fruit,
## pip fruit} => {tropical fruit} 0.005591704 0.4044118 0.013826759 3.854452 55
## [4] {other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.007015047 0.4107143 0.017080114 3.768457 69
## [5] {other vegetables,
## pip fruit,
## whole milk} => {root vegetables} 0.005490037 0.4060150 0.013521757 3.725339 54
## [6] {curd,
## tropical fruit} => {yogurt} 0.005286702 0.5148515 0.010268402 3.691020 52
## [7] {beef,
## other vegetables} => {root vegetables} 0.007930053 0.4020619 0.019723465 3.689068 78
## [8] {onions,
## other vegetables} => {root vegetables} 0.005693371 0.4000000 0.014233428 3.670149 56
## [9] {root vegetables,
## tropical fruit,
## whole milk} => {yogurt} 0.005693371 0.4745763 0.011996747 3.402283 56
## [10] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795039 0.6333333 0.009150061 3.273498 57
In this plot, we can visualize some of these relationships.
plot(rules, method = "graph")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).