In this analysis, we apply Association Rule Mining using the Apriori algorithm on the Instacart Market Basket Analysis dataset. The goal is to uncover interesting relationships between products frequently purchased together, which can be useful for recommendation systems and targeted marketing.
# Set the CRAN mirror to ensure that packages can be installed
options(repos = c(CRAN = "https://cloud.r-project.org/"))
# Loading necessary libraries
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ff)
## Loading required package: bit
##
## Attaching package: 'bit'
## The following object is masked from 'package:dplyr':
##
## symdiff
## The following object is masked from 'package:base':
##
## xor
## Attaching package ff
## - getOption("fftempdir")=="/var/folders/ff/_j3220254qz_r_lngknyzc8r0000gn/T//RtmpyD6a41/ff"
## - getOption("ffextension")=="ff"
## - getOption("ffdrop")==TRUE
## - getOption("fffinonexit")==TRUE
## - getOption("ffpagesize")==65536
## - getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes
## - getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system
## - getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system
##
## Attaching package: 'ff'
## The following objects are masked from 'package:utils':
##
## write.csv, write.csv2
## The following objects are masked from 'package:base':
##
## is.factor, is.ordered
# Install arulesCBA package for more efficient rule mining
install.packages("arulesCBA")
##
## The downloaded binary packages are in
## /var/folders/ff/_j3220254qz_r_lngknyzc8r0000gn/T//RtmpyD6a41/downloaded_packages
library(arulesCBA)
# Check if instacart_ff exists, and if not, load the data
if (!exists("instacart_ff")) {
# Load the Instacart dataset using ff (on disk)
instacart_ff <- read.transactions("order_products__prior.csv", format = "single", sep = ",", cols = c(2, 3))
}
# Summary of the transaction data
dim(instacart_ff)
## [1] 49678 146
summary(instacart_ff)
## transactions as itemMatrix in sparse format with
## 49678 rows (elements/itemsets/transactions) and
## 146 columns (items) and a density of 0.1515866
##
## most frequent items:
## 1 2 4 3 5 (Other)
## 41826 41664 41653 41614 41131 891568
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 154 450 935 1386 1479 1543 1525 1478 1419 1389 1374 1368 1262 1326 1256 1259
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 1327 1283 1276 1276 1204 1218 1215 1138 1231 1170 1169 1135 1036 1055 1026 958
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 956 861 866 811 780 708 621 628 588 519 492 433 405 378 328 309
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 242 238 193 176 142 124 110 89 71 57 48 44 29 23 14 22
## 65 66 67 68 69 70 71 72 73 74 75 76
## 11 15 7 8 3 1 1 1 2 2 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 11.00 21.00 22.13 31.00 76.00
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 100
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 10
## 3 100
We use the arules package to read the dataset as transaction data. The dataset consists of order-product pairs where each row represents a product purchased in an order. We transform the data into a transaction format suitable for association rule mining.We check if the data is loaded, and if not, we use the ff package to read the transaction data into memory in a more efficient manner, especially for large datasets.
# Inspecting the most frequently purchased items
itemFrequencyPlot(instacart_ff, topN = 20, col = "lightblue", main = "Top 20 Most Frequent Items")
We visualize the most frequently purchased items to understand buying patterns.
# Taking a random sample of 0.1% of the transactions
# Taking a random sample of 0.05% of the transactions
set.seed(123)
sample_instacart_ff <- instacart_ff[sample(1:nrow(instacart_ff), size = 0.0002 * nrow(instacart_ff)), ]
# Remove infrequent items
item_freq <- itemFrequency(sample_instacart_ff, type = "absolute")
infrequent_items <- names(item_freq[item_freq < 5])
sample_instacart_ff <- sample_instacart_ff[, !colnames(sample_instacart_ff) %in% infrequent_items]
# Generating association rules with reduced support
rules <- apriori(sample_instacart_ff, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 0
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17 item(s), 9 transaction(s)] done [0.00s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(sample_instacart_ff, parameter = list(supp = 0.05, conf =
## 0.6, : Mining stopped (maxlen reached). Only patterns up to a length of 10
## returned!
## done [0.02s].
## writing ... [857975 rule(s)] done [0.08s].
## creating S4 object ... done [0.19s].
# Inspecting the top 10 rules based on lift
inspect(head(sort(rules, by = "lift"), 10))
## lhs rhs support confidence coverage lift count
## [1] {21} => {11} 0.5555556 1 0.5555556 1.8 5
## [2] {11} => {21} 0.5555556 1 0.5555556 1.8 5
## [3] {17, 8} => {21} 0.3333333 1 0.3333333 1.8 3
## [4] {17, 8} => {11} 0.3333333 1 0.3333333 1.8 3
## [5] {21, 8} => {11} 0.4444444 1 0.4444444 1.8 4
## [6] {11, 8} => {21} 0.4444444 1 0.4444444 1.8 4
## [7] {16, 8} => {21} 0.3333333 1 0.3333333 1.8 3
## [8] {10, 21} => {8} 0.3333333 1 0.3333333 1.8 3
## [9] {13, 8} => {21} 0.4444444 1 0.4444444 1.8 4
## [10] {12, 8} => {21} 0.4444444 1 0.4444444 1.8 4
# Force garbage collection to free up memory
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2369283 126.6 4603077 245.9 NA 3035081 162.1
## Vcells 12977283 99.1 116574912 889.4 16384 179809249 1371.9
We take a random sample of 0.01% of the transaction data to reduce memory usage and speed up computation. The Apriori algorithm is applied with a minimum support of 0.01 and confidence of 0.6. The resulting rules are sorted by lift, which measures how likely items appear together more than expected by chance.
# Plotting rules using the graph method
plot(rules, method = "graph", engine = "htmlwidget")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
# Scatterplot of rules with support and confidence measures, shaded by lift
plot(rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The arulesViz package is used for visualization. The graph visualization shows interconnected items, where nodes represent items and edges represent the association between them. The scatterplot helps visualize the relationship between support, confidence, and lift, giving insights into which rules are stronger and more reliable.
In this project, we analyzed Instacart transaction data using association rule mining. Our key findings include:
Frequently bought-together items such as milk & bread, vegetables & cooking oil. High-lift rules indicating strong relationships between certain product pairs. Potential applications for personalized recommendations, product placement optimization, and targeted promotions in e-commerce.
By leveraging these insights, businesses can enhance their customer experience and marketing strategies.