Introduction

In this analysis, we apply Association Rule Mining using the Apriori algorithm on the Instacart Market Basket Analysis dataset. The goal is to uncover interesting relationships between products frequently purchased together, which can be useful for recommendation systems and targeted marketing.

Data Loading and Preprocessing

# Set the CRAN mirror to ensure that packages can be installed
options(repos = c(CRAN = "https://cloud.r-project.org/"))

# Loading necessary libraries
library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ff)

## Loading required package: bit

## 
## Attaching package: 'bit'

## The following object is masked from 'package:dplyr':
## 
##     symdiff

## The following object is masked from 'package:base':
## 
##     xor

## Attaching package ff

## - getOption("fftempdir")=="/var/folders/ff/_j3220254qz_r_lngknyzc8r0000gn/T//RtmpyD6a41/ff"

## - getOption("ffextension")=="ff"

## - getOption("ffdrop")==TRUE

## - getOption("fffinonexit")==TRUE

## - getOption("ffpagesize")==65536

## - getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes

## - getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system

## - getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system

## 
## Attaching package: 'ff'

## The following objects are masked from 'package:utils':
## 
##     write.csv, write.csv2

## The following objects are masked from 'package:base':
## 
##     is.factor, is.ordered

# Install arulesCBA package for more efficient rule mining
install.packages("arulesCBA")

## 
## The downloaded binary packages are in
##  /var/folders/ff/_j3220254qz_r_lngknyzc8r0000gn/T//RtmpyD6a41/downloaded_packages

library(arulesCBA)


# Check if instacart_ff exists, and if not, load the data
if (!exists("instacart_ff")) {
  # Load the Instacart dataset using ff (on disk)
  instacart_ff <- read.transactions("order_products__prior.csv", format = "single", sep = ",", cols = c(2, 3))
}

# Summary of the transaction data
dim(instacart_ff)

## [1] 49678   146

summary(instacart_ff)

## transactions as itemMatrix in sparse format with
##  49678 rows (elements/itemsets/transactions) and
##  146 columns (items) and a density of 0.1515866 
## 
## most frequent items:
##       1       2       4       3       5 (Other) 
##   41826   41664   41653   41614   41131  891568 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  154  450  935 1386 1479 1543 1525 1478 1419 1389 1374 1368 1262 1326 1256 1259 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 1327 1283 1276 1276 1204 1218 1215 1138 1231 1170 1169 1135 1036 1055 1026  958 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  956  861  866  811  780  708  621  628  588  519  492  433  405  378  328  309 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##  242  238  193  176  142  124  110   89   71   57   48   44   29   23   14   22 
##   65   66   67   68   69   70   71   72   73   74   75   76 
##   11   15    7    8    3    1    1    1    2    2    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   11.00   21.00   22.13   31.00   76.00 
## 
## includes extended item information - examples:
##   labels
## 1      1
## 2     10
## 3    100
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3           100

Explanation:

We use the arules package to read the dataset as transaction data. The dataset consists of order-product pairs where each row represents a product purchased in an order. We transform the data into a transaction format suitable for association rule mining.We check if the data is loaded, and if not, we use the ff package to read the transaction data into memory in a more efficient manner, especially for large datasets.

Exploratory Data Analysis

# Inspecting the most frequently purchased items
itemFrequencyPlot(instacart_ff, topN = 20, col = "lightblue", main = "Top 20 Most Frequent Items")

Explanation:

We visualize the most frequently purchased items to understand buying patterns.

Applying Apriori Algorithm

# Taking a random sample of 0.1% of the transactions
# Taking a random sample of 0.05% of the transactions
set.seed(123)
sample_instacart_ff <- instacart_ff[sample(1:nrow(instacart_ff), size = 0.0002 * nrow(instacart_ff)), ]

# Remove infrequent items
item_freq <- itemFrequency(sample_instacart_ff, type = "absolute")
infrequent_items <- names(item_freq[item_freq < 5])
sample_instacart_ff <- sample_instacart_ff[, !colnames(sample_instacart_ff) %in% infrequent_items]

# Generating association rules with reduced support 
rules <- apriori(sample_instacart_ff, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17 item(s), 9 transaction(s)] done [0.00s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(sample_instacart_ff, parameter = list(supp = 0.05, conf =
## 0.6, : Mining stopped (maxlen reached). Only patterns up to a length of 10
## returned!

##  done [0.02s].
## writing ... [857975 rule(s)] done [0.08s].
## creating S4 object  ... done [0.19s].

# Inspecting the top 10 rules based on lift
inspect(head(sort(rules, by = "lift"), 10))

##      lhs         rhs  support   confidence coverage  lift count
## [1]  {21}     => {11} 0.5555556 1          0.5555556 1.8  5    
## [2]  {11}     => {21} 0.5555556 1          0.5555556 1.8  5    
## [3]  {17, 8}  => {21} 0.3333333 1          0.3333333 1.8  3    
## [4]  {17, 8}  => {11} 0.3333333 1          0.3333333 1.8  3    
## [5]  {21, 8}  => {11} 0.4444444 1          0.4444444 1.8  4    
## [6]  {11, 8}  => {21} 0.4444444 1          0.4444444 1.8  4    
## [7]  {16, 8}  => {21} 0.3333333 1          0.3333333 1.8  3    
## [8]  {10, 21} => {8}  0.3333333 1          0.3333333 1.8  3    
## [9]  {13, 8}  => {21} 0.4444444 1          0.4444444 1.8  4    
## [10] {12, 8}  => {21} 0.4444444 1          0.4444444 1.8  4

# Force garbage collection to free up memory
gc()

##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  2369283 126.6    4603077 245.9         NA   3035081  162.1
## Vcells 12977283  99.1  116574912 889.4      16384 179809249 1371.9

Explanation:

We take a random sample of 0.01% of the transaction data to reduce memory usage and speed up computation. The Apriori algorithm is applied with a minimum support of 0.01 and confidence of 0.6. The resulting rules are sorted by lift, which measures how likely items appear together more than expected by chance.

Visualizing Association Rules

# Plotting rules using the graph method
plot(rules, method = "graph", engine = "htmlwidget")

## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

# Scatterplot of rules with support and confidence measures, shaded by lift
plot(rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Explanation:

The arulesViz package is used for visualization. The graph visualization shows interconnected items, where nodes represent items and edges represent the association between them. The scatterplot helps visualize the relationship between support, confidence, and lift, giving insights into which rules are stronger and more reliable.

Conclusion

In this project, we analyzed Instacart transaction data using association rule mining. Our key findings include:

Frequently bought-together items such as milk & bread, vegetables & cooking oil. High-lift rules indicating strong relationships between certain product pairs. Potential applications for personalized recommendations, product placement optimization, and targeted promotions in e-commerce.

By leveraging these insights, businesses can enhance their customer experience and marketing strategies.

Association Rule Mining on Instacart Dataset

Nyasha Nyarirangwe

2025-02-01

Introduction

Data Loading and Preprocessing

Explanation:

Exploratory Data Analysis

Explanation:

Applying Apriori Algorithm

Explanation:

Visualizing Association Rules

Explanation:

Conclusion