Objective

In the Grocery Dataset, there contains rows of purchase receipts. Each line represents 1 receipt and items purchased. Using R, I will mine the data for association rules.

Market Basket Analysis, or the study of relationships in transaction data sets seeks to discover patterns and associations between items. In the case of retail transactions, items that are closely associated are most likely to be bought together. Association rules explains the dynamics of these relationships.

In my analysis, I will report the support, confidence and lift and organize the top 10 rules by lift.

data <- read_csv("C:\\Users\\urios\\OneDrive\\Documents\\GroceryDataSet.csv", col_names = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 9835 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## lgl (10): X23, X24, X25, X26, X27, X28, X29, X30, X31, X32
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Exploration

One of the first things we need to address is the column types. The columns are current split between Character and Logical data types. Instead, all the columns should be converted to Character types - since the columns only contain character values.

summary(data)
##       X1                 X2                 X3                 X4           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       X5                 X6                 X7                 X8           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       X9                X10                X11                X12           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X13                X14                X15                X16           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X17                X18                X19                X20           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X21                X22              X23            X24         
##  Length:9835        Length:9835        Mode:logical   Mode:logical  
##  Class :character   Class :character   NA's:9835      NA's:9835     
##  Mode  :character   Mode  :character                                
##    X25            X26            X27            X28            X29         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:9835      NA's:9835      NA's:9835      NA's:9835      NA's:9835     
##                                                                            
##    X30            X31            X32         
##  Mode:logical   Mode:logical   Mode:logical  
##  NA's:9835      NA's:9835      NA's:9835     
## 

Data Preparation

To prepare our data set for analysis, we first transform all the column types to character datatypes and rename our columns.

Once clean, I will transform the dataset into a basket format to allow for easy analysis.

clean <- data %>%
  mutate(across(everything(), as.character))
colnames(clean) <- paste0("Item_", 1:ncol(clean))
write.csv(clean, "grocery_items.csv", quote = FALSE, row.names = TRUE, na = "" )
transaction <- read.transactions("grocery_items.csv", format = 'basket', sep = ',')

Analysis

The most bought items include Whole milk, other vegetables, rolls/buns, and soda.

itemFrequencyPlot(transaction, topN = 10, type = 'absolute')

When applying the apriori method, I had to tweak the parameter from the default numbers. When using the default confidence level of 0.8 and support of 0.1, it gives us 0 results. I lowered the support to %.5 and confidence of 40%. This selects items that appear in at least 0.5% of transactions and that only rules with at least 40% are generated.

By lowering the parameters we are able to get results.

rules <- apriori(transaction, parameter = list(supp = 0.005, conf = 0.4, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10036 item(s), 9836 transaction(s)] done [0.02s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [262 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

When inspecting the first 10 rows, I notice the following:

The rule {herbs} -> {root vegetables} has the highest lift. This tells us that the purchase of root vegetables are close to 4 times more likely when a customer purchases herbs.

Whole milk is observed to be most purchased alongside other items.

The rule {detergent} -> {whole milk} has the most amount of transactions (88 transactions).

inspect(rules[1:10])
##      lhs                      rhs                support     confidence
## [1]  {cake bar}            => {whole milk}       0.005591704 0.4263566 
## [2]  {mustard}             => {whole milk}       0.005185035 0.4322034 
## [3]  {pot plants}          => {whole milk}       0.006913379 0.4000000 
## [4]  {pasta}               => {whole milk}       0.005998373 0.4013605 
## [5]  {herbs}               => {root vegetables}  0.007015047 0.4312500 
## [6]  {herbs}               => {other vegetables} 0.007726718 0.4750000 
## [7]  {herbs}               => {whole milk}       0.007726718 0.4750000 
## [8]  {processed cheese}    => {whole milk}       0.007015047 0.4233129 
## [9]  {semi-finished bread} => {whole milk}       0.007116714 0.4022989 
## [10] {detergent}           => {whole milk}       0.008946726 0.4656085 
##      coverage   lift     count
## [1]  0.01311509 1.668780 55   
## [2]  0.01199675 1.691664 51   
## [3]  0.01728345 1.565619 68   
## [4]  0.01494510 1.570944 59   
## [5]  0.01626678 3.956880 69   
## [6]  0.01626678 2.455123 76   
## [7]  0.01626678 1.859172 76   
## [8]  0.01657178 1.656867 69   
## [9]  0.01769012 1.574617 70   
## [10] 0.01921513 1.822413 88

When organizing by lift, we can observe the following:

The strong rule by lift is {citrus fruit, other vegetables, whole milk} -> {root vegetables}. The lift is 4.085908 which means that the purchase of root vegetables are more likely to happen if citrus fruit, other vegetables, or whole milk are bought first.

The rule {beef, other vegetables} -> {root vegetables} occurs most frequently (78 total rules).

We can see that root vegetables are popular complementary items.

inspect(sort(rules, by = "lift")[1:10])
##      lhs                    rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                             
##       other vegetables,                                                                         
##       whole milk}        => {root vegetables}  0.005795039  0.4453125 0.013013420 4.085908    57
## [2]  {herbs}             => {root vegetables}  0.007015047  0.4312500 0.016266775 3.956880    69
## [3]  {citrus fruit,                                                                             
##       pip fruit}         => {tropical fruit}   0.005591704  0.4044118 0.013826759 3.854452    55
## [4]  {other vegetables,                                                                         
##       tropical fruit,                                                                           
##       whole milk}        => {root vegetables}  0.007015047  0.4107143 0.017080114 3.768457    69
## [5]  {other vegetables,                                                                         
##       pip fruit,                                                                                
##       whole milk}        => {root vegetables}  0.005490037  0.4060150 0.013521757 3.725339    54
## [6]  {curd,                                                                                     
##       tropical fruit}    => {yogurt}           0.005286702  0.5148515 0.010268402 3.691020    52
## [7]  {beef,                                                                                     
##       other vegetables}  => {root vegetables}  0.007930053  0.4020619 0.019723465 3.689068    78
## [8]  {onions,                                                                                   
##       other vegetables}  => {root vegetables}  0.005693371  0.4000000 0.014233428 3.670149    56
## [9]  {root vegetables,                                                                          
##       tropical fruit,                                                                           
##       whole milk}        => {yogurt}           0.005693371  0.4745763 0.011996747 3.402283    56
## [10] {citrus fruit,                                                                             
##       root vegetables,                                                                          
##       whole milk}        => {other vegetables} 0.005795039  0.6333333 0.009150061 3.273498    57

In this plot, we can visualize some of these relationships.

plot(rules, method = "graph")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).