2_association

Introduction

This document is a basic analysis of association rules existing in a set of transactions of a random grocery shop. The data is taken from kaggle. The project explores association rules in the data based on the 3 key assumptions of the Apriori algorithm, those being:

Support - a proportion of all transactions in the data set, where all the items from a rule appear together
Confidence - the probability of finding the supercedent item in the transaction, if the precedent item appeared
Lift - an extension of Confidence, such that if a lift values is higher than 1, the likelihood of a supercedent item appearing in a transaction is raised by ((Lift - 1) * 100) percent

Libraries

library(dplyr)
library(knitr)
library(tidyr)
library(readr)
library(lubridate)
library(ggplot2)
library(Matrix)
library(arules)
library(arulesViz)

The Data

First, the data is loaded into R. It has a clear structure, however not all rows follow it and it required somewhat significant wrangling to convert the existing format into a more usable one.

cart <- read.csv("dataset.csv", header = FALSE)
str(cart)

## 'data.frame':    1637 obs. of  32 variables:
##  $ V1 : chr  "1/1/2000yogurt" "1/1/2000toilet paper" "2/1/2000soda" "2/1/2000cereals" ...
##  $ V2 : chr  " pork" " shampoo" " pork" " juice" ...
##  $ V3 : chr  " sandwich bags" " hand soap" " soap" " lunch meat" ...
##  $ V4 : chr  " lunch meat" " waffles" " ice cream" " soda" ...
##  $ V5 : chr  " all- purpose" " vegetables" " toilet paper" " toilet paper" ...
##  $ V6 : chr  " flour" " cheeses" " dinner rolls" " all- purpose" ...
##  $ V7 : chr  " soda" " mixes" " hand soap" " " ...
##  $ V8 : chr  " butter" " milk" " spaghetti sauce" "" ...
##  $ V9 : chr  " vegetables" " sandwich bags" " milk" "" ...
##  $ V10: chr  " beef" " laundry detergent" " ketchup" "" ...
##  $ V11: chr  " aluminum foil" " dishwashing liquid/detergent" " sandwich loaves" "" ...
##  $ V12: chr  " all- purpose" " waffles" " poultry" "" ...
##  $ V13: chr  " dinner rolls" " individual meals" " toilet paper" "" ...
##  $ V14: chr  " shampoo" " hand soap" " ice cream" "" ...
##  $ V15: chr  " all- purpose" " vegetables" " ketchup" "" ...
##  $ V16: chr  " mixes" " individual meals" " vegetables" "" ...
##  $ V17: chr  " soap" " yogurt" " laundry detergent" "" ...
##  $ V18: chr  " laundry detergent" " cereals" " spaghetti sauce" "" ...
##  $ V19: chr  " ice cream" " shampoo" " bagels" "" ...
##  $ V20: chr  " dinner rolls" " vegetables" " soap" "" ...
##  $ V21: chr  " " " aluminum foil" " ice cream" "" ...
##  $ V22: chr  "" " tortillas" " shampoo" "" ...
##  $ V23: chr  "" " mixes" " lunch meat" "" ...
##  $ V24: chr  "" " " " cereals" "" ...
##  $ V25: chr  "" "" " spaghetti sauce" "" ...
##  $ V26: chr  "" "" " pork" "" ...
##  $ V27: chr  "" "" " vegetables" "" ...
##  $ V28: chr  "" "" " cheeses" "" ...
##  $ V29: chr  "" "" " eggs" "" ...
##  $ V30: chr  "" "" " vegetables" "" ...
##  $ V31: chr  "" "" " vegetables" "" ...
##  $ V32: chr  "" "" " " "" ...

Separating the date from the items in the first column. It’s unsuccessful for 138 rows where the date does not exist. Those rows need to be omitted as the transaction cannot be properly assigned to a date, or even trusted to have occurred.

# separating the date from the product in the first column -> adding transaction id
cart_clean <- cart %>%
  mutate(V1 = as.character(V1)) %>% 
  separate(V1, into = c("Date", "V1"), sep = "(?<=\\d{4})")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 138 rows [11, 22, 29, 46,
## 62, 74, 101, 112, 123, 128, 146, 156, 169, 193, 195, 204, 221, 227, 234, 236,
## ...].

cart_clean <- cart_clean %>%
  mutate(Date = if_else(!is.na(dmy(Date)), as.character(Date), NA_character_))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Date = if_else(!is.na(dmy(Date)), as.character(Date),
##   NA_character_)`.
## Caused by warning:
## !  138 failed to parse.

cart_clean <- na.omit(cart_clean) # 138 omissions

cart_clean <- cart_clean %>% 
  mutate(transaction_id = row_number()) %>%
  select(transaction_id, everything())

head(cart_clean)

##   transaction_id     Date                V1            V2             V3
## 1              1 1/1/2000            yogurt          pork  sandwich bags
## 2              2 1/1/2000      toilet paper       shampoo      hand soap
## 3              3 2/1/2000              soda          pork           soap
## 4              4 2/1/2000           cereals         juice     lunch meat
## 5              5 2/1/2000   sandwich loaves         pasta      tortillas
## 6              6 2/1/2000 laundry detergent  toilet paper           eggs
##              V4            V5            V6                            V7
## 1    lunch meat  all- purpose         flour                          soda
## 2       waffles    vegetables       cheeses                         mixes
## 3     ice cream  toilet paper  dinner rolls                     hand soap
## 4          soda  toilet paper  all- purpose                              
## 5         mixes     hand soap  toilet paper                    vegetables
## 6  toilet paper    vegetables        bagels  dishwashing liquid/detergent
##                 V8             V9                V10
## 1           butter     vegetables               beef
## 2             milk  sandwich bags  laundry detergent
## 3  spaghetti sauce           milk            ketchup
## 4                                                   
## 5       vegetables   paper towels         vegetables
## 6          cereals   paper towels  laundry detergent
##                             V11           V12               V13           V14
## 1                 aluminum foil  all- purpose      dinner rolls       shampoo
## 2  dishwashing liquid/detergent       waffles  individual meals     hand soap
## 3               sandwich loaves       poultry      toilet paper     ice cream
## 4                                                                            
## 5                         flour    vegetables              pork       poultry
## 6                        butter       cereals            bagels  paper towels
##             V15               V16                V17                V18
## 1  all- purpose             mixes               soap  laundry detergent
## 2    vegetables  individual meals             yogurt            cereals
## 3       ketchup        vegetables  laundry detergent    spaghetti sauce
## 4                                                                      
## 5          eggs        vegetables               pork    spaghetti sauce
## 6       shampoo      toilet paper               soap               soap
##           V19           V20            V21               V22            V23
## 1   ice cream  dinner rolls                                                
## 2     shampoo    vegetables  aluminum foil         tortillas          mixes
## 3      bagels          soap      ice cream           shampoo     lunch meat
## 4                                                                          
## 5  vegetables          milk        waffles  individual meals     vegetables
## 6       pasta    coffee/tea        poultry            bagels  aluminum foil
##             V24              V25      V26           V27      V28   V29
## 1                                                                     
## 2                                                                     
## 3       cereals  spaghetti sauce     pork    vegetables  cheeses  eggs
## 4                                                                     
## 5  dinner rolls     all- purpose     soda        yogurt               
## 6        butter  spaghetti sauce  ketchup  all- purpose     milk      
##           V30         V31 V32
## 1                            
## 2                            
## 3  vegetables  vegetables    
## 4                            
## 5                            
## 6

Transforming the cleaned data into a format accepted by the “arules” library. Each product in one row is assigned an artificial transaction ID and the whole data frame is transformed so that the items in transactions are tracked vertically by that ID, rather than horizontally. Additionally, unnecessary spaces are removed from appearing before the names of the first products in a transaction. A left over from the previous transformation.

# creating a transaction like data frame
transactions <- cart_clean %>%
  pivot_longer(
    cols = starts_with("V"),
    names_to = "product_key", 
    values_to = "product" ) %>%
  select(-product_key) %>%  
  mutate(product = trimws(product, which = "left")) %>%
  filter(product != "")

head(transactions)

## # A tibble: 6 × 3
##   transaction_id Date     product      
##            <int> <chr>    <chr>        
## 1              1 1/1/2000 yogurt       
## 2              1 1/1/2000 pork         
## 3              1 1/1/2000 sandwich bags
## 4              1 1/1/2000 lunch meat   
## 5              1 1/1/2000 all- purpose 
## 6              1 1/1/2000 flour

str(transactions)

## tibble [28,936 × 3] (S3: tbl_df/tbl/data.frame)
##  $ transaction_id: int [1:28936] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date          : chr [1:28936] "1/1/2000" "1/1/2000" "1/1/2000" "1/1/2000" ...
##  $ product       : chr [1:28936] "yogurt" "pork" "sandwich bags" "lunch meat" ...

Finally, a transaction object is created

transaction_list <- split(transactions[["product"]], transactions[["transaction_id"]])
transaction_obj <- as(transaction_list, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

inspect(transaction_obj[1:5])

##     items                           transactionID
## [1] {all- purpose,                               
##      aluminum foil,                              
##      beef,                                       
##      butter,                                     
##      dinner rolls,                               
##      flour,                                      
##      ice cream,                                  
##      laundry detergent,                          
##      lunch meat,                                 
##      mixes,                                      
##      pork,                                       
##      sandwich bags,                              
##      shampoo,                                    
##      soap,                                       
##      soda,                                       
##      vegetables,                                 
##      yogurt}                                    1
## [2] {aluminum foil,                              
##      cereals,                                    
##      cheeses,                                    
##      dishwashing liquid/detergent,               
##      hand soap,                                  
##      individual meals,                           
##      laundry detergent,                          
##      milk,                                       
##      mixes,                                      
##      sandwich bags,                              
##      shampoo,                                    
##      toilet paper,                               
##      tortillas,                                  
##      vegetables,                                 
##      waffles,                                    
##      yogurt}                                    2
## [3] {bagels,                                     
##      cereals,                                    
##      cheeses,                                    
##      dinner rolls,                               
##      eggs,                                       
##      hand soap,                                  
##      ice cream,                                  
##      ketchup,                                    
##      laundry detergent,                          
##      lunch meat,                                 
##      milk,                                       
##      pork,                                       
##      poultry,                                    
##      sandwich loaves,                            
##      shampoo,                                    
##      soap,                                       
##      soda,                                       
##      spaghetti sauce,                            
##      toilet paper,                               
##      vegetables}                                3
## [4] {all- purpose,                               
##      cereals,                                    
##      juice,                                      
##      lunch meat,                                 
##      soda,                                       
##      toilet paper}                              4
## [5] {all- purpose,                               
##      dinner rolls,                               
##      eggs,                                       
##      flour,                                      
##      hand soap,                                  
##      individual meals,                           
##      milk,                                       
##      mixes,                                      
##      paper towels,                               
##      pasta,                                      
##      pork,                                       
##      poultry,                                    
##      sandwich loaves,                            
##      soda,                                       
##      spaghetti sauce,                            
##      toilet paper,                               
##      tortillas,                                  
##      vegetables,                                 
##      waffles,                                    
##      yogurt}                                    5

Frequency of occurance

Item frequency plots of all unique items present in the every transaction and of the top 15 most frequently occuring ones, show that one clearly dominates above all others - vegetables.

itemFrequencyPlot(transaction_obj, type="absolute", main="Occurences of all items", col = "royalblue2")

itemFrequencyPlot(transaction_obj, topN = 10, type="absolute", main="Top 15 most occuring items", col = "orange")

This is the first hint that the rules of association will most probably revolve around it. To examine that, the Apriori algorithm is run on the data and the first set of rules is created, with a rather high support of 10% and a confidence of 80%.

Association rules

rules <- apriori(transaction_obj, parameter = list(support = 0.1, confidence = 0.8, minlen = 2))

summary(rules)

## set of 420 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
##   3 417 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.993   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1074   Min.   :0.8000   Min.   :0.1328   Min.   :1.102  
##  1st Qu.:0.1214   1st Qu.:0.8099   1st Qu.:0.1481   1st Qu.:1.116  
##  Median :0.1264   Median :0.8217   Median :0.1541   Median :1.132  
##  Mean   :0.1282   Mean   :0.8235   Mean   :0.1557   Mean   :1.135  
##  3rd Qu.:0.1321   3rd Qu.:0.8333   3rd Qu.:0.1601   3rd Qu.:1.148  
##  Max.   :0.3095   Max.   :0.8755   Max.   :0.3809   Max.   :1.206  
##      count      
##  Min.   :161.0  
##  1st Qu.:182.0  
##  Median :189.5  
##  Mean   :192.2  
##  3rd Qu.:198.0  
##  Max.   :464.0  
## 
## mining info:
##             data ntransactions support confidence
##  transaction_obj          1499     0.1        0.8
##                                                                                            call
##  apriori(data = transaction_obj, parameter = list(support = 0.1, confidence = 0.8, minlen = 2))

inspect(rules[1:10])

##      lhs                              rhs          support   confidence
## [1]  {laundry detergent}           => {vegetables} 0.3028686 0.8092692 
## [2]  {eggs}                        => {vegetables} 0.3095397 0.8126095 
## [3]  {yogurt}                      => {vegetables} 0.3082055 0.8148148 
## [4]  {hand soap, sandwich loaves}  => {vegetables} 0.1154103 0.8357488 
## [5]  {hand soap, sugar}            => {vegetables} 0.1167445 0.8101852 
## [6]  {hand soap, paper towels}     => {vegetables} 0.1074049 0.8009950 
## [7]  {hand soap, juice}            => {vegetables} 0.1114076 0.8067633 
## [8]  {hand soap, individual meals} => {vegetables} 0.1140761 0.8260870 
## [9]  {hand soap, pasta}            => {vegetables} 0.1187458 0.8054299 
## [10] {dinner rolls, hand soap}     => {vegetables} 0.1180787 0.8009050 
##      coverage  lift     count
## [1]  0.3742495 1.114977 454  
## [2]  0.3809206 1.119579 464  
## [3]  0.3782522 1.122617 462  
## [4]  0.1380921 1.151459 173  
## [5]  0.1440961 1.116239 175  
## [6]  0.1340894 1.103577 161  
## [7]  0.1380921 1.111524 167  
## [8]  0.1380921 1.138147 171  
## [9]  0.1474316 1.109687 178  
## [10] 0.1474316 1.103453 177

plot(rules)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules, method="grouped")

At these levels of support and confidence, all created rules have vegetables as the supercedent and with the number of rules created, evidence points to the fact that buying just about anything, leads to buying vegetables. Computing an even bigger rule set is impractical, as an extrapolation can be made that most, if not all of it, will still point to vegetables. Out of this set of rules, the most relevant ones should be considered.

With that said, only 3 rules of length 2 were created, which give an idea of the most direct precedent to supercedent relationship, with support of over 30% and both confidence and lift values in the neighborhood of other most highly rated rules, they are most common type of rules in the set. According to this data, whenever laundry detergent, eggs or yogurt appear in a basket, people also buy vegetables over 80% of the time.

Narrowing this rule set down to exclude the first 3 rules of length 2, raising the support level and ordering by lift, paints a clearer picture of the rest of the set.

rules2 <- apriori(transaction_obj, parameter = list(support = 0.14, confidence = 0.8, minlen = 3))

inspect(sort(rules2, by = "lift")[1:10])

##      lhs                                     rhs          support   confidence
## [1]  {cheeses, eggs}                      => {vegetables} 0.1454303 0.8755020 
## [2]  {dinner rolls, eggs}                 => {vegetables} 0.1420947 0.8729508 
## [3]  {milk, yogurt}                       => {vegetables} 0.1400934 0.8677686 
## [4]  {eggs, poultry}                      => {vegetables} 0.1434290 0.8600000 
## [5]  {eggs, yogurt}                       => {vegetables} 0.1414276 0.8582996 
## [6]  {dishwashing liquid/detergent, eggs} => {vegetables} 0.1414276 0.8582996 
## [7]  {aluminum foil, yogurt}              => {vegetables} 0.1454303 0.8549020 
## [8]  {cheeses, sandwich bags}             => {vegetables} 0.1447632 0.8509804 
## [9]  {eggs, soda}                         => {vegetables} 0.1487658 0.8479087 
## [10] {aluminum foil, bagels}              => {vegetables} 0.1407605 0.8473896 
##      coverage  lift     count
## [1]  0.1661107 1.206229 218  
## [2]  0.1627752 1.202714 213  
## [3]  0.1614410 1.195575 210  
## [4]  0.1667779 1.184871 215  
## [5]  0.1647765 1.182529 212  
## [6]  0.1647765 1.182529 212  
## [7]  0.1701134 1.177847 218  
## [8]  0.1701134 1.172444 217  
## [9]  0.1754503 1.168213 223  
## [10] 0.1661107 1.167497 211

plot(rules2)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules2, method="paracoord", control=list(reorder=TRUE))

plot(rules2, method="graph")

There is a bigger mix of rules, however eggs and yogurt appear again, this time alongside cheese, dinner rolls, milk or each other, with the highest lift values out of the entire set.

Conclusions

While it seems that whenever laundry detergent appears in a basket in this shop, vegetables are purchased too, the best predictors seem to be eggs and yogurt, in that order, having the highest lift values of all other items, whether by themselves or in combination with other products.
Whenever people buy eggs or yogurt, they are much more likely to buy vegetables.

2_association_rules

Szymon Zientalak

2024-02-28

Introduction

The Data

Frequency of occurance

Association rules

Conclusions