Association Rule Mining

The Dataset
Theory
Data Engineering
Visualizing and Exploring the Data
Market Basket Analysis
- Apriori Algorithm
- Eclat Algorithm
Conclusions

The Dataset

A kaggle dataset Random Shopping cart is being used here for the association rule analysis. This dataset is transformed from Random Shopping cart https://www.kaggle.com/fanatiks/shopping-cart.

The following packages are loaded for the analysis.

library(rmarkdown)
library(arules)
library(arulesViz)
library(dplyr)
library(ggplot2)
library(wordcloud2)

The dataset is loaded and column names are changed to some meaningful ones.

basket <- read.csv("~/dataset_group.csv", header = FALSE)
dim(basket)

## [1] 22343     3

colnames(basket) <- c("Date", "Transaction_Id", "Item")
str(basket)

## 'data.frame':    22343 obs. of  3 variables:
##  $ Date          : Factor w/ 603 levels "2000-01-01","2000-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Transaction_Id: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Item          : Factor w/ 38 levels "all- purpose",..: 38 25 27 20 1 12 31 5 36 4 ...

paged_table(basket)

The data set contains 22343 observations and the following columns,

Date - Categorical variable that tells us the date of the transactions (YYYY-MM-DD format). The column includes dates from 01/01/2000 to 02/26/2002.
Transaction_Id - Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that’s why the data set has less transactions than observations.
Item - Categorical variable containing the products.

Theory

The Apriori Algorithm gives us associative properties within transactions. This is also known as Association Rules for our dataset. The analysis of these association rules depend on three measures- Support, Confidence, and Lift.

Support
Support can be thought of as the percentage of the total amount of transactions relevant to an association. This is perhaps better understood by a simple equation:
Support(Item1) = (Transactions containing Item1) / (Total transactions)
Confidence
Confidence tells us how likely it is that purchasing Item1 results in a purchase of Item2.
Confidence(Item1 -> Item2) = (Transactions containing both Item1 and Item2) / (Transactions containing Item1)
Lift
The lift refers to how the chances of Item2 being purchased increased given that Item1 is purchased.
Lift(Item1 -> Item2) = (Confidence(Item1 -> Item2)) / (Support(Item2))
A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

Data Engineering

We will check for any discrepancies in the dataset like missing or null values, negative or invalid values, and duplicate records.

Missing or null values

colSums(is.na(basket))

##           Date Transaction_Id           Item 
##              0              0              0

any(is.null(basket))

## [1] FALSE

There are no missing or null values in the dataset.

Negative or invalid values

sum(rowMeans(basket[,-1] < 0), na.rm = TRUE)

## [1] 0

There are no negative values in the dataset.

Duplicate Records

dim(unique(basket))[1]

## [1] 16753

There are duplicate records since the dataset contains 22343 records and only 16753 records are unique.

We will remove these duplicate records from the dataset.

basket <- unique(basket)
dim(basket)

## [1] 16753     3

We have 16753 records and 3 columns in our cleaned data for further analysis.

Adding new variables to the Dataframe

# create month, year and hour of date variable
basket$Date <- as.character(basket$Date)
basket$Year <- sapply(basket$Date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][1]})
basket$Month <- sapply(basket$Date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][2]})
basket$Day <- sapply(basket$Date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][3]})
basket$Date <- as.Date(basket$Date)
str(basket)

## 'data.frame':    16753 obs. of  6 variables:
##  $ Date          : Date, format: "2000-01-01" "2000-01-01" ...
##  $ Transaction_Id: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Item          : Factor w/ 38 levels "all- purpose",..: 38 25 27 20 1 12 31 5 36 4 ...
##  $ Year          : chr  "2000" "2000" "2000" "2000" ...
##  $ Month         : chr  "01" "01" "01" "01" ...
##  $ Day           : chr  "01" "01" "01" "01" ...

Visualizing and Exploring the Data

most_sold %>%
  arrange(n) %>%    # First sort by val. This sort the dataframe but NOT the factor levels
  ungroup(n) %>%
  mutate(Item = factor(Item, levels = Item)) %>%   # This trick update the factor levels
  ggplot( aes(x = Item, y = n)) +
    geom_segment( aes(xend = Item, yend = 0), color = "steelblue", alpha = 0.8) + geom_text(aes(label = n), hjust = -0.4, color = "#A9A9A9", size = 3.5) +
    geom_point( size = 3, color = "steelblue") + ggtitle("Items and its frequency") + 
    coord_flip() +
    theme_light() + theme(panel.border = element_blank(), axis.title = element_blank(), axis.text.x = element_blank(), panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())

The above plot shows that the Vegetables is the most bought item whereas remaining items does not have much difference in terms of being bought.

Here are all the items in the dataset.

wordcloud2(most_sold, size = 0.5)

Vegetables is the most bought item.

monthly_sold <- basket %>% group_by(Month) %>% count() %>% arrange(desc(n))
ggplot(monthly_sold, aes(x = Month, y = n)) + geom_bar(stat = "identity", aes(fill = as.factor(Month)), show.legend = FALSE) + ggtitle("Montly Sales") + theme_bw() +  scale_fill_brewer(palette = "Spectral")

The above plot shows the monthly purchases frequency. We observe that the maximum puchases were done in the month of January and then decreased till April. The purchase suddenly increased in May and again decreased in June from where it again rose till September.

yearly_sold <- basket %>% group_by(Year) %>% count() %>% arrange(desc(n))
ggplot(yearly_sold, aes(x = Year, y = n)) + geom_bar(stat = "identity", aes(fill = as.factor(Year)), show.legend = FALSE) + ggtitle("Yearly Sales") + theme_bw() +  scale_fill_brewer(palette = "Spectral")

The above plot shows that there were less purchases in the year 2002 but it is probably because there is less data of 2002 in the dataset.

daily_sold <- basket %>% group_by(Day) %>% count() %>% arrange(desc(n))
ggplot(daily_sold, aes(x = Day, y = n)) + geom_bar(stat = "identity", aes(fill = as.factor(Day)), show.legend = FALSE) + ggtitle("Daily Sales") + theme_bw()

The above plot shows the day-wise frequency of puchases of any month.

most_sold <- most_sold[1:10, "Item"]
top_10 <- basket[basket$Item %in% most_sold$Item, ]
ggplot(top_10, aes(x = Item, y = Transaction_Id, fill = Item)) + geom_boxplot(show.legend = FALSE) + ggtitle("No. of transaction by Top 10 items") + theme_bw() +  scale_fill_brewer(palette = "Spectral")

The above boxplot shows the number of transactions of top 10 purchased items.

basket_sizes <- basket %>% group_by(Transaction_Id) %>% summarise(total = n())
ggplot(basket_sizes, aes(x = total)) + geom_histogram(fill = "indianred", binwidth = 1) + geom_rug() + ggtitle("No. of transactions with different basket sizes") + theme_bw() + xlab("Basket size") + ylab("No. of transactions")

The above plot shows the number of tractions with different size of baskets.

Market Basket Analysis

We will create a transaction object using basket dataframe for Market Basket Analysis/ Association Rule Mining.

basket_trans <- split(x = basket[,"Item"], f = basket$Transaction_Id)
basket_trans <- as(basket_trans, "transactions")
summary(basket_trans)

## transactions as itemMatrix in sparse format with
##  1139 rows (elements/itemsets/transactions) and
##  38 columns (items) and a density of 0.3870662 
## 
## most frequent items:
## vegetables    poultry  ice cream    cereals lunch meat    (Other) 
##        842        480        454        451        450      14076 
## 
## element (itemset/transaction) length distribution:
## sizes
##  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
## 12 34 35 41 51 56 62 67 48 55 71 64 58 79 77 75 91 54 58 25 16  7  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   10.00   15.00   14.71   19.00   26.00 
## 
## includes extended item information - examples:
##          labels
## 1  all- purpose
## 2 aluminum foil
## 3        bagels
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

itemFrequencyPlot(basket_trans, topN = 20, type = "absolute")

Apriori Algorithm

apriori_rules <- apriori(basket_trans,parameter = list(sup = 0.15, conf = 0.8,target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.15      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 170 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1139 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.01s].
## writing ... [22 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(apriori_rules)

## set of 22 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
##  6 16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.250   3.000   2.727   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence          lift           count      
##  Min.   :0.1501   Min.   :0.8082   Min.   :1.093   Min.   :171.0  
##  1st Qu.:0.1530   1st Qu.:0.8327   1st Qu.:1.126   1st Qu.:174.2  
##  Median :0.1572   Median :0.8507   Median :1.151   Median :179.0  
##  Mean   :0.1968   Mean   :0.8541   Mean   :1.155   Mean   :224.1  
##  3rd Qu.:0.2524   3rd Qu.:0.8765   3rd Qu.:1.186   3rd Qu.:287.5  
##  Max.   :0.3266   Max.   :0.8995   Max.   :1.217   Max.   :372.0  
## 
## mining info:
##          data ntransactions support confidence
##  basket_trans          1139    0.15        0.8

plot(apriori_rules)

plot(head(sort(apriori_rules, by = "lift"), 22), method = "graph")

plot(apriori_rules, method = "grouped")

inspect(sort(apriori_rules, by = "lift", decreasing = T))

##      lhs                                       rhs          support  
## [1]  {eggs,yogurt}                          => {vegetables} 0.1571554
## [2]  {dinner rolls,eggs}                    => {vegetables} 0.1562774
## [3]  {dishwashing liquid/detergent,eggs}    => {vegetables} 0.1536435
## [4]  {cereals,laundry detergent}            => {vegetables} 0.1510097
## [5]  {cheeses,eggs}                         => {vegetables} 0.1501317
## [6]  {eggs,poultry}                         => {vegetables} 0.1553995
## [7]  {cereals,eggs}                         => {vegetables} 0.1510097
## [8]  {aluminum foil,yogurt}                 => {vegetables} 0.1527656
## [9]  {mixes,poultry}                        => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## [11] {lunch meat,waffles}                   => {vegetables} 0.1571554
## [12] {lunch meat,poultry}                   => {vegetables} 0.1580334
## [13] {poultry,sugar}                        => {vegetables} 0.1518876
## [14] {eggs,soda}                            => {vegetables} 0.1580334
## [15] {poultry,yogurt}                       => {vegetables} 0.1527656
## [16] {eggs}                                 => {vegetables} 0.3266023
## [17] {yogurt}                               => {vegetables} 0.3195786
## [18] {dinner rolls,poultry}                 => {vegetables} 0.1615452
## [19] {sugar}                                => {vegetables} 0.2976295
## [20] {laundry detergent}                    => {vegetables} 0.3090430
## [21] {sandwich loaves}                      => {vegetables} 0.2827041
## [22] {aluminum foil}                        => {vegetables} 0.3107989
##      confidence lift     count
## [1]  0.8994975  1.216779 179  
## [2]  0.8989899  1.216092 178  
## [3]  0.8974359  1.213990 175  
## [4]  0.8911917  1.205543 172  
## [5]  0.8860104  1.198534 171  
## [6]  0.8805970  1.191211 177  
## [7]  0.8643216  1.169195 172  
## [8]  0.8613861  1.165224 174  
## [9]  0.8599034  1.163218 178  
## [10] 0.8544601  1.155855 182  
## [11] 0.8523810  1.153043 179  
## [12] 0.8490566  1.148546 180  
## [13] 0.8480392  1.147169 173  
## [14] 0.8450704  1.143153 180  
## [15] 0.8446602  1.142599 174  
## [16] 0.8378378  1.133370 372  
## [17] 0.8310502  1.124188 364  
## [18] 0.8288288  1.121183 184  
## [19] 0.8248175  1.115757 339  
## [20] 0.8167053  1.104783 352  
## [21] 0.8090452  1.094421 322  
## [22] 0.8082192  1.093304 354

Eclat Algorithm

itemsets  = eclat(data = basket_trans, parameter = list(support = 0.15, minlen = 2 ))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.15      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 170 
## 
## create itemset ... 
## set transactions ...[38 item(s), 1139 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating bit matrix ... [38 row(s), 1139 column(s)] done [0.00s].
## writing  ... [518 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

eclat_rules <- ruleInduction(itemsets , basket_trans, confidence = .8)
summary(eclat_rules)

## set of 22 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
##  6 16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.250   3.000   2.727   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence          lift          itemset     
##  Min.   :0.1501   Min.   :0.8082   Min.   :1.093   Min.   :  5.0  
##  1st Qu.:0.1530   1st Qu.:0.8327   1st Qu.:1.126   1st Qu.:397.2  
##  Median :0.1572   Median :0.8507   Median :1.151   Median :454.5  
##  Mean   :0.1968   Mean   :0.8541   Mean   :1.155   Mean   :394.1  
##  3rd Qu.:0.2524   3rd Qu.:0.8765   3rd Qu.:1.186   3rd Qu.:489.2  
##  Max.   :0.3266   Max.   :0.8995   Max.   :1.217   Max.   :509.0  
## 
## mining info:
##          data ntransactions support confidence
##  basket_trans          1139    0.15        0.8

plot(eclat_rules)

plot(head(sort(eclat_rules, by = "lift"), 22), method = "graph")

plot(eclat_rules, method = "grouped")

inspect(sort(eclat_rules, by = "lift", decreasing = T)[1:22])

##      lhs                                       rhs          support  
## [1]  {eggs,yogurt}                          => {vegetables} 0.1571554
## [2]  {dinner rolls,eggs}                    => {vegetables} 0.1562774
## [3]  {dishwashing liquid/detergent,eggs}    => {vegetables} 0.1536435
## [4]  {cereals,laundry detergent}            => {vegetables} 0.1510097
## [5]  {cheeses,eggs}                         => {vegetables} 0.1501317
## [6]  {eggs,poultry}                         => {vegetables} 0.1553995
## [7]  {cereals,eggs}                         => {vegetables} 0.1510097
## [8]  {aluminum foil,yogurt}                 => {vegetables} 0.1527656
## [9]  {mixes,poultry}                        => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## [11] {lunch meat,waffles}                   => {vegetables} 0.1571554
## [12] {lunch meat,poultry}                   => {vegetables} 0.1580334
## [13] {poultry,sugar}                        => {vegetables} 0.1518876
## [14] {eggs,soda}                            => {vegetables} 0.1580334
## [15] {poultry,yogurt}                       => {vegetables} 0.1527656
## [16] {eggs}                                 => {vegetables} 0.3266023
## [17] {yogurt}                               => {vegetables} 0.3195786
## [18] {dinner rolls,poultry}                 => {vegetables} 0.1615452
## [19] {sugar}                                => {vegetables} 0.2976295
## [20] {laundry detergent}                    => {vegetables} 0.3090430
## [21] {sandwich loaves}                      => {vegetables} 0.2827041
## [22] {aluminum foil}                        => {vegetables} 0.3107989
##      confidence lift     itemset
## [1]  0.8994975  1.216779 476    
## [2]  0.8989899  1.216092 454    
## [3]  0.8974359  1.213990 441    
## [4]  0.8911917  1.205543 300    
## [5]  0.8860104  1.198534 501    
## [6]  0.8805970  1.191211 508    
## [7]  0.8643216  1.169195 507    
## [8]  0.8613861  1.165224 428    
## [9]  0.8599034  1.163218 387    
## [10] 0.8544601  1.155855 442    
## [11] 0.8523810  1.153043 493    
## [12] 0.8490566  1.148546 494    
## [13] 0.8480392  1.147169  59    
## [14] 0.8450704  1.143153 466    
## [15] 0.8446602  1.142599 477    
## [16] 0.8378378  1.133370 509    
## [17] 0.8310502  1.124188 478    
## [18] 0.8288288  1.121183 455    
## [19] 0.8248175  1.115757  60    
## [20] 0.8167053  1.104783 301    
## [21] 0.8090452  1.094421   5    
## [22] 0.8082192  1.093304 429

Conclusions

For the same minimum support, both algorithms must give the same result, or there is an error in their implementation. There is only one correct output. ECLAT improves Apriori in the step of Extracting frequent itemsets. As you know Apriori has to scan the Database multiple times, but with ECLAT there is no need to scan the database for countig the support for k-itemsets (k>=1). In R, apriori() could have as an output the frequent itemsets or association rules. Althought eclat() has as an output just the frequent itemsets. You have after that use the command ruleInduction() to extract rules from those itemsets.

We clearly see meaningful results here from our analysis, where the higher the lift value, the stronger the correlation between the items. The data clearly shows that vegetables is a most popular consequent, which makes sense because it is an essential item for humans. Let’s look at the more interesting item correlations (format: antecedant(s) -> consequent):

(Eggs + Yogurt) -> Vegetables
(dinner rolls + eggs) -> Vegetables

So how is this useful knowledge for the Grocery Stores? Businesses are always looking to optimize their setup and drive up their sales. Stores are no different, and this kind of analysis can be done on any kind of retail store or market place as well. Because we now know the correlation between items and the common interest of the customers, the business can make decisions based on these findings. For example, this store might want to place their fresh vegetables near their eggs and yogurt, since customers who purchase eggs and yogurt seem to also be enticed by vegetables.