A kaggle dataset Random Shopping cart is being used here for the association rule analysis. This dataset is transformed from Random Shopping cart https://www.kaggle.com/fanatiks/shopping-cart.
The following packages are loaded for the analysis.
library(rmarkdown)
library(arules)
library(arulesViz)
library(dplyr)
library(ggplot2)
library(wordcloud2)
The dataset is loaded and column names are changed to some meaningful ones.
basket <- read.csv("~/dataset_group.csv", header = FALSE)
dim(basket)
## [1] 22343 3
colnames(basket) <- c("Date", "Transaction_Id", "Item")
str(basket)
## 'data.frame': 22343 obs. of 3 variables:
## $ Date : Factor w/ 603 levels "2000-01-01","2000-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Transaction_Id: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Item : Factor w/ 38 levels "all- purpose",..: 38 25 27 20 1 12 31 5 36 4 ...
paged_table(basket)
The data set contains 22343 observations and the following columns,
Date - Categorical variable that tells us the date of the transactions (YYYY-MM-DD format). The column includes dates from 01/01/2000 to 02/26/2002.
Transaction_Id - Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that’s why the data set has less transactions than observations.
Item - Categorical variable containing the products.
The Apriori Algorithm gives us associative properties within transactions. This is also known as Association Rules for our dataset. The analysis of these association rules depend on three measures- Support, Confidence, and Lift.
Support
Support can be thought of as the percentage of the total amount of transactions relevant to an association. This is perhaps better understood by a simple equation:
Support(Item1) = (Transactions containing Item1) / (Total transactions)
Confidence
Confidence tells us how likely it is that purchasing Item1 results in a purchase of Item2.
Confidence(Item1 -> Item2) = (Transactions containing both Item1 and Item2) / (Transactions containing Item1)
Lift
The lift refers to how the chances of Item2 being purchased increased given that Item1 is purchased.
Lift(Item1 -> Item2) = (Confidence(Item1 -> Item2)) / (Support(Item2))
A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.
We will check for any discrepancies in the dataset like missing or null values, negative or invalid values, and duplicate records.
colSums(is.na(basket))
## Date Transaction_Id Item
## 0 0 0
any(is.null(basket))
## [1] FALSE
There are no missing or null values in the dataset.
sum(rowMeans(basket[,-1] < 0), na.rm = TRUE)
## [1] 0
There are no negative values in the dataset.
dim(unique(basket))[1]
## [1] 16753
There are duplicate records since the dataset contains 22343 records and only 16753 records are unique.
We will remove these duplicate records from the dataset.
basket <- unique(basket)
dim(basket)
## [1] 16753 3
We have 16753 records and 3 columns in our cleaned data for further analysis.
# create month, year and hour of date variable
basket$Date <- as.character(basket$Date)
basket$Year <- sapply(basket$Date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][1]})
basket$Month <- sapply(basket$Date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][2]})
basket$Day <- sapply(basket$Date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][3]})
basket$Date <- as.Date(basket$Date)
str(basket)
## 'data.frame': 16753 obs. of 6 variables:
## $ Date : Date, format: "2000-01-01" "2000-01-01" ...
## $ Transaction_Id: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Item : Factor w/ 38 levels "all- purpose",..: 38 25 27 20 1 12 31 5 36 4 ...
## $ Year : chr "2000" "2000" "2000" "2000" ...
## $ Month : chr "01" "01" "01" "01" ...
## $ Day : chr "01" "01" "01" "01" ...
most_sold %>%
arrange(n) %>% # First sort by val. This sort the dataframe but NOT the factor levels
ungroup(n) %>%
mutate(Item = factor(Item, levels = Item)) %>% # This trick update the factor levels
ggplot( aes(x = Item, y = n)) +
geom_segment( aes(xend = Item, yend = 0), color = "steelblue", alpha = 0.8) + geom_text(aes(label = n), hjust = -0.4, color = "#A9A9A9", size = 3.5) +
geom_point( size = 3, color = "steelblue") + ggtitle("Items and its frequency") +
coord_flip() +
theme_light() + theme(panel.border = element_blank(), axis.title = element_blank(), axis.text.x = element_blank(), panel.grid.minor = element_blank(), panel.grid.major.x = element_blank())
The above plot shows that the Vegetables is the most bought item whereas remaining items does not have much difference in terms of being bought.
Here are all the items in the dataset.
wordcloud2(most_sold, size = 0.5)
Vegetables is the most bought item.
monthly_sold <- basket %>% group_by(Month) %>% count() %>% arrange(desc(n))
ggplot(monthly_sold, aes(x = Month, y = n)) + geom_bar(stat = "identity", aes(fill = as.factor(Month)), show.legend = FALSE) + ggtitle("Montly Sales") + theme_bw() + scale_fill_brewer(palette = "Spectral")
The above plot shows the monthly purchases frequency. We observe that the maximum puchases were done in the month of January and then decreased till April. The purchase suddenly increased in May and again decreased in June from where it again rose till September.
yearly_sold <- basket %>% group_by(Year) %>% count() %>% arrange(desc(n))
ggplot(yearly_sold, aes(x = Year, y = n)) + geom_bar(stat = "identity", aes(fill = as.factor(Year)), show.legend = FALSE) + ggtitle("Yearly Sales") + theme_bw() + scale_fill_brewer(palette = "Spectral")
The above plot shows that there were less purchases in the year 2002 but it is probably because there is less data of 2002 in the dataset.
daily_sold <- basket %>% group_by(Day) %>% count() %>% arrange(desc(n))
ggplot(daily_sold, aes(x = Day, y = n)) + geom_bar(stat = "identity", aes(fill = as.factor(Day)), show.legend = FALSE) + ggtitle("Daily Sales") + theme_bw()
The above plot shows the day-wise frequency of puchases of any month.
most_sold <- most_sold[1:10, "Item"]
top_10 <- basket[basket$Item %in% most_sold$Item, ]
ggplot(top_10, aes(x = Item, y = Transaction_Id, fill = Item)) + geom_boxplot(show.legend = FALSE) + ggtitle("No. of transaction by Top 10 items") + theme_bw() + scale_fill_brewer(palette = "Spectral")
The above boxplot shows the number of transactions of top 10 purchased items.
basket_sizes <- basket %>% group_by(Transaction_Id) %>% summarise(total = n())
ggplot(basket_sizes, aes(x = total)) + geom_histogram(fill = "indianred", binwidth = 1) + geom_rug() + ggtitle("No. of transactions with different basket sizes") + theme_bw() + xlab("Basket size") + ylab("No. of transactions")
The above plot shows the number of tractions with different size of baskets.
We will create a transaction object using basket dataframe for Market Basket Analysis/ Association Rule Mining.
basket_trans <- split(x = basket[,"Item"], f = basket$Transaction_Id)
basket_trans <- as(basket_trans, "transactions")
summary(basket_trans)
## transactions as itemMatrix in sparse format with
## 1139 rows (elements/itemsets/transactions) and
## 38 columns (items) and a density of 0.3870662
##
## most frequent items:
## vegetables poultry ice cream cereals lunch meat (Other)
## 842 480 454 451 450 14076
##
## element (itemset/transaction) length distribution:
## sizes
## 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 12 34 35 41 51 56 62 67 48 55 71 64 58 79 77 75 91 54 58 25 16 7 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 10.00 15.00 14.71 19.00 26.00
##
## includes extended item information - examples:
## labels
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
itemFrequencyPlot(basket_trans, topN = 20, type = "absolute")
apriori_rules <- apriori(basket_trans,parameter = list(sup = 0.15, conf = 0.8,target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 170
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1139 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.01s].
## writing ... [22 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(apriori_rules)
## set of 22 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 6 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.250 3.000 2.727 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.1501 Min. :0.8082 Min. :1.093 Min. :171.0
## 1st Qu.:0.1530 1st Qu.:0.8327 1st Qu.:1.126 1st Qu.:174.2
## Median :0.1572 Median :0.8507 Median :1.151 Median :179.0
## Mean :0.1968 Mean :0.8541 Mean :1.155 Mean :224.1
## 3rd Qu.:0.2524 3rd Qu.:0.8765 3rd Qu.:1.186 3rd Qu.:287.5
## Max. :0.3266 Max. :0.8995 Max. :1.217 Max. :372.0
##
## mining info:
## data ntransactions support confidence
## basket_trans 1139 0.15 0.8
plot(apriori_rules)
plot(head(sort(apriori_rules, by = "lift"), 22), method = "graph")
plot(apriori_rules, method = "grouped")
inspect(sort(apriori_rules, by = "lift", decreasing = T))
## lhs rhs support
## [1] {eggs,yogurt} => {vegetables} 0.1571554
## [2] {dinner rolls,eggs} => {vegetables} 0.1562774
## [3] {dishwashing liquid/detergent,eggs} => {vegetables} 0.1536435
## [4] {cereals,laundry detergent} => {vegetables} 0.1510097
## [5] {cheeses,eggs} => {vegetables} 0.1501317
## [6] {eggs,poultry} => {vegetables} 0.1553995
## [7] {cereals,eggs} => {vegetables} 0.1510097
## [8] {aluminum foil,yogurt} => {vegetables} 0.1527656
## [9] {mixes,poultry} => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## [11] {lunch meat,waffles} => {vegetables} 0.1571554
## [12] {lunch meat,poultry} => {vegetables} 0.1580334
## [13] {poultry,sugar} => {vegetables} 0.1518876
## [14] {eggs,soda} => {vegetables} 0.1580334
## [15] {poultry,yogurt} => {vegetables} 0.1527656
## [16] {eggs} => {vegetables} 0.3266023
## [17] {yogurt} => {vegetables} 0.3195786
## [18] {dinner rolls,poultry} => {vegetables} 0.1615452
## [19] {sugar} => {vegetables} 0.2976295
## [20] {laundry detergent} => {vegetables} 0.3090430
## [21] {sandwich loaves} => {vegetables} 0.2827041
## [22] {aluminum foil} => {vegetables} 0.3107989
## confidence lift count
## [1] 0.8994975 1.216779 179
## [2] 0.8989899 1.216092 178
## [3] 0.8974359 1.213990 175
## [4] 0.8911917 1.205543 172
## [5] 0.8860104 1.198534 171
## [6] 0.8805970 1.191211 177
## [7] 0.8643216 1.169195 172
## [8] 0.8613861 1.165224 174
## [9] 0.8599034 1.163218 178
## [10] 0.8544601 1.155855 182
## [11] 0.8523810 1.153043 179
## [12] 0.8490566 1.148546 180
## [13] 0.8480392 1.147169 173
## [14] 0.8450704 1.143153 180
## [15] 0.8446602 1.142599 174
## [16] 0.8378378 1.133370 372
## [17] 0.8310502 1.124188 364
## [18] 0.8288288 1.121183 184
## [19] 0.8248175 1.115757 339
## [20] 0.8167053 1.104783 352
## [21] 0.8090452 1.094421 322
## [22] 0.8082192 1.093304 354
itemsets = eclat(data = basket_trans, parameter = list(support = 0.15, minlen = 2 ))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.15 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 170
##
## create itemset ...
## set transactions ...[38 item(s), 1139 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating bit matrix ... [38 row(s), 1139 column(s)] done [0.00s].
## writing ... [518 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
eclat_rules <- ruleInduction(itemsets , basket_trans, confidence = .8)
summary(eclat_rules)
## set of 22 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 6 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.250 3.000 2.727 3.000 3.000
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.1501 Min. :0.8082 Min. :1.093 Min. : 5.0
## 1st Qu.:0.1530 1st Qu.:0.8327 1st Qu.:1.126 1st Qu.:397.2
## Median :0.1572 Median :0.8507 Median :1.151 Median :454.5
## Mean :0.1968 Mean :0.8541 Mean :1.155 Mean :394.1
## 3rd Qu.:0.2524 3rd Qu.:0.8765 3rd Qu.:1.186 3rd Qu.:489.2
## Max. :0.3266 Max. :0.8995 Max. :1.217 Max. :509.0
##
## mining info:
## data ntransactions support confidence
## basket_trans 1139 0.15 0.8
plot(eclat_rules)
plot(head(sort(eclat_rules, by = "lift"), 22), method = "graph")
plot(eclat_rules, method = "grouped")
inspect(sort(eclat_rules, by = "lift", decreasing = T)[1:22])
## lhs rhs support
## [1] {eggs,yogurt} => {vegetables} 0.1571554
## [2] {dinner rolls,eggs} => {vegetables} 0.1562774
## [3] {dishwashing liquid/detergent,eggs} => {vegetables} 0.1536435
## [4] {cereals,laundry detergent} => {vegetables} 0.1510097
## [5] {cheeses,eggs} => {vegetables} 0.1501317
## [6] {eggs,poultry} => {vegetables} 0.1553995
## [7] {cereals,eggs} => {vegetables} 0.1510097
## [8] {aluminum foil,yogurt} => {vegetables} 0.1527656
## [9] {mixes,poultry} => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## [11] {lunch meat,waffles} => {vegetables} 0.1571554
## [12] {lunch meat,poultry} => {vegetables} 0.1580334
## [13] {poultry,sugar} => {vegetables} 0.1518876
## [14] {eggs,soda} => {vegetables} 0.1580334
## [15] {poultry,yogurt} => {vegetables} 0.1527656
## [16] {eggs} => {vegetables} 0.3266023
## [17] {yogurt} => {vegetables} 0.3195786
## [18] {dinner rolls,poultry} => {vegetables} 0.1615452
## [19] {sugar} => {vegetables} 0.2976295
## [20] {laundry detergent} => {vegetables} 0.3090430
## [21] {sandwich loaves} => {vegetables} 0.2827041
## [22] {aluminum foil} => {vegetables} 0.3107989
## confidence lift itemset
## [1] 0.8994975 1.216779 476
## [2] 0.8989899 1.216092 454
## [3] 0.8974359 1.213990 441
## [4] 0.8911917 1.205543 300
## [5] 0.8860104 1.198534 501
## [6] 0.8805970 1.191211 508
## [7] 0.8643216 1.169195 507
## [8] 0.8613861 1.165224 428
## [9] 0.8599034 1.163218 387
## [10] 0.8544601 1.155855 442
## [11] 0.8523810 1.153043 493
## [12] 0.8490566 1.148546 494
## [13] 0.8480392 1.147169 59
## [14] 0.8450704 1.143153 466
## [15] 0.8446602 1.142599 477
## [16] 0.8378378 1.133370 509
## [17] 0.8310502 1.124188 478
## [18] 0.8288288 1.121183 455
## [19] 0.8248175 1.115757 60
## [20] 0.8167053 1.104783 301
## [21] 0.8090452 1.094421 5
## [22] 0.8082192 1.093304 429
For the same minimum support, both algorithms must give the same result, or there is an error in their implementation. There is only one correct output. ECLAT improves Apriori in the step of Extracting frequent itemsets. As you know Apriori has to scan the Database multiple times, but with ECLAT there is no need to scan the database for countig the support for k-itemsets (k>=1). In R, apriori() could have as an output the frequent itemsets or association rules. Althought eclat() has as an output just the frequent itemsets. You have after that use the command ruleInduction() to extract rules from those itemsets.
We clearly see meaningful results here from our analysis, where the higher the lift value, the stronger the correlation between the items. The data clearly shows that vegetables is a most popular consequent, which makes sense because it is an essential item for humans. Let’s look at the more interesting item correlations (format: antecedant(s) -> consequent):
(Eggs + Yogurt) -> Vegetables
(dinner rolls + eggs) -> Vegetables
So how is this useful knowledge for the Grocery Stores? Businesses are always looking to optimize their setup and drive up their sales. Stores are no different, and this kind of analysis can be done on any kind of retail store or market place as well. Because we now know the correlation between items and the common interest of the customers, the business can make decisions based on these findings. For example, this store might want to place their fresh vegetables near their eggs and yogurt, since customers who purchase eggs and yogurt seem to also be enticed by vegetables.