library(arules)
library(arulesViz)
Market Basket Analysis is one of the most important methods that major merchants use to investigate the connections that exist between different products. In order for it to function, it searches for combinations of products that are found to appear together frequently in transactions. To put it another way, it enables merchants to determine whether or not there is a connection between the products that customers purchase. Association Rules are used extensively in the analysis of retail bundle or transaction data. These rules are designed to identify strong rules that have been identified in transaction data by utilising measures of interestingness. The concept of strong rules is the foundation for these rules.
The collection contains 38765 rows, all of which are the purchase requests that customers made at grocery retailers. Market Basket Analysis, which is used by algorithms such as the Apriori Algorithm, can be used to analyse these categories and generate relationship rules between them.
An algorithm for gathering frequent itemsets and learning association rules from relational databases, Apriori is referred to as “Apriori.” The next step is to determine which of the database’s individual items occur most frequently, and then to expand those frequent occurrences into progressively more extensive groups of items, provided that those sets of items occur frequently enough in the database. This has implications in fields such as market basket analysis and can be used to determine association rules that emphasise general patterns in the database based on the frequent itemets that are determined by Apriori.
First of all the dataset is read and inspected as below:
data <- read.csv("Groceries_dataset.csv")
head(data)
The dimensions of the dataset are checked
dim(data)
## [1] 38765 3
The summary of the dataset is obtained.
summary(data)
## Member_number Date itemDescription
## Min. :1000 Length:38765 Length:38765
## 1st Qu.:2002 Class :character Class :character
## Median :3005 Mode :character Mode :character
## Mean :3004
## 3rd Qu.:4007
## Max. :5000
The first column is removed and the number of unique items in the dataset are checked which are 167.
data <- data[,c(2:3)]
length(unique(data$itemDescription))
## [1] 167
Next, we have read the data as transactions and the summary is printed. We can see from the summary that the whole milk and other vegetables are the most commonly occurring items in the dataset. Moreover, the median value for basket is 35 in this case.
transactions_data <- read.transactions("Groceries_dataset.csv",
format = "single",
sep=",",
cols = c(2,3))
summary(transactions_data)
## transactions as itemMatrix in sparse format with
## 729 rows (elements/itemsets/transactions) and
## 168 columns (items) and a density of 0.2060226
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 697 666 649 617
## yogurt (Other)
## 600 22003
##
## element (itemset/transaction) length distribution:
## sizes
## 1 17 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## 1 1 2 3 6 6 5 11 16 24 21 32 33 24 45 42 47 33 61 34 47 41 30 40 23 22
## 43 44 45 46 47 48 49 51 52 55 56 59
## 22 16 10 5 11 3 6 1 2 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 30.00 35.00 34.61 39.00 59.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
##
## includes extended transaction information - examples:
## transactionID
## 1 01-01-2014
## 2 01-01-2015
## 3 01-02-2014
The top 10 frequent items in the dataset are plotted and is attached below. We can see that the whole milk, other vegetables and rolls / buns are the three top most frequently items occurring in the dataset.
itemFrequencyPlot(transactions_data,
topN = 5,
type = "absolute",
main = "Top 5 items by frequency",
col = "steel blue")
The apriori algorithm is required to be utilized in order to derive principles from the information. The Eclat algorithm, which is a faster variant of apriori, is the one I will use because it is better suited for large databases. Eclat is able to determine the most likely component combinations within a group. From th summary it can be seen that 16557 rules are found with support of 0.20 and maximum length of 10.
association_rules <- eclat(transactions_data,
parameter = list(supp = 0.20,
maxlen = 10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.2 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 145
##
## create itemset ...
## set transactions ...[168 item(s), 729 transaction(s)] done [0.00s].
## sorting and recoding items ... [60 item(s)] done [0.00s].
## creating bit matrix ... [60 row(s), 729 column(s)] done [0.00s].
## writing ... [16557 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
Next, the top 10 rules by support are printed and is attached below. Support is associated with the likelihood of a basket’s occurrence, whereas count refers to the actual number of times a basket appears in the collection. The frequency histogram reveals that the vast majority of containers contain vegetables.
inspect(head(sort(association_rules,
by = "support"),
10))
## items support count
## [1] {whole milk} 0.9561043 697
## [2] {other vegetables} 0.9135802 666
## [3] {rolls/buns} 0.8902606 649
## [4] {other vegetables, whole milk} 0.8737997 637
## [5] {rolls/buns, whole milk} 0.8504801 620
## [6] {soda} 0.8463649 617
## [7] {yogurt} 0.8230453 600
## [8] {soda, whole milk} 0.8161866 595
## [9] {other vegetables, rolls/buns} 0.8148148 594
## [10] {whole milk, yogurt} 0.7873800 574
A measure that demonstrates the possibility of event A and event B divided by the probability of event A is referred to as confidence. According to the findings, the likelihood of a consumer purchasing whole milk increases to 1 if they have previously purchased bottled beer and hygiene articles. Because the majority of the baskets contain whole milk, the majority of the regulations also include whole milk as a natural consequence of this reality.
frequent_rules <- ruleInduction(association_rules,
transactions_data,
confidence = 0.8)
inspect(head(sort(frequent_rules,
by = "confidence",
decreasing = TRUE),
10))
## lhs rhs support confidence lift itemset
## [1] {chicken,
## citrus fruit,
## tropical fruit,
## yogurt} => {whole milk} 0.2043896 1.0000000 1.045911 2041
## [2] {brown bread,
## butter,
## rolls/buns,
## soda} => {whole milk} 0.2167353 1.0000000 1.045911 6528
## [3] {butter,
## root vegetables,
## soda,
## whipped/sour cream} => {whole milk} 0.2112483 1.0000000 1.045911 6578
## [4] {butter,
## other vegetables,
## rolls/buns,
## soda,
## whipped/sour cream} => {whole milk} 0.2153635 1.0000000 1.045911 6597
## [5] {butter,
## rolls/buns,
## soda,
## whipped/sour cream} => {whole milk} 0.2331962 1.0000000 1.045911 6598
## [6] {butter,
## soda,
## whipped/sour cream} => {whole milk} 0.2524005 0.9945946 1.040257 6601
## [7] {chicken,
## frankfurter} => {whole milk} 0.2427984 0.9943820 1.040035 1867
## [8] {brown bread,
## butter,
## soda} => {whole milk} 0.2359396 0.9942197 1.039865 6530
## [9] {chicken,
## citrus fruit,
## tropical fruit} => {whole milk} 0.2345679 0.9941860 1.039830 2046
## [10] {chicken,
## citrus fruit,
## other vegetables,
## yogurt} => {whole milk} 0.2345679 0.9941860 1.039830 2065
Because whole milk was the item that was purchased the most frequently during the hamper analysis, we are unable to adhere to any guidelines that do not include it. Let’s verify the guidelines item by item so that we can identify a wider variety of recurring patterns. Let’s begin with other types of vegetables, as this is the product that is purchased the second most often overall. As the apriori algorithm makes it possible to discover principles by product, I will use it. We will need to reduce the confidence number in order to accomplish it.
other_vegetables <- apriori(transactions_data,
parameter = list(supp = 0.1,
conf = 0.48),
appearance =list(default = "lhs",
rhs = "other vegetables"),
control = list(verbose = F))
inspect(head(sort(other_vegetables,
by = "support",
decreasing = TRUE),
10))
## lhs rhs support confidence
## [1] {} => {other vegetables} 0.9135802 0.9135802
## [2] {whole milk} => {other vegetables} 0.8737997 0.9139168
## [3] {rolls/buns} => {other vegetables} 0.8148148 0.9152542
## [4] {rolls/buns, whole milk} => {other vegetables} 0.7777778 0.9145161
## [5] {soda} => {other vegetables} 0.7722908 0.9124797
## [6] {yogurt} => {other vegetables} 0.7572016 0.9200000
## [7] {soda, whole milk} => {other vegetables} 0.7434842 0.9109244
## [8] {whole milk, yogurt} => {other vegetables} 0.7242798 0.9198606
## [9] {root vegetables} => {other vegetables} 0.6886145 0.9127273
## [10] {rolls/buns, soda} => {other vegetables} 0.6844993 0.9139194
## coverage lift count
## [1] 1.0000000 1.0000000 666
## [2] 0.9561043 1.0003684 637
## [3] 0.8902606 1.0018323 594
## [4] 0.8504801 1.0010244 567
## [5] 0.8463649 0.9987954 563
## [6] 0.8230453 1.0070270 552
## [7] 0.8161866 0.9970929 542
## [8] 0.7873800 1.0068745 528
## [9] 0.7544582 0.9990663 502
## [10] 0.7489712 1.0003713 499
The graphical representation of above rules is attached below. We can create more rules for other items as well in the same way as we created for other vegetables.
plot(other_vegetables,
method="graph")