Market Basket Analysis using Apriori Algorithms and Association Rules

An Overview of Association Rules

Association rule is usually a data mining approach used to explore and interpret large transactional datasets to identify unique patterns and rules. During transactions, these patterns define fascinating relationships and interactions between different items. Moreover, association rule is often referred to as market basket study, which is utilized to analyze habits in customer purchase. Association rules help identify and forecast transactional behaviors based on information from training transactions utilizing beneficial properties. Using this approach, we can answer questions such as what items human beings tend to buy together, indicating frequent sets of goods. We can also associate or correlate products and items.

Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for instance, understanding which symptoms tend to co-morbid can help to improve patient care and medicine prescription.

Libraries

The following library packages would be used in the process analyzing the dataset.

library("arules")
library("arulesViz")
library("plotly")

Exploration of Dataset.

The dataset used in this article describes 9835 transactions by customers in a grocery shop. The dataset was downloaded from Kaggle but I made a few modifications.

shop <- read.csv2("groceries_data.csv", sep = ",")
nrow(shop)

## [1] 9835

ncol(shop)

## [1] 32

The `nrow(shop`) command compiles the total number of customers while `ncol(shop)` displays the different items which were purchased by a customer in the same basket. Now,I will go ahead and read the data as transactions using the `arules` library package.

trans<-read.transactions("groceries_data.csv", format = "basket", sep=",", header = TRUE)
trans

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

From the analysis displayed above, we can see that there are 169 varieties of items available in the grocery store and the 9835 transactions that were made by customers were based on item list of 169 products which might mean that there would be some kind of relationship exixting between the goods that were picked by the customers.

More vital information can be obtained by utilising the `summary` function.

summary(trans)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The section of the summary results which displays the most frequent items shows the items that were mostly purchased by customers. In relation to the data set used, the most purchased item is ‘whole milk’ followed by ‘vegetables’.

The section displaying the length distribution sizes represent the number of times the total item(s) in a basket is occurring. For example it can be seen from above that, the number of baskets that contained just one item occurred 2159 times whilst there was only one basket that contained 32 items.

It can also be deduced that the mean number of items in a basket is 4 while the maximum itemset is 32.

In order to attempt to visualize this information, it will be represented on a bar graph to show the first 10 most purchased items from the supermarket.

itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")

To get an idea about the less frequent items, I will attempt to sort the items at the tail point.

tail(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=10)

##        salad dressing                whisky        toilet cleaner 
##                     8                     8                     7 
##        baby cosmetics        frozen chicken                  bags 
##                     6                     6                     4 
##       kitchen utensil preservation products             baby food 
##                     4                     2                     1 
##  sound storage medium 
##                     1

Association Rules

Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association.They are Support, Confidence and Lift. These are the constraints used to select best rules from a set of possible rules. In this particular paper, I set the the support threshold to 0.01 which approximately represents the probability of an item appearing 100 times with other items in all 9835 transactions and the confidence threshold to 0.05 representing 50% of the entire threshold. The number of rules were determined as below.

rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.50))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Support

The support constraint is simply the occurrence of two unique items or set of unique items ending up in one basket over the total number of transactions.

\[\ Support = Number of Transactions with both A and B / Total Number of Transactions\]

library("DT")
support_rules <- sort(rules, by = "support", decreasing = TRUE)
support_table <- inspect(support_rules)

##      lhs                                       rhs                support   
## [1]  {other vegetables, yogurt}             => {whole milk}       0.02226741
## [2]  {tropical fruit, yogurt}               => {whole milk}       0.01514997
## [3]  {other vegetables, whipped/sour cream} => {whole milk}       0.01464159
## [4]  {root vegetables, yogurt}              => {whole milk}       0.01453991
## [5]  {other vegetables, pip fruit}          => {whole milk}       0.01352313
## [6]  {root vegetables, yogurt}              => {other vegetables} 0.01291307
## [7]  {rolls/buns, root vegetables}          => {whole milk}       0.01270971
## [8]  {domestic eggs, other vegetables}      => {whole milk}       0.01230300
## [9]  {root vegetables, tropical fruit}      => {other vegetables} 0.01230300
## [10] {rolls/buns, root vegetables}          => {other vegetables} 0.01220132
## [11] {root vegetables, tropical fruit}      => {whole milk}       0.01199797
## [12] {butter, other vegetables}             => {whole milk}       0.01148958
## [13] {whipped/sour cream, yogurt}           => {whole milk}       0.01087951
## [14] {citrus fruit, root vegetables}        => {other vegetables} 0.01037112
## [15] {curd, yogurt}                         => {whole milk}       0.01006609
##      confidence coverage   lift     count
## [1]  0.5128806  0.04341637 2.007235 219  
## [2]  0.5173611  0.02928317 2.024770 149  
## [3]  0.5070423  0.02887646 1.984385 144  
## [4]  0.5629921  0.02582613 2.203354 143  
## [5]  0.5175097  0.02613116 2.025351 133  
## [6]  0.5000000  0.02582613 2.584078 127  
## [7]  0.5230126  0.02430097 2.046888 125  
## [8]  0.5525114  0.02226741 2.162336 121  
## [9]  0.5845411  0.02104728 3.020999 121  
## [10] 0.5020921  0.02430097 2.594890 120  
## [11] 0.5700483  0.02104728 2.230969 118  
## [12] 0.5736041  0.02003050 2.244885 113  
## [13] 0.5245098  0.02074225 2.052747 107  
## [14] 0.5862069  0.01769192 3.029608 102  
## [15] 0.5823529  0.01728521 2.279125  99

datatable(support_table)

Sorting the rules and grouping by Support, it is realized that, the rule with the highest support value is `{other vegetables, yoghurt} => whole milk` and had 219 appearances out of the total which represents about 2.2% of the total transactions. The rule with the least transactions according to the support constraint was that of `{curd, yoghurt} => {whole milk}` which only happened 99 times representing 1% of the total transactions.

Confidence

The Confidence constraint is also the occurrence of more than one unique items ending up in one basket over the total number of all other transactions containing at least one of the items in the basket.

\[\ Confidence=Number of Transactions with both A and B/Total Number of Transactions with A\]

confidence_rules <- sort(rules, by = "confidence", decreasing = TRUE)
confidence_table <- inspect(confidence_rules)

##      lhs                                       rhs                support   
## [1]  {citrus fruit, root vegetables}        => {other vegetables} 0.01037112
## [2]  {root vegetables, tropical fruit}      => {other vegetables} 0.01230300
## [3]  {curd, yogurt}                         => {whole milk}       0.01006609
## [4]  {butter, other vegetables}             => {whole milk}       0.01148958
## [5]  {root vegetables, tropical fruit}      => {whole milk}       0.01199797
## [6]  {root vegetables, yogurt}              => {whole milk}       0.01453991
## [7]  {domestic eggs, other vegetables}      => {whole milk}       0.01230300
## [8]  {whipped/sour cream, yogurt}           => {whole milk}       0.01087951
## [9]  {rolls/buns, root vegetables}          => {whole milk}       0.01270971
## [10] {other vegetables, pip fruit}          => {whole milk}       0.01352313
## [11] {tropical fruit, yogurt}               => {whole milk}       0.01514997
## [12] {other vegetables, yogurt}             => {whole milk}       0.02226741
## [13] {other vegetables, whipped/sour cream} => {whole milk}       0.01464159
## [14] {rolls/buns, root vegetables}          => {other vegetables} 0.01220132
## [15] {root vegetables, yogurt}              => {other vegetables} 0.01291307
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5823529  0.01728521 2.279125  99  
## [4]  0.5736041  0.02003050 2.244885 113  
## [5]  0.5700483  0.02104728 2.230969 118  
## [6]  0.5629921  0.02582613 2.203354 143  
## [7]  0.5525114  0.02226741 2.162336 121  
## [8]  0.5245098  0.02074225 2.052747 107  
## [9]  0.5230126  0.02430097 2.046888 125  
## [10] 0.5175097  0.02613116 2.025351 133  
## [11] 0.5173611  0.02928317 2.024770 149  
## [12] 0.5128806  0.04341637 2.007235 219  
## [13] 0.5070423  0.02887646 1.984385 144  
## [14] 0.5020921  0.02430097 2.594890 120  
## [15] 0.5000000  0.02582613 2.584078 127

datatable(confidence_table)

Analyzing the results from the group by Confidence, the highest confidence value happened for baskets that contained citrus fruit, root vegetables and other vegetables. The confidence value was also around 58.6%. This means that there is a 58.6% chance that the rule with its support value is likely to happen.

Lift And Expected Confidence

Another vital constraints which is used in association rules is the lift constraint. This is determined by the value confidence over the expected confidence.

\[\ Lift=Confidence/Expected Confidence\]

The expected confidence is also given by the number of transactions with B over the total number of transactions.

\[\ Expected Confidence=Number of Transactions with B/Total Number of Transactions\]

lift_rules <- sort(rules, by = "lift", decreasing = TRUE)
lift_table <- inspect(lift_rules)

##      lhs                                       rhs                support   
## [1]  {citrus fruit, root vegetables}        => {other vegetables} 0.01037112
## [2]  {root vegetables, tropical fruit}      => {other vegetables} 0.01230300
## [3]  {rolls/buns, root vegetables}          => {other vegetables} 0.01220132
## [4]  {root vegetables, yogurt}              => {other vegetables} 0.01291307
## [5]  {curd, yogurt}                         => {whole milk}       0.01006609
## [6]  {butter, other vegetables}             => {whole milk}       0.01148958
## [7]  {root vegetables, tropical fruit}      => {whole milk}       0.01199797
## [8]  {root vegetables, yogurt}              => {whole milk}       0.01453991
## [9]  {domestic eggs, other vegetables}      => {whole milk}       0.01230300
## [10] {whipped/sour cream, yogurt}           => {whole milk}       0.01087951
## [11] {rolls/buns, root vegetables}          => {whole milk}       0.01270971
## [12] {other vegetables, pip fruit}          => {whole milk}       0.01352313
## [13] {tropical fruit, yogurt}               => {whole milk}       0.01514997
## [14] {other vegetables, yogurt}             => {whole milk}       0.02226741
## [15] {other vegetables, whipped/sour cream} => {whole milk}       0.01464159
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5020921  0.02430097 2.594890 120  
## [4]  0.5000000  0.02582613 2.584078 127  
## [5]  0.5823529  0.01728521 2.279125  99  
## [6]  0.5736041  0.02003050 2.244885 113  
## [7]  0.5700483  0.02104728 2.230969 118  
## [8]  0.5629921  0.02582613 2.203354 143  
## [9]  0.5525114  0.02226741 2.162336 121  
## [10] 0.5245098  0.02074225 2.052747 107  
## [11] 0.5230126  0.02430097 2.046888 125  
## [12] 0.5175097  0.02613116 2.025351 133  
## [13] 0.5173611  0.02928317 2.024770 149  
## [14] 0.5128806  0.04341637 2.007235 219  
## [15] 0.5070423  0.02887646 1.984385 144

datatable(lift_table)

Considering the grouping according to the Lift constraint, the rule with the highest lift value was `{citrus fruit, root fruit} => {other vegetables}` and had a lift value of 3.03. This value explains that, the items in LHS and RHS are 3 times more likely to be purchased together compared to the purchases when they are assumed to be unrelated. The rule with the least lift value was `{other vegetables,whipped/sour cream} => {whole milk}` with a value of 1.98 which means the items are likely to purchase 2 times more.

Now I would go ahead to visualize these rules relative to the support, confidence and lift constraints. This plot is achieved using the `plotly` package. Hovering the mouse over the plotted points would give you the support, confidence and lift values.

plot(rules, engine="plotly")

Rules for Specific Items

Assuming the grocery shop wants to run clearance sales and attach discounts to some particular product, it would be wise to take advantage of the association rules in order to make higher sales on particular products that are mostly purchased with other products.

This can be achieved by setting discounts on items on the right hand side (rhs) based on individual’s purchases on items on the left hand side (lhs).

For the purpose of this paper, let us check rules for yogurt in our transactions.

Rules for Yogurt

yogurt_rules <- apriori(
    data = trans,
    parameter = list(supp = 0.001, conf = 0.9),
    appearance = list(default = "lhs", rhs = "yogurt"),
    control = list(verbose = F)
  )
yogurt_rules_table <- inspect(yogurt_rules, linebreak = FALSE)

##     lhs                                                           rhs     
## [1] {butter, cream cheese, root vegetables}                    => {yogurt}
## [2] {butter, sliced cheese, tropical fruit, whole milk}        => {yogurt}
## [3] {cream cheese, curd, other vegetables, whipped/sour cream} => {yogurt}
## [4] {butter, other vegetables, tropical fruit, white bread}    => {yogurt}
##     support     confidence coverage    lift     count
## [1] 0.001016777 0.9090909  0.001118454 6.516698 10   
## [2] 0.001016777 0.9090909  0.001118454 6.516698 10   
## [3] 0.001016777 0.9090909  0.001118454 6.516698 10   
## [4] 0.001016777 0.9090909  0.001118454 6.516698 10

datatable(yogurt_rules_table)

Visualization

plot(yogurt_rules, method="graph")

It can be seen that 4 rules were created for yogurt with a very high confidence value of occurrence a little over 90%. In this case it would be easier to identify other products that sells in connection with yogurt and also with a high lift values above 6 which implies an increase in times of purchase.

Conclusion

As I mentioned at the beginning of the paper, my objective was to identify how goods should be displayed in the grocery shop in such a way that the purchase of a particular product can influence a customer to purchase another product. In conclusion, it has be shown that whole milk was a frequently purchased item and also identified other items that happened to be in the same basket with it and we can conclude that it would be wise to display whole milk somewhere closer to yogurt and other vegetables as this can increase the purchase of whole milk 2 times.

Market Basket Analysis using Apriori Algorithms and Association Rules

Ayoola Ayetigbo

1/16/2022

An Overview of Association Rules

Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for instance, understanding which symptoms tend to co-morbid can help to improve patient care and medicine prescription.

Libraries

The following library packages would be used in the process analyzing the dataset.

Exploration of Dataset.

The dataset used in this article describes 9835 transactions by customers in a grocery shop. The dataset was downloaded from Kaggle but I made a few modifications.

The `nrow(shop`) command compiles the total number of customers while `ncol(shop)` displays the different items which were purchased by a customer in the same basket. Now,I will go ahead and read the data as transactions using the `arules` library package.

More vital information can be obtained by utilising the `summary` function.

The section of the summary results which displays the most frequent items shows the items that were mostly purchased by customers. In relation to the data set used, the most purchased item is ‘whole milk’ followed by ‘vegetables’.

The section displaying the length distribution sizes represent the number of times the total item(s) in a basket is occurring. For example it can be seen from above that, the number of baskets that contained just one item occurred 2159 times whilst there was only one basket that contained 32 items.

It can also be deduced that the mean number of items in a basket is 4 while the maximum itemset is 32.

In order to attempt to visualize this information, it will be represented on a bar graph to show the first 10 most purchased items from the supermarket.

Association Rules

Support

The support constraint is simply the occurrence of two unique items or set of unique items ending up in one basket over the total number of transactions.

Confidence

The Confidence constraint is also the occurrence of more than one unique items ending up in one basket over the total number of all other transactions containing at least one of the items in the basket.

Lift And Expected Confidence

Another vital constraints which is used in association rules is the lift constraint. This is determined by the value confidence over the expected confidence.

The expected confidence is also given by the number of transactions with B over the total number of transactions.

Now I would go ahead to visualize these rules relative to the support, confidence and lift constraints. This plot is achieved using the `plotly` package. Hovering the mouse over the plotted points would give you the support, confidence and lift values.

Rules for Specific Items

Assuming the grocery shop wants to run clearance sales and attach discounts to some particular product, it would be wise to take advantage of the association rules in order to make higher sales on particular products that are mostly purchased with other products.

This can be achieved by setting discounts on items on the right hand side (rhs) based on individual’s purchases on items on the left hand side (lhs).

For the purpose of this paper, let us check rules for yogurt in our transactions.

Rules for Yogurt

Visualization

Conclusion

Market Basket Analysis using Apriori Algorithms and Association Rules

Ayoola Ayetigbo

1/16/2022

An Overview of Association Rules

Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for instance, understanding which symptoms tend to co-morbid can help to improve patient care and medicine prescription.

Libraries

The following library packages would be used in the process analyzing the dataset.

Exploration of Dataset.

The dataset used in this article describes 9835 transactions by customers in a grocery shop. The dataset was downloaded from Kaggle but I made a few modifications.

The nrow(shop) command compiles the total number of customers while ncol(shop) displays the different items which were purchased by a customer in the same basket. Now,I will go ahead and read the data as transactions using the arules library package.

More vital information can be obtained by utilising the summary function.

The section of the summary results which displays the most frequent items shows the items that were mostly purchased by customers. In relation to the data set used, the most purchased item is ‘whole milk’ followed by ‘vegetables’.

The section displaying the length distribution sizes represent the number of times the total item(s) in a basket is occurring. For example it can be seen from above that, the number of baskets that contained just one item occurred 2159 times whilst there was only one basket that contained 32 items.

It can also be deduced that the mean number of items in a basket is 4 while the maximum itemset is 32.

In order to attempt to visualize this information, it will be represented on a bar graph to show the first 10 most purchased items from the supermarket.

Association Rules

Support

The support constraint is simply the occurrence of two unique items or set of unique items ending up in one basket over the total number of transactions.

Confidence

The Confidence constraint is also the occurrence of more than one unique items ending up in one basket over the total number of all other transactions containing at least one of the items in the basket.

Lift And Expected Confidence

Another vital constraints which is used in association rules is the lift constraint. This is determined by the value confidence over the expected confidence.

The expected confidence is also given by the number of transactions with B over the total number of transactions.

Now I would go ahead to visualize these rules relative to the support, confidence and lift constraints. This plot is achieved using the plotly package. Hovering the mouse over the plotted points would give you the support, confidence and lift values.

Rules for Specific Items

Assuming the grocery shop wants to run clearance sales and attach discounts to some particular product, it would be wise to take advantage of the association rules in order to make higher sales on particular products that are mostly purchased with other products.

This can be achieved by setting discounts on items on the right hand side (rhs) based on individual’s purchases on items on the left hand side (lhs).

For the purpose of this paper, let us check rules for yogurt in our transactions.

Rules for Yogurt

Visualization

Conclusion

The `nrow(shop`) command compiles the total number of customers while `ncol(shop)` displays the different items which were purchased by a customer in the same basket. Now,I will go ahead and read the data as transactions using the `arules` library package.

More vital information can be obtained by utilising the `summary` function.

Now I would go ahead to visualize these rules relative to the support, confidence and lift constraints. This plot is achieved using the `plotly` package. Hovering the mouse over the plotted points would give you the support, confidence and lift values.