library(bookdown)

1 Introduction

Market Basket Analysis is the process of finding the most common in-store shopping patterns. It arises as a result of the analysis of transaction databases to determine the combination of items that are related to each other. Products are detected whose presence in the transaction increases the chances of the appearance of other products or their combinations.

Market basket analysis allows you to optimize the assortment of goods and inventory, place them in sales areas and increase sales by offering related products to customers. More precisely, if the analysis carried out shows that the joint purchase of bread and butter is a typical pattern, then placing the above goods at the same exhibition may encourage the buyer to purchase both goods.

The aim of the study is to try to answer the following issues: a. Identification of jointly purchased products b. Creating useful rules for defining consumer behavior. Based on the association rules used, transaction data was analyzed in order to find recurring patterns in the sale of goods.

2 Database and description of variables

The dataset comes from service “kaggle” and involves 9002 transactions: https://www.kaggle.com/apmonisha08/market-basket-analysis?select=Groceries.csv

## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: grid
# read the data
setwd("/Users/nehrebeckiwp.pl/Desktop/UL3")
trans1<-read.transactions("Groceries.csv", rm.duplicates=FALSE, format="basket", sep=",", skip=0)
length(trans1)
## [1] 9002
LIST(head(trans1))
## [[1]]
## [1] "citrus fruit"        "margarine"           "ready soups"        
## [4] "semi-finished bread"
## 
## [[2]]
## [1] "coffee"         "tropical fruit" "yogurt"        
## 
## [[3]]
## [1] "whole milk"
## 
## [[4]]
## [1] "cream cheese" "meat spreads" "pip fruit"    "yogurt"      
## 
## [[5]]
## [1] "condensed milk"           "long life bakery product"
## [3] "other vegetables"         "whole milk"              
## 
## [[6]]
## [1] "abrasive cleaner" "butter"           "rice"             "whole milk"      
## [5] "yogurt"

Before carrying out the analysis, the distribution of the basket length should be presented.

summary(size(trans1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.828   6.000  32.000

Based on descriptive statistics, it was obtained that consumers buy about 4 categories, while a maximum of one customer has made purchases in 32 categories.

2.1 Basic descriptive statistics

It is worth starting data analysis by verifying the frequency of items.

round(itemFrequency(trans1, type="relative"),3)
##                         `          abrasive cleaner          artif. sweetener 
##                     0.000                     0.003                     0.003 
##            baby cosmetics                 baby food                      bags 
##                     0.001                     0.000                     0.000 
##             baking powder          bathroom cleaner                      beef 
##                     0.016                     0.002                     0.044 
##                   berries                 beverages              bottled beer 
##                     0.028                     0.024                     0.071 
##             bottled water                    brandy               brown bread 
##                     0.097                     0.003                     0.057 
##                    butter               butter milk                  cake bar 
##                     0.049                     0.024                     0.011 
##                   candles                     candy               canned beer 
##                     0.008                     0.026                     0.070 
##               canned fish              canned fruit         canned vegetables 
##                     0.013                     0.003                     0.009 
##                  cat food                   cereals               chewing gum 
##                     0.020                     0.005                     0.018 
##                   chicken                 chocolate     chocolate marshmallow 
##                     0.036                     0.041                     0.007 
##              citrus fruit                   cleaner           cling film/bags 
##                     0.073                     0.004                     0.010 
##              cocoa drinks                    coffee            condensed milk 
##                     0.002                     0.052                     0.009 
##         cooking chocolate                  cookware                     cream 
##                     0.002                     0.003                     0.001 
##              cream cheese                      curd               curd cheese 
##                     0.034                     0.046                     0.004 
##               decalcifier               dental care                   dessert 
##                     0.001                     0.005                     0.032 
##                 detergent              dish cleaner                    dishes 
##                     0.017                     0.009                     0.016 
##                  dog food             domestic eggs  female sanitary products 
##                     0.007                     0.056                     0.005 
##         finished products                      fish                     flour 
##                     0.006                     0.003                     0.015 
##            flower (seeds)    flower soil/fertilizer               frankfurter 
##                     0.009                     0.002                     0.051 
##            frozen chicken            frozen dessert               frozen fish 
##                     0.001                     0.009                     0.009 
##             frozen fruits              frozen meals    frozen potato products 
##                     0.001                     0.023                     0.008 
##         frozen vegetables     fruit/vegetable juice                    grapes 
##                     0.041                     0.063                     0.017 
##                hair spray                       ham            hamburger meat 
##                     0.001                     0.022                     0.028 
##               hard cheese                     herbs                     honey 
##                     0.022                     0.014                     0.001 
##    house keeping products          hygiene articles                 ice cream 
##                     0.007                     0.029                     0.022 
##            instant coffee     Instant food products                       jam 
##                     0.007                     0.007                     0.005 
##                   ketchup            kitchen towels           kitchen utensil 
##                     0.004                     0.005                     0.000 
##               light bulbs                   liqueur                    liquor 
##                     0.004                     0.001                     0.010 
##        liquor (appetizer)                liver loaf  long life bakery product 
##                     0.008                     0.005                     0.032 
##           make up remover            male cosmetics                 margarine 
##                     0.001                     0.004                     0.052 
##                mayonnaise                      meat              meat spreads 
##                     0.008                     0.022                     0.004 
##           misc. beverages                   mustard                   napkins 
##                     0.026                     0.012                     0.043 
##                newspapers                 nut snack               nuts/prunes 
##                     0.069                     0.003                     0.002 
##                       oil                    onions          organic products 
##                     0.024                     0.027                     0.002 
##           organic sausage          other vegetables packaged fruit/vegetables 
##                     0.002                     0.167                     0.012 
##                     pasta                    pastry                  pet care 
##                     0.013                     0.076                     0.008 
##                photo/film        pickled vegetables                 pip fruit 
##                     0.009                     0.017                     0.066 
##                   popcorn                      pork                pot plants 
##                     0.006                     0.050                     0.016 
##           potato products     preservation products          processed cheese 
##                     0.002                     0.000                     0.014 
##                  prosecco            pudding powder               ready soups 
##                     0.002                     0.002                     0.002 
##            red/blush wine                      rice             roll products 
##                     0.017                     0.007                     0.010 
##                rolls/buns           root vegetables           rubbing alcohol 
##                     0.161                     0.096                     0.001 
##                       rum            salad dressing                      salt 
##                     0.004                     0.001                     0.009 
##               salty snack                    sauces                   sausage 
##                     0.032                     0.005                     0.081 
##         seasonal products       semi-finished bread             shopping bags 
##                     0.012                     0.015                     0.084 
##                 skin care             sliced cheese            snack products 
##                     0.003                     0.020                     0.003 
##                      soap                      soda               soft cheese 
##                     0.002                     0.154                     0.015 
##                  softener      sound storage medium                     soups 
##                     0.005                     0.000                     0.006 
##            sparkling wine             specialty bar          specialty cheese 
##                     0.004                     0.023                     0.008 
##       specialty chocolate             specialty fat      specialty vegetables 
##                     0.026                     0.004                     0.001 
##                    spices             spread cheese                     sugar 
##                     0.005                     0.010                     0.029 
##             sweet spreads                     syrup                       tea 
##                     0.009                     0.002                     0.003 
##                   tidbits            toilet cleaner            tropical fruit 
##                     0.002                     0.001                     0.091 
##                    turkey                  UHT-milk                   vinegar 
##                     0.008                     0.029                     0.006 
##                   waffles        whipped/sour cream                    whisky 
##                     0.032                     0.063                     0.001 
##               white bread                white wine                whole milk 
##                     0.037                     0.016                     0.222 
##                    yogurt                  zwieback 
##                     0.119                     0.006

Based on the analysis of basic statistics, it was concluded that some items are bought more often.

itemFrequencyPlot(trans1, topN=30, type="relative", main="Item Frequency") 

3 Empirical verification

3.1 A priori algorithm

Based on the literature review, it worth pointing that there are many algorithms associated with determining the appropriate relationship of a set of goods. However, it must be admitted that the most popular is the A priori algorithm (Agrawal and Srikant, 1994). The above-mentioned algorithm uses a priori data related to the frequency of consumer selection of specific sets of goods.

Working with big data, the a priori algorithm usually gives good results, but in cases where the amount of data is small, the results obtained may be poorly explained from the point of view of common sense and sometimes even false.

As part of the initial analysis, the apriori algorithm was used with the use of default parameters.

rules.trans0 <- apriori(trans1, parameter = list(supp = 0.1, conf = 0.9, minlen=1))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 900 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

On the basis of the applied algorithm with default values, it was obtained that there is no rule meeting the above limits. Consequently, the level of support has been changed.

rules.trans0a <- apriori(trans1, parameter = list(supp = 0.05, conf = 0.9, minlen=1))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 450 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Based on the obtained results, it was worth pointing that the algorithm still did not find any rule. Further modification would be to ease the restriction on support.

rules.trans0b <- apriori(trans1, parameter = list(supp = 0.01, conf = 0.9, minlen=1))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 90 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [77 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

There are 0 associations were recorded. Further modification of the easing of the restriction on support.

rules.trans1 <- apriori(trans1, parameter = list(supp = 0.001, conf = 0.9, minlen=1))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [154 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [107 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Finally, 107 associations were obtained.

3.2 Verification of analysis

Let’s print the first 5 association rules:

inspect(rules.trans1[1:5])
##     lhs                         rhs                    support confidence    coverage     lift count
## [1] {house keeping products,                                                                        
##      whipped/sour cream}     => {whole milk}       0.001110864  0.9090909 0.001221951 4.087730    10
## [2] {rice,                                                                                          
##      sugar}                  => {whole milk}       0.001110864  1.0000000 0.001110864 4.496503    10
## [3] {bottled water,                                                                                 
##      rice}                   => {whole milk}       0.001221951  0.9166667 0.001333037 4.121795    11
## [4] {frozen fish,                                                                                   
##      pip fruit,                                                                                     
##      whole milk}             => {other vegetables} 0.001110864  0.9090909 0.001221951 5.434021    10
## [5] {citrus fruit,                                                                                  
##      herbs,                                                                                         
##      tropical fruit}         => {whole milk}       0.001110864  0.9090909 0.001221951 4.087730    10

The following results have been obtained: * probability (90%) that the consumer will buy products, choosing whipped/sour cream is associated with the purchase of whole milk; * probability (100%) that the purchase of rice, sugar are associated with the purchase of whole milk; * probability (92%) that the purchase of bottled water,rice are associated with the purchase of whole milk, etc.

As a result of changing the parameters (support, confidence, minlen) of the A priori algorithm, it is always possible to receive different lists of association rules. Increasing the value of the support parameter guarantees the exclusion of unpopular goods - as was done in this article. It is worth noting that the high value of the confidence parameter ensures obtaining rules with a high confidence value.

Based on obtained results, the quality of association rules using the A priori algorithm should be verified.

summary(rules.trans1)
## set of 107 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
##  3 53 45  6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.505   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.001111   Min.   :0.9000   Min.   :0.001111   Min.   :4.068  
##  1st Qu.:0.001111   1st Qu.:0.9091   1st Qu.:0.001222   1st Qu.:4.122  
##  Median :0.001111   Median :0.9167   Median :0.001222   Median :4.497  
##  Mean   :0.001270   Mean   :0.9344   Mean   :0.001361   Mean   :5.004  
##  3rd Qu.:0.001333   3rd Qu.:0.9375   3rd Qu.:0.001444   3rd Qu.:5.479  
##  Max.   :0.002999   Max.   :1.0000   Max.   :0.003333   Max.   :9.697  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :10.00  
##  Mean   :11.43  
##  3rd Qu.:12.00  
##  Max.   :27.00  
## 
## mining info:
##    data ntransactions support confidence
##  trans1          9002   0.001        0.9

Based on these conditions, a total of 107 rules were obtained. The length of these rules ranges from 3 to 6, namely: three rules are 3 long, four are 53 long, etc.

As part of the apriori method used, it is necessary to check and the association rules also contain the “lift” parameter, which determines how many times purchasing Product 1 increases the odds of obtaining Product 2.

rules.trans1=sort(rules.trans1, by="lift") 
inspect(rules.trans1[1:5])
##     lhs                     rhs                   support confidence    coverage     lift count
## [1] {oil,                                                                                      
##      other vegetables,                                                                         
##      tropical fruit,                                                                           
##      whole milk}         => {root vegetables} 0.001444124  0.9285714 0.001555210 9.697216    13
## [2] {oil,                                                                                      
##      other vegetables,                                                                         
##      tropical fruit,                                                                           
##      whole milk,                                                                               
##      yogurt}             => {root vegetables} 0.001110864  0.9090909 0.001221951 9.493778    10
## [3] {cream cheese,                                                                             
##      curd,                                                                                     
##      whipped/sour cream,                                                                       
##      whole milk}         => {yogurt}          0.001221951  0.9166667 0.001333037 7.683271    11
## [4] {frankfurter,                                                                              
##      rolls/buns,                                                                               
##      root vegetables,                                                                          
##      whole milk}         => {yogurt}          0.001221951  0.9166667 0.001333037 7.683271    11
## [5] {butter,                                                                                   
##      cream cheese,                                                                             
##      root vegetables}    => {yogurt}          0.001110864  0.9090909 0.001221951 7.619773    10

The rule that comes first:
{oil,other vegetables,tropical fruit,whole milk} => {root vegetables}

3.3 Result visualization

In the further part of the paper results of association rules are presented.

plot(rules.trans1, measure=c("support","lift"), shading="confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.


The presented diagram shows 107 rules. The above chart presents the relationship between rule support and confidence and also the lift parameter.

For a more detailed analysis, it is worth examining how often a given element appears in associations.

plot(rules.trans1, method="grouped")


The presented chart shows the number and quality of associations between the categories: LHS - the predecessors are presented, while the RHS - informs about the successor.

4 Conclusion

Association analysis provides the necessary information on customer behavior that is especially useful in marketing. Useful conclusions can be drawn from the big data in order to implement the business plan.

5 Bibliography

Brett L. (2013), Machine Learning with R. Packt Publishing, Birmingham - Mumbai.