Market Basket Analysis is the process of finding the most common in-store shopping patterns. It arises as a result of the analysis of transaction databases to determine the combination of items that are related to each other. Products are detected whose presence in the transaction increases the chances of the appearance of other products or their combinations.
Market basket analysis allows you to optimize the assortment of goods and inventory, place them in sales areas and increase sales by offering related products to customers. More precisely, if the analysis carried out shows that the joint purchase of bread and butter is a typical pattern, then placing the above goods at the same exhibition may encourage the buyer to purchase both goods.
The aim of the study is to try to answer the following issues: a. Identification of jointly purchased products b. Creating useful rules for defining consumer behavior. Based on the association rules used, transaction data was analyzed in order to find recurring patterns in the sale of goods.
The dataset comes from service “kaggle” and involves 9002 transactions: https://www.kaggle.com/apmonisha08/market-basket-analysis?select=Groceries.csv
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: grid
# read the data
setwd("/Users/nehrebeckiwp.pl/Desktop/UL3")
trans1<-read.transactions("Groceries.csv", rm.duplicates=FALSE, format="basket", sep=",", skip=0)## [1] 9002
## [[1]]
## [1] "citrus fruit" "margarine" "ready soups"
## [4] "semi-finished bread"
##
## [[2]]
## [1] "coffee" "tropical fruit" "yogurt"
##
## [[3]]
## [1] "whole milk"
##
## [[4]]
## [1] "cream cheese" "meat spreads" "pip fruit" "yogurt"
##
## [[5]]
## [1] "condensed milk" "long life bakery product"
## [3] "other vegetables" "whole milk"
##
## [[6]]
## [1] "abrasive cleaner" "butter" "rice" "whole milk"
## [5] "yogurt"
Before carrying out the analysis, the distribution of the basket length should be presented.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.828 6.000 32.000
Based on descriptive statistics, it was obtained that consumers buy about 4 categories, while a maximum of one customer has made purchases in 32 categories.
It is worth starting data analysis by verifying the frequency of items.
## ` abrasive cleaner artif. sweetener
## 0.000 0.003 0.003
## baby cosmetics baby food bags
## 0.001 0.000 0.000
## baking powder bathroom cleaner beef
## 0.016 0.002 0.044
## berries beverages bottled beer
## 0.028 0.024 0.071
## bottled water brandy brown bread
## 0.097 0.003 0.057
## butter butter milk cake bar
## 0.049 0.024 0.011
## candles candy canned beer
## 0.008 0.026 0.070
## canned fish canned fruit canned vegetables
## 0.013 0.003 0.009
## cat food cereals chewing gum
## 0.020 0.005 0.018
## chicken chocolate chocolate marshmallow
## 0.036 0.041 0.007
## citrus fruit cleaner cling film/bags
## 0.073 0.004 0.010
## cocoa drinks coffee condensed milk
## 0.002 0.052 0.009
## cooking chocolate cookware cream
## 0.002 0.003 0.001
## cream cheese curd curd cheese
## 0.034 0.046 0.004
## decalcifier dental care dessert
## 0.001 0.005 0.032
## detergent dish cleaner dishes
## 0.017 0.009 0.016
## dog food domestic eggs female sanitary products
## 0.007 0.056 0.005
## finished products fish flour
## 0.006 0.003 0.015
## flower (seeds) flower soil/fertilizer frankfurter
## 0.009 0.002 0.051
## frozen chicken frozen dessert frozen fish
## 0.001 0.009 0.009
## frozen fruits frozen meals frozen potato products
## 0.001 0.023 0.008
## frozen vegetables fruit/vegetable juice grapes
## 0.041 0.063 0.017
## hair spray ham hamburger meat
## 0.001 0.022 0.028
## hard cheese herbs honey
## 0.022 0.014 0.001
## house keeping products hygiene articles ice cream
## 0.007 0.029 0.022
## instant coffee Instant food products jam
## 0.007 0.007 0.005
## ketchup kitchen towels kitchen utensil
## 0.004 0.005 0.000
## light bulbs liqueur liquor
## 0.004 0.001 0.010
## liquor (appetizer) liver loaf long life bakery product
## 0.008 0.005 0.032
## make up remover male cosmetics margarine
## 0.001 0.004 0.052
## mayonnaise meat meat spreads
## 0.008 0.022 0.004
## misc. beverages mustard napkins
## 0.026 0.012 0.043
## newspapers nut snack nuts/prunes
## 0.069 0.003 0.002
## oil onions organic products
## 0.024 0.027 0.002
## organic sausage other vegetables packaged fruit/vegetables
## 0.002 0.167 0.012
## pasta pastry pet care
## 0.013 0.076 0.008
## photo/film pickled vegetables pip fruit
## 0.009 0.017 0.066
## popcorn pork pot plants
## 0.006 0.050 0.016
## potato products preservation products processed cheese
## 0.002 0.000 0.014
## prosecco pudding powder ready soups
## 0.002 0.002 0.002
## red/blush wine rice roll products
## 0.017 0.007 0.010
## rolls/buns root vegetables rubbing alcohol
## 0.161 0.096 0.001
## rum salad dressing salt
## 0.004 0.001 0.009
## salty snack sauces sausage
## 0.032 0.005 0.081
## seasonal products semi-finished bread shopping bags
## 0.012 0.015 0.084
## skin care sliced cheese snack products
## 0.003 0.020 0.003
## soap soda soft cheese
## 0.002 0.154 0.015
## softener sound storage medium soups
## 0.005 0.000 0.006
## sparkling wine specialty bar specialty cheese
## 0.004 0.023 0.008
## specialty chocolate specialty fat specialty vegetables
## 0.026 0.004 0.001
## spices spread cheese sugar
## 0.005 0.010 0.029
## sweet spreads syrup tea
## 0.009 0.002 0.003
## tidbits toilet cleaner tropical fruit
## 0.002 0.001 0.091
## turkey UHT-milk vinegar
## 0.008 0.029 0.006
## waffles whipped/sour cream whisky
## 0.032 0.063 0.001
## white bread white wine whole milk
## 0.037 0.016 0.222
## yogurt zwieback
## 0.119 0.006
Based on the analysis of basic statistics, it was concluded that some items are bought more often.
Based on the literature review, it worth pointing that there are many algorithms associated with determining the appropriate relationship of a set of goods. However, it must be admitted that the most popular is the A priori algorithm (Agrawal and Srikant, 1994). The above-mentioned algorithm uses a priori data related to the frequency of consumer selection of specific sets of goods.
Working with big data, the a priori algorithm usually gives good results, but in cases where the amount of data is small, the results obtained may be poorly explained from the point of view of common sense and sometimes even false.
As part of the initial analysis, the apriori algorithm was used with the use of default parameters.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 900
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
On the basis of the applied algorithm with default values, it was obtained that there is no rule meeting the above limits. Consequently, the level of support has been changed.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 450
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Based on the obtained results, it was worth pointing that the algorithm still did not find any rule. Further modification would be to ease the restriction on support.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 90
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [77 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
There are 0 associations were recorded. Further modification of the easing of the restriction on support.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[170 item(s), 9002 transaction(s)] done [0.00s].
## sorting and recoding items ... [154 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [107 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Finally, 107 associations were obtained.
Let’s print the first 5 association rules:
## lhs rhs support confidence coverage lift count
## [1] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001110864 0.9090909 0.001221951 4.087730 10
## [2] {rice,
## sugar} => {whole milk} 0.001110864 1.0000000 0.001110864 4.496503 10
## [3] {bottled water,
## rice} => {whole milk} 0.001221951 0.9166667 0.001333037 4.121795 11
## [4] {frozen fish,
## pip fruit,
## whole milk} => {other vegetables} 0.001110864 0.9090909 0.001221951 5.434021 10
## [5] {citrus fruit,
## herbs,
## tropical fruit} => {whole milk} 0.001110864 0.9090909 0.001221951 4.087730 10
The following results have been obtained: * probability (90%) that the consumer will buy products, choosing whipped/sour cream is associated with the purchase of whole milk; * probability (100%) that the purchase of rice, sugar are associated with the purchase of whole milk; * probability (92%) that the purchase of bottled water,rice are associated with the purchase of whole milk, etc.
As a result of changing the parameters (support, confidence, minlen) of the A priori algorithm, it is always possible to receive different lists of association rules. Increasing the value of the support parameter guarantees the exclusion of unpopular goods - as was done in this article. It is worth noting that the high value of the confidence parameter ensures obtaining rules with a high confidence value.
Based on obtained results, the quality of association rules using the A priori algorithm should be verified.
## set of 107 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 3 53 45 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.505 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001111 Min. :0.9000 Min. :0.001111 Min. :4.068
## 1st Qu.:0.001111 1st Qu.:0.9091 1st Qu.:0.001222 1st Qu.:4.122
## Median :0.001111 Median :0.9167 Median :0.001222 Median :4.497
## Mean :0.001270 Mean :0.9344 Mean :0.001361 Mean :5.004
## 3rd Qu.:0.001333 3rd Qu.:0.9375 3rd Qu.:0.001444 3rd Qu.:5.479
## Max. :0.002999 Max. :1.0000 Max. :0.003333 Max. :9.697
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :10.00
## Mean :11.43
## 3rd Qu.:12.00
## Max. :27.00
##
## mining info:
## data ntransactions support confidence
## trans1 9002 0.001 0.9
Based on these conditions, a total of 107 rules were obtained. The length of these rules ranges from 3 to 6, namely: three rules are 3 long, four are 53 long, etc.
As part of the apriori method used, it is necessary to check and the association rules also contain the “lift” parameter, which determines how many times purchasing Product 1 increases the odds of obtaining Product 2.
## lhs rhs support confidence coverage lift count
## [1] {oil,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.001444124 0.9285714 0.001555210 9.697216 13
## [2] {oil,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.001110864 0.9090909 0.001221951 9.493778 10
## [3] {cream cheese,
## curd,
## whipped/sour cream,
## whole milk} => {yogurt} 0.001221951 0.9166667 0.001333037 7.683271 11
## [4] {frankfurter,
## rolls/buns,
## root vegetables,
## whole milk} => {yogurt} 0.001221951 0.9166667 0.001333037 7.683271 11
## [5] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001110864 0.9090909 0.001221951 7.619773 10
The rule that comes first: {oil,other vegetables,tropical fruit,whole milk} => {root vegetables}
In the further part of the paper results of association rules are presented.
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The presented diagram shows 107 rules. The above chart presents the relationship between rule support and confidence and also the lift parameter.
For a more detailed analysis, it is worth examining how often a given element appears in associations.
The presented chart shows the number and quality of associations between the categories: LHS - the predecessors are presented, while the RHS - informs about the successor.
Association analysis provides the necessary information on customer behavior that is especially useful in marketing. Useful conclusions can be drawn from the big data in order to implement the business plan.
Brett L. (2013), Machine Learning with R. Packt Publishing, Birmingham - Mumbai.