In this paper I will implement Marekt Basket Analysis on real data collected from kaggle: https://www.kaggle.com/heeraldedhia/groceries-dataset. The main goal of this paper is to examine this dataset’s characteristics
The goal of market basket analysis is to create set of ‘strong’ rules that are applied along data transactions. Transaction can be understood as set of items bought by single person in given moment. Basing on the rules found along dataset some statements may be formulated.
The selected dataset contains of 3 variables and 38765 observations. The key in dataset is Member_number + Date, this combination will be called a transaction. The data is already groupped into categories like ‘beef’, ‘sausage’ etc. That part of the process must be done by data preparation team. There are ways to group product names by their name using wide range of text mining tools like stimization and clustering but for purpose of that paper data is already groupped.
df = read.csv('Groceries_dataset.csv', header=1)
head(df, 10)
## Member_number Date itemDescription
## 1 1808 21-07-2015 tropical fruit
## 2 2552 05-01-2015 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 12-12-2015 other vegetables
## 5 3037 01-02-2015 whole milk
## 6 4941 14-02-2015 rolls/buns
## 7 4501 08-05-2015 other vegetables
## 8 3803 23-12-2015 pot plants
## 9 2762 20-03-2015 whole milk
## 10 4119 12-02-2015 tropical fruit
There are in total 14963 transactions.
xx = df %>% group_by(Member_number, Date) %>% count()
xx1 = xx %>% filter(n<200)
hist(xx1$n, main='Histogram of number of products in basket')
summary(xx1$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.591 3.000 11.000
Highly positively skewed distribution suggest some sort of following Bernulli distribution, what suggest interpretation that each next product bought by customer is a success.
xd1 = arules::sort(table(df$itemDescription), decreasing=TRUE)
plot(xd1[xd1>500],las = 2,
cex.names = 0.3)
In order to create rules along dataset we need to transform dataframe into transitions object.
grouping_for_AA <- df %>%
group_by(Member_number, itemDescription) %>%
dplyr::select(Member_number, itemDescription, Date) %>%
data.frame()
trans <- as(split(grouping_for_AA[,"itemDescription"], grouping_for_AA[,"Member_number"]), "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
Now, the items frequency may be calculated.
freq_items<-eclat(trans, parameter=list(supp=0.001, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.001 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 3
##
## create itemset ...
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [164 item(s)] done [0.00s].
## creating sparse bit matrix ... [164 row(s), 3898 column(s)] done [0.00s].
## writing ... [217551 set(s)] done [0.12s].
## Creating S4 object ... done [0.07s].
freq_rules<-ruleInduction(freq_items, trans, confidence=0.3)
summary(freq_rules)
## set of 533867 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8
## 866 30881 170025 226520 92390 12609 576
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 4.784 5.000 8.000
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.001026 Min. :0.3000 Min. : 0.6548 Min. : 1
## 1st Qu.:0.001026 1st Qu.:0.4000 1st Qu.: 1.5873 1st Qu.: 67957
## Median :0.001283 Median :0.5000 Median : 2.0804 Median :123656
## Mean :0.001654 Mean :0.5591 Mean : 2.4271 Mean :119349
## 3rd Qu.:0.001539 3rd Qu.:0.6667 3rd Qu.: 2.7883 3rd Qu.:174775
## Max. :0.191380 Max. :1.0000 Max. :74.2476 Max. :217387
##
## mining info:
## data ntransactions support
## trans 3898 0.001
## call confidence
## eclat(data = trans, parameter = list(supp = 0.001, maxlen = 15)) 0.3
inspect(head(sort(freq_rules, by ="lift"),10))
## lhs rhs support confidence lift itemset
## [1] {chicken,
## domestic eggs,
## rolls/buns,
## soda,
## whole milk} => {cereals} 0.001026167 0.8000000 74.24762 2083
## [2] {bottled water,
## long life bakery product,
## root vegetables,
## sausage} => {house keeping products} 0.001026167 0.6666667 57.74815 3829
## [3] {bottled water,
## long life bakery product,
## root vegetables,
## whole milk} => {house keeping products} 0.001026167 0.5000000 43.31111 3832
## [4] {bottled beer,
## bottled water,
## curd,
## pork,
## whole milk} => {curd cheese} 0.001026167 0.5000000 42.36957 4549
## [5] {chocolate,
## pickled vegetables,
## pip fruit,
## whole milk} => {salt} 0.001026167 0.8000000 35.03820 15664
## [6] {chocolate,
## pickled vegetables,
## pip fruit} => {salt} 0.001026167 0.8000000 35.03820 15666
## [7] {domestic eggs,
## hamburger meat,
## shopping bags,
## soda} => {canned fish} 0.001026167 1.0000000 33.89565 21715
## [8] {chicken,
## domestic eggs,
## rolls/buns,
## soda} => {cereals} 0.001026167 0.3636364 33.74892 2085
## [9] {chicken,
## domestic eggs,
## rolls/buns,
## whole milk} => {cereals} 0.001026167 0.3636364 33.74892 2086
## [10] {cream cheese ,
## domestic eggs,
## fruit/vegetable juice,
## other vegetables} => {photo/film} 0.001026167 0.6666667 33.74892 11541
These are top 10 rules basing on lift value, lets try to plot all of the rules recognized in the dataset.
plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
rules_for_plot = head(sort(sort(freq_rules, by ="confidence"),by="support"),10)
plot(rules_for_plot, method="paracoord")
The parallel plot is the plot to visualize which product results in which.
Whole milk, as one of the products
rules.wholemilk<-apriori(data=trans, parameter=list(supp=0.001, conf=0.08), appearance=list(default="lhs", rhs="whole milk"), control=list(verbose=F))
rules.wholemilk<-sort(rules.wholemilk, by="confidence", decreasing=TRUE)
inspect(head(rules.wholemilk))
## lhs rhs support confidence
## [1] {whisky} => {whole milk} 0.002052335 1
## [2] {frozen vegetables, hair spray} => {whole milk} 0.001026167 1
## [3] {frozen fruits, other vegetables} => {whole milk} 0.001026167 1
## [4] {bottled water, whisky} => {whole milk} 0.001282709 1
## [5] {root vegetables, whisky} => {whole milk} 0.001539251 1
## [6] {rolls/buns, whisky} => {whole milk} 0.001026167 1
## coverage lift count
## [1] 0.002052335 2.182531 8
## [2] 0.001026167 2.182531 4
## [3] 0.001026167 2.182531 4
## [4] 0.001282709 2.182531 5
## [5] 0.001539251 2.182531 6
## [6] 0.001026167 2.182531 4
We can extract important rules out of the set above.
rules.wholemilk2 = rules.wholemilk[is.significant(rules.wholemilk, trans)]
inspect(rules.wholemilk2)
## lhs rhs support
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [2] {bottled water, other vegetables} => {whole milk} 0.05618266
## [3] {other vegetables, yogurt} => {whole milk} 0.07183171
## [4] {rolls/buns, yogurt} => {whole milk} 0.06593125
## [5] {other vegetables, rolls/buns} => {whole milk} 0.08209338
## [6] {yogurt} => {whole milk} 0.15059005
## confidence coverage lift count
## [1] 0.6568627 0.05233453 1.433623 134
## [2] 0.5983607 0.09389430 1.305941 219
## [3] 0.5970149 0.12031811 1.303003 280
## [4] 0.5921659 0.11133915 1.292420 257
## [5] 0.5594406 0.14674192 1.220996 320
## [6] 0.5321850 0.28296562 1.161510 587
plot(rules.wholemilk2, method="graph",control = list(cex=0.7))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The plot may show relations between products and their impact.
For detaile definition please visit: Jaccard Index: https://en.wikipedia.org/wiki/Jaccard_index
Affinity Measure: https://en.wikipedia.org/wiki/Affinity_analysis
trans.sel<-trans[,itemFrequency(trans)>0.2]
jac<-dissimilarity(trans.sel, which="items")
round(jac,digits=3)
## bottled water other vegetables rolls/buns root vegetables
## other vegetables 0.811
## rolls/buns 0.836 0.747
## root vegetables 0.861 0.817 0.814
## sausage 0.867 0.810 0.826 0.852
## soda 0.831 0.781 0.780 0.824
## tropical fruit 0.856 0.824 0.822 0.859
## whole milk 0.799 0.703 0.716 0.803
## yogurt 0.846 0.777 0.786 0.838
## sausage soda tropical fruit whole milk
## other vegetables
## rolls/buns
## root vegetables
## sausage
## soda 0.825
## tropical fruit 0.858 0.824
## whole milk 0.808 0.757 0.798
## yogurt 0.818 0.805 0.828 0.745
a = affinity(trans.sel)
round(a, digits=3)
## An object of class "ar_similarity"
## bottled water other vegetables rolls/buns root vegetables
## bottled water 0.000 0.189 0.164 0.139
## other vegetables 0.189 0.000 0.253 0.184
## rolls/buns 0.164 0.253 0.000 0.186
## root vegetables 0.139 0.184 0.186 0.000
## sausage 0.133 0.190 0.174 0.148
## soda 0.169 0.219 0.220 0.176
## tropical fruit 0.144 0.176 0.178 0.141
## whole milk 0.201 0.297 0.284 0.197
## yogurt 0.154 0.223 0.214 0.162
## sausage soda tropical fruit whole milk yogurt
## bottled water 0.133 0.169 0.144 0.201 0.154
## other vegetables 0.190 0.219 0.176 0.297 0.223
## rolls/buns 0.174 0.220 0.178 0.284 0.214
## root vegetables 0.148 0.176 0.141 0.197 0.162
## sausage 0.000 0.175 0.142 0.192 0.182
## soda 0.175 0.000 0.176 0.243 0.195
## tropical fruit 0.142 0.176 0.000 0.202 0.172
## whole milk 0.192 0.243 0.202 0.000 0.255
## yogurt 0.182 0.195 0.172 0.255 0.000
## Slot "method":
## [1] "Affinity"
Using Market Basket Analysis we can extract rules that are describing customers choices about which product to choose. In our case we could succesfully extract 6 rules that determine customers probability of buying whole milk. The similar analysis may be conducted for all other products to for example decide whats optimal layout of the glocery, which products are often bought with other products and concluding -> which products should be near others (in the range of sight). It may influence customers to buy another product.