Research Question
You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). Your project has been divided into four parts where you’ll explore a recent marketing dataset by performing various unsupervised learning techniques and later providing recommendations based on your insights.
Create association rules that will allow us to identify relationships between variables in the dataset.
Finding most important association rules with a lift greater than 1 in order of confidence level from the dataset provided.
The dataset provided contains various transactions by Carrefoure Supermarket customers. We are able to perform market basket analysis from these transactions.
Reading the data
Checking the data - data understanding
Implementing the solution
Challenge the solution
Follow up Questions
Conclusion.
Recommendations.
We aim to find which associative rules are most important for the supermarket in order to increase sales of items and also strategize marketing teams for certain products which in turn increases profit.
# import arules and arulesviz
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::recode() masks arules::recode()
## ✖ tidyr::unpack() masks Matrix::unpack()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(glue)
# libraries for visualization
library(ggiraph)
library(ggiraphExtra)
# load the dataset as Transactions
df <- read.transactions('http://bit.ly/SupermarketDatasetII', sep=',')
## Warning in asMethod(object): removing duplicated items in transactions
#check data structure of our dataset
class(df)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
This method automatically drops duplicate transactions which will reduce bias in our model.
# Previewing our first 5 transactions
#
inspect(head(df))
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
## [6] {low fat yogurt}
# checking summary of dataset
summary(df)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
We have records of 7501 transactions in our dataset independent of whether the customer is unique since this has not been specified.
There are 119 products in our dataset.
Most people bought 1 item in a single transaction.
#frequency/support plot for items with minimum support of 0.1
df %>%
itemFrequency() %>%
as_tibble(rownames = "items") %>%
rename("support"="value") %>%
filter(support >= 0.1) %>%
arrange(-support) %>%
ggDonut(aes(donuts=items,count=support), explode = c(2,4,6,8), labelposition=0)
The most frequent items are: mineral water, eggs, spaghetti, french fries, chocolate, green tea and milk in that order.
# checking item frequency for first 5 items
itemFrequency(df[, 1:5],type = "absolute")
## almonds antioxydant juice asparagus avocado
## 153 67 36 250
## babies food
## 34
# percentage frequency relative to all other products
round(itemFrequency(df[, 1:5],type = "relative")*100,2)
## almonds antioxydant juice asparagus avocado
## 2.04 0.89 0.48 3.33
## babies food
## 0.45
# create subplot
par(mfrow = c(1, 2))
# plot the frequency of top 10 most frequent items
itemFrequencyPlot(df, topN = 10,col="#34495E")
# plot the freuency of items with a support limit of 0.1
itemFrequencyPlot(df, support = 0.1, col="#935116")
Plotting relative item frequency, we see that out of the 10 top 10 elements with the highest frequency, only 7 have a support not less than 0.1.
The apriori algorithm uses some thresholds measurements namely:
The support of itemset or rule measures the frequency of it in the data. Support is defined as: support(X) = {count(X)}/ {N}
The confidence measures the power of a rule; the closer to 1, the stronger the relationship of a rule. Confidence is defined as: confidence(X -> Y) = {support(X, Y)}/ {support(X)}
Lift measures how much more likely it is that the items in the itemset are found together to the rate being alone. Lift is defined as: lift(X -> Y) = {confidence(X -> Y)}/{support(Y)}
# building an association rules based model
# set minimum support = 0.001 and minimum confidence = 0.8
rules <- apriori (df, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules
## set of 74 rules
The final result is 74 rules.
# check rules summary
summary(rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## df 7501 0.001 0.8
## call
## apriori(data = df, parameter = list(supp = 0.001, conf = 0.8))
Here, we discover that majority have a combination of 3 items being bought together.
The summary of quality measures also explains statistic measures of the support, confidence, lift, coverage and count of the 74 rules.
The average confidence of the combination of these items is 85%.
# Observing rules built in our model i.e. first 5 model rules
# ---
#
inspect(rules[1:5])
## lhs rhs support confidence
## [1] {frozen smoothie, spinach} => {mineral water} 0.001066524 0.8888889
## [2] {bacon, pancakes} => {spaghetti} 0.001733102 0.8125000
## [3] {nonfat milk, turkey} => {mineral water} 0.001199840 0.8181818
## [4] {ground beef, nonfat milk} => {mineral water} 0.001599787 0.8571429
## [5] {mushroom cream sauce, pasta} => {escalope} 0.002532996 0.9500000
## coverage lift count
## [1] 0.001199840 3.729058 8
## [2] 0.002133049 4.666587 13
## [3] 0.001466471 3.432428 9
## [4] 0.001866418 3.595877 12
## [5] 0.002666311 11.976387 19
The first rule states that there is a 88% chance that a customer who picks frozen smoothie and spinach will most likely buy mineral water as well. There are 8 instances of this from our dataset. The lift is greater than 1 hence supports the existence of a correlation between these products.
# Looking at the first ten rules sorted by confidence
#
#
rules<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {french fries,
## mushroom cream sauce,
## pasta} => {escalope} 0.001066524 1.0000000 0.001066524 12.606723 8
## [2] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [3] {cake,
## meatballs,
## mineral water} => {milk} 0.001066524 1.0000000 0.001066524 7.717078 8
## [4] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [5] {mushroom cream sauce,
## pasta} => {escalope} 0.002532996 0.9500000 0.002666311 11.976387 19
## [6] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 0.001999733 3.915511 14
## [7] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.722185 10
## [8] {herb & pepper,
## mineral water,
## rice} => {ground beef} 0.001333156 0.9090909 0.001466471 9.252498 10
## [9] {ground beef,
## pancakes,
## whole wheat rice} => {mineral water} 0.001333156 0.9090909 0.001466471 3.813809 10
## [10] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
The above first 4 rules have a 100% confidence chance of occurring.
The lift value for all the top 10 rules is greater than 1 in all instances signifying correlation.
Let’s visualize top 10 rules sorted by lift.
# visualizing the rules with ggiraph for interactivity
plot_rules <-
rules %>%
# sort rules by lift
arules::sort(by="lift") %>%
# convert ouptput to dataframe
DATAFRAME() %>%
# convert dataframe to Tibble
as_tibble() %>%
# Take first 10 rules
head(10) %>%
# define rulename without deleting present variables
mutate(ruleName = paste(LHS,"=>",RHS) %>%
# reorder factor levels according to lift and assign support values to support parameter
fct_reorder(lift), support = support, confidence = confidence %>%
# convert support and confidence to percentage
percent(),
# assign lift values to lift parameter and round to 2dp
lift = lift %>% round(2)) %>%
# use rulename, support and lift for plot
select(ruleName, support, confidence, lift) %>%
#create plot
ggplot(aes(x=ruleName,y=lift)) + ggtitle('Top 10 Rules Plot') +
geom_segment(aes(xend=ruleName, yend=0),
color="#DC7633",
size=1) +
# make plot interactive
geom_point_interactive(aes(tooltip=glue("Support: {support}\nConfidence: {confidence}\nLift: {lift}"),
data_id=support),
size=3,
color="#85C1E9") +
coord_flip() +
theme_minimal() +
theme(
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
# theme background color
panel.background = element_rect(fill = "#979A9A", color = NA),
# theme background color for plot
plot.background = element_rect(fill = "#D4EFDF", color = NA)
) + xlab("") + ylab("")
# display plot
girafe(ggobj = plot_rules)
The top most rule implies that it is more likely that customers will buy eggs, mineral water and pasta and later add shrimp than buying shrimp alone. In order to increase shrimp sales, these antecedent products can be given a group discount.
# checking mineral water appearance on rhs
water <- subset(rules, subset = rhs %pin% "mineral water")
# Then order by confidence
water <- sort(water, by="confidence", decreasing=TRUE)
inspect(water)
## lhs rhs support confidence coverage lift count
## [1] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [2] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [3] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 0.001999733 3.915511 14
## [4] {ground beef,
## pancakes,
## whole wheat rice} => {mineral water} 0.001333156 0.9090909 0.001466471 3.813809 10
## [5] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
## [6] {chocolate,
## frozen vegetables,
## olive oil,
## shrimp} => {mineral water} 0.001199840 0.9000000 0.001333156 3.775671 9
## [7] {frozen smoothie,
## spinach} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [8] {cake,
## meatballs,
## milk} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [9] {cake,
## olive oil,
## whole wheat pasta} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [10] {brownies,
## eggs,
## ground beef} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [11] {chicken,
## fresh bread,
## pancakes} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [12] {chocolate,
## soup,
## turkey} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [13] {chocolate,
## frozen vegetables,
## shrimp,
## spaghetti} => {mineral water} 0.001733102 0.8666667 0.001999733 3.635831 13
## [14] {ground beef,
## nonfat milk} => {mineral water} 0.001599787 0.8571429 0.001866418 3.595877 12
## [15] {turkey,
## whole wheat pasta} => {mineral water} 0.001466471 0.8461538 0.001733102 3.549776 11
## [16] {frozen vegetables,
## milk,
## shrimp,
## spaghetti} => {mineral water} 0.001466471 0.8461538 0.001733102 3.549776 11
## [17] {chocolate,
## eggs,
## frozen vegetables,
## ground beef} => {mineral water} 0.001466471 0.8461538 0.001733102 3.549776 11
## [18] {olive oil,
## soup,
## tomatoes} => {mineral water} 0.001333156 0.8333333 0.001599787 3.495992 10
## [19] {frozen vegetables,
## olive oil,
## shrimp} => {mineral water} 0.001866418 0.8235294 0.002266364 3.454862 14
## [20] {nonfat milk,
## turkey} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [21] {cooking oil,
## fromage blanc} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [22] {french fries,
## herb & pepper,
## milk} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [23] {burgers,
## frozen vegetables,
## olive oil} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [24] {frozen vegetables,
## milk,
## olive oil,
## soup} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [25] {chocolate,
## eggs,
## olive oil,
## spaghetti} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [26] {chocolate,
## milk,
## shrimp,
## spaghetti} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [27] {frozen vegetables,
## olive oil,
## soup} => {mineral water} 0.001733102 0.8125000 0.002133049 3.408592 13
## [28] {black tea,
## salmon} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [29] {pancakes,
## tomato sauce} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [30] {milk,
## spaghetti,
## strong cheese} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [31] {grated cheese,
## ground beef,
## rice} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [32] {oil,
## shrimp,
## spaghetti} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [33] {escalope,
## hot dogs,
## milk} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [34] {chocolate,
## hot dogs,
## milk} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [35] {chocolate,
## olive oil,
## soup} => {mineral water} 0.001599787 0.8000000 0.001999733 3.356152 12
## [36] {cooking oil,
## eggs,
## olive oil} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [37] {burgers,
## frozen vegetables,
## low fat yogurt} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [38] {cake,
## eggs,
## milk,
## turkey} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [39] {chocolate,
## eggs,
## milk,
## olive oil} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [40] {chocolate,
## frozen vegetables,
## pancakes,
## shrimp} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [41] {french fries,
## milk,
## pancakes,
## spaghetti} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
Out of the 74 rules, mineral water is a consequent product in 41 instances.
# investigating water as an antecedent(bought first)
# checking mineral water appearance on rhs
mineral <- subset(rules, subset = lhs %pin% "mineral water")
# Then order by confidence
mineral <- sort(mineral, by="confidence", decreasing=TRUE)
inspect(mineral)
## lhs rhs support confidence coverage lift count
## [1] {cake,
## meatballs,
## mineral water} => {milk} 0.001066524 1.0000000 0.001066524 7.717078 8
## [2] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.722185 10
## [3] {herb & pepper,
## mineral water,
## rice} => {ground beef} 0.001333156 0.9090909 0.001466471 9.252498 10
## [4] {light cream,
## mineral water,
## shrimp} => {spaghetti} 0.001066524 0.8888889 0.001199840 5.105326 8
## [5] {grated cheese,
## mineral water,
## rice} => {ground beef} 0.001066524 0.8888889 0.001199840 9.046887 8
## [6] {escalope,
## hot dogs,
## mineral water} => {milk} 0.001066524 0.8888889 0.001199840 6.859625 8
## [7] {chocolate,
## ground beef,
## milk,
## mineral water,
## spaghetti} => {frozen vegetables} 0.001066524 0.8888889 0.001199840 9.325253 8
## [8] {frozen vegetables,
## ground beef,
## mineral water,
## shrimp} => {spaghetti} 0.001733102 0.8666667 0.001999733 4.977693 13
## [9] {mineral water,
## pasta,
## shrimp} => {eggs} 0.001333156 0.8333333 0.001599787 4.637117 10
## [10] {frozen vegetables,
## ground beef,
## mineral water,
## tomatoes} => {spaghetti} 0.001199840 0.8181818 0.001466471 4.699220 9
## [11] {milk,
## mineral water,
## parmesan cheese} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [12] {cooking oil,
## mineral water,
## red wine} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [13] {frozen vegetables,
## mineral water,
## olive oil,
## tomatoes} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
## [14] {chocolate,
## french fries,
## mineral water,
## olive oil} => {spaghetti} 0.001066524 0.8000000 0.001333156 4.594793 8
Results tell us that the consequent product having bought mineral water is spaghetti(has higher probability than other products).
# checking eggs appearance on rhs
eggsrhs <- subset(rules, subset = rhs %pin% "eggs")
# Then order by confidence
eggsrhs <- sort(eggsrhs, by="confidence", decreasing=TRUE)
inspect(eggsrhs)
## lhs rhs support confidence coverage
## [1] {black tea, spaghetti, turkey} => {eggs} 0.001066524 0.8888889 0.001199840
## [2] {mineral water, pasta, shrimp} => {eggs} 0.001333156 0.8333333 0.001599787
## lift count
## [1] 4.946258 8
## [2] 4.637117 10
# checking mineral water appearance on lhs
eggslhs <- subset(rules, subset = lhs %pin% "eggs")
# Then order by confidence
eggslhs <- sort(eggslhs, by="confidence", decreasing=TRUE)
inspect(eggslhs)
## lhs rhs support confidence coverage lift count
## [1] {eggs,
## mineral water,
## pasta} => {shrimp} 0.001333156 0.9090909 0.001466471 12.722185 10
## [2] {brownies,
## eggs,
## ground beef} => {mineral water} 0.001066524 0.8888889 0.001199840 3.729058 8
## [3] {chocolate,
## eggs,
## frozen vegetables,
## ground beef} => {mineral water} 0.001466471 0.8461538 0.001733102 3.549776 11
## [4] {chocolate,
## eggs,
## olive oil,
## spaghetti} => {mineral water} 0.001199840 0.8181818 0.001466471 3.432428 9
## [5] {cooking oil,
## eggs,
## olive oil} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [6] {cake,
## eggs,
## milk,
## turkey} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
## [7] {chocolate,
## eggs,
## milk,
## olive oil} => {mineral water} 0.001066524 0.8000000 0.001333156 3.356152 8
Since we do not know the frequency of customers, we are unabl to rule out the possibility of one customer buying same products severally. However, this is only a small probability and does not prevent us from trusting our findings.
We have the right data containing transactions in a supermarket.
The supermarket(client) wanted to find out which products are associated.
According to our analysis, there are 74 rules that can be applied by the customer. We shall however focus on the top 10 and test the algorithm with the response of customers’ transaction behaviours.
We are able to determine that mineral water is the most bought item from the supermarket. In order to increase profits, the supermarket can give discounts to consequent products after a customer picks mineral water and also rearrange the shelves in such a way that these products are close to the mineral water. These products include spaghetti, ground beef, milk, eggs, frozen vegetables and shrimp.