DA data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax).
Identify the association rules of product transactions in the store.
The provided data has been acquired from the carre four stores in Kenya of transactions that have been made over time and my goal of this research is to come up with insights from the analysis.
The provided data is relevant for this kind of study since it has a reflection of how current transactions happen and the association it portrays.
library(modelr)
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart)
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x broom::bootstrap() masks modelr::bootstrap()
## x dplyr::filter() masks stats::filter()
## x data.table::first() masks dplyr::first()
## x dplyr::lag() masks stats::lag()
## x data.table::last() masks dplyr::last()
## x purrr::lift() masks caret::lift()
## x purrr::transpose() masks data.table::transpose()
sales <- fread('http://bit.ly/SupermarketDatasetII')
## Warning in fread("http://bit.ly/SupermarketDatasetII"): Detected 2 column
## names but the data has 3 columns (i.e. invalid file). Added 1 extra default
## column name for the first column which is guessed to be row names or an index.
## Use setnames() afterwards if this guess is not correct, or fix the file write
## command that created the file to create a valid file.
## Warning in fread("http://bit.ly/SupermarketDatasetII"): Stopped early on line
## 10. Expected 3 fields but found 1. Consider fill=TRUE and comment.char=. First
## discarded non-empty line: <<french fries>>
head(sales)
## V1 whole wheat pasta french fries
## 1: soup light cream shallot
## 2: frozen vegetables spaghetti green tea
checking out the dataset
str(sales)
## Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
## $ V1 : chr "soup" "frozen vegetables"
## $ whole wheat pasta: chr "light cream" "spaghetti"
## $ french fries : chr "shallot" "green tea"
## - attr(*, ".internal.selfref")=<externalptr>
Converting entries to transactions.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
path <-"http://bit.ly/SupermarketDatasetII"
Transactions<-read.transactions(path, sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions
Transactions
## transactions in sparse format with
## 7501 transactions (rows) and
## 119 items (columns)
Verifying the object’s class
class(Transactions)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
previewing our first 5 transations.
inspect(Transactions[1:5])
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
Previewing items that make up our dataset
items<-as.data.frame(itemLabels(Transactions))
colnames(items) <- "Item"
head(items, 10)
## Item
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
## 4 avocado
## 5 babies food
## 6 bacon
## 7 barbecue sauce
## 8 black tea
## 9 blueberries
## 10 body spray
summary of the transaction dataset.
summary(Transactions)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
Exploring frequency of some articles at index 3 to 8.
itemFrequency(Transactions[, 3:8],type = "absolute")
## asparagus avocado babies food bacon barbecue sauce
## 36 250 34 65 81
## black tea
## 107
round(itemFrequency(Transactions[, 3:8],type = "relative")*100,2)
## asparagus avocado babies food bacon barbecue sauce
## 0.48 3.33 0.45 0.87 1.08
## black tea
## 1.43
Build the model model 1 : support =0.001 and confidence = 0.8
sales.rules <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
sales.rules
## set of 74 rules
model 1 has 74 rules.
model 2 : support =0.002 and confidence = 0.8
sales.rules2 <- apriori (Transactions, parameter = list(supp = 0.002, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
sales.rules2
## set of 2 rules
model 2 has 2 rules which means alot of ossible rules will be overlooked.
model 3 : support =0.002 and confidence = 0.6
sales.rules3 <- apriori (Transactions, parameter = list(supp = 0.002, conf = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [43 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
sales.rules3
## set of 43 rules
model 3 produces 43 rules.
Exploring our model
summary(sales.rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## Transactions 7501 0.001 0.8
## call
## apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))
Observing rules built in our model
inspect(sales.rules[1:5])
## lhs rhs support confidence
## [1] {frozen smoothie, spinach} => {mineral water} 0.001066524 0.8888889
## [2] {bacon, pancakes} => {spaghetti} 0.001733102 0.8125000
## [3] {nonfat milk, turkey} => {mineral water} 0.001199840 0.8181818
## [4] {ground beef, nonfat milk} => {mineral water} 0.001599787 0.8571429
## [5] {mushroom cream sauce, pasta} => {escalope} 0.002532996 0.9500000
## coverage lift count
## [1] 0.001199840 3.729058 8
## [2] 0.002133049 4.666587 13
## [3] 0.001466471 3.432428 9
## [4] 0.001866418 3.595877 12
## [5] 0.002666311 11.976387 19
summary(sales.rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## Transactions 7501 0.001 0.8
## call
## apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))
The best set of rules is model 3 with a support of 0.002 and confidence of 0.6.