Association Rules

CarreFour Kenya Sales increment strategies .

1. Research Question

DA data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax).

2. Metric of Success

Identify the association rules of product transactions in the store.

3. Understanding the context.

The provided data has been acquired from the carre four stores in Kenya of transactions that have been made over time and my goal of this research is to come up with insights from the analysis.

4. Recording the Experimental Design

Data Loading
Data Cleaning and preprocessing
Exploratory Data Analysis
Associative rules implememntation
Recommendations and Conclusions.

5. Data Relevance.

The provided data is relevant for this kind of study since it has a reflection of how current transactions happen and the association it portrays.

Data Preview

Loading the libraries

library(modelr)
library(broom)

## 
## Attaching package: 'broom'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(rpart)
library(ggplot2)
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x broom::bootstrap()    masks modelr::bootstrap()
## x dplyr::filter()       masks stats::filter()
## x data.table::first()   masks dplyr::first()
## x dplyr::lag()          masks stats::lag()
## x data.table::last()    masks dplyr::last()
## x purrr::lift()         masks caret::lift()
## x purrr::transpose()    masks data.table::transpose()

Loading the data

sales <- fread('http://bit.ly/SupermarketDatasetII')

## Warning in fread("http://bit.ly/SupermarketDatasetII"): Detected 2 column
## names but the data has 3 columns (i.e. invalid file). Added 1 extra default
## column name for the first column which is guessed to be row names or an index.
## Use setnames() afterwards if this guess is not correct, or fix the file write
## command that created the file to create a valid file.

## Warning in fread("http://bit.ly/SupermarketDatasetII"): Stopped early on line
## 10. Expected 3 fields but found 1. Consider fill=TRUE and comment.char=. First
## discarded non-empty line: <<french fries>>

head(sales)

##                   V1 whole wheat pasta french fries
## 1:              soup       light cream      shallot
## 2: frozen vegetables         spaghetti    green tea

checking out the dataset

str(sales)

## Classes 'data.table' and 'data.frame':   2 obs. of  3 variables:
##  $ V1               : chr  "soup" "frozen vegetables"
##  $ whole wheat pasta: chr  "light cream" "spaghetti"
##  $ french fries     : chr  "shallot" "green tea"
##  - attr(*, ".internal.selfref")=<externalptr>

Converting entries to transactions.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

path <-"http://bit.ly/SupermarketDatasetII"
Transactions<-read.transactions(path, sep = ",")

## Warning in asMethod(object): removing duplicated items in transactions

Transactions

## transactions in sparse format with
##  7501 transactions (rows) and
##  119 items (columns)

Verifying the object’s class

class(Transactions)

## [1] "transactions"
## attr(,"package")
## [1] "arules"

previewing our first 5 transations.

inspect(Transactions[1:5])

##     items               
## [1] {almonds,           
##      antioxydant juice, 
##      avocado,           
##      cottage cheese,    
##      energy drink,      
##      frozen smoothie,   
##      green grapes,      
##      green tea,         
##      honey,             
##      low fat yogurt,    
##      mineral water,     
##      olive oil,         
##      salad,             
##      salmon,            
##      shrimp,            
##      spinach,           
##      tomato juice,      
##      vegetables mix,    
##      whole weat flour,  
##      yams}              
## [2] {burgers,           
##      eggs,              
##      meatballs}         
## [3] {chutney}           
## [4] {avocado,           
##      turkey}            
## [5] {energy bar,        
##      green tea,         
##      milk,              
##      mineral water,     
##      whole wheat rice}

Previewing items that make up our dataset

items<-as.data.frame(itemLabels(Transactions))
colnames(items) <- "Item"
head(items, 10)

##                 Item
## 1            almonds
## 2  antioxydant juice
## 3          asparagus
## 4            avocado
## 5        babies food
## 6              bacon
## 7     barbecue sauce
## 8          black tea
## 9        blueberries
## 10        body spray

summary of the transaction dataset.

summary(Transactions)

## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Exploring frequency of some articles at index 3 to 8.

itemFrequency(Transactions[, 3:8],type = "absolute")

##      asparagus        avocado    babies food          bacon barbecue sauce 
##             36            250             34             65             81 
##      black tea 
##            107

round(itemFrequency(Transactions[, 3:8],type = "relative")*100,2)

##      asparagus        avocado    babies food          bacon barbecue sauce 
##           0.48           3.33           0.45           0.87           1.08 
##      black tea 
##           1.43

Build the model model 1 : support =0.001 and confidence = 0.8

sales.rules <- apriori (Transactions, parameter = list(supp = 0.001, conf = 0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

sales.rules

## set of 74 rules

model 1 has 74 rules.

model 2 : support =0.002 and confidence = 0.8

sales.rules2 <- apriori (Transactions, parameter = list(supp = 0.002, conf = 0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 15 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

sales.rules2

## set of 2 rules

model 2 has 2 rules which means alot of ossible rules will be overlooked.

model 3 : support =0.002 and confidence = 0.6

sales.rules3 <- apriori (Transactions, parameter = list(supp = 0.002, conf = 0.6))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 15 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [43 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

sales.rules3

## set of 43 rules

model 3 produces 43 rules.

Exploring our model

summary(sales.rules)

## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 16  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.041   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001067   Min.   : 3.356  
##  1st Qu.:0.001067   1st Qu.:0.8000   1st Qu.:0.001333   1st Qu.: 3.432  
##  Median :0.001133   Median :0.8333   Median :0.001333   Median : 3.795  
##  Mean   :0.001256   Mean   :0.8504   Mean   :0.001479   Mean   : 4.823  
##  3rd Qu.:0.001333   3rd Qu.:0.8889   3rd Qu.:0.001600   3rd Qu.: 4.877  
##  Max.   :0.002533   Max.   :1.0000   Max.   :0.002666   Max.   :12.722  
##      count       
##  Min.   : 8.000  
##  1st Qu.: 8.000  
##  Median : 8.500  
##  Mean   : 9.419  
##  3rd Qu.:10.000  
##  Max.   :19.000  
## 
## mining info:
##          data ntransactions support confidence
##  Transactions          7501   0.001        0.8
##                                                                      call
##  apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))

Observing rules built in our model

inspect(sales.rules[1:5])

##     lhs                              rhs             support     confidence
## [1] {frozen smoothie, spinach}    => {mineral water} 0.001066524 0.8888889 
## [2] {bacon, pancakes}             => {spaghetti}     0.001733102 0.8125000 
## [3] {nonfat milk, turkey}         => {mineral water} 0.001199840 0.8181818 
## [4] {ground beef, nonfat milk}    => {mineral water} 0.001599787 0.8571429 
## [5] {mushroom cream sauce, pasta} => {escalope}      0.002532996 0.9500000 
##     coverage    lift      count
## [1] 0.001199840  3.729058  8   
## [2] 0.002133049  4.666587 13   
## [3] 0.001466471  3.432428  9   
## [4] 0.001866418  3.595877 12   
## [5] 0.002666311 11.976387 19

summary(sales.rules)

## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 16  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.041   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001067   Min.   : 3.356  
##  1st Qu.:0.001067   1st Qu.:0.8000   1st Qu.:0.001333   1st Qu.: 3.432  
##  Median :0.001133   Median :0.8333   Median :0.001333   Median : 3.795  
##  Mean   :0.001256   Mean   :0.8504   Mean   :0.001479   Mean   : 4.823  
##  3rd Qu.:0.001333   3rd Qu.:0.8889   3rd Qu.:0.001600   3rd Qu.: 4.877  
##  Max.   :0.002533   Max.   :1.0000   Max.   :0.002666   Max.   :12.722  
##      count       
##  Min.   : 8.000  
##  1st Qu.: 8.000  
##  Median : 8.500  
##  Mean   : 9.419  
##  3rd Qu.:10.000  
##  Max.   :19.000  
## 
## mining info:
##          data ntransactions support confidence
##  Transactions          7501   0.001        0.8
##                                                                      call
##  apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.8))

Recommendations.

The best set of rules is model 3 with a support of 0.002 and confidence of 0.6.