MARKET BASKET ANALYSIS - ASSOCIATION RULES

1. Defining the Question

a) Specifying the Question

Which goods are likely to be purchased together?
Which customer type is likely to return?
Which items have the highest transaction frequency?

b) Defining the Metric for Success

Accurately deriving insights on the most popular items, items that have the highest relationship, and types of customers that are likely to return.

c) Understanding the Context

Carrefour Kenya was undertaking a project that would inform the marketing department on the most relevant marketing strategies that would result in the highest number of sales. They provided a data set on which insights were to be derived.

The insights derived are important for planning stocking, advertisements, and promotions, to boost sales. Therefore, highly accurate information is required to prevent losses incurred by a high number of closing stock, especially for perishable goods.

d) Recording the Experimental Design

Data sourcing/loading.
Data preparation.
Determining item frequency.
Implementing the Apriori algorithm.
Determining item sets with the highest confidence.
Determining item sets with the highest lift.
Implementing the solution.
Challenging the solution.
Conclusion.
Recommendations.
Follow up questions.

e) Data Relevance

For the data to be relevant, it should provide necessary insights that can be derived and used by the marketing and sales departments to improve on currently implemented strategies used to increase sales.

2. Data Understanding

a) Importing Required Libraries

# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)

# Libraries
library(data.table) # Data Table library
library (plyr)
library(psych)
library(coda)
library(base)       # Date-time conversion
library(ggplot2)    # Plotting Library

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(moments)    # Measures of distribution 
library(ggcorrplot) # Correlation plotting 
library(mice)       # Missing values

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(devtools)

## Loading required package: usethis

library(arules)     # Apriori

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)  # Apriori visualization

b) Reading the Data

# Loading data set
part3 <- read.csv('dataset_3.csv', header = F)

c) Checking the Data

i) Number of Records

# Data set records summary
records.summary <- function(data){
  cat('Number of rows = ', nrow(data), ' and columns = ', ncol(data), '\n')
  
}

# Data set summary

print('Dataset 3')

## [1] "Dataset 3"

records.summary(part3)

## Number of rows =  7501  and columns =  20

ii) Top Dataset Preview

# Top data set preview
head(part3)

##               V1        V2         V3               V4           V5
## 1         shrimp   almonds    avocado   vegetables mix green grapes
## 2        burgers meatballs       eggs                              
## 3        chutney                                                   
## 4         turkey   avocado                                         
## 5  mineral water      milk energy bar whole wheat rice    green tea
## 6 low fat yogurt                                                   
##                 V6   V7             V8           V9          V10            V11
## 1 whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt
## 2                                                                              
## 3                                                                              
## 4                                                                              
## 5                                                                              
## 6                                                                              
##         V12   V13   V14           V15    V16               V17             V18
## 1 green tea honey salad mineral water salmon antioxydant juice frozen smoothie
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6                                                                             
##       V19       V20
## 1 spinach olive oil
## 2                  
## 3                  
## 4                  
## 5                  
## 6

iii) Bottom Dataset Preview

# Bottom  data set preview
tail(part3)

##            V1                V2          V3             V4        V5        V6
## 7496 pancakes        light mayo                                               
## 7497   butter        light mayo fresh bread                                   
## 7498  burgers frozen vegetables        eggs   french fries magazines green tea
## 7499  chicken                                                                 
## 7500 escalope         green tea                                               
## 7501     eggs   frozen smoothie yogurt cake low fat yogurt                    
##      V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 7496                                                     
## 7497                                                     
## 7498                                                     
## 7499                                                     
## 7500                                                     
## 7501

The date column will be changed to a date data type.

d) Checking Datatypes

, 10)

# Data type for the third data set.
str(part3)

## 'data.frame':    7501 obs. of  20 variables:
##  $ V1 : chr  "shrimp" "burgers" "chutney" "turkey" ...
##  $ V2 : chr  "almonds" "meatballs" "" "avocado" ...
##  $ V3 : chr  "avocado" "eggs" "" "" ...
##  $ V4 : chr  "vegetables mix" "" "" "" ...
##  $ V5 : chr  "green grapes" "" "" "" ...
##  $ V6 : chr  "whole weat flour" "" "" "" ...
##  $ V7 : chr  "yams" "" "" "" ...
##  $ V8 : chr  "cottage cheese" "" "" "" ...
##  $ V9 : chr  "energy drink" "" "" "" ...
##  $ V10: chr  "tomato juice" "" "" "" ...
##  $ V11: chr  "low fat yogurt" "" "" "" ...
##  $ V12: chr  "green tea" "" "" "" ...
##  $ V13: chr  "honey" "" "" "" ...
##  $ V14: chr  "salad" "" "" "" ...
##  $ V15: chr  "mineral water" "" "" "" ...
##  $ V16: chr  "salmon" "" "" "" ...
##  $ V17: chr  "antioxydant juice" "" "" "" ...
##  $ V18: chr  "frozen smoothie" "" "" "" ...
##  $ V19: chr  "spinach" "" "" "" ...
##  $ V20: chr  "olive oil" "" "" "" ...

The data types are correct. The NAN values will be dealt with and analyzed later.

3. External Data Set Validation

The data sets have been provided by the client, therefore, external data set validation will not be used.

4. Data Preperation

a) Validation

# Data set columns
colnames(part3)

##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
## [13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20"

All columns will be retained as they contain product categories.

b) Consistency

Part 3

# Missing values
colSums(is.na(part3))

##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

There are no missing values in the second data set.

c) Completeness

# Checking for duplicates
sum(duplicated(part3))

## [1] 2325

There are duplicates in the second data set. They will be dropped later when the transaction class is formed, if necessary.

d) Uniformity

Checking the uniformity of column names and values.

# Column names
colnames(part3)

##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
## [13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20"

The column names are uniform, therefore, they will not be renamed.

e) Outliers

# Numerical columns
num <- unlist(lapply(part3, is.numeric))
num

##    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11   V12   V13 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   V14   V15   V16   V17   V18   V19   V20 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE

There are no numerical columns, therefore, no outliers can be checked.

head(part3)

##               V1        V2         V3               V4           V5
## 1         shrimp   almonds    avocado   vegetables mix green grapes
## 2        burgers meatballs       eggs                              
## 3        chutney                                                   
## 4         turkey   avocado                                         
## 5  mineral water      milk energy bar whole wheat rice    green tea
## 6 low fat yogurt                                                   
##                 V6   V7             V8           V9          V10            V11
## 1 whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt
## 2                                                                              
## 3                                                                              
## 4                                                                              
## 5                                                                              
## 6                                                                              
##         V12   V13   V14           V15    V16               V17             V18
## 1 green tea honey salad mineral water salmon antioxydant juice frozen smoothie
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6                                                                             
##       V19       V20
## 1 spinach olive oil
## 2                  
## 3                  
## 4                  
## 5                  
## 6

Due to the format of the data set, descriptive analysis will not be done using the common methods. Instead, these insights will be derived in the following section, using functions specific to the transaction class and Apriori function.

Part 3: Association Rules

# As the data set does not require any alterations, it will be re-loaded as a
# transactions object.

transactions <- read.transactions('dataset_3.csv', format = 'basket',
                                  rm.duplicates = T, sep = ',')

## distribution of transactions with duplicates:
## 1 
## 5

items<-as.data.frame(itemLabels(transactions))
colnames(items) <- "Item"
head(items, 10)

##                 Item
## 1            almonds
## 2  antioxydant juice
## 3          asparagus
## 4            avocado
## 5        babies food
## 6              bacon
## 7     barbecue sauce
## 8          black tea
## 9        blueberries
## 10        body spray

For this data set, duplicates were removed, and the separation of the items set to ‘,’ to prevent item splitting because of white spaces.

# transactions summary
summary(transactions)

## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.
The smallest item set has 1 item, while the super set has 20 items.

# Inspecting the transaction object
inspect(transactions[1:5])

##     items               
## [1] {almonds,           
##      antioxydant juice, 
##      avocado,           
##      cottage cheese,    
##      energy drink,      
##      frozen smoothie,   
##      green grapes,      
##      green tea,         
##      honey,             
##      low fat yogurt,    
##      mineral water,     
##      olive oil,         
##      salad,             
##      salmon,            
##      shrimp,            
##      spinach,           
##      tomato juice,      
##      vegetables mix,    
##      whole weat flour,  
##      yams}              
## [2] {burgers,           
##      eggs,              
##      meatballs}         
## [3] {chutney}           
## [4] {avocado,           
##      turkey}            
## [5] {energy bar,        
##      green tea,         
##      milk,              
##      mineral water,     
##      whole wheat rice}

# Item Frequency plot

itemFrequencyPlot(transactions, topN = 20, type = 'absolute',
                  col="darkblue",
                  main = 'Absolute Item Frequency Plot')

As specified above: mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.

# Association rules
rules <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Summary
summary(rules)

## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 16  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.041   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001067   Min.   : 3.356  
##  1st Qu.:0.001067   1st Qu.:0.8000   1st Qu.:0.001333   1st Qu.: 3.432  
##  Median :0.001133   Median :0.8333   Median :0.001333   Median : 3.795  
##  Mean   :0.001256   Mean   :0.8504   Mean   :0.001479   Mean   : 4.823  
##  3rd Qu.:0.001333   3rd Qu.:0.8889   3rd Qu.:0.001600   3rd Qu.: 4.877  
##  Max.   :0.002533   Max.   :1.0000   Max.   :0.002666   Max.   :12.722  
##      count       
##  Min.   : 8.000  
##  1st Qu.: 8.000  
##  Median : 8.500  
##  Mean   : 9.419  
##  3rd Qu.:10.000  
##  Max.   :19.000  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          7501   0.001        0.8
##                                                                      call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))

A support of 0.002 leads to 2 rules being formed, while 0.001 = 74 rules. Support = 0.001 will be used to prevent loss of meaningful insights. The are a maximum of 6 items in a rule, and a minimum of 3. * A rule length of 4 has the most rules = 42, while a length of 6 has the lowest = 1. * The mean lift if 4.8, which means that most antecedents and consequents are highly correlated, meaning that the consequents are likely to be bought after the antecedents have been bought.

Removing Redundant Rules

# Remove redundancy
subset.rules <- which(colSums(is.subset(rules, rules)) > 1)
sub.rules <- rules[-subset.rules]
summary(sub.rules)

## set of 62 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 33 13  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       4       4       4       4       6 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001200   Min.   : 3.356  
##  1st Qu.:0.001067   1st Qu.:0.8000   1st Qu.:0.001333   1st Qu.: 3.432  
##  Median :0.001200   Median :0.8284   Median :0.001333   Median : 3.752  
##  Mean   :0.001279   Mean   :0.8451   Mean   :0.001514   Mean   : 4.533  
##  3rd Qu.:0.001333   3rd Qu.:0.8889   3rd Qu.:0.001600   3rd Qu.: 4.824  
##  Max.   :0.002533   Max.   :1.0000   Max.   :0.002666   Max.   :11.995  
##      count       
##  Min.   : 8.000  
##  1st Qu.: 8.000  
##  Median : 9.000  
##  Mean   : 9.597  
##  3rd Qu.:10.000  
##  Max.   :19.000  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          7501   0.001        0.8
##                                                                      call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))

62 rules remain. The largest rule length is still 6, and the mean lift increases to 4.5.

Lift

# The first 5 rules
inspect(head(sort(sub.rules, by = 'lift'), 5))

##     lhs                        rhs                     support confidence    coverage      lift count
## [1] {milk,                                                                                           
##      pasta}                 => {shrimp}            0.001599787  0.8571429 0.001866418 11.995203    12
## [2] {mushroom cream sauce,                                                                           
##      pasta}                 => {escalope}          0.002532996  0.9500000 0.002666311 11.976387    19
## [3] {chocolate,                                                                                      
##      ground beef,                                                                                    
##      milk,                                                                                           
##      mineral water,                                                                                  
##      spaghetti}             => {frozen vegetables} 0.001066524  0.8888889 0.001199840  9.325253     8
## [4] {herb & pepper,                                                                                  
##      mineral water,                                                                                  
##      rice}                  => {ground beef}       0.001333156  0.9090909 0.001466471  9.252498    10
## [5] {meatballs,                                                                                      
##      whole wheat pasta}     => {milk}              0.001333156  0.8333333 0.001599787  6.430898    10

The lift shows the preference of the rhs items being bought once the lhs was/ were bought.

Customers who bought milk and pasta were 12 more times likely to buy shrimp.
Customers who bough mushroom cream sauce and pasta were 12 times more likely to buy escalopes.
The same format applies for the remaining observations.

Confidence

# The first 5 rules
inspect(head(sort(sub.rules, by = 'confidence'), 5))

##     lhs                        rhs                 support confidence    coverage      lift count
## [1] {ground beef,                                                                                
##      light cream,                                                                                
##      olive oil}             => {mineral water} 0.001199840  1.0000000 0.001199840  4.195190     9
## [2] {cake,                                                                                       
##      olive oil,                                                                                  
##      shrimp}                => {mineral water} 0.001199840  1.0000000 0.001199840  4.195190     9
## [3] {mushroom cream sauce,                                                                       
##      pasta}                 => {escalope}      0.002532996  0.9500000 0.002666311 11.976387    19
## [4] {red wine,                                                                                   
##      soup}                  => {mineral water} 0.001866418  0.9333333 0.001999733  3.915511    14
## [5] {herb & pepper,                                                                              
##      mineral water,                                                                              
##      rice}                  => {ground beef}   0.001333156  0.9090909 0.001466471  9.252498    10

Customers who bought ground beef, light cream, olive oil, cake, and shrimp were 4 times more likely to buy mineral water. These sets had a confidence of 100%, therefore, these customers would always return.
Customers who bought mushroom cream sauce and pasta were 4 times more likely to buy escalope. This set had a confidence of 95%, therefore, these customers were also likely to return.
Customers who bought red wine and soup were 3.9 times more likely to buy mineral water. This set had a confidence of 93%, therefore, these customers were also likely to return.
Customers who bought herb & pepper, mineral water and rice were 9.3 times more likely to buy ground beef. This set had a confidence of 91%, therefore, these customers were also likely to return.

Limiting the item sets to 3

This provides a narrower scope for analysis.

shorter.rules <- apriori(transactions, parameter = list(supp=0.001, conf=0.8,
                                                        maxlen=3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Only 15 rules remain.
summary(shorter.rules)

## set of 15 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001067   Min.   :0.8000   Min.   :0.001200   Min.   : 3.356  
##  1st Qu.:0.001133   1st Qu.:0.8153   1st Qu.:0.001400   1st Qu.: 3.491  
##  Median :0.001200   Median :0.8182   Median :0.001466   Median : 3.916  
##  Mean   :0.001413   Mean   :0.8434   Mean   :0.001662   Mean   : 5.289  
##  3rd Qu.:0.001600   3rd Qu.:0.8571   3rd Qu.:0.001866   3rd Qu.: 5.598  
##  Max.   :0.002533   Max.   :0.9500   Max.   :0.002666   Max.   :11.995  
##      count     
##  Min.   : 8.0  
##  1st Qu.: 8.5  
##  Median : 9.0  
##  Mean   :10.6  
##  3rd Qu.:12.0  
##  Max.   :19.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          7501   0.001        0.8
##                                                                                  call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3))

Lift

# The first 5 rules
inspect(head(sort(shorter.rules, by = 'lift'), 5))

##     lhs                               rhs         support     confidence
## [1] {milk, pasta}                  => {shrimp}    0.001599787 0.8571429 
## [2] {mushroom cream sauce, pasta}  => {escalope}  0.002532996 0.9500000 
## [3] {meatballs, whole wheat pasta} => {milk}      0.001333156 0.8333333 
## [4] {black tea, frozen smoothie}   => {milk}      0.001199840 0.8181818 
## [5] {red wine, tomato sauce}       => {chocolate} 0.001066524 0.8000000 
##     coverage    lift      count
## [1] 0.001866418 11.995203 12   
## [2] 0.002666311 11.976387 19   
## [3] 0.001599787  6.430898 10   
## [4] 0.001466471  6.313973  9   
## [5] 0.001333156  4.882669  8

This gives a more concise summary, with very high lift and confidence values.

Customers who purchase milk and pasta are 12 times more likely to purchase shrimp.
Customers who bought mushroom cream sauce and pasta are 12 times more likely to buy escalope. This seems to be most popular combination.
The same format applies to the rest.
Selling the antecedents will lead to a high likelihood of the consequents being bought.

Confidence

# The first 5 rules
inspect(head(sort(shorter.rules, by = 'confidence'), 5))

##     lhs                              rhs             support     confidence
## [1] {mushroom cream sauce, pasta} => {escalope}      0.002532996 0.9500000 
## [2] {red wine, soup}              => {mineral water} 0.001866418 0.9333333 
## [3] {frozen smoothie, spinach}    => {mineral water} 0.001066524 0.8888889 
## [4] {ground beef, nonfat milk}    => {mineral water} 0.001599787 0.8571429 
## [5] {milk, pasta}                 => {shrimp}        0.001599787 0.8571429 
##     coverage    lift      count
## [1] 0.002666311 11.976387 19   
## [2] 0.001999733  3.915511 14   
## [3] 0.001199840  3.729058  8   
## [4] 0.001866418  3.595877 12   
## [5] 0.001866418 11.995203 12

Customers who buy mushroom cream sauce and pasta will buy escalope. They have a 95% chance of returning. This remains to be the most popular set, even for smaller item sets.
Customers who buy red wine and soup will buy mineral water. They have a 93% chance of returning.
Customers who purchase these goods are >86% likely to return as this metric depends on the number of transactions.

Visualizations

# Initial rules with all items set lengths
plot(sub.rules, jitter = 0) # Jitter = 0 - reduce  over plotting

High lift values are not restricted to high confidence values.
Most observations have a support < 0.002.

# Limited rule size
plot(shorter.rules, jitter = 0) # Jitter = 0 - reduce  overplotting

* The highest lift values corresponds to a confidence of 95% and ~86%. Most support values are < 0.00175.

Items that are likely to be bought together

Larger sets

The following plots are interactive, feel free to select items from the drop down menu and zoom in to the image for a clearer view.

top10 <- head(sub.rules, n = 10, by = 'lift')
plot(top10, method = "graph",  engine = "htmlwidget")

Shorter sets

top10.short <- head(shorter.rules, n = 10, by = 'lift')
plot(top10.short, method = "graph",  engine = "htmlwidget")

Customers likely to return

Larger sets

top10.c <- head(sub.rules, n = 10, by = 'confidence')
plot(top10.c, method = "graph",  engine = "htmlwidget")

Shorter sets

top10.short.c <- head(shorter.rules, n = 10, by = 'confidence')
plot(top10.short.c, method = "graph",  engine = "htmlwidget")

6. Implementing the Solution

Analysis Summary

1. Most popular items

Mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.

2. Items that have the highest relationship

(purchase of an item/items leads to a high likelihood of the purchase of an item/items.)

Customers who bought milk and pasta were 12 more times likely to buy shrimp.
Customers who bough mushroom cream sauce and pasta were also 12 times more likely to buy escalopes.

3. Customers most likely to return

—Longer item sets—

Customers who bought ground beef, light cream, olive oil, cake, and shrimp were 4 times more likely to buy mineral water. These sets had a confidence of 100%, therefore, these customers would always return.
Customers who bought mushroom cream sauce and pasta were 4 times more likely to buy escalope. This set had a confidence of 95%, therefore, these customers were also likely to return.
Customers who bought red wine and soup were 3.9 times more likely to buy mineral water. This set had a confidence of 93%, therefore, these customers were also likely to return.
Customers who bought herb & pepper, mineral water and rice were 9.3 times more likely to buy ground beef. This set had a confidence of 91%, therefore, these customers were also likely to return.

—Shorter item sets—

Customers who buy mushroom cream sauce and pasta will buy escalope. They have a 95% chance of returning. This remains to be the most popular set, even for smaller item sets.
Customers who buy red wine and soup will buy mineral water. They have a 93% chance of returning.

7. Challenging the Solution

The length of the item sets was limited to a maximum of 3. The main observation is that mushroom cream and pasta -> escalope has a high lift nd confidence. This means that these customers are highly likely to purchase this combination of goods, as well are return to the store (due to the high number of transactions).

8. Conclusion

In conclusion:

Mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.
Milk, pasta and shrimp; mushroom cream sauce, pasta and escopes had the highest likelihood of being purchased together. Customers who purchased these are also likely to return.
Customers who are most likely to return bought:

Larger maximum item sets:
- ground beef, light cream, olive oil, cake, and shrimp -> mineral water (100%),
- mushroom cream sauce and pasta -> escalope (95%),
- red wine and soup -> mineral water (93%),
- herb & pepper, mineral water and rice -> beef (91%).
Shorter maximum item sets:
- mushroom cream sauce and pasta -> escalope (95%),
- red wine and soup -> mineral water (93%).

9. Recommendations

Stocking up on the most popular items, which are mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.
Setting up promotions for antecedents and consequents with the highest lift and confidence. Furthermore, they can be placed closed to each other on the item aisles. These actions will boost sales.
Obtaining more data to provide more in depth insights that can be used to improve marketing and sales.

10. Follow Up Questions

a) Did we have the right data?

Yes, we have the right data as meaningful insights were derived from the data.

b) Do we need other data to answer our question?

Yes, more data can be used for better generalization of customer purchase behavior and the most popular items and item sets.

c) Did we have the right question?

Yes, we have the right question, as the analysis sought to cater to the client’s requirements.

# Suppressing warnings
options(warn = defaultW)

Association Rules

Deborah Masibo

2022-06-10