1. Research question

You are a Data analyst at Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax). This project section entails studying associations of various items sold by the supermarket

2. Success Criteria

The most significant associations are identified to help the supermarket improve it’s marketing strategies

3. Research Methodology

Defining the research questions and work plan Loading the data set Previewing the data set Data validation Univariate analysis Sorting out purchases in descending order Targeting specific items in data set Creating a Visualization Providing a Conclusion Defining further questions

4. Understanding Data provided

The data was obtained from Carrefour supermarket and details the product transactions. The data set has a total of 7501 transactions

loading libraries

# Loading the libraries
#
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)

loading the dataset

# loading the dataset
#
purchases <- read.transactions("http://bit.ly/SupermarketDatasetII", sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions

preview dataset

# preview first six records of the dataset
#
head(purchases)
## transactions in sparse format with
##  6 transactions (rows) and
##  119 items (columns)
# preview dimensions of data set
#
dim(purchases)
## [1] 7501  119

The data set has 7501 records and 119 columns

# Prviewing the class of the data set
#
class(purchases)
## [1] "transactions"
## attr(,"package")
## [1] "arules"

The data set is a transactions kind of data set

Data validation

There is no need to validate the data since it is provided by the client hence it is imagined to be accurate

Univariate analysis

# Previewing the items that make up our data set
#
items<-as.data.frame(itemLabels(purchases))
colnames(items) <- "Item"

# The first 15 items
dim(items) 
## [1] 119   1

There are a total of 119 items making up our transactions data set

# Generating a summary of the transaction data set
# 
summary(purchases)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Top 3 most frequently purchased items were; mineral water, eggs and spaghetti

# Exploring the frequency of some items choosen at random (6-9) 
# 
itemFrequency(purchases[, 6:9],type = "absolute")
##          bacon barbecue sauce      black tea    blueberries 
##             65             81            107             69
round(itemFrequency(purchases[, 6:9],type = "relative")*100,2)
##          bacon barbecue sauce      black tea    blueberries 
##           0.87           1.08           1.43           0.92

Among the four selected items, black tea was the most popular (107 items purchased). It accounted for 1.43 percent of total purchases

# Producing a chart of frequencies and filtering 
# 
par(mfrow = c(1, 2))

# plot the frequency of items
# Displaying top 10 most common items in the transactions dataset 
#
itemFrequencyPlot(purchases, topN = 10,col="darkgreen")

# Displaying the items whose relative importance is at least 10%
#
itemFrequencyPlot(purchases, support = 0.1,col="darkred")

# Building a model based on association rules using the apriori function 
#
rules <- apriori (purchases, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Show the top 5 rules, but only 2 digits
options(digits=2)
inspect(rules[1:5])
##     lhs                              rhs             support confidence
## [1] {frozen smoothie, spinach}    => {mineral water} 0.0011  0.89      
## [2] {bacon, pancakes}             => {spaghetti}     0.0017  0.81      
## [3] {nonfat milk, turkey}         => {mineral water} 0.0012  0.82      
## [4] {ground beef, nonfat milk}    => {mineral water} 0.0016  0.86      
## [5] {mushroom cream sauce, pasta} => {escalope}      0.0025  0.95      
##     coverage lift count
## [1] 0.0012    3.7  8   
## [2] 0.0021    4.7 13   
## [3] 0.0015    3.4  9   
## [4] 0.0019    3.6 12   
## [5] 0.0027   12.0 19

Observations

From the analysis taking the second record of the five displayed; 81 per cent of the buyers are most likely to buy spaghetti if they buy bacon and pancakes.

# summary of the rules
summary(rules)
## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 16  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       4       4       4       4       6 
## 
## summary of quality measures:
##     support          confidence      coverage            lift     
##  Min.   :0.00107   Min.   :0.80   Min.   :0.00107   Min.   : 3.4  
##  1st Qu.:0.00107   1st Qu.:0.80   1st Qu.:0.00133   1st Qu.: 3.4  
##  Median :0.00113   Median :0.83   Median :0.00133   Median : 3.8  
##  Mean   :0.00126   Mean   :0.85   Mean   :0.00148   Mean   : 4.8  
##  3rd Qu.:0.00133   3rd Qu.:0.89   3rd Qu.:0.00160   3rd Qu.: 4.9  
##  Max.   :0.00253   Max.   :1.00   Max.   :0.00267   Max.   :12.7  
##      count     
##  Min.   : 8.0  
##  1st Qu.: 8.0  
##  Median : 8.5  
##  Mean   : 9.4  
##  3rd Qu.:10.0  
##  Max.   :19.0  
## 
## mining info:
##       data ntransactions support confidence
##  purchases          7501   0.001        0.8
##                                                                   call
##  apriori(data = purchases, parameter = list(supp = 0.001, conf = 0.8))

Observations

From the summary of the rules;

· The number of rules produced: 74 · Length allocation of the rules: The most common rules are 4 long items · Quality measures summary: support - most common - 0.00113 lift - most common - 3.8 Trust - most common - 0.83

· Total data - 7501 transactions min rules - 3 long items max rules - 6 long items only once

Sorting out purchases

# sorting out the most important rules
#
rules<-sort(rules, by="confidence", decreasing=TRUE)

# return top 10 rules, but only 2 digits
#
options(digits=2)
inspect(rules[1:10])
##      lhs                        rhs             support confidence coverage lift count
## [1]  {french fries,                                                                   
##       mushroom cream sauce,                                                           
##       pasta}                 => {escalope}       0.0011       1.00   0.0011 12.6     8
## [2]  {ground beef,                                                                    
##       light cream,                                                                    
##       olive oil}             => {mineral water}  0.0012       1.00   0.0012  4.2     9
## [3]  {cake,                                                                           
##       meatballs,                                                                      
##       mineral water}         => {milk}           0.0011       1.00   0.0011  7.7     8
## [4]  {cake,                                                                           
##       olive oil,                                                                      
##       shrimp}                => {mineral water}  0.0012       1.00   0.0012  4.2     9
## [5]  {mushroom cream sauce,                                                           
##       pasta}                 => {escalope}       0.0025       0.95   0.0027 12.0    19
## [6]  {red wine,                                                                       
##       soup}                  => {mineral water}  0.0019       0.93   0.0020  3.9    14
## [7]  {eggs,                                                                           
##       mineral water,                                                                  
##       pasta}                 => {shrimp}         0.0013       0.91   0.0015 12.7    10
## [8]  {herb & pepper,                                                                  
##       mineral water,                                                                  
##       rice}                  => {ground beef}    0.0013       0.91   0.0015  9.3    10
## [9]  {ground beef,                                                                    
##       pancakes,                                                                       
##       whole wheat rice}      => {mineral water}  0.0013       0.91   0.0015  3.8    10
## [10] {frozen vegetables,                                                              
##       milk,                                                                           
##       spaghetti,                                                                      
##       turkey}                => {mineral water}  0.0012       0.90   0.0013  3.8     9

Observations

From the analysis, if a customer buys 1. french fries, mushroom cream sauce,and pasta with a confidence of 100% they will buy escalope 2. ground beef, light cream, and olive oil with a confidence of 100% they will buy mineral water 3. cake, meatballs, and mineral water with a confidence of 100% they will buy milk 4. cake, olive oil, and shrimp with a confidence of 100% they will buy mineral water 5. mushroom cream sauce, and pasta with a confidence of 90% they will buy escalope

These are the top five associations with highest confidence

Dealing with Redundancies

Rules will be repeated sometimes. Redundancy means one item could be specified. The redudant rules will be removed

# removing redudant rules
#
rules <- rules[!is.redundant(rules)]
 
# summary of rules after removing redundancies
#
summary(rules)
## set of 73 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
## 15 42 15  1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       4       4       4       4       6 
## 
## summary of quality measures:
##     support          confidence      coverage            lift     
##  Min.   :0.00107   Min.   :0.80   Min.   :0.00107   Min.   : 3.4  
##  1st Qu.:0.00107   1st Qu.:0.80   1st Qu.:0.00133   1st Qu.: 3.4  
##  Median :0.00120   Median :0.83   Median :0.00133   Median : 3.8  
##  Mean   :0.00126   Mean   :0.85   Mean   :0.00148   Mean   : 4.8  
##  3rd Qu.:0.00133   3rd Qu.:0.89   3rd Qu.:0.00160   3rd Qu.: 4.9  
##  Max.   :0.00253   Max.   :1.00   Max.   :0.00267   Max.   :12.7  
##      count     
##  Min.   : 8.0  
##  1st Qu.: 8.0  
##  Median : 9.0  
##  Mean   : 9.4  
##  3rd Qu.:10.0  
##  Max.   :19.0  
## 
## mining info:
##       data ntransactions support confidence
##  purchases          7501   0.001        0.8
##                                                                   call
##  apriori(data = purchases, parameter = list(supp = 0.001, conf = 0.8))

Observations

Of the 74 original rules, one was redundant. It has been succefully removed remaining with only 73 rules

Targetting items

Here we will explore the people most likely to buy the most common item mineral water. Questions: 1. What are customers who can buy mineral water before buying? 2. If customers buy mineral water, what are they likely to buy?

# investigating customers most likely to buy mineral water
#
rules<-apriori(data=purchases, parameter=list(supp=0.001,conf = 0.08), 
 appearance = list(default="lhs",rhs="mineral water"),
 control = list(verbose=F))
 rules<-sort(rules, decreasing=TRUE,by="confidence")
 inspect(rules[1:5])
##     lhs                     rhs             support confidence coverage lift count
## [1] {ground beef,                                                                 
##      light cream,                                                                 
##      olive oil}          => {mineral water}  0.0012       1.00   0.0012  4.2     9
## [2] {cake,                                                                        
##      olive oil,                                                                   
##      shrimp}             => {mineral water}  0.0012       1.00   0.0012  4.2     9
## [3] {red wine,                                                                    
##      soup}               => {mineral water}  0.0019       0.93   0.0020  3.9    14
## [4] {ground beef,                                                                 
##      pancakes,                                                                    
##      whole wheat rice}   => {mineral water}  0.0013       0.91   0.0015  3.8    10
## [5] {frozen vegetables,                                                           
##      milk,                                                                        
##      spaghetti,                                                                   
##      turkey}             => {mineral water}  0.0012       0.90   0.0013  3.8     9

Observations

On the LHS are the items somebody is most likely to buy in case he or she buys mineral water(on the rhs) all with confidence of over 90%.

Comparing buyers of green tea and chocolate

Here we shall explore the people who are most likely to buy green tea versus chocolate all which are beverages · We have set the trust to 0.15 because we do not have 0.8 rules.

· To avoid leaky left-hand items, we set a minimum length of 2

# investigating top5 customers most likely to buy chocolate
#
rules<-apriori(data=purchases, parameter=list(supp=0.001,conf = 0.15,minlen=2), 
 appearance = list(default="rhs",lhs="chocolate"),
 control = list(verbose=F))
 rules<-sort(rules, decreasing=TRUE,by="confidence")
 inspect(rules[1:5])
##     lhs            rhs             support confidence coverage lift count
## [1] {chocolate} => {mineral water} 0.053   0.32       0.16     1.3  395  
## [2] {chocolate} => {spaghetti}     0.039   0.24       0.16     1.4  294  
## [3] {chocolate} => {french fries}  0.034   0.21       0.16     1.2  258  
## [4] {chocolate} => {eggs}          0.033   0.20       0.16     1.1  249  
## [5] {chocolate} => {milk}          0.032   0.20       0.16     1.5  241
# investigating top5 customers most likely to buy green tea
#
rules<-apriori(data=purchases, parameter=list(supp=0.001,conf = 0.15,minlen=2), 
 appearance = list(default="rhs",lhs="green tea"),
 control = list(verbose=F))
 rules<-sort(rules, decreasing=TRUE,by="confidence")
 inspect(rules[1:5])
##     lhs            rhs             support confidence coverage lift count
## [1] {green tea} => {mineral water} 0.031   0.24       0.13     0.99 233  
## [2] {green tea} => {french fries}  0.029   0.22       0.13     1.26 214  
## [3] {green tea} => {spaghetti}     0.027   0.20       0.13     1.15 199  
## [4] {green tea} => {eggs}          0.025   0.19       0.13     1.07 191  
## [5] {green tea} => {chocolate}     0.023   0.18       0.13     1.08 176

Observations

For both green tea and chocolate, they are mostly bought with one other item. In all five instances, both were both with one other item respectively. Both had low confidence less than 40%. Hence it cannot be said conclusively that if a customer buys chocolate for instance he/she will buy mineral water.

Both chocolate and green tea have poor association with the other items.

Visualization

Visualization is the final step.

# Visualizing the rules (support = 0.001, and confidence= 0.8)
#
rules <- apriori (purchases, parameter = list(supp = 0.001, conf = 0.9))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 7 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
plot(rules, method="graph")

Observations

From the visualization:

The most popular transaction was mineral water and escalope Another popular transaction was pasta and french fries Many people buys milk along with cake Incase somebody buys red wine they are most likely to buy soup

Conclusion

From the study, the most popular item in Carrefour supermarket was mineral water and escalope. Soup and red wine as a combination that is popular among the supermarkets clients Mineral water pasta and eggs are a popular combination among customers

Further Questions

A) Do we have the right data

Yes we do. Transactions data will help us come up with the most important associations which in turn help the ,marketing department develop relevant marketing strategies

B) Do we have the right question?

Yes. Studying associations among various items will help the supermarket know what items it can prioritize and boost profitability