Accurately deriving insights on the most popular items, items that have the highest relationship, and types of customers that are likely to return.
Carrefour Kenya was undertaking a project that would inform the marketing department on the most relevant marketing strategies that would result in the highest number of sales. They provided a data set on which insights were to be derived.
The insights derived are important for planning stocking, advertisements, and promotions, to boost sales. Therefore, highly accurate information is required to prevent losses incurred by a high number of closing stock, especially for perishable goods.
For the data to be relevant, it should provide necessary insights that can be derived and used by the marketing and sales departments to improve on currently implemented strategies used to increase sales.
# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(psych)
library(coda)
library(base) # Date-time conversion
library(ggplot2) # Plotting Library
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(moments) # Measures of distribution
library(ggcorrplot) # Correlation plotting
library(mice) # Missing values
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(devtools)
## Loading required package: usethis
library(arules) # Apriori
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz) # Apriori visualization
# Loading data set
part3 <- read.csv('dataset_3.csv', header = F)
# Data set records summary
records.summary <- function(data){
cat('Number of rows = ', nrow(data), ' and columns = ', ncol(data), '\n')
}
# Data set summary
print('Dataset 3')
## [1] "Dataset 3"
records.summary(part3)
## Number of rows = 7501 and columns = 20
# Top data set preview
head(part3)
## V1 V2 V3 V4 V5
## 1 shrimp almonds avocado vegetables mix green grapes
## 2 burgers meatballs eggs
## 3 chutney
## 4 turkey avocado
## 5 mineral water milk energy bar whole wheat rice green tea
## 6 low fat yogurt
## V6 V7 V8 V9 V10 V11
## 1 whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt
## 2
## 3
## 4
## 5
## 6
## V12 V13 V14 V15 V16 V17 V18
## 1 green tea honey salad mineral water salmon antioxydant juice frozen smoothie
## 2
## 3
## 4
## 5
## 6
## V19 V20
## 1 spinach olive oil
## 2
## 3
## 4
## 5
## 6
# Bottom data set preview
tail(part3)
## V1 V2 V3 V4 V5 V6
## 7496 pancakes light mayo
## 7497 butter light mayo fresh bread
## 7498 burgers frozen vegetables eggs french fries magazines green tea
## 7499 chicken
## 7500 escalope green tea
## 7501 eggs frozen smoothie yogurt cake low fat yogurt
## V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 7496
## 7497
## 7498
## 7499
## 7500
## 7501
The date column will be changed to a date data type.
, 10)
# Data type for the third data set.
str(part3)
## 'data.frame': 7501 obs. of 20 variables:
## $ V1 : chr "shrimp" "burgers" "chutney" "turkey" ...
## $ V2 : chr "almonds" "meatballs" "" "avocado" ...
## $ V3 : chr "avocado" "eggs" "" "" ...
## $ V4 : chr "vegetables mix" "" "" "" ...
## $ V5 : chr "green grapes" "" "" "" ...
## $ V6 : chr "whole weat flour" "" "" "" ...
## $ V7 : chr "yams" "" "" "" ...
## $ V8 : chr "cottage cheese" "" "" "" ...
## $ V9 : chr "energy drink" "" "" "" ...
## $ V10: chr "tomato juice" "" "" "" ...
## $ V11: chr "low fat yogurt" "" "" "" ...
## $ V12: chr "green tea" "" "" "" ...
## $ V13: chr "honey" "" "" "" ...
## $ V14: chr "salad" "" "" "" ...
## $ V15: chr "mineral water" "" "" "" ...
## $ V16: chr "salmon" "" "" "" ...
## $ V17: chr "antioxydant juice" "" "" "" ...
## $ V18: chr "frozen smoothie" "" "" "" ...
## $ V19: chr "spinach" "" "" "" ...
## $ V20: chr "olive oil" "" "" "" ...
The data types are correct. The NAN values will be dealt with and analyzed later.
The data sets have been provided by the client, therefore, external data set validation will not be used.
# Data set columns
colnames(part3)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12"
## [13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20"
All columns will be retained as they contain product categories.
Part 3
# Missing values
colSums(is.na(part3))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
There are no missing values in the second data set.
# Checking for duplicates
sum(duplicated(part3))
## [1] 2325
There are duplicates in the second data set. They will be dropped later when the transaction class is formed, if necessary.
Checking the uniformity of column names and values.
# Column names
colnames(part3)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12"
## [13] "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20"
The column names are uniform, therefore, they will not be renamed.
# Numerical columns
num <- unlist(lapply(part3, is.numeric))
num
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## V14 V15 V16 V17 V18 V19 V20
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE
There are no numerical columns, therefore, no outliers can be checked.
head(part3)
## V1 V2 V3 V4 V5
## 1 shrimp almonds avocado vegetables mix green grapes
## 2 burgers meatballs eggs
## 3 chutney
## 4 turkey avocado
## 5 mineral water milk energy bar whole wheat rice green tea
## 6 low fat yogurt
## V6 V7 V8 V9 V10 V11
## 1 whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt
## 2
## 3
## 4
## 5
## 6
## V12 V13 V14 V15 V16 V17 V18
## 1 green tea honey salad mineral water salmon antioxydant juice frozen smoothie
## 2
## 3
## 4
## 5
## 6
## V19 V20
## 1 spinach olive oil
## 2
## 3
## 4
## 5
## 6
Due to the format of the data set, descriptive analysis will not be done using the common methods. Instead, these insights will be derived in the following section, using functions specific to the transaction class and Apriori function.
# As the data set does not require any alterations, it will be re-loaded as a
# transactions object.
transactions <- read.transactions('dataset_3.csv', format = 'basket',
rm.duplicates = T, sep = ',')
## distribution of transactions with duplicates:
## 1
## 5
items<-as.data.frame(itemLabels(transactions))
colnames(items) <- "Item"
head(items, 10)
## Item
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
## 4 avocado
## 5 babies food
## 6 bacon
## 7 barbecue sauce
## 8 black tea
## 9 blueberries
## 10 body spray
For this data set, duplicates were removed, and the separation of the items set to ‘,’ to prevent item splitting because of white spaces.
# transactions summary
summary(transactions)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
# Inspecting the transaction object
inspect(transactions[1:5])
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
# Item Frequency plot
itemFrequencyPlot(transactions, topN = 20, type = 'absolute',
col="darkblue",
main = 'Absolute Item Frequency Plot')
As specified above: mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.
# Association rules
rules <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Summary
summary(rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 42 16 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.041 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001067 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001133 Median :0.8333 Median :0.001333 Median : 3.795
## Mean :0.001256 Mean :0.8504 Mean :0.001479 Mean : 4.823
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.877
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :12.722
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 8.500
## Mean : 9.419
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## transactions 7501 0.001 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))
A support of 0.002 leads to 2 rules being formed, while 0.001 = 74 rules. Support = 0.001 will be used to prevent loss of meaningful insights. The are a maximum of 6 items in a rule, and a minimum of 3. * A rule length of 4 has the most rules = 42, while a length of 6 has the lowest = 1. * The mean lift if 4.8, which means that most antecedents and consequents are highly correlated, meaning that the consequents are likely to be bought after the antecedents have been bought.
Removing Redundant Rules
# Remove redundancy
subset.rules <- which(colSums(is.subset(rules, rules)) > 1)
sub.rules <- rules[-subset.rules]
summary(sub.rules)
## set of 62 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 15 33 13 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 4 4 4 4 6
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001200 Min. : 3.356
## 1st Qu.:0.001067 1st Qu.:0.8000 1st Qu.:0.001333 1st Qu.: 3.432
## Median :0.001200 Median :0.8284 Median :0.001333 Median : 3.752
## Mean :0.001279 Mean :0.8451 Mean :0.001514 Mean : 4.533
## 3rd Qu.:0.001333 3rd Qu.:0.8889 3rd Qu.:0.001600 3rd Qu.: 4.824
## Max. :0.002533 Max. :1.0000 Max. :0.002666 Max. :11.995
## count
## Min. : 8.000
## 1st Qu.: 8.000
## Median : 9.000
## Mean : 9.597
## 3rd Qu.:10.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## transactions 7501 0.001 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))
Lift
# The first 5 rules
inspect(head(sort(sub.rules, by = 'lift'), 5))
## lhs rhs support confidence coverage lift count
## [1] {milk,
## pasta} => {shrimp} 0.001599787 0.8571429 0.001866418 11.995203 12
## [2] {mushroom cream sauce,
## pasta} => {escalope} 0.002532996 0.9500000 0.002666311 11.976387 19
## [3] {chocolate,
## ground beef,
## milk,
## mineral water,
## spaghetti} => {frozen vegetables} 0.001066524 0.8888889 0.001199840 9.325253 8
## [4] {herb & pepper,
## mineral water,
## rice} => {ground beef} 0.001333156 0.9090909 0.001466471 9.252498 10
## [5] {meatballs,
## whole wheat pasta} => {milk} 0.001333156 0.8333333 0.001599787 6.430898 10
The lift shows the preference of the rhs items being bought once the lhs was/ were bought.
Confidence
# The first 5 rules
inspect(head(sort(sub.rules, by = 'confidence'), 5))
## lhs rhs support confidence coverage lift count
## [1] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [2] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 0.001199840 4.195190 9
## [3] {mushroom cream sauce,
## pasta} => {escalope} 0.002532996 0.9500000 0.002666311 11.976387 19
## [4] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 0.001999733 3.915511 14
## [5] {herb & pepper,
## mineral water,
## rice} => {ground beef} 0.001333156 0.9090909 0.001466471 9.252498 10
Limiting the item sets to 3
This provides a narrower scope for analysis.
shorter.rules <- apriori(transactions, parameter = list(supp=0.001, conf=0.8,
maxlen=3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Only 15 rules remain.
summary(shorter.rules)
## set of 15 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001067 Min. :0.8000 Min. :0.001200 Min. : 3.356
## 1st Qu.:0.001133 1st Qu.:0.8153 1st Qu.:0.001400 1st Qu.: 3.491
## Median :0.001200 Median :0.8182 Median :0.001466 Median : 3.916
## Mean :0.001413 Mean :0.8434 Mean :0.001662 Mean : 5.289
## 3rd Qu.:0.001600 3rd Qu.:0.8571 3rd Qu.:0.001866 3rd Qu.: 5.598
## Max. :0.002533 Max. :0.9500 Max. :0.002666 Max. :11.995
## count
## Min. : 8.0
## 1st Qu.: 8.5
## Median : 9.0
## Mean :10.6
## 3rd Qu.:12.0
## Max. :19.0
##
## mining info:
## data ntransactions support confidence
## transactions 7501 0.001 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3))
Lift
# The first 5 rules
inspect(head(sort(shorter.rules, by = 'lift'), 5))
## lhs rhs support confidence
## [1] {milk, pasta} => {shrimp} 0.001599787 0.8571429
## [2] {mushroom cream sauce, pasta} => {escalope} 0.002532996 0.9500000
## [3] {meatballs, whole wheat pasta} => {milk} 0.001333156 0.8333333
## [4] {black tea, frozen smoothie} => {milk} 0.001199840 0.8181818
## [5] {red wine, tomato sauce} => {chocolate} 0.001066524 0.8000000
## coverage lift count
## [1] 0.001866418 11.995203 12
## [2] 0.002666311 11.976387 19
## [3] 0.001599787 6.430898 10
## [4] 0.001466471 6.313973 9
## [5] 0.001333156 4.882669 8
This gives a more concise summary, with very high lift and confidence values.
Confidence
# The first 5 rules
inspect(head(sort(shorter.rules, by = 'confidence'), 5))
## lhs rhs support confidence
## [1] {mushroom cream sauce, pasta} => {escalope} 0.002532996 0.9500000
## [2] {red wine, soup} => {mineral water} 0.001866418 0.9333333
## [3] {frozen smoothie, spinach} => {mineral water} 0.001066524 0.8888889
## [4] {ground beef, nonfat milk} => {mineral water} 0.001599787 0.8571429
## [5] {milk, pasta} => {shrimp} 0.001599787 0.8571429
## coverage lift count
## [1] 0.002666311 11.976387 19
## [2] 0.001999733 3.915511 14
## [3] 0.001199840 3.729058 8
## [4] 0.001866418 3.595877 12
## [5] 0.001866418 11.995203 12
Visualizations
# Initial rules with all items set lengths
plot(sub.rules, jitter = 0) # Jitter = 0 - reduce over plotting
# Limited rule size
plot(shorter.rules, jitter = 0) # Jitter = 0 - reduce overplotting
* The highest lift values corresponds to a confidence of 95% and ~86%.
Most support values are < 0.00175.
Items that are likely to be bought together
Larger sets
The following plots are interactive, feel free to select items from the drop down menu and zoom in to the image for a clearer view.
top10 <- head(sub.rules, n = 10, by = 'lift')
plot(top10, method = "graph", engine = "htmlwidget")
Shorter sets
top10.short <- head(shorter.rules, n = 10, by = 'lift')
plot(top10.short, method = "graph", engine = "htmlwidget")
Customers likely to return
Larger sets
top10.c <- head(sub.rules, n = 10, by = 'confidence')
plot(top10.c, method = "graph", engine = "htmlwidget")
Shorter sets
top10.short.c <- head(shorter.rules, n = 10, by = 'confidence')
plot(top10.short.c, method = "graph", engine = "htmlwidget")
Analysis Summary
1. Most popular items
Mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.
2. Items that have the highest relationship
(purchase of an item/items leads to a high likelihood of the purchase of an item/items.)
Customers who bought milk and pasta were 12 more times likely to buy shrimp.
Customers who bough mushroom cream sauce and pasta were also 12 times more likely to buy escalopes.
3. Customers most likely to return
—Longer item sets—
Customers who bought ground beef, light cream, olive oil, cake, and shrimp were 4 times more likely to buy mineral water. These sets had a confidence of 100%, therefore, these customers would always return.
Customers who bought mushroom cream sauce and pasta were 4 times more likely to buy escalope. This set had a confidence of 95%, therefore, these customers were also likely to return.
Customers who bought red wine and soup were 3.9 times more likely to buy mineral water. This set had a confidence of 93%, therefore, these customers were also likely to return.
Customers who bought herb & pepper, mineral water and rice were 9.3 times more likely to buy ground beef. This set had a confidence of 91%, therefore, these customers were also likely to return.
—Shorter item sets—
Customers who buy mushroom cream sauce and pasta will buy escalope. They have a 95% chance of returning. This remains to be the most popular set, even for smaller item sets.
Customers who buy red wine and soup will buy mineral water. They have a 93% chance of returning.
The length of the item sets was limited to a maximum of 3. The main observation is that mushroom cream and pasta -> escalope has a high lift nd confidence. This means that these customers are highly likely to purchase this combination of goods, as well are return to the store (due to the high number of transactions).
In conclusion:
Mineral water, eggs, spaghetti, french fries and chocolate have the highest frequency.
Milk, pasta and shrimp; mushroom cream sauce, pasta and escopes had the highest likelihood of being purchased together. Customers who purchased these are also likely to return.
Customers who are most likely to return bought:
Yes, we have the right data as meaningful insights were derived from the data.
Yes, more data can be used for better generalization of customer purchase behavior and the most popular items and item sets.
Yes, we have the right question, as the analysis sought to cater to the client’s requirements.
# Suppressing warnings
options(warn = defaultW)