Market Basket Analysis Report

Technical appendix

Data understanding

To tackle this task, our first step was to see the magnitude of the companies we were considering. We will look at it in terms of items sold in a month. Even if it’s not a perfectly representative magnitude, it can give us a big picture with the data that we have available:

library(arules)
library(arulesViz)
library(ggplot2)
library(readr)
library(reshape2)

setwd("C:/Users/user/Documents/M2 T4 market basket analysis/Datasets")

Blackwell <- read_csv("./existingproductattributes2017.csv")

Electronidex_tr <- 
  read.transactions("./ElectronidexTransactions2017.csv", sep = ",")

tr_df <- 
  read_csv("./ElectronidexTransactions2017.csv", col_names = FALSE)

ProductTypeKey <-
  read_delim("./producttypekey_Blackwell product types.csv", ";")

sum(Blackwell$Volume)

## [1] 56400

sum(! is.na(tr_df))

## [1] 43276

comparison <- data.frame("company" = c("Blackwell", "Electronidex"), "volume" = c(sum(Blackwell$Volume), sum(! is.na(tr_df))))

ggplot(comparison, aes(x = company, y = volume, fill = company)) + geom_col()

Looking at last month’s volume (assuming that the data that we are given is for the same month) for both companies we can see they are similar in size: Blackwell is only Electronidex times 1,3. This tells us that we would not only benefit from their product range, online retail value chain, and know-how, but also from their sales volume. Cannibalisation would be limited, as they only compete closely in one product range as we’ll see later on.

If we look deeper into Product Category, Blackwell would be interested in increasing their sales of PC, Display and Laptops, the top 3 categories with the highest margin in $ per unit within their current product range as we can see in this boxplot. Instead, Accessories is the category with the lowest margins:

Blackwell$ProductProfit <- (Blackwell$Price * Blackwell$ProfitMargin)
ggplot(Blackwell, aes(x = ProductType, y = ProductProfit, fill = ProductType)) + geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

If we look at Blackwell’s current sales, these are not the top seller categories in volume, but they are for Electronidex, who sells more units than Blackwell even with a lower number of items sold:

volume <- sort(itemFrequency(Electronidex_tr, type = "absolute"), decreasing = T)

Electronidex <- as.data.frame(volume)
Electronidex$'product name' <- row.names(Electronidex)

Electronidex_wProducttype <- merge(Electronidex, ProductTypeKey)
Electronidex_wProducttype <- Electronidex_wProducttype[,(c(3, 1, 2))]
colnames(Electronidex_wProducttype) <- c("ProductType", "ProductName", "Volume")


Blackwell$Company <- "Blackwell"
Electronidex_wProducttype$Company <- "Electronidex"


ggplot() +
  geom_col(data = Electronidex_wProducttype, aes(x = ProductType, y = Volume, fill = ProductType)) +
  geom_col(data = Blackwell, aes(x = ProductType, y = Volume, fill = ProductType)) +
  facet_grid(Company~.) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

As stated before, this product category distribution would reduce cannibalisation risk, as they are only competing in the Accessories category, not strategic due to its low margin. Instead, the following categories would only add value:

Extended Warranty
Game Console
Laptop
PC
Smartphone
Software

Now we understand the 2 companies together and what we want out of Electronidex, we will have a look at the company’s transactions to get further information.

tr <- Electronidex_tr

Our first step will be to understand the data: we have been given data for a month of Electronidex online sales (9835 transactions). Electronidex has a range of 125 products, and this are the top 15:

itemFrequencyPlot(tr, type = "absolute", topN = 15, horiz = F)

When it comes to transactions, we can see that we have very big purchases (the average purchase contains 4 items or more, which is not usual for individual customers). We can already think we will have 2 client types: B2B and B2C.

itemsbought <- sort(size(tr), decreasing = T)
table(itemsbought)

## itemsbought
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    2 2163 1647 1294 1021  856  646  540  439  353  247  171  119   77   72 
##   15   16   17   18   19   20   21   22   23   25   26   27   29   30 
##   56   41   26   20   10   10   10    5    3    1    1    3    1    1

summary(itemsbought)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.383   6.000  30.000

hist(itemsbought, breaks = 30)

As a 2nd check, we look at the rules this dataset would give us, and we already see that 6 out of our top 8 rules imply buying 2 or more computers together, which is definetly not common for individual customers and confirms our hypothesis:

RulesName<- apriori (tr, parameter = list(supp = 0.05, conf = 0.1, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 491 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(sort(RulesName, by = "support"))

##     lhs                           rhs                        support   
## [1] {HP Laptop}                => {iMac}                     0.07554652
## [2] {iMac}                     => {HP Laptop}                0.07554652
## [3] {Lenovo Desktop Computer}  => {iMac}                     0.05876970
## [4] {iMac}                     => {Lenovo Desktop Computer}  0.05876970
## [5] {CYBERPOWER Gamer Desktop} => {iMac}                     0.05673615
## [6] {iMac}                     => {CYBERPOWER Gamer Desktop} 0.05673615
## [7] {Dell Desktop}             => {iMac}                     0.05460092
## [8] {iMac}                     => {Dell Desktop}             0.05460092
##     confidence lift     count
## [1] 0.3892090  1.519599 743  
## [2] 0.2949583  1.519599 743  
## [3] 0.3969780  1.549932 578  
## [4] 0.2294561  1.549932 578  
## [5] 0.3084577  1.204320 558  
## [6] 0.2215165  1.204320 558  
## [7] 0.4074355  1.590762 537  
## [8] 0.2131798  1.590762 537

Pre processing

After realizing this, we’ll have to first split the dataset in B2B = more than 4 products, B2C = 4 products or less including this information in a new column:

tr_df$itemsbought <- rowSums(!is.na(tr_df))

and we’ll include laptop and desktop information for the split as well:

product_key <-
  read_delim("~/producttypekey.csv", ";", escape_double = FALSE, trim_ws = TRUE)

desktop_or_laptop <- rbind(product_key[(which(product_key == "Desktop")-125),], 
                           product_key[(which(product_key == "Laptops")-125),])

computercount <- c()

for (i in 1:nrow(tr_df)) {
  
  hola <- sum(tr_df[i, ] %in% desktop_or_laptop$`product name`)
  computercount <- rbind(computercount, hola)
  
}

tr_df <- cbind(tr_df, computercount)

However, once we split the datasets we see that some customers that buy more than 4 products are buying gaming products, which is not a common behaviour for a company, and that out of these gaming customers a lot of them buy a computer specific for gaming and a normal one on the same purchase. Therefore, we’ll separate them from the normal B2C into a Gaming category, being that the 1st condition for our splitting:

gaming <- c()

for (i in 1:nrow(tr_df)) {
  
  hola <- sum(grepl("Gam", tr_df[i, ], ))
  gaming <- rbind(gaming, hola)
  
}

tr_df <- cbind(tr_df, gaming)

tr_df$customer_type <- ifelse(
  tr_df$gaming > 0, "Gamer", ifelse(
    tr_df$computercount > 1, "B2B", ifelse(
        tr_df$itemsbought > 4, "B2B", "B2C"
    )))

Gamer_transactions <- tr[which(tr_df$customer_type == "Gamer"),]
B2B_transactions <- tr[which(tr_df$customer_type == "B2B"),]
B2C_transactions <- tr[which(tr_df$customer_type == "B2C"),]

length (B2B_transactions)

## [1] 2509

length (B2C_transactions)

## [1] 3906

length (Gamer_transactions)

## [1] 3420

Modeling

Now we have the dataset split in B2B, Gamer and B2C we will insepct them again separately. For each dataset we will set the support to number of products / number of transactions, as it is equal to the probability of having one of them alone. We will set confidence to a number low enough to get around 10 rules:

B2B

itemsbought <- sort(size(B2B_transactions), decreasing = T)
table(itemsbought)

## itemsbought
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   5 209 299 327 516 330 257 198 132  78  54  29  17  18   6   9   9   8 
##  19  20  21  22  23 
##   1   2   1   2   2

summary(itemsbought)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   5.841   7.000  23.000

hist(itemsbought, breaks = 30)

itemFrequencyPlot(B2B_transactions, type = "absolute", topN = 10, horiz = F)

B2BRulesName<- apriori (B2B_transactions, parameter = list(supp = 0.046, conf = 0.4, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.046      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 115 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 2509 transaction(s)] done [0.00s].
## sorting and recoding items ... [39 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Looking at the top 3 rules, we can see that out of the B2B transactions a 35% of them bought a computer and an iMac together, and that 31% of them bought a PC and a screen together:

inspect(sort(B2BRulesName, by = "support"))

##      lhs                                         rhs         support   
## [1]  {HP Laptop}                              => {iMac}      0.13989637
## [2]  {Lenovo Desktop Computer}                => {iMac}      0.10601833
## [3]  {Dell Desktop}                           => {iMac}      0.10163412
## [4]  {ViewSonic Monitor}                      => {iMac}      0.08011160
## [5]  {ViewSonic Monitor}                      => {HP Laptop} 0.07612595
## [6]  {Microsoft Office Home and Student 2016} => {iMac}      0.05420486
## [7]  {3-Button Mouse}                         => {iMac}      0.05300917
## [8]  {Apple Magic Keyboard}                   => {iMac}      0.05261060
## [9]  {HP Monitor}                             => {iMac}      0.05261060
## [10] {Samsung Monitor}                        => {iMac}      0.04942208
##      confidence lift     count
## [1]  0.4259709  1.049863 351  
## [2]  0.4222222  1.040624 266  
## [3]  0.4166667  1.026932 255  
## [4]  0.4785714  1.179505 201  
## [5]  0.4547619  1.384706 191  
## [6]  0.5787234  1.426343 136  
## [7]  0.5037879  1.241654 133  
## [8]  0.5000000  1.232318 132  
## [9]  0.4631579  1.141516 132  
## [10] 0.4147157  1.022124 124

Gamer:

itemsbought <- sort(size(Gamer_transactions), decreasing = T)
table(itemsbought)

## itemsbought
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 213 380 416 372 340 316 283 241 221 169 117  90  60  54  50  32  17  12 
##  19  20  21  22  23  25  26  27  29  30 
##   9   8   9   3   1   1   1   3   1   1

summary(itemsbought)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   6.246   9.000  30.000

hist(itemsbought, breaks = 30)

itemFrequencyPlot(Gamer_transactions, type = "absolute", topN = 10, horiz = F)

GamerRulesName<- apriori (Gamer_transactions, parameter = list(supp = 0.05, conf = 0.5, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 171 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 3420 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

From the top rules we can see that 24% of the transactions that included a Cyberpower Gamer Desktop, included another computer as well, and that 14% bought a screen with it:

inspect(sort(GamerRulesName, by = "support"))

##     lhs                          rhs                           support confidence      lift count
## [1] {Apple Earpods}           => {CYBERPOWER Gamer Desktop} 0.11023392  0.5764526 1.0898109   377
## [2] {Lenovo Desktop Computer} => {CYBERPOWER Gamer Desktop} 0.10350877  0.5190616 0.9813105   354
## [3] {Samsung Monitor}         => {CYBERPOWER Gamer Desktop} 0.08918129  0.6407563 1.2113801   305
## [4] {ViewSonic Monitor}       => {iMac}                     0.07865497  0.5233463 1.5727982   269
## [5] {ViewSonic Monitor}       => {HP Laptop}                0.07573099  0.5038911 1.9605317   259
## [6] {Apple MacBook Pro}       => {CYBERPOWER Gamer Desktop} 0.06959064  0.5360360 1.0134015   238
## [7] {Apple MacBook Air}       => {CYBERPOWER Gamer Desktop} 0.06929825  0.5435780 1.0276599   237
## [8] {3-Button Mouse}          => {CYBERPOWER Gamer Desktop} 0.06023392  0.5350649 1.0115655   206
## [9] {Acer Monitor}            => {CYBERPOWER Gamer Desktop} 0.05526316  0.6057692 1.1452354   189

B2C:

itemsbought <- sort(size(B2C_transactions), decreasing = T)
table(itemsbought)

## itemsbought
##    0    1    2    3    4 
##    2 1945 1058  579  322

summary(itemsbought)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.814   2.000   4.000

hist(itemsbought)

itemFrequencyPlot(B2C_transactions, type = "absolute", topN = 10, horiz = F)

B2CRulesName<- apriori (B2C_transactions, parameter = list(supp = 0.0075, conf = 0.1, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5  0.0075      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 3906 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

To get rules we needed to lower the support below the probability of finding a product alone. These rules should be more significant as they have higher lift than the rules found for Gamers and B2B, but still they don’t happen often enough to take them into account:

inspect(sort(B2CRulesName, by = "support"))

##     lhs                                                rhs                     support confidence     lift count
## [1] {Apple MacBook Pro}                             => {Apple Earpods}     0.011520737  0.2647059 1.856268    45
## [2] {3-Button Mouse}                                => {Apple Earpods}     0.010240655  0.1769912 1.241162    40
## [3] {iPhone Charger Cable}                          => {Apple MacBook Air} 0.009216590  0.4800000 3.078621    36
## [4] {Microsoft Wireless Desktop Keyboard and Mouse} => {Apple MacBook Air} 0.008448541  0.1594203 1.022489    33
## [5] {Microsoft Wireless Desktop Keyboard and Mouse} => {Apple Earpods}     0.007936508  0.1497585 1.050191    31
## [6] {3-Button Mouse}                                => {iMac}              0.007680492  0.1327434 1.428362    30