In this task, Danielle asked us to investigate Electronidex’s last month sales data in order to determine whether Blackwell Electronics should acquire it or not. Our conclusion is that Blackwell should consider acquiring Electronidex, due to the following reasons extracted from the data given:
Electronidex possesses 2 client categories that account for 60% of last month transactions count and who are very good performers:
B2B clients: Blackwell doesn’t have this client category currently, and they are heavy buyers of PCs, the category with the biggest margins within Blackwell
Gamers: they do not only buy categories with the biggest margins (PCs and Display), but are also likely to become buyers of Blackwell’s current top performer category inexistent for Electronidex, Game Consoles
To tackle this task, our first step was to see the magnitude of the companies we were considering. We will look at it in terms of items sold in a month. Even if it’s not a perfectly representative magnitude, it can give us a big picture with the data that we have available:
library(arules)
library(arulesViz)
library(ggplot2)
library(readr)
library(reshape2)
setwd("C:/Users/user/Documents/M2 T4 market basket analysis/Datasets")
Blackwell <- read_csv("./existingproductattributes2017.csv")
Electronidex_tr <-
read.transactions("./ElectronidexTransactions2017.csv", sep = ",")
tr_df <-
read_csv("./ElectronidexTransactions2017.csv", col_names = FALSE)
ProductTypeKey <-
read_delim("./producttypekey_Blackwell product types.csv", ";")
sum(Blackwell$Volume)
## [1] 56400
sum(! is.na(tr_df))
## [1] 43276
comparison <- data.frame("company" = c("Blackwell", "Electronidex"), "volume" = c(sum(Blackwell$Volume), sum(! is.na(tr_df))))
ggplot(comparison, aes(x = company, y = volume, fill = company)) + geom_col()
Looking at last month’s volume (assuming that the data that we are given is for the same month) for both companies we can see they are similar in size: Blackwell is only Electronidex times 1,3. This tells us that we would not only benefit from their product range, online retail value chain, and know-how, but also from their sales volume. Cannibalisation would be limited, as they only compete closely in one product range as we’ll see later on.
If we look deeper into Product Category, Blackwell would be interested in increasing their sales of PC, Display and Laptops, the top 3 categories with the highest margin in $ per unit within their current product range as we can see in this boxplot. Instead, Accessories is the category with the lowest margins:
Blackwell$ProductProfit <- (Blackwell$Price * Blackwell$ProfitMargin)
ggplot(Blackwell, aes(x = ProductType, y = ProductProfit, fill = ProductType)) + geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
If we look at Blackwell’s current sales, these are not the top seller categories in volume, but they are for Electronidex, who sells more units than Blackwell even with a lower number of items sold:
volume <- sort(itemFrequency(Electronidex_tr, type = "absolute"), decreasing = T)
Electronidex <- as.data.frame(volume)
Electronidex$'product name' <- row.names(Electronidex)
Electronidex_wProducttype <- merge(Electronidex, ProductTypeKey)
Electronidex_wProducttype <- Electronidex_wProducttype[,(c(3, 1, 2))]
colnames(Electronidex_wProducttype) <- c("ProductType", "ProductName", "Volume")
Blackwell$Company <- "Blackwell"
Electronidex_wProducttype$Company <- "Electronidex"
ggplot() +
geom_col(data = Electronidex_wProducttype, aes(x = ProductType, y = Volume, fill = ProductType)) +
geom_col(data = Blackwell, aes(x = ProductType, y = Volume, fill = ProductType)) +
facet_grid(Company~.) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
As stated before, this product category distribution would reduce cannibalisation risk, as they are only competing in the Accessories category, not strategic due to its low margin. Instead, the following categories would only add value:
Extended Warranty
Game Console
Laptop
PC
Smartphone
Software
Now we understand the 2 companies together and what we want out of Electronidex, we will have a look at the company’s transactions to get further information.
tr <- Electronidex_tr
Our first step will be to understand the data: we have been given data for a month of Electronidex online sales (9835 transactions). Electronidex has a range of 125 products, and this are the top 15:
itemFrequencyPlot(tr, type = "absolute", topN = 15, horiz = F)
When it comes to transactions, we can see that we have very big purchases (the average purchase contains 4 items or more, which is not usual for individual customers). We can already think we will have 2 client types: B2B and B2C.
itemsbought <- sort(size(tr), decreasing = T)
table(itemsbought)
## itemsbought
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2 2163 1647 1294 1021 856 646 540 439 353 247 171 119 77 72
## 15 16 17 18 19 20 21 22 23 25 26 27 29 30
## 56 41 26 20 10 10 10 5 3 1 1 3 1 1
summary(itemsbought)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.383 6.000 30.000
hist(itemsbought, breaks = 30)
As a 2nd check, we look at the rules this dataset would give us, and we already see that 6 out of our top 8 rules imply buying 2 or more computers together, which is definetly not common for individual customers and confirms our hypothesis:
RulesName<- apriori (tr, parameter = list(supp = 0.05, conf = 0.1, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 491
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(RulesName, by = "support"))
## lhs rhs support
## [1] {HP Laptop} => {iMac} 0.07554652
## [2] {iMac} => {HP Laptop} 0.07554652
## [3] {Lenovo Desktop Computer} => {iMac} 0.05876970
## [4] {iMac} => {Lenovo Desktop Computer} 0.05876970
## [5] {CYBERPOWER Gamer Desktop} => {iMac} 0.05673615
## [6] {iMac} => {CYBERPOWER Gamer Desktop} 0.05673615
## [7] {Dell Desktop} => {iMac} 0.05460092
## [8] {iMac} => {Dell Desktop} 0.05460092
## confidence lift count
## [1] 0.3892090 1.519599 743
## [2] 0.2949583 1.519599 743
## [3] 0.3969780 1.549932 578
## [4] 0.2294561 1.549932 578
## [5] 0.3084577 1.204320 558
## [6] 0.2215165 1.204320 558
## [7] 0.4074355 1.590762 537
## [8] 0.2131798 1.590762 537
After realizing this, we’ll have to first split the dataset in B2B = more than 4 products, B2C = 4 products or less including this information in a new column:
tr_df$itemsbought <- rowSums(!is.na(tr_df))
and we’ll include laptop and desktop information for the split as well:
product_key <-
read_delim("~/producttypekey.csv", ";", escape_double = FALSE, trim_ws = TRUE)
desktop_or_laptop <- rbind(product_key[(which(product_key == "Desktop")-125),],
product_key[(which(product_key == "Laptops")-125),])
computercount <- c()
for (i in 1:nrow(tr_df)) {
hola <- sum(tr_df[i, ] %in% desktop_or_laptop$`product name`)
computercount <- rbind(computercount, hola)
}
tr_df <- cbind(tr_df, computercount)
However, once we split the datasets we see that some customers that buy more than 4 products are buying gaming products, which is not a common behaviour for a company, and that out of these gaming customers a lot of them buy a computer specific for gaming and a normal one on the same purchase. Therefore, we’ll separate them from the normal B2C into a Gaming category, being that the 1st condition for our splitting:
gaming <- c()
for (i in 1:nrow(tr_df)) {
hola <- sum(grepl("Gam", tr_df[i, ], ))
gaming <- rbind(gaming, hola)
}
tr_df <- cbind(tr_df, gaming)
tr_df$customer_type <- ifelse(
tr_df$gaming > 0, "Gamer", ifelse(
tr_df$computercount > 1, "B2B", ifelse(
tr_df$itemsbought > 4, "B2B", "B2C"
)))
Gamer_transactions <- tr[which(tr_df$customer_type == "Gamer"),]
B2B_transactions <- tr[which(tr_df$customer_type == "B2B"),]
B2C_transactions <- tr[which(tr_df$customer_type == "B2C"),]
length (B2B_transactions)
## [1] 2509
length (B2C_transactions)
## [1] 3906
length (Gamer_transactions)
## [1] 3420
Now we have the dataset split in B2B, Gamer and B2C we will insepct them again separately. For each dataset we will set the support to number of products / number of transactions, as it is equal to the probability of having one of them alone. We will set confidence to a number low enough to get around 10 rules:
itemsbought <- sort(size(B2B_transactions), decreasing = T)
table(itemsbought)
## itemsbought
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 5 209 299 327 516 330 257 198 132 78 54 29 17 18 6 9 9 8
## 19 20 21 22 23
## 1 2 1 2 2
summary(itemsbought)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 5.841 7.000 23.000
hist(itemsbought, breaks = 30)
itemFrequencyPlot(B2B_transactions, type = "absolute", topN = 10, horiz = F)
B2BRulesName<- apriori (B2B_transactions, parameter = list(supp = 0.046, conf = 0.4, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.046 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 115
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 2509 transaction(s)] done [0.00s].
## sorting and recoding items ... [39 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Looking at the top 3 rules, we can see that out of the B2B transactions a 35% of them bought a computer and an iMac together, and that 31% of them bought a PC and a screen together:
inspect(sort(B2BRulesName, by = "support"))
## lhs rhs support
## [1] {HP Laptop} => {iMac} 0.13989637
## [2] {Lenovo Desktop Computer} => {iMac} 0.10601833
## [3] {Dell Desktop} => {iMac} 0.10163412
## [4] {ViewSonic Monitor} => {iMac} 0.08011160
## [5] {ViewSonic Monitor} => {HP Laptop} 0.07612595
## [6] {Microsoft Office Home and Student 2016} => {iMac} 0.05420486
## [7] {3-Button Mouse} => {iMac} 0.05300917
## [8] {Apple Magic Keyboard} => {iMac} 0.05261060
## [9] {HP Monitor} => {iMac} 0.05261060
## [10] {Samsung Monitor} => {iMac} 0.04942208
## confidence lift count
## [1] 0.4259709 1.049863 351
## [2] 0.4222222 1.040624 266
## [3] 0.4166667 1.026932 255
## [4] 0.4785714 1.179505 201
## [5] 0.4547619 1.384706 191
## [6] 0.5787234 1.426343 136
## [7] 0.5037879 1.241654 133
## [8] 0.5000000 1.232318 132
## [9] 0.4631579 1.141516 132
## [10] 0.4147157 1.022124 124
itemsbought <- sort(size(Gamer_transactions), decreasing = T)
table(itemsbought)
## itemsbought
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 213 380 416 372 340 316 283 241 221 169 117 90 60 54 50 32 17 12
## 19 20 21 22 23 25 26 27 29 30
## 9 8 9 3 1 1 1 3 1 1
summary(itemsbought)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 6.246 9.000 30.000
hist(itemsbought, breaks = 30)
itemFrequencyPlot(Gamer_transactions, type = "absolute", topN = 10, horiz = F)
GamerRulesName<- apriori (Gamer_transactions, parameter = list(supp = 0.05, conf = 0.5, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 171
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 3420 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
From the top rules we can see that 24% of the transactions that included a Cyberpower Gamer Desktop, included another computer as well, and that 14% bought a screen with it:
inspect(sort(GamerRulesName, by = "support"))
## lhs rhs support confidence lift count
## [1] {Apple Earpods} => {CYBERPOWER Gamer Desktop} 0.11023392 0.5764526 1.0898109 377
## [2] {Lenovo Desktop Computer} => {CYBERPOWER Gamer Desktop} 0.10350877 0.5190616 0.9813105 354
## [3] {Samsung Monitor} => {CYBERPOWER Gamer Desktop} 0.08918129 0.6407563 1.2113801 305
## [4] {ViewSonic Monitor} => {iMac} 0.07865497 0.5233463 1.5727982 269
## [5] {ViewSonic Monitor} => {HP Laptop} 0.07573099 0.5038911 1.9605317 259
## [6] {Apple MacBook Pro} => {CYBERPOWER Gamer Desktop} 0.06959064 0.5360360 1.0134015 238
## [7] {Apple MacBook Air} => {CYBERPOWER Gamer Desktop} 0.06929825 0.5435780 1.0276599 237
## [8] {3-Button Mouse} => {CYBERPOWER Gamer Desktop} 0.06023392 0.5350649 1.0115655 206
## [9] {Acer Monitor} => {CYBERPOWER Gamer Desktop} 0.05526316 0.6057692 1.1452354 189
itemsbought <- sort(size(B2C_transactions), decreasing = T)
table(itemsbought)
## itemsbought
## 0 1 2 3 4
## 2 1945 1058 579 322
summary(itemsbought)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.814 2.000 4.000
hist(itemsbought)
itemFrequencyPlot(B2C_transactions, type = "absolute", topN = 10, horiz = F)
B2CRulesName<- apriori (B2C_transactions, parameter = list(supp = 0.0075, conf = 0.1, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.0075 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 3906 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
To get rules we needed to lower the support below the probability of finding a product alone. These rules should be more significant as they have higher lift than the rules found for Gamers and B2B, but still they don’t happen often enough to take them into account:
inspect(sort(B2CRulesName, by = "support"))
## lhs rhs support confidence lift count
## [1] {Apple MacBook Pro} => {Apple Earpods} 0.011520737 0.2647059 1.856268 45
## [2] {3-Button Mouse} => {Apple Earpods} 0.010240655 0.1769912 1.241162 40
## [3] {iPhone Charger Cable} => {Apple MacBook Air} 0.009216590 0.4800000 3.078621 36
## [4] {Microsoft Wireless Desktop Keyboard and Mouse} => {Apple MacBook Air} 0.008448541 0.1594203 1.022489 33
## [5] {Microsoft Wireless Desktop Keyboard and Mouse} => {Apple Earpods} 0.007936508 0.1497585 1.050191 31
## [6] {3-Button Mouse} => {iMac} 0.007680492 0.1327434 1.428362 30