The board of directors from the fictional company we work for, is considering the acquisition of Electronidex, a start-up electronics online retailer. Management has asked us to help them better understand the clientele that Electronidex currently is serving and if it would be an optimal partnership.
Our objective is to gain insight into Electrodinex’s clientele and products, by conducting a market basket analysis on their online transactions.
There are two data files that we will be working with:
The file labeled ElectronidexTransactions2017.csv contains 30 days’ worth of Electrodinex’s online transactions. There are 9,835 transactions in total.
The file labeled ElectrodinexItems2017.pdf includes all the electronics that they currently sell. There are 125 items for sale across 17 product categories.
We can perform a market basket analysis to identify purchasing patterns, based on the relationships between customer transactions and the item(s) they have purchased.
The first step is to upload the ElectrodinexTransactions2017 file and to familiarize ourselves with the data.
Because we are dealing with transactional data, where each row represents a single transaction with the purchased item(s) being separated by commas, we need to upload the dataset through the read.transactions( ) function.
# load the libraries needed to perform the analysis
library(arules)
library(arulesViz)
# upload Electronidex's data
ElectronidexTransactions2017 <- read.transactions("~/Documents/Data Science/Data Analytics and Big Data/Predicting Customer Preferences/ElectronidexTransactions2017.csv", format = "basket", sep = ",", rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## items
## 1 2
## 191 10
Now, we can have a look at the most important statistics of our dataset:
summary(ElectronidexTransactions2017)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 125 columns (items) and a density of 0.03506172
##
## most frequent items:
## iMac HP Laptop CYBERPOWER Gamer Desktop
## 2519 1909 1809
## Apple Earpods Apple MacBook Air (Other)
## 1715 1530 33622
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2 2163 1647 1294 1021 856 646 540 439 353 247 171 119 77 72 56
## 16 17 18 19 20 21 22 23 25 26 27 29 30
## 41 26 20 10 10 10 5 3 1 1 3 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.383 6.000 30.000
##
## includes extended item information - examples:
## labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3 3-Button Mouse
In summary, we see the following:
To better understand the data, we can visualize the top 20 most frequent items.
# relative frequency histogram for the top 20 items
itemFrequencyPlot(ElectronidexTransactions2017, topN = 20, type = "relative", main = "Relative Item Frequency Plot")
Four out of five of the best-selling items were computer related. As shown in the plot above, Electronidex’s customers purchase computer goods such as desktops, laptops, keyboards, and monitors. Therefore, a partnership with Electronidex should be considered if the objective is to increase sales volume in the computer market. Should management choose to pursue other markets such as smartphones or video game consoles, acquiring Electronidex is not an optimal business decision.
The Apriori algorithm is useful to uncover insights related to transactional datasets.
It can be used to identify itemtsets that occur frequently. An itemset is a group of goods that have been purchased together. For example, one itemset could be: {monitor, keyboard, mouse} where all of these items have been bought in a single transaction.
In addition, the Apriori algorithm evaluates association rules. For example, one rule might be: {monitor, keyboard} -> {mouse}. This means that IF a computer monitor and keyboard are purchased, THEN a mouse is likely to be purchased too.
Let’s begin with mining the most frequent itemsets, where: there are at least 2 items per transaction; and at the minimum 5 out of 1000 sales include these items.
# frequent itemsets
itemsets <- apriori(ElectronidexTransactions2017, parameter = list(support = .005, minlen = 2, target = 'frequent'))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1017 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(itemsets)
## set of 1017 itemsets
##
## most frequent items:
## iMac HP Laptop Lenovo Desktop Computer
## 266 223 150
## Dell Desktop CYBERPOWER Gamer Desktop (Other)
## 145 124 1507
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4
## 658 337 22
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 4.000
##
## summary of quality measures:
## support count
## Min. :0.005084 Min. : 50.0
## 1st Qu.:0.005999 1st Qu.: 59.0
## Median :0.007422 Median : 73.0
## Mean :0.009894 Mean : 97.3
## 3rd Qu.:0.010676 3rd Qu.:105.0
## Max. :0.075547 Max. :743.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## ElectronidexTransactions2017 9835 0.005 1
There are 1,017 itemsets that meet our length and support (5/1000 = .005) requirements. Here are the top 10 itemsets, sorted by frequency:
# top 10 itemsets sorted by support
inspect(sort(itemsets, by = 'support', decreasing = T)[1:10])
## items support count
## [1] {HP Laptop,iMac} 0.07554652 743
## [2] {iMac,Lenovo Desktop Computer} 0.05876970 578
## [3] {CYBERPOWER Gamer Desktop,iMac} 0.05673615 558
## [4] {Dell Desktop,iMac} 0.05460092 537
## [5] {iMac,ViewSonic Monitor} 0.04941535 486
## [6] {HP Laptop,ViewSonic Monitor} 0.04799187 472
## [7] {HP Laptop,Lenovo Desktop Computer} 0.04616167 454
## [8] {Dell Desktop,HP Laptop} 0.04494154 442
## [9] {CYBERPOWER Gamer Desktop,HP Laptop} 0.04260295 419
## [10] {Apple Earpods,iMac} 0.04026436 396
We knew the “iMac” and “HP Laptop” were the first and second best-sellers of the month, respectively. However, we now see that the most frequent itemset contains both products. Moreover, most itemsets include a combination of goods that could be considered substitutes (e.g. desktops of different brands, laptops).
Retail customers do not frequently purchase two computers in one transaction. It is worth exploring wether Electronidex mainly serves other businesses, or if there is a discount pricing strategy when buying more than one computer.
Now it’s time to mine association rules. As previously stated, rules are based on the theory that if you purchase a certain group of items, you are more (or less) likely to purchase another group of items.
The Apriori algorithm creates rules using two measurements. The first statistical measure is support, which calculates rules frequency within the transactional data. The second statistical measure is confidence, which computes the accuracy of rules.
These parameters are user-specified and depending on the selected values, a different set of rules will be created. Here we will choose low support and moderate accuracy:
# create rules with support = 0.005 and confidence = 0.5
rulesCustomized <- apriori(ElectronidexTransactions2017, parameter = list(supp = 0.005, conf = 0.5, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [151 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rulesCustomized)
## set of 151 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 1 114 36
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.232 3.000 4.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.005084 Min. :0.5000 Min. :1.952 Min. : 50.00
## 1st Qu.:0.005694 1st Qu.:0.5196 1st Qu.:2.082 1st Qu.: 56.00
## Median :0.006406 Median :0.5490 Median :2.276 Median : 63.00
## Mean :0.007456 Mean :0.5590 Mean :2.410 Mean : 73.33
## 3rd Qu.:0.008693 3rd Qu.:0.5829 3rd Qu.:2.682 3rd Qu.: 85.50
## Max. :0.023081 Max. :0.8125 Max. :4.186 Max. :227.00
##
## mining info:
## data ntransactions support confidence
## ElectronidexTransactions2017 9835 0.005 0.5
A set of 151 rules was generated! How to know which rules are the “best”? We can use the lift measurement, which indicates the importance of a rule.
Let’s sort the top 10 rules by their lift measurement.
# sort rules by high lift
inspect(sort(rulesCustomized, by = "lift", decreasing = T)[1:10])
## lhs rhs support confidence lift count
## [1] {Acer Aspire,
## Dell Desktop,
## ViewSonic Monitor} => {HP Laptop} 0.005287239 0.8125000 4.185928 52
## [2] {ASUS 2 Monitor,
## Dell Desktop,
## iMac} => {Lenovo Desktop Computer} 0.005185562 0.5730337 3.870732 51
## [3] {Apple Magic Keyboard,
## Dell Desktop,
## iMac} => {Lenovo Desktop Computer} 0.005287239 0.5200000 3.512500 52
## [4] {HP Laptop,
## HP Monitor,
## iMac} => {Lenovo Desktop Computer} 0.005388917 0.5096154 3.442354 53
## [5] {Acer Aspire,
## iMac,
## ViewSonic Monitor} => {HP Laptop} 0.006202339 0.6630435 3.415942 61
## [6] {Acer Desktop,
## iMac,
## ViewSonic Monitor} => {HP Laptop} 0.006405694 0.6363636 3.278489 63
## [7] {Dell Desktop,
## Lenovo Desktop Computer,
## ViewSonic Monitor} => {HP Laptop} 0.006202339 0.6224490 3.206802 61
## [8] {Computer Game,
## ViewSonic Monitor} => {HP Laptop} 0.007422471 0.6186441 3.187200 73
## [9] {Computer Game,
## Dell Desktop} => {HP Laptop} 0.005693950 0.6086957 3.135946 56
## [10] {Acer Aspire,
## ViewSonic Monitor} => {HP Laptop} 0.010777834 0.6022727 3.102856 106
The rules corrobate our findings from previous steps. The “iMac” and “HP Laptop” are popular items, frequently sold together or with other computers from different brands.
It could also be interesting to know what customers buy alongside the “iMac.”
# view rules that include the item "iMac"
iMacrules <- subset(rulesCustomized, items %in% "iMac")
inspect(iMacrules[1:15])
## lhs rhs support confidence lift count
## [1] {Smart Light Bulb} => {iMac} 0.009252669 0.5229885 2.041918 91
## [2] {Etekcity Power Extension Cord Cable,
## HP Laptop} => {iMac} 0.005388917 0.5196078 2.028719 53
## [3] {Dell 2 Desktop,
## HP Laptop} => {iMac} 0.005185562 0.5000000 1.952164 51
## [4] {Brother Printer,
## HP Laptop} => {iMac} 0.005287239 0.5473684 2.137105 52
## [5] {ASUS Desktop,
## HP Laptop} => {iMac} 0.005693950 0.5333333 2.082308 56
## [6] {Intel Desktop,
## Lenovo Desktop Computer} => {iMac} 0.007117438 0.5384615 2.102330 70
## [7] {AOC Monitor,
## Dell Desktop} => {iMac} 0.005998983 0.5462963 2.132919 59
## [8] {HP Wireless Mouse,
## ViewSonic Monitor} => {iMac} 0.005998983 0.5514019 2.152853 59
## [9] {CYBERPOWER Gamer Desktop,
## HP Wireless Mouse} => {iMac} 0.005388917 0.5520833 2.155514 53
## [10] {Computer Game,
## ViewSonic Monitor} => {iMac} 0.005998983 0.5000000 1.952164 59
## [11] {Computer Game,
## iMac} => {HP Laptop} 0.008642603 0.5214724 2.686580 85
## [12] {Epson Printer,
## ViewSonic Monitor} => {iMac} 0.006202339 0.5258621 2.053138 61
## [13] {Dell Desktop,
## Epson Printer} => {iMac} 0.006507372 0.5871560 2.292449 64
## [14] {CYBERPOWER Gamer Desktop,
## Epson Printer} => {iMac} 0.005083884 0.5000000 1.952164 50
## [15] {Epson Printer,
## HP Laptop} => {iMac} 0.009761057 0.5454545 2.129633 96
With hundreds of rules generated based on data, it is important to present our findings in an easy-to-understand manner.
We can create an interactive plot, where each rule can be viewed with its statistical measurements (support, confidence and lift).
# interactive scatter plot
plotly_arules(rulesCustomized)
Also, we can use a graph plot where itemsets and rules are connected via arrows. To avoid congestion, it is recommended to plot just a handful of rules. We will select the top 3 rules based on their lift measurement.
# plot graph-based visualization
subrules.lift <- head(sort(rulesCustomized, by = "lift"),3)
plot(subrules.lift, method = "graph", engine = "htmlwidget")
Based on our market basket analysis, here are the initial recommendations and remarks about a potential acquisition of Electronidex:
Electronidex is a company that focuses mostly on selling computers and computer accessories. If the objective is to increase our presence in the computer market, acquiring Electronidex could be a strategic business decision.
Customers of Electronidex often purchase more than one computer in a single transaction. This is atypical for retail purchases. It is worth investigating if the customer base of Electronidex is mostly other businesses.
The “iMac” and “HP Laptop” were the best selling products, purchased together or alongisde other computer goods. It is worth exploring if there is a discount pricing strategy to increase sales volume.
The data provided covers one month’s worth of online transactions. To make a final decision, more information is required. However, this initial analysis serves to better understand Electronidex and its clients.