Introduction

The board of directors from the fictional company we work for, is considering the acquisition of Electronidex, a start-up electronics online retailer. Management has asked us to help them better understand the clientele that Electronidex currently is serving and if it would be an optimal partnership.

Our objective is to gain insight into Electrodinex’s clientele and products, by conducting a market basket analysis on their online transactions.

There are two data files that we will be working with:

  1. The file labeled ElectronidexTransactions2017.csv contains 30 days’ worth of Electrodinex’s online transactions. There are 9,835 transactions in total.

  2. The file labeled ElectrodinexItems2017.pdf includes all the electronics that they currently sell. There are 125 items for sale across 17 product categories.

We can perform a market basket analysis to identify purchasing patterns, based on the relationships between customer transactions and the item(s) they have purchased.

Getting to know the dataset

The first step is to upload the ElectrodinexTransactions2017 file and to familiarize ourselves with the data.

Because we are dealing with transactional data, where each row represents a single transaction with the purchased item(s) being separated by commas, we need to upload the dataset through the read.transactions( ) function.

# load the libraries needed to perform the analysis
library(arules)
library(arulesViz)
# upload Electronidex's data
ElectronidexTransactions2017 <- read.transactions("~/Documents/Data Science/Data Analytics and Big Data/Predicting Customer Preferences/ElectronidexTransactions2017.csv", format = "basket", sep = ",", rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## items
##   1   2 
## 191  10

Now, we can have a look at the most important statistics of our dataset:

summary(ElectronidexTransactions2017)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  125 columns (items) and a density of 0.03506172 
## 
## most frequent items:
##                     iMac                HP Laptop CYBERPOWER Gamer Desktop 
##                     2519                     1909                     1809 
##            Apple Earpods        Apple MacBook Air                  (Other) 
##                     1715                     1530                    33622 
## 
## element (itemset/transaction) length distribution:
## sizes
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##    2 2163 1647 1294 1021  856  646  540  439  353  247  171  119   77   72   56 
##   16   17   18   19   20   21   22   23   25   26   27   29   30 
##   41   26   20   10   10   10    5    3    1    1    3    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.383   6.000  30.000 
## 
## includes extended item information - examples:
##                             labels
## 1 1TB Portable External Hard Drive
## 2 2TB Portable External Hard Drive
## 3                   3-Button Mouse

In summary, we see the following:

Visualize the dataset

To better understand the data, we can visualize the top 20 most frequent items.

# relative frequency histogram for the top 20 items
itemFrequencyPlot(ElectronidexTransactions2017, topN = 20, type = "relative", main = "Relative Item Frequency Plot")

Four out of five of the best-selling items were computer related. As shown in the plot above, Electronidex’s customers purchase computer goods such as desktops, laptops, keyboards, and monitors. Therefore, a partnership with Electronidex should be considered if the objective is to increase sales volume in the computer market. Should management choose to pursue other markets such as smartphones or video game consoles, acquiring Electronidex is not an optimal business decision.

Using the Apriori algorithm to perform a market basket analysis

The Apriori algorithm is useful to uncover insights related to transactional datasets.

It can be used to identify itemtsets that occur frequently. An itemset is a group of goods that have been purchased together. For example, one itemset could be: {monitor, keyboard, mouse} where all of these items have been bought in a single transaction.

In addition, the Apriori algorithm evaluates association rules. For example, one rule might be: {monitor, keyboard} -> {mouse}. This means that IF a computer monitor and keyboard are purchased, THEN a mouse is likely to be purchased too.

Discovering frequent itemsets

Let’s begin with mining the most frequent itemsets, where: there are at least 2 items per transaction; and at the minimum 5 out of 1000 sales include these items.

# frequent itemsets
itemsets <- apriori(ElectronidexTransactions2017, parameter = list(support = .005, minlen = 2, target = 'frequent'))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen            target   ext
##      10 frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1017 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(itemsets)
## set of 1017 itemsets
## 
## most frequent items:
##                     iMac                HP Laptop  Lenovo Desktop Computer 
##                      266                      223                      150 
##             Dell Desktop CYBERPOWER Gamer Desktop                  (Other) 
##                      145                      124                     1507 
## 
## element (itemset/transaction) length distribution:sizes
##   2   3   4 
## 658 337  22 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   4.000 
## 
## summary of quality measures:
##     support             count      
##  Min.   :0.005084   Min.   : 50.0  
##  1st Qu.:0.005999   1st Qu.: 59.0  
##  Median :0.007422   Median : 73.0  
##  Mean   :0.009894   Mean   : 97.3  
##  3rd Qu.:0.010676   3rd Qu.:105.0  
##  Max.   :0.075547   Max.   :743.0  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##                          data ntransactions support confidence
##  ElectronidexTransactions2017          9835   0.005          1

There are 1,017 itemsets that meet our length and support (5/1000 = .005) requirements. Here are the top 10 itemsets, sorted by frequency:

# top 10 itemsets sorted by support
inspect(sort(itemsets, by = 'support', decreasing = T)[1:10])
##      items                                support    count
## [1]  {HP Laptop,iMac}                     0.07554652 743  
## [2]  {iMac,Lenovo Desktop Computer}       0.05876970 578  
## [3]  {CYBERPOWER Gamer Desktop,iMac}      0.05673615 558  
## [4]  {Dell Desktop,iMac}                  0.05460092 537  
## [5]  {iMac,ViewSonic Monitor}             0.04941535 486  
## [6]  {HP Laptop,ViewSonic Monitor}        0.04799187 472  
## [7]  {HP Laptop,Lenovo Desktop Computer}  0.04616167 454  
## [8]  {Dell Desktop,HP Laptop}             0.04494154 442  
## [9]  {CYBERPOWER Gamer Desktop,HP Laptop} 0.04260295 419  
## [10] {Apple Earpods,iMac}                 0.04026436 396

We knew the “iMac” and “HP Laptop” were the first and second best-sellers of the month, respectively. However, we now see that the most frequent itemset contains both products. Moreover, most itemsets include a combination of goods that could be considered substitutes (e.g. desktops of different brands, laptops).

Retail customers do not frequently purchase two computers in one transaction. It is worth exploring wether Electronidex mainly serves other businesses, or if there is a discount pricing strategy when buying more than one computer.

Discovering rules of association

Now it’s time to mine association rules. As previously stated, rules are based on the theory that if you purchase a certain group of items, you are more (or less) likely to purchase another group of items.

The Apriori algorithm creates rules using two measurements. The first statistical measure is support, which calculates rules frequency within the transactional data. The second statistical measure is confidence, which computes the accuracy of rules.

These parameters are user-specified and depending on the selected values, a different set of rules will be created. Here we will choose low support and moderate accuracy:

# create rules with support = 0.005 and confidence = 0.5
rulesCustomized <- apriori(ElectronidexTransactions2017, parameter = list(supp = 0.005, conf = 0.5, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [151 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rulesCustomized)
## set of 151 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
##   1 114  36 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.232   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift           count       
##  Min.   :0.005084   Min.   :0.5000   Min.   :1.952   Min.   : 50.00  
##  1st Qu.:0.005694   1st Qu.:0.5196   1st Qu.:2.082   1st Qu.: 56.00  
##  Median :0.006406   Median :0.5490   Median :2.276   Median : 63.00  
##  Mean   :0.007456   Mean   :0.5590   Mean   :2.410   Mean   : 73.33  
##  3rd Qu.:0.008693   3rd Qu.:0.5829   3rd Qu.:2.682   3rd Qu.: 85.50  
##  Max.   :0.023081   Max.   :0.8125   Max.   :4.186   Max.   :227.00  
## 
## mining info:
##                          data ntransactions support confidence
##  ElectronidexTransactions2017          9835   0.005        0.5

A set of 151 rules was generated! How to know which rules are the “best”? We can use the lift measurement, which indicates the importance of a rule.

Let’s sort the top 10 rules by their lift measurement.

# sort rules by high lift
inspect(sort(rulesCustomized, by = "lift", decreasing = T)[1:10])
##      lhs                          rhs                           support confidence     lift count
## [1]  {Acer Aspire,                                                                               
##       Dell Desktop,                                                                              
##       ViewSonic Monitor}       => {HP Laptop}               0.005287239  0.8125000 4.185928    52
## [2]  {ASUS 2 Monitor,                                                                            
##       Dell Desktop,                                                                              
##       iMac}                    => {Lenovo Desktop Computer} 0.005185562  0.5730337 3.870732    51
## [3]  {Apple Magic Keyboard,                                                                      
##       Dell Desktop,                                                                              
##       iMac}                    => {Lenovo Desktop Computer} 0.005287239  0.5200000 3.512500    52
## [4]  {HP Laptop,                                                                                 
##       HP Monitor,                                                                                
##       iMac}                    => {Lenovo Desktop Computer} 0.005388917  0.5096154 3.442354    53
## [5]  {Acer Aspire,                                                                               
##       iMac,                                                                                      
##       ViewSonic Monitor}       => {HP Laptop}               0.006202339  0.6630435 3.415942    61
## [6]  {Acer Desktop,                                                                              
##       iMac,                                                                                      
##       ViewSonic Monitor}       => {HP Laptop}               0.006405694  0.6363636 3.278489    63
## [7]  {Dell Desktop,                                                                              
##       Lenovo Desktop Computer,                                                                   
##       ViewSonic Monitor}       => {HP Laptop}               0.006202339  0.6224490 3.206802    61
## [8]  {Computer Game,                                                                             
##       ViewSonic Monitor}       => {HP Laptop}               0.007422471  0.6186441 3.187200    73
## [9]  {Computer Game,                                                                             
##       Dell Desktop}            => {HP Laptop}               0.005693950  0.6086957 3.135946    56
## [10] {Acer Aspire,                                                                               
##       ViewSonic Monitor}       => {HP Laptop}               0.010777834  0.6022727 3.102856   106

The rules corrobate our findings from previous steps. The “iMac” and “HP Laptop” are popular items, frequently sold together or with other computers from different brands.

It could also be interesting to know what customers buy alongside the “iMac.”

# view rules that include the item "iMac"
iMacrules <- subset(rulesCustomized, items %in% "iMac")
inspect(iMacrules[1:15])
##      lhs                                      rhs             support confidence     lift count
## [1]  {Smart Light Bulb}                    => {iMac}      0.009252669  0.5229885 2.041918    91
## [2]  {Etekcity Power Extension Cord Cable,                                                     
##       HP Laptop}                           => {iMac}      0.005388917  0.5196078 2.028719    53
## [3]  {Dell 2 Desktop,                                                                          
##       HP Laptop}                           => {iMac}      0.005185562  0.5000000 1.952164    51
## [4]  {Brother Printer,                                                                         
##       HP Laptop}                           => {iMac}      0.005287239  0.5473684 2.137105    52
## [5]  {ASUS Desktop,                                                                            
##       HP Laptop}                           => {iMac}      0.005693950  0.5333333 2.082308    56
## [6]  {Intel Desktop,                                                                           
##       Lenovo Desktop Computer}             => {iMac}      0.007117438  0.5384615 2.102330    70
## [7]  {AOC Monitor,                                                                             
##       Dell Desktop}                        => {iMac}      0.005998983  0.5462963 2.132919    59
## [8]  {HP Wireless Mouse,                                                                       
##       ViewSonic Monitor}                   => {iMac}      0.005998983  0.5514019 2.152853    59
## [9]  {CYBERPOWER Gamer Desktop,                                                                
##       HP Wireless Mouse}                   => {iMac}      0.005388917  0.5520833 2.155514    53
## [10] {Computer Game,                                                                           
##       ViewSonic Monitor}                   => {iMac}      0.005998983  0.5000000 1.952164    59
## [11] {Computer Game,                                                                           
##       iMac}                                => {HP Laptop} 0.008642603  0.5214724 2.686580    85
## [12] {Epson Printer,                                                                           
##       ViewSonic Monitor}                   => {iMac}      0.006202339  0.5258621 2.053138    61
## [13] {Dell Desktop,                                                                            
##       Epson Printer}                       => {iMac}      0.006507372  0.5871560 2.292449    64
## [14] {CYBERPOWER Gamer Desktop,                                                                
##       Epson Printer}                       => {iMac}      0.005083884  0.5000000 1.952164    50
## [15] {Epson Printer,                                                                           
##       HP Laptop}                           => {iMac}      0.009761057  0.5454545 2.129633    96

Visualize the rules

With hundreds of rules generated based on data, it is important to present our findings in an easy-to-understand manner.

We can create an interactive plot, where each rule can be viewed with its statistical measurements (support, confidence and lift).

# interactive scatter plot
plotly_arules(rulesCustomized)

Also, we can use a graph plot where itemsets and rules are connected via arrows. To avoid congestion, it is recommended to plot just a handful of rules. We will select the top 3 rules based on their lift measurement.

# plot graph-based visualization
subrules.lift <- head(sort(rulesCustomized, by = "lift"),3)
plot(subrules.lift, method = "graph", engine = "htmlwidget")

Conclusion and recommendations for management

Based on our market basket analysis, here are the initial recommendations and remarks about a potential acquisition of Electronidex: