Perform a market basket analysis to help the company’s board of directors to better understand the clientele of an electronic online retailer to see if it would be an optimal acquisition.
The board of directors is considering acquiring a start-up electronics online retailer. The board of directors has asked us to help them better understand the clientele that the retailer is serving and if it would be an optimal partnership.
They need our help to identify purchasing patterns that will provide insight into the retailer’s clientele.
We were provided a CSV file that contains one month’s (30 days) worht of the retailer’s online transactions and a file containing all the electronics that they currently sell. The data was pulled on the items that purchased per customers’ transactions.
The data contains rows that represent single transactions with the purchased items being separated by commas, which is also called a ’basket" format. We will upload the CSV file through the read.transactions() function. It changes the dataset to a sparse matrix which makes each row represent a transaction and creates columns for each item that a customer might purchase. It also changes the data to binary. (1=item purchased in that transaction OR 0=no purchase.)
Call the libraries
# call pacakges arules and arulesViz
library(arules)
library(arulesViz)
Import the dataset
# import the data
transactions <- read.transactions(file = "ElectronidexTransactions2017.csv",
format = "basket",
sep = ",",
rm.duplicates = T)
## distribution of transactions with duplicates:
## items
## 1 2
## 191 10
Here, we set remove duplicates to TRUE. Dupliate transaction will effect the analysis and model performance later on.
Inspect the transactions
# inspect the first 5 transactions
inspect(transactions[1:5])
## items
## [1] {Acer Aspire,
## Belkin Mouse Pad,
## Brother Printer Toner,
## VGA Monitor Cable}
## [2] {Apple Wireless Keyboard,
## Dell Desktop,
## Lenovo Desktop Computer}
## [3] {iMac}
## [4] {Acer Desktop,
## Intel Desktop,
## Lenovo Desktop Computer,
## XIBERIA Gaming Headset}
## [5] {ASUS Desktop,
## Epson Black Ink,
## HP Laptop,
## iMac}
We can also use LIST() function to inspect each transaction.
LIST(transactions[1:5])
## [[1]]
## [1] "Acer Aspire" "Belkin Mouse Pad" "Brother Printer Toner"
## [4] "VGA Monitor Cable"
##
## [[2]]
## [1] "Apple Wireless Keyboard" "Dell Desktop"
## [3] "Lenovo Desktop Computer"
##
## [[3]]
## [1] "iMac"
##
## [[4]]
## [1] "Acer Desktop" "Intel Desktop"
## [3] "Lenovo Desktop Computer" "XIBERIA Gaming Headset"
##
## [[5]]
## [1] "ASUS Desktop" "Epson Black Ink" "HP Laptop" "iMac"
Investigate how many transactions we have
length(transactions)
## [1] 9835
Investigate how many items are in each transaction
size(transactions)[1:10]
## [1] 4 3 1 4 4 5 1 5 1 2
Investigate how many products we have
length(itemLabels(transactions))
## [1] 125
We have 125 items from the dataset.
# item frequency
itemFrequencyPlot(transactions, type = "absolute", topN =10, col = rainbow(10), main = "High Frequency")
# create a table of low frequency items
lowFreqency <- sort(table(unlist(LIST(transactions))), decreasing = FALSE)[1:10]
lowFreqency
##
## Logitech Wireless Keyboard VGA Monitor Cable
## 22 22
## Panasonic On-Ear Stereo Headphones 1TB Portable External Hard Drive
## 23 27
## Canon Ink Logitech Stereo Headset
## 27 30
## Ethernet Cable Canon Office Printer
## 32 35
## Gaming Mouse Professional Audio Cable
## 35 36
# plot low frequency items
par(mar=c(10, 17, 2, 1))
barplot(lowFreqency,
horiz=TRUE,
las = 1,
col=rainbow(10),
main = "Low Frequency")
image(sample(transactions, 200))
From the 200 random sampled transactions, we do see some patterns for certain items. Some items show a almost solid vetical line, those items were purchased more frequently.
We will use Apriori algorithm to perform the Market Basket Analysis. The Apriori algorithm is helpful when working with large datasets and is used to uncover insights pertaining to transactional datasets. It is based on item frequency. For example, this item set {Item 1, Item 2, Item 3, Item 4} can only occur if items {Item 1}, {Item 2}, {Item 3} and {Item 4} occur just as frequently.
The Apriori algorithm assesses association rules using two types of measurements. The first statistical measure is the Support measurement, which measures itemsets or rules frequency within your transactional data.The second statistical measure is the Confidence measurement, which measures the accuracy of the rules.
Apply and inspect the first 10 rules
# applying apriori rule
aprioriRule <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.55, minlen = 2, maxlen = 4))
# inspect the first 10 rules
inspect(aprioriRule[1:10])
## lhs rhs support confidence lift count
## [1] {Logitech Wireless Keyboard} => {iMac} 0.001321810 0.5909091 2.307102 13
## [2] {Panasonic On-Ear Stereo Headphones} => {iMac} 0.001321810 0.5652174 2.206794 13
## [3] {Generic Black 3-Button} => {iMac} 0.003660397 0.6428571 2.509925 36
## [4] {Mackie CR Speakers} => {iMac} 0.004677173 0.6133333 2.394654 46
## [5] {Backlit LED Gaming Keyboard,
## Large Mouse Pad} => {Apple MacBook Air} 0.001321810 0.8125000 5.222835 13
## [6] {ASUS 2 Monitor,
## Generic Black 3-Button} => {iMac} 0.001016777 0.9090909 3.549388 10
## [7] {Generic Black 3-Button,
## ViewSonic Monitor} => {iMac} 0.001016777 0.7692308 3.003329 10
## [8] {Dell Desktop,
## Generic Black 3-Button} => {iMac} 0.001220132 0.8571429 3.346566 12
## [9] {Generic Black 3-Button,
## Lenovo Desktop Computer} => {iMac} 0.001728521 0.8095238 3.160646 17
## [10] {Generic Black 3-Button,
## HP Laptop} => {iMac} 0.001321810 0.6500000 2.537813 13
Check the statistical numbers of the rules
summary(aprioriRule)
## set of 4031 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 4 837 3190
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 4.00 4.00 3.79 4.00 4.00
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.5500 Min. : 2.147 Min. : 10.00
## 1st Qu.:0.001118 1st Qu.:0.5882 1st Qu.: 2.510 1st Qu.: 11.00
## Median :0.001322 Median :0.6271 Median : 2.991 Median : 13.00
## Mean :0.001664 Mean :0.6542 Mean : 3.243 Mean : 16.36
## 3rd Qu.:0.001830 3rd Qu.:0.7059 3rd Qu.: 3.680 3rd Qu.: 18.00
## Max. :0.015760 Max. :1.0000 Max. :17.697 Max. :155.00
##
## mining info:
## data ntransactions support confidence
## transactions 9835 0.001 0.55
Lift measures the importance of a rule. A high value for lift strongly indicates that the rule is important. We will inspect the top 10 rules by their lift measurements.
We will then manually change Support and Confident parameters in the apriori() function above while we are monitoring the lift measurements until the lift measurements stopping changing. We then get our optimal 10 rules.
Inspect the rules by life measurements
inspect(sort(aprioriRule, by = "lift")[1:10])
## lhs rhs support confidence lift count
## [1] {Apple Earpods,
## Logitech MK360 Wireless Keyboard and Mouse Combo} => {Eluktronics Pro Gaming Laptop} 0.001220132 0.6315789 17.696806 12
## [2] {Apple Earpods,
## Microsoft Wireless Comfort Keyboard and Mouse} => {Slim Wireless Mouse} 0.001220132 0.6315789 16.697793 12
## [3] {Dell Wired Keyboard,
## HDMI Cable 6ft} => {AOC Monitor} 0.001931876 0.6333333 15.045491 19
## [4] {Dell Desktop,
## HDMI Cable 6ft,
## HP Monitor} => {AOC Monitor} 0.001118454 0.6111111 14.517579 11
## [5] {Computer Game,
## iMac,
## Intel Desktop} => {Apple Magic Keyboard} 0.001118454 0.7333333 10.230260 11
## [6] {ASUS 2 Monitor,
## Dell Desktop,
## Intel Desktop} => {Apple Magic Keyboard} 0.001118454 0.7333333 10.230260 11
## [7] {Dell Desktop,
## Microsoft Office Home and Student 2016,
## Rii LED Gaming Keyboard & Mouse Combo} => {Belkin Mouse Pad} 0.001016777 0.5882353 10.043913 10
## [8] {Apple Magic Keyboard,
## Eluktronics Pro Gaming Laptop,
## Lenovo Desktop Computer} => {ASUS Monitor} 0.001016777 0.5555556 10.025484 10
## [9] {ASUS Chromebook,
## Computer Game,
## Lenovo Desktop Computer} => {ASUS Monitor} 0.001016777 0.5555556 10.025484 10
## [10] {ASUS Monitor,
## iMac,
## Logitech MK550 Wireless Wave Keyboard and Mouse Combo} => {Apple Magic Keyboard} 0.001423488 0.6666667 9.300236 14
# see Kindle's rule
xRules <- subset(aprioriRule, items %in% "Kindle")
inspect(xRules)
summary(xRules)
## set of 0 rules
There are no rules related to Kindle.
Here is and example how we specify which items should appear on either left hand side of right hand side of the rules. The purpose of this approch is to find which items should be displayed or bundled with the item that we want to promote.
# left hand and right hand
# lhs = iMac
iMacRules_lh <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.1),
appearance = list(default = "rhs", lhs = "iMac"))
inspect(sample(iMacRules_lh, 5))
## lhs rhs support confidence
## [1] {iMac} => {HP Laptop} 0.07554652 0.2949583
## [2] {iMac} => {CYBERPOWER Gamer Desktop} 0.05673615 0.2215165
## [3] {iMac} => {Microsoft Office Home and Student 2016} 0.03101169 0.1210798
## [4] {iMac} => {3-Button Mouse} 0.03335028 0.1302104
## [5] {iMac} => {Acer Aspire} 0.03070666 0.1198888
## lift count
## [1] 1.519599 743
## [2] 1.204320 558
## [3] 1.820825 305
## [4] 1.463565 328
## [5] 1.448534 302
plot(aprioriRule, method = "scatterplot", measure = c("support","confidence"),
shading = "lift")
We can see that we have some important rules where the darker red spots are.
# create a subset of rules ordered by lift
aRule <- sort(aprioriRule, by = "lift")
# plot the subset of rules
plot(aRule, method = "graph", control = list(type = "items"), max = 10)
From the graph, we can see Dell Desktop, ViewSonic Monitor, Acer Aspire are associated with some important rules.
Cross selling opportunities
Remove low frequency items
The online retailer has very popular and profitable items, like Apple products, Dell Desktop, etc. But the lest frequent purchased items(bottom 10) counts less than 2 percent of the total transactions for each item. Meanwhile these products take the room from the shelf and inventory. We recommand to remove the low frequency items.
Here ends our project