Section 1 - The Problem
Section 2 - Our Solution
Section 3 - Data Import
Section 4 - Exploratory Data Analysis
Section 5 - Market Basket Association Rules
Section 6 - Plot of ARM Rules Distribution
Section 7 - Frequency Plot
Section 8 - Ruleset Discovery for Product Categories
Section 9 - Conclusion
Section 10 - Appendix
Section 10.1 - Required Packages
Section 10.2 - Session Information
Blackwell Electronics’ board of directors is considering acquiring Electronidex, a start-up electronics online retailer. The board of directors has asked us to help them:
1) Better understand the clientele that Electronidex currently is serving and if it would be an optimal partnership.
2) Identify purchasing patterns that will provide insight into Electronidex’s clientele.
In order to solve the above problems, this document examines a CSV file that contains one month’s (30 days’ worth) of Electronidex’s online transactions and a file containing all the electronics that they currently sell. Due to their lack of funding, Electronidex is only able to pull data on the items that customers purchased per their transactions.
Via discovering any interesting relationships, (or associations), between customer’s transactions and the item(s) they’ve purchased, sales-oriented initiatives such as recommender systems are then possible to develop.
This document performs Association Rules Analysis in order to assist Blackwell’s board of directors to form a clearer picture of Electronidex’s customer buying patterns. Interesting patterns or item relationships within Electronidex’s transactions are examined. The benefit to Blackwell from selling Electronidex’s items is examined. Thereby, a conclusion of whether Blackwell should acquire Electronidex is determined, with recommedations for Blackwell concerning acquiring Electronidex. For example, cross-selling items, sale promotions, removing items, etc.
# Load transactional data
df <- read.transactions("Assignment 2/Data/ElectronidexTransactions2017.csv",
format = "basket",
sep=",",
rm.duplicates=F,
cols = NULL)
# Remove "" for improved processing
df@itemInfo$labels <- gsub("\"","",df@itemInfo$labels)
# Format item data into a "myLabels" variable
myLabels <- c("HardDrive_External_1TB_Portable","HardDrive_External_2TBPortable", "ComputerMouse_3Button", "HardDrive_External_3TBPortable","HardDrive_Desktop_5TB","Laptop_Acer_Aspire", "Desktop_Acer", "Monitor_Acer", "ComputerHeadPhones_Ailihen_Stereo", "Laptop_Alienware_AW17R4-7345SLV-PUS17", "Monitor_AOC", "ActiveHeadPhones_APIE_Bluetooh","ActiveHeadPhones_Earpods_Apple", "Laptop_AppleMacBookAir", "Laptop_AppleMacBookPro", "Keyboard_AppleMagic", "SmartHomeDevice_TV_Apple","Keyboard_AppleWired", "Keyboard_AppleWireless", "Monitor_ASUS2", "Laptop_ChromebookASUS", "Desktop_ASUS", "Monitor_ASUS","ComputerCord_AudioCable", "Keyboard_BacklitLED_Gaming", "Accessories_MousePad_Belkin", "Speaker_BoseCompanion", "Printer_Brother", "PrinterInk_Toner_Brother", "Speaker_Bluetooth_Cambridge","PrinterInk_Canon", "Printer_Canon_Office", "Software_ComputerGame", "Speaker_CyberAcoustics", "Desktop_CYBERPOWERGamer", "Desktop_Dell2", "Desktop_Dell", "MouseAndKeyboardCombo_Dell_KM117Wireless", "Laptop_Dell", "Monitor_Dell", "Keyboard_Dell_Wired", "Speaker_DOSS_Touch_Wireless_Bluetooth", "Printer_DYMO_LabelMaker","PrinterInk_DYMOLabelingTape", "MouseAndKeyboardCombo_EagleTec_Wireless", "Laptop_Eluktronics_ProGaming", "PrinterInk_Epson_Black", "Printer_Epson", "ComputerCord_Etekcity_PowerExtension","ComputerCord_EthernetCable", "ComputerTablet_FireHD", "SmartHomeDevice_FireTVStick", "ComputerStand_MonitorFullMotion", "ComputerMouse_GamingProfessional", "ComputerMouse_GenericBlack_3Button", "SmartHomeDevice_GoogleHome", "ComputerStand_HalterAcrylicMonitorStand", "ComputerStand_Halter_MeshMetal_MonitorStand", "ComputerCord_HDMIAdapter", "ComputerCord_HDMICable6ft", "ComputerStand_HeightAdjustable_StandingDesk", "PrinterInk_HP_BlackAndTricolor", "Desktop_HP", "Laptop_HP", "Monitor_HP", "Laptop_HP_Notebook_TouchScreen_PC", "Keyboard_HP_USB", "ComputerMouse_HP_Wireless", "Printer_HP_Wireless", "Desktop_iMac", "Desktop_Intel", "ComputerTablet_iPad", "ComputerTablet_iPadPro", "Accessories_iPhone_Charger_Cable", "Speaker_JBL_Splashproof_Portable_Bluetooth", "ComputerHeadPhones_Kensington", "ComputerTablet_Kindle","ComputerHeadPhones_Koss_Home", "Accessories_LargeMousePad", "Desktop_Lenovo_Computer", "Monitor_LG", "Laptop_LG_Touchscreen", "ComputerMouse_Logitech_3button", "ComputerHeadPhones_HeadSet_Logitech_ClearChat", "MouseAndKeyboardCombo_Logitech_Desktop_MK120", "Keyboard_Logitech", "MouseAndKeyboardCombo_Logitech_MK270_Wireless", "MouseAndKeyboardCombo_Logitech_MK360_Wireless", "MouseAndKeyboardCombo_Logitech_MK550_Wireless_Wave", "Speaker_Logitech_Multimedia", "ComputerHeadPhones_Headset_Logitech_Stereo", "Keyboard_Logitech_Wireless", "ComputerMouse_Logitech_Wireless", "Speaker_Mackie_CR", "ComputerMouse_Microsoft_BasicOptical", "ComputerHeadPhones_HeadSet_Microsoft", "Software_MicrosoftOffice_HomeandStudent2016","MouseAndKeyboardCombo_Microsoft_Wireless_Comfort","MouseAndKeyboardCombo_Microsoft_Wireless_Desktop", "ActiveHeadPhones_MonsterBeats_ByDrDre","ComputerStand_MultiMediaStand","ActiveHeadPhones_Otium_Wireless_Sports_Bluetooth", "ActiveHeadPhones_Panasonic_In-Ear", "ComputerHeadPhones_Panasonic_On-Ear_Stereo_RP-HT21", "ComputerHeadPhones_HeadSet_PC_Gaming", "ActiveHeadPhones_Philips_Flexible_Earhook","ComputerMouse_Redragon_Gaming", "MouseAndKeyboardCombo_Rii_LED_Gaming", "Keyboard_Rii_LED", "Speaker_Rokono_Mini", "SmartHomeDevice_RokuExpress", "ComputerCord_Samsung_ChargingCable", "ComputerTablet_SamsungGalaxy", "Monitor_Samsung", "Monitor_Sceptre","HardDrive_External_Slim_2TB_Portable", "ComputerMouse_Slim_Wireless", "SmartHomeDevice_SmartLightBulb", "Speaker_Sonos", "ComputerCord_USB_Cable", "ComputerCord_VGA_MonitorCable","Monitor_ViewSonic", "ComputerMouse_Wireless_Portable", "ComputerHeadPhones_HeadSet_XIBERIA_Gaming", "ComputerHeadPhones_HeadSet_Zombie_Gaming")
# Enhance the dataset with previosly formatted item labels
itemLabels(df) <- myLabels
# Create Levels
myLevel1 <- as.factor(sub("\\_.*", "", myLabels))
# Add Levels
df@itemInfo$level1 <- myLevel1
# Remove empty rows
df <- df[which(size(df)!= 0)]
# We have 9833 transactions
dim(df)[1]
## [1] 9833
# Find items that were consumed alone
oneCat <- df[which(size(df) == 1), ]
dim(oneCat)[1]
## [1] 2163
The dataset is summarized to find useful information. A plot of item frequencies is displayed that shows the number of times a product was in a unique transaction. The most popular product was the iMac (Desktop), follow by HP Laptop. These products would benefit Blackwell’s struggling laptop and PC sales.
The most popular Electronidex products that were purchased alone are:
1) Apple MacBook Air (383 purchases)
2) iMac 121 purchases
3) CYBERPOWER GAMER Desktop 109 purchases
# Summary
print(paste("Table 1: Electronidex Transactions 2017 Summary Statistics"))
## [1] "Table 1: Electronidex Transactions 2017 Summary Statistics"
summary(df)
## transactions as itemMatrix in sparse format with
## 9833 rows (elements/itemsets/transactions) and
## 125 columns (items) and a density of 0.03506885
##
## most frequent items:
## Desktop_iMac Laptop_HP
## 2519 1909
## Desktop_CYBERPOWERGamer ActiveHeadPhones_Earpods_Apple
## 1809 1715
## Laptop_AppleMacBookAir (Other)
## 1530 33622
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2163 1647 1294 1021 856 646 540 439 353 247 171 119 77 72 56 41
## 17 18 19 20 21 22 23 25 26 27 29 30
## 26 20 10 10 10 5 3 1 1 3 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.384 6.000 30.000
##
## includes extended item information - examples:
## labels level1
## 1 HardDrive_External_1TB_Portable HardDrive
## 2 HardDrive_External_2TBPortable HardDrive
## 3 ComputerMouse_3Button ComputerMouse
# Plot item frequencies
itemFrequencyPlot(df,
topN=10,
main='Graph 1: Item Frequency Plot',
type="absolute",
ylab="Item Frequency")
# Most products in one cat transactions
barplot(sort(itemFrequency(oneCat, type="absolute"), decreasing=T), main = "Graph 2: Bar Plot of One Category Item Frequencies")
The Apriori algorithm is used to analyse the data and output a set of rules with calculated support, confidence and lift. The higher the confidence and lift are, the better the rule is.
# Apriori Rules processing, with sup at 0.01 and conf at 0.5
basket_rules <- apriori(df, parameter = list(sup = 0.01,
conf = 0.4,
minlen = 2,
target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9833 transaction(s)] done [0.00s].
## sorting and recoding items ... [82 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [70 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# The number of Association Rules Mining (ARM) rules
basket_rules
## set of 70 rules
# The first six ARM rules
print(paste("Table 2: The First Six Association Rules"))
## [1] "Table 2: The First Six Association Rules"
inspect(head(basket_rules))
## lhs rhs support confidence lift count
## [1] {MouseAndKeyboardCombo_Logitech_MK550_Wireless_Wave} => {Desktop_iMac} 0.01006814 0.4107884 1.603526 99
## [2] {Laptop_Alienware_AW17R4-7345SLV-PUS17} => {Desktop_iMac} 0.01159361 0.4145455 1.618192 114
## [3] {ComputerCord_HDMICable6ft} => {Desktop_iMac} 0.01149191 0.4414062 1.723044 113
## [4] {ActiveHeadPhones_Panasonic_In-Ear} => {Desktop_iMac} 0.01078003 0.4326531 1.688876 106
## [5] {Laptop_Eluktronics_ProGaming} => {Laptop_HP} 0.01444117 0.4045584 2.083825 142
## [6] {Laptop_Eluktronics_ProGaming} => {Desktop_iMac} 0.01576325 0.4415954 1.723782 155
The scatterplot below demonstrates the distribution of 70 rules generated from all products. There are a few rules with high confidence and lift. Better performing rules are generated by generalising the data.
# Scatter plot of ARM rules found using the apriori algorithm
plot(basket_rules, main = "Graph 3: Distribution of 70 ARM rules")
The top 6 rules are sorted by highest confidence. A redundancy check is performed to remove any low tier rulesets.
# 'high-confidence' rules.
rules_conf <- sort(basket_rules, by="confidence", decreasing=TRUE)
# Show the support, lift and confidence for all rules
print(paste("Table 3: Support, Lift and Confidence for all Rules"))
## [1] "Table 3: Support, Lift and Confidence for all Rules"
inspect(head(rules_conf))
## lhs rhs support confidence lift count
## [1] {Laptop_Acer_Aspire,
## Monitor_ViewSonic} => {Laptop_HP} 0.01078003 0.6022727 3.102225 106
## [2] {Monitor_ASUS2,
## Desktop_Lenovo_Computer} => {Desktop_iMac} 0.01088172 0.5911602 2.307614 107
## [3] {Keyboard_AppleMagic,
## Desktop_Dell} => {Desktop_iMac} 0.01016984 0.5847953 2.282768 100
## [4] {Monitor_ASUS,
## Laptop_HP} => {Desktop_iMac} 0.01179701 0.5829146 2.275426 116
## [5] {Monitor_ASUS2,
## Laptop_HP} => {Desktop_iMac} 0.01108512 0.5828877 2.275321 109
## [6] {Desktop_Dell,
## Monitor_ViewSonic} => {Laptop_HP} 0.01525475 0.5747126 2.960267 150
# Check for redundant rules
print(paste("Table 4: Redundant Rules Check"))
## [1] "Table 4: Redundant Rules Check"
is.redundant(basket_rules)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Running the Apriori algorithm again, (but this time on the categories), we get a sizable amount of rules. 5000+ total. A plot of category frequencies is displayed that shows the number of times a product category wasin a unique transaction. This demonstrates Electronidex’s high level of desktop, laptop, and monitor sales.
# Aggregate by Category
dfByType <- aggregate(df, by= df@itemInfo$level1)
# Plot frequency for categories
itemFrequencyPlot(dfByType,
topN=10,
main='Graph 4: Category Frequency Plot',
type="absolute",
ylab="Category Frequency")
Sales of Blackwell’s laptops, PCs and desktops would benefit from acquisition of Electronidex via increased sales and an improved margin of revenues and profits. Sales of accessories would also increase greatly with Electronidex’s customer base, and Blackwell Electronics would benefit from expanding into business to business sales, in addition to business to customer sales.
Electronidex is a business to business company that sells multiple laptops and PC’s simultaneously. Electronidex’s Laptop association rules mining rule set has the highest support and confidence. Therefore, Blackwell’s acquisition of Electroidex will greatly increase laptop sales.
The variety of products offered by Blackwell Electronics will increase exponentially via inclusion of Eletronidex’s 125 products within 15 categories. The product categories shared by Blackwell Electronics and Electronidex include Laptops, Printers, PC, Monitors, Tablets and Accessories. One possibly problem is the data from Electronidex being deficit in date range, possibly leading to inaccurate analysis of annual sales.
List of Required Packages | |
---|---|
Required Packages | ‘arules’ ‘arulesViz’ ‘dplyr’ ‘caret’ ‘treemap’ ‘reshape2’ ‘base’ ‘stats’ ‘e1071’ ‘magrittr’ ‘lattice’ ‘datasets’ ‘RColorBrewer’ ‘ggplot2’ ‘knitr’ |
Session Information | |
---|---|
R Version | R version 3.6.2 (2019-12-12) |
Platform | x86_64-w64-mingw32/x64 (64-bit) |
Running | Windows 10 x64 (build 17763) |
RStudio Citation | RStudio: Integrated Development Environment for R |
RStudio Version | 1.0.153 |