Students will continue to work as data analysts at Blackwell Electronics. Their job is to extend Blackwell’s application of data mining methods to develop predictive models.
In this course, students will use the R statistical programming language augmented with machine learning packages to predict different product types a customer will be likely to buy. Next, students will create a recommender system that recommends additional purchases based on customer choices. Finally, students will present to management, explaining their insights and suggestions for data mining process improvements.
Market Basket Analysis Hi,
Blackwell Electronics’ board of directors is considering acquiring Electronidex, a start-up electronics online retailer. The board of directors has asked us to help them better understand the clientele that Electronidex currently is serving and if it would be an optimal partnership.
They need our help to identify purchasing patterns that will provide insight into Electronidex’s clientele. Attached is a CSV file that contains one month’s (30 days’ worth) of Electronidexes online transactions and a file containing all the electronics that they currently sell. Due to their lack of funding, Electronidex is only able to pull data on the items that customers purchased per their transactions.
I would like you to to use R to conduct a market basket analysis. You will be discovering any interesting relationships (or associations) between customer’s transactions and the item(s) they’ve purchased. These associations can then be used to drive sales-oriented initiatives such as recommender systems like the ones used by Amazon and other eCommerce sites.
To help Blackwell’s board of directors form a clearer picture of Electronidex’s customer buying patterns, please consider the following questions while you’re performing your analysis:
Once you’ve completed your market basket analysis, please put together a formal business report in Word. Thank you in advance!
Best,
Danielle Sherman Chief Technology Officer Blackwell Electronics www.blackwellelectronics.com
suppressPackageStartupMessages(library(arulesViz))
suppressPackageStartupMessages(library(arules))
suppressPackageStartupMessages(library(tidyverse))
library(readxl)
library(knitr)
library(ggplot2)
library(lubridate)
library(rmarkdown)
library(stringr)
library(RColorBrewer)
df_product <- read.csv("ElectronidexTransactions2017.csv", header = FALSE, sep = ",")
product <- read.transactions("ElectronidexTransactions2017.csv", format = "basket", sep = ",")
## Warning in readLines(file, encoding = encoding): incomplete final line found on
## 'ElectronidexTransactions2017.csv'
## Warning in asMethod(object): removing duplicated items in transactions
blackwellold <- read.csv("existingproductattributes2017.csv")
blackwellnew <- read.csv("newproductattributes2017.csv")
Product_Equivalence <- read.csv("product.csv", header = TRUE, sep = ";")
NOTE:
ElectronidexTransactions.csv, is a record of one month’s (30 days’ worth) of 9835 online transactions and which items were purchased out of the 125 products Electronidex sells.
The file, product.csv, is an equivalence list between the Product and Category sold by Electronidex, with 125 products that Electronidex sells broken down into 17 product types.
equivalence <- data.frame(
Category_Blackwell = c(
"NULL", "PC & Laptop & Netbook", "Accessories", "Software", "Display", "Printer",
"Printer Supplies", "Extended Warranty", "Tablet", "Smartphone", "Game Console", "NULL",
"NULL", "NULL", "NULL", "NULL"
),
Category_Electronidex = c(
"Desktop", "Laptop",
"Accessories, Computer Stands, Mouse and Keyboard Combo, Computer Cords, Computer Mice, Keyboard",
"NULL", "Monitors", "Printer", "Printer Ink", "NULL", "Computer Tablets", "NULL",
"NULL", "Speakers", "External Hard Drives", "Smart Home Devices",
"Computer Headphones", "Active Headphones"
)
)
equivalence
Comment:
# Create an empty dataframe for df_copy
df_category <- data.frame(matrix(nrow = nrow(Product_Equivalence), ncol = ncol(Product_Equivalence)))
# Create an empty dataframe df_category with the same dimensions as df_product
df_category <- data.frame(matrix(nrow = nrow(df_product), ncol = ncol(df_product)))
# Loop through the indices of df_product
#for (i in 1:nrow(df_product)) {
# try({
# for (e in 1:ncol(df_product)) {
# print(i)
# if (nchar(str_trim(df_product[i, e])) == 0) {
# df_category[i, e] <- df_product[i, e]
# } else {
# df_category[i, e] <- Product_Equivalence[which(Product_Equivalence$producto == str_trim(df_product[i, e])), 1]
# print(str_trim(df_product[i, e]))
# print(Product_Equivalence[which(Product_Equivalence$producto == str_trim(df_product[i, e])), 1])
# }
# }
# })
#}
#Write df_copy to a CSV file
#write.csv(df_category, "category.csv", row.names = FALSE)
# Read transactions from the CSV file into category
df_category <- read.csv("category.csv", sep = ",")
category <- read.transactions("category.csv", format = "basket", sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions
COMMENT:
Based on the information above:
df represents the data in the file “ElectronidexTransactions.csv”. It contains records of 9835 online transactions that occurred over a period of 30 days. Each transaction entry includes information about the purchased items from the 125 products sold by Electronidex.
dfproduct represents the data in the file “product.csv”. It is an equivalence list that maps the products sold by Electronidex to their respective categories. It contains 125 rows, each representing a product, and includes information about the category to which each product belongs. The categories are broken down into 17 product types.
The purpose of the code is to create a copy of the df dataframe named df_copy with corresponding category information from dfproduct. The code iterates over each row and column of df, checks if the value in df is empty, and if not, searches for the matching product in dfproduct. It then assigns the corresponding category value from dfproduct to the df_copy dataframe.
This code essentially enhances the df dataframe by including the category information from dfproduct, providing a more comprehensive representation of the transactions with the corresponding product categories.
# Exploratory Data Analysis (EDA) - category
# View the transactions in category
#inspect(category)
# Check the number of transactions
length(category) # 9836 Number of transactions.
## [1] 9836
# Check the number of items per transaction
#size(category) # Number of items per transaction
# List the transactions by conversion (LIST must be capitalized)
#LIST(category)
# View the item labels
#itemLabels(category)
# Summary statistics of the category
#summary(category)
# Item frequency plot for the category
itemFrequencyPlot(category, topN = 17, type = "absolute", col = brewer.pal(8, 'Pastel2'), main = "Absolute Item Frequency for Category")
category
## transactions in sparse format with
## 9836 transactions (rows) and
## 49 items (columns)
# Exploratory Data Analysis (EDA) - product
# View the transactions in product
#inspect(product)
# Check the number of transactions
length(product) # 9836 Number of transactions.
## [1] 9835
# Check the number of items per transaction
#size(product) # Number of items per transaction
# List the transactions by conversion (LIST must be capitalized)
#LIST(product)
# View the item labels
#itemLabels(product)
# Summary statistics of the product
#summary(product)
# Item frequency plot for the product
if (!require("RColorBrewer")) {
install.packages("RColorBrewer")
library(RColorBrewer)
}
itemFrequencyPlot(product, topN = 30, type = "absolute", col = brewer.pal(8, 'Pastel2'), main = "Absolute Item Frequency for Product")
rules_cat <- apriori(category, parameter = list(supp=0.005, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[49 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.00s].
## writing ... [3387 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_cat<- subset(rules_cat, lift > 1)
rules_cat <- sort(rules_cat, by='supp', decreasing = TRUE)
rules_cat <- rules_cat[!is.redundant(rules_cat)]
gi <- generatingItemsets(rules_cat)
d <- which(duplicated(gi))
rules_cat<-rules_cat[-d]
summary(rules_cat)
## set of 1579 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7
## 41 332 653 436 112 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 4.000 4.165 5.000 7.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005083 Min. :0.5000 Min. :0.005287 Min. :1.035
## 1st Qu.:0.006405 1st Qu.:0.7440 1st Qu.:0.008032 1st Qu.:1.444
## Median :0.008743 Median :0.8156 Median :0.011387 Median :1.524
## Mean :0.015233 Mean :0.7926 Mean :0.020464 Mean :1.589
## 3rd Qu.:0.014742 3rd Qu.:0.8687 3rd Qu.:0.019164 3rd Qu.:1.659
## Max. :0.320557 Max. :0.9808 Max. :0.534160 Max. :2.532
## count
## Min. : 50.0
## 1st Qu.: 63.0
## Median : 86.0
## Mean : 149.8
## 3rd Qu.: 145.0
## Max. :3153.0
##
## mining info:
## data ntransactions support confidence
## category 9836 0.005 0.5
## call
## apriori(data = category, parameter = list(supp = 0.005, conf = 0.5))
plot(rules_cat, measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
topRules <- rules_cat[1:5]
plot(topRules, jitter=0)
plot(head(sort(topRules, by = "lift")), method = "graph")
plot(topRules, method="grouped")
plot(topRules)
inspectDT(topRules)
rules_prod <- apriori(product, parameter = list(supp=0.01, conf=0.25))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [82 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [204 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_prod <- subset(rules_prod, lift > 1)
rules_prod <- sort(rules_prod, by='lift', decreasing = TRUE)
rules_prod <- rules_prod[!is.redundant(rules_prod)]
summary(rules_prod)
## set of 201 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 102 99
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.493 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.2500 Min. :0.01739 Min. :1.155
## 1st Qu.:0.01118 1st Qu.:0.3055 1st Qu.:0.03071 1st Qu.:1.596
## Median :0.01474 Median :0.3578 Median :0.04209 Median :1.851
## Mean :0.01814 Mean :0.3720 Mean :0.05154 Mean :1.914
## 3rd Qu.:0.02105 3rd Qu.:0.4286 3rd Qu.:0.05877 3rd Qu.:2.190
## Max. :0.07555 Max. :0.6023 Max. :0.25613 Max. :3.360
## count
## Min. : 99.0
## 1st Qu.:110.0
## Median :145.0
## Mean :178.4
## 3rd Qu.:207.0
## Max. :743.0
##
## mining info:
## data ntransactions support confidence
## product 9835 0.01 0.25
## call
## apriori(data = product, parameter = list(supp = 0.01, conf = 0.25))
gi <- generatingItemsets(rules_prod)
d <- which(duplicated(gi))
rules_prod <-rules_prod[-d]
summary(rules_prod)
## set of 145 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 101 44
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.303 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.2500 Min. :0.01901 Min. :1.155
## 1st Qu.:0.01149 1st Qu.:0.2920 1st Qu.:0.03416 1st Qu.:1.518
## Median :0.01576 Median :0.3421 Median :0.04616 Median :1.795
## Mean :0.01931 Mean :0.3519 Mean :0.05650 Mean :1.872
## 3rd Qu.:0.02267 3rd Qu.:0.3947 3rd Qu.:0.07168 3rd Qu.:2.156
## Max. :0.07555 Max. :0.5829 Max. :0.19410 Max. :3.360
## count
## Min. : 99.0
## 1st Qu.:113.0
## Median :155.0
## Mean :189.9
## 3rd Qu.:223.0
## Max. :743.0
##
## mining info:
## data ntransactions support confidence
## product 9835 0.01 0.25
## call
## apriori(data = product, parameter = list(supp = 0.01, conf = 0.25))
plot(rules_prod, measure = c("support", "confidence"), shading = "lift")
topRules <- rules_prod[1:9]
plot(topRules, jitter=0)
plot(head(sort(topRules, by = "lift"), 22), method = "graph")
plot(topRules, method="grouped")
plot(topRules)
inspectDT(rules_prod)
Cat_Rules1 <- subset(rules_cat, items %in% "desktop")
CatRules1 <- sort(Cat_Rules1, by = "support", decreasing = TRUE)
top5_categories <- head(CatRules1, 5)
head(top5_categories)
## set of 5 rules
plot(top5_categories)
inspectDT(top5_categories)
COMMENT:
Desktop is the most popular category bought in Electronidex by far.
A staggering 20% of transactions are from people who buy a laptop and then a desktop
This is closely followed by a 17% of people who buy a monitor which will then also buy a desktop
Cat_Rules2 <- subset(rules_cat, items %in% "laptop")
CatRules2 <- sort(Cat_Rules2, by = "support", decreasing = TRUE)
top5_categories <- head(CatRules2, 5)
head(top5_categories)
## set of 5 rules
plot(top5_categories)
inspectDT(top5_categories)
Comment:
Laptop is the second most bought category. It is interesting to note that 15% of the purchases include buying a monitor which turns into buying a laptop.
This is closely followed by a 12% who buy a computer mice which will then also buy a laptop
Cat_Rules3 <- subset(rules_cat, items %in% "monitors")
CatRules3 <- sort(Cat_Rules3, by = "support", decreasing = TRUE)
top5_categories <- head(CatRules3, 5)
head(top5_categories)
## set of 5 rules
plot(top5_categories)
inspectDT(top5_categories)
Comment:
Monitors is the third most bought category.
It is interesting to note that 10% of the purchases include buying a computer mice will also buy a monitorwhich turns into buying a laptop.
This is closely followed by a 9% who buy a keyboard which will then also buy a laptop
rules_product <- apriori(product, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[125 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [82 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
sorted_rules <- sort(rules_product, by = "support", decreasing = TRUE)
sorted_rules
## set of 19 rules
top5_products <- head(sorted_rules, 5)
head(top5_products)
## set of 5 rules
plot(top5_products)
inspectDT(top5_products)
Item_Rules<- subset(rules_product, items %in% "iMac")
ItemRules <- sort(Item_Rules, by = "support", decreasing = TRUE)
top5_products <- head(ItemRules, 5)
head(top5_products)
## set of 5 rules
plot(top5_products)
inspectDT(top5_products)
Item_Rules1<- subset(rules_product, items %in% "HP Laptop")
ItemRules1 <- sort(Item_Rules1, by = "support", decreasing = TRUE)
top5_products <- head(ItemRules1, 5)
head(top5_products)
## set of 5 rules
plot(top5_products)
inspectDT(top5_products)
Item_Rules2<- subset(rules_product, items %in% "CYBERPOWER Gamer Desktop")
ItemRules2 <- sort(Item_Rules2, by = "support", decreasing = TRUE)
top5_products <- head(ItemRules2, 5)
head(top5_products)
## set of 2 rules
plot(top5_products)
inspectDT(top5_products)
#2.Would Blackwell benefit from buying Electronidex?
ItemRules3 <- subset(rules_cat, items %in% "Speakers")
sorted_ItemRules3<- sort(ItemRules3, by = "support", decreasing = TRUE)
top5_Speakers <- head(sorted_ItemRules3, 5)
head(top5_Speakers)
## set of 5 rules
plot(top5_Speakers)
inspectDT(top5_Speakers)
ItemRules4 <- subset(rules_cat, items %in% "External Hardrives")
sorted_ItemRules4<- sort(ItemRules4, by = "support", decreasing = TRUE)
top5_External_Hardrives <- head(sorted_ItemRules4, 5)
head(top5_External_Hardrives)
## set of 5 rules
plot(top5_External_Hardrives)
inspectDT(top5_External_Hardrives)
unique(df_new)
categories_dict <- c("laptop" = 1650, "desktop" = 1750, "monitors" = 300, "Computer Tablets" = 500, "Computer Mice" = 80, "Keyboard" = 300,
"Mouse and Keyboard Combo" = 400, "Active Headphones" = 100, "Computer Headphones" = 150, "Computer Cords" = 30,
"Computer Stands" = 60, "Accessories" = 70, "Speakers" = 200, "Printers" = 100, "Printer Ink" = 30,
"External Hardrives" = 120, "Smart Home Devices" = 50)
df_category$Average_Price_Category_Transaction <- NA # Create an empty "Category" column
# Iterate through each row and replace "Category" values based on matching keys in categories_dict
for (i in 1:nrow(df_category)) {
categories <- unique(unlist(df_category[i,]))
categories <- categories[!is.na(categories) & categories %in% names(categories_dict)]
if (length(categories) > 0) {
df_category$Average_Price_Category_Transaction[i] <- sum(categories_dict[categories])
}
}
df_category$Average_Price_Category <- rowSums(df_category[, "Average_Price_Category_Transaction", drop = FALSE], na.rm = TRUE) # Calculate the row sum for the "Category" column
df_category$Average_Price_Category_Transaction <- as.numeric(as.character(df_category$Average_Price_Category_Transaction))
df_category <- na.omit(df_category)
summary(df_category$Average_Price_Category_Transaction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30 1650 2120 2340 3700 5590
hist(df_category$Average_Price_Category_Transaction)
# Initialize the Pricing Strategy column
df_category$Range <- ""
# Set the threshold for transactions made by companies
# Assign the pricing strategy categories based on the price values and transaction type
df_category$Range[df_category$Range < 30] <- "Very Low Price (Light)"
df_category$Range[df_category$Average_Price_Category_Transaction >= 30 & df_category$Average_Price_Category_Transaction < 150] <- "Low"
df_category$Range[df_category$Average_Price_Category_Transaction >= 150 & df_category$Average_Price_Category_Transaction < 500] <- "Standard"
df_category$Range[df_category$Average_Price_Category_Transaction >= 500 & df_category$Average_Price_Category_Transaction < 1250] <- "Mid-High Price"
df_category$Range[df_category$Average_Price_Category_Transaction >= 1250 & df_category$Average_Price_Category_Transaction < 2000] <- "High Price"
df_category$Range[df_category$Average_Price_Category_Transaction >= 2000 & df_category$Average_Price_Category_Transaction < 3000] <- "Premium"
df_category$Range[df_category$Average_Price_Category_Transaction >= 3000] <- "B2B"
# Count the frequency of each range category
range_counts <- table(df_category$Range)
# Sort the range_counts in descending order
range_counts <- sort(range_counts, decreasing = TRUE)
# Create a vertical bar plot with ordered categories
barplot(range_counts, main = "Range Distribution", xlab = "Range", ylab = "Count", col = "steelblue", ylim = c(0, max(range_counts)))
# Calculate the percentage of each range category
percentage <- round((range_counts / sum(range_counts)) * 100, 1)
# Create a cross-tabulation table with percentages
cross_tab <- prop.table(range_counts) * 100
cross_tab <- round(cross_tab, 1)
# Print the cross-tabulation table
cat("\nCross-Tabulation of Range with Percentages:\n")
##
## Cross-Tabulation of Range with Percentages:
print(cross_tab)
##
## B2B High Price Premium Standard Low
## 33.1 23.5 22.9 9.9 5.7
## Mid-High Price
## 4.9
# df_category$Rules <- "Other"
#
# for (i in 1:nrow(df_product)) {
# if (sum(grepl("Laptop", df_product[i, ]) | grepl("Desktop", df_product[i, ]) | grepl("Monitor", df_product[i, ]) |
# grepl("Keyboard", df_product[i, ]) | grepl("Computer Mice", df_product[i, ]) | grepl("Computer Cords", df_product[i, ]) |
# grepl("Mouse and Keyboard Combo", df_product[i, ]) | grepl("Computer Headphones", df_product[i, ]) |
# grepl("Active Headphones", df_product[i, ]) | grepl("Speakers", df_product[i, ]) | grepl("Printers", df_product[i, ]) |
# grepl("Printer Ink", df_product[i, ]) | grepl("Computer Stands", df_product[i, ]) | grepl("Computer Tablets", df_product[i, ]) |
# grepl("External Hardrives", df_product[i, ]) | grepl("Smart Home Devices", df_product[i, ])) >= 4) {
# df_product$Rules[i] <- "Office"
# }
# }
# for (i in 1:nrow(df_)) {
# if (sum(
# grepl("Keyboard", df_product[i,]) |
# grepl("Computer Mice", df_product[i,]) |
# grepl("Computer Cords", df_product[i,]) |
# grepl("Mouse and Keyboard Combo", df_product[i,]) |
# grepl("Computer Headphones", df_product[i,]) |
# grepl("Active Headphones", df_product[i,]) |
# grepl("Speakers", df_product[i,]) |
# grepl("Printers", df_product[i,]) |
# grepl("Printer Ink", df_product[i,]) |
# grepl("Computer Stands", df_product[i,]) |
# grepl("Computer Tablets", df_product[i,]) |
# grepl("External Hardrives", df_product[i,]) |
# grepl("Smart Home Devices", df_product[i,]) |
# grepl("monitors", df_product[i,])
# ) >= 1) {
# df_product$Rules[i] <- "Casual Buyers"
# }
# }
# for (i in 1:nrow(df_product)) {
# if (sum(grepl("Laptop", df_product[i,]) | grepl("Desktop", df_product[i,])>= 1) {
# df_product$Rules[i] <- "Computer"
# }
# }
# # Count the frequency of each range category
# range_counts <- table(df_product$Rules)
#
# # Sort the range_counts in descending order
# range_counts <- sort(range_counts, decreasing = TRUE)
#
# # Create a vertical bar plot with ordered categories
# barplot(range_counts, main = "Range Distribution", xlab = "Range", ylab = "Count", col = "steelblue", ylim = c(0, max(range_counts)))
#
# # Calculate the percentage of each range category
# percentage <- round((range_counts / sum(range_counts)) * 100, 1)
#
# # Create a cross-tabulation table with percentages
# cross_tab <- prop.table(range_counts) * 100
# cross_tab <- round(cross_tab, 1)
#
# # Print the cross-tabulation table
# cat("\nCross-Tabulation of Range with Percentages:\n")
# print(cross_tab)