We choose to make a market basket analysis program for our group project. Market basket analysis is a data mining technique used in retail and marketing to uncover associations between products customers frequently purchase. Various business applications can apply the insight gained from market basket analysis, including cross-selling, product placement optimization, recommendation systems, and targeted marketing campaigns. By leveraging market basket analysis, businesses can gain a competitive edge by understanding customer behavior, improving operational efficiency, and increasing customer satisfaction. It allows retailers and marketers to tailor their offerings to customer preferences, ultimately driving sales and fostering long-term customer loyalty.
To code this market basket analysis, we chose R as our software. The latter has implemented functions such as “apriori”, which are effective for mining the data set and discovering frequent item sets and association rules. It is also the software taught to us in several of our courses during our bachelor’s degree, so we already had the basics of this software.
Our code consists first of the cleaning of our datasets, of some graphs on the primary data, then of the market basket analysis. The results of the market basket analysis are displayed in the form of an interactive HTML widget. Finally, we create a function that takes a product as an input and returns a product that is frequently bought with the input product.
# Load packages
library(arules)
library(arulesViz)
library(tidyverse)
library(readxl)
library(lubridate)
library(arulesViz)
# Import dataset
online_retail <- read_excel("~/Desktop/Online Retail.xlsx")
# View the dataset
View(online_retail)
We use the “online retail” data set provided by UIC Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/online+retail). This is a transnational dataset which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
| Variable name | Description |
|---|---|
| InvoiceNo | Invoice number. It is a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘C’, it indicates a cancellation. |
| StockCode | Product (item) code. It is a 5-digit integral number uniquely assigned to each distinct product. |
| Description | Product (item) name. |
| Quantity | The quantities of each product (item) per transaction. |
| InvoiceDate: | Invoice date and time. The day and time when a transaction was generated. |
| UnitPrice | Unit price. Product price per unit in sterling. |
| CustomerID | Customer number. It is a 5-digit integral number uniquely assigned to each customer. |
| Country | Country name. The name of the country where a customer resides. |
####################
# NA values
####################
# Check if there are NA values
sum(is.na(online_retail))
## [1] 136534
# Check which columns have NA values
na_counts <- data.frame(column = names(online_retail), na_count = colSums(is.na(online_retail)))
print(na_counts)
## column na_count
## InvoiceNo InvoiceNo 0
## StockCode StockCode 0
## Description Description 1454
## Quantity Quantity 0
## InvoiceDate InvoiceDate 0
## UnitPrice UnitPrice 0
## CustomerID CustomerID 135080
## Country Country 0
# The NA values are in the columns "Description" and "CustomerID". "CustomerID" is not relevant for the market basket analysis -> The NA values in this column could be kept. However, most NA values in the "CustomerID" column are for special transactions that are irrelevant for our analysis (e.g., debt adjustment) so, we remove them. Moreover, we will use "Description" for the analysis -> We remove NA values that are in this column. So, in conclusion, we remove all NA values
# Remove NA values
online_retail <- na.omit(online_retail)
####################
# Negative quantities
####################
# Negative quantities seem to be used for e.g., errors or damaged products, they are irrelevant
# There are no null quantities, so we don't have to consider them
online_retail <- online_retail %>% filter(Quantity >= 1)
####################
# Duplicates
####################
# Check if there are duplicates
num_duplicates <- sum(duplicated(online_retail))
print(paste("Number of duplicate rows: ", num_duplicates))
## [1] "Number of duplicate rows: 5192"
# Remove duplicate rows
online_retail <- online_retail[!duplicated(online_retail), ]
####################
# Cancelled transactions
####################
# Cancelled transactions start with "C" in the invoice number
# Remove cancelled transactions
online_retail <- online_retail[!grepl("^C", online_retail$InvoiceNo), ]
####################
# Irrelevant values
####################
# There are irrelevant stockcodes (e.g., "Post")
online_retail <- online_retail %>% filter(!(StockCode %in% c("POST", "PADS", "M", "m", "S", "DOT", "C2", "BANK CHARGES")))
####################
# Convert data types
####################
online_retail$Country <- as.factor(online_retail$Country)
online_retail$Description <- as.factor(online_retail$Description)
online_retail$InvoiceDate <- ymd_hms(online_retail$InvoiceDate) # Using the lubridate package
# Overview of the cleaned dataset
head(online_retail)
## # A tibble: 6 × 8
## InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate UnitP…⁴ Custo…⁵ Country
## <chr> <chr> <fct> <dbl> <dttm> <dbl> <dbl> <fct>
## 1 536365 85123A WHITE … 6 2010-12-01 08:26:00 2.55 17850 United…
## 2 536365 71053 WHITE … 6 2010-12-01 08:26:00 3.39 17850 United…
## 3 536365 84406B CREAM … 8 2010-12-01 08:26:00 2.75 17850 United…
## 4 536365 84029G KNITTE… 6 2010-12-01 08:26:00 3.39 17850 United…
## 5 536365 84029E RED WO… 6 2010-12-01 08:26:00 3.39 17850 United…
## 6 536365 22752 SET 7 … 2 2010-12-01 08:26:00 7.65 17850 United…
## # … with abbreviated variable names ¹StockCode, ²Description, ³Quantity,
## # ⁴UnitPrice, ⁵CustomerID
Creating a graph that shows the distribution of the number of items by basket:
####################
# Distribution of nb of items by basket
####################
online_retail %>%
# Group the data frame by the invoice number to summarize the "Quantity for each unique invoice number
group_by(InvoiceNo) %>%
# Calculate the mean of the "Quantity" column for each invoice and create a new column called "n_items" with the mean values.
summarize(n_items = mean(Quantity)) %>%
# Generate an histogram plot with the package ggplot2, "aes" argument "x=" specify that the x axis will be mapped to our new column
ggplot(aes(x = n_items)) +
#"fill =" define the colour of the bins and "bins =" the width of the bins
geom_histogram(fill = "black", bins = 100000) +
#select the limits of the x-axis
coord_cartesian(xlim = c(0, 100)) +
labs(title = "Distribution of the number of items by basket")
Best sellers:
####################
#Best sellers
####################
online_retail %>%
# Group data by StockCode and Description
group_by(StockCode, Description) %>%
# Calculate the count of each group and drop the grouping
summarize(count = n(), .groups = "drop") %>%
# Sort data by count in descending order
arrange(desc(count)) %>%
# Select the top 10 items with the highest count
slice_max(order_by = count, n = 10) %>%
# Create a bar plot
ggplot(aes(
# Reorder Description based on count for plotting
x = reorder(Description, count),
y = count
)) +
# Add bars with 'identity' stat and black fill color
geom_bar(stat = "identity", fill = "black") +
# Add title and remove x axis label
labs(title = "Best sellers", x = "") +
# Flip coordinates to get a horizontal bar plot
coord_flip()
####################
# Market basket analysis
####################
# Split the data
data_split <- split(online_retail$Description, online_retail$InvoiceNo)
# Convert the data into transactions using the arules package
trans <- as(data_split, "transactions")
# Inspect the transactions
inspect(head(trans, 2))
## items transactionID
## [1] {CREAM CUPID HEARTS COAT HANGER,
## GLASS STAR FROSTED T-LIGHT HOLDER,
## KNITTED UNION FLAG HOT WATER BOTTLE,
## RED WOOLLY HOTTIE WHITE HEART.,
## SET 7 BABUSHKA NESTING BOXES,
## WHITE HANGING HEART T-LIGHT HOLDER,
## WHITE METAL LANTERN} 536365
## [2] {HAND WARMER RED POLKA DOT,
## HAND WARMER UNION JACK} 536366
Support, confidence and lift are important concepts in market basket analysis and are defined as follow:
Support: % of the transactions that contain items 1 and 2
Confidence: Conditional probability that a customer that buy item1 will also by item2
Lift: % increase in probability that a customer will buy item2 if he or she buys item1 -100% (e.g., lift of 120% means an increase of 20% of probability)
In the following code snippet, we generate rules:
# Generate association rules
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.5, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 184
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3860 item(s), 18405 transaction(s)] done [0.14s].
## sorting and recoding items ... [621 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [269 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Sort the rules by confidence
rules <- sort(rules, by='confidence', decreasing = TRUE)
# Infer the top rules (top 20)
top_rules <- head(rules, 20)
# Convert rules into a data frame
rules_df <- as(rules, "data.frame")
# Get an overview of the data frame
head(rules_df)
## rules
## 141 {REGENCY TEA PLATE PINK,REGENCY TEA PLATE ROSES} => {REGENCY TEA PLATE GREEN}
## 140 {REGENCY TEA PLATE GREEN,REGENCY TEA PLATE PINK} => {REGENCY TEA PLATE ROSES}
## 143 {POPPY'S PLAYHOUSE BEDROOM,POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN}
## 264 {PINK REGENCY TEACUP AND SAUCER,REGENCY CAKESTAND 3 TIER,ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER}
## 1 {REGENCY TEA PLATE PINK} => {REGENCY TEA PLATE GREEN}
## 149 {PINK REGENCY TEACUP AND SAUCER,ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER}
## support confidence coverage lift count
## 141 0.01005162 0.9390863 0.01070361 64.01438 185
## 140 0.01005162 0.9158416 0.01097528 51.39044 185
## 143 0.01010595 0.9073171 0.01113828 48.26350 186
## 264 0.01298560 0.9018868 0.01439826 24.02204 239
## 1 0.01097528 0.9017857 0.01217061 61.47173 202
## 149 0.02118989 0.8944954 0.02368921 23.82516 390
# Generate a scatter plot
plot(rules, measure=c("support", "lift"), shading="confidence", jitter = 0)
# Generate a interactive plot showing the 20 top rules
plot(top_rules,
method = "graph",
engine = "htmlwidget")
The market basket analysis can be useful to make recommendations based on which products are usually bought together.
####################
# Create two separate columns for rules for each product in the data frame
####################
# Convert rules to character format.
rules_df$rules <- as.character(rules_df$rules)
# Remove the braces {}.
rules_df$rules <- gsub("[{}]", "", rules_df$rules)
# Split the rules column.
split_rules <- strsplit(rules_df$rules, " => ")
# Create new columns for left and right
rules_df$rule_left <- sapply(split_rules, `[`, 1)
rules_df$rule_right <- sapply(split_rules, `[`, 2)
####################
# Recommendation function
####################
# Create a recommendation function
recommend_product <- function(product, rules_df, data){
# Check if the product is in the dataframe
if(!(product %in% data$Description)){
return("The product is not in the data.")
}
# Look for rules where the product is on the left
product_rules <- subset(rules_df, rule_left %in% product)
# If no rules are found, return a message
if(nrow(product_rules)==0){
return("There are no products frequently bought with this product.")
}
# Sort by confidence and return the product(s) on the right of the top rule(s)
product_rules <- product_rules[order(-product_rules$confidence), ]
return(product_rules$rule_right[1])
}
# Test the function
recommend_product("REGENCY TEA PLATE PINK", rules_df, online_retail)
## [1] "REGENCY TEA PLATE GREEN"
recommend_product("KNITTED UNION FLAG HOT WATER BOTTLE", rules_df, online_retail)
## [1] "There are no products frequently bought with this product."
recommend_product("PEN", rules_df, online_retail)
## [1] "The product is not in the data."
The code presented above is an example of a simple market basket analysis. This analysis is then used to create a recommendation algorithm. A possible improvement would be to use advanced techniques involving Natural Language Processing (NLP) to create categories of items: Some rules presented above are not very useful since they involve items of the same category (e.g., same product in different colors).