Project description

We choose to make a market basket analysis program for our group project. Market basket analysis is a data mining technique used in retail and marketing to uncover associations between products customers frequently purchase. Various business applications can apply the insight gained from market basket analysis, including cross-selling, product placement optimization, recommendation systems, and targeted marketing campaigns. By leveraging market basket analysis, businesses can gain a competitive edge by understanding customer behavior, improving operational efficiency, and increasing customer satisfaction. It allows retailers and marketers to tailor their offerings to customer preferences, ultimately driving sales and fostering long-term customer loyalty.

To code this market basket analysis, we chose R as our software. The latter has implemented functions such as “apriori”, which are effective for mining the data set and discovering frequent item sets and association rules. It is also the software taught to us in several of our courses during our bachelor’s degree, so we already had the basics of this software.

Our code consists first of the cleaning of our datasets, of some graphs on the primary data, then of the market basket analysis. The results of the market basket analysis are displayed in the form of an interactive HTML widget. Finally, we create a function that takes a product as an input and returns a product that is frequently bought with the input product.

Load packages and import data set

# Load packages
library(arules)
library(arulesViz)
library(tidyverse)
library(readxl)
library(lubridate)
library(arulesViz)

# Import dataset
online_retail <- read_excel("~/Desktop/Online Retail.xlsx")

# View the dataset
View(online_retail)

Data description

We use the “online retail” data set provided by UIC Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/online+retail). This is a transnational dataset which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Variables description

Variable name Description
InvoiceNo Invoice number. It is a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘C’, it indicates a cancellation.
StockCode Product (item) code. It is a 5-digit integral number uniquely assigned to each distinct product.
Description Product (item) name.
Quantity The quantities of each product (item) per transaction.
InvoiceDate: Invoice date and time. The day and time when a transaction was generated.
UnitPrice Unit price. Product price per unit in sterling.
CustomerID Customer number. It is a 5-digit integral number uniquely assigned to each customer.
Country Country name. The name of the country where a customer resides.

Data cleaning and data preparation

####################
# NA values
####################
# Check if there are NA values
sum(is.na(online_retail))
## [1] 136534
# Check which columns have NA values
na_counts <- data.frame(column = names(online_retail), na_count = colSums(is.na(online_retail)))

print(na_counts)
##                  column na_count
## InvoiceNo     InvoiceNo        0
## StockCode     StockCode        0
## Description Description     1454
## Quantity       Quantity        0
## InvoiceDate InvoiceDate        0
## UnitPrice     UnitPrice        0
## CustomerID   CustomerID   135080
## Country         Country        0
# The NA values are in the columns "Description" and "CustomerID". "CustomerID" is not relevant for the market basket analysis -> The NA values in this column could be kept. However, most NA values in the "CustomerID" column are for special transactions that are irrelevant for our analysis (e.g., debt adjustment) so, we remove them. Moreover, we will use "Description" for the analysis -> We remove NA values that are in this column. So, in conclusion, we remove all NA values

# Remove NA values
online_retail <- na.omit(online_retail)

####################
# Negative quantities
####################

# Negative quantities seem to be used for e.g., errors or damaged products, they are irrelevant
# There are no null quantities, so we don't have to consider them
online_retail <- online_retail %>% filter(Quantity >= 1)

####################
# Duplicates
####################

# Check if there are duplicates
num_duplicates <- sum(duplicated(online_retail))
print(paste("Number of duplicate rows: ", num_duplicates))
## [1] "Number of duplicate rows:  5192"
# Remove duplicate rows
online_retail <- online_retail[!duplicated(online_retail), ]

####################
# Cancelled transactions
####################

# Cancelled transactions start with "C" in the invoice number
# Remove cancelled transactions
online_retail <- online_retail[!grepl("^C", online_retail$InvoiceNo), ]

####################
# Irrelevant values
####################

# There are irrelevant stockcodes (e.g., "Post")
online_retail <- online_retail %>% filter(!(StockCode %in% c("POST", "PADS", "M", "m", "S", "DOT", "C2", "BANK CHARGES")))

####################
# Convert data types
####################
online_retail$Country <- as.factor(online_retail$Country)
online_retail$Description <- as.factor(online_retail$Description)
online_retail$InvoiceDate <- ymd_hms(online_retail$InvoiceDate) # Using the lubridate package

# Overview of the cleaned dataset
head(online_retail)
## # A tibble: 6 × 8
##   InvoiceNo StockC…¹ Descr…² Quant…³ InvoiceDate         UnitP…⁴ Custo…⁵ Country
##   <chr>     <chr>    <fct>     <dbl> <dttm>                <dbl>   <dbl> <fct>  
## 1 536365    85123A   WHITE …       6 2010-12-01 08:26:00    2.55   17850 United…
## 2 536365    71053    WHITE …       6 2010-12-01 08:26:00    3.39   17850 United…
## 3 536365    84406B   CREAM …       8 2010-12-01 08:26:00    2.75   17850 United…
## 4 536365    84029G   KNITTE…       6 2010-12-01 08:26:00    3.39   17850 United…
## 5 536365    84029E   RED WO…       6 2010-12-01 08:26:00    3.39   17850 United…
## 6 536365    22752    SET 7 …       2 2010-12-01 08:26:00    7.65   17850 United…
## # … with abbreviated variable names ¹​StockCode, ²​Description, ³​Quantity,
## #   ⁴​UnitPrice, ⁵​CustomerID

Exploring the data

Creating a graph that shows the distribution of the number of items by basket:

####################
# Distribution of nb of items by basket
####################

online_retail %>% 
# Group the data frame by the invoice number to summarize the "Quantity for each unique invoice number
  group_by(InvoiceNo) %>% 
# Calculate the mean of the "Quantity" column for each invoice and create a new column called "n_items" with the mean values.
  summarize(n_items = mean(Quantity)) %>%
# Generate an histogram plot with the package ggplot2, "aes" argument "x=" specify that the x axis will be mapped to our new column   
  ggplot(aes(x = n_items)) +
#"fill =" define the colour of the bins and "bins =" the width of the bins
  geom_histogram(fill = "black", bins = 100000) + 
#select the limits of the x-axis  
  coord_cartesian(xlim = c(0, 100)) +
  labs(title = "Distribution of the number of items by basket")

Best sellers:

####################
#Best sellers
####################

online_retail %>%
  # Group data by StockCode and Description
  group_by(StockCode, Description) %>% 
  # Calculate the count of each group and drop the grouping
  summarize(count = n(), .groups = "drop") %>% 
  # Sort data by count in descending order
  arrange(desc(count)) %>% 
  # Select the top 10 items with the highest count
  slice_max(order_by = count, n = 10) %>% 
  # Create a bar plot
  ggplot(aes(
    # Reorder Description based on count for plotting
    x = reorder(Description, count),
    y = count
  )) +
  # Add bars with 'identity' stat and black fill color
  geom_bar(stat = "identity", fill = "black") +
  # Add title and remove x axis label
  labs(title = "Best sellers", x = "") +
  # Flip coordinates to get a horizontal bar plot
  coord_flip()

Market basket analysis

Transforming the data into transactions

####################
# Market basket analysis
####################

# Split the data
data_split <- split(online_retail$Description, online_retail$InvoiceNo)

# Convert the data into transactions using the arules package
trans <- as(data_split, "transactions")

# Inspect the transactions
inspect(head(trans, 2))
##     items                                  transactionID
## [1] {CREAM CUPID HEARTS COAT HANGER,                    
##      GLASS STAR FROSTED T-LIGHT HOLDER,                 
##      KNITTED UNION FLAG HOT WATER BOTTLE,               
##      RED WOOLLY HOTTIE WHITE HEART.,                    
##      SET 7 BABUSHKA NESTING BOXES,                      
##      WHITE HANGING HEART T-LIGHT HOLDER,                
##      WHITE METAL LANTERN}                         536365
## [2] {HAND WARMER RED POLKA DOT,                         
##      HAND WARMER UNION JACK}                      536366

Association rules

Support, confidence and lift are important concepts in market basket analysis and are defined as follow:

Support: % of the transactions that contain items 1 and 2

Confidence: Conditional probability that a customer that buy item1 will also by item2

Lift: % increase in probability that a customer will buy item2 if he or she buys item1 -100% (e.g., lift of 120% means an increase of 20% of probability)

In the following code snippet, we generate rules:

# Generate association rules
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.5, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 184 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3860 item(s), 18405 transaction(s)] done [0.14s].
## sorting and recoding items ... [621 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [269 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Sort the rules by confidence
rules <- sort(rules, by='confidence', decreasing = TRUE)

# Infer the top rules (top 20)
top_rules <- head(rules, 20)

# Convert rules into a data frame
rules_df <- as(rules, "data.frame")

# Get an overview of the data frame
head(rules_df)
##                                                                                                                              rules
## 141                                                  {REGENCY TEA PLATE PINK,REGENCY TEA PLATE ROSES} => {REGENCY TEA PLATE GREEN}
## 140                                                  {REGENCY TEA PLATE GREEN,REGENCY TEA PLATE PINK} => {REGENCY TEA PLATE ROSES}
## 143                                        {POPPY'S PLAYHOUSE BEDROOM,POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN}
## 264 {PINK REGENCY TEACUP AND SAUCER,REGENCY CAKESTAND 3 TIER,ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER}
## 1                                                                            {REGENCY TEA PLATE PINK} => {REGENCY TEA PLATE GREEN}
## 149                          {PINK REGENCY TEACUP AND SAUCER,ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER}
##        support confidence   coverage     lift count
## 141 0.01005162  0.9390863 0.01070361 64.01438   185
## 140 0.01005162  0.9158416 0.01097528 51.39044   185
## 143 0.01010595  0.9073171 0.01113828 48.26350   186
## 264 0.01298560  0.9018868 0.01439826 24.02204   239
## 1   0.01097528  0.9017857 0.01217061 61.47173   202
## 149 0.02118989  0.8944954 0.02368921 23.82516   390

Visualization

# Generate a scatter plot
plot(rules, measure=c("support", "lift"), shading="confidence", jitter = 0)

# Generate a interactive plot showing the 20 top rules
plot(top_rules,
     method = "graph",
     engine = "htmlwidget")

Recommendations

The market basket analysis can be useful to make recommendations based on which products are usually bought together.

####################
# Create two separate columns for rules for each product in the data frame
####################

# Convert rules to character format.
rules_df$rules <- as.character(rules_df$rules)

# Remove the braces {}.
rules_df$rules <- gsub("[{}]", "", rules_df$rules)

# Split the rules column.
split_rules <- strsplit(rules_df$rules, " => ")

# Create new columns for left and right
rules_df$rule_left <- sapply(split_rules, `[`, 1)
rules_df$rule_right <- sapply(split_rules, `[`, 2)

####################
# Recommendation function
####################

# Create a recommendation function
recommend_product <- function(product, rules_df, data){
  
  # Check if the product is in the dataframe
  if(!(product %in% data$Description)){
    return("The product is not in the data.")
  }
  
  # Look for rules where the product is on the left
  product_rules <- subset(rules_df, rule_left %in% product)

  # If no rules are found, return a message
  if(nrow(product_rules)==0){
    return("There are no products frequently bought with this product.")
  }
  
  # Sort by confidence and return the product(s) on the right of the top rule(s)
  product_rules <- product_rules[order(-product_rules$confidence), ]
  
  return(product_rules$rule_right[1])
}

# Test the function
recommend_product("REGENCY TEA PLATE PINK", rules_df, online_retail)
## [1] "REGENCY TEA PLATE GREEN"
recommend_product("KNITTED UNION FLAG HOT WATER BOTTLE", rules_df, online_retail)
## [1] "There are no products frequently bought with this product."
recommend_product("PEN", rules_df, online_retail)
## [1] "The product is not in the data."

Conclusion

The code presented above is an example of a simple market basket analysis. This analysis is then used to create a recommendation algorithm. A possible improvement would be to use advanced techniques involving Natural Language Processing (NLP) to create categories of items: Some rules presented above are not very useful since they involve items of the same category (e.g., same product in different colors).