Association Rules Analysis of Online Retail Dataset

Introduction

This project aims to uncover hidden patterns in an online retail dataset using association rule mining. The goal is to identify relationships between products that are frequently purchased together and generate actionable insights for business decisions, such as product bundling or targeted marketing strategies.

The dataset consists of transaction data from an online retailer and includes columns such as InvoiceNo, StockCode, Description, Quantity, UnitPrice, CustomerID, and Country. This analysis focuses on extracting association rules and insights from the purchase data.

Theoretical Background

Association Rule Mining is a fundamental technique in data mining used to discover interesting relationships (associations) between variables in large datasets. The most commonly used algorithm for mining association rules is the Apriori algorithm, which finds frequent itemsets and generates association rules based on the frequency of these itemsets. Each rule consists of two parts: - Antecedent (LHS): The condition or the item that is present. - Consequent (RHS): The item that is predicted to be purchased when the antecedent is purchased.

The strength of these rules is determined by: - Support: The proportion of transactions that contain both the antecedent and consequent. - Confidence: The likelihood that the consequent is bought when the antecedent is bought. - Lift: The ratio of the observed support to the expected support if the antecedent and consequent were independent.

Loading and Preprocessing the Data

First, we load the data and perform necessary preprocessing to clean it.

# Load required libraries
library(dplyr)
library(ggplot2)
library(tidyr)

# Load the dataset
data <- read.csv("~/Desktop/Data science and Business Analytics/Unsupervised Learning/Online_Retail.csv", sep=',', nrows = 1000)

# View the first few rows of the dataset
head(data)

##   InvoiceNo StockCode                         Description Quantity  InvoiceDate
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26
## 2    536365     71053                 WHITE METAL LANTERN        6 12/1/10 8:26
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26
##   UnitPrice CustomerID        Country
## 1      2.55      17850 United Kingdom
## 2      3.39      17850 United Kingdom
## 3      2.75      17850 United Kingdom
## 4      3.39      17850 United Kingdom
## 5      3.39      17850 United Kingdom
## 6      7.65      17850 United Kingdom

Data Preprocessing Explanation

Before performing association rule mining, it is crucial to clean the dataset. This includes removing transactions with negative quantities (i.e., returns) and eliminating duplicate invoices. These steps ensure that only valid transactions are considered in the analysis.

Data Cleaning

Next, we filter the dataset to include only transactions with positive quantities (i.e., no returns).

# Filter out rows with non-positive quantities
data_clean <- data %>% filter(Quantity > 0)

# Check for missing values and handle them
sum(is.na(data_clean))

## [1] 1

# Check for duplicated invoices
data_clean <- data_clean %>% distinct()

# View cleaned data
head(data_clean)

##   InvoiceNo StockCode                         Description Quantity  InvoiceDate
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26
## 2    536365     71053                 WHITE METAL LANTERN        6 12/1/10 8:26
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26
##   UnitPrice CustomerID        Country
## 1      2.55      17850 United Kingdom
## 2      3.39      17850 United Kingdom
## 3      2.75      17850 United Kingdom
## 4      3.39      17850 United Kingdom
## 5      3.39      17850 United Kingdom
## 6      7.65      17850 United Kingdom

Creating Itemsets

We will create itemsets where each row represents an invoice, and each column represents a product purchased in that invoice. We will represent products with a binary value (1 if purchased, 0 if not).

# Create a basket of items purchased per invoice
basket <- data_clean %>%
  group_by(InvoiceNo, Description) %>%
  summarise(Quantity = sum(Quantity)) %>%
  spread(key = Description, value = Quantity, fill = 0)

## `summarise()` has grouped output by 'InvoiceNo'. You can override using the
## `.groups` argument.

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## ℹ The deprecated feature was likely used in the tidyr package.
##   Please report the issue at <https://github.com/tidyverse/tidyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Convert to binary (1 if product is bought, 0 if not)
basket_binary <- basket %>%
  mutate_all(~ifelse(. > 0, 1, 0)) %>%
  select(-InvoiceNo)

## `mutate_all()` ignored the following grouping variables:
## Adding missing grouping variables: `InvoiceNo`
## • Column `InvoiceNo`
## ℹ Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.

# View the first few rows of the binary basket
head(basket_binary)

## # A tibble: 6 × 589
## # Groups:   InvoiceNo [6]
##   InvoiceNo    V1 ` SET 2 TEA TOWELS I LOVE LONDON ` `10 COLOUR SPACEBOY PEN`
##   <chr>     <dbl>                              <dbl>                    <dbl>
## 1 536365        0                                  0                        0
## 2 536366        0                                  0                        0
## 3 536367        0                                  0                        0
## 4 536368        0                                  0                        0
## 5 536369        0                                  0                        0
## 6 536370        0                                  1                        0
## # ℹ 585 more variables: `12 DAISY PEGS IN WOOD BOX` <dbl>,
## #   `12 MESSAGE CARDS WITH ENVELOPES` <dbl>,
## #   `12 PENCILS TALL TUBE SKULLS` <dbl>,
## #   `3 PIECE SPACEBOY COOKIE CUTTER SET` <dbl>,
## #   `3 STRIPEY MICE FELTCRAFT` <dbl>, `3 TIER CAKE TIN GREEN AND CREAM` <dbl>,
## #   `3 TIER CAKE TIN RED AND CREAM` <dbl>, `36 FOIL HEART CAKE CASES` <dbl>,
## #   `36 FOIL STAR CAKE CASES ` <dbl>, `36 PENCILS TUBE RED RETROSPOT` <dbl>, …

Explanation of Itemsets

In this step, we create binary matrices to represent which products were bought in each transaction. A 1 indicates that the product was purchased, while a 0 indicates that it was not. This is crucial for running association rule mining algorithms, as it allows the identification of frequent itemsets (combinations of products) across all transactions.

Visualizing the Data

Distribution of Product Quantities

Let’s first visualize the distribution of product quantities to understand which products are bought in bulk.

# Plotting the distribution of product quantities
ggplot(data_clean, aes(x=Quantity)) +
  geom_histogram(bins=30, fill='skyblue', color='black') +
  theme_minimal() +
  labs(title="Distribution of Product Quantities",
       x="Quantity",
       y="Frequency")

Interpretation of the Histogram

The histogram shows the frequency distribution of product quantities purchased in the dataset. Most of the purchases are for low quantities, and there are a few high-quantity purchases scattered across the data.

The bulk of purchases (over 600 occurrences) are for small quantities, likely indicating that customers are purchasing products individually or in small amounts. There are outliers where certain products are bought in higher quantities, likely due to bulk buying or promotional sales, where customers might have purchased large quantities at once. Business Insight: This distribution helps businesses understand consumer purchasing behavior. Retailers can use this information to manage stock levels effectively, ensuring that products with high purchase volumes are adequately stocked while minimizing overstocking of less popular items.

Top 10 Most Frequently Purchased Products

Now, let’s visualize the top 10 most frequently purchased products.

# Top 10 most frequently purchased products
top_products <- data_clean %>%
  group_by(Description) %>%
  summarise(Total_Sales = sum(Quantity)) %>%
  arrange(desc(Total_Sales)) %>%
  top_n(10)

## Selecting by Total_Sales

# Plotting the top 10 products
ggplot(top_products, aes(x=reorder(Description, Total_Sales), y=Total_Sales)) +
  geom_bar(stat="identity", fill="lightcoral") +
  theme_minimal() +
  coord_flip() +
  labs(title="Top 10 Most Frequently Purchased Products",
       x="Product",
       y="Total Sales")

Interpretation of Bar Chart

This bar chart ranks the top 10 most frequently purchased products by their total sales volume (quantity).

The products like NAMASTE SWAGAT INCENSE and BLACK RECORD COVER FRAME appear to dominate the sales, with NAMASTE SWAGAT INCENSE having the highest total sales, significantly higher than the other products. Products like DISCO BALL CHRISTMAS DECORATION and JUMBO BAG RED RETROSPOT also make it into the top 10, showcasing a variety of home decor and seasonal items.

These are the most in-demand products. For a business, this is vital information for inventory management, as these products should be stocked more frequently to meet customer demand. Additionally, these products could be included in promotional bundles or targeted marketing campaigns to increase sales further.

Generating Association Rules

We will now generate association rules based on the frequency of product pairs using ggplot2 for visualizing the pairs.

# Generate item pairs
item_pairs <- data_clean %>%
  select(InvoiceNo, Description) %>%
  distinct() %>%
  unite("pair", Description, InvoiceNo, sep = "-") %>%
  count(pair) %>%
  arrange(desc(n))

# Top 10 most frequent item pairs
top_pairs <- item_pairs %>%
  head(10)

# Plotting the top 10 pairs
ggplot(top_pairs, aes(x=reorder(pair, n), y=n)) +
  geom_bar(stat="identity", fill="lightblue") +
  theme_minimal() +
  coord_flip() +
  labs(title="Top 10 Most Frequent Item Pairs",
       x="Item Pair",
       y="Frequency")

Interpretation of Pair Chart

This chart shows the most common item pairs purchased together. For example, a product like 12 PENCILS TALL TUBE SKULLS is often bought together with other products (indicated by the product code).

The top pairs are composed mostly of items like 12 PENCILS TALL TUBE SKULLS and 12 MESSAGE CARDS WITH ENVELOPES, both of which could be related to a particular customer segment (such as arts and crafts enthusiasts or gifts). Multiple entries for products like SET 2 TEA TOWELS I LOVE LONDON suggest that these items are commonly bundled together for customers, possibly as part of a gift set or theme. Business Insight: These frequent item pairs can be crucial for product bundling and cross-selling strategies. For example, promoting the sale of TEA TOWELS alongside other related items like mugs or plates can increase overall sales. Additionally, businesses can create product recommendations based on these common pairings.

Insights and Conclusion

Insights

Insights High-Correlation Products: Certain products, such as 12 PENCILS TALL TUBE SKULLS and 12 MESSAGE CARDS WITH ENVELOPES, are frequently purchased together, indicating a strong correlation between them. This suggests that these products belong to the same category or share a similar customer base.

Business Recommendation: These products are prime candidates for promotional bundling. By packaging these items together, you can create discounted bundles that increase the average transaction value. Additionally, bundling related products (e.g., arts and crafts or gift-related items) offers convenience for customers, which may increase the likelihood of purchase. Offering a “Buy 2, Save More” type of promotion can help capitalize on these strong associations.

Product Preferences: Items with high total sales quantities, such as NAMASTE SWAGAT INCENSE and DISCO BALL CHRISTMAS DECORATION, reflect significant customer demand and are central to the product mix. These products not only contribute to high sales volume but also help in identifying broader customer preferences based on seasonal trends, lifestyle, and hobbies.

Business Recommendation: Identifying popular products can guide inventory management. Ensuring a consistent stock of high-demand items prevents stockouts during peak purchasing periods (such as holidays or specific seasons). These products should also be highlighted in marketing campaigns to maintain their visibility, such as including them in email newsletters or targeted advertisements.

Targeted Marketing: The insights from association analysis suggest that certain products are frequently purchased by the same customers, pointing to an opportunity for personalized marketing. By leveraging this data, businesses can create recommendation engines that suggest products based on a customer’s past purchases, enhancing the likelihood of repeat business.

Business Recommendation: Use this association data to develop personalized email campaigns, where products frequently bought together are suggested to customers who have already purchased one item. Cross-selling or up-selling these recommended products can lead to a higher conversion rate. For example, if a customer buys a home decor item, they can be automatically recommended complementary products like decorative lights or wall art that have been frequently bought together by other customers.

Further Actionable Insights:

Product Availability & Visibility: Ensure that the most popular and frequently purchased items are given prominence on your website or in your physical store. Position them in strategic locations (e.g., homepage, email banners, or front-store shelves) for easy access, leading to more visibility and, consequently, higher sales.

Inventory Optimization: High-frequency products should be forecasted for restocking well in advance. It may be beneficial to increase the stock levels of items that are seasonally popular or in high demand, as this will help avoid lost sales due to stockouts. Furthermore, items with frequent low sales might be candidates for discounts or clearance sales to optimize inventory turnover.

Leveraging Data for Dynamic Pricing: Based on frequent co-purchases, introduce dynamic pricing models. For instance, products that are often bought together could have a discounted combined price. Using the data, the business can identify bundles that make the most financial sense, especially during peak seasons.

Seasonality & Trend Analysis: Some products, such as Christmas-themed items, may show clear seasonality. By analyzing the sales patterns over time, you can ensure better planning for high-demand seasons, such as Christmas or the back-to-school period, by adjusting marketing and inventory strategies well ahead of time.

Detailed Conclusion

The analysis of the online retail dataset using association rule mining and product pairing analysis has provided valuable insights. The most frequently purchased products can be identified, and the relationships between products that are often bought together have been uncovered. These findings can assist businesses in making informed decisions regarding inventory management, product bundling, and targeted marketing campaigns. By focusing on high-correlation products and popular items, businesses can enhance customer experience and optimize sales strategies.

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (pp. 487-499).
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques (3rd ed.). Elsevier.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/