Set working directory

Load essential libraries

library(readxl)
library(dplyr) 
library(tidyr)
library(ggplot2)  
library(arules)  
library(patchwork)

Introduction

The realm of consumer analytics is where the retail pulse is decoded via data. In this investigation, i looked into a strong dataset obtained from a UK-based online retail company, which includes over 525,461 observations across eight variables.

Each item in my dataset represents a distinct interaction between the retailer and its consumers, resulting in a complex tapestry of transactional information, product features, and customer demographics. My goal with advanced data analytics and visualization methodologies is to extract actionable insights that illuminate strategic paths for retail firms.

online_store <- read_excel("online_retail_II.xlsx")

View the structure of the dataset

str(online_store)
## tibble [525,461 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Invoice    : chr [1:525461] "489434" "489434" "489434" "489434" ...
##  $ StockCode  : chr [1:525461] "85048" "79323P" "79323W" "22041" ...
##  $ Description: chr [1:525461] "15CM CHRISTMAS GLASS BALL 20 LIGHTS" "PINK CHERRY LIGHTS" "WHITE CHERRY LIGHTS" "RECORD FRAME 7\" SINGLE SIZE" ...
##  $ Quantity   : num [1:525461] 12 12 12 48 24 24 24 10 12 12 ...
##  $ InvoiceDate: POSIXct[1:525461], format: "2009-12-01 07:45:00" "2009-12-01 07:45:00" ...
##  $ Price      : num [1:525461] 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
##  $ Customer ID: num [1:525461] 13085 13085 13085 13085 13085 ...
##  $ Country    : chr [1:525461] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

This is a summary of my dataset with 525,461 records and eight variables: Invoice: A character vector that represents the invoice number connected with each transaction. StockCode: A character vector containing the stock code of the item purchased. Description: A character vector that describes the item you purchased. Quantity: A numeric vector that represents the quantity of each item purchased in each transaction. InvoiceDate: A date-time vector that represents the date and time of each transaction. Price: A number vector indicating the price of each item. client ID is a number vector that identifies the client linked with each transaction. Country: A character vector representing the country in which the transaction took place.

Taking out the NA

online_store <- na.omit(online_store)

After omiiting the ‘NA’ the dataset remained 417534 observations of 8 variables.

Exploratory Data Analysis - Summuary Statistics

summary(online_store)
##    Invoice           StockCode         Description           Quantity       
##  Length:417534      Length:417534      Length:417534      Min.   :-9360.00  
##  Class :character   Class :character   Class :character   1st Qu.:    2.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    4.00  
##                                                           Mean   :   12.76  
##                                                           3rd Qu.:   12.00  
##                                                           Max.   :19152.00  
##   InvoiceDate                         Price            Customer ID   
##  Min.   :2009-12-01 07:45:00.00   Min.   :    0.000   Min.   :12346  
##  1st Qu.:2010-03-26 11:26:00.00   1st Qu.:    1.250   1st Qu.:13983  
##  Median :2010-07-08 19:12:00.00   Median :    1.950   Median :15311  
##  Mean   :2010-07-01 01:12:19.78   Mean   :    3.888   Mean   :15361  
##  3rd Qu.:2010-10-14 14:08:00.00   3rd Qu.:    3.750   3rd Qu.:16799  
##  Max.   :2010-12-09 20:01:00.00   Max.   :25111.090   Max.   :18287  
##    Country         
##  Length:417534     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Distribution of variables - Product Quantity

ggplot(online_store, aes(x = Quantity)) +
  geom_histogram() +
  labs(title = "Distribution of Quantity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Removing the non-finite values from the Quantity column

online_store_clean <- online_store[is.finite(online_store$Quantity), ]

ggplot(online_store_clean, aes(x = Quantity)) +
  geom_histogram(bins = 30) +
  xlim(0, 100) +
  ylim(0, 50000) +  # Adjust y-axis limits as needed
  labs(title = "Distribution of Quantity")

The histogram depicts the distribution of item amounts purchased per each transaction in the online store dataset. The y-axis indicates the number of transactions falling into each quantity bin, while the x-axis reflects the number of products purchased. Notably, there is a big rise in transaction counts at the quantity value of 25, with numerous bars, followed by a lesser peak around 50, but only one bar. This finding indicates that quantities around 25 are more common in transactions than other values.

Creating a new dataset for canceled orders and non-canceled orders

canceled_orders <- online_store %>%
  filter(substr(Invoice, 1, 1) == "C")

non_canceled_orders <- online_store %>%
  filter(substr(Invoice, 1, 1) != "C")

seperated cancelled order from non-cancelled orders.

Descriptive Analysis

summary(non_canceled_orders$Quantity)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     2.00     5.00    13.59    12.00 19152.00

The data shows the majority of orders involve the purchase of a modest number of things, with half of transactions involving the purchase of five or fewer items. However, some shipments are significantly larger, with one including an incredible 19,152 pieces. This implies that, while many buyers buy only a few items, others make significant purchases.

Time series Analysis

sales_time_series <- non_canceled_orders %>%
  mutate(Date = as.Date(InvoiceDate)) %>%
  group_by(Date) %>%
  summarise(TotalSales = sum(Quantity * Price))

Converting to time series object

sales_ts <- ts(sales_time_series$TotalSales, frequency = 7)

sales_decomp <- decompose(sales_ts)

plot(sales_decomp)

I extract the date and total sales from the dataset of non-canceled orders, then group the data by date to determine total sales for each day. It then turns the total sales into a time series object with a frequency of 7 (which represents daily data). This then decomposes the time series into seasonal, trend, and random components using a decomposition algorithm. Finally, it plots the decomposed time series to see the components.

Calculating RFM (Recency, Frequency, Monetary) metrics

  • Perform customer segmentation using clustering techniques (e.g., K-means clustering)

The analysis here focuses on understanding customer behaviour using RFM (Recency, Frequency, and Monetary) indicators. These analytics reveal how recently a customer made a purchase, how frequently they shop, and how much they spend.

Using this information, the study divides clients into groups using a method known as k-means clustering. This technique aids in the identification of discrete groups of clients with similar purchasing behaviours. For example, it might group together customers who make regular purchases and spend a lot, whereas another cluster might consist of customers who shop less frequently but spend more when they do.

The scatter plot depicts different customer groupings. Each point represents a customer, with Recency (the time since their last purchase) plotted against Frequency. The colour of each point shows the cluster to which the consumer belongs, providing a clear picture of how different customer groups act.

Associating Rule Mining

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 8153 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23621 item(s), 407695 transaction(s)] done [0.27s].
## sorting and recoding items ... [0 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

The Apriori technique is used in this research to uncover association rules inside a transaction dataset. These association rules highlight trends in client purchasing behavior, revealing which things are commonly purchased together.

In this case, the algorithm was set with a minimum support threshold of 0.02 and a minimum confidence level of 0.6. These thresholds specify the minimal degree of support (the proportion of transactions containing a specific set of products) and confidence (the likelihood that if item A is purchased, item B will also be purchased) needed for a rule to be considered significant.

After executing the algorithm, no association rules with the supplied criteria were found. This implies that either the thresholds were too strict or that the dataset lacks substantial correlations between items that fulfill the specified requirements.

Performing K-means clustering on RFM data

kmeans_model <- kmeans(rfm_data[, c("Recency", "Frequency", "Monetary")], centers = 4)

print(kmeans_model$centers)

print(kmeans_model$cluster)

K-means clustering technique was used on RFM (Recency, Frequency, Monetary) data to segment clients based on their purchase habits. The system discovered four separate groups based on various combinations of recency, frequency, and monetary worth. After evaluating the cluster centres, we notice that: - Customers in Cluster 1 have a low recency, frequency, and monetary value. - Cluster 2 customers have a high level of recency, low frequency, and monetary value. - Customers in Cluster 3 have modest recency, frequency, and monetary value. - Customers in Cluster 4 have a high level of recency, frequency, and monetary value. These clusters provide useful information on customer segments, allowing businesses to modify their marketing and retention tactics accordingly.

RFM dimensions with labeled clusters

The scatter figure depicts the K-means grouping of customers according to their RFM (Recency, Frequency, Monetary) characteristics. We can see that there are many inactive customers on the bottom end of the Recency axis (from 0 to 300), indicating that they have not made any recent transactions. On the other hand, there are fewer high-value and regular customers concentrated at the far end of the Recency axis, indicating that they have recently purchased. This distribution demonstrates the different levels of customer engagement and provides opportunities for focused marketing and retention efforts.

Appending cluster assignments to the original dataset

  • Analyzing characteristics of each cluster
rfm_data$Cluster <- kmeans_model$cluster


cluster_summary <- rfm_data %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Recency = mean(Recency),
    Avg_Frequency = mean(Frequency),
    Avg_Monetary = mean(Monetary)
  )

print(cluster_summary)
## # A tibble: 3 × 4
##   Cluster Avg_Recency Avg_Frequency Avg_Monetary
##     <int>       <dbl>         <dbl>        <dbl>
## 1       1        329.         41.2        32130.
## 2       2        368.        114.        215544.
## 3       3        132.          3.86        1415.

The cluster summary describes the characteristics of each cluster based on the K-means clustering algorithm.

  • Cluster 1 has an average recency of 129 days, a moderate frequency of 3.65 purchases, and a relatively low monetary value of $1239.09.
  • Cluster 2 has a substantially larger recency of 368 days, indicating less recent purchases, but with a very high frequency of 89.67 and a huge monetary value of 264,709.20.
  • Cluster 3 has a moderate recency of 301 days, a lower frequency of 23.21 purchases, and a moderate monetary value of 16,367.86.
  • Cluster 4 has a similar recency to Cluster 3 of 335 days, with a high frequency of 79.08 purchases and a significant monetary value of 74,805.56.

These findings can help adjust marketing tactics and customer interaction initiatives to each cluster’s unique needs and behaviors.

Cluster characteristics with different colors for each variable

The graphic depicts the features of several clusters according to average recency, frequency, and monetary value. Each cluster is represented by a bar chart, with different colors denoting each variable.

  • Cluster 1 has the highest average recency, along with a significant average frequency and monetary worth.
  • Cluster 2 has a lower average recency than Cluster 1, but maintains a comparably high frequency and monetary value.
  • Cluster 3 has a recency value of zero, indicating recent activity, but a negative average monetary worth, implying potential transaction difficulties.
  • Cluster 4 has the lowest average recency, frequency, and monetary value among the clusters, showing that its customers are less engaged and of lesser worth.