library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(arules)
library(patchwork)
The realm of consumer analytics is where the retail pulse is decoded via data. In this investigation, i looked into a strong dataset obtained from a UK-based online retail company, which includes over 525,461 observations across eight variables.
Each item in my dataset represents a distinct interaction between the retailer and its consumers, resulting in a complex tapestry of transactional information, product features, and customer demographics. My goal with advanced data analytics and visualization methodologies is to extract actionable insights that illuminate strategic paths for retail firms.
online_store <- read_excel("online_retail_II.xlsx")
str(online_store)
## tibble [525,461 × 8] (S3: tbl_df/tbl/data.frame)
## $ Invoice : chr [1:525461] "489434" "489434" "489434" "489434" ...
## $ StockCode : chr [1:525461] "85048" "79323P" "79323W" "22041" ...
## $ Description: chr [1:525461] "15CM CHRISTMAS GLASS BALL 20 LIGHTS" "PINK CHERRY LIGHTS" "WHITE CHERRY LIGHTS" "RECORD FRAME 7\" SINGLE SIZE" ...
## $ Quantity : num [1:525461] 12 12 12 48 24 24 24 10 12 12 ...
## $ InvoiceDate: POSIXct[1:525461], format: "2009-12-01 07:45:00" "2009-12-01 07:45:00" ...
## $ Price : num [1:525461] 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
## $ Customer ID: num [1:525461] 13085 13085 13085 13085 13085 ...
## $ Country : chr [1:525461] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
This is a summary of my dataset with 525,461 records and eight variables: Invoice: A character vector that represents the invoice number connected with each transaction. StockCode: A character vector containing the stock code of the item purchased. Description: A character vector that describes the item you purchased. Quantity: A numeric vector that represents the quantity of each item purchased in each transaction. InvoiceDate: A date-time vector that represents the date and time of each transaction. Price: A number vector indicating the price of each item. client ID is a number vector that identifies the client linked with each transaction. Country: A character vector representing the country in which the transaction took place.
online_store <- na.omit(online_store)
After omiiting the ‘NA’ the dataset remained 417534 observations of 8 variables.
summary(online_store)
## Invoice StockCode Description Quantity
## Length:417534 Length:417534 Length:417534 Min. :-9360.00
## Class :character Class :character Class :character 1st Qu.: 2.00
## Mode :character Mode :character Mode :character Median : 4.00
## Mean : 12.76
## 3rd Qu.: 12.00
## Max. :19152.00
## InvoiceDate Price Customer ID
## Min. :2009-12-01 07:45:00.00 Min. : 0.000 Min. :12346
## 1st Qu.:2010-03-26 11:26:00.00 1st Qu.: 1.250 1st Qu.:13983
## Median :2010-07-08 19:12:00.00 Median : 1.950 Median :15311
## Mean :2010-07-01 01:12:19.78 Mean : 3.888 Mean :15361
## 3rd Qu.:2010-10-14 14:08:00.00 3rd Qu.: 3.750 3rd Qu.:16799
## Max. :2010-12-09 20:01:00.00 Max. :25111.090 Max. :18287
## Country
## Length:417534
## Class :character
## Mode :character
##
##
##
ggplot(online_store, aes(x = Quantity)) +
geom_histogram() +
labs(title = "Distribution of Quantity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
online_store_clean <- online_store[is.finite(online_store$Quantity), ]
ggplot(online_store_clean, aes(x = Quantity)) +
geom_histogram(bins = 30) +
xlim(0, 100) +
ylim(0, 50000) + # Adjust y-axis limits as needed
labs(title = "Distribution of Quantity")
The histogram depicts the distribution of item amounts purchased per each transaction in the online store dataset. The y-axis indicates the number of transactions falling into each quantity bin, while the x-axis reflects the number of products purchased. Notably, there is a big rise in transaction counts at the quantity value of 25, with numerous bars, followed by a lesser peak around 50, but only one bar. This finding indicates that quantities around 25 are more common in transactions than other values.
canceled_orders <- online_store %>%
filter(substr(Invoice, 1, 1) == "C")
non_canceled_orders <- online_store %>%
filter(substr(Invoice, 1, 1) != "C")
seperated cancelled order from non-cancelled orders.
summary(non_canceled_orders$Quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 5.00 13.59 12.00 19152.00
The data shows the majority of orders involve the purchase of a modest number of things, with half of transactions involving the purchase of five or fewer items. However, some shipments are significantly larger, with one including an incredible 19,152 pieces. This implies that, while many buyers buy only a few items, others make significant purchases.
sales_time_series <- non_canceled_orders %>%
mutate(Date = as.Date(InvoiceDate)) %>%
group_by(Date) %>%
summarise(TotalSales = sum(Quantity * Price))
sales_ts <- ts(sales_time_series$TotalSales, frequency = 7)
sales_decomp <- decompose(sales_ts)
plot(sales_decomp)
I extract the date and total sales from the dataset of non-canceled orders, then group the data by date to determine total sales for each day. It then turns the total sales into a time series object with a frequency of 7 (which represents daily data). This then decomposes the time series into seasonal, trend, and random components using a decomposition algorithm. Finally, it plots the decomposed time series to see the components.
The analysis here focuses on understanding customer behaviour using RFM (Recency, Frequency, and Monetary) indicators. These analytics reveal how recently a customer made a purchase, how frequently they shop, and how much they spend.
Using this information, the study divides clients into groups using a method known as k-means clustering. This technique aids in the identification of discrete groups of clients with similar purchasing behaviours. For example, it might group together customers who make regular purchases and spend a lot, whereas another cluster might consist of customers who shop less frequently but spend more when they do.
The scatter plot depicts different customer groupings. Each point represents a customer, with Recency (the time since their last purchase) plotted against Frequency. The colour of each point shows the cluster to which the consumer belongs, providing a clear picture of how different customer groups act.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 8153
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23621 item(s), 407695 transaction(s)] done [0.27s].
## sorting and recoding items ... [0 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
The Apriori technique is used in this research to uncover association rules inside a transaction dataset. These association rules highlight trends in client purchasing behavior, revealing which things are commonly purchased together.
In this case, the algorithm was set with a minimum support threshold of 0.02 and a minimum confidence level of 0.6. These thresholds specify the minimal degree of support (the proportion of transactions containing a specific set of products) and confidence (the likelihood that if item A is purchased, item B will also be purchased) needed for a rule to be considered significant.
After executing the algorithm, no association rules with the supplied criteria were found. This implies that either the thresholds were too strict or that the dataset lacks substantial correlations between items that fulfill the specified requirements.
kmeans_model <- kmeans(rfm_data[, c("Recency", "Frequency", "Monetary")], centers = 4)
print(kmeans_model$centers)
print(kmeans_model$cluster)
K-means clustering technique was used on RFM (Recency, Frequency, Monetary) data to segment clients based on their purchase habits. The system discovered four separate groups based on various combinations of recency, frequency, and monetary worth. After evaluating the cluster centres, we notice that: - Customers in Cluster 1 have a low recency, frequency, and monetary value. - Cluster 2 customers have a high level of recency, low frequency, and monetary value. - Customers in Cluster 3 have modest recency, frequency, and monetary value. - Customers in Cluster 4 have a high level of recency, frequency, and monetary value. These clusters provide useful information on customer segments, allowing businesses to modify their marketing and retention tactics accordingly.
The scatter figure depicts the K-means grouping of customers according to their RFM (Recency, Frequency, Monetary) characteristics. We can see that there are many inactive customers on the bottom end of the Recency axis (from 0 to 300), indicating that they have not made any recent transactions. On the other hand, there are fewer high-value and regular customers concentrated at the far end of the Recency axis, indicating that they have recently purchased. This distribution demonstrates the different levels of customer engagement and provides opportunities for focused marketing and retention efforts.
rfm_data$Cluster <- kmeans_model$cluster
cluster_summary <- rfm_data %>%
group_by(Cluster) %>%
summarise(
Avg_Recency = mean(Recency),
Avg_Frequency = mean(Frequency),
Avg_Monetary = mean(Monetary)
)
print(cluster_summary)
## # A tibble: 3 × 4
## Cluster Avg_Recency Avg_Frequency Avg_Monetary
## <int> <dbl> <dbl> <dbl>
## 1 1 329. 41.2 32130.
## 2 2 368. 114. 215544.
## 3 3 132. 3.86 1415.
The cluster summary describes the characteristics of each cluster based on the K-means clustering algorithm.
These findings can help adjust marketing tactics and customer interaction initiatives to each cluster’s unique needs and behaviors.
The graphic depicts the features of several clusters according to average recency, frequency, and monetary value. Each cluster is represented by a bar chart, with different colors denoting each variable.