Enhancing Customer Insights and Business Growth in Restaurant
In today’s competitive landscape, restaurants are increasingly seeking data-driven strategies to stay ahead of the curve. Three key methodologies that can substantially impact their bottom line are Sales Forecasting, Market Basket Analysis (MBA) and Customer Segmentation Analysis. Sales Forecasting employs predictive models like SVM, Decision Trees, Random Forest, Linear Regression to estimate upcoming daily or weekly sales, enabling the restaurants to optimize inventory, staffing, and promotional activities. Customer Segmentation Analysis groups customers based on purchasing behavior, spending patterns, or demographic features, helping restaurants tailor marketing campaigns and enhance the overall customer experience. Meanwhile, Market Basket Analysis leverages transactional data to reveal which items are frequently purchased together. This enables more effective promotional bundles, targeted recommendations, and inventory optimization.
Despite these advantages, restaurants face unique challenges compared to traditional retail:
By applying techniques such as machine learning models for sales forecasting, clustering algorithms for customer segmentation and association rule mining for MBA, restaurants can gain invaluable insights into purchasing behaviors. These insights not only optimize menu engineering and bundle offerings but also inform loyalty programs, promotional campaigns, and targeted advertisements.
In this project, we consolidate transactional data from a local restaurant across multiple Excel and CSV files. We then demonstrate how to clean, preprocess, and merge these data sources to form a robust dataset. From there, use several machine learning models to forecast next day sales as part of the Sales Forecasting. We perform Market Basket Analysis to uncover item affinities and propose potential bundle deals. In parallel, we conduct Customer Segmentation Analysis to cluster patrons based on spending and purchase patterns. These approaches underscore how data science techniques can significantly elevate a restaurant’s strategic decision-making, from operational logistics like inventory planning to high-level marketing initiatives, such as personalized promos and meal bundles.
The primary objectives of this analysis are:
Through these objectives, we hope to demonstrate how advanced analytics can empower a restaurant to make more evidence-based decisions, from creating enticing meal sets to crafting personalized promotions for different customer segments.
Our dataset is collected from a local restaurant’s point-of-sale (POS) system and other digital records, covering the period from 1 November 2023 to 31 October 2024. These data have been provided in multiple Excel and CSV files, each capturing a distinct facet of the restaurant’s daily operations. The dataset is anonymized, focusing purely on transactional and operational details with no personal information about customers.
Title:
H Restaurant Puchong – Transactions, Orders, Discounts, and Item Sales.
Year:
1 Year Data of 2023/2024 (data spanning 1 November 2023 to 31 October 2024).
Purpose:
To analyze the restaurant’s daily operations, including sales transactions, discount usage, item-level purchases, and payment methods. The ultimate goal is to perform Market Basket Analysis and Customer Segmentation Analysis for business insights.
Dimension:
Content:
Structure:
Summary:
Read and Inspect the Transactions data for the month of December 2023:
## # A tibble: 6 x 15
## Date `Receipt No` Type Register Method Amount `Additional Inputs` Reason
## <chr> <chr> <chr> <chr> <chr> <dbl> <lgl> <lgl>
## 1 31/12/20~ 231231PNFQ Sale H Resta~ TnG E~ 104. NA NA
## 2 31/12/20~ 231231USVH Sale H Resta~ Cash 51.8 NA NA
## 3 31/12/20~ 2312315OPS Sale H Resta~ Card 89.8 NA NA
## 4 31/12/20~ 23123133IL Sale H Resta~ Cash 56.2 NA NA
## 5 31/12/20~ 231231PFUH Sale H Resta~ TnG E~ 34.7 NA NA
## 6 31/12/20~ 2312312ZWG Sale H Resta~ TnG E~ 58.6 NA NA
## # i 7 more variables: `Time to Pay` <chr>, Device <chr>, Location <chr>,
## # `Order No` <chr>, `Order Status` <chr>, `Order Total` <dbl>,
## # `Order Payment Status` <chr>
Read and Inspect the Discount data for the month of September 2024:
## # A tibble: 6 x 13
## Date `Applied To` Item Type Name Discount Device Location
## <dttm> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2024-09-30 19:54:00 Order <NA> Discount Open~ 5 H Res~ H Resta~
## 2 2024-09-30 17:17:00 Order <NA> Discount Open~ 0.32 H Res~ H Resta~
## 3 2024-09-30 15:04:00 Order <NA> Discount Open~ 5 H Res~ H Resta~
## 4 2024-09-30 14:36:00 Order <NA> Discount Open~ 0.49 H Res~ H Resta~
## 5 2024-09-30 13:21:00 Order <NA> Loyalty~ Birt~ 15 H Res~ H Resta~
## 6 2024-09-30 12:21:00 Order <NA> Discount Open~ 5 H Res~ H Resta~
## # i 5 more variables: `Order No` <chr>, `Order Status` <chr>,
## # `Order Total` <dbl>, `Order Payment Status` <chr>, `Order Notes` <chr>
Read and Inspect the Orders data for the month of September 2024:
## # A tibble: 6 x 41
## Date `Dining Option` Table `Ticket No` `Order No`
## <dttm> <chr> <chr> <lgl> <chr>
## 1 2024-09-30 20:26:00 Dine In T8 NA 240930NZVTBR0
## 2 2024-09-30 20:14:00 Dine In T6 NA 240930NZVTZFY
## 3 2024-09-30 19:59:00 Takeaway A16 NA 240930NZVJFA6
## 4 2024-09-30 19:57:00 Dine In T7 NA 240930NZVZ87U
## 5 2024-09-30 19:54:00 Dine In T9 NA 240930NZVUMED
## 6 2024-09-30 19:48:00 Foodpanda FP 5 NA 240930NZVVZAT
## # i 36 more variables: `Order Status` <chr>, Total <dbl>,
## # `Payment Status` <chr>, Items <chr>, `Items (Amount)` <chr>,
## # Discounts <chr>, `Discounts (Amount)` <dbl>,
## # `Manufacturer Discounts` <chr>, `Manufacturer Discounts (Amount)` <dbl>,
## # Payments <chr>, `Payments (Amount)` <chr>, Refunds <chr>,
## # `Refunds (Amount)` <dbl>, Device <chr>, Location <chr>, Channel <chr>,
## # Paid <dbl>, Remaining <dbl>, Rounding <dbl>, Gross <dbl>, ...
Read and Inspect the Item Sales data for the month of September and October 2024:
## # A tibble: 6 x 10
## Date Name Category Price Qty Total Net Device Location `Order No`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 31/10/2024 ~ "LD0~ "茶点~ 11.9 1 11.9 11.9 H Res~ Bandar ~ 241031NZV~
## 2 31/10/2024 ~ "A03~ "Add-On" 3.5 3 10.5 10.5 H Res~ Bandar ~ 241031NZV~
## 3 31/10/2024 ~ "D33~ "çƒé¥®~ 1 1 1 1 H Res~ Bandar ~ 241031NZV~
## 4 31/10/2024 ~ "D33~ "çƒé¥®~ 1 2 2 2 H Res~ Bandar ~ 241031NZV~
## 5 31/10/2024 ~ "D30~ "冷饮~ 6.9 2 13.8 13.8 H Res~ Bandar ~ 241031NZV~
## 6 31/10/2024 ~ "MS1~ "æ¤\u00~ 29.9 1 29.9 29.9 H Res~ Bandar ~ 241031NZV~
All four data sources—Transactions, Item Transaction, Discount Transactions, and Orders—rely on the Order No (or a close variation like Receipt No) as the primary key for joining. The steps to integrate them and consolidate multiple months into one master dataset are summarized below:
transactions_202409_df, transactions_202410_df, etc.).Once we have monthly data frames for each data type, we append them (row‐bind) to create unified data frames that hold all months. For example:
# Pseudocode for transactions
transactions_df <- bind_rows(
transactions_202311_df,
transactions_202312_df,
transactions_202401_df,
transactions_202402_df
# etc., as it goes on until October 2024 Dataset
)This results in four consolidated data frames:
combined_transactions_df (combined transactions for all months)combined_item_sales_df (combined item sales for all months)combined_discounts_df (combined discounts for all months)combined_orders_df (combined orders for all months)Order No (or Receipt No if necessary) to link these four consolidated data frames together. For example:
Order No and date to confirm total amounts, payment status, and item lists.Order No to attach each line‐item detail to the corresponding order.Order No (and possibly item names if the discount is item‐specific) to quantify the impact of discounts or promotions on each order.By following these steps, we ensure the data is properly combined across multiple time periods, accurately merged on the correct keys, and ready for the next data cleaning process.
Read and inspect the item sales data
## # A tibble: 6 x 4
## Date Name Category Total
## <chr> <chr> <chr> <dbl>
## 1 31/12/2023 20:17 "R08 å’–å–±æ\u009d‚è\u008fœé¥ Curry Mixed Ve~ "é¥ç±»~ 7.9
## 2 31/12/2023 20:17 "Take Away / Container Charge" "Add-On" 1
## 3 31/12/2023 20:17 "D01C 冰海å\u008d—å’–å•¡ Ice Hainanese Coff~ "冷饮~ 11.8
## 4 31/12/2023 20:17 "D08C 冰美禄 Ice Milo" "冷饮~ 5.9
## 5 31/12/2023 20:17 "N05 沙煲è€\u0081é¼ ç²‰ Claypot Lou Shu Fun" "é\u009~ 12.9
## 6 31/12/2023 20:17 "N01 邦咯å\u008f»æ²™ Pangkor Laksa" "é\u009~ 15.9
Read and inspect the orders data
## # A tibble: 6 x 4
## Date Items Total `Order No`
## <dttm> <chr> <dbl> <chr>
## 1 2024-09-30 20:26:00 "BS03 <U+7518><U+699C><U+751F><U+719F><U+86CB> (2<U+7C92>) Half-boiled Kampu~ 31.9 240930NZV~
## 2 2024-09-30 20:14:00 "N09 <U+767D><U+841D><U+535C><U+732A><U+809A><U+80E1><U+6912><U+7CBF><U+6761><U+7C73><U+7C89><U+6C64> White Radi~ 17.8 240930NZV~
## 3 2024-09-30 19:59:00 "Kaya<U+9171> (<U+4E00><U+74F6><U+88C5>) Kaya (1 Bottle)" 9.43 240930NZV~
## 4 2024-09-30 19:57:00 "N06 <U+9E21><U+4E1D><U+6CB3><U+7C89> Shredded VChicken Hor Fun\~ 27.8 240930NZV~
## 5 2024-09-30 19:54:00 "R08 <U+5496><U+55B1><U+6742><U+83DC><U+996D> Curry Mixed Vege Rice\nR~ 29.9 240930NZV~
## 6 2024-09-30 19:48:00 "FP A07 <U+714E><U+86CB> Fried Egg\nFP D04C <U+51B0><U+9E33><U+9E2F> I~ 50.5 240930NZV~
This is the function to remove the non-ASCII character, by filtering out that are greater than 0x7F, since ASCII values are between 0x00 to 0x7F
remove_non_ascii <- function(text) {
rawBytes <- charToRaw(text)
cleanedBytes <- rawBytes[rawBytes <= 0x7F]
return(trimws(rawToChar(cleanedBytes)))
}
Now we perform sapply the those mentioned columns.
combined_item_sales_df <- combined_item_sales_df |>
mutate(
Name = sapply(Name, remove_non_ascii),
Category = sapply(Category, remove_non_ascii)
)
head(combined_item_sales_df[, c("Date", "Name", "Category", "Total")])
## # A tibble: 6 x 4
## Date Name Category Total
## <chr> <chr> <chr> <dbl>
## 1 31/12/2023 20:17 R08 Curry Mixed Vege Rice Rice Set 7.9
## 2 31/12/2023 20:17 Take Away / Container Charge Add-On 1
## 3 31/12/2023 20:17 D01C Ice Hainanese Coffee Cold Drinks 11.8
## 4 31/12/2023 20:17 D08C Ice Milo Cold Drinks 5.9
## 5 31/12/2023 20:17 N05 Claypot Lou Shu Fun Noodles 12.9
## 6 31/12/2023 20:17 N01 Pangkor Laksa Noodles 15.9
combined_orders_df <- combined_orders_df %>%
mutate(
Items = sapply(Items, remove_non_ascii)
)
head(combined_orders_df[, c("Date", "Items", "Total", "Order No")])
## # A tibble: 6 x 4
## Date Items Total `Order No`
## <dttm> <chr> <dbl> <chr>
## 1 2024-09-30 20:26:00 "BS03 (2) Half-boiled Kampung Egg (2pcs~ 31.9 240930NZV~
## 2 2024-09-30 20:14:00 "N09 White Radish Pepper Soup Kuey Teow~ 17.8 240930NZV~
## 3 2024-09-30 19:59:00 "Kaya () Kaya (1 Bottle)" 9.43 240930NZV~
## 4 2024-09-30 19:57:00 "N06 Shredded VChicken Hor Fun\nN03 Ca~ 27.8 240930NZV~
## 5 2024-09-30 19:54:00 "R08 Curry Mixed Vege Rice\nR02 () Sig~ 29.9 240930NZV~
## 6 2024-09-30 19:48:00 "FP A07 Fried Egg\nFP D04C Ice Hainane~ 50.5 240930NZV~
For instance, some discount fields might be empty if no discount was applied.
combined_item_sales_df <- combined_item_sales_df %>%
mutate(across(where(is.numeric), ~if_else(is.na(.), 0, .))) %>%
mutate(across(where(is.character), ~if_else(is.na(.), "", .)))
combined_orders_df <- combined_orders_df %>%
mutate(across(where(is.numeric), ~if_else(is.na(.), 0, .))) %>%
mutate(across(where(is.character), ~if_else(is.na(.), "", .)))
Filter Refunds: Depending on the analysis, filter only sales transactions.
Filter out refunds from item sales data and orders data. Then verify the filtered data.
combined_item_sales_df <- combined_item_sales_df %>%
filter(Total >= 0)
combined_orders_df <- combined_orders_df %>%
filter(Total >= 0)
head(combined_item_sales_df[, c("Date", "Name", "Category", "Total")])
## # A tibble: 6 x 4
## Date Name Category Total
## <chr> <chr> <chr> <dbl>
## 1 31/12/2023 20:17 R08 Curry Mixed Vege Rice Rice Set 7.9
## 2 31/12/2023 20:17 Take Away / Container Charge Add-On 1
## 3 31/12/2023 20:17 D01C Ice Hainanese Coffee Cold Drinks 11.8
## 4 31/12/2023 20:17 D08C Ice Milo Cold Drinks 5.9
## 5 31/12/2023 20:17 N05 Claypot Lou Shu Fun Noodles 12.9
## 6 31/12/2023 20:17 N01 Pangkor Laksa Noodles 15.9
head(combined_orders_df[, c("Date", "Items", "Total", "Order No")])
## # A tibble: 6 x 4
## Date Items Total `Order No`
## <dttm> <chr> <dbl> <chr>
## 1 2024-09-30 20:26:00 "BS03 (2) Half-boiled Kampung Egg (2pcs~ 31.9 240930NZV~
## 2 2024-09-30 20:14:00 "N09 White Radish Pepper Soup Kuey Teow~ 17.8 240930NZV~
## 3 2024-09-30 19:59:00 "Kaya () Kaya (1 Bottle)" 9.43 240930NZV~
## 4 2024-09-30 19:57:00 "N06 Shredded VChicken Hor Fun\nN03 Ca~ 27.8 240930NZV~
## 5 2024-09-30 19:54:00 "R08 Curry Mixed Vege Rice\nR02 () Sig~ 29.9 240930NZV~
## 6 2024-09-30 19:48:00 "FP A07 Fried Egg\nFP D04C Ice Hainane~ 50.5 240930NZV~
Save the cleaned data into respective files
write.xlsx(combined_item_sales_df, "combined_item_sales.xlsx")
write.xlsx(combined_orders_df, "combined_orders.xlsx")
Read & inspect transactions data
## Warning: Expecting logical in H3783 / R3783C8: got 'Pay by TnG'
## Warning: Expecting logical in H3787 / R3787C8: got 'Pay by card'
Consolidate the data into daily with sum/min/max/mean/median/number of daily transaction
Create a new column called After_1_day_sum_sales, in that column, move the target variable to a day forward, to simulate that the metadata of today pair with target variable of tomorrow
Train with 4 models: linear regression, svm, random forest and decision tree
Next Step: Remove potential outliers and train again
The table below shows the summary of the After_1_day_sum_sales column
This section analyzes customer segmentation using K-Means, Hierarchical Clustering, and Gaussian Mixture Models (GMM). The dataset is cleaned, prepared, and subjected to clustering analysis, with various visualizations and metrics provided to evaluate the clustering performance.
library(readxl)
library(dplyr)
library(ggplot2)
library(cluster)
library(factoextra)
library(NbClust)
library(mclust)
library(clValid)
library(reshape2)
library(readxl)
# Load dataset (assumed to be preloaded as `combined_transactions`)
df <- read_excel("combined_transactions-reformatdate.xlsx")
# Clean column names and parse date column
colnames(df) <- make.names(colnames(df), unique = TRUE)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y %H:%M")
# Remove refund transactions
cleaned_df <- df %>% filter(Type != "Refund")
# Select relevant columns for segmentation
segmentation_data <- cleaned_df %>%
select(Amount, Order.Total, Time.to.Pay) %>%
mutate(Time.to.Pay = as.numeric(gsub("m", "", Time.to.Pay))) %>%
na.omit()
# Sample data for analysis
set.seed(123)
segmentation_data <- segmentation_data[sample(1:nrow(segmentation_data), 5000), ]
# Normalize data
segmentation_data_scaled <- scale(segmentation_data)
# Cleanup unused variables
rm(cleaned_df)
rm(combined_transactions)
rm(df)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2962987 158.3 4799895 256.4 4799895 256.4
## Vcells 17517116 133.7 54677384 417.2 52101503 397.6
# Distribution of features
ggplot(segmentation_data, aes(x = Amount)) +
geom_histogram(binwidth = 10, fill = "blue", alpha = 0.7) +
labs(title = "Distribution of Amount")
ggplot(segmentation_data, aes(x = Order.Total)) +
geom_histogram(binwidth = 10, fill = "green", alpha = 0.7) +
labs(title = "Distribution of Order Total")
ggplot(segmentation_data, aes(x = Time.to.Pay)) +
geom_histogram(binwidth = 50, fill = "red", alpha = 0.7) +
labs(title = "Distribution of Time to Pay")
The Distribution of Amount, Order Total and Time To Pay are all concentrated near the lower range, resulting in a positive skewness. This could imply that customers frequently make small purchases and many payments were processed instantly or within a short period. The graphs also indicates potential outliers with very high values but occur less frequently which could be due to infrequent bulk purchases, delayed payments or errors in data entry.
# Elbow Method
fviz_nbclust(segmentation_data_scaled, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2, color = "red") +
labs(subtitle = "Elbow Method for Optimal Clusters")
The graph plots the Total Within Sum of Squares (WSS) (y-axis) against the number of clusters k (x-axis). The WSS measures the compactness of clusters - lower values indicate that points are closer to their cluster centers. The red dashed line at k =3 indicates the elbow point, where the rate of decrease in WSS slows significantly. The “elbow” at k = 3 suggests that 3 clusters is the optimal choice for segmenting this data, , as it provides well-defined groups without overcomplicating the segmentation. Beyond k = 3, adding more clusters results in diminishing returns, as the reduction in WSS becomes less significant.
# Silhouette Method
fviz_nbclust(segmentation_data_scaled, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette Method for Optimal Clusters")
The silhouette width measures how similar an object is to its own cluster compared to other clusters. The average silhouette width (y-axis) indicates the quality of clustering for a given number of clusters (k) on the x-axis. Higher values mean better-defined clusters. The peak value of the silhouette width occurs at k=2 as marked by the vertical dashed line which suggests that 2 clusters is the optimal choice, as it achieves the highest silhouette width and closest to 1, indicating well-separated and compact clusters. As the number of clusters increases beyond k=2, the silhouette width decreases, indicating diminishing cluster quality.
Conclusion: We have chosen k=3 as suggested by the Elbow Method, and the silhouette score at k=3 is still relatively high. Splitting the data into 3 clusters provides more nuanced segmentation compared to 2 clusters, such as segmenting them into low, medium and highsegments.
# Perform K-Means
set.seed(123)
kmeans_model <- kmeans(segmentation_data_scaled, centers = 3, nstart = 25)
segmentation_data$Cluster <- as.factor(kmeans_model$cluster)
# Visualize Clusters
fviz_cluster(kmeans_model, data = segmentation_data_scaled, geom = "point") +
labs(title = "K-Means Clustering with 3 Clusters")
# Analyze Cluster Characteristics
cluster_summary <- segmentation_data %>%
group_by(Cluster) %>%
summarise(
Avg_Amount = mean(Amount),
Avg_Order_Total = mean(Order.Total),
Avg_Time_to_Pay = mean(Time.to.Pay)
)
cluster_summary
## # A tibble: 3 x 4
## Cluster Avg_Amount Avg_Order_Total Avg_Time_to_Pay
## <fct> <dbl> <dbl> <dbl>
## 1 1 110. 198. 207.
## 2 2 2591. 2591. 98.5
## 3 3 45.2 45.5 33.5
The data is well-segmented into 3 clusters with clear differences in patterns.The x-axis (Dim1) explains 64.5% of the variance, while the y-axis (Dim2) explains 31%. Together, these two dimensions capture 95.5% of the total variance, making the projection a good representation of the original dataset.
Interpreting along with the Tibble Summary, we can know that Cluster 1 represents moderate spenders with average transaction values and longest payment behaviour, Cluster 3 represents low spenders with quick payment behaviour compared to Cluster 1, and Cluster 2 represents high spenders.
# Perform Hierarchical Clustering
dist_matrix <- dist(segmentation_data_scaled, method = "euclidean")
hclust_model <- hclust(dist_matrix, method = "ward.D2")
plot(hclust_model, labels = FALSE, main = "Hierarchical Clustering Dendrogram")
# Assign clusters
segmentation_data$HCluster <- cutree(hclust_model, k = 3)
# Visualize Dendrogram
fviz_dend(hclust_model, k = 3, rect = TRUE, show_labels = FALSE) +
labs(title = "Dendrogram with 3 Clusters (Sampled Data)")
# Analyze Cluster Characteristics
hcluster_summary <- segmentation_data %>%
group_by(HCluster) %>%
summarise(
Avg_Amount = mean(Amount),
Avg_Order_Total = mean(Order.Total),
Avg_Time_to_Pay = mean(Time.to.Pay)
)
hcluster_summary
## # A tibble: 3 x 4
## HCluster Avg_Amount Avg_Order_Total Avg_Time_to_Pay
## <int> <dbl> <dbl> <dbl>
## 1 1 45.8 45.9 35.3
## 2 2 130. 305. 254.
## 3 3 2591. 2591. 98.5
The dendrogram shows the data cut into 3 clusters indicated by red, green, and blue colors. The clusters are chosen by cutting the tree at a height where the distance between clusters is large, resulting in well-separated groups. Interpreting with the Tibble Summary below, we can find that cluster 1 in red represents low and quick spenders, cluster 2 in green represents moderate spenders with delayed payments, and cluster 3 in blue represents high spenders with moderate payment delay.
# Perform GMM
gmm_model <- Mclust(segmentation_data_scaled)
summary(gmm_model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9
## components:
##
## log-likelihood n df BIC ICL
## 32405.08 5000 89 64052.14 61425.92
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 31 1835 1109 366 248 788 39 450 134
segmentation_data$GMMCluster <- as.factor(gmm_model$classification)
# Visualize Clusters
fviz_mclust(gmm_model, what = "classification") +
labs(title = "GMM Clustering Classification")
# Analyze Cluster Characteristics
gmm_summary <- segmentation_data %>%
group_by(GMMCluster) %>%
summarise(
Avg_Amount = mean(Amount),
Avg_Order_Total = mean(Order.Total),
Avg_Time_to_Pay = mean(Time.to.Pay),
Cluster_Size = n()
)
gmm_summary
## # A tibble: 9 x 5
## GMMCluster Avg_Amount Avg_Order_Total Avg_Time_to_Pay Cluster_Size
## <fct> <dbl> <dbl> <dbl> <int>
## 1 1 553. 1027. 136. 31
## 2 2 46.2 46.2 30.4 1835
## 3 3 15.9 15.9 29.9 1109
## 4 4 11.2 11.2 0.320 366
## 5 5 38.1 38.1 0.560 248
## 6 6 100. 100. 44.3 788
## 7 7 26.2 36.9 24.8 39
## 8 8 34.0 34.0 89.6 450
## 9 9 136. 136. 190. 134
GMM identified 9 clusters as optimal based on the Bayesian Information Criterion (BIC). Cluster 2 is the largest followed by Cluster 3, whereas Cluster 1 is the smallest, likely representing high-value outliers or niche segments. Dim1 (62.9%) and Dim2 (32.1%) together explain 95% of the variance, making this projection a good representation of the data. Most data points are concentrated in overlapping regions, suggesting similar characteristics among these groups.
Combining with the tibble summary below, Cluster 2 to 5 represent a significant portion of customers making relatively smaller and quicker transactions, Cluster 1 and 9 represents customers with highest transaction amount and longest payment time.
# Silhouette Scores
fviz_silhouette(silhouette(kmeans_model$cluster, dist_matrix)) +
labs(title = "Silhouette Plot for K-Means")
## cluster size ave.sil.width
## 1 1 158 0.04
## 2 2 4 0.59
## 3 3 4838 0.83
fviz_silhouette(silhouette(segmentation_data$HCluster, dist_matrix)) +
labs(title = "Silhouette Plot for Hierarchical Clustering")
## cluster size ave.sil.width
## 1 1 4912 0.87
## 2 2 84 0.06
## 3 3 4 0.58
fviz_silhouette(silhouette(as.integer(segmentation_data$GMMCluster), dist_matrix)) +
labs(title = "Silhouette Plot for GMM")
## cluster size ave.sil.width
## 1 1 31 -0.23
## 2 2 1835 0.30
## 3 3 1109 0.16
## 4 4 366 0.76
## 5 5 248 0.28
## 6 6 788 0.13
## 7 7 39 -0.48
## 8 8 450 0.31
## 9 9 134 -0.19
Silhouette Plot: The silhouette plots visualize the quality of clustering for K-Means, Hierarchical Clustering and GMM. K means and Hierarchical Clustering have similar results with one cluster with highest silhouette width, and the other two lower width and with negative values, suggesting there might be misclassification. As for GMM, although it is separated into 9 clusters, only cluster 4 does not have negative values, whereas all others have negative values suggesting poor clustering quality and potential misclassifications.
# Dunn Index for K-Means
dunn_kmeans <- dunn(clusters = kmeans_model$cluster, Data = segmentation_data_scaled)
# Dunn Index for Hierarchical Clustering
dunn_hclust <- dunn(clusters = cutree(hclust_model, k = 3), Data = segmentation_data_scaled)
# Dunn Index for GMM
dunn_gmm <- dunn(clusters = gmm_model$classification, Data = segmentation_data_scaled)
print(dunn_kmeans)
## [1] 0.002099785
print(dunn_hclust)
## [1] 0.006098638
print(dunn_gmm)
## [1] 3.727276e-06
calculate_ch_index <- function(data, clusters) {
n <- nrow(data) # Number of observations
k <- length(unique(clusters)) # Number of clusters
# Overall mean
overall_mean <- colMeans(data)
# Within-cluster and between-cluster sum of squares
SSW <- 0
SSB <- 0
for (cluster in unique(clusters)) {
cluster_points <- data[clusters == cluster, , drop = FALSE]
cluster_mean <- colMeans(cluster_points)
cluster_size <- nrow(cluster_points)
# Within-cluster sum of squares
SSW <- SSW + sum(rowSums((cluster_points - cluster_mean)^2))
# Between-cluster sum of squares
SSB <- SSB + cluster_size * sum((cluster_mean - overall_mean)^2)
}
# Calinski-Harabasz Index
CH <- (SSB / (k - 1)) / (SSW / (n - k))
return(CH)
}
# Calculate CH Index for GMM
ch_gmm <- calculate_ch_index(segmentation_data_scaled, gmm_model$classification)
print(ch_gmm)
## [1] 400.7199
# Calculate CH Index for K-Means
ch_kmeans <- calculate_ch_index(segmentation_data_scaled, kmeans_model$cluster)
print(ch_kmeans)
## [1] 2060.636
# Calculate CH Index for Hierarchical Clustering
ch_hclust <- calculate_ch_index(segmentation_data_scaled, cutree(hclust_model, k = 3))
print(ch_hclust)
## [1] 2094.058
# BIC
calculate_bic_kmeans <- function(kmeans_model, data) {
n <- nrow(data)
m <- ncol(data)
k <- kmeans_model$centers
wcss <- kmeans_model$tot.withinss
bic <- n * log(wcss / n) + log(n) * m * length(k)
return(bic)
}
# Compute BIC for K-Means
bic_kmeans <- calculate_bic_kmeans(kmeans_model, segmentation_data_scaled)
print(bic_kmeans)
## [1] 1651.063
calculate_bic_hclust <- function(hclust_model, data, k) {
clusters <- cutree(hclust_model, k = k)
wcss <- sum((data - rowMeans(data[clusters == as.numeric(clusters), ]))^2)
n <- nrow(data)
m <- ncol(data)
bic <- n * log(wcss / n) + log(n) * m * k
return(bic)
}
# Compute BIC for Hierarchical Clustering
bic_hclust <- calculate_bic_hclust(hclust_model, segmentation_data_scaled, k = 3)
print(bic_hclust)
## [1] 1245.954
# BIC for GMM
bic_gmm <- gmm_model$BIC
print(bic_gmm)
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 1 -42599.22 -42599.224 -42616.26 -42616.259 -42616.26 -42616.2589 -36105.15
## 2 -36912.92 -19792.154 -35751.85 -18006.530 -30732.38 -17587.6715 -27960.43
## 3 -33477.19 -15952.305 -33032.40 -12298.441 -26676.02 -10970.3527 -27995.33
## 4 -33511.22 -12199.057 -33066.44 -8094.862 -23196.83 -8705.7127 -28029.42
## 5 -33545.33 -10523.139 -33100.53 -5383.518 -21745.05 -4858.7858 -25465.43
## 6 -33579.42 -9138.932 -33134.66 -3971.977 -20003.11 -2432.1653 -25499.50
## 7 -27557.67 -7928.140 -30540.31 -1887.690 -17433.24 -1395.9117 -25533.58
## 8 -24042.84 -6497.730 -30574.36 -4522.574 -17316.89 925.3369 -25567.69
## 9 -23267.93 -5860.156 -29392.48 -4154.020 -16780.33 2603.3037 -24332.40
## VEE EVE VVE EEV VEV EVV VVV
## 1 -36105.15 -36105.151 -36105.15 -36105.151 -36105.15 -36105.151 -36105.15
## 2 51721.86 -10668.209 54964.17 -23320.498 51712.58 -5591.000 55095.31
## 3 55842.09 6833.361 58711.31 -19132.084 56229.01 7513.579 59162.73
## 4 56516.82 7074.710 61261.10 -11668.155 57340.26 7855.120 61690.69
## 5 57222.09 7355.317 62410.25 -8238.531 57623.56 8138.137 62757.48
## 6 58013.91 7304.221 62996.00 -4930.436 58440.33 NA 63211.86
## 7 58748.43 7253.803 63170.16 -2349.463 59127.18 7984.888 63611.28
## 8 58735.18 7553.007 63596.18 -4719.667 59767.71 8268.358 63751.76
## 9 59065.98 7398.147 63994.97 -4779.314 59482.86 NA 64052.14
##
## Top 3 models based on the BIC criterion:
## VVV,9 VVE,9 VVV,8
## 64052.14 63994.97 63751.76
Dunn Index: The Dunn Index measures the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. Higher values indicate better separation and compactness. However, none of the models achieve a high Dunn Index, but Hierarchical Clustering slightly outperforms the others.
CH Index: The CH Index measures the ratio of the between-cluster variance to the within-cluster variance. Higher values indicate better-defined clusters. K-Means achieves the highest CH Index, making it the best performer for well-separated and compact clusters.
Bayesian Information Criterion (BIC): The BIC measures the model fit while penalizing complexity. Lower values indicate better model performance. Hierarchical Clustering has the lowest BIC value, indicating the best performance for balancing fit and complexity. The GMM model with the VVV covariance structure has the highest BIC, suggesting it is overfitting due to complexity.
library(arules)
library(arulesViz)
df <- read.csv("D:/Download D/combined_item_sales-reformatdate.xlsx - Sheet1.csv")
head(df)
## Date Name Category
## 1 1/11/2023 7:54 Curry Chicken Potato with Turmeric Rice Rice Set
## 2 1/11/2023 7:54 Shredded Chicken Hor Fun Noodles
## 3 1/11/2023 8:02 Nyonya Kuih (3pcs) Local Delicacies
## 4 1/11/2023 8:06 Hot Hainanese Coffee Hot Drinks
## 5 1/11/2023 8:06 Hainanese Butter & Kaya Toast Breakfast Series
## 6 1/11/2023 8:06 Hainanese Butter & Kaya Toast Breakfast Series
## Price Qty Total Net Device Location Order.No
## 1 13.9 1 13.9 13.9 H Restaurant Puchong iPad Puchong 231101X98H
## 2 12.9 1 12.9 12.9 H Restaurant Puchong iPad Puchong 231101X98H
## 3 6.9 3 20.7 20.7 H Restaurant Puchong Pastry iPad Puchong 231101XSGR
## 4 4.5 1 4.5 4.5 H Restaurant Puchong Pastry iPad Puchong 231101XSGR
## 5 3.5 2 7.0 7.0 H Restaurant Puchong Pastry iPad Puchong 231101XSGR
## 6 3.5 1 3.5 3.5 H Restaurant Puchong Pastry iPad Puchong 231101XSGR
Change the dataset to transactional dataset format
transactions <- split(df$Name, df$Order.No)
transactions <- lapply(transactions, function(x) unique(x))
transactions_with_int_ids <- setNames(transactions, as.character(1:length(transactions)))
trans <- as(transactions_with_int_ids, "transactions")
inspect(trans[1:5])
## items transactionID
## [1] {Ice Hainanese Coffee,
## Kampung Style Fried Meehoon} 1
## [2] {Brown Rice,
## Claypot Lou Shu Fun,
## Fried Egg,
## GonLou Dry Noodles with BBQ Chicken,
## Hainan Roasted & White-Cut Chicken Rice,
## Hainanese Butter & Kaya Toast,
## Luncheon VMeat,
## Mango Feast,
## O Hot Hainanese Coffee O,
## Signature Nasi Lemak with Rendang Chicken} 2
## [3] {Cold Soymilk,
## Loh Mee,
## Passion Fruit Juice,
## Signature Hainan Roasted Chicken Rice} 3
## [4] {Black sesame Bao,
## Ginger Mushroom Bao} 4
## [5] {Hot Hainanese Coffee,
## Shredded Chicken Hor Fun,
## Signature VChar Siew Rice} 5
frequentItems <- eclat (trans, parameter = list(supp = 0.07, maxlen = 15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 2108
##
## create itemset ...
## set transactions ...[302 item(s), 30116 transaction(s)] done [0.02s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating sparse bit matrix ... [10 row(s), 30116 column(s)] done [0.02s].
## writing ... [10 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
itemFrequencyPlot(trans, topN=10, type="absolute", main="Item Frequency")
1. Hainanese Butter & Kaya Toast, which is a Malaysian authentic favourite breakfast shows the top 1 sale in the restaurant, this may be caused by the affordable price and light meal to be breakfast for all customers and a side for lunch. 2. The sales followed by lunch main dish such as nasi lemak with rendang chicken and curry mee hoon & noodles. This indicates that these two foods are the favourite lunch option in the restaurant. 3. Beverage such as hot Hainanese coffee and iced Hainanese coffee are the top beverages in the restaurant as it is also an authentic beverage that is suitable for breakfast and lunch.
Association rules with minimum support 0.01 and minimum lift 0.5
frequent_itemsets <- apriori(trans, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 301
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[302 item(s), 30116 transaction(s)] done [0.03s].
## sorting and recoding items ... [87 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## lhs rhs support confidence coverage lift count
## [1] {Hot Milo} => {Hainanese Butter & Kaya Toast} 0.01215301 0.5069252 0.02397397 2.211904 366
## [2] {Half-boiled Kampung Egg (2pcs)} => {Hainanese Butter & Kaya Toast} 0.06999602 0.6374357 0.10980874 2.781370 2108
## [3] {Half-boiled Kampung Egg (2pcs),
## Pangkor Shell Curry Puff} => {Hainanese Butter & Kaya Toast} 0.01006110 0.6433121 0.01563953 2.807011 303
## [4] {Half-boiled Kampung Egg (2pcs),
## Ice Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.01646965 0.6657718 0.02473768 2.905011 496
## [5] {Half-boiled Kampung Egg (2pcs),
## Hot Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.02128437 0.7226607 0.02945278 3.153238 641
## [6] {Half-boiled Kampung Egg (2pcs),
## Special Curry Mee Hoon & Noodles} => {Hainanese Butter & Kaya Toast} 0.01361403 0.6602254 0.02062027 2.880810 410
## [7] {Half-boiled Kampung Egg (2pcs),
## Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413 0.6364764 0.02676318 2.777184 513
View the association rules sorting by confidence, lift and support
arules::inspect(frequent_itemsets)
## lhs rhs support confidence coverage lift count
## [1] {Hot Milo} => {Hainanese Butter & Kaya Toast} 0.01215301 0.5069252 0.02397397 2.211904 366
## [2] {Half-boiled Kampung Egg (2pcs)} => {Hainanese Butter & Kaya Toast} 0.06999602 0.6374357 0.10980874 2.781370 2108
## [3] {Half-boiled Kampung Egg (2pcs),
## Pangkor Shell Curry Puff} => {Hainanese Butter & Kaya Toast} 0.01006110 0.6433121 0.01563953 2.807011 303
## [4] {Half-boiled Kampung Egg (2pcs),
## Ice Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.01646965 0.6657718 0.02473768 2.905011 496
## [5] {Half-boiled Kampung Egg (2pcs),
## Hot Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.02128437 0.7226607 0.02945278 3.153238 641
## [6] {Half-boiled Kampung Egg (2pcs),
## Special Curry Mee Hoon & Noodles} => {Hainanese Butter & Kaya Toast} 0.01361403 0.6602254 0.02062027 2.880810 410
## [7] {Half-boiled Kampung Egg (2pcs),
## Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413 0.6364764 0.02676318 2.777184 513
rules_conf <- sort(frequent_itemsets, by="confidence", decreasing=TRUE)
inspect(head(rules_conf))
## lhs rhs support confidence coverage lift count
## [1] {Half-boiled Kampung Egg (2pcs),
## Hot Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.02128437 0.7226607 0.02945278 3.153238 641
## [2] {Half-boiled Kampung Egg (2pcs),
## Ice Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.01646965 0.6657718 0.02473768 2.905011 496
## [3] {Half-boiled Kampung Egg (2pcs),
## Special Curry Mee Hoon & Noodles} => {Hainanese Butter & Kaya Toast} 0.01361403 0.6602254 0.02062027 2.880810 410
## [4] {Half-boiled Kampung Egg (2pcs),
## Pangkor Shell Curry Puff} => {Hainanese Butter & Kaya Toast} 0.01006110 0.6433121 0.01563953 2.807011 303
## [5] {Half-boiled Kampung Egg (2pcs)} => {Hainanese Butter & Kaya Toast} 0.06999602 0.6374357 0.10980874 2.781370 2108
## [6] {Half-boiled Kampung Egg (2pcs),
## Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413 0.6364764 0.02676318 2.777184 513
rules_lift <- sort(frequent_itemsets, by="lift", decreasing=TRUE)
inspect(head(rules_lift))
## lhs rhs support confidence coverage lift count
## [1] {Half-boiled Kampung Egg (2pcs),
## Hot Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.02128437 0.7226607 0.02945278 3.153238 641
## [2] {Half-boiled Kampung Egg (2pcs),
## Ice Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.01646965 0.6657718 0.02473768 2.905011 496
## [3] {Half-boiled Kampung Egg (2pcs),
## Special Curry Mee Hoon & Noodles} => {Hainanese Butter & Kaya Toast} 0.01361403 0.6602254 0.02062027 2.880810 410
## [4] {Half-boiled Kampung Egg (2pcs),
## Pangkor Shell Curry Puff} => {Hainanese Butter & Kaya Toast} 0.01006110 0.6433121 0.01563953 2.807011 303
## [5] {Half-boiled Kampung Egg (2pcs)} => {Hainanese Butter & Kaya Toast} 0.06999602 0.6374357 0.10980874 2.781370 2108
## [6] {Half-boiled Kampung Egg (2pcs),
## Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413 0.6364764 0.02676318 2.777184 513
rules_support <- sort(frequent_itemsets, by="support", decreasing=TRUE)
inspect(head(rules_support))
## lhs rhs support confidence coverage lift count
## [1] {Half-boiled Kampung Egg (2pcs)} => {Hainanese Butter & Kaya Toast} 0.06999602 0.6374357 0.10980874 2.781370 2108
## [2] {Half-boiled Kampung Egg (2pcs),
## Hot Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.02128437 0.7226607 0.02945278 3.153238 641
## [3] {Half-boiled Kampung Egg (2pcs),
## Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413 0.6364764 0.02676318 2.777184 513
## [4] {Half-boiled Kampung Egg (2pcs),
## Ice Hainanese Coffee} => {Hainanese Butter & Kaya Toast} 0.01646965 0.6657718 0.02473768 2.905011 496
## [5] {Half-boiled Kampung Egg (2pcs),
## Special Curry Mee Hoon & Noodles} => {Hainanese Butter & Kaya Toast} 0.01361403 0.6602254 0.02062027 2.880810 410
## [6] {Hot Milo} => {Hainanese Butter & Kaya Toast} 0.01215301 0.5069252 0.02397397 2.211904 366
Summary of Important Rules
# View the data frame
print(rules_table)
## Rules Antecedent
## 1 1 Half-boiled Kampung Egg
## 2 2 Hot Hainanese Coffee
## 3 3 Half-boiled Kampung Egg, Hot Hainanese Coffee
## 4 4 Half-boiled Kampung Egg, Special Curry Mee Hoon & Noodles
## 5 5 Half-boiled Kampung Egg, Pangkor Shell Curry Puff
## 6 6 Half-boiled Kampung Egg, Nasi Lemak with Rendang Chicken
## 7 7 Brown Rice
## 8 8 Homemade Pumpkin Tofu
## Consequent Support Confidence Lift
## 1 Hainanese Butter & Kaya Toast 0.070 0.637 2.781
## 2 Hainanese Butter & Kaya Toast 0.062 0.415 1.809
## 3 Hainanese Butter & Kaya Toast 0.021 0.723 3.153
## 4 Hainanese Butter & Kaya Toast 0.014 0.660 2.881
## 5 Hainanese Butter & Kaya Toast 0.010 0.644 2.810
## 6 Hainanese Butter & Kaya Toast 0.017 0.636 2.777
## 7 Homemade Pumpkin Tofu 0.018 0.463 12.371
## 8 Brown Rice 0.018 0.476 12.371
The table presents the results of a market basket analysis, highlighting the relationships between items frequently purchased together. Each row represents an association rule with antecedents (items already in the basket) and consequents (items likely to be added). Key metrics include support, indicating the proportion of transactions containing the antecedent and consequent; confidence, measuring the likelihood of purchasing the consequent given the antecedent; and lift, showing how much more likely the items are to be purchased together compared to random chance.
One of the strongest association rules is between hot Hainanese coffee, half-boiled kampung egg, Hainanese butter and kaya toast. Although Rule 1 and Rule 2 have higher support compared to Rule 3, but if their antecedents are bought together, it produces higher confident, which is 72.3% customers that buy half-boiled kampung egg and hot Hainanese coffee also buy Hainanese butter and kaya toast. The higher lift of 3.153 indicates that buying half-boiled kampung egg and hot Hainanese coffee seems to be a strong contributor factor for buying Hainanese butter and kaya toast compared to random chance level. This indicates that half-boiled egg with signature hot beverage often lead to the purchases of crunchy kaya toast with butter, and this is known as our Malaysian signature breakfast combination.
Other than that, Rule 4 indicates that half-boiled egg with special curry mee hoon & noodles with Hainanese butter & kaya toast was one of the favourite breakfast combinations with 66% of confident they will be bought together. A lift of 2.881 also show that they are likely to purchase than they are not related to each other. Other than breakfast, datasets also show that this combination also bought as lunch bundle, indicates that Hainanese butter & kaya toast can be sold as a side of the lunch and appeals to customers looking for a lighter option alongside the main dish.
Other than breakfast set, other notable findings in lunch bundle shows a strong connection between Brown Rice and Homemade Pumpkin Tofu, with high confidence (0.463 and 0.476, respectively) and an exceptionally high lift of 12.371, indicating a very strong association. Conversely, it also shows similar confident and lift value. This result shows that customers also purchase these two items together and can be considered as a lunch sets.
This shows huge potential for the restaurant to optimize bundle promotions. Breakfast bundle for example, can consists of such things as half-boiled egg, one authentic beverage like hot Hainanese coffee or milo and kaya toast with butter. Similarly, lunch set such as half-boiled egg, a main dish like Nasi Lemak with Rendang or Curry Mee Hoon or Noodles together with kaya toast with butter. Bundling higher-value items like Nasi Lemak or Curry Mee Hoon can significantly boost sales during lunch.
In Sales Forecasting, SVM outperforms other algorithms such as decision tree, random forest and linear regression in predicting the next day’s sales.
Customer Segmentation Analysis in overall shows Hierarchical Clustering has the lowest BIC and slightly better Dunn Index than the other methods. This suggests it produces relatively better-separated clusters with balanced complexity. K-Means has the highest CH Index, indicating well-defined and compact clusters. However, its Dunn Index and BIC are less favorable compared to Hierarchical Clustering. GMM performs poorly on all metrics, with very low Dunn and CH Index values. The high BIC suggests that it overfits due to the complexity of the model.
Market Basket Analysis shows huge potential for the restaurant to optimize bundle promotions. Breakfast bundle and lunch set could be considered based on the rules by association analysis as bundling higher-value items can significantly boost the sales.