Group Members (Team 13)

Tan Yee Thong (u2004927)
Koo Jian Wei (22105628)
Foo Kai Wen (23083543)
Yeo Minyi (24060558)
Chua Ziwan (23061223)

Project Title

Enhancing Customer Insights and Business Growth in Restaurant

Introduction

In today’s competitive landscape, restaurants are increasingly seeking data-driven strategies to stay ahead of the curve. Three key methodologies that can substantially impact their bottom line are Sales Forecasting, Market Basket Analysis (MBA) and Customer Segmentation Analysis. Sales Forecasting employs predictive models like SVM, Decision Trees, Random Forest, Linear Regression to estimate upcoming daily or weekly sales, enabling the restaurants to optimize inventory, staffing, and promotional activities. Customer Segmentation Analysis groups customers based on purchasing behavior, spending patterns, or demographic features, helping restaurants tailor marketing campaigns and enhance the overall customer experience. Meanwhile, Market Basket Analysis leverages transactional data to reveal which items are frequently purchased together. This enables more effective promotional bundles, targeted recommendations, and inventory optimization.

Despite these advantages, restaurants face unique challenges compared to traditional retail:

Dynamic Menus: Frequent changes in menu items, portion sizes, and pricing can disrupt conventional predictive models.
Irregular Peak Hours: Customer traffic is not uniform; it fluctuates during breakfast, lunch, and dinner rushes, requiring real-time adaptability of any recommendation or segmentation system.
Multi-Channel Sales: The proliferation of dine-in, take-out, and delivery platforms complicates the data landscape and demands cohesive analysis across varied transactional sources.

By applying techniques such as machine learning models for sales forecasting, clustering algorithms for customer segmentation and association rule mining for MBA, restaurants can gain invaluable insights into purchasing behaviors. These insights not only optimize menu engineering and bundle offerings but also inform loyalty programs, promotional campaigns, and targeted advertisements.

In this project, we consolidate transactional data from a local restaurant across multiple Excel and CSV files. We then demonstrate how to clean, preprocess, and merge these data sources to form a robust dataset. From there, use several machine learning models to forecast next day sales as part of the Sales Forecasting. We perform Market Basket Analysis to uncover item affinities and propose potential bundle deals. In parallel, we conduct Customer Segmentation Analysis to cluster patrons based on spending and purchase patterns. These approaches underscore how data science techniques can significantly elevate a restaurant’s strategic decision-making, from operational logistics like inventory planning to high-level marketing initiatives, such as personalized promos and meal bundles.

Objectives

The primary objectives of this analysis are:

Sales Forecasting
- To develop and compare multiple models to predict the next day’s sales.
- To identify which model offers the highest accuracy or lowest error metrics in forecasting.
- To utilize forecasting insights to improve operation of restaurant.
Customer Segmentation Analysis:
- To cluster customers based on attributes like spending amount, time to pay, and order totals, using algorithms such as K-Means, Hierarchical Clustering, and Gaussian Mixture Models.
- To profile each cluster to understand their unique behavioral traits (e.g., high spenders, value-seekers, moderate spenders with delayed payment, etc.).
- To generate targeted marketing strategies for each segment, such as loyalty rewards, discounts, or upselling tactics, thereby enhancing customer satisfaction and retention.
Market Basket Analysis (MBA):
- To identify frequently co-purchased items to discover strong association rules.
- To propose data-driven bundle deals or combos that cater to popular customer preferences, aiming to increase average order value.

Through these objectives, we hope to demonstrate how advanced analytics can empower a restaurant to make more evidence-based decisions, from creating enticing meal sets to crafting personalized promotions for different customer segments.

Dataset

Our dataset is collected from a local restaurant’s point-of-sale (POS) system and other digital records, covering the period from 1 November 2023 to 31 October 2024. These data have been provided in multiple Excel and CSV files, each capturing a distinct facet of the restaurant’s daily operations. The dataset is anonymized, focusing purely on transactional and operational details with no personal information about customers.

Dataset Details

Title:
H Restaurant Puchong – Transactions, Orders, Discounts, and Item Sales.
Year:
1 Year Data of 2023/2024 (data spanning 1 November 2023 to 31 October 2024).
Purpose:
To analyze the restaurant’s daily operations, including sales transactions, discount usage, item-level purchases, and payment methods. The ultimate goal is to perform Market Basket Analysis and Customer Segmentation Analysis for business insights.
Dimension:
1. Transactions Dataset:
  - Approx. number of rows: Several thousand (each representing a transaction).
  - Columns include: Date, Receipt No, Type (Sale/Refund), Register, Method, Amount, Time to Pay, Device, Location, Order No, Order Status, Order Payment Status, etc.
2. Discounts Dataset:
  - Rows represent discount-related records (order-level or item-level).
  - Columns include: Date, Applied To (Order or Item), Type (Discount, Loyalty Reward, Promotion), Discount Amount, Order No, etc.
3. Orders Dataset:
  - Rows represent high-level order entries.
  - Columns include: Date, Dining Option, Ticket No, Order No, Order Status, Payment Status, Items, Discounts, Payments, Channel, etc.
4. Item Sales Dataset:
  - Rows show each item’s sale information.
  - Columns include: Date, Name, Category, Price, Qty, Total, Net, Device, Order No, etc.
Content:
- Transactions & Orders: Summaries of each sale, including total amounts, payment methods, and statuses.
- Discounts: Full overview of applied discounts or promotions, specifying discount type and magnitude.
- Item Sales: Granular breakdown of every line item purchased, revealing item-specific data essential for Market Basket Analysis.
Structure:
- Each dataset is stored in Excel (.xlsx) format.
- Primary key for integration is Order No (occasionally Receipt No), allowing for cross-referencing.
- Data types are numeric (amounts, totals), categorical (payment method, item category), date-time (transaction timestamps), and text (item names, discount descriptions).
Summary:
- Overall, the combined datasets capture thousands of transactions across multiple ordering channels (dine-in, takeaway, delivery), dozens of discount types, and hundreds of unique menu items.
- When merged, they enable in-depth analyses of purchase patterns, discount effectiveness, consumer payment preferences, and item-level co-occurrences.
- This extensive coverage, spanning an entire month, provides a robust basis for discovering meaningful insights via predictive models, clustering algorithms, and association rule mining.

Data Sources

1. H Restaurant Puchong Transactions

Read and Inspect the Transactions data for the month of December 2023:

## # A tibble: 6 x 15
##   Date      `Receipt No` Type  Register Method Amount `Additional Inputs` Reason
##   <chr>     <chr>        <chr> <chr>    <chr>   <dbl> <lgl>               <lgl> 
## 1 31/12/20~ 231231PNFQ   Sale  H Resta~ TnG E~  104.  NA                  NA    
## 2 31/12/20~ 231231USVH   Sale  H Resta~ Cash     51.8 NA                  NA    
## 3 31/12/20~ 2312315OPS   Sale  H Resta~ Card     89.8 NA                  NA    
## 4 31/12/20~ 23123133IL   Sale  H Resta~ Cash     56.2 NA                  NA    
## 5 31/12/20~ 231231PFUH   Sale  H Resta~ TnG E~   34.7 NA                  NA    
## 6 31/12/20~ 2312312ZWG   Sale  H Resta~ TnG E~   58.6 NA                  NA    
## # i 7 more variables: `Time to Pay` <chr>, Device <chr>, Location <chr>,
## #   `Order No` <chr>, `Order Status` <chr>, `Order Total` <dbl>,
## #   `Order Payment Status` <chr>

Content: Transaction-level details such as Date, Receipt Number, Transaction type (sale/refund), Register used, Payment Method, Amount, Time to Pay, Order No, Order Status.
Key Fields:
- Date and Time to Pay show when transactions occurred and how quickly they were settled.
- Method (e.g., Card, Cash, TnG E-Wallet, Foodpanda) indicates the payment channel.
- Order No and Order Status link to additional data, such as items purchased or discounts.
Usefulness:
- Allows for time-series analysis of daily revenue trends.
- Facilitates matching of payment methods to transaction amounts and speeds.

2. H Restaurant Puchong Discount Transactions

Read and Inspect the Discount data for the month of September 2024:

## # A tibble: 6 x 13
##   Date                `Applied To` Item  Type     Name  Discount Device Location
##   <dttm>              <chr>        <chr> <chr>    <chr>    <dbl> <chr>  <chr>   
## 1 2024-09-30 19:54:00 Order        <NA>  Discount Open~     5    H Res~ H Resta~
## 2 2024-09-30 17:17:00 Order        <NA>  Discount Open~     0.32 H Res~ H Resta~
## 3 2024-09-30 15:04:00 Order        <NA>  Discount Open~     5    H Res~ H Resta~
## 4 2024-09-30 14:36:00 Order        <NA>  Discount Open~     0.49 H Res~ H Resta~
## 5 2024-09-30 13:21:00 Order        <NA>  Loyalty~ Birt~    15    H Res~ H Resta~
## 6 2024-09-30 12:21:00 Order        <NA>  Discount Open~     5    H Res~ H Resta~
## # i 5 more variables: `Order No` <chr>, `Order Status` <chr>,
## #   `Order Total` <dbl>, `Order Payment Status` <chr>, `Order Notes` <chr>

Content: Records of applied discounts or promotions (order-level or item-level), including discount types (Open Disc, Loyalty Rewards, Promotion) and discount amounts.
Key Fields:
- Applied To: Indicates the scope of discount, whether applied to entire order or specific item only.
- Name, Discount: The discount type and amount (e.g., 5 RM, 15 RM loyalty rewards, percentage discount).
- Order No: Links back to the main transaction.
Usefulness:
- Analyzing how discounts correlate with final bill amounts and payment statuses.
- Understanding which discount types are most popular or effective in boosting sales.

3. H Restaurant Puchong Orders

Read and Inspect the Orders data for the month of September 2024:

## # A tibble: 6 x 41
##   Date                `Dining Option` Table `Ticket No` `Order No`   
##   <dttm>              <chr>           <chr> <lgl>       <chr>        
## 1 2024-09-30 20:26:00 Dine In         T8    NA          240930NZVTBR0
## 2 2024-09-30 20:14:00 Dine In         T6    NA          240930NZVTZFY
## 3 2024-09-30 19:59:00 Takeaway        A16   NA          240930NZVJFA6
## 4 2024-09-30 19:57:00 Dine In         T7    NA          240930NZVZ87U
## 5 2024-09-30 19:54:00 Dine In         T9    NA          240930NZVUMED
## 6 2024-09-30 19:48:00 Foodpanda       FP 5  NA          240930NZVVZAT
## # i 36 more variables: `Order Status` <chr>, Total <dbl>,
## #   `Payment Status` <chr>, Items <chr>, `Items (Amount)` <chr>,
## #   Discounts <chr>, `Discounts (Amount)` <dbl>,
## #   `Manufacturer Discounts` <chr>, `Manufacturer Discounts (Amount)` <dbl>,
## #   Payments <chr>, `Payments (Amount)` <chr>, Refunds <chr>,
## #   `Refunds (Amount)` <dbl>, Device <chr>, Location <chr>, Channel <chr>,
## #   Paid <dbl>, Remaining <dbl>, Rounding <dbl>, Gross <dbl>, ...

Content: High-level overview of each order, including dining option, table/ticket number, order status, payment status, and a comprehensive Items column listing purchased items.
Key Fields:
- Dining Option: Dine-in, Takeaway, Foodpanda, etc.
- Order Total, Discounts, Payments: Aggregated amounts that link to the more granular transaction sheets.
- Channel: Additional column showing whether the sale came from in-house or a specific delivery partner.
Usefulness:
- Ties each order to a specific table or channel, enabling analysis of in-store vs. delivery patterns.
- Provides a quick snapshot of all items per order, helpful for building Market Basket transactions.

4. H Restaurant Puchong Item Sales Transaction

Read and Inspect the Item Sales data for the month of September and October 2024:

## # A tibble: 6 x 10
##   Date         Name  Category Price   Qty Total   Net Device Location `Order No`
##   <chr>        <chr> <chr>    <dbl> <dbl> <dbl> <dbl> <chr>  <chr>    <chr>     
## 1 31/10/2024 ~ "LD0~ "èŒ¶ç‚¹~  11.9     1  11.9  11.9 H Res~ Bandar ~ 241031NZV~
## 2 31/10/2024 ~ "A03~ "Add-On"   3.5     3  10.5  10.5 H Res~ Bandar ~ 241031NZV~
## 3 31/10/2024 ~ "D33~ "çƒé¥®~    1       1   1     1   H Res~ Bandar ~ 241031NZV~
## 4 31/10/2024 ~ "D33~ "çƒé¥®~    1       2   2     2   H Res~ Bandar ~ 241031NZV~
## 5 31/10/2024 ~ "D30~ "å†·é¥®~   6.9     2  13.8  13.8 H Res~ Bandar ~ 241031NZV~
## 6 31/10/2024 ~ "MS1~ "æ¤\u00~  29.9     1  29.9  29.9 H Res~ Bandar ~ 241031NZV~

Content: A granular, line-by-line breakdown of items sold, their categories, price, quantity, and net total after any adjustments.
Key Fields:
- Name & Category: E.g., “N03 Shredded Chicken Hor Fun,” “Hot Hainanese Coffee,” “Signature Nasi Lemak.”
- Qty and Total: How many units of an item were sold in a single transaction, along with extended cost.
- Order No: Allows joining to the Orders table for context on how items combine.
Usefulness:
- Foundational for Market Basket Analysis, since items from the same Order No can be grouped to detect frequently co-purchased items.
- Helps identify best-selling items and time-based patterns (e.g., breakfast vs. lunch).

Data Integration Process

All four data sources—Transactions, Item Transaction, Discount Transactions, and Orders—rely on the Order No (or a close variation like Receipt No) as the primary key for joining. The steps to integrate them and consolidate multiple months into one master dataset are summarized below:

Read Individual Monthly Files
- For each data type (Transactions, Item Transactions, Discounts, Orders), we have separate Excel files for different months (e.g., September, October). We first read each monthly file into its own data frame, for example, transactions_202409_df, transactions_202410_df, etc.).
Append Monthly Data
- Once we have monthly data frames for each data type, we append them (row‐bind) to create unified data frames that hold all months. For example:
```
# Pseudocode for transactions
transactions_df <- bind_rows(
  transactions_202311_df,
  transactions_202312_df,
  transactions_202401_df,
  transactions_202402_df
  # etc., as it goes on until October 2024 Dataset
)
```
- This results in four consolidated data frames:
  - combined_transactions_df (combined transactions for all months)
  - combined_item_sales_df (combined item sales for all months)
  - combined_discounts_df (combined discounts for all months)
  - combined_orders_df (combined orders for all months)
Join or Merge on Order No
- Use Order No (or Receipt No if necessary) to link these four consolidated data frames together. For example:
  1. Transactions ↔︎ Orders: Matched via Order No and date to confirm total amounts, payment status, and item lists.
  2. Item Transactions ↔︎ Orders: Joined through Order No to attach each line‐item detail to the corresponding order.
  3. Discount Transactions ↔︎ Orders: Linked by Order No (and possibly item names if the discount is item‐specific) to quantify the impact of discounts or promotions on each order.
Finalize Consolidated Data Frame
- After these merges, we end up with one master data set that contains all records (transactions, orders, items, and discounts) spanning all months under analysis. This single, unified data set is then ready for further analyses on Market Basket Analysis, Sales Forecasting and Customer Segmentation.

By following these steps, we ensure the data is properly combined across multiple time periods, accurately merged on the correct keys, and ready for the next data cleaning process.

Data Cleaning

Read and inspect the item sales data

## # A tibble: 6 x 4
##   Date             Name                                           Category Total
##   <chr>            <chr>                                          <chr>    <dbl>
## 1 31/12/2023 20:17 "R08 å’–å–±æ\u009d‚è\u008fœé¥ Curry Mixed Ve~  "é¥ç±»~    7.9
## 2 31/12/2023 20:17 "Take Away / Container Charge"                 "Add-On"   1  
## 3 31/12/2023 20:17 "D01C å†°æµ·å\u008d—å’–å•¡ Ice Hainanese Coff~ "å†·é¥®~  11.8
## 4 31/12/2023 20:17 "D08C å†°ç¾Žç¦„ Ice Milo"                      "å†·é¥®~   5.9
## 5 31/12/2023 20:17 "N05 æ²™ç…²è€\u0081é¼ ç²‰ Claypot Lou Shu Fun" "é\u009~  12.9
## 6 31/12/2023 20:17 "N01 é‚¦å’¯å\u008f»æ²™ Pangkor Laksa"          "é\u009~  15.9

Read and inspect the orders data

## # A tibble: 6 x 4
##   Date                Items                                     Total `Order No`
##   <dttm>              <chr>                                     <dbl> <chr>     
## 1 2024-09-30 20:26:00 "BS03 <U+7518><U+699C><U+751F><U+719F><U+86CB> (2<U+7C92>) Half-boiled Kampu~ 31.9  240930NZV~
## 2 2024-09-30 20:14:00 "N09 <U+767D><U+841D><U+535C><U+732A><U+809A><U+80E1><U+6912><U+7CBF><U+6761><U+7C73><U+7C89><U+6C64> White Radi~ 17.8  240930NZV~
## 3 2024-09-30 19:59:00 "Kaya<U+9171> (<U+4E00><U+74F6><U+88C5>) Kaya (1 Bottle)"          9.43 240930NZV~
## 4 2024-09-30 19:57:00 "N06 <U+9E21><U+4E1D><U+6CB3><U+7C89> Shredded VChicken Hor Fun\~ 27.8  240930NZV~
## 5 2024-09-30 19:54:00 "R08 <U+5496><U+55B1><U+6742><U+83DC><U+996D> Curry Mixed Vege Rice\nR~ 29.9  240930NZV~
## 6 2024-09-30 19:48:00 "FP A07 <U+714E><U+86CB> Fried Egg\nFP D04C <U+51B0><U+9E33><U+9E2F> I~ 50.5  240930NZV~

Findings

In item sales data, Name and Category columns have weird characters (non-ASCII characters).
In orders data, items column has non-ASCII characters instead.

This is the function to remove the non-ASCII character, by filtering out that are greater than 0x7F, since ASCII values are between 0x00 to 0x7F

remove_non_ascii <- function(text) {
  rawBytes <- charToRaw(text)
  cleanedBytes <- rawBytes[rawBytes <= 0x7F]
  return(trimws(rawToChar(cleanedBytes)))
}

Now we perform sapply the those mentioned columns.

combined_item_sales_df <- combined_item_sales_df |>
  mutate(
    Name = sapply(Name, remove_non_ascii),
    Category = sapply(Category, remove_non_ascii)
  )
head(combined_item_sales_df[, c("Date", "Name", "Category", "Total")])

## # A tibble: 6 x 4
##   Date             Name                         Category    Total
##   <chr>            <chr>                        <chr>       <dbl>
## 1 31/12/2023 20:17 R08  Curry Mixed Vege Rice   Rice Set      7.9
## 2 31/12/2023 20:17 Take Away / Container Charge Add-On        1  
## 3 31/12/2023 20:17 D01C  Ice Hainanese Coffee   Cold Drinks  11.8
## 4 31/12/2023 20:17 D08C  Ice Milo               Cold Drinks   5.9
## 5 31/12/2023 20:17 N05  Claypot Lou Shu Fun     Noodles      12.9
## 6 31/12/2023 20:17 N01  Pangkor Laksa           Noodles      15.9

combined_orders_df <- combined_orders_df %>%
  mutate(
    Items = sapply(Items, remove_non_ascii)
  )
head(combined_orders_df[, c("Date", "Items", "Total", "Order No")])

## # A tibble: 6 x 4
##   Date                Items                                     Total `Order No`
##   <dttm>              <chr>                                     <dbl> <chr>     
## 1 2024-09-30 20:26:00 "BS03  (2) Half-boiled Kampung Egg (2pcs~ 31.9  240930NZV~
## 2 2024-09-30 20:14:00 "N09  White Radish Pepper Soup Kuey Teow~ 17.8  240930NZV~
## 3 2024-09-30 19:59:00 "Kaya () Kaya (1 Bottle)"                  9.43 240930NZV~
## 4 2024-09-30 19:57:00 "N06  Shredded VChicken Hor Fun\nN03  Ca~ 27.8  240930NZV~
## 5 2024-09-30 19:54:00 "R08  Curry Mixed Vege Rice\nR02  () Sig~ 29.9  240930NZV~
## 6 2024-09-30 19:48:00 "FP A07  Fried Egg\nFP D04C  Ice Hainane~ 50.5  240930NZV~

Handle Missing or Null Entries

For instance, some discount fields might be empty if no discount was applied.

combined_item_sales_df <- combined_item_sales_df %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), 0, .))) %>%
  mutate(across(where(is.character), ~if_else(is.na(.), "", .)))

combined_orders_df <- combined_orders_df %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), 0, .))) %>%
  mutate(across(where(is.character), ~if_else(is.na(.), "", .)))

Clean and Validate

Filter Refunds: Depending on the analysis, filter only sales transactions.
Filter out refunds from item sales data and orders data. Then verify the filtered data.

combined_item_sales_df <- combined_item_sales_df %>%
  filter(Total >= 0)

combined_orders_df <- combined_orders_df %>%
  filter(Total >= 0)

head(combined_item_sales_df[, c("Date", "Name", "Category", "Total")])

## # A tibble: 6 x 4
##   Date             Name                         Category    Total
##   <chr>            <chr>                        <chr>       <dbl>
## 1 31/12/2023 20:17 R08  Curry Mixed Vege Rice   Rice Set      7.9
## 2 31/12/2023 20:17 Take Away / Container Charge Add-On        1  
## 3 31/12/2023 20:17 D01C  Ice Hainanese Coffee   Cold Drinks  11.8
## 4 31/12/2023 20:17 D08C  Ice Milo               Cold Drinks   5.9
## 5 31/12/2023 20:17 N05  Claypot Lou Shu Fun     Noodles      12.9
## 6 31/12/2023 20:17 N01  Pangkor Laksa           Noodles      15.9

head(combined_orders_df[, c("Date", "Items", "Total", "Order No")])

## # A tibble: 6 x 4
##   Date                Items                                     Total `Order No`
##   <dttm>              <chr>                                     <dbl> <chr>     
## 1 2024-09-30 20:26:00 "BS03  (2) Half-boiled Kampung Egg (2pcs~ 31.9  240930NZV~
## 2 2024-09-30 20:14:00 "N09  White Radish Pepper Soup Kuey Teow~ 17.8  240930NZV~
## 3 2024-09-30 19:59:00 "Kaya () Kaya (1 Bottle)"                  9.43 240930NZV~
## 4 2024-09-30 19:57:00 "N06  Shredded VChicken Hor Fun\nN03  Ca~ 27.8  240930NZV~
## 5 2024-09-30 19:54:00 "R08  Curry Mixed Vege Rice\nR02  () Sig~ 29.9  240930NZV~
## 6 2024-09-30 19:48:00 "FP A07  Fried Egg\nFP D04C  Ice Hainane~ 50.5  240930NZV~

Save the cleaned data into respective files

write.xlsx(combined_item_sales_df, "combined_item_sales.xlsx")
write.xlsx(combined_orders_df, "combined_orders.xlsx")

Sales Forecasting

Read & inspect transactions data

## Warning: Expecting logical in H3783 / R3783C8: got 'Pay by TnG'

## Warning: Expecting logical in H3787 / R3787C8: got 'Pay by card'

Consolidate the data into daily with sum/min/max/mean/median/number of daily transaction

Create a new column called After_1_day_sum_sales, in that column, move the target variable to a day forward, to simulate that the metadata of today pair with target variable of tomorrow

Train with 4 models: linear regression, svm, random forest and decision tree

Findings 1

SVM has the lowest value in MAE, whereas LR has the lowest value in RMSE.
This means LR may handle extreme deviations better than SVM but overall SVM performs better in predicting typical sales.

Findings 2

None of the algorithms able to handle outliers (6 data points that are around 10,000 to 20,000).
Random Forest algorithm predicts some data (supposedly 5,000 actual value) unexpectedly high, until 12,000.
Visually, Decision Tree algorithm has a rigid predicted values, due to the nature of the decision rules it has.

Next Step: Remove potential outliers and train again

The table below shows the summary of the After_1_day_sum_sales column

Findings 3

SVM is still the lowest MAE.
Random Forest algorithm now becomes better in predicting in lowest RMSE score.
Overall RMSE dropped more than half of the original score, from 4,000 to lesser than 2,000.
MAE has reduced around 1/3 of original score too, from 2,000 to 1,500.
The standard deviation of the target column is 1.7k, so the MAE is still in acceptable range.

Findings 4

The outliers significantly reduced the performance of models, by making them predict values with lesser errors.
SVM outperforms other algorithms in predicting the next day’s sales.

Customer Segmentation Analysis

This section analyzes customer segmentation using K-Means, Hierarchical Clustering, and Gaussian Mixture Models (GMM). The dataset is cleaned, prepared, and subjected to clustering analysis, with various visualizations and metrics provided to evaluate the clustering performance.

Load Necessary Libraries

library(readxl)
library(dplyr)
library(ggplot2)
library(cluster)
library(factoextra)
library(NbClust)
library(mclust)
library(clValid)
library(reshape2)
library(readxl)

Data Cleaning and Preparation

# Load dataset (assumed to be preloaded as `combined_transactions`)
df <- read_excel("combined_transactions-reformatdate.xlsx")

# Clean column names and parse date column
colnames(df) <- make.names(colnames(df), unique = TRUE)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y %H:%M")

# Remove refund transactions
cleaned_df <- df %>% filter(Type != "Refund")

# Select relevant columns for segmentation
segmentation_data <- cleaned_df %>%
  select(Amount, Order.Total, Time.to.Pay) %>%
  mutate(Time.to.Pay = as.numeric(gsub("m", "", Time.to.Pay))) %>%
  na.omit()

# Sample data for analysis
set.seed(123)
segmentation_data <- segmentation_data[sample(1:nrow(segmentation_data), 5000), ]

# Normalize data
segmentation_data_scaled <- scale(segmentation_data)

# Cleanup unused variables
rm(cleaned_df)
rm(combined_transactions)
rm(df)
gc()

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  2962987 158.3    4799895 256.4  4799895 256.4
## Vcells 17517116 133.7   54677384 417.2 52101503 397.6

Visualize Feature Distributions

# Distribution of features
ggplot(segmentation_data, aes(x = Amount)) + 
  geom_histogram(binwidth = 10, fill = "blue", alpha = 0.7) + 
  labs(title = "Distribution of Amount")

ggplot(segmentation_data, aes(x = Order.Total)) + 
  geom_histogram(binwidth = 10, fill = "green", alpha = 0.7) + 
  labs(title = "Distribution of Order Total")

ggplot(segmentation_data, aes(x = Time.to.Pay)) + 
  geom_histogram(binwidth = 50, fill = "red", alpha = 0.7) + 
  labs(title = "Distribution of Time to Pay")

Interpretation 1

The Distribution of Amount, Order Total and Time To Pay are all concentrated near the lower range, resulting in a positive skewness. This could imply that customers frequently make small purchases and many payments were processed instantly or within a short period. The graphs also indicates potential outliers with very high values but occur less frequently which could be due to infrequent bulk purchases, delayed payments or errors in data entry.

Determine Optimal Number of Clusters

# Elbow Method
fviz_nbclust(segmentation_data_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2, color = "red") + 
  labs(subtitle = "Elbow Method for Optimal Clusters")

Interpretation 2

The graph plots the Total Within Sum of Squares (WSS) (y-axis) against the number of clusters k (x-axis). The WSS measures the compactness of clusters - lower values indicate that points are closer to their cluster centers. The red dashed line at k =3 indicates the elbow point, where the rate of decrease in WSS slows significantly. The “elbow” at k = 3 suggests that 3 clusters is the optimal choice for segmenting this data, , as it provides well-defined groups without overcomplicating the segmentation. Beyond k = 3, adding more clusters results in diminishing returns, as the reduction in WSS becomes less significant.

# Silhouette Method
fviz_nbclust(segmentation_data_scaled, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette Method for Optimal Clusters")

Interpretation 3

The silhouette width measures how similar an object is to its own cluster compared to other clusters. The average silhouette width (y-axis) indicates the quality of clustering for a given number of clusters (k) on the x-axis. Higher values mean better-defined clusters. The peak value of the silhouette width occurs at k=2 as marked by the vertical dashed line which suggests that 2 clusters is the optimal choice, as it achieves the highest silhouette width and closest to 1, indicating well-separated and compact clusters. As the number of clusters increases beyond k=2, the silhouette width decreases, indicating diminishing cluster quality.

Conclusion: We have chosen k=3 as suggested by the Elbow Method, and the silhouette score at k=3 is still relatively high. Splitting the data into 3 clusters provides more nuanced segmentation compared to 2 clusters, such as segmenting them into low, medium and highsegments.

K-Means Clustering

# Perform K-Means
set.seed(123)
kmeans_model <- kmeans(segmentation_data_scaled, centers = 3, nstart = 25)
segmentation_data$Cluster <- as.factor(kmeans_model$cluster)

# Visualize Clusters
fviz_cluster(kmeans_model, data = segmentation_data_scaled, geom = "point") +
  labs(title = "K-Means Clustering with 3 Clusters")

# Analyze Cluster Characteristics
cluster_summary <- segmentation_data %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Amount = mean(Amount),
    Avg_Order_Total = mean(Order.Total),
    Avg_Time_to_Pay = mean(Time.to.Pay)
  )
cluster_summary

## # A tibble: 3 x 4
##   Cluster Avg_Amount Avg_Order_Total Avg_Time_to_Pay
##   <fct>        <dbl>           <dbl>           <dbl>
## 1 1            110.            198.            207. 
## 2 2           2591.           2591.             98.5
## 3 3             45.2            45.5            33.5

Interpretation 4

The data is well-segmented into 3 clusters with clear differences in patterns.The x-axis (Dim1) explains 64.5% of the variance, while the y-axis (Dim2) explains 31%. Together, these two dimensions capture 95.5% of the total variance, making the projection a good representation of the original dataset.

Interpreting along with the Tibble Summary, we can know that Cluster 1 represents moderate spenders with average transaction values and longest payment behaviour, Cluster 3 represents low spenders with quick payment behaviour compared to Cluster 1, and Cluster 2 represents high spenders.

Hierarchical Clustering

# Perform Hierarchical Clustering
dist_matrix <- dist(segmentation_data_scaled, method = "euclidean")
hclust_model <- hclust(dist_matrix, method = "ward.D2")
plot(hclust_model, labels = FALSE, main = "Hierarchical Clustering Dendrogram")

# Assign clusters
segmentation_data$HCluster <- cutree(hclust_model, k = 3)

# Visualize Dendrogram
fviz_dend(hclust_model, k = 3, rect = TRUE, show_labels = FALSE) +
  labs(title = "Dendrogram with 3 Clusters (Sampled Data)")

# Analyze Cluster Characteristics
hcluster_summary <- segmentation_data %>%
  group_by(HCluster) %>%
  summarise(
    Avg_Amount = mean(Amount),
    Avg_Order_Total = mean(Order.Total),
    Avg_Time_to_Pay = mean(Time.to.Pay)
  )
hcluster_summary

## # A tibble: 3 x 4
##   HCluster Avg_Amount Avg_Order_Total Avg_Time_to_Pay
##      <int>      <dbl>           <dbl>           <dbl>
## 1        1       45.8            45.9            35.3
## 2        2      130.            305.            254. 
## 3        3     2591.           2591.             98.5

Interpretation 5

The dendrogram shows the data cut into 3 clusters indicated by red, green, and blue colors. The clusters are chosen by cutting the tree at a height where the distance between clusters is large, resulting in well-separated groups. Interpreting with the Tibble Summary below, we can find that cluster 1 in red represents low and quick spenders, cluster 2 in green represents moderate spenders with delayed payments, and cluster 3 in blue represents high spenders with moderate payment delay.

Gaussian Mixture Model (GMM)

# Perform GMM
gmm_model <- Mclust(segmentation_data_scaled)
summary(gmm_model)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9
## components: 
## 
##  log-likelihood    n df      BIC      ICL
##        32405.08 5000 89 64052.14 61425.92
## 
## Clustering table:
##    1    2    3    4    5    6    7    8    9 
##   31 1835 1109  366  248  788   39  450  134

segmentation_data$GMMCluster <- as.factor(gmm_model$classification)

# Visualize Clusters
fviz_mclust(gmm_model, what = "classification") +
  labs(title = "GMM Clustering Classification")

# Analyze Cluster Characteristics
gmm_summary <- segmentation_data %>%
  group_by(GMMCluster) %>%
  summarise(
    Avg_Amount = mean(Amount),
    Avg_Order_Total = mean(Order.Total),
    Avg_Time_to_Pay = mean(Time.to.Pay),
    Cluster_Size = n()
  )
gmm_summary

## # A tibble: 9 x 5
##   GMMCluster Avg_Amount Avg_Order_Total Avg_Time_to_Pay Cluster_Size
##   <fct>           <dbl>           <dbl>           <dbl>        <int>
## 1 1               553.           1027.          136.              31
## 2 2                46.2            46.2          30.4           1835
## 3 3                15.9            15.9          29.9           1109
## 4 4                11.2            11.2           0.320          366
## 5 5                38.1            38.1           0.560          248
## 6 6               100.            100.           44.3            788
## 7 7                26.2            36.9          24.8             39
## 8 8                34.0            34.0          89.6            450
## 9 9               136.            136.          190.             134

Interpretation 6

GMM identified 9 clusters as optimal based on the Bayesian Information Criterion (BIC). Cluster 2 is the largest followed by Cluster 3, whereas Cluster 1 is the smallest, likely representing high-value outliers or niche segments. Dim1 (62.9%) and Dim2 (32.1%) together explain 95% of the variance, making this projection a good representation of the data. Most data points are concentrated in overlapping regions, suggesting similar characteristics among these groups.

Combining with the tibble summary below, Cluster 2 to 5 represent a significant portion of customers making relatively smaller and quicker transactions, Cluster 1 and 9 represents customers with highest transaction amount and longest payment time.

Evaluation of Clustering Models

# Silhouette Scores
fviz_silhouette(silhouette(kmeans_model$cluster, dist_matrix)) +
  labs(title = "Silhouette Plot for K-Means")

##   cluster size ave.sil.width
## 1       1  158          0.04
## 2       2    4          0.59
## 3       3 4838          0.83

fviz_silhouette(silhouette(segmentation_data$HCluster, dist_matrix)) +
  labs(title = "Silhouette Plot for Hierarchical Clustering")

##   cluster size ave.sil.width
## 1       1 4912          0.87
## 2       2   84          0.06
## 3       3    4          0.58

fviz_silhouette(silhouette(as.integer(segmentation_data$GMMCluster), dist_matrix)) +
  labs(title = "Silhouette Plot for GMM")

##   cluster size ave.sil.width
## 1       1   31         -0.23
## 2       2 1835          0.30
## 3       3 1109          0.16
## 4       4  366          0.76
## 5       5  248          0.28
## 6       6  788          0.13
## 7       7   39         -0.48
## 8       8  450          0.31
## 9       9  134         -0.19

Interpretation 7

Silhouette Plot: The silhouette plots visualize the quality of clustering for K-Means, Hierarchical Clustering and GMM. K means and Hierarchical Clustering have similar results with one cluster with highest silhouette width, and the other two lower width and with negative values, suggesting there might be misclassification. As for GMM, although it is separated into 9 clusters, only cluster 4 does not have negative values, whereas all others have negative values suggesting poor clustering quality and potential misclassifications.

# Dunn Index for K-Means
dunn_kmeans <- dunn(clusters = kmeans_model$cluster, Data = segmentation_data_scaled)

# Dunn Index for Hierarchical Clustering
dunn_hclust <- dunn(clusters = cutree(hclust_model, k = 3), Data = segmentation_data_scaled)

# Dunn Index for GMM
dunn_gmm <- dunn(clusters = gmm_model$classification, Data = segmentation_data_scaled)

print(dunn_kmeans)

## [1] 0.002099785

print(dunn_hclust)

## [1] 0.006098638

print(dunn_gmm)

## [1] 3.727276e-06

calculate_ch_index <- function(data, clusters) {
  n <- nrow(data) # Number of observations
  k <- length(unique(clusters)) # Number of clusters
  
  # Overall mean
  overall_mean <- colMeans(data)
  
  # Within-cluster and between-cluster sum of squares
  SSW <- 0
  SSB <- 0
  
  for (cluster in unique(clusters)) {
    cluster_points <- data[clusters == cluster, , drop = FALSE]
    cluster_mean <- colMeans(cluster_points)
    cluster_size <- nrow(cluster_points)
    
    # Within-cluster sum of squares
    SSW <- SSW + sum(rowSums((cluster_points - cluster_mean)^2))
    
    # Between-cluster sum of squares
    SSB <- SSB + cluster_size * sum((cluster_mean - overall_mean)^2)
  }
  
  # Calinski-Harabasz Index
  CH <- (SSB / (k - 1)) / (SSW / (n - k))
  return(CH)
}

# Calculate CH Index for GMM
ch_gmm <- calculate_ch_index(segmentation_data_scaled, gmm_model$classification)
print(ch_gmm)

## [1] 400.7199

# Calculate CH Index for K-Means
ch_kmeans <- calculate_ch_index(segmentation_data_scaled, kmeans_model$cluster)
print(ch_kmeans)

## [1] 2060.636

# Calculate CH Index for Hierarchical Clustering
ch_hclust <- calculate_ch_index(segmentation_data_scaled, cutree(hclust_model, k = 3))
print(ch_hclust)

## [1] 2094.058

# BIC
calculate_bic_kmeans <- function(kmeans_model, data) {
  n <- nrow(data)
  m <- ncol(data)
  k <- kmeans_model$centers
  wcss <- kmeans_model$tot.withinss
  bic <- n * log(wcss / n) + log(n) * m * length(k)
  return(bic)
}

# Compute BIC for K-Means
bic_kmeans <- calculate_bic_kmeans(kmeans_model, segmentation_data_scaled)
print(bic_kmeans)

## [1] 1651.063

calculate_bic_hclust <- function(hclust_model, data, k) {
  clusters <- cutree(hclust_model, k = k)
  wcss <- sum((data - rowMeans(data[clusters == as.numeric(clusters), ]))^2)
  n <- nrow(data)
  m <- ncol(data)
  bic <- n * log(wcss / n) + log(n) * m * k
  return(bic)
}

# Compute BIC for Hierarchical Clustering
bic_hclust <- calculate_bic_hclust(hclust_model, segmentation_data_scaled, k = 3)
print(bic_hclust)

## [1] 1245.954

# BIC for GMM
bic_gmm <- gmm_model$BIC
print(bic_gmm)

## Bayesian Information Criterion (BIC): 
##         EII        VII       EEI        VEI       EVI         VVI       EEE
## 1 -42599.22 -42599.224 -42616.26 -42616.259 -42616.26 -42616.2589 -36105.15
## 2 -36912.92 -19792.154 -35751.85 -18006.530 -30732.38 -17587.6715 -27960.43
## 3 -33477.19 -15952.305 -33032.40 -12298.441 -26676.02 -10970.3527 -27995.33
## 4 -33511.22 -12199.057 -33066.44  -8094.862 -23196.83  -8705.7127 -28029.42
## 5 -33545.33 -10523.139 -33100.53  -5383.518 -21745.05  -4858.7858 -25465.43
## 6 -33579.42  -9138.932 -33134.66  -3971.977 -20003.11  -2432.1653 -25499.50
## 7 -27557.67  -7928.140 -30540.31  -1887.690 -17433.24  -1395.9117 -25533.58
## 8 -24042.84  -6497.730 -30574.36  -4522.574 -17316.89    925.3369 -25567.69
## 9 -23267.93  -5860.156 -29392.48  -4154.020 -16780.33   2603.3037 -24332.40
##         VEE        EVE       VVE        EEV       VEV        EVV       VVV
## 1 -36105.15 -36105.151 -36105.15 -36105.151 -36105.15 -36105.151 -36105.15
## 2  51721.86 -10668.209  54964.17 -23320.498  51712.58  -5591.000  55095.31
## 3  55842.09   6833.361  58711.31 -19132.084  56229.01   7513.579  59162.73
## 4  56516.82   7074.710  61261.10 -11668.155  57340.26   7855.120  61690.69
## 5  57222.09   7355.317  62410.25  -8238.531  57623.56   8138.137  62757.48
## 6  58013.91   7304.221  62996.00  -4930.436  58440.33         NA  63211.86
## 7  58748.43   7253.803  63170.16  -2349.463  59127.18   7984.888  63611.28
## 8  58735.18   7553.007  63596.18  -4719.667  59767.71   8268.358  63751.76
## 9  59065.98   7398.147  63994.97  -4779.314  59482.86         NA  64052.14
## 
## Top 3 models based on the BIC criterion: 
##    VVV,9    VVE,9    VVV,8 
## 64052.14 63994.97 63751.76

Interpretation 8

Dunn Index: The Dunn Index measures the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. Higher values indicate better separation and compactness. However, none of the models achieve a high Dunn Index, but Hierarchical Clustering slightly outperforms the others.

Interpretation 9

CH Index: The CH Index measures the ratio of the between-cluster variance to the within-cluster variance. Higher values indicate better-defined clusters. K-Means achieves the highest CH Index, making it the best performer for well-separated and compact clusters.

Interpretation 10

Bayesian Information Criterion (BIC): The BIC measures the model fit while penalizing complexity. Lower values indicate better model performance. Hierarchical Clustering has the lowest BIC value, indicating the best performance for balancing fit and complexity. The GMM model with the VVV covariance structure has the highest BIC, suggesting it is overfitting due to complexity.

Market Basket Analysis

The association analysis offers great insights into customer purchases and helps track those natural bundles of items purchased together.
The association rules it has generated clearly identify strong relationships between items in a menu item based on metrics such as support, confidence and lift.

library(arules)
library(arulesViz)
df <- read.csv("D:/Download D/combined_item_sales-reformatdate.xlsx - Sheet1.csv")
head(df)

##             Date                                    Name          Category
## 1 1/11/2023 7:54 Curry Chicken Potato with Turmeric Rice          Rice Set
## 2 1/11/2023 7:54                Shredded Chicken Hor Fun           Noodles
## 3 1/11/2023 8:02                      Nyonya Kuih (3pcs)  Local Delicacies
## 4 1/11/2023 8:06                    Hot Hainanese Coffee        Hot Drinks
## 5 1/11/2023 8:06           Hainanese Butter & Kaya Toast  Breakfast Series
## 6 1/11/2023 8:06           Hainanese Butter & Kaya Toast  Breakfast Series
##   Price Qty Total  Net                           Device Location   Order.No
## 1  13.9   1  13.9 13.9        H Restaurant Puchong iPad  Puchong 231101X98H
## 2  12.9   1  12.9 12.9        H Restaurant Puchong iPad  Puchong 231101X98H
## 3   6.9   3  20.7 20.7 H Restaurant Puchong Pastry iPad  Puchong 231101XSGR
## 4   4.5   1   4.5  4.5 H Restaurant Puchong Pastry iPad  Puchong 231101XSGR
## 5   3.5   2   7.0  7.0 H Restaurant Puchong Pastry iPad  Puchong 231101XSGR
## 6   3.5   1   3.5  3.5 H Restaurant Puchong Pastry iPad  Puchong 231101XSGR

Change the dataset to transactional dataset format

transactions <- split(df$Name, df$Order.No)
transactions <- lapply(transactions, function(x) unique(x))
transactions_with_int_ids <- setNames(transactions, as.character(1:length(transactions)))
trans <- as(transactions_with_int_ids, "transactions")
inspect(trans[1:5])

##     items                                       transactionID
## [1] {Ice Hainanese Coffee,                                   
##      Kampung Style Fried Meehoon}                           1
## [2] {Brown Rice,                                             
##      Claypot Lou Shu Fun,                                    
##      Fried Egg,                                              
##      GonLou Dry Noodles with BBQ Chicken,                    
##      Hainan Roasted & White-Cut Chicken Rice,                
##      Hainanese Butter & Kaya Toast,                          
##      Luncheon VMeat,                                         
##      Mango Feast,                                            
##      O Hot Hainanese Coffee O,                               
##      Signature Nasi Lemak with Rendang Chicken}             2
## [3] {Cold Soymilk,                                           
##      Loh Mee,                                                
##      Passion Fruit Juice,                                    
##      Signature Hainan Roasted Chicken Rice}                 3
## [4] {Black sesame Bao,                                       
##      Ginger Mushroom Bao}                                   4
## [5] {Hot Hainanese Coffee,                                   
##      Shredded Chicken Hor Fun,                               
##      Signature VChar Siew Rice}                             5

frequentItems <- eclat (trans, parameter = list(supp = 0.07, maxlen = 15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 2108 
## 
## create itemset ... 
## set transactions ...[302 item(s), 30116 transaction(s)] done [0.02s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating sparse bit matrix ... [10 row(s), 30116 column(s)] done [0.02s].
## writing  ... [10 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

Item Frequency Plot Interpretation

itemFrequencyPlot(trans, topN=10, type="absolute", main="Item Frequency")

1. Hainanese Butter & Kaya Toast, which is a Malaysian authentic favourite breakfast shows the top 1 sale in the restaurant, this may be caused by the affordable price and light meal to be breakfast for all customers and a side for lunch. 2. The sales followed by lunch main dish such as nasi lemak with rendang chicken and curry mee hoon & noodles. This indicates that these two foods are the favourite lunch option in the restaurant. 3. Beverage such as hot Hainanese coffee and iced Hainanese coffee are the top beverages in the restaurant as it is also an authentic beverage that is suitable for breakfast and lunch.

Association rules with minimum support 0.01 and minimum lift 0.5

frequent_itemsets <- apriori(trans, parameter = list(support = 0.01, confidence = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 301 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[302 item(s), 30116 transaction(s)] done [0.03s].
## sorting and recoding items ... [87 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##     lhs                                            rhs                                support confidence   coverage     lift count
## [1] {Hot Milo}                                  => {Hainanese Butter & Kaya Toast} 0.01215301  0.5069252 0.02397397 2.211904   366
## [2] {Half-boiled Kampung Egg (2pcs)}            => {Hainanese Butter & Kaya Toast} 0.06999602  0.6374357 0.10980874 2.781370  2108
## [3] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Pangkor Shell Curry Puff}                  => {Hainanese Butter & Kaya Toast} 0.01006110  0.6433121 0.01563953 2.807011   303
## [4] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Ice Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.01646965  0.6657718 0.02473768 2.905011   496
## [5] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Hot Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.02128437  0.7226607 0.02945278 3.153238   641
## [6] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Special Curry Mee Hoon & Noodles}          => {Hainanese Butter & Kaya Toast} 0.01361403  0.6602254 0.02062027 2.880810   410
## [7] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413  0.6364764 0.02676318 2.777184   513

View the association rules sorting by confidence, lift and support

arules::inspect(frequent_itemsets)

##     lhs                                            rhs                                support confidence   coverage     lift count
## [1] {Hot Milo}                                  => {Hainanese Butter & Kaya Toast} 0.01215301  0.5069252 0.02397397 2.211904   366
## [2] {Half-boiled Kampung Egg (2pcs)}            => {Hainanese Butter & Kaya Toast} 0.06999602  0.6374357 0.10980874 2.781370  2108
## [3] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Pangkor Shell Curry Puff}                  => {Hainanese Butter & Kaya Toast} 0.01006110  0.6433121 0.01563953 2.807011   303
## [4] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Ice Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.01646965  0.6657718 0.02473768 2.905011   496
## [5] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Hot Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.02128437  0.7226607 0.02945278 3.153238   641
## [6] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Special Curry Mee Hoon & Noodles}          => {Hainanese Butter & Kaya Toast} 0.01361403  0.6602254 0.02062027 2.880810   410
## [7] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413  0.6364764 0.02676318 2.777184   513

rules_conf <- sort(frequent_itemsets, by="confidence", decreasing=TRUE)
inspect(head(rules_conf))

##     lhs                                            rhs                                support confidence   coverage     lift count
## [1] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Hot Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.02128437  0.7226607 0.02945278 3.153238   641
## [2] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Ice Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.01646965  0.6657718 0.02473768 2.905011   496
## [3] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Special Curry Mee Hoon & Noodles}          => {Hainanese Butter & Kaya Toast} 0.01361403  0.6602254 0.02062027 2.880810   410
## [4] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Pangkor Shell Curry Puff}                  => {Hainanese Butter & Kaya Toast} 0.01006110  0.6433121 0.01563953 2.807011   303
## [5] {Half-boiled Kampung Egg (2pcs)}            => {Hainanese Butter & Kaya Toast} 0.06999602  0.6374357 0.10980874 2.781370  2108
## [6] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413  0.6364764 0.02676318 2.777184   513

rules_lift <- sort(frequent_itemsets, by="lift", decreasing=TRUE)
inspect(head(rules_lift))

##     lhs                                            rhs                                support confidence   coverage     lift count
## [1] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Hot Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.02128437  0.7226607 0.02945278 3.153238   641
## [2] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Ice Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.01646965  0.6657718 0.02473768 2.905011   496
## [3] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Special Curry Mee Hoon & Noodles}          => {Hainanese Butter & Kaya Toast} 0.01361403  0.6602254 0.02062027 2.880810   410
## [4] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Pangkor Shell Curry Puff}                  => {Hainanese Butter & Kaya Toast} 0.01006110  0.6433121 0.01563953 2.807011   303
## [5] {Half-boiled Kampung Egg (2pcs)}            => {Hainanese Butter & Kaya Toast} 0.06999602  0.6374357 0.10980874 2.781370  2108
## [6] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413  0.6364764 0.02676318 2.777184   513

rules_support <- sort(frequent_itemsets, by="support", decreasing=TRUE)
inspect(head(rules_support))

##     lhs                                            rhs                                support confidence   coverage     lift count
## [1] {Half-boiled Kampung Egg (2pcs)}            => {Hainanese Butter & Kaya Toast} 0.06999602  0.6374357 0.10980874 2.781370  2108
## [2] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Hot Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.02128437  0.7226607 0.02945278 3.153238   641
## [3] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Signature Nasi Lemak with Rendang Chicken} => {Hainanese Butter & Kaya Toast} 0.01703413  0.6364764 0.02676318 2.777184   513
## [4] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Ice Hainanese Coffee}                      => {Hainanese Butter & Kaya Toast} 0.01646965  0.6657718 0.02473768 2.905011   496
## [5] {Half-boiled Kampung Egg (2pcs),                                                                                              
##      Special Curry Mee Hoon & Noodles}          => {Hainanese Butter & Kaya Toast} 0.01361403  0.6602254 0.02062027 2.880810   410
## [6] {Hot Milo}                                  => {Hainanese Butter & Kaya Toast} 0.01215301  0.5069252 0.02397397 2.211904   366

Summary of Important Rules

Summary Interpretation

# View the data frame
print(rules_table)

##   Rules                                                Antecedent
## 1     1                                   Half-boiled Kampung Egg
## 2     2                                      Hot Hainanese Coffee
## 3     3             Half-boiled Kampung Egg, Hot Hainanese Coffee
## 4     4 Half-boiled Kampung Egg, Special Curry Mee Hoon & Noodles
## 5     5         Half-boiled Kampung Egg, Pangkor Shell Curry Puff
## 6     6  Half-boiled Kampung Egg, Nasi Lemak with Rendang Chicken
## 7     7                                                Brown Rice
## 8     8                                     Homemade Pumpkin Tofu
##                      Consequent Support Confidence   Lift
## 1 Hainanese Butter & Kaya Toast   0.070      0.637  2.781
## 2 Hainanese Butter & Kaya Toast   0.062      0.415  1.809
## 3 Hainanese Butter & Kaya Toast   0.021      0.723  3.153
## 4 Hainanese Butter & Kaya Toast   0.014      0.660  2.881
## 5 Hainanese Butter & Kaya Toast   0.010      0.644  2.810
## 6 Hainanese Butter & Kaya Toast   0.017      0.636  2.777
## 7         Homemade Pumpkin Tofu   0.018      0.463 12.371
## 8                    Brown Rice   0.018      0.476 12.371

Conclusion

In Sales Forecasting, SVM outperforms other algorithms such as decision tree, random forest and linear regression in predicting the next day’s sales.
Customer Segmentation Analysis in overall shows Hierarchical Clustering has the lowest BIC and slightly better Dunn Index than the other methods. This suggests it produces relatively better-separated clusters with balanced complexity. K-Means has the highest CH Index, indicating well-defined and compact clusters. However, its Dunn Index and BIC are less favorable compared to Hierarchical Clustering. GMM performs poorly on all metrics, with very low Dunn and CH Index values. The high BIC suggests that it overfits due to the complexity of the model.
Market Basket Analysis shows huge potential for the restaurant to optimize bundle promotions. Breakfast bundle and lunch set could be considered based on the rules by association analysis as bundling higher-value items can significantly boost the sales.

WQD7004 Group Assignment