4.1 Data Import & Setup

library(tidyverse)
library(lubridate)
library(skimr)
library(dlookr)
library(scales)

## Load Data
retail_raw <- read_csv("Online Retail.csv")

cat("Dataset dimensions:", nrow(retail_raw), "rows ×", ncol(retail_raw), "variables\n\n")

## Dataset dimensions: 541909 rows × 8 variables

##Confirming correct import
head(retail_raw, 10)

## # A tibble: 10 × 8
##    InvoiceNo StockCode Description     Quantity InvoiceDate UnitPrice CustomerID
##    <chr>     <chr>     <chr>              <dbl> <chr>           <dbl>      <dbl>
##  1 536365    85123A    WHITE HANGING …        6 01/12/2010…      2.55      17850
##  2 536365    71053     WHITE METAL LA…        6 01/12/2010…      3.39      17850
##  3 536365    84406B    CREAM CUPID HE…        8 01/12/2010…      2.75      17850
##  4 536365    84029G    KNITTED UNION …        6 01/12/2010…      3.39      17850
##  5 536365    84029E    RED WOOLLY HOT…        6 01/12/2010…      3.39      17850
##  6 536365    22752     SET 7 BABUSHKA…        2 01/12/2010…      7.65      17850
##  7 536365    21730     GLASS STAR FRO…        6 01/12/2010…      4.25      17850
##  8 536366    22633     HAND WARMER UN…        6 01/12/2010…      1.85      17850
##  9 536366    22632     HAND WARMER RE…        6 01/12/2010…      1.85      17850
## 10 536367    84879     ASSORTED COLOU…       32 01/12/2010…      1.69      13047
## # ℹ 1 more variable: Country <chr>

glimpse(retail_raw)

## Rows: 541,909
## Columns: 8
## $ InvoiceNo   <chr> "536365", "536365", "536365", "536365", "536365", "536365"…
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752", …
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN…
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, 3, …
## $ InvoiceDate <chr> "01/12/2010 08:26", "01/12/2010 08:26", "01/12/2010 08:26"…
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1.69…
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17…
## $ Country     <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…

4.2 Data Diagnostics

Overall Diagnosis

# Overall diagnosis
diagnose(retail_raw) %>%
  arrange(desc(missing_count)) %>%
  select(variables, types, missing_count, missing_percent, unique_count)

## # A tibble: 8 × 5
##   variables   types     missing_count missing_percent unique_count
##   <chr>       <chr>             <int>           <dbl>        <int>
## 1 CustomerID  numeric          135080          24.9           4373
## 2 Description character          1454           0.268         4212
## 3 InvoiceNo   character             0           0            25900
## 4 StockCode   character             0           0             4070
## 5 Quantity    numeric               0           0              722
## 6 InvoiceDate character             0           0            23260
## 7 UnitPrice   numeric               0           0             1630
## 8 Country     character             0           0               38

Key Data Quality Observations:

The dataset contains 541,909 transactions with 8 variables. The most notable data quality issue is 135,080 missing CustomerID values (24.93% of records), which limits customer-level analysis. These missing identifiers likely represent guest checkouts.

Additionally, 1,454 records (0.27%) have missing product descriptions, though this is relatively minor. All other core fields are complete with no missing values.

The dataset has 25,900 unique invoice numbers, 4,070 distinct stock codes, 4,212 different product descriptions, and 4,373 unique customers. Geographic diversity is limited to 38 distinct countries.

Numeric Diagnostics

diagnose_numeric(retail_raw) %>%
  select(variables, min, mean, median, max, zero, minus)

## # A tibble: 3 × 7
##   variables      min     mean   median   max  zero minus
##   <chr>        <dbl>    <dbl>    <dbl> <dbl> <int> <int>
## 1 Quantity   -80995      9.55     3    80995     0 10624
## 2 UnitPrice  -11062.     4.61     2.08 38970  2515     2
## 3 CustomerID  12346  15288.   15152    18287     0     0

Outlier Detection

diagnose_outlier(retail_raw) %>%
  filter(outliers_cnt > 0) %>%
  arrange(desc(outliers_ratio))

## # A tibble: 2 × 6
##   variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>            <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 Quantity         58619          10.8           50.6      9.55         4.57
## 2 UnitPrice        39627           7.31          31.1      4.61         2.52

Missing Values

# Missing values
colSums(is.na(retail_raw))

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0        1454           0           0           0 
##  CustomerID     Country 
##      135080           0

round(colMeans(is.na(retail_raw)) * 100, 2)

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##        0.00        0.00        0.27        0.00        0.00        0.00 
##  CustomerID     Country 
##       24.93        0.00

Missing Value Impact:

CustomerID is missing for nearly one-quarter of all transactions (24.93%). Customer segmentation, lifetime value analysis, and repeat purchase behavior can only be assessed for the 75% of transactions with valid customer identifiers. The minimal missing descriptions (0.27%) have negligible impact on product-level analysis.

Returns and Cancellations

# Negative quantities (returns or cancellations)
neg_qty_summary <- retail_raw %>%
  summarise(
    negative_rows = sum(Quantity < 0, na.rm = TRUE),
    pct_negative  = round(mean(Quantity < 0, na.rm = TRUE) * 100, 3)
  )
neg_qty_summary

## # A tibble: 1 × 2
##   negative_rows pct_negative
##           <int>        <dbl>
## 1         10624         1.96

Returns and Cancellations:

The dataset contains 10,624 transactions with negative quantities (1.96% of all records). These represent product returns or order cancellations and are identified by invoice numbers beginning with “C”. While relatively small in proportion, these cancellations must be handled separately in revenue calculations to avoid misstatement.

Price Issues

# Zero or negative Unit Price
price_issue_summary <- retail_raw %>%
  summarise(
    zero_or_negative   = sum(UnitPrice <= 0, na.rm = TRUE),
    pct_zero_negative  = round(mean(UnitPrice <= 0, na.rm = TRUE) * 100, 3)
  )
price_issue_summary

## # A tibble: 1 × 2
##   zero_or_negative pct_zero_negative
##              <int>             <dbl>
## 1             2517             0.464

Pricing Value Issues:

There are 2,517 records (0.464%) with zero or negative unit prices. These records will be filtered out during data preparation to ensure revenue calculations are based only on valid, positive prices.

# Duplicate rows
total_rows   <- nrow(retail_raw)
unique_rows  <- nrow(distinct(retail_raw))
duplicate_rows <- total_rows - unique_rows

cat("Total rows:", total_rows, "\n")

## Total rows: 541909

cat("Unique rows:", unique_rows, "\n")

## Unique rows: 536641

cat("Duplicate rows:", duplicate_rows, "\n")

## Duplicate rows: 5268

cat("Duplicate percentage:", round((duplicate_rows / total_rows) * 100, 2), "%\n")

## Duplicate percentage: 0.97 %

Duplicate Records:

The dataset contains 5,268 exact duplicate rows (0.97% of total records). These could be from system errors, double-entry, or data export issues rather than legitimate repeat transactions (which would have different invoice numbers or timestamps). These duplicates will be removed during data cleaning.

Date Range

# Check Date range
date_range <- retail_raw %>%
  summarise(
    min_date = min(dmy_hm(InvoiceDate), na.rm = TRUE),
    max_date = max(dmy_hm(InvoiceDate), na.rm = TRUE)
  )
date_range

## # A tibble: 1 × 2
##   min_date            max_date           
##   <dttm>              <dttm>             
## 1 2010-12-01 08:26:00 2011-12-09 12:50:00

Date Check:

The dataset spans over one year, from December 1, 2010 to December 9, 2011 in line with expectations.

4.3 Data Preparation

retail_clean <- retail_raw %>%
  distinct() %>%                          #removes exact duplicates
  mutate(
    InvoiceDate = dmy_hm(InvoiceDate),
    Date        = as.Date(InvoiceDate),
    Year        = year(InvoiceDate),
    Month       = month(InvoiceDate, label = TRUE),
    YearMonth   = floor_date(InvoiceDate, "month"),
    DayOfWeek   = wday(InvoiceDate, label = TRUE, week_start = 1),
    Hour        = hour(InvoiceDate),
    Country     = str_trim(Country),
    IsCancelled = str_starts(InvoiceNo, "C"),
    LineValue   = Quantity * UnitPrice     # can be negative
  ) %>%
  filter(UnitPrice > 0) %>%               # drop price <= 0
  mutate(IsUK = Country == "United Kingdom") ## true/false flag if needed

# Transactions for main sales analysis (complete, positive quantity)
retail_actuals <- retail_clean %>%
  filter(!IsCancelled, Quantity > 0) %>%
  mutate(Revenue = Quantity * UnitPrice)

# Transactions for customer analysis
retail_customers <- retail_actuals %>%
  filter(!is.na(CustomerID))

cat("Rows after cleaning:", nrow(retail_clean), "\n")

## Rows after cleaning: 534129

cat("Rows in sales dataset (no cancellations, positive qty):", nrow(retail_actuals), "\n")

## Rows in sales dataset (no cancellations, positive qty): 524878

cat("Rows with valid CustomerID:", nrow(retail_customers), "\n")

## Rows with valid CustomerID: 392692

Data Cleaning Steps:

The cleaning process creates three datasets to help with analysis:

retail_clean (534,129 rows): Removes 5,268 duplicate records and 2,517 transactions with invalid prices (≤0). Parses invoice dates into proper datetime format and extracts components. Identifies invoices starting with “C” for cancellations, calculates line amounts and adds a UK/non-UK indicator.
retail_actuals (524,878 rows): Filters transactions by excluding cancellations and negative quantities.
retail_customers (392,692 rows): Filters retail_actuals to include only transactions with valid CustomerID values. This 75% subset allows customer segmentation.

4.4 Question 1: Most popular products and Sales over time

Top 10 Products by Quantity Sold

main_col <- "#1b9e77"
top_qty <- retail_actuals %>%
  group_by(StockCode, Description) %>%
  summarise(
    TotalQty = sum(Quantity),
    TotalRevenue = sum(Revenue),
    .groups = "drop"
  ) %>%
  slice_max(TotalQty, n = 10) %>%
  arrange(TotalQty)

##Plot
ggplot(top_qty, aes(
  x = TotalQty,
  y = fct_reorder(Description, TotalQty)
)) +
  geom_col(fill = main_col) +
  geom_text(
    aes(label = scales::comma(TotalQty, accuracy = 1)),
    hjust = 1.1, ##for label inside bar
    color = "white",
    size = 3.5
  ) +
  labs(
    title = "Top 10 Products by Quantity Sold",
    x = "Units Sold",
    y = NULL
  ) +
  theme_minimal()

Top 10 products by Quantity Sold

Observations: Home decor and craft items appear to dominate sales by volume with PAPER CRAFT, LITTLE BIRDIE leading at 80,995 units.

Top 10 Products by Revenue

top_rev <- retail_actuals %>%
  group_by(StockCode, Description) %>%
  summarise(
    TotalQty = sum(Quantity),
    TotalRevenue = sum(Revenue),
    .groups = "drop"
  ) %>%
  slice_max(TotalRevenue, n = 10) %>%
  arrange(TotalRevenue)

##Plot
ggplot(top_rev, aes(
  x = TotalRevenue,
  y = fct_reorder(Description, TotalRevenue)
)) +
  geom_col(fill = main_col) +
  geom_text(
    aes(label = scales::dollar_format(prefix = "£", accuracy = 1)(TotalRevenue)),
    hjust = 1.1,       
    color = "white",
    size = 3.5
  ) +
  scale_x_continuous(labels = scales::dollar_format(prefix = "£")) +
  labs(
    title = "Top 10 Products by Revenue",
    x = "Revenue (£)",
    y = NULL
  ) +
  theme_minimal()

$Top 10 products by Revenue Generated$

Top 10 products by Revenue Generated

Observations: Revenue leaders differ from volume leaders—DOTCOM POSTAGE (£206k) and POSTAGE (£78k) could possibly be shipping charges rather than product sales, inflating reported revenue.

Excluding shipping, the REGENCY CAKESTAND 3 TIER (£174k) leads in product revenue despite lower volume, indicating higher-priced items drive different revenue dynamics than the high-volume, lower-priced craft items dominating quantity sales.

Monthly Revenue Trend

monthly_sales <- retail_actuals %>%
  group_by(YearMonth) %>%
  summarise(MonthlyRevenue = sum(Revenue), .groups = "drop")

##Plot
ggplot(monthly_sales, aes(x = YearMonth, y = MonthlyRevenue)) +
  geom_line(color = main_col, linewidth = 1) +
  geom_point(color = main_col, size = 2) +
  scale_y_continuous(labels = scales::dollar_format(prefix = "£")) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") + ##utf8 error fix
  labs(
    title = "Monthly Revenue Trend",
    x = NULL,
    y = "Revenue (£)"
  ) +
  theme_minimal()

Monthly revenue showing seasonal patterns and growth trends

Observations: Clear seasonality with strong Q4 growth peaking at £1.5M in November 2011, driven by holiday shopping demand. The sharp December 2011 drop reflects incomplete data as dataset ends Dec 9. Revenue shows overall upward trajectory through 2011, with a summer plateau (May-July around £750k) before accelerating into peak season—typical retail seasonality.

Revenue by Hour of Day

hour_rev <- retail_actuals %>%
  group_by(Hour) %>%
  summarise(Revenue = sum(Revenue), .groups = "drop")

##Plot
ggplot(hour_rev, aes(x = Hour, y = Revenue)) +
  geom_line(color = main_col, linewidth = 1.1) +
  geom_point(color = main_col, size = 2) +
  scale_x_continuous(breaks = 0:23) +
  scale_y_continuous(
    labels = scales::dollar_format(prefix = "£"),
    breaks = scales::pretty_breaks(n = 6),
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "Revenue Pattern by Hour of Day",
    x = "Hour of Day",
    y = "Revenue (£)"
  ) +
  theme_minimal(base_size = 13)

Revenue Distribution by Hour of Day

Observations:

Revenue concentrates during business hours (10am-3pm), with minimal activity after 6pm and virtually none overnight. This pattern strongly indicates B2B operations—likely wholesale customers ordering during their own business hours—rather than B2C consumer shopping which would show evening and weekend activity.

4.4.1 AI-Assisted

I used ChatGPT to improve the visual appearance of the hourly revenue line chart. It suggested showing all 24 hours explicitly using scale_x_continuous(breaks = 0:23) rather than letting R choose arbitrary intervals, applying expansion(mult = c(0, 0.05)) to remove unnecessary bottom padding while maintaining top margin, and using scales::pretty_breaks(n = 6) for cleaner y-axis intervals,along with consistent currency formatting using scales::dollar_format(prefix = “£”) and fixing a UTF-8 error for the GBP symbol

4.5 Question 2: Customer Segmentation

customer_rev <- retail_customers %>%
  group_by(CustomerID) %>%
  summarise(
    TotalRevenue = sum(Revenue),
    Orders       = n_distinct(InvoiceNo),
    TotalQty     = sum(Quantity),
    .groups = "drop"
  ) %>%
  arrange(desc(TotalRevenue))

customer_rev <- customer_rev %>%
  mutate(
    Rank       = row_number(),
    CumRevenue = cumsum(TotalRevenue),
    CumShare   = CumRevenue / sum(TotalRevenue),
    Percentile = Rank / n()
  )

## Quick Check
head(customer_rev, 10)

## # A tibble: 10 × 8
##    CustomerID TotalRevenue Orders TotalQty  Rank CumRevenue CumShare Percentile
##         <dbl>        <dbl>  <int>    <dbl> <int>      <dbl>    <dbl>      <dbl>
##  1      14646      280206.     73   196915     1    280206.   0.0315   0.000231
##  2      18102      259657.     60    64124     2    539863.   0.0607   0.000461
##  3      17450      194391.     46    69973     3    734254.   0.0826   0.000692
##  4      16446      168472.      2    80997     4    902727.   0.102    0.000922
##  5      14911      143711.    201    80240     5   1046438.   0.118    0.00115 
##  6      12415      124915.     21    77374     6   1171352.   0.132    0.00138 
##  7      14156      117210.     55    57768     7   1288562.   0.145    0.00161 
##  8      17511       91062.     31    64549     8   1379625.   0.155    0.00184 
##  9      16029       80851.     63    40108     9   1460476.   0.164    0.00207 
## 10      12346       77184.      1    74215    10   1537659.   0.173    0.00231

Customer Revenue Distribution (Linear Scale)

ggplot(customer_rev, aes(x = TotalRevenue)) +
  geom_histogram(fill = main_col, bins = 50) +
  scale_x_continuous(labels = scales::dollar_format(prefix = "£")) +
  labs(
    title = "Customer Revenue Distribution (Linear Scale)",
    x = "Total Revenue per Customer",
    y = "Number of Customers"
  ) +
  theme_minimal()

Customer revenue distribution on linear scale showing concentration

Observations: Extreme right-skewed distribution with the vast majority of customers clustered at low revenue levels and a minority contributing disproportionately

Customer Revenue Distribution (Log Scale)

ggplot(customer_rev, aes(x = TotalRevenue)) +
  geom_histogram(fill = main_col, bins = 50) +
  scale_x_log10(labels = scales::dollar_format(prefix = "£")) +
  labs(
    title = "Customer Revenue Distribution (Log Scale)",
    x = "Total Revenue (log10)",
    y = "Number of Customers"
  ) +
  theme_minimal()

Customer revenue distribution on log scale revealing full range

Observations: Applying a log scale reveals the underlying structure of customer revenue more clearly. We can see revenue spans several orders of magnitude, confirming that customers differ widely in purchasing intensity. This long-tail pattern is typical of wholesale/online retail settings and suggests that only a minority of customers generate most of the value.

The transformation reveals a roughly normal distribution.

Pareto Analysis

customer_rev <- customer_rev %>%
  mutate(
    Segment = case_when(
      Percentile <= 0.10 ~ "Top 10% (Key Accounts)",
      Percentile <= 0.30 ~ "Next 20% (High Value)",
      Percentile <= 0.60 ~ "Next 30% (Mid Value)",
      TRUE               ~ "Bottom 40% (Low Value)"
    )
  )

segment_summary <- customer_rev %>%
  group_by(Segment) %>%
  summarise(
    Customers = n(),
    SegmentRevenue = sum(TotalRevenue),
    .groups = "drop"
  ) %>%
  mutate(
    RevenueShare = SegmentRevenue / sum(SegmentRevenue)
  ) %>%
  arrange(desc(RevenueShare)) %>%
  mutate(
    Segment = factor(Segment, levels = Segment),
    CumShare = cumsum(RevenueShare)
  )
segment_summary

## # A tibble: 4 × 5
##   Segment                Customers SegmentRevenue RevenueShare CumShare
##   <fct>                      <int>          <dbl>        <dbl>    <dbl>
## 1 Top 10% (Key Accounts)       433       5457734.       0.614     0.614
## 2 Next 20% (High Value)        868       1903348.       0.214     0.828
## 3 Next 30% (Mid Value)        1301       1083199.       0.122     0.950
## 4 Bottom 40% (Low Value)      1736        442929.       0.0498    1

# Compute sample size + total revenue for caption
n_customers <- nrow(customer_rev)
total_rev <- sum(customer_rev$TotalRevenue)

## plot
ggplot(segment_summary, aes(x = Segment)) +
  
  # bars showing share of total revenue
  geom_col(aes(y = RevenueShare), fill = main_col) +
  
  # bar labels inside
  geom_text(
    aes(y = RevenueShare - 0.02,
        label = scales::percent(RevenueShare, accuracy = 1)),
    colour = "white",
    fontface = "bold",
    size = 4
  ) +
  
  # cumulative line
  geom_line(aes(y = CumShare, group = 1),
            colour = "black", linewidth = 1) +
  geom_point(aes(y = CumShare),
             colour = "black", size = 2) +
  
  # cumulative labels top of line
  geom_text(
    aes(y = CumShare + 0.03,
        label = scales::percent(CumShare, accuracy = 1)),
    size = 4
  ) +
  
  scale_y_continuous(
    breaks = c(0, 0.25, 0.50, 0.75, 1.00),
    labels = scales::percent_format(accuracy = 1),
    limits = c(0, 1.15)
  )+
  labs(
    title = "Pareto Chart: Customer Segments and Revenue Contribution",
    x = "Customer Segment",
    y = "Share of Total Revenue (%)",
    ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 11)
  )

Pareto Chart showing Revenue Concentration across Customer Segments

Observations: Decided to segment customers based on their share of revenue contribution.

The Pareto chart shows clear value differentiation:

The Top 10% (Key Accounts) segment contributes a disproportionately large share of revenue.

The Next 20% still delivers substantial value but far less than the top tier.

The Next 30% are moderate buyers with potential for growth.

The Bottom 40% contribute minimal revenue despite forming the largest group.

Customer purchasing behaviour varies dramatically, with a small number of customers generating most of the revenue.

A Pareto-based segmentation captures these differences clearly and offers practical groups for targeted commercial efforts.

4.5.1 AI-Assisted

I asked ChatGPT to guide me through the structural approach of overlaying a cumulative line on top of bar charts position labels strategically placing percentage labels inside bars, while positioning cumulative percentages above the line to avoid overlaps. ChatGPT also suggested setting explicit limits to accommodate the upper labels, using breaks = c(0, 0.25, 0.50, 0.75, 1.00) for clear percentage intervals, and adding a dynamic caption with paste0() to display total customers and revenue.

4.6 Question 3: Country-Specific Patterns

UK vs International Markets

uk_split <- retail_actuals %>%
  group_by(IsUK) %>%
  summarise(Revenue = sum(Revenue), .groups = "drop") %>%
  mutate(
    Region = if_else(IsUK == TRUE, "United Kingdom", "Other Countries"),
    Share = Revenue / sum(Revenue)
  )
uk_split

## # A tibble: 2 × 4
##   IsUK   Revenue Region          Share
##   <lgl>    <dbl> <chr>           <dbl>
## 1 FALSE 1640367. Other Countries 0.154
## 2 TRUE  9001744. United Kingdom  0.846

ggplot(uk_split, aes(x = Region, y = Revenue)) +
  geom_col(fill = main_col) +
  
  # % in bar
  geom_text(
    aes(label = scales::percent(Share, accuracy = 0.1)),
    vjust = 1.4,
    colour = "white",
    fontface = "bold",
    size = 4
  ) +
  
  # £ top of bar (UTF-8 safe)
  geom_text(
    aes(label = scales::dollar_format(prefix = "\u00A3")(Revenue)),
    vjust = -0.4,
    size = 4
  ) +
  
  scale_y_continuous(
    labels = scales::label_dollar(prefix = "\u00A3"),
    expand = expansion(mult = c(0, 0.15))
  ) +
  
  labs(
    title = "Revenue Contribution: UK vs Non-UK",
    x = "",
    y = "Revenue (\u00A3)"
  ) +
  
  theme_minimal()

Revenue split between UK and international markets

Observations:

The UK generates 84.6% of revenue versus just 15.4% internationally.

This is expected as the retailer is UK-based, UK orders will likely have lower shipping frictions and a larger customer base.

However, over dependency on one market could pose business risks as well.

Top 10 Non-UK Countries by Revenue

top_non_uk <- retail_actuals %>%
  filter(Country != "United Kingdom") %>%
  group_by(Country) %>%
  summarise(Revenue = sum(Revenue)) %>%
  mutate(Share = Revenue / sum(Revenue)) %>%
  slice_max(Revenue, n = 10) %>%
  arrange(Revenue)

ggplot(top_non_uk, aes(x = reorder(Country, Revenue), y = Revenue)) +
  geom_col(fill = main_col) +
  
  # % label inside bar
  geom_text(
    aes(label = scales::percent(Share, accuracy = 0.1)),
    hjust = 1.15,
    colour = "white",
    fontface = "bold",
    size = 3.6
  ) +
  
  # £ top bar
  geom_text(
    aes(label = scales::dollar_format(prefix = "\u00A3")(Revenue)),
    hjust = -0.15,
    size = 3.6
  ) +
  
  coord_flip() +
  
  scale_y_continuous(
    labels = scales::label_dollar(prefix = "\u00A3"),
    expand = expansion(mult = c(0, 0.20))
  ) +
  
  labs(
    title = "Top 10 Non-UK Countries by Revenue",
    subtitle = "Analysis period: Dec 2010 – Dec 2011",
    x = "Country",
    y = "Revenue (\u00A3)" ##for UTF8 error
  ) +
  
  theme_minimal() +
  theme(panel.grid = element_blank())

Leading non-UK markets by revenue contribution

Observations:

Among non-UK markets, neighboring European countries dominate: Netherlands (17.4%) and Ireland (17.3%) lead, followed by Germany (13.9%, £229k) and France (12.8%, £210k). These four countries capture 61% of all international sales, suggesting geographic distance facilitates business. Notably, Australia ranks 5th despite geographic distance (8.4%, £138k).

Average Quantity per Order by Country

non_uk_patterns <- retail_actuals %>%
  filter(Country != "United Kingdom") %>%
  group_by(Country) %>%
  summarise(
    Orders = n_distinct(InvoiceNo),
    TotalQty = sum(Quantity),
    AvgQtyPerOrder = TotalQty / Orders
  ) %>%
  slice_max(AvgQtyPerOrder, n = 10) %>%
  arrange(AvgQtyPerOrder)

ggplot(non_uk_patterns, 
       aes(x = reorder(Country, AvgQtyPerOrder), y = AvgQtyPerOrder)) +
  geom_col(fill = main_col) +
  
  geom_text(
    aes(label = round(AvgQtyPerOrder, 1)),
    hjust = -0.1,
    size = 3.8
  ) +
  
  coord_flip() +
  
  labs(
    title = "Top 10 Non-UK Countries by Average Quantity per Order",
    subtitle = "Analysis period: Dec 2010 – Dec 2011",
    x = "Country",
    y = "Avg Quantity per Order"
  ) +
  
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.ticks = element_blank()
  )

Average order size varies significantly by market

Observations: Certain countries buy larger quantities per order. This reinforces a bulk-purchasing pattern where there are fewer orders but much larger baskets. These markets behave differently from the UK, which has many orders smaller quantities more retail-like behaviour.

Non-UK Market Summary

non_uk_summary <- retail_actuals %>%
  filter(Country != "United Kingdom") %>%
  group_by(Country) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(InvoiceNo),
    TotalQty = sum(Quantity),
    AOV = Revenue / Orders,
    AvgQtyPerOrder = TotalQty / Orders
  ) %>%
  arrange(desc(Revenue))

head(non_uk_summary, 10)

## # A tibble: 10 × 6
##    Country     Revenue Orders TotalQty   AOV AvgQtyPerOrder
##    <chr>         <dbl>  <int>    <dbl> <dbl>          <dbl>
##  1 Netherlands 285446.     94   200361 3037.          2132.
##  2 EIRE        283141.    288   147007  983.           510.
##  3 Germany     228678.    457   119154  500.           261.
##  4 France      209625.    392   112060  535.           286.
##  5 Australia   138454.     57    83891 2429.          1472.
##  6 Spain        61559.     90    27933  684.           310.
##  7 Switzerland  57068.     54    30617 1057.           567.
##  8 Belgium      41196.     98    23237  420.           237.
##  9 Sweden       38368.     36    36078 1066.          1002.
## 10 Japan        37416.     19    26016 1969.          1369.

Observations: Order size patterns reveal distinct customer profiles by market. Customers outside Europe consolidate orders into larger baskets.

Some smaller countries place fewer but significantly higher-value orders. Countries like Switzerland, Australia, and Japan typically have higher AOV, higher shipping costs, bulk purchases rather than frequent small orders. This suggests unique purchasing patterns driven by logistics.

Netherlands leads dramatically with 2,131 units per order (just 94 orders generating £285k)—clearly wholesale/bulk purchasing.

Similarly, Australia (1,472 units/order) and Japan (1,369 units/order) exhibit bulk-buying behavior.

In contrast, revenue leaders like Germany and France show retail-oriented patterns with lower quantities per order (261 and 286 respectively) but higher order frequency (457 and 392 orders).

4.7 Question 4 (Bonus): Cancellations impact

Quarterly Waterfall Analysis

quarterly_cancel <- retail_clean %>%
  mutate(
    QuarterLabel = paste0("Q", quarter(Date), " ", year(Date)),
    QuarterDate  = floor_date(Date, "quarter"),
    
    # correctsigns:
    SalesAmount      = if_else(!IsCancelled & Quantity > 0, LineValue, 0),
    CancelAmount     = if_else(IsCancelled, LineValue, 0)   
  ) %>%
  group_by(QuarterLabel, QuarterDate) %>%
  summarise(
    SalesRevenue     = sum(SalesAmount, na.rm = TRUE),
    CancelledRevenue = sum(CancelAmount, na.rm = TRUE),  # negative
    .groups = "drop"
  ) %>%
  arrange(QuarterDate)

# make a waterfall vector first
values <- c(quarterly_cancel$SalesRevenue,
            quarterly_cancel$CancelledRevenue)

labels <- c(
  paste0(quarterly_cancel$QuarterLabel, " Revenue"),
  paste0(quarterly_cancel$QuarterLabel, " Cancelled")
)

colors <- c(
  rep("forestgreen", nrow(quarterly_cancel)),
  rep("firebrick", nrow(quarterly_cancel))
)

# then cumulative start/end
ends   <- cumsum(values)
starts <- c(0, head(ends, -1))

wf <- data.frame(
  id = seq_along(values),
  label = labels,
  start = starts,
  end   = ends,
  value = values,
  fill  = colors
)

final_net <- sum(values)
ggplot(wf) +
  geom_rect(aes(
    xmin = id - 0.4,
    xmax = id + 0.4,
    ymin = start,
    ymax = end,
    fill = fill
  )) +
  
  # labels for bar
  geom_text(
    aes(x = id, y = end,
        label = scales::comma(value)),
    vjust = ifelse(values > 0, -0.3, 1.3),
    size = 3.5
  ) +
  
  # final net bar
  geom_col(
    data = data.frame(id = max(wf$id) + 1, value = final_net),
    aes(x = id, y = value),
    fill = "steelblue",
    width = 0.6
  ) +
  geom_text(
    data = data.frame(id = max(wf$id) + 1, value = final_net),
    aes(x = id, y = value,
        label = scales::comma(final_net)),
    vjust = -0.3,
    size = 3.5
  ) +
  
  scale_x_continuous(
    breaks = c(wf$id, max(wf$id) + 1),
    labels = c(wf$label, "Final Net Revenue")
  ) +
  
  # axis 
  scale_y_continuous(
    labels = scales::comma,
    expand = expansion(mult = c(0, 0.15))
  ) +
  
  scale_fill_identity() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 40, hjust = 1)
  ) +
  
  labs(
    title = "Quarterly Waterfall: Revenue -> Cancellations -> Net Revenue (GBP)",
    subtitle = "Impact of order cancellations on quarterly realised revenue",
    x = "",
    y = "Revenue (GBP)"
  )

Quarterly waterfall showing impact of cancellations on net revenue

Observations: Waterfall chart shows Q4 2011’s peak month generates £3.3M revenue but suffers the largest cancellations at £335K—creating double risk when the business is most dependent on performance. Cancellations scale proportionally with revenue growth (£75K in Q4 2010 → £335K in Q4 2011)

4.6.1 AI-Assisted

I asked ChatGPT what would be the best way to show impact of cancellations on overall revenue,suggestions included: Stacked Bar Chart, Before/After comparison bars and a Waterfall Chart. Settled on using Waterfall chart because it makes the deductions easily visible and shows the bridge to revenue. I then used ChatGPT to help create the visual by breaking into smaller chunks and customising the aesthetic aspects.

4.8 Summary & Key Takeaways

Through this R analysis of the Online Retail dataset I have looked at answering key questions regarding what the top products for the business are both by volume and by sales generated. I have examined how customers segmented on the basis of their revenue contribution using Pareto segmentation. I have examined the split of revenue between local (UK) and International regions. Finally, I have looked at the impact of cancellations on the total revenue by quarters using a waterfall analysis.

4.9 Tableau Dashboard Description

The dashboard provides the CEO context of what top-selling products are, how much they contribute to revenue. What region is most important to business and the monthly revenue trends allowing them to visualise growth trends and times of the year that are crucial for the business.

MN50749 - Databases and BI - Individual Assignment

Online Retail Data Analysis

Olivia D Souza

2025-11-19

4.1 Data Import & Setup

4.2 Data Diagnostics

Overall Diagnosis

Numeric Diagnostics

Outlier Detection

Missing Values

Returns and Cancellations

Price Issues

Date Range

4.3 Data Preparation

4.4 Question 1: Most popular products and Sales over time

Top 10 Products by Quantity Sold

Top 10 Products by Revenue

Monthly Revenue Trend

Revenue by Hour of Day

4.4.1 AI-Assisted

4.5 Question 2: Customer Segmentation

Customer Revenue Distribution (Linear Scale)

Customer Revenue Distribution (Log Scale)

Pareto Analysis

4.5.1 AI-Assisted

4.6 Question 3: Country-Specific Patterns

UK vs International Markets

Top 10 Non-UK Countries by Revenue

Average Quantity per Order by Country

Non-UK Market Summary

4.7 Question 4 (Bonus): Cancellations impact

Quarterly Waterfall Analysis

4.6.1 AI-Assisted

4.8 Summary & Key Takeaways

4.9 Tableau Dashboard Description