library(tidyverse)
library(lubridate)
library(skimr)
library(dlookr)
library(scales)
## Load Data
retail_raw <- read_csv("Online Retail.csv")
cat("Dataset dimensions:", nrow(retail_raw), "rows ×", ncol(retail_raw), "variables\n\n")
## Dataset dimensions: 541909 rows × 8 variables
##Confirming correct import
head(retail_raw, 10)
## # A tibble: 10 × 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 536365 85123A WHITE HANGING … 6 01/12/2010… 2.55 17850
## 2 536365 71053 WHITE METAL LA… 6 01/12/2010… 3.39 17850
## 3 536365 84406B CREAM CUPID HE… 8 01/12/2010… 2.75 17850
## 4 536365 84029G KNITTED UNION … 6 01/12/2010… 3.39 17850
## 5 536365 84029E RED WOOLLY HOT… 6 01/12/2010… 3.39 17850
## 6 536365 22752 SET 7 BABUSHKA… 2 01/12/2010… 7.65 17850
## 7 536365 21730 GLASS STAR FRO… 6 01/12/2010… 4.25 17850
## 8 536366 22633 HAND WARMER UN… 6 01/12/2010… 1.85 17850
## 9 536366 22632 HAND WARMER RE… 6 01/12/2010… 1.85 17850
## 10 536367 84879 ASSORTED COLOU… 32 01/12/2010… 1.69 13047
## # ℹ 1 more variable: Country <chr>
glimpse(retail_raw)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo <chr> "536365", "536365", "536365", "536365", "536365", "536365"…
## $ StockCode <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752", …
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN…
## $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, 3, …
## $ InvoiceDate <chr> "01/12/2010 08:26", "01/12/2010 08:26", "01/12/2010 08:26"…
## $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1.69…
## $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17…
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
# Overall diagnosis
diagnose(retail_raw) %>%
arrange(desc(missing_count)) %>%
select(variables, types, missing_count, missing_percent, unique_count)
## # A tibble: 8 × 5
## variables types missing_count missing_percent unique_count
## <chr> <chr> <int> <dbl> <int>
## 1 CustomerID numeric 135080 24.9 4373
## 2 Description character 1454 0.268 4212
## 3 InvoiceNo character 0 0 25900
## 4 StockCode character 0 0 4070
## 5 Quantity numeric 0 0 722
## 6 InvoiceDate character 0 0 23260
## 7 UnitPrice numeric 0 0 1630
## 8 Country character 0 0 38
Key Data Quality Observations:
The dataset contains 541,909 transactions with 8 variables. The most notable data quality issue is 135,080 missing CustomerID values (24.93% of records), which limits customer-level analysis. These missing identifiers likely represent guest checkouts.
Additionally, 1,454 records (0.27%) have missing product descriptions, though this is relatively minor. All other core fields are complete with no missing values.
The dataset has 25,900 unique invoice numbers, 4,070 distinct stock codes, 4,212 different product descriptions, and 4,373 unique customers. Geographic diversity is limited to 38 distinct countries.
diagnose_numeric(retail_raw) %>%
select(variables, min, mean, median, max, zero, minus)
## # A tibble: 3 × 7
## variables min mean median max zero minus
## <chr> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 Quantity -80995 9.55 3 80995 0 10624
## 2 UnitPrice -11062. 4.61 2.08 38970 2515 2
## 3 CustomerID 12346 15288. 15152 18287 0 0
diagnose_outlier(retail_raw) %>%
filter(outliers_cnt > 0) %>%
arrange(desc(outliers_ratio))
## # A tibble: 2 × 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Quantity 58619 10.8 50.6 9.55 4.57
## 2 UnitPrice 39627 7.31 31.1 4.61 2.52
# Missing values
colSums(is.na(retail_raw))
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 1454 0 0 0
## CustomerID Country
## 135080 0
round(colMeans(is.na(retail_raw)) * 100, 2)
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0.00 0.00 0.27 0.00 0.00 0.00
## CustomerID Country
## 24.93 0.00
Missing Value Impact:
CustomerID is missing for nearly one-quarter of all transactions (24.93%). Customer segmentation, lifetime value analysis, and repeat purchase behavior can only be assessed for the 75% of transactions with valid customer identifiers. The minimal missing descriptions (0.27%) have negligible impact on product-level analysis.
# Negative quantities (returns or cancellations)
neg_qty_summary <- retail_raw %>%
summarise(
negative_rows = sum(Quantity < 0, na.rm = TRUE),
pct_negative = round(mean(Quantity < 0, na.rm = TRUE) * 100, 3)
)
neg_qty_summary
## # A tibble: 1 × 2
## negative_rows pct_negative
## <int> <dbl>
## 1 10624 1.96
Returns and Cancellations:
The dataset contains 10,624 transactions with negative quantities (1.96% of all records). These represent product returns or order cancellations and are identified by invoice numbers beginning with “C”. While relatively small in proportion, these cancellations must be handled separately in revenue calculations to avoid misstatement.
# Zero or negative Unit Price
price_issue_summary <- retail_raw %>%
summarise(
zero_or_negative = sum(UnitPrice <= 0, na.rm = TRUE),
pct_zero_negative = round(mean(UnitPrice <= 0, na.rm = TRUE) * 100, 3)
)
price_issue_summary
## # A tibble: 1 × 2
## zero_or_negative pct_zero_negative
## <int> <dbl>
## 1 2517 0.464
Pricing Value Issues:
There are 2,517 records (0.464%) with zero or negative unit prices. These records will be filtered out during data preparation to ensure revenue calculations are based only on valid, positive prices.
# Duplicate rows
total_rows <- nrow(retail_raw)
unique_rows <- nrow(distinct(retail_raw))
duplicate_rows <- total_rows - unique_rows
cat("Total rows:", total_rows, "\n")
## Total rows: 541909
cat("Unique rows:", unique_rows, "\n")
## Unique rows: 536641
cat("Duplicate rows:", duplicate_rows, "\n")
## Duplicate rows: 5268
cat("Duplicate percentage:", round((duplicate_rows / total_rows) * 100, 2), "%\n")
## Duplicate percentage: 0.97 %
Duplicate Records:
The dataset contains 5,268 exact duplicate rows (0.97% of total records). These could be from system errors, double-entry, or data export issues rather than legitimate repeat transactions (which would have different invoice numbers or timestamps). These duplicates will be removed during data cleaning.
# Check Date range
date_range <- retail_raw %>%
summarise(
min_date = min(dmy_hm(InvoiceDate), na.rm = TRUE),
max_date = max(dmy_hm(InvoiceDate), na.rm = TRUE)
)
date_range
## # A tibble: 1 × 2
## min_date max_date
## <dttm> <dttm>
## 1 2010-12-01 08:26:00 2011-12-09 12:50:00
Date Check:
The dataset spans over one year, from December 1, 2010 to December 9, 2011 in line with expectations.
retail_clean <- retail_raw %>%
distinct() %>% #removes exact duplicates
mutate(
InvoiceDate = dmy_hm(InvoiceDate),
Date = as.Date(InvoiceDate),
Year = year(InvoiceDate),
Month = month(InvoiceDate, label = TRUE),
YearMonth = floor_date(InvoiceDate, "month"),
DayOfWeek = wday(InvoiceDate, label = TRUE, week_start = 1),
Hour = hour(InvoiceDate),
Country = str_trim(Country),
IsCancelled = str_starts(InvoiceNo, "C"),
LineValue = Quantity * UnitPrice # can be negative
) %>%
filter(UnitPrice > 0) %>% # drop price <= 0
mutate(IsUK = Country == "United Kingdom") ## true/false flag if needed
# Transactions for main sales analysis (complete, positive quantity)
retail_actuals <- retail_clean %>%
filter(!IsCancelled, Quantity > 0) %>%
mutate(Revenue = Quantity * UnitPrice)
# Transactions for customer analysis
retail_customers <- retail_actuals %>%
filter(!is.na(CustomerID))
cat("Rows after cleaning:", nrow(retail_clean), "\n")
## Rows after cleaning: 534129
cat("Rows in sales dataset (no cancellations, positive qty):", nrow(retail_actuals), "\n")
## Rows in sales dataset (no cancellations, positive qty): 524878
cat("Rows with valid CustomerID:", nrow(retail_customers), "\n")
## Rows with valid CustomerID: 392692
Data Cleaning Steps:
The cleaning process creates three datasets to help with analysis:
retail_clean (534,129 rows):
Removes 5,268 duplicate records and 2,517 transactions with invalid
prices (≤0). Parses invoice dates into proper datetime format and
extracts components. Identifies invoices starting with “C” for
cancellations, calculates line amounts and adds a UK/non-UK
indicator.
retail_actuals (524,878 rows):
Filters transactions by excluding cancellations and negative
quantities.
retail_customers (392,692 rows):
Filters retail_actuals to include only transactions with
valid CustomerID values. This 75% subset allows customer
segmentation.
main_col <- "#1b9e77"
top_qty <- retail_actuals %>%
group_by(StockCode, Description) %>%
summarise(
TotalQty = sum(Quantity),
TotalRevenue = sum(Revenue),
.groups = "drop"
) %>%
slice_max(TotalQty, n = 10) %>%
arrange(TotalQty)
##Plot
ggplot(top_qty, aes(
x = TotalQty,
y = fct_reorder(Description, TotalQty)
)) +
geom_col(fill = main_col) +
geom_text(
aes(label = scales::comma(TotalQty, accuracy = 1)),
hjust = 1.1, ##for label inside bar
color = "white",
size = 3.5
) +
labs(
title = "Top 10 Products by Quantity Sold",
x = "Units Sold",
y = NULL
) +
theme_minimal()
Top 10 products by Quantity Sold
Observations: Home decor and craft items appear to dominate sales by volume with PAPER CRAFT, LITTLE BIRDIE leading at 80,995 units.
top_rev <- retail_actuals %>%
group_by(StockCode, Description) %>%
summarise(
TotalQty = sum(Quantity),
TotalRevenue = sum(Revenue),
.groups = "drop"
) %>%
slice_max(TotalRevenue, n = 10) %>%
arrange(TotalRevenue)
##Plot
ggplot(top_rev, aes(
x = TotalRevenue,
y = fct_reorder(Description, TotalRevenue)
)) +
geom_col(fill = main_col) +
geom_text(
aes(label = scales::dollar_format(prefix = "£", accuracy = 1)(TotalRevenue)),
hjust = 1.1,
color = "white",
size = 3.5
) +
scale_x_continuous(labels = scales::dollar_format(prefix = "£")) +
labs(
title = "Top 10 Products by Revenue",
x = "Revenue (£)",
y = NULL
) +
theme_minimal()
Top 10 products by Revenue Generated
Observations: Revenue leaders differ from volume leaders—DOTCOM POSTAGE (£206k) and POSTAGE (£78k) could possibly be shipping charges rather than product sales, inflating reported revenue.
Excluding shipping, the REGENCY CAKESTAND 3 TIER (£174k) leads in product revenue despite lower volume, indicating higher-priced items drive different revenue dynamics than the high-volume, lower-priced craft items dominating quantity sales.
monthly_sales <- retail_actuals %>%
group_by(YearMonth) %>%
summarise(MonthlyRevenue = sum(Revenue), .groups = "drop")
##Plot
ggplot(monthly_sales, aes(x = YearMonth, y = MonthlyRevenue)) +
geom_line(color = main_col, linewidth = 1) +
geom_point(color = main_col, size = 2) +
scale_y_continuous(labels = scales::dollar_format(prefix = "£")) +
scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") + ##utf8 error fix
labs(
title = "Monthly Revenue Trend",
x = NULL,
y = "Revenue (£)"
) +
theme_minimal()
Monthly revenue showing seasonal patterns and growth trends
Observations: Clear seasonality with strong Q4 growth peaking at £1.5M in November 2011, driven by holiday shopping demand. The sharp December 2011 drop reflects incomplete data as dataset ends Dec 9. Revenue shows overall upward trajectory through 2011, with a summer plateau (May-July around £750k) before accelerating into peak season—typical retail seasonality.
hour_rev <- retail_actuals %>%
group_by(Hour) %>%
summarise(Revenue = sum(Revenue), .groups = "drop")
##Plot
ggplot(hour_rev, aes(x = Hour, y = Revenue)) +
geom_line(color = main_col, linewidth = 1.1) +
geom_point(color = main_col, size = 2) +
scale_x_continuous(breaks = 0:23) +
scale_y_continuous(
labels = scales::dollar_format(prefix = "£"),
breaks = scales::pretty_breaks(n = 6),
expand = expansion(mult = c(0, 0.05))
) +
labs(
title = "Revenue Pattern by Hour of Day",
x = "Hour of Day",
y = "Revenue (£)"
) +
theme_minimal(base_size = 13)
Revenue Distribution by Hour of Day
Observations:
Revenue concentrates during business hours (10am-3pm), with minimal activity after 6pm and virtually none overnight. This pattern strongly indicates B2B operations—likely wholesale customers ordering during their own business hours—rather than B2C consumer shopping which would show evening and weekend activity.
I used ChatGPT to improve the visual appearance of the hourly revenue line chart. It suggested showing all 24 hours explicitly using scale_x_continuous(breaks = 0:23) rather than letting R choose arbitrary intervals, applying expansion(mult = c(0, 0.05)) to remove unnecessary bottom padding while maintaining top margin, and using scales::pretty_breaks(n = 6) for cleaner y-axis intervals,along with consistent currency formatting using scales::dollar_format(prefix = “£”) and fixing a UTF-8 error for the GBP symbol
customer_rev <- retail_customers %>%
group_by(CustomerID) %>%
summarise(
TotalRevenue = sum(Revenue),
Orders = n_distinct(InvoiceNo),
TotalQty = sum(Quantity),
.groups = "drop"
) %>%
arrange(desc(TotalRevenue))
customer_rev <- customer_rev %>%
mutate(
Rank = row_number(),
CumRevenue = cumsum(TotalRevenue),
CumShare = CumRevenue / sum(TotalRevenue),
Percentile = Rank / n()
)
## Quick Check
head(customer_rev, 10)
## # A tibble: 10 × 8
## CustomerID TotalRevenue Orders TotalQty Rank CumRevenue CumShare Percentile
## <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 14646 280206. 73 196915 1 280206. 0.0315 0.000231
## 2 18102 259657. 60 64124 2 539863. 0.0607 0.000461
## 3 17450 194391. 46 69973 3 734254. 0.0826 0.000692
## 4 16446 168472. 2 80997 4 902727. 0.102 0.000922
## 5 14911 143711. 201 80240 5 1046438. 0.118 0.00115
## 6 12415 124915. 21 77374 6 1171352. 0.132 0.00138
## 7 14156 117210. 55 57768 7 1288562. 0.145 0.00161
## 8 17511 91062. 31 64549 8 1379625. 0.155 0.00184
## 9 16029 80851. 63 40108 9 1460476. 0.164 0.00207
## 10 12346 77184. 1 74215 10 1537659. 0.173 0.00231
ggplot(customer_rev, aes(x = TotalRevenue)) +
geom_histogram(fill = main_col, bins = 50) +
scale_x_continuous(labels = scales::dollar_format(prefix = "£")) +
labs(
title = "Customer Revenue Distribution (Linear Scale)",
x = "Total Revenue per Customer",
y = "Number of Customers"
) +
theme_minimal()
Customer revenue distribution on linear scale showing concentration
Observations: Extreme right-skewed distribution with the vast majority of customers clustered at low revenue levels and a minority contributing disproportionately
ggplot(customer_rev, aes(x = TotalRevenue)) +
geom_histogram(fill = main_col, bins = 50) +
scale_x_log10(labels = scales::dollar_format(prefix = "£")) +
labs(
title = "Customer Revenue Distribution (Log Scale)",
x = "Total Revenue (log10)",
y = "Number of Customers"
) +
theme_minimal()
Customer revenue distribution on log scale revealing full range
Observations: Applying a log scale reveals the underlying structure of customer revenue more clearly. We can see revenue spans several orders of magnitude, confirming that customers differ widely in purchasing intensity. This long-tail pattern is typical of wholesale/online retail settings and suggests that only a minority of customers generate most of the value.
The transformation reveals a roughly normal distribution.
customer_rev <- customer_rev %>%
mutate(
Segment = case_when(
Percentile <= 0.10 ~ "Top 10% (Key Accounts)",
Percentile <= 0.30 ~ "Next 20% (High Value)",
Percentile <= 0.60 ~ "Next 30% (Mid Value)",
TRUE ~ "Bottom 40% (Low Value)"
)
)
segment_summary <- customer_rev %>%
group_by(Segment) %>%
summarise(
Customers = n(),
SegmentRevenue = sum(TotalRevenue),
.groups = "drop"
) %>%
mutate(
RevenueShare = SegmentRevenue / sum(SegmentRevenue)
) %>%
arrange(desc(RevenueShare)) %>%
mutate(
Segment = factor(Segment, levels = Segment),
CumShare = cumsum(RevenueShare)
)
segment_summary
## # A tibble: 4 × 5
## Segment Customers SegmentRevenue RevenueShare CumShare
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Top 10% (Key Accounts) 433 5457734. 0.614 0.614
## 2 Next 20% (High Value) 868 1903348. 0.214 0.828
## 3 Next 30% (Mid Value) 1301 1083199. 0.122 0.950
## 4 Bottom 40% (Low Value) 1736 442929. 0.0498 1
# Compute sample size + total revenue for caption
n_customers <- nrow(customer_rev)
total_rev <- sum(customer_rev$TotalRevenue)
## plot
ggplot(segment_summary, aes(x = Segment)) +
# bars showing share of total revenue
geom_col(aes(y = RevenueShare), fill = main_col) +
# bar labels inside
geom_text(
aes(y = RevenueShare - 0.02,
label = scales::percent(RevenueShare, accuracy = 1)),
colour = "white",
fontface = "bold",
size = 4
) +
# cumulative line
geom_line(aes(y = CumShare, group = 1),
colour = "black", linewidth = 1) +
geom_point(aes(y = CumShare),
colour = "black", size = 2) +
# cumulative labels top of line
geom_text(
aes(y = CumShare + 0.03,
label = scales::percent(CumShare, accuracy = 1)),
size = 4
) +
scale_y_continuous(
breaks = c(0, 0.25, 0.50, 0.75, 1.00),
labels = scales::percent_format(accuracy = 1),
limits = c(0, 1.15)
)+
labs(
title = "Pareto Chart: Customer Segments and Revenue Contribution",
x = "Customer Segment",
y = "Share of Total Revenue (%)",
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(size = 11)
)
Pareto Chart showing Revenue Concentration across Customer Segments
Observations: Decided to segment customers based on their share of revenue contribution.
The Pareto chart shows clear value differentiation:
The Top 10% (Key Accounts) segment contributes a disproportionately large share of revenue.
The Next 20% still delivers substantial value but far less than the top tier.
The Next 30% are moderate buyers with potential for growth.
The Bottom 40% contribute minimal revenue despite forming the largest group.
Customer purchasing behaviour varies dramatically, with a small number of customers generating most of the revenue.
A Pareto-based segmentation captures these differences clearly and offers practical groups for targeted commercial efforts.
I asked ChatGPT to guide me through the structural approach of overlaying a cumulative line on top of bar charts position labels strategically placing percentage labels inside bars, while positioning cumulative percentages above the line to avoid overlaps. ChatGPT also suggested setting explicit limits to accommodate the upper labels, using breaks = c(0, 0.25, 0.50, 0.75, 1.00) for clear percentage intervals, and adding a dynamic caption with paste0() to display total customers and revenue.
uk_split <- retail_actuals %>%
group_by(IsUK) %>%
summarise(Revenue = sum(Revenue), .groups = "drop") %>%
mutate(
Region = if_else(IsUK == TRUE, "United Kingdom", "Other Countries"),
Share = Revenue / sum(Revenue)
)
uk_split
## # A tibble: 2 × 4
## IsUK Revenue Region Share
## <lgl> <dbl> <chr> <dbl>
## 1 FALSE 1640367. Other Countries 0.154
## 2 TRUE 9001744. United Kingdom 0.846
ggplot(uk_split, aes(x = Region, y = Revenue)) +
geom_col(fill = main_col) +
# % in bar
geom_text(
aes(label = scales::percent(Share, accuracy = 0.1)),
vjust = 1.4,
colour = "white",
fontface = "bold",
size = 4
) +
# £ top of bar (UTF-8 safe)
geom_text(
aes(label = scales::dollar_format(prefix = "\u00A3")(Revenue)),
vjust = -0.4,
size = 4
) +
scale_y_continuous(
labels = scales::label_dollar(prefix = "\u00A3"),
expand = expansion(mult = c(0, 0.15))
) +
labs(
title = "Revenue Contribution: UK vs Non-UK",
x = "",
y = "Revenue (\u00A3)"
) +
theme_minimal()
Revenue split between UK and international markets
Observations:
The UK generates 84.6% of revenue versus just 15.4% internationally.
This is expected as the retailer is UK-based, UK orders will likely have lower shipping frictions and a larger customer base.
However, over dependency on one market could pose business risks as well.
top_non_uk <- retail_actuals %>%
filter(Country != "United Kingdom") %>%
group_by(Country) %>%
summarise(Revenue = sum(Revenue)) %>%
mutate(Share = Revenue / sum(Revenue)) %>%
slice_max(Revenue, n = 10) %>%
arrange(Revenue)
ggplot(top_non_uk, aes(x = reorder(Country, Revenue), y = Revenue)) +
geom_col(fill = main_col) +
# % label inside bar
geom_text(
aes(label = scales::percent(Share, accuracy = 0.1)),
hjust = 1.15,
colour = "white",
fontface = "bold",
size = 3.6
) +
# £ top bar
geom_text(
aes(label = scales::dollar_format(prefix = "\u00A3")(Revenue)),
hjust = -0.15,
size = 3.6
) +
coord_flip() +
scale_y_continuous(
labels = scales::label_dollar(prefix = "\u00A3"),
expand = expansion(mult = c(0, 0.20))
) +
labs(
title = "Top 10 Non-UK Countries by Revenue",
subtitle = "Analysis period: Dec 2010 – Dec 2011",
x = "Country",
y = "Revenue (\u00A3)" ##for UTF8 error
) +
theme_minimal() +
theme(panel.grid = element_blank())
Leading non-UK markets by revenue contribution
Observations:
Among non-UK markets, neighboring European countries dominate: Netherlands (17.4%) and Ireland (17.3%) lead, followed by Germany (13.9%, £229k) and France (12.8%, £210k). These four countries capture 61% of all international sales, suggesting geographic distance facilitates business. Notably, Australia ranks 5th despite geographic distance (8.4%, £138k).
non_uk_patterns <- retail_actuals %>%
filter(Country != "United Kingdom") %>%
group_by(Country) %>%
summarise(
Orders = n_distinct(InvoiceNo),
TotalQty = sum(Quantity),
AvgQtyPerOrder = TotalQty / Orders
) %>%
slice_max(AvgQtyPerOrder, n = 10) %>%
arrange(AvgQtyPerOrder)
ggplot(non_uk_patterns,
aes(x = reorder(Country, AvgQtyPerOrder), y = AvgQtyPerOrder)) +
geom_col(fill = main_col) +
geom_text(
aes(label = round(AvgQtyPerOrder, 1)),
hjust = -0.1,
size = 3.8
) +
coord_flip() +
labs(
title = "Top 10 Non-UK Countries by Average Quantity per Order",
subtitle = "Analysis period: Dec 2010 – Dec 2011",
x = "Country",
y = "Avg Quantity per Order"
) +
theme_minimal() +
theme(
panel.grid = element_blank(),
axis.ticks = element_blank()
)
Average order size varies significantly by market
Observations: Certain countries buy larger quantities per order. This reinforces a bulk-purchasing pattern where there are fewer orders but much larger baskets. These markets behave differently from the UK, which has many orders smaller quantities more retail-like behaviour.
non_uk_summary <- retail_actuals %>%
filter(Country != "United Kingdom") %>%
group_by(Country) %>%
summarise(
Revenue = sum(Revenue),
Orders = n_distinct(InvoiceNo),
TotalQty = sum(Quantity),
AOV = Revenue / Orders,
AvgQtyPerOrder = TotalQty / Orders
) %>%
arrange(desc(Revenue))
head(non_uk_summary, 10)
## # A tibble: 10 × 6
## Country Revenue Orders TotalQty AOV AvgQtyPerOrder
## <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Netherlands 285446. 94 200361 3037. 2132.
## 2 EIRE 283141. 288 147007 983. 510.
## 3 Germany 228678. 457 119154 500. 261.
## 4 France 209625. 392 112060 535. 286.
## 5 Australia 138454. 57 83891 2429. 1472.
## 6 Spain 61559. 90 27933 684. 310.
## 7 Switzerland 57068. 54 30617 1057. 567.
## 8 Belgium 41196. 98 23237 420. 237.
## 9 Sweden 38368. 36 36078 1066. 1002.
## 10 Japan 37416. 19 26016 1969. 1369.
Observations: Order size patterns reveal distinct customer profiles by market. Customers outside Europe consolidate orders into larger baskets.
Some smaller countries place fewer but significantly higher-value orders. Countries like Switzerland, Australia, and Japan typically have higher AOV, higher shipping costs, bulk purchases rather than frequent small orders. This suggests unique purchasing patterns driven by logistics.
Netherlands leads dramatically with 2,131 units per order (just 94 orders generating £285k)—clearly wholesale/bulk purchasing.
Similarly, Australia (1,472 units/order) and Japan (1,369 units/order) exhibit bulk-buying behavior.
In contrast, revenue leaders like Germany and France show retail-oriented patterns with lower quantities per order (261 and 286 respectively) but higher order frequency (457 and 392 orders).
quarterly_cancel <- retail_clean %>%
mutate(
QuarterLabel = paste0("Q", quarter(Date), " ", year(Date)),
QuarterDate = floor_date(Date, "quarter"),
# correctsigns:
SalesAmount = if_else(!IsCancelled & Quantity > 0, LineValue, 0),
CancelAmount = if_else(IsCancelled, LineValue, 0)
) %>%
group_by(QuarterLabel, QuarterDate) %>%
summarise(
SalesRevenue = sum(SalesAmount, na.rm = TRUE),
CancelledRevenue = sum(CancelAmount, na.rm = TRUE), # negative
.groups = "drop"
) %>%
arrange(QuarterDate)
# make a waterfall vector first
values <- c(quarterly_cancel$SalesRevenue,
quarterly_cancel$CancelledRevenue)
labels <- c(
paste0(quarterly_cancel$QuarterLabel, " Revenue"),
paste0(quarterly_cancel$QuarterLabel, " Cancelled")
)
colors <- c(
rep("forestgreen", nrow(quarterly_cancel)),
rep("firebrick", nrow(quarterly_cancel))
)
# then cumulative start/end
ends <- cumsum(values)
starts <- c(0, head(ends, -1))
wf <- data.frame(
id = seq_along(values),
label = labels,
start = starts,
end = ends,
value = values,
fill = colors
)
final_net <- sum(values)
ggplot(wf) +
geom_rect(aes(
xmin = id - 0.4,
xmax = id + 0.4,
ymin = start,
ymax = end,
fill = fill
)) +
# labels for bar
geom_text(
aes(x = id, y = end,
label = scales::comma(value)),
vjust = ifelse(values > 0, -0.3, 1.3),
size = 3.5
) +
# final net bar
geom_col(
data = data.frame(id = max(wf$id) + 1, value = final_net),
aes(x = id, y = value),
fill = "steelblue",
width = 0.6
) +
geom_text(
data = data.frame(id = max(wf$id) + 1, value = final_net),
aes(x = id, y = value,
label = scales::comma(final_net)),
vjust = -0.3,
size = 3.5
) +
scale_x_continuous(
breaks = c(wf$id, max(wf$id) + 1),
labels = c(wf$label, "Final Net Revenue")
) +
# axis
scale_y_continuous(
labels = scales::comma,
expand = expansion(mult = c(0, 0.15))
) +
scale_fill_identity() +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 40, hjust = 1)
) +
labs(
title = "Quarterly Waterfall: Revenue -> Cancellations -> Net Revenue (GBP)",
subtitle = "Impact of order cancellations on quarterly realised revenue",
x = "",
y = "Revenue (GBP)"
)
Quarterly waterfall showing impact of cancellations on net revenue
Observations: Waterfall chart shows Q4 2011’s peak month generates £3.3M revenue but suffers the largest cancellations at £335K—creating double risk when the business is most dependent on performance. Cancellations scale proportionally with revenue growth (£75K in Q4 2010 → £335K in Q4 2011)
I asked ChatGPT what would be the best way to show impact of cancellations on overall revenue,suggestions included: Stacked Bar Chart, Before/After comparison bars and a Waterfall Chart. Settled on using Waterfall chart because it makes the deductions easily visible and shows the bridge to revenue. I then used ChatGPT to help create the visual by breaking into smaller chunks and customising the aesthetic aspects.
Through this R analysis of the Online Retail dataset I have looked at answering key questions regarding what the top products for the business are both by volume and by sales generated. I have examined how customers segmented on the basis of their revenue contribution using Pareto segmentation. I have examined the split of revenue between local (UK) and International regions. Finally, I have looked at the impact of cancellations on the total revenue by quarters using a waterfall analysis.
The dashboard provides the CEO context of what top-selling products are, how much they contribute to revenue. What region is most important to business and the monthly revenue trends allowing them to visualise growth trends and times of the year that are crucial for the business.