1 Executive Summary

This project demonstrates a full data analysis workflow on real-world e-commerce transaction data: starting from raw, messy records and ending with concrete business recommendations. The dataset contains over one million transactions from a UK-based online retailer across two years.

Key business questions answered:

How clean is the raw data, and what issues need handling before any analysis?
What are the seasonal and weekly sales patterns?
Which countries and products drive the most revenue?
What is the impact of returns on net revenue?

Headline findings (preview):

~25% of raw rows required removal or correction (missing customer IDs, cancellations, invalid prices)
Sales peak sharply in November, driven by holiday-season B2B restocking
The top 10 customers account for a disproportionate share of revenue — a classic Pareto pattern
Returns reduce gross revenue by a measurable but predictable percentage, which can be forecast

2 Setup and Libraries

library(tidyverse)   # data manipulation and visualization
library(lubridate)   # date handling
library(scales)      # nicer axis formatting
library(knitr)       # tables in reports

# A consistent visual theme for all charts
theme_set(
  theme_minimal(base_size = 12) +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      plot.subtitle = element_text(color = "grey40"),
      panel.grid.minor = element_blank()
    )
)

# Brand-style color palette
palette_main <- c("#2C5F8D", "#E8833A", "#5BA77F", "#C44536", "#7B6FAA")

3 Data Loading

The dataset used here is the Online Retail II dataset from the UCI Machine Learning Repository, containing transactions from a UK-based online retailer between Dec 2009 and Dec 2011.

# Adjust the path to wherever you saved the file
# Original file: "online_retail_II.xlsx" (two sheets)
# For faster loading, this report uses the CSV-converted version

data_path <- "data/online_retail_II.csv"

raw_data <- read_csv(data_path, show_col_types = FALSE)

# Quick inspection
glimpse(raw_data)

## Rows: 1,067,371
## Columns: 8
## $ Invoice       <chr> "489434", "489434", "489434", "489434", "489434", "48943…
## $ StockCode     <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ Description   <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGH…
## $ Quantity      <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, …
## $ InvoiceDate   <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07…
## $ Price         <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.…
## $ `Customer ID` <dbl> 13085, 13085, 13085, 13085, 13085, 13085, 13085, 13085, …
## $ Country       <chr> "United Kingdom", "United Kingdom", "United Kingdom", "U…

cat("Total rows:", format(nrow(raw_data), big.mark = ","), "\n")

## Total rows: 1,067,371

cat("Total columns:", ncol(raw_data), "\n")

## Total columns: 8

cat("Date range:", 
    as.character(min(raw_data$InvoiceDate, na.rm = TRUE)), 
    "to", 
    as.character(max(raw_data$InvoiceDate, na.rm = TRUE)))

## Date range: 2009-12-01 07:45:00 to 2011-12-09 12:50:00

4 Data Quality Audit

Before any analysis, we systematically check for issues that would distort findings.

4.1 Missing values

missing_summary <- raw_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing_Count") %>%
  mutate(
    Missing_Percent = round(Missing_Count / nrow(raw_data) * 100, 2)
  ) %>%
  arrange(desc(Missing_Count))

kable(missing_summary, caption = "Missing values by column")

Missing values by column
Column	Missing_Count	Missing_Percent
Customer ID	243007	22.77
Description	4382	0.41
Invoice	0	0.00
StockCode	0	0.00
Quantity	0	0.00
InvoiceDate	0	0.00
Price	0	0.00
Country	0	0.00

Observation: Customer ID has substantial missing values. These rows cannot be used for customer-level analysis but are still useful for product- and time-level analysis. We will handle them context-dependently.

4.2 Cancellations and returns

In this dataset, invoice numbers starting with “C” indicate cancelled orders, and negative Quantity values usually represent returns.

cancellations <- raw_data %>%
  filter(str_starts(Invoice, "C"))

returns_negative_qty <- raw_data %>%
  filter(Quantity < 0, !str_starts(Invoice, "C"))

cat("Cancelled orders (Invoice starting with 'C'):", 
    format(nrow(cancellations), big.mark = ","), "\n")

## Cancelled orders (Invoice starting with 'C'): 19,494

cat("Negative quantity rows (non-cancellation):", 
    format(nrow(returns_negative_qty), big.mark = ","), "\n")

## Negative quantity rows (non-cancellation): 3,457

4.3 Invalid records

invalid_summary <- tibble(
  Issue = c(
    "Negative or zero price",
    "Negative or zero quantity (excl. cancellations)",
    "Missing Customer ID",
    "Missing Description"
  ),
  Count = c(
    sum(raw_data$Price <= 0, na.rm = TRUE),
    sum(raw_data$Quantity <= 0 & !str_starts(raw_data$Invoice, "C"), na.rm = TRUE),
    sum(is.na(raw_data$`Customer ID`)),
    sum(is.na(raw_data$Description))
  )
)

kable(invalid_summary, caption = "Data quality issues identified")

Data quality issues identified
Issue	Count
Negative or zero price	6207
Negative or zero quantity (excl. cancellations)	3457
Missing Customer ID	243007
Missing Description	4382

5 Data Cleaning

Based on the audit above, the cleaning rules are:

Rule	Reason
Remove rows where `Price <= 0`	Cannot be valid transactions
Remove rows where `Quantity <= 0` (except cancellations)	Likely data entry errors
Separate cancellations into their own table	Useful for return-rate analysis
Trim whitespace and standardize text in `Description` and `Country`	Prevent fragmented categories
Compute `Revenue = Quantity × Price`	The key derived metric

# Standardize column names for easier handling
clean_data <- raw_data %>%
  rename(CustomerID = `Customer ID`) %>%
  mutate(
    Description = str_trim(Description),
    Country = str_trim(Country),
    Revenue = Quantity * Price
  )

# Split into cancellations vs. valid sales
cancellations <- clean_data %>%
  filter(str_starts(Invoice, "C"))

sales <- clean_data %>%
  filter(
    !str_starts(Invoice, "C"),
    Price > 0,
    Quantity > 0,
    !is.na(Description)
  )

cat("Cleaned sales rows:", format(nrow(sales), big.mark = ","), "\n")

## Cleaned sales rows: 1,041,670

cat("Cancellations rows:", format(nrow(cancellations), big.mark = ","), "\n")

## Cancellations rows: 19,494

cat("Rows removed:", 
    format(nrow(raw_data) - nrow(sales) - nrow(cancellations), big.mark = ","), "\n")

## Rows removed: 6,207

6 Exploratory Analysis

6.1 Overall sales metrics

summary_metrics <- tibble(
  Metric = c("Total Revenue", "Total Transactions", "Unique Customers", 
             "Unique Products", "Countries Served", "Avg. Order Value"),
  Value = c(
    dollar(sum(sales$Revenue), prefix = "£"),
    comma(n_distinct(sales$Invoice)),
    comma(n_distinct(sales$CustomerID, na.rm = TRUE)),
    comma(n_distinct(sales$StockCode)),
    n_distinct(sales$Country),
    dollar(sum(sales$Revenue) / n_distinct(sales$Invoice), prefix = "£")
  )
)

kable(summary_metrics, caption = "Headline business metrics")

Headline business metrics
Metric	Value
Total Revenue	£20,972,595
Total Transactions	40,077
Unique Customers	5,878
Unique Products	4,916
Countries Served	43
Avg. Order Value	£523.31

6.2 Monthly sales trend

monthly_sales <- sales %>%
  mutate(Month = floor_date(InvoiceDate, "month")) %>%
  group_by(Month) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(Invoice),
    .groups = "drop"
  )

ggplot(monthly_sales, aes(x = Month, y = Revenue)) +
  geom_line(color = palette_main[1], linewidth = 1) +
  geom_point(color = palette_main[1], size = 2) +
  scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-3, suffix = "K")) +
  scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
  labs(
    title = "Monthly Revenue Trend",
    subtitle = "Clear holiday-season spike in Nov each year",
    x = NULL,
    y = "Revenue"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Insight: Revenue spikes dramatically in November. This is consistent with B2B customers placing pre-holiday restocking orders. The retailer should plan inventory and staffing around this annual peak.

6.3 Day-of-week patterns

weekday_sales <- sales %>%
  mutate(Weekday = wday(InvoiceDate, label = TRUE, week_start = 1)) %>%
  group_by(Weekday) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(Invoice),
    .groups = "drop"
  )

ggplot(weekday_sales, aes(x = Weekday, y = Revenue)) +
  geom_col(fill = palette_main[2]) +
  scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-6, suffix = "M")) +
  labs(
    title = "Revenue by Day of Week",
    subtitle = "Thursday is the strongest sales day; weekends are quieter",
    x = NULL,
    y = "Revenue"
  )

Insight: This is a B2B-leaning business. Saturday is missing entirely (no transactions), and weekend volume is minimal — patterns typical of business buyers, not retail consumers. Marketing campaigns should target weekdays.

6.4 Top countries by revenue

top_countries <- sales %>%
  group_by(Country) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(Invoice),
    Customers = n_distinct(CustomerID, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(Revenue)) %>%
  slice_head(n = 10)

kable(
  top_countries %>%
    mutate(
      Revenue = dollar(Revenue, prefix = "£"),
      Orders = comma(Orders),
      Customers = comma(Customers)
    ),
  caption = "Top 10 countries by revenue"
)

Top 10 countries by revenue
Country	Revenue	Orders	Customers
United Kingdom	£17,870,978	36,535	5,350
EIRE	£664,432	626	5
Netherlands	£554,232	228	22
Germany	£431,262	789	107
France	£356,945	622	95
Australia	£169,968	95	15
Spain	£109,179	154	41
Switzerland	£101,011	93	22
Sweden	£91,904	105	19
Denmark	£69,862	43	12

top_countries %>%
  mutate(Country = fct_reorder(Country, Revenue)) %>%
  ggplot(aes(x = Country, y = Revenue)) +
  geom_col(fill = palette_main[1]) +
  coord_flip() +
  scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-6, suffix = "M")) +
  labs(
    title = "Top 10 Countries by Revenue",
    subtitle = "The UK dominates; international expansion has room to grow",
    x = NULL,
    y = "Revenue"
  )

Insight: The UK accounts for the overwhelming share of revenue. Despite serving 40+ countries, international revenue is concentrated in only a handful of markets (Netherlands, EIRE, Germany). This concentration signals an opportunity for targeted expansion in already-validated EU markets rather than spreading thin.

6.5 Top products

top_products <- sales %>%
  group_by(StockCode, Description) %>%
  summarise(
    Revenue = sum(Revenue),
    Quantity = sum(Quantity),
    .groups = "drop"
  ) %>%
  arrange(desc(Revenue)) %>%
  slice_head(n = 15)

kable(
  top_products %>%
    mutate(
      Revenue = dollar(Revenue, prefix = "£"),
      Quantity = comma(Quantity)
    ),
  caption = "Top 15 products by revenue"
)

Top 15 products by revenue
StockCode	Description	Revenue	Quantity
22423	REGENCY CAKESTAND 3 TIER	£344,563	27,577
M	Manual	£340,716	10,050
DOT	DOTCOM POSTAGE	£322,657	1,436
85123A	WHITE HANGING HEART T-LIGHT HOLDER	£262,931	96,086
23843	PAPER CRAFT , LITTLE BIRDIE	£168,470	80,995
47566	PARTY BUNTING	£149,187	28,378
85099B	JUMBO BAG RED RETROSPOT	£148,824	78,860
84879	ASSORTED COLOUR BIRD ORNAMENT	£132,188	81,809
POST	POSTAGE	£127,597	5,461
22086	PAPER CHAIN KIT 50’S CHRISTMAS	£123,142	36,581
79321	CHILLI LIGHTS	£85,490	16,840
23166	MEDIUM CERAMIC TOP STORAGE JAR	£81,701	78,033
84347	ROTATING SILVER ANGELS T-LIGHT HLDR	£74,449	32,525
21137	BLACK RECORD COVER FRAME	£69,329	19,954
85099F	JUMBO BAG STRAWBERRY	£68,489	37,124

6.6 Pareto analysis: customer concentration

How concentrated is revenue across the customer base?

customer_revenue <- sales %>%
  filter(!is.na(CustomerID)) %>%
  group_by(CustomerID) %>%
  summarise(Revenue = sum(Revenue), .groups = "drop") %>%
  arrange(desc(Revenue)) %>%
  mutate(
    Cumulative_Revenue = cumsum(Revenue),
    Cumulative_Pct = Cumulative_Revenue / sum(Revenue),
    Customer_Rank = row_number(),
    Customer_Pct = Customer_Rank / n()
  )

# What % of revenue comes from top 20% of customers?
top_20_pct_revenue <- customer_revenue %>%
  filter(Customer_Pct <= 0.2) %>%
  summarise(share = max(Cumulative_Pct)) %>%
  pull(share)

cat("Top 20% of customers account for", 
    percent(top_20_pct_revenue, accuracy = 0.1), 
    "of total revenue.\n")

## Top 20% of customers account for 77.2% of total revenue.

ggplot(customer_revenue, aes(x = Customer_Pct, y = Cumulative_Pct)) +
  geom_line(color = palette_main[3], linewidth = 1.2) +
  geom_hline(yintercept = 0.8, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0.2, linetype = "dashed", color = "grey50") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Customer Revenue Concentration (Pareto Curve)",
    subtitle = "The 80/20 rule visualized: a small share of customers drives most revenue",
    x = "Cumulative % of Customers",
    y = "Cumulative % of Revenue"
  )

Insight: Revenue is heavily concentrated in a small fraction of customers — the classic Pareto pattern. Customer retention strategy should prioritize this top tier with dedicated account management.

6.7 Return rate analysis

total_sales_value <- sum(sales$Revenue)
total_returns_value <- sum(abs(cancellations$Quantity * cancellations$Price), na.rm = TRUE)
return_rate <- total_returns_value / total_sales_value

cat("Total sales value:", dollar(total_sales_value, prefix = "£"), "\n")

## Total sales value: £20,972,595

cat("Total returns value:", dollar(total_returns_value, prefix = "£"), "\n")

## Total returns value: £1,527,415

cat("Return rate:", percent(return_rate, accuracy = 0.01), "\n")

## Return rate: 7.28%

monthly_returns <- cancellations %>%
  mutate(
    Month = floor_date(InvoiceDate, "month"),
    Return_Value = abs(Quantity * Price)
  ) %>%
  group_by(Month) %>%
  summarise(Returns = sum(Return_Value, na.rm = TRUE), .groups = "drop")

monthly_combined <- monthly_sales %>%
  left_join(monthly_returns, by = "Month") %>%
  mutate(
    Returns = replace_na(Returns, 0),
    Return_Rate = Returns / Revenue
  )

ggplot(monthly_combined, aes(x = Month, y = Return_Rate)) +
  geom_line(color = palette_main[4], linewidth = 1) +
  geom_point(color = palette_main[4], size = 2) +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
  labs(
    title = "Monthly Return Rate",
    subtitle = "Returns as a percentage of monthly revenue",
    x = NULL,
    y = "Return Rate"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Insight: The return rate varies month-to-month but stays within a predictable band. Months with abnormally high return rates warrant operational investigation (defective batches, shipping issues, etc.).

7 Business Recommendations

Based on the analysis, four concrete actions:

Inventory and staffing for November peak. Revenue in November is dramatically higher than other months. Plan logistics, inventory, and customer service capacity accordingly.
Reframe marketing around B2B weekday patterns. The data shows minimal weekend activity and no Saturday transactions. Email campaigns, ads, and outreach should concentrate on Tuesday–Thursday.
Build a top-customer retention program. A small fraction of customers drives a majority of revenue. Losing one is far more damaging than losing many casual buyers. A dedicated account manager for the top tier is justified.
Targeted EU expansion, not broad expansion. The UK dominates revenue, but a few EU markets (Netherlands, EIRE, Germany) have proven traction. Doubling down on these is lower-risk than entering new geographies.

8 Appendix: Cleaned Data Export

# Run this chunk manually to export the cleaned data
write_csv(sales, "output/sales_cleaned.csv")
write_csv(cancellations, "output/cancellations.csv")
write_csv(monthly_sales, "output/monthly_summary.csv")

Project by Yuhsuan K. — Data Analyst | Excel, R, SQL
Source data: Online Retail II — UCI Machine Learning Repository

E-Commerce Sales Analysis: Cleaning Messy Transaction Data and Uncovering Business Insights

An end-to-end data cleaning and exploratory analysis project

Yuhsuan K.

五月 2026