1 Executive Summary

This project demonstrates a full data analysis workflow on real-world e-commerce transaction data: starting from raw, messy records and ending with concrete business recommendations. The dataset contains over one million transactions from a UK-based online retailer across two years.

Key business questions answered:

  1. How clean is the raw data, and what issues need handling before any analysis?
  2. What are the seasonal and weekly sales patterns?
  3. Which countries and products drive the most revenue?
  4. What is the impact of returns on net revenue?

Headline findings (preview):

  • ~25% of raw rows required removal or correction (missing customer IDs, cancellations, invalid prices)
  • Sales peak sharply in November, driven by holiday-season B2B restocking
  • The top 10 customers account for a disproportionate share of revenue — a classic Pareto pattern
  • Returns reduce gross revenue by a measurable but predictable percentage, which can be forecast

2 Setup and Libraries

library(tidyverse)   # data manipulation and visualization
library(lubridate)   # date handling
library(scales)      # nicer axis formatting
library(knitr)       # tables in reports

# A consistent visual theme for all charts
theme_set(
  theme_minimal(base_size = 12) +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      plot.subtitle = element_text(color = "grey40"),
      panel.grid.minor = element_blank()
    )
)

# Brand-style color palette
palette_main <- c("#2C5F8D", "#E8833A", "#5BA77F", "#C44536", "#7B6FAA")

3 Data Loading

The dataset used here is the Online Retail II dataset from the UCI Machine Learning Repository, containing transactions from a UK-based online retailer between Dec 2009 and Dec 2011.

# Adjust the path to wherever you saved the file
# Original file: "online_retail_II.xlsx" (two sheets)
# For faster loading, this report uses the CSV-converted version

data_path <- "data/online_retail_II.csv"

raw_data <- read_csv(data_path, show_col_types = FALSE)

# Quick inspection
glimpse(raw_data)
## Rows: 1,067,371
## Columns: 8
## $ Invoice       <chr> "489434", "489434", "489434", "489434", "489434", "48943…
## $ StockCode     <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ Description   <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGH…
## $ Quantity      <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, …
## $ InvoiceDate   <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07…
## $ Price         <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.…
## $ `Customer ID` <dbl> 13085, 13085, 13085, 13085, 13085, 13085, 13085, 13085, …
## $ Country       <chr> "United Kingdom", "United Kingdom", "United Kingdom", "U…
cat("Total rows:", format(nrow(raw_data), big.mark = ","), "\n")
## Total rows: 1,067,371
cat("Total columns:", ncol(raw_data), "\n")
## Total columns: 8
cat("Date range:", 
    as.character(min(raw_data$InvoiceDate, na.rm = TRUE)), 
    "to", 
    as.character(max(raw_data$InvoiceDate, na.rm = TRUE)))
## Date range: 2009-12-01 07:45:00 to 2011-12-09 12:50:00

4 Data Quality Audit

Before any analysis, we systematically check for issues that would distort findings.

4.1 Missing values

missing_summary <- raw_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing_Count") %>%
  mutate(
    Missing_Percent = round(Missing_Count / nrow(raw_data) * 100, 2)
  ) %>%
  arrange(desc(Missing_Count))

kable(missing_summary, caption = "Missing values by column")
Missing values by column
Column Missing_Count Missing_Percent
Customer ID 243007 22.77
Description 4382 0.41
Invoice 0 0.00
StockCode 0 0.00
Quantity 0 0.00
InvoiceDate 0 0.00
Price 0 0.00
Country 0 0.00

Observation: Customer ID has substantial missing values. These rows cannot be used for customer-level analysis but are still useful for product- and time-level analysis. We will handle them context-dependently.

4.2 Cancellations and returns

In this dataset, invoice numbers starting with “C” indicate cancelled orders, and negative Quantity values usually represent returns.

cancellations <- raw_data %>%
  filter(str_starts(Invoice, "C"))

returns_negative_qty <- raw_data %>%
  filter(Quantity < 0, !str_starts(Invoice, "C"))

cat("Cancelled orders (Invoice starting with 'C'):", 
    format(nrow(cancellations), big.mark = ","), "\n")
## Cancelled orders (Invoice starting with 'C'): 19,494
cat("Negative quantity rows (non-cancellation):", 
    format(nrow(returns_negative_qty), big.mark = ","), "\n")
## Negative quantity rows (non-cancellation): 3,457

4.3 Invalid records

invalid_summary <- tibble(
  Issue = c(
    "Negative or zero price",
    "Negative or zero quantity (excl. cancellations)",
    "Missing Customer ID",
    "Missing Description"
  ),
  Count = c(
    sum(raw_data$Price <= 0, na.rm = TRUE),
    sum(raw_data$Quantity <= 0 & !str_starts(raw_data$Invoice, "C"), na.rm = TRUE),
    sum(is.na(raw_data$`Customer ID`)),
    sum(is.na(raw_data$Description))
  )
)

kable(invalid_summary, caption = "Data quality issues identified")
Data quality issues identified
Issue Count
Negative or zero price 6207
Negative or zero quantity (excl. cancellations) 3457
Missing Customer ID 243007
Missing Description 4382

5 Data Cleaning

Based on the audit above, the cleaning rules are:

Rule Reason
Remove rows where Price <= 0 Cannot be valid transactions
Remove rows where Quantity <= 0 (except cancellations) Likely data entry errors
Separate cancellations into their own table Useful for return-rate analysis
Trim whitespace and standardize text in Description and Country Prevent fragmented categories
Compute Revenue = Quantity × Price The key derived metric
# Standardize column names for easier handling
clean_data <- raw_data %>%
  rename(CustomerID = `Customer ID`) %>%
  mutate(
    Description = str_trim(Description),
    Country = str_trim(Country),
    Revenue = Quantity * Price
  )

# Split into cancellations vs. valid sales
cancellations <- clean_data %>%
  filter(str_starts(Invoice, "C"))

sales <- clean_data %>%
  filter(
    !str_starts(Invoice, "C"),
    Price > 0,
    Quantity > 0,
    !is.na(Description)
  )

cat("Cleaned sales rows:", format(nrow(sales), big.mark = ","), "\n")
## Cleaned sales rows: 1,041,670
cat("Cancellations rows:", format(nrow(cancellations), big.mark = ","), "\n")
## Cancellations rows: 19,494
cat("Rows removed:", 
    format(nrow(raw_data) - nrow(sales) - nrow(cancellations), big.mark = ","), "\n")
## Rows removed: 6,207

6 Exploratory Analysis

6.1 Overall sales metrics

summary_metrics <- tibble(
  Metric = c("Total Revenue", "Total Transactions", "Unique Customers", 
             "Unique Products", "Countries Served", "Avg. Order Value"),
  Value = c(
    dollar(sum(sales$Revenue), prefix = "£"),
    comma(n_distinct(sales$Invoice)),
    comma(n_distinct(sales$CustomerID, na.rm = TRUE)),
    comma(n_distinct(sales$StockCode)),
    n_distinct(sales$Country),
    dollar(sum(sales$Revenue) / n_distinct(sales$Invoice), prefix = "£")
  )
)

kable(summary_metrics, caption = "Headline business metrics")
Headline business metrics
Metric Value
Total Revenue £20,972,595
Total Transactions 40,077
Unique Customers 5,878
Unique Products 4,916
Countries Served 43
Avg. Order Value £523.31

6.2 Monthly sales trend

monthly_sales <- sales %>%
  mutate(Month = floor_date(InvoiceDate, "month")) %>%
  group_by(Month) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(Invoice),
    .groups = "drop"
  )

ggplot(monthly_sales, aes(x = Month, y = Revenue)) +
  geom_line(color = palette_main[1], linewidth = 1) +
  geom_point(color = palette_main[1], size = 2) +
  scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-3, suffix = "K")) +
  scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
  labs(
    title = "Monthly Revenue Trend",
    subtitle = "Clear holiday-season spike in Nov each year",
    x = NULL,
    y = "Revenue"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Insight: Revenue spikes dramatically in November. This is consistent with B2B customers placing pre-holiday restocking orders. The retailer should plan inventory and staffing around this annual peak.

6.3 Day-of-week patterns

weekday_sales <- sales %>%
  mutate(Weekday = wday(InvoiceDate, label = TRUE, week_start = 1)) %>%
  group_by(Weekday) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(Invoice),
    .groups = "drop"
  )

ggplot(weekday_sales, aes(x = Weekday, y = Revenue)) +
  geom_col(fill = palette_main[2]) +
  scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-6, suffix = "M")) +
  labs(
    title = "Revenue by Day of Week",
    subtitle = "Thursday is the strongest sales day; weekends are quieter",
    x = NULL,
    y = "Revenue"
  )

Insight: This is a B2B-leaning business. Saturday is missing entirely (no transactions), and weekend volume is minimal — patterns typical of business buyers, not retail consumers. Marketing campaigns should target weekdays.

6.4 Top countries by revenue

top_countries <- sales %>%
  group_by(Country) %>%
  summarise(
    Revenue = sum(Revenue),
    Orders = n_distinct(Invoice),
    Customers = n_distinct(CustomerID, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(Revenue)) %>%
  slice_head(n = 10)

kable(
  top_countries %>%
    mutate(
      Revenue = dollar(Revenue, prefix = "£"),
      Orders = comma(Orders),
      Customers = comma(Customers)
    ),
  caption = "Top 10 countries by revenue"
)
Top 10 countries by revenue
Country Revenue Orders Customers
United Kingdom £17,870,978 36,535 5,350
EIRE £664,432 626 5
Netherlands £554,232 228 22
Germany £431,262 789 107
France £356,945 622 95
Australia £169,968 95 15
Spain £109,179 154 41
Switzerland £101,011 93 22
Sweden £91,904 105 19
Denmark £69,862 43 12
top_countries %>%
  mutate(Country = fct_reorder(Country, Revenue)) %>%
  ggplot(aes(x = Country, y = Revenue)) +
  geom_col(fill = palette_main[1]) +
  coord_flip() +
  scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-6, suffix = "M")) +
  labs(
    title = "Top 10 Countries by Revenue",
    subtitle = "The UK dominates; international expansion has room to grow",
    x = NULL,
    y = "Revenue"
  )

Insight: The UK accounts for the overwhelming share of revenue. Despite serving 40+ countries, international revenue is concentrated in only a handful of markets (Netherlands, EIRE, Germany). This concentration signals an opportunity for targeted expansion in already-validated EU markets rather than spreading thin.

6.5 Top products

top_products <- sales %>%
  group_by(StockCode, Description) %>%
  summarise(
    Revenue = sum(Revenue),
    Quantity = sum(Quantity),
    .groups = "drop"
  ) %>%
  arrange(desc(Revenue)) %>%
  slice_head(n = 15)

kable(
  top_products %>%
    mutate(
      Revenue = dollar(Revenue, prefix = "£"),
      Quantity = comma(Quantity)
    ),
  caption = "Top 15 products by revenue"
)
Top 15 products by revenue
StockCode Description Revenue Quantity
22423 REGENCY CAKESTAND 3 TIER £344,563 27,577
M Manual £340,716 10,050
DOT DOTCOM POSTAGE £322,657 1,436
85123A WHITE HANGING HEART T-LIGHT HOLDER £262,931 96,086
23843 PAPER CRAFT , LITTLE BIRDIE £168,470 80,995
47566 PARTY BUNTING £149,187 28,378
85099B JUMBO BAG RED RETROSPOT £148,824 78,860
84879 ASSORTED COLOUR BIRD ORNAMENT £132,188 81,809
POST POSTAGE £127,597 5,461
22086 PAPER CHAIN KIT 50’S CHRISTMAS £123,142 36,581
79321 CHILLI LIGHTS £85,490 16,840
23166 MEDIUM CERAMIC TOP STORAGE JAR £81,701 78,033
84347 ROTATING SILVER ANGELS T-LIGHT HLDR £74,449 32,525
21137 BLACK RECORD COVER FRAME £69,329 19,954
85099F JUMBO BAG STRAWBERRY £68,489 37,124

6.6 Pareto analysis: customer concentration

How concentrated is revenue across the customer base?

customer_revenue <- sales %>%
  filter(!is.na(CustomerID)) %>%
  group_by(CustomerID) %>%
  summarise(Revenue = sum(Revenue), .groups = "drop") %>%
  arrange(desc(Revenue)) %>%
  mutate(
    Cumulative_Revenue = cumsum(Revenue),
    Cumulative_Pct = Cumulative_Revenue / sum(Revenue),
    Customer_Rank = row_number(),
    Customer_Pct = Customer_Rank / n()
  )

# What % of revenue comes from top 20% of customers?
top_20_pct_revenue <- customer_revenue %>%
  filter(Customer_Pct <= 0.2) %>%
  summarise(share = max(Cumulative_Pct)) %>%
  pull(share)

cat("Top 20% of customers account for", 
    percent(top_20_pct_revenue, accuracy = 0.1), 
    "of total revenue.\n")
## Top 20% of customers account for 77.2% of total revenue.
ggplot(customer_revenue, aes(x = Customer_Pct, y = Cumulative_Pct)) +
  geom_line(color = palette_main[3], linewidth = 1.2) +
  geom_hline(yintercept = 0.8, linetype = "dashed", color = "grey50") +
  geom_vline(xintercept = 0.2, linetype = "dashed", color = "grey50") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Customer Revenue Concentration (Pareto Curve)",
    subtitle = "The 80/20 rule visualized: a small share of customers drives most revenue",
    x = "Cumulative % of Customers",
    y = "Cumulative % of Revenue"
  )

Insight: Revenue is heavily concentrated in a small fraction of customers — the classic Pareto pattern. Customer retention strategy should prioritize this top tier with dedicated account management.

6.7 Return rate analysis

total_sales_value <- sum(sales$Revenue)
total_returns_value <- sum(abs(cancellations$Quantity * cancellations$Price), na.rm = TRUE)
return_rate <- total_returns_value / total_sales_value

cat("Total sales value:", dollar(total_sales_value, prefix = "£"), "\n")
## Total sales value: £20,972,595
cat("Total returns value:", dollar(total_returns_value, prefix = "£"), "\n")
## Total returns value: £1,527,415
cat("Return rate:", percent(return_rate, accuracy = 0.01), "\n")
## Return rate: 7.28%
monthly_returns <- cancellations %>%
  mutate(
    Month = floor_date(InvoiceDate, "month"),
    Return_Value = abs(Quantity * Price)
  ) %>%
  group_by(Month) %>%
  summarise(Returns = sum(Return_Value, na.rm = TRUE), .groups = "drop")

monthly_combined <- monthly_sales %>%
  left_join(monthly_returns, by = "Month") %>%
  mutate(
    Returns = replace_na(Returns, 0),
    Return_Rate = Returns / Revenue
  )

ggplot(monthly_combined, aes(x = Month, y = Return_Rate)) +
  geom_line(color = palette_main[4], linewidth = 1) +
  geom_point(color = palette_main[4], size = 2) +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
  labs(
    title = "Monthly Return Rate",
    subtitle = "Returns as a percentage of monthly revenue",
    x = NULL,
    y = "Return Rate"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Insight: The return rate varies month-to-month but stays within a predictable band. Months with abnormally high return rates warrant operational investigation (defective batches, shipping issues, etc.).


7 Business Recommendations

Based on the analysis, four concrete actions:

  1. Inventory and staffing for November peak. Revenue in November is dramatically higher than other months. Plan logistics, inventory, and customer service capacity accordingly.

  2. Reframe marketing around B2B weekday patterns. The data shows minimal weekend activity and no Saturday transactions. Email campaigns, ads, and outreach should concentrate on Tuesday–Thursday.

  3. Build a top-customer retention program. A small fraction of customers drives a majority of revenue. Losing one is far more damaging than losing many casual buyers. A dedicated account manager for the top tier is justified.

  4. Targeted EU expansion, not broad expansion. The UK dominates revenue, but a few EU markets (Netherlands, EIRE, Germany) have proven traction. Doubling down on these is lower-risk than entering new geographies.


8 Appendix: Cleaned Data Export

# Run this chunk manually to export the cleaned data
write_csv(sales, "output/sales_cleaned.csv")
write_csv(cancellations, "output/cancellations.csv")
write_csv(monthly_sales, "output/monthly_summary.csv")

Project by Yuhsuan K. — Data Analyst | Excel, R, SQL
Source data: Online Retail II — UCI Machine Learning Repository