This project demonstrates a full data analysis workflow on real-world e-commerce transaction data: starting from raw, messy records and ending with concrete business recommendations. The dataset contains over one million transactions from a UK-based online retailer across two years.
Key business questions answered:
Headline findings (preview):
library(tidyverse) # data manipulation and visualization
library(lubridate) # date handling
library(scales) # nicer axis formatting
library(knitr) # tables in reports
# A consistent visual theme for all charts
theme_set(
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40"),
panel.grid.minor = element_blank()
)
)
# Brand-style color palette
palette_main <- c("#2C5F8D", "#E8833A", "#5BA77F", "#C44536", "#7B6FAA")
The dataset used here is the Online Retail II dataset from the UCI Machine Learning Repository, containing transactions from a UK-based online retailer between Dec 2009 and Dec 2011.
# Adjust the path to wherever you saved the file
# Original file: "online_retail_II.xlsx" (two sheets)
# For faster loading, this report uses the CSV-converted version
data_path <- "data/online_retail_II.csv"
raw_data <- read_csv(data_path, show_col_types = FALSE)
# Quick inspection
glimpse(raw_data)
## Rows: 1,067,371
## Columns: 8
## $ Invoice <chr> "489434", "489434", "489434", "489434", "489434", "48943…
## $ StockCode <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ Description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGH…
## $ Quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, …
## $ InvoiceDate <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07…
## $ Price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.…
## $ `Customer ID` <dbl> 13085, 13085, 13085, 13085, 13085, 13085, 13085, 13085, …
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "U…
cat("Total rows:", format(nrow(raw_data), big.mark = ","), "\n")
## Total rows: 1,067,371
cat("Total columns:", ncol(raw_data), "\n")
## Total columns: 8
cat("Date range:",
as.character(min(raw_data$InvoiceDate, na.rm = TRUE)),
"to",
as.character(max(raw_data$InvoiceDate, na.rm = TRUE)))
## Date range: 2009-12-01 07:45:00 to 2011-12-09 12:50:00
Before any analysis, we systematically check for issues that would distort findings.
missing_summary <- raw_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Missing_Percent = round(Missing_Count / nrow(raw_data) * 100, 2)
) %>%
arrange(desc(Missing_Count))
kable(missing_summary, caption = "Missing values by column")
| Column | Missing_Count | Missing_Percent |
|---|---|---|
| Customer ID | 243007 | 22.77 |
| Description | 4382 | 0.41 |
| Invoice | 0 | 0.00 |
| StockCode | 0 | 0.00 |
| Quantity | 0 | 0.00 |
| InvoiceDate | 0 | 0.00 |
| Price | 0 | 0.00 |
| Country | 0 | 0.00 |
Observation: Customer ID has
substantial missing values. These rows cannot be used for customer-level
analysis but are still useful for product- and time-level analysis. We
will handle them context-dependently.
In this dataset, invoice numbers starting with “C” indicate cancelled
orders, and negative Quantity values usually represent
returns.
cancellations <- raw_data %>%
filter(str_starts(Invoice, "C"))
returns_negative_qty <- raw_data %>%
filter(Quantity < 0, !str_starts(Invoice, "C"))
cat("Cancelled orders (Invoice starting with 'C'):",
format(nrow(cancellations), big.mark = ","), "\n")
## Cancelled orders (Invoice starting with 'C'): 19,494
cat("Negative quantity rows (non-cancellation):",
format(nrow(returns_negative_qty), big.mark = ","), "\n")
## Negative quantity rows (non-cancellation): 3,457
invalid_summary <- tibble(
Issue = c(
"Negative or zero price",
"Negative or zero quantity (excl. cancellations)",
"Missing Customer ID",
"Missing Description"
),
Count = c(
sum(raw_data$Price <= 0, na.rm = TRUE),
sum(raw_data$Quantity <= 0 & !str_starts(raw_data$Invoice, "C"), na.rm = TRUE),
sum(is.na(raw_data$`Customer ID`)),
sum(is.na(raw_data$Description))
)
)
kable(invalid_summary, caption = "Data quality issues identified")
| Issue | Count |
|---|---|
| Negative or zero price | 6207 |
| Negative or zero quantity (excl. cancellations) | 3457 |
| Missing Customer ID | 243007 |
| Missing Description | 4382 |
Based on the audit above, the cleaning rules are:
| Rule | Reason |
|---|---|
Remove rows where Price <= 0 |
Cannot be valid transactions |
Remove rows where Quantity <= 0 (except
cancellations) |
Likely data entry errors |
| Separate cancellations into their own table | Useful for return-rate analysis |
Trim whitespace and standardize text in Description and
Country |
Prevent fragmented categories |
Compute Revenue = Quantity × Price |
The key derived metric |
# Standardize column names for easier handling
clean_data <- raw_data %>%
rename(CustomerID = `Customer ID`) %>%
mutate(
Description = str_trim(Description),
Country = str_trim(Country),
Revenue = Quantity * Price
)
# Split into cancellations vs. valid sales
cancellations <- clean_data %>%
filter(str_starts(Invoice, "C"))
sales <- clean_data %>%
filter(
!str_starts(Invoice, "C"),
Price > 0,
Quantity > 0,
!is.na(Description)
)
cat("Cleaned sales rows:", format(nrow(sales), big.mark = ","), "\n")
## Cleaned sales rows: 1,041,670
cat("Cancellations rows:", format(nrow(cancellations), big.mark = ","), "\n")
## Cancellations rows: 19,494
cat("Rows removed:",
format(nrow(raw_data) - nrow(sales) - nrow(cancellations), big.mark = ","), "\n")
## Rows removed: 6,207
summary_metrics <- tibble(
Metric = c("Total Revenue", "Total Transactions", "Unique Customers",
"Unique Products", "Countries Served", "Avg. Order Value"),
Value = c(
dollar(sum(sales$Revenue), prefix = "£"),
comma(n_distinct(sales$Invoice)),
comma(n_distinct(sales$CustomerID, na.rm = TRUE)),
comma(n_distinct(sales$StockCode)),
n_distinct(sales$Country),
dollar(sum(sales$Revenue) / n_distinct(sales$Invoice), prefix = "£")
)
)
kable(summary_metrics, caption = "Headline business metrics")
| Metric | Value |
|---|---|
| Total Revenue | £20,972,595 |
| Total Transactions | 40,077 |
| Unique Customers | 5,878 |
| Unique Products | 4,916 |
| Countries Served | 43 |
| Avg. Order Value | £523.31 |
monthly_sales <- sales %>%
mutate(Month = floor_date(InvoiceDate, "month")) %>%
group_by(Month) %>%
summarise(
Revenue = sum(Revenue),
Orders = n_distinct(Invoice),
.groups = "drop"
)
ggplot(monthly_sales, aes(x = Month, y = Revenue)) +
geom_line(color = palette_main[1], linewidth = 1) +
geom_point(color = palette_main[1], size = 2) +
scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-3, suffix = "K")) +
scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
labs(
title = "Monthly Revenue Trend",
subtitle = "Clear holiday-season spike in Nov each year",
x = NULL,
y = "Revenue"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Insight: Revenue spikes dramatically in November. This is consistent with B2B customers placing pre-holiday restocking orders. The retailer should plan inventory and staffing around this annual peak.
weekday_sales <- sales %>%
mutate(Weekday = wday(InvoiceDate, label = TRUE, week_start = 1)) %>%
group_by(Weekday) %>%
summarise(
Revenue = sum(Revenue),
Orders = n_distinct(Invoice),
.groups = "drop"
)
ggplot(weekday_sales, aes(x = Weekday, y = Revenue)) +
geom_col(fill = palette_main[2]) +
scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-6, suffix = "M")) +
labs(
title = "Revenue by Day of Week",
subtitle = "Thursday is the strongest sales day; weekends are quieter",
x = NULL,
y = "Revenue"
)
Insight: This is a B2B-leaning business. Saturday is missing entirely (no transactions), and weekend volume is minimal — patterns typical of business buyers, not retail consumers. Marketing campaigns should target weekdays.
top_countries <- sales %>%
group_by(Country) %>%
summarise(
Revenue = sum(Revenue),
Orders = n_distinct(Invoice),
Customers = n_distinct(CustomerID, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(Revenue)) %>%
slice_head(n = 10)
kable(
top_countries %>%
mutate(
Revenue = dollar(Revenue, prefix = "£"),
Orders = comma(Orders),
Customers = comma(Customers)
),
caption = "Top 10 countries by revenue"
)
| Country | Revenue | Orders | Customers |
|---|---|---|---|
| United Kingdom | £17,870,978 | 36,535 | 5,350 |
| EIRE | £664,432 | 626 | 5 |
| Netherlands | £554,232 | 228 | 22 |
| Germany | £431,262 | 789 | 107 |
| France | £356,945 | 622 | 95 |
| Australia | £169,968 | 95 | 15 |
| Spain | £109,179 | 154 | 41 |
| Switzerland | £101,011 | 93 | 22 |
| Sweden | £91,904 | 105 | 19 |
| Denmark | £69,862 | 43 | 12 |
top_countries %>%
mutate(Country = fct_reorder(Country, Revenue)) %>%
ggplot(aes(x = Country, y = Revenue)) +
geom_col(fill = palette_main[1]) +
coord_flip() +
scale_y_continuous(labels = label_dollar(prefix = "£", scale = 1e-6, suffix = "M")) +
labs(
title = "Top 10 Countries by Revenue",
subtitle = "The UK dominates; international expansion has room to grow",
x = NULL,
y = "Revenue"
)
Insight: The UK accounts for the overwhelming share of revenue. Despite serving 40+ countries, international revenue is concentrated in only a handful of markets (Netherlands, EIRE, Germany). This concentration signals an opportunity for targeted expansion in already-validated EU markets rather than spreading thin.
top_products <- sales %>%
group_by(StockCode, Description) %>%
summarise(
Revenue = sum(Revenue),
Quantity = sum(Quantity),
.groups = "drop"
) %>%
arrange(desc(Revenue)) %>%
slice_head(n = 15)
kable(
top_products %>%
mutate(
Revenue = dollar(Revenue, prefix = "£"),
Quantity = comma(Quantity)
),
caption = "Top 15 products by revenue"
)
| StockCode | Description | Revenue | Quantity |
|---|---|---|---|
| 22423 | REGENCY CAKESTAND 3 TIER | £344,563 | 27,577 |
| M | Manual | £340,716 | 10,050 |
| DOT | DOTCOM POSTAGE | £322,657 | 1,436 |
| 85123A | WHITE HANGING HEART T-LIGHT HOLDER | £262,931 | 96,086 |
| 23843 | PAPER CRAFT , LITTLE BIRDIE | £168,470 | 80,995 |
| 47566 | PARTY BUNTING | £149,187 | 28,378 |
| 85099B | JUMBO BAG RED RETROSPOT | £148,824 | 78,860 |
| 84879 | ASSORTED COLOUR BIRD ORNAMENT | £132,188 | 81,809 |
| POST | POSTAGE | £127,597 | 5,461 |
| 22086 | PAPER CHAIN KIT 50’S CHRISTMAS | £123,142 | 36,581 |
| 79321 | CHILLI LIGHTS | £85,490 | 16,840 |
| 23166 | MEDIUM CERAMIC TOP STORAGE JAR | £81,701 | 78,033 |
| 84347 | ROTATING SILVER ANGELS T-LIGHT HLDR | £74,449 | 32,525 |
| 21137 | BLACK RECORD COVER FRAME | £69,329 | 19,954 |
| 85099F | JUMBO BAG STRAWBERRY | £68,489 | 37,124 |
How concentrated is revenue across the customer base?
customer_revenue <- sales %>%
filter(!is.na(CustomerID)) %>%
group_by(CustomerID) %>%
summarise(Revenue = sum(Revenue), .groups = "drop") %>%
arrange(desc(Revenue)) %>%
mutate(
Cumulative_Revenue = cumsum(Revenue),
Cumulative_Pct = Cumulative_Revenue / sum(Revenue),
Customer_Rank = row_number(),
Customer_Pct = Customer_Rank / n()
)
# What % of revenue comes from top 20% of customers?
top_20_pct_revenue <- customer_revenue %>%
filter(Customer_Pct <= 0.2) %>%
summarise(share = max(Cumulative_Pct)) %>%
pull(share)
cat("Top 20% of customers account for",
percent(top_20_pct_revenue, accuracy = 0.1),
"of total revenue.\n")
## Top 20% of customers account for 77.2% of total revenue.
ggplot(customer_revenue, aes(x = Customer_Pct, y = Cumulative_Pct)) +
geom_line(color = palette_main[3], linewidth = 1.2) +
geom_hline(yintercept = 0.8, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0.2, linetype = "dashed", color = "grey50") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Customer Revenue Concentration (Pareto Curve)",
subtitle = "The 80/20 rule visualized: a small share of customers drives most revenue",
x = "Cumulative % of Customers",
y = "Cumulative % of Revenue"
)
Insight: Revenue is heavily concentrated in a small fraction of customers — the classic Pareto pattern. Customer retention strategy should prioritize this top tier with dedicated account management.
total_sales_value <- sum(sales$Revenue)
total_returns_value <- sum(abs(cancellations$Quantity * cancellations$Price), na.rm = TRUE)
return_rate <- total_returns_value / total_sales_value
cat("Total sales value:", dollar(total_sales_value, prefix = "£"), "\n")
## Total sales value: £20,972,595
cat("Total returns value:", dollar(total_returns_value, prefix = "£"), "\n")
## Total returns value: £1,527,415
cat("Return rate:", percent(return_rate, accuracy = 0.01), "\n")
## Return rate: 7.28%
monthly_returns <- cancellations %>%
mutate(
Month = floor_date(InvoiceDate, "month"),
Return_Value = abs(Quantity * Price)
) %>%
group_by(Month) %>%
summarise(Returns = sum(Return_Value, na.rm = TRUE), .groups = "drop")
monthly_combined <- monthly_sales %>%
left_join(monthly_returns, by = "Month") %>%
mutate(
Returns = replace_na(Returns, 0),
Return_Rate = Returns / Revenue
)
ggplot(monthly_combined, aes(x = Month, y = Return_Rate)) +
geom_line(color = palette_main[4], linewidth = 1) +
geom_point(color = palette_main[4], size = 2) +
scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
labs(
title = "Monthly Return Rate",
subtitle = "Returns as a percentage of monthly revenue",
x = NULL,
y = "Return Rate"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Insight: The return rate varies month-to-month but stays within a predictable band. Months with abnormally high return rates warrant operational investigation (defective batches, shipping issues, etc.).
Based on the analysis, four concrete actions:
Inventory and staffing for November peak. Revenue in November is dramatically higher than other months. Plan logistics, inventory, and customer service capacity accordingly.
Reframe marketing around B2B weekday patterns. The data shows minimal weekend activity and no Saturday transactions. Email campaigns, ads, and outreach should concentrate on Tuesday–Thursday.
Build a top-customer retention program. A small fraction of customers drives a majority of revenue. Losing one is far more damaging than losing many casual buyers. A dedicated account manager for the top tier is justified.
Targeted EU expansion, not broad expansion. The UK dominates revenue, but a few EU markets (Netherlands, EIRE, Germany) have proven traction. Doubling down on these is lower-risk than entering new geographies.
# Run this chunk manually to export the cleaned data
write_csv(sales, "output/sales_cleaned.csv")
write_csv(cancellations, "output/cancellations.csv")
write_csv(monthly_sales, "output/monthly_summary.csv")
Project by Yuhsuan K. — Data Analyst | Excel, R, SQL
Source data: Online
Retail II — UCI Machine Learning Repository