Exploratory Data Analysis for UK E-commerce Transaction

Author

Karim Abdul Aziz Chatab

I. Dataset Description

Retrieved from: https://www.kaggle.com/datasets/gabrielramos87/an-online-shop-business

This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.

The data set contains 536,350 rows and 8 columns. The following is the description of each column.

Column	Type	Description
TransactionNo	Categorical	A six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation.
Date	Numeric	The date when each transaction was generated.
ProductNo	Categorical	A five or six-digit unique character used to identify a specific product.
Product	Categorical	Product/item name.
Price	Numeric	The price of each product per unit in pound sterling (£).
Quantity	Numeric	The quantity of each product per transaction. Negative values related to cancelled transactions.
CustomerNo	Categorical	A five-digit unique number that defines each customer.
Country	Categorical	Name of the country where the customer resides.

There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.

II. Load Libraries

show code

```{r}
library(scales)
library(tidyverse)
library(patchwork)
```

III. Initial Data loading and Wrangling

For each plot there will be further data wrangling to complement the graph. This initial data wrangling is to truly clean it first.

show code

```{r}
# Read dataset
sales = read_csv('Sales Transaction v.4a.csv')

# Check for Missing Values
MV = sum(is.na(sales))

# Show first 5 data
head(sales, 10)

# Data Wrangling to get three new variables
canceled = sales %>%
  filter(startsWith(TransactionNo, 'C'))

# Check for Missing Values
MVC = sum(is.na(canceled))

sales %>%
  summarize_all(~ sum(is.na(.)))
```

# A tibble: 10 × 8
   TransactionNo Date    ProductNo ProductName Price Quantity CustomerNo Country
   <chr>         <chr>   <chr>     <chr>       <dbl>    <dbl>      <dbl> <chr>  
 1 581482        12/9/2… 22485     Set Of 2 W…  21.5       12      17490 United…
 2 581475        12/9/2… 22596     Christmas …  10.6       36      13069 United…
 3 581475        12/9/2… 23235     Storage Ti…  11.5       12      13069 United…
 4 581475        12/9/2… 23272     Tree T-Lig…  10.6       12      13069 United…
 5 581475        12/9/2… 23239     Set Of 4 K…  11.9        6      13069 United…
 6 581475        12/9/2… 21705     Bag 500g S…  10.6       24      13069 United…
 7 581475        12/9/2… 22118     Joy Wooden…  11.5       18      13069 United…
 8 581475        12/9/2… 22119     Peace Wood…  12.2       12      13069 United…
 9 581475        12/9/2… 22217     T-Light Ho…  10.6       12      13069 United…
10 581475        12/9/2… 22216     T-Light Ho…  10.6       24      13069 United…
# A tibble: 1 × 8
  TransactionNo  Date ProductNo ProductName Price Quantity CustomerNo Country
          <int> <int>     <int>       <int> <int>    <int>      <int>   <int>
1             0     0         0           0     0        0         55       0

This dataset has 55 missing values all from “CustomerNo” which 54 of missing values is from cancelled transactions.

show code

```{r}
canceled %>% 
  arrange(Quantity) %>% 
  head(5)
```

# A tibble: 5 × 8
  TransactionNo Date     ProductNo ProductName Price Quantity CustomerNo Country
  <chr>         <chr>    <chr>     <chr>       <dbl>    <dbl>      <dbl> <chr>  
1 C581484       12/9/20… 23843     Paper Craf…  6.19   -80995      16446 United…
2 C541433       1/18/20… 23166     Medium Cer… 11.3    -74215      12346 United…
3 C536757       12/2/20… 84347     Rotating S… 10.3     -9360      15838 United…
4 C550456       4/18/20… 21108     Fairy Cake… 12.4     -3114      15749 United…
5 C550456       4/18/20… 21175     Gin And To… 12.2     -2000      15749 United…

show code

```{r}
sales %>% 
  arrange(desc(Quantity)) %>% 
  head(5)
```

# A tibble: 5 × 8
  TransactionNo Date     ProductNo ProductName Price Quantity CustomerNo Country
  <chr>         <chr>    <chr>     <chr>       <dbl>    <dbl>      <dbl> <chr>  
1 581483        12/9/20… 23843     Paper Craf… 12.4     80995      16446 United…
2 541431        1/18/20… 23166     Medium Cer… 11.3     74215      12346 United…
3 578841        11/25/2… 84826     Asstd Desi…  6.19    12540      13256 United…
4 573008        10/27/2… 84077     World War … 10.5      4800      12901 United…
5 554868        5/27/20… 22197     Popcorn Ho… 11.0      4300      13135 United…

This dataset consists of the “Cancelled” Transactions, but the identical Quantity and TransactionNo from the first two rows suggests the data was not deleted after it was being cancelled. Hence, we update the dataset to subtract the transaction data with its cancellation data counterparts.

However, Their TransactionNo is not identical. But their Quantity, ProductName and CustomerNo is.

show code

```{r}
canceled_ = canceled %>%
  mutate(Quantity = Quantity - Quantity -Quantity) 

df = sales %>% 
  anti_join(canceled) %>%
  anti_join(canceled_, 
            by = c("ProductNo","ProductName","Quantity","CustomerNo")) %>%
  mutate(Revenue = Price * Quantity) %>% 
  group_by(CustomerNo) %>%
  mutate(Revenue_percust = sum(Revenue)) %>% 
  ungroup() %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))%>% 
  arrange(desc(Quantity))
```

IV. Visualization Analysis

Plot 1. Time Series

Daily Transaction plot

Objective:

Visualize the company’s daily transaction
Will be useful to give insights of the sales’ traffic

show code

```{r}
# Transactions per day
date_transaction = df %>% 
  count(Date)

# Get the two peak's in revenue
top_2_T = slice_max(date_transaction, n, n =2)

# Transaction movement per day
ggplot(date_transaction, aes(Date, n))+ 
  geom_line() +
  theme_minimal() +
  ggtitle("Daily Transaction") + ylab(NULL) +
  geom_text(data = top_2_T, 
            aes(label = prettyNum(n, 
                                  big.mark = ',',
                                  scientific = FALSE, digits = 10)),
            vjust = -0.5)+
  geom_hline(yintercept = mean(date_transaction$n), linetype = "dashed", color ="red")+
  geom_label(y = 2000, x =as.Date('2019-2-01', format = "%Y-%m-%d"), 
             label = paste0("Yearly Avg: ",comma(mean(date_transaction$n))), color="red") + ylim(0,5500)
```

Purpose

Show the fluctuations of the transaction throughout year.
initial datasets needs to be grouped and summarize accordingly to each date to plot the time series, hence, group_by and summarize wrangling method was used.

This graph displays daily transaction counts for a UK company over the course of a year, from late 2018 to late 2019. The data shows significant day-to-day fluctuations, with a general upward trend towards the end of the period. Most daily transactions fall between 1,000 and 3,000, but there are several notable spikes exceeding 4,000 transactions, particularly in the last quarter. The yearly average is indicated as 1,730 transactions per day, represented by a red dashed line.

Daily Revenue Plot

Objective:

Determine the stability of the revenue cash flow.
The shareholder demands information about the cash flow fluctuations in revenue.

show code

```{r}
# Revenue per date
date_revenue = df %>% 
  group_by(Date) %>% 
  summarise(Revenue = sum(Revenue)) %>% 
  ungroup() 

# Get the two peak's in revenue
top_2 = slice_max(date_revenue, Revenue, n =2)

# Calculate Mean revenue
mean_revenue = prettyNum(mean(date_revenue$Revenue),big.mark=",",scientific=FALSE, digits = 8)

# Plot revenue movement per day
ggplot(date_revenue, aes(Date, Revenue)) + 
  geom_line() +
  theme_minimal() +
  ggtitle("Daily Revenue", subtitle = "From December 2018 to December 2019") +
  # Add text labels only for the top 2 highest revenue days
  geom_text(data = top_2, 
            aes(label = prettyNum(Revenue, big.mark = ',',
                                  scientific = FALSE, digits = 10)),
            vjust = -0.5) +
  xlim(as.Date('2018-12-01'), as.Date('2019-12-30'))+
  scale_y_continuous(labels = comma, limits = c(0,600000))+
  geom_hline(yintercept = mean(date_revenue$Revenue), linetype = "dashed", 
             color ="red")+
  geom_label(y = 280000, x =as.Date('2019-3-01', format = "%Y-%m-%d"), 
             label = paste0("Yearly Avg: ",mean_revenue), color="red")
```

Purpose

Show the fluctuations of the revenue throughout year.
initial datasets needs to be grouped and summarize accordingly to each date to plot the time series, hence, group_by and summarize wrangling method was used.

The graph illustrates revenue changes over the year, with most figures averaging around £195,763. Significant peaks occur in October and November 2019, at £527,634 and £553,496, indicating strong sales in the last quarter. These spikes suggest seasonal influences, possibly from holiday events or promotions. Overall, the revenue trend fluctuates but remains close to the average, indicating steady daily sales with occasional increases.

Plot. 2: Bar/Column chart

Country Revenue and Top 5 Revenue Countries

Objective:

Discover the majority revenue streams from outside the origin country (United Kingdom).
Finding a possible market demand from other countries.

show code

```{r}
# Get each country's revenue
country_revenue = df %>% 
  group_by(Country) %>%
  summarise(Revenue = sum(Revenue))

# Top 5 countries' revenues
top_5  = country_revenue %>% 
  arrange(desc(Revenue)) %>%
  head(5)

# UK has the highest, due to the large differences in revenue, we exclude it 
top_5_exclude_uk  = country_revenue %>% 
  filter(Country != "United Kingdom") %>%
  arrange(desc(Revenue)) %>%
  head(5)

# Top 5 countries' revenues plot
top_5_plot = ggplot(top_5_exclude_uk, 
       aes(reorder(Country, Revenue ),Revenue )) + 
  geom_col(aes(fill= Country), show.legend = NULL) + 
  ggtitle("Top 5 Revenue Excluding UK") + 
  theme_minimal() + ylab('Revenue') +xlab(NULL)+
  geom_text(label = comma(top_5_exclude_uk$Revenue), vjust = -0.3, size = 3)+
  ylim(0,2400000) + ylab(NULL) + scale_y_continuous(labels = NULL)

# Visualize all countries 
rev_nouk = country_revenue%>% 
  filter(Country != "United Kingdom") %>% arrange(desc(Revenue))

rev_nouk <- rev_nouk %>%
  mutate(is_top_5 = row_number() <= 5,
         country_label = ifelse(is_top_5, Country, ""))

countries_plot = ggplot(rev_nouk, 
       aes(reorder(Country, Revenue), Revenue)) + 
  geom_col(aes(fill = is_top_5), show.legend = FALSE) + 
  scale_fill_manual(values = c("TRUE" = "#00BA38", "FALSE" = "grey")) +
  ggtitle("Countries Revenue excluding UK") + 
  theme_minimal() + 
  ylab('Revenue') +
  xlab(NULL) + 
  annotate(x ='Sweden', y = 1600000, 
           label = "Top 5 Countries", 
           geom = "label", color = "#00BA38") + 
  coord_flip()

# Plot all countries vs top 5 countries
countries_plot + top_5_plot 
```

Purpose:

Determining which country has the most customers outside of the store’s origin country to highlight potential market presence in those countries’
Dataset required revenue generated per country while excluding United Kingdom, Hence, group_by, summarise, filter,and arrange wrangling method was used.

The graph emphasizes the stark revenue disparities among countries, highlighting how a few top-performing countries contribute significantly more to revenue compared to others. This analysis could suggest a need for targeted strategies in lower-performing regions to boost revenue. The bar chart showcases the top five revenue-generating countries, excluding the UK, with the Netherlands leading at £2,137,603. EIRE follows at £1,602,341, with Germany, France, and Australia contributing £1,358,280, £1,310,007, and £989,246, respectively. The significant revenue gap indicates a stronger market presence in the Netherlands, suggesting that focusing marketing efforts on these high-revenue regions could be beneficial.

Top 10 most ordered product

Objective:

Discover the most ordered product.
Give a good understanding of the stores’ product demand

show code

```{r}
# Data Wrangling to get the plot
product_count = df %>% count(ProductName, Quantity) %>% 
  mutate(Total = Quantity * n)%>% group_by(ProductName) %>%
  summarise(SumTotal = sum(Total, na.rm = TRUE))%>% arrange(desc(SumTotal)) 

# Plot the bargraph
ggplot(head(product_count, 10), 
       aes(SumTotal, fct_reorder(ProductName, SumTotal))) + 
  geom_col(aes(fill = ProductName), 
           show.legend = FALSE) +
  geom_text(aes(label = comma(SumTotal)), hjust = -0.05) + 
  ylab(NULL) + xlab("Count") + xlim(0,61000)+
  theme_minimal() + ggtitle("Top 10 most ordered product by Quantity") 
```

Purpose:

To determine the winning product. The company plans to invest in a category of product, but wanted to know which product has the most transaction.
Dataset required manipulating datasets and count respective items. Hence, group_by, summarise, mutate and arrange wrangling method was used.

Top 5 Countries’ customer count

Objective:

Discover which country holds most customers outside the origin country (United Kingdom).
Finding a possible market demand from other countries.

show code

```{r}
# Create a customers dataframe
customers = df %>% group_by(Country) %>% 
  count(CustomerNo) %>%
  arrange(desc(n)) %>% ungroup() %>%
  count(Country) %>% arrange(desc(n))

# calculate total customers
customers_sum = customers$n %>% sum()

# Plot the top 5 countries' customers
top_5_cust = customers %>% filter(Country %in% top_5_exclude_uk$Country) %>% ggplot(aes(fct_reorder(Country, n ),n, fill = Country)) + 
  geom_col(show.legend =FALSE)+
  geom_text(aes(label = n), vjust = -0.2, size = 3) + 
  theme_minimal() + 
  ggtitle("Top 5 Countries",subtitle = "(Excluding UK)") +
  xlab(NULL) + ylab("Customers") + 
  ylab(NULL) + scale_y_continuous(labels = NULL)
# Visualize all countries 
cust_nouk = customers%>% 
  filter(Country != "United Kingdom") %>% arrange(desc(n))

cust_nouk <- cust_nouk %>%
  mutate(is_top_5 = row_number() <= 5,
         country_label = ifelse(is_top_5, Country, ""))

customer_plot = ggplot(cust_nouk, 
       aes(reorder(Country,  n), n)) + 
  geom_col(aes(fill = is_top_5), show.legend = FALSE) + 
  scale_fill_manual(values = c("TRUE" = "#00BA38", "FALSE" = "grey")) +
  theme_minimal() + 
  ylab('Customers') +
  xlab(NULL) + 
  ggtitle("Countries customer count",subtitle = "(Excluding UK)") +
  annotate(x ='Switzerland', y = 55, 
           label = "Top 5 Countries", 
           geom = "label", color = "#00BA38") +
  coord_flip() 

# Plot all countries vs top 5 countries
customer_plot + top_5_cust 
```

Purpose:

Determining which country has the most customers outside of the store’s origin country to highlight potential market presence in those countries.
Dataset required transaction generated per country while excluding United Kingdom, Hence, group_by, summarise, filter,and arrange wrangling method was used.

This visualization depicts international customer distribution, excluding the UK, through two complementary charts. Germany leads with 91 customers, followed closely by France with 87, showing strong dominance in central European markets. There’s a significant drop to the next tier, with Ireland (13 customers), Netherlands (9), and Australia (9) rounding out the top five. The broader chart reveals a long tail of markets with minimal presence, from Belgium to Brazil, suggesting a business primarily focused on Western Europe with limited penetration in other global markets, including notably low numbers in major economies like the USA.

Plot. 3: Histogram (Bar Graph diagram plot)

Objective:

Determine the distribution of revenue generated per customer.
Revealing concentration of customers within certain revenue ranges.

show code

```{r}
customer_revenue <- df %>%
  group_by(CustomerNo) %>%
  summarize(TotalRevenue = sum(Revenue))

hist_1 = ggplot(customer_revenue, aes(x = TotalRevenue)) +
  geom_histogram(bins = 50, fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Total Revenue per Customer",
       x = "Total Revenue per Customer",
       y = "Count") +
  scale_x_continuous(labels = scales::comma)

hist_2 = ggplot(df, aes(x = Revenue)) +
  geom_histogram(bins = 50, fill = "lightgreen", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Revenue per Transaction",
       x = "Revenue per Transaction",
       y = "Count") +
  scale_x_continuous(labels = scales::comma) 

# Calculate the 95th percentile
q95 <- quantile(df$Revenue_percust, 0.95)

hist_3_q95 = ggplot(df, aes(x = Revenue_percust)) +
  geom_histogram(bins = 50, fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Total Revenue per Customer (95th percentile)",
       x = "Total Revenue per Customer",
       y = "Count") +
  scale_y_continuous(labels = scales::comma) + 
  scale_x_continuous(labels = scales::comma, limits = c(0, q95))

hist_3 = ggplot(df, aes(x = Revenue_percust)) +
  geom_histogram(bins = 50, fill = "pink", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Total Revenue per Customer ",
       x = "Total Revenue per Customer",
       y = "Count") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) + 
  annotate("text", x = 2020000, y = 80000, label = "Top Revenue", color = "red", fontface = "bold", size = 3)+
  annotate("text", x = 2020000, y = 40000, color = "red", fontface = "bold",label = comma(max(df$Revenue_percust)),size = 3)

hist_3 / hist_3_q95
```

show code

```{r}
central_revenue_percust = df %>% summarize(mean = mean(Revenue_percust),
                 median = median(Revenue_percust),
                 max = max(Revenue_percust),
                 q95 = quantile(Revenue_percust, 0.95))

ggplot(df, aes(x = Revenue_percust)) +
  geom_histogram(bins = 50, fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Total Revenue per Customer", 
       subtitle = "limit to 95th percentile",
       x = "Total Revenue per Customer (£)",
       y = "Customer count") +
  scale_y_continuous(labels = comma) + 
  scale_x_continuous(labels = comma, 
                     limits = c(0, central_revenue_percust$q95))+
  geom_vline(xintercept = central_revenue_percust$mean, 
               linetype = "dashed", linewidth = 0.7, color = "darkseagreen4")+
  geom_vline(xintercept = central_revenue_percust$median, 
             linetype = "dashed", linewidth = 0.7,  color = "darkgreen")+
  #median
  annotate("label", x = 43000 ,y = 55000, 
           label = paste0("Median: £", 
                          comma(central_revenue_percust$median)), 
           color ="darkgreen") + 
  #mean
  annotate("label", x = 81000 ,y = 80000,
           label = paste0("Mean: £", 
                          comma(central_revenue_percust$mean)),
           color ="darkseagreen4") + 
  geom_segment(aes(x = 180000, y = 25000, xend = 240000, yend = 10000), 
               arrow = arrow(length = unit(0.3, "cm")), color = "darkred") + 
  annotate("label", x = 200000, y = 25000, 
           label = paste0("Top Revenue up to £", 
                          comma(central_revenue_percust$max)), 
           color = "darkred")
```

Purpose:

To show the distribution of total revenue generated per customer.
Dataset required vertical line visualizing the central tendencies of revenue per customer while zooming into the 95% percentile due to significant outlier, giving a more insightful visualization.
We’ve used summarize wrangling method to calculate the mean, median, max and 95% quantile through mean, median, max, and quantile method

This histogram limit to 95th percentile. The data is highly right-skewed, with most customers generating lower revenues and a long tail of high-value customers. The median revenue is 16,326, while the mean is significantly higher at 57,610, indicating some very high-revenue customers. The top revenue per customer reaches $2,098,752, though this appears to be an extreme case.

Plot. 4: Scatterplot

Objective:

Understand the relationship between revenue and quantity sold, indicate whether more units sold lead to proportionally higher revenue.
Identify patterns or trends such as linear or non-linear relationships, clusters, or outliers in the data, providing insights into sales performance.

show code

```{r}
# Calculate medians
mean_qty = mean(df$Quantity)

mean_rev = mean(df$Revenue)

# use these ranges to give more context relative to its central tendencies
df_ <- df %>%
  mutate(Quantity_range = case_when(
    Quantity >= mean_qty + (mean_qty / 2) ~ "High",
    Quantity <= mean_qty - (mean_qty / 2) ~ "Low",
    TRUE ~ "Average"
  ),
  Revenue_range = case_when(
    Revenue >= mean_rev + (mean_rev / 2) ~ "High",
    Revenue <= mean_rev - (mean_rev / 2) ~ "Low",
    TRUE ~ "Average"))


# Fit the linear model to print the R-squared
fmla = as.formula("Revenue ~ Quantity")
linear_model = lm(fmla, df)
model_summary = summary(linear_model)
r_squared = model_summary$r.squared

# Find correlation 
correlation = cor(df$Quantity, df$Revenue)
# Create the plot
ggplot(df_, aes(x = Quantity, y = Revenue)) +
  geom_point(alpha = 0.2, aes(color = Quantity_range, shape = Revenue_range ), 
             size = 5) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Relationship between Revenue and Quantity",
    x = "Quantity per transaction",
    y = "Revenue per transaction",
    shape = "Revenue Range",
    color = "Quantity Range"
  ) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma) + 
  annotate("text", x = 11100, y = 80000, label = "Top Outlier") + 
  annotate("text", x = 12000, y = Inf, 
           label = paste("R-squared:", round(r_squared, 2)),
           hjust = 1.1, vjust = 1.5, size = 4, color = "red") +
  annotate("text", x = 4000, y = Inf, 
           label = paste("Correlation:", round(correlation, 2)),
           hjust = 1.1, vjust = 1.5, size = 4, color = "blue") 
```

Purpose:

Illustrate how revenue increases with the quantity of products sold.
Data requires the mean of quantity and revenue, creating two variables through mean respectively while using mutate and case_when to add two new columns representing the revenue and quantity range.
We’ve also added correlation and calculated the r.squared to emphasizes the two variables strength and direction of the linear relationship.

The scatterplot shows a strong positive correlation (0.97) and high R-squared (0.94) between quantity sold and revenue, with most points clustered at lower values, indicating frequent smaller transactions. A clear linear trend suggests consistent pricing across volumes as increasing quantities are generally associated with proportionally higher revenues. Few high-value outliers likely represent bulk purchases by organizations rather than individuals.