show code
```{r}
library(scales)
library(tidyverse)
library(patchwork)
```Retrieved from: https://www.kaggle.com/datasets/gabrielramos87/an-online-shop-business
This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.
The data set contains 536,350 rows and 8 columns. The following is the description of each column.
| Column | Type | Description |
|---|---|---|
| TransactionNo | Categorical | A six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. |
| Date | Numeric | The date when each transaction was generated. |
| ProductNo | Categorical | A five or six-digit unique character used to identify a specific product. |
| Product | Categorical | Product/item name. |
| Price | Numeric | The price of each product per unit in pound sterling (£). |
| Quantity | Numeric | The quantity of each product per transaction. Negative values related to cancelled transactions. |
| CustomerNo | Categorical | A five-digit unique number that defines each customer. |
| Country | Categorical | Name of the country where the customer resides. |
There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.
```{r}
library(scales)
library(tidyverse)
library(patchwork)
```For each plot there will be further data wrangling to complement the graph. This initial data wrangling is to truly clean it first.
```{r}
# Read dataset
sales = read_csv('Sales Transaction v.4a.csv')
# Check for Missing Values
MV = sum(is.na(sales))
# Show first 5 data
head(sales, 10)
# Data Wrangling to get three new variables
canceled = sales %>%
filter(startsWith(TransactionNo, 'C'))
# Check for Missing Values
MVC = sum(is.na(canceled))
sales %>%
summarize_all(~ sum(is.na(.)))
```# A tibble: 10 × 8
TransactionNo Date ProductNo ProductName Price Quantity CustomerNo Country
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 581482 12/9/2… 22485 Set Of 2 W… 21.5 12 17490 United…
2 581475 12/9/2… 22596 Christmas … 10.6 36 13069 United…
3 581475 12/9/2… 23235 Storage Ti… 11.5 12 13069 United…
4 581475 12/9/2… 23272 Tree T-Lig… 10.6 12 13069 United…
5 581475 12/9/2… 23239 Set Of 4 K… 11.9 6 13069 United…
6 581475 12/9/2… 21705 Bag 500g S… 10.6 24 13069 United…
7 581475 12/9/2… 22118 Joy Wooden… 11.5 18 13069 United…
8 581475 12/9/2… 22119 Peace Wood… 12.2 12 13069 United…
9 581475 12/9/2… 22217 T-Light Ho… 10.6 12 13069 United…
10 581475 12/9/2… 22216 T-Light Ho… 10.6 24 13069 United…
# A tibble: 1 × 8
TransactionNo Date ProductNo ProductName Price Quantity CustomerNo Country
<int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 55 0
This dataset has 55 missing values all from “CustomerNo” which 54 of missing values is from cancelled transactions.
```{r}
canceled %>%
arrange(Quantity) %>%
head(5)
```# A tibble: 5 × 8
TransactionNo Date ProductNo ProductName Price Quantity CustomerNo Country
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 C581484 12/9/20… 23843 Paper Craf… 6.19 -80995 16446 United…
2 C541433 1/18/20… 23166 Medium Cer… 11.3 -74215 12346 United…
3 C536757 12/2/20… 84347 Rotating S… 10.3 -9360 15838 United…
4 C550456 4/18/20… 21108 Fairy Cake… 12.4 -3114 15749 United…
5 C550456 4/18/20… 21175 Gin And To… 12.2 -2000 15749 United…
```{r}
sales %>%
arrange(desc(Quantity)) %>%
head(5)
```# A tibble: 5 × 8
TransactionNo Date ProductNo ProductName Price Quantity CustomerNo Country
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 581483 12/9/20… 23843 Paper Craf… 12.4 80995 16446 United…
2 541431 1/18/20… 23166 Medium Cer… 11.3 74215 12346 United…
3 578841 11/25/2… 84826 Asstd Desi… 6.19 12540 13256 United…
4 573008 10/27/2… 84077 World War … 10.5 4800 12901 United…
5 554868 5/27/20… 22197 Popcorn Ho… 11.0 4300 13135 United…
This dataset consists of the “Cancelled” Transactions, but the identical Quantity and TransactionNo from the first two rows suggests the data was not deleted after it was being cancelled. Hence, we update the dataset to subtract the transaction data with its cancellation data counterparts.
However, Their TransactionNo is not identical. But their Quantity, ProductName and CustomerNo is.
```{r}
canceled_ = canceled %>%
mutate(Quantity = Quantity - Quantity -Quantity)
df = sales %>%
anti_join(canceled) %>%
anti_join(canceled_,
by = c("ProductNo","ProductName","Quantity","CustomerNo")) %>%
mutate(Revenue = Price * Quantity) %>%
group_by(CustomerNo) %>%
mutate(Revenue_percust = sum(Revenue)) %>%
ungroup() %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y"))%>%
arrange(desc(Quantity))
```Visualize the company’s daily transaction
Will be useful to give insights of the sales’ traffic
```{r}
# Transactions per day
date_transaction = df %>%
count(Date)
# Get the two peak's in revenue
top_2_T = slice_max(date_transaction, n, n =2)
# Transaction movement per day
ggplot(date_transaction, aes(Date, n))+
geom_line() +
theme_minimal() +
ggtitle("Daily Transaction") + ylab(NULL) +
geom_text(data = top_2_T,
aes(label = prettyNum(n,
big.mark = ',',
scientific = FALSE, digits = 10)),
vjust = -0.5)+
geom_hline(yintercept = mean(date_transaction$n), linetype = "dashed", color ="red")+
geom_label(y = 2000, x =as.Date('2019-2-01', format = "%Y-%m-%d"),
label = paste0("Yearly Avg: ",comma(mean(date_transaction$n))), color="red") + ylim(0,5500)
```Show the fluctuations of the transaction throughout year.
initial datasets needs to be grouped and summarize accordingly to each date to plot the time series, hence, group_by and summarize wrangling method was used.
This graph displays daily transaction counts for a UK company over the course of a year, from late 2018 to late 2019. The data shows significant day-to-day fluctuations, with a general upward trend towards the end of the period. Most daily transactions fall between 1,000 and 3,000, but there are several notable spikes exceeding 4,000 transactions, particularly in the last quarter. The yearly average is indicated as 1,730 transactions per day, represented by a red dashed line.
Determine the stability of the revenue cash flow.
The shareholder demands information about the cash flow fluctuations in revenue.
```{r}
# Revenue per date
date_revenue = df %>%
group_by(Date) %>%
summarise(Revenue = sum(Revenue)) %>%
ungroup()
# Get the two peak's in revenue
top_2 = slice_max(date_revenue, Revenue, n =2)
# Calculate Mean revenue
mean_revenue = prettyNum(mean(date_revenue$Revenue),big.mark=",",scientific=FALSE, digits = 8)
# Plot revenue movement per day
ggplot(date_revenue, aes(Date, Revenue)) +
geom_line() +
theme_minimal() +
ggtitle("Daily Revenue", subtitle = "From December 2018 to December 2019") +
# Add text labels only for the top 2 highest revenue days
geom_text(data = top_2,
aes(label = prettyNum(Revenue, big.mark = ',',
scientific = FALSE, digits = 10)),
vjust = -0.5) +
xlim(as.Date('2018-12-01'), as.Date('2019-12-30'))+
scale_y_continuous(labels = comma, limits = c(0,600000))+
geom_hline(yintercept = mean(date_revenue$Revenue), linetype = "dashed",
color ="red")+
geom_label(y = 280000, x =as.Date('2019-3-01', format = "%Y-%m-%d"),
label = paste0("Yearly Avg: ",mean_revenue), color="red")
```Show the fluctuations of the revenue throughout year.
initial datasets needs to be grouped and summarize accordingly to each date to plot the time series, hence, group_by and summarize wrangling method was used.
The graph illustrates revenue changes over the year, with most figures averaging around £195,763. Significant peaks occur in October and November 2019, at £527,634 and £553,496, indicating strong sales in the last quarter. These spikes suggest seasonal influences, possibly from holiday events or promotions. Overall, the revenue trend fluctuates but remains close to the average, indicating steady daily sales with occasional increases.
Discover the majority revenue streams from outside the origin country (United Kingdom).
Finding a possible market demand from other countries.
```{r}
# Get each country's revenue
country_revenue = df %>%
group_by(Country) %>%
summarise(Revenue = sum(Revenue))
# Top 5 countries' revenues
top_5 = country_revenue %>%
arrange(desc(Revenue)) %>%
head(5)
# UK has the highest, due to the large differences in revenue, we exclude it
top_5_exclude_uk = country_revenue %>%
filter(Country != "United Kingdom") %>%
arrange(desc(Revenue)) %>%
head(5)
# Top 5 countries' revenues plot
top_5_plot = ggplot(top_5_exclude_uk,
aes(reorder(Country, Revenue ),Revenue )) +
geom_col(aes(fill= Country), show.legend = NULL) +
ggtitle("Top 5 Revenue Excluding UK") +
theme_minimal() + ylab('Revenue') +xlab(NULL)+
geom_text(label = comma(top_5_exclude_uk$Revenue), vjust = -0.3, size = 3)+
ylim(0,2400000) + ylab(NULL) + scale_y_continuous(labels = NULL)
# Visualize all countries
rev_nouk = country_revenue%>%
filter(Country != "United Kingdom") %>% arrange(desc(Revenue))
rev_nouk <- rev_nouk %>%
mutate(is_top_5 = row_number() <= 5,
country_label = ifelse(is_top_5, Country, ""))
countries_plot = ggplot(rev_nouk,
aes(reorder(Country, Revenue), Revenue)) +
geom_col(aes(fill = is_top_5), show.legend = FALSE) +
scale_fill_manual(values = c("TRUE" = "#00BA38", "FALSE" = "grey")) +
ggtitle("Countries Revenue excluding UK") +
theme_minimal() +
ylab('Revenue') +
xlab(NULL) +
annotate(x ='Sweden', y = 1600000,
label = "Top 5 Countries",
geom = "label", color = "#00BA38") +
coord_flip()
# Plot all countries vs top 5 countries
countries_plot + top_5_plot
```Determining which country has the most customers outside of the store’s origin country to highlight potential market presence in those countries’
Dataset required revenue generated per country while excluding United Kingdom, Hence, group_by, summarise, filter,and arrange wrangling method was used.
The graph emphasizes the stark revenue disparities among countries, highlighting how a few top-performing countries contribute significantly more to revenue compared to others. This analysis could suggest a need for targeted strategies in lower-performing regions to boost revenue. The bar chart showcases the top five revenue-generating countries, excluding the UK, with the Netherlands leading at £2,137,603. EIRE follows at £1,602,341, with Germany, France, and Australia contributing £1,358,280, £1,310,007, and £989,246, respectively. The significant revenue gap indicates a stronger market presence in the Netherlands, suggesting that focusing marketing efforts on these high-revenue regions could be beneficial.
Discover the most ordered product.
Give a good understanding of the stores’ product demand
```{r}
# Data Wrangling to get the plot
product_count = df %>% count(ProductName, Quantity) %>%
mutate(Total = Quantity * n)%>% group_by(ProductName) %>%
summarise(SumTotal = sum(Total, na.rm = TRUE))%>% arrange(desc(SumTotal))
# Plot the bargraph
ggplot(head(product_count, 10),
aes(SumTotal, fct_reorder(ProductName, SumTotal))) +
geom_col(aes(fill = ProductName),
show.legend = FALSE) +
geom_text(aes(label = comma(SumTotal)), hjust = -0.05) +
ylab(NULL) + xlab("Count") + xlim(0,61000)+
theme_minimal() + ggtitle("Top 10 most ordered product by Quantity")
```To determine the winning product. The company plans to invest in a category of product, but wanted to know which product has the most transaction.
Dataset required manipulating datasets and count respective items. Hence, group_by, summarise, mutate and arrange wrangling method was used.
Discover which country holds most customers outside the origin country (United Kingdom).
Finding a possible market demand from other countries.
```{r}
# Create a customers dataframe
customers = df %>% group_by(Country) %>%
count(CustomerNo) %>%
arrange(desc(n)) %>% ungroup() %>%
count(Country) %>% arrange(desc(n))
# calculate total customers
customers_sum = customers$n %>% sum()
# Plot the top 5 countries' customers
top_5_cust = customers %>% filter(Country %in% top_5_exclude_uk$Country) %>% ggplot(aes(fct_reorder(Country, n ),n, fill = Country)) +
geom_col(show.legend =FALSE)+
geom_text(aes(label = n), vjust = -0.2, size = 3) +
theme_minimal() +
ggtitle("Top 5 Countries",subtitle = "(Excluding UK)") +
xlab(NULL) + ylab("Customers") +
ylab(NULL) + scale_y_continuous(labels = NULL)
# Visualize all countries
cust_nouk = customers%>%
filter(Country != "United Kingdom") %>% arrange(desc(n))
cust_nouk <- cust_nouk %>%
mutate(is_top_5 = row_number() <= 5,
country_label = ifelse(is_top_5, Country, ""))
customer_plot = ggplot(cust_nouk,
aes(reorder(Country, n), n)) +
geom_col(aes(fill = is_top_5), show.legend = FALSE) +
scale_fill_manual(values = c("TRUE" = "#00BA38", "FALSE" = "grey")) +
theme_minimal() +
ylab('Customers') +
xlab(NULL) +
ggtitle("Countries customer count",subtitle = "(Excluding UK)") +
annotate(x ='Switzerland', y = 55,
label = "Top 5 Countries",
geom = "label", color = "#00BA38") +
coord_flip()
# Plot all countries vs top 5 countries
customer_plot + top_5_cust
```Determining which country has the most customers outside of the store’s origin country to highlight potential market presence in those countries.
Dataset required transaction generated per country while excluding United Kingdom, Hence, group_by, summarise, filter,and arrange wrangling method was used.
This visualization depicts international customer distribution, excluding the UK, through two complementary charts. Germany leads with 91 customers, followed closely by France with 87, showing strong dominance in central European markets. There’s a significant drop to the next tier, with Ireland (13 customers), Netherlands (9), and Australia (9) rounding out the top five. The broader chart reveals a long tail of markets with minimal presence, from Belgium to Brazil, suggesting a business primarily focused on Western Europe with limited penetration in other global markets, including notably low numbers in major economies like the USA.
Determine the distribution of revenue generated per customer.
Revealing concentration of customers within certain revenue ranges.
```{r}
customer_revenue <- df %>%
group_by(CustomerNo) %>%
summarize(TotalRevenue = sum(Revenue))
hist_1 = ggplot(customer_revenue, aes(x = TotalRevenue)) +
geom_histogram(bins = 50, fill = "lightblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Total Revenue per Customer",
x = "Total Revenue per Customer",
y = "Count") +
scale_x_continuous(labels = scales::comma)
hist_2 = ggplot(df, aes(x = Revenue)) +
geom_histogram(bins = 50, fill = "lightgreen", color = "black") +
theme_minimal() +
labs(title = "Distribution of Revenue per Transaction",
x = "Revenue per Transaction",
y = "Count") +
scale_x_continuous(labels = scales::comma)
# Calculate the 95th percentile
q95 <- quantile(df$Revenue_percust, 0.95)
hist_3_q95 = ggplot(df, aes(x = Revenue_percust)) +
geom_histogram(bins = 50, fill = "lightblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Total Revenue per Customer (95th percentile)",
x = "Total Revenue per Customer",
y = "Count") +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma, limits = c(0, q95))
hist_3 = ggplot(df, aes(x = Revenue_percust)) +
geom_histogram(bins = 50, fill = "pink", color = "black") +
theme_minimal() +
labs(title = "Distribution of Total Revenue per Customer ",
x = "Total Revenue per Customer",
y = "Count") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
annotate("text", x = 2020000, y = 80000, label = "Top Revenue", color = "red", fontface = "bold", size = 3)+
annotate("text", x = 2020000, y = 40000, color = "red", fontface = "bold",label = comma(max(df$Revenue_percust)),size = 3)
hist_3 / hist_3_q95
``````{r}
central_revenue_percust = df %>% summarize(mean = mean(Revenue_percust),
median = median(Revenue_percust),
max = max(Revenue_percust),
q95 = quantile(Revenue_percust, 0.95))
ggplot(df, aes(x = Revenue_percust)) +
geom_histogram(bins = 50, fill = "lightblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Total Revenue per Customer",
subtitle = "limit to 95th percentile",
x = "Total Revenue per Customer (£)",
y = "Customer count") +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma,
limits = c(0, central_revenue_percust$q95))+
geom_vline(xintercept = central_revenue_percust$mean,
linetype = "dashed", linewidth = 0.7, color = "darkseagreen4")+
geom_vline(xintercept = central_revenue_percust$median,
linetype = "dashed", linewidth = 0.7, color = "darkgreen")+
#median
annotate("label", x = 43000 ,y = 55000,
label = paste0("Median: £",
comma(central_revenue_percust$median)),
color ="darkgreen") +
#mean
annotate("label", x = 81000 ,y = 80000,
label = paste0("Mean: £",
comma(central_revenue_percust$mean)),
color ="darkseagreen4") +
geom_segment(aes(x = 180000, y = 25000, xend = 240000, yend = 10000),
arrow = arrow(length = unit(0.3, "cm")), color = "darkred") +
annotate("label", x = 200000, y = 25000,
label = paste0("Top Revenue up to £",
comma(central_revenue_percust$max)),
color = "darkred")
```To show the distribution of total revenue generated per customer.
Dataset required vertical line visualizing the central tendencies of revenue per customer while zooming into the 95% percentile due to significant outlier, giving a more insightful visualization.
We’ve used summarize wrangling method to calculate the mean, median, max and 95% quantile through mean, median, max, and quantile method
This histogram limit to 95th percentile. The data is highly right-skewed, with most customers generating lower revenues and a long tail of high-value customers. The median revenue is 16,326, while the mean is significantly higher at 57,610, indicating some very high-revenue customers. The top revenue per customer reaches $2,098,752, though this appears to be an extreme case.
Understand the relationship between revenue and quantity sold, indicate whether more units sold lead to proportionally higher revenue.
Identify patterns or trends such as linear or non-linear relationships, clusters, or outliers in the data, providing insights into sales performance.
```{r}
# Calculate medians
mean_qty = mean(df$Quantity)
mean_rev = mean(df$Revenue)
# use these ranges to give more context relative to its central tendencies
df_ <- df %>%
mutate(Quantity_range = case_when(
Quantity >= mean_qty + (mean_qty / 2) ~ "High",
Quantity <= mean_qty - (mean_qty / 2) ~ "Low",
TRUE ~ "Average"
),
Revenue_range = case_when(
Revenue >= mean_rev + (mean_rev / 2) ~ "High",
Revenue <= mean_rev - (mean_rev / 2) ~ "Low",
TRUE ~ "Average"))
# Fit the linear model to print the R-squared
fmla = as.formula("Revenue ~ Quantity")
linear_model = lm(fmla, df)
model_summary = summary(linear_model)
r_squared = model_summary$r.squared
# Find correlation
correlation = cor(df$Quantity, df$Revenue)
# Create the plot
ggplot(df_, aes(x = Quantity, y = Revenue)) +
geom_point(alpha = 0.2, aes(color = Quantity_range, shape = Revenue_range ),
size = 5) +
geom_smooth(method = "lm", se = FALSE, size = 0.5, linetype = "dashed") +
theme_minimal() +
labs(
title = "Relationship between Revenue and Quantity",
x = "Quantity per transaction",
y = "Revenue per transaction",
shape = "Revenue Range",
color = "Quantity Range"
) +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma) +
annotate("text", x = 11100, y = 80000, label = "Top Outlier") +
annotate("text", x = 12000, y = Inf,
label = paste("R-squared:", round(r_squared, 2)),
hjust = 1.1, vjust = 1.5, size = 4, color = "red") +
annotate("text", x = 4000, y = Inf,
label = paste("Correlation:", round(correlation, 2)),
hjust = 1.1, vjust = 1.5, size = 4, color = "blue")
```Illustrate how revenue increases with the quantity of products sold.
Data requires the mean of quantity and revenue, creating two variables through mean respectively while using mutate and case_when to add two new columns representing the revenue and quantity range.
We’ve also added correlation and calculated the r.squared to emphasizes the two variables strength and direction of the linear relationship.
The scatterplot shows a strong positive correlation (0.97) and high R-squared (0.94) between quantity sold and revenue, with most points clustered at lower values, indicating frequent smaller transactions. A clear linear trend suggests consistent pricing across volumes as increasing quantities are generally associated with proportionally higher revenues. Few high-value outliers likely represent bulk purchases by organizations rather than individuals.