Data Import & Setup

4.1 Data Import & Setup

library(tidyverse)       
library(lubridate) 
library(xray)
retail <- read.csv("Online Retail.csv")

4.2 Data Diagnostics: Analyze the percentage of NA, blanks, inf values in the data set.

xray::anomalies(retail) 
## $variables
##      Variable      q    qNA    pNA qZero pZero qBlank pBlank qInf pInf
## 1  CustomerID 541909 135080 24.93%     0     -      0      -    0    -
## 2   UnitPrice 541909      0      -  2515 0.46%      0      -    0    -
## 3 Description 541909      0      -     0     -   1454  0.27%    0    -
## 4     Country 541909      0      -     0     -      0      -    0    -
## 5    Quantity 541909      0      -     0     -      0      -    0    -
## 6   StockCode 541909      0      -     0     -      0      -    0    -
## 7 InvoiceDate 541909      0      -     0     -      0      -    0    -
## 8   InvoiceNo 541909      0      -     0     -      0      -    0    -
##   qDistinct      type anomalous_percent
## 1      4373   Integer            24.93%
## 2      1630   Numeric             0.46%
## 3      4224 Character             0.27%
## 4        38 Character                 -
## 5       722   Integer                 -
## 6      4070 Character                 -
## 7     23260 Character                 -
## 8     25900 Character                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA               pNA              
##  [5] qZero             pZero             qBlank            pBlank           
##  [9] qInf              pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)

Key observation: Around one quarter of the values in Customer ID are NA’s. This can lead to inaccuracies while analising our data set. Besides this observation, the rest of the anomalous percent seems not significant to our purposes.

glimpse(retail)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo   <chr> "536365", "536365", "536365", "536365", "536365", "536365"…
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752", …
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN…
## $ Quantity    <int> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, 3, …
## $ InvoiceDate <chr> "01/12/2010 08:26", "01/12/2010 08:26", "01/12/2010 08:26"…
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1.69…
## $ CustomerID  <int> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17…
## $ Country     <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…

Basic statistics for numeric variables, Check missing values and Check missing values

summary(retail)          
##   InvoiceNo          StockCode         Description           Quantity         
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.000  
##  Class :character   Class :character   Class :character   1st Qu.:     1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.000  
##                                                           Mean   :     9.552  
##                                                           3rd Qu.:    10.000  
##                                                           Max.   : 80995.000  
##                                                                               
##  InvoiceDate          UnitPrice            CustomerID       Country         
##  Length:541909      Min.   :-11062.060   Min.   :12346    Length:541909     
##  Class :character   1st Qu.:     1.250   1st Qu.:13953    Class :character  
##  Mode  :character   Median :     2.080   Median :15152    Mode  :character  
##                     Mean   :     4.611   Mean   :15288                      
##                     3rd Qu.:     4.130   3rd Qu.:16791                      
##                     Max.   : 38970.000   Max.   :18287                      
##                                          NA's   :135080
sum(is.na(retail))       
## [1] 135080
colSums(is.na(retail))
##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country 
##      135080           0

Key observation: Consistency with the NA’s for Customer ID in all cases

4.3 Data Preparation

retail_clean <- retail %>%
  # Remove negative/cancelled entries
  filter(Quantity > 0, UnitPrice > 0, !is.na(CustomerID)) %>%   
  # Convert date format
  mutate(InvoiceDate = dmy_hm(InvoiceDate)) %>% 
  # Create TotalValue variable
  mutate(TotalValue = Quantity * UnitPrice)                     
head(retail_clean,10)
##    InvoiceNo StockCode                         Description Quantity
## 1     536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2     536365     71053                 WHITE METAL LANTERN        6
## 3     536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4     536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5     536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6     536365     22752        SET 7 BABUSHKA NESTING BOXES        2
## 7     536365     21730   GLASS STAR FROSTED T-LIGHT HOLDER        6
## 8     536366     22633              HAND WARMER UNION JACK        6
## 9     536366     22632           HAND WARMER RED POLKA DOT        6
## 10    536367     84879       ASSORTED COLOUR BIRD ORNAMENT       32
##            InvoiceDate UnitPrice CustomerID        Country TotalValue
## 1  2010-12-01 08:26:00      2.55      17850 United Kingdom      15.30
## 2  2010-12-01 08:26:00      3.39      17850 United Kingdom      20.34
## 3  2010-12-01 08:26:00      2.75      17850 United Kingdom      22.00
## 4  2010-12-01 08:26:00      3.39      17850 United Kingdom      20.34
## 5  2010-12-01 08:26:00      3.39      17850 United Kingdom      20.34
## 6  2010-12-01 08:26:00      7.65      17850 United Kingdom      15.30
## 7  2010-12-01 08:26:00      4.25      17850 United Kingdom      25.50
## 8  2010-12-01 08:28:00      1.85      17850 United Kingdom      11.10
## 9  2010-12-01 08:28:00      1.85      17850 United Kingdom      11.10
## 10 2010-12-01 08:34:00      1.69      13047 United Kingdom      54.08

4.4 Question 1: Most popular products and sales over time

As seen in class, we can use pipes to make our code more concise. As a way to facilitate the visualization of the top products by quantity, I used ChatGPT.

AI-Assisted: I need to visualize the top 10 products by quantity and in descendent order. I have installed the tidyverse library. Which is my optimal option to arrange my data?

I have learnt how to approach arrange the top 10 product to have a quick insight on the data presented. This helped me to change the structure to any that I consider more fit for business purposes.

Top products sold by amount of orders Q1. What are the most popular products?

top_products <- retail_clean %>%
  group_by(Description) %>%
  summarise(TotalQty = sum(Quantity)) %>%
  arrange(desc(TotalQty)) %>%
  slice(1:10)
top_products
## # A tibble: 10 × 2
##    Description                          TotalQty
##    <chr>                                   <int>
##  1 "PAPER CRAFT , LITTLE BIRDIE"           80995
##  2 "MEDIUM CERAMIC TOP STORAGE JAR"        77916
##  3 "WORLD WAR 2 GLIDERS ASSTD DESIGNS"     54415
##  4 "JUMBO BAG RED RETROSPOT"               46181
##  5 "WHITE HANGING HEART T-LIGHT HOLDER"    36725
##  6 "ASSORTED COLOUR BIRD ORNAMENT"         35362
##  7 "PACK OF 72 RETROSPOT CAKE CASES"       33693
##  8 "POPCORN HOLDER"                        30931
##  9 "RABBIT NIGHT LIGHT"                    27202
## 10 "MINI PAINT SET VINTAGE "               26076

Using the same logic, now I rearranged it by Cashflow or revenue to see the similarities (if any) of the top 10 products by revenue. The question I want to answer is: Are my most ordered products the highest in revenue?

Top products sold by revenue Q1 What are the most popular products?

top_products_byTotalValue <- retail_clean %>%
  group_by(Description) %>%
  summarise(TotalCashflow = sum(TotalValue)) %>%
  arrange(desc(TotalCashflow)) %>%
  slice(1:10)
top_products_byTotalValue
## # A tibble: 10 × 2
##    Description                        TotalCashflow
##    <chr>                                      <dbl>
##  1 PAPER CRAFT , LITTLE BIRDIE              168470.
##  2 REGENCY CAKESTAND 3 TIER                 142593.
##  3 WHITE HANGING HEART T-LIGHT HOLDER       100448.
##  4 JUMBO BAG RED RETROSPOT                   85221.
##  5 MEDIUM CERAMIC TOP STORAGE JAR            81417.
##  6 POSTAGE                                   77804.
##  7 PARTY BUNTING                             68844.
##  8 ASSORTED COLOUR BIRD ORNAMENT             56580.
##  9 Manual                                    53780.
## 10 RABBIT NIGHT LIGHT                        51346.

Key observation: Half of my top 10 products by revenue are not listed by quantity. This means that there could be an area of opportunity to advertise different products with higher margin.

Amount of sales per month: 1. How do sales vary over time?

sales_time <- retail_clean %>%
  mutate(Month = floor_date(InvoiceDate, "month")) %>%
  group_by(Month) %>%
  summarise(MonthlySales = sum(TotalValue))
sales_time
## # A tibble: 13 × 2
##    Month               MonthlySales
##    <dttm>                     <dbl>
##  1 2010-12-01 00:00:00      572714.
##  2 2011-01-01 00:00:00      569445.
##  3 2011-02-01 00:00:00      447137.
##  4 2011-03-01 00:00:00      595501.
##  5 2011-04-01 00:00:00      469200.
##  6 2011-05-01 00:00:00      678595.
##  7 2011-06-01 00:00:00      661214.
##  8 2011-07-01 00:00:00      600091.
##  9 2011-08-01 00:00:00      645344.
## 10 2011-09-01 00:00:00      952838.
## 11 2011-10-01 00:00:00     1039319.
## 12 2011-11-01 00:00:00     1161817.
## 13 2011-12-01 00:00:00      518193.

Plot on sales trend over time -> How do sales vary over time?

ggplot(sales_time, aes(Month, MonthlySales)) +
  geom_line() +
  labs(title="Monthly Sales Trend", x="Month", y="Sales (£)")

Key observation: With the presented graph, we can see a marked seasonality at the end of the year, probably prior Christmas. By the data we have, we cannot say that the December data is incomplete given that both December 2010 and December 2011 presented similar performance.

4.5 Question 2: Can we segment customers based on their purchasing behavior?

I am going to approach this question through the RFM analysis. The RFM analysis is a customer segmentation method that leverages existing customer behavior data to assess, predict, and improve key business processes by focusing on three key factors: (1) Recency, (2) Frequency and (Monetary Value).

AI-Assisted: How can I segment customers in an eCommerce environment? My dataset has revenue, quantity ordered, Customer ID as the most important columns for this analysis

I’ve learnt about the RFM model-based customer segmentation allows businesses (mainly eCommerce) to understand their customers better and optimize the efforts for each customer segment. The analysis sets the direction for improving marketing effectiveness, refining customer retention strategies, and optimizing resource allocation.

In this first part, we can see the monetary value of each customer. Not necessarily having a high frequency of purchase.

rfm <- retail_clean %>%
  group_by(CustomerID) %>%
  summarise(
    Recency = as.numeric(max(InvoiceDate) - max(InvoiceDate)),
    Frequency = n_distinct(InvoiceNo),
    Monetary = sum(TotalValue)
  )%>%
  arrange(desc(Monetary))
rfm
## # A tibble: 4,338 × 4
##    CustomerID Recency Frequency Monetary
##         <int>   <dbl>     <int>    <dbl>
##  1      14646       0        73  280206.
##  2      18102       0        60  259657.
##  3      17450       0        46  194551.
##  4      16446       0         2  168472.
##  5      14911       0       201  143825.
##  6      12415       0        21  124915.
##  7      14156       0        55  117380.
##  8      17511       0        31   91062.
##  9      16029       0        63   81025.
## 10      12346       0         1   77184.
## # ℹ 4,328 more rows

Here are the customers that buy from the business more often, regardless of the total revenue they represent.

rfm <- retail_clean %>%
  group_by(CustomerID) %>%
  summarise(
    Recency = as.numeric(max(InvoiceDate) - max(InvoiceDate)),
    Frequency = n_distinct(InvoiceNo),
    Monetary = sum(TotalValue)
  )%>%
  arrange(desc(Frequency))
rfm
## # A tibble: 4,338 × 4
##    CustomerID Recency Frequency Monetary
##         <int>   <dbl>     <int>    <dbl>
##  1      12748       0       209   33720.
##  2      14911       0       201  143825.
##  3      17841       0       124   40992.
##  4      13089       0        97   58826.
##  5      14606       0        93   12157.
##  6      15311       0        91   60768.
##  7      12971       0        86   11190.
##  8      14646       0        73  280206.
##  9      16029       0        63   81025.
## 10      13408       0        62   28117.
## # ℹ 4,328 more rows

Key observation: This approach gives us the insight of our most recurring customer and the monetary impact they have on the business. A different approach would be to ascendingly arrange the data by monetary value to see the customers that bring more money to the business.

4.6 Question 3: Are there specific countries that contribute more to sales or have unique patterns?

country_sales <- retail_clean %>%
  group_by(Country) %>%
  summarise(TotalSales = sum(TotalValue)) %>%
  arrange(desc(TotalSales))
country_sales
## # A tibble: 37 × 2
##    Country        TotalSales
##    <chr>               <dbl>
##  1 United Kingdom   7308392.
##  2 Netherlands       285446.
##  3 EIRE              265546.
##  4 Germany           228867.
##  5 France            209024.
##  6 Australia         138521.
##  7 Spain              61577.
##  8 Switzerland        56444.
##  9 Belgium            41196.
## 10 Sweden             38378.
## # ℹ 27 more rows
ggplot(country_sales[1:10,], aes(x=reorder(Country, TotalSales), y=TotalSales)) +
  geom_col() +
  coord_flip() +
  labs(title="Top Countries by Sales", x="Country", y="Sales (£)")

Key observation: Sales outside the United Kingdom represent roughly one fifth than the ones inside. This is undeniable evidence the dominance of internal market.

4.7 Question 4 (Bonus): Cancellations impact

# Identify cancelled invoices (InvoiceNo starting with "C")
cancellations <- retail %>%
  filter(str_detect(InvoiceNo, "^C"))
head(cancellations,10)
##    InvoiceNo StockCode                       Description Quantity
## 1    C536379         D                          Discount       -1
## 2    C536383    35004C   SET OF 3 COLOURED  FLYING DUCKS       -1
## 3    C536391     22556    PLASTERS IN TIN CIRCUS PARADE       -12
## 4    C536391     21984  PACK OF 12 PINK PAISLEY TISSUES       -24
## 5    C536391     21983  PACK OF 12 BLUE PAISLEY TISSUES       -24
## 6    C536391     21980 PACK OF 12 RED RETROSPOT TISSUES       -24
## 7    C536391     21484       CHICK GREY HOT WATER BOTTLE      -12
## 8    C536391     22557  PLASTERS IN TIN VINTAGE PAISLEY       -12
## 9    C536391     22553            PLASTERS IN TIN SKULLS      -24
## 10   C536506     22960          JAM MAKING SET WITH JARS       -6
##         InvoiceDate UnitPrice CustomerID        Country
## 1  01/12/2010 09:41     27.50      14527 United Kingdom
## 2  01/12/2010 09:49      4.65      15311 United Kingdom
## 3  01/12/2010 10:24      1.65      17548 United Kingdom
## 4  01/12/2010 10:24      0.29      17548 United Kingdom
## 5  01/12/2010 10:24      0.29      17548 United Kingdom
## 6  01/12/2010 10:24      0.29      17548 United Kingdom
## 7  01/12/2010 10:24      3.45      17548 United Kingdom
## 8  01/12/2010 10:24      1.65      17548 United Kingdom
## 9  01/12/2010 10:24      1.65      17548 United Kingdom
## 10 01/12/2010 12:38      4.25      17897 United Kingdom

AI-Assisted: How can I aggregate the total amount of cancelled orders and completed orders? Consider that all cancelled orders start with C in the Invoice No column.

I have learnt how the if structure can be smoothly added to the pipe and I can integrate a new value (is_cancelled) to it and make the comparison I need for the cancelled sales and active ones.

# Compare cancelled vs non-cancelled value
cancel_vs_sales <- retail %>%
  mutate(is_cancelled = if_else(str_detect(InvoiceNo, "^C"), "Cancelled", "Completed")) %>%
  group_by(is_cancelled) %>%
  summarise(TotalValue = sum(Quantity * UnitPrice, na.rm = TRUE))

cancel_vs_sales
## # A tibble: 2 × 2
##   is_cancelled TotalValue
##   <chr>             <dbl>
## 1 Cancelled      -896812.
## 2 Completed     10644560.

AI-Assisted: How can I put the breakdown by month?

By observing this chunk of code, I noticed the function ‘floor_date’ and it helps to round down any date on a dataset. This comes handy when you want to aggregate certain data in a specific range of time (month) and, in this case, how many cancelled and completed orders are in the dataset.

# Monthly impact, seggregation of orders cancelled by revenue
cancel_monthly <- retail %>%
  mutate(Month = floor_date(dmy_hm(InvoiceDate), "month"),
         is_cancelled = if_else(str_detect(InvoiceNo, "^C"), "Cancelled", "Completed")) %>%
  group_by(Month, is_cancelled) %>%
  summarise(MonthlyValue = sum(Quantity * UnitPrice, na.rm = TRUE))
cancel_monthly
## # A tibble: 26 × 3
## # Groups:   Month [13]
##    Month               is_cancelled MonthlyValue
##    <dttm>              <chr>               <dbl>
##  1 2010-12-01 00:00:00 Cancelled         -74789.
##  2 2010-12-01 00:00:00 Completed         823746.
##  3 2011-01-01 00:00:00 Cancelled        -131364.
##  4 2011-01-01 00:00:00 Completed         691365.
##  5 2011-02-01 00:00:00 Cancelled         -25569.
##  6 2011-02-01 00:00:00 Completed         523632.
##  7 2011-03-01 00:00:00 Cancelled         -34372.
##  8 2011-03-01 00:00:00 Completed         717639.
##  9 2011-04-01 00:00:00 Cancelled         -44602.
## 10 2011-04-01 00:00:00 Completed         537809.
## # ℹ 16 more rows
# Visualisation barplot
ggplot(cancel_monthly, aes(Month, abs(MonthlyValue), fill = is_cancelled)) +
  geom_col() +
  labs(title="Monthly Sales vs Cancellations", x="Month", y="Value (£)")

ggplot(cancel_monthly, aes(Month, abs(MonthlyValue), colour = is_cancelled)) +
  geom_line() +
  labs(title="Monthly Sales vs Cancellations", x="Month", y="Value (£)")

Key observation: Overall, we cannot say that cancellations have a huge impact on the sales besides December 2011 as the lowest sales month and the highest cancelled orders. Conversely, this trend can be present at the beginnig of the year.

TABLEAU

In 3-4 sentences, describe: (1) What story does your dashboard tell? (2) Which 2-3 key insights did you choose to visualize? (3) Why are these insights important for business decisions?

This dashboard presents a map depicting the impact of cancelled invoices and analyzes which country has the highest cancellation rate in relation to the total amount of orders. Additionally, using the stock code we can infer the most cancelled and profitable items in each market. All of these can be dynamically filtered by selecting any country in the map. For instance, we can observe that more than one third of the orders allocated in the USA, are cancelled, whereas the rest of countries are below 10% (except for Portugal and Czech Republic) or in the UK (the largest market), the highest selling stock code gets cancelled at the same rate as it has revenue. These insights are important to decision making because they reveal patterns that help optimize inventory, reduce losses from cancellations, and guide strategic decisions for pricing and market prioritization.