Data Import & Setup
4.1 Data Import & Setup
library(tidyverse)
library(lubridate)
library(xray)
retail <- read.csv("Online Retail.csv")
4.2 Data Diagnostics: Analyze the percentage of NA, blanks, inf values in the data set.
xray::anomalies(retail)
## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf
## 1 CustomerID 541909 135080 24.93% 0 - 0 - 0 -
## 2 UnitPrice 541909 0 - 2515 0.46% 0 - 0 -
## 3 Description 541909 0 - 0 - 1454 0.27% 0 -
## 4 Country 541909 0 - 0 - 0 - 0 -
## 5 Quantity 541909 0 - 0 - 0 - 0 -
## 6 StockCode 541909 0 - 0 - 0 - 0 -
## 7 InvoiceDate 541909 0 - 0 - 0 - 0 -
## 8 InvoiceNo 541909 0 - 0 - 0 - 0 -
## qDistinct type anomalous_percent
## 1 4373 Integer 24.93%
## 2 1630 Numeric 0.46%
## 3 4224 Character 0.27%
## 4 38 Character -
## 5 722 Integer -
## 6 4070 Character -
## 7 23260 Character -
## 8 25900 Character -
##
## $problem_variables
## [1] Variable q qNA pNA
## [5] qZero pZero qBlank pBlank
## [9] qInf pInf qDistinct type
## [13] anomalous_percent problems
## <0 rows> (or 0-length row.names)
Key observation: Around one quarter of the values in Customer ID are NA’s. This can lead to inaccuracies while analising our data set. Besides this observation, the rest of the anomalous percent seems not significant to our purposes.
glimpse(retail)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo <chr> "536365", "536365", "536365", "536365", "536365", "536365"…
## $ StockCode <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752", …
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANTERN…
## $ Quantity <int> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, 3, …
## $ InvoiceDate <chr> "01/12/2010 08:26", "01/12/2010 08:26", "01/12/2010 08:26"…
## $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1.69…
## $ CustomerID <int> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17…
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
Basic statistics for numeric variables, Check missing values and Check missing values
summary(retail)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.000
## Class :character Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Mode :character Median : 3.000
## Mean : 9.552
## 3rd Qu.: 10.000
## Max. : 80995.000
##
## InvoiceDate UnitPrice CustomerID Country
## Length:541909 Min. :-11062.060 Min. :12346 Length:541909
## Class :character 1st Qu.: 1.250 1st Qu.:13953 Class :character
## Mode :character Median : 2.080 Median :15152 Mode :character
## Mean : 4.611 Mean :15288
## 3rd Qu.: 4.130 3rd Qu.:16791
## Max. : 38970.000 Max. :18287
## NA's :135080
sum(is.na(retail))
## [1] 135080
colSums(is.na(retail))
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country
## 135080 0
Key observation: Consistency with the NA’s for Customer ID in all cases
4.3 Data Preparation
retail_clean <- retail %>%
# Remove negative/cancelled entries
filter(Quantity > 0, UnitPrice > 0, !is.na(CustomerID)) %>%
# Convert date format
mutate(InvoiceDate = dmy_hm(InvoiceDate)) %>%
# Create TotalValue variable
mutate(TotalValue = Quantity * UnitPrice)
head(retail_clean,10)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## 7 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6
## 8 536366 22633 HAND WARMER UNION JACK 6
## 9 536366 22632 HAND WARMER RED POLKA DOT 6
## 10 536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32
## InvoiceDate UnitPrice CustomerID Country TotalValue
## 1 2010-12-01 08:26:00 2.55 17850 United Kingdom 15.30
## 2 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34
## 3 2010-12-01 08:26:00 2.75 17850 United Kingdom 22.00
## 4 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34
## 5 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34
## 6 2010-12-01 08:26:00 7.65 17850 United Kingdom 15.30
## 7 2010-12-01 08:26:00 4.25 17850 United Kingdom 25.50
## 8 2010-12-01 08:28:00 1.85 17850 United Kingdom 11.10
## 9 2010-12-01 08:28:00 1.85 17850 United Kingdom 11.10
## 10 2010-12-01 08:34:00 1.69 13047 United Kingdom 54.08
4.4 Question 1: Most popular products and sales over time
As seen in class, we can use pipes to make our code more concise. As a way to facilitate the visualization of the top products by quantity, I used ChatGPT.
AI-Assisted: I need to visualize the top 10 products by quantity and in descendent order. I have installed the tidyverse library. Which is my optimal option to arrange my data?
I have learnt how to approach arrange the top 10 product to have a quick insight on the data presented. This helped me to change the structure to any that I consider more fit for business purposes.
Top products sold by amount of orders Q1. What are the most popular products?
top_products <- retail_clean %>%
group_by(Description) %>%
summarise(TotalQty = sum(Quantity)) %>%
arrange(desc(TotalQty)) %>%
slice(1:10)
top_products
## # A tibble: 10 × 2
## Description TotalQty
## <chr> <int>
## 1 "PAPER CRAFT , LITTLE BIRDIE" 80995
## 2 "MEDIUM CERAMIC TOP STORAGE JAR" 77916
## 3 "WORLD WAR 2 GLIDERS ASSTD DESIGNS" 54415
## 4 "JUMBO BAG RED RETROSPOT" 46181
## 5 "WHITE HANGING HEART T-LIGHT HOLDER" 36725
## 6 "ASSORTED COLOUR BIRD ORNAMENT" 35362
## 7 "PACK OF 72 RETROSPOT CAKE CASES" 33693
## 8 "POPCORN HOLDER" 30931
## 9 "RABBIT NIGHT LIGHT" 27202
## 10 "MINI PAINT SET VINTAGE " 26076
Using the same logic, now I rearranged it by Cashflow or revenue to see the similarities (if any) of the top 10 products by revenue. The question I want to answer is: Are my most ordered products the highest in revenue?
Top products sold by revenue Q1 What are the most popular products?
top_products_byTotalValue <- retail_clean %>%
group_by(Description) %>%
summarise(TotalCashflow = sum(TotalValue)) %>%
arrange(desc(TotalCashflow)) %>%
slice(1:10)
top_products_byTotalValue
## # A tibble: 10 × 2
## Description TotalCashflow
## <chr> <dbl>
## 1 PAPER CRAFT , LITTLE BIRDIE 168470.
## 2 REGENCY CAKESTAND 3 TIER 142593.
## 3 WHITE HANGING HEART T-LIGHT HOLDER 100448.
## 4 JUMBO BAG RED RETROSPOT 85221.
## 5 MEDIUM CERAMIC TOP STORAGE JAR 81417.
## 6 POSTAGE 77804.
## 7 PARTY BUNTING 68844.
## 8 ASSORTED COLOUR BIRD ORNAMENT 56580.
## 9 Manual 53780.
## 10 RABBIT NIGHT LIGHT 51346.
Key observation: Half of my top 10 products by revenue are not listed by quantity. This means that there could be an area of opportunity to advertise different products with higher margin.
Amount of sales per month: 1. How do sales vary over time?
sales_time <- retail_clean %>%
mutate(Month = floor_date(InvoiceDate, "month")) %>%
group_by(Month) %>%
summarise(MonthlySales = sum(TotalValue))
sales_time
## # A tibble: 13 × 2
## Month MonthlySales
## <dttm> <dbl>
## 1 2010-12-01 00:00:00 572714.
## 2 2011-01-01 00:00:00 569445.
## 3 2011-02-01 00:00:00 447137.
## 4 2011-03-01 00:00:00 595501.
## 5 2011-04-01 00:00:00 469200.
## 6 2011-05-01 00:00:00 678595.
## 7 2011-06-01 00:00:00 661214.
## 8 2011-07-01 00:00:00 600091.
## 9 2011-08-01 00:00:00 645344.
## 10 2011-09-01 00:00:00 952838.
## 11 2011-10-01 00:00:00 1039319.
## 12 2011-11-01 00:00:00 1161817.
## 13 2011-12-01 00:00:00 518193.
Plot on sales trend over time -> How do sales vary over time?
ggplot(sales_time, aes(Month, MonthlySales)) +
geom_line() +
labs(title="Monthly Sales Trend", x="Month", y="Sales (£)")
Key observation: With the presented graph, we can see a marked seasonality at the end of the year, probably prior Christmas. By the data we have, we cannot say that the December data is incomplete given that both December 2010 and December 2011 presented similar performance.
4.5 Question 2: Can we segment customers based on their purchasing behavior?
I am going to approach this question through the RFM analysis. The RFM analysis is a customer segmentation method that leverages existing customer behavior data to assess, predict, and improve key business processes by focusing on three key factors: (1) Recency, (2) Frequency and (Monetary Value).
AI-Assisted: How can I segment customers in an eCommerce environment? My dataset has revenue, quantity ordered, Customer ID as the most important columns for this analysis
I’ve learnt about the RFM model-based customer segmentation allows businesses (mainly eCommerce) to understand their customers better and optimize the efforts for each customer segment. The analysis sets the direction for improving marketing effectiveness, refining customer retention strategies, and optimizing resource allocation.
In this first part, we can see the monetary value of each customer. Not necessarily having a high frequency of purchase.
rfm <- retail_clean %>%
group_by(CustomerID) %>%
summarise(
Recency = as.numeric(max(InvoiceDate) - max(InvoiceDate)),
Frequency = n_distinct(InvoiceNo),
Monetary = sum(TotalValue)
)%>%
arrange(desc(Monetary))
rfm
## # A tibble: 4,338 × 4
## CustomerID Recency Frequency Monetary
## <int> <dbl> <int> <dbl>
## 1 14646 0 73 280206.
## 2 18102 0 60 259657.
## 3 17450 0 46 194551.
## 4 16446 0 2 168472.
## 5 14911 0 201 143825.
## 6 12415 0 21 124915.
## 7 14156 0 55 117380.
## 8 17511 0 31 91062.
## 9 16029 0 63 81025.
## 10 12346 0 1 77184.
## # ℹ 4,328 more rows
Here are the customers that buy from the business more often, regardless of the total revenue they represent.
rfm <- retail_clean %>%
group_by(CustomerID) %>%
summarise(
Recency = as.numeric(max(InvoiceDate) - max(InvoiceDate)),
Frequency = n_distinct(InvoiceNo),
Monetary = sum(TotalValue)
)%>%
arrange(desc(Frequency))
rfm
## # A tibble: 4,338 × 4
## CustomerID Recency Frequency Monetary
## <int> <dbl> <int> <dbl>
## 1 12748 0 209 33720.
## 2 14911 0 201 143825.
## 3 17841 0 124 40992.
## 4 13089 0 97 58826.
## 5 14606 0 93 12157.
## 6 15311 0 91 60768.
## 7 12971 0 86 11190.
## 8 14646 0 73 280206.
## 9 16029 0 63 81025.
## 10 13408 0 62 28117.
## # ℹ 4,328 more rows
Key observation: This approach gives us the insight of our most recurring customer and the monetary impact they have on the business. A different approach would be to ascendingly arrange the data by monetary value to see the customers that bring more money to the business.
4.6 Question 3: Are there specific countries that contribute more to sales or have unique patterns?
country_sales <- retail_clean %>%
group_by(Country) %>%
summarise(TotalSales = sum(TotalValue)) %>%
arrange(desc(TotalSales))
country_sales
## # A tibble: 37 × 2
## Country TotalSales
## <chr> <dbl>
## 1 United Kingdom 7308392.
## 2 Netherlands 285446.
## 3 EIRE 265546.
## 4 Germany 228867.
## 5 France 209024.
## 6 Australia 138521.
## 7 Spain 61577.
## 8 Switzerland 56444.
## 9 Belgium 41196.
## 10 Sweden 38378.
## # ℹ 27 more rows
ggplot(country_sales[1:10,], aes(x=reorder(Country, TotalSales), y=TotalSales)) +
geom_col() +
coord_flip() +
labs(title="Top Countries by Sales", x="Country", y="Sales (£)")
Key observation: Sales outside the United Kingdom represent roughly one fifth than the ones inside. This is undeniable evidence the dominance of internal market.
4.7 Question 4 (Bonus): Cancellations impact
# Identify cancelled invoices (InvoiceNo starting with "C")
cancellations <- retail %>%
filter(str_detect(InvoiceNo, "^C"))
head(cancellations,10)
## InvoiceNo StockCode Description Quantity
## 1 C536379 D Discount -1
## 2 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1
## 3 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12
## 4 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24
## 5 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24
## 6 C536391 21980 PACK OF 12 RED RETROSPOT TISSUES -24
## 7 C536391 21484 CHICK GREY HOT WATER BOTTLE -12
## 8 C536391 22557 PLASTERS IN TIN VINTAGE PAISLEY -12
## 9 C536391 22553 PLASTERS IN TIN SKULLS -24
## 10 C536506 22960 JAM MAKING SET WITH JARS -6
## InvoiceDate UnitPrice CustomerID Country
## 1 01/12/2010 09:41 27.50 14527 United Kingdom
## 2 01/12/2010 09:49 4.65 15311 United Kingdom
## 3 01/12/2010 10:24 1.65 17548 United Kingdom
## 4 01/12/2010 10:24 0.29 17548 United Kingdom
## 5 01/12/2010 10:24 0.29 17548 United Kingdom
## 6 01/12/2010 10:24 0.29 17548 United Kingdom
## 7 01/12/2010 10:24 3.45 17548 United Kingdom
## 8 01/12/2010 10:24 1.65 17548 United Kingdom
## 9 01/12/2010 10:24 1.65 17548 United Kingdom
## 10 01/12/2010 12:38 4.25 17897 United Kingdom
AI-Assisted: How can I aggregate the total amount of cancelled orders and completed orders? Consider that all cancelled orders start with C in the Invoice No column.
I have learnt how the if structure can be smoothly added to the pipe and I can integrate a new value (is_cancelled) to it and make the comparison I need for the cancelled sales and active ones.
# Compare cancelled vs non-cancelled value
cancel_vs_sales <- retail %>%
mutate(is_cancelled = if_else(str_detect(InvoiceNo, "^C"), "Cancelled", "Completed")) %>%
group_by(is_cancelled) %>%
summarise(TotalValue = sum(Quantity * UnitPrice, na.rm = TRUE))
cancel_vs_sales
## # A tibble: 2 × 2
## is_cancelled TotalValue
## <chr> <dbl>
## 1 Cancelled -896812.
## 2 Completed 10644560.
AI-Assisted: How can I put the breakdown by month?
By observing this chunk of code, I noticed the function ‘floor_date’ and it helps to round down any date on a dataset. This comes handy when you want to aggregate certain data in a specific range of time (month) and, in this case, how many cancelled and completed orders are in the dataset.
# Monthly impact, seggregation of orders cancelled by revenue
cancel_monthly <- retail %>%
mutate(Month = floor_date(dmy_hm(InvoiceDate), "month"),
is_cancelled = if_else(str_detect(InvoiceNo, "^C"), "Cancelled", "Completed")) %>%
group_by(Month, is_cancelled) %>%
summarise(MonthlyValue = sum(Quantity * UnitPrice, na.rm = TRUE))
cancel_monthly
## # A tibble: 26 × 3
## # Groups: Month [13]
## Month is_cancelled MonthlyValue
## <dttm> <chr> <dbl>
## 1 2010-12-01 00:00:00 Cancelled -74789.
## 2 2010-12-01 00:00:00 Completed 823746.
## 3 2011-01-01 00:00:00 Cancelled -131364.
## 4 2011-01-01 00:00:00 Completed 691365.
## 5 2011-02-01 00:00:00 Cancelled -25569.
## 6 2011-02-01 00:00:00 Completed 523632.
## 7 2011-03-01 00:00:00 Cancelled -34372.
## 8 2011-03-01 00:00:00 Completed 717639.
## 9 2011-04-01 00:00:00 Cancelled -44602.
## 10 2011-04-01 00:00:00 Completed 537809.
## # ℹ 16 more rows
# Visualisation barplot
ggplot(cancel_monthly, aes(Month, abs(MonthlyValue), fill = is_cancelled)) +
geom_col() +
labs(title="Monthly Sales vs Cancellations", x="Month", y="Value (£)")
ggplot(cancel_monthly, aes(Month, abs(MonthlyValue), colour = is_cancelled)) +
geom_line() +
labs(title="Monthly Sales vs Cancellations", x="Month", y="Value (£)")
Key observation: Overall, we cannot say that cancellations have a huge impact on the sales besides December 2011 as the lowest sales month and the highest cancelled orders. Conversely, this trend can be present at the beginnig of the year.
TABLEAU
In 3-4 sentences, describe: (1) What story does your dashboard tell? (2) Which 2-3 key insights did you choose to visualize? (3) Why are these insights important for business decisions?
This dashboard presents a map depicting the impact of cancelled invoices and analyzes which country has the highest cancellation rate in relation to the total amount of orders. Additionally, using the stock code we can infer the most cancelled and profitable items in each market. All of these can be dynamically filtered by selecting any country in the map. For instance, we can observe that more than one third of the orders allocated in the USA, are cancelled, whereas the rest of countries are below 10% (except for Portugal and Czech Republic) or in the UK (the largest market), the highest selling stock code gets cancelled at the same rate as it has revenue. These insights are important to decision making because they reveal patterns that help optimize inventory, reduce losses from cancellations, and guide strategic decisions for pricing and market prioritization.