library(readxl)
library(httr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(tidyverse)
library(scales)
library(janitor)
library(epiDisplay)
retailURL <- "http://archive.ics.uci.edu//ml//machine-learning-databases//00502//online_retail_II.xlsx"
GET(retailURL, write_disk(tempFileName <- tempfile(fileext = ".xlsx")))
## Response [http://archive.ics.uci.edu//ml//machine-learning-databases//00502//online_retail_II.xlsx]
## Date: 2020-10-18 03:36
## Status: 200
## Content-Type: application/x-httpd-php
## Size: 45.6 MB
## <ON DISK> C:\Users\cassi\AppData\Local\Temp\RtmpwJHTUe\file27cc27fe39e0.xlsx
{Detail - reading in Excel vs csv file}
retail_sheet_2009 <- read_excel(tempFileName, sheet = "Year 2009-2010")
retail_sheet_2010 <- read_excel(tempFileName, sheet = "Year 2010-2011")
retaildf <- rbind(retail_sheet_2009, retail_sheet_2010)
The observations of this dataset are invoice line items, capturing the variables you would see on a typical invoice, Item code, qty ordered, price per unit, customer id, and of course invoice number. We can you use this data to describe how sales performed over the months for this store. The timeframe of the data is Dec 2009 to Dec 2011. As we can see, over 1 million invoice lines were generated in this time.
head(retaildf)
## # A tibble: 6 x 8
## Invoice StockCode Description Quantity InvoiceDate Price `Customer ID`
## <chr> <chr> <chr> <dbl> <dttm> <dbl> <dbl>
## 1 489434 85048 "15CM CHRI~ 12 2009-12-01 07:45:00 6.95 13085
## 2 489434 79323P "PINK CHER~ 12 2009-12-01 07:45:00 6.75 13085
## 3 489434 79323W "WHITE CHE~ 12 2009-12-01 07:45:00 6.75 13085
## 4 489434 22041 "RECORD FR~ 48 2009-12-01 07:45:00 2.1 13085
## 5 489434 21232 "STRAWBERR~ 24 2009-12-01 07:45:00 1.25 13085
## 6 489434 22064 "PINK DOUG~ 24 2009-12-01 07:45:00 1.65 13085
## # ... with 1 more variable: Country <chr>
glimpse(retaildf)
## Rows: 1,067,371
## Columns: 8
## $ Invoice <chr> "489434", "489434", "489434", "489434", "489434", "48...
## $ StockCode <chr> "85048", "79323P", "79323W", "22041", "21232", "22064...
## $ Description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY L...
## $ Quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 1...
## $ InvoiceDate <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01...
## $ Price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55,...
## $ `Customer ID` <dbl> 13085, 13085, 13085, 13085, 13085, 13085, 13085, 1308...
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom",...
Because there are so many observations, I’m going to first subset my data into a much smaller dataset to work with in familiarizing myself with how the data looks. From 1 million observations, I have drilled down to 1044 by looking at the first day of this dataset - December 1, 2011.
Let’s slice this dataset to see at each stocking point how much earnings were cumulatively invoiced (which we can interpret as ordered) and the cumulative qty billed (ordered).
retaildf_2009 <- retaildf %>%
filter(as.Date.POSIXct(InvoiceDate)=='2009-12-01') %>%
filter(!is.na(Description) &!is.na(`Customer ID`) & Quantity > 0) %>%
mutate(Dollar_Total = Quantity * Price) %>%
group_by(StockCode, Description) %>%
summarise(Total_Earned = sum(Dollar_Total),
Total_Sold = sum(Quantity)) %>%
arrange(desc(Total_Earned)) %>%
ungroup() %>%
mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))
Lets look at the most popular and least popular SKUs for this day to see the type of items this company sold and their most popular items.
These look to be fairly standard homegood decorations and sundries. The most popular items on this day are Christmas paper chain kits and cherry lights.
I’m very surprised the Love Potion Incense was not a hit
Lets bucket this revenue so we can have a clearer picture of the distribution of SKU earnings
proportion_totals <- retaildf_2009 %>%
mutate(daily_revenue = case_when(
between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
between(Total_Earned, 500, 1000) ~ ">$500",
between(Total_Earned, 301, 500) ~ ">$300",
between(Total_Earned, 201, 300) ~">$200",
between(Total_Earned, 101, 200) ~">$100",
between(Total_Earned, 51, 100) ~">$50",
between(Total_Earned, 25, 50) ~">=$25",
between(Total_Earned, 0, 24) ~"<$25",
TRUE ~ "loss")) %>%
filter(daily_revenue!="loss")
On December 1, 643(!!) SKUs each made less than $25 and roughly 3% of SKUs earned more than $200. Only 10% earned more than $100. This does not, on the surface, appear to be a very profitable company (keeping in mind we are only looking at one day of sales in the holiday season)
tab1(proportion_totals$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Daily Revenue by SKU", cex=.7, cex.names = .8)
## proportion_totals$daily_revenue :
## Frequency Percent Cum. percent
## > $1000 1 0.1 0.1
## >$300 6 0.6 0.7
## >$500 8 0.8 1.5
## >$200 16 1.6 3.0
## >$100 67 6.5 9.5
## >$50 123 11.9 21.4
## >=$25 174 16.9 38.3
## <$25 637 61.7 100.0
## Total 1032 100.0 100.0
I stand a bit corrected when I look at the totals. In one day, the company earned $42k in revenue. However, we are missing the context of their COGS and cost of operation. $42k sounds good but look at HOW MANY unique SKUs they sold in a single day and the volume of units sold just to make $42k. That is 1028 items to track, receive, store, pick, pack, ship - moving volume of 24k units in a single day. You have to wonder at the labor costs involved in an operation like this. I’m starting to see why Amazon operated a literal loss for so many years.
proportion_totals %>%
summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
unique_items_sold=n(),
units_sold=sum(Total_Sold))
## # A tibble: 1 x 3
## Overall_Revenue unique_items_sold units_sold
## <chr> <int> <dbl>
## 1 $43,547.84 1032 24128
One thing to note is that given the nature of many of these items (cherry lights, paper chain kits), the cost & feasibility of coordinating and shipping may be more reasonable than we would otherwise be the case for heavier, more sizeable items.
I dropped any invoice line items that had a negative quantity value as those would be returns or write-offs rather than true orders. Lets take a look at how many items had returns processed on this day and what those items were. (Any invoice line item with a negative quantity value and attached to a Customer ID and StockCode represents product returned).
It appears 110 different returns were made on this day by a total of 38 customers for ~$1300. Retail is brutal. And phone charms are apparently incredibly unpopular.
retaildf_2009_returns <- retaildf %>%
filter(as.Date.POSIXct(InvoiceDate)=='2009-12-01') %>%
filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
filter(Quantity<0) %>%
arrange(Quantity)
retaildf_2009_returns %>%
group_by(`Customer ID`) %>%
summarize(Refunded_Revenue = sum(Quantity*Price)) %>%
arrange(Refunded_Revenue)
## # A tibble: 38 x 2
## `Customer ID` Refunded_Revenue
## <dbl> <dbl>
## 1 13526 -239.
## 2 16321 -196.
## 3 17592 -148.
## 4 14030 -96.8
## 5 17345 -74.6
## 6 16128 -43.9
## 7 15461 -38.4
## 8 13711 -37.0
## 9 15796 -36.6
## 10 15750 -30.3
## # ... with 28 more rows
retaildf_2009_returns %>%
summarize(total_returned = sum(Quantity*Price))
## # A tibble: 1 x 1
## total_returned
## <dbl>
## 1 -1340.
I want to take quick look at items that are “non-stock” - invoice lines that do not tie back to a customer as an order, charge, or return. These codes might indicate the types of issues this company has to write-off. It would be very interesting to see the degree to which these issues are occurring and to what degree are they impacting the bottom line.
retaildf_Non_Stock <- retaildf %>%
filter(is.na(`Customer ID`) & !is.na(Description) &
as.Date.POSIXct(InvoiceDate)>='2009-12-01' & as.Date.POSIXct(InvoiceDate) <='2009-12-31')%>%
mutate(Dollar_Total = Quantity * Price) %>%
group_by(StockCode, Description) %>%
arrange(Quantity)
retaildf_Non_Stock
## # A tibble: 13,240 x 9
## # Groups: StockCode, Description [2,299]
## Invoice StockCode Description Quantity InvoiceDate Price
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 489660 35956 lost -1043 2009-12-01 17:43:00 0
## 2 489899 79323GR sold as go~ -954 2009-12-03 09:41:00 0
## 3 489820 21133 invcd as 8~ -720 2009-12-02 13:23:00 0
## 4 490007 84347 21494 -720 2009-12-03 12:09:00 0
## 5 490130 21493 lost? -600 2009-12-03 18:28:00 0
## 6 491732 79030A smashed -350 2009-12-14 10:00:00 0
## 7 489463 71477 short -240 2009-12-01 10:52:00 0
## 8 490766 51008 wet -200 2009-12-08 11:06:00 0
## 9 489467 85123A 21733 mixed -192 2009-12-01 10:53:00 0
## 10 489663 35605A damages -117 2009-12-01 18:02:00 0
## # ... with 13,230 more rows, and 3 more variables: `Customer ID` <dbl>,
## # Country <chr>, Dollar_Total <dbl>
Now that I have some nice insight into the data and the company, I feel comfortable doing some aggregates over a larger time frame. Specifically I’d like to compare the holiday month sales between 2009 and 2011.
We can already see that there was a significant spike in invoice lines generated between Dec. 2009 and Dec. 2010, and then a drop of 40k between Dec. 2010 and Dec. 2011. I’d like to investigate what happened.
retaildf <- retaildf %>%
mutate(
day = day(InvoiceDate),
month = month(InvoiceDate),
year = year(InvoiceDate),
invoice_date = make_date(year, month, day)
)
Looking at volume of invoice lines generated each day of December 2009 (which we can consider a rough proxy measure of sales orders barring returns and write-offs), it appears the first 2 months of Decemeber are the busiest time for shopping.
retaildf %>%
filter(invoice_date >= '2009-12-01' & invoice_date <= '2009-12-23') %>%
count(invoice_date, sort=TRUE)
## # A tibble: 21 x 2
## invoice_date n
## <date> <int>
## 1 2009-12-14 4117
## 2 2009-12-02 3277
## 3 2009-12-01 3223
## 4 2009-12-16 3216
## 5 2009-12-03 3002
## 6 2009-12-07 2869
## 7 2009-12-04 2559
## 8 2009-12-10 2492
## 9 2009-12-09 2470
## 10 2009-12-08 2440
## # ... with 11 more rows
If we count the number of invoice lines generated each months, does a noticeable/predictable pattern emerge? What months had the least orders? Was there a noticeable increase/decrease in invoices billed year to year? Unsurprisingly, the holiday months of October, November, December racked the greatest number of invoice lines, reflecting increased sales in those months. Interesting to note that December of 2011 only had ~25k invoice line items to the 65k~ invoice lines generated in Dec 2010. January and February appear to be least busiest months of the year.
retaildf %>%
filter(year == '2011') %>%
count(month, sort=TRUE)
## # A tibble: 12 x 2
## month n
## <dbl> <int>
## 1 11 84711
## 2 10 60742
## 3 9 50226
## 4 7 39518
## 5 5 37030
## 6 6 36874
## 7 3 36748
## 8 8 35284
## 9 1 35147
## 10 4 29916
## 11 2 27707
## 12 12 25526
retaildf %>%
filter(year == '2010') %>%
count(month, sort=TRUE)
## # A tibble: 12 x 2
## month n
## <dbl> <int>
## 1 11 78015
## 2 12 65004
## 3 10 59098
## 4 9 42091
## 5 3 41511
## 6 6 39983
## 7 5 35323
## 8 4 34057
## 9 7 33383
## 10 8 33306
## 11 1 31555
## 12 2 29388
retaildf %>%
filter(year == '2009') %>%
count(month, sort=TRUE)
## # A tibble: 1 x 2
## month n
## <dbl> <int>
## 1 12 45228
Lets do a deep dive into the month of December for each year 2009-2011. How did revenue and sales volume differ in the holiday month of each year?
retaildf_2009_all_dec <- retaildf %>%
filter(invoice_date >='2009-12-01' & invoice_date <='2009-12-23') %>%
filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
mutate(Dollar_Total = Quantity * Price) %>%
group_by(StockCode, Description) %>%
summarise(Total_Earned = sum(Dollar_Total),
Total_Sold = sum(Quantity)) %>%
arrange(desc(Total_Earned)) %>%
ungroup() %>%
mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))
retaildf_2010_all_dec <- retaildf %>%
filter(invoice_date <='2010-12-01' & invoice_date <='2010-12-23') %>%
filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
mutate(Dollar_Total = Quantity * Price) %>%
group_by(StockCode, Description) %>%
summarise(Total_Earned = sum(Dollar_Total),
Total_Sold = sum(Quantity)) %>%
arrange(desc(Total_Earned)) %>%
ungroup() %>%
mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))
retaildf_2011_all_dec <- retaildf %>%
filter(invoice_date <='2011-12-01' & invoice_date <='2011-12-23') %>%
filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
mutate(Dollar_Total = Quantity * Price) %>%
group_by(StockCode, Description) %>%
summarise(Total_Earned = sum(Dollar_Total),
Total_Sold = sum(Quantity)) %>%
arrange(desc(Total_Earned)) %>%
ungroup() %>%
mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))
In 2009, the top items were Tealight Holders, cherry lights. It was a great year inexplicably for Edwardian parasols. Tealight Holders and Assorted Bird Ornaments were still the most popular items in December. Champagne tray cards were no more popular this year than in 2009. Regency Cakestands and Jumbo Bags were the rage in 2011 along with the consistent popular Tealight Holders and Bird Ornaments. Cards with cats on them were decidedly not a hit. ## December Monthly Revenue by SKU
proportion_totals_09 <- retaildf_2009_all_dec %>%
mutate(daily_revenue = case_when(
between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
between(Total_Earned, 500, 1000) ~ ">$500",
between(Total_Earned, 301, 500) ~ ">$300",
between(Total_Earned, 201, 300) ~">$200",
between(Total_Earned, 101, 200) ~">$100",
between(Total_Earned, 51, 100) ~">$50",
between(Total_Earned, 25, 50) ~">=$25",
between(Total_Earned, 0, 24) ~"<$25",
TRUE ~ "loss")) %>%
filter(daily_revenue!="loss")
Only 27% of Items sold in December made more than $200. A full 30% of Items sold in December made less than $25. This company could very reasonably benefit from trimming down the range of items on sale when a full 829 SKUs failed to likely make more the stocking point cost to procure and ship in the first place.
tab1(proportion_totals_09$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Dec-09 Revenue by SKU", cex=.7, cex.names = .8)
## proportion_totals_09$daily_revenue :
## Frequency Percent Cum. percent
## > $1000 129 4.8 4.8
## >$500 173 6.4 11.3
## >$200 205 7.6 18.9
## >$300 219 8.2 27.0
## >$50 364 13.6 40.6
## >$100 374 13.9 54.5
## >=$25 391 14.6 69.1
## <$25 829 30.9 100.0
## Total 2684 100.0 100.0
proportion_totals_10 <- retaildf_2010_all_dec %>%
mutate(daily_revenue = case_when(
between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
between(Total_Earned, 500, 1000) ~ ">$500",
between(Total_Earned, 301, 500) ~ ">$300",
between(Total_Earned, 201, 300) ~">$200",
between(Total_Earned, 101, 200) ~">$100",
between(Total_Earned, 51, 100) ~">$50",
between(Total_Earned, 25, 50) ~">=$25",
between(Total_Earned, 0, 24) ~"<$25",
TRUE ~ "loss")) %>%
filter(daily_revenue!="loss")
In December 2010, we can see the company substantially increased its sales from year past. From December 2009, where only 5% of all SKUs made more than $1000, more than 35% of all unique SKUs sold in December 2010 made more than $1000.
tab1(proportion_totals_10$daily_revenue, sort.group = "decreasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Dec-10 Revenue by SKU", cex=.7, cex.names = .8)
## proportion_totals_10$daily_revenue :
## Frequency Percent Cum. percent
## > $1000 1575 35.8 35.8
## >$500 650 14.8 50.6
## >$300 493 11.2 61.8
## >$100 408 9.3 71.0
## <$25 403 9.2 80.2
## >$50 333 7.6 87.8
## >$200 272 6.2 93.9
## >=$25 267 6.1 100.0
## Total 4401 100.0 100.0
proportion_totals_11 <- retaildf_2011_all_dec %>%
mutate(daily_revenue = case_when(
between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
between(Total_Earned, 500, 1000) ~ ">$500",
between(Total_Earned, 301, 500) ~ ">$300",
between(Total_Earned, 201, 300) ~">$200",
between(Total_Earned, 101, 200) ~">$100",
between(Total_Earned, 51, 100) ~">$50",
between(Total_Earned, 25, 50) ~">=$25",
between(Total_Earned, 0, 24) ~"<$25",
TRUE ~ "loss")) %>%
filter(daily_revenue!="loss")
In December 2011, the proportions appear even better - 46% of all unique SKUs made more than $1000. Altbough the amount of unique SKUs offered seems to grown, only 5% of SKUs sold in December 2011 made less than $25 - a complete reversal of revenue distribution by SKU in only 2 years.
tab1(proportion_totals_11$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Dec-11 Revenue by SKU", cex=.7, cex.names = .8)
## proportion_totals_11$daily_revenue :
## Frequency Percent Cum. percent
## >=$25 244 4.7 4.7
## >$200 263 5.0 9.7
## >$50 322 6.1 15.8
## <$25 354 6.8 22.6
## >$100 401 7.6 30.2
## >$300 475 9.1 39.3
## >$500 770 14.7 54.0
## > $1000 2414 46.0 100.0
## Total 5243 100.0 100.0
Now lets look at the aggregate numbers for each December.
In 2009, the company made ~600k in sales revenue, selling 2684 unique item codes, and sold almost ~400,000 units.
In 2010, the company made 8 MILLION in sales revenue, selling over 4,400 unique item codes, and sold more than 5 MILLION units.
In 2011, the company made 16 MILLION in sales revenue, selling over 5,000 unique item codes, and shipped more than 10 MILLION units. Interesting to note that the invoice lines did not actually map to sales revenue or volume sold. As we saw, there was more than 40k reduction in invoice lines between Dec 2010 to Dec 2011. Since item popularity seems to have remained relatively unchanged, we can surmise that customers felt confident to order more product from this vendor.
proportion_totals_09 %>%
summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
unique_items_sold=n(),
units_sold=sum(Total_Sold))
## # A tibble: 1 x 3
## Overall_Revenue unique_items_sold units_sold
## <chr> <int> <dbl>
## 1 $664,357 2684 389700
proportion_totals_10 %>%
summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
unique_items_sold=n(),
units_sold=sum(Total_Sold))
## # A tibble: 1 x 3
## Overall_Revenue unique_items_sold units_sold
## <chr> <int> <dbl>
## 1 $8,269,102 4401 5227420
proportion_totals_11 %>%
summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
unique_items_sold=n(),
units_sold=sum(Total_Sold))
## # A tibble: 1 x 3
## Overall_Revenue unique_items_sold units_sold
## <chr> <int> <dbl>
## 1 $16,551,904 5243 10050318
The amounts that the top 10 customers of each year spent saw significantly increases year over year, confirming the theory that a contributing factor to the increased revenue seen was an increase in consumer confidence reflected in the increased amount of product purchased by customers. In 2009, the top customer purchased 14k units of product and spent ~$41,000. In 2010, the SAME customer purchased >100k units of product and spent ~$328,000, 8x what they spent in the previous year.
retaildf %>%
group_by(year,`Customer ID`) %>%
summarize(Spent=sum(Price * Quantity),
Qty_Purchased=sum(Quantity)) %>%
filter(Spent>0, !is.na(`Customer ID`)) %>%
slice_max(Spent, n=10)
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 30 x 4
## # Groups: year [3]
## year `Customer ID` Spent Qty_Purchased
## <dbl> <dbl> <dbl> <dbl>
## 1 2009 18102 41006. 14084
## 2 2009 13694 20428. 26751
## 3 2009 14646 14696. 10541
## 4 2009 15061 14313. 8234
## 5 2009 14156 11025. 6272
## 6 2009 15311 9293. 3738
## 7 2009 17511 8368. 4999
## 8 2009 17850 7411. 2158
## 9 2009 13777 7125. 3272
## 10 2009 16779 6742. 3935
## # ... with 20 more rows
…