Data Loading

library(readxl)
library(httr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(tidyverse)
library(scales)
library(janitor)
library(epiDisplay)

Data Source (Excel file)

retailURL <- "http://archive.ics.uci.edu//ml//machine-learning-databases//00502//online_retail_II.xlsx"
GET(retailURL, write_disk(tempFileName <- tempfile(fileext = ".xlsx")))
## Response [http://archive.ics.uci.edu//ml//machine-learning-databases//00502//online_retail_II.xlsx]
##   Date: 2020-10-18 03:36
##   Status: 200
##   Content-Type: application/x-httpd-php
##   Size: 45.6 MB
## <ON DISK>  C:\Users\cassi\AppData\Local\Temp\RtmpwJHTUe\file27cc27fe39e0.xlsx

{Detail - reading in Excel vs csv file}

retail_sheet_2009 <- read_excel(tempFileName, sheet = "Year 2009-2010")
retail_sheet_2010 <- read_excel(tempFileName, sheet = "Year 2010-2011")
retaildf <- rbind(retail_sheet_2009, retail_sheet_2010)

Description of Data

The observations of this dataset are invoice line items, capturing the variables you would see on a typical invoice, Item code, qty ordered, price per unit, customer id, and of course invoice number. We can you use this data to describe how sales performed over the months for this store. The timeframe of the data is Dec 2009 to Dec 2011. As we can see, over 1 million invoice lines were generated in this time.

head(retaildf)
## # A tibble: 6 x 8
##   Invoice StockCode Description Quantity InvoiceDate         Price `Customer ID`
##   <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>         <dbl>
## 1 489434  85048     "15CM CHRI~       12 2009-12-01 07:45:00  6.95         13085
## 2 489434  79323P    "PINK CHER~       12 2009-12-01 07:45:00  6.75         13085
## 3 489434  79323W    "WHITE CHE~       12 2009-12-01 07:45:00  6.75         13085
## 4 489434  22041     "RECORD FR~       48 2009-12-01 07:45:00  2.1          13085
## 5 489434  21232     "STRAWBERR~       24 2009-12-01 07:45:00  1.25         13085
## 6 489434  22064     "PINK DOUG~       24 2009-12-01 07:45:00  1.65         13085
## # ... with 1 more variable: Country <chr>
glimpse(retaildf)
## Rows: 1,067,371
## Columns: 8
## $ Invoice       <chr> "489434", "489434", "489434", "489434", "489434", "48...
## $ StockCode     <chr> "85048", "79323P", "79323W", "22041", "21232", "22064...
## $ Description   <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY L...
## $ Quantity      <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 1...
## $ InvoiceDate   <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01...
## $ Price         <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55,...
## $ `Customer ID` <dbl> 13085, 13085, 13085, 13085, 13085, 13085, 13085, 1308...
## $ Country       <chr> "United Kingdom", "United Kingdom", "United Kingdom",...

Because there are so many observations, I’m going to first subset my data into a much smaller dataset to work with in familiarizing myself with how the data looks. From 1 million observations, I have drilled down to 1044 by looking at the first day of this dataset - December 1, 2011.

Let’s slice this dataset to see at each stocking point how much earnings were cumulatively invoiced (which we can interpret as ordered) and the cumulative qty billed (ordered).

retaildf_2009 <- retaildf %>%
    filter(as.Date.POSIXct(InvoiceDate)=='2009-12-01') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`) & Quantity > 0) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

Lets look at the most popular and least popular SKUs for this day to see the type of items this company sold and their most popular items.

These look to be fairly standard homegood decorations and sundries. The most popular items on this day are Christmas paper chain kits and cherry lights.

I’m very surprised the Love Potion Incense was not a hit

Lets bucket this revenue so we can have a clearer picture of the distribution of SKU earnings

proportion_totals <- retaildf_2009 %>%
    mutate(daily_revenue = case_when(
        between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
        between(Total_Earned, 500, 1000) ~ ">$500",
        between(Total_Earned, 301, 500) ~ ">$300",
        between(Total_Earned, 201, 300) ~">$200",
        between(Total_Earned, 101, 200) ~">$100",
        between(Total_Earned, 51, 100) ~">$50",
        between(Total_Earned, 25, 50) ~">=$25",
        between(Total_Earned, 0, 24) ~"<$25",
        TRUE ~ "loss")) %>%
    filter(daily_revenue!="loss")

On December 1, 643(!!) SKUs each made less than $25 and roughly 3% of SKUs earned more than $200. Only 10% earned more than $100. This does not, on the surface, appear to be a very profitable company (keeping in mind we are only looking at one day of sales in the holiday season)

tab1(proportion_totals$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Daily Revenue by SKU", cex=.7, cex.names = .8)

## proportion_totals$daily_revenue : 
##         Frequency Percent Cum. percent
## > $1000         1     0.1          0.1
## >$300           6     0.6          0.7
## >$500           8     0.8          1.5
## >$200          16     1.6          3.0
## >$100          67     6.5          9.5
## >$50          123    11.9         21.4
## >=$25         174    16.9         38.3
## <$25          637    61.7        100.0
##   Total      1032   100.0        100.0

I stand a bit corrected when I look at the totals. In one day, the company earned $42k in revenue. However, we are missing the context of their COGS and cost of operation. $42k sounds good but look at HOW MANY unique SKUs they sold in a single day and the volume of units sold just to make $42k. That is 1028 items to track, receive, store, pick, pack, ship - moving volume of 24k units in a single day. You have to wonder at the labor costs involved in an operation like this. I’m starting to see why Amazon operated a literal loss for so many years.

proportion_totals %>%
    summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
              unique_items_sold=n(),
              units_sold=sum(Total_Sold))
## # A tibble: 1 x 3
##   Overall_Revenue unique_items_sold units_sold
##   <chr>                       <int>      <dbl>
## 1 $43,547.84                   1032      24128

One thing to note is that given the nature of many of these items (cherry lights, paper chain kits), the cost & feasibility of coordinating and shipping may be more reasonable than we would otherwise be the case for heavier, more sizeable items.

I dropped any invoice line items that had a negative quantity value as those would be returns or write-offs rather than true orders. Lets take a look at how many items had returns processed on this day and what those items were. (Any invoice line item with a negative quantity value and attached to a Customer ID and StockCode represents product returned).

It appears 110 different returns were made on this day by a total of 38 customers for ~$1300. Retail is brutal. And phone charms are apparently incredibly unpopular.

retaildf_2009_returns <- retaildf %>%
    filter(as.Date.POSIXct(InvoiceDate)=='2009-12-01') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    filter(Quantity<0) %>%
    arrange(Quantity)
retaildf_2009_returns %>%
    group_by(`Customer ID`) %>%
    summarize(Refunded_Revenue = sum(Quantity*Price)) %>%
    arrange(Refunded_Revenue)
## # A tibble: 38 x 2
##    `Customer ID` Refunded_Revenue
##            <dbl>            <dbl>
##  1         13526           -239. 
##  2         16321           -196. 
##  3         17592           -148. 
##  4         14030            -96.8
##  5         17345            -74.6
##  6         16128            -43.9
##  7         15461            -38.4
##  8         13711            -37.0
##  9         15796            -36.6
## 10         15750            -30.3
## # ... with 28 more rows
retaildf_2009_returns %>%
  summarize(total_returned = sum(Quantity*Price))
## # A tibble: 1 x 1
##   total_returned
##            <dbl>
## 1         -1340.

I want to take quick look at items that are “non-stock” - invoice lines that do not tie back to a customer as an order, charge, or return. These codes might indicate the types of issues this company has to write-off. It would be very interesting to see the degree to which these issues are occurring and to what degree are they impacting the bottom line.

retaildf_Non_Stock <- retaildf %>%
 filter(is.na(`Customer ID`) & !is.na(Description) &
as.Date.POSIXct(InvoiceDate)>='2009-12-01' & as.Date.POSIXct(InvoiceDate) <='2009-12-31')%>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    arrange(Quantity)

retaildf_Non_Stock
## # A tibble: 13,240 x 9
## # Groups:   StockCode, Description [2,299]
##    Invoice StockCode Description Quantity InvoiceDate         Price
##    <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>
##  1 489660  35956     lost           -1043 2009-12-01 17:43:00     0
##  2 489899  79323GR   sold as go~     -954 2009-12-03 09:41:00     0
##  3 489820  21133     invcd as 8~     -720 2009-12-02 13:23:00     0
##  4 490007  84347     21494           -720 2009-12-03 12:09:00     0
##  5 490130  21493     lost?           -600 2009-12-03 18:28:00     0
##  6 491732  79030A    smashed         -350 2009-12-14 10:00:00     0
##  7 489463  71477     short           -240 2009-12-01 10:52:00     0
##  8 490766  51008     wet             -200 2009-12-08 11:06:00     0
##  9 489467  85123A    21733 mixed     -192 2009-12-01 10:53:00     0
## 10 489663  35605A    damages         -117 2009-12-01 18:02:00     0
## # ... with 13,230 more rows, and 3 more variables: `Customer ID` <dbl>,
## #   Country <chr>, Dollar_Total <dbl>

Now that I have some nice insight into the data and the company, I feel comfortable doing some aggregates over a larger time frame. Specifically I’d like to compare the holiday month sales between 2009 and 2011.

We can already see that there was a significant spike in invoice lines generated between Dec. 2009 and Dec. 2010, and then a drop of 40k between Dec. 2010 and Dec. 2011. I’d like to investigate what happened.

retaildf <- retaildf %>%
mutate( 
    day = day(InvoiceDate), 
    month = month(InvoiceDate), 
    year = year(InvoiceDate),
    invoice_date = make_date(year, month, day)
    )

Looking at volume of invoice lines generated each day of December 2009 (which we can consider a rough proxy measure of sales orders barring returns and write-offs), it appears the first 2 months of Decemeber are the busiest time for shopping.

retaildf %>%
    filter(invoice_date >= '2009-12-01' & invoice_date <= '2009-12-23') %>%
    count(invoice_date, sort=TRUE)
## # A tibble: 21 x 2
##    invoice_date     n
##    <date>       <int>
##  1 2009-12-14    4117
##  2 2009-12-02    3277
##  3 2009-12-01    3223
##  4 2009-12-16    3216
##  5 2009-12-03    3002
##  6 2009-12-07    2869
##  7 2009-12-04    2559
##  8 2009-12-10    2492
##  9 2009-12-09    2470
## 10 2009-12-08    2440
## # ... with 11 more rows

Invoice Lines

If we count the number of invoice lines generated each months, does a noticeable/predictable pattern emerge? What months had the least orders? Was there a noticeable increase/decrease in invoices billed year to year? Unsurprisingly, the holiday months of October, November, December racked the greatest number of invoice lines, reflecting increased sales in those months. Interesting to note that December of 2011 only had ~25k invoice line items to the 65k~ invoice lines generated in Dec 2010. January and February appear to be least busiest months of the year.

retaildf %>%
    filter(year == '2011') %>%
    count(month, sort=TRUE)
## # A tibble: 12 x 2
##    month     n
##    <dbl> <int>
##  1    11 84711
##  2    10 60742
##  3     9 50226
##  4     7 39518
##  5     5 37030
##  6     6 36874
##  7     3 36748
##  8     8 35284
##  9     1 35147
## 10     4 29916
## 11     2 27707
## 12    12 25526
retaildf %>%
    filter(year == '2010') %>%
    count(month, sort=TRUE)
## # A tibble: 12 x 2
##    month     n
##    <dbl> <int>
##  1    11 78015
##  2    12 65004
##  3    10 59098
##  4     9 42091
##  5     3 41511
##  6     6 39983
##  7     5 35323
##  8     4 34057
##  9     7 33383
## 10     8 33306
## 11     1 31555
## 12     2 29388
retaildf %>%
    filter(year == '2009') %>%
    count(month, sort=TRUE)
## # A tibble: 1 x 2
##   month     n
##   <dbl> <int>
## 1    12 45228

Lets do a deep dive into the month of December for each year 2009-2011. How did revenue and sales volume differ in the holiday month of each year?

retaildf_2009_all_dec <- retaildf %>%
    filter(invoice_date >='2009-12-01' & invoice_date <='2009-12-23') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

retaildf_2010_all_dec <- retaildf %>%
    filter(invoice_date <='2010-12-01' & invoice_date <='2010-12-23') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

retaildf_2011_all_dec <- retaildf %>%
    filter(invoice_date <='2011-12-01' & invoice_date <='2011-12-23') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))