Project 3

Data Loading

library(readxl)
library(httr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(tidyverse)
library(scales)
library(janitor)
library(epiDisplay)

Data Source (Excel file)

retailURL <- "http://archive.ics.uci.edu//ml//machine-learning-databases//00502//online_retail_II.xlsx"
GET(retailURL, write_disk(tempFileName <- tempfile(fileext = ".xlsx")))

## Response [http://archive.ics.uci.edu//ml//machine-learning-databases//00502//online_retail_II.xlsx]
##   Date: 2020-10-18 03:36
##   Status: 200
##   Content-Type: application/x-httpd-php
##   Size: 45.6 MB
## <ON DISK>  C:\Users\cassi\AppData\Local\Temp\RtmpwJHTUe\file27cc27fe39e0.xlsx

{Detail - reading in Excel vs csv file}

retail_sheet_2009 <- read_excel(tempFileName, sheet = "Year 2009-2010")
retail_sheet_2010 <- read_excel(tempFileName, sheet = "Year 2010-2011")
retaildf <- rbind(retail_sheet_2009, retail_sheet_2010)

Description of Data

The observations of this dataset are invoice line items, capturing the variables you would see on a typical invoice, Item code, qty ordered, price per unit, customer id, and of course invoice number. We can you use this data to describe how sales performed over the months for this store. The timeframe of the data is Dec 2009 to Dec 2011. As we can see, over 1 million invoice lines were generated in this time.

head(retaildf)

## # A tibble: 6 x 8
##   Invoice StockCode Description Quantity InvoiceDate         Price `Customer ID`
##   <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>         <dbl>
## 1 489434  85048     "15CM CHRI~       12 2009-12-01 07:45:00  6.95         13085
## 2 489434  79323P    "PINK CHER~       12 2009-12-01 07:45:00  6.75         13085
## 3 489434  79323W    "WHITE CHE~       12 2009-12-01 07:45:00  6.75         13085
## 4 489434  22041     "RECORD FR~       48 2009-12-01 07:45:00  2.1          13085
## 5 489434  21232     "STRAWBERR~       24 2009-12-01 07:45:00  1.25         13085
## 6 489434  22064     "PINK DOUG~       24 2009-12-01 07:45:00  1.65         13085
## # ... with 1 more variable: Country <chr>

glimpse(retaildf)

## Rows: 1,067,371
## Columns: 8
## $ Invoice       <chr> "489434", "489434", "489434", "489434", "489434", "48...
## $ StockCode     <chr> "85048", "79323P", "79323W", "22041", "21232", "22064...
## $ Description   <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY L...
## $ Quantity      <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 1...
## $ InvoiceDate   <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01...
## $ Price         <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55,...
## $ `Customer ID` <dbl> 13085, 13085, 13085, 13085, 13085, 13085, 13085, 1308...
## $ Country       <chr> "United Kingdom", "United Kingdom", "United Kingdom",...

Because there are so many observations, I’m going to first subset my data into a much smaller dataset to work with in familiarizing myself with how the data looks. From 1 million observations, I have drilled down to 1044 by looking at the first day of this dataset - December 1, 2011.

Let’s slice this dataset to see at each stocking point how much earnings were cumulatively invoiced (which we can interpret as ordered) and the cumulative qty billed (ordered).

retaildf_2009 <- retaildf %>%
    filter(as.Date.POSIXct(InvoiceDate)=='2009-12-01') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`) & Quantity > 0) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

Lets look at the most popular and least popular SKUs for this day to see the type of items this company sold and their most popular items.

These look to be fairly standard homegood decorations and sundries. The most popular items on this day are Christmas paper chain kits and cherry lights.

I’m very surprised the Love Potion Incense was not a hit

Lets bucket this revenue so we can have a clearer picture of the distribution of SKU earnings

proportion_totals <- retaildf_2009 %>%
    mutate(daily_revenue = case_when(
        between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
        between(Total_Earned, 500, 1000) ~ ">$500",
        between(Total_Earned, 301, 500) ~ ">$300",
        between(Total_Earned, 201, 300) ~">$200",
        between(Total_Earned, 101, 200) ~">$100",
        between(Total_Earned, 51, 100) ~">$50",
        between(Total_Earned, 25, 50) ~">=$25",
        between(Total_Earned, 0, 24) ~"<$25",
        TRUE ~ "loss")) %>%
    filter(daily_revenue!="loss")

On December 1, 643(!!) SKUs each made less than $25 and roughly 3% of SKUs earned more than $200. Only 10% earned more than $100. This does not, on the surface, appear to be a very profitable company (keeping in mind we are only looking at one day of sales in the holiday season)

tab1(proportion_totals$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Daily Revenue by SKU", cex=.7, cex.names = .8)

## proportion_totals$daily_revenue : 
##         Frequency Percent Cum. percent
## > $1000         1     0.1          0.1
## >$300           6     0.6          0.7
## >$500           8     0.8          1.5
## >$200          16     1.6          3.0
## >$100          67     6.5          9.5
## >$50          123    11.9         21.4
## >=$25         174    16.9         38.3
## <$25          637    61.7        100.0
##   Total      1032   100.0        100.0

I stand a bit corrected when I look at the totals. In one day, the company earned $42k in revenue. However, we are missing the context of their COGS and cost of operation. $42k sounds good but look at HOW MANY unique SKUs they sold in a single day and the volume of units sold just to make $42k. That is 1028 items to track, receive, store, pick, pack, ship - moving volume of 24k units in a single day. You have to wonder at the labor costs involved in an operation like this. I’m starting to see why Amazon operated a literal loss for so many years.

proportion_totals %>%
    summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
              unique_items_sold=n(),
              units_sold=sum(Total_Sold))

## # A tibble: 1 x 3
##   Overall_Revenue unique_items_sold units_sold
##   <chr>                       <int>      <dbl>
## 1 $43,547.84                   1032      24128

One thing to note is that given the nature of many of these items (cherry lights, paper chain kits), the cost & feasibility of coordinating and shipping may be more reasonable than we would otherwise be the case for heavier, more sizeable items.

I dropped any invoice line items that had a negative quantity value as those would be returns or write-offs rather than true orders. Lets take a look at how many items had returns processed on this day and what those items were. (Any invoice line item with a negative quantity value and attached to a Customer ID and StockCode represents product returned).

It appears 110 different returns were made on this day by a total of 38 customers for ~$1300. Retail is brutal. And phone charms are apparently incredibly unpopular.

retaildf_2009_returns <- retaildf %>%
    filter(as.Date.POSIXct(InvoiceDate)=='2009-12-01') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    filter(Quantity<0) %>%
    arrange(Quantity)

retaildf_2009_returns %>%
    group_by(`Customer ID`) %>%
    summarize(Refunded_Revenue = sum(Quantity*Price)) %>%
    arrange(Refunded_Revenue)

## # A tibble: 38 x 2
##    `Customer ID` Refunded_Revenue
##            <dbl>            <dbl>
##  1         13526           -239. 
##  2         16321           -196. 
##  3         17592           -148. 
##  4         14030            -96.8
##  5         17345            -74.6
##  6         16128            -43.9
##  7         15461            -38.4
##  8         13711            -37.0
##  9         15796            -36.6
## 10         15750            -30.3
## # ... with 28 more rows

retaildf_2009_returns %>%
  summarize(total_returned = sum(Quantity*Price))

## # A tibble: 1 x 1
##   total_returned
##            <dbl>
## 1         -1340.

I want to take quick look at items that are “non-stock” - invoice lines that do not tie back to a customer as an order, charge, or return. These codes might indicate the types of issues this company has to write-off. It would be very interesting to see the degree to which these issues are occurring and to what degree are they impacting the bottom line.

retaildf_Non_Stock <- retaildf %>%
 filter(is.na(`Customer ID`) & !is.na(Description) &
as.Date.POSIXct(InvoiceDate)>='2009-12-01' & as.Date.POSIXct(InvoiceDate) <='2009-12-31')%>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    arrange(Quantity)

retaildf_Non_Stock

## # A tibble: 13,240 x 9
## # Groups:   StockCode, Description [2,299]
##    Invoice StockCode Description Quantity InvoiceDate         Price
##    <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>
##  1 489660  35956     lost           -1043 2009-12-01 17:43:00     0
##  2 489899  79323GR   sold as go~     -954 2009-12-03 09:41:00     0
##  3 489820  21133     invcd as 8~     -720 2009-12-02 13:23:00     0
##  4 490007  84347     21494           -720 2009-12-03 12:09:00     0
##  5 490130  21493     lost?           -600 2009-12-03 18:28:00     0
##  6 491732  79030A    smashed         -350 2009-12-14 10:00:00     0
##  7 489463  71477     short           -240 2009-12-01 10:52:00     0
##  8 490766  51008     wet             -200 2009-12-08 11:06:00     0
##  9 489467  85123A    21733 mixed     -192 2009-12-01 10:53:00     0
## 10 489663  35605A    damages         -117 2009-12-01 18:02:00     0
## # ... with 13,230 more rows, and 3 more variables: `Customer ID` <dbl>,
## #   Country <chr>, Dollar_Total <dbl>

Now that I have some nice insight into the data and the company, I feel comfortable doing some aggregates over a larger time frame. Specifically I’d like to compare the holiday month sales between 2009 and 2011.

We can already see that there was a significant spike in invoice lines generated between Dec. 2009 and Dec. 2010, and then a drop of 40k between Dec. 2010 and Dec. 2011. I’d like to investigate what happened.

retaildf <- retaildf %>%
mutate( 
    day = day(InvoiceDate), 
    month = month(InvoiceDate), 
    year = year(InvoiceDate),
    invoice_date = make_date(year, month, day)
    )

Looking at volume of invoice lines generated each day of December 2009 (which we can consider a rough proxy measure of sales orders barring returns and write-offs), it appears the first 2 months of Decemeber are the busiest time for shopping.

retaildf %>%
    filter(invoice_date >= '2009-12-01' & invoice_date <= '2009-12-23') %>%
    count(invoice_date, sort=TRUE)

## # A tibble: 21 x 2
##    invoice_date     n
##    <date>       <int>
##  1 2009-12-14    4117
##  2 2009-12-02    3277
##  3 2009-12-01    3223
##  4 2009-12-16    3216
##  5 2009-12-03    3002
##  6 2009-12-07    2869
##  7 2009-12-04    2559
##  8 2009-12-10    2492
##  9 2009-12-09    2470
## 10 2009-12-08    2440
## # ... with 11 more rows

Invoice Lines

If we count the number of invoice lines generated each months, does a noticeable/predictable pattern emerge? What months had the least orders? Was there a noticeable increase/decrease in invoices billed year to year? Unsurprisingly, the holiday months of October, November, December racked the greatest number of invoice lines, reflecting increased sales in those months. Interesting to note that December of 2011 only had ~25k invoice line items to the 65k~ invoice lines generated in Dec 2010. January and February appear to be least busiest months of the year.

retaildf %>%
    filter(year == '2011') %>%
    count(month, sort=TRUE)

## # A tibble: 12 x 2
##    month     n
##    <dbl> <int>
##  1    11 84711
##  2    10 60742
##  3     9 50226
##  4     7 39518
##  5     5 37030
##  6     6 36874
##  7     3 36748
##  8     8 35284
##  9     1 35147
## 10     4 29916
## 11     2 27707
## 12    12 25526

retaildf %>%
    filter(year == '2010') %>%
    count(month, sort=TRUE)

## # A tibble: 12 x 2
##    month     n
##    <dbl> <int>
##  1    11 78015
##  2    12 65004
##  3    10 59098
##  4     9 42091
##  5     3 41511
##  6     6 39983
##  7     5 35323
##  8     4 34057
##  9     7 33383
## 10     8 33306
## 11     1 31555
## 12     2 29388

retaildf %>%
    filter(year == '2009') %>%
    count(month, sort=TRUE)

## # A tibble: 1 x 2
##   month     n
##   <dbl> <int>
## 1    12 45228

Lets do a deep dive into the month of December for each year 2009-2011. How did revenue and sales volume differ in the holiday month of each year?

retaildf_2009_all_dec <- retaildf %>%
    filter(invoice_date >='2009-12-01' & invoice_date <='2009-12-23') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

retaildf_2010_all_dec <- retaildf %>%
    filter(invoice_date <='2010-12-01' & invoice_date <='2010-12-23') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

retaildf_2011_all_dec <- retaildf %>%
    filter(invoice_date <='2011-12-01' & invoice_date <='2011-12-23') %>%
    filter(!is.na(Description) &!is.na(`Customer ID`)) %>%
    mutate(Dollar_Total = Quantity * Price) %>%
    group_by(StockCode, Description) %>%
    summarise(Total_Earned = sum(Dollar_Total),
              Total_Sold = sum(Quantity)) %>%
    arrange(desc(Total_Earned)) %>%
    ungroup() %>%
    mutate(Proportion_of_Revenue = scales::percent( Total_Earned/sum(Total_Earned)))

What were the most and least popular items of each year in December?

In 2009, the top items were Tealight Holders, cherry lights. It was a great year inexplicably for Edwardian parasols. Tealight Holders and Assorted Bird Ornaments were still the most popular items in December. Champagne tray cards were no more popular this year than in 2009. Regency Cakestands and Jumbo Bags were the rage in 2011 along with the consistent popular Tealight Holders and Bird Ornaments. Cards with cats on them were decidedly not a hit. ## December Monthly Revenue by SKU

proportion_totals_09 <- retaildf_2009_all_dec %>%
    mutate(daily_revenue = case_when(
        between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
        between(Total_Earned, 500, 1000) ~ ">$500",
        between(Total_Earned, 301, 500) ~ ">$300",
        between(Total_Earned, 201, 300) ~">$200",
        between(Total_Earned, 101, 200) ~">$100",
        between(Total_Earned, 51, 100) ~">$50",
        between(Total_Earned, 25, 50) ~">=$25",
        between(Total_Earned, 0, 24) ~"<$25",
        TRUE ~ "loss")) %>%
    filter(daily_revenue!="loss")

Only 27% of Items sold in December made more than $200. A full 30% of Items sold in December made less than $25. This company could very reasonably benefit from trimming down the range of items on sale when a full 829 SKUs failed to likely make more the stocking point cost to procure and ship in the first place.

tab1(proportion_totals_09$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Dec-09 Revenue by SKU", cex=.7, cex.names = .8)

## proportion_totals_09$daily_revenue : 
##         Frequency Percent Cum. percent
## > $1000       129     4.8          4.8
## >$500         173     6.4         11.3
## >$200         205     7.6         18.9
## >$300         219     8.2         27.0
## >$50          364    13.6         40.6
## >$100         374    13.9         54.5
## >=$25         391    14.6         69.1
## <$25          829    30.9        100.0
##   Total      2684   100.0        100.0

proportion_totals_10 <- retaildf_2010_all_dec %>%
    mutate(daily_revenue = case_when(
        between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
        between(Total_Earned, 500, 1000) ~ ">$500",
        between(Total_Earned, 301, 500) ~ ">$300",
        between(Total_Earned, 201, 300) ~">$200",
        between(Total_Earned, 101, 200) ~">$100",
        between(Total_Earned, 51, 100) ~">$50",
        between(Total_Earned, 25, 50) ~">=$25",
        between(Total_Earned, 0, 24) ~"<$25",
        TRUE ~ "loss")) %>%
    filter(daily_revenue!="loss")

In December 2010, we can see the company substantially increased its sales from year past. From December 2009, where only 5% of all SKUs made more than $1000, more than 35% of all unique SKUs sold in December 2010 made more than $1000.

tab1(proportion_totals_10$daily_revenue, sort.group = "decreasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Dec-10 Revenue by SKU", cex=.7, cex.names = .8)

## proportion_totals_10$daily_revenue : 
##         Frequency Percent Cum. percent
## > $1000      1575    35.8         35.8
## >$500         650    14.8         50.6
## >$300         493    11.2         61.8
## >$100         408     9.3         71.0
## <$25          403     9.2         80.2
## >$50          333     7.6         87.8
## >$200         272     6.2         93.9
## >=$25         267     6.1        100.0
##   Total      4401   100.0        100.0

proportion_totals_11 <- retaildf_2011_all_dec %>%
    mutate(daily_revenue = case_when(
        between(Total_Earned, 1001, max(Total_Earned)) ~ "> $1000",
        between(Total_Earned, 500, 1000) ~ ">$500",
        between(Total_Earned, 301, 500) ~ ">$300",
        between(Total_Earned, 201, 300) ~">$200",
        between(Total_Earned, 101, 200) ~">$100",
        between(Total_Earned, 51, 100) ~">$50",
        between(Total_Earned, 25, 50) ~">=$25",
        between(Total_Earned, 0, 24) ~"<$25",
        TRUE ~ "loss")) %>%
    filter(daily_revenue!="loss")

In December 2011, the proportions appear even better - 46% of all unique SKUs made more than $1000. Altbough the amount of unique SKUs offered seems to grown, only 5% of SKUs sold in December 2011 made less than $25 - a complete reversal of revenue distribution by SKU in only 2 years.

tab1(proportion_totals_11$daily_revenue, sort.group = "increasing", cum.percent = TRUE, horiz=TRUE, main = "Distribution of Dec-11 Revenue by SKU", cex=.7, cex.names = .8)

## proportion_totals_11$daily_revenue : 
##         Frequency Percent Cum. percent
## >=$25         244     4.7          4.7
## >$200         263     5.0          9.7
## >$50          322     6.1         15.8
## <$25          354     6.8         22.6
## >$100         401     7.6         30.2
## >$300         475     9.1         39.3
## >$500         770    14.7         54.0
## > $1000      2414    46.0        100.0
##   Total      5243   100.0        100.0

Now lets look at the aggregate numbers for each December.

In 2009, the company made ~600k in sales revenue, selling 2684 unique item codes, and sold almost ~400,000 units.

In 2010, the company made 8 MILLION in sales revenue, selling over 4,400 unique item codes, and sold more than 5 MILLION units.

In 2011, the company made 16 MILLION in sales revenue, selling over 5,000 unique item codes, and shipped more than 10 MILLION units. Interesting to note that the invoice lines did not actually map to sales revenue or volume sold. As we saw, there was more than 40k reduction in invoice lines between Dec 2010 to Dec 2011. Since item popularity seems to have remained relatively unchanged, we can surmise that customers felt confident to order more product from this vendor.

proportion_totals_09 %>%
    summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
              unique_items_sold=n(),
              units_sold=sum(Total_Sold))

## # A tibble: 1 x 3
##   Overall_Revenue unique_items_sold units_sold
##   <chr>                       <int>      <dbl>
## 1 $664,357                     2684     389700

proportion_totals_10 %>%
    summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
              unique_items_sold=n(),
              units_sold=sum(Total_Sold))

## # A tibble: 1 x 3
##   Overall_Revenue unique_items_sold units_sold
##   <chr>                       <int>      <dbl>
## 1 $8,269,102                   4401    5227420

proportion_totals_11 %>%
    summarise(Overall_Revenue = scales::dollar(sum(Total_Earned)),
              unique_items_sold=n(),
              units_sold=sum(Total_Sold))

## # A tibble: 1 x 3
##   Overall_Revenue unique_items_sold units_sold
##   <chr>                       <int>      <dbl>
## 1 $16,551,904                  5243   10050318

The amounts that the top 10 customers of each year spent saw significantly increases year over year, confirming the theory that a contributing factor to the increased revenue seen was an increase in consumer confidence reflected in the increased amount of product purchased by customers. In 2009, the top customer purchased 14k units of product and spent ~$41,000. In 2010, the SAME customer purchased >100k units of product and spent ~$328,000, 8x what they spent in the previous year.

retaildf %>%
  group_by(year,`Customer ID`) %>%
  summarize(Spent=sum(Price * Quantity),
            Qty_Purchased=sum(Quantity)) %>%
  filter(Spent>0, !is.na(`Customer ID`)) %>%
  slice_max(Spent, n=10)

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

## # A tibble: 30 x 4
## # Groups:   year [3]
##     year `Customer ID`  Spent Qty_Purchased
##    <dbl>         <dbl>  <dbl>         <dbl>
##  1  2009         18102 41006.         14084
##  2  2009         13694 20428.         26751
##  3  2009         14646 14696.         10541
##  4  2009         15061 14313.          8234
##  5  2009         14156 11025.          6272
##  6  2009         15311  9293.          3738
##  7  2009         17511  8368.          4999
##  8  2009         17850  7411.          2158
##  9  2009         13777  7125.          3272
## 10  2009         16779  6742.          3935
## # ... with 20 more rows

…