DATA 622 HW1

Introduction

This assignment explores and compares the structure and contents of two datasets from https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/, considers their similarities and differences, and explores how to analyze and predict an outcome based on the data.

We will use the following packages for this study,

library(tidyverse)
library(caret)
library(ipred)
library(corrplot)
library(forecast)
library(randomForest)
library(ggthemes)
library(egg)
library(kableExtra)
library(doParallel)

The Data

Overview

Dataset 1: 100 Sales Records

data.100 <- read.csv('/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2023/DATA622/HW1/100 Sales Records.csv')
glimpse(data.100)
## Rows: 100
## Columns: 14
## $ Region         <chr> "Australia and Oceania", "Central America and the Carib…
## $ Country        <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe",…
## $ Item.Type      <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "Of…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H", …
## $ Order.Date     <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2/1…
## $ Order.ID       <int> 669165933, 963881480, 341417157, 514321792, 115456712, …
## $ Ship.Date      <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/6/…
## $ Units.Sold     <int> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070, 6…
## $ Unit.Price     <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27, 1…
## $ Unit.Cost      <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54, 9…
## $ Total.Revenue  <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.02…
## $ Total.Cost     <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.52,…
## $ Total.Profit   <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50, 2…

The dataset consists of:

  • 100 rows of data across 14 columns.
  • 7 continuous variables.
  • 7 variables in string (character) format.
    • 4 categorical variables: Region, Country, Item.Type, Sales.Channel.
    • 1 ordinal variable: Order.Priority., with no order given.
    • 2 date variables: Order.Date, Ship.Date.

We convert categorical, except for Country, which contains 76 unique values, and ordinal variables to factors and date variables to dates,

data.100 <- data.100 |>
  mutate(Region = as.factor(Region),
         Item.Type = as.factor(Item.Type),
         Sales.Channel = as.factor(Sales.Channel),
         Order.Priority = as.factor(Order.Priority),
         Order.Date = as.Date(Order.Date, format = "%m/%d/%Y"),
         Ship.Date = as.Date(Ship.Date, format = "%m/%d/%Y"))

colnames(data.100) <- tolower(colnames(data.100))

We check for any missing values; there are none,

colSums(is.na(data.100))
##         region        country      item.type  sales.channel order.priority 
##              0              0              0              0              0 
##     order.date       order.id      ship.date     units.sold     unit.price 
##              0              0              0              0              0 
##      unit.cost  total.revenue     total.cost   total.profit 
##              0              0              0              0

Now we can view summary statistics for each variable,

summary(data.100)
##                                region     country                    item.type 
##  Asia                             :11   Length:100         Clothes        :13  
##  Australia and Oceania            :11   Class :character   Cosmetics      :13  
##  Central America and the Caribbean: 7   Mode  :character   Office Supplies:12  
##  Europe                           :22                      Fruits         :10  
##  Middle East and North Africa     :10                      Personal Care  :10  
##  North America                    : 3                      Household      : 9  
##  Sub-Saharan Africa               :36                      (Other)        :33  
##  sales.channel order.priority   order.date            order.id        
##  Offline:50    C:22           Min.   :2010-02-02   Min.   :114606559  
##  Online :50    H:30           1st Qu.:2012-02-14   1st Qu.:338922488  
##                L:27           Median :2013-07-12   Median :557708561  
##                M:21           Mean   :2013-09-16   Mean   :555020412  
##                               3rd Qu.:2015-04-07   3rd Qu.:790755081  
##                               Max.   :2017-05-22   Max.   :994022214  
##                                                                       
##    ship.date            units.sold     unit.price       unit.cost     
##  Min.   :2010-02-25   Min.   : 124   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2012-02-24   1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84  
##  Median :2013-08-11   Median :5382   Median :179.88   Median :107.28  
##  Mean   :2013-10-09   Mean   :5129   Mean   :276.76   Mean   :191.05  
##  3rd Qu.:2015-04-28   3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33  
##  Max.   :2017-06-17   Max.   :9925   Max.   :668.27   Max.   :524.96  
##                                                                       
##  total.revenue       total.cost       total.profit    
##  Min.   :   4870   Min.   :   3612   Min.   :   1258  
##  1st Qu.: 268721   1st Qu.: 168868   1st Qu.: 121444  
##  Median : 752314   Median : 363566   Median : 290768  
##  Mean   :1373488   Mean   : 931806   Mean   : 441682  
##  3rd Qu.:2212045   3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :5997055   Max.   :4509794   Max.   :1719922  
## 

We can view the shape of the numeric data,

data.100 |>
  select(where(is.numeric) & -order.id) |>
  gather() |>
  ggplot(aes(x = key, y = value)) +
  geom_boxplot() + 
  labs(x = "", y = "", title = "Distribution of Numerical Variables") +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~ key, scales = "free")

The only numeric variable that appears to be symmetrical is Units.Sold. Let’s take a look at the categorical variables.

data.100 |>
  select(item.type, order.priority, region) |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar(fill='lightblue') +
  facet_wrap(~ key, scales = "free") + 
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x='',
       y='',
       title='distribution of categorical variables')
## Warning: attributes are not identical across measure variables; they will be
## dropped

We know the Sales.Channel variable is an even 50/50 split from the summary statistics. What we can see here is that Order.Priority is relatively symmetrical, while the other categorical variables are not.

data.100 |>
  select(order.date, total.revenue, region) |>
  ggplot(aes(x = order.date, y = total.revenue, color=region)) +
  geom_line() +
  labs(x = 'Date',
       y = 'Total Revenue',
       title = 'Total Revenue by Region: 2010-2017') +
  theme_few()

Let’s take a look at the correlation amongst the numeric variables,

data.100 |>
  select(where(is.numeric), -order.id) |>
  cor() |>
  corrplot(type = "upper")

We see there is high correlation amongst all numerical variables except Unit.Cost and Unit.Price to Units.Sold.

Let’s take a look at the total profit by sales channel,

p1 <- data.100 |>
  select(sales.channel, region, total.profit) |>
  ggplot(aes(x=region, y=total.profit, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Profit by Channel") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

p2 <- data.100 |>  
  select(sales.channel, region, units.sold) |>
  ggplot(aes(x=region, y=units.sold, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Sales by Channel") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

ggarrange(p1, p2, ncol=2)

Let’s take a look at the profitability distribution by Order Priority:

data.100 |>
  group_by(order.priority) |>
  summarise(avg.profit = mean(total.profit)) |>
  ggplot(aes(x=order.priority, y=avg.profit)) +
  geom_bar(stat='identity', fill='lightblue') +
  geom_text(aes(label = paste0('$',round(avg.profit))), vjust = 2, size = 4) +
  labs(x='',
       y='',
       title='avg profit per order by priority') +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.y = element_blank(),  # Remove y-axis text
        axis.ticks.y = element_blank())

What is the average turnaround time for the priority levels? To calculate this, we create a new variable turn that represents the time between order.date and ship.date

data.100 <- data.100 |>
  mutate(turn = as.integer(ship.date - order.date))
data.100 |>
  group_by(order.priority) |>
  summarise(total.orders = n(),
            total.units = sum(units.sold),
            avg.turn = round(mean(turn)),
            avg.profit = mean(total.profit)) |>
  arrange(desc(avg.profit)) |>
  kable() |>
  kable_styling()
order.priority total.orders total.units avg.turn avg.profit
H 30 154212 21 563053.3
M 21 94832 25 460454.4
L 27 146876 24 402175.1
C 22 116951 24 306742.2
data.100 |>
  ggplot(aes(x=order.priority, y=turn)) +
  geom_boxplot(fill='lightblue') +
  labs(x='priority',
       y='days',
       title='turnaround time by order priority') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5))

We see that 30% of all orders are priority level “H”, these orders ship 3-4 days faster than the other levels, and the average profit of these orders is $10-15k higher than levels “M” and “L” and nearly twice as high as level “C”.

Is there a relationship between the number of items per order and the turnaround time?

data.100 |>
  ggplot(aes(x=units.sold, y=turn, color=order.priority)) +
  geom_point() +
  labs(x='units',
       y='days',
       title = 'turnaround time by units sold') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

It’s just white noise.

Does the profitability of the items affect turnaround time?

data.100 |>
  ggplot(aes(x=total.profit, y=turn)) +
  geom_point(color='lightblue', alpha=0.8) +
  facet_wrap(~ region, scales="free")  +
  labs(x='profit',
       y='days',
       title = 'turnaround time by order profit') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

There does not appear to be a relationship.

Lastly, let’s take a look at profitability by item type to see if we can observe any patterns. We’ll look at an overview of all orders and then drill down by region,

p1 <- data.100 |>
  ggplot(aes(x=total.profit, y=item.type)) +
  geom_boxplot(fill='lightblue') + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="profit by item type")


p2 <- data.100 |>
  ggplot(aes(x=total.profit, y=item.type, fill=region)) +
  geom_boxplot() + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="",
       fill="")

ggarrange(p1, p2, nrow=2)

We can see that Household and Cosmetics items have the highest overall median profits, but there is some variance by region.

data.100 |>
  group_by(region, item.type) |>
  summarise(total.profit = sum(total.profit)) |>
  top_n(2, wt = total.profit) |>
  arrange(region, desc(total.profit)) |>
  kable() |>
  kable_styling()
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
region item.type total.profit
Asia Household 2002018.4
Asia Office Supplies 1479397.5
Australia and Oceania Cosmetics 1678541.0
Australia and Oceania Baby Food 1236498.1
Central America and the Caribbean Household 1487261.0
Central America and the Caribbean Clothes 403773.1
Europe Cosmetics 5233487.0
Europe Baby Food 2117259.8
Middle East and North Africa Cosmetics 4105940.0
Middle East and North Africa Clothes 1028160.0
North America Household 1152486.4
North America Personal Care 305456.3
Sub-Saharan Africa Office Supplies 2051688.8
Sub-Saharan Africa Cosmetics 2032888.0

Dataset 2: 100,000 Sales Records

data.100k <- read.csv('/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2023/DATA622/HW1/100000 Sales Records.csv')
glimpse(data.100k)
## Rows: 100,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "Central America and th…
## $ Country        <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel  <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date     <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID       <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date      <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold     <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…

The dataset consists of:

  • 100,000 rows of data across 14 columns.
  • 7 continuous variables.
  • 7 variables in string (character) format.
    • 4 categorical variables: Region, Country, Item.Type, Sales.Channel.
    • 1 ordinal variable: Order.Priority., with no order given.
    • 2 date variables: Order.Date, Ship.Date.

We convert categorical, except for Country, which contains 185 unique values, and ordinal variables to factors and date variables to dates,

data.100k <- data.100k |>
  mutate(Region = as.factor(Region),
         Item.Type = as.factor(Item.Type),
         Sales.Channel = as.factor(Sales.Channel),
         Order.Priority = as.factor(Order.Priority),
         Order.Date = as.Date(Order.Date, format = "%m/%d/%Y"),
         Ship.Date = as.Date(Ship.Date, format = "%m/%d/%Y"))

colnames(data.100k) <- tolower(colnames(data.100k))

We check for NAs, none.

colSums(is.na(data.100k))
##         region        country      item.type  sales.channel order.priority 
##              0              0              0              0              0 
##     order.date       order.id      ship.date     units.sold     unit.price 
##              0              0              0              0              0 
##      unit.cost  total.revenue     total.cost   total.profit 
##              0              0              0              0

And we can view some summary statistics,

summary(data.100k)
##                                region        country         
##  Asia                             :14547   Length:100000     
##  Australia and Oceania            : 8113   Class :character  
##  Central America and the Caribbean:10731   Mode  :character  
##  Europe                           :25877                     
##  Middle East and North Africa     :12580                     
##  North America                    : 2133                     
##  Sub-Saharan Africa               :26019                     
##            item.type     sales.channel   order.priority   order.date        
##  Office Supplies: 8426   Offline:49946   C:24951        Min.   :2010-01-01  
##  Cereal         : 8421   Online :50054   H:24945        1st Qu.:2011-11-25  
##  Baby Food      : 8407                   L:25016        Median :2013-10-15  
##  Cosmetics      : 8370                   M:25088        Mean   :2013-10-15  
##  Personal Care  : 8364                                  3rd Qu.:2015-09-07  
##  Meat           : 8320                                  Max.   :2017-07-28  
##  (Other)        :49692                                                      
##     order.id           ship.date            units.sold      unit.price    
##  Min.   :100008904   Min.   :2010-01-02   Min.   :    1   Min.   :  9.33  
##  1st Qu.:326046383   1st Qu.:2011-12-21   1st Qu.: 2505   1st Qu.:109.28  
##  Median :547718512   Median :2013-11-09   Median : 5007   Median :205.70  
##  Mean   :550395554   Mean   :2013-11-09   Mean   : 5001   Mean   :266.70  
##  3rd Qu.:775078534   3rd Qu.:2015-10-02   3rd Qu.: 7495   3rd Qu.:437.20  
##  Max.   :999996459   Max.   :2017-09-16   Max.   :10000   Max.   :668.27  
##                                                                           
##    unit.cost      total.revenue       total.cost       total.profit      
##  Min.   :  6.92   Min.   :     19   Min.   :     14   Min.   :      4.8  
##  1st Qu.: 56.67   1st Qu.: 279753   1st Qu.: 162928   1st Qu.:  95900.0  
##  Median :117.11   Median : 789892   Median : 467937   Median : 283657.5  
##  Mean   :188.02   Mean   :1336067   Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:364.69   3rd Qu.:1836490   3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :524.96   Max.   :6682700   Max.   :5249075   Max.   :1738700.0  
## 
data.100k |>
  select(where(is.numeric) & -order.id) |>
  gather() |>
  ggplot(aes(x = key, y = value)) +
  geom_boxplot() + 
  labs(x = "", y = "", title = "Distribution of Numerical Variables") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~ key, scales = "free")

data.100k |>
  select(item.type, order.priority, region) |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar(fill='lightblue') +
  facet_wrap(~ key, scales = "free") + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "",
       y = "",
       title = "Distribution of Categorical Variables")
## Warning: attributes are not identical across measure variables; they will be
## dropped

We can see that although the distribution amongst regions is highly variant, the Item.Types and Order.Prioritys are uniformly distributed amongst categories.

data.100k |>
  select(order.date, total.revenue, region) |>
  ggplot(aes(x = order.date, y = total.revenue, color=region)) +
  geom_line() +
  labs(x = 'Date',
       y = 'Total Revenue',
       title = 'Total Revenue by Region: 2010-2017')

We can see that the data is so granular and variant that we’d either need to expand our plot dramatically or group the data by month (losing some information along the way) in order to visualize it. Zooming in on a single month illustrates this:

data.100k |>
  select(order.date, total.revenue, region) |>
  filter(order.date <= min(order.date + 29)) |>
  ggplot(aes(x = order.date, y = total.revenue, color=region)) +
  geom_line() +
  labs(x = 'Date',
       y = 'Total Revenue',
       title = 'Total Revenue by Region: January 2010')

We can see the high variance among the daily revenue.

Let’s take a look at the correlation between the numeric variables,

data.100k |>
  select(where(is.numeric), -order.id) |>
  cor() |>
  corrplot(type = "upper")

Again, the only variables that do not exhibit strong correlation are Unit.Price and Unit.Cost to Units.Sold.

d1 <- data.100k |>
  select(sales.channel, total.profit, region) |>
  ggplot(aes(x=region, y=total.profit, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Profit") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

d2 <- data.100k |>  
  select(sales.channel, region, units.sold) |>
  ggplot(aes(x=region, y=units.sold, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Sales") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

ggarrange(d1, d2, ncol=2)

Let’s take a look at the profitability distribution by Order Priority:

data.100k |>
  group_by(order.priority) |>
  summarise(avg.profit = mean(total.profit)) |>
  ggplot(aes(x=order.priority, y=avg.profit)) +
  geom_bar(stat='identity', fill='lightblue') +
  geom_text(aes(label = paste0('$',round(avg.profit))), vjust = 2, size = 4) +
  labs(x='',
       y='',
       title='avg profit per order by priority') +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.y = element_blank(),  # Remove y-axis text
        axis.ticks.y = element_blank())

For this dataset, the profit by priority is comparable amongst all levels. Let’s take a look at the turnaround time by priority level.

data.100k <- data.100k |>
  mutate(turn = as.integer(ship.date - order.date))
data.100k |>
  group_by(order.priority) |>
  summarise(total.orders = n(),
            avg.order = round(mean(units.sold)),
            avg.turn = round(mean(turn)),
            avg.profit = mean(total.profit)) |>
  arrange(desc(avg.profit)) |>
  kable() |>
  kable_styling()
order.priority total.orders avg.order avg.turn avg.profit
H 24945 5040 25 394755.6
C 24951 4997 25 394146.9
M 25088 4991 25 393904.6
L 25016 4978 25 393560.4
data.100k |>
  ggplot(aes(x=order.priority, y=turn)) +
  geom_boxplot(fill='lightblue') +
  labs(x='priority',
       y='days',
       title='turnaround time by order priority') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5))

For this dataset, the average turnaround time and average profit is equal amongst all the different priority levels, although the profit level is marginally higher for the “H” and “C” priority levels.

Lastly, let’s take a look at profitability by item type to see if we can observe any patterns. We’ll look at an overview of all orders and then drill down by region,

p1 <- data.100k |>
  ggplot(aes(x=total.profit, y=item.type)) +
  geom_boxplot(fill='lightblue') + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="profit by item type")


p2 <- data.100k |>
  ggplot(aes(x=total.profit, y=item.type, fill=region)) +
  geom_boxplot() + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="",
       fill="")

ggarrange(p1, p2, nrow=2)

We can see for this dataset that the regions have no bearing on the distribution of profit by item type.

data.100k |> 
  group_by(sales.channel, order.priority) |> 
  summarise(total.orders = n(), 
            total.sold = sum(units.sold),
            avg.turn = mean(turn))
## `summarise()` has grouped output by 'sales.channel'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 5
## # Groups:   sales.channel [2]
##   sales.channel order.priority total.orders total.sold avg.turn
##   <fct>         <fct>                 <int>      <int>    <dbl>
## 1 Offline       C                     12480   62070435     25.1
## 2 Offline       H                     12471   62888532     24.8
## 3 Offline       L                     12419   61741004     24.9
## 4 Offline       M                     12576   63171201     25.1
## 5 Online        C                     12471   62606910     25.0
## 6 Online        H                     12474   62833684     25.1
## 7 Online        L                     12597   62790722     25.0
## 8 Online        M                     12512   62042129     25.2

Is there a relationship between the number of items per order and the turnaround time?

data.100k |>
  ggplot(aes(x=units.sold, y=turn, color=order.priority)) +
  geom_point() +
  labs(x='units',
       y='days',
       title = 'turnaround time by units sold') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

It’s just white noise, there is too much data to visualize. But there do not appear to be any patterns.

Does the profitability of the items affect turnaround time?

data.100k |>
  ggplot(aes(x=total.profit, y=turn)) +
  geom_point(color='lightblue', alpha=0.8) +
  facet_wrap(~ region, scales="free")  +
  labs(x='profit',
       y='days',
       title = 'turnaround time by order profit') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

Again, it’s too much data to visualize, but we don’t see any patterns whatsoever.

data.100k |>
  group_by(region, item.type, sales.channel) |>
  summarise(total.sold = sum(units.sold)) |>
  ggplot(aes(x=item.type, y=total.sold, fill=sales.channel)) +
  geom_bar(stat='identity', position=position_dodge()) +
  facet_wrap(~region, scales="free") +
  labs(x='',
       y='sold',
       title='total sales by region',
       fill='') +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        plot.title = element_text(hjust=0.5)) +
  scale_fill_brewer(palette="YlGnBu")
## `summarise()` has grouped output by 'region', 'item.type'. You can override
## using the `.groups` argument.

We can see that certain products have sell better online or offline in different regions.

Comparing the Data

Both datasets contain observations about inventory sold by date around the world (mostly outside North America) consisting of a mix of categorical and continuous variables. Both datasets exhibit high variance and non-normal distribution amongst the variables. We can observe the following dependencies across both datasets:

total.cost is a function units.sold * unit.cost total.profit is a function total.revenue - total.cost.
total.revenue is a function unit.cost x units.sold.

We can observe that both datasets contain roughly equal distribution of online and offline sales, but the total number of items sold and the total profit from those sales is much higher for offline sales than online. There are no missing values in either dataset.

Model Selection

The guiding question at this point in the analysis is, what do we want to predict? Considering the data, there are a few things we can look at:

  • Predicting Sales Channel.
  • Predicting Order Priority.
  • Predicting Total Profit.

Considering these three options with the data that is available, I wonder if any of these three serve realistic business goals, based upon the available data. For example, if we are given all of the data except sales channel, ie, we know how many of each item type are sold, what value is it to know after the fact whether an order is sold online or offline, or what the order priority is? Along the same lines, how beneficial is it to predict profit given sales totals? That is simple mathematics. With these thoughts in mind, the question to me becomes, how can we predict the total number of items sold by item type if given all the additional information in the dataset? Predicting the numbers of items sold would provide valuable business use regarding pricing and inventory allocation, ie how many items to make available online vs. offline, or perhaps offering a lower price for online sales.

Considering that both datasets contain a mix of categorical and continuous variables, have non-normal distributions, and have unequal variance, I will attempt to model two decision trees on these data to try to predict Units.Sold as the outcome variable.

Feature Selection.

Considering that both datasets contain a mix of categorical and continuous variables, have non-normal distributions, and have unequal variance, and that most of the continous variables are dependent on one another, I am going to build decision trees to predict the total sold per item given the following variables:

Month.
Year.
Region.
Country.
Item.Type.
Unit.Price.
Sales.Channel.

First, we need to extract the Month variable from the Order.Date in each dataset,

d100 <- data.100 |>
  mutate(month = factor(month(order.date, label=TRUE)),
         year = year(order.date),
         price = unit.price,
         item = item.type,
         channel = sales.channel,
         sold = units.sold) |>
  select(sold, item, price, month, year, region, channel) 

d100k <- data.100k |>
  mutate(month = factor(month(order.date, label=TRUE)),
         year = year(order.date),
         price = unit.price,
         item = item.type,
         channel = sales.channel,
         sold = units.sold) |>
  select(sold, item, price, month, year, region, channel)

Previewing our datasets looks like this:

Dataset 1:

kable(head(d100))
sold item price month year region channel
9925 Baby Food 255.28 May 2010 Australia and Oceania Offline
2804 Cereal 205.70 Aug 2012 Central America and the Caribbean Online
1779 Office Supplies 651.21 May 2014 Europe Offline
8102 Fruits 9.33 Jun 2014 Sub-Saharan Africa Online
5062 Office Supplies 651.21 Feb 2013 Sub-Saharan Africa Offline
2974 Baby Food 255.28 Feb 2015 Australia and Oceania Online

Dataset 2:

kable(head(d100k))
sold item price month year region channel
934 Snacks 152.58 Oct 2014 Middle East and North Africa Online
4551 Cosmetics 437.20 Feb 2015 Central America and the Caribbean Offline
9986 Fruits 9.33 Dec 2015 Sub-Saharan Africa Offline
9118 Personal Care 81.73 Sep 2014 Sub-Saharan Africa Online
5858 Household 668.27 Feb 2010 Central America and the Caribbean Offline
1149 Clothes 109.28 Feb 2013 Europe Online

Data Modeling

We are going to use Random Forest and Bagged Trees to generate predictions about total number of items sold on each dataset. First we split each set training and testing sets using an 80/20 split.

set.seed(1)

# set split indices
split.d100 <- createDataPartition(d100$sold, p = .8, list = FALSE)
split.d100k<- createDataPartition(d100k$sold, p = .8, list = FALSE)

# split data and partition predictor (x) and response (y) sets

# dataset1
train.d100 <- d100[split.d100, ]
test.d100 <- d100[-split.d100, ]

# dataset2
train.d100k <- d100k[split.d100k, ]
test.d100k <- d100k[-split.d100k, ]

We set up our cross-validation parameters. Since the dataset 1 training data is only 80 observations, we choose Leave-One-Out cross-validation. For the larger dataset, we will use 10-fold cross-validation.

loocv <- trainControl(
  method = "LOOCV",
  allowParallel = TRUE
)

cv <- trainControl(
  method = "cv",
  number = 5,
  allowParallel = TRUE
)

We define a function to return the prediction metrics, including Mean Absolute Percentage Error (MAPE):

metrics <- function(predicted, actual){
  mape = accuracy(predicted, actual)['Test set','MAPE']
  measures = postResample(predicted, actual) 
  metrics = c(measures, MAPE=mape)
  return(metrics)
}

Selecting a Model

First we prepare additional cores for parallel processing,

num_cores <- 4
cl <- makeCluster(num_cores)

Random Forest

Dataset 1

registerDoParallel(cl)
set.seed(1)

rf100 <- randomForest(sold ~ ., 
                      data = train.d100,
                      importance = TRUE,
                      trControl = loocv,
                      ntree = 1000,
                      parallel = "multicore")

rf100.preds <- predict(rf100, test.d100)
rf100.results <- metrics(rf100.preds, test.d100$sold)
rf100
## 
## Call:
##  randomForest(formula = sold ~ ., data = train.d100, importance = TRUE,      trControl = loocv, ntree = 1000, parallel = "multicore") 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 7597693
##                     % Var explained: -0.47

Dataset 2

set.seed(1)

rf100k <- randomForest(sold ~ ., 
                       data = train.d100k,
                       importance = TRUE,
                       ntree = 10,
                       mtry = sqrt(length(colnames(train.d100k))-1),
                       trControl = cv,
                       parallel = "multicore")

rf100k.preds <- predict(rf100k, test.d100k)
rf100k.results <- metrics(rf100k.preds, test.d100k$sold)
rf100k
## 
## Call:
##  randomForest(formula = sold ~ ., data = train.d100k, importance = TRUE,      ntree = 10, mtry = sqrt(length(colnames(train.d100k)) - 1),      trControl = cv, parallel = "multicore") 
##                Type of random forest: regression
##                      Number of trees: 10
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 8646108
##                     % Var explained: -4.02

Bagged Trees

Dataset 1

set.seed(1)

bagged100 <- train(sold ~ .,
                   data = train.d100,
                   method = "treebag",
                   trControl = loocv,
                   tuneLength = 10)

bagged100.preds <- predict(bagged100, test.d100)
bagged100.results <- metrics(bagged100.preds, test.d100$sold)
bagged100
## Bagged CART 
## 
## 80 samples
##  6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 79, 79, 79, 79, 79, 79, ... 
## Resampling results:
## 
##   RMSE      Rsquared     MAE     
##   2953.101  0.001819921  2401.879

Dataset 2

set.seed(1)

bagged100k <- train(sold ~ .,
                    data = train.d100k,
                    method = "treebag",
                    trControl = cv,
                    tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
bagged100k.preds <- predict(bagged100k, test.d100k)
bagged100k.results <- metrics(bagged100k.preds, test.d100k$sold)
bagged100k
## Bagged CART 
## 
## 80001 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 64000, 64001, 64000, 64002, 64001 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   2883.047  NaN       2496.025

Results

kable(rbind("Dataset 1: Random Forest" = rf100.results,
      "Dataset 1: Bagged Trees" = bagged100.results,
      "Dataset 2: Random Forest" = rf100k.results,
      "Dataset 2: Bagged Trees" = bagged100k.results)) |> kable_styling()
RMSE Rsquared MAE MAPE
Dataset 1: Random Forest 2958.876 0.0180008 2558.838 272.8429
Dataset 1: Bagged Trees 3049.350 0.0001126 2571.120 257.6522
Dataset 2: Random Forest 2920.719 0.0000063 2520.789 492.8483
Dataset 2: Bagged Trees 2890.612 NA 2504.423 501.7582

Neither model for either dataset is able to predict the number sold based on the chosen variables. The Mean Absolute Percentage Error (MAPE) values indicate the model’s predictive accuracy is off by 257% to 501% – those are unacceptably poor results.

Discussion / Essay

In this study we looked at two datasets from Excel Bianalytics, containing worldwide sales data for the years 2010-2017. We looked at two different sized datasets to identify a business problem and compare the performance of two machine learning algorithms on the datasets. The first dataset is small, consisting of 100 observations. The second dataset is large, containing 100,000 observations. The observations are a mix of 14 categorical, continuous, and date variables, as well as an order ID that does not serve a use for modeling. The 13 remaining variables are region, country, item type, sales channel, order priority, order date, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit. We determined that dependencies exist between total profit (total revenue - total cost), total cost (units sold x unit cost), and total revenue (units sold x unit price). As such, these three variables exhibited high correlation with their dependence variables and decided to focus on predicting units sold as our outcome variable, since cost, revenue, and profit can all be derived from the units sold. This seemed the most sensible business use to focus our attention on. This left us with the following 9 variables: region, country, item type, sales channel, order priority, order date, ship date, units sold. We calculated the processing time for each order as the difference between the order date and ship date. We also determined that the number of distinct countries was too large to be of use for our modeling: dataset 1 containing 100 observations: 76 different countries; dataset 2, 100k observations: 185 different countries.

After feature engineering and selection we had 7 predictor variables. We conducted exploratory data analysis and observed that the categorical predictors in the smaller dataset were unevenly distributed, while in the larger dataset, they were uniform. We observed no noticeable trend over time in either dataset. We further observed no linearity between the predictors and the outcome variable, and we determined that the processing time variable that we created did not appear to have any relationship with either the predictors or outcome variable.

As the datasets contained a combination of categorical and continuous variables we decided to employ decision tree models to predict the number of items sold. Decision trees are able to handle mixed data types and don’t require strong assumptions about variable distributions. They are also able to be employed for datasets of different sizes. However, they can be prone to overfitting and bias. To reduce this, we used the ensemble methods Bagged Trees and Random Forests. The bagged trees (“B”ootstrap “Agg”regation) model reduces the variance of a model by averaging the predictions of multiple decision trees, taking bootstrap samples (sampling with replacement) of the original dataset, training a separate decision tree on each sample, and averaging the predictions of each tree (aggregating). For the smaller dataset, we used “Leave One Out” (LOO) cross-validation to further reduce bias of the model. For the larger dataset, we used 10-fold cross-validation. We found this model did not predict well on either of the datasets, and was computationally expensive on the larger dataset.

Next we tried the Random Forest model, which extends the bagged trees model. In addition to using bootstrapped samples, random subsets of predictors are selected at each tree node, reducing correlation between the individual trees and improving model generalization. This model also did not perform well at predicting units sold, and was required a lot of processing time to train. For the larger dataset, we had to limit the number of trees to 10 in order to train a model that took less than 10 minutes to compute, which likely had an effect on the variance and predictive accuracy of the model, but not enough to make a difference considering how far off our predictions were. Again, LOO and 10-fold cross-validation were used for the smaller and larger datasets, respectively.

In retrospect I think the problem is with these datasets and the business question selected as far as why these models do not offer predictive accuracy – the datasets contain largely uniform information that requires transformation and feature engineering to generate predictors that might better fit a model. A few thoughts for future exploration would be to calculate the per-unit profit of an item, and aggregating sales by month by region, so there is less variance within the variables.