DATA 622 HW1

Introduction

This assignment explores and compares the structure and contents of two datasets from https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/, considers their similarities and differences, and explores how to analyze and predict an outcome based on the data.

We will use the following packages for this study,

library(tidyverse)
library(caret)
library(ipred)
library(corrplot)
library(forecast)
library(randomForest)
library(ggthemes)
library(egg)
library(kableExtra)
library(doParallel)

The Data

Overview

Dataset 1: 100 Sales Records

data.100 <- read.csv('/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2023/DATA622/HW1/100 Sales Records.csv')
glimpse(data.100)

## Rows: 100
## Columns: 14
## $ Region         <chr> "Australia and Oceania", "Central America and the Carib…
## $ Country        <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe",…
## $ Item.Type      <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "Of…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H", …
## $ Order.Date     <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2/1…
## $ Order.ID       <int> 669165933, 963881480, 341417157, 514321792, 115456712, …
## $ Ship.Date      <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/6/…
## $ Units.Sold     <int> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070, 6…
## $ Unit.Price     <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27, 1…
## $ Unit.Cost      <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54, 9…
## $ Total.Revenue  <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.02…
## $ Total.Cost     <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.52,…
## $ Total.Profit   <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50, 2…

The dataset consists of:

100 rows of data across 14 columns.
7 continuous variables.
7 variables in string (character) format.
- 4 categorical variables: Region, Country, Item.Type, Sales.Channel.
- 1 ordinal variable: Order.Priority., with no order given.
- 2 date variables: Order.Date, Ship.Date.

We convert categorical, except for Country, which contains 76 unique values, and ordinal variables to factors and date variables to dates,

data.100 <- data.100 |>
  mutate(Region = as.factor(Region),
         Item.Type = as.factor(Item.Type),
         Sales.Channel = as.factor(Sales.Channel),
         Order.Priority = as.factor(Order.Priority),
         Order.Date = as.Date(Order.Date, format = "%m/%d/%Y"),
         Ship.Date = as.Date(Ship.Date, format = "%m/%d/%Y"))

colnames(data.100) <- tolower(colnames(data.100))

We check for any missing values; there are none,

colSums(is.na(data.100))

##         region        country      item.type  sales.channel order.priority 
##              0              0              0              0              0 
##     order.date       order.id      ship.date     units.sold     unit.price 
##              0              0              0              0              0 
##      unit.cost  total.revenue     total.cost   total.profit 
##              0              0              0              0

Now we can view summary statistics for each variable,

summary(data.100)

##                                region     country                    item.type 
##  Asia                             :11   Length:100         Clothes        :13  
##  Australia and Oceania            :11   Class :character   Cosmetics      :13  
##  Central America and the Caribbean: 7   Mode  :character   Office Supplies:12  
##  Europe                           :22                      Fruits         :10  
##  Middle East and North Africa     :10                      Personal Care  :10  
##  North America                    : 3                      Household      : 9  
##  Sub-Saharan Africa               :36                      (Other)        :33  
##  sales.channel order.priority   order.date            order.id        
##  Offline:50    C:22           Min.   :2010-02-02   Min.   :114606559  
##  Online :50    H:30           1st Qu.:2012-02-14   1st Qu.:338922488  
##                L:27           Median :2013-07-12   Median :557708561  
##                M:21           Mean   :2013-09-16   Mean   :555020412  
##                               3rd Qu.:2015-04-07   3rd Qu.:790755081  
##                               Max.   :2017-05-22   Max.   :994022214  
##                                                                       
##    ship.date            units.sold     unit.price       unit.cost     
##  Min.   :2010-02-25   Min.   : 124   Min.   :  9.33   Min.   :  6.92  
##  1st Qu.:2012-02-24   1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84  
##  Median :2013-08-11   Median :5382   Median :179.88   Median :107.28  
##  Mean   :2013-10-09   Mean   :5129   Mean   :276.76   Mean   :191.05  
##  3rd Qu.:2015-04-28   3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33  
##  Max.   :2017-06-17   Max.   :9925   Max.   :668.27   Max.   :524.96  
##                                                                       
##  total.revenue       total.cost       total.profit    
##  Min.   :   4870   Min.   :   3612   Min.   :   1258  
##  1st Qu.: 268721   1st Qu.: 168868   1st Qu.: 121444  
##  Median : 752314   Median : 363566   Median : 290768  
##  Mean   :1373488   Mean   : 931806   Mean   : 441682  
##  3rd Qu.:2212045   3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :5997055   Max.   :4509794   Max.   :1719922  
##

We can view the shape of the numeric data,

data.100 |>
  select(where(is.numeric) & -order.id) |>
  gather() |>
  ggplot(aes(x = key, y = value)) +
  geom_boxplot() + 
  labs(x = "", y = "", title = "Distribution of Numerical Variables") +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~ key, scales = "free")

The only numeric variable that appears to be symmetrical is Units.Sold. Let’s take a look at the categorical variables.

data.100 |>
  select(item.type, order.priority, region) |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar(fill='lightblue') +
  facet_wrap(~ key, scales = "free") + 
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x='',
       y='',
       title='distribution of categorical variables')

## Warning: attributes are not identical across measure variables; they will be
## dropped

We know the Sales.Channel variable is an even 50/50 split from the summary statistics. What we can see here is that Order.Priority is relatively symmetrical, while the other categorical variables are not.

data.100 |>
  select(order.date, total.revenue, region) |>
  ggplot(aes(x = order.date, y = total.revenue, color=region)) +
  geom_line() +
  labs(x = 'Date',
       y = 'Total Revenue',
       title = 'Total Revenue by Region: 2010-2017') +
  theme_few()

Let’s take a look at the correlation amongst the numeric variables,

data.100 |>
  select(where(is.numeric), -order.id) |>
  cor() |>
  corrplot(type = "upper")

We see there is high correlation amongst all numerical variables except Unit.Cost and Unit.Price to Units.Sold.

Let’s take a look at the total profit by sales channel,

p1 <- data.100 |>
  select(sales.channel, region, total.profit) |>
  ggplot(aes(x=region, y=total.profit, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Profit by Channel") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

p2 <- data.100 |>  
  select(sales.channel, region, units.sold) |>
  ggplot(aes(x=region, y=units.sold, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Sales by Channel") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

ggarrange(p1, p2, ncol=2)

Let’s take a look at the profitability distribution by Order Priority:

data.100 |>
  group_by(order.priority) |>
  summarise(avg.profit = mean(total.profit)) |>
  ggplot(aes(x=order.priority, y=avg.profit)) +
  geom_bar(stat='identity', fill='lightblue') +
  geom_text(aes(label = paste0('$',round(avg.profit))), vjust = 2, size = 4) +
  labs(x='',
       y='',
       title='avg profit per order by priority') +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.y = element_blank(),  # Remove y-axis text
        axis.ticks.y = element_blank())

What is the average turnaround time for the priority levels? To calculate this, we create a new variable turn that represents the time between order.date and ship.date

data.100 <- data.100 |>
  mutate(turn = as.integer(ship.date - order.date))

data.100 |>
  group_by(order.priority) |>
  summarise(total.orders = n(),
            total.units = sum(units.sold),
            avg.turn = round(mean(turn)),
            avg.profit = mean(total.profit)) |>
  arrange(desc(avg.profit)) |>
  kable() |>
  kable_styling()

order.priority	total.orders	total.units	avg.turn	avg.profit
H	30	154212	21	563053.3
M	21	94832	25	460454.4
L	27	146876	24	402175.1
C	22	116951	24	306742.2

data.100 |>
  ggplot(aes(x=order.priority, y=turn)) +
  geom_boxplot(fill='lightblue') +
  labs(x='priority',
       y='days',
       title='turnaround time by order priority') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5))

We see that 30% of all orders are priority level “H”, these orders ship 3-4 days faster than the other levels, and the average profit of these orders is $10-15k higher than levels “M” and “L” and nearly twice as high as level “C”.

Is there a relationship between the number of items per order and the turnaround time?

data.100 |>
  ggplot(aes(x=units.sold, y=turn, color=order.priority)) +
  geom_point() +
  labs(x='units',
       y='days',
       title = 'turnaround time by units sold') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

It’s just white noise.

Does the profitability of the items affect turnaround time?

data.100 |>
  ggplot(aes(x=total.profit, y=turn)) +
  geom_point(color='lightblue', alpha=0.8) +
  facet_wrap(~ region, scales="free")  +
  labs(x='profit',
       y='days',
       title = 'turnaround time by order profit') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

There does not appear to be a relationship.

Lastly, let’s take a look at profitability by item type to see if we can observe any patterns. We’ll look at an overview of all orders and then drill down by region,

p1 <- data.100 |>
  ggplot(aes(x=total.profit, y=item.type)) +
  geom_boxplot(fill='lightblue') + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="profit by item type")


p2 <- data.100 |>
  ggplot(aes(x=total.profit, y=item.type, fill=region)) +
  geom_boxplot() + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="",
       fill="")

ggarrange(p1, p2, nrow=2)

We can see that Household and Cosmetics items have the highest overall median profits, but there is some variance by region.

data.100 |>
  group_by(region, item.type) |>
  summarise(total.profit = sum(total.profit)) |>
  top_n(2, wt = total.profit) |>
  arrange(region, desc(total.profit)) |>
  kable() |>
  kable_styling()

## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.

region	item.type	total.profit
Asia	Household	2002018.4
Asia	Office Supplies	1479397.5
Australia and Oceania	Cosmetics	1678541.0
Australia and Oceania	Baby Food	1236498.1
Central America and the Caribbean	Household	1487261.0
Central America and the Caribbean	Clothes	403773.1
Europe	Cosmetics	5233487.0
Europe	Baby Food	2117259.8
Middle East and North Africa	Cosmetics	4105940.0
Middle East and North Africa	Clothes	1028160.0
North America	Household	1152486.4
North America	Personal Care	305456.3
Sub-Saharan Africa	Office Supplies	2051688.8
Sub-Saharan Africa	Cosmetics	2032888.0

Dataset 2: 100,000 Sales Records

data.100k <- read.csv('/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2023/DATA622/HW1/100000 Sales Records.csv')
glimpse(data.100k)

## Rows: 100,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "Central America and th…
## $ Country        <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel  <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date     <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID       <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date      <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold     <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…

The dataset consists of:

100,000 rows of data across 14 columns.
7 continuous variables.
7 variables in string (character) format.
- 4 categorical variables: Region, Country, Item.Type, Sales.Channel.
- 1 ordinal variable: Order.Priority., with no order given.
- 2 date variables: Order.Date, Ship.Date.

We convert categorical, except for Country, which contains 185 unique values, and ordinal variables to factors and date variables to dates,

data.100k <- data.100k |>
  mutate(Region = as.factor(Region),
         Item.Type = as.factor(Item.Type),
         Sales.Channel = as.factor(Sales.Channel),
         Order.Priority = as.factor(Order.Priority),
         Order.Date = as.Date(Order.Date, format = "%m/%d/%Y"),
         Ship.Date = as.Date(Ship.Date, format = "%m/%d/%Y"))

colnames(data.100k) <- tolower(colnames(data.100k))

We check for NAs, none.

colSums(is.na(data.100k))

##         region        country      item.type  sales.channel order.priority 
##              0              0              0              0              0 
##     order.date       order.id      ship.date     units.sold     unit.price 
##              0              0              0              0              0 
##      unit.cost  total.revenue     total.cost   total.profit 
##              0              0              0              0

And we can view some summary statistics,

summary(data.100k)

##                                region        country         
##  Asia                             :14547   Length:100000     
##  Australia and Oceania            : 8113   Class :character  
##  Central America and the Caribbean:10731   Mode  :character  
##  Europe                           :25877                     
##  Middle East and North Africa     :12580                     
##  North America                    : 2133                     
##  Sub-Saharan Africa               :26019                     
##            item.type     sales.channel   order.priority   order.date        
##  Office Supplies: 8426   Offline:49946   C:24951        Min.   :2010-01-01  
##  Cereal         : 8421   Online :50054   H:24945        1st Qu.:2011-11-25  
##  Baby Food      : 8407                   L:25016        Median :2013-10-15  
##  Cosmetics      : 8370                   M:25088        Mean   :2013-10-15  
##  Personal Care  : 8364                                  3rd Qu.:2015-09-07  
##  Meat           : 8320                                  Max.   :2017-07-28  
##  (Other)        :49692                                                      
##     order.id           ship.date            units.sold      unit.price    
##  Min.   :100008904   Min.   :2010-01-02   Min.   :    1   Min.   :  9.33  
##  1st Qu.:326046383   1st Qu.:2011-12-21   1st Qu.: 2505   1st Qu.:109.28  
##  Median :547718512   Median :2013-11-09   Median : 5007   Median :205.70  
##  Mean   :550395554   Mean   :2013-11-09   Mean   : 5001   Mean   :266.70  
##  3rd Qu.:775078534   3rd Qu.:2015-10-02   3rd Qu.: 7495   3rd Qu.:437.20  
##  Max.   :999996459   Max.   :2017-09-16   Max.   :10000   Max.   :668.27  
##                                                                           
##    unit.cost      total.revenue       total.cost       total.profit      
##  Min.   :  6.92   Min.   :     19   Min.   :     14   Min.   :      4.8  
##  1st Qu.: 56.67   1st Qu.: 279753   1st Qu.: 162928   1st Qu.:  95900.0  
##  Median :117.11   Median : 789892   Median : 467937   Median : 283657.5  
##  Mean   :188.02   Mean   :1336067   Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:364.69   3rd Qu.:1836490   3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :524.96   Max.   :6682700   Max.   :5249075   Max.   :1738700.0  
##

data.100k |>
  select(where(is.numeric) & -order.id) |>
  gather() |>
  ggplot(aes(x = key, y = value)) +
  geom_boxplot() + 
  labs(x = "", y = "", title = "Distribution of Numerical Variables") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~ key, scales = "free")

data.100k |>
  select(item.type, order.priority, region) |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar(fill='lightblue') +
  facet_wrap(~ key, scales = "free") + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "",
       y = "",
       title = "Distribution of Categorical Variables")

## Warning: attributes are not identical across measure variables; they will be
## dropped

We can see that although the distribution amongst regions is highly variant, the Item.Types and Order.Prioritys are uniformly distributed amongst categories.

data.100k |>
  select(order.date, total.revenue, region) |>
  ggplot(aes(x = order.date, y = total.revenue, color=region)) +
  geom_line() +
  labs(x = 'Date',
       y = 'Total Revenue',
       title = 'Total Revenue by Region: 2010-2017')

We can see that the data is so granular and variant that we’d either need to expand our plot dramatically or group the data by month (losing some information along the way) in order to visualize it. Zooming in on a single month illustrates this:

data.100k |>
  select(order.date, total.revenue, region) |>
  filter(order.date <= min(order.date + 29)) |>
  ggplot(aes(x = order.date, y = total.revenue, color=region)) +
  geom_line() +
  labs(x = 'Date',
       y = 'Total Revenue',
       title = 'Total Revenue by Region: January 2010')

We can see the high variance among the daily revenue.

Let’s take a look at the correlation between the numeric variables,

data.100k |>
  select(where(is.numeric), -order.id) |>
  cor() |>
  corrplot(type = "upper")

Again, the only variables that do not exhibit strong correlation are Unit.Price and Unit.Cost to Units.Sold.

d1 <- data.100k |>
  select(sales.channel, total.profit, region) |>
  ggplot(aes(x=region, y=total.profit, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Profit") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

d2 <- data.100k |>  
  select(sales.channel, region, units.sold) |>
  ggplot(aes(x=region, y=units.sold, fill=sales.channel)) +
  geom_bar(stat = "identity") +
  labs(x = "",
       y = "",
       fill = "",
       title = "Sales") +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette="YlGnBu")

ggarrange(d1, d2, ncol=2)

Let’s take a look at the profitability distribution by Order Priority:

data.100k |>
  group_by(order.priority) |>
  summarise(avg.profit = mean(total.profit)) |>
  ggplot(aes(x=order.priority, y=avg.profit)) +
  geom_bar(stat='identity', fill='lightblue') +
  geom_text(aes(label = paste0('$',round(avg.profit))), vjust = 2, size = 4) +
  labs(x='',
       y='',
       title='avg profit per order by priority') +
  theme_few() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.y = element_blank(),  # Remove y-axis text
        axis.ticks.y = element_blank())

For this dataset, the profit by priority is comparable amongst all levels. Let’s take a look at the turnaround time by priority level.

data.100k <- data.100k |>
  mutate(turn = as.integer(ship.date - order.date))

data.100k |>
  group_by(order.priority) |>
  summarise(total.orders = n(),
            avg.order = round(mean(units.sold)),
            avg.turn = round(mean(turn)),
            avg.profit = mean(total.profit)) |>
  arrange(desc(avg.profit)) |>
  kable() |>
  kable_styling()

order.priority	total.orders	avg.order	avg.turn	avg.profit
H	24945	5040	25	394755.6
C	24951	4997	25	394146.9
M	25088	4991	25	393904.6
L	25016	4978	25	393560.4

data.100k |>
  ggplot(aes(x=order.priority, y=turn)) +
  geom_boxplot(fill='lightblue') +
  labs(x='priority',
       y='days',
       title='turnaround time by order priority') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5))

For this dataset, the average turnaround time and average profit is equal amongst all the different priority levels, although the profit level is marginally higher for the “H” and “C” priority levels.

Lastly, let’s take a look at profitability by item type to see if we can observe any patterns. We’ll look at an overview of all orders and then drill down by region,

p1 <- data.100k |>
  ggplot(aes(x=total.profit, y=item.type)) +
  geom_boxplot(fill='lightblue') + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="profit by item type")


p2 <- data.100k |>
  ggplot(aes(x=total.profit, y=item.type, fill=region)) +
  geom_boxplot() + 
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  labs(x="",
       y="",
       title="",
       fill="")

ggarrange(p1, p2, nrow=2)

We can see for this dataset that the regions have no bearing on the distribution of profit by item type.

data.100k |> 
  group_by(sales.channel, order.priority) |> 
  summarise(total.orders = n(), 
            total.sold = sum(units.sold),
            avg.turn = mean(turn))

## `summarise()` has grouped output by 'sales.channel'. You can override using the
## `.groups` argument.

## # A tibble: 8 × 5
## # Groups:   sales.channel [2]
##   sales.channel order.priority total.orders total.sold avg.turn
##   <fct>         <fct>                 <int>      <int>    <dbl>
## 1 Offline       C                     12480   62070435     25.1
## 2 Offline       H                     12471   62888532     24.8
## 3 Offline       L                     12419   61741004     24.9
## 4 Offline       M                     12576   63171201     25.1
## 5 Online        C                     12471   62606910     25.0
## 6 Online        H                     12474   62833684     25.1
## 7 Online        L                     12597   62790722     25.0
## 8 Online        M                     12512   62042129     25.2

Is there a relationship between the number of items per order and the turnaround time?

data.100k |>
  ggplot(aes(x=units.sold, y=turn, color=order.priority)) +
  geom_point() +
  labs(x='units',
       y='days',
       title = 'turnaround time by units sold') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

It’s just white noise, there is too much data to visualize. But there do not appear to be any patterns.

Does the profitability of the items affect turnaround time?

data.100k |>
  ggplot(aes(x=total.profit, y=turn)) +
  geom_point(color='lightblue', alpha=0.8) +
  facet_wrap(~ region, scales="free")  +
  labs(x='profit',
       y='days',
       title = 'turnaround time by order profit') +
  theme_few() +
  theme(plot.title = element_text(hjust=0.5)) +
  facet_wrap(~ region, scales="free")

Again, it’s too much data to visualize, but we don’t see any patterns whatsoever.

data.100k |>
  group_by(region, item.type, sales.channel) |>
  summarise(total.sold = sum(units.sold)) |>
  ggplot(aes(x=item.type, y=total.sold, fill=sales.channel)) +
  geom_bar(stat='identity', position=position_dodge()) +
  facet_wrap(~region, scales="free") +
  labs(x='',
       y='sold',
       title='total sales by region',
       fill='') +
  theme_few() +
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        plot.title = element_text(hjust=0.5)) +
  scale_fill_brewer(palette="YlGnBu")

## `summarise()` has grouped output by 'region', 'item.type'. You can override
## using the `.groups` argument.

We can see that certain products have sell better online or offline in different regions.

Comparing the Data

Both datasets contain observations about inventory sold by date around the world (mostly outside North America) consisting of a mix of categorical and continuous variables. Both datasets exhibit high variance and non-normal distribution amongst the variables. We can observe the following dependencies across both datasets:

total.cost is a function units.sold * unit.cost total.profit is a function total.revenue - total.cost.
total.revenue is a function unit.cost x units.sold.

We can observe that both datasets contain roughly equal distribution of online and offline sales, but the total number of items sold and the total profit from those sales is much higher for offline sales than online. There are no missing values in either dataset.

Model Selection

The guiding question at this point in the analysis is, what do we want to predict? Considering the data, there are a few things we can look at:

Predicting Sales Channel.
Predicting Order Priority.
Predicting Total Profit.

Considering these three options with the data that is available, I wonder if any of these three serve realistic business goals, based upon the available data. For example, if we are given all of the data except sales channel, ie, we know how many of each item type are sold, what value is it to know after the fact whether an order is sold online or offline, or what the order priority is? Along the same lines, how beneficial is it to predict profit given sales totals? That is simple mathematics. With these thoughts in mind, the question to me becomes, how can we predict the total number of items sold by item type if given all the additional information in the dataset? Predicting the numbers of items sold would provide valuable business use regarding pricing and inventory allocation, ie how many items to make available online vs. offline, or perhaps offering a lower price for online sales.

Considering that both datasets contain a mix of categorical and continuous variables, have non-normal distributions, and have unequal variance, I will attempt to model two decision trees on these data to try to predict Units.Sold as the outcome variable.

Feature Selection.

Considering that both datasets contain a mix of categorical and continuous variables, have non-normal distributions, and have unequal variance, and that most of the continous variables are dependent on one another, I am going to build decision trees to predict the total sold per item given the following variables:

Month.
Year.
Region.
Country.
Item.Type.
Unit.Price.
Sales.Channel.

First, we need to extract the Month variable from the Order.Date in each dataset,

d100 <- data.100 |>
  mutate(month = factor(month(order.date, label=TRUE)),
         year = year(order.date),
         price = unit.price,
         item = item.type,
         channel = sales.channel,
         sold = units.sold) |>
  select(sold, item, price, month, year, region, channel) 

d100k <- data.100k |>
  mutate(month = factor(month(order.date, label=TRUE)),
         year = year(order.date),
         price = unit.price,
         item = item.type,
         channel = sales.channel,
         sold = units.sold) |>
  select(sold, item, price, month, year, region, channel)

Previewing our datasets looks like this:

Dataset 1:

kable(head(d100))

sold	item	price	month	year	region	channel
9925	Baby Food	255.28	May	2010	Australia and Oceania	Offline
2804	Cereal	205.70	Aug	2012	Central America and the Caribbean	Online
1779	Office Supplies	651.21	May	2014	Europe	Offline
8102	Fruits	9.33	Jun	2014	Sub-Saharan Africa	Online
5062	Office Supplies	651.21	Feb	2013	Sub-Saharan Africa	Offline
2974	Baby Food	255.28	Feb	2015	Australia and Oceania	Online

Dataset 2:

kable(head(d100k))

sold	item	price	month	year	region	channel
934	Snacks	152.58	Oct	2014	Middle East and North Africa	Online
4551	Cosmetics	437.20	Feb	2015	Central America and the Caribbean	Offline
9986	Fruits	9.33	Dec	2015	Sub-Saharan Africa	Offline
9118	Personal Care	81.73	Sep	2014	Sub-Saharan Africa	Online
5858	Household	668.27	Feb	2010	Central America and the Caribbean	Offline
1149	Clothes	109.28	Feb	2013	Europe	Online

Data Modeling

We are going to use Random Forest and Bagged Trees to generate predictions about total number of items sold on each dataset. First we split each set training and testing sets using an 80/20 split.

set.seed(1)

# set split indices
split.d100 <- createDataPartition(d100$sold, p = .8, list = FALSE)
split.d100k<- createDataPartition(d100k$sold, p = .8, list = FALSE)

# split data and partition predictor (x) and response (y) sets

# dataset1
train.d100 <- d100[split.d100, ]
test.d100 <- d100[-split.d100, ]

# dataset2
train.d100k <- d100k[split.d100k, ]
test.d100k <- d100k[-split.d100k, ]

We set up our cross-validation parameters. Since the dataset 1 training data is only 80 observations, we choose Leave-One-Out cross-validation. For the larger dataset, we will use 10-fold cross-validation.

loocv <- trainControl(
  method = "LOOCV",
  allowParallel = TRUE
)

cv <- trainControl(
  method = "cv",
  number = 5,
  allowParallel = TRUE
)

We define a function to return the prediction metrics, including Mean Absolute Percentage Error (MAPE):

metrics <- function(predicted, actual){
  mape = accuracy(predicted, actual)['Test set','MAPE']
  measures = postResample(predicted, actual) 
  metrics = c(measures, MAPE=mape)
  return(metrics)
}

Selecting a Model

First we prepare additional cores for parallel processing,

num_cores <- 4
cl <- makeCluster(num_cores)

Random Forest

Dataset 1

registerDoParallel(cl)
set.seed(1)

rf100 <- randomForest(sold ~ ., 
                      data = train.d100,
                      importance = TRUE,
                      trControl = loocv,
                      ntree = 1000,
                      parallel = "multicore")

rf100.preds <- predict(rf100, test.d100)
rf100.results <- metrics(rf100.preds, test.d100$sold)

rf100

## 
## Call:
##  randomForest(formula = sold ~ ., data = train.d100, importance = TRUE,      trControl = loocv, ntree = 1000, parallel = "multicore") 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 7597693
##                     % Var explained: -0.47

Dataset 2

set.seed(1)

rf100k <- randomForest(sold ~ ., 
                       data = train.d100k,
                       importance = TRUE,
                       ntree = 10,
                       mtry = sqrt(length(colnames(train.d100k))-1),
                       trControl = cv,
                       parallel = "multicore")

rf100k.preds <- predict(rf100k, test.d100k)
rf100k.results <- metrics(rf100k.preds, test.d100k$sold)

rf100k

## 
## Call:
##  randomForest(formula = sold ~ ., data = train.d100k, importance = TRUE,      ntree = 10, mtry = sqrt(length(colnames(train.d100k)) - 1),      trControl = cv, parallel = "multicore") 
##                Type of random forest: regression
##                      Number of trees: 10
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 8646108
##                     % Var explained: -4.02

Bagged Trees

Dataset 1

set.seed(1)

bagged100 <- train(sold ~ .,
                   data = train.d100,
                   method = "treebag",
                   trControl = loocv,
                   tuneLength = 10)

bagged100.preds <- predict(bagged100, test.d100)
bagged100.results <- metrics(bagged100.preds, test.d100$sold)

bagged100

## Bagged CART 
## 
## 80 samples
##  6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 79, 79, 79, 79, 79, 79, ... 
## Resampling results:
## 
##   RMSE      Rsquared     MAE     
##   2953.101  0.001819921  2401.879

Dataset 2

set.seed(1)

bagged100k <- train(sold ~ .,
                    data = train.d100k,
                    method = "treebag",
                    trControl = cv,
                    tuneLength = 10)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

bagged100k.preds <- predict(bagged100k, test.d100k)
bagged100k.results <- metrics(bagged100k.preds, test.d100k$sold)

bagged100k

## Bagged CART 
## 
## 80001 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 64000, 64001, 64000, 64002, 64001 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   2883.047  NaN       2496.025

Results

kable(rbind("Dataset 1: Random Forest" = rf100.results,
      "Dataset 1: Bagged Trees" = bagged100.results,
      "Dataset 2: Random Forest" = rf100k.results,
      "Dataset 2: Bagged Trees" = bagged100k.results)) |> kable_styling()

	RMSE	Rsquared	MAE	MAPE
Dataset 1: Random Forest	2958.876	0.0180008	2558.838	272.8429
Dataset 1: Bagged Trees	3049.350	0.0001126	2571.120	257.6522
Dataset 2: Random Forest	2920.719	0.0000063	2520.789	492.8483
Dataset 2: Bagged Trees	2890.612	NA	2504.423	501.7582

Neither model for either dataset is able to predict the number sold based on the chosen variables. The Mean Absolute Percentage Error (MAPE) values indicate the model’s predictive accuracy is off by 257% to 501% – those are unacceptably poor results.

Discussion / Essay

In this study we looked at two datasets from Excel Bianalytics, containing worldwide sales data for the years 2010-2017. We looked at two different sized datasets to identify a business problem and compare the performance of two machine learning algorithms on the datasets. The first dataset is small, consisting of 100 observations. The second dataset is large, containing 100,000 observations. The observations are a mix of 14 categorical, continuous, and date variables, as well as an order ID that does not serve a use for modeling. The 13 remaining variables are region, country, item type, sales channel, order priority, order date, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit. We determined that dependencies exist between total profit (total revenue - total cost), total cost (units sold x unit cost), and total revenue (units sold x unit price). As such, these three variables exhibited high correlation with their dependence variables and decided to focus on predicting units sold as our outcome variable, since cost, revenue, and profit can all be derived from the units sold. This seemed the most sensible business use to focus our attention on. This left us with the following 9 variables: region, country, item type, sales channel, order priority, order date, ship date, units sold. We calculated the processing time for each order as the difference between the order date and ship date. We also determined that the number of distinct countries was too large to be of use for our modeling: dataset 1 containing 100 observations: 76 different countries; dataset 2, 100k observations: 185 different countries.

After feature engineering and selection we had 7 predictor variables. We conducted exploratory data analysis and observed that the categorical predictors in the smaller dataset were unevenly distributed, while in the larger dataset, they were uniform. We observed no noticeable trend over time in either dataset. We further observed no linearity between the predictors and the outcome variable, and we determined that the processing time variable that we created did not appear to have any relationship with either the predictors or outcome variable.

As the datasets contained a combination of categorical and continuous variables we decided to employ decision tree models to predict the number of items sold. Decision trees are able to handle mixed data types and don’t require strong assumptions about variable distributions. They are also able to be employed for datasets of different sizes. However, they can be prone to overfitting and bias. To reduce this, we used the ensemble methods Bagged Trees and Random Forests. The bagged trees (“B”ootstrap “Agg”regation) model reduces the variance of a model by averaging the predictions of multiple decision trees, taking bootstrap samples (sampling with replacement) of the original dataset, training a separate decision tree on each sample, and averaging the predictions of each tree (aggregating). For the smaller dataset, we used “Leave One Out” (LOO) cross-validation to further reduce bias of the model. For the larger dataset, we used 10-fold cross-validation. We found this model did not predict well on either of the datasets, and was computationally expensive on the larger dataset.

Next we tried the Random Forest model, which extends the bagged trees model. In addition to using bootstrapped samples, random subsets of predictors are selected at each tree node, reducing correlation between the individual trees and improving model generalization. This model also did not perform well at predicting units sold, and was required a lot of processing time to train. For the larger dataset, we had to limit the number of trees to 10 in order to train a model that took less than 10 minutes to compute, which likely had an effect on the variance and predictive accuracy of the model, but not enough to make a difference considering how far off our predictions were. Again, LOO and 10-fold cross-validation were used for the smaller and larger datasets, respectively.

In retrospect I think the problem is with these datasets and the business question selected as far as why these models do not offer predictive accuracy – the datasets contain largely uniform information that requires transformation and feature engineering to generate predictors that might better fit a model. A few thoughts for future exploration would be to calculate the per-unit profit of an item, and aggregating sales by month by region, so there is less variance within the variables.