DATA 622 HW1
Introduction
This assignment explores and compares the structure and contents of two datasets from https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/, considers their similarities and differences, and explores how to analyze and predict an outcome based on the data.
We will use the following packages for this study,
The Data
Overview
Dataset 1: 100 Sales Records
data.100 <- read.csv('/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2023/DATA622/HW1/100 Sales Records.csv')
glimpse(data.100)
## Rows: 100
## Columns: 14
## $ Region <chr> "Australia and Oceania", "Central America and the Carib…
## $ Country <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe",…
## $ Item.Type <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "Of…
## $ Sales.Channel <chr> "Offline", "Online", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H", …
## $ Order.Date <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2/1…
## $ Order.ID <int> 669165933, 963881480, 341417157, 514321792, 115456712, …
## $ Ship.Date <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/6/…
## $ Units.Sold <int> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070, 6…
## $ Unit.Price <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27, 1…
## $ Unit.Cost <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54, 9…
## $ Total.Revenue <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.02…
## $ Total.Cost <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.52,…
## $ Total.Profit <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50, 2…
The dataset consists of:
- 100 rows of data across 14 columns.
- 7 continuous variables.
- 7 variables in string (character) format.
- 4 categorical variables:
Region
,Country
,Item.Type
,Sales.Channel.
- 1 ordinal variable:
Order.Priority.
, with no order given.
- 2 date variables:
Order.Date
,Ship.Date.
- 4 categorical variables:
We convert categorical, except for Country
, which
contains 76 unique values, and ordinal variables to factors and date
variables to dates,
data.100 <- data.100 |>
mutate(Region = as.factor(Region),
Item.Type = as.factor(Item.Type),
Sales.Channel = as.factor(Sales.Channel),
Order.Priority = as.factor(Order.Priority),
Order.Date = as.Date(Order.Date, format = "%m/%d/%Y"),
Ship.Date = as.Date(Ship.Date, format = "%m/%d/%Y"))
colnames(data.100) <- tolower(colnames(data.100))
We check for any missing values; there are none,
## region country item.type sales.channel order.priority
## 0 0 0 0 0
## order.date order.id ship.date units.sold unit.price
## 0 0 0 0 0
## unit.cost total.revenue total.cost total.profit
## 0 0 0 0
Now we can view summary statistics for each variable,
## region country item.type
## Asia :11 Length:100 Clothes :13
## Australia and Oceania :11 Class :character Cosmetics :13
## Central America and the Caribbean: 7 Mode :character Office Supplies:12
## Europe :22 Fruits :10
## Middle East and North Africa :10 Personal Care :10
## North America : 3 Household : 9
## Sub-Saharan Africa :36 (Other) :33
## sales.channel order.priority order.date order.id
## Offline:50 C:22 Min. :2010-02-02 Min. :114606559
## Online :50 H:30 1st Qu.:2012-02-14 1st Qu.:338922488
## L:27 Median :2013-07-12 Median :557708561
## M:21 Mean :2013-09-16 Mean :555020412
## 3rd Qu.:2015-04-07 3rd Qu.:790755081
## Max. :2017-05-22 Max. :994022214
##
## ship.date units.sold unit.price unit.cost
## Min. :2010-02-25 Min. : 124 Min. : 9.33 Min. : 6.92
## 1st Qu.:2012-02-24 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84
## Median :2013-08-11 Median :5382 Median :179.88 Median :107.28
## Mean :2013-10-09 Mean :5129 Mean :276.76 Mean :191.05
## 3rd Qu.:2015-04-28 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33
## Max. :2017-06-17 Max. :9925 Max. :668.27 Max. :524.96
##
## total.revenue total.cost total.profit
## Min. : 4870 Min. : 3612 Min. : 1258
## 1st Qu.: 268721 1st Qu.: 168868 1st Qu.: 121444
## Median : 752314 Median : 363566 Median : 290768
## Mean :1373488 Mean : 931806 Mean : 441682
## 3rd Qu.:2212045 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :5997055 Max. :4509794 Max. :1719922
##
We can view the shape of the numeric data,
data.100 |>
select(where(is.numeric) & -order.id) |>
gather() |>
ggplot(aes(x = key, y = value)) +
geom_boxplot() +
labs(x = "", y = "", title = "Distribution of Numerical Variables") +
theme_few() +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~ key, scales = "free")
The only numeric variable that appears to be symmetrical is
Units.Sold
. Let’s take a look at the categorical
variables.
data.100 |>
select(item.type, order.priority, region) |>
gather() |>
ggplot(aes(x = value)) +
geom_bar(fill='lightblue') +
facet_wrap(~ key, scales = "free") +
theme_few() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x='',
y='',
title='distribution of categorical variables')
## Warning: attributes are not identical across measure variables; they will be
## dropped
We know the Sales.Channel
variable is an even 50/50
split from the summary statistics. What we can see here is that
Order.Priority
is relatively symmetrical, while the other
categorical variables are not.
data.100 |>
select(order.date, total.revenue, region) |>
ggplot(aes(x = order.date, y = total.revenue, color=region)) +
geom_line() +
labs(x = 'Date',
y = 'Total Revenue',
title = 'Total Revenue by Region: 2010-2017') +
theme_few()
Let’s take a look at the correlation amongst the numeric variables,
We see there is high correlation amongst all numerical variables
except Unit.Cost
and Unit.Price
to
Units.Sold
.
Let’s take a look at the total profit by sales channel,
p1 <- data.100 |>
select(sales.channel, region, total.profit) |>
ggplot(aes(x=region, y=total.profit, fill=sales.channel)) +
geom_bar(stat = "identity") +
labs(x = "",
y = "",
fill = "",
title = "Profit by Channel") +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette="YlGnBu")
p2 <- data.100 |>
select(sales.channel, region, units.sold) |>
ggplot(aes(x=region, y=units.sold, fill=sales.channel)) +
geom_bar(stat = "identity") +
labs(x = "",
y = "",
fill = "",
title = "Sales by Channel") +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette="YlGnBu")
ggarrange(p1, p2, ncol=2)
Let’s take a look at the profitability distribution by Order Priority:
data.100 |>
group_by(order.priority) |>
summarise(avg.profit = mean(total.profit)) |>
ggplot(aes(x=order.priority, y=avg.profit)) +
geom_bar(stat='identity', fill='lightblue') +
geom_text(aes(label = paste0('$',round(avg.profit))), vjust = 2, size = 4) +
labs(x='',
y='',
title='avg profit per order by priority') +
theme_few() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y = element_blank(), # Remove y-axis text
axis.ticks.y = element_blank())
What is the average turnaround time for the priority levels? To
calculate this, we create a new variable turn
that
represents the time between order.date
and
ship.date
data.100 |>
group_by(order.priority) |>
summarise(total.orders = n(),
total.units = sum(units.sold),
avg.turn = round(mean(turn)),
avg.profit = mean(total.profit)) |>
arrange(desc(avg.profit)) |>
kable() |>
kable_styling()
order.priority | total.orders | total.units | avg.turn | avg.profit |
---|---|---|---|---|
H | 30 | 154212 | 21 | 563053.3 |
M | 21 | 94832 | 25 | 460454.4 |
L | 27 | 146876 | 24 | 402175.1 |
C | 22 | 116951 | 24 | 306742.2 |
data.100 |>
ggplot(aes(x=order.priority, y=turn)) +
geom_boxplot(fill='lightblue') +
labs(x='priority',
y='days',
title='turnaround time by order priority') +
theme_few() +
theme(plot.title = element_text(hjust=0.5))
We see that 30% of all orders are priority level “H”, these orders ship 3-4 days faster than the other levels, and the average profit of these orders is $10-15k higher than levels “M” and “L” and nearly twice as high as level “C”.
Is there a relationship between the number of items per order and the turnaround time?
data.100 |>
ggplot(aes(x=units.sold, y=turn, color=order.priority)) +
geom_point() +
labs(x='units',
y='days',
title = 'turnaround time by units sold') +
theme_few() +
theme(plot.title = element_text(hjust=0.5)) +
facet_wrap(~ region, scales="free")
It’s just white noise.
Does the profitability of the items affect turnaround time?
data.100 |>
ggplot(aes(x=total.profit, y=turn)) +
geom_point(color='lightblue', alpha=0.8) +
facet_wrap(~ region, scales="free") +
labs(x='profit',
y='days',
title = 'turnaround time by order profit') +
theme_few() +
theme(plot.title = element_text(hjust=0.5)) +
facet_wrap(~ region, scales="free")
There does not appear to be a relationship.
Lastly, let’s take a look at profitability by item type to see if we can observe any patterns. We’ll look at an overview of all orders and then drill down by region,
p1 <- data.100 |>
ggplot(aes(x=total.profit, y=item.type)) +
geom_boxplot(fill='lightblue') +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
labs(x="",
y="",
title="profit by item type")
p2 <- data.100 |>
ggplot(aes(x=total.profit, y=item.type, fill=region)) +
geom_boxplot() +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
labs(x="",
y="",
title="",
fill="")
ggarrange(p1, p2, nrow=2)
We can see that Household and Cosmetics items have the highest overall
median profits, but there is some variance by region.
data.100 |>
group_by(region, item.type) |>
summarise(total.profit = sum(total.profit)) |>
top_n(2, wt = total.profit) |>
arrange(region, desc(total.profit)) |>
kable() |>
kable_styling()
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
region | item.type | total.profit |
---|---|---|
Asia | Household | 2002018.4 |
Asia | Office Supplies | 1479397.5 |
Australia and Oceania | Cosmetics | 1678541.0 |
Australia and Oceania | Baby Food | 1236498.1 |
Central America and the Caribbean | Household | 1487261.0 |
Central America and the Caribbean | Clothes | 403773.1 |
Europe | Cosmetics | 5233487.0 |
Europe | Baby Food | 2117259.8 |
Middle East and North Africa | Cosmetics | 4105940.0 |
Middle East and North Africa | Clothes | 1028160.0 |
North America | Household | 1152486.4 |
North America | Personal Care | 305456.3 |
Sub-Saharan Africa | Office Supplies | 2051688.8 |
Sub-Saharan Africa | Cosmetics | 2032888.0 |
Dataset 2: 100,000 Sales Records
data.100k <- read.csv('/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2023/DATA622/HW1/100000 Sales Records.csv')
glimpse(data.100k)
## Rows: 100,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "Central America and th…
## $ Country <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…
The dataset consists of:
- 100,000 rows of data across 14 columns.
- 7 continuous variables.
- 7 variables in string (character) format.
- 4 categorical variables:
Region
,Country
,Item.Type
,Sales.Channel.
- 1 ordinal variable:
Order.Priority.
, with no order given.
- 2 date variables:
Order.Date
,Ship.Date.
- 4 categorical variables:
We convert categorical, except for Country
, which
contains 185 unique values, and ordinal variables to factors and date
variables to dates,
data.100k <- data.100k |>
mutate(Region = as.factor(Region),
Item.Type = as.factor(Item.Type),
Sales.Channel = as.factor(Sales.Channel),
Order.Priority = as.factor(Order.Priority),
Order.Date = as.Date(Order.Date, format = "%m/%d/%Y"),
Ship.Date = as.Date(Ship.Date, format = "%m/%d/%Y"))
colnames(data.100k) <- tolower(colnames(data.100k))
We check for NAs, none.
## region country item.type sales.channel order.priority
## 0 0 0 0 0
## order.date order.id ship.date units.sold unit.price
## 0 0 0 0 0
## unit.cost total.revenue total.cost total.profit
## 0 0 0 0
And we can view some summary statistics,
## region country
## Asia :14547 Length:100000
## Australia and Oceania : 8113 Class :character
## Central America and the Caribbean:10731 Mode :character
## Europe :25877
## Middle East and North Africa :12580
## North America : 2133
## Sub-Saharan Africa :26019
## item.type sales.channel order.priority order.date
## Office Supplies: 8426 Offline:49946 C:24951 Min. :2010-01-01
## Cereal : 8421 Online :50054 H:24945 1st Qu.:2011-11-25
## Baby Food : 8407 L:25016 Median :2013-10-15
## Cosmetics : 8370 M:25088 Mean :2013-10-15
## Personal Care : 8364 3rd Qu.:2015-09-07
## Meat : 8320 Max. :2017-07-28
## (Other) :49692
## order.id ship.date units.sold unit.price
## Min. :100008904 Min. :2010-01-02 Min. : 1 Min. : 9.33
## 1st Qu.:326046383 1st Qu.:2011-12-21 1st Qu.: 2505 1st Qu.:109.28
## Median :547718512 Median :2013-11-09 Median : 5007 Median :205.70
## Mean :550395554 Mean :2013-11-09 Mean : 5001 Mean :266.70
## 3rd Qu.:775078534 3rd Qu.:2015-10-02 3rd Qu.: 7495 3rd Qu.:437.20
## Max. :999996459 Max. :2017-09-16 Max. :10000 Max. :668.27
##
## unit.cost total.revenue total.cost total.profit
## Min. : 6.92 Min. : 19 Min. : 14 Min. : 4.8
## 1st Qu.: 56.67 1st Qu.: 279753 1st Qu.: 162928 1st Qu.: 95900.0
## Median :117.11 Median : 789892 Median : 467937 Median : 283657.5
## Mean :188.02 Mean :1336067 Mean : 941975 Mean : 394091.2
## 3rd Qu.:364.69 3rd Qu.:1836490 3rd Qu.:1209475 3rd Qu.: 568384.1
## Max. :524.96 Max. :6682700 Max. :5249075 Max. :1738700.0
##
data.100k |>
select(where(is.numeric) & -order.id) |>
gather() |>
ggplot(aes(x = key, y = value)) +
geom_boxplot() +
labs(x = "", y = "", title = "Distribution of Numerical Variables") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~ key, scales = "free")
data.100k |>
select(item.type, order.priority, region) |>
gather() |>
ggplot(aes(x = value)) +
geom_bar(fill='lightblue') +
facet_wrap(~ key, scales = "free") +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
labs(x = "",
y = "",
title = "Distribution of Categorical Variables")
## Warning: attributes are not identical across measure variables; they will be
## dropped
We can see that although the distribution amongst regions is highly
variant, the Item.Type
s and Order.Priority
s
are uniformly distributed amongst categories.
data.100k |>
select(order.date, total.revenue, region) |>
ggplot(aes(x = order.date, y = total.revenue, color=region)) +
geom_line() +
labs(x = 'Date',
y = 'Total Revenue',
title = 'Total Revenue by Region: 2010-2017')
We can see that the data is so granular and variant that we’d either need to expand our plot dramatically or group the data by month (losing some information along the way) in order to visualize it. Zooming in on a single month illustrates this:
data.100k |>
select(order.date, total.revenue, region) |>
filter(order.date <= min(order.date + 29)) |>
ggplot(aes(x = order.date, y = total.revenue, color=region)) +
geom_line() +
labs(x = 'Date',
y = 'Total Revenue',
title = 'Total Revenue by Region: January 2010')
We can see the high variance among the daily revenue.
Let’s take a look at the correlation between the numeric variables,
Again, the only variables that do not exhibit strong correlation are
Unit.Price
and Unit.Cost
to
Units.Sold
.
d1 <- data.100k |>
select(sales.channel, total.profit, region) |>
ggplot(aes(x=region, y=total.profit, fill=sales.channel)) +
geom_bar(stat = "identity") +
labs(x = "",
y = "",
fill = "",
title = "Profit") +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette="YlGnBu")
d2 <- data.100k |>
select(sales.channel, region, units.sold) |>
ggplot(aes(x=region, y=units.sold, fill=sales.channel)) +
geom_bar(stat = "identity") +
labs(x = "",
y = "",
fill = "",
title = "Sales") +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette="YlGnBu")
ggarrange(d1, d2, ncol=2)
Let’s take a look at the profitability distribution by Order Priority:
data.100k |>
group_by(order.priority) |>
summarise(avg.profit = mean(total.profit)) |>
ggplot(aes(x=order.priority, y=avg.profit)) +
geom_bar(stat='identity', fill='lightblue') +
geom_text(aes(label = paste0('$',round(avg.profit))), vjust = 2, size = 4) +
labs(x='',
y='',
title='avg profit per order by priority') +
theme_few() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y = element_blank(), # Remove y-axis text
axis.ticks.y = element_blank())
For this dataset, the profit by priority is comparable amongst all
levels. Let’s take a look at the turnaround time by priority level.
data.100k |>
group_by(order.priority) |>
summarise(total.orders = n(),
avg.order = round(mean(units.sold)),
avg.turn = round(mean(turn)),
avg.profit = mean(total.profit)) |>
arrange(desc(avg.profit)) |>
kable() |>
kable_styling()
order.priority | total.orders | avg.order | avg.turn | avg.profit |
---|---|---|---|---|
H | 24945 | 5040 | 25 | 394755.6 |
C | 24951 | 4997 | 25 | 394146.9 |
M | 25088 | 4991 | 25 | 393904.6 |
L | 25016 | 4978 | 25 | 393560.4 |
data.100k |>
ggplot(aes(x=order.priority, y=turn)) +
geom_boxplot(fill='lightblue') +
labs(x='priority',
y='days',
title='turnaround time by order priority') +
theme_few() +
theme(plot.title = element_text(hjust=0.5))
For this dataset, the average turnaround time and average profit is equal amongst all the different priority levels, although the profit level is marginally higher for the “H” and “C” priority levels.
Lastly, let’s take a look at profitability by item type to see if we can observe any patterns. We’ll look at an overview of all orders and then drill down by region,
p1 <- data.100k |>
ggplot(aes(x=total.profit, y=item.type)) +
geom_boxplot(fill='lightblue') +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
labs(x="",
y="",
title="profit by item type")
p2 <- data.100k |>
ggplot(aes(x=total.profit, y=item.type, fill=region)) +
geom_boxplot() +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
labs(x="",
y="",
title="",
fill="")
ggarrange(p1, p2, nrow=2)
We can see for this dataset that the regions have no bearing on the distribution of profit by item type.
data.100k |>
group_by(sales.channel, order.priority) |>
summarise(total.orders = n(),
total.sold = sum(units.sold),
avg.turn = mean(turn))
## `summarise()` has grouped output by 'sales.channel'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 5
## # Groups: sales.channel [2]
## sales.channel order.priority total.orders total.sold avg.turn
## <fct> <fct> <int> <int> <dbl>
## 1 Offline C 12480 62070435 25.1
## 2 Offline H 12471 62888532 24.8
## 3 Offline L 12419 61741004 24.9
## 4 Offline M 12576 63171201 25.1
## 5 Online C 12471 62606910 25.0
## 6 Online H 12474 62833684 25.1
## 7 Online L 12597 62790722 25.0
## 8 Online M 12512 62042129 25.2
Is there a relationship between the number of items per order and the turnaround time?
data.100k |>
ggplot(aes(x=units.sold, y=turn, color=order.priority)) +
geom_point() +
labs(x='units',
y='days',
title = 'turnaround time by units sold') +
theme_few() +
theme(plot.title = element_text(hjust=0.5)) +
facet_wrap(~ region, scales="free")
It’s just white noise, there is too much data to visualize. But there do not appear to be any patterns.
Does the profitability of the items affect turnaround time?
data.100k |>
ggplot(aes(x=total.profit, y=turn)) +
geom_point(color='lightblue', alpha=0.8) +
facet_wrap(~ region, scales="free") +
labs(x='profit',
y='days',
title = 'turnaround time by order profit') +
theme_few() +
theme(plot.title = element_text(hjust=0.5)) +
facet_wrap(~ region, scales="free")
Again, it’s too much data to visualize, but we don’t see any patterns whatsoever.
data.100k |>
group_by(region, item.type, sales.channel) |>
summarise(total.sold = sum(units.sold)) |>
ggplot(aes(x=item.type, y=total.sold, fill=sales.channel)) +
geom_bar(stat='identity', position=position_dodge()) +
facet_wrap(~region, scales="free") +
labs(x='',
y='sold',
title='total sales by region',
fill='') +
theme_few() +
theme(axis.text.x = element_text(angle = 45, hjust=1),
plot.title = element_text(hjust=0.5)) +
scale_fill_brewer(palette="YlGnBu")
## `summarise()` has grouped output by 'region', 'item.type'. You can override
## using the `.groups` argument.
We can see that certain products have sell better online or offline in
different regions.
Comparing the Data
Both datasets contain observations about inventory sold by date around the world (mostly outside North America) consisting of a mix of categorical and continuous variables. Both datasets exhibit high variance and non-normal distribution amongst the variables. We can observe the following dependencies across both datasets:
total.cost
is a function units.sold
*
unit.cost
total.profit
is a function
total.revenue
- total.cost
.
total.revenue
is a function unit.cost
x
units.sold
.
We can observe that both datasets contain roughly equal distribution of online and offline sales, but the total number of items sold and the total profit from those sales is much higher for offline sales than online. There are no missing values in either dataset.
Model Selection
The guiding question at this point in the analysis is, what do we want to predict? Considering the data, there are a few things we can look at:
- Predicting Sales Channel.
- Predicting Order Priority.
- Predicting Total Profit.
Considering these three options with the data that is available, I wonder if any of these three serve realistic business goals, based upon the available data. For example, if we are given all of the data except sales channel, ie, we know how many of each item type are sold, what value is it to know after the fact whether an order is sold online or offline, or what the order priority is? Along the same lines, how beneficial is it to predict profit given sales totals? That is simple mathematics. With these thoughts in mind, the question to me becomes, how can we predict the total number of items sold by item type if given all the additional information in the dataset? Predicting the numbers of items sold would provide valuable business use regarding pricing and inventory allocation, ie how many items to make available online vs. offline, or perhaps offering a lower price for online sales.
Considering that both datasets contain a mix of categorical and
continuous variables, have non-normal distributions, and have unequal
variance, I will attempt to model two decision trees on these data to
try to predict Units.Sold
as the outcome variable.
Feature Selection.
Considering that both datasets contain a mix of categorical and continuous variables, have non-normal distributions, and have unequal variance, and that most of the continous variables are dependent on one another, I am going to build decision trees to predict the total sold per item given the following variables:
Month
.
Year
.
Region
.
Country
.
Item.Type
.
Unit.Price
.
Sales.Channel
.
First, we need to extract the Month
variable from the
Order.Date
in each dataset,
d100 <- data.100 |>
mutate(month = factor(month(order.date, label=TRUE)),
year = year(order.date),
price = unit.price,
item = item.type,
channel = sales.channel,
sold = units.sold) |>
select(sold, item, price, month, year, region, channel)
d100k <- data.100k |>
mutate(month = factor(month(order.date, label=TRUE)),
year = year(order.date),
price = unit.price,
item = item.type,
channel = sales.channel,
sold = units.sold) |>
select(sold, item, price, month, year, region, channel)
Previewing our datasets looks like this:
Dataset 1:
sold | item | price | month | year | region | channel |
---|---|---|---|---|---|---|
9925 | Baby Food | 255.28 | May | 2010 | Australia and Oceania | Offline |
2804 | Cereal | 205.70 | Aug | 2012 | Central America and the Caribbean | Online |
1779 | Office Supplies | 651.21 | May | 2014 | Europe | Offline |
8102 | Fruits | 9.33 | Jun | 2014 | Sub-Saharan Africa | Online |
5062 | Office Supplies | 651.21 | Feb | 2013 | Sub-Saharan Africa | Offline |
2974 | Baby Food | 255.28 | Feb | 2015 | Australia and Oceania | Online |
Dataset 2:
sold | item | price | month | year | region | channel |
---|---|---|---|---|---|---|
934 | Snacks | 152.58 | Oct | 2014 | Middle East and North Africa | Online |
4551 | Cosmetics | 437.20 | Feb | 2015 | Central America and the Caribbean | Offline |
9986 | Fruits | 9.33 | Dec | 2015 | Sub-Saharan Africa | Offline |
9118 | Personal Care | 81.73 | Sep | 2014 | Sub-Saharan Africa | Online |
5858 | Household | 668.27 | Feb | 2010 | Central America and the Caribbean | Offline |
1149 | Clothes | 109.28 | Feb | 2013 | Europe | Online |
Data Modeling
We are going to use Random Forest and Bagged Trees to generate predictions about total number of items sold on each dataset. First we split each set training and testing sets using an 80/20 split.
set.seed(1)
# set split indices
split.d100 <- createDataPartition(d100$sold, p = .8, list = FALSE)
split.d100k<- createDataPartition(d100k$sold, p = .8, list = FALSE)
# split data and partition predictor (x) and response (y) sets
# dataset1
train.d100 <- d100[split.d100, ]
test.d100 <- d100[-split.d100, ]
# dataset2
train.d100k <- d100k[split.d100k, ]
test.d100k <- d100k[-split.d100k, ]
We set up our cross-validation parameters. Since the dataset 1 training data is only 80 observations, we choose Leave-One-Out cross-validation. For the larger dataset, we will use 10-fold cross-validation.
loocv <- trainControl(
method = "LOOCV",
allowParallel = TRUE
)
cv <- trainControl(
method = "cv",
number = 5,
allowParallel = TRUE
)
We define a function to return the prediction metrics, including Mean Absolute Percentage Error (MAPE):
metrics <- function(predicted, actual){
mape = accuracy(predicted, actual)['Test set','MAPE']
measures = postResample(predicted, actual)
metrics = c(measures, MAPE=mape)
return(metrics)
}
Selecting a Model
First we prepare additional cores for parallel processing,
Random Forest
Dataset 1
registerDoParallel(cl)
set.seed(1)
rf100 <- randomForest(sold ~ .,
data = train.d100,
importance = TRUE,
trControl = loocv,
ntree = 1000,
parallel = "multicore")
rf100.preds <- predict(rf100, test.d100)
rf100.results <- metrics(rf100.preds, test.d100$sold)
##
## Call:
## randomForest(formula = sold ~ ., data = train.d100, importance = TRUE, trControl = loocv, ntree = 1000, parallel = "multicore")
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 7597693
## % Var explained: -0.47
Dataset 2
set.seed(1)
rf100k <- randomForest(sold ~ .,
data = train.d100k,
importance = TRUE,
ntree = 10,
mtry = sqrt(length(colnames(train.d100k))-1),
trControl = cv,
parallel = "multicore")
rf100k.preds <- predict(rf100k, test.d100k)
rf100k.results <- metrics(rf100k.preds, test.d100k$sold)
##
## Call:
## randomForest(formula = sold ~ ., data = train.d100k, importance = TRUE, ntree = 10, mtry = sqrt(length(colnames(train.d100k)) - 1), trControl = cv, parallel = "multicore")
## Type of random forest: regression
## Number of trees: 10
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 8646108
## % Var explained: -4.02
Bagged Trees
Dataset 1
set.seed(1)
bagged100 <- train(sold ~ .,
data = train.d100,
method = "treebag",
trControl = loocv,
tuneLength = 10)
bagged100.preds <- predict(bagged100, test.d100)
bagged100.results <- metrics(bagged100.preds, test.d100$sold)
## Bagged CART
##
## 80 samples
## 6 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 79, 79, 79, 79, 79, 79, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2953.101 0.001819921 2401.879
Dataset 2
set.seed(1)
bagged100k <- train(sold ~ .,
data = train.d100k,
method = "treebag",
trControl = cv,
tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
bagged100k.preds <- predict(bagged100k, test.d100k)
bagged100k.results <- metrics(bagged100k.preds, test.d100k$sold)
## Bagged CART
##
## 80001 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 64000, 64001, 64000, 64002, 64001
## Resampling results:
##
## RMSE Rsquared MAE
## 2883.047 NaN 2496.025
Results
kable(rbind("Dataset 1: Random Forest" = rf100.results,
"Dataset 1: Bagged Trees" = bagged100.results,
"Dataset 2: Random Forest" = rf100k.results,
"Dataset 2: Bagged Trees" = bagged100k.results)) |> kable_styling()
RMSE | Rsquared | MAE | MAPE | |
---|---|---|---|---|
Dataset 1: Random Forest | 2958.876 | 0.0180008 | 2558.838 | 272.8429 |
Dataset 1: Bagged Trees | 3049.350 | 0.0001126 | 2571.120 | 257.6522 |
Dataset 2: Random Forest | 2920.719 | 0.0000063 | 2520.789 | 492.8483 |
Dataset 2: Bagged Trees | 2890.612 | NA | 2504.423 | 501.7582 |
Neither model for either dataset is able to predict the number sold based on the chosen variables. The Mean Absolute Percentage Error (MAPE) values indicate the model’s predictive accuracy is off by 257% to 501% – those are unacceptably poor results.
Discussion / Essay
In this study we looked at two datasets from Excel Bianalytics, containing worldwide sales data for the years 2010-2017. We looked at two different sized datasets to identify a business problem and compare the performance of two machine learning algorithms on the datasets. The first dataset is small, consisting of 100 observations. The second dataset is large, containing 100,000 observations. The observations are a mix of 14 categorical, continuous, and date variables, as well as an order ID that does not serve a use for modeling. The 13 remaining variables are region, country, item type, sales channel, order priority, order date, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit. We determined that dependencies exist between total profit (total revenue - total cost), total cost (units sold x unit cost), and total revenue (units sold x unit price). As such, these three variables exhibited high correlation with their dependence variables and decided to focus on predicting units sold as our outcome variable, since cost, revenue, and profit can all be derived from the units sold. This seemed the most sensible business use to focus our attention on. This left us with the following 9 variables: region, country, item type, sales channel, order priority, order date, ship date, units sold. We calculated the processing time for each order as the difference between the order date and ship date. We also determined that the number of distinct countries was too large to be of use for our modeling: dataset 1 containing 100 observations: 76 different countries; dataset 2, 100k observations: 185 different countries.
After feature engineering and selection we had 7 predictor variables. We conducted exploratory data analysis and observed that the categorical predictors in the smaller dataset were unevenly distributed, while in the larger dataset, they were uniform. We observed no noticeable trend over time in either dataset. We further observed no linearity between the predictors and the outcome variable, and we determined that the processing time variable that we created did not appear to have any relationship with either the predictors or outcome variable.
As the datasets contained a combination of categorical and continuous variables we decided to employ decision tree models to predict the number of items sold. Decision trees are able to handle mixed data types and don’t require strong assumptions about variable distributions. They are also able to be employed for datasets of different sizes. However, they can be prone to overfitting and bias. To reduce this, we used the ensemble methods Bagged Trees and Random Forests. The bagged trees (“B”ootstrap “Agg”regation) model reduces the variance of a model by averaging the predictions of multiple decision trees, taking bootstrap samples (sampling with replacement) of the original dataset, training a separate decision tree on each sample, and averaging the predictions of each tree (aggregating). For the smaller dataset, we used “Leave One Out” (LOO) cross-validation to further reduce bias of the model. For the larger dataset, we used 10-fold cross-validation. We found this model did not predict well on either of the datasets, and was computationally expensive on the larger dataset.
Next we tried the Random Forest model, which extends the bagged trees model. In addition to using bootstrapped samples, random subsets of predictors are selected at each tree node, reducing correlation between the individual trees and improving model generalization. This model also did not perform well at predicting units sold, and was required a lot of processing time to train. For the larger dataset, we had to limit the number of trees to 10 in order to train a model that took less than 10 minutes to compute, which likely had an effect on the variance and predictive accuracy of the model, but not enough to make a difference considering how far off our predictions were. Again, LOO and 10-fold cross-validation were used for the smaller and larger datasets, respectively.
In retrospect I think the problem is with these datasets and the business question selected as far as why these models do not offer predictive accuracy – the datasets contain largely uniform information that requires transformation and feature engineering to generate predictors that might better fit a model. A few thoughts for future exploration would be to calculate the per-unit profit of an item, and aggregating sales by month by region, so there is less variance within the variables.