There are 100 sneaker shops, each shop sells only one model of shoe. When calculated the average order value (AOV) over 30 day period, the value is $3145.13.
Discover what could be wrong with the calculation and find a more accurate way to evaluate the data.
Generate a metric to report this dataset
Calculate the value
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(naniar)
## Warning: package 'naniar' was built under R version 4.1.2
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.2
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(scales)
## Warning: package 'scales' was built under R version 4.1.2
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.2
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(dplyr)
library(waffle)
## Warning: package 'waffle' was built under R version 4.1.2
sneaker <- read.csv("sneaker.csv")
glimpse(sneaker)
## Rows: 5,000
## Columns: 7
## $ order_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
## $ shop_id <int> 53, 92, 44, 18, 18, 58, 87, 22, 64, 52, 66, 40, 54, 100~
## $ user_id <int> 746, 925, 861, 935, 883, 882, 915, 761, 914, 788, 848, ~
## $ order_amount <int> 224, 90, 144, 156, 156, 138, 149, 292, 266, 146, 322, 3~
## $ total_items <int> 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 3, 2000, 1, 1~
## $ payment_method <chr> "cash", "cash", "cash", "credit_card", "credit_card", "~
## $ created_at <chr> "2017-03-13 12:36:56", "2017-03-03 17:38:52", "2017-03-~
summary(sneaker$order_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90 163 284 3145 390 704000
vis_miss(sneaker) #check missing value
No missing data detected. If any, there will showing in black the location of missing values, and provide the percentage of missing values overall (in the legend), and in each variable.
Visualize the order_amount to overview how it’s distribution.
#format date
sneaker$date <- as.Date(sneaker$created_at)
p <- ggplot(sneaker,
aes(x=date,
y=order_amount,
color=shop_id)) +
geom_point(size=4,
alpha=1/5) +
xlab("Time") +
ylab("Order Amount(Dollar)") +
scale_y_continuous(labels=comma) +
theme(axis.text.x = element_text(angle = 45, hjust=1),
axis.ticks.x = element_blank(),
axis.line = element_line())
p
Clearly, there is a significant distance between two order_amount groups: one group below $100,000 which is the majority amount about hundreds dollar and another group distributed about $700,000.
Let’s check how mean and median distributed on the histogram graph.
mean.sneaker <- mean(sneaker$order_amount, trim = 0) #NA data checked
median.sneaker <- median(sneaker$order_amount, trim = 0)
sneaker %>%
filter( order_amount<5000) %>% #order value limit to 5000
ggplot( aes(x=order_amount)) +
geom_histogram(binwidth = 15, aes(y=..density..)) +
geom_vline(data = sneaker, aes(xintercept=mean.sneaker), #add mean line
linetype="dashed") +
geom_vline(data = sneaker, aes(xintercept=median.sneaker), #add median line
linetype="solid")
sd(sneaker$order_amount)
## [1] 41282.54
Now we can verify that there seems non-logical understand with all the number: the mean=3185 with the standard deviation=41282.54 and the median=284. In fact, we see that the maximum order_amount is 704000 while 50% order_amount lower than 284 dollar.
###Finding insights by filter order_amount values.
ggplot(
sneaker,
aes(x=date,
y=order_amount,
color=shop_id)) +
geom_point(
size=4,
alpha=1/5) +
geom_label(
data=sneaker %>% filter(order_amount>10000),
aes(label=shop_id)) +
scale_y_continuous(labels=comma) +
theme(axis.text.x = element_text(angle = 45, hjust=1),
axis.line = element_line())
It is clearly to recognized that there are shop_id 42 and shop_id 78 contributed a huge value order_amount to total value of the data. With a further investigation, the reason shop_id 42 has that high value order_amount due to the massive number of 2000 items per order from shop_id 42. On the other hand, we have shop_id 78 also contributed a large order_amount value upper than $10,000 due to the item value which is $25,000/ item price.
sneaker_sum <- sneaker %>%
group_by(total_items) %>%
summarise(total_order_amount = sum(order_amount)) %>%
mutate(
percentage = paste0(round(total_order_amount/sum(total_order_amount)*100,2),"%"))
sneaker_sum
## # A tibble: 8 x 3
## total_items total_order_amount percentage
## <int> <int> <chr>
## 1 1 763777 4.86%
## 2 2 1374394 8.74%
## 3 3 1120803 7.13%
## 4 4 277672 1.77%
## 5 5 58470 0.37%
## 6 6 161460 1.03%
## 7 8 1064 0.01%
## 8 2000 11968000 76.11%
sum(sneaker_sum$total_order_amount)
## [1] 15725640
Now, it is more clearly to recognize that the total_items 2000 account for 76.11% of the total order_amount $15,6725,640. This total_items 2000 per order from shop_id 42 is the reason of unbalanced average order value, or massive skewed to the right that lead to drag the mean so far away from the median.
It is hard to set one or more than one metrics do evaluate all the aspects of an variable while these variables are probably associated to others.
In the scenario, I would recommend categorize the total_items into groups and then calculate its the AOV. On the other hand, there should be categorize groups between product values like shop_id 78 who sell a pair of sneaker shoe for $25,000 distinct from a group whose sell a pair of sneaker shoe for an average of hundred dollars.
sneaker_aov <- sneaker %>%
group_by(total_items) %>%
summarise(AOV = mean(order_amount),
SD = sd(order_amount),
median_order_value = median(order_amount))
sneaker_aov
## # A tibble: 8 x 4
## total_items AOV SD median_order_value
## <int> <dbl> <dbl> <dbl>
## 1 1 417. 2593. 153
## 2 2 750. 4761. 306
## 3 3 1191. 7471. 459
## 4 4 948. 5978. 592
## 5 5 759. 161. 765
## 6 6 17940 51154. 948
## 7 8 1064 NA 1064
## 8 2000 704000 0 704000
The total_items 8 has a NA standard deviation because of there is only one transaction.
The total_items 2000 has a median account for $704,000 because of the transaction value per order is 704,000 dollar.
Even there are some high Standard Deviation relative to each total_items value such as total_items 1 has an AOV 417,000 while the SD is 2,593. However, we could basically easy to figure out it’s 50% above the value and 50% below when looking at median value for each observation. Furthermore, we can also understanding the shape distribution when compare AVO with median_order_value by it’s gap. Ultimately, Shopify team or Merchant would make more accurate decision for business with this developed report.
- How many orders were shipped by Speedy Express in total?
FROM Orders JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID GROUP BY Orders.ShipperID;Speedy Express: 54 orders
b. What is the last name of the employee with the most orders?
FROM Orders JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID GROUP BY Orders.EmployeeID Order by COUNT(Orders.EmployeeID) Desc;The employee with the most orders: Peacock
c. What product was ordered the most by customers in Germany?
FROM Orders INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID INNER JOIN OrderDetails ON Orders.OrderID=OrderDetails.OrderID INNER JOIN Products ON Products.ProductID=OrderDetails.ProductID WHERE Country = 'Germany' Group By Products.ProductID Order by OrderDetails.Quantity Desc;The most ordered product in Germany is Steeleye Stout.
Thank you for reading 😄