Link to my github for the R Markdown file.
Question 1 : On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.
Dataset is available here
Importing the dataset and exploring a little.
Quick look at the dataset by summarizing our data (I made the id columns and payement method column factors)
data %>%
summary
## order_id shop_id user_id order_amount
## 1 : 1 53 : 68 718 : 28 Min. : 90
## 2 : 1 71 : 66 868 : 27 1st Qu.: 163
## 3 : 1 19 : 64 791 : 26 Median : 284
## 4 : 1 13 : 63 847 : 26 Mean : 3145
## 5 : 1 89 : 61 727 : 25 3rd Qu.: 390
## 6 : 1 59 : 60 786 : 25 Max. :704000
## (Other):4994 (Other):4618 (Other):4843
## total_items payment_method created_at
## Min. : 1.000 cash :1594 Min. :2017-03-01 00:08:09
## 1st Qu.: 1.000 credit_card:1735 1st Qu.:2017-03-08 07:08:04
## Median : 2.000 debit :1671 Median :2017-03-16 00:21:20
## Mean : 8.787 Mean :2017-03-15 22:20:37
## 3rd Qu.: 3.000 3rd Qu.:2017-03-23 10:39:57
## Max. :2000.000 Max. :2017-03-30 23:55:35
##
Zooming in on the order_amount column
summary(data$order_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90 163 284 3145 390 704000
It appears the naive calculation is obtained by averaging all of the order_amount data.
Question 1a: Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.
Response: This calculation tries to evaluate the AOV across all 100 shops over the 30 day window by averageing the order_amount column. This results in the high $3145.13 value.
According to this shopify blog, to obtain the AOV, we divide the total revenue by the total number of orders. Doing this for each shop and getting the average, we have
aov_v1 <- data %>%
group_by(shop_id) %>%
summarise(total_revenue = sum(order_amount),
total_orders = sum(total_items)) %>%
transmute(aov_v1 = total_revenue/total_orders) %>%
ungroup() %>%
summarise(average_ov = mean(aov_v1))
aov_v1
## # A tibble: 1 x 1
## average_ov
## <dbl>
## 1 408.
An AOV of $407.99 sounds much more realistic but it’s still on the high side for affordable sneakers, let’s explore further.
I’d like to calculate the sneaker price for each shop to see what that looks like. This is straightforward as each store sells only one model of sneakers. A simple division of the revenue per store and the number of items (shoes) sold will yield this. Coincidentally, this also fits the definition for the AOV, so running the two following chunks, yield the same results and the sneaker price per shop is same as the AOV per shop.
Getting the sneaker price per shop
data %>%
group_by(shop_id) %>%
mutate(sneaker_price = order_amount/total_items) %>%
summarise(sneaker_price = mean(sneaker_price)) %>%
select(shop_id, sneaker_price) %>%
arrange(desc(sneaker_price))
## # A tibble: 100 x 2
## shop_id sneaker_price
## <fct> <dbl>
## 1 78 25725
## 2 42 352
## 3 12 201
## 4 89 196
## 5 99 195
## 6 50 193
## 7 38 190
## 8 6 187
## 9 51 187
## 10 11 184
## # ... with 90 more rows
Getting the AOV per shop
aov_per_shop <- data %>%
group_by(shop_id) %>%
summarise(total_revenue = sum(order_amount),
total_orders = sum(total_items)) %>%
transmute(shop_id = shop_id,
aov_per_shop = total_revenue/total_orders) %>%
arrange(desc(aov_per_shop))
aov_per_shop
## # A tibble: 100 x 2
## shop_id aov_per_shop
## <fct> <dbl>
## 1 78 25725
## 2 42 352
## 3 12 201
## 4 89 196
## 5 99 195
## 6 50 193
## 7 38 190
## 8 6 187
## 9 51 187
## 10 11 184
## # ... with 90 more rows
Let’s visualise the aov_per_shop.
PLotting this would be devastating. Nevertheless, I make an attempt!
ggplot(aov_per_shop) +
aes(x = "", y = aov_per_shop) +
geom_boxplot(shape = "circle") +
coord_trans(y = "log10") +
scale_y_continuous(breaks=c(50,100, 150, 200, 350, 500, 20000)) +
ylab("AOV per shop") +
xlab("") +
theme_bw()
The outlier at $25,725 makes this mostly unreadable despite my logarithmic efforts so let’s just look at the numbers😄!
summary(aov_per_shop$aov_per_shop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90.0 132.8 153.0 408.0 168.2 25725.0
Given that the 3rd Quantile is just $168.2 and the maximum AOV is $25725, we have to ask, what is going on in shop 78??.
According to this, Shop 78 sells their sneakers at $25725 a pair! (They must be really fly).
Our primary aim is to analyse the AOV for our shops. Given that each shop sells just one model of sneakers, we are essentially trying to determine our average sneaker price.
Shop 78 has a sneaker price of $25,725 which is rather disturbing and needs to be checked. In order not to disrupt our analysis (if we’re adamant on using the average sneaker price) then we should not include this shop’s price in our calculations. As such, our AOV would be obtained as follows.
aov_v2 <- data %>%
group_by(shop_id) %>%
filter(shop_id != 78) %>%
summarise(total_revenue = sum(order_amount),
total_orders = sum(total_items)) %>%
transmute(shop_id = shop_id,
aov_per_shop = total_revenue/total_orders) %>%
arrange(desc(aov_per_shop)) %>%
ungroup() %>%
summarise(aov_v2 = mean(aov_per_shop))
aov_v2
## # A tibble: 1 x 1
## aov_v2
## <dbl>
## 1 152.
This gives an average order value of $152.26.
As it’s against data best practices to delete data, we’ll explore other methods.
Another method would be to use the median instead, this is more robust and helps dampen the effect of outliers on the Order Values. This is implemented as follows
aov_v3 <- data %>%
group_by(shop_id) %>%
summarise(total_revenue = sum(order_amount),
total_orders = sum(total_items)) %>%
transmute(shop_id = shop_id,
aov_per_shop = total_revenue/total_orders) %>%
arrange(desc(aov_per_shop)) %>%
ungroup() %>%
summarise(aov_v3 = median(aov_per_shop))
aov_v3
## # A tibble: 1 x 1
## aov_v3
## <dbl>
## 1 153
The median order value obtained is $153.
One more method would be to find the modal sneaker price (most occuring). As R does not have an inbuilt mode function, we write a quick one to obtain this for us.
getmode <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
aov_all <- data %>%
group_by(shop_id) %>%
summarise(total_revenue = sum(order_amount),
total_orders = sum(total_items)) %>%
transmute(shop_id = shop_id,
aov_per_shop = total_revenue/total_orders) %>%
arrange(desc(aov_per_shop)) %>%
ungroup() %>%
summarise(median_ov = median(aov_per_shop),
mode_ov = getmode(aov_per_shop))
aov_all
## # A tibble: 1 x 2
## median_ov mode_ov
## <dbl> <dbl>
## 1 153 153
This yields a modal order value of $153.
Question 1b: What metric would you report for this dataset?
I would report the modal order value (sneaker prices) as this is safer than the omission method and reduces the effect of outliers. Hence it gives a more accurate description of the value of each order across all 100 shops.
Question 1c: What is its value?
Its value is $153.
I noticed (while playing around some more because why not!) that the orders made by user_id 607 had the exact same transaction and timestamp. I suspect this was an error but would need to confirm with the originator of the data (le DB folks) before I can take any decisions concerning it. As the sneaker price obtained from these transactions is $352, a little pricey but still reasonable, I left the data in my analysis.
Alrighty then, see you at the Life Story Interview! 😉
Question 2 : For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.
Response
Question 2a: How many orders were shipped by Speedy Express in total?
In total, 54 Orders were shipped by Speedy Express.
/* solution 1*/
SELECT
COUNT(ShipperID)
FROM Orders
WHERE ShipperID == 1;
/*Solution 2 */
SELECT COUNT(o.ShipperID)
FROM Orders AS o
WHERE (SELECT ShipperID
FROM Shippers AS s
WHERE s.ShipperName == "Speedy Express") == o.ShipperID;
Question 2b: What is the last name of the employee with the most orders?
Last name is Peacock.
SELECT e.LastName
FROM Employees AS e
WHERE (SELECT o.EmployeeID, COUNT(o.EmployeeID)
FROM Orders AS o
GROUP BY o.EmployeeID
ORDER BY COUNT(o.EmployeeID) DESC
LIMIT 1) == e.EmployeeID;
Question 2c: What product was ordered the most by customers in Germany?
Boston Crab Meat was most ordered by Germany-based customers with a total of 160 orders.
SELECT p.ProductName, SUM(Quantity) AS TotalQuantity
FROM Orders AS o, OrderDetails AS od, Customers AS c, Products AS p
WHERE c.Country = "Germany" AND od.OrderID = o.OrderID AND od.ProductID = p.ProductID AND c.CustomerID = o.CustomerID
GROUP BY p.ProductID
ORDER BY TotalQuantity DESC
LIMIT 1;