Link to my github for the R Markdown file.

Problem Statement:

Question 1 : On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

  1. Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.
  2. What metric would you report for this dataset?
  3. What is its value?

Dataset is available here

My Response:

Importing the dataset and exploring a little.

Quick look at the dataset by summarizing our data (I made the id columns and payement method column factors)

data %>% 
  summary
##     order_id       shop_id        user_id      order_amount   
##  1      :   1   53     :  68   718    :  28   Min.   :    90  
##  2      :   1   71     :  66   868    :  27   1st Qu.:   163  
##  3      :   1   19     :  64   791    :  26   Median :   284  
##  4      :   1   13     :  63   847    :  26   Mean   :  3145  
##  5      :   1   89     :  61   727    :  25   3rd Qu.:   390  
##  6      :   1   59     :  60   786    :  25   Max.   :704000  
##  (Other):4994   (Other):4618   (Other):4843                   
##   total_items           payment_method   created_at                 
##  Min.   :   1.000   cash       :1594   Min.   :2017-03-01 00:08:09  
##  1st Qu.:   1.000   credit_card:1735   1st Qu.:2017-03-08 07:08:04  
##  Median :   2.000   debit      :1671   Median :2017-03-16 00:21:20  
##  Mean   :   8.787                      Mean   :2017-03-15 22:20:37  
##  3rd Qu.:   3.000                      3rd Qu.:2017-03-23 10:39:57  
##  Max.   :2000.000                      Max.   :2017-03-30 23:55:35  
## 

Zooming in on the order_amount column

summary(data$order_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      90     163     284    3145     390  704000

It appears the naive calculation is obtained by averaging all of the order_amount data.

Question 1a: Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

Response: This calculation tries to evaluate the AOV across all 100 shops over the 30 day window by averageing the order_amount column. This results in the high $3145.13 value.

According to this shopify blog, to obtain the AOV, we divide the total revenue by the total number of orders. Doing this for each shop and getting the average, we have

aov_v1 <- data %>% 
  group_by(shop_id) %>% 
  summarise(total_revenue = sum(order_amount),
            total_orders = sum(total_items)) %>% 
  transmute(aov_v1 = total_revenue/total_orders) %>% 
  ungroup() %>% 
  summarise(average_ov = mean(aov_v1))
aov_v1
## # A tibble: 1 x 1
##   average_ov
##        <dbl>
## 1       408.

An AOV of $407.99 sounds much more realistic but it’s still on the high side for affordable sneakers, let’s explore further.

I’d like to calculate the sneaker price for each shop to see what that looks like. This is straightforward as each store sells only one model of sneakers. A simple division of the revenue per store and the number of items (shoes) sold will yield this. Coincidentally, this also fits the definition for the AOV, so running the two following chunks, yield the same results and the sneaker price per shop is same as the AOV per shop.

Getting the sneaker price per shop

data %>% 
  group_by(shop_id) %>% 
  mutate(sneaker_price = order_amount/total_items) %>%
  summarise(sneaker_price = mean(sneaker_price)) %>% 
  select(shop_id, sneaker_price) %>% 
  arrange(desc(sneaker_price))
## # A tibble: 100 x 2
##    shop_id sneaker_price
##    <fct>           <dbl>
##  1 78              25725
##  2 42                352
##  3 12                201
##  4 89                196
##  5 99                195
##  6 50                193
##  7 38                190
##  8 6                 187
##  9 51                187
## 10 11                184
## # ... with 90 more rows

Getting the AOV per shop

aov_per_shop <- data %>% 
  group_by(shop_id) %>% 
  summarise(total_revenue = sum(order_amount),
            total_orders = sum(total_items)) %>% 
  transmute(shop_id = shop_id,
            aov_per_shop = total_revenue/total_orders) %>% 
  arrange(desc(aov_per_shop))

aov_per_shop
## # A tibble: 100 x 2
##    shop_id aov_per_shop
##    <fct>          <dbl>
##  1 78             25725
##  2 42               352
##  3 12               201
##  4 89               196
##  5 99               195
##  6 50               193
##  7 38               190
##  8 6                187
##  9 51               187
## 10 11               184
## # ... with 90 more rows

Let’s visualise the aov_per_shop.

PLotting this would be devastating. Nevertheless, I make an attempt!

ggplot(aov_per_shop) +
  aes(x = "", y = aov_per_shop) +
  geom_boxplot(shape = "circle") +
  coord_trans(y = "log10") +
  scale_y_continuous(breaks=c(50,100, 150, 200, 350, 500, 20000)) +
  ylab("AOV per shop") +
  xlab("") +
  theme_bw()

The outlier at $25,725 makes this mostly unreadable despite my logarithmic efforts so let’s just look at the numbers😄!

summary(aov_per_shop$aov_per_shop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    90.0   132.8   153.0   408.0   168.2 25725.0

Given that the 3rd Quantile is just $168.2 and the maximum AOV is $25725, we have to ask, what is going on in shop 78??.

According to this, Shop 78 sells their sneakers at $25725 a pair! (They must be really fly).

Summary

Our primary aim is to analyse the AOV for our shops. Given that each shop sells just one model of sneakers, we are essentially trying to determine our average sneaker price.

Shop 78 has a sneaker price of $25,725 which is rather disturbing and needs to be checked. In order not to disrupt our analysis (if we’re adamant on using the average sneaker price) then we should not include this shop’s price in our calculations. As such, our AOV would be obtained as follows.

aov_v2 <-  data %>% 
  group_by(shop_id) %>% 
  filter(shop_id != 78) %>% 
  summarise(total_revenue = sum(order_amount),
            total_orders = sum(total_items)) %>% 
  transmute(shop_id = shop_id,
            aov_per_shop = total_revenue/total_orders) %>% 
  arrange(desc(aov_per_shop)) %>% 
  ungroup() %>% 
  summarise(aov_v2 = mean(aov_per_shop))
aov_v2
## # A tibble: 1 x 1
##   aov_v2
##    <dbl>
## 1   152.

This gives an average order value of $152.26.

As it’s against data best practices to delete data, we’ll explore other methods.

Median Order Value

Another method would be to use the median instead, this is more robust and helps dampen the effect of outliers on the Order Values. This is implemented as follows

aov_v3 <-  data %>% 
  group_by(shop_id) %>% 
  summarise(total_revenue = sum(order_amount),
            total_orders = sum(total_items)) %>% 
  transmute(shop_id = shop_id,
            aov_per_shop = total_revenue/total_orders) %>% 
  arrange(desc(aov_per_shop)) %>% 
  ungroup() %>% 
  summarise(aov_v3 = median(aov_per_shop))
aov_v3
## # A tibble: 1 x 1
##   aov_v3
##    <dbl>
## 1    153

The median order value obtained is $153.

Side Note:

I noticed (while playing around some more because why not!) that the orders made by user_id 607 had the exact same transaction and timestamp. I suspect this was an error but would need to confirm with the originator of the data (le DB folks) before I can take any decisions concerning it. As the sneaker price obtained from these transactions is $352, a little pricey but still reasonable, I left the data in my analysis.

Alrighty then, see you at the Life Story Interview! 😉

Problem Statement 2:

Question 2 : For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

  1. How many orders were shipped by Speedy Express in total?
  2. What is the last name of the employee with the most orders?
  3. What product was ordered the most by customers in Germany?

Response

Question 2a: How many orders were shipped by Speedy Express in total?

In total, 54 Orders were shipped by Speedy Express.

/* solution 1*/
SELECT 
COUNT(ShipperID)
FROM Orders
WHERE ShipperID == 1;

/*Solution 2 */

SELECT COUNT(o.ShipperID)
FROM Orders AS o
WHERE (SELECT ShipperID 
    FROM Shippers AS s
    WHERE s.ShipperName == "Speedy Express") == o.ShipperID;

Question 2b: What is the last name of the employee with the most orders?

Last name is Peacock.

SELECT e.LastName
FROM Employees AS e
WHERE (SELECT o.EmployeeID, COUNT(o.EmployeeID)
FROM Orders AS o
GROUP BY o.EmployeeID
ORDER BY COUNT(o.EmployeeID) DESC
LIMIT 1) == e.EmployeeID;

Question 2c: What product was ordered the most by customers in Germany?

Boston Crab Meat was most ordered by Germany-based customers with a total of 160 orders.

SELECT p.ProductName, SUM(Quantity) AS TotalQuantity
FROM Orders AS o, OrderDetails AS od, Customers AS c, Products AS p
WHERE c.Country = "Germany" AND od.OrderID = o.OrderID AND od.ProductID = p.ProductID AND c.CustomerID = o.CustomerID
GROUP BY p.ProductID
ORDER BY TotalQuantity DESC
LIMIT 1;