Question 1

Given some sample data, write a program to answer the following: click here to access the required data set.

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Let us load the data first and do summary statistics for every variable.

data <-  read.csv('~/Downloads/2019 Winter Data Science Intern Challenge Data Set.csv', header=TRUE, stringsAsFactors = FALSE, na.strings=c("","NA"))
anyNA(data) #check whether there is NA in the data set
## [1] FALSE
summary(data)
##     order_id       shop_id          user_id       order_amount   
##  Min.   :   1   Min.   :  1.00   Min.   :607.0   Min.   :    90  
##  1st Qu.:1251   1st Qu.: 24.00   1st Qu.:775.0   1st Qu.:   163  
##  Median :2500   Median : 50.00   Median :849.0   Median :   284  
##  Mean   :2500   Mean   : 50.08   Mean   :849.1   Mean   :  3145  
##  3rd Qu.:3750   3rd Qu.: 75.00   3rd Qu.:925.0   3rd Qu.:   390  
##  Max.   :5000   Max.   :100.00   Max.   :999.0   Max.   :704000  
##   total_items       payment_method      created_at       
##  Min.   :   1.000   Length:5000        Length:5000       
##  1st Qu.:   1.000   Class :character   Class :character  
##  Median :   2.000   Mode  :character   Mode  :character  
##  Mean   :   8.787                                        
##  3rd Qu.:   3.000                                        
##  Max.   :2000.000

Take a closer look at order_amount

summary(data$order_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      90     163     284    3145     390  704000

What could be going wrong with our calculation:

From the above summary output, naively, we can see the mean of the total order amount is 3145, which could be where the original AOV stated in the question comes from. Based on the definition of AOV defined on the Shopify website, the formula is \({TotalOrderAmount}/TotalNumberofOrders\), and we can check below:

TotalOrderAmount = sum(data$order_amount)
TotalNumberofOrders = nrow(data)
(OriginalAOV = TotalOrderAmount/TotalNumberofOrders)
## [1] 3145.128

However, if we were to present this result to our clients, it could be misleading (as stated in the question) because we can see that the mean has a considerable distance from the Median Order Amount, which is 284. It indicates that the data of order amount is highly skewed due to extreme values. There could be a few high-dollar orders that skew up the Average. Let us find out what happened in our data.

Take a look at AOV for each of these 100 sneaker stores:

AOV_pershop <- data %>% group_by(shop_id) %>%
    summarise(Sum_per_shop = sum(order_amount),num_order = n(), 
              AOV_per_shop = Sum_per_shop/num_order) %>%
    arrange(desc(AOV_per_shop)) 
head(AOV_pershop, 10)
## # A tibble: 10 × 4
##    shop_id Sum_per_shop num_order AOV_per_shop
##      <int>        <int>     <int>        <dbl>
##  1      42     11990176        51      235101.
##  2      78      2263800        46       49213.
##  3      50        17756        44         404.
##  4      90        19758        49         403.
##  5      38        13680        35         391.
##  6      81        22656        59         384 
##  7       6        22627        59         384.
##  8      89        23128        61         379.
##  9      33        15051        40         376.
## 10      51        16643        46         362.

Let’s visualize extreme AOVs by using a boxplot:

#Function for extreme values
is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

#Boxplot to show the distribution
AOV_pershop %>%
    mutate(outlier = ifelse(is_outlier(AOV_per_shop), shop_id, as.numeric(NA))) %>%
    ggplot(aes(x = "",y = AOV_per_shop)) + 
    geom_boxplot(width = 0.6) + 
    coord_trans(y="log10") +
    scale_y_continuous(breaks=c(150, 200, 300, 500, 50000)) + 
    ylab("Average Order Value per Shop") + xlab("Figure 1. Box Plot of AOV per Shop") +
    geom_text(aes(label = ifelse(is.na(outlier), "", paste("shop id :",outlier)), col = "red"), hjust = -0.3)+
    theme_bw() + theme(legend.position="none")

As shown in Figure 1, shop id 42 and 78 have insanely high AOV. Those values could explain a massive gap between Average Order Amount and Median Order Amount.

Let us look at the data for shop id 42 and 78 since they both have extreme AOV.

Shop 78:

data %>% filter(shop_id == 78) %>% 
  group_by(user_id) %>%
  summarise(order_amount = sum(order_amount), num_items = sum(total_items)) %>%
  arrange(desc(order_amount))
## # A tibble: 45 × 3
##    user_id order_amount num_items
##      <int>        <int>     <int>
##  1     878       154350         6
##  2     834       102900         4
##  3     766        77175         3
##  4     775        77175         3
##  5     787        77175         3
##  6     817        77175         3
##  7     818        77175         3
##  8     855        77175         3
##  9     915        77175         3
## 10     928        77175         3
## # … with 35 more rows

From the above table, shop 78 seems to have a significantly high unit price, given that a customer with user_id 878 bought six items for 154,350 dollars ($25,725/sneaker). We need to confirm with shop 78 to see if the unit price was correct.

Shop 42:

data %>% filter(shop_id == 42) %>% 
  group_by(user_id) %>%
  summarise(order_amount = sum(order_amount), num_items = sum(total_items)) %>%
  arrange(desc(order_amount))
## # A tibble: 31 × 3
##    user_id order_amount num_items
##      <int>        <int>     <int>
##  1     607     11968000     34000
##  2     797         1760         5
##  3     819         1760         5
##  4     868         1408         4
##  5     907         1408         4
##  6     926         1408         4
##  7     726         1056         3
##  8     739         1056         3
##  9     808         1056         3
## 10     720          704         2
## # … with 21 more rows

From the above table, we can see that the customer with user_id 607 has purchased a large quantity in shop 42; this could result in extreme AOV. By further investigation, we tried to remove data of user_id 607 and plotted a boxplot for AOV per shop. The results show that shop 42 without the data of customer 607 still has extreme AOV, indicating user_id 607 is not the only customer causing this extreme AOV.

Ways that can be improved in order to provide better insight:

  1. Find out what causes the extreme AOV and handle it by talking to the seller and potentially removing those extremes from our analysis. In this case, to make our analysis more meaningful to our clients, I recommend removing shop 78 and shop 42. It won’t make a big deal if we only remove 2% of shop data. The analysis is shown below:
data_extremes_rm <- data %>% filter(!((shop_id == 78)| (shop_id ==42))) 
summary(data_extremes_rm$order_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    90.0   163.0   284.0   300.2   386.5  1086.0

The AOV is 300.2 after removing shops 42 and 78, while Median Order Value remains the same.

Let’s visualize the data by a boxplot:

AOV_extremes_rm <- data_extremes_rm %>%
  group_by(shop_id) %>%
  summarise(Sum_per_shop = sum(order_amount),num_order = n(), 
              AOV_per_shop = Sum_per_shop/num_order) %>%
    arrange(desc(AOV_per_shop)) 

AOV_extremes_rm %>%
    mutate(outlier = ifelse(is_outlier(AOV_per_shop), shop_id, as.numeric(NA))) %>%
    ggplot(aes(x = "",y = AOV_per_shop)) + 
    geom_boxplot(width = 0.6) + 
    coord_trans(y="log10") +
    scale_y_continuous(breaks=c(150, 200, 300, 500, 50000)) + 
    ylab("Average Order Value per Shop") + xlab("Figure 1. Box Plot of AOV per Shop") +
    geom_text(aes(label = ifelse(is.na(outlier), "", paste("shop id :",outlier)), col = "red"), hjust = -0.3)+
    theme_bw() + theme(legend.position="none")

  1. Look at different metrics such as Median Order Value.

What metric would you report for this dataset?

  1. Average Order Value after removing extreme values

  2. Median Order Value after removing extreme values

  3. Original Median Order Value

For highly skewed distributions, the mean is not an appropriate representation of the center of the distribution. Even though there is only a tiny percent of the distribution in the tail of a highly skewed distribution, these few substantial values (relative to the rest of the distribution) cause the mean of the distribution to be massive. The median does not have this flaw.

What is its value?

  1. Average Order Value after removing extreme values
mean(data_extremes_rm$order_amount)
## [1] 300.1558
  1. Median Order Value after removing extreme values
median(data_extremes_rm$order_amount)
## [1] 284
  1. Original Median Order Value
median(data$order_amount)
## [1] 284

The AOV seems to be more reasonable now.

Question 2

For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

How many orders were shipped by Speedy Express in total?

There were 54 orders shipped by Speedy Express.

SELECT s.ShipperName, COUNT(DISTINCT o.OrderID) as TotalOrders
FROM Orders as o
LEFT JOIN Shippers as s
ON o.ShipperID = s.ShipperID
WHERE ShipperName = "Speedy Express";

Thought Process: Table “Orders” contains information on OrderID and ShipperID, but we need ShipperName to answer the question. We can use LEFT JOIN to combine table “Orders” and table “Shippers” based on ShipperID, and filter the record by using WHERE to return the number of total orders shipped by Speedy Express. DISTINCT is used if a duplicate OrderID is made by a human mistake or something.

What is the last name of the employee with the most orders?

Peacock with 40 orders.

SELECT e.LastName, COUNT(DISTINCT o.OrderID) as TotalOrders
FROM Orders as o
LEFT JOIN Employees as e
ON o.EmployeeID = e.EmployeeID
GROUP BY e.LastName
ORDER BY TotalOrders DESC
LIMIT 1

Thought Process: Table “Employees” contains LastName but no order information. Thus, we again LEFT JOIN two tables, “Orders” and “Employees”. Then we can use GROUP BY to get TotalOrders of every employee. We want the highest TotalOrders; we can add DESC to the ORDER BY and use LIMIT to limit the output to one row.

What product was ordered the most by customers in Germany?

Boston Crab Meat was most ordered product in Germany with Total Quantity 160.

SELECT p.ProductName, SUM(od.Quantity) as TotalQuan
FROM Orders as o
JOIN Customers as c
  ON c.CustomerID = o.CustomerID
JOIN OrderDetails as od
  ON o.OrderID = od.OrderID
JOIN Products as p
  ON od.ProductID = p.ProductID
WHERE c.Country = 'Germany'
GROUP BY ProductName
ORDER BY TotalQuan DESC
LIMIT 1;

Table “Customers” contains Country and CustomerID.

Table “OrderDetails” contains ProductID, Quantity, and OrderID.

Table “Products” contains ProductID and ProductName.

Table “Orders” contains CustomerID and OrderID.

Thought Process: We need to fetch data from 4 different tables to complete this query. We can repeat the JOIN clause and join four tables. The question specifies the country in Germany; we can use WHERE to filter our records. Finally, GROUP BY ProductName, in this case, we want the highest TotalQuan; we can add DESC to the ORDER BY and use LIMIT to limit the output to one row.