Given some sample data, write a program to answer the following: click here to access the required data set.
On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.
Let us load the data first and do summary statistics for every variable.
data <- read.csv('~/Downloads/2019 Winter Data Science Intern Challenge Data Set.csv', header=TRUE, stringsAsFactors = FALSE, na.strings=c("","NA"))
anyNA(data) #check whether there is NA in the data set
## [1] FALSE
summary(data)
## order_id shop_id user_id order_amount
## Min. : 1 Min. : 1.00 Min. :607.0 Min. : 90
## 1st Qu.:1251 1st Qu.: 24.00 1st Qu.:775.0 1st Qu.: 163
## Median :2500 Median : 50.00 Median :849.0 Median : 284
## Mean :2500 Mean : 50.08 Mean :849.1 Mean : 3145
## 3rd Qu.:3750 3rd Qu.: 75.00 3rd Qu.:925.0 3rd Qu.: 390
## Max. :5000 Max. :100.00 Max. :999.0 Max. :704000
## total_items payment_method created_at
## Min. : 1.000 Length:5000 Length:5000
## 1st Qu.: 1.000 Class :character Class :character
## Median : 2.000 Mode :character Mode :character
## Mean : 8.787
## 3rd Qu.: 3.000
## Max. :2000.000
Take a closer look at order_amount
summary(data$order_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90 163 284 3145 390 704000
From the above summary output, naively, we can see the mean of the total order amount is 3145, which could be where the original AOV stated in the question comes from. Based on the definition of AOV defined on the Shopify website, the formula is \({TotalOrderAmount}/TotalNumberofOrders\), and we can check below:
TotalOrderAmount = sum(data$order_amount)
TotalNumberofOrders = nrow(data)
(OriginalAOV = TotalOrderAmount/TotalNumberofOrders)
## [1] 3145.128
However, if we were to present this result to our clients, it could be misleading (as stated in the question) because we can see that the mean has a considerable distance from the Median Order Amount, which is 284. It indicates that the data of order amount is highly skewed due to extreme values. There could be a few high-dollar orders that skew up the Average. Let us find out what happened in our data.
Take a look at AOV for each of these 100 sneaker stores:
AOV_pershop <- data %>% group_by(shop_id) %>%
summarise(Sum_per_shop = sum(order_amount),num_order = n(),
AOV_per_shop = Sum_per_shop/num_order) %>%
arrange(desc(AOV_per_shop))
head(AOV_pershop, 10)
## # A tibble: 10 × 4
## shop_id Sum_per_shop num_order AOV_per_shop
## <int> <int> <int> <dbl>
## 1 42 11990176 51 235101.
## 2 78 2263800 46 49213.
## 3 50 17756 44 404.
## 4 90 19758 49 403.
## 5 38 13680 35 391.
## 6 81 22656 59 384
## 7 6 22627 59 384.
## 8 89 23128 61 379.
## 9 33 15051 40 376.
## 10 51 16643 46 362.
Let’s visualize extreme AOVs by using a boxplot:
#Function for extreme values
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
#Boxplot to show the distribution
AOV_pershop %>%
mutate(outlier = ifelse(is_outlier(AOV_per_shop), shop_id, as.numeric(NA))) %>%
ggplot(aes(x = "",y = AOV_per_shop)) +
geom_boxplot(width = 0.6) +
coord_trans(y="log10") +
scale_y_continuous(breaks=c(150, 200, 300, 500, 50000)) +
ylab("Average Order Value per Shop") + xlab("Figure 1. Box Plot of AOV per Shop") +
geom_text(aes(label = ifelse(is.na(outlier), "", paste("shop id :",outlier)), col = "red"), hjust = -0.3)+
theme_bw() + theme(legend.position="none")
As shown in Figure 1, shop id 42 and 78 have insanely high AOV. Those values could explain a massive gap between Average Order Amount and Median Order Amount.
Let us look at the data for shop id 42 and 78 since they both have extreme AOV.
Shop 78:
data %>% filter(shop_id == 78) %>%
group_by(user_id) %>%
summarise(order_amount = sum(order_amount), num_items = sum(total_items)) %>%
arrange(desc(order_amount))
## # A tibble: 45 × 3
## user_id order_amount num_items
## <int> <int> <int>
## 1 878 154350 6
## 2 834 102900 4
## 3 766 77175 3
## 4 775 77175 3
## 5 787 77175 3
## 6 817 77175 3
## 7 818 77175 3
## 8 855 77175 3
## 9 915 77175 3
## 10 928 77175 3
## # … with 35 more rows
From the above table, shop 78 seems to have a significantly high unit price, given that a customer with user_id 878 bought six items for 154,350 dollars ($25,725/sneaker). We need to confirm with shop 78 to see if the unit price was correct.
Shop 42:
data %>% filter(shop_id == 42) %>%
group_by(user_id) %>%
summarise(order_amount = sum(order_amount), num_items = sum(total_items)) %>%
arrange(desc(order_amount))
## # A tibble: 31 × 3
## user_id order_amount num_items
## <int> <int> <int>
## 1 607 11968000 34000
## 2 797 1760 5
## 3 819 1760 5
## 4 868 1408 4
## 5 907 1408 4
## 6 926 1408 4
## 7 726 1056 3
## 8 739 1056 3
## 9 808 1056 3
## 10 720 704 2
## # … with 21 more rows
From the above table, we can see that the customer with user_id 607 has purchased a large quantity in shop 42; this could result in extreme AOV. By further investigation, we tried to remove data of user_id 607 and plotted a boxplot for AOV per shop. The results show that shop 42 without the data of customer 607 still has extreme AOV, indicating user_id 607 is not the only customer causing this extreme AOV.
data_extremes_rm <- data %>% filter(!((shop_id == 78)| (shop_id ==42)))
summary(data_extremes_rm$order_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90.0 163.0 284.0 300.2 386.5 1086.0
The AOV is 300.2 after removing shops 42 and 78, while Median Order Value remains the same.
Let’s visualize the data by a boxplot:
AOV_extremes_rm <- data_extremes_rm %>%
group_by(shop_id) %>%
summarise(Sum_per_shop = sum(order_amount),num_order = n(),
AOV_per_shop = Sum_per_shop/num_order) %>%
arrange(desc(AOV_per_shop))
AOV_extremes_rm %>%
mutate(outlier = ifelse(is_outlier(AOV_per_shop), shop_id, as.numeric(NA))) %>%
ggplot(aes(x = "",y = AOV_per_shop)) +
geom_boxplot(width = 0.6) +
coord_trans(y="log10") +
scale_y_continuous(breaks=c(150, 200, 300, 500, 50000)) +
ylab("Average Order Value per Shop") + xlab("Figure 1. Box Plot of AOV per Shop") +
geom_text(aes(label = ifelse(is.na(outlier), "", paste("shop id :",outlier)), col = "red"), hjust = -0.3)+
theme_bw() + theme(legend.position="none")
Average Order Value after removing extreme values
Median Order Value after removing extreme values
Original Median Order Value
For highly skewed distributions, the mean is not an appropriate representation of the center of the distribution. Even though there is only a tiny percent of the distribution in the tail of a highly skewed distribution, these few substantial values (relative to the rest of the distribution) cause the mean of the distribution to be massive. The median does not have this flaw.
mean(data_extremes_rm$order_amount)
## [1] 300.1558
median(data_extremes_rm$order_amount)
## [1] 284
median(data$order_amount)
## [1] 284
The AOV seems to be more reasonable now.
For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.
There were 54 orders shipped by Speedy Express.
SELECT s.ShipperName, COUNT(DISTINCT o.OrderID) as TotalOrders
FROM Orders as o
LEFT JOIN Shippers as s
ON o.ShipperID = s.ShipperID
WHERE ShipperName = "Speedy Express";
Thought Process: Table “Orders” contains information on OrderID and ShipperID, but we need ShipperName to answer the question. We can use LEFT JOIN to combine table “Orders” and table “Shippers” based on ShipperID, and filter the record by using WHERE to return the number of total orders shipped by Speedy Express. DISTINCT is used if a duplicate OrderID is made by a human mistake or something.
Peacock with 40 orders.
SELECT e.LastName, COUNT(DISTINCT o.OrderID) as TotalOrders
FROM Orders as o
LEFT JOIN Employees as e
ON o.EmployeeID = e.EmployeeID
GROUP BY e.LastName
ORDER BY TotalOrders DESC
LIMIT 1
Thought Process: Table “Employees” contains LastName but no order information. Thus, we again LEFT JOIN two tables, “Orders” and “Employees”. Then we can use GROUP BY to get TotalOrders of every employee. We want the highest TotalOrders; we can add DESC to the ORDER BY and use LIMIT to limit the output to one row.
Boston Crab Meat was most ordered product in Germany with Total Quantity 160.
SELECT p.ProductName, SUM(od.Quantity) as TotalQuan
FROM Orders as o
JOIN Customers as c
ON c.CustomerID = o.CustomerID
JOIN OrderDetails as od
ON o.OrderID = od.OrderID
JOIN Products as p
ON od.ProductID = p.ProductID
WHERE c.Country = 'Germany'
GROUP BY ProductName
ORDER BY TotalQuan DESC
LIMIT 1;
Table “Customers” contains Country and CustomerID.
Table “OrderDetails” contains ProductID, Quantity, and OrderID.
Table “Products” contains ProductID and ProductName.
Table “Orders” contains CustomerID and OrderID.
Thought Process: We need to fetch data from 4 different tables to complete this query. We can repeat the JOIN clause and join four tables. The question specifies the country in Germany; we can use WHERE to filter our records. Finally, GROUP BY ProductName, in this case, we want the highest TotalQuan; we can add DESC to the ORDER BY and use LIMIT to limit the output to one row.