Question 1:

Given some sample data, write a program to answer the following: click here to access the required data set

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Let’s load the data, check for any NA values and do a summary for the dataset

data <- read.csv("2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv")
anyNA(data)
## [1] FALSE
summary(data)
##     order_id       shop_id          user_id       order_amount   
##  Min.   :   1   Min.   :  1.00   Min.   :607.0   Min.   :    90  
##  1st Qu.:1251   1st Qu.: 24.00   1st Qu.:775.0   1st Qu.:   163  
##  Median :2500   Median : 50.00   Median :849.0   Median :   284  
##  Mean   :2500   Mean   : 50.08   Mean   :849.1   Mean   :  3145  
##  3rd Qu.:3750   3rd Qu.: 75.00   3rd Qu.:925.0   3rd Qu.:   390  
##  Max.   :5000   Max.   :100.00   Max.   :999.0   Max.   :704000  
##   total_items       payment_method      created_at       
##  Min.   :   1.000   Length:5000        Length:5000       
##  1st Qu.:   1.000   Class :character   Class :character  
##  Median :   2.000   Mode  :character   Mode  :character  
##  Mean   :   8.787                                        
##  3rd Qu.:   3.000                                        
##  Max.   :2000.000

At first glance, the mean and max values of order_amount column are extremely high in comparison to other information provided in summary.

What could be wrong?

The mean of the total order amount is $3145 which is the same as stated in the question. These results could be misleading which is seen by the mean order amount being much larger than the media order amount. This indicates that the data may be inaccurate for some stores or the data could be skewed by extreme values.

Let us break down the AOV per store to identify if there are specific stores causing these results.

shop_aov <- data  %>% group_by(shop_id) %>%
  summarize(aov_per_shop = mean(order_amount)) %>%
  arrange(desc(aov_per_shop))
head(shop_aov, 10)
## # A tibble: 10 × 2
##    shop_id aov_per_shop
##      <int>        <dbl>
##  1      42      235101.
##  2      78       49213.
##  3      50         404.
##  4      90         403.
##  5      38         391.
##  6      81         384 
##  7       6         384.
##  8      89         379.
##  9      33         376.
## 10      51         362.

After running the calculations for AOV for each individual store, we can see that store 42 and store 78 have some extreme AOV values. We can visualize this to consider stores 42 and 78 as outlier.

ggplot(shop_aov) +
  aes(x = "", y = aov_per_shop) +
  geom_boxplot() +
  coord_trans(y = "log10") +
  scale_y_continuous(breaks=c(50, 100, 150, 200, 350, 500, 20000)) +
  ylab("Average Order Value by shop") +
  xlab("Figure 1. Box Plot of AOV per Shop")+
  theme_bw()

As seen in Figure 1. We can clearly see the gap in AOV of stores 42 and 78 in comparison to the rest of the stores.

What metric would you report for this dataset?

Based on the data being provided, unless we can confirm that stores 42 and 78 are errors or invalid, we cannot exclude them from the data. As a result, I would recommend using Median to calculate the AOV of stores as it will not be skewed by the extreme values that are presented in the data. Alternatively, we could remove the extremes and calculate the Mean values or we could calculate the average price per item in the order as we do see high item numbers per order in stores 42 and 78.

What is its value?

median(data$order_amount)
## [1] 284

Using the median of the original data without removing the extremes, the AOV is $284.

Question 2

For this question you’ll need to use SQL.Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below

A. How many orders were shipped by Speedy Express in total?

Select Count(OrderID) AS Count_Orders, Orders.ShipperID, Max(ShipperName) AS Shipper_Name
From Orders
RIGHT JOIN Shippers ON Orders.ShipperID = Shippers.ShipperID
Group By Orders.ShipperID;

Figure 2. Order count per Shipper

As seen in Figure 2, Speedy Express shipped a total of 54 packages.

B.What is the last name of the employee with the most orders?

Select Count(OrderID) AS Count_Orders, Orders.EmployeeID, Max(Employees.LastName) AS Emp_LastName
From Orders
RIGHT JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
Group By Orders.EmployeeID
Order By Count(OrderID) DESC;

Figure 3. Order Count per Employee

As seen in Figure 3, the last name of the employee with the most orders is Peacock.

C. What product was ordered the most by customers in Germany?

SELECT OrderDetails.ProductID, Sum(Quantity) AS Sum_Quantity, Max(Products.ProductName) AS ProductName
FROM OrderDetails
RIGHT JOIN Products ON OrderDetails.ProductID = Products.ProductID
WHERE OrderID IN (
    SELECT Orders.OrderID
    FROM Customers
    RIGHT JOIN Orders ON Customers.CustomerID = Orders.CustomerID
    WHERE Country = 'Germany'
    Order by OrderID) 
GROUP BY OrderDetails.ProductID
Order BY SUM(Quantity) DESC;

Figure 4. Most Ordered Products From Customers in Germany

As Seen in Figure 4, the most ordered product from customers in Germany is Boston Crab Meat.