library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.5
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Question 1A:

Given some sample data, write a program to answer the following: click here to access the required data set.

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

The problem with the AOV could be due to fraud or human error such as logging in the data itself wrong. First, we need to import the data and examine what each column represents.

dat <- read.csv("2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv")
head(dat)
##   order_id shop_id user_id order_amount total_items payment_method
## 1        1      53     746          224           2           cash
## 2        2      92     925           90           1           cash
## 3        3      44     861          144           1           cash
## 4        4      18     935          156           1    credit_card
## 5        5      18     883          156           1    credit_card
## 6        6      58     882          138           1    credit_card
##            created_at
## 1 2017-03-13 12:36:56
## 2 2017-03-03 17:38:52
## 3  2017-03-14 4:23:56
## 4 2017-03-26 12:43:37
## 5  2017-03-01 4:35:11
## 6 2017-03-14 15:25:01

There are 7 columns: order_id, shop_id, user_id, order_amount, total_items, payment_method, and created_at. Order_id is an integer data type to represent a unique sneaker order. Shop_id is another integer data type to represent a shop selliing sneakers. User_id is also an integer data type to represent the user that placed a sneaker order. Order_amount is an integer data type to represents the price, in dollars, of the order placed by the respective customer and paid to the respective shop. Total_items is an integer data type to represent how many sneakers were bought by the customer. Payment_method is a character data type to represent if the customer paid in either cash, debit, or credit card. Created_at is also a character data type to represent the date and military time that the order was placed.

Then, we want to check that there are no missing values within the data itself and check if the AOV is truly $3145.

# Data exploration
# Check for any missing values
sum(is.na(dat))
## [1] 0
# Observing the summary statistics of the data
summary(dat)
##     order_id       shop_id          user_id       order_amount   
##  Min.   :   1   Min.   :  1.00   Min.   :607.0   Min.   :    90  
##  1st Qu.:1251   1st Qu.: 24.00   1st Qu.:775.0   1st Qu.:   163  
##  Median :2500   Median : 50.00   Median :849.0   Median :   284  
##  Mean   :2500   Mean   : 50.08   Mean   :849.1   Mean   :  3145  
##  3rd Qu.:3750   3rd Qu.: 75.00   3rd Qu.:925.0   3rd Qu.:   390  
##  Max.   :5000   Max.   :100.00   Max.   :999.0   Max.   :704000  
##   total_items       payment_method      created_at       
##  Min.   :   1.000   Length:5000        Length:5000       
##  1st Qu.:   1.000   Class :character   Class :character  
##  Median :   2.000   Mode  :character   Mode  :character  
##  Mean   :   8.787                                        
##  3rd Qu.:   3.000                                        
##  Max.   :2000.000
# Check AOV 
mean(dat$order_amount)
## [1] 3145.128
print(paste0("Average Order Value = $", mean(dat$order_amount)))
## [1] "Average Order Value = $3145.128"
print(paste0("Median Order Value = $", median(dat$order_amount)))
## [1] "Median Order Value = $284"

There seems to be outliers from the data, so the median seems like a better estimate of the AOV because the outliers are skewing the data to the right. However, we can take a closer look at the data to evaluate this statistic through graphs.

# Creating a table of each unique count of total items
table(dat$total_items)
## 
##    1    2    3    4    5    6    8 2000 
## 1830 1832  941  293   77    9    1   17
# Creating a data frame of the table to visualize the data
dat_total_items <- as.data.frame(table(dat$total_items))
dat_total_items <- rename(dat_total_items,
                          number = Var1,
                          frequency = Freq)

ggplot(dat_total_items, aes(x = number, y = frequency)) +
  geom_col(
    color = c("skyblue", "royalblue", "blue", "navy", "skyblue", "royalblue", "blue", "navy"),
    fill = c("skyblue", "royalblue", "blue", "navy", "skyblue", "royalblue", "blue", "navy")) + 
  labs(title = "Total Items Ordered Frequencies Across All Shops",
       x = "Unique Number of Total Items Ordered",
       y = "Frequency") + 
  theme_minimal()

From the table, we observe that 17/5000 orders had 2000 items within the same order which seems out of the ordinary and skews the AOV. So, let’s check what the AOV and median order value are when total orders less than 2000 and when total orders equal to 2000.

# Filtering the data on total items ordered less than 2000 and exactly 2000 to get mean and median values
small_orders <- filter(dat, total_items < 2000)
large_orders <- filter(dat, total_items == 2000)
paste0("Average Order Value = $", mean(small_orders$order_amount))
## [1] "Average Order Value = $754.091912502509"
paste0("Median Order Value = $", median(small_orders$order_amount))
## [1] "Median Order Value = $284"
paste0("Average Order Value = $", mean(large_orders$order_amount))
## [1] "Average Order Value = $704000"
paste0("Median Order Value = $", median(large_orders$order_amount))
## [1] "Median Order Value = $704000"

We still run into the same problem of a high AOV (754.09 USD) with the same median order value (254 USD) after filtering between small and large orders. There still seems to be the presence of fraud or human error with this high AOV, so we should determine what each shop is charging for their shoe.

# Calculating the average shoe price of each store
shop_aov <- dat %>%
  group_by(shop_id) %>%
  summarise(mean_price = sum(order_amount) / sum(total_items)) %>%
  arrange(desc(mean_price)) 

head(shop_aov, 5)
## # A tibble: 5 × 2
##   shop_id mean_price
##     <int>      <dbl>
## 1      78      25725
## 2      42        352
## 3      12        201
## 4      89        196
## 5      99        195

All shops have an average shoe price under 400 USD, but shop #78 has an average shoe price of $25725 which seems very unlikely from selling just sneakers. We should examine shop #78’s data to get an idea of why their price is so high.

# Filtering the data without shop #78
fraud_dat <- filter(dat, shop_id == 78)
fraud_dat
##    order_id shop_id user_id order_amount total_items payment_method
## 1       161      78     990        25725           1    credit_card
## 2       491      78     936        51450           2          debit
## 3       494      78     983        51450           2           cash
## 4       512      78     967        51450           2           cash
## 5       618      78     760        51450           2           cash
## 6       692      78     878       154350           6          debit
## 7      1057      78     800        25725           1          debit
## 8      1194      78     944        25725           1          debit
## 9      1205      78     970        25725           1    credit_card
## 10     1260      78     775        77175           3    credit_card
## 11     1385      78     867        25725           1           cash
## 12     1420      78     912        25725           1           cash
## 13     1453      78     812        25725           1    credit_card
## 14     1530      78     810        51450           2           cash
## 15     2271      78     855        25725           1    credit_card
## 16     2453      78     709        51450           2           cash
## 17     2493      78     834       102900           4          debit
## 18     2496      78     707        51450           2           cash
## 19     2513      78     935        51450           2          debit
## 20     2549      78     861        25725           1           cash
## 21     2565      78     915        77175           3          debit
## 22     2691      78     962        77175           3          debit
## 23     2774      78     890        25725           1           cash
## 24     2819      78     869        51450           2          debit
## 25     2822      78     814        51450           2           cash
## 26     2907      78     817        77175           3          debit
## 27     2923      78     740        25725           1          debit
## 28     3086      78     910        25725           1           cash
## 29     3102      78     855        51450           2    credit_card
## 30     3152      78     745        25725           1    credit_card
## 31     3168      78     927        51450           2           cash
## 32     3404      78     928        77175           3          debit
## 33     3441      78     982        25725           1          debit
## 34     3706      78     828        51450           2    credit_card
## 35     3725      78     766        77175           3    credit_card
## 36     3781      78     889        25725           1           cash
## 37     4041      78     852        25725           1           cash
## 38     4080      78     946        51450           2           cash
## 39     4193      78     787        77175           3    credit_card
## 40     4312      78     960        51450           2          debit
## 41     4413      78     756        51450           2          debit
## 42     4421      78     969        77175           3          debit
## 43     4506      78     866        25725           1          debit
## 44     4585      78     997        25725           1           cash
## 45     4716      78     818        77175           3          debit
## 46     4919      78     823        25725           1           cash
##             created_at
## 1   2017-03-12 5:56:57
## 2  2017-03-26 17:08:19
## 3  2017-03-16 21:39:35
## 4   2017-03-09 7:23:14
## 5  2017-03-18 11:18:42
## 6  2017-03-27 22:51:43
## 7  2017-03-15 10:16:45
## 8  2017-03-16 16:38:26
## 9  2017-03-17 22:32:21
## 10  2017-03-27 9:27:20
## 11 2017-03-17 16:38:06
## 12 2017-03-30 12:23:43
## 13 2017-03-17 18:09:54
## 14  2017-03-29 7:12:01
## 15 2017-03-14 23:58:22
## 16 2017-03-27 11:04:04
## 17  2017-03-04 4:37:34
## 18  2017-03-26 4:38:52
## 19 2017-03-18 18:57:13
## 20 2017-03-17 19:36:00
## 21  2017-03-25 1:19:35
## 22  2017-03-22 7:33:25
## 23 2017-03-26 10:36:43
## 24  2017-03-17 6:25:51
## 25 2017-03-02 17:13:25
## 26  2017-03-16 3:45:46
## 27 2017-03-12 20:10:58
## 28  2017-03-26 1:59:27
## 29  2017-03-21 5:10:34
## 30 2017-03-18 13:13:07
## 31 2017-03-12 12:23:08
## 32  2017-03-16 9:45:05
## 33 2017-03-19 19:02:54
## 34 2017-03-14 20:43:15
## 35 2017-03-16 14:13:26
## 36 2017-03-11 21:14:50
## 37 2017-03-02 14:31:12
## 38 2017-03-20 21:14:00
## 39  2017-03-18 9:25:32
## 40  2017-03-01 3:02:10
## 41  2017-03-02 4:13:39
## 42 2017-03-09 15:21:35
## 43 2017-03-22 22:06:01
## 44 2017-03-25 21:48:44
## 45  2017-03-05 5:10:44
## 46 2017-03-15 13:26:46

The above tibble confirms our suspicions that shop #78 is possibly either committing fraud or inputting the order amounts. The human error could be from the first row’s order amount (25725) is actually 257.25 USD and that every order amount is inputted without the decimal. From here, we should calculate the overall AOV without both shop #78 data and total items ordered at 2000 (outliers).

# Cleaning the data without shop 78 and total items ordered less than 2000
clean_dat <- filter(dat, shop_id != 78)
small_clean_dat <- filter(clean_dat, total_items < 2000)
mean(small_clean_dat$order_amount)
## [1] 302.5805
median(small_clean_dat$order_amount)
## [1] 284

Question 1B:

What metric would you report for this dataset?

I would use the average small order value as the metric which does not include shop #78 and total items ordered equal to 2000 as the metric to report for this dataset. If we are using all the data, I would use the median order value.

Question 1C:

What is its value?

I would report the true AOV is 302.58 USD without shop #78 and total items ordered less than 2000. If we were to include all the data, I would report the AOV to be the median which is 284 USD.

Question 2:

For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

How many orders were shipped by Speedy Express in total?

SELECT COUNT(OrderID) FROM Orders

WHERE ShipperID = 1

This returns 54.

What is the last name of the employee with the most orders?

SELECT LastName FROM Employees

WHERE EmployeeID = (SELECT TOP 1 EmployeeID FROM

(SELECT EmployeeID, COUNT(OrderID) FROM Orders GROUP BY EmployeeID ORDER BY COUNT(OrderID) DESC));

This returns Peacock.

What product was ordered the most by customers in Germany?

SELECT ProductName FROM Products

WHERE ProductID IN (SELECT TOP 1 ProductID FROM OrderDetails

WHERE OrderID IN (

SELECT OrderID FROM Orders

WHERE CustomerID IN (SELECT CustomerID FROM Customers

WHERE Country = ‘Germany’))

GROUP BY ProductID

ORDER BY COUNT(OrderDetailID) DESC);

This returns Gorgonzola Telino.