library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.5
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Given some sample data, write a program to answer the following: click here to access the required data set.
On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.
Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.
The problem with the AOV could be due to fraud or human error such as logging in the data itself wrong. First, we need to import the data and examine what each column represents.
dat <- read.csv("2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv")
head(dat)
## order_id shop_id user_id order_amount total_items payment_method
## 1 1 53 746 224 2 cash
## 2 2 92 925 90 1 cash
## 3 3 44 861 144 1 cash
## 4 4 18 935 156 1 credit_card
## 5 5 18 883 156 1 credit_card
## 6 6 58 882 138 1 credit_card
## created_at
## 1 2017-03-13 12:36:56
## 2 2017-03-03 17:38:52
## 3 2017-03-14 4:23:56
## 4 2017-03-26 12:43:37
## 5 2017-03-01 4:35:11
## 6 2017-03-14 15:25:01
There are 7 columns: order_id, shop_id, user_id, order_amount, total_items, payment_method, and created_at. Order_id is an integer data type to represent a unique sneaker order. Shop_id is another integer data type to represent a shop selliing sneakers. User_id is also an integer data type to represent the user that placed a sneaker order. Order_amount is an integer data type to represents the price, in dollars, of the order placed by the respective customer and paid to the respective shop. Total_items is an integer data type to represent how many sneakers were bought by the customer. Payment_method is a character data type to represent if the customer paid in either cash, debit, or credit card. Created_at is also a character data type to represent the date and military time that the order was placed.
Then, we want to check that there are no missing values within the data itself and check if the AOV is truly $3145.
# Data exploration
# Check for any missing values
sum(is.na(dat))
## [1] 0
# Observing the summary statistics of the data
summary(dat)
## order_id shop_id user_id order_amount
## Min. : 1 Min. : 1.00 Min. :607.0 Min. : 90
## 1st Qu.:1251 1st Qu.: 24.00 1st Qu.:775.0 1st Qu.: 163
## Median :2500 Median : 50.00 Median :849.0 Median : 284
## Mean :2500 Mean : 50.08 Mean :849.1 Mean : 3145
## 3rd Qu.:3750 3rd Qu.: 75.00 3rd Qu.:925.0 3rd Qu.: 390
## Max. :5000 Max. :100.00 Max. :999.0 Max. :704000
## total_items payment_method created_at
## Min. : 1.000 Length:5000 Length:5000
## 1st Qu.: 1.000 Class :character Class :character
## Median : 2.000 Mode :character Mode :character
## Mean : 8.787
## 3rd Qu.: 3.000
## Max. :2000.000
# Check AOV
mean(dat$order_amount)
## [1] 3145.128
print(paste0("Average Order Value = $", mean(dat$order_amount)))
## [1] "Average Order Value = $3145.128"
print(paste0("Median Order Value = $", median(dat$order_amount)))
## [1] "Median Order Value = $284"
There seems to be outliers from the data, so the median seems like a better estimate of the AOV because the outliers are skewing the data to the right. However, we can take a closer look at the data to evaluate this statistic through graphs.
# Creating a table of each unique count of total items
table(dat$total_items)
##
## 1 2 3 4 5 6 8 2000
## 1830 1832 941 293 77 9 1 17
# Creating a data frame of the table to visualize the data
dat_total_items <- as.data.frame(table(dat$total_items))
dat_total_items <- rename(dat_total_items,
number = Var1,
frequency = Freq)
ggplot(dat_total_items, aes(x = number, y = frequency)) +
geom_col(
color = c("skyblue", "royalblue", "blue", "navy", "skyblue", "royalblue", "blue", "navy"),
fill = c("skyblue", "royalblue", "blue", "navy", "skyblue", "royalblue", "blue", "navy")) +
labs(title = "Total Items Ordered Frequencies Across All Shops",
x = "Unique Number of Total Items Ordered",
y = "Frequency") +
theme_minimal()
From the table, we observe that 17/5000 orders had 2000 items within the same order which seems out of the ordinary and skews the AOV. So, let’s check what the AOV and median order value are when total orders less than 2000 and when total orders equal to 2000.
# Filtering the data on total items ordered less than 2000 and exactly 2000 to get mean and median values
small_orders <- filter(dat, total_items < 2000)
large_orders <- filter(dat, total_items == 2000)
paste0("Average Order Value = $", mean(small_orders$order_amount))
## [1] "Average Order Value = $754.091912502509"
paste0("Median Order Value = $", median(small_orders$order_amount))
## [1] "Median Order Value = $284"
paste0("Average Order Value = $", mean(large_orders$order_amount))
## [1] "Average Order Value = $704000"
paste0("Median Order Value = $", median(large_orders$order_amount))
## [1] "Median Order Value = $704000"
We still run into the same problem of a high AOV (754.09 USD) with the same median order value (254 USD) after filtering between small and large orders. There still seems to be the presence of fraud or human error with this high AOV, so we should determine what each shop is charging for their shoe.
# Calculating the average shoe price of each store
shop_aov <- dat %>%
group_by(shop_id) %>%
summarise(mean_price = sum(order_amount) / sum(total_items)) %>%
arrange(desc(mean_price))
head(shop_aov, 5)
## # A tibble: 5 × 2
## shop_id mean_price
## <int> <dbl>
## 1 78 25725
## 2 42 352
## 3 12 201
## 4 89 196
## 5 99 195
All shops have an average shoe price under 400 USD, but shop #78 has an average shoe price of $25725 which seems very unlikely from selling just sneakers. We should examine shop #78’s data to get an idea of why their price is so high.
# Filtering the data without shop #78
fraud_dat <- filter(dat, shop_id == 78)
fraud_dat
## order_id shop_id user_id order_amount total_items payment_method
## 1 161 78 990 25725 1 credit_card
## 2 491 78 936 51450 2 debit
## 3 494 78 983 51450 2 cash
## 4 512 78 967 51450 2 cash
## 5 618 78 760 51450 2 cash
## 6 692 78 878 154350 6 debit
## 7 1057 78 800 25725 1 debit
## 8 1194 78 944 25725 1 debit
## 9 1205 78 970 25725 1 credit_card
## 10 1260 78 775 77175 3 credit_card
## 11 1385 78 867 25725 1 cash
## 12 1420 78 912 25725 1 cash
## 13 1453 78 812 25725 1 credit_card
## 14 1530 78 810 51450 2 cash
## 15 2271 78 855 25725 1 credit_card
## 16 2453 78 709 51450 2 cash
## 17 2493 78 834 102900 4 debit
## 18 2496 78 707 51450 2 cash
## 19 2513 78 935 51450 2 debit
## 20 2549 78 861 25725 1 cash
## 21 2565 78 915 77175 3 debit
## 22 2691 78 962 77175 3 debit
## 23 2774 78 890 25725 1 cash
## 24 2819 78 869 51450 2 debit
## 25 2822 78 814 51450 2 cash
## 26 2907 78 817 77175 3 debit
## 27 2923 78 740 25725 1 debit
## 28 3086 78 910 25725 1 cash
## 29 3102 78 855 51450 2 credit_card
## 30 3152 78 745 25725 1 credit_card
## 31 3168 78 927 51450 2 cash
## 32 3404 78 928 77175 3 debit
## 33 3441 78 982 25725 1 debit
## 34 3706 78 828 51450 2 credit_card
## 35 3725 78 766 77175 3 credit_card
## 36 3781 78 889 25725 1 cash
## 37 4041 78 852 25725 1 cash
## 38 4080 78 946 51450 2 cash
## 39 4193 78 787 77175 3 credit_card
## 40 4312 78 960 51450 2 debit
## 41 4413 78 756 51450 2 debit
## 42 4421 78 969 77175 3 debit
## 43 4506 78 866 25725 1 debit
## 44 4585 78 997 25725 1 cash
## 45 4716 78 818 77175 3 debit
## 46 4919 78 823 25725 1 cash
## created_at
## 1 2017-03-12 5:56:57
## 2 2017-03-26 17:08:19
## 3 2017-03-16 21:39:35
## 4 2017-03-09 7:23:14
## 5 2017-03-18 11:18:42
## 6 2017-03-27 22:51:43
## 7 2017-03-15 10:16:45
## 8 2017-03-16 16:38:26
## 9 2017-03-17 22:32:21
## 10 2017-03-27 9:27:20
## 11 2017-03-17 16:38:06
## 12 2017-03-30 12:23:43
## 13 2017-03-17 18:09:54
## 14 2017-03-29 7:12:01
## 15 2017-03-14 23:58:22
## 16 2017-03-27 11:04:04
## 17 2017-03-04 4:37:34
## 18 2017-03-26 4:38:52
## 19 2017-03-18 18:57:13
## 20 2017-03-17 19:36:00
## 21 2017-03-25 1:19:35
## 22 2017-03-22 7:33:25
## 23 2017-03-26 10:36:43
## 24 2017-03-17 6:25:51
## 25 2017-03-02 17:13:25
## 26 2017-03-16 3:45:46
## 27 2017-03-12 20:10:58
## 28 2017-03-26 1:59:27
## 29 2017-03-21 5:10:34
## 30 2017-03-18 13:13:07
## 31 2017-03-12 12:23:08
## 32 2017-03-16 9:45:05
## 33 2017-03-19 19:02:54
## 34 2017-03-14 20:43:15
## 35 2017-03-16 14:13:26
## 36 2017-03-11 21:14:50
## 37 2017-03-02 14:31:12
## 38 2017-03-20 21:14:00
## 39 2017-03-18 9:25:32
## 40 2017-03-01 3:02:10
## 41 2017-03-02 4:13:39
## 42 2017-03-09 15:21:35
## 43 2017-03-22 22:06:01
## 44 2017-03-25 21:48:44
## 45 2017-03-05 5:10:44
## 46 2017-03-15 13:26:46
The above tibble confirms our suspicions that shop #78 is possibly either committing fraud or inputting the order amounts. The human error could be from the first row’s order amount (25725) is actually 257.25 USD and that every order amount is inputted without the decimal. From here, we should calculate the overall AOV without both shop #78 data and total items ordered at 2000 (outliers).
# Cleaning the data without shop 78 and total items ordered less than 2000
clean_dat <- filter(dat, shop_id != 78)
small_clean_dat <- filter(clean_dat, total_items < 2000)
mean(small_clean_dat$order_amount)
## [1] 302.5805
median(small_clean_dat$order_amount)
## [1] 284
What metric would you report for this dataset?
I would use the average small order value as the metric which does not include shop #78 and total items ordered equal to 2000 as the metric to report for this dataset. If we are using all the data, I would use the median order value.
What is its value?
I would report the true AOV is 302.58 USD without shop #78 and total items ordered less than 2000. If we were to include all the data, I would report the AOV to be the median which is 284 USD.
For this question you’ll need to use SQL. Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.
How many orders were shipped by Speedy Express in total?
SELECT COUNT(OrderID) FROM Orders
WHERE ShipperID = 1
This returns 54.
What is the last name of the employee with the most orders?
SELECT LastName FROM Employees
WHERE EmployeeID = (SELECT TOP 1 EmployeeID FROM
(SELECT EmployeeID, COUNT(OrderID) FROM Orders GROUP BY EmployeeID ORDER BY COUNT(OrderID) DESC));
This returns Peacock.
What product was ordered the most by customers in Germany?
SELECT ProductName FROM Products
WHERE ProductID IN (SELECT TOP 1 ProductID FROM OrderDetails
WHERE OrderID IN (
SELECT OrderID FROM Orders
WHERE CustomerID IN (SELECT CustomerID FROM Customers
WHERE Country = ‘Germany’))
GROUP BY ProductID
ORDER BY COUNT(OrderDetailID) DESC);
This returns Gorgonzola Telino.