Question 1

Ask❓

There are 100 sneaker shops, each shop sells only one model of shoe. When calculated the average order value (AOV) over 30 day period, the value is $3145.13.

Business tasks

Discover what could be wrong with the calculation and find a more accurate way to evaluate the data.
Generate a metric to report this dataset
Calculate the value

Prepare📃

Loading packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.2

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'readr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readr)
library(naniar)

## Warning: package 'naniar' was built under R version 4.1.2

library(hrbrthemes)

## Warning: package 'hrbrthemes' was built under R version 4.1.2

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(scales)

## Warning: package 'scales' was built under R version 4.1.2

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.1.2

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(dplyr)
library(waffle)

## Warning: package 'waffle' was built under R version 4.1.2

Loading dataset

sneaker <- read.csv("sneaker.csv")
glimpse(sneaker)

## Rows: 5,000
## Columns: 7
## $ order_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
## $ shop_id        <int> 53, 92, 44, 18, 18, 58, 87, 22, 64, 52, 66, 40, 54, 100~
## $ user_id        <int> 746, 925, 861, 935, 883, 882, 915, 761, 914, 788, 848, ~
## $ order_amount   <int> 224, 90, 144, 156, 156, 138, 149, 292, 266, 146, 322, 3~
## $ total_items    <int> 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 3, 2000, 1, 1~
## $ payment_method <chr> "cash", "cash", "cash", "credit_card", "credit_card", "~
## $ created_at     <chr> "2017-03-13 12:36:56", "2017-03-03 17:38:52", "2017-03-~

Statistics summary

summary(sneaker$order_amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      90     163     284    3145     390  704000

vis_miss(sneaker) #check missing value

No missing data detected. If any, there will showing in black the location of missing values, and provide the percentage of missing values overall (in the legend), and in each variable.

Process ⌨️

Visualize the order_amount to overview how it’s distribution.

#format date
sneaker$date <- as.Date(sneaker$created_at)

p <- ggplot(sneaker, 
            aes(x=date, 
                y=order_amount, 
                color=shop_id)) +
  geom_point(size=4, 
             alpha=1/5) +
    xlab("Time") +
    ylab("Order Amount(Dollar)") +
    scale_y_continuous(labels=comma) +
    theme(axis.text.x = element_text(angle = 45, hjust=1),
          axis.ticks.x = element_blank(),
          axis.line = element_line())
p

Clearly, there is a significant distance between two order_amount groups: one group below $100,000 which is the majority amount about hundreds dollar and another group distributed about $700,000.

Let’s check how mean and median distributed on the histogram graph.

mean.sneaker <- mean(sneaker$order_amount, trim = 0) #NA data checked
median.sneaker <- median(sneaker$order_amount, trim = 0)

sneaker %>%
    filter( order_amount<5000) %>% #order value limit to 5000
    ggplot( aes(x=order_amount)) +
      geom_histogram(binwidth = 15, aes(y=..density..)) +
      geom_vline(data = sneaker, aes(xintercept=mean.sneaker),   #add mean line
             linetype="dashed") +
      geom_vline(data = sneaker, aes(xintercept=median.sneaker), #add median line
             linetype="solid")

sd(sneaker$order_amount)

## [1] 41282.54

Now we can verify that there seems non-logical understand with all the number: the mean=3185 with the standard deviation=41282.54 and the median=284. In fact, we see that the maximum order_amount is 704000 while 50% order_amount lower than 284 dollar.

Analyze💡

###Finding insights by filter order_amount values.

ggplot(
  sneaker, 
  aes(x=date, 
  y=order_amount,
  color=shop_id)) +
    geom_point(
      size=4, 
      alpha=1/5) +
    geom_label(
      data=sneaker %>% filter(order_amount>10000),
      aes(label=shop_id)) +
        scale_y_continuous(labels=comma) +
        theme(axis.text.x = element_text(angle = 45, hjust=1),
        axis.line = element_line())

It is clearly to recognized that there are shop_id 42 and shop_id 78 contributed a huge value order_amount to total value of the data. With a further investigation, the reason shop_id 42 has that high value order_amount due to the massive number of 2000 items per order from shop_id 42. On the other hand, we have shop_id 78 also contributed a large order_amount value upper than $10,000 due to the item value which is $25,000/ item price.

sneaker_sum <- sneaker %>%
  group_by(total_items) %>%
  summarise(total_order_amount = sum(order_amount)) %>%
  mutate(
    percentage = paste0(round(total_order_amount/sum(total_order_amount)*100,2),"%"))
sneaker_sum

## # A tibble: 8 x 3
##   total_items total_order_amount percentage
##         <int>              <int> <chr>     
## 1           1             763777 4.86%     
## 2           2            1374394 8.74%     
## 3           3            1120803 7.13%     
## 4           4             277672 1.77%     
## 5           5              58470 0.37%     
## 6           6             161460 1.03%     
## 7           8               1064 0.01%     
## 8        2000           11968000 76.11%

sum(sneaker_sum$total_order_amount)

## [1] 15725640

Now, it is more clearly to recognize that the total_items 2000 account for 76.11% of the total order_amount $15,6725,640. This total_items 2000 per order from shop_id 42 is the reason of unbalanced average order value, or massive skewed to the right that lead to drag the mean so far away from the median.

New metrics report for the dataset

It is hard to set one or more than one metrics do evaluate all the aspects of an variable while these variables are probably associated to others.

In the scenario, I would recommend categorize the total_items into groups and then calculate its the AOV. On the other hand, there should be categorize groups between product values like shop_id 78 who sell a pair of sneaker shoe for $25,000 distinct from a group whose sell a pair of sneaker shoe for an average of hundred dollars.

Calculate suggestion new metrics approach

sneaker_aov <- sneaker %>%
  group_by(total_items) %>%
  summarise(AOV = mean(order_amount),
            SD = sd(order_amount),
            median_order_value = median(order_amount)) 
sneaker_aov

## # A tibble: 8 x 4
##   total_items     AOV     SD median_order_value
##         <int>   <dbl>  <dbl>              <dbl>
## 1           1    417.  2593.                153
## 2           2    750.  4761.                306
## 3           3   1191.  7471.                459
## 4           4    948.  5978.                592
## 5           5    759.   161.                765
## 6           6  17940  51154.                948
## 7           8   1064     NA                1064
## 8        2000 704000      0              704000

The total_items 8 has a NA standard deviation because of there is only one transaction.
The total_items 2000 has a median account for $704,000 because of the transaction value per order is 704,000 dollar.

Conclusion

Even there are some high Standard Deviation relative to each total_items value such as total_items 1 has an AOV 417,000 while the SD is 2,593. However, we could basically easy to figure out it’s 50% above the value and 50% below when looking at median value for each observation. Furthermore, we can also understanding the shape distribution when compare AVO with median_order_value by it’s gap. Ultimately, Shopify team or Merchant would make more accurate decision for business with this developed report.

Shopify Technical Challenge Submission

mynguyen811.us@gmail.com

1/12/2022