Details

author: Sujit Tilakraj Thakur

School email :

Personal email :

contact : +1-(806)0-401-2452

Lets read the data first and convert to type in which we can wrangle the data easily

data <- read_excel("C:\\Users\\sjtha\\OneDrive\\Desktop\\shopify\\Data.xlsx")
str(data)
## tibble [5,000 x 7] (S3: tbl_df/tbl/data.frame)
##  $ order_id      : num [1:5000] 16 61 521 1105 1363 ...
##  $ shop_id       : num [1:5000] 42 42 42 42 42 42 42 42 42 42 ...
##  $ user_id       : num [1:5000] 607 607 607 607 607 607 607 607 607 607 ...
##  $ order_amount  : num [1:5000] 704000 704000 704000 704000 704000 704000 704000 704000 704000 704000 ...
##  $ total_items   : num [1:5000] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ payment_method: chr [1:5000] "credit_card" "credit_card" "credit_card" "credit_card" ...
##  $ created_at    : POSIXct[1:5000], format: "2017-03-07 04:00:00" "2017-03-04 04:00:00" ...
data$order_id <- as.factor(data$order_id)
data$shop_id <- as.factor(data$shop_id)
data$user_id <- as.factor(data$user_id)
data$payment_method <- as.factor(data$payment_method)
str(data)
## tibble [5,000 x 7] (S3: tbl_df/tbl/data.frame)
##  $ order_id      : Factor w/ 5000 levels "1","2","3","4",..: 16 61 521 1105 1363 1437 1563 1603 2154 2298 ...
##  $ shop_id       : Factor w/ 100 levels "1","2","3","4",..: 42 42 42 42 42 42 42 42 42 42 ...
##  $ user_id       : Factor w/ 301 levels "607","700","701",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ order_amount  : num [1:5000] 704000 704000 704000 704000 704000 704000 704000 704000 704000 704000 ...
##  $ total_items   : num [1:5000] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ payment_method: Factor w/ 3 levels "cash","credit_card",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ created_at    : POSIXct[1:5000], format: "2017-03-07 04:00:00" "2017-03-04 04:00:00" ...

Lets see what is the AOV , we already know as given in question but lets still confirm it

total_revenue <- sum(data$order_amount)

we know the total number of order is 5000 , Hence the AOV will be

AOV <- total_revenue / 5000
AOV
## [1] 3145.128

Question No 1- Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

Lets wrangle data to know why the AOV is too much high

trial <- data %>% group_by(total_items) %>% summarise(n=n())
trial
## # A tibble: 8 x 2
##   total_items     n
##         <dbl> <int>
## 1           1  1830
## 2           2  1832
## 3           3   941
## 4           4   293
## 5           5    77
## 6           6     9
## 7           8     1
## 8        2000    17
trial$total_items <- as.factor(trial$total_items)


ggplot(data = trial,
       mapping = 
         aes(x =  total_items,
             y = n , fill = total_items
                 )) + 
  geom_bar(stat = "identity")

As we can see from above that maximum number of total items are 2000 , which is a very big jump from the second largest being 8. Hence AOV is getting shifted to such a great value .

A better way to evaulate AOV could be using buckets sellers based on sale volume and then calculating the AOV

Question 2 - What metric would you report for this dataset?

The metric which I would use to report here would be median , as median is the parameter which is least affected by the extreme range of values

Question 3-What is its value?

medi <- median(data$order_amount)
medi
## [1] 284

The Value is $284


Answer for SQL Questions

Question 1 -How many orders were shipped by Speedy Express in total?

First Question Solution Screenshot

54 orders were shipped by Speedy Express in total

Question 2 -What is the last name of the employee with the most orders?

Second Question Solution Screenshot

Peacock is the last name of the employee with the most orders

Question 3 -c. What product was ordered the most by customers in Germany?

Third Question Solution Screenshot

Boston Crab Meat is the product ordered the most by customers in Germany

The End