Name:Aadish Chopra Email:aadishchopra16@gmail.com

Loading all the libraries and the dataset at the beginning of the script Making all the messages and warning silent so that the documentation is more succint

library(readr)
library("dplyr")
library("lubridate")
library("knitr")
orders_table <- read_csv("data_science_screening_exercise_orders.csv")

Exercise Questions

The attached CSV file lists the customer, date, and dollar value of orders placed at a store in 2017. The gender of each customer is also provided.

Complete each of the following activities in a jupyter notebook or similar. Include your name and email at the top. Send back your notebook file.

  1. Assemble a dataframe with one row per customer and the following columns:
    • customer_id
    • gender
    • most_recent_order_date
    • order_count (number of orders placed by this customer)
  2. Plot the count of orders per week.

  3. Compute the mean order value for gender 0 and for gender 1. Do you think the difference is significant?


My solution

Let us view the dataset and get idea about the columns and the nature of values it has

kable(head(orders_table),caption = "Orders_Table")
Orders_Table
customer_id gender date value
1000 0 2017-01-01 00:11:31 198.5
1001 0 2017-01-01 00:29:56 338.0
1002 1 2017-01-01 01:30:31 733.0
1003 1 2017-01-01 01:34:22 772.0
1004 0 2017-01-01 03:11:54 508.0
1005 1 2017-01-01 10:08:05 338.0
names(orders_table)

[1] “customer_id” “gender” “date” “value”

We created a helping dataset which we will use later. This dataset shows the number of orders placed by a customer

str(orders_table)
cust_order_count<-orders_table %>% group_by(customer_id,gender) %>% summarise(order_count=n())

To find the most recent order we can use arrange the table in descending order and then take the maximum value for each individual customer. Date format is in a standard posix so there won’t be any difficulty in playing around with the dates

Solving Part A)

cust_date<-orders_table %>% group_by(customer_id) %>% mutate(order_count=n()) %>% slice(which.max(date)) %>% select(customer_id,gender,most_recent_order_date=date,order_count)
## Warning: package 'bindrcpp' was built under R version 3.5.1
kable(head(cust_date),caption = "Customer_Orders_Table")
Customer_Orders_Table
customer_id gender most_recent_order_date order_count
1000 0 2017-01-01 00:11:31 1
1001 0 2017-01-01 00:29:56 1
1002 1 2017-02-19 21:35:31 3
1003 1 2017-04-26 02:37:20 4
1004 0 2017-01-01 03:11:54 1
1005 1 2017-12-16 01:39:27 2

Running the above code would solve Part A)

Solving Part B)

Assumption: Assuming weeks by their natural order i.e year starts from January 1 and ends at Dec 31

Not taking fiscal year, quarters or some other business logic For calculation of part B lubridate would be quite useful

what_week_of_year shows in what week of the year the order was made. The count of orders would simply be a frequency analysis of what_week_of_year

orders_week<-orders_table %>% mutate(what_week_of_year=week(date))
kable(head(orders_week),caption = "Orders_Week")
Orders_Week
customer_id gender date value what_week_of_year
1000 0 2017-01-01 00:11:31 198.5 1
1001 0 2017-01-01 00:29:56 338.0 1
1002 1 2017-01-01 01:30:31 733.0 1
1003 1 2017-01-01 01:34:22 772.0 1
1004 0 2017-01-01 03:11:54 508.0 1
1005 1 2017-01-01 10:08:05 338.0 1
num_of_orders_weekwise<-orders_week %>% group_by(what_week_of_year) %>% summarise(number_of_orders=n())

kable(head(num_of_orders_weekwise),caption = "Count of order per week")
Count of order per week
what_week_of_year number_of_orders
1 175
2 259
3 356
4 287
5 209
6 198

The above table gives us a tidier version and is more natural and intuitive to see and infer. We can use plot function because there are simply two variables and nothing fancy is to be shown.

plot(orders_week %>% group_by(what_week_of_year) %>% summarise(number_of_orders=n()),type="b",pch=19,cex=1)

kable(num_of_orders_weekwise[which.max(num_of_orders_weekwise$number_of_orders),],caption = "Max order in week")
Max order in week
what_week_of_year number_of_orders
20 554

Although not desired but the maximum orders were placed in the week is 20th week starting from Jan 1 as stated in the assumption above

Answer to Part C)

mean_order<-orders_table %>% group_by(gender) %>% summarise(mean_order_value=mean(value))
mean_order
## # A tibble: 2 x 2
##   gender mean_order_value
##    <int>            <dbl>
## 1      0             364.
## 2      1             351.

To test for significant difference we need to perform a t-test. Performing a two sided t-test

gender_0<-orders_table %>% filter(gender==0)
gender_1<-orders_table %>% filter(gender==1)

t.test(x = gender_0$value,y=gender_1$value,alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  gender_0$value and gender_1$value
## t = 1.9761, df = 13445, p-value = 0.04816
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.1065113 26.2567783
## sample estimates:
## mean of x mean of y 
##  363.8900  350.7084

Although the p-valye is less than 5% threshold and the result may be considered significant, but the confidence interval is too broad. Hence, it is not significantly different

Thank you for the data challenge

Please note that answers have been striclty restricted to questions asked