Loading all the libraries and the dataset at the beginning of the script Making all the messages and warning silent so that the documentation is more succint
library(readr)
library("dplyr")
library("lubridate")
library("knitr")
orders_table <- read_csv("data_science_screening_exercise_orders.csv")
The attached CSV file lists the customer, date, and dollar value of orders placed at a store in 2017. The gender of each customer is also provided.
Complete each of the following activities in a jupyter notebook or similar. Include your name and email at the top. Send back your notebook file.
Plot the count of orders per week.
Compute the mean order value for gender 0 and for gender 1. Do you think the difference is significant?
My solution
Let us view the dataset and get idea about the columns and the nature of values it has
kable(head(orders_table),caption = "Orders_Table")
customer_id | gender | date | value |
---|---|---|---|
1000 | 0 | 2017-01-01 00:11:31 | 198.5 |
1001 | 0 | 2017-01-01 00:29:56 | 338.0 |
1002 | 1 | 2017-01-01 01:30:31 | 733.0 |
1003 | 1 | 2017-01-01 01:34:22 | 772.0 |
1004 | 0 | 2017-01-01 03:11:54 | 508.0 |
1005 | 1 | 2017-01-01 10:08:05 | 338.0 |
names(orders_table)
[1] “customer_id” “gender” “date” “value”
We created a helping dataset which we will use later. This dataset shows the number of orders placed by a customer
str(orders_table)
cust_order_count<-orders_table %>% group_by(customer_id,gender) %>% summarise(order_count=n())
To find the most recent order we can use arrange the table in descending order and then take the maximum value for each individual customer. Date format is in a standard posix so there won’t be any difficulty in playing around with the dates
cust_date<-orders_table %>% group_by(customer_id) %>% mutate(order_count=n()) %>% slice(which.max(date)) %>% select(customer_id,gender,most_recent_order_date=date,order_count)
## Warning: package 'bindrcpp' was built under R version 3.5.1
kable(head(cust_date),caption = "Customer_Orders_Table")
customer_id | gender | most_recent_order_date | order_count |
---|---|---|---|
1000 | 0 | 2017-01-01 00:11:31 | 1 |
1001 | 0 | 2017-01-01 00:29:56 | 1 |
1002 | 1 | 2017-02-19 21:35:31 | 3 |
1003 | 1 | 2017-04-26 02:37:20 | 4 |
1004 | 0 | 2017-01-01 03:11:54 | 1 |
1005 | 1 | 2017-12-16 01:39:27 | 2 |
Running the above code would solve Part A)
Assumption: Assuming weeks by their natural order i.e year starts from January 1 and ends at Dec 31
Not taking fiscal year, quarters or some other business logic For calculation of part B lubridate would be quite useful
what_week_of_year shows in what week of the year the order was made. The count of orders would simply be a frequency analysis of what_week_of_year
orders_week<-orders_table %>% mutate(what_week_of_year=week(date))
kable(head(orders_week),caption = "Orders_Week")
customer_id | gender | date | value | what_week_of_year |
---|---|---|---|---|
1000 | 0 | 2017-01-01 00:11:31 | 198.5 | 1 |
1001 | 0 | 2017-01-01 00:29:56 | 338.0 | 1 |
1002 | 1 | 2017-01-01 01:30:31 | 733.0 | 1 |
1003 | 1 | 2017-01-01 01:34:22 | 772.0 | 1 |
1004 | 0 | 2017-01-01 03:11:54 | 508.0 | 1 |
1005 | 1 | 2017-01-01 10:08:05 | 338.0 | 1 |
num_of_orders_weekwise<-orders_week %>% group_by(what_week_of_year) %>% summarise(number_of_orders=n())
kable(head(num_of_orders_weekwise),caption = "Count of order per week")
what_week_of_year | number_of_orders |
---|---|
1 | 175 |
2 | 259 |
3 | 356 |
4 | 287 |
5 | 209 |
6 | 198 |
The above table gives us a tidier version and is more natural and intuitive to see and infer. We can use plot function because there are simply two variables and nothing fancy is to be shown.
plot(orders_week %>% group_by(what_week_of_year) %>% summarise(number_of_orders=n()),type="b",pch=19,cex=1)
kable(num_of_orders_weekwise[which.max(num_of_orders_weekwise$number_of_orders),],caption = "Max order in week")
what_week_of_year | number_of_orders |
---|---|
20 | 554 |
Although not desired but the maximum orders were placed in the week is 20th week starting from Jan 1 as stated in the assumption above
mean_order<-orders_table %>% group_by(gender) %>% summarise(mean_order_value=mean(value))
mean_order
## # A tibble: 2 x 2
## gender mean_order_value
## <int> <dbl>
## 1 0 364.
## 2 1 351.
To test for significant difference we need to perform a t-test. Performing a two sided t-test
gender_0<-orders_table %>% filter(gender==0)
gender_1<-orders_table %>% filter(gender==1)
t.test(x = gender_0$value,y=gender_1$value,alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: gender_0$value and gender_1$value
## t = 1.9761, df = 13445, p-value = 0.04816
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1065113 26.2567783
## sample estimates:
## mean of x mean of y
## 363.8900 350.7084
Although the p-valye is less than 5% threshold and the result may be considered significant, but the confidence interval is too broad. Hence, it is not significantly different
Thank you for the data challenge
Please note that answers have been striclty restricted to questions asked