Submitted By: Priyanka Gagneja
Email: priyankaigit at gmail dot com
The attached CSV file for this exercise lists the customer, date, and dollar value of orders placed at a store in 2017. The gender of each customer is also provided.
Answer the questions below and state any considerations or assumptions you made.
# File read
suppressMessages(suppressWarnings(library(data.table)))
orders<-read.table("data_science_screening_exercise_orders.csv",
header=TRUE, sep=",", row.names=NULL)
# Checking input file info
head(orders)
## customer_id gender date value
## 1 1000 0 2017-01-01 00:11:31 198.5
## 2 1001 0 2017-01-01 00:29:56 338.0
## 3 1002 1 2017-01-01 01:30:31 733.0
## 4 1003 1 2017-01-01 01:34:22 772.0
## 5 1004 0 2017-01-01 03:11:54 508.0
## 6 1005 1 2017-01-01 10:08:05 338.0
class(orders)
## [1] "data.frame"
dim(orders)
## [1] 13471 4
uniqueN(orders$customer_id)
## [1] 8814
suppressMessages(library(lubridate))
orders$date<-as.Date(substr(orders$date,1,10),format = "%Y-%m-%d")
Assemble a dataframe with one row per customer and the following columns:
orders<-as.data.table(orders)
order_cnt <- orders[, .(most_recent_order_date = max(date)
, order_count = .N )
, by = .(customer_id) ]
gender<-unique(orders[,.(customer_id,gender)])
customers<-merge(gender,order_cnt,by="customer_id")
head(customers)
## customer_id gender most_recent_order_date order_count
## 1: 1000 0 2017-01-01 1
## 2: 1001 0 2017-01-01 1
## 3: 1002 1 2017-02-19 3
## 4: 1003 1 2017-04-26 4
## 5: 1004 0 2017-01-01 1
## 6: 1005 1 2017-12-16 2
The customers data frame contains the customer level information - customer_id, its gender, latest order date and frequency of orders made in this year.
Plot the count of orders per week
orders$week<-week(orders$date)
suppressMessages(suppressWarnings(library(dplyr)))
order_wk<- orders %>%
group_by(week) %>%
summarise(order_count = n())
head(order_wk)
## # A tibble: 6 x 2
## week order_count
## <dbl> <int>
## 1 1 175
## 2 2 259
## 3 3 356
## 4 4 287
## 5 5 209
## 6 6 198
suppressMessages(library(ggplot2))
plot(order_wk$week,order_wk$order_count, xlab="Week", ylab = "count of orders per week", main = "Distribution of orders by week", type='b')
This object seems to have a peaking sales in May end. There are also some alternating peaks in early part of the year (winter months). Relatively lower sales observed in summer weeks and slowly increasing again in Fall (October and Dec) and so on.
Prominent weekly peaks are:
Compute the mean order value for gender 0 and for gender 1. Do you think the difference is significant?
order_g<- orders %>%
group_by(customer_id) %>%
summarise(avg_ord = mean(value))
order_gen<-merge(order_g,gender,by="customer_id")
t.test(avg_ord ~ gender, order_gen,var.equal = TRUE)
##
## Two Sample t-test
##
## data: avg_ord by gender
## t = 1.4016, df = 8812, p-value = 0.1611
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.208385 25.327483
## sample estimates:
## mean in group 0 mean in group 1
## 348.3448 337.7852
# ggplot(order_gen,aes(fill=factor(gender))) +
# geom_histogram(aes(x=avg_ord))
This is basically a independent 2-sample t-test scenario. An assumption I have taken here is that the variance of the two samples is same.