Submitted By: Priyanka Gagneja
Email: priyankaigit at gmail dot com

Exercise

The attached CSV file for this exercise lists the customer, date, and dollar value of orders placed at a store in 2017. The gender of each customer is also provided.

Answer the questions below and state any considerations or assumptions you made.

# File read
suppressMessages(suppressWarnings(library(data.table)))
orders<-read.table("data_science_screening_exercise_orders.csv", 
                   header=TRUE, sep=",", row.names=NULL)

# Checking input file info 
head(orders)
##   customer_id gender                date value
## 1        1000      0 2017-01-01 00:11:31 198.5
## 2        1001      0 2017-01-01 00:29:56 338.0
## 3        1002      1 2017-01-01 01:30:31 733.0
## 4        1003      1 2017-01-01 01:34:22 772.0
## 5        1004      0 2017-01-01 03:11:54 508.0
## 6        1005      1 2017-01-01 10:08:05 338.0
class(orders)
## [1] "data.frame"
dim(orders)
## [1] 13471     4
uniqueN(orders$customer_id)
## [1] 8814
suppressMessages(library(lubridate))
orders$date<-as.Date(substr(orders$date,1,10),format = "%Y-%m-%d")
Part (A)

Assemble a dataframe with one row per customer and the following columns:

  • customer_id
  • gender
  • most_recent_order_date
  • order_count (number of orders placed by this customer)
orders<-as.data.table(orders)

order_cnt <- orders[, .(most_recent_order_date = max(date)
                      , order_count = .N )
                                            , by = .(customer_id) ]

gender<-unique(orders[,.(customer_id,gender)])

customers<-merge(gender,order_cnt,by="customer_id")
head(customers)
##    customer_id gender most_recent_order_date order_count
## 1:        1000      0             2017-01-01           1
## 2:        1001      0             2017-01-01           1
## 3:        1002      1             2017-02-19           3
## 4:        1003      1             2017-04-26           4
## 5:        1004      0             2017-01-01           1
## 6:        1005      1             2017-12-16           2

The customers data frame contains the customer level information - customer_id, its gender, latest order date and frequency of orders made in this year.

Part (B)

Plot the count of orders per week

orders$week<-week(orders$date)

suppressMessages(suppressWarnings(library(dplyr)))
order_wk<- orders %>% 
           group_by(week) %>%
           summarise(order_count = n())

head(order_wk)
## # A tibble: 6 x 2
##    week order_count
##   <dbl>       <int>
## 1     1         175
## 2     2         259
## 3     3         356
## 4     4         287
## 5     5         209
## 6     6         198
suppressMessages(library(ggplot2))

plot(order_wk$week,order_wk$order_count, xlab="Week", ylab = "count of orders per week", main = "Distribution of orders by week", type='b')

This object seems to have a peaking sales in May end. There are also some alternating peaks in early part of the year (winter months). Relatively lower sales observed in summer weeks and slowly increasing again in Fall (October and Dec) and so on.
Prominent weekly peaks are:

  • leading effect of Memorial Day(week 20),
  • Columbus day October (week 41/42) ,
  • Thanks giving (Nov/Dec, Black Friday sale in week 47 & 48).
Part (C)

Compute the mean order value for gender 0 and for gender 1. Do you think the difference is significant?

order_g<- orders %>% 
          group_by(customer_id) %>%
          summarise(avg_ord = mean(value))

order_gen<-merge(order_g,gender,by="customer_id")

t.test(avg_ord ~ gender, order_gen,var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  avg_ord by gender
## t = 1.4016, df = 8812, p-value = 0.1611
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.208385 25.327483
## sample estimates:
## mean in group 0 mean in group 1 
##        348.3448        337.7852
# ggplot(order_gen,aes(fill=factor(gender))) + 
# geom_histogram(aes(x=avg_ord))

This is basically a independent 2-sample t-test scenario. An assumption I have taken here is that the variance of the two samples is same.

  • p-value here is high (>0.05) meaning that I fail to reject Null Hypothesis (no difference in mean of avg order values between the two genders).
  • In other words, there is high probability(p-value) that the differene in mean avg order value of the two has come up due to chance.
  • There is not sufficient statistical evidence to reject null hypothesis (Ho) to conclude that there is a difference in the buying behavior by gender (even though visually it does seem so).