Libraries used
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
Data is read into for processing
data <- read.csv("C:/Users/Sharathchandra/Desktop/Skill Assessment/Klaviyo/screening_exercise_orders_v201810.csv", header = T, stringsAsFactors = F)
Questions asked and answered below with code chunks.
Answer: There were rows which were multiple transactions (orders) done by customer with different time stamps. They had to be grouped by customer_id, order_count columns contains number of orders done and most_recent_order_date i.e. recent transaction time stamp of order is taken. The dataframe is sorted in ascending order and only first 10 rows are displayed in this notebook.
customerOrderCount <- data %>% group_by(customer_id) %>% mutate(order_count=n()) %>% slice(which.max(as.POSIXct(date))) %>% select(customer_id,gender,most_recent_order_date=date,order_count)
top10customerOrderCount <- customerOrderCount[c(1:10),]
top10customerOrderCount
## # A tibble: 10 x 4
## # Groups: customer_id [10]
## customer_id gender most_recent_order_date order_count
## <int> <int> <chr> <int>
## 1 1000 0 2017-01-01 00:11:31 1
## 2 1001 0 2017-01-01 00:29:56 1
## 3 1002 1 2017-02-19 21:35:31 3
## 4 1003 1 2017-04-26 02:37:20 4
## 5 1004 0 2017-01-01 03:11:54 1
## 6 1005 1 2017-12-16 01:39:27 2
## 7 1006 1 2017-05-09 15:27:20 3
## 8 1007 0 2017-01-01 15:59:50 1
## 9 1008 0 2017-12-17 05:47:48 3
## 10 1009 1 2017-01-01 19:27:17 1
Answer: There are 53 weeks in a year. Thus, data$date is mutated by week number. The orders are summarized count and a bar plot (visually appealing) is created.
ordersWeek <- data %>% mutate(week_of_the_year=week(date))
ordersPerWeek <- ordersWeek %>% group_by(week_of_the_year) %>% summarise(number_of_orders=n())
plot <- ggplot(data = ordersPerWeek) + geom_bar(mapping = aes(x=week_of_the_year,y=number_of_orders, fill = number_of_orders),stat = "identity", position = "dodge") + labs(x="Weeks' Number", y="Number Of Orders Per Week", title = "Number Of Orders Per Week (2017)") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + geom_text(aes(x=week_of_the_year,y=number_of_orders, label = number_of_orders), hjust = -0.5, size = 2, inherit.aes = TRUE) + theme_bw()
plot
Answer: The mean order values for gender=0 and gender=1 are 363.89 & 350.70 respectively. To see if the difference is significant, I have performed t-testing/null hypothesis to check for p-value with 95% confidence. Meaning alpha(threshold) = 0.05. We can see from the results that p-value = 0.04816 lesser than 0.05 i.e. threshold value. Therefore, with 95% confidence I can reject this hypothesis meaning, the difference is not significant.
meanOrderByGender <- data %>% group_by(gender) %>% summarise(mean_order=mean(value))
gender0 <- data %>% filter(gender==0)
gender1 <- data %>% filter(gender==1)
t.test(x = gender0$value,y=gender1$value,alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: gender0$value and gender1$value
## t = 1.9761, df = 13445, p-value = 0.04816
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1065113 26.2567783
## sample estimates:
## mean of x mean of y
## 363.8900 350.7084
Answer: I wanted to use confusionmatrix() but because of higher version R, I couldn’t install “caret” library. But here, as we are assuming single gender prediction case is considered (I have considered gender prediction = 1 for all), FN and TN will be equal to zero. The positive count is TP = 6712 and rest of FP = 6759.
Thereforen the metrics wil be: Accuracy: TP+TN/(TP+FP+FN+TN) => 6712/13471 = 0.4982 (49.82% accurate model) Precision: TP/(TP+FP) = Accuracy = 0.4982 (49.82% precise model) Recall: TP/(TP+FN) = 6712/6712 = 1
The quality of prediction is 50% accurate due to gender column values are not left or right skewed. Meaning equal number of gender = 0 and gender = 1 values are present. We can also notice that recall = 1 meaning our precision of classifying genders is less and is biased to capturing only gender = 1 customers. We captured all gender = 1 customers but also missed out a lot on capturing gender = 0 customers.
data2 <- data %>% group_by(customer_id,gender) %>% select(customer_id,gender,predicted_gender)
data2$predicted_gender <- 1
data2$positiveCount <- ifelse(data2$gender == data2$predicted_gender, 1, 0)
sum(data2$positiveCount)
## [1] 6712