Klaviyo Data Science Assessment

Libraries used

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

## Warning: package 'knitr' was built under R version 3.5.3

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.5.3

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

Data is read into for processing

data <- read.csv("C:/Users/Sharathchandra/Desktop/Skill Assessment/Klaviyo/screening_exercise_orders_v201810.csv", header = T, stringsAsFactors = F)

Questions asked and answered below with code chunks.

Assemble a dataframe with one row per customer and the following columns:
- customer_id
- gender
- most_recent_order_date
- order_count (number of orders placed by this customer) Sort the dataframe by customer_id ascending and display the first 10 rows.

Answer: There were rows which were multiple transactions (orders) done by customer with different time stamps. They had to be grouped by customer_id, order_count columns contains number of orders done and most_recent_order_date i.e. recent transaction time stamp of order is taken. The dataframe is sorted in ascending order and only first 10 rows are displayed in this notebook.

customerOrderCount <- data %>% group_by(customer_id) %>% mutate(order_count=n()) %>% slice(which.max(as.POSIXct(date))) %>% select(customer_id,gender,most_recent_order_date=date,order_count)
top10customerOrderCount <- customerOrderCount[c(1:10),]
top10customerOrderCount

## # A tibble: 10 x 4
## # Groups:   customer_id [10]
##    customer_id gender most_recent_order_date order_count
##          <int>  <int> <chr>                        <int>
##  1        1000      0 2017-01-01 00:11:31              1
##  2        1001      0 2017-01-01 00:29:56              1
##  3        1002      1 2017-02-19 21:35:31              3
##  4        1003      1 2017-04-26 02:37:20              4
##  5        1004      0 2017-01-01 03:11:54              1
##  6        1005      1 2017-12-16 01:39:27              2
##  7        1006      1 2017-05-09 15:27:20              3
##  8        1007      0 2017-01-01 15:59:50              1
##  9        1008      0 2017-12-17 05:47:48              3
## 10        1009      1 2017-01-01 19:27:17              1

Plot the count of orders per week for the store.

Answer: There are 53 weeks in a year. Thus, data$date is mutated by week number. The orders are summarized count and a bar plot (visually appealing) is created.

ordersWeek <- data %>% mutate(week_of_the_year=week(date))
ordersPerWeek <- ordersWeek %>% group_by(week_of_the_year) %>% summarise(number_of_orders=n())
plot <- ggplot(data = ordersPerWeek) + geom_bar(mapping = aes(x=week_of_the_year,y=number_of_orders, fill = number_of_orders),stat = "identity", position = "dodge") + labs(x="Weeks' Number", y="Number Of Orders Per Week", title = "Number Of Orders Per Week (2017)") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + geom_text(aes(x=week_of_the_year,y=number_of_orders, label = number_of_orders), hjust = -0.5, size = 2, inherit.aes = TRUE) + theme_bw()
plot

Compute the mean order value for gender 0 and for gender 1. Do you think the difference is significant?

Answer: The mean order values for gender=0 and gender=1 are 363.89 & 350.70 respectively. To see if the difference is significant, I have performed t-testing/null hypothesis to check for p-value with 95% confidence. Meaning alpha(threshold) = 0.05. We can see from the results that p-value = 0.04816 lesser than 0.05 i.e. threshold value. Therefore, with 95% confidence I can reject this hypothesis meaning, the difference is not significant.

meanOrderByGender <- data %>% group_by(gender) %>% summarise(mean_order=mean(value))
gender0 <- data %>% filter(gender==0)
gender1 <- data %>% filter(gender==1)
t.test(x = gender0$value,y=gender1$value,alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  gender0$value and gender1$value
## t = 1.9761, df = 13445, p-value = 0.04816
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.1065113 26.2567783
## sample estimates:
## mean of x mean of y 
##  363.8900  350.7084

Assuming a single gender prediction was made for each customer, generate a confusion matrix for predicted gender. What does the confusion matrix tell you about the quality of the predictions?

Answer: I wanted to use confusionmatrix() but because of higher version R, I couldn’t install “caret” library. But here, as we are assuming single gender prediction case is considered (I have considered gender prediction = 1 for all), FN and TN will be equal to zero. The positive count is TP = 6712 and rest of FP = 6759.

Thereforen the metrics wil be: Accuracy: TP+TN/(TP+FP+FN+TN) => 6712/13471 = 0.4982 (49.82% accurate model) Precision: TP/(TP+FP) = Accuracy = 0.4982 (49.82% precise model) Recall: TP/(TP+FN) = 6712/6712 = 1

The quality of prediction is 50% accurate due to gender column values are not left or right skewed. Meaning equal number of gender = 0 and gender = 1 values are present. We can also notice that recall = 1 meaning our precision of classifying genders is less and is biased to capturing only gender = 1 customers. We captured all gender = 1 customers but also missed out a lot on capturing gender = 0 customers.

data2 <- data %>% group_by(customer_id,gender) %>% select(customer_id,gender,predicted_gender)
data2$predicted_gender <- 1
data2$positiveCount <- ifelse(data2$gender == data2$predicted_gender, 1, 0)
sum(data2$positiveCount)

## [1] 6712

Describe one of your favorite tools or techniques and give a small example of how it’s helped you solve a problem. Limit your answer to one paragraph.

Answer: My favorite tool for Data Analysis and problem solving has always been R. Very easy to perform statistical analysis, modelling, cleaning and visualizing the data. I would like to give an example of how I used R - RStudio in Data Lake to give meaning insights to Analytics and Insights team regarding production sales’ and campaigning data (Data Sciene Intern experience at BI). Then create interactive visualization and deploy the application in R Shiny Server. cbind() worked well joining two dataframes according to ID value. Effective text mining with use of regular expressions. Then model creation along with H2O library gave a visualized display of models, ROC, Confusion Matrix etc. It was very easy to expalin the insights to executives and create interest to use this tool.

Please have a look at my GitHub: https://github.com/SharathchandraBangaloreMunibairegowda

Klaviyo Data Science Assessment

Sharathchandra BM Contact No: 315-278-4479