Libraries used
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
Data is read into for processing
data <- read.csv("C:/Users/Sharathchandra/Desktop/Skill Assessment/Klaviyo/screening_exercise_orders_v201810.csv", header = T, stringsAsFactors = F)
Questions asked and answered below with code chunks.
Answer: There were rows which were multiple transactions (orders) done by customer with different time stamps. They had to be grouped by customer_id, order_count columns contains number of orders done and most_recent_order_date i.e. recent transaction time stamp of order is taken. The dataframe is sorted in ascending order and only first 10 rows are displayed in this notebook.
customerOrderCount <- data %>% group_by(customer_id) %>% mutate(order_count=n()) %>% slice(which.max(as.POSIXct(date))) %>% select(customer_id,gender,most_recent_order_date=date,order_count)
top10customerOrderCount <- customerOrderCount[c(1:10),]
top10customerOrderCount
## # A tibble: 10 x 4
## # Groups: customer_id [10]
## customer_id gender most_recent_order_date order_count
## <int> <int> <chr> <int>
## 1 1000 0 2017-01-01 00:11:31 1
## 2 1001 0 2017-01-01 00:29:56 1
## 3 1002 1 2017-02-19 21:35:31 3
## 4 1003 1 2017-04-26 02:37:20 4
## 5 1004 0 2017-01-01 03:11:54 1
## 6 1005 1 2017-12-16 01:39:27 2
## 7 1006 1 2017-05-09 15:27:20 3
## 8 1007 0 2017-01-01 15:59:50 1
## 9 1008 0 2017-12-17 05:47:48 3
## 10 1009 1 2017-01-01 19:27:17 1
Answer: There are 53 weeks in a year. Thus, data$date is mutated by week number. The orders are summarized count and a bar plot (visually appealing) is created.
ordersWeek <- data %>% mutate(week_of_the_year=week(date))
ordersPerWeek <- ordersWeek %>% group_by(week_of_the_year) %>% summarise(number_of_orders=n())
plot <- ggplot(data = ordersPerWeek) + geom_bar(mapping = aes(x=week_of_the_year,y=number_of_orders, fill = number_of_orders),stat = "identity", position = "dodge") + labs(x="Weeks' Number", y="Number Of Orders Per Week", title = "Number Of Orders Per Week (2017)") + theme(panel.background = element_blank()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip() + geom_text(aes(x=week_of_the_year,y=number_of_orders, label = number_of_orders), hjust = -0.5, size = 2, inherit.aes = TRUE) + theme_bw()
plot
Answer: The mean order values for gender=0 and gender=1 are 363.89 & 350.70 respectively. To see if the difference is significant, I have performed t-testing/null hypothesis to check for p-value with 95% confidence. Meaning alpha(threshold) = 0.05. We can see from the results that p-value = 0.04816 lesser than 0.05 i.e. threshold value. Therefore, with 95% confidence I can reject this hypothesis meaning, the difference is not significant.
meanOrderByGender <- data %>% group_by(gender) %>% summarise(mean_order=mean(value))
gender0 <- data %>% filter(gender==0)
gender1 <- data %>% filter(gender==1)
t.test(x = gender0$value,y=gender1$value,alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: gender0$value and gender1$value
## t = 1.9761, df = 13445, p-value = 0.04816
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1065113 26.2567783
## sample estimates:
## mean of x mean of y
## 363.8900 350.7084
Answer: I wanted to use confusionmatrix() but because of higher version R, I couldn’t install “caret” library. But here, as we are assuming single gender prediction case is considered (I have considered gender prediction = 1 for all), FN and TN will be equal to zero. The positive count is TP = 6712 and rest of FP = 6759.
Thereforen the metrics wil be: Accuracy: TP+TN/(TP+FP+FN+TN) => 6712/13471 = 0.4982 (49.82% accurate model) Precision: TP/(TP+FP) = Accuracy = 0.4982 (49.82% precise model) Recall: TP/(TP+FN) = 6712/6712 = 1
The quality of prediction is 50% accurate due to gender column values are not left or right skewed. Meaning equal number of gender = 0 and gender = 1 values are present. We can also notice that recall = 1 meaning our precision of classifying genders is less and is biased to capturing only gender = 1 customers. We captured all gender = 1 customers but also missed out a lot on capturing gender = 0 customers.
data2 <- data %>% group_by(customer_id,gender) %>% select(customer_id,gender,predicted_gender)
data2$predicted_gender <- 1
data2$positiveCount <- ifelse(data2$gender == data2$predicted_gender, 1, 0)
sum(data2$positiveCount)
## [1] 6712
Answer: My favorite tool for Data Analysis and problem solving has always been R. Very easy to perform statistical analysis, modelling, cleaning and visualizing the data. I would like to give an example of how I used R - RStudio in Data Lake to give meaning insights to Analytics and Insights team regarding production sales’ and campaigning data (Data Sciene Intern experience at BI). Then create interactive visualization and deploy the application in R Shiny Server. cbind() worked well joining two dataframes according to ID value. Effective text mining with use of regular expressions. Then model creation along with H2O library gave a visualized display of models, ROC, Confusion Matrix etc. It was very easy to expalin the insights to executives and create interest to use this tool.
Please have a look at my GitHub: https://github.com/SharathchandraBangaloreMunibairegowda