Introducton

The purpose of this study is to analyze customer cart and predict their buying trend, as well as which products will be in customer cart during customer’s next order.

The dataset provides information on 3 million customer record on their cart items. This is the combined raw customer data.

################################ Loading the required packages ################################
library(DT)         #To display scrollable tables in r markdown
library(data.table) #For import using fread
library(dplyr)      #For data manipulation
library(tidyr)      #For getting the data in tidy format
library(lubridate)  #For extracting and working with dates
library(stringr)    #For working of strings
library(ggplot2) ## visualize data
library(janitor) # fF
library(cowplot)# Combines easily multiple plots
library(knitr) ## present table in webpage
library(shiny) ## develop shiny app

# Load Data 
path <- "C:/Users/Sanjiv/Desktop/instaCart/dataset"

aisles <- fread(file.path(path, "aisles.csv"))
departments <- fread(file.path(path, "departments.csv"))
or_pr_prior <- fread(file.path(path, "order_products__prior.csv"))
or_pr_train <- fread(file.path(path, "order_products__train.csv"))
orders <- fread(file.path(path, "orders.csv"))
products <- fread(file.path(path, "products.csv"))

Reshape data

To build the xgb algorithm and create new ones, you need to prepare sets. We take the following steps. Some columns in different data frames have character types, and this is a problem in xgboost, which controls only numeric vectors. Therefore, we turn symbolic columns into factors. Create a new “products” table via internal_join () We delete tables “aisles” and “departments” because we do not need them. We create the “user_id” column in the ordert table after matching “order_id” of this table with “order_id” in the “order” table. Create a new “orders_products” table, which contains tables “orders” and “orderp”.

aisles$aisle <- as.factor(aisles$aisle)
departments$department <- as.factor(departments$department)
orders$eval_set <- as.factor(orders$eval_set)
products$product_name <- as.factor(products$product_name)

or_pr_train$user_id <- orders$user_id[match(or_pr_train$order_id, orders$order_id)]

# Products -- Create reorder metrics for the PRODUCTS which are part of prior orders
total_prior_orders <- nrow(orders[orders$eval_set=="prior",])

or_pr_prior <- or_pr_prior %>% 
  inner_join(products, by= "product_id" ) 

or_pr_prior$product_name <- NULL

#Metrics on products
product_metrics<- or_pr_prior %>%
  group_by(product_id) %>%
  summarise(pr_total_orders = n(),
            pr_total_orders_ratio = n()/total_prior_orders,
            pr_mean_add_to_cart = mean(add_to_cart_order),
            pr_reordered_times = sum(reordered) 
  )

product_metrics$pr_reordered_ratio = product_metrics$pr_reordered_times / 
  product_metrics$pr_total_orders

Daily and weekly shopping habits

Hour of day:

  orders %>% 
  ggplot(aes(x=order_hour_of_day)) + 
  geom_histogram(stat="count",fill="skyblue")

# Day of Week:

    orders %>% 
    ggplot(aes(x=order_dow)) + 
    geom_histogram(stat="count",fill="skyblue")

Although from the given dataset the weekdays are not clear, we can clearly see that there is a lot of shopping occured during regular work hours.

When do they purchase again?

    orders %>% 
      ggplot(aes(x=days_since_prior_order)) + 
      geom_histogram(stat="count",fill="skyblue")

How many items do people buy?

This seems to follow a Poisson Distribution.

    or_pr_prior %>% 
      group_by(order_id) %>% 
      summarize(n_items = last(add_to_cart_order)) %>%
      ggplot(aes(x=n_items))+
      geom_histogram(stat="count",fill="skyblue") + 
      geom_rug()+
      coord_cartesian(xlim=c(0,80))

Top reorder proportion

Although there is some overlap (notably bananas and milk), a lot of the high frequency re-orders are more niche products like special kinds of milk and bread.

Top 10

    tmp <-or_pr_prior %>% 
      group_by(product_id) %>% 
      summarize(proportion_reordered = mean(reordered), n=n()) %>% 
      filter(n>40) %>% 
      top_n(10,wt=proportion_reordered) %>% 
      arrange(desc(proportion_reordered)) %>% 
      left_join(products,by="product_id")
    kable(tmp)

product_id	proportion_reordered	n	product_name	aisle_id	department_id
6433	0.941176	68	Raw Veggie Wrappers	13	20
2075	0.931034	87	Serenity Ultimate Extrema Overnight Pads	126	11
27740	0.920792	101	Chocolate Love Bar	45	19
13875	0.911111	45	Simply Sleep Nighttime Sleep Aid	6	2
31418	0.900000	60	Sparking Water	115	7
35604	0.900000	100	Maca Buttercups	45	19
36543	0.895522	67	Bars Peanut Butter	88	13
26093	0.893939	66	Soy Crisps Lightly Salted	107	19
38251	0.891892	111	Benchbreak Chardonnay	62	5
36801	0.885417	96	Organic Blueberry B Mega	31	7
# Top 20

    tmp <-or_pr_prior %>% 
      group_by(product_id) %>% 
      summarize(proportion_reordered = mean(reordered), n=n()) %>% 
      filter(n>40) %>% 
      top_n(10,wt=proportion_reordered) %>% 
      arrange(desc(proportion_reordered)) %>% 
      left_join(products,by="product_id")
    kable(tmp)

product_id	proportion_reordered	n	product_name	aisle_id	department_id
6433	0.941176	68	Raw Veggie Wrappers	13	20
2075	0.931034	87	Serenity Ultimate Extrema Overnight Pads	126	11
27740	0.920792	101	Chocolate Love Bar	45	19
13875	0.911111	45	Simply Sleep Nighttime Sleep Aid	6	2
31418	0.900000	60	Sparking Water	115	7
35604	0.900000	100	Maca Buttercups	45	19
36543	0.895522	67	Bars Peanut Butter	88	13
26093	0.893939	66	Soy Crisps Lightly Salted	107	19
38251	0.891892	111	Benchbreak Chardonnay	62	5
36801	0.885417	96	Organic Blueberry B Mega	31	7

kable(tmp)

product_id	proportion_reordered	n	product_name	aisle_id	department_id
6433	0.941176	68	Raw Veggie Wrappers	13	20
2075	0.931034	87	Serenity Ultimate Extrema Overnight Pads	126	11
27740	0.920792	101	Chocolate Love Bar	45	19
13875	0.911111	45	Simply Sleep Nighttime Sleep Aid	6	2
31418	0.900000	60	Sparking Water	115	7
35604	0.900000	100	Maca Buttercups	45	19
36543	0.895522	67	Bars Peanut Butter	88	13
26093	0.893939	66	Soy Crisps Lightly Salted	107	19
38251	0.891892	111	Benchbreak Chardonnay	62	5
36801	0.885417	96	Organic Blueberry B Mega	31	7
# User Metric	s – Create user metric	s base	d on orders and then on products contained	in those or	ders

orders$order_dow_hod <- orders$order_dow * 24 + orders$order_hour_of_day

user_metrics <- orders %>% 
  filter(eval_set == "prior") %>%
  group_by(user_id) %>%
  summarise(
    user_total_orders = max(order_number),
    user_mean_dow = mean(order_dow),
    user_mean_hod = mean(order_hour_of_day),
    user_mean_dow_hod = mean(order_dow_hod),
    user_order_frequency = mean(days_since_prior_order, na.rm=T)
  )

test_train_orders <-  orders %>% 
  filter(eval_set != "prior") %>%
  select(user_id, order_id, eval_set, days_since_prior_order)                                

user_metrics <- user_metrics %>%
  inner_join(test_train_orders)

User Metrics - Contd…

or_pr_prior <- or_pr_prior %>%
  inner_join(orders, by = "order_id")

user_metrics2 <- or_pr_prior %>%
  group_by(user_id) %>%
  summarise(
    user_total_products =n(),
    user_distinct_products = n_distinct(product_id),
    user_total_pr_reorders = sum(reordered)
  )

user_metrics2$user_pr_reorder_ratio = user_metrics2$user_total_pr_reorders/ 
  user_metrics2$user_total_products

Create user and product combination metrics

user_product_metrics <- or_pr_prior %>%
  group_by(user_id, product_id) %>%
  summarise(up_total_orders = n(),
            up_mean_add_to_cart= mean(add_to_cart_order),        
            up_total_reorders = sum(reordered)
  ) 

rm(or_pr_prior)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   976592  52.2   18207705  972.4  22759632 1215.5
## Vcells 78746126 600.8  438413500 3344.9 451345029 3443.5

user_product_metrics <- user_product_metrics %>% 
  inner_join(product_metrics, by= "product_id") %>%
  inner_join(user_metrics, by= "user_id") %>%
  inner_join(user_metrics2, by= "user_id")

rm(products, aisles, departments,product_metrics, user_metrics, user_metrics2)
gc()

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells    976639   52.2   11652931  622.4  22759632 1215.5
## Vcells 255915210 1952.5  543736772 4148.4 484661114 3697.7

Few more features…

user_product_metrics$up_ttlOrd_ttlusrOrd_ratio = user_product_metrics$up_total_orders / user_product_metrics$user_total_orders
user_product_metrics$up_ttlOrd_ttlprOrd_ratio = user_product_metrics$up_total_orders / user_product_metrics$pr_total_orders
user_product_metrics$up_ATC_pr_ATC_ratio = user_product_metrics$up_mean_add_to_cart / user_product_metrics$pr_mean_add_to_cart
user_product_metrics$up_reorder_ratio = user_product_metrics$up_total_reorders / user_product_metrics$user_total_orders
user_product_metrics$user_total_pr_reorder_ratio <- user_product_metrics$user_total_pr_reorders/
  user_product_metrics$user_total_products

Seperate data into training and test sets

train <- user_product_metrics[user_product_metrics$eval_set == "train",]
test <- user_product_metrics[user_product_metrics$eval_set == "test",]

train <- train %>% 
  left_join(or_pr_train %>% select(user_id, product_id, reordered), 
            by = c("user_id", "product_id"))
train$reordered[is.na(train$reordered)] <- 0

train$eval_set <- NULL
train$user_id <- NULL
train$product_id <- NULL
train$order_id <- NULL

test$eval_set <- NULL
test$user_id <- NULL
test$reordered <- NULL

rm(or_pr_train, orders, test_train_orders, user_product_metrics)
gc()

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   1052798   56.3    4773040  255.0  22759632 1215.5
## Vcells 290710681 2218.0  939868341 7170.7 827422876 6312.8

Build Model

This step is the most important part of the process for model quality.

We use train data. As explained above, the data and label are stored in the list. Each variable is a list of two things, a label, and data. A label is the result of our data set, which means that it is a binary classification that we will try to predict. In our example, the label we want to predict is the “reordered” column. For xgboost, we convert categorical data into fictitious variables when constructing the model that we need.

we use this feature:

xgb.DMatrix [dtrain <- xgb.DMatrix (data = train data, label = traindata, label = train label) [bstDMatrix <- xgboost (data = dtrain) Details here: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html The advantage of using decision tree methods ensembles, such as gradient acceleration, is that they can automatically evaluate the importance of an object from a trained predictive model. Important in Xgboost can be done using xgb.importance (). The result of this function is a table with all the variables, as well as their gain, cover and frequency. The increase is an improvement in accuracy that brings the function to the affiliates to which it is located. The cover measures the relative number of observations on a specific basis. Frequency is an easy way to measure the gain. It simply counts the number of times a function is used in all formed trees. We do not have to use it (if you do not know why you want to use it). The Gain column will provide the information we are looking for Finally, we can construct the results (trees) from our model using xgb.plot.importance ()

There is a list of parameters which are necessary to train our decision tree model These are the following:

objective:

We need to specify the type of learner we want which includes linear regression, logistic regression, poisson regression etc eval_metric: We need to specify the evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and error for classification, mean average precision for ranking) eta: The default value is set to 0.3. We need to specify step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means model is more robust to overfitting max_depth: The default value is set to 6. Is need to specify the maximum depth or splits of a tree. The range is 1 to ??? min_child_weight: The default value is set to 1. Is need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to ??? gamma: The default value is set to 0. Is need to specify minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to ??? subsample: The default value is set to 1. Is need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. The range is 0 to 1 colsample_bytree: The default value is set to 1. Is need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1 alpha: These are regularization term on weights. ??lpha default value assumed is 0 lambda: These are regularization term on weights. Lambda default value assumed is 1 nround: The number of trees to the model ```

library(xgboost)
params <- list(
  "objective"           = "reg:logistic",
  "eval_metric"         = "logloss",
  "eta"                 = 0.1,
  "max_depth"           = 6,
  "min_child_weight"    = 10,
  "gamma"               = 0.70,
  "subsample"           = 0.77,
  "colsample_bytree"    = 0.95,
  "alpha"               = 2e-05,
  "lambda"              = 10
)

train <- as.data.frame(train)
test <- as.data.frame(test)

subtrain <- train # %>% sample_frac(0.1)
X <- xgb.DMatrix(as.matrix(subtrain %>% select(-reordered)), label = subtrain$reordered)
model <- xgboost(data = X, params = params, nrounds = 30)

## [12:18:27] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
## [1]  train-logloss:0.626671 
## [2]  train-logloss:0.572368 
## [3]  train-logloss:0.527316 
## [4]  train-logloss:0.489568 
## [5]  train-logloss:0.457691 
## [6]  train-logloss:0.430599 
## [7]  train-logloss:0.407456 
## [8]  train-logloss:0.387610 
## [9]  train-logloss:0.370535 
## [10] train-logloss:0.355820 
## [11] train-logloss:0.343101 
## [12] train-logloss:0.332121 
## [13] train-logloss:0.322606 
## [14] train-logloss:0.314357 
## [15] train-logloss:0.307211 
## [16] train-logloss:0.301041 
## [17] train-logloss:0.295682 
## [18] train-logloss:0.291027 
## [19] train-logloss:0.286999 
## [20] train-logloss:0.283512 
## [21] train-logloss:0.280478 
## [22] train-logloss:0.277865 
## [23] train-logloss:0.275595 
## [24] train-logloss:0.273631 
## [25] train-logloss:0.271942 
## [26] train-logloss:0.270475 
## [27] train-logloss:0.269208 
## [28] train-logloss:0.268118 
## [29] train-logloss:0.267173 
## [30] train-logloss:0.266367

importance <- xgb.importance(colnames(X), model = model)
xgb.ggplot.importance(importance)

rm(X, importance, subtrain)
gc()

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   2074330  110.8    4773040  255.0  22759632 1215.5
## Vcells 286828963 2188.4  939868341 7170.7 827422876 6312.8

# Apply model -- Make predictions
# Now we  perform the prediction with the model we built 
# for classification of new data.

# We use the xgb.DMatrix to group our test data into a matrix
X <- xgb.DMatrix(as.matrix(test %>% select(-order_id, -product_id)))

# We perform the prediction using the predict ()
test$reordered <- predict(model, X)

# We apply a threshold so every prediction above 0.21 consider as 1
test$reordered <- (test$reordered > 0.21) * 1

# We multiply with 1 to convert the column "reordered" to numeric type
# We summarize () the table "products" where we have every order and 
# its products that were predicted as reordered
submission <- test %>%
  filter(reordered == 1) %>%
  group_by(order_id) %>%
  summarise(
    products = paste(product_id, collapse = " ")
  )
# We create the table "missing" where we have the orders 
# in which no product will be ordered according to our prediction 
# (the% in% operator is used when you want a logical value 
# for more specific elements, whether they are in longer vectors)
missing <- data.frame(
  order_id = unique(test$order_id[!test$order_id %in% submission$order_id]),
  products = "None"
)
# We record the result of the prediction into a file submit.csv

submission <- submission %>% bind_rows(missing) %>% arrange(order_id)
write.csv(submission, file = "submit.csv", row.names = F)

CUSTOMER CART ANALYSIS
MARKET BASED PREDICTION

Sanjiv Kukadia

`07/16/2018`

Introducton

Reshape data

Daily and weekly shopping habits

Hour of day:

Although from the given dataset the weekdays are not clear, we can clearly see that there is a lot of shopping occured during regular work hours.

When do they purchase again?

How many items do people buy?

Top reorder proportion

Top 10

User Metrics - Contd…

Create user and product combination metrics

Few more features…

Seperate data into training and test sets

Build Model

This step is the most important part of the process for model quality.

we use this feature:

There is a list of parameters which are necessary to train our decision tree model These are the following:

objective:

CUSTOMER CART ANALYSIS MARKET BASED PREDICTION

Sanjiv Kukadia

07/16/2018

Introducton

Reshape data

Daily and weekly shopping habits

Hour of day:

Although from the given dataset the weekdays are not clear, we can clearly see that there is a lot of shopping occured during regular work hours.

When do they purchase again?

How many items do people buy?

Top reorder proportion

Top 10

User Metrics - Contd…

Create user and product combination metrics

Few more features…

Seperate data into training and test sets

Build Model

This step is the most important part of the process for model quality.

we use this feature:

There is a list of parameters which are necessary to train our decision tree model These are the following:

objective:

CUSTOMER CART ANALYSIS
MARKET BASED PREDICTION

`07/16/2018`