Companies’ need for data processing has generated a new science, methodology, and tools to analyze large masses of data such as Big Data or data mining. This science has developed methods and predictive algorithms that allow us to know hidden behavior patterns of the individual that, with other techniques, it is impossible to know. This helps companies understand how their customers behave when making a purchase. Once their behavior pattern is known, companies better adapt to it to achieve maximum customer satisfaction, trust, and commitment.
There are currently many tools to carry out this type of analysis. In this work, we will focus on one of them, the association rules, a methodology that we consider to have great potential. We will use the apriori algorithm.
Our goal is to earn customer loyalty and increase revenue through machine-readable marketing data, which allows us to create a user behavior profile and more accurately predict revenue.
The data analyzed came for kaggle, https://www.kaggle.com/c/instacart-market-basket-analysis/data. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users
Once the dataset is downloaded, it weighs 200.7 MB in zip format. Once unzipped, six files appear:
departments.csv: Collect name and id for different departments.
aisle.csv: That collects the corridors and the different sections.
products.csv: Collect all the products that are for sale in the store.
orders.csv: Collect all the orders made by customers, or which is the same, all the shopping carts (Market_Basket).
order_products_prior.csv: Collect the products that were purchased in an order and all the customer’s purchase history throughout that year.
order_products_train.csv: Collect the products that were purchased in an order reflecting the last basket purchased by each customer.
For this analysis, we will use the order_products_train.csv, which offers a data sample of the order_products_prior.csv
## library(arules) ## Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
## library(arulesViz) ## Extends package 'arules' with various visualization techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
## library(tidyverse)
## library(ggplot2)
## library(plyr) ## Tools for Splitting, Applying and Combining Data
## library(dplyr)
## library(readxl)
## library(treemap)
## library(d3treeR)
## library(colorspace)
## library(plotly)
orders <- read_csv("C:/Users/karolo/Desktop/Instacart/orders.csv")
order_products <- read_csv("C:/Users/karolo/Desktop/Instacart/order_products__train.csv")
products <- read_csv("C:/Users/karolo/Desktop/Instacart/products.csv")
aisles <- read_csv("C:/Users/karolo/Desktop/Instacart/aisles.csv")
departments <- read_csv("C:/Users/karolo/Desktop/Instacart/departments.csv")
str(products) ## Linked to departments and aisles by the id
## spec_tbl_df [49,688 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ product_id : num [1:49688] 1 2 3 4 5 6 7 8 9 10 ...
## $ product_name : chr [1:49688] "Chocolate Sandwich Cookies" "All-Seasons Salt" "Robust Golden Unsweetened Oolong Tea" "Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce" ...
## $ aisle_id : num [1:49688] 61 104 94 38 5 11 98 116 120 115 ...
## $ department_id: num [1:49688] 19 13 7 1 13 11 7 1 16 7 ...
## - attr(*, "spec")=
## .. cols(
## .. product_id = col_double(),
## .. product_name = col_character(),
## .. aisle_id = col_double(),
## .. department_id = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(departments)
## spec_tbl_df [21 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ department_id: num [1:21] 1 2 3 4 5 6 7 8 9 10 ...
## $ department : chr [1:21] "Frozen" "Other" "Bakery" "Produce" ...
## - attr(*, "spec")=
## .. cols(
## .. department_id = col_double(),
## .. department = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(aisles)
## spec_tbl_df [134 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ aisle_id: num [1:134] 1 2 3 4 5 6 7 8 9 10 ...
## $ aisle : chr [1:134] "prepared soups salads" "specialty cheeses" "energy granola bars" "instant foods" ...
## - attr(*, "spec")=
## .. cols(
## .. aisle_id = col_double(),
## .. aisle = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(order_products) # Connnected to Products and Order by order_id
## spec_tbl_df [1,384,617 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ order_id : num [1:1384617] 1 1 1 1 1 1 1 1 36 36 ...
## $ product_id : num [1:1384617] 49302 11109 10246 49683 43633 ...
## $ add_to_cart_order: num [1:1384617] 1 2 3 4 5 6 7 8 1 2 ...
## $ reordered : num [1:1384617] 1 1 0 0 1 0 0 1 0 1 ...
## - attr(*, "spec")=
## .. cols(
## .. order_id = col_double(),
## .. product_id = col_double(),
## .. add_to_cart_order = col_double(),
## .. reordered = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(orders)
## spec_tbl_df [3,421,083 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ order_id : num [1:3421083] 2539329 2398795 473747 2254736 431534 ...
## $ user_id : num [1:3421083] 1 1 1 1 1 1 1 1 1 1 ...
## $ eval_set : chr [1:3421083] "prior" "prior" "prior" "prior" ...
## $ order_number : num [1:3421083] 1 2 3 4 5 6 7 8 9 10 ...
## $ order_dow : num [1:3421083] 2 3 3 4 4 2 1 1 1 4 ...
## $ order_hour_of_day : chr [1:3421083] "08" "07" "12" "07" ...
## $ days_since_prior_order: num [1:3421083] NA 15 21 29 28 19 20 14 0 30 ...
## - attr(*, "spec")=
## .. cols(
## .. order_id = col_double(),
## .. user_id = col_double(),
## .. eval_set = col_character(),
## .. order_number = col_double(),
## .. order_dow = col_double(),
## .. order_hour_of_day = col_character(),
## .. days_since_prior_order = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
orders$order_hour_of_day <- as.numeric(orders$order_hour_of_day)
products <- products%>% mutate(category = ifelse(grepl("Organic", products$product_name), "Organic", "Non-organic"))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
df <- products %>% group_by(department_id, aisle_id, product_id, product_name, category) %>% dplyr::summarize(n=n())
df <- df %>% left_join(departments,by="department_id")
df<- df %>% left_join(aisles,by="aisle_id")
data <- merge(x= order_products,y= df, by= "product_id")
data <- merge(x= data, y= orders, by= "order_id")
data <- data %>% mutate(product_name = as.factor(product_name))
head(data)
## order_id product_id add_to_cart_order reordered department_id aisle_id
## 1 1 13176 6 0 4 24
## 2 1 49302 1 1 16 120
## 3 1 22035 8 1 16 21
## 4 1 11109 2 1 16 108
## 5 1 47209 7 0 4 24
## 6 1 49683 4 0 4 83
## product_name category n department
## 1 Bag of Organic Bananas Organic 1 Produce
## 2 Bulgarian Yogurt Non-organic 1 Dairy Eggs
## 3 Organic Whole String Cheese Organic 1 Dairy Eggs
## 4 Organic 4% Milk Fat Whole Milk Cottage Cheese Organic 1 Dairy Eggs
## 5 Organic Hass Avocado Organic 1 Produce
## 6 Cucumber Kirby Non-organic 1 Produce
## aisle user_id eval_set order_number order_dow
## 1 fresh fruits 112108 train 4 4
## 2 yogurt 112108 train 4 4
## 3 packaged cheese 112108 train 4 4
## 4 other creams cheeses 112108 train 4 4
## 5 fresh fruits 112108 train 4 4
## 6 fresh vegetables 112108 train 4 4
## order_hour_of_day days_since_prior_order
## 1 10 9
## 2 10 9
## 3 10 9
## 4 10 9
## 5 10 9
## 6 10 9
day_week <- orders %>%
mutate(day = as.factor(order_dow)) %>%
mutate(hour = as.factor(order_hour_of_day)) %>%
group_by(day,hour) %>%
summarise(count = n()) %>%
arrange(desc(count))
day_weekp <-day_week %>%
ggplot(aes(x=day, y=hour))+
geom_tile(aes(fill=count), colour = "white") +
scale_fill_gradient(name= "Number of\nOrders", low = "#fff1e6",high = "#00835C")+
scale_x_discrete( position = "top",
breaks = c("0", "1", "2", "3", "4", "5", "6"),
label = c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday", "Friday", "Saturday"),
expand=c(0,0))+
scale_y_discrete(
breaks = c("0", "6", "12", "18", "23"),
label = c("12am", "6am", "12pm", "6pm", "11pm"),
expand=c(0,0))+
labs(title="Which Day and What Time\nDo Customers Order the Most?",
x="",
y="",
caption = "Maximum number of orders are placed between 9:00am and 4:00pm on Sunday and Monday. There is also a big number of orders during Firday\nand Saturday.")+
theme_classic()+
theme(
axis.line=element_blank(),
axis.ticks=element_line(size=0.4),
axis.text = element_text(size= 10, color= "#00835C"),
axis.line.x = element_line(color= "#00835C" ),
plot.background=element_blank(),
plot.title = element_text(size =10, face = "bold", hjust = 0.50, vjust = 1),
plot.caption = element_text(hjust = 0, size = 7, margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"), color = "#718c9e"),
panel.grid = element_blank(),
legend.position = "bottom",
legend.title = element_text(size= 8),
legend.margin=margin(grid::unit(0,"cm")),
legend.key.width=grid::unit(2,"cm"),
legend.key.height=grid::unit(0.2,"cm")
)
day_weekp
So, from the above graph, its clear that Maximum no of orders are placed between 10:00AM and 5:00PM on Sunday and Monday.
Actions: Develop strategies to increase sales during the quietest days of the week:
Offer discounts on certain items.
Offer free delivery with orders over a specific dollar amount.
Positive note: Customers can shop on a Wednesday or Thursday, and most likely, they will never encounter “out of stock” items or delivery/pick delays.
Prod_portf <- data %>%
group_by(product_id) %>%
dplyr::summarize(count=n()) %>%
left_join(products,by="product_id") %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
dplyr::summarize(sumcount = sum(count)) %>%
left_join(df, by = c("department_id", "aisle_id")) %>%
mutate(onesize = 1)
## Treemap
p <- treemap(Prod_portf,index=c("department","aisle"),
vSize="n",vColor="department",
#Main
palette= "Set3",
title="Product Portfolio",
sortID="-sumcount",
#Borders
border.lwds=c(0,0),
border.col=c("#343538","transparent"),
type="categorical",
fontsize.legend = 0,
fontsize.labels=c(9,1),
fontsize.title=12,
overlap.labels=0.5,
inflate.labels=F,
bg.labels=c("transparent"),
align.labels=list(
c("center", "center"),
c("right", "bottom")))
d3tree2( p , rootname = "Portfolio" )
type <- data %>%
group_by(product_id)%>%
dplyr::summarize(count = n()) %>%
top_n(20, wt = count) %>%
left_join(select(products, product_id, product_name, category), by="product_id") %>%
arrange(desc(count))
best <- type %>%
ggplot(aes(x=reorder(product_name,count), y=count, color= category, text= paste0(product_name, ", Total Orders:", count)))+ geom_point(size= 2)+
geom_segment(aes(x=reorder(product_name,count),
xend=reorder(product_name,count),
y=0,
yend=count), size=0.8)+
scale_y_continuous(labels = scales::comma) +
labs(title="Bestsellers Products",
subtitle = "Organic vs Non-Organic",
y="",
x="",
legend = "")+
scale_color_manual("", values = c("#FF8200", "#0AAD0A"))+
theme_minimal()+
theme(
axis.text.x= element_text( size= 7),
axis.text.y= element_text( size= 8),
plot.title = element_text(hjust=0.5, size= 12, face = "bold", vjust = 0.5),
plot.subtitle = element_text(hjust=0.5, size= 9, vjust = 0.5),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid = element_line(color = "#e5e5e5"))+
coord_flip()
ggplotly(best, tooltip = "text")
The highest ordered products are Banana, Bag of Organic Bananas, Organic Strawberries and Organic Baby Spinach. So to increase the sale of Strawberries the retailer can put it near the Bananas.
Categorically we see what areas we can improve sales on by addressing various marketing strategies.
library(flextable)
set_flextable_defaults(font.family = "Arial", font.size = 9, padding = 2)
less_p <- data %>%
group_by(product_id)%>%
dplyr::summarize(count = n()) %>%
top_n(-6, wt = count) %>%
left_join(select(products, product_id, product_name, category), by="product_id") %>%
arrange(desc(count))%>%
mutate(Number_of_Orders= count,
Product = product_name,
Category= category)%>%
select(Product, Number_of_Orders, Category)
f <-flextable(head(less_p, n= 5))%>%
align(
j = c("Product", "Number_of_Orders", "Category"),
align = "center", part = "all") %>%
padding(padding = 3, part = "all")%>%
theme_zebra(odd_header = "#F26A58", even_header = "#F26A58",
odd_body = "#fff5f5", even_body = "#f8f9fa") %>%
color(part = "header", color = "white")%>%
align(align = "right", part = "all")%>% autofit()
f
Product | Number_of_Orders | Category |
Green Chile Anytime Sauce | 1 | Non-organic |
Pure Coconut Water With Orange | 1 | Non-organic |
Saline Nasal Mist | 1 | Non-organic |
Fresh Scent Dishwasher Cleaner | 1 | Non-organic |
Mint Chocolate Flavored Syrup | 1 | Non-organic |
less_org <- data %>%
group_by(product_id)%>%
filter(category== "Organic")%>%
dplyr::summarize(count = n()) %>%
top_n(-6, wt = count) %>%
left_join(select(products, product_id, product_name, category), by="product_id") %>%
arrange(desc(count))%>%
mutate(Number_of_Orders= count,
Product = product_name,
Category= category)%>%
select(Product, Number_of_Orders, Category)
f_org <-flextable(head(less_org, n= 5))%>%
align(
j = c("Product", "Number_of_Orders", "Category"),
align = "center", part = "all") %>%
padding(padding = 3, part = "all")%>%
theme_zebra(odd_header = "#0AAD0A", even_header = "#0AAD0A",
odd_body = "#fff5f5" , even_body = "#FAF1E5") %>%
color(part = "header", color = "white")%>%
align(align = "right", part = "all")%>% autofit()
f_org
Product | Number_of_Orders | Category |
Organic Spaghetti Style Pasta | 1 | Organic |
Organic Vegetable Greens & Greens Juice Blend | 1 | Organic |
Organic Peppermint Lemonade | 1 | Organic |
Organic Lemon Gingersnap | 1 | Organic |
Reg; Organic Apple Cider Vinegar 16 Fl Oz | 1 | Organic |
Market Basket Analysis is a technique that identifies the strength of association between pairs of products purchased together and identifies patterns of co-occurrence. A co-occurrence is when two or more things take place concurrently.
To conduct an Association Rule Mining or Market Basket Analysis, we need to build the transaction dataset. We need to gather information about user_id, order_id, and products_name. This last feature needs to be converted into one row, so we can get all the items that have been bought in the same transaction or order.
transactionData <- ddply(data,c("user_id","order_id"),
function(df1)paste(df1$product_name,
collapse = ","))
write.csv(transactionData,"C:/Users/karolo/Desktop/Instacart/market_basket_transactions.csv", quote = FALSE, row.names = FALSE)
With the read.transactions () function you can directly read data from text-type files and store it in an object of type transactions, which is the storage structure used by arules
tr <- read.transactions("C:/Users/karolo/Desktop/Instacart/market_basket_transactions.csv", format = 'basket', sep=',', header = TRUE, rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1
## 1
summary(tr)
## transactions as itemMatrix in sparse format with
## 131209 rows (elements/itemsets/transactions) and
## 306430 columns (items) and a density of 3.867044e-05
##
## most frequent items:
## Banana Bag of Organic Bananas Organic Strawberries
## 17611 14604 10241
## Organic Baby Spinach Large Lemon (Other)
## 9326 7724 1495292
##
## element (itemset/transaction) length distribution:
## sizes
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 7929 8337 8887 8798 9332 8999 8719 8098 7239 6454 5888 5229 4636 4220 3603 3332
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 2893 2521 2279 1877 1656 1461 1271 1118 924 821 726 548 524 448 373 328
## 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 264 237 212 142 134 125 107 66 59 65 42 55 33 28 28 19
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
## 16 17 16 17 10 5 5 6 3 3 1 5 1 1 2 4
## 67 68 70 72 74 76 78 82
## 2 2 1 3 1 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 10.00 11.85 15.00 82.00
##
## includes extended item information - examples:
## labels
## 1 #2 Coffee Filters
## 2 #2 Cone White Coffee Filters
## 3 #2 Mechanical Pencils
The summary(tr) is a useful command that gives us information about our transaction object. Let’s take a look at what the above output says:
There are 131,209 transactions (rows) and 306,430 items (columns). Note that 306,433 are the product names involved in the dataset, and 131,209 transactions are collections of these items.
Density tells the percentage of nonzero cells in a sparse matrix. You can say it as the total number of purchased items divided by a possible number of items in that matrix. You can calculate how many items were purchased by using density: 131,209 x 306,430 x3.867004e-05
The summary can also tell you the most frequent items.
Element (itemset/transaction) length distribution: This tells you how many transactions are there for 1-itemset, for 2-itemset, and so on. The first row indicates the number of items, and the second row shows the number of transactions.
For example, there are 7929 transactions for three items, 8337 transactions for four things, and 82 items in one transaction.
One of the first analyses that should be carried out when working with transactions is exploring their content and size. The inspect () function shows the features that make up each transaction.
inspect(tr[1:3])
## items
## [1] {1,
## 112108,
## Bag of Organic Bananas,
## Bulgarian Yogurt,
## Cucumber Kirby,
## Lightly Smoked Sardines in Olive Oil,
## Organic 4% Milk Fat Whole Milk Cottage Cheese,
## Organic Celery Hearts,
## Organic Hass Avocado,
## Organic Whole String Cheese}
## [2] {36,
## 79431,
## Asparagus,
## Cage Free Extra Large Grade AA Eggs,
## Grated Pecorino Romano Cheese,
## Organic Garnet Sweet Potato (Yam),
## Organic Half & Half,
## Prosciutto Americano,
## Spring Water,
## Super Greens Salad}
## [3] {38,
## 42756,
## Bunched Cilantro,
## Flat Parsley Bunch,
## Fresh Dill,
## Green Peas,
## Organic Baby Arugula,
## Organic Biologique Limes,
## Organic Hot House Tomato,
## Organic Raw Unfiltered Apple Cider Vinegar,
## Shelled Pistachios}
It is also possible to display the results in dataframe format with the DATAFRAME () function or with as (transactions, “dataframe”).
df_tr <- as(tr, Class = "data.frame")
as.tibble(df_tr) %>%
head(n= 3)
## # A tibble: 3 x 1
## items
## <chr>
## 1 {1,112108,Bag of Organic Bananas,Bulgarian Yogurt,Cucumber Kirby,Lightly Smok~
## 2 {36,79431,Asparagus,Cage Free Extra Large Grade AA Eggs,Grated Pecorino Roman~
## 3 {38,42756,Bunched Cilantro,Flat Parsley Bunch,Fresh Dill,Green Peas,Organic B~
To extract the size of each transaction, the size () function is used.
tr_size <- size(tr)
summary(tr_size)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 10.00 11.85 15.00 82.00
data.frame(tr_size) %>%
ggplot(aes(x = tr_size)) +
geom_histogram(fill= "orange", bin = 5) +
labs(title = "Transaction Size Distribution",
x = "Size",
y= "Number of Transactions") +
theme_minimal()
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Set2'), main="Relative Item Frequency Plot")
The highest ordered products are Banana, Bag of Organic Bananas, Organic Strawberries and Organic Baby Spinach. So to increase the sale of Strawberries the retailer can put it near the Bananas.
The apriori algorithm has been used to search for association rules; a function with the same name in the arules package allows it to be calculated. It is based on the anti monotone property of the support; that is, if an itemset is frequent, then all its subsets must be too.
Apriori was one of the first algorithms developed for the search for association rules, and it is still one of the most used; it has two stages:
Identify all itemsets that occur with a frequency above a specific limit (frequent itemsets).
Convert those frequent itemsets into association rules.
Before going into the details of the algorithm, a series of concepts should be defined:
Support: The fraction of which our item set occurs in our dataset.
Confidence: It’s an indication of how often the rule has been found to be true.
Coverage: It’s the support of the left part of the rule (antecedent). It is interpreted as the frequency with which the antecedent appears in the set of transactions.
Lift: It is the ratio of confidence to expected confidence. Expected confidence is simply the probability that product y is in the basket.Higher lift values indicate stronger associations.
To find a reasonable number of rules, it was necessary to considerably reduce the support and confidence to 0.001 and 0.2, respectively. This is due to the high number of shopping carts and products present in the dataset.
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.2, maxlen=10, target= "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 131
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[306430 item(s), 131209 transaction(s)] done [1.26s].
## sorting and recoding items ... [1729 item(s)] done [0.05s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.07s].
## writing ... [873 rule(s)] done [0.00s].
## creating S4 object ... done [0.05s].
summary(association.rules)
## set of 873 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 380 485 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.574 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001006 Min. :0.2000 Min. :0.001936 Min. : 1.498
## 1st Qu.:0.001136 1st Qu.:0.2261 1st Qu.:0.004344 1st Qu.: 2.167
## Median :0.001372 Median :0.2571 Median :0.005244 Median : 3.120
## Mean :0.001981 Mean :0.2747 Mean :0.007634 Mean : 5.137
## 3rd Qu.:0.001936 3rd Qu.:0.3051 3rd Qu.:0.007599 3rd Qu.: 4.152
## Max. :0.021005 Max. :0.5714 Max. :0.078051 Max. :85.582
## count
## Min. : 132
## 1st Qu.: 149
## Median : 180
## Mean : 260
## 3rd Qu.: 254
## Max. :2756
##
## mining info:
## data ntransactions support confidence
## tr 131209 0.001 0.2
## call
## apriori(data = tr, parameter = list(supp = 0.001, conf = 0.2, maxlen = 10, target = "rules"))
The summary of the rules gives us some information: The number of rules : 873
The distribution of rules by length: a length of 3 items has the most rules.
The summary of quality measures: ranges of support, confidence, and lift.
# Filter rules with confidence greater than 0.2 or 20%
subRules<-association.rules[quality(association.rules)$confidence> 0.2]
Since there are 873 rules, let’s print only top 15 with a confidence level > 20%:
inspect(sort(x= subRules[1:15], decreasing = TRUE, by = "confidence"))
## lhs rhs support confidence coverage lift count
## [1] {Organic Yellow Squash} => {Organic Zucchini} 0.001242293 0.4515235 0.002751336 13.701192 163
## [2] {Nonfat Icelandic Style Strawberry Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001097486 0.4161850 0.002637014 81.625133 144
## [3] {Non Fat Acai & Mixed Berries Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001234671 0.4060150 0.003040950 79.630534 162
## [4] {Sweet Potato Yam} => {Banana} 0.001028893 0.3638814 0.002827550 2.711062 135
## [5] {Organic Whole Grassmilk Milk} => {Bag of Organic Bananas} 0.001196564 0.3512304 0.003406778 3.155614 157
## [6] {Organic Fuji Apples} => {Bag of Organic Bananas} 0.001021271 0.3292383 0.003101921 2.958027 134
## [7] {Packaged Grape Tomatoes} => {Hass Avocados} 0.001173700 0.2558140 0.004588100 15.229171 154
## [8] {Icelandic Style Skyr Blueberry Non-fat Yogurt} => {Non Fat Acai & Mixed Berries Yogurt} 0.001234671 0.2421525 0.005098736 79.630534 162
## [9] {Packaged Grape Tomatoes} => {Strawberries} 0.001105107 0.2408638 0.004588100 5.159755 145
## [10] {Baby Cucumbers} => {Bag of Organic Bananas} 0.001127971 0.2379421 0.004740528 2.137781 148
## [11] {Baby Cucumbers} => {Raspberries} 0.001036514 0.2186495 0.004740528 9.177474 136
## [12] {Baby Cucumbers} => {Hass Avocados} 0.001028893 0.2170418 0.004740528 12.920979 135
## [13] {Baby Cucumbers} => {Strawberries} 0.001028893 0.2170418 0.004740528 4.649443 135
## [14] {Icelandic Style Skyr Blueberry Non-fat Yogurt} => {Nonfat Icelandic Style Strawberry Yogurt} 0.001097486 0.2152466 0.005098736 81.625133 144
## [15] {Seedless Cucumbers} => {Strawberries} 0.001044136 0.2041729 0.005113978 4.373766 137
top5subRules <- head(subRules, n = 5, by = "confidence")
plot(top5subRules, method = "graph", engine = "htmlwidget")
plot(top5subRules , method="paracoord", control=list(reorder=TRUE))
The graph above shows us that when a customer has organic strawberries, organic raspberries and organic Hass avocados in their shopping cart, they are very likely to buy a bag of organic bananas to go along with those items, and so on.
With this information, retailers can influence purchasing decisions by confidently predicting what customers need next. This will allow them to recommend products and design marketing strategies appealing to the customer segment. Another insight, which can be obtained from this analysis, is the positioning of products in specific places and the offering of bundles for less popular products.
You can filter the created rules and restrict the elements that appear on the left or right side of the rules when creating them. Suppose you know that there is interest in understanding the products sold together with the banana, the best-selling product. This means that the banana product should appear on the right side.
bananas <- apriori(data= tr, parameter= list(supp=0.0001, confidence=0.4, target= "rules"), appearance = list(rhs= "Banana"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 1e-04 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 13
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[306430 item(s), 131209 transaction(s)] done [2.86s].
## sorting and recoding items ... [10811 item(s)] done [0.12s].
## creating transaction tree ... done [0.18s].
## checking subsets of size 1 2 3 4 5 6 done [2.30s].
## writing ... [1938 rule(s)] done [0.95s].
## creating S4 object ... done [0.16s].
summary(bananas)
## set of 1938 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 22 1078 790 48
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.446 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.0001067 Min. :0.4000 Min. :0.0001143 Min. :2.980
## 1st Qu.:0.0001143 1st Qu.:0.4239 1st Qu.:0.0002439 1st Qu.:3.158
## Median :0.0001372 Median :0.4565 Median :0.0002972 Median :3.401
## Mean :0.0001814 Mean :0.4753 Mean :0.0003952 Mean :3.541
## 3rd Qu.:0.0001829 3rd Qu.:0.5116 3rd Qu.:0.0004039 3rd Qu.:3.812
## Max. :0.0019358 Max. :0.9333 Max. :0.0047100 Max. :6.954
## count
## Min. : 14.0
## 1st Qu.: 15.0
## Median : 18.0
## Mean : 23.8
## 3rd Qu.: 24.0
## Max. :254.0
##
## mining info:
## data ntransactions support confidence
## tr 131209 1e-04 0.4
## call
## apriori(data = tr, parameter = list(supp = 1e-04, confidence = 0.4, target = "rules"), appearance = list(rhs = "Banana"))
inspect(bananas[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Frozen Unsweetened Whole Strawberries} => {Banana} 0.0001143214 0.4838710 0.0002362643 3.605032 15
## [2] {Chamomile Herbal Tea} => {Banana} 0.0001143214 0.4285714 0.0002667500 3.193029 15
## [3] {Yokids Organic Strawberry Banana Lowfat Yogurt Squeeze} => {Banana} 0.0001143214 0.4687500 0.0002438857 3.492375 15
## [4] {All Natural Whole Strawberries} => {Banana} 0.0001219429 0.4324324 0.0002819929 3.221795 16
## [5] {Caramel Nut Brownie} => {Banana} 0.0001143214 0.4054054 0.0002819929 3.020433 15
## [6] {Chickpea Salad} => {Banana} 0.0001219429 0.4210526 0.0002896143 3.137011 16
## [7] {Greek 100 Calories Mixed Berry Yogurt} => {Banana} 0.0001219429 0.4102564 0.0002972357 3.056574 16
## [8] {Triscuit Reduced Fat Crackers} => {Banana} 0.0001143214 0.4054054 0.0002819929 3.020433 15
## [9] {Natural Uncured Turkey Hot Dog} => {Banana} 0.0001295643 0.4146341 0.0003124786 3.089190 17
## [10] {Broccoli Floret} => {Banana} 0.0001448071 0.4130435 0.0003505857 3.077339 19
Maximum no of orders are placed between 10:00AM and 5:00PM on Sunday and Monday.
The highest ordered products are Banana, Bag of Organic Bananas, Organic Strawberries and Organic Baby Spinach.
The implementation of association rules is an interesting option in the selection process of the technique to use to perform data mining; Its application is essential for the discovery of association relationships of data, being useful in the selection of a marketing strategy, decision-making processes, business management, etc.
When a customer has organic strawberries, organic raspberries and organic Hass avocados in their shopping cart, they are very likely to buy a bag of organic bananas to go along with those items, and so on. With this information, retailers can influence purchasing decisions by confidently predicting what customers need next.
We recommend to do additional work with marketing team, to increase product placement storewide.
We also recommend to increase visibility of products in the aisles, or end of aisle displays.
Conduct a Cluster Analysis to identify and understand the customer’s behavior. Then, the buying behavior of each group can be examined separately on measures such as brand loyalty, price willing to pay, frequency of purchase, etc. To do this analysis, refer to the K-means clustering Analysis, Gap Statistic Method, Average Silhouette or Elbow Method.
Conduct a Basket Analysis by department or aisles and increase the confidence of the model. Refer to the lift metric to find the strong-association and randomly sample out those associations.
Conduct a data analysis including real prices of the products or other additional information that allows business managers to improve their performance and predict other purchasing behaviors.
Domain knowledge - Understand the data well enough to generate questions on how to solve current and future challenges.
Keep your project simple by well defining a manageable portion of the data; then redefine throughout the EDA. Your final steps are for the cool new tricks you have learned.
Instacart Market Basket Analysis. Available online: https://www.kaggle.com/c/instacart-market-basket-analysis/data
Market Basket Analysis using R. https://www.datacamp.com/community/tutorials/market-basket-analysis-r
A Gentle Introduction on Market Basket Analysis.Association Rules. Available online: https://datascienceplus.com/a-gentle-introduction-on-market-basket-analysis%E2%80%8A-%E2%80%8Aassociation-rules/