In this lab, you will work with the Instacart Kaggle competition data and generate frequent patterns and association rules. The link to the competition is: https://www.kaggle.com/c/instacart-market-basket-analysis/
For your ease, I have uploaded the relevant files at the following locations.
orders - https://an-utd-course.s3-us-west-1.amazonaws.com/kaggle-market-basket/orders.csv
products -https://an-utd-course.s3-us-west-1.amazonaws.com/kaggle-market-basket/products.csv
Complete the following steps:
Load libraries
require(tidyverse)
require(arules)
Load the data using tidyverse command format:
# use read_csv to read orders and products files
orders <- read_csv("orders.csv")
products <- read_csv("products.csv")
Join the orders and products tables using tidyverse syntax
# use inner_join
joined <- inner_join(orders[1:100000,],products)
rm(orders, products)
summary(joined)
Check whether you can load this directly into a transactions object in arules package. You can directly do that i.e. from a dataframe to transactions object, or you may have to write to disk and read it into a transactions object.
# create transactions object
#torderdata <- read.transactions("orders.csv", sep="\t")
tData <- as (joined, "transactions")
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7 not logical or factor. Applying default
## discretization (see '? discretizeDF').
## Warning in discretize(x = c(1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, : The calculated breaks are: 0, 0, 1, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
summary(tData)
inspect(head(tData))
Find frequent items using eclat function. Remember the dataset is very large, so you might need to start from a very small support values. Ensure that you have set the minlen parameter to 2.
# find frequent itemsets
frequentItems <- eclat (tData,
parameter = list(supp = 0.01, maxlen = 15, minlen = 2))
inspect(frequentItems)
itemFrequencyPlot(tData, topN=10,
type="absolute", main="Item Frequency")
Output top 10 frequent items of length at least 2, sorted by support values.
Also, create a item freqency plot for top 15 items.
Use the apriori algorithm to generate association rules. Make sure to set the support and confidence values judiciously.
Output the top 10 rules first sorted by support and then by confidence.
# find association rules
rules <- apriori (tData,
parameter = list(supp = 0.001, conf = 0.5))
rules_conf <- sort (rules, by="confidence",
decreasing=TRUE)
inspect(head(rules_conf))
Choose a product of your choice and give the significant rules that contain that product on rhs of the rule.
# find rules with specified rhs
rules <- apriori (data=tData, parameter=list (supp=0.001,conf = 0.08),
appearance = list (default="lhs",rhs="product_name=Organic Spring Mix"),
control = list (verbose=F))
rules_conf <- sort (rules, by="confidence", decreasing=TRUE)
inspect(head(rules_conf))
Publish your notebook to rpubs website and submit the final link. Make sure your code and output both are visible.