Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
I’ll use itemFrequencyPlot for plotting top to features.I found the documentation from https://www.rdocumentation.org/packages/arules/versions/1.5-5/topics/itemFrequencyPlot.
#data <- read.csv("https://raw.githubusercontent.com/omerozeren/DATA624/master/HMW10/GroceryDataSet.csv")
#itemFrequencyPlot(data, topN=10, type="absolute", main="The top 10 features",col="#f2fbd2")
data <- read.transactions("GroceryDataSet.csv", sep=",")
itemFrequencyPlot(data, topN=10, type="absolute", main="The top 10 features",col="#f2fbd2")
The graph above indicates that the most important feature is WholeMilk and other vegetables follows it
For Market Analyisis , I’ll implement Apriori algorithm. Mine frequent itemsets, association rules or association hyperedges using the Apriori algorithm. The Apriori algorithm employs level-wise search for frequent itemsets. Ref : https://www.rdocumentation.org/packages/arules/versions/1.6-6/topics/apriori
top_10_rules<- apriori(data, parameter=list(supp=0.001, conf=0.5) , control=list(verbose=FALSE))
top_10_rules %>%
DATAFRAME() %>%
arrange(desc(lift)) %>%
top_n(10) %>%
kable() %>%
kable_styling()
## Selecting by count
LHS | RHS | support | confidence | coverage | lift | count |
---|---|---|---|---|---|---|
{root vegetables,tropical fruit} | {other vegetables} | 0.0123030 | 0.5845411 | 0.0210473 | 3.020999 | 121 |
{rolls/buns,root vegetables} | {other vegetables} | 0.0122013 | 0.5020921 | 0.0243010 | 2.594890 | 120 |
{root vegetables,yogurt} | {other vegetables} | 0.0129131 | 0.5000000 | 0.0258261 | 2.584078 | 127 |
{root vegetables,yogurt} | {whole milk} | 0.0145399 | 0.5629921 | 0.0258261 | 2.203354 | 143 |
{domestic eggs,other vegetables} | {whole milk} | 0.0123030 | 0.5525114 | 0.0222674 | 2.162336 | 121 |
{rolls/buns,root vegetables} | {whole milk} | 0.0127097 | 0.5230126 | 0.0243010 | 2.046888 | 125 |
{other vegetables,pip fruit} | {whole milk} | 0.0135231 | 0.5175097 | 0.0261312 | 2.025351 | 133 |
{tropical fruit,yogurt} | {whole milk} | 0.0151500 | 0.5173611 | 0.0292832 | 2.024770 | 149 |
{other vegetables,yogurt} | {whole milk} | 0.0222674 | 0.5128806 | 0.0434164 | 2.007235 | 219 |
{other vegetables,whipped/sour cream} | {whole milk} | 0.0146416 | 0.5070423 | 0.0288765 | 1.984385 | 144 |
The basic Clustering graph is generated by using “hclust” from https://www.r-graph-gallery.com/29-basic-dendrogram.html. The graph below alligns with the Top 10 Associative Rules above which indicates Wholemilk and pther vegitables are most important cluster features.
dataframe <- read.transactions("GroceryDataSet.csv", sep=",")
dataframe <- dataframe[ , itemFrequency(dataframe) > 0.05]
d_jaccard <- dissimilarity(dataframe, which = "items")
# plot dendrogram
plot(hclust(d_jaccard, method = "ward.D2"),
main = "Features Clustering", sub = "", xlab = "")