Name: Charles Ugiagbe.

Date: 12/16/23

Homework Intro

The aim of this assigment is to use is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

library(tidyverse)
library(kableExtra)
library(arules)
library(igraph)

load the data

receipt.df <- read.csv("GroceryDataSet.csv")

Take a head look at the data

head(receipt.df)
##       citrus.fruit semi.finished.bread      margarine              ready.soups
## 1   tropical fruit              yogurt         coffee                         
## 2       whole milk                                                            
## 3        pip fruit              yogurt  cream cheese              meat spreads
## 4 other vegetables          whole milk condensed milk long life bakery product
## 5       whole milk              butter         yogurt                     rice
## 6       rolls/buns                                                            
##                  X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1                                                                              
## 2                                                                              
## 3                                                                              
## 4                                                                              
## 5 abrasive cleaner                                                             
## 6                                                                              
##   X.15 X.16 X.17 X.18 X.19 X.20 X.21 X.22 X.23 X.24 X.25 X.26 X.27
## 1                                                                 
## 2                                                                 
## 3                                                                 
## 4                                                                 
## 5                                                                 
## 6

Market Analysis

we need to first read in and explore the data by looking at the top 15 item purchase.

data.df <- read.transactions("GroceryDataSet.csv", sep=",")
itemFrequencyPlot(data.df, topN=15, type="absolute", main="Top 15 Items", col=rainbow(15))

Whole milk is the most frequently purchased item.

Association Rules

In order to complete this market basket analysis, the Apriori algorithm is initiated to print out the top 10 rules with their support, confidence and lift. To find the association rules, we will use the ‘apriori’ function.

rules<- apriori(data.df, parameter=list(supp=0.001, conf=0.5) , control=list(verbose=FALSE))
summary(rules)
## set of 5668 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##   11 1461 3211  939   46 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    4.00    3.92    4.00    6.00 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.5000   Min.   :0.001017   Min.   : 1.957  
##  1st Qu.:0.001118   1st Qu.:0.5455   1st Qu.:0.001729   1st Qu.: 2.464  
##  Median :0.001322   Median :0.6000   Median :0.002135   Median : 2.899  
##  Mean   :0.001668   Mean   :0.6250   Mean   :0.002788   Mean   : 3.262  
##  3rd Qu.:0.001729   3rd Qu.:0.6842   3rd Qu.:0.002949   3rd Qu.: 3.691  
##  Max.   :0.022267   Max.   :1.0000   Max.   :0.043416   Max.   :18.996  
##      count      
##  Min.   : 10.0  
##  1st Qu.: 11.0  
##  Median : 13.0  
##  Mean   : 16.4  
##  3rd Qu.: 17.0  
##  Max.   :219.0  
## 
## mining info:
##     data ntransactions support confidence
##  data.df          9835   0.001        0.5
##                                                                                                  call
##  apriori(data = data.df, parameter = list(supp = 0.001, conf = 0.5), control = list(verbose = FALSE))
apriori(data.df, parameter=list(supp=0.001, conf=0.5) , control=list(verbose=FALSE)) %>%
  DATAFRAME() %>%
  arrange(desc(lift)) %>%
  top_n(10) %>%
  kable() %>%
  kable_styling()
## Selecting by count
LHS RHS support confidence coverage lift count
{root vegetables,tropical fruit} {other vegetables} 0.0123030 0.5845411 0.0210473 3.020999 121
{rolls/buns,root vegetables} {other vegetables} 0.0122013 0.5020921 0.0243010 2.594890 120
{root vegetables,yogurt} {other vegetables} 0.0129131 0.5000000 0.0258261 2.584078 127
{root vegetables,yogurt} {whole milk} 0.0145399 0.5629921 0.0258261 2.203354 143
{domestic eggs,other vegetables} {whole milk} 0.0123030 0.5525114 0.0222674 2.162336 121
{rolls/buns,root vegetables} {whole milk} 0.0127097 0.5230126 0.0243010 2.046888 125
{other vegetables,pip fruit} {whole milk} 0.0135231 0.5175097 0.0261312 2.025351 133
{tropical fruit,yogurt} {whole milk} 0.0151500 0.5173611 0.0292832 2.024770 149
{other vegetables,yogurt} {whole milk} 0.0222674 0.5128806 0.0434164 2.007235 219
{other vegetables,whipped/sour cream} {whole milk} 0.0146416 0.5070423 0.0288765 1.984385 144

Cluster Analysis

We are to look for item groupings. A network graph can be use to preform the cluster analysis. First I will need to create a network graph from the transaction data. The I will detect the communities in the graph using the Louvain algorthym.

temp <- read.csv("GroceryDataSet.csv", header = FALSE) %>%
  mutate(shoper_id = row_number()) %>%
  pivot_longer(-shoper_id) %>%
  filter(value != "") %>%
  select(-name)

louvain_communities <- temp %>%
  rename(to = value, from = shoper_id) %>%
  graph_from_data_frame(directed = FALSE) %>%
  cluster_louvain() %>%
  communities()
items <- as.character(unique(temp$value))

cluster_df <- data.frame(name = c(NA), members = c(NA)) %>% na.omit()

for (i in 1:length(louvain_communities)){
  cluster_name <- paste0(i,": ")
  cluster_members <- 0
  for (member in louvain_communities[[i]]){
    if (member %in% items){
      cluster_name <- paste0(cluster_name, member, " + ")
    } else {
      cluster_members <- cluster_members + 1
    }
  }
  cluster_name <- substr(cluster_name,1,nchar(cluster_name)-3)
  cluster_df <- rbind(cluster_df, data.frame(name = cluster_name, members = cluster_members))
}

cluster_df %>%
  arrange(desc(members)) %>%
  kable() %>%
  kable_styling()
name members
5: citrus fruit + other vegetables + rice + abrasive cleaner + beef + chicken + root vegetables + spices + pork + turkey + curd cheese + canned vegetables + onions + herbs + specialty cheese + dog food + frozen fish + salad dressing + vinegar + roll products + rubbing alcohol + jam + toilet cleaner + preservation products 1147
9: chocolate + soda + specialty bar + pastry + waffles + candy + beverages + chocolate marshmallow + frozen potato products + cake bar + snack products + finished products + potato products + baby food + tidbits + bags + sound storage medium 1104
3: margarine + whole milk + butter + cereals + curd + flour + sugar + detergent + whipped/sour cream + baking powder + specialty fat + flower (seeds) + salt + honey + cocoa drinks + skin care + soups + rum + soap + organic sausage + pudding powder + frozen fruits + cooking chocolate 1067
1: ready soups + rolls/buns + frankfurter + sausage + spread cheese + hard cheese + cat food + canned fish + sliced cheese + soft cheese + meat + mustard + mayonnaise + organic products + nut snack + kitchen towels + cream 1022
14: shopping bags + misc. beverages + chewing gum + specialty chocolate + sparkling wine + brandy + liqueur + whisky 517
4: yogurt + cream cheese + meat spreads + packaged fruit/vegetables + butter milk + bathroom cleaner + berries + fish + instant coffee + frozen chicken + kitchen utensil 507
6: liquor (appetizer) + canned beer + candles 433
16: frozen dessert + ice cream + frozen vegetables + frozen meals + cleaner + liver loaf 408
15: bottled beer + red/blush wine + prosecco + liquor 394
18: hygiene articles + domestic eggs + oil + canned fruit + dish cleaner + house keeping products + baby cosmetics + ketchup 382
11: UHT-milk + bottled water + artif. sweetener + white wine + male cosmetics 377
7: long life bakery product + pot plants + fruit/vegetable juice + sweet spreads + pickled vegetables 352
8: tropical fruit + white bread + processed cheese + ham + tea + syrup + specialty vegetables 341
13: dishes + napkins + grapes + zwieback + light bulbs + decalcifier 329
10: brown bread + hamburger meat + pasta + Instant food products + sauces + hair spray 323
19: semi-finished bread + newspapers + pet care + nuts/prunes 291
2: coffee + condensed milk + cling film/bags + female sanitary products 269
17: pip fruit + photo/film + softener + cookware 231
20: salty snack + dental care + popcorn + make up remover 175
12: dessert + seasonal products + flower soil/fertilizer 166