Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Association rules

Load Data

grocery<-read.transactions("GroceryDataSet.csv", sep=",")
summary(grocery)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Data Analysis

itemFrequencyPlot(grocery,topN=20,type="absolute",main = "Top 20 Items")

The plot shows the top 20 items sold on the receipt. The No. 1 item brought by customer is whole milk. Vegetables and roll/buns are also popular items on the list.

I use apriori()to mine the data for association rules. They are three important ratios:

Support: The fraction of which our item set occurs in our dataset. Confidence: probability that a rule is correct for a new transaction with items on the left. Lift: The ratio by which by the confidence of a rule exceeds the expected confidence.

I set the support at 0.001 and confidence at 0.8.

rules<-apriori(grocery,parameter = list(supp = 0.001, conf = 0.8),control = list(verbose = FALSE))
summary(rules)

## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.27  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##     data ntransactions support confidence
##  grocery          9835   0.001        0.8

There are total 410 rules with the confidence rate more than 80%. And the mean of lift is 3.262.

Top 10 rules show as below:

rules %>% DATAFRAME() %>% 
          arrange(desc(confidence))%>%
          top_n(10)

## Selecting by count

The table indicates some relationships between different items.For example, if customer buy citrus fruit,root vegetables, tropical fruit and whole milk, 88.5% likely they will buy other vegetables.

I present the graph for 20 rules. We can easily detect the relationships between different items.

subrules <- head(rules,  n=20,by = 'lift')
plot(subrules, method = 'graph')

Cluster Analysis

Extract the data and change it as dataframe.

df_grocery <- grocery@data %>% 
              as.matrix()  %>% 
              as.data.frame() 
     
row.names(df_grocery) <- grocery@itemInfo$labels
df_grocery<-scale(df_grocery)

We’ll need to identify the appropriate number for k. k represents the number of clusters we will group rows into. Three common techniques are: Elbow, Silhouette, Gap statistic, and NBClust(). For simplicity, I will use the Elbow approach.

fviz_nbclust(df_grocery, kmeans, method = "wss", k.max = 30) + theme_minimal() + ggtitle("the Elbow Method")

We can see above, when the number of K increase, the sum of squares continues decrease and the slope is very steep. Therefore, I will perform K-Means model with \(k=30\).

set.seed(123)
km_res <- kmeans(df_grocery, centers = 30,nstart = 20)
summary(km_res)

##              Length Class  Mode   
## cluster         169 -none- numeric
## centers      295050 -none- numeric
## totss             1 -none- numeric
## withinss         30 -none- numeric
## tot.withinss      1 -none- numeric
## betweenss         1 -none- numeric
## size             30 -none- numeric
## iter              1 -none- numeric
## ifault            1 -none- numeric

sil <- silhouette(km_res$cluster, dist(df_grocery))

fviz_silhouette(sil)

##    cluster size ave.sil.width
## 1        1    1          0.00
## 2        2  135          0.40
## 3        3    1          0.00
## 4        4    1          0.00
## 5        5    1          0.00
## 6        6    1          0.00
## 7        7    1          0.00
## 8        8    1          0.00
## 9        9    1          0.00
## 10      10    1          0.00
## 11      11    1          0.00
## 12      12    4         -0.19
## 13      13    1          0.00
## 14      14    1          0.00
## 15      15    1          0.00
## 16      16    1          0.00
## 17      17    1          0.00
## 18      18    1          0.00
## 19      19    1          0.00
## 20      20    3         -0.20
## 21      21    1          0.00
## 22      22    1          0.00
## 23      23    1          0.00
## 24      24    1          0.00
## 25      25    1          0.00
## 26      26    1          0.00
## 27      27    1          0.00
## 28      28    1          0.00
## 29      29    1          0.00
## 30      30    1          0.00

I tried to use different K under 30, but there is always one cluster with big size. With K equal to 30, cluster 2 has size equal to 135. We can take a look of what items it includes.

km_res$cluster[km_res$cluster==2]

##          abrasive cleaner          artif. sweetener            baby cosmetics 
##                         2                         2                         2 
##                 baby food                      bags             baking powder 
##                         2                         2                         2 
##          bathroom cleaner                   berries                 beverages 
##                         2                         2                         2 
##                    brandy               butter milk                  cake bar 
##                         2                         2                         2 
##                   candles                     candy               canned fish 
##                         2                         2                         2 
##              canned fruit         canned vegetables                  cat food 
##                         2                         2                         2 
##                   cereals               chewing gum     chocolate marshmallow 
##                         2                         2                         2 
##                   cleaner           cling film/bags              cocoa drinks 
##                         2                         2                         2 
##            condensed milk         cooking chocolate                  cookware 
##                         2                         2                         2 
##                     cream              cream cheese               curd cheese 
##                         2                         2                         2 
##               decalcifier               dental care                   dessert 
##                         2                         2                         2 
##                 detergent              dish cleaner                    dishes 
##                         2                         2                         2 
##                  dog food  female sanitary products         finished products 
##                         2                         2                         2 
##                      fish                     flour            flower (seeds) 
##                         2                         2                         2 
##    flower soil/fertilizer            frozen chicken            frozen dessert 
##                         2                         2                         2 
##               frozen fish             frozen fruits              frozen meals 
##                         2                         2                         2 
##    frozen potato products                    grapes                hair spray 
##                         2                         2                         2 
##                       ham            hamburger meat               hard cheese 
##                         2                         2                         2 
##                     herbs                     honey    house keeping products 
##                         2                         2                         2 
##          hygiene articles                 ice cream            instant coffee 
##                         2                         2                         2 
##     Instant food products                       jam                   ketchup 
##                         2                         2                         2 
##            kitchen towels           kitchen utensil               light bulbs 
##                         2                         2                         2 
##                   liqueur                    liquor        liquor (appetizer) 
##                         2                         2                         2 
##                liver loaf  long life bakery product           make up remover 
##                         2                         2                         2 
##            male cosmetics                mayonnaise                      meat 
##                         2                         2                         2 
##              meat spreads           misc. beverages                   mustard 
##                         2                         2                         2 
##                 nut snack               nuts/prunes                       oil 
##                         2                         2                         2 
##                    onions          organic products           organic sausage 
##                         2                         2                         2 
## packaged fruit/vegetables                     pasta                  pet care 
##                         2                         2                         2 
##                photo/film        pickled vegetables                   popcorn 
##                         2                         2                         2 
##                pot plants           potato products     preservation products 
##                         2                         2                         2 
##          processed cheese                  prosecco            pudding powder 
##                         2                         2                         2 
##               ready soups            red/blush wine                      rice 
##                         2                         2                         2 
##             roll products           rubbing alcohol                       rum 
##                         2                         2                         2 
##            salad dressing                      salt                    sauces 
##                         2                         2                         2 
##         seasonal products       semi-finished bread                 skin care 
##                         2                         2                         2 
##             sliced cheese            snack products                      soap 
##                         2                         2                         2 
##               soft cheese                  softener      sound storage medium 
##                         2                         2                         2 
##                     soups            sparkling wine             specialty bar 
##                         2                         2                         2 
##          specialty cheese             specialty fat      specialty vegetables 
##                         2                         2                         2 
##                    spices             spread cheese                     sugar 
##                         2                         2                         2 
##             sweet spreads                     syrup                       tea 
##                         2                         2                         2 
##                   tidbits            toilet cleaner                    turkey 
##                         2                         2                         2 
##                  UHT-milk                   vinegar                    whisky 
##                         2                         2                         2 
##               white bread                white wine                  zwieback 
##                         2                         2                         2

It’s difficult to see what these items share in common in their respective groups.

Data624_HW10

Mengqin Cai

5/8/2021

Association rules

Load Data

Data Analysis

Cluster Analysis