Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
grocery<-read.transactions("GroceryDataSet.csv", sep=",")
summary(grocery)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
itemFrequencyPlot(grocery,topN=20,type="absolute",main = "Top 20 Items")
The plot shows the top 20 items sold on the receipt. The No. 1 item brought by customer is whole milk. Vegetables and roll/buns are also popular items on the list.
I use apriori()to mine the data for association rules. They are three important ratios:
Support: The fraction of which our item set occurs in our dataset. Confidence: probability that a rule is correct for a new transaction with items on the left. Lift: The ratio by which by the confidence of a rule exceeds the expected confidence.
I set the support at 0.001 and confidence at 0.8.
rules<-apriori(grocery,parameter = list(supp = 0.001, conf = 0.8),control = list(verbose = FALSE))
summary(rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001220 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001247 Mean :0.8663 Mean :0.001449 Mean : 3.951
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.341
## Max. :0.003152 Max. :1.0000 Max. :0.003559 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :12.27
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## grocery 9835 0.001 0.8
There are total 410 rules with the confidence rate more than 80%. And the mean of lift is 3.262.
Top 10 rules show as below:
rules %>% DATAFRAME() %>%
arrange(desc(confidence))%>%
top_n(10)
## Selecting by count
The table indicates some relationships between different items.For example, if customer buy citrus fruit,root vegetables, tropical fruit and whole milk, 88.5% likely they will buy other vegetables.
I present the graph for 20 rules. We can easily detect the relationships between different items.
subrules <- head(rules, n=20,by = 'lift')
plot(subrules, method = 'graph')
Extract the data and change it as dataframe.
df_grocery <- grocery@data %>%
as.matrix() %>%
as.data.frame()
row.names(df_grocery) <- grocery@itemInfo$labels
df_grocery<-scale(df_grocery)
We’ll need to identify the appropriate number for k. k represents the number of clusters we will group rows into. Three common techniques are: Elbow, Silhouette, Gap statistic, and NBClust(). For simplicity, I will use the Elbow approach.
fviz_nbclust(df_grocery, kmeans, method = "wss", k.max = 30) + theme_minimal() + ggtitle("the Elbow Method")
We can see above, when the number of K increase, the sum of squares continues decrease and the slope is very steep. Therefore, I will perform K-Means model with \(k=30\).
set.seed(123)
km_res <- kmeans(df_grocery, centers = 30,nstart = 20)
summary(km_res)
## Length Class Mode
## cluster 169 -none- numeric
## centers 295050 -none- numeric
## totss 1 -none- numeric
## withinss 30 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 30 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
sil <- silhouette(km_res$cluster, dist(df_grocery))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 135 0.40
## 3 3 1 0.00
## 4 4 1 0.00
## 5 5 1 0.00
## 6 6 1 0.00
## 7 7 1 0.00
## 8 8 1 0.00
## 9 9 1 0.00
## 10 10 1 0.00
## 11 11 1 0.00
## 12 12 4 -0.19
## 13 13 1 0.00
## 14 14 1 0.00
## 15 15 1 0.00
## 16 16 1 0.00
## 17 17 1 0.00
## 18 18 1 0.00
## 19 19 1 0.00
## 20 20 3 -0.20
## 21 21 1 0.00
## 22 22 1 0.00
## 23 23 1 0.00
## 24 24 1 0.00
## 25 25 1 0.00
## 26 26 1 0.00
## 27 27 1 0.00
## 28 28 1 0.00
## 29 29 1 0.00
## 30 30 1 0.00
I tried to use different K under 30, but there is always one cluster with big size. With K equal to 30, cluster 2 has size equal to 135. We can take a look of what items it includes.
km_res$cluster[km_res$cluster==2]
## abrasive cleaner artif. sweetener baby cosmetics
## 2 2 2
## baby food bags baking powder
## 2 2 2
## bathroom cleaner berries beverages
## 2 2 2
## brandy butter milk cake bar
## 2 2 2
## candles candy canned fish
## 2 2 2
## canned fruit canned vegetables cat food
## 2 2 2
## cereals chewing gum chocolate marshmallow
## 2 2 2
## cleaner cling film/bags cocoa drinks
## 2 2 2
## condensed milk cooking chocolate cookware
## 2 2 2
## cream cream cheese curd cheese
## 2 2 2
## decalcifier dental care dessert
## 2 2 2
## detergent dish cleaner dishes
## 2 2 2
## dog food female sanitary products finished products
## 2 2 2
## fish flour flower (seeds)
## 2 2 2
## flower soil/fertilizer frozen chicken frozen dessert
## 2 2 2
## frozen fish frozen fruits frozen meals
## 2 2 2
## frozen potato products grapes hair spray
## 2 2 2
## ham hamburger meat hard cheese
## 2 2 2
## herbs honey house keeping products
## 2 2 2
## hygiene articles ice cream instant coffee
## 2 2 2
## Instant food products jam ketchup
## 2 2 2
## kitchen towels kitchen utensil light bulbs
## 2 2 2
## liqueur liquor liquor (appetizer)
## 2 2 2
## liver loaf long life bakery product make up remover
## 2 2 2
## male cosmetics mayonnaise meat
## 2 2 2
## meat spreads misc. beverages mustard
## 2 2 2
## nut snack nuts/prunes oil
## 2 2 2
## onions organic products organic sausage
## 2 2 2
## packaged fruit/vegetables pasta pet care
## 2 2 2
## photo/film pickled vegetables popcorn
## 2 2 2
## pot plants potato products preservation products
## 2 2 2
## processed cheese prosecco pudding powder
## 2 2 2
## ready soups red/blush wine rice
## 2 2 2
## roll products rubbing alcohol rum
## 2 2 2
## salad dressing salt sauces
## 2 2 2
## seasonal products semi-finished bread skin care
## 2 2 2
## sliced cheese snack products soap
## 2 2 2
## soft cheese softener sound storage medium
## 2 2 2
## soups sparkling wine specialty bar
## 2 2 2
## specialty cheese specialty fat specialty vegetables
## 2 2 2
## spices spread cheese sugar
## 2 2 2
## sweet spreads syrup tea
## 2 2 2
## tidbits toilet cleaner turkey
## 2 2 2
## UHT-milk vinegar whisky
## 2 2 2
## white bread white wine zwieback
## 2 2 2
It’s difficult to see what these items share in common in their respective groups.