Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
# Loading library
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Loading the data
grocery_data = read.transactions("GroceryDataSet.csv", sep = ",")
library(readr)
GroceryDataSet = read_csv("GroceryDataSet.csv")
## New names:
## Rows: 9834 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (32): citrus fruit, semi-finished bread, margarine, ready soups, ...5, ....
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
View(GroceryDataSet)
In market basket analysis,we observe the frequency at which customers buy products together. An example, would be if a customer buy a bag of chips , how likely are they to buy a can of soda or bottle of water. In order to understand this relationship , we seek to observe a technique called association. This shows the likelihood of the relationship occurring when the customer purchases an item.
We first have to check a summary of data, do see the frequencies of products purchased.
summary(grocery_data)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
We can explore each item to understand how frequently customer were interested in purchasing this items.
# We will display 15 of the most frequet items that are brought.
itemFrequencyPlot(grocery_data, topN=15, xlab = "Products", ylab = "Frequency of Items", main = "Top 15 Most Frequently Purchased Grocery Items", col = rainbow(15))
?itemFrequencyPlot
Our plot indicates that that milk, vegetables, and buns were the most popular items purchased in the store.
Each of these items have have a relationship that can be observed through “association rules”. Association rules creates “if-then” statements that show the relationship between these products in the grocery. If you went to store to purchase bread and jelly, you are more likely to purchase peanut butter. This association rule can be written as {bread, jelly} -> {peanut butter} This relationship creates a peanut butter and jelly sandwich. In order to evaluate this rules, we use metrics such as support, confidence and lift to understand the relationship. These metrics develop a ratio between the items in rules to understand its significance. Through support, we check how often the the items occur together on the list. This shows how common the combo, of bread and peanut butter is. bread and peanut butter. Confidence shows probability of peanut butter being purchased when bread is purchased. This give as the strength of the association rule. Lastly, lift shows the likelihood of bread and peanut butter being purchased together, compared to when the purchase of both items are made independent.A lift that is more than 1, is a positive association, equal to 1 is no association, and less than 1 is a negative association.
Our support parameter shows items that purchased together at least 1 percent of the time during a transaction. This sets the minimum threshold to 1%, so we considnder itemset that will occur 1 % in all of the transactions. The confidence has minimum confidence of 50%, showing we the 50% of the antecedent “{A,B}” will happens at least 50% of the time. The “minlen = 2” shows that the rule must, have least two items in there.
We can inspect the rules in the grocery data set using apriori function, found the arules library to create these rules.
# list is the data sctruture that hold collection of data.
grocery_rules = apriori(grocery_data, parameter = list(supp = 0.01, conf = 0.5, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(grocery_rules)
## set of 15 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.5000 Min. :0.01729 Min. :1.984
## 1st Qu.:0.01174 1st Qu.:0.5151 1st Qu.:0.02089 1st Qu.:2.036
## Median :0.01230 Median :0.5245 Median :0.02430 Median :2.203
## Mean :0.01316 Mean :0.5411 Mean :0.02454 Mean :2.299
## 3rd Qu.:0.01403 3rd Qu.:0.5718 3rd Qu.:0.02598 3rd Qu.:2.432
## Max. :0.02227 Max. :0.5862 Max. :0.04342 Max. :3.030
## count
## Min. : 99.0
## 1st Qu.:115.5
## Median :121.0
## Mean :129.4
## 3rd Qu.:138.0
## Max. :219.0
##
## mining info:
## data ntransactions support confidence
## grocery_data 9835 0.01 0.5
## call
## apriori(data = grocery_data, parameter = list(supp = 0.01, conf = 0.5, minlen = 2))
Our grocery data does many association rules that occur throughout, with some rules have a stronger association than others. Mention previously, lift indicates how strong association is. We will sort through the grocery rules to observe this.
# Here will sort via the lift, looking the top 10 rules
inspect(sort(grocery_rules, by="lift")[1:10])
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3] {rolls/buns, root vegetables} => {other vegetables} 0.01220132
## [4] {root vegetables, yogurt} => {other vegetables} 0.01291307
## [5] {curd, yogurt} => {whole milk} 0.01006609
## [6] {butter, other vegetables} => {whole milk} 0.01148958
## [7] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [8] {root vegetables, yogurt} => {whole milk} 0.01453991
## [9] {domestic eggs, other vegetables} => {whole milk} 0.01230300
## [10] {whipped/sour cream, yogurt} => {whole milk} 0.01087951
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5020921 0.02430097 2.594890 120
## [4] 0.5000000 0.02582613 2.584078 127
## [5] 0.5823529 0.01728521 2.279125 99
## [6] 0.5736041 0.02003050 2.244885 113
## [7] 0.5700483 0.02104728 2.230969 118
## [8] 0.5629921 0.02582613 2.203354 143
## [9] 0.5525114 0.02226741 2.162336 121
## [10] 0.5245098 0.02074225 2.052747 107
Sorting the rules with lift, show that {citrus fruit, root vegetables} = {other vegetables} as the strongest relationship. A customer will likely purchase this set togther, than the item set with weakest relationship at {other vegetables, whipped/sour cream} = {whole milk}.
Next, we can visualize our model through a basket model plot
# this graph shows the confidence, support and lift level of 10 rules
plot(grocery_rules)
plot(grocery_rules, method = "graph")
This graph visualize th spread of the association rules among the purchase food items. Each point help us understand the purchases that frequently brought together and strength of association likely occurring. The size of each circle show items, determine which items are most likely brought. The arrows show that for one item brought, this item is likely purchased together with it. The item with the largest dot, is “whole milk”, because it has high support, and customers that purchase “whole milk” are morel likely to buy yogurt or curd, as “curd” as darker red indicating a strong association with “whole milk” and “curd”.
{r}# install.packages("rstatix")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
In clustering we are using a technique to group similar items together. transactions that occur in the cluster, will behave similar to each other , transactions in different cluster have more distinct behavior.
K- means clustering is technique, that will group the grocery items with similar purchase rules and measure the distance between each feature vector. Feature vector representing the customer behavior.
We will set different groups of cluster and repeat the algorithm, to observe the amount of items in cluster.
Before we start clustering, we must convert the grocery data into a data frame.
groceries_matrix <- as(GroceryDataSet, “matrix”)
grocery_dataframe = as.data.frame(as.matrix(grocery_data@data))
# we set row names for our data to check the grocery item
row.names(grocery_dataframe) = grocery_data@itemInfo$labels
# we will set a seed before repeat each algorithm
set.seed(451)
# we break our cluster in 5 group, run the algorithm 75 times, then set a limit of 15 iterations to update the cluster
cluster_1 = kmeans(grocery_dataframe, centers = 5, nstart = 75,iter.max = 15 )
table(cluster_1$cluster)
##
## 1 2 3 4 5
## 2 143 1 1 22
fviz_cluster(cluster_1,
data = grocery_dataframe,
geom = "point",
ellipse.type = "convex",
main = "K-means Clustering For Cluster 1")
# we can check the See the transactions in the cluster
cluster_1$cluster[cluster_1$cluster == 2] #
## abrasive cleaner artif. sweetener baby cosmetics
## 2 2 2
## baby food bags baking powder
## 2 2 2
## bathroom cleaner berries beverages
## 2 2 2
## brandy butter milk cake bar
## 2 2 2
## candles candy canned beer
## 2 2 2
## canned fish canned fruit canned vegetables
## 2 2 2
## cat food cereals chewing gum
## 2 2 2
## chicken chocolate chocolate marshmallow
## 2 2 2
## cleaner cling film/bags cocoa drinks
## 2 2 2
## coffee condensed milk cooking chocolate
## 2 2 2
## cookware cream cream cheese
## 2 2 2
## curd cheese decalcifier dental care
## 2 2 2
## dessert detergent dish cleaner
## 2 2 2
## dishes dog food female sanitary products
## 2 2 2
## finished products fish flour
## 2 2 2
## flower (seeds) flower soil/fertilizer frozen chicken
## 2 2 2
## frozen dessert frozen fish frozen fruits
## 2 2 2
## frozen meals frozen potato products frozen vegetables
## 2 2 2
## grapes hair spray ham
## 2 2 2
## hamburger meat hard cheese herbs
## 2 2 2
## honey house keeping products hygiene articles
## 2 2 2
## ice cream instant coffee Instant food products
## 2 2 2
## jam ketchup kitchen towels
## 2 2 2
## kitchen utensil light bulbs liqueur
## 2 2 2
## liquor liquor (appetizer) liver loaf
## 2 2 2
## long life bakery product make up remover male cosmetics
## 2 2 2
## mayonnaise meat meat spreads
## 2 2 2
## misc. beverages mustard nut snack
## 2 2 2
## nuts/prunes oil onions
## 2 2 2
## organic products organic sausage packaged fruit/vegetables
## 2 2 2
## pasta pet care photo/film
## 2 2 2
## pickled vegetables popcorn pot plants
## 2 2 2
## potato products preservation products processed cheese
## 2 2 2
## prosecco pudding powder ready soups
## 2 2 2
## red/blush wine rice roll products
## 2 2 2
## rubbing alcohol rum salad dressing
## 2 2 2
## salt salty snack sauces
## 2 2 2
## seasonal products semi-finished bread skin care
## 2 2 2
## sliced cheese snack products soap
## 2 2 2
## soft cheese softener sound storage medium
## 2 2 2
## soups sparkling wine specialty bar
## 2 2 2
## specialty cheese specialty chocolate specialty fat
## 2 2 2
## specialty vegetables spices spread cheese
## 2 2 2
## sugar sweet spreads syrup
## 2 2 2
## tea tidbits toilet cleaner
## 2 2 2
## turkey UHT-milk vinegar
## 2 2 2
## waffles whisky white bread
## 2 2 2
## white wine zwieback
## 2 2
Our first cluster groups the grocery items into five groups, and our graph shows their difference determined by two dimensions. When Dim 1 explains 8.2% of the difference between the items, and Dim2 explain 5.2 % difference between the items. Looking the table, the cluster with the most items is item 2 with 143 grocery items. Cluster 5 , has grouped 22 grocery items. With cluster 1,3 and 4 having at least 1 grocery item. Our cluster 2 has the most grocery items ,because it captures the most common behavior of customer shopping patterns Cluster 5 contains 22 grocery transactions, because it looks at unique group of shopping patterns, which does not common grocery items. Cluster with at least 1 item , are group of transactions that rarely purchased or outliers.
Next we will increase the cluster into different groups to visualize how they clustered and grouped.
set.seed(452)
# we break our cluster in 10 group, run the algorithm 75 times, then set a limit of 15 iterations to update the cluster
cluster_2 = kmeans(grocery_dataframe, centers = 10, nstart = 75,iter.max = 15 )
table(cluster_2$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 21 1 138 2 1 2 1 1 1 1
fviz_cluster(cluster_2,
data = grocery_dataframe,
geom = "point",
ellipse.type = "convex",
main = "K-means Clustering For Cluster 2")
cluster_2$cluster[cluster_2$cluster == 3]
## abrasive cleaner artif. sweetener baby cosmetics
## 3 3 3
## baby food bags baking powder
## 3 3 3
## bathroom cleaner berries beverages
## 3 3 3
## brandy butter milk cake bar
## 3 3 3
## candles candy canned beer
## 3 3 3
## canned fish canned fruit canned vegetables
## 3 3 3
## cat food cereals chewing gum
## 3 3 3
## chocolate marshmallow cleaner cling film/bags
## 3 3 3
## cocoa drinks condensed milk cooking chocolate
## 3 3 3
## cookware cream cream cheese
## 3 3 3
## curd cheese decalcifier dental care
## 3 3 3
## dessert detergent dish cleaner
## 3 3 3
## dishes dog food female sanitary products
## 3 3 3
## finished products fish flour
## 3 3 3
## flower (seeds) flower soil/fertilizer frozen chicken
## 3 3 3
## frozen dessert frozen fish frozen fruits
## 3 3 3
## frozen meals frozen potato products grapes
## 3 3 3
## hair spray ham hamburger meat
## 3 3 3
## hard cheese herbs honey
## 3 3 3
## house keeping products hygiene articles ice cream
## 3 3 3
## instant coffee Instant food products jam
## 3 3 3
## ketchup kitchen towels kitchen utensil
## 3 3 3
## light bulbs liqueur liquor
## 3 3 3
## liquor (appetizer) liver loaf long life bakery product
## 3 3 3
## make up remover male cosmetics mayonnaise
## 3 3 3
## meat meat spreads misc. beverages
## 3 3 3
## mustard nut snack nuts/prunes
## 3 3 3
## oil onions organic products
## 3 3 3
## organic sausage packaged fruit/vegetables pasta
## 3 3 3
## pet care photo/film pickled vegetables
## 3 3 3
## popcorn pot plants potato products
## 3 3 3
## preservation products processed cheese prosecco
## 3 3 3
## pudding powder ready soups red/blush wine
## 3 3 3
## rice roll products rubbing alcohol
## 3 3 3
## rum salad dressing salt
## 3 3 3
## salty snack sauces seasonal products
## 3 3 3
## semi-finished bread skin care sliced cheese
## 3 3 3
## snack products soap soft cheese
## 3 3 3
## softener sound storage medium soups
## 3 3 3
## sparkling wine specialty bar specialty cheese
## 3 3 3
## specialty chocolate specialty fat specialty vegetables
## 3 3 3
## spices spread cheese sugar
## 3 3 3
## sweet spreads syrup tea
## 3 3 3
## tidbits toilet cleaner turkey
## 3 3 3
## UHT-milk vinegar waffles
## 3 3 3
## whisky white wine zwieback
## 3 3 3
Our data is now groups grocery items into 10 groups, where cluster 3 contains 138 items and cluster 1 contains 21 grocery items. Each cluster contains at least 1 item. Cluster 3 contains most of the common combination of grocery transactions. In cluster 2,4, and 5 to 10, these transactions most likely outliers that rare to occur. Increasing our cluster , will create more groups, with some cluster more meaningful than other. This could lead to overfitting our data, making it harder to explain the shopping behavior of the consumer
set.seed(453)
# we break our cluster in 10 group, run the algorithm 75 times, then set a limit of 15 iterations to update the cluster
cluster_3 = kmeans(grocery_dataframe, centers = 20, nstart = 75,iter.max = 15 )
table(cluster_3$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 130
fviz_cluster(cluster_3,
data = grocery_dataframe,
geom = "point",
ellipse.type = "convex",
main = "K-means Clustering For Cluster 3")
Our data is now groups grocery items into 20 groups, where cluster 4
contains 21 items and cluster 20 contains 130 grocery items. Each
cluster contains at least 1 item. Cluster 20 is dominant, meaning half
of the data is contained in that data. Increasing the cluster, has now
cause overfitting, as most transactions are concentrated into cluster
20.
cluster_3$cluster[cluster_3$cluster == 1]
## newspapers
## 1
set.seed(454)
# we break our cluster in 10 group, run the algorithm 75 times, then set a limit of 15 iterations to update the cluster
cluster_4 = kmeans(grocery_dataframe, centers = 50, nstart = 75,iter.max = 15 )
table(cluster_4$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 2 1 1 1 1 1 1 1 1 1 1 115 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50
## 1 1 1 1 1 1 1 1 1 1
fviz_cluster(cluster_4,
data = grocery_dataframe,
geom = "point",
ellipse.type = "convex",
main = "K-means Clustering For Cluster 4") +
theme(legend.position = "none")
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '26'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '27'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '28'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '29'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '30'
## Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
## pch value '31'
cluster_4$cluster[cluster_4$cluster == 34]
## abrasive cleaner artif. sweetener baby cosmetics
## 34 34 34
## baby food bags baking powder
## 34 34 34
## bathroom cleaner brandy cake bar
## 34 34 34
## candles canned fish canned fruit
## 34 34 34
## canned vegetables cat food cereals
## 34 34 34
## chewing gum chocolate marshmallow cleaner
## 34 34 34
## cling film/bags cocoa drinks condensed milk
## 34 34 34
## cooking chocolate cookware cream
## 34 34 34
## curd cheese decalcifier dental care
## 34 34 34
## detergent dish cleaner dishes
## 34 34 34
## dog food female sanitary products finished products
## 34 34 34
## fish flour flower (seeds)
## 34 34 34
## flower soil/fertilizer frozen chicken frozen dessert
## 34 34 34
## frozen fish frozen fruits frozen potato products
## 34 34 34
## grapes hair spray herbs
## 34 34 34
## honey house keeping products ice cream
## 34 34 34
## instant coffee Instant food products jam
## 34 34 34
## ketchup kitchen towels kitchen utensil
## 34 34 34
## light bulbs liqueur liquor
## 34 34 34
## liquor (appetizer) liver loaf make up remover
## 34 34 34
## male cosmetics mayonnaise meat
## 34 34 34
## meat spreads mustard nut snack
## 34 34 34
## nuts/prunes organic products organic sausage
## 34 34 34
## packaged fruit/vegetables pasta pet care
## 34 34 34
## photo/film pickled vegetables popcorn
## 34 34 34
## pot plants potato products preservation products
## 34 34 34
## processed cheese prosecco pudding powder
## 34 34 34
## ready soups red/blush wine rice
## 34 34 34
## roll products rubbing alcohol rum
## 34 34 34
## salad dressing salt sauces
## 34 34 34
## seasonal products semi-finished bread skin care
## 34 34 34
## snack products soap soft cheese
## 34 34 34
## softener sound storage medium soups
## 34 34 34
## sparkling wine specialty cheese specialty fat
## 34 34 34
## specialty vegetables spices spread cheese
## 34 34 34
## sweet spreads syrup tea
## 34 34 34
## tidbits toilet cleaner turkey
## 34 34 34
## vinegar whisky white wine
## 34 34 34
## zwieback
## 34
These items in cluster 34, shows a customer is likely to purchase these groups of items together during a transactions. These are everyday food items people purchase for breakfast or lunch.
Our data is now groups grocery items into 100 groups, where cluster 34 contains 115 items and Each cluster contains at least 1 item. With so many clusters, containing 1 item we have caused overfitting, which have group of grocery item that are not useful for understanding the customer behavior.