Hello and welcome to this week in Data Mining. This week I began working with a classic clustering technique, market basket analysis. Market basket analysis investigates whether two products are being purchased together and whether the purchase of one product increases the other’s likelihood. Throughout this study, there are a few variables that are worth examining.
Lift: Lift compares the probability of B given A with the probability of A. If this ratio is larger than 1, we say that A on the left-hand side (LHS) results in an upward lift on the right-hand side (RHS) B.
Support: Proportion of all transactions that contain the rule
Confidence: Probability that a rule is accurate for a transaction with the items on the LHS
The data this week was grocery store purchase data. This set is unlike other data sets I have worked with as each row is a grocery store transaction. Therefore, it is critical to analyze this data as a transaction, not a standard data base. First, we load in our data and declare our packages.
# Declare packages
library(readr)
library(datasets)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v dplyr 1.0.2
## v tibble 3.0.4 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Loading required package: grid
trans <- read.transactions("groceries.csv", format = 'basket', sep = ',')
str(trans) # 169 different grocery items
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 1 variable:
## .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
There are two items that you should take note of in the structure of the transaction data. In line 4, we see 9836 individual transactions, and the 2nd last line tells us there are 169 unique grocery store items purchased. Next, I wanted to investigate which things are bought most frequently.
# view item frequency
itemFrequency <- itemFrequency(trans)
sort(itemFrequency, decreasing = TRUE)
## whole milk other vegetables rolls/buns
## 0.2555160142 0.1934926284 0.1839349263
## soda yogurt bottled water
## 0.1743772242 0.1395017794 0.1105236401
## root vegetables tropical fruit shopping bags
## 0.1089984748 0.1049313676 0.0985256736
## sausage pastry citrus fruit
## 0.0939501779 0.0889679715 0.0827656329
## bottled beer newspapers canned beer
## 0.0805287239 0.0798169802 0.0776817489
## pip fruit fruit/vegetable juice whipped/sour cream
## 0.0756481952 0.0722928317 0.0716827656
## brown bread domestic eggs frankfurter
## 0.0648703610 0.0634468734 0.0589730554
## margarine coffee pork
## 0.0585663447 0.0580579563 0.0576512456
## butter curd beef
## 0.0554143366 0.0532791052 0.0524656838
## napkins chocolate frozen vegetables
## 0.0523640061 0.0496187087 0.0480935435
## chicken white bread cream cheese
## 0.0429079817 0.0420945602 0.0396542959
## waffles salty snack long life bakery product
## 0.0384341637 0.0378240976 0.0374173869
## dessert sugar UHT-milk
## 0.0371123538 0.0338586680 0.0334519573
## berries hamburger meat hygiene articles
## 0.0332486019 0.0332486019 0.0329435689
## onions specialty chocolate candy
## 0.0310116929 0.0304016268 0.0298932384
## frozen meals misc. beverages oil
## 0.0283680732 0.0283680732 0.0280630402
## butter milk specialty bar beverages
## 0.0279613625 0.0273512964 0.0260294865
## ham meat ice cream
## 0.0260294865 0.0258261312 0.0250127097
## hard cheese sliced cheese cat food
## 0.0245043213 0.0245043213 0.0232841891
## grapes chewing gum detergent
## 0.0223690900 0.0210472801 0.0192170819
## red/blush wine white wine pickled vegetables
## 0.0192170819 0.0190137265 0.0178952720
## baking powder semi-finished bread dishes
## 0.0176919166 0.0176919166 0.0175902389
## flour potted plants soft cheese
## 0.0173868836 0.0172852059 0.0170818505
## processed cheese herbs canned fish
## 0.0165734621 0.0162684291 0.0150482969
## pasta seasonal products cake bar
## 0.0150482969 0.0142348754 0.0132180986
## packaged fruit/vegetables mustard frozen fish
## 0.0130147433 0.0119979664 0.0116929334
## cling film/bags spread cheese liquor
## 0.0113879004 0.0111845450 0.0110828673
## canned vegetables frozen dessert salt
## 0.0107778343 0.0107778343 0.0107778343
## dish cleaner flower (seeds) condensed milk
## 0.0104728012 0.0103711235 0.0102694459
## roll products pet care photo/film
## 0.0102694459 0.0094560244 0.0092526690
## mayonnaise chocolate marshmallow sweet spreads
## 0.0091509914 0.0090493137 0.0090493137
## candles dog food specialty cheese
## 0.0089476360 0.0085409253 0.0085409253
## frozen potato products house keeping products turkey
## 0.0084392476 0.0083375699 0.0081342145
## Instant food products liquor (appetizer) rice
## 0.0080325369 0.0079308592 0.0076258261
## instant coffee popcorn zwieback
## 0.0074224708 0.0072191154 0.0069140824
## soups finished products vinegar
## 0.0068124047 0.0065073716 0.0065073716
## female sanitary products kitchen towels dental care
## 0.0061006609 0.0059989832 0.0057956279
## cereals sparkling wine sauces
## 0.0056939502 0.0055922725 0.0054905948
## softener jam spices
## 0.0054905948 0.0053889171 0.0051855618
## cleaner curd cheese liver loaf
## 0.0050838841 0.0050838841 0.0050838841
## male cosmetics rum ketchup
## 0.0045754957 0.0044738180 0.0042704626
## meat spreads brandy light bulbs
## 0.0042704626 0.0041687850 0.0041687850
## tea specialty fat abrasive cleaner
## 0.0038637519 0.0036603965 0.0035587189
## skin care nuts/prunes artif. sweetener
## 0.0035587189 0.0033553635 0.0032536858
## canned fruit syrup nut snack
## 0.0032536858 0.0032536858 0.0031520081
## snack products fish potato products
## 0.0030503305 0.0029486528 0.0028469751
## bathroom cleaner cookware soap
## 0.0027452974 0.0027452974 0.0026436197
## cooking chocolate pudding powder tidbits
## 0.0025419420 0.0023385867 0.0023385867
## cocoa drinks organic sausage prosecco
## 0.0022369090 0.0022369090 0.0020335536
## flower soil/fertilizer ready soups specialty vegetables
## 0.0019318760 0.0018301983 0.0017285206
## organic products decalcifier honey
## 0.0016268429 0.0015251652 0.0015251652
## cream frozen fruits hair spray
## 0.0013218099 0.0012201322 0.0011184545
## rubbing alcohol liqueur make up remover
## 0.0010167768 0.0009150991 0.0008134215
## salad dressing whisky toilet cleaner
## 0.0008134215 0.0008134215 0.0007117438
## baby cosmetics frozen chicken bags
## 0.0006100661 0.0006100661 0.0004067107
## kitchen utensil preservation products baby food
## 0.0004067107 0.0002033554 0.0001016777
## sound storage medium
## 0.0001016777
#Item Frequency Plot
itemFrequencyPlot(trans,topN=20,type="absolute")
Now that our data is properly sorted, we can start constructing our association rules. First, we will look at our associations with support greater than 0.01 and confidence greater than 0.5.
# Build Association Rules
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules)
## lhs rhs support
## [1] {curd,yogurt} => {whole milk} 0.01006609
## [2] {butter,other vegetables} => {whole milk} 0.01148958
## [3] {domestic eggs,other vegetables} => {whole milk} 0.01230300
## [4] {whipped/sour cream,yogurt} => {whole milk} 0.01087951
## [5] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159
## [6] {other vegetables,pip fruit} => {whole milk} 0.01352313
## [7] {citrus fruit,root vegetables} => {other vegetables} 0.01037112
## [8] {root vegetables,tropical fruit} => {other vegetables} 0.01230300
## [9] {root vegetables,tropical fruit} => {whole milk} 0.01199797
## [10] {tropical fruit,yogurt} => {whole milk} 0.01514997
## [11] {root vegetables,yogurt} => {other vegetables} 0.01291307
## [12] {root vegetables,yogurt} => {whole milk} 0.01453991
## [13] {rolls/buns,root vegetables} => {other vegetables} 0.01220132
## [14] {rolls/buns,root vegetables} => {whole milk} 0.01270971
## [15] {other vegetables,yogurt} => {whole milk} 0.02226741
## confidence coverage lift count
## [1] 0.5823529 0.01728521 2.279125 99
## [2] 0.5736041 0.02003050 2.244885 113
## [3] 0.5525114 0.02226741 2.162336 121
## [4] 0.5245098 0.02074225 2.052747 107
## [5] 0.5070423 0.02887646 1.984385 144
## [6] 0.5175097 0.02613116 2.025351 133
## [7] 0.5862069 0.01769192 3.029608 102
## [8] 0.5845411 0.02104728 3.020999 121
## [9] 0.5700483 0.02104728 2.230969 118
## [10] 0.5173611 0.02928317 2.024770 149
## [11] 0.5000000 0.02582613 2.584078 127
## [12] 0.5629921 0.02582613 2.203354 143
## [13] 0.5020921 0.02430097 2.594890 120
## [14] 0.5230126 0.02430097 2.046888 125
## [15] 0.5128806 0.04341637 2.007235 219
This plot is read as the items on LHS predict the object on RHS. Support is the proportion of all transactions that have the rule, and confidence is the probability it is true and lift is the effect that the items on LHS have on the likelihood of purchasing the item on the RHS.
## Warning in plot.rules(rules, method = "graph", interactive = FALSE, shading =
## NA): The parameter interactive is deprecated. Use engine='interactive' instead.
In plot one, most items predict root vegetables and milk. This is expected as they are the most purchased items in the data set. Plot two shows us the confidence, support and lift for our top 15 rules. A few rules are grouped over the .55 confidence, but there is an outlier support rule that is especially prevalent.
So our first association was cool, but the RHS consisted of whole milk and other vegetables, our two most purchased items. Next, I wanted to see what items best predict buying bottled beer.
beerrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08),
appearance = list(default="lhs",rhs="bottled beer"),
control = list(verbose=F))
beerrules<-sort(beerrules, decreasing=TRUE,by="confidence")
inspect(beerrules[1:5])
## lhs rhs support confidence
## [1] {liquor,red/blush wine} => {bottled beer} 0.001931876 0.9047619
## [2] {liquor,soda} => {bottled beer} 0.001220132 0.5714286
## [3] {liquor} => {bottled beer} 0.004677173 0.4220183
## [4] {bottled water,herbs} => {bottled beer} 0.001220132 0.4000000
## [5] {soups,whole milk} => {bottled beer} 0.001118454 0.3793103
## coverage lift count
## [1] 0.002135231 11.235269 19
## [2] 0.002135231 7.095960 12
## [3] 0.011082867 5.240594 46
## [4] 0.003050330 4.967172 12
## [5] 0.002948653 4.710249 11
This rule tells us that beer is bought often with either wine, liquor or soda. It seems that when someone buys bottled beer, customers usually purchase other types of alcohol. Now, what about canned beer?
cannedbeerrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08),
appearance = list(default="lhs",rhs="canned beer"),
control = list(verbose=F))
cannedbeerrules<-sort(cannedbeerrules, decreasing=TRUE,by="confidence")
inspect(cannedbeerrules[1:5])
## lhs rhs support confidence
## [1] {rolls/buns,shopping bags,soda} => {canned beer} 0.001525165 0.2419355
## [2] {rolls/buns,sausage,shopping bags} => {canned beer} 0.001423488 0.2372881
## [3] {liquor (appetizer)} => {canned beer} 0.001728521 0.2179487
## [4] {coffee,soda} => {canned beer} 0.001931876 0.1938776
## [5] {chicken,soda} => {canned beer} 0.001525165 0.1829268
## coverage lift count
## [1] 0.006304016 3.114444 15
## [2] 0.005998983 3.054619 14
## [3] 0.007930859 2.805662 17
## [4] 0.009964413 2.495793 19
## [5] 0.008337570 2.354824 15
The result was especially surprising to me! We see that the items are different for canned than bottled beer. It seems that canned beer is bought more with regular groceries like buns, coffee and sausage, where bottled beer is usually purchased with over types of alcohol. Also, we had much lower confidence for canned beer compared to bottled beer.
Next, I wanted to look at one of my favourite foods, marshmallows. The data set only listed chocolate marshmallows (which I’ve never had), but I wanted to run it anyway.
marshrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.01),
appearance = list(default="lhs",rhs="chocolate marshmallow"),
control = list(verbose=F))
marshrules<-sort(marshrules, decreasing=TRUE,by="confidence")
inspect(marshrules[1:5])
## lhs rhs support
## [1] {candy} => {chocolate marshmallow} 0.001423488
## [2] {waffles} => {chocolate marshmallow} 0.001423488
## [3] {chocolate} => {chocolate marshmallow} 0.001626843
## [4] {domestic eggs} => {chocolate marshmallow} 0.001830198
## [5] {long life bakery product} => {chocolate marshmallow} 0.001016777
## confidence coverage lift count
## [1] 0.04761905 0.02989324 5.262172 14
## [2] 0.03703704 0.03843416 4.092801 14
## [3] 0.03278689 0.04961871 3.623135 16
## [4] 0.02884615 0.06344687 3.187662 18
## [5] 0.02717391 0.03741739 3.002870 10
Chocolate marshmallows are bought with candy and chocolate and waffles—the essential food groups. Finally, let us flip our search criteria to see what is on the right-hand side when whipped/sour cream is on the left.
# Now lets set our left hand side to whipped/sour cream
sourrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08,minlen=2),
appearance = list(default="rhs",lhs="whipped/sour cream"),
control = list(verbose=F))
sourrules<-sort(sourrules, decreasing=TRUE,by="confidence")
inspect(sourrules[1:5])
## lhs rhs support confidence coverage
## [1] {whipped/sour cream} => {whole milk} 0.03223183 0.4496454 0.07168277
## [2] {whipped/sour cream} => {other vegetables} 0.02887646 0.4028369 0.07168277
## [3] {whipped/sour cream} => {yogurt} 0.02074225 0.2893617 0.07168277
## [4] {whipped/sour cream} => {root vegetables} 0.01708185 0.2382979 0.07168277
## [5] {whipped/sour cream} => {rolls/buns} 0.01464159 0.2042553 0.07168277
## lift count
## [1] 1.759754 317
## [2] 2.081924 284
## [3] 2.074251 204
## [4] 2.186250 168
## [5] 1.110476 144
The outcomes here are consistent with our original association rules, with whole milk and other vegetables taking the lead.
Chris