Hello

Hello and welcome to this week in Data Mining. This week I began working with a classic clustering technique, market basket analysis. Market basket analysis investigates whether two products are being purchased together and whether the purchase of one product increases the other’s likelihood. Throughout this study, there are a few variables that are worth examining.

Lift: Lift compares the probability of B given A with the probability of A. If this ratio is larger than 1, we say that A on the left-hand side (LHS) results in an upward lift on the right-hand side (RHS) B.

Support: Proportion of all transactions that contain the rule

Confidence: Probability that a rule is accurate for a transaction with the items on the LHS

The data this week was grocery store purchase data. This set is unlike other data sets I have worked with as each row is a grocery store transaction. Therefore, it is critical to analyze this data as a transaction, not a standard data base. First, we load in our data and declare our packages.

# Declare packages
library(readr)
library(datasets)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v dplyr   1.0.2
## v tibble  3.0.4     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Loading required package: grid
trans <-  read.transactions("groceries.csv", format = 'basket', sep = ',')
str(trans) # 169 different grocery items
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  1 variable:
##   .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

There are two items that you should take note of in the structure of the transaction data. In line 4, we see 9836 individual transactions, and the 2nd last line tells us there are 169 unique grocery store items purchased. Next, I wanted to investigate which things are bought most frequently.

# view item frequency
itemFrequency <- itemFrequency(trans)
sort(itemFrequency, decreasing = TRUE)
##                whole milk          other vegetables                rolls/buns 
##              0.2555160142              0.1934926284              0.1839349263 
##                      soda                    yogurt             bottled water 
##              0.1743772242              0.1395017794              0.1105236401 
##           root vegetables            tropical fruit             shopping bags 
##              0.1089984748              0.1049313676              0.0985256736 
##                   sausage                    pastry              citrus fruit 
##              0.0939501779              0.0889679715              0.0827656329 
##              bottled beer                newspapers               canned beer 
##              0.0805287239              0.0798169802              0.0776817489 
##                 pip fruit     fruit/vegetable juice        whipped/sour cream 
##              0.0756481952              0.0722928317              0.0716827656 
##               brown bread             domestic eggs               frankfurter 
##              0.0648703610              0.0634468734              0.0589730554 
##                 margarine                    coffee                      pork 
##              0.0585663447              0.0580579563              0.0576512456 
##                    butter                      curd                      beef 
##              0.0554143366              0.0532791052              0.0524656838 
##                   napkins                 chocolate         frozen vegetables 
##              0.0523640061              0.0496187087              0.0480935435 
##                   chicken               white bread              cream cheese 
##              0.0429079817              0.0420945602              0.0396542959 
##                   waffles               salty snack  long life bakery product 
##              0.0384341637              0.0378240976              0.0374173869 
##                   dessert                     sugar                  UHT-milk 
##              0.0371123538              0.0338586680              0.0334519573 
##                   berries            hamburger meat          hygiene articles 
##              0.0332486019              0.0332486019              0.0329435689 
##                    onions       specialty chocolate                     candy 
##              0.0310116929              0.0304016268              0.0298932384 
##              frozen meals           misc. beverages                       oil 
##              0.0283680732              0.0283680732              0.0280630402 
##               butter milk             specialty bar                 beverages 
##              0.0279613625              0.0273512964              0.0260294865 
##                       ham                      meat                 ice cream 
##              0.0260294865              0.0258261312              0.0250127097 
##               hard cheese             sliced cheese                  cat food 
##              0.0245043213              0.0245043213              0.0232841891 
##                    grapes               chewing gum                 detergent 
##              0.0223690900              0.0210472801              0.0192170819 
##            red/blush wine                white wine        pickled vegetables 
##              0.0192170819              0.0190137265              0.0178952720 
##             baking powder       semi-finished bread                    dishes 
##              0.0176919166              0.0176919166              0.0175902389 
##                     flour             potted plants               soft cheese 
##              0.0173868836              0.0172852059              0.0170818505 
##          processed cheese                     herbs               canned fish 
##              0.0165734621              0.0162684291              0.0150482969 
##                     pasta         seasonal products                  cake bar 
##              0.0150482969              0.0142348754              0.0132180986 
## packaged fruit/vegetables                   mustard               frozen fish 
##              0.0130147433              0.0119979664              0.0116929334 
##           cling film/bags             spread cheese                    liquor 
##              0.0113879004              0.0111845450              0.0110828673 
##         canned vegetables            frozen dessert                      salt 
##              0.0107778343              0.0107778343              0.0107778343 
##              dish cleaner            flower (seeds)            condensed milk 
##              0.0104728012              0.0103711235              0.0102694459 
##             roll products                  pet care                photo/film 
##              0.0102694459              0.0094560244              0.0092526690 
##                mayonnaise     chocolate marshmallow             sweet spreads 
##              0.0091509914              0.0090493137              0.0090493137 
##                   candles                  dog food          specialty cheese 
##              0.0089476360              0.0085409253              0.0085409253 
##    frozen potato products    house keeping products                    turkey 
##              0.0084392476              0.0083375699              0.0081342145 
##     Instant food products        liquor (appetizer)                      rice 
##              0.0080325369              0.0079308592              0.0076258261 
##            instant coffee                   popcorn                  zwieback 
##              0.0074224708              0.0072191154              0.0069140824 
##                     soups         finished products                   vinegar 
##              0.0068124047              0.0065073716              0.0065073716 
##  female sanitary products            kitchen towels               dental care 
##              0.0061006609              0.0059989832              0.0057956279 
##                   cereals            sparkling wine                    sauces 
##              0.0056939502              0.0055922725              0.0054905948 
##                  softener                       jam                    spices 
##              0.0054905948              0.0053889171              0.0051855618 
##                   cleaner               curd cheese                liver loaf 
##              0.0050838841              0.0050838841              0.0050838841 
##            male cosmetics                       rum                   ketchup 
##              0.0045754957              0.0044738180              0.0042704626 
##              meat spreads                    brandy               light bulbs 
##              0.0042704626              0.0041687850              0.0041687850 
##                       tea             specialty fat          abrasive cleaner 
##              0.0038637519              0.0036603965              0.0035587189 
##                 skin care               nuts/prunes          artif. sweetener 
##              0.0035587189              0.0033553635              0.0032536858 
##              canned fruit                     syrup                 nut snack 
##              0.0032536858              0.0032536858              0.0031520081 
##            snack products                      fish           potato products 
##              0.0030503305              0.0029486528              0.0028469751 
##          bathroom cleaner                  cookware                      soap 
##              0.0027452974              0.0027452974              0.0026436197 
##         cooking chocolate            pudding powder                   tidbits 
##              0.0025419420              0.0023385867              0.0023385867 
##              cocoa drinks           organic sausage                  prosecco 
##              0.0022369090              0.0022369090              0.0020335536 
##    flower soil/fertilizer               ready soups      specialty vegetables 
##              0.0019318760              0.0018301983              0.0017285206 
##          organic products               decalcifier                     honey 
##              0.0016268429              0.0015251652              0.0015251652 
##                     cream             frozen fruits                hair spray 
##              0.0013218099              0.0012201322              0.0011184545 
##           rubbing alcohol                   liqueur           make up remover 
##              0.0010167768              0.0009150991              0.0008134215 
##            salad dressing                    whisky            toilet cleaner 
##              0.0008134215              0.0008134215              0.0007117438 
##            baby cosmetics            frozen chicken                      bags 
##              0.0006100661              0.0006100661              0.0004067107 
##           kitchen utensil     preservation products                 baby food 
##              0.0004067107              0.0002033554              0.0001016777 
##      sound storage medium 
##              0.0001016777
#Item Frequency Plot
itemFrequencyPlot(trans,topN=20,type="absolute")

Now that our data is properly sorted, we can start constructing our association rules. First, we will look at our associations with support greater than 0.01 and confidence greater than 0.5.

# Build Association Rules
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(rules)
##      lhs                                      rhs                support   
## [1]  {curd,yogurt}                         => {whole milk}       0.01006609
## [2]  {butter,other vegetables}             => {whole milk}       0.01148958
## [3]  {domestic eggs,other vegetables}      => {whole milk}       0.01230300
## [4]  {whipped/sour cream,yogurt}           => {whole milk}       0.01087951
## [5]  {other vegetables,whipped/sour cream} => {whole milk}       0.01464159
## [6]  {other vegetables,pip fruit}          => {whole milk}       0.01352313
## [7]  {citrus fruit,root vegetables}        => {other vegetables} 0.01037112
## [8]  {root vegetables,tropical fruit}      => {other vegetables} 0.01230300
## [9]  {root vegetables,tropical fruit}      => {whole milk}       0.01199797
## [10] {tropical fruit,yogurt}               => {whole milk}       0.01514997
## [11] {root vegetables,yogurt}              => {other vegetables} 0.01291307
## [12] {root vegetables,yogurt}              => {whole milk}       0.01453991
## [13] {rolls/buns,root vegetables}          => {other vegetables} 0.01220132
## [14] {rolls/buns,root vegetables}          => {whole milk}       0.01270971
## [15] {other vegetables,yogurt}             => {whole milk}       0.02226741
##      confidence coverage   lift     count
## [1]  0.5823529  0.01728521 2.279125  99  
## [2]  0.5736041  0.02003050 2.244885 113  
## [3]  0.5525114  0.02226741 2.162336 121  
## [4]  0.5245098  0.02074225 2.052747 107  
## [5]  0.5070423  0.02887646 1.984385 144  
## [6]  0.5175097  0.02613116 2.025351 133  
## [7]  0.5862069  0.01769192 3.029608 102  
## [8]  0.5845411  0.02104728 3.020999 121  
## [9]  0.5700483  0.02104728 2.230969 118  
## [10] 0.5173611  0.02928317 2.024770 149  
## [11] 0.5000000  0.02582613 2.584078 127  
## [12] 0.5629921  0.02582613 2.203354 143  
## [13] 0.5020921  0.02430097 2.594890 120  
## [14] 0.5230126  0.02430097 2.046888 125  
## [15] 0.5128806  0.04341637 2.007235 219

This plot is read as the items on LHS predict the object on RHS. Support is the proportion of all transactions that have the rule, and confidence is the probability it is true and lift is the effect that the items on LHS have on the likelihood of purchasing the item on the RHS.

## Warning in plot.rules(rules, method = "graph", interactive = FALSE, shading =
## NA): The parameter interactive is deprecated. Use engine='interactive' instead.

In plot one, most items predict root vegetables and milk. This is expected as they are the most purchased items in the data set. Plot two shows us the confidence, support and lift for our top 15 rules. A few rules are grouped over the .55 confidence, but there is an outlier support rule that is especially prevalent.

So our first association was cool, but the RHS consisted of whole milk and other vegetables, our two most purchased items. Next, I wanted to see what items best predict buying bottled beer.

beerrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08), 
               appearance = list(default="lhs",rhs="bottled beer"),
               control = list(verbose=F))
beerrules<-sort(beerrules, decreasing=TRUE,by="confidence")
inspect(beerrules[1:5])
##     lhs                        rhs            support     confidence
## [1] {liquor,red/blush wine} => {bottled beer} 0.001931876 0.9047619 
## [2] {liquor,soda}           => {bottled beer} 0.001220132 0.5714286 
## [3] {liquor}                => {bottled beer} 0.004677173 0.4220183 
## [4] {bottled water,herbs}   => {bottled beer} 0.001220132 0.4000000 
## [5] {soups,whole milk}      => {bottled beer} 0.001118454 0.3793103 
##     coverage    lift      count
## [1] 0.002135231 11.235269 19   
## [2] 0.002135231  7.095960 12   
## [3] 0.011082867  5.240594 46   
## [4] 0.003050330  4.967172 12   
## [5] 0.002948653  4.710249 11

This rule tells us that beer is bought often with either wine, liquor or soda. It seems that when someone buys bottled beer, customers usually purchase other types of alcohol. Now, what about canned beer?

cannedbeerrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08), 
               appearance = list(default="lhs",rhs="canned beer"),
               control = list(verbose=F))
cannedbeerrules<-sort(cannedbeerrules, decreasing=TRUE,by="confidence")
inspect(cannedbeerrules[1:5])
##     lhs                                   rhs           support     confidence
## [1] {rolls/buns,shopping bags,soda}    => {canned beer} 0.001525165 0.2419355 
## [2] {rolls/buns,sausage,shopping bags} => {canned beer} 0.001423488 0.2372881 
## [3] {liquor (appetizer)}               => {canned beer} 0.001728521 0.2179487 
## [4] {coffee,soda}                      => {canned beer} 0.001931876 0.1938776 
## [5] {chicken,soda}                     => {canned beer} 0.001525165 0.1829268 
##     coverage    lift     count
## [1] 0.006304016 3.114444 15   
## [2] 0.005998983 3.054619 14   
## [3] 0.007930859 2.805662 17   
## [4] 0.009964413 2.495793 19   
## [5] 0.008337570 2.354824 15

The result was especially surprising to me! We see that the items are different for canned than bottled beer. It seems that canned beer is bought more with regular groceries like buns, coffee and sausage, where bottled beer is usually purchased with over types of alcohol. Also, we had much lower confidence for canned beer compared to bottled beer.

Next, I wanted to look at one of my favourite foods, marshmallows. The data set only listed chocolate marshmallows (which I’ve never had), but I wanted to run it anyway.

marshrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.01), 
                         appearance = list(default="lhs",rhs="chocolate marshmallow"),
                         control = list(verbose=F))
marshrules<-sort(marshrules, decreasing=TRUE,by="confidence")
inspect(marshrules[1:5])
##     lhs                           rhs                     support    
## [1] {candy}                    => {chocolate marshmallow} 0.001423488
## [2] {waffles}                  => {chocolate marshmallow} 0.001423488
## [3] {chocolate}                => {chocolate marshmallow} 0.001626843
## [4] {domestic eggs}            => {chocolate marshmallow} 0.001830198
## [5] {long life bakery product} => {chocolate marshmallow} 0.001016777
##     confidence coverage   lift     count
## [1] 0.04761905 0.02989324 5.262172 14   
## [2] 0.03703704 0.03843416 4.092801 14   
## [3] 0.03278689 0.04961871 3.623135 16   
## [4] 0.02884615 0.06344687 3.187662 18   
## [5] 0.02717391 0.03741739 3.002870 10

Chocolate marshmallows are bought with candy and chocolate and waffles—the essential food groups. Finally, let us flip our search criteria to see what is on the right-hand side when whipped/sour cream is on the left.

# Now lets set our left hand side to whipped/sour cream
sourrules<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08,minlen=2), 
               appearance = list(default="rhs",lhs="whipped/sour cream"),
               control = list(verbose=F))
sourrules<-sort(sourrules, decreasing=TRUE,by="confidence")
inspect(sourrules[1:5])
##     lhs                     rhs                support    confidence coverage  
## [1] {whipped/sour cream} => {whole milk}       0.03223183 0.4496454  0.07168277
## [2] {whipped/sour cream} => {other vegetables} 0.02887646 0.4028369  0.07168277
## [3] {whipped/sour cream} => {yogurt}           0.02074225 0.2893617  0.07168277
## [4] {whipped/sour cream} => {root vegetables}  0.01708185 0.2382979  0.07168277
## [5] {whipped/sour cream} => {rolls/buns}       0.01464159 0.2042553  0.07168277
##     lift     count
## [1] 1.759754 317  
## [2] 2.081924 284  
## [3] 2.074251 204  
## [4] 2.186250 168  
## [5] 1.110476 144

The outcomes here are consistent with our original association rules, with whole milk and other vegetables taking the lead.

That’s all for this week. Thanks for reading

Chris