Data 624 - Hw 10

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

df<- read.transactions("GroceryDataSet.csv",format="basket", sep=",")


summary(df)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Whole milk is the most popular item, with vegies and rolls/buns after it. There are 9835 transactions (rows) and 169 items (columns.)

Below Top 10 items.

itemFrequencyPlot(df, topN=10, type="absolute", main="Top 10 Items")

rules <- apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen=3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3

## Warning in apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3)):
## Mining stopped (maxlen reached). Only patterns up to a length of 3 returned!

##  done [0.01s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules <- sort(rules, by="confidence", decreasing=T)

rules

## set of 29 rules

inspect(head(rules, 10))

##      lhs                          rhs                    support confidence    coverage      lift count
## [1]  {rice,                                                                                            
##       sugar}                   => {whole milk}       0.001220132  1.0000000 0.001220132  3.913649    12
## [2]  {canned fish,                                                                                     
##       hygiene articles}        => {whole milk}       0.001118454  1.0000000 0.001118454  3.913649    11
## [3]  {house keeping products,                                                                          
##       whipped/sour cream}      => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [4]  {bottled water,                                                                                   
##       rice}                    => {whole milk}       0.001220132  0.9230769 0.001321810  3.612599    12
## [5]  {bottled beer,                                                                                    
##       soups}                   => {whole milk}       0.001118454  0.9166667 0.001220132  3.587512    11
## [6]  {grapes,                                                                                          
##       onions}                  => {other vegetables} 0.001118454  0.9166667 0.001220132  4.737476    11
## [7]  {hard cheese,                                                                                     
##       oil}                     => {other vegetables} 0.001118454  0.9166667 0.001220132  4.737476    11
## [8]  {cereals,                                                                                         
##       curd}                    => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [9]  {pastry,                                                                                          
##       sweet spreads}           => {whole milk}       0.001016777  0.9090909 0.001118454  3.557863    10
## [10] {liquor,                                                                                          
##       red/blush wine}          => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19

summary(rules)

## set of 29 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 29 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001118   Min.   : 3.131  
##  1st Qu.:0.001118   1st Qu.:0.8125   1st Qu.:0.001220   1st Qu.: 3.261  
##  Median :0.001220   Median :0.8462   Median :0.001525   Median : 3.613  
##  Mean   :0.001473   Mean   :0.8613   Mean   :0.001732   Mean   : 4.000  
##  3rd Qu.:0.001729   3rd Qu.:0.9091   3rd Qu.:0.002135   3rd Qu.: 4.199  
##  Max.   :0.002542   Max.   :1.0000   Max.   :0.003152   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:11.00  
##  Median :12.00  
##  Mean   :14.48  
##  3rd Qu.:17.00  
##  Max.   :25.00  
## 
## mining info:
##  data ntransactions support confidence
##    df          9835   0.001        0.8
##                                                                        call
##  apriori(data = df, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3))

plot(rules, method="graph", engine = "igraph", layout = igraph::in_circle(), limit = 10)

plot(rules, method="graph", engine = "igraph", layout = igraph::in_circle(), limit = 20)

The frequency plot shows that the top 10 frequently purchased items are whole milk, other vegetables, rolls/buns, soda, yogurt, bottled water and so on in the decreasing order. The function apriori() returns 29 association rules for these data. To avoid overly long rules, the apriori() function is run with maxlen=3 specified. After inspecting the top 10 rules by confidence shows that most of the associations are with other vegetables which is the #2 most purchased items. After increasing limit to 20, there are shown many associations with the whole milk which is the #1 of the most purchased items. These relationships are pictured above in the plot of the network where most arrows are pointing toward whole milk and other vegetables categories.

References:

https://cran.r-project.org/web/packages/arulesViz/arulesViz.pdf

http://www.salemmarafi.com/code/market-basket-analysis-with-r/

Data 624 - Hw 10

Dominika Markowska-Desvallons

5/7/2022