Homework 10 Predictive analytics

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Load Data

# Load Groceries Dataset
groceries_df <- read.csv("GroceryDataSet.csv", header = FALSE,
                         na.strings="", 
                         stringsAsFactors=FALSE )
# Add an id column
groceries_df$id <- seq(nrow(groceries_df))
groceries_df <- groceries_df %>%
  mutate(across(where(is.character), str_trim))

print(paste(nrow(groceries_df), ncol(groceries_df)))

## [1] "9835 33"

groceries_long <- melt(groceries_df, id.vars="id")
groceries_trans <- as(lapply(split(groceries_long$value, groceries_long$id), unique), "transactions")
inspect(groceries_trans[1:5])

##     items                      transactionID
## [1] {citrus fruit,                          
##      margarine,                             
##      ready soups,                           
##      semi-finished bread}                  1
## [2] {coffee,                                
##      tropical fruit,                        
##      yogurt}                               2
## [3] {whole milk}                           3
## [4] {cream cheese,                          
##      meat spreads,                          
##      pip fruit,                             
##      yogurt}                               4
## [5] {condensed milk,                        
##      long life bakery product,              
##      other vegetables,                      
##      whole milk}                           5

itemFrequencyPlot(groceries_trans, 
                  topN=30, 
                  type="absolute",
                  main='Absolute Item Frequency Plot',
                  ylab="Item Frequency (Absolute)")

Create Basket Rules

I am filtering to only include rules with 1~4 items on the RHS. Rules with 5+ items are probably less useful and would increase noise. Given the large number of receipts, I’m using a lower support threshold and I experimented with different confidence values to get what look like reasonable rules.

Note: I tried to remove redundant rules using examples from documentation, but for some reason (I don’t know why), the filtering treated ALL rules as redundant. After a fair bit of troubleshooting, I just removed that step for now. This would certainly be something I would circle back to if I were using this analysis for real.

rules <- apriori(groceries_trans, 
                 parameter = list(supp = 0.001, 
                                  conf = 0.15,
                                  minlen = 2,
                                  maxlen = 5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5

## Warning in apriori(groceries_trans, parameter = list(supp = 0.001, conf =
## 0.15, : Mining stopped (maxlen reached). Only patterns up to a length of 5
## returned!

##  done [0.01s].
## writing ... [26760 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].

# find redundant rules
# subset.matrix <- is.subset(rules, rules)
# subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
# redundant <- colSums(subset.matrix, na.rm=T) >= 1
# remove redundant rules
# rules.pruned <- rules[!redundant]
# rules <- rules.pruned
rules <- sort(rules, by="lift", decreasing=TRUE)
inspect(rules[1:10])

##      lhs                        rhs                     support confidence coverage lift count
## [1]  {bottled beer,                                                                           
##       red/blush wine}        => {liquor}                 0.0019       0.40   0.0049   36    19
## [2]  {hamburger meat,                                                                         
##       soda}                  => {Instant food products}  0.0012       0.21   0.0058   26    12
## [3]  {ham,                                                                                    
##       white bread}           => {processed cheese}       0.0019       0.38   0.0051   23    19
## [4]  {other vegetables,                                                                       
##       root vegetables,                                                                        
##       whole milk,                                                                             
##       yogurt}                => {rice}                   0.0013       0.17   0.0078   22    13
## [5]  {bottled beer,                                                                           
##       liquor}                => {red/blush wine}         0.0019       0.41   0.0047   21    19
## [6]  {Instant food products,                                                                  
##       soda}                  => {hamburger meat}         0.0012       0.63   0.0019   19    12
## [7]  {curd,                                                                                   
##       sugar}                 => {flour}                  0.0011       0.32   0.0035   19    11
## [8]  {baking powder,                                                                          
##       sugar}                 => {flour}                  0.0010       0.31   0.0033   18    10
## [9]  {processed cheese,                                                                       
##       white bread}           => {ham}                    0.0019       0.46   0.0042   18    19
## [10] {fruit/vegetable juice,                                                                  
##       ham}                   => {processed cheese}       0.0011       0.29   0.0039   17    11

Stepping back, if this were being used in a real setting, I would explore generating a score using a geometric mean of support, confidence and lift … then sorting my rules based on this new metric, score. Ideally we want rules that are both prevalent and meaningful. A large lift with low support not be actionable. Alternatively, a high confidence with low support or low lift would also not be as actionable. Ideally, we want rules that identify both high lift and high support.

Per instructions, I’ve just sorted based on lift for now; however, lift alone probably isn’t the best approach.

Cluster Analysis

s <- groceries_trans[,itemFrequency(groceries_trans) > 0.02]
d_jaccard <- dissimilarity(s, which = "items", method="affinity")
plot(hclust(d_jaccard, method = "ward.D2"), main = "Dendrogram for Items")

d_affinity <- dissimilarity(rules[1:20], 
                            method = "affinity", 
                            args = list(transactions = groceries_trans))
hc <- hclust(d_affinity, method = "ward.D2")
plot(hc, main = "Dendrogram for Rules (Affinity)")

## create 4 groups and inspect the rules in the first group.
assign <- cutree(hc, k = 3)
rules

## set of 26760 rules

inspect(rules[assign == 2][1:10])

##      lhs                        rhs                     support confidence coverage lift count
## [1]  {hamburger meat,                                                                         
##       soda}                  => {Instant food products}  0.0012       0.21   0.0058   26    12
## [2]  {other vegetables,                                                                       
##       root vegetables,                                                                        
##       whole milk,                                                                             
##       yogurt}                => {rice}                   0.0013       0.17   0.0078   22    13
## [3]  {Instant food products,                                                                  
##       soda}                  => {hamburger meat}         0.0012       0.63   0.0019   19    12
## [4]  {curd,                                                                                   
##       sugar}                 => {flour}                  0.0011       0.32   0.0035   19    11
## [5]  {baking powder,                                                                          
##       sugar}                 => {flour}                  0.0010       0.31   0.0033   18    10
## [6]  {margarine,                                                                              
##       sugar}                 => {flour}                  0.0016       0.30   0.0055   17    16
## [7]  {root vegetables,                                                                        
##       sugar,                                                                                  
##       whole milk}            => {flour}                  0.0010       0.29   0.0035   17    10
## [8]  {popcorn,                                                                                
##       soda}                  => {salty snack}            0.0012       0.63   0.0019   17    12
## [9]  {baking powder,                                                                          
##       flour}                 => {sugar}                  0.0010       0.56   0.0018   16    10
## [10] {sugar,                                                                                  
##       whipped/sour cream}    => {baking powder}          0.0013       0.27   0.0049   15    13

Homework 10 Predictive analytics

Salma Elshahawy

2021-05-09

Load Data

Create Basket Rules

Cluster Analysis