Market Basket Analysis GroceryData

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

library(arulesViz)

## Loading required package: arules

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: grid

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RColorBrewer)

library(arules)
grocery_data <- read.transactions("T:/00-624 HH Predictive Analytics/HWK10 Market Analysis Optional/GroceryDataSet.csv", format = "basket", sep = ",")

summary(grocery_data)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

This matrix stores 9835 transactions INR, with columns as items they have purchased. People have purchased from one item up to 32 items per transaction, with medium of three items, Meaning 50 percent of the transactions are below 3 items. The most frequent items are home milk, other vegetables, roll/buns, soda, yogurt.

itemFrequencyPlot(grocery_data, topN = 15,col=brewer.pal(10,'RdBu'))

rules <- apriori(grocery_data, parameter = list(supp = 0.001, confidence = 0.8, minlen = 2), control = list(verbose = FALSE))
length (rules)

## [1] 410

rules.nonredundant <- rules [!is.redundant(rules)]
summary(rules.nonredundant)

## set of 392 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 227 130   6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.288   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
##  Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
##  Mean   :0.001253   Mean   :0.8667   Mean   :0.001456   Mean   : 3.959  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.357  
##  Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :12.33  
##  3rd Qu.:13.00  
##  Max.   :31.00  
## 
## mining info:
##          data ntransactions support confidence
##  grocery_data          9835   0.001        0.8

Using the apriori package, I have calculated the numbers of rules in this grocery data set. I used 80% confidence, minimum lands of two, and 0.001 as supply, and comes up with 410 rules. Looks like after excluding the repeated rules, there are 392 rules which are not repeated rules of each other.

 plot(rules.nonredundant, method='grouped')

Here is the visualization of the three 92 non redundant rules. The number one rule is iquor,red/blush wine.

plot(head(sort(rules.nonredundant, by="lift"), 10), method = "graph")

This graph is another visualization of the top 10 rules. although a little hard to visualize, we can see the number one rule as before on the upper right hand side. The obvious lifting of bottled beer to red/blush wine is represented by this large circle.

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

kable ( inspect(head(rules.nonredundant, 10, by = "lift", decending=TRUE )))

##      lhs                        rhs                   support confidence    coverage      lift count
## [1]  {liquor,                                                                                       
##       red/blush wine}        => {bottled beer}    0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                 
##       fruit/vegetable juice,                                                                        
##       other vegetables,                                                                             
##       soda}                  => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {oil,                                                                                          
##       other vegetables,                                                                             
##       tropical fruit,                                                                               
##       whole milk,                                                                                   
##       yogurt}                => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [4]  {citrus fruit,                                                                                 
##       fruit/vegetable juice,                                                                        
##       grapes}                => {tropical fruit}  0.001118454  0.8461538 0.001321810  8.063879    11
## [5]  {other vegetables,                                                                             
##       rice,                                                                                         
##       whole milk,                                                                                   
##       yogurt}                => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [6]  {oil,                                                                                          
##       other vegetables,                                                                             
##       tropical fruit,                                                                               
##       whole milk}            => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [7]  {ham,                                                                                          
##       other vegetables,                                                                             
##       pip fruit,                                                                                    
##       yogurt}                => {tropical fruit}  0.001016777  0.8333333 0.001220132  7.941699    10
## [8]  {beef,                                                                                         
##       citrus fruit,                                                                                 
##       other vegetables,                                                                             
##       tropical fruit}        => {root vegetables} 0.001016777  0.8333333 0.001220132  7.645367    10
## [9]  {butter,                                                                                       
##       cream cheese,                                                                                 
##       root vegetables}       => {yogurt}          0.001016777  0.9090909 0.001118454  6.516698    10
## [10] {butter,                                                                                       
##       sliced cheese,                                                                                
##       tropical fruit,                                                                               
##       whole milk}            => {yogurt}          0.001016777  0.9090909 0.001118454  6.516698    10

Here are the top rules by the descending order of lifting count. For example, iquor,red/blush wine, the first rule can be lifted by bottled beer by 11. And there were 19 counts of such lift. Below is a scatter plot of the support versus confidence of rules, by its lift (top10).

# install.packages('arulesViz')
library(arulesViz)
plot(rules.nonredundant, method='scatter', engine='htmlwidget')

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

### Cluster Analysis

Find optimal cluster number K

crosstab.grocerydata <- crossTable (grocery_data, sort=TRUE)
dim(crosstab.grocerydata)

## [1] 169 169

df<- as.data.frame (crosstab.grocerydata)
dim(df)

## [1] 169 169

In order for us to use the cluster analysis, we first have to transform the data from its current format to cross tabulation and data frame. As we can see now the dimension is 169 by 169.

withss <- sapply(1:20,
                 function(k) {
                   kmeans(df, k, nstart = 50, iter.max = 20)$tot.withinss})


plot( 1:20, withss, type = "b", pch = 19, frame = FALSE, xlab = "Number of clusters", 
      ylab = "Within Sum of squares" ) 

axis(1, at = 1:20, labels = seq(1, 20, 1))

Then we use K means to calculate the total within sum of squares for each cluster, and plot out the numbers of cluster On X axis and total within sums of squares on Y axis. As we can see, there is a large decrease from total sums of squares from one cluster to two and two 3,. Going from 5 clusters and above, the decreasing of within sums of squares level off.

Next, we are somewha interested in viewing the details of the top small transactions (bought 1 item only) and top multi-item transactions (bought 150 items). It’s difficult to find what these items share in common. The transaction data may not have enough information to distinguish these items into groups.

Here’s what happen when we cluster the items into 20 centers:

set.seed(100)
kmeans(df, centers=20, nstart=50, iter.max=20)$cluster %>%   table()

## .
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  2  9 30  1  1  1  1  1  1  1  1  1 16  1  1  1 97  1  1  1

set.seed(100)
dist_mat <- dist(df, method = 'euclidean')
hcluster <- hclust(dist_mat, method = 'average')
clusters <- cutree(hcluster, 20) %>% sort()

cutree(hcluster, 20) %>% table()

## .
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 150   1

names(clusters[clusters == 1 ])

## [1] "whole milk"

By descending rank, whole milk is the number one item on cluster one, which has only one item in the basket. This is consistent with every other information we have obtained so far. `{r} names(clusters[clusters == 19]) ``` We can see that the 19th cluster has 150 items in the basket. Below is a list of what these items are, what has this customer purchased in the single transaction comma out of curiosity.

Market Basket Analysis GroceryData

Hui (Gracie) Han