Introduction :-

In this report, I am attempting to do Association analysis (or) Market baskets analysis on Groceries Data set.


Association analysis :-

Association analysis enables us to identify items that have an affinity for each other (or) finding interesting relationships between items. It is frequently used to analyze transactional data (also called market baskets) to identify items that often appear together in transactions.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 9835 trancation IDs and each trancation has 169 items. For this analysis, We got directly the tranaction sparse matrix.


Association Rules Mining :-

In our tranaction data set as we have 169 number of items, we can get the number of association rules ( either strong / week ) by using the following formula.

\[ R = 2^{k} - k - 1 \] Where ,

The Number of association rules in our dataset are 7.48e+50 .

As this number is very very big to analyse, we are assigning few metric ( rank ) to each rule which indicates strength of that rule. The measures which we are considering in this analysis are,

Support :-

Support measure gives an idea of how frequent an item (or) itemset is in all the transactions. It is defined by following formula.

\[ Support(A,B) = P ( A \cap B ) \]

Confidence :-

Confidence measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents.It is defined by following formula.

\[ Confidence(A,B) = \frac {P ( A \cap B )}{P(A)} \]

Lift :-

Lift measure checks the confidence from both sides of releation (or) rule.Unlike the confidence metric whose value may vary depending on direction, lift has no direction.lift(A,B) is always equal to the lift(B,A).It is defined by following formula.

\[ Lift(A,B) = \frac {P ( A \cap B )}{P(A)*P(B)} \]


APRIORI Algorithm :-

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. By using this algorithm , we can define pre-defined thershold values to following metrices (or) ranks and filter-out the rules & identify the best rules out of the ocean of rules.

Out of the above controls ( to filter best rules ) , We are controlling the following parameters.


The Specification of defined Algorithm :-

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [26820 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

We can observer from the specification , there are 26820 number of rules which are satisfying given thershold values of support , confidence and minimum length.

Summary of Algorithm :-

## set of 26820 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6 
##  1102 12603 11198  1857    60 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.522   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.1500   Min.   :0.001017   Min.   : 0.6212  
##  1st Qu.:0.001118   1st Qu.:0.2138   1st Qu.:0.003050   1st Qu.: 2.0481  
##  Median :0.001322   Median :0.3056   Median :0.004982   Median : 2.6852  
##  Mean   :0.001979   Mean   :0.3534   Mean   :0.007073   Mean   : 2.9368  
##  3rd Qu.:0.001932   3rd Qu.:0.4583   3rd Qu.:0.007524   3rd Qu.: 3.5085  
##  Max.   :0.074835   Max.   :1.0000   Max.   :0.255516   Max.   :35.7158  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 19.46  
##  3rd Qu.: 19.00  
##  Max.   :736.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001       0.15

From the summary, we can observe that the maximum cardinality which the algoirithm considered is 6 . And out of 26820 rules which are under observation, most of the rules are having 3 & 4 cardinality.

The Algoithm is defining following metrices for all the 26820 rules.

  • support ( with mean of 0.001979) - We are controlling minimum support (indirectly mean of support).
  • confidence ( with mean of 0.3534) - We are controlling minimum confidence (indirectly mean of confidence).
  • lift ( with mean of 2.9368)
  • coverage ( with mean of 0.007073 )

Top Rules (or) releationships (w.r.t lift) :-

The Top-6 rules (or) releationship which are having maximum lift are as follows.

##     lhs                        rhs                         support confidence    coverage     lift count
## [1] {bottled beer,                                                                                      
##      red/blush wine}        => {liquor}                0.001931876  0.3958333 0.004880529 35.71579    19
## [2] {hamburger meat,                                                                                    
##      soda}                  => {Instant food products} 0.001220132  0.2105263 0.005795628 26.20919    12
## [3] {ham,                                                                                               
##      white bread}           => {processed cheese}      0.001931876  0.3800000 0.005083884 22.92822    19
## [4] {root vegetables,                                                                                   
##      other vegetables,                                                                                  
##      whole milk,                                                                                        
##      yogurt}                => {rice}                  0.001321810  0.1688312 0.007829181 22.13939    13
## [5] {bottled beer,                                                                                      
##      liquor}                => {red/blush wine}        0.001931876  0.4130435 0.004677173 21.49356    19
## [6] {Instant food products,                                                                             
##      soda}                  => {hamburger meat}        0.001220132  0.6315789 0.001931876 18.99565    12

26450 th rule (or) releationship is the topper with highest lift of 35.71579.

Visualizing the rules :-

The 3-Dimentional plot of all the 52 rules is as follows.

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

We can observe that, the rules which are having confidence = 0.001 & Support = 0.4 but intrestingly with very highest lift.

Top Rules (or) releationships (w.r.t support) :-

The Top-6 rules (or) releationship which are having maximum support are as follows.

##     lhs                   rhs                support    confidence coverage 
## [1] {other vegetables} => {whole milk}       0.07483477 0.3867578  0.1934926
## [2] {whole milk}       => {other vegetables} 0.07483477 0.2928770  0.2555160
## [3] {rolls/buns}       => {whole milk}       0.05663447 0.3079049  0.1839349
## [4] {whole milk}       => {rolls/buns}       0.05663447 0.2216474  0.2555160
## [5] {yogurt}           => {whole milk}       0.05602440 0.4016035  0.1395018
## [6] {whole milk}       => {yogurt}           0.05602440 0.2192598  0.2555160
##     lift     count
## [1] 1.513634 736  
## [2] 1.513634 736  
## [3] 1.205032 557  
## [4] 1.205032 557  
## [5] 1.571735 551  
## [6] 1.571735 551

Top Rules (or) releationships (w.r.t confidence) :-

The Top-6 rules (or) releationship which are having maximum confidence are as follows.

##     lhs                     rhs                    support confidence    coverage     lift count
## [1] {rice,                                                                                      
##      sugar}              => {whole milk}       0.001220132          1 0.001220132 3.913649    12
## [2] {canned fish,                                                                               
##      hygiene articles}   => {whole milk}       0.001118454          1 0.001118454 3.913649    11
## [3] {root vegetables,                                                                           
##      butter,                                                                                    
##      rice}               => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [4] {root vegetables,                                                                           
##      whipped/sour cream,                                                                        
##      flour}              => {whole milk}       0.001728521          1 0.001728521 3.913649    17
## [5] {butter,                                                                                    
##      soft cheese,                                                                               
##      domestic eggs}      => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [6] {citrus fruit,                                                                              
##      root vegetables,                                                                           
##      soft cheese}        => {other vegetables} 0.001016777          1 0.001016777 5.168156    10

Visualizing of rules in a graph ( by highest lift) :-

The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest lift value.

In the above graph , each circle (or node ) represent’s rule (in the above graph we have only 15 nodes ), the size of the node represents support value of the rule and color intensity represents lift value of the rule.

Visualizing of rules in a graph ( by highest support) :-

The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest support value.

Visualizing of rules in a graph ( by highest confidence) :-

The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest confidence value.

## Similarity between items :-

Cluster analysis on similarity between items with phi-coefficient as distance measurement.

From the above Dendogram, we can see the items which are very similar to each other ( in ahigh dimensional vector space) , and the similarity is measured with phi-coefficient of the item.

Conclusion :-

————————————————————- THANK YOU ————————————————————-