Introduction :-
In this report, I am attempting to do Association analysis (or) Market baskets analysis on Groceries Data set.
Association analysis :-
Association analysis enables us to identify items that have an affinity for each other (or) finding interesting relationships between items. It is frequently used to analyze transactional data (also called market baskets) to identify items that often appear together in transactions.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
Structure of given dataset :-
The given dataset has 9835 trancation IDs and each trancation has 169 items. For this analysis, We got directly the tranaction sparse matrix.
Association Rules Mining :-
In our tranaction data set as we have 169 number of items, we can get the number of association rules ( either strong / week ) by using the following formula.
\[ R = 2^{k} - k - 1 \] Where ,
R = Number of possible association rules
k = Number of items in the tranaction dataset
The Number of association rules in our dataset are 7.48e+50 .
As this number is very very big to analyse, we are assigning few metric ( rank ) to each rule which indicates strength of that rule. The measures which we are considering in this analysis are,
Support :-
Support measure gives an idea of how frequent an item (or) itemset is in all the transactions. It is defined by following formula.
\[ Support(A,B) = P ( A \cap B ) \]
Confidence :-
Confidence measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents.It is defined by following formula.
\[ Confidence(A,B) = \frac {P ( A \cap B )}{P(A)} \]
Lift :-
Lift measure checks the confidence from both sides of releation (or) rule.Unlike the confidence metric whose value may vary depending on direction, lift has no direction.lift(A,B) is always equal to the lift(B,A).It is defined by following formula.
\[ Lift(A,B) = \frac {P ( A \cap B )}{P(A)*P(B)} \]
APRIORI Algorithm :-
Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. By using this algorithm , we can define pre-defined thershold values to following metrices (or) ranks and filter-out the rules & identify the best rules out of the ocean of rules.
- confidence
- minimum value
- maximum time
- support
- minimum length
- maximum length
Out of the above controls ( to filter best rules ) , We are controlling the following parameters.
- support = 0.001
- confidence = 0.15
- minimum_length = 2
The Specification of defined Algorithm :-
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [26820 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
We can observer from the specification , there are 26820 number of rules which are satisfying given thershold values of support , confidence and minimum length.
Summary of Algorithm :-
## set of 26820 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 1102 12603 11198 1857 60
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.522 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.1500 Min. :0.001017 Min. : 0.6212
## 1st Qu.:0.001118 1st Qu.:0.2138 1st Qu.:0.003050 1st Qu.: 2.0481
## Median :0.001322 Median :0.3056 Median :0.004982 Median : 2.6852
## Mean :0.001979 Mean :0.3534 Mean :0.007073 Mean : 2.9368
## 3rd Qu.:0.001932 3rd Qu.:0.4583 3rd Qu.:0.007524 3rd Qu.: 3.5085
## Max. :0.074835 Max. :1.0000 Max. :0.255516 Max. :35.7158
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 19.46
## 3rd Qu.: 19.00
## Max. :736.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.001 0.15
From the summary, we can observe that the maximum cardinality which the algoirithm considered is 6 . And out of 26820 rules which are under observation, most of the rules are having 3 & 4 cardinality.
The Algoithm is defining following metrices for all the 26820 rules.
- support ( with mean of 0.001979) - We are controlling minimum support (indirectly mean of support).
- confidence ( with mean of 0.3534) - We are controlling minimum confidence (indirectly mean of confidence).
- lift ( with mean of 2.9368)
- coverage ( with mean of 0.007073 )
Top Rules (or) releationships (w.r.t lift) :-
The Top-6 rules (or) releationship which are having maximum lift are as follows.
## lhs rhs support confidence coverage lift count
## [1] {bottled beer,
## red/blush wine} => {liquor} 0.001931876 0.3958333 0.004880529 35.71579 19
## [2] {hamburger meat,
## soda} => {Instant food products} 0.001220132 0.2105263 0.005795628 26.20919 12
## [3] {ham,
## white bread} => {processed cheese} 0.001931876 0.3800000 0.005083884 22.92822 19
## [4] {root vegetables,
## other vegetables,
## whole milk,
## yogurt} => {rice} 0.001321810 0.1688312 0.007829181 22.13939 13
## [5] {bottled beer,
## liquor} => {red/blush wine} 0.001931876 0.4130435 0.004677173 21.49356 19
## [6] {Instant food products,
## soda} => {hamburger meat} 0.001220132 0.6315789 0.001931876 18.99565 12
26450 th rule (or) releationship is the topper with highest lift of 35.71579.
Visualizing the rules :-
The 3-Dimentional plot of all the 52 rules is as follows.
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
We can observe that, the rules which are having confidence = 0.001 & Support = 0.4 but intrestingly with very highest lift.
Top Rules (or) releationships (w.r.t support) :-
The Top-6 rules (or) releationship which are having maximum support are as follows.
## lhs rhs support confidence coverage
## [1] {other vegetables} => {whole milk} 0.07483477 0.3867578 0.1934926
## [2] {whole milk} => {other vegetables} 0.07483477 0.2928770 0.2555160
## [3] {rolls/buns} => {whole milk} 0.05663447 0.3079049 0.1839349
## [4] {whole milk} => {rolls/buns} 0.05663447 0.2216474 0.2555160
## [5] {yogurt} => {whole milk} 0.05602440 0.4016035 0.1395018
## [6] {whole milk} => {yogurt} 0.05602440 0.2192598 0.2555160
## lift count
## [1] 1.513634 736
## [2] 1.513634 736
## [3] 1.205032 557
## [4] 1.205032 557
## [5] 1.571735 551
## [6] 1.571735 551
Top Rules (or) releationships (w.r.t confidence) :-
The Top-6 rules (or) releationship which are having maximum confidence are as follows.
## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
## [3] {root vegetables,
## butter,
## rice} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [4] {root vegetables,
## whipped/sour cream,
## flour} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
## [5] {butter,
## soft cheese,
## domestic eggs} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [6] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1 0.001016777 5.168156 10
Visualizing of rules in a graph ( by highest lift) :-
The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest lift value.
In the above graph , each circle (or node ) represent’s rule (in the above graph we have only 15 nodes ), the size of the node represents support value of the rule and color intensity represents lift value of the rule.
Visualizing of rules in a graph ( by highest support) :-
The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest support value.
Visualizing of rules in a graph ( by highest confidence) :-
The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest confidence value.
## Similarity between items :-
Cluster analysis on similarity between items with phi-coefficient as distance measurement.
From the above Dendogram, we can see the items which are very similar to each other ( in ahigh dimensional vector space) , and the similarity is measured with phi-coefficient of the item.
Conclusion :-
- The given dataset has 9835 trancation IDs and each trancation has 169 items
- There are only 26820 rules with minimum support of 0.001 , confidence of 0.15 & minimum_length of 2.
- Out of those 26820 rules , {bottled beer,red/blush wine} => {liquor} is having the highest lift of 35.71579.
- Out of those 26820 rules , {other vegetables} => {whole milk} is having the highest support of 0.07483477.
- Out of those 26820 rules , {rice,sugar} => {whole milk} is having the highest confidece of 1.
————————————————————- THANK YOU ————————————————————-