Introduction :-
In this report, I am attempting to do Association analysis (or) Market baskets analysis on Adutls Data set.
Association analysis :-
Association analysis enables us to identify items that have an affinity for each other (or) finding interesting relationships between items. It is frequently used to analyze transactional data (also called market baskets) to identify items that often appear together in transactions.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
Structure of given dataset :-
The given dataset has 48842 trancation IDs and each trancation has 115 items. For this analysis, We got directly the tranaction sparse matrix.
In general, the Adult Dataset contains a data frame with 48842 observations on the following 15 variables. All the variables are as follows.
age : a numeric vector.
workclass : a factor with levels Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, and Without-pay.
education : an ordered factor with levels Preschool < 1st-4th < 5th-6th < 7th-8th < 9th < 10th < 11th < 12th < HS-grad < Prof-school < Assoc-acdm < Assoc-voc < Some-college < Bachelors < Masters < Doctorate.
education-num : a numeric vector.
marital-status : a factor with levels Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, and Widowed.
occupation : a factor with levels Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, and Transport-moving.
relationship : a factor with levels Husband, Not-in-family, Other-relative, Own-child, Unmarried, and Wife.
race : a factor with levels Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, and White.
sex : a factor with levels Female and Male.
capital-gain : a numeric vector.
capital-loss : a numeric vector.
fnlwgt : a numeric vector.
hours-per-week : a numeric vector.
native-country : a factor with levels Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, and Yugoslavia.
income : an ordered factor with levels small < large.
Association Rules Mining :-
In our tranaction data set as we have 115 number of items, we can get the number of association rules ( either strong / week ) by using the following formula.
\[ R = 2^{k} - k - 1 \] Where ,
R = Number of possible association rules
k = Number of items in the tranaction dataset
The Number of association rules in our dataset are 4.15 e+34 .
As this number is very very big to analyse, we are assigning few metric ( rank ) to each rule which indicates strength of that rule. The measures which we are considering in this analysis are,
Support :-
Support measure gives an idea of how frequent an item (or) itemset is in all the transactions. It is defined by following formula.
\[ Support(A,B) = P ( A \cap B ) \]
Confidence :-
Confidence measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents.It is defined by following formula.
\[ Confidence(A,B) = \frac {P ( A \cap B )}{P(A)} \]
Lift :-
Lift measure checks the confidence from both sides of releation (or) rule.Unlike the confidence metric whose value may vary depending on direction, lift has no direction.lift(A,B) is always equal to the lift(B,A).It is defined by following formula.
\[ Lift(A,B) = \frac {P ( A \cap B )}{P(A)*P(B)} \]
APRIORI Algorithm :-
Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. By using this algorithm , we can define pre-defined thershold values to following metrices (or) ranks and filter-out the rules & identify the best rules out of the ocean of rules.
- confidence
- minimum value
- maximum time
- support
- minimum length
- maximum length
Out of the above controls ( to filter best rules ) , We are controlling the following parameters.
- support = 0.5
- confidence = 0.9
The Specification of defined Algorithm :-
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.5 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24421
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.03s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [52 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
We can observer from the specification , there are 52 number of rules which are satisfying given thershold values of support and confidence.
Summary of Algorithm :-
## set of 52 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4
## 2 13 24 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.923 3.250 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.5084 Min. :0.9031 Min. :0.5406 Min. :0.9844
## 1st Qu.:0.5415 1st Qu.:0.9155 1st Qu.:0.5875 1st Qu.:0.9937
## Median :0.5974 Median :0.9229 Median :0.6293 Median :0.9997
## Mean :0.6436 Mean :0.9308 Mean :0.6915 Mean :1.0036
## 3rd Qu.:0.7426 3rd Qu.:0.9494 3rd Qu.:0.7945 3rd Qu.:1.0057
## Max. :0.9533 Max. :0.9583 Max. :1.0000 Max. :1.0586
## count
## Min. :24832
## 1st Qu.:26447
## Median :29178
## Mean :31433
## 3rd Qu.:36269
## Max. :46560
##
## mining info:
## data ntransactions support confidence
## Adult 48842 0.5 0.9
From the summary, we can observe that the maximum cardinality which the algoirithm considered is 4 . And out of 52 rules which are under observation, most of the rules are having 3 cardinality.
The Algoithm is defining following metrices for all the 52 rules.
- support ( with mean of 0.6436) - We are controlling minimum support (indirectly mean of support).
- confidence ( with mean of 0.9308) - We are controlling minimum confidence (indirectly mean of confidence).
- lift ( with mean of 1.0036)
- coverage ( with mean of 0.6915 )
Top Rules (or) releationships (w.r.t lift) :-
The Top-6 rules (or) releationship which are having maximum lift are as follows.
## lhs rhs support confidence coverage lift count
## [1] {sex=Male,
## native-country=United-States} => {race=White} 0.5415421 0.9051090 0.5983170 1.058554 26450
## [2] {sex=Male,
## capital-loss=None,
## native-country=United-States} => {race=White} 0.5113632 0.9032585 0.5661316 1.056390 24976
## [3] {race=White} => {native-country=United-States} 0.7881127 0.9217231 0.8550428 1.027076 38493
## [4] {race=White,
## capital-loss=None} => {native-country=United-States} 0.7490480 0.9205626 0.8136849 1.025783 36585
## [5] {race=White,
## sex=Male} => {native-country=United-States} 0.5415421 0.9204803 0.5883256 1.025691 26450
## [6] {race=White,
## capital-gain=None} => {native-country=United-States} 0.7194628 0.9202807 0.7817862 1.025469 35140
26450 th rule (or) releationship is the topper with highest lift of 1.0586.
Visualizing the rules :-
The 3-Dimentional plot of all the 52 rules is as follows.
We can observe that, the rules which are having very less confidence & Support but intrestingly with very highest lift.
Visualizing of rules in a graph :-
The graph representation of the same rules is as follows. In this graph, we are considering only top-15 rules which are having highest lift value.
In the above graph , each circle (or node ) represent’s rule (in the above graph we have only 15 nodes ), the size of the node represents support value of the rule and color intensity represents lift value of the rule.
Similarity between items :-
Cluster analysis on similarity between items with phi-coefficient as distance measurement.
From the above Dendogram, we can see the items which are very similar to each other ( in ahigh dimensional vector space) , and the similarity is measured with phi-coefficient of the item.
Conclusion :-
- The given dataset has 48842 trancation IDs and each trancation has 115 items
- There are only 52 rules with minimum support of 0.5 & confidence of 0.9.
- Out of those 52 rules , {sex=Male, native-country=United-States} => {race=White} is having the highest lift of 1.0586.
————————————————————- THANK YOU ————————————————————-