Introduction :-
In this report, I am attempting to do Association analysis (or) Market baskets analysis on Football Data set.
Association analysis :-
Association analysis enables us to identify frequent itemsets , items that have an affinity for each other (or) finding interesting relationships between items. It is frequently used to analyze transactional data (also called market baskets) to identify items that often appear together in transactions.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
Structure of given dataset :-
The given dataset has 17994 Players and each player is observed in 34 athletic abilities ( crossing, volleys, dribbling ).
The sample dataset is as folows.
## crossing finishing heading_accuracy short_passing volleys dribbling curve
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 4 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## free_kick_accuracy long_passing ball_control acceleration sprint_speed
## 1 TRUE TRUE TRUE TRUE TRUE
## 2 TRUE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE TRUE TRUE TRUE
## 4 TRUE TRUE TRUE TRUE TRUE
## 5 FALSE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE
## agility reactions balance shot_power jumping stamina strength long_shots
## 1 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 3 TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
## 4 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 5 FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## aggression interceptions positioning vision penalties composure marking
## 1 TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## 2 FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## 3 FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## 4 TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## 5 FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## 6 TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## standing_tackle sliding_tackle gk_diving gk_handling gk_kicking
## 1 FALSE FALSE FALSE FALSE TRUE
## 2 FALSE FALSE FALSE FALSE TRUE
## 3 FALSE FALSE FALSE FALSE TRUE
## 4 FALSE FALSE TRUE TRUE TRUE
## 5 FALSE FALSE TRUE TRUE TRUE
## 6 FALSE FALSE TRUE FALSE TRUE
## gk_positioning gk_reflexes
## 1 TRUE FALSE
## 2 TRUE FALSE
## 3 TRUE FALSE
## 4 TRUE TRUE
## 5 TRUE TRUE
## 6 FALSE FALSE
The summary of the transaction data is as follows.
## transactions as itemMatrix in sparse format with
## 17994 rows (elements/itemsets/transactions) and
## 34 columns (items) and a density of 0.4846289
##
## most frequent items:
## sprint_speed long_shots finishing stamina volleys (Other)
## 8992 8980 8951 8951 8938 251682
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 5 23 46 90 125 979 825 761 634 633 661 659 659 705 672 685 696 708 749 742
## 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 759 757 848 814 747 712 591 534 423 338 222 131 46 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 17.00 16.48 23.00 33.00
##
## includes extended item information - examples:
## labels variables levels
## 1 crossing crossing TRUE
## 2 finishing finishing TRUE
## 3 heading_accuracy heading_accuracy TRUE
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Frequent Ability-sets Mining :-
In our tranaction data set as we have 34 number of abilities for each player, By using permitations & combinations we can get differnt ability-sets ( either frequnt / Not-frequent ). But this itemset size is very huge ( as we have 34 abilities).
As this number is very very big to analyse, we are assigning few metric ( rank ) to each ability-set which indicates strength of that set. The measures which we are considering in this analysis are,
Support :-
Support measure gives an idea of how frequent an item (or) itemset is in all the transactions. It is defined by following formula.
\[ Support(A,B) = P ( A \cap B ) \]
Lift :-
Lift measure checks the confidence from both sides of releation (or) rule.Unlike the confidence metric whose value may vary depending on direction, lift has no direction.lift(A,B) is always equal to the lift(B,A).It is defined by following formula.
\[ Lift(A,B) = \frac {P ( A \cap B )}{P(A)*P(B)} \]
APRIORI Algorithm :-
Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. By using this algorithm , we can define pre-defined thershold values to following metrices (or) ranks and filter-out the / itemsets & identify the best frequent itemsets out of the ocean of itemsets.
- minimum value
- maximum time
- support
- minimum length
- maximum length
Out of the above controls ( to filter best frequent ability-sets ) , We are controlling the following parameters.
- support = 0.01
- minimum_length = 2
- maximum length = 3
The Specification of defined Algorithm :-
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 3 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 179
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[34 item(s), 17994 transaction(s)] done [0.01s].
## sorting and recoding items ... [34 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.05s].
## sorting transactions ... done [0.00s].
## writing ... [6545 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
We can observer from the specification , there are 6545 number of ability-sets which are satisfying given thershold values of support, minlen & maxlen.
Summary of Ability-sets :-
## set of 6545 itemsets
##
## most frequent items:
## crossing finishing heading_accuracy short_passing
## 561 561 561 561
## volleys (Other)
## 561 16269
##
## element (itemset/transaction) length distribution:sizes
## 2 3
## 561 5984
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.914 3.000 3.000
##
## summary of quality measures:
## support transIdenticalToItemsets count
## Min. :0.06096 Min. :0.000e+00 Min. :1097
## 1st Qu.:0.13199 1st Qu.:0.000e+00 1st Qu.:2375
## Median :0.16650 Median :0.000e+00 Median :2996
## Mean :0.18265 Mean :1.155e-06 Mean :3287
## 3rd Qu.:0.22852 3rd Qu.:0.000e+00 3rd Qu.:4112
## Max. :0.46804 Max. :2.223e-04 Max. :8422
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## df 17994 0.01 1
From the summary, we can observe that the maximum cardinality / length which the algoirithm considered is 3 ( which we controlled) . And out of 6545 ability-sets which are under observation, most of the ability-sets are having 3 length.
The Algoithm is defining following metrices for all the 6545 itemsets
- support ( with mean of 0.18265 ) - We are controlling minimum support (indirectly mean of support).
And it is very important to have lift value for each ability-sets, We are introducing it manually.
The updated summary of the algorithm is as follows.
## set of 6545 itemsets
##
## most frequent items:
## crossing finishing heading_accuracy short_passing
## 561 561 561 561
## volleys (Other)
## 561 16269
##
## element (itemset/transaction) length distribution:sizes
## 2 3
## 561 5984
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.914 3.000 3.000
##
## summary of quality measures:
## support transIdenticalToItemsets count lift
## Min. :0.06096 Min. :0.000e+00 Min. :1097 Min. :0.5533
## 1st Qu.:0.13199 1st Qu.:0.000e+00 1st Qu.:2375 1st Qu.:1.1071
## Median :0.16650 Median :0.000e+00 Median :2996 Median :1.3784
## Mean :0.18265 Mean :1.155e-06 Mean :3287 Mean :1.4958
## 3rd Qu.:0.22852 3rd Qu.:0.000e+00 3rd Qu.:4112 3rd Qu.:1.8124
## Max. :0.46804 Max. :2.223e-04 Max. :8422 Max. :3.7762
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## df 17994 0.01 1
As we can observe that, the lift is inserted to each ability-sets (for all the 6545 ability-sets)
- Lift ( with mean of 1.4958 )
Top ability-sets (or) frequent-ability-sets (w.r.t support) :-
The Top-6 ability-sets (or) frequent-ability-sets which are having maximum support are as follows.
## items support
## [1] {marking,standing_tackle} 0.4680449
## [2] {standing_tackle,sliding_tackle} 0.4673780
## [3] {marking,sliding_tackle} 0.4615983
## [4] {interceptions,standing_tackle} 0.4545404
## [5] {interceptions,marking} 0.4520396
## [6] {marking,standing_tackle,sliding_tackle} 0.4520396
## [7] {interceptions,sliding_tackle} 0.4443703
## [8] {interceptions,marking,standing_tackle} 0.4395354
## [9] {interceptions,standing_tackle,sliding_tackle} 0.4367011
## [10] {interceptions,marking,sliding_tackle} 0.4323663
## transIdenticalToItemsets count lift
## [1] 0.000000e+00 8422 1.911068
## [2] 0.000000e+00 8410 1.935251
## [3] 0.000000e+00 8306 1.905319
## [4] 0.000000e+00 8179 1.890046
## [5] 0.000000e+00 8134 1.873746
## [6] 1.667222e-04 8134 3.776217
## [7] 0.000000e+00 7996 1.867926
## [8] 0.000000e+00 7909 3.687273
## [9] 5.557408e-05 7858 3.715148
## [10] 0.000000e+00 7780 3.666723
{marking,standing_tackle} abilities-set is more frequent with highest support of 0.4680449.
Visualizing of ability-sets in a graph :-
- Filtering by support :-
The graph representation of the top-30 ability-sets is as follows. In this graph, we are considering only top-30 itemsets which are having highest support value.
In the above graph , each circle (or node ) represent’s ability-set (in the above graph we have only 30 nodes ), the size of the node represents support value of the ability-set and color intensity represents lift value of the ability-set
- Filtering by Lift :-
The graph representation of the top-25 ability-sets is as follows. In this graph, we are considering only top-25 itemsets which are having highest lift value.
Similarity between items :-
Cluster analysis on similarity between items with phi-coefficient as distance measurement.
From the above Dendogram, we can see the items which are very similar to each other ( in ahigh dimensional vector space) , and the similarity is measured with phi-coefficient of the item.
Conclusion :-
- The given dataset has 17994 players and each player is observerd in 34 atheletic abilities.
- There are only 6545 ability-sets with minimum support of 0.01 , minimum_length = 2 & maximum length = 3.
- Out of those 6545 ability-sets , {marking,standing_tackle} is having the highest support of 0.4680449
————————————————————- THANK YOU ————————————————————-