Introduction :-

In this report, I am attempting to do Association analysis (or) Market baskets analysis on Football Data set.


Association analysis :-

Association analysis enables us to identify frequent itemsets , items that have an affinity for each other (or) finding interesting relationships between items. It is frequently used to analyze transactional data (also called market baskets) to identify items that often appear together in transactions.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 17994 Players and each player is observed in 34 athletic abilities ( crossing, volleys, dribbling ).

The sample dataset is as folows.

##   crossing finishing heading_accuracy short_passing volleys dribbling curve
## 1     TRUE      TRUE             TRUE          TRUE    TRUE      TRUE  TRUE
## 2     TRUE      TRUE             TRUE          TRUE    TRUE      TRUE  TRUE
## 3     TRUE      TRUE             TRUE          TRUE    TRUE      TRUE  TRUE
## 4     TRUE      TRUE             TRUE          TRUE    TRUE      TRUE  TRUE
## 5    FALSE     FALSE            FALSE         FALSE   FALSE     FALSE FALSE
## 6     TRUE      TRUE             TRUE          TRUE    TRUE      TRUE  TRUE
##   free_kick_accuracy long_passing ball_control acceleration sprint_speed
## 1               TRUE         TRUE         TRUE         TRUE         TRUE
## 2               TRUE         TRUE         TRUE         TRUE         TRUE
## 3               TRUE         TRUE         TRUE         TRUE         TRUE
## 4               TRUE         TRUE         TRUE         TRUE         TRUE
## 5              FALSE         TRUE        FALSE        FALSE        FALSE
## 6               TRUE         TRUE         TRUE         TRUE         TRUE
##   agility reactions balance shot_power jumping stamina strength long_shots
## 1    TRUE      TRUE   FALSE       TRUE    TRUE    TRUE     TRUE       TRUE
## 2    TRUE      TRUE    TRUE       TRUE    TRUE    TRUE    FALSE       TRUE
## 3    TRUE      TRUE    TRUE       TRUE   FALSE    TRUE    FALSE       TRUE
## 4    TRUE      TRUE   FALSE       TRUE    TRUE    TRUE     TRUE       TRUE
## 5   FALSE      TRUE   FALSE      FALSE    TRUE   FALSE     TRUE      FALSE
## 6    TRUE      TRUE    TRUE       TRUE    TRUE    TRUE     TRUE       TRUE
##   aggression interceptions positioning vision penalties composure marking
## 1       TRUE         FALSE        TRUE   TRUE      TRUE      TRUE   FALSE
## 2      FALSE         FALSE        TRUE   TRUE      TRUE      TRUE   FALSE
## 3      FALSE         FALSE        TRUE   TRUE      TRUE      TRUE   FALSE
## 4       TRUE         FALSE        TRUE   TRUE      TRUE      TRUE   FALSE
## 5      FALSE         FALSE       FALSE   TRUE     FALSE      TRUE   FALSE
## 6       TRUE         FALSE        TRUE   TRUE      TRUE      TRUE   FALSE
##   standing_tackle sliding_tackle gk_diving gk_handling gk_kicking
## 1           FALSE          FALSE     FALSE       FALSE       TRUE
## 2           FALSE          FALSE     FALSE       FALSE       TRUE
## 3           FALSE          FALSE     FALSE       FALSE       TRUE
## 4           FALSE          FALSE      TRUE        TRUE       TRUE
## 5           FALSE          FALSE      TRUE        TRUE       TRUE
## 6           FALSE          FALSE      TRUE       FALSE       TRUE
##   gk_positioning gk_reflexes
## 1           TRUE       FALSE
## 2           TRUE       FALSE
## 3           TRUE       FALSE
## 4           TRUE        TRUE
## 5           TRUE        TRUE
## 6          FALSE       FALSE

The summary of the transaction data is as follows.

## transactions as itemMatrix in sparse format with
##  17994 rows (elements/itemsets/transactions) and
##  34 columns (items) and a density of 0.4846289 
## 
## most frequent items:
## sprint_speed   long_shots    finishing      stamina      volleys      (Other) 
##         8992         8980         8951         8951         8938       251682 
## 
## element (itemset/transaction) length distribution:
## sizes
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
##   5  23  46  90 125 979 825 761 634 633 661 659 659 705 672 685 696 708 749 742 
##  20  21  22  23  24  25  26  27  28  29  30  31  32  33 
## 759 757 848 814 747 712 591 534 423 338 222 131  46  15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   17.00   16.48   23.00   33.00 
## 
## includes extended item information - examples:
##             labels        variables levels
## 1         crossing         crossing   TRUE
## 2        finishing        finishing   TRUE
## 3 heading_accuracy heading_accuracy   TRUE
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Frequent Ability-sets Mining :-

In our tranaction data set as we have 34 number of abilities for each player, By using permitations & combinations we can get differnt ability-sets ( either frequnt / Not-frequent ). But this itemset size is very huge ( as we have 34 abilities).

As this number is very very big to analyse, we are assigning few metric ( rank ) to each ability-set which indicates strength of that set. The measures which we are considering in this analysis are,

Support :-

Support measure gives an idea of how frequent an item (or) itemset is in all the transactions. It is defined by following formula.

\[ Support(A,B) = P ( A \cap B ) \]

Lift :-

Lift measure checks the confidence from both sides of releation (or) rule.Unlike the confidence metric whose value may vary depending on direction, lift has no direction.lift(A,B) is always equal to the lift(B,A).It is defined by following formula.

\[ Lift(A,B) = \frac {P ( A \cap B )}{P(A)*P(B)} \]


APRIORI Algorithm :-

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. By using this algorithm , we can define pre-defined thershold values to following metrices (or) ranks and filter-out the / itemsets & identify the best frequent itemsets out of the ocean of itemsets.

Out of the above controls ( to filter best frequent ability-sets ) , We are controlling the following parameters.


The Specification of defined Algorithm :-

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen            target  ext
##       3 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 179 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[34 item(s), 17994 transaction(s)] done [0.01s].
## sorting and recoding items ... [34 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.05s].
## sorting transactions ... done [0.00s].
## writing ... [6545 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

We can observer from the specification , there are 6545 number of ability-sets which are satisfying given thershold values of support, minlen & maxlen.

Summary of Ability-sets :-

## set of 6545 itemsets
## 
## most frequent items:
##         crossing        finishing heading_accuracy    short_passing 
##              561              561              561              561 
##          volleys          (Other) 
##              561            16269 
## 
## element (itemset/transaction) length distribution:sizes
##    2    3 
##  561 5984 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.914   3.000   3.000 
## 
## summary of quality measures:
##     support        transIdenticalToItemsets     count     
##  Min.   :0.06096   Min.   :0.000e+00        Min.   :1097  
##  1st Qu.:0.13199   1st Qu.:0.000e+00        1st Qu.:2375  
##  Median :0.16650   Median :0.000e+00        Median :2996  
##  Mean   :0.18265   Mean   :1.155e-06        Mean   :3287  
##  3rd Qu.:0.22852   3rd Qu.:0.000e+00        3rd Qu.:4112  
##  Max.   :0.46804   Max.   :2.223e-04        Max.   :8422  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support confidence
##    df         17994    0.01          1

From the summary, we can observe that the maximum cardinality / length which the algoirithm considered is 3 ( which we controlled) . And out of 6545 ability-sets which are under observation, most of the ability-sets are having 3 length.

The Algoithm is defining following metrices for all the 6545 itemsets

  • support ( with mean of 0.18265 ) - We are controlling minimum support (indirectly mean of support).

And it is very important to have lift value for each ability-sets, We are introducing it manually.

The updated summary of the algorithm is as follows.

## set of 6545 itemsets
## 
## most frequent items:
##         crossing        finishing heading_accuracy    short_passing 
##              561              561              561              561 
##          volleys          (Other) 
##              561            16269 
## 
## element (itemset/transaction) length distribution:sizes
##    2    3 
##  561 5984 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.914   3.000   3.000 
## 
## summary of quality measures:
##     support        transIdenticalToItemsets     count           lift       
##  Min.   :0.06096   Min.   :0.000e+00        Min.   :1097   Min.   :0.5533  
##  1st Qu.:0.13199   1st Qu.:0.000e+00        1st Qu.:2375   1st Qu.:1.1071  
##  Median :0.16650   Median :0.000e+00        Median :2996   Median :1.3784  
##  Mean   :0.18265   Mean   :1.155e-06        Mean   :3287   Mean   :1.4958  
##  3rd Qu.:0.22852   3rd Qu.:0.000e+00        3rd Qu.:4112   3rd Qu.:1.8124  
##  Max.   :0.46804   Max.   :2.223e-04        Max.   :8422   Max.   :3.7762  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support confidence
##    df         17994    0.01          1

As we can observe that, the lift is inserted to each ability-sets (for all the 6545 ability-sets)

  • Lift ( with mean of 1.4958 )

Top ability-sets (or) frequent-ability-sets (w.r.t support) :-

The Top-6 ability-sets (or) frequent-ability-sets which are having maximum support are as follows.

##      items                                          support  
## [1]  {marking,standing_tackle}                      0.4680449
## [2]  {standing_tackle,sliding_tackle}               0.4673780
## [3]  {marking,sliding_tackle}                       0.4615983
## [4]  {interceptions,standing_tackle}                0.4545404
## [5]  {interceptions,marking}                        0.4520396
## [6]  {marking,standing_tackle,sliding_tackle}       0.4520396
## [7]  {interceptions,sliding_tackle}                 0.4443703
## [8]  {interceptions,marking,standing_tackle}        0.4395354
## [9]  {interceptions,standing_tackle,sliding_tackle} 0.4367011
## [10] {interceptions,marking,sliding_tackle}         0.4323663
##      transIdenticalToItemsets count lift    
## [1]  0.000000e+00             8422  1.911068
## [2]  0.000000e+00             8410  1.935251
## [3]  0.000000e+00             8306  1.905319
## [4]  0.000000e+00             8179  1.890046
## [5]  0.000000e+00             8134  1.873746
## [6]  1.667222e-04             8134  3.776217
## [7]  0.000000e+00             7996  1.867926
## [8]  0.000000e+00             7909  3.687273
## [9]  5.557408e-05             7858  3.715148
## [10] 0.000000e+00             7780  3.666723

{marking,standing_tackle} abilities-set is more frequent with highest support of 0.4680449.

Visualizing of ability-sets in a graph :-

  • Filtering by support :-

The graph representation of the top-30 ability-sets is as follows. In this graph, we are considering only top-30 itemsets which are having highest support value.

In the above graph , each circle (or node ) represent’s ability-set (in the above graph we have only 30 nodes ), the size of the node represents support value of the ability-set and color intensity represents lift value of the ability-set

  • Filtering by Lift :-

The graph representation of the top-25 ability-sets is as follows. In this graph, we are considering only top-25 itemsets which are having highest lift value.

Similarity between items :-

Cluster analysis on similarity between items with phi-coefficient as distance measurement.

From the above Dendogram, we can see the items which are very similar to each other ( in ahigh dimensional vector space) , and the similarity is measured with phi-coefficient of the item.

Conclusion :-

————————————————————- THANK YOU ————————————————————-