Association rules

Association rules are part of data mining, used to understand the probability of a connection between attributes. Apriori algorithm is used to determine what data points precede a certain (given by the data scientist) attribute, their probability (confidence), how many times the set of items appears in the dataset (support), the relationship between variables (lift) and coverage.

Project introduction

The aim of this project is to evaluate to what extend the apriori algorithm is able to explain why certain Gold Rare cards are more frequently used in the FIFA’22 game than others. In order to do so the data was recoded using a simple method where all values below 1st quartile was classified as poor/below average, data points in the rage between 1st and 3rd quartile were classified as average and all the values above 3rd quartile were marked as above average.


The dataset used was FIFA’22 computer game players (https://www.futbin.com/22/players?page=1&version=gold_rare&sort=version&order=desc, accessed 22-23.01.2022). The dataset was reduced to only Gold Rare players as these are the most valuable players. It is composed of 19 variables and 107 observations:
- Index -> of each observation to better understand the difference in clustering
- Player Name -> ch. for each of the player
- Position -> position on which the player plays, can be used for filtering as there are different requirements for goal keepers, defendants and strikers
- Version -> the same for all the players in the dataset “rare”
- Rating -> rating in the game
- Player’s price -> player’s price in the game
- Skills -> rating out of 5 stars
- Weak foot -> rating out of 5 stars
- Pace -> score out of 100
- Shooting -> score out of 100
- Dribbling -> score out of 100
- Defense -> score out of 100
- Physically -> score out of 100
- Popularity -> number of clicks (profile views of each player) on the Futbin website
- Base statistics -> statistics of the player
- In game statistics -> statistics of the player
- Games played -> number of games played
- Average goals per game -> average goals scored in one game

Loading necessary libraries

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)

Data preparation

The data has 17 observations and 19 attributes. For data preparation th data points were recoded based on the approach exaplined in the introduction.

db <- read.csv("FIFA_RARE.csv", sep = ";", dec = ",", header = TRUE)
summary(db)
##      Index          Player            Position           Version         
##  Min.   :  1.0   Length:107         Length:107         Length:107        
##  1st Qu.: 27.5   Class :character   Class :character   Class :character  
##  Median : 54.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 54.0                                                           
##  3rd Qu.: 80.5                                                           
##  Max.   :107.0                                                           
##      Rating         PS_price          Skills        Weak_Foot    
##  Min.   :77.00   Min.   :   700   Min.   :1.000   Min.   :2.000  
##  1st Qu.:82.00   1st Qu.:  1000   1st Qu.:2.500   1st Qu.:3.000  
##  Median :84.00   Median :  2600   Median :3.000   Median :4.000  
##  Mean   :84.34   Mean   : 41224   Mean   :3.168   Mean   :3.561  
##  3rd Qu.:86.50   3rd Qu.: 16625   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :93.00   Max.   :969000   Max.   :5.000   Max.   :5.000  
##       Pace         Passing         Shooting      Dribbling        Defense     
##  Min.   :53.0   Min.   :37.00   Min.   :53.0   Min.   :59.00   Min.   :29.00  
##  1st Qu.:73.5   1st Qu.:65.50   1st Qu.:72.5   1st Qu.:78.00   1st Qu.:45.00  
##  Median :80.0   Median :76.00   Median :78.0   Median :82.00   Median :64.00  
##  Mean   :78.9   Mean   :72.49   Mean   :77.2   Mean   :81.35   Mean   :61.78  
##  3rd Qu.:86.0   3rd Qu.:81.00   3rd Qu.:81.0   3rd Qu.:86.50   3rd Qu.:80.00  
##  Max.   :97.0   Max.   :93.00   Max.   :93.0   Max.   :95.00   Max.   :91.00  
##    Physically      Popularity        Base_Stats    In_game_stats 
##  Min.   :50.00   Min.   : -440.0   Min.   :385.0   Min.   : 817  
##  1st Qu.:67.50   1st Qu.:  141.5   1st Qu.:429.0   1st Qu.:2012  
##  Median :76.00   Median :  442.0   Median :446.0   Median :2117  
##  Mean   :74.87   Mean   : 1170.0   Mean   :446.6   Mean   :2015  
##  3rd Qu.:82.00   3rd Qu.: 1405.0   3rd Qu.:462.5   3rd Qu.:2211  
##  Max.   :92.00   Max.   :11196.0   Max.   :502.0   Max.   :2337  
##   Games_played        Avg_goals     
##  Min.   :    4665   Min.   :0.0000  
##  1st Qu.:  164228   1st Qu.:0.0200  
##  Median :  924030   Median :0.1600  
##  Mean   : 3670417   Mean   :0.2604  
##  3rd Qu.: 3665154   3rd Qu.:0.3500  
##  Max.   :26525205   Max.   :1.1100
cat("Number of observations in the dataset:", nrow(db))
## Number of observations in the dataset: 107
cat("Number of years variables in the analysis:", ncol(db))
## Number of years variables in the analysis: 19
db$Rating<-ifelse(db[,5]<82, "low rating", ifelse(db[,5]<86, "average rating", ifelse(db[,5]<94, "high rating", NA)))
db$PS_price<-ifelse(db[,6]<1001, "low price", ifelse(db[,6]<16625, "average price", ifelse(db[,6]<969001, "high price", NA)))
db$Skills<-ifelse(db[,7]<3, "low skills", ifelse(db[,7]<4, "average skils", ifelse(db[,7]<6, "high skills", NA)))
db$Weak_Foot<-ifelse(db[,8]<3, "poor weak foot", ifelse(db[,8]<4, "average weak foot", ifelse(db[,8]<6, "above average weak foot", NA)))
db$Pace<-ifelse(db[,9]<74, "poor pace", ifelse(db[,9]<87, "average pace", ifelse(db[,9]<98, "above average pace", NA)))
db$Passing<-ifelse(db[,10]<66, "poor passing", ifelse(db[,10]<82, "average passing", ifelse(db[,10]<94, "above average passing", NA)))
db$Shooting<-ifelse(db[,11]<73, "poor shooting", ifelse(db[,11]<82, "average shooting", ifelse(db[,11]<94, "above average shooting", NA)))
db$Dribbling<-ifelse(db[,12]<78, "poor dribbling", ifelse(db[,12]<87, "average dribbling", ifelse(db[,12]<96, "above average dribbling", NA)))
db$Defense<-ifelse(db[,13]<45, "poor defense", ifelse(db[,13]<81, "average defense", ifelse(db[,13]<96, "above average defense", NA)))
db$Physically<-ifelse(db[,14]<68, "poor physique", ifelse(db[,14]<83, "average physique", ifelse(db[,14]<96, "above average physique", NA)))
db$Popularity<-ifelse(db[,15]<142, "poor popularity", ifelse(db[,15]<1406, "average popularity", ifelse(db[,15]<11198, "above average popularity", NA)))
db$Base_Stats<-ifelse(db[,16]<430, "poor base stats", ifelse(db[,16]<463, "average base stats", ifelse(db[,16]<503, "above average base stats", NA)))
db$In_game_stats<-ifelse(db[,17]<2013, "poor in game stats", ifelse(db[,17]<2212, "average in game stats", ifelse(db[,17]<2338, "above average in game stats", NA)))
db$Games_played<-ifelse(db[,18]<164229, "low number of games played", ifelse(db[,18]<3665155, "average number of games played", ifelse(db[,18]<26525206, "above average number of games played", NA)))
db$Avg_goals<-ifelse(db[,19]<0.0201, "low number of goals per game", ifelse(db[,19]<0.3501, "average number of goals per game", ifelse(db[,19]<2, "above average number of goals per game", NA)))

data2<-db[, c(3, 5:19)] 
write.csv(data2, file="FIFA2.csv")

Data after recoding

Once recoding was completed the data was saved and imported as transaction.

trans1<-read.transactions("FIFA2.csv", format="basket", sep=",", skip=0) 
trans1
## transactions in sparse format with
##  108 transactions (rows) and
##  183 items (columns)
inspect(trans1[1:10])
##      items                                    
## [1]  {Avg_goals,                              
##       Base_Stats,                             
##       Defense,                                
##       Dribbling,                              
##       Games_played,                           
##       In_game_stats,                          
##       Pace,                                   
##       Passing,                                
##       Physically,                             
##       Popularity,                             
##       Position,                               
##       PS_price,                               
##       Rating,                                 
##       Shooting,                               
##       Skills,                                 
##       Weak_Foot}                              
## [2]  {1,                                      
##       above average dribbling,                
##       above average in game stats,            
##       above average number of games played,   
##       above average number of goals per game, 
##       above average passing,                  
##       above average popularity,               
##       above average shooting,                 
##       above average weak foot,                
##       average base stats,                     
##       average pace,                           
##       high price,                             
##       high rating,                            
##       high skills,                            
##       poor defense,                           
##       poor physique,                          
##       RW}                                     
## [3]  {2,                                      
##       above average in game stats,            
##       above average number of games played,   
##       above average number of goals per game, 
##       above average passing,                  
##       above average popularity,               
##       above average weak foot,                
##       average base stats,                     
##       average dribbling,                      
##       average pace,                           
##       average physique,                       
##       average shooting,                       
##       high price,                             
##       high rating,                            
##       high skills,                            
##       poor defense,                           
##       ST}                                     
## [4]  {3,                                      
##       above average base stats,               
##       above average dribbling,                
##       above average in game stats,            
##       above average number of games played,   
##       above average number of goals per game, 
##       above average pace,                     
##       above average passing,                  
##       above average popularity,               
##       above average weak foot,                
##       average physique,                       
##       average shooting,                       
##       high price,                             
##       high rating,                            
##       high skills,                            
##       poor defense,                           
##       ST}                                     
## [5]  {4,                                      
##       above average dribbling,                
##       above average in game stats,            
##       above average number of games played,   
##       above average number of goals per game, 
##       above average pace,                     
##       above average passing,                  
##       above average popularity,               
##       above average shooting,                 
##       above average weak foot,                
##       average base stats,                     
##       high price,                             
##       high rating,                            
##       high skills,                            
##       LW,                                     
##       poor defense,                           
##       poor physique}                          
## [6]  {5,                                      
##       above average base stats,               
##       above average dribbling,                
##       above average in game stats,            
##       above average number of games played,   
##       above average passing,                  
##       above average popularity,               
##       above average shooting,                 
##       above average weak foot,                
##       average defense,                        
##       average number of goals per game,       
##       average pace,                           
##       average physique,                       
##       CM,                                     
##       high price,                             
##       high rating,                            
##       high skills}                            
## [7]  {6,                                      
##       above average base stats,               
##       above average dribbling,                
##       above average pace,                     
##       above average passing,                  
##       above average physique,                 
##       average defense,                        
##       average number of games played,         
##       average popularity,                     
##       average shooting,                       
##       average weak foot,                      
##       GK,                                     
##       high price,                             
##       high rating,                            
##       low number of goals per game,           
##       low skills,                             
##       poor in game stats}                     
## [8]  {7,                                      
##       above average dribbling,                
##       above average in game stats,            
##       above average number of games played,   
##       above average number of goals per game, 
##       above average pace,                     
##       above average passing,                  
##       above average popularity,               
##       above average shooting,                 
##       above average weak foot,                
##       average base stats,                     
##       average physique,                       
##       high price,                             
##       high rating,                            
##       high skills,                            
##       poor defense,                           
##       ST}                                     
## [9]  {8,                                      
##       above average base stats,               
##       above average dribbling,                
##       above average number of games played,   
##       above average pace,                     
##       above average passing,                  
##       above average physique,                 
##       above average shooting,                 
##       above average weak foot,                
##       average defense,                        
##       average popularity,                     
##       GK,                                     
##       high price,                             
##       high rating,                            
##       low number of goals per game,           
##       low skills,                             
##       poor in game stats}                     
## [10] {9,                                      
##       above average base stats,               
##       above average dribbling,                
##       above average pace,                     
##       above average passing,                  
##       above average physique,                 
##       above average shooting,                 
##       above average weak foot,                
##       average number of games played,         
##       average popularity,                     
##       GK,                                     
##       high price,                             
##       high rating,                            
##       low number of goals per game,           
##       low skills,                             
##       poor defense,                           
##       poor in game stats}
size(trans1) 
##   [1] 16 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
##  [26] 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
##  [51] 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
##  [76] 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
## [101] 17 17 17 17 17 17 17 17
length(trans1)
## [1] 108

Frequency plot

Below graph shows the frequency of each data point within the whole dataset. The frequency index is not high as there where multiple attributes for each observation. However, we can identify the data values which appeared most frequently. The assigned value “above average weak foot” appeared the most throughout the data, however, the value “above average number of games played” is the top 25th most frequent element. This is the element we will focus on later in the project.

itemFrequencyPlot(trans1, topN=25, type="relative", main="Frequency Plot") 

Apriori

Before running the analysis of the the attributes connected to the number of games the player plays, apriori algorithm was used without any assumptions.

rule1<-apriori(trans1)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 10 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[183 item(s), 108 transaction(s)] done [0.00s].
## sorting and recoding items ... [47 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [2957 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
set.seed(123) 
plot(rule1, method="graph", measure="support", shading="lift", main="Association rules without assumptions")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).

The algorithm found almost 3 thousand rules of association. This was too much to present on the graph.

What determines if card gets chosen?

Below apriori algorithm was run to determine which attributes proceed above average number of games played by a football player.

rule2<-apriori(data=trans1, appearance=list(default="lhs", rhs="above average number of games played"), control=list(verbose=F)) 

rule2.byconf<-sort(rule2, by="confidence", decreasing=TRUE)
inspect(head(rule2.byconf))
##     lhs                            rhs                                      support confidence  coverage lift count
## [1] {above average popularity,                                                                                     
##      high price}                => {above average number of games played} 0.1296296          1 0.1296296    4    14
## [2] {above average popularity,                                                                                     
##      high rating}               => {above average number of games played} 0.1481481          1 0.1481481    4    16
## [3] {above average passing,                                                                                        
##      above average popularity,                                                                                     
##      high price}                => {above average number of games played} 0.1111111          1 0.1111111    4    12
## [4] {above average passing,                                                                                        
##      above average popularity,                                                                                     
##      high rating}               => {above average number of games played} 0.1111111          1 0.1111111    4    12
## [5] {above average popularity,                                                                                     
##      high price,                                                                                                   
##      high rating}               => {above average number of games played} 0.1296296          1 0.1296296    4    14
## [6] {above average passing,                                                                                        
##      above average popularity,                                                                                     
##      high price,                                                                                                   
##      high rating}               => {above average number of games played} 0.1111111          1 0.1111111    4    12
plot(rule2, method="graph")

The confidence level for all the displayed results is one, which means that the proceeding basket appears together with the following variable 100% times the basket occurs. The lift is always above 1, which indicates a positive correlation. Support is between 11% and 15%, which presents the percentage of occurrence of the proceeding basket in the whole dataset. This is not a large percentage, but not so little. Above average number of games played is a result of high popularity of the football player, his high rating, above average passing score, high price and mixture of the above.

Conclusion

Association rules allows to recognize the probability of various inputs given a certain output or vice versa. Apriori algorithm used allows to identify the characteristics of observations with a specific element on the left hand side. Reflcting back on the aim of the paper, most frequently chosen players were highly popular, had high ratings, above average passing scores and high prices.

Reference

https://www.ibm.com/docs/en/db2/10.5?topic=visualizer-characteristics-association-rules-item-sets