Introduction

Football transfers involve a combination of player attributes, performance indicators, and structural characteristics related to clubs and competitions. Identifying how these features co-occur can provide useful descriptive insights into typical transfer profiles. Association rule mining offers a suitable exploratory framework for this purpose, as it focuses on discovering frequent patterns in categorical data without assuming causal relationships.

This project applies association rule mining to a dataset of football player transfers in order to uncover regularities among player characteristics and transfer-related variables. After appropriate data preprocessing, including discretization and transformation of binary indicators, the Apriori algorithm is used to extract association rules. The discovered rules are evaluated using standard quality measures and visualized to support interpretation of the results.

1. Dataset

The data were scraped from Transfermarkt.com and FBref.com. The dataset includes all player transfers into and out of the Premier League between the 2010 and 2025 seasons. In total, the dataset contains 2025 transfers described by 11 features.

The statistical variables describe player performance metrics recorded in the season preceding the transfer. In cases where a transfer occurred during the winter transfer window, the statistics correspond to the autumn round of the same season.

1.1. Using Python for Data Cleaning

Before executing association rule mining in the R environment, the dataset was preprocessed using Python. Two functions were employed to transform binary variables into categorical factors suitable for association rule analysis.

The first function collapsed groups of mutually exclusive binary columns into a single categorical variable. For example, binary indicators representing player continents (e.g., Europe, Asia, Africa) were merged into one variable named Continent.

The second function transformed individual binary columns into explicit categorical attributes by assigning the column name when the value equaled 1 and prefixing the label with “not_” when the value equaled 0.

def collapse_binary_columns(
    df: pd.DataFrame,
    mapping: dict,
    new_col: str,
    default_value: str | None = None
):
    def get_value(row):
        active = [
            label
            for col, label in mapping.items()
            if col in row.index and row[col] == 1
        ]

        if len(active) == 1:
            return active[0]
        elif len(active) == 0:
            return default_value
        else:
            return "Multiple"

    df[new_col] = df.apply(get_value, axis=1)

    drop_cols = [c for c in mapping.keys() if c in df.columns]
    df.drop(columns=drop_cols, inplace=True)

    return df

def binary_to_explicit_category(df, columns):
    for col in columns:
        if col not in df.columns:
            continue

        df[col] = df[col].map({
            1: col,
            0: f"not_{col}"
        })

    return df

For continuous variables, quantile-based discretization was applied using custom Python functions. The number of quantiles was selected individually for each variable based on its empirical distribution, as assessed using distribution plots. This approach ensures balanced category sizes while preserving meaningful variation in the data.

def quantile_age_bins(series, q, prefix):
    bins = pd.qcut(series, q=q, duplicates="drop")
    return bins.apply(
        lambda x: f"{prefix}_{int(x.left)}_{int(x.right)}"
    )

1.2. Description of Features

Whole dataset is descripted by 11 features:

Age - discretized into five intervals: 16–21, 21–23, 23–25, 25–28, and 28–36,
Heights - discretized into four intervals: 159–178, 178–183, 183–188, and 188–206,
matches - discretized into four intervals: 0–20, 20–30, 30–37, and 37–56,
goals - discretized into three intervals: 0–1, 1–5, and 5–36,
y_card - discretized into three intervals: 0–2, 2–5, and 5–18,
Homegrown - binary categorical variable with values Homegrown and not_Homegrown,
Position - categorical variable with four playing positions: Forward, Midfielder, Defender, and Goalkeeper,
Transfer_year - categorical variable containing 15 values ranging from 2010 to 2024. Transfers occurring during the winter 2025 window were assigned the value 2024, as the year refers to the start of the season,
Continent - categorical variable with six values: Europe, Africa, South America, North America, Asia, and Oceania,
european_competitions - categorical variable indicating prior participation in European competitions, with values no_participation, champions_league, and europa_league,
Foot - variable indicating the player’s dominant foot: right, left, or both.

1.3. Envionment Preparation

library(arules)
library(arulesViz)
library(ggplot2)

transfer_data <- read.csv('transfers_ar.csv')

summary(transfer_data)

##      Age              Heights            matches             goals          
##  Length:2026        Length:2026        Length:2026        Length:2026       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     y_card           Homegrown           Position         Transfer_year     
##  Length:2026        Length:2026        Length:2026        Length:2026       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   Continent         european_competitions     Foot          
##  Length:2026        Length:2026           Length:2026       
##  Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character      Mode  :character

1.4. Transformation of Data into a Transaction Dataset

transfer_trans <- as(transfer_data, "transactions")

summary(transfer_trans)

## transactions as itemMatrix in sparse format with
##  2026 rows (elements/itemsets/transactions) and
##  53 columns (items) and a density of 0.2075472 
## 
## most frequent items:
##                             Foot=right                       Continent=Europe 
##                                   1454                                   1398 
## european_competitions=no_participation                Homegrown=not_Homegrown 
##                                   1344                                   1343 
##                        goals=goals_0_1                                (Other) 
##                                    858                                  15889 
## 
## element (itemset/transaction) length distribution:
## sizes
##   11 
## 2026 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      11      11      11      11      11      11 
## 
## includes extended item information - examples:
##          labels variables    levels
## 1 Age=Age_16_21       Age Age_16_21
## 2 Age=Age_21_23       Age Age_21_23
## 3 Age=Age_23_25       Age Age_23_25
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

inspect(transfer_trans[1:5])

##     items                                     transactionID
## [1] {Age=Age_25_28,                                        
##      Heights=Heights_178_183,                              
##      matches=matches_0_20,                                 
##      goals=goals_5_36,                                     
##      y_card=y_card_0_2,                                    
##      Homegrown=Homegrown,                                  
##      Position=Forward,                                     
##      Transfer_year=Transfer_year_2010,                     
##      Continent=Europe,                                     
##      european_competitions=no_participation,               
##      Foot=right}                                          1
## [2] {Age=Age_21_23,                                        
##      Heights=Heights_159_178,                              
##      matches=matches_20_30,                                
##      goals=goals_1_5,                                      
##      y_card=y_card_0_2,                                    
##      Homegrown=Homegrown,                                  
##      Position=Midfielder,                                  
##      Transfer_year=Transfer_year_2010,                     
##      Continent=Europe,                                     
##      european_competitions=no_participation,               
##      Foot=right}                                          2
## [3] {Age=Age_25_28,                                        
##      Heights=Heights_159_178,                              
##      matches=matches_0_20,                                 
##      goals=goals_0_1,                                      
##      y_card=y_card_0_2,                                    
##      Homegrown=not_Homegrown,                              
##      Position=Midfielder,                                  
##      Transfer_year=Transfer_year_2010,                     
##      Continent=Africa,                                     
##      european_competitions=champions_league,               
##      Foot=right}                                          3
## [4] {Age=Age_23_25,                                        
##      Heights=Heights_159_178,                              
##      matches=matches_30_37,                                
##      goals=goals_5_36,                                     
##      y_card=y_card_2_5,                                    
##      Homegrown=Homegrown,                                  
##      Position=Midfielder,                                  
##      Transfer_year=Transfer_year_2010,                     
##      Continent=Europe,                                     
##      european_competitions=no_participation,               
##      Foot=right}                                          4
## [5] {Age=Age_23_25,                                        
##      Heights=Heights_183_188,                              
##      matches=matches_0_20,                                 
##      goals=goals_0_1,                                      
##      y_card=y_card_0_2,                                    
##      Homegrown=Homegrown,                                  
##      Position=Defender,                                    
##      Transfer_year=Transfer_year_2010,                     
##      Continent=Africa,                                     
##      european_competitions=no_participation,               
##      Foot=right}                                          5

1.5. Item Frequency Analysis

Before applying the association rule mining algorithm, it is useful to examine an item frequency plot in order to identify which variables occur most frequently in the dataset.

itemFrequencyPlot(transfer_trans, topN = 10, col = "skyblue", xlab = "Characteristics", ylab = "Frequency", main = "Item Frequency Distribution in the Transfer Dataset")

The item frequency distribution indicates that several characteristics occur in a large proportion of transfer transactions. The most frequent items include Foot=right, Continent=Europe, european_competitions=no_participation, and Homegrown=not_Homegrown, each appearing in more than 60% of observations. These items represent structural features of the dataset rather than discriminative attributes and reflect general characteristics of the Premier League transfer market.

Moderately frequent items, such as goals=0–1 and y_card=0–2, suggest that a substantial share of transferred players recorded relatively low attacking output and disciplinary involvement in the season preceding the transfer. In contrast, positional categories such as Position=Forward and Position=Defender exhibit lower frequencies, indicating greater heterogeneity across playing roles.

2. Application of the Apriori Algorithm

2.1. Applying the Apriori Algorithm

After transforming the dataset into a transaction format, the Apriori algorithm was applied to discover frequent association rules among player and transfer characteristics. In order to ensure that the extracted rules were both statistically meaningful and interpretable, minimum thresholds for support and confidence were specified. A minimum support value of 0.1 was selected to exclude infrequent patterns and reduce the risk of identifying spurious associations, while a minimum confidence threshold of 0.6 was used to retain rules with a sufficient level of reliability. Additionally, a minimum rule length of two items was imposed to avoid trivial rules consisting of single-item antecedents or consequences.

set.seed(123)
rules1 <- apriori(transfer_trans, parameter = list(support = 0.1, confidence = 0.6, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 202 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[53 item(s), 2026 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [416 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules1

## set of 416 rules

The algorithm identified 416 association rules that satisfied the specified criteria.

2.2. Analysis of Apriori Algorithm Results

rules.by.conf<-sort(rules1, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))

##     lhs                                          rhs                  support confidence  coverage     lift count
## [1] {goals=goals_0_1,                                                                                            
##      Homegrown=Homegrown,                                                                                        
##      european_competitions=no_participation}  => {Continent=Europe} 0.1140178  0.9023438 0.1263574 1.307688   231
## [2] {goals=goals_0_1,                                                                                            
##      Homegrown=Homegrown}                     => {Continent=Europe} 0.1426456  0.8975155 0.1589339 1.300691   289
## [3] {goals=goals_0_1,                                                                                            
##      Homegrown=Homegrown,                                                                                        
##      Foot=right}                              => {Continent=Europe} 0.1071076  0.8966942 0.1194472 1.299501   217
## [4] {Homegrown=Homegrown,                                                                                        
##      european_competitions=no_participation}  => {Continent=Europe} 0.2453110  0.8906810 0.2754195 1.290787   497
## [5] {Homegrown=Homegrown}                     => {Continent=Europe} 0.2976308  0.8828697 0.3371175 1.279466   603
## [6] {Homegrown=Homegrown,                                                                                        
##      european_competitions=no_participation,                                                                     
##      Foot=right}                              => {Continent=Europe} 0.1816387  0.8720379 0.2082922 1.263769   368

The rules with the highest confidence predominantly identify associations leading to the outcome Continent = Europe. In particular, players who are homegrown, recorded low goal counts (0–1 goals), and did not participate in European competitions exhibit a very high probability of being European. Confidence values exceeding 0.88 indicate that these antecedent conditions are strongly associated with the European continent category within the dataset.

However, the corresponding lift values, which range from approximately 1.26 to 1.31, suggest that these associations are only moderately stronger than the baseline probability of a player being European. This reflects the structural composition of the Premier League transfer market, where European players constitute the majority, especially among homegrown transfers. Consequently, these rules should be interpreted as descriptive demographic patterns rather than as unexpected or causal relationships.

Overall, the confidence-ranked rules highlight dominant structural characteristics of the dataset and provide insight into how eligibility and performance-related attributes co-occur with player origin. While highly reliable, these rules primarily capture background regularities and therefore complement, rather than replace, the more discriminative patterns identified through lift-based rule analysis.

inspect(sort(rules1, by = "lift", decreasing = TRUE)[1:5])

##     lhs                                          rhs                      support confidence  coverage     lift count
## [1] {goals=goals_0_1,                                                                                                
##      y_card=y_card_0_2,                                                                                              
##      european_competitions=no_participation}  => {matches=matches_0_20} 0.1169793  0.7053571 0.1658440 2.686191   237
## [2] {goals=goals_0_1,                                                                                                
##      y_card=y_card_0_2}                       => {matches=matches_0_20} 0.1594274  0.6785714 0.2349457 2.584184   323
## [3] {goals=goals_0_1,                                                                                                
##      y_card=y_card_0_2,                                                                                              
##      Continent=Europe}                        => {matches=matches_0_20} 0.1179664  0.6657382 0.1771964 2.535311   239
## [4] {goals=goals_0_1,                                                                                                
##      y_card=y_card_0_2,                                                                                              
##      Foot=right}                              => {matches=matches_0_20} 0.1135242  0.6647399 0.1707799 2.531509   230
## [5] {goals=goals_5_36,                                                                                               
##      european_competitions=no_participation,                                                                         
##      Foot=right}                              => {Position=Forward}     0.1041461  0.7992424 0.1303060 2.206083   211

The rules ranked highest by lift reveal more distinctive and informative associations than those ranked solely by confidence. In contrast to the confidence-based rules, the consequents in this set are not dominated by a single frequent category, but instead relate primarily to low match participation (matches = 0–20) and, in one case, playing position.

The strongest lift values, exceeding 2.5, indicate that players who recorded low goal counts (0–1 goals), few yellow cards (0–2), and no participation in European competitions are more than twice as likely as average to have played 20 or fewer matches in the season preceding the transfer. This pattern consistently appears across multiple rules with slight variations in the antecedent, such as the inclusion of continent or dominant foot, suggesting a robust association between low on-field involvement and limited playing time.

Additionally, the fifth rule indicates that players with high goal output (5–36 goals), no European competition participation, and a right dominant foot are more likely to occupy the forward position. With a lift value above 2.2 and a confidence close to 0.8, this rule highlights a more role-specific pattern, linking performance characteristics to positional classification rather than general demographic attributes.

Overall, the lift-ranked rules capture meaningful structural relationships within the dataset, particularly those related to player involvement intensity and positional roles. Compared to the confidence-ranked rules, these associations are less influenced by highly prevalent background characteristics and therefore provide more informative insights into typical transfer profiles. Consequently, lift-based evaluation proves essential for identifying non-trivial and substantively relevant patterns in the football transfer market.

2.3. Visualization of Association Rule Results

Visualization techniques are essential for interpreting the results of association rule mining, particularly when a large number of rules are generated. Graphical representations allow for an effective exploration of rule quality measures and facilitate the identification of structural patterns that may not be immediately evident from tabular summaries alone. In this chapter, two complementary visualization approaches are employed to examine the discovered association rules.

plot(rules1, engine = "ggplot2", main = "Support–Confidence Scatter Plot of Discovered Association Rules")+
  theme_minimal()+
  geom_point(size=3)+
  scale_color_gradient(low = "green", high = "red")

Figure X presents a scatter plot of the discovered association rules, with support plotted on the horizontal axis and confidence on the vertical axis. Each point represents a single association rule, while color intensity corresponds to the lift value, indicating the strength of the association relative to the baseline frequency of the consequent.

The plot reveals a dense concentration of rules at lower support levels, primarily between 0.1 and 0.2, combined with moderate to high confidence values ranging from approximately 0.6 to 0.8. This distribution reflects the exploratory nature of the analysis, where infrequent but reliable patterns are retained by design. Rules exhibiting higher lift values appear less frequently and are scattered across the mid-support range, suggesting that stronger associations tend to involve more specific combinations of characteristics rather than dominant background features.

Overall, the visualization highlights the trade-off between rule frequency and reliability and reinforces the importance of considering multiple quality measures simultaneously when evaluating association rules. The scatter plot thus serves as an effective diagnostic tool for identifying potentially informative rules for further interpretation.

plot(rules1, method="paracoord", control=list(reorder=TRUE), main = "Parallel Coordinates Visualization of Discovered Association Rules")

The parallel coordinates plot provides a detailed view of the internal structure of the discovered association rules by visualizing the relationships between antecedent components and their corresponding consequents. Each line in the plot represents a single association rule, with the vertical axes corresponding to individual items appearing in the left-hand side of the rules and the final axis indicating the consequent.

Summary

This study applied association rule mining to a dataset of Premier League football transfers from the 2010–2025 seasons in order to identify frequent patterns among player characteristics and transfer-related attributes. The analysis was conducted using an exploratory, descriptive approach without making causal claims.

Prior to applying the Apriori algorithm, the data were extensively preprocessed using Python. Continuous variables were discretized, binary indicators were transformed into categorical attributes, and the dataset was converted into transaction format in R. The final dataset consisted of 2,025 transfers described by 11 categorical features.

Using predefined support and confidence thresholds, the Apriori algorithm identified 416 association rules. Evaluation based on support, confidence, and lift revealed that confidence-ranked rules largely reflected dominant structural characteristics of the dataset, whereas lift-ranked rules provided more informative insights related to player involvement and positional roles. Visualization techniques supported the interpretation of these patterns by illustrating the distribution and structure of the discovered rules.

Overall, the results demonstrate that association rule mining can serve as a useful exploratory tool for analyzing football transfer data, provided that appropriate preprocessing and careful interpretation of rule quality measures are applied.

Results of Association Rule Mining on Football Transfer Data

Florian Ficek

2026-01-31