Introduction

This report is made as part of the talent detection task from the module. The goal is to work with a dataset of football players from multiple competitions, focus on midfielders, and try to find young players who show good creative and playmaking qualities.

The idea behind this is quite simple – instead of just looking at goals or assists, we want to build a more complete picture of a midfielder’s ability to create, progress the ball and make things happen in attack. At the same time we also calculate a separate defensive score, so we don’t mix everything into one number and lose the detail.

At the end of the report, we pick Luka Modrić as a reference player – someone who represents a very specific style of midfield play – and use a similarity algorithm to find young players (under 21) who show a comparable profile. The idea is to answer a question like: who plays like Modrić but is still 19 or 20 years old?


1. Reading the File

1.1 Libraries

First we load the libraries needed for this analysis. tidyverse covers most of what we need – data manipulation with dplyr and visualization with ggplot2. fmsb is used later for radar charts.

# Run this chunk once if you don't have these packages installed yet
install.packages("tidyverse")
install.packages("fmsb")
library(tidyverse)
library(fmsb)

1.2 Loading the data

The file is a CSV but uses semicolons as separators, not commas. It also uses a dot for decimals. The encoding is UTF-8 which is important because some player names have special characters (like Modrić).

df_raw <- read.csv(
  "FBREF_BigPlayers_2425.csv",
  sep       = ";",
  header    = TRUE,
  encoding  = "UTF-8",
  stringsAsFactors = FALSE
)

1.3 First look at the data

Let’s check the basic dimensions and what the first few rows look like.

cat("Rows:", nrow(df_raw), "\n")
## Rows: 3972
cat("Columns:", ncol(df_raw), "\n")
## Columns: 72
head(df_raw, 5)
##            Player         Squad Nation   Pos Age MP  Min Gls G.PK Ast  xG xAG
## 1 Abdoulie Ceesay     St. Pauli    GAM    FW  20  7   60   0    0   0 0.0 0.0
## 2      Adam Aznou Bayern Munich    MAR    DF  18  2   17   0    0   0 0.0 0.0
## 3   Adam Dźwigała     St. Pauli    POL DF,MF  28 16  373   0    0   0 0.4 0.0
## 4     Adam Hložek    Hoffenheim    CZE FW,MF  22 27 1871   8    8   4 5.6 3.5
## 5     Adrian Beck    Heidenheim    GER MF,FW  27 32 1598   4    4   1 3.3 1.2
##   Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90 Passes.
## 1   0.00    0.00   0.00  0.00   0.03  Bundesliga  0  0.00   0.00    57.1
## 2   0.00    0.00   0.00  0.00   0.00  Bundesliga  0  0.00   0.00    76.5
## 3   0.00    0.00   0.00  0.09   0.00  Bundesliga  6  1.45   0.00    81.6
## 4   0.38    0.38   0.19  0.27   0.17  Bundesliga 59  2.84   1.15    68.5
## 5   0.23    0.23   0.06  0.18   0.07  Bundesliga 39  2.20   0.73    80.1
##   ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1         70.0           0.0         0.0   0.0    0.43      0.29   0.00
## 2         90.0          66.7         0.0   0.0    0.00      0.00   0.00
## 3         91.0          91.4        40.0   0.0    0.62      0.38   0.25
## 4         73.8          75.0        54.8   0.5    0.37      0.48   0.15
## 5         83.3          87.3        72.4  -0.2    0.53      0.97   0.28
##   Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1       0.43   0.14       3.71        0.00       0.0   1.48   0.00     0.0
## 2       0.00   0.50       9.00        0.00       0.0   0.00   0.00     0.0
## 3       1.50   1.75      17.50        0.06      50.0   1.21   0.00    76.5
## 4       0.96   0.74      28.19        1.22      51.6   2.74   0.48    50.6
## 5       1.41   0.53      30.06        1.25      53.3   3.16   0.23    59.2
##   Points.90  xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1      0.43  1.4  0    0        0       0  0     0     0   0       0       0
## 2      3.00  0.8  0    0        0       0  0     0     0   0       0       0
## 3      0.88  9.0  0    0        0       0  0     0     0   0       0       0
## 4      0.85 34.4  0    0        0       0  0     0     0   0       0       0
## 5      0.91 29.2  0    0        0       0  0     0     0   0       0       0
##   SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1  0.00  0.0               1.14               2.00                    1.00
## 2  0.00  0.0               6.50               8.50                    4.50
## 3  0.00 -0.4              10.81              13.25                    4.44
## 4  3.00  2.4              12.89              18.81                    6.56
## 5  3.25  0.7              18.16              22.66                    8.56
##   MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1                     0.00                   0.00            10.00
## 2                     2.00                   0.00            89.00
## 3                     5.31                   0.75           183.44
## 4                     4.67                   0.85           198.19
## 5                     7.28                   1.72           309.41
##   PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1             0.29  0.00  0.14                0.00   0.00     0.00
## 2            22.00  0.00  0.00                0.50   0.00     0.00
## 3            75.50  0.01  0.00                0.69   0.06     0.00
## 4            48.19  0.12  0.74                1.37   0.70     0.15
## 5            70.56  0.06  0.66                1.66   0.56     0.12
##   PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90   xGD xGD.90
## 1                 0.00   0.20     0.57   0.29   0.43       0.00  -1.4  -0.20
## 2                 0.50   0.40     0.50   0.00   0.00       0.00  -0.8  -0.40
## 3                 0.88   0.56     0.81   0.31   0.19       0.81  -8.6  -0.47
## 4                 2.52   1.27     2.44   0.56   1.11       1.48 -28.8  -1.00
## 5                 2.16   0.91     4.09   0.50   0.44       0.91 -25.9  -0.73
##   MP_Squad
## 1       34
## 2       34
## 3       34
## 4       34
## 5       34
glimpse(df_raw)
## Rows: 3,972
## Columns: 72
## $ Player                   <chr> "Abdoulie Ceesay", "Adam Aznou", "Adam Dźwiga…
## $ Squad                    <chr> "St. Pauli", "Bayern Munich", "St. Pauli", "H…
## $ Nation                   <chr> "GAM", "MAR", "POL", "CZE", "GER", "FRA", "ES…
## $ Pos                      <chr> "FW", "DF", "DF,MF", "FW,MF", "MF,FW", "MF,FW…
## $ Age                      <int> 20, 18, 28, 22, 27, 31, 27, 20, 25, 33, 27, 2…
## $ MP                       <dbl> 7, 2, 16, 27, 32, 30, 28, 21, 19, 2, 34, 23, …
## $ Min                      <int> 60, 17, 373, 1871, 1598, 1902, 1455, 1451, 12…
## $ Gls                      <int> 0, 0, 0, 8, 4, 11, 3, 1, 7, 0, 0, 0, 0, 9, 0,…
## $ G.PK                     <int> 0, 0, 0, 8, 4, 10, 3, 1, 7, 0, 0, 0, 0, 9, 0,…
## $ Ast                      <int> 0, 0, 0, 4, 1, 4, 4, 0, 2, 0, 1, 1, 0, 2, 0, …
## $ xG                       <dbl> 0.0, 0.0, 0.4, 5.6, 3.3, 7.4, 1.0, 0.6, 4.1, …
## $ xAG                      <dbl> 0.0, 0.0, 0.0, 3.5, 1.2, 3.1, 4.9, 0.2, 1.3, …
## $ Gls.90                   <dbl> 0.00, 0.00, 0.00, 0.38, 0.23, 0.52, 0.19, 0.0…
## $ G.PK.90                  <dbl> 0.00, 0.00, 0.00, 0.38, 0.23, 0.47, 0.19, 0.0…
## $ Ast.90                   <dbl> 0.00, 0.00, 0.00, 0.19, 0.06, 0.19, 0.25, 0.0…
## $ xG.90                    <dbl> 0.00, 0.00, 0.09, 0.27, 0.18, 0.35, 0.06, 0.0…
## $ xAG.90                   <dbl> 0.03, 0.00, 0.00, 0.17, 0.07, 0.15, 0.30, 0.0…
## $ Competition              <chr> "Bundesliga", "Bundesliga", "Bundesliga", "Bu…
## $ Sh                       <dbl> 0, 0, 6, 59, 39, 41, 15, 9, 27, 0, 0, 12, 0, …
## $ Sh.90                    <dbl> 0.00, 0.00, 1.45, 2.84, 2.20, 1.94, 0.93, 0.5…
## $ SoT.90                   <dbl> 0.00, 0.00, 0.00, 1.15, 0.73, 0.57, 0.31, 0.2…
## $ Passes.                  <dbl> 57.1, 76.5, 81.6, 68.5, 80.1, 72.1, 87.2, 93.…
## $ ShortPasses.             <dbl> 70.0, 90.0, 91.0, 73.8, 83.3, 83.1, 94.3, 96.…
## $ MediumPasses.            <dbl> 0.0, 66.7, 91.4, 75.0, 87.3, 74.2, 89.8, 95.1…
## $ LongPasses.              <dbl> 0.0, 0.0, 40.0, 54.8, 72.4, 63.2, 65.8, 73.1,…
## $ A.xAG                    <dbl> 0.0, 0.0, 0.0, 0.5, -0.2, 0.9, -0.9, -0.2, 0.…
## $ TklW.90                  <dbl> 0.43, 0.00, 0.62, 0.37, 0.53, 0.10, 0.25, 1.0…
## $ Blocks.90                <dbl> 0.29, 0.00, 0.38, 0.48, 0.97, 0.37, 0.43, 0.8…
## $ Int.90                   <dbl> 0.00, 0.00, 0.25, 0.15, 0.28, 0.07, 0.39, 0.2…
## $ Tkl.Int.90               <dbl> 0.43, 0.00, 1.50, 0.96, 1.41, 0.30, 0.86, 2.0…
## $ Clr.90                   <dbl> 0.14, 0.50, 1.75, 0.74, 0.53, 0.23, 0.86, 0.6…
## $ Touches.90               <dbl> 3.71, 9.00, 17.50, 28.19, 30.06, 32.63, 53.68…
## $ Dribbles.90              <dbl> 0.00, 0.00, 0.06, 1.22, 1.25, 0.57, 0.14, 0.2…
## $ Dribbles.                <dbl> 0.0, 0.0, 50.0, 51.6, 53.3, 39.5, 80.0, 100.0…
## $ SCA.90                   <dbl> 1.48, 0.00, 1.21, 2.74, 3.16, 3.41, 3.71, 1.8…
## $ GCA.90                   <dbl> 0.00, 0.00, 0.00, 0.48, 0.23, 0.66, 0.62, 0.1…
## $ Aerial.                  <dbl> 0.0, 0.0, 76.5, 50.6, 59.2, 31.9, 27.3, 54.5,…
## $ Points.90                <dbl> 0.43, 3.00, 0.88, 0.85, 0.91, 1.17, 2.11, 2.3…
## $ xGA                      <dbl> 1.4, 0.8, 9.0, 34.4, 29.2, 35.7, 21.2, 12.6, …
## $ OG                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ PSxG                     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ PSxG.SoT                 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ PSxG...                  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ GA                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 53, 0, 13, 0, 0…
## $ GA.90                    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ Save.                    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ CS.                      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ SoTA.90                  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ SoTA.GA                  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ SoT.G                    <dbl> 0.00, 0.00, 0.00, 3.00, 3.25, 1.20, 1.67, 4.0…
## $ G.xG                     <dbl> 0.0, 0.0, -0.4, 2.4, 0.7, 3.6, 2.0, 0.4, 2.9,…
## $ PassesCompleted.90       <dbl> 1.14, 6.50, 10.81, 12.89, 18.16, 19.10, 43.29…
## $ PassesAttempted.90       <dbl> 2.00, 8.50, 13.25, 18.81, 22.66, 26.50, 49.64…
## $ ShortPassesCompleted.90  <dbl> 1.00, 4.50, 4.44, 6.56, 8.56, 10.30, 22.86, 3…
## $ MediumPassesCompleted.90 <dbl> 0.00, 2.00, 5.31, 4.67, 7.28, 6.33, 15.14, 26…
## $ LongPassesCompleted.90   <dbl> 0.00, 0.00, 0.75, 0.85, 1.72, 1.60, 4.61, 3.6…
## $ TotDistPasses.90         <dbl> 10.00, 89.00, 183.44, 198.19, 309.41, 303.93,…
## $ PrgDistPasses.90         <dbl> 0.29, 22.00, 75.50, 48.19, 70.56, 100.97, 214…
## $ xA.90                    <dbl> 0.00, 0.00, 0.01, 0.12, 0.06, 0.11, 0.14, 0.0…
## $ KP.90                    <dbl> 0.14, 0.00, 0.00, 0.74, 0.66, 0.83, 1.14, 0.3…
## $ FinalThirdPasses.90      <dbl> 0.00, 0.50, 0.69, 1.37, 1.66, 2.67, 4.82, 8.0…
## $ PPA.90                   <dbl> 0.00, 0.00, 0.06, 0.70, 0.56, 1.17, 0.43, 0.6…
## $ CrsPA.90                 <dbl> 0.00, 0.00, 0.00, 0.15, 0.12, 0.27, 0.14, 0.0…
## $ PassesProgressive.90     <dbl> 0.00, 0.50, 0.88, 2.52, 2.16, 3.87, 4.29, 6.1…
## $ xGA.90                   <dbl> 0.20, 0.40, 0.56, 1.27, 0.91, 1.19, 0.76, 0.6…
## $ Recov.90                 <dbl> 0.57, 0.50, 0.81, 2.44, 4.09, 2.20, 2.46, 3.4…
## $ Fls.90                   <dbl> 0.29, 0.00, 0.31, 0.56, 0.50, 0.57, 0.29, 0.8…
## $ Fld.90                   <dbl> 0.43, 0.00, 0.19, 1.11, 0.44, 1.10, 0.32, 0.4…
## $ AerialW.90               <dbl> 0.00, 0.00, 0.81, 1.48, 0.91, 0.50, 0.11, 0.5…
## $ xGD                      <dbl> -1.4, -0.8, -8.6, -28.8, -25.9, -28.3, -20.2,…
## $ xGD.90                   <dbl> -0.20, -0.40, -0.47, -1.00, -0.73, -0.84, -0.…
## $ MP_Squad                 <int> 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 3…

A quick look at the column names to understand what variables are available:

colnames(df_raw)
##  [1] "Player"                   "Squad"                   
##  [3] "Nation"                   "Pos"                     
##  [5] "Age"                      "MP"                      
##  [7] "Min"                      "Gls"                     
##  [9] "G.PK"                     "Ast"                     
## [11] "xG"                       "xAG"                     
## [13] "Gls.90"                   "G.PK.90"                 
## [15] "Ast.90"                   "xG.90"                   
## [17] "xAG.90"                   "Competition"             
## [19] "Sh"                       "Sh.90"                   
## [21] "SoT.90"                   "Passes."                 
## [23] "ShortPasses."             "MediumPasses."           
## [25] "LongPasses."              "A.xAG"                   
## [27] "TklW.90"                  "Blocks.90"               
## [29] "Int.90"                   "Tkl.Int.90"              
## [31] "Clr.90"                   "Touches.90"              
## [33] "Dribbles.90"              "Dribbles."               
## [35] "SCA.90"                   "GCA.90"                  
## [37] "Aerial."                  "Points.90"               
## [39] "xGA"                      "OG"                      
## [41] "PSxG"                     "PSxG.SoT"                
## [43] "PSxG..."                  "GA"                      
## [45] "GA.90"                    "Save."                   
## [47] "CS."                      "SoTA.90"                 
## [49] "SoTA.GA"                  "SoT.G"                   
## [51] "G.xG"                     "PassesCompleted.90"      
## [53] "PassesAttempted.90"       "ShortPassesCompleted.90" 
## [55] "MediumPassesCompleted.90" "LongPassesCompleted.90"  
## [57] "TotDistPasses.90"         "PrgDistPasses.90"        
## [59] "xA.90"                    "KP.90"                   
## [61] "FinalThirdPasses.90"      "PPA.90"                  
## [63] "CrsPA.90"                 "PassesProgressive.90"    
## [65] "xGA.90"                   "Recov.90"                
## [67] "Fls.90"                   "Fld.90"                  
## [69] "AerialW.90"               "xGD"                     
## [71] "xGD.90"                   "MP_Squad"

2. Description and First Transformations

2.1 What is in the dataset

The dataset comes from FBref and covers the 2024/25 season. It includes players from 7 competitions:

df_raw %>%
  count(Competition, sort = TRUE)
##      Competition   n
## 1        Serie A 634
## 2        La Liga 601
## 3  Primeira Liga 584
## 4 Premier League 574
## 5        Ligue 1 553
## 6     Eredivisie 534
## 7     Bundesliga 492

Worth noting that mixing players from leagues of different levels (for example La Liga and Ligue 1) into one pool means the Z-scores are calculated across all of them together. A midfielder putting up good progressive pass numbers in a weaker league might look the same as one doing it in a stronger league. This is a known limitation – ideally we would apply some kind of competition difficulty adjustment, but for this study we keep it simple and normalise across the full sample.

The positions available in the data:

df_raw %>%
  count(Pos, sort = TRUE)
##      Pos    n
## 1     DF 1161
## 2     MF  804
## 3     FW  542
## 4  FW,MF  474
## 5  MF,FW  327
## 6     GK  294
## 7  DF,MF  155
## 8  MF,DF  114
## 9  DF,FW   62
## 10 FW,DF   39

2.2 Age column

The Age column in FBref data comes as a decimal – for example 20.187 means the player is 20 years old and some days. We need to floor this to get a clean integer age for filtering and display.

df_raw <- df_raw %>%
  mutate(Age_int = floor(Age))

# Quick check
df_raw %>%
  select(Player, Age, Age_int) %>%
  head(10)
##                    Player Age Age_int
## 1         Abdoulie Ceesay  20      20
## 2              Adam Aznou  18      18
## 3           Adam Dźwigała  28      28
## 4             Adam Hložek  22      22
## 5             Adrian Beck  27      27
## 6           Alassane Pléa  31      31
## 7            Aleix García  27      27
## 8     Aleksandar Pavlovic  20      20
## 9  Alexander Bernhardsson  25      25
## 10        Alexander Meyer  33      33

2.3 Missing values overview

Before filtering, let’s check how many NAs exist in the key metrics we plan to use.

key_metrics <- c(
  "xA.90", "KP.90", "FinalThirdPasses.90",
  "PPA.90", "PassesProgressive.90", "PrgDistPasses.90",
  "SCA.90", "Passes.",
  "TklW.90", "Int.90", "Tkl.Int.90", "Recov.90", "Blocks.90"
)

df_raw %>%
  select(all_of(key_metrics)) %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "metric", values_to = "na_count") %>%
  arrange(desc(na_count))
## # A tibble: 13 × 2
##    metric               na_count
##    <chr>                   <int>
##  1 xA.90                       0
##  2 KP.90                       0
##  3 FinalThirdPasses.90         0
##  4 PPA.90                      0
##  5 PassesProgressive.90        0
##  6 PrgDistPasses.90            0
##  7 SCA.90                      0
##  8 Passes.                     0
##  9 TklW.90                     0
## 10 Int.90                      0
## 11 Tkl.Int.90                  0
## 12 Recov.90                    0
## 13 Blocks.90                   0

3. Sample Selection

3.1 Filtering midfielders

We keep only players whose primary or secondary position is midfielder. In this dataset that means positions: MF, MF,FW and MF,DF. We also apply the minimum 900 minutes threshold to make sure we are only looking at players with a reasonable sample of data. On top of that we filter out any rows where the age is below 15, since those are clearly data errors (no professional midfielder plays 900+ minutes at that age).

df_mf <- df_raw %>%
  filter(Pos %in% c("MF", "MF,FW", "MF,DF")) %>%
  filter(Min >= 900) %>%
  filter(Age_int >= 15)

cat("Players in MF sample (900+ min):", nrow(df_mf), "\n")
## Players in MF sample (900+ min): 657

Age distribution in the filtered sample:

df_mf %>%
  count(Age_int, sort = FALSE) %>%
  ggplot(aes(x = Age_int, y = n)) +
  geom_col(fill = "#2c7bb6") +
  labs(
    title = "Age distribution – midfielders with 900+ minutes",
    x     = "Age",
    y     = "Number of players"
  ) +
  theme_minimal()

How many U21 players are in the sample:

df_mf %>%
  filter(Age_int <= 20) %>%
  count(Age_int)
##   Age_int  n
## 1      16  1
## 2      17  2
## 3      18  8
## 4      19 25
## 5      20 42

3.2 Selecting variables of interest

We split the metrics into two groups. The creative / playmaking metrics capture how well a midfielder creates chances and progresses the ball forward:

  • xA.90 – expected assists per 90 minutes
  • KP.90 – key passes per 90 (passes that directly lead to a shot)
  • FinalThirdPasses.90 – passes into the final third per 90
  • PPA.90 – passes into the penalty area per 90
  • PassesProgressive.90 – progressive passes per 90 (passes that move the ball substantially closer to the opponent’s goal)
  • PrgDistPasses.90 – total progressive passing distance per 90
  • SCA.90 – shot-creating actions per 90
  • Passes. – pass completion percentage

The defensive metrics capture out-of-possession work:

  • TklW.90 – tackles won per 90
  • Int.90 – interceptions per 90
  • Tkl.Int.90 – combined tackles + interceptions per 90
  • Recov.90 – ball recoveries per 90
  • Blocks.90 – blocks per 90

We now reduce the dataset to only the columns we actually need – player info and the two sets of metrics.

df_mf <- df_mf %>%
  select(
    # player info
    Player, Squad, Nation, Pos, Age, Age_int, Min, Competition,

    # creative / playmaking metrics
    xA.90,
    KP.90,
    FinalThirdPasses.90,
    PPA.90,
    PassesProgressive.90,
    PrgDistPasses.90,
    SCA.90,
    Passes.,

    # defensive metrics
    TklW.90,
    Int.90,
    Tkl.Int.90,
    Recov.90,
    Blocks.90
  )

glimpse(df_mf)
## Rows: 657
## Columns: 21
## $ Player               <chr> "Adrian Beck", "Alassane Pléa", "Aleix García", "…
## $ Squad                <chr> "Heidenheim", "Gladbach", "Leverkusen", "Bayern M…
## $ Nation               <chr> "GER", "FRA", "ESP", "GER", "FRA", "GER", "MLI", …
## $ Pos                  <chr> "MF,FW", "MF,FW", "MF", "MF", "MF,FW", "MF", "MF"…
## $ Age                  <int> 27, 31, 27, 20, 26, 19, 26, 33, 25, 23, 22, 38, 2…
## $ Age_int              <dbl> 27, 31, 27, 20, 26, 19, 26, 33, 25, 23, 22, 38, 2…
## $ Min                  <int> 1598, 1902, 1455, 1451, 2109, 922, 1320, 2767, 15…
## $ Competition          <chr> "Bundesliga", "Bundesliga", "Bundesliga", "Bundes…
## $ xA.90                <dbl> 0.06, 0.11, 0.14, 0.09, 0.07, 0.05, 0.04, 0.18, 0…
## $ KP.90                <dbl> 0.66, 0.83, 1.14, 0.33, 0.75, 0.67, 0.18, 1.53, 0…
## $ FinalThirdPasses.90  <dbl> 1.66, 2.67, 4.82, 8.05, 1.36, 1.87, 2.32, 3.16, 1…
## $ PPA.90               <dbl> 0.56, 1.17, 0.43, 0.62, 0.75, 0.40, 0.46, 1.19, 0…
## $ PassesProgressive.90 <dbl> 2.16, 3.87, 4.29, 6.10, 2.18, 2.80, 2.32, 4.47, 2…
## $ PrgDistPasses.90     <dbl> 70.56, 100.97, 214.79, 255.90, 50.39, 112.80, 120…
## $ SCA.90               <dbl> 3.16, 3.41, 3.71, 1.86, 3.29, 2.05, 1.98, 3.32, 1…
## $ Passes.              <dbl> 80.1, 72.1, 87.2, 93.3, 82.9, 75.5, 80.2, 77.7, 7…
## $ TklW.90              <dbl> 0.53, 0.10, 0.25, 1.00, 0.93, 0.27, 0.61, 0.25, 1…
## $ Int.90               <dbl> 0.28, 0.07, 0.39, 0.24, 0.54, 0.40, 0.64, 0.22, 0…
## $ Tkl.Int.90           <dbl> 1.41, 0.30, 0.86, 2.05, 2.00, 1.07, 1.71, 0.69, 2…
## $ Recov.90             <dbl> 4.09, 2.20, 2.46, 3.43, 2.57, 3.87, 3.14, 4.25, 2…
## $ Blocks.90            <dbl> 0.97, 0.37, 0.43, 0.86, 0.57, 0.93, 1.50, 0.50, 1…

3.3 Handling NAs in metrics

We remove players who have NA in any of the metric columns. In practice this should be very few.

df_mf <- df_mf %>%
  drop_na(xA.90, KP.90, FinalThirdPasses.90, PPA.90,
          PassesProgressive.90, PrgDistPasses.90, SCA.90, Passes.,
          TklW.90, Int.90, Tkl.Int.90, Recov.90, Blocks.90)

cat("Final sample size after removing NAs:", nrow(df_mf), "\n")
## Final sample size after removing NAs: 657

4. Data Processing – Score Calculation

4.1 Z-score normalization

For each metric we calculate a Z-score within the full MF sample. This puts all metrics on the same scale (mean = 0, sd = 1) so they can be combined fairly regardless of their original units.

creative_metrics  <- c("xA.90", "KP.90", "FinalThirdPasses.90",
                        "PPA.90", "PassesProgressive.90",
                        "PrgDistPasses.90", "SCA.90", "Passes.")

defensive_metrics <- c("TklW.90", "Int.90", "Tkl.Int.90",
                        "Recov.90", "Blocks.90")

df_mf <- df_mf %>%
  mutate(
    across(all_of(creative_metrics),
           ~ scale(.)[,1],
           .names = "z_{.col}"),
    across(all_of(defensive_metrics),
           ~ scale(.)[,1],
           .names = "z_{.col}")
  )

4.2 Composite scores

The Creative score is the average of all creative Z-scores. Same for Defensive. Higher score means the player is above average in that category compared to all midfielders in the sample.

df_mf <- df_mf %>%
  mutate(
    score_creative = rowMeans(cbind(
      z_xA.90, z_KP.90, z_FinalThirdPasses.90,
      z_PPA.90, z_PassesProgressive.90,
      z_PrgDistPasses.90, z_SCA.90, z_Passes.
    )),
    score_defensive = rowMeans(cbind(
      z_TklW.90, z_Int.90, z_Tkl.Int.90,
      z_Recov.90, z_Blocks.90
    ))
  )

4.3 Rankings

df_mf <- df_mf %>%
  mutate(
    rank_creative  = rank(-score_creative,  ties.method = "min"),
    rank_defensive = rank(-score_defensive, ties.method = "min")
  )

Top 15 midfielders by Creative score:

df_mf %>%
  arrange(rank_creative) %>%
  select(Player, Squad, Competition, Age_int, Min,
         score_creative, score_defensive,
         rank_creative, rank_defensive) %>%
  head(15)
##             Player          Squad    Competition Age_int  Min score_creative
## 1   Joshua Kimmich  Bayern Munich     Bundesliga      29 2847       3.778199
## 2     Joey Veerman  PSV Eindhoven     Eredivisie      25 1830       3.535609
## 3            Pedri      Barcelona        La Liga      21 2879       2.684584
## 4      Orkun Kökçü        Benfica  Primeira Liga      23 2633       2.532469
## 5   Angelo Stiller      Stuttgart     Bundesliga      23 2741       2.528589
## 6  Bruno Fernandes Manchester Utd Premier League      29 3018       2.511450
## 7     Granit Xhaka     Leverkusen     Bundesliga      31 2888       2.330650
## 8    Romano Schmid  Werder Bremen     Bundesliga      24 2834       2.323731
## 9       Alex Baena     Villarreal        La Liga      23 2595       2.188212
## 10    Nadiem Amiri       Mainz 05     Bundesliga      27 2473       2.177759
## 11 Martin Ødegaard        Arsenal Premier League      25 2325       2.170921
## 12            Isco          Betis        La Liga      32 1547       2.020787
## 13 Pierre Højbjerg      Marseille        Ligue 1      28 2664       1.999850
## 14     Luka Modrić    Real Madrid        La Liga      38 1827       1.833274
## 15     Tiago Silva        Vitória  Primeira Liga      31 2326       1.776778
##    score_defensive rank_creative rank_defensive
## 1       0.42384289             1            180
## 2      -0.13078355             2            347
## 3       0.94320833             3             94
## 4       0.02547731             4            301
## 5       0.34589315             5            203
## 6       1.04267820             6             78
## 7       0.14469326             7            261
## 8       0.29644458             8            220
## 9      -0.08101674             9            338
## 10      0.40730196            10            186
## 11     -1.18745473            11            617
## 12     -0.62272099            12            495
## 13      1.50753297            13             37
## 14     -0.44922576            14            443
## 15      0.44836560            15            177

Top creative players who are U21:

df_mf %>%
  filter(Age_int <= 20) %>%
  arrange(rank_creative) %>%
  select(Player, Squad, Competition, Age_int, Min,
         score_creative, score_defensive,
         rank_creative, rank_defensive) %>%
  head(15)
##                 Player          Squad    Competition Age_int  Min
## 1          Jakob Breum   Go Ahead Eag     Eredivisie      20 2059
## 2        Lamine Camara         Monaco        Ligue 1      20 2054
## 3          Tom Bischof     Hoffenheim     Bundesliga      19 2559
## 4           Arda Güler    Real Madrid        La Liga      19 1250
## 5  Aleksandar Pavlovic  Bayern Munich     Bundesliga      20 1451
## 6           João Neves            PSG        Ligue 1      19 1844
## 7          Nicolás Paz           Como        Serie A      19 2687
## 8         Youri Regeer         Twente     Eredivisie      20 1342
## 9         Adam Wharton Crystal Palace Premier League      20 1318
## 10     Luciano Valente      Groningen     Eredivisie      20 2587
## 11      Geovany Quenda    Sporting CP  Primeira Liga      17 2253
## 12          Levi Smans     Heerenveen     Eredivisie      20 2298
## 13        Djaoui Cissé         Rennes        Ligue 1      20 1121
## 14  Eliesse Ben Seghir         Monaco        Ligue 1      19 1750
## 15       Andrey Santos     Strasbourg        Ligue 1      20 2855
##    score_creative score_defensive rank_creative rank_defensive
## 1       1.0492180      -0.3710345            49            424
## 2       1.0218256       1.2261502            52             58
## 3       0.9326012       2.0717990            62             11
## 4       0.8755228      -0.8927326            72            566
## 5       0.8445543      -0.1386502            77            350
## 6       0.8190397       0.6202001            83            149
## 7       0.7315042       0.3286415            97            211
## 8       0.6847693       0.1463685           105            259
## 9       0.6842372       0.3742037           106            194
## 10      0.6676788      -0.4480592           108            442
## 11      0.6346847      -0.8285452           117            548
## 12      0.4363483      -0.7564290           148            533
## 13      0.4290084       1.7310427           151             22
## 14      0.4116841      -0.6746968           159            507
## 15      0.3633382       2.2350283           174              4

5. Reference Player and Similarity Analysis

5.1 Why Luka Modrić

For this part of the analysis we pick Luka Modrić as a reference player. He is probably one of the best examples of a pure creative midfielder of the last decade – excellent at progressing the ball, finding key passes, and controlling the tempo of a game. At 38 years old he is clearly at the end of his career, but his creative profile in this dataset is still ranked 14th among all midfielders with 900+ minutes. That makes him a very good reference point: we want to find young players who show a similar creative fingerprint.

The idea is simple – if a 19 or 20 year old already shows a Z-score profile close to Modrić’s across the creative metrics, that is a strong signal they could develop into a similar type of player.

modric <- df_mf %>%
  filter(Player == "Luka Modrić")

# Check he is in the sample
modric %>%
  select(Player, Squad, Age_int, Min, score_creative, score_defensive,
         rank_creative, rank_defensive)
##        Player       Squad Age_int  Min score_creative score_defensive
## 1 Luka Modrić Real Madrid      38 1827       1.833274      -0.4492258
##   rank_creative rank_defensive
## 1            14            443

5.2 Similarity algorithm

To measure how similar each U21 midfielder is to Modrić, we use Euclidean distance calculated on the Z-scores of the 8 creative metrics.

Euclidean distance measures the straight-line distance between two points in multi-dimensional space – in this case each player is a point defined by 8 coordinates (one per creative metric). The smaller the distance, the more similar the creative profile.

We calculate the distance between Modrić’s Z-score vector and every U21 midfielder’s Z-score vector, then rank by smallest distance.

# Z-score columns for creative metrics
z_creative <- c("z_xA.90", "z_KP.90", "z_FinalThirdPasses.90",
                 "z_PPA.90", "z_PassesProgressive.90",
                 "z_PrgDistPasses.90", "z_SCA.90", "z_Passes.")

# Modrić's Z-score vector
modric_vec <- as.numeric(modric[1, z_creative])

# U21 sample only
df_u21 <- df_mf %>%
  filter(Age_int <= 20)

# Calculate euclidean distance for each U21 player vs Modrić
df_u21 <- df_u21 %>%
  rowwise() %>%
  mutate(
    dist_modric = sqrt(sum(
      (c_across(all_of(z_creative)) - modric_vec)^2
    ))
  ) %>%
  ungroup()

# Top 3 most similar U21 players
top3_similar <- df_u21 %>%
  arrange(dist_modric) %>%
  select(Player, Squad, Competition, Age_int, Min,
         score_creative, score_defensive,
         rank_creative, dist_modric) %>%
  head(3)

top3_similar
## # A tibble: 3 × 9
##   Player        Squad   Competition Age_int   Min score_creative score_defensive
##   <chr>         <chr>   <chr>         <dbl> <int>          <dbl>           <dbl>
## 1 Lamine Camara Monaco  Ligue 1          20  2054          1.02            1.23 
## 2 Tom Bischof   Hoffen… Bundesliga       19  2559          0.933           2.07 
## 3 Adam Wharton  Crysta… Premier Le…      20  1318          0.684           0.374
## # ℹ 2 more variables: rank_creative <int>, dist_modric <dbl>

5.3 Radar charts

Now we build radar charts comparing Modrić with each of the 3 most similar U21 players.

The fmsb package requires a specific format: the first row is the maximum value for each metric, the second row is the minimum, and then the actual data rows follow. We use Z-scores for all metrics so the scale is the same for everyone – a value of 0 means exactly average, positive means above average, negative means below average compared to all midfielders in the sample.

We set the min and max of the radar to -2 and +2 so the chart always covers a meaningful range.

# Friendly labels for the radar axes
radar_labels <- c("xA/90", "KP/90", "Final 3rd\nPasses", 
                   "PPA/90", "Prog\nPasses", "Prog\nDist", 
                   "SCA/90", "Pass%")

# Function to build the radar data frame for fmsb
build_radar <- function(players_df, z_cols, labels) {
  # Extract Z-score rows for selected players
  radar_data <- players_df %>%
    select(all_of(z_cols)) %>%
    as.data.frame()
  
  rownames(radar_data) <- players_df$Player
  colnames(radar_data) <- labels
  
  # fmsb needs max row first, then min row, then data
  max_row <- rep(2.5,  length(labels))
  min_row <- rep(-2.5, length(labels))
  
  radar_data <- rbind(max_row, min_row, radar_data)
  rownames(radar_data)[1:2] <- c("Max", "Min")
  
  radar_data
}
# Get the top 3 similar player names
similar_names <- top3_similar$Player

# Combine Modrić + top 3 into one data frame, keeping the right order
players_to_compare <- df_mf %>%
  filter(Player %in% c("Luka Modrić", similar_names)) %>%
  arrange(match(Player, c("Luka Modrić", similar_names)))

# Build radar data
radar_df <- build_radar(players_to_compare, z_creative, radar_labels)

# Colors for the 4 players (reference + top 3 similar)
colors_fill <- c(
  rgb(0.18, 0.47, 0.71, 0.10),  # blue
  rgb(0.84, 0.19, 0.15, 0.10),  # red
  rgb(0.13, 0.63, 0.31, 0.10),  # green
  rgb(0.95, 0.60, 0.07, 0.10)   # orange
)
colors_line <- c(
  rgb(0.18, 0.47, 0.71, 0.9),
  rgb(0.84, 0.19, 0.15, 0.9),
  rgb(0.13, 0.63, 0.31, 0.9),
  rgb(0.95, 0.60, 0.07, 0.9)
)

player_names <- c("Luka Modrić", similar_names)

# Single combined radar chart
par(mar = c(2, 2, 3, 2))

radarchart(
  radar_df,
  axistype    = 1,
  pcol        = colors_line,
  pfcol       = colors_fill,
  plwd        = 2.5,
  cglcol      = "grey70",
  cglty       = 1,
  axislabcol  = "grey40",
  caxislabels = c("-2.5", "-1.25", "0", "1.25", "2.5"),
  cglwd       = 0.8,
  vlcex       = 0.9,
  title       = "Modrić vs most similar U21 midfielders – creative profile"
)

legend(
  "bottomright",
  legend = player_names,
  col    = colors_line,
  lty    = 1,
  lwd    = 2.5,
  bty    = "n",
  cex    = 0.85
)

Looking at the radar chart, we can see that all three U21 players share a similar shape to Modrić’s creative profile, though none of them match him on every metric. That is expected – Modrić at 38 is still producing at an elite level in some areas (like progressive passing distance and key passes), so matching him perfectly at age 19-20 would be unrealistic. The important thing is the general direction: these players are above average on the same dimensions where Modrić is above average, and that pattern is what makes them interesting from a scouting perspective.

It is also interesting that the scatter plot (below) shows all three similar U21 players score higher on defensive metrics than Modrić. This could mean they are being used in slightly deeper or more box-to-box roles at their current clubs, which is common for young midfielders who need to earn their place in the team before being given full creative freedom.

5.4 Summary scatter plot

As a final overview, let’s place all midfielders on a scatter plot of Creative score vs Defensive score, highlighting Modrić and the 3 similar young players. This gives a good picture of where each player sits in the full landscape.

# Labels only for highlighted players
highlight_names <- c("Luka Modrić", similar_names)

df_mf %>%
  mutate(
    highlight = case_when(
      Player == "Luka Modrić"         ~ "Modrić",
      Player %in% similar_names       ~ "Similar U21",
      Age_int <= 20                   ~ "Other U21",
      TRUE                            ~ "All MF"
    ),
    label = ifelse(Player %in% highlight_names, Player, NA)
  ) %>%
  ggplot(aes(x = score_creative, y = score_defensive,
             color = highlight, size = highlight)) +
  geom_point(alpha = 0.6) +
  geom_text(aes(label = label), vjust = -0.8, size = 3.2,
            show.legend = FALSE) +
  scale_color_manual(values = c(
    "All MF"      = "grey75",
    "Other U21"   = "#91bfdb",
    "Similar U21" = "#d73027",
    "Modrić"      = "#1a6faf"
  )) +
  scale_size_manual(values = c(
    "All MF"      = 1.5,
    "Other U21"   = 2,
    "Similar U21" = 4,
    "Modrić"      = 5
  )) +
  labs(
    title    = "Creative vs Defensive score – all midfielders",
    subtitle = "Modrić and 3 most similar U21 players highlighted",
    x        = "Creative score (Z-score average)",
    y        = "Defensive score (Z-score average)",
    color    = NULL,
    size     = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")


6. Conclusions

This analysis looked at midfielders across 7 competitions in the 2024/25 season, focusing on creative playmaking ability. By separating the creative and defensive scores we avoided penalising pure playmakers for low defensive output – a player like Modrić scores low on defense but that is expected and fine for his role.

The similarity search identified the three U21 midfielders whose creative Z-score profile is closest to Modrić’s across all 8 dimensions. It is worth pointing out that none of these players are trying to be “the next Modrić” – what the algorithm picks up is a statistical fingerprint, not a playing style in the tactical sense. Still, the fact that their creative output distributes in a similar way across passing progression, chance creation and shot-creating actions suggests they could develop into a similar type of midfielder over time.

From a scouting perspective, this type of multi-dimensional comparison is more useful than looking at a single metric like key passes or assists. A player might rank high on xA/90 but be average at ball progression, or the other way around. The Euclidean distance approach captures the full shape of a player’s creative profile, which is where the real value is.

This analysis also has some clear limitations. The data covers different leagues with different levels of competition, which affects the raw per90 numbers even after Z-score normalisation within the full sample. A more advanced version could normalise within each competition separately, or apply a competition difficulty weight. The sample only covers one season, so players who had injuries or limited minutes early on might be underrepresented. And of course, statistical similarity does not guarantee tactical similarity – a coach would need to watch the actual footage before making any decision based on these numbers. But for an initial screening step, this approach gives a solid and interpretable starting point.