This report is made as part of the talent detection task from the module. The goal is to work with a dataset of football players from multiple competitions, focus on midfielders, and try to find young players who show good creative and playmaking qualities.
The idea behind this is quite simple – instead of just looking at goals or assists, we want to build a more complete picture of a midfielder’s ability to create, progress the ball and make things happen in attack. At the same time we also calculate a separate defensive score, so we don’t mix everything into one number and lose the detail.
At the end of the report, we pick Luka Modrić as a reference player – someone who represents a very specific style of midfield play – and use a similarity algorithm to find young players (under 21) who show a comparable profile. The idea is to answer a question like: who plays like Modrić but is still 19 or 20 years old?
First we load the libraries needed for this analysis.
tidyverse covers most of what we need – data manipulation
with dplyr and visualization with ggplot2.
fmsb is used later for radar charts.
The file is a CSV but uses semicolons as separators, not commas. It also uses a dot for decimals. The encoding is UTF-8 which is important because some player names have special characters (like Modrić).
Let’s check the basic dimensions and what the first few rows look like.
## Rows: 3972
## Columns: 72
## Player Squad Nation Pos Age MP Min Gls G.PK Ast xG xAG
## 1 Abdoulie Ceesay St. Pauli GAM FW 20 7 60 0 0 0 0.0 0.0
## 2 Adam Aznou Bayern Munich MAR DF 18 2 17 0 0 0 0.0 0.0
## 3 Adam Dźwigała St. Pauli POL DF,MF 28 16 373 0 0 0 0.4 0.0
## 4 Adam Hložek Hoffenheim CZE FW,MF 22 27 1871 8 8 4 5.6 3.5
## 5 Adrian Beck Heidenheim GER MF,FW 27 32 1598 4 4 1 3.3 1.2
## Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90 Passes.
## 1 0.00 0.00 0.00 0.00 0.03 Bundesliga 0 0.00 0.00 57.1
## 2 0.00 0.00 0.00 0.00 0.00 Bundesliga 0 0.00 0.00 76.5
## 3 0.00 0.00 0.00 0.09 0.00 Bundesliga 6 1.45 0.00 81.6
## 4 0.38 0.38 0.19 0.27 0.17 Bundesliga 59 2.84 1.15 68.5
## 5 0.23 0.23 0.06 0.18 0.07 Bundesliga 39 2.20 0.73 80.1
## ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1 70.0 0.0 0.0 0.0 0.43 0.29 0.00
## 2 90.0 66.7 0.0 0.0 0.00 0.00 0.00
## 3 91.0 91.4 40.0 0.0 0.62 0.38 0.25
## 4 73.8 75.0 54.8 0.5 0.37 0.48 0.15
## 5 83.3 87.3 72.4 -0.2 0.53 0.97 0.28
## Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1 0.43 0.14 3.71 0.00 0.0 1.48 0.00 0.0
## 2 0.00 0.50 9.00 0.00 0.0 0.00 0.00 0.0
## 3 1.50 1.75 17.50 0.06 50.0 1.21 0.00 76.5
## 4 0.96 0.74 28.19 1.22 51.6 2.74 0.48 50.6
## 5 1.41 0.53 30.06 1.25 53.3 3.16 0.23 59.2
## Points.90 xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1 0.43 1.4 0 0 0 0 0 0 0 0 0 0
## 2 3.00 0.8 0 0 0 0 0 0 0 0 0 0
## 3 0.88 9.0 0 0 0 0 0 0 0 0 0 0
## 4 0.85 34.4 0 0 0 0 0 0 0 0 0 0
## 5 0.91 29.2 0 0 0 0 0 0 0 0 0 0
## SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1 0.00 0.0 1.14 2.00 1.00
## 2 0.00 0.0 6.50 8.50 4.50
## 3 0.00 -0.4 10.81 13.25 4.44
## 4 3.00 2.4 12.89 18.81 6.56
## 5 3.25 0.7 18.16 22.66 8.56
## MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1 0.00 0.00 10.00
## 2 2.00 0.00 89.00
## 3 5.31 0.75 183.44
## 4 4.67 0.85 198.19
## 5 7.28 1.72 309.41
## PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1 0.29 0.00 0.14 0.00 0.00 0.00
## 2 22.00 0.00 0.00 0.50 0.00 0.00
## 3 75.50 0.01 0.00 0.69 0.06 0.00
## 4 48.19 0.12 0.74 1.37 0.70 0.15
## 5 70.56 0.06 0.66 1.66 0.56 0.12
## PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90 xGD xGD.90
## 1 0.00 0.20 0.57 0.29 0.43 0.00 -1.4 -0.20
## 2 0.50 0.40 0.50 0.00 0.00 0.00 -0.8 -0.40
## 3 0.88 0.56 0.81 0.31 0.19 0.81 -8.6 -0.47
## 4 2.52 1.27 2.44 0.56 1.11 1.48 -28.8 -1.00
## 5 2.16 0.91 4.09 0.50 0.44 0.91 -25.9 -0.73
## MP_Squad
## 1 34
## 2 34
## 3 34
## 4 34
## 5 34
## Rows: 3,972
## Columns: 72
## $ Player <chr> "Abdoulie Ceesay", "Adam Aznou", "Adam Dźwiga…
## $ Squad <chr> "St. Pauli", "Bayern Munich", "St. Pauli", "H…
## $ Nation <chr> "GAM", "MAR", "POL", "CZE", "GER", "FRA", "ES…
## $ Pos <chr> "FW", "DF", "DF,MF", "FW,MF", "MF,FW", "MF,FW…
## $ Age <int> 20, 18, 28, 22, 27, 31, 27, 20, 25, 33, 27, 2…
## $ MP <dbl> 7, 2, 16, 27, 32, 30, 28, 21, 19, 2, 34, 23, …
## $ Min <int> 60, 17, 373, 1871, 1598, 1902, 1455, 1451, 12…
## $ Gls <int> 0, 0, 0, 8, 4, 11, 3, 1, 7, 0, 0, 0, 0, 9, 0,…
## $ G.PK <int> 0, 0, 0, 8, 4, 10, 3, 1, 7, 0, 0, 0, 0, 9, 0,…
## $ Ast <int> 0, 0, 0, 4, 1, 4, 4, 0, 2, 0, 1, 1, 0, 2, 0, …
## $ xG <dbl> 0.0, 0.0, 0.4, 5.6, 3.3, 7.4, 1.0, 0.6, 4.1, …
## $ xAG <dbl> 0.0, 0.0, 0.0, 3.5, 1.2, 3.1, 4.9, 0.2, 1.3, …
## $ Gls.90 <dbl> 0.00, 0.00, 0.00, 0.38, 0.23, 0.52, 0.19, 0.0…
## $ G.PK.90 <dbl> 0.00, 0.00, 0.00, 0.38, 0.23, 0.47, 0.19, 0.0…
## $ Ast.90 <dbl> 0.00, 0.00, 0.00, 0.19, 0.06, 0.19, 0.25, 0.0…
## $ xG.90 <dbl> 0.00, 0.00, 0.09, 0.27, 0.18, 0.35, 0.06, 0.0…
## $ xAG.90 <dbl> 0.03, 0.00, 0.00, 0.17, 0.07, 0.15, 0.30, 0.0…
## $ Competition <chr> "Bundesliga", "Bundesliga", "Bundesliga", "Bu…
## $ Sh <dbl> 0, 0, 6, 59, 39, 41, 15, 9, 27, 0, 0, 12, 0, …
## $ Sh.90 <dbl> 0.00, 0.00, 1.45, 2.84, 2.20, 1.94, 0.93, 0.5…
## $ SoT.90 <dbl> 0.00, 0.00, 0.00, 1.15, 0.73, 0.57, 0.31, 0.2…
## $ Passes. <dbl> 57.1, 76.5, 81.6, 68.5, 80.1, 72.1, 87.2, 93.…
## $ ShortPasses. <dbl> 70.0, 90.0, 91.0, 73.8, 83.3, 83.1, 94.3, 96.…
## $ MediumPasses. <dbl> 0.0, 66.7, 91.4, 75.0, 87.3, 74.2, 89.8, 95.1…
## $ LongPasses. <dbl> 0.0, 0.0, 40.0, 54.8, 72.4, 63.2, 65.8, 73.1,…
## $ A.xAG <dbl> 0.0, 0.0, 0.0, 0.5, -0.2, 0.9, -0.9, -0.2, 0.…
## $ TklW.90 <dbl> 0.43, 0.00, 0.62, 0.37, 0.53, 0.10, 0.25, 1.0…
## $ Blocks.90 <dbl> 0.29, 0.00, 0.38, 0.48, 0.97, 0.37, 0.43, 0.8…
## $ Int.90 <dbl> 0.00, 0.00, 0.25, 0.15, 0.28, 0.07, 0.39, 0.2…
## $ Tkl.Int.90 <dbl> 0.43, 0.00, 1.50, 0.96, 1.41, 0.30, 0.86, 2.0…
## $ Clr.90 <dbl> 0.14, 0.50, 1.75, 0.74, 0.53, 0.23, 0.86, 0.6…
## $ Touches.90 <dbl> 3.71, 9.00, 17.50, 28.19, 30.06, 32.63, 53.68…
## $ Dribbles.90 <dbl> 0.00, 0.00, 0.06, 1.22, 1.25, 0.57, 0.14, 0.2…
## $ Dribbles. <dbl> 0.0, 0.0, 50.0, 51.6, 53.3, 39.5, 80.0, 100.0…
## $ SCA.90 <dbl> 1.48, 0.00, 1.21, 2.74, 3.16, 3.41, 3.71, 1.8…
## $ GCA.90 <dbl> 0.00, 0.00, 0.00, 0.48, 0.23, 0.66, 0.62, 0.1…
## $ Aerial. <dbl> 0.0, 0.0, 76.5, 50.6, 59.2, 31.9, 27.3, 54.5,…
## $ Points.90 <dbl> 0.43, 3.00, 0.88, 0.85, 0.91, 1.17, 2.11, 2.3…
## $ xGA <dbl> 1.4, 0.8, 9.0, 34.4, 29.2, 35.7, 21.2, 12.6, …
## $ OG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ PSxG <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ PSxG.SoT <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ PSxG... <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ GA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 53, 0, 13, 0, 0…
## $ GA.90 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ Save. <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ CS. <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ SoTA.90 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ SoTA.GA <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ SoT.G <dbl> 0.00, 0.00, 0.00, 3.00, 3.25, 1.20, 1.67, 4.0…
## $ G.xG <dbl> 0.0, 0.0, -0.4, 2.4, 0.7, 3.6, 2.0, 0.4, 2.9,…
## $ PassesCompleted.90 <dbl> 1.14, 6.50, 10.81, 12.89, 18.16, 19.10, 43.29…
## $ PassesAttempted.90 <dbl> 2.00, 8.50, 13.25, 18.81, 22.66, 26.50, 49.64…
## $ ShortPassesCompleted.90 <dbl> 1.00, 4.50, 4.44, 6.56, 8.56, 10.30, 22.86, 3…
## $ MediumPassesCompleted.90 <dbl> 0.00, 2.00, 5.31, 4.67, 7.28, 6.33, 15.14, 26…
## $ LongPassesCompleted.90 <dbl> 0.00, 0.00, 0.75, 0.85, 1.72, 1.60, 4.61, 3.6…
## $ TotDistPasses.90 <dbl> 10.00, 89.00, 183.44, 198.19, 309.41, 303.93,…
## $ PrgDistPasses.90 <dbl> 0.29, 22.00, 75.50, 48.19, 70.56, 100.97, 214…
## $ xA.90 <dbl> 0.00, 0.00, 0.01, 0.12, 0.06, 0.11, 0.14, 0.0…
## $ KP.90 <dbl> 0.14, 0.00, 0.00, 0.74, 0.66, 0.83, 1.14, 0.3…
## $ FinalThirdPasses.90 <dbl> 0.00, 0.50, 0.69, 1.37, 1.66, 2.67, 4.82, 8.0…
## $ PPA.90 <dbl> 0.00, 0.00, 0.06, 0.70, 0.56, 1.17, 0.43, 0.6…
## $ CrsPA.90 <dbl> 0.00, 0.00, 0.00, 0.15, 0.12, 0.27, 0.14, 0.0…
## $ PassesProgressive.90 <dbl> 0.00, 0.50, 0.88, 2.52, 2.16, 3.87, 4.29, 6.1…
## $ xGA.90 <dbl> 0.20, 0.40, 0.56, 1.27, 0.91, 1.19, 0.76, 0.6…
## $ Recov.90 <dbl> 0.57, 0.50, 0.81, 2.44, 4.09, 2.20, 2.46, 3.4…
## $ Fls.90 <dbl> 0.29, 0.00, 0.31, 0.56, 0.50, 0.57, 0.29, 0.8…
## $ Fld.90 <dbl> 0.43, 0.00, 0.19, 1.11, 0.44, 1.10, 0.32, 0.4…
## $ AerialW.90 <dbl> 0.00, 0.00, 0.81, 1.48, 0.91, 0.50, 0.11, 0.5…
## $ xGD <dbl> -1.4, -0.8, -8.6, -28.8, -25.9, -28.3, -20.2,…
## $ xGD.90 <dbl> -0.20, -0.40, -0.47, -1.00, -0.73, -0.84, -0.…
## $ MP_Squad <int> 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 3…
A quick look at the column names to understand what variables are available:
## [1] "Player" "Squad"
## [3] "Nation" "Pos"
## [5] "Age" "MP"
## [7] "Min" "Gls"
## [9] "G.PK" "Ast"
## [11] "xG" "xAG"
## [13] "Gls.90" "G.PK.90"
## [15] "Ast.90" "xG.90"
## [17] "xAG.90" "Competition"
## [19] "Sh" "Sh.90"
## [21] "SoT.90" "Passes."
## [23] "ShortPasses." "MediumPasses."
## [25] "LongPasses." "A.xAG"
## [27] "TklW.90" "Blocks.90"
## [29] "Int.90" "Tkl.Int.90"
## [31] "Clr.90" "Touches.90"
## [33] "Dribbles.90" "Dribbles."
## [35] "SCA.90" "GCA.90"
## [37] "Aerial." "Points.90"
## [39] "xGA" "OG"
## [41] "PSxG" "PSxG.SoT"
## [43] "PSxG..." "GA"
## [45] "GA.90" "Save."
## [47] "CS." "SoTA.90"
## [49] "SoTA.GA" "SoT.G"
## [51] "G.xG" "PassesCompleted.90"
## [53] "PassesAttempted.90" "ShortPassesCompleted.90"
## [55] "MediumPassesCompleted.90" "LongPassesCompleted.90"
## [57] "TotDistPasses.90" "PrgDistPasses.90"
## [59] "xA.90" "KP.90"
## [61] "FinalThirdPasses.90" "PPA.90"
## [63] "CrsPA.90" "PassesProgressive.90"
## [65] "xGA.90" "Recov.90"
## [67] "Fls.90" "Fld.90"
## [69] "AerialW.90" "xGD"
## [71] "xGD.90" "MP_Squad"
The dataset comes from FBref and covers the 2024/25 season. It includes players from 7 competitions:
## Competition n
## 1 Serie A 634
## 2 La Liga 601
## 3 Primeira Liga 584
## 4 Premier League 574
## 5 Ligue 1 553
## 6 Eredivisie 534
## 7 Bundesliga 492
Worth noting that mixing players from leagues of different levels (for example La Liga and Ligue 1) into one pool means the Z-scores are calculated across all of them together. A midfielder putting up good progressive pass numbers in a weaker league might look the same as one doing it in a stronger league. This is a known limitation – ideally we would apply some kind of competition difficulty adjustment, but for this study we keep it simple and normalise across the full sample.
The positions available in the data:
## Pos n
## 1 DF 1161
## 2 MF 804
## 3 FW 542
## 4 FW,MF 474
## 5 MF,FW 327
## 6 GK 294
## 7 DF,MF 155
## 8 MF,DF 114
## 9 DF,FW 62
## 10 FW,DF 39
The Age column in FBref data comes as a decimal – for example
20.187 means the player is 20 years old and some days. We
need to floor this to get a clean integer age for filtering and
display.
df_raw <- df_raw %>%
mutate(Age_int = floor(Age))
# Quick check
df_raw %>%
select(Player, Age, Age_int) %>%
head(10)## Player Age Age_int
## 1 Abdoulie Ceesay 20 20
## 2 Adam Aznou 18 18
## 3 Adam Dźwigała 28 28
## 4 Adam Hložek 22 22
## 5 Adrian Beck 27 27
## 6 Alassane Pléa 31 31
## 7 Aleix García 27 27
## 8 Aleksandar Pavlovic 20 20
## 9 Alexander Bernhardsson 25 25
## 10 Alexander Meyer 33 33
Before filtering, let’s check how many NAs exist in the key metrics we plan to use.
key_metrics <- c(
"xA.90", "KP.90", "FinalThirdPasses.90",
"PPA.90", "PassesProgressive.90", "PrgDistPasses.90",
"SCA.90", "Passes.",
"TklW.90", "Int.90", "Tkl.Int.90", "Recov.90", "Blocks.90"
)
df_raw %>%
select(all_of(key_metrics)) %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "metric", values_to = "na_count") %>%
arrange(desc(na_count))## # A tibble: 13 × 2
## metric na_count
## <chr> <int>
## 1 xA.90 0
## 2 KP.90 0
## 3 FinalThirdPasses.90 0
## 4 PPA.90 0
## 5 PassesProgressive.90 0
## 6 PrgDistPasses.90 0
## 7 SCA.90 0
## 8 Passes. 0
## 9 TklW.90 0
## 10 Int.90 0
## 11 Tkl.Int.90 0
## 12 Recov.90 0
## 13 Blocks.90 0
We keep only players whose primary or secondary position is
midfielder. In this dataset that means positions: MF,
MF,FW and MF,DF. We also apply the minimum 900
minutes threshold to make sure we are only looking at players with a
reasonable sample of data. On top of that we filter out any rows where
the age is below 15, since those are clearly data errors (no
professional midfielder plays 900+ minutes at that age).
df_mf <- df_raw %>%
filter(Pos %in% c("MF", "MF,FW", "MF,DF")) %>%
filter(Min >= 900) %>%
filter(Age_int >= 15)
cat("Players in MF sample (900+ min):", nrow(df_mf), "\n")## Players in MF sample (900+ min): 657
Age distribution in the filtered sample:
df_mf %>%
count(Age_int, sort = FALSE) %>%
ggplot(aes(x = Age_int, y = n)) +
geom_col(fill = "#2c7bb6") +
labs(
title = "Age distribution – midfielders with 900+ minutes",
x = "Age",
y = "Number of players"
) +
theme_minimal()How many U21 players are in the sample:
## Age_int n
## 1 16 1
## 2 17 2
## 3 18 8
## 4 19 25
## 5 20 42
We split the metrics into two groups. The creative / playmaking metrics capture how well a midfielder creates chances and progresses the ball forward:
xA.90 – expected assists per 90 minutesKP.90 – key passes per 90 (passes that directly lead to
a shot)FinalThirdPasses.90 – passes into the final third per
90PPA.90 – passes into the penalty area per 90PassesProgressive.90 – progressive passes per 90
(passes that move the ball substantially closer to the opponent’s
goal)PrgDistPasses.90 – total progressive passing distance
per 90SCA.90 – shot-creating actions per 90Passes. – pass completion percentageThe defensive metrics capture out-of-possession work:
TklW.90 – tackles won per 90Int.90 – interceptions per 90Tkl.Int.90 – combined tackles + interceptions per
90Recov.90 – ball recoveries per 90Blocks.90 – blocks per 90We now reduce the dataset to only the columns we actually need – player info and the two sets of metrics.
df_mf <- df_mf %>%
select(
# player info
Player, Squad, Nation, Pos, Age, Age_int, Min, Competition,
# creative / playmaking metrics
xA.90,
KP.90,
FinalThirdPasses.90,
PPA.90,
PassesProgressive.90,
PrgDistPasses.90,
SCA.90,
Passes.,
# defensive metrics
TklW.90,
Int.90,
Tkl.Int.90,
Recov.90,
Blocks.90
)
glimpse(df_mf)## Rows: 657
## Columns: 21
## $ Player <chr> "Adrian Beck", "Alassane Pléa", "Aleix García", "…
## $ Squad <chr> "Heidenheim", "Gladbach", "Leverkusen", "Bayern M…
## $ Nation <chr> "GER", "FRA", "ESP", "GER", "FRA", "GER", "MLI", …
## $ Pos <chr> "MF,FW", "MF,FW", "MF", "MF", "MF,FW", "MF", "MF"…
## $ Age <int> 27, 31, 27, 20, 26, 19, 26, 33, 25, 23, 22, 38, 2…
## $ Age_int <dbl> 27, 31, 27, 20, 26, 19, 26, 33, 25, 23, 22, 38, 2…
## $ Min <int> 1598, 1902, 1455, 1451, 2109, 922, 1320, 2767, 15…
## $ Competition <chr> "Bundesliga", "Bundesliga", "Bundesliga", "Bundes…
## $ xA.90 <dbl> 0.06, 0.11, 0.14, 0.09, 0.07, 0.05, 0.04, 0.18, 0…
## $ KP.90 <dbl> 0.66, 0.83, 1.14, 0.33, 0.75, 0.67, 0.18, 1.53, 0…
## $ FinalThirdPasses.90 <dbl> 1.66, 2.67, 4.82, 8.05, 1.36, 1.87, 2.32, 3.16, 1…
## $ PPA.90 <dbl> 0.56, 1.17, 0.43, 0.62, 0.75, 0.40, 0.46, 1.19, 0…
## $ PassesProgressive.90 <dbl> 2.16, 3.87, 4.29, 6.10, 2.18, 2.80, 2.32, 4.47, 2…
## $ PrgDistPasses.90 <dbl> 70.56, 100.97, 214.79, 255.90, 50.39, 112.80, 120…
## $ SCA.90 <dbl> 3.16, 3.41, 3.71, 1.86, 3.29, 2.05, 1.98, 3.32, 1…
## $ Passes. <dbl> 80.1, 72.1, 87.2, 93.3, 82.9, 75.5, 80.2, 77.7, 7…
## $ TklW.90 <dbl> 0.53, 0.10, 0.25, 1.00, 0.93, 0.27, 0.61, 0.25, 1…
## $ Int.90 <dbl> 0.28, 0.07, 0.39, 0.24, 0.54, 0.40, 0.64, 0.22, 0…
## $ Tkl.Int.90 <dbl> 1.41, 0.30, 0.86, 2.05, 2.00, 1.07, 1.71, 0.69, 2…
## $ Recov.90 <dbl> 4.09, 2.20, 2.46, 3.43, 2.57, 3.87, 3.14, 4.25, 2…
## $ Blocks.90 <dbl> 0.97, 0.37, 0.43, 0.86, 0.57, 0.93, 1.50, 0.50, 1…
We remove players who have NA in any of the metric columns. In practice this should be very few.
df_mf <- df_mf %>%
drop_na(xA.90, KP.90, FinalThirdPasses.90, PPA.90,
PassesProgressive.90, PrgDistPasses.90, SCA.90, Passes.,
TklW.90, Int.90, Tkl.Int.90, Recov.90, Blocks.90)
cat("Final sample size after removing NAs:", nrow(df_mf), "\n")## Final sample size after removing NAs: 657
For each metric we calculate a Z-score within the full MF sample. This puts all metrics on the same scale (mean = 0, sd = 1) so they can be combined fairly regardless of their original units.
creative_metrics <- c("xA.90", "KP.90", "FinalThirdPasses.90",
"PPA.90", "PassesProgressive.90",
"PrgDistPasses.90", "SCA.90", "Passes.")
defensive_metrics <- c("TklW.90", "Int.90", "Tkl.Int.90",
"Recov.90", "Blocks.90")
df_mf <- df_mf %>%
mutate(
across(all_of(creative_metrics),
~ scale(.)[,1],
.names = "z_{.col}"),
across(all_of(defensive_metrics),
~ scale(.)[,1],
.names = "z_{.col}")
)The Creative score is the average of all creative Z-scores. Same for Defensive. Higher score means the player is above average in that category compared to all midfielders in the sample.
df_mf <- df_mf %>%
mutate(
rank_creative = rank(-score_creative, ties.method = "min"),
rank_defensive = rank(-score_defensive, ties.method = "min")
)Top 15 midfielders by Creative score:
df_mf %>%
arrange(rank_creative) %>%
select(Player, Squad, Competition, Age_int, Min,
score_creative, score_defensive,
rank_creative, rank_defensive) %>%
head(15)## Player Squad Competition Age_int Min score_creative
## 1 Joshua Kimmich Bayern Munich Bundesliga 29 2847 3.778199
## 2 Joey Veerman PSV Eindhoven Eredivisie 25 1830 3.535609
## 3 Pedri Barcelona La Liga 21 2879 2.684584
## 4 Orkun Kökçü Benfica Primeira Liga 23 2633 2.532469
## 5 Angelo Stiller Stuttgart Bundesliga 23 2741 2.528589
## 6 Bruno Fernandes Manchester Utd Premier League 29 3018 2.511450
## 7 Granit Xhaka Leverkusen Bundesliga 31 2888 2.330650
## 8 Romano Schmid Werder Bremen Bundesliga 24 2834 2.323731
## 9 Alex Baena Villarreal La Liga 23 2595 2.188212
## 10 Nadiem Amiri Mainz 05 Bundesliga 27 2473 2.177759
## 11 Martin Ødegaard Arsenal Premier League 25 2325 2.170921
## 12 Isco Betis La Liga 32 1547 2.020787
## 13 Pierre Højbjerg Marseille Ligue 1 28 2664 1.999850
## 14 Luka Modrić Real Madrid La Liga 38 1827 1.833274
## 15 Tiago Silva Vitória Primeira Liga 31 2326 1.776778
## score_defensive rank_creative rank_defensive
## 1 0.42384289 1 180
## 2 -0.13078355 2 347
## 3 0.94320833 3 94
## 4 0.02547731 4 301
## 5 0.34589315 5 203
## 6 1.04267820 6 78
## 7 0.14469326 7 261
## 8 0.29644458 8 220
## 9 -0.08101674 9 338
## 10 0.40730196 10 186
## 11 -1.18745473 11 617
## 12 -0.62272099 12 495
## 13 1.50753297 13 37
## 14 -0.44922576 14 443
## 15 0.44836560 15 177
Top creative players who are U21:
df_mf %>%
filter(Age_int <= 20) %>%
arrange(rank_creative) %>%
select(Player, Squad, Competition, Age_int, Min,
score_creative, score_defensive,
rank_creative, rank_defensive) %>%
head(15)## Player Squad Competition Age_int Min
## 1 Jakob Breum Go Ahead Eag Eredivisie 20 2059
## 2 Lamine Camara Monaco Ligue 1 20 2054
## 3 Tom Bischof Hoffenheim Bundesliga 19 2559
## 4 Arda Güler Real Madrid La Liga 19 1250
## 5 Aleksandar Pavlovic Bayern Munich Bundesliga 20 1451
## 6 João Neves PSG Ligue 1 19 1844
## 7 Nicolás Paz Como Serie A 19 2687
## 8 Youri Regeer Twente Eredivisie 20 1342
## 9 Adam Wharton Crystal Palace Premier League 20 1318
## 10 Luciano Valente Groningen Eredivisie 20 2587
## 11 Geovany Quenda Sporting CP Primeira Liga 17 2253
## 12 Levi Smans Heerenveen Eredivisie 20 2298
## 13 Djaoui Cissé Rennes Ligue 1 20 1121
## 14 Eliesse Ben Seghir Monaco Ligue 1 19 1750
## 15 Andrey Santos Strasbourg Ligue 1 20 2855
## score_creative score_defensive rank_creative rank_defensive
## 1 1.0492180 -0.3710345 49 424
## 2 1.0218256 1.2261502 52 58
## 3 0.9326012 2.0717990 62 11
## 4 0.8755228 -0.8927326 72 566
## 5 0.8445543 -0.1386502 77 350
## 6 0.8190397 0.6202001 83 149
## 7 0.7315042 0.3286415 97 211
## 8 0.6847693 0.1463685 105 259
## 9 0.6842372 0.3742037 106 194
## 10 0.6676788 -0.4480592 108 442
## 11 0.6346847 -0.8285452 117 548
## 12 0.4363483 -0.7564290 148 533
## 13 0.4290084 1.7310427 151 22
## 14 0.4116841 -0.6746968 159 507
## 15 0.3633382 2.2350283 174 4
For this part of the analysis we pick Luka Modrić as a reference player. He is probably one of the best examples of a pure creative midfielder of the last decade – excellent at progressing the ball, finding key passes, and controlling the tempo of a game. At 38 years old he is clearly at the end of his career, but his creative profile in this dataset is still ranked 14th among all midfielders with 900+ minutes. That makes him a very good reference point: we want to find young players who show a similar creative fingerprint.
The idea is simple – if a 19 or 20 year old already shows a Z-score profile close to Modrić’s across the creative metrics, that is a strong signal they could develop into a similar type of player.
modric <- df_mf %>%
filter(Player == "Luka Modrić")
# Check he is in the sample
modric %>%
select(Player, Squad, Age_int, Min, score_creative, score_defensive,
rank_creative, rank_defensive)## Player Squad Age_int Min score_creative score_defensive
## 1 Luka Modrić Real Madrid 38 1827 1.833274 -0.4492258
## rank_creative rank_defensive
## 1 14 443
To measure how similar each U21 midfielder is to Modrić, we use Euclidean distance calculated on the Z-scores of the 8 creative metrics.
Euclidean distance measures the straight-line distance between two points in multi-dimensional space – in this case each player is a point defined by 8 coordinates (one per creative metric). The smaller the distance, the more similar the creative profile.
We calculate the distance between Modrić’s Z-score vector and every U21 midfielder’s Z-score vector, then rank by smallest distance.
# Z-score columns for creative metrics
z_creative <- c("z_xA.90", "z_KP.90", "z_FinalThirdPasses.90",
"z_PPA.90", "z_PassesProgressive.90",
"z_PrgDistPasses.90", "z_SCA.90", "z_Passes.")
# Modrić's Z-score vector
modric_vec <- as.numeric(modric[1, z_creative])
# U21 sample only
df_u21 <- df_mf %>%
filter(Age_int <= 20)
# Calculate euclidean distance for each U21 player vs Modrić
df_u21 <- df_u21 %>%
rowwise() %>%
mutate(
dist_modric = sqrt(sum(
(c_across(all_of(z_creative)) - modric_vec)^2
))
) %>%
ungroup()
# Top 3 most similar U21 players
top3_similar <- df_u21 %>%
arrange(dist_modric) %>%
select(Player, Squad, Competition, Age_int, Min,
score_creative, score_defensive,
rank_creative, dist_modric) %>%
head(3)
top3_similar## # A tibble: 3 × 9
## Player Squad Competition Age_int Min score_creative score_defensive
## <chr> <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 Lamine Camara Monaco Ligue 1 20 2054 1.02 1.23
## 2 Tom Bischof Hoffen… Bundesliga 19 2559 0.933 2.07
## 3 Adam Wharton Crysta… Premier Le… 20 1318 0.684 0.374
## # ℹ 2 more variables: rank_creative <int>, dist_modric <dbl>
Now we build radar charts comparing Modrić with each of the 3 most similar U21 players.
The fmsb package requires a specific format: the first
row is the maximum value for each metric, the second row is the minimum,
and then the actual data rows follow. We use Z-scores for all metrics so
the scale is the same for everyone – a value of 0 means exactly average,
positive means above average, negative means below average compared to
all midfielders in the sample.
We set the min and max of the radar to -2 and +2 so the chart always covers a meaningful range.
# Friendly labels for the radar axes
radar_labels <- c("xA/90", "KP/90", "Final 3rd\nPasses",
"PPA/90", "Prog\nPasses", "Prog\nDist",
"SCA/90", "Pass%")
# Function to build the radar data frame for fmsb
build_radar <- function(players_df, z_cols, labels) {
# Extract Z-score rows for selected players
radar_data <- players_df %>%
select(all_of(z_cols)) %>%
as.data.frame()
rownames(radar_data) <- players_df$Player
colnames(radar_data) <- labels
# fmsb needs max row first, then min row, then data
max_row <- rep(2.5, length(labels))
min_row <- rep(-2.5, length(labels))
radar_data <- rbind(max_row, min_row, radar_data)
rownames(radar_data)[1:2] <- c("Max", "Min")
radar_data
}# Get the top 3 similar player names
similar_names <- top3_similar$Player
# Combine Modrić + top 3 into one data frame, keeping the right order
players_to_compare <- df_mf %>%
filter(Player %in% c("Luka Modrić", similar_names)) %>%
arrange(match(Player, c("Luka Modrić", similar_names)))
# Build radar data
radar_df <- build_radar(players_to_compare, z_creative, radar_labels)
# Colors for the 4 players (reference + top 3 similar)
colors_fill <- c(
rgb(0.18, 0.47, 0.71, 0.10), # blue
rgb(0.84, 0.19, 0.15, 0.10), # red
rgb(0.13, 0.63, 0.31, 0.10), # green
rgb(0.95, 0.60, 0.07, 0.10) # orange
)
colors_line <- c(
rgb(0.18, 0.47, 0.71, 0.9),
rgb(0.84, 0.19, 0.15, 0.9),
rgb(0.13, 0.63, 0.31, 0.9),
rgb(0.95, 0.60, 0.07, 0.9)
)
player_names <- c("Luka Modrić", similar_names)
# Single combined radar chart
par(mar = c(2, 2, 3, 2))
radarchart(
radar_df,
axistype = 1,
pcol = colors_line,
pfcol = colors_fill,
plwd = 2.5,
cglcol = "grey70",
cglty = 1,
axislabcol = "grey40",
caxislabels = c("-2.5", "-1.25", "0", "1.25", "2.5"),
cglwd = 0.8,
vlcex = 0.9,
title = "Modrić vs most similar U21 midfielders – creative profile"
)
legend(
"bottomright",
legend = player_names,
col = colors_line,
lty = 1,
lwd = 2.5,
bty = "n",
cex = 0.85
)Looking at the radar chart, we can see that all three U21 players share a similar shape to Modrić’s creative profile, though none of them match him on every metric. That is expected – Modrić at 38 is still producing at an elite level in some areas (like progressive passing distance and key passes), so matching him perfectly at age 19-20 would be unrealistic. The important thing is the general direction: these players are above average on the same dimensions where Modrić is above average, and that pattern is what makes them interesting from a scouting perspective.
It is also interesting that the scatter plot (below) shows all three similar U21 players score higher on defensive metrics than Modrić. This could mean they are being used in slightly deeper or more box-to-box roles at their current clubs, which is common for young midfielders who need to earn their place in the team before being given full creative freedom.
As a final overview, let’s place all midfielders on a scatter plot of Creative score vs Defensive score, highlighting Modrić and the 3 similar young players. This gives a good picture of where each player sits in the full landscape.
# Labels only for highlighted players
highlight_names <- c("Luka Modrić", similar_names)
df_mf %>%
mutate(
highlight = case_when(
Player == "Luka Modrić" ~ "Modrić",
Player %in% similar_names ~ "Similar U21",
Age_int <= 20 ~ "Other U21",
TRUE ~ "All MF"
),
label = ifelse(Player %in% highlight_names, Player, NA)
) %>%
ggplot(aes(x = score_creative, y = score_defensive,
color = highlight, size = highlight)) +
geom_point(alpha = 0.6) +
geom_text(aes(label = label), vjust = -0.8, size = 3.2,
show.legend = FALSE) +
scale_color_manual(values = c(
"All MF" = "grey75",
"Other U21" = "#91bfdb",
"Similar U21" = "#d73027",
"Modrić" = "#1a6faf"
)) +
scale_size_manual(values = c(
"All MF" = 1.5,
"Other U21" = 2,
"Similar U21" = 4,
"Modrić" = 5
)) +
labs(
title = "Creative vs Defensive score – all midfielders",
subtitle = "Modrić and 3 most similar U21 players highlighted",
x = "Creative score (Z-score average)",
y = "Defensive score (Z-score average)",
color = NULL,
size = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")This analysis looked at midfielders across 7 competitions in the 2024/25 season, focusing on creative playmaking ability. By separating the creative and defensive scores we avoided penalising pure playmakers for low defensive output – a player like Modrić scores low on defense but that is expected and fine for his role.
The similarity search identified the three U21 midfielders whose creative Z-score profile is closest to Modrić’s across all 8 dimensions. It is worth pointing out that none of these players are trying to be “the next Modrić” – what the algorithm picks up is a statistical fingerprint, not a playing style in the tactical sense. Still, the fact that their creative output distributes in a similar way across passing progression, chance creation and shot-creating actions suggests they could develop into a similar type of midfielder over time.
From a scouting perspective, this type of multi-dimensional comparison is more useful than looking at a single metric like key passes or assists. A player might rank high on xA/90 but be average at ball progression, or the other way around. The Euclidean distance approach captures the full shape of a player’s creative profile, which is where the real value is.
This analysis also has some clear limitations. The data covers different leagues with different levels of competition, which affects the raw per90 numbers even after Z-score normalisation within the full sample. A more advanced version could normalise within each competition separately, or apply a competition difficulty weight. The sample only covers one season, so players who had injuries or limited minutes early on might be underrepresented. And of course, statistical similarity does not guarantee tactical similarity – a coach would need to watch the actual footage before making any decision based on these numbers. But for an initial screening step, this approach gives a solid and interpretable starting point.