In this project, I will be crunching the numbers across Europes top 5 leagues and assessing the options for Arsenal in their pursuit of the perfect number 9.
It is an area where Arsenal actualy have quite a few options; Gabriel Jesus, Viktor Gyokeres Kai Havertz and even Mikel Merino have done solid jobs in the role, and each have produced impressive purple patches. That being said, through inconsistent form and injury issues, it would appear that Arsenal are yet to find that Number 9 that in my opinion would fire them to domination in England, and perhaps Europe.
Amongst their current options * Gabriel Jesus offers versatility, pressing intensity and outstanding link up play. On the other hand, he is often wasteful in front of goal and has suffered a series of unfortunate injuries over the last 2-3 years.
Kai Havertz excels in aerial and ground duels, has very intelligent off the ball movement and link up, and offers an incredible aerial threat, particularly from set pieces. However like Gabriel Jesus, his finishing has also been hot and cold and he has also suffered a series of injuries that have left him unavailable for extended periods.
Viktor Gyokeres is powerful, robust, and offers an exceptional work ethic with an eye for goal (He was Europes Top Scorer last season with an impressive 54 goals across all competitions). While there is still potential for him to grow and adapt, he has no doubt struggled to get to grips to the pace of the Premier League.
Mikel Merino has proved to be a more then capable back up in the role, but is a natural central midfield player and has talents are best served here.
Our Purpose is to identify the best centre forward who can fit Arsenals style of play using data from the Top 5 leagues in the 2024/25 season.
To address this topic, we will utilise a 4 tier, analytical approach: Tier 1 - Elite Players. World Class Centre Forward, who may be unattainable but it is still useful to compare. Tier 2 - Realistic Targets - Proven players who are attainable and between the ages of 23-27. Tier 3 - Value Options - Short Term solutions from smaler clubs (Age 25-27) Tier 4 - Young Prospects - High potential, developing strikers (Age 19-23)
For Each tier we will: * Develop a scoring system to evaluate candidates for Arsenal. * Apply a similiarity algorithm (Euclidean and Cosine) to find players who are a stylistic match. * Create charts and visualisations to compare some top candidates.
# First we set the working directory
# Load data from same directory as .Rmd file
data <- read.csv("FBREF_BigPlayers_2425.csv", sep=";", encoding="UTF-8")
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
# Next we load the relevant libraries
library(tidyverse) # Data manipulation
library(fmsb) # Radar charts
library(lsa) # Cosine similarity
library(knitr) # Tables
library(DT) # Interactive tables
library(scales) # Color scales
library(ggplot2)
head(data)
## Player Squad Nation Pos Age MP Min Gls G.PK Ast xG xAG
## 1 Abdoulie Ceesay St. Pauli GAM FW 20 7 60 0 0 0 0.0 0.0
## 2 Adam Aznou Bayern Munich MAR DF 18 2 17 0 0 0 0.0 0.0
## 3 Adam Dźwigała St. Pauli POL DF,MF 28 16 373 0 0 0 0.4 0.0
## 4 Adam Hložek Hoffenheim CZE FW,MF 22 27 1871 8 8 4 5.6 3.5
## 5 Adrian Beck Heidenheim GER MF,FW 27 32 1598 4 4 1 3.3 1.2
## 6 Alassane Pléa Gladbach FRA MF,FW 31 30 1902 11 10 4 7.4 3.1
## Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90 Passes.
## 1 0.00 0.00 0.00 0.00 0.03 Bundesliga 0 0.00 0.00 57.1
## 2 0.00 0.00 0.00 0.00 0.00 Bundesliga 0 0.00 0.00 76.5
## 3 0.00 0.00 0.00 0.09 0.00 Bundesliga 6 1.45 0.00 81.6
## 4 0.38 0.38 0.19 0.27 0.17 Bundesliga 59 2.84 1.15 68.5
## 5 0.23 0.23 0.06 0.18 0.07 Bundesliga 39 2.20 0.73 80.1
## 6 0.52 0.47 0.19 0.35 0.15 Bundesliga 41 1.94 0.57 72.1
## ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1 70.0 0.0 0.0 0.0 0.43 0.29 0.00
## 2 90.0 66.7 0.0 0.0 0.00 0.00 0.00
## 3 91.0 91.4 40.0 0.0 0.62 0.38 0.25
## 4 73.8 75.0 54.8 0.5 0.37 0.48 0.15
## 5 83.3 87.3 72.4 -0.2 0.53 0.97 0.28
## 6 83.1 74.2 63.2 0.9 0.10 0.37 0.07
## Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1 0.43 0.14 3.71 0.00 0.0 1.48 0.00 0.0
## 2 0.00 0.50 9.00 0.00 0.0 0.00 0.00 0.0
## 3 1.50 1.75 17.50 0.06 50.0 1.21 0.00 76.5
## 4 0.96 0.74 28.19 1.22 51.6 2.74 0.48 50.6
## 5 1.41 0.53 30.06 1.25 53.3 3.16 0.23 59.2
## 6 0.30 0.23 32.63 0.57 39.5 3.41 0.66 31.9
## Points.90 xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1 0.43 1.4 0 0 0 0 0 0 0 0 0 0
## 2 3.00 0.8 0 0 0 0 0 0 0 0 0 0
## 3 0.88 9.0 0 0 0 0 0 0 0 0 0 0
## 4 0.85 34.4 0 0 0 0 0 0 0 0 0 0
## 5 0.91 29.2 0 0 0 0 0 0 0 0 0 0
## 6 1.17 35.7 0 0 0 0 0 0 0 0 0 0
## SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1 0.00 0.0 1.14 2.00 1.00
## 2 0.00 0.0 6.50 8.50 4.50
## 3 0.00 -0.4 10.81 13.25 4.44
## 4 3.00 2.4 12.89 18.81 6.56
## 5 3.25 0.7 18.16 22.66 8.56
## 6 1.20 3.6 19.10 26.50 10.30
## MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1 0.00 0.00 10.00
## 2 2.00 0.00 89.00
## 3 5.31 0.75 183.44
## 4 4.67 0.85 198.19
## 5 7.28 1.72 309.41
## 6 6.33 1.60 303.93
## PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1 0.29 0.00 0.14 0.00 0.00 0.00
## 2 22.00 0.00 0.00 0.50 0.00 0.00
## 3 75.50 0.01 0.00 0.69 0.06 0.00
## 4 48.19 0.12 0.74 1.37 0.70 0.15
## 5 70.56 0.06 0.66 1.66 0.56 0.12
## 6 100.97 0.11 0.83 2.67 1.17 0.27
## PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90 xGD xGD.90
## 1 0.00 0.20 0.57 0.29 0.43 0.00 -1.4 -0.20
## 2 0.50 0.40 0.50 0.00 0.00 0.00 -0.8 -0.40
## 3 0.88 0.56 0.81 0.31 0.19 0.81 -8.6 -0.47
## 4 2.52 1.27 2.44 0.56 1.11 1.48 -28.8 -1.00
## 5 2.16 0.91 4.09 0.50 0.44 0.91 -25.9 -0.73
## 6 3.87 1.19 2.20 0.57 1.10 0.50 -28.3 -0.84
## MP_Squad
## 1 34
## 2 34
## 3 34
## 4 34
## 5 34
## 6 34
list.files("data")
## [1] "FBREF_BigClubes_2324.csv" "FBREF_BigClubes_2425.csv"
## [3] "FBREF_BigPlayers_2324.csv" "FBREF_BigPlayers_2425.csv"
# Define the select_players function (complete)
select_players <- function(file, Encoding, position, competition, primary, player) {
# 1. Read the file
data <- read.csv(file, sep = ";", encoding = Encoding)
# 2. Filter by competition
if (length(competition) == 1 && competition == "ALL") {
cat("We consider all players from all competitions\n")
data_comp <- data
} else {
cat("We filter by competition\n")
data_comp <- data %>%
filter(Competition %in% competition)
}
# 3. Filter by position
if (primary){
cat("We keep players whose main position is:", position, "\n")
data_players <- data_comp %>%
filter(substr(Pos, 1, 2) == position)
} else {
cat("We keep players whose position contains:", position, "\n")
data_players <- data_comp %>%
filter(grepl(position, Pos))
}
# 4. Filter by specific player if needed
if (!is.na(player)){
data_players <- data_players %>% filter(Player %in% player)
}
# 5. Return the filtered data
return(data_players)
}
# NOW call the function with all the parameters
df_forwards <- select_players(
file = "FBREF_BigPlayers_2425.csv",
Encoding = "UTF-8",
position = "FW",
competition = c("Premier League", "La Liga", "Serie A", "Bundesliga", "Ligue 1"),
primary = TRUE,
player = NA
)
## We filter by competition
## We keep players whose main position is: FW
cat("\nTotal Centre Forwards in Top 5 Leagues:", nrow(df_forwards), "\n")
##
## Total Centre Forwards in Top 5 Leagues: 723
cat("(Excludes wingers and attacking midfielders)\n")
## (Excludes wingers and attacking midfielders)
head(df_forwards)
## Player Squad Nation Pos Age MP Min Gls G.PK Ast
## 1 Abdoulie Ceesay St. Pauli GAM FW 20 7 60 0 0 0
## 2 Adam Hložek Hoffenheim CZE FW,MF 22 27 1871 8 8 4
## 3 Alexander Bernhardsson Holstein Kiel SWE FW,MF 25 19 1263 7 7 2
## 4 Andreas Albers St. Pauli DEN FW 34 14 62 1 1 0
## 5 Andreas Skov Olsen Wolfsburg DEN FW,MF 24 12 535 1 1 1
## 6 Andrej Ilic Union Berlin SRB FW 24 16 937 7 6 0
## xG xAG Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90
## 1 0.0 0.0 0.00 0.00 0.00 0.00 0.03 Bundesliga 0 0.00 0.00
## 2 5.6 3.5 0.38 0.38 0.19 0.27 0.17 Bundesliga 59 2.84 1.15
## 3 4.1 1.3 0.50 0.50 0.14 0.29 0.10 Bundesliga 27 1.92 1.00
## 4 0.6 0.1 1.45 1.45 0.00 0.90 0.14 Bundesliga 3 4.35 1.45
## 5 0.8 1.9 0.17 0.17 0.17 0.14 0.32 Bundesliga 7 1.18 0.17
## 6 4.9 0.3 0.67 0.58 0.00 0.47 0.03 Bundesliga 27 2.59 1.06
## Passes. ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1 57.1 70.0 0.0 0.0 0.0 0.43 0.29 0.00
## 2 68.5 73.8 75.0 54.8 0.5 0.37 0.48 0.15
## 3 60.5 71.0 64.9 36.4 0.7 0.79 0.74 0.26
## 4 51.7 55.6 66.7 50.0 -0.1 0.07 0.00 0.00
## 5 80.4 85.7 87.6 55.0 -0.9 0.50 0.33 0.17
## 6 57.6 69.2 46.2 20.0 -0.3 0.12 0.44 0.31
## Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1 0.43 0.14 3.71 0.00 0.0 1.48 0.00 0.0
## 2 0.96 0.74 28.19 1.22 51.6 2.74 0.48 50.6
## 3 1.79 1.32 27.37 1.16 41.5 2.57 0.50 35.8
## 4 0.07 0.07 3.14 0.14 66.7 2.95 0.00 45.5
## 5 0.83 1.00 27.42 0.33 28.6 5.38 0.17 50.0
## 6 0.44 0.38 18.88 0.06 14.3 1.44 0.19 48.1
## Points.90 xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1 0.43 1.4 0 0 0 0 0 0 0 0 0 0
## 2 0.85 34.4 0 0 0 0 0 0 0 0 0 0
## 3 0.79 23.3 0 0 0 0 0 0 0 0 0 0
## 4 0.86 2.1 0 0 0 0 0 0 0 0 0 0
## 5 1.17 9.2 0 0 0 0 0 0 0 0 0 0
## 6 1.25 16.3 0 0 0 0 0 0 0 0 0 0
## SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1 0.00 0.0 1.14 2.00 1.00
## 2 3.00 2.4 12.89 18.81 6.56
## 3 2.00 2.9 10.16 16.79 4.89
## 4 1.00 0.4 1.07 2.07 0.71
## 5 1.00 0.2 18.08 22.50 8.50
## 6 1.83 2.1 6.88 11.94 4.50
## MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1 0.00 0.00 10.00
## 2 4.67 0.85 198.19
## 3 3.89 0.84 166.53
## 4 0.29 0.07 16.64
## 5 7.08 1.83 306.42
## 6 1.50 0.06 76.44
## PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1 0.29 0.00 0.14 0.00 0.00 0.00
## 2 48.19 0.12 0.74 1.37 0.70 0.15
## 3 56.63 0.06 0.89 0.84 0.53 0.21
## 4 3.36 0.01 0.14 0.07 0.00 0.00
## 5 64.00 0.11 1.50 1.00 0.83 0.17
## 6 19.62 0.01 0.38 1.00 0.00 0.00
## PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90 xGD xGD.90
## 1 0.00 0.20 0.57 0.29 0.43 0.00 -1.4 -0.20
## 2 2.52 1.27 2.44 0.56 1.11 1.48 -28.8 -1.00
## 3 1.74 1.23 2.37 1.11 0.79 1.00 -19.2 -0.94
## 4 0.21 0.15 0.21 0.21 0.14 0.71 -1.5 0.75
## 5 1.83 0.77 1.50 0.17 0.25 0.25 -8.4 -0.63
## 6 0.81 1.02 1.00 0.88 0.50 3.25 -11.4 -0.55
## MP_Squad
## 1 34
## 2 34
## 3 34
## 4 34
## 5 34
## 6 34
## We now go ahead and load only Centre Forwards out of Europes top 5 leagues by setting primary = True. This eliminates players who play both CF and MF roles, to focus solely on out and out 9s.
cat("\nTotal Centre Forwards in Top 5 Leagues:", nrow(df_forwards), "\n")
##
## Total Centre Forwards in Top 5 Leagues: 723
cat("Excludes attacking midfielders)\n")
## Excludes attacking midfielders)
head(df_forwards)
## Player Squad Nation Pos Age MP Min Gls G.PK Ast
## 1 Abdoulie Ceesay St. Pauli GAM FW 20 7 60 0 0 0
## 2 Adam Hložek Hoffenheim CZE FW,MF 22 27 1871 8 8 4
## 3 Alexander Bernhardsson Holstein Kiel SWE FW,MF 25 19 1263 7 7 2
## 4 Andreas Albers St. Pauli DEN FW 34 14 62 1 1 0
## 5 Andreas Skov Olsen Wolfsburg DEN FW,MF 24 12 535 1 1 1
## 6 Andrej Ilic Union Berlin SRB FW 24 16 937 7 6 0
## xG xAG Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90
## 1 0.0 0.0 0.00 0.00 0.00 0.00 0.03 Bundesliga 0 0.00 0.00
## 2 5.6 3.5 0.38 0.38 0.19 0.27 0.17 Bundesliga 59 2.84 1.15
## 3 4.1 1.3 0.50 0.50 0.14 0.29 0.10 Bundesliga 27 1.92 1.00
## 4 0.6 0.1 1.45 1.45 0.00 0.90 0.14 Bundesliga 3 4.35 1.45
## 5 0.8 1.9 0.17 0.17 0.17 0.14 0.32 Bundesliga 7 1.18 0.17
## 6 4.9 0.3 0.67 0.58 0.00 0.47 0.03 Bundesliga 27 2.59 1.06
## Passes. ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1 57.1 70.0 0.0 0.0 0.0 0.43 0.29 0.00
## 2 68.5 73.8 75.0 54.8 0.5 0.37 0.48 0.15
## 3 60.5 71.0 64.9 36.4 0.7 0.79 0.74 0.26
## 4 51.7 55.6 66.7 50.0 -0.1 0.07 0.00 0.00
## 5 80.4 85.7 87.6 55.0 -0.9 0.50 0.33 0.17
## 6 57.6 69.2 46.2 20.0 -0.3 0.12 0.44 0.31
## Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1 0.43 0.14 3.71 0.00 0.0 1.48 0.00 0.0
## 2 0.96 0.74 28.19 1.22 51.6 2.74 0.48 50.6
## 3 1.79 1.32 27.37 1.16 41.5 2.57 0.50 35.8
## 4 0.07 0.07 3.14 0.14 66.7 2.95 0.00 45.5
## 5 0.83 1.00 27.42 0.33 28.6 5.38 0.17 50.0
## 6 0.44 0.38 18.88 0.06 14.3 1.44 0.19 48.1
## Points.90 xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1 0.43 1.4 0 0 0 0 0 0 0 0 0 0
## 2 0.85 34.4 0 0 0 0 0 0 0 0 0 0
## 3 0.79 23.3 0 0 0 0 0 0 0 0 0 0
## 4 0.86 2.1 0 0 0 0 0 0 0 0 0 0
## 5 1.17 9.2 0 0 0 0 0 0 0 0 0 0
## 6 1.25 16.3 0 0 0 0 0 0 0 0 0 0
## SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1 0.00 0.0 1.14 2.00 1.00
## 2 3.00 2.4 12.89 18.81 6.56
## 3 2.00 2.9 10.16 16.79 4.89
## 4 1.00 0.4 1.07 2.07 0.71
## 5 1.00 0.2 18.08 22.50 8.50
## 6 1.83 2.1 6.88 11.94 4.50
## MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1 0.00 0.00 10.00
## 2 4.67 0.85 198.19
## 3 3.89 0.84 166.53
## 4 0.29 0.07 16.64
## 5 7.08 1.83 306.42
## 6 1.50 0.06 76.44
## PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1 0.29 0.00 0.14 0.00 0.00 0.00
## 2 48.19 0.12 0.74 1.37 0.70 0.15
## 3 56.63 0.06 0.89 0.84 0.53 0.21
## 4 3.36 0.01 0.14 0.07 0.00 0.00
## 5 64.00 0.11 1.50 1.00 0.83 0.17
## 6 19.62 0.01 0.38 1.00 0.00 0.00
## PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90 xGD xGD.90
## 1 0.00 0.20 0.57 0.29 0.43 0.00 -1.4 -0.20
## 2 2.52 1.27 2.44 0.56 1.11 1.48 -28.8 -1.00
## 3 1.74 1.23 2.37 1.11 0.79 1.00 -19.2 -0.94
## 4 0.21 0.15 0.21 0.21 0.14 0.71 -1.5 0.75
## 5 1.83 0.77 1.50 0.17 0.25 0.25 -8.4 -0.63
## 6 0.81 1.02 1.00 0.88 0.50 3.25 -11.4 -0.55
## MP_Squad
## 1 34
## 2 34
## 3 34
## 4 34
## 5 34
## 6 34
unique(df_forwards$Competition)
## [1] "Bundesliga" "La Liga" "Ligue 1" "Premier League"
## [5] "Serie A"
cat("Position distribution:\n")
## Position distribution:
table(df_forwards$Pos)
##
## FW FW,DF FW,MF
## 375 24 324
cat("\nNote: We should only see 'FW' - Players that play in the front 3 and not AMs\n")
##
## Note: We should only see 'FW' - Players that play in the front 3 and not AMs
filter_players <- function(data, metrics, pct_min_minutes, age_max, age_min = 0){
# Filter the data and select the metrics that define our sample
data_filter <- data %>%
filter(
Min > round((pct_min_minutes * 90 * MP_Squad) / 100),
Age <= age_max,
Age >= age_min
)%>%
select(c("Player", "Squad", "Age", "Competition", all_of(metrics)))
rownames(data_filter) <- 1:nrow(data_filter)
return(data_filter)
}
# Key Metrics for Arsenals CF profile.
list_metrics <- c(
"G.PK.90", # Non Pen goals per 90
"xG.90", # xG per 90
"Ast.90", # Assists per 90
"xAG.90", # xA per 90
"SCA.90", # Shot Creating Actions per 90
"GCA.90", # Goal Creating actions per 90
"TklW.90", # Tackles Won per 90
"Recov.90", # Ball Recoveries per 90
"Dribbles.90", # Successful dribbles per 90
"Aerial.", # Aerial Duel percentage
"Sh.90", # Shots per 90
"SoT.90" # Shots on target per 90
)
df_forwards_filter <- filter_players(
data = df_forwards,
metrics = list_metrics,
pct_min_minutes = 50,
age_max = 28,
age_min = 18
)
cat("Centre forwards meeting criteria:", nrow(df_forwards_filter), "\n")
## Centre forwards meeting criteria: 156
duplicated_players <- df_forwards_filter[
duplicated(df_forwards_filter$Player), ]$Player
if(length(duplicated_players) > 0) {
cat("Duplicated players:", paste(duplicated_players, collapse = ", "), "\n")
}else {
cat("No duplicated players found\n")
}
## No duplicated players found
df_forwards_rename <- df_forwards_filter %>%
rename(
'Non Penalty Goals/90' = 'G.PK.90',
'Expected Goals/90' = 'xG.90',
'Assists/90' = 'Ast.90',
'Expected Assists/90' = 'xAG.90',
'Shot Creating Actions/90' = 'SCA.90',
'Goal Creating Actions/90' = 'GCA.90',
'Tackles Won/90' = 'TklW.90',
'Recoveries/90' = 'Recov.90',
'Dribbles/90' = 'Dribbles.90',
'Aerial Win %' = 'Aerial.',
'Shots/90' = 'Sh.90',
'Shots on Target/90' = 'SoT.90'
)
head(df_forwards_rename
)
## Player Squad Age Competition Non Penalty Goals/90
## 1 Adam Hložek Hoffenheim 22 Bundesliga 0.38
## 2 Benedict Hollerbach Union Berlin 23 Bundesliga 0.32
## 3 Benjamin Šeško RB Leipzig 21 Bundesliga 0.42
## 4 Deniz Undav Stuttgart 28 Bundesliga 0.47
## 5 Ermedin Demirović Stuttgart 26 Bundesliga 0.73
## 6 Hugo Ekitike Eint Frankfurt 22 Bundesliga 0.49
## Expected Goals/90 Assists/90 Expected Assists/90 Shot Creating Actions/90
## 1 0.27 0.19 0.17 2.74
## 2 0.25 0.04 0.06 2.80
## 3 0.38 0.19 0.08 1.93
## 4 0.61 0.16 0.19 3.02
## 5 0.69 0.05 0.09 2.28
## 6 0.76 0.28 0.24 3.55
## Goal Creating Actions/90 Tackles Won/90 Recoveries/90 Dribbles/90
## 1 0.48 0.37 2.44 1.22
## 2 0.25 0.65 3.26 1.24
## 3 0.34 0.09 1.88 1.18
## 4 0.42 0.37 1.78 0.37
## 5 0.34 0.21 0.97 0.15
## 6 0.42 0.36 2.64 1.58
## Aerial Win % Shots/90 Shots on Target/90
## 1 50.6 2.84 1.15
## 2 30.9 2.69 0.85
## 3 58.8 2.50 1.10
## 4 40.4 3.91 1.62
## 5 42.0 3.11 1.31
## 6 46.8 4.00 1.55
scoring_calculate <- function(sample, metrics, weights){
# Check that weights sum to 1
if (sum(weights) != 1){
stop("The sum of weights must be equal to 1")
}
# Normalize metrics (0-100 scale)
data_scaled <- sample
for (i in 1:length(metrics)){
metric <- metrics[i]
max_value <- max(sample[[metric]], na.rm = TRUE)
min_value <- min(sample[[metric]], na.rm = TRUE)
# Scale to 0-100
data_scaled[[metric]] <- ((sample[[metric]] - min_value) /
(max_value - min_value)) * 100
}
# Calculate weighted score
data_scaled$Score <- 0
for (i in 1:length(metrics)){
data_scaled$Score <- data_scaled$Score +
(data_scaled[[metrics[i]]] * weights[i])
}
# Add ranking
data_scaled <- data_scaled %>%
arrange(desc(Score)) %>%
mutate(Rank = row_number())
return(data_scaled)
}
weights_arsenal_cf <- c(
0.25, # Non penalty Goals/90 (Increased Weight - Main job is to score)
0.18, # Expected Goals/90 (Increased Weight to show CF getting into positions)
0.08, # Assists/90
0.08, # Expected Assists/90
0.10, # Shot Creating Actions/90
0.06, # Goal Creating Actions/90
0.08, # Tackles Won/90 (Pressing)
0.04, # Recoveries/90
0.04, # Dribbles/90
0.06, # Aerial Win %
0.02, # Shots/90
0.01 # Shots on Target/90
)
# Verify weuights sum to 1
cat("Total Weight:", sum(weights_arsenal_cf), "\n\n")
## Total Weight: 1
# NExt, we calculate scores
data_final <- scoring_calculate(
sample = df_forwards_filter,
metrics = list_metrics,
weights = weights_arsenal_cf
)
top_20 <- data_final %>%
select(Rank, Player, Squad, Age, Competition, Score, 'G.PK.90', 'Aerial.') %>%
head(20) %>%
mutate(across(where(is.numeric), ~round(., 2)))
kable(top_20, caption = "Top 20 Strikers for Arsenal")
| Rank | Player | Squad | Age | Competition | Score | G.PK.90 | Aerial. |
|---|---|---|---|---|---|---|---|
| 1 | Ousmane Dembélé | PSG | 27 | Ligue 1 | 75.61 | 97.20 | 59.44 |
| 2 | Michael Olise | Bayern Munich | 22 | Bundesliga | 63.79 | 42.99 | 71.03 |
| 3 | Raphinha | Barcelona | 27 | La Liga | 60.28 | 47.66 | 89.15 |
| 4 | Bradley Barcola | PSG | 21 | Ligue 1 | 59.93 | 54.21 | 99.11 |
| 5 | Kylian Mbappé | Real Madrid | 25 | La Liga | 57.22 | 69.16 | 59.44 |
| 6 | Leroy Sané | Bayern Munich | 28 | Bundesliga | 55.20 | 56.07 | 79.94 |
| 7 | Bukayo Saka | Arsenal | 22 | Premier League | 54.38 | 24.30 | 49.48 |
| 8 | Mateo Retegui | Atalanta | 25 | Serie A | 54.00 | 73.83 | 63.74 |
| 9 | Désiré Doué | PSG | 19 | Ligue 1 | 53.94 | 28.97 | 65.97 |
| 10 | Hugo Ekitike | Eint Frankfurt | 22 | Bundesliga | 53.39 | 45.79 | 69.54 |
| 11 | Rayan Cherki | Lyon | 20 | Ligue 1 | 51.51 | 32.71 | 49.48 |
| 12 | Patrik Schick | Leverkusen | 28 | Bundesliga | 48.19 | 100.00 | 66.86 |
| 13 | Mason Greenwood | Marseille | 22 | Ligue 1 | 47.73 | 42.06 | 79.20 |
| 14 | Nick Woltemade | Stuttgart | 22 | Bundesliga | 47.56 | 51.40 | 64.04 |
| 15 | Vinicius Júnior | Real Madrid | 24 | La Liga | 47.28 | 33.64 | 18.57 |
| 16 | Luis Díaz | Liverpool | 27 | Premier League | 46.90 | 45.79 | 37.89 |
| 17 | Alexander Isak | Newcastle Utd | 24 | Premier League | 46.00 | 57.94 | 47.70 |
| 18 | Serhou Guirassy | Dortmund | 28 | Bundesliga | 45.88 | 57.94 | 78.60 |
| 19 | Riccardo Orsolini | Bologna | 27 | Serie A | 44.82 | 54.21 | 81.43 |
| 20 | Erling Haaland | Manchester City | 24 | Premier League | 43.98 | 57.94 | 79.20 |
We will now separate our pool of talent into tiers, based on age, club, profile and realism of making the signing.
#Elite Clubs and rivals would indicate that the chances of signing are much lower.
elite_clubs <- c(
"Manchester City", "Real Madrid", "Barcelona", "Bayern Munich", "Tottenham", "Liverpool", "Chelsea", "Manchester United", "PSG"
)
#Tier 1: Elite Benchmarks (age <= 27, top scorers regardless of team)
tier1_elite <- data_final %>%
filter(Age <= 27) %>%
arrange(desc('G.PK.90')) %>%
head(10) %>%
mutate(Tier = "Tier1: Eliter Benchmark")
#Tier 2: Realistic Targets (Age 23-27), not at the elite clubs or rivals) - MAIN FOCUS AREA
tier2_realistic <- data_final %>%
filter(
Age >= 23, Age <= 27,
!Squad %in% elite_clubs
) %>%
mutate(Tier = "Tier 2: Realistic Target")
#Tier 3A: Value Opportunities (Age 24-27, experienced players at smaller clubs)
tier3a_value <- data_final %>%
filter(
Age >= 24, Age <= 27,
!Squad %in% elite_clubs
) %>%
arrange(desc(Score)) %>%
head(10) %>%
mutate(Tier = "Tier 3A: Value Options")
#Tier 3B: Young Prospects (age 19-23, Pure Centre Forwards only as at a young age there is little data to suggest they can be a success in any front 3 position
tier3b_prospects <- data_final %>%
filter(Age >= 19, Age <= 23) %>%
arrange(desc(Score)) %>%
head(10) %>%
mutate(Tier = "Tier 3B: Young Prospects")
cat("Tier 3B Young Prospects - These are Centre Forwards, not wingers:\n")
## Tier 3B Young Prospects - These are Centre Forwards, not wingers:
tier1_display <- tier1_elite %>%
select(Player, Squad, Age, `G.PK.90`, `xG.90`, `Aerial.`, Score) %>%
mutate(across(where(is.numeric), ~round(., 2)))
kable(tier1_display, caption = "Tier 1: Elite Benchmark Centre Forwards")
| Player | Squad | Age | G.PK.90 | xG.90 | Aerial. | Score |
|---|---|---|---|---|---|---|
| Ousmane Dembélé | PSG | 27 | 97.20 | 100.00 | 59.44 | 75.61 |
| Michael Olise | Bayern Munich | 22 | 42.99 | 42.86 | 71.03 | 63.79 |
| Raphinha | Barcelona | 27 | 47.66 | 70.24 | 89.15 | 60.28 |
| Bradley Barcola | PSG | 21 | 54.21 | 63.10 | 99.11 | 59.93 |
| Kylian Mbappé | Real Madrid | 25 | 69.16 | 92.86 | 59.44 | 57.22 |
| Bukayo Saka | Arsenal | 22 | 24.30 | 40.48 | 49.48 | 54.38 |
| Mateo Retegui | Atalanta | 25 | 73.83 | 82.14 | 63.74 | 54.00 |
| Désiré Doué | PSG | 19 | 28.97 | 29.76 | 65.97 | 53.94 |
| Hugo Ekitike | Eint Frankfurt | 22 | 45.79 | 88.10 | 69.54 | 53.39 |
| Rayan Cherki | Lyon | 20 | 32.71 | 23.81 | 49.48 | 51.51 |
tier2_display <- tier2_realistic %>%
select(Player, Squad, Age, Competition, 'G.PK.90', 'Ast.90', 'Aerial.', Score) %>%
mutate(across(where(is.numeric), ~round(., 2)))
kable(tier2_display, caption = "Tier 2: Realistic Target Centre Forwards")
| Player | Squad | Age | Competition | G.PK.90 | Ast.90 | Aerial. | Score |
|---|---|---|---|---|---|---|---|
| Mateo Retegui | Atalanta | 25 | Serie A | 73.83 | 51.72 | 63.74 | 54.00 |
| Alexander Isak | Newcastle Utd | 24 | Premier League | 57.94 | 34.48 | 47.70 | 46.00 |
| Riccardo Orsolini | Bologna | 27 | Serie A | 54.21 | 32.76 | 81.43 | 44.82 |
| Christian Pulisic | Milan | 25 | Serie A | 27.10 | 56.90 | 30.91 | 43.80 |
| Evann Guessand | Nice | 23 | Ligue 1 | 39.25 | 48.28 | 62.26 | 43.76 |
| Ermedin Demirović | Stuttgart | 26 | Bundesliga | 68.22 | 8.62 | 62.41 | 43.39 |
| Marcus Thuram | Inter | 26 | Serie A | 51.40 | 27.59 | 84.70 | 43.13 |
| Julián Álvarez | Atlético Madrid | 24 | La Liga | 43.93 | 24.14 | 43.83 | 42.77 |
| Bryan Mbeumo | Brentford | 24 | Premier League | 37.38 | 31.03 | 46.81 | 42.70 |
| Rafael Leão | Milan | 25 | Serie A | 28.97 | 53.45 | 83.95 | 42.64 |
| Jonathan Burkardt | Mainz 05 | 24 | Bundesliga | 63.55 | 15.52 | 28.23 | 42.54 |
| Moise Kean | Fiorentina | 24 | Serie A | 56.07 | 17.24 | 76.52 | 41.59 |
| Antoine Semenyo | Bournemouth | 24 | Premier League | 28.97 | 24.14 | 69.09 | 40.98 |
| Breel Embolo | Monaco | 27 | Ligue 1 | 27.10 | 34.48 | 74.29 | 40.66 |
| Yoane Wissa | Brentford | 27 | Premier League | 55.14 | 20.69 | 67.61 | 40.53 |
| Ritsu Doan | Freiburg | 26 | Bundesliga | 28.97 | 37.93 | 45.77 | 39.09 |
| Jarrod Bowen | West Ham | 27 | Premier League | 33.64 | 41.38 | 29.72 | 38.51 |
| Lautaro Martínez | Inter | 26 | Serie A | 39.25 | 18.97 | 71.17 | 38.37 |
| Kaoru Mitoma | Brighton | 27 | Premier League | 32.71 | 24.14 | 72.96 | 37.50 |
| Harvey Barnes | Newcastle Utd | 26 | Premier League | 42.99 | 36.21 | 42.50 | 37.43 |
| Marcus Tavernier | Bournemouth | 25 | Premier League | 13.08 | 39.66 | 65.53 | 37.41 |
| Mohamed Amoura | Wolfsburg | 24 | Bundesliga | 27.10 | 56.90 | 44.13 | 37.11 |
| Valentín Castellanos | Lazio | 25 | Serie A | 28.04 | 18.97 | 78.90 | 35.73 |
| Jonathan David | Lille | 24 | Ligue 1 | 32.71 | 31.03 | 41.01 | 35.45 |
| Dan Ndoye | Bologna | 23 | Serie A | 23.36 | 29.31 | 59.44 | 34.63 |
| Lassine Sinayoko | Auxerre | 24 | Ligue 1 | 14.95 | 53.45 | 44.13 | 34.38 |
| Anthony Gordon | Newcastle Utd | 23 | Premier League | 16.82 | 31.03 | 74.29 | 33.99 |
| Kai Havertz | Arsenal | 25 | Premier League | 40.19 | 24.14 | 66.42 | 33.87 |
| Dušan Vlahović | Juventus | 24 | Serie A | 28.04 | 34.48 | 71.47 | 33.80 |
| Zuriko Davitashvili | Saint-Étienne | 23 | Ligue 1 | 21.50 | 44.83 | 34.18 | 33.79 |
| Dodi Lukebakio | Sevilla | 26 | La Liga | 27.10 | 10.34 | 68.65 | 33.37 |
| Mathias Pereira Lage | Brest | 27 | Ligue 1 | 10.28 | 68.97 | 68.50 | 33.11 |
| Jean-Philippe Mateta | Crystal Palace | 27 | Premier League | 38.32 | 12.07 | 55.13 | 32.63 |
| Mohammed Kudus | West Ham | 23 | Premier League | 15.89 | 17.24 | 32.10 | 32.57 |
| Morgan Guilavogui | St. Pauli | 26 | Bundesliga | 28.04 | 17.24 | 69.54 | 32.55 |
| Evanilson | Bournemouth | 24 | Premier League | 36.45 | 6.90 | 60.03 | 32.07 |
| Robin Hack | Gladbach | 25 | Bundesliga | 15.89 | 50.00 | 61.37 | 31.69 |
| Gabriel Martinelli | Arsenal | 23 | Premier League | 28.97 | 27.59 | 45.17 | 31.58 |
| Keito Nakamura | Reims | 24 | Ligue 1 | 34.58 | 12.07 | 49.48 | 31.34 |
| Shuto Machino | Holstein Kiel | 24 | Bundesliga | 38.32 | 15.52 | 58.10 | 31.11 |
| Issa Soumaré | Le Havre | 23 | Ligue 1 | 22.43 | 32.76 | 89.15 | 30.27 |
| Jonas Wind | Wolfsburg | 25 | Bundesliga | 34.58 | 24.14 | 73.55 | 30.18 |
| Gabriel Strefezza | Como | 27 | Serie A | 19.63 | 24.14 | 43.68 | 29.73 |
| Iliman Ndiaye | Everton | 24 | Premier League | 24.30 | 0.00 | 35.66 | 29.69 |
| Benedict Hollerbach | Union Berlin | 23 | Bundesliga | 29.91 | 6.90 | 45.91 | 29.47 |
| Loïs Openda | RB Leipzig | 24 | Bundesliga | 30.84 | 31.03 | 56.46 | 29.28 |
| Jørgen Strand Larsen | Wolves | 24 | Premier League | 45.79 | 24.14 | 59.58 | 29.21 |
| Marvin Pieringer | Heidenheim | 24 | Bundesliga | 14.95 | 27.59 | 55.13 | 28.33 |
| Nikola Krstović | Lecce | 24 | Serie A | 24.30 | 25.86 | 61.81 | 28.23 |
| Phillip Tietz | Augsburg | 27 | Bundesliga | 29.91 | 18.97 | 71.17 | 27.61 |
| Dennis Man | Parma | 25 | Serie A | 16.82 | 31.03 | 51.41 | 27.60 |
| Jorge de Frutos | Rayo Vallecano | 27 | La Liga | 20.56 | 18.97 | 57.21 | 27.29 |
| Artem Dovbyk | Roma | 27 | Serie A | 34.58 | 12.07 | 61.22 | 27.02 |
| Patrick Cutrone | Como | 26 | Serie A | 28.97 | 31.03 | 38.34 | 26.61 |
| Hugo Duro | Valencia | 24 | La Liga | 39.25 | 13.79 | 54.83 | 26.44 |
| Carlos Vicente | Alavés | 25 | La Liga | 11.21 | 25.86 | 53.79 | 26.07 |
| Farid El Melali | Angers | 27 | Ligue 1 | 9.35 | 24.14 | 32.99 | 25.97 |
| Juan Cruz | Leganés | 24 | La Liga | 18.69 | 27.59 | 65.53 | 25.44 |
| Callum Hudson-Odoi | Nott’ham Forest | 23 | Premier League | 19.63 | 13.79 | 34.32 | 25.14 |
| Javi Puado | Espanyol | 26 | La Liga | 19.63 | 20.69 | 34.32 | 24.96 |
| Bryan Gil | Girona | 23 | La Liga | 14.95 | 27.59 | 0.00 | 24.66 |
| Esteban Lepaul | Angers | 24 | Ligue 1 | 46.73 | 0.00 | 63.45 | 24.66 |
| Gustav Isaksen | Lazio | 23 | Serie A | 14.95 | 13.79 | 52.15 | 24.54 |
| Lorenzo Lucca | Udinese | 23 | Serie A | 39.25 | 6.90 | 69.09 | 24.45 |
| Lucas Beltrán | Fiorentina | 23 | Serie A | 13.08 | 31.03 | 55.87 | 24.44 |
| Nicolás González | Juventus | 26 | Serie A | 14.02 | 17.24 | 82.76 | 24.28 |
| Tete Morente | Lecce | 27 | Serie A | 12.15 | 15.52 | 91.68 | 24.09 |
| Samuel Essende | Augsburg | 26 | Bundesliga | 36.45 | 18.97 | 65.08 | 24.07 |
| Viktor Tsyhankov | Girona | 26 | La Liga | 8.41 | 41.38 | 49.48 | 24.04 |
| Mikel Oyarzabal | Real Sociedad | 27 | La Liga | 18.69 | 20.69 | 45.32 | 23.93 |
| Gorka Guruzeta | Athletic Club | 27 | La Liga | 31.78 | 17.24 | 50.97 | 23.86 |
| Oladapo Afolayan | St. Pauli | 26 | Bundesliga | 14.95 | 8.62 | 44.13 | 23.65 |
| Santiago Pierotti | Lecce | 23 | Serie A | 17.76 | 15.52 | 66.12 | 23.43 |
| Roberto Piccoli | Cagliari | 23 | Serie A | 24.30 | 5.17 | 63.60 | 23.18 |
| Isaac Romero | Sevilla | 24 | La Liga | 15.89 | 13.79 | 53.94 | 21.44 |
| Andrea Pinamonti | Genoa | 25 | Serie A | 29.91 | 5.17 | 63.60 | 20.91 |
| Dany Mota | Monza | 26 | Serie A | 19.63 | 15.52 | 68.05 | 20.61 |
| Junior Adamu | Freiburg | 23 | Bundesliga | 11.21 | 20.69 | 48.89 | 20.32 |
| Amin Sarr | Hellas Verona | 23 | Serie A | 17.76 | 8.62 | 61.96 | 18.86 |
| Johannes Eggestein | St. Pauli | 26 | Bundesliga | 9.35 | 32.76 | 36.85 | 17.85 |
| Jack Harrison | Everton | 27 | Premier League | 3.74 | 0.00 | 18.57 | 16.65 |
| Miguel | Leganés | 24 | La Liga | 0.00 | 27.59 | 34.47 | 14.02 |
| Alessandro Zanoli | Genoa | 23 | Serie A | 4.67 | 0.00 | 74.29 | 13.32 |
##Tier 3A
This tier looks at strikers at smaller clubs who offer an immediate impact but at a more affordable cost.
tier3a_display <- tier3a_value %>%
select(Player, Squad, Age, Competition, 'G.PK.90', Score) %>%
mutate(Score = round(Score, 2), 'G.PK.90' = round(`G.PK.90`, 2))
kable(tier3a_display, caption = "Tier 3A: Value Option Strikers")
| Player | Squad | Age | Competition | G.PK.90 | Score |
|---|---|---|---|---|---|
| Mateo Retegui | Atalanta | 25 | Serie A | 73.83 | 54.00 |
| Alexander Isak | Newcastle Utd | 24 | Premier League | 57.94 | 46.00 |
| Riccardo Orsolini | Bologna | 27 | Serie A | 54.21 | 44.82 |
| Christian Pulisic | Milan | 25 | Serie A | 27.10 | 43.80 |
| Ermedin Demirović | Stuttgart | 26 | Bundesliga | 68.22 | 43.39 |
| Marcus Thuram | Inter | 26 | Serie A | 51.40 | 43.13 |
| Julián Álvarez | Atlético Madrid | 24 | La Liga | 43.93 | 42.77 |
| Bryan Mbeumo | Brentford | 24 | Premier League | 37.38 | 42.70 |
| Rafael Leão | Milan | 25 | Serie A | 28.97 | 42.64 |
| Jonathan Burkardt | Mainz 05 | 24 | Bundesliga | 63.55 | 42.54 |
For the purpose of a detailed pipeline of talent, it is also good to look at younger options, ones with large upsides of potential. If none of our ‘prime’ targets are available, then it is shrewd business to assess the market for younger, lesser known options.
tier3b_display <- tier3b_prospects %>%
select(Player, Squad, Age, Competition, `G.PK.90`, `Aerial.`, Score) %>%
mutate(across(where(is.numeric), ~round(., 2)))
kable(tier3b_display, caption = "Tier 3B: Young Prospects")
| Player | Squad | Age | Competition | G.PK.90 | Aerial. | Score |
|---|---|---|---|---|---|---|
| Michael Olise | Bayern Munich | 22 | Bundesliga | 42.99 | 71.03 | 63.79 |
| Bradley Barcola | PSG | 21 | Ligue 1 | 54.21 | 99.11 | 59.93 |
| Bukayo Saka | Arsenal | 22 | Premier League | 24.30 | 49.48 | 54.38 |
| Désiré Doué | PSG | 19 | Ligue 1 | 28.97 | 65.97 | 53.94 |
| Hugo Ekitike | Eint Frankfurt | 22 | Bundesliga | 45.79 | 69.54 | 53.39 |
| Rayan Cherki | Lyon | 20 | Ligue 1 | 32.71 | 49.48 | 51.51 |
| Mason Greenwood | Marseille | 22 | Ligue 1 | 42.06 | 79.20 | 47.73 |
| Nick Woltemade | Stuttgart | 22 | Bundesliga | 51.40 | 64.04 | 47.56 |
| Evann Guessand | Nice | 23 | Ligue 1 | 39.25 | 62.26 | 43.76 |
| Maghnes Akliouche | Monaco | 22 | Ligue 1 | 14.02 | 26.60 | 43.13 |
We can now use similiarity algorithms to find the strikers who match Arsenals desired profile.
We are going to use the top scoring player in Tier 2 as the reference point for comparison
similiarity_tool <- function(sample, player, metrics, metrics_rename, distance, n){
data_scaled <- sample
# Scale each metric
for (metric in metrics) {
max_value <- max(sample[[metric]], na.rm = TRUE)
min_value <- min(sample[[metric]], na.rm = TRUE)
data_scaled[[metric]] <- (sample[[metric]] - min_value) / (max_value - min_value)
}
# Select only metrics for distance calculation
data_for_dist <- data_scaled[, metrics]
rownames(data_for_dist) <- sample$Player
# Calculate distance matrix
if(distance == "euclidean"){
mat_dist <- as.matrix(dist(data_for_dist, method = "euclidean"))
} else if(distance == "cosine"){
mat_dist <- as.matrix(1 - cosine(t(as.matrix(data_for_dist))))
} else {
stop("Distance method must be 'euclidean' or 'cosine'")
}
# Extract the similarity for our target player
if(!(player %in% rownames(mat_dist))){
stop(paste("Player", player, "not found in sample"))
}
player_sim <- mat_dist[, player]
df_sim <- data.frame(
Player = names(player_sim),
Distance = as.numeric(player_sim)
)
# Drop the Player Themselves (distance = 0)
df_sim <- df_sim[df_sim$Player != player, ]
# Convert the distances to similarity percentage
d95 <- quantile(df_sim$Distance, 0.95, na.rm = TRUE)
df_sim$Similarity <- (1 - (df_sim$Distance / d95)) * 100
# Order by distance (most similar first)
df_sim <- df_sim[order(df_sim$Distance), ]
# Take top n
final_df <- df_sim[1:n, c("Player", "Similarity")]
# Merge with original data for context
data_clean <- sample %>%
select(Player, Age, Squad, Competition, all_of(metrics))
# Rename the metric columns
colnames(data_clean)[colnames(data_clean) %in% metrics] <- metrics_rename
final_df <- merge(
x = final_df, y = data_clean,
by = "Player", all.x = TRUE
)
final_df <- final_df[order(-final_df$Similarity), ]
rownames(final_df) <- 1:n
return(final_df)
}
metrics_rename <- c(
"npg/90", "xG/90", "Ast/90", "xAG/90", "SCA/90", "GCA/90", "tklw/90", "Recov/90", "Drib/90", "Aerial%", "Sh/90", "SoT/90"
)
# Use the top ranked striker (Dembele) as the gold standard benchmark
reference_player <- tier1_elite$Player[1]
cat("Using", reference_player, "as reference player\n")
## Using Ousmane Dembélé as reference player
cat("(Elite Tier 1 - The Gold Standard\n")
## (Elite Tier 1 - The Gold Standard
cat("\nFinding realistic targets who play most similiarly to Dembele\n\n")
##
## Finding realistic targets who play most similiarly to Dembele
# Euclidean Distance (Similiar Absolute Output)
sim_euclidean <- similiarity_tool(
sample = data_final,
player = reference_player,
metrics = list_metrics,
metrics_rename = metrics_rename,
distance = "euclidean",
n = 15
)
kable(sim_euclidean[, 1:6],
caption = paste("Top 15 Similiar Centre Forwards to", reference_player, "(Euclidean Distance)"), digits = 2)
| Player | Similarity | Age | Squad | Competition | npg/90 |
|---|---|---|---|---|---|
| Kylian Mbappé | 60.22 | 25 | Real Madrid | La Liga | 69.16 |
| Leroy Sané | 52.94 | 28 | Bayern Munich | Bundesliga | 56.07 |
| Raphinha | 51.53 | 27 | Barcelona | La Liga | 47.66 |
| Hugo Ekitike | 51.41 | 22 | Eint Frankfurt | Bundesliga | 45.79 |
| Bradley Barcola | 50.89 | 21 | PSG | Ligue 1 | 54.21 |
| Mateo Retegui | 46.11 | 25 | Atalanta | Serie A | 73.83 |
| Nick Woltemade | 45.54 | 22 | Stuttgart | Bundesliga | 51.40 |
| Alexander Isak | 43.29 | 24 | Newcastle Utd | Premier League | 57.94 |
| Deniz Undav | 42.80 | 28 | Stuttgart | Bundesliga | 43.93 |
| Mason Greenwood | 42.25 | 22 | Marseille | Ligue 1 | 42.06 |
| Michael Olise | 38.85 | 22 | Bayern Munich | Bundesliga | 42.99 |
| Julián Álvarez | 38.10 | 24 | Atlético Madrid | La Liga | 43.93 |
| Erling Haaland | 36.61 | 24 | Manchester City | Premier League | 57.94 |
| Riccardo Orsolini | 36.57 | 27 | Bologna | Serie A | 54.21 |
| Rafael Leão | 36.12 | 25 | Milan | Serie A | 28.97 |
# Cosine Distance (similiar style/profile)
sim_cosine <- similiarity_tool(
sample = data_final,
player = reference_player,
metrics = list_metrics,
metrics_rename = metrics_rename,
distance = "cosine",
n = 15
)
kable(sim_cosine[, 1:6],
caption = paste("Top 15 similiar Strikers to", reference_player, "(Cosine Similiarity)"),
digits = 2)
| Player | Similarity | Age | Squad | Competition | npg/90 |
|---|---|---|---|---|---|
| Harvey Barnes | 91.47 | 26 | Newcastle Utd | Premier League | 42.99 |
| Alexander Isak | 89.64 | 24 | Newcastle Utd | Premier League | 57.94 |
| Nick Woltemade | 89.24 | 22 | Stuttgart | Bundesliga | 51.40 |
| Kylian Mbappé | 87.46 | 25 | Real Madrid | La Liga | 69.16 |
| Julián Álvarez | 87.24 | 24 | Atlético Madrid | La Liga | 43.93 |
| Leroy Sané | 85.23 | 28 | Bayern Munich | Bundesliga | 56.07 |
| Deniz Undav | 84.97 | 28 | Stuttgart | Bundesliga | 43.93 |
| Nicolas Jackson | 84.96 | 23 | Chelsea | Premier League | 38.32 |
| Hugo Ekitike | 84.10 | 22 | Eint Frankfurt | Bundesliga | 45.79 |
| Mateo Retegui | 82.68 | 25 | Atalanta | Serie A | 73.83 |
| Jonathan David | 81.58 | 24 | Lille | Ligue 1 | 32.71 |
| Loïs Openda | 80.25 | 24 | RB Leipzig | Bundesliga | 30.84 |
| Raphinha | 80.03 | 27 | Barcelona | La Liga | 47.66 |
| Shuto Machino | 79.63 | 24 | Holstein Kiel | Bundesliga | 38.32 |
| Bradley Barcola | 79.24 | 21 | PSG | Ligue 1 | 54.21 |
comparison <- data_frame(
Rank = 1:10,
`Euclidean (Output)` = sim_euclidean$Player[1:10],
`Cosine (Style)` = sim_cosine$Player[1:10]
)
kable(comparison, caption = "Euclidean vs Cosine Top 10 Comparison")
| Rank | Euclidean (Output) | Cosine (Style) |
|---|---|---|
| 1 | Kylian Mbappé | Harvey Barnes |
| 2 | Leroy Sané | Alexander Isak |
| 3 | Raphinha | Nick Woltemade |
| 4 | Hugo Ekitike | Kylian Mbappé |
| 5 | Bradley Barcola | Julián Álvarez |
| 6 | Mateo Retegui | Leroy Sané |
| 7 | Nick Woltemade | Deniz Undav |
| 8 | Alexander Isak | Nicolas Jackson |
| 9 | Deniz Undav | Hugo Ekitike |
| 10 | Mason Greenwood | Mateo Retegui |
# Use the absolute Benchmark (Dembele) as our gold standard
elite_reference <- tier1_elite$Player[1]
# Top Scorer from Tier 2 (Best Realistic Target)
top_tier2 <- tier2_realistic$Player[1]
# Most Similiar Player (Euclidean). Of course, we assume that Mbappe is completely untouchable by Real Madrid, so we default to option 2
most_similar_eucl <- sim_euclidean$Player [2]
# Most similiar player (cosine - ensuring a different result from euclidean for more in depth comparison). We also want to exclude Barnes, Isak, Woltemade and Mbappe for the reasons explained above
cosine_candidates <- sim_cosine$Player[!sim_cosine$Player %in% most_similar_eucl]
most_similar_cos <- cosine_candidates[5]
#Top Young Prospect
top_prospect <- tier3b_prospects$Player[1]
# Combine these players for radar (include elite benchmark and realistic options)
players_radar <- unique(c(elite_reference, top_tier2, most_similar_eucl, most_similar_cos, top_prospect))
cat("Centre Forwards selected for radar comparison:\n\n")
## Centre Forwards selected for radar comparison:
cat(" 1.", elite_reference, "(Tier 1 - World's Best)\n\n")
## 1. Ousmane Dembélé (Tier 1 - World's Best)
cat("REALISTIC ALTERNATIVES:\n")
## REALISTIC ALTERNATIVES:
for(i in 2:length(players_radar)){
cat(" ", i, ".", players_radar[i], "\n")
}
## 2 . Mateo Retegui
## 3 . Leroy Sané
## 4 . Julián Álvarez
## 5 . Michael Olise
min_max_df <- rbind(
apply(data_final[, list_metrics], 2,
function(x) quantile(x, probs = 0.95, na.rm = TRUE)),
apply(data_final[, list_metrics], 2,
function(x) quantile(x, probs = 0.05, na.rm = TRUE))
)
rownames(min_max_df) <- c("p95", "p5")
min_max_df
## G.PK.90 xG.90 Ast.90 xAG.90 SCA.90 GCA.90 TklW.90
## p95 57.943925 77.67857 62.50000 64.673913 75.188324 57.024793 71.487603
## p5 9.345794 13.09524 3.87931 6.521739 5.178908 8.264463 9.297521
## Recov.90 Dribbles.90 Aerial. Sh.90 SoT.90
## p95 67.533937 70.421245 83.06092 70.29478 58.36820
## p5 7.522624 3.205128 28.04606 16.66667 10.46025
# Filter Data for Selected Players
df_forwards_radar <- data_final[data_final$Player %in% players_radar, ]
# Ensure values are within [p5, p95] boundaries
for (metric in list_metrics) {
for (p in players_radar) {
value_c <- df_forwards_radar[df_forwards_radar$Player == p, metric]
if(length(value_c) > 0 && !is.na(value_c)){
if(value_c < min_max_df["p5", metric]){
df_forwards_radar[df_forwards_radar$Player == p, metric] <- min_max_df["p5", metric]
} else if (value_c > min_max_df["p95", metric]){
df_forwards_radar[df_forwards_radar$Player == p, metric] <- min_max_df["p95", metric]
}
}
}
}
#Create final radar dataframe
df_forwards_radar <- as.data.frame(df_forwards_radar)
rownames(df_forwards_radar) <- df_forwards_radar$Player
df_final_plot <- rbind(
min_max_df, df_forwards_radar[, list_metrics]
)
df_final_plot
## G.PK.90 xG.90 Ast.90 xAG.90 SCA.90 GCA.90
## p95 57.943925 77.67857 62.50000 64.673913 75.188324 57.024793
## p5 9.345794 13.09524 3.87931 6.521739 5.178908 8.264463
## Ousmane Dembélé 57.943925 77.67857 53.44828 64.673913 75.188324 57.024793
## Michael Olise 42.990654 42.85714 62.50000 64.673913 75.188324 57.024793
## Leroy Sané 56.074766 64.28571 46.55172 58.695652 56.120527 37.190083
## Mateo Retegui 57.943925 77.67857 51.72414 32.608696 31.450094 40.495868
## Julián Álvarez 43.925234 58.33333 24.13793 34.782609 50.282486 41.322314
## TklW.90 Recov.90 Dribbles.90 Aerial. Sh.90 SoT.90
## p95 71.487603 67.533937 70.421245 83.06092 70.29478 58.36820
## p5 9.297521 7.522624 3.205128 28.04606 16.66667 10.46025
## Ousmane Dembélé 11.570248 12.443439 47.619048 59.43536 70.29478 58.36820
## Michael Olise 53.719008 52.036199 70.421245 71.02526 62.13152 51.46444
## Leroy Sané 38.842975 43.891403 42.124542 79.94056 70.29478 58.36820
## Mateo Retegui 18.181818 25.565611 6.593407 63.74443 70.29478 43.93305
## Julián Álvarez 35.537190 28.054299 28.205128 43.83358 43.76417 47.28033
# Define radar chart function
create_radarchart <- function(data, color = color,
vlabels = colnames(data), vlcex = 0.7,
caxislabels = NULL, title = NULL){
radarchart(
data, axistype = 1,
# Polygon
pcol = color, pfcol = scales::alpha(color, 0.5),
plwd = 2, plty = 1,
cglcol = "grey", cglty = 1, cglwd = 8,
# Axis
axislabcol = "grey30",
#Labels
vlcex = vlcex, vlabels = vlabels,
caxislabels = caxislabels, title = title
)
}
# Metric Names for radar chart
metrics_name_plot <- c(
"npG/90", "xG/90", "Ast/90", "xAG/90", "SCA/90", "GCA/90", "tklW/90", "Recov/90", "Drib/90", "Aerial%", "Sh/90", "SoT/90"
)
# Colors
colors_radar <- c("#EF0107", "#00AFBB", "#f7d62d", "#8DBF8D", "purple4")[1:(nrow(df_final_plot) - 2)]
# Create Radar chart
op <- par(mar = c(1, 2, 2, 2))
create_radarchart(
data = df_final_plot,
color = colors_radar,
vlabels = metrics_name_plot
)
legend("bottomleft",
legend = rownames(df_final_plot[-c(1,2), ]),
horiz = FALSE,
bty = 'n', pch = 20,
col = colors_radar,
text.col = "black", cex = 0.7, pt.cex = 2)
title(
main = "Arsenal Striker Search\nComparing Realistic Targets to the worlds best (2024/25)",
cex.main = 1.1, col.main = "#5D6D7E"
)
The Benchmark - Arsenal would love to be able to sign Ousmane Dembele, or his French compatriot Kylian Mbappe. However these players are currently unattainable and would cost astronomical money.
**Realistic Alternatives* - Julian Alvarez and Mateo Retegui represent excellent players, who would fit Arsenals style, intensity and system excellently. While both players would cost more then 60m Euros, they represent excellent market opportunities that are more realistic. Out of these 2, I would recommend Arsenal to sign Julian Alvarez. While his metrics are slightly lower then Reteguis, he is slightly younger and based on his previous successful stint in England (with Man City), he is sure to hit the ground running and could be an instant impacter for Arsenals recruitment team. While his metrics are impressive and well rounded, there is also lots of upside to improve the defensive aspect of his game, as well as his creativity, which I am confident under Arteta, and playing with other world class players like Saka, Odegaard etc, this would be something Alvarez can surely live up to .
Other Options Leroy Sane represents an excellent tactical fit for Arsenal, as well as being Premier League proven and has also previusly worked with Arteta, however at 29, he does not represent a huge amount of growth potential. Michael Olise and Bradley Barcola would be other excellent young options to take a look at. Both would fit Arsenals style very well, have massive upside potential however it would take a lot of money to prize either player from their clubs.
Age of Data I only had data from the 24/25 season available to me for this study. Therefore, players who moved in the summer transfer window (Isak, Ekitike) are no longer available on the market. It also does not take into account current form in the 2025/26 season.
League Difficulty Goals in the Premier League may be more valuable then goals in a slower league, such as Serie A. This is not taken into account in the analysis, hence my preference for Alvarez or Retegui, who has already proven his ability in the Premier League.
Team Quality Our analysis does not account for players who may have inflated statistics due to playing with better attacking teams/players.
Injury Status While I tried to account for injury prone players, and included a min number of minutes played, this is not fully captured in my analysis.
Transfer Value This can be implemented to take a deep dive into the players representing the best value in the market
Trajectory Is the player trending upwords or have they reached their peak?
Big Games Do some of these players have inflated stats in smaller games? Who are the best big game performers?
Injury History Assess Physical Robustness across our talent pool.
In my opinion, Arsenal should already be doing their due dilligence on if a deal for Julian Alvarez is possible this summer. His playing style, age, robustness and proven premier league ability make him a perfect match for Arsenal and by signing him, Arsenal could truly take a step towards becoming the best team in Europe.
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_Canada.utf8 LC_CTYPE=English_Canada.utf8
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.utf8
##
## time zone: America/Vancouver
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 DT_0.34.0 knitr_1.51 lsa_0.73.4
## [5] SnowballC_0.7.1 fmsb_0.7.6 lubridate_1.9.4 forcats_1.0.1
## [9] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.1 readr_2.1.6
## [13] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2 tidyselect_1.2.1
## [5] jquerylib_0.1.4 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
## [9] generics_0.1.4 htmlwidgets_1.6.4 bslib_0.9.0 pillar_1.11.1
## [13] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.7 stringi_1.8.7
## [17] cachem_1.1.0 xfun_0.55 sass_0.4.10 S7_0.2.1
## [21] otel_0.2.0 timechange_0.3.0 cli_3.6.5 withr_3.0.2
## [25] magrittr_2.0.4 digest_0.6.39 grid_4.5.2 rstudioapi_0.18.0
## [29] hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.0 evaluate_1.0.5
## [33] glue_1.8.0 farver_2.1.2 rmarkdown_2.30 tools_4.5.2
## [37] pkgconfig_2.0.3 htmltools_0.5.9
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.