Crunching the Numbers on Arsenals Perfect Centre Forward

Introduction

In this project, I will be crunching the numbers across Europes top 5 leagues and assessing the options for Arsenal in their pursuit of the perfect number 9.

It is an area where Arsenal actualy have quite a few options; Gabriel Jesus, Viktor Gyokeres Kai Havertz and even Mikel Merino have done solid jobs in the role, and each have produced impressive purple patches. That being said, through inconsistent form and injury issues, it would appear that Arsenal are yet to find that Number 9 that in my opinion would fire them to domination in England, and perhaps Europe.

Amongst their current options * Gabriel Jesus offers versatility, pressing intensity and outstanding link up play. On the other hand, he is often wasteful in front of goal and has suffered a series of unfortunate injuries over the last 2-3 years.

  • Kai Havertz excels in aerial and ground duels, has very intelligent off the ball movement and link up, and offers an incredible aerial threat, particularly from set pieces. However like Gabriel Jesus, his finishing has also been hot and cold and he has also suffered a series of injuries that have left him unavailable for extended periods.

  • Viktor Gyokeres is powerful, robust, and offers an exceptional work ethic with an eye for goal (He was Europes Top Scorer last season with an impressive 54 goals across all competitions). While there is still potential for him to grow and adapt, he has no doubt struggled to get to grips to the pace of the Premier League.

  • Mikel Merino has proved to be a more then capable back up in the role, but is a natural central midfield player and has talents are best served here.

Our Purpose is to identify the best centre forward who can fit Arsenals style of play using data from the Top 5 leagues in the 2024/25 season.

To address this topic, we will utilise a 4 tier, analytical approach: Tier 1 - Elite Players. World Class Centre Forward, who may be unattainable but it is still useful to compare. Tier 2 - Realistic Targets - Proven players who are attainable and between the ages of 23-27. Tier 3 - Value Options - Short Term solutions from smaler clubs (Age 25-27) Tier 4 - Young Prospects - High potential, developing strikers (Age 19-23)

For Each tier we will: * Develop a scoring system to evaluate candidates for Arsenal. * Apply a similiarity algorithm (Euclidean and Cosine) to find players who are a stylistic match. * Create charts and visualisations to compare some top candidates.

# First we set the working directory
# Load data from same directory as .Rmd file
data <- read.csv("FBREF_BigPlayers_2425.csv", sep=";", encoding="UTF-8")

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
# Next we load the relevant libraries
library(tidyverse)  # Data manipulation
library(fmsb)       # Radar charts
library(lsa)        # Cosine similarity
library(knitr)      # Tables
library(DT)         # Interactive tables
library(scales)     # Color scales
library(ggplot2)
head(data)
##            Player         Squad Nation   Pos Age MP  Min Gls G.PK Ast  xG xAG
## 1 Abdoulie Ceesay     St. Pauli    GAM    FW  20  7   60   0    0   0 0.0 0.0
## 2      Adam Aznou Bayern Munich    MAR    DF  18  2   17   0    0   0 0.0 0.0
## 3   Adam Dźwigała     St. Pauli    POL DF,MF  28 16  373   0    0   0 0.4 0.0
## 4     Adam Hložek    Hoffenheim    CZE FW,MF  22 27 1871   8    8   4 5.6 3.5
## 5     Adrian Beck    Heidenheim    GER MF,FW  27 32 1598   4    4   1 3.3 1.2
## 6   Alassane Pléa      Gladbach    FRA MF,FW  31 30 1902  11   10   4 7.4 3.1
##   Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90 Passes.
## 1   0.00    0.00   0.00  0.00   0.03  Bundesliga  0  0.00   0.00    57.1
## 2   0.00    0.00   0.00  0.00   0.00  Bundesliga  0  0.00   0.00    76.5
## 3   0.00    0.00   0.00  0.09   0.00  Bundesliga  6  1.45   0.00    81.6
## 4   0.38    0.38   0.19  0.27   0.17  Bundesliga 59  2.84   1.15    68.5
## 5   0.23    0.23   0.06  0.18   0.07  Bundesliga 39  2.20   0.73    80.1
## 6   0.52    0.47   0.19  0.35   0.15  Bundesliga 41  1.94   0.57    72.1
##   ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1         70.0           0.0         0.0   0.0    0.43      0.29   0.00
## 2         90.0          66.7         0.0   0.0    0.00      0.00   0.00
## 3         91.0          91.4        40.0   0.0    0.62      0.38   0.25
## 4         73.8          75.0        54.8   0.5    0.37      0.48   0.15
## 5         83.3          87.3        72.4  -0.2    0.53      0.97   0.28
## 6         83.1          74.2        63.2   0.9    0.10      0.37   0.07
##   Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1       0.43   0.14       3.71        0.00       0.0   1.48   0.00     0.0
## 2       0.00   0.50       9.00        0.00       0.0   0.00   0.00     0.0
## 3       1.50   1.75      17.50        0.06      50.0   1.21   0.00    76.5
## 4       0.96   0.74      28.19        1.22      51.6   2.74   0.48    50.6
## 5       1.41   0.53      30.06        1.25      53.3   3.16   0.23    59.2
## 6       0.30   0.23      32.63        0.57      39.5   3.41   0.66    31.9
##   Points.90  xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1      0.43  1.4  0    0        0       0  0     0     0   0       0       0
## 2      3.00  0.8  0    0        0       0  0     0     0   0       0       0
## 3      0.88  9.0  0    0        0       0  0     0     0   0       0       0
## 4      0.85 34.4  0    0        0       0  0     0     0   0       0       0
## 5      0.91 29.2  0    0        0       0  0     0     0   0       0       0
## 6      1.17 35.7  0    0        0       0  0     0     0   0       0       0
##   SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1  0.00  0.0               1.14               2.00                    1.00
## 2  0.00  0.0               6.50               8.50                    4.50
## 3  0.00 -0.4              10.81              13.25                    4.44
## 4  3.00  2.4              12.89              18.81                    6.56
## 5  3.25  0.7              18.16              22.66                    8.56
## 6  1.20  3.6              19.10              26.50                   10.30
##   MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1                     0.00                   0.00            10.00
## 2                     2.00                   0.00            89.00
## 3                     5.31                   0.75           183.44
## 4                     4.67                   0.85           198.19
## 5                     7.28                   1.72           309.41
## 6                     6.33                   1.60           303.93
##   PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1             0.29  0.00  0.14                0.00   0.00     0.00
## 2            22.00  0.00  0.00                0.50   0.00     0.00
## 3            75.50  0.01  0.00                0.69   0.06     0.00
## 4            48.19  0.12  0.74                1.37   0.70     0.15
## 5            70.56  0.06  0.66                1.66   0.56     0.12
## 6           100.97  0.11  0.83                2.67   1.17     0.27
##   PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90   xGD xGD.90
## 1                 0.00   0.20     0.57   0.29   0.43       0.00  -1.4  -0.20
## 2                 0.50   0.40     0.50   0.00   0.00       0.00  -0.8  -0.40
## 3                 0.88   0.56     0.81   0.31   0.19       0.81  -8.6  -0.47
## 4                 2.52   1.27     2.44   0.56   1.11       1.48 -28.8  -1.00
## 5                 2.16   0.91     4.09   0.50   0.44       0.91 -25.9  -0.73
## 6                 3.87   1.19     2.20   0.57   1.10       0.50 -28.3  -0.84
##   MP_Squad
## 1       34
## 2       34
## 3       34
## 4       34
## 5       34
## 6       34
list.files("data")
## [1] "FBREF_BigClubes_2324.csv"  "FBREF_BigClubes_2425.csv" 
## [3] "FBREF_BigPlayers_2324.csv" "FBREF_BigPlayers_2425.csv"
# Define the select_players function (complete)
select_players <- function(file, Encoding, position, competition, primary, player) {
  # 1. Read the file
  data <- read.csv(file, sep = ";", encoding = Encoding)
  
  # 2. Filter by competition
  if (length(competition) == 1 && competition == "ALL") {
    cat("We consider all players from all competitions\n")
    data_comp <- data
  } else {
    cat("We filter by competition\n")
    data_comp <- data %>%
      filter(Competition %in% competition)
  }
  
  # 3. Filter by position
  if (primary){
    cat("We keep players whose main position is:", position, "\n")
    data_players <- data_comp %>% 
      filter(substr(Pos, 1, 2) == position)
  } else {
    cat("We keep players whose position contains:", position, "\n")
    data_players <- data_comp %>%
      filter(grepl(position, Pos))
  }
  
  # 4. Filter by specific player if needed
  if (!is.na(player)){
    data_players <- data_players %>% filter(Player %in% player)
  }
  
  # 5. Return the filtered data
  return(data_players)
}

# NOW call the function with all the parameters
df_forwards <- select_players(
  file = "FBREF_BigPlayers_2425.csv",
  Encoding = "UTF-8",
  position = "FW", 
  competition = c("Premier League", "La Liga", "Serie A", "Bundesliga", "Ligue 1"),
  primary = TRUE,
  player = NA
)
## We filter by competition
## We keep players whose main position is: FW
cat("\nTotal Centre Forwards in Top 5 Leagues:", nrow(df_forwards), "\n")
## 
## Total Centre Forwards in Top 5 Leagues: 723
cat("(Excludes wingers and attacking midfielders)\n")
## (Excludes wingers and attacking midfielders)
head(df_forwards)
##                   Player         Squad Nation   Pos Age MP  Min Gls G.PK Ast
## 1        Abdoulie Ceesay     St. Pauli    GAM    FW  20  7   60   0    0   0
## 2            Adam Hložek    Hoffenheim    CZE FW,MF  22 27 1871   8    8   4
## 3 Alexander Bernhardsson Holstein Kiel    SWE FW,MF  25 19 1263   7    7   2
## 4         Andreas Albers     St. Pauli    DEN    FW  34 14   62   1    1   0
## 5     Andreas Skov Olsen     Wolfsburg    DEN FW,MF  24 12  535   1    1   1
## 6            Andrej Ilic  Union Berlin    SRB    FW  24 16  937   7    6   0
##    xG xAG Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90
## 1 0.0 0.0   0.00    0.00   0.00  0.00   0.03  Bundesliga  0  0.00   0.00
## 2 5.6 3.5   0.38    0.38   0.19  0.27   0.17  Bundesliga 59  2.84   1.15
## 3 4.1 1.3   0.50    0.50   0.14  0.29   0.10  Bundesliga 27  1.92   1.00
## 4 0.6 0.1   1.45    1.45   0.00  0.90   0.14  Bundesliga  3  4.35   1.45
## 5 0.8 1.9   0.17    0.17   0.17  0.14   0.32  Bundesliga  7  1.18   0.17
## 6 4.9 0.3   0.67    0.58   0.00  0.47   0.03  Bundesliga 27  2.59   1.06
##   Passes. ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1    57.1         70.0           0.0         0.0   0.0    0.43      0.29   0.00
## 2    68.5         73.8          75.0        54.8   0.5    0.37      0.48   0.15
## 3    60.5         71.0          64.9        36.4   0.7    0.79      0.74   0.26
## 4    51.7         55.6          66.7        50.0  -0.1    0.07      0.00   0.00
## 5    80.4         85.7          87.6        55.0  -0.9    0.50      0.33   0.17
## 6    57.6         69.2          46.2        20.0  -0.3    0.12      0.44   0.31
##   Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1       0.43   0.14       3.71        0.00       0.0   1.48   0.00     0.0
## 2       0.96   0.74      28.19        1.22      51.6   2.74   0.48    50.6
## 3       1.79   1.32      27.37        1.16      41.5   2.57   0.50    35.8
## 4       0.07   0.07       3.14        0.14      66.7   2.95   0.00    45.5
## 5       0.83   1.00      27.42        0.33      28.6   5.38   0.17    50.0
## 6       0.44   0.38      18.88        0.06      14.3   1.44   0.19    48.1
##   Points.90  xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1      0.43  1.4  0    0        0       0  0     0     0   0       0       0
## 2      0.85 34.4  0    0        0       0  0     0     0   0       0       0
## 3      0.79 23.3  0    0        0       0  0     0     0   0       0       0
## 4      0.86  2.1  0    0        0       0  0     0     0   0       0       0
## 5      1.17  9.2  0    0        0       0  0     0     0   0       0       0
## 6      1.25 16.3  0    0        0       0  0     0     0   0       0       0
##   SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1  0.00  0.0               1.14               2.00                    1.00
## 2  3.00  2.4              12.89              18.81                    6.56
## 3  2.00  2.9              10.16              16.79                    4.89
## 4  1.00  0.4               1.07               2.07                    0.71
## 5  1.00  0.2              18.08              22.50                    8.50
## 6  1.83  2.1               6.88              11.94                    4.50
##   MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1                     0.00                   0.00            10.00
## 2                     4.67                   0.85           198.19
## 3                     3.89                   0.84           166.53
## 4                     0.29                   0.07            16.64
## 5                     7.08                   1.83           306.42
## 6                     1.50                   0.06            76.44
##   PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1             0.29  0.00  0.14                0.00   0.00     0.00
## 2            48.19  0.12  0.74                1.37   0.70     0.15
## 3            56.63  0.06  0.89                0.84   0.53     0.21
## 4             3.36  0.01  0.14                0.07   0.00     0.00
## 5            64.00  0.11  1.50                1.00   0.83     0.17
## 6            19.62  0.01  0.38                1.00   0.00     0.00
##   PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90   xGD xGD.90
## 1                 0.00   0.20     0.57   0.29   0.43       0.00  -1.4  -0.20
## 2                 2.52   1.27     2.44   0.56   1.11       1.48 -28.8  -1.00
## 3                 1.74   1.23     2.37   1.11   0.79       1.00 -19.2  -0.94
## 4                 0.21   0.15     0.21   0.21   0.14       0.71  -1.5   0.75
## 5                 1.83   0.77     1.50   0.17   0.25       0.25  -8.4  -0.63
## 6                 0.81   1.02     1.00   0.88   0.50       3.25 -11.4  -0.55
##   MP_Squad
## 1       34
## 2       34
## 3       34
## 4       34
## 5       34
## 6       34

Next we filter by position, as we are not interested in defenders or midfielders in this study.

## We now go ahead and load only Centre Forwards out of Europes top 5 leagues by setting primary = True. This eliminates players who play both CF and MF roles, to focus solely on out and out 9s. 

Next, we keep the players whose main position is FW. We want the analysis to stay flexible, so we can include wingers who can play in the front 3 and offer a dynamic goal threat.

cat("\nTotal Centre Forwards in Top 5 Leagues:", nrow(df_forwards), "\n")
## 
## Total Centre Forwards in Top 5 Leagues: 723
cat("Excludes attacking midfielders)\n")
## Excludes attacking midfielders)
head(df_forwards)
##                   Player         Squad Nation   Pos Age MP  Min Gls G.PK Ast
## 1        Abdoulie Ceesay     St. Pauli    GAM    FW  20  7   60   0    0   0
## 2            Adam Hložek    Hoffenheim    CZE FW,MF  22 27 1871   8    8   4
## 3 Alexander Bernhardsson Holstein Kiel    SWE FW,MF  25 19 1263   7    7   2
## 4         Andreas Albers     St. Pauli    DEN    FW  34 14   62   1    1   0
## 5     Andreas Skov Olsen     Wolfsburg    DEN FW,MF  24 12  535   1    1   1
## 6            Andrej Ilic  Union Berlin    SRB    FW  24 16  937   7    6   0
##    xG xAG Gls.90 G.PK.90 Ast.90 xG.90 xAG.90 Competition Sh Sh.90 SoT.90
## 1 0.0 0.0   0.00    0.00   0.00  0.00   0.03  Bundesliga  0  0.00   0.00
## 2 5.6 3.5   0.38    0.38   0.19  0.27   0.17  Bundesliga 59  2.84   1.15
## 3 4.1 1.3   0.50    0.50   0.14  0.29   0.10  Bundesliga 27  1.92   1.00
## 4 0.6 0.1   1.45    1.45   0.00  0.90   0.14  Bundesliga  3  4.35   1.45
## 5 0.8 1.9   0.17    0.17   0.17  0.14   0.32  Bundesliga  7  1.18   0.17
## 6 4.9 0.3   0.67    0.58   0.00  0.47   0.03  Bundesliga 27  2.59   1.06
##   Passes. ShortPasses. MediumPasses. LongPasses. A.xAG TklW.90 Blocks.90 Int.90
## 1    57.1         70.0           0.0         0.0   0.0    0.43      0.29   0.00
## 2    68.5         73.8          75.0        54.8   0.5    0.37      0.48   0.15
## 3    60.5         71.0          64.9        36.4   0.7    0.79      0.74   0.26
## 4    51.7         55.6          66.7        50.0  -0.1    0.07      0.00   0.00
## 5    80.4         85.7          87.6        55.0  -0.9    0.50      0.33   0.17
## 6    57.6         69.2          46.2        20.0  -0.3    0.12      0.44   0.31
##   Tkl.Int.90 Clr.90 Touches.90 Dribbles.90 Dribbles. SCA.90 GCA.90 Aerial.
## 1       0.43   0.14       3.71        0.00       0.0   1.48   0.00     0.0
## 2       0.96   0.74      28.19        1.22      51.6   2.74   0.48    50.6
## 3       1.79   1.32      27.37        1.16      41.5   2.57   0.50    35.8
## 4       0.07   0.07       3.14        0.14      66.7   2.95   0.00    45.5
## 5       0.83   1.00      27.42        0.33      28.6   5.38   0.17    50.0
## 6       0.44   0.38      18.88        0.06      14.3   1.44   0.19    48.1
##   Points.90  xGA OG PSxG PSxG.SoT PSxG... GA GA.90 Save. CS. SoTA.90 SoTA.GA
## 1      0.43  1.4  0    0        0       0  0     0     0   0       0       0
## 2      0.85 34.4  0    0        0       0  0     0     0   0       0       0
## 3      0.79 23.3  0    0        0       0  0     0     0   0       0       0
## 4      0.86  2.1  0    0        0       0  0     0     0   0       0       0
## 5      1.17  9.2  0    0        0       0  0     0     0   0       0       0
## 6      1.25 16.3  0    0        0       0  0     0     0   0       0       0
##   SoT.G G.xG PassesCompleted.90 PassesAttempted.90 ShortPassesCompleted.90
## 1  0.00  0.0               1.14               2.00                    1.00
## 2  3.00  2.4              12.89              18.81                    6.56
## 3  2.00  2.9              10.16              16.79                    4.89
## 4  1.00  0.4               1.07               2.07                    0.71
## 5  1.00  0.2              18.08              22.50                    8.50
## 6  1.83  2.1               6.88              11.94                    4.50
##   MediumPassesCompleted.90 LongPassesCompleted.90 TotDistPasses.90
## 1                     0.00                   0.00            10.00
## 2                     4.67                   0.85           198.19
## 3                     3.89                   0.84           166.53
## 4                     0.29                   0.07            16.64
## 5                     7.08                   1.83           306.42
## 6                     1.50                   0.06            76.44
##   PrgDistPasses.90 xA.90 KP.90 FinalThirdPasses.90 PPA.90 CrsPA.90
## 1             0.29  0.00  0.14                0.00   0.00     0.00
## 2            48.19  0.12  0.74                1.37   0.70     0.15
## 3            56.63  0.06  0.89                0.84   0.53     0.21
## 4             3.36  0.01  0.14                0.07   0.00     0.00
## 5            64.00  0.11  1.50                1.00   0.83     0.17
## 6            19.62  0.01  0.38                1.00   0.00     0.00
##   PassesProgressive.90 xGA.90 Recov.90 Fls.90 Fld.90 AerialW.90   xGD xGD.90
## 1                 0.00   0.20     0.57   0.29   0.43       0.00  -1.4  -0.20
## 2                 2.52   1.27     2.44   0.56   1.11       1.48 -28.8  -1.00
## 3                 1.74   1.23     2.37   1.11   0.79       1.00 -19.2  -0.94
## 4                 0.21   0.15     0.21   0.21   0.14       0.71  -1.5   0.75
## 5                 1.83   0.77     1.50   0.17   0.25       0.25  -8.4  -0.63
## 6                 0.81   1.02     1.00   0.88   0.50       3.25 -11.4  -0.55
##   MP_Squad
## 1       34
## 2       34
## 3       34
## 4       34
## 5       34
## 6       34
unique(df_forwards$Competition)
## [1] "Bundesliga"     "La Liga"        "Ligue 1"        "Premier League"
## [5] "Serie A"
cat("Position distribution:\n")
## Position distribution:
table(df_forwards$Pos)
## 
##    FW FW,DF FW,MF 
##   375    24   324
cat("\nNote: We should only see 'FW' - Players that play in the front 3 and not AMs\n")
## 
## Note: We should only see 'FW' - Players that play in the front 3 and not AMs

Our next step is to filter the dataset to focus on our target sample, based on playing time and age criteria.

filter_players <- function(data, metrics, pct_min_minutes, age_max, age_min = 0){
  # Filter the data and select the metrics that define our sample
  data_filter <- data %>%
    filter(
      Min > round((pct_min_minutes * 90 * MP_Squad) / 100),
      Age <= age_max,
      Age >= age_min
    )%>%
    select(c("Player", "Squad", "Age", "Competition", all_of(metrics)))
  rownames(data_filter) <- 1:nrow(data_filter)
  return(data_filter)
}

Arsenal are a team that try to suffocate opponents high up the field with controlled, possession based football. All of the current centre forwards have offered different traits in playing the role, but undoubtedly, Arteta will be looking for someone who is:

  • A goal Threat We will look at Non penalty goals, xG, and shot volume/efficiency.
  • Great Link Up Play We will look at shot creating actions, xA, and Assists.
  • Pressing/Work Ethic High Intensity pressing, tackles won, balls recovered.
  • Other Traits Dribbling, first touch, aerial duels.
# Key Metrics for Arsenals CF profile. 

list_metrics <-  c(
  "G.PK.90", # Non Pen goals per 90
  "xG.90", # xG per 90
  "Ast.90", # Assists per 90
  "xAG.90", # xA per 90
  "SCA.90", # Shot Creating Actions per 90
  "GCA.90", # Goal Creating actions per 90
  "TklW.90", # Tackles Won per 90
  "Recov.90", # Ball Recoveries per 90
  "Dribbles.90", # Successful dribbles per 90
  "Aerial.", # Aerial Duel percentage
  "Sh.90", # Shots per 90
  "SoT.90" # Shots on target per 90 
)

We will keep centre forwards who have played at least 50% of available minutes and who are below the age of 28.

df_forwards_filter <-  filter_players(
  data = df_forwards,
  metrics = list_metrics,
  pct_min_minutes = 50, 
  age_max = 28,
  age_min = 18
)

cat("Centre forwards meeting criteria:", nrow(df_forwards_filter), "\n")
## Centre forwards meeting criteria: 156

We can now filter to ensure that there are no duplicates in our data ser

duplicated_players <- df_forwards_filter[
  duplicated(df_forwards_filter$Player), ]$Player

if(length(duplicated_players) > 0) {
  cat("Duplicated players:", paste(duplicated_players, collapse = ", "), "\n")
}else {
  cat("No duplicated players found\n")
}
## No duplicated players found

To make our data easier to read, we will now rename the metrics.

df_forwards_rename <- df_forwards_filter %>%
  rename(
    'Non Penalty Goals/90' = 'G.PK.90',
    'Expected Goals/90' = 'xG.90',
    'Assists/90' = 'Ast.90',
    'Expected Assists/90' = 'xAG.90',
    'Shot Creating Actions/90' = 'SCA.90',
    'Goal Creating Actions/90' = 'GCA.90',
    'Tackles Won/90' = 'TklW.90',
    'Recoveries/90' = 'Recov.90',
    'Dribbles/90' = 'Dribbles.90',
    'Aerial Win %' = 'Aerial.',
    'Shots/90' = 'Sh.90',
    'Shots on Target/90' = 'SoT.90'
  )

head(df_forwards_rename
  )
##                Player          Squad Age Competition Non Penalty Goals/90
## 1         Adam Hložek     Hoffenheim  22  Bundesliga                 0.38
## 2 Benedict Hollerbach   Union Berlin  23  Bundesliga                 0.32
## 3      Benjamin Šeško     RB Leipzig  21  Bundesliga                 0.42
## 4         Deniz Undav      Stuttgart  28  Bundesliga                 0.47
## 5   Ermedin Demirović      Stuttgart  26  Bundesliga                 0.73
## 6        Hugo Ekitike Eint Frankfurt  22  Bundesliga                 0.49
##   Expected Goals/90 Assists/90 Expected Assists/90 Shot Creating Actions/90
## 1              0.27       0.19                0.17                     2.74
## 2              0.25       0.04                0.06                     2.80
## 3              0.38       0.19                0.08                     1.93
## 4              0.61       0.16                0.19                     3.02
## 5              0.69       0.05                0.09                     2.28
## 6              0.76       0.28                0.24                     3.55
##   Goal Creating Actions/90 Tackles Won/90 Recoveries/90 Dribbles/90
## 1                     0.48           0.37          2.44        1.22
## 2                     0.25           0.65          3.26        1.24
## 3                     0.34           0.09          1.88        1.18
## 4                     0.42           0.37          1.78        0.37
## 5                     0.34           0.21          0.97        0.15
## 6                     0.42           0.36          2.64        1.58
##   Aerial Win % Shots/90 Shots on Target/90
## 1         50.6     2.84               1.15
## 2         30.9     2.69               0.85
## 3         58.8     2.50               1.10
## 4         40.4     3.91               1.62
## 5         42.0     3.11               1.31
## 6         46.8     4.00               1.55

Now that we have our sample of players, and the relevent metrics that we are interested in, it is now time to created a weighting system that represents the needs of Arsenals number 9. One that reflects the CLubs priorities in the transfer market. We will weight this model slightly in favour of goalscoring ability, given the teams needs to find goals in that area.

scoring_calculate <- function(sample, metrics, weights){
  # Check that weights sum to 1
  if (sum(weights) != 1){
    stop("The sum of weights must be equal to 1")
  }
  
  # Normalize metrics (0-100 scale)
  data_scaled <- sample
  for (i in 1:length(metrics)){
    metric <- metrics[i]
    max_value <- max(sample[[metric]], na.rm = TRUE)
    min_value <- min(sample[[metric]], na.rm = TRUE)
    
    # Scale to 0-100
    data_scaled[[metric]] <- ((sample[[metric]] - min_value) / 
                               (max_value - min_value)) * 100
  }
  
  # Calculate weighted score
  data_scaled$Score <- 0
  for (i in 1:length(metrics)){
    data_scaled$Score <- data_scaled$Score + 
      (data_scaled[[metrics[i]]] * weights[i])
  }
  
  # Add ranking
  data_scaled <- data_scaled %>%
    arrange(desc(Score)) %>%
    mutate(Rank = row_number())
  
  return(data_scaled)
}
weights_arsenal_cf <- c(
  0.25, # Non penalty Goals/90 (Increased Weight - Main job is to score)
  0.18, # Expected Goals/90 (Increased Weight to show CF getting into positions)
  0.08, # Assists/90
  0.08, # Expected Assists/90
  0.10, # Shot Creating Actions/90
  0.06, # Goal Creating Actions/90
  0.08, # Tackles Won/90 (Pressing)
  0.04, # Recoveries/90
  0.04, # Dribbles/90
  0.06, # Aerial Win %
  0.02, # Shots/90
  0.01 # Shots on Target/90 
  )
# Verify weuights sum to 1
cat("Total Weight:", sum(weights_arsenal_cf), "\n\n")
## Total Weight: 1
# NExt, we calculate scores
data_final <- scoring_calculate(
  sample = df_forwards_filter,
  metrics = list_metrics,
  weights = weights_arsenal_cf
)

Now we can move on and take a look at the top performers up front so we can begin to create a shortlist of transfer options.

top_20 <- data_final %>%
  select(Rank, Player, Squad, Age, Competition, Score, 'G.PK.90', 'Aerial.') %>%
  head(20) %>%
  mutate(across(where(is.numeric), ~round(., 2)))
  kable(top_20, caption = "Top 20 Strikers for Arsenal")
Top 20 Strikers for Arsenal
Rank Player Squad Age Competition Score G.PK.90 Aerial.
1 Ousmane Dembélé PSG 27 Ligue 1 75.61 97.20 59.44
2 Michael Olise Bayern Munich 22 Bundesliga 63.79 42.99 71.03
3 Raphinha Barcelona 27 La Liga 60.28 47.66 89.15
4 Bradley Barcola PSG 21 Ligue 1 59.93 54.21 99.11
5 Kylian Mbappé Real Madrid 25 La Liga 57.22 69.16 59.44
6 Leroy Sané Bayern Munich 28 Bundesliga 55.20 56.07 79.94
7 Bukayo Saka Arsenal 22 Premier League 54.38 24.30 49.48
8 Mateo Retegui Atalanta 25 Serie A 54.00 73.83 63.74
9 Désiré Doué PSG 19 Ligue 1 53.94 28.97 65.97
10 Hugo Ekitike Eint Frankfurt 22 Bundesliga 53.39 45.79 69.54
11 Rayan Cherki Lyon 20 Ligue 1 51.51 32.71 49.48
12 Patrik Schick Leverkusen 28 Bundesliga 48.19 100.00 66.86
13 Mason Greenwood Marseille 22 Ligue 1 47.73 42.06 79.20
14 Nick Woltemade Stuttgart 22 Bundesliga 47.56 51.40 64.04
15 Vinicius Júnior Real Madrid 24 La Liga 47.28 33.64 18.57
16 Luis Díaz Liverpool 27 Premier League 46.90 45.79 37.89
17 Alexander Isak Newcastle Utd 24 Premier League 46.00 57.94 47.70
18 Serhou Guirassy Dortmund 28 Bundesliga 45.88 57.94 78.60
19 Riccardo Orsolini Bologna 27 Serie A 44.82 54.21 81.43
20 Erling Haaland Manchester City 24 Premier League 43.98 57.94 79.20

Tier Analysis

We will now separate our pool of talent into tiers, based on age, club, profile and realism of making the signing.

#Elite Clubs and rivals would indicate that the chances of signing are much lower. 
elite_clubs <- c(
  "Manchester City", "Real Madrid", "Barcelona", "Bayern Munich", "Tottenham", "Liverpool", "Chelsea", "Manchester United", "PSG"
)

#Tier 1: Elite Benchmarks (age <= 27, top scorers regardless of team)
tier1_elite <- data_final %>%
  filter(Age <= 27) %>%
  arrange(desc('G.PK.90')) %>%
  head(10) %>%
  mutate(Tier = "Tier1: Eliter Benchmark")

#Tier 2: Realistic Targets (Age 23-27), not at the elite clubs or rivals) - MAIN FOCUS AREA
tier2_realistic <-  data_final %>%
  filter(
    Age >= 23, Age <= 27,
    !Squad %in% elite_clubs
  ) %>%
  mutate(Tier = "Tier 2: Realistic Target")

#Tier 3A: Value Opportunities (Age 24-27, experienced players at smaller clubs)
tier3a_value <- data_final %>%
  filter(
    Age >= 24, Age <= 27,
    !Squad %in% elite_clubs
  ) %>%
  arrange(desc(Score)) %>%
  head(10) %>%
  mutate(Tier = "Tier 3A: Value Options")

#Tier 3B: Young Prospects (age 19-23, Pure Centre Forwards only as at a young age there is little data to suggest they can be a success in any front 3 position
tier3b_prospects <-  data_final %>%
  filter(Age >= 19, Age <= 23) %>%
  arrange(desc(Score)) %>%
  head(10) %>%
  mutate(Tier = "Tier 3B: Young Prospects")
cat("Tier 3B Young Prospects - These are Centre Forwards, not wingers:\n")
## Tier 3B Young Prospects - These are Centre Forwards, not wingers:

Tier 1 - Ranking the elite benchmarks of the worlds best strikers - These players are perhaps unattainable, but it is still useful information for comparison (and perhaps an unexpected opportunity will arise

tier1_display <- tier1_elite %>%
  select(Player, Squad, Age, `G.PK.90`, `xG.90`, `Aerial.`, Score) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

kable(tier1_display, caption = "Tier 1: Elite Benchmark Centre Forwards")
Tier 1: Elite Benchmark Centre Forwards
Player Squad Age G.PK.90 xG.90 Aerial. Score
Ousmane Dembélé PSG 27 97.20 100.00 59.44 75.61
Michael Olise Bayern Munich 22 42.99 42.86 71.03 63.79
Raphinha Barcelona 27 47.66 70.24 89.15 60.28
Bradley Barcola PSG 21 54.21 63.10 99.11 59.93
Kylian Mbappé Real Madrid 25 69.16 92.86 59.44 57.22
Bukayo Saka Arsenal 22 24.30 40.48 49.48 54.38
Mateo Retegui Atalanta 25 73.83 82.14 63.74 54.00
Désiré Doué PSG 19 28.97 29.76 65.97 53.94
Hugo Ekitike Eint Frankfurt 22 45.79 88.10 69.54 53.39
Rayan Cherki Lyon 20 32.71 23.81 49.48 51.51

Of course, a lot of these players are probbaly quite unrealistic. These players (particularly the ones from Europes elite clubs) would command enormous transfer fees and their clubs will likely not be willing to let them go. That being said, Arsenal should remain alert to these players situations, as market opportunities can arise unexpectedly. If these players run into contract negotiation issues, or if any of these players feel ready for a new challenge, then Arsenal should be ready to strike as these represent some of the worlds best players who would guarantee goals in Arsenasl team.

After puting a shortlist together of our elite prospects, we now put together our top targets out of our most realistic options.

tier2_display <- tier2_realistic %>%
  select(Player, Squad, Age, Competition, 'G.PK.90', 'Ast.90', 'Aerial.', Score) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

kable(tier2_display, caption = "Tier 2: Realistic Target Centre Forwards")
Tier 2: Realistic Target Centre Forwards
Player Squad Age Competition G.PK.90 Ast.90 Aerial. Score
Mateo Retegui Atalanta 25 Serie A 73.83 51.72 63.74 54.00
Alexander Isak Newcastle Utd 24 Premier League 57.94 34.48 47.70 46.00
Riccardo Orsolini Bologna 27 Serie A 54.21 32.76 81.43 44.82
Christian Pulisic Milan 25 Serie A 27.10 56.90 30.91 43.80
Evann Guessand Nice 23 Ligue 1 39.25 48.28 62.26 43.76
Ermedin Demirović Stuttgart 26 Bundesliga 68.22 8.62 62.41 43.39
Marcus Thuram Inter 26 Serie A 51.40 27.59 84.70 43.13
Julián Álvarez Atlético Madrid 24 La Liga 43.93 24.14 43.83 42.77
Bryan Mbeumo Brentford 24 Premier League 37.38 31.03 46.81 42.70
Rafael Leão Milan 25 Serie A 28.97 53.45 83.95 42.64
Jonathan Burkardt Mainz 05 24 Bundesliga 63.55 15.52 28.23 42.54
Moise Kean Fiorentina 24 Serie A 56.07 17.24 76.52 41.59
Antoine Semenyo Bournemouth 24 Premier League 28.97 24.14 69.09 40.98
Breel Embolo Monaco 27 Ligue 1 27.10 34.48 74.29 40.66
Yoane Wissa Brentford 27 Premier League 55.14 20.69 67.61 40.53
Ritsu Doan Freiburg 26 Bundesliga 28.97 37.93 45.77 39.09
Jarrod Bowen West Ham 27 Premier League 33.64 41.38 29.72 38.51
Lautaro Martínez Inter 26 Serie A 39.25 18.97 71.17 38.37
Kaoru Mitoma Brighton 27 Premier League 32.71 24.14 72.96 37.50
Harvey Barnes Newcastle Utd 26 Premier League 42.99 36.21 42.50 37.43
Marcus Tavernier Bournemouth 25 Premier League 13.08 39.66 65.53 37.41
Mohamed Amoura Wolfsburg 24 Bundesliga 27.10 56.90 44.13 37.11
Valentín Castellanos Lazio 25 Serie A 28.04 18.97 78.90 35.73
Jonathan David Lille 24 Ligue 1 32.71 31.03 41.01 35.45
Dan Ndoye Bologna 23 Serie A 23.36 29.31 59.44 34.63
Lassine Sinayoko Auxerre 24 Ligue 1 14.95 53.45 44.13 34.38
Anthony Gordon Newcastle Utd 23 Premier League 16.82 31.03 74.29 33.99
Kai Havertz Arsenal 25 Premier League 40.19 24.14 66.42 33.87
Dušan Vlahović Juventus 24 Serie A 28.04 34.48 71.47 33.80
Zuriko Davitashvili Saint-Étienne 23 Ligue 1 21.50 44.83 34.18 33.79
Dodi Lukebakio Sevilla 26 La Liga 27.10 10.34 68.65 33.37
Mathias Pereira Lage Brest 27 Ligue 1 10.28 68.97 68.50 33.11
Jean-Philippe Mateta Crystal Palace 27 Premier League 38.32 12.07 55.13 32.63
Mohammed Kudus West Ham 23 Premier League 15.89 17.24 32.10 32.57
Morgan Guilavogui St. Pauli 26 Bundesliga 28.04 17.24 69.54 32.55
Evanilson Bournemouth 24 Premier League 36.45 6.90 60.03 32.07
Robin Hack Gladbach 25 Bundesliga 15.89 50.00 61.37 31.69
Gabriel Martinelli Arsenal 23 Premier League 28.97 27.59 45.17 31.58
Keito Nakamura Reims 24 Ligue 1 34.58 12.07 49.48 31.34
Shuto Machino Holstein Kiel 24 Bundesliga 38.32 15.52 58.10 31.11
Issa Soumaré Le Havre 23 Ligue 1 22.43 32.76 89.15 30.27
Jonas Wind Wolfsburg 25 Bundesliga 34.58 24.14 73.55 30.18
Gabriel Strefezza Como 27 Serie A 19.63 24.14 43.68 29.73
Iliman Ndiaye Everton 24 Premier League 24.30 0.00 35.66 29.69
Benedict Hollerbach Union Berlin 23 Bundesliga 29.91 6.90 45.91 29.47
Loïs Openda RB Leipzig 24 Bundesliga 30.84 31.03 56.46 29.28
Jørgen Strand Larsen Wolves 24 Premier League 45.79 24.14 59.58 29.21
Marvin Pieringer Heidenheim 24 Bundesliga 14.95 27.59 55.13 28.33
Nikola Krstović Lecce 24 Serie A 24.30 25.86 61.81 28.23
Phillip Tietz Augsburg 27 Bundesliga 29.91 18.97 71.17 27.61
Dennis Man Parma 25 Serie A 16.82 31.03 51.41 27.60
Jorge de Frutos Rayo Vallecano 27 La Liga 20.56 18.97 57.21 27.29
Artem Dovbyk Roma 27 Serie A 34.58 12.07 61.22 27.02
Patrick Cutrone Como 26 Serie A 28.97 31.03 38.34 26.61
Hugo Duro Valencia 24 La Liga 39.25 13.79 54.83 26.44
Carlos Vicente Alavés 25 La Liga 11.21 25.86 53.79 26.07
Farid El Melali Angers 27 Ligue 1 9.35 24.14 32.99 25.97
Juan Cruz Leganés 24 La Liga 18.69 27.59 65.53 25.44
Callum Hudson-Odoi Nott’ham Forest 23 Premier League 19.63 13.79 34.32 25.14
Javi Puado Espanyol 26 La Liga 19.63 20.69 34.32 24.96
Bryan Gil Girona 23 La Liga 14.95 27.59 0.00 24.66
Esteban Lepaul Angers 24 Ligue 1 46.73 0.00 63.45 24.66
Gustav Isaksen Lazio 23 Serie A 14.95 13.79 52.15 24.54
Lorenzo Lucca Udinese 23 Serie A 39.25 6.90 69.09 24.45
Lucas Beltrán Fiorentina 23 Serie A 13.08 31.03 55.87 24.44
Nicolás González Juventus 26 Serie A 14.02 17.24 82.76 24.28
Tete Morente Lecce 27 Serie A 12.15 15.52 91.68 24.09
Samuel Essende Augsburg 26 Bundesliga 36.45 18.97 65.08 24.07
Viktor Tsyhankov Girona 26 La Liga 8.41 41.38 49.48 24.04
Mikel Oyarzabal Real Sociedad 27 La Liga 18.69 20.69 45.32 23.93
Gorka Guruzeta Athletic Club 27 La Liga 31.78 17.24 50.97 23.86
Oladapo Afolayan St. Pauli 26 Bundesliga 14.95 8.62 44.13 23.65
Santiago Pierotti Lecce 23 Serie A 17.76 15.52 66.12 23.43
Roberto Piccoli Cagliari 23 Serie A 24.30 5.17 63.60 23.18
Isaac Romero Sevilla 24 La Liga 15.89 13.79 53.94 21.44
Andrea Pinamonti Genoa 25 Serie A 29.91 5.17 63.60 20.91
Dany Mota Monza 26 Serie A 19.63 15.52 68.05 20.61
Junior Adamu Freiburg 23 Bundesliga 11.21 20.69 48.89 20.32
Amin Sarr Hellas Verona 23 Serie A 17.76 8.62 61.96 18.86
Johannes Eggestein St. Pauli 26 Bundesliga 9.35 32.76 36.85 17.85
Jack Harrison Everton 27 Premier League 3.74 0.00 18.57 16.65
Miguel Leganés 24 La Liga 0.00 27.59 34.47 14.02
Alessandro Zanoli Genoa 23 Serie A 4.67 0.00 74.29 13.32

##Tier 3A

This tier looks at strikers at smaller clubs who offer an immediate impact but at a more affordable cost.

tier3a_display <- tier3a_value %>%
  select(Player, Squad, Age, Competition, 'G.PK.90', Score) %>%
  mutate(Score = round(Score, 2), 'G.PK.90' = round(`G.PK.90`, 2))

kable(tier3a_display, caption = "Tier 3A: Value Option Strikers")
Tier 3A: Value Option Strikers
Player Squad Age Competition G.PK.90 Score
Mateo Retegui Atalanta 25 Serie A 73.83 54.00
Alexander Isak Newcastle Utd 24 Premier League 57.94 46.00
Riccardo Orsolini Bologna 27 Serie A 54.21 44.82
Christian Pulisic Milan 25 Serie A 27.10 43.80
Ermedin Demirović Stuttgart 26 Bundesliga 68.22 43.39
Marcus Thuram Inter 26 Serie A 51.40 43.13
Julián Álvarez Atlético Madrid 24 La Liga 43.93 42.77
Bryan Mbeumo Brentford 24 Premier League 37.38 42.70
Rafael Leão Milan 25 Serie A 28.97 42.64
Jonathan Burkardt Mainz 05 24 Bundesliga 63.55 42.54

Tier 3B - Young Prospects

For the purpose of a detailed pipeline of talent, it is also good to look at younger options, ones with large upsides of potential. If none of our ‘prime’ targets are available, then it is shrewd business to assess the market for younger, lesser known options.

tier3b_display <- tier3b_prospects %>%
  select(Player, Squad, Age, Competition, `G.PK.90`, `Aerial.`, Score) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

kable(tier3b_display, caption = "Tier 3B: Young Prospects")
Tier 3B: Young Prospects
Player Squad Age Competition G.PK.90 Aerial. Score
Michael Olise Bayern Munich 22 Bundesliga 42.99 71.03 63.79
Bradley Barcola PSG 21 Ligue 1 54.21 99.11 59.93
Bukayo Saka Arsenal 22 Premier League 24.30 49.48 54.38
Désiré Doué PSG 19 Ligue 1 28.97 65.97 53.94
Hugo Ekitike Eint Frankfurt 22 Bundesliga 45.79 69.54 53.39
Rayan Cherki Lyon 20 Ligue 1 32.71 49.48 51.51
Mason Greenwood Marseille 22 Ligue 1 42.06 79.20 47.73
Nick Woltemade Stuttgart 22 Bundesliga 51.40 64.04 47.56
Evann Guessand Nice 23 Ligue 1 39.25 62.26 43.76
Maghnes Akliouche Monaco 22 Ligue 1 14.02 26.60 43.13

Similiarity Algorithms

We can now use similiarity algorithms to find the strikers who match Arsenals desired profile.

We are going to use the top scoring player in Tier 2 as the reference point for comparison

similiarity_tool <- function(sample, player, metrics, metrics_rename, distance, n){
  data_scaled <- sample
  
  # Scale each metric
  for (metric in metrics) {
    max_value <- max(sample[[metric]], na.rm = TRUE)
    min_value <- min(sample[[metric]], na.rm = TRUE)
    data_scaled[[metric]] <- (sample[[metric]] - min_value) / (max_value - min_value)
  }
  
  # Select only metrics for distance calculation
  data_for_dist <- data_scaled[, metrics]
  rownames(data_for_dist) <- sample$Player
  
  # Calculate distance matrix
  if(distance == "euclidean"){
    mat_dist <- as.matrix(dist(data_for_dist, method = "euclidean"))
  } else if(distance == "cosine"){
    mat_dist <- as.matrix(1 - cosine(t(as.matrix(data_for_dist))))
  } else {
    stop("Distance method must be 'euclidean' or 'cosine'")
  }
  
  # Extract the similarity for our target player
  if(!(player %in% rownames(mat_dist))){
    stop(paste("Player", player, "not found in sample"))
  }
  
  player_sim <- mat_dist[, player]
  df_sim <- data.frame(
    Player = names(player_sim),
    Distance = as.numeric(player_sim)
  )
  
  # Drop the Player Themselves (distance = 0)
  df_sim <- df_sim[df_sim$Player != player, ]
  
  # Convert the distances to similarity percentage
  d95 <- quantile(df_sim$Distance, 0.95, na.rm = TRUE)
  df_sim$Similarity <- (1 - (df_sim$Distance / d95)) * 100
  
  # Order by distance (most similar first)
  df_sim <- df_sim[order(df_sim$Distance), ]
  
  # Take top n
  final_df <- df_sim[1:n, c("Player", "Similarity")]
  
  # Merge with original data for context
  data_clean <- sample %>%
    select(Player, Age, Squad, Competition, all_of(metrics))
  
  # Rename the metric columns
  colnames(data_clean)[colnames(data_clean) %in% metrics] <- metrics_rename
  
  final_df <- merge(
    x = final_df, y = data_clean, 
    by = "Player", all.x = TRUE
  )
  
  final_df <- final_df[order(-final_df$Similarity), ]
  rownames(final_df) <- 1:n
  return(final_df)
}

It is very easy to see why Ousmane Dembele is the current Ballon D’or holder. After a historic season of firing PSG to their first ever Champions League and treble, Dembele ranks highest in our Tier 1 forwards. Stylistically, physically, output and age would all be perfectly aligned for Arsenal if they were to be able to sign a player of Dembeles quality. This is unlikely, therefore we will use him as a benchmark when targeting other, ore realistic elite forwards.

metrics_rename <- c(
  "npg/90", "xG/90", "Ast/90", "xAG/90", "SCA/90", "GCA/90", "tklw/90", "Recov/90", "Drib/90", "Aerial%", "Sh/90", "SoT/90"
)
  
  # Use the top ranked striker (Dembele) as the gold standard benchmark
  
reference_player <- tier1_elite$Player[1]
  
  cat("Using", reference_player, "as reference player\n")
## Using Ousmane Dembélé as reference player
cat("(Elite Tier 1 - The Gold Standard\n")
## (Elite Tier 1 - The Gold Standard
cat("\nFinding realistic targets who play most similiarly to Dembele\n\n")
## 
## Finding realistic targets who play most similiarly to Dembele
# Euclidean Distance (Similiar Absolute Output)
sim_euclidean <- similiarity_tool(
  sample = data_final, 
  player = reference_player,
  metrics = list_metrics,
  metrics_rename = metrics_rename,
  distance = "euclidean",
  n = 15
)

kable(sim_euclidean[, 1:6],
      caption = paste("Top 15 Similiar Centre Forwards to", reference_player, "(Euclidean Distance)"), digits = 2)
Top 15 Similiar Centre Forwards to Ousmane Dembélé (Euclidean Distance)
Player Similarity Age Squad Competition npg/90
Kylian Mbappé 60.22 25 Real Madrid La Liga 69.16
Leroy Sané 52.94 28 Bayern Munich Bundesliga 56.07
Raphinha 51.53 27 Barcelona La Liga 47.66
Hugo Ekitike 51.41 22 Eint Frankfurt Bundesliga 45.79
Bradley Barcola 50.89 21 PSG Ligue 1 54.21
Mateo Retegui 46.11 25 Atalanta Serie A 73.83
Nick Woltemade 45.54 22 Stuttgart Bundesliga 51.40
Alexander Isak 43.29 24 Newcastle Utd Premier League 57.94
Deniz Undav 42.80 28 Stuttgart Bundesliga 43.93
Mason Greenwood 42.25 22 Marseille Ligue 1 42.06
Michael Olise 38.85 22 Bayern Munich Bundesliga 42.99
Julián Álvarez 38.10 24 Atlético Madrid La Liga 43.93
Erling Haaland 36.61 24 Manchester City Premier League 57.94
Riccardo Orsolini 36.57 27 Bologna Serie A 54.21
Rafael Leão 36.12 25 Milan Serie A 28.97

Next, we can take a look at the cosine similiarity comparisons, which measures similiarity between tactical profiles, regardless of total output volume (matches strengths with strengths and weaknesses with weaknesses)

# Cosine Distance (similiar style/profile)

sim_cosine <- similiarity_tool(
  sample = data_final,
  player = reference_player,
  metrics = list_metrics, 
  metrics_rename = metrics_rename,
  distance = "cosine",
  n = 15
)

kable(sim_cosine[, 1:6],
      caption = paste("Top 15 similiar Strikers to", reference_player, "(Cosine Similiarity)"),
      digits = 2)
Top 15 similiar Strikers to Ousmane Dembélé (Cosine Similiarity)
Player Similarity Age Squad Competition npg/90
Harvey Barnes 91.47 26 Newcastle Utd Premier League 42.99
Alexander Isak 89.64 24 Newcastle Utd Premier League 57.94
Nick Woltemade 89.24 22 Stuttgart Bundesliga 51.40
Kylian Mbappé 87.46 25 Real Madrid La Liga 69.16
Julián Álvarez 87.24 24 Atlético Madrid La Liga 43.93
Leroy Sané 85.23 28 Bayern Munich Bundesliga 56.07
Deniz Undav 84.97 28 Stuttgart Bundesliga 43.93
Nicolas Jackson 84.96 23 Chelsea Premier League 38.32
Hugo Ekitike 84.10 22 Eint Frankfurt Bundesliga 45.79
Mateo Retegui 82.68 25 Atalanta Serie A 73.83
Jonathan David 81.58 24 Lille Ligue 1 32.71
Loïs Openda 80.25 24 RB Leipzig Bundesliga 30.84
Raphinha 80.03 27 Barcelona La Liga 47.66
Shuto Machino 79.63 24 Holstein Kiel Bundesliga 38.32
Bradley Barcola 79.24 21 PSG Ligue 1 54.21

From this, we can generate a side by side view of the top 10 players based on both cosine and euclidean similiarities to Dembele.

comparison <- data_frame(
  Rank = 1:10,
  `Euclidean (Output)` = sim_euclidean$Player[1:10],
  `Cosine (Style)` = sim_cosine$Player[1:10]
)

kable(comparison, caption = "Euclidean vs Cosine Top 10 Comparison")
Euclidean vs Cosine Top 10 Comparison
Rank Euclidean (Output) Cosine (Style)
1 Kylian Mbappé Harvey Barnes
2 Leroy Sané Alexander Isak
3 Raphinha Nick Woltemade
4 Hugo Ekitike Kylian Mbappé
5 Bradley Barcola Julián Álvarez
6 Mateo Retegui Leroy Sané
7 Nick Woltemade Deniz Undav
8 Alexander Isak Nicolas Jackson
9 Deniz Undav Hugo Ekitike
10 Mason Greenwood Mateo Retegui

We can now visualize our top targets to compare profiles. Based on our scoring and similiarity analysis, we select our most interesting targets for a deeper dive. From this shortlist, we can use our football knowledge to assess current situations to decide upon the final shortlist.

  • Mbappe and Raphinha are completely untouchable by their clubs barring a dramatic change in availability, it is impossible to sign them.
  • Isak, Ekitike and Woltemade have all recently signed for Liverpool and Newcastle respectively.
  • Barnes, while clearly impressive output has mainly produced his best attacking output when Newcastle have been hit with many injuries, himself struggling for form and fitness at times.
# Use the absolute Benchmark (Dembele) as our gold standard
elite_reference <- tier1_elite$Player[1]

# Top Scorer from Tier 2 (Best Realistic Target)
top_tier2 <- tier2_realistic$Player[1]

# Most Similiar Player (Euclidean). Of course, we assume that Mbappe is completely untouchable by Real Madrid, so we default to option 2

most_similar_eucl <- sim_euclidean$Player [2]

# Most similiar player (cosine - ensuring a different result from euclidean for more in depth comparison). We also want to exclude Barnes, Isak, Woltemade and Mbappe for the reasons explained above
cosine_candidates <- sim_cosine$Player[!sim_cosine$Player %in% most_similar_eucl]
most_similar_cos <- cosine_candidates[5]

#Top Young Prospect
top_prospect <- tier3b_prospects$Player[1]

# Combine these players for radar (include elite benchmark and realistic options)
players_radar <- unique(c(elite_reference, top_tier2, most_similar_eucl, most_similar_cos, top_prospect))

cat("Centre Forwards selected for radar comparison:\n\n")
## Centre Forwards selected for radar comparison:
cat("  1.", elite_reference, "(Tier 1 - World's Best)\n\n")
##   1. Ousmane Dembélé (Tier 1 - World's Best)
cat("REALISTIC ALTERNATIVES:\n")
## REALISTIC ALTERNATIVES:
for(i in 2:length(players_radar)){
cat(" ", i, ".", players_radar[i], "\n")
}
##   2 . Mateo Retegui 
##   3 . Leroy Sané 
##   4 . Julián Álvarez 
##   5 . Michael Olise

So there we have it; a 5 man shortlist of fantastic strikers for Arsenal to consider; Dembele (Who of course we know would be perfect but unrealistic), Retegui, Sane, Alvarez and Olise.

Now we calculate the 5th and 95th percentile for each metric to create the radar boundaries.

min_max_df <- rbind(
  apply(data_final[, list_metrics], 2, 
        function(x) quantile(x, probs = 0.95, na.rm = TRUE)),
  apply(data_final[, list_metrics], 2,
        function(x) quantile(x, probs = 0.05, na.rm = TRUE))
)
rownames(min_max_df) <- c("p95", "p5")
min_max_df
##       G.PK.90    xG.90   Ast.90    xAG.90    SCA.90    GCA.90   TklW.90
## p95 57.943925 77.67857 62.50000 64.673913 75.188324 57.024793 71.487603
## p5   9.345794 13.09524  3.87931  6.521739  5.178908  8.264463  9.297521
##      Recov.90 Dribbles.90  Aerial.    Sh.90   SoT.90
## p95 67.533937   70.421245 83.06092 70.29478 58.36820
## p5   7.522624    3.205128 28.04606 16.66667 10.46025

Now we can prepare our radar charts

# Filter Data for Selected Players
df_forwards_radar <- data_final[data_final$Player %in% players_radar, ]

# Ensure values are within [p5, p95] boundaries

for (metric in list_metrics) {
  for (p in players_radar) {
    value_c <- df_forwards_radar[df_forwards_radar$Player == p, metric]
    
    if(length(value_c) > 0 && !is.na(value_c)){
      if(value_c < min_max_df["p5", metric]){
        df_forwards_radar[df_forwards_radar$Player == p, metric] <- min_max_df["p5", metric]
      } else if (value_c > min_max_df["p95", metric]){
        df_forwards_radar[df_forwards_radar$Player == p, metric] <- min_max_df["p95", metric]
      }
    }
  }
}

#Create final radar dataframe

df_forwards_radar <- as.data.frame(df_forwards_radar)
rownames(df_forwards_radar) <- df_forwards_radar$Player
         df_final_plot <-  rbind(
           min_max_df, df_forwards_radar[, list_metrics]
         )
         df_final_plot
##                   G.PK.90    xG.90   Ast.90    xAG.90    SCA.90    GCA.90
## p95             57.943925 77.67857 62.50000 64.673913 75.188324 57.024793
## p5               9.345794 13.09524  3.87931  6.521739  5.178908  8.264463
## Ousmane Dembélé 57.943925 77.67857 53.44828 64.673913 75.188324 57.024793
## Michael Olise   42.990654 42.85714 62.50000 64.673913 75.188324 57.024793
## Leroy Sané      56.074766 64.28571 46.55172 58.695652 56.120527 37.190083
## Mateo Retegui   57.943925 77.67857 51.72414 32.608696 31.450094 40.495868
## Julián Álvarez  43.925234 58.33333 24.13793 34.782609 50.282486 41.322314
##                   TklW.90  Recov.90 Dribbles.90  Aerial.    Sh.90   SoT.90
## p95             71.487603 67.533937   70.421245 83.06092 70.29478 58.36820
## p5               9.297521  7.522624    3.205128 28.04606 16.66667 10.46025
## Ousmane Dembélé 11.570248 12.443439   47.619048 59.43536 70.29478 58.36820
## Michael Olise   53.719008 52.036199   70.421245 71.02526 62.13152 51.46444
## Leroy Sané      38.842975 43.891403   42.124542 79.94056 70.29478 58.36820
## Mateo Retegui   18.181818 25.565611    6.593407 63.74443 70.29478 43.93305
## Julián Álvarez  35.537190 28.054299   28.205128 43.83358 43.76417 47.28033

We are now ready to create our radar chart

# Define radar chart function
create_radarchart <- function(data, color = color,
                              vlabels = colnames(data), vlcex = 0.7,
                              caxislabels = NULL, title = NULL){
  radarchart(
    data, axistype = 1,
    
    # Polygon
    pcol = color, pfcol = scales::alpha(color, 0.5),
    plwd = 2, plty = 1,
    cglcol = "grey", cglty = 1, cglwd = 8,
    
    # Axis
    axislabcol = "grey30",
    
    #Labels
    vlcex = vlcex, vlabels = vlabels,
    caxislabels = caxislabels, title = title
  )
}
# Metric Names for radar chart
metrics_name_plot <- c(
  "npG/90", "xG/90", "Ast/90", "xAG/90", "SCA/90", "GCA/90", "tklW/90", "Recov/90", "Drib/90", "Aerial%", "Sh/90", "SoT/90"
)

# Colors

colors_radar <- c("#EF0107", "#00AFBB", "#f7d62d", "#8DBF8D", "purple4")[1:(nrow(df_final_plot) - 2)]


# Create Radar chart

op <- par(mar = c(1, 2, 2, 2))
create_radarchart(
  data = df_final_plot,
  color = colors_radar,
  vlabels = metrics_name_plot
)

legend("bottomleft",
       legend = rownames(df_final_plot[-c(1,2), ]),
       horiz = FALSE,
       bty = 'n', pch = 20,
       col = colors_radar,
       text.col = "black", cex = 0.7, pt.cex = 2)

title(
  main = "Arsenal Striker Search\nComparing Realistic Targets to the worlds best (2024/25)",
  cex.main = 1.1, col.main = "#5D6D7E" 
) 

Conclusions and Recommendations

Based on our analysis, I have offered further context to the search for Arsenals next forward, and made recommendations based not just on data, but other mitigating football circumstances.

The Benchmark - Arsenal would love to be able to sign Ousmane Dembele, or his French compatriot Kylian Mbappe. However these players are currently unattainable and would cost astronomical money.

**Realistic Alternatives* - Julian Alvarez and Mateo Retegui represent excellent players, who would fit Arsenals style, intensity and system excellently. While both players would cost more then 60m Euros, they represent excellent market opportunities that are more realistic. Out of these 2, I would recommend Arsenal to sign Julian Alvarez. While his metrics are slightly lower then Reteguis, he is slightly younger and based on his previous successful stint in England (with Man City), he is sure to hit the ground running and could be an instant impacter for Arsenals recruitment team. While his metrics are impressive and well rounded, there is also lots of upside to improve the defensive aspect of his game, as well as his creativity, which I am confident under Arteta, and playing with other world class players like Saka, Odegaard etc, this would be something Alvarez can surely live up to .

Other Options Leroy Sane represents an excellent tactical fit for Arsenal, as well as being Premier League proven and has also previusly worked with Arteta, however at 29, he does not represent a huge amount of growth potential. Michael Olise and Bradley Barcola would be other excellent young options to take a look at. Both would fit Arsenals style very well, have massive upside potential however it would take a lot of money to prize either player from their clubs.

Limitations

  • Age of Data I only had data from the 24/25 season available to me for this study. Therefore, players who moved in the summer transfer window (Isak, Ekitike) are no longer available on the market. It also does not take into account current form in the 2025/26 season.

  • League Difficulty Goals in the Premier League may be more valuable then goals in a slower league, such as Serie A. This is not taken into account in the analysis, hence my preference for Alvarez or Retegui, who has already proven his ability in the Premier League.

  • Team Quality Our analysis does not account for players who may have inflated statistics due to playing with better attacking teams/players.

  • Injury Status While I tried to account for injury prone players, and included a min number of minutes played, this is not fully captured in my analysis.

Future Possibilities - To enhance my work in the future, these are areas that I can improve upon my work.

  • Transfer Value This can be implemented to take a deep dive into the players representing the best value in the market

  • Trajectory Is the player trending upwords or have they reached their peak?

  • Big Games Do some of these players have inflated stats in smaller games? Who are the best big game performers?

  • Injury History Assess Physical Robustness across our talent pool.

Final Recommendation

In my opinion, Arsenal should already be doing their due dilligence on if a deal for Julian Alvarez is possible this summer. His playing style, age, robustness and proven premier league ability make him a perfect match for Arsenal and by signing him, Arsenal could truly take a step towards becoming the best team in Europe.

sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
## [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.utf8    
## 
## time zone: America/Vancouver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0    DT_0.34.0       knitr_1.51      lsa_0.73.4     
##  [5] SnowballC_0.7.1 fmsb_0.7.6      lubridate_1.9.4 forcats_1.0.1  
##  [9] stringr_1.6.0   dplyr_1.1.4     purrr_1.2.1     readr_2.1.6    
## [13] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
##  [9] generics_0.1.4     htmlwidgets_1.6.4  bslib_0.9.0        pillar_1.11.1     
## [13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [17] cachem_1.1.0       xfun_0.55          sass_0.4.10        S7_0.2.1          
## [21] otel_0.2.0         timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [25] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         rstudioapi_0.18.0 
## [29] hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.0        evaluate_1.0.5    
## [33] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     tools_4.5.2       
## [37] pkgconfig_2.0.3    htmltools_0.5.9

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.