1 Introduction
2 Define workspace
3 Libraries
4 Reading the data
5 Data filtering
- 5.1 Checking of duplicated records
6 Scoring calculation
7 Similarity algorithm
8 Visualisation - Radar graph
9 Replacing Robert Lewandowski – Strategic Forward Options (Season 24/25)
10 Strategic Implication for FC Barcelona
11 Final Recommendation

1 Introduction

The aim of my scouting task is to develop a complete end-to-end scouting process, applying concepts covered in the module and putting them into practice through a real-world case study.

My objective is to identify a suitable replacement - a younger forward who will replace FC Barcelona’s Robert Lewandowski using performance data from the 2024/25 season.

I will aim to identify potential candidates and draw a conclusion about who could be the top candidate replacing him.

My scouting analysis will be based on the following analytical approaches:

Developing a scoring system to evaluate potential candidate players.
Applying a similarity algorithm to identify and shortlist most similar players to RL.
Supporting my analysis with visualisations helping me to establish my conclusions.

2 Define workspace

All file paths are specified relative to the project directory to ensure reproducibility.

3 Libraries

As next step I will import the libraries necessary to perform my scouting analysis. For those I do not have downloaded I use the commmand:install.packages({library}).

library(tidyverse)
library(fmsb)
library(lsa)
library(knitr)
library(DT)

4 Reading the data

As next I willread our CSV file I am going to work with and select the sample (context) of analysis.

# Read the file
data <- read.csv("data/FBREF_BigPlayers_2425.csv", sep = ";", encoding = "UTF-8")

Function

select_players <- function(file, encoding, position, competition, primary, player){
  # 1. Read the file
  data <- read.csv(file, sep = ";", encoding = encoding)
  if (is.na(competition)) {
    cat("We consider all players", "\n")
    data_comp <- data
  } else {
    cat("We filter by competition", "\n")
    data_comp <- data %>%
      filter(Competition %in% c(competition))
  }
  if (primary){
    cat("We keep the players whose main position is:", position)
    data_players <- data_comp %>% 
      filter(substr(Pos, 1, 2) == position)
  } else {
    data_players <- data_players %>% 
      filter(grepl(position, Pos))
    cat("We keep the players whose position is:", position)
  }
  if (!is.na(player)){
    data_players <- data_players %>% filter(Player %in% c(player))
  }
  return (data_players)
}

Next I will use a function and focus on players whose primary position is FW (forwards) - the position of my objective player: Robert Lewandowski. I will consider all competitions.

df_forwards <- select_players(
  file = "data/FBREF_BigPlayers_2425.csv", 
  encoding = "UTF-8", 
  position = "FW", 
  competition = NA,
  primary = TRUE,
  player = NA
)

## We consider all players 
## We keep the players whose main position is: FW

cat("\nNumber of players:", nrow(df_forwards))

## 
## Number of players: 1055

Competitions included

unique(df_forwards$Competition)

## [1] "Bundesliga"     "Eredivisie"     "La Liga"        "Ligue 1"       
## [5] "Premier League" "Primeira Liga"  "Serie A"

I will refine my search and narrow it down even further and focus only on pure forwards excluding wingers as Lewandowski is a typical striker and my objective is to find players of the same typology.

df_forwards <- df_forwards %>%
  filter(Pos == "FW")

cat("Number of pure forwards:", nrow(df_forwards))

## Number of pure forwards: 542

5 Data filtering

I will have to narrow down the sample further including filters.

I will define the filter_player function that will allow me to focus on those players who have played a minimum number of minutes required for our candidate. I will also narrow down my search by age as I am looking for a younger player than RL, the expectation is that he can play several seasons for my club.

My function will have the following parameters:

data: Data set read from a CSV.
metrics: List of metrics to be considered in the analysis.
pct_min_minutes: Minimum percentage of minutes played to be within the sample.
age_max: Maximum age of the player to be considered in the sample.

filter_players <- function(
  data, metrics, pct_min_minutes, age_max){
  # We filter data and select metrics that define our sample data
  data_filter <- data %>%
    filter(Min > round((pct_min_minutes*90*MP_Squad) / 100), 
           Age <= age_max) %>%
    select(c("Player", "Squad", "Age", metrics))
  rownames(data_filter) <- 1:nrow(data_filter)
  return (data_filter)
}

Players included in my sample are forwards who played at least 50% of the total minutes of their team and who are under 27 years of age. I have selected this subset of metrics because they reflect a complete modern central striker model I am looking for.

list_metrics <- c("G.PK.90", "xG.90", "G.xG", "Sh.90", "SoT.90", 
                  "SCA.90", "GCA.90", "KP.90", "Fld.90","FinalThirdPasses.90",
                  "PassesProgressive.90")

df_forwards_filter <- filter_players(
  data = df_forwards, 
  metrics = list_metrics, 
  pct_min_minutes = 50, 
  age_max = 27
)

5.1 Checking of duplicated records

cat("Duplicated players:", 
      df_forwards_filter[
        duplicated(df_forwards_filter$Player),]$Player)

## Duplicated players:

To keep things tidy I will rename some metrics.

# Rename metrics
df_forwards_rename <- df_forwards_filter %>%
  rename(`Non-Penalty Goals by 90'` = `G.PK.90`,
         `Expected Goals (xG) by 90'` = `xG.90`,
         `Goals minus xG` = `G.xG`,
         `Shots by 90'` = `Sh.90`,
         `Shots on Target by 90'` = `SoT.90`,
         `Shot-Creating Actions by 90'` = `SCA.90`,
         `Goal-Creating Actions by 90'` = `GCA.90`,
         `Key Passes by 90'` = `KP.90`,
         `Fouls Drawn by 90'` = `Fld.90`,
         `Final Third Passes by 90'` = `FinalThirdPasses.90`,
         `Progressive Passes by 90'` = `PassesProgressive.90`)

head(df_forwards_rename)

##               Player          Squad Age Non-Penalty Goals by 90'
## 1  Ermedin Demirović      Stuttgart  26                     0.73
## 2       Hugo Ekitike Eint Frankfurt  22                     0.49
## 3 Johannes Eggestein      St. Pauli  26                     0.10
## 4  Jonathan Burkardt       Mainz 05  24                     0.68
## 5       Junior Adamu       Freiburg  23                     0.12
## 6     Mohamed Amoura      Wolfsburg  24                     0.29
##   Expected Goals (xG) by 90' Goals minus xG Shots by 90' Shots on Target by 90'
## 1                       0.69            0.9         3.11                   1.31
## 2                       0.76           -6.6         4.00                   1.55
## 3                       0.23           -1.7         1.65                   0.49
## 4                       0.63            3.2         2.99                   1.19
## 5                       0.32           -3.5         2.10                   0.64
## 6                       0.32            1.3         2.66                   0.91
##   Shot-Creating Actions by 90' Goal-Creating Actions by 90' Key Passes by 90'
## 1                         2.28                         0.34              0.53
## 2                         3.55                         0.42              1.33
## 3                         2.09                         0.19              0.89
## 4                         2.31                         0.34              0.69
## 5                         1.69                         0.17              0.68
## 6                         2.92                         0.47              1.13
##   Fouls Drawn by 90' Final Third Passes by 90' Progressive Passes by 90'
## 1               0.79                      0.44                      0.91
## 2               0.52                      0.73                      1.61
## 3               0.44                      0.93                      1.59
## 4               0.83                      0.86                      1.45
## 5               0.88                      0.28                      0.56
## 6               0.65                      0.94                      1.97

6 Scoring calculation

For the selected sample, I will calculate a value that summarises the performance of these players. This will serve as my rating.

Since I am working with different metrics measured in different magnitudes, the first step is to normalise the variables so that all metrics will be in the same scale. After I assign different weights and calculate the final score.

6.1 Data transformation

I will use the MinMax transformer to normalise the values of the performance variables. For that, we define a normalize function:

# Normalization function
normalize <- function(x, na.rm=TRUE){
  return((x-min(x))/(max(x)-min(x)))
}
# Normalize numeric columns (from 4 onwards)
df_forwards_norm <- data.frame(df_forwards_rename)

for (i in 4:ncol(df_forwards_rename)){
  df_forwards_norm[, i] <- normalize(df_forwards_rename[, i])
}

summary(df_forwards_norm)

##     Player             Squad                Age       Non.Penalty.Goals.by.90.
##  Length:102         Length:102         Min.   :17.0   Min.   :0.0000          
##  Class :character   Class :character   1st Qu.:22.0   1st Qu.:0.2139          
##  Mode  :character   Mode  :character   Median :24.0   Median :0.3221          
##                                        Mean   :23.6   Mean   :0.3453          
##                                        3rd Qu.:25.0   3rd Qu.:0.4519          
##                                        Max.   :27.0   Max.   :1.0000          
##  Expected.Goals..xG..by.90. Goals.minus.xG    Shots.by.90.   
##  Min.   :0.0000             Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1932             1st Qu.:0.3463   1st Qu.:0.2590  
##  Median :0.2614             Median :0.4797   Median :0.3500  
##  Mean   :0.3155             Mean   :0.4768   Mean   :0.3814  
##  3rd Qu.:0.4205             3rd Qu.:0.5980   3rd Qu.:0.5013  
##  Max.   :1.0000             Max.   :1.0000   Max.   :1.0000  
##  Shots.on.Target.by.90. Shot.Creating.Actions.by.90.
##  Min.   :0.0000         Min.   :0.0000              
##  1st Qu.:0.1878         1st Qu.:0.1487              
##  Median :0.2978         Median :0.2635              
##  Mean   :0.3096         Mean   :0.3024              
##  3rd Qu.:0.3956         3rd Qu.:0.4330              
##  Max.   :1.0000         Max.   :1.0000              
##  Goal.Creating.Actions.by.90. Key.Passes.by.90. Fouls.Drawn.by.90.
##  Min.   :0.0000               Min.   :0.0000    Min.   :0.0000    
##  1st Qu.:0.1565               1st Qu.:0.1595    1st Qu.:0.1548    
##  Median :0.2261               Median :0.2543    Median :0.2429    
##  Mean   :0.2713               Mean   :0.2994    Mean   :0.2827    
##  3rd Qu.:0.3652               3rd Qu.:0.3966    3rd Qu.:0.3897    
##  Max.   :1.0000               Max.   :1.0000    Max.   :1.0000    
##  Final.Third.Passes.by.90. Progressive.Passes.by.90.
##  Min.   :0.0000            Min.   :0.00000          
##  1st Qu.:0.1250            1st Qu.:0.09611          
##  Median :0.1961            Median :0.20302          
##  Mean   :0.2398            Mean   :0.26028          
##  3rd Qu.:0.2876            3rd Qu.:0.36987          
##  Max.   :1.0000            Max.   :1.00000

6.2 Scoring calculation

Following the normalisation of data sets, I will calculate the final scoring for each player.

I define the function calc_scoring that receives as parameters:

data: Data set transformed using MinMax.
weights: List of weights associated to each variable.
ind_metric: Index of the column where the performance metrics start.
columns_return: indicate which columns we want to return in the final result.
n: number of players to return (according to calculated scoring)

calc_scoring <- function(
  data, weights, ind_metric, columns_return, n){
  # Transform each variable: transformed value * associated weight
  for (i in ind_metric:ncol(data)){
    data[, i] <- data[,i]*weights[i-(ind_metric-1)]
  }
  cat("Weights sum:", sum(weights))
  # Calculating the scoring (sum of each complete record)
  data$`Final Score` <- rowSums(
    data[, c(ind_metric:ncol(data))])
  data$`Final Score` <- round(10*data$`Final Score`, 3)
  # Ordering df by score
  data <- data[order(-data$`Final Score`), 
               c(columns_return, "Final Score")]
  rownames(data) <- 1:nrow(data)
  return(data[1:n,])
}

After I assign weights to each of the performance metrics, with the only requirement that the sum must be 1. I will run this chunk to verify the order first:

colnames(df_forwards_norm)

##  [1] "Player"                       "Squad"                       
##  [3] "Age"                          "Non.Penalty.Goals.by.90."    
##  [5] "Expected.Goals..xG..by.90."   "Goals.minus.xG"              
##  [7] "Shots.by.90."                 "Shots.on.Target.by.90."      
##  [9] "Shot.Creating.Actions.by.90." "Goal.Creating.Actions.by.90."
## [11] "Key.Passes.by.90."            "Fouls.Drawn.by.90."          
## [13] "Final.Third.Passes.by.90."    "Progressive.Passes.by.90."

Now I use this weight vector (where the sum of values values must = 1).

weights_fw <- c(
  0.20, # Non-Penalty Goals by 90'
  0.18, # Expected Goals (xG) by 90'
  0.14, # Goals minus xG
  0.09, # Shots by 90'
  0.09, # Shots on Target by 90'
  0.08, # Shot-Creating Actions by 90'
  0.07, # Goal-Creating Actions by 90'
  0.05, # Key Passes by 90'
  0.03, # Fouls Drawn by 90'
  0.04, # Final Third Passes by 90'
  0.03  # Progressive Passes by 90'
)

6.3 Calculate scoring + return Top N

df_score_forwards <- calc_scoring(
  data = df_forwards_norm,
  weights = weights_fw,
  ind_metric = 4,
  columns_return = c("Player", "Squad", "Age"),
  n = 10
)

## Weights sum: 1

df_score_forwards sum(weights_fw)

6.4 Short conclusions with market context

Ousmane Dembélé is an elite metric performer, but he’s an expensive, non–like-for-like “9” solution and a former Barça player — making him a strategically unlikely successor. Transfermarkt lists him at €100m.
Viktor Gyökeres looks like the strongest “true striker” replacement on performance metrics (volume + output), but he is also a premium-priced asset, it is a big investment but can deliver immediate impact. Transfermarkt lists him at €70m.
Kylian Mbappé is effectively ruled out: direct rival (Real Madrid) + extreme cost. Transfermarkt lists him at €200m.
Lamine Yamal shows up as a “successor signal” because he’s already elite, but he’s not a central striker and is a long-term internal pillar rather than a Lewandowski-style 9. Transfermarkt lists him at €200m and as a Right Winger.
Vangelis Pavlidis is a strong value-style candidate: productive profile, true CF role, and comparatively more attainable pricing than the elite tier. Transfermarkt lists him at €35m. He is a clear market opportunity.
Mateo Retegui profiles as a solid “classic 9” option with strong striker indicators and an “accessible relative to superstars” price point. He could present a potential reasonable option. Transfermarkt lists him at €40m.

Summary:

Gyökeres = best pure sporting fit (high cost).
Pavlidis = best “value opportunity” type.
Retegui = solid striker alternative (mid-to-high cost bracket).
Mbappé/Dembélé/Yamal = unrealistic or not a like-for-like 9.

7 Similarity algorithm

7.1 As next I will create my forward sample for similarity

# Metrics (same ones used for scoring)
list_metrics_fw <- c("G.PK.90", "xG.90", "G.xG", "Sh.90", "SoT.90",
                     "SCA.90", "GCA.90", "KP.90", "Fld.90",
                     "FinalThirdPasses.90", "PassesProgressive.90")

# Create the final sample for similarity (Player + metrics only)
data_final <- df_forwards_filter %>%
  select(Player, all_of(list_metrics_fw)) %>%
  group_by(Player) %>%
  summarise(across(everything(), mean), .groups = "drop")

(That group_by/summarise step avoids duplicates if a player appears more than once.)

7.2 I will use the same normalize() function

normalize <- function(x, na.rm=TRUE){
  (x - min(x, na.rm=na.rm)) / (max(x, na.rm=na.rm) - min(x, na.rm=na.rm))
}

7.3 Similarity function

Before running the similarity tool chunk I will run a quick check.

getwd()

## [1] "/Users/hoots/Desktop/MSc Football Data Analytics/TalentDetection"

list.files()

## [1] "data"                                         
## [2] "Talent_Detection_Lewandowski_Replacement.html"
## [3] "Talent_Detection_Lewandowski_Replacement.Rmd" 
## [4] "TalentDetection.html"                         
## [5] "TalentDetection.Rmd"

list.files("data")

## [1] "FBREF_BigClubes_2324.csv"  "FBREF_BigClubes_2425.csv" 
## [3] "FBREF_BigPlayers_2324.csv" "FBREF_BigPlayers_2425.csv"

I will now load the dataset into `data_players

data_players <- read.csv(
  file = "data/FBREF_BigPlayers_2425.csv",
  sep = ";",
  encoding = "UTF-8"
)

# confirm it loaded
dim(data_players)

## [1] 3972   72

Check for Lewandowski spelling.

data_players$Player[grepl("Lew", data_players$Player)]

##  [1] "Jamie Leweling"        "Lewis Holtby"          "Lewis Schouten"       
##  [4] "Robert Lewandowski"    "Dominic Calvert-Lewin" "Keane Lewis-Potter"   
##  [7] "Lewis Cook"            "Lewis Dunk"            "Lewis Hall"           
## [10] "Lewis Miley"           "Lewis Orford"          "Myles Lewis-Skelly"   
## [13] "Rico Lewis"            "Lewis Ferguson"

I check if he is in my similarity sample.

"Robert Lewandowski" %in% data_final$Player

## [1] FALSE

Since Lewandowski is not in my sample I pull his row from the full dataset and append it to my similarity sample.

# define my Metrics (same as in my scoring profile) and set the exact player name
list_metrics_fw <- c("G.PK.90", "xG.90", "G.xG", "Sh.90", "SoT.90",
                     "SCA.90", "GCA.90", "KP.90", "Fld.90",
                     "FinalThirdPasses.90", "PassesProgressive.90")

# Reference player name
player_name <- "Robert Lewandowski"

(Build my “candidate pool” I want to compare against.)

# Build the comparison sample from my filtered forward dataset
data_final <- df_forwards_filter %>%
  select(Player, all_of(list_metrics_fw)) %>%
  group_by(Player) %>%
  summarise(across(everything(), mean), .groups = "drop")

# Check if Lewandowski is already included
player_name %in% data_final$Player

## [1] FALSE

Extract Lewandowski from the FULL dataset (unfiltered) - this grabs his metrics.

# Pull Lewandowski data from the full dataset (unfiltered)
lewa_row <- data_players %>%
  filter(Player == player_name) %>%
  select(Player, all_of(list_metrics_fw)) %>%
  group_by(Player) %>%
  summarise(across(everything(), mean), .groups = "drop")

lewa_row

## # A tibble: 1 × 12
##   Player            G.PK.90 xG.90   G.xG Sh.90 SoT.90 SCA.90 GCA.90 KP.90 Fld.90
##   <chr>               <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
## 1 Robert Lewandows…    0.81  0.92 -0.100  3.75   1.52   1.69    0.1  0.56   1.15
## # ℹ 2 more variables: FinalThirdPasses.90 <dbl>, PassesProgressive.90 <dbl>

# Append Lewandowski if he is missing from the comparison sample
if(!(player_name %in% data_final$Player)){
  data_final <- bind_rows(data_final, lewa_row)
}

# I confirm that he is now included
player_name %in% data_final$Player

## [1] TRUE

Next I create the function similarity_tool that calculates the N players most similar to the player indicated in the player argument.

similarity_tool <- function(
  sample, player, metrics, metrics_rename, distance, n
  ){
  
  # We define seed for reproducibility
  set.seed(123)

  # Scale data
  data_final_norm <- scale(data_final %>% select(-Player))
  rownames(data_final_norm) <- data_final$Player
  
  # Distances
  if (distance == 'cosine'){
    ## Cosine distance
    ## Transpose data
    players_df <- t(data_final_norm)
    ## Cosine similarity (lsa package)
    sim_cosine <- cosine(players_df)
    
    ## Access to the column player
    player_sim <- sim_cosine[, player]
    
    ## Convert distances to percentages
    ## Normalize data - MinMax [0,1] scale
    df_sim <- as.data.frame(player_sim)
    colnames(df_sim) <- "Similarity"
    df_sim$Similarity <- normalize(df_sim$Similarity)
    ## Multiply by 100 to obtain a value inside the interval [0,100]
    df_sim$Similarity <- round(100*df_sim$Similarity, 3)

    ## Order by similarity and prepare output
    df_sim$Player <- data_final$Player
    final_df <- df_sim[order(-df_sim$Similarity),]
    # Drop player
    final_df <- final_df %>% filter(Player != player)
    rownames(final_df) <- 1:nrow(final_df)
    final_df <- final_df[1:n, c("Player", "Similarity")]
  }
  
  else {
    players_df <- data_final_norm
    ## Euclidean distance: dist(method='euclidean')
    mat_dist <- as.matrix(dist(x = players_df, method = "euclidean"))
    ## We keep player column and his distances between all players
    player_sim <- mat_dist[, player]
    df_sim <- as.data.frame(player_sim)
    colnames(df_sim) <- "Distance"
    df_sim$Player <- data_final$Player
    ## Drop player whose distance between himself is 0
    df_sim <- df_sim[df_sim$Player != player,]
    
    ## Convert distances into percentages
    d95 <- quantile(df_sim$Distance, 0.95) ## p95
    df_sim$Similarity <- (1 - (df_sim$Distance / d95))*100
    
    ## Order dataframe and prepare output data
    final_df <- df_sim[order(-df_sim$Similarity),]
    rownames(final_df) <- 1:nrow(final_df)
    final_df <- final_df[1:n, c("Player", "Similarity")]
  }
  
  # Union with the original data to access to real values of each metric
  # drop duplicates data
  data_clean <- sample %>%
    select(Player, metrics) %>%
    group_by(Player) %>%
    summarise_all("mean") %>% # average all columns
    rename_at(vars(metrics), ~ metrics_rename) # rename
  
  final_df <- merge(
    x = final_df, y = data_clean, 
    by = "Player", all.x = TRUE)
  final_df <- final_df[(order(-final_df$Similarity)), ]
  rownames(final_df) <- 1:n

  return(final_df)
}

7.4 Looking for most similar players to Lewandowski (Euclidean + Cosine distances)

Using the dataset already filtered previously, we use the following function to find out which players performed most similarly to the Barcelona’s forward.

metrics_rename_fw <- c("G-PK/90", "xG/90", "G-xG", "Sh/90", "SoT/90",
                       "SCA/90", "GCA/90", "KP/90", "Fld/90",
                       "FinalThirdPasses/90", "ProgPasses/90")

sim_Lewa_EUCL <- similarity_tool(
  sample = data_final,
  player = "Robert Lewandowski",
  metrics = list_metrics_fw,
  metrics_rename = metrics_rename_fw,
  distance = "euclidean",
  n = 10
)

sim_Lewa_EUCL

##               Player Similarity G-PK/90 xG/90 G-xG Sh/90 SoT/90 SCA/90 GCA/90
## 1         Moise Kean   68.39824    0.60  0.65 -0.4  3.43   1.63   1.90   0.23
## 2  Ermedin Demirović   68.16007    0.73  0.69  0.9  3.11   1.31   2.28   0.34
## 3  Jonathan Burkardt   62.48488    0.68  0.63  3.2  2.99   1.19   2.31   0.34
## 4     Erling Haaland   62.06110    0.62  0.72  0.0  3.42   1.81   2.34   0.30
## 5        Yoane Wissa   59.74726    0.59  0.57  0.5  2.77   1.26   2.13   0.31
## 6       Troy Parrott   57.25596    0.50  0.62 -0.9  2.73   1.24   2.73   0.25
## 7     Emanuel Emegha   56.11362    0.55  0.67 -3.0  2.39   1.37   1.26   0.16
## 8     Samu Omorodion   53.80679    0.60  0.56  4.9  3.10   1.15   1.75   0.36
## 9  Vangelis Pavlidis   52.76428    0.64  0.74  0.4  3.23   1.36   3.31   0.60
## 10     Mateo Retegui   50.87713    0.79  0.71  6.1  3.74   1.21   2.87   0.53
##    KP/90 Fld/90 FinalThirdPasses/90 ProgPasses/90
## 1   0.59   1.62                0.47          0.81
## 2   0.53   0.79                0.44          0.91
## 3   0.69   0.83                0.86          1.45
## 4   0.94   0.42                0.32          0.65
## 5   0.77   1.46                1.26          1.91
## 6   1.11   1.11                0.75          1.14
## 7   0.67   1.37                0.15          0.52
## 8   0.57   1.07                0.53          0.93
## 9   1.18   0.76                0.65          1.65
## 10  1.06   0.94                0.81          1.28

If we analyse the cosine distance:

sim_Lewa_COS <- similarity_tool(
  sample = data_final,
  player = "Robert Lewandowski",
  metrics = list_metrics_fw,
  metrics_rename = metrics_rename_fw,
  distance = "cosine",
  n = 10
)

sim_Lewa_COS

##               Player Similarity G-PK/90 xG/90 G-xG Sh/90 SoT/90 SCA/90 GCA/90
## 1  Ermedin Demirović     92.846    0.73  0.69  0.9  3.11   1.31   2.28   0.34
## 2         Moise Kean     92.600    0.60  0.65 -0.4  3.43   1.63   1.90   0.23
## 3        Yoane Wissa     91.197    0.59  0.57  0.5  2.77   1.26   2.13   0.31
## 4  Jonathan Burkardt     90.451    0.68  0.63  3.2  2.99   1.19   2.31   0.34
## 5       Troy Parrott     89.392    0.50  0.62 -0.9  2.73   1.24   2.73   0.25
## 6     Erling Haaland     88.149    0.62  0.72  0.0  3.42   1.81   2.34   0.30
## 7     Emanuel Emegha     83.320    0.55  0.67 -3.0  2.39   1.37   1.26   0.16
## 8     Samu Omorodion     80.771    0.60  0.56  4.9  3.10   1.15   1.75   0.36
## 9      Mateo Retegui     80.187    0.79  0.71  6.1  3.74   1.21   2.87   0.53
## 10 Vangelis Pavlidis     79.854    0.64  0.74  0.4  3.23   1.36   3.31   0.60
##    KP/90 Fld/90 FinalThirdPasses/90 ProgPasses/90
## 1   0.53   0.79                0.44          0.91
## 2   0.59   1.62                0.47          0.81
## 3   0.77   1.46                1.26          1.91
## 4   0.69   0.83                0.86          1.45
## 5   1.11   1.11                0.75          1.14
## 6   0.94   0.42                0.32          0.65
## 7   0.67   1.37                0.15          0.52
## 8   0.57   1.07                0.53          0.93
## 9   1.06   0.94                0.81          1.28
## 10  1.18   0.76                0.65          1.65

Observation - Based on the similarity algorithm, Moise Kean emerges as the player most similar to Robert Lewandowski in terms of style and key performance metrics. According to Barca Blaugranes and several recent articles in Spain and Italy there is a Gossip: Barcelona considering Moise Kean as Robert Lewandowski successor. It’s beautiful to see how data reinforces this.

8 Visualisation - Radar graph

As a final step, we identified the most compelling candidates and mapped them on a graph to evaluate and visualize potential successors for Robert Lewandowski.

Moise Kean – still young player - most similar to Lewa in style and key metrics, with Barcelona confirming interest in him already as a good fit and potential successor of the Polish striker.
Ermedin Demirović – the VFB Stuttgart forward has similar metrics with Moise Kean, he ranks very well in shots on target percentage and xG in Bundesliga, makes well-timed runs in behind and into the channels. He is an interesting prospect, physically robust centre-forward who has developed into a reliable scorer at Stuttgart. With a market value of around 20 mil euro he represents a very affordable option.
Jonathan Burkardt – Emerging Bundesliga scorer, consistent and capable of high involvement in build-up and finishing. Market value ranging between 15-20 mil euro. He offers a balanced and dynamic forward profile and could be a high-upside option if budget or competition for Kean is high.

8.1 Boundary construction

I calculate the p5 and p95 for each of the analysis metrics.

I create the dataframe min_max_df that will contain the p5 and p95 for each of the study metrics.

min_max_df <- rbind(
  apply(data_final[, list_metrics], 2, 
        function(x) quantile(x, probs=.95)), 
  apply(data_final[, list_metrics], 2, 
        function(x) quantile(x, probs=.05)))
rownames(min_max_df) <- c("p95", "p5")

min_max_df

##     G.PK.90 xG.90  G.xG Sh.90 SoT.90 SCA.90 GCA.90 KP.90 Fld.90
## p95   0.725 0.738  4.99 3.959  1.568  5.039  0.718 1.839  2.318
## p5    0.111 0.150 -3.40 1.359  0.472  1.262  0.091 0.371  0.411
##     FinalThirdPasses.90 PassesProgressive.90
## p95               2.201                3.746
## p5                0.324                0.551

8.2 Preparation of the dataframe

Next we have to append on the dataframe min_max_df each of the records to be drawn on the radar. What happens if there are players whose actual value is above p95 or below p5? In this case we will have to adjust the value by replacing the actual value of the player or team by the maximum (p95) or minimum (p5).

# Players to visualize
players_radar <- c("Robert Lewandowski", "Moise Kean", "Ermedin Demirović", "Jonathan Burkardt")

# Filter data by players
df_forwards_radar <- data_final[
  data_final$Player %in% players_radar, ]

# Ensure values remain inside interval [p5, p95]
index_metric <- 2
for (p in players_radar){
  df_p <- df_forwards_radar[df_forwards_radar$Player == p,]
  for (c in colnames(df_forwards_radar)[index_metric:ncol(df_forwards_radar)])
    {
    value_c <- df_p[, c]
    if (value_c < min_max_df["p5", c]){
      df_forwards_radar[
        df_forwards_radar$Player == p, c] = min_max_df["p5", c]
    } else {
      if (value_c > min_max_df["p95", c]){
        df_forwards_radar[
          df_forwards_radar$Player == p, c] = min_max_df["p95", c]
      }
    }
  }
}

# Prepare final radar dataframe
df_forwards_radar <- as.data.frame(df_forwards_radar)
rownames(df_forwards_radar) <- df_forwards_radar$Player # update !!
df_final_plot <- rbind(
  min_max_df, df_forwards_radar[, list_metrics])

df_final_plot

##                    G.PK.90 xG.90  G.xG Sh.90 SoT.90 SCA.90 GCA.90 KP.90 Fld.90
## p95                  0.725 0.738  4.99 3.959  1.568  5.039  0.718 1.839  2.318
## p5                   0.111 0.150 -3.40 1.359  0.472  1.262  0.091 0.371  0.411
## Ermedin Demirović    0.725 0.690  0.90 3.110  1.310  2.280  0.340 0.530  0.790
## Jonathan Burkardt    0.680 0.630  3.20 2.990  1.190  2.310  0.340 0.690  0.830
## Moise Kean           0.600 0.650 -0.40 3.430  1.568  1.900  0.230 0.590  1.620
## Robert Lewandowski   0.725 0.738 -0.10 3.750  1.520  1.690  0.100 0.560  1.150
##                    FinalThirdPasses.90 PassesProgressive.90
## p95                              2.201                3.746
## p5                               0.324                0.551
## Ermedin Demirović                0.440                0.910
## Jonathan Burkardt                0.860                1.450
## Moise Kean                       0.470                0.810
## Robert Lewandowski               0.910                1.410

8.3 Radar representation

We use the fmsb library to create the radar chart. We define the function create_radarchart where we modify the radarchart function of the fmsb library that will allow us to work with the different arguments (colours, names of the axes, etc.):

# Step 1 — create the radar helper function
create_radarchart <- function(data, color = color, 
                              vlabels = colnames(data), vlcex = 0.7,
                              caxislabels = NULL, title = NULL){
  fmsb::radarchart(
    data, axistype = 1,
    pcol = color, pfcol = scales::alpha(color, 0.5), 
    plwd = 2, plty = 1,
    cglcol = "grey", cglty = 1, cglwd = 0.8,
    axislabcol = "white",
    vlcex = vlcex, vlabels = vlabels,
    caxislabels = caxislabels, title = title
  )
}

# Step 2 — metric labels
metrics_name_plot <- c("G-PK/90", "xG/90", "G-xG", "Sh/90", "SoT/90",
                       "SCA/90", "GCA/90", "KP/90", "Fld/90",
                       "FinalThirdPasses/90", "ProgPasses/90")

# Step 3 — colors for each player
colors_radar <- c("#004D98", "#f7d62d", "#8DBF8D", "#FFA500")

# Step 4 — plot
op <- par(mar = c(1, 2, 2, 2))

create_radarchart(
  data = df_final_plot, 
  color = colors_radar,
  vlabels = metrics_name_plot
)

legend("bottomleft",
       legend = rownames(df_final_plot[-c(1,2), ]),
       horiz = FALSE,
       bty = "n",
       pch = 20,
       col = colors_radar,
       text.col = "black",
       cex = 0.7,
       pt.cex = 1.75)

title(
  main = "Robert Lewandowski Replacement\nForwards, 24/25",
  cex.main = 1,
  col.main = "#2C3E50"
)

9 Replacing Robert Lewandowski – Strategic Forward Options (Season 24/25)

Robert Lewandowski has been FC Barcelona’s attacking reference point:
elite movement, high non-penalty goal production, consistent xG generation, positional intelligence, and penalty-box efficiency.

Based on my radar comparison, similarity algorithm, and scoring analysis, three viable successor profiles emerge.

9.1 Performance Continuity – Moise Kean

If Barcelona’s objective is to preserve immediate goal output and maintain a traditional central striker identity, Moise Kean is the strongest direct replacement.

Age (24/25): 25
Profile: Central No.9, penalty-box striker
Metrics: Strong xG/90, high shot volume, similar non-penalty scoring output
Market Value (24/25 est.): ~€40–45M
Tactical Fit: Operates as a true striker with comparable positioning tendencies

Kean’s radar profile closely mirrors Lewandowski in:

Shot volume
xG generation
Penalty-area presence

He does not significantly exceed Lewandowski in creative metrics but replicates the core scoring function effectively.

Conclusion:
Kean represents the safest plug-and-play replacement if Barcelona prioritizes continuity in the No.9 role.

9.2 Cost-Performance Balance – Ermedin Demirović

If the club aims to maintain central presence while optimizing financial efficiency, Ermedin Demirović offers a compelling alternative.

Age (24/25): 27
Profile: Strong, mobile central forward
Metrics: Competitive G-PK/90, balanced creative contribution (SCA, GCA)
Market Value: ~€20–25M
Tactical Fit: Functions as a focal point while contributing to link-up play

Demirović provides:

Strong goal efficiency
Balanced attacking involvement
Lower acquisition cost relative to output

Conclusion:
Demirović represents the value-efficient performance solution — strong metrics with moderate financial exposure.

9.3 Tactical Evolution & Upside – Jonathan Burkardt

If Barcelona prioritizes long-term evolution rather than direct replication, Jonathan Burkardt offers mobility and growth potential.

Age (24/25): 25
Profile: Dynamic striker with fluid movement
Metrics: High G-xG efficiency, balanced creative involvement
Market Value: ~€25–30M
Tactical Fit: More mobile and adaptable within a fluid front three

Burkardt demonstrates:

Efficient finishing relative to xG
Strong off-ball movement
Greater tactical flexibility

Conclusion:
Burkardt represents a strategic evolution option rather than strict stylistic continuity.

10 Strategic Implication for FC Barcelona

Barcelona has three clear strategic pathways:

1️⃣ Immediate Scoring Stability → Moise Kean
Maintain traditional No.9 structure with minimal tactical adjustment.
2️⃣ Cost-Performance Optimization → Ermedin Demirović
Strong performance metrics at lower cost, reducing financial risk.
3️⃣ Tactical Evolution & Long-Term Upside → Jonathan Burkardt
Transition toward a more dynamic and fluid attacking structure aligned with the younger core (e.g., Yamal, Pedri).

11 Final Recommendation

If the objective is short-term continuity and minimal disruption,
→ Moise Kean is the closest statistical successor.

If financial flexibility is required,
→ Ermedin Demirović offers the strongest value-to-performance ratio.

If the club envisions a tactical evolution beyond the Lewandowski era,
→ Jonathan Burkardt provides the most adaptable long-term solution.

By combining performance metrics, stylistic similarity, and market feasibility,
FC Barcelona can adopt a data-driven, multi-dimensional recruitment strategy to manage the post-Lewandowski transition.

Looking for a new forward at FC Barcelona: Lewandowski replacement

M9. Master Data Analytics in Football

Peter Orosz