Introduction

Soccer, or football as it is known to many, is often described as, “The Beautiful Game”. For modern soccer clubs, a clear path to glory resides in data and the decisions that can be made by effectively wielding it. This report seeks to elucidate that path via global player metrics from the FIFA 18 database. The visualizations herein focus on identifying top-tier talent, mapping the lifecycle of player performance, and assessing the financial return on investment across position groups. This report thus investigates how clubs can optimize their scouting networks and wage structures in their search to build an optimal roster.

Dataset

The FIFA 18 dataset provides comprehensive information on over 17,000 professional soccer players worldwide. Fields collected include ‘Name’, ‘Age’, ‘Nationality’, ‘Club’ indicating the player’s current team, ‘Overall’ representing their current performance rating (0-100 scale), ‘Potential’ indicating their projected peak rating, ‘Wage’ reflecting their weekly salary, and ‘Preferred Positions’ showing where they typically operate on the field. The data is composed of numeric variables and string characters. Player ratings range from the mid-40s to the high-90s, and wages scale from entry-level academy stipends to over €500K weekly for global superstars.

# Set directory, ensure the data has time to load, add libraries
setwd("/Users/samsobkov/Documents/R_datafiles")
options(timeout=300)
library(data.table)
library(ggplot2)
library(dplyr)
library(scales)
library(ggrepel)
library(plotly)

filename <- "FIFACompleteDataset.csv"
fifa_df <- fread(filename, na.strings=c(NA, ""))

Findings

Building a world-class roster requires precise timing when the transfer market opens. The visualizations below that apply R to the FIFA 18 dataset aim to show factors that clubs should consider when scouting for new talent.

Top 10 Talent Factories

The data reflect that the top talent-producing clubs operate across a plethora of countries. Given the highly competitive nature of these clubs and the fact that no more than eleven can play at once, smaller clubs should consider reviewing the talent within these organizations; they could be amenable to a loan or transfer of a potential superstar in their reserves.

# 1. Prepare the data by isolating youth prospects (23 and under)
viz1_df <- fifa_df %>%
  select(Name, Age, Club, Potential)

talent_factories <- viz1_df %>%
  filter(!is.na(Club), Club != "") %>%
  filter (!is.na(Potential)) %>%
  filter(Age <= 23) %>%
  group_by(Club) %>%
  summarise(
    Avg_Potential = mean(Potential),
    Player_Count = n(),
    .groups = "keep"
  ) %>%
  filter(Player_Count > 5) %>%
  data.frame()

# 2. Sort the data to find the clubs with the highest developmental ceiling
talent_factories <- talent_factories[order(talent_factories$Avg_Potential, decreasing = TRUE), ]
talent_factories <- head(talent_factories, 10)

# 3. Build the bar chart with a color gradient based on Potential
ggplot(talent_factories, aes(x = reorder(Club, Avg_Potential), y = Avg_Potential, fill = Avg_Potential)) +
  geom_bar(stat = "identity", color = "black") +
  coord_flip() +
  labs(
    title = "Top 10 Talent Factories (FIFA 18)",
    subtitle = "Highest Average Potential for Players Aged 23 and Under",
    x = "Football Club",
    y = "Average Potential Rating"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  geom_text(aes(label = round(Avg_Potential, 1)), hjust = -0.2, size = 4) +
  scale_y_continuous(limits = c(0, 100)) +
  scale_fill_continuous(low="lightpink", high = "red")

Rating Progression by Age

When are players considered at their peak? This chart reflects the amount of time that it takes for good quality players to develop; the ratings at the most and least optimal times in a player’s career are shown here.

# 1. Isolate the core variables for performance lifecycle analysis
viz2_df <- fifa_df %>%
  select(Age, Overall)

# 2. Calculate the average rating for every age in the dataset
age_df <- viz2_df %>%
  filter(!is.na(Age), !is.na(Overall)) %>%
  filter(Age <= 40) %>%
  group_by(Age) %>%
  summarise(Avg_Rating = mean(Overall), .groups = "keep") %>%
  data.frame()

# Create clean axis breaks for every year represented
x_axis_labels <- min(age_df$Age):max(age_df$Age)

# 3. Identify the "Peak" and "Low" points to highlight on the chart
peak_low <- age_df %>%
  filter(Avg_Rating == min(Avg_Rating) | Avg_Rating == max(Avg_Rating)) %>%
  data.frame()

# 4. Build the line chart with gold-highlighted peak/trough points
ggplot(age_df, aes(x = Age, y = Avg_Rating)) +
  geom_line(color = 'black', linewidth = 1) +
  geom_point(shape = 21, size = 3, color = "dodgerblue4", fill = "white") +
geom_point(data = peak_low, aes(x = Age, y = Avg_Rating), 
  shape = 21, size = 4, fill = "gold", color = "gold") +
labs(x = "Age (Years)", 
  y = "Average Overall Rating", 
  title = "The Golden Age: Rating Progression by Age") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels, minor_breaks = NULL) +
# Add repulsive labels to ensure values don't overlap with data points
geom_label_repel(data = age_df,
  aes(label = ifelse(Avg_Rating == max(Avg_Rating) | Avg_Rating == min(Avg_Rating), 
        round(Avg_Rating, 1), "")),
        box.padding = 1, 
        point.padding = 1, 
        size = 4,
        color = "grey50", 
        segment.color = "darkblue")

Heatmap: Player Potential by Age and Position Group

Where on the pitch can the highest developmental ceilings be found? Furthermore, which position groups offer more runway in terms of a player’s potential? This heatmap illustrates the average potential of players categorized by their position and age group.

# 1. Rename column to remove spaces for R functionality
colnames(fifa_df)[colnames(fifa_df) == "Preferred Positions"] <- "Pref_P"

# 2. Prepare the Data
viz3_df <- fifa_df %>%
  filter(!is.na(Age), !is.na(Potential), !is.na(Pref_P)) %>%
  mutate(
    # Group the positions
    Position_Group = ifelse(grepl("GK", Pref_P), "Goalkeeper",
                            ifelse(grepl("CB|LB|RB|LWB|RWB", Pref_P), "Defender",
                                   ifelse(grepl("CM|CDM|CAM|LM|RM", Pref_P), "Midfielder", 
                                          "Forward"))),
    # New Age Brackets: 15-17, 18-20, 21-24, 25+
    Age_Bracket = ifelse(Age <= 17, "15-17",
                         ifelse(Age <= 20, "18-20",
                                ifelse(Age <= 24, "21-24", "25+")))
  ) %>%
  group_by(Position_Group, Age_Bracket) %>%
  summarise(Avg_Potential = mean(Potential), .groups = 'keep') %>%
  data.frame()

# Lock in factor levels for logical pitch order and chronological age
viz3_df$Position_Group <- factor(viz3_df$Position_Group, levels = c("Forward", "Midfielder", "Defender", "Goalkeeper"))

# Set the age brackets
age_levels <- c("15-17", "18-20", "21-24", "25+")
viz3_df$Age_Bracket <- factor(viz3_df$Age_Bracket, levels = age_levels)

# 3. Build the Heatmap
viz3_heat <- ggplot(viz3_df, aes(x = Age_Bracket, y = Position_Group, fill = Avg_Potential)) +
  geom_tile(color = "white") +
  scale_fill_continuous(low = "lightpink", high = "red") + 
  labs(
    title = "Heatmap: Player Potential by Age and Position Group",
    x = "Age Bracket",
    y = "Position",
    fill = "Avg Potential"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = round(Avg_Potential, 1)), color = "black", size = 4)

# Print the plot
viz3_heat

Global Talent Powerhouse Nations

Which nations truly dominate the upper echelons of the sport? This nested donut chart breaks down the player populations of Brazil, Germany, and Spain across three distinct quality tiers: Good, Great, and Elite. For a global recruiter, this highlights which national markets are saturated with high-quality talent.

# 1. Prepare data for selected national superpowers
dt_nations <- c("Brazil", "Germany", "Spain")

nat_df <- fifa_df %>%
  filter(!is.na(Nationality), !is.na(Overall)) %>%
  filter(Nationality %in% dt_nations) %>%
  mutate(
    Tier = ifelse(Overall >= 85, "Elite",
                  ifelse(Overall >= 75, "Great", "Good"))
  ) %>%
  group_by(Tier, Nationality) %>%
  summarise(n = length(Nationality), .groups = 'keep') %>%
  data.frame()

# 2. Lock factor levels for consistent color mapping and alphabetical order
nat_df$Nationality <- factor(nat_df$Nationality, levels = c("Brazil", "Germany", "Spain"))
my_colors <- c("blue", "black", "red")

# 3. Build the interactive nested donut chart using Plotly
viz4_donut <- plot_ly(hole = 0.7) %>%
  layout(title = "Top 3 Nations by Player Tier") %>%
  
  # Outer Ring: Good
  add_trace(data = nat_df[nat_df$Tier == "Good",],
            labels = ~Nationality,
            values = ~nat_df[nat_df$Tier == "Good", "n"],
            type = "pie",
            textposition = "inside",
            marker = list(colors = my_colors), # Applying custom colors here
            hovertemplate = "Tier: Good<br>Nation: %{label}<br>Percent: %{percent}<br>Count: %{value}<extra></extra>") %>%
  
  # Middle Ring: Great
  add_trace(data = nat_df[nat_df$Tier == "Great",],
            labels = ~Nationality,
            values = ~nat_df[nat_df$Tier == "Great", "n"],
            type = "pie",
            textposition = "inside",
            marker = list(colors = my_colors),
            hovertemplate = "Tier: Great<br>Nation: %{label}<br>Percent: %{percent}<br>Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.16, 0.84),
              y = c(0.16, 0.84))) %>%
  
  # Inner Ring: Elite
  add_trace(data = nat_df[nat_df$Tier == "Elite",],
            labels = ~Nationality,
            values = ~nat_df[nat_df$Tier == "Elite", "n"],
            type = "pie",
            textposition = "inside",
            marker = list(colors = my_colors),
            hovertemplate = "Tier: Elite<br>Nation: %{label}<br>Percent: %{percent}<br>Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.27, 0.73),
              y = c(0.27, 0.73)))
viz4_donut

Average Wage by Position and Age

Does the cost of a player align with their peak performance years? This trellis chart analyzes the “financial lifecycle” of a professional athlete by mapping average wages against age across four major position groups. Thus, organizational financial planners can broadly analyze the inflections of player expenses over time.

# 1. Clean financial data
viz5_df <- fifa_df %>%
  filter(!is.na(Age), !is.na(Wage), !is.na(Pref_P)) %>%
  filter(Age <= 40) %>% 
  mutate(
    Wage_Raw = as.numeric(gsub("[^0-9.]", "", Wage)),
    Wage_Num = ifelse(grepl("M", Wage), Wage_Raw * 1000000,
                      ifelse(grepl("K", Wage), Wage_Raw * 1000, Wage_Raw)),
    Position_Group = ifelse(grepl("GK", Pref_P), "Goalkeeper",
                            ifelse(grepl("CB|LB|RB|LWB|RWB", Pref_P), "Defender",
                                   ifelse(grepl("CM|CDM|CAM|LM|RM", Pref_P), "Midfielder", 
                                          "Forward")))
  ) %>%
  group_by(Age, Position_Group) %>%
  summarise(Avg_Wage = mean(Wage_Num), .groups = 'keep') %>%
  data.frame()

# 2. Lock position factor levels for a logical 2x2 grid layout (back-to-front)
pos_levels <- c("Goalkeeper", "Defender", "Midfielder", "Forward")
viz5_df$Position_Group <- factor(viz5_df$Position_Group, levels = pos_levels)

# 3. Build the Trellis Line Chart
viz5_plot <- ggplot(viz5_df, aes(x = Age, y = Avg_Wage, color = Position_Group)) +
  geom_line(linewidth = 1) + 
  geom_point(shape = 21, size = 3, fill = "white") + 
  labs(
    title = "The ROI of Talent: Wage Curve by Position",
    x = "Age (Years)",
    y = "Average Wage",
    color = "Position Group"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
  scale_y_continuous(labels = label_currency(prefix = "€")) +
  facet_wrap(~Position_Group, ncol = 2, nrow = 2)
viz5_plot

Conclusion

The data presented conveys that sustainable success in global soccer requires balancing effective player evaluations with strict financial discipline. By targeting the U-23 “Talent Factories” and securing high-potential prospects before they reach their peak ability, clubs can maximize on-field performance while avoiding paying bloated wages. Furthermore, recognizing the talent in Spain, Germany, and Brazil allows a club to focus its limited scouting resources on the most productive talent pipelines. Thus, this data-driven approach can enable an organization to build a championship-caliber roster that is both athletically superior and financially resilient.