Soccer, or football as it is known to many, is often described as, “The Beautiful Game”. For modern soccer clubs, a clear path to glory resides in data and the decisions that can be made by effectively wielding it. This report seeks to elucidate that path via global player metrics from the FIFA 18 database. The visualizations herein focus on identifying top-tier talent, mapping the lifecycle of player performance, and assessing the financial return on investment across position groups. This report thus investigates how clubs can optimize their scouting networks and wage structures in their search to build an optimal roster.
The FIFA 18 dataset provides comprehensive information on over 17,000 professional soccer players worldwide. Fields collected include ‘Name’, ‘Age’, ‘Nationality’, ‘Club’ indicating the player’s current team, ‘Overall’ representing their current performance rating (0-100 scale), ‘Potential’ indicating their projected peak rating, ‘Wage’ reflecting their weekly salary, and ‘Preferred Positions’ showing where they typically operate on the field. The data is composed of numeric variables and string characters. Player ratings range from the mid-40s to the high-90s, and wages scale from entry-level academy stipends to over €500K weekly for global superstars.
# Set directory, ensure the data has time to load, add libraries
setwd("/Users/samsobkov/Documents/R_datafiles")
options(timeout=300)
library(data.table)
library(ggplot2)
library(dplyr)
library(scales)
library(ggrepel)
library(plotly)
filename <- "FIFACompleteDataset.csv"
fifa_df <- fread(filename, na.strings=c(NA, ""))
Building a world-class roster requires precise timing when the transfer market opens. The visualizations below that apply R to the FIFA 18 dataset aim to show factors that clubs should consider when scouting for new talent.
The data reflect that the top talent-producing clubs operate across a plethora of countries. Given the highly competitive nature of these clubs and the fact that no more than eleven can play at once, smaller clubs should consider reviewing the talent within these organizations; they could be amenable to a loan or transfer of a potential superstar in their reserves.
# 1. Prepare the data by isolating youth prospects (23 and under)
viz1_df <- fifa_df %>%
select(Name, Age, Club, Potential)
talent_factories <- viz1_df %>%
filter(!is.na(Club), Club != "") %>%
filter (!is.na(Potential)) %>%
filter(Age <= 23) %>%
group_by(Club) %>%
summarise(
Avg_Potential = mean(Potential),
Player_Count = n(),
.groups = "keep"
) %>%
filter(Player_Count > 5) %>%
data.frame()
# 2. Sort the data to find the clubs with the highest developmental ceiling
talent_factories <- talent_factories[order(talent_factories$Avg_Potential, decreasing = TRUE), ]
talent_factories <- head(talent_factories, 10)
# 3. Build the bar chart with a color gradient based on Potential
ggplot(talent_factories, aes(x = reorder(Club, Avg_Potential), y = Avg_Potential, fill = Avg_Potential)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
labs(
title = "Top 10 Talent Factories (FIFA 18)",
subtitle = "Highest Average Potential for Players Aged 23 and Under",
x = "Football Club",
y = "Average Potential Rating"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
geom_text(aes(label = round(Avg_Potential, 1)), hjust = -0.2, size = 4) +
scale_y_continuous(limits = c(0, 100)) +
scale_fill_continuous(low="lightpink", high = "red")
When are players considered at their peak? This chart reflects the amount of time that it takes for good quality players to develop; the ratings at the most and least optimal times in a player’s career are shown here.
# 1. Isolate the core variables for performance lifecycle analysis
viz2_df <- fifa_df %>%
select(Age, Overall)
# 2. Calculate the average rating for every age in the dataset
age_df <- viz2_df %>%
filter(!is.na(Age), !is.na(Overall)) %>%
filter(Age <= 40) %>%
group_by(Age) %>%
summarise(Avg_Rating = mean(Overall), .groups = "keep") %>%
data.frame()
# Create clean axis breaks for every year represented
x_axis_labels <- min(age_df$Age):max(age_df$Age)
# 3. Identify the "Peak" and "Low" points to highlight on the chart
peak_low <- age_df %>%
filter(Avg_Rating == min(Avg_Rating) | Avg_Rating == max(Avg_Rating)) %>%
data.frame()
# 4. Build the line chart with gold-highlighted peak/trough points
ggplot(age_df, aes(x = Age, y = Avg_Rating)) +
geom_line(color = 'black', linewidth = 1) +
geom_point(shape = 21, size = 3, color = "dodgerblue4", fill = "white") +
geom_point(data = peak_low, aes(x = Age, y = Avg_Rating),
shape = 21, size = 4, fill = "gold", color = "gold") +
labs(x = "Age (Years)",
y = "Average Overall Rating",
title = "The Golden Age: Rating Progression by Age") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels, minor_breaks = NULL) +
# Add repulsive labels to ensure values don't overlap with data points
geom_label_repel(data = age_df,
aes(label = ifelse(Avg_Rating == max(Avg_Rating) | Avg_Rating == min(Avg_Rating),
round(Avg_Rating, 1), "")),
box.padding = 1,
point.padding = 1,
size = 4,
color = "grey50",
segment.color = "darkblue")
Where on the pitch can the highest developmental ceilings be found? Furthermore, which position groups offer more runway in terms of a player’s potential? This heatmap illustrates the average potential of players categorized by their position and age group.
# 1. Rename column to remove spaces for R functionality
colnames(fifa_df)[colnames(fifa_df) == "Preferred Positions"] <- "Pref_P"
# 2. Prepare the Data
viz3_df <- fifa_df %>%
filter(!is.na(Age), !is.na(Potential), !is.na(Pref_P)) %>%
mutate(
# Group the positions
Position_Group = ifelse(grepl("GK", Pref_P), "Goalkeeper",
ifelse(grepl("CB|LB|RB|LWB|RWB", Pref_P), "Defender",
ifelse(grepl("CM|CDM|CAM|LM|RM", Pref_P), "Midfielder",
"Forward"))),
# New Age Brackets: 15-17, 18-20, 21-24, 25+
Age_Bracket = ifelse(Age <= 17, "15-17",
ifelse(Age <= 20, "18-20",
ifelse(Age <= 24, "21-24", "25+")))
) %>%
group_by(Position_Group, Age_Bracket) %>%
summarise(Avg_Potential = mean(Potential), .groups = 'keep') %>%
data.frame()
# Lock in factor levels for logical pitch order and chronological age
viz3_df$Position_Group <- factor(viz3_df$Position_Group, levels = c("Forward", "Midfielder", "Defender", "Goalkeeper"))
# Set the age brackets
age_levels <- c("15-17", "18-20", "21-24", "25+")
viz3_df$Age_Bracket <- factor(viz3_df$Age_Bracket, levels = age_levels)
# 3. Build the Heatmap
viz3_heat <- ggplot(viz3_df, aes(x = Age_Bracket, y = Position_Group, fill = Avg_Potential)) +
geom_tile(color = "white") +
scale_fill_continuous(low = "lightpink", high = "red") +
labs(
title = "Heatmap: Player Potential by Age and Position Group",
x = "Age Bracket",
y = "Position",
fill = "Avg Potential"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = round(Avg_Potential, 1)), color = "black", size = 4)
# Print the plot
viz3_heat
Which nations truly dominate the upper echelons of the sport? This nested donut chart breaks down the player populations of Brazil, Germany, and Spain across three distinct quality tiers: Good, Great, and Elite. For a global recruiter, this highlights which national markets are saturated with high-quality talent.
# 1. Prepare data for selected national superpowers
dt_nations <- c("Brazil", "Germany", "Spain")
nat_df <- fifa_df %>%
filter(!is.na(Nationality), !is.na(Overall)) %>%
filter(Nationality %in% dt_nations) %>%
mutate(
Tier = ifelse(Overall >= 85, "Elite",
ifelse(Overall >= 75, "Great", "Good"))
) %>%
group_by(Tier, Nationality) %>%
summarise(n = length(Nationality), .groups = 'keep') %>%
data.frame()
# 2. Lock factor levels for consistent color mapping and alphabetical order
nat_df$Nationality <- factor(nat_df$Nationality, levels = c("Brazil", "Germany", "Spain"))
my_colors <- c("blue", "black", "red")
# 3. Build the interactive nested donut chart using Plotly
viz4_donut <- plot_ly(hole = 0.7) %>%
layout(title = "Top 3 Nations by Player Tier") %>%
# Outer Ring: Good
add_trace(data = nat_df[nat_df$Tier == "Good",],
labels = ~Nationality,
values = ~nat_df[nat_df$Tier == "Good", "n"],
type = "pie",
textposition = "inside",
marker = list(colors = my_colors), # Applying custom colors here
hovertemplate = "Tier: Good<br>Nation: %{label}<br>Percent: %{percent}<br>Count: %{value}<extra></extra>") %>%
# Middle Ring: Great
add_trace(data = nat_df[nat_df$Tier == "Great",],
labels = ~Nationality,
values = ~nat_df[nat_df$Tier == "Great", "n"],
type = "pie",
textposition = "inside",
marker = list(colors = my_colors),
hovertemplate = "Tier: Great<br>Nation: %{label}<br>Percent: %{percent}<br>Count: %{value}<extra></extra>",
domain = list(
x = c(0.16, 0.84),
y = c(0.16, 0.84))) %>%
# Inner Ring: Elite
add_trace(data = nat_df[nat_df$Tier == "Elite",],
labels = ~Nationality,
values = ~nat_df[nat_df$Tier == "Elite", "n"],
type = "pie",
textposition = "inside",
marker = list(colors = my_colors),
hovertemplate = "Tier: Elite<br>Nation: %{label}<br>Percent: %{percent}<br>Count: %{value}<extra></extra>",
domain = list(
x = c(0.27, 0.73),
y = c(0.27, 0.73)))
viz4_donut
Does the cost of a player align with their peak performance years? This trellis chart analyzes the “financial lifecycle” of a professional athlete by mapping average wages against age across four major position groups. Thus, organizational financial planners can broadly analyze the inflections of player expenses over time.
# 1. Clean financial data
viz5_df <- fifa_df %>%
filter(!is.na(Age), !is.na(Wage), !is.na(Pref_P)) %>%
filter(Age <= 40) %>%
mutate(
Wage_Raw = as.numeric(gsub("[^0-9.]", "", Wage)),
Wage_Num = ifelse(grepl("M", Wage), Wage_Raw * 1000000,
ifelse(grepl("K", Wage), Wage_Raw * 1000, Wage_Raw)),
Position_Group = ifelse(grepl("GK", Pref_P), "Goalkeeper",
ifelse(grepl("CB|LB|RB|LWB|RWB", Pref_P), "Defender",
ifelse(grepl("CM|CDM|CAM|LM|RM", Pref_P), "Midfielder",
"Forward")))
) %>%
group_by(Age, Position_Group) %>%
summarise(Avg_Wage = mean(Wage_Num), .groups = 'keep') %>%
data.frame()
# 2. Lock position factor levels for a logical 2x2 grid layout (back-to-front)
pos_levels <- c("Goalkeeper", "Defender", "Midfielder", "Forward")
viz5_df$Position_Group <- factor(viz5_df$Position_Group, levels = pos_levels)
# 3. Build the Trellis Line Chart
viz5_plot <- ggplot(viz5_df, aes(x = Age, y = Avg_Wage, color = Position_Group)) +
geom_line(linewidth = 1) +
geom_point(shape = 21, size = 3, fill = "white") +
labs(
title = "The ROI of Talent: Wage Curve by Position",
x = "Age (Years)",
y = "Average Wage",
color = "Position Group"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none") +
scale_y_continuous(labels = label_currency(prefix = "€")) +
facet_wrap(~Position_Group, ncol = 2, nrow = 2)
viz5_plot
The data presented conveys that sustainable success in global soccer requires balancing effective player evaluations with strict financial discipline. By targeting the U-23 “Talent Factories” and securing high-potential prospects before they reach their peak ability, clubs can maximize on-field performance while avoiding paying bloated wages. Furthermore, recognizing the talent in Spain, Germany, and Brazil allows a club to focus its limited scouting resources on the most productive talent pipelines. Thus, this data-driven approach can enable an organization to build a championship-caliber roster that is both athletically superior and financially resilient.