Building a Soccer Team

The Beemore Bees

Problem: Lets imagine we have been asked to help set up a new world class football club. Where would we find players? What kind of players should we get? How long would I have to use those players? Using FIFIA 2016-2020 player data we can visualize the problems and find our bearings without being seasoned recruiters.

# First we need to gather libraries. 
library(ggplot2)
library(rnaturalearth)#https://cran.r-project.org/web/packages/rnaturalearth/vignettes/rnaturalearth.html 
library(sf)
library(dplyr)

FIFA as DATA

In order to provide some insight we should produce visualizations to best understand the problem statement, and draw conclusions. For this exercise we will be looking at the following questions:

setwd("~/Fifa Gaming/ FifaGames")#The data can be found here: https://www.kaggle.com/datasets/stefanoleone992/fifa-20-complete-player-dataset
full2020PlayerDf<- read.csv("players_20.csv",header=TRUE,sep=",")

full2016PlayerDf<- read.csv("players_16.csv",header=TRUE,sep=",")#I'll need these later and to expand on if I want an over time view
full2017PlayerDf<- read.csv("players_17.csv",header=TRUE,sep=",")
full2018PlayerDf<- read.csv("players_18.csv",header=TRUE,sep=",")
full2019PlayerDf<- read.csv("players_19.csv",header=TRUE,sep=",")
cleanPlayerDf<- full2020PlayerDf[,-c(79:104)] #We dont necessarily need the video game information so lets remove it.
cleanPlayerDf <- cleanPlayerDf[ , !(names(cleanPlayerDf) %in% c("player_traits"))]
valueOrder<-cleanPlayerDf[order(-cleanPlayerDf$value_eur),]

Who’s the best?

The first step anyone would most likely do is find out who the “best” players are and how much they are worth. So, lets grab the top 10. But superstars are usually the exception, so we should add the average, and use this information to provide initial insight into how much a 11 man squad would be valued at.

average<-mean(valueOrder$value_eur)
average<-trunc(average)
averageValue<-c("Average Player", average)

Player_Names<-c(head(valueOrder$short_name,10))
Euro_Value<-c(head(valueOrder$value_eur,10))
top10PlayersBValue<-data.frame(Player_Names,Euro_Value)#Make the DF

top10PlayersBValue<-rbind(top10PlayersBValue,averageValue)#add the "average" player

top10PlayersBValue$Euro_Value <- as.numeric(top10PlayersBValue$Euro_Value)#The scale is too much for the chart, lets make it in Millions
top10PlayersBValue$Euro_Value <- top10PlayersBValue$Euro_Value / 1000000 
#Ready for Visual 1!
x<-length(top10PlayersBValue)
xlim_range <-c(0,x*.6)
#Ready for Visual 1!
barplot(
  top10PlayersBValue$Euro_Value, names.arg = top10PlayersBValue$Player_Names,
  col ="lightblue",
  main = "Players By est Value (M Euro)",
  width = .1,
  xlab = "Player Name",
  cex.names= .5,
  ylab = "M Euro",
  xlim = xlim_range
)

This shows that the scale of value is heavily weighted on superstars. The average player is well below the headline prices which would indicate that most teams would field a full squad below the cost of one superstar.

Where should I look?

Its one thing to know the top players and the average players expected value, but where are current players coming from? Knowing where the players are allows a team to work with junior clubs and camps to find the best talent.

Country<-c(cleanPlayerDf$nationality)

#I have to do some cleanup, nnaturalearth does not match a few places
Country[Country == "United States"] <- "United States of America" 
Country[Country == "England"] <- "United Kingdom" #rnaturalearth has multiple break outs for the British Isles, but I cant see (in my plot) Northern Ireland/Scotland/Whales so UK and Ireland it is.
Country[Country == "Republic of Ireland"] <- "Ireland"
Country[Country == "China PR"] <- "China"
Country[Country == "Chinese Taipei"] <- "Taiwan"
Country[Country == "Korea DPR"] <- "North Korea" #Didn't expect that, so I HAD to include it.
Country[Country == "Korea Republic"] <- "South Korea"
#This is a great geography lesson too.

nationality_count <-table(Country)
nationality_df<-as.data.frame(nationality_count)
mapdf <- nationality_df[order(-nationality_df$Freq), ]
#the nationality DF is ready!

#must create the world map so I can plot the mapdf to it
world <- ne_countries(scale = "medium", returnclass = "sf")
world_data <- left_join(world, mapdf, by = c("name" = "Country"))#Put said data in the map

#Ready for Visual 2!
ggplot(data = world_data) +
  geom_sf(aes(fill = Freq, geometry = geometry)) +
  scale_fill_gradient(low = "lightyellow", high = "red", na.value = "grey90") +
  labs(title = "Player Count from Each Country", fill = "Num Players") +
  theme_minimal()

It looks like the best source market for talent is in Europe, with England France and Spain having the highest current day density of players. South america has a good density as well, such as Brasil and Argentina, so if I want to place my recruiters in the best locations, I’d be looking in Europe or South America.

What position should I look for?

Its all well and good to find a great athlete but if everyone is used playing in the center position the team will have problems working together. Lets view what positions cost so we can get a feel for value in each of the positions. Note: some positions depend on what the teams formation is.

avg_cost <- aggregate(valueOrder$wage_eur~valueOrder$team_position,data = valueOrder, FUN=mean, na.rm = TRUE)
colnames(avg_cost) <- c("Position","Avg_Wage")
avg_cost<-avg_cost[order(-avg_cost$`Avg_Wage`),]
avg_cost$Wage <- trunc(avg_cost$Avg_Wage)
#ready for visual!
pie(avg_cost$`Avg_Wage`,labels=avg_cost$Position,main="Average Compensation by Position", col = rainbow((length(avg_cost$Position))))

Well Center Forwards (CF) are the most expensive by far, but this chart doesn’t fully capture the picture as CF’s and Striker (ST) can occupy the same role, and the cost is much different. Lets combine positions to a more generic view, which would let the team put people in positions as needed than as specified.

avg_cost<-avg_cost%>%
  mutate(Role = case_when(
    Position %in% c("ST", "CF", "LW", "RW", "LF", "RF", "LS", "RS") ~ "Offense",
    Position %in% c("CDM", "RCM", "LCM", "CAM", "CM", "CB", "LAM", "RAM", "RM", "LM") ~ "Midfield",
    Position %in% c("LDM", "RDM", "LCB", "RCB", "RWB", "LWB", "RB", "LB") ~ "Defense",
    Position %in% c("SUB", "RES") ~ "Reserve",
    Position %in% c("GK") ~ "Goalkeeper",
    TRUE ~ "Other"
  ))

simple_df <- aggregate(Wage ~ Role, data = avg_cost, FUN = mean)
simple_df$Wage <- trunc(simple_df$Wage)#after taking the aggregate i need to 

#Visualization 3 READY
pie(simple_df$Wage,labels=simple_df$Role,main="Average Compensation by Role", col = rainbow((length(simple_df$Role))))

This view cleans up a lot and highlights two things, Offense is the most expensive area, but per player Goalkeepers can demand premium per position price. Thus if I am looking to estimate how much I’ll be spending in each section of the field the Midfield and Defense will cost per player the least, while the Goalkeeper and Offense will consume most of the budget.

How long can I expect a player to be playing?

Well if I find a potential player for a position, how long should I expect them to play for? Some players may be undervalued when they have, on average, a few years left of good play. Lets look at the positions, their average age, and the minimum and maximum ages in those positions.

#For box chart I need, team_position, average age, max age, min age, first quartiles, and third quartiles

# Average age by team_position
avg_age <- aggregate(age ~ team_position, data = valueOrder, FUN = mean, na.rm = TRUE)

# Max age by team_position
max_age <- aggregate(age ~ team_position, data = valueOrder, FUN = max, na.rm = TRUE)

# Min age by team_position
min_age <- aggregate(age ~ team_position, data = valueOrder, FUN = min, na.rm = TRUE)

quartiles_df <- aggregate(age ~ team_position, data = valueOrder, 
                          FUN = function(x) quantile(x, probs = c(0.25, 0.75), na.rm = TRUE))

# Convert the matrix of quartiles into two columns
quartiles_df <- do.call(data.frame, quartiles_df)
names(quartiles_df)[2:3] <- c("Q1", "Q3")


# Merge all into one data frame
summary_df <- merge(avg_age, max_age, by = "team_position", suffixes = c("_avg", "_max"))
summary_df <- merge(summary_df, min_age, by = "team_position")
names(summary_df)[names(summary_df) == "age"] <- "age_min"
summary_df<-merge(summary_df,quartiles_df)
summary_df$age_avg <- trunc(summary_df$age_avg)#Another trunc
summary_df<-summary_df[-1,]#postion NULLs detected in slot1, removing them.
rownames(summary_df)<-NULL

#ready for Vis

#Visual 4 Box chart!
ggplot(summary_df, aes(x = factor(`team_position`))) +
  geom_boxplot(
    aes(
      ymin = age_min,
      lower = Q1,
      middle = age_avg,
      upper = Q3,
      ymax = age_max
    ),
    stat = "identity",
    fill = "lightgreen",
    color = "darkblue"
  ) +
  labs(
    x = "Team Position",
    y = "Age",
    title = "Box Plot of Ages by Team Position"
  ) +
  theme_minimal()

What can we see? Ignoring the Reserve (RES) and Substitutes, Goalkeepers (GK), Center Backs (CB), and Right Attacking Mid (RAM) have careers later into age on average. However the green areas only tell the average, with some GKs and Left Backs (LB) having some of the longest careers. Left Forwards (LF) and Right Forwards (RF) are some of the youngest players, but there are more younger RF players on average. Midfield players have a shorter, more dispersed age range. This means a player in those positions may not command the highest end cost, but are the most flexible, albeit short lived, of the team.

How expensive is this going to be?

So we have some players in mind? But what kind of money will we be expected to spend in the future.
Lets take the top 10 again to use as a bellwether on wage growth. If we use the ratio from the first graphic as a rule of thumb we can determine any average and above average players wage growth in the next 5 years.

# Order by wage_eur descending
top10_2020 <- full2020PlayerDf[order(-full2020PlayerDf$wage_eur), ][1:10, ]
top_players <- top10_2020$short_name #to search in the previous years

#Lets make a function to do this easier
get_wages <- function(df, year, top_players) {
  subset_df <- df[df$short_name %in% top_players, c("short_name", "wage_eur")]
  colnames(subset_df)[2] <- paste0("wage_", year)
  return(subset_df)
}
wage_2016 <- get_wages(full2016PlayerDf, 2016, top_players)
wage_2016<-wage_2016[1:10,]#There are multiple Suarez(s)
wage_2017 <- get_wages(full2017PlayerDf, 2017, top_players)
wage_2017<-wage_2017[1:10,]
wage_2018 <- get_wages(full2018PlayerDf, 2018, top_players)
wage_2018<-wage_2018[1:10,]
wage_2019 <- get_wages(full2019PlayerDf, 2019, top_players)
wage_2019<-wage_2019[1:10,]
wage_2020 <- get_wages(full2020PlayerDf, 2020, top_players)
wage_2020<-wage_2020[1:10,]

# Rename salary columns to distinguish them
df1 <- wage_2016 %>% rename("2016" = wage_2016)
df2 <- wage_2017 %>% rename("2017" = wage_2017)
df3 <- wage_2018 %>% rename("2018" = wage_2018)
df4 <- wage_2019 %>% rename("2019" = wage_2019)
df5 <- wage_2020 %>% rename("2020" = wage_2020)

# Combine all using full_join
combined_df <- df1 %>%
  full_join(df2, by = "short_name") %>%
  full_join(df3, by = "short_name") %>%
  full_join(df4, by = "short_name") %>%
  full_join(df5, by = "short_name")
colnames(combined_df) <-c("Player_Name","2016","2017","2018","2019","2020")
#I now have a combined_df that has the top10 players and their wages from 2016-2020
#I need to transpose the X/Y on wages to make a line chart.
# Remove Name column
scores <- as.matrix(combined_df[ , -1])
# Transpose so years are on x-axis
scores_t <- t(scores)
scores_t<-scores_t/1000
years <- as.numeric(colnames(combined_df)[-1])

# Ready for vis 5.
matplot(
  years, scores_t,
  type = "l",               # line plot
  lty = 3,                  # line type
  col = 1:nrow(combined_df),         # color for each player
  xlab = "Year",
  ylab = "Wage (in K)",
  main = "Player wages from 2016 to 2020"
)

# Add legend 
legend("topleft", legend = combined_df$Player_Name, col = 1:nrow(combined_df), lty = 1)

This graph shows that at the high end, players value can fluctuate in the 100s of millions, but stay in the neighborhood of 300-400m per year. Over 5 years the top end players jumped by 100m (about 1/3 of their 2017 value). This means the team, if it wishes to maintain the highest end talent (on just pay alone), should expect a 1/3 value increase on high performing players over 3-5 years.

What does this all say?

The best players can command a value well in excess of 50x the average value of a player. The hottest locations to find players are from England, France, Spain, Brasil, and Argentina.

A valuable defense can be expected to last (overall) a longer time than an offense, with a goalkeeper being some of the longest career players. If a recruiter can secure valuable players in these positions young, their value can grow and be traded, or remain on the team as experienced pitch players. Offensive players, such as Strikers (ST), Center forwards (CF), and LF/RF may be the most dynamic players which speaks to how the players may start as a LF/RF, but be moved to center or striker if they are found to be of better value. Midfielders should complement the desired play style, and be considered based on their specialized position relative to the Defense and Offense.

So our strategy should be to find young talent for the wings, veteran talent for the center, and fill the midfield with flexibility. A superstar would be a huge investment compared to the squad as a whole and should only be done if the team can afford the expense as talent may be found within the squad capable of performing the roll sufficiently without overburdening the cost/benefit. As expected, a larger budget allows more flexibility, but smart clubs will seek young talent to grow both as players and as a club without taking out a small country sized loan.