Introduction

Soccer, or football, is the world’s most popular, global sport. Contrary to the four major North American sports, soccer is very arguable equally competitive in various leagues across the globe, not having solely one premium league, such as the NBA, NFL, MLB, and NHL. Those four sports each have their own video game so kids, teens, and adults can play on a console with their favorite athletes. Soccer has a game, too, the FIFA series, now known as FC, sold by EA Sports.

In September of 2017, EA released FIFA 18, their video game for that soccer season. As previously mentioned, soccer has a large variety of leagues.

Here, we have all the player data from the FIFA 18 video game.

With a myriad of players in FIFA’s video game, along with countless respective statistics, there is a ton of information to breakdown. But aside from it being a video game and “not real,” a lot of the data are actually real and reflected real life attributes at the time of the video game’s release, such as country or origin, position, and salary.

Dataset

To get a quick grasp of the data, our highest player overall is 94 and our lowest player overall is 46. There are a total of 17,981 players in the data set.

There are 11 players with an overall in the 90s, 508 players with one in the 80s, 5,238 players in the 70s, 9,290 players in the 60s, 2,838 players in the in the 50s, and 96 players in the 40s.

For players who recieve a salary (which are most), the range of weekly salary in Euros goes from the lowest of 1,000 Euros up to 565,000 Euros.

Through the following five visualizations aphs, we show how a manager can take advantage of publicly available data to create a soccer team that best fits their needs.

Findings

Below are tabs for my each of my five visualizations, which the viewer gets to navigate through. By taking a deep dive into player overall, both specific positions and position areas, salaries, and important attributes, we allow the user to make critical decisions as a football manager, whether they are enjoying the comfort of FIFA 18 or if they have the ambition to take the realm of a real soccer team.

Overall by Position

Our first visualizations splits up all 17,981 into their distinguishable preferred positions. Some players have multiple preferred positions, up to four at the most; these multiple positions are preferred for said players because they play that role just as competently as the other ones attached to their name.

So, to say, if a player has multiple preferred positions, they are counted that number of times (e.g. a player whose preferred positions are ST and CAM are listed once under ST and once under CAM).

After counting the number of players who can play each position, each horizontal bar shows how many of those players are which overall range, split in increments of 10. With this visualization, one can obtain a picture of how many players are playing at any specific level and the commonness of having one of those players at that skill level on roster.

library(plyr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(plotly)
library(forcats)

fifa18_df = read.csv("C:/Users/samme/OneDrive/Documents/R_datafiles/FIFA18CompleteDataset.csv")

fifa_overall_df = fifa18_df[ , c("Overall", "Preferred.Positions")]

overall_bins = c("40-49", "50-59", "60-69", "70-79", "80-89", "90-99")

fifa18_df$Overall.Range = cut(
  fifa18_df$Overall,
  breaks = c(40, 50, 60, 70, 80, 90, 100),
  labels = rev(overall_bins),
  right = FALSE)

duplicates_fifa18_df = fifa18_df %>%
  mutate(Preferred.Positions = str_squish(Preferred.Positions)) %>%
  separate_rows(Preferred.Positions, sep = " ") %>%
  filter(Preferred.Positions != "")

fifa_position_levels = c("ST","CF","LW","RW","CAM","LM","RM","CM","CDM","LWB","RWB","LB","RB","CB","GK")

position_df = duplicates_fifa18_df %>%
  mutate(Preferred.Positions = factor(
    Preferred.Positions,
    levels = fifa_position_levels,
    ordered = TRUE)) %>%
  dplyr::count(Preferred.Positions) %>%
  arrange(Preferred.Positions)

position_df = position_df %>%
  mutate(Preferred.Positions = factor(Preferred.Positions, levels = rev(fifa_position_levels)))

ggplot(duplicates_fifa18_df %>%
         mutate(Preferred.Positions = factor(Preferred.Positions, levels = rev(fifa_position_levels))), 
       aes(y = Preferred.Positions, fill = Overall.Range)) +
  geom_bar(position=position_stack(reverse=TRUE)) +
  labs(title = "FIFA Overall Range by Position", 
       x = "Count of Players", y = "Preferred Position", fill = "FIFA Overall") +
  theme_light() +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_fill_manual(
    values = c(
      "40-49" = "#b2182b",  # dark red
      "50-59" = "#ef8a62",  # orange-red
      "60-69" = "#fddc00",  # yellow
      "70-79" = "#66c2a5",  # darker green
      "80-89" = "#2ca25f",  # strong green
      "90-99" = "#006d2c"   # deep green
    ),
    limits = rev(c("40-49","50-59","60-69","70-79","80-89","90-99"))) +
    geom_text(data=position_df, aes(x=n, y=Preferred.Positions, label=scales::comma(n), fill=NULL), hjust = -0.1, size = 3.5) +
  scale_x_continuous(labels=comma, limits=c(0, 3900))

Average Wage

Our second visualization displays for each specific position the average wage for each overall range, the same overall bins shown in the first visualization.

Players were accounted for the multiple preferred positions just as they were in first visualization, too.

By taking a look at each position’s wages, one can get a grasp of past market transactions and what the average monetary value is that has been offered and accepted for a specific caliber player playing a specific position.

How much more is one willing to spend to get the chance to secure top-level talent at their desired position? Will they follow past market trends or set a new one to increase their leverage?

duplicates_fifa18_df = duplicates_fifa18_df %>%
  mutate(Wage_Euros = case_when(Wage == "€0" ~ 0,
                                str_detect(Wage, "K") ~ as.numeric(str_remove_all(Wage, "[^0-9.]")) * 1000,
                                str_detect(Wage, "M") ~ as.numeric(str_remove_all(Wage, "[^0-9.]")) * 1000000,
                                TRUE ~ as.numeric(str_remove_all(Wage, "[^0-9.]"))))

duplicates_fifa18_df = duplicates_fifa18_df %>%
  mutate(Preferred.Positions = factor(Preferred.Positions, levels = fifa_position_levels, ordered = TRUE))

ggplot(duplicates_fifa18_df, aes(x = Overall.Range, y = Wage_Euros, fill = Preferred.Positions)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5, size = 19), axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14), legend.title = element_text(size = 13),
        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + 
  labs(title = "Average Wage by Overall Range by Position",
       x = "Overall Range", y = "Average Wage (Weekly)", fill = "Position") +
  facet_wrap(~Preferred.Positions, ncol = 5, nrow = 3) +
  scale_x_discrete(labels = rev) +
  scale_y_continuous(breaks = scales::breaks_width(100000),
                     labels = scales::label_number(prefix = "€", big.mark = ",")) +
  ggthemes::scale_fill_tableau("Tableau 20")

Specific Deal % by Area

Here is a visualization narrowing down our data to players who will see the field more often than others: those who are at least a 70 overall.

Contrary to the specificities in regard to position in the previous two visualizations, this visualization groups up our 15 positions into four main position areas: Attackers (ST, CF, LW, RW), Midfielders (CAM, LM, RM, CM, CDM), Defenders (LWB, RWB, LB, RB, CB), and Goalkeepers (GK).

In that aforementioned order of areas, Attackers on the outside-most ring and Goalkeepers in the inside-most ring, we see a distribution of the percentage of each position area whose salary falls in a specific salary bin, starting at increments of 25k euros up until 100k, where the frequency of players with large mega-deals.

How often are these productive players getting paid a salary in a certain salary range? What salary do you want to offer your players who will see the pitch?

Attack = c("ST", "CF", "LW", "RW")
Midfield = c("CAM", "LM", "RM", "CM", "CDM")
Defense = c("LWB", "RWB", "LB", "RB", "CB")
Goalkeeper = c("GK")

map_position_area = function(pos_string) {
  # Split positions
  positions = unlist(str_split(pos_string, " "))
  
  groups = c()
  if (any(positions %in% Attack)) groups = c(groups, "Attack")
  if (any(positions %in% Midfield)) groups = c(groups, "Midfield")
  if (any(positions %in% Defense)) groups = c(groups, "Defense")
  if (any(positions %in% Goalkeeper)) groups = c(groups, "Goalkeeper")
  
  # Collapse into one string
  paste(groups, collapse = ", ")
}

fifa18_df = fifa18_df %>%
  mutate(Position.Area = sapply(Preferred.Positions, map_position_area))

fifa18_df = fifa18_df %>%
  mutate(Wage_Euros = case_when(Wage == "€0" ~ 0,
                                str_detect(Wage, "K") ~ as.numeric(str_remove_all(Wage, "[^0-9.]")) * 1000,
                                str_detect(Wage, "M") ~ as.numeric(str_remove_all(Wage, "[^0-9.]")) * 1000000,
                                TRUE ~ as.numeric(str_remove_all(Wage, "[^0-9.]"))))

fifa18_df = fifa18_df %>%
  mutate(Wage.Range = cut(
    Wage_Euros,
    breaks = c(0, 25000, 50000, 75000, 100000, 150000, 200000, Inf),
    labels = c("0-25k", "25-50k", "50-75k", "75-100k",
               "100-150k", "150-200k", "200k+"),
    include.lowest = TRUE,
    right = FALSE
  ),
  Wage.Range = factor(Wage.Range,
                      levels = c("0-25k","25-50k","50-75k","75-100k",
                                 "100-150k","150-200k","200k+")))

dups_fifa18_df_position.area = fifa18_df %>%
  separate_rows(Position.Area, sep = ",\\s*")  # splits "Attack, Midfield" into two rows

dups_fifa18_df_position.area = dups_fifa18_df_position.area %>%
  filter(Overall >= 70)

fifa18_df_position.area_wage.range = fifa18_df %>%
  separate_rows(Position.Area, sep = ",\\s*") %>%
  filter(Overall >= 70) %>%
  group_by(Position.Area, Wage.Range) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Position.Area) %>%
  mutate(Percent = n / sum(n) * 100) %>%
  arrange(Position.Area, Wage.Range) 

fifa18_df_position.area_wage.range = fifa18_df_position.area_wage.range %>%
  arrange(Position.Area, Wage.Range)

fifa18_df_position.area_wage.range = fifa18_df_position.area_wage.range %>%
  mutate(Wage.Range = factor(Wage.Range,
                             levels = c("0-25k","25-50k","50-75k","75-100k",
                                        "100-150k","150-200k","200k+")))

attack = fifa18_df_position.area_wage.range %>%
  filter(Position.Area == "Attack") %>% arrange(Wage.Range)
midfield = fifa18_df_position.area_wage.range %>%
  filter(Position.Area == "Midfield") %>% arrange(Wage.Range)
defense = fifa18_df_position.area_wage.range %>%
  filter(Position.Area == "Defense") %>% arrange(Wage.Range)
goalkeeper = fifa18_df_position.area_wage.range %>%
  filter(Position.Area == "Goalkeeper") %>% arrange(Wage.Range)

plot_ly(hole = 0.7) %>%
  layout(title = "Salary Range (Attack, Midfield, Def, GK) 70+ Overall",
         legend = list(title = list(text = "Weekly Salary Range (€)"))) %>%
  add_trace(data = attack,
            labels = ~Wage.Range,
            values = ~n,
            type = "pie",
            textposition = "inside",
            sort = FALSE,  # this tells Plotly to respect row/factor order
            hovertemplate = "Area: Attack<br>Salary Range: %{label}<br>Percent: %{percent}<br>Players: %{value}<extra></extra>"
  ) %>%
  add_trace(data = midfield,
            labels = ~Wage.Range,
            values = ~n,
            type = "pie",
            textposition = "inside",
            sort = FALSE,
            hovertemplate = "Area: Midfield<br>Salary Range: %{label}<br>Percent: %{percent}<br>Players: %{value}<extra></extra>",
            domain = list(
              x = c(0.16, 0.84),
              y = c(0.16, 0.84))) %>%
  add_trace(data = defense,
            labels = ~Wage.Range,
            values = ~n,
            type = "pie",
            textposition = "inside",
            sort = FALSE,
            hovertemplate = "Area: Defense<br>Salary Range: %{label}<br>Percent: %{percent}<br>Players: %{value}<extra></extra>",
            domain = list(
              x = c(0.27, 0.73),
              y = c(0.27, 0.73))) %>%
  add_trace(data = goalkeeper,
            labels = ~Wage.Range,
            values = ~n,
            type = "pie",
            textposition = "inside",
            sort = FALSE,
            hovertemplate = "Area: Goalkeeper<br>Salary Range: %{label}<br>Percent: %{percent}<br>Players: %{value}<extra></extra>",
            domain = list(
              x = c(0.35, 0.65),
              y = c(0.35, 0.65)))

Best Attackers

Here we have a hover-able, interactive visualization with our heatmap of our attributes. This visualization is tailored to the attackers as denoted earlier, those who will be in position to score the goals for the club, and still tailored to those players/attackers who are expected to be serviceable, so a 70 overall and above.

Two important attributes for attackers outlined in FIFA 18 are Finishing and Acceleration. These players are the ones closest to the goal and need the finishing skills to do so. Acceleration is key as they need to get a fast burst off of the defenders to weave their way into scoring a goal.

We see our Finishing Rating on our x-axis and Acceleration Rating on our y-axis, both in ranges separated by increments of five. Not only do you see the number of players who fall in each combination of ranges, but if one hovers over any square, you see the names of those who are in that square. Granted you only see so many names at once, but one should only really need to know the names of those who are “one-of-one” players, those in a category of their own, or at least close to it.

Are these the players you are willing to sign or pursue? What ranges of these to critical attributes for attackers is the combination you will aim to have on your club?

fifa18_df_attackers = dups_fifa18_df_position.area %>%
  filter(Position.Area == "Attack", Overall >= 70)

fifa18_df_attackers$Finishing = as.numeric(fifa18_df_attackers$Finishing)
fifa18_df_attackers$Acceleration = as.numeric(fifa18_df_attackers$Acceleration)

# Create ranges
fifa18_df_attackers = fifa18_df_attackers %>%
  mutate(
    Finishing.Range = cut(
      Finishing, breaks = c(50,55,60,65,70,75,80,85,90,100),
      labels = c("50-54","55-59","60-64","65-69","70-74","75-79","80-84","85-89","90-99"),
      include.lowest = TRUE, right = FALSE),
    Acceleration.Range = cut(
      Acceleration, breaks = c(50,55,60,65,70,75,80,85,90,100),
      labels = c("50-54","55-59","60-64","65-69","70-74","75-79","80-84","85-89","90-99"),
      include.lowest = TRUE, right = FALSE))

fifa18_df_attackers = fifa18_df_attackers %>%
  filter(!is.na(Finishing.Range) & !is.na(Acceleration.Range))

fifa18_fin_accel_counts_players = fifa18_df_attackers %>%
  group_by(Finishing.Range, Acceleration.Range) %>%
  summarise(
    n = n(),
    players = paste(Name, collapse = ", "),  # all player names
    .groups = "drop") %>%
  complete(Finishing.Range, Acceleration.Range, fill = list(n = 0, players = ""))

fifa18_fin_accel_counts_players$Acceleration.Range = 
  factor(fifa18_fin_accel_counts_players$Acceleration.Range,
         levels = rev(levels(fifa18_fin_accel_counts_players$Acceleration.Range)))

heatmap_attackers = ggplot(fifa18_fin_accel_counts_players, aes(
  x = Finishing.Range, y = fct_rev(Acceleration.Range), fill = n,
  text = paste("Count:", n, "<br>Players:", players))) +
  geom_tile(color = "black") +
    geom_text(aes(label = n), color = "black") +
    coord_equal(ratio = 1) +
    labs(title = "Heatmap: Player Count by Finishing and Accel Ranges",
         x = "Finishing Rating", y = "Acceleration Rating",
         fill = "Player Count") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_fill_gradient(low = "white", high = "red",
                        breaks = c(25, 50, 75, 100))
  
ggplotly(heatmap_attackers, tooltip = "text")

Critical Attribute Averages

Our final visualization takes our position areas, this time excluding goalkeeping, and divides each of the three remaining areas (Attack, Midfield, and Defense) into two groups, 70-79 overall and 80+ overall (i.e., serviceable and premium groups, respectively). On the x-axis are eight important attributes for any footballer: Stamina, Acceleration, Sprint speed, Agility, Balance, Strength, Ball control, and Dribbling.

This visualization takes each of the six groups’ average rating for each of the eight attributes, mapping out the disparity, or lack thereof, between players whose overall are in the 70s compared to all those players better than them.

How much better are these elite players at certain skills on the field? Is that difference worth acquiring at the cost of other potential talent?

dups_fifa18_df_position.area2 = dups_fifa18_df_position.area %>%
  mutate(
    Area.Ovr = case_when(
      Position.Area == "Attack" & Overall < 80 ~ "70-79 Attacker",
      Position.Area == "Attack" & Overall >= 80 ~ "80+ Attacker",
      
      Position.Area == "Midfield" & Overall < 80 ~ "70-79 Midfielder",
      Position.Area == "Midfield" & Overall >= 80 ~ "80+ Midfielder",
      
      Position.Area == "Defense" & Overall < 80 ~ "70-79 Defender",
      Position.Area == "Defense" & Overall >= 80 ~ "80+ Defender",
      
      TRUE ~ NA_character_))

dups_fifa18_df_position.area2 = dups_fifa18_df_position.area2 %>%
  filter(!is.na(Area.Ovr))

attributes = c("Stamina", "Acceleration", "Sprint.speed", "Agility", 
                "Balance", "Strength", "Ball.control", "Dribbling")

area_order = c("80+ Attacker", "70-79 Attacker", 
                "80+ Midfielder", "70-79 Midfielder", 
                "80+ Defender", "70-79 Defender")

attributes_df_allrows = dups_fifa18_df_position.area2 %>%
  filter(Area.Ovr %in% area_order) %>%
  select(all_of(c("Area.Ovr", attributes))) %>%
  pivot_longer(
    cols = all_of(attributes),
    names_to = "Attribute",
    values_to = "Rating")

attributes_df_allrows$Rating = as.numeric(attributes_df_allrows$Rating)

attributes_avg_df = attributes_df_allrows %>%
  group_by(Attribute, Area.Ovr) %>%
  summarise(Average.Rating = mean(Rating, na.rm = TRUE), .groups = "drop")

attributes_avg_df = attributes_avg_df %>%
  mutate(Area.Ovr = factor(Area.Ovr, levels = area_order),
    Attribute = factor(Attribute, levels = attributes)) %>%
  arrange(Area.Ovr, Attribute)

area_colors = c(
  "80+ Attacker" = "#A85A1C",    # dark orange
  "70-79 Attacker" = "#F8C17C",  # light orange
  "80+ Midfielder" = "#1F4E79",  # dark blue
  "70-79 Midfielder" = "#95C1E8",# light blue
  "80+ Defender" = "#9C8415",    # dark gold
  "70-79 Defender" = "#F9E79F"   # light gold
)

ggplot(attributes_avg_df, aes(x = Attribute, y = Average.Rating, group=Area.Ovr)) +
  geom_line(aes(color=Area.Ovr), size = 3) +
  labs(title = "Average Attribute Rating by Position Area, 70-80 or 80+ Overall", x = "Attribute", y = "Average Rating") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5, size = 20),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 12),
        axis.text.y = element_text(size = 12),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12)) +
  geom_point(shape = 21, size = 5, color = "black", fill = "white") +
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = area_colors, name = "Area & Overall") +
  guides(color = guide_legend(reverse = FALSE))

Conclusion

From the distribution of positions and overalls, to average salary by position, to heat mapping elite attackers in the sport, to plotting the averages of critical attributes for any soccer player, one can make more proper managerial decisions for constructing a winning football team, but a deep dive is required into this information to create actionable decisions that translate into winning and elite productivity.

The question is what elements of these data will you prioritize? How will you make your winning team on a console in FIFA 18 or on the field as a real football manager? What trade-offs are you willing to make with these data in mind? Those are the critical questions one gets to answer with these visualizations at their use.