(Image: EA Sports)

(Image: EA Sports)

1. Overview

Football is a popular sport played and enjoyed by many. The popularity of this sport is not only limited to international games but it also extends to the video gaming world. FIFA is a football simulation video game series released annually by Electronic Arts under the EA Sports label. The game series is a great success, being one of the best-selling video game franchises of all time.

This DataViz will use the statistical information of football players featuring in the game FIFA 20 which is updated for the season that starts in 2019 and ends in 2020. In this DataViz, we will use the dataset available on Kaggle, which contains information on 18K+ players with 70+ attributes. It is scraped from the website SoFiFa.

This DataViz is presented with the author’s love for football and data science to come up with a short exploratory analysis of the FIFA 20 dataset using R.

2. Major Data and Design Challenges

2.1. Challenge 1: Visualizing Monetary Values of Players

In the dataset, the monetary values of players are described in Euro (€) and denoted in thousands (K) or in millions (M), e.g., €95.5M. It would be a challenge to to analyze and visualize these values as they are not in actual numerical format.

2.2. Challenge 2: Visualizing Various Playing Positions of Players

There are a total of 15 different playing positions in the dataset. In order to gain better overview and meaningful insights, the positions with similar characteristics should be grouped together, e.g., LB (left-back), RB (right-back) and CB (center-back) are all defending positions in the game.

3. Suggestions to Overcome Challenges

3.1. Suggestion 1: Visualizing Monetary Values of Players

Convert the described values to actual currency values. Remove “€” sign and convert the values into actual thousands or millions (for “K” or “M” notations) numerical values.

However, doing so would cause another challenge for visualization as the actual converted values will have many zeros, e.g., €95.5M will become 95,500,000. To overcome this, create value brackets with labels, e.g., the values within the range 90,000,000 to 100,000,000 will be grouped with the value label 90–100M.

3.2. Suggestion 2: Visualizing Various Playing Positions of Players

The different player positions in the dataset are CAM, CB, CDM, CF, CM, GK, LB, LM, LW, LWB, RB, RM, RW, RWB, ST. Based on the these positions, classify these specific positions into more general roles namely, Striker, Midfielder, Defender, and Goalkeeper according to the table below.

Positions Role
CF, ST Striker
LW, LM, CDM, CM, CAM, RM, RW Midfielder
LWB, LB, CB, RB, RWB Defender
GK Goalkeeper

3.3. Sketch of Proposed DataViz Design

4. Data Preparation Steps

4.1. Load R packages and Data

First, load the necessary R packages in RStudio.

  • tidyverse contains a set of essential packages for data manipulation and exploration.
  • kableExtra to build common complex tables and manipulate table styles.
  • gridExtra to arrange multiple grid-based plots on a page.
packages = c('tidyverse', 'kableExtra', 'gridExtra')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Load the data.

df <- read_csv('data/fifa20_data.csv')

4.2. Data Wrangling

For visualization, we will use the player attributes such as Name, Age, Country, Overall, Club, Best Position and Value.

df <- tibble::as_tibble(df) %>% 
  select(ID, Name, Age, Country, Overall, Club, BP, Value) %>%
  rename(`Best Position` = BP)
ID Name Age Country Overall Club Best Position Value
158023 Lionel Messi 32 Argentina 94 FC Barcelona CAM €95.5M
20801 C. Ronaldo dos Santos Aveiro 34 Portugal 93 Juventus ST €58.5M
190871 Neymar da Silva Santos Jr.  27 Brazil 92 Paris Saint-Germain CAM €105.5M
200389 Jan Oblak 26 Slovenia 91 Atlético Madrid GK €77.5M
192985 Kevin De Bruyne 28 Belgium 91 Manchester City CAM €90M
183277 Eden Hazard 28 Belgium 91 Real Madrid CAM €90M
209331 Mohamed Salah 27 Egypt 90 Liverpool LW €80.5M
203376 Virgil van Dijk 27 Netherlands 90 Liverpool CB €78M
192448 Marc-André ter Stegen 27 Germany 90 FC Barcelona GK €67.5M
177003 Luka Modrić 33 Croatia 90 Real Madrid CM €45M

4.2.1. Creating Value Brackets

We will first convert the Value column to actual currency values. We will use the function below to remove the “€” sign from the column values and apply the appropriate currency conversion either to thousands (K) or millions (M).

toNumberCurrency <- function(vector) {
  vector <- as.character(vector)
  vector <- gsub("(€|,)","", vector)
  result <- as.numeric(vector)
  
  k_positions <- grep("K", vector)
  result[k_positions] <- as.numeric(gsub("K","",        
                                         vector[k_positions])) * 1000
  
  m_positions <- grep("M", vector)
  result[m_positions] <- as.numeric(gsub("M","", 
                                         vector[m_positions])) * 1000000
  
  return(result)
}

df$Value <- toNumberCurrency(df$Value)

Next, we will bin the converted player values and name the new column as Value Brackets with the labels: 0–10M, 10–20M, 20–30M, 30–40M, 40–50M, 50–60M, 60–70M, 70–80M, 80–90M, 90–100M, 100M+.

# Create value brackets
value_breaks <- c(0, 10000000, 20000000, 30000000, 40000000, 50000000, 60000000, 70000000, 80000000, 90000000, 100000000, Inf)
value_labels <- c("0–10M", "10–20M", "20–30M", "30–40M", "40–50M","50–60M", "60–70M", "70–80M", "80–90M","90–100M","100M+")

`Value Brackets` <- cut(x=df$Value, breaks=value_breaks, 
                      labels=value_labels, 
                      include.lowest = TRUE)

df <-mutate(df, `Value Brackets`) 

Check and compare the actual Value with the newly created Value Brackets.

ID Name Age Country Overall Club Value Value Brackets
158023 Lionel Messi 32 Argentina 94 FC Barcelona 95500000 90–100M
20801 C. Ronaldo dos Santos Aveiro 34 Portugal 93 Juventus 58500000 50–60M
190871 Neymar da Silva Santos Jr.  27 Brazil 92 Paris Saint-Germain 105500000 100M+
200389 Jan Oblak 26 Slovenia 91 Atlético Madrid 77500000 70–80M
192985 Kevin De Bruyne 28 Belgium 91 Manchester City 90000000 80–90M
183277 Eden Hazard 28 Belgium 91 Real Madrid 90000000 80–90M
209331 Mohamed Salah 27 Egypt 90 Liverpool 80500000 80–90M
203376 Virgil van Dijk 27 Netherlands 90 Liverpool 78000000 70–80M
192448 Marc-André ter Stegen 27 Germany 90 FC Barcelona 67500000 60–70M
177003 Luka Modrić 33 Croatia 90 Real Madrid 45000000 40–50M

4.2.2. Classifying Player Positions

Based on the table in Suggestion 2, we will create another column that will classify the specific player positions into general playing roles using the code chunk below.

x <- as.factor(df$`Best Position`)
levels(x) <- list(Striker = c("CF", "ST"),
                  Midfielder = c("LW","LM","CDM","CM","CAM","RM","RW"),
                  Defender = c("LWB", "LB", "CB", "RB", "RWB"), 
                  Goalkeeper  = c("GK")
                  )
df <- mutate(df, Role = x)

Check each player’s Best Position and the general Role.

ID Name Age Country Overall Club Best Position Role
158023 Lionel Messi 32 Argentina 94 FC Barcelona CAM Midfielder
20801 C. Ronaldo dos Santos Aveiro 34 Portugal 93 Juventus ST Striker
190871 Neymar da Silva Santos Jr.  27 Brazil 92 Paris Saint-Germain CAM Midfielder
200389 Jan Oblak 26 Slovenia 91 Atlético Madrid GK Goalkeeper
192985 Kevin De Bruyne 28 Belgium 91 Manchester City CAM Midfielder
183277 Eden Hazard 28 Belgium 91 Real Madrid CAM Midfielder
209331 Mohamed Salah 27 Egypt 90 Liverpool LW Midfielder
203376 Virgil van Dijk 27 Netherlands 90 Liverpool CB Defender
192448 Marc-André ter Stegen 27 Germany 90 FC Barcelona GK Goalkeeper
177003 Luka Modrić 33 Croatia 90 Real Madrid CM Midfielder

5. Final Visualization Steps and Insights

The data is now ready for visualization and we will conduct exploratory data analysis (EDA) using ggplot() function of ggplot2 package. For each visualization, a short description as well as any useful insight (where applicable) will be provided in this section.

5.1. Distribution by General Player Roles

First off, visualize the distribution of players based on their general playing roles.

ggplot(df, aes(Role)) + 
  geom_bar(aes(col = "orange", fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution of Players based on General Playing Roles") + 
  theme_minimal() + 
  theme(legend.position = 'none')

We see that the number of Midfielder is the highest, followed by Defender, Striker, and finally Goalkeeper.

5.2. Distribution by Player Positions

Next, visualize the distribution of players by their best positions.

ggplot(df, aes(`Best Position`)) + 
  geom_bar(aes(col = "orange", fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution of Players based on Best Positions") + 
  theme_minimal() +
  theme(legend.position = 'none')

Based on the previous observation, we would have expected some specific Midfielder position to have the highest count. But surprisingly, here we see that the number of CB (center-back defender) is the highest, followed by the number of ST (striker)!

5.3. Distribution by Age

Following that, plot the distribution of players based on the age.

g_age <- ggplot(data = df, aes(Age))
g_age + 
  geom_histogram(binwidth = 1, col = "orange", aes(fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution based on Age") + 
  theme_minimal() +
  theme(legend.position = 'none')

We see that there is a high number of players between 20 to 27 years of age.

5.4. Comparison between Age and Playing Roles

The following plot shows the relation between the age of the players and their general playing role.

g_age + 
  geom_density(col = "orange", aes(fill = Role), alpha = 0.5) +
  facet_grid(.~Role) + 
  ggtitle("Distribution based on Age and Role") + 
  theme_light() +
  theme(legend.position = 'none')

5.5. Distribution by Overall Rating

g_overall <- ggplot(data = df, aes(Overall))
g_overall + 
  geom_histogram(binwidth = 2, col = "orange", aes(fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution based on Overall Rating") + 
  theme_minimal() +
  theme(legend.position = 'none')

From the visualization above, we see that the majority number of players have an overall rating of around 65.

5.6. Distribution by Player Value

We will plot the players against their values. Examining the dataset, it can be noticed that a very large number of players have valuation less than 50M. Plotting these values would skew the graph a lot since they are high in magnitude as compared to the rest of the values. Hence, we will not display these values in the visualization. We will only display the players with valuation from 50M to 100M+.

moreThan50M <- filter(df, Value > 50000000)

ggplot(moreThan50M, aes(x = `Value Brackets`)) + 
  geom_bar(aes(col = "orange", fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution of Value between 50M–100M+") + 
  theme_minimal() +
  theme(legend.position = 'none')

5.7. Age vs Overall Rating of Players divided amongst Value Brackets

g_age_overall <- ggplot(df, aes(Age, Overall))
g_age_overall + 
  geom_point(aes(color = `Value Brackets`)) + 
  geom_smooth(color = "darkblue") + 
  ggtitle("Distribution between Age and Overall Rating of players based on Value bracket") + 
  theme_minimal()

We see that the high valuations are dominated by players of overall rating 85+ and age between 23 to 33 years.

5.8. Distribution of Player Positions by Value Brackets

The visualization below shows the player valuation based on their best playing positions.

gf1 <- filter(df, Value <= 30000000)
g1 <- ggplot(gf1, aes(`Best Position`)) + 
  geom_bar(aes(fill = `Value Brackets`)) + 
  ggtitle("Position based on Value (0–30M)") + 
  theme_minimal()
  
gf2 <- filter(df,Value > 30000000)
g2 <- ggplot(gf2, aes(`Best Position`)) + 
  geom_bar(aes(fill = `Value Brackets`)) + 
  ggtitle("Position based on Value (30M+)") + 
  theme_minimal()
  
grid.arrange(g1, g2, ncol=1)

We see that the most valuable footballers (with valuation 80M+) are playing in forward positions: CAM, LW, RW and ST. The result is as expected since we know most of the top football stars are attacking-midfielders and strikers!

5.9. Top 10 Clubs by Players’ Value

We will also plot the top 10 valuable clubs using the code chunk below. The club value is calculated by summing up the player valuation for each club.

group_clubs <- group_by(df, Club)

club_value <- summarise(group_clubs, `Total Value` = sum(Value))

top_10_valuable_clubs <- top_n(club_value, 10, `Total Value`)

top_10_valuable_clubs$Club <-as.factor(top_10_valuable_clubs$Club)
  
ggplot(top_10_valuable_clubs, aes(x = reorder(Club, `Total Value`), y = `Total Value`)) + 
  labs(x = 'Club') +
  geom_bar(stat = "identity", aes(col = "orange", fill = `Total Value`)) + 
  coord_flip() + 
  scale_y_continuous(labels = scales::unit_format(unit = "M", scale = 1e-6)) +
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Top 10 Valuable Clubs") + 
  theme_minimal() +
  theme(legend.position = 'none')

5.10. Top 10 Countries by Number of Players

And finally, we will plot the top 10 countries with the highest number of players in FIFA 20.

countries_count <- count(df, Country)

top10_countries <- top_n(countries_count, 10, n)

top10_country_names <- top10_countries$Country
  
country <- filter(df, Country == top10_country_names)

ggplot(country, aes(x=reorder(Country, Country,
                              function(x)-length(x)))) + 
  labs(x = 'Country') +
  geom_bar(col = "orange", aes(fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Top 10 Countries with the Most Players") + 
  theme_minimal() +
  theme(legend.position = 'none')

As we all know, the majority of the pro footballers are from European countries followed by South American countries. We see that only one Asian country, Japan, has made the Top 10 list. Despite there are many African pro football players, African countries still could not dominate the top 10 list in FIFA 20.