FIFA 20: Data Visualization and Key Insights

(Image: EA Sports)

1. Overview

Football is a popular sport played and enjoyed by many. The popularity of this sport is not only limited to international games but it also extends to the video gaming world. FIFA is a football simulation video game series released annually by Electronic Arts under the EA Sports label. The game series is a great success, being one of the best-selling video game franchises of all time.

This DataViz will use the statistical information of football players featuring in the game FIFA 20 which is updated for the season that starts in 2019 and ends in 2020. In this DataViz, we will use the dataset available on Kaggle, which contains information on 18K+ players with 70+ attributes. It is scraped from the website SoFiFa.

This DataViz is presented with the author’s love for football and data science to come up with a short exploratory analysis of the FIFA 20 dataset using R.

2. Major Data and Design Challenges

2.1. Challenge 1: Visualizing Monetary Values of Players

In the dataset, the monetary values of players are described in Euro (€) and denoted in thousands (K) or in millions (M), e.g., €95.5M. It would be a challenge to to analyze and visualize these values as they are not in actual numerical format.

2.2. Challenge 2: Visualizing Various Playing Positions of Players

There are a total of 15 different playing positions in the dataset. In order to gain better overview and meaningful insights, the positions with similar characteristics should be grouped together, e.g., LB (left-back), RB (right-back) and CB (center-back) are all defending positions in the game.

3. Suggestions to Overcome Challenges

3.1. Suggestion 1: Visualizing Monetary Values of Players

Convert the described values to actual currency values. Remove “€” sign and convert the values into actual thousands or millions (for “K” or “M” notations) numerical values.

However, doing so would cause another challenge for visualization as the actual converted values will have many zeros, e.g., €95.5M will become 95,500,000. To overcome this, create value brackets with labels, e.g., the values within the range 90,000,000 to 100,000,000 will be grouped with the value label 90–100M.

3.2. Suggestion 2: Visualizing Various Playing Positions of Players

The different player positions in the dataset are CAM, CB, CDM, CF, CM, GK, LB, LM, LW, LWB, RB, RM, RW, RWB, ST. Based on the these positions, classify these specific positions into more general roles namely, Striker, Midfielder, Defender, and Goalkeeper according to the table below.

Positions	Role
CF, ST	Striker
LW, LM, CDM, CM, CAM, RM, RW	Midfielder
LWB, LB, CB, RB, RWB	Defender
GK	Goalkeeper

3.3. Sketch of Proposed DataViz Design

4. Data Preparation Steps

4.1. Load R packages and Data

First, load the necessary R packages in RStudio.

tidyverse contains a set of essential packages for data manipulation and exploration.
kableExtra to build common complex tables and manipulate table styles.
gridExtra to arrange multiple grid-based plots on a page.

packages = c('tidyverse', 'kableExtra', 'gridExtra')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Load the data.

df <- read_csv('data/fifa20_data.csv')

4.2. Data Wrangling

For visualization, we will use the player attributes such as Name, Age, Country, Overall, Club, Best Position and Value.

df <- tibble::as_tibble(df) %>% 
  select(ID, Name, Age, Country, Overall, Club, BP, Value) %>%
  rename(`Best Position` = BP)

ID	Name	Age	Country	Overall	Club	Best Position	Value
158023	Lionel Messi	32	Argentina	94	FC Barcelona	CAM	€95.5M
20801	C. Ronaldo dos Santos Aveiro	34	Portugal	93	Juventus	ST	€58.5M
190871	Neymar da Silva Santos Jr.	27	Brazil	92	Paris Saint-Germain	CAM	€105.5M
200389	Jan Oblak	26	Slovenia	91	Atlético Madrid	GK	€77.5M
192985	Kevin De Bruyne	28	Belgium	91	Manchester City	CAM	€90M
183277	Eden Hazard	28	Belgium	91	Real Madrid	CAM	€90M
209331	Mohamed Salah	27	Egypt	90	Liverpool	LW	€80.5M
203376	Virgil van Dijk	27	Netherlands	90	Liverpool	CB	€78M
192448	Marc-André ter Stegen	27	Germany	90	FC Barcelona	GK	€67.5M
177003	Luka Modrić	33	Croatia	90	Real Madrid	CM	€45M

4.2.1. Creating Value Brackets

We will first convert the Value column to actual currency values. We will use the function below to remove the “€” sign from the column values and apply the appropriate currency conversion either to thousands (K) or millions (M).

toNumberCurrency <- function(vector) {
  vector <- as.character(vector)
  vector <- gsub("(€|,)","", vector)
  result <- as.numeric(vector)
  
  k_positions <- grep("K", vector)
  result[k_positions] <- as.numeric(gsub("K","",        
                                         vector[k_positions])) * 1000
  
  m_positions <- grep("M", vector)
  result[m_positions] <- as.numeric(gsub("M","", 
                                         vector[m_positions])) * 1000000
  
  return(result)
}

df$Value <- toNumberCurrency(df$Value)

Next, we will bin the converted player values and name the new column as Value Brackets with the labels: 0–10M, 10–20M, 20–30M, 30–40M, 40–50M, 50–60M, 60–70M, 70–80M, 80–90M, 90–100M, 100M+.

# Create value brackets
value_breaks <- c(0, 10000000, 20000000, 30000000, 40000000, 50000000, 60000000, 70000000, 80000000, 90000000, 100000000, Inf)
value_labels <- c("0–10M", "10–20M", "20–30M", "30–40M", "40–50M","50–60M", "60–70M", "70–80M", "80–90M","90–100M","100M+")

`Value Brackets` <- cut(x=df$Value, breaks=value_breaks, 
                      labels=value_labels, 
                      include.lowest = TRUE)

df <-mutate(df, `Value Brackets`)

Check and compare the actual Value with the newly created Value Brackets.

ID	Name	Age	Country	Overall	Club	Value	Value Brackets
158023	Lionel Messi	32	Argentina	94	FC Barcelona	95500000	90–100M
20801	C. Ronaldo dos Santos Aveiro	34	Portugal	93	Juventus	58500000	50–60M
190871	Neymar da Silva Santos Jr.	27	Brazil	92	Paris Saint-Germain	105500000	100M+
200389	Jan Oblak	26	Slovenia	91	Atlético Madrid	77500000	70–80M
192985	Kevin De Bruyne	28	Belgium	91	Manchester City	90000000	80–90M
183277	Eden Hazard	28	Belgium	91	Real Madrid	90000000	80–90M
209331	Mohamed Salah	27	Egypt	90	Liverpool	80500000	80–90M
203376	Virgil van Dijk	27	Netherlands	90	Liverpool	78000000	70–80M
192448	Marc-André ter Stegen	27	Germany	90	FC Barcelona	67500000	60–70M
177003	Luka Modrić	33	Croatia	90	Real Madrid	45000000	40–50M

4.2.2. Classifying Player Positions

Based on the table in Suggestion 2, we will create another column that will classify the specific player positions into general playing roles using the code chunk below.

x <- as.factor(df$`Best Position`)
levels(x) <- list(Striker = c("CF", "ST"),
                  Midfielder = c("LW","LM","CDM","CM","CAM","RM","RW"),
                  Defender = c("LWB", "LB", "CB", "RB", "RWB"), 
                  Goalkeeper  = c("GK")
                  )
df <- mutate(df, Role = x)

Check each player’s Best Position and the general Role.

ID	Name	Age	Country	Overall	Club	Best Position	Role
158023	Lionel Messi	32	Argentina	94	FC Barcelona	CAM	Midfielder
20801	C. Ronaldo dos Santos Aveiro	34	Portugal	93	Juventus	ST	Striker
190871	Neymar da Silva Santos Jr.	27	Brazil	92	Paris Saint-Germain	CAM	Midfielder
200389	Jan Oblak	26	Slovenia	91	Atlético Madrid	GK	Goalkeeper
192985	Kevin De Bruyne	28	Belgium	91	Manchester City	CAM	Midfielder
183277	Eden Hazard	28	Belgium	91	Real Madrid	CAM	Midfielder
209331	Mohamed Salah	27	Egypt	90	Liverpool	LW	Midfielder
203376	Virgil van Dijk	27	Netherlands	90	Liverpool	CB	Defender
192448	Marc-André ter Stegen	27	Germany	90	FC Barcelona	GK	Goalkeeper
177003	Luka Modrić	33	Croatia	90	Real Madrid	CM	Midfielder

5. Final Visualization Steps and Insights

The data is now ready for visualization and we will conduct exploratory data analysis (EDA) using ggplot() function of ggplot2 package. For each visualization, a short description as well as any useful insight (where applicable) will be provided in this section.

5.1. Distribution by General Player Roles

First off, visualize the distribution of players based on their general playing roles.

ggplot(df, aes(Role)) + 
  geom_bar(aes(col = "orange", fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution of Players based on General Playing Roles") + 
  theme_minimal() + 
  theme(legend.position = 'none')

We see that the number of Midfielder is the highest, followed by Defender, Striker, and finally Goalkeeper.

5.2. Distribution by Player Positions

Next, visualize the distribution of players by their best positions.

ggplot(df, aes(`Best Position`)) + 
  geom_bar(aes(col = "orange", fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution of Players based on Best Positions") + 
  theme_minimal() +
  theme(legend.position = 'none')

Based on the previous observation, we would have expected some specific Midfielder position to have the highest count. But surprisingly, here we see that the number of CB (center-back defender) is the highest, followed by the number of ST (striker)!

5.3. Distribution by Age

Following that, plot the distribution of players based on the age.

g_age <- ggplot(data = df, aes(Age))
g_age + 
  geom_histogram(binwidth = 1, col = "orange", aes(fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution based on Age") + 
  theme_minimal() +
  theme(legend.position = 'none')

We see that there is a high number of players between 20 to 27 years of age.

5.4. Comparison between Age and Playing Roles

The following plot shows the relation between the age of the players and their general playing role.

g_age + 
  geom_density(col = "orange", aes(fill = Role), alpha = 0.5) +
  facet_grid(.~Role) + 
  ggtitle("Distribution based on Age and Role") + 
  theme_light() +
  theme(legend.position = 'none')

5.5. Distribution by Overall Rating

g_overall <- ggplot(data = df, aes(Overall))
g_overall + 
  geom_histogram(binwidth = 2, col = "orange", aes(fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution based on Overall Rating") + 
  theme_minimal() +
  theme(legend.position = 'none')

From the visualization above, we see that the majority number of players have an overall rating of around 65.

5.6. Distribution by Player Value

We will plot the players against their values. Examining the dataset, it can be noticed that a very large number of players have valuation less than 50M. Plotting these values would skew the graph a lot since they are high in magnitude as compared to the rest of the values. Hence, we will not display these values in the visualization. We will only display the players with valuation from 50M to 100M+.

moreThan50M <- filter(df, Value > 50000000)

ggplot(moreThan50M, aes(x = `Value Brackets`)) + 
  geom_bar(aes(col = "orange", fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Distribution of Value between 50M–100M+") + 
  theme_minimal() +
  theme(legend.position = 'none')

5.7. Age vs Overall Rating of Players divided amongst Value Brackets

g_age_overall <- ggplot(df, aes(Age, Overall))
g_age_overall + 
  geom_point(aes(color = `Value Brackets`)) + 
  geom_smooth(color = "darkblue") + 
  ggtitle("Distribution between Age and Overall Rating of players based on Value bracket") + 
  theme_minimal()

We see that the high valuations are dominated by players of overall rating 85+ and age between 23 to 33 years.

5.8. Distribution of Player Positions by Value Brackets

The visualization below shows the player valuation based on their best playing positions.

gf1 <- filter(df, Value <= 30000000)
g1 <- ggplot(gf1, aes(`Best Position`)) + 
  geom_bar(aes(fill = `Value Brackets`)) + 
  ggtitle("Position based on Value (0–30M)") + 
  theme_minimal()
  
gf2 <- filter(df,Value > 30000000)
g2 <- ggplot(gf2, aes(`Best Position`)) + 
  geom_bar(aes(fill = `Value Brackets`)) + 
  ggtitle("Position based on Value (30M+)") + 
  theme_minimal()
  
grid.arrange(g1, g2, ncol=1)

We see that the most valuable footballers (with valuation 80M+) are playing in forward positions: CAM, LW, RW and ST. The result is as expected since we know most of the top football stars are attacking-midfielders and strikers!

5.9. Top 10 Clubs by Players’ Value

We will also plot the top 10 valuable clubs using the code chunk below. The club value is calculated by summing up the player valuation for each club.

group_clubs <- group_by(df, Club)

club_value <- summarise(group_clubs, `Total Value` = sum(Value))

top_10_valuable_clubs <- top_n(club_value, 10, `Total Value`)

top_10_valuable_clubs$Club <-as.factor(top_10_valuable_clubs$Club)
  
ggplot(top_10_valuable_clubs, aes(x = reorder(Club, `Total Value`), y = `Total Value`)) + 
  labs(x = 'Club') +
  geom_bar(stat = "identity", aes(col = "orange", fill = `Total Value`)) + 
  coord_flip() + 
  scale_y_continuous(labels = scales::unit_format(unit = "M", scale = 1e-6)) +
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Top 10 Valuable Clubs") + 
  theme_minimal() +
  theme(legend.position = 'none')

5.10. Top 10 Countries by Number of Players

And finally, we will plot the top 10 countries with the highest number of players in FIFA 20.

countries_count <- count(df, Country)

top10_countries <- top_n(countries_count, 10, n)

top10_country_names <- top10_countries$Country
  
country <- filter(df, Country == top10_country_names)

ggplot(country, aes(x=reorder(Country, Country,
                              function(x)-length(x)))) + 
  labs(x = 'Country') +
  geom_bar(col = "orange", aes(fill = ..count..)) + 
  scale_fill_distiller(palette = "Reds", direction = 1) +
  ggtitle("Top 10 Countries with the Most Players") + 
  theme_minimal() +
  theme(legend.position = 'none')

As we all know, the majority of the pro footballers are from European countries followed by South American countries. We see that only one Asian country, Japan, has made the Top 10 list. Despite there are many African pro football players, African countries still could not dominate the top 10 list in FIFA 20.