(Image: EA Sports)
Football is a popular sport played and enjoyed by many. The popularity of this sport is not only limited to international games but it also extends to the video gaming world. FIFA is a football simulation video game series released annually by Electronic Arts under the EA Sports label. The game series is a great success, being one of the best-selling video game franchises of all time.
This DataViz will use the statistical information of football players featuring in the game FIFA 20 which is updated for the season that starts in 2019 and ends in 2020. In this DataViz, we will use the dataset available on Kaggle, which contains information on 18K+ players with 70+ attributes. It is scraped from the website SoFiFa.
This DataViz is presented with the author’s love for football and data science to come up with a short exploratory analysis of the FIFA 20 dataset using R.
In the dataset, the monetary values of players are described in Euro (€) and denoted in thousands (K) or in millions (M), e.g., €95.5M. It would be a challenge to to analyze and visualize these values as they are not in actual numerical format.
There are a total of 15 different playing positions in the dataset. In order to gain better overview and meaningful insights, the positions with similar characteristics should be grouped together, e.g., LB (left-back), RB (right-back) and CB (center-back) are all defending positions in the game.
Convert the described values to actual currency values. Remove “€” sign and convert the values into actual thousands or millions (for “K” or “M” notations) numerical values.
However, doing so would cause another challenge for visualization as the actual converted values will have many zeros, e.g., €95.5M will become 95,500,000. To overcome this, create value brackets with labels, e.g., the values within the range 90,000,000 to 100,000,000 will be grouped with the value label 90–100M.
The different player positions in the dataset are CAM, CB, CDM, CF, CM, GK, LB, LM, LW, LWB, RB, RM, RW, RWB, ST. Based on the these positions, classify these specific positions into more general roles namely, Striker, Midfielder, Defender, and Goalkeeper according to the table below.
| Positions | Role |
|---|---|
| CF, ST | Striker |
| LW, LM, CDM, CM, CAM, RM, RW | Midfielder |
| LWB, LB, CB, RB, RWB | Defender |
| GK | Goalkeeper |
First, load the necessary R packages in RStudio.
packages = c('tidyverse', 'kableExtra', 'gridExtra')
for (p in packages){
if (!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Load the data.
df <- read_csv('data/fifa20_data.csv')
For visualization, we will use the player attributes such as Name, Age, Country, Overall, Club, Best Position and Value.
df <- tibble::as_tibble(df) %>%
select(ID, Name, Age, Country, Overall, Club, BP, Value) %>%
rename(`Best Position` = BP)
| ID | Name | Age | Country | Overall | Club | Best Position | Value |
|---|---|---|---|---|---|---|---|
| 158023 | Lionel Messi | 32 | Argentina | 94 | FC Barcelona | CAM | €95.5M |
| 20801 | C. Ronaldo dos Santos Aveiro | 34 | Portugal | 93 | Juventus | ST | €58.5M |
| 190871 | Neymar da Silva Santos Jr. | 27 | Brazil | 92 | Paris Saint-Germain | CAM | €105.5M |
| 200389 | Jan Oblak | 26 | Slovenia | 91 | Atlético Madrid | GK | €77.5M |
| 192985 | Kevin De Bruyne | 28 | Belgium | 91 | Manchester City | CAM | €90M |
| 183277 | Eden Hazard | 28 | Belgium | 91 | Real Madrid | CAM | €90M |
| 209331 | Mohamed Salah | 27 | Egypt | 90 | Liverpool | LW | €80.5M |
| 203376 | Virgil van Dijk | 27 | Netherlands | 90 | Liverpool | CB | €78M |
| 192448 | Marc-André ter Stegen | 27 | Germany | 90 | FC Barcelona | GK | €67.5M |
| 177003 | Luka Modrić | 33 | Croatia | 90 | Real Madrid | CM | €45M |
We will first convert the Value column to actual currency values. We will use the function below to remove the “€” sign from the column values and apply the appropriate currency conversion either to thousands (K) or millions (M).
toNumberCurrency <- function(vector) {
vector <- as.character(vector)
vector <- gsub("(€|,)","", vector)
result <- as.numeric(vector)
k_positions <- grep("K", vector)
result[k_positions] <- as.numeric(gsub("K","",
vector[k_positions])) * 1000
m_positions <- grep("M", vector)
result[m_positions] <- as.numeric(gsub("M","",
vector[m_positions])) * 1000000
return(result)
}
df$Value <- toNumberCurrency(df$Value)
Next, we will bin the converted player values and name the new column as Value Brackets with the labels: 0–10M, 10–20M, 20–30M, 30–40M, 40–50M, 50–60M, 60–70M, 70–80M, 80–90M, 90–100M, 100M+.
# Create value brackets
value_breaks <- c(0, 10000000, 20000000, 30000000, 40000000, 50000000, 60000000, 70000000, 80000000, 90000000, 100000000, Inf)
value_labels <- c("0–10M", "10–20M", "20–30M", "30–40M", "40–50M","50–60M", "60–70M", "70–80M", "80–90M","90–100M","100M+")
`Value Brackets` <- cut(x=df$Value, breaks=value_breaks,
labels=value_labels,
include.lowest = TRUE)
df <-mutate(df, `Value Brackets`)
Check and compare the actual Value with the newly created Value Brackets.
| ID | Name | Age | Country | Overall | Club | Value | Value Brackets |
|---|---|---|---|---|---|---|---|
| 158023 | Lionel Messi | 32 | Argentina | 94 | FC Barcelona | 95500000 | 90–100M |
| 20801 | C. Ronaldo dos Santos Aveiro | 34 | Portugal | 93 | Juventus | 58500000 | 50–60M |
| 190871 | Neymar da Silva Santos Jr. | 27 | Brazil | 92 | Paris Saint-Germain | 105500000 | 100M+ |
| 200389 | Jan Oblak | 26 | Slovenia | 91 | Atlético Madrid | 77500000 | 70–80M |
| 192985 | Kevin De Bruyne | 28 | Belgium | 91 | Manchester City | 90000000 | 80–90M |
| 183277 | Eden Hazard | 28 | Belgium | 91 | Real Madrid | 90000000 | 80–90M |
| 209331 | Mohamed Salah | 27 | Egypt | 90 | Liverpool | 80500000 | 80–90M |
| 203376 | Virgil van Dijk | 27 | Netherlands | 90 | Liverpool | 78000000 | 70–80M |
| 192448 | Marc-André ter Stegen | 27 | Germany | 90 | FC Barcelona | 67500000 | 60–70M |
| 177003 | Luka Modrić | 33 | Croatia | 90 | Real Madrid | 45000000 | 40–50M |
Based on the table in Suggestion 2, we will create another column that will classify the specific player positions into general playing roles using the code chunk below.
x <- as.factor(df$`Best Position`)
levels(x) <- list(Striker = c("CF", "ST"),
Midfielder = c("LW","LM","CDM","CM","CAM","RM","RW"),
Defender = c("LWB", "LB", "CB", "RB", "RWB"),
Goalkeeper = c("GK")
)
df <- mutate(df, Role = x)
Check each player’s Best Position and the general Role.
| ID | Name | Age | Country | Overall | Club | Best Position | Role |
|---|---|---|---|---|---|---|---|
| 158023 | Lionel Messi | 32 | Argentina | 94 | FC Barcelona | CAM | Midfielder |
| 20801 | C. Ronaldo dos Santos Aveiro | 34 | Portugal | 93 | Juventus | ST | Striker |
| 190871 | Neymar da Silva Santos Jr. | 27 | Brazil | 92 | Paris Saint-Germain | CAM | Midfielder |
| 200389 | Jan Oblak | 26 | Slovenia | 91 | Atlético Madrid | GK | Goalkeeper |
| 192985 | Kevin De Bruyne | 28 | Belgium | 91 | Manchester City | CAM | Midfielder |
| 183277 | Eden Hazard | 28 | Belgium | 91 | Real Madrid | CAM | Midfielder |
| 209331 | Mohamed Salah | 27 | Egypt | 90 | Liverpool | LW | Midfielder |
| 203376 | Virgil van Dijk | 27 | Netherlands | 90 | Liverpool | CB | Defender |
| 192448 | Marc-André ter Stegen | 27 | Germany | 90 | FC Barcelona | GK | Goalkeeper |
| 177003 | Luka Modrić | 33 | Croatia | 90 | Real Madrid | CM | Midfielder |
The data is now ready for visualization and we will conduct exploratory data analysis (EDA) using ggplot() function of ggplot2 package. For each visualization, a short description as well as any useful insight (where applicable) will be provided in this section.
First off, visualize the distribution of players based on their general playing roles.
ggplot(df, aes(Role)) +
geom_bar(aes(col = "orange", fill = ..count..)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Distribution of Players based on General Playing Roles") +
theme_minimal() +
theme(legend.position = 'none')
We see that the number of Midfielder is the highest, followed by Defender, Striker, and finally Goalkeeper.
Next, visualize the distribution of players by their best positions.
ggplot(df, aes(`Best Position`)) +
geom_bar(aes(col = "orange", fill = ..count..)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Distribution of Players based on Best Positions") +
theme_minimal() +
theme(legend.position = 'none')
Based on the previous observation, we would have expected some specific Midfielder position to have the highest count. But surprisingly, here we see that the number of CB (center-back defender) is the highest, followed by the number of ST (striker)!
Following that, plot the distribution of players based on the age.
g_age <- ggplot(data = df, aes(Age))
g_age +
geom_histogram(binwidth = 1, col = "orange", aes(fill = ..count..)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Distribution based on Age") +
theme_minimal() +
theme(legend.position = 'none')
We see that there is a high number of players between 20 to 27 years of age.
The following plot shows the relation between the age of the players and their general playing role.
g_age +
geom_density(col = "orange", aes(fill = Role), alpha = 0.5) +
facet_grid(.~Role) +
ggtitle("Distribution based on Age and Role") +
theme_light() +
theme(legend.position = 'none')
g_overall <- ggplot(data = df, aes(Overall))
g_overall +
geom_histogram(binwidth = 2, col = "orange", aes(fill = ..count..)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Distribution based on Overall Rating") +
theme_minimal() +
theme(legend.position = 'none')
From the visualization above, we see that the majority number of players have an overall rating of around 65.
We will plot the players against their values. Examining the dataset, it can be noticed that a very large number of players have valuation less than 50M. Plotting these values would skew the graph a lot since they are high in magnitude as compared to the rest of the values. Hence, we will not display these values in the visualization. We will only display the players with valuation from 50M to 100M+.
moreThan50M <- filter(df, Value > 50000000)
ggplot(moreThan50M, aes(x = `Value Brackets`)) +
geom_bar(aes(col = "orange", fill = ..count..)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Distribution of Value between 50M–100M+") +
theme_minimal() +
theme(legend.position = 'none')
g_age_overall <- ggplot(df, aes(Age, Overall))
g_age_overall +
geom_point(aes(color = `Value Brackets`)) +
geom_smooth(color = "darkblue") +
ggtitle("Distribution between Age and Overall Rating of players based on Value bracket") +
theme_minimal()
We see that the high valuations are dominated by players of overall rating 85+ and age between 23 to 33 years.
The visualization below shows the player valuation based on their best playing positions.
gf1 <- filter(df, Value <= 30000000)
g1 <- ggplot(gf1, aes(`Best Position`)) +
geom_bar(aes(fill = `Value Brackets`)) +
ggtitle("Position based on Value (0–30M)") +
theme_minimal()
gf2 <- filter(df,Value > 30000000)
g2 <- ggplot(gf2, aes(`Best Position`)) +
geom_bar(aes(fill = `Value Brackets`)) +
ggtitle("Position based on Value (30M+)") +
theme_minimal()
grid.arrange(g1, g2, ncol=1)
We see that the most valuable footballers (with valuation 80M+) are playing in forward positions: CAM, LW, RW and ST. The result is as expected since we know most of the top football stars are attacking-midfielders and strikers!
We will also plot the top 10 valuable clubs using the code chunk below. The club value is calculated by summing up the player valuation for each club.
group_clubs <- group_by(df, Club)
club_value <- summarise(group_clubs, `Total Value` = sum(Value))
top_10_valuable_clubs <- top_n(club_value, 10, `Total Value`)
top_10_valuable_clubs$Club <-as.factor(top_10_valuable_clubs$Club)
ggplot(top_10_valuable_clubs, aes(x = reorder(Club, `Total Value`), y = `Total Value`)) +
labs(x = 'Club') +
geom_bar(stat = "identity", aes(col = "orange", fill = `Total Value`)) +
coord_flip() +
scale_y_continuous(labels = scales::unit_format(unit = "M", scale = 1e-6)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Top 10 Valuable Clubs") +
theme_minimal() +
theme(legend.position = 'none')
And finally, we will plot the top 10 countries with the highest number of players in FIFA 20.
countries_count <- count(df, Country)
top10_countries <- top_n(countries_count, 10, n)
top10_country_names <- top10_countries$Country
country <- filter(df, Country == top10_country_names)
ggplot(country, aes(x=reorder(Country, Country,
function(x)-length(x)))) +
labs(x = 'Country') +
geom_bar(col = "orange", aes(fill = ..count..)) +
scale_fill_distiller(palette = "Reds", direction = 1) +
ggtitle("Top 10 Countries with the Most Players") +
theme_minimal() +
theme(legend.position = 'none')
As we all know, the majority of the pro footballers are from European countries followed by South American countries. We see that only one Asian country, Japan, has made the Top 10 list. Despite there are many African pro football players, African countries still could not dominate the top 10 list in FIFA 20.