When it comes to soccer there are never ending debates about who the best players, nations, leagues, and teams in the world are. Due to the structure of the professional leagues, the unpredictability of international competitions, and voting systems for individual awards these debates will always happen. Opinions vary greatly on many of these issues and there are only a handful of competitions that try to rank the top teams in the world. Using data provided by EA Sports FIFA 20 we are able to make some insights into particular points related to these questions. This analysis will look at the top player producing nations and which of them make up the top leagues, how physical attributes relate to the quality of a player, and the top clubs and leagues in the world.
This data is collected from EA Sports FIFA 20. FIFA 20 and many more versions of the games data can be found at https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset?select=players_20.csv. When looking through the data I discovered many irrelevant pieces of information for my visualizations. Of the 106 columns many of them contained information that helps the gameplay of the game but is not very applicable to visualizations. Due to this, I removed multiple columns and added an index to assist in keeping track of players. A summary of the cleaned-up version of the data can be found in the Appendix.
One of the first things to look at to better understand the dataset and the overall landscape of professional soccer players, is to look at the world map. This map shows the number of players from each country in the world and has a sort of heat map legend to represent the count of players. This can help us better understand what nations produce the most professional players in the world. This can also help us understand which regions produce the most players in the world. These are interesting points because when it comes to national competition a lot of it is played within certain regions such as Europe, Asia, Africa, and the Americas; and only the World Cup and Olympics ever bring these nations together to face each other at a high level.
#datafram to duplicate fifa data
fifa20_map = fifa20
fifa20_map$nationality[fifa20_map$nationality=="United States"] = "USA"
fifa20_map$nationality[fifa20_map$nationality=="England"] = "UK"
fifa20_map$nationality[fifa20_map$nationality=="China PR"] = "China"
fifa20_map$nationality[fifa20_map$nationality=="Korea Republic"] = "South Korea"
fifa20_map$nationality[fifa20_map$nationality=="Republic of Ireland"] = "Ireland"
world_map = map_data("world")
names(world_map)[names(world_map) == 'region'] = 'nationality'
numofplayers = world_map %>%
left_join((fifa20_map %>%
dplyr::count(nationality)), by = "nationality")
ggplot(numofplayers, aes(long, lat, group = group))+
geom_polygon(aes(fill = n), color = "black", show.legend = TRUE)+
scale_fill_viridis_c(option = "C")+
theme_void()+
labs(fill = "Number of Players",
title = "Players Per Country")
Using this map, we can easily identify two regions of the world that produce the most players. This is not very surprising as most would have predicted both Europe and South America to produce the most players overall as soccer is incredibly popular in both of these locations. The reason for this is also quite evident in the map as South America has Brazil, Argentina, Colombia, Chile, and Uruguay producing players at a high rate; and Europe has England, France, Italy, Germany, Spain, and Portugal producing players at a high rate. An interesting thing to note is how England produces the most professionals while being significantly smaller in size than many of the other nations. Also, keep these top player producing countries in mind as we look at the top nationalities of the top 500 players in the world. Will this distribution of total players match the one of the top 500 players?
To better understand the overall dataset, we can look at the preferred foot of all players. Of the 18483 players in the dataset I was curious as to the distribution of left and right footers. I was expecting significantly more right footed players.
#barplot preferred foot of all players
p2=barplot(table(fifa20$preferred_foot), ylim = c(0,20000), main = "Preferred Foot Comparison of all Players")
foot = fifa20[,c("short_name", "overall", "preferred_foot")]
head(foot)
## short_name overall preferred_foot
## 1 L. Messi 94 Left
## 2 Cristiano Ronaldo 93 Right
## 3 Neymar Jr 92 Right
## 4 J. Oblak 91 Right
## 5 E. Hazard 91 Right
## 6 K. De Bruyne 91 Right
Upon inspection of the graph, we can see that there are nearly 3 times as many right footed players as there are left footed players. What is interesting is that left footers make up roughly 16% of the entire dataset and exactly 16% of the top 6 players in the dataset.
It is usually understood that there are 5 top leagues in the world for soccer. These leagues are, the English Premier League, French Ligue 1, German 1. Bundesliga, Italian Serie A, and Spain Primera Division(La Liga). Now that we have seen the number of players each of these nations produces it will be interesting to find out what nationality makes up the majority of these leagues. It is commonly debated whether or not leagues have bias towards their home nation, and this will be able to give us a better understanding of that. This graph below shows us the top nationality in each of the top 5 leagues and the number of players of that nationality.
#dataframe to hold top 5 league and nationality
league_nation = fifa20[, c("league_name", "nationality")]
#updated league_nation to hold only top five leagues and their respective nationality
league_nation = league_nation %>% filter(league_name %in% c("Spain Primera Division", "Italian Serie A", "French Ligue 1", "English Premier League", "German 1. Bundesliga"))
#creating data frame to hold each leagues nationalities
spain = league_nation %>% filter(league_name %in% c("Spain Primera Division"))
italy = league_nation %>% filter(league_name %in% c("Italian Serie A"))
england = league_nation %>% filter(league_name %in% c("English Premier League"))
france = league_nation %>% filter(league_name %in% c("French Ligue 1"))
germany = league_nation %>% filter(league_name %in% c("German 1. Bundesliga"))
#find the mode
get_mode = function(x){return(names(sort(table(x), decreasing = T, na.last = T)[1]))}
#spain top nationality and count of occurrences
spain.nationality = get_mode(spain$nationality)
spain.count = sum(spain$nationality==spain.nationality)
#italy top nationality and count of occurrences
italy.nationality = get_mode(italy$nationality)
italy.count = sum(italy$nationality==italy.nationality)
#england top nationality and count of occurrences
england.nationality = get_mode(england$nationality)
england.count = sum(england$nationality==england.nationality)
#germany top nationality and count of occurrences
germany.nationality = get_mode(germany$nationality)
germany.count = sum(germany$nationality==germany.nationality)
#france top nationality and count of occurrences
france.nationality = get_mode(france$nationality)
france.count = sum(france$nationality==france.nationality)
#dataframe of top 5 leagues and their top nationality with count
top5_topNationality = data.frame("League" = c("English Premier League", "French Ligue 1", "German 1. Bundesliga", "Italian Serie A", "Spain Primera Division"),
"Nationality" = c(england.nationality, france.nationality, germany.nationality, italy.nationality, spain.nationality),
"n" = c(england.count, france.count, germany.count, italy.count, spain.count))
#changing wording
top5_topNationality$Nationality[top5_topNationality$Nationality=="England"] = "English"
top5_topNationality$Nationality[top5_topNationality$Nationality=="France"] = "French"
top5_topNationality$Nationality[top5_topNationality$Nationality=="Germany"] = "German"
top5_topNationality$Nationality[top5_topNationality$Nationality=="Italy"] = "Italian"
top5_topNationality$Nationality[top5_topNationality$Nationality=="Spain"] = "Spanish"
ggplot(top5_topNationality[1:5,], aes(x = League, y = n, fill=Nationality)) +
geom_bar(color = "white", stat = "identity") +
scale_fill_manual(values = c("English" = "blue4", "French" = "firebrick1", "German" = "black", "Italian" =
"springgreen4", "Spanish" = "gold"))+
labs(title = "Top Nationality in Each of the Top 5 Leagues in the World", x = "League", y = "Number of
Players")+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(label = Nationality), vjust=1.6, color = "white")+
geom_text(aes(label = n), vjust=5.0, color = "white")
As can be seen above, the bias argument seems to be quite true as each league’s top nationality is of its home nation. Interestingly, La Liga has a very large number of players of one nationality compared to the other leagues. In addition, it is interesting that England, which produces the most overall players, has the fewest of its home nation out of the top 5 leagues. This introduces some questions about why these players stay local more often than not? Also, keep each league’s top nationality in mind as we look at the top nationalities of the top 500 players to see which league may hold the best players.
After knowing the top player producing countries and the top nationality per top 5 league, finding out the top nationalities of the top players in the world could help us to understand who the best national teams in the world are. On top of that, it could help to narrow down the top regions of the world as well. This donut chart tells us the top 15 nationalities of the top 500 players in the world with a legend of each country name.
#creating dataframe to hold top 500 players, their overalls, short names, and nationality
top500 = fifa20[, c("Index","short_name", "overall", "nationality")]
top500$overall=as.numeric(top500$overall)
top500 = head(arrange(top500,(Index)), n=500)
#count of nations occurrence in top 500 players listed in order
s_pain = sum(top500$nationality=="Spain")
f_rance = sum(top500$nationality=="France")
b_razil = sum(top500$nationality=="Brazil")
g_ermany = sum(top500$nationality=="Germany")
i_taly = sum(top500$nationality=="Italy")
p_ortugal = sum(top500$nationality=="Portugal")
a_rgentina = sum(top500$nationality=="Argentina")
e_ngland = sum(top500$nationality=="England")
h_olland = sum(top500$nationality=="Netherlands")
b_elgium = sum(top500$nationality=="Belgium")
u_ruguay = sum(top500$nationality=="Uruguay")
c_roatia = sum(top500$nationality=="Croatia")
s_erbia = sum(top500$nationality=="Serbia")
p_oland = sum(top500$nationality=="Poland")
c_olombia = sum(top500$nationality=="Colombia")
#dataframe containing top 15 nationalities in the top 500 players and the count of nationality
top15_top500 = data.frame("Nationality" = c("Spain", "France", "Brazil", "Germany",
"Italy", "Portugal", "Argentina", "England", "Netherlands", "Belgium",
"Uruguay", "Croatia", "Serbia", "Poland", "Colombia"),
"n" = c(s_pain, f_rance, b_razil, g_ermany,i_taly, p_ortugal,
a_rgentina, e_ngland, h_olland, b_elgium, u_ruguay, c_roatia, s_erbia,
p_oland, c_olombia))
plot_ly(top15_top500, labels = ~Nationality, values = ~n) %>%
add_pie(hole=0.3) %>%
layout(title = "Top 15 Nationalities of the Top 500 Players in the World by Overall")
The above donut chart is very interesting for many reasons. For one, England, the top player producing country, is 8th on the list. Next, Spain which has the highest concentration of its own people in its own league is also the number one nation in the world based on overall of the top 500 players. This means that Spain produces a large amount of the top players in the world. It is also important to note that 4 of 15 countries in the list are from South America and the rest are European. This graph truly shows that European and South American players are the cream of the crop at this point in time.
In many sports like basketball, football, and swimming there are certain physical attributes that are common amongst top players. In these sports the players tend to be long and strong. Soccer in very interesting game because there doesn’t seem to be a perfect body type for the sport. The top two players in the world, Messi and Ronaldo, are on opposite sides of this spectrum yet have been competing for the Ballon d’Or for over a decade. With Messi being 5’7’’ with a skinny build and Ronaldo being a 6’2’’ physical freak of nature, I wanted to see how height compared with the top players in the world. To do this I created a scatter plot of height vs. overall of the top 100 players.
#dataframe to hold top 100 players and their heights
top100_height = fifa20[, c("Index", "short_name", "height_cm", "overall")]
top100_height$overall=as.numeric(top100_height$overall)
#arrange in order
top100_height = head(arrange(top100_height,(Index)), n=100)
x = list(title = "Overall")
y= list(title = "Height (cm)")
fit = lm(height_cm ~ overall, data = top100_height)
top100_height %>%
plot_ly(x = ~overall, y = ~height_cm, type="scatter", mode="markers", text = ~short_name) %>%
add_lines(x = ~overall, y = fitted(fit))%>%
layout(title = "Overall vs. Height of Top 100 Players in the World")%>%
layout(xaxis = x, yaxis= y)%>%
layout(showlegend = F)
Using the above plot, it is clear to see that height is definitely not a factor when it comes to ability in soccer. It is actually interesting to note the slight downward slope of height as overall increases. Though this plot does not identify what makes the best players the best, I think it proves that there is no ideal body type to play at the highest level in soccer. This is one of the few sports where the intangibles can make the biggest difference and that’s what makes it so interesting. For example, Messi, the highest rated player in the world, is one of the shortest in the top 100 players, but he has out of this world talent that cannot be quantified.
The number one debate amongst soccer fans is, who is the best club in the world? By understanding that the European leagues are the superior leagues we can use the Champions League as an identifier of the best team in Europe and thus the best in the world. The Champions League is a competition between the top teams from each of the European leagues that is played first in a group stage, then in a knockout round stage all the way until the final. The final two remaining teams play for the title of Champion of Europe and that basically means the world. By graphing the top 20 clubs by average player overall we can get some insight into the top teams in club soccer as well as what leagues contribute the most number of top teams. This below graph is a horizontal bar chart which is colored by the league that each team is from.
#top 20 clubs by average overall
avg_overall = fifa20[,c("Index", "club_name", "league_name", "overall")]
avg_overall = aggregate(avg_overall[, 4], list(avg_overall$club_name), mean)
avg_overall = head(arrange(avg_overall,desc(x)), n=20)
#vector of teams in order to be added to dataframe
vec = c("Bundesliga", "La Liga", "Serie A", "La Liga", "Bundesliga", "Premier League",
"Premier League", "Serie A", "Premier League", "Premier League", "La Liga",
"Serie A", "Serie A", "Ligue 1", "Premier League", "Primeira Liga", "Serie A", "Primeira Liga",
"La Liga", "La Liga")
avg_overall$League = vec
#rounding to make more presentable
avg_overall = avg_overall %>%
mutate_if(is.numeric, round)
ggplot(avg_overall, aes(x = reorder(Group.1, x, sum), y = x, fill = League)) +
geom_bar(stat = "identity") +
coord_flip()+
labs(title = "Top 20 Clubs by Average Player Overall", x = "", y = "Average Overall")+
theme_few()+
geom_text(aes(label = x), hjust=1.0, color = "black")
The above graph can provide us with many insights and predictions. First, we can see that Bayern Munich is the number one team in the world based on average player overall. Second, Bayern is one of two teams in the top 20 from the Bundesliga. Third, a surprising addition is FC Porto and SL Benfica from the Primeira Liga which is the Portuguese topflight. This goes to show that even outside the so called “top 5 leagues” there are quality teams. Fourth, we can see the Premier League, La Liga, and the Serie A all have 5 teams in the top 20. This can tell us that these leagues are most certainly very competitive because this many teams from one league are in the top 20. Finally, the most interesting part of this graph is the fact that Bayern Munich went on to win the Champions League the season this FIFA was released. That means this graph, using FIFA data, predicted the actual best team in the world correctly. What is even more intriguing is the fact that Bayern were actually not playing well at the time of this games release so to predict them to win the Champions League would have been amazing odds. This in theory could have been used to place a futures bet on them to win and we could have made some good money.
From these visualizations we can learn a few things about the top nations, players, and clubs in the world. By looking at the map of players and using that information alongside the top nationality visualizations we can strongly say Europe and South America produce the best players in the world. In addition, we can say the European leagues are most certainly the best in the world. Also, we can note that there is no ideal body type to be the best in soccer. Finally, we learned that average player overall may be an identifier of the best team in the world. Now, as we look forward and watch the 2020-2021 season unfold, we can use this information to hopefully better predict outcomes and discuss the game. But most importantly, we can be sure that no two seasons are the same and that the beautiful game is complex and unforeseeable, and that’s what keeps us coming back to watch more.
Summary of Data:
summary(fifa20)
## Index short_name long_name age
## Min. : 1 Length:18483 Length:18483 Min. :16.00
## 1st Qu.: 4622 Class :character Class :character 1st Qu.:22.00
## Median : 9242 Mode :character Mode :character Median :25.00
## Mean : 9242 Mean :25.28
## 3rd Qu.:13862 3rd Qu.:29.00
## Max. :18483 Max. :42.00
##
## dob height_cm weight_kg nationality
## Length:18483 Min. :155.0 Min. : 50.00 Length:18483
## Class :character 1st Qu.:177.0 1st Qu.: 70.00 Class :character
## Mode :character Median :181.0 Median : 75.00 Mode :character
## Mean :181.3 Mean : 75.26
## 3rd Qu.:186.0 3rd Qu.: 80.00
## Max. :205.0 Max. :110.00
##
## club_name league_name overall potential
## Length:18483 Length:18483 Min. :48.0 Min. :49.0
## Class :character Class :character 1st Qu.:62.0 1st Qu.:67.0
## Mode :character Mode :character Median :66.0 Median :71.0
## Mean :66.2 Mean :71.5
## 3rd Qu.:71.0 3rd Qu.:75.0
## Max. :94.0 Max. :95.0
##
## player_positions preferred_foot team_position team_jersey_number
## Length:18483 Length:18483 Length:18483 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 9.00
## Mode :character Mode :character Mode :character Median :17.00
## Mean :20.09
## 3rd Qu.:27.00
## Max. :99.00
## NA's :240
## pace shooting passing dribbling defending
## Min. :24.00 Min. :15.00 Min. :24.00 Min. :23.0 Min. :15.00
## 1st Qu.:61.00 1st Qu.:42.00 1st Qu.:50.00 1st Qu.:57.0 1st Qu.:36.00
## Median :69.00 Median :54.00 Median :58.00 Median :64.0 Median :56.00
## Mean :67.69 Mean :52.26 Mean :57.19 Mean :62.5 Mean :51.52
## 3rd Qu.:75.00 3rd Qu.:63.00 3rd Qu.:64.00 3rd Qu.:69.0 3rd Qu.:65.00
## Max. :96.00 Max. :93.00 Max. :92.00 Max. :96.0 Max. :90.00
## NA's :2061 NA's :2061 NA's :2061 NA's :2061 NA's :2061
## physic gk_diving gk_handling gk_kicking
## Min. :27.00 Min. :44.00 Min. :42.00 Min. :35.00
## 1st Qu.:59.00 1st Qu.:60.00 1st Qu.:58.00 1st Qu.:57.00
## Median :66.00 Median :65.00 Median :63.00 Median :61.00
## Mean :64.85 Mean :65.37 Mean :63.08 Mean :61.77
## 3rd Qu.:72.00 3rd Qu.:70.00 3rd Qu.:68.00 3rd Qu.:66.00
## Max. :90.00 Max. :90.00 Max. :92.00 Max. :93.00
## NA's :2061 NA's :16422 NA's :16422 NA's :16422
## gk_reflexes gk_speed gk_positioning player_traits
## Min. :45.00 Min. :12.00 Min. :41.0 Length:18483
## 1st Qu.:60.00 1st Qu.:29.00 1st Qu.:58.0 Class :character
## Median :66.00 Median :39.00 Median :64.0 Mode :character
## Mean :66.33 Mean :37.75 Mean :63.3
## 3rd Qu.:72.00 3rd Qu.:46.00 3rd Qu.:69.0
## Max. :92.00 Max. :65.00 Max. :91.0
## NA's :16422 NA's :16422 NA's :16422
## attacking_crossing attacking_finishing attacking_heading_accuracy
## Min. : 5.00 Min. : 2.00 Min. : 5.00
## 1st Qu.:38.00 1st Qu.:30.00 1st Qu.:44.00
## Median :54.00 Median :49.00 Median :55.00
## Mean :49.67 Mean :45.56 Mean :52.18
## 3rd Qu.:64.00 3rd Qu.:62.00 3rd Qu.:64.00
## Max. :93.00 Max. :95.00 Max. :93.00
##
## attacking_short_passing attacking_volleys skill_dribbling skill_curve
## Min. : 7.00 Min. : 3.00 Min. : 4.00 Min. : 6.00
## 1st Qu.:54.00 1st Qu.:30.00 1st Qu.:50.00 1st Qu.:34.00
## Median :62.00 Median :44.00 Median :61.00 Median :49.00
## Mean :58.71 Mean :42.77 Mean :55.55 Mean :47.26
## 3rd Qu.:68.00 3rd Qu.:56.00 3rd Qu.:68.00 3rd Qu.:62.00
## Max. :92.00 Max. :90.00 Max. :97.00 Max. :94.00
##
## skill_fk_accuracy skill_long_passing skill_ball_control movement_acceleration
## Min. : 4.00 Min. : 8.00 Min. : 5.00 Min. :12.00
## 1st Qu.:31.00 1st Qu.:43.00 1st Qu.:54.00 1st Qu.:56.00
## Median :41.00 Median :56.00 Median :63.00 Median :67.00
## Mean :42.65 Mean :52.73 Mean :58.42 Mean :64.29
## 3rd Qu.:56.00 3rd Qu.:64.00 3rd Qu.:69.00 3rd Qu.:75.00
## Max. :94.00 Max. :92.00 Max. :96.00 Max. :97.00
##
## movement_sprint_speed movement_agility movement_reactions movement_balance
## Min. :11.0 Min. :11.00 Min. :21.00 Min. :12.00
## 1st Qu.:57.0 1st Qu.:55.00 1st Qu.:56.00 1st Qu.:56.00
## Median :67.0 Median :66.00 Median :62.00 Median :66.00
## Mean :64.4 Mean :63.49 Mean :61.71 Mean :63.86
## 3rd Qu.:75.0 3rd Qu.:74.00 3rd Qu.:68.00 3rd Qu.:74.00
## Max. :96.0 Max. :96.00 Max. :96.00 Max. :97.00
##
## power_shot_power power_jumping power_stamina power_strength
## Min. :14.00 Min. :19.00 Min. :12.00 Min. :20.00
## 1st Qu.:48.00 1st Qu.:58.00 1st Qu.:56.00 1st Qu.:58.00
## Median :59.00 Median :66.00 Median :66.00 Median :66.00
## Mean :58.13 Mean :64.92 Mean :62.86 Mean :65.21
## 3rd Qu.:68.00 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:74.00
## Max. :95.00 Max. :95.00 Max. :97.00 Max. :97.00
##
## power_long_shots mentality_aggression mentality_interceptions
## Min. : 4.00 Min. : 9.00 Min. : 3.00
## 1st Qu.:32.00 1st Qu.:44.00 1st Qu.:25.00
## Median :51.00 Median :58.00 Median :52.00
## Mean :46.77 Mean :55.69 Mean :46.34
## 3rd Qu.:62.00 3rd Qu.:69.00 3rd Qu.:64.00
## Max. :94.00 Max. :95.00 Max. :92.00
##
## mentality_positioning mentality_vision mentality_penalties mentality_composure
## Min. : 2.00 Min. : 9.00 Min. : 7.00 Min. :12.00
## 1st Qu.:39.00 1st Qu.:44.00 1st Qu.:39.00 1st Qu.:51.00
## Median :55.00 Median :55.00 Median :49.00 Median :59.00
## Mean :50.04 Mean :53.57 Mean :48.34 Mean :58.47
## 3rd Qu.:64.00 3rd Qu.:64.00 3rd Qu.:60.00 3rd Qu.:67.00
## Max. :95.00 Max. :94.00 Max. :92.00 Max. :96.00
##
## defending_standing_tackle defending_sliding_tackle goalkeeping_diving
## Min. : 5.0 Min. : 3.00 Min. : 1.00
## 1st Qu.:27.0 1st Qu.:24.00 1st Qu.: 8.00
## Median :55.0 Median :52.00 Median :11.00
## Mean :47.6 Mean :45.57 Mean :16.57
## 3rd Qu.:66.0 3rd Qu.:64.00 3rd Qu.:14.00
## Max. :92.0 Max. :90.00 Max. :90.00
##
## goalkeeping_handling goalkeeping_kicking goalkeeping_positioning
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 8.00
## Median :11.00 Median :11.00 Median :11.00
## Mean :16.35 Mean :16.21 Mean :16.36
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :92.00 Max. :93.00 Max. :91.00
##
## goalkeeping_reflexes
## Min. : 1.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :16.71
## 3rd Qu.:14.00
## Max. :92.00
##