1. Introduction

  • Overview

I have been a huge football fan since my childhood and have grown up loving the sport. With this project, I aim to combine my knowledge of data analytics and passion for the sport to discover some insights utilizing the large number of player attributes available in this dataset. We can try to find answers to some of the questions like -

  • Who are the top 10 defenders/midifielders/strikers in the world?
  • Which soccer club has the players with the most potential to grow in the coming years?
  • Which country has the top rated players?

  • Analytical Approach

We will be utilizing data about the player attributes from the latest EA Sports FIFA 18- Soccer Video Game. I will be making use of exploratory analysis mainly on the different player attributes to search for existing patterns in the data. Along with a graphical approach, I plan to utilize the techniques learnt in the classroom to manipulate the data and explain those findings in a sophistciated way.

It will assist people like me who are huge fans of the sport to understand the science behind it. It will make them think about the reason for the occurence of a particular event in the game analytically. Thus, with the availability of the latest player data at hand, we can try to answer any of our questions by performing appropriate analysis on the data.

2. Packages Required

I have used the following packages for my analysis.

library(tidyverse) # Tidy up the data
library(knitr) # Displaying an entire table on the screen
library(DT) # Display Data on the screen in a scrollable format
library(data.table) # Fast aggregation of large data

3. Data Preparation

The dataset ‘CompleteDataset.csv’ contains information about 17981 players and has 75 attributes for all those players.

This dataset was obtained from the kaggle page here where they have scraped the data from this website.

3.1 Data Import

Lets have have a initial look at the dataset.

# Data Import
player <- read.csv("https://raw.githubusercontent.com/DeshpandeMohit/Data-Wrangling/master/CompleteDataset.csv")

# Dimensions of the dataset
dim(player)
## [1] 17981    75
# Displaying Column names
names(player)
##  [1] "X"                   "Name"                "Age"                
##  [4] "Photo"               "Nationality"         "Flag"               
##  [7] "Overall"             "Potential"           "Club"               
## [10] "Club.Logo"           "Value"               "Wage"               
## [13] "Special"             "Acceleration"        "Aggression"         
## [16] "Agility"             "Balance"             "Ball.control"       
## [19] "Composure"           "Crossing"            "Curve"              
## [22] "Dribbling"           "Finishing"           "Free.kick.accuracy" 
## [25] "GK.diving"           "GK.handling"         "GK.kicking"         
## [28] "GK.positioning"      "GK.reflexes"         "Heading.accuracy"   
## [31] "Interceptions"       "Jumping"             "Long.passing"       
## [34] "Long.shots"          "Marking"             "Penalties"          
## [37] "Positioning"         "Reactions"           "Short.passing"      
## [40] "Shot.power"          "Sliding.tackle"      "Sprint.speed"       
## [43] "Stamina"             "Standing.tackle"     "Strength"           
## [46] "Vision"              "Volleys"             "CAM"                
## [49] "CB"                  "CDM"                 "CF"                 
## [52] "CM"                  "ID"                  "LAM"                
## [55] "LB"                  "LCB"                 "LCM"                
## [58] "LDM"                 "LF"                  "LM"                 
## [61] "LS"                  "LW"                  "LWB"                
## [64] "Preferred.Positions" "RAM"                 "RB"                 
## [67] "RCB"                 "RCM"                 "RDM"                
## [70] "RF"                  "RM"                  "RS"                 
## [73] "RW"                  "RWB"                 "ST"

3.2 Data Cleaning

After inspecting the summary statistics of the dataset, I came across missing values for some observations in certain columns. But, as we have attribute values for the players playing at different positions, there are bound to be certain players which won’t be considered for that position and so might have no values for that attribute.

For instance, we might not include any goalkeeper while taking a look at the strikers. Thus,a goalkeeper might not have any value for the striker attribute. Hence we go forward with this reasoning and won’t remove any of the observations.

Now, we have data for all the player playing positions but for the analysis to be performed here, I will limit it to only a certain number of playing positions. So, I will just keep those required columns and remove the remaining ones.

# Keeping only the required columns
player <- player[, -c(1,21,48,50:51,53:54,56:59,61,63,65,67:70,72,74)]

# Dimension of the final dataset
dim(player)
## [1] 17981    55
# Names of the final dataset
names(player)
##  [1] "Name"                "Age"                 "Photo"              
##  [4] "Nationality"         "Flag"                "Overall"            
##  [7] "Potential"           "Club"                "Club.Logo"          
## [10] "Value"               "Wage"                "Special"            
## [13] "Acceleration"        "Aggression"          "Agility"            
## [16] "Balance"             "Ball.control"        "Composure"          
## [19] "Crossing"            "Dribbling"           "Finishing"          
## [22] "Free.kick.accuracy"  "GK.diving"           "GK.handling"        
## [25] "GK.kicking"          "GK.positioning"      "GK.reflexes"        
## [28] "Heading.accuracy"    "Interceptions"       "Jumping"            
## [31] "Long.passing"        "Long.shots"          "Marking"            
## [34] "Penalties"           "Positioning"         "Reactions"          
## [37] "Short.passing"       "Shot.power"          "Sliding.tackle"     
## [40] "Sprint.speed"        "Stamina"             "Standing.tackle"    
## [43] "Strength"            "Vision"              "Volleys"            
## [46] "CB"                  "CM"                  "LB"                 
## [49] "LM"                  "LW"                  "Preferred.Positions"
## [52] "RB"                  "RM"                  "RW"                 
## [55] "ST"

Thus, our final dataset contains 17981 observations with 55 attributes for each observation.

3.3 Data Preview

Following table gives a preview of our dataset.

3.4 Data Description

Majority of the attributes are self-explanatory like the Flag, Age, Nationality etc. The attributes like Acceleration, Ball Control, Marking etc. just assigns a number between 1 and 100 for that particular player depending on the player’s strength in that area.But, descriptions will be needed for the player positions as it will not be familiar for the people who don’t follow soccer. So, following are the descriptions of the player positions that I have used in the dataset.

Abbreviation Meaning
CB Center Back
LB Left Back
RB Right Back
CM Center Midfield
LM Left Midfield
RM Right Midfield
ST Striker
LW Left Wing
RW Right Wing

4. Exploratory Data Analysis

4.1 Initial Explorations

Now, lets have a look at the distribution of the player ages.

# Age Density

ggplot(player,aes(x = Age, fill = factor(Age))) + 
geom_bar() + 
guides(fill = FALSE) + 
xlab("Player Age") + 
ylab("Number of Players") + 
scale_x_continuous(breaks = c(16:47)) + 
ggtitle("Player Age") + 
theme(plot.title = element_text(hjust = 0.5))

Thus, we can see that the mean of the player Age is 25.1445415 while the median of the playerAge is 25. There are very less players having age greater than 33 years.

In our dataset, we have two important attributes called Overall and Potential rating. Overall rating denotes the all round rating for the player and the Potential rating denotes the potential for a player to grow. Lets look at the boxplots for these two attributes.

# Boxplot for Overall Rating
ggplot(player,aes(x = "", y = Overall)) + 
geom_boxplot() +
xlab("") + 
ylab("Overall Rating") + 
ggtitle("Boxplot for Overall Rating") + 
theme(plot.title = element_text(hjust = 0.5)) + 
coord_cartesian(ylim = c(40, 100))  

The median for the Overall rating is observed to be 66. We can see that the top players in the world will obviously have a very high overall rating value and thus act as outliers for the plot. There are comparitively lesser number of players than the top rated players in the lower half having a low overall rating.

# Boxplot for Potential Rating
ggplot(player,aes(x = "", y = Potential)) + 
geom_boxplot() +
xlab("") + 
ylab("Potential Rating") + 
ggtitle("Boxplot for Potential Rating") + 
theme(plot.title = element_text(hjust = 0.5)) + 
coord_cartesian(ylim = c(40, 100))  

The median for the Overall rating is observed to be 71. Here, the situation is different than the boxplot for Overall rating as we observe that there is no significant difference between the players that have a higher potential to develop than the low potential rated players with the number of lesser rated just marginally higher than the top potential rated players.

# Trend of Age vs Overall Rating

ggplot(player,aes(x = Age,y = Overall)) +
geom_point(aes(color = factor(Age))) + 
geom_smooth(method = "lm") +
xlab("Player Age") + 
ylab("Overall Rating") + 
ggtitle("Player Age vs Overall Rating") + 
theme(plot.title = element_text(hjust = 0.5))  

We can see that both the quantities are positively related. But, on closer inspection we can see that the overall rating of the players more than 30 years old is comparitively lesser than those with the players younger than 30 years old. Thus, the overall rating of the players tend to increase upto a certain point and then decreases afterwards.

# Trend of Age vs Potential Rating

ggplot(player,aes(x = Age,y = Potential)) +
geom_point(aes(color = factor(Age))) + 
geom_smooth(method = "lm") +
xlab("Player Age") + 
ylab("Potential Rating") + 
ggtitle("Player Age vs Potential Rating") + 
theme(plot.title = element_text(hjust = 0.5))  

Thus, younger players seem to have a higher potential to grow than older players which is denoted by the decreasing slope of the linear smooth line.

# Trend of Age vs Overall - Potential Rating

ggplot(player,aes(x = Age,y = Overall - Potential)) +
geom_point(aes(color = factor(Age))) + 
geom_smooth(method = "lm") +
xlab("Player Age") + 
ylab("Overall - Potential Rating") + 
ggtitle("Player Age vs Overall - Potential Rating") + 
theme(plot.title = element_text(hjust = 0.5))  

So from this plot, we can say that the players below 30 years of age have a greater potential rating than overall rating. But, this trend changes for the players above 30 years of age as it is observed that there is no major difference between their overall and potential rating.

4.2 Best Players,Clubs and Countries

We have now observed the relations between the player age with respect to their overall and potential rating. Now, lets have a look at the Top 10 players based on these attributes.

# Top 10 Overall Players

player %>% 
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>')) %>%
arrange(desc(Overall)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,Club,ClubImage,Nationality,CountryImage) %>% 
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

As expected, the two best players of this generation, Portugal’s Cristiano Ronaldo and Argentina’s Lionel Messi lead the way with the Brazilian Neymar coming in at the third place.

# Top 10 Potential Players

player %>% 
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>')) %>%
arrange(desc(Potential)) %>% 
top_n(10,wt = Potential) %>% 
select(Photo,Name,Potential,Club,ClubImage,Nationality,CountryImage) %>% 
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

Cristiano Ronaldo again leads the way in the best player with the highest potential rating thus showing his hunger and desire to still improve his game. PSG seems to be in a good place for the near future as they have two great emerging players in Neymar and Mbappe.

# Top 10 Overall Clubs

player %>%
group_by(Club) %>%
summarise(MeanOverallRating = round(x = mean(Overall), digits = 2)) %>%
arrange(desc(MeanOverallRating)) %>%
top_n(10,wt = MeanOverallRating) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  

Being so rich in talent, FC Barcelona leads the way in having the players with the mean overall rating which is evident by their dominance in their domestic seasons as well as the European competitions. Their Spanish league rivals are at the third place with the Italian powerhouse Juventus in between these two heavyweights.I am happy to see my favorite team Manchester United in the top 10 teams and hope they climb up the rankings in the near future.

# Top 10 Potential Clubs

player %>%
group_by(Club) %>%
summarise(MeanPotentialRating = round(x = mean(Potential), digits = 2)) %>%
arrange(desc(MeanPotentialRating)) %>%
top_n(10,wt = MeanPotentialRating) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

Real Madrid leads the way here in having the players with the highest potential to develop. FC Barcelona comes in at a close second with the German champions FC Bayern Munich behind them. Manchester United’s similar ranking here compared to their ranking based on overall rating suggests that they have a good mix of emerging and talented players in their team.

# Top 10 Overall Nationality

player %>%
group_by(Nationality) %>%
summarise(MeanOverallRating = round(x = mean(Overall), digits = 2), n = n()) %>%
filter(n > 100) %>%
arrange(desc(MeanOverallRating)) %>%
top_n(10,wt = MeanOverallRating) %>%
select(Nationality,MeanOverallRating) %>%  
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  

It’s no surprise for a fan to see Brazil at the top of the rankings based on overall rating because of the wonderful players that they have produced through the years. Chile have developed a good batch of players in the recent years coming in second behind Brazil while the Euro 2016 winners Portugal are at the third rank.

# Top 10 Potential Nationality

player %>%
group_by(Nationality) %>%
summarise(MeanPotentialRating = round(x = mean(Potential), digits = 2), n = n()) %>%
filter(n > 100) %>%
arrange(desc(MeanPotentialRating)) %>%
top_n(10,wt = MeanPotentialRating) %>%
select(Nationality,MeanPotentialRating) %>%  
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  

Portugal have a good crop of young players with great talent and hence they top this table based on the potential rating.The World Cup 2010 winners Spain are ranked second and Serbia follows Spain at rank 3.

4.3 Best players at different positions

Now, let’s have a look at the top 10 players for the different player positions based on the overall rating differentiated by their respective positions. We will first subset the entire data set to the players having their playing position in the column ‘Preferred.Positions’. Then we select the Top 10 players for each playing position and sort them in the descending order of their overall ranking and their particular player position attribute value.

GK
# Top 10 Goalkeepers

player %>% 
subset(grepl("GK",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(GK.handling)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,GK.handling,Club,ClubImage,Nationality,CountryImage,
       Value,Wage,GK.diving,GK.kicking,GK.positioning,GK.reflexes,Special,
       Stamina,Reactions)%>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))   
LB
# Top 10 Left Backs

player %>% 
subset(grepl("LB",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(LB)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,LB,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Aggression,Sliding.tackle,Heading.accuracy,Interceptions,Jumping,
       Standing.tackle,Special,Stamina,Marking,Positioning,Strength,Overall,
       Composure,Balance,Agility) %>%  
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  
CB
# Top 10 Center Backs

player %>% 
subset(grepl("CB",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(CB)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,CB,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Aggression,Sliding.tackle,Heading.accuracy,Interceptions,Jumping,
       Standing.tackle,Special,Stamina,Marking,Positioning,Strength,Composure,
       Balance,Agility) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))    
RB
# Top 10 Right Backs

player %>% 
subset(grepl("RB",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(RB)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,RB,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Aggression,Sliding.tackle,Heading.accuracy,Interceptions,Jumping,
       Standing.tackle,Special,Stamina,Marking,Positioning,Strength,Overall,
       Composure,Balance,Agility) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))    
LM
# Top 10 Left Midifelders

player %>% 
subset(grepl("LM",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(LM)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,LM,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Crossing,Special,Vision,Dribbling,Stamina,Short.passing,Long.passing,
       Long.shots,Sprint.speed,Acceleration,Ball.control,Composure,Balance,
       Agility)  %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  
CM
# Top 10 Center Midfielders

player %>% 
subset(grepl("CM",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(CM)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,CM,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Interceptions,Special,Vision,Stamina,Short.passing,Long.passing,
       Long.shots,Ball.control,Composure,Balance,Agility,Crossing,Dribbling,
       Sprint.speed) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  
RM
# Top 10 Right Midfielders

player %>% 
subset(grepl("RM",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(RM)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,RM,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Crossing,Special,Vision,Dribbling,Stamina,Short.passing,Long.passing,
       Long.shots,Sprint.speed,Acceleration,Ball.control,Composure,Balance,
       Agility) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  
LW
# Top 10 Left Wingers

player %>% 
subset(grepl("LW",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(LW)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,LW,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Crossing,Special,Vision,Dribbling,Stamina,Volleys,Free.kick.accuracy,
       Shot.power,Sprint.speed,Acceleration,Ball.control,Composure,Balance,
       Agility) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  
ST
# Top 10 Strikers

player %>% 
subset(grepl("ST",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(ST)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,ST,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Penalties,Finishing,Heading.accuracy,Special,Dribbling,Stamina,Volleys,
       Free.kick.accuracy,Shot.power,Strength,Acceleration,Ball.control,
       Composure,Balance,Agility) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  
RW
# Top 10 Right Wingers

player %>% 
subset(grepl("RW",Preferred.Positions)) %>%
mutate(Photo = paste0('<img src="', `Photo`, '"></img>'),
       ClubImage = paste0('<img src="', `Club.Logo`, '"></img>'),
       CountryImage = paste0('<img src="', `Flag`, '"></img>'))%>% 
arrange(desc(Overall),desc(RW)) %>% 
top_n(10,wt = Overall) %>% 
select(Photo,Name,Overall,RW,Club,ClubImage,Nationality,CountryImage,Value,Wage,
       Crossing,Special,Vision,Dribbling,Stamina,Volleys,Free.kick.accuracy,
       Shot.power,Sprint.speed,Acceleration,Ball.control,Composure,Balance,
       Agility) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, 
          options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))  

5. Summary

Thus, we have analyzed our data set and got a lot of insights from it. Following is the summary of our analysis:-

  • We had players having age of around 16 years to some players being 44 years old with majority of players around 25 years of age.
  • Overall Rating and Potential Rating for a player are very much related to the player Age as we saw that the overall rating decreases for players above 30 years old while the potential rating decreases as age increases.
  • Certain players like Cristiano Ronaldo, Lionel Messi and Neymar are very talented than other players as they appear in the top 10 lists in terms of both the Overall and Potential rating.
  • European football clubs dominate the rankings for the best players as compared to the clubs from the other leagues around the world.
  • However on the international front, the rankings are split amongst the South American and European countries.
  • We can see that the clubs like Real Madrid, FC Barcelona and Bayern Munich have a lot of players in all the Top 10 lists for the different positions thus denoting why these clubs are the current best clubs in the world.