I have been a huge soccer fan since my childhood and have grown up loving the sport. With this project, I aim to combine my knowledge of data analytics and the passion for the sport to discover insights that we may not normally come across while watching a particular game or discussing it with our friends. We can find answers to questions like - Who is a better player between Ronaldo and Messi? Which club has the most potential to grow in the next 5 years? Who are those players that justify their salary in the team? Who are the top players in the world? So, let’s just dive right into it.
We will be utilizing data about the attributes of the players from the latest EA Sports FIFA 18- Soccer Video Game. I will be making use of exploratory analyisis mainly on the different player attribute columns to search for exisitng patterns in the data. Along with a graphical approach, I plan to utilize the techniques learnt in the classroom to manipulate the data and explain those patterns in a sophistciated way.
It will assist the fans of the sport to understand the science behind it. It will make them think as to why did a particular event happen and the reasons behind it. Thus, there will be a sense of increased awareness amongst the viewers as they have all the latest player data at hand and they themselves cantry to answer any questions that they must have had at any point of time by analyzing that data.
I propose to use the following packages for my analysis. It will be updated in the future if I use any additional packages.
library(tidyverse) # Tidy up the data
library(dplyr) # Data Manipulation in R
library(ggplot2) # Plotting Charts
library(knitr) # Displaying an entire table on the screen
library(DT) # Display Data on the screen in a scrollable format
library(data.table) # Fast aggregation of large data
library(kableExtra) # Construct complex tables and customize styles
The dataset ‘CompleteDataset.csv’ contains information about 17981 players in total and 75 attributes associated with those players.
This dataset was obtained from the kaggle page here where they have scraped the data from this website.
Lets have have a look at the dataset in general
# Data Import
player <- read.csv("CompleteDataset.csv")
# Dimensions of the dataset
dim(player)
## [1] 17981 75
# Displaying Column names
names(player)
## [1] "X" "Name" "Age"
## [4] "Photo" "Nationality" "Flag"
## [7] "Overall" "Potential" "Club"
## [10] "Club.Logo" "Value" "Wage"
## [13] "Special" "Acceleration" "Aggression"
## [16] "Agility" "Balance" "Ball.control"
## [19] "Composure" "Crossing" "Curve"
## [22] "Dribbling" "Finishing" "Free.kick.accuracy"
## [25] "GK.diving" "GK.handling" "GK.kicking"
## [28] "GK.positioning" "GK.reflexes" "Heading.accuracy"
## [31] "Interceptions" "Jumping" "Long.passing"
## [34] "Long.shots" "Marking" "Penalties"
## [37] "Positioning" "Reactions" "Short.passing"
## [40] "Shot.power" "Sliding.tackle" "Sprint.speed"
## [43] "Stamina" "Standing.tackle" "Strength"
## [46] "Vision" "Volleys" "CAM"
## [49] "CB" "CDM" "CF"
## [52] "CM" "ID" "LAM"
## [55] "LB" "LCB" "LCM"
## [58] "LDM" "LF" "LM"
## [61] "LS" "LW" "LWB"
## [64] "Preferred.Positions" "RAM" "RB"
## [67] "RCB" "RCM" "RDM"
## [70] "RF" "RM" "RS"
## [73] "RW" "RWB" "ST"
After inspecting the summary statistics of the dataset, I came across missing values of some observations in certain columns. But, as we are looking at attributes of players playing at particular positions, there are bound to be certain players which won’t be considered while we are analyzing the players playing at other positions.
For instance, we might not include any goalkeepers while taking a look atthe strikers so we won’t remove any of the observations. Thus, we will go forward with our 17981 observations.
Now, we have data for all the player playing positions but for the analysis to be performed here, I will limit it to only a certain number of positions. So,I will just keep the required columns.
# Keeping only the required columns
player <- player[, -c(1,48,50:51,53:54,56:59,61,63:65,67:70,72,74)]
# Dimension of the final dataset
dim(player)
## [1] 17981 55
# Names of the final dataset
names(player)
## [1] "Name" "Age" "Photo"
## [4] "Nationality" "Flag" "Overall"
## [7] "Potential" "Club" "Club.Logo"
## [10] "Value" "Wage" "Special"
## [13] "Acceleration" "Aggression" "Agility"
## [16] "Balance" "Ball.control" "Composure"
## [19] "Crossing" "Curve" "Dribbling"
## [22] "Finishing" "Free.kick.accuracy" "GK.diving"
## [25] "GK.handling" "GK.kicking" "GK.positioning"
## [28] "GK.reflexes" "Heading.accuracy" "Interceptions"
## [31] "Jumping" "Long.passing" "Long.shots"
## [34] "Marking" "Penalties" "Positioning"
## [37] "Reactions" "Short.passing" "Shot.power"
## [40] "Sliding.tackle" "Sprint.speed" "Stamina"
## [43] "Standing.tackle" "Strength" "Vision"
## [46] "Volleys" "CB" "CM"
## [49] "LB" "LM" "LW"
## [52] "RB" "RM" "RW"
## [55] "ST"
Thus, our final dataset is ready and it contains 17981 observations with 55 attributes for each observation.
Let’s have a preview of the inital value of all our attributes
| Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club.Logo | Value | Wage | Special | Acceleration | Aggression | Agility | Balance | Ball.control | Composure | Crossing | Curve | Dribbling | Finishing | Free.kick.accuracy | GK.diving | GK.handling | GK.kicking | GK.positioning | GK.reflexes | Heading.accuracy | Interceptions | Jumping | Long.passing | Long.shots | Marking | Penalties | Positioning | Reactions | Short.passing | Shot.power | Sliding.tackle | Sprint.speed | Stamina | Standing.tackle | Strength | Vision | Volleys | CB | CM | LB | LM | LW | RB | RM | RW | ST |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cristiano Ronaldo | 32 | https://cdn.sofifa.org/48/18/players/20801.png | Portugal | https://cdn.sofifa.org/flags/38.png | 94 | 94 | Real Madrid CF | https://cdn.sofifa.org/24/18/teams/243.png | â¬95.5M | â¬565K | 2228 | 89 | 63 | 89 | 63 | 93 | 95 | 85 | 81 | 91 | 94 | 76 | 7 | 11 | 15 | 14 | 11 | 88 | 29 | 95 | 77 | 92 | 22 | 85 | 95 | 96 | 83 | 94 | 23 | 91 | 92 | 31 | 80 | 85 | 88 | 53 | 82 | 61 | 89 | 91 | 61 | 89 | 91 | 92 |
| L. Messi | 30 | https://cdn.sofifa.org/48/18/players/158023.png | Argentina | https://cdn.sofifa.org/flags/52.png | 93 | 93 | FC Barcelona | https://cdn.sofifa.org/24/18/teams/241.png | â¬105M | â¬565K | 2154 | 92 | 48 | 90 | 95 | 95 | 96 | 77 | 89 | 97 | 95 | 90 | 6 | 11 | 15 | 14 | 8 | 71 | 22 | 68 | 87 | 88 | 13 | 74 | 93 | 95 | 88 | 85 | 26 | 87 | 73 | 28 | 59 | 90 | 85 | 45 | 84 | 57 | 90 | 91 | 57 | 90 | 91 | 88 |
| Neymar | 25 | https://cdn.sofifa.org/48/18/players/190871.png | Brazil | https://cdn.sofifa.org/flags/54.png | 92 | 94 | Paris Saint-Germain | https://cdn.sofifa.org/24/18/teams/73.png | â¬123M | â¬280K | 2100 | 94 | 56 | 96 | 82 | 95 | 92 | 75 | 81 | 96 | 89 | 84 | 9 | 9 | 15 | 15 | 11 | 62 | 36 | 61 | 75 | 77 | 21 | 81 | 90 | 88 | 81 | 80 | 33 | 90 | 78 | 24 | 53 | 80 | 83 | 46 | 79 | 59 | 87 | 89 | 59 | 87 | 89 | 84 |
| L. Suárez | 30 | https://cdn.sofifa.org/48/18/players/176580.png | Uruguay | https://cdn.sofifa.org/flags/60.png | 92 | 92 | FC Barcelona | https://cdn.sofifa.org/24/18/teams/241.png | â¬97M | â¬510K | 2291 | 88 | 78 | 86 | 60 | 91 | 83 | 77 | 86 | 86 | 94 | 84 | 27 | 25 | 31 | 33 | 37 | 77 | 41 | 69 | 64 | 86 | 30 | 85 | 92 | 93 | 83 | 87 | 38 | 77 | 89 | 45 | 80 | 84 | 88 | 58 | 80 | 64 | 85 | 87 | 64 | 85 | 87 | 88 |
| M. Neuer | 31 | https://cdn.sofifa.org/48/18/players/167495.png | Germany | https://cdn.sofifa.org/flags/21.png | 92 | 92 | FC Bayern Munich | https://cdn.sofifa.org/24/18/teams/21.png | â¬61M | â¬230K | 1493 | 58 | 29 | 52 | 35 | 48 | 70 | 15 | 14 | 30 | 13 | 11 | 91 | 90 | 95 | 91 | 89 | 25 | 30 | 78 | 59 | 16 | 10 | 47 | 12 | 85 | 55 | 25 | 11 | 61 | 44 | 10 | 83 | 70 | 11 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| R. Lewandowski | 28 | https://cdn.sofifa.org/48/18/players/188545.png | Poland | https://cdn.sofifa.org/flags/37.png | 91 | 91 | FC Bayern Munich | https://cdn.sofifa.org/24/18/teams/21.png | â¬92M | â¬355K | 2143 | 79 | 80 | 78 | 80 | 89 | 87 | 62 | 77 | 85 | 91 | 84 | 15 | 6 | 12 | 8 | 10 | 85 | 39 | 84 | 65 | 83 | 25 | 81 | 91 | 91 | 83 | 88 | 19 | 83 | 79 | 42 | 84 | 78 | 87 | 57 | 78 | 58 | 82 | 84 | 58 | 82 | 84 | 88 |
Majority of the attributes are self-explanatory like the Flag, Age, nationality etc. The attributes like Acceleration, Ball Control, Marking etc. just assigns a number between 1 and 100 for that particular player depending on the player’s strength in that area.But, descriptions will be needed for the player positions as it will not be familiar for the people who don’t follow soccer. So, following are the descriptions of the player positions that I have used in the dataset.
| Abbreviation | Meaning |
|---|---|
| CB | Center Back |
| LB | Left Back |
| RB | Right Back |
| CM | Center Midfield |
| LM | Left Midfield |
| RM | Right Midfield |
| ST | Striker |
| LW | Left Wing |
| RW | Right Wing |
I plan to use the data manipulation functions to slice and dice the data into different combinations which will assist me to explore some of the hidden patterns in depth.
# Top 6 players based on Potential Rating
head(player %>%
select(Name,Club,Age,Potential) %>%
arrange(desc(Potential)))
## Name Club Age Potential
## 1 Cristiano Ronaldo Real Madrid CF 32 94
## 2 Neymar Paris Saint-Germain 25 94
## 3 K. Mbappé Paris Saint-Germain 18 94
## 4 G. Donnarumma Milan 18 94
## 5 L. Messi FC Barcelona 30 93
## 6 P. Dybala Juventus 23 93
Thus, as we can see here that selecting a few columns and manipulating them gives us interesting insights on the players who have the best potential to grow in the coming years. Going forward, I intend to make more use of this method to solve a lot of questions.
Majority of my analysis hinges on making creative plots so that we may be able to understand the data in a better way rather than just plainly look at rows and rows of tabular data. I will try to incorporate a lot of histograms, barplots, density plots, scatter plots, heat maps etc.
# Histogram of the Player Overall rating
hist(player$Overall, xlab = "Player Overall Rating", main = "Histogram of Player Overall Rating")
# Boxplot of the Player Age
boxplot(player$Age, xlab = "Player Age",main = "Boxplot of Player Age")
The histogram lets us know that there are very few players with a very high overall potential rating while majority of them have an overall rating of about 70. The boxplot lets us know that the median age of the players in the dataset is around 25 and there are a few players aged around 40 acting as outliers for our dataset on the basis of their age.
At this point of time, I have very less knowledge about using the powerful data visualization tools in R like ggplot and qplot. I plan to get quickly learn them as their visual power will certainly enhance my analysis and make it easier for a person seeing my project to get a better hold of what information I was trying to convey.
My major focus would be on discovering hidden patterns in the given data set through exploratory data analysis but if I come across a question wherein I could use a machine learning technique to solve it rather than using exploratory analysis, I would ceratinly give it a try.
There’s still a long way to go with the analysis part as I have just looked at the tip of the iceberg that is this dataset. Hopefully with implementation of the appropriate visual and advanced analytical operatons,I intend to answer all the questions that I aim to solve with this dataset and in the process, understand more about the sport in general.