1. Introduction

  • Overview

I have been a huge soccer fan since my childhood and have grown up loving the sport. With this project, I aim to combine my knowledge of data analytics and the passion for the sport to discover insights that we may not normally come across while watching a particular game or discussing it with our friends. We can find answers to questions like - Who is a better player between Ronaldo and Messi? Which club has the most potential to grow in the next 5 years? Who are those players that justify their salary in the team? Who are the top players in the world? So, let’s just dive right into it.

  • Analytical Approach

We will be utilizing data about the attributes of the players from the latest EA Sports FIFA 18- Soccer Video Game. I will be making use of exploratory analyisis mainly on the different player attribute columns to search for exisitng patterns in the data. Along with a graphical approach, I plan to utilize the techniques learnt in the classroom to manipulate the data and explain those patterns in a sophistciated way.

  • Mission

It will assist the fans of the sport to understand the science behind it. It will make them think as to why did a particular event happen and the reasons behind it. Thus, there will be a sense of increased awareness amongst the viewers as they have all the latest player data at hand and they themselves cantry to answer any questions that they must have had at any point of time by analyzing that data.

2. Packages Required

I propose to use the following packages for my analysis. It will be updated in the future if I use any additional packages.

library(tidyverse) # Tidy up the data
library(dplyr) # Data Manipulation in R
library(ggplot2) # Plotting Charts
library(knitr) # Displaying an entire table on the screen
library(DT) # Display Data on the screen in a scrollable format
library(data.table) # Fast aggregation of large data
library(kableExtra) # Construct complex tables and customize styles 

3. Data Preparation

The dataset ‘CompleteDataset.csv’ contains information about 17981 players in total and 75 attributes associated with those players.

This dataset was obtained from the kaggle page here where they have scraped the data from this website.

3.1 Data Import

Lets have have a look at the dataset in general

# Data Import
player <- read.csv("CompleteDataset.csv")
# Dimensions of the dataset
dim(player)
## [1] 17981    75
# Displaying Column names
names(player)
##  [1] "X"                   "Name"                "Age"                
##  [4] "Photo"               "Nationality"         "Flag"               
##  [7] "Overall"             "Potential"           "Club"               
## [10] "Club.Logo"           "Value"               "Wage"               
## [13] "Special"             "Acceleration"        "Aggression"         
## [16] "Agility"             "Balance"             "Ball.control"       
## [19] "Composure"           "Crossing"            "Curve"              
## [22] "Dribbling"           "Finishing"           "Free.kick.accuracy" 
## [25] "GK.diving"           "GK.handling"         "GK.kicking"         
## [28] "GK.positioning"      "GK.reflexes"         "Heading.accuracy"   
## [31] "Interceptions"       "Jumping"             "Long.passing"       
## [34] "Long.shots"          "Marking"             "Penalties"          
## [37] "Positioning"         "Reactions"           "Short.passing"      
## [40] "Shot.power"          "Sliding.tackle"      "Sprint.speed"       
## [43] "Stamina"             "Standing.tackle"     "Strength"           
## [46] "Vision"              "Volleys"             "CAM"                
## [49] "CB"                  "CDM"                 "CF"                 
## [52] "CM"                  "ID"                  "LAM"                
## [55] "LB"                  "LCB"                 "LCM"                
## [58] "LDM"                 "LF"                  "LM"                 
## [61] "LS"                  "LW"                  "LWB"                
## [64] "Preferred.Positions" "RAM"                 "RB"                 
## [67] "RCB"                 "RCM"                 "RDM"                
## [70] "RF"                  "RM"                  "RS"                 
## [73] "RW"                  "RWB"                 "ST"

3.2 Data Cleaning

After inspecting the summary statistics of the dataset, I came across missing values of some observations in certain columns. But, as we are looking at attributes of players playing at particular positions, there are bound to be certain players which won’t be considered while we are analyzing the players playing at other positions.

For instance, we might not include any goalkeepers while taking a look atthe strikers so we won’t remove any of the observations. Thus, we will go forward with our 17981 observations.

Now, we have data for all the player playing positions but for the analysis to be performed here, I will limit it to only a certain number of positions. So,I will just keep the required columns.

# Keeping only the required columns
player <- player[, -c(1,48,50:51,53:54,56:59,61,63:65,67:70,72,74)]
# Dimension of the final dataset
dim(player)
## [1] 17981    55
# Names of the final dataset
names(player)
##  [1] "Name"               "Age"                "Photo"             
##  [4] "Nationality"        "Flag"               "Overall"           
##  [7] "Potential"          "Club"               "Club.Logo"         
## [10] "Value"              "Wage"               "Special"           
## [13] "Acceleration"       "Aggression"         "Agility"           
## [16] "Balance"            "Ball.control"       "Composure"         
## [19] "Crossing"           "Curve"              "Dribbling"         
## [22] "Finishing"          "Free.kick.accuracy" "GK.diving"         
## [25] "GK.handling"        "GK.kicking"         "GK.positioning"    
## [28] "GK.reflexes"        "Heading.accuracy"   "Interceptions"     
## [31] "Jumping"            "Long.passing"       "Long.shots"        
## [34] "Marking"            "Penalties"          "Positioning"       
## [37] "Reactions"          "Short.passing"      "Shot.power"        
## [40] "Sliding.tackle"     "Sprint.speed"       "Stamina"           
## [43] "Standing.tackle"    "Strength"           "Vision"            
## [46] "Volleys"            "CB"                 "CM"                
## [49] "LB"                 "LM"                 "LW"                
## [52] "RB"                 "RM"                 "RW"                
## [55] "ST"

Thus, our final dataset is ready and it contains 17981 observations with 55 attributes for each observation.

3.3 Data Preview

Let’s have a preview of the inital value of all our attributes

Name Age Photo Nationality Flag Overall Potential Club Club.Logo Value Wage Special Acceleration Aggression Agility Balance Ball.control Composure Crossing Curve Dribbling Finishing Free.kick.accuracy GK.diving GK.handling GK.kicking GK.positioning GK.reflexes Heading.accuracy Interceptions Jumping Long.passing Long.shots Marking Penalties Positioning Reactions Short.passing Shot.power Sliding.tackle Sprint.speed Stamina Standing.tackle Strength Vision Volleys CB CM LB LM LW RB RM RW ST
Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Real Madrid CF https://cdn.sofifa.org/24/18/teams/243.png €95.5M €565K 2228 89 63 89 63 93 95 85 81 91 94 76 7 11 15 14 11 88 29 95 77 92 22 85 95 96 83 94 23 91 92 31 80 85 88 53 82 61 89 91 61 89 91 92
L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina https://cdn.sofifa.org/flags/52.png 93 93 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png €105M €565K 2154 92 48 90 95 95 96 77 89 97 95 90 6 11 15 14 8 71 22 68 87 88 13 74 93 95 88 85 26 87 73 28 59 90 85 45 84 57 90 91 57 90 91 88
Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 94 Paris Saint-Germain https://cdn.sofifa.org/24/18/teams/73.png €123M €280K 2100 94 56 96 82 95 92 75 81 96 89 84 9 9 15 15 11 62 36 61 75 77 21 81 90 88 81 80 33 90 78 24 53 80 83 46 79 59 87 89 59 87 89 84
L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay https://cdn.sofifa.org/flags/60.png 92 92 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png €97M €510K 2291 88 78 86 60 91 83 77 86 86 94 84 27 25 31 33 37 77 41 69 64 86 30 85 92 93 83 87 38 77 89 45 80 84 88 58 80 64 85 87 64 85 87 88
M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany https://cdn.sofifa.org/flags/21.png 92 92 FC Bayern Munich https://cdn.sofifa.org/24/18/teams/21.png €61M €230K 1493 58 29 52 35 48 70 15 14 30 13 11 91 90 95 91 89 25 30 78 59 16 10 47 12 85 55 25 11 61 44 10 83 70 11 NA NA NA NA NA NA NA NA NA
R. Lewandowski 28 https://cdn.sofifa.org/48/18/players/188545.png Poland https://cdn.sofifa.org/flags/37.png 91 91 FC Bayern Munich https://cdn.sofifa.org/24/18/teams/21.png €92M €355K 2143 79 80 78 80 89 87 62 77 85 91 84 15 6 12 8 10 85 39 84 65 83 25 81 91 91 83 88 19 83 79 42 84 78 87 57 78 58 82 84 58 82 84 88

3.4 Data Description

Majority of the attributes are self-explanatory like the Flag, Age, nationality etc. The attributes like Acceleration, Ball Control, Marking etc. just assigns a number between 1 and 100 for that particular player depending on the player’s strength in that area.But, descriptions will be needed for the player positions as it will not be familiar for the people who don’t follow soccer. So, following are the descriptions of the player positions that I have used in the dataset.

Abbreviation Meaning
CB Center Back
LB Left Back
RB Right Back
CM Center Midfield
LM Left Midfield
RM Right Midfield
ST Striker
LW Left Wing
RW Right Wing

4. Proposed Exploratory Data Analysis

4.1 Approach

I plan to use the data manipulation functions to slice and dice the data into different combinations which will assist me to explore some of the hidden patterns in depth.

# Top 6 players based on Potential Rating
head(player %>%
     select(Name,Club,Age,Potential) %>%
     arrange(desc(Potential)))
##                Name                Club Age Potential
## 1 Cristiano Ronaldo      Real Madrid CF  32        94
## 2            Neymar Paris Saint-Germain  25        94
## 3        K. Mbappé Paris Saint-Germain  18        94
## 4     G. Donnarumma               Milan  18        94
## 5          L. Messi        FC Barcelona  30        93
## 6         P. Dybala            Juventus  23        93

Thus, as we can see here that selecting a few columns and manipulating them gives us interesting insights on the players who have the best potential to grow in the coming years. Going forward, I intend to make more use of this method to solve a lot of questions.

4.2 Visualizations

Majority of my analysis hinges on making creative plots so that we may be able to understand the data in a better way rather than just plainly look at rows and rows of tabular data. I will try to incorporate a lot of histograms, barplots, density plots, scatter plots, heat maps etc.

# Histogram of the Player Overall rating
hist(player$Overall, xlab = "Player Overall Rating", main = "Histogram of Player      Overall Rating")

# Boxplot of the Player Age
boxplot(player$Age, xlab = "Player Age",main = "Boxplot of Player Age") 

The histogram lets us know that there are very few players with a very high overall potential rating while majority of them have an overall rating of about 70. The boxplot lets us know that the median age of the players in the dataset is around 25 and there are a few players aged around 40 acting as outliers for our dataset on the basis of their age.

4.3 Things to learn

At this point of time, I have very less knowledge about using the powerful data visualization tools in R like ggplot and qplot. I plan to get quickly learn them as their visual power will certainly enhance my analysis and make it easier for a person seeing my project to get a better hold of what information I was trying to convey.

4.4 Utilization of Machine Learning Techniques

My major focus would be on discovering hidden patterns in the given data set through exploratory data analysis but if I come across a question wherein I could use a machine learning technique to solve it rather than using exploratory analysis, I would ceratinly give it a try.

5. Further Analysis

There’s still a long way to go with the analysis part as I have just looked at the tip of the iceberg that is this dataset. Hopefully with implementation of the appropriate visual and advanced analytical operatons,I intend to answer all the questions that I aim to solve with this dataset and in the process, understand more about the sport in general.