INTRODUCTION

The FIFA World Cup is held every four years, a soccer championship for national football teams organized by FIFA. The FIFA18 dataset consists of statistics and attribute information of the players featured in FIFA 2018 including; Names, associated Clubs, Value (which is the how much each player is worth in million Euros), Wage (earnings per week in thousand Euros), Nationality, Player style etc.

The goal of this project is to analyze the complete dataset of 74 attributes and conduct descriptive analysis of relationships between player attributes using R/RStudio.

The scope of this project is limited to the attributes analyzed out of the 75 attributes.

PACKAGES INSTALLED

The packages used in this exploratory analysis includes the following:

  • data.table = Used to import the dataset from the working directory
  • plyr = to split data, manipulate, and put back together data
  • dplyr = next iteration of plyr focusing on dataframes for data preparation, mutation, etc.
  • tidyr = simplify the process of creating tidy data
  • scales = control the axis and legends labels
  • ggthemes = provides extra themes and geoms
  • ggplot2 = to create complete complex graphics from data in a data frame
  • ggrepel = to create overlapping text labels
  • RColorBrewer = to add ready to use color palettes
library(data.table)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(scales)
library(ggthemes)
library(ggplot2)
library(ggrepel)
library(RColorBrewer)

DATA IMPORT & PREVIEW

The FIFA 18 Complete Player dataset was imported from the Kaggle repository. The data was read using fread with consideration for missing data, empty fields, and string with zero length.

setwd("C:/Users/nurab/OneDrive/Documents/Data Viz")
filename <- "CompleteDataset.csv"
df <- fread(filename, na.strings=c(NA, ""))

Initial preview of the dataset shows 17981 rows and 75 columns (17981, 75). Analysis of the data structure shows some columns are characters and would need to be converted to numeric for successful data exploration.

Basic descriptive statistics before cleaning up the data shows for example, Age, this a minimum age of 16 and a maximum of 47, with an average of approximately 25years. The performance overall rating of the players show a minimum of 46 and maximum potential of 94. I would assume the highest earners have the highest ratings… well, we will explore who these players are as we further analyze the data.

DATA CLEANSING

The dataset was analyzed to have couple of bad data, in order to accurately visualize attributes of this dataset, it had to be cleaned to remove bad data that would potentially result in an error if not removed.

The Value and Wages column contained special characters (for example €275K) that were removed and converted to numeric to allow data analysis.

Removing special characters from Value column and changing column structure to numeric.

df$Value <- as.numeric(gsub("\\€|M", "", df$Value))

Removing special characters from Wage column and changing column structure to numeric.

df$Wage <- as.numeric(gsub("\\€|K", "", df$Wage))

Selected column structure were required to be changed to numeric to allow data analysis.

df <- df %>% mutate_at(c('Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control', 'Composure', 'Crossing'), as.numeric)

The column “Preferred Positions” had multiple entries, this is understandable because a player may have more than one position they prefer. To explore this, the Preferred Positions column was split into two (2) columns and any extra values were dropped.

df <- df %>% separate('Preferred Positions', c('Preferred1', 'Preferred2'))

There were no NULL values in the dataset however, the dataset had NA values which had to be removed.

(is.null(df))
## [1] FALSE

Instead of just removing the NAs and leaving the cell values as null values or empty cells, the NAs were converted to 0.

df[is.na(df)] <- 0

Duplicate IDs were identified - Unique IDs of 17929 when compared with the dimension result of 17981 shows there were duplicate IDs in the dataset.

length(unique(df$ID))
## [1] 17929

Duplicates were removed to represent the true number of players.

df1 <- df[!duplicated(df$ID),]

The final cleansed dataset is now ready to be analyzed.

DATA VISUALIZATION

Top 15 FIFA Players

The top 15 FIFA Players dataframe was created using the Overall ranking as a measure.

fifaplayers15 <- df1[ , c("Name", "Age", "Overall", "Value", "Wage", "Club", "Nationality")]

It is visible from the visualization that the top ranking players have variability in overall value which could be due to various things including the club they play for. The ranking of the top 15 players range from 94 (highest) and 84 (Lowest). Neymar shows up as the highest earner followed by Messi and Sergio Ramous the lowest paid in this category despite having a high ranking.

ggplot(fifaplayers15[1:15,], aes(x = reorder (Name, -Value), y=Value)) +
  geom_bar(colour="deeppink", fill="forestgreen", stat="identity") +
  labs(title = "Top 15 FIFA Players Overall by Value", x = "Player Name", y = "Value (Million Euros)") +
  theme(plot.title = element_text(hjust=0.5))

Mean Wage of Players by Age

The second visualization was to analyze the relationship between age and wage of players. The data frame was created using the Age and Wage column.

Awage_df <- fifaplayers15 %>%
  select(Age, Wage) %>%
  data.frame()

To further this analysis and make the data easy to visualize due to the large dataset the aggregate function was used to compute the mean for the age and wage.

Awage_df <- aggregate(Wage ~ Age, data = Awage_df, mean)

As shown in the bar plot the younger players averagely earn lower than the mid-aged players. between the age of 25-35 they earn the highest with 30 years been the highest earner and the older age groups earning lesser as well.

V2 <- ggplot(Awage_df, aes(x = Age, y = Wage, fill=Wage)) +
  geom_bar(stat="identity") +
  labs(title = "Mean Wage of players by Age", x= "Player Age", y = "Wage (Thousand Euros)") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_x_continuous(breaks = seq(15,45,by=5))
V2 + scale_fill_gradient(low = "blue", high = "deeppink1")

Top 15 FIFA Player’s Attributes

To isolate and analyze Top 15 players overall attributes some of the attributes were selected to create the attribute dataframe.

Attributes <- df1[ , c("Name","Overall", "Acceleration", "Aggression", "Agility", "Balance", "Ball control", "Composure", "Crossing")]
Attributes <- Attributes[order(-Attributes$Overall),][0:15,]
Attributes <- gather(Attributes, Index, Value, Acceleration:Crossing)

The Top 15 FIFA player attributes show the players these top players with variability in levels of their attributes. This is understandable because depending on the player position their corresponding attributes would vary.

ggplot(Attributes, aes(fill = Index, y = Value, x = Name)) +
  geom_bar(position="stack", stat="identity") +
  ggtitle("Top 15 FIFA Player's Attributes") +
  coord_flip() +
  labs(title="Top 15 FIFA Player's Attributes", x= "Player Names", y="Attribute Value", fill="Index") +
  theme_solarized_2() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(breaks = seq(0,600,by=100))

Highest Paid Clubs(Top 20)

The intent of creating the clubs dataframe is to analyze the club wages and determine whay club are ranked highest paid and the lowest paid.

Fifaclubs <- df1[, c("Club", "Wage")]
Fifaclubs <- aggregate(Wage ~ Club, data = Fifaclubs, mean)
Fifaclubs <- Fifaclubs[order(-Fifaclubs$Wage),][1:20,]

The Top 20 highest paid clubs shows a huge margin between the highest and the lowest, with FC Barcelona paying the highest approximately 190 thousand Euros followed by Real Madrid CF. The lowest at the bottom #20 is Atletico Mardid with wage of approximately 50 thousand Euros.

ggplot(data = Fifaclubs, aes(x = reorder (Club, Wage), y = Wage, fill=Wage)) +
  geom_bar(colour="darkslategray1", fill="darkmagenta", stat="identity") +
  coord_flip() +
  labs(title = "Top 20 Highest Paid Clubs", x = "Club Names", y = "Mean Wage(Thousands Euros)") +
  scale_y_continuous(breaks = seq(0,200,by=20)) +
  theme_clean() +
  theme(plot.title = element_text(hjust=0.5)) 

Mean Value of Players by Preferred Position

The preferred position dataframe was created using one of the split Positions and Value column. The intent is to analyze the value of players and their preferred position.

Fifapositions <- df1[, c("Value", "Preferred1")]
Fifapositions <- aggregate(Value ~ Preferred1, data = Fifapositions, mean)

In soccer, certain positions tend to earn more due to various reason also tied to their attributes and if the club they play for pay higher. Since majority of these players according to this data have at least 1 preferred position and a maximum of 3 preferred positions, I wonder if playing a certain position guarntees more value.

A deep dive into the value of the FIFA players by position shows the LW - Left Wingers and RW - Right Wingers earn the highest Value. In this case, the GK - Goal Keeper position is the lowest earner.

ggplot(data=Fifapositions, aes(x = reorder(Preferred1,Value), y = Value, fill = Value)) +
  geom_bar(stat="identity") +
  labs(title = "Mean Value of Players by Preferred Position", x = "Player Preferred Position", y = "Player Mean Value (Million Euros)") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_fill_gradient(low = "lightcoral", high = "mediumvioletred")

CONCLUSION

The dataset had to be cleansed to arrive at a workable final dataframe. This visualization analyzed players overall value, age, wage and attributes. From the overall dataset, these soccer players play for a team and also their country

  • The player with the highest value is Neymar followed by L. Messi ranking second.
  • The highest paid club is FC Barcelona and Real Madrid CF ranking second. I wonder what could potentially make other clubs such as Liverpool for example rank higher, could this be the player’s attributes?
  • The level of players attribute varies at different levels however, acceleration and composure seems like an attribute with approximate equal levels for the players. The younger players (Age group 16-26) and older players (Age group 33-46) earn on the lower side than the mid aged (Age group 27-32). The highest earners are age 30.