Data Wrangling Assignment 2

Executive Summary
Required Packages
Data
Understand the data
Tidy and manipulate data - 1 & 2
Understand the data continued..
Scan - 1 : Scan for missing, special and obvious errors
Scan - 2 : Scan for outliers
Transformation (Shown on one variable only)

Executive Summary

The data is FIFA players information from 2015 edition of the game to the 2020 edition. It was read using the ‘readr’ package. The variables were examined and changed using base r functions and some variables were changed after they were used to derive another variable for the dataset. No changes to the variable names were done as they are quite understandable. The data was checked against the Tidy data principles. Only one new variable was derived for extra insight. The variables were checked for missing values and only one variable has missing values in the original dataset with an explainable reason and the derived variable has missing values in the same rows because it has been derived from beforementioned variable. No imputation was done for empty rows. Different numerical variables were checked for normality using histograms and qq-plots before choosing the outlier detection methods to find the right method for dealing with outliers. Finally, I tried to transform a non-normal numerical variable to approximately normal variable using appropriate functions.

Required Packages

library(readr) # For reading the data
library(magrittr) # For code understandability and cleanliness
library(dplyr) # For data wrangling
library(tidyr) # For data wrangling
library(ggplot2) # For plotting outliers
library(MVN) # For outlier detection
library(outliers) # For outlier detection
library(forecast) # For data transformation

Data

The data is selected from Kaggle. It consists of player’s information in one of the most played video gaame FIFA. Ranging from physical measures, playing abilities to mental and emotional strength, it consists of 104 variables. For better demonstration of data cleaning process I have selected only 12 variables. The selected variables are just basic personal, demographic and overall abilities information of the player. Player information for each edition of FIFA since 2015 was in a separate file and were merged using cbind function. To directly go to the dataset please click here. The final data will contain following 13 variables: 1. sofifa_id - unique id given to each player. 2. long_name - full name of the player. 3. dob - Date of Birth 4. nationality - Nationality of the player 5. club - the club for which the player plays 6. overall - overall rating of the player 7. value_eur - Market value of the player in Euros 8. wage_eur - Weekly wage/salary of the player in Euros 9. joined - The date on which player joined his current club 10. fifaEdition - Edition of the game 11. age - Age of the player 12. player_positions - contains all the positions the player can play (later changed to one position in each row with variable name ‘player_position’) 13. lengthOfStay - Number of years, the player has been playing for his current club

# Read players data for all editions of FIFA since 2015
players_15 <- read_csv("C:/Users/shekh/Downloads/RMIT/Semester 2/Data Wrangling/Assignment 2/FIFA Dataset/players_15.csv")
players_16 <- read_csv("C:/Users/shekh/Downloads/RMIT/Semester 2/Data Wrangling/Assignment 2/FIFA Dataset/players_16.csv")
players_17 <- read_csv("C:/Users/shekh/Downloads/RMIT/Semester 2/Data Wrangling/Assignment 2/FIFA Dataset/players_17.csv")
players_18 <- read_csv("C:/Users/shekh/Downloads/RMIT/Semester 2/Data Wrangling/Assignment 2/FIFA Dataset/players_18.csv")
players_19 <- read_csv("C:/Users/shekh/Downloads/RMIT/Semester 2/Data Wrangling/Assignment 2/FIFA Dataset/players_19.csv")
players_20 <- read_csv("C:/Users/shekh/Downloads/RMIT/Semester 2/Data Wrangling/Assignment 2/FIFA Dataset/players_20.csv")

# Adding the edition column to all the dataframes
players_15 <- players_15 %>% mutate(fifaEdition = 2015)
players_16 <- players_16 %>% mutate(fifaEdition = 2016)
players_17 <- players_17 %>% mutate(fifaEdition = 2017)
players_18 <- players_18 %>% mutate(fifaEdition = 2018)
players_19 <- players_19 %>% mutate(fifaEdition = 2019)
players_20 <- players_20 %>% mutate(fifaEdition = 2020)

# Selecting columns that are required for analysis
players_15 <- players_15 %>% select(sofifa_id, long_name, dob, nationality, club,
                                    overall, player_positions, value_eur, wage_eur,
                                    joined, fifaEdition, age)

players_16 <- players_16 %>% select(sofifa_id, long_name, dob, nationality, club,
                                    overall, player_positions, value_eur, wage_eur,
                                    joined, fifaEdition, age)

players_17 <- players_17 %>% select(sofifa_id, long_name, dob, nationality, club,
                                    overall, player_positions, value_eur, wage_eur,
                                    joined, fifaEdition, age)

players_18 <- players_18 %>% select(sofifa_id, long_name, dob, nationality, club,
                                    overall, player_positions, value_eur, wage_eur,
                                    joined, fifaEdition, age)

players_19 <- players_19 %>% select(sofifa_id, long_name, dob, nationality, club,
                                    overall, player_positions, value_eur, wage_eur,
                                    joined, fifaEdition, age)

players_20 <- players_20 %>% select(sofifa_id, long_name, dob, nationality, club,
                                    overall, player_positions, value_eur, wage_eur,
                                    joined, fifaEdition, age)

players <- bind_rows(players_15, players_16, players_17, players_18, players_19, players_20)

# Merge the five year's data sets into data frame (satisfies requirement 1)
players <- as.data.frame(bind_rows(players_15, players_16, players_17, players_18, players_19, players_20))

head(players)

##   sofifa_id                           long_name        dob nationality
## 1    158023      Lionel Andrés Messi Cuccittini 1987-06-24   Argentina
## 2     20801 Cristiano Ronaldo dos Santos Aveiro 1985-02-05    Portugal
## 3      9014                        Arjen Robben 1984-01-23 Netherlands
## 4     41236                  Zlatan Ibrahimovic 1981-10-03      Sweden
## 5    167495                        Manuel Neuer 1986-03-27     Germany
## 6    176580            Luis Alberto Suárez Díaz 1987-01-24     Uruguay
##                  club overall player_positions value_eur wage_eur     joined
## 1        FC Barcelona      93               CF         0        0 2004-07-01
## 2         Real Madrid      92           LW, LM         0        0 2009-07-01
## 3   FC Bayern München      90       RM, LM, RW         0        0 2009-08-28
## 4 Paris Saint-Germain      90               ST         0        0 2012-07-01
## 5   FC Bayern München      90               GK         0        0 2011-07-01
## 6        FC Barcelona      89           ST, CF         0        0 2014-07-11
##   fifaEdition age
## 1        2015  27
## 2        2015  29
## 3        2015  30
## 4        2015  32
## 5        2015  28
## 6        2015  27

Understand the data

Lets take a look at the structure of data (satisfies requirement 2). As you’ll see, the data contains multiple data types.

class(players) #Class of the whole data

## [1] "data.frame"

glimpse(players)

## Rows: 100,995
## Columns: 12
## $ sofifa_id        <dbl> 158023, 20801, 9014, 41236, 167495, 176580, 183277...
## $ long_name        <chr> "Lionel Andrés Messi Cuccittini", "Cristiano Ronal...
## $ dob              <date> 1987-06-24, 1985-02-05, 1984-01-23, 1981-10-03, 1...
## $ nationality      <chr> "Argentina", "Portugal", "Netherlands", "Sweden", ...
## $ club             <chr> "FC Barcelona", "Real Madrid", "FC Bayern München"...
## $ overall          <dbl> 93, 92, 90, 90, 90, 89, 88, 88, 88, 88, 88, 87, 87...
## $ player_positions <chr> "CF", "LW, LM", "RM, LM, RW", "ST", "GK", "ST, CF"...
## $ value_eur        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wage_eur         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ joined           <date> 2004-07-01, 2009-07-01, 2009-08-28, 2012-07-01, 2...
## $ fifaEdition      <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20...
## $ age              <dbl> 27, 29, 30, 32, 28, 27, 23, 30, 29, 31, 28, 24, 25...

As you can see the ‘players’ variable is a data frame and contains 12 columns which has 2 date variables, 4 character variables and rest of the variables are double. Also the ‘player_positions variable’ contains all positions a player can play, which violates the Tidy data principles. Therefore before I convert the variable types, I’d like to tidy my data so that all the conversions are done at once.

Tidy and manipulate data - 1 & 2

To get the positions a player can play, I’ll first have to separate the different positions he can play into different variables. A player can play at most 4 positions, accordingly the ‘player_positions’ variable is first broken into 4 variable and then gathered in ‘player_position’ to get the 4 columns as individual rows. This introduces a lot of null values, to avoid all these null values, all the rows having ‘player_position’ as ‘NA’ are filtered out. Then finally first 5 rows are printed to give an example of how the multiple positions have beeb gathered into a single column successfully.

In football it is sometime helpful to look at how long a player has been playing at one club because some players stay at their childhood club and are not likely to move to another club. To get the variable of a player’s length of stay at his current club I’ve added a new variable into the dataset by the name ‘lengthOfStay’. I have used mutate function to get the new variable which is the difference of game’s edition and year of date on which the player joined his current club.

glimpse(players)

## Rows: 100,995
## Columns: 12
## $ sofifa_id        <dbl> 158023, 20801, 9014, 41236, 167495, 176580, 183277...
## $ long_name        <chr> "Lionel Andrés Messi Cuccittini", "Cristiano Ronal...
## $ dob              <date> 1987-06-24, 1985-02-05, 1984-01-23, 1981-10-03, 1...
## $ nationality      <chr> "Argentina", "Portugal", "Netherlands", "Sweden", ...
## $ club             <chr> "FC Barcelona", "Real Madrid", "FC Bayern München"...
## $ overall          <dbl> 93, 92, 90, 90, 90, 89, 88, 88, 88, 88, 88, 87, 87...
## $ player_positions <chr> "CF", "LW, LM", "RM, LM, RW", "ST", "GK", "ST, CF"...
## $ value_eur        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ wage_eur         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ joined           <date> 2004-07-01, 2009-07-01, 2009-08-28, 2012-07-01, 2...
## $ fifaEdition      <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20...
## $ age              <dbl> 27, 29, 30, 32, 28, 27, 23, 30, 29, 31, 28, 24, 25...

# Tidying the data, creating position variable so each row contains one position
players <- players %>% separate(player_positions,
                                into = c("position1", "position2", "position3", "position4"),
                                sep = ", ")

players <- players %>% gather(position1, position2, position3, position4,
                              key = position_number,
                              value = player_position)

players <- players %>% select(-position_number)

players <-  players %>% filter(!is.na(player_position))

# Creating a new column to store player's length of stay at current club
players <- players %>% mutate(lengthOfStay =
                                as.integer(fifaEdition)
                              - as.integer(format(joined, format = "%Y"))
                              )

head(players)

##   sofifa_id                           long_name        dob nationality
## 1    158023      Lionel Andrés Messi Cuccittini 1987-06-24   Argentina
## 2     20801 Cristiano Ronaldo dos Santos Aveiro 1985-02-05    Portugal
## 3      9014                        Arjen Robben 1984-01-23 Netherlands
## 4     41236                  Zlatan Ibrahimovic 1981-10-03      Sweden
## 5    167495                        Manuel Neuer 1986-03-27     Germany
## 6    176580            Luis Alberto Suárez Díaz 1987-01-24     Uruguay
##                  club overall value_eur wage_eur     joined fifaEdition age
## 1        FC Barcelona      93         0        0 2004-07-01        2015  27
## 2         Real Madrid      92         0        0 2009-07-01        2015  29
## 3   FC Bayern München      90         0        0 2009-08-28        2015  30
## 4 Paris Saint-Germain      90         0        0 2012-07-01        2015  32
## 5   FC Bayern München      90         0        0 2011-07-01        2015  28
## 6        FC Barcelona      89         0        0 2014-07-11        2015  27
##   player_position lengthOfStay
## 1              CF           11
## 2              LW            6
## 3              RM            6
## 4              ST            3
## 5              GK            4
## 6              ST            1

Understand the data continued..

After having made the data tidy, we can come back to type conversions to proper data types like character to factors and ordered factors.

Since edition of fifa is just the year for which the game was released, it is converted to ordered factor because 2020 edition is the latest edition and 2015 is the oldest in perspective to current dataset. Other 4 variables namely ‘club’, ‘nationality’, ‘player_position’ and ‘long_name’ are converted to unordered factors.

Also player’s fifa ID is changed to character because it’s just an identifier and not a quatifiable value.

# Converting to unordered factors
factorColumns <- c("club", "nationality", "player_position", "long_name")
players[factorColumns] <- lapply(players[factorColumns], factor)

# Change fifa edition to ordered factor
players$fifaEdition <- factor(players$fifaEdition, ordered = TRUE,
                              levels = c(2015, 2016, 2017, 2018,
                                         2019, 2020))

players$sofifa_id <- as.character(players$sofifa_id)

glimpse(players)

## Rows: 162,751
## Columns: 13
## $ sofifa_id       <chr> "158023", "20801", "9014", "41236", "167495", "1765...
## $ long_name       <fct> Lionel Andrés Messi Cuccittini, Cristiano Ronaldo d...
## $ dob             <date> 1987-06-24, 1985-02-05, 1984-01-23, 1981-10-03, 19...
## $ nationality     <fct> Argentina, Portugal, Netherlands, Sweden, Germany, ...
## $ club            <fct> FC Barcelona, Real Madrid, FC Bayern München, Paris...
## $ overall         <dbl> 93, 92, 90, 90, 90, 89, 88, 88, 88, 88, 88, 87, 87,...
## $ value_eur       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ wage_eur        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ joined          <date> 2004-07-01, 2009-07-01, 2009-08-28, 2012-07-01, 20...
## $ fifaEdition     <ord> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201...
## $ age             <dbl> 27, 29, 30, 32, 28, 27, 23, 30, 29, 31, 28, 24, 25,...
## $ player_position <fct> CF, LW, RM, ST, GK, ST, LM, ST, CM, LM, ST, RM, ST,...
## $ lengthOfStay    <int> 11, 6, 6, 3, 4, 1, 3, 3, 13, 8, NA, 2, 1, 13, 10, 3...

Scan - 1 : Scan for missing, special and obvious errors

First a search for total missing values throughout the dataset was done. A total of around 26.5 thousand values were found to be missing. Then column wise check was done. I found that a total of 13223 values were missing in the ‘joined’ column and an equal number of missing values are present in the ‘lengthOfStay’ column since it was derived from ‘joined’ variable. After a careful inspection, it is only right that these values are left null as these players may have come from their current clubs youth academy and do not have a value in ‘joined’ column. No change was done to the dataset.

Also, it is evident that data has no extreme or impossible values. You can easily look at summary of each column and min-max as well as count to see no special/obvious errors.

print( sum( is.na(players) ) )

## [1] 26446

print( colSums( is.na(players) ) )

##       sofifa_id       long_name             dob     nationality            club 
##               0               0               0               0               0 
##         overall       value_eur        wage_eur          joined     fifaEdition 
##               0               0               0           13223               0 
##             age player_position    lengthOfStay 
##               0               0           13223

print( summary(players) )

##   sofifa_id                                         long_name     
##  Length:162751      Liam Kelly                           :    30  
##  Class :character   Abdullah Mohammed Al Zoari Al Dossary:    27  
##  Mode  :character   Reece Brown                          :    25  
##                     Ben Davies                           :    24  
##                     Brecht Dejaegere                     :    23  
##                     Thomas Müller                        :    23  
##                     (Other)                              :162599  
##       dob                nationality                    club       
##  Min.   :1969-07-19   England  : 15023   Borussia Dortmund:   385  
##  1st Qu.:1988-02-29   Spain    : 11065   FC Schalke 04    :   377  
##  Median :1991-11-15   Germany  :  9857   VfL Bochum 1848  :   369  
##  Mean   :1991-06-30   France   :  8666   1. FSV Mainz 05  :   364  
##  3rd Qu.:1995-02-05   Argentina:  8360   Manchester United:   357  
##  Max.   :2002-08-08   Italy    :  6772   UD Las Palmas    :   357  
##                       (Other)  :103008   (Other)          :160542  
##     overall       value_eur            wage_eur          joined          
##  Min.   :40.0   Min.   :        0   Min.   :     0   Min.   :1991-06-01  
##  1st Qu.:62.0   1st Qu.:   160000   1st Qu.:  1000   1st Qu.:2014-01-01  
##  Median :66.0   Median :   575000   Median :  2000   Median :2015-07-17  
##  Mean   :66.3   Mean   :  2043474   Mean   :  9643   Mean   :2015-05-18  
##  3rd Qu.:71.0   3rd Qu.:  1500000   3rd Qu.:  8000   3rd Qu.:2017-07-01  
##  Max.   :94.0   Max.   :123000000   Max.   :565000   Max.   :2019-08-18  
##                                                      NA's   :13223       
##  fifaEdition       age        player_position  lengthOfStay   
##  2015:25211   Min.   :16.00   CM     :20558   Min.   : 1.000  
##  2016:23966   1st Qu.:22.00   CB     :20425   1st Qu.: 1.000  
##  2017:26502   Median :25.00   ST     :18301   Median : 2.000  
##  2018:28299   Mean   :25.15   CDM    :14812   Mean   : 2.653  
##  2019:29306   3rd Qu.:28.00   LM     :13785   3rd Qu.: 3.000  
##  2020:29467   Max.   :47.00   RM     :13395   Max.   :28.000  
##                               (Other):61475   NA's   :13223

Scan - 2 : Scan for outliers

Before scanning for outlier, I investigated the distribution of each numerical variable and took two different approaches depending upon the distribution of the variables for univariate outlier detection method. If a variable was approximately normal, distance based method was used to detect the outliers, which is, identifying all values which are more than 3 standard deviations away from the mean. If a variable was not normally ditributed, then Tukey’s method of outlier detection was performed.

# Gathering all numerical variables in one dataframe to plot all variables at once
boxPlotTable <- players[, c(6:8, 11, 13)] %>% gather(key = "variableNames", value = "variableValue")

# Plotting histograms of all variables to get a rough estimate of normality
ggplot(data = boxPlotTable, aes(variableValue)) + geom_histogram(bins = 20, colour = "blue") +
  facet_wrap(~variableNames, scales = "free")

# Plotting QQ plot for a better look at normality of variables
ggplot(data = boxPlotTable, aes(sample = variableValue)) +
  stat_qq(color = "blue", alpha = 0.5) +
  stat_qq_line(color = "red") + facet_wrap(~variableNames, scales = "free")

# Plotting box plots to identify all points above and below the whiskers of the box plot
ggplot(data = boxPlotTable, aes(x = variableNames, y=variableValue)) + geom_boxplot() + 
  facet_wrap( ~ variableNames, scales="free")

# Finally plotting scatter plot of all variables to get a rough estimate of multivariate outliers
pairs(players[, c(6:8, 11, 13)])

After looking at the scatter plot, we get an idea that, we might have some outliers and a careful approach is required, therefore the ‘Mahalanobis’ method of outlier detection is to be done.

# Mahalanobis outlier detection

# results <- mvn(data = players[, c(6:8, 11, 13)], multivariateOutlierMethod = "quan", showOutliers = TRUE)

Mahalanobis test couldn’t be run because of the size of the dataset and it is recommended NOT to delete outliers using univariate outlier detection methods in multivariate data therefore no outliers were deleted due to lack of evidence of outliers.

If one still wishes to delete outliers from multivariate data using single variate outlier detection methods like Tukey’s or distance based methods, you can follow the below code. But it is strongly advisable to not delete data points until sufficient evidence is found.

# For approximately normally distributed variables like age and overall
# count of outliers in columns
sapply(players[, c(6, 11)], function(x){sum(abs(scores(x, type = "z"))>3)})

## overall     age 
##     324     279

# Deleting outliers from copy of players object
player_copy <- players %>% filter(abs(scores(age, type = "z")) < 3)
player_copy <- player_copy %>% filter(abs(scores(overall, type = "z")) < 3)

print(nrow(players))

## [1] 162751

print(nrow(player_copy))

## [1] 162155

# For variables which are not approximately normally distributed
#Function to delete outliers
outlier.qr <- function(data, var_name)
{
  data$var_name <- eval(substitute(var_name), data) 
  iqr <- IQR(data$`var_name`, na.rm = TRUE)
  qr <- quantile(data$var_name, c(0.25,0.75))
  Q1 <- qr[[1]] - 1.5*iqr
  Q3 <- qr[[2]] + 1.5*iqr
  data <- data %>% filter(var_name < Q3 & var_name > Q1)
  return(data)
}

# Numerical columns which are not normally ditributed
variableNames <- colnames(players[, c(7,8,13)])
variableNames <- lapply(variableNames, as.name)

# Deleting outliers
# Deleting NA because quantile can not handle NA values
player_copy <- na.omit(player_copy)
 for (i in seq_along(variableNames)) {
   player_copy <- outlier.qr(player_copy, eval(variableNames[[i]]))
 }
nrow(players)

## [1] 162751

nrow(player_copy)

## [1] 106244

Transformation (Shown on one variable only)

If the variables are normally distributed statistical analysis becomes easier. Due to this reason, sometimes transformations are desirable. As we saw earlier from the QQ plots and histograms that ‘lengthOfStay’ is not normally distributed infact it is right skewed, we will try to transform this variable to approximately normal

# Recap of wage distribution
hist(players$lengthOfStay, main = "Player's length of stay")

# Trying with BoxCox transformaation
BoxCox(players$lengthOfStay, lambda = "auto") %>% hist(main = "Players length of stay after BoxCox")

#Trying with log transformation
log_length <- log(players$lengthOfStay)
hist(log_length, main = "Histogram after log transformation")

As you can see clearly sometime a variable can not be transformed to an approximately normal ditribution.