Required packages

library(readr)
library(dplyr)
library(tidyr)
library(outliers)
library(forecast)

Executive Summary

This report aims to preprocess the players’ data in the videogame FIFA19. The datasets are download from Kaggle[https://www.kaggle.com/karangadiya/fifa19]. The first dataset is about the players’ information and the second one is about the players’ abilities in details. Each dataset has a variable of unique identify numbers.

In this report, the data of overall rating above 80 is filtered out and used. The data conversion of incorrect data types is executed after the two datasets merged together. Then, the dataset has been checked if it follows the principles of tidy data. A variable named “avg_ab” is created from the players’ detailed abilities to show the average abilities of them. Besides, four subsets of the data has been created according to the position of players, and a variable of average rating of each players’ required abilities is created as well in each subset.

Missing values and outliers are detected and handled by appropriate methods. Finally, different types of data transformation to the selected variable “avg_ab” are applied to compare, and conclusion is made that the Box-Cox transformation is the most useful method.

Data

These two datasets are collected from https://www.kaggle.com/karangadiya/fifa19.

fifa_a <- read_csv("fifa19a.csv")
Parsed with column specification:
cols(
  ID = col_double(),
  Name = col_character(),
  Age = col_double(),
  Nationality = col_character(),
  Overall = col_double(),
  Potential = col_double(),
  Club = col_character(),
  Value = col_double(),
  Wage = col_double(),
  Special = col_double(),
  `Preferred Foot` = col_character(),
  `International Reputation` = col_double(),
  `Weak Foot` = col_double(),
  `Skill Moves` = col_double(),
  Position = col_character(),
  `Jersey Number` = col_double(),
  Weight = col_character()
)
fifa_b <- read_csv("fifa19b.csv")
Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.
fifa <- fifa_a %>% left_join(fifa_b, by = "ID")
fifa_fil <- fifa %>% filter(fifa$Overall >= 80)
head(fifa_fil)

Understand

str(fifa_fil)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    555 obs. of  51 variables:
 $ ID                      : num  158023 20801 190871 193080 192985 ...
 $ Name                    : chr  "L. Messi" "Cristiano Ronaldo" "Neymar Jr" "De Gea" ...
 $ Age                     : num  31 33 26 27 27 27 32 31 32 25 ...
 $ Nationality             : chr  "Argentina" "Portugal" "Brazil" "Spain" ...
 $ Overall                 : num  94 94 92 91 91 91 91 91 91 90 ...
 $ Potential               : num  94 94 93 93 92 91 91 91 91 93 ...
 $ Club                    : chr  "FC Barcelona" "Juventus" "Paris Saint-Germain" "Manchester United" ...
 $ Value                   : num  1.10e+08 7.70e+07 1.18e+08 7.20e+07 1.02e+08 ...
 $ Wage                    : num  565000 405000 290000 260000 355000 340000 420000 455000 380000 94000 ...
 $ Special                 : num  2202 2228 2143 1471 2281 ...
 $ Preferred Foot          : chr  "Left" "Right" "Right" "Right" ...
 $ International Reputation: num  5 5 5 4 4 4 4 5 4 3 ...
 $ Weak Foot               : num  4 4 5 3 5 4 4 4 3 3 ...
 $ Skill Moves             : num  4 5 5 1 4 4 4 3 3 1 ...
 $ Position                : chr  "RF" "ST" "LW" "GK" ...
 $ Jersey Number           : num  10 7 10 1 7 10 10 9 15 1 ...
 $ Weight                  : chr  "159lbs" "183lbs" "150lbs" "168lbs" ...
 $ Crossing                : num  84 84 79 17 93 81 86 77 66 13 ...
 $ Finishing               : num  95 94 87 13 82 84 72 93 60 11 ...
 $ HeadingAccuracy         : num  70 89 62 21 55 61 55 77 91 15 ...
 $ ShortPassing            : num  90 81 84 50 92 89 93 82 78 29 ...
 $ Volleys                 : num  86 87 84 13 82 80 76 88 66 13 ...
 $ Dribbling               : num  97 88 96 18 86 95 90 87 63 12 ...
 $ Curve                   : num  93 81 88 21 85 83 85 86 74 13 ...
 $ FKAccuracy              : num  94 76 87 19 83 79 78 84 72 14 ...
 $ LongPassing             : num  87 77 78 51 91 83 88 64 77 26 ...
 $ BallControl             : num  96 94 95 42 91 94 93 90 84 16 ...
 $ Acceleration            : num  91 89 94 57 78 94 80 86 76 43 ...
 $ SprintSpeed             : num  86 91 90 58 76 88 72 75 75 60 ...
 $ Agility                 : num  91 87 96 60 79 95 93 82 78 67 ...
 $ Reactions               : num  95 96 94 90 91 90 90 92 85 86 ...
 $ Balance                 : num  95 70 84 43 77 94 94 83 66 49 ...
 $ ShotPower               : num  85 95 80 31 91 82 79 86 79 22 ...
 $ Jumping                 : num  68 95 61 67 63 56 68 69 93 76 ...
 $ Stamina                 : num  72 88 81 43 90 83 89 90 84 41 ...
 $ Strength                : num  59 79 49 64 75 66 58 83 83 78 ...
 $ LongShots               : num  94 93 82 12 91 80 82 85 59 12 ...
 $ Aggression              : num  48 63 56 38 76 54 62 87 88 34 ...
 $ Interceptions           : num  22 29 36 30 61 41 83 41 90 19 ...
 $ Positioning             : num  94 95 89 12 87 87 79 92 60 11 ...
 $ Vision                  : num  94 82 87 68 94 89 92 84 63 70 ...
 $ Penalties               : num  75 85 81 40 79 86 82 85 75 11 ...
 $ Composure               : num  96 95 94 68 88 91 84 85 82 70 ...
 $ Marking                 : num  33 28 27 15 68 34 60 62 87 27 ...
 $ StandingTackle          : num  28 31 24 21 58 27 76 45 92 12 ...
 $ SlidingTackle           : num  26 23 33 13 51 22 73 38 91 18 ...
 $ GKDiving                : num  6 7 9 90 15 11 13 27 11 86 ...
 $ GKHandling              : num  11 11 9 85 13 12 9 25 8 92 ...
 $ GKKicking               : num  15 15 15 87 5 6 7 31 9 78 ...
 $ GKPositioning           : num  14 14 15 88 10 8 14 33 7 88 ...
 $ GKReflexes              : num  8 11 11 94 13 8 9 37 11 89 ...
fifa_fil$ID <- fifa_fil$ID %>% as.character()
class(fifa_fil$ID)
[1] "character"
fifa_fil$`Preferred Foot` <- fifa_fil$`Preferred Foot` %>% as.factor()
fifa_fil$`Preferred Foot` <- fifa_fil$`Preferred Foot` %>% factor(levels = c("Left", "Right"))
levels(fifa_fil$`Preferred Foot`)
[1] "Left"  "Right"
fifa_fil <- fifa_fil %>% separate(Weight, into = c("weight", "lbs"), sep = 3)
fifa_fil$weight <- fifa_fil$weight %>% as.numeric()
class(fifa_fil$weight)
[1] "numeric"
fifa_fil <- fifa_fil[, -18]
head(fifa_fil)
fifa_fil$Position <- fifa_fil$Position %>% as.factor()
fifa_fil$Position <- fifa_fil$Position %>% factor(levels = c("ST","LS","RS","CF","LF","RF","LW","RW","CAM","LAM","RAM","CM","LCM","RCM","LM","RM","CDM","LDM","RDM","CB","LCB","RCB","LWB","RWB","LB","RB","GK"))
levels(fifa_fil$Position)
 [1] "ST"  "LS"  "RS"  "CF"  "LF"  "RF"  "LW"  "RW"  "CAM" "LAM" "RAM" "CM"  "LCM" "RCM" "LM" 
[16] "RM"  "CDM" "LDM" "RDM" "CB"  "LCB" "RCB" "LWB" "RWB" "LB"  "RB"  "GK" 

Tidy & Manipulate Data I

This data is a tidy data as it matches the principles of tidy data:

Tidy & Manipulate Data II

# Use mutate function
fifa_fil <-  mutate(fifa_fil, avg_ab = (Crossing + Finishing + HeadingAccuracy + ShortPassing + Volleys + Dribbling + Curve + FKAccuracy + LongPassing + BallControl + Acceleration + SprintSpeed + Agility + Reactions + Balance + ShotPower + Jumping + Stamina + Strength + LongShots + Aggression + Interceptions + Positioning + Vision + Penalties + Composure + Marking + StandingTackle + SlidingTackle + GKDiving + GKHandling + GKKicking + GKPositioning + GKReflexes) / 34)
head(fifa_fil$avg_ab)
[1] 67.58824 68.32353 65.79412 45.26471 69.67647 65.67647
# For attackers
fifa_attack <- fifa_fil %>% filter(Position == "ST" | Position == "LS" | Position == "RS" | Position == "CF" | Position == "LF" | Position == "RF" | Position == "LW" | Position == "RW")
fifa_attack <- mutate(fifa_attack, avg_attab = (Crossing + Finishing + HeadingAccuracy + ShortPassing + Volleys + Dribbling + Curve + FKAccuracy+ BallControl + Acceleration + SprintSpeed + Agility + Reactions + Balance + ShotPower + Jumping + Stamina + Strength + LongShots + Positioning + Vision + Penalties + Composure) / 23)
head(fifa_attack)
# For Midfielders
fifa_midf <- fifa_fil %>%  filter(Position == "CAM" | Position == "LAM" | Position == "RAM" | Position == "CM"  | Position == "LCM" | Position == "RCM" | Position == "LM"  | Position == "RM"  | Position == "CDM" | Position == "LDM" | Position =="RDM")
fifa_midf <- mutate(fifa_midf, avg_midab = (Crossing + HeadingAccuracy + ShortPassing + Volleys + Dribbling + Curve + LongPassing + BallControl + Acceleration + SprintSpeed + Agility + Reactions + Balance + ShotPower + Jumping + Stamina + Strength + LongShots + Aggression + Interceptions + Positioning + Vision + StandingTackle) / 23)
head(fifa_midf)
# For Defenders
fifa_def <- fifa_fil %>% filter(Position == "CB" | Position == "LCB" | Position == "RCB" | Position == "LWB" | Position == "RWB" | Position == "LB" | Position == "RB")
fifa_def <- mutate(fifa_def, avg_defab = (HeadingAccuracy + ShortPassing + LongPassing + Acceleration + SprintSpeed + Reactions + Balance + Jumping + Stamina + Strength + Aggression + Interceptions + Positioning + Composure + Marking + StandingTackle + SlidingTackle) / 17)
head(fifa_def)
# For GoalKeepers
fifa_gk <- fifa_fil %>% filter(Position == "GK")
fifa_gk <- mutate(fifa_gk, avg_gk = (Acceleration + Reactions + Jumping + Vision + Composure + GKDiving + GKHandling + GKKicking + GKPositioning + GKReflexes) / 10)
head(fifa_gk)

Scan I

colSums(is.na(fifa_fil))
                      ID                     Name                      Age 
                       0                        0                        0 
             Nationality                  Overall                Potential 
                       0                        0                        0 
                    Club                    Value                     Wage 
                       2                        2                        2 
                 Special           Preferred Foot International Reputation 
                       0                        0                        0 
               Weak Foot              Skill Moves                 Position 
                       0                        0                        0 
           Jersey Number                   weight                 Crossing 
                       0                        0                        0 
               Finishing          HeadingAccuracy             ShortPassing 
                       0                        0                        0 
                 Volleys                Dribbling                    Curve 
                       0                        0                        0 
              FKAccuracy              LongPassing              BallControl 
                       0                        0                        0 
            Acceleration              SprintSpeed                  Agility 
                       0                        0                        0 
               Reactions                  Balance                ShotPower 
                       0                        0                        0 
                 Jumping                  Stamina                 Strength 
                       0                        0                        0 
               LongShots               Aggression            Interceptions 
                       0                        0                        0 
             Positioning                   Vision                Penalties 
                       0                        0                        0 
               Composure                  Marking           StandingTackle 
                       0                        0                        0 
           SlidingTackle                 GKDiving               GKHandling 
                       0                        0                        0 
               GKKicking            GKPositioning               GKReflexes 
                       0                        0                        0 
                  avg_ab 
                       0 
fifa_fil <- fifa_fil %>% na.omit(fifa_fil$Club)
# Check the missing value again
colSums(is.na(fifa_fil))
                      ID                     Name                      Age 
                       0                        0                        0 
             Nationality                  Overall                Potential 
                       0                        0                        0 
                    Club                    Value                     Wage 
                       0                        0                        0 
                 Special           Preferred Foot International Reputation 
                       0                        0                        0 
               Weak Foot              Skill Moves                 Position 
                       0                        0                        0 
           Jersey Number                   weight                 Crossing 
                       0                        0                        0 
               Finishing          HeadingAccuracy             ShortPassing 
                       0                        0                        0 
                 Volleys                Dribbling                    Curve 
                       0                        0                        0 
              FKAccuracy              LongPassing              BallControl 
                       0                        0                        0 
            Acceleration              SprintSpeed                  Agility 
                       0                        0                        0 
               Reactions                  Balance                ShotPower 
                       0                        0                        0 
                 Jumping                  Stamina                 Strength 
                       0                        0                        0 
               LongShots               Aggression            Interceptions 
                       0                        0                        0 
             Positioning                   Vision                Penalties 
                       0                        0                        0 
               Composure                  Marking           StandingTackle 
                       0                        0                        0 
           SlidingTackle                 GKDiving               GKHandling 
                       0                        0                        0 
               GKKicking            GKPositioning               GKReflexes 
                       0                        0                        0 
                  avg_ab 
                       0 

From the result, we can see that all the missing values are located in same rows, thus, all the missing values are deleted. However, if the missing values are not located in same rows, the mean imputation to the numerical missing values will be used.

Scan II

# For "Overall"
z.score_all <- fifa_fil$Overall %>% scores(type = "z")
z.score_all %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.9726 -0.9726 -0.2307  0.0000  0.5111  4.2204 
# Visualization
fifa_fil$Overall %>% boxplot(main = "Overall", ylab = "Rating")

# Find total number of outliers
length(which( abs(z.score_all) > 3))
[1] 9
# Delete the outliers
clean_overall <- fifa_fil$Overall[ - which( abs(z.score_all) > 3)]
head(clean_overall)
[1] 90 90 90 90 90 89
# For "Poetential"
z.score_pot <- fifa_fil$Potential %>% scores(type = "z")
z.score_pot %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.2943 -0.7011 -0.1078  0.0000  0.7820  3.1550 
# Visualization
fifa_fil$Potential %>% boxplot(main = "Potential", ylab = "Rating")

# Find the total numbers of outliers
length(which( abs(z.score_pot)> 3))
[1] 1
# Delete the outlier
clean_pot <- fifa_fil$Potential[ - which( abs(z.score_pot) > 3)]
head(clean_pot)
[1] 94 94 93 93 92 91
# For "weight"
z.score_wei <- fifa_fil$weight %>% scores(type = "z")
z.score_wei %>% summary()
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.36436 -0.66693 -0.02307  0.00000  0.73784  2.90354 
# Find the totoal number of outliers
length(which( abs(z.score_wei)> 3))
[1] 0
# For avg_ab
z.score_avgab <- fifa_fil$avg_ab %>% scores(type = "z")
z.score_avgab %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.2000 -0.1669  0.3157  0.0000  0.6116  1.4923 
# Find the totoal number of outliers
length(which( abs(z.score_avgab) > 3))
[1] 1

Capping, one of the approaches to handling outliers will be used

# Define a function of capping
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95))
  x[ x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
  x[x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
  x
}
avg_ab_capped <- fifa_fil$avg_ab %>% cap()
head(avg_ab_capped)
[1] 67.58824 68.32353 65.79412 38.08235 69.67647 65.67647

Transform

In this section, the variable will be used to transformed is “Overall”

# Gain the histogram of the variable to determine the method of transformation
hist(fifa_fil$Overall)

This is a right-skewed distribution, many transformation will be used to compare.

boxcox_overall <- BoxCox(fifa_fil$Overall, lambda = "auto")
# Check the result
hist(boxcox_overall)

log_oveall <- log10(fifa_fil$Overall)
# Check the result
hist(log_oveall)

ln_overall <- log(fifa_fil$Overall)
# Check the result
hist(ln_overall)

recip_overall <- 1/fifa_fil$Overall
# Check the result
hist(recip_overall)

From the results, it can be seen that the box-cox transformation works best on this data.

