Introduction

My data comes from the players in the game FIFA 19 downloaded from Kaggle. The analysis I want to explore will give us better insight into how the decisions that the creators of this game make leads to the breakdown of players. I don’t claim that many of my findings tell us anything about the players in real life but rather the opinions and methods of EA Sports and their creation of player attributes in this game. Some of my analysis, however, will hold concrete in the world when using information that is not subjective. These variables include information like wage, club, position, jersey number, nationality, age, etc. Some of the analysis we can do will give us insight into how the subjective attributes are correlated to the objective attributes about the players.

Clustering

How do Goalkeepers cluster by fifa 19 attributes?

To answer this question, we will use GK attributes, Physical and mental attributes, and passing. First we will use pca plots to explore how the variables group, then we will use kmeans clustering and hierarchical clustering to see how the goalkeepers in fifa 19 cluster.

PCA plot

gk.df1 <- as.data.frame(
  gk.df0 %>%
  dplyr::select(-Name) %>%
  scale())
Players <- gk.df0$Name
row.names(gk.df1) <- Players
gk.pca <- prcomp(gk.df1)
fviz_pca_biplot(gk.pca, repel = TRUE)

Here we see the grouping of the variables including largely technical attributes together, athletic attributes, physical attributes, and then age and height/weight seem to be fairly seperate, which makes sense.

K means clustering

Using 5 clusters we can cluster the players based on their attributes and plot them.

gk.km <- kmeans(gk.df1, 5, 
                iter.max = 20, 
                nstart = 30)

gk.df0 <- gk.df0 %>%
  mutate(cluster.km = gk.km$cluster)

datatable(gk.df0 %>%
       dplyr::select(Name, cluster.km))

How do the sizes of the clusters break down?

datatable(
  as.data.frame(
  with(gk.df0, table(cluster.km))) %>%
            rename(Cluster = cluster.km,
                   Number = Freq))

Rotate the matrix and build the dataframe for plotting

gk.pca1<- data.matrix(gk.df1) %*% gk.pca$rotation
pca1 <- gk.pca1[,1]
pca2 <- gk.pca1[,2]
pca3 <- gk.pca1[,3]

gkCluster.df <- data.frame(pca1, pca2, 
                           cluster = factor(gk.df0$cluster.km),
                           name = Players)
Kmeans Cluster Plot

Hierarchical clustering

How do defenders cluster in fifa 19?

def.df <- fifa.df %>%
  filter(Position.Type == "Defender")

def.df0 <- def.df[-23:-48] %>%
  dplyr::select(-Position.Type, -Position,
                -ID, -Nationality,
                -Club, -Preferred.Foot, -Body.Type,
                -Jersey.Number, -Joined, -Skill.Moves,
                -Real.Face, -Contract.Valid.Until, -Off.Work.Rate,
                -Def.Work.Rate, -Special, -GKReflexes,
                -GKDiving, -GKKicking, -GKHandling, 
                -GKPositioning, -Release.Clause) %>%
  filter(Overall > 82)

PCA plot

def.df1 <- as.data.frame(
  def.df0 %>%
  dplyr::select(-Name) %>%
  scale())
Players <- def.df0$Name
row.names(def.df1) <- Players
def.pca <- prcomp(def.df1)
fviz_pca_biplot(def.pca, repel = TRUE)

K means clustering

Using 5 clusters we can cluster the players based on their attributes and plot them.

def.km <- kmeans(def.df1, 5, 
                iter.max = 20, 
                nstart = 30)

def.df0 <- def.df0 %>%
  mutate(cluster.km = def.km$cluster)

datatable(def.df0 %>%
       dplyr::select(Name, cluster.km))

Rotate the matrix and build the dataframe for plotting

def.pca1<- data.matrix(def.df1) %*% def.pca$rotation
pca1 <- def.pca1[,1]
pca2 <- def.pca1[,2]
pca3 <- def.pca1[,3]

defCluster.df <- data.frame(pca1, pca2, 
                           cluster = factor(def.df0$cluster.km),
                           name = Players)
Kmeans Cluster Plot

With this clustering we see many of the wingbacks in the green section, the young up and coming centerbacks in blue, older and solid centerbacks in black and red, and the purple seems to be full of the best centerbacks in the world, that next tier of player.

Hierarchical clustering

Unsurprising to see Sergio Ramos in his own cluster, this certainly reinforces the clear opinion of EA writers and fans from last year that Sergio Ramos was the best defender in the world in 2018 but it would be interesting to do this analysis again next season after Virgil Van Dijk has been regarded as the new best defender in the world after winning the Premier League player of the year and taking Liverpool to back to back Champions League Finals.

Classification

Position Type Classification

How well can you predict what position type (attacker, defender, midfielder, gk) a player is based on physical attributes? (height, weight, bodytype, speed, strength, jumping, etc.)

Error rate with Random Forest Classification

#picking the mtry for the model
p <- ncol(pos.pred.df) - 1

#building a function that caclulates error rate
errRF <- function(data.df, kfolds = 8){
  #8 fold cross validation
  folds <- sample(1:kfolds, nrow(data.df), rep = T)
  #tracking the error through repititions to find average
  err <- rep(0, kfolds)
  #training and testing the data multiple times
  for (k in 1:kfolds){
    train.df <- data.df[folds != k,] %>%
      mutate(Position.Type = as.factor(Position.Type))
    test.df <- data.df[folds == k,] %>%
      mutate(Position.Type = as.factor(Position.Type))
    #building the actual model
    mod <- randomForest(Position.Type ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100)
    #predicting the position type
    preds <- predict(mod, newdata = test.df, type = "response")
    #calculating the error
    err[k] <- with(test.df, mean((Position.Type != preds)))
  }
  min(err)
}

(err.rf.fm <- errRF(pos.pred.df))
## [1] 0.2390244

Interestingly, approximately 3 out of 4 times we can predict what type of position a player is based on Body.Type, Real.Face, Height, Weight, Acceleration, SprintSpeed, Agility, Balance, Jumping, Stamina, Strength, and Aggression. Compared to what though? Let’s look at the Null Rate of position type.

#Frequency breakdown of position types
with(pos.pred.df, table(Position.Type))
## Position.Type
##   Attacker   Defender     Keeper Midfielder 
##        325        453        156        637
#Calculating the null error rate
(null.err <- max(with(pos.pred.df, 
                     table(Position.Type)))/
  sum(with(pos.pred.df,
           table(Position.Type))))
## [1] 0.4054742

We can decrease the the null rate by about 15% by predicting using a random forest model containing physical attributes as predictors.

Variable Importance of this Model

#Building one model to track var imp
mod.rf <- randomForest(Position.Type ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100,
                        importance = T)
grid.arrange(p1, p2, ncol = 2, 
             top = "Variable Importance Plot")

This finding is very interesting. Our data and our model shows us that of these variables, the most important predictor of position type is Stamina. Stamina is a very subjective attribute in FIFA and we cannot conclude that stamina has a strong relationship with position type in real life, only in this game. I wonder if the assignment of Stamina is heavily correlated with position because of coincidence, bias, or actual statistics and observations.

Position type based on wage, value, and jersey.number

(err.rf.1 <- errRF(fifa.df %>%
                    dplyr::select(Position.Type,
                                  Wage, Value,
                                  Jersey.Number)))
## [1] 0.4480874

Interestingly, we don’t beat the null rate trying to predict Position.Type using Wage, Value, and Jersey.Number. I would’ve thought that those variables tell a lot about the players position but here we see that this model requires more variables to see strong prediction capabilities.

International Reputation Classification problem

International Reputation classification with various predictors. We will include only the top 50 nations in terms of world cup appearances, as told Wikipedia. We will use Position.Type, Real.Face, Nationality, Overall, Wage, Jersey.Number, and Age to attempt to classify the International.Reputation.

set.seed(234)
#Same function as before to calculate new models error
p <- ncol(int.rep.df) - 1

errRF <- function(data.df, kfolds = 8){
  folds <- sample(1:kfolds, nrow(data.df), rep = T)
  err <- rep(0, kfolds)
  for (k in 1:kfolds){
    train.df <- data.df[folds != k,] %>%
      mutate(International.Reputation = 
               as.factor(International.Reputation))
    test.df <- data.df[folds == k,] %>%
      mutate(International.Reputation = 
               as.factor(International.Reputation))
    mod <- randomForest(International.Reputation ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100)
    preds <- predict(mod, newdata = test.df, type = "response")
    err[k] <- with(test.df, mean((International.Reputation != preds)))
  }
  min(err)
}

(err.rf.int <- errRF(int.rep.df))
## [1] 0.3505155

Our error rate is about 35%, but what does that compare to? Let’s check the null rate.

with(int.rep.df, table(International.Reputation))
## International.Reputation
##   1   2   3   4   5 
## 500 600 267  50   6
max.freq <- max(with(int.rep.df, table(International.Reputation)))
tot <- sum(with(int.rep.df, table(International.Reputation)))
(null <- max.freq/tot)
## [1] 0.4216444

The null error rate is about 42% which tells us that our model only improves upon this by 7%. This suggests that the relationships between these variables and International.Reputation do not lead to strong predictions.

Variable Importance

It is still interesting to observe the variables at play.

mod.rf <- randomForest(International.Reputation ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100,
                        importance = T)
grid.arrange(p1, p2, ncol = 2, 
             top = "Variable Importance Plot")

From our Variable Importance Plots we see that the top variables for predicting in this model include Overall, Wage, Age, and Nationality.

Visualizing Work Rate-Position relationships

ggplotly(
  theme_plot(fifa.df %>%
       dplyr::select(Name, Position.Type, Off.Work.Rate) %>%
  mutate(Off.Work.Rate = as.factor(Off.Work.Rate))  %>%
  filter(Position.Type != "Keeper") %>%
  group_by(Position.Type, Off.Work.Rate) %>%
   summarise(count = n()) %>%
    mutate(frequency = count/sum(count)) %>%
  ggplot(aes(x = Position.Type, 
             y = frequency, fill = as.factor(Off.Work.Rate))) +
  geom_col(position = "fill") + 
  labs(title = "Offensive Work Rate by Postion Type",
       fill = "Work Rate",
       x = "Position Type",
       y = "Proportion") + 
  scale_fill_manual(values = c("mediumpurple4",
                               "indianred1",
                               "lightpink4"))), 
  tooltip = "frequency")
ggplotly(
  theme_plot(fifa.df %>%
       dplyr::select(Name, Position.Type, Def.Work.Rate) %>%
  mutate(Def.Work.Rate = as.factor(Def.Work.Rate))  %>%
  filter(Position.Type != "Keeper") %>%
  group_by(Position.Type, Def.Work.Rate) %>%
   summarise(count = n()) %>%
    mutate(frequency = count/sum(count)) %>%
  ggplot(aes(x = Position.Type, 
             y = frequency, fill = as.factor(Def.Work.Rate))) +
  geom_col(position = "fill") + 
  labs(title = "Defensive Work Rate by Postion Type",
       fill = "Work Rate",
       x = "Position Type",
       y = "Proportion") + 
  scale_fill_manual(values = c("mediumpurple4",
                               "indianred1",
                               "lightpink4"))),
  tooltip = "frequency")

Regression

How well can we predict Potential using Age, Position.Type, Wage, and Composure?

First we will fit a multiple linear regression model, report the MSE, then compare to the Random Forest regression.

set.seed(234)
errLM <- function(data.df, kfolds = 8){
  folds <- sample(1:kfolds, nrow(data.df), rep = T)
  err <- rep(0, kfolds)
  for (k in 1:kfolds){
    train.df <- data.df[folds != k,]
    test.df <- data.df[folds == k,]
    mod <- lm(Potential ~.,
                        data = train.df)
    preds <- predict(mod, newdata = test.df)
    err[k] <- with(test.df, mean((Potential-preds)^2))
  }
  mean(err)
}

(mse.lm <- errLM(pot.df0))
## [1] 5.798617

Our mse for a linear model is 5.79. How does that compare to Random Forest?

set.seed(234)
#Same function as before to calculate new models error using regression
p <- ncol(pot.df0) - 1

mseRF <- function(data.df, kfolds = 8){
  folds <- sample(1:kfolds, nrow(data.df), rep = T)
  err <- rep(0, kfolds)
  for (k in 1:kfolds){
    train.df <- data.df[folds != k,]
    test.df <- data.df[folds == k,]
    mod <- randomForest(Potential ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100)
    preds <- predict(mod, newdata = test.df)
    err[k] <- with(test.df, mean((Potential-preds)^2))
  }
  min(err)
}

(mse.rf <- mseRF(pot.df0))
## [1] 3.952835

The mse for a random forest model is 3.95, which is much less than the linear model.

In these models, we are aiming to predict the Potential of the players, but as we can see in our visualization using Age, there are some outliers of players that are fairly old. This is probably to do with the fact that they have already reached their potential.
To reframe our question, let’s make a new variable that tracks Players potential growth, and attempt to predict that using these same predictors.

Perhaps this will lower our mse.

set.seed(234)
#Same function as before to calculate new models error using regression
p <- ncol(pot.df1) - 1

mseRF <- function(data.df, kfolds = 8){
  folds <- sample(1:kfolds, nrow(data.df), rep = T)
  err <- rep(0, kfolds)
  for (k in 1:kfolds){
    train.df <- data.df[folds != k,]
    test.df <- data.df[folds == k,]
    mod <- randomForest(Pot.Growth ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100)
    preds <- predict(mod, newdata = test.df)
    err[k] <- with(test.df, mean((Pot.Growth-preds)^2))
  }
  min(err)
}

(mse.rf <- mseRF(pot.df1))
## [1] 0.8746883

Our mse dropped drastically to .87. Using the results from this model we could easily add the predicted growth to the players overall and thus, this model has proven to be a much more effective way of predicting Potential.

Are there other variables we could include that could improve this model?

pot.df1 <- fifa.df %>%
  mutate(Pot.Growth = Potential - Overall) %>%
  mutate(Nationality = as.character(Nationality)) %>%
  filter(Nationality %in% World.Cup.countries) %>%
  mutate(Nationality = as.factor(Nationality)) %>%
  dplyr::select(Pot.Growth, Position.Type,
                Age, Wage, Composure, Nationality, Value, 
                Skill.Moves, ShortPassing, Real.Face)

p <- ncol(pot.df1) - 1

(mse.rf <- mseRF(pot.df1))
## [1] 0.6351726

By adding Nationality, Value, Skill.Moves, ShortPassing, and Real.Face we can decrease the MSE to about .808.

Predicting Overall using International.Reputation, Real.Face, Wage,

Composure, Jersey.Number, Age, Skill.Moves, Weak.Foot, and Value.

Random Forest Model to find error rate

set.seed(234)
p <- ncol(over.df) - 1

mseRF <- function(data.df, kfolds = 8){
  folds <- sample(1:kfolds, nrow(data.df), rep = T)
  err <- rep(0, kfolds)
  for (k in 1:kfolds){
    train.df <- data.df[folds != k,]
    test.df <- data.df[folds == k,]
    mod <- randomForest(Overall ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100)
    preds <- predict(mod, newdata = test.df)
    err[k] <- with(test.df, mean((Overall-preds)^2))
  }
  min(err)
}

(mse.rf <- mseRF(over.df))
## [1] 1.143589

Our MSE is about 1.14. More interestingly, let’s observe the variable importance plots.

Variable Importance of the model

mod.rf <- randomForest(Overall ~.,
                        data = train.df,
                        mtry = p,
                        ntree = 100,
                        importance = T)
##     %IncMSE IncNodePurity                 variable
## 1 13.302644     319.92971 International.Reputation
## 2  1.331913      19.72421                Real.Face
## 3  9.296683     444.62052                     Wage
## 4 12.821787     180.05895                Composure
## 5  5.383163     620.18453            Jersey.Number
## 6 56.157590    6410.39069                    Value
grid.arrange(p1, p2, ncol = 2, 
             top = "Variable Importance Plot")

Not surprislingly we see that Value is by far the strongest variable in our model predicting Overall rating. Surprisingly, we see Jersey.Number make an appearance as the second strongest variable in our Node Purity plot.

Conclusion

In this exploration we see some powerful clustering of different types of players using fifa 19 attributes. We also see some interesting trends relating various predictors with various response variables including the strong relationship between Value and Overall rating. From here there are near unlimited questions that could be asked and explored using this data and these functions could be used as a template for more questions to explore.