Introduction
My data comes from the players in the game FIFA 19 downloaded from Kaggle. The analysis I want to explore will give us better insight into how the decisions that the creators of this game make leads to the breakdown of players. I don’t claim that many of my findings tell us anything about the players in real life but rather the opinions and methods of EA Sports and their creation of player attributes in this game. Some of my analysis, however, will hold concrete in the world when using information that is not subjective. These variables include information like wage, club, position, jersey number, nationality, age, etc. Some of the analysis we can do will give us insight into how the subjective attributes are correlated to the objective attributes about the players.
Clustering
How do Goalkeepers cluster by fifa 19 attributes?
To answer this question, we will use GK attributes, Physical and mental attributes, and passing. First we will use pca plots to explore how the variables group, then we will use kmeans clustering and hierarchical clustering to see how the goalkeepers in fifa 19 cluster.
PCA plot
gk.df1 <- as.data.frame(
gk.df0 %>%
dplyr::select(-Name) %>%
scale())
Players <- gk.df0$Name
row.names(gk.df1) <- Players
gk.pca <- prcomp(gk.df1)
fviz_pca_biplot(gk.pca, repel = TRUE)Here we see the grouping of the variables including largely technical attributes together, athletic attributes, physical attributes, and then age and height/weight seem to be fairly seperate, which makes sense.
K means clustering
Using 5 clusters we can cluster the players based on their attributes and plot them.
gk.km <- kmeans(gk.df1, 5,
iter.max = 20,
nstart = 30)
gk.df0 <- gk.df0 %>%
mutate(cluster.km = gk.km$cluster)
datatable(gk.df0 %>%
dplyr::select(Name, cluster.km))How do the sizes of the clusters break down?
datatable(
as.data.frame(
with(gk.df0, table(cluster.km))) %>%
rename(Cluster = cluster.km,
Number = Freq))Rotate the matrix and build the dataframe for plotting
gk.pca1<- data.matrix(gk.df1) %*% gk.pca$rotation
pca1 <- gk.pca1[,1]
pca2 <- gk.pca1[,2]
pca3 <- gk.pca1[,3]
gkCluster.df <- data.frame(pca1, pca2,
cluster = factor(gk.df0$cluster.km),
name = Players)Kmeans Cluster Plot
Hierarchical clustering
How do defenders cluster in fifa 19?
def.df <- fifa.df %>%
filter(Position.Type == "Defender")
def.df0 <- def.df[-23:-48] %>%
dplyr::select(-Position.Type, -Position,
-ID, -Nationality,
-Club, -Preferred.Foot, -Body.Type,
-Jersey.Number, -Joined, -Skill.Moves,
-Real.Face, -Contract.Valid.Until, -Off.Work.Rate,
-Def.Work.Rate, -Special, -GKReflexes,
-GKDiving, -GKKicking, -GKHandling,
-GKPositioning, -Release.Clause) %>%
filter(Overall > 82)PCA plot
def.df1 <- as.data.frame(
def.df0 %>%
dplyr::select(-Name) %>%
scale())
Players <- def.df0$Name
row.names(def.df1) <- Players
def.pca <- prcomp(def.df1)
fviz_pca_biplot(def.pca, repel = TRUE)K means clustering
Using 5 clusters we can cluster the players based on their attributes and plot them.
def.km <- kmeans(def.df1, 5,
iter.max = 20,
nstart = 30)
def.df0 <- def.df0 %>%
mutate(cluster.km = def.km$cluster)
datatable(def.df0 %>%
dplyr::select(Name, cluster.km))Rotate the matrix and build the dataframe for plotting
def.pca1<- data.matrix(def.df1) %*% def.pca$rotation
pca1 <- def.pca1[,1]
pca2 <- def.pca1[,2]
pca3 <- def.pca1[,3]
defCluster.df <- data.frame(pca1, pca2,
cluster = factor(def.df0$cluster.km),
name = Players)Kmeans Cluster Plot
With this clustering we see many of the wingbacks in the green section, the young up and coming centerbacks in blue, older and solid centerbacks in black and red, and the purple seems to be full of the best centerbacks in the world, that next tier of player.
Hierarchical clustering
Unsurprising to see Sergio Ramos in his own cluster, this certainly reinforces the clear opinion of EA writers and fans from last year that Sergio Ramos was the best defender in the world in 2018 but it would be interesting to do this analysis again next season after Virgil Van Dijk has been regarded as the new best defender in the world after winning the Premier League player of the year and taking Liverpool to back to back Champions League Finals.
Classification
Position Type Classification
How well can you predict what position type (attacker, defender, midfielder, gk) a player is based on physical attributes? (height, weight, bodytype, speed, strength, jumping, etc.)
Error rate with Random Forest Classification
#picking the mtry for the model
p <- ncol(pos.pred.df) - 1
#building a function that caclulates error rate
errRF <- function(data.df, kfolds = 8){
#8 fold cross validation
folds <- sample(1:kfolds, nrow(data.df), rep = T)
#tracking the error through repititions to find average
err <- rep(0, kfolds)
#training and testing the data multiple times
for (k in 1:kfolds){
train.df <- data.df[folds != k,] %>%
mutate(Position.Type = as.factor(Position.Type))
test.df <- data.df[folds == k,] %>%
mutate(Position.Type = as.factor(Position.Type))
#building the actual model
mod <- randomForest(Position.Type ~.,
data = train.df,
mtry = p,
ntree = 100)
#predicting the position type
preds <- predict(mod, newdata = test.df, type = "response")
#calculating the error
err[k] <- with(test.df, mean((Position.Type != preds)))
}
min(err)
}
(err.rf.fm <- errRF(pos.pred.df))## [1] 0.2390244
Interestingly, approximately 3 out of 4 times we can predict what type of position a player is based on Body.Type, Real.Face, Height, Weight, Acceleration, SprintSpeed, Agility, Balance, Jumping, Stamina, Strength, and Aggression. Compared to what though? Let’s look at the Null Rate of position type.
#Frequency breakdown of position types
with(pos.pred.df, table(Position.Type))## Position.Type
## Attacker Defender Keeper Midfielder
## 325 453 156 637
#Calculating the null error rate
(null.err <- max(with(pos.pred.df,
table(Position.Type)))/
sum(with(pos.pred.df,
table(Position.Type))))## [1] 0.4054742
We can decrease the the null rate by about 15% by predicting using a random forest model containing physical attributes as predictors.
Variable Importance of this Model
#Building one model to track var imp
mod.rf <- randomForest(Position.Type ~.,
data = train.df,
mtry = p,
ntree = 100,
importance = T)grid.arrange(p1, p2, ncol = 2,
top = "Variable Importance Plot")This finding is very interesting. Our data and our model shows us that of these variables, the most important predictor of position type is Stamina. Stamina is a very subjective attribute in FIFA and we cannot conclude that stamina has a strong relationship with position type in real life, only in this game. I wonder if the assignment of Stamina is heavily correlated with position because of coincidence, bias, or actual statistics and observations.
Position type based on wage, value, and jersey.number
(err.rf.1 <- errRF(fifa.df %>%
dplyr::select(Position.Type,
Wage, Value,
Jersey.Number)))## [1] 0.4480874
Interestingly, we don’t beat the null rate trying to predict Position.Type using Wage, Value, and Jersey.Number. I would’ve thought that those variables tell a lot about the players position but here we see that this model requires more variables to see strong prediction capabilities.
International Reputation Classification problem
International Reputation classification with various predictors. We will include only the top 50 nations in terms of world cup appearances, as told Wikipedia. We will use Position.Type, Real.Face, Nationality, Overall, Wage, Jersey.Number, and Age to attempt to classify the International.Reputation.
set.seed(234)
#Same function as before to calculate new models error
p <- ncol(int.rep.df) - 1
errRF <- function(data.df, kfolds = 8){
folds <- sample(1:kfolds, nrow(data.df), rep = T)
err <- rep(0, kfolds)
for (k in 1:kfolds){
train.df <- data.df[folds != k,] %>%
mutate(International.Reputation =
as.factor(International.Reputation))
test.df <- data.df[folds == k,] %>%
mutate(International.Reputation =
as.factor(International.Reputation))
mod <- randomForest(International.Reputation ~.,
data = train.df,
mtry = p,
ntree = 100)
preds <- predict(mod, newdata = test.df, type = "response")
err[k] <- with(test.df, mean((International.Reputation != preds)))
}
min(err)
}
(err.rf.int <- errRF(int.rep.df))## [1] 0.3505155
Our error rate is about 35%, but what does that compare to? Let’s check the null rate.
with(int.rep.df, table(International.Reputation))## International.Reputation
## 1 2 3 4 5
## 500 600 267 50 6
max.freq <- max(with(int.rep.df, table(International.Reputation)))
tot <- sum(with(int.rep.df, table(International.Reputation)))
(null <- max.freq/tot)## [1] 0.4216444
The null error rate is about 42% which tells us that our model only improves upon this by 7%. This suggests that the relationships between these variables and International.Reputation do not lead to strong predictions.
Variable Importance
It is still interesting to observe the variables at play.
mod.rf <- randomForest(International.Reputation ~.,
data = train.df,
mtry = p,
ntree = 100,
importance = T)grid.arrange(p1, p2, ncol = 2,
top = "Variable Importance Plot")From our Variable Importance Plots we see that the top variables for predicting in this model include Overall, Wage, Age, and Nationality.
Visualizing Work Rate-Position relationships
ggplotly(
theme_plot(fifa.df %>%
dplyr::select(Name, Position.Type, Off.Work.Rate) %>%
mutate(Off.Work.Rate = as.factor(Off.Work.Rate)) %>%
filter(Position.Type != "Keeper") %>%
group_by(Position.Type, Off.Work.Rate) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count)) %>%
ggplot(aes(x = Position.Type,
y = frequency, fill = as.factor(Off.Work.Rate))) +
geom_col(position = "fill") +
labs(title = "Offensive Work Rate by Postion Type",
fill = "Work Rate",
x = "Position Type",
y = "Proportion") +
scale_fill_manual(values = c("mediumpurple4",
"indianred1",
"lightpink4"))),
tooltip = "frequency")ggplotly(
theme_plot(fifa.df %>%
dplyr::select(Name, Position.Type, Def.Work.Rate) %>%
mutate(Def.Work.Rate = as.factor(Def.Work.Rate)) %>%
filter(Position.Type != "Keeper") %>%
group_by(Position.Type, Def.Work.Rate) %>%
summarise(count = n()) %>%
mutate(frequency = count/sum(count)) %>%
ggplot(aes(x = Position.Type,
y = frequency, fill = as.factor(Def.Work.Rate))) +
geom_col(position = "fill") +
labs(title = "Defensive Work Rate by Postion Type",
fill = "Work Rate",
x = "Position Type",
y = "Proportion") +
scale_fill_manual(values = c("mediumpurple4",
"indianred1",
"lightpink4"))),
tooltip = "frequency")Regression
How well can we predict Potential using Age, Position.Type, Wage, and Composure?
First we will fit a multiple linear regression model, report the MSE, then compare to the Random Forest regression.
set.seed(234)
errLM <- function(data.df, kfolds = 8){
folds <- sample(1:kfolds, nrow(data.df), rep = T)
err <- rep(0, kfolds)
for (k in 1:kfolds){
train.df <- data.df[folds != k,]
test.df <- data.df[folds == k,]
mod <- lm(Potential ~.,
data = train.df)
preds <- predict(mod, newdata = test.df)
err[k] <- with(test.df, mean((Potential-preds)^2))
}
mean(err)
}
(mse.lm <- errLM(pot.df0))## [1] 5.798617
Our mse for a linear model is 5.79. How does that compare to Random Forest?
set.seed(234)
#Same function as before to calculate new models error using regression
p <- ncol(pot.df0) - 1
mseRF <- function(data.df, kfolds = 8){
folds <- sample(1:kfolds, nrow(data.df), rep = T)
err <- rep(0, kfolds)
for (k in 1:kfolds){
train.df <- data.df[folds != k,]
test.df <- data.df[folds == k,]
mod <- randomForest(Potential ~.,
data = train.df,
mtry = p,
ntree = 100)
preds <- predict(mod, newdata = test.df)
err[k] <- with(test.df, mean((Potential-preds)^2))
}
min(err)
}
(mse.rf <- mseRF(pot.df0))## [1] 3.952835
The mse for a random forest model is 3.95, which is much less than the linear model.
In these models, we are aiming to predict the Potential of the players, but as we can see in our visualization using Age, there are some outliers of players that are fairly old. This is probably to do with the fact that they have already reached their potential.
To reframe our question, let’s make a new variable that tracks Players potential growth, and attempt to predict that using these same predictors.
Perhaps this will lower our mse.
set.seed(234)
#Same function as before to calculate new models error using regression
p <- ncol(pot.df1) - 1
mseRF <- function(data.df, kfolds = 8){
folds <- sample(1:kfolds, nrow(data.df), rep = T)
err <- rep(0, kfolds)
for (k in 1:kfolds){
train.df <- data.df[folds != k,]
test.df <- data.df[folds == k,]
mod <- randomForest(Pot.Growth ~.,
data = train.df,
mtry = p,
ntree = 100)
preds <- predict(mod, newdata = test.df)
err[k] <- with(test.df, mean((Pot.Growth-preds)^2))
}
min(err)
}
(mse.rf <- mseRF(pot.df1))## [1] 0.8746883
Our mse dropped drastically to .87. Using the results from this model we could easily add the predicted growth to the players overall and thus, this model has proven to be a much more effective way of predicting Potential.
Are there other variables we could include that could improve this model?
pot.df1 <- fifa.df %>%
mutate(Pot.Growth = Potential - Overall) %>%
mutate(Nationality = as.character(Nationality)) %>%
filter(Nationality %in% World.Cup.countries) %>%
mutate(Nationality = as.factor(Nationality)) %>%
dplyr::select(Pot.Growth, Position.Type,
Age, Wage, Composure, Nationality, Value,
Skill.Moves, ShortPassing, Real.Face)
p <- ncol(pot.df1) - 1
(mse.rf <- mseRF(pot.df1))## [1] 0.6351726
By adding Nationality, Value, Skill.Moves, ShortPassing, and Real.Face we can decrease the MSE to about .808.
Predicting Overall using International.Reputation, Real.Face, Wage,
Composure, Jersey.Number, Age, Skill.Moves, Weak.Foot, and Value.
Random Forest Model to find error rate
set.seed(234)
p <- ncol(over.df) - 1
mseRF <- function(data.df, kfolds = 8){
folds <- sample(1:kfolds, nrow(data.df), rep = T)
err <- rep(0, kfolds)
for (k in 1:kfolds){
train.df <- data.df[folds != k,]
test.df <- data.df[folds == k,]
mod <- randomForest(Overall ~.,
data = train.df,
mtry = p,
ntree = 100)
preds <- predict(mod, newdata = test.df)
err[k] <- with(test.df, mean((Overall-preds)^2))
}
min(err)
}
(mse.rf <- mseRF(over.df))## [1] 1.143589
Our MSE is about 1.14. More interestingly, let’s observe the variable importance plots.
Variable Importance of the model
mod.rf <- randomForest(Overall ~.,
data = train.df,
mtry = p,
ntree = 100,
importance = T)## %IncMSE IncNodePurity variable
## 1 13.302644 319.92971 International.Reputation
## 2 1.331913 19.72421 Real.Face
## 3 9.296683 444.62052 Wage
## 4 12.821787 180.05895 Composure
## 5 5.383163 620.18453 Jersey.Number
## 6 56.157590 6410.39069 Value
grid.arrange(p1, p2, ncol = 2,
top = "Variable Importance Plot")Not surprislingly we see that Value is by far the strongest variable in our model predicting Overall rating. Surprisingly, we see Jersey.Number make an appearance as the second strongest variable in our Node Purity plot.
Conclusion
In this exploration we see some powerful clustering of different types of players using fifa 19 attributes. We also see some interesting trends relating various predictors with various response variables including the strong relationship between Value and Overall rating. From here there are near unlimited questions that could be asked and explored using this data and these functions could be used as a template for more questions to explore.