FIFA is a popular video game series by Electronic Arts, in which you can play as many club and national soccer teams. It also features various story and career modes that allow you to form custom teams with your favorite players or recruit the strongest players to create the best team. Each player has many statistics that measure things such as preferred foot, body type, and skills in passing, shooting, and goalkeeping, that all go into a player’s overall rating.
This analysis focuses on player data from FIFA 19 to determine which player attributes are the most significant predictors for a player’s overall ranking (Overall
). The dataset used is available on Kaggle here. Multiple linear regression (MLR) and a random forest are compared for their predictive power, though MLR has the distinct advantage of providing estimates of each attribute’s relative contribution to a player’s ranking.
As is often the case with Kaggle datasets, this data was relatively clean but certain variables (columns) are removed before analysis, some of which were statistics explained by Overall
rather than the other way around, such as Wage
and Release.Clause
. In other words a player’s Overall
rank determines their Wage
and Release.Clause
, not the other way around, so they cannot be predictors of Overall
. The variable Potential
serves more as a proxy of Overall
(as described here), so should also be removed from analysis. Some columns have no bearing on the analysis, such as a player’s name, photo, club, nationality, and jersey number. Other columns were positional ratings reported as two numbers with a plus sign in between (e.g., 87+3). Since the odd format makes the data hard to work with, and since we are only interested in evaluating players at the position they play, those columns will be excluded. Lastly, the dataset includes a variable named Special
that doesn’t have a definition on either Kaggle or the FIFA website, so it’s also excluded from analysis.
Not much needs doing here other than some cosmetic revisions and data type conversion from factor to numeric (or vice-versa).
The most notable detail is the distinct separation for goalkeepers compared to the rest of positions. Two examples — Stamina
and BallControl
— suffice to demonstrate this. Goalkeepers have a much narrower range of player statistics since they’re mainly confined to the goal and penalty areas, and exercise a limited range of motion compared to other players. For these reasons, goalkeepers should be analyzed separately.
Importantly, manual selection must be done. Automatic variable selection procedures like LASSO cannot remedy issues like multicollinearity or outliers. LASSO can inform but not replace human judgment in decision making! Further, while the goal is prediction it’s worthwhile choosing a model with low multicollinearity to preserve the usual interpretation of coefficients. This is in spite of the fact that multicollinearity isn’t an issue if prediction is the only objective. However, its presence obscures an explanatory variable’s effect on the response variable since the x-variable would be related to other x-variables! We can examine a correlation plot of some player characteristics. If two predictors have an \(|r| > 0.7\) they’ll be considered collinear.
Dribbling
is highly correlated to numerous other variables except for:
Weight
HeadingAccuracy
Reactions
Balance
Jumping
Strength
Stamina
Aggression
Interceptions
Composure
Marking
StandingTackle
SlidingTackle
Let’s examine a correlation plot for those variables.
Using Interceptions
as the baseline, the following will be kept:
Weight
HeadingAccuracy
Reactions
Balance
Jumping
Stamina
Strength
Composure
We choose to keep Interceptions
for a couple reasons. First, it can be thought of as the result of the other variables (e.g., good tackling results in more interceptions). Second, it is a clear indicator of success or failure, instead of just a subjective measure of tackling ability or aggression.
To summarize, these variables will be included in the baseline model:
Age
Weight
Dribbling
Interceptions
HeadingAccuracy
Reactions
Balance
Jumping
Stamina
Strength
Composure
Preferred.Foot
International.Reputation
Work.Rate
Position
First, we’ll recategorize the 27 total player positions to just 6:
The simplified player positions are adapted from Nitin Datta’s kernel. It’s a useful but imperfect categorization that simplifies analysis and process time. It’s best to exclude goalkeepers given its different distribution relative to other positions.
To compare the relative importance of each variable, the predictors need to be standardized since unit of measures differ (e.g., Age
and Weight
). For a continuous predictor (column) each observation \(x_i\) is subtracted by the column mean \(\overline{x}\), the difference then divided by the column’s standard deviation \(sd(x)\), as shown in the formula below.
\[\frac{x_i - \overline{x}}{sd(x)}\] While it’s standard (pun intended) to perform this calculation on both (continuous) predictors and the response variable, standardizing just the predictors keeps the response variable in its original units. Check the University of Notre Dame’s summary on standardizing variables and their interpretations.
Anova Table (Type III tests)
Response: Overall
Sum Sq Df F value Pr(>F)
(Intercept) 2069425 1 283279.8391 < 0.00000000000000022 ***
Age 4 1 0.5692 0.450590
Weight 245 1 33.6026 0.000000006919 ***
Dribbling 14011 1 1917.9128 < 0.00000000000000022 ***
Interceptions 74 1 10.0663 0.001514 **
HeadingAccuracy 5923 1 810.7256 < 0.00000000000000022 ***
Reactions 32767 1 4485.4614 < 0.00000000000000022 ***
Balance 71 1 9.7076 0.001839 **
Jumping 8 1 1.0438 0.306966
Stamina 1103 1 150.9605 < 0.00000000000000022 ***
Strength 975 1 133.4316 < 0.00000000000000022 ***
Composure 15949 1 2183.2197 < 0.00000000000000022 ***
Work.Rate 1330 7 26.0049 < 0.00000000000000022 ***
Position 5354 4 183.2165 < 0.00000000000000022 ***
Residuals 94048 12874
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All variables except Age
and Jumping
are highly significant. To assess this model’s quality, we examine how well it aligns with linear regression assumptions.
Residuals appear to be normally distributed with constant variance, demonstrated in the QQ plot and Residuals vs Fitted plots. There are no problematic leverage or influential points in the Residuals vs Leverage plot. With large datasets, it’s not uncommon to see ~ 5% of data fall outside 3 standard deviations, so observations near \(\pm\) 4 standard deviations in the Residuals vs Leverage plot are not necessarily problematic. Since this is cross-sectional data from a single season, we do not have to worry about serial correlation. Clustering could be an issue given that players on the same team can help or hurt each other’s statistics. Nevertheless, independence is assumed for this analysis.
GVIF Df GVIF^(1/(2*Df))
Age 1.465875 1 1.210733
Weight 2.572048 1 1.603761
Dribbling 2.854910 1 1.689648
Interceptions 3.747066 1 1.935734
HeadingAccuracy 2.532135 1 1.591268
Reactions 3.019605 1 1.737701
Balance 2.692676 1 1.640937
Jumping 1.393673 1 1.180539
Stamina 1.634045 1 1.278298
Strength 3.399674 1 1.843821
Composure 2.769842 1 1.664284
Work.Rate 1.816716 7 1.043567
Position 5.056235 4 1.224555
GVIF is used instead of VIF when more than two levels exist for a (categorical) variable, or if a quadratic term exists. This is the case with the Fifa data — for example Work.Rate
has 8 levels (categories), or 7 degrees of freedom. GVIF reduces to the VIF for continuous predictors. Squaring the second column of this output corresponds to the normal VIF for continuous predictors. See Section 4.5 of Practical Econometrics for more details. All GVIF values are moderately low, evidence that the MLR model doesn’t suffer from multicollinearity.
Here’s the coefficient summary for the MLR model.
Call:
lm(formula = Overall ~ ., data = fifa_train)
Residuals:
Min 1Q Median 3Q Max
-12.3542 -1.8035 -0.0193 1.7967 10.9269
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.57991 0.12509 532.240 < 0.0000000000000002 ***
Age 0.02175 0.02882 0.754 0.45059
Weight 0.22291 0.03845 5.797 0.00000000691861179 ***
Dribbling 1.76679 0.04034 43.794 < 0.0000000000000002 ***
Interceptions 0.14618 0.04607 3.173 0.00151 **
HeadingAccuracy 1.07973 0.03792 28.473 < 0.0000000000000002 ***
Reactions 2.75961 0.04120 66.974 < 0.0000000000000002 ***
Balance 0.12200 0.03916 3.116 0.00184 **
Jumping 0.02876 0.02815 1.022 0.30697
Stamina 0.37617 0.03062 12.287 < 0.0000000000000002 ***
Strength 0.50999 0.04415 11.551 < 0.0000000000000002 ***
Composure 1.85165 0.03963 46.725 < 0.0000000000000002 ***
Work.RateHighLow 1.09083 0.15422 7.073 0.00000000000159492 ***
Work.RateHighMedium 0.48086 0.11183 4.300 0.00001722409939677 ***
Work.RateLowHigh 1.38909 0.17827 7.792 0.00000000000000709 ***
Work.RateLowMedium 1.23578 0.17729 6.970 0.00000000000331873 ***
Work.RateMediumHigh 0.61760 0.12458 4.957 0.00000072365048378 ***
Work.RateMediumLow 0.74133 0.14800 5.009 0.00000055396648836 ***
Work.RateMediumMedium 0.20971 0.10656 1.968 0.04908 *
PositionDF 0.31754 0.11180 2.840 0.00452 **
PositionDM -1.25205 0.12622 -9.920 < 0.0000000000000002 ***
PositionMF -0.93235 0.08982 -10.380 < 0.0000000000000002 ***
PositionST -1.81311 0.10731 -16.896 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.703 on 12874 degrees of freedom
Multiple R-squared: 0.8426, Adjusted R-squared: 0.8423
F-statistic: 3133 on 22 and 12874 DF, p-value: < 0.00000000000000022
An interesting result is that a player’s Age
isn’t a significant predictor when all other model variables are accounted for, suggesting that skill takes precedence. There’s not enough evidence to suggest Jumping
skills are beneficial either, which makes sense given that soccer is concerned more with kicking and running. The most-likely exception to this is goalkeepers - who jump quite often - which were not included in the model.
The Work.Rate
baseline category is HighHigh
. Compared to it, the most effective Work.Rate
s appear to be high on attack and low on defense, and low on attack and high on defense, suggesting players are better off specializing in one area rather than both.
The Position
baseline category is PositionAM
(equivalently, Attacking Midfielders). Compared to them, only defenders (PositionDF
) fare better.
Overall, the most important quantitative variables affecting Overall
score are Dribbling
and Reactions
. Dribbling is defined as “a player’s ability to carry the ball and past an opponent while being in control”. Reactions is defined as a player’s speed in responding to events and situations around them. These characteristics agree with our intuition on what makes a great soccer player. One can even argue that, taken together, these two variables encompass the other skills-based variables.
One strength of linear regression models is their high interpretability, so this should be taken advantage of. Here is a template for interpreting the variables from the MLR model. The y-intercept has no logical meaning for this analysis (can someone be 0 yrs old and weigh 0kg ?).
For the continuous variables Age
through Composure
the construct looks like this:
Weight
is associated with a mean increase in Overall
score of 0.22, holding all other variables constant.Another way to phase this is:
Weight
is associated with a mean increase in Overall
score of 0.22, holding all other variables constant.or
Weight
is associated with a mean increase in Overall
score of 0.0149, holding all other variables constant.The standard deviation of Weight
in the dataset is 14.80742, or ~ 14.81. In keeping with the usual interpretation in regression (“a one unit increase…”), we can divide 0.22 by 14.81.
For Work.Rate
the construct looks like this:
HighHigh
profiles, players with HighLow
profiles are expected to have a mean increase in Overall
score of 1.09, holding all other variables constant.For Position
the construct looks like this:
AM
), defenders (DM
) are expected to have a mean increase in Overall
score of 0.32, holding all other variables constant.95% confidence intervals are easily obtained.
2.5 % 97.5 %
(Intercept) 66.3347096948 66.82511403
Age -0.0347520055 0.07824370
Weight 0.1475339621 0.29828525
Dribbling 1.6877103072 1.84586758
Interceptions 0.0558693298 0.23649343
HeadingAccuracy 1.0053974173 1.15405815
Reactions 2.6788473572 2.84038128
Balance 0.0452475772 0.19875292
Jumping -0.0264183776 0.08393637
Stamina 0.3161606961 0.43618683
Strength 0.4234505602 0.59653290
Composure 1.7739714063 1.92932775
Work.RateHighLow 0.7885234258 1.39313008
Work.RateHighMedium 0.2616469983 0.70006729
Work.RateLowHigh 1.0396557173 1.73852304
Work.RateLowMedium 0.8882602451 1.58330914
Work.RateMediumHigh 0.3734027092 0.86180542
Work.RateMediumLow 0.4512398000 1.03142832
Work.RateMediumMedium 0.0008428747 0.41858326
PositionDF 0.0983891025 0.53669250
PositionDM -1.4994488231 -1.00464399
PositionMF -1.1084118627 -0.75628092
PositionST -2.0234532813 -1.60277161
Taking Stamina
as an example, the interpretation is: we’re 95% confident that the true value of Stamina
’s coefficient, using standardized data, lies between 0.316 and 0.436.
The predicted values and actual values for the test set have approximately 91% correlation, which suggests a relatively good fit.
To summarize the results:
For prediction, RMSE and MAPE are the more relevant metrics. RMSE is used for providing prediction intervals that quantify the margin of error for a predicted value. For large samples, a 95% prediction interval (PI) takes the form \(\hat y \pm 2*RMSE\), where \(\hat y\) is predicted value from the regression model. For the MLR model, the margin of error for a 95% PI is \(2*2.73 = 5.46\). If for example we predict a player’s Overall
to be 75, the lower bound is 75-5.46=70 and the upper bound is 75+5.46=80 — rounding to the nearest whole number since FIFA scores are integer values.
The MAPE quantifies how off the predictions were from the actual values. The MLR model implies an accuracy of ~ 97%. So while point predictions are accurate, the margin or error might be a bit wide.
Earlier it was mentioned goalkeepers were excluded from the MLR model due to their distinct pattern from the rest of positions. A decision tree can easily handle such abnormalities and non-linearity since it isn’t forced to conform to linear assumptions about the data. We will try improving the accuracy of our predictions (lowering the RMSE) by using a random forest, an ensemble of decision trees.
We use a random forest of 250 trees (250 bootstrapped samples). Standardization isn’t necessary for random forests because, as this Stack Overflow post explains, they don’t have a similar metric for explaining the relationship between a predictor and response variable as do MLR models, namely the coefficients. The one metric included in random forest output is Importance, measuring how much each predictor reduces the residual sum of squares (SSR).
Here’s a summary of the results:
A random forest better captures the variability in the data, even with the goalkeepers included, by approximately 6 percentage points. The RMSE decreases modestly, but not by much (2.71 to 2.20). We have to keep in mind, however, that the MLR model didn’t include goalkeepers. If goalkeepers are removed, the test set RMSE for the random forest decreases to ~ 1.9.
%IncMSE IncNodePurity
Age 1.4099309 50560.928
Weight 0.5068813 8197.697
Dribbling 11.4324899 90320.777
Interceptions 5.2887394 52796.588
HeadingAccuracy 3.5888335 30998.888
Reactions 18.1111568 259676.113
Balance 0.5318262 7846.675
Jumping 0.3480535 8676.450
Stamina 1.6217141 18999.577
Strength 0.9273811 14032.001
Composure 6.4814500 129137.541
Work.Rate 0.1798603 4841.686
Position 2.7123281 8395.659
The random forest, as did the MLR model, indicates that Reactions
and Dribbling
are the most important indicators of a player’s Overall
score.
An MLR and random forest regression model were compared for their predictive powers. The MLR built strikes a balance between predictive accuracy and explanability since it eliminates multicollinearity for a minimally sufficient subset of predictor variables. The random forest built doesn’t provide a large margin of improvement over MLR, illustrating the power of linear regression modeling. Nevertheless, both models infer that Reactions
and Dribbling
are the most important indicators of a player’s success. Further research can focus on modeling goal keepers separately with MLR or adding more variables to improve the current MLR predictive accuracy, hopefully without sacrificing explainability.
The complete R Markdown code and the csv file used for this analysis can be found on my Fifa19 Github repository.