Introduction

FIFA is a popular video game series by Electronic Arts, in which you can play as many club and national soccer teams. It also features various story and career modes that allow you to form custom teams with your favorite players or recruit the strongest players to create the best team. Each player has many statistics that measure things such as preferred foot, body type, and skills in passing, shooting, and goalkeeping, that all go into a player’s overall rating.

This analysis focuses on player data from FIFA 19 to determine which player attributes are the most significant predictors for a player’s overall ranking (Overall). The dataset used is available on Kaggle here. Multiple linear regression (MLR) and a random forest are compared for their predictive power, though MLR has the distinct advantage of providing estimates of each attribute’s relative contribution to a player’s ranking.

Data Profile

As is often the case with Kaggle datasets, this data was relatively clean but certain variables (columns) are removed before analysis, some of which were statistics explained by Overall rather than the other way around, such as Wage and Release.Clause. In other words a player’s Overall rank determines their Wage and Release.Clause, not the other way around, so they cannot be predictors of Overall. The variable Potential serves more as a proxy of Overall (as described here), so should also be removed from analysis. Some columns have no bearing on the analysis, such as a player’s name, photo, club, nationality, and jersey number. Other columns were positional ratings reported as two numbers with a plus sign in between (e.g., 87+3). Since the odd format makes the data hard to work with, and since we are only interested in evaluating players at the position they play, those columns will be excluded. Lastly, the dataset includes a variable named Special that doesn’t have a definition on either Kaggle or the FIFA website, so it’s also excluded from analysis.

Data cleaning

Not much needs doing here other than some cosmetic revisions and data type conversion from factor to numeric (or vice-versa).

Exploratory Data Analysis

The most notable detail is the distinct separation for goalkeepers compared to the rest of positions. Two examples — Stamina and BallControl — suffice to demonstrate this. Goalkeepers have a much narrower range of player statistics since they’re mainly confined to the goal and penalty areas, and exercise a limited range of motion compared to other players. For these reasons, goalkeepers should be analyzed separately.

Manual Variable Selection

Importantly, manual selection must be done. Automatic variable selection procedures like LASSO cannot remedy issues like multicollinearity or outliers. LASSO can inform but not replace human judgment in decision making! Further, while the goal is prediction it’s worthwhile choosing a model with low multicollinearity to preserve the usual interpretation of coefficients. This is in spite of the fact that multicollinearity isn’t an issue if prediction is the only objective. However, its presence obscures an explanatory variable’s effect on the response variable since the x-variable would be related to other x-variables! We can examine a correlation plot of some player characteristics. If two predictors have an \(|r| > 0.7\) they’ll be considered collinear.

Dribbling is highly correlated to numerous other variables except for:

  • Weight
  • HeadingAccuracy
  • Reactions
  • Balance
  • Jumping
  • Strength
  • Stamina
  • Aggression
  • Interceptions
  • Composure
  • Marking
  • StandingTackle
  • SlidingTackle

Let’s examine a correlation plot for those variables.

Using Interceptions as the baseline, the following will be kept:

  • Weight
  • HeadingAccuracy
  • Reactions
  • Balance
  • Jumping
  • Stamina
  • Strength
  • Composure

We choose to keep Interceptions for a couple reasons. First, it can be thought of as the result of the other variables (e.g., good tackling results in more interceptions). Second, it is a clear indicator of success or failure, instead of just a subjective measure of tackling ability or aggression.

To summarize, these variables will be included in the baseline model:

  • Age
  • Weight
  • Dribbling
  • Interceptions
  • HeadingAccuracy
  • Reactions
  • Balance
  • Jumping
  • Stamina
  • Strength
  • Composure
  • Preferred.Foot
  • International.Reputation
  • Work.Rate
  • Position

Predicting Overall Rating

Multiple Linear Regression (MLR)

First, we’ll recategorize the 27 total player positions to just 6:

  • Defenders (DF)
  • Defensive Midfielders (DM)
  • Midfielders (MF)
  • Attacking Midfielders (AM)
  • Strikers (ST)
  • Goalkeepers (GK)

The simplified player positions are adapted from Nitin Datta’s kernel. It’s a useful but imperfect categorization that simplifies analysis and process time. It’s best to exclude goalkeepers given its different distribution relative to other positions.

To compare the relative importance of each variable, the predictors need to be standardized since unit of measures differ (e.g., Age and Weight). For a continuous predictor (column) each observation \(x_i\) is subtracted by the column mean \(\overline{x}\), the difference then divided by the column’s standard deviation \(sd(x)\), as shown in the formula below.

\[\frac{x_i - \overline{x}}{sd(x)}\] While it’s standard (pun intended) to perform this calculation on both (continuous) predictors and the response variable, standardizing just the predictors keeps the response variable in its original units. Check the University of Notre Dame’s summary on standardizing variables and their interpretations.

Anova Table (Type III tests)

Response: Overall
                 Sum Sq    Df     F value                Pr(>F)    
(Intercept)     2069425     1 283279.8391 < 0.00000000000000022 ***
Age                   4     1      0.5692              0.450590    
Weight              245     1     33.6026        0.000000006919 ***
Dribbling         14011     1   1917.9128 < 0.00000000000000022 ***
Interceptions        74     1     10.0663              0.001514 ** 
HeadingAccuracy    5923     1    810.7256 < 0.00000000000000022 ***
Reactions         32767     1   4485.4614 < 0.00000000000000022 ***
Balance              71     1      9.7076              0.001839 ** 
Jumping               8     1      1.0438              0.306966    
Stamina            1103     1    150.9605 < 0.00000000000000022 ***
Strength            975     1    133.4316 < 0.00000000000000022 ***
Composure         15949     1   2183.2197 < 0.00000000000000022 ***
Work.Rate          1330     7     26.0049 < 0.00000000000000022 ***
Position           5354     4    183.2165 < 0.00000000000000022 ***
Residuals         94048 12874                                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All variables except Age and Jumping are highly significant. To assess this model’s quality, we examine how well it aligns with linear regression assumptions.

Assumption Checking

Residuals appear to be normally distributed with constant variance, demonstrated in the QQ plot and Residuals vs Fitted plots. There are no problematic leverage or influential points in the Residuals vs Leverage plot. With large datasets, it’s not uncommon to see ~ 5% of data fall outside 3 standard deviations, so observations near \(\pm\) 4 standard deviations in the Residuals vs Leverage plot are not necessarily problematic. Since this is cross-sectional data from a single season, we do not have to worry about serial correlation. Clustering could be an issue given that players on the same team can help or hurt each other’s statistics. Nevertheless, independence is assumed for this analysis.

Examining Potential Multicollinearity

                    GVIF Df GVIF^(1/(2*Df))
Age             1.465875  1        1.210733
Weight          2.572048  1        1.603761
Dribbling       2.854910  1        1.689648
Interceptions   3.747066  1        1.935734
HeadingAccuracy 2.532135  1        1.591268
Reactions       3.019605  1        1.737701
Balance         2.692676  1        1.640937
Jumping         1.393673  1        1.180539
Stamina         1.634045  1        1.278298
Strength        3.399674  1        1.843821
Composure       2.769842  1        1.664284
Work.Rate       1.816716  7        1.043567
Position        5.056235  4        1.224555

GVIF is used instead of VIF when more than two levels exist for a (categorical) variable, or if a quadratic term exists. This is the case with the Fifa data — for example Work.Rate has 8 levels (categories), or 7 degrees of freedom. GVIF reduces to the VIF for continuous predictors. Squaring the second column of this output corresponds to the normal VIF for continuous predictors. See Section 4.5 of Practical Econometrics for more details. All GVIF values are moderately low, evidence that the MLR model doesn’t suffer from multicollinearity.

MLR Analysis

Here’s the coefficient summary for the MLR model.


Call:
lm(formula = Overall ~ ., data = fifa_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.3542  -1.8035  -0.0193   1.7967  10.9269 

Coefficients:
                      Estimate Std. Error t value             Pr(>|t|)    
(Intercept)           66.57991    0.12509 532.240 < 0.0000000000000002 ***
Age                    0.02175    0.02882   0.754              0.45059    
Weight                 0.22291    0.03845   5.797  0.00000000691861179 ***
Dribbling              1.76679    0.04034  43.794 < 0.0000000000000002 ***
Interceptions          0.14618    0.04607   3.173              0.00151 ** 
HeadingAccuracy        1.07973    0.03792  28.473 < 0.0000000000000002 ***
Reactions              2.75961    0.04120  66.974 < 0.0000000000000002 ***
Balance                0.12200    0.03916   3.116              0.00184 ** 
Jumping                0.02876    0.02815   1.022              0.30697    
Stamina                0.37617    0.03062  12.287 < 0.0000000000000002 ***
Strength               0.50999    0.04415  11.551 < 0.0000000000000002 ***
Composure              1.85165    0.03963  46.725 < 0.0000000000000002 ***
Work.RateHighLow       1.09083    0.15422   7.073  0.00000000000159492 ***
Work.RateHighMedium    0.48086    0.11183   4.300  0.00001722409939677 ***
Work.RateLowHigh       1.38909    0.17827   7.792  0.00000000000000709 ***
Work.RateLowMedium     1.23578    0.17729   6.970  0.00000000000331873 ***
Work.RateMediumHigh    0.61760    0.12458   4.957  0.00000072365048378 ***
Work.RateMediumLow     0.74133    0.14800   5.009  0.00000055396648836 ***
Work.RateMediumMedium  0.20971    0.10656   1.968              0.04908 *  
PositionDF             0.31754    0.11180   2.840              0.00452 ** 
PositionDM            -1.25205    0.12622  -9.920 < 0.0000000000000002 ***
PositionMF            -0.93235    0.08982 -10.380 < 0.0000000000000002 ***
PositionST            -1.81311    0.10731 -16.896 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.703 on 12874 degrees of freedom
Multiple R-squared:  0.8426,    Adjusted R-squared:  0.8423 
F-statistic:  3133 on 22 and 12874 DF,  p-value: < 0.00000000000000022

An interesting result is that a player’s Age isn’t a significant predictor when all other model variables are accounted for, suggesting that skill takes precedence. There’s not enough evidence to suggest Jumping skills are beneficial either, which makes sense given that soccer is concerned more with kicking and running. The most-likely exception to this is goalkeepers - who jump quite often - which were not included in the model.

The Work.Rate baseline category is HighHigh. Compared to it, the most effective Work.Rates appear to be high on attack and low on defense, and low on attack and high on defense, suggesting players are better off specializing in one area rather than both.

The Position baseline category is PositionAM (equivalently, Attacking Midfielders). Compared to them, only defenders (PositionDF) fare better.

Overall, the most important quantitative variables affecting Overall score are Dribbling and Reactions. Dribbling is defined as “a player’s ability to carry the ball and past an opponent while being in control”. Reactions is defined as a player’s speed in responding to events and situations around them. These characteristics agree with our intuition on what makes a great soccer player. One can even argue that, taken together, these two variables encompass the other skills-based variables.

Parameter Interpretation

One strength of linear regression models is their high interpretability, so this should be taken advantage of. Here is a template for interpreting the variables from the MLR model. The y-intercept has no logical meaning for this analysis (can someone be 0 yrs old and weigh 0kg ?).

For the continuous variables Age through Composure the construct looks like this:

  • An increase of one standard deviation in Weight is associated with a mean increase in Overall score of 0.22, holding all other variables constant.

Another way to phase this is:

  • A 14.81 kg increase in Weight is associated with a mean increase in Overall score of 0.22, holding all other variables constant.

or

  • A 1 kg increase in Weight is associated with a mean increase in Overall score of 0.0149, holding all other variables constant.

The standard deviation of Weight in the dataset is 14.80742, or ~ 14.81. In keeping with the usual interpretation in regression (“a one unit increase…”), we can divide 0.22 by 14.81.

For Work.Rate the construct looks like this:

  • Relative to players with HighHigh profiles, players with HighLow profiles are expected to have a mean increase in Overall score of 1.09, holding all other variables constant.

For Position the construct looks like this:

  • Relative to attacking midfielders (AM), defenders (DM) are expected to have a mean increase in Overall score of 0.32, holding all other variables constant.

95% confidence intervals are easily obtained.

                              2.5 %      97.5 %
(Intercept)           66.3347096948 66.82511403
Age                   -0.0347520055  0.07824370
Weight                 0.1475339621  0.29828525
Dribbling              1.6877103072  1.84586758
Interceptions          0.0558693298  0.23649343
HeadingAccuracy        1.0053974173  1.15405815
Reactions              2.6788473572  2.84038128
Balance                0.0452475772  0.19875292
Jumping               -0.0264183776  0.08393637
Stamina                0.3161606961  0.43618683
Strength               0.4234505602  0.59653290
Composure              1.7739714063  1.92932775
Work.RateHighLow       0.7885234258  1.39313008
Work.RateHighMedium    0.2616469983  0.70006729
Work.RateLowHigh       1.0396557173  1.73852304
Work.RateLowMedium     0.8882602451  1.58330914
Work.RateMediumHigh    0.3734027092  0.86180542
Work.RateMediumLow     0.4512398000  1.03142832
Work.RateMediumMedium  0.0008428747  0.41858326
PositionDF             0.0983891025  0.53669250
PositionDM            -1.4994488231 -1.00464399
PositionMF            -1.1084118627 -0.75628092
PositionST            -2.0234532813 -1.60277161

Taking Stamina as an example, the interpretation is: we’re 95% confident that the true value of Stamina’s coefficient, using standardized data, lies between 0.316 and 0.436.

Test set results

The predicted values and actual values for the test set have approximately 91% correlation, which suggests a relatively good fit.

To summarize the results:

  • Adj \(R^{2}\) (Training Set) = 84.2%
  • Adj \(R^{2}\) (Test Set) = 83.7%
  • RMSE (Training Set) = 2.71
  • RMSE (Test Set) = 2.73
  • MAPE (Training Set) = 3.3%
  • MAPE (Test Set) = 3.3%

For prediction, RMSE and MAPE are the more relevant metrics. RMSE is used for providing prediction intervals that quantify the margin of error for a predicted value. For large samples, a 95% prediction interval (PI) takes the form \(\hat y \pm 2*RMSE\), where \(\hat y\) is predicted value from the regression model. For the MLR model, the margin of error for a 95% PI is \(2*2.73 = 5.46\). If for example we predict a player’s Overall to be 75, the lower bound is 75-5.46=70 and the upper bound is 75+5.46=80 — rounding to the nearest whole number since FIFA scores are integer values.

The MAPE quantifies how off the predictions were from the actual values. The MLR model implies an accuracy of ~ 97%. So while point predictions are accurate, the margin or error might be a bit wide.

Improving Prediction with a random forest

Earlier it was mentioned goalkeepers were excluded from the MLR model due to their distinct pattern from the rest of positions. A decision tree can easily handle such abnormalities and non-linearity since it isn’t forced to conform to linear assumptions about the data. We will try improving the accuracy of our predictions (lowering the RMSE) by using a random forest, an ensemble of decision trees.

We use a random forest of 250 trees (250 bootstrapped samples). Standardization isn’t necessary for random forests because, as this Stack Overflow post explains, they don’t have a similar metric for explaining the relationship between a predictor and response variable as do MLR models, namely the coefficients. The one metric included in random forest output is Importance, measuring how much each predictor reduces the residual sum of squares (SSR).

Here’s a summary of the results:

  • \(R^{2}\) (training set) = 90.1%
  • \(R^{2}\) (test set) = 90.3%
  • RMSE (training set) = 2.17
  • RMSE (test set) = 2.20
  • MAPE (training set) = 2.5%
  • MAPE (test set) = 2.5%

A random forest better captures the variability in the data, even with the goalkeepers included, by approximately 6 percentage points. The RMSE decreases modestly, but not by much (2.71 to 2.20). We have to keep in mind, however, that the MLR model didn’t include goalkeepers. If goalkeepers are removed, the test set RMSE for the random forest decreases to ~ 1.9.

                   %IncMSE IncNodePurity
Age              1.4099309     50560.928
Weight           0.5068813      8197.697
Dribbling       11.4324899     90320.777
Interceptions    5.2887394     52796.588
HeadingAccuracy  3.5888335     30998.888
Reactions       18.1111568    259676.113
Balance          0.5318262      7846.675
Jumping          0.3480535      8676.450
Stamina          1.6217141     18999.577
Strength         0.9273811     14032.001
Composure        6.4814500    129137.541
Work.Rate        0.1798603      4841.686
Position         2.7123281      8395.659

The random forest, as did the MLR model, indicates that Reactions and Dribbling are the most important indicators of a player’s Overall score.

Conclusion

An MLR and random forest regression model were compared for their predictive powers. The MLR built strikes a balance between predictive accuracy and explanability since it eliminates multicollinearity for a minimally sufficient subset of predictor variables. The random forest built doesn’t provide a large margin of improvement over MLR, illustrating the power of linear regression modeling. Nevertheless, both models infer that Reactions and Dribbling are the most important indicators of a player’s success. Further research can focus on modeling goal keepers separately with MLR or adding more variables to improve the current MLR predictive accuracy, hopefully without sacrificing explainability.

Addendum

The complete R Markdown code and the csv file used for this analysis can be found on my Fifa19 Github repository.